Abstract
Memory copying is one of the most common operations in modern software. Usually, the operation reflects a synchronous (sync) CPU procedure of memory copying, incurring overheads such as cache pollution and CPU stalling, especially in the scenario of bulk copying with large data. To improve this issue, some works based on I/OAT, which is a dedicated and popular hardware copying engine on Intel platform, is proposed but still exists several problems: (1) lacking atomic allocation/revocation at the granularity of I/OAT channel; (2) deficiency of interrupt support and (3) complicated programming interfaces. We propose RAMCI, an asynchronous (async) memory copying mechanism based on Intel I/OAT engine, not only improves the sync overheads, but also overcomes the above three issues through (1) a lock mechanism by using low-level CAS instruction; (2) a lightweight interrupt mechanism for the completion of memory copying, instead of using the polling pattern which consuming large CPU resource and (3) a group of well-defined and abstract interfaces, allowing the programmers to utilize the underlying free I/OAT channels transparently. To support the interfaces, a novel scheduler of the I/OAT channels is introduced. It splits the source copying data into several pieces, and each of them can be allocated with a dedicated I/OAT channel intelligently to transfer the data with parallelism. We evaluate RAMCI and compare it with other memory copying mechanisms in four NUMA scenarios. The experimental results show that RAMCI improves memory copying performance up to 4.68\(\times \) while achieving almost full ability of parallel computing.
Similar content being viewed by others
References
Atlidakis, V., Andrus, J., Geambasu, R., Mitropoulos, D., Nieh, J.: Posix abstractions in modern operating systems: The old, the new, and the missing. In: Proceedings of the Eleventh European Conference on Computer Systems, pp 1–17 (2016)
Chen, Q., Zheng, L., Liao, X., Jin, H., Wang, Q.: Effective runtime scheduling for high-performance graph processing on heterogeneous dataflow architecture. CCF Transactions on High Performance Computing pp 1–14 (2020a)
Chen, W., Chen, Z., Li, D., Liu, H., Tang, Y.: Low-overhead inline deduplication for persistent memory. Transactions on Emerging Telecommunications Technologies p e4079 (2020b)
Dong, M., Li, H., Ota, K., Xiao, J.: Rule caching in sdn-enabled mobile access networks. IEEE Netw. 29(4), 40–45 (2015)
Duarte, F., Wong, S.: Cache-based memory copy hardware accelerator for multicore systems. IEEE Trans. Comput. 59(11), 1494–1507 (2010)
Fang, J., Huang, C., Tang, T., Wang, Z.: Parallel programming models for heterogeneous many-cores: a comprehensive survey. CCF Trans. High Perform. Comput. pp 1–19 (2020)
Govindaraju, R.K., Cheng, L., Ranganathan, P., Marty, M.R., Gallatin, A.: Asynchronous copying of data within memory. US Patent 10,191,672 (2019)
Gschwind, M.: Chip multiprocessing and the cell broadband engine. In: Proceedings of the 3rd conference on Computing frontiers, pp 1–8 (2006)
Harris, T.L., Fraser, K., Pratt, I.A.: A practical multi-word compare-and-swap operation. In: International Symposium on Distributed Computing, Springer, pp 265–279 (2002)
Hua, Y., Shi, X., Jin, H., Liu, W., Jiang, Y., Chen, Y., He, L.: Software-defined qos for i/o in exascale computing. CCF Trans. High Perform. Comput. 1(1), 49–59 (2019)
Huang, D., Lu, Y.: Improving the efficiency of hpc data movement on container-based virtual cluster. CCF Trans. High Perform. Comput. pp 1–14 (2020)
Intel (2014) Intel\(\textregistered \) Xeon\(\textregistered \) E7-2800, E7-4800, E7-8800 v2 Datasheet, Vol. 2, March 2014
Jiang, X., Solihin, Y., Zhao, L., Iyer, R.: Architecture support for improving bulk memory copying and initialization performance. In: 2009 18th International Conference on Parallel Architectures and Compilation Techniques, IEEE, pp 169–180 (2009)
Kanter, D.: Intel’s sandy bridge microarchitecture (2010)
Lepak, K., Talbot, G., White, S., Beck, N., Naffziger, S., et al. (2017) The next generation amd enterprise server product architecture. IEEE Hot Chips 29
Li, D., Liao, X., Jin, H., Zhou, B., Zhang, Q.: A new disk i/o model of virtualized cloud environment. IEEE Trans. Parallel Distrib. Syst. 24(6), 1129–1138 (2012)
Li, D., Dong, M., Yuan, Y., Chen, J., Ota, K., Tang, Y.: Seer-mcache: A prefetchable memory object caching system for iot real-time data processing. IEEE Internet Things J. 5(5), 3648–3660 (2018a)
Li, D., Ota, K., Zhong, Y., Dong, M., Tang, Y., Qiu, J.: Towards high-efficient transaction commitment in a virtualized and sustainable rdbms. IEEE Trans. Sustain. Comput. (2019a). https://doi.org/10.1109/TSUSC.2019.2890841
Li, H., Ota, K., Dong, M.: Eccn: Orchestration of edge-centric computing and content-centric networking in the 5g radio access network. IEEE Wirel. Commun. 25(3), 88–93 (2018b)
Li, H., Ota, K., Dong, M.: Deep reinforcement scheduling for mobile crowdsensing in fog computing. ACM Trans. Internet Technol. (TOIT) 19(2), 1–18 (2019b)
Seshadri, V., Kim, Y., Fallin, C., Lee, D., Ausavarungnirun, R., Pekhimenko, G., Luo, Y., Mutlu, O., Gibbons, P.B., Kozuch, M.A, et al. Rowclone: fast and energy-efficient in-dram bulk data copy and initialization. In: Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, pp 185–197 (2013)
Su, W., Wang, L., Su, M., Liu, S.: A processor-dma-based memory copy hardware accelerator. In: 2011 IEEE Sixth International Conference on Networking, Architecture, and Storage, IEEE, pp 225–229 (2011)
Sun, J., Chen, H., He, L., Tan, H.: Redundant network traffic elimination with gpu accelerated rabin fingerprinting. IEEE Trans. Parallel Distrib. Syst. 27(7), 2130–2142 (2015)
Vaidyanathan, K., Chai, L., Huang, W., Panda, D.K.: Efficient asynchronous memory copy operations on multi-core systems and i/oat. In: 2007 IEEE International Conference on Cluster Computing, IEEE, pp 159–168 (2007a)
Vaidyanathan, K., Huang, W., Chai, L., Panda, D.K.: Designing efficient asynchronous memory operations using hardware copy engine: A case study with i/oat. In: 2007 IEEE International Parallel and Distributed Processing Symposium, IEEE, pp 1–8 (2007b)
Valois, J.D.: Lock-free linked lists using compare-and-swap. In: Proceedings of the fourteenth annual ACM symposium on Principles of distributed computing, pp 214–222 (1995)
Vassiliadis ,S., Duarte, F., Wong, S.: A load/store unit for a memcpy hardware accelerator. In: 2007 International Conference on Field Programmable Logic and Applications, IEEE, pp 537–541 (2007)
Wong, S., Duarte, F., Vassiliadis, S.: A hardware cache memcpy accelerator. In: 2006 IEEE International Conference on Field Programmable Technology, IEEE, pp 141–148 (2006)
Yang, Z., Harris, J.R., Walker, B., Verkamp, D., Liu, C., Chang, C., Cao, G., Stern, J., Verma, V., Paul, L.E.: Spdk: A development kit to build high performance storage applications. In: 2017 IEEE International Conference on Cloud Computing Technology and Science (CloudCom), IEEE, pp 154–161 (2017)
Zhao, L., Iyer, R., Makineni, S., Bhuyan, L., Newell, D.: Hardware support for bulk data movement in server platforms. In: 2005 International Conference on Computer Design, IEEE, pp 53–60 (2005)
Zhao, L., Bhuyan, L.N., Iyer, R., Makineni, S., Newell, D.: Hardware support for accelerating data movement in server platform. IEEE Trans. Comput. 56(6), 740–753 (2007)
Zhong, W., Sun, J., Chen, H., Xiao, J., Chen, Z., Cheng, C., Shi, X.: Optimizing graph processing on gpus. IEEE Trans. Parallel Distrib. Syst. 28(4), 1149–1162 (2016)
Zhou, Z., Chen, X., Li, E., Zeng, L., Luo, K., Zhang, J.: Edge intelligence: Paving the last mile of artificial intelligence with edge computing. Proc. IEEE 107(8), 1738–1762 (2019)
Zhou, Z., Yang, S., Pu, L.J., Yu, S.: Cefl: Online admission control, data scheduling and accuracy tuning for cost-efficient federated learning across edge nodes. IEEE Internet Things J. (2020)
Acknowledgements
This work was funded by the National Natural Science Foundation of China under grant number 61972164, 61772211 and U1811263, by the Guangdong Basic and Applied Basic Research Foundation under grant number 2019A1515011160, by the Guangzhou Key Laboratory of Big Data and Intelligent Education under grant number 201905010009.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, Z., Li, D., Wang, Z. et al. RAMCI: a novel asynchronous memory copying mechanism based on I/OAT. CCF Trans. HPC 3, 129–143 (2021). https://doi.org/10.1007/s42514-021-00063-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-021-00063-y