Abstract
Despite the increasing investment in integrated GPU and next-generation interconnect research, discrete GPU connected by PCIe still account for the dominant position of the market, the management of data communication between CPU and GPU continues to evolve. Initially, the programmer explicitly controls the data transfer between CPU and GPU. To simplify programming and enable system-wide atomic memory operations, GPU vendors have developed a programming model that provides a single, virtual address space for accessing all CPU and GPU memories in the system. The page migration engine in this model automatically migrates pages between CPU and GPU on demand. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnects compared to GDDR, larger page migration has longer delay, which may reduce the overlap of computation and transmission, waste time to migrate unrequested data, block subsequent requests, and cause serious performance decline. In this paper, we propose partial page migration that only migrates the requested part of a page to reduce the migration unit, shorten the migration latency, and avoid the performance degradation of the full page migration when the page becomes larger. We show that partial page migration is possible to largely hide the performance overheads of full page migration. Compared with programmer controlled data transmission, when the page size is 2MB and the PCIe bandwidth is 16GB/sec, full page migration is 72.72× slower, while our partial page migration achieves 1.29× speedup. When the PCIe bandwidth is changed to 96GB/sec, full page migration is 18.85× slower, while our partial page migration provides 1.37× speedup. Additionally, we examine the performance impact that PCIe bandwidth and migration unit size have on execution time, enabling designers to make informed decisions.
Similar content being viewed by others
References
Harris M. Unified memory in CUDA 6. GTC On-Demand, NVIDIA, 2013
Lindholm E, Nickolls J, Oberman S, Montrym J. Nvidia tesla: a unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55
Di Carlo S, Gambardella G, Martella I, Prinetto P, Rolfo D, Trotta P. Fault mitigation strategies for CUDA GPUs. In: Proceedings of IEEE International Test Conference. 2013, 1–8
Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2014, 568–578
Landaverde R, Zhang T, Coskun A K, Herbordt M. An investigation of unified memory access performance in CUDA. In: Proceedings of IEEE High Performance Extreme Computing Conference. 2014, 1–6
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler S W. Towards high performance paged memory for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2016, 345–357
Lustig D, Martonosi M. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2013, 354–365
Kirk D. Nvidia CUDA software and GPU parallel computing architecture. In: Proceedings of International Symposium on Memory Management. 2007, 103–104
Patterson D. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Nvidia Whitepaper, 2009, 47
Hammarlund P, Martinez A J, Bajwa A A, Hill D L, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne R B, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdun S, Gunther S, Piazza T, Burton T. Haswell: the fourth-generation intel core processor. IEEE Micro, 2014, 34(2): 6–20
Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GGGPU processing in CUDA architecture. Advanced Computing, 2012, 3(1): 105
Rogers P. Heterogeneous system architecture overview. In: Proceedings of Hot Chip: A Symposium on High Performance Chips. 2013
Kim Y, Lee J, Kim D, Kim J. Scalegpu: GPU architecture for memory-unaware GPU programming. IEEE Computer Architecture Letters, 2014, 13(2): 101–104
Cao Y, Chen L, Zhang Z. Flexible memory: a novel main memory architecture with block-level memory compression. In: Proceedings of IEEE International Conference on Networking, Architecture and Storage. 2015, 285–294
Agarwal N, Nellans D, Stephenson M, O’Connor M, Keckler S W. Page placement strategies for GPUs within heterogeneous memory systems. ACM SIGPLAN Notices, 2015, 50: 607–618
Agarwal N, Nellans D, O’Connor M, Keckler S W, Wenisch T F. Unlocking bandwidth for GPUs in CC-NUMA systems. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 354–365
Keckler S W, Dally W J, Khailany B, Garland M, Glasco D. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7–17
Vesely J, Basu A, Oskin M, Loh G H, Bhattacharjee A. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161–171
Awasthi M, Nellans D, Sudan K, Balasubramonian R, Davis A. Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques. 2010, 319–330
Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A K, Kandemir M T, Mutlu O, Das C R. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceedings of International Conference on Parallel Architecture and Compilation Techniques. 2016, 31–44
Chan C, Didem Unat D, Lijewski M, Zhang W, Bell J, Shalf J. Software design space exploration for exascale combustion co-design. In: Proceedings of International Supercomputing Conference. 2013, 196–212
Dashti M, Fedorova A. Analyzing memory management methods on integrated CPU-GPU systems. In: Proceedings of ACM SIGPLAN International Symposium on Memory Management. 2017, 59–69
Bakhoda A, Yuan G L, Fung W W L, Wong H, Aamodt T M. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163–174
Aamodt T M, Fung W W L, Singh I, El-Shafiey A, Kwa J, Hetherington T, Gubran A, Boktor A, Rogers T, Bakhoda A. GPGPU-Sim 3.x manual. Retrieved February, 2013, 1: 2015
Ajanovic J. PCI express 3.0 overview. In: Proceedings of Hot Chips: A Symposium on High Performance Chips. 2009
Gonzales D. PCI express 4.0 electrical previews. In: Proceedings of PCI-SIG Developers Conference. 2015
Acknowledgements
We thank the anonymous reviewers for their valuable feedback. This work was supported by NSFC (Grant No. 61472431).
Author information
Authors and Affiliations
Corresponding author
Additional information
Shiqing Zhang received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where she is currently pursuing the MS degree. Her research interests include parallel programming and optimization techniques.
Zheng Qin received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where he is currently pursuing the MS degree. His research interests include machine learning, computer vision, and deep learning acceleration.
Yaohua Yang received the BS degree in Software Engineering from the Shan Dong University, China. Currently, he is a graduate student at National University of Defense Technology, China. His research interests are high performance processor and optimization techniques.
Li Shen received the BS, MS, and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include high performance processor architecture, parallel programming, and optimization techniques.
Zhiying Wang received the BS, MS and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include processor architecture, on-chip interconnect, and information security.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Zhang, S., Qin, Z., Yang, Y. et al. Transparent partial page migration between CPU and GPU. Front. Comput. Sci. 14, 143101 (2020). https://doi.org/10.1007/s11704-018-7386-4
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11704-018-7386-4