Transparent partial page migration between CPU and GPU

Zhang, Shiqing; Qin, Zheng; Yang, Yaohua; Shen, Li; Wang, Zhiying

doi:10.1007/s11704-018-7386-4

Transparent partial page migration between CPU and GPU

Research Article
Published: 07 December 2019

Volume 14, article number 143101, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Shiqing Zhang¹,
Zheng Qin¹,
Yaohua Yang¹,
Li Shen¹ &
…
Zhiying Wang¹

185 Accesses
9 Citations
Explore all metrics

Abstract

Despite the increasing investment in integrated GPU and next-generation interconnect research, discrete GPU connected by PCIe still account for the dominant position of the market, the management of data communication between CPU and GPU continues to evolve. Initially, the programmer explicitly controls the data transfer between CPU and GPU. To simplify programming and enable system-wide atomic memory operations, GPU vendors have developed a programming model that provides a single, virtual address space for accessing all CPU and GPU memories in the system. The page migration engine in this model automatically migrates pages between CPU and GPU on demand. To meet the needs of high-performance workloads, the page size tends to be larger. Limited by low bandwidth and high latency interconnects compared to GDDR, larger page migration has longer delay, which may reduce the overlap of computation and transmission, waste time to migrate unrequested data, block subsequent requests, and cause serious performance decline. In this paper, we propose partial page migration that only migrates the requested part of a page to reduce the migration unit, shorten the migration latency, and avoid the performance degradation of the full page migration when the page becomes larger. We show that partial page migration is possible to largely hide the performance overheads of full page migration. Compared with programmer controlled data transmission, when the page size is 2MB and the PCIe bandwidth is 16GB/sec, full page migration is 72.72× slower, while our partial page migration achieves 1.29× speedup. When the PCIe bandwidth is changed to 96GB/sec, full page migration is 18.85× slower, while our partial page migration provides 1.37× speedup. Additionally, we examine the performance impact that PCIe bandwidth and migration unit size have on execution time, enabling designers to make informed decisions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GPU Memory Management Solution Supporting Incomplete Pages

Compressed page walk cache

Article 11 November 2021

A quantitative evaluation of unified memory in GPUs

Article 16 November 2019

References

Harris M. Unified memory in CUDA 6. GTC On-Demand, NVIDIA, 2013
Lindholm E, Nickolls J, Oberman S, Montrym J. Nvidia tesla: a unified graphics and computing architecture. IEEE Micro, 2008, 28(2): 39–55
Article Google Scholar
Di Carlo S, Gambardella G, Martella I, Prinetto P, Rolfo D, Trotta P. Fault mitigation strategies for CUDA GPUs. In: Proceedings of IEEE International Test Conference. 2013, 1–8
Power J, Hill M D, Wood D A. Supporting x86-64 address translation for 100s of GPU lanes. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2014, 568–578
Landaverde R, Zhang T, Coskun A K, Herbordt M. An investigation of unified memory access performance in CUDA. In: Proceedings of IEEE High Performance Extreme Computing Conference. 2014, 1–6
Zheng T, Nellans D, Zulfiqar A, Stephenson M, Keckler S W. Towards high performance paged memory for GPUs. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2016, 345–357
Lustig D, Martonosi M. Reducing GPU offload latency via fine-grained CPU-GPU synchronization. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2013, 354–365
Kirk D. Nvidia CUDA software and GPU parallel computing architecture. In: Proceedings of International Symposium on Memory Management. 2007, 103–104
Patterson D. The top 10 innovations in the new nvidia fermi architecture, and the top 3 next challenges. Nvidia Whitepaper, 2009, 47
Hammarlund P, Martinez A J, Bajwa A A, Hill D L, Hallnor E, Jiang H, Dixon M, Derr M, Hunsaker M, Kumar R, Osborne R B, Rajwar R, Singhal R, D’Sa R, Chappell R, Kaushik S, Chennupaty S, Jourdun S, Gunther S, Piazza T, Burton T. Haswell: the fourth-generation intel core processor. IEEE Micro, 2014, 34(2): 6–20
Article Google Scholar
Ghorpade J, Parande J, Kulkarni M, Bawaskar A. GGGPU processing in CUDA architecture. Advanced Computing, 2012, 3(1): 105
Google Scholar
Rogers P. Heterogeneous system architecture overview. In: Proceedings of Hot Chip: A Symposium on High Performance Chips. 2013
Kim Y, Lee J, Kim D, Kim J. Scalegpu: GPU architecture for memory-unaware GPU programming. IEEE Computer Architecture Letters, 2014, 13(2): 101–104
Article Google Scholar
Cao Y, Chen L, Zhang Z. Flexible memory: a novel main memory architecture with block-level memory compression. In: Proceedings of IEEE International Conference on Networking, Architecture and Storage. 2015, 285–294
Agarwal N, Nellans D, Stephenson M, O’Connor M, Keckler S W. Page placement strategies for GPUs within heterogeneous memory systems. ACM SIGPLAN Notices, 2015, 50: 607–618
Article Google Scholar
Agarwal N, Nellans D, O’Connor M, Keckler S W, Wenisch T F. Unlocking bandwidth for GPUs in CC-NUMA systems. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture. 2015, 354–365
Keckler S W, Dally W J, Khailany B, Garland M, Glasco D. GPUs and the future of parallel computing. IEEE Micro, 2011, 31(5): 7–17
Article Google Scholar
Vesely J, Basu A, Oskin M, Loh G H, Bhattacharjee A. Observations and opportunities in architecting shared virtual memory for heterogeneous systems. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2016, 161–171
Awasthi M, Nellans D, Sudan K, Balasubramonian R, Davis A. Handling the problems and opportunities posed by multiple on-chip memory controllers. In: Proceedings of International Conference on Parallel Architectures and Compilation Techniques. 2010, 319–330
Pattnaik A, Tang X, Jog A, Kayiran O, Mishra A K, Kandemir M T, Mutlu O, Das C R. Scheduling techniques for GPU architectures with processing-in-memory capabilities. In: Proceedings of International Conference on Parallel Architecture and Compilation Techniques. 2016, 31–44
Chan C, Didem Unat D, Lijewski M, Zhang W, Bell J, Shalf J. Software design space exploration for exascale combustion co-design. In: Proceedings of International Supercomputing Conference. 2013, 196–212
Dashti M, Fedorova A. Analyzing memory management methods on integrated CPU-GPU systems. In: Proceedings of ACM SIGPLAN International Symposium on Memory Management. 2017, 59–69
Article Google Scholar
Bakhoda A, Yuan G L, Fung W W L, Wong H, Aamodt T M. Analyzing CUDA workloads using a detailed GPU simulator. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software. 2009, 163–174
Aamodt T M, Fung W W L, Singh I, El-Shafiey A, Kwa J, Hetherington T, Gubran A, Boktor A, Rogers T, Bakhoda A. GPGPU-Sim 3.x manual. Retrieved February, 2013, 1: 2015
Google Scholar
Ajanovic J. PCI express 3.0 overview. In: Proceedings of Hot Chips: A Symposium on High Performance Chips. 2009
Gonzales D. PCI express 4.0 electrical previews. In: Proceedings of PCI-SIG Developers Conference. 2015

Download references

Acknowledgements

We thank the anonymous reviewers for their valuable feedback. This work was supported by NSFC (Grant No. 61472431).

Author information

Authors and Affiliations

Department of Computer Science, National University of Defense Technology, Changsha, 410073, China
Shiqing Zhang, Zheng Qin, Yaohua Yang, Li Shen & Zhiying Wang

Authors

Shiqing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Qin
View author publications
You can also search for this author in PubMed Google Scholar
Yaohua Yang
View author publications
You can also search for this author in PubMed Google Scholar
Li Shen
View author publications
You can also search for this author in PubMed Google Scholar
Zhiying Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shiqing Zhang.

Additional information

Shiqing Zhang received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where she is currently pursuing the MS degree. Her research interests include parallel programming and optimization techniques.

Zheng Qin received the BS degree in computer science from the National University of Defense Technology (NUDT), China in 2016, where he is currently pursuing the MS degree. His research interests include machine learning, computer vision, and deep learning acceleration.

Yaohua Yang received the BS degree in Software Engineering from the Shan Dong University, China. Currently, he is a graduate student at National University of Defense Technology, China. His research interests are high performance processor and optimization techniques.

Li Shen received the BS, MS, and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include high performance processor architecture, parallel programming, and optimization techniques.

Zhiying Wang received the BS, MS and PhD degrees in computer science from the National University of Defense Technology (NUDT), China. Currently he is a professor at NUDT, China. His research interests include processor architecture, on-chip interconnect, and information security.

Electronic supplementary material