Abstract
In this paper, we propose a program development toolkit called OMPICUDA for hybrid CPU/GPU clusters. With the support of this toolkit, users can make use of a familiar programming model, i.e., compound OpenMP and MPI instead of mixed CUDA and MPI or SDSM to develop their applications on a hybrid CPU/GPU cluster. In addition, they can adapt the types of resources used for executing different parallel regions in the same program by means of an extended device directive according to the property of each parallel region. On the other hand, this programming toolkit supports a set of data-partition interfaces for users to achieve load balance at the application level no matter what type of resources are used for the execution of their programs.
Similar content being viewed by others
References
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell T (2007) A survey of general purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113
Top500 list, Nov 2012, Referenced from http://www.top500.org
Titan supercomputer, referenced from http://www.olcf.ornl.gov/titan/
Yang X-J, Liao X-K, Lu K, Hu Q-F, Song J-Q, Su J-S (2011) The TianHe-1A supercomputer: its hardware and software. J Comput Sci Technol 26(3):344–351
Vecchiola C, Pandey S, Buyya R (2009) High-performance cloud computing: a view of scientific applications. In: Proceedings of 10th international symposium on pervasive systems, algorithms, and networks, pp 4–16
Gropp W, Lusk E, Doss N, Skjellum A (1996) A high-performance, portable implementation of the MPI message passing interface standard. Parallel Comput 22:789–828
The OpenMP Forum (1998) OpenMP C and C++ application program interface, version 1.0. http://www.openmp.org
NVIDIA CUDA programming guide version 2.1.1. http://www.nvidia.com.tw/object/cuda_develop_tw_old.html
Stone JE, Gohara D, Shi G (2010) OpenCL: a parallel programming standard for heterogeneous computing systems. Comput Sci Eng 12(3):66–73
Amza C, Cox AL, Dwarkadas H, Keleher P, Lu H, Rajamony R, Yu W, Zwaenepoel W (1996) TreadMarks: shared memory computing on networks of workstations. IEEE Comput 29(2):18–28
Clark C, Fraser K, Hand SM, Hansen JG, Jul EB, Limpach C, Pratt IA, Warfield A (2005) Live migration of virtual machines. Proceedings of the 2nd Conference on Symposium on Networked Systems Design and Implementation 2:273–286
Basumallik A, Min S-j, Eigenmann R (2012) Towards OpenMP execution on software distributed shared memory systems. In: Proceedings of WOMPEI’02. Lecture notes in computer science, vol 2327, pp 457–468
Microsoft, “HLSL for DirectX”. http://msdn.microsoft.com/en-us/library/windows/desktop/bb509561.aspx
Kessenich J, Baldwin D, Rost R (2011) The OpenGL shader language
Fernando R, Kilgard MJ (2003) The Cg tutorial: the definitive guide to programmable real-time graphics. Addison-Wesley Professional, Reading. ISBN 0-321-19496-9
Yan Y, Grossman M, Sarkar V (2009) JCUDA: a programmer-friendly interface for accelerating Java programs with CUDA. In: Euro-par 2009 parallel processing. Lecture notes in computer science, vol 5704, pp 887–899
Dotzler G, Veldema R, Klemm M (2010) JCudaMP:OpenMP/Java on CUDA. In: Proceeding of the 3rd international workshops on multicore software engineering, pp 10–17
Chen Q-k, Zhang J-k (2009) A stream processor cluster architecture model with the hybrid technology of MPI and CUDA. In: Proceeding of 2009 1st international conference on information science and engineering, pp 26–28
Han TD, Abdelrahman TS (2011) hiCUDA: high-level GPGPU programming. IEEE Trans Parallel Distrib Syst 22(1):78–90
Noaje G, Jaillet C, Krajecki M (2011) Source-to-source code translator: OpenMP C to CUDA. In: IEEE 13th international conference on high performance computing and communications (HPCC), pp 512–519
Lee S, Eigenmann R (2010) OpenMPC: extended OpenMP programming and tuning for GPUs. In: 2010 international conference for high performance computing, networking, storage and analysis (SC), pp 1–11
He B, Fang W, Luo Q, Govindaraju NK, Wang T (2008) Mars: a MapReduce framework on graphics processors. In: Proceedings of the 17th international conference on parallel architectures and compilation techniques, pp 260–269
Dolbeau R, Bihan S, Bodin F (2007) HMPP: a hybrid multi-core parallel programming environment. In: The proceedings of the workshop on general purpose processing on graphics processing units (GPGPU 2007)
Tsai T-C (2010) OMP2OCL translator: a translator for automatic translation of OpenMP programs into OpenCL programs. Mater Thesis, Institute of Computer Science and Engineering, National Chiao-Tung University
Dean J, Ghmawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
OpenACC API 1.0 (2012) http://www.openacc-standard.org/Downloads/
Liang T-Y, Li H-F, Chiu J-Y (2012) Enabling mixed OpenMP/MPI programming on hybrid CPU/GPU computing architecture. In: IPDPS 2102, pp 2369–2377
Liang T-Y, Chang Y-W, Li H-F (2012) A CUDA programming toolkit on grids. Int J Grid Util Comput 3(2/3):97–111
Kivity A, Kamay Y, Laor D, Lublin U, Liguori A (2007) KVM: Linux virtual machine monitor. In: Proceedings of the Linux symposium, vol 1, pp 225–230
Li H-F, Liang T-Y, Jiang J-L (2011) An OpenMP compiler for hybrid CPU/GPU computing architecture. In: Third international conference on intelligent networking and collaborative systems, pp 209–216
Kusano K, Sato M, Hosomi T, Seo Y (2001) The omni OpenMP compiler on the distributed shared memory of Cenju-4. In: OpenMP shared memory parallel programming. Lecture notes in computer science, vol 2104, pp 20–30
Lee S, Min S-J, Eigenmann R (2009) OpenMP to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN symposium on principles and practice of parallel programming, pp 101–110
Conway ME (1963) Design of a separable transition-diagram compiler. Commun ACM, 396–408
NVIDIA Development Zone (2012) CUDA C best practices guide, pp 51–52. http://developer.nvidia.com/cuda/nvidia-gpu-computing-documentation
Almási G, Heidelberger P, Archer CJ, Martorell X, Erway CC, Moreira JE, Steinmacher-Burow B, Zheng Y (2005) Optimization of MPI collective communication on BlueGene/L systems. In: Proceedings of 19th annual international conference on supercomputing, pp 253–262
Vadhiyar S, Fagg G, Dongarra J (2000) Automatically tuned collective communications. In: Proceedings of the 2000 ACM/IEEE conference on supercomputing
Corbalan J, Duran A, Labarta J (2004) Dynamic load balancing of MPI+OpenMP applications. In: International conference on parallel processing 2004, vol 1, pp 195–202
Zhang K, Wu B (2012) Task scheduling for GPU heterogeneous cluster. In: 2012 IEEE international conference on cluster computing workshops, pp 161–169
Nian S, Guangmin L (2009) Dynamic load balancing algorithm for MPI parallel computing. In: 2009 international conference on new trends in information and service science, pp 95–99
Galindo I, Almeida F (2008) Dynamic load balancing on dedicated heterogeneous systems. In: Proceedings of 15th Euro PVM/MPI, pp 64–74
Acknowledgements
We would like to thank National Science Council of the Republic of China for their grant support with the project number of NSC 99-2221-E-151-055-MY3.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, HF., Liang, TY. & Chiu, JY. A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clusters. J Supercomput 66, 381–405 (2013). https://doi.org/10.1007/s11227-013-0912-0
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-013-0912-0