Abstract
This paper is on a multi-GPU simulation of a petroleum reservoir using a 3D structured grid, where each point is represented by its state. Using the Darcy model for porous media, each grid point is related to its six neighbors by a linear relation. The system equation is a sparse linear system AX = b, where A is a hepta-diagonal matrix, b is model parameters, and X is a state vector. The simulation repeatedly computes (1) A and b given X and (2) X given A and b, which is the focus of this paper. The BiCG-Stab is an iterative procedure for solving AX = b for X. This work focuses on developing a scalable multi-GPU approach for solving large sparse systems Ax = b using BiCG-Stab. We extend a previously developed storage scheme for blocked hepta-diagonal matrices to minimize storage and computing overheads of the sparse matrix–vector multiply (SpMV) used in BiCG-Stab. To honor data dependencies in BiCG-Stab tasks we propose a hierarchical multi-GPUs synchronization scheme that reduces the polling, combines barriers, termination BiCG-Stab iteration, and assembles the solution X. We present a multi-GPU implementation of BiCG-Stab by distribution operations overall units within a GPU and overall GPUs. Reduce-add operations are orchestrated by assembling partial results across all units and all GPUs. Since the cuSparse library works only on a single GPU, we present an SpMV for multi-GPU. In the evaluation, we present the testing of the pinned/paged memory to implement multi-GPU synchronization and communication. The scalability of the multi-GPU implementation of BiCG-Stab is presented showing a smooth increase in the computational load when the problem size grows to a billion, which is useful for developing scalable petroleum reservoir simulation on a clusterof GPUs.
Similar content being viewed by others
Data Availability
Not Applicable.
Code availability
Not Applicable.
References
Abdelfattah A, Ltaief H, Keyes D (2015) High performance multi-gpu spmv for multi-component pde-based applications. In: Träff JL, Hunold S, Versaci F (eds) Euro-Par 2015: Parallel Processing. Springer, Berlin, Heidelberg, pp 601–612
Abu-Sufah W, Karim AA (2012) An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 453–460. https://doi.org/10.1109/HPCC.2012.68
Acosta A, Blanco V, Almeida F (2012) Towards the Dynamic Load Balancing on Heterogeneous Multi-GPU Systems. 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, 646–653. https://doi.org/10.1109/ISPA.2012.96
Ahamed A-KC, Magoules F (2012) Iterative Methods for Sparse Linear Systems on Graphics Processing Unit. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 836–842. https://doi.org/10.1109/HPCC.2012.118
Ahmad N, Yilmaz B, Unat D (2021) A split execution model for sptrsv. IEEE Trans Parallel Distrib Syst 32(11):2809–2822. https://doi.org/10.1109/TPDS.2021.3074501
Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2019) Accelerating the task/data-parallel version of ilupack’s bicg in multi-cpu/gpu configurations. Parallel Comput 85:79–87. https://doi.org/10.1016/j.parco.2019.02.005
Aliaga JÍ, Pérez J, Quíntana-Orti ES, Anzt H (2013) Reformulated conjugate gradient for the energy-Aware solution of linear systems on GPUs. Proceedings of the International Conference on Parallel Processing, 320–329. https://doi.org/10.1109/ICPP.2013.41
Al-Mouhamed Mayez, Khan Ayaz H (2017) SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73(9):3761–3795. https://doi.org/10.1007/s11227-017-1972-3
Ament M, Knittel G, Weiskopf D, Straßer W (2010) A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-GPU platform. Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010, 583–592. https://doi.org/10.1109/PDP.2010.51
Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c / sell-c- σ formats on nvidia gpus. Technical Report ut-eecs-14–727, University of Tennessee, url: https://icl.utk.edu/publications/implementing-sparse-matrix-vector-product-sell-csell-c-%CF%83-formats-nvidia-gpus
Bastem B, Unat D (2020) Tiling-based programming model for structured grids on gpu clusters. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. HPCAsia2020, pp. 43–51. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3368474.3368485
Ben-Nun T, Sutton M, Pai S, Pingali K (2020) Groute: Asynchronous multi-gpu programming model with applications to large-scale graph processing. ACM Trans. Parallel Comput. 7(3). https://doi.org/10.1145/3399730
Bocharov AN, Evstigneev NM, Petrovskiy VP, Ryabkov OI, Teplyakov IO (2020) Implicit method for the solution of supersonic and hypersonic 3d flow problems with lower-upper symmetric-gauss-seidel preconditioner on multiple graphics processing units. J Comput Phys 406:109189. https://doi.org/10.1016/j.jcp.2019.109189
Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91. https://doi.org/10.1007/s00450-010-0112-6
Chen Y, Zhao Y, Zhao W, Zhao L (2013) A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver. 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 628–635. https://doi.org/10.1109/HPCC.and.EUC.2013.94
Cooperative Groups. https://developer.nvidia.com/blog/cooperative-groups/. Accessed 15 Aug 2023
Developing a Linux Kernel Module using GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/. Accessed 15 Aug 2023
Gao J, Liang R, Wang J (2014) Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J Parallel Distrib Comput 74(2):2088–2098. https://doi.org/10.1016/j.jpdc.2013.10.002
Gao J, Zhou Y, He G, Xia Y (2017) A multi-gpu parallel optimization model for the preconditioned conjugate gradient algorithm. Parallel Comput 63:1–16. https://doi.org/10.1016/j.parco.2017.04.003
Guan J, Yan S, Jin JM (2013) An OpenMP-CUDA implementation of multilevel fast multipole algorithm for electromagnetic simulation on multi-GPU computing systems. IEEE Trans Antennas Propag 61(7):3607–3616. https://doi.org/10.1109/TAP.2013.2258882
Guo P, Zhang C (2016) Performance optimization for spmv on multi-gpu systems using threads and multiple streams. In: 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 67–72. https://doi.org/10.1109/SBAC-PADW.2016.20
Hermann E, Raffin B, Faure F, Gautier T, Allard J (2010) Multi-GPU and multi-CPU parallelization for interactive physics simulations. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6272 LNCS(PART 2), 235–246. https://doi.org/10.1007/978-3-642-15291-7_23
Jradi WAR, Dantas do Nascimento HA, Santos Martins W (2018) A fast and generic gpu-based parallel reduction implementation. In: 2018 Symposium on High Performance Computing Systems (WSCAD), pp. 16–22. https://doi.org/10.1109/WSCAD.2018.00013
Klie H, Sudan H, Li R (2011) Exploiting Capabilities of Many Core Platforms in Reservoir Simulation. SPE Reserv Simul. https://doi.org/10.2118/141265-MS
Li A, van den Braak G-J, Corporaal H, Kumar A (2015) Fine-grained synchronizations and dataflow programming on gpus. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 109–118. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751232
Li R, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466. https://doi.org/10.1007/s11227-012-0825-3
Li A, Song SL, Chen J, Li J, Liu X, Tallent NR, Barker KJ (2020) Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Trans Parallel Distrib Syst 31(1):94–110
Liu Y (2014) Faster GPU-based sparse matrix-vector multiplication. https://lightspmv.sourceforge.net/homepage.htm#latest. Accessed 15 Aug 2023
Liu Y, Schmidt B (2015) Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In: 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 82–89. https://doi.org/10.1109/ASAP.2015.7245713
Liu W, Vinter B (2015) Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 339–350. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751209
Lopes N, Ribeiro B (2011) GPUMLib: An Efficient Open-source GPU Machine Learning Library. J Comput Inf Syst 3(July 2016):355–362
Mak J, Choboter P, Lupo C (2011) Numerical ocean modeling and simulation with CUDA. OCEANS’11 MTS/IEEE KONA, Waikoloa, pp 1–6. https://doi.org/10.23919/OCEANS.2011.6107199
Mei X, Chu X (2017) Dissecting gpu memory hierarchy through microbenchmarking. IEEE Trans Parallel Distrib Syst 28(1):72–86. https://doi.org/10.1109/TPDS.2016.2549523
Micikevicius P. Multi-GPU Programming. https://www.nvidia.com/docs/io/116711/sc11-multi-gpu.pdf. Accessed 15 Aug 2023
Micikevicius P (2011) Multi-GPU programming. GPU Computing Webinars, NVIDIA, url: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf
Moler CB (2004) Numerical Computing with MATLAB. SIAM, Massachusets
Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. AMC Queue 6(April):40–53. https://doi.org/10.1145/1365490.1365500
NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 15 Aug 2023
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757
Saad Y (2003) Iterative Methods for Sparse Linear Systems. SIAM, Minnesota
Schaetz S, Uecker M (2012) A multi-GPU programming library for real-time applications. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7439 LNCS(PART 1), 114–128 https://arxiv.org/abs/1301.1215arXiv:1301.1215. https://doi.org/10.1007/978-3-642-33078-0_9
Sourouri M, Gillberg T, Baden SB, Cai X (2014) Effective multi-GPU communication using multiple CUDA streams and threads. Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS 2015-April, 981–986. https://doi.org/10.1109/PADSW.2014.7097919
Steinberger M, Derlery A, Zayer R, Seidel H-P (2016) How naive is naive spmv on the gpu? In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. https://doi.org/10.1109/HPEC.2016.7761634
Technology P (n.d.) Product Brief: PEX 8747, PCI Express Gen 3 Switch, 48 Lanes, 5 Ports. https://docs.broadcom.com/doc/12351854. Accessed 15 Aug 2023
Thibault JC, Senocak I (2012) Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms. J Supercomput 59(2):693–719. https://doi.org/10.1007/s11227-010-0468-1
Tiwari M, Vadhiyar S (2022) Strategies for efficient execution of pipelined conjugate gradient method on gpu systems. In: Anzt H, Bienz A, Luszczek P, Baboulin M (eds.) High Performance Computing. ISC High Performance 2022 International Workshops, pp. 77–89. Springer, Cham
Torres R, Ferrer R, Teruel X (2022) A novel set of directives for multi-device programming with openmp. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 401–410. https://doi.org/10.1109/IPDPSW55747.2022.00075
Xie C, Chen J, Firoz J, Li J, Song SL, Barker K, Raugas M, Li A (2021) Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3472456.3472478
Yang C, Buluç A, Owens JD (2018) Design principles for sparse matrix multiplication on the gpu. In: Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Proceedings, pp. 672–687. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-319-96983-1_48
Zhang X, Tan G, Xue S, Li J, Zhou K, Chen M (2017) Understanding the gpu microarchitecture to achieve bare-metal performance tuning. SIGPLAN Not 52(8):31–43. https://doi.org/10.1145/3155284.3018755
Zhang L, Wahib M, Chen P, Meng J, Wang X, Toshio E, Matsuoka S (2023) ArXiv:2204.02064v2 [CS.DC] 21 may 2022. https://arxiv.org/pdf/2204.02064. Accessed 15 Aug 2023
Acknowledgements
This research was funded through a research project by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks to King Fahd University of Petroleum & Minerals (KFUPM) for computing support.
Author information
Authors and Affiliations
Contributions
Not Applicable.
Corresponding author
Ethics declarations
Conflict of interest/Competing interests
The authors have no competing interests to declare that are relevant to the content of this article.
Ethics approval
Not Applicable.
Consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Al-Mouhamed, M., Firdaus, L., Khan, A.H. et al. SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation. Multimed Tools Appl 83, 23563–23597 (2024). https://doi.org/10.1007/s11042-023-16185-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-16185-0