SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Al-Mouhamed, Mayez; Firdaus, Lutfi; Khan, Ayaz H.; Mohammad, Nazeeruddin

doi:10.1007/s11042-023-16185-0

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Published: 17 August 2023

Volume 83, pages 23563–23597, (2024)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Mayez Al-Mouhamed¹^na1,
Lutfi Firdaus¹^na1,
Ayaz H. Khan ORCID: orcid.org/0000-0003-1167-7319¹^na1 &
…
Nazeeruddin Mohammad²^na1

183 Accesses
1 Citation
Explore all metrics

Abstract

This paper is on a multi-GPU simulation of a petroleum reservoir using a 3D structured grid, where each point is represented by its state. Using the Darcy model for porous media, each grid point is related to its six neighbors by a linear relation. The system equation is a sparse linear system AX = b, where A is a hepta-diagonal matrix, b is model parameters, and X is a state vector. The simulation repeatedly computes (1) A and b given X and (2) X given A and b, which is the focus of this paper. The BiCG-Stab is an iterative procedure for solving AX = b for X. This work focuses on developing a scalable multi-GPU approach for solving large sparse systems Ax = b using BiCG-Stab. We extend a previously developed storage scheme for blocked hepta-diagonal matrices to minimize storage and computing overheads of the sparse matrix–vector multiply (SpMV) used in BiCG-Stab. To honor data dependencies in BiCG-Stab tasks we propose a hierarchical multi-GPUs synchronization scheme that reduces the polling, combines barriers, termination BiCG-Stab iteration, and assembles the solution X. We present a multi-GPU implementation of BiCG-Stab by distribution operations overall units within a GPU and overall GPUs. Reduce-add operations are orchestrated by assembling partial results across all units and all GPUs. Since the cuSparse library works only on a single GPU, we present an SpMV for multi-GPU. In the evaluation, we present the testing of the pinned/paged memory to implement multi-GPU synchronization and communication. The scalability of the multi-GPU implementation of BiCG-Stab is presented showing a smooth increase in the computational load when the problem size grows to a billion, which is useful for developing scalable petroleum reservoir simulation on a clusterof GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

Article 24 March 2017

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Article 13 May 2019

Data Availability

Not Applicable.

Code availability

Not Applicable.

References

Abdelfattah A, Ltaief H, Keyes D (2015) High performance multi-gpu spmv for multi-component pde-based applications. In: Träff JL, Hunold S, Versaci F (eds) Euro-Par 2015: Parallel Processing. Springer, Berlin, Heidelberg, pp 601–612
Chapter Google Scholar
Abu-Sufah W, Karim AA (2012) An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 453–460. https://doi.org/10.1109/HPCC.2012.68
Acosta A, Blanco V, Almeida F (2012) Towards the Dynamic Load Balancing on Heterogeneous Multi-GPU Systems. 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, 646–653. https://doi.org/10.1109/ISPA.2012.96
Ahamed A-KC, Magoules F (2012) Iterative Methods for Sparse Linear Systems on Graphics Processing Unit. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 836–842. https://doi.org/10.1109/HPCC.2012.118
Ahmad N, Yilmaz B, Unat D (2021) A split execution model for sptrsv. IEEE Trans Parallel Distrib Syst 32(11):2809–2822. https://doi.org/10.1109/TPDS.2021.3074501
Article Google Scholar
Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2019) Accelerating the task/data-parallel version of ilupack’s bicg in multi-cpu/gpu configurations. Parallel Comput 85:79–87. https://doi.org/10.1016/j.parco.2019.02.005
Article MathSciNet Google Scholar
Aliaga JÍ, Pérez J, Quíntana-Orti ES, Anzt H (2013) Reformulated conjugate gradient for the energy-Aware solution of linear systems on GPUs. Proceedings of the International Conference on Parallel Processing, 320–329. https://doi.org/10.1109/ICPP.2013.41
Al-Mouhamed Mayez, Khan Ayaz H (2017) SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73(9):3761–3795. https://doi.org/10.1007/s11227-017-1972-3
Article Google Scholar
Ament M, Knittel G, Weiskopf D, Straßer W (2010) A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-GPU platform. Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010, 583–592. https://doi.org/10.1109/PDP.2010.51
Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c / sell-c- σ formats on nvidia gpus. Technical Report ut-eecs-14–727, University of Tennessee, url: https://icl.utk.edu/publications/implementing-sparse-matrix-vector-product-sell-csell-c-%CF%83-formats-nvidia-gpus
Bastem B, Unat D (2020) Tiling-based programming model for structured grids on gpu clusters. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. HPCAsia2020, pp. 43–51. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3368474.3368485
Ben-Nun T, Sutton M, Pai S, Pingali K (2020) Groute: Asynchronous multi-gpu programming model with applications to large-scale graph processing. ACM Trans. Parallel Comput. 7(3). https://doi.org/10.1145/3399730
Bocharov AN, Evstigneev NM, Petrovskiy VP, Ryabkov OI, Teplyakov IO (2020) Implicit method for the solution of supersonic and hypersonic 3d flow problems with lower-upper symmetric-gauss-seidel preconditioner on multiple graphics processing units. J Comput Phys 406:109189. https://doi.org/10.1016/j.jcp.2019.109189
Article MathSciNet Google Scholar
Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91. https://doi.org/10.1007/s00450-010-0112-6
Article Google Scholar
Chen Y, Zhao Y, Zhao W, Zhao L (2013) A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver. 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 628–635. https://doi.org/10.1109/HPCC.and.EUC.2013.94
Cooperative Groups. https://developer.nvidia.com/blog/cooperative-groups/. Accessed 15 Aug 2023
Developing a Linux Kernel Module using GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/. Accessed 15 Aug 2023
Gao J, Liang R, Wang J (2014) Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J Parallel Distrib Comput 74(2):2088–2098. https://doi.org/10.1016/j.jpdc.2013.10.002
Article Google Scholar
Gao J, Zhou Y, He G, Xia Y (2017) A multi-gpu parallel optimization model for the preconditioned conjugate gradient algorithm. Parallel Comput 63:1–16. https://doi.org/10.1016/j.parco.2017.04.003
Article MathSciNet Google Scholar
Guan J, Yan S, Jin JM (2013) An OpenMP-CUDA implementation of multilevel fast multipole algorithm for electromagnetic simulation on multi-GPU computing systems. IEEE Trans Antennas Propag 61(7):3607–3616. https://doi.org/10.1109/TAP.2013.2258882
Article MathSciNet ADS Google Scholar
Guo P, Zhang C (2016) Performance optimization for spmv on multi-gpu systems using threads and multiple streams. In: 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 67–72. https://doi.org/10.1109/SBAC-PADW.2016.20
Hermann E, Raffin B, Faure F, Gautier T, Allard J (2010) Multi-GPU and multi-CPU parallelization for interactive physics simulations. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6272 LNCS(PART 2), 235–246. https://doi.org/10.1007/978-3-642-15291-7_23
Jradi WAR, Dantas do Nascimento HA, Santos Martins W (2018) A fast and generic gpu-based parallel reduction implementation. In: 2018 Symposium on High Performance Computing Systems (WSCAD), pp. 16–22. https://doi.org/10.1109/WSCAD.2018.00013
Klie H, Sudan H, Li R (2011) Exploiting Capabilities of Many Core Platforms in Reservoir Simulation. SPE Reserv Simul. https://doi.org/10.2118/141265-MS
Article Google Scholar
Li A, van den Braak G-J, Corporaal H, Kumar A (2015) Fine-grained synchronizations and dataflow programming on gpus. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 109–118. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751232
Li R, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466. https://doi.org/10.1007/s11227-012-0825-3
Article Google Scholar
Li A, Song SL, Chen J, Li J, Liu X, Tallent NR, Barker KJ (2020) Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Trans Parallel Distrib Syst 31(1):94–110
Article Google Scholar
Liu Y (2014) Faster GPU-based sparse matrix-vector multiplication. https://lightspmv.sourceforge.net/homepage.htm#latest. Accessed 15 Aug 2023
Liu Y, Schmidt B (2015) Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In: 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 82–89. https://doi.org/10.1109/ASAP.2015.7245713
Liu W, Vinter B (2015) Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 339–350. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751209
Lopes N, Ribeiro B (2011) GPUMLib: An Efficient Open-source GPU Machine Learning Library. J Comput Inf Syst 3(July 2016):355–362
Google Scholar
Mak J, Choboter P, Lupo C (2011) Numerical ocean modeling and simulation with CUDA. OCEANS’11 MTS/IEEE KONA, Waikoloa, pp 1–6. https://doi.org/10.23919/OCEANS.2011.6107199
Mei X, Chu X (2017) Dissecting gpu memory hierarchy through microbenchmarking. IEEE Trans Parallel Distrib Syst 28(1):72–86. https://doi.org/10.1109/TPDS.2016.2549523
Article Google Scholar
Micikevicius P. Multi-GPU Programming. https://www.nvidia.com/docs/io/116711/sc11-multi-gpu.pdf. Accessed 15 Aug 2023
Micikevicius P (2011) Multi-GPU programming. GPU Computing Webinars, NVIDIA, url: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf
Moler CB (2004) Numerical Computing with MATLAB. SIAM, Massachusets
Book Google Scholar
Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. AMC Queue 6(April):40–53. https://doi.org/10.1145/1365490.1365500
Article Google Scholar
NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 15 Aug 2023
Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757
Article Google Scholar
Saad Y (2003) Iterative Methods for Sparse Linear Systems. SIAM, Minnesota
Book Google Scholar
Schaetz S, Uecker M (2012) A multi-GPU programming library for real-time applications. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7439 LNCS(PART 1), 114–128 https://arxiv.org/abs/1301.1215arXiv:1301.1215. https://doi.org/10.1007/978-3-642-33078-0_9
Sourouri M, Gillberg T, Baden SB, Cai X (2014) Effective multi-GPU communication using multiple CUDA streams and threads. Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS 2015-April, 981–986. https://doi.org/10.1109/PADSW.2014.7097919
Steinberger M, Derlery A, Zayer R, Seidel H-P (2016) How naive is naive spmv on the gpu? In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. https://doi.org/10.1109/HPEC.2016.7761634
Technology P (n.d.) Product Brief: PEX 8747, PCI Express Gen 3 Switch, 48 Lanes, 5 Ports. https://docs.broadcom.com/doc/12351854. Accessed 15 Aug 2023
Thibault JC, Senocak I (2012) Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms. J Supercomput 59(2):693–719. https://doi.org/10.1007/s11227-010-0468-1
Article Google Scholar
Tiwari M, Vadhiyar S (2022) Strategies forÂ efficient execution ofÂ pipelined conjugate gradient method onÂ gpu systems. In: Anzt H, Bienz A, Luszczek P, Baboulin M (eds.) High Performance Computing. ISC High Performance 2022 International Workshops, pp. 77–89. Springer, Cham
Torres R, Ferrer R, Teruel X (2022) A novel set of directives for multi-device programming with openmp. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 401–410. https://doi.org/10.1109/IPDPSW55747.2022.00075
Xie C, Chen J, Firoz J, Li J, Song SL, Barker K, Raugas M, Li A (2021) Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3472456.3472478
Yang C, Buluç A, Owens JD (2018) Design principles for sparse matrix multiplication on the gpu. In: Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Proceedings, pp. 672–687. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-319-96983-1_48
Zhang X, Tan G, Xue S, Li J, Zhou K, Chen M (2017) Understanding the gpu microarchitecture to achieve bare-metal performance tuning. SIGPLAN Not 52(8):31–43. https://doi.org/10.1145/3155284.3018755
Article Google Scholar
Zhang L, Wahib M, Chen P, Meng J, Wang X, Toshio E, Matsuoka S (2023) ArXiv:2204.02064v2 [CS.DC] 21 may 2022. https://arxiv.org/pdf/2204.02064. Accessed 15 Aug 2023

Download references

Acknowledgements

This research was funded through a research project by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks to King Fahd University of Petroleum & Minerals (KFUPM) for computing support.

Author information

Mayez Al-Mouhamed, Lutfi Firdaus, Ayaz H. Khan and Nazeeruddin Mohammad are authors contributed equally to this work

Authors and Affiliations

Computer Engineering Department, King Fahd University of Petroleum and Minerals, Dhahran, Kingdom of Saudi Arabia
Mayez Al-Mouhamed, Lutfi Firdaus & Ayaz H. Khan
Cybersecurity Center, Prince Mohammad Bin Fahd University, Dhahran, Kingdom of Saudi Arabia
Nazeeruddin Mohammad

Authors

Mayez Al-Mouhamed
View author publications
You can also search for this author in PubMed Google Scholar
Lutfi Firdaus
View author publications
You can also search for this author in PubMed Google Scholar
Ayaz H. Khan
View author publications
You can also search for this author in PubMed Google Scholar
Nazeeruddin Mohammad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Not Applicable.

Corresponding author

Correspondence to Ayaz H. Khan.

Ethics declarations

Conflict of interest/Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics approval

Not Applicable.

Consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Al-Mouhamed, M., Firdaus, L., Khan, A.H. et al. SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation. Multimed Tools Appl 83, 23563–23597 (2024). https://doi.org/10.1007/s11042-023-16185-0

Download citation

Received: 08 February 2022
Revised: 23 May 2023
Accepted: 03 July 2023
Published: 17 August 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11042-023-16185-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Data Availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU

High Performance Multi-GPU SpMV for Multi-component PDE-Based Applications

Solving a trillion unknowns per second with HPGMG on Sunway TaihuLight

Data Availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest/Competing interests

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation