iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s11042-023-16185-0
SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation | Multimedia Tools and Applications Skip to main content
Log in

SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

This paper is on a multi-GPU simulation of a petroleum reservoir using a 3D structured grid, where each point is represented by its state. Using the Darcy model for porous media, each grid point is related to its six neighbors by a linear relation. The system equation is a sparse linear system AX = b, where A is a hepta-diagonal matrix, b is model parameters, and X is a state vector. The simulation repeatedly computes (1) A and b given X and (2) X given A and b, which is the focus of this paper. The BiCG-Stab is an iterative procedure for solving AX = b for X. This work focuses on developing a scalable multi-GPU approach for solving large sparse systems Ax = b using BiCG-Stab. We extend a previously developed storage scheme for blocked hepta-diagonal matrices to minimize storage and computing overheads of the sparse matrix–vector multiply (SpMV) used in BiCG-Stab. To honor data dependencies in BiCG-Stab tasks we propose a hierarchical multi-GPUs synchronization scheme that reduces the polling, combines barriers, termination BiCG-Stab iteration, and assembles the solution X. We present a multi-GPU implementation of BiCG-Stab by distribution operations overall units within a GPU and overall GPUs. Reduce-add operations are orchestrated by assembling partial results across all units and all GPUs. Since the cuSparse library works only on a single GPU, we present an SpMV for multi-GPU. In the evaluation, we present the testing of the pinned/paged memory to implement multi-GPU synchronization and communication. The scalability of the multi-GPU implementation of BiCG-Stab is presented showing a smooth increase in the computational load when the problem size grows to a billion, which is useful for developing scalable petroleum reservoir simulation on a clusterof GPUs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 1
Algorithm 2
Algorithm 3
Algorithm 4
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Data Availability

Not Applicable.

Code availability

Not Applicable.

References

  1. Abdelfattah A, Ltaief H, Keyes D (2015) High performance multi-gpu spmv for multi-component pde-based applications. In: Träff JL, Hunold S, Versaci F (eds) Euro-Par 2015: Parallel Processing. Springer, Berlin, Heidelberg, pp 601–612

    Chapter  Google Scholar 

  2. Abu-Sufah W, Karim AA (2012) An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 453–460. https://doi.org/10.1109/HPCC.2012.68

  3. Acosta A, Blanco V, Almeida F (2012) Towards the Dynamic Load Balancing on Heterogeneous Multi-GPU Systems. 2012 IEEE 10th International Symposium on Parallel and Distributed Processing with Applications, 646–653. https://doi.org/10.1109/ISPA.2012.96

  4. Ahamed A-KC, Magoules F (2012) Iterative Methods for Sparse Linear Systems on Graphics Processing Unit. 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems, 836–842. https://doi.org/10.1109/HPCC.2012.118

  5. Ahmad N, Yilmaz B, Unat D (2021) A split execution model for sptrsv. IEEE Trans Parallel Distrib Syst 32(11):2809–2822. https://doi.org/10.1109/TPDS.2021.3074501

    Article  Google Scholar 

  6. Aliaga JI, Dufrechou E, Ezzatti P, Quintana-Ortí ES (2019) Accelerating the task/data-parallel version of ilupack’s bicg in multi-cpu/gpu configurations. Parallel Comput 85:79–87. https://doi.org/10.1016/j.parco.2019.02.005

    Article  MathSciNet  Google Scholar 

  7. Aliaga JÍ, Pérez J, Quíntana-Orti ES, Anzt H (2013) Reformulated conjugate gradient for the energy-Aware solution of linear systems on GPUs. Proceedings of the International Conference on Parallel Processing, 320–329. https://doi.org/10.1109/ICPP.2013.41

  8. Al-Mouhamed Mayez, Khan Ayaz H (2017) SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73(9):3761–3795. https://doi.org/10.1007/s11227-017-1972-3

    Article  Google Scholar 

  9. Ament M, Knittel G, Weiskopf D, Straßer W (2010) A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-GPU platform. Proceedings of the 18th Euromicro Conference on Parallel, Distributed and Network-Based Processing, PDP 2010, 583–592. https://doi.org/10.1109/PDP.2010.51

  10. Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c / sell-c- σ formats on nvidia gpus. Technical Report ut-eecs-14–727, University of Tennessee, url: https://icl.utk.edu/publications/implementing-sparse-matrix-vector-product-sell-csell-c-%CF%83-formats-nvidia-gpus

  11. Bastem B, Unat D (2020) Tiling-based programming model for structured grids on gpu clusters. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region. HPCAsia2020, pp. 43–51. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3368474.3368485

  12. Ben-Nun T, Sutton M, Pai S, Pingali K (2020) Groute: Asynchronous multi-gpu programming model with applications to large-scale graph processing. ACM Trans. Parallel Comput. 7(3). https://doi.org/10.1145/3399730

  13. Bocharov AN, Evstigneev NM, Petrovskiy VP, Ryabkov OI, Teplyakov IO (2020) Implicit method for the solution of supersonic and hypersonic 3d flow problems with lower-upper symmetric-gauss-seidel preconditioner on multiple graphics processing units. J Comput Phys 406:109189. https://doi.org/10.1016/j.jcp.2019.109189

    Article  MathSciNet  Google Scholar 

  14. Cevahir A, Nukada A, Matsuoka S (2010) High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning. Comput Sci Res Dev 25(1–2):83–91. https://doi.org/10.1007/s00450-010-0112-6

    Article  Google Scholar 

  15. Chen Y, Zhao Y, Zhao W, Zhao L (2013) A Comparative Study of Preconditioners for GPU-Accelerated Conjugate Gradient Solver. 2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 628–635. https://doi.org/10.1109/HPCC.and.EUC.2013.94

  16. Cooperative Groups. https://developer.nvidia.com/blog/cooperative-groups/. Accessed 15 Aug 2023

  17. Developing a Linux Kernel Module using GPUDirect RDMA. https://docs.nvidia.com/cuda/gpudirect-rdma/. Accessed 15 Aug 2023

  18. Gao J, Liang R, Wang J (2014) Research on the conjugate gradient algorithm with a modified incomplete Cholesky preconditioner on GPU. J Parallel Distrib Comput 74(2):2088–2098. https://doi.org/10.1016/j.jpdc.2013.10.002

    Article  Google Scholar 

  19. Gao J, Zhou Y, He G, Xia Y (2017) A multi-gpu parallel optimization model for the preconditioned conjugate gradient algorithm. Parallel Comput 63:1–16. https://doi.org/10.1016/j.parco.2017.04.003

    Article  MathSciNet  Google Scholar 

  20. Guan J, Yan S, Jin JM (2013) An OpenMP-CUDA implementation of multilevel fast multipole algorithm for electromagnetic simulation on multi-GPU computing systems. IEEE Trans Antennas Propag 61(7):3607–3616. https://doi.org/10.1109/TAP.2013.2258882

    Article  MathSciNet  ADS  Google Scholar 

  21. Guo P, Zhang C (2016) Performance optimization for spmv on multi-gpu systems using threads and multiple streams. In: 2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW), pp. 67–72. https://doi.org/10.1109/SBAC-PADW.2016.20

  22. Hermann E, Raffin B, Faure F, Gautier T, Allard J (2010) Multi-GPU and multi-CPU parallelization for interactive physics simulations. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6272 LNCS(PART 2), 235–246. https://doi.org/10.1007/978-3-642-15291-7_23

  23. Jradi WAR, Dantas do Nascimento HA, Santos Martins W (2018) A fast and generic gpu-based parallel reduction implementation. In: 2018 Symposium on High Performance Computing Systems (WSCAD), pp. 16–22. https://doi.org/10.1109/WSCAD.2018.00013

  24. Klie H, Sudan H, Li R (2011) Exploiting Capabilities of Many Core Platforms in Reservoir Simulation. SPE Reserv Simul. https://doi.org/10.2118/141265-MS

    Article  Google Scholar 

  25. Li A, van den Braak G-J, Corporaal H, Kumar A (2015) Fine-grained synchronizations and dataflow programming on gpus. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 109–118. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751232

  26. Li R, Saad Y (2013) GPU-accelerated preconditioned iterative linear solvers. J Supercomput 63(2):443–466. https://doi.org/10.1007/s11227-012-0825-3

    Article  Google Scholar 

  27. Li A, Song SL, Chen J, Li J, Liu X, Tallent NR, Barker KJ (2020) Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Trans Parallel Distrib Syst 31(1):94–110

    Article  Google Scholar 

  28. Liu Y (2014) Faster GPU-based sparse matrix-vector multiplication. https://lightspmv.sourceforge.net/homepage.htm#latest. Accessed 15 Aug 2023

  29. Liu Y, Schmidt B (2015) Lightspmv: Faster csr-based sparse matrix-vector multiplication on cuda-enabled gpus. In: 2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP), pp. 82–89. https://doi.org/10.1109/ASAP.2015.7245713

  30. Liu W, Vinter B (2015) Csr5: An efficient storage format for cross-platform sparse matrix-vector multiplication. In: Proceedings of the 29th ACM on International Conference on Supercomputing. ICS ’15, pp. 339–350. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/2751205.2751209

  31. Lopes N, Ribeiro B (2011) GPUMLib: An Efficient Open-source GPU Machine Learning Library. J Comput Inf Syst 3(July 2016):355–362

    Google Scholar 

  32. Mak J, Choboter P, Lupo C (2011) Numerical ocean modeling and simulation with CUDA. OCEANS’11 MTS/IEEE KONA, Waikoloa, pp 1–6. https://doi.org/10.23919/OCEANS.2011.6107199

  33. Mei X, Chu X (2017) Dissecting gpu memory hierarchy through microbenchmarking. IEEE Trans Parallel Distrib Syst 28(1):72–86. https://doi.org/10.1109/TPDS.2016.2549523

    Article  Google Scholar 

  34. Micikevicius P. Multi-GPU Programming. https://www.nvidia.com/docs/io/116711/sc11-multi-gpu.pdf. Accessed 15 Aug 2023

  35. Micikevicius P (2011) Multi-GPU programming. GPU Computing Webinars, NVIDIA, url: https://developer.download.nvidia.com/CUDA/training/cuda_webinars_multi_gpu.pdf

  36. Moler CB (2004) Numerical Computing with MATLAB. SIAM, Massachusets

    Book  Google Scholar 

  37. Nickolls J, Buck I, Garland M, Skadron K (2008) Scalable parallel programming with CUDA. AMC Queue 6(April):40–53. https://doi.org/10.1145/1365490.1365500

    Article  Google Scholar 

  38. NVIDIA GPUDirect. https://developer.nvidia.com/gpudirect. Accessed 15 Aug 2023

  39. Owens JD, Houston M, Luebke D, Green S, Stone JE, Phillips JC (2008) GPU computing. Proc IEEE 96(5):879–899. https://doi.org/10.1109/JPROC.2008.917757

    Article  Google Scholar 

  40. Saad Y (2003) Iterative Methods for Sparse Linear Systems. SIAM, Minnesota

    Book  Google Scholar 

  41. Schaetz S, Uecker M (2012) A multi-GPU programming library for real-time applications. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 7439 LNCS(PART 1), 114–128 https://arxiv.org/abs/1301.1215arXiv:1301.1215. https://doi.org/10.1007/978-3-642-33078-0_9

  42. Sourouri M, Gillberg T, Baden SB, Cai X (2014) Effective multi-GPU communication using multiple CUDA streams and threads. Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS 2015-April, 981–986. https://doi.org/10.1109/PADSW.2014.7097919

  43. Steinberger M, Derlery A, Zayer R, Seidel H-P (2016) How naive is naive spmv on the gpu? In: 2016 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. https://doi.org/10.1109/HPEC.2016.7761634

  44. Technology P (n.d.) Product Brief: PEX 8747, PCI Express Gen 3 Switch, 48 Lanes, 5 Ports. https://docs.broadcom.com/doc/12351854. Accessed 15 Aug 2023

  45. Thibault JC, Senocak I (2012) Accelerating incompressible flow computations with a Pthreads-CUDA implementation on small-footprint multi-GPU platforms. J Supercomput 59(2):693–719. https://doi.org/10.1007/s11227-010-0468-1

    Article  Google Scholar 

  46. Tiwari M, Vadhiyar S (2022) Strategies for efficient execution of pipelined conjugate gradient method on gpu systems. In: Anzt H, Bienz A, Luszczek P, Baboulin M (eds.) High Performance Computing. ISC High Performance 2022 International Workshops, pp. 77–89. Springer, Cham

  47. Torres R, Ferrer R, Teruel X (2022) A novel set of directives for multi-device programming with openmp. In: 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pp. 401–410. https://doi.org/10.1109/IPDPSW55747.2022.00075

  48. Xie C, Chen J, Firoz J, Li J, Song SL, Barker K, Raugas M, Li A (2021) Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/3472456.3472478

  49. Yang C, Buluç A, Owens JD (2018) Design principles for sparse matrix multiplication on the gpu. In: Euro-Par 2018: Parallel Processing: 24th International Conference on Parallel and Distributed Computing, Turin, Italy, August 27 - 31, 2018, Proceedings, pp. 672–687. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-319-96983-1_48

  50. Zhang X, Tan G, Xue S, Li J, Zhou K, Chen M (2017) Understanding the gpu microarchitecture to achieve bare-metal performance tuning. SIGPLAN Not 52(8):31–43. https://doi.org/10.1145/3155284.3018755

    Article  Google Scholar 

  51. Zhang L, Wahib M, Chen P, Meng J, Wang X, Toshio E, Matsuoka S (2023) ArXiv:2204.02064v2 [CS.DC] 21 may 2022. https://arxiv.org/pdf/2204.02064. Accessed 15 Aug 2023

Download references

Acknowledgements

This research was funded through a research project by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks to King Fahd University of Petroleum & Minerals (KFUPM) for computing support.

Author information

Authors and Affiliations

Authors

Contributions

Not Applicable.

Corresponding author

Correspondence to Ayaz H. Khan.

Ethics declarations

Conflict of interest/Competing interests

The authors have no competing interests to declare that are relevant to the content of this article.

Ethics approval

Not Applicable.

Consent to participate

Not Applicable.

Consent for publication

Not Applicable.

Additional information

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Al-Mouhamed, M., Firdaus, L., Khan, A.H. et al. SpMV and BiCG-Stab sparse solver on Multi-GPUs for reservoir simulation. Multimed Tools Appl 83, 23563–23597 (2024). https://doi.org/10.1007/s11042-023-16185-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-16185-0

Keywords

Navigation