Abstract
The abundant data parallelism available in many-core GPUs has been a key interest to improve accuracy in scientific and engineering simulation. In many cases, most of the simulation time is spent in linear solver involving sparse matrix–vector multiply. In forward petroleum oil and gas reservoir simulation, the application of a stencil relationship to structured grid leads to a family of generalized hepta-diagonal solver matrices with some regularity and structural uniqueness. We present a customized storage scheme that takes advantage of generalized hepta-diagonal sparsity pattern and stencil regularity by optimizing both storage and matrix–vector computation. We also present an in-kernel optimization for implementing sparse matrix–vector multiply (SpMV) and biconjugate gradient stabilized (BiCG-Stab) solver. In-kernel is intended to avoid the multiple kernels invocation associated with the use of the numerical library operators. To keep in-kernel, a lock-free inter-block synchronization is used in which completing thread blocks are assigned some independent computations to avoid repeatedly polling the global memory. Other optimizations enable combining reductions and collective write operations to memory. The in-kernel optimization is particularly useful for the iterative structure of BiCG-Stab for preserving vector data locality and to avoid saving vector data back to memory and reloading on each kernel exit and re-entry. Evaluation uses generalized hepta-diagonal matrices that derives from a range of forward reservoir simulation’s structured grids. Results show the profitability of proposed generalized hepta-diagonal custom storage scheme over standard library storage like compressed sparse row, hybrid sparse, and diagonal formats. Using proposed optimizations, SpMV and BiCG-Stab have been noticeably accelerated compared to other implementations using multiple kernel exit–re-entry when the solver is implemented by invoking numerical library operators.
Similar content being viewed by others
References
Abu-Sufah W, Karim A (2012) An effective approach for implementing sparse matrix–vector multiplication on graphics processing units. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 453–460
Ahamed C, Kassim A, Frdric M (2012) Iterative methods for sparse linear systems on graphics processing unit. In: IEEE 14th International Conference on High Performance Computing and Communication, pp 836–842
Alejandro D, Polanco R (2009) Collective communication and barrier synchronization on NVIDIA CUDA GPU. MS thesis, University of Kentucky
Aliaga J, Perez J, Quintana-Orti E, Anzt H (2013) Reformulated conjugate gradient for the energy-aware solution of linear systems on GPUs. In: 42nd International Conference on Parallel Processing (ICPP), pp 320–329
Anzt H, Tomov S, Dongarra J (2014) Implementing a sparse matrix vector product for the sell-c/sell-c-\(\sigma \) formats on NVIDIA GPUs. Technical report on UT-EECS-14-727, University of Tennessee
Anzt H, Tomov S, Luszczek P, Sawyer W, Dongarra J (2015) Acceleration of GPU-based krylov solvers via data transfer reduction. Int J High Perform Comput Appl 29(3):366–383
Balay S, Abhyankar S, Adams M, Brown J, Brune P, Buschelman K, Eijkhout V, Gropp W, Kaushik D, Knepley M et al (2014) PETSc users manual revision 3.5. Technical report on Argonne National Laboratory (ANL)
Bell N, Garland M (2008) Efficient sparse matrix–vector multiplication on CUDA. Technical report on NVR-2008-004, NVIDIA Corporation
Bell N, Garland M (2009) Implementing sparse matrix–vector multiplication on throughput-oriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC ’09. ACM, pp 18:1–18:11
Bell N, Garland M (2012) CUSP: generic parallel algorithms for sparse matrix and graph computations. https://code.google.com/archive/p/cusp-library/
Bordawekar R, Baskaran MM (2008) Optimizing sparse matrix–vector multiplication on GPUs. Technical report on RC24704, IMB Research
Buatois L, Caumon G, Lvy B (2009) Concurrent number cruncher: a GPU implementation of a general sparse linear solver. Int J Parallel Emerg Distrib Syst 24(3):205–223
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Math Softw 38(1):1:1–1:25
Goharian N, Jain A, Sun Q (2003) Comparative analysis of sparse matrix algorithms for information retrieval. Computer 2:0–4
Greathouse JL, Daga M (2014) Efficient sparse matrix–vector multiplication on GPUs using the CSR storage format. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’14. IEEE Press, Piscataway, pp 769–780
Grillo L, de Sande F, Reyes R (2014) Performance evaluation of OpenACC compilers. In: 22nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 656–663
Heroux MA, Bartlett RA, Howle VE, Hoekstra RJ, Hu JJ, Kolda TG, Lehoucq RB, Long KR, Pawlowski RP, Phipps ET et al (2005) An overview of the Trilinos project. ACM Trans Math Softw 31(3):397–423
Hoberock J, Bell N (2010) Thrust: a parallel template library. https://thrust.github.io/
Huan G, Qian Z (2012) A new method of sparse matrix–vector multiplication on GPU. In: 2nd International Conference on Computer Science and Network Technology (ICCSNT), pp 954–958
Khan AH, Al-Mouhamed M, Firdaus LA (2015) Evaluation of global synchronization for iterative algebra algorithms on many-core. In: 16th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing (SNPD), pp 1–16
Lee S, Vetter JS (2012) Early evaluation of directive-based GPU programming models for productive exascale computing. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12. IEEE Computer Society Press, Los Alamitos, pp 23:1–23:11
Lowell D, Godwin J, Holewinski J, Karthik D, Choudary C, Mametjanov A, Norris B, Sabin G, Sadayappan P, Sarich J (2013) Stencil-aware GPU optimization of iterative solvers. SIAM J Sci Comput 35(5):S209–S228
Maggioni M, Berger-Wolf T (2016) Optimization techniques for sparse matrixvector multiplication on {GPUs}. J Parallel Distrib Comput 9394:66–86
Matam KK, Kothapalli K (2011) Accelerating sparse matrix vector multiplication in iterative methods using GPU. In: IEEE International Conference on Parallel Processing (ICPP), pp 612–621
Nagasaka Y, Nukada A, Matsuoka S (2016) Adaptive multi-level blocking optimization for sparse matrix vector multiplication on GPU. Procedia Comput Sci 80:131–142 (International Conference on Computational Science 2016)
Neelima B, Reddy GRM, Raghavendra PS (2014) Predicting an optimal sparse matrix format for SpMV computation on GPU. In: IEEE International Parallel & Distributed Processing Symposium Workshops (IPDPSW), pp 1427–1436
NVIDIA: CUDA basic linear algebra subroutines (cublas) library, CUDA sparse matrix (cusparse) library. https://developer.nvidia.com/gpu-accelerated-libraries
NVIDIA (2013) Tuning CUDA applications for kepler. http://docs.nvidia.com/cuda/kepler-tuning-guide/index.html. Accessed 10 June 2013
NVIDIA (2008) NVIDIA CUDA programming guide 2.0. NVIDIA
NVIDIA (2011) CUDA toolkit. https://developer.nvidia.com/cuda-toolkit
NVIDIA (2014) cuSparse library. https://developer.nvidia.com/cusparse
Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2007) A survey of general-purpose computation on graphics hardware. In: Computer graphics forum, vol 26. Wiley Online Library, pp 80–113
Saad Y (2003) Iterative methods for sparse linear systems, 2nd edn. Society for Industrial and Applied Mathematics, Philadelphia
Tomov S, Dongarra J, Baboulin M (2010) Towards dense linear algebra for hybrid GPU accelerated manycore systems. Parallel Comput 36(5):232–240
Xiao S, c. Feng W (2010) Inter-block GPU communication via fast barrier synchronization. In: IEEE International Symposium on Parallel Distributed Processing (IPDPS), pp 1–12
Yan SEA (2014) yaSpMV: Yet another SpMV framework on GPUs. ACM SIGPLAN Notices 49(8):107–118
Yang LT (2000) Data distribution and communication schemes for IQMR method on massively distributed memory computers. In: Proceedings of the International Workshop on Parallel Processing, ICPPW, Toronto, Canada, August 21–24, pp 299–306
Zaza A (2015) A CUDA based parallel multi-phase oil reservoir simulator. PhD thesis, Computer Engineering Department, King Fahd University of Petroleum & Minerals (KFUPM), Saudi Arabia
Acknowledgements
This project was funded by the National Plan for Science, Technology, and Innovation (MAARIFAH) King Abdulaziz City for Science and Technology- through the Science & Technology Unit at King Fahd University of Petroleum and Minerals (KFUPM) the Kingdom of Saudi Arabia, award number (12-INF3008-04). Thanks for King Fahd University of Petroleum & Minerals (KFUPM) for computing support.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Al-Mouhamed, M.A., Khan, A.H. SpMV and BiCG-Stab optimization for a class of hepta-diagonal-sparse matrices on GPU. J Supercomput 73, 3761–3795 (2017). https://doi.org/10.1007/s11227-017-1972-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-017-1972-3