iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1007/s00791-014-0223-x
Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs | Computing and Visualization in Science Skip to main content
Log in

Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs

  • Published:
Computing and Visualization in Science

    We’re sorry, something doesn't seem to be working properly.

    Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in “flat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Ament, M., Knittel, G., Weiskopf, D., Strasser, W.: A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-GPU platform. In: 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 583–592 (2010)

  2. Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughputoriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, SC ’09, pp 18:1–18:11 (2009)

  3. Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)

    Article  Google Scholar 

  4. Börm, S., Hiptmair, R.: Analysis of tensor product multigrid. Numer. Algorithms 26, 200–201 (1999)

    Google Scholar 

  5. Brannick, J., Chen, Y., Hu, X., Zikatanov, L.: Parallel unsmoothed aggregation algebraic multigrid algorithms on GPUs. arXiv preprint arXiv:1302.2547 (2013)

  6. Briggs, W., Henson, V., McCormick, S.: A Multigrid Tutorial. Society for Industrial and Applied Mathematics (2000)

  7. Cantwell, C., Sherwin, S., Kirby, R., Kelly, P.: From h to p efficiently: strategy selection for operator evaluation on hexahedral and tetrahedral elements. Comput. Fluids 43(1):23–28, Symposium on High Accuracy Flow Simulations. Special Issue Dedicated to Prof. Michel Deville (2011)

  8. Carvalho, R., Martins, C., Batalha, R., Camargos, A.: 3D parallel conjugate gradient solver optimized for GPUs. In: 14th Biennial IEEE Conference on Electromagnetic Field Computation (CEFC), 2010 , p. 1 (2010)

  9. Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Allen, Gabrielle and Nabrzyski, Jarosław and Seidel, Edward and Albada, GeertDick and Dongarra, Jack and Sloot, Peter M.A (eds) Lecture Notes in Computer Science, vol 5544. Springer, Berlin (2009)

  10. Chronopoulos, A., Gear, C.: s-Step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25(2), 153–168 (1989)

    Article  MathSciNet  MATH  Google Scholar 

  11. Davies, T., Cullen, M.J.P., Malcolm, A.J., Mawson, M.H., Staniforth, A., White, A.A., Wood, N.: A new dynamical core for the Met Office’s global and regional modelling of the atmosphere. Q. J. R. Meteorol. Soc. 131(608), 1759–1782 (2005)

    Article  Google Scholar 

  12. Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)

    MathSciNet  Google Scholar 

  13. de Jong, M.: Developing a CUDA solver for large sparse matrices for MARIN. Master’s thesis, Delft Institute of Applied Mathematics (2012)

  14. Dehnavi, M., Fernández, D., Giannacopoulos, D.: Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Trans. Magn. 46(8), 2982–2985 (2010)

    Article  Google Scholar 

  15. Dehnavi, M., Fernández, D., Giannacopoulos, D.: Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47(5), 1162–1165 (2011)

    Article  Google Scholar 

  16. Fringer, O.B., Gerritsen, M.: An unstructured-grid, finite-volume, nonhydrostatic, parallel coastal ocean simulator. Ocean Model. 14, 139–173 (2006)

    Article  Google Scholar 

  17. Georgescu, S., Okuda, H.: Conjugate gradients on multiple GPUs. Int. J. Numer. Methods Fluids 64(10–12), 1254–1273 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  18. Geveler, M., Ribbrock, D., Göddeke, D., Zajac, P., Turek, S.: Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs. Techn. Univ., Fak. für Mathematik (2011)

  19. Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: ACM SIGGRAPH 2005 Courses, ACM, New York, NY, USA, SIGGRAPH ’05 (2005)

  20. Helfenstein, R., Koko, J.: Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236(15):3584–3590, Proceedings of the Fifteenth International Congress on Computational and Applied Mathematics (ICCAM-2010), Leuven, Belgium, 5–9 July, 2010 (2012)

  21. Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)

    Article  MathSciNet  MATH  Google Scholar 

  22. Knibbe, H., Oosterlee, C., Vuik, C.: GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method. J. Comput. Appl. Math. 236, 281–293 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  23. Kwizak, M., Robert, A.J.: A semi-implicit scheme for grid point atmospheric models of the primitive equations. Mon. Weather Rev. 99, 32 (1971)

    Article  Google Scholar 

  24. Lacroix, S., Vassilevski, Y., Wheeler, J., Wheeler, M.: Iterative solution methods for modeling multiphase flow in porous media fully implicitly. SIAM J. Sci. Comput. 25(3), 905–926 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  25. Li, R., Saad, Y.: GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 443–466 (2013)

    Article  Google Scholar 

  26. Markall, G.R., Ham, D.A., Kelly, P.H.: Towards generating optimised finite element solvers for GPUs from high-level specifications. Procedia Comput. Sci. 1(1):1815–1823, iCCS 2010 (2010)

  27. Marshall, J., Adcroft, A., Hill, C., Perelman, L., Heisey, C.: A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res. 102, 5753–5766 (1997)

    Article  Google Scholar 

  28. Melvin, T., Dubal, M., Wood, N., Staniforth, A., Zerroukat, M.: An inherently mass-conserving iterative semi-implicit semi-Lagrangian discretization of the non-hydrostatic vertical-slice equations. Q. J. R. Meteorol. Soc. 136(648), 799–814 (2010)

    Google Scholar 

  29. Menon, S., Perot, J.: Implementation of an efficient conjugate gradient algorithm for Poisson solutions on graphics processors. In: Proceedings of the 2007 Meeting of the Canadian CFD Society, Toronto Canada (2007)

  30. Michels, D.: Sparse-matrix-cg-solver in cuda. In: Proceedings of the 15th Central European Seminar on Computer Graphics (2011)

  31. Müller, E., Scheichl, R.: Massively parallel solvers for elliptic PDEs in numerical weather- and climate prediction. accepted for publication in Q. J. R. Meteorol. Soc. (2014)

  32. nVidia Corporation (2009) Fermi architecture whitepaper. http://www.nvidia.co.uk/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 9 Feb 2013

  33. nVidia Corporation (2012) CUDA programming guide. http://docs.nvidia.com/cuda/index.html. Accessed 9 Feb 2013

  34. nVidia Corporation (2013) CuSPARSE Library. https://developer.nvidia.com/cusparse

  35. Piotrowski, Z.P., Wyszogrodzki, A.A., Smolarkiewicz, P.K.: Towards petascale simulation of atmospheric circulations with soundproof equations. Acta Geophys. 59(6), 1294–1311 (2011)

  36. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3rd edn. Cambridge University Press, New York, NY (2007)

    Google Scholar 

  37. Reguly, I., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Innovative Parallel Computing (InPar), 2012, pp. 1–12 (2012)

  38. Robert, A.: A stable numerical integration scheme for the primitive meteorological equations. Atmosphere-Ocean 19(1), 35–46 (1981)

  39. Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (2003)

  40. Skamarock, W.C., Smolarkiewicz, P.K., Klemp, J.B.: Preconditioned conjugate-residual solvers for Helmholtz equations in nonhydrostatic models. Mon. Weather Rev. 125(4), 587–599 (1997)

    Article  Google Scholar 

  41. Smolarkiewicz, P.K., Margolin, L.G.: Variational solver for elliptic problems in atmospheric flows. Appl. Math. Comput. Sci. 4, 101–125 (1994)

  42. Smolarkiewicz, P.K., Margolin, L.G.: On forward-in-time differencing in fluids: an Eulerian/semi-Lagrangian nonhydrostatic model for stratified flows. Atmosphere-Ocean 45(1), 127–152 (1997)

    Article  Google Scholar 

  43. Thomas, S.J., Malevsky, A.V., Desgagne, M., Benoit, R., Pellerin, P., Valin, M.: Massively parallel implementation of the mesoscale compressible community model. Span pp. 1–19 (1997)

  44. Toselli, A., Widlund, O.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics. Springer, Berlin (2005)

    Google Scholar 

  45. Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, London (2001)

  46. Verschoor, M., Jalba, A.C.: Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs. Parallel. Comput. 38(10–11), 552–575 (2012)

    Article  MathSciNet  Google Scholar 

  47. Wood, N., Staniforth, A., White, A., Allen, T., Diamantakis, M., Gross, M., Melvin, T., Smith, C., Vosper, S., Zerroukat, M., Thuburn, J.: An inherently mass-conserving semi-implicit semi-Lagrangian discretisation of the deep-atmosphere global nonhydrostatic equations. Q. J. R. Meteorol. Soc. 140(682), 1505–1520 (2013)

  48. Zhang Xianyi, Z.C., Wang, Q.: OpenBLAS. http://xianyi.github.com/OpenBLAS/. Accessed 9 Feb 2013 (2012)

Download references

Acknowledgments

We would like to thank all members of the GungHo! project and in particular Chris Maynard and David Ham for useful and inspiring discussions. The numerical experiments in this work were carried out on a node of the aquila supercomputer at the University of Bath and we are grateful to Steven Chapman for his continuous and tireless technical support which was essential for the success of this project. The contribution of EM and RS was funded as part of the NERC project on “Next Generation Weather and Climate Prediction” (NGWCP), Grant no. NE/J005576/1.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eike Müller.

Additional information

Communicated by Gabriel Wittum.

Appendix: A interleaved PCG kernels

Appendix: A interleaved PCG kernels

The kernels for the interleaved PCG algorithm described in Sect. 4.3 are shown in Algorithms 2 and 3. Using similar notation as in Sect. 3.1, the index pair \((i',j')\) runs over all immediate neighbours of a horizontal cell \((i,j)\), i.e. \((i',j')\in \mathcal {N}(i,j) = \{(i-1,j),(i+1,j),(i,j-1),(i,j+1)\}\), and we define \(\alpha _{ij}=\sum _{(i',j')\in \mathcal {N}(i,j)}\alpha _{i'j'}\). The coefficients \(a_k'\), \(b_k'\) and \(c_k'\) are defined in Eq. (17).

In the GPU implementation each thread calculates dot products and norms in one column. To obtain global sums, these need to be reduced with an additional BLAS operation. However, as this operation only operates on two-dimensional (horizontal) vectors, its cost is very small (\({<}1\,\%\) of the time per iteration).

figure b
figure c

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Müller, E., Guo, X., Scheichl, R. et al. Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs. Comput. Visual Sci. 16, 41–58 (2013). https://doi.org/10.1007/s00791-014-0223-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00791-014-0223-x

Mathematics Subject Classification

Navigation