Abstract
Many problems in geophysical and atmospheric modelling require the fast solution of elliptic partial differential equations (PDEs) in “flat” three dimensional geometries. In particular, an anisotropic elliptic PDE for the pressure correction has to be solved at every time step in the dynamical core of many numerical weather prediction (NWP) models, and equations of a very similar structure arise in global ocean models, subsurface flow simulations and gas and oil reservoir modelling. The elliptic solve is often the bottleneck of the forecast, and to meet operational requirements an algorithmically optimal method has to be used and implemented efficiently. Graphics Processing Units (GPUs) have been shown to be highly efficient (both in terms of absolute performance and power consumption) for a wide range of applications in scientific computing, and recently iterative solvers have been parallelised on these architectures. In this article we describe the GPU implementation and optimisation of a Preconditioned Conjugate Gradient (PCG) algorithm for the solution of a three dimensional anisotropic elliptic PDE for the pressure correction in NWP. Our implementation exploits the strong vertical anisotropy of the elliptic operator in the construction of a suitable preconditioner. As the algorithm is memory bound, performance can be improved significantly by reducing the amount of global memory access. We achieve this by using a matrix-free implementation which does not require explicit storage of the matrix and instead recalculates the local stencil. Global memory access can also be reduced by rewriting the PCG algorithm using loop fusion and we show that this further reduces the runtime on the GPU. We demonstrate the performance of our matrix-free GPU code by comparing it both to a sequential CPU implementation and to a matrix-explicit GPU code which uses existing CUDA libraries. The absolute performance of the algorithm for different problem sizes is quantified in terms of floating point throughput and global memory bandwidth.
Similar content being viewed by others
References
Ament, M., Knittel, G., Weiskopf, D., Strasser, W.: A parallel preconditioned conjugate gradient solver for the poisson problem on a multi-GPU platform. In: 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 583–592 (2010)
Bell, N., Garland, M.: Implementing sparse matrix-vector multiplication on throughputoriented processors. In: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, ACM, New York, NY, USA, SC ’09, pp 18:1–18:11 (2009)
Bolz, J., Farmer, I., Grinspun, E., Schröder, P.: Sparse matrix solvers on the GPU: conjugate gradients and multigrid. ACM Trans. Graphics 22, 917–924 (2003)
Börm, S., Hiptmair, R.: Analysis of tensor product multigrid. Numer. Algorithms 26, 200–201 (1999)
Brannick, J., Chen, Y., Hu, X., Zikatanov, L.: Parallel unsmoothed aggregation algebraic multigrid algorithms on GPUs. arXiv preprint arXiv:1302.2547 (2013)
Briggs, W., Henson, V., McCormick, S.: A Multigrid Tutorial. Society for Industrial and Applied Mathematics (2000)
Cantwell, C., Sherwin, S., Kirby, R., Kelly, P.: From h to p efficiently: strategy selection for operator evaluation on hexahedral and tetrahedral elements. Comput. Fluids 43(1):23–28, Symposium on High Accuracy Flow Simulations. Special Issue Dedicated to Prof. Michel Deville (2011)
Carvalho, R., Martins, C., Batalha, R., Camargos, A.: 3D parallel conjugate gradient solver optimized for GPUs. In: 14th Biennial IEEE Conference on Electromagnetic Field Computation (CEFC), 2010 , p. 1 (2010)
Cevahir, A., Nukada, A., Matsuoka, S.: Fast conjugate gradients with multiple GPUs. In: Allen, Gabrielle and Nabrzyski, Jarosław and Seidel, Edward and Albada, GeertDick and Dongarra, Jack and Sloot, Peter M.A (eds) Lecture Notes in Computer Science, vol 5544. Springer, Berlin (2009)
Chronopoulos, A., Gear, C.: s-Step iterative methods for symmetric linear systems. J. Comput. Appl. Math. 25(2), 153–168 (1989)
Davies, T., Cullen, M.J.P., Malcolm, A.J., Mawson, M.H., Staniforth, A., White, A.A., Wood, N.: A new dynamical core for the Met Office’s global and regional modelling of the atmosphere. Q. J. R. Meteorol. Soc. 131(608), 1759–1782 (2005)
Davis, T.A., Hu, Y.: The University of Florida sparse matrix collection. ACM Trans. Math. Softw. 38(1), 1:1–1:25 (2011)
de Jong, M.: Developing a CUDA solver for large sparse matrices for MARIN. Master’s thesis, Delft Institute of Applied Mathematics (2012)
Dehnavi, M., Fernández, D., Giannacopoulos, D.: Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Trans. Magn. 46(8), 2982–2985 (2010)
Dehnavi, M., Fernández, D., Giannacopoulos, D.: Enhancing the performance of conjugate gradient solvers on graphic processing units. IEEE Trans. Magn. 47(5), 1162–1165 (2011)
Fringer, O.B., Gerritsen, M.: An unstructured-grid, finite-volume, nonhydrostatic, parallel coastal ocean simulator. Ocean Model. 14, 139–173 (2006)
Georgescu, S., Okuda, H.: Conjugate gradients on multiple GPUs. Int. J. Numer. Methods Fluids 64(10–12), 1254–1273 (2010)
Geveler, M., Ribbrock, D., Göddeke, D., Zajac, P., Turek, S.: Efficient Finite Element Geometric Multigrid Solvers for Unstructured Grids on GPUs. Techn. Univ., Fak. für Mathematik (2011)
Goodnight, N., Woolley, C., Lewin, G., Luebke, D., Humphreys, G.: A multigrid solver for boundary value problems using programmable graphics hardware. In: ACM SIGGRAPH 2005 Courses, ACM, New York, NY, USA, SIGGRAPH ’05 (2005)
Helfenstein, R., Koko, J.: Parallel preconditioned conjugate gradient algorithm on GPU. J. Comput. Appl. Math. 236(15):3584–3590, Proceedings of the Fifteenth International Congress on Computational and Applied Mathematics (ICCAM-2010), Leuven, Belgium, 5–9 July, 2010 (2012)
Hestenes, M.R., Stiefel, E.: Methods of conjugate gradients for solving linear systems. J. Res. Natl. Bur. Stand. 49(6), 409–436 (1952)
Knibbe, H., Oosterlee, C., Vuik, C.: GPU implementation of a Helmholtz Krylov solver preconditioned by a shifted Laplace multigrid method. J. Comput. Appl. Math. 236, 281–293 (2011)
Kwizak, M., Robert, A.J.: A semi-implicit scheme for grid point atmospheric models of the primitive equations. Mon. Weather Rev. 99, 32 (1971)
Lacroix, S., Vassilevski, Y., Wheeler, J., Wheeler, M.: Iterative solution methods for modeling multiphase flow in porous media fully implicitly. SIAM J. Sci. Comput. 25(3), 905–926 (2003)
Li, R., Saad, Y.: GPU-accelerated preconditioned iterative linear solvers. J. Supercomput. 63, 443–466 (2013)
Markall, G.R., Ham, D.A., Kelly, P.H.: Towards generating optimised finite element solvers for GPUs from high-level specifications. Procedia Comput. Sci. 1(1):1815–1823, iCCS 2010 (2010)
Marshall, J., Adcroft, A., Hill, C., Perelman, L., Heisey, C.: A finite-volume, incompressible Navier Stokes model for studies of the ocean on parallel computers. J. Geophys. Res. 102, 5753–5766 (1997)
Melvin, T., Dubal, M., Wood, N., Staniforth, A., Zerroukat, M.: An inherently mass-conserving iterative semi-implicit semi-Lagrangian discretization of the non-hydrostatic vertical-slice equations. Q. J. R. Meteorol. Soc. 136(648), 799–814 (2010)
Menon, S., Perot, J.: Implementation of an efficient conjugate gradient algorithm for Poisson solutions on graphics processors. In: Proceedings of the 2007 Meeting of the Canadian CFD Society, Toronto Canada (2007)
Michels, D.: Sparse-matrix-cg-solver in cuda. In: Proceedings of the 15th Central European Seminar on Computer Graphics (2011)
Müller, E., Scheichl, R.: Massively parallel solvers for elliptic PDEs in numerical weather- and climate prediction. accepted for publication in Q. J. R. Meteorol. Soc. (2014)
nVidia Corporation (2009) Fermi architecture whitepaper. http://www.nvidia.co.uk/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf. Accessed 9 Feb 2013
nVidia Corporation (2012) CUDA programming guide. http://docs.nvidia.com/cuda/index.html. Accessed 9 Feb 2013
nVidia Corporation (2013) CuSPARSE Library. https://developer.nvidia.com/cusparse
Piotrowski, Z.P., Wyszogrodzki, A.A., Smolarkiewicz, P.K.: Towards petascale simulation of atmospheric circulations with soundproof equations. Acta Geophys. 59(6), 1294–1311 (2011)
Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical Recipes 3rd Edition: The Art of Scientific Computing, 3rd edn. Cambridge University Press, New York, NY (2007)
Reguly, I., Giles, M.: Efficient sparse matrix-vector multiplication on cache-based GPUs. In: Innovative Parallel Computing (InPar), 2012, pp. 1–12 (2012)
Robert, A.: A stable numerical integration scheme for the primitive meteorological equations. Atmosphere-Ocean 19(1), 35–46 (1981)
Saad, Y.: Iterative Methods for Sparse Linear Systems, 2nd edn. Society for Industrial and Applied Mathematics (2003)
Skamarock, W.C., Smolarkiewicz, P.K., Klemp, J.B.: Preconditioned conjugate-residual solvers for Helmholtz equations in nonhydrostatic models. Mon. Weather Rev. 125(4), 587–599 (1997)
Smolarkiewicz, P.K., Margolin, L.G.: Variational solver for elliptic problems in atmospheric flows. Appl. Math. Comput. Sci. 4, 101–125 (1994)
Smolarkiewicz, P.K., Margolin, L.G.: On forward-in-time differencing in fluids: an Eulerian/semi-Lagrangian nonhydrostatic model for stratified flows. Atmosphere-Ocean 45(1), 127–152 (1997)
Thomas, S.J., Malevsky, A.V., Desgagne, M., Benoit, R., Pellerin, P., Valin, M.: Massively parallel implementation of the mesoscale compressible community model. Span pp. 1–19 (1997)
Toselli, A., Widlund, O.: Domain Decomposition Methods: Algorithms and Theory, Springer Series in Computational Mathematics. Springer, Berlin (2005)
Trottenberg, U., Oosterlee, C.W., Schüller, A.: Multigrid. Academic Press, London (2001)
Verschoor, M., Jalba, A.C.: Analysis and performance estimation of the Conjugate Gradient method on multiple GPUs. Parallel. Comput. 38(10–11), 552–575 (2012)
Wood, N., Staniforth, A., White, A., Allen, T., Diamantakis, M., Gross, M., Melvin, T., Smith, C., Vosper, S., Zerroukat, M., Thuburn, J.: An inherently mass-conserving semi-implicit semi-Lagrangian discretisation of the deep-atmosphere global nonhydrostatic equations. Q. J. R. Meteorol. Soc. 140(682), 1505–1520 (2013)
Zhang Xianyi, Z.C., Wang, Q.: OpenBLAS. http://xianyi.github.com/OpenBLAS/. Accessed 9 Feb 2013 (2012)
Acknowledgments
We would like to thank all members of the GungHo! project and in particular Chris Maynard and David Ham for useful and inspiring discussions. The numerical experiments in this work were carried out on a node of the aquila supercomputer at the University of Bath and we are grateful to Steven Chapman for his continuous and tireless technical support which was essential for the success of this project. The contribution of EM and RS was funded as part of the NERC project on “Next Generation Weather and Climate Prediction” (NGWCP), Grant no. NE/J005576/1.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Gabriel Wittum.
Appendix: A interleaved PCG kernels
Appendix: A interleaved PCG kernels
The kernels for the interleaved PCG algorithm described in Sect. 4.3 are shown in Algorithms 2 and 3. Using similar notation as in Sect. 3.1, the index pair \((i',j')\) runs over all immediate neighbours of a horizontal cell \((i,j)\), i.e. \((i',j')\in \mathcal {N}(i,j) = \{(i-1,j),(i+1,j),(i,j-1),(i,j+1)\}\), and we define \(\alpha _{ij}=\sum _{(i',j')\in \mathcal {N}(i,j)}\alpha _{i'j'}\). The coefficients \(a_k'\), \(b_k'\) and \(c_k'\) are defined in Eq. (17).
In the GPU implementation each thread calculates dot products and norms in one column. To obtain global sums, these need to be reduced with an additional BLAS operation. However, as this operation only operates on two-dimensional (horizontal) vectors, its cost is very small (\({<}1\,\%\) of the time per iteration).
Rights and permissions
About this article
Cite this article
Müller, E., Guo, X., Scheichl, R. et al. Matrix-free GPU implementation of a preconditioned conjugate gradient solver for anisotropic elliptic PDEs. Comput. Visual Sci. 16, 41–58 (2013). https://doi.org/10.1007/s00791-014-0223-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00791-014-0223-x