When cache blocking of sparse matrix vector multiply works and why

Nishtala, Rajesh; Vuduc, Richard W.; Demmel, James W.; Yelick, Katherine A.

doi:10.1007/s00200-007-0038-9

When cache blocking of sparse matrix vector multiply works and why

Published: 06 March 2007

Volume 18, pages 297–311, (2007)
Cite this article

Applicable Algebra in Engineering, Communication and Computing Aims and scope

Rajesh Nishtala¹,
Richard W. Vuduc¹,
James W. Demmel¹ &
…
Katherine A. Yelick¹

303 Accesses
45 Citations
3 Altmetric
Explore all metrics

Abstract

We present new performance models and more compact data structures for cache blocking when applied to sparse matrix-vector multiply (SpM × V). We extend our prior models by relaxing the assumption that the vectors fit in cache and find that the new models are accurate enough to predict optimum block sizes. In addition, we determine criteria that predict when cache blocking improves performance. We conclude with architectural suggestions that would make memory systems execute SpM × V faster.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Vector Caching Scheme for Streaming FPGA SpMV Accelerators

Optimal Cache Replacement Policy for Matrix Multiplication

Optimizations for Very Long and Sparse Vector Operations on a RISC-V VPU: A Work-in-Progress

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Bilmes, J., Asanović, K., Demmel, J., Lam, D., Chin, C.W. PHiPAC: A portable, high- performance, ANSI C coding methodology and its application to matrix multiply, University of Tennessee, LAPACK Working Note 111 (1996)
Browne, S., Dongarra, J., Garner, N., London, K., Mucci, P.: A scalable cross-platform infrastructure for application performance tuning using hardware counters. In: Proceedings of supercomputing, November (2000)
Fraguela, B.B., Doallo, R., Zapata, E.L.: Memory hierarchy performance prediction for sparse blocked algorithms. Parallel Proc Lett, 9(3) (1999)
Gropp, W.D., Kasushik, D.K., Keyes, D.E., Smith, B.F.: Towards realistic bounds for implicit CFD codes. In: Proceedings of parallel computational fluid dynamics, pp. 241–248 (1999)
Heber, G., Dolgert, A.J., Alt, M., Mazurkiewicz, K.A., Stringer, L.: Fracture mechanics on the Intel Itanium architecture: a case study. In: Workshop on EPIC architectures and compiler technology (ACM MICRO 34), Austin, TX (2001)
Heras, D.B., Perez, V.B., Dominguez, J.C.C., Rivera, F.F.: Modeling and improving locality for irregular problems: sparse matrix-vector product on cache memories as a case study. In: HPCN Europe, pp. 201–210 (1999)
Im, E.-J.: Optimizing the performance of sparse matrix-vector multiplication. PhD thesis, University of California, Berkeley, May (2000)
Nishtala, R., Vuduc, R.W., Demmel, J.W., Yelick, K.A.: Performance modeling and analysis of cache blocking in sparse matrix vector multiply. Technical report (UCB/CSD-04-1335), University of California, Berkeley, EECS Dept. (2004)
Saad, Y. SPARSKIT: A basic toolkit for sparse matrix computations (1994) www.cs.umn. edu/Research/arpa/SPARSKIT/sparskit.html
Saavedra-Barrera, R.H.: CPU performance evaluation and execution time prediction using narrow spectrum benchmarking. PhD thesis, University of California, Berkeley, February (1992)
Snavely, A., Carrington, L., Wolter, N.: Modeling application performance by convolving machine signatures with application profiles (2001)
Temam, O., Jalby, W.: Characterizing the behavior of sparse algorithms on caches. In: Proceedings of supercomputing ’92 (1992)
Vuduc, R.W. OSKI: Optimized Sparse Kernel Interface (2005) http://bebop.cs.berkeley. edu/oski/
Vuduc, R., Demmel, J.W., Yelick, K.A., Kamil, S., Nishtala, R., Lee, B.: Performance optimizations and bounds for sparse matrix-vector multiply. In: Proceedings of supercomputing, Baltimore, MD, USA, November (2002)
Vuduc, R.W.: Automatic performance tuning of sparse matrix kernels. PhD thesis, University of California, Berkeley (2003)
Whaley, C., Dongarra, J.: Automatically tuned linear algebra software. In: Proceedings of supercomputer (1998)

Download references

Author information

Authors and Affiliations

Computer Science Division, University of California at Berkeley, 575 Soda Hall, Berkeley, CA, 94720, USA
Rajesh Nishtala, Richard W. Vuduc, James W. Demmel & Katherine A. Yelick

Authors

Rajesh Nishtala
View author publications
You can also search for this author in PubMed Google Scholar
Richard W. Vuduc
View author publications
You can also search for this author in PubMed Google Scholar
James W. Demmel
View author publications
You can also search for this author in PubMed Google Scholar
Katherine A. Yelick
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rajesh Nishtala.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Nishtala, R., Vuduc, R.W., Demmel, J.W. et al. When cache blocking of sparse matrix vector multiply works and why. AAECC 18, 297–311 (2007). https://doi.org/10.1007/s00200-007-0038-9

Download citation

Received: 08 November 2005
Revised: 21 September 2006
Published: 06 March 2007
Issue Date: May 2007
DOI: https://doi.org/10.1007/s00200-007-0038-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

When cache blocking of sparse matrix vector multiply works and why

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Vector Caching Scheme for Streaming FPGA SpMV Accelerators

Optimal Cache Replacement Policy for Matrix Multiplication

Optimizations for Very Long and Sparse Vector Operations on a RISC-V VPU: A Work-in-Progress

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

When cache blocking of sparse matrix vector multiply works and why

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Vector Caching Scheme for Streaming FPGA SpMV Accelerators

Optimal Cache Replacement Policy for Matrix Multiplication

Optimizations for Very Long and Sparse Vector Operations on a RISC-V VPU: A Work-in-Progress

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation