Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply

Rajesh Nishtala and Richard W. Vuduc and James W. Demmel and Katherine A. Yelick

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-04-1335

, 2004

We consider the problem of building high-performance implementations of sparse matrix-vector multiply (SpM x V), or y = y + A * x, which is an important and ubiquitous computational kernel. Prior work indicates that cache blocking of SpM x V is extremely important for some matrix and machine combinations, with speedups as high as 3x. In this paper we present a new, more compact data structure for cache blocking for SpM x V and look at the general question of when and why performance improves. Cache blocking appears to be most effective when simultaneously 1) the vector x does not fit in cache 2) the vector y fits in cache 3) the non zeros are distributed throughout the matrix and 4) the non zero density is sufficiently high. In particular we find that cache blocking does not help with band matrices no matter how large x and y are since the matrix structure already lends itself to the optimal access pattern.

Prior work on performance modeling assumed that the matrices were small enough so that x and y fit in the cache. However when this is not the case, the optimal block sizes picked by these models may have poor performance motivating us to update these performance models. In contrast, the optimum block sizes predicted by the new performance models generally match the measured optimum block sizes and therefore the models can be used as a basis for a heuristic to pick the block size.

We conclude with architectural suggestions that would make processor and memory systems more amenable to SpM x V.

BibTeX citation:

@techreport{Nishtala:CSD-04-1335,
    Author= {Nishtala, Rajesh and Vuduc, Richard W. and Demmel, James W. and Yelick, Katherine A.},
    Title= {Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply},
    Year= {2004},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2004/5535.html},
    Number= {UCB/CSD-04-1335},
    Abstract= {We consider the problem of building high-performance implementations of sparse matrix-vector multiply (SpM x V), or <i>y</i> = <i>y</i> + <i>A</i> * x, which is an important and ubiquitous computational kernel. Prior work indicates that cache blocking of SpM x V is extremely important for some matrix and machine combinations, with speedups as high as 3x. In this paper we present a new, more compact data structure for cache blocking for SpM x V and look at the general question of when and why performance improves. Cache blocking appears to be most effective when simultaneously 1) the vector <i>x</i> does not fit in cache 2) the vector <i>y</i> fits in cache 3) the non zeros are distributed throughout the matrix and 4) the non zero density is sufficiently high. In particular we find that cache blocking does not help with band matrices no matter how large <i>x</i> and <i>y</i> are since the matrix structure already lends itself to the optimal access pattern. <p>Prior work on performance modeling assumed that the matrices were small enough so that <i>x</i> and <i>y</i> fit in the cache. However when this is not the case, the optimal block sizes picked by these models may have poor performance motivating us to update these performance models. In contrast, the optimum block sizes predicted by the new performance models generally match the measured optimum block sizes and therefore the models can be used as a basis for a heuristic to pick the block size. <p>We conclude with architectural suggestions that would make processor and memory systems more amenable to SpM x V.},
}

EndNote citation:

%0 Report
%A Nishtala, Rajesh 
%A Vuduc, Richard W. 
%A Demmel, James W. 
%A Yelick, Katherine A. 
%T Performance Modeling and Analysis of Cache Blocking in Sparse Matrix Vector Multiply
%I EECS Department, University of California, Berkeley
%D 2004
%@ UCB/CSD-04-1335
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2004/5535.html
%F Nishtala:CSD-04-1335