Avoiding Communication in Computing Krylov Subspaces
James Demmel and Mark Frederick Hoemmen and Marghoob Mohiyuddin and Katherine A. Yelick
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2007-123
October 9, 2007
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-123.pdf
Our goal is to minimize the communication costs of Krylov Subspace Methods (KSMs) to solve either Ax=b or Ax = lambda x, when A is a large sparse matrix. By communication costs we mean both bandwidth and latency costs, either between processors on a parallel computer, or between levels of a memory hierarchy on a sequential computer. As the cost of communication is growing exponentially relative to computation on modern computers, reducing communication is becoming ever more important. It is possible to reduce communication costs to near their theoretical minima: On a parallel computer this means latency costs are independent of the dimension k of the Krylov subspace, as opposed to growing proportionally to k for a conventional implementation. On a sequential computer, this mean latency and bandwidth costs are independent of k. Achieving this speedup requires a new algorithmic formulation of KSMs.
In this paper we present just part of this new formulation, namely computing a basis of the Krylov subspace spanned by [x,Ax, A^2x, ... A^kx]; other papers will present the rest of the KSM formulation. We present theory, performance models, and computational results: Our parallel performance models of 2D and 3D model problems predict speedups over a conventional algorithm of up to 15x on a model Petaflop machine, and up to 22x on a model of the Grid. Our sequential performance models of the same model problems predict speedups over a conventional algorithm of up to 10x on an out-of-core implementation, and up to 2.5x on Intel Clovertown, where we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we measured a speedup of over 3x on an actual out-of-core implementation.
We also consider the related kernel [Ax,MAx,AMAx,...,A(MA)^{k-1}x,(MA)^kx], which arises in preconditioned KSMs. Under certain mathematical conditions on A and the preconditioner $M$, we show how to avoid latency and bandwidth for this kernel as well.
BibTeX citation:
@techreport{Demmel:EECS-2007-123, Author= {Demmel, James and Hoemmen, Mark Frederick and Mohiyuddin, Marghoob and Yelick, Katherine A.}, Title= {Avoiding Communication in Computing Krylov Subspaces}, Year= {2007}, Month= {Oct}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-123.html}, Number= {UCB/EECS-2007-123}, Abstract= {Our goal is to minimize the communication costs of Krylov Subspace Methods (KSMs) to solve either Ax=b or Ax = lambda x, when A is a large sparse matrix. By communication costs we mean both bandwidth and latency costs, either between processors on a parallel computer, or between levels of a memory hierarchy on a sequential computer. As the cost of communication is growing exponentially relative to computation on modern computers, reducing communication is becoming ever more important. It is possible to reduce communication costs to near their theoretical minima: On a parallel computer this means latency costs are independent of the dimension k of the Krylov subspace, as opposed to growing proportionally to k for a conventional implementation. On a sequential computer, this mean latency and bandwidth costs are independent of k. Achieving this speedup requires a new algorithmic formulation of KSMs. In this paper we present just part of this new formulation, namely computing a basis of the Krylov subspace spanned by [x,Ax, A^2x, ... A^kx]; other papers will present the rest of the KSM formulation. We present theory, performance models, and computational results: Our parallel performance models of 2D and 3D model problems predict speedups over a conventional algorithm of up to 15x on a model Petaflop machine, and up to 22x on a model of the Grid. Our sequential performance models of the same model problems predict speedups over a conventional algorithm of up to 10x on an out-of-core implementation, and up to 2.5x on Intel Clovertown, where we use our ideas to reduce off-chip latency and bandwidth to DRAM. Finally, we measured a speedup of over 3x on an actual out-of-core implementation. We also consider the related kernel [Ax,MAx,AMAx,...,A(MA)^{k-1}x,(MA)^kx], which arises in preconditioned KSMs. Under certain mathematical conditions on A and the preconditioner $M$, we show how to avoid latency and bandwidth for this kernel as well.}, }
EndNote citation:
%0 Report %A Demmel, James %A Hoemmen, Mark Frederick %A Mohiyuddin, Marghoob %A Yelick, Katherine A. %T Avoiding Communication in Computing Krylov Subspaces %I EECS Department, University of California, Berkeley %D 2007 %8 October 9 %@ UCB/EECS-2007-123 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2007/EECS-2007-123.html %F Demmel:EECS-2007-123