Communication-Avoiding QR Decomposition for GPUs
Michael Anderson and Grey Ballard and James Demmel and Kurt Keutzer
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2010-131
October 8, 2010
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.pdf
We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.
We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.
BibTeX citation:
@techreport{Anderson:EECS-2010-131, Author= {Anderson, Michael and Ballard, Grey and Demmel, James and Keutzer, Kurt}, Title= {Communication-Avoiding QR Decomposition for GPUs}, Year= {2010}, Month= {Oct}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html}, Number= {UCB/EECS-2010-131}, Abstract= {We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices. We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.}, }
EndNote citation:
%0 Report %A Anderson, Michael %A Ballard, Grey %A Demmel, James %A Keutzer, Kurt %T Communication-Avoiding QR Decomposition for GPUs %I EECS Department, University of California, Berkeley %D 2010 %8 October 8 %@ UCB/EECS-2010-131 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html %F Anderson:EECS-2010-131