Communication-Avoiding QR Decomposition for GPUs

Michael Anderson and Grey Ballard and James Demmel and Kurt Keutzer

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-131

October 8, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.pdf

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.

We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.

BibTeX citation:

@techreport{Anderson:EECS-2010-131,
    Author= {Anderson, Michael and Ballard, Grey and Demmel, James and Keutzer, Kurt},
    Title= {Communication-Avoiding QR Decomposition for GPUs},
    Year= {2010},
    Month= {Oct},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html},
    Number= {UCB/EECS-2010-131},
    Abstract= {We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.

We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.},
}

EndNote citation:

%0 Report
%A Anderson, Michael 
%A Ballard, Grey 
%A Demmel, James 
%A Keutzer, Kurt 
%T Communication-Avoiding QR Decomposition for GPUs
%I EECS Department, University of California, Berkeley
%D 2010
%8 October 8
%@ UCB/EECS-2010-131
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html
%F Anderson:EECS-2010-131