Communication-Avoiding QR Decomposition for GPUs

Michael Anderson, Grey Ballard, James Demmel and Kurt Keutzer

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-131
October 8, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.pdf

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.

We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.


BibTeX citation:

@techreport{Anderson:EECS-2010-131,
    Author = {Anderson, Michael and Ballard, Grey and Demmel, James and Keutzer, Kurt},
    Title = {Communication-Avoiding QR Decomposition for GPUs},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2010},
    Month = {Oct},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html},
    Number = {UCB/EECS-2010-131},
    Abstract = {We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by CAQR allows us to outperform existing parallel GPU implementations of QR for a large class of tall-skinny matrices. Other GPU implementations of QR handle panel factorizations by either sending the work to a general-purpose processor or using entirely bandwidth-bound operations, incurring data transfer overheads. In contrast, our QR is done entirely on the GPU using compute-bound kernels, meaning performance is good regardless of the width of the matrix. As a result, we outperform CULA, a parallel linear algebra library for GPUs, by up to 13x for tall-skinny matrices.

We also discuss stationary video background subtraction as a motivating application. We apply a recent statistical approach, which requires many iterations of computing the singular value decomposition of a tall-skinny matrix. Using CAQR as a first step to getting the singular value decomposition, we are able to get the answer 3x faster than if we use a traditional bandwidth-bound GPU QR factorization tuned specifically for that matrix size, and 34x faster than if we use Intel's Math Kernel Library (MKL) QR factorization.}
}

EndNote citation:

%0 Report
%A Anderson, Michael
%A Ballard, Grey
%A Demmel, James
%A Keutzer, Kurt
%T Communication-Avoiding QR Decomposition for GPUs
%I EECS Department, University of California, Berkeley
%D 2010
%8 October 8
%@ UCB/EECS-2010-131
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-131.html
%F Anderson:EECS-2010-131