### James Demmel, David Eliahu, Armando Fox, Shoaib Ashraf Kamil, Benjamin Lipshitz, Oded Schwartz and Omer Spillinger

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2012-205

October 24, 2012

### http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-205.pdf

Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectan- gular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache- and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared- and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.

BibTeX citation:

@techreport{Demmel:EECS-2012-205, Author = {Demmel, James and Eliahu, David and Fox, Armando and Kamil, Shoaib Ashraf and Lipshitz, Benjamin and Schwartz, Oded and Spillinger, Omer}, Title = {Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication}, Institution = {EECS Department, University of California, Berkeley}, Year = {2012}, Month = {Oct}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-205.html}, Number = {UCB/EECS-2012-205}, Abstract = {Communication-optimal algorithms are known for square matrix multiplication. Here, we obtain the first communication-optimal algorithm for all dimensions of rectan- gular matrices. Combining the dimension-splitting technique of Frigo, Leiserson, Prokop and Ramachandran (1999) with the recursive BFS/DFS approach of Ballard, Demmel, Holtz, Lipshitz and Schwartz (2012) allows for a communication-optimal as well as cache- and network-oblivious algorithm. Moreover, the implementation is simple: approximately 50 lines of code for the shared-memory version. Since the new algorithm minimizes communication across the network, between NUMA domains, and between levels of cache, it performs well in practice on both shared- and distributed-memory machines. We show significant speedups over existing parallel linear algebra libraries both on a 32-core shared-memory machine and on a distributed-memory supercomputer.} }

EndNote citation:

%0 Report %A Demmel, James %A Eliahu, David %A Fox, Armando %A Kamil, Shoaib Ashraf %A Lipshitz, Benjamin %A Schwartz, Oded %A Spillinger, Omer %T Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication %I EECS Department, University of California, Berkeley %D 2012 %8 October 24 %@ UCB/EECS-2012-205 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-205.html %F Demmel:EECS-2012-205