Speeding up distributed machine learning algorithms via coding theoretic tools

Kangwook Lee, Ramtin Pedarsani1, Dimitris Papailiopoulos2 and Kannan Ramchandran

Distributed machine learning algorithms that dominate the use of modern large-scale computing platforms face several types of randomness, uncertainty and system noise. These include straggler nodes, system failures, maintenance outages, and communication bottlenecks. In this project, we view distributed machine learning algorithms through a coding-theoretic lens, and show how codes can equip them with robustness against this system noise. We focus on two of the most basic building blocks of distributed learning algorithms: data shuffling and matrix multiplication. In data shuffling, we use codes to reduce communication bottlenecks, and sow that coded shuffling reduces the communication cost by up to a factor O(n) over uncoded shuffling, when n is the number of worker nodes. For matrix multiplication, we use codes to alleviate the effects of stragglers, also known as the straggler problem. We show that coded matrix multiplication is O(log(n)) times faster than the uncoded matrix multiplication.

1UC Berkeley
2UC Berkeley