Scalable, Efficient Deep Learning by Means of Elastic Averaging

Kevin Peng

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2018-77

May 18, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.pdf

Much effort has been dedicated to optimizing the parallel performance of machine learning frameworks in the domains of training speed and model quality. Most current approaches make use of parameter servers or distributed allreduce operations. However, these approaches all suffer from various bottlenecks or must trade model quality for efficiency. We present BIDMach, a distributed machine learning framework that eliminates some of these bottlenecks without sacrificing model quality. We accomplish this by using a grid layout that enables allreduces with only n^(1/d) nodes instead of the usual n, a modified allreduce operation resistant to lagging nodes, an architecture responsive to nodes leaving and joining, and elastic averaging to mitigate the effect of worker iteration divergence.

Advisors: John F. Canny

BibTeX citation:

@mastersthesis{Peng:EECS-2018-77,
    Author= {Peng, Kevin},
    Title= {Scalable, Efficient Deep Learning by Means of Elastic Averaging},
    School= {EECS Department, University of California, Berkeley},
    Year= {2018},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.html},
    Number= {UCB/EECS-2018-77},
    Abstract= {Much effort has been dedicated to optimizing the parallel performance of machine
learning frameworks in the domains of training speed and model quality. Most
current approaches make use of parameter servers or distributed allreduce
operations. However, these approaches all suffer from various bottlenecks or
must trade model quality for efficiency. We present BIDMach, a distributed
machine learning framework that eliminates some of these bottlenecks without
sacrificing model quality. We accomplish this by using a grid layout that
enables allreduces with only n^(1/d) nodes instead of the usual n, a
modified allreduce operation resistant to lagging nodes, an architecture
responsive to nodes leaving and joining, and elastic averaging to mitigate the
effect of worker iteration divergence.},
}

EndNote citation:

%0 Thesis
%A Peng, Kevin 
%T Scalable, Efficient Deep Learning by Means of Elastic Averaging
%I EECS Department, University of California, Berkeley
%D 2018
%8 May 18
%@ UCB/EECS-2018-77
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.html
%F Peng:EECS-2018-77