Scalable, Efficient Deep Learning by Means of Elastic Averaging
Kevin Peng
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2018-77
May 18, 2018
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.pdf
Much effort has been dedicated to optimizing the parallel performance of machine learning frameworks in the domains of training speed and model quality. Most current approaches make use of parameter servers or distributed allreduce operations. However, these approaches all suffer from various bottlenecks or must trade model quality for efficiency. We present BIDMach, a distributed machine learning framework that eliminates some of these bottlenecks without sacrificing model quality. We accomplish this by using a grid layout that enables allreduces with only n^(1/d) nodes instead of the usual n, a modified allreduce operation resistant to lagging nodes, an architecture responsive to nodes leaving and joining, and elastic averaging to mitigate the effect of worker iteration divergence.
Advisors: John F. Canny
BibTeX citation:
@mastersthesis{Peng:EECS-2018-77, Author= {Peng, Kevin}, Title= {Scalable, Efficient Deep Learning by Means of Elastic Averaging}, School= {EECS Department, University of California, Berkeley}, Year= {2018}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.html}, Number= {UCB/EECS-2018-77}, Abstract= {Much effort has been dedicated to optimizing the parallel performance of machine learning frameworks in the domains of training speed and model quality. Most current approaches make use of parameter servers or distributed allreduce operations. However, these approaches all suffer from various bottlenecks or must trade model quality for efficiency. We present BIDMach, a distributed machine learning framework that eliminates some of these bottlenecks without sacrificing model quality. We accomplish this by using a grid layout that enables allreduces with only n^(1/d) nodes instead of the usual n, a modified allreduce operation resistant to lagging nodes, an architecture responsive to nodes leaving and joining, and elastic averaging to mitigate the effect of worker iteration divergence.}, }
EndNote citation:
%0 Thesis %A Peng, Kevin %T Scalable, Efficient Deep Learning by Means of Elastic Averaging %I EECS Department, University of California, Berkeley %D 2018 %8 May 18 %@ UCB/EECS-2018-77 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-77.html %F Peng:EECS-2018-77