Learning Rate Estimation for Stochastic Gradient Descent
Nadia Hyder and Gerald Friedland
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2022-155
May 20, 2022
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.pdf
State-of-the-art gradient descent optimizers all attempt to tune learning rate such that we can find the minimum of the loss function without overshooting or approaching it so slowly that we fail to reach it by the end of training. Yet, current approaches fail to consider what the shape of the error function means. In this work, we conduct experiments to better under- stand complexity of error functions and develop systematic methods of measuring learning rate using concepts from information theory and fractal geometry. Experiments conducted suggest a few findings: (1) resulting loss curves from training over random, unlearnable data resemble exponential decay, (2) oversized networks are less sensitive to hyperparameters, and (3) fractal dimension can be a useful heuristic for learning rate scaling. Together, these 3 findings solidify that the underlying complexity of the learning problem should be accounted for when measuring– rather than selecting– learning rate.
Advisors: Gerald Friedland and David Bamman
BibTeX citation:
@mastersthesis{Hyder:EECS-2022-155, Author= {Hyder, Nadia and Friedland, Gerald}, Title= {Learning Rate Estimation for Stochastic Gradient Descent}, School= {EECS Department, University of California, Berkeley}, Year= {2022}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.html}, Number= {UCB/EECS-2022-155}, Abstract= {State-of-the-art gradient descent optimizers all attempt to tune learning rate such that we can find the minimum of the loss function without overshooting or approaching it so slowly that we fail to reach it by the end of training. Yet, current approaches fail to consider what the shape of the error function means. In this work, we conduct experiments to better under- stand complexity of error functions and develop systematic methods of measuring learning rate using concepts from information theory and fractal geometry. Experiments conducted suggest a few findings: (1) resulting loss curves from training over random, unlearnable data resemble exponential decay, (2) oversized networks are less sensitive to hyperparameters, and (3) fractal dimension can be a useful heuristic for learning rate scaling. Together, these 3 findings solidify that the underlying complexity of the learning problem should be accounted for when measuring– rather than selecting– learning rate.}, }
EndNote citation:
%0 Thesis %A Hyder, Nadia %A Friedland, Gerald %T Learning Rate Estimation for Stochastic Gradient Descent %I EECS Department, University of California, Berkeley %D 2022 %8 May 20 %@ UCB/EECS-2022-155 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.html %F Hyder:EECS-2022-155