Learning Rate Estimation for Stochastic Gradient Descent

Nadia Hyder and Gerald Friedland

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-155

May 20, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.pdf

State-of-the-art gradient descent optimizers all attempt to tune learning rate such that we can find the minimum of the loss function without overshooting or approaching it so slowly that we fail to reach it by the end of training. Yet, current approaches fail to consider what the shape of the error function means. In this work, we conduct experiments to better under- stand complexity of error functions and develop systematic methods of measuring learning rate using concepts from information theory and fractal geometry. Experiments conducted suggest a few findings: (1) resulting loss curves from training over random, unlearnable data resemble exponential decay, (2) oversized networks are less sensitive to hyperparameters, and (3) fractal dimension can be a useful heuristic for learning rate scaling. Together, these 3 findings solidify that the underlying complexity of the learning problem should be accounted for when measuring– rather than selecting– learning rate.

Advisors: Gerald Friedland and David Bamman

BibTeX citation:

@mastersthesis{Hyder:EECS-2022-155,
    Author= {Hyder, Nadia and Friedland, Gerald},
    Title= {Learning Rate Estimation for Stochastic Gradient Descent},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.html},
    Number= {UCB/EECS-2022-155},
    Abstract= {State-of-the-art gradient descent optimizers all attempt to tune learning rate such that we can find the minimum of the loss function without overshooting or approaching it so slowly that we fail to reach it by the end of training. Yet, current approaches fail to consider what the shape of the error function means. In this work, we conduct experiments to better under- stand complexity of error functions and develop systematic methods of measuring learning rate using concepts from information theory and fractal geometry. Experiments conducted suggest a few findings: (1) resulting loss curves from training over random, unlearnable data resemble exponential decay, (2) oversized networks are less sensitive to hyperparameters, and (3) fractal dimension can be a useful heuristic for learning rate scaling. Together, these 3 findings solidify that the underlying complexity of the learning problem should be accounted for when measuring– rather than selecting– learning rate.},
}

EndNote citation:

%0 Thesis
%A Hyder, Nadia 
%A Friedland, Gerald 
%T Learning Rate Estimation for Stochastic Gradient Descent
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 20
%@ UCB/EECS-2022-155
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-155.html
%F Hyder:EECS-2022-155