Measuring Generalization and Overfitting in Machine Learning

Rebecca Roelofs

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2019-102

June 19, 2019

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-102.pdf

Due to the prevalence of machine learning algorithms and the potential for their decisions to profoundly impact billions of human lives, it is crucial that they are robust, reliable, and understandable. This thesis examines key theoretical pillars of machine learning surrounding generalization and overfitting, and tests the extent to which empirical behavior matches existing theory. We develop novel methods for measuring overfitting and generalization, and we characterize how reproducible observed behavior is across differences in optimization algorithm, dataset, task, evaluation metric, and domain.

First, we examine how optimization algorithms bias machine learning models towards solutions with varying generalization properties. We show that adaptive gradient methods empirically find solutions with inferior generalization behavior compared to those found by stochastic gradient descent. We then construct an example using a simple overparameterized model that corroborates the algorithms’ empirical behavior on neural networks. Next, we study the extent to which machine learning models have overfit to commonly reused datasets in both academic benchmarks and machine learning competitions. We build new test sets for the CIFAR-10 and ImageNet datasets and evaluate a broad range of classification models on the new datasets. All models experience a drop in accuracy, which indicates that current accuracy numbers are susceptible to even minute natural variations in the data distribution. Surprisingly, despite several years of adaptively selecting the models to perform well on these competitive benchmarks, we find no evidence of overfitting. We then analyze data from the machine learning platform Kaggle and find little evidence of substantial overfitting in ML competitions. These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.

Overall, our work suggests that the true concern for robust machine learning is distribution shift rather than overfitting, and designing models that still work reliably in dynamic environments is a challenging but necessary undertaking.

Advisors: James Demmel and Benjamin Recht

BibTeX citation:

@phdthesis{Roelofs:EECS-2019-102,
    Author= {Roelofs, Rebecca},
    Title= {Measuring Generalization and Overfitting in Machine Learning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2019},
    Month= {Jun},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-102.html},
    Number= {UCB/EECS-2019-102},
    Abstract= {Due to the prevalence of machine learning algorithms and the potential for their decisions to profoundly impact billions of human lives, it is crucial that they are robust, reliable, and understandable.  This thesis examines key theoretical pillars of machine learning surrounding generalization and overfitting, and tests the extent to which empirical behavior matches existing theory.  We develop novel methods for measuring overfitting and generalization, and we characterize how reproducible observed behavior is across differences in optimization algorithm, dataset, task, evaluation metric, and domain.

First, we examine how optimization algorithms bias machine learning models towards solutions with varying generalization properties.  We show that adaptive gradient methods empirically find solutions with inferior generalization behavior compared to those found by stochastic gradient descent. We then construct an example using a simple overparameterized model that corroborates the algorithms’ empirical behavior on neural networks.
    
Next, we study the extent to which machine learning models have overfit to commonly reused datasets in both academic benchmarks and machine learning competitions.  We build new test sets for the CIFAR-10 and ImageNet datasets and evaluate a broad range of classification models on the new datasets.  All models experience a drop in accuracy, which indicates that current accuracy numbers are susceptible to even minute natural variations in the data distribution.  Surprisingly, despite several years of adaptively selecting the models to perform well on these competitive benchmarks, we find no evidence of overfitting.  
We then analyze data from the machine learning platform Kaggle and find little evidence of substantial overfitting in ML competitions.  These findings speak to the robustness of the holdout method across different data domains, loss functions, model classes, and human analysts.

Overall, our work suggests that the true concern for robust machine learning is distribution shift rather than overfitting, and designing models that still work reliably in dynamic environments is a challenging but necessary undertaking.},
}

EndNote citation:

%0 Thesis
%A Roelofs, Rebecca 
%T Measuring Generalization and Overfitting in Machine Learning
%I EECS Department, University of California, Berkeley
%D 2019
%8 June 19
%@ UCB/EECS-2019-102
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-102.html
%F Roelofs:EECS-2019-102