The Effect of Model Size on Worst-Group Generalization

Alan Pham

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-138

May 18, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-138.pdf

Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.

Advisors: Joseph Gonzalez

BibTeX citation:

@mastersthesis{Pham:EECS-2022-138,
    Author= {Pham, Alan},
    Title= {The Effect of Model Size on Worst-Group Generalization},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-138.html},
    Number= {UCB/EECS-2022-138},
    Abstract= {Overparameterization is shown to result in poor test accuracy on rare subgroups under a variety of settings where subgroup information is known. To gain a more complete picture, we consider the case where subgroup information is unknown. We investigate the effect of model size on worst-group generalization under empirical risk minimization (ERM) across a wide range of settings, varying: 1) architectures (ResNet, VGG, or BERT), 2) domains (vision or natural language processing), 3) model size (width or depth), and 4) initialization (with pre-trained or random weights). Our systematic evaluation reveals that increasing model size does not hurt, and may help, worst-group test performance under ERM across all setups. In particular, increasing pre-trained model size consistently improves performance on Waterbirds and MultiNLI. We advise practitioners to use larger pre-trained models when subgroup labels are unknown.},
}

EndNote citation:

%0 Thesis
%A Pham, Alan 
%T The Effect of Model Size on Worst-Group Generalization
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 18
%@ UCB/EECS-2022-138
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-138.html
%F Pham:EECS-2022-138