Scaling SGD Batch Size to 32K for ImageNet Training

Yang You, Igor Gitman and Boris Ginsburg

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2017-156
September 16, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-156.pdf

The most natural way to speed-up the training of large networks is to use data-parallelism on multiple GPUs. To scale Stochastic Gradient (SG) based methods to more processors, one need to increase the batch size to make full use of the computational power of each GPU. However, keeping the accuracy of network with increase of batch size is not trivial. Currently, the state-of-the art method is to increase Learning Rate (LR) proportional to the batch size, and use special learning rate with "warm-up" policy to overcome initial optimization difficulty.

By controlling the LR during the training process, one can efficiently use large-batch in ImageNet training. For example, Batch-1024 for AlexNet and Batch-8192 for ResNet-50 are successful applications. However, for ImageNet-1k training, state-of-the-art AlexNet only scales the batch size to 1024 and ResNet50 only scales it to 8192. The reason is that we can not scale the learning rate to a large value. To enable large-batch training to general networks or datasets, we propose Layer-wise Adaptive Rate Scaling (LARS). LARS LR uses different LRs for different layers based on the norm of the weights and the norm of the gradients. By using LARS algoirithm, we can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Large batch can make full use of the system's computational power. For example, batch-4096 can achieve 3x speedup over batch-512 for ImageNet training by AlexNet model on a DGX-1 station (8 P100 GPUs).


BibTeX citation:

@techreport{You:EECS-2017-156,
    Author = {You, Yang and Gitman, Igor and Ginsburg, Boris},
    Title = {Scaling SGD Batch Size to 32K for ImageNet Training},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2017},
    Month = {Sep},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-156.html},
    Number = {UCB/EECS-2017-156},
    Abstract = {The most natural way to speed-up the training of large networks is to use data-parallelism on multiple GPUs. To scale Stochastic Gradient (SG) based methods to more processors, one need to increase the batch size to make full use of the computational power of each GPU. However, keeping the accuracy of network with increase of batch size is not trivial. Currently, the state-of-the art method is to increase Learning Rate (LR) proportional to the batch size, and use special learning rate with "warm-up" policy to overcome initial optimization difficulty.

By controlling the LR during the training process, one can efficiently use large-batch in ImageNet training.
For example, Batch-1024 for AlexNet and Batch-8192 for ResNet-50 are successful applications.
However, for ImageNet-1k training, state-of-the-art AlexNet only scales the batch size to 1024 and ResNet50 only scales it to 8192. 
The reason is that we can not scale the learning rate to a large value.
To enable large-batch training to general networks or datasets, we propose Layer-wise Adaptive Rate Scaling (LARS).
LARS LR uses different LRs for different layers based on the norm of the weights and the norm of the gradients. 
By using LARS algoirithm, we can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet.
Large batch can make full use of the system's computational power.
For example, batch-4096 can achieve 3x speedup over batch-512 for ImageNet training by AlexNet model on a DGX-1 station (8 P100 GPUs).}
}

EndNote citation:

%0 Report
%A You, Yang
%A Gitman, Igor
%A Ginsburg, Boris
%T Scaling SGD Batch Size to 32K for ImageNet Training
%I EECS Department, University of California, Berkeley
%D 2017
%8 September 16
%@ UCB/EECS-2017-156
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-156.html
%F You:EECS-2017-156