Large-Batch Training for LSTM and Beyond

Yang You, James Demmel, Kurt Keutzer, Cho-Jui Hsieh, Chris Ying and Jonathan Hseu

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2018-138
November 14, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-138.pdf

A large-batch training approach has enabled us to apply large-scale distributed processing. By scaling the batch size from 256 to 64K, researchers have been able to reduce the training time of ResNet50 on the ImageNet dataset from 29 hours to 8.6 minutes. However, there are three problems in current large-batch research: (1) Although RNN techniques like LSTM have been widely used in many real-world applications, the current large-batch research is only focused on CNN applications. (2) Even for CNN applications, there is no automated technique for the extending the batch size beyond 8K. Instead it requires significant parameter turning. (3) To keep the variance in the gradient expectation constant, theory suggests Sqrt Scaling scheme should be used in large-batch training. Unfortunately, there is no successful application using such a Sqrt Scaling scheme. In this paper, we propose a new approach called linear-epoch gradual-warmup (LEGW) for better large-batch training. We call this approach Leg-Warmup. We observe that LEGW achieves much better results than previous Linear Scaling learning rate scheme. With LEGW, we are able to conduct large-batch training for both CNNs and LSTMs with the Sqrt Scaling scheme. LEGW enables Sqrt Scaling scheme in practice and we achieve much better results than Linear Scaling learning rate scheme. For LSTM applications, we are able to scale the batch size by 64 times without losing accuracy and without tuning the hyper-parameters. For CNN applications, LEGW is able to achieve the constant accuracy when we scale the batch size to 32K. LEGW works better than previous large-batch auto-tuning techniques. We also provide some theoretical explanations for LEGW.


BibTeX citation:

@techreport{You:EECS-2018-138,
    Author = {You, Yang and Demmel, James and Keutzer, Kurt and Hsieh, Cho-Jui and Ying, Chris and Hseu, Jonathan},
    Title = {Large-Batch Training for LSTM and Beyond},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2018},
    Month = {Nov},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-138.html},
    Number = {UCB/EECS-2018-138},
    Abstract = {A large-batch training approach has enabled us to apply large-scale distributed processing.
By scaling the batch size from 256 to 64K, researchers have been able to reduce the training time of ResNet50 on the ImageNet dataset from 29 hours to 8.6 minutes.
However, there are three problems in current large-batch research:
(1) Although RNN techniques like LSTM have been widely used in many real-world applications, the current large-batch research is only focused on CNN applications.
(2) Even for CNN applications, there is no automated technique for the extending the batch size beyond 8K. Instead it requires significant parameter turning. 
(3) To keep the variance in the gradient expectation constant, theory suggests Sqrt Scaling scheme should be used in large-batch training. Unfortunately, there is no successful application using such a Sqrt Scaling scheme.
In this paper, we propose a new approach called linear-epoch gradual-warmup (LEGW) for better large-batch training. 
We call this approach Leg-Warmup.
We observe that LEGW achieves much better results than previous Linear Scaling learning rate scheme. 
With LEGW, we are able to conduct large-batch training for both CNNs and LSTMs with the Sqrt Scaling scheme. 
LEGW enables Sqrt Scaling scheme in practice and we achieve much better results than Linear Scaling learning rate scheme.
For LSTM applications, we are able to scale the batch size by 64 times without losing accuracy and without tuning the hyper-parameters.
For CNN applications, LEGW is able to achieve the constant accuracy when we scale the batch size to 32K. LEGW works better than previous large-batch auto-tuning techniques.
We also provide some theoretical explanations for LEGW.}
}

EndNote citation:

%0 Report
%A You, Yang
%A Demmel, James
%A Keutzer, Kurt
%A Hsieh, Cho-Jui
%A Ying, Chris
%A Hseu, Jonathan
%T Large-Batch Training for LSTM and Beyond
%I EECS Department, University of California, Berkeley
%D 2018
%8 November 14
%@ UCB/EECS-2018-138
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-138.html
%F You:EECS-2018-138