Zhuohan Li and Siyuan Zhuang and Shiyuan Guo and Danyang Zhuo and Hao Zhang and Dawn Song and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-258

December 16, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.pdf

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.

Advisors: Ion Stoica


BibTeX citation:

@mastersthesis{Li:EECS-2021-258,
    Author= {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion},
    Title= {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html},
    Number= {UCB/EECS-2021-258},
    Abstract= {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.},
}

EndNote citation:

%0 Thesis
%A Li, Zhuohan 
%A Zhuang, Siyuan 
%A Guo, Shiyuan 
%A Zhuo, Danyang 
%A Zhang, Hao 
%A Song, Dawn 
%A Stoica, Ion 
%T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
%I EECS Department, University of California, Berkeley
%D 2021
%8 December 16
%@ UCB/EECS-2021-258
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html
%F Li:EECS-2021-258