TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li and Siyuan Zhuang and Shiyuan Guo and Danyang Zhuo and Hao Zhang and Dawn Song and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-258

December 16, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.pdf

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.

Advisors: Ion Stoica

BibTeX citation:

@mastersthesis{Li:EECS-2021-258,
    Author= {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion},
    Title= {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html},
    Number= {UCB/EECS-2021-258},
    Abstract= {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.},
}

EndNote citation:

%0 Thesis
%A Li, Zhuohan 
%A Zhuang, Siyuan 
%A Guo, Shiyuan 
%A Zhuo, Danyang 
%A Zhang, Hao 
%A Song, Dawn 
%A Stoica, Ion 
%T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
%I EECS Department, University of California, Berkeley
%D 2021
%8 December 16
%@ UCB/EECS-2021-258
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html
%F Li:EECS-2021-258