TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song and Ion Stoica

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-258
December 16, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.pdf

Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.

Advisor: Ion Stoica


BibTeX citation:

@mastersthesis{Li:EECS-2021-258,
    Author = {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion},
    Title = {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models},
    School = {EECS Department, University of California, Berkeley},
    Year = {2021},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html},
    Number = {UCB/EECS-2021-258},
    Abstract = {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.}
}

EndNote citation:

%0 Thesis
%A Li, Zhuohan
%A Zhuang, Siyuan
%A Guo, Shiyuan
%A Zhuo, Danyang
%A Zhang, Hao
%A Song, Dawn
%A Stoica, Ion
%T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models
%I EECS Department, University of California, Berkeley
%D 2021
%8 December 16
%@ UCB/EECS-2021-258
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html
%F Li:EECS-2021-258