Zhuohan Li, Siyuan Zhuang, Shiyuan Guo, Danyang Zhuo, Hao Zhang, Dawn Song and Ion Stoica
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-258
December 16, 2021
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.pdf
Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.
Advisor: Ion Stoica
BibTeX citation:
@mastersthesis{Li:EECS-2021-258, Author = {Li, Zhuohan and Zhuang, Siyuan and Guo, Shiyuan and Zhuo, Danyang and Zhang, Hao and Song, Dawn and Stoica, Ion}, Title = {TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models}, School = {EECS Department, University of California, Berkeley}, Year = {2021}, Month = {Dec}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html}, Number = {UCB/EECS-2021-258}, Abstract = {Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single training sequence for Transformer-based language models thanks to its autoregressive property. This enables a more fine-grained pipeline compared with previous work. With this key idea, we design TeraPipe, a high-performance token-level pipeline parallel algorithm for synchronous model-parallel training of Transformer-based language models. We develop a novel dynamic programming-based algorithm to calculate the optimal pipelining execution scheme given a specific model and cluster configuration. We show that TeraPipe can speed up the training by 5.0x for the largest GPT-3 model with 175 billion parameters on an AWS cluster with 48 p3.16xlarge instances compared with state-of-the-art model-parallel methods.} }
EndNote citation:
%0 Thesis %A Li, Zhuohan %A Zhuang, Siyuan %A Guo, Shiyuan %A Zhuo, Danyang %A Zhang, Hao %A Song, Dawn %A Stoica, Ion %T TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models %I EECS Department, University of California, Berkeley %D 2021 %8 December 16 %@ UCB/EECS-2021-258 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-258.html %F Li:EECS-2021-258