Hardware Software Co-design and Architectural Optimization of Deep Learning Models for Natural Language Processing
Thanakul Wattanawong and Kurt Keutzer
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-92
May 11, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.pdf
Transformer models are achieving state of the art performance across tasks in natural language processing, computer vision, and others. However, the amount of compute required to perform inference using Transformers has grown significantly over the past few years, making them unusable at the edge or in low-power electronics. Therefore, there is an increasing need to improve their efficiency with opportunities ranging from performing architectural modifications to designing domain-specific accelerators (DSA). In this work we present two approaches to optimizing the inference of Transformer models. One is a Hardware Software Co-design approach that jointly optimizes the hardware architecture alongside the Transformer architecture. We present a framework based on Neural Architecture Search (NAS) and evolutionary search that practitioners may use to find the best-matched hardware configuration and Transformer architecture that satisfy the required performance criteria. We optimize the Transformer for both inference latency and power consumption using a metric called Energy-Delay Product (EDP), and find that the framework can attain a 2.2× EDP improvement while tolerating a 0.1 point perplexity degradation, and 10.6× with a 1 point degradation over the baseline. The survey that this work contributed to further combined other improvements such as insights from for an overall 88.7× speedup. Based on the insights gained from the survey paper, we were able to conduct experiments on architectural optimizations on feedforward networks for small Transformers that are designed to reduce inference FLOPs and energy consumption. We find that the feedforward network, which accounts for roughly 60% of the parameter count and inference FLOPs of the model T5-Mini, can be removed with only a 2.7 point loss on MNLI-mm, a standard natural language inference benchmark. Along with a number of other ablations, we find that structural weight reparametrization can be used to reduce inference FLOPs and parameters by about 30% with only a one point drop on MNLI-mm.
BibTeX citation:
@mastersthesis{Wattanawong:EECS-2023-92, Author= {Wattanawong, Thanakul and Keutzer, Kurt}, Title= {Hardware Software Co-design and Architectural Optimization of Deep Learning Models for Natural Language Processing}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.html}, Number= {UCB/EECS-2023-92}, Abstract= {Transformer models are achieving state of the art performance across tasks in natural language processing, computer vision, and others. However, the amount of compute required to perform inference using Transformers has grown significantly over the past few years, making them unusable at the edge or in low-power electronics. Therefore, there is an increasing need to improve their efficiency with opportunities ranging from performing architectural modifications to designing domain-specific accelerators (DSA). In this work we present two approaches to optimizing the inference of Transformer models. One is a Hardware Software Co-design approach that jointly optimizes the hardware architecture alongside the Transformer architecture. We present a framework based on Neural Architecture Search (NAS) and evolutionary search that practitioners may use to find the best-matched hardware configuration and Transformer architecture that satisfy the required performance criteria. We optimize the Transformer for both inference latency and power consumption using a metric called Energy-Delay Product (EDP), and find that the framework can attain a 2.2× EDP improvement while tolerating a 0.1 point perplexity degradation, and 10.6× with a 1 point degradation over the baseline. The survey that this work contributed to further combined other improvements such as insights from for an overall 88.7× speedup. Based on the insights gained from the survey paper, we were able to conduct experiments on architectural optimizations on feedforward networks for small Transformers that are designed to reduce inference FLOPs and energy consumption. We find that the feedforward network, which accounts for roughly 60% of the parameter count and inference FLOPs of the model T5-Mini, can be removed with only a 2.7 point loss on MNLI-mm, a standard natural language inference benchmark. Along with a number of other ablations, we find that structural weight reparametrization can be used to reduce inference FLOPs and parameters by about 30% with only a one point drop on MNLI-mm.}, }
EndNote citation:
%0 Thesis %A Wattanawong, Thanakul %A Keutzer, Kurt %T Hardware Software Co-design and Architectural Optimization of Deep Learning Models for Natural Language Processing %I EECS Department, University of California, Berkeley %D 2023 %8 May 11 %@ UCB/EECS-2023-92 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.html %F Wattanawong:EECS-2023-92