Thanakul Wattanawong and Kurt Keutzer

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-92

May 11, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.pdf

Transformer models are achieving state of the art performance across tasks in natural language processing, computer vision, and others. However, the amount of compute required to perform inference using Transformers has grown significantly over the past few years, making them unusable at the edge or in low-power electronics. Therefore, there is an increasing need to improve their efficiency with opportunities ranging from performing architectural modifications to designing domain-specific accelerators (DSA). In this work we present two approaches to optimizing the inference of Transformer models. One is a Hardware Software Co-design approach that jointly optimizes the hardware architecture alongside the Transformer architecture. We present a framework based on Neural Architecture Search (NAS) and evolutionary search that practitioners may use to find the best-matched hardware configuration and Transformer architecture that satisfy the required performance criteria. We optimize the Transformer for both inference latency and power consumption using a metric called Energy-Delay Product (EDP), and find that the framework can attain a 2.2× EDP improvement while tolerating a 0.1 point perplexity degradation, and 10.6× with a 1 point degradation over the baseline. The survey that this work contributed to further combined other improvements such as insights from for an overall 88.7× speedup. Based on the insights gained from the survey paper, we were able to conduct experiments on architectural optimizations on feedforward networks for small Transformers that are designed to reduce inference FLOPs and energy consumption. We find that the feedforward network, which accounts for roughly 60% of the parameter count and inference FLOPs of the model T5-Mini, can be removed with only a 2.7 point loss on MNLI-mm, a standard natural language inference benchmark. Along with a number of other ablations, we find that structural weight reparametrization can be used to reduce inference FLOPs and parameters by about 30% with only a one point drop on MNLI-mm.


BibTeX citation:

@mastersthesis{Wattanawong:EECS-2023-92,
    Author= {Wattanawong, Thanakul and Keutzer, Kurt},
    Title= {Hardware Software Co-design and Architectural Optimization of Deep Learning Models for Natural Language Processing},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.html},
    Number= {UCB/EECS-2023-92},
    Abstract= {Transformer models are achieving state of the art performance across tasks in natural language
processing, computer vision, and others. However, the amount of compute required to
perform inference using Transformers has grown significantly over the past few years, making
them unusable at the edge or in low-power electronics. Therefore, there is an increasing
need to improve their efficiency with opportunities ranging from performing architectural
modifications to designing domain-specific accelerators (DSA). In this work we present two
approaches to optimizing the inference of Transformer models. One is a Hardware Software
Co-design approach that jointly optimizes the hardware architecture alongside the Transformer
architecture. We present a framework based on Neural Architecture Search (NAS)
and evolutionary search that practitioners may use to find the best-matched hardware configuration
and Transformer architecture that satisfy the required performance criteria. We
optimize the Transformer for both inference latency and power consumption using a metric
called Energy-Delay Product (EDP), and find that the framework can attain a 2.2× EDP
improvement while tolerating a 0.1 point perplexity degradation, and 10.6× with a 1 point
degradation over the baseline. The survey that this work contributed to further combined
other improvements such as insights from for an overall 88.7× speedup. Based
on the insights gained from the survey paper, we were able to conduct experiments on architectural
optimizations on feedforward networks for small Transformers that are designed
to reduce inference FLOPs and energy consumption. We find that the feedforward network,
which accounts for roughly 60% of the parameter count and inference FLOPs of the model
T5-Mini, can be removed with only a 2.7 point loss on MNLI-mm, a standard natural language
inference benchmark. Along with a number of other ablations, we find that structural
weight reparametrization can be used to reduce inference FLOPs and parameters by about
30% with only a one point drop on MNLI-mm.},
}

EndNote citation:

%0 Thesis
%A Wattanawong, Thanakul 
%A Keutzer, Kurt 
%T Hardware Software Co-design and Architectural Optimization of Deep Learning Models for Natural Language Processing
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 11
%@ UCB/EECS-2023-92
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-92.html
%F Wattanawong:EECS-2023-92