Pranav Prakash
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-37
May 9, 2021
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.pdf
Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving.
To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.
Advisor: Krste Asanović
BibTeX citation:
@mastersthesis{Prakash:EECS-2021-37, Author = {Prakash, Pranav}, Title = {End-to-end Model Inference and Training on Gemmini}, School = {EECS Department, University of California, Berkeley}, Year = {2021}, Month = {May}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html}, Number = {UCB/EECS-2021-37}, Abstract = {Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving. To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.} }
EndNote citation:
%0 Thesis %A Prakash, Pranav %T End-to-end Model Inference and Training on Gemmini %I EECS Department, University of California, Berkeley %D 2021 %8 May 9 %@ UCB/EECS-2021-37 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html %F Prakash:EECS-2021-37