End-to-end Model Inference and Training on Gemmini
Pranav Prakash
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2021-37
May 9, 2021
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.pdf
Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving.
To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.
Advisors: Krste Asanović
BibTeX citation:
@mastersthesis{Prakash:EECS-2021-37, Author= {Prakash, Pranav}, Title= {End-to-end Model Inference and Training on Gemmini}, School= {EECS Department, University of California, Berkeley}, Year= {2021}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html}, Number= {UCB/EECS-2021-37}, Abstract= {Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving. To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.}, }
EndNote citation:
%0 Thesis %A Prakash, Pranav %T End-to-end Model Inference and Training on Gemmini %I EECS Department, University of California, Berkeley %D 2021 %8 May 9 %@ UCB/EECS-2021-37 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html %F Prakash:EECS-2021-37