End-to-end Model Inference and Training on Gemmini

Pranav Prakash

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-37

May 9, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.pdf

Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving.

To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.

Advisors: Krste Asanović

BibTeX citation:

@mastersthesis{Prakash:EECS-2021-37,
    Author= {Prakash, Pranav},
    Title= {End-to-end Model Inference and Training on Gemmini},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html},
    Number= {UCB/EECS-2021-37},
    Abstract= {Gemmini is an open-source generator for systolic-array architectures, allowing for systematic explorations of the accelerator design space. However, the lack of existing software support for Gemmini in machine-learning frameworks (e.g. PyTorch or Tensorflow) poses a significant bottleneck to neural network model evaluation and serving. 

To address these limitations we present the design and implementation of a Gemmini backend for Microsoft's ONNX Runtime engine. By extending ONNX Runtime's support for heterogeneous architectures and model graph transformations, the Gemmini backend accelerates the primary computational kernels – matrix multiplications, convolutions, and pooling – while ensuring interoperability between the channel-layouts expected by Gemmini and the rest of ONNX Runtime. We then proceed to benchmark our implementation on a broad-set of networks including ResNet, BERT, and Mask-RCNN – our results show that the Gemmini backend is a performant drop-in replacement for accelerating real-world workloads.},
}

EndNote citation:

%0 Thesis
%A Prakash, Pranav 
%T End-to-end Model Inference and Training on Gemmini
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 9
%@ UCB/EECS-2021-37
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-37.html
%F Prakash:EECS-2021-37