Enabling Full-Stack DNN Accelerator Design and Evaluation on Synthesizable Hardware

Hasan Genc

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-161

August 8, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-161.pdf

The growing diversity of computationally demanding DNN workloads, together with the long-running decline of technology-scaling trends, has motivated the design of a great many diverse specialized hardware accelerators. While these accelerators provide significant improvements to performance and energy consumption — making the modern wave of AI innovation possible — they introduce significant challenges to computer architects, programmers, and hardware designers, due to the difficulty of (i) exploring the very broad design space they represent, (ii) translating such designs rapidly to high-quality RTL and software libraries, and (iii) evaluating such designs in realistic full-system contexts early on in the design process.

Prior work has attempted to address these difficulties by proposing new accelerator design frameworks which allow users to change only a few settings in a config file, or a few lines of a domain-specific language, from which they can rapidly generate new synthesizable hardware, or new high-level models that can guide architectural decisions. Many of these frameworks also attempt to make accelerator design more principled by separating the different concerns which go into accelerator design, so that each can be explored individually, and in relation to other design choices.

However, prior accelerator generators and design frameworks often lack the ability to provide users visibility into the impact that the full system and software stack have upon DNN accelerator performance, such as the potential for outer caches, virtual address translation mechanisms, or host CPUs to bottleneck performance if they have not been carefully tuned along with the accelerator’s functional units or spatial arrays. Prior accelerator design frameworks which separate out different design concerns are also are not capable of generating high-quality RTL for both dense and sparse accelerator ASICs, which limits their ability to cover various modern workloads which sparsify DNN layers to improve performance or energy efficiency.

This thesis presents two projects, Gemmini and Stellar, which address these difficulties. Gemmini is a DNN accelerator evaluation framework which, while generating efficient spa- tial arrays and accelerators, is primarily intended to help users evaluate the impact of SoC components outside of the accelerator itself, such as external caches or virtual address trans- lation mechanisms, upon overall DNN accelerator performance. Stellar is another framework which provides abstractions that help users to design and explore different components of both dense and sparse accelerators, while separating out the different concerns that go into accelerator design, such as an accelerator’s functionality, its dataflow, the sparse/dense data formats it supports, its load-balancing strategies, and the private memory buffers it is equipped with. Gemmini-generated dense DNN accelerators achieve 87% the performance of prior state-of-the-art accelerators such as NVDLA on image classification networks such as ResNet50, and enable insights into how minor changes to system components such as TLBs can improve end-to-end DNN performance by up to 15%. Stellar-generated accelerators achieve up to 92% the performance of hand-written accelerators, with less than 15% area overhead and power overheads on various DNN layers as low as 7%.

Advisors: Krste Asanović

BibTeX citation:

@phdthesis{Genc:EECS-2024-161,
    Author= {Genc, Hasan},
    Title= {Enabling Full-Stack DNN Accelerator Design and Evaluation on Synthesizable Hardware},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-161.html},
    Number= {UCB/EECS-2024-161},
    Abstract= {The growing diversity of computationally demanding DNN workloads, together with the long-running decline of technology-scaling trends, has motivated the design of a great many diverse specialized hardware accelerators. While these accelerators provide significant improvements to performance and energy consumption — making the modern wave of AI innovation possible — they introduce significant challenges to computer architects, programmers, and hardware designers, due to the difficulty of (i) exploring the very broad design space they represent, (ii) translating such designs rapidly to high-quality RTL and software libraries, and (iii) evaluating such designs in realistic full-system contexts early on in the design process.

Prior work has attempted to address these difficulties by proposing new accelerator design frameworks which allow users to change only a few settings in a config file, or a few lines of a domain-specific language, from which they can rapidly generate new synthesizable hardware, or new high-level models that can guide architectural decisions. Many of these frameworks also attempt to make accelerator design more principled by separating the different concerns which go into accelerator design, so that each can be explored individually, and in relation to other design choices.

However, prior accelerator generators and design frameworks often lack the ability to provide users visibility into the impact that the full system and software stack have upon DNN accelerator performance, such as the potential for outer caches, virtual address translation mechanisms, or host CPUs to bottleneck performance if they have not been carefully tuned along with the accelerator’s functional units or spatial arrays. Prior accelerator design frameworks which separate out different design concerns are also are not capable of generating high-quality RTL for both dense and sparse accelerator ASICs, which limits their ability to cover various modern workloads which sparsify DNN layers to improve performance or energy efficiency.

This thesis presents two projects, Gemmini and Stellar, which address these difficulties. Gemmini is a DNN accelerator evaluation framework which, while generating efficient spa- tial arrays and accelerators, is primarily intended to help users evaluate the impact of SoC components outside of the accelerator itself, such as external caches or virtual address trans- lation mechanisms, upon overall DNN accelerator performance. Stellar is another framework which provides abstractions that help users to design and explore different components of both dense and sparse accelerators, while separating out the different concerns that go into accelerator design, such as an accelerator’s functionality, its dataflow, the sparse/dense data formats it supports, its load-balancing strategies, and the private memory buffers it is equipped with. Gemmini-generated dense DNN accelerators achieve 87% the performance of prior state-of-the-art accelerators such as NVDLA on image classification networks such as ResNet50, and enable insights into how minor changes to system components such as TLBs can improve end-to-end DNN performance by up to 15%. Stellar-generated accelerators achieve up to 92% the performance of hand-written accelerators, with less than 15% area overhead and power overheads on various DNN layers as low as 7%.},
}

EndNote citation:

%0 Thesis
%A Genc, Hasan 
%T Enabling Full-Stack DNN Accelerator Design and Evaluation on Synthesizable Hardware
%I EECS Department, University of California, Berkeley
%D 2024
%8 August 8
%@ UCB/EECS-2024-161
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-161.html
%F Genc:EECS-2024-161