A Hardware Accelerator for Computing an Exact Dot Product

Jack Koenig

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2018-51
May 11, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.pdf

We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs. We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.

Advisor: Krste Asanović and Jonathan Bachrach


BibTeX citation:

@mastersthesis{Koenig:EECS-2018-51,
    Author = {Koenig, Jack},
    Title = {A Hardware Accelerator for Computing an Exact Dot Product},
    School = {EECS Department, University of California, Berkeley},
    Year = {2018},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html},
    Number = {UCB/EECS-2018-51},
    Abstract = {We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs.
We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.}
}

EndNote citation:

%0 Thesis
%A Koenig, Jack
%T A Hardware Accelerator for Computing an Exact Dot Product
%I EECS Department, University of California, Berkeley
%D 2018
%8 May 11
%@ UCB/EECS-2018-51
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html
%F Koenig:EECS-2018-51