A Hardware Accelerator for Computing an Exact Dot Product

Jack Koenig

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2018-51

May 11, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.pdf

We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs. We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.

Advisors: Krste Asanović and Jonathan Bachrach

BibTeX citation:

@mastersthesis{Koenig:EECS-2018-51,
    Author= {Koenig, Jack},
    Title= {A Hardware Accelerator for Computing an Exact Dot Product},
    School= {EECS Department, University of California, Berkeley},
    Year= {2018},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html},
    Number= {UCB/EECS-2018-51},
    Abstract= {We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs.
We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.},
}

EndNote citation:

%0 Thesis
%A Koenig, Jack 
%T A Hardware Accelerator for Computing an Exact Dot Product
%I EECS Department, University of California, Berkeley
%D 2018
%8 May 11
%@ UCB/EECS-2018-51
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html
%F Koenig:EECS-2018-51