A Hardware Accelerator for Computing an Exact Dot Product
Jack Koenig
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2018-51
May 11, 2018
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.pdf
We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs. We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.
Advisors: Krste Asanović and Jonathan Bachrach
BibTeX citation:
@mastersthesis{Koenig:EECS-2018-51, Author= {Koenig, Jack}, Title= {A Hardware Accelerator for Computing an Exact Dot Product}, School= {EECS Department, University of California, Berkeley}, Year= {2018}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html}, Number= {UCB/EECS-2018-51}, Abstract= {We study the implementation of a hardware accelerator that computes a dot product of IEEE-754 floating-point numbers exactly. The accelerator uses a wide (640 or 4288 bits for single or double-precision respectively) fixed-point representation into which intermediate floating-point products are accumulated. We designed the accelerator as a generator in Chisel, which can synthesize various configurations of the accelerator that make different area-performance trade-offs. We integrated eight different configurations into an SoC comprised of RISC-V in-order scalar core, split L1 instruction and data caches, and unified L2 cache. In a TSMC 45 nm technology, the accelerator area ranges from 0.05mm^2 to 0.32mm^2, and all configurations could be clocked at frequencies in excess of 900 MHz. The accelerator successfully saturates the SoC’s memory system, achieving the same per-element efficiency (1 cycle-per-element) as Intel MKL running on an x86 machine with a similar cache configuration.}, }
EndNote citation:
%0 Thesis %A Koenig, Jack %T A Hardware Accelerator for Computing an Exact Dot Product %I EECS Department, University of California, Berkeley %D 2018 %8 May 11 %@ UCB/EECS-2018-51 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-51.html %F Koenig:EECS-2018-51