Decoupled Vector-Fetch Architecture with a Scalarizing Compiler
Yunsup Lee
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2016-117
May 24, 2016
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.pdf
As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.
In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.
Advisors: Krste Asanović
BibTeX citation:
@phdthesis{Lee:EECS-2016-117, Author= {Lee, Yunsup}, Title= {Decoupled Vector-Fetch Architecture with a Scalarizing Compiler}, School= {EECS Department, University of California, Berkeley}, Year= {2016}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.html}, Number= {UCB/EECS-2016-117}, Abstract= {As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others. In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.}, }
EndNote citation:
%0 Thesis %A Lee, Yunsup %T Decoupled Vector-Fetch Architecture with a Scalarizing Compiler %I EECS Department, University of California, Berkeley %D 2016 %8 May 24 %@ UCB/EECS-2016-117 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.html %F Lee:EECS-2016-117