Decoupled Vector-Fetch Architecture with a Scalarizing Compiler

Yunsup Lee

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2016-117

May 24, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.pdf

As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency. Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs). Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.

In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine. I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture. The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator. These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.

Advisors: Krste Asanović

BibTeX citation:

@phdthesis{Lee:EECS-2016-117,
    Author= {Lee, Yunsup},
    Title= {Decoupled Vector-Fetch Architecture with a Scalarizing Compiler},
    School= {EECS Department, University of California, Berkeley},
    Year= {2016},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.html},
    Number= {UCB/EECS-2016-117},
    Abstract= {As we approach the end of conventional technology scaling, computer architects are forced to incorporate specialized and heterogeneous accelerators into general-purpose processors for greater energy efficiency.  Among the prominent accelerators that have recently become more popular are data-parallel processing units, such as classic vector units, SIMD units, and graphics processing units (GPUs).  Surveying a wide range of data-parallel architectures and their parallel programming models and compilers reveals an opportunity to construct a new data-parallel machine that is highly performant and efficient, yet a favorable compiler target that maintains the same level of programmability as the others.

In this thesis, I present the Hwacha decoupled vector-fetch architecture as the basis of a new data-parallel machine.  I reason through the design decisions while describing its programming model, microarchitecture, and LLVM-based scalarizing compiler that efficiently maps OpenCL kernels to the architecture.  The Hwacha vector unit is implemented in Chisel as an accelerator attached to a RISC-V Rocket control processor within the open-source Rocket Chip SoC generator. Using complete VLSI implementations of Hwacha, including a cache-coherent memory hierarchy in a commercial 28 nm process and simulated LPDDR3 DRAM modules, I quantify the area, performance, and energy consumption of the Hwacha accelerator.  These numbers are then validated against an ARM Mali-T628 MP6 GPU, also built in a 28 nm process, using a set of OpenCL microbenchmarks compiled from the same source code with our custom compiler and ARM's stock OpenCL compiler.},
}

EndNote citation:

%0 Thesis
%A Lee, Yunsup 
%T Decoupled Vector-Fetch Architecture with a Scalarizing Compiler
%I EECS Department, University of California, Berkeley
%D 2016
%8 May 24
%@ UCB/EECS-2016-117
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-117.html
%F Lee:EECS-2016-117