Daniel Grubb

EECS Department, University of California, Berkeley

Technical Report No. UCB/

December 1, 2025

Systems-on-chip (SoCs) for mobile devices need flexible yet energy efficient processing capabilities in order to target evolving multimedia and embedded machine-learning applications while coping with constrained device battery life. Digital signal processors (DSPs) are frequently integrated into these SoCs under a wide variety of physical constraints to fill this role. DSPs have evolved significantly over the last four decades, but have always had the goal of providing programmable, domain-specialized computation in constrained deployments, guided by trends in silicon design and the important applications of the day. The emergence of modern scalable vector ISAs, such as the RISC-V “V” vector extension (RVV), presents the opportunity to design DSPs by building on a scalable, efficient vector machine template. This thesis presents techniques for designing and implementing domain-specialized shortvector machines, suitable for deployment as DSPs in mobile SoCs. First, this thesis discusses a generator for baseline short-vector machines implementing the RVV 1.0 specification, including quantitative performance, power, and area (PPA) results and a qualitative comparison to other DSP core paradigms. This thesis presents techniques for specializing short-vector machines for mobile multimedia workloads to enhance their efficiency and enable new PPA trade-offs, including the design of novel vector instructions that compose neatly with the baseline vector machine. This thesis presents contributions to Hammer, a reusable physical design framework, as a means for rapidly evaluating and implementing these designs. Finally, this thesis presents Cygnus, a 1GHz heterogeneous octa-core RISC-V vector processor for digital signal processing. Cygnus demonstrates the silicon implementation of instances of the short-vector machine generator described in this thesis organized in a big/little architecture. Cygnus achieves over 90% datapath utilization on matrix multiplication and convolution kernels, and 414 GOPS/W and 109 GFLOPS/W on INT8 and FP32 matrix multiplication kernels, respectively.

Advisors: Borivoje Nikolic


BibTeX citation:

@phdthesis{Grubb:31915,
    Author= {Grubb, Daniel},
    Title= {Design Techniques for Specialized Short-Vector Machines},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {Systems-on-chip (SoCs) for mobile devices need flexible yet energy efficient processing capabilities
in order to target evolving multimedia and embedded machine-learning applications
while coping with constrained device battery life. Digital signal processors (DSPs) are frequently
integrated into these SoCs under a wide variety of physical constraints to fill this
role. DSPs have evolved significantly over the last four decades, but have always had the
goal of providing programmable, domain-specialized computation in constrained deployments,
guided by trends in silicon design and the important applications of the day. The
emergence of modern scalable vector ISAs, such as the RISC-V “V” vector extension (RVV),
presents the opportunity to design DSPs by building on a scalable, efficient vector machine
template.
This thesis presents techniques for designing and implementing domain-specialized shortvector
machines, suitable for deployment as DSPs in mobile SoCs. First, this thesis discusses
a generator for baseline short-vector machines implementing the RVV 1.0 specification,
including quantitative performance, power, and area (PPA) results and a qualitative
comparison to other DSP core paradigms. This thesis presents techniques for specializing
short-vector machines for mobile multimedia workloads to enhance their efficiency and enable
new PPA trade-offs, including the design of novel vector instructions that compose
neatly with the baseline vector machine. This thesis presents contributions to Hammer, a
reusable physical design framework, as a means for rapidly evaluating and implementing
these designs. Finally, this thesis presents Cygnus, a 1GHz heterogeneous octa-core RISC-V
vector processor for digital signal processing. Cygnus demonstrates the silicon implementation
of instances of the short-vector machine generator described in this thesis organized in
a big/little architecture. Cygnus achieves over 90% datapath utilization on matrix multiplication
and convolution kernels, and 414 GOPS/W and 109 GFLOPS/W on INT8 and FP32
matrix multiplication kernels, respectively.},
}

EndNote citation:

%0 Thesis
%A Grubb, Daniel 
%T Design Techniques for Specialized Short-Vector Machines
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 1
%@ UCB/
%F Grubb:31915