Design Techniques for Specialized Short-Vector Machines
Daniel Grubb
EECS Department, University of California, Berkeley
Technical Report No. UCB/
December 1, 2025
Systems-on-chip (SoCs) for mobile devices need flexible yet energy efficient processing capabilities in order to target evolving multimedia and embedded machine-learning applications while coping with constrained device battery life. Digital signal processors (DSPs) are frequently integrated into these SoCs under a wide variety of physical constraints to fill this role. DSPs have evolved significantly over the last four decades, but have always had the goal of providing programmable, domain-specialized computation in constrained deployments, guided by trends in silicon design and the important applications of the day. The emergence of modern scalable vector ISAs, such as the RISC-V “V” vector extension (RVV), presents the opportunity to design DSPs by building on a scalable, efficient vector machine template. This thesis presents techniques for designing and implementing domain-specialized shortvector machines, suitable for deployment as DSPs in mobile SoCs. First, this thesis discusses a generator for baseline short-vector machines implementing the RVV 1.0 specification, including quantitative performance, power, and area (PPA) results and a qualitative comparison to other DSP core paradigms. This thesis presents techniques for specializing short-vector machines for mobile multimedia workloads to enhance their efficiency and enable new PPA trade-offs, including the design of novel vector instructions that compose neatly with the baseline vector machine. This thesis presents contributions to Hammer, a reusable physical design framework, as a means for rapidly evaluating and implementing these designs. Finally, this thesis presents Cygnus, a 1GHz heterogeneous octa-core RISC-V vector processor for digital signal processing. Cygnus demonstrates the silicon implementation of instances of the short-vector machine generator described in this thesis organized in a big/little architecture. Cygnus achieves over 90% datapath utilization on matrix multiplication and convolution kernels, and 414 GOPS/W and 109 GFLOPS/W on INT8 and FP32 matrix multiplication kernels, respectively.
Advisors: Borivoje Nikolic
BibTeX citation:
@phdthesis{Grubb:31915,
Author= {Grubb, Daniel},
Title= {Design Techniques for Specialized Short-Vector Machines},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Number= {UCB/},
Abstract= {Systems-on-chip (SoCs) for mobile devices need flexible yet energy efficient processing capabilities
in order to target evolving multimedia and embedded machine-learning applications
while coping with constrained device battery life. Digital signal processors (DSPs) are frequently
integrated into these SoCs under a wide variety of physical constraints to fill this
role. DSPs have evolved significantly over the last four decades, but have always had the
goal of providing programmable, domain-specialized computation in constrained deployments,
guided by trends in silicon design and the important applications of the day. The
emergence of modern scalable vector ISAs, such as the RISC-V “V” vector extension (RVV),
presents the opportunity to design DSPs by building on a scalable, efficient vector machine
template.
This thesis presents techniques for designing and implementing domain-specialized shortvector
machines, suitable for deployment as DSPs in mobile SoCs. First, this thesis discusses
a generator for baseline short-vector machines implementing the RVV 1.0 specification,
including quantitative performance, power, and area (PPA) results and a qualitative
comparison to other DSP core paradigms. This thesis presents techniques for specializing
short-vector machines for mobile multimedia workloads to enhance their efficiency and enable
new PPA trade-offs, including the design of novel vector instructions that compose
neatly with the baseline vector machine. This thesis presents contributions to Hammer, a
reusable physical design framework, as a means for rapidly evaluating and implementing
these designs. Finally, this thesis presents Cygnus, a 1GHz heterogeneous octa-core RISC-V
vector processor for digital signal processing. Cygnus demonstrates the silicon implementation
of instances of the short-vector machine generator described in this thesis organized in
a big/little architecture. Cygnus achieves over 90% datapath utilization on matrix multiplication
and convolution kernels, and 414 GOPS/W and 109 GFLOPS/W on INT8 and FP32
matrix multiplication kernels, respectively.},
}
EndNote citation:
%0 Thesis %A Grubb, Daniel %T Design Techniques for Specialized Short-Vector Machines %I EECS Department, University of California, Berkeley %D 2025 %8 December 1 %@ UCB/ %F Grubb:31915