End-to-end Heterogeneous System Design for Hyperscale Big-Data Processing

Abraham Gonzalez

EECS Department, University of California, Berkeley

Technical Report No. UCB/

December 1, 2025

Modern hyperscale cloud data center applications continue to grow ever larger, more complex, and more distributed as they tackle exponential increases in computing demand. Of these systems, big-data processing on hyperscale data stores has emerged as an integral solution for collecting and analyzing quintillion bytes of data generated per day. At the same time, the Moore’s Law hardware scaling slowdown motivates new solutions for warehouse-scale system design through novel tools focused on rapid evaluation of complete systems.

This dissertation focuses on identifying key bottlenecks in hyperscale big-data processing systems, developing new agile hardware/software co-design methodologies built for heterogeneous data processing, and utilizing these techniques to improve end-to-end accelerator latencies for big-data processing remote procedure calls (RPCs).

First, we discuss the characterization of Spanner, BigTable, and BigQuery, three key hyperscale big-data processing systems serving live traffic at Google. This work demonstrates that a mix of fixed-function accelerators, a “sea of accelerators”, is necessary for holistic system acceleration due to no single dominant bottleneck.

Next, we describe the Chipyard framework, an agile tool for developing specialized RISC-V system-on-chips (SoCs), and the FireSim FPGA-accelerated hardware simulation platform used to simulate complex SoCs at scale. Chipyard and FireSim enable rapid composition, verification, and validation of large-scale heterogeneous SoCs built for the hyperscale. We then present the first Chipyard test chip, a 16mm2 multi-core/accelerator SoC in Intel 22FFL, that demonstrates the ability of the frameworks to quickly produce silicon-proven results.

Next, we introduce novel end-to-end RPC accelerator scheduling techniques for big-data processing. Through the use of Chipyard and FireSim to evaluate domain-specific RPC accelerators running novel fleet-representative benchmarks, we show that greedy work-stealing policies to load-balance payloads across CPUs and accelerators can improve 99th percentile latency by over 5x without average latency degradation for variable data processing payloads.

Advisors: Borivoje Nikolic and Krste Asanović

BibTeX citation:

@phdthesis{Gonzalez:31916,
    Author= {Gonzalez, Abraham},
    Editor= {Asanović, Krste and Nikolic, Borivoje and Shao, Sophia and Ranganathan, Parthasarathy},
    Title= {End-to-end Heterogeneous System Design for Hyperscale Big-Data Processing},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {Modern hyperscale cloud data center applications continue to grow ever larger, more complex, and more distributed as they tackle exponential increases in computing demand. Of these systems, big-data processing on hyperscale data stores has emerged as an integral solution for collecting and analyzing quintillion bytes of data generated per day. At the same time, the Moore’s Law hardware scaling slowdown motivates new solutions for warehouse-scale system design through novel tools focused on rapid evaluation of complete systems.

This dissertation focuses on identifying key bottlenecks in hyperscale big-data processing systems, developing new agile hardware/software co-design methodologies built for heterogeneous data processing, and utilizing these techniques to improve end-to-end accelerator latencies for big-data processing remote procedure calls (RPCs).

First, we discuss the characterization of Spanner, BigTable, and BigQuery, three key hyperscale big-data processing systems serving live traffic at Google. This work demonstrates that a mix of fixed-function accelerators, a “sea of accelerators”, is necessary for holistic system acceleration due to no single dominant bottleneck.

Next, we describe the Chipyard framework, an agile tool for developing specialized RISC-V system-on-chips (SoCs), and the FireSim FPGA-accelerated hardware simulation platform used to simulate complex SoCs at scale. Chipyard and FireSim enable rapid composition, verification, and validation of large-scale heterogeneous SoCs built for the hyperscale. We then present the first Chipyard test chip, a 16mm2 multi-core/accelerator SoC in Intel 22FFL, that demonstrates the ability of the frameworks to quickly produce silicon-proven results. 

Next, we introduce novel end-to-end RPC accelerator scheduling techniques for big-data processing. Through the use of Chipyard and FireSim to evaluate domain-specific RPC accelerators running novel fleet-representative benchmarks, we show that greedy work-stealing policies to load-balance payloads across CPUs and accelerators can improve 99th percentile latency by over 5x without average latency degradation for variable data processing payloads.},
}

EndNote citation:

%0 Thesis
%A Gonzalez, Abraham 
%E Asanović, Krste 
%E Nikolic, Borivoje 
%E Shao, Sophia 
%E Ranganathan, Parthasarathy 
%T End-to-end Heterogeneous System Design for Hyperscale Big-Data Processing
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 1
%@ UCB/
%F Gonzalez:31916