Agile Hardware/Software Co-Design for Hyperscale Cloud Systems

Sagar Karandikar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-222

December 19, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-222.pdf

Global reliance on cloud services, powered by transformative technologies such as generative AI, machine learning, and big-data analytics, is driving exponential growth in demand for hyperscale cloud compute infrastructure. Meanwhile, the breakdown of classical hardware scaling (e.g., Moore’s Law) is hampering growth in compute supply. Building domain-specific hardware can address this supply-demand gap, but catching up with exponential demand requires developing new hardware rapidly and with confidence that performance/efficiency gains will compound in the context of a complete system. These are challenging tasks given the status quo in hardware design, even before accounting for the immense scale of the cloud.

This dissertation focuses on two themes: (1) Developing agile, end-to-end HW/SW co-design tools that challenge the status quo in hardware design across system scale. (2) Leveraging these tools and hyperscale datacenter fleet profiling insights to architect/implement state-of-the-art domain-specific hardware to address key inefficiencies in hyperscale systems.

We first cover the FireSim FPGA-accelerated hardware simulation platform, which automatically constructs high-performance, cycle-exact, scale-out simulations of novel hardware designs derived from tapeout-friendly RTL, empowering hardware designers and domain experts alike to rapidly co-design systems. FireSim unlocks innovation in datacenter hardware with the ability to scale to massive, distributed simulations of specialized datacenter clusters. Next, we cover Chipyard, a platform for agile construction, evaluation, and tape-out of specialized RISC-V System-on-Chip (SoC) designs using an RTL-generator-driven approach.

We then cover Hyperscale SoC, a cloud-optimized server chip built, evaluated, and taped-out with FireSim/Chipyard. Hyperscale SoC includes several new domain-specific accelerators for expensive but foundational overheads in hyperscale servers, including (de)serialization, (de)compression, and more. This SoC demonstrates a new paradigm of data-driven, end-to-end HW/SW co-design, combining key insights from profiling Google’s global datacenter fleet with the ability to rapidly build/evaluate novel HW/SW systems in FireSim/Chipyard.

Advisors: Krste Asanović

BibTeX citation:

@phdthesis{Karandikar:EECS-2024-222,
    Author= {Karandikar, Sagar},
    Title= {Agile Hardware/Software Co-Design for Hyperscale Cloud Systems},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-222.html},
    Number= {UCB/EECS-2024-222},
    Abstract= {Global reliance on cloud services, powered by transformative technologies such as generative AI, machine learning, and big-data analytics, is driving exponential growth in demand for hyperscale cloud compute infrastructure. Meanwhile, the breakdown of classical hardware scaling (e.g., Moore’s Law) is hampering growth in compute supply. Building domain-specific hardware can address this supply-demand gap, but catching up with exponential demand requires developing new hardware rapidly and with confidence that performance/efficiency gains will compound in the context of a complete system. These are challenging tasks given the status quo in hardware design, even before accounting for the immense scale of the cloud.

This dissertation focuses on two themes: (1) Developing agile, end-to-end HW/SW co-design tools that challenge the status quo in hardware design across system scale. (2) Leveraging these tools and hyperscale datacenter fleet profiling insights to architect/implement state-of-the-art domain-specific hardware to address key inefficiencies in hyperscale systems.

We first cover the FireSim FPGA-accelerated hardware simulation platform, which automatically constructs high-performance, cycle-exact, scale-out simulations of novel hardware designs derived from tapeout-friendly RTL, empowering hardware designers and domain experts alike to rapidly co-design systems. FireSim unlocks innovation in datacenter hardware with the ability to scale to massive, distributed simulations of specialized datacenter clusters. Next, we cover Chipyard, a platform for agile construction, evaluation, and tape-out of specialized RISC-V System-on-Chip (SoC) designs using an RTL-generator-driven approach.

We then cover Hyperscale SoC, a cloud-optimized server chip built, evaluated, and taped-out with FireSim/Chipyard. Hyperscale SoC includes several new domain-specific accelerators for expensive but foundational overheads in hyperscale servers, including (de)serialization, (de)compression, and more. This SoC demonstrates a new paradigm of data-driven, end-to-end HW/SW co-design, combining key insights from profiling Google’s global datacenter fleet with the ability to rapidly build/evaluate novel HW/SW systems in FireSim/Chipyard.},
}

EndNote citation:

%0 Thesis
%A Karandikar, Sagar 
%T Agile Hardware/Software Co-Design for Hyperscale Cloud Systems
%I EECS Department, University of California, Berkeley
%D 2024
%8 December 19
%@ UCB/EECS-2024-222
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-222.html
%F Karandikar:EECS-2024-222