FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs

Joonho Whangbo and Krste Asanović and Borivoje Nikolic

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-220

December 19, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-220.pdf

Pre.silicon validation and end.to.end system evaluation are integral parts of hardware development as they provide architects with insights about the complex interactions between various hardware components, system software, and application code. Although this process can be accelerated using FPGAs as a simulation host, existing platforms fall short when the resource requirements of a custom hardware design exceed a single FPGA.

We present FireAxe, an open.source FPGA.accelerated RTL simulation platform that sup. ports push.button user.guided partitioning across multiple FPGAs, using a compiler called FireRipper. Given a partition point, FireRipper automatically maps a monolithic RTL design onto multiple FPGAs while providing hardware designers quick feedback about the partition interface and expected simulation performance. Furthermore, FireRipper enables users to choose between an exact.mode which provides cycle.exact results with RTL.level fidelity, or a fast.mode that improves simulation rate while sacrificing fidelity only at the partition boundary. Built on FireSim, FireAxe preserves the ability to elastically scale simulations from on.premises FPGAs to cloud FPGAs. For example, pulling out a core from a systemon.chip (SoC) onto a separate FPGA, we achieve simulation rates of 1.6 MHz using on.premises FPGAs connected by direct.attach cables and 1 MHz on AWS F1 FPGAs using peer.to.peer PCIe.

To show FireAxe’s ability to enable pre.silicon performance validation at unprecedented scale, we show several case studies. First, we replicate full.stack system.level effects such as latency spikes from garbage collection in a Golang application on an SoC containing 4 out.of.order (OoO) cores. We also boot Linux on, to our knowledge, the largest OoO core ever cycle.exactly simulated in academia. Lastly, we simulate a system.on.chip containing 24 OoO cores mapped onto five datacenter.class FPGAs. We discover an RTL bug when trying to run Linux user.space applications that did not appear with less substantial software stacks. This was discovered in less than 2 hours using FireAxe and would have taken weeks in a commercial software RTL simulator.

Advisors: Krste Asanović and Sophia Shao

BibTeX citation:

@mastersthesis{Whangbo:EECS-2024-220,
    Author= {Whangbo, Joonho and Asanović, Krste and Nikolic, Borivoje},
    Title= {FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-220.html},
    Number= {UCB/EECS-2024-220},
    Abstract= {Pre.silicon validation and end.to.end system evaluation are integral parts of hardware development as they provide architects with insights about the complex interactions between various hardware components, system software, and application code. Although this process can be accelerated using FPGAs as a simulation host, existing platforms fall short when the resource requirements of a custom hardware design exceed a single FPGA.

We present FireAxe, an open.source FPGA.accelerated RTL simulation platform that sup. ports push.button user.guided partitioning across multiple FPGAs, using a compiler called FireRipper. Given a partition point, FireRipper automatically maps a monolithic RTL design onto multiple FPGAs while providing hardware designers quick feedback about the partition interface and expected simulation performance. Furthermore, FireRipper enables users to choose between an exact.mode which provides cycle.exact results with RTL.level fidelity, or a fast.mode that improves simulation rate while sacrificing fidelity only at the partition boundary. Built on FireSim, FireAxe preserves the ability to elastically scale simulations from on.premises FPGAs to cloud FPGAs. For example, pulling out a core from a systemon.chip (SoC) onto a separate FPGA, we achieve simulation rates of 1.6 MHz using on.premises FPGAs connected by direct.attach cables and 1 MHz on AWS F1 FPGAs using peer.to.peer PCIe.

To show FireAxe’s ability to enable pre.silicon performance validation at unprecedented scale, we show several case studies. First, we replicate full.stack system.level effects such as latency spikes from garbage collection in a Golang application on an SoC containing 4 out.of.order (OoO) cores. We also boot Linux on, to our knowledge, the largest OoO core ever cycle.exactly simulated in academia. Lastly, we simulate a system.on.chip containing 24 OoO cores mapped onto five datacenter.class FPGAs. We discover an RTL bug when trying to run Linux user.space applications that did not appear with less substantial software stacks. This was discovered in less than 2 hours using FireAxe and would have taken weeks in a commercial software RTL simulator.},
}

EndNote citation:

%0 Thesis
%A Whangbo, Joonho 
%A Asanović, Krste 
%A Nikolic, Borivoje 
%T FireAxe: Partitioned FPGA-Accelerated Simulation of Large-Scale RTL Designs
%I EECS Department, University of California, Berkeley
%D 2024
%8 December 19
%@ UCB/EECS-2024-220
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-220.html
%F Whangbo:EECS-2024-220