Improving FPGA Simulation Capacity with Automatic Resource Multi-Threading
Albert Magyar
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2022-24
May 1, 2022
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-24.pdf
Modern system-on-a-chip (SoC) development is a highly complex process that spans multiple levels of design abstraction and cross-cutting requirements. With a rapidly evolving ecosystem of domain-specific accelerators and wide design spaces to search, the ability to rapidly evaluate potential chip designs has never been more important. In the modern chip-design landscape, field-programmable gate arrays (FPGAs) play a critical role in delivering this simulation capability due to their unique ability to emulate concrete, register-transfer level (RTL) designs at speeds sufficient to run real applications spanning trillions of cycles of simulated target-design execution. However, the use of FPGAs for logic emulation presents challenges, including the perennial difficulty of effectively mapping large target designs to the finite resources of a given FPGA platform.
To help address this challenge, this dissertation presents a novel approach to manage these limitations through the use of automatic resource-efficiency optimizations that reduce the number of FPGA resources required to faithfully implement cycle-accurate emulators of large chips, all without requiring the tedious manual effort and complexity of previous FPGA-optimized simulation techniques. By substituting target-design memories with logic-intensive read and write ports for resource-efficient, cycle-accurate models that serially access FPGA memory primitives, Golden Gate simulators can avoid the disproportionate impact of FPGA-hostile memory design patterns on simulators of high-performance processor cores. Drawing inspiration from software simulators and specialized emulators, where common code may be repeatedly executed to model an arbitrary number of copies of a given block, I also introduce an automatic instance-threading optimization, through which the logic resources required to simulate a given module may be shared across multiple instances, radically reducing their collective footprint.
To support the use of these optimizations across a broad array of user designs, they are integrated as contributions to Golden Gate, an extensible compiler that translates RTL designs into cycle-accurate FPGA simulators as part of the open-source FireSim FPGA simulation framework. By structuring simulators as modular dataflow networks, Golden Gate provides the flexibility to compose the two optimizations along with the ability to combine them with software co-simulation or other advanced simulation features. To evaluate the performance of the optimizations and to validate the optimizing compiler stack, these techniques are applied to two input designs: a general-purpose SoC with multiple out-of-order cores and a domain-specific accelerator with multiple systolic array co-processors. In each case, finite programmable logic resources limit the maximum number of cores–and therefore the size of the system–that can effectively be simulated on a simulation platform consisting of cloud-hosted Xilinx VU9P FPGAs. However, by enabling optimizations in Golden Gate through simple compiler directives, the same FPGA platform was able to support configurations of each system with an eight-fold increase in core count relative to the baseline, providing the ability to simulate sixteen out-of-order cores or eight accelerator cores at high speed, with deterministic, cycle-accurate results. Ultimately, this significant increase in per-FPGA capability broadens the utility of commodity FPGAs in simulating ever-growing chips, while the convenience of automatic compiler optimization helps support designer productivity in a rapidly accelerating hardware ecosystem.
Advisors: Krste Asanović and Jonathan Bachrach
BibTeX citation:
@phdthesis{Magyar:EECS-2022-24, Author= {Magyar, Albert}, Title= {Improving FPGA Simulation Capacity with Automatic Resource Multi-Threading}, School= {EECS Department, University of California, Berkeley}, Year= {2022}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-24.html}, Number= {UCB/EECS-2022-24}, Abstract= {Modern system-on-a-chip (SoC) development is a highly complex process that spans multiple levels of design abstraction and cross-cutting requirements. With a rapidly evolving ecosystem of domain-specific accelerators and wide design spaces to search, the ability to rapidly evaluate potential chip designs has never been more important. In the modern chip-design landscape, field-programmable gate arrays (FPGAs) play a critical role in delivering this simulation capability due to their unique ability to emulate concrete, register-transfer level (RTL) designs at speeds sufficient to run real applications spanning trillions of cycles of simulated target-design execution. However, the use of FPGAs for logic emulation presents challenges, including the perennial difficulty of effectively mapping large target designs to the finite resources of a given FPGA platform. To help address this challenge, this dissertation presents a novel approach to manage these limitations through the use of automatic resource-efficiency optimizations that reduce the number of FPGA resources required to faithfully implement cycle-accurate emulators of large chips, all without requiring the tedious manual effort and complexity of previous FPGA-optimized simulation techniques. By substituting target-design memories with logic-intensive read and write ports for resource-efficient, cycle-accurate models that serially access FPGA memory primitives, Golden Gate simulators can avoid the disproportionate impact of FPGA-hostile memory design patterns on simulators of high-performance processor cores. Drawing inspiration from software simulators and specialized emulators, where common code may be repeatedly executed to model an arbitrary number of copies of a given block, I also introduce an automatic instance-threading optimization, through which the logic resources required to simulate a given module may be shared across multiple instances, radically reducing their collective footprint. To support the use of these optimizations across a broad array of user designs, they are integrated as contributions to Golden Gate, an extensible compiler that translates RTL designs into cycle-accurate FPGA simulators as part of the open-source FireSim FPGA simulation framework. By structuring simulators as modular dataflow networks, Golden Gate provides the flexibility to compose the two optimizations along with the ability to combine them with software co-simulation or other advanced simulation features. To evaluate the performance of the optimizations and to validate the optimizing compiler stack, these techniques are applied to two input designs: a general-purpose SoC with multiple out-of-order cores and a domain-specific accelerator with multiple systolic array co-processors. In each case, finite programmable logic resources limit the maximum number of cores–and therefore the size of the system–that can effectively be simulated on a simulation platform consisting of cloud-hosted Xilinx VU9P FPGAs. However, by enabling optimizations in Golden Gate through simple compiler directives, the same FPGA platform was able to support configurations of each system with an eight-fold increase in core count relative to the baseline, providing the ability to simulate sixteen out-of-order cores or eight accelerator cores at high speed, with deterministic, cycle-accurate results. Ultimately, this significant increase in per-FPGA capability broadens the utility of commodity FPGAs in simulating ever-growing chips, while the convenience of automatic compiler optimization helps support designer productivity in a rapidly accelerating hardware ecosystem.}, }
EndNote citation:
%0 Thesis %A Magyar, Albert %T Improving FPGA Simulation Capacity with Automatic Resource Multi-Threading %I EECS Department, University of California, Berkeley %D 2022 %8 May 1 %@ UCB/EECS-2022-24 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-24.html %F Magyar:EECS-2022-24