
Christopher Celio
David A. Patterson
Krste Asanović

Electrical Engineering and Computer Sciences
University of California at Berkeley

Technical Report No. UCB/EECS-2015-167
http://www.eecs.berkeley.edu/Pubs/TechRpts/2015/EECS-2015-167.html

June 13, 2015
The Berkeley Out-of-Order Machine (BOOM):
An Industry-Competitive, Synthesizable, Parameterized RISC-V Processor

Christopher Celio, David Patterson, and Krste Asanović
University of California, Berkeley, California 94720–1770
celio@eecs.berkeley.edu

BOOM is a work-in-progress. Results shown are preliminary and subject to change as of 2015 June.

1. The Berkeley Out-of-Order Machine

BOOM is a synthesizable, parameterized, superscalar out-of-order RISC-V core designed to serve as the prototypical baseline processor for future micro-architectural studies of out-of-order processors. Our goal is to provide a readable, open-source implementation for use in education, research, and industry.

BOOM is written in roughly 9,000 lines of the hardware construction language Chisel. We leveraged Berkeley’s open-source Rocket-chip SoC generator, allowing us to quickly bring up an entire multi-core processor system (including caches and uncore) by replacing the in-order Rocket core with an out-of-order BOOM core. BOOM supports atomics, IEEE 754-2008 floating-point, and page-based virtual memory. We have demonstrated BOOM running Linux, SPEC CINT2006, and CoreMark.

BOOM, configured similarly to an ARM Cortex-A9, achieves 3.91 CoreMarks/MHz with a core size of 0.47 mm² in TSMC 45 nm excluding caches (and 1.1 mm² with 32 kB L1 caches). The in-order Rocket core has been successfully demonstrated to reach over 1.5 GHz in IBM 45 nm SOI, with the SRAM access being the critical path. As BOOM instantiates the same caches as Rocket, BOOM should be similarly constrained to 1.5 GHz. So far we have not found it necessary to deeply pipeline BOOM to keep the logic faster than the SRAM access. With modest resource sizes matching the synthesizable MIPS32 74K, the the worst case path for BOOM’s logic is ~2.2 GHz in TSMC 45 nm (~30 FO4).

2. Leveraging New Infrastructure

The feasibility of BOOM is in large part due to the available infrastructure that has been developed in parallel at Berkeley.

BOOM implements the open-source RISC-V ISA, which was designed from the ground-up to enable VLSI-driven computer architecture research. It is clean, realistic, and highly extensible. Available software includes the GCC and LLVM compilers and a port of the Linux operating system.[6]

BOOM is written in Chisel, an open-source hardware construction language developed to enable advanced hardware design using highly parameterized generators. Chisel allows designers to utilize concepts such as object orientation, functional programming, parameterized types, and type inference. From a single Chisel source, Chisel can generate a cycle-accurate C++ simulator, Verilog targeting FPGA designs, and Verilog targeting ASIC tool-flows.[2]

UC Berkeley also provides the open-source Rocket-chip SoC generator, which has been successfully taped out seven times in two different, modern technologies.[6, 10] BOOM makes significant use of Rocket-chip as a library – the caches, the uncore, and functional units all derive from Rocket. In total, over 11,500 lines of code is instantiated by BOOM.

3. Methodology: What We Plan to Do

The typical methodology for single-core studies, as gathered from an informal sampling of ISCA 2014 papers, is to use CPU2006 coupled with a SimPoints[12]-inspired methodology to choose the most representative section of the reference input set. Each sampling point is typically run for around 10-100 million instructions of detailed software-based simulation.

The average CPU2006 benchmark is roughly 2.2 trillion instructions, with many of the benchmarks exhibiting multiple phases of execution.[9] While completely untenable for software simulators, FPGA-based simulators can bring runtimes to within reason – a 50 MHz FPGA simulation can take over 12 hours for a single benchmark. Moreover, we hope to utilize an FPGA cluster to run all the SPEC workloads in parallel.

4. Comparison to Commercial Designs

Table 1 shows preliminary results of BOOM and Rocket for the CoreMark EEMBC benchmark (we use CoreMark because ARM does not offer SPEC results for the A9 and A15 cores). Our aim is to be competitive in both performance and area against low-power, embedded out-of-order cores.
Table 1: CoreMark results.

<table>
<thead>
<tr>
<th>Processor</th>
<th>Core Area (core+L1s)</th>
<th>CoreMark MHz/Cores</th>
<th>Freq (MHz)</th>
<th>CoreMark Core IPC</th>
</tr>
</thead>
<tbody>
<tr>
<td>Intel Atom E5-2670v3 (Sandy)</td>
<td>≈18 mm² @ 22nm</td>
<td>7.96</td>
<td>2.400</td>
<td>25.077</td>
</tr>
<tr>
<td>Intel Xeon E5-2650v3 (Haswell)</td>
<td>≈14 mm² @ 32nm</td>
<td>5.60</td>
<td>1.500</td>
<td>18.914</td>
</tr>
<tr>
<td>ARM Cortex-A55*</td>
<td>2.8 mm² @ 28nm</td>
<td>4.72</td>
<td>2.116</td>
<td>9.977</td>
</tr>
<tr>
<td>RISC BOOM two-wide*</td>
<td>1.4 mm² @ 45nm</td>
<td>3.79</td>
<td>1.400</td>
<td>7.056</td>
</tr>
<tr>
<td>RISC BOOM four-wide*</td>
<td>1.1 mm² @ 44nm</td>
<td>2.51</td>
<td>1.500</td>
<td>5.865</td>
</tr>
<tr>
<td>ARM Cortex-A53 (Kylig)</td>
<td>≈2.5 mm² @ 40nm</td>
<td>2.91</td>
<td>1.400</td>
<td>5.189</td>
</tr>
<tr>
<td>MIPS 74K</td>
<td>2.5 mm² @ 65nm</td>
<td>2.58</td>
<td>1.500</td>
<td>3.916</td>
</tr>
<tr>
<td>RV64 Rocket*</td>
<td>0.5 mm² @ 46nm</td>
<td>2.32</td>
<td>1.500</td>
<td>3.880</td>
</tr>
<tr>
<td>ARM Cortex-A55*</td>
<td>0.5 mm² @ 40nm</td>
<td>2.13</td>
<td>1.000</td>
<td>2.125</td>
</tr>
</tbody>
</table>

Results collected from *the authors (using gcc51 -O3 and perf) [3], or ‡ [1]. The Intel core areas include the L1 and L2 caches.

Table 2: A sample of academic out-of-order processors.

<table>
<thead>
<tr>
<th>Processor</th>
<th>Fully synthesizable</th>
<th>FPAA</th>
<th>Parameterized</th>
<th>Floating point</th>
<th>Atomic support</th>
<th>L1 cache</th>
<th>L1 cache</th>
<th>Virtual memory</th>
<th>boots Linux</th>
<th>Instruction set</th>
<th>ISA Alpha (sub-set)</th>
<th>SPARC+8</th>
<th>PISA (sub-set)</th>
<th>RISC-V</th>
</tr>
</thead>
<tbody>
<tr>
<td>IVM [13]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>SCOORE [1]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>FabScalar[1,11]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
<tr>
<td>Sharing [14]</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
<td>✓</td>
</tr>
</tbody>
</table>

Information was gathered from publicly available code at [4].

5. Related Work

There have been many academic efforts to implement out-of-order cores. The Illinois Verilog Model (IVM) is a 4-issue, out-of-order core designed to study transient faults.[13] The Santa Cruz Out-of-Order RISC Engine (SCOORE) was designed to efficiently target both ASIC and FPGA generation. However, SCOORE lacks a synthesizable fetch unit.

FabScalar is a tool for composing synthesizable out-of-order cores. It searches through a library of parameterized components of varying width and depth, guided by performance constraints given by the designer. FabScalar has been demonstrated on an FPGA,[8] however, as FabScalar did not implement caches, all memory operations were treated as cache hits. Later work incorporated the OpenSPARC T2 caches in a tape-out of FabScalar.[11]

The Sharing Architecture is composed of a two-wide out-of-order core (or “slice”) that can be combined with other slices to form a single, larger out-of-order core. By implementing a slice in RTL, they were able to accurately demonstrate the area costs associated with reconfigurable, virtual cores.[14]

6. Lessons Learned

Single-board FPGAs have gotten more capable of handling mobile processor designs. Chisel provides a back-end mechanism to generate memories optimized for FPGAs, but requires no changes to the processor’s source code. While some coding patterns map poorly to FPGAs (e.g., large variable shifters), generally techniques that map well to ASICs also map well to FPGAs.

Re-use is critical. Some of the most difficult parts of building a processor – for example the cache coherency system, the privileged ISA support, and the FPGA and ASIC flows – came to BOOM “for free” via the Rocket-chip SoC generator.

And as the Rocket-chip SoC evolves, BOOM inherits the new improvements.

Benchmarks are harder to use than they should be. Benchmarks can be difficult to work with and exhibit poor performance portability across different processors, address modes, and ISAs. Many benchmarks (like CoreMark) are written to target 32-bit addresses, which can cause poor code generation for 64-bit processors. We built a histogram generator into the RISC-V ISA simulator to help direct us to potential problem areas. However, additional compiler optimizations are needed to improve 64-bit RISC-V code generation.

We were also surprised to find that SPECINT contains significant floating point code – an integer core may spend over half its time executing software FP routines. As academic SPEC results are typically reported in terms of CPI, we must be careful to not optimize for the wrong cases. We added hardware FP support to BOOM to address this issue.

Finally, we found SPEC difficult to work with, especially in non-native environments. We created the Speckle wrapper to help facilitate cross-compiling and generating portable directories to run on simulators and FPGAs.[5]

Diagnosing bugs that occur billions of cycles into a program is hard. We mostly rely on a Chisel-generated C++ simulator for debugging, but at roughly 30 KIPS, 1 billion cycles takes 8 hours. A torture-test generator (and a suite of small test codes) is invaluable.

Acknowledgments

Research partially funded by DARPA Award Number HR0011-12-2-0016, the Center for Future Architecture Research, a member of STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA, and ASPIRE Lab industrial sponsors and affiliates Intel, Google, Huawei, Nokia, NVIDIA, Oracle, and Samsung. Any opinions, findings, conclusions, or recommendations in this paper are solely those of the authors and does not necessarily reflect the position or the policy of the sponsors.

References


