A Hardware Accelerator Generator for Zstandard Decompression

Junsun Choi

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-236

December 20, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-236.pdf

In hyperscale cloud computing systems, lossless data compression and decompression (referred to as “(de)compression”) is a widely used common low-level function that is heavily utilized across applications. However, (de)compression is unique as it trades off CPU cycles for reduced storage or network bandwidth and is considered a part of the datacenter tax. (De)Compression accounts for 3% of the fleet-wide CPU cycles in Google's datacenters, ranking second only to protocol buffer serialization and deserialization. Accelerating (de)compression is anticipated to deliver significant total cost of ownership (TCO) savings in hyperscale systems. Among the various (de)compression algorithms, Zstandard decompression consumes the highest percentage of CPU cycles dedicated to (de)compression within Google's fleet, accounting for 25.8%. This makes Zstandard decompression the top candidate for optimization through a custom hardware accelerator.

A large body of prior work has proposed enhanced microarchitectural features on the existing (de)compression hardware, but the effect of high-level design parameters on the feasibility of integrating the decompression hardware is not thoroughly explored. Therefore, this thesis presents a generator for Zstandard decompression accelerator that exposes design parameters for tuning including the history buffer size and the number of bits used in speculative Huffman decoding (referred to as “Huffman speculation bits"). The generator is open-sourced and is integrated into an open-source RISC-V SoC ecosystem for the performance and area evaluation of the accelerator designs with different placement configurations. With this approach, this thesis performs a design space exploration where the exploration range of the SoC speedup is 15.1x and that of the SoC silicon area is 1.5x. The design space exploration enables a better understanding of the impact of history buffer size, Huffman speculation bits, and accelerator placement on the SoC's quality-of-result (QoR), leading to an in-depth assessment of different design strategies of the accelerated SoC for hyperscale systems. The final optimized SoC with the Zstandard decompression accelerator is 5.6x faster than a single Xeon core while consuming only a small portion (approximately 12 percent) of the Xeon core's area.

Advisors: Borivoje Nikolic and Sophia Shao

BibTeX citation:

@mastersthesis{Choi:EECS-2024-236,
    Author= {Choi, Junsun},
    Title= {A Hardware Accelerator Generator for Zstandard Decompression},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-236.html},
    Number= {UCB/EECS-2024-236},
    Abstract= {In hyperscale cloud computing systems, lossless data compression and decompression (referred to as “(de)compression”) is a widely used common low-level function that is heavily utilized across applications. However, (de)compression is unique as it trades off CPU cycles for reduced storage or network bandwidth and is considered a part of the datacenter tax. (De)Compression accounts for 3% of the fleet-wide CPU cycles in Google's datacenters, ranking second only to protocol buffer serialization and deserialization. Accelerating (de)compression is anticipated to deliver significant total cost of ownership (TCO) savings in hyperscale systems. Among the various (de)compression algorithms, Zstandard decompression consumes the highest percentage of CPU cycles dedicated to (de)compression within Google's fleet, accounting for 25.8%. This makes Zstandard decompression the top candidate for optimization through a custom hardware accelerator.

A large body of prior work has proposed enhanced microarchitectural features on the existing (de)compression hardware, but the effect of high-level design parameters on the feasibility of integrating the decompression hardware is not thoroughly explored. Therefore, this thesis presents a generator for Zstandard decompression accelerator that exposes design parameters for tuning including the history buffer size and the number of bits used in speculative Huffman decoding (referred to as “Huffman speculation bits"). The generator is open-sourced and is integrated into an open-source RISC-V SoC ecosystem for the performance and area evaluation of the accelerator designs with different placement configurations. With this approach, this thesis performs a design space exploration where the exploration range of the SoC speedup is 15.1x and that of the SoC silicon area is 1.5x. The design space exploration enables a better understanding of the impact of history buffer size, Huffman speculation bits, and accelerator placement on the SoC's quality-of-result (QoR), leading to an in-depth assessment of different design strategies of the accelerated SoC for hyperscale systems. The final optimized SoC with the Zstandard decompression accelerator is 5.6x faster than a single Xeon core while consuming only a small portion (approximately 12 percent) of the Xeon core's area.},
}

EndNote citation:

%0 Thesis
%A Choi, Junsun 
%T A Hardware Accelerator Generator for Zstandard Decompression
%I EECS Department, University of California, Berkeley
%D 2024
%8 December 20
%@ UCB/EECS-2024-236
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-236.html
%F Choi:EECS-2024-236