Franklin Huang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-120

May 17, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.pdf

Machine learning systems today are developing in two opposites that require an increasing amount of hardware awareness to make productization feasible. On one hand, a stable scaling law in large language models (LLMs) pushes the system scale to be bigger; on the other hand, robotics and wearable applications require neural networks to fit in small systems that have an extremely low amount of compute, memory, and power budget.

Even though Moore's Law and architectural innovations have been sustaining compute performance growth, there is a widening processor-memory gap that requires imminent innovation. While anticipated hardware advancements such as GDDR7, HBM4, and UCIe are expected to alleviate this gap, challenges in the memory hierarchy will persist. Therefore, it is crucial to design kernels that enhance inference throughput by efficiently utilizing and managing the memory hierarchy.

This technical report explores improvements in the compression of a broad range of language models and compares them to the state-of-the-art. While quantization benefits all models, this thesis finds that sparsity and entropy methods are particularly effective for smaller models, reducing the bitrate to as low as 1.96 bits per weight with minimal accuracy loss. In contrast, larger language models derive greater advantages from enhanced data reuse and page-based memory management techniques.

Specifically, in CodeGen applications where parallel sampling enhances accuracy, these strategies have demonstrated the potential to reduce the memory capacity and bandwidth requirements of attention kernels by 15x. When evaluating problem-solving capacity, parallel sampling effectively matches the capabilities of a single sampled larger model with a tenfold reduction in memory and parameter count. These achievements unlock the possibility of local deployment for both real-time embedded systems and language model applications.

Advisors: Borivoje Nikolic


BibTeX citation:

@mastersthesis{Huang:EECS-2024-120,
    Author= {Huang, Franklin},
    Title= {Machine Learning Systems with Reduced Memory Requirements},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html},
    Number= {UCB/EECS-2024-120},
    Abstract= {Machine learning systems today are developing in two opposites that require an increasing amount of hardware awareness to make productization feasible. On one hand, a stable scaling law in large language models (LLMs) pushes the system scale to be bigger; on the other hand, robotics and wearable applications require neural networks to fit in small systems that have an extremely low amount of compute, memory, and power budget. 

Even though Moore's Law and architectural innovations have been sustaining compute performance growth, there is a widening processor-memory gap that requires imminent innovation. While anticipated hardware advancements such as GDDR7, HBM4, and UCIe are expected to alleviate this gap, challenges in the memory hierarchy will persist. Therefore, it is crucial to design kernels that enhance inference throughput by efficiently utilizing and managing the memory hierarchy.

This technical report explores improvements in the compression of a broad range of language models and compares them to the state-of-the-art. While quantization benefits all models, this thesis finds that sparsity and entropy methods are particularly effective for smaller models, reducing the bitrate to as low as 1.96 bits per weight with minimal accuracy loss. In contrast, larger language models derive greater advantages from enhanced data reuse and page-based memory management techniques.

Specifically, in CodeGen applications where parallel sampling enhances accuracy, these strategies have demonstrated the potential to reduce the memory capacity and bandwidth requirements of attention kernels by 15x. When evaluating problem-solving capacity, parallel sampling effectively matches the capabilities of a single sampled larger model with a tenfold reduction in memory and parameter count. These achievements unlock the possibility of local deployment for both real-time embedded systems and language model applications.},
}

EndNote citation:

%0 Thesis
%A Huang, Franklin 
%T Machine Learning Systems with Reduced Memory Requirements
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 17
%@ UCB/EECS-2024-120
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-120.html
%F Huang:EECS-2024-120