Sehoon Kim
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2024-210
December 12, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-210.pdf
Recent advancements in AI technologies have led to unprecedented growth in model sizes, particularly with the advent of large language models (LLMs). While these models have shown great capabilities in various domains, their exponential scaling has introduced significant inference-time overheads, such as increased memory requirements, latency, and computational costs, thereby making efficient deployment and serving challenging. This thesis addresses these challenges through a full-stack approach that enhances efficiency across four key components of the AI inference stack: model optimization, inference methods, model architectures, and applications.
For model optimization, we introduce quantization techniques to optimize inference-time compute and memory requirements. I-BERT optimizes compute by leveraging integer-only quantization, which achieves up to a 3.5× latency speedup and enables deployment of the Transformer architectures on integer-only hardware. SqueezeLLM, which employs extremely low-bit weight quantization, effectively reduces memory requirements without sacrificing accuracy during LLM inference. For enhanced inference methods, we present the Big Little Decoder, a speculative decoding framework that accelerates autoregressive LLM inference by up to 2×through a collaboration between small and large models. Regarding model architectures, we propose an efficient design for speech recognition using a Temporal U-Net structure, which improves inference efficiency by shortening input sequence lengths. Finally, at the application level, we introduce LLMCompiler, a framework for efficiently orchestrating multiple function calls in LLM-based applications, which reduces execution latency and costs while enhancing robustness by decomposing complex user inputs into smaller, easier tasks. Collectively, these contributions provide a full-stack strategy for optimizing AI model inference from low-level systems to high-level applications to enable the efficient deployment and serving of state-of-the-art AI solutions
Advisor: Kurt Keutzer
"; ?>
BibTeX citation:
@phdthesis{Kim:EECS-2024-210, Author = {Kim, Sehoon}, Title = {Full Stack Approach for Efficient Deep Learning Inference}, School = {EECS Department, University of California, Berkeley}, Year = {2024}, Month = {Dec}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-210.html}, Number = {UCB/EECS-2024-210}, Abstract = {Recent advancements in AI technologies have led to unprecedented growth in model sizes, particularly with the advent of large language models (LLMs). While these models have shown great capabilities in various domains, their exponential scaling has introduced significant inference-time overheads, such as increased memory requirements, latency, and computational costs, thereby making efficient deployment and serving challenging. This thesis addresses these challenges through a full-stack approach that enhances efficiency across four key components of the AI inference stack: model optimization, inference methods, model architectures, and applications. For model optimization, we introduce quantization techniques to optimize inference-time compute and memory requirements. I-BERT optimizes compute by leveraging integer-only quantization, which achieves up to a 3.5× latency speedup and enables deployment of the Transformer architectures on integer-only hardware. SqueezeLLM, which employs extremely low-bit weight quantization, effectively reduces memory requirements without sacrificing accuracy during LLM inference. For enhanced inference methods, we present the Big Little Decoder, a speculative decoding framework that accelerates autoregressive LLM inference by up to 2×through a collaboration between small and large models. Regarding model architectures, we propose an efficient design for speech recognition using a Temporal U-Net structure, which improves inference efficiency by shortening input sequence lengths. Finally, at the application level, we introduce LLMCompiler, a framework for efficiently orchestrating multiple function calls in LLM-based applications, which reduces execution latency and costs while enhancing robustness by decomposing complex user inputs into smaller, easier tasks. Collectively, these contributions provide a full-stack strategy for optimizing AI model inference from low-level systems to high-level applications to enable the efficient deployment and serving of state-of-the-art AI solutions} }
EndNote citation:
%0 Thesis %A Kim, Sehoon %T Full Stack Approach for Efficient Deep Learning Inference %I EECS Department, University of California, Berkeley %D 2024 %8 December 12 %@ UCB/EECS-2024-210 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-210.html %F Kim:EECS-2024-210