Isaac Ong

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-108

May 16, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.pdf

In light of the rapidly-increasing size of large language models (LLMs), this work addresses the challenge of serving these LLMs efficiently given the limitations of modern GPU memory. We observe that the inference of LLMs is unique as compared to other models due to the wide variation in input lengths, a factor not adequately addressed by existing works. Current inference engines typically employ a static partitioning strategy, which is sub-optimal given the variability in input lengths and the diversity of GPU specifications. To overcome these challenges, we propose a dynamic partitioning strategy for distributed LLM inference which dynamically switches between different partitioning strategies at inference time, optimizing for both GPU characteristics and input length. We systematically search for all Pareto-optimal partitioning strategies for distributed LLM inference, focusing on their computational requirements, communication overhead, and memory demands. Based on this search, we identify three Pareto-optimal strategies that cater to different scenarios and implement an inference engine for dynamic partitioning. Our evaluation, conducted on NVIDIA L4 and A100 GPUs using the Llama 2 family of models, demonstrates significant improvements over existing approaches. We illustrate reductions in the time to the first token of up to 40\% and reductions in latency of up to 18\%, underlining the effectiveness of dynamic partitioning. Our findings pave the way for more efficient utilization of GPU resources in distributed LLM inference, accommodating the evolving landscape of model sizes and architectures.

Advisors: Ion Stoica


BibTeX citation:

@mastersthesis{Ong:EECS-2024-108,
    Author= {Ong, Isaac},
    Title= {Efficient Distributed LLM Inference with Dynamic Partitioning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html},
    Number= {UCB/EECS-2024-108},
    Abstract= {In light of the rapidly-increasing size of large language models (LLMs), this work addresses the challenge of serving these LLMs efficiently given the limitations of modern GPU memory. We observe that the inference of LLMs is unique as compared to other models due to the wide variation in input lengths, a factor not adequately addressed by existing works. Current inference engines typically employ a static partitioning strategy, which is sub-optimal given the variability in input lengths and the diversity of GPU specifications. To overcome these challenges, we propose a dynamic partitioning strategy for distributed LLM inference which dynamically switches between different partitioning strategies at inference time, optimizing for both GPU characteristics and input length. We systematically search for all Pareto-optimal partitioning strategies for distributed LLM inference, focusing on their computational requirements, communication overhead, and memory demands. Based on this search, we identify three Pareto-optimal strategies that cater to different scenarios and implement an inference engine for dynamic partitioning. Our evaluation, conducted on NVIDIA L4 and A100 GPUs using the Llama 2 family of models, demonstrates significant improvements over existing approaches. We illustrate reductions in the time to the first token of up to 40\% and reductions in latency of up to 18\%, underlining the effectiveness of dynamic partitioning. Our findings pave the way for more efficient utilization of GPU resources in distributed LLM inference, accommodating the evolving landscape of model sizes and architectures.},
}

EndNote citation:

%0 Thesis
%A Ong, Isaac 
%T Efficient Distributed LLM Inference with Dynamic Partitioning
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 16
%@ UCB/EECS-2024-108
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-108.html
%F Ong:EECS-2024-108