Elden Ren

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-111

May 16, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-111.pdf

The rapid expansion of generative AI and its integration into daily workflows has magnified the demand for large language model (LLM) inference services. However, the deployment of LLM models is often burdened by the high cost and limited availability of GPU resources. A Decentralized Physical Infrastructure Network (DePIN) tailored for LLM inference could leverage idle GPU resources globally to enable decentralized serving of LLMs. In this paper, we discuss the challenge of scheduling model pipelines in a heterogeneous network to enable fast decentralized serving. We focus on the metric of time per output token (TPOT), which measures the latency to generate each token after the first token. We present a novel heuristic scheduling algorithm that enables fast inference in decentralized LLM serving systems and is practical for large, heterogeneous networks. The experimental results show the feasibility of using consumer-grade GPUs for low-latency LLM inference and validate the effectiveness of our algorithm. The proposed algorithm achieves uniformly lower TPOT than the integer programming baseline and does so with a shorter execution time.

Advisors: Dawn Song


BibTeX citation:

@mastersthesis{Ren:EECS-2024-111,
    Author= {Ren, Elden},
    Title= {Task Scheduling for Decentralized LLM Serving in Heterogeneous Networks},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-111.html},
    Number= {UCB/EECS-2024-111},
    Abstract= {The rapid expansion of generative AI and its integration into daily workflows has magnified the demand for large language model (LLM) inference services. However, the deployment of LLM models is often burdened by the high cost and limited availability of GPU resources. A Decentralized Physical Infrastructure Network (DePIN) tailored for LLM inference could leverage idle GPU resources globally to enable decentralized serving of LLMs. In this paper, we discuss the challenge of scheduling model pipelines in a heterogeneous network to enable fast decentralized serving. We focus on the metric of time per output token (TPOT), which measures the latency to generate each token after the first token. We present a novel heuristic scheduling algorithm that enables fast inference in decentralized LLM serving systems and is practical for large, heterogeneous networks. The experimental results show the feasibility of using consumer-grade GPUs for low-latency LLM inference and validate the effectiveness of our algorithm. The proposed algorithm achieves uniformly lower TPOT than the integer programming baseline and does so with a shorter execution time.},
}

EndNote citation:

%0 Thesis
%A Ren, Elden 
%T Task Scheduling for Decentralized LLM Serving in Heterogeneous Networks
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 16
%@ UCB/EECS-2024-111
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-111.html
%F Ren:EECS-2024-111