Aditya Ramkumar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-88

May 13, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-88.pdf

Serverless computing has the potential to make application deployment over cloud resources more seamless and scalable. However, most work from major cloud providers has been focused around easily divisible resources like CPUs and local DRAM. Since many applications like ML inference benefit from specialized hardware like GPUs, incorporating them into the serverless setting will unlock new use cases. However, serverless GPUs come with a unique set of challenges: price, cold starts, dealing with memory, etc. In this thesis, I address some theoretical and practical issues that prevent serverless GPUs from becoming a reality.

Due to their utility, GPUs in a serverless setting are often highly in demand and limited in supply leading to overwhelmed systems. Users of these GPUs often have tight cost or latency constraints (for example, inference for self-driving cars). In this thesis, I explore scheduling policies to efficiently allocate GPU resources to requests based on user-provided Service Level Objectives. I further consider a heterogenous set of resources (both CPUs and GPUs), and explore how policies rooted in admission control can prevent the system from being overwhelmed.

While exploring GPU scheduling in a FaaS setting, there are some inherent system limitations. Most existing solutions require that applications carefully design their tasks to manually share resources and clean up properly when they finish, providing overhead for application developers and leaving scope for inefficient resource utilization. Another challenge is that when initializing a new accelerator resource, there is significant startup latency due to container and language runtime initialization. Recently, Nathan Pemberton proposed a new Kernel as a Service (KaaS) paradigm, where the system is responsible for managing GPU memory and schedules user kernels across the entire pool of available GPUs rather than relying on static allocations. Since resources are managed at the system-level with KaaS, it opens up a new set of challenges around request scheduling. I explore various policies, and evaluate them for metrics like cold start minimization, average wait time, and fairness among other metrics.

Advisors: Joseph Gonzalez


BibTeX citation:

@mastersthesis{Ramkumar:EECS-2022-88,
    Author= {Ramkumar, Aditya},
    Title= {Making the Most of Serverless Accelerators},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-88.html},
    Number= {UCB/EECS-2022-88},
    Abstract= {Serverless computing has the potential to make application deployment over cloud resources more seamless and scalable. However, most work from major cloud providers has been focused around easily divisible resources like CPUs and local DRAM. Since many applications like ML inference benefit from specialized hardware like GPUs, incorporating them into the serverless setting will unlock new use cases. However, serverless GPUs come with a unique set of challenges: price, cold starts, dealing with memory, etc. In this thesis, I address some theoretical and practical issues that prevent serverless GPUs from becoming a reality. 

Due to their utility, GPUs in a serverless setting are often highly in demand and limited in supply leading to overwhelmed systems. Users of these GPUs often have tight cost or latency constraints (for example, inference for self-driving cars). In this thesis, I explore scheduling policies to efficiently allocate GPU resources to requests based on user-provided Service Level Objectives. I further consider a heterogenous set of resources (both CPUs and GPUs), and explore how policies rooted in admission control can prevent the system from being overwhelmed. 

While exploring GPU scheduling in a FaaS setting, there are some inherent system limitations. Most existing solutions require that applications carefully design their tasks to manually share resources and clean up properly when they finish, providing overhead for application developers and leaving scope for inefficient resource utilization. Another challenge is that when initializing a new accelerator resource, there is significant startup latency due to container and language runtime initialization. Recently, Nathan Pemberton proposed a new Kernel as a Service (KaaS) paradigm, where the system is responsible for managing GPU memory and schedules user kernels across the entire pool of available GPUs rather than relying on static allocations. Since resources are managed at the system-level with KaaS, it opens up a new set of challenges around request scheduling. I explore various policies, and evaluate them for metrics like cold start minimization, average wait time, and fairness among other metrics.},
}

EndNote citation:

%0 Thesis
%A Ramkumar, Aditya 
%T Making the Most of Serverless Accelerators
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 13
%@ UCB/EECS-2022-88
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-88.html
%F Ramkumar:EECS-2022-88