Ziming Mao and Tian Xia and Zhanghao Wu and Wei-Lin Chiang and Tyler Griggs and Romil Bhardwaj and Zongheng Yang and Scott Shenker and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-227

December 31, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-227.pdf

Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models.

To address this, we introduce SkyServe, a system that efficiently serves AI models over a mixture of spot and on-demand replicas across regions and clouds. SkyServe intelligently spreads spot replicas across different failure domains (e.g., regions or clouds) to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We compare SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by up to 44% while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by up to 2.6x, 3.1x, 2.7x compared to other research and production systems.

Advisors: Scott Shenker and Ion Stoica


BibTeX citation:

@mastersthesis{Mao:EECS-2025-227,
    Author= {Mao, Ziming and Xia, Tian and Wu, Zhanghao and Chiang, Wei-Lin and Griggs, Tyler and Bhardwaj, Romil and Yang, Zongheng and Shenker, Scott and Stoica, Ion},
    Title= {SkyServe: Serving AI Models across Regions and Clouds with Spot Instances},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-227.html},
    Number= {UCB/EECS-2025-227},
    Abstract= {Recent years have witnessed an explosive growth of AI models. The high cost of hosting AI services on GPUs and their demanding service requirements, make it timely and challenging to lower service costs and guarantee service quality. While spot instances have long been offered with a large discount, spot preemptions have discouraged users from using them to host model replicas when serving AI models. 

To address this, we introduce SkyServe, a system that efficiently serves AI models over a mixture of spot and on-demand replicas across regions and clouds. SkyServe intelligently spreads spot replicas across different failure domains (e.g., regions or clouds) to improve availability and reduce correlated preemptions, overprovisions cheap spot replicas than required as a safeguard against possible preemptions, and dynamically falls back to on-demand replicas when spot replicas become unavailable. We compare SkyServe with both research and production systems on real AI workloads: SkyServe reduces cost by up to 44% while achieving high resource availability compared to using on-demand replicas. Additionally, SkyServe improves P50, P90, and P99 latency by up to 2.6x, 3.1x, 2.7x compared to other research and production systems.},
}

EndNote citation:

%0 Thesis
%A Mao, Ziming 
%A Xia, Tian 
%A Wu, Zhanghao 
%A Chiang, Wei-Lin 
%A Griggs, Tyler 
%A Bhardwaj, Romil 
%A Yang, Zongheng 
%A Shenker, Scott 
%A Stoica, Ion 
%T SkyServe: Serving AI Models across Regions and Clouds with Spot Instances
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 31
%@ UCB/EECS-2025-227
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-227.html
%F Mao:EECS-2025-227