Providing Efficient Fault Tolerance in Distributed Systems

Siyuan Zhuang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-86

May 10, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.pdf

The exponential growth in data and computational demands is transforming the approach to system design, particularly in tackling large-scale problems such as training large language models. This shift necessitates the widespread adoption of distributed systems. Simultaneously, systems and applications are becoming increasingly heterogeneous and sophisticated. In this evolving landscape, a critical challenge arises: supporting a wide range of distributed applications while simultaneously achieving computational efficiency and fault tolerance. This thesis explores the development of universal distributed systems that provide efficient fault tolerance for modern applications. The key idea is to exploit the semantics of workloads at all layers of distributed systems. At the communication layer, we introduce Hoplite, a distributed object store that dynamically exploits data transfer patterns and employs fine-grained pipelining to gain efficiency. Hoplite also reschedules tasks to mitigate the effects of failures. At the task execution layer, ExoFlow leverages the semantics of tasks and data passing between tasks to separate execution and recovery units within workflow systems. This approach ensures exactly-once failure recovery semantics while minimizing checkpointing overhead. Together, these contributions demonstrate a full-stack approach to building universal, efficient, and fault-tolerant distributed systems.

Advisors: Ion Stoica and Dawn Song

BibTeX citation:

@phdthesis{Zhuang:EECS-2024-86,
Author= {Zhuang, Siyuan},
Title= {Providing Efficient Fault Tolerance in Distributed Systems},
School= {EECS Department, University of California, Berkeley},
Year= {2024},
Month= {May},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.html},
Number= {UCB/EECS-2024-86},
Abstract= {The exponential growth in data and computational demands is transforming the approach to system design, particularly in tackling large-scale problems such as training large language models. This shift necessitates the widespread adoption of distributed systems. Simultaneously, systems and applications are becoming increasingly heterogeneous and sophisticated. In this evolving landscape, a critical challenge arises: supporting a wide range of distributed applications while simultaneously achieving computational efficiency and fault tolerance.
This thesis explores the development of universal distributed systems that provide efficient fault tolerance for modern applications. The key idea is to exploit the semantics of workloads at all layers of distributed systems. At the communication layer, we introduce Hoplite, a distributed object store that dynamically exploits data transfer patterns and employs fine-grained pipelining to gain efficiency. Hoplite also reschedules tasks to mitigate the effects of failures. At the task execution layer, ExoFlow leverages the semantics of tasks and data passing between tasks to separate execution and recovery units within workflow systems. This approach ensures exactly-once failure recovery semantics while minimizing checkpointing overhead. Together, these contributions demonstrate a full-stack approach to building universal, efficient, and fault-tolerant distributed systems.},
}

EndNote citation:

%0 Thesis
%A Zhuang, Siyuan 
%T Providing Efficient Fault Tolerance in Distributed Systems
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 10
%@ UCB/EECS-2024-86
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.html
%F Zhuang:EECS-2024-86