Providing Efficient Fault Tolerance in Distributed Systems
Siyuan Zhuang
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-86
May 10, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.pdf
The exponential growth in data and computational demands is transforming the approach to system design, particularly in tackling large-scale problems such as training large language models. This shift necessitates the widespread adoption of distributed systems. Simultaneously, systems and applications are becoming increasingly heterogeneous and sophisticated. In this evolving landscape, a critical challenge arises: supporting a wide range of distributed applications while simultaneously achieving computational efficiency and fault tolerance. This thesis explores the development of universal distributed systems that provide efficient fault tolerance for modern applications. The key idea is to exploit the semantics of workloads at all layers of distributed systems. At the communication layer, we introduce Hoplite, a distributed object store that dynamically exploits data transfer patterns and employs fine-grained pipelining to gain efficiency. Hoplite also reschedules tasks to mitigate the effects of failures. At the task execution layer, ExoFlow leverages the semantics of tasks and data passing between tasks to separate execution and recovery units within workflow systems. This approach ensures exactly-once failure recovery semantics while minimizing checkpointing overhead. Together, these contributions demonstrate a full-stack approach to building universal, efficient, and fault-tolerant distributed systems.
Advisors: Ion Stoica and Dawn Song
BibTeX citation:
@phdthesis{Zhuang:EECS-2024-86, Author= {Zhuang, Siyuan}, Title= {Providing Efficient Fault Tolerance in Distributed Systems}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.html}, Number= {UCB/EECS-2024-86}, Abstract= {The exponential growth in data and computational demands is transforming the approach to system design, particularly in tackling large-scale problems such as training large language models. This shift necessitates the widespread adoption of distributed systems. Simultaneously, systems and applications are becoming increasingly heterogeneous and sophisticated. In this evolving landscape, a critical challenge arises: supporting a wide range of distributed applications while simultaneously achieving computational efficiency and fault tolerance. This thesis explores the development of universal distributed systems that provide efficient fault tolerance for modern applications. The key idea is to exploit the semantics of workloads at all layers of distributed systems. At the communication layer, we introduce Hoplite, a distributed object store that dynamically exploits data transfer patterns and employs fine-grained pipelining to gain efficiency. Hoplite also reschedules tasks to mitigate the effects of failures. At the task execution layer, ExoFlow leverages the semantics of tasks and data passing between tasks to separate execution and recovery units within workflow systems. This approach ensures exactly-once failure recovery semantics while minimizing checkpointing overhead. Together, these contributions demonstrate a full-stack approach to building universal, efficient, and fault-tolerant distributed systems.}, }
EndNote citation:
%0 Thesis %A Zhuang, Siyuan %T Providing Efficient Fault Tolerance in Distributed Systems %I EECS Department, University of California, Berkeley %D 2024 %8 May 10 %@ UCB/EECS-2024-86 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-86.html %F Zhuang:EECS-2024-86