Stephanie Wang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-3

January 11, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-3.pdf

Commodity hardware is reaching fundamental limits, while the demands of data-intensive applications continue to grow. Thus, we now rely on horizontal scale-out and hardware accelerators to improve application performance and scale, while developing a myriad of distributed execution frameworks that are specialized to specific application domains, from data analytics to machine learning. While this reduces burden for certain applications, it also creates three problems: (1) duplicated system implementation effort, (2) reduced framework evolvability, and (3) difficulty interoperating efficiently between applications, especially when large data is involved.

This thesis describes the first steps towards a distributed ``operating system'' that can provide essential services to such data-intensive applications. This would allow currently monolithic frameworks to be built as libraries instead, making them easier to build, evolve, and compose. Towards this vision, we propose an intermediate and interoperable execution layer that handles common problems in distributed execution and memory management.

We first propose distributed futures, a general-purpose programming interface that extends the RPC abstraction with pass-by-reference semantics and a shared address space. Distributed futures act as a virtual memory-like abstraction but for the distributed setting, enabling distributed memory management to be factored out into a common system. Next, we present a design for this system that provides flexible fault tolerance with low overheads. We first present a fault-tolerant architecture for distributed futures that provides automatic memory management.

We show how this system factors out system complexity from data-intensive applications without sacrificing performance, using MapReduce workloads as an example. Finally, we show how stronger recovery guarantees can be layered on top of this core architecture to provide greater recovery flexibility to end applications. Thus, we show how an end-to-end approach to fault tolerance can expand system generality.

Advisors: Ion Stoica


BibTeX citation:

@phdthesis{Wang:EECS-2024-3,
    Author= {Wang, Stephanie},
    Title= {Towards a Distributed OS for Data-Intensive Cloud Applications},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Jan},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-3.html},
    Number= {UCB/EECS-2024-3},
    Abstract= {Commodity hardware is reaching fundamental limits, while the demands of data-intensive applications continue to grow. Thus, we now rely on horizontal scale-out and hardware accelerators to improve application performance and scale, while developing a myriad of distributed execution frameworks that are specialized to specific application domains, from data analytics to machine learning. While this reduces burden for certain applications, it also creates three problems: (1) duplicated system implementation effort, (2) reduced framework evolvability, and (3) difficulty interoperating efficiently between applications, especially when large data is involved.

This thesis describes the first steps towards a distributed ``operating system'' that can provide essential services to such data-intensive applications. This would allow currently monolithic frameworks to be built as libraries instead, making them easier to build, evolve, and compose. Towards this vision, we propose an intermediate and interoperable execution layer that handles common problems in distributed execution and memory management.

We first propose distributed futures, a general-purpose programming interface that extends the RPC abstraction with pass-by-reference semantics and a shared address space. Distributed futures act as a virtual memory-like abstraction but for the distributed setting, enabling distributed memory management to be factored out into a common system. Next, we present a design for this system that provides flexible fault tolerance with low overheads. We first present a fault-tolerant architecture for distributed futures that provides automatic memory management.

We show how this system factors out system complexity from data-intensive applications without sacrificing performance, using MapReduce workloads as an example. Finally, we show how stronger recovery guarantees can be layered on top of this core architecture to provide greater recovery flexibility to end applications. Thus, we show how an end-to-end approach to fault tolerance can expand system generality.},
}

EndNote citation:

%0 Thesis
%A Wang, Stephanie 
%T Towards a Distributed OS for Data-Intensive Cloud Applications
%I EECS Department, University of California, Berkeley
%D 2024
%8 January 11
%@ UCB/EECS-2024-3
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-3.html
%F Wang:EECS-2024-3