Secure, Expressive, and Debuggable Large-Scale Analytics

Ankur Dave

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2020-143
August 12, 2020

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-143.pdf

Growing volumes of data collection, outsourced computing, and demand for complex analytics have led to the rise of big data analytics frameworks such as MapReduce and Apache Spark. However, these systems fall short in processing sensitive data, graph querying, and debugging. This dissertation addresses these remaining challenges in analytics by introducing three systems built on top of Spark: Oblivious Coopetitive Queries (OCQ), GraphFrames, and Arthur. OCQ focuses on the setting of coopetitive analytics, which refers to cooperation among competing parties to run queries over their joint data. OCQ is an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. GraphFrames is an integrated system that lets users combine graph algorithms, pattern matching, and relational queries, each of which typically requires a specialized engine, and optimizes work across them. Arthur is a debugger for Apache Spark that provides a rich set of analysis tools at close to zero runtime overhead through selective replay of data flow applications. Together, these systems bring Apache Spark closer to the goal of a unified analytics platform that retains the flexibility, extensibility, and performance of relational systems.

Advisor: Ion Stoica and Raluca Ada Popa


BibTeX citation:

@phdthesis{Dave:EECS-2020-143,
    Author = {Dave, Ankur},
    Title = {Secure, Expressive, and Debuggable Large-Scale Analytics},
    School = {EECS Department, University of California, Berkeley},
    Year = {2020},
    Month = {Aug},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-143.html},
    Number = {UCB/EECS-2020-143},
    Abstract = {Growing volumes of data collection, outsourced computing, and demand for complex analytics have led to the rise of big data analytics frameworks such as MapReduce and Apache Spark. However, these systems fall short in processing sensitive data, graph querying, and debugging. This dissertation addresses these remaining challenges in analytics by introducing three systems built on top of Spark: Oblivious Coopetitive Queries (OCQ), GraphFrames, and Arthur. OCQ focuses on the setting of coopetitive analytics, which refers to cooperation among competing parties to run queries over their joint data. OCQ is an efficient, general framework for oblivious coopetitive analytics using hardware enclaves. GraphFrames is an integrated system that lets users combine graph algorithms, pattern matching, and relational queries, each of which typically requires a specialized engine, and optimizes work across them. Arthur is a debugger for Apache Spark that provides a rich set of analysis tools at close to zero runtime overhead through selective replay of data flow applications. Together, these systems bring Apache Spark closer to the goal of a unified analytics platform that retains the flexibility, extensibility, and performance of relational systems.}
}

EndNote citation:

%0 Thesis
%A Dave, Ankur
%T Secure, Expressive, and Debuggable Large-Scale Analytics
%I EECS Department, University of California, Berkeley
%D 2020
%8 August 12
%@ UCB/EECS-2020-143
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-143.html
%F Dave:EECS-2020-143