Queries on Compressed Data

Anurag Khandelwal

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2019-141
November 2, 2019

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-141.pdf

Low-latency, high-throughput systems for serving interactive queries are crucial to today's web services. Building such systems for today's web services is challenging due to the massive volumes of data they must cater to, and the requirement for supporting sophisticated queries (e.g., searches, filters, aggregations, regular expression matches, graph queries, etc.). Several recent approaches have highlighted the importance of in-memory storage for meeting the low-latency and high-throughput requirements, but these approaches are unable to sustain this performance when the data grows larger than DRAM capacity. Existing systems thus achieve these goals either by assuming large enough DRAM (too expensive) or by supporting only a limited set of queries (e.g., key-value stores).

In this dissertation, we explore algorithmic and data structure-driven solutions to these system design problems. We present Succinct, a distributed data store that addresses these challenges using a fundamentally new approach --- executing a wide range of queries (e.g., search, random access, range, wildcard) directly on a compressed representation of the input data --- thereby enabling efficient execution of queries on data sizes much larger than DRAM capacity. We then describe BlowFish, a system that builds on Succinct to enable a dynamic storage-performance tradeoff in data stores, providing applications the flexibility to modify the storage and performance fractionally, just enough to meet the desired goals. Finally, we explore approaches that enable even richer query semantics on compressed data, including graph queries using ZipG, a memory efficient graph store, and regular expression queries using Sprint, a query rewriting technique.

Advisor: Ion Stoica


BibTeX citation:

@phdthesis{Khandelwal:EECS-2019-141,
    Author = {Khandelwal, Anurag},
    Title = {Queries on Compressed Data},
    School = {EECS Department, University of California, Berkeley},
    Year = {2019},
    Month = {Nov},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-141.html},
    Number = {UCB/EECS-2019-141},
    Abstract = {Low-latency, high-throughput systems for serving interactive queries are crucial to today's web services. Building such systems for today's web services is challenging due to the massive volumes of data they must cater to, and the requirement for supporting sophisticated queries (e.g., searches, filters, aggregations, regular expression matches, graph queries, etc.). Several recent approaches have highlighted the importance of in-memory storage for meeting the low-latency and high-throughput requirements, but these approaches are unable to sustain this performance when the data grows larger than DRAM capacity. Existing systems thus achieve these goals either by assuming large enough DRAM (too expensive) or by supporting only a limited set of queries (e.g., key-value stores).

In this dissertation, we explore algorithmic and data structure-driven solutions to these system design problems. We present Succinct, a distributed data store that addresses these challenges using a fundamentally new approach --- executing a wide range of queries (e.g., search, random access, range, wildcard) <em>directly</em> on a compressed representation of the input data --- thereby enabling efficient execution of queries on data sizes much larger than DRAM capacity. We then describe BlowFish, a system that builds on Succinct to enable a dynamic storage-performance tradeoff in data stores, providing applications the flexibility to modify the storage and performance fractionally, just enough to meet the desired goals. Finally, we explore approaches that enable even richer query semantics on compressed data, including graph queries using ZipG, a memory efficient graph store, and regular expression queries using Sprint, a query rewriting technique.}
}

EndNote citation:

%0 Thesis
%A Khandelwal, Anurag
%T Queries on Compressed Data
%I EECS Department, University of California, Berkeley
%D 2019
%8 November 2
%@ UCB/EECS-2019-141
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-141.html
%F Khandelwal:EECS-2019-141