Ryan Jay Huebsch and Minos Garofalakis and Joseph M. Hellerstein and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2006-98

July 12, 2006

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-98.pdf

An important challenge in modern distributed querying is to efficiently process multiple continuous aggregation queries simultaneously. However, processing each of these queries independently may be prohibitive due to network bandwidth constraints. Multiquery optimizations can be used to share computations across queries in order to reduce the overall network communication. In this paper, we consider this problem in the context of distributed aggregation queries that vary in their selection predicates. We identify settings in which a large set of n such queries can be answered by executing k less than n different queries. The k queries are revealed by analyzing a boolean matrix capturing the connection between data and the queries that they satisfy, in a manner akin to familiar techniques from linear algebra. Indeed, we identify a class of linear aggregate functions (including SUM, COUNT and AVERAGE), and show that the sharing potential for such queries can be optimally recovered using standard matrix decompositions from computational linear algebra. Unfortunately, for some other typical aggregation functions (including MIN and MAX) we find that optimal sharing maps to the NP-hard set basis problem. However, for those scenarios, we present a family of heuristic algorithms that perform well for moderately-sized matrices. We also present an overall distributed system architecture to exploit sharing opportunities, and experimentally evaluate the benefits of our techniques via a novel, flexible random workload generator we develop for this setting.


BibTeX citation:

@techreport{Huebsch:EECS-2006-98,
    Author= {Huebsch, Ryan Jay and Garofalakis, Minos and Hellerstein, Joseph M. and Stoica, Ion},
    Title= {Sharing Aggregate Computation for Distributed Queries},
    Year= {2006},
    Month= {Jul},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-98.html},
    Number= {UCB/EECS-2006-98},
    Abstract= {An important challenge in modern distributed querying is to efficiently process multiple continuous aggregation queries simultaneously.  However, processing each of these queries independently may be prohibitive due to network bandwidth constraints. Multiquery optimizations can be used to share computations across queries in order to reduce the overall network communication.  In this paper, we consider this problem in the context of distributed aggregation queries that vary in their selection predicates.  We identify settings in which a large set of n such queries can be answered by executing k less than n different queries.  The k queries are revealed by analyzing a boolean matrix capturing the connection between data and the queries that they satisfy, in a manner akin to familiar techniques from linear algebra. Indeed, we identify a class of linear aggregate functions (including SUM, COUNT and AVERAGE), and show that the sharing potential for such queries can be optimally recovered using standard matrix decompositions from computational linear algebra. Unfortunately, for some other typical aggregation functions (including MIN and MAX) we find that optimal sharing maps to the NP-hard set basis problem.  However, for those scenarios, we present a family of heuristic algorithms that perform well for moderately-sized matrices. We also present an overall distributed system architecture to exploit sharing opportunities, and experimentally evaluate the benefits of our techniques via a novel, flexible random workload generator we develop for this setting.},
}

EndNote citation:

%0 Report
%A Huebsch, Ryan Jay 
%A Garofalakis, Minos 
%A Hellerstein, Joseph M. 
%A Stoica, Ion 
%T Sharing Aggregate Computation for Distributed Queries
%I EECS Department, University of California, Berkeley
%D 2006
%8 July 12
%@ UCB/EECS-2006-98
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-98.html
%F Huebsch:EECS-2006-98