The Case for Evaluating MapReduce Performance Using Workload Suites

Yanpei Chen and Archana Ganapathi and Rean Griffith and Randy H. Katz

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2011-21

March 30, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.pdf

MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.

BibTeX citation:

@techreport{Chen:EECS-2011-21,
    Author= {Chen, Yanpei and Ganapathi, Archana and Griffith, Rean and Katz, Randy H.},
    Title= {The Case for Evaluating MapReduce Performance Using Workload Suites},
    Year= {2011},
    Month= {Mar},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.html},
    Number= {UCB/EECS-2011-21},
    Abstract= {MapReduce systems face enormous challenges due to increasing growth, diversity, and consolidation of the data and computation involved. Provisioning, configuring, and managing large-scale MapReduce clusters require realistic, workload-specific performance insights that existing MapReduce benchmarks are ill-equipped to supply. In this paper, we build the case for going beyond benchmarks for MapReduce performance evaluations. We analyze and compare two production MapReduce traces to develop a vocabulary for describing MapReduce workloads. We show that existing benchmarks fail to capture rich workload characteristics observed in traces, and propose a framework to synthesize and execute representative workloads. We demonstrate that performance evaluations using realistic workloads gives cluster operator new ways to identify workload-specific resource bottlenecks, and workload-specific choice of MapReduce task schedulers. We expect that once available, workload suites would allow cluster operators to accomplish previously challenging tasks beyond what we can now imagine, thus serving as a useful tool to help design and manage MapReduce systems.},
}

EndNote citation:

%0 Report
%A Chen, Yanpei 
%A Ganapathi, Archana 
%A Griffith, Rean 
%A Katz, Randy H. 
%T The Case for Evaluating MapReduce Performance Using Workload Suites
%I EECS Department, University of California, Berkeley
%D 2011
%8 March 30
%@ UCB/EECS-2011-21
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-21.html
%F Chen:EECS-2011-21