A Methodology for Understanding MapReduce Performance Under Diverse Workloads

Yanpei Chen, Archana Sulochana Ganapathi, Rean Griffith and Randy H. Katz

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2010-135
November 9, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.pdf

MapReduce is a popular, but still insufficiently understood paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements. In this paper, we present a methodology to understand performance tradeoffnulls for MapReduce workloads. Using production workload traces from Facebook and Yahoo, we develop an empirical workload model and use it to generate and replay synthetic workloads. We demonstrate how to use this methodology to answer "what-if" questions pertaining to system size, data intensity and hardware/software configuration.


BibTeX citation:

@techreport{Chen:EECS-2010-135,
    Author = {Chen, Yanpei and Ganapathi, Archana Sulochana and Griffith, Rean and Katz, Randy H.},
    Title = {A Methodology for Understanding MapReduce Performance Under Diverse Workloads},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2010},
    Month = {Nov},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.html},
    Number = {UCB/EECS-2010-135},
    Abstract = {MapReduce is a popular, but still insufficiently understood
paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements. In this paper, we present a methodology to understand performance tradeoffs for MapReduce workloads. Using production workload traces from Facebook and Yahoo, we develop an empirical workload model and use it to generate and replay synthetic workloads. We demonstrate how to use this methodology to answer "what-if" questions pertaining to system size, data intensity and hardware/software configuration.}
}

EndNote citation:

%0 Report
%A Chen, Yanpei
%A Ganapathi, Archana Sulochana
%A Griffith, Rean
%A Katz, Randy H.
%T A Methodology for Understanding MapReduce Performance Under Diverse Workloads
%I EECS Department, University of California, Berkeley
%D 2010
%8 November 9
%@ UCB/EECS-2010-135
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.html
%F Chen:EECS-2010-135