Yanpei Chen and Archana Sulochana Ganapathi and Rean Griffith and Randy H. Katz

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-135

November 9, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.pdf

MapReduce is a popular, but still insufficiently understood paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements. In this paper, we present a methodology to understand performance tradeoff s for MapReduce workloads. Using production workload traces from Facebook and Yahoo, we develop an empirical workload model and use it to generate and replay synthetic workloads. We demonstrate how to use this methodology to answer "what-if" questions pertaining to system size, data intensity and hardware/software configuration.


BibTeX citation:

@techreport{Chen:EECS-2010-135,
    Author= {Chen, Yanpei and Ganapathi, Archana Sulochana and Griffith, Rean and Katz, Randy H.},
    Title= {A Methodology for Understanding MapReduce Performance Under Diverse Workloads},
    Year= {2010},
    Month= {Nov},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.html},
    Number= {UCB/EECS-2010-135},
    Abstract= {MapReduce is a popular, but still insufficiently understood
paradigm for large-scale, distributed, data-intensive computation. The variety of MapReduce applications and deployment environments makes it difficult to model MapReduce performance and generalize design improvements. In this paper, we present a methodology to understand performance tradeoffs for MapReduce workloads. Using production workload traces from Facebook and Yahoo, we develop an empirical workload model and use it to generate and replay synthetic workloads. We demonstrate how to use this methodology to answer "what-if" questions pertaining to system size, data intensity and hardware/software configuration.},
}

EndNote citation:

%0 Report
%A Chen, Yanpei 
%A Ganapathi, Archana Sulochana 
%A Griffith, Rean 
%A Katz, Randy H. 
%T A Methodology for Understanding MapReduce Performance Under Diverse Workloads
%I EECS Department, University of California, Berkeley
%D 2010
%8 November 9
%@ UCB/EECS-2010-135
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-135.html
%F Chen:EECS-2010-135