Chukwa: A system for reliable large-scale log collection

Ariel Rabkin and Randy H. Katz

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-25

March 5, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-25.pdf

Large Internet services companies like Google, Yahoo, and Facebook use the MapReduce programming model to process log data. MapReduce is designed to work on data stored in a distributed filesystem like Hadoop's HDFS. As a result, a number of companies have developed log collection systems that write to HDFS. These systems have a number of common weaknesses, induced by the semantics of the filesystem. They impose a delay, often several minutes, before data is available for processing. They are difficult to integrate with existing applications. They cannot reliably handle concurrent failures. We present a system, called Chukwa, that adds the needed semantics for log collection and analysis. Chukwa uses an end-to-end delivery model that leverages local on-disk log files when possible, easing integration with legacy systems. Chukwa offers a choice of delivery models, making subsets of the collected data available promptly for clients that require it, while reliably storing a copy in HDFS. We demonstrate that our system works correctly on a 200-node testbed and can collect in excess of 200 MB/sec of log data. We supplement these measurements with a set of case studies.

Advisors: Randy H. Katz

BibTeX citation:

@mastersthesis{Rabkin:EECS-2010-25,
    Author= {Rabkin, Ariel and Katz, Randy H.},
    Title= {Chukwa: A system for reliable large-scale log collection},
    School= {EECS Department, University of California, Berkeley},
    Year= {2010},
    Month= {Mar},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-25.html},
    Number= {UCB/EECS-2010-25},
    Abstract= {Large Internet services companies like Google, Yahoo, and Facebook use the MapReduce programming model to process log data.  MapReduce is designed to work on data stored in a distributed filesystem like Hadoop's HDFS. As a result, a number of companies have developed log collection systems that write to HDFS.  These systems have a number of common weaknesses, induced by the semantics of the filesystem. They impose a delay, often several minutes, before data is available for processing. They are difficult to integrate with existing applications.  They cannot reliably handle concurrent failures. We present a system, called Chukwa, that adds the needed semantics for log collection and analysis. Chukwa uses an end-to-end delivery model that leverages local on-disk log files when possible, easing integration with legacy systems. Chukwa offers a choice of delivery models, making subsets of the collected data available promptly for clients that require it, while reliably storing a copy in HDFS. We demonstrate that our system works correctly on a 200-node testbed and can collect in excess of 200 MB/sec of log data. We supplement these measurements with a set of case studies.},
}

EndNote citation:

%0 Thesis
%A Rabkin, Ariel 
%A Katz, Randy H. 
%T Chukwa: A system for reliable large-scale log collection
%I EECS Department, University of California, Berkeley
%D 2010
%8 March 5
%@ UCB/EECS-2010-25
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-25.html
%F Rabkin:EECS-2010-25