Wei Xu and Ling Huang and Armando Fox and David A. Patterson and Michael Jordan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2009-103

July 21, 2009

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-103.pdf

Surprisingly, console logs rarely help operators detect problems in large-scale datacenter services, for they often consist of the voluminous intermixing of messages from many software components written by independent developers. We propose a general methodology to mine this rich source of information to automatically detect system runtime problems. We first parse console logs by combining source code analysis with information retrieval to create composite features. We then analyze these features using machine learning to detect operational problems. We show that our method enables analyses that are impossible with previous methods because of its superior ability to create sophisticated features. We also show how to distill the results of our analysis to an operator-friendly one-page decision tree showing the critical messages associated with the detected problems. We validate our approach using the Darkstar online game server and the Hadoop File System, where we detect numerous real problems with high accuracy and few false positives. In the Hadoop case, we are able to analyze 24 million lines of console logs in 3 minutes. Our methodology works on textual console logs of any size and requires no changes to the service software, no human input, and no knowledge of the software’s internals.


BibTeX citation:

@techreport{Xu:EECS-2009-103,
    Author= {Xu, Wei and Huang, Ling and Fox, Armando and Patterson, David A. and Jordan, Michael},
    Title= {Large-Scale System Problems Detection by Mining Console Logs},
    Year= {2009},
    Month= {Jul},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-103.html},
    Number= {UCB/EECS-2009-103},
    Abstract= {Surprisingly, console logs rarely help operators detect
problems in large-scale datacenter services, for they often
consist of the voluminous intermixing of messages
from many software components written by independent
developers. We propose a general methodology to mine
this rich source of information to automatically detect
system runtime problems. We first parse console logs
by combining source code analysis with information retrieval
to create composite features. We then analyze
these features using machine learning to detect operational
problems. We show that our method enables analyses
that are impossible with previous methods because of
its superior ability to create sophisticated features. We
also show how to distill the results of our analysis to
an operator-friendly one-page decision tree showing the
critical messages associated with the detected problems.
We validate our approach using the Darkstar online game
server and the Hadoop File System, where we detect numerous
real problems with high accuracy and few false
positives. In the Hadoop case, we are able to analyze 24
million lines of console logs in 3 minutes. Our methodology
works on textual console logs of any size and requires
no changes to the service software, no human input,
and no knowledge of the software’s internals.},
}

EndNote citation:

%0 Report
%A Xu, Wei 
%A Huang, Ling 
%A Fox, Armando 
%A Patterson, David A. 
%A Jordan, Michael 
%T Large-Scale System Problems Detection by Mining Console Logs
%I EECS Department, University of California, Berkeley
%D 2009
%8 July 21
%@ UCB/EECS-2009-103
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-103.html
%F Xu:EECS-2009-103