Authors
Mike Chen, Alice X Zheng, Jim Lloyd, Michael I Jordan, Eric Brewer
Publication date
2004/5/17
Conference
International Conference on Autonomic Computing, 2004. Proceedings.
Pages
36-43
Publisher
IEEE
Description
We present a decision tree learning approach to diagnosing failures in large Internet sites. We record runtime properties of each request and apply automated machine learning and data mining techniques to identify the causes of failures. We train decision trees on the request traces from time periods in which user-visible failures are present. Paths through the tree are ranked according to their degree of correlation with failure, and nodes are merged according to the observed partial order of system components. We evaluate this approach using actual failures from eBay, and find that, among hundreds of potential causes, the algorithm successfully identifies 13 out of 14 true causes of failure, along with 2 false positives. We discuss some results in applying simplified decision trees on eBay's production site for several months. In addition, we give a cost-benefit analysis of manual vs. automated diagnosis systems. Our …
Total citations
2004200520062007200820092010201120122013201420152016201720182019202020212022202320244151632242941233612920172424224042447810
Scholar articles
M Chen, AX Zheng, J Lloyd, MI Jordan, E Brewer - International Conference on Autonomic Computing …, 2004