Thesis: Automated diagnosis of chronic problems in production systems
[Thesis document] [Slides]
Large distributed systems are susceptible to chronic performance
problems where the system still works, but with degraded
performance. Chronic performance problems occur intermittently or
affect a subset of end-users.
This dissertation presents a top-down diagnostic framework for
diagnosing chronic performance problems. The
framework comprises of four components. First, an extensible
log-analysis framework that extracts end-to-end causal flows using
common white-box (application) logs in the production system;
these end-to-end flows capture the user's experience with the
system. Second, anomaly-detection tools exploit heuristics and a
peer-comparison approach to label each end-to-end flow as successful
or failed. Third, statistical diagnostic tool
combines white-box metrics with black-box metrics (e.g., CPU usage)
to localize the source of the problem by identifying
attributes that are more correlated with failed flows than successful
ones. Fourth, a visualization tool that uses peer-comparison to
highlight anomalous nodes in a parallel-computing cluster.
The diagnostic framework has been used to localize real incidents at
an academic cloud-computing cluster that runs the Hadoop
parallel-processing framework, and a production Voice-over-IP system
at a major Internet Services Provider.