Major outages, like the Amazon EC2 incident in April 2011 that brought down their cloud storage,
get a lot of press due to their high visibility. However, large distributed systems are more
likely to experience chronic problems---performance degradations or request failures which occur
intermittently or affect a small subset of end users. Chronics are elusive to diagnose because
these problems fly under the radar of operations teams as they may not be severe enough to set
off alarm thresholds. Operators could ignore these problems if they were one-off incidents.
However, the recurrent nature of these problems negatively impacts customer satisfaction over time.
My research focuses on a "top-down" approach to diagnosing chronics in distributed systems that
starts by identifying user-visible symptoms of a problem (e.g., performance degradation, exceptions),
and drills down to identify the components and associated resource-usage metrics (e.g., CPU, memory)
that are the most highly indicative of the symptoms. Diagnosis proceeds in two phases namely: (i)
an anomaly-detection phase which uses peer-comparison to identify "odd-man-out" behavior;
(ii) a problem-localization phase that localizes the root-cause of these problems by identifying
request features that best distinguish successful requests from anomalous requests.
Peer-comparison is an attractive option for anomaly detection because it is relatively robust
to workload changes as peers execute similar workloads in a given period of time. My approach
uses unmodified application-level and system-level logs for diagnosis making it amenable for use
in production systems. I validate my approach using fault-injection experiments and analysis of
real incidents in two production systems namely: the Hadoop parallel-processing framework and
a production Voice-over-IP system at a large telecommunications provider.