| THESIS RESEARCH 
 Thesis: Automated diagnosis of chronic problems in production systems [Thesis document] [Slides] Large distributed systems are susceptible to chronic performance
				problems where the system still works, but with degraded
				performance. Chronic performance problems occur intermittently or
				affect a subset of end-users. This dissertation presents a top-down diagnostic framework for
				diagnosing chronic performance problems.  The
				framework comprises of four components. First, an extensible
				log-analysis framework that extracts end-to-end causal flows using
				common white-box (application) logs in the production system;
				these end-to-end flows capture the user's experience with the
				system. Second, anomaly-detection tools exploit heuristics and a
				peer-comparison approach to label each end-to-end flow as successful
				or failed.  Third, statistical diagnostic tool
				combines white-box metrics with black-box metrics (e.g., CPU usage) 
				to localize the source of the problem by identifying
				attributes that are more correlated with failed flows than successful
				ones. Fourth, a visualization tool that uses peer-comparison to
				highlight anomalous nodes in a parallel-computing cluster. The diagnostic framework has been used to localize real incidents at
				an academic cloud-computing cluster that runs the Hadoop
				parallel-processing framework, and a production Voice-over-IP system
				at a major Internet Services Provider.  |