Soila Pertet Kavulya

About me

My name is Soila Pertet Kavulya. I graduated with my PhD Student in Electrical and Computer Engineering at Carnegie Mellon University in May 2013. My advisor was Professor Priya Narasimhan.

My research focuses on diagnosis of problems in distributed systems. I have applied my diagnosis algorithms to MapReduce systems, Voice-over-IP systems, Internet Services, automotive systems, and group communication systems.

CONTACT

	Google+
	Google Scholar Citations
	Twitter
	spertet@ece.cmu.edu

THESIS RESEARCH

Thesis: Automated diagnosis of chronic problems in production systems

[Thesis document] [Slides]

Large distributed systems are susceptible to chronic performance problems where the system still works, but with degraded performance. Chronic performance problems occur intermittently or affect a subset of end-users.

This dissertation presents a top-down diagnostic framework for diagnosing chronic performance problems. The framework comprises of four components. First, an extensible log-analysis framework that extracts end-to-end causal flows using common white-box (application) logs in the production system; these end-to-end flows capture the user's experience with the system. Second, anomaly-detection tools exploit heuristics and a peer-comparison approach to label each end-to-end flow as successful or failed. Third, statistical diagnostic tool combines white-box metrics with black-box metrics (e.g., CPU usage) to localize the source of the problem by identifying attributes that are more correlated with failed flows than successful ones. Fourth, a visualization tool that uses peer-comparison to highlight anomalous nodes in a parallel-computing cluster.

The diagnostic framework has been used to localize real incidents at an academic cloud-computing cluster that runs the Hadoop parallel-processing framework, and a production Voice-over-IP system at a major Internet Services Provider.

Selected Publications

Draco: Statistical Diagnosis of Chronic Problems in Large Distributed Systems.
Soila P. Kavulya, Scott Daniels, Kaustubh Joshi, Matti Hiltunen, Rajeev Gandhi, Priya Narasimhan. IEEE/IFIP Conference on Dependable Systems and Networks (DSN), June 2012.
Theia: Visual Signatures for Problem Diagnosis in Large Hadoop Cluster. Elmer Garduno, Soila P. Kavulya, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. USENIX Large Installation System Administration (LISA) Conference, December 2012 (Best Student Paper).
An analysis of traces from a production MapReduce cluster. Soila P. Kavulya, Jiaqi Tan, Rajeev Gandhi, Priya Narasimhan. IEEE/ACM Conference on Cluster, Cloud and Grid Computing (CCGrid), May 2010.
Visual, log-based causal tracing for performance debugging of mapreduce systems. Jiaqi Tan, Soila P. Kavulya, Rajeev Gandhi, Priya Narasimhan. IEEE Conference on Distributed Computing Systems (ICDCS). June 2010.
Failure Diagnosis of Complex Systems. Soila. P. Kavulya, Kaustubh Joshi, Felicita Di Giandomenico, Priya Narasimhan. Book on Resilience Assessment and Evaluation. Wolter, K.; Avritzer, A.; Vieira, M.; van Moorsel, A. (Eds.). Springer Verlag, 2012.
Causes of Failure in Web Applications. Soila Pertet and Priya Narasimhan. Carnegie Mellon University Parallel Data Lab Technical Report CMU-PDL-05-109. December 2005