HomePublications

about me

My name is Soila Pertet Kavulya. I am a PhD Student at the Department of Electrical and Computer Engineering at Carnegie Mellon University. I am advised by Professor Priya Narasimhan.

My research focuses on diagnosis of problems in distributed systems. I have applied my diagnosis algorithms to MapReduce systems, Internet Services, automotive systems, and group communication systems.

 

CONTACT

Soila Pertet Kavulya
Department of Electrical and Computer Engineering
Carnegie Mellon University
Pittsburgh, PA 15213
Email: spertet@ece.cmu.edu

Google+Twitter  My Google Citations

THESIS RESEARCH

Major outages, like the Amazon EC2 incident in April 2011 that brought down their cloud storage, get a lot of press due to their high visibility. However, large distributed systems are more likely to experience chronic problems---performance degradations or request failures which occur intermittently or affect a small subset of end users. Chronics are elusive to diagnose because these problems fly under the radar of operations teams as they may not be severe enough to set off alarm thresholds. Operators could ignore these problems if they were one-off incidents. However, the recurrent nature of these problems negatively impacts customer satisfaction over time.

My research focuses on a "top-down" approach to diagnosing chronics in distributed systems that starts by identifying user-visible symptoms of a problem (e.g., performance degradation, exceptions), and drills down to identify the components and associated resource-usage metrics (e.g., CPU, memory) that are the most highly indicative of the symptoms. Diagnosis proceeds in two phases namely: (i) an anomaly-detection phase which uses peer-comparison to identify "odd-man-out" behavior; (ii) a problem-localization phase that localizes the root-cause of these problems by identifying request features that best distinguish successful requests from anomalous requests.

Peer-comparison is an attractive option for anomaly detection because it is relatively robust to workload changes as peers execute similar workloads in a given period of time. My approach uses unmodified application-level and system-level logs for diagnosis making it amenable for use in production systems. I validate my approach using fault-injection experiments and analysis of real incidents in two production systems namely: the Hadoop parallel-processing framework and a production Voice-over-IP system at a large telecommunications provider.

Selected Publications