Dependable, Online Upgrades in Distributed Systems

I take a holistic approach and focus on upgrading distributed systems end-to-end. I identify the leading causes of both unplanned and planned downtime due to upgrades in large-scale distributed systems, I address them through a new upgrading approach, and I propose a new methodology for evaluating the dependability of online upgrades.

Traditional fault-tolerance approaches concentrate almost entirely on responding to, avoiding, or tolerating unexpected faults or security violations. However, scheduled events, such as software upgrades, account for most of the system unavailability. Historically employed in the telecommunications industry, online upgrades are now required in large-scale systems, such as electrical utilities, assembly-line manufacturing, customer support, e-commerce, banking, etc.

I establish an upgrade-centric fault model, by analyzing independent sources of fault data through statistical clustering techniques (widely used in the natural sciences for creating taxonomies of living organisms). My model focuses on human errors in the upgrade procedure, which break hidden dependencies (e.g., specifying wrong service locations, creating database-schema mismatches, introducing shared-library conflicts) in the system under upgrade. There are four common types of upgrade faults:

  1. Simple configuration or procedural errors (e.g., typos)
  2. Semantic configuration errors, which indicate a misunderstanding of the configuration directives used
  3. Broken environmental dependencies (e.g., library or port conflicts)
  4. Data-access errors, which prevent the access to persistent data

These faults represent the leading causes of upgrade failure in distributed systems. I also identify incompatible schema changes and computationally-intensive data conversions as the leading causes of planned downtime in a popular Internet system (Wikipedia). Previous approaches—which focus on upgrading individual components of distributed systems rather than performing end-to-end upgrades—often induce unplanned downtime, by breaking hidden dependencies in the system under upgrade, and that they cannot always prevent planned downtime, in the the presence of complex schema changes.

I address these causes of downtime through a system called Imago [Middleware 2009]. Imago provides the AIR properties:

By installing the new version in a parallel universe (a distinct collection of resources), Imago isolates the production system from the upgrade operations and avoids breaking hidden dependencies. Imago performs the end-to-end upgrade atomically, while enabling the complex data and schema conversions that commonly impose planned downtime. I evaluate the dependability of online-upgrade approaches using my upgrade-centric fault model to drive fault-injection experiments. This suggests that Imago is more resilient than previous approaches to the common upgrade faults. Moreover, Imago can prevent unplanned downtime during end-to-end upgrades of distributed systems.

Imago harnesses the opportunities provided by new technologies, such as cloud computing, to reduce resource overhead by temporarily leasing hardware and storage resources during the upgrade. Through the separation of concerns between the functional aspects of the upgrade (e.g. data conversion) and the online-upgrade mechanisms (e.g. atomic switchover), Imago enables an upgrades-as-a-service model [FAST 2010]. This approach has the potential to eliminate planned downtime for competitive upgrades. recently awarded me a research grant [Amazon Research Grant, 2009], for conducting a large-scale experiment (350 machines and 3 terabytes of data), which emulates a major upgrade of Wikipedia, using Amazon's cloud-computing infrastructure.



Conference and Workshop Publications

Book Chapters

Workshop Proceedings

Technical Reports

Other Publications