Adaptive Dependability for Fault-Tolerant Middleware

I architected and implemented the MEAD (Middleware for Embedded Adaptive Dependability) system, which served as a fault-tolerance platform in the DARPA ARMS-II and PCES-II programs. MEAD uses system-call interception and does not modify the applications or the operating systems. In this manner, MEAD enhances legacy CORBA applications with replication and recovery mechanisms, and it allows users to select the appropriate trade-offs among resource usage, fault-tolerance and performance [CC:PE 2005].

Using fault-tolerant middleware usually demands an in-depth understanding of fault-tolerance semantics. For instance, the Fault-Tolerant CORBA standard specifies ten low-level parameters for tuning the replication mechanisms: replication style, membership style, consistency style, fault-monitoring style, fault-monitoring granularity, location of factories, initial number of replicas, minimum number of replicas, fault-monitoring interval and timeout, and checkpoint interval.

Instead, MEAD provides versatile dependability [ADS III]. Low-level knobs control the internal fault-tolerant mechanisms of the infrastructure (e.g., the degree and style of replication). In contrast, high-level knobs regulate external properties (e.g., scalability, availability) that are relevant to the system's users, and they hide the internal implementation details. For example, a low-level knob allows switching between active and passive replication on the fly, and a high-level knob tunes the system's scalability.

MEAD also helped us discover the magical 1% phenomenon [Middleware 2005]. While MEAD's maximum response time is not predictable, the latency outliers are limited to only 1% of the remote invocations. We showed that this is a general phenomenon through a broad empirical study of unpredictability in 15 additional systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications [COMNET 2012]. The maximum latency is not influenced by the system's workload, cannot be regulated through configuration parameters and is not correlated with the system's resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after selectively filtering 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence. We verified this result on different operating systems (Linux 2.4, Linux 2.6, Linux-rt, TimeSys), middleware platforms (CORBA and EJB), programming languages (C, C++ and Java), replication styles (active and warm passive) and applications (e-commerce and online gaming) [Middleware 2007; COMNET 2012]. Moreover, this phenomenon occurs at all the layers of middleware-based systems, from the communication protocols to the business logic. This suggests that, while the emergent behavior of middleware is not strictly predictable, distributed systems could cope with the inherent unpredictability by focusing on statistical measures, such as the 99th latency percentile.

My contributions to the MEAD project include:

Publications

Journal Articles and Book Chapters

Conference and Workshop Papers