Tudor Dumitras :: Research :: Adaptive Dependability for Fault-Tolerant Middleware

Adaptive Dependability for Fault-Tolerant Middleware

I architected and implemented the MEAD (Middleware for Embedded Adaptive Dependability) system, which served as a fault-tolerance platform in the DARPA ARMS-II and PCES-II programs. MEAD uses system-call interception and does not modify the applications or the operating systems. In this manner, MEAD enhances legacy CORBA applications with replication and recovery mechanisms, and it allows users to select the appropriate trade-offs among resource usage, fault-tolerance and performance [CC:PE 2005].

Using fault-tolerant middleware usually demands an in-depth understanding of fault-tolerance semantics. For instance, the Fault-Tolerant CORBA standard specifies ten low-level parameters for tuning the replication mechanisms: replication style, membership style, consistency style, fault-monitoring style, fault-monitoring granularity, location of factories, initial number of replicas, minimum number of replicas, fault-monitoring interval and timeout, and checkpoint interval.

Instead, MEAD provides versatile dependability [ADS III]. Low-level knobs control the internal fault-tolerant mechanisms of the infrastructure (e.g., the degree and style of replication). In contrast, high-level knobs regulate external properties (e.g., scalability, availability) that are relevant to the system's users, and they hide the internal implementation details. For example, a low-level knob allows switching between active and passive replication on the fly, and a high-level knob tunes the system's scalability.

MEAD also helped us discover the magical 1% phenomenon [Middleware 2005]. While MEAD's maximum response time is not predictable, the latency outliers are limited to only 1% of the remote invocations. We showed that this is a general phenomenon through a broad empirical study of unpredictability in 15 additional systems, ranging from simple transport protocols to fault-tolerant, middleware-based enterprise applications [COMNET 2012]. The maximum latency is not influenced by the system's workload, cannot be regulated through configuration parameters and is not correlated with the system's resource consumption. The high-latency outliers (up to three orders of magnitude higher than the average latency) have multiple causes and may originate in any component of the system. However, after selectively filtering 1% of the invocations with the highest recorded response-times, the latency becomes bounded with high statistical confidence. We verified this result on different operating systems (Linux 2.4, Linux 2.6, Linux-rt, TimeSys), middleware platforms (CORBA and EJB), programming languages (C, C++ and Java), replication styles (active and warm passive) and applications (e-commerce and online gaming) [Middleware 2007; COMNET 2012]. Moreover, this phenomenon occurs at all the layers of middleware-based systems, from the communication protocols to the business logic. This suggests that, while the emergent behavior of middleware is not strictly predictable, distributed systems could cope with the inherent unpredictability by focusing on statistical measures, such as the 99^th latency percentile.

My contributions to the MEAD project include:

I implemented the first version of the system, MEAD v1.0
I implemented MEAD's adaptation mechanisms, e.g., switching between active and passive replication on-the-fly and tuning the system scalability
I discovered the magical 1% phenomenon
I collaborated with my colleagues Soila Pertet, Joseph Slember and Deepti Srivastava, who used MEAD for developing proactive recovery, mechanisms to compensate for nondeterminism and mode-driven fault-tolerance.

Publications

Journal Articles and Book Chapters

T. Dumitraş and P. Narasimhan
A Study of Unpredictability in Fault-Tolerant Middleware
Computer Networks (COMNET), Elsevier, available online 29 October 2012, ISSN 1389-1286

P. Narasimhan, T. Dumitraş, A. Paulos, S. Pertet, C. Reverte, J. Slember and D. Srivastava
MEAD: Support for Real-Time, Fault-Tolerant CORBA
Concurrency and Computation: Practice and Experience, vol. 17, no. 12, Wiley and Sons, 2005
[Source code]

T. Dumitraş, D. Srivastava and P. Narasimhan
Architecting and Implementing Versatile Dependability
Architecting Dependable Systems Vol. III, C. Gacek, et al., eds., Springer-Verlag, 2005 (LNCS 3549)

Conference and Workshop Papers

T. Dumitraş and P. Narasimhan
Got Predictability? Experiences with Fault-Tolerant Middleware
ACM/IFIP/USENIX Middleware Conference, Nov. 2007
[Supplemental information]

T. Dumitraş and P. Narasimhan
Fault-Tolerant Middleware and the Magical 1%
ACM/IFIP/USENIX Middleware Conference, Nov.-Dec. 2005 (LNCS 3790)
[Supplemental information]
T. Dumitraş, P. Narasimhan
An Architecture for Versatile Dependability
DSN Workshop on Architecting Dependable Systems (WADS), Jun. 2004