Fault Tolerance Computing-- Draft

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999

Ying Shi


As users are not concerned only about whether it is working but also whether it is working correctly, particularly in safety critical cases, Fault Tolerant Computing (FTC) plays a important role especially since early fifties.  Given this is a very broad research area, it involves varieties of categorizations of techniques towards the effort to make system fault tolerant, modeling and testing helping with system development, and benchmarking to evaluate and compare systems.   These concept, model, and methodology of fault tolerant computing are very much controversial to software while they are considered fairly mature for hardware.  Most of the time we refer to real-time critical systems as embedded systems, fault tolerant computing is far more strictly applied in the system development and operational process of this particular world.

Related Topics:



A number of recent trends, such as harsh environments, novice users, larger and more complex systems, and downtime costs, have accelerated interest in making general-purpose computer systems fault tolerant, and the primary goals of fault tolerance are to avoid downtime and to ensure correct operation even in the presence of faults, or more applicable, high availability, long life, postponed maintenance, high-performance computing, and critical computations.  System performance, minimally defined to be the number of results per unit time times the uninterrupted length of time of correct processing, should not be compromised.  In real systems, however, price-performance trade-offs must be make; fault-tolerance features will incur some costs in hardware, in performance, or both.

Fault tolerance features basic allow the computer keep executing with the presence of defects. these systems are usually classified as either highly reliable or highly available. Reliability, as a function of time, is the conditional probability that the system has survived the interval [0,t], given that it was operational at time t=0.  Highly reliable systems are used in situations in which repair can not take place (e.g. spacecraft) or in which the computer is performing a critical function for which even the small amount of time lost due to repairs can not be tolerated (e.g. flight-control computers).  Availability is the intuitive sense of reliability.  A system is available if it is able to perform its intended function at the moment the function is required.  Formally, the availability of a system as a function of time is the probability that the system is operational at the instant of time, t. If the limit of this function exists as t goes to infinity, it expresses the expected fraction of time that the system is available to perform useful computations.  Availability is frequently used as a figure of merit in systems for which service can be delayed or denied for short periods without serious consequences.  For a system in which downtime costs tens of thousands of dollars per minute(e.g. airline reservation system) an increase of only .1% availability makes a substantial difference.  In general, highly available systems are easier to  build than highly reliable systems because of the more stringent requirements imposed by the reliability definition.

Fault-tolerant techniques and architecture found their way in mainstream computer design when computers began to be used in situations in which failure could endanger life or property, or could foment significant economic loss.  Examples of fault-tolerant systems can be found many nowadays, for instance, August, Parallel, Tandem, AT&T3B20D, Stratus, and Intel 432 are some well known fault tolerant systems.

In the next session, I will point out some important fault tolerance concepts.

Key Concepts

Faults and their manifestation

To understand how a system fails is certainly necessary before design a fault-tolerant system.  Basically, failures start from physical failure, and then logical faults arise, and then system errors are results.  Usually the definitions involved in this propagation process are as follow:

Transient faults and intermittent faults are the major source of system errors.  The distinguishment between these two types of faults are ability of repair.  We consider transient faults are not repairable, and intermittent ones as repairable.  The manifestations of transient and intermittent faults and of incorrect hardware or software design are much more difficult  ot determine than permanent faults.

System Fault Response stages

Table-1 shows the detail of the ten system fault response stages, and give each stage a detailed explaination and some more points that need to pay attention.

Table-1  System Fault Response Stages

Name of Fault Response Stages Explanation Extra mention
Fault confinement limit the scope of fault affection into local area, or protect other areas of the system from getting contaminated by this fault. This technique may be applied in both hardware and software. For instance, it can be achieved by liberal use of fault detection circuits, consistency checks before performing a function("mutual suspicion"), and multiple requests/confirmations before performing a function.
Fault detection  Locate the fault.  Multiple techniques have been developed and applied for fault detection.  They can be basically classified into off-line fault detection and on-line fault detection.  With the off-line detection, the device is unable to perform any function during the test, while for the on-line detection, the operation can keep going on while tests and the consequent work are being applied.  Thus, off-line detection assures integrity before and possibly at intervals during operation, but not during the entire time of operation, while on-line techniques have to guarantee system integrity all through the detection stage.
The arbitrary period that passes by before detection occurs is called fault latency.
Fault masking Also called static redundancy, fault masking techniques hide the effects of failures through the means that redundant information outweighs the incorrect information.  Majority voting is an example of fault masking. In its pure form, masking provides no detection.  However, many fault-masking techniques can be extended to provide on-line detection as well.  Otherwise, off-line detection techniques are needed to discover failures. 
Retry In some cases, a second attempt to a operation is effective enough, especially for those transient faults which cause no physical damage.. Retry can be applied more than once, and then when certain number is arrived, system need to go through diagnosis, detection. It may appear that "retry" should be attempted after recovery is effected.  But many times an operation that failed will execute correctly if it is tried again immediately.  For instance, a transient error may prevent a successful operation, but an immediate retry will succeed since the transient will have died away a few moments later.
Diagnosis Diagnosis stage becomes necessary when detection could not provide fault location and other fault information refer to the article of "Diagnosis" listed also in this course web page for more detail 
Reconfiguration If a gut is detected and a permanent failure located, the system may be able to reconfigure its components to replace the failed component or to isolate it from the rest of the system.  The component may be replaced by backup spares.  Alternatively, it may simply be switched off and the system capability degraded as called graceful degradation Graceful degradation is one of the dynamic redundancy techniques
Recovery After detection and maybe reconfiguration, the effects of errors must be eliminated.  Normally the system operation is backed up to some point in its processing that preceded the fault detection, and operation recommences from this point.  This form of recovery, often called rollback, usually entails strategies using backup files, checkpointing, and journaling. In recovery, error latency becomes an important issue because the rollback must go far enough to avoid the effects of undetected errors that occurred before the detected one.
Restart This might be possible in the case too much information is damaged by an error, or if the system is not designed for recovery.  A "hot" restart, a resumption of all operations from the point of fault detection, is possible only if the occurred damage is not unrecoverable.  A "warm" restart implies that only some of the processes can be resumed without loss.  A "cold" restart corresponds to a complete reload of the system,  with no processes surviving.
Repair Replace the damaged component. It can be either off-line or on-line.
Reintegration After all, the repaired the device or module is reintegrated into the system. And specially for on-line repair, this has to be done without delay system operation.

Reliability and Availability Techniques

Two approaches to increasing system reliability are fault avoidance and fault tolerance.  Fault avoidance results from  conservative design practices such as the use of high-reliability parts.  Though the goal of fault avoidance is to reduce the likelihood of failure, even after the most careful application of fault-avoidance techniques, failures will occur eventually owing to defects in the system.  In comparison to this approach, fault tolerance appears much better, as fault tolerance approaches the system design with the assumptions that defects would very much likely surface any way during system operational stage, so that the design is orientated towards making the system keep operating correctly with the presence of defects.   Redundancy is a very classic technique used in both fault avoidance and fault tolerance approaches.  With the redundancy technique a system could highly likely pass the ten fault response stages listed above.

Table-2 will give a very clear graphical description of the reliability techniques.

Table-2  Taxonomy of Reliability Techniques

Region Technique Brief extra mention
Fault avoidance Environment modification
Quality changes
Component integration level
Fault detection Duplication
Error detection codes( M-of-N codes, Parity, Checksums, Arithmetic codes, Cyclic codes
Self-checking and fail-safe logic 
Watchdog timers and time-outs
Consistency and capability checks
Static redundancy/masking redundancy NMR/voting
Error correcting codes(Hamming SEC/DED, Other codes)
Masking logic(Interwoven logic, Coded-state machines)
Dynamic redundancy reconfigurable duplication
Reconfigurable nmr
Backup sparing
Graceful degradation

And there are some other useful fault tolerance techniques such as hardware redundancy, n-version programming, graceful degradation, etc.  For those detail would not be mentioned here and can be found in [1]

Available tools and metrics (I will find more for the final write-up, ying)

Relationship to other topics


In this article, I've introduced basic fault tolerance concepts, techniques and tools to achieve this special system feature, and also give a description of the core issue, fault, its manifestation and behavior.   In general fault tolerance computing is considered as a study of faults/ failures, as mastering of faults/failures behavior is the reasonable starting point of stopping their effects as any system defects, and all those techniques and tools are developed towards how to probe this behavior and further how to stop the propagation.  As most of the techniques and tools are generated initially for coping with hardware defects, or more effective when applied to hardware world,  software fault tolerance still has not been that relatively mature in comparison with hardware.  And software fault tolerance research has drawn more and more focus nowadays, as the majority of system defects are shown to be software defects.

Annotated Reference List

Siewiorek & Swarts, "Reliable Computer Systems - Design and Evaluation", Digital Press
FTCS proceedings

Loose Ends

Go To Project Page