Fault Tolerance Computing-- Draft

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999

Ying Shi

Abstract:

As users are not concerned only about whether it is working but also whether it is working correctly, particularly in safety critical cases, Fault Tolerant Computing (FTC) plays a important role especially since early fifties. Given this is a very broad research area, it involves varieties of categorizations of techniques towards the effort to make system fault tolerant, modeling and testing helping with system development, and benchmarking to evaluate and compare systems. These concept, model, and methodology of fault tolerant computing are very much controversial to software while they are considered fairly mature for hardware. Most of the time we refer to real-time critical systems as embedded systems, fault tolerant computing is far more strictly applied in the system development and operational process of this particular world.

Introduction

A number of recent trends, such as harsh environments, novice users, larger and more complex systems, and downtime costs, have accelerated interest in making general-purpose computer systems fault tolerant, and the primary goals of fault tolerance are to avoid downtime and to ensure correct operation even in the presence of faults, or more applicable, high availability, long life, postponed maintenance, high-performance computing, and critical computations. System performance, minimally defined to be the number of results per unit time times the uninterrupted length of time of correct processing, should not be compromised. In real systems, however, price-performance trade-offs must be make; fault-tolerance features will incur some costs in hardware, in performance, or both.

Fault tolerance features basic allow the computer keep executing with the presence of defects. these systems are usually classified as either highly reliable or highly available. Reliability, as a function of time, is the conditional probability that the system has survived the interval [0,t], given that it was operational at time t=0. Highly reliable systems are used in situations in which repair can not take place (e.g. spacecraft) or in which the computer is performing a critical function for which even the small amount of time lost due to repairs can not be tolerated (e.g. flight-control computers). Availability is the intuitive sense of reliability. A system is available if it is able to perform its intended function at the moment the function is required. Formally, the availability of a system as a function of time is the probability that the system is operational at the instant of time, t. If the limit of this function exists as t goes to infinity, it expresses the expected fraction of time that the system is available to perform useful computations. Availability is frequently used as a figure of merit in systems for which service can be delayed or denied for short periods without serious consequences. For a system in which downtime costs tens of thousands of dollars per minute(e.g. airline reservation system) an increase of only .1% availability makes a substantial difference. In general, highly available systems are easier to build than highly reliable systems because of the more stringent requirements imposed by the reliability definition.

Fault-tolerant techniques and architecture found their way in mainstream computer design when computers began to be used in situations in which failure could endanger life or property, or could foment significant economic loss. Examples of fault-tolerant systems can be found many nowadays, for instance, August, Parallel, Tandem, AT&T3B20D, Stratus, and Intel 432 are some well known fault tolerant systems.

In the next session, I will point out some important fault tolerance concepts.

Key Concepts

Faults and their manifestation

To understand how a system fails is certainly necessary before design a fault-tolerant system. Basically, failures start from physical failure, and then logical faults arise, and then system errors are results. Usually the definitions involved in this propagation process are as follow:

Failure. Physical change in hardware.
Fault. Erroneous state of hardware or software resulting from failures of components, physical interference from the environment, operator error, or incorrect design.
Error. Manifestation of a fault within a program or data structure. The error may occur some distance from the fault site.
Permanent. Describes a failure, fault, or error that is continuous and stable. In hardware, permanent failure reflects an irreversible physical change. The word "hard" is used interchangeably with the word permanent.
Intermittent. Describes a fault or error that is only occasionally present due to unstable hardware or varying hardware or software states(e.g. as a function of load or activity).
Transient. Describes a fault or error resulting from temporary environmental conditions. The word " soft" is used interchangeably with transient.

Transient faults and intermittent faults are the major source of system errors. The distinguishment between these two types of faults are ability of repair. We consider transient faults are not repairable, and intermittent ones as repairable. The manifestations of transient and intermittent faults and of incorrect hardware or software design are much more difficult ot determine than permanent faults.

System Fault Response stages

Table-1 shows the detail of the ten system fault response stages, and give each stage a detailed explaination and some more points that need to pay attention.

Table-1 System Fault Response Stages

Name of Fault Response Stages	Explanation	Extra mention
Fault confinement	limit the scope of fault affection into local area, or protect other areas of the system from getting contaminated by this fault.	This technique may be applied in both hardware and software. For instance, it can be achieved by liberal use of fault detection circuits, consistency checks before performing a function("mutual suspicion"), and multiple requests/confirmations before performing a function.
Fault detection	Locate the fault. Multiple techniques have been developed and applied for fault detection. They can be basically classified into off-line fault detection and on-line fault detection. With the off-line detection, the device is unable to perform any function during the test, while for the on-line detection, the operation can keep going on while tests and the consequent work are being applied.	Thus, off-line detection assures integrity before and possibly at intervals during operation, but not during the entire time of operation, while on-line techniques have to guarantee system integrity all through the detection stage. The arbitrary period that passes by before detection occurs is called fault latency.
Fault masking	Also called static redundancy, fault masking techniques hide the effects of failures through the means that redundant information outweighs the incorrect information. Majority voting is an example of fault masking.	In its pure form, masking provides no detection. However, many fault-masking techniques can be extended to provide on-line detection as well. Otherwise, off-line detection techniques are needed to discover failures.
Retry	In some cases, a second attempt to a operation is effective enough, especially for those transient faults which cause no physical damage.. Retry can be applied more than once, and then when certain number is arrived, system need to go through diagnosis, detection.	It may appear that "retry" should be attempted after recovery is effected. But many times an operation that failed will execute correctly if it is tried again immediately. For instance, a transient error may prevent a successful operation, but an immediate retry will succeed since the transient will have died away a few moments later.
Diagnosis	Diagnosis stage becomes necessary when detection could not provide fault location and other fault information	refer to the article of "Diagnosis" listed also in this course web page for more detail
Reconfiguration	If a gut is detected and a permanent failure located, the system may be able to reconfigure its components to replace the failed component or to isolate it from the rest of the system. The component may be replaced by backup spares. Alternatively, it may simply be switched off and the system capability degraded as called graceful degradation	Graceful degradation is one of the dynamic redundancy techniques
Recovery	After detection and maybe reconfiguration, the effects of errors must be eliminated. Normally the system operation is backed up to some point in its processing that preceded the fault detection, and operation recommences from this point. This form of recovery, often called rollback, usually entails strategies using backup files, checkpointing, and journaling.	In recovery, error latency becomes an important issue because the rollback must go far enough to avoid the effects of undetected errors that occurred before the detected one.
Restart	This might be possible in the case too much information is damaged by an error, or if the system is not designed for recovery. A "hot" restart, a resumption of all operations from the point of fault detection, is possible only if the occurred damage is not unrecoverable. A "warm" restart implies that only some of the processes can be resumed without loss. A "cold" restart corresponds to a complete reload of the system, with no processes surviving.
Repair	Replace the damaged component. It can be either off-line or on-line.
Reintegration	After all, the repaired the device or module is reintegrated into the system. And specially for on-line repair, this has to be done without delay system operation.

Reliability and Availability Techniques

Two approaches to increasing system reliability are fault avoidance and fault tolerance. Fault avoidance results from conservative design practices such as the use of high-reliability parts. Though the goal of fault avoidance is to reduce the likelihood of failure, even after the most careful application of fault-avoidance techniques, failures will occur eventually owing to defects in the system. In comparison to this approach, fault tolerance appears much better, as fault tolerance approaches the system design with the assumptions that defects would very much likely surface any way during system operational stage, so that the design is orientated towards making the system keep operating correctly with the presence of defects. Redundancy is a very classic technique used in both fault avoidance and fault tolerance approaches. With the redundancy technique a system could highly likely pass the ten fault response stages listed above.

Table-2 will give a very clear graphical description of the reliability techniques.

Table-2 Taxonomy of Reliability Techniques

Region	Technique	Brief extra mention
Fault avoidance	Environment modification Quality changes Component integration level
Fault detection	Duplication Error detection codes( M-of-N codes, Parity, Checksums, Arithmetic codes, Cyclic codes Self-checking and fail-safe logic Watchdog timers and time-outs Consistency and capability checks
Static redundancy/masking redundancy	NMR/voting Error correcting codes(Hamming SEC/DED, Other codes) Masking logic(Interwoven logic, Coded-state machines)
Dynamic redundancy	reconfigurable duplication Reconfigurable nmr Backup sparing Graceful degradation Reconfiguration Recovery

And there are some other useful fault tolerance techniques such as hardware redundancy, n-version programming, graceful degradation, etc. For those detail would not be mentioned here and can be found in [1]

Available tools and metrics (I will find more for the final write-up, ying)

Fault Injection, is one of the well known techniques/metrics to help measure system fault tolerant capability.

Relationship to other topics

Fault Injection. As mentioned above, fault injection is a very useful technique used for measuring system fault tolerance capability. It works together with tests generation tools which generate faults to be injected into the system, and by measuring the coverage of the faults system able to tolerate, we could get the idea of this particular system capability, fault tolerance.
Software Reliability. Software reliability is getting more and more attention to the researchers working in the FTC area, as it appears to be the vase majority of the cause of system defects. Although fault tolerant techniques existent so far seem working reasonably well to insure hardware reliability, they are not of same effect when applied to world of software. Hardware has the experience of ware-out as a function of time, while software never does so. Software reliability has been shown to face the big obstacle of complexity issue.
Software Testing. This is the necessary approach for software reliability, as testing is always an important tool towards system fault tolerance capability. As no testing method can explore the population space thoroughly, especially for software testing case due to the prevalent complexity issue, software testing is often considered as an "art" in this fault tolerance research field.

Conclusions

In this article, I've introduced basic fault tolerance concepts, techniques and tools to achieve this special system feature, and also give a description of the core issue, fault, its manifestation and behavior. In general fault tolerance computing is considered as a study of faults/ failures, as mastering of faults/failures behavior is the reasonable starting point of stopping their effects as any system defects, and all those techniques and tools are developed towards how to probe this behavior and further how to stop the propagation. As most of the techniques and tools are generated initially for coping with hardware defects, or more effective when applied to hardware world, software fault tolerance still has not been that relatively mature in comparison with hardware. And software fault tolerance research has drawn more and more focus nowadays, as the majority of system defects are shown to be software defects.

Annotated Reference List

Siewiorek & Swarts, "Reliable Computer Systems - Design and Evaluation", Digital Press
FTCS proceedings

Loose Ends

Go To Project Page