Ying Shi
Fault tolerance features basic allow the computer keep executing with the presence of defects. these systems are usually classified as either highly reliable or highly available. Reliability, as a function of time, is the conditional probability that the system has survived the interval [0,t], given that it was operational at time t=0. Highly reliable systems are used in situations in which repair can not take place (e.g. spacecraft) or in which the computer is performing a critical function for which even the small amount of time lost due to repairs can not be tolerated (e.g. flight-control computers). Availability is the intuitive sense of reliability. A system is available if it is able to perform its intended function at the moment the function is required. Formally, the availability of a system as a function of time is the probability that the system is operational at the instant of time, t. If the limit of this function exists as t goes to infinity, it expresses the expected fraction of time that the system is available to perform useful computations. Availability is frequently used as a figure of merit in systems for which service can be delayed or denied for short periods without serious consequences. For a system in which downtime costs tens of thousands of dollars per minute(e.g. airline reservation system) an increase of only .1% availability makes a substantial difference. In general, highly available systems are easier to build than highly reliable systems because of the more stringent requirements imposed by the reliability definition.
Fault-tolerant techniques and architecture found their way in mainstream computer design when computers began to be used in situations in which failure could endanger life or property, or could foment significant economic loss. Examples of fault-tolerant systems can be found many nowadays, for instance, August, Parallel, Tandem, AT&T3B20D, Stratus, and Intel 432 are some well known fault tolerant systems.
In the next session, I will point out some important fault tolerance
concepts.
To understand how a system fails is certainly necessary before design a fault-tolerant system. Basically, failures start from physical failure, and then logical faults arise, and then system errors are results. Usually the definitions involved in this propagation process are as follow:
System Fault Response stages
Table-1 shows the detail of the ten system fault response stages, and give
each stage a detailed explaination and some more points that need to pay
attention.
Table-1 System Fault Response Stages
Name of Fault Response Stages | Explanation | Extra mention |
Fault confinement | limit the scope of fault affection into local area, or protect other areas of the system from getting contaminated by this fault. | This technique may be applied in both hardware and software. For instance, it can be achieved by liberal use of fault detection circuits, consistency checks before performing a function("mutual suspicion"), and multiple requests/confirmations before performing a function. |
Fault detection | Locate the fault. Multiple techniques have been developed and applied for fault detection. They can be basically classified into off-line fault detection and on-line fault detection. With the off-line detection, the device is unable to perform any function during the test, while for the on-line detection, the operation can keep going on while tests and the consequent work are being applied. | Thus, off-line detection assures integrity before and possibly at intervals
during operation, but not during the entire time of operation, while on-line
techniques have to guarantee system integrity all through the detection stage.
The arbitrary period that passes by before detection occurs is called fault latency. |
Fault masking | Also called static redundancy, fault masking techniques hide the effects of failures through the means that redundant information outweighs the incorrect information. Majority voting is an example of fault masking. | In its pure form, masking provides no detection. However, many fault-masking techniques can be extended to provide on-line detection as well. Otherwise, off-line detection techniques are needed to discover failures. |
Retry | In some cases, a second attempt to a operation is effective enough, especially for those transient faults which cause no physical damage.. Retry can be applied more than once, and then when certain number is arrived, system need to go through diagnosis, detection. | It may appear that "retry" should be attempted after recovery is effected. But many times an operation that failed will execute correctly if it is tried again immediately. For instance, a transient error may prevent a successful operation, but an immediate retry will succeed since the transient will have died away a few moments later. |
Diagnosis | Diagnosis stage becomes necessary when detection could not provide fault location and other fault information | refer to the article of "Diagnosis" listed also in this course web page for more detail |
Reconfiguration | If a gut is detected and a permanent failure located, the system may be able to reconfigure its components to replace the failed component or to isolate it from the rest of the system. The component may be replaced by backup spares. Alternatively, it may simply be switched off and the system capability degraded as called graceful degradation | Graceful degradation is one of the dynamic redundancy techniques |
Recovery | After detection and maybe reconfiguration, the effects of errors must be eliminated. Normally the system operation is backed up to some point in its processing that preceded the fault detection, and operation recommences from this point. This form of recovery, often called rollback, usually entails strategies using backup files, checkpointing, and journaling. | In recovery, error latency becomes an important issue because the rollback must go far enough to avoid the effects of undetected errors that occurred before the detected one. |
Restart | This might be possible in the case too much information is damaged by an error, or if the system is not designed for recovery. A "hot" restart, a resumption of all operations from the point of fault detection, is possible only if the occurred damage is not unrecoverable. A "warm" restart implies that only some of the processes can be resumed without loss. A "cold" restart corresponds to a complete reload of the system, with no processes surviving. | |
Repair | Replace the damaged component. It can be either off-line or on-line. | |
Reintegration | After all, the repaired the device or module is reintegrated into the system. And specially for on-line repair, this has to be done without delay system operation. |
Reliability and Availability Techniques
Two approaches to increasing system reliability are fault avoidance and fault tolerance. Fault avoidance results from conservative design practices such as the use of high-reliability parts. Though the goal of fault avoidance is to reduce the likelihood of failure, even after the most careful application of fault-avoidance techniques, failures will occur eventually owing to defects in the system. In comparison to this approach, fault tolerance appears much better, as fault tolerance approaches the system design with the assumptions that defects would very much likely surface any way during system operational stage, so that the design is orientated towards making the system keep operating correctly with the presence of defects. Redundancy is a very classic technique used in both fault avoidance and fault tolerance approaches. With the redundancy technique a system could highly likely pass the ten fault response stages listed above.
Table-2 will give a very clear graphical description of the reliability
techniques.
Table-2 Taxonomy of Reliability Techniques
Region | Technique | Brief extra mention |
Fault avoidance | Environment modification Quality changes Component integration level |
|
Fault detection | Duplication Error detection codes( M-of-N codes, Parity, Checksums, Arithmetic codes, Cyclic codes Self-checking and fail-safe logic Watchdog timers and time-outs Consistency and capability checks |
|
Static redundancy/masking redundancy | NMR/voting Error correcting codes(Hamming SEC/DED, Other codes) Masking logic(Interwoven logic, Coded-state machines) |
|
Dynamic redundancy | reconfigurable duplication Reconfigurable nmr Backup sparing Graceful degradation Reconfiguration Recovery |
And there are some other useful fault tolerance techniques such as hardware
redundancy, n-version programming, graceful degradation, etc. For those
detail would not be mentioned here and can be found in [1]