Checkpointing and Recovery
Available tools, techniques, and metrics
Relationship to other topics
Annotated Reference List
In order to adequately understand software
fault tolerance it is important to understand the nature of the problem that
software fault tolerance is supposed to solve. Software faults are all design
faults.[Storey96] Software manufacturing, the
reproduction of software, is considered to be perfect. The source of the
problem being solely design faults is very different than almost any other
system in which fault tolerance is a desired property. This inherent issue,
that software faults are the result of human error in interpreting a
specification or correctly implementing an algorithm, creates issues which must
be dealt with in the fundamental approach to software fault tolerance.
Current software fault tolerance methods are
based on traditional hardware fault tolerance. The deficiency with this
approach is that traditional hardware fault tolerance was designed to conquer
manufacturing faults primarily, and environmental and other faults secondarily.
Design diversity was not a concept applied to the solutions to hardware fault
tolerance, and to this end, N-Way redundant systems solved many single errors
by replicating the same hardware. Software fault tolerance tries to leverage
the experience of hardware fault tolerance to solve a different problem, but by
doing so creates a need for design diversity in order to properly create a redundant
diversity is a solution to software fault tolerance only so far as it is
possible to create diverse and equivalent specifications so that programmers
can create software which has different enough designs that they don't share
similar failure modes. Design diversity and independent failure modes have been
shown to be a particularly difficult problem though, as evidenced in [DeVale99]. The issue still remains that for a complex
problem, the need for humans to solve that problem error free is not easily
Fault tolerance is defined as how to provide, by redundancy, service complying with the specification in spite of faults having occurred or occurring. (Laprie 1996). There are some important concepts buried within the text of this definition that should be examined. Primarily, Laprie argues that fault tolerance is accomplished using redundancy. This argument is good for errors which are not caused by design faults, however, replicating a design fault in multiple places will not aide in complying with a specification. It is also important to note the emphasis placed on the specification as the final arbiter of what is an error and what is not. Design diversity increases pressure on the specification creators to make multiple variants of the same specification which are equivalent in order to aide the programmer in creating variations in algorithms for the necessary redundancy. The definition itself may no longer be appropriate for the type of problems that current fault tolerance is trying to solve, both hardware and software.
Randell argues that the difference between fault tolerance versus exception handling is that exception handling deviates from the specification and fault tolerance attempts to provide services compliant with the specification after detecting a fault. [Lyu95] This is an important difference to realize between trying to construct robust software versus trying to construct reliable software. Reliable software will accomplish its task under adverse conditions while robust software will be able to indicate a failure correctly, (hopefully without the entire system failing.) (For more information on the definitions and differences between reliability, robustness, and fault masking see traditional reliability.)
Software fault tolerance is mostly based on traditional hardware fault tolerance. N-version programming closely parallels N-way redundancy in the hardware fault tolerance paradigm. Recovery blocks, are modeled after what Randell discovered was the current ad hoc method being employed in safety critical software. [Lyu95] The ad hoc method used the concept of retrying the same operation in hopes that the problem would be resolved when the second try occurred. Traditional hardware fault tolerance tried to solve a few common problems which plagued earlier computer hardware. One of the largest problems facing computer hardware was and may still be, manufacturing faults. Another common hardware problem, whose sources may be very diverse is transient faults. These two types of faults can generally be effectively guarded against using redundant hardware of the same type, however, redundant hardware of the same type will not mask a design fault. Recovery blocks may be a good solution to transient faults, however, it faces the same inherent problem that N-version programming does in that they do not offer (sufficient) protection against design faults. (It is possible for a limited class of design faults to be recovered from using distributed N-version programming. Memory leaks, which are a design fault, can cause a local heap to grow beyond the limits of its computer system. Using distributed N-version programming or one of its variants, it is possible that distributed heaps could run out of memory at different times and still be consistent with respect to valid data state.) [Murray98]
Software faults are most often caused by design faults. Design faults occur when a designer, (in this case a programmer,) either misunderstands a specification or simply makes a mistake. Software faults are common for the simple reason that the complexity in modern systems is often pushed into the software part of the system. It is estimated that 60-90% of current computer errors are from software faults. [Gray91] Software faults may also occur from hardware; these faults are usually transitory in nature, and can be masked using a combination of current software and hardware fault tolerance techniques.
The current assumption is that software cannot be made without bugs. This
assumption may be mostly true, but software does not have to be as traditional
buggy as it is now. For example, the Tandem Guardian 90 operating system showed
that for all of installed field systems, that for a period of less than a year,
(specific time period not given,) that a total of 200 errors were reported; 179
of those error were attributed to software faults. The Tandem data shows that
it is not necessary for software to be inherently buggy, however, the cost and
overhead for replicated processes and the time and effort spent on making
software correct must be taken into account.[Lee93]
Metrics in the area of software fault tolerance, (or software faults,) are generally pretty poor. The data sets that have been analyzed in the past are surely not indicative of today's large and complex software systems. The analysis by [DeVale99] of various POSIX systems has the largest applicable data set found in the literature. Some of the advantages of the [DeVale99] research are the fact that the systems are commercially developed, the systems adhere to the same specification, and the systems are large enough that testing them shows an array of problems. Still, the [DeVale99] research is not without its problems; operating systems may be a more unique case than application software; operating systems may share more heritage from projects like Berkeley's Unix or the Open Software Foundation's research projects. The issue with gathering good metrics data is the cost involved in developing multiple versions of complex robust software. Operating systems offer the advantage of many organizations building their own versions of this complex software.
The results of the [DeVale99] and [Knight86] research show that software errors may be correlated in N-version software systems. The results of these studies imply that the failure mode for programmers is not unique, destroying a major tenant of the N-version software fault tolerance technique. It is important to remember however, the the [Knight86] research, like most other publications in this area, are case studies, and may not be an in-depth study across enough variety of software systems to be a conclusive result.
Software fault tolerance has an extreme lack of tools in order to aide the
programmer in making reliable system. This lack of adequate tools is not very
different from the general lack of functional tools in software development
that go beyond an editor and a compiler. The ability to semi-automate the
adding of fault tolerance into software would be a significant enhancement to
the market today. One of the biggest issues facing the development of software
fault tolerant systems is the cost currently required to develop these systems.
Enhanced and functional tools, that can easily accomplish their task, would
surely be welcomed in the market place.
The recovery block method is a simple method developed by Randell from what was observed as somewhat current practice at the time. [Lyu95] The recovery block operates with an adjudicator which confirms the results of various implementations of the same algorithm. In a system with recovery blocks, the system view is broken down into fault recoverable blocks. The entire system is constructed of these fault tolerant blocks. Each block contains at least a primary, secondary, and exceptional case code along with an adjudicator. (It is important to note that this definition can be recursive, and that any component may be composed of another fault tolerant block composed of primary, secondary, exceptional case, and adjudicator components.) The adjudicator is the component which determines the correctness of the various blocks to try. The adjudicator should be kept somewhat simple in order to maintain execution speed and aide in correctness. Upon first entering a unit, the adjudicator first executes the primary alternate. (There may be N alternates in a unit which the adjudicator may try.) If the adjudicator determines that the primary block failed, it then tries to roll back the state of the system and tries the secondary alternate. If the adjudicator does not accept the results of any of the alternates, it then invokes the exception handler, which then indicates the fact that the software could not perform the requested operation.
Recovery block operation still has the same dependency which most software fault tolerance systems have: design diversity. The recovery block method increases the pressure on the specification to be specific enough to create different multiple alternatives that are functionally the same. This issue is further discussed in the context of the N-version method.
The recovery block system is also complicated by the fact that it requires the ability to roll back the state of the system from trying an alternate. This may be accomplished in a variety of ways, including hardware support for these operations. This try and rollback ability has the effect of making the software to appear extremely transactional, in which only after a transaction is accepted is it committed to the system. There are advantages to a system built with a transactional nature, the largest of which is the difficult nature of getting such a system into an incorrect or unstable state. This property, in combination with checkpointing and recovery may aide in constructing a distributed hardware fault tolerant system.
The N-version software concept attempts to parallel the traditional hardware fault tolerance concept of N-way redundant hardware. In an N-version software system, each module is made with up to N different implementations. Each variant accomplishes the same task, but hopefully in a different way. Each version then submits its answer to voter or decider which determines the correct answer, (hopefully, all versions were the same and correct,) and returns that as the result of the module. This system can hopefully overcome the design faults present in most software by relying upon the design diversity concept. An important distinction in N-version software is the fact that the system could include multiple types of hardware using multiple versions of software. The goal is to increase the diversity in order to avoid common mode failures. Using N-version software, it is encouraged that each different version be implemented in as diverse a manner as possible, including different tool sets, different programming languages, and possibly different environments. The various development groups must have as little interaction related to the programming between them as possible. [Avizienis85] N-version software can only be successful and successfully tolerate faults if the required design diversity is met.
The dependence on appropriate specifications in N-version software, (and recovery blocks,) can not be stressed enough. The delicate balance required by the N-version software method requires that a specification be specific enough so that the various versions are completely inter-operable, so that a software decider may choose equally between them, but cannot be so limiting that the software programmers do not have enough freedom to create diverse designs. The flexibility in the specification to encourage design diversity, yet maintain the compatibility between versions is a difficult task, however, most current software fault tolerance methods rely on this delicate balance in the specification.
The N-version method presents the possibility of various faults being generated, but successfully masked and ignored within the system. It is important, however, to detect and correct these faults before they become errors. First, the classification of faults applied to N-version software method: if only a single version in an N-version system, the error is classified as a simplex fault. If M versions within an N-version system have faults, the the fault is declared to be an M-plex fault. M-plex faults are further classified into two classes of faults of related and independent types. Detecting, classifying, and correcting faults is an important task in any fault tolerant system for long term correct operation.
The differences between the recovery block method and the N-version method are not too numerous, but they are important. In traditional recovery blocks, each alternative would be executed serially until an acceptable solution is found as determined by the adjudicator. The recovery block method has been extended to include concurrent execution of the various alternatives. The N-version method has always been designed to be implemented using N-way hardware concurrently. In a serial retry system, the cost in time of trying multiple alternatives may be too expensive, especially for a real-time system. Conversely, concurrent systems require the expense of N-way hardware and a communications network to connect them. Another important difference in the two methods is the difference between an adjudicator and the decider. The recovery block method requires that each module build a specific adjudicator; in the N-version method, a single decider may be used. The recovery block method, assuming that the programmer can create a sufficiently simple adjudicator, will create a system which is difficult to enter into an incorrect state. The engineering tradeoffs, especially monetary costs, involved with developing either type of system have their advantages and disadvantages, and it is important for the engineer to explore the space to decide on what the best solution for his project is.
Self checking software is not a rigorously described method in the literature, but rather a more ad hoc method used in some important systems. [Lyu95] Self-checking software has been implemented in some extremely reliable and safety-critical systems already deployed in our society, including the Lucent ESS-5 phone switch and the Airbus A-340 airplanes. [Lyu95]
Self-checking software are the extra checks, often including some amount checkpointing and rollback recovery methods added into fault-tolerant or safety critical systems. Other methods including separate tasks that "walk" the heap finding and correcting data defects and the options of using degraded performance algorithms. While self-checking may not be a rigorous methodology, it has shown to be surprisingly effective.
The obvious problem with self-checking software is its lack of rigor. Code
coverage for a fault tolerant system is unknown. Furthermore, just how reliable
a system made with self-checking software? Without the proper rigor and
experiments comparing and improving self-checking software cannot effectively
Without software fault tolerance, it is
generally not possible to make a truly fault tolerant system. As previously
mentioned, it is estimated that 60-90% of current failures are software
failures. This means, that a larger focus on software reliability and fault
tolerance is necessary in order to ensure a fault tolerant system.
An ultra-fault tolerant system needs
software fault tolerance in order to create a system that is ultra-reliable.
These systems are very necessary for missions in which the system may not be
accessible. For example, space missions, or very deep undersea communications
systems, are not easily accessible. These missions require systems whose
reliability ensures that the system will operate throughout its mission life.
Current software fault tolerance is
based on traditional hardware fault tolerance, (for better or worse.) Both
hardware and software fault tolerance are beginning to face the new class of
problems of dealing with design faults. Hardware designers will soon face how
to create a microprocessor that effectively uses one billion transistors; as
part of that daunting task, making the microprocessor correct becomes more
challenging. In the future, hardware and software may cooperate more in
achieving fault tolerance for the system as a whole.
Software methodology may be one of the
best ways to build in software fault tolerance. Building correct software would
make large strides in system dependability. Using a system that is mostly
correct, with some more simple fault tolerance techniques may be the best
system solution in the future.
The view that software has to have bugs will
have to be conquered. If software cannot be made (at least relatively,) bug
free then the next generation of safety critical systems will be very flawed.
Reliable computing systems, often used for transaction servers, made by
companies like Tandem, Stratos, and IBM, have shown that reliable computers can
currently be made, however, they have also demonstrated that the cost is
Currently, the technologies used in these
systems do not appear to scale well for the embedded market place. The solution
may be that a networked world is indeed a better solution, in that reliable
systems with humans watching over them, may be the final solution, and that
ubiquitous networking to these reliable systems may solve the embedded fault
tolerance issue. Another possible panacea is the evolving application of
degraded performance. While degraded performance may not be the ultimate
solution, or acceptable in all cases, by limiting the amount of complexity
necessary, it may go a long way toward being able to create correct and fault
tolerant software. In the end, a solution that is cost effective enough to be
applied to the embedded world of computing systems is in dire need. As today's
common appliances, including automobiles, become increasingly computer
automated and relied upon by society, software fault tolerance becomes more
N. Storey, Safety-Critical Computer
Systems. Harlow, England: Addison-Wesley, 1996.
Good introductory information on safety-critical computers.
M. R. Lyu,
ed., Software Fault Tolerance Chichester, England: John Wiley and Sons,
The authoritative book on the subject of software fault tolerance written by the experts in the field.
P. Murray, R. Fleming, P. Harry, and P. Vickers,
"Somersault Software Fault-Tolerance,"
whitepaper, Palo Alto, California, 1998.
An interesting paper on distributed rollback and recovery. It mentions an single interesting possibility of fault tolerance.
Gray and D. P. Siewiorek, "High-Availability Computer Systems,"
IEEE Computer, 24(9):39-48, September 1991.
A good discussion of the number of software failures occuring in today's high-reliability systems.
J. C. Knight and N. G. Leveson, "An Experimental
Evaluation of the Assumption of Independence in Multi-version
Programming", IEEE Transactions on Software Engineering, Vol.
SE-12, No. 1 (January 1986), pp. 96-109.
The original work on disputing the results that N-version programming works.
A. Avizeinis, "The N-Version Approach to
Fault-Tolerant Software", IEEE Transactions of Software
Engineering, Vol. SE-11, No. 12 (December 1985), pp. 1491-1501.
A paper describing N-version programming written by the original creator of the concept. A good in depth discussion of the concept and how to apply it.
I. Lee and R. K. Iyer, "Faults, Symptoms, and
Software Fault Tolerance in the Tandem GUARDIAN90 Operating System", IEEE
1993, pp. 20-29.
Presentation of good quality commericial data of on an operating system that is supposed to be one of the most fault tolerant. It supports the view that most of the problems in highly available/reliable computers are the software.
Go To Project Page