Software Fault Tolerance

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1998
Author: Chris Inacio


Complex safety critical systems currently being designed and built are often difficult multi-disciplinary undertakings. Part of these systems is often a computer control system. In order to ensure that these systems perform as specified, even under extreme conditions, it is important to have a fault tolerant computing system; both hardware and software. Current methods for software fault tolerance include recovery blocks, N-version programming, and self-checking software. Through the rest of this discourse on software fault tolerance, we will describe the nature of the software problem, discuss the current methodologies for solving these problems, and conclude some thoughts on future research directions.


Related Topics:


Checkpointing and Recovery

Exception Handling


Project Page


Key Concepts

Available tools, techniques, and metrics

Relationship to other topics


Annotated Reference List

Loose Ends


Software fault tolerance is the ability for software to detect and recover from a fault that is happening or has already happened in either the software or hardware in the system in which the software is running in order to provide service in accordance with the specification. Software fault tolerance is a necessary component in order to construct the next generation of highly available and reliable computing systems from embedded systems to data warehouse systems. Software fault tolerance is not a solution unto itself however, and it is important to realize that software fault tolerance is just one piece necessary to create the next generation of systems.


In order to adequately understand software fault tolerance it is important to understand the nature of the problem that software fault tolerance is supposed to solve. Software faults are all design faults.[Storey96] Software manufacturing, the reproduction of software, is considered to be perfect. The source of the problem being solely design faults is very different than almost any other system in which fault tolerance is a desired property. This inherent issue, that software faults are the result of human error in interpreting a specification or correctly implementing an algorithm, creates issues which must be dealt with in the fundamental approach to software fault tolerance.

Current software fault tolerance methods are based on traditional hardware fault tolerance. The deficiency with this approach is that traditional hardware fault tolerance was designed to conquer manufacturing faults primarily, and environmental and other faults secondarily. Design diversity was not a concept applied to the solutions to hardware fault tolerance, and to this end, N-Way redundant systems solved many single errors by replicating the same hardware. Software fault tolerance tries to leverage the experience of hardware fault tolerance to solve a different problem, but by doing so creates a need for design diversity in order to properly create a redundant system.

Design diversity is a solution to software fault tolerance only so far as it is possible to create diverse and equivalent specifications so that programmers can create software which has different enough designs that they don't share similar failure modes. Design diversity and independent failure modes have been shown to be a particularly difficult problem though, as evidenced in [DeVale99]. The issue still remains that for a complex problem, the need for humans to solve that problem error free is not easily solvable.

Key Concepts

Fault Tolerance

Fault tolerance is defined as how to provide, by redundancy, service complying with the specification in spite of faults having occurred or occurring. (Laprie 1996). There are some important concepts buried within the text of this definition that should be examined. Primarily, Laprie argues that fault tolerance is accomplished using redundancy. This argument is good for errors which are not caused by design faults, however, replicating a design fault in multiple places will not aide in complying with a specification. It is also important to note the emphasis placed on the specification as the final arbiter of what is an error and what is not. Design diversity increases pressure on the specification creators to make multiple variants of the same specification which are equivalent in order to aide the programmer in creating variations in algorithms for the necessary redundancy. The definition itself may no longer be appropriate for the type of problems that current fault tolerance is trying to solve, both hardware and software.

Randell argues that the difference between fault tolerance versus exception handling is that exception handling deviates from the specification and fault tolerance attempts to provide services compliant with the specification after detecting a fault. [Lyu95] This is an important difference to realize between trying to construct robust software versus trying to construct reliable software. Reliable software will accomplish its task under adverse conditions while robust software will be able to indicate a failure correctly, (hopefully without the entire system failing.) (For more information on the definitions and differences between reliability, robustness, and fault masking see traditional reliability.)

Software Fault Tolerance, The History

Software fault tolerance is mostly based on traditional hardware fault tolerance. N-version programming closely parallels N-way redundancy in the hardware fault tolerance paradigm. Recovery blocks, are modeled after what Randell discovered was the current ad hoc method being employed in safety critical software. [Lyu95] The ad hoc method used the concept of retrying the same operation in hopes that the problem would be resolved when the second try occurred. Traditional hardware fault tolerance tried to solve a few common problems which plagued earlier computer hardware. One of the largest problems facing computer hardware was and may still be, manufacturing faults. Another common hardware problem, whose sources may be very diverse is transient faults. These two types of faults can generally be effectively guarded against using redundant hardware of the same type, however, redundant hardware of the same type will not mask a design fault. Recovery blocks may be a good solution to transient faults, however, it faces the same inherent problem that N-version programming does in that they do not offer (sufficient) protection against design faults. (It is possible for a limited class of design faults to be recovered from using distributed N-version programming. Memory leaks, which are a design fault, can cause a local heap to grow beyond the limits of its computer system. Using distributed N-version programming or one of its variants, it is possible that distributed heaps could run out of memory at different times and still be consistent with respect to valid data state.) [Murray98]

Software Errors (Bugs)

Software faults are most often caused by design faults. Design faults occur when a designer, (in this case a programmer,) either misunderstands a specification or simply makes a mistake. Software faults are common for the simple reason that the complexity in modern systems is often pushed into the software part of the system. It is estimated that 60-90% of current computer errors are from software faults. [Gray91] Software faults may also occur from hardware; these faults are usually transitory in nature, and can be masked using a combination of current software and hardware fault tolerance techniques.

The current assumption is that software cannot be made without bugs. This assumption may be mostly true, but software does not have to be as traditional buggy as it is now. For example, the Tandem Guardian 90 operating system showed that for all of installed field systems, that for a period of less than a year, (specific time period not given,) that a total of 200 errors were reported; 179 of those error were attributed to software faults. The Tandem data shows that it is not necessary for software to be inherently buggy, however, the cost and overhead for replicated processes and the time and effort spent on making software correct must be taken into account.[Lee93]


Tools, Techniques, and Metrics


Metrics in the area of software fault tolerance, (or software faults,) are generally pretty poor. The data sets that have been analyzed in the past are surely not indicative of today's large and complex software systems. The analysis by [DeVale99] of various POSIX systems has the largest applicable data set found in the literature. Some of the advantages of the [DeVale99] research are the fact that the systems are commercially developed, the systems adhere to the same specification, and the systems are large enough that testing them shows an array of problems. Still, the [DeVale99] research is not without its problems; operating systems may be a more unique case than application software; operating systems may share more heritage from projects like Berkeley's Unix or the Open Software Foundation's research projects. The issue with gathering good metrics data is the cost involved in developing multiple versions of complex robust software. Operating systems offer the advantage of many organizations building their own versions of this complex software.

The results of the [DeVale99] and [Knight86] research show that software errors may be correlated in N-version software systems. The results of these studies imply that the failure mode for programmers is not unique, destroying a major tenant of the N-version software fault tolerance technique. It is important to remember however, the the [Knight86] research, like most other publications in this area, are case studies, and may not be an in-depth study across enough variety of software systems to be a conclusive result.


Software fault tolerance has an extreme lack of tools in order to aide the programmer in making reliable system. This lack of adequate tools is not very different from the general lack of functional tools in software development that go beyond an editor and a compiler. The ability to semi-automate the adding of fault tolerance into software would be a significant enhancement to the market today. One of the biggest issues facing the development of software fault tolerant systems is the cost currently required to develop these systems. Enhanced and functional tools, that can easily accomplish their task, would surely be welcomed in the market place.


Recovery Blocks

The recovery block method is a simple method developed by Randell from what was observed as somewhat current practice at the time. [Lyu95] The recovery block operates with an adjudicator which confirms the results of various implementations of the same algorithm. In a system with recovery blocks, the system view is broken down into fault recoverable blocks. The entire system is constructed of these fault tolerant blocks. Each block contains at least a primary, secondary, and exceptional case code along with an adjudicator. (It is important to note that this definition can be recursive, and that any component may be composed of another fault tolerant block composed of primary, secondary, exceptional case, and adjudicator components.) The adjudicator is the component which determines the correctness of the various blocks to try. The adjudicator should be kept somewhat simple in order to maintain execution speed and aide in correctness. Upon first entering a unit, the adjudicator first executes the primary alternate. (There may be N alternates in a unit which the adjudicator may try.) If the adjudicator determines that the primary block failed, it then tries to roll back the state of the system and tries the secondary alternate. If the adjudicator does not accept the results of any of the alternates, it then invokes the exception handler, which then indicates the fact that the software could not perform the requested operation.

Recovery block operation still has the same dependency which most software fault tolerance systems have: design diversity. The recovery block method increases the pressure on the specification to be specific enough to create different multiple alternatives that are functionally the same. This issue is further discussed in the context of the N-version method.

The recovery block system is also complicated by the fact that it requires the ability to roll back the state of the system from trying an alternate. This may be accomplished in a variety of ways, including hardware support for these operations. This try and rollback ability has the effect of making the software to appear extremely transactional, in which only after a transaction is accepted is it committed to the system. There are advantages to a system built with a transactional nature, the largest of which is the difficult nature of getting such a system into an incorrect or unstable state. This property, in combination with checkpointing and recovery may aide in constructing a distributed hardware fault tolerant system.

N-version Software

The N-version software concept attempts to parallel the traditional hardware fault tolerance concept of N-way redundant hardware. In an N-version software system, each module is made with up to N different implementations. Each variant accomplishes the same task, but hopefully in a different way. Each version then submits its answer to voter or decider which determines the correct answer, (hopefully, all versions were the same and correct,) and returns that as the result of the module. This system can hopefully overcome the design faults present in most software by relying upon the design diversity concept. An important distinction in N-version software is the fact that the system could include multiple types of hardware using multiple versions of software. The goal is to increase the diversity in order to avoid common mode failures. Using N-version software, it is encouraged that each different version be implemented in as diverse a manner as possible, including different tool sets, different programming languages, and possibly different environments. The various development groups must have as little interaction related to the programming between them as possible. [Avizienis85] N-version software can only be successful and successfully tolerate faults if the required design diversity is met.

The dependence on appropriate specifications in N-version software, (and recovery blocks,) can not be stressed enough. The delicate balance required by the N-version software method requires that a specification be specific enough so that the various versions are completely inter-operable, so that a software decider may choose equally between them, but cannot be so limiting that the software programmers do not have enough freedom to create diverse designs. The flexibility in the specification to encourage design diversity, yet maintain the compatibility between versions is a difficult task, however, most current software fault tolerance methods rely on this delicate balance in the specification.

The N-version method presents the possibility of various faults being generated, but successfully masked and ignored within the system. It is important, however, to detect and correct these faults before they become errors. First, the classification of faults applied to N-version software method: if only a single version in an N-version system, the error is classified as a simplex fault. If M versions within an N-version system have faults, the the fault is declared to be an M-plex fault. M-plex faults are further classified into two classes of faults of related and independent types. Detecting, classifying, and correcting faults is an important task in any fault tolerant system for long term correct operation.

The differences between the recovery block method and the N-version method are not too numerous, but they are important. In traditional recovery blocks, each alternative would be executed serially until an acceptable solution is found as determined by the adjudicator. The recovery block method has been extended to include concurrent execution of the various alternatives. The N-version method has always been designed to be implemented using N-way hardware concurrently. In a serial retry system, the cost in time of trying multiple alternatives may be too expensive, especially for a real-time system. Conversely, concurrent systems require the expense of N-way hardware and a communications network to connect them. Another important difference in the two methods is the difference between an adjudicator and the decider. The recovery block method requires that each module build a specific adjudicator; in the N-version method, a single decider may be used. The recovery block method, assuming that the programmer can create a sufficiently simple adjudicator, will create a system which is difficult to enter into an incorrect state. The engineering tradeoffs, especially monetary costs, involved with developing either type of system have their advantages and disadvantages, and it is important for the engineer to explore the space to decide on what the best solution for his project is.

Self-Checking Software

Self checking software is not a rigorously described method in the literature, but rather a more ad hoc method used in some important systems. [Lyu95] Self-checking software has been implemented in some extremely reliable and safety-critical systems already deployed in our society, including the Lucent ESS-5 phone switch and the Airbus A-340 airplanes. [Lyu95]

Self-checking software are the extra checks, often including some amount checkpointing and rollback recovery methods added into fault-tolerant or safety critical systems. Other methods including separate tasks that "walk" the heap finding and correcting data defects and the options of using degraded performance algorithms. While self-checking may not be a rigorous methodology, it has shown to be surprisingly effective.

The obvious problem with self-checking software is its lack of rigor. Code coverage for a fault tolerant system is unknown. Furthermore, just how reliable a system made with self-checking software? Without the proper rigor and experiments comparing and improving self-checking software cannot effectively be done.

Relationship to other topics

Fault Tolerance


Without software fault tolerance, it is generally not possible to make a truly fault tolerant system. As previously mentioned, it is estimated that 60-90% of current failures are software failures. This means, that a larger focus on software reliability and fault tolerance is necessary in order to ensure a fault tolerant system.

Ultra Fault Tolerance


An ultra-fault tolerant system needs software fault tolerance in order to create a system that is ultra-reliable. These systems are very necessary for missions in which the system may not be accessible. For example, space missions, or very deep undersea communications systems, are not easily accessible. These missions require systems whose reliability ensures that the system will operate throughout its mission life.

Hardware Fault Tolerance


Current software fault tolerance is based on traditional hardware fault tolerance, (for better or worse.) Both hardware and software fault tolerance are beginning to face the new class of problems of dealing with design faults. Hardware designers will soon face how to create a microprocessor that effectively uses one billion transistors; as part of that daunting task, making the microprocessor correct becomes more challenging. In the future, hardware and software may cooperate more in achieving fault tolerance for the system as a whole.

Software Methodology


Software methodology may be one of the best ways to build in software fault tolerance. Building correct software would make large strides in system dependability. Using a system that is mostly correct, with some more simple fault tolerance techniques may be the best system solution in the future.


Software fault tolerance is an immature area of research. As more and more complex systems get designed and built, especially safety critical systems, software fault tolerance and the next generation of hardware fault tolerance will need to evolve to be able to solve the design fault problem. Unlike fault tolerance practiced in any other field, the necessity to be able to design fault tolerance into the system for design faults and unexpected circumstances has never been greater. The current generation of software fault tolerance methods cannot adequately compensate for these faults. Part of this next generation of software fault tolerance methods will have to include an in-depth view at how to combat the increasing cost of building correct software. It will be necessary for the next generation of fault tolerance methods to be cost effective enough to be applied to the safety critical systems in which they will be necessary.


The view that software has to have bugs will have to be conquered. If software cannot be made (at least relatively,) bug free then the next generation of safety critical systems will be very flawed. Reliable computing systems, often used for transaction servers, made by companies like Tandem, Stratos, and IBM, have shown that reliable computers can currently be made, however, they have also demonstrated that the cost is significant.

Currently, the technologies used in these systems do not appear to scale well for the embedded market place. The solution may be that a networked world is indeed a better solution, in that reliable systems with humans watching over them, may be the final solution, and that ubiquitous networking to these reliable systems may solve the embedded fault tolerance issue. Another possible panacea is the evolving application of degraded performance. While degraded performance may not be the ultimate solution, or acceptable in all cases, by limiting the amount of complexity necessary, it may go a long way toward being able to create correct and fault tolerant software. In the end, a solution that is cost effective enough to be applied to the embedded world of computing systems is in dire need. As today's common appliances, including automobiles, become increasingly computer automated and relied upon by society, software fault tolerance becomes more necessary.

Annotated Reference List

Loose Ends

Go To Project Page