Fault injection is a phrase covering a variety of techniques for inducing faults in systems to measure their response to those faults. In particular, it can be used in both electronic hardware systems and software systems to measure the fault tolerance of the system. For hardware, faults can be injected into simulations of the system, as well as into implementation, both on a pin or external level and, recently, on an internal level for some chips. For software, faults can be injected into simulations of software systems, such as distributed systems, or into running software systems, at levels from the CPU registers to memory to disk to networks. Fault injection is best used as a means for measuring the fault tolerance or robustness of a system, especially for stress testing a system that may experience faults too infrequently for normal testing. While the theory behind fault injection is still being developed, the mechanisms are well understood. For an embedded system designer attempting to measure the degree to which his design is resistant to faults, fault injection can be a useful technique for quantifying this aspect of design.
Fault injection is a testing technique used in computer systems to test both hardware and software. It is the deliberate introduction of faults into a system, and the subsequent examination of the system for the errors and failures that result. It can be performed on either simulations and models or working prototypes or systems in the field. In this manner the weaknesses of interactions can be discovered, but this is a haphazard way of debugging the design errors in a system. It is better used to test the resilience of a fault tolerant system against known faults, and thereby measure the effectiveness of the fault tolerant measures.
There are two main issues in fault injection, and it is along these axes that different fault tolerant techniques may de divided. The first axis is that of simulation versus execution. In the former, a model of the system is developed and faults are introduced into that model. The model is then simulated to find the effects of the fault on the operation of the system. These methods are often slower to test, but easier to change. In the latter, the system itself is deployed, and some mechanism found to cause faults in the system, and its execution is then observed to determine the effects of the fault. These techniques are more useful for analyzing final designs, but are typically more difficult to modify afterwards.
The second axis is that of invasive and non-invasive techniques. The problem with sufficiently complex systems, particularly time dependant ones, is that it may be impossible to remove the footprint of the testing mechanism from the behavior of the system, independent of the fault injected. For example, a real-time communication protocol that ordinarily would meet a deadline for a particular task might miss it because of the extra latency induced by the fault injection mechanism. Invasive techniques are those which leave behind such a footprint during testing. Non-invasive techniques are able to mask their presence so as to have no effect on the system other than the faults they inject.
Fault injection is still a somewhat new technique, however, and there is still development being done to see what kinds of systems this can be applied to, and which systems it is appropriate to test in this manner. While there are some well understood mechanisms for injecting faults into certain kinds of systems, such as distributed systems, other systems still have basic techniques being designed, such as VLSI circuits. Often the method for inserting faults is very application specific, rather than generalized, and therefore comparison of testing methods is difficult. Finally, even when results have been gathered, researchers are still uncertain or divided as to exactly what the results mean, and how they should be used.
In discussing fault injection, we will discuss techniques for hardware and software separately because they are so distinct from one another. Once we have covered some of the details, we will then talk about what fault injection does and does not measure, and what it can be used for.
Hardware fault injection is used to inject faults into hardware and examine the effects. Typically this is performed on VLSI circuits at the transistor level, because these circuits are complex enough to warrant characterization through fault injection rather than a performance range, and because these are the best understood basic faults in such circuits. Transistors are typically given stuck-at, bridging, or transient faults, and the results examined in the operation of the circuit. Such faults may be injected in software simulations of the circuits, or into production circuits cut from the wafer.
Hardware simulations typically occur in a high level description of the circuit. This high level description is turned into a transistor level description of the circuit, and faults are injected into the circuit. Typically these are stuck-at or bridging faults, as software simulation is most often used to detect the response to manufacturing defects. The system is then simulated to evaluate the response of the circuit to that particular fault. Since this is a simulation, a new fault can then be easily injected, and the simulation re-run to gauge the response to the new fault. This consumes time to construct the model, insert the faults, and then simulate the circuit, but modifications in the circuit are easier to make than later in the design cycle. This sort of testing would be used to check a circuit early in the design cycle. These simulations are non-intrusive, since the simulation functions normally other than the introduction of the fault.
Hardware fault injections occur in actual examples of the circuit after fabrication. The circuit is subjected to some sort of interference to produce the fault, and the resulting behavior is examined. So far, this has been done with transient faults, as the difficulty and expense of introducing stuck-at and bridging faults in the circuit has not been overcome. The circuit is attached to a testing apparatus which operates it and examines the behavior after the fault is injected. This consumes time to prepare the circuit and test it, but such tests generally proceed faster than simulation does. It is, rather obviously, used to test circuit just before or in production. These simulations are non-intrusive, since they do not alter the behavior of the circuit other than to introduce the fault. Should special circuitry be included to cause or simulate faults in the finished circuit, these would most likely affect the timing or other characteristics of the circuit, and therefore be intrusive.
Software fault injection is used to inject faults into the operation of software and examine the effects. This is generally used on code that has communicative or cooperative functions so that there is enough interaction to make fault injection useful. All sorts of faults may be injected, from register and memory faults, to dropped or replicated network packets, to erroneous error conditions and flags. These faults may be injected into simulations of complex systems where the interactions are understood though not the details of implementation, or they may be injected into operating systems to examine the effects.
Software simulation are typically of high level description of a system, in which the protocols or interactions are known, but not the details of implementation. These faults tend to be mis-timings, missing messages, replays, or other faults in communication in a system. The simulation is then run to discover the effects of the faults. Because of the abstract nature of these simulations, they may be run at a faster speed that the actual system might, but would not necessarily capture the timing aspects of the final system. This sort of testing would be performed to verify a protocol, or to examine the resistance of an interaction to faults. This would typically be done early in the design cycle so as to flesh out the higher level details before attempting the task of implementation. These simulations are non-intrusive, as they are simulated, but they may not capture the exact behavior of the system.
Software fault injections are more oriented towards implementation details, and can address program state as well as communication and interactions. Faults are mis-timings, missing messages, replays, corrupted memory or registers, faulty disk reads, and almost any other state the hardware provides access to. The system is then run with the fault to examine its behavior. These simulations tend to take longer because they encapsulate all of the operation and detail of the system, but they will more accurately capture the timing aspects of the system. This testing is performed to verify the system's reaction to introduced faults and catalog the faults successfully dealt with. this is done later in the design cycle to show performance for a final or near-final design. These simulations can be non-intrusive, especially if timing is not a concern, but if timing is at all involved the time required for the injection mechanism to inject the faults can disrupt the activity of the system, and cause timing results that are not representative of the system without the fault injection mechanism deployed. This occurs because the injection mechanism runs on the same system as the software being tested.
The main problem with fault injection is knowing what to do with it. Upon first glance, it would seem to be a good tool for debugging a system, and detecting any flaws within it. Once one examines the procedures and the information gained, however, it becomes apparent that fault injection is good at testing known sorts of bugs or defects, but poor at testing novel faults or defects, which are precisely the sorts of defects we would want to discover. Therefore, what emerges is that fault injection is not really suited for debugging and improving the system so much as it is suited for testing the fault tolerant features of the system.[Voas95] A known fault in injected and the results examined to see if the system can respond correctly despite the fault.
Along these lines, there are two proposed uses for fault injection. The first is for verification of a system. If a system is designed to tolerate a certain class of faults, or exhibit certain behavior in the presence of certain faults, then these faults can be directly injected into the system to examine their effects. The system will either behave appropriately or not, and it's fault tolerance measured accordingly.[Arlat] For certain classes of ultra-dependable untestable systems in which the occurrence of errors is too infrequent to effectively test the system in the field, fault injection can be a powerful tool for accelerating the occurrence of faults in the system and verifying that the system works properly.
The other proposed use for fault injection is less well understood, because the problem it addresses is poorly understood. Robustness is used in regard to systems these days almost synonymously with fault tolerance, but robustness actually embraces more than this. There is no really good definition of robustness, but it is something along the lines of "the capability of a system to behave correctly in unusual conditions." The difficulty lies in creating unusual conditions so as to test the system for robustness. Fault injection has been proposed as a method to address this problem, by including unusual conditions as well as faults. This would provide us with a metric for measuring the robustness of a system.
There are two difficulties that must be addressed before this use of fault
injection can be fully applied. The first is the disparate nature of
systems, and the ways in which they can fail or experience faults. Unless
two systems are set to accomplish the exact same task, determining the relative
robustness of the two systems is a difficult task. A good metric for
robustness would be able to resolve this difference. Secondly, it is not
yet certain how our metric should be biased. Common practice is to have
the test distribution mirror the real world distributions of occurrence of
faults. If we are truly testing the system's response to unusual
situations, however, it might be better to bias the test towards the less
frequently encountered conditions.[Voas93] While there is agreement that
fault injection can serve as a metric for robustness, the exact mechanisms of
doing so are as of yet poorly understood.
For hardware simulation most tools will take a hardware specification and inject faults into it for simulation. One such tool is MEFISTO, which injects faults into VHDL descriptions of circuits and simulates them.[Jenn] It takes advantages of the manner in which systems are specified in VHDL to alter signals and values in the circuit.
For hardware execution, several types of tools exist. One is a pin level tester, which manipulates the voltages at the pins in order to induce faulty or unusual conditions. The MESSALINE project is an example of this sort of testing regime.[Arlat] This can provide limited reach into the internals of the circuit, however, so several non-contact methods also exist to induce faults in the interior of the chip. The FIST project used heavy-ion radiation to project random transient faults into the interior of a chip for testing. The MARS project extended this to include electromagnetic fields to create faults in the interior of the chip. These methods, however, tend to produce random rather than targeted faults.[Hsueh] A third very new method of fault injection addresses this concern. Laser Fault Injection (LFI) uses a laser to inject faults precisely into the interior of the chip at specific times.[Samson] This allows a higher level of control and a much better data set than the other two methods.
Most of the software tools that exist are for testing of actual systems, and not simulation. This is probably due to the difficult task of correctly capturing high level behavior without the implementation being finished, and the relative ease of inserting faults into operating systems due to the debugging facilities provided by modern hardware. Ferrari is a testing system that introduces CPR, memory, and bus faults through CPU traps during normal execution. Ftape it a system that introduces CPU, memory, and disk faults through altered drivers and OS modifications.[Hsueh] DOCTOR is a tool for introducing faults into a distributed real time system under synthetic workloads, introducing CPU, memory, and network faults through time-outs, traps, and code modification.[Han] Xception causes multiple sorts of faults through hardware exception triggers. ORCHESTRA is a distributed system testbed that tests protocols by inserting faults through the introduction of a Fault Injection Layer between the protocol and the communication layer beneath it.[Hsueh] All of these methods are invasive, due to the fact that they operate on the same system as the hardware they test.
Fault injection has been proposed as a possible metric for all of the above properties of a system and its software.
Fault injection can be used to accelerate testing of a system in which the normal occurrence of faults is too sparse to permit proper testing.
Fault injection can be used to show that a system does prevent certain faults from becoming hazards.
Fault injection has been proposed as a step towards safety or quality verification or certification
Faults are often exceptions in the system, and fault injection is perhaps the best method to directly test the exception handling of a system by injection exception.
At this point the mechanisms for fault injection are fairly well understood. More investigation is needed into accurate fault injection of executing hardware, and new methods of injecting faults into software systems, but the basics of operation are already known. What is necessary at this point is to disseminate the knowledge outside of the researchers whose focus is on fault injection, and find uses for it among general testing practice. There are certainly systems in existence whose testing can be improved by the addition of fault injection, but testing tools may be platform specific, or not easily adaptable to a specific process. The development of more adaptable tools would aid in this effort.
In addition, the proper use of fault injection as a measurement or verification tool, rather than a debugging tool, needs to be delivered along with the tool. Fault injection is poorly suited to the task of discovering novel design faults in a system, but rather in testing its fault tolerant capabilities. This proper use as a verification tool, rather than as a tool for improving performance during development, is important to remember.
Finally, the meaning of the results of fault injection need to be given meaning. At an absolute level, it described the system's ability to resist certain faults, and fall prey to others. This can be thought of, though, as a way of testing the robustness of the system, and its ability to operate under unusual conditions. If it is to be used as such, though, then its relationship to robustness needs to be better defined, and a framework for understanding the numbers that result needs to be built. While some of the theoretical underpinnings are there, the actual interpretation of the practice is not.
This paper describes some of the basic theory behind fault injection and
interpretation of its results, and also discusses the MESSALINE tool.
This article provides an overview of fault injection, and lists several
tools currently available for such operations.
This paper details the operation of the DOCTOR software fault injection
This article surveys a number of fault injection tools, and provides a good
introduction into the basics of fault injection.
This paper details the operation of the VHDL fault injection and
instrumentation tool MEFISTO.
This paper discusses the operation and use of the new technique of laser
fault injection into executing circuits.
While the experiment performed in this paper is not revealing in what it
says about the system tested, it is revealing in that it compares simulation
and execution fault injection techniques on the same system and their
This paper discusses the proper view of fault injection as a testing and
verification tool, rather than a debugging tool.
This paper describes the inversion of test pattern frequencies from the
usual observed workload to putting the emphasis on the unusual cases.
Go To Project Page