Exception Handling

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999

Author: Charles P. Shelton

Abstract:

Exceptional conditions are things that occur in a system that are not expected or are not a part of normal system operation. When the system handles these exceptional conditions improperly, it can lead to failures and system crashes. Exception failures are estimated to cause two thirds of system crashes and fifty percent of computer system security vulnerabilities. Exception handling is especially important in embedded and real-time computer systems because software in these systems cannot easily be fixed or replaced, and they must deal with the unpredictability of the real world. Robust exception handling in software can improve software fault tolerance and fault avoidance, but no structured techniques exist for implementing dependable exception handling. However, many exceptional conditions can be anticipated when the system is designed, and protection against these conditions can be incorporated into the system. Traditional software engineering techniques such as code walkthroughs and software testing can illuminate more exceptional conditions to be caught, such as bad input for functions and memory and data errors. However, it is impossible to cover all exceptional cases. It is also difficult to design a dependable system that can tolerate truly unexpected conditions. In these cases, some form of graceful degradation is necessary to safely bring down the system without causing major hazards.

Introduction

Exception handling is the method of building a system to detect and recover from exceptional conditions. Exceptional conditions are any unexpected occurrences that are not accounted for in a system's normal operation. It is difficult to protect a system from the effects of exceptional conditions because, by nature, all unusual occurrences cannot be anticipated when the system is designed. Some examples of exceptional conditions are incorrect inputs from the user, bit level memory or data corruption, software design defects that cause a system to enter an undefined state, and environmental anomalies. If these exceptional conditions are not properly caught and handled, they can cause an error or failure in the system. Failures due to exceptions are estimated to account for two thirds of system crashes and fifty percent of system security vulnerabilities [Maxion98].

Exception handling is different from fault tolerance. Fault tolerance focuses on keeping known error states from causing system failures. Exception handling deals with the undefined and unanticipated conditions that, if left unchecked, can propagate through the system and cause a fault. Exception handling is more like fault avoidance or fault containment. I submit that exception handling is more difficult than fault tolerance because it must deal with all the unpredictabilities of the system.

When designing an embedded system, exception handling is usually focused on software. In fact, more than two thirds of code written for systems is devoted to properly detecting and handling exceptions. However, most software testing efforts focus on exercising the correct operation of code, and not determining how robust it is to exceptional conditions [Cristian80]. Therefore, exception handling code is the least tested and most susceptible to bugs.

Exception handling should also not be ignored in system components other than software. Hardware and user interface components should also have some built-in protection from exceptional conditions as well as having some system-level protection. This was one of the problems with the Therac-25 medical device. The Therac-20 had hardware interlocks to prevent lethal doses of radiation that were removed in the Therac-25. Thus, unknown software defects that were effectively neutralized in the Therac-20 were exposed in the Therac-25 and caused several deaths (both machines used the same basic software). This illustrates the need for system-level as well as component-level exception handling mechanisms.

Unfortunately, no well defined techniques exist for building robust exception handling into a system. Most methods are ad hoc and limited to what the design team can anticipate the system will encounter. Luckily, many of the most common problems can easily be avoided as long as code is written to check for them. Many exception failures in commercial libraries are linked to simple conditions such as checking that a pointer is not null before dereferencing it, or checking that a file is open before attempting to read or write to it. Good software engineering practices such as code reviews, code walkthroughs, and thorough testing can illuminate many of these exceptional conditions, but are limited to the software of the system. It is also difficult to model the complex interactions of system components at the design phase to determine where other problems lie.

It is unrealistic to build a system that is completely bulletproof to exceptional conditions because we cannot anticipate all possible situations. Therefore it is necessary to build in default exception handlers that will attempt to recover from any of these unanticipated conditions. If the application is somewhat safety critical or has real-time deadlines, some form of graceful degradation must be put in place to reduce the harm or damage done by any system failures.

Key Concepts

Exception handling techniques can be separated into two broad categories: programmed exception handling and default exception handling. In some cases programmed exception handling is capable of doing forward error recovery, but both programmed and default exception handling methods can perform backward error recovery. Forward error recovery can mask any exceptional occurrences and continue normal operation. Backward error recovery must halt normal system execution and attempt to return to a previous normal state to continue execution and retry the operation. Checkpointing and recovery is a technique of backward error recovery for tolerating transient or internittent conditions.

Programmed Exception Handling

Programmed exception handling modules are mechanisms built into software for specific exceptional cases that are known are likely to occur. Since these occurrences are relatively well understood, protection for them can be incorporated into the system. When a program is executing, if one of the exceptional conditions is detected, control is passed from the main process block to the special exception handling block. This code will deviate from normal execution to compensate for the exceptional condition and will attempt to mask it to prevent propagating an error condition to higher levels in the software hierarchy.

If the condition cannot be recovered, the exception handler may call checkpointing recovery code to return the system to a known state before the exception occurrence and retry the operation.

Default Exception Handling

For all the exceptional conditions that are not anticipated by the system designers, default exception handlers must be built. The default handlers may be within the programming language or operating environment itself, transparent to the application developer. They must be a catch-all for any unexpected exceptions, and must also be responsible for containing exceptions due to design defects.

Exceptional conditions due to design defects are especially dangerous because they will always be present. If you knew about all design defects in a system a priori, they would have been eliminated before building the system. Since we have not yet learned how to design perfect systems, it is important that exception handlers can reduce the impact of design defects as much as possible.

In most cases, default exception handlers cannot do much to continue system operation. In the best cases they can use the checkpointing and recovery system to mask transient errors, but for truly exceptional conditions that cause error states, the best that can be hoped for is a graceful program termination.

In order to achieve robust operation, as much exception handling as possible is desired. However, exception handling overhead may be too great for real-time systems and make timing and scheduling difficult.

Real-Time System Constraints

In real-time systems, timing and meeting deadlines are the first priority, especially for safety critical systems. However, if exceptional conditions occur, there must be some detection and recovery mechanisms in place to prevent error propagation. The extent and complexity of the exception handling mechanisms will make it difficult to calculate and meet timing constraints [Colnaric93]. Either the scheduling will have to be worst-case, making performance worst-case, or exception handling will have to be sacrificed. This is a tradeoff between getting results on time, or getting correct results. Some research is being done in constructing models that use object-oriented techniques to account for both real-time constraints and exception handling mechanisms, so that they can be more easily and compatibly designed [Romanovsky98].

Available tools, techniques, and metrics

As discussed above, there are no mature methods for generating robust exception handlers or ensuring that all exceptions have been accounted for, but there is research being done in these areas. Extending traditional software engineering practices to use dependability cases for generating exceptional conditions is one technique. Another technique called Xept provides an instrumentation language for structured generation of wrappers for exceptional inputs to software library modules. Another problem is that there are no accepted ways of measuring how robust a system is to exceptional conditions. The Ballista project has developed a methodology for automatically testing and comparing the relative robustness of software modules.

Dependability Cases

It is hypothesized that exceptional conditions are not guarded in software because designers do not think of them. Dependability cases aim to provide a general framework and methodology for generating scenarios of exceptional conditions so the system designer can build exception handlers for them into the system. This technique, when used in conjunction with good software engineering processes, is supposed to improve software robustness. Hazard analysis techniques such as fault trees and fishbone diagrams are used to aid the designer in anticipating exceptional conditions. Using dependability cases, a taxonomy of exceptional conditions can be developed. For example, [Maxion98] describes the CHILDREN mnemonic for exceptions:

Computational problem
Hardware problem
I/O and file problems
Library function problem
Data input problem
Return-value problem: function or procedure call
External user/client problem
Null pointer and memory problems

However, since it is impossible to anticipate and cover all exceptional conditions, it is unclear how much of an improvement dependability cases can make in the system's software. Whatever taxonomy of exceptional conditions we develop, it may exclude a key class of exceptions, leaving the system vulnerable. However, this more structured approach is better than ad hoc methods.

Xept

Xept is a method of generating wrappers for software modules. Using an instrumentation language, you can generate code to check for exceptional inputs before passing parameters to library functions [Vo97]. This is particularly useful for Commercial Off-The-Shelf (COTS) software where source code may not be available and the programmer only has access to the module interface. Many COTS software modules are not as robust as they can be, and extra protection must be built into the system if you use these components in your software. Xept provides a structured method of instrumenting application code to mask and handle exceptions in library code. However, in order to generate these exception handlers, the conditions to be protected against must already be known. Xept does not detect exceptional conditions, it only provides a way of correcting for them.

Ballista

The Ballista software testing methodology focuses on passing exceptional inputs at the module level and recording the results. Ballista is completely automated and can demonstrate repeatable, atomic responses to exceptional conditions from unexpected parameters. It is scalable because testing is based of the parameters passed to the function, not the function's operation [Kropp98]. Therefore, once test cases for data types are developed, any function that uses those data types can be tested. This is ideal for testing COTS software and making comparisons between different implementations of the same application programming interface (API). Since Ballista focuses of repeatable results, it is only useful for component testing and cannot detect exceptional conditions due to complex interactions between system components. Also, when testing modules, the tester must come up with the exceptional inputs for the data types to be tested. However, as the system grows, a database of exceptional values is being kept and can be reused for the same data type.

Relationship to other topics

Exception handling is a method of achieving system robustness, and is also related to fault tolerance and error recovery.

Robustness - Exception handling is a technique for designing a robust system. Robustness is defined as the degree to which a system can function in the presence of invalid inputs or stressful environmental conditions. These are exceptional conditions.
Software Testing - Testing is currently the only metric we have for measuring how well a system can handle exceptional conditions. It is also used to uncover any cases previously unanticipated. Unfortunately, the problem of completely testing any system for all possible occurrences is intractable.
Fault Tolerant Computing - Fault tolerant computing is similar to robustness and exception handling, but deals with controlling and containing system or component errors after they have occurred. Exception handling attempts to keep unanticipated conditions from causing faults.
Software Fault Tolerance - Fault tolerance in software is especially important since software is quickly becoming the most complex and integral part of any embedded system. Software exception handling can improve software fault tolerance by preventing exceptional conditions from becoming software faults.
Checkpoint/Recovery - Checkpoint/Recovery is a method that can recover from some transient and intermittent failures and can mask exceptional occurrences.
Security - Many security vulnerabilities are caused by not properly containing exceptional conditions. For example, many security holes are caused by race conditions and not detecting a memory buffer overflow. These vulnerabilities can be exploited by people to gain access to and tamper with restricted systems.
Human Interface/Human Error - Since input from a human user is one of most likely places that exceptional and invalid inputs can be generated in an embedded system, the user interface should be able to prevent the operator from causing a fault condition. The interface should constrain the user to only entering valid inputs into the system.

Conclusions

The following ideas are the important ones to take away from reading about this topic:

Exception handling differs from fault tolerance, but they are related. Fault tolerance deals with correcting for known error conditions. Exception handling can be seen as fault avoidance or fault containment. Unexpected conditions must be masked before they can cause a fault in the system.
It is not possible to cover every exception within a closed system. There are unanticipated situations that the system cannot compensate for.
Where you draw the system boundary determines the level of exception handling you can do. For example, if you only look at the software, environmental exceptional conditions cannot be sufficiently handled. If a human operator is part of the system, there may be more exceptions that can be covered, but with less certainty.
Coverage is a major problem. It is unrealistic to cover all exceptional conditions because they are not predictable
It is difficult to develop strategies to safely handle exceptions for unanticipated situations. Most methods are ad hoc and based on previous experience.
In real-time systems, there is a tension between developing robust exception handlers for safety and correctness, and meeting timing constraints.

Annotated Reference List

[Colnaric93] Colnaric, Matjaz; Halang, Wolfgang A., "Exception Handling and Predictability in Hard Real-Time Systems." SAFECOMP 93. 12th International Conference on Computer Safety, Reliability and Security, October 1993, p. 371-378.
This paper discusses the concerns of implementing exception handling and accounting for unpredictability in the face of the timing constraints in hard real-time systems.
[Cristian80] Cristian, Flaviu, "Exception Handling And Software-Fault Tolerance." 10th International Symposium on Fault-Tolerant Computing, October 1980, p. 97-103.
Basic concepts in software exception handling and mathematical definitions.
[Kropp98] Kropp, Nathan P.; Koopman, Philip J.; Siewiorek, Daniel P., "Automated robustness testing of off-the-shelf software components." Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, June 1998, p. 230-239.
Motivation, methodology, and results of applying the Balista software testing technology to POSIX operating systems.
[Maxion98] Maxion, Roy A.; Olszewski, Robert T., "Improving Software Robustness With Dependability Cases." Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, June 1998, p. 346-355.
Introduces technique of dependability cases and how it can help improve exception handling.
[Romanovsky98] Romanovsky, Alexander; Xu, Jie; Randell, Brian, "Exception Handling in Object-Oriented Real-Time Distributed Systems." First International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC '98), April 1998, p. 32-42.
Research into using object-oriented programming techniques to build structed exception handling into real-time systems.
[Vo97] Vo, Kiem-Pheng; Wang, Yi-Min; Chung, P.Emerald; Huang, Yennun, "Xept: A Software Instrumentation Method For Exception Handling." Eighth International Symposium on Software Reliability Engineering, November 1997, p. 60-69.
Information about Xept, the motivation, methodolgy, and the instrumentation language developed.

Index of other topics

Home page