Carnegie Mellon University
18-849b Dependable Embedded Systems
Author: Charles P. Shelton
Exceptional conditions are things that occur in a system that are not expected
or are not a part of normal system operation. When the system handles these
exceptional conditions improperly, it can lead to failures and system crashes.
Exception failures are estimated to cause two thirds of system crashes and
fifty percent of computer system security vulnerabilities. Exception handling
is especially important in embedded and real-time computer systems because
software in these systems cannot easily be fixed or replaced, and they must
deal with the unpredictability of the real world. Robust exception handling in
software can improve software fault tolerance and fault avoidance, but no
structured techniques exist for implementing dependable exception handling.
However, many exceptional conditions can be anticipated when the system is
designed, and protection against these conditions can be incorporated into the
system. Traditional software engineering techniques such as code walkthroughs
and software testing can illuminate more exceptional conditions to be caught,
such as bad input for functions and memory and data errors. However, it is
impossible to cover all exceptional cases. It is also difficult to design a
dependable system that can tolerate truly unexpected conditions. In these
cases, some form of graceful degradation is necessary to safely bring down the
system without causing major hazards.
Exception handling is the method of building a system to detect and recover
from exceptional conditions. Exceptional conditions are any unexpected
occurrences that are not accounted for in a system's normal operation. It is
difficult to protect a system from the effects of exceptional conditions
because, by nature, all unusual occurrences cannot be anticipated when the
system is designed. Some examples of exceptional conditions are incorrect
inputs from the user, bit level memory or data corruption, software design
defects that cause a system to enter an undefined state, and environmental
anomalies. If these exceptional conditions are not properly caught and handled,
they can cause an error or failure in the system. Failures due to exceptions
are estimated to account for two thirds of system crashes and fifty percent of
system security vulnerabilities [Maxion98].
Exception handling is different from fault tolerance. Fault tolerance
focuses on keeping known error states from causing system failures. Exception
handling deals with the undefined and unanticipated conditions that, if left
unchecked, can propagate through the system and cause a fault. Exception
handling is more like fault avoidance or fault containment. I submit that
exception handling is more difficult than fault tolerance because it must deal
with all the unpredictabilities of the system.
When designing an embedded system, exception handling is usually focused on
software. In fact, more than two thirds of code written for systems is devoted
to properly detecting and handling exceptions. However, most software testing
efforts focus on exercising the correct operation of code, and not determining
how robust it is to exceptional conditions [Cristian80]. Therefore, exception handling code is the
least tested and most susceptible to bugs.
Exception handling should also not be ignored in system components other
than software. Hardware and user interface components should also have some
built-in protection from exceptional conditions as well as having some
system-level protection. This was one of the problems with the Therac-25
medical device. The Therac-20 had hardware interlocks to prevent lethal doses
of radiation that were removed in the Therac-25. Thus, unknown software defects
that were effectively neutralized in the Therac-20 were exposed in the
Therac-25 and caused several deaths (both machines used the same basic
software). This illustrates the need for system-level as well as
component-level exception handling mechanisms.
Unfortunately, no well defined techniques exist for building robust
exception handling into a system. Most methods are ad hoc and limited to what
the design team can anticipate the system will encounter. Luckily, many of the
most common problems can easily be avoided as long as code is written to check
for them. Many exception failures in commercial libraries are linked to simple
conditions such as checking that a pointer is not null before dereferencing it,
or checking that a file is open before attempting to read or write to it. Good
software engineering practices such as code reviews, code walkthroughs, and
thorough testing can illuminate many of these exceptional conditions, but are
limited to the software of the system. It is also difficult to model the
complex interactions of system components at the design phase to determine
where other problems lie.
It is unrealistic to build a system that is completely bulletproof to
exceptional conditions because we cannot anticipate all possible situations.
Therefore it is necessary to build in default exception handlers that will
attempt to recover from any of these unanticipated conditions. If the
application is somewhat safety critical or has real-time deadlines, some form
of graceful degradation must be put in place to reduce the harm or damage done
by any system failures.
Exception handling techniques can be separated into two broad categories:
programmed exception handling and default exception handling. In some cases
programmed exception handling is capable of doing forward error recovery, but
both programmed and default exception handling methods can perform backward
error recovery. Forward error recovery can mask any exceptional occurrences and
continue normal operation. Backward error recovery must halt normal system
execution and attempt to return to a previous normal state to continue
execution and retry the operation. Checkpointing and recovery is a technique of
backward error recovery for tolerating transient or internittent conditions.
Programmed Exception Handling
Programmed exception handling modules are mechanisms built into software for
specific exceptional cases that are known are likely to occur. Since these
occurrences are relatively well understood, protection for them can be
incorporated into the system. When a program is executing, if one of the
exceptional conditions is detected, control is passed from the main process
block to the special exception handling block. This code will deviate from
normal execution to compensate for the exceptional condition and will attempt
to mask it to prevent propagating an error condition to higher levels in the
If the condition cannot be recovered, the exception
handler may call checkpointing recovery code to return the system to a known
state before the exception occurrence and retry the operation.
Default Exception Handling
For all the exceptional conditions that are not anticipated by the system
designers, default exception handlers must be built. The default handlers may
be within the programming language or operating environment itself, transparent
to the application developer. They must be a catch-all for any unexpected
exceptions, and must also be responsible for containing exceptions due to
Exceptional conditions due to design defects are especially
dangerous because they will always be present. If you knew about all design
defects in a system a priori, they would have been eliminated before building
the system. Since we have not yet learned how to design perfect systems, it is
important that exception handlers can reduce the impact of design defects as
much as possible.
In most cases, default exception handlers cannot do much to continue system
operation. In the best cases they can use the checkpointing and recovery system
to mask transient errors, but for truly exceptional conditions that cause error
states, the best that can be hoped for is a graceful program termination.
In order to achieve robust operation, as much exception handling as possible
is desired. However, exception handling overhead may be too great for real-time
systems and make timing and scheduling difficult.
Real-Time System Constraints
In real-time systems, timing and meeting deadlines are the first priority,
especially for safety critical systems. However, if exceptional conditions
occur, there must be some detection and recovery mechanisms in place to prevent
error propagation. The extent and complexity of the exception handling
mechanisms will make it difficult to calculate and meet timing constraints
[Colnaric93]. Either the scheduling will have to be
worst-case, making performance worst-case, or exception handling will have to
be sacrificed. This is a tradeoff between getting results on time, or getting
correct results. Some research is being done in constructing models that use
object-oriented techniques to account for both real-time constraints and
exception handling mechanisms, so that they can be more easily and compatibly
Available tools, techniques, and metrics
As discussed above, there are no mature methods for generating robust exception
handlers or ensuring that all exceptions have been accounted for, but there is
research being done in these areas. Extending traditional software engineering
practices to use dependability cases for generating exceptional conditions is
one technique. Another technique called Xept provides an instrumentation
language for structured generation of wrappers for exceptional inputs to
software library modules. Another problem is that there are no accepted ways of
measuring how robust a system is to exceptional conditions. The Ballista
project has developed a methodology for automatically testing and comparing the
relative robustness of software modules.
It is hypothesized that exceptional conditions are not guarded in software
because designers do not think of them. Dependability cases aim to provide a
general framework and methodology for generating scenarios of exceptional
conditions so the system designer can build exception handlers for them into
the system. This technique, when used in conjunction with good software
engineering processes, is supposed to improve software robustness. Hazard
analysis techniques such as fault trees and fishbone diagrams are used to aid
the designer in anticipating exceptional conditions. Using dependability cases,
a taxonomy of exceptional conditions can be developed. For example,
[Maxion98] describes the CHILDREN mnemonic for
However, since it is impossible to anticipate and cover all exceptional
conditions, it is unclear how much of an improvement dependability cases can
make in the system's software. Whatever taxonomy of exceptional conditions we
develop, it may exclude a key class of exceptions, leaving the system
vulnerable. However, this more structured approach is better than ad hoc
- Computational problem
- Hardware problem
- I/O and file problems
- Library function problem
- Data input problem
- Return-value problem: function or procedure call
- External user/client problem
- Null pointer and memory problems
Xept is a method of generating wrappers for software modules. Using an
instrumentation language, you can generate code to check for exceptional inputs
before passing parameters to library functions [Vo97]. This
is particularly useful for Commercial Off-The-Shelf (COTS) software where
source code may not be available and the programmer only has access to the
module interface. Many COTS software modules are not as robust as they can be,
and extra protection must be built into the system if you use these components
in your software. Xept provides a structured method of instrumenting
application code to mask and handle exceptions in library code. However, in
order to generate these exception handlers, the conditions to be protected
against must already be known. Xept does not detect exceptional conditions, it
only provides a way of correcting for them.
The Ballista software testing methodology focuses on passing exceptional inputs
at the module level and recording the results. Ballista is completely automated
and can demonstrate repeatable, atomic responses to exceptional conditions from
unexpected parameters. It is scalable because testing is based of the
parameters passed to the function, not the function's operation
[Kropp98]. Therefore, once test cases for data types are
developed, any function that uses those data types can be tested. This is ideal
for testing COTS software and making comparisons between different
implementations of the same application programming interface (API). Since
Ballista focuses of repeatable results, it is only useful for component testing
and cannot detect exceptional conditions due to complex interactions between
system components. Also, when testing modules, the tester must come up with the
exceptional inputs for the data types to be tested. However, as the system
grows, a database of exceptional values is being kept and can be reused for the
same data type.
Relationship to other topics
Exception handling is a method of achieving system robustness, and is also
related to fault tolerance and error recovery.
- Robustness - Exception handling is a technique
for designing a robust system. Robustness is defined as the degree to
which a system can function in the presence of invalid inputs or stressful
environmental conditions. These are exceptional conditions.
- Software Testing - Testing is currently the
only metric we have for measuring how well a system can handle exceptional
conditions. It is also used to uncover any cases previously
unanticipated. Unfortunately, the problem of completely testing any
system for all possible occurrences is intractable.
- Fault Tolerant Computing - Fault tolerant
computing is similar to robustness and exception handling, but deals with
controlling and containing system or component errors after they have
occurred. Exception handling attempts to keep unanticipated conditions
from causing faults.
- Software Fault Tolerance - Fault
tolerance in software is especially important since software is quickly
becoming the most complex and integral part of any embedded system.
Software exception handling can improve software fault tolerance by preventing
exceptional conditions from becoming software faults.
- Checkpoint/Recovery - Checkpoint/Recovery is a
method that can recover from some transient and intermittent failures and can
mask exceptional occurrences.
- Security - Many security vulnerabilities are
caused by not properly containing exceptional conditions. For example,
many security holes are caused by race conditions and not detecting a memory
buffer overflow. These vulnerabilities can be exploited by people to gain
access to and tamper with restricted systems.
- Human Interface/Human Error - Since input from a
human user is one of most likely places that exceptional and invalid inputs can
be generated in an embedded system, the user interface should be able to
prevent the operator from causing a fault condition. The interface should
constrain the user to only entering valid inputs into the system.
The following ideas are the important ones to take away from reading about this
- Exception handling differs from fault tolerance, but they are related.
Fault tolerance deals with correcting for known error conditions. Exception
handling can be seen as fault avoidance or fault containment. Unexpected
conditions must be masked before they can cause a fault in the system.
- It is not possible to cover every exception within a closed system. There
are unanticipated situations that the system cannot compensate for.
- Where you draw the system boundary determines the level of exception
handling you can do. For example, if you only look at the software,
environmental exceptional conditions cannot be sufficiently handled. If a human
operator is part of the system, there may be more exceptions that can be
covered, but with less certainty.
- Coverage is a major problem. It is unrealistic to cover all exceptional
conditions because they are not predictable
- It is difficult to develop strategies to safely handle exceptions for
unanticipated situations. Most methods are ad hoc and based on previous
- In real-time systems, there is a tension between developing robust
exception handlers for safety and correctness, and meeting timing constraints.
Annotated Reference List
- [Colnaric93] Colnaric, Matjaz; Halang, Wolfgang
A., "Exception Handling and Predictability in Hard Real-Time
Systems." SAFECOMP 93. 12th International Conference on Computer
Safety, Reliability and Security, October 1993, p. 371-378.
This paper discusses the concerns of implementing exception handling and
accounting for unpredictability in the face of the timing constraints in hard
- [Cristian80] Cristian, Flaviu, "Exception
Handling And Software-Fault Tolerance." 10th International Symposium on
Fault-Tolerant Computing, October 1980, p. 97-103.
Basic concepts in software exception handling and mathematical definitions.
- [Kropp98] Kropp, Nathan P.; Koopman, Philip J.;
Siewiorek, Daniel P., "Automated robustness testing of off-the-shelf
software components." Twenty-Eighth Annual International Symposium on
Fault-Tolerant Computing, June 1998, p. 230-239.
Motivation, methodology, and results of applying the Balista software testing
technology to POSIX operating systems.
- [Maxion98] Maxion, Roy A.; Olszewski, Robert T.,
"Improving Software Robustness With Dependability Cases."
Twenty-Eighth Annual International Symposium on Fault-Tolerant Computing, June
1998, p. 346-355.
Introduces technique of dependability cases and how it can help improve
- [Romanovsky98] Romanovsky, Alexander; Xu, Jie;
Randell, Brian, "Exception Handling in Object-Oriented Real-Time
Distributed Systems." First International Symposium on Object-Oriented
Real-Time Distributed Computing (ISORC '98), April 1998, p. 32-42.
Research into using object-oriented programming techniques to build structed
exception handling into real-time systems.
- [Vo97] Vo, Kiem-Pheng; Wang, Yi-Min; Chung, P.Emerald;
Huang, Yennun, "Xept: A Software Instrumentation Method For Exception
Handling." Eighth International Symposium on Software Reliability
Engineering, November 1997, p. 60-69.
Information about Xept, the motivation, methodolgy, and the instrumentation
Index of other topics