Embedded Systems Case Studies

Carnegie Mellon University

Spring 1999

Author: Michael Collins

Abstract:

Systems failures do not occur in a vacuum; while a single event may trigger the failure, investigation often reveals that a history of managerial and technical decisions produce conditions turning a single event into a disaster. At the minimum, investigating case studies provides lessons on what to avoid. By systematic studies of failure, it may be possible to draw general conclusins and improve practice as a whole.

Unfortunately, good systems failure studies are rare. Embedded systems failure is a volatile topic and the field is filled with a vast amount of noise, urban myth, and political agendas. In order to build case studies, engineers must develop some investigative skill. We discuss sources of case studies, examples of the good case studies that do exist and what techniques can be used to separate noise from real information.

Introduction

Engineers aren't perfect: systems fail for for various reasons, ranging from inevitable hardware breakdowns to avoidable loss of morale. However, the accidents that result in system failures are not unique and have often been encountered before. For this reason, case studies are a valuable tool for designers: design is a discipline best learned by example, and examples of failures are as illustrative as studies of successes.

While case studies are generally understood to be a benefit for designers, there are very few good ones available. There are pragmatic reasons for this: the politicization of systems failure and the accompanying need to find a victim, liability protection, a tendency to focus on single failure points, and the fact that ignoring embarrassing mistakes is a part of human nature.[Petroski94]

Because of this reluctance, case studies are usually collected as a last resort. In the 19th century, bridge collapses eventually forced civil engineers to systematically collect and study the contributing factors. In the railroad era, good bridge design was necessary for human safety and economic health; at a certain point, the cost of failure became to great. We can draw parallels between the situation of bridge engineers in the 19th century and embedded systems designers today: embedded systems are critical in maintaining safety, trade and communications. Just like a bridge failure, the failure of an embedded system can cost human lives.

At the minimum systems failures should be studied so as to prevent the same failure from occurring twice. However, in addition to uncovering specific engineering failures, it is possible to draw broader conclusions about engineering practice by studying failures. Systems failures happen for a reason, and these reasons do not have to be unique to either the situation or discipline. Petroski, in particular, uses historical examples to demonstrate several common failures in the design process. Design is not unique to software engineering, and Petroski's rubrics are applicable outside of the civil examples he uses. [Petroski94].

The primary problem in studying case studies is finding them. There are relatively few good case studies, and very few sources of good source material. While it is acknowledged that lessons learned from systems failure can apply to multiple disciplines, most case histories are written for a narrow audience of professionals [Petroski94]. In addition to the rarity of good information, there is a large amount of bad information. Nebulous computer failures can serve as a convenient scapegoat when the actual system failure is more complicated; Sudden Acceleration Syndrome, where operator failures were blamed on micro chip failures, is a good example of this problem. Given the dearth of good material, the presence of bad material, engineers will often have to do their own investigative work.

While there are no exact numbers available, it's highly likely that the most extensively studied case histories are drawn from civil engineering. This is understandable given that civil engineering is the oldest engineering discipline; the first true case studies are drawn from Vitruvius in the 1st century AD [Vitruvius]. While embedded systems design is a relatively young field, there are several good (if unverified) archives of computer systems failures. The Usenet newsgroup comp.risks comprises a large archive of computer failures, stretching back approximately fifteen years. Coupled with its parent archive in Software Engineering News, the risks archive is arguably the largest collection of systems failure in existence. Similar Internet-based resources exist, most notably the Air Failures Page and pages devoted to Chernobyl and the Ariane V.

Because of these archives, the difficulty in studying embedded systems failures lies in filtering, rather than finding, good information. While the comp.risks archive is extensive, it is still a Usenet newsgroup and must be viewed with a certain amount of suspicion. Comp.risks is best used as a starting paper point for further research. The summary printed in Software Engineering News is generally more thorough.

While good case studies are rare, there are several examples. Leveson has written the definitive autopsy of the Therac-25 incident [Leveson93]. Other case studies, and a revision of her Therac-25 study are included in Leveson's book Safeware[Leveson95]. The Rogers Commission study of the Challenger Disaster is also excellent [Rogers86]. Petroski's Design Paradigms is devoted exclusively to case studies [Petroski94].

Key Concepts

Since case studies serve as a data source for nearly every other topic, there are key concepts associated with data analysis.

Investigation

The motivation for studying case studies is to learn from mistakes: this implies that a good case study must provide an unvarnished history of the failure and its surrounding factors. Unfortunately, as discussed above, there is a wealth of bad and unverified information.

Case studies have to be viewed with a certain amount of cynicism. Most failures are to complex to be solved with magic bullets.

Know The Source: Engineers writing a case study will tend to focus on their domains of expertise. While this is understandable, systems failures are complex and may not lie entirely in one engineering domain. Less innocently, a case study may be slanted to focus on the sales or political bias of the author. This is especially true when the failure costs a large number of lives or involves a politically sensitive technology. Such case studies are not necessarily bad, but the author's bias must be understood. Case studies are, after all, historical studies, and history is subject to interpretation.
An illustrative example comes from the early history of comp.risks [RISKS1.27]. The article in question is a rather lengthy scare story involving computer viruses and their potential effect on the banking system. The article's author writes an investment/religious newsletter advocating the end of modern banking and a return to 'Biblically sound' economics. While a bit extreme, the article provides a good point about author bias.
The Short Explanation Is Usually Wrong:Simple explanations like "Software Error" or "Human Error" usually gloss over the multiple factors that produce a failure. Leveson notes in her Therac-25 study that previous examinations of the Therac-25 accident had usually classified the failure as an operator error or poor interface design.
One of the more frustrating problems in studying systems failures is that, ultimately, it is completely possible for there to be no single point of failure. Multiple factors may contribute equally, and the actual source of failure may be discussed years after the fact. Consequently, while a single incident (often mechanical) may trigger a disaster, there is often a chain of events leading up to that incident.
Design Failures Can Transcend Disciplines:There is a rich body of information available for other engineering disciplines, particularly civil engineering. However, lessons learned designing one system may apply to other systems, in particular the tradeoffs between system safety and cost, as well as the benefits of conservatism in design.
Petroski lists several examples of design failures that he draws primarily from civil engineering case studies and argues that the lessons learned from those failures extends to other disciplines as well. Petroski's examples include examples of scalability failures from Vitruvius, logical failures in the design process, and how historical patterns of success can mask errors in design [Petroski94]. These are lessons about how the design process works, and are not exclusive to any technology. Certainly, state of the art techniques like formal verification are intended to uncover logical failures in design.

The Role Of Case Studies

Petroski points out that one of the traditional difficulties in systems design is that system architects work ahead of their theoretical knowledge [Petroski94]. In most cases, an engineer has to make educated guesses, and fit hose choices are going to cost, he must be able to convince his client that this is the best choice of action. In this role, case studies should be used to provide guidelines for design: on the need for caution, examples of how people should be managed in system, and so on.

A good example of how case studies are used in design comes from John Roebling's design of the Brooklyn Bridge. As noted above, civil engineers in the 19th century lacked the necessary aerodynamic knowledge to properly model suspension bridges. After studying multiple bridge collapses, Roebling used all the evidence available to decide that the only logical approach was to design the bridge as tough as possible. Roebling built the bridge to handle six times the stresses he expected it would accommodate in even the worst situations. During the bridge's design, he placed extensive restrictions on the number of people allowed to stand on the bridge, their movement patterns and other factors. When asked why he was convinced that the bridge would stand, Roebling would always point out just how strong he was building the bridge, and exactly why it was so strong. [Petroski 94] [Bentley 89]. While it might appear that this kind of caution is obvious, it should be noted that Roebling's bridge is one of the few examples of 19th century suspension bridges still standing. All of the bridges built by his contemporaries have collapsed. [Bentley 89]

Some Analyses

At this point, it may be appropriate to discuss some of the initial trends observed from reading these case studies. The points discussed below are intended to assist in studying cases, Leveson and Petroski both discuss how these lessons can be used to assist the design process.

While a single point of failure may trigger an accident, it does not follow that a single point is the cause. Systems failures are the result of complex interactions between designers, operators and systems, and disasters occur when points of failure in all these components interact at once. A single point failure may trigger the accident, but it is most often simply the most obvious source. Depending on one's viewpoint, the Therac-25 deaths can be viewed as a consequence of poor software design, poor user interface, a poor safety culture, or other factors [Leveson93].

Because systems failures are so complex, it is likely that at least some of the contributing factors are nontechnical. Managerial problems can be as much of a cause as faulty circuitry: the Rogers commission points out that NASA's safety culture had decayed extensively before the Challenger accident [Rogers86]. Leveson stresses the importance of safety culture in her book, noting that the Bhopal accident was preceded by waves of downsizing and a consequent loss of employee morale [Leveson95]. Failure is multidisciplinary.

Relationship to other topics

Field Data Case histories and field data are effectively the qualitative and quantitative investigations of systems failures. Both topics cover the analysis of past systems for future designs, but take different approaches. Field data and prediction models try to ensure system integrity while case histories should be used to improve design.
System Architecture. Because case histories are used as aids to design, they are tightly tied up with architectural and holistic concerns. Leveson argues that system failures are the result of interactions between the elements of the system [Leveson95], there are problems which lie in the system architect's domain.
Social And Legal Concerns. One of the key points that Leveson, Neumann and other historians make is that system failures are not exclusively technical problems. Beyond the simple issue of liability and damages, proper system design involves understanding how people interact with the system and cause (or prevent) failure. Case studies are one of the more important tools for analyzing human interaction with systems.
On a more morbid note, systems failures will end up in litigation, and sometimes the best places to look for an exhaustive analysis of system failure is in the courtroom or the senate records.

Conclusions

Good studies of embedded systems failures are useful, but rare, educational tools for engineers. Unfortunately, there aren't that many good studies out there, so engineers must understand some amount of investigative research. While case studies can provide insights into technical failures, they are best used as aids for the design process.
Systems failures are system failures, while a single point failure might trigger a disaster, it is rarely the sole cause of the disaster. Preventing failures is not an exclusively technical issue, and can instead involve the

Annotated References

There are two broad categories of references for this paper. The first list consists of useful case studies and books dealing with the topics of case history. The second list consists of case history repositories.

Risk Studies

Of the risk studies available, Leveson's Therac-25 paper [Leveson93] and the Rogers commission study are the best general-purpose case studies available. Of the other books mentioned, Petroski's book is the most extensive argument about the role of case studies in the design process, while Leveson's and Neumann's books are primarily discussions about safety and risk in software systems.

[Leveson95] Leveson, Nancy, Safeware, London: Chapman & Hall, 1995.
Leveson's book is an excellent introduction to software safety in general. The appendix of the book is a compendium of case studies in power generation, medical systems, transportation systems and chemical plants. Leveson's studies are more directly related to the issue of safety in computer and engineering systems.
[Leveson93]Leveson, Nancy and Turner, Clark. "An Investigation Of The Therac-25 Accidents"
Leveson's paper on the Therac-25 disaster is one of the best case studies available. In addition to discussing the technical reasons for why the failure occurred, Leveson's work studies the environmental and human factors which caused the warning signs to be ignored.
[Neumann95] Neumann, Peter G. Computer Related Risks.
Neumann's book is a distillation of the material in the risks archive, and provides a convenient summary of the more notable cases.
[Petroski94]Petroski, Henry. Design Paradigms, Case Histories Of Error And Judgment In Engineering
Petroski's book is an argument for the use of case studies and a survey of historically relevant examples. In the literature available, is is the only book specifically aimed at using case studies for educational purposes. Although the failures he discusses are primarily in the fields in civil and mechanical engineering, Petroski argues (cogently) that the lessons learned from case studies transcend the discipline in which they occurred.
[Rogers86]Rogers, et al. "Report to the President by the Presidential Commission on the Space Shuttle Challenger Accident"
The Rogers Commission's study is the best autopsy of the Challenger incident. As with Leveson's study of the Therac-25, the Rogers study focuses not only on the failure of the devices, but the organizational problems which led to the Challenger explosion.
An abridged version of the Rogers report is available Here.

Risk Archives

These references point to archives of risk information. Unlike the books mentioned above, this information is basically raw data, investigators will need to analyze this material and verify it independently of the source.

[RISKS]Comp.Risks
The comp.risks archive is a moderated Usenet newsgroup which collects anecdotal cases of computer reacted failures. As with any newsgroup, the quality of the source material varies widely, ranging from first-hand accounts to hysterical propaganda. The risks archive is the primary source for anyone interested in studying systems failures, but the material should be verified before being use. References in this paper to the comp risks archive are given as [RISKSX.Y] where X is the volume number and Y the issue number.
- The virus story in risks originated with Dr. Gary North, an author noted for his doomsday predictions. The following issue of risks [RISKS1.28]included an extensive debunking of North's article.
[SEN]Software Engineering News
The Risks column in Software Engineering News is the predecessor and companion to the Usenet based risks archive. Stretching back further than Risks, the material in SEN is also more thoroughly examined and verified.

Loose Ends

The most glaring problem with analyzing these case studies is the dearth of usable material. Comp.risks is, unfortunately, an only marginally verified source. Interestingly enough, the call for better case studies is not new, engineers have been discussing this since the 19th century.

There are several tasks necessary for building a good source of case study material. A good taxonomy must be built, and a large collection of primary sources must be archived. To some extent, the comp.risks archive will help any investigator seeking to do this; at least with comp.risks, there's a starting point.

Course Project Page