Carnegie Mellon University
Author: Michael Collins
Systems failures do not occur in a vacuum; while a single event may trigger the failure, investigation often reveals that a history of managerial and technical decisions produce conditions turning a single event into a disaster. At the minimum, investigating case studies provides lessons on what to avoid. By systematic studies of failure, it may be possible to draw general conclusins and improve practice as a whole.
Unfortunately, good systems failure studies are rare. Embedded systems failure is a volatile topic and the field is filled with a vast amount of noise, urban myth, and political agendas. In order to build case studies, engineers must develop some investigative skill. We discuss sources of case studies, examples of the good case studies that do exist and what techniques can be used to separate noise from real information.
Engineers aren't perfect: systems fail for for various reasons, ranging from inevitable hardware breakdowns to avoidable loss of morale. However, the accidents that result in system failures are not unique and have often been encountered before. For this reason, case studies are a valuable tool for designers: design is a discipline best learned by example, and examples of failures are as illustrative as studies of successes.
While case studies are generally understood to be a benefit for designers, there are very few good ones available. There are pragmatic reasons for this: the politicization of systems failure and the accompanying need to find a victim, liability protection, a tendency to focus on single failure points, and the fact that ignoring embarrassing mistakes is a part of human nature.[Petroski94]
Because of this reluctance, case studies are usually collected as a last resort. In the 19th century, bridge collapses eventually forced civil engineers to systematically collect and study the contributing factors. In the railroad era, good bridge design was necessary for human safety and economic health; at a certain point, the cost of failure became to great. We can draw parallels between the situation of bridge engineers in the 19th century and embedded systems designers today: embedded systems are critical in maintaining safety, trade and communications. Just like a bridge failure, the failure of an embedded system can cost human lives.
At the minimum systems failures should be studied so as to prevent the same failure from occurring twice. However, in addition to uncovering specific engineering failures, it is possible to draw broader conclusions about engineering practice by studying failures. Systems failures happen for a reason, and these reasons do not have to be unique to either the situation or discipline. Petroski, in particular, uses historical examples to demonstrate several common failures in the design process. Design is not unique to software engineering, and Petroski's rubrics are applicable outside of the civil examples he uses. [Petroski94].
The primary problem in studying case studies is finding them. There are relatively few good case studies, and very few sources of good source material. While it is acknowledged that lessons learned from systems failure can apply to multiple disciplines, most case histories are written for a narrow audience of professionals [Petroski94]. In addition to the rarity of good information, there is a large amount of bad information. Nebulous computer failures can serve as a convenient scapegoat when the actual system failure is more complicated; Sudden Acceleration Syndrome, where operator failures were blamed on micro chip failures, is a good example of this problem. Given the dearth of good material, the presence of bad material, engineers will often have to do their own investigative work.
While there are no exact numbers available, it's highly likely that the most extensively studied case histories are drawn from civil engineering. This is understandable given that civil engineering is the oldest engineering discipline; the first true case studies are drawn from Vitruvius in the 1st century AD [Vitruvius]. While embedded systems design is a relatively young field, there are several good (if unverified) archives of computer systems failures. The Usenet newsgroup comp.risks comprises a large archive of computer failures, stretching back approximately fifteen years. Coupled with its parent archive in Software Engineering News, the risks archive is arguably the largest collection of systems failure in existence. Similar Internet-based resources exist, most notably the Air Failures Page and pages devoted to Chernobyl and the Ariane V.
Because of these archives, the difficulty in studying embedded systems failures lies in filtering, rather than finding, good information. While the comp.risks archive is extensive, it is still a Usenet newsgroup and must be viewed with a certain amount of suspicion. Comp.risks is best used as a starting paper point for further research. The summary printed in Software Engineering News is generally more thorough.
While good case studies are rare, there are several examples. Leveson has written the definitive autopsy of the Therac-25 incident [Leveson93]. Other case studies, and a revision of her Therac-25 study are included in Leveson's book Safeware[Leveson95]. The Rogers Commission study of the Challenger Disaster is also excellent [Rogers86]. Petroski's Design Paradigms is devoted exclusively to case studies [Petroski94].
Since case studies serve as a data source for nearly every other topic, there are key concepts associated with data analysis.
The motivation for studying case studies is to learn from mistakes: this implies that a good case study must provide an unvarnished history of the failure and its surrounding factors. Unfortunately, as discussed above, there is a wealth of bad and unverified information.
Case studies have to be viewed with a certain amount of cynicism. Most failures are to complex to be solved with magic bullets.
An illustrative example comes from the early history of comp.risks [RISKS1.27]. The article in question is a rather lengthy scare story involving computer viruses and their potential effect on the banking system. The article's author writes an investment/religious newsletter advocating the end of modern banking and a return to 'Biblically sound' economics. While a bit extreme, the article provides a good point about author bias.
One of the more frustrating problems in studying systems failures is that, ultimately, it is completely possible for there to be no single point of failure. Multiple factors may contribute equally, and the actual source of failure may be discussed years after the fact. Consequently, while a single incident (often mechanical) may trigger a disaster, there is often a chain of events leading up to that incident.
Petroski lists several examples of design failures that he draws primarily from civil engineering case studies and argues that the lessons learned from those failures extends to other disciplines as well. Petroski's examples include examples of scalability failures from Vitruvius, logical failures in the design process, and how historical patterns of success can mask errors in design [Petroski94]. These are lessons about how the design process works, and are not exclusive to any technology. Certainly, state of the art techniques like formal verification are intended to uncover logical failures in design.
Petroski points out that one of the traditional difficulties in systems design is that system architects work ahead of their theoretical knowledge [Petroski94]. In most cases, an engineer has to make educated guesses, and fit hose choices are going to cost, he must be able to convince his client that this is the best choice of action. In this role, case studies should be used to provide guidelines for design: on the need for caution, examples of how people should be managed in system, and so on.
A good example of how case studies are used in design comes from John Roebling's design of the Brooklyn Bridge. As noted above, civil engineers in the 19th century lacked the necessary aerodynamic knowledge to properly model suspension bridges. After studying multiple bridge collapses, Roebling used all the evidence available to decide that the only logical approach was to design the bridge as tough as possible. Roebling built the bridge to handle six times the stresses he expected it would accommodate in even the worst situations. During the bridge's design, he placed extensive restrictions on the number of people allowed to stand on the bridge, their movement patterns and other factors. When asked why he was convinced that the bridge would stand, Roebling would always point out just how strong he was building the bridge, and exactly why it was so strong. [Petroski 94] [Bentley 89]. While it might appear that this kind of caution is obvious, it should be noted that Roebling's bridge is one of the few examples of 19th century suspension bridges still standing. All of the bridges built by his contemporaries have collapsed. [Bentley 89]
At this point, it may be appropriate to discuss some of the initial trends observed from reading these case studies. The points discussed below are intended to assist in studying cases, Leveson and Petroski both discuss how these lessons can be used to assist the design process.
While a single point of failure may trigger an accident, it does not follow that a single point is the cause. Systems failures are the result of complex interactions between designers, operators and systems, and disasters occur when points of failure in all these components interact at once. A single point failure may trigger the accident, but it is most often simply the most obvious source. Depending on one's viewpoint, the Therac-25 deaths can be viewed as a consequence of poor software design, poor user interface, a poor safety culture, or other factors [Leveson93].
Because systems failures are so complex, it is likely that at least some of the contributing factors are nontechnical. Managerial problems can be as much of a cause as faulty circuitry: the Rogers commission points out that NASA's safety culture had decayed extensively before the Challenger accident [Rogers86]. Leveson stresses the importance of safety culture in her book, noting that the Bhopal accident was preceded by waves of downsizing and a consequent loss of employee morale [Leveson95]. Failure is multidisciplinary.
On a more morbid note, systems failures will end up in litigation, and sometimes the best places to look for an exhaustive analysis of system failure is in the courtroom or the senate records.
There are two broad categories of references for this paper. The first list consists of useful case studies and books dealing with the topics of case history. The second list consists of case history repositories.
Of the risk studies available, Leveson's Therac-25 paper [Leveson93] and the Rogers commission study are the best general-purpose case studies available. Of the other books mentioned, Petroski's book is the most extensive argument about the role of case studies in the design process, while Leveson's and Neumann's books are primarily discussions about safety and risk in software systems.
These references point to archives of risk information. Unlike the books mentioned above, this information is basically raw data, investigators will need to analyze this material and verify it independently of the source.
The most glaring problem with analyzing these case studies is the dearth of usable material. Comp.risks is, unfortunately, an only marginally verified source. Interestingly enough, the call for better case studies is not new, engineers have been discussing this since the 19th century.
There are several tasks necessary for building a good source of case study material. A good taxonomy must be built, and a large collection of primary sources must be archived. To some extent, the comp.risks archive will help any investigator seeking to do this; at least with comp.risks, there's a starting point.
Course Project Page