Software Safety

Carnegie Mellon University 18-849b Dependable Embedded Systems Spring 1998 Author: Michael Scheinholtz

Abstract:

High pressure steam engines were used in the 1800's for their profit potential, despite the high risk of explosion. Now, computers are being trusted to control many highly dangerous systems, sometimes without any hardware backup, because they allow businesses to make money. They appear to be powerful and cheap, but are they well understood? We must look carefully at how we use our technology, or risk even greater peril than that of the high pressure steam operator.

Safety is an emergent system property, and one component can not make a system safe. Computers and software add an unpredictable element to the system, but there are a number of ways to deal with safety issues. First, it is important to consider safety from the very beginning of system design and a safety team, responsible for system safety issues, should be created. Second, extensive safety analysis should be done to try and come up with as many safety issues as possible. This analysis should better prepare the system designers to deal with safety issues, and it might make it easier to figure out which system functions are safety critical. More effort can be spend on making those identified functions safe and correct. Finally, it is important to use diverse safety mechanisms to increase the chances that one will catch the safety hazard. Although these are not fool proof methods, they are a good starting point for safe system design.

Required Reading:

Leveson, Nancy, "High-Pressure Steam Engines and Computer Software," IEEE Software, October, 1994

Introduction

With the ever increasing power of the microprocessor, software has become a key technology for implementing complex systems. Some of these systems can be quite hazardous to the health of humans or the environment. This paper will examine the driving forces behind this trend, potential problems, and some possible solutions for dealing with software in high risk systems.

The first important issue is why software has become so prevalent in safety critical systems. Software is a very flexible medium capable of expressing a wide range of behaviors. Computers are fast and relatively cheap, and there are some actions in system that only a computer is capable of performing. These characteristics allow engineers to do things like: make jet engines more fuel efficient, help chemical plants be more productive, and take some tedious tasks away from human operators. There are also a certain amount of myth that surrounds software and adds to its popularity. Some of the myths are: software is cheap because you only make it once, computers are more reliable then mechanical devices, and software is easy to change. Most of these apparent features stem from software's flexibility, the speed of modern microprocessors, and the notion that software never wears out. While there is truth to each myth, they all come with a downside. The downside comes from the amount of complexity we now put into software. To make the jet engine fuel efficient, the controlling software becomes very complex. So despite the absence of manufacturing or end of life wear out, software is becoming the most expensive part of the system to create, and because so much of a system depends on software for control, it can no longer be looked at as a side issue. Now the software is the system.

All high risk systems should be concerned with safety. Safety can be defined as being free of accidents or loss [Leveson95]. The most difficult thing about safety is that its an emergent property of the system's behavior. An emergent property is one that is not the result of any one sub system; it is a result of the interaction of many sub-systems. This interaction presents a number of problems to designers because it breaks through the layers of abstraction they use to combat complexity. On a large design team, this means many of the smaller design groups must be able to understand how the system works as a whole. This makes the job of making a safe system much more difficult.

Software presents some new problems for system engineers that do not appear in mechanical systems. Software is inherently imperfect. While the same can be said of a mechanical system, software has a much larger testing space because it has non-linear properties. Verifying a complex piece of software through testing is effectively impossible [Leveson95_2] It may, at first first glance, appear that the best way to write safe software is to try and make it as reliable as possible. Unfortunately, even perfectly correct software could trigger an accident because the actions a software engineer thought were appropriate really put the system into an unsafe state. Software engineers must understand the safety issues involved with the software they write. System engineers are in a situation where they have highly complex, dangerous systems being controlled by a technology that we cannot easily verify. Until software engineers have more control over software, it will be difficult to make it safe, but it is unlikely that people will stop using software to build systems. It is much too attractive an option, despite the potential risks. Complex, dangerous systems are a fact of our modern day life; this paper will closely examine the risks behind using software in potentially hazardous systems, and some of the techniques that can be used to reduce that risk.

System Safety

The Importance of Safety

No real system can be made completely safe. The safest jet is the one that does not fly, the safest car does not run, and the safest boat never sets out to sea. Human imperfect creeps into everything we build, but our society still relies on these imperfect machines for its day to day operation. Since complex systems cannot be made completely safe and since we still rely on these systems, we have to settle for degrees of safety. The problem with degrees of safety is that no one wants to be one of the casualties of a machine. People want perfect safety even though they can't have it. So the companies who make potentially dangerous machines must balance how safe they make there systems and how much each system costs with how much money they will lose in law suits. Witness the proliferation of warning labels on products as companies attempt to protect themselves from litigation. The government doesn't worry about profit in the same manner as most companies; they worry about the cost of a particular machine. They balance safety with system cost, and development time. Safety is a big consideration with any dangerous system. It affects the software, hardware, and the mechanical devices, and should be a major area of understanding for any embedded systems engineer.

Safety Factors

One of the things that makes safety so difficult is that each system has its own unique set of behaviors that are safe. The safety procedures that work on a jet engine controller may not work in a chemical plant. So the safety engineers must understand the system they are working on so they will know what the fail-safe mode for each sub-system is. They must know whether a particular valve should fail open or closed in the event of a power outage, or whether to turn the pumps on or off. This level of understanding requires a great deal of communication between separate groups, and overall system level knowledge.

Safety can be affected by factors that are completely outside of the actual, physical system. Money is a primary factor. There is always a fundamental tension between the desire to make a safe product and the desire to make money. This can be seen in what happened with the Ford Pinto. The Pinto had a defective fuel system, if the car was hit from behind the fuel tank would sometimes explode. Ford could have fixed this, but it chose not to [Birsch94]. They either believed the car was safe enough as it is, or they crossed their fingers and hoped that there wouldn't be too many lawsuits. It is an unpleasant reality that engineering designs must sometimes be compromised because of business considerations.

Another major external problem software must deal with is the interaction with human users. Humans can have a profound influence on the safety of a system; they spend much of their time interacting with the software. The control program that deals with system operators must balance several important factors. Humans are better at dealing with unique situations, but computers are much better at doing boring, repetitive tasks. Also, humans tend to stop paying attention to a system if it doesn't give them anything to do, and they will be unprepared when an important event occurs. So their must be a middle ground, where the operators do some work, but it is varied and interesting enough that they pay attention to the system. Another important trait for the software is that it must provide useable information to the operator so the operator will be able to act in a time of emergency. Human operators can be very helpful in an emergency situation, given the right information, and every step should be taken to make sure they are prepared when emergencies arise.

Safety is an Emergent System Property

No single component can make a complex system safe. There isn't a safety subsystem that can assure a system will always remain safe. It could be argued that a kill switch that would shut down a system in an emergency is a single, safety, sub-system, but that isn't really an isolated system. The switch must know when to shut down, and there can't be another part of the overall system that will override the switch-- for instance a maintenance mode. Issues like these are what make safety such a hard problem. Safety issues breakthrough all the levels of abstraction, and dealing with them requires understanding the system as a whole.

Software Accidents

Software has become responsible for most of the critical functions of complex systems. Its everywhere. Inside toasters, nuclear power plants, airplanes, and space ships. Since it controls the system, it has the potential to cause a great deal of harm. Software can cause harm for two main reasons: it may have been erroneously implemented, or it may have been designed incorrectly. There is an important distinction between the two. Software that was designed incorrectly most likely had incorrect requirements; there was something about the system's environment that the designers didn't understand or didn't anticipate. Erroneously implemented software-- software that deviates from the requirements-- can produce incorrect responses in know or unknown situations. Both can cause serious safety problems, and both are nearly impossible to eliminate. What follows is a more detailed examination of the unique safety problems software presents.

Implementation Problems

Buggy software presents a barrier to making truly safe systems. Despite the high reliability of most computer chips, software allows for a new and different set of reliability problems. Software sub-systems have a much broader set of failure modes then the mechanical subsystems they replace which makes verifying them extremely difficult.

Verifying a mechanical device requires much less effort then verifying a piece of software. Mechanical devices are linear; if you test the endpoints and the middle of an input on a mechanical device you can get a pretty clear picture of how well that device functions. This may be a somewhat simplistic example, but it is true that mechanical systems are much more well behaved then software. It is possible to just double the strength of every mechanical component and assume that everything will work out. This cannot be done in software because software is non-linear. Any potential input to a software artifact could cause a failure, and the degree of failure does not relate to the amount of error in an input. One bit flipped can crash an entire operating system, but the same piece of software may deal well with completely corrupt packets. This non linearity makes complete software testing impossible. A program that takes one 32 bit integer input would require 2³² tests to prove it is correct.

However, software is not an inherently flawed technology. It just happens to be a very flexible medium with a quick turnaround time for modifications. The side effect of this flexibility is that much of a system's complexity is put into the software, and modern software engineering is not yet up to the challenge of dealing with complex software systems.

Software engineering is slowly moving to deal with the unique problems of software, but its progress has fallen far behind the advanced functions software is capable of performing. The lure of flexibility and feature creep can turn a clean, well designed program into an unintelligible mess. Its a situation that's quite similar to the problems with high pressure steam engines in the 1800's [Leveson94]. A very powerful, profitable technology, the high pressure steam engine, was used despite the safety hazards associated with it-- frequent boiler explosions. Scientist and engineers had only a vague understanding of how a boiler worked; their knowledge lagged behind the demand for high pressure steam engines. Eventually regulation, improved scientific understanding, and a dissemination of that scientific knowledge to the boilermakers reduced the hazards of high pressure steam engines. Software engineering must catch up with the speed and power of computers, or we will continue to deal with the software explosions.

Requirements Problems

Perfect software does not mean safe software, and buggy software can still operate without producing safety hazards. Requirements misunderstandings can lead to some of the most difficult safety problems. Picture a valve controlling the flow of a liquid into a chemical reaction. In an emergency the software controlling the valve performs its safety function and dutifully closes the valve. The software worked correctly, as far as the software engineer was concerned. The only problem was that the valve was supposed to be left open in an emergency. This requirements problem would escape all manner of formal proof, and any software testing method based on the requirements document. Problems like these emphasize the importance of safety analysis and system level testing.

A great emphasis must also be placed on communication between project groups. The software engineers must understand the system they control, and the other system engineers must understand the software. Without outside input it would be impossible to generate good requirements for the software no matter what analysis method is used.

Designing Safe Software

Software system safety is the notion that software will execute within a system context without contributing to hazards. Safe software design can be a challenging task for any project. No technique in use now can guarantee the safety of a design, but some can increase the probability of having a safe design at the end of the development cycle. Most of the design techniques deal with reducing complexity, promoting system wide understanding, and compensating for the idiosyncrasies of software.

Safety Management

Good management is critical to having a successful safety plan. Management must promote the organizations safety culture. Having a strong safety culture means that everyone in the organization-- from project managers to administrative assistants-- must buy into safety, and be safety conscious while doing their jobs. The safety culture must also extend from the people who designed the product, to the people who operate and maintain the product. If the safety culture declines, and people become sloppy when doing their jobs, major disasters can result. The Challenger disaster can be partially attributed to a decline in the safety culture at NASA. NASA had become somewhat complacent and overconfident in their systems. They were on a tight shuttle launch schedule and ignored many of the warnings about potential problems with the booster rockets. Had a stronger safety culture been in place, NASA may have decided to fix the booster rocket engines rather continue the Challenger launch. Leveson documents some of the challenger disaster in her book [Leveson95_4], and a copy of the Rogers Commission report is available here [Rogers].

Managers should also foster communication between project teams. Every team must understand how their part of the project affects system safety. Without communication, teams may not have a clear enough picture of the system to make informed decisions about safety. Since safety is an emergent property, some of the boundaries between sub-systems must be blurred to help safety. Also, a great deal of formal analysis must be done to try and catch as many safety situations as possible before the system is implemented. This can only be done with clear communication between groups.

System Level Design

The first part of safety design is realizing that safety must be dealt with at the system level. Each design team on a project must be made aware of the safety issues they face, and a small group of engineers with system wide knowledge is needed to monitor safety issues. This group of safety engineers should have an understanding of the system as a whole, and have system safety as their primary goal. They must also have enough control of the design process to change the design if safety issues arise. Without such power, the safety engineers will be unable to deal with safety problems if those problems conflict with the design schedule or profit margins.

The second principle is that safety should be considered from the start of the design process. Its much easier to change ideas before a system has been built. In other words, the worst, most costly mistakes are made on the first day of design. Changes are always needed in a complex project, but the changes get more expensive later in the design cycle. Every important architectural change should be examined by the safety engineers for possible impact on system safety.

Complexity

Complexity is the core challenge of any large project. Unfortunately, many of the hazardous systems we create are quite complicated, and our society depends on its nuclear power plants, oil refineries, jet planes, and automobiles. These systems become more complex with every generation. Modern industrial societies seem to drive the trend towards more and more complicated systems, and the system engineers are forced to confront the problems that come with the added complexity.

Complexity affects the project at all levels, and any method that can reduce the complexity of the design should be considered. Safety is an extremely complex property because no one subsystem is responsible for it, and it is completely relative to the environment in which it operates. Their is nothing close to a generalized way to make a system safe, especially in a unique product. Software has thrown another level of complexity into modern systems because of it is so difficult to verify it works properly. Our ability to correctly design software does not match the power a software system can wield. It may seem that software has no power in a system, but the mechanical devices it controls can produce deadly effects if they are controlled improperly.

It is clear, however, that simple systems are easier to make safe then complicated systems. Any technique that can reduce the amount of safety related functionality in a system is useful. Keeping the system software simple will make verification easier, and may allow the designers to formally verify some parts of the safety system. A simple, clean design is also much easier to modify then a complicated one. Of course sometimes the safest thing to do in an emergency situation is not the simplest, and even if software is correct it may still situations for which it has no correct action. Identifying the safety critical sections of the software and focusing the development and testing effort on them can be a beneficial in light of schedule problems. The trouble comes when the wrong safety critical functions are chosen.

Diversity

Software alone should not be depended on to keep a system safe. Diversity in system safety features can help makeup for software's frailties. There are many way to have diversity in a system, and all of them should be considered, if feasible. The goal of diversifying the system is to gain failure independence. That is, if you add redundant systems that fail independently of each other, the overall reliability of a system will increase. The same principle applies to safety. If you have redundant safety systems that have failure independence, its more likely that one of the systems will work and do the correct thing in an emergency.

One method would be to diversify the software. Their is some controversy over the effectiveness of things like N-version programming [Leveson95_3] where two different teams implement a piece of software from the same set of requirements. I believe that having two pieces of software implemented with different levels of complexity in mind-- i.e. different requirements but the same functionality-- may be helpful. Jet engine controllers could use one loop to be the highly optimized, fuel efficient, software control loop, and also have a very simple, easy to test and possible prove, backup software control loop.

Software should also be used in conjunction with other types of safety systems. The most obvious choice is to use mechanical backups along with the safety software. Removing the mechanical backups and depending solely on software was part of the reason for the Therac-25 incident [Leveson95_5]. Mechanical devices fail differently then software and are easier to test. I do not believe there is ever a justifiable reason, at least from a safety standpoint, to remove them. Its usually about cost.

There can even be more diversity in the sensing systems. Alarms can be set off not just by redundant temperature sensors, but by temperature or pressure sensors. The diversity in what each sensor is reading may help the safety systems discover unsafe states more quickly.

Use as much diversity in a system as can be afforded. Its one of the most effective ways to increase system safety and reliability.

Tools and Metrics

Tools

Few tools exist to help the software engineer cope with safety problems, and many of the traditional methods of software engineering aren't effective. What follows is a brief discussion of several software engineer methods and how they can or cannot deal with safety issues.

Safety Analysis: Many safety analysis methods exists to help designers identify potential safety problems. None of these methods will find every single potential hazard, but they help. Some of the methods, such as fault tree analysis, can be used to isolate the parts of the software that can directly cause an unsafe state. These sections are called the safety critical functions.
Formal Methods: Formal methods can be useful, but they can only make the quality of the software better. They cannot make it safe. Safe software requires correct requirements. Formal methods are also somewhat limited in their scope, but they certainly can be applied to safety critical functions to make sure those function work as intended.

Software Reliability: Any method that can improve the reliability of software is helpful, but it is important to remember the difference between safety and reliability. Just because a system is reliable-- meaning it is operational for long periods of time -- does not make it safe. It is quite possible that when the software does fail it will put the system into an unsafe state, or even if it is operating correctly according to its requirements it may still perform an unsafe behavior. Safety does not come from correct software alone; it comes from understanding the system as a whole.

Software Fault Tolerance: Software fault tolerance has been a hotly contested topic, and there is a great deal of dissagreement over its effectiveness.

Software Safety Standards: Standards have their good and bad points. A standard can provide a point of reference for safety development, and the project that follows a respected standard may be less liable in the lawsuit. The downside of safety standards, at least for software, is that they are usually only concerned with processes. They tend to specify the process that should be used to construct a safe system, the standards do not actually say anything about the final product [Fenton98].

Metrics

Safety verification is an extremely difficult. Since most interesting systems are complex, and safety comes from the interaction of many system components, its pretty much impossible to verify how safe a system is. There are some general methods that can be used. I will group them by whether or not they require the execution of the program.

Dynamic Analysis: Dynamic analysis requires the execution of the software to check all of the systems safety features. It has the ability to catch unanticipated safety problems, but it cannot prove that a system is safe. Some examples would be injecting a fault into a system to see how it responds, general software testing, or user testing.
Static Analysis: Static analysis looks over the code and design documents of the system. It is similar to a structured code review. Systems can be proven to match requirements, but it will not catch any safety states that the requirements miss. Some examples are fault tree analysis, HAZOP, or formal code proving methods.

Relationship to other topics

Software safety relates to any area involving software design and verification, and any area where safety issues are important. That is a very broad statement, but safety is a big issue in many embedded systems.

Verification, Validation, and Certification: Software is notoriously difficult to verify and safety is also quite difficult. Yet safety certification is a critical issue both in legal matters-- did the company absolve itself from liability by following the safety standard-- and in systems that are unique.
Social and Legal Concerns: Much litigation revolves around system safety, and therefor software safety. It is impossible to make a system completely safe, so often companies will allow their legal departments to determine how safe is safe enough. This is probably not the most appealing way to deal with safety, but it is a situation which will confront every embedded system engineer at some point in their career.

Safety Critical Systems Analysis: Safety analysis deals with understanding what a system needs to do in order to be safe. This is critical information needed to make software safe.

Conclusions

Safety is an extremely complex issue that depends on many factors inside and outside a system. Software is a relatively new technology that is becoming increasingly responsible for the safety of a system. Unfortunately, it is hard to make correct software and even harder to make safe software. Correct software is hard because of the complexity engineers thrust into software and its inherent non linearity. Safety is hard because it is an emergent property of the system, and it depends on software. The difference between safe and correct software can clearly be seen when one piece of software that is safe for one system is moved to another. Because the environment changes, problems with the software that were hidden on the first system appear in the new system. Even perfect software may not be safe because the requirements used to develop it may be wrong. Understanding the requirements for safety is key to making safe software, but that is a very hard problem. It requires the software engineers to understand the system, and the other engineers must understand the software.

Software Safety is a huge problem that is becoming more critical due to the increasing number of dangerous systems controlled by software, but there are some techniques that can help. Diversity in safety systems can make the safety mechanism more reliable and more likely to detect an unsafe state. Depending on software alone to handle safety can be disastrous-- see the Therac-25. With diversity, the strengths of one technology can be used to hide the weakness of another. This gives failure independence. Software is better at performing complicated tasks then a mechanical system, but the mechanical system can be more effective in simple situations because they can be free of software's complexity.

People play a key role in software safety in many different ways. First, humans create the software, and software is almost exclusively an artifact of human intelligence. It has no moving parts, the laws of physics don't apply to it. It is an abstraction, and the only way to really understand it is to understand humans. Humans also play a key role as operators of the systems. A software control system must balance how much responsibility is given to the user so the user will be attentive but not overly stressed should anything unusual occur.

There is no way to measure software safety. This is an unfortunate problem because most consumers want to know how the safe the products they buy are. The best we can do, since we cannot verify the safety of a product just by examining it, is to try and make sure the process used to create the software was sound. Structured safety reviews can be done, along with formal analysis of safety problems and of the software itself. The hope from all of the focus on the methodology is that the it will produce safe software. This isn't the ideal solution but its the best we have at the moment. Safety is really much more art then engineering. We can't tell how safe a system is until it has been built and run out in the field for a long period of time. Its almost necessary to wait for the end of a products life to determine its safety.

Every effort should be taken to make sure a safety critical project is design by a talented experienced team of safety engineers. This is certainly not a general solution, as there will always be a shortage of talented safety engineers, but it is something to keep in mind. The importance of ability cannot be over estimated.

Software Safety is a problem that will continue to plague computerized societies in the near future. The only way to overcome this issue is to realize our inability to understand software and use it will caution until we understand it better. Once we have a better understanding of software and its impact on system safety, we will be better able to use it in safety critical systems.

Annotated Reference List

[Leveson95_2] Leveson, p. 33

[Leveson95_3] Leveson, p. 435

[Leveson95_5] Leveson, p. 569

[Leveson95_4] Leveson, p. 515

Safeware is an excellent book that covers many different aspects of safety engineering. I've focused on the sections concerned with software in safety critical systems, and some of the case studies.

[Hansen98] Hansen, Kirsten M., Anders, P. Ravn, Stavridou, Victoria, From Safety Analysis to Software Requirements, IEEE Transactions on Software Engineering, Vol .24, No. 7, July 1998

A decent overview of the different techniques that can be used to for software requirements in safety critical systems.

[Murphy98] Murphy, Niall, Safe Systems Through Better User Interfaces, Embedded Systems Programming, Vol 11, No. 5, August1998

A good description of how the user can affect safety in a system.

[Fenton98] Fenton, Norman E., Neil, Martin, A Strategy for Improving Safety Related Software Engineering Standards, IEEE Transactions on Software Engineering, Vol. 24, No. 11, November 1998

A discussion of the weaknesses of most software safety standards and ways they can be improved.

[Bowen93] Bowen, Jonathan, Stavridou, Victoria, Safety-Critical Systems, Formal Methods and Standards, . IEE/BCS Software Engineering Journal, 8(4), pp 189-209, July 1993.

Discusses how formal methods can be used to improve the quality of safety critical systems.

[Leveson94] Leveson, Nancy, "High-Pressure Steam Engines and Computer Software," IEEE Software, October, 1994

This paper gives an excellent encapsulation of the problems causes by using computer software in safety critical systems. It does this by comparing problems with computer software to the problems caused by high pressure steam engines in the 1800's.

[Birsch94] Birsch, Douglass, Fielder, John H., The Ford Pinto case : A Study in Applied Ethics, Business, and Technology, 1994, Albany, NY : State University of New York Press

A number of papers that discuss the Ford Pinto fuel tank incident.

[Rogers] http://www.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

The findings of the committee investigating the Challenger disaster.

Loose Ends and other Information

A requirements specification done by Nancy Leveson:
Leveson, Nancy G., Heimdahl, Requirements Specifications for Process-Control Systems, IEEE Transactions on Software Engineering, Vol. 20, No. 9, pp 684-707, September 1994

Oxford Page on safety critical systems by Jonathan Bowen.
http://www.comlab.ox.ac.uk/archive/safety.html

here is a link to a web copy of the Rogers commission's report on the Challenger accident.
http://www.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/table-of-contents.html

Go To Project Page