Safety Critical Systems Analysis

Carnegie Mellon University 18-849b Dependable Embedded Systems Spring 1998 Authors: Robert Slater

Abstract:

Safety analysis is a method for evaluating the hazards and risks posed by a system and ways to minimize them. Many guidelines exist to guide safety analyses, but all study two main areas. Hazard analysis is the first stage, in which the system is studied for situations in which potential harm could result, and the frequency with which those situations occur. Risk analysis is the second stage, in which the possible outcomes of the hazards and the frequency of appearance of each outcome is determined. This allows sources of potential harm in the system to be prioritized and dealt with to increase the safety of the system. Many standards exist for acceptable levels of safety in different industries, but sometimes it is a judgement call as to when the system is safe enough. In many cases, the best safety analyses are performed by those expert in the analysis techniques, and novices are best tutored in the techniques before performing them independently. For embedded systems in which there is the potential for harm to a person or the environment safety analysis can be a useful way to quantify that potential and minimize it, but its most effective use lies in the hands of those familiar with it.

Introduction

Safety critical systems exist all around us, from nuclear power plants to chemical processing plants to heart monitors and emergency phone systems such as 911 in the United States. The industries that support these systems have put a great deal of deliberation and thought into making their systems as safe as possible, both in providing their designed function and in preventing their malfunctioning or defects. What remains a difficult task, however, is quantifying the safety inherent in a system. Safety is a nebulous concept, and is therefore difficult to define or measure. Should safety be measured in the amount of harm done, or perhaps a ration of the amount of harm done vs. the potential to do harm? Over what time span does one measure the harm a system can do? And how can one system be said to be safer than another? While these industries have no universal answer to these questions, they do have a collection of techniques which helps address them. Known as safety-critical systems analyses, these techniques can be used to assess the level of safety inherent in a system, and possible improvements that can be made. Furthermore, they take steps towards addressing those questions of measurability and quantification that seem so intractable.

This analysis typically takes two forms, hazard analysis and risk analysis. Hazard analysis is the examination of a system for potential to cause harm. In it the system or a model of the system is examined for ways in which it can cause harm or dangerous situations. Risk analysis examines the potential damage that can result from the hazards present in a system.[Storey] It examines the types of harm that can occur in hazardous situations caused by the system and their likelihood of occurrence. When tied together, the two forms of analysis can provide a detailed and potentially prioritized list of the potential harm that a system can cause. That list can be used in an iterative design process to refine the system and add safeguards for particularly dangerous outcomes. It is the combination of these two forms of analysis that proves most effective, and it is their use throughout the design process that produces the safest systems.

For embedded systems designers, safety analysis is relevant because of the increasing automation of safety critical systems. The increases in adaptability and response time are too attractive not to include. However, as we are increasingly finding in systems in the field, the added complexity of automation causes unsafe conditions to be overlooked or improperly protected against. As projects like the drive-by-wire car and automated emergency-call systems become feasible, embedded system designers will need the methodology of safety-critical analysis to ensure that their systems meet the safety requirements for these kinds of applications. They introduce guidelines, quantification, and methodical process to an unclear aspect of design that embedded designers may not initially be equipped to handle on their own.

Key Concepts

As stated above, there are two main branches safety critical system analysis, hazard analysis and risk analysis. We will discuss both of these, in addition to the fuzzy human factors that make the area difficult to approach.

Hazard Analysis

A hazard is a situation in which there is actual or potential danger to people or the environment.[Storey] Hazard analysis, accordingly is a method for examining a system to examine how it can cause hazards to occur, and, in some cases, how to prevent those hazards from occurring. While the actual techniques may vary in their approaches, they all have certain aspects in common. They typically will have a suggested model of the system to use, which hopefully exposes the activity and components of the system in a meaningful way so as to examine them for hazards. They will have a method of examining the different parts of the model that is systematic and attempts to be as complete as possible. This methodology will typically have a formatted sort of result for easy interpretation, so that the results can be read without going through the preparer's thinking. Lastly, the analysis technique may have additional guidelines for the process of making the analysis.

The system model is one of the most crucial parts of the analysis, second only, perhaps, to the actual method of examination. It places limits on the way the system is examined, and those limits can be fatal weaknesses at the heart of an analysis. Limits, in the forms of levels of abstraction, are necessary to allow people to perform the analysis, as humans are bad at dealing with high levels of complexity. The danger exists, however, that potential hazards will be hidden in the abstractions, and thereby go unexamined. It is important, therefore, that the model for the system extend to the appropriate levels of detail as well as supporting analysis at higher levels of abstraction.

The next portion of hazard analysis is the actual mechanism of analysis. Typically this will involve taking apart the system model and examining each portion and interaction of the model for hazards that might be caused by that component. This methodology should address every portion of the system, in addition to ensuring that each section is examined adequately. Some methodologies will go so far as to have checklists or forms to fill out for each component or hazard. Once the examination has been performed, these forms or diagrams or other tool of the method can be used to quickly summarize the result. This organized visual feedback can be the most powerful part of the analysis, making the results of the study available and understandable with a much lower investment of time and effort.

Finally, the more developed and involved methods of analysis will have guidelines of a greater scope, concerning the process rather than the procedure of the analysis. In many cases a team is specified to perform the analysis, preferably with multiple areas of expertise so as to cover all aspects of the system. The amount of time spent in meetings and frequency of meetings is even specified in some types of analysis. The important things that these guidelines provide are ensuring expertise in the proper areas and diligence in performing the analysis. A team can double check the results of one member, and discuss points, and preventing overwork or exhaustion ensures that the team can apply the proper concentration to the analysis. Also, there is typically an admonition to employ the analysis throughout the design cycle. Early use of analysis can prevent the large hazards, and continued use will further refine and improve the system.

There also exists a branch of hazard analysis called probabilistic hazard analysis, which attempts to place a chance of occurrence on each hazard in addition to identifying it. This can be based on field data, component lifetimes, standards, or any other numerical data which give some idea of the conditions the system is likely to be placed in, and the behavior of the system in those conditions. Structural and mechanical analyses tend to use load and capacity distributions to calculate probabilities of failure, but a large number of statistical methods exist for this kind of analysis.[Blockley] What is important to know is the statistical reasoning behind the method used, and that the end result is the capacity to know to some degree of accuracy the probability of a hazard actually occurring with the system in operation.

Risk Analysis

Risk is a combination of the frequency or probability of a specified event, and its consequence. Risk analysis is the counterpart to hazard analysis, taking a list of hazards and producing a list of possible outcomes and their likelihood of happening. Classically, probabilistic risk analysis is used to describe this process, while risk analysis refers only to the examination of outcomes. In common usage, however, risk analysis without probabilities is hardly ever performed, and so risk analysis is typically used to refer to the combination of the two.[Blockley]

The first part of risk analysis is an examination of the possible results of a hazard. Many hazards can produce a range of actual harm done, and so each hazard must be examined to determine the possibilities. Often these can be categorized into different levels of harm, distinguished by the amount of harm done.[Storey] Multiple fatalities would be more serious than a single fatality, a fatality more serious than a major injury, and so on. In this way the analysis can be speeded up by needing to be less precise without losing the major insight as to how much harm can be done.

Once this enumeration of possibilities has been done, the analysis can proceed into its probabilistic phase. Each potential harm is associated with a probability of occurring. Much like probabilistic hazard analysis, the numerical data can be derived from field data, component lifetimes, and other sources, and manipulated statistically. These calculation tend not to be absolutely correct, but the relative probability of two events is typically preserved. Thus, if an injury occurs twice a year and a fatality once every five years according to the calculations, while those might not be the actual observed rates of occurrence, injuries should tend to occur more often than fatalities. A similar categorization can be performed as above, in which ranges of probability are lumped together. Again, the intuition as to frequency of occurrence is preserved, even if the absolute numbers are not. Thus each hazard has a list of possible harms done and probabilities of each.

The combination of these two values can be an extremely valuable tool for prioritizing further work and determining when the system is safe enough. Highly damaging and likely hazards can be addressed in the refinement of the design before rarer or less dangerous ones. Not only are resources used more efficiently to improve safety in this way, but they are also target at the problems most likely to cause the system to be dangerous. It can also be used to establish the break-even point, when the cost of developing further safety is greater than the cost of dealing with the harm caused by the hazard. Once this line is crossed where the system is 'safe enough', further safety work can be held off until the next design cycle.

Fuzzy Societal Factors

There are shortcomings, however, to safety analysis, and the greatest are the poor definition of the problem to be solved, the heavy dependence upon the expertise of the annalists, and the indeterminate nature of the results of the investigation. Safety is not an easily measurable or definable property to a system, and therefore it is difficult to separate the safety properties or components of a system from the normal functioning ones. Since in many cases safety is an emergent property of the system as a whole, and dependant upon its interaction with its environment, which may be changing or ill-defined, the task of isolating and describing the 'safety' of the system becomes incredibly difficult. Those performing the analysis must have insight into the operation of the system and what it means to be safe, and the quality of the analysis is directly dependant upon the quality of this insight. If the annalists are inexperienced, or not familiar with the ways in which systems present risks, then the analysis will suffer, and no methodology can completely make up for shortcomings of those performing the analysis. Finally, once the results have been gathered, because of the poorly defined nature of safety, it is difficult to interpret the results or come to any conclusions regarding their correctness. All too often the true test of how safe a system is occurs in the field, and the best measure we have is the hazards and harm it does cause, rather than the ones the analysis has prevented.[Leveson]

It is difficult to answer the question of what safety is, or how it applies to a specific system. One definition is that it is a property of a system that it will not endanger human life or the environment.[Storey] Defining danger to human life or the environment, though, is an exercise in enumeration, rather than a procession from basic principles, and drawing the line between what is dangerous and non-dangerous is often a matter debate. Is one particle per million of a certain chemical dangerous or not? Is 55 m.p.h. the best speed limit, or does the improved performance of modern cars warrant an increase in the speed limit? What is under debate here is the tradeoff between safety and other concerns, such as cost or performance. It is easy to measure the cost of better containment of chemicals, or the time saved in having a higher speed limit. Higher cancer rates in an area, or a higher number of highway accidents and fatalities, however, are less easy to measure, although still quantifiable, but how are the two compared?

The answer is: on an ad hoc basis. Those performing the analysis must consider each part of the system and each interaction, and relate it to its environment, and then consider how it might cause damage. If there is no one with experience regarding that particular component or subsystem, then the analysis of that particular component has a large possibility of being inadequate. Furthermore, when the next level of abstraction is reached, more hazards my be hidden behind that abstraction, as the interactions of the component have larger effects that are not accounted for. Expertise in that particular component's operation is important to understand its larger purpose in the system. Expertise in its interactions with the environment is essentially expertise in the safety aspects of that component. If either is lacking, then the analysis will be incomplete. Unfortunately, while many people may be expert in the operation of a component or sub-system, knowledge about interactions with people and the environment is not necessarily as widespread. Thus we see in many types of safety analysis the mention that experienced practitioners provide the best analyses, and training is necessary to allow others to practice safety analysis effectively on their own. This safety expertise is still held by people, rather than having been encapsulated within the analytical methods themselves, and will remain so for the foreseeable future.[Leveson]

Given competent practitioners, however, and a substantial looking analysis, how does one view the results? It is difficult to accept any guarantees of completeness, as even a simple system interacting with a typical environment has many complicated interactions, many of which are poorly understood and potentially dangerous. It is often small details that are overlooked that combine to cause danger and, in some cases, disaster. Often the basic assumptions behind a particular part of the analysis may be flawed, such as the performance required of a bearing when in the field the system will be used beyond the designed parameters, putting undue stress on the bearing. If the system is operated outside the specified environment, then the analysis is invalid. Overlooking an interaction or making a mistaken assumption is too easy to not occur at least once in an analysis.

Quantization of results is also a problem. While the analysis may provide numbers that help define how dangerous a risk is, those numbers may have no bearing in reality. They may be based upon older systems that have some fundamental or perhaps subtle difference in their operation. Often, if no data exists, the numbers may be based on the gut instinct of the person performing the analysis. Such a stab in the dark may be fairly accurate if made by someone competent, but even competent people and analysts can make mistakes, or misjudge dangers based on personal prejudices. Human error and the difficulty of prediction make this stage of the analysis difficult to have great confidence in.

And even if the analysis is complete, and the risks defined and prioritized, it is also difficult to determine what safety level is great enough. Sometimes regulating agencies will have guidelines or standards to follow, which make the decision easy. Other times, however, the tradeoffs are more difficult to determine, and the decisions harder to make. When is the cost of making a car safer greater than the cost of the lawsuits it prevents? And if there is a societal cost in terms of medical expenses, work lost, emotional trauma, and recovery work, who is accounting for that? Often this falls upon a project manager or an executive who is not expert in the system. Sometimes the guarantees that must be met for increased safety impact time to market, or the performance needed from a piece of equipment, rather than mere dollar cost of the system. The other side of the equation is usually well defined, and the benefits of increased safety difficult to conceptualize.

What all of this amounts to is that the problem of safety is ill defined, and our current attempts to address it are incomplete. Those who have addressed or concerned themselves with safety in the past will be able to use that experience to aid further work, but those without that experience will not have enough insight into safety concerns to do as effective a job. The results are unverifiable, though in general some level of confidence will be able to be held in them. Finally, the tradeoffs between safety and other concerns such as cost and performance are sometimes difficult to perceive, and complete safety is almost always unattainable.

Available tools, techniques, and metrics

There are many forms of safety analysis, and some of the major ones are discussed below. Some are useful tools or techniques, while others are designed to be comprehensive. All of the analytical methods listed below have been used to good effect in the past. If one of them seems applicable to a particular problem or system, seek out further sources and, preferably, someone who has used that method in the past for further guidance.

Checklists

Despite their seeming simplicity, checklists are a form of safety analysis. As an example, an airplane is a safety critical system. As one level of analysis, a pilot must complete a pre-flight checklist before flight to ensure that the plane is working properly. This checklist is a simple form of safety analysis. They are generally useful where a problem is well understood, and examination rather than system analysis is the goal.

Fault Tree Analysis

Fault trees developed in the aerospace industries, but have found uses in many areas, most recently software analysis. Fault trees operate by developing a list of the faults that can occur in a system, and attempting to trace them back to their root causes. The reason that they are called fault trees is that there is a tree-like formal notation that accompanies the analysis, in which different types of events are specified by differently shaped containers, and the events are linked logically in tree like structures to lead up to the eventual fault of the system. While this method can be used to show complicated interactions, it is still subject to the danger of overlooking aspects of the system as these are mostly enumerated. It is advisable to combine this with another more methodical approach to ensure the completeness of the analysis.. An example is shown below.[Leveson]

Event Tree Analysis

Event trees function similarly to fault trees, but in the opposite direction. An event tree attempts to enumerate a list of components and subsystems and determine the result of their operation or non-operation. In this way all sequences of possible events are covered involving those components. As with fault trees, enumeration is the main form of choosing subsystems and components to examine, so a more methodical approach should be coupled with event tree analysis for greater completeness. An example is shown below.[Storey]

Failure Modes and Effects (and Criticality) Analysis

Failure Modes and Effects Analysis (FMEA) and Failure Modes and Effects and Criticality Analysis (FMECA) function much like a checklist, only a more organized one. There is a standard form which must be filled out, in which each subsystem or component is listed, along with the different ways in which that particular component can fail. Once these failure modes have been listed, the effects of that failure are listed. In the criticality analysis, each failure mode is associated with a frequency, and each effect with a 'danger rating'. These numbers are used to provide some idea of exactly how much risk that failure mode places upon users or the environment. Once these have been collected, each failure mode has a possible protective measure listed with it. Criticality analysis adds a cost of protection number here. This provides a list of hazards, risks, and possible countermeasures, and the criticality analysis orders them according to the level of danger they represent. The danger here is, again, that of leaving something out in the course of listing the possibilities.[Storey]

HAZard OPerability Studies

HAZard Operability Studies (HAZOPS) is a methodology for safety analysis that is highly rigorous, precise, and involved. A system model is constructed and each component is described with a list of attributes that describe the operation of that component. A list of guidewords with well defined meanings is then applied to each attribute to determine the effect of the deviation from normal operating described by that attribute. For example a pipe might have the attribute flow, for which the guideword backwards would mean backwards flow through the pipe. By having a well defined set of guidewords and a good system model, as well as expert annalists, this method attempts to be completely rigorous in its application. In addition to this mechanism of analysis, it also has guidelines describing the process of meeting and conducting the analysis.[MoD]

Relationship to other topics

Ultra-dependability

Since ultra-dependable systems cannot be tested to assure the requisite level of reliability, safety analysis could prove useful as a mean of determining the safety level of such a system. The only problem is the amount of trust which can be placed in the analysis.

Software Reliability, Software Fault Tolerance, Software Safety

Safety analysis is being targeted all of these aspects of software in order to improve its quality and address concerns that, up until now, have not really been applied to software.

Multi-disciplinary Design

Since most systems being analyzed for safety purposes will have subsystems designed by different specialties or professions, from electrical, chemical, and mechanical engineers to medical professionals, multi-disciplinary knowledge is necessary to construct a good system model and to examine that model.

Social & Legal Concerns, Ethics

The question of how safe is safe enough is often answered in the social or legal area. In come areas, ethics may delineate risks which are monetarily tenable but still need to be protected against.

Validation, Verification, & Certification

In the process of certifying systems for safety, safety analysis is often used as a means of constructing a safety case, or persuasive argument that the system is safe.

Conclusions

Safety Critical Systems Analysis is an attempt to solve a poorly defined problem. Safety is an ill-defined property of a system, and one that can rarely be confined to one portion of the system. Therefore considering the safety of a system involves examining the system as a whole, and its interactions, a task to which people are ill suited. Instead, these analyses break the system down into manageable parts and break the analysis into easily manageable parts. The take the model and its components, and examine each for the hazards they represent. Then the hazards are examined for the potential risks they present. Together they represent a list of the danger the system presents to people and its environment, and also prioritize those dangers in their severity and in the necessity of their being prevented.

What it is necessary to remember is that the analysis is being conducted by humans, who are prone to error. Components or interactions can be poorly understood. Hazards can be overlooked. Risks can be over- or underestimated. And the results, finally, are subject to human judgement as to how safe is safe enough, and how much work should be put into making the system safer. Those with experience in safety analysis will do a better job of analysis, and those with a greater interest in safety will err on the side of caution. The best thing that can be said about safety critical systems analysis is that, despite its imperfections, it has a history of being successful, and in some cases highly so. It is our best tool in addressing a problem that is poorly understood, and far too often undervalued.

Annotated Reference List

Blockley, David. "Engineering Safety." Mcgraw Hill, 1992.

This is a very complete book detailing a lot of the basic ideas in safety analysis from the point of view of civil engineering. While it doesn't address embedded concerns directly, many of the idea are applicable.

Leveson, Nancy. "Safeware: System Safety and Computers." Addison Wesley, 1995.

This book covers the area of safety analysis from the perspective of computer systems. It begins to address safety analysis to the problems faced by computers and software, but these methods have not yet been proven in this field. It also covers a lot of other issues related to safety analysis, hazard, and risks, and is probably a good text for the computer professional who has to deal with safety issues, but does not want to delve too deeply into the material.

MoD "Interim Defense Standard 00-58 HAZOP Studies on Systems Containint Programmable Electronics." Aug. 1996

This standard details the application of the HAZOPS approach for electrical and computing systems. It provides more detail into the HAZOPS procedure and potential ways of adapting it for embedded applications. A copy of this, and other safety-related standards can be found at http:://www.seasys.demon.co.uk/

Storey, Neil. "Safety-Critical Computer Systems" Addison Wesley, 1996.

This book covers the basics of safety and safety analysis in its initial chapters, but provides more of a surface skim than an in depth look.

Go To Project Page