An ultradependable system has such a low failure rate that prolonged testing can not confirm if the target failure rate has been reached. The sheer number of embedded systems being deployed in potentially safety critical situations creates a need to develop ultradependable or ultra quality systems that don't cost significantly more than a current design. Unfortunately there isn't a formal, well understood way to solve this problem. It requires understanding many different engineering disciplines and a system level perspective of what is being designed. The best way to make ultradependable systems right now is to use proven components algorithms and components in the product. Also, applying quality control techniques to the development process may improve the product's quality more then extensive testing. Even with these techniques, creating an ultradependable system is an extremely difficult and expensive task that requires a great deal of skill from the engineers and managers on the project.
Embedded systems pervade modern society. From consumer electronics, to automobiles, to satellites, they represent the largest segment of the computer industry. Since they are so pervasive, our society has come to depend on these systems for its day to day operation. The disruption caused by the outage of the Galaxy IV satellite [web1] shows how the failure of one system can affect society. According to CNN, 80%-90% of the pagers in the United States ceased to function. This interrupted business, peoples personal lives, and a hospital's ability to contact its medical staff. Clearly, if modern society is going to continue to depend on these systems, steps must be taken to assure they will be available when we need them.
One way to make the systems more dependable is to look into ultradependability techniques. Ultradependability means that a system has a been design to operate for such a long period of time without defects that testing becomes impractical. A more quantifiable definition would be: a system that has a 10-6 to 10-9 chance of error per operating hour. At that rate it would take 1000 units and1000 hours of operating time before the first error would occur. This puts an obvious strain on standard testing methodologies. Even if the ultradependable product passed a standard body of tests, it doesn't necessarily guarantee that the product has a failure rate of 10-6. Testing can really only reveal that a product does not have an adequate MTBF-- mean time between failures; it cannot certify that a system is ultradependable.
Is there really a need for products with this level of dependability? Certainly the average consumer product, despite our societies dependence on electronic gizmos, does not need to be ultradependable. However, many systems are much more critical, possibly safety critical or critical to the infrastructure of our society, and equally as numerous. Automobiles and jet planes are two systems which may need ultradependability techniques. Many jet planes fly everyday, and they depend on many embedded systems to fly, and failure of any of the plane's embedded systems may cause a major disaster. With operating lives of up to 20 or 30 years and 10,000's of planes flying, the number of hours the fleet operates becomes enormous. When faced with an enormous number of operating hours, the chances of their being an error during operation go up dramatically. In situations like these, where the system must be as close to flawless as possible, embedded systems designers must look to make their systems ultradependable.
Unfortunately, ultradependability is extremely difficult
to achieve. It is a complex, emergent property that imposes a requirement
on designers that they can never truly verify. Techniques do exist to
design ultradependable systems. They rely on redundancy and components
and algorithms that have worked over time. The paper will examine
the issues surrounding ultradependability: the need for it, the problems it
presents, and some possible ways to combat those problems.
Modern society needs ultradependable systems. One reason for this simply has to do with economies of scale. When 10 million copies of a system are deployed, it stands to reason that the rare, exceptional, conditions start to matter. Even if only .5% of a fleet encounter those rare failures that is still 50,000 systems that must handle the exceptional conditions. Many embedded systems number in the 10 to 100 millions, automobiles for example. 200 million cars operate in the United States. If it assumed that a car is driven 1.12 hours per day, and each car has a failure rate of 10-9 failures per hour, that still gives you approximately 82 failures per year (one every 4.5 days). Most car's real failure rate is much worse then 10-9.
The need for ultradependability is strongest when the system is life critical, or difficult to repair due to a hostile environment. Large fleets of these systems exacerbate this situation. Any system that will deployed for the military needs to be ultradependable. The military has all three aspects requiring an ultradependable system; the system is life critical, deployed in large numbers, and difficult to repair during battle. Satellites, rockets, and any other space bound machines rely on ultradependable embedded systems to operate reliably. They may or may not be life critical, and not deployed in large numbers, but they are certainly difficult to repair. Finally, airplanes, both commercial and military, are a classic example of a system that needs to be ultradependable. Airplanes are life critical, difficult to repair while operating, and the operate in relatively large numbers. Many airplanes are used much longer than originally anticipated. The B-52 has been in operation since 1955, yet the Air Force still depends on it. [web2] It has had to be dependable past its original point of retirement. Clearly their is a need for ultradependability for certain types of systems.
Consumer products, however, don't necessarily need
to be ultradependable, despite their large numbers and extensive
use. It could be argued that some consumer electronics are critical
to the operation of society, and some systems like pagers may be used for life
critical purposes. However, if a few of these systems fail it is unlikely
that anyone will die. If a few airplanes fail, it is quite likely that
people will die. Consumer electronics are also easily replaceable,
unlike space vehicles or military equipment. The lack of a life critical
nature and the ease of repair makes it unnecessary to have ultradependable
consumer electronics, despite their numbers.
Making a system ultradependable is one of the most difficult problems systems engineers face. I'll examine six different problems designers must cope with when developing an ultradependable systems. Some of the problems are external to system, others are a result of deficiencies in the technologies we use to build ultradependable systems.
Mechanical components can wear our and give way under the stress of extreme environmental conditions. Designers must try to anticipate all possible loads and stresses on the system and design it to withstand them. Of course with an ultradependable system it is impossible to anticipate all possible problems. Fortunately, we have a good understanding of how to make mechanical systems strong and resilient. A typical solution to this problem is to use conservatism or a margin of error in the design to account for unexpected situations; it just takes money.
Failure rates of different system components in failures / million operating hours:
Military Microprocessor | .022 |
Automotive Microprocessor | .12 (1987 data) |
Electric Motor | 2.17 |
Lead/Acid Battery | 16.9 |
Oil Pump | 37.3 |
Human: single operator best cases | 100 (per Million actions) |
Automotive Wiring Harness (luxury) | 775 |
Human: crisis interventions | 300,000 (per Million actions) |
The mechanical parts of a system, while they fail more often than most components, fail in a way that we understand and can predict. This understanding makes mechanical components easier to deal with then computer hardware or software.
Integrated circuits have become extremely reliable. Development methods like Six Sigma manufacturing have reduced the amount of manufacturing defects to extremely low levels, and microchip designers also have very effective CAD tools that hid a great deal of complexity. The data for microprocessors listed above is somewhat outdated; Six sigma quality techniques give manufacturing error rates of 3.5 parts per million [Fieler91]. Wiring is a problem in many systems, look at the failure rate for wiring harnesses in automobiles, but the speed and power of microprocessors allows a reduction in the number of chips per sub-system. Fewer chips mean less wiring and fewer critical components that can break.
Computer hardware shares an important trait with mechanical hardware; replication can be used to prevent transient and permanent hardware failures. There is a downside to this though, it assumes there is independence between the failures. A problem that hardware components are facing now is that their designs are so complex that design errors begin to become a more prominate cause of failures. Design errors cannot be corrected through replication, and the more complex hardware gets the worse this problem will become.
Computer software is becoming the major limiting factor
for ultradependable system, but it is not a fundamentally flawed
technology. It has certain strengths that are useful, and some points
that can be both a strength and a weakness. Software doesn't have any
manufacturing defects and it doesn't wear out, but this means all of its errors
are design errors. Now that is not such a bad thing, but software is also
very flexible, and its flexibility leads designers to put a great deal of
functionality and complexity in the software. Complexity leads to
errors. Design errors are the most difficult errors to deal with, and
software gets a lot of them because it shoulders the brunt of the system's
complexity. Its flexibility can be too strong a lure. Here are some
numbers from Tandem that support the notion that software is the
bottleneck. It must be remembered, though, that the Tandem machines use
the best possible dependable hardware methods. These numbers are at the
far end of the spectrum, but they are still representative of the software
problem.
Source of Error | Tandem (Gray 1985) |
Tandem (Gray 1987) |
Hardware | .18 | .19 |
Software | .26 | .43 |
Maintenance | .25 | .13 |
Operations | .17 | .13 |
Environment | .14 | .12 |
Software is also difficult to fully understand because it is inherently non-linear. By non-linear I mean that software is not an analog system. It has discrete states. This means that a one bit error can bring an entire system down, or a much larger error may do nothing. This makes software extremely difficult to verify. This non-linearity, and the fact that design errors are independent, make it difficult to use replication. Replication is the classic way to get reliability from a sub-system, and since it can't be applied to software this has created the software problem. The software problem, a term that's been in use since the 1960's, will continue to exist and affect ultradependability, until there are better tools and techniques to deal with software design and complexity.
Since ultradependable systems are expected to be reliable throughout their lifetime, end of life issues become important. Components will fail during operation as they drift into the far end of the bathtub curve. There must be a way to deal, reliably, with component failure. If it is possible to repair the component, then it should be possible to keep the system operating while the component is being fixed. It should also be hard for the maintenance person to break the system or install the component incorrectly. In systems that are unrepairable, say a long space flight like the mission to Mars, component failure must be anticipated and dealt with in such a way that human intervention will not be needed.
A harsh environment adds to the difficulties of making systems ultradependable. Extreme ranges of pressure and temperature can wreak havoc with components, as can moisture, shock, EMI, and grit. It is important to assess where the product needs to operate, so the environmental conditions can be assessed and dealt with appropriately.
Complexity is the reason why building perfect software is hard to do, and why building hardware is becoming more difficult. Software's biggest limiting factor is the complexity we place in it. Their is also a great deal of complexity in the interactions between sub-systems. Anyone interested in making ultradependable systems must pay attention to those interactions. Dependability will always be limited by the amount of complexity designers can deal with. Abstraction is one of the tools designers use to fight complexity, and we must always be aware of how much complexity we can handle. Keeping systems as simple as possible will help us understand them and help us make them work under any forseeable condition.
Verifying that a system is ultradependable is a nearly impossible task. Its very definition is that it is a system that is so reliable it will never fail in its operating life time. That means you would have to build the system, and test it over its entire life cycle before you know that it fails. Testing can be used to determine if a system is not ultradependable, and it certainly should be done, but other methods must be found to verify ultradependability.
Some methods go beyond normal testing, and may help to improve ultradependability verification techniques. Fault injection can be used to test the system's responses to exceptional conditions. While this is helpful, and may speed up the testing process, it still does not verify a systems dependability. Formal methods may also be useful. If you could prove the software meets its requirements it would greatly improve the quality of the overall system. The problem here is the requirements must have been able to anticipate all exceptional conditions. If they did not, then the system may still have many flaws when it is deployed in its true environment. Right now what is being done for ultradependability verification is process certification. Quality techniques are used to verify that the product was developed under a high quality process. The assumption here is that a high quality process will produce a high quality product. This is the best we can do right now.
Ultradependability is an emergent property of the system, and this makes it an extremely complex and difficult property to achieve. The current best practice is to rely on proven techniques and algorithms. In a sense this trades performance for reliability. [4] The space shuttle's computers are not nearly as fast a ground based commercial system, but they are much more reliable because they use proven techniques and chips-- proven in this case means older.
A variation of this approach is the quality control approach. This approach relies on quality control to assure that the process used for product development is sound. It is hoped that if the processes was good then the product will be dependable. Quality techniques have reduced the number of bugs in both software and hardware. Six Sigma manufacturing has greatly improved the quality of hardware design and manufacturing, and techniques like Cleanroom software engineering and structured design reviews have been effective at reducing the number of errors in software.
Their is no good, generalized, way to make an ultradependable system beyond the standard reliability techniques. Essentially, the more reliability you can afford, the more you get, and that method doesn't work for the software. Keep the system as simple as possible, and try to verify or prove as much of it as you can, then hope for the best. Some more specific design techniques that are applicable to sub-systems will be discussed in the next section.
Hardware can rely on standard redundancy techniques to achieve high degrees of reliability (see Electronic Reliability). N-version hardware and voting algorithms use redundancy to eliminate transient hardware failures. Since it is quite likely that components will fail over time, sufficient redundancy and fail over mechanism must be included to assure continuous operation. Although, as hardware gets more complex an interesting problems arises. More errors will come from design problems, and design errors are not independent. That would make standard redundancy techniques ineffective, because they all rely on failure independence. At this point the hardware failure mechanism begins to look like the software failure mechanism. So be sure the components used in an ultradependable system are reliable. Most people use chips that have been in production for a long period of time with a good track record. The more that is known about the reliability of the components in a system, the easier it will be to make it ultradependable.
Reducing the number of chips in a system can reduce the points of failure. So it may seem like a good idea to put more functionality into the software and save on chips. Remember though, that software complexity is a difficult problem. The more complex the software , the more likely it will fail, and redundancy techniques do not work well with software. So it may be a win to include more chips with backups and make the software simpler.
The best software reliability technique is to keep it as simple as possible. Complexity is the source of many software errors. Using proven algorithms is another common practice. The essential point of these two methods is that software can get quite complex and hard, and for it to be dependable it must be easy to understand and verify. If the amount of software is kept small, it is much more likely that formal methods could be used to prove it correct, and testing will have more coverage.
Reusing software may be a tempting option, but
unfortunatley, software does not always transfere well from one product to
another. Certainly, using COTS (common off the shelf) software can save
design time, but the quality of the reused software must be
verified. The larger the system that's being reused, the more
likely it is that their will be problems. Systems like the Therac-25, and
Ariane 5 rocket have demonstrated that reusing software must be done with
caution. The main problem is that factors or requirements that were not
important in the original system may become important in the new system.
The Therac-20 had safety locks that hid its safety problems, the Ariane 4
rocket was not fast enough to trigger a floating point overflow. Hidden
problems like these make reusing large software systems a difficult
proposition.
No amount of extra money can make software reliable.
Replication simply does not fix design errors. The best place to spend
money on software is probably to hire talented, disciplined, personnel.
Making ultradependable software is an art, and its best that it be practiced by
skilled artists. Regardless of the people in the development group,
quality control techniques should be used. One important technique is
some form of peer review. Code reviews have proven much more effective at
catching bugs than testing alone, with an average rate of 55 to 60% defect
detection rate versus 25% for unit testing, 35% for function testing, and 45%
for integration testing [Jones86]. Any quality control methods will help if
they make the developers focus on writing good software.
Ultradependability relies on a number of topic areas for its foundation. It is a broad, high level, concept that must be concerned with every phase of a system's life. Some of the areas that affect dependability are listed below.
Multi-disciplinary Design: Ultradependability relies on quality at every part of the design phase, and an understanding of system level issues. Only by pooling the resources of many branches of engineering can an ultradependable system be created.Software Reliability: Software is a key bottleneck in making a design ultradependable. Any techniques that can be used to make software more dependable will help.
Safety Critical Systems/Analysis: The design phase will usually make or a break a complicated system. Anticipating as many hazards as possible will make it more likely that the design will hold up over time. Exceptional conditions are a real challenge when making an ultradependable system.
End-of-Life Wearout and Retirement: An ultradependable system runs for such a long time that end of life issues become important. The system must be able to detect a component failure, preferable before it fails, and be able to deal with that failure. Systems that can't be easily repaired, like the Voyager space probe, need to have a way to deal with their irreplaceable components breaking.
Maintenance and Reliability: Any system that must run for tens of thousands of hours will require maintenance. It is a fact of life that components wearout and must be replaced. It is equally important that the system will not fail for routine maintenance, and that maintainance should not be able to damage the system.
Diagnosis and Prognosis: If an ultradependable system does fail, it is critical to know why, so that the problem can be fixed.
System Life Cycle: Ultradependability requires an understanding of how the system will behave throughout its life cycle. Without this understanding, its possible the system will begin to show a great many failures towards the end of its career before it retires.
Ultradependability is an extremely challenging problem, and it is growing in importance. The proliferation of embedded systems in life critical and unrepairable situations has spurned the need for ultradependability techniques. Ultradependabilty is needed to make sure life critical systems don't fail, especially if there are large numbers of the same design. Large numbers of systems increase the chance that one of the systems will encounter an exceptional condition that it can not handle. Ultradependability is also needed for systems that cannot be easily repaired, like a satellite or a space probe.
The best way to achieve ultradependabilty is to combine proven techniques with proven components. This helps assure that the components have been tested as much as possible and been exposed to as many exceptional conditions as possible. Also, quality control should be applied to the development process to assure the highest quality in the end product. This is the best we can do at the moment. There isn't a good way to verify that a system is ultradependable; we can only verify that a system is not ultradependable. The major limiting factor in the development of ultradependable systems is our inability to deal with high levels of complexity. As we develop new tools and abstractions, especially for software, our ability should improve. Perhaps formal methods, fault injection, or Cleanroom software engineering will mature enough to make developing ultradependable systems easy. Right now, however, its a matter of using a skilled design team with a good process and proven components.
[Siewiorek90] Siewiorek, Daniel P., Hsiao, M. Y., Rennels, David, Gray, James, Williams, Thomas, Ultradependable Architectures, Annual Review of Computer Science, 1990
[Siewiork90_2] Siewiorek, p. 504
A rare paper devoted to ultradependable systems. It describes some of the problems in the field, and the necessary developments that must occur for progress to be made.
[Fieler91] Fieler, Paul E.,, Loverro, Nick, Jr., "Defects Tail Off with Six-Sigma Manufacturing", IEEE Circuits & Devices Magazine, Vol. 7, no. 5, p. 18-20, 48
A reference giving a brief overview of six sigma manufacturing. It defines what it is without going into too much detail.
[web1] http://www.cnn.com/TECH/space/9805/20/satellite.outage/
CNN's description of the Galaxy IV satellite outage.
[web2] http://www.csd.uwo.ca/~pettypi/elevon/baugher_us/b052i.html
Some information on the B-52 Stratofortress. I've just used to to get the date of its commission.
[Rechtin97] Rechtin, Eberhardt, Maier, Mark W., The Art of Systems Architecting, p. 14 -17, (c) 1997, CRC Press, Inc.
An excellent reference about systems architecting that also touches on some ultraquality principles.
[Jones86] Jones, Capers, 1986, Programming Productivity: Issues for the Eighties, 2d ed. Los Angeles: IEEE Computer Society Press.
Contains some numbers on the effectiveness of code reviews versus various testing methods.
A description of the GUARDS (Generic Upgradable Architecture for real-time
Dependable Systems)
http://www.cs.york.ac.uk/~ljerka/guards.html