Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999
Author: Michael Carchia
Electrical systems are ubiquitous in today's world and in many cases it is absolutely necessary that they do not fail. Designers of these systems must be aware of the various points of failure and how to deal with these problems via a sound design. Unlike mechanical parts, electrical components generally do not wear out per se. Discrete analog component parameters tend to drift over time and can cause problems with sensitive designs. Integrated circuits can undergo electromigration. Furthermore when combined with environmental effects, corrosion, vibration and temperature are of extreme concern. Transient stresses such as electrostatic discharge (ESD) and lightning can also cause failures. Various environmental stresses will be outlined along with a discussion of failure mechanisms. Lastly, when it is absolutely necessary to ensure that an electrical system won't fail beyond some tolerance, how does one approach this problem? Some useful design principles will be presented.
The main difference between electrical and mechanical reliability is that generally speaking electronic systems do not wear out (with some exceptions). While there are debatably some wear out mechanisms such as electromigration and component parameter drift, electronic systems behave fundamentally different than mechanical ones. Typical to a discussion of reliability is the concept of the bathtub curve. Shown below, the curve can be broken up into three portions.
Zone I (the rapidly decreasing part of the curve), referred to as the Burn-in period or infant mortality stage, is characterized by failures due to manufacturing defects. Zone II is the useful life stage and is characterized by a constant failure rate due to random failures. Zone III, is termed the Wearout period and is characterized by an increasing failure rate as a result of equipment aging and deterioration. Because modern electronic equipment is largely made up of semiconductor devices that have no real short term wear out mechanism, the existence of a Zone III for electronic systems is sort of a gray area [ERS87, p. 22]. For most electronic components, Zone III is relatively flat. What is important for designers to realize is how electrical systems fail. In subsequent sections, the more common causes for electronic system failure will be outlined along with some common methods of protection. Afterwards, various classes of reliability prediction models will be presented. There are numerous models, most of which fall under the given five classifications.
As mentioned previously, electrical systems debatably exhibit wearout behavior. Electromigration for instance, might be considered a wearout mechanism. Over time, high current densities in thin-film conductors on integrated circuits can cause voids or hillocks. Pictured below is damaged interconnect due to significant momentum transfer from electrons to conductor atoms [Conyers].
Over time analog components can drift from their specified values. This can be accelerated by factors such as temperature. Therefore, critical circuits need to be designed with a level of tolerance that can cope with parameter drift of components.
Modern electronic components are prone to damage from high currents due to their delicate nature and inability to sink heat. Thus transient stresses such as those due to electrostatic discharge (ESD), lightning, and power supply transients from switching or lighting can cause system failures [O’Conner88]. Some methods to protect against transient voltages include:
Typically a problem in avionics and military equipment, excessive heat can wreck havoc in an electrical system. Component parameter values usually vary with temperature and it is important not to exceed the manufacture’s temperature range. Above such temperatures, parts are no longer guaranteed to be within specification. Typically, this can range from 80C to 150C. Thus thermal design can be an important aspect of a system’s over design. Components generate heat in operation and when combined with ambient temperature and solar radiation, excessive temperatures can be attained. Common methods to provide thermal protection include:
There is a variety of reliability prediction modeling techniques. Instead of listing them here, they can be classified into five main categories:
Similar Equipment Techniques. In order to estimate the level of reliability, the equipment under consideration is compared with similar equipment of known reliability.
Similar Complexity Techniques. The reliability of a design is estimated by comparing its relative complexity with an item of similar complexity.
Prediction by Function Techniques. Correlations between function and reliability are considered in order to obtain reliability prediction of a new design.
Part Count Techniques. Reliability is estimated as a function of the number of parts involved.
Stress Analysis Techniques. Failure rate is a function of individual part failure rates and takes into consideration part type, operational stress level, and derating characteristics of each part [ERS87, p. 169].
Useful in understanding these techniques is the exponential distribution. The exponential distribution is one of the most important distributions in reliability calculations. Specifically, it is used heavily for reliability prediction of electronic equipment. This is because of the general lack of a wearout mechanism. An exponential distribution has a constant failure rate, analogous to random system failures, not associated with wear, corrosion, etc. [ERS87, p.22] An exponential distribution is good for modeling:
Not all electrical components follow an exponential failure rate. For instance, electrolytic capacitors can break down over time. Thus, it is not safe to say that electrical system can not wear out.
For a more detailed discussion of reliability prediction models, one can consult [ERS87].
There are a tremendous amount of design principles that can be utilized to promote system reliability. These can include the following:
Part selection, control, and derating. Since an electronic system is inherently made up of discrete components, the selection and quality of these components is of crucial importance. Choosing the right type of part for the right job can mean the difference between reliability and unreliability. Part selection involves decisions such as TTL vs. ECL, the use of plastic encapsulated devices, and surface mount vs. through hole technologies. Furthermore, the performance of critical parts should meet certain industry guidelines.
Reliable circuit design. As a general rule, simpler designs will be more reliable. Thus there should be a push for simplicity throughout all phases of the design process. The necessity of all parts should be questioned and design simplifications should be employed where available. This can be through circuit design simplifications or by simply using fewer parts. Also, the use of standard components and circuits is always recommended (where a component could be as complex as a microprocessor). Reliable circuit design also entails a parameter degradation analysis. Since component parameters are known to drift over time, one must ensure that different tolerances can not combine in such a way that will degrade system functionality.
Redundancy. The use of multiple components with the same function can always be a useful tool if used properly. There are a wide variety of redundancy techniques. See <hyperlink> for a discussion of redundancy.
Designing for the environment. Given the environmental stresses mentioned in the Main Stresses & Protection Method section, one can improve the reliability of an electrical system by carefully detailing all the subtleties of the target environment. Environments can be very harsh and the system being designed might have to function in the presence of [ERS87, p. 328-9]:
Because of this, reliability can not be an afterthought and must be a goal from the beginning of the project.
Some of the following topics are related to electronic/electrical reliability and may be worth reading.
The following are the key ideas for this topic: