Non-Operating Reliability

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999

Author: Michael Carchia

Abstract:

Today, large portions of safety critical embedded systems such as automotive electronics or safety equipment spend the majority of their life in the non-operating state. The Non-Operating environment is characterized by parts or systems that are connected to a functioning device where there is a reduction or elimination of the physical and electrical stresses compared with the operating condition. While current literature may focus on the operating reliability of embedded systems, the non-operating state is one that requires attention by system designers. Specifically, the non-operating environment will be explained, various failure mechanisms will be described, and some reliability models outlined. Systems designed for high operating reliability do not necessarily perform well (or at all) after long periods of exposure to the non-operating environment. For proper handling of the non-operating environment, issues relating to non-operating failures need to be taken into consideration from the design stage of the lifecycle. Furthermore, the relevant environmental concerns that need to be taken into consideration depend on the environmental factors associated with each different target environment. To combat this, a physics of failure based approach to the design cycle is mentioned.

Introduction

Consider for a moment a missile defense system that may lay inactive in times of peace. The reliability of this system is crucial since its correct operation could literally save the lives of millions. Similarly, the correct operation of a fire alarm system in a city skyscraper has a comparable role in the lives of many. Safety critical embedded systems are everywhere and some of them spend a large portion of their lives in the inactive state. When these systems are needed in action, it is important they work flawlessly. In order for this to become a reality, designers need to consider the effects of the non-operating environment closely and compensate for them early in the design phase.

There is a distinction between dormancy and storage, but for the sake of this discussion we will group them together. This text is meant to give the reader an introductory understanding of non-operating reliability. If the distinction is important to the reader, further information can be found in the references section.

Dormancy is defined as the state in which the equipment is in its normal operational configuration and connected, but not operating. For testing purposes, equipment in the dormant state may be cycled on and off. During dormancy, the electrical stresses normally experienced under operational conditions are usually eliminated or reduced. [Pecht95]

Storage is defined as the state in which the system, subsystem, or component is totally inactive and resides in a storage area. The product may have to be unpacked and connected to a power source to be tested. [Pecht95]

Together, these two conditions form the non-operating state and are quite common in the useful lives of many embedded systems. Harris [1980] has compiled a list of typical values for time spent in dormancy of many different types of equipment. This list, shown below in Figure 1, demonstrates that the non-operating state can make up a considerable portion of the lifetime of a system.

Figure 1. Typical Values for Percentage of Calendar Time For Equipment in the Dormant Condition [Harris, 1980].

DOMESTIC APPLIANCES

- Television Sets

- Kitchen Electrical Appliances

75%

97%

CARS

- Personal Use

- Taxis

93%

38%

PROFESSIONAL EQUIPMENT

- Personal Calculators

- Small Copying Machine

- Electronic Test Equipment

98%

>75%

>90%

INDUSTRIAL EQUIPMENT

- Safety Equipment

- Standby Power

- Valves (most)

- Air Conditioning

- Built-in Test Equipment (MIL)

98%

>90%

>75%

50-80%

99%

Key Concepts

Systems shown to be reliable under operating conditions aren’t necessarily going to be reliable after periods of exposure to a non-operating environment. What follows is a description of the non-operating environment, its subtleties, and some failure mechanisms associated with them.

The Non-Operating Environments

A system may be situated in numerous non-operating environments throughout its lifetime. Some of these environments may be of concern due to the possibility of causing harm to a system while others may be of negligible importance. Systems may lay inactive in the field (subject to possible harsh environmental factors) or elsewhere (possibly in route for maintenance). During these times, systems may come into contact with numerous environmental stresses which may be natural (such as adverse weather) or man made (such as mishandling or abuse). The following is an overview taken from Pecht [1995] of some of the possible environments designers should be aware of aside from the field environment.

Storage. While in storage, parts or systems may or may not be in a controlled environment. Thus, factors such as moisture from condensation may become an issue. Furthermore, diurnal temperatures can range from –50C to 75C. When poorly ventilated, this can cause extreme temperatures that may wreck havoc [Livesay78]. For instance, because thermal expansion coefficients vary greatly for different materials, surface mounted ICs can literally pop off their circuit board due to extreme temperatures.

Receipt Screening. Prior to being placed in storage, parts are subjected to receipt screening that may involve the removal of some of the system’s protective coverings. This is therefore a location of environmental stresses such as thermal, biological, and humidity stresses. Human or mechanical handling can also cause shock or particulate contamination.

Repair/Modification. While systems are undergoing repair, they are associated with stresses associated with manufacturing as well as stresses from transportation, storage, and packaging. This can include mechanical shock, physical deformation, and electromagnetic radiation. Replacement parts may also be introduced with reliabilities different from those being replaced.

Test. Systems and parts that need to be tested or re-certified are subject to similar environments as those cited above for Repair/Modification.

Movement/Transportation. Parts and systems being transported by air, sea, rail, truck, or mail can be subjected to a broad range of extremely adverse stresses. Parts may undergo thermal and biological, acceleration, acoustic noise vibration, mechanical shock, radiation, pressure, and physical impact to name a few. Furthermore, facilities at intermediate stops are more likely to have personnel inexperienced with the handling and care of certain parts.

Failure Mechanisms

Aside from the subtle non-operating environments mentioned previously, one has to be concerned with to what extent the designed system will lay inactive in the target field environment. To approach this, exposure to some of the failure mechanisms is useful so that one knows what breaks, and can go about protecting against system failure. Four main classes of failure mechanisms are outlined, mechanical, electrical, corrosion, and radiation failure mechanisms.

Mechanical Failure Mechanisms. The main mechanisms are fatigue and fracture of various system components. This can be a result of temperature cycling in the presence of varying coefficients of expansion. Vibration and shock are frequent accelerators of mechanical failures. Sand and dust are common causes of increased wear and friction.

Temperature - Temperature cycling can cause fatigue and fracture by inducing stresses on different components as well as inside components. For instance, different thermal expansion coefficients can cause IC dies to fracture and solder joints to break. The possibility of damage is dependent on the magnitude of temperature cycling and length of exposure time. Temperature cycling of 20 degrees Celsius is possible in some environments and prolonged exposure can result in component failures.

Shock and Vibration - Shocks and vibrations can cause and accelerate many mechanical failures. In particular, flexing of leads and interconnects is common as well as the damaging of components such as bearings. Dampers can be used to absorb shocks and somewhat isolate components.

Sand and Dust – Sand and dust are particularly dangerous towards moving parts and optical surfaces.

Electrical Failure Mechanisms. Large currents such as those due to electrostatic discharge (ESD) and lightning can cause damage to integrated circuits. ESD is caused by large a potential difference that can form when two different materials are rubbed together then separated.

Corrosion Failure Mechanisms. Corrosion is the chemical process of a metal interacting with its surrounding environment. This process can destroy integrated circuits and degrade component parameters. Conditions that accelerate corrosion include relative humidity, high temperatures, and the presence of dirt or dust [Pecht95]. Most important in the process of corrosion is the presence of moisture. Corrosion can occur in numerous ways. To name a few, it can be caused by moisture ingress and the entrance of contaminants: either during manufacture or due to loss of hermeticity. There are a variety of different corrosive effects. Pecht [1995] gives a detailed list of numerous types of corrosion. Some of them are briefly listed below:

Galvanic Corrosion
Crevice Corrosion
Defects in Passivation
Pitting Corrosion
Surface Oxidation
Corrosion due to Microorganisms

Radiation Failure Mechanisms. There are many different types of radiation effects, many of which cause both mechanical and electrical degradation. Mechanical defects consist of ones that cause properties of materials to be altered. For instance, such defects could alter the mechanical, optical, thermal and electrical properties of metals. Electrical degradation would physically occur during operation. Due to the accumulation of alpha particles, bits can be flipped during operation and cause system failure [Pecht95].

Available tools, techniques, and metrics

Parts that spend a large portion of their life in the dormant state require special attention when doing a reliability analysis. The following are a few methods for assessing and predicting non-operating reliability. However, it is often the case that the models are at best crude and approximate. Furthermore, many rely on field data that may not be existent for your particular application. They are mentioned to give a survey of some techniques available to predict reliability. Afterwards, it is followed by a physics of failure approach to design and reliability assessment.

Specific Field Data. Sometimes, you might be lucky enough to have specific field data in which the reliability of your system can be determined. This is common in the automotive industry where data on certain parts may have been collected from certain older models of an automobile. If these parts are similar enough, from this data one can extrapolate the expected reliability of the part in question.

RADC-TR-85-91 Method. Essentially, a failure rate is determined by some base failure rate, modified by different environmental factors. This is done to account for factors specific to certain components. This method is somewhat limited and can result in gross inaccuracies due to the document being out of date and the assumption that all failure are exponential with constant failure rates.

MIL-HDBK-217 "Zero Electrical Stress". Similar to the RADC-TR-85-91 method with similar problems. Can be an extremely inaccurate.

The "K" Factor Approach. This method assumes there is a direct relationship between the operating and non-operating reliabilities of electronics. A ‘K’ factor is used to multiply against the operating reliability to achieve the non-operating reliability. For electronic components, a 30:1 or 60:1 ratio can be used [Harris 1980]. This method can also be very unreliable because it fundamentally assumes that the stresses in the operating and non-operating conditions are the same. [Pecht95]. At best it can be a rough approximation.

Many of the above models tend to disregard the details of specific components and group similar parts into the same category. In doing so, the accuracy of the reliability prediction is compromised.

If one knows what can go wrong in a system, then one can design around such faults from the early stages of a project. Thus, a physics-of-failure based approach seems reasonable starting point for finding a method to achieve satisfactory non-operating reliability. Pecht [1995] outlines a physics-based approach with the following steps:

Define realistic system requirements
Define the system usage environment
Identify potential failure sites and failure mechanisms (FMECA)
Characterize the materials and the manufacturing and assembly processes
Design reliable products within the capabilities of the materials and manufacturing processes used
Qualify the manufacturing and assembly processes
Control the manufacturing and assembly processes
Manage the life cycle of the product

The above approach to design, reliability assessment, testing and screening uses knowledge about the cause of potential failures and circumvents them via robust design and manufacturing practices.

Relationship to other topics

A study of non-operating reliability is an extension of other reliability topics such as traditional reliability. Furthermore, it is peripherally related to subjects such as field data, maintenance as well as many others.

Conclusions

The following are the key ideas for this topic:

There is a lot of industrial equipment that spends the majority of its useful life in the non-operating state.
Designing for operating reliability is not necessarily the same as worrying about non-operating reliability.
Threats come from different places depending on the non-operating environment. Designers need to clearly define and understand their target environment. Furthermore, they need to be aware of subtle secondary environments that the device may be subject to such as during transportation, modification and storage.

There are numerous common failure mechanisms and system designers should be familiar with each of them. These failure mechanisms have been broken up into four different categories: mechanical, electrical, corrosion, and radiation.

There are some models that can be used to predict non-operating reliability, but they can be grossly inaccurate at times. For this reason, a physics-based approach is recommended. It is by designing around the failure mechanisms that one can avoid system failure.

Annotated References

[Harris80] Harris, A.P. Reliability in the dormant condition. Microelectronics and Reliability. Vol 20. p33-44. 1980.
This paper discusses various reliability models, discusses non-operating concerns and contains non-operating failure data.
[Livesay78] Livesay, B.R. The reliability of devices in storage environments. Solid State Technology. (October): 63-8.
This paper discusses the physical and chemical processes leading to the degradation of electronic components in storage environments.
[Pecht95] Pecht, J. and Pecht M. Long-term non-operating reliability of electronic products. Boca Raton, FL: CRC Press, 1995.
This is a useful book on non-operating reliability that covers numerous topics and can be used to find more information about each such topic. It was referred to heavily in the writing of this text.