Maintenance and Dependability

Carnegie Mellon University

18-849b Dependable Embedded Systems

Spring 1999

Author: Adrian Drury


Abstract:

Maintenance is an important part of the life-cycle of embedded systems, and must be considered from the design stage through the end-of-life stage of the system. Maintenance covers two aspects of systems - operation and performance. Maintenance is generally performed in anticipation of, or in reaction to, a failure. Maintenance is performed to ensure or restore system performance to specified levels. Improperly performed or timed maintenance can exacerbate problems because of faulty parts, maintainer error, or decreased profits. A systematic and structured approach to system maintenance, starting during the design process, is necessary to ensure proper and cost-effective maintenance.


Contents


Introduction

Maintenance of any kind performed on a system is a consequence of the fact that systems (or components) deteriorate and fail. Any product or system that has maintenance directions or procedures has an implicit statement that there is a non-zero probability that the system could at some point operate outside its specified parameters. Failure to perform maintenance to maintain the dependability of a system can have effects ranging from benign to catastrophic. Developing effective maintenance procedures can be at worst a circular process -- procedures cannot be tested unless something deteriorates or fails. Accelerated stress testing can be used to induce failures, but there is no guarantee that all failures that should be covered in a maintenance plan will be exposed. Also, maintenance procedures are generally not extensively tested until a product has been deployed (and then, properly performed maintenance can be critical).

Overview of Topics

Maintenance has close ties to a variety of other topics relevant to dependable embedded systems design. Dependability is the most obvious tie, because without maintenance, dependability declines. Other topics include economic concerns, such as profits and business models. System life-cycle is closely related, because maintenance is an important part of the lifetime of a system. Human factors are also important because human error during maintenance can cause further problems. Additionally, diagnosis is used to tell what's wrong with a system, or what needs maintenance. Other factors that are related to these topics include project budget, time-to-market and quality.

Dependability

Dependability is obviously a desirable system attribute. Even if a system is designed to be "dependable," it is likely that it will need maintenance at some point in its life. Generally, if a system is designed poorly, maintenance cannot improve its poor performance. Maintenance can simply restore or prolong a previous state of operation. Of course, a poorly designed system could be retrofitted during a maintenance procedure, but retrofitting goes further than maintenance.

Profits and Economics of Maintenace

Profits and business models are strongly related to maintenance, and affect design decisions made. These economic considerations cover a broad range of other topics which will be discussed below. How is your business model affected if there is a low availability of working systems which need to be repaired often? What are the economic benefits and design considerations of disposal versus repair at the system or component level? Who will perform maintenance when it is necessary, and how do the choices affect recommended types of maintenance? What aspects of system maintenance are safety-critical, and how does that affect the system design? Also, how do maintenance contracts affect design decisions?

Repair or Replace

Economic benefits of disposal and repair are often approached most easily from an accounting point of view. If the cost of designing for maintenance is much higher than the cost of not doing so, and the applications in which the product will be used are such that replacement is feasible, then disposal may be a viable option. Considerations about the expected lifetime of the system must be taken into account as well. It may in fact be cheaper, over the expected lifetime of a product, to design for maintenance, instead of having to maintain inventories of replacements, which may never be used. On a system level, mechanical systems are virtually always repaired rather than disposed and replaced, because of the cost associated. Electronic systems are sometimes repaired, but often that repair is done through the disposal and replacement of a subassembly or component. Electronic components are virtually impossible to repair in a cost-effective manner, while larger numbers of mechanical components are. On a related note, regulatory agencies may require replacement of failed or degraded components, instead of repair, because of the failure modes associated with the components, or their criticality in the system.

If there is a low availability of systems that need to be repaired often, it can make sense to simply swap known-good components or subassemblies for faulty ones and diagnose and repair the faulty pieces at a slower pace. For example, if there are a limited number of aircraft necessary for a particular mission, the time necessary to diagnose and repair a malfunctioning engine may make it worthwhile to have extra engines on hand and simply replace an entire engine, and later fix the faulty one.

Personnel

The issue of who performs maintenance when it is necessary is an important one from the point of view of profits. There are endless variations on who can perform what maintenance how and when, but three common situations will be covered. The first approach is for the owner or user of the system to perform the maintenance themselves. In a safety-critical system, this may not be allowed unless the owner has special maintenance training or certification, or hires a suitably trained or certified third party to perform the maintenance. Commercial airplane maintenance is one example of this situation. A second approach is for the producer of the system to have an in-house maintenance staff which performs all maintenance, either at the system's location or on the producer's premises. If the maintainers and the designers work closely together, this approach generally results in the highest quality maintenance. People knowledgeable about the design and functionality of the system are arguably best qualified to maintain it. A third approach is for a third-party to provide maintenance. If the maintenance personnel are well trained, this approach can result in maintenance as good as would be provided by the system provider, and may result in quicker service, if the third-party happens to be located closer to the user's location.

Safety-critical portions of a system pose unique issues in maintenance. Safety-critical parts of a system may have requirements associated with them that virtually require (or exclude) certain types of maintenance. However, design decisions can be affected to the same degree by safety requirements as they can be for other profit or business models.

Maintenance Contracts

One final consideration related to profits and business models is maintenance contracts. Maintenance contracts are usually profitable for the system providers. However, other design and maintenance decisions can be affected. For example, if scheduled maintenance visits are required (or performed), it makes sense to design the system for preventative maintenance tasks, as opposed to unscheduled corrective maintenance.

Human Factors

Along with the considerations about the economic aspects of maintenance is the very real possibility that a condition that requires maintenance can be exacerbated by improperly performed maintenance, or human involvement. A study of the US public switched telephone network over two years [Kuhn97] found that twenty-five percent of telephone outages were caused by maintenance errors on the part of telephone company personnel. Hardware and software failures caused a combined thirty-six percent of outages. No matter what decisions are made about how and by whom maintenance is performed, human error needs to be carefully considered as a potential problem source.

Diagnostics

Built-in system diagnostics can be an invaluable troubleshooting aid when performing maintenance. Their use needs to be weighed against a variety of factors. Economics play a large role again. Are built-in diagnostics worth the money? Should an external diagnostic tool be manufactured instead and provided only to service providers? If the system will never be repaired, but only replaced, built-in diagnostics are useless except for indicating system failure, and for testing during development. Even there, their use is questionable given the cost of incorporating them in the design in the first place. If built-in diagnostics are available for troubleshooting, the question of how much information will be made available about their function arises. Perhaps it is economically advantageous to patent the diagnostic interface to the system and license or charge third-party service centers to use the diagnostic tools.

As an example of the direct economic costs related to maintenance, consider the maintenance of weapons. In the instance of the naval weapons systems of the United States, "The direct maintenance cost of aircraft and ships is at least $15 billion per year." [Technology99] This is admittedly beyond the realm of "dependable embedded systems" but the need for dependability in weapons systems is obvious. Additionally, personnel are required for maintenance. "Forty-seven percent of the Navy's active duty enlisted force (173,000 sailors) and 24 percent of the Marine Corps (37,600 marines) are assigned to maintenance functions." [Technology99]. Additionally, in space applications (which are some of the most dependable systems developed), operations and maintenance account for nearly half of the life-cycle cost [Wall98].

Types of Maintenace

Maintenance operations have been categorized based on their frequency and their motivating factors. Four of the most common designations are described below - predictive, preventative, corrective and fault-finding.

Predictive maintenance involves a series of steps prior to actually performing maintenance. It begins with sampling physical data over time, such as vibration or particulate matter in oil. Analysis is then performed on the collected data to create an appropriate maintenance schedule, and maintenance is performed according to the schedule. This type of maintenance analysis works well for mechanical systems because the failure modes are well understood. Additionally there is historical data useful for creating and validating performance and maintenance models for mechanical systems.

Preventative maintenance refers to maintenance performed when a system is functioning properly to prevent a later failure. Generally, it is performed on a regular basis and the maintenance will be performed regardless of whether functionality or performance is degraded. The frequency of the maintenance is generally constant, and is usually based on the expected life of the components being maintained, but there is not necessarily any monitoring occurring at the same time (as there would be in predictive maintenance). One common example is lubrication of mechanical systems after a certain number of operating hours. Another is replacement of lightning arresters in jet engines after a certain number of lightning strikes.

Corrective maintenance refers to maintenance done to correct a problem when something has failed, or is failing. The need for corrective maintenance can be beneficial or detrimental depending on the product and the profit model used during the design phase of the product. On the most obvious level, corrective maintenance is detrimental to operation because it means that something failed, and the system is (probably) not available during the time needed to perform the maintenance. On the other hand, it may be that the economics and planned functionality of a system are such that using a cheaper, replaceable device for which failure is anticipated, makes sense.

Failure-finding maintenance involves checking a (quiescent) part of a system to see if it is still working. This is most often performed on portions of a system dedicated to safety -- protective devices. This is an important type of maintenance check to perform because failures in safety systems can have more catastrophic effects, if other parts of the system fail.

Reliability Centered Maintenance

Reliability Centered Maintenance (RCM) is a well known maintenance methodology. It is an approach to maintenance that helps to solve the question of what maintenance to do at any particular time. RCM started in the aviation industry in the 1960s to reduce maintenance costs and increase the safety and reliability of the maintenance that was performed. It is used today in a wide variety of industries, and has benefits applicable to dependable embedded systems. RCM covers a broad series of steps, from the product design phase to the deployment and maintenance of a system [Aladon99]. The first step in applying RCM techniques is to establish the user's expectations about various characteristics of the system on which RCM will be performed. Then, all the ways the system can fail must be identified, and an FMEA or FMECA is performed to identify root causes of the failure modes. From that information, an appropriate combination of types of maintenance is selected, and an appropriate schedule of those maintenance actions is selected. The maintenance plan is then implemented, and data is collected to refine and improve the maintenance schedule.

RCM is an expensive method to deploy, because of the high initial costs. While most of the implementations of RCM have been successful [D'Addio],[Bowler95], there have been some unsuccessful implementations [Bowler95], making an economic evaluation of RCM an important procedure before its adoption. An economic evaluation of RCM can be difficult to do effectively. [Bowler95] discusses this issue from the perspective of viewing RCM as an investment.

As an investment, RCM should not be undertaken if the financial benefits cannot be shown to outweigh the costs. Traditionally, the financial benefits and costs associated with RCM have been difficult to measure. That is partly because the areas of savings can be vague ("improve plant availability"), and partly because there are not often clear cause and effect relationships in the sphere of RCM evaluation. Costs are generally simpler to identify than benefits. Costs include initial outlays primarily for training, and ongoing costs, including maintenance and support personnel, and expenditures associated with maintenance instituted as a result of RCM findings. Benefit amounts can generally be identified through a series of steps. First, identify the current problems that would be eliminated or lessened through RCM. Second, estimate how much improvement would result for the identified problems through the adoption of RCM. Lastly, quantify each of the improvements in the larger sense of company performance (profits, plant availability, personnel cost, etc.). When that quantification is done, then the economic benefits of RCM can be evaluated to see if its adoption makes economic sense.

COTS Maintenance Issues

There is an increasing tendency to use COTS (Commercial Off The Shelf) assemblies in a wide variety of systems today, ranging from other commercial products to military equipment. COTS equipment must be maintained as well as custom-designed equipment, but because of the relatively recent adoption of COTS parts in dependable systems, issues related to maintenance and reliability have not been well thought out. In a study of the use of COTS equipment in military applications, [DeBusk98] covers a number of important points.

First, there is no difference in the importance of proper maintenance procedures and proper reliability and dependability information between COTS and custom parts. Designers should not be lulled into a false sense of security because "someone else" designed and mass produced the product. Nor should they be unjustifiably worried that COTS parts are more inherently likely to fail.

Second, it is vital that designers and users have reliability predictions for the COTS equipment they will be using. Often this data is available directly from the manufacturer. Sometimes it may have to be accumulated through field observation. Regardless of how the data is accumulated, it may be necessary to make adjustments, especially if reliability data is included from MIL-HDBK-217 sources.

Third, product quality needs to monitored continuously to see if environmental stress screening (ESS), to reduce numbers of latent defects, is warranted. While common for military hardware suppliers, it is usually less necessary for COTS equipment because of the significantly higher volumes and tighter manufacturing controls.

Fourth, there is a critical difference between evaluating failure modes and characteristics of COTS and custom designed equipment. In custom designed equipment, a FMECA is used to evaluate the design of the system. With a COTS device, that is generally not an issue. The level of detail about the design of a COTS assembly is usually not enough to perform FMECA, and therefore evaluation must be limited to the interface of the assembly.

Finally, it is important to keep track of failure and fault information for COTS equipment. This is partly because better predictions about the use or failure of the equipment can be made with field data. Additionally, trends in failures can be tracked and reported to the manufacturer of the assembly. The manufacturer may very well have data about similar failures in other installations of the assembly and be able to provide advice about corrective action. The manufacturer may also take corrective action themselves, in future product revisions.


Relationships to Other Topics


Conclusions

Maintenance is a complex part of the lifetime of a dependable embedded systems. Design and maintenance must be simultaneously planned in order to ensure an efficient and cost-effective outcome over the life of the product. There are a variety of approaches to maintenance, and different approaches are applicable based on the expected use and maintenance schedule of an item. Economic considerations are tightly related to maintenance and system lifecycle; it is clear that failure to consider design's effects on maintenance, and vice versa, can have adverse affects on profit.


References

[Kuhn97] Kuhn, Richard D. Sources of Failure in the Public Switched Telephone Network. Pp 31-36, IEEE ???, April 1997.
A discussion of the causes of failure in the telephone system, based on a study of two years of data.

[Anderson90] Anderson, Ronald T. and Neri, Lewis. Reliability-Centered Maintenance: Management and Engineering Methods.
Discussion of RCM in a military context, with complete information about approach to RCM and implementation information.

[Technology99] Technology for the United States Navy and Marine Corps, 2000-2035, Becoming a 21st-Century Force,Volume 8: Logistics. http://www2.nas.edu/nsb2/lgindex.htm Viewed April 27, 1999.
Report on logistics information as it relates to the armed services.

[Wall98] Wall, J., Sinnadurai, N. The Past, present and future of EEE components for space applications; COTS - The next generation. 1988 IEEE International frequency control symposium. pp. 392-404.
Good information about maintenance as a part of system life.

[Aladon99] http://www.aladon.co.uk Viewed April 27, 1999.
Extensive information about RCM from an RCM consulting company.

[DeBusk98] Debusk, Billy M., Jr. Managing the Reliability of COTS-based Military Systems. IEEE Proceedings Annual Reliability and Maintainability Symposium. 1988. pp. 394-400.
Investigation of issues relating to the use of COTS equipment in military equipment.

[D'Addio] D'Addio, G. F., Firpo, P. Savio, S. Reliability Centered Maintenance: A Solution to Optimize Mass Transit System Cost Effectiveness.
Economic analysis of RCM in a specific setting - a mass transit system.

[Bowler95] Bowler, D.J., Primrose, P.L., Leonard, R. Economic evaluation of reliability-centered maintenance (RCM): an electricity transmission industry perspective. IEE Proceedings Gener. Transm. Distrib. Vol. 142, No. 1, January 1995. pp. 9-16.
Another example of economic analysis of RCM in a specific setting - an electric power grid.