Diagnosis

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1998

Ying Shi

Abstract:

Diagnosis is a very important and essential fault tolerant computing technique involved in maintenance. It is an action of investigation and determining the cause of a problem, or error, in location and nature. As it’s generally classified into retrospective approach and predictive approach, with tow basic approaches of specification-based diagnosis and symptom-based diagnosis. Symptom-based diagnosis is considered more plausible towards future work in FTC. Specification-based diagnosis typical with black-box testing is the major diagnosis approach for embedded system.

Introduction

Figure-1.

Diagnosis is one of the system failure response stages, as shown in Figure-1. At the first stage of fault confinement, fault effects are limited to one system area, v.s. get propagated (see FT of Ying) and contaminate other system areas; Fault detection stage recognizes unexpected system event has occurred; Diagnosis becomes necessary step when the detection technique does not provide information about the location and/or properties of the failure; Reconfiguration comes into stage when a fault is detected and a permanent failure is located; At the stage of recovery varieties of techniques are utilized to eliminate fault effects; After the undamaged information is recovered, system goes into the restart stage; In the repair stage, we replace those system components diagnosed as having failed; And in the stage of reintegration, the repaired module gets reintegrated into the system.

Diagnosis is an important function in computer systems, both for manual repair and for automatic reconfiguration in fault-tolerant systems. A fault may cause the system to violate its requirements, i.e., to fail. Even if the errors caused by a given fault are masked by redundancy, it is desirable to eliminate the fault. Otherwise, additional faults, accumulating over time, may eventually exceed the fault-masking capabilities of the system and cause system failure. Some form of repair or reconfiguration ability is thus generally necessary, except possibly in fault-masking designs for very short mission times.

Diagnosis is an important aspect of a system's response to failure. Diagnosis involves inferring the process that generated a set of symptoms, results, or outcomes. Diagnosis is a significant proportion of maintenance procedures, not only to isolate the failed component but also to ensure that the repair operation is successful. Generally speaking, diagnosis can be either retrospective or predictive. Retrospective diagnosis seeks to determine what caused a system failure – the “what happened?” question. It can increase system availability by facilitating the quick revival of fallen systems. Predictive diagnosis seeks to determine when a failure will occur – the “what if?” question. Predictive diagnosis (failure prediction) can turn corrective maintenance into preventive maintenance, thereby increasing perceived system reliability. Retrospective diagnosis enables operators and maintainers to assess problems immediately and restore service quickly, while predictive diagnosis can be used to guide preventive and preemptive maintenance. In each case, system downtime is reduced.

Diagnosis is often associated with and sometimes confused with testing. It is important to understand the distinction between diagnosis and testing. Testing is a measurement procedure whose goal is to provide information sufficient for determining whether or not a unit under test responds to a stimulus in accordance with a prespecified standard. Diagnosis is a constrained search procedure whose goal is to guide the administration of tests. Results of testing are often used diagnostically to provide additional symptomatic evidence for deciding which further tests to perform. In practice, the two processes – testing and diagnosis – are often mixed indistinguishably.

Key Concepts

The two basic approaches to the diagnosis of system failure are the specification-based and symptom-based approaches. In the approach of specification-based diagnosis, system design specifications provide information for determining the expected behavior of a system under particular conditions. Diagnostic tests based on this projected behavior are then developed. Traditional diagnostic programs, which help isolate a fault after it has surfaced, are exemplary of the specification-based approach. These programs diagnoses by returning good or bad results from prespecified tests. Such programs are limited in important ways, particularly in their ability to isolate unanticipated faults. Symptoms-based diagnosis is the identification of a fault condition based on its symptoms. This type of diagnosis uses the information captured in system event logs, which sometimes called error logs, and real-time monitoring data to diagnose faults. More or less like a patient's uncomfortable symptoms history tells a story to a doctor, past sequences of events can be reconstructed from system event logs, often revealing the circumstances under which an error or other significant event occurred. Symptom-based diagnosis bases its diagnostic judgments on evaluations of system behavior. The symptoms are then the indicators in a symptom-based approach. Moreover, it is a common observation that standard diagnostic programs cannot stress a system in the same way that an actual system workload can, and hence they are often incapable of replicating a failure.

Specification-based Diagnosis

The core of specification-based diagnosis is testing, which can be characterized as a black box experiment. Black box testing is to apply stimuli to the input terminals and then observe responses, which are called terminal characteristics, on the output terminals. The terminal characteristics may be electrical (such as a straight-line relationship between voltage and current for a resistor), combinational ( such as an AND gate), sequential ( such as a counter), or even complex systems ( such as a microprocessor on a chip). As the functions of the component become more complex, the kindof testing becomes a problem, for there is less direct control and less direct observability of internal behavior. Also it requires that the manipulation of external inputs must establish a certain condition in a component deep in the recesses of the black box, and the outputs of that component must be propagated to the output terminals. As system complexity increases, not only are there more components, but each component is also harder to test.

Testing involves more than just the maintenance activity, it comes into act through most of the stages in the life of a digital system. During the first stage, specification and design, the faults of most concern are logic errors in the algorithms. During the prototype development stage, any number of failures are possible, including logical design errors, wring mistakes, incorrect time- all of which can lead to different functional behavior. Failed components can also cause altered functional behavior. The former, designated as a logical fault. With logical faults, the proper algorithm must ultimately be distinguished from any arbitrary algorithm. Here testing involves many similarities to proving programs correct; However, given a correct design, there are many fewer faulty behaviors caused by a malfunction. The component interconnections limit the number of realizable fault behaviors. In prototype development, the final errors in the design and proposed implementation are sought by testing. Physical connectivity may cause timing errors and coupling between multiple signal lines. Subjecting a small number of systems to design maturity testing establishes baseline failure manifestations and MTTF.

During the manufacturing and installation stages of a system, the main goal is acceptance testing. Here, problems of design have been resolved, and testing focuses on the mass-produced black boxes. The faults are primarily structural , but there may be any number of them resulting from the assembly process. When a system malfunctions during the operational stage, maintenance testing is used to isolate and repair faults. This form of testing is perhaps the easiest form of testing, since at this stage there are few structural faults. Frequently, maintenance tests are run during system idle time to detect failures and increase confidence in the correct functioning of the system. There is a significant trend toward remote diagnosis, either to pinpoint failures before dispatching field service personnel or to issue instructions for customer repair.

At any of the stages of system life, testing can occur at each level in the physical hierarchy. It is extremely important to understand at what level and stage a testing technique is aimed. Thus, Table-2 classifies types of testing by combining some of the levels and stages. At the design stage, issues around testings are design validation, lower confidence level test, sequential probability ratio test, and weibull sequential test. At the production stage, there are paramitric testing, acceptance testing, system-level testing, and design for testability. At the field operation stage, we have testing methods such as margining, built-ins, and synthetic load.

Table-2 Testing based on system level and life stage

Level \ Stage->	Design	Production	Operation
Circuit	Simulation	Parametric	Margining
Logic	simulation	Acceptance test/incoming inspection	Diagnostics/built-in test
System	Design maturity test	Process maturity test	Synthetic load/remote diagnosis

Symptom-based Diagnosis

Analysis of system error files indicates that many permanent hardware failures are preceded by a period of instability . Frequently , the period of instability can be detected by observing trends. If a characteristic symptom can be identified in the trend data, the diagnostic time and hence the period of instability can be reduced. This approach is often called trend analysis or symptom-based diagnosis.

Symptom-based diagnosis, which relies almost entirely on observed performance of the system under diagnostic procedure, is relatively new. It was explored in two independent efforts in the early 1980s. Maxion employed a real-time monitoring technique to gather data about disk failures in a distributed system. Analysis of this data yielded competent faults. Tsao used "tuple" analysis to group symptom data into meaningful units. These units, or tuples, were then mapped into fault mechanisms. When a particular tuple was observed during normal operation, its underlying fault mechanism could be inferred from previous experience. Lin&Siewiorek extended these ideas by developing the frame dispersion algorithm for determining the relationship between errors by examining their closeness in time and space. Maxion & siewiorek and Maxion applied similar techniques to diagnosis of communication networks. This work involves a real-time problem-solving entity called a diagnostic server. The diagnostic server is comprised of four main elements: a knowledge kernel in which are kept general and domain-specific problem-solving strategies; an information gatherer, or sensor, that collects system performance data relevant to the problem at hand; and hypothesis generator that forms a ranked set of hypotheses based on various heuristics; and an experimenter, or effector, that provides mechanisms for confirming or denying hypotheses. The system, though in its early stages of development, has been successful in identifying a number of network performance anomalies.

Methods of Analysis

One trend analysis method, called tuple extraction or tupling, as mentioned above [Tsao and Siewiorek,1983], employs a data grouping or clustering technique . Tuples are clusters, or groups, of event-log entries exhibiting temporal or spatial patterns of features. This approach to the data clustering technique is based on the observation that because computers have mechanisms for both hardware and software detection of faults, single error events can propagate through a system, causing multiple entries in the event log. Tuple extraction clusters those entries into tuples, collections of machine events whose logical grouping is based primarily on their proximity in time and in hardware space. A tuple may contain from one to several hundred event-log entries. Formation into tuples reduces the number of logical entities in an event log. Also matching algorithms are used to determine the uniqueness of a tuple, which further reduces the volume of log information and demonstrates the validity of the "hierarchy of significance" of data in a tuple.

The concept of observing system trends for failure prediction was investigated by Nassar around mid eighties, although this method has been neither thoroughly tested nor implemented, at least some results clearly indicate that failure prediction based on an increase in error rate, a threshold error number, a CPU utilization threshold, or a combination of these factors may be feasible. And soon Iyer developed a probabilistic model to characterize the relationship among errors recorded in a system error log, which was used to automatically detect symptoms of frequently occurring persistent errors in two large control data cyber systems.

The Dispersion Frame Technique(DFT) was developed, based on the observation that electromechanical and electronic devices experience an increasing error rate prior to catastrophic failure. The technique determines the relationship between error occurrences by examining their closeness in time and space. As general case is that users tend not to tolerate too much system defects before the repair, so the data points able to be collected from the event log are not sufficient enough to apply those existing fairly accepted statistic tools, and this is the motive for the DFT heuristic technique, who makes decision based on as few as three to five prefailure messages to predict system failure behavior.

Impact of Symptom-based Diagnosis

There are several reasons to go after symptom-based diagnosis. Strategies for sumptom-based diagnosis closely reflect the strategies employed by personnel performing troubleshooting, thus making symptom-based diagnosis easier to understand and follow. The symptoms on which the approach is based are frequently more accurate indicators in a specification-based approach. Moreover it is a common observation that standard diagnostic programs can, and hence are often incapable of replicating a failure.

During the second half of the decade of the 1970s, Digital Equipment corporation started a shift from specification-based diagnosis, in which the field service engineer would execute diagnostic programs in an attempt to recreate the failure, to symptom-based diagnosis, in which error data is analyzed to capture information about the failure. The latter approach, termed symptom-directed diagnosis (SDD), is based upon techniques similar to those outlines in the previous section and is embodies in programs such as SPEAR( systems package for error analysis and reporting) and VAXsim-PLUS. These programs analyze error log data using heuristically established rules to locate the failing FRU. From the distribution of onsite time spent by a field engineer when specification-based diagnosis was prevalent, the vase majority of the time was spent executing diagnostic programs and observing the system to identify the problem, whereas only 10 percent of the time was spent making the physical repair, and a further 15 percent of the time was devoted to validating that the repair activity fixed the original problem. SDD, in conjunction with the DEC Remote Diagnostic Center, analyzes the system prior to dispatching a field service engineer; this results in a two thirds decrease in on-site time. SDD is not only a powerful technique for reducing repair time but also it quickly identifies problems, thereby decreasing the number of system crashes and increasing system availability.

Practical Problem with Diagnosis

However, an important practical problem in diagnosis is discriminating between permanent faults and transient faults. In many computer systems. the majority of errors are due to transient faults. many heuristic methods have been used for discriminating between transient and permanent faults; however, rarely is there works stating this decision problem in clear probabilistic terms.

The purpose of diagnosis is often limited to identifying the hardware module(s) affected by faults. However, to choose the right fault treatment action it is also necessary to discriminate between permanent and transient faults. With transient faults, modules momentarily become prone to behaving erroneously, though they do not suffer any permanent damage. The natural way of dealing with a transient fault is to keep using the affected module, after recovering any data error caused by the transient fault. In many computer systems, transient faults are known to be the great majority of causes of errors. However, discriminating between transient and permanent faults is difficult.

Diagnosis is based on the observation of erroneous behavior. A general description of diagnosis is as follows. A module( which, depending on the specific design, may be as small as a subset of a chip or as large as a whole computer or more) is monitored, either by observing side effects of its intended computations (typically, signals from error-detection mechanisms) or by applying test procedures every now and then and observing the module's response. Some of the observed behaviors can be diagnosed as erroneous, in the sense that they would not happen unless there is a fault in the system. But a module may behave erroneously for different reasons: a permanent fault, a transient fault or propagation of errors from another module. decisions about fault treatment should be quite different in the three cases, yet the symptom along is insufficient for deciding among them.

Compared to "ordinary" reliability modeling, reasoning about diagnosis is somewhat counter-intuitive. Reliability modeling predicts the probabilities of certain sequences of events given some assumptions. the model plays the role of an omniscient observer. Instead, a diagnostic procedure takes these assumptions, this set of sequences and their probabilities as inputs, but can not depend on knowing the real sequence under way, and involves questions like " if I observe certain symptoms, which are consistent with more than one of the possible sequences, which sequence is really taking place?"

Designers have used many heuristics for discriminating between transient and permanent fault, spanning from simple retry to rather sophisticated off-line error log audit and trend analysis. Heuristics are suggested by intuitive reasoning, and then validated by experiment or modeling. Most on-line techniques use thresholding schemes. They count errors, and when the count crosses a pre-set threshold a permanent fault is assumed.

Available tools, techniques, and metrics

Specification-based diagnosis tools:

"black-box" testing : a fairly mature technique, and this approach is immensely used in embedded systems

Symptom-based diagnosis tools:

tupling : efficient and plausible technique/tool towards automatic diagnosis
DFT heuristic : the validation experiments are under its way in CMU at this moment.

And many algorithms exist for automatic diagnosis as well.

Relationship to other topics

Fault Tolerance Computing. Diagnosis is one of the system fault response stages, which help to provide such information of the faults as location and the like. The failure prediction techniques based on system behavior observation diagnosis can preapproach the defect system component before it causes serious problem, by doing so, system shows a higher reliability seen by user.
Maintenance. Diagnosis is a big piece of maintenance work. It significantly determines the effectiveness of maintenance, due to it's maintenance guidance role

Conclusions

Specification-based diagnosis, frequently referred to as testing, is primarily focused at the lower levels of abstraction (through the logic level) and the earlier stages in the development of a system (through installation). While specification-based diagnosis is frequently applied at the system level and during operational life, the large variation in system configurations and operational environments diminishes the effectiveness of this approach.

Symptom-based diagnosis is rapidly becoming the preferred approach for system-level diagnosis during operational life. Based on the observation that systems usually shows a period of potentially more and more unreliable behavior prior to a catastrophic failure, trends can then be identified and used to isolate the faulty field replaceable unit. If the system has been designed to tolerate these intermittent faults, the user will perceive no system outages during diagnosis. Trend analysis develops a model of normal system behavior and watches for a deviation that signifies the onset of abnormal behavior. Since normal system workloads tend to stress systems differently than specification-based diagnostics, their workloads will uncover problems that are not stimulated by test programs. Trend analysis can also adaptively learn the changing normal usage pattern of individual systems.

I'd like to mention besides is that a new trend in diagnosis employs tools and techniques from artificial intelligence to perform automatic diagnosis of complex systems.

Annotated Reference List

"Optimal Discrimination between Transient and Permanent Faults"
Siewiorek & Swarts, "Reliable Computer Systems - Design and Evaluation", Digital Press, 1992

Loose Ends

Go To Project Page