Lost Messages and System Failures

Philip J. Koopman, Jr
Carnegie Mellon University, Engineering Design Research Center

Draft text for article published in:
Embedded Systems Programming, 9(11), October 1996, pp. 38-52

(For a nicely typeset and edited version, please refer to that published article.)

An Excel Spreadsheet is available with equations from this paper -- our web server is mangling it, so please e-mail me at koopman@cmu.edu for a copy.

Abstract

Lost or delayed messages are common in networked communications, and require a change in thinking to avoid designing systems that are too brittle. If you fail to anticipate the loss of multiple messages in a row, you can experience system failures. This article discusses the potential severity of the problem, and shows how to predict system failures based on the likelihood of message losses, with Ethernet and Echelon's LonTalk used as examples. The article discusses sources of message losses and steps you can take to reduce or eliminate problems.

Introduction

How Often Will the System Fail?
Fault Model
System Failure Rates

Bit Errors Due To Noise

Transmitter Collisions
Ethernet Burst Message Losses
LonTalk Burst Message Losses
LonTalk Steady-State Message Losses

An Example of LonTalk Failure Rates
What Can You Do About It?
Conclusion

Introduction

When you make the leap from a centralized embedded processor to a distributed embedded system, you must deal with a fact of life: communication networks can't guarantee 100% on-time message delivery. Even with properly working network hardware, messages can be lost due to collisions or corruption by noise.

If your system is built to tolerate it, occasional message loss is no big deal. But if you fail to anticipate the loss of multiple messages in a row, you can experience system failures. Even if message losses are relatively infrequent, long periods of operation multiplied by many units in the field can turn small potential problems into large maintenance headaches.

First I will discuss the potential severity of the problem, and show how to predict system failures based on the likelihood of message losses. Then, I will discuss sources of message losses and steps you can take to reduce or eliminate problems.

How Often Will the System Fail?

For our purposes, we're going to assume that nothing is broken in the system. This means that failures will occur due to chance message losses that happen in the normal course of events. In order to get a grip on the issues, we need to make some assumptions about how the system is going to fail.

Fault Model

A fault model is nothing more than a statement of how the system is expected to fail. The fault model I use for message losses is that the intended message does not get delivered within a specified deadline. This deadline takes into account any retries or other corrective action that the lower layers of the protocol may take.

It could be that the deadline is relatively long. For a system that sends periodic messages the deadline may be several periods long (for example, messages sent every 50 msec may be able to occasionally tolerate a 200 msec delay or even miss messages entirely for 200 msec without causing problems bad enough to be a "system failure"). Also, the deadline may be different for each different type of message in the system.

The key to determining failure rates is to determine how many times the system can attempt to transmit a message within the failure deadline. For a system that uses acknowledgements or collision detection hardware, the number of attempts is limited by the time it takes to attempt a transmission and find out that it didn't work. If the message is sent without a way to know if it was successfully transmitted, the number of attempts is the number of consecutive tries to transmit updates of that same information that can be lost outright. The number of message transmission attempts for most cases is then:

[equation]

For example, a 37 msec message deadline with a 10 msec timeout means that a total of 4 message transmission attempts (1 initial attempt plus 3 retries) may be made before the system fails, with the fourth attempt delivered within 7 msec (having been sent at 30 msec after the first attempt) . This example could either be due to a 10 msec acknowledgement timeout period, or that a control loop that can tolerate losing only 3 messages in a row.

System Failure Rates

Given the number of times the network can attempt to send a message, the system failure rate is simply the probability that all attempts fail. If we assume that failures are independent, the probability of the system failing when sending any given message is:

[equation]

While the failure probability is typically quite small for any particular message, things can add up quickly in an embedded system that is transmitting tens or hundreds of messages per second for years at a time. To put things in perspective, the failure rate per year is:

[equation]

To get an idea of how failure rates can multiply, Figure 1 shows the number of failures per year for a system. Eight different lines on the log-log plot show the curves different values of max_attempts message failures required to lead to system failure. As an example, if 4 messages lost in a row causes a system failure, this will happen 63 times per year with a only a 1% message failure rate. If a system failure requires a service call, then a service call every 5.8 days might be too frequent to be acceptable for many systems, especially when multiplied by thousands of systems in the field.

Figure 1. A system that tolerates many lost messages in a row will fail dramatically less often.

Figure 1 can be used to make some gross generalizations about message loss rates based on the robustness of the system. At less than 0.001% message failure rate, most systems will operate without problems. At between 0.01% and 1%, systems that are fairly robust will be OK, but attention must be paid to what happens when several messages are lost in a row. Care must be taken because these errors may be infrequent enough not to show up in prototypes, but instead only manifest once large-scale system production has begun.

Above about 10%, most systems will suffer from frequent errors. Unfortunately, as we shall see, it is fairly easy to get into that situation using some off-the-shelf communication solutions.

Bit Errors Due To Noise

Noise is an unpleasant fact of life. Any time a message is transmitted on a communications network, there is a probability that some of the bits in the message will be corrupted by noise. The probability of corruption varies widely depending on the quality and length of the network wire, the type of coupling between the transmitter and the wire, and the amount of noise in the environment. We'll assume that the corrupted message is caught by an error checking code and discarded at the receiver.

The effects of noise are typically expressed as the probability that any given bit will be corrupted, such as 10^-6 (each bit has a one-in-a-million chance of being corrupted). One of the first things you should attempt to determine when using a communications network is what this bit error rate is. Unfortunately, this information may be hard to come by because of a general lack of experimental data and high variability between installations. When in doubt, you can either design robustly with an assumed high bit error rate (such as 10^-5 or whatever value is prudent for your application), or live with the costs of reworking the hopefully few installations that have noise problems.

In order to compute the number of messages lost to bit errors we compute the probability that all bits will be uncorrupted in a message and subtract from one:

[equation]

For example, if a typical message is 96 bits long, and the bit error rate is 10^-6, then the message error rate is:

[equation]
or in other words, 0.0096% of messages will be lost due to bit errors.

Figure 1 shows that for this example running at 200 messages per second, the system should be built so as not to fail unless three or more messages are lost in a row. If the system fails with only two lost messages, approximately 58 failures per year can be expected because of noise.

Transmitter Collisions

In some communication systems, the protocol guarantees that there will be no collisions between transmitters. In other words, collision-free protocols guarantee by their very operation that no two transmitters can transmit at the same time. Commercially available collision-free protocols include Token Bus (e.g., Arcnet), Bit-wise arbitration (e.g., CAN when no two transmitters have the same priority), Master-based Polling (e.g., MIL-STD-1553), and Time Division Multiple Access (e.g., ARINC-629).

However, some network protocols permit collisions because it is one of the few practical ways to efficiently accommodate a large number of nodes on a single communication medium. For example, Ethernet uses a collision-based approach that works well for PCs and workstations, but is not suitable for time-critical applications without alteration. Other protocols, such as Echelon's LonTalk protocol, are designed to tolerate collisions and can be used for many embedded applications.

A collision can happen any time there is more than one transmitter with a message to send when using a collision-based protocol. But, collisions happen most frequently when there is a lot of traffic on a network with limited bandwidth, or when there are a large number of messages from different transmitters in a burst of traffic. Such message traffic bursts can be common in embedded applications that employ synchronized control loops, trigger multiple actions from an external event, or employ a single master broadcasting to multiple slave systems. Collisions are a normal part of protocol operation, and are not inherently bad as long as message delivery deadlines are met.

Figure 2. Each message is followed by one or more random access time slots that can be used to transmit the next message.

Figure 2 shows that after a message is transmitted in either Ethernet or LonTalk, a number of random-access idle slots elapses on the network. Any transmitter having traffic to send randomly selects one of the slots, so each slot is either idle or selected by at least one transmitter. The first non-idle slot is when the next message is sent. If that slot was chosen by a single lucky transmitter, that transmitter will send its message without a collision (transmitters selecting later slots notice that a message transmission is already in progress, and so do not attempt to transmit). If the first non-idle slot is, by chance, selected by more than one transmitter, then those transmitters will suffer a collision and have to try again.

Ethernet Burst Message Losses

Let's look at Ethernet as an example, and consider a case when multiple transmitters each have messages ready to transmit simultaneously. Ethernet uses a binary exponential slot allocation. This means that on the first transmission attempt a transmitter always tries the first slot; on the second attempt it randomly selects either slot 0 or 1; on the third attempt slots 0-3; and so on until on the eleventh attempt it selects slots 0-1023 (the number of slots available is 2^i-1 on the ith transmission attempt). Ethernet also has a limit of 16 total transmission attempts, on the assumption that simultaneous bursts of traffic typically do not happen from more than a handful of transmitters in a typical computer network. So, if a large burst of traffic occurs, all messages that have not been sent at the end of 16 attempts will be discarded.

Ethernet works fine for office automation and workstation applications, but can have problems with bursty traffic found in embedded systems. As Figure 3 shows, bursts of more than 12 messages will cause standard Ethernet systems to operate in the failure-prone right-hand side of Figure 1, with 10% or higher message losses. There are other reasons not to use Ethernet for embedded systems, including the fact that it has poor efficiency at high loads because of all the time spent experiencing collisions. [See Upender & Koopman, "Communication Protocols for Embedded Systems," Embedded Systems Programming, 7(11), November 1994, pp. 46-58 ].

Figure 3. When Ethernet is heavily loaded with a burst of traffic, simulation results show that some messages are lost when they are discarded after 16 transmission attempts.

LonTalk Burst Message Losses

Like Ethernet, Echelon's LonTalk protocol uses a collision-based protocol with slots after each message. Instead of a binary exponential backoff algorithm, LonTalk simply uses 16 slots after each message. In the special case of multi-cast message acknowledgements, LonTalk increases the number of slots to 16 times the number of acknowledgements expected, which greatly reduces the potential for collision. However, we are at this point considering message transmissions, not acknowledgements.

If LonTalk is used with collision detection, then messages are essentially never lost. This is because transmitters try 255 times to deliver each message, so the burst of messages eventually all gets through.

However, in many installations collision detection is not used with LonTalk. In particular, physical limits of noise, distance, or transmission medium make it impractical to use collision detection in the harsh environments typical of many embedded systems.

When LonTalk is used without collision detection, messages involved in collisions are lost rather than retried, and the receiver must wait until the next periodically generated message of that type to get the information. If message acknowledgements are used as an indirect method of collision detection (messages lost in collisions will fail to generate an acknowledgement), the timeout value used to wait for the acknowledgement sets a limit on the number of transmission attempts possible before missing deadlines.

Figure 4 shows the message losses that can be expected when using LonTalk without collision detection hardware under bursty traffic conditions. Any such LonTalk-based system that generates bursts of 3 or more messages at a time can expect collisions involving more than 10% of the messages. This could result in a significant problem with lost messages, depending on how robust the system is to missed messages.

Figure 4. Significant numbers of messages are lost due to collisions in LonTalk when operated without collision detection.

LonTalk Steady-State Message Losses

Not all embedded systems generate bursts of messages. But in many embedded systems, particularly those with limited message delivery bandwidth, the network may still have backed up message traffic on a regular basis. In addition to the situation where the network is simply being loaded to capacity, this heavy steady-state loading can take place because of bursts of traffic that are close enough in time so as to merge into a single period of heavy message transmissions. Also, once the network gets behind because of a burst of traffic or time spent retrying collided messages, it can take a while for the limited bandwidth available to catch up with the message backlog.

It is important to point out that LonTalk is in fact designed to tolerate much more abuse in terms of heavy message traffic than a non-embedded network protocol such as Ethernet. LonTalk is stable under continuous heavy loads, whereas standard Ethernet is simply not designed to work in such a situation. And, LonTalk is able to function even when collision detection can't be used -- Ethernet requires collision detection to work at all.

However, if the communication network is heavily loaded, LonTalk can still lose a significant percentage of messages.

In the bursty traffic situation just discussed, the math was messy enough that a simulation was the best way to gain insight. However, if we assume a relatively constant network message load that has several messages ready to transmit at any given time, we can arrive at an analytic solution for LonTalk message loss rates.

Let us suppose that there be T transmitters actively selecting from among S slots in a LonTalk network. What we want to compute is the expected number of messages lost due to collision out of the total number of messages that were supposed to be sent (in this context a retried message counts as an attempt to send a new message).

First we compute the probability that n transmitters will collide in the mth slot (numbered from 1 to S). The probability that any given n transmitters all occupy the same slot m out of S slots is:

[equation]

But, there are in general several possible combinations of n out of T transmitters that may all occupy that slot. Using combinatorial mathematics, the total probability of n transmitters occupying the same slot m is:

[equation]

In addition, for there to be a collision the mth slot must be the first slot selected by any transmitter, so the the remaining transmitters must all select slots after the mth slot:

[equation]

So, the probability that there will be a collision in the mth slot is:

[equation]

In a LonTalk system that uses collision detection, the probability of a collision is simply the sum of probabilities of a collision in each slots:

[equation]

Figure 5 shows this equation evaluated for 16 slots. Because LonTalk attempts 255 retransmissions, essentially all messages will be transmitted even in the face of a high collision rate. Of course, there may be many retransmissions required for any particular message, so system failures could possibly be caused by this retry latency. The system failure rate can be computed using the probability of collision and the number of successive collisions that can be tolerated for any particular message using the equation given earlier.

Figure 5. LonTalk's 16-slot approach results in a significant number of collisions under steady-state loading, although essentially all messages eventually get through if collision detection is used.

It's important to realize, however, that the number of collisions is only half the story. In a LonTalk system that doesn't use collision detection it is the number of messages that is lost rather than the collision rate that is important in determining the potential for system failure.

The average number of messages lost per transmission attempt is the sum of the probabilities of losing messages P_nm for each number-of-transmitter/ slot-number pair weighted by the number of messages n involved in a collision.

[equation]

In order to get a normalized result, we must compare the number of messages lost in collisions to the total number of messages submitted to the network (both lost and sent messages). The probability of an attempt to transmit being successful in sending one message is 1 minus the sum of all possible failures:

[equation]

Finally, the probability of losing a message due to collisions in the transmission process is the ratio of lost messages to total messages sent:

[equation]

Figure 6 shows the results of using this analytic solution for 16 slots, as well as for LonTalk's "adaptive slot" mode that is only used for multicast message acknowledgements. As this graph shows, any LonTalk system without collision detection hardware that on average has more than one transmitter active at a time will lose more than 10% of messages due to collisions. As we saw in Figure 1, this is likely to lead to unacceptable system operation, depending on the number of consecutive lost messages that can be tolerated.

Figure 6. Without collision detection a steady-state load of even two simultaneously active LonTalk transmitters can lead to significant message losses in normal 16-slot operation.

In fact, Figure 6 is a generalization of the earlier burst model in that a burst of traffic simply starts with, say, 16 active transmitters, then decrements through 15, 14, and down to 1. Burst transmission events are roughly equivalent to a steady-state condition involving about half the number of transmitters.

Tables 1 and 2 give some Monte Carlo simulation results that can be used to check the analytic equations presented above. The analytic message losses shown for the LonTalk burst simulations were obtained by having the program compute the analytic probability for each transmission attempt in the simulation and accumulate it for comparison with the simulation results.

Table 1. Simulation and analytic results for burst traffic on LonTalk.

10000000 Simulation Trials

Steady-state   Simulated    Analytic
Active         Message      Message
Transmitters   Losses       Losses
------------   --------     --------
   1           0.000000     0.000000
   2           0.117851     0.117647
   3           0.171214     0.171123
   4           0.221529     0.221453
   5           0.268899     0.268758
   6           0.313329     0.313216
   7           0.355372     0.354997
   8           0.394456     0.394262
   9           0.431029     0.431159
  10           0.465779     0.465830
  11           0.498364     0.498409
  12           0.528870     0.529019
  13           0.557682     0.557779
  14           0.584955     0.584799
  15           0.610523     0.610184
  16           0.634046     0.634031

Table 2. Simulation and analytic results for steady-state LonTalk traffic.

10000000 Simulation Trials

Burst          Average        Simulated  Analytic   
Active         Transmitters   Message    Message
Transmitters   Active         Losses     Losses
------------   ------------   --------   ---------
   1           1.000000       0.000000   0.000000
   2           1.516091       0.062356   0.062500
   3           2.021316       0.100348   0.100342
   4           2.519947       0.132297   0.132179
   5           3.012929       0.160971   0.160978
   6           3.500631       0.187594   0.187744
   7           3.983836       0.212862   0.212964
   8           4.461014       0.236893   0.236814
   9           4.934062       0.259739   0.259548
  10           5.402001       0.281307   0.281263
  11           5.865488       0.302090   0.301994
  12           6.324401       0.321752   0.321919
  13           6.778946       0.341182   0.340867
  14           7.230104       0.359191   0.359244
  15           7.675349       0.376889   0.376715
  16           8.117022       0.393270   0.393644

An Example of LonTalk Failure Rates

As an example, let us assume that LonTalk is being used with a bit error rate of 10^-6 and an average of only three transmitters active for each message transmission. Further assume that the system transmits 200 unacknowledged messages per second in operation, and can tolerate 8 lost messages in a row. Because acknowledgements are not in use, this means that control loops can withstand up to 8 missed periodic messages. Most systems I have seen tolerate fewer lost messages, and tend to have a higher average number of active transmitters, but this can vary considerably depending on the application.

If we assume 96 bits per message, this example has a message error rate due to bit errors of:

[equation]
The failure rate from this source is:

[equation]
In other words, you don't have to worry about bit errors as long as you can retry transmission enough times. So far, so good. But, what about losses due to message collisions?

If the system predominantly operates with an average of three active transmitters in 16-slot operation, LonTalk can be expected to lose 17.11% of all messages. In this case, there are:

[equation]
or, approximately a failure every two hours. This is just frequent enough to cause problems, but could be a real challenge to debug.

An obvious approach to fixing this is to use message acknowledgements so that the system retransmits any lost messages. Unfortunately, this at a minimum doubles the message traffic to 400 messages per second, which exceeds LonTalk's capacity for 78 Kbps networks. If the faster (but more expensive) 1.25 Mbps network were used, up to 15 retries would be needed to reduce failure rates to less than once every 10 years. Using a faster network might well work for this example, because there would probably be time for 15 retries within 8 message periods so long as other messages in the system were not unduly delayed.

What Can You Do About It?

The best way to deal with the message loss problem is to make your system robust in the face of losing several messages in a row. But, this is not always practical, nor is it necessarily easy.

What happens when a system fails due to lost messages is, of course, heavily dependent on the application. If the network is transmitting messages for a control loop, some systems will lose control after missing a deadline of only a transmission period or two. In a well-designed system, this will either cause a degraded mode of operation or activate mechanical safeties. However, either case may prompt a service call or cause system damage. In a system that does not take into account the possibility of lost messages, the results can be more severe, including potentially killing somebody.

In general the most important thing is to get a feel for whether message losses are going to be a problem in your system. Find out the noise characteristics of your operating environment and compute how often you will lose two, three, or more message transmission attempts due to noise. If your system can't tolerate the problem, look into investing in better shielding, a more robust network physical layer technology, or even fiber optics. Or, use error correcting codes instead of simply error detecting codes so that messages survive a corrupted bit

Message losses from collisions are a more serious problem, because in many cases they will happen much more frequently. Finding out the number of simultaneously active transmitters is critical -- and may be difficult. I have found that simulations help estimate the number of active transmitters with a minimum of fuss, and can yield better insight into some aspects of system operation than laboratory or field experiments. But, there is nothing like a lab test to give you that warm fuzzy feeling that your simulations are in fact correct, and that you are simulating the aspects of the system that actually matter. I recommend you simulate the collision mechanism for the protocol you are using and get a prediction of message loss rates for your system, then conduct some modest experiments with the actual hardware to confirm your results.

If you find that you will suffer too many message collisions, then you can reduce the frequency of potential system failures in a number of ways:

Use collision detection. However, collision detection may not work reliably in many embedded applications due to, for example, use of transformer-coupled network connections or high signal attenuation on long wiring runs.
Use acknowledgements rather than simply waiting for the next periodic message. Unfortunately, some protocols require separate messages for each acknowledgement, which may double the network traffic or worse.
Reduce the timeout interval for message acknowledgements to increase the number of transmission attempts within the deadline period. This may increase the network traffic if too many acknowledgement timeouts are due to network delays rather than lost messages.
Send each message twice to increase the odds that it will arrive intact. This doubles network traffic.
Reduce the probability of collisions by using priority-based message transmission. Be careful that the priority-based transmission system you use works well when transmitting onto an idle network -- some collision-based priority mechanisms do not, resulting in inconsistent prioritization of messages.

If none of those techniques are sufficient, you may have to redesign your system to reduce the average number of concurrently active transmitters. In order to do this, you will need to address some or all of the underlying reasons for having many active transmitters:

Heavily loaded network with insufficient bandwidth reserve to handle traffic bursts.
Synchronization of tasks to a real time scheduling clock.
External events triggering a flurry of communication activity.
Synchronization to rotating machinery.
Synchronization to control loops with similar periods.
Accidental synchronization of periodic messages that drift in and out of phase.

If all else fails, you can always switch to a collision-free protocol, although they have their drawbacks as well.

Conclusion

Lost or delayed messages are common in networked communications, and require a change in thinking to avoid designing systems that are too brittle. In some cases system failures may happen seldom enough that they may not be seen in prototyping or pilot installations, or may be blamed on other causes.

On the other hand, there is no sense spending a lot of energy chasing down problems that just aren't important. A good rule of thumb is that if communication errors account for less than 1% to 10% of expected system failures, then your time is probably better spent working on the problems accounting for the other 90% to 99% of failures instead.

If you use an embedded communication network, your system will eventually experience lost or excessively delayed messages. Will you be able to estimate in advance how often it will happen? Is it a big enough problem to be important? Will your software be able to handle it? What will be the costs if it can't?

Phil Koopman

Lost Messages and System Failures

Abstract

Contents

Introduction

How Often Will the System Fail?

Fault Model

System Failure Rates

Bit Errors Due To Noise

Transmitter Collisions

Ethernet Burst Message Losses

LonTalk Burst Message Losses

LonTalk Steady-State Message Losses

An Example of LonTalk Failure Rates

What Can You Do About It?

Conclusion