Carnegie Mellon University
18-849b Dependable Embedded Systems
Author: Leo Rollins
Communication is essential to achieving a dependable distributed embedded system. Designers of these systems are faced with several challenges in specifying the communication network. Complex systems usually require some sort of shared media network. In this environment, the designer must recognize the fundamental trade-off that exists between the efficiency and the predictability of the network. Given this trade-off, the designer must evaluate and select the communication network. Particular attention must be given to the protocols, which determine much of the network behavior. Finally, many error detection methods are available which are necessary to build a reliable communication system.
Most historical communication systems can be considered to be "embedded" at least from one perspective: they have a very narrowly defined task. They are not designed for general purpose communication. For instance telephones were conceived for only for the purpose of voice transmission. However, this fact has been changing in recent years with the design of integrated services networks. These networks are designed to carry different types of communication including voice, data and video signals. Even systems with a single original purpose like telephony have been exploited for the transfer of other traffic, like data transfer for computers. Another development that has increased interest in general purpose communication is the internet. Once computers across the world began to be connected, the problem of incompatible networks became apparent. The OSI (Open Systems Interconnection) Reference Model was developed in an attempt to solve this compatibility problem. This model divides the communication system into seven layers which provide varying levels of service. The layers were intended to provide standard interfaces and services, so that various protocols, machines and network types could coexist.
Despite the spread of general purpose networking ideas, there are still many closed systems which have very specific purposes. In this environment, a simple and efficient protocol can be enforced without the danger of incompatibilities. An example is the network of devices in a modern automobile that communicate over a network. From the perspective of the author these narrowly defined closed systems are considered embedded communication systems. Even in these embedded systems, there is increasing interest in the connection of embedded systems to larger networks for status monitoring purposes. Just as the embedded systems have borrowed communication protocols and technology from larger communication systems, they are likely to borrow the many of the interconnection and standardization ideas in the near future.
The majority of embedded communication systems can be classified as either point-to-point networks (data links) or shared media networks (data highways). It is important to understand the trade-off between these two types of systems. In point-to-point networks, each node of the system is connected to every other node. These systems are simple and reliable. Reliability is high since correct transmission between two nodes only depends on a single transmitter and receiver. Since each link is dedicated to communication between two nodes, it is easy to meet real-time deadlines without any sophisticated scheduling mechanism. In shared media systems all nodes are connected together using a ring or bus topology. The primary motivation for shared media is the reduction in wiring (and thus cost). These networks are easily extendable without adding any new data ports to individual nodes. Limited new cabling is required.
The price for scalability and reduced cost of a shared media network is the complexity that must be added to the network protocol. Some means must be added to arbitrate for access to the shared media. The remaining discussion in this paper applies mainly to shared media embedded communication systems.
In practice communication systems may not be purely event or state based. A communication protocol may contain some properties of each. However, it is instructive to examine the fundamental differences between an event based system and a state based system. One of the fundamental trade-offs between these two types of systems is the efficient use of resources found in event based systems versus the predictablity of the network found in state based systems. The primary resources of concern in the network are bandwidth (the amount of data that can be transmitted per unit time) and the buffer space required at nodes to process incoming or outgoing messages.
In an event based communication system, messages are generated and transmitted in response to "events" detected at a local node in the network. Examples of "events" include changes in the value of process variables, new alarm conditions that have been detected, conditions that represent alarms clearing, or requests by other nodes for data. An example of an event based communication system is the typical office network. Messages are generated by users whenever they send data to printers, access data on shared network drives, run applications that exist on other machines or send email to others in the network.
One goal of event based communication is the efficient use of network bandwidth. By transmitting only necessary data, an efficient use of network bandwidth is assured. However, since data is transmitted only when there is a change at the source node, every message becomes important. This places additional requirements on the communication system to assure that each message is delivered successfully. One mechanism to do this is for destination nodes to acknowledge each successful transmission and request a retry for each corrupted message. If an acknowledgement is not generated within a specified timeout, the source node may also repeat the message of its own accord. Note that this acknowledge and retry mechanism consumes some additional network bandwidth.
Consider the example of an event based distributed monitoring system. This system monitors plant conditions and generates alarms when certain conditions are generated. During normal operation, the network should be lightly loaded with few alarm conditions. During system upsets, many messages will be required due to multiple alarm conditions and changing state. It is difficult to predict the maximum number of messages that might be exchanged during this situation. Many nodes may compete for the communication channel. Therefore it is difficult to confirm that a system design will contain adequate resources (bandwidth and buffers) to handle the load. For a system with safety functions, the network is at its worst (in terms of delay and lost messages) when it is needed the most. This condition is sometimes referred to as the alarm flood problem. One potential solution to this problem is to design an overly conservative network in order to meet the worst case situation. This approach may not be feasible in a small embedded system with cost constraints.
In a state based communication system, messages represent the entire state of a node. For instance, all of the alarms for a node are transmitted as either on or off in its message. A node sends its fixed size message at pre-defined, regular intervals. Access to the media is easily scheduled, since the message requirements of each node never change. Network load is fixed and can be easily calculated during system design. An example of a state based system is a distributed process control system. Each node has a fixed number of inputs, calculated values, and alarm conditions that it sends in its message to other nodes in the network.
The state based system is a less efficient in terms of network bandwidth than in the event based system. Network bandwidth is sacrificed for the predictability of regular message size and regular access to the communication channel. Note that some reduction in the overall data is possible. Each piece of data occupies a fixed location in the message. Therefore the data can be restricted to value. Information about what each data point represents is not required to be transmitted with the message.
State based systems can be designed to tolerate the occasional missed message. Re-transmission may not be necessary, since the entire state will be transmitted again in the next time interval. If messages are transmitted at twice the required frequency, the system can meet its deadlines even if every second message is corrupted. In order to tolerate two corrupted messages in a row, the each node could be designed to transmit its messages at three times the required frequency.
One difficulty in state based systems is transient data. It is important for a source node to maintain momentary signals for a sufficient duration that all nodes will see the data. Although the data persists for only a fraction of one message time, a source node may need to transmit the data in several successive messages. This momentary situation is sometimes referred to as the "pulse-stretching" problem. An example of transient data is a momentary push-button which is a hard-wired input to a single node. Assume that indications of button presses are needed at some other node in the system. If the condition that the button had been pushed was transmitted in only a single message, and that message was lost, other nodes would be informed that the button had been pressed.
According to [Kopetz97] there has never been or will ever be a perfect real-time protocol. This is because there are fundamental conflicts in the requirements that we would like to place upon the communication system. These requirements are the best features of both event based and state based systems. The conflicts reflect trade-off between either the efficiency or flexibility found in the event-based system and the predictability found in the state-based system. Trade-offs exist for external control versus composability, flexibility versus error detection and protectiveness, sporadic data versus regular data, spontaneous service versus interface simplicity, and probabilistic access versus replica determinism. For a detailed discussion of each specific trade-off refer to [Kopetz97].
Even though no "best" protocol exists, embedded system designers are not relieved of the task of specifying an appropriate communication system. Therefore it is important to focus on the key differentiating factors found in the protocols. The OSI Reference Model, shown in Figure 1, can be used to examine communication protocols. A brief description of the function of each of the layers is provided below. For more information refer to [Spragins94]
Embedded systems tend to focus on layer 1 (physical) and layer 2 (data link) and use minimal or non-existent upper layers. Two reasons for this focus may be 1) embedded communication systems are simple and do not require upper level services 2) upper layers add overhead that cannot be tolerated in some real-time systems. However this situation may change as complexity increases in embedded systems and users demand more features such as the interoperability with other networks. In order for multiple networks to communicate a common interface is needed. A common upper layer in the protocol may provide this interface.
Within the data link layer a sub-layer called media access control (MAC) exists which determines many of the characteristics of a shared media communication system. Several media access techniques have been proposed and successfully used in popular protocols. Some common media access techniques and protocols which use these techniques are covered below.
Individual protocol studies have been undertaken and published in journals. Unfortunately, only limited comparisons of the protocols used in embedded systems have been performed. Refer to [Koopman94] for a qualitative comparison of protocols used in embedded systems and more detailed coverage on individual media access techniques.
Although the protocols themselves are not strictly event or state based, they often lend themselves more easily to one or the other system type. For instance, CSMA/CD used in Ethernet is a probabilistic access method. Event based systems fit well with this protocol because of the sporadic nature of messages. Time division multiplexing protocols break up the network bandwidth into time slices for individual nodes. State based systems can efficiently use one or more time slices to send their regular data.
Error detection and diagnostics are important in any embedded system and especially in safety critical systems. Communication systems are fairly advanced in their capabilities for detecting, tolerating and sometimes correcting errors. In Table 1, some typical errors for communication systems are listed. Along with each error type, the typical defenses available in the communication system are discussed. Knowledge of common error types and defenses are invaluable to the system designer.
|repeated message||serial number|
|station run-on||anti-jabber circuit|
|failure propagation||surge protection
|intermittent errors||statistical counters|
|interface H/W errors||loopback testing|
|cable breaks||dynamic reconfiguration
Channel noise - Noise is typically induced in communication channels from the environment or cross-talk from adjacent wires. A technique to reduce the noise is the use of fiber optics for the communication channel. Fibers are impervious to electro-magnetic interference. Cyclic redundancy checksums (CRC) are often appended to messages. These checksums allow the detection of all single and many multiple bit errors induced in messages. More sophisticated error coding techniques can also be used to correct bit errors.
Stale messages - Old messages that do no represent accurate real-time data may be present in the system. Some protocols include a time-stamp that is inserted by the source to mark the message age. Note that this implies some global time base.
Repeated messages - In certain failures of a host node or its network interface, the same message may be continuously repeated. Some protocols include a serial number for each message. Destination nodes can easily detect repeated or out of sequence messages.
Failure propagation - In a shared media system it is important to prevent failures in one node from propagating to other nodes. Surge protection is often included to prevent electrical failure propagation. Fiber optic cables serve as galvanic isolation between nodes. Redundant networks can also prevent propagation.
Station run-on - Stations may fail in such a way that they monopolize the shared media. Some protocols, such as Ethernet, contain an anti-jabber supervisory circuit. These circuits bound the time that any station is allowed access to the media. The station will be locked out until a specified silence period is observed.
Memory errors - Internal to a node, a message may be copied several times. Copies are typically made in DMA transactions or other exchanges between a host node and its network interface chipset. It is possible to add information checksums to messages that can be used to detect memory errors in the copy process.
Interface hardware failures - Communication interface hardware can fail. Diagnostics are included in many communication systems that allow loop-back testing of the interfaces.
Intermittent errors - Errors may begin to occur at a higher rate that is below the threshold for system errors. However, the increase in these errors may signal a bad part or connection in the system. Many communication chipsets include statistical counters that show error rates and types. If reported to the system level, these errors can signal maintenance action before system failures occur.
Cable breaks - The loss of communication through cable breaks is normally detectable through loss of signal. However, some communication systems include fault-tolerant capabilities that tolerate cable breaks. One example is FDDI, which is configured as counter rotating rings. Individual stations can reconfigure the rings to bypass the cable breaks.
Protocol Analyzers can be attached to most networks to examine data at the bit, character and frame level. Headers for common protocols can be automatically decoded. These analyzers are particularly useful in examining errors and violations of protocols.
Time domain reflectometers can analyze the cabling and connections in networks. Versions of this equipment exist for electrical and fiber optic media. They are useful in finding cable breaks, bad connections and determing cable lengths. These instruments work by sending a wave down a cable and examining the reflections. Each reflection represents a connection or impefection in the cable. In fiber optics, connection quality is extremely important. These instruments can determine the loss in signal level introduced by each connection.
Formal methods techniques have been applied to the verification of communication protocols (Petri Nets, Lotos, SDL, Z ). Petri Nets, in particular, were created to analyze communication networks. The verification of a protocol is normally required during the standardization process. Some level of correctness can be ensured by the embedded system designer if he chooses a standardized protocol.
A more likely effort for embedded system designers is the selection of a communication system. In order to do this, a good handle on the requirements and key issues involved in the decision is needed. A list of the issues and specific recommendations are presented in [Preckshot93] NUREG/CR-6082, Data Communication. This document was developed as a guide for regulatory authorities to use when evaluating proposed systems. Even though it is intended for the nuclear industry it is applicable to other embedded systems because it asks focused questions about the communication system.
The common metrics published in manufacturer's literature are data rate and error rates. Protocol studies give more detailed measures of the performance that may be expected. These studies involve complex modeling and simulation techniques. It is not surprising that large scale quantitative comparisons between many protocols have not been attempted. Examples of the metrics found in protocol studies are throughput versus load, delay versus throughput, and worst case utilization.
In general the communication theory and analysis techniques are quite mature. However, the process of selecting the communication system is ad hoc at best.
Communication may be considered as a form of I/O. However, a more applicable relationship may be the current trend to use field busses to communicate with I/O.
The communication architecture is often the method used for achieving a dependable embedded system, usually through redundancy.
Embedded communication systems are most often real-time systems. The real-time topic covers schedulability, which is important in shared media networks.
Fault Tolerant Computing
Communication enables fault tolerant computing through the use of error detection.
Error coding techniques are often used in communication for error detection, error correction, reliability, compression, and optimum signal-to-noise ratio.
Formal methods are often used for communication protocol verification.
A fundamental trade-off exists between efficiency and predictability in the selection of an event-based system or a state based system. Regardless of the decision, there are shortcomings that must be addressed. The lack of a detailed quantitative comparison of protocols places the burden of communication system evaluation squarely on the embedded system designer. Many of the properties of the communication system are determined by the media access protocol. Therefore, embedded system designers should focus on the media access method when determining what protocol to use. Other significant factors to consider include the communication technology and its cost or longevity. Often overlooked in design are the error conditions. Communication systems have a variety of mechanisms that can be used to detect errors. Utilizing these detection methods, the designer can build a reliable communication system.
Notes: Good qualitative comparison of protocols, especially the variety of available media access methods. Practical. Written at an introductory level. Examines different media access methods in some detail.
Notes: Wide variety of information on real-time systems. A key discussion examines five fundamental trade-offs in the ideal requirements of communication systems. Communication section show bias towards the Time Triggerd Protocol (TTP) which he has written other papers about.
Notes: Written from an safety system assessor's viewpoint for critical systems. However, this document asks all the questions an embedded system designer should ask himself. The appendix is more tutorial in nature.
Notes: Good source for mathematics of communication, queuing theory, and metrics. Unfortunately, the examples covered are standard communication protocols rather than embedded system protocols. This reference also provides a good background on the OSI Reference Model and communication networks in general.
Notes: The discussion is more tailored to using existing telecom networks. Not so applicable to embedded systems. This may be a future issue. Does cover the issue of QOS and mixing of traffic types.
Notes: High level architecture of two safety critical communication systems. One is based on Ethernet, the other on FDDI.
Notes: Tries to predict if Ethernet will spread to factory floor. May be too opinionated. Not very strong in supporting arguments. May have hidden agenda, since it is on his employer's web site.
Notes: This is big issue for some protocols like TDMA (and for some applications that require sequence-of-events). Gets into the Byzantine General's problem for synchronization in the presence of faulty clocks. There are lots of papers about this issue. [Kopetz97] has one also.
Notes: Good idea because the network connection is passive and therefore more reliable. Unfortunately, this method may not be supported by current technology. High losses in stars only work with strong transmitters. Ok in early days when fiber optic transmitters were high power. Now most transmitters are LEDs with lower power output.
Notes: Analysis of the problems with protocols for certain applications, but only covers 3 (LonTalk, CAN and IEEE-1394).
Notes: In depth analysis of MAC types. Tends toward mathematical.
Go To Project Page