Embedded Communication

Carnegie Mellon University
18-849b Dependable Embedded Systems
Spring 1999

Author: Leo Rollins

Abstract:

Communication is essential to achieving a dependable distributed embedded system. Designers of these systems are faced with several challenges in specifying the communication network. Complex systems usually require some sort of shared media network. In this environment, the designer must recognize the fundamental trade-off that exists between the efficiency and the predictability of the network. Given this trade-off, the designer must evaluate and select the communication network. Particular attention must be given to the protocols, which determine much of the network behavior. Finally, many error detection methods are available which are necessary to build a reliable communication system.

Introduction

Most historical communication systems can be considered to be "embedded" at least from one perspective: they have a very narrowly defined task. They are not designed for general purpose communication. For instance telephones were conceived for only for the purpose of voice transmission. However, this fact has been changing in recent years with the design of integrated services networks. These networks are designed to carry different types of communication including voice, data and video signals. Even systems with a single original purpose like telephony have been exploited for the transfer of other traffic, like data transfer for computers. Another development that has increased interest in general purpose communication is the internet. Once computers across the world began to be connected, the problem of incompatible networks became apparent. The OSI (Open Systems Interconnection) Reference Model was developed in an attempt to solve this compatibility problem. This model divides the communication system into seven layers which provide varying levels of service. The layers were intended to provide standard interfaces and services, so that various protocols, machines and network types could coexist.

Despite the spread of general purpose networking ideas, there are still many closed systems which have very specific purposes. In this environment, a simple and efficient protocol can be enforced without the danger of incompatibilities. An example is the network of devices in a modern automobile that communicate over a network. From the perspective of the author these narrowly defined closed systems are considered embedded communication systems. Even in these embedded systems, there is increasing interest in the connection of embedded systems to larger networks for status monitoring purposes. Just as the embedded systems have borrowed communication protocols and technology from larger communication systems, they are likely to borrow the many of the interconnection and standardization ideas in the near future.

The majority of embedded communication systems can be classified as either point-to-point networks (data links) or shared media networks (data highways). It is important to understand the trade-off between these two types of systems. In point-to-point networks, each node of the system is connected to every other node. These systems are simple and reliable. Reliability is high since correct transmission between two nodes only depends on a single transmitter and receiver. Since each link is dedicated to communication between two nodes, it is easy to meet real-time deadlines without any sophisticated scheduling mechanism. In shared media systems all nodes are connected together using a ring or bus topology. The primary motivation for shared media is the reduction in wiring (and thus cost). These networks are easily extendable without adding any new data ports to individual nodes. Limited new cabling is required.

The price for scalability and reduced cost of a shared media network is the complexity that must be added to the network protocol. Some means must be added to arbitrate for access to the shared media. The remaining discussion in this paper applies mainly to shared media embedded communication systems.

Key Concepts

Event versus State Based Communication

In practice communication systems may not be purely event or state based. A communication protocol may contain some properties of each. However, it is instructive to examine the fundamental differences between an event based system and a state based system. One of the fundamental trade-offs between these two types of systems is the efficient use of resources found in event based systems versus the predictablity of the network found in state based systems. The primary resources of concern in the network are bandwidth (the amount of data that can be transmitted per unit time) and the buffer space required at nodes to process incoming or outgoing messages.

In an event based communication system, messages are generated and transmitted in response to "events" detected at a local node in the network. Examples of "events" include changes in the value of process variables, new alarm conditions that have been detected, conditions that represent alarms clearing, or requests by other nodes for data. An example of an event based communication system is the typical office network. Messages are generated by users whenever they send data to printers, access data on shared network drives, run applications that exist on other machines or send email to others in the network.

One goal of event based communication is the efficient use of network bandwidth. By transmitting only necessary data, an efficient use of network bandwidth is assured. However, since data is transmitted only when there is a change at the source node, every message becomes important. This places additional requirements on the communication system to assure that each message is delivered successfully. One mechanism to do this is for destination nodes to acknowledge each successful transmission and request a retry for each corrupted message. If an acknowledgement is not generated within a specified timeout, the source node may also repeat the message of its own accord. Note that this acknowledge and retry mechanism consumes some additional network bandwidth.

Consider the example of an event based distributed monitoring system. This system monitors plant conditions and generates alarms when certain conditions are generated. During normal operation, the network should be lightly loaded with few alarm conditions. During system upsets, many messages will be required due to multiple alarm conditions and changing state. It is difficult to predict the maximum number of messages that might be exchanged during this situation. Many nodes may compete for the communication channel. Therefore it is difficult to confirm that a system design will contain adequate resources (bandwidth and buffers) to handle the load. For a system with safety functions, the network is at its worst (in terms of delay and lost messages) when it is needed the most. This condition is sometimes referred to as the alarm flood problem. One potential solution to this problem is to design an overly conservative network in order to meet the worst case situation. This approach may not be feasible in a small embedded system with cost constraints.

In a state based communication system, messages represent the entire state of a node. For instance, all of the alarms for a node are transmitted as either on or off in its message. A node sends its fixed size message at pre-defined, regular intervals. Access to the media is easily scheduled, since the message requirements of each node never change. Network load is fixed and can be easily calculated during system design. An example of a state based system is a distributed process control system. Each node has a fixed number of inputs, calculated values, and alarm conditions that it sends in its message to other nodes in the network.

The state based system is a less efficient in terms of network bandwidth than in the event based system. Network bandwidth is sacrificed for the predictability of regular message size and regular access to the communication channel. Note that some reduction in the overall data is possible. Each piece of data occupies a fixed location in the message. Therefore the data can be restricted to value. Information about what each data point represents is not required to be transmitted with the message.

State based systems can be designed to tolerate the occasional missed message. Re-transmission may not be necessary, since the entire state will be transmitted again in the next time interval. If messages are transmitted at twice the required frequency, the system can meet its deadlines even if every second message is corrupted. In order to tolerate two corrupted messages in a row, the each node could be designed to transmit its messages at three times the required frequency.

One difficulty in state based systems is transient data. It is important for a source node to maintain momentary signals for a sufficient duration that all nodes will see the data. Although the data persists for only a fraction of one message time, a source node may need to transmit the data in several successive messages. This momentary situation is sometimes referred to as the "pulse-stretching" problem. An example of transient data is a momentary push-button which is a hard-wired input to a single node. Assume that indications of button presses are needed at some other node in the system. If the condition that the button had been pushed was transmitted in only a single message, and that message was lost, other nodes would be informed that the button had been pressed.

Finding the Best Real-Time Protocol

According to [Kopetz97] there has never been or will ever be a perfect real-time protocol. This is because there are fundamental conflicts in the requirements that we would like to place upon the communication system. These requirements are the best features of both event based and state based systems. The conflicts reflect trade-off between either the efficiency or flexibility found in the event-based system and the predictability found in the state-based system. Trade-offs exist for external control versus composability, flexibility versus error detection and protectiveness, sporadic data versus regular data, spontaneous service versus interface simplicity, and probabilistic access versus replica determinism. For a detailed discussion of each specific trade-off refer to [Kopetz97].

Even though no "best" protocol exists, embedded system designers are not relieved of the task of specifying an appropriate communication system. Therefore it is important to focus on the key differentiating factors found in the protocols. The OSI Reference Model, shown in Figure 1, can be used to examine communication protocols. A brief description of the function of each of the layers is provided below. For more information refer to [Spragins94]

OSI Model

Figure 1: OSI Reference Model

Layer 7: Application - Provides standard interfaces for different types of data transfer such as mail or file transfer.

Layer 6: Presentation - Allows data to be presented to the application in the native format allowing communication between systems with different data representations.

Layer 5: Session - Provides a means for applications to structure a dialogue between each other.

Layer 4: Transport - Provides transparent transfer of data and end-to-end control of message transfer.

Layer 3: Network - Provides an abstraction from the particular communication technology used at lower layers. Includes the functions of routing and relaying messages within the network.

Layer 2: Data link - Provides the procedures for access to the channel, initiating and closing links between stations, grouping of characters into messages or frames, error control and frame synchronization.

Sublayer LLC - Logical Link control is concerned primarily with the establishment and termination of a virtual connection between two stations in a network.
Sublayer MAC - Media Access Control is concerned primarily with arbitrating for and granting access to the communciaiton channel.

Layer 1: Physical - Provides the electrical or optical transmission characteristics and representation of signals. This layer also includes the procedures used to intiate or close communication on a physical link.

Embedded systems tend to focus on layer 1 (physical) and layer 2 (data link) and use minimal or non-existent upper layers. Two reasons for this focus may be 1) embedded communication systems are simple and do not require upper level services 2) upper layers add overhead that cannot be tolerated in some real-time systems. However this situation may change as complexity increases in embedded systems and users demand more features such as the interoperability with other networks. In order for multiple networks to communicate a common interface is needed. A common upper layer in the protocol may provide this interface.

Within the data link layer a sub-layer called media access control (MAC) exists which determines many of the characteristics of a shared media communication system. Several media access techniques have been proposed and successfully used in popular protocols. Some common media access techniques and protocols which use these techniques are covered below.

CSMA/CD - Carrier sense multiple access with collision detection. Each node monitors the channel or carrier to determine when the channel is idle. This is known as carrier sense. If the node has a message to send it begins transmission. The node continues to monitor the channel while it is transmitting. Another node in the network could also begin transmitting on the clear channel. In this case a collision would be detected by both nodes. The nodes would stop transmitting their messages and send a jam signal for a duration long enough for all nodes in the network to see the collision. Each nodes then computes and waits for a random interval before retrying its transmission. The Ethernet protocol used in office LANs (local area networks) popularized this access method. It was later standardized as IEEE 802.3

CSMA/CA - Carrier sense multiple access with collison avoidance. Initial access to a clear channel is performed similiar to CSMA/CD. However, after a collision and a jam signal, stations use contention slots to resolve access to the channel. These slots give a node or nodes priority of access during its contention slot. Due to this priority contention slot, some collisions that would have occured on retries in CSMA/CD are avoided in CSMA/CA. The slots assignments are rotated between successive collisions to ensure fairness. An example of this protocol is LONWorks.

Polling - In polling, a single master node controls access to the channel. All other nodes are polled sequentially to determine if they have messages to send. If they have messages, they are granted access to send their messages. Note that this method relies heavily on correct operation of the master. Intel's BitBus and many fieldbus communication protocols use this method.

Bit Dominance - In bit dominance protocols, all nodes are synchronized. Each node begins transmission on a clear channel by sending its node or message ID. This ID indicates the priority of its transmission. The node with the highest ID wins the bidding because the 1's in its ID dominate any 0's sent by other nodes. Note that this requires an electrical media where sent 1's dominate over 0's. The Controller Area Network (CAN) based on this access method is used heavily in automobiles.

Token Passing - In token passing, access to the channel is determined by the holder of a token. When this node is finished transmitting, it passes the token in a special message to the next node in the network. If a node has no messages to send, it simply passes the token. Special bidding processes are often required to establish the initial holder of the token and how long each node may hold the token. IBM's token ring, token bus and FDDI (fiber distributed data interface) all use some form of token passing.

TDMA - Time division multiple access. In this access method, the bandwidth of the network is sliced into slots. Each node is allocated one or more slots where it has sole access to the channel. The slots repeat continuously, giving each node periodic access to the channel. ARINC 629 (Aeronautical Radio Incorporation) is a protocol established for embedded airplane networks that use this method. Another TDMA protocol designed specifically for fault tolerant real-time applications is TTP (time-triggered protocol). This protocol is a relatively recent development (1993) and its applications in real embedded systems are unknown to the author.

Individual protocol studies have been undertaken and published in journals. Unfortunately, only limited comparisons of the protocols used in embedded systems have been performed. Refer to [Koopman94] for a qualitative comparison of protocols used in embedded systems and more detailed coverage on individual media access techniques.

Although the protocols themselves are not strictly event or state based, they often lend themselves more easily to one or the other system type. For instance, CSMA/CD used in Ethernet is a probabilistic access method. Event based systems fit well with this protocol because of the sporadic nature of messages. Time division multiplexing protocols break up the network bandwidth into time slices for individual nodes. State based systems can efficiently use one or more time slices to send their regular data.

Error Detection / Diagnostics

Error detection and diagnostics are important in any embedded system and especially in safety critical systems. Communication systems are fairly advanced in their capabilities for detecting, tolerating and sometimes correcting errors. In Table 1, some typical errors for communication systems are listed. Along with each error type, the typical defenses available in the communication system are discussed. Knowledge of common error types and defenses are invaluable to the system designer.

**Table 1: Common Communication Errors**
Error Type	Defense
channel noise	CRC fiber optics
stale message	time-stamp
repeated message	serial number
station run-on	anti-jabber circuit
failure propagation	surge protection redundant network fiber optics
memory errors	checksum
intermittent errors	statistical counters
interface H/W errors	loopback testing
cable breaks	dynamic reconfiguration redundant network

Channel noise - Noise is typically induced in communication channels from the environment or cross-talk from adjacent wires. A technique to reduce the noise is the use of fiber optics for the communication channel. Fibers are impervious to electro-magnetic interference. Cyclic redundancy checksums (CRC) are often appended to messages. These checksums allow the detection of all single and many multiple bit errors induced in messages. More sophisticated error coding techniques can also be used to correct bit errors.

Stale messages - Old messages that do no represent accurate real-time data may be present in the system. Some protocols include a time-stamp that is inserted by the source to mark the message age. Note that this implies some global time base.

Repeated messages - In certain failures of a host node or its network interface, the same message may be continuously repeated. Some protocols include a serial number for each message. Destination nodes can easily detect repeated or out of sequence messages.

Failure propagation - In a shared media system it is important to prevent failures in one node from propagating to other nodes. Surge protection is often included to prevent electrical failure propagation. Fiber optic cables serve as galvanic isolation between nodes. Redundant networks can also prevent propagation.

Station run-on - Stations may fail in such a way that they monopolize the shared media. Some protocols, such as Ethernet, contain an anti-jabber supervisory circuit. These circuits bound the time that any station is allowed access to the media. The station will be locked out until a specified silence period is observed.

Memory errors - Internal to a node, a message may be copied several times. Copies are typically made in DMA transactions or other exchanges between a host node and its network interface chipset. It is possible to add information checksums to messages that can be used to detect memory errors in the copy process.

Interface hardware failures - Communication interface hardware can fail. Diagnostics are included in many communication systems that allow loop-back testing of the interfaces.

Intermittent errors - Errors may begin to occur at a higher rate that is below the threshold for system errors. However, the increase in these errors may signal a bad part or connection in the system. Many communication chipsets include statistical counters that show error rates and types. If reported to the system level, these errors can signal maintenance action before system failures occur.

Cable breaks - The loss of communication through cable breaks is normally detectable through loss of signal. However, some communication systems include fault-tolerant capabilities that tolerate cable breaks. One example is FDDI, which is configured as counter rotating rings. Individual stations can reconfigure the rings to bypass the cable breaks.

Available tools, techniques, and metrics

Tools

Protocol Analyzers can be attached to most networks to examine data at the bit, character and frame level. Headers for common protocols can be automatically decoded. These analyzers are particularly useful in examining errors and violations of protocols.

Time domain reflectometers can analyze the cabling and connections in networks. Versions of this equipment exist for electrical and fiber optic media. They are useful in finding cable breaks, bad connections and determing cable lengths. These instruments work by sending a wave down a cable and examining the reflections. Each reflection represents a connection or impefection in the cable. In fiber optics, connection quality is extremely important. These instruments can determine the loss in signal level introduced by each connection.

Techniques

Formal methods techniques have been applied to the verification of communication protocols (Petri Nets, Lotos, SDL, Z …). Petri Nets, in particular, were created to analyze communication networks. The verification of a protocol is normally required during the standardization process. Some level of correctness can be ensured by the embedded system designer if he chooses a standardized protocol.

A more likely effort for embedded system designers is the selection of a communication system. In order to do this, a good handle on the requirements and key issues involved in the decision is needed. A list of the issues and specific recommendations are presented in [Preckshot93] NUREG/CR-6082, Data Communication. This document was developed as a guide for regulatory authorities to use when evaluating proposed systems. Even though it is intended for the nuclear industry it is applicable to other embedded systems because it asks focused questions about the communication system.

Metrics

The common metrics published in manufacturer's literature are data rate and error rates. Protocol studies give more detailed measures of the performance that may be expected. These studies involve complex modeling and simulation techniques. It is not surprising that large scale quantitative comparisons between many protocols have not been attempted. Examples of the metrics found in protocol studies are throughput versus load, delay versus throughput, and worst case utilization.

In general the communication theory and analysis techniques are quite mature. However, the process of selecting the communication system is ad hoc at best.

Relationship to other topics

I/O

Communication may be considered as a form of I/O. However, a more applicable relationship may be the current trend to use field busses to communicate with I/O.

Distributed Dependability

The communication architecture is often the method used for achieving a dependable embedded system, usually through redundancy.

Real-Time Systems

Embedded communication systems are most often real-time systems. The real-time topic covers schedulability, which is important in shared media networks.

Fault Tolerant Computing

Communication enables fault tolerant computing through the use of error detection.

Error Coding

Error coding techniques are often used in communication for error detection, error correction, reliability, compression, and optimum signal-to-noise ratio.

Formal Methods

Formal methods are often used for communication protocol verification.

Conclusions

A fundamental trade-off exists between efficiency and predictability in the selection of an event-based system or a state based system. Regardless of the decision, there are shortcomings that must be addressed. The lack of a detailed quantitative comparison of protocols places the burden of communication system evaluation squarely on the embedded system designer. Many of the properties of the communication system are determined by the media access protocol. Therefore, embedded system designers should focus on the media access method when determining what protocol to use. Other significant factors to consider include the communication technology and its cost or longevity. Often overlooked in design are the error conditions. Communication systems have a variety of mechanisms that can be used to detect errors. Utilizing these detection methods, the designer can build a reliable communication system.

Annotated References

[Koopman94] Koopman, P.J., and Upender, B.P, "Communication Protocols for Embedded Systems", Embedded Systems Programming, 7(11), November 1994, pp. 46-48, http://www.cs.cmu.edu/People/koopman/protsrvy/protsrvy.html, Accessed: May 8, 1999.

Notes: Good qualitative comparison of protocols, especially the variety of available media access methods. Practical. Written at an introductory level. Examines different media access methods in some detail.

[Kopetz97] Kopetz, H., Real-Time Systems, Design Principles for Distributed Embedded Applications, Klower Academic Publishers, 1997, Chpt.7-8.

Notes: Wide variety of information on real-time systems. A key discussion examines five fundamental trade-offs in the ideal requirements of communication systems. Communication section show bias towards the Time Triggerd Protocol (TTP) which he has written other papers about.

[Preckshot93] Preckshot, G.G., Data Communications, NUREG/CR-6082,1993.

Notes: Written from an safety system assessor's viewpoint for critical systems. However, this document asks all the questions an embedded system designer should ask himself. The appendix is more tutorial in nature.

[Spragins94] Spragins, J.D., Hammond, J.L., and Pawlikowski, K., Telecommunications Protocols and Design, Addison Wesley Publishing, 1994.

Notes: Good source for mathematics of communication, queuing theory, and metrics. Unfortunately, the examples covered are standard communication protocols rather than embedded system protocols. This reference also provides a good background on the OSI Reference Model and communication networks in general.

Loose Ends

In synchronous networks, there are fault tolerant clock synchronization issues. These issues are an incarnation of the Byzantine general's problem.
Some systems may require a mixture of synchronous and asynchronous data.
CORBA techniques may be applied to embedded communication to achieve interoperability.

Go To Project Page