Models for Distributed Embedded System Design: an essay

Philip Koopman
2/28/96
koopman@cmu.edu

Abstract

Distributed embedded systems are difficult to design correctly. Three types of models should be used when designing such systems: analytic models, executable simulations, and prototypes. Each of the three model types have both strengths and weaknesses; using all three greatly increases the likelihood of producing a correct design.

A Need for Modeling

When designing embedded systems, going to a distributed approach offers many potential advantages compared to a centralized approach. Distributed designs can be more scalable, offer cleaner separation of tasks for design teams, and facilitate use of commercial off-the-shelf hardware and software building blocks. (In this context, distributed systems encompass both logically distributed systems such as multiple tasks running on a single CPU, and physically distributed systems such as multiple computers on a communication network.)

However, it can be difficult correctly design distributed systems. This is because the distributed system must manifest a correct emergent behavior involving a collection of loosely coupled components. The correctness of this emergent behavior or even what the emergent behavior is may not be obvious from the point of view of any component in the system. Furthermore, many distributed systems are too complex for a human designer to understand without considerable study.

One solution to designing a system correctly is to create models that help the designers understand and evaluate both the system requirements and implementation. We think that three modelling techniques are required in order to successfully design distributed systems: analysis, simulation, and prototyping.

Analytic Models

Analysis involves the use of mathematical approaches to create high-level abstractions of system properties, most notably performance. Analytic models are typically succinct mathematical equations that may be evaluated for any set of conditions to predict system properties. Different analytic models are typically required to express different categories of system properties.

Analytic techniques include:

"Plumbing diagrams" that examine steady-state capacities and flow rates (e.g., bus bandwidth vs. average message traffic)
Queuing models that estimate backlogs and latencies (e.g., a model to determine expected queue length).
Probabilistic models that compute the likelihood of events (e.g., computing probability of message loss due to collision on a CSMA network, occurrence rate of double-bit errors given a single-bit DRAM error rate).
Empirical models that extrapolate past system characteristics to similar new systems (e.g., failure rate rules-of-thumb)

Analytic models typically have the advantage of being readily grasped. In many cases, analytic models can be constructed in a few hours. They are often relatively cheap to evaluate, and can give a reasonable estimate of system characteristics quickly. Also, analytic models can provide insight into the dynamics of the system to help guide design activities.

On the other hand, analytic models can only be created by people having keen insight into the system being designed. There is always a risk that the system properties being analyzed are not the ones that will ultimately dominate the system's characteristics. Also, in some cases too many simplifying assumptions must be made in order to create tractable models.

So, while analysis can provide quick answers during the exploration phase of a design, it is possible that the answers are not good approximations to reality. It is also possible that while analysis answers questions correctly, it may not provide insight into whether the right questions have been asked.

Simulation Models

Simulation involves the use of executable computer programs to demonstrate emergent system behavior. Building an executable model at even a high level of abstraction forces the designer to think through issues that otherwise might be swept under the rug with a non-executable specification technique. More than one simulation technique and corresponding model are often desirable for any particular system, depending on the aspects that must be studied.

Simulation techniques include:

Discrete Event simulations for queuing simulations and communication protocol performance, including execution of queue- based models and Petri nets
Cycle-based simulations for processors, memories, and network performance
"Continuous time" simulators for non-digital processes
Coupled analytic models in a spread-sheet type environment to explore their coupled behavior by "what-if" evaluation trials.

Simulations require a "workload", or stimulus at an appropriately high level of abstraction. Simulations may be fed by:

Random inputs, with varying probability distributions that are used to provide system stimuli according to predefined criteria.
Abstract workloads, in which characteristics of the workload are abstracted to, for example, a set of periodic and aperiodic events (often with "noise" in the form of probabilistic timing jitter and drift).
Traces, in which stimuli are provided via time-stamped data files from the output of other simulations, instrumented prototype, or production systems.

Simulations provide an important intermediate capability between analytic models and actual prototypes. By building a model of the system and executing it, designers can see what behavior emerges. With appropriate instrumentation and attention, a simulation can reveal unexpected interactions and performance bottlenecks that are missed by analysis. In particular, simulations are valuable for studying "fine-grain", detailed interactions that deal with specific sequences of events rather than the broad-brush steady-state approach typical of analytic methods.

Simulations can also be superior to prototypes in many cases. It is relatively simple to create arbitrary initial conditions (controllability) and detailed monitoring devices (observability) in a simulation. Controllability is important to investigate conditions that are unlikely to happen in practice, or are too expensive to create in the laboratory more than once. With the complete controllability offered by digital computer simulations, it is generally easy to repeat experiments in order to evaluate potential design changes.

Simulations also offer superior observability, since any state within the model is available as a value in some memory location. An important implication of complete observability is that it is usually straightforward to freeze operation of a system and capture the complete state when an infrequent bug occurs.

So, simulation provides an intermediate step between quick tradeoff studies performed by analysis and detailed validation provided by prototyping.

Prototypes

Prototyping involves the creation of actual or approximated system hardware and software for evaluation. Prototypes can usually be created much more quickly than production units because of relaxed manufacturability, tooling, material cost and life-cycle requirements. Typically, prototypes are expensive on a per-unit basis, and so can be built only in limited quantities.

Prototypes potentially offer an exact model of the final system in all important aspects.

On the other hand, prototypes may be difficult and expensive to change. Setting initial conditions and providing appropriate instrumentation may be difficult, time-consuming, and expensive. And, prototypes may be too few in number to obtain meaningful predictions about performance scalability.

Making Use of All Three Modeling Types

Analysis, Simulation, and Prototyping are all required for successful system development. It is not sufficient to have only one or two of the three models in order to be sure that a distributed system will be designed correctly (not to mention on time and on budget).

Analytic models should be employed first in order to get the broad brush strokes of the system's characteristics. To the extent that similar systems have been built before, analytic models will in general be helpful in providing guidance. However, for areas in which the new system is novel, it may be impossible to accurately predict which system characteristics must be monitored for potential problems. Analysis should always be attempted as the first step of a design, but its limitations should be well understood.

Even if the system design is familiar, it is likely that analytic techniques are unavailable for some important facets of the design. There is always the temptation to make simplifications that help the system fit into known analytic solutions, whether such simplifications are warranted or not. As a result, analytic results must be treated with caution and attention to limitations in their applicability.

After an initial analytic modelling attempt, it is vital to build a simulation of the system. In the presence of good analysis, the simulation will validate the analytic models. In the absence of thorough analysis (because, for example, the system is so novel that it is not apparent what should be analyzed), execution of a simulation can provide a way to gain enough insight to attempt analytic model creation.

It is common for there to be a tightly coupled iteration between analysis and simulation. In fact, it is often desirable to have multiple analysis approaches and multiple simulation approaches in order to converge on answers that are understood, explainable, and reproducible by more than one technique.

Finally, when analysis and simulation both suggest that the system is well designed, prototypes should be constructed and instrumented to verify results.

Summary Tradeoffs for Three Types of Models

Analysis Pro:

When tractable, gives an answer that is complete and mathematically well characterized for the stated assumptions
Performance for various scenarios can typically be computed quickly, and graphed for sanity checks
Encourages understanding of the fundamental mechanisms of the system
Exploring scalability issues are easy (if assumptions support it...)
Cost little to change for exploration of the design space
Equations are very portable, and can be readily disseminated
May be amenable to formal methods for correctness and stability proofs

Analysis Con:

Too much simplification may be required in order to achieve tractability
The analysis might be wrong because of incorrect assumptions or unwarranted simplifications; and these problems may be obscured by a desire to have a "clean" solution
Mathematics is intimidating to many customers
May be intractable, especially for non-linear aspects of the system.
Requires some insight into the system in order to be sure the right things are being analyzed
May provide no evidence that critical behaviors are being missed

Simulation Pro:

Does not require a priori understanding of emergent behavior -- an empirical approach
Finds unexpected events (if the designer is alert); can help break out of incorrect designer mind-sets about system behavior
Experiments can be done faster than real time (in some cases). Parallel or networked computing can be used to do multiple simulations at once (e.g., one simulation per workstation at night).
Can take into account detailed interactions and dependencies rather than steady-state approximations (computer architecture experience indicates that this is vitally important for getting accurate performance estimates)
Can be used to attempt to reproduce transient problems in prototype -- cheap way to test hypothesis as to causes.
Provides good controllability; relatively easy to set up boundary case initial conditions
Simulations are moderately portable, depending on the simulation language used
Can be scaled semi-automatically, by compiling versions with more copies of system elements (at the expense of longer run times)

Simulation Con:

Gives representative cases; many runs required to generate statistical evidence of coverage (and, no guarantee probabilistic simulations will find the worst case)
Simulation bugs are likely (which is why results should be compared with analysis)

Prototyping Pro:

There's nothing like the Real Thing
Very high, potentially perfect level of accuracy
An expensive, but effective, way of finding simulation bugs

Prototyping con:

If prototype has mistakes, they may be expensive & time- consuming to correct
There is often not enough time, money, nor facilities to create a prototype that is scaled as large as the biggest fielded system. This is especially true if the design was intended to grow into bigger applications as computing costs decline over time.
Experiments are done in real time, not faster
If only one test rig, it is a resource bottleneck (costly to replicate)

Phil Koopman -- koopman@cmu.edu