Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 6. Understanding Stack Machines
There are three components to the performance of processing interrupts. The first component is the amount of time that elapses between the time that an interrupt request is received by the processor and the time that the processor takes action to begin processing the interrupt service routine. This delay is called interrupt latency.
The second component of interrupt service performance is interrupt processing time. This is the amount of time that the processor spends actually saving the machine state of the interrupted job and diverting execution to the interrupt service routine. Usually the amount of machine state saved is minimal, on the presumption that the interrupt service routine can minimize costs by saving only those additional registers that it plans to use. Sometimes, one sees the term "interrupt latency" used to describe the sum of these first two components.
The third component of interrupt service performance is what we shall call state saving overhead. This is the amount of time taken to save machine registers that are not automatically saved by the interrupt processing logic, but which must be saved in order for the interrupt service routine to do its job. The state saving overhead can vary considerably, depending upon the complexity of the interrupt service routine. In the extreme case, state saving overhead can involve a complete context switch between multi-tasking jobs.
Of course, the costs of restoring all the machine state and returning to the interrupted routine are a consideration in determining overall system performance. We shall not consider them explicitly here, since they tend to be roughly equal to the state saving time (since everything that is saved must be restored), and are not as important in meeting a time-critical deadline for responding to an interrupt.
CISC machines may have instructions which take a very long time to execute, degrading interrupt response latency performance. Stack machines, like RISC machines, can have a very quick interrupt response latency. This is because most stack machine instructions are only a single cycle long, so at worst only a few clock cycles elapse before an interrupt request is acknowledged and the interrupt is processed.
Once the interrupt is processed, however, the difference between RISC and stack machines becomes apparent. RISC machines must go through a tricky pipeline saving procedure upon recognizing an interrupt, as well as a pipeline restoring procedure when returning from the interrupt, in order to avoid losing information about partially processed instructions. Stack machines, on the other hand, have no instruction execution pipeline, so only the address of the next instruction to be executed needs to be saved. This means that stack machines can treat an interrupt as a hardware generated procedure call. Of course, since procedure calls are very fast, interrupt processing time is very low.
There is one possible problem with stack machine interrupt response latency. That is the issue of streamed instructions and microcoded loops.
Streamed instructions are used to repetitively execute an operation such as writing the top data stack element to memory. These instructions are implemented using an instruction repeat feature on the NC4016 and RTX 2000, an instruction buffer on the M17, and microcoded loops on the CPU/16 and RTX 32P. These primitives are very useful since they can be used to build efficient string manipulation primitives and stack underflow/overflow service routines. The problem is that, in most cases, these instructions are also non-interruptible.
One solution is to make these instructions interruptible with extra control hardware, which may increase processor complexity quite a bit. A potentially hard problem that non-stack processors have with this solution is the issue of saving intermediate results. With a stack processor this is not a problem, since intermediate results are already resident on a stack, which is the ideal mechanism for saving state during an interrupt.
Another approach that is used by stack processors is to use a software restriction on the size of the repeat count allowed to be used with streaming instructions. This means that if a block of 100 characters is to be moved in memory, the action may be accomplished by moving several groups of 8 characters at a time. This keeps interrupt latency reasonable without sacrificing much performance. As expected, there is a tradeoff between absolute machine efficiency (with long streamed instructions) and interrupt response latency.
In microcoded machines, the tradeoffs are much the same. However, there is a very simple microcode strategy to provide the best of both worlds which is designed into the RTX 32P commercial version. This strategy is having a condition code bit visible to the microcode indicating whether an interrupt is pending. At each iteration of a microcoded loop, the interrupt pending bit is tested, with no cost in execution time. If no interrupt is pending, another iteration is made through the loop. If an interrupt is pending, the address of the streamed instruction is pushed onto the return stack as the address to be executed upon return from the interrupt, and the interrupt is allowed to be processed. As long as the streamed instruction keeps all its state on the stack (which is simple with an operation such as a character block move), there is very little overhead associated with this method when processing an interrupt, and no overhead during normal program execution.
Let us examine three different degrees of state saving required by different interrupt categories: fast interrupts, lightweight threads for multi-tasking, and full context switching.
Fast interrupts are the kind most frequently seen at run time. These interrupts do things such as add a few milliseconds to the time-of-day counter, or copy a byte from an input port to a memory buffer. When conventional machines handle this kind of interrupt, they must usually save two or three registers in program memory to create working room in the register file. In stack machines, absolutely no state saving is required. The interrupt service routine can simply push its information on top of the stack without disturbing information from the program that was interrupted. So, for fast service interrupts, stack machines have zero state saving overhead.
Lightweight threads are tasks in a multi-tasking system which have a similar execution strategy as the interrupts just described. They can reap the benefits of multi-tasking without the cost of starting and stopping full-fledged processes. A stack machine can implement lightweight threads simply by requiring that each task run a short sequence of instructions when invoked, then relinquish control to the central task manager. This can be called non-preemptive, or cooperative task management. If each task starts and stops its operation with no parameters on the stack, there is no overhead for context switches between tasks. The cost for this method of multi-tasking is essentially zero, since a task only relinquishes its control to the task manager at a logical breaking point in the program, where the stack probably would have been empty anyway.
From these two examples, we can see that interrupt processing and lightweight thread multi-tasking are very inexpensive on stack processors. The only issue that remains open is that of full-fledged, preemptive multi-tasking accomplished with context switching.
Context switching overhead is usually said to be the reason why "stack machines are no good at multi-tasking." The argument behind such reasoning is usually based on having to save a tremendous amount of stack buffer space into program memory. This idea that stack machines are any worse at multi-tasking than other machines is patently false.
Context switching is a potentially expensive operation on any system. In RISC and CISC computers with cache memories, context switching can be more expensive than the manufacturers would have one believe, as a result of hidden performance degradations caused by increased cache misses after the context switch. To the extent that RISC machines use large register files, they face exactly the same problems that are faced by stack machines. An added disadvantage of RISC machines is that their random access to registers dictates saving all registers (or adding complicated hardware to detect which registers are in use), whereas a stack machine can readily save only the active area of the stack buffer.
Table 6.7 shows data gathered from a trace-driven simulation of the number of memory cycles spent saving and restoring data stack elements for Forth programs in a context switching environment. The programs simulated were Queen, Hanoi, and a Quick-sort program. Small values of N were used for Queen and Hanoi in order to keep the running time of the simulator reasonable. Both the effects of stack overflow and underflow as well as context switching were measured, since they interact heavily in such an environment.
Table 6.7. Memory cycles expended for Data Stack spills for different buffer sizes and context swapping frequencies.
Combination of Queens, Qsort, Hanoi # Instructions = 36678 Average stack depth = 12.1 Maximum stack depth = 36
Buffer Size timer=100 timer=500 timer=1000 timer=10000 2 17992 16334 16124 15916 4 12834 9924 9524 9214 8 8440 3950 3430 2910 12 10380 3944 3068 2314 16 11602 2642 1620 632 20 12886 3122 1846 626 24 13120 2876 1518 330 28 14488 3058 1584 242 32 15032 3072 1556 124 36 15458 3108 1568 82
Table 6.7(a) Page-managed buffer management.
Buffer Size timer=100 timer=500 timer=1000 timer=10000 2 26424 24992 24798 24626 4 11628 8912 8548 8282 8 7504 3378 2762 2314 12 6986 1930 1286 630 16 7022 1876 1144 322 20 7022 1852 1084 180 24 7022 1880 1066 124 28 7022 1820 1062 90 32 7022 1828 1060 80 36 7022 1822 1048 80
Table 6.7(b) Demand-fed buffer management.
Figure 6.4 -- Overhead for page managed stack.
Table 6.7a and Figure 6.4 show the results for a page-managed stack. The notation "xxx CLOCKS/SWITCH" indicates the number of clock cycles between context switches. At 100 clock cycles between context switches, the number of memory cycles expended on managing the stack decreases as the buffer size increases. This is because of the effects of a reduced spilling rate while the program accesses the stack. As the buffer size increases beyond 8 elements, however, the memory traffic increases since the increasingly large buffers are constantly copied in and out of memory on context switches.
Notice how the program behaves at 500 cycles between context switches. Even at this relatively high rate (which corresponds to 20 000 context switches per second for a 10 MHz processor -- an excessively high rate in practice), the cost of context switching is only about 0.08 clocks per instruction for a stack buffer size greater than 12. Since in this experiment each instruction averaged 1.688 clocks without context switching overhead, this only amounts to a 4.7% overhead. At 10 000 cycles between context switch (1 millisecond between context switches), the overhead is less than 1%.
How it possible to have such low overhead? One reason is that the average stack depth is only 12.1 elements during the execution of these three heavily recursive programs. That means that, since there is never very much information on the stack, very little information needs to be saved on a context switch. In fact, compared to a 16-register CISC machine, the stack machine simulated in this experiment actually has less state to save on a context switch.
Figure 6.5 -- Overhead for demand fed managed stack.
Table 6.7b and Figure 6.5 show the results of the same simulation run using a demand-fed stack management algorithm. In these results, the rise on the 100-cycle-interval curve when more than 12 elements are in the stack buffer is almost nonexistent. This is because the stack was not refilled when restoring the machine state, but rather was allowed to refill during program execution in a demand-driven fashion. For reasonable context switching frequencies (less than 1000 per second), the demand-fed strategy is somewhat better than the paged strategy, but not by an overwhelming margin.
There is an approach that can be used with stack machines which can eliminate even the modest costs associated with context switching that we have seen. Instead of using a single large stack for all programs, high-priority/time-critical portions of a program can be assigned their own stack space. This means that each process uses a stack pointer and stack limit registers to carve out a piece of the stack for its use. Upon encountering a context switch, the process manager simply saves the current stack pointer for the process, since it already knows what the stack limits are. When the new stack pointer value and stack limit registers are loaded, the new process is ready to execute. No time at all is spent copying stack elements to and from memory.
The amount of stack memory needed by most programs is typically rather small. Furthermore, it can be guaranteed by design to be small in short, time-critical processes. So, even a modest stack buffer of 128 elements can be divided up among four processes with 32 elements each. If more than four processes are needed by the multi-tasking system, one of the buffers can be designated the low priority scratch buffer, which is to be shared using copy-in and copy-out among all the low priority tasks.
From this discussion we can see that the notion that stack processors have too large a state to save for effective multi-tasking is a myth. In fact, in many cases stack processors can be better at multi-tasking and interrupt processing than any other kind of computer.
Hayes and Fraeman (1988) have independently obtained results for stack spilling and context switching costs on the FRISC 3 similar to the results reported in this chapter.
Phil Koopman -- firstname.lastname@example.org