Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.

Chapter 6. Understanding Stack Machines


Since stack machines depend on accessing a high-speed stack memory on every instruction, the characteristics of use of the stack memory are of vital importance. In particular, as processors get faster, the question is: how much stack memory needs to be placed on-chip to obtain good performance? The answer to this question is crucial, since it affects the cost and performance of high-end stack processors that place stack memory on-chip.

An equally important question is how should the stacks be managed, especially in the realm of multi-tasking environments?

6.4.1 Estimating stack size: An experiment

The first question, the one of the size of the on-chip stack buffer, is best resolved by a simulation of various programs with different size stack buffers. This simulation measures the amount of traffic to and from memory generated by stack overflows and underflows. Overflows need to copy elements from the hardware stack buffer to a save area in memory. Underflows cause a copying of the elements back from memory to the stack buffer.

                      FRAC     LIFE     MATH     FIB     HANOI   QUEENS


# INSTRUCTIONS     2051600  1296143  6133519   880997   235665   140224


MAX STACK DEPTH         44        6       23       25       52       29


# STACK OPERANDS   3670356  1791638 11786764  1483760   446642   257320


# STACK SPILLS                                                         

FOR BUFFER SIZE:                                                       

0                  3670356  1791638 11786764  1483760   446642   257320

2                   838960   148448  3919622   370940   155656    41426

4                   202214     4098  1313566    92732    69608     9216

8                    39040        0   238020    13526    32752      512

12                   10236        0    28300     1970     8184      196

16                    3580        0      800      284     4088       64

20                    1532        0      280       38     2040       22

24                     636        0        0        2     1016       10

28                     220        0        0        0      504        2

32                      92        0        0        0      248        0

36                      36        0        0        0      120        0

40                      10        0        0        0       56        0

44                       0        0        0        0       24        0

48                       0        0        0        0        8        0

52                       0        0        0        0        0        0

Table 6.5. Memory cycles expended for Data Stack spills.

[Figure 6.2]
Figure 6.2 -- Data stack spilling.

Table 6.5 and Figure 6.2 show the results of a simulator that monitored the number of memory cycles spent on data stack buffer spilling and restoring for Life, Hanoi, Frac, Fib, Math, and Queens. While the "toy" benchmarks Fib, Hanoi, and Queens are not representative of typical programs, all are deeply recursive and are representative of the worst one might expect of stack programs.

The spilling algorithm that was used spilled exactly one element from the stack buffer each time a push operation was performed on a full stack, and read exactly one element into the stack buffer each time a read/pop operation was performed on an empty stack buffer. The simulation assumed that hardware automatically handled the spilling with a cost of one memory cycle per element read or written. The RTX 32P instruction set was used for this simulation, so each instruction was approximately twice as complex as would be seen in a hardwired processor such as the RTX 2000. The number of cycles measured were memory cycles, not microcycles. The purpose of the simulation was to show the best behavior that could be expected, which is certainly within a factor of three or four of the costs for most implementations.

Surprisingly, Frac behaves almost as badly as Hanoi when using the stack. This is because Frac pushes 6 elements onto the data stack at each step of a recursive subdivision algorithm for dividing a mesh of points. As is obvious, any recursive program has the potential to generate a large number of elements on the stack.

The good news about stack size is that stack overflow and underflow memory traffic tapers off at a steep exponential rate for all programs. At a stack buffer size of 24, even Hanoi generates a stack spill on fewer than 1% of instructions. As a practical matter, a stack size of 32 will eliminate stack buffer overflows for almost all programs.

                      FRAC     LIFE     MATH     FIB     HANOI   QUEENS


# INSTRUCTIONS     2051600  1296143  6133519   880997   235665   140224


MAX STACK DEPTH         14        7       30       22       14       39


# STACK OPERANDS    725224   680676  3199170   185472    41056    53722


# STACK SPILLS                                                         

FOR BUFFER SIZE:                                                       

0                   725224   680676  3199170   185472    41056    53722

2                   326778   135608  1235678    57312    12310    26070

4                   179938      118   642798    21890     2048    13306

8                    27932        0   273686     3192      128     1158

12                     132        0    57262      464        8      572

16                       0        0    13442       66        0      314

20                       0        0     1062        8        0      154

24                       0        0      382        0        0       62

28                       0        0       42        0        0       32

32                       0        0        0        0        0       16

36                       0        0        0        0        0        8

40                       0        0        0        0        0        0

Table 6.6. Memory cycles expended for Return Stack spills.

[Figure 6.3]
Figure 6.3 -- Return stack spilling.

Table 6.6 and Figure 6.3 show simulator results for Return Stack spills and restores for the same programs. The results are similar, except that Math emerges as an unexpectedly heavy user of the Return Stack. This is because the math package was written to be extremely modular and easy to port between systems, so it uses a large number of deeply nested subroutines. Also, Math uses the Return Stack for storing a large number of temporary variables to manipulate 48-bit data on a 16-bit processor.

6.4.2 Overflow handling

Now that we have examined how stack overflows and underflows occur in program execution, how should they be handled? Four possible ways of handling spills are: to ensure that they never happen, and treat them as catastrophic system failures; to use a demand-driven stack controller; to use a paging stack control mechanism; or to use a data cache memory. Each approach has its strengths and weaknesses. A very large stack memory

The simplest way to solve the stack problem is simply to assume that stack overflows will never happen. While this may seem like a foolish strategy at first, it has some merit. The nicest result from using this strategy is that system performance is totally predictable (no stack spilling traffic to slow down the system) and that no stack management hardware is required.

This approach of using a very large stack memory to avoid overflows is the one taken by the MISC M17, which has a stack overflow only when program memory capacity is exceeded. The approach taken by the NC4016 is to use high speed off-chip stack memories that may be expanded to be several thousand elements deep. Both these processors have solved the stack overflow problem by simply designing it away. The price paid when using this approach is a tradeoff between off-chip memory size/speed and processor speed.

In the case where small on-chip stacks are used, the approach of treating overflows as a catastrophic system event when programs are being debugged can still be taken by simply declaring that the programmer has only X elements on the stacks to work with and is responsible for never overflowing this limit. This approach is very practical if only small, simple programs are being written and the value of X is greater than 16 or 32. The WISC CPU/16 uses this approach with a stack size of 256 elements to keep the hardware simple. Demand fed single-element stack manager

Given that stack overflows are allowed to occur on a regular basis, the most conceptually appealing way to deal with the problem is to use a demand fed stack manager that moves single elements on and off the stack as required.

To implement this strategy, the stack buffer is set up as a circular buffer with a head and tail pointer. A pointer to memory is also needed to keep track of the top element of the memory-resident portion of the stack. Whenever a stack overflow is encountered, the bottom-most buffer-resident element is copied to memory, freeing a buffer location. Whenever an underflow is encountered, one element from memory is copied into the buffer. This technique has the appeal that the processor never moves a stack element to or from memory unless absolutely necessary, guaranteeing the minimum amount of stack traffic.

A possible embellishment of this scheme would be to have the stack manager always keep a few elements empty and at least several elements full on the stack. This management could be done using otherwise unused memory cycles, and would reduce the number of overflow and underflow pauses. Unfortunately, this embellishment is of little value on real stack machines, since they all strive to use program memory 100% of the time for fetching instructions and data, leaving no memory bandwidth left over for the stack manager to use.

The benefit to demand-fed stack management is that very good use is made of available stack buffer elements. Therefore, it is suitable for use in systems where chip space for stack buffers is at a premium. As an additional benefit, the stack underflows and overflows are spread throughout program execution at a maximum of two per instruction for the case of a data stack spill combined with a subroutine return underflow. The cost of this good performance is that reasonably complex control hardware and three counters for each stack are needed to implement the scheme.

The FRISC 3 stack management scheme is similar to the demand-fed strategy. The architects of this system have done considerable research in this area. A generalization of this algorithm, called the cutback-K algorithm, was proposed by Hasegawa & Shigei (1985). Stanley & Wedig (1987) have also discussed top-of-stack buffer management for RISC machines. Paging stack manager

An alternative to the demand-fed strategy is to generate an interrupt on stack overflow and underflow, then use software to manage the stack spill. This approach uses less control hardware than the demand-fed method, but requires a stack buffer that is somewhat bigger to reduce the frequency of the interrupts.

The general strategy used in this scheme is to have limit registers pointing to locations near the top and bottom of the stack buffer space. When an instruction causes the stack pointer to be less than underflow pointer, a half-buffer full of elements is copied from program memory. When an instruction exceeds the overflow pointer, a half-buffer full of elements is copied into program memory.

The paging scheme allows arbitrarily sized sections of a large stack memory to be used by different procedures on a time-sliced basis. Because of this, the stack buffer appears as a section of special memory, not as a circular buffer. Therefore, in practice, a stack overflow actually involves copying a half-buffer of elements to memory, then relocating the other half-buffer to place it at the start of the stack buffer area.

The cost of the paging management method is about twice that of the demand-fed method in terms of memory cycles spent shuffling elements. Also, the buffer size must be twice that of the demand-fed buffers to guarantee the same number of consecutive pushes and pops between overflows and underflows, although in practice that increase in size is seldom needed.

An interesting approach to using this method is to declare as a matter of programming style that stack overflows and underflows are unlikely and undesirable, since a buffer size of 32 essentially eliminates them anyway. Then the paging method provides an inexpensive hardware means for affording graceful degradation of a program that exceeds its buffer size. This way an ill behaved program will still function properly (although more slowly), while the operating system can generate a warning message identifying the culprit.

The RTX 2000 and the RTX 32P both use this paging method for stack management. An associative cache

The method used by many conventional processors for managing the program stack is to use a conventional data cache memory, usually mapped into the program memory space. This method involves significant hardware complexity but does not provide any advantage over the previously mentioned methods for stack machines, since stack machines do not skip about much in accessing their stack elements. It does provide an advantage when variable length data structures such as strings and records are pushed onto a "stack" as defined in a C or Ada programming environment.

Other publications of interest that discuss the stack management issue are: Blake (1977), Hennesy (1984), Prabhala & Sethi (1977), and Sites (1979).


HOME Phil Koopman --