Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 2. A Taxonomy of Hardware Stack Support
Figure 2.1 -- The three-axis stack machine design space.
The stack computer design space may be categorized by coordinates along a three axis system as shown in Figure 2.1. The three dimensions of the design space are: number of stacks supported by the hardware, the size of any dedicated buffer for stack elements, and how many operands are permitted by the instruction format.
In some respects these three dimensions can present a continuum, but for the purposes of this taxonomy we shall break the design space into 12 categories, with the three dimensions having the possible values of:
The most obvious example of a stack supported function is a single stack used to support subroutine return addresses. Often times this stack also is used to pass parameters to subroutines. Sometimes one or more additional stacks are added to allow processing subroutine calls without affecting parameter lists, or to allow processing values on an expression stack separately from subroutine information.
Single Stack computers are those computers with exactly one stack supported by the instruction set. This stack is often intended for state saving for subroutine calls and interrupts. It may also be used for expression evaluation. In either case, it is probably used for subroutine parameter passing by compilers for some languages. In general, a single stack leads to simple hardware, but at the expense of intermingling data parameters with return address information.
An advantage of having a single stack is that it is easier for an operating system to manage only one block of variable sized memory per process. Machines built for structured programming languages often employ a single stack that combines subroutine parameters and the subroutine return address, often using some sort of frame pointer mechanism.
A disadvantage of a single stack is that parameter and return address information are forced to become mutually well nested. This imposes an overhead if modular software design techniques force elements of a parameter list to be propagated through multiple layers of software interfaces, repeatedly being copied into new activation records.
Multiple Stack computers have two or more stacks supported by the instruction set. One stack is usually intended to store return addresses, the other stack is for expression evaluation and subroutine parameter passing. Multiple stacks allow separating control flow information from data operands.
In the case where the parameter stack is separate from the return address stack, software may pass a set of parameters through several layers of subroutines with no overhead for recopying the data into new parameter lists.
An important advantage of having multiple stacks is one of speed. Multiple stacks allow access to multiple values within a clock cycle. As an example, a machine that has simultaneous access to both a data stack and a return address stack can perform subroutine calls and returns in parallel with data operations.
The amount of dedicated memory used to buffer stack elements is a crucial performance issue. Implementation strategies range from using only program memory to store stack elements, to having a few top of stack registers in the processor, to having a completely separate stack memory unit. The taxonomy divides the design space into those designs that have stacks residing mostly in program memory (with perhaps a few buffering elements in the CPU) and those designs that provide significant stack buffering.
An architecture with a Small Stack Buffer typically views the stack as a reserved portion of the general purpose program memory address space. Stacks use the same memory subsystem as instructions and variables, allowing the regular memory reference instructions to access stack operands if desired. Stack elements may also be addressed by an offset from a stack pointer or frame pointer into memory.
To be competitive in speed, a stack machine must have at least one or two stack elements buffered inside the processor. To see the reason for this, consider an addition operation on a machine with unbuffered stacks. A single instruction fetch for the addition would generate three more memory cycles to fetch both operands and store the result. With two elements in a stack buffer, only one additional memory cycle is generated by an addition. This memory cycle is used to fetch the new second-from-top stack element, filling the hole created by the addition's consumption of a stack argument.
A small stack buffer with primary stacks residing in program memory allows quick switching between stacks for different tasks since the stack elements are predominately memory resident at all times.
The fact that a small dedicated stack buffer is simple to implement and easy to manage makes it very popular. In particular, the fact that most stack elements reside in main memory makes managing pointers, strings, and other data structures quite easy. The disadvantage of this approach is that significant main memory bandwidth is consumed to read and write stack elements.
If an architecture has a large enough stack buffer that main memory bandwidth is usually not consumed to access stack elements, then the architecture has a Large Stack Buffer. This large buffer may take one of several forms. It may be a large set of registers accessed using a register window scheme such as that used by the RISC I processor (Sequin & Patterson 1982), a separate memory unit that is isolated from program memory, or a dedicated stack memory cache in the processor (Ditzel & McLellan 1982). In any event, the stack buffer is considered "large" if several levels of subroutines (say, 5 or more) may be processed without exhausting the capacity of the stack memory. In the case of a stack that is only used as an expression evaluation stack, "large" may only be approximately 16 elements, since single expressions seldom nest very deeply (Haley 1962). In Chapter 6, we shall examine some program execution statistics that will give more insight into how large is large enough.
An advantage of a large stack buffer is that program memory cycles are not consumed while accessing data elements and subroutine return addresses. This can lead to significant speedups, particularly in subroutine intensive environments.
A disadvantage of a separate stack memory unit is that it may not be large enough for all applications. In this case a spilling of data into program memory to make room for new stack entries may be required. Also, saving the entire stack buffer when switching between tasks in a multitasking environment may impose an unacceptably large context switching overhead, although it should be noted that this can be solved by dividing the stack memory into separate areas for separate tasks. At a lower level, separate data buses for off-chip stack memories and program memory will add pins and expense to a microprocessor.
Clearly, the delineation between "large" and "small" stack buffers can get hazy, but in practice it is usually clear which of these two alternatives the designer had in mind.
The number of operands in the machine instruction format might at first not seem to have much to do with hardware support for stacks. In practice however, the number of addressing modes has a tremendous affect on how the stacks are constructed and how the stacks can be used by programs.
0-Operand instructions do not allow any operands to be associated with the opcode. All operations are implicitly specified to be performed on the top stack element(s). This kind of addressing is often called "pure" stack addressing.
A 0-operand stack architecture must, of course, use one of its stacks for expression evaluation.
Even in a pure stack machine, there must be a few instructions that specify addresses for loading and storing variables in program memory, loading literal (constant) values, subroutine calls, and conditional branching instructions. These instructions tend to have extremely simple formats, often just using the memory word after the opcode to hold the operand.
There are several advantages to the simplicity of 0-operand instructions. One is that only the top one or two stack locations can be referenced by an instruction. This can simplify construction of the stack memory by allowing the use of a single ported memory with one or two top-of-stack registers. A speed advantage may also be gained by loading the operand registers in parallel with instruction decoding, since the operands for each instruction are known in advance to be the top stack elements. This can completely eliminate the need for pipelining to fetch and store operands.
Another advantage is that individual instructions can be extremely compact, with an 8 bit instruction format sufficing for 256 different opcodes. Furthermore, instruction decoding is simplified, since no operand addressing modes need be interpreted by the decoding hardware.
A disadvantage to the 0-operand addressing mode is that complex addressing modes for data structure accessing may take several instructions to synthesize. Also, data elements that are deeply buried on the stack can be difficult to access if provisions are not made for copying the Nth-deep data stack element to the top of the stack.
A machine with a 1-operand instruction format usually performs operations on the specified operand and uses the top stack element as the implicit second operand. 1-operand addressing, also called stack/accumulator addressing, offers more flexibility than 0-operand addressing, since it combines the fetching of an operand with the operation on the stack.
Keedy (1978) has argued that a stack/accumulator architecture uses fewer instructions than a pure stack architecture for expression evaluation. His argument suggests that overall program size for 1-operand designs may be smaller than for 0-operand design. Of course, there is a tradeoff involved. Since the operand is specified by the instruction, an efficient implementation must either incorporate an operand fetching pipeline or have a longer clock cycle to allow for operand access time. In the case when an operand is resident on a subroutine parameter stack or evaluation stack, the stack memory must be addressed with the offset of the operand to fetch the element. This requires more execution time or more pipelining hardware than having the top elements prefetched and waiting for an operation.
A 1-operand stack architecture almost always has an evaluation stack. Most 1-operand architectures also support a 0-operand addressing mode to save instruction bits when the operand field would be unused.
2-operand instruction formats, which for the purposes of this taxonomy include 3-operand instruction formats as a special case, allow each instruction to specify both a source and a destination. In the case where stacks are only used to store return addresses, a 2-operand machine is simply a general purpose register machine. If subroutine parameters are passed on the stack, then the 2 operands either specify an offset from a stack or frame pointer, or specify a pair of registers in the current register window for the operation. 2-operand machines do not need an expression evaluation stack, but place the burden of tracking intermediate results for evaluated expressions on the compiler.
2-operand machines offer a maximum of flexibility, but require more complicated hardware to perform efficiently. Since no operands are known before an instruction is decoded, a data pipeline and dual ported register file must be used to supply operands to the execution unit.
Phil Koopman -- koopman@cmu.edu