Chapter 3. Multiple-stack, 0-operand Machines

3.2 A GENERIC STACK MACHINE

Before embarking on a detailed review of real MS0 and ML0 designs, a baseline for comparison needs to be established. Therefore, we shall explore the design for a canonical ML0 machine. The design presented is as simple as possible to present a common point of departure for comparing other designs.

3.2.1 Block diagram

Figure 3.1 is a block diagram of the Canonical Stack Machine. Each box on the diagram represents a logical resource for the machine corresponding to the essential minimum components for an ML0 design. These components are: the data bus, the data stack (DS), the return stack (RS), the arithmetic/logic unit (ALU) with its top of stack register (TOS), the program counter (PC), program memory with a memory address register (MAR), control logic with an instruction register (IR), and an input/output section (I/O).

Figure 3.1 -- The canonical stack machine.

3.2.1.1 Data bus

For simplicity, the Canonical Stack Machine has a single bus connecting the system resources. Real processors may have more than one data path to allow for parallel operation of instruction fetching and calculations. In the Canonical Stack Machine, the data bus allows a single transmitting functional block and a single receiving functional block during any single operation cycle.

3.2.1.2 Data stack

The data stack is a memory with an internal mechanism to implement a LIFO stack. A common implementation for this might be a conventional memory with an up/down counter used for address generation. The data stack allows two operations: push and pop. The push operation allocates a new cell on the top of the stack and writes the value on the data bus into that cell. The pop operation places the value on the top cell of the stack onto the data bus, then deletes the cell, exposing the next cell on the stack for the next processor operation.

3.2.1.3 Return stack

The return stack is a LIFO stack implemented in an identical manner to the data stack. The only difference is that the return stack is used to store subroutine return addresses instead of instruction operands.

3.2.1.4 ALU and top-of-stack register

The ALU functional block performs arithmetic and logical computations on pairs of data elements. One of the data element pairs is the top-of-stack (TOS) register, which holds the topmost element of the data stack as used by the programmer. Thus, the top element of the data stack block is actually the second item on the data stack as perceived by the programmer, while the top perceived data stack element is kept in the one register TOS buffer at the ALU. This scheme allows using a single ported data stack memory while allowing operations, such as addition, on the top two stack elements.

The ALU supports the standard primitive operations needed by any computer. For the purposes of our illustration, this includes addition, subtraction, logical functions (AND, OR, XOR), and test for zero. For the purposes of this conceptual design, all arithmetic will be integer. There is no reason why floating point arithmetic could not be added to the ALU as a generalization of the concept.

3.2.1.5 Program counter

The program counter holds the address of the next instruction to be executed. The PC may be loaded from the bus to implement branches, or may be incremented to fetch the next sequential instruction from program memory.

3.2.1.6 Program memory

The program memory block has both a Memory Address Register (MAR) and a reasonable amount of random access memory. To access the memory, first the MAR is written with the address to be read or written. Then, on the next system cycle, the program memory is either read onto or written from the system data bus.

3.2.1.7 I/O

As with most conceptual designs, the issue of input/output from and to the outside world will be swept under the rug. Suffice it to say that there is some system resource, the I/O box, that handles this task.

3.2.2 Data operations

Table 3.1 shows the minimum set of operators for the Canonical Stack Machine. This set of operators was chosen in order to illustrate the use of the machine -- it is obviously not adequate for efficient program execution. In particular, multiply, divide and shift operations have been omitted in the interest of simplicity. Notation derived from the Forth language (see Section 3.3) has been used to maintain consistency with discussions of instruction sets in later chapters. An important point to note is that Forth notation often makes extensive use of special characters, such as ! (which is pronounced "store" in Forth) and @ (which is pronounced "fetch" in Forth).

```Instr-     Data Stack

uction  input   -> output   Function

program memory

+       N1 N2   -> N3         Add N1 and N2, giving sum N3

-       N1 N2   -> N3         Subtract N2 from N1, giving

difference N3

>R      N1      ->            Push N1 onto the return stack

@       ADDR    -> N1         Fetch the value at location

returning N1

AND     N1 N2   -> N3         Perform a bitwise AND on N1 and

N2, giving result N3

DROP    N1      ->            Drop N1 from the stack

DUP     N1      -> N1 N1      Duplicate N1, returning a

second copy of it on the stack

OR      N1 N2   -> N3         Perform a bitwise OR on N1 and

N2, giving result N3

OVER    N1 N2   -> N1 N2 N1   Push a copy of the second

element on the stack, N1, onto

the top of the stack

R>              -> N1         Pop the top element of the

return stack, and push it onto

the data stack as N1

SWAP    N1 N2   -> N2 N1      Swap the order of the top two

stack elements

XOR     N1  N2  -> N3         Perform a bitwise eXclusive OR

on N1 and N2, giving result N3

[IF]    N1      ->            If N1 is false (value is 0)

perform a branch to the address

in the next program cell,

otherwise continue

[CALL]          ->            Perform a subroutine call to

the address in the next program

cell

[EXIT]          ->            Perform a subroutine return

[LIT]           -> N1         Treat the value in the next

program cell as an integer

constant, and push it onto the

stack as N1

```

Table 3.1 Canonical Stack Machine instruction set

3.2.2.1 Reverse Polish Notation

Stack machines execute data manipulation operations using postfix operations. These operations are usually called "Reverse Polish" after the Reverse Polish Notation (RPN) that is often used to describe postfix operations. Postfix operations are distinguished by the fact that the operands come before the operation. For example, an expression in conventional (infix) notation might be represented as:

```(12 + 45) * 98

```

In this expression, parentheses are used to force the addition to occur before the multiplication. Even in expressions without parentheses, an implied operator precedence is in force. For example, without parentheses the multiplication would be done before the addition. The equivalent to the above parenthesized expression would be written in postfix notation as:

```98    12  45  +    *

```

In postfix notation, the operator acts upon the most recently seen operands, and uses an implied stack for evaluation. In this postfix example, the numbers 98, 12, and 45 would be pushed onto a stack as shown in Figure 3.2. Then the + operator would act upon the top two stack elements (namely 45 and 12) leaving the result 57. Then the * operator would work upon the new two top stack elements 57 and 98 leaving 5586.

Figure 3.2 -- An example stack machine.

Postfix notation has an economy of expression when compared to infix notation in that neither operator precedence nor parentheses are necessary. It is much better suited to the needs of computers. In fact, compilers as a matter of course translate infix expressions in languages such as C and FORTRAN into postfix machine code, sometimes using explicit register allocation instead of an expression stack.

The Canonical Stack Machine described in the preceding section is designed to execute postfix operations directly without burdening the compiler with register allocation bookkeeping.

3.2.2.2 Arithmetic and logical operators

In order to accomplish basic arithmetic, the Canonical Stack Machine needs both arithmetic and logical operators. In this and the following sections each instruction will be discussed and described in terms of a register transfer level pseudocode, which should be self explanatory. For example, the first operation is addition.

Operation: +
Stack: N1 N2 -> N3
Description: Add N1 and N2, giving sum N3
Pseudocode: TOSREG <= TOSREG + POP(DS)

For the + operation, the top two elements of the user's perceived data stack N1 and N2 are popped and added, with the result N3 pushed back onto the stack. From an implementation point of view, this would mean popping the DS (which gives N1) and adding it to the TOSREG value which contains N2. The result is placed back into TOSREG, leaving N3 in at the top of the user's perceived data stack. The DS element which is accessed by POP(DS) is actually the second-to-top element of the stack as seen by the programmer, but is the top element on the actual hardware stack. This operation of POP(DS) is consistent with the notion of TOSREG as a top-of-stack register as seen by the user. Note that by keeping the TOSREG as a 1-element stack buffer, a POP of element N2 and a subsequent PUSH of element N3 were eliminated.

Operation: -
Stack: N1 N2 -> N3
Description: Subtract N2 from N1, giving difference N3
Pseudocode: TOSREG <= POP(DS) - TOSREG

Operation: AND
Stack: N1 N2 -> N3
Description: Perform a logical AND on N1 and N2, giving result N3
Pseudocode: TOSREG <= TOSREG and POP(DS)

Operation: OR
Stack: N1 N2 -> N3
Description: Perform a logical OR on N1 and N2, giving result N3
Pseudocode: TOSREG <= TOSREG or POP(DS)

Operation: XOR
Stack: N1 N2 -> N3
Description: Perform a logical eXclusive OR on N1 and N2, giving result N3
Pseudocode: TOSREG <= TOSREG xor POP(DS)

It is obvious that the top-of-stack buffer register saves a considerable amount of work when performing these operations.

3.2.2.3 Stack manipulation operators

One of the problems associated with pure stack machines is that they are able to access only the top two stack elements for arithmetic operations. Therefore, some overhead instructions must be spent on preparing operands to be consumed by other operations. Of course, it should be said that some register-based machines also spend a large number instructions doing register-to-register moves to prepare for operations, so the question of which approach is better becomes complicated.

The following instructions all deal with manipulating stack elements.

Operation: DROP
Stack: N1 ->
Description: Drop N1 from the stack
Pseudocode: TOSREG <= POP(DS)

In this and many subsequent instruction definitions, notation similar to TOSREG <= POP(DS) is used. In order to accomplish this operation, the data stack information is placed onto the data bus, then brought through the ALU by performing a dummy operation (such as adding 0) to be placed in the top-of-stack register.

Operation: DUP
Stack: N1 -> N1 N1
Description: Duplicate N1, returning a second copy of it on the stack
Pseudocode: PUSH(DS) <= TOSREG

Operation: OVER
Stack: N1 N2 -> N1 N2 N1
Description: Push a copy of the second element on the stack, N1, onto the top of the stack
Pseudocode: PUSH(RS) <= TOSREG (Place N2 on RS)
TOSREG <= POP(DS) (Place N1 into TOSREG)
PUSH(DS) <= TOSREG (Push N1 onto DS)
PUSH(DS) <= POP(RS) (Push N2 onto DS)

OVER seems conceptually simple when looking at the description. However, the operation is complicated by the need to store N2 temporarily to get it out of the way. In actual machines, one or more temporary storage registers are added to the system to reduce this thrashing for OVER, SWAP, and other stack manipulation instructions.

Operation: SWAP
Stack: N1 N2 -> N2 N1
Description: Swap the order of the top two stack elements
Pseudocode: PUSH(RS) <= TOSREG (Save N2 on the RS)
TOSREG <= POP(DS) (Place N1 into TOSREG)
PUSH(DS) <= POP(RS) (Push N2 onto DS)

Operation: >R (Pronounced "to R")
Stack: N1 ->
Description: Push N1 onto the return stack
Pseudocode: PUSH(RS) <= TOSREG
TOSREG <= POP(DS)

The instruction >R and its complement R> allow shuffling data elements between the data and return stacks. This technique is used to access stack elements buried more than two elements deep on the stacks by temporarily placing the topmost data stack elements on the return stack.

Operation: R> (Pronounced "R from")
Stack: -> N1
Description: Pop the top element of the return stack, and push it onto the data stack as N1
Pseudocode: PUSH(DS) <= TOSREG
TOSREG <= POP(RS)

3.2.2.4 Memory fetching and storing

Even though all arithmetic and logical operations are performed on data elements on the stack, there must be some way of loading information onto the stack before operations are performed, and storing information from the stack into memory. The Canonical Stack Machine uses a simple load/store architecture, so has only a single load instruction `@' and a single store instruction `!' .

Since instructions do not have an operand field, the memory address is taken from the stack. This eases access to data structures, since the stack may be used to hold a pointer that indexes through array elements. Since memory must be accessed once for the instruction and once for the data, these instructions take two memory cycles to execute.

Operation: ! (Pronounced "store")
Description: Store N1 at location ADDR in program memory
Pseudocode: MAR <= TOSREG
MEMORY <= POP(DS)
TOSREG <= POP(DS)

Operation: @ (Pronounced "fetch")
Description: Fetch the value at location ADDR in program memory, returning N1
Pseudocode: MAR <= TOSREG
TOSREG <= MEMORY

3.2.2.5 Literals

Somehow there must be a way to get a constant value onto the stack. The instruction to do this is called the literal instruction, which is often called a load-immediate instruction in register-based architectures. The literal instruction uses two consecutive instruction words: one for the actual instruction, and one to hold a literal value to be pushed onto the stack. Literal requires two memory cycles, one for the instruction, one for the data element.

Operation: [LIT]
Stack: -> N1
Description: Treat the value in the next program cell as an integer constant, and push it onto the stack as N1
Pseudocode: MAR <= PC (Address of literal)
PC <= PC + 1
PUSH(DS) <= TOSREG
TOSREG <= MEMORY

This implementation assumes that the PC is pointing to the location of the next instruction word after the current opcode.

3.2.3 Instruction execution

Thus far we have ignored the mechanics of how an instruction actually gets fetched from program memory and executed. This execution sequence involves a typical sequence of instruction fetch, instruction decode, and instruction execute.

3.2.3.1 Program Counter

The Program Counter is the register that holds a pointer to the next instruction to be executed. After fetching an instruction, the program counter is automatically incremented to point to the next word in memory. In the case of a branch or subroutine call instruction, the program counter is loaded with a new value to accomplish the branch.

An implied instruction fetching sequence which is performed before every instruction is:

Operation: [Fetch next instruction]
Pseudocode: MAR <= PC
INSTRUCTION REGISTER <= MEMORY
PC <= PC + 1

3.2.3.2 Conditional branching

In order to be able to make decisions, the machine must have available some method for conditional branching. The Canonical Stack Machine uses the simplest method possible. A conditional branch may be performed based on whether the top stack element is equal to 0. This approach eliminates the need for condition codes, yet allows implementation of all control flow structures.

Operation: [IF]
Stack: N1 ->
Description: If N1 is false (value is 0) perform a branch to the address contained in the next program cell, otherwise continue
Pseudocode: if TOSREG is 0
MAR <= PC
PC <= MEMORY
else
PC <= PC + 1
endif
TOSREG <= POP(DS)

3.2.3.3 Subroutine calls

Finally, the Canonical Stack Machine must have a method of efficiently implementing subroutine calls. Since there is a dedicated return address stack, subroutine calls simply require pushing the current program counter value onto the stack, then loading the program counter with a new value. We will assume that the instruction format for subroutine calls allows specifying a full address for the subroutine call within a single instruction word, and will ignore the mechanics of actually extracting the address field from the instruction. Real machines in later chapters will offer a variety of methods for accomplishing this with extremely low hardware overhead.

Subroutine returns are accomplished by simply popping the return address from the top of the return address stack and placing the address in the program counter. Since data parameters are maintained on the data stack, no pointers or memory locations need be manipulated by the subroutine call instruction.

Operation: [CALL]
Stack: ->
Description: Perform a subroutine
Pseudocode: PUSH(RS) <= PC (Save return address)
PC <= INSTRUCTION REGISTER ADDRESS FIELD

Operation: [EXIT]
Stack: ->
Description: Perform a subroutine return
Pseudocode: PC <= POP(RS)

3.2.3.4 Hardwired vs. microcoded instructions

While the Canonical Stack Machine description has been kept free from implementation considerations for conceptual simplicity, a discussion of one major design tradeoff that is seen in real implementations is in order. The tradeoff is one between hardwired control and microcoded control. An introduction to the concepts of hardwired versus microcoded implementation techniques may be found in (Koopman 1987a).

Hardwired designs traditionally allow faster and more space efficient implementations to be made. The cost for this performance increase is usually increased complexity in designing decoding circuitry, and a major risk of requiring a complete control logic redesign if the instruction set specification is changed near the end of the product design cycle.

With a stack machine, the instruction format is extremely simple (just an opcode) and the usual combinatorial explosion of operand/type combinations is completely absent. For this reason hardwired design of a stack machine is relatively straightforward.

As an additional benefit, if a stack machine has a 16 bit or larger word length, the word size is very large compared to the few bits needed to specify the possible operations. Hardwired stack machines usually exploit this situation by using pre-decoded instruction formats to further simplify control hardware and increase flexibility. Pre-decoded (also called unencoded) instructions have a microcode-like format in that specific bit fields of the instruction perform specific tasks. This allows for the possibility of combining several independent operations (such as DUP and [EXIT]) in the same instruction.

While 16-bit instructions may seem wastefully large, the selection of a fixed length instruction simplifies hardware for decoding, and allows a subroutine call to be encoded in the same length word as other instructions. A simple strategy for encoding a subroutine call is to simply set the highest bit to 0 for a subroutine call (giving a 15 bit address field) or 1 for an opcode (giving a 15 bit unencoded instruction field). In general, the speed advantage of a fixed length instruction when combined with the possibility of compressing multiple operations into the same instruction makes the selection of a fixed length instruction justifiable. The technique of unencoded hardwired design on stack machines was pioneered by the Novix NC4016 and has since been used by other machines.

With all the benefits of hardwired instruction decoding for stack machines, it might seem at first glance that microcoded instruction execution would never be used. However, there are several advantages to using a microcoded implementation scheme.

The major advantage to a microcoded approach is flexibility. Since most combinations of bits in an unencoded hardwired instruction format are not useful, a microcoded machine can use fewer bits to specify the same possible instructions, including optimized instructions that perform a sequence of stack functions. This leaves room for user-specified opcodes in the architecture. A microcoded machine can have several complex, multicycle user-specific instructions that would be unfeasible to implement using hardwired design techniques. If some or all of the microcode memory is constructed of RAM instead ROM, then the instruction set can be customized for each user or even each application program.

One potential drawback is that using a microcoded design often establishes a microinstruction fetching pipeline to avoid a speed penalty for accessing microcode program memory. This can result in a requirement that instructions take more than one clock cycle, whereas hardwired designs are optimized for single-clock-cycle execution.

As it turns out, this is not really a penalty. Using a realistic match of processor speed and affordable memory speed, most processors can perform two internal stack operations in the time it takes for a single memory access. Thus, both a hardwired design and microcoded design can execute instructions in a single memory cycle. Furthermore, since a microcoded design can slip in twice as many operations per memory cycle period, opportunities for code optimization and user-specified instructions are that much greater.

As a practical matter, microcoded implementations are more convenient to implement in discrete component designs, so they predominate in board-level implementations. Most single-chip implementations are hardwired.

3.2.4 State changing

An important consideration in real time control applications is how the processor can handle interrupts and task switches. The specified instruction set for the Canonical Stack Machine sidesteps these issues to a certain extent, so we will talk about the standard ways of handling these events to build a base upon which to contrast designs in the next sections.

A potential liability for stack machines with independent stack memories is the large state that must be saved if the stacks are swapped into program memory to change tasks. We will see how this state change can be avoided much of the time. Chapter 6 speaks about advanced design techniques that can further reduce the penalties of a task switch.

3.2.4.1 Stack overflow/underflow interrupts

Interrupts are caused either by exceptional events, such as stack overflow, or by requests for I/O service. Both events require quick resolution without disturbing the flow of the current task.

Stack Overflow/Underflow is by far the most frequent exceptional condition on a stack machine, so it will be used as an example for how "large grain" interrupts are handled.

Stack Overflow/Underflow occurs when the hardware stack memory capacity is exceeded by the application program. Several possible responses to this situation include: ignore the overflow and let the software crash (an easy to implement but rather messy solution), halt the program with a fatal execution error, or copy a portion of the stack memory to program memory to allow continued program execution. Clearly, the last alternative gives the most flexibility, but a fatal execution error may be acceptable and simpler to implement in some environments.

Other exceptional conditions such as a memory parity may be handled by the system as well.

Exceptional conditions can take a long time to resolve, but all have the property of requiring the state of the current task to remain intact during resolution so that the task may be restarted if possible. Thus, these conditions require no action from the processor except to force a hardware generated subroutine call to a condition handling code.

3.2.4.2 I/O service interrupts

I/O servicing is a potentially frequent event that must be handled quickly for real time control tasks. Fortunately, interrupts usually require very little processing and almost no temporary storage. For this reason, stack machines treat interrupts as hardware generated subroutine calls.

These subroutine calls push parameters onto the stack, perform their calculations, then perform a subroutine exit to restart the program that was interrupted. The only constraint is that the interrupt service routine must leave no "garbage" behind on the stack.

Interrupts are much less expensive on stack machines than on conventional machines for several reasons: registers need not be saved since the stack allocates working storage automatically, there are no condition code flags to be saved since the only branch conditions are kept as flags on the data stack, and most stack processors have a short or nonexistent data pipeline, so there is no penalty for saving the pipeline state when an interrupt is received.