Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 5. Architecture of 32-bit Systems
The Wright State University's SF1 (which stands for Stack Frame computer number 1) is an experimental multi-stack processor designed to efficiently execute high level languages, including both Forth and C. It is designed to have a large number of stacks, using five stacks in the implementation described here. While the SF1 has its roots in the Forth language, it crosses the boundary between ML0 and ML2 machines by allowing each instruction to address any of the elements of two stacks directly from the stack memory. It has an interesting mix of the features found on the FRISC 3 and the RTX 32P, as well as some unique innovations.
Wright State University has developed a series of stack based computers starting with RUFOR (Grewe & Dixon 1984), which was purely a Forth-based processor built with bit-slice components. In 1985-1986, a computer architecture class built a discrete component prototype of a more generalized stack processor called the SF1. In 1986-1987, a VLSI class extended that architecture and made a multi-chip custom silicon implementation which was the VLSI version of the SF1. The following description is of this VLSI SF1 implementation.
The intended application area for the SF1 is real time control using Forth, C, and other high level languages.
Figure 5.6 is an architectural block diagram of the SF1.
Figure 5.6 -- SF1 block diagram.
The SF1 has two busses. The MBUS is multiplexed to transfer addresses to program memory and then instructions and data to or from program memory. The SBUS is used to transfer data between system resources. The two-bus design allows instructions to be fetched on the MBUS while data operations are taking place on the SBUS.
The ALU has an associated top-of-stack register (TOS) which receives the results of all ALU operations. The ALU input register (ALUI) acts as a holding buffer to contain the second operand for the ALU. ALUI may be loaded from either the SBUS for most operations, or the MBUS for memory fetches. Both the ALUI and the TOS may be routed to the MBUS or the SBUS. The TOS register by convention contains the top stack element of whatever stack is being used for a particular instruction, although it is up to the programmer to ensure it is managed properly.
There are eight different sources and destinations connected to the stack bus: S, F, R, L, G, C, I, and P:
All eight are referred to as stacks in the machine's reference material, but in reality the C, I, and P resources are special non-stack structures. The stacks L, G, F, and R are for the most part interchangeable in practice, and may be used for any purpose. The S stack is somewhat specialized in that all subroutine return addresses are automatically pushed onto the S stack.
Any one of the top 8192 elements of these stacks may be specified as a bus source or destination. Whenever a stack is read, the top element may be either retained or popped. When a stack is popped, the top element is always shifted out of the stack memory, regardless of which element was actually read.
Similarly, when a stack is used as a bus destination, any one of the top 8192 elements of the stack may be written to, or the top stack element may be pushed with a value from the SBUS.
The C bus source is used to return a 13 bit signed constant from the address field of the instruction. P is used to load and store the program counter. I is used to address an I/O address space of 8K words.
The Program Counter (PC) is a counter that can be asserted on the MBUS to provide addresses for instructions as well as loaded from the MBUS for jumps and subroutine calls. The PC can also be read and written on the SBUS to save and restore subroutine return addresses.
The SF1 has two instruction formats as shown in Figure 5.7. The first instruction format is used for jumps and subroutine calls, the second for all other instructions.
Figure 5.7(a) -- SF1 instruction formats -- jump/call.
Figure 5.7a shows the jump/subroutine call instruction format. This instruction format is selected by a 0 in bit 0 of the instruction. Bit 1 of the instruction selects a jump if set to 1, or a subroutine call if set to 0. Bits 2-31 of the instruction select a word-aligned address as the jump/call target. This instruction format is quite similar to that of the RTX 32P shown in figure 5.4, but without the opcode in the highest order bits. Both jump and subroutine call instructions take one clock cycle.
Figure 5.7(b) -- SF1 instruction formats -- operation.
Figure 5.7b shows the operation instruction format, which is more like the FRISC 3 ALU instruction format shown in Figure 5.2c. Bit 0 is a constant 1 which selects this instruction format.
Bit 1 selects between a no-branch and a skip operation. If a skip operation is selected and the zero status flag is set, the next instruction in the instruction stream is fetched. This can be used to implement a conditional branch-on-zero instruction sequence.
Bits 2-7 select the ALU operations. A special ALU operation returns the status flags from the previous ALU instruction. These flags can be used as an offset into a multi-way jump table for branching on multiple condition codes. This conditional branching is slower than using a skip instruction, but is more flexible.
Before covering the operation of bits 8-28, we should describe the way the SBUS works during a clock cycle. The SBUS is used twice during each clock cycle. During the first half of the clock cycle, the SBUS is used to read one of 8 bus sources. The data read is always placed into the ALUI register. During the second half of the clock cycle, the ALU performs its operation on the new ALUI value and the old TOS value. Simultaneously, the old TOS value is written to one of 8 bus destinations. Bit 29 in the instruction format can override the selection of TOS as the value to be written by forcing ALUI (which was just loaded on the first half of the clock cycle) to be asserted on the SBUS during the second half of the clock cycle.
Bits 8-11 select the SBUS destination. This destination is written with the TOS value set by the previous instruction during the second half of the clock cycle. Bit 8 selects whether the destination stack is pushed or just written. Similarly, bits 12-15 select the SBUS source, which is read during the first half of the clock cycle. Bit 12 selects whether the source stack is popped.
Bits 16-28 provide an address that is used when reading and writing the stacks. This address allows reading or writing any one of the top 8K stack elements directly. Note that there is only one address in the instruction, so both the source and destination stacks must use the same address on a given cycle.
Bit 29 is used to override the selection of TOS as the value to be written to the SBUS destination. When set, this bit uses the ALUI register instead of the TOS register. This allows direct data movement between any two SBUS resources by loading the value into the ALUI during the first half clock cycle, then storing that same value during the second half clock cycle.
Bits 30-31 are used to control memory accesses. Bit 31 selects an extended instruction cycle, which uses a second clock cycle to access RAM via the MBUS (the first clock cycle is used to fetch the next instruction). Bit 30 specifies a RAM read or write operation. The TOS register is read during the first clock cycle to provide an address, then read or written again during the second clock cycle to provide or receive the RAM data. Note that RAM reads and writes are performed on the second of two clock cycles, so bits 2-29 may be used to perform a normal instruction on the first of the two cycles. This first clock cycle is often used to reload the TOS register (which contained an address) with the value to be written into the RAM during the second clock cycle.
Once again, we see the importance of providing a dedicated path for instruction fetching in the form of the MBUS, with a second path for data manipulations in the form of the SBUS. As with the other stack machines, the SF1 is designed to support fast instruction execution and, in particular, quick subroutine calls.
The use of operands in the SF1 operation instruction format is novel. The use of a single top-of-stack register as one input for all ALU operations and the fact that only a single address field is provided makes the architecture feel like a 1-operand stack machine. However, the fact that both a source and destination may be specified for each instruction makes the machine feel more like a 2-operand machine. Perhaps this instruction format is more properly called a 1½ operand instruction, since only a single address is available while both a source and destination may be selected.
The reason for having the top 8K elements of each stack directly addressable is to provide support for languages such as C which have the notion of a stack frame. In addition, one of the stacks can be used as a very large (8K word) register file by simply never pushing or popping that particular stack.
The reason for having several hardware stacks is to support fast context switching in real time control applications. Although the implementation described only contains five hardware stacks, this number can be increased in other versions of the design. A simple way to allocate the stacks would be to dedicate one hardware stack to each of four high priority tasks, with the fifth stack saved and restored to and from program memory as required to process low priority tasks.
Subroutine returns are accomplished under program control by popping a stack and writing the top stack element into the PC. Because of a prefetch pipeline, the instruction following the subroutine return is also executed before the return takes effect.
32 Bit literal values are obtained by using PC-relative addressing with a 13 bit offset to access a constant stored in the program space. This constant is typically placed after an unconditional branch, or after the subroutine return at the end of the procedure in which it is used.
The SF1 is implemented on 3.0 micron CMOS MOSIS technology using a full-custom approach. Two custom chip designs are used. One chip contains the ALU and PC, while the other chip implements a 32-bit wide stack. Control and instruction decoding is accomplished using programmable logic, but will eventually be incorporated onto custom VLSI as well.
The implementation of the stack chips is quite different than what has been seen on other stack machines. Since the stack must be designed for random access of elements, an obvious design method would be to incorporate an adder with a stack pointer into a standard memory. This method has the disadvantage that it is slow and difficult to expand to multi-chip stacks.
The approach taken in the SF1 is completely different. Each stack memory is actually a giant shift register that actually moves the stack elements between adjacent memory words when the stack is pushed or popped. Addressing the Nth word in the stack is done simply by addressing the Nth word in memory, since the top element on the stack is always kept shifted into the 0th memory address. One disadvantage of this approach is that shift register cells are larger than regular memory cells, so the largest stack chip made for the SF1 contains only 128 words by 32 bits of memory.
The SF1 is primarily a research platform, with an emphasis on real time control with fast context switching (by dedicating a stack chip to each task) and support for high level languages.
The information in this section is based on the description of the SF1 given by Dixon (1987) and Longway (1988).
Phil Koopman -- koopman@cmu.edu