Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 5. Architecture of 32-bit Systems
The Harris Semiconductor RTX 32P is a 32-bit member of the Real Time Express (RTX) processor family. The RTX 32P is a prototype machine that is the basis of Harris' commercial 32-bit stack machine design.
The RTX 32P is a CMOS chip implementation of the WISC Technologies CPU/32 (Koopman 1987c) which was originally built using discrete TTL components. The CPU/32 was in turn
developed from the WISC CPU/16 described in Chapter 4. Because of this history, the RTX 32P is a microcoded machine, with on-chip microcode RAM and on-chip stacks.
The RTX 32P is a 2-chip stack processor designed primarily for maximum flexibility as an architectural evaluation platform. It contains very large data and return stacks on-chip, as well as a large amount of on-chip microcode memory. This large amount of high speed RAM forced the design to use two chips, but this was consistent with the goal of producing a research and development vehicle. Real time control is the primary application area for the RTX 32P.
The primary language for programming the RTX 32P is Forth. However, the RTX 32P's commercial successor will be enhanced for excellent support of more conventional languages such as C, Ada, Pascal; special purpose languages such as LISP and Prolog; and functional programming languages.
An important design philosophy of the RTX 32P is that as processor speeds increase, an ALU can be cycled twice for every off-chip memory access. Therefore the RTX 32P executes two microinstructions for each main memory access, including instruction fetches. Every instruction is two or more clock cycles in length, with a different microinstruction executed on each clock cycle. The reasons for adopting this strategy are discussed at greater length in Section 9.4.
Figure 5.3 is an architectural block diagram of the RTX 32P.
Figure 5.3 -- RTX 32P block diagram.
The Data Stack and Return Stack are implemented as identical hardware stacks consisting of an 9-bit up/down counter (the Stack Pointer) feeding an address to a 512 element by 32 bit wide memory. The stack pointers are readable and writable by the system to provide an efficient way of accessing deeply buried stack elements.
The ALU section includes a standard multifunction ALU with a DHI register for holding intermediate results. By convention, the DHI register acts as a buffer for the top stack element. This means that the Data Stack Pointer actually addresses the element perceived by the programmer to be the second-from-top stack element. The result is that an operation on the top two stack elements, such as addition, can be performed in a single cycle, with the B side of the ALU reading the second stack element from the Data Stack and the A side of the ALU reading the top stack element from the Data Hi register.
The Data Latch on the B side of the ALU input is a normally transparent latch that can be used to retain data for one clock cycle. This speeds up swap operations between the DHI register and the Data Stack.
There are no condition codes visible to machine language programs. Add with carry and other multiple precision operations are supported by microcoded instructions that push the carry flag onto the data stack as a logical value (0 for carry clear, -1 for carry set).
The DLO register acts as a temporary holding register for intermediate results within a single instruction. Both the DHI and DLO registers are shift registers, connected to allow 64-bit shifting for multiplication and division.
An off-chip Host Interface is used to connect to the personal computer host. Since all on-chip storage is RAM-based, an external host is required for initializing the CPU.
The RTX 32P has no program counter. Every instruction contains the address of the next instruction or refers to the address on the top of the return address stack. This design decision is in keeping with the observation that Forth programs contain a very high proportion of subroutine calls. Section 6.3.3 discusses the affects of the RTX 32P's instruction format in greater detail.
Instead of a program counter, the block described as the Memory Address Logic contains a Next Address Register (NAR), which holds the pointer for fetching the next instruction. The Memory Address Logic uses the top element of the Return Stack to address memory for subroutine returns, while it uses the RAM address register (ADDR REG) for doing memory fetches and stores efficiently. The Memory Address Logic also contains an increment-by-4 circuit for generating return addresses for subroutine call operations. Since the Return Stack and Memory Address Logic can be isolated from the system Data Bus, subroutine calls, subroutine returns, and unconditional jumps can be performed in parallel with other operations. This results in these control transfer operations costing zero clock cycles in many cases.
Program memory is organized as up to 4 Gbytes of memory, addressable on byte boundaries. Instructions and 32-bit data items are required to be aligned on 32-bit memory boundaries, since data is accessed in 32-bit words from memory. The actual RTX 32P chips can address only 8M bytes because a limited number of pins on the package.
Microprogram Memory is an on-chip read/write memory containing 2K elements by 30 bits. The memory is addressed as 256 pages of 8 words each. Each opcode in the machine is allocated its own page of 8 words. The Microprogram Counter supplies an 9 bit page address of which only the lowest 8 bits are used in this implementation. This scheme allows supplying 3 bits from the current microinstruction, the lowest bit of which is the result of a 1-in-8 conditional microbranch selection, as the address for the next microinstruction within the same microcode page. This allows conditional branching and looping during the execution of a single opcode.
Instruction decoding is accomplished simply by loading the 9-bit opcode into the Microprogram Counter and using that as the page address to Microprogram Memory. Since the Microprogram Counter is built with a counter circuit, operations can span more than one 8-microinstruction page if required.
The Microinstruction Register (MIR) holds the output of the Microprogram Memory. This allows the next microinstruction to be accessed from Microprogram Memory in parallel with execution of the current microinstruction. The MIR completely removes the Microprogram Memory access delay from the system's critical path. Its use also enforces a lower limit of two clock cycles on instructions. If an instruction could be accomplished in a single clock cycle, a second no-op microinstruction must be added to allow the next instruction to flow through the MIR fetching sequence properly.
The Host Interface allows the RTX 32P to operate in two possible modes: Master Mode and Slave Mode. In Slave Mode, the RTX 32P is controlled by the personal computer host to allow program loading, microprogram loading, and alteration of any register or memory location on the system for initialization or debugging. In Master Mode, the RTX 32P runs its program freely, while the host computer monitors a status register for a request for service. While the RTX 32P is in master mode the host computer may enter a dedicated service loop, or may perform other tasks such as prefetching the next block of a disk input stream or displaying an image, and only periodically poll the status register. The RTX 32P will wait for service from the host for as long as is necessary.
Figure 5.4 -- RTX 32P instruction format.
The RTX 32P has only one instruction format, shown in Figure 5.4. Every instruction contains a 9-bit opcode which is used as the page number for addressing microcode. It also contains a 2-bit program flow control field that invokes either an unconditional branch, a subroutine call, or a subroutine exit. In the case of either a subroutine call or unconditional branch, bits 2-22 are used to specify the high 21 bits of a 23-bit word-aligned target address. This design limits program sizes to 8M bytes unless the page register in the Memory Address Logic is used with special far jump and call instructions. Data fetches and stores see the memory as a contiguous 4G byte address space.
Wherever possible, the RTX 32P's compiler compacts an opcode followed by a subroutine call, return, or jump into a single instruction. In those cases where such compaction is not possible, a NOP opcode is compiled with a call, jump, or return, or a jump to next in-line instruction is compiled with an opcode. Tail-end recursion elimination is performed by compressing a subroutine call followed by a subroutine return into a simple jump to the beginning of the subroutine that was to be called, saving the cost of the return that would otherwise be executed in the calling routine.
Since the RTX 32P uses RAM for the microcode memory, the microcode may be completely changed by the user if desired. The standard software environment for the CPU/32 is a version of MVP-FORTH, a FORTH-79 dialect (Haydon 1983). Some of the Forth instructions included in the standard microcoded instruction set are shown in Table 5.2. One thing that is noticeable in this instruction set is the number and complexity of instructions supported.
! DDROP + DDUP +! DNEGATE - DROP 0 DSWAP 0< DUP 0= I 0BRANCH I' 1+ J 1- LEAVE 2* LIT 2/ NEGATE < NOP PICK NOT ROLL OR = OVER >R R> ?DUP R@ @ ROT ABS S->D AND SWAP BRANCH U* D! U/MOD D+ XOR D@
Table 5.2(a) RTX 32P Instruction Set Summary -- Forth Primitives. (see Appendix B for descriptions)
<variable> @ (fetch a variable) <variable> @ + (fetch and add a variable) <variable> ! (store a variable) @ + DUP @ LIT + OVER + OVER - R> DROP R> SWAP >R SWAP ! SWAP - SWAP DROP The RTX 32P instruction set may be extended by the user to incorporate any other stack manipulation primitives required for a particular application.
Table 5.2(b) RTX 32P Instruction Set Summary -- Compound Forth Primitives.
OPCODE DATA STACK RETURN STACK HALT -> -> Returns control to host processor SYSCALL N -> -> Requests I/O service number N from host DOVAR -> ADDR -> Used to implement Forth variables DOCON -> N -> Used to implement Forth constants
Table 5.2(c) RTX 32P Instruction Set Summary -- Special Words.
The following Forth operations have microcoded support words that do most of their work: SP@ (fetch contents of data stack pointer) SP! (initialize data stack pointer) RP@ (fetch contents of return stack pointer) RP! (initialize return stack pointer) MATCH (string compare primitive) ABORT" (error checking & reporting word) +LOOP (variable increment loop) /LOOP (variable unsigned increment loop) CMOVE (string move) <CMOVE (reverse order string move) DO (loop initialization) ENCLOSE (text parsing primitive) LOOP (increment by 1 loop) FILL (block memory initialization word) TOGGLE (bit mask/set primitive)
Table 5.2(d) RTX 32P Instruction Set Summary -- Support Words for High Level Operations.
OPCODE DATA STACK RETURN STACK <UNORM> EXP1 U2 -> EXP3 U4 -> Floating point normalize of unsigned 32-bit mantissa ADC N1 N2 CIN -> N3 COUT -> Add with carry. CIN and COUT are logical flags on the stack. ASR N1 -> N2 -> Arithmetic shift right. BYTE-ROLL N1 -> N2 -> Rotate right by 8 bits. D+! D ADDR -> -> Sum D into 32-bit number at ADDR. D>R D -> -> D Move D to return stack. DLSLN D1 N2 -> D3 -> Logical shift left of D1 by N2 bits. DLSR D1 -> D2 -> Logical shift right of D1 by 1 bit. DLSRN D1 N2 -> D3 -> Logical shift right of D1 by N2 bits. DR> -> D D -> Move D from return stack to data stack. DROT D1 D2 D3 -> D2 D3 D1 -> Perform double-precision ROT. LSLN N1 N2 -> N3 -> Logical shift left of N1 by N2 bits. LSR N1 -> N2 -> Logical shift right of N1 by 1 bit. LSRN N1 N2 -> N3 -> Logical shift right of N1 by N2 bits. Q+ Q1 Q2 -> Q3 -> 128-bit addition. QLSL Q1 -> Q2 -> Logical shift left of Q1 by 1 bit. RLC N1 CIN -> N2 COUT -> Rotate left through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. RRC N1 CIN -> N2 COUT -> Rotate right through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. Note: The RTX 32P uses RAM microcode memory, so the user may add or modify any instructions desired. The above list merely indicates the instructions supplied with the standard development software package.
Table 5.2(e) RTX 32P Instruction Set Summary -- Extended Math & Floating Point Support Words.
Table 5.2b shows some common Forth word combinations that are available as single instructions. Table 5.2c shows some words that are used to support underlying Forth operations such as subroutine call and exit. Table 5.2d lists some high level Forth words that are directly supported by specialized microcode. Table 5.2e shows words that were added in microcode to support extended precision integer operations and 32-bit floating point calculations.
Since the instructions vary considerably in complexity, execution time of instructions ranges accordingly. Simple instructions that manipulate data on the stack such as + and SWAP take 2 microcycles (one memory cycle) each. Complex microinstructions such as Q+ (128-bit addition) may take 10 or more microinstructions, but are still much faster than comparable high level code. If desired, microcoded loops can be written that can potentially last thousands of clock cycles to do things such as block memory moves.
Figure 5.5 -- RTX 32P microinstruction format.
As mentioned earlier, each instruction invokes a sequence of microinstructions on a Microprogram Memory page corresponding to the 9-bit opcode for the instruction. Figure 5.5 shows the microinstruction format. The microcode used is horizontal, which means that there is only one format for microcode, ant that the format is broken into separate fields to control different portions of the machine.
As with the WISC CPU/16, the simplicity of the stack machine approach and the RTX 32P hardware results in a simple microcode format, in this case only using 30 bits per microinstruction. The microcode format of the RTX 32P is similar to that of the CPU/16 discussed in the previous chapter.
Bits 0-3 of the microinstruction specify the source of the system Data Bus. Two of the bus sources are used as special control signals to configure the RTX 32P for one-clock-cycle-per-bit multiplication and nonrestoring division of 32/64 bit numbers.
Bits 8-9 specify the Data Bus destination. Two special cases for destinations exist: DLO may be independently specified as a bus destination using bits 22-23, and the DHI register is always loaded with the ALU output. Bits 8-9 and 10-11 specify Data Stack Pointer and Return Stack Pointer control, respectively. Bits 12-13 control a shifter on the output of the ALU. This shifter allows shifting left or right, as well as an 8-bit rotation function.
Bits 14-15 of the microinstruction are unused, and therefore not included in the Microcode RAM. Bits 16-20 control the function of the ALU. Bit 21 specifies a carry-in of 0 or 1. To synthesize multiple precision arithmetic, the microcode does a conditional microbranch based on the carry-out of the low half of the result, and then forces the next carry-in to 0 or 1 as appropriate. Bits 22-23 control the loading and shifting of the DLO register.
Bits 24-29 of the microinstruction are used to compute a 3-bit offset into the microprogram page for fetching the next microinstruction. Bits 24-26 select one of eight condition codes to form the lowest address bit, while bits 27-28 are used as constants to generate the two high order address bits. This allows jumping and 2-way conditional branching anywhere within the microprogram page on every clock cycle. Bit 29 can be used to increment the contents of the 9-bit Micro Program Counter to allow opcodes to use more than 8 Microcode Memory locations. Bit 30 initiates the instruction decoding sequence for the next instruction. This is required since instructions are a variable number of clock cycles long. Bit 31 controls the return address incrementer for use as a counter into memory for block data accesses.
One microinstruction is executed on every clock cycle, with two or more microinstructions executed for every machine macroinstruction.
The heritage of the WISC CPU/16 in the RTX 32P architecture is unmistakable. The most obvious area of improvement is the addition of more efficient Memory Address Logic and the isolation of the Return Address Stack from the Data Bus during subroutine call and return operations. These changes, along with the RTX 32P's unique instruction format, allow subroutine calls, returns, and jumps to be processed "for free" to the extent that they can be combined with opcodes.
The RTX 32P's clock runs at twice the speed that main memory can be accessed, thus giving two clock cycles per memory cycle, and a minimum of two clock cycles per instruction.
There are a number of uses for the RTX 32P's instruction format, many of which are not immediately obvious. One of them is for executing conditional branches. The RTX 32P does not have direct hardware support for conditional branches, since this would slow down the rest of the hardware too much on other instructions or require excessively fast program memory. Conditional branches are accomplished by using a special 0BRANCH opcode combined with a subroutine call to the branch target. The subroutine call is processed by the hardware in parallel with the opcode's evaluation of whether the top stack element is zero (in which case the branch is taken). If the branch is to be taken, the Return Stack is popped, converting the subroutine call to just a jump, and execution continues. If the branch is not to be taken, the microcode pops the Return Stack and uses the value to fetch the branch fall-through instruction, in effect performing an immediate subroutine return. The cost for this conditional branch is 3 clock cycles to take a branch, 4 clock cycles to not take a branch. Remember that on this processor each memory cycle is 2 clock cycles.
Another interesting capability of the RTX 32P is quick access of any memory location as a variable. Even though the 0-operand instruction format would seem to require a second memory location to specify the variable address, the following operation can be used. A special opcode is compiled with a subroutine call, where the address of the "subroutine" is actually the address of the variable desired to be fetched. The microcode then "steals" the variable value as the instruction fetching logic reads it in, then forces a subroutine return before the value can be executed as an instruction.
The point of discussing these two methods is to illustrate that there are several significant capabilities of the hardware that are not immediately obvious to programmers who are used to more conventional machines. These capabilities are especially useful in programming data structure accesses (for example, expert system decision trees), and actually allow direct execution of data structures. This direct execution is accomplished by storing the data in a tagged format having a 9 bit tag (corresponding to special user-defined opcodes) and a 23-bit address that is a subroutine call or jump to the next data element in the structure, or a subroutine return for a nil pointer.
An important implementation feature of the RTX 32P is that all resources on the machine can be directly controlled by the host computer. This can be done because the host interface supports Microinstruction Register load and single-step clock features. With these features, any microinstruction desired can be executed by first loading values into any or all registers in the system, loading a microinstruction, cycling the clock, then reading data values back to examine the results. This design technique makes writing microcode extremely straightforward, eliminating the need for expensive external analysis hardware. It also makes testing and diagnostic programs very simple to write.
The RTX 32P supports interrupt handling, including interrupt on stack underflow and overflow for both the Data Stack and Return Stack. The usual technique for handling these overflows and underflows is to page in or out half the on-chip stack contents to a holding area in program memory. This allows programs to use arbitrarily deep stacks. With a 512 element hardware stack buffer size, typical Forth programs never experience a stack overflow.
The RTX 32P is implemented on 2.5 micron CMOS standard cell technology in a 2-chip set. The data path chip, which contains the ALU, data stack, and ALU bits of the microcode memory, is an 84 pin Leadless Chip Carrier (LCC). The control chip, which contains the rest of the system, is packaged in a 145 pin Pin Grid Array (PGA). The RTX 32P runs at 8 MHz.
The RTX 32P is designed for real time control applications, especially in the area of embedded systems with low power and small size requirements. As was mentioned previously, the RTX 32P is a prototyping vehicle for a commercial processor which, as of this writing, is planned to be called the RTX 4000. This new processor will have several features that make it suitable for use in real time control applications and personal computer coprocessor acceleration tasks including: a mixture of ROM and RAM microcode to shrink the system onto a single chip, stand-alone operation, on-chip hardware support for floating point math, a significantly faster clock speed, and on-chip support for dynamic program memory chips. Some versions of the chip may not have all these features. In addition, architectural enhancements will be made to support languages such as C, Ada, and LISP by allowing use of the address field in the instruction to specify fast-access 21-bit literals. This will allow crucial operations such as frame-pointer-plus-offset addressing to run at high speed.
The information in this section is based on the descriptions of the WISC CPU/32 in Koopman (1987c), and Koopman (1987d), and the introduction of the RTX 32P in Koopman (1989).
Phil Koopman -- koopman@cmu.edu