Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 4. Architecture of 16-bit Systems
The WISC Technologies CPU/16 was designed by this author as a very simple (in terms of TTL component count) stack machine with a good mixture of flexibility and speed. The WISC CPU/16 uses discrete MSI components throughout. It is a 16-bit machine that features a completely RAM-based microcode memory (writable control store) to allow full user programmability. The CPU/16 is implemented as a pair of printed circuit boards that plug into an IBM PC compatible computer as a coprocessor.
The name WISC comes from "Writable Instruction Set Computer," although a more complete term for the technology used would be "WISC/Stack," since hardware stacks are an integral part of the design.
The primary purpose for developing the CPU/16 was to investigate technology and design alternatives before designing the RTX 32P described in Chapter 5. The resulting product is a reasonably fast processor in its own right, and has an very simple and uncluttered design. The original wire-wrapped prototype for the CPU/16 fit onto a single IBM PC expansion card (13 inches by 4 inches) with 16K bytes of program memory. The use of RAM for microcode memory and the simple microinstruction format makes the processor useful as a teaching tool for computer design courses.
Figure 4.1 is an architectural block diagram of the CPU/16.
Figure 4.1 -- CPU/16 block diagram.
The Data Stack and Return Stack are implemented as identical hardware stacks consisting of an 8-bit up/down counter (the Stack Pointer) feeding an address to a 256 by 16 bit memory. The stack pointers are readable and writable by the system to provide an efficient capability to access deeply buried stack elements.
The ALU section includes a standard multifunction ALU built from 74LS181 chips with a DHI register for holding intermediate results. By convention, the DHI register acts as a buffer for the top stack element. This means that the Data Stack Pointer actually addresses the element perceived by the programmer to be the second-to-top stack element. The result is that an operation on the top two stack elements, such as addition, can be performed in a single cycle, with the A side of the ALU reading the second stack element from the Data Stack and the B side of the ALU reading the top stack element from the Data Hi register.
There are no condition codes visible to machine language programs. Add-with-carry and other multiple precision operations are supported by microcoded instructions that push the carry flag onto the data stack as a logical value (0 for carry clear, -1 for carry set).
The DLO register acts as a temporary holding register for intermediate results within a single instruction. Both the DHI and DLO registers are shift registers, connected to allow 32-bit shifting for multiplication and division.
The Program Counter is connected directly to the memory address bus. This allows fetching the next instruction in parallel with data operations in the rest of the system. Thus, the system can overlap data operations involving the ALU and the Data Stack with instruction fetching operations. In order to save the program counter for subroutine call operations, the Program Counter Save register captures the program counter value before it is loaded with the subroutine starting address. The Program Counter Save register is then pushed onto the Return Stack during the subroutine calling process. During subroutine returns, the saved Program Counter value is routed from the Return Stack through the ALU to add 1 before storing as the new Program Counter value. This saves a Program Counter increment that would otherwise cost a clock cycle.
Program memory is organized as 64K words of 16 bits. It is accessed on word boundaries only, but a microcoded byte-swapping operation is supported to allow for manipulation of single-byte quantities.
Microprogram Memory is a read/write memory containing 2K elements by 32 bits. The memory is addressed as 256 pages of 8 words each. The Microprogram Counter supplies an 8 bit page address, and microprograms execute within the 8 word page. This scheme allows supplying only 3 bits of the next microprogram instruction from the microinstruction, one bit of which is the result of a 1-in-8 conditional microbranch selection. This allows conditional branching and looping during the execution of a single opcode.
Instruction decoding is accomplished simply by loading an 8 bit opcode into the Microprogram Counter and using that as the page address to Microprogram Memory. Since the Microprogram Counter is built with counter hardware, operations can span more than one 8-microinstruction page if required.
The Microinstruction Register holds the output of the Microprogram Memory, forming a 1-stage pipeline. This pipeline allows the next microinstruction to be accessed from Microprogram Memory in parallel with execution of the current microinstruction. This completely removes the delay of Microprogram Memory access time from the system's critical path. It also enforces a lower limit of two clock cycles on instructions. If an instruction only requires a single clock cycle, a second no-op microinstruction must be added to allow the next instruction to flow through the pipeline properly.
The Host interface block allows the CPU/16 to operate in two possible modes: Master Mode and Slave Mode. In Slave Mode, the CPU/16 is controlled by the personal computer host to allow program loading, microprogram loading, and alteration of any register or memory location on the system for initialization or debugging. In Master Mode, the CPU/16 runs its program freely, while the host computer monitors a status register for a request for service. While the CPU/16 is in master mode, the host computer may enter a dedicated service loop. Alternately, the host computer may perform other tasks, such as prefetching the next block of a disk input stream or displaying an image, and only periodically poll the status register. The CPU/16 will wait for service from the host for as long as is necessary.
The CPU/16 has two instruction formats: one for invoking microcoded opcodes and one for subroutine calls.
Figure 4.2a -- CPU/16 instruction formats -- operation.
Figure 4.2b -- CPU/16 instruction formats -- subroutine call.
Figure 4.2a shows the operation instruction format used to invoke microcoded instructions. Since 256 possible opcodes are supported by the 256 pages of Microcode Memory, only 8 bits of each instruction are needed to specify the opcode. This results in an instruction format for a microcoded opcode which has the highest 8 bits set to ones. This allows the subroutine call format, shown in Figure 4.2b, to be any address that does not have all 8 highest bits set to 1. The strategy eliminates the constraint of 15-bit subroutine addresses found on the other designs in this chapter. A disadvantage of this strategy is that parameters for instructions cannot be contained in the instruction word. As a consequence, targets for conditional branches are stored in the memory word after the instruction, as opposed to as a small offset within the instruction. This design tradeoff was made in the interest of minimizing the amount of instruction decoding logic used.
Since the CPU/16 uses RAM chips for the microcode memory, the microcode may be completely changed by the user if desired. The standard software environment for the CPU/16 is MVP-FORTH, a FORTH-79 dialect (Haydon 1983). Some of the Forth instructions included in the standard microcoded instruction set are shown in Table 4.1. Of course, other software environments are possible, but none except Forth have been implemented.
! DDROP + DDUP +! DNEGATE - DROP 0 DSWAP 0< DUP 0= I 0BRANCH I' 1+ J 1- LEAVE 2* LIT 2/ NEGATE < NOP PICK NOT ROLL OR = OVER >R R> ?DUP R@ @ ROT ABS S->D AND SWAP BRANCH U* D! U/MOD D+ XOR D@
4.1(a) CPU/16 Instruction Set Summary -- Forth Primitives. (see Appendix B for descriptions)
@ + @ - DROP ; DROP DUP I + I + @ OVER + OVER - R> DROP R> SWAP >R SWAP - SWAP DROP DUP @ SWAP 1+ (fetch & increment address) DUP ROT ROT ! 1- (store & decrement address) @ @ (indirect fetch) @ ! (indirect store) DUP @ @ 1 ROT +! (auto-postincrement indirect fetch for software stack) -1 OVER +! @ ! (auto-predecrement indirect store for software stack)
Table 4.1(b). CPU/16 Instruction Set Summary -- Compound Forth Primitives.
OPCODE DATA STACK RETURN STACK DOCOL -> -> ADDR Performs a subroutine call SEMIS -> ADDR -> Performs a subroutine return HALT -> -> Returns control to host processor SYSCALL N -> -> Requests I/O service number N from host DOVAR -> ADDR -> Used to implement Forth variables DOCON -> N -> Used to implement Forth constants
Table 4.1(c) CPU/16 Instruction Set Summary -- Special Words.
The following Forth operations have microcoded support words for inner loops or the run-time action: SP@ (fetch contents of data stack pointer) SP! (initialize data stack pointer) RP@ (fetch contents of return stack pointer) RP! (initialize return stack pointer) MATCH (string compare primitive) ABORT" (error checking & reporting word) +LOOP (variable increment loop) /LOOP (variable unsigned increment loop) CMOVE (string move) <CMOVE (reverse order string move) DO (loop initialization) ENCLOSE (text parsing primitive) LOOP (increment by 1 loop) FILL (block memory initialization word) TOGGLE (bit mask/set primitive)
Table 4.1(d) CPU/16 Instruction Set Summary -- Support Words for High Level Operations.
OPCODE DATA STACK RETURN STACK <UDNORM> EXP1 UD2 -> EXP2 UD4 -> Floating point normalize of unsigned 32-bit mantissa ADC N1 N2 CIN -> N3 COUT -> Add with carry. CIN and COUT are logical flags on the stack. ASR N1 -> N2 -> Arithmetic shift right. BYTESWAP N1 -> N2 -> Swap high and low bytes within N1. D+! D ADDR -> -> Sum D into 32-bit number at ADDR. D>R D -> -> D Move D to return stack. DLSLN D1 N2 -> D3 -> Logical shift left of D1 by N2 bits. DLSR D1 -> D2 -> Logical shift right of D1 by 1 bit. DLSRN D1 N2 -> D3 -> Logical shift right of D1 by N2 bits. DR> -> D D -> Move D from return stack to data stack. DROT D1 D2 D3 -> D2 D3 D1 -> Perform double-precision ROT. LSLN N1 N2 -> N3 -> Logical shift left of N1 by N2 bits. LSR N1 -> N2 -> Logical shift right of N1 by 1 bit. LSRN N1 N2 -> N3 -> Logical shift right of N1 by N2 bits. Q+ Q1 Q2 -> Q3 -> 64-bit addition. QLSL Q1 -> Q2 -> Logical shift left of Q1 by 1 bit. RLC N1 CIN -> N2 COUT -> Rotate left through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. RRC N1 CIN -> N2 COUT -> Rotate right through carry N1 by 1 bit. CIN is carry-in, COUT is carry-out. TDUP D1 N2 -> D1 N2 D1 N2 -> Duplicate a temporary floating point number (32-bit mantissa, 16-bit integer). Note: The CPU/16 uses RAM microcode memory, so the user may add or modify any instructions desired. The above list merely indicates the instructions supplied with the standard development software package.
Table 4.1(e) CPU/16 Instruction Set Summary -- Extended Math & Floating Point Support Words.
One thing that is noticeable in this instruction set is the diversity of instructions supported. The instructions in Table 4.1a are a very large set of Forth primitive operations. Table 4.1b shows some common Forth word combinations that are available as single instructions. Table 4.1c shows some words that are used to support underlying Forth operations such as subroutine call and exit. Table 4.1d lists some high level Forth words that use specialized microcode to speed up their execution. Table 4.1e shows words which were added to support extended precision integer operations and 32-bit floating point calculations.
Execution time of instructions varies according to the considerable range in complexity of the instructions. Simple instructions that manipulate data on the stack such as + and SWAP take 2 or 3 microcycles each. Complex instructions take more clock cycles (e.g. Q+, which is 64-bit addition, takes 18 cycles) but are still much faster than comparable high level code. If desired, microcoded loops can be written that can potentially last thousands of clock cycles for block memory moves and other repetitive operations.
As mentioned earlier, each instruction invokes a sequence of microinstructions on a Microprogram Memory page corresponding to the 8-bit opcode for the instruction. Figure 4.3 shows the microcode format for a microinstruction. The microcode used is horizontal, which means that there is only one format for microcode which is broken into separate fields to control different portions of the machine.
Figure 4.3 -- CPU/16 microinstruction format.
Because of the simplicity of the stack machine approach and the CPU/16 hardware, only 32 bits are needed in each microinstruction. This 32-bit format can be contrasted with the microinstruction formats of 48 bits and wider found in other horizontally microcoded machines, such as any machine using the AMD 2900 series bit-slice components. This simplicity makes microprogramming not much harder than assembly language programming on a conventional machine.
As an example, the pseudocode description for the addition operation on the
Canonical Stack Machine was:
TOSREG <= TOSREG + POP(DS)
The same operation in CPU/16 microcode would be written as the
microinstruction:
SOURCE=DS ALU=A+B DEST=DHI INC[DP]
Where the microoperation SOURCE=DS routes the current top element of the
hardware Data Stack onto the Data Bus, ALU=A+B directs to ALU to add the A
input (from the Data Bus) and the B input (the top-of-stack element buffered in
DHI), and DEST=DHI deposits the result back into the Data Hi register.
Meanwhile, the INC[DP] microoperation increments the Data Stack Pointer after
the Data Stack has been read, thus popping the stack.
The CPU/16 is very similar to the Canonical Stack Machine. This probably has a lot to do with the fact that both were designed by this author, and the major goal for both the Canonical Stack Machine and the CPU/16 is simplicity.
The major efficiency improvement of the CPU/16 over the Canonical Stack Machine is the replacement of the Memory Address Register with the Program Counter. This has the advantage of allowing the next instruction to be fetched without tying up the data bus, so that stack computations can be overlapped with instruction fetches. A disadvantage of this technique is that the @ and ! operations require overwriting the Program Counter with the memory address, then restoring the Program Counter with the contents of the Program Counter Save register. Obviously the program counter and a memory address register (or the DHI register) could be multiplexed onto the RAM Address bus, but this would introduce added complexity and components.
The DLO register was added to the design primarily to provide for efficient 32-bit shifting to support multiplication and division. However, the presence of an intermediate storage register measurably improves performance, because four intermediate results are available at any one time (DHI, DLO, Data Stack, and a temporary result pushed onto the Return Stack). For example, the DLO register is used as the intermediate storage location for the SWAP operation, which is conceptually cleaner than using the Return Stack for this purpose.
An important implementation feature of the CPU/16 is that all resources on the machine can be directly controlled by the host computer. This can be done because the host interface supports Microinstruction Register load and single-step clock features. With these features, any microinstruction desired can be executed by first loading values into any or all registers in the system, loading a microinstruction, cycling the clock, then reading data values back to examine the results. This design technique makes writing microcode extremely straightforward and avoids the need for expensive microcode development support hardware. It also makes diagnostic programs very simple to write.
The CPU/16 is not designed to handle interrupts.
The CPU/16 is constructed using conservative (some might even say obsolete) 74LS00 series chips and relatively slow 150 ns static RAMs for stack and program memories. The design priorities for the CPU/16 are, in decreasing order of importance: simplicity, minimum design & development costs, compactness, flexibility, and speed. The CPU/16 clock cycle time is 280 ns, with an average of three clock cycles per instruction.
Discrete components were chosen because they are inexpensive and require little initial development investment when compared to a single-chip gate array. Discrete component designs are also much easier and cheaper to change for bug extermination and design upgrades. This is in keeping with the philosophy of a design exploration project. It also results in a much slower processor than could be produced using a single-chip design. Even so, at the time of its introduction, the CPU/16 was speed competitive with the slower versions of the Novix NC4016 (the leading stack machine of the time) when running many application programs.
In order to allow increased flexibility and to limit the required microinstruction width, the CPU/16 uses discrete ALU chips (74LS181) instead of bit-sliced components. The primary application area is general purpose stack processing as a coprocessor for IBM PC family of personal computers. While the redefinable instruction set makes the machine suitable for most languages, the primary application language is Forth.
An additional application area that is attractive is that of a teaching aid in computer architecture courses. Since the machine is constructed using less than 100 simple TTL chips including memory, a student can readily understand the design. An additional result of using discrete component technology is that all signals in the system are accessible with external probes, making the system suitable for experimentation by students learning how hardware, software, and microcode interact.
The information in this section is derived from the CPU/16 Technical Reference Manual (Koopman 1986). Additional information about the CPU/16 may be found in Haydon & Koopman (1986) and Koopman & Haydon (1986). Also, Koopman (1987b) describes a conceptual WISC architecture that is extremely similar to the CPU/16.
Phil Koopman -- koopman@cmu.edu