Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 4. Architecture of 16-bit Systems
The Novix NC4016, formerly called the NC4000, is a 16-bit stack based microprocessor designed to execute primitives of the Forth programming language. It was the first single-chip Forth computer to be built, and originated many of the features found on subsequent designs. Intended applications are real time control and high speed execution of the Forth language for general purpose programming.
The NC4016 uses dedicated off-chip stack memories for the Data Stack and the Return Stack. Since three separate groups of pins connect the two stacks and the RAM data bus to the NC4016, it can execute most instructions in a single clock cycle.
Figure 4.6 shows the block diagram of the NC4016.
Figure 4.6 -- NC4016 block diagram.
The ALU section contains a 2-element buffer for the top elements of the data stack (T for Top data stack element, and N (Next) for the second-from-top data stack element). It also contains a special MD register for support of multiplication and division as well as an SR register for fast integer square roots. The ALU may perform operations on the T register and any one of the N, MD, or SR registers.
The Data Stack is an off-chip memory holding 256 elements. The data stack pointer is on-chip and provides a stack address to the off-chip memory. A separate 16-bit stack data bus allows the Data Stack to be read or written in parallel with other operations. As noted previously, the top two Data Stack elements are buffered by the T and N registers in the ALU.
The Return Stack is a separate memory that is very similar to the Data Stack, with the exception that only the top return stack element is buffered on-chip, in the Index register. Since Forth keeps loop counters as well as subroutine return addresses on the return stack, the Index register can be decremented to implement countdown loops efficiently.
The stacks do not have on-chip underflow or overflow protection. In a multitasking environment, an off-chip stack page register can be controlled using the I/O ports to give each task a separate piece of a larger than 256 word stack memory. This gives hardware protection to avoid one task overwriting another task's stack, and reduces context swapping overhead to a minimum.
The Program Counter points to the location of the next instruction to be fetched from external program memory. It is automatically altered by the jump, loop, and subroutine call instructions. Program memory is arranged in 16 bit words. Byte addressing is not directly supported.
The NC4016 also has two I/O buses leading off-chip on dedicated pins. The B-port is a 16-bit I/O bus, and the X-port is a 5-bit I/O bus. The I/O ports allow direct access to I/O devices for control applications without stealing bandwidth from the memory bus. Some bits of the I/O ports can also be used to extended the program memory address space by provide high order memory address bits.
The NC4016 can use four separate 16-bit busses for data transfers on every clock cycle for high performance (program memory, Data Stack, Return Stack, and I/O busses).
The NC4016 pioneered the use of unencoded instruction formats for stack machines. In the NC4016 the ALU instruction is formatted in independent fields of bits that simultaneously control different parts of the machines, much like horizontal microcode. The NC4016, and many of its Forth processor successors, are the only 16-bit computers that use this technique. Using an unencoded instruction format allows simple hardware decoding of instructions. Figure 4.7 shows the instruction formats for the NC4016.
Figure 4.7(a) -- NC4016 instruction formats -- subroutine call.
Figure 4.7a shows the instruction format for subroutine calls. In this format, the highest bit of the instruction is set to 0, and the remainder of the instruction is used to hold a 15-bit subroutine address. This limits programs to 32K words of memory.
Figure 4.7(b) -- NC4016 instruction formats -- conditional branch.
Figure 4.7b shows the conditional branch instruction format. Bits 12 and 13 select either a branch if T is zero, an unconditional branch, or a decrement and branch-if-zero using the index register for implementing loops. Bits 0-11 specify the lowest 12 bits of the target address, restricting the branch target to be in the same 4K byte block of memory as the branch instruction.
Figure 4.7(c) -- NC4016 instruction formats -- ALU operation.
Figure 4.7c shows the format of the ALU instruction. This instruction has several bit fields that control various resources on the chip. Bits 0 and 1 control the operation of the shifter at the ALU output. Bit 2 specifies a nonrestoring division cycle. Bit 3 enables shifting of the T and N registers connected as a 32-bit shift register.
Bit 5 of the ALU instruction indicates a subroutine return operation. This allows subroutine returns to be combined with preceding arithmetic operations to obtain "free" subroutine returns in many cases.
Bit 6 specifies whether a stack push is to be accomplished. It, combined with bit 4, controls pushing and popping stack elements.
Bits 7 and 8 control the input select for the ALU as well as allow specify a step for iterative multiply or square root functions. Bits 9-11 specify the ALU function to be performed.
Figure 4.7(d) -- NC4016 instruction formats -- memory reference.
Figure 4.7d shows the format of a memory reference instruction. These instructions take two clock cycles: one cycle for the instruction fetch, and one clock cycle for the actual reading or writing of the operand. The address for the memory access is always taken from the T register. Bit 12 indicates whether the operation is a memory read or write. Bits 0-4 specify a small constant that can be added or subtracted to the T value to perform autoincrement or autodecrement addressing functions. Bits 5-11 of this instruction specify ALU and control functions almost identical to those used in the ALU instruction format.
Figure 4.7(e) -- NC4016 instruction formats -- user space/register transfer/literal.
Figure 4.7e shows the miscellaneous instruction format. This instruction can be used to read or write a 32-word "user space" residing in the first 32 words of program memory, saving the time taken to push a memory address on the stack before performing the fetch or store. It can also be used to transfer values between registers within the chip, or push either a 5-bit literal (in a single clock cycle) or a 16-bit literal (in two clock cycles) onto the stack. Bits 5-11 of this instruction specify ALU and control functions very similar to those in the ALU instruction format.
The NC4016 is specifically designed to execute the Forth language. Because of the unencoded format of many of the instructions, machine operations that correspond to a sequence of Forth operations can be encoded in a single instruction. Table 4.2 shows the Forth primitives and instruction sequences supported by the NC4016.
: (subroutine call) AND ; (subroutine exit) BRANCH ! DROP + DUP - I 0 LIT 0< NOP 0BRANCH OR 1+ OVER 1- R> 2* R@ >R SWAP @ XOR
Table 4.2(a) NC4016 Instruction Set Summary -- Forth Primitives. (see Appendix B for descriptions)
nn @ + nn ! @ +c nn + @ - nn +c @ -c nn - @ SWAP - nn -c @ SWAP -c nn @ @ OR nn @ + @ XOR nn @ +c @ AND nn @ - DROP DUP nn @ -c DUP nn ! nn @ AND DUP nn ! + nn @ SWAP - DUP nn ! - nn @ SWAP -c DUP nn ! AND nn @ OR DUP nn ! OR nn @ XOR DUP nn ! SWAP - nn AND DUP nn ! XOR nn I@ DUP nn I! nn I@ + DUP nn I! + nn I@ - DUP nn I! - nn I@ AND DUP nn I! AND nn I@ OR DUP nn I! OR nn I@ SWAP - DUP nn I! SWAP - nn I@ XOR DUP nn I! XOR nn I@! DUP @ SWAP nn + nn I! DUP @ SWAP nn - nn OR OVER + nn SWAP - OVER +c nn SWAP -c OVER - nn XOR OVER -c lit + OVER SWAP - lit +c OVER SWAP -c lit - R> DROP lit -c R> SWAP >R lit AND SWAP - lit OR SWAP -c lit SWAP - SWAP DROP lit SWAP -c SWAP OVER ! lit XOR SWAP OVER ! nn + SWAP OVER ! nn - Notes: "nn" represents a 5 bit literal or user offset value. "lit" represents a 16 bit literal stored in the memory location after the instruction.
Table 4.2(b) NC4016 Instruction Set Summary -- Compound Forth Primitives.
INSTRUCTION DATA STACK RETURN STACK nn I@ -> N -> Fetch the value from internal register nn (stored as a 5 bit literal in the instruction). nn I! N -> -> Store N into the internal register nn (stored as a 5 bit literal in the instruction) +c N1 N2 -> N3 -> Add with carry (using internal carry bit) -c N1 N2 -> N3 -> Subtract with borrow (using internal carry bit) *' D1 -> D2 -> Unsigned Multiply step (takes two 16 bit numbers and produces a 32 bit product). *- D1 -> D2 -> Signed Multiply step (takes two 16 bit numbers and produces a 32 bit product). *F D1 -> D2 -> Fractional Multiply step (takes two 16 bit fractions and produces a 32 bit product). */' D1 -> D2 -> Divide step (takes a 16 bit dividend and divisor and produces 16 bit remainder and quotients). */'' D1 -> D2 -> Last Divide step (to perform non-restoring division fixup). 2/ N1 -> N2 -> Arithmetic shift right (same as division by two for non-negative integers. D2/ D1 -> D2 -> 32 bit arithmetic shift right (same as division by two for non-negative integers. S' D1 -> D2 -> Square Root step. TIMES -> N1 -> N2 Count-down loop using top of return stack as a counter.
Table 4.2(c) NC4016 Instruction Set Summary -- Special Purpose Words.
The internal structure of the NC4016 is designed for single clock cycle instruction execution. All primitive operations except memory fetch, memory store, and long literal fetch execute in a single clock cycle. This requires many more on-chip interconnection paths than are present on the Canonical Stack Machine, but provides much better performance.
The NC4016 allows combining nonconflicting sequential operations into the same instruction. For example, a value can be fetched from memory and added to the top stack element using the sequence @ + in a Forth program. These operations can be combined into a single instruction on the NC4016.
The NC4016 subroutine return bit allows combining a subroutine return with other instructions in a similar manner. This results in most subroutine exit instructions executing "for free" in combination with other instructions. An optimization that is performed by NC4016 compilers is tail-end recursion elimination. Tail-end recursion elimination involves replacing a subroutine call/subroutine exit instruction pair by an unconditional branch to the subroutine that would have been called.
Another innovation of the NC4016 is the mechanism to access the first 32 locations of program memory as global "user" variables. This mechanism can ease problems associated with implementing high level languages by allowing key information for a task, such as the pointer to an auxiliary stack in main memory, to be kept in a rapidly accessible variable. It also allows reasonable performance using high level language compilers, which may have originally been developed for register machines, by allowing the 32 fast-access variables to be used to simulate a register set.
The NC4016 is implemented using fewer than 4000 gates on a 3.0 micron HCMOS gate array technology, packaged in a 121 pin Pin Grid Array (PGA). The NC4016 runs at up to 8 MHz.
When the NC4016 was designed, gate array technology did not permit placing the stack memories on-chip. Therefore a minimum NC4016 system consists of three 16-bit memories: one for programs and data, one for the data stack, and one for the return stack.
Because the NC4016 executes most instructions, including conditional branches and subroutine calls, in a single cycle, there is a significant amount of time between the beginning of the clock cycle and the time that the memory address is valid for fetching the next instruction. This time is approximately half the clock cycle, meaning that program memory access time must be approximately twice as fast as the clock rate.
The NC4016 was originally designed as a proof-of-concept and prototype machine. It therefore has some inconveniences that can be largely overcome by software and external hardware. For example, the NC4016 was intended to handle interrupts, but a bug in the gate array design causes improper interrupt response. Novix has since published an application note showing how to use a 20-pin PAL to overcome this problem. A successor product will eliminate these implementation difficulties and add additional capabilities.
The NC4016 is aimed at the embedded control market. It delivers very high performance with a reasonably small system. Among the appropriate applications for the NC4016 are: laser printer control, graphics CRT display control, telecommunications control (T1 switches, facsimile controllers, etc.), local area network controllers, and optical character recognition.
The information in this section is derived from Golden et al. (1985), Miller (1987), Stephens & Watson (1985), and Novix's Programmers' Introduction to the NC4016 Microprocessor (Novix 1985).
Phil Koopman -- firstname.lastname@example.org