Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 4. Architecture of 16-bit Systems
The MISC M17 microprocessor was designed by Minimum Instruction Set Computer, Inc., as a low cost, embedded microprocessor. In order to achieve low system cost, the M17 keeps its two stacks in program memory with a few top-of-stack buffer registers on the chip. Other design tradeoffs have been made to keep both chip production costs and total system costs low, while maintaining reasonably high system performance.
The MISC M17 is aimed at high volume embedded control applications where a low cost processor chip with reasonably high performance (compared to other stack machines -- very high performance when compared to standard microcontrollers) is required.
Figure 4.4 shows the block diagram of the M17.
Figure 4.4 -- M17 block diagram.
Both the Data Stack and Return Stack reside in program memory, with the top elements of each held in on-chip registers for speed. The X, Y, and Z registers hold the top three elements of the data stack, with X being the top element. These registers are connected with multiplexers so that values can be transferred between registers in a single clock cycle. Simultaneously, the Z register can be read from or written to the portion of the stack resident in program memory. Thus, a Data Stack popping operation (Forth DROP operation) is accomplished by simultaneously reading Z from memory, copying Z to Y, and copying Y to X. Similarly, a Data Stack pushing operation (such as the Forth DUP operation) is accomplished by copying X to Y while retaining the old value of X, copying Y to Z, and writing Z to program memory.
The LASTX register can be updated with the contents of the X register on each instruction cycle. It therefore contains the top-of-stack value that was overwritten by the previous instruction, which is useful for many instruction sequences.
The ALU on the M17 is designed to generate all possible ALU functions simultaneously, only at the last moment selecting the correct function output for writing back to the X and/or Y registers. This technique allows the ALU delay to overlap the instruction decoding time, since once the instruction is decoded its only task is to select the correct ALU output from the functions already computed.
The M17 has an 8 bit I/O bus that allows concurrent operations in the ALU while performing data transfers. This feature, found on all the 16 bit single-chip Forth machines discussed here, allows high speed I/O without tying up the memory data bus.
The Return Stack is kept in program memory, just as the Data Stack is. The top element of the Return Stack is buffered in the INDEX register. The INDEX register doubles as a count-down counter for use in program loops and the instruction repeat feature.
The Instruction Pointer is a conventional program counter that can be loaded from the Instruction Register for subroutine calls, from the memory data bus for branches, or from the INDEX register for subroutine returns. The INDEX register can also be loaded from the Instruction Pointer to save the return address for subroutine calls.
The Return Stack Pointer is an up/down counter that contains the memory address of the top element of the return stack resident in program memory (which is actually the second-from-top element visible to the programmer, since the INDEX register contains the top element). Similarly, the Data Stack Pointer points to the top data stack element resident in program memory, which is actually the fourth element on the stack since the top three elements are buffered in X, Y, and Z. The data stack grows from high memory locations to low memory locations. The return stack grows up from low memory locations to high memory. With this arrangement, the free space between the top of the data stack and top of return stack can be shared for more efficient use of memory space.
The M17 directly addresses five segments of up to 64K words of 16 bit wide memory. Byte swapping, byte packing, and byte unpacking instructions are available to allow access to 8 bit quantities. The M17 provides five signal pins to indicate which memory space is active: data stack, return stack, code space, A buffer, and B buffer. The activated pin indicates which address space is being used by the address bus. In simple systems, these pins can be ignored. For somewhat larger systems, each pin can control its own memory chips, providing five independent banks of 64K words of memory. Using a companion memory controller chip, up to 16 MWords of memory can be addressed.
The M17 takes two clock cycles for each instruction: one clock cycle to load the instruction from program memory, and another clock cycle to perform the operation while doing a read from or write to one of the stacks in program memory. By performing two-cycle instruction execution, the memory bus is kept continuously busy, and simple systems can operate with only two 8-bit memory packages.
The M17 also has six instruction cache registers. These registers form a short history buffer that retains sequences of consecutive instructions as they are executed. If a repeat sequence is triggered with a special instruction, from one to six of these retained instructions are formed into a loop and repeated until an exit condition is true. The loop executes at one clock cycle per instruction instead of two on the second and subsequent iterations, since instructions do not need to be fetched from memory. In order to simplify the interrupt and control logic, these loops are required to be properly aligned within an address range evenly divisible by 8. The sequence is interruptible, but the interrupt service routine is responsible for saving a special flag if it intends to use a repeat sequence itself.
A final feature of the M17 is that it can support variable length clock cycles by using an asynchronous memory interface. In the asynchronous mode of operation, the M17 provides a memory request signal for each memory cycle. The responding memory device is responsible for asserting a device ready signal when its data is valid. This handshaking process actually eliminates the need for an oscillator, and results in asynchronous operation of the system. One advantage of this scheme is that different speed memory devices may be used with different device ready delays to avoid wasting memory bandwidth. Another advantage is that a very short delay can be provided for clock cycles that do not address memory, allowing internal operation cycles to proceed faster than memory reference cycles. In extremely cost sensitive applications, an ordinary clock oscillator can be used to run the entire system.
Figure 4.5 shows the instruction formats for the M17. Instructions are accomplished in two clock cycles: one for the instruction fetch, and one for the operation and stack memory access. All of the Canonical Stack Machine's primitive operations listed in Table 3.1 can be executed in a single instruction cycle (two clock cycles). The details of operation of some instructions are slightly different on the M17 to accomplish single instruction cycle execution. For example, a memory store operation does not pop the data and address from the stack because this would require two additional memory transactions.
Figure 4.5(a) -- M17 instruction formats -- subroutine call.
Figure 4.5a shows the subroutine call instruction. A subroutine call is made by using the address of the subroutine (which must be an even address) as the instruction. The zero in bit 0 of the instruction designates a subroutine call. This forces subroutines to start on even memory locations, but allows code to span the entire 64K words of address space.
Figure 4.5(b) -- M17 instruction formats -- conditional instruction template.
The M17 has three conditional instructions: SET, RETURN, and JUMP. Figure 4.5b shows the format of a generic conditional instruction. Bits 6-15 indicate which conditions are selected as inputs into a logical OR condition evaluation function. For example, if bits 15 and 13 are set, a "less than or equal to zero" condition is selected. When bit 5 is set, it causes a logical inversion of the condition value. For example, if bits 15, 13, and 5 are set, a "greater than zero" condition is selected. Bit 4 controls the INDEX register and its function. For RETURN, it allows programmer control of the return stack drop. For SET and JUMP it selects a test for zero and decrement INDEX step. In this way many useful conditions based on the data in X, Y, Z, or INDEX can created in one instruction step.
It is important to note that conditional instructions in the M17 do not change the data on the stack. They simply extract a condition code value from the data in the system and perform a conditional operation. For example, selecting the carry out condition (bit 9) will give a carry bit as if X and Y were added, but does not actually modify the contents of either X or Y. The results of the conditional evaluation are not retained unless the SET instruction is used.
Figure 4.5(c) -- M17 instruction formats -- set user flag.
Figure 4.5c shows the format of the SET conditional instruction. This instruction sets the User Flag, which may be thought of as a conventional condition code register, with the value of the condition code selected by bits 4-15. The User Flag can be tested by other instructions in the program for later branching. Bit 3 specifies whether the top stack element is to be popped (equivalent to a Forth DROP operation) after the evaluation is performed.
Figure 4.5(d) -- M17 instruction formats -- conditional return.
Figure 4.5d shows the format of the conditional subroutine RETURN instruction. When bit 4 is 0, the instruction performs as a conditional subroutine return, performing the return and popping the return address from the Return Stack (resident in the INDEX register) only if the condition evaluates as true. When bit 4 is set to 1, the branch to the address at the top of the Return Stack is still made, but the return stack is only popped if the condition is false. This is a convenient way of implementing a BEGIN...UNTIL_FALSE conditional control structure that stores the start address of the loop in INDEX and uses data stack conditions for determining when to terminate.
Figure 4.5(e) -- M17 instruction formats -- conditional jump.
A conditional JUMP instruction is shown in Figure 4.5e. This instruction evaluates the specified condition and jumps if it is true. The destination address is stored at the memory location after the JUMP instruction. If the jump condition is false, the M17 skips the jump destination value and executes the instruction in the next memory location (the second word after the JUMP instruction). The JUMP instruction can be used to implement a countdown loop using the INDEX register by setting bit 4 to 1.
Figure 4.5(f) -- M17 instruction formats -- process.
Figure 4.5f shows the PROCESS instruction format. This instruction has several independent control fields, reminiscent of the horizontal microcode format seen in the CPU/16. Bits 3-5 specify control for the Z register, bits 6-7 for the Y register, bit 13 for the LASTX register and bit 14 for the X register. Additionally, bits 8-12 select the ALU/shifter function to be performed, with the results loaded into the X or Y register. Finally, bit 15 can cause the Data Stack Pointer to be updated by the instruction.
Figure 4.5(g) -- M17 instruction formats -- access.
Figure 4.5g shows the ACCESS instruction format. This instruction has a very similar format to the PROCESS instruction. The major difference is that bits 8-11 specify a source or a source/destination pair for routing data around the processor. Bits 12 and 14 control the updating of the source and destination registers, allowing exchanges between internal registers.
The M17 handles interrupts as a hardware forced subroutine call to memory address 0. Another address can be supplied by the interrupting device. It also has a context register which allows saving the state of the processor when receiving an interrupt.
The biggest difference between the M17 and the Canonical Stack Machine described in Chapter 3 is that the M17's stack memory and program memory accesses use the same bus, and may reside in the same memory chips. In order to maintain a reasonably high level of performance, the M17 buffers the top three Data Stack elements and the top Return Stack element in internal registers.
In contrast to the single internal bus used by the Canonical Stack Machine, the M17 provides a rich interconnect structure between registers. These interconnects not only allow moving data along the LASTX/X/Y/Z register chain to perform pushes and pops, but also allow routing to perform fairly complex stack manipulations within a single decode/execute clock cycle pair.
Since stacks are kept in program memory, a multiplexer is used to select the address to be fed to program memory. An advantage of placing stacks in program memory is that the amount of information that must be saved from the chip on a context swap is quite low. Instead of copying the elements of an on-chip stack into a holding area of main memory, the top-of-stack registers can be flushed to memory and the stack pointers redirected to point to a different memory block to activate a new task.
The M17 is implemented using 6600 gates on 2.0 micron HCMOS gate array technology, packaged in a 68 pin Plastic Leadless Chip Carrier (PLCC). This technology choice is meant to keep development and production costs low while providing reasonably high performance. The main off-chip components required for operation are a 16-bit wide bank of memory to hold the program and stacks.
The maximum clock speed on the M17 is approximately 15 MHz using 30 ns static RAMs, and 6 MHz using 120 ns static RAMs. Each instruction takes two clock cycles. Sequences stored in the six-element instruction cache execute at the rate of one clock cycle per instruction.
Several features of the MISC M17 are directed to the designer of small volume, high performance products. Example applications include remote sensing for smoke stacks, mines, hazardous areas, and remote equipment installations. The decision to place stacks in program memory results in lower system cost and complexity. The asynchronous memory bus protocol allows coupling high speed processing and data transmission operations without complicating the interface to low speed data acquisition devices.
The information in this section is derived from the MISC M17 Technical Reference Manual (MISC 1988).
Phil Koopman -- email@example.com