Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 6. Understanding Stack Machines
Now that we have a conceptual understanding of how stack machines differ from other computers, let us look at some quantitative results that show how stack machines perform. Measurements of instruction frequencies and code sizes for stack-based and register-based machines abound (references include: Blake (1977), Cooke & Donde (1982), Cooke & Lee (1980), Cragon (1979), Haikala (1982),, McDaniel (1982), Sweet & Sandman (1982), and Tanenbaum (1978)). Unfortunately, most of these measurements are for programs written in conventional languages, not in an inherently stack-based language such as Forth. Hayes et al. (1987) have previously published execution statistics for Forth programs, but we shall expand upon their findings.
The results in this chapter are all based on programs written in Forth, since these programs take the most advantage of the capabilities of a stack machine. The cautions about benchmarks are still applicable, so these results should be used as only rough approximations to "truth" (whatever that is).
Six different benchmark programs are referred to in the following sections. Except as noted, all programs are written for a 16-bit Forth system. They are as follows:
Frac: a fractal landscape generation program that uses a random number generator. It is always seeded with the same initial value for consistency to generate a graphics image. (Koopman 1987e, Koopman 1987f)
Life: a simple implementation of Conway's game of Life on an 80 column by 25 row character display. Each program run computes ten generations of a screen full of gliders.
Math: a 32-bit floating point package written in high level Forth code with no machine-specific primitives for normalization, etc. Each program run generates a table of sine, cosine, and tangent values for integer degrees between 1 and 10. (Koopman 1985)
Compile: a script used to compile several Forth programs, measuring the execution of the Forth compiler itself.
Fib: computation of the 24th Fibonacci number using a recursive procedure (commonly called "dumb" Fibonacci).
Hanoi: the Towers of Hanoi problem, written as a recursive procedure. Each program run computes the result for 12 disks.
Queens: the N Queens problem (derived from the 8 queens on a chess board puzzle) written as a recursive procedure. The program finds the first acceptable placement for N queens on an N x N board. Each program run computes the result for N = 12 queens.
The three programs which represent the best mix of different application areas are Math, which uses intensive stack manipulation to manage 32-bit quantities (and a 48-bit temporary floating point format) on a 16-bit stack; Life, which does intensive management of an array of memory cells with much conditional branching; and Frac, which does graphics line drawing and rudimentary graphics projections.
The compilation benchmark is also useful in that it reflects the activities of a compiler which must do tokenizing of the input streams and identifier searches.
NAMES FRAC LIFE MATH COMPILE AVE CALL 11.16% 12.73% 12.59% 12.36% 12.21% EXIT 11.07% 12.72% 12.55% 10.60% 11.74% VARIABLE 7.63% 10.30% 2.26% 1.65% 5.46% @ 7.49% 2.05% 0.96% 11.09% 5.40% 0BRANCH 3.39% 6.38% 3.23% 6.11% 4.78% LIT 3.94% 5.22% 4.92% 4.09% 4.54% + 3.41% 10.45% 0.60% 2.26% 4.18% SWAP 4.43% 2.99% 7.00% 1.17% 3.90% R> 2.05% 0.00% 11.28% 2.23% 3.89% >R 2.05% 0.00% 11.28% 2.16% 3.87% CONSTANT 3.92% 3.50% 2.78% 4.50% 3.68% DUP 4.08% 0.45% 1.88% 5.78% 3.05% ROT 4.05% 0.00% 4.61% 0.48% 2.29% USER 0.07% 0.00% 0.06% 8.59% 2.18% C@ 0.00% 7.52% 0.01% 0.36% 1.97% I 0.58% 6.66% 0.01% 0.23% 1.87% = 0.33% 4.48% 0.01% 1.87% 1.67% AND 0.17% 3.12% 3.14% 0.04% 1.61% BRANCH 1.61% 1.57% 0.72% 2.26% 1.54% EXECUTE 0.14% 0.00% 0.02% 2.45% 0.65% Instructions: 2051600 1296143 6133519 447050
Table 6.1. Dynamic Instruction Execution Frequencies for important Forth primitives.
Table 6.1 shows dynamic instruction execution frequencies for the most frequently executed primitives for Frac, Life, Math, and Compile. The dynamic frequency of an instruction is the number of times it is executed during a program run. Appendix C contains the unabridged version of the instruction frequencies given in Table 6.1. The AVE column shows the equally weighted average of the four benchmarks, which is a rough approximation of execution frequency for most Forth programs. The Forth words selected for inclusion in this table were either in the top ten of the Ave column, or one of the top ten words for a particular program. For example, EXECUTE has only a .65% AVE value, but has a 2.45% Compile value, which was tenth largest for the Compile measurements.
The first thing that is obvious about these numbers is that subroutine calls and exits dominate all other operations. This well known fact is why the Forth-derived stack processors place such a heavy emphasis on efficient subroutine calls and subroutine exits in combination with other instructions. The subroutine exit numbers are less than the subroutine call numbers because some Forth operations pop the return stack to climb up through two levels of subroutine calls. This performs a conditional premature exit of the calling routine.
The amount of time spent on stack manipulation primitives is also interesting. Of all the instructions in the sample, approximately 25% were spent manipulating the stacks. At first this seems rather high. However, since stack processors all have some capability for combining stack manipulations with other useful work (such as the combinations OVER -) this number is much higher than that seen in practice. Also, this 25% was skewed by up to 5% by the very high usage of R and R in the floating point math package to manipulate 32-bit quantities. This cost would not be present on a 32-bit processor or a 16-bit processor that used a fast access user memory space (such as the NC4016 and RTX 2000) to store intermediate results.
Also of interest is that the process of getting data onto the stack to be manipulated is very important (this process involves VARIABLE , @ , LIT , CONSTANT , and USER ). Fortunately, stack machines are able to combine these instructions with other operations as well.
As a final observation, many of the instructions shown in Appendix C have dynamic execution frequencies of less than 1%. However, these instructions should not immediately be dismissed as unimportant, because many of them can have long execution times if not supported by the hardware. It is not enough to just look at the execution frequency to determine the importance of an instruction.
NAMES FRAC LIFE MATH COMPILE AVE CALL 16.82% 31.44% 37.61% 17.62% 25.87% LIT 11.35% 7.22% 11.02% 8.03% 9.41% EXIT 5.75% 7.22% 9.90% 7.00% 7.47% @ 10.81% 1.27% 1.40% 8.88% 5.59% DUP 4.38% 1.70% 2.84% 4.18% 3.28% 0BRANCH 3.01% 2.55% 3.67% 3.16% 3.10% PICK 6.29% 0.00% 1.04% 4.53% 2.97% + 3.28% 2.97% 0.76% 4.61% 2.90% SWAP 1.78% 5.10% 1.19% 3.16% 2.81% OVER 2.05% 5.10% 0.76% 2.05% 2.49% ! 3.28% 2.12% 0.90% 2.99% 2.32% I 1.37% 5.10% 0.11% 1.62% 2.05% DROP 2.60% 0.85% 1.69% 2.31% 1.86% BRANCH 1.92% 0.85% 2.09% 2.05% 1.73% >R 0.55% 0.00% 4.11% 0.77% 1.36% R> 0.55% 0.00% 4.68% 0.77% 1.50% C@ 0.00% 3.40% 0.61% 0.34% 1.09% = 0.14% 2.76% 0.29% 0.26% 0.86% Instructions: 731 471 2777 1171
Table 6.2. Static Instruction Execution Frequencies for important Forth primitives.
Table 6.2 shows static instruction compilation frequencies for the most often compiled primitives for Frac, Life, and Math, and the most often compiled primitives used by the programs being compiled in the Compile benchmark (which includes Frac, Queens, Hanoi, and Fib). The static frequency of an instruction is the number of times it appears in the source program. The AVE column shows the equally weighted average of the four benchmarks, which is a rough approximation of compilation frequency for most Forth programs. The Forth words selected for inclusion in this table were either in the top ten of the Ave column, or one of the top ten words for a particular program.
In the static measurements, subroutine calls are very frequent, accounting for about one in four instructions compiled. Note that Frac is counted twice since it is included in Compile, so actually the subroutine call number is somewhat lower than it would otherwise be.
With subroutine exits so common, it is no wonder that most of the stack machines have a mechanism for combining subroutine exits with other instructions. An important additional observation is that subroutine calls are more common than subroutine returns in the source code, and are even more attractive to combine with other operations.
The RTX 32P discussed in Chapter 5 is unique in that it has only a single instruction format that combines both an opcode and a subroutine call/subroutine return/unconditional branch. While at first this may seem to be wasteful of memory, there are significant performance gains to be made, and the memory cost is relatively low. Unfortunately, this single instruction format is only useful for 32-bit processors, since 16-bit processors do not have enough bits in an instruction to combine both an opcode and a large address field.
Tables 6.3 and 6.4 are execution and compilation statistics gathered from versions of Frac, Life, and Math that were rewritten to take advantage of the capabilities of the 32-bit processor.
Table 6.3 has four profiles of dynamic program execution with different optimizations for the RTX 32P. Part (a) of the table shows the results of executing programs with no compression of opcodes and subroutines, and no peephole optimization of adjacent opcodes (opcode combination).
Table 6.3. Dynamic Instruction Execution Frequencies for RTX 32P Instruction types.
FRAC LIFE MATH AVE OP 57.54% 46.07% 49.66% 51% CALL 19.01% 26.44% 19.96% 22% EXIT 10.80% 12.53% 16.25% 13% OP+CALL 0.00% 0.00% 0.00% 0% OP+EXIT 0.00% 0.00% 0.00% 0% CALL+EXIT 0.00% 0.00% 0.00% 0% OP+CALL+EXIT 0.00% 0.00% 0.00% 0% COND 5.89% 9.95% 6.56% 7% LIT 6.76% 5.01% 7.57% 6% LIT-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP-OP 0.00% 0.00% 0.00% 0% Instructions: 8381513 1262079 940448 OP-OP 0.00% 0.00% 0.00% 0%
Table 6.3(a) Instruction compression OFF, Opcode combination OFF
FRAC LIFE MATH AVE OP 50.92% 42.22% 45.94% 46% CALL 17.81% 28.31% 21.42% 23% EXIT 12.48% 13.42% 17.45% 14% OP+CALL 0.00% 0.00% 0.00% 0% OP+EXIT 0.00% 0.00% 0.00% 0% CALL+EXIT 0.00% 0.00% 0.00% 0% OP+CALL+EXIT 0.00% 0.00% 0.00% 0% COND 6.82% 10.66% 7.05% 8% LIT 2.60% 1.94% 2.53% 2% LIT-OP 5.21% 3.43% 5.59% 5% VARIABLE-OP 2.67% 0.00% 0.01% 1% VARIABLE-OP-OP 1.49% 0.00% 0.01% 1% Instructions: 7250149 1178235 875882 OP-OP 4.72% 3.68% 1.76% 3%
Table 6.3(b) Instruction compression OFF, Opcode combination ON.
Part (b) of the table shows the effects of combining common opcode sequences (such as SWAP DROP , OVER + , Variable @ and Variable @ +) into single instructions. The column marked OP-OP is the number of combinations of two opcodes treated as a single opcode in the OP, OP-CALL, OP-EXIT, and OP-CALL-EXIT measurements. The special cases of LITERAL + , LITERAL AND , etc. are all designated as LITERAL-OP. The special cases of Variable @ and Variable ! are designated VARIABLE-OP. The special cases of Variable @ + and Variable @ - are designated VARIABLE-OP-OP. All the literal and variable special cases require a full instruction to hold an opcode and address, so are not combinable with other instructions. For the example programs, peephole optimization of opcodes was able to achieve a 10% reduction in the number of instructions executed.
FRAC LIFE MATH AVE OP 48.84% 31.26% 40.81% 40% CALL 8.46% 22.20% 15.53% 15% EXIT 4.57% 0.00% 4.80% 3% OP+CALL 13.93% 11.47% 6.68% 11% OP+EXIT 7.71% 15.96% 12.90% 12% CALL+EXIT 0.80% 0.00% 2.04% 1% OP+CALL+EXIT 0.15% 0.00% 0.03% 0% COND 7.23% 12.69% 7.99% 9% LIT 8.31% 6.39% 9.22% 8% LIT-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP-OP 0.00% 0.00% 0.00% 0% Instructions: 6827482 990313 772865 OP-OP 0.00% 0.00% 0.00% 0%
Table 6.3(c) Instruction compression ON, Opcode combination OFF.
Part (c) of the table shows the effects of using instruction compression instead of opcode combination. This means that wherever possible, opcodes are combined with following subroutine calls and exits. Subroutine calls followed by exits are also combined into unconditional jumps to accomplish tail-end recursion elimination. The result is a total of 24% of all instructions can combine opcodes and subroutine calls/returns. This translates into about 40% of all subroutine calls in the original program being executed "for free." Almost all of the subroutine exits are executed "for free," the exceptions being special instructions such as literals and return stack manipulations that cannot be combined with subroutine exits.
FRAC LIFE MATH AVE OP 39.05% 24.91% 39.19% 34% CALL 6.75% 24.27% 15.94% 16% EXIT 6.54% 0.01% 10.78% 6% OP+CALL 12.71% 12.53% 6.87% 11% OP+EXIT 6.78% 17.44% 7.40% 11% CALL+EXIT 0.95% 0.01% 2.10% 1% OP+CALL+EXIT 0.09% 0.00% 0.03% 0% COND 7.84% 13.86% 8.21% 10% LIT 3.00% 2.52% 2.95% 3% LIT-OP 6.00% 4.45% 6.51% 6% VARIABLE-OP 3.08% 0.00% 0.01% 1% VARIABLE-OP-OP 1.72% 0.00% 0.01% 1% Instructions: 6294109 906469 752257 OP-OP 5.44% 4.79% 2.05% 4%
Table 6.3(d) Instruction compression ON, Opcode combination ON.
Part (d) of the table shows the effects of turning on both opcode combination and instruction compression. The resulting code takes 25% fewer instructions than the original programs. This performance speedup is possible at almost no software or processing hardware expense because of the inherent parallelism between subroutine calls and opcodes.
An interesting point is that the execution time for the Math benchmark reduced from 6.1 million instructions on the 16-bit system to only 940 thousand instructions on the RTX 32P, testimony to the need for a 32-bit processor when doing floating point calculations. The Life benchmark (which is mostly 8-bit data manipulation) remained almost the same between systems. The Frac benchmark apparently increased by a factor of four, but this was because of the fact that the 32-bit version used a higher graphics resolution, requiring 4 times the number of points to be computed, which takes approximately 4 times as many instructions.
The performance speedup of combining opcodes with subroutine calls is worthwhile, especially since it takes essentially no extra hardware inside the processor. In fact, it actually simplifies the hardware by requiring only one instruction format. The question that must still be resolved is, what is the cost in memory space?
Fortunately, Forth programs have a static subroutine call frequency that is even higher than the dynamic frequency. This provides a ripe opportunity for opcode/subroutine call combinations. Table 6.4 shows the difference in static program size between raw programs with no compression and programs on which both instruction compression and opcode compression has been performed.
Table 6.4. Static Instruction Compilation Frequencies for RTX 32P Instruction types.
FRAC LIFE MATH AVE OP 48.40% 51.46% 44.72% 48% CALL 28.48% 33.01% 35.64% 32% EXIT 5.12% 6.41% 7.55% 6% OP+CALL 0.00% 0.00% 0.00% 0% OP+EXIT 0.00% 0.00% 0.00% 0% CALL+EXIT 0.00% 0.00% 0.00% 0% OP+CALL+EXIT 0.00% 0.00% 0.00% 0% COND 3.52% 4.46% 4.04% 4% LIT 14.48% 4.66% 8.05% 9% LIT-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP 0.00% 0.00% 0.00% 0% VARIABLE-OP-OP 0.00% 0.00% 0.00% 0% Instructions: 1250 515 2422 OP-OP 0.00% 0.00% 0.00% 0%
Table 6.4(a) Instruction compression OFF, Opcode combination OFF.
FRAC LIFE MATH AVE OP 33.71% 35.78% 37.05% 36% CALL 17.33% 21.94% 27.03% 22% EXIT 1.47% 2.87% 2.39% 2% OP+CALL 11.65% 21.15% 10.54% 14% OP+EXIT 3.78% 4.70% 1.73% 3% CALL+EXIT 1.05% 1.04% 4.02% 2% OP+CALL+EXIT 0.42% 0.00% 1.17% 1% COND 4.62% 6.00% 4.98% 5% LIT 16.17% 4.18% 8.61% 10% LIT-OP 2.83% 2.08% 1.32% 2% VARIABLE-OP 5.46% 0.26% 1.01% 2% VARIABLE-OP-OP 1.47% 0.00% 0.15% 1% Instructions: 952 383 1965 OP-OP 2.73% 5.22% 1.98% 3%
Table 6.4(b) Instruction compression ON, Opcode combination ON.
The RTX 32P uses 9 bit opcodes, 21 bit addresses, and 2 bit control fields. If we were to assume an optimally packed instruction format, we might design an instruction format that used 11 bits to specify an opcode with a single subroutine exit bit, and a 23 bit subroutine call/jump format. Also, let us be generous and assume that this instruction format would get all subroutine exits for free by combining them with opcodes or using a jump instead of call format. This supposes a machine with variable word width (11 or 23 bits), but let us not worry about that, since we are computing a theoretical minimum.
In the optimized form, the three programs together would consist of 1953 opcodes (at 11 bits each), 1389 subroutine calls (at 23 bits each), and 565 combination opcodes/address fields (at 34 bits each). This adds up to a total of 72 640 bits.
Now consider the actual program compiled using the optimizations on the RTX 32P. Considering that each instruction category uses a fixed 32-bit encoding with some potentially unused fields, the total is 3300 instructions at 32 bits, or 105 600 bits. The memory cost is then 32 960 bits, or 31% of memory "wasted" over the theoretical minimum.
Of course, designing a machine to use 11 bit opcodes and 23 bit subroutine calls would be a neat trick. In a more practical vein, we should consider that the number of "empty" opcodes in the compressed version of the programs is 766 (at 9 bits each), and the number of "empty" subroutine call fields is 917 (at 23 bits each). This is a total of 27 985 bits, only 27%, "wasted" in exchange for executing 25% fewer instructions executed. So, we are getting a significant speedup at a relatively low cost over even a variable-length instruction format.
There is a slight problem with the measurements presented in this section in that they are for several relatively small programs. The programs do perform some fairly complex operations, so this observation is in part supportive evidence that stack machine programs are compact. The problem is that very large Forth programs are difficult to find. Nevertheless, the programs were chosen to represent a reasonable cross section of commonly used Forth code and, in the author's considered opinion, the results are reasonably close to those that would be obtained by measuring a larger sample of programs.
Of course, one way to get much larger programs would be to use the output of a conventional language compiler, but that kind of code would probably have different characteristics, because programmers solve problems much differently in C or FORTRAN than they do in Forth. We shall revisit that thought in Chapter 7.
Phil Koopman -- koopman@cmu.edu