Stack Computers: the new wave © Copyright 1989, Philip Koopman, All Rights Reserved.
Chapter 9. The Future of Stack Computers
Perhaps the greatest challenge that has faced computer architects over the years is the problem of memory bandwidth. Memory bandwidth is the amount of information that can be transferred to and from memory per unit time. Put another way, memory bandwidth determines how often values can be accessed from memory.
The crux of the issue is that program memory usually is much bigger in terms of number of transistors or other devices than the processor. That means that the CPU can easily be made faster than the memory while keeping within budget. This comes from the general observation that faster components tend to be more expensive, consume more power, etc. for any given fabrication technology and geometry.
The specter of memory bandwidth limitations has come and gone throughout computer design history. In the beginning, people were grateful that computers ran at all, so the speed of program memory and that of the processor were not a real issue. When electronic computers came along, much of the memory capacity of computer was in fact magnetic tape used for data files, but it was better than nothing.
The magnetic core memories used in early large computers were very slow. This spawned very complex instruction sets that packed a lot of work into each instruction. It also made practical microcode, since a small amount of microcode memory could be made as fast as the CPU without breaking the budget, while the large amount of program memory could not. Once semiconductor memory was somewhat affordable, cache memories that captured small pieces of the program, especially loops, for reuse at execution time became popular. Cache memories became bigger and bigger, so that more and more of the programs resided in cache memory, which was fast enough to match the speed of the processor.
Then, microprocessors came into being. The early ones were slow enough that available program memory chips were sufficiently fast (and again, the issue was not how fast they ran as much as the wonder that they ran at all). The memory bandwidth problem was forestalled for a while. Microprocessor manufacturers followed in the footsteps of the big systems and used microcode with complex instruction sets.
Mainstream microprocessors developed quite a bit, then started needing "wait states" when accessing memory. A wait state is a clock cycle wasted by a processor waiting for memory to respond when the processor is faster than its memory chips. One easy solution to this problem involves spending a lot of money to buy faster memory chips. The other easy solution is just to wait for the memory chip manufacturers to make faster memories at an affordable cost. Finally, the CISC microprocessors introduced cache memory and other techniques borrowed from large systems in an attempt to move more and more of the programs into fast memory.
RISC machines upset the applecart by claiming that processors with conventional microcode were bad. They, instead, use a very low level instruction set which can be best described as compiler-generated microcode that resides in program memory. This approach is claimed to give significant advantages, but at an admitted increase in memory bandwidth requirements. RISC machines depend heavily on cache memory for performance.
With the latest generations of computers, there is a new problem. Cache memory chips will not be fast enough to keep up with processors of the future. This is not really because processor transistor switching speeds are increasing faster than memory chip speeds (they probably aren't.) The problem is that the pins in and out of the processor and memory chips are beginning to dominate the timing picture.
The way that pins become the bottleneck is that transistors are getting smaller and faster as chips become denser. Unfortunately, pins that can be soldered and connected together haven't become much smaller except in very exotic packaging technologies. The number of electrons that must be pushed out on a pin and the wire connected to the pin becomes significant compared to the ability of the transistors to push electrons, so the pins become a bottleneck. This in turn means that any off-chip memory is slower by up to an order of magnitude than on-chip memory solely because of the delays introduced by going between chips.
Now we may have a situation where all off-chip memory is too slow to keep the processor busy. This creates a need for an additional layer of memory response speed: on-chip cache memory. Unfortunately, there is a fundamental problem with this approach compared to the previous memory approaches. Printed circuit boards may be made quite large without any problem. The yield of the circuit board varies linearly with the number of chips, and circuit boards are repairable if defects are discovered. Unfortunately, the yield of chips grows worse exponentially with area, and chips are not easily repairable.
Using separate cache memory chips, adding more chips to a printed circuit board can provide as much cache memory as was needed within reason. However, if a single-chip system does not have enough on-chip cache memory, increasing the chip size to provide more memory can make the processor unmanufacturable because of yield problems. The tremendous amounts of memory needed by high-speed processors, especially RISC machines, seem to indicate that the best we may hope for is a modest amount of high-speed cache memory on-chip, and a large amount of slow-speed off-chip cache memory. Now our program performance is at the mercy of the hit ratios for two different caches. Is this the best that we can hope for?
Stack machines provide a much different way to solve the problem. Conventional machines with their caches attempt to capture bits and pieces of programs as they are run. This is part of an attempt to reuse already fetched instructions as they are executed in loops or very frequently called procedures. Part of the problem that interferes with cache performance is that conventional programming languages and compilers produce rambling, sprawling code. Stack machines on the other hand, encourage the writing of compact code with much reuse of procedures.
The impact of the way stack machine programs behave is that stack machines should not use dynamically allocated cache memory. They instead should use small, statically allocated or operating system managed program memory for high speed execution of selected subroutines. Frequently used subroutines may be placed in these portions of program memory, which can be used more freely by the compiler and the user with the knowledge that they will run quickly. Since stack machine code is very compact, a significant amount of a program can reside in high speed on-chip memory. This can further encourage the use of modular, reused procedures with the knowledge that they will actually help performance, instead of hurting it as is too often the case in other machines. Of course, since the on-chip program memory does not need complex and bulky control circuitry for management of cache, all the more room is available for extra program memory. On the 16-bit stack processors, it is quite reasonable for an entire real time control program and the data memory for its local variables to reside entirely on-chip. With process technology at sub-micron levels, the same will begin to be true for 32-bit stack processors as well.
To consider how this different approach to memory hierarchies might work, consider a microcoded machine such as the RTX 32P. Large dynamic RAMs may be used to contain the bulk of a program and its data. Actually, this is really an extreme case, because programs for the RTX 32P seldom need more than the capacity of its static memory chips for programs, but let us assume that this is true anyway. The dynamic RAMS form a storage element for the very highest levels of the program that are executed infrequently, and for data that is accessed sparsely or infrequently.
Next, static memory chips are added to the system. These are used for the medium-level layers of the program that are executed fairly frequently. Also, program data that will be manipulated frequently may be resident in this memory, or may be copied in from the dynamic memory for a period of time when it will be needed. In practice, there may be two levels of static memory chips: large slow ones and small fast ones, each with different power, cost, and printed circuit board space characteristics.
On-chip program memory can come next in the hierarchy. The inner loops of important procedures in the program can reside here for quick access by the processor. Several hundred bytes of program RAM can easily fit onto the processor chip for data and program. In the case of chips which run dedicated programs (which is often the case in the real time embedded system environment), several thousand bytes of program ROM may reside on-chip. In practice, any language can use many common subroutines in ROM to assist the programmer and the compiler.
Finally, microcode memory resides on-chip for the actual control of the CPU. In the sense of the memory hierarchy, microcode memory may be thought of as just another level of program memory. It contains the most frequently executed actions for the processor, which correspond to supported machine instructions. Once again, a mixture of ROM and RAM is appropriate. And of course, the data stack acts as a fast access device for holding intermediate computation results.
What we have then, is a layered hierarchy of memory sizes and speeds throughout the system. The concept of a hierarchy is not new. What is new is the thought that it need not be managed by hardware at run time. The compiler and programmer can easily manage it. The key is, since stack programs are very small, significant amounts of code can reside at each level of a statically allocate hierarchy. Since stack machines support very fast procedure calls, only the inner loops or small segments of code that are frequently executed need be stored in high-speed memory, not the entire bulk of user-defined procedures. This means that dynamic memory allocation is really not required.
Phil Koopman -- koopman@cmu.edu