Quantitative Analyses of Program Execution

CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 403 In the earlier chapters, the fetch-execute cycle is described in the form: “fetch an instruction, execute that instruction, fetch the next instruction, etc. ” T his gives the impression of a straight-line linear progression of program execution. In fact, the processor architectures of today have many advanced features that go beyond this simple paradigm. T hese features include pipelining , in which several instructions sharing the same hardware can simultaneously be in various phases of execution, superscalar execution , in which several instructions are executed simultaneously using different portions of the hardware, with possibly only some of the results contributing to the overall computation, very long instruction word VLIW architectures, in which each instruction word specifies multiple instructions of smaller widths that are executed simultaneously, and parallel processing , in which multiple processors are coordinated to work on a single problem. In this chapter we cover issues that relate to these features. T he discussion begins with issues that led to the emergence of the reduced instruction set computer RISC and examples of RISC features and characteristics. Following that, we cover an advanced feature used specifically in SPARC architectures: overlapping register windows . We then cover two important architectural features: supersca- lar execution and VLIW architectures. We then move into the topic of parallel processing, touching both on parallel architectures and program decomposition. T he chapter includes with case studies covering Intel’s Merced architecture, the PowerPC 601, and an example of a pervasive parallel architecture that can be found in a home videogame system.

10.1 Quantitative Analyses of Program Execution

Prior to the late 1970’s, computer architects exploited improvements in inte- grated circuit technology by increasing the complexity of instructions and TRENDS IN COMPUTER ARCHITECTURE 10 404 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE addressing modes, as the benefits of such improvements were thought to be obvi- ous. It became an effective selling strategy to have more complex instructions and more complex addressing modes than a competing processor. Increases in archi- tectural complexity catered to the belief that a significant barrier to better machine performance was the semantic gap —the gap between the meanings of high-level language statements and the meanings of machine-level instructions. Unfortunately, as computer architects attempted to close the semantic gap, they sometimes made it worse. T he IBM 360 architecture has the MVC move charac- ter instruction that copies a string of up to 256 bytes between two arbitrary locations. If the source and destination strings overlap, then the overlapped por- tion is copied one byte at a time. T he runtime analysis that determines the degree of overlap adds a significant overhead to the execution time of the MVC instruc- tion. Measurements show that overlaps occur only a few percent of the time, and that the average string size is only eight bytes. In general, faster execution results when the MVC instruction is entirely ignored, and instead, its function is synthe- sized with simpler instructions. Although a greater number of instructions may be executed without the MVC instruction, on average, fewer clock cycles are needed to implement the copy operation without using MVC than by using it. Long-held views began to change in 1971, when Donald Knuth published a landmark analysis of typical FORT RAN programs, showing that most of the statements are simple assignments. Later research by John Hennessy at Stanford University, and David Patterson at the University of California at Berkeley con- firmed that most complex instructions and addressing modes went largely unused by compilers. T hese researchers popularized the use of program analysis and benchmark programs to evaluate the impact of architecture upon perfor- mance. Figure 10-1, taken from Knuth, 1971, summarizes the frequency of occurrence of instructions in a mix of programs written in a variety of languages. Nearly half of all instructions are assignment statements. Interestingly, arithmetic and other “more powerful” operations account for only 7 of all instructions. T hus, if we want to improve the performance of a computer, our efforts would be better spent optimizing instructions that account for the greatest percentage of execu- tion time rather than focusing on instructions that are inherently complex but rarely occur. Related metrics are shown in Figure 10-2. From the figure, the number of terms in an assignment statement is normally just a few. T he most frequent case 80, CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 405 is the simple variable assignment, X ← Y . T here are only a few local variables in each procedure, and only a few arguments are normally passed to a procedure. We can see from these measurements that the bulk of computer programs are very simple at the instruction level, even though more complex programs could potentially be created. T his means that there may be little or no payoff in increas- ing the complexity of the instructions. Discouragingly, analyses of compiled code showed that compilers usually did not take advantage of the complex instructions and addressing modes made available by computer architects eager to close the semantic gap. One important reason for this phenomenon is that it is difficult for a compiler to analyze the code in suffi- cient detail to locate areas where the new instructions can be used effectively, because of the great difference in meaning between most high-level language constructs and the expression of those constructs in assembly language. T his Statement Average Percent of Time Assignment If Call Loop Goto Other 47 23 15 6 3 7 Figure 10-1 Frequency of occurrence of instruction types for a variety of languages. The percentages do not sum to 100 due to roundoff. Adapted from [Knuth, 1991]. Percentage of Number of Terms in Assignments 1 2 3 4 ≥ 5 – 80 15 3 2 Percentage of Number of Locals in Procedures 22 17 20 14 8 20 Percentage of Number of Parameters in Procedure Calls 41 19 15 9 7 8 Figure 10-2 Percentages showing complexity of assignments and procedure calls. Adapted from [Tanenbaum, 1999]. 406 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE observation, and the ever increasing speed and capacity of integrated circuit tech- nology, converged to bring about an evolution from complex instruction set computer CISC machines to RISC machines. A basic tenet of current computer architecture is to make the frequent case fast, and this often means making it simple. Since the assignment statement happens so frequently, we should concentrate on making it fast and simple, as a conse- quence. One way to simplify assignments is to force all communication with memory into just two commands: LOAD and STORE. T he LOADSTORE model is typical of RISC architectures. We see the LOADSTORE concept in Chapter 4 with the ld and st instructions for the ARC. By restricting memory accesses to LOADSTORE instructions only, other instructions can only access data that is stored in registers. T here are two conse- quences of this, both good and bad: 1 accesses to memory can be easily over- lapped, since there are fewer side effects that would occur if different instruction types could access memory this is good; and 2 there is a need for a large num- ber of registers this seems bad, but read on. A simpler instruction set results in a simpler and typically smaller CPU, which frees up space on a microprocessor to be used for something else, like registers. T hus, the need for more registers is balanced to a degree by the newly vacant cir- cuit area, or chip real estate as it is sometimes called. A key problem lies in how to manage these registers, which is discussed in Section 10.4. 10.1.1 QUANTITATIVE PERFORMANCE ANALYSIS When we estimate machine performance, the measure that is generally most important is execution time, T . When considering the impact of some perfor- mance improvement, the effect of the improvement is usually expressed in terms of the speedup , S , taken as the ratio of the execution time without the improve- ment T wo to the execution time with the improvement T w : For example, if adding a 1MB cache module to a computer system results in low- ering the execution time of some benchmark program from 12 seconds to 8 sec- onds, then the speedup would be 128, = 1.5, or 50. An equation to calculate S T wo T w --------- = CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 407 speedup as a direct percent can be represented as: We can develop a more fine-grained equation for estimating T if we have infor- mation about the machine’s clock period, τ , the number of clock cycles per instruction, CPI , and a count of the number of instructions executed by the pro- gram during its execution, IC . In this case the total execution time for the pro- gram is given by: CPI and IC can be expressed either as an average over the instruction set and total count, respectively, or summed over each kind and number of instructions in the instruction set and program. Substituting the latter equation into the former we get: T hese equations and others derived from them, are useful in computing and esti- mating the impact of changes in instructions and architecture upon perfor- mance. EXAMPLE: CALCULAT ING SPEEDUP FOR A NEW INST RUCT ION SET Suppose we wish to estimate the speedup obtained by replacing a CPU having an average CPI of 5 with another CPU having an average CPI of 3.5, with the clock period increased from 100 ns to 120 ns. T he equation above becomes: T hus, without actually running a benchmark program we can estimate the impact of an architectural change upon performance. ■

10.2 From CISC to RISC