CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
403
In the earlier chapters, the fetch-execute cycle is described in the form: “fetch an instruction, execute that instruction, fetch the next instruction,
etc. ” T his gives
the impression of a straight-line linear progression of program execution. In fact, the processor architectures of today have many advanced features that go beyond
this simple paradigm. T hese features include
pipelining
, in which several instructions sharing the same hardware can simultaneously be in various phases
of execution,
superscalar execution
, in which several instructions are executed simultaneously using different portions of the hardware, with possibly only some
of the results contributing to the overall computation,
very long instruction word
VLIW architectures, in which each instruction word specifies multiple instructions of smaller widths that are executed simultaneously, and
parallel processing
, in which multiple processors are coordinated to work on a single problem.
In this chapter we cover issues that relate to these features. T he discussion begins with issues that led to the emergence of the reduced instruction set computer
RISC and examples of RISC features and characteristics. Following that, we cover an advanced feature used specifically in SPARC architectures:
overlapping register windows
. We then cover two important architectural features: supersca- lar execution and VLIW architectures. We then move into the topic of parallel
processing, touching both on parallel architectures and program decomposition. T he chapter includes with case studies covering Intel’s Merced architecture, the
PowerPC 601, and an example of a pervasive parallel architecture that can be found in a home videogame system.
10.1 Quantitative Analyses of Program Execution
Prior to the late 1970’s, computer architects exploited improvements in inte- grated circuit technology by increasing the complexity of instructions and
TRENDS IN COMPUTER ARCHITECTURE
10
404
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
addressing modes, as the benefits of such improvements were thought to be obvi- ous. It became an effective selling strategy to have more complex instructions and
more complex addressing modes than a competing processor. Increases in archi- tectural complexity catered to the belief that a significant barrier to better
machine performance was the
semantic gap
—the gap between the meanings of high-level language statements and the meanings of machine-level instructions.
Unfortunately, as computer architects attempted to close the semantic gap, they sometimes made it worse. T he IBM 360 architecture has the
MVC
move charac- ter instruction that copies a string of up to 256 bytes between two arbitrary
locations. If the source and destination strings overlap, then the overlapped por- tion is copied one byte at a time. T he runtime analysis that determines the degree
of overlap adds a significant overhead to the execution time of the
MVC
instruc- tion. Measurements show that overlaps occur only a few percent of the time, and
that the average string size is only eight bytes. In general, faster execution results when the
MVC
instruction is entirely ignored, and instead, its function is synthe- sized with simpler instructions. Although a greater number of instructions may
be executed without the
MVC
instruction, on average, fewer clock cycles are needed to implement the copy operation without using
MVC
than by using it. Long-held views began to change in 1971, when Donald Knuth published a
landmark analysis of typical FORT RAN programs, showing that most of the statements are simple assignments. Later research by John Hennessy at Stanford
University, and David Patterson at the University of California at Berkeley con- firmed that most complex instructions and addressing modes went largely
unused by compilers. T hese researchers popularized the use of program analysis and benchmark programs to evaluate the impact of architecture upon perfor-
mance.
Figure 10-1, taken from Knuth, 1971, summarizes the frequency of occurrence of instructions in a mix of programs written in a variety of languages. Nearly half
of all instructions are assignment statements. Interestingly, arithmetic and other “more powerful” operations account for only 7 of all instructions. T hus, if we
want to improve the performance of a computer, our efforts would be better spent optimizing instructions that account for the greatest percentage of execu-
tion time rather than focusing on instructions that are inherently complex but rarely occur.
Related metrics are shown in Figure 10-2. From the figure, the number of terms in an assignment statement is normally just a few. T he most frequent case 80,
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
405
is the simple variable assignment, X
←
Y . T here are only a few local variables in
each procedure, and only a few arguments are normally passed to a procedure. We can see from these measurements that the bulk of computer programs are
very simple at the instruction level, even though more complex programs could potentially be created. T his means that there may be little or no payoff in increas-
ing the complexity of the instructions.
Discouragingly, analyses of compiled code showed that compilers usually did not take advantage of the complex instructions and addressing modes made available
by computer architects eager to close the semantic gap. One important reason for this phenomenon is that it is difficult for a compiler to analyze the code in suffi-
cient detail to locate areas where the new instructions can be used effectively, because of the great difference in meaning between most high-level language
constructs and the expression of those constructs in assembly language. T his
Statement Average Percent of Time
Assignment If
Call Loop
Goto Other
47 23
15 6
3 7
Figure 10-1 Frequency of occurrence of instruction types for a variety of languages. The percentages do not sum to 100 due to roundoff. Adapted from [Knuth, 1991].
Percentage of Number of Terms
in Assignments
1 2
3 4
≥ 5
– 80
15 3
2
Percentage of Number of Locals
in Procedures
22 17
20 14
8 20
Percentage of Number of Parameters in
Procedure Calls
41 19
15 9
7 8
Figure 10-2 Percentages showing complexity of assignments and procedure calls. Adapted from [Tanenbaum, 1999].
406
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
observation, and the ever increasing speed and capacity of integrated circuit tech- nology, converged to bring about an evolution from complex instruction set
computer CISC machines to RISC machines.
A basic tenet of current computer architecture is to make the frequent case fast, and this often means making it simple. Since the assignment statement happens
so frequently, we should concentrate on making it fast and simple, as a conse- quence. One way to simplify assignments is to force all communication with
memory into just two commands: LOAD and STORE. T he LOADSTORE model is typical of RISC architectures. We see the LOADSTORE concept in
Chapter 4 with the
ld
and
st
instructions for the ARC. By restricting memory accesses to LOADSTORE instructions only, other
instructions can only access data that is stored in registers. T here are two conse- quences of this, both good and bad: 1 accesses to memory can be easily over-
lapped, since there are fewer side effects that would occur if different instruction types could access memory this is good; and 2 there is a need for a large num-
ber of registers this seems bad, but read on.
A simpler instruction set results in a simpler and typically smaller CPU, which frees up space on a microprocessor to be used for something else, like registers.
T hus, the need for more registers is balanced to a degree by the newly vacant cir- cuit area, or chip
real estate
as it is sometimes called. A key problem lies in how to manage these registers, which is discussed in Section 10.4.
10.1.1
QUANTITATIVE PERFORMANCE ANALYSIS When we estimate machine performance, the measure that is generally most
important is execution time, T
. When considering the impact of some perfor- mance improvement, the effect of the improvement is usually expressed in terms
of the
speedup
, S
, taken as the ratio of the execution time without the improve- ment
T
wo
to the execution time with the improvement T
w
:
For example, if adding a 1MB cache module to a computer system results in low- ering the execution time of some benchmark program from 12 seconds to 8 sec-
onds, then the speedup would be 128, = 1.5, or 50. An equation to calculate
S T
wo
T
w
--------- =
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
407
speedup as a direct percent can be represented as:
We can develop a more fine-grained equation for estimating T
if we have infor- mation about the machine’s clock period,
τ
, the number of clock cycles per instruction,
CPI , and a count of the number of instructions executed by the pro-
gram during its execution, IC
. In this case the total execution time for the pro- gram is given by:
CPI and IC can be expressed either as an average over the instruction set and total count, respectively, or summed over each kind and number of instructions
in the instruction set and program. Substituting the latter equation into the former we get:
T hese equations and others derived from them, are useful in computing and esti- mating the impact of changes in instructions and architecture upon perfor-
mance.
EXAMPLE: CALCULAT ING SPEEDUP FOR A NEW INST RUCT ION SET
Suppose we wish to estimate the speedup obtained by replacing a CPU having an average CPI of 5 with another CPU having an average CPI of 3.5, with the clock
period increased from 100 ns to 120 ns. T he equation above becomes:
T hus, without actually running a benchmark program we can estimate the impact of an architectural change upon performance.
■
10.2 From CISC to RISC