CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
423
and simulation, and parallel scientific applications. As programmers of language XYZ whatever language that may be, we sometimes find ourselves wanting to
write XYZ programs that make calls to Fortran routines. T his is easy to do once we understand what is happening.
T here are two significant issues that need to be addressed: 1 differences in routine labels;
2 differences in subroutine linkage. In Fortran, the source code labels are prepended with two underscores in the
assembly code. A C program if C is language XYZ that makes a call to Fortran routine
add_two
would then make a call to
_ _add_two
, which also must be declared as external in the C source code and declared as global in the Fortran
program.
If all of the parameters that are passed to the Fortran routines are pointers, then everything will work OK. If there are any scalars passed, then there will be trou-
ble because C like Java uses call-by-value for scalars whereas Fortran uses call-by-reference. We need to “trick” the C compiler into using call-by-reference
by making it explicit. Wherever a Fortran routine expects a scalar in its argument list, we use a pointer to the scalar in the C code.
As a practical consideration, the
gcc
compiler will compile Fortran programs. It knows what to do by observing the extension of the source file, which should be
.f
for Fortran. Now, let us take a look at how an optimizing compiler improves the code. Figure
10-11 shows the optimized code using the compiler’s
-O
flag. Notice there is not a single
nop
,
ld
, or
st
instruction. Wasted cycles devoted to
nop
instructions have been reclaimed, and memory references devoted to stack manipulation have
been eliminated.
10.5 M ultiple Instruction Issue Superscalar M achines – The Pow - erPC 601
In the earlier pipelining discussion, we see how several instructions can be in var- ious phases of execution at once. Here, we look at superscalar architecture,
where, with separate execution units, several instructions can be executed simul-
424
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
taneously. In a superscalar architecture, there might be one or more separate
Integer Units
IUs,
Floating Point Units
FPUs, and
Branch Processing Units
BPUs. T his implies that instructions need to be scheduled into the vari- ous execution units, and further, that instructions might be executed
out-of-order.
Out-of-order execution means that instructions need to be examined prior to
dispatching
them to an execution unit, not only to determine which unit should execute them, but also to determine whether executing them out of order would
result in an incorrect program, because of dependencies between the instruc- tions. T his in turn implies an
Instruction Unit
, IU, that can prefetch instruc-
Output produced by -O optimiziation for gcc compiler .file add.c
.section .rodata .align 8
.LLC0: .asciz c = d\n
.section .text .align 4
.global main .type main,function
.proc 04 main:
PROLOGUE 0 save sp,-112,sp
PROLOGUE 1 mov 10,o0
call add_two,0 mov 4,o1
mov o0,o1 sethi hi.LLC0,o0
call printf,0 or o0,lo.LLC0,o0
ret restore
.LLfe1: .size main,.LLfe1-main
.align 4 .global add_two
.type add_two,function .proc 04
add_two: PROLOGUE 0
PROLOGUE 1 retl
add o0,o1,o0 .LLfe2:
.size add_two,.LLfe2-add_two .ident GCC: GNU 2.7.2
Figure 10-11 SPARC code generated with the
-O
optimization flag.
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
425
tions into an instruction queue, determine the kinds of instructions and the dependence relations among them, and schedule them into the various execution
units.
10.6 Case Study: The Pow erPC™ 601 as a Superscalar Architecture
As an example of a modern superscalar architecture let us examine the Motorola PowerPC
™
601. T he 601 has actually been superseded by more powerful mem- bers of the PowerPC family, but it will serve to illustrate the important features of
superscalar architectures without introducing unnecessary complexity into our discussion.
10.6.1
INSTRUCTION SET ARCHITECTURE OF THE POWERPC 601 T he 601 is a 32-bit general register RISC machine whose ISA includes:
• 32 32-bit general purpose integer registers GPRs; • 32 64-bit floating point registers FPRs;
• 8 4-bit condition code registers; • nearly 50 special-purpose 32-bit registers that are used to control various
aspects of memory management and the operating system; • over 250 instructions many of which are special-purpose.
10.6.2
HARDWARE ARCHITECTURE OF THE POWERPC 601 Figure 10-12, taken from the Motorola PowerPC 601 user’s manual, shows the
microarchitecture of the 601. T he flow of instructions and data proceed via the System Interface, shown at the bottom of the figure, into the 32KByte cache.
From there, instructions are fetched eight at a time into the Instruction Unit, shown at the top of the fiture.T he issue logic within the Instruction Unit exam-
ines the instructions in the queue for their kind and dependence, and issues them into one of the three execution units: IU, BPU, or FPU.
T he IU is shown as containing the GPR file and an integer exception register, XER, which holds information about exceptions that arise within the IU. T he
IU can execute most integer instructions in a single clock cycle, thus obviating
426
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
REAL TIME CLOCK
Upper Lower
CR Instruction
Queue
BPU
CTR LR
+
IU FPU
FPSCR
FPR FILE
GPR FILE
+ × ÷ + × ÷
Tag Memory
32-KB Instruction
and Data
Cache
MMU
UTLB ITLB
BAT Array MEMORY UNIT
Read Queue
Write Queue Snoop
SYSTEM INTERFACE
64-Bit Data Bus 2 Words 32-Bit Data Bus 2 Words
2 Words 4 Words
1 Word 2 Words
8 Words
8 Words Fetched Instruction
Instruction Instruction
Data
Data Data
Data
Address Address
Physical Address Address
Snoop Address
Issue Logic INSTRUCTION UNIT
Figure 10-12 The PowerPC 601 architecture adapted from the Motorola PowerPC 601 user manual.
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
427
the need for any kind of pipelining of integer instructions. T he FPU contains the FPRs and the floating point status and control register
FPSCR. T he FPSCR contains information about floating point exceptions and the type of result produced by the FPR. T he FPU is pipelined, and most FP
instructions can be issued at a rate of one per clock.
As we mentioned above in the section on pipelining, branch instructions, espe- cially conditional branch instructions, pose a bottleneck when trying to overlap
instruction execution. T his is because the branch condition must first be ascer- tained to be true, for example “branch on plus” must test the N flag to ascertain
that it is cleared. T he branch address must then be computed, which often involves address arithmetic. Only then can the PC be loaded with the branch
address.
T he 601 attacks this problem in several ways. First, as mentioned above, there are eight 4-bit condition code registers instead of the usual one. T his allows up to
eight instructions to have separate condition code bits, and therefore not inter- fere with each other’s ability to set condition codes. T he BPU looks in the
instruction queue, and if it finds a conditional branch instruction it proceeds to compute the branch target address ahead of time, and fetches instructions at the
branch target. If the branch is taken, this results in effectively a zero-cycle branch, since the instruction at the branch target has already been fetched in
anticipation of the branch condition being satisfied. T he BPU also has a link reg- ister LR in which it can store subroutine return addresses, thus saving one GPR
as well as several other registers used for special purposes. Note that the BPU can issue its addresses over a separate bus directly to the MME and Memory unit for
prefetching instructions.
T he RTC unit shown in the figure is a Real Time Clock which has a calendar range of 137 years, with an accuracy of 1 ns.
T he MMU and Memory Unit assist in fetching both instructions and data. Note the separate path for data items that goes from the cache directly to the GPR and
FPR register files.
T he PowerPC 601, and its more powerful descendants are typical of the modern general purpose microprocessor. Current microprocessor families are superscalar
in design, often having several of each kind of execution unit.
428
CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE
10.7 VLIW M achines