M ultiple Instruction Issue Superscalar M achines – The Pow - erPC 601 Case Study: The Pow erPC™ 601 as a Superscalar Architecture

CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 423 and simulation, and parallel scientific applications. As programmers of language XYZ whatever language that may be, we sometimes find ourselves wanting to write XYZ programs that make calls to Fortran routines. T his is easy to do once we understand what is happening. T here are two significant issues that need to be addressed: 1 differences in routine labels; 2 differences in subroutine linkage. In Fortran, the source code labels are prepended with two underscores in the assembly code. A C program if C is language XYZ that makes a call to Fortran routine add_two would then make a call to _ _add_two , which also must be declared as external in the C source code and declared as global in the Fortran program. If all of the parameters that are passed to the Fortran routines are pointers, then everything will work OK. If there are any scalars passed, then there will be trou- ble because C like Java uses call-by-value for scalars whereas Fortran uses call-by-reference. We need to “trick” the C compiler into using call-by-reference by making it explicit. Wherever a Fortran routine expects a scalar in its argument list, we use a pointer to the scalar in the C code. As a practical consideration, the gcc compiler will compile Fortran programs. It knows what to do by observing the extension of the source file, which should be .f for Fortran. Now, let us take a look at how an optimizing compiler improves the code. Figure 10-11 shows the optimized code using the compiler’s -O flag. Notice there is not a single nop , ld , or st instruction. Wasted cycles devoted to nop instructions have been reclaimed, and memory references devoted to stack manipulation have been eliminated.

10.5 M ultiple Instruction Issue Superscalar M achines – The Pow - erPC 601

In the earlier pipelining discussion, we see how several instructions can be in var- ious phases of execution at once. Here, we look at superscalar architecture, where, with separate execution units, several instructions can be executed simul- 424 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE taneously. In a superscalar architecture, there might be one or more separate Integer Units IUs, Floating Point Units FPUs, and Branch Processing Units BPUs. T his implies that instructions need to be scheduled into the vari- ous execution units, and further, that instructions might be executed out-of-order. Out-of-order execution means that instructions need to be examined prior to dispatching them to an execution unit, not only to determine which unit should execute them, but also to determine whether executing them out of order would result in an incorrect program, because of dependencies between the instruc- tions. T his in turn implies an Instruction Unit , IU, that can prefetch instruc- Output produced by -O optimiziation for gcc compiler .file add.c .section .rodata .align 8 .LLC0: .asciz c = d\n .section .text .align 4 .global main .type main,function .proc 04 main: PROLOGUE 0 save sp,-112,sp PROLOGUE 1 mov 10,o0 call add_two,0 mov 4,o1 mov o0,o1 sethi hi.LLC0,o0 call printf,0 or o0,lo.LLC0,o0 ret restore .LLfe1: .size main,.LLfe1-main .align 4 .global add_two .type add_two,function .proc 04 add_two: PROLOGUE 0 PROLOGUE 1 retl add o0,o1,o0 .LLfe2: .size add_two,.LLfe2-add_two .ident GCC: GNU 2.7.2 Figure 10-11 SPARC code generated with the -O optimization flag. CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 425 tions into an instruction queue, determine the kinds of instructions and the dependence relations among them, and schedule them into the various execution units.

10.6 Case Study: The Pow erPC™ 601 as a Superscalar Architecture

As an example of a modern superscalar architecture let us examine the Motorola PowerPC ™ 601. T he 601 has actually been superseded by more powerful mem- bers of the PowerPC family, but it will serve to illustrate the important features of superscalar architectures without introducing unnecessary complexity into our discussion. 10.6.1 INSTRUCTION SET ARCHITECTURE OF THE POWERPC 601 T he 601 is a 32-bit general register RISC machine whose ISA includes: • 32 32-bit general purpose integer registers GPRs; • 32 64-bit floating point registers FPRs; • 8 4-bit condition code registers; • nearly 50 special-purpose 32-bit registers that are used to control various aspects of memory management and the operating system; • over 250 instructions many of which are special-purpose. 10.6.2 HARDWARE ARCHITECTURE OF THE POWERPC 601 Figure 10-12, taken from the Motorola PowerPC 601 user’s manual, shows the microarchitecture of the 601. T he flow of instructions and data proceed via the System Interface, shown at the bottom of the figure, into the 32KByte cache. From there, instructions are fetched eight at a time into the Instruction Unit, shown at the top of the fiture.T he issue logic within the Instruction Unit exam- ines the instructions in the queue for their kind and dependence, and issues them into one of the three execution units: IU, BPU, or FPU. T he IU is shown as containing the GPR file and an integer exception register, XER, which holds information about exceptions that arise within the IU. T he IU can execute most integer instructions in a single clock cycle, thus obviating 426 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE REAL TIME CLOCK Upper Lower CR Instruction Queue BPU CTR LR + IU FPU FPSCR FPR FILE GPR FILE + × ÷ + × ÷ Tag Memory 32-KB Instruction and Data Cache MMU UTLB ITLB BAT Array MEMORY UNIT Read Queue Write Queue Snoop SYSTEM INTERFACE 64-Bit Data Bus 2 Words 32-Bit Data Bus 2 Words 2 Words 4 Words 1 Word 2 Words 8 Words 8 Words Fetched Instruction Instruction Instruction Data Data Data Data Address Address Physical Address Address Snoop Address Issue Logic INSTRUCTION UNIT Figure 10-12 The PowerPC 601 architecture adapted from the Motorola PowerPC 601 user manual. CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 427 the need for any kind of pipelining of integer instructions. T he FPU contains the FPRs and the floating point status and control register FPSCR. T he FPSCR contains information about floating point exceptions and the type of result produced by the FPR. T he FPU is pipelined, and most FP instructions can be issued at a rate of one per clock. As we mentioned above in the section on pipelining, branch instructions, espe- cially conditional branch instructions, pose a bottleneck when trying to overlap instruction execution. T his is because the branch condition must first be ascer- tained to be true, for example “branch on plus” must test the N flag to ascertain that it is cleared. T he branch address must then be computed, which often involves address arithmetic. Only then can the PC be loaded with the branch address. T he 601 attacks this problem in several ways. First, as mentioned above, there are eight 4-bit condition code registers instead of the usual one. T his allows up to eight instructions to have separate condition code bits, and therefore not inter- fere with each other’s ability to set condition codes. T he BPU looks in the instruction queue, and if it finds a conditional branch instruction it proceeds to compute the branch target address ahead of time, and fetches instructions at the branch target. If the branch is taken, this results in effectively a zero-cycle branch, since the instruction at the branch target has already been fetched in anticipation of the branch condition being satisfied. T he BPU also has a link reg- ister LR in which it can store subroutine return addresses, thus saving one GPR as well as several other registers used for special purposes. Note that the BPU can issue its addresses over a separate bus directly to the MME and Memory unit for prefetching instructions. T he RTC unit shown in the figure is a Real Time Clock which has a calendar range of 137 years, with an accuracy of 1 ns. T he MMU and Memory Unit assist in fetching both instructions and data. Note the separate path for data items that goes from the cache directly to the GPR and FPR register files. T he PowerPC 601, and its more powerful descendants are typical of the modern general purpose microprocessor. Current microprocessor families are superscalar in design, often having several of each kind of execution unit. 428 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE

10.7 VLIW M achines