Pipelining the Datapath Trends in Computer Architecture

CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 409 • T he RISC instruction set should be designed with pipelined architectures in mind. • T here is no requirement that CISC instructions be maintained as integrat- ed wholes; they can be decomposed into sequences of simpler RISC in- structions. T he result is that RISC architectures have characteristics that distinguish them from CISC architectures: • All instructions are of fixed length, one machine word in size. • All instructions perform simple operations that can be issued into the pipe- line at a rate of one per clock cycle. Complex operations are now composed of simple instructions by the compiler. • All operands must be in registers before being operated upon. T here is a separate class of memory access instructions: LOAD and ST ORE. T his is referred to as a LOAD-ST ORE architecture. • Addressing modes are limited to simple ones. Complex addressing calcula- tions are built up using sequences of simple operations. • T here should be a large number of general registers for arithmetic opera- tions so that temporary variables can be stored in registers rather than on a stack in memory. In the next few sections, we explore additional motivations for RISC architec- tures, and special characteristics that make RISC architectures effective.

10.3 Pipelining the Datapath

T he flow of instructions through a pipeline follows the steps normally taken when an instruction is executed. In the discussion below we consider how three classes of instructions: arithmetic, branch, and load-store, are executed, and then we relate this to how the instructions are pipelined. 10.3.1 ARITHMETIC, BRANCH, AND LOAD-STORE INSTRUCTIONS Consider the “normal” sequence of events when an arithmetic instruction is executed in a load-store machine: 410 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 1 Fetch the instruction from memory; 2 Decode the instruction it is an arithmetic instruction, but the CPU has to find that out through a decode operation; 3 Fetch the operands from the register file; 4 Apply the operands to the ALU; 5 Write the result back to the register file. T here are similar patterns for other instruction classes. For branch instructions the sequence is: 1 Fetch the instruction from memory; 2 Decode the instruction it is a branch instruction; 3 Fetch the components of the address from the instruction or register file; 4 Apply the components of the address to the ALU address arithmetic; 5 Copy the resulting effective address into the PC, thus accomplishing the branch. T he sequence for load and store instructions is: 1 Fetch the instruction from memory; 2 Decode the instruction it is a load or store instruction; 3 Fetch the components of the address from the instruction or register file; 4 Apply the components of the address to the ALU address arithmetic; 5 Apply the resulting effective address to memory along with a read load or write store signal. If it is a write signal, the data item to be written must also be retrieved from the register file. T he three sequences above show a high degree of similarity in what is done at each stage: 1 fetch, 2 decode, 3 operand fetch, 4 ALU operation, 5 result writeback. CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 411 T hese five phases are similar to the four phases discussed in chapters 4 and 6 except that we have refined the fourth phase, “execute,” into two subphases: ALU operation and writeback, as illustrated in Figure 10-3. A result writeback is not always needed, and one way to deal with this is to have two separate subphases ALU and writeback with a bypass path for situations when a writeback is not needed. For this discussion, we take a simpler approach, and force all instruc- tions to go entirely through each phase, whether or not that is actually needed. 10.3.2 PIPELINING INSTRUCTIONS In practice, each CPU designer approaches the design of the pipeline from a dif- ferent perspective, depending upon the particular design goals and instruction set. For example the original SPARC implementation had only four pipeline stages, while some floating point pipelines may have a dozen or more stages. Each of the execution units performs a different operation in the fetch-execute cycle. After the Instruction Fetch unit finishes its task, the fetched instruction is handed off to the Decode unit. At this point, the Instruction Fetch unit can begin fetching the next instruction, which overlaps with the decoding of the pre- vious instruction. When the Instruction Fetch and Decode units complete their tasks, they hand off the remaining tasks to the next units Operand Fetch is the next unit for Decode. T he flow of control continues until all units are filled. 10.3.3 KEEPING THE PIPELINE FILLED Notice an important point: although it takes multiple steps to execute an instruc- tion in this model, on average, one instruction can be executed per cycle as long as the pipeline stays filled. T he pipeline does not stay filled, however, unless we are careful as to how instructions are ordered. We know from Figure 10-1 that approximately one in every four instructions is a branch. We cannot fetch the instruction that follows a branch until the branch completes execution. T hus, as soon as the pipeline fills, a branch is encountered, and then the pipeline has to be Instruction Fetch Decode Operand Fetch ALU Op. and Writeback Execute Figure 10-3 Four-stage instruction pipeline. 412 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE flushed by filling it with no-operations NOPs. T hese NOPs are sometimes referred to as pipeline bubbles . A similar situation arises with the LOAD and STORE instructions. T hey generally require an additional clock cycle in which to access memory, which has the effect of expanding the Execute phase from one cycle to two cycles at times. T he “wait” cycles are filled with NOPs. Figure 10-4 illustrates the pipeline behavior during a memory reference and also during a branch for the ARC instruction set. T he addcc instruction enters the pipeline on time step cycle 1. On cycle 2, the ld instruction, which references memory, enters the pipeline and addcc moves to the Decode stage. T he pipeline continues filling with the srl and subcc instructions on cycles 3 and 4, respec- tively. On cycle 4, the addcc instruction is executed and leaves the pipeline. On cycle 5, the ld instruction reaches the Execute level, but does not finish execu- tion because an additional cycle is needed for memory references. T he ld instruction finishes execution during cycle 6. Branch and Load Delay Slots T he ld and st instructions both require five cycles, but the remaining instruc- tions require only four. T hus, an instruction that follows an ld or st should not use the register that is being loaded or stored. A safe approach is to insert a NOP after an ld or an st as shown in Figure 10-5a. T he extra cycle or cycles, depending on the architecture for a load is known as a delayed load , since the data from the load is not immediately available on the next cycle. A delayed Instruction Fetch Decode Operand Fetch Execute 1 2 3 4 5 6 7 8 addcc ld srl subcc be nop nop nop addcc ld srl subcc be nop nop addcc ld srl subcc be nop addcc ld srl subcc be ld Memory Reference Time Figure 10-4 Pipeline behavior during a memory reference and a branch. CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 413 branch is similar, as shown for the be instruction in cycles 5 through 8 of Figure 10-4. T he position occupied by this NOP instruction is known as a load delay slot or branch delay slot respectively. It is often possible for the compiler to find a nearby instruction to fill the delay slot. In Figure 10-5a, the srl instruction can be moved to the position of the nop since its register usage does not conflict with the surrounding code and reor- dering instructions this way does not impact the result. After replacing the nop line with the srl line, the code shown in Figure 10-5b is obtained. T his is the code that is traced through the pipeline in Figure 10-4. Speculative Execution of Instructions An alternative approach to dealing with branch behavior in pipelines is to simply guess which way the branch will go, and then undo any damage if the wrong path is taken. Statistically, loops are executed more often than not, and so it is usually a good guess to assume that a branch that exits a loop will not be taken. T hus, a processor can start processing the next instruction in anticipation of the direction of the branch. If the branch goes the wrong way, then the execution phase for the next instruction, and any subsequent instructions that enter the pipeline, can be stopped so that the pipeline can be flushed. T his approach works well for a number of architectures, particularly those with slow cycle speeds or deep pipelines. For RISCs, however, the overhead of determining when a branch goes the wrong way and then cleaning up any side effects caused by wrong instructions entering the pipeline is generally too great. T he nop instruction is normally used in RISC pipelines when something useful cannot be used to replace it. a addcc r1, 10, r1 ld r1, r2 nop subcc r2, r4, r4 be label srl r3, r5 b addcc r1, 10, r1 ld r1, r2 srl r3, r5 subcc r2, r4, r4 be label Figure 10-5 SPARC code, a with a nop inserted, and b with srl migrated to nop position. 414 CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE EXAMPLE: ANALYSIS OF PIPELINE EFFICIENCY In this example, we analyze the efficiency of a pipeline. A processor has a five stage pipeline. If a branch is taken, then four cycles are needed to flush the pipeline. T he branch penalty b is thus 4. T he probability P b that a particular instruction is a branch is .25. T he probability P t that the branch is taken is .5. We would like to compute the average number of cycles needed to execute an instruction, and the execution efficiency . When the pipeline is filled and there are no branches, then the average number of cycles per instruction CPI No_Branch is 1. T he average number of cycles per instruction when there are branches is then: CPI Avg = 1 - P b CPI No_Branch + P b [P t 1 + b + 1 - P t CPI No_Branch ] = 1 + bP b P t . After making substitutions, we have: CPI Avg = 1 - .251 + .25[.51 + 4 + 1 - .51] = 1.5 cycles. T he execution efficiency is the ratio of the cycles per instruction when there are no branches to the cycles per instruction when there are branches. T hus we have: Execution efficiency = CPI No_Branch CPI Avg = 11.5 = 67. T he processor runs at 67 of its potential speed as a result of branches, but this is still much better than the five cycles per instruction that might be needed without pipelining. T here are techniques for improving the efficiency. As stated above, we know that loops are usually executed more than once, so we can guess that a branch out of a loop will not be taken and be right most of the time. We can also run simulations on the non-loop branches, and get a statistical sampling of which branches are likely to be taken, and then guess the branches accordingly. As explained above, CHAPT ER 10 T RENDS IN COMPUT ER ARCHIT ECT URE 415 this approach works best when the pipeline is deep or the clock rate is slow. ■

10.4 Overlapping Register W indow s