Maximum Speedup ≤≤ Number of stages Speedup ≤ Time for unpipelined operation Time for longest stage

Lecture 13: Pipeline Control and Exceptions Professor Mike Schulte Computer Architecture ECE 201 Recap: Ideal Pipelining

  IF DCD EX MEM WB

  IF DCD EX MEM WB

  IF DCD EX MEM WB

  IF DCD EX MEM WB

  IF DCD EX MEM WB

  Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup

  ≤

  4 Assume instructions

  are completely independent!

Maximum Speedup ≤≤ Number of stages Speedup ≤ ≤ Time for unpipelined operation Time for longest stage

Recap: Control Diagram

  IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M;

  Equal .

  A M S Reg File

  Reg File

  IR Exec PC

  . Mem B

  Next PC Inst

  D Mem

  Data Mem Access

But recall use of “Data Stationary Control”

  ° The Main Control generates the control signals during

Reg/Dec

  • Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
  • Control signals for Mem (MemWr Branch) are used 2 cycles later
  • Control signals for Wr (MemtoReg MemWr) are used 3 cycles later

  Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc

  Ex/ Mem

  IF/ID Register

  ID/Ex Register ALUOp ALUOp

  Mem Main /Wr

  RegDst RegDst Control Register Register

  MemW MemW MemW r r r Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg

  RegWr RegWr RegWr RegWr

Datapath + Data Stationary Control

  im

  Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal

  v

  wb rw

  v

  me wb rw

  v

  ex me wb rw

  Exec Reg .

  File Mem

  M

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  PC Next PC

  A B S Reg File

  Access Data Mem

  rs rt op rs rt fun

Start: Fetch 10

  Exec Reg .

  File Mem

  Access Data Mem

  A B S Reg File

  IR Inst . Mem

  D Decode

  Mem Ctrl

  WB Ctrl

  M rs rt im

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n n n

  IF

  PC Next PC

  10

  =

Fetch 14, Decode 10

  File Mem

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n n lw r1, r2(35)

  Exec Reg .

  =

  14

  PC Next PC

  IF

  ID

  im

  A B S Reg File

  2 rt

  M

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  Access Data Mem

Fetch 20, Decode 14, Exec 10

  2 rt

  =

  20

  PC Next PC

  IF EX

  ID

  35 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n lw r1 addI r2, r2, 3

  Exec Reg .

  File Mem

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  B S Reg File

  r2

  Access Data Mem

  M

Fetch 24, Decode 20, Exec 14, Mem 10

  File Mem

  5

  Exec Reg .

  =

  24

  PC Next PC

  IF EX M

  ID

  3 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n lw r1 sub r3, r4, r5 addI r2, r2, 3

  4

  r2

  M

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  Reg File

  r2+35

  B

  Access Data Mem

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10

  Exec Reg .

  Note Delayed Branch: always execute ori after beq

  =

  30

  PC Next PC

  IF EX M WB

  ID

  7 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

lw

r1

beq r6, r7 100 addI r2 sub r3

  

M[r2+35]

  6

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  Reg File

  r4 r5 r2+3

  Access Data Mem

  File Mem

Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14

  Access Data Mem

  100 ori r8, r9 17

  File Mem

  =

  100

  PC Next PC

  IF EX M WB

  ID

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 beq addI

r2

sub r3 r4-r5

  r6 r7

r2+3

  r1=M[r2+35] 9 xx

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  Reg File

  Exec Reg .

Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20

  Exec Reg .

  File Mem

  Access Data Mem

  Reg File

  IR Inst . Mem

  D Decode

  Mem Ctrl

  WB Ctrl

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  ID EX M WB

  PC Next PC

  ___

  = Fill it in yourself!

  ?

Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24

  Access Data Mem

  Reg File

  IR Inst . Mem

  D Decode

  Mem Ctrl

  WB Ctrl

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  EX M WB

  PC Next PC

  ___

  = Fill it in yourself!

  ? ? ? ?

  Exec Reg .

  File Mem

Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30

  Exec Reg .

  ? ? ? ?

   IF DCD EX Mem WB

   IF DCD EX Mem WB

  

IF DCD OF Ex RS WAR Data Hazard

(write after read)

   (write after write)

   IF DCD OF Ex Mem

RAW (read after write) Data Hazard

WAW Data Hazard

   IF DCD EX Mem WB

  IFetch DCD ° ° °

Control Hazard

  IFetch DCD ° ° ° Structural Hazard

  ? ?

  = Fill it in yourself!

  File Mem

  ___

  PC Next PC

  M WB

  10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

  WB Ctrl

  Mem Ctrl

  D Decode

  IR Inst . Mem

  Reg File

  Access Data Mem

Pipeline Hazards Again I-Fet ch DCD MemOpFetch OpFetch Exec Store

Data Hazards

  ° Avoid some “by design”

  • eliminate WAR by always fetching operands early (DCD) in pipe
  • eleminate WAW by doing all WBs in order (last stage, static)
  • stall or forward (if possible)

  ° Detect and resolve remaining ones

   IF DCD EX Mem WB

   IF DCD OF Ex Mem RAW Data Hazard WAW Data Hazard

  

IF DCD OF Ex RS RAW Data Hazard

   IF DCD EX Mem WB

IF DCD EX Mem WB Hazard Detection

  ° Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. ° A RAW hazard exists on register

Wregs( j )

  ρ ρ if

  ρ ∈ ρ ∈

Rregs( i )

  ∩ ∩

  • Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction.
  • When instruction issues, reserve its result register.
  • When on operation completes, remove its write reservation.

  ° A WAW hazard exists on register ρ ρ if ρ ∈ ρ ∈ Wregs( i ) ∩ ∩ Wregs( j ) ° A WAR hazard exists on register ρ ρ if ρ ∈ ρ ∈ Wregs( i ) ∩ ∩ Rregs( j )

Record of Pending Writes

  IAU

  npc ° Current operand

registers

  I mem

  ° Pending writes

  Regs op rw rs rt PC

  

° hazard <=

B A im n op rw

  ((rs == rw & regW ) OR

  ex) ex

  ((rs == rw & regW ) OR

  mem) me

  alu ((rs == rw & regW ) OR

  wb) wb S n op rw ((rt == rw & regW ) OR ex) ex

  ((rt == rw & regW ) OR

  mem) me

  D mem ((rt == rw ) & regW )

  wb wb m n op rw

  Regs

Resolve RAW by forwarding

  IAU

  ° Detect nearest valid write op npc operand register and forward into

  I mem

  op latches, bypass ing

  Regs

  remainder of the

  op rw rs rt PC

Forward pipe mux

  • Increase muxes to

  B A

  im n op rw

  add paths from pipeline registers

  alu

  • Data Forwarding =

  S

  n op rw

Data Bypassing

  D mem

  m

  n op rw Regs

What about memory operations? ° If instructions are initiated in order and operations always occur in the same

  

op Rd Ra Rb

stage, there can be no hazards between memory operations! ° What does delaying WB on arithmetic operations cost? – cycles ?

op Rd Ra Rb

  • – hardware ? A B ° What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1 Rd R => " Delayed Loads " T Rd to reg file Compiler Avoiding Load Stalls: scheduled unscheduled 54% gcc 31% 42% spice 14% 65% tex 25% 0% 20% 40% 60% 80% % loads stalling pipeline

What about Interrupts, Traps, Faults?

  ° External

Interrupts:

  • Allow pipeline to drain,
  • Load PC with interupt address

  ° Faults (within instruction, restartable)

  • Force trap instruction into IF
  • disable writes till trap hits WB
  • must save multiple PCs or PC + state

  

Refer to MIPS solution

Exception Handling

  IAU

  npc

  I mem detect bad instruction address

  lw $2,20($5)

  Regs PC

  

detect bad instruction

B A im n op rw detect overflow

  alu

  S detect bad data address

  D mem

  m

  Regs Allow exception to take effect

Exception Problem

  ° Exceptions/Interrupts : 5 instructions executing in 5 stage pipeline

  • How to stop the pipeline?
  • Restart?
  • Who caused the interrupt?

  Stage Problem interrupts occurring

  IF Page fault on instruction fetch; misaligned memory access; memory-protection violation

  ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error

  ° Load with data page fault, Add with instruction page fault? ° Solution 1: interrupt vector/instruction

Resolution: Freeze above & Bubble Below npc

  alu

  B

  D mem

  m

  IAU PC Regs

  A

  im op rw n op rw n op rw n op rw rs rt

  bubble freeze

  I mem Regs

  S

MIPS R3000 Instruction Pipeline Decode Inst Fetch ALU / E.A Memory Write Reg Reg. Read TLB I-Cache RF Operation WB E.A. TLB D-Cache Resource Usage TLB TLB I-cache RF WB ALUALU D-Cache

  Write in phase 1, read in phase 2 => eliminates bypass from WB

Recall: Data Hazard on r1

  Time (clock cycles)

  IF

ID/RF EX MEM WB ALU Reg Reg Im Dm add r1 ,r2,r3

  I ALU n Im Reg Dm Reg s sub r4, r1 ,r3 t

  

ALU

r.

  

Im Reg Dm Reg

and r6, r1 ,r7

  O ALU r Im Reg Dm Reg or r8, r1 ,r9 d

  ALU e

  Im Reg Dm Reg r xor r10, r1 ,r11

  With MIPS R3000 pipeline, no need to forward from WB stage

MIPS R3000 Multicycle Operations op Rd Ra Rb Ex: Multiply, Divide, Cache Miss Stall all stages above multicycle operation in the pipeline mul Rd Ra Rb A B Drain (bubble) stages below it Use control word of local stage state to step through multicycle operation Rd R Rd T to reg file Issues in Pipelined design

  IF D Ex M W Limitation ° Pipelining IF D Ex M W IF D Ex M W IF D Ex M W Issue rate, FU stalls, FU depth ° Super-pipeline

  • - Issue one instruction per (fast) cycle
  • IF D Ex M W<
  • - ALU takes multiple cycles
  • IF D Ex M W IF D Ex M W IF D Ex M W Clock skew, FU stalls, FU depth ° Super-scalar IF D Ex M W IF D Ex M W Hazard resolutio
  • - Issue multiple scalar
  • IF D Ex M W instructions per cycle IF D Ex M W ° VLIW (“EPIC”
  • - Each instruction specifies
  • IF D Ex M W Ex M W

    Packing

    multiple scalar operations Ex M W<
  • - Compiler determines parallelism
  • Ex M W ° Vector operations IF D Ex M W Ex M W Applicabilit
  • - Each instruction specifies
  • Ex M W series of identical operations Ex M W

    Partitioned Instruction Issue (simple Superscalar)

      independent int and FP issue to separate pipelines

    I-Cache Int Reg Inst Issue FP Reg and Bypass Operand / Result Busses Int Unit Load / FP Add FP Mul Store Unit D-Cache Single Issue Total Time = Int Time + FP Time Max Speedup: Total Time MAX(Int Time, FP Time) Multiple Pipes/ Harder Superscalar

      IR0

    IR1 Issues:

      

    Reg. File ports

    Register File Detecting Data Dependences A B B A Bypassing

      

    RAW Hazard

    WAR Hazard

    R R D$ D$ Multiple load/store ops? T T Branches

    Branch penalties in superscalar Example: resolved in op-fetch stage, single exposed delay (ala MIPS, Sparc)

      I-fetch Branch delay Squash 2

      I-fetch

    Branch Squash 1 delay Summary

      ° Pipelines pass control information down the pipe just as data moves down pipe

      ° Forwarding/Stalls handled by local control ° Exceptions stop the pipeline ° MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)

      ° More performance from deeper pipelines, parallelism