Maximum Speedup ≤≤ Number of stages Speedup ≤ Time for unpipelined operation Time for longest stage
Lecture 13: Pipeline Control and Exceptions Professor Mike Schulte Computer Architecture ECE 201 Recap: Ideal Pipelining
IF DCD EX MEM WB
IF DCD EX MEM WB
IF DCD EX MEM WB
IF DCD EX MEM WB
IF DCD EX MEM WB
Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup
≤
4 Assume instructions
are completely independent!
Maximum Speedup ≤≤ Number of stages Speedup ≤ ≤ Time for unpipelined operation Time for longest stage
Recap: Control Diagram
IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M;
Equal .
A M S Reg File
Reg File
IR Exec PC
. Mem B
Next PC Inst
D Mem
Data Mem Access
But recall use of “Data Stationary Control”
° The Main Control generates the control signals during
Reg/Dec
- Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later
- Control signals for Mem (MemWr Branch) are used 2 cycles later
- Control signals for Wr (MemtoReg MemWr) are used 3 cycles later
Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc
Ex/ Mem
IF/ID Register
ID/Ex Register ALUOp ALUOp
Mem Main /Wr
RegDst RegDst Control Register Register
MemW MemW MemW r r r Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg
RegWr RegWr RegWr RegWr
Datapath + Data Stationary Control
im
Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal
v
wb rw
v
me wb rw
v
ex me wb rw
Exec Reg .
File Mem
M
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
PC Next PC
A B S Reg File
Access Data Mem
rs rt op rs rt fun
Start: Fetch 10
Exec Reg .
File Mem
Access Data Mem
A B S Reg File
IR Inst . Mem
D Decode
Mem Ctrl
WB Ctrl
M rs rt im
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n n n
IF
PC Next PC
10
=
Fetch 14, Decode 10
File Mem
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n n lw r1, r2(35)
Exec Reg .
=
14
PC Next PC
IF
ID
im
A B S Reg File
2 rt
M
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
Access Data Mem
Fetch 20, Decode 14, Exec 10
2 rt
=
20
PC Next PC
IF EX
ID
35 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n n lw r1 addI r2, r2, 3
Exec Reg .
File Mem
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
B S Reg File
r2
Access Data Mem
M
Fetch 24, Decode 20, Exec 14, Mem 10
File Mem
5
Exec Reg .
=
24
PC Next PC
IF EX M
ID
3 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 n lw r1 sub r3, r4, r5 addI r2, r2, 3
4
r2
M
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
Reg File
r2+35
B
Access Data Mem
Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10
Exec Reg .
Note Delayed Branch: always execute ori after beq
=
30
PC Next PC
IF EX M WB
ID
7 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15
lw
r1
beq r6, r7 100 addI r2 sub r3
M[r2+35]
6
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
Reg File
r4 r5 r2+3
Access Data Mem
File Mem
Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14
Access Data Mem
100 ori r8, r9 17
File Mem
=
100
PC Next PC
IF EX M WB
ID
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 beq addI
r2
sub r3 r4-r5r6 r7
r2+3
r1=M[r2+35] 9 xx
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
Reg File
Exec Reg .
Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20
Exec Reg .
File Mem
Access Data Mem
Reg File
IR Inst . Mem
D Decode
Mem Ctrl
WB Ctrl
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15
ID EX M WB
PC Next PC
___
= Fill it in yourself!
?
Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24
Access Data Mem
Reg File
IR Inst . Mem
D Decode
Mem Ctrl
WB Ctrl
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15
EX M WB
PC Next PC
___
= Fill it in yourself!
? ? ? ?
Exec Reg .
File Mem
Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30
Exec Reg .
? ? ? ?
IF DCD EX Mem WB
IF DCD EX Mem WB
IF DCD OF Ex RS WAR Data Hazard
(write after read)(write after write)
IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data HazardIF DCD EX Mem WB
IFetch DCD ° ° °
Control Hazard
IFetch DCD ° ° ° Structural Hazard
? ?
= Fill it in yourself!
File Mem
___
PC Next PC
M WB
10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15
WB Ctrl
Mem Ctrl
D Decode
IR Inst . Mem
Reg File
Access Data Mem
Pipeline Hazards Again I-Fet ch DCD MemOpFetch OpFetch Exec Store
Data Hazards
° Avoid some “by design”
- eliminate WAR by always fetching operands early (DCD) in pipe
- eleminate WAW by doing all WBs in order (last stage, static)
- stall or forward (if possible)
° Detect and resolve remaining ones
IF DCD EX Mem WB
IF DCD OF Ex Mem RAW Data Hazard WAW Data Hazard
IF DCD OF Ex RS RAW Data Hazard
IF DCD EX Mem WB
IF DCD EX Mem WB Hazard Detection
° Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. ° A RAW hazard exists on register
Wregs( j )
ρ ρ if
ρ ∈ ρ ∈
Rregs( i )
∩ ∩
- Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction.
- When instruction issues, reserve its result register.
- When on operation completes, remove its write reservation.
° A WAW hazard exists on register ρ ρ if ρ ∈ ρ ∈ Wregs( i ) ∩ ∩ Wregs( j ) ° A WAR hazard exists on register ρ ρ if ρ ∈ ρ ∈ Wregs( i ) ∩ ∩ Rregs( j )
Record of Pending Writes
IAU
npc ° Current operand
registers
I mem
° Pending writes
Regs op rw rs rt PC
° hazard <=
B A im n op rw((rs == rw & regW ) OR
ex) ex
((rs == rw & regW ) OR
mem) me
alu ((rs == rw & regW ) OR
wb) wb S n op rw ((rt == rw & regW ) OR ex) ex
((rt == rw & regW ) OR
mem) me
D mem ((rt == rw ) & regW )
wb wb m n op rw
Regs
Resolve RAW by forwarding
IAU
° Detect nearest valid write op npc operand register and forward into
I mem
op latches, bypass ing
Regs
remainder of the
op rw rs rt PC
Forward pipe mux
- Increase muxes to
B A
im n op rw
add paths from pipeline registers
alu
- Data Forwarding =
S
n op rw
Data Bypassing
D mem
m
n op rw Regs
What about memory operations? ° If instructions are initiated in order and operations always occur in the same
op Rd Ra Rb
stage, there can be no hazards between memory operations! ° What does delaying WB on arithmetic operations cost? – cycles ?op Rd Ra Rb
- – hardware ? A B ° What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1 Rd R => " Delayed Loads " T Rd to reg file Compiler Avoiding Load Stalls: scheduled unscheduled 54% gcc 31% 42% spice 14% 65% tex 25% 0% 20% 40% 60% 80% % loads stalling pipeline
What about Interrupts, Traps, Faults?
° External
Interrupts:
- Allow pipeline to drain,
- Load PC with interupt address
° Faults (within instruction, restartable)
- Force trap instruction into IF
- disable writes till trap hits WB
- must save multiple PCs or PC + state
Refer to MIPS solution
Exception Handling
IAU
npc
I mem detect bad instruction address
lw $2,20($5)
Regs PC
detect bad instruction
B A im n op rw detect overflowalu
S detect bad data address
D mem
m
Regs Allow exception to take effect
Exception Problem
° Exceptions/Interrupts : 5 instructions executing in 5 stage pipeline
- How to stop the pipeline?
- Restart?
- Who caused the interrupt?
Stage Problem interrupts occurring
IF Page fault on instruction fetch; misaligned memory access; memory-protection violation
ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error
° Load with data page fault, Add with instruction page fault? ° Solution 1: interrupt vector/instruction
Resolution: Freeze above & Bubble Below npc
alu
B
D mem
m
IAU PC Regs
A
im op rw n op rw n op rw n op rw rs rt
bubble freeze
I mem Regs
S
MIPS R3000 Instruction Pipeline Decode Inst Fetch ALU / E.A Memory Write Reg Reg. Read TLB I-Cache RF Operation WB E.A. TLB D-Cache Resource Usage TLB TLB I-cache RF WB ALUALU D-Cache
Write in phase 1, read in phase 2 => eliminates bypass from WB
Recall: Data Hazard on r1
Time (clock cycles)
IF
ID/RF EX MEM WB ALU Reg Reg Im Dm add r1 ,r2,r3
I ALU n Im Reg Dm Reg s sub r4, r1 ,r3 t
ALU
r.
Im Reg Dm Reg
and r6, r1 ,r7O ALU r Im Reg Dm Reg or r8, r1 ,r9 d
ALU e
Im Reg Dm Reg r xor r10, r1 ,r11
With MIPS R3000 pipeline, no need to forward from WB stage
MIPS R3000 Multicycle Operations op Rd Ra Rb Ex: Multiply, Divide, Cache Miss Stall all stages above multicycle operation in the pipeline mul Rd Ra Rb A B Drain (bubble) stages below it Use control word of local stage state to step through multicycle operation Rd R Rd T to reg file Issues in Pipelined design
IF D Ex M W Limitation ° Pipelining IF D Ex M W IF D Ex M W IF D Ex M W Issue rate, FU stalls, FU depth ° Super-pipeline
- - Issue one instruction per (fast) cycle IF D Ex M W<
- - ALU takes multiple cycles IF D Ex M W IF D Ex M W IF D Ex M W Clock skew, FU stalls, FU depth ° Super-scalar IF D Ex M W IF D Ex M W Hazard resolutio
- - Issue multiple scalar IF D Ex M W instructions per cycle IF D Ex M W ° VLIW (“EPIC”
- - Each instruction specifies IF D Ex M W Ex M W
- - Compiler determines parallelism Ex M W ° Vector operations IF D Ex M W Ex M W Applicabilit
- - Each instruction specifies Ex M W series of identical operations Ex M W
Packing
multiple scalar operations Ex M W<Partitioned Instruction Issue (simple Superscalar)
independent int and FP issue to separate pipelines
I-Cache Int Reg Inst Issue FP Reg and Bypass Operand / Result Busses Int Unit Load / FP Add FP Mul Store Unit D-Cache Single Issue Total Time = Int Time + FP Time Max Speedup: Total Time MAX(Int Time, FP Time) Multiple Pipes/ Harder Superscalar
IR0
IR1 Issues:
Reg. File ports
Register File Detecting Data Dependences A B B A Bypassing
RAW Hazard
WAR Hazard
R R D$ D$ Multiple load/store ops? T T BranchesBranch penalties in superscalar Example: resolved in op-fetch stage, single exposed delay (ala MIPS, Sparc)
I-fetch Branch delay Squash 2
I-fetch
Branch Squash 1 delay Summary
° Pipelines pass control information down the pipe just as data moves down pipe
° Forwarding/Stalls handled by local control ° Exceptions stop the pipeline ° MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load)
° More performance from deeper pipelines, parallelism