Professor Mike Schulte Computer Architecture ECE 201
Lecture 11: Exceptions and Pipelining Basics Professor Mike Schulte Computer Architecture ECE 201 Exceptions
System
user programException
Handler
Exception:
return from
exception
normal control flow: sequential, jumps, branches, calls, returns° Exception = unprogrammed control transfer
- system takes action to handle the exception
- must record the address of the offending instruction
- record any other information necessary to return afterwards
- must save & restore user state
Two Types of Exceptions
° Interrupts
- caused by external events
- asynchronous to program execution
- may be handled between instructions
- simply suspend and resume user program
° Traps (or Exceptions)
- caused by internal events
- exceptional conditions (overflow)
- invalid instruction
- faults (non-resident page in memory)
- internal hardware error
- synchronous to program execution
- condition must be remedied by the handler
- instruction may be retried or simulated and program continued
or program may be aborted MIPS convention ° Exception means any unexpected change in control flow, without distinguishing internal or external; ° Use the term interrupt only when the event is externally caused.
Type of event From where? MIPS terminology
I/O device request External Interrupt Invoke OS from user program Internal Exception Arithmetic overflow Internal Exception Using an undefined instruction Internal Exception Hardware malfunctions Either Exception or Interrupt
Additions to MIPS ISA to support Exceptions?
° EPC –a 32-bit register used to hold the address of the affected instruction. ° Cause –a register used to record the cause of the exception. To simplify the discussion, assume
- undefined instruction=0
- arithmetic overflow=1
° Status - interrupt mask and enable bits and determines what exceptions can occur.
° Control signals to write EPC , Cause, and Status ° Be able to write exception address into PC, increase mux set PC to exception address (C000 0000 ). hex
° May have to undo PC = PC + 4, since want EPC to point to offending instruction (not its successor); PC = PC - 4 Big Picture: user / system modes
° By providing two modes of execution (user/system) it is possible for the computer to manage itself
- operating system is a special program that runs in the priviledged mode and has access to all of the resources of the computer
- presents “virtual resources” to each user that are more
convenient that the physical resurces files vs. disk sectors -
- virtual memory vs physical memory
- protects each user program from others
° Exceptions allow the system to taken action in response to events that occur while user program is executing
- O/S begins at the handler
How Control Detects Exceptions in our FSD
° Undefined Instruction–detected when no next state is defined from state 1 for the op value.
We handle this exception by defining the next state value for all op • values other than lw, sw, 0 (R-type), jmp, beq, and ori as new state
° detect overflow, and a signal called Overflow is provided as an output from the ALU. This signal is used in the modified finite state machine to specify an additional possible next state
° Note: Challenge in designing control of a real machine is to handle different interactions between instructions and other exception-causing events such that control logic remains small and fast.
Complex interactions makes the control unit the most challenging • aspect of hardware design
Modification to the Control Specification
IR <= MEM[PC] undefined instruction
PC <= PC + 4 EPC <= PC - 4
A <= R[rs] PC <= exp_addr other
B <= R[rt] cause <= 0 (UI)
BEQ
LW R-type ORi SWS <= A - B ~Equal
S <= A fun B S <= A op ZX S <= A + SX S <= A + SX 0010
Equal
overflowPC <= PC + M <= MEM[S] MEM[S] <= B
SX || 00 0011
R[rd] <= S R[rt] <= S R[rt] <= M Additional condition from
EPC <= PC - 4 Datapath to indicate overflow
PC <= exp_addr cause <=1 (Ovf)
Pipelining is Natural!
° Pipelining provides a method for executing multiple A B C D instructions at the same time.
° Laundry Example ° Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold
° Washer takes 30 minutes ° Dryer takes 40 minutes ° “Folder” takes 20 minutes
Sequential Laundry
7
8
9
11 Midnight
10 Time 30 40 20 30 40 20 30 40 20 30 40 20
T a
A
s k
B
O r
C
d e r
D
° Sequential laundry takes 6 hours for 4 loads ° If they learned pipelining, how long would laundry take?
Pipelined Laundry: Start work ASAP
6 PM
7
8
9
11 Midnight
10 Time 30 40
40 40 40 20
T a
A
s k
B
O r
C
d e r
D
° Pipelined laundry takes 3.5 hours for 4 loads
Pipelining Lessons
° Pipelining doesn’t help latency of single task, it
6 PM
7
8
9 helps throughput of entire workload
Time
° Pipeline rate limited by slowest pipeline stage 30 40
40 40 40 20
T
° Multiple tasks operating
a
A simultaneously using
s
different resources
k
° Potential speedup =
B Number pipe stages
O
° Unbalanced lengths of
r
d
pipe stages reduces C
speedup
e r
° Time to “ fill ” pipeline and
D time to “ drain ” it reduces
speedup
° Stall for DependencesThe Five Stages of the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem Wr Load Pipelined Execution
° Ifetch: Instruction Fetch
- Fetch the instruction from the Instruction Memory
° Reg/Dec: Registers Fetch and Instruction Decode ° Exec: Calculate the memory address ° Mem: Read the data from the Data Memory ° Wr: Write the data back to the register file
° On a processor multple instructions are in various stages at the same time. ° Assume each instruction takes five cycles
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB
IFetch Dcd Exec Mem WB Program Flow Time
Single Cycle, Multiple Cycle, vs. Pipeline Clk Cycle 1 Multiple Cycle Implementation: Ifetch Reg Exec Mem Wr Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Load Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Load Store Pipeline Implementation: Ifetch Reg Exec Mem Wr Store Clk Single Cycle Implementation: Load Store Waste Ifetch R-type Ifetch Reg Exec Mem Wr R-type Cycle 1 Cycle 2 Graphically Representing Pipelines
° Can help with answering questions like:
- How many cycles does it take to execute this code?
- What is the ALU doing during cycle 4?
- Are two instructions trying to use the same resource at the
same time?
I n s t r.
Time (clock cycles)
Inst 0 Inst 1 ALU Im Reg Dm Reg ALU Im Reg Dm Reg
Why Pipeline? Because the resources are there!
Time (clock cycles)
ALU
I Im Reg Dm Reg n Inst 0
ALU
s
Im Reg Dm Reg
t
Inst 1 r. ALU Im Reg Dm Reg
O
Inst 2
r
ALU
d
Inst 3 Im Reg Dm Reg
e r
ALU Inst 4 Im Reg Dm Reg Why Pipeline?
° Suppose
- 100 instructions are executed
- The single cycle machine has a cycle time of 45 ns
The multicycle and pipeline machines have cycle times of 10 ns
- The multicycle machine has a CPI of 4.6 •
° Single Cycle Machine
- 45 ns/cycle x 1 CPI x 100 inst = 4500 ns
° Multicycle Machine 10 ns/cycle x 4.6 CPI x 100 inst = 4600 ns •
° Ideal pipelined machine
- 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns
° Ideal pipelined vs. single cycle speedup
- 4500 ns / 1040 ns = 4.33
What has not yet been considered? °
Can pipelining get us into trouble?
° Yes: Pipeline Hazards
- structural hazards : attempt to use the same resource two
different ways at the same time
- E.g., two instructions try to read the same memory at the
same time
- data hazards : attempt to use item before it is ready
- instruction depends on result of prior instruction still in the pipeline add r1 , r2, r3 sub r4, r2, r1
- control hazards : attempt to make a decision before condition is
evaulated
- branch instructions
beq r1, loop add r1, r2, r3 ° Can always resolve hazards by waiting
- pipeline control must detect the hazard take action (or delay action) to resolve hazards
- Single Memory is a Structural Hazard
Time (clock cycles)
ALU Mem Mem Reg
I Reg n Load
ALU
s
Mem Mem Reg Reg
t
Instr 1 r. ALU Mem Mem Reg Reg
O
Instr 2 ALU
r
Mem Mem Reg Reg
d
Instr 3
e
ALU
r
Mem Mem Reg Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)
Structural Hazards limit performance
° Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI = 1.3 •
- otherwise resource is more than 100% utilized
° Solution 1: Use separate instruction and data memories
° Solution 2: Allow memory to read and write more than one word per cycle
° Solution 3: Stall
Control Hazard Solutions
° Stall: wait until decision is clear Its possible to move up decision to 2nd stage by adding • hardware to check registers as being read
I Time (clock cycles) n ALU
Mem Mem Reg
s Reg
Add
t
ALU r. Mem Mem Reg Reg Beq
O
ALU
r
d e r
Mem Reg Reg Load Mem
° Impact: 2 clock cycles per branch instruction => slow
Control Hazard Solutions
° Predict: guess one direction then back up if wrong
- Predict not taken
° Impact: 1 clock cycle per branch instruction if right, 2 if wrong (right - 50% of time)
° More dynamic scheme: history of 1 branch (- 90%)
I n s t r. O r d e r
Time (clock cycles)
Add Beq Load ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg Mem ALU Reg
Mem Reg
° Redefine branch behavior (takes place after next instruction) “delayed branch”Control Hazard Solutions
° Impact: 1 clock cycles per branch instruction if can find instruction to put in “slot” (- 50% of time)
° Launch more instructions per clock cycle=>less useful
I n s t r. O r d e r
Time (clock cycles)
Add Beq Misc ALU Mem Reg Mem Reg ALU Mem Reg Mem Reg Mem ALU Reg
Mem Reg
Load Mem ALU Reg Mem RegData Hazard on r1 add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11 Problem: r1 cannot be read by other instructions before it is written by the add
Data Hazard on r1:
- Dependencies backwards in time are hazards
I n s t r. O r d e r
Time (clock cycles)
add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11
IF
ID/RF EX MEM WB ALU Im Reg Dm
Reg
ALU Im Reg Dm Reg
ALU
Im Reg Dm Reg
Im ALU Reg Dm Reg ALU Im Reg Dm RegData Hazard Solution:
- “Forward” result from one stage to another
- “or” OK if define read/write properly
I n s t r. O r d e r
Time (clock cycles)
add r1 ,r2,r3 sub r4, r1 ,r3 and r6, r1 ,r7 or r8, r1 ,r9 xor r10, r1 ,r11
IF
ID/RF EX MEM WB ALU Im Reg Dm
Reg
ALU Im Reg Dm Reg
ALU
Im Reg Dm Reg
Im ALU Reg Dm Reg ALU Im Reg Dm RegForwarding (or Bypassing): What about Loads
- Dependencies backwards in time are hazards
- Can’t solve with forwarding:
- Must delay/stall instruction dependent on loads
Time (clock cycles)
lw r1 ,0(r2) sub r4, r1 ,r3
IF
ID/RF EX MEM WB ALU Im Reg Dm
Reg
ALU Im Reg Dm Reg