panel-plw-2007.ppt 340KB Jun 23 2011 12:05:46 PM
Lizy Kurian John, LCA, UT Aust in
1
The University of Texas at Austin
What Programming
Language/Compiler
Researchers should Know
about Computer Architecture
Lizy Kurian John
Department of Electrical and Computer Engineering
(2)
Lizy Kurian John, LC A, UT Austin
2
Somebody once said
“
Computers are dumb actors
and
compilers/programmers
are the master playwrights
.”
(3)
Lizy Kurian John, LC A, UT Austin
3
Computer Architecture
Basics
ISAs
RISC vs CISC
Assembly language coding
Datapath (ALU) and controller
Pipelining
Caches
Out of order execution
(4)
Lizy Kurian John, LC A, UT Austin
4
Basics
ILP
DLP
TLP
Massive parallelism
SIMD/MIMD
VLIW
Performance and Power metrics
Hennessy and Patterson architecture books ASPLOS, ISCA, Micro, HPCA
(5)
Lizy Kurian John, LC A, UT Austin
5
The Bottomline
Programming Language choice
affects performance and power
eg: Java
Compilers affect Performance
and Power
(6)
Lizy Kurian John, LC A, UT Austin
6
A Java Hardware
Interpreter
Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)
This technique used by NazomiCommunications, Parthus (Chicory Systems)
Java class file
Native executable
Fetch Hardware bytecode
translator
Decode Execute
bytecodes
(7)
Lizy Kurian John, LC A, UT Austin
7
HardInt Performance
4-way performance 44 .8 10 9.3 149.
7 93 4. 1 91 1. 7 60 .4 13 5. 9 85 .2 12 7. 7 49 2. 2 71 .0 13 3. 7 22 1. 5 98 9. 4 86 7. 8 59 .8 10 8.
8 146.
2 14 6. 1 32 1. 9 16 .0 27 .7 28 .8 25 0. 2 12 0. 0 0 50 100 150 200 250 300 350 400
db javac jess mpeg mtrt
e x e c u ti o n c y c le s ( m il li o n s )
J DK 1.1.6 Interpreter J DK 1.1.6 J IT J DK 1.2 Interpreter J DK 1.2 J IT Hard- Int
• Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5
(8)
Lizy Kurian John, LC A, UT Austin
8
Compiler and Power
A B D F C E A B D F A B D F C C E E
DDG Peak Power = 3
Energy = 6
Peak Power = 2 Energy = 6
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4
(9)
Lizy Kurian John, LC A, UT Austin
9
Valluri et al 2001 HPCA
workshop
Quantitative Study
Influence of state-of-the-art optimizations
on energy and power of the processor
examined
Optimizations studied
Standard –O1 to –O4 of DEC Alpha’s cc compiler Four individual optimizations – simple
basic-block instruction scheduling, loop unrolling,
function inlining, and aggressive global
(10)
Lizy Kurian John, LC A, UT Austin
10
Standard Optimizations on
Power
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100 100 100 100 100 O1 74.48 81.55 81.52 91.33 99.96 O2 75.13 81.44 82.04 92.25 100.73 O3 75.13 81.44 82.04 92.25 100.73 O4 79.01 82.77 86.11 95.45 104.03 O0 100 100 100 100 100 O1 66.2 64.13 68.94 103.23 107.5 O2 62.62 61.31 63.01 102.14 102.78 O3 62.62 61.31 63.01 102.14 102.78 O4 63.67 62.19 63.75 102.38 102.51 O0 100 100 100 100 100 O1 81.32 83.66 83.18 97.2 99.42 O2 79.6 75.97 82.97 104.78 109.21 O3 79.6 75.97 82.97 104.78 109.21 O4 85.71 77.89 90.96 110.05 116.78
compress
go
(11)
Lizy Kurian John, LC A, UT Austin
11
Somebody once said
“
Computers are dumb actors
and
compilers/programmers
are the master playwrights
.”
(12)
Lizy Kurian John, LC A, UT Austin
12
A large part of modern
out of order processors
is hardware that could have
been eliminated if a good
compiler existed.
(13)
Lizy Kurian John, LC A, UT Austin
13
Let me get more arrogant
A large part of modern out of
order processors was designed
because
computer architects thought
compiler writers could not do a
good job.
(14)
Lizy Kurian John, LC A, UT Austin
14
Value Prediction
Is a slap on your face
Shen and Lipasti
(15)
Lizy Kurian John, LC A, UT Austin
15
Value Locality
Likelihood that an instruction’s
computed result or a similar
predictable result will occur soon
Observation – a limited set of
unique values constitute majority
of values produced and consumed
during execution
(16)
Lizy Kurian John, LC A, UT Austin
16
(17)
Lizy Kurian John, LC A, UT Austin
17
Causes of value locality
Data redundancy – many 0s, sparse
matrices, white space in files, empty
cells in spread sheets
Program constants –
Computed branches – base address for
jump tables is a run-time constant
Virtual function calls – involve code to
(18)
Lizy Kurian John, LC A, UT Austin
18
Causes of value locality
Memory alias resolution – compiler
conservatively generates code – may
contain stores that alias with loads
Register spill code – stores and
subsequent loads
Convergent algorithms – convergence in
parts of algorithms before global
convergence
(19)
Lizy Kurian John, LC A, UT Austin
19
2 Extremist Views
Anything that can be done in
hardware should be done in
hardware.
Anything that can be done in
software should be done in
software.
(20)
Lizy Kurian John, LC A, UT Austin
20
What do we need?
The Dumb actor
Or the
The defiant actor – who pays very
little attention to the script
(21)
Lizy Kurian John, LC A, UT Austin
21
Challenging all compiler
writers
The last 15 years was the defiant actor’s era
What about the next 15? TLP,
Multithreading, Parallelizing compilers –
It’s time for a lot more dumb acting from
the architect’s side.
And it’s time for some good scriptwriting
from the compiler writer’s side.
(22)
Lizy Kurian John, LCA, UT Aust in
22
The University of Texas at Austin
(23)
Lizy Kurian John, LC A, UT Austin
23
Compiler Optimzations
cc
-
Native C compiler on Dec
Alpha 21064 running OSF1
operating system
gcc –
Used to study the effect of
(24)
Lizy Kurian John, LC A, UT Austin
24
Std Optimizations Levels
on
cc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE,
copy propagation, IVE etc
-O2 – Inline expansion of static procedures
and global optimizations such as loop
unrolling, instruction scheduling
-O3 – Inline expansion of global procedures
-O4 – s/w pipelining, loop vectorization etc
(25)
Lizy Kurian John, LC A, UT Austin
25
Std Optimizations Levels
on g
cc
-O0 – No optimizations performed
-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures
Almost same optimizations in each level of cc and gcc In cc and gcc, optimizations that increase ILP are in
levels -O2, -O3, and -O4
cc used where ever possible, gcc used used where specific hooks are required
(26)
Lizy Kurian John, LC A, UT Austin
26
Individual Optimizations
Four
gcc
optimizations, all optimizations
applied on top -O1
-
fschedule-insns
–
local register allocation
followed by basic-block list scheduling
-
fschedule-insns2
– Postpass scheduling
done
-
finline-functions –
Integrated all simple
functions into their callers
-funroll-loops
– Perform the optimization
(27)
Lizy Kurian John, LC A, UT Austin
27
Some observations
Energy consumption reduces when
# of instructions is reduced, i.e.,
when the total work done is less,
energy is less
Power dissipation is directly
(28)
Lizy Kurian John, LC A, UT Austin
28
Observations (contd.)
Function inlining was found to be
good for both power and energy
Unrolling was found to be good for
energy consumption but bad for
power dissipation
(29)
Lizy Kurian John, LC A, UT Austin
29
MMX/SIMD
Automatic usage of SIMD ISA
still difficult 10+ years after
introduction of MMX.
(30)
Lizy Kurian John, LC A, UT Austin
30
Standard Optimizations on
Power (Contd)
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100 100 100 100 100 O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51 O0 100 100 100 100 100 O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38 O0 100 100 100 100 100 O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01
su2cor
swim saxpy
(1)
Lizy Kurian John, LC
A, UT Austin
25
Std Optimizations Levels
on g
cc
-
O0 – No optimizations performed-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures
Almost same optimizations in each level of cc and gcc
In cc and gcc, optimizations that increase ILP are inlevels -O2, -O3, and -O4
cc used where ever possible, gcc used used where specific hooks are required(2)
Lizy Kurian John, LC
A, UT Austin
26
Individual Optimizations
Four
gcc
optimizations, all optimizations
applied on top -O1
-
fschedule-insns
–
local register allocation
followed by basic-block list scheduling
-
fschedule-insns2
– Postpass scheduling
done
-
finline-functions –
Integrated all simple
functions into their callers
-funroll-loops
– Perform the optimization
(3)
Lizy Kurian John, LC
A, UT Austin
27
Some observations
Energy consumption reduces when
# of instructions is reduced, i.e.,
when the total work done is less,
energy is less
Power dissipation is directly
(4)
Lizy Kurian John, LC
A, UT Austin
28
Observations (contd.)
Function inlining was found to be
good for both power and energy
Unrolling was found to be good for
energy consumption but bad for
power dissipation
(5)
Lizy Kurian John, LC
A, UT Austin
29
MMX/SIMD
Automatic usage of SIMD ISA
still difficult 10+ years after
introduction of MMX.
(6)
Lizy Kurian John, LC
A, UT Austin
30
Standard Optimizations on
Power (Contd)
Benchmark opt level Energy Exec Time Insts Avg Power IPC
O0 100 100 100 100 100
O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51
O0 100 100 100 100 100
O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38
O0 100 100 100 100 100
O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01
su2cor
swim saxpy