panel-plw-2007.ppt 340KB Jun 23 2011 12:05:46 PM

(1)

Lizy Kurian John, LCA, UT Aust in

The University of Texas at Austin

What Programming

Language/Compiler

Researchers should Know

about Computer Architecture

Lizy Kurian John

Department of Electrical and Computer Engineering

(2)

Lizy Kurian John, LC A, UT Austin

Somebody once said

“

Computers are dumb actors

and

compilers/programmers

are the master playwrights

.”

(3)

Lizy Kurian John, LC A, UT Austin

Computer Architecture

Basics



_ISAs



_{RISC vs CISC}



_{Assembly language coding}



_{Datapath (ALU) and controller}



_Pipelining



_Caches



_{Out of order execution}

(4)

Lizy Kurian John, LC A, UT Austin

Basics



_ILP



_DLP



_TLP



_{Massive parallelism}



_SIMD/MIMD



_VLIW



_{Performance and Power metrics}

Hennessy and Patterson architecture books ASPLOS, ISCA, Micro, HPCA

(5)

Lizy Kurian John, LC A, UT Austin

The Bottomline

Programming Language choice

affects performance and power

eg: Java

Compilers affect Performance

and Power

(6)

Lizy Kurian John, LC A, UT Austin

A Java Hardware

Interpreter



_{Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001)}



_{This technique used by Nazomi}

Communications, Parthus (Chicory Systems)

Java class file

Native executable

Fetch Hardware _bytecode

translator

Decode Execute

bytecodes

(7)

Lizy Kurian John, LC A, UT Austin

HardInt Performance

4-way performance 44 .8 10 9.

3 ₁₄9.

7 93 4. 1 91 1. 7 60 .4 13 5. 9 85 .2 12 7. 7 49 2. 2 71 .0 13 3. 7 22 1. 5 98 9. 4 86 7. 8 59 .8 10 8.

8 ₁₄6.

2 14 6. 1 32 1. 9 16 .0 27 .7 28 .8 25 0. 2 12 0. 0 0 50 100 150 200 250 300 350 400

db javac jess mpeg mtrt

e x e c u ti o n c y c le s ( m il li o n s )

J DK 1.1.6 Interpreter J DK 1.1.6 J IT J DK 1.2 Interpreter J DK 1.2 J IT Hard- Int

• Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5

(8)

Lizy Kurian John, LC A, UT Austin

Compiler and Power

A B D F C E A B D F A B D F C _C E E

DDG _{Peak Power = 3}

Energy = 6

Peak Power = 2 Energy = 6

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 1 Cycle 2 Cycle 3 Cycle 4

(9)

Lizy Kurian John, LC A, UT Austin

Valluri et al 2001 HPCA

workshop



_{Quantitative Study}



_{Influence of state-of-the-art optimizations}

on energy and power of the processor

examined



_{Optimizations studied}

 Standard –O1 to –O4 of DEC Alpha’s cc compiler  Four individual optimizations – simple

basic-block instruction scheduling, loop unrolling,

function inlining, and aggressive global

(10)

Lizy Kurian John, LC A, UT Austin

Standard Optimizations on

Power

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 74.48 81.55 81.52 91.33 99.96 O2 75.13 81.44 82.04 92.25 100.73 O3 75.13 81.44 82.04 92.25 100.73 O4 79.01 82.77 86.11 95.45 104.03 O0 100 100 100 100 100 O1 66.2 64.13 68.94 103.23 107.5 O2 62.62 61.31 63.01 102.14 102.78 O3 62.62 61.31 63.01 102.14 102.78 O4 63.67 62.19 63.75 102.38 102.51 O0 100 100 100 100 100 O1 81.32 83.66 83.18 97.2 99.42 O2 79.6 75.97 82.97 104.78 109.21 O3 79.6 75.97 82.97 104.78 109.21 O4 85.71 77.89 90.96 110.05 116.78

compress

(11)

Lizy Kurian John, LC A, UT Austin

Somebody once said

“

Computers are dumb actors

and

compilers/programmers

are the master playwrights

.”

(12)

Lizy Kurian John, LC A, UT Austin

A large part of modern

out of order processors

is hardware that could have

been eliminated if a good

compiler existed.

(13)

Lizy Kurian John, LC A, UT Austin

Let me get more arrogant

A large part of modern out of

order processors was designed

because

computer architects thought

compiler writers could not do a

good job.

(14)

Lizy Kurian John, LC A, UT Austin

Value Prediction

Is a slap on your face

Shen and Lipasti

(15)

Lizy Kurian John, LC A, UT Austin

Value Locality



_{Likelihood that an instruction’s}

computed result or a similar

predictable result will occur soon



_{Observation – a limited set of}

unique values constitute majority

of values produced and consumed

during execution

(16)

Lizy Kurian John, LC A, UT Austin

(17)

Lizy Kurian John, LC A, UT Austin

Causes of value locality



_{Data redundancy – many 0s, sparse}

matrices, white space in files, empty

cells in spread sheets



_{Program constants –}



_{Computed branches – base address for}

jump tables is a run-time constant



_{Virtual function calls – involve code to}

(18)

Lizy Kurian John, LC A, UT Austin

Causes of value locality



Memory alias resolution – compiler

conservatively generates code – may

contain stores that alias with loads



_{Register spill code – stores and}

subsequent loads



_{Convergent algorithms – convergence in}

parts of algorithms before global

convergence

(19)

Lizy Kurian John, LC A, UT Austin

2 Extremist Views

Anything that can be done in

hardware should be done in

hardware.

Anything that can be done in

software should be done in

software.

(20)

Lizy Kurian John, LC A, UT Austin

What do we need?

The Dumb actor

Or the

The defiant actor – who pays very

little attention to the script

(21)

Lizy Kurian John, LC A, UT Austin

Challenging all compiler

writers

The last 15 years was the defiant actor’s era

What about the next 15? TLP,

Multithreading, Parallelizing compilers –

It’s time for a lot more dumb acting from

the architect’s side.

And it’s time for some good scriptwriting

from the compiler writer’s side.

(22)

Lizy Kurian John, LCA, UT Aust in

The University of Texas at Austin

(23)

Lizy Kurian John, LC A, UT Austin

Compiler Optimzations



_cc

_-

_{Native C compiler on Dec}

Alpha 21064 running OSF1

operating system



_{gcc –}

_{Used to study the effect of}

(24)

Lizy Kurian John, LC A, UT Austin

Std Optimizations Levels

on

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE,

copy propagation, IVE etc

-O2 – Inline expansion of static procedures

and global optimizations such as loop

unrolling, instruction scheduling

-O3 – Inline expansion of global procedures

-O4 – s/w pipelining, loop vectorization etc

(25)

Lizy Kurian John, LC A, UT Austin

Std Optimizations Levels

on g

cc

-O0 – No optimizations performed

-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures

 _{Almost same optimizations in each level of}_cc_and_gcc  _In_cc_and_gcc_{, optimizations that increase ILP are in}

levels -O2, -O3, and -O4

 cc used where ever possible, gcc used used where specific hooks are required

(26)

Lizy Kurian John, LC A, UT Austin

Individual Optimizations



_Four

_gcc

_{optimizations, all optimizations}

applied on top -O1



_-

_{fschedule-insns}

_–

_{local register allocation}

followed by basic-block list scheduling



_-

_{fschedule-insns2}

_{– Postpass scheduling}

done



_-

_{finline-functions –}

_{Integrated all simple}

functions into their callers



_{-funroll-loops}

_{– Perform the optimization}

(27)

Lizy Kurian John, LC A, UT Austin

Some observations



_{Energy consumption reduces when}

# of instructions is reduced, i.e.,

when the total work done is less,

energy is less



_{Power dissipation is directly}

(28)

Lizy Kurian John, LC A, UT Austin

Observations (contd.)



_{Function inlining was found to be}

good for both power and energy



_{Unrolling was found to be good for}

energy consumption but bad for

power dissipation

(29)

Lizy Kurian John, LC A, UT Austin

MMX/SIMD

Automatic usage of SIMD ISA

still difficult 10+ years after

introduction of MMX.

(30)

Lizy Kurian John, LC A, UT Austin

Standard Optimizations on

Power (Contd)

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100 O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51 O0 100 100 100 100 100 O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38 O0 100 100 100 100 100 O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01

su2cor

swim saxpy

(1)

Lizy Kurian John, LC

A, UT Austin

25

Std Optimizations Levels

on g

cc

-

O0 – No optimizations performed

-O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures



_{Almost same optimizations in each level of}_cc_and_gcc



_In_cc_and_gcc_{, optimizations that increase ILP are in}

levels -O2, -O3, and -O4



cc used where ever possible, gcc used used where specific hooks are required

(2)

Lizy Kurian John, LC

A, UT Austin

26

Individual Optimizations



_Four

_gcc

_{optimizations, all optimizations}

applied on top -O1



_-

_{fschedule-insns}

_–

_{local register allocation}

followed by basic-block list scheduling



_-

_{fschedule-insns2}

_{– Postpass scheduling}

done



_-

_{finline-functions –}

_{Integrated all simple}

functions into their callers



_{-funroll-loops}

_{– Perform the optimization}

(3)

Lizy Kurian John, LC

A, UT Austin

27

Some observations



_{Energy consumption reduces when}

# of instructions is reduced, i.e.,

when the total work done is less,

energy is less



_{Power dissipation is directly}

(4)

Lizy Kurian John, LC

A, UT Austin

28

Observations (contd.)



_{Function inlining was found to be}

good for both power and energy



_{Unrolling was found to be good for}

energy consumption but bad for

power dissipation

(5)

Lizy Kurian John, LC

A, UT Austin

29

MMX/SIMD

Automatic usage of SIMD ISA

still difficult 10+ years after

introduction of MMX.

(6)

Lizy Kurian John, LC

A, UT Austin

30 Standard Optimizations on

Power (Contd)

Benchmark opt level Energy Exec Time Insts Avg Power IPC

O0 100 100 100 100 100

O1 97.38 100.24 92.49 97.15 92.27 O2 97.69 99.38 92.49 98.3 93.07 O3 97.69 99.38 92.49 98.3 93.07 O4 98.31 99.27 92.84 99.02 93.51

O0 100 100 100 100 100

O1 42.09 51.04 33.21 82.46 65.06 O2 40.99 47.52 33.1 86.28 69.67 O3 40.99 46.37 33.1 87.65 71.38

O0 100 100 100 100 100

O1 30.1 36.64 20.01 82.15 5463 O2 28.93 34.01 19.05 85.06 56.01 O3 28.93 34.01 19.05 85.06 56.01

su2cor

swim saxpy