cstb04_rockNhardplace.ppt 790KB Jun 23 2011 12:31:46 PM

Future Computer Advances are
Between a Rock (Slow Memory)
and a Hard Place (Multithreading)
Mark D. Hill
Computer Sciences Dept.
and Electrical & Computer Engineer Dept.
University of Wisconsin—Madison
Multifacet Project (www.cs.wisc.edu/multifacet)
October 2004
Full Disclosure: Consult for Sun & US NSF
© 2004 Mark D. Hill

Wisconsin Multifacet Project

Executive Summary: Problem
• Expect computer performance doubling every 2 years
• Derives from Technology & Architecture
• Technology will advance for ten or more years

talk


• But Architecture faces a Rock: Slow Memory
– a.k.a. Wall [Wulf & McKee 1995]

• Prediction: Popular Moore’s Law (doubling
performance) will end soon, regardless of
the real Moore’s Law (doubling transistors)
© 2004 Mark D. Hill

2

Wisconsin Multifacet Project

Executive Summary: Recommendation
• Chip Multiprocessing (CMP) Can Help
– Implement multiple processors per chip
– >>10x cost-performance for multithreaded workloads
– What about software with one apparent thread?

• Go to Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing

– Computer architects can help, but long run
– Requires moving multithreading from CS fringe to center
(algorithms, programming languages, …, hardware)

• Necessary For Restoring Popular Moore’s Law
© 2004 Mark D. Hill

3

Wisconsin Multifacet Project

Outline
• Executive Summary
• Background





Moore’s Law

Architecture
Instruction Level Parallelism
Caches

• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill

4

Wisconsin Multifacet Project

Society Expects A Popular Moore’s Law
Computing critical: commerce, education, engineering,
entertainment, government, medicine, science, …
– Servers (> PCs)
– Clients (= PCs)
– Embedded (< PCs) talk


• Come to expect a misnamed “Moore’s Law”
– Computer performance doubles every two years (same cost)
  Progress in next two years = All past progress

• Important Corollary
– Computer cost halves every two years (same performance)
  In ten years, same performance for 3% (sales tax – Jim Gray)

• Derives from Technology & Architecture
© 2004 Mark D. Hill

5

Wisconsin Multifacet Project

(Technologist’s) Moore’s Law Provides Transistors

Number of transistors
per chip doubles every
two years (18 months)

Merely a “Law” of
Business Psychology

© 2004 Mark D. Hill

6

Wisconsin Multifacet Project

Performance from Technology & Architecture

Reprinted from Hennessy and Patterson,"Computer Architecture:
A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
© 2004 Mark D. Hill

7

Wisconsin Multifacet Project

Architects Use Transistors To Compute Faster

• Bit Level Parallelism (BLP) within Instructions
 Instrns

Time 

• Instruction Level Parallelism (ILP) among Instructions
 Instrns

Time 

• Scores of speculative instructions look sequential!
© 2004 Mark D. Hill

8

Wisconsin Multifacet Project

Architects Use Transistors Tolerate Slow Memory
• Cache
– Small, Fast Memory

– Holds information (expected)
to be used soon
– Mostly Successful

• Apply Recursively
– Level-one cache(s)
– Level-two cache

• Most of microprocessor
die area is cache!
© 2004 Mark D. Hill

9

Wisconsin Multifacet Project

Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock

– Technology Continues
– Slow Memory
– Implications

• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill

10

Wisconsin Multifacet Project

Future Technology Implications
• For (at least) ten years, Moore’s Law continues
– More repeated doublings of number of transistors per chip
– Faster transistors

• But hard for processor architects to use
– More transistors due global wire delays
– Faster transistors due too much dynamic power


• Moreover, hitting a Rock: Slow Memory
– Memory access = 100s floating-point multiplies!
– a.k.a. Wall [Wulf & McKee 1995]

© 2004 Mark D. Hill

11

Wisconsin Multifacet Project

Rock: Memory Gets (Relatively) Slower

Reprinted from Hennessy and Patterson,"Computer Architecture:
A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.
© 2004 Mark D. Hill

12

Wisconsin Multifacet Project


Impact of Slow Memory (Rock)
• Off-Chip Misses are now hundreds of cycles
Time 
 Instrns

Compute Phases
Good Case!

Memory Phases

• More Realistic Case

Time 

 Instrns

I1
I3
I4


© 2004 Mark D. Hill

13

I2

window = 4 (64)

Wisconsin Multifacet Project

Implications of Slow Memory (Rock)
• Increasing Memory Latency hides Compute Phase
• Near Term Implications
– Reduce memory latency
– Fewer memory accesses
– More Memory Level Parallelism (MLP)

• Longer Term Implications
– What can single-threaded software do while waiting 100
instruction opportunities, 200, 400, … 1000?
– What can amazing speculative hardware do?

© 2004 Mark D. Hill

14

Wisconsin Multifacet Project

Assessment So Far
• Appears
– Popular Moore’s Law (doubling performance)
will end soon, regardless of the
real Moore’s Law (doubling transistors)
• Processor performance hitting Rock (Slow Memory)
• No known way to overcome this, unless
• Redefine performance in Popular Moore’s Law
– From Processor Performance
– To Chip Performance
© 2004 Mark D. Hill

15

Wisconsin Multifacet Project

Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
– Small & Large CMPs
– CMP Systems
– CMP Workload

• Go to the Hard Place of Mainstream Multithreading
© 2004 Mark D. Hill

16

Wisconsin Multifacet Project

Performance for Chip, not Processor or Thread
• Chip Multiprocessing (CMP)
• Replicate Processor
• Private L1 Caches
– Low latency
– High bandwidth

• Shared L2 Cache
– Larger than if private

© 2004 Mark D. Hill

17

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:
1-issue, in-order,
500MHz

CPU

Next few slides from
Luiz Barosso’s ISCA 2000 presentation of
Piranha: A Scalable Architecture
Based on Single-Chip Multiprocessing

© 2004 Mark D. Hill

18

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:
1-issue, in-order,
500MHz

CPU

L1 caches:
I&D, 64KB, 2-way

I$ D$

© 2004 Mark D. Hill

19

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

CPU

CPU

CPU

1-issue, in-order,
500MHz

CPU

L1 caches:
I&D, 64KB, 2-way

I$ D$

I$ D$

I$ D$

Intra-chip switch (ICS)

I$ D$

32GB/sec, 1-cycle delay

ICS

© 2004 Mark D. Hill

I$ D$

I$ D$

I$ D$

I$ D$

CPU

CPU

CPU

CPU

20

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

CPU
I$ D$

CPU
L2$

I$ D$

CPU
L2$

1-issue, in-order,
500MHz

CPU

I$ D$

L2$

I$ D$

L1 caches:

L2$

I&D, 64KB, 2-way

Intra-chip switch (ICS)
32GB/sec, 1-cycle delay

L2 cache:
shared, 1MB, 8-way

ICS

L2$

I$ D$
CPU

© 2004 Mark D. Hill

L2$

I$ D$
CPU

L2$

I$ D$
CPU

21

L2$

I$ D$
CPU

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

1-issue, in-order,
500MHz

L1 caches:
I&D, 64KB, 2-way

Intra-chip switch (ICS)
32GB/sec, 1-cycle delay

L2 cache:
shared, 1MB, 8-way

ICS

Memory Controller (MC)
RDRAM, 12.8GB/sec

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL

© 2004 Mark D. Hill

8 banks
@1.6GB/sec

22

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
HE

I$ D$

CPU
L2$

I$ D$

CPU
L2$

CPU

I$ D$

L2$

I$ D$

1-issue, in-order,
500MHz

L1 caches:

L2$

I&D, 64KB, 2-way

Intra-chip switch (ICS)
32GB/sec, 1-cycle delay

L2 cache:
shared, 1MB, 8-way

ICS

RE L2$

I$ D$

L2$

I$ D$

Memory Controller (MC)

L2$

I$ D$

L2$

I$ D$

RDRAM, 12.8GB/sec
Protocol Engines (HE & RE)
prog., 1K instr.,
even/odd interleaving

CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL

© 2004 Mark D. Hill

23

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU

4 Links
@ 8GB/s

HE

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

1-issue, in-order,
500MHz

L1 caches:
I&D, 64KB, 2-way

Intra-chip switch (ICS)
32GB/sec, 1-cycle delay

Router

L2 cache:
shared, 1MB, 8-way

ICS

RE L2$

I$ D$

L2$

I$ D$

Memory Controller (MC)

L2$

I$ D$

L2$

I$ D$

CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL

© 2004 Mark D. Hill

24

RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):
prog., 1K instr.,
even/odd interleaving

System Interconnect:
4-port Xbar router
topology independent
32GB/sec total bandwidth

Wisconsin Multifacet Project

Piranha Processing Node
Alpha core:

MEM-CTL MEM-CTL MEM-CTL MEM-CTL
CPU
CPU
CPU
CPU
HE

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

I$ D$

L2$

1-issue, in-order,
500MHz

L1 caches:
I&D, 64KB, 2-way

Intra-chip switch (ICS)
32GB/sec, 1-cycle delay

Router

L2 cache:
shared, 1MB, 8-way

ICS

RE L2$

I$ D$

L2$

I$ D$

Memory Controller (MC)

L2$

I$ D$

L2$

I$ D$

CPU
CPU
CPU
CPU
MEM-CTL MEM-CTL MEM-CTL MEM-CTL

© 2004 Mark D. Hill

25

RDRAM, 12.8GB/sec
Protocol Engines (HE & RE):
prog., 1K instr.,
even/odd interleaving

System Interconnect:
4-port Xbar router
topology independent
32GB/sec total bandwidth

Wisconsin Multifacet Project

Single-Chip Piranha Performance
350

350

L2Miss

300
250

L2Hit

233

CPU
191

200
150
100
50

145
100

Normalized Execution Time

100
44

34

0

P1
INO
OOO
P8
500 MHz 1GHz
1GHz 500MHz
1-issue 1-issue 4-issue 1-issue
OLTP

P1
INO
OOO
P8
500 MHz 1GHz
1GHz 500MHz
1-issue 1-issue 4-issue 1-issue
DSS

• Piranha’s performance margin 3x for OLTP and 2.2x for DSS
• Piranha
has more outstanding misses  better utilizes
memory system
© 2004 Mark D. Hill
Wisconsin Multifacet Project

26

Simultaneous Multithreading (SMT)
• Multiplex S logical processors on each processor
– Replicate registers, share caches, & manage other parts
– Implementation factors keep S small, e.g., 2-4

• Cost-effective gain if threads available
– E.g, S=2  1.4x performance

• Modest cost
– Limits waste if additional logical processor(s) not used

• Worthwhile CMP enhancement

© 2004 Mark D. Hill

27

Wisconsin Multifacet Project

Small CMP Systems
• Use One CMP (with C cores of S-way SMT)
– C=[2,16] & S=[2,4]  C*S = [4,64]
– Size of a small PC!

• Directly Connect CMP (C) to
Memory Controller (M) or DRAM

C

© 2004 Mark D. Hill

M

C

28

Wisconsin Multifacet Project

Medium CMP Systems
• Use 2-16 CMPs (with C cores of S-way SMT)
– Smaller: 2*4*4 = 32
– Larger: 16*16*4 = 1024
– In a single cabinet

• Connecting CMPs & Memory Controllers/DRAM & many issues

M

M

M
C

C

C

C
M

C

C

C

M

M

M

M

Dance Hall

Processor-Centric
© 2004 Mark D. Hill

C

29

Wisconsin Multifacet Project

Inflection Points
• Inflection point occurs when
– Smooth input change leads
– Disruptive output change

• Enough transistors for …






1970s simple microprocessor
1980s pipelined RISC
1990s speculative out-of-order
2000s …

CMP will be Server Inflection Point
– Expect >10x performance for less cost
– Implying, >>10x cost-performance
– Early CMPs like old SMPs but expect dramatic advances!

© 2004 Mark D. Hill

30

Wisconsin Multifacet Project

So What’s Wrong with CMP Picture?
• Chip Multiprocessors
– Allow profitable use of more transistors
– Support modest to vast multithreading
– Will be inflection point for commercial servers

• But
– Many workloads have single thread (available to run)
– Even if single thread solves a problem formerly done by
many people in parallel (e.g., clerks in payroll processing)

• Go to a Hard Place
– Make most workloads flourish with CMPs
© 2004 Mark D. Hill

31

Wisconsin Multifacet Project

Outline
• Executive Summary
• Background
• Going Forward Processor Architecture Hits Rock
• Chip Multiprocessing to the Rescue?
• Go to the Hard Place of Mainstream Multithreading
– Parallel from Fringe to Center
– For All of Computer Science!
© 2004 Mark D. Hill

32

Wisconsin Multifacet Project

Thread Parallelism from Fringe to Center
• History
– Automatic Computer (vs. Human)  Computer
– Digital Computer (vs. Analog)  Computer

• Must Change






Parallel Computer (vs. Sequential)  Computer
Parallel Algorithm (vs. Sequential)  Algorithm
Parallel Programming (vs. Sequential)  Programming
Parallel Library (vs. Sequential)  Library
Parallel X (vs. Sequential)  X

• Otherwise, repeated performance doublings unlikely
© 2004 Mark D. Hill

33

Wisconsin Multifacet Project

Computer Architects Can Contribute
• Chip Multiprocessor Design
– Transcend pre-CMP multiprocessor design
– Intra-CMP has lower latency & much higher bandwidth

• Hide Multithreading (Helper Threads)
• Assist Multithreading (Thread-Level Speculation)
• Ease Multithreaded Programming (Transactions)
• Provide a “Gentle Ramp to Parallelism” (Hennessy)
© 2004 Mark D. Hill

34

Wisconsin Multifacet Project

But All of Computer Science is Needed
• Hide Multithreading (Libraries & Compilers)
• Assist Multithreading (Development Environments)
• Ease Multithreaded Programming (Languages)
• Divide & Conquer Multithreaded Complexity
(Theory & Abstractions)
• Must Enable
– 99% of programmers think sequentially while
– 99% of instructions execute in parallel

• Enable a “Parallelism Superhighway”
© 2004 Mark D. Hill

35

Wisconsin Multifacet Project

Summary
• (Single-Threaded) Computing faces a Rock: Slow Memory
• Popular Moore’s Law (doubling performance) will end soon
• Chip Multiprocessing Can Help
– >>10x cost-performance for multithreaded workloads
– What about software with one apparent thread?

• Go to Hard Place: Mainstream Multithreading
– Make most workloads flourish with chip multiprocessing
– Computer architects can help, but long run
– Requires moving multithreading from CS fringe to center

• Necessary For Restoring Popular Moore’s Law
© 2004 Mark D. Hill

36

Wisconsin Multifacet Project

Dokumen yang terkait

ANALISIS FAKTOR YANGMEMPENGARUHI FERTILITAS PASANGAN USIA SUBUR DI DESA SEMBORO KECAMATAN SEMBORO KABUPATEN JEMBER TAHUN 2011

2 53 20

KONSTRUKSI MEDIA TENTANG KETERLIBATAN POLITISI PARTAI DEMOKRAT ANAS URBANINGRUM PADA KASUS KORUPSI PROYEK PEMBANGUNAN KOMPLEK OLAHRAGA DI BUKIT HAMBALANG (Analisis Wacana Koran Harian Pagi Surya edisi 9-12, 16, 18 dan 23 Februari 2013 )

64 565 20

FAKTOR – FAKTOR YANG MEMPENGARUHI PENYERAPAN TENAGA KERJA INDUSTRI PENGOLAHAN BESAR DAN MENENGAH PADA TINGKAT KABUPATEN / KOTA DI JAWA TIMUR TAHUN 2006 - 2011

1 35 26

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

9 161 13

Pengaruh kualitas aktiva produktif dan non performing financing terhadap return on asset perbankan syariah (Studi Pada 3 Bank Umum Syariah Tahun 2011 – 2014)

6 101 0

Pengaruh pemahaman fiqh muamalat mahasiswa terhadap keputusan membeli produk fashion palsu (study pada mahasiswa angkatan 2011 & 2012 prodi muamalat fakultas syariah dan hukum UIN Syarif Hidayatullah Jakarta)

0 22 0

Perlindungan Hukum Terhadap Anak Jalanan Atas Eksploitasi Dan Tindak Kekerasan Dihubungkan Dengan Undang-Undang Nomor 39 Tahun 1999 Tentang Hak Asasi Manusia Jo Undang-Undang Nomor 23 Tahun 2002 Tentang Perlindungan Anak

1 15 79

Pendidikan Agama Islam Untuk Kelas 3 SD Kelas 3 Suyanto Suyoto 2011

4 108 178

PP 23 TAHUN 2010 TENTANG KEGIATAN USAHA

2 51 76

KOORDINASI OTORITAS JASA KEUANGAN (OJK) DENGAN LEMBAGA PENJAMIN SIMPANAN (LPS) DAN BANK INDONESIA (BI) DALAM UPAYA PENANGANAN BANK BERMASALAH BERDASARKAN UNDANG-UNDANG RI NOMOR 21 TAHUN 2011 TENTANG OTORITAS JASA KEUANGAN

3 32 52