p10-MT1.ppt 216KB Jun 23 2011 12:33:32 PM

(1)

Advance Computer

Architecture – fall 2003,

Technion

1

Multithreaded Architectures

Dr. Avi Mendelson


(2)

Advance Computer

Architecture – fall 2003,

Technion

2

Overview

Multithreaded Architecture

Multithreaded Micro-Architecture

Conclusions


(3)

Advance Computer

Architecture – fall 2003,

Technion

3

References

“Asynchrony in Parallel Computing: From Dataflow to Multithreading” by Jurij Silc, Borut Robic, and Theo Ungerer, Parallel and

Distributed Computing Practices Vol.1, No.1, March 1998. http://www-csd.ijs.si/silc/pdcp.html

R.S. Nikhil, G.M. Papadopoulos and Arvind. *T: A Multithreaded Massively Parallel Architecture. In Proc. 19th Annual International

Symposium on Computer Architecture, pp. 156-167, 1992.

A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G.D'Souza, M. Parkin. Sparcle: An Evolutionary Processor Design for Large-Scale

Multiprocessors. IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.

R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera Computer System. In Proc. 1990 International

Conference on Supercomputing, pp. 1-6, June 1990

H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. An elementary Processor Architecture with

Simultaneous Instruction Issuing from Multiple Threads. ISCA ‘19, pp. 136-145, 1992.

G.S. Sohi, S. E. Breach, T.N. Vijaykumar. Multiscalar Processors. ISCA ‘22, 1995.

M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang. Y. Gurevich, W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Annual Inter.

Sym. on Microarchitecture, pp 146-156, 1995

"Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy. ISCA’95.

D. M. Tullsen, S. J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm. Exploiting choice: Instruction Fetch and Issue on an Implementable

Simultaneous Multithreading Processor. ISCA ‘23, 1996.

“Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm

and Tullsen in ACM Transactions on Computer Systems, August 1997.

“Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro,

October, 1997.

“Simultaneous Multithreading: Multiplying Alpha Performance”, Joel Emer, MicroProcessor Forum 1999.

http://www.alphapowered.com/simu-multi-thread.ppt

“A Dynamic Multithreading Processor” by H. Akkary (Intel), M. Driscoll (Portland State Univ.). Micro-31, Nov ‘1999."Speculative Data-Driven Multithreading" by Amir Roth, Guri Sohi. Submitted to ASPLOS ‘00.

“MicroUnity Lifts Veil on MediaProcessor: New Architecture Designed for Broadband Communications”, Michael Slater,

http://www.mpronline.com/mpr/h/19951023/091402.html, Microprocessor report 10/23/95


(4)

Advance Computer

Architecture – fall 2003,

Technion

4

Goals of Multithreaded Architecture

Successful MTA must have:

Minimal impact on the conventional design

Improved throughput on multiple thread workloads

Multiple thread = multithreaded or multiprogrammed workload

Good cost/throughput

Minimal impact on single-thread performance

Would also like

Performance gain on multithreaded applications


(5)

Advance Computer

Architecture – fall 2003,

Technion

5

Kinds of Multithreaded Architectures

Two dimensions

Primary: Front-end interleaving

Fine-grain (cycle-by-cycle) vs. coarse-grain (longer intervals)

Secondary: Back-end interleaving

Time multiplexing vs. space multiplexing

Depends on front-end

Three valid combinations

Blocked MT: coarse-grain FE + time-mplexed BE

Interleaved MT: fine-grain FE + time-mplexed BE

Simultaneous MT: fine-grain FE + space-mplexed BE


(6)

Advance Computer

Architecture – fall 2003,

Technion

6

Throughput vs. Utilization

Not the same thing

Throughput: how many instructions complete per cycle

Utilization: how many resources busy per cycle

Can increase one without the other

Can increase one while decreasing the other


(7)

Advance Computer

Architecture – fall 2003,

Technion

7

Scalar Execution

Dependencies reduce throughput/utilization

Time


(8)

Advance Computer

Architecture – fall 2003,

Technion

8

Superscalar Execution

Generally increases throughput, but decreases utilization

Time


(9)

Advance Computer

Architecture – fall 2003,

Technion

9

Predication

Generally increases utilization, increases throughput less

(much of the utilization is thrown away)


(10)

Advance Computer

Architecture – fall 2003,

Technion

10


(11)

Advance Computer

Architecture – fall 2003,

Technion

11

Blocked Multithreading

May increase utilization and throughput, but must switch when current

thread goes to low utilization/throughput section (e.g. L2 cache miss)


(12)

Advance Computer

Architecture – fall 2003,

Technion

12

Fine Grained Multithreading

Increases utilization/throughput by reducing impact of dependences

Time


(13)

Advance Computer

Architecture – fall 2003,

Technion

13

Simultaneous Multithreading

Increases utilization/throughput

Time


(14)

Advance Computer

Architecture – fall 2003,

Technion

14

Blocked Multithreading

Critical decision: when to switch threads

Answer: when current thread’s utilization/thput is about to drop

Primary example: L2 cache miss

Requirements for throughput:

Thread-switch + pipe-fill time << blocking latency

Would like to get some work done before other thread comes back

Fast thread-switch: multiple register banks

Fast pipe-fill: short pipe

Examples

Macro-dataflow machine

MIT Alewife


(15)

Advance Computer

Architecture – fall 2003,

Technion

15

Interleaved Multithreading

Critical decision: none?

Requirements for throughput:

Enough threads to eliminate intra-thread hazards

Increasing number of threads reduces single-thread

performance

Examples:

HEP Denelcor: 8 threads (latencies were shorter then)


(16)

Advance Computer

Architecture – fall 2003,

Technion

16

Simultaneous Multi-threading

Critical decision: fetch-interleaving policy

Requirements for throughput:

Enough threads to utilize resources

Notice, many fewer than needed to stretch dependences

Examples:

Compaq Alpha EV8


(17)

Advance Computer

Architecture – fall 2003,

Technion

17

SMT Case Study: EV8

8-issue OOO processor

SMT Support

Multiple sequencers (PC): 4

More physical registers

Thread tags on all sequential resources: ROB, LSQ,

etc.

Process tags on all address space resources:

caches, TLB’s, etc.

Notice: none of these things are in the core


(18)

Advance Computer

Architecture – fall 2003,

Technion

18

Basic EV8

Fetch

Decode/

Map

Queue

Reg

Read

Execute

Dcache/

Store

Buffer

Reg

Write

Retire

PC

Icach

e

Register

Map

Dcach

e

Regs

Regs

Thread-blind


(19)

Advance Computer

Architecture – fall 2003,

Technion

19

SMT EV8

Fetch

Decode/

Map

Queue

Reg

Read

Execute

Dcache/

Store

Buffer

Reg

Write

Retire

Icach

e

Dcach

e

PC

Register

Map

Regs

Regs


(20)

Advance Computer

Architecture – fall 2003,

Technion

20

SMT Performance Study (U. Wash.)

Execution resources

IQ size - 2x32

EU - 3 FP, 4 Int/Mem, 2 Int

Fetch/rename/retire bandwidth - 8 instructions

Speedup of 4 threads - 2.1

Experiments

fetch 4 instr from 2 threads

fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ

Influence of MT on branch prediction and caches

Where are the bottlenecks

IQ size/EU number/memory - not a bottleneck

Fetch/BTB - potential bottleneck


(21)

Advance Computer

Architecture – fall 2003,

Technion

21

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications

Multiprogrammed Workload


(22)

Advance Computer

Architecture – fall 2003,

Technion

22

Fetch Interleaving on SMT

What if one thread gets “stuck”?

Round-robin: eventually it will fill up the machine (not good)

ICOUNT: thread with fewest instructions in pipe has priority

Translation: thread doesn’t get to fetch until it gets “unstuck”

Variation: what if one thread is spinning?

Not really stuck, gets to keep fetching


(23)

Advance Computer

Architecture – fall 2003,

Technion

23

Improving Performance on MT Apps

Shared memory apps:

Communicate through caches

Communication is faster if happens in the same cache

No coherence overhead


(24)

Advance Computer

Architecture – fall 2003,

Technion

24

Summary

Multithreaded Software

Multithreaded Architecture

Advantageous cost/throughput

Blocked MT

Good single thread performance

Good throughput

Needs fast thread switch and short pipe

Interleaved MT

Bad single thread performance

Good throughput

Needs many threads

Simultaneous MT

Good throughput

Good single thread performance

Good utilization


(1)

Advance Computer

Architecture – fall 2003, Technion

19

SMT EV8

Fetch Decode/ Map

Queue Reg Read

Execute Dcache/ Store Buffer

Reg Write

Retire

Icach e

Dcach e

PC

Register Map


(2)

Advance Computer

Architecture – fall 2003, Technion

20

SMT Performance Study (U. Wash.)

Execution resources

IQ size - 2x32

EU - 3 FP, 4 Int/Mem, 2 Int

Fetch/rename/retire bandwidth - 8 instructions

Speedup of 4 threads - 2.1Experiments

fetch 4 instr from 2 threads

fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ

Influence of MT on branch prediction and caches

Where are the bottlenecks

IQ size/EU number/memory - not a bottleneckFetch/BTB - potential bottleneck


(3)

Advance Computer

Architecture – fall 2003, Technion 21

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications Multiprogrammed Workload


(4)

Advance Computer

Architecture – fall 2003, Technion

22

Fetch Interleaving on SMT

What if one thread gets “stuck”?

Round-robin: eventually it will fill up the machine (not good)ICOUNT: thread with fewest instructions in pipe has priority

Translation: thread doesn’t get to fetch until it gets “unstuck”

Variation: what if one thread is spinning?

Not really stuck, gets to keep fetchingHave to stick it artificially (QUIESCE)


(5)

Advance Computer

Architecture – fall 2003, Technion

23

Improving Performance on MT Apps

Shared memory apps:

Communicate through caches

Communication is faster if happens in the same cache


(6)

Advance Computer

Architecture – fall 2003, Technion

24

Summary

Multithreaded SoftwareMultithreaded Architecture

Advantageous cost/throughputBlocked MT

Good single thread performanceGood throughput

Needs fast thread switch and short pipe

Interleaved MT

Bad single thread performanceGood throughput

Needs many threads

Simultaneous MT

Good throughput

Good single thread performanceGood utilization


Dokumen yang terkait

ANALISIS FAKTOR YANGMEMPENGARUHI FERTILITAS PASANGAN USIA SUBUR DI DESA SEMBORO KECAMATAN SEMBORO KABUPATEN JEMBER TAHUN 2011

2 53 20

KONSTRUKSI MEDIA TENTANG KETERLIBATAN POLITISI PARTAI DEMOKRAT ANAS URBANINGRUM PADA KASUS KORUPSI PROYEK PEMBANGUNAN KOMPLEK OLAHRAGA DI BUKIT HAMBALANG (Analisis Wacana Koran Harian Pagi Surya edisi 9-12, 16, 18 dan 23 Februari 2013 )

64 565 20

FAKTOR – FAKTOR YANG MEMPENGARUHI PENYERAPAN TENAGA KERJA INDUSTRI PENGOLAHAN BESAR DAN MENENGAH PADA TINGKAT KABUPATEN / KOTA DI JAWA TIMUR TAHUN 2006 - 2011

1 35 26

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

9 161 13

Pengaruh kualitas aktiva produktif dan non performing financing terhadap return on asset perbankan syariah (Studi Pada 3 Bank Umum Syariah Tahun 2011 – 2014)

6 101 0

Pengaruh pemahaman fiqh muamalat mahasiswa terhadap keputusan membeli produk fashion palsu (study pada mahasiswa angkatan 2011 & 2012 prodi muamalat fakultas syariah dan hukum UIN Syarif Hidayatullah Jakarta)

0 22 0

Perlindungan Hukum Terhadap Anak Jalanan Atas Eksploitasi Dan Tindak Kekerasan Dihubungkan Dengan Undang-Undang Nomor 39 Tahun 1999 Tentang Hak Asasi Manusia Jo Undang-Undang Nomor 23 Tahun 2002 Tentang Perlindungan Anak

1 15 79

Pendidikan Agama Islam Untuk Kelas 3 SD Kelas 3 Suyanto Suyoto 2011

4 108 178

PP 23 TAHUN 2010 TENTANG KEGIATAN USAHA

2 51 76

KOORDINASI OTORITAS JASA KEUANGAN (OJK) DENGAN LEMBAGA PENJAMIN SIMPANAN (LPS) DAN BANK INDONESIA (BI) DALAM UPAYA PENANGANAN BANK BERMASALAH BERDASARKAN UNDANG-UNDANG RI NOMOR 21 TAHUN 2011 TENTANG OTORITAS JASA KEUANGAN

3 32 52