p10-MT1.ppt 216KB Jun 23 2011 12:33:32 PM

(1)

Advance Computer

Architecture – fall 2003,

Technion

Multithreaded Architectures

Dr. Avi Mendelson

(2)

Advance Computer

Architecture – fall 2003,

Technion

Overview



Multithreaded Architecture



Multithreaded Micro-Architecture



Conclusions

(3)

Advance Computer

Architecture – fall 2003,

Technion

References

 _{“Asynchrony in Parallel Computing: From Dataflow to Multithreading”}_{by Jurij Silc, Borut Robic, and Theo Ungerer, Parallel and}

Distributed Computing Practices Vol.1, No.1, March 1998. http://www-csd.ijs.si/silc/pdcp.html

 _{R.S. Nikhil, G.M. Papadopoulos and Arvind. *T: A Multithreaded Massively Parallel Architecture.}_{In Proc. 19th Annual International}

Symposium on Computer Architecture, pp. 156-167, 1992.

 _{A. Agarwal, J. Kubiatowicz, D. Kranz, B.-H. Lim, D. Yeung, G.D'Souza, M. Parkin. Sparcle: An Evolutionary Processor Design for Large-Scale}

Multiprocessors. IEEE Micro, vol. 13, no. 3, pp. 48-61, June 1993.

 _{R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, B. Smith. The Tera Computer System.}_{In Proc. 1990 International}

Conference on Supercomputing, pp. 1-6, June 1990

 _{H. Hirata, K. Kimura, S. Nagamine, Y. Mochizuki, A. Nishimura, Y. Nakase and T. Nishizawa. An elementary Processor Architecture with}

Simultaneous Instruction Issuing from Multiple Threads. ISCA ‘19, pp. 136-145, 1992.

 _{G.S. Sohi, S. E. Breach, T.N. Vijaykumar. Multiscalar Processors}_{. ISCA ‘22,}_1995.

 _{M. Fillo, S.W. Keckler, W.J. Dally, N.P. Carter, A. Chang. Y. Gurevich, W.S. Lee. The M-Machine Multicomputer. In Proc. 28th Annual Inter.}

Sym. on Microarchitecture, pp 146-156, 1995

 _{"Simultaneous Multithreading: Maximizing On-Chip Parallelism" by Tullsen, Eggers and Levy. ISCA’95.}

 _{D. M. Tullsen, S. J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo, R.L. Stamm. Exploiting choice: Instruction Fetch and Issue on an Implementable}

Simultaneous Multithreading Processor. ISCA ‘23, 1996.

 _{“Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading” by Lo, Eggers, Emer, Levy, Stamm}

and Tullsen in ACM Transactions on Computer Systems, August 1997.

 _{“Simultaneous Multithreading: A Platform for Next-Generation Prcoessors” by Eggers, Emer, Levy, Lo, Stamm and Tullsen in IEEE Micro,}

October, 1997.

 _{“Simultaneous Multithreading: Multiplying Alpha Performance”, Joel Emer, MicroProcessor Forum 1999.}

http://www.alphapowered.com/simu-multi-thread.ppt

 _{“A Dynamic Multithreading Processor” by H. Akkary (Intel), M. Driscoll (Portland State Univ.). Micro-31, Nov ‘1999.}  _{"Speculative Data-Driven Multithreading" by Amir Roth, Guri Sohi. Submitted to ASPLOS ‘00.}

 _{“MicroUnity Lifts Veil on MediaProcessor: New Architecture Designed for Broadband Communications”, Michael Slater,}

http://www.mpronline.com/mpr/h/19951023/091402.html, Microprocessor report 10/23/95

(4)

Advance Computer

Architecture – fall 2003,

Technion

Goals of Multithreaded Architecture



Successful MTA must have:



Minimal impact on the conventional design



Improved throughput on multiple thread workloads



Multiple thread = multithreaded or multiprogrammed workload



Good cost/throughput



Minimal impact on single-thread performance



Would also like



Performance gain on multithreaded applications

(5)

Advance Computer

Architecture – fall 2003,

Technion

Kinds of Multithreaded Architectures



Two dimensions



Primary: Front-end interleaving



Fine-grain (cycle-by-cycle) vs. coarse-grain (longer intervals)



Secondary: Back-end interleaving



Time multiplexing vs. space multiplexing



Depends on front-end



Three valid combinations



Blocked MT: coarse-grain FE + time-mplexed BE



Interleaved MT: fine-grain FE + time-mplexed BE



Simultaneous MT: fine-grain FE + space-mplexed BE

(6)

Advance Computer

Architecture – fall 2003,

Technion

Throughput vs. Utilization



Not the same thing



Throughput: how many instructions complete per cycle



Utilization: how many resources busy per cycle



Can increase one without the other



Can increase one while decreasing the other

(7)

Advance Computer

Architecture – fall 2003,

Technion

Scalar Execution

Dependencies reduce throughput/utilization

Time

(8)

Advance Computer

Architecture – fall 2003,

Technion

Superscalar Execution

Generally increases throughput, but decreases utilization

Time

(9)

Advance Computer

Architecture – fall 2003,

Technion

Predication

Generally increases utilization, increases throughput less

(much of the utilization is thrown away)

(10)

Advance Computer

Architecture – fall 2003,

Technion

(11)

Advance Computer

Architecture – fall 2003,

Technion

Blocked Multithreading

May increase utilization and throughput, but must switch when current

thread goes to low utilization/throughput section (e.g. L2 cache miss)

(12)

Advance Computer

Architecture – fall 2003,

Technion

Fine Grained Multithreading

Increases utilization/throughput by reducing impact of dependences

Time

(13)

Advance Computer

Architecture – fall 2003,

Technion

Simultaneous Multithreading

Increases utilization/throughput

Time

(14)

Advance Computer

Architecture – fall 2003,

Technion

Blocked Multithreading



Critical decision: when to switch threads



Answer: when current thread’s utilization/thput is about to drop



Primary example: L2 cache miss



Requirements for throughput:



Thread-switch + pipe-fill time << blocking latency



Would like to get some work done before other thread comes back



Fast thread-switch: multiple register banks



Fast pipe-fill: short pipe



Examples



Macro-dataflow machine



MIT Alewife

(15)

Advance Computer

Architecture – fall 2003,

Technion

Interleaved Multithreading



Critical decision: none?



Requirements for throughput:



Enough threads to eliminate intra-thread hazards



Increasing number of threads reduces single-thread

performance



Examples:



HEP Denelcor: 8 threads (latencies were shorter then)

(16)

Advance Computer

Architecture – fall 2003,

Technion

Simultaneous Multi-threading



Critical decision: fetch-interleaving policy



Requirements for throughput:



Enough threads to utilize resources



Notice, many fewer than needed to stretch dependences



Examples:



Compaq Alpha EV8

(17)

Advance Computer

Architecture – fall 2003,

Technion

SMT Case Study: EV8



8-issue OOO processor



SMT Support



Multiple sequencers (PC): 4



More physical registers



Thread tags on all sequential resources: ROB, LSQ,

etc.



Process tags on all address space resources:

caches, TLB’s, etc.



Notice: none of these things are in the core

(18)

Advance Computer

Architecture – fall 2003,

Technion

Basic EV8

Fetch

Decode/

Map

Queue

Reg

Read

Execute

Dcache/

Store

Buffer

Reg

Write

Retire

PC

Icach

e

Register

Map

Dcach

e

Regs

Thread-blind

(19)

Advance Computer

Architecture – fall 2003,

Technion

SMT EV8

Fetch

Decode/

Map

Queue

Reg

Read

Execute

Dcache/

Store

Buffer

Reg

Write

Retire

Icach

e

Dcach

e

PC

Register

Map

Regs

_Regs

(20)

Advance Computer

Architecture – fall 2003,

Technion

SMT Performance Study (U. Wash.)



Execution resources



IQ size - 2x32



EU - 3 FP, 4 Int/Mem, 2 Int



Fetch/rename/retire bandwidth - 8 instructions



Speedup of 4 threads - 2.1



Experiments



fetch 4 instr from 2 threads



fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ



Influence of MT on branch prediction and caches



Where are the bottlenecks



IQ size/EU number/memory - not a bottleneck



Fetch/BTB - potential bottleneck

(21)

Advance Computer

Architecture – fall 2003,

Technion

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications

Multiprogrammed Workload

(22)

Advance Computer

Architecture – fall 2003,

Technion

Fetch Interleaving on SMT



What if one thread gets “stuck”?



Round-robin: eventually it will fill up the machine (not good)



ICOUNT: thread with fewest instructions in pipe has priority



Translation: thread doesn’t get to fetch until it gets “unstuck”



Variation: what if one thread is spinning?



Not really stuck, gets to keep fetching

(23)

Advance Computer

Architecture – fall 2003,

Technion

Improving Performance on MT Apps



Shared memory apps:



Communicate through caches



Communication is faster if happens in the same cache



No coherence overhead

(24)

Advance Computer

Architecture – fall 2003,

Technion

Summary



Multithreaded Software



Multithreaded Architecture



Advantageous cost/throughput



Blocked MT



Good single thread performance



Good throughput



Needs fast thread switch and short pipe



Interleaved MT



Bad single thread performance



Good throughput



Needs many threads



Simultaneous MT



Good throughput



Good single thread performance



Good utilization

(1)

Advance Computer

Architecture – fall 2003, Technion

SMT EV8

Fetch Decode/ Map

Queue Reg Read

Execute Dcache/ Store Buffer

Reg Write

Retire

Icach e

Dcach e

Register Map

(2)

Advance Computer

Architecture – fall 2003, Technion

SMT Performance Study (U. Wash.)

 Execution resources

 IQ size - 2x32

 EU - 3 FP, 4 Int/Mem, 2 Int

 Fetch/rename/retire bandwidth - 8 instructions

 Speedup of 4 threads - 2.1  Experiments

 fetch 4 instr from 2 threads

 fetch priority to threads with fewest of instructions in

Decoder/Renamer/IQ

 Influence of MT on branch prediction and caches

 Where are the bottlenecks

 IQ size/EU number/memory - not a bottleneck  Fetch/BTB - potential bottleneck

(3)

Advance Computer

Architecture – fall 2003, Technion 21

Performance Scalability

0% 50% 100% 150% 200% 250%

SpecInt SpecFP Mixed Int/FP

1T 2T 3T 4T 0% 50% 100% 150% 200% 250%

Turb3d Swm256 Tomcatv

1T 2T 3T 4T 0% 50% 100% 150% 200% 250% 300%

Barnes Chess Sort TP

1T 2T 4T

Decomposed SPEC95 Applications Multiprogrammed Workload

(4)

Advance Computer

Architecture – fall 2003, Technion

Fetch Interleaving on SMT

 What if one thread gets “stuck”?

Round-robin: eventually it will fill up the machine (not good) ICOUNT: thread with fewest instructions in pipe has priority

 Translation: thread doesn’t get to fetch until it gets “unstuck”

 Variation: what if one thread is spinning?

Not really stuck, gets to keep fetching Have to stick it artificially (QUIESCE)

(5)

Advance Computer

Architecture – fall 2003, Technion

Improving Performance on MT Apps

 Shared memory apps:

Communicate through caches

 Communication is faster if happens in the same cache

(6)

Advance Computer

Architecture – fall 2003, Technion

Summary

 Multithreaded Software  Multithreaded Architecture

 Advantageous cost/throughput  Blocked MT

 Good single thread performance  Good throughput

 Needs fast thread switch and short pipe

 Interleaved MT

 Bad single thread performance  Good throughput

 Needs many threads

 Simultaneous MT

 Good throughput

 Good single thread performance  Good utilization