dally.ppt 3432KB Jun 23 2011 12:31:30 PM

Tomorrow’s Computing Engines
February 3, 1998
Symposium on High-Performance Computer Architecture

William J. Dally
Computer Systems Laboratory
Stanford University
billd@csl.stanford.edu

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

1

Focus on Tomorrow, not Yesterday

General’s tend to always fight the last war

Computer architects tend to always design the last computer

old programs
old technology assumptions

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

2

Some Previous “Wars” (1/3)

MARS Router
1984

WJD Feb 3, 1998

Reliable Router
1994


Torus Routing Chip
1985
Network Design Frame
1988

Tomorrow's Computin
g Engines

3

Some Previous “Wars” (2/3)

MDP Chip

WJD Feb 3, 1998

J-Machine

Cray T3D


Tomorrow's Computin
g Engines

MAP Chip

4

Some Previous “Wars” (3/3)

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

5

Tomorrow’s Computing Engines

• Driven by tomorrow’s applications - media
• Constrained by tomorrow’s technology


WJD Feb 3, 1998

Tomorrow's Computin
g Engines

6

90% of Desktop Cycles will Be Spent on ‘Media’
Applications by 2000
• Quote from Scott Kirkpatric of IBM (talk abstract)
• Media applications include
– video encode/decode
– polygon & image-based graphics
– audio processing - compression, music, speech recognition/synthesis
– modulation/demodulation at audio and video rates

• These applications involve stream processing
• So do
– radar processing: SAR, STAP, MTI ...


WJD Feb 3, 1998

Tomorrow's Computin
g Engines

7

Typical Media Kernel
Image Warp and Composite





Read 10,000 pixels from memory
Perform 100 16-bit integer operations on each pixel
Test each pixel
Write 3,000 result pixels that pass to memory


• Little reuse of data fetched from memory
– each pixel used once

• Little interaction between pixels
– very insensitive to operation latency

• Challenge is to maximize bandwidth
Tomorrow's Computin
WJD Feb 3, 1998
g Engines

8

Telepresence: A Driving Application

Acquire
2D
Images

Extract

Depth
(3D Images)

Segmentation
Model
Extraction

Compression

Channel

Decompression

Rendering

Display
3D
Scene

Most kernels: Latency insensitive

High ratio of arithmetic to memory references

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

9

Tomorrow’s Technology is Wire Limited
• Lots of devices
• A little faster
• Slow wires

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

10


Technology scaling makes communication the
scarce resource
1997

2007

0.35m
64Mb DRAM
16 64b FP Proc
400MHz

0.10m
4Gb DRAM
1K 64b FP Proc
2.5GHz

P

18mm

12,000 tracks
1 clock

WJD Feb 3, 1998

32mm
90,000 tracks
20 clocks

Tomorrow's Computin
g Engines

11

On-chip wires are getting slower

y
y

x1


x2

tw = RCy2

RCy2

RCy2

x2 = s x1

0.5x

R2 = R1/s2

4x

C2 = C 1

1x

tw2 = R2C2y2 = tw1/s2

4x

tw2/tg2= tw1/(tg1s3)

8x

v = 0.5(tgRC)-1/2 (m/s)
v2 = v1s1/2

tg

tg

tg

vtg = 0.5(tg/RC)1/2 (m/gate)
v2tg2 = v1tg1s3/2

WJD Feb 3, 1998

0.7x

Tomorrow's Computin
g Engines

0.35x

12

Bandwidth and Latency of Modern VLSI

103

1

Bandwidth
100

0.01
Bandwidth

Latency
10

10-4
Latency

1

10-6
1

10

100

103

104

105

Size
Chip Boundary
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

13

Architecture for Locality

Vector
Reg
File

50GB/s

104
32-bit
ALUs

Switch

Off-chip
RAM

Pin-Bandwidth,
2GB/s

Exploit high on-chip bandwidth

500GB/s

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

14

Tomorrow’s Computing Engines


Aimed at media processing






stream based
latency tolerant
low-precision
little reuse
lots of conditionals





– bandwidth hierarchy
– no centralized resources



WJD Feb 3, 1998

Use the large number of
devices available on future
chips
Make efficient use of scarce
communication resources

Approach the performance
of a special-purpose
processor

Tomorrow's Computin
g Engines

15

Why do Special-Purpose Processors Perform
Well?

Lots (100s) of ALUs

WJD Feb 3, 1998

Fed by dedicated wires/memories

Tomorrow's Computin
g Engines

16

Instruction
Bandwidth

IP

Care and Feeding of ALUs

Instr.
Cache

IR
Data
Bandwidth
Regs

‘Feeding’ Structure Dwarfs ALU
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

17

Three Key Problems
• Instruction bandwidth
• Data bandwidth
• Conditional execution

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

18

A Bandwidth Hierarchy

13 ALUs per cluster

SDRAM
SDRAM
SDRAM

1.6GB/s

ALU Cluster
Vector
Register File

Streaming Memory

SDRAM

ALU Cluster

500GB/s
ALU Cluster

50GB/s

•Solves data bandwidth problem
•Matched to bandwidth curve of technology
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

19

A Streaming Memory System

D

Address
Generator

Address
Generator

WJD Feb 3, 1998

Reorder
Queue

SDRAM
Bank

Reorder
Queue

SDRAM
Bank

Crossbar

IX

Tomorrow's Computin
g Engines

20

Streaming Memory Performance

Bank Queue Effectiveness
1.80000
1.60000

Cycles/Access

1.40000
1.20000
1.00000
0.80000
0.60000
0.40000
0.20000
0.00000
1

2

4

8

16

32

64

Infinite

Queue Size

• Exploit latency insensitivity for improved bandwidth
• 1.75:1 Performance improvement from relatively short reorder queue

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

21

Compound Vector Operations
1 Instruction does lots of work
Memory Instructions

Compound Vector Instruction

LD Vd Vx

Op V0 V1 V2 V3 V4 V5 V6 V7

uIP

Mem

Control
Store

Op Ra Rb Op Ra Rb

AG

1 CV Inst (50b)
uInst (300b)
x 20uInst/Op
x 1000el/vec
-----------------6 x 106 b
Op Ra Rb

VRF

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

22

Scheduling by Simulated Annealing


List scheduling assumes
global communication
– does poorly when
communication exposed



View scheduling as a CAD
problem (place and route)
– generate naïve ‘feasible’
schedule
– iteratively improve schedule
by moving operations.
Ready

ALUs
Time

Ops

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

23

Typical Annealing Schedule

180

166

160
140
120
100
80

Energy function changed

60
40
20

13

0
1

WJD Feb 3, 1998

2001

4001

6001

8001

10001

12001

Tomorrow's Computin
g Engines

14001

16001

18001

24

Conventional Approaches to
Data-Dependent Conditional Execution
A
Y

x>0

N

A

A

x>0

y=(x>0)

Y
B
C

J
K

Data-Dependent
Branch

B

if y

J

if ~y

Whoops

C

if y

J

K

if ~y

B
C

Speculative
Loss
DxW
~1000

Exponentially
Decreasing
Duty Factor

K
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

25

Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly
– Branching control flow - dead issue slots on mispredicted branches
– Predication (SIMD select, masked vectors) - large fraction of
execution ‘opportunities’ go idle.

• Conditional Vectors
– append an element to an output stream depending on a case
variable.
Result Stream

0

1

Output Stream 0

Output Stream 1

Case Stream {0,1}

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

26

Application Sketch - Polygon Rendering
V3

V1

V2

V3

X Y RGB UV

Vertex

V2
V1
Y
X1

Y X1 X2 RGB1 RGB UV1 UV

Span

X Y RGB UV

Pixel

X2

Y
X

UV

RGB

X Y RGB
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

Textured
Pixel
27

Status
• Working simulator of Imagine
• Simple kernels running on simulator
– FFT

• Applications being developed
– Depth extraction, video compression, polygon rendering,
image-based graphics

• Circuit/Layout studies underway

WJD Feb 3, 1998

Tomorrow's Computin
g Engines

28

Acknowledgements




Students/Staff











Don Alpert (Intel)
Chris Buehler (MIT)
J.P Grossman (MIT)
Brad Johanson
Ujval Kapasi
Brucek Khailany
Abelardo Lopez-Lagunas
Peter Mattson
John Owens
Scott Rixner

WJD Feb 3, 1998

Helpful Suggestions







Henry Fuchs (UNC)
Pat Hanrahan
Tom Knight (MIT)
Marc Levoy
Leonard McMillan (MIT)
John Poulton (UNC)

Tomorrow's Computin
g Engines

29

Conclusion
• Work toward tomorrow’s computing engines
• Targeted toward media processing
– streams of low-precision samples
– little reuse
– latency tolerant

• Matched to the capabilities of communication-limited
technology
– explicit bandwidth hierarchy
– explicit communication between units
– communication exposed

• Insight not numbers
WJD Feb 3, 1998

Tomorrow's Computin
g Engines

30

Dokumen yang terkait

ANALISIS FAKTOR YANGMEMPENGARUHI FERTILITAS PASANGAN USIA SUBUR DI DESA SEMBORO KECAMATAN SEMBORO KABUPATEN JEMBER TAHUN 2011

2 53 20

KONSTRUKSI MEDIA TENTANG KETERLIBATAN POLITISI PARTAI DEMOKRAT ANAS URBANINGRUM PADA KASUS KORUPSI PROYEK PEMBANGUNAN KOMPLEK OLAHRAGA DI BUKIT HAMBALANG (Analisis Wacana Koran Harian Pagi Surya edisi 9-12, 16, 18 dan 23 Februari 2013 )

64 565 20

FAKTOR – FAKTOR YANG MEMPENGARUHI PENYERAPAN TENAGA KERJA INDUSTRI PENGOLAHAN BESAR DAN MENENGAH PADA TINGKAT KABUPATEN / KOTA DI JAWA TIMUR TAHUN 2006 - 2011

1 35 26

A DISCOURSE ANALYSIS ON “SPA: REGAIN BALANCE OF YOUR INNER AND OUTER BEAUTY” IN THE JAKARTA POST ON 4 MARCH 2011

9 161 13

Pengaruh kualitas aktiva produktif dan non performing financing terhadap return on asset perbankan syariah (Studi Pada 3 Bank Umum Syariah Tahun 2011 – 2014)

6 101 0

Pengaruh pemahaman fiqh muamalat mahasiswa terhadap keputusan membeli produk fashion palsu (study pada mahasiswa angkatan 2011 & 2012 prodi muamalat fakultas syariah dan hukum UIN Syarif Hidayatullah Jakarta)

0 22 0

Perlindungan Hukum Terhadap Anak Jalanan Atas Eksploitasi Dan Tindak Kekerasan Dihubungkan Dengan Undang-Undang Nomor 39 Tahun 1999 Tentang Hak Asasi Manusia Jo Undang-Undang Nomor 23 Tahun 2002 Tentang Perlindungan Anak

1 15 79

Pendidikan Agama Islam Untuk Kelas 3 SD Kelas 3 Suyanto Suyoto 2011

4 108 178

PP 23 TAHUN 2010 TENTANG KEGIATAN USAHA

2 51 76

KOORDINASI OTORITAS JASA KEUANGAN (OJK) DENGAN LEMBAGA PENJAMIN SIMPANAN (LPS) DAN BANK INDONESIA (BI) DALAM UPAYA PENANGANAN BANK BERMASALAH BERDASARKAN UNDANG-UNDANG RI NOMOR 21 TAHUN 2011 TENTANG OTORITAS JASA KEUANGAN

3 32 52