dally.ppt 3432KB Jun 23 2011 12:31:30 PM
Tomorrow’s Computing Engines
February 3, 1998
Symposium on High-Performance Computer Architecture
William J. Dally
Computer Systems Laboratory
Stanford University
billd@csl.stanford.edu
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
1
Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war
Computer architects tend to always design the last computer
old programs
old technology assumptions
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
2
Some Previous “Wars” (1/3)
MARS Router
1984
WJD Feb 3, 1998
Reliable Router
1994
Torus Routing Chip
1985
Network Design Frame
1988
Tomorrow's Computin
g Engines
3
Some Previous “Wars” (2/3)
MDP Chip
WJD Feb 3, 1998
J-Machine
Cray T3D
Tomorrow's Computin
g Engines
MAP Chip
4
Some Previous “Wars” (3/3)
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
5
Tomorrow’s Computing Engines
• Driven by tomorrow’s applications - media
• Constrained by tomorrow’s technology
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
6
90% of Desktop Cycles will Be Spent on ‘Media’
Applications by 2000
• Quote from Scott Kirkpatric of IBM (talk abstract)
• Media applications include
– video encode/decode
– polygon & image-based graphics
– audio processing - compression, music, speech recognition/synthesis
– modulation/demodulation at audio and video rates
• These applications involve stream processing
• So do
– radar processing: SAR, STAP, MTI ...
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
7
Typical Media Kernel
Image Warp and Composite
•
•
•
•
Read 10,000 pixels from memory
Perform 100 16-bit integer operations on each pixel
Test each pixel
Write 3,000 result pixels that pass to memory
• Little reuse of data fetched from memory
– each pixel used once
• Little interaction between pixels
– very insensitive to operation latency
• Challenge is to maximize bandwidth
Tomorrow's Computin
WJD Feb 3, 1998
g Engines
8
Telepresence: A Driving Application
Acquire
2D
Images
Extract
Depth
(3D Images)
Segmentation
Model
Extraction
Compression
Channel
Decompression
Rendering
Display
3D
Scene
Most kernels: Latency insensitive
High ratio of arithmetic to memory references
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
9
Tomorrow’s Technology is Wire Limited
• Lots of devices
• A little faster
• Slow wires
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
10
Technology scaling makes communication the
scarce resource
1997
2007
0.35m
64Mb DRAM
16 64b FP Proc
400MHz
0.10m
4Gb DRAM
1K 64b FP Proc
2.5GHz
P
18mm
12,000 tracks
1 clock
WJD Feb 3, 1998
32mm
90,000 tracks
20 clocks
Tomorrow's Computin
g Engines
11
On-chip wires are getting slower
y
y
x1
x2
tw = RCy2
RCy2
RCy2
x2 = s x1
0.5x
R2 = R1/s2
4x
C2 = C 1
1x
tw2 = R2C2y2 = tw1/s2
4x
tw2/tg2= tw1/(tg1s3)
8x
v = 0.5(tgRC)-1/2 (m/s)
v2 = v1s1/2
tg
tg
tg
vtg = 0.5(tg/RC)1/2 (m/gate)
v2tg2 = v1tg1s3/2
WJD Feb 3, 1998
0.7x
Tomorrow's Computin
g Engines
0.35x
12
Bandwidth and Latency of Modern VLSI
103
1
Bandwidth
100
0.01
Bandwidth
Latency
10
10-4
Latency
1
10-6
1
10
100
103
104
105
Size
Chip Boundary
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
13
Architecture for Locality
Vector
Reg
File
50GB/s
104
32-bit
ALUs
Switch
Off-chip
RAM
Pin-Bandwidth,
2GB/s
Exploit high on-chip bandwidth
500GB/s
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
14
Tomorrow’s Computing Engines
•
Aimed at media processing
–
–
–
–
–
stream based
latency tolerant
low-precision
little reuse
lots of conditionals
•
•
– bandwidth hierarchy
– no centralized resources
•
WJD Feb 3, 1998
Use the large number of
devices available on future
chips
Make efficient use of scarce
communication resources
Approach the performance
of a special-purpose
processor
Tomorrow's Computin
g Engines
15
Why do Special-Purpose Processors Perform
Well?
Lots (100s) of ALUs
WJD Feb 3, 1998
Fed by dedicated wires/memories
Tomorrow's Computin
g Engines
16
Instruction
Bandwidth
IP
Care and Feeding of ALUs
Instr.
Cache
IR
Data
Bandwidth
Regs
‘Feeding’ Structure Dwarfs ALU
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
17
Three Key Problems
• Instruction bandwidth
• Data bandwidth
• Conditional execution
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
18
A Bandwidth Hierarchy
13 ALUs per cluster
SDRAM
SDRAM
SDRAM
1.6GB/s
ALU Cluster
Vector
Register File
Streaming Memory
SDRAM
ALU Cluster
500GB/s
ALU Cluster
50GB/s
•Solves data bandwidth problem
•Matched to bandwidth curve of technology
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
19
A Streaming Memory System
D
Address
Generator
Address
Generator
WJD Feb 3, 1998
Reorder
Queue
SDRAM
Bank
Reorder
Queue
SDRAM
Bank
Crossbar
IX
Tomorrow's Computin
g Engines
20
Streaming Memory Performance
Bank Queue Effectiveness
1.80000
1.60000
Cycles/Access
1.40000
1.20000
1.00000
0.80000
0.60000
0.40000
0.20000
0.00000
1
2
4
8
16
32
64
Infinite
Queue Size
• Exploit latency insensitivity for improved bandwidth
• 1.75:1 Performance improvement from relatively short reorder queue
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
21
Compound Vector Operations
1 Instruction does lots of work
Memory Instructions
Compound Vector Instruction
LD Vd Vx
Op V0 V1 V2 V3 V4 V5 V6 V7
uIP
Mem
Control
Store
Op Ra Rb Op Ra Rb
AG
1 CV Inst (50b)
uInst (300b)
x 20uInst/Op
x 1000el/vec
-----------------6 x 106 b
Op Ra Rb
VRF
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
22
Scheduling by Simulated Annealing
•
List scheduling assumes
global communication
– does poorly when
communication exposed
•
View scheduling as a CAD
problem (place and route)
– generate naïve ‘feasible’
schedule
– iteratively improve schedule
by moving operations.
Ready
ALUs
Time
Ops
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
23
Typical Annealing Schedule
180
166
160
140
120
100
80
Energy function changed
60
40
20
13
0
1
WJD Feb 3, 1998
2001
4001
6001
8001
10001
12001
Tomorrow's Computin
g Engines
14001
16001
18001
24
Conventional Approaches to
Data-Dependent Conditional Execution
A
Y
x>0
N
A
A
x>0
y=(x>0)
Y
B
C
J
K
Data-Dependent
Branch
B
if y
J
if ~y
Whoops
C
if y
J
K
if ~y
B
C
Speculative
Loss
DxW
~1000
Exponentially
Decreasing
Duty Factor
K
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
25
Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly
– Branching control flow - dead issue slots on mispredicted branches
– Predication (SIMD select, masked vectors) - large fraction of
execution ‘opportunities’ go idle.
• Conditional Vectors
– append an element to an output stream depending on a case
variable.
Result Stream
0
1
Output Stream 0
Output Stream 1
Case Stream {0,1}
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
26
Application Sketch - Polygon Rendering
V3
V1
V2
V3
X Y RGB UV
Vertex
V2
V1
Y
X1
Y X1 X2 RGB1 RGB UV1 UV
Span
X Y RGB UV
Pixel
X2
Y
X
UV
RGB
X Y RGB
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
Textured
Pixel
27
Status
• Working simulator of Imagine
• Simple kernels running on simulator
– FFT
• Applications being developed
– Depth extraction, video compression, polygon rendering,
image-based graphics
• Circuit/Layout studies underway
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
28
Acknowledgements
•
•
Students/Staff
–
–
–
–
–
–
–
–
–
–
Don Alpert (Intel)
Chris Buehler (MIT)
J.P Grossman (MIT)
Brad Johanson
Ujval Kapasi
Brucek Khailany
Abelardo Lopez-Lagunas
Peter Mattson
John Owens
Scott Rixner
WJD Feb 3, 1998
Helpful Suggestions
–
–
–
–
–
–
Henry Fuchs (UNC)
Pat Hanrahan
Tom Knight (MIT)
Marc Levoy
Leonard McMillan (MIT)
John Poulton (UNC)
Tomorrow's Computin
g Engines
29
Conclusion
• Work toward tomorrow’s computing engines
• Targeted toward media processing
– streams of low-precision samples
– little reuse
– latency tolerant
• Matched to the capabilities of communication-limited
technology
– explicit bandwidth hierarchy
– explicit communication between units
– communication exposed
• Insight not numbers
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
30
February 3, 1998
Symposium on High-Performance Computer Architecture
William J. Dally
Computer Systems Laboratory
Stanford University
billd@csl.stanford.edu
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
1
Focus on Tomorrow, not Yesterday
General’s tend to always fight the last war
Computer architects tend to always design the last computer
old programs
old technology assumptions
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
2
Some Previous “Wars” (1/3)
MARS Router
1984
WJD Feb 3, 1998
Reliable Router
1994
Torus Routing Chip
1985
Network Design Frame
1988
Tomorrow's Computin
g Engines
3
Some Previous “Wars” (2/3)
MDP Chip
WJD Feb 3, 1998
J-Machine
Cray T3D
Tomorrow's Computin
g Engines
MAP Chip
4
Some Previous “Wars” (3/3)
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
5
Tomorrow’s Computing Engines
• Driven by tomorrow’s applications - media
• Constrained by tomorrow’s technology
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
6
90% of Desktop Cycles will Be Spent on ‘Media’
Applications by 2000
• Quote from Scott Kirkpatric of IBM (talk abstract)
• Media applications include
– video encode/decode
– polygon & image-based graphics
– audio processing - compression, music, speech recognition/synthesis
– modulation/demodulation at audio and video rates
• These applications involve stream processing
• So do
– radar processing: SAR, STAP, MTI ...
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
7
Typical Media Kernel
Image Warp and Composite
•
•
•
•
Read 10,000 pixels from memory
Perform 100 16-bit integer operations on each pixel
Test each pixel
Write 3,000 result pixels that pass to memory
• Little reuse of data fetched from memory
– each pixel used once
• Little interaction between pixels
– very insensitive to operation latency
• Challenge is to maximize bandwidth
Tomorrow's Computin
WJD Feb 3, 1998
g Engines
8
Telepresence: A Driving Application
Acquire
2D
Images
Extract
Depth
(3D Images)
Segmentation
Model
Extraction
Compression
Channel
Decompression
Rendering
Display
3D
Scene
Most kernels: Latency insensitive
High ratio of arithmetic to memory references
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
9
Tomorrow’s Technology is Wire Limited
• Lots of devices
• A little faster
• Slow wires
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
10
Technology scaling makes communication the
scarce resource
1997
2007
0.35m
64Mb DRAM
16 64b FP Proc
400MHz
0.10m
4Gb DRAM
1K 64b FP Proc
2.5GHz
P
18mm
12,000 tracks
1 clock
WJD Feb 3, 1998
32mm
90,000 tracks
20 clocks
Tomorrow's Computin
g Engines
11
On-chip wires are getting slower
y
y
x1
x2
tw = RCy2
RCy2
RCy2
x2 = s x1
0.5x
R2 = R1/s2
4x
C2 = C 1
1x
tw2 = R2C2y2 = tw1/s2
4x
tw2/tg2= tw1/(tg1s3)
8x
v = 0.5(tgRC)-1/2 (m/s)
v2 = v1s1/2
tg
tg
tg
vtg = 0.5(tg/RC)1/2 (m/gate)
v2tg2 = v1tg1s3/2
WJD Feb 3, 1998
0.7x
Tomorrow's Computin
g Engines
0.35x
12
Bandwidth and Latency of Modern VLSI
103
1
Bandwidth
100
0.01
Bandwidth
Latency
10
10-4
Latency
1
10-6
1
10
100
103
104
105
Size
Chip Boundary
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
13
Architecture for Locality
Vector
Reg
File
50GB/s
104
32-bit
ALUs
Switch
Off-chip
RAM
Pin-Bandwidth,
2GB/s
Exploit high on-chip bandwidth
500GB/s
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
14
Tomorrow’s Computing Engines
•
Aimed at media processing
–
–
–
–
–
stream based
latency tolerant
low-precision
little reuse
lots of conditionals
•
•
– bandwidth hierarchy
– no centralized resources
•
WJD Feb 3, 1998
Use the large number of
devices available on future
chips
Make efficient use of scarce
communication resources
Approach the performance
of a special-purpose
processor
Tomorrow's Computin
g Engines
15
Why do Special-Purpose Processors Perform
Well?
Lots (100s) of ALUs
WJD Feb 3, 1998
Fed by dedicated wires/memories
Tomorrow's Computin
g Engines
16
Instruction
Bandwidth
IP
Care and Feeding of ALUs
Instr.
Cache
IR
Data
Bandwidth
Regs
‘Feeding’ Structure Dwarfs ALU
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
17
Three Key Problems
• Instruction bandwidth
• Data bandwidth
• Conditional execution
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
18
A Bandwidth Hierarchy
13 ALUs per cluster
SDRAM
SDRAM
SDRAM
1.6GB/s
ALU Cluster
Vector
Register File
Streaming Memory
SDRAM
ALU Cluster
500GB/s
ALU Cluster
50GB/s
•Solves data bandwidth problem
•Matched to bandwidth curve of technology
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
19
A Streaming Memory System
D
Address
Generator
Address
Generator
WJD Feb 3, 1998
Reorder
Queue
SDRAM
Bank
Reorder
Queue
SDRAM
Bank
Crossbar
IX
Tomorrow's Computin
g Engines
20
Streaming Memory Performance
Bank Queue Effectiveness
1.80000
1.60000
Cycles/Access
1.40000
1.20000
1.00000
0.80000
0.60000
0.40000
0.20000
0.00000
1
2
4
8
16
32
64
Infinite
Queue Size
• Exploit latency insensitivity for improved bandwidth
• 1.75:1 Performance improvement from relatively short reorder queue
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
21
Compound Vector Operations
1 Instruction does lots of work
Memory Instructions
Compound Vector Instruction
LD Vd Vx
Op V0 V1 V2 V3 V4 V5 V6 V7
uIP
Mem
Control
Store
Op Ra Rb Op Ra Rb
AG
1 CV Inst (50b)
uInst (300b)
x 20uInst/Op
x 1000el/vec
-----------------6 x 106 b
Op Ra Rb
VRF
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
22
Scheduling by Simulated Annealing
•
List scheduling assumes
global communication
– does poorly when
communication exposed
•
View scheduling as a CAD
problem (place and route)
– generate naïve ‘feasible’
schedule
– iteratively improve schedule
by moving operations.
Ready
ALUs
Time
Ops
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
23
Typical Annealing Schedule
180
166
160
140
120
100
80
Energy function changed
60
40
20
13
0
1
WJD Feb 3, 1998
2001
4001
6001
8001
10001
12001
Tomorrow's Computin
g Engines
14001
16001
18001
24
Conventional Approaches to
Data-Dependent Conditional Execution
A
Y
x>0
N
A
A
x>0
y=(x>0)
Y
B
C
J
K
Data-Dependent
Branch
B
if y
J
if ~y
Whoops
C
if y
J
K
if ~y
B
C
Speculative
Loss
DxW
~1000
Exponentially
Decreasing
Duty Factor
K
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
25
Zero-Cost Conditionals
• Most Approaches to Conditional Operations are Costly
– Branching control flow - dead issue slots on mispredicted branches
– Predication (SIMD select, masked vectors) - large fraction of
execution ‘opportunities’ go idle.
• Conditional Vectors
– append an element to an output stream depending on a case
variable.
Result Stream
0
1
Output Stream 0
Output Stream 1
Case Stream {0,1}
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
26
Application Sketch - Polygon Rendering
V3
V1
V2
V3
X Y RGB UV
Vertex
V2
V1
Y
X1
Y X1 X2 RGB1 RGB UV1 UV
Span
X Y RGB UV
Pixel
X2
Y
X
UV
RGB
X Y RGB
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
Textured
Pixel
27
Status
• Working simulator of Imagine
• Simple kernels running on simulator
– FFT
• Applications being developed
– Depth extraction, video compression, polygon rendering,
image-based graphics
• Circuit/Layout studies underway
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
28
Acknowledgements
•
•
Students/Staff
–
–
–
–
–
–
–
–
–
–
Don Alpert (Intel)
Chris Buehler (MIT)
J.P Grossman (MIT)
Brad Johanson
Ujval Kapasi
Brucek Khailany
Abelardo Lopez-Lagunas
Peter Mattson
John Owens
Scott Rixner
WJD Feb 3, 1998
Helpful Suggestions
–
–
–
–
–
–
Henry Fuchs (UNC)
Pat Hanrahan
Tom Knight (MIT)
Marc Levoy
Leonard McMillan (MIT)
John Poulton (UNC)
Tomorrow's Computin
g Engines
29
Conclusion
• Work toward tomorrow’s computing engines
• Targeted toward media processing
– streams of low-precision samples
– little reuse
– latency tolerant
• Matched to the capabilities of communication-limited
technology
– explicit bandwidth hierarchy
– explicit communication between units
– communication exposed
• Insight not numbers
WJD Feb 3, 1998
Tomorrow's Computin
g Engines
30