Total Data Volume 98 Mb
Compression Ratio
SAS Data Volume
32 Segments
7.7 Mb 12x
64 Segments
5.3 Mb 17x
128 Segments
4.5 Mb 22x
256 Segments
3.6 Mb 27x
Ms of annual test cost savings
Computer Architecture Parallel Processing Computer Architecture Parallel Processing
Architecture Parallel Processing Architecture Parallel Processing
Architecture Parallel Processing Architecture Parallel Processing
Load Balancing for Parallel Visualization of Blood Head Vessel Angiography on Cluster of PCs.
Shared Channels in Interconnection Networks.
Study of modified Multistage Interconnection Networks for Networks-on-Chips.
Design of a Simulator for a Class of Dynamic Execution Processors.
Beyond Instruction-Level Parallelism in Processor Architecture.
Design and Performance Evaluation of a Distributed Crossbar Scheduler.
Software Pipelining for Reconfigurable Instruction Set Processors.
Load Balancing for Parallel Visualization of Blood Head Vessel Angiography on Cluster of PCs.
Shared Channels in Interconnection Networks.
Study of modified Multistage Interconnection Networks for Networks-on-Chips.
Design of a Simulator for a Class of Dynamic Execution Processors.
Beyond Instruction-Level Parallelism in Processor Architecture.
Design and Performance Evaluation of a Distributed Crossbar Scheduler.
Software Pipelining for Reconfigurable Instruction Set Processors.
Large-Scale SMT Architectures Large-Scale SMT Architectures
Large-Scale SMT Architectures Large-Scale SMT Architectures
Scalable front end
•
Multiple i-caches
•
Scalable i-cache capacity
•
Scalable i-cache bandwidth
One-level scalable and shareable data
cache
•
Split into multiple block-interleaved
banks
•
Each bank is single- ported and shared
by all threads
•
Parallel access to different banks
through interconnect
•
Complexity grows with number of ports
and banks
Scalable front end
•
Multiple i-caches
•
Scalable i-cache capacity
•
Scalable i-cache bandwidth
One-level scalable and shareable data
cache
•
Split into multiple block-interleaved
banks
•
Each bank is single- ported and shared
by all threads
•
Parallel access to different banks
through interconnect
•
Complexity grows with number of ports
and banks
Interconnect
...
Memory Module
Dcache Bank
Dcache Bank
Dcache Bank
...
Memory Module
Dcache Bank
Dcache Bank
Dcache Bank
PC
Decode Rename
I-cache
PC
In t
Q Registers
Bypass
FPU
LS
LS
ALU
ALU FPU
F P
Q
PC
Decode Rename
I-cache
PC
In t
Q Registers
Bypass
FPU
LS
LS ALU
ALU FPU
F P
Q
PC
Decode Rename
I-cache
PC
In t
Q Registers
Bypass
FPU
LS
LS ALU FPU
F P
Q
PC
Decode Rename
I-cache
PC
In t
Q Registers
Bypass
FPU
LS
LS ALU
ALU FPU
F P
Q
Modular and Scalable SMT not a pure SMT
Most hardware resources have limited thread sharing
•
I-caches, Decode logic, Queues, Registers, FUs
Modular and Scalable SMT not a pure SMT
Most hardware resources have limited thread sharing
•
I-caches, Decode logic, Queues, Registers, FUs
SPEC 2000 Simulation
•
8 simultaneous threads
Simulation Parameters
Issueretirement width: 32 instructions cycle
Scheduling Queue: 128 entries
Load-Store Queue: 64 entries
Other Resources:
•
24 simple ALUs
•
8 fully pipelined FPUs
•
4 cycle-latency for FP add and FP multiply
SPEC 2000 Simulation
•
8 simultaneous threads
Simulation Parameters
Issueretirement width: 32 instructions cycle
Scheduling Queue: 128 entries
Load-Store Queue: 64 entries
Other Resources:
•
24 simple ALUs
•
8 fully pipelined FPUs
•
4 cycle-latency for FP add and FP multiply
2 4
6 8
10 12
14 16
18 20
22 24
Ideal Mem Ideal Cache
Latency 3 Latency 5
Latency 7 Latency 9
In st
ru ct
io n
s P
e r
C yc
le IP
C
188.ammp 183.equake
177.mesa 176.gcc
197.parser 255.vortex
175.vpr 181.mcf
23.07 20.46