Upload
randell-charles
View
216
Download
2
Tags:
Embed Size (px)
Citation preview
A few issues on the design
of future multicores
André Seznec
IRISA/INRIA
2André Seznec
CAPS project-teamIrisa-Inria
Single Chip Uniprocessor: the end of the road
(Very) wide issue superscalar processors are not cost effective:
More than quadratic complexity on many key components:
• Register file
• Bypass network
• Issue logic
Limited performance return
Failure of EV8 =
end of very wide issue superscalar processors
3André Seznec
CAPS project-teamIrisa-Inria
Hardware thread parallelism
High-end single chip component: Chip multiprocessors:
• IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64
• Many CMP SoCs for embedded markets• Cell
(Simultaneous) Multithreading:• Pentium 4, Power 5,• Multithreading
4André Seznec
CAPS project-teamIrisa-Inria
Thread parallelism
Expressed by the application developer: Depends on the application itself Depends on the programming language or paradigm Depends on the programmer
Discovered by the compiler: Automatic (static) parallelization
Exploited by the runtime: Task scheduling
Dynamically discovered/exploited by hardware or software: Speculative hardware/software threading
5André Seznec
CAPS project-teamIrisa-Inria
Direction of (single chip) architecture:betting on parallelism success
(Future) applications are intrinsically parallel: As much as possible simple cores
(Future) applications are moderately parallel A few complex state-of-the-art superscalar cores
SSC: Sea of Simple Cores
FCC: Few Complex Cores
6André Seznec
CAPS project-teamIrisa-Inria
SSC: Sea of Simple Cores
7André Seznec
CAPS project-teamIrisa-Inria
FCC: Few Complex Cores
4-way O-O-O superscalar
4-way O-O-O superscalar
Shared L3 cache
4-way O-O-O superscalar
••••
8André Seznec
CAPS project-teamIrisa-Inria
Common architectural design issues
9André Seznec
CAPS project-teamIrisa-Inria
Instruction Set Architecture
Single ISAs ? Extension of “conventional” multiprocessors
• Shared or distributed memory ?
Hetorogeneous ISAs: A la CELL ?: (master processor + slave
processors) x N A la SoC ? : specialized coprocessors Radically new architecture ?
• Which one ?
10André Seznec
CAPS project-teamIrisa-Inria
Hardware accelerators ?
SIMD extensions: Seems to be accepted, report the burden to applications
developers and compilers
Reconfigurable datapaths: Popular when you get a well defined intrinsically parallel
application
Vector extensions: Might be the right move when targeting essentially scientific
computing
11André Seznec
CAPS project-teamIrisa-Inria
On-chip memory/processors/memory bandwidth
The uniprocessor credo was:
“Use the remaining silicon for caches”
New issue: An extra processor or more cache
Extra processing power = increased memory bandwidth demand Increased power consumption, more temperature hot spots
Extra cache = decreased (external) memory demand
12André Seznec
CAPS project-teamIrisa-Inria
Memory hierarchy organization ?
13André Seznec
CAPS project-teamIrisa-Inria
Flat: sharing a big L2/L3 cache?
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
L3 cache
14André Seznec
CAPS project-teamIrisa-Inria
Flat: communication issues?through the big cache
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
L3 cache
15André Seznec
CAPS project-teamIrisa-Inria
Flat: communication issues?Grid-like ?
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
μP $ μP $ μP $ μP $
L3 cache
16André Seznec
CAPS project-teamIrisa-Inria
Hierarchical organization ?
μP $ μP $
L2 $
μP $ μP $
L2 $
μP $ μP $
L2 $
μP $ μP $
L2 $
L3 $
17André Seznec
CAPS project-teamIrisa-Inria
Hierarchical organization ?
Arbitration at all levels
Coherency at all levels
Interleaving at all levels
Bandwidth dimensioning
18André Seznec
CAPS project-teamIrisa-Inria
NoC structure
Very dependent of the memory hierarchy organization !!
+ sharing coprocessors/hardware accelerators
+ I/O buses/(processors ?)
+ memory interface
+ network interface
19André Seznec
CAPS project-teamIrisa-Inria
Example
μP $ μP $
L2 $
μP $ μP $
L2 $
μP $ μP $
L2 $
L3 $
MemoryInt. IO
20André Seznec
CAPS project-teamIrisa-Inria
Multithreading ?
An extra level thread parallelism !!
Might be an interesting alternative to prefetching on massively parallel applications
21André Seznec
CAPS project-teamIrisa-Inria
Power and thermal issues
Voltage/frequency scaling to adapt to the workload ?
Adapting the workload to the available power ?
Adapting/dimensioning the architecture to the power budget
Activity migration for managing temperatures ?
22André Seznec
CAPS project-teamIrisa-Inria
General issues for software/compiler
Parallelism detection and partitioning: find the correct granularity
Memory bandwidth mastering
Non-uniform memory latency
Optimizing sequential code portions
23André Seznec
CAPS project-teamIrisa-Inria
SSC design specificities
24André Seznec
CAPS project-teamIrisa-Inria
Basic core granularity
RISC cores
VLIW cores
In-order superscalar cores
25André Seznec
CAPS project-teamIrisa-Inria
Homogeneous vs. heterogeneous ISAs
Core specialization: RISC + VLIW or DSP slaves ? Master core + a set of special purpose cores ?
26André Seznec
CAPS project-teamIrisa-Inria
Sharing issue
Simple cores: Lot of duplications and lots of unused resources at any time
Adjacent cores can share: Caches Functional units: FP, mult/div , multimedia, Hardware accelerators
27André Seznec
CAPS project-teamIrisa-Inria
An example of sharing
μP FP μP
DL1 $
Inst. fetch
IL1 $
μP FP μP
DL1 $
Inst. fetch
IL1 $
Har
dw
are
acce
lera
tor
L2 cache
28André Seznec
CAPS project-teamIrisa-Inria
Multithreading/prefetching
Multithreading: Is the extra complexity worth for simple cores ?
Prefetching: Is it worth ? Sharing prefetch engines ?
29André Seznec
CAPS project-teamIrisa-Inria
Vision of a SSC (my own vision )
30André Seznec
CAPS project-teamIrisa-Inria
SSC: the basic brick
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
31André Seznec
CAPS project-teamIrisa-Inria
Memory interface
network interface
System interface
L3 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
μP FP μP
D $
I $
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
32André Seznec
CAPS project-teamIrisa-Inria
FCC design specificities
33André Seznec
CAPS project-teamIrisa-Inria
Only limited available thread parallelism ?
Focus on uniprocessor architecture: Find the correct tradeoff between complexity and
performance Power and temperature issues
Vector extensions ? Contiguous vectors ( a la SSE) ? Strided vectors in L2 caches ( Tarantula-like)
34André Seznec
CAPS project-teamIrisa-Inria
Performance enablers
SMT for parallel workloads ?
Helper threads ? Run ahead threads
Speculative multithreading hardware support
35André Seznec
CAPS project-teamIrisa-Inria
Intermediate design ?
SCCs: Shine on massively parallel applications
Poor/ limited performance on sequential sections
FCCs: Moderate performance on parallel applications
Good performance on sequential sections
36André Seznec
CAPS project-teamIrisa-Inria
Amdahl’s law
Mix of FCC and SSC
37André Seznec
CAPS project-teamIrisa-Inria
The basic brick
L2 cache
μP FP μP
D $
I $
μP FP μP
D $
I $
Ultimate Out-of-order Superscalar
38André Seznec
CAPS project-teamIrisa-Inria
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L2 $
D $
I $
D $
I $
Ult. O-O-O
L3 cache
Memory interface
network interface
System interface
39André Seznec
CAPS project-teamIrisa-Inria
Conclusion
The era of uniprocessor has come to the end
No clear trend to continue
Might be time for more architecture diversity