39
A few issues on the design of future multicores André Seznec IRISA/INRIA

A few issues on the design of future multicores André Seznec IRISA/INRIA

Embed Size (px)

Citation preview

Page 1: A few issues on the design of future multicores André Seznec IRISA/INRIA

A few issues on the design

of future multicores

André Seznec

IRISA/INRIA

Page 2: A few issues on the design of future multicores André Seznec IRISA/INRIA

2André Seznec

CAPS project-teamIrisa-Inria

Single Chip Uniprocessor: the end of the road

(Very) wide issue superscalar processors are not cost effective:

More than quadratic complexity on many key components:

• Register file

• Bypass network

• Issue logic

Limited performance return

Failure of EV8 =

end of very wide issue superscalar processors

Page 3: A few issues on the design of future multicores André Seznec IRISA/INRIA

3André Seznec

CAPS project-teamIrisa-Inria

Hardware thread parallelism

High-end single chip component: Chip multiprocessors:

• IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64

• Many CMP SoCs for embedded markets• Cell

(Simultaneous) Multithreading:• Pentium 4, Power 5,• Multithreading

Page 4: A few issues on the design of future multicores André Seznec IRISA/INRIA

4André Seznec

CAPS project-teamIrisa-Inria

Thread parallelism

Expressed by the application developer: Depends on the application itself Depends on the programming language or paradigm Depends on the programmer

Discovered by the compiler: Automatic (static) parallelization

Exploited by the runtime: Task scheduling

Dynamically discovered/exploited by hardware or software: Speculative hardware/software threading

Page 5: A few issues on the design of future multicores André Seznec IRISA/INRIA

5André Seznec

CAPS project-teamIrisa-Inria

Direction of (single chip) architecture:betting on parallelism success

(Future) applications are intrinsically parallel: As much as possible simple cores

(Future) applications are moderately parallel A few complex state-of-the-art superscalar cores

SSC: Sea of Simple Cores

FCC: Few Complex Cores

Page 6: A few issues on the design of future multicores André Seznec IRISA/INRIA

6André Seznec

CAPS project-teamIrisa-Inria

SSC: Sea of Simple Cores

Page 7: A few issues on the design of future multicores André Seznec IRISA/INRIA

7André Seznec

CAPS project-teamIrisa-Inria

FCC: Few Complex Cores

4-way O-O-O superscalar

4-way O-O-O superscalar

Shared L3 cache

4-way O-O-O superscalar

••••

Page 8: A few issues on the design of future multicores André Seznec IRISA/INRIA

8André Seznec

CAPS project-teamIrisa-Inria

Common architectural design issues

Page 9: A few issues on the design of future multicores André Seznec IRISA/INRIA

9André Seznec

CAPS project-teamIrisa-Inria

Instruction Set Architecture

Single ISAs ? Extension of “conventional” multiprocessors

• Shared or distributed memory ?

Hetorogeneous ISAs: A la CELL ?: (master processor + slave

processors) x N A la SoC ? : specialized coprocessors Radically new architecture ?

• Which one ?

Page 10: A few issues on the design of future multicores André Seznec IRISA/INRIA

10André Seznec

CAPS project-teamIrisa-Inria

Hardware accelerators ?

SIMD extensions: Seems to be accepted, report the burden to applications

developers and compilers

Reconfigurable datapaths: Popular when you get a well defined intrinsically parallel

application

Vector extensions: Might be the right move when targeting essentially scientific

computing

Page 11: A few issues on the design of future multicores André Seznec IRISA/INRIA

11André Seznec

CAPS project-teamIrisa-Inria

On-chip memory/processors/memory bandwidth

The uniprocessor credo was:

“Use the remaining silicon for caches”

New issue: An extra processor or more cache

Extra processing power = increased memory bandwidth demand Increased power consumption, more temperature hot spots

Extra cache = decreased (external) memory demand

Page 12: A few issues on the design of future multicores André Seznec IRISA/INRIA

12André Seznec

CAPS project-teamIrisa-Inria

Memory hierarchy organization ?

Page 13: A few issues on the design of future multicores André Seznec IRISA/INRIA

13André Seznec

CAPS project-teamIrisa-Inria

Flat: sharing a big L2/L3 cache?

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

L3 cache

Page 14: A few issues on the design of future multicores André Seznec IRISA/INRIA

14André Seznec

CAPS project-teamIrisa-Inria

Flat: communication issues?through the big cache

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

L3 cache

Page 15: A few issues on the design of future multicores André Seznec IRISA/INRIA

15André Seznec

CAPS project-teamIrisa-Inria

Flat: communication issues?Grid-like ?

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

μP $ μP $ μP $ μP $

L3 cache

Page 16: A few issues on the design of future multicores André Seznec IRISA/INRIA

16André Seznec

CAPS project-teamIrisa-Inria

Hierarchical organization ?

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

L3 $

Page 17: A few issues on the design of future multicores André Seznec IRISA/INRIA

17André Seznec

CAPS project-teamIrisa-Inria

Hierarchical organization ?

Arbitration at all levels

Coherency at all levels

Interleaving at all levels

Bandwidth dimensioning

Page 18: A few issues on the design of future multicores André Seznec IRISA/INRIA

18André Seznec

CAPS project-teamIrisa-Inria

NoC structure

Very dependent of the memory hierarchy organization !!

+ sharing coprocessors/hardware accelerators

+ I/O buses/(processors ?)

+ memory interface

+ network interface

Page 19: A few issues on the design of future multicores André Seznec IRISA/INRIA

19André Seznec

CAPS project-teamIrisa-Inria

Example

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

L3 $

MemoryInt. IO

Page 20: A few issues on the design of future multicores André Seznec IRISA/INRIA

20André Seznec

CAPS project-teamIrisa-Inria

Multithreading ?

An extra level thread parallelism !!

Might be an interesting alternative to prefetching on massively parallel applications

Page 21: A few issues on the design of future multicores André Seznec IRISA/INRIA

21André Seznec

CAPS project-teamIrisa-Inria

Power and thermal issues

Voltage/frequency scaling to adapt to the workload ?

Adapting the workload to the available power ?

Adapting/dimensioning the architecture to the power budget

Activity migration for managing temperatures ?

Page 22: A few issues on the design of future multicores André Seznec IRISA/INRIA

22André Seznec

CAPS project-teamIrisa-Inria

General issues for software/compiler

Parallelism detection and partitioning: find the correct granularity

Memory bandwidth mastering

Non-uniform memory latency

Optimizing sequential code portions

Page 23: A few issues on the design of future multicores André Seznec IRISA/INRIA

23André Seznec

CAPS project-teamIrisa-Inria

SSC design specificities

Page 24: A few issues on the design of future multicores André Seznec IRISA/INRIA

24André Seznec

CAPS project-teamIrisa-Inria

Basic core granularity

RISC cores

VLIW cores

In-order superscalar cores

Page 25: A few issues on the design of future multicores André Seznec IRISA/INRIA

25André Seznec

CAPS project-teamIrisa-Inria

Homogeneous vs. heterogeneous ISAs

Core specialization: RISC + VLIW or DSP slaves ? Master core + a set of special purpose cores ?

Page 26: A few issues on the design of future multicores André Seznec IRISA/INRIA

26André Seznec

CAPS project-teamIrisa-Inria

Sharing issue

Simple cores: Lot of duplications and lots of unused resources at any time

Adjacent cores can share: Caches Functional units: FP, mult/div , multimedia, Hardware accelerators

Page 27: A few issues on the design of future multicores André Seznec IRISA/INRIA

27André Seznec

CAPS project-teamIrisa-Inria

An example of sharing

μP FP μP

DL1 $

Inst. fetch

IL1 $

μP FP μP

DL1 $

Inst. fetch

IL1 $

Har

dw

are

acce

lera

tor

L2 cache

Page 28: A few issues on the design of future multicores André Seznec IRISA/INRIA

28André Seznec

CAPS project-teamIrisa-Inria

Multithreading/prefetching

Multithreading: Is the extra complexity worth for simple cores ?

Prefetching: Is it worth ? Sharing prefetch engines ?

Page 29: A few issues on the design of future multicores André Seznec IRISA/INRIA

29André Seznec

CAPS project-teamIrisa-Inria

Vision of a SSC (my own vision )

Page 30: A few issues on the design of future multicores André Seznec IRISA/INRIA

30André Seznec

CAPS project-teamIrisa-Inria

SSC: the basic brick

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

Page 31: A few issues on the design of future multicores André Seznec IRISA/INRIA

31André Seznec

CAPS project-teamIrisa-Inria

Memory interface

network interface

System interface

L3 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

Page 32: A few issues on the design of future multicores André Seznec IRISA/INRIA

32André Seznec

CAPS project-teamIrisa-Inria

FCC design specificities

Page 33: A few issues on the design of future multicores André Seznec IRISA/INRIA

33André Seznec

CAPS project-teamIrisa-Inria

Only limited available thread parallelism ?

Focus on uniprocessor architecture: Find the correct tradeoff between complexity and

performance Power and temperature issues

Vector extensions ? Contiguous vectors ( a la SSE) ? Strided vectors in L2 caches ( Tarantula-like)

Page 34: A few issues on the design of future multicores André Seznec IRISA/INRIA

34André Seznec

CAPS project-teamIrisa-Inria

Performance enablers

SMT for parallel workloads ?

Helper threads ? Run ahead threads

Speculative multithreading hardware support

Page 35: A few issues on the design of future multicores André Seznec IRISA/INRIA

35André Seznec

CAPS project-teamIrisa-Inria

Intermediate design ?

SCCs: Shine on massively parallel applications

Poor/ limited performance on sequential sections

FCCs: Moderate performance on parallel applications

Good performance on sequential sections

Page 36: A few issues on the design of future multicores André Seznec IRISA/INRIA

36André Seznec

CAPS project-teamIrisa-Inria

Amdahl’s law

Mix of FCC and SSC

Page 37: A few issues on the design of future multicores André Seznec IRISA/INRIA

37André Seznec

CAPS project-teamIrisa-Inria

The basic brick

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

Ultimate Out-of-order Superscalar

Page 38: A few issues on the design of future multicores André Seznec IRISA/INRIA

38André Seznec

CAPS project-teamIrisa-Inria

L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L3 cache

Memory interface

network interface

System interface

Page 39: A few issues on the design of future multicores André Seznec IRISA/INRIA

39André Seznec

CAPS project-teamIrisa-Inria

Conclusion

The era of uniprocessor has come to the end

No clear trend to continue

Might be time for more architecture diversity