A few issues on the design of future multicores André Seznec IRISA/INRIA

A few issues on the design

of future multicores

André Seznec

IRISA/INRIA

2André Seznec

CAPS project-teamIrisa-Inria

Single Chip Uniprocessor: the end of the road

(Very) wide issue superscalar processors are not cost effective:

More than quadratic complexity on many key components:

• Register file

• Bypass network

• Issue logic

Limited performance return

Failure of EV8 =

end of very wide issue superscalar processors

3André Seznec


Hardware thread parallelism

High-end single chip component: Chip multiprocessors:

• IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64

• Many CMP SoCs for embedded markets• Cell

(Simultaneous) Multithreading:• Pentium 4, Power 5,• Multithreading

4André Seznec


Thread parallelism

Expressed by the application developer: Depends on the application itself Depends on the programming language or paradigm Depends on the programmer

Discovered by the compiler: Automatic (static) parallelization

Exploited by the runtime: Task scheduling

Dynamically discovered/exploited by hardware or software: Speculative hardware/software threading

5André Seznec


Direction of (single chip) architecture:betting on parallelism success

(Future) applications are intrinsically parallel: As much as possible simple cores

(Future) applications are moderately parallel A few complex state-of-the-art superscalar cores

SSC: Sea of Simple Cores

FCC: Few Complex Cores

6André Seznec


SSC: Sea of Simple Cores

7André Seznec


FCC: Few Complex Cores

4-way O-O-O superscalar


Shared L3 cache


••••

8André Seznec


Common architectural design issues

9André Seznec


Instruction Set Architecture

Single ISAs ? Extension of “conventional” multiprocessors

• Shared or distributed memory ?

Hetorogeneous ISAs: A la CELL ?: (master processor + slave

processors) x N A la SoC ? : specialized coprocessors Radically new architecture ?

• Which one ?

10André Seznec


Hardware accelerators ?

SIMD extensions: Seems to be accepted, report the burden to applications

developers and compilers

Reconfigurable datapaths: Popular when you get a well defined intrinsically parallel

application

Vector extensions: Might be the right move when targeting essentially scientific

computing

11André Seznec


On-chip memory/processors/memory bandwidth

The uniprocessor credo was:

“Use the remaining silicon for caches”

New issue: An extra processor or more cache

Extra processing power = increased memory bandwidth demand Increased power consumption, more temperature hot spots

Extra cache = decreased (external) memory demand

12André Seznec


Memory hierarchy organization ?

13André Seznec


Flat: sharing a big L2/L3 cache?

μP $ μP $ μP $ μP $



L3 cache

14André Seznec


Flat: communication issues?through the big cache




L3 cache

15André Seznec


Flat: communication issues?Grid-like ?




L3 cache

16André Seznec


Hierarchical organization ?

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

L3 $

17André Seznec


Hierarchical organization ?

Arbitration at all levels

Coherency at all levels

Interleaving at all levels

Bandwidth dimensioning

18André Seznec


NoC structure

Very dependent of the memory hierarchy organization !!

+ sharing coprocessors/hardware accelerators

+ I/O buses/(processors ?)

+ memory interface

+ network interface

19André Seznec


Example

μP $ μP $

L2 $

μP $ μP $

L2 $

μP $ μP $

L2 $

L3 $

MemoryInt. IO

20André Seznec


Multithreading ?

An extra level thread parallelism !!

Might be an interesting alternative to prefetching on massively parallel applications

21André Seznec


Power and thermal issues

Voltage/frequency scaling to adapt to the workload ?

Adapting the workload to the available power ?

Adapting/dimensioning the architecture to the power budget

Activity migration for managing temperatures ?

22André Seznec


General issues for software/compiler

Parallelism detection and partitioning: find the correct granularity

Memory bandwidth mastering

Non-uniform memory latency

Optimizing sequential code portions

23André Seznec


SSC design specificities

24André Seznec


Basic core granularity

RISC cores

VLIW cores

In-order superscalar cores

25André Seznec


Homogeneous vs. heterogeneous ISAs

Core specialization: RISC + VLIW or DSP slaves ? Master core + a set of special purpose cores ?

26André Seznec


Sharing issue

Simple cores: Lot of duplications and lots of unused resources at any time

Adjacent cores can share: Caches Functional units: FP, mult/div , multimedia, Hardware accelerators

27André Seznec


An example of sharing

μP FP μP

DL1 $

Inst. fetch

IL1 $

μP FP μP

DL1 $

Inst. fetch

IL1 $

Har

dw

are

acce

lera

tor

L2 cache

28André Seznec


Multithreading/prefetching

Multithreading: Is the extra complexity worth for simple cores ?

Prefetching: Is it worth ? Sharing prefetch engines ?

29André Seznec


Vision of a SSC (my own vision )

30André Seznec


SSC: the basic brick

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

31André Seznec


Memory interface

network interface

System interface

L3 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

μP FP μP

D $

I $

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

32André Seznec


FCC design specificities

33André Seznec


Only limited available thread parallelism ?

Focus on uniprocessor architecture: Find the correct tradeoff between complexity and

performance Power and temperature issues

Vector extensions ? Contiguous vectors ( a la SSE) ? Strided vectors in L2 caches ( Tarantula-like)

34André Seznec


Performance enablers

SMT for parallel workloads ?

Helper threads ? Run ahead threads

Speculative multithreading hardware support

35André Seznec


Intermediate design ?

SCCs: Shine on massively parallel applications

Poor/ limited performance on sequential sections

FCCs: Moderate performance on parallel applications

Good performance on sequential sections

36André Seznec


Amdahl’s law

Mix of FCC and SSC

37André Seznec


The basic brick

L2 cache

μP FP μP

D $

I $

μP FP μP

D $

I $

Ultimate Out-of-order Superscalar

38André Seznec


L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L2 $

D $

I $

D $

I $

Ult. O-O-O

L3 cache

Memory interface

network interface

System interface

39André Seznec


Conclusion

The era of uniprocessor has come to the end

No clear trend to continue

Might be time for more architecture diversity

Documents

A few issues on the design of future multicores André Seznec IRISA/INRIA