54
1 Faster unicores are still needed André Seznec INRIA/IRISA

Faster unicores are still needed

  • Upload
    rianne

  • View
    28

  • Download
    0

Embed Size (px)

DESCRIPTION

Faster unicores are still needed. André Seznec INRIA/IRISA. DAL: Defying Amdahl ’ s Law. ERC advanced grant to A. Seznec (2011-2016) DAL objective: « Given that Amdahl ’ s Law is Forever  propose (impact) the microarchitecture of the 2020 General Purpose manycore ». - PowerPoint PPT Presentation

Citation preview

Page 1: Faster unicores  are still needed

1

Faster unicores are still needed

André Seznec

INRIA/IRISA

Page 2: Faster unicores  are still needed

2

DAL: Defying Amdahl’s Law

• ERC advanced grant to A. Seznec (2011-2016)

DAL objective:

« Given that Amdahl’s Law is Forever propose (impact) the microarchitecture of the 2020

General Purpose manycore »

Page 3: Faster unicores  are still needed

3

Multicores are everywhere

• Multicores in servers, desktop, laptops 2-4-8-12 O-O-O cores

• Multicores in smart phones, tablets 2-4-(not that simple) cores

• Manycores for niche markets 48-80-100 simple cores

Tilera, Intel Phi

Page 4: Faster unicores  are still needed

4Multicore/multithread for everyone

• End-user : improved usage comfort Can surf on the web and hear MP3

• Parallel performance for the masses? Very few (scalable) mainstream // apps

Graphics Niche market segments

Page 5: Faster unicores  are still needed

5

No parallel software bonanza

in the near future

• Inheritage of sequential legacy codes

• Parallelism is not cost-effective for most apps

• Sequential programming will remain dominant

Page 6: Faster unicores  are still needed

6 Inheritage of sequential legacy codes

• Software is more resilient than hardware Apps are surviving/evolving for years, often

decades Very few parallel apps now

• Unlikely redevelopment of parallel apps from scratch

• Computing intensive sections will be parallelized But significant code sections will remain sequential

Page 7: Faster unicores  are still needed

7

Parallelism is not cost-effective

for most apps

• Why parallelism ? Only for performance

• But costly: Difficult, man-time consuming, error prone Poorly portable: functionality and

performance

Page 8: Faster unicores  are still needed

8

Sequential programming will

remain dominant

Just easier The « Joe » programmer Portability, maintenance, debug

+ compiler to parallelize + parallel libraries + software components (developped by

experts)

Page 9: Faster unicores  are still needed

9

Looking backwards

Page 10: Faster unicores  are still needed

102002: The End of the Uniprocessor Road

• Power and temperature walls: Stopped the frequency increase

• 2x transistors: 5 %? 10 % ? perf. (if any)

economical logic : buy smaller chips !

IC industry needs to sell new (expensive) chips:

Marketing:

« You need hyperthreading, 2, 4, 8 cores »

Page 11: Faster unicores  are still needed

11

Marketing multicores to the masses2002- ..

GREAT !!

SMT Dual-core

SMT

Quad-core

SMT

Page 12: Faster unicores  are still needed

12

And now ?

The end user is not such a fool ..

Page 13: Faster unicores  are still needed

13

Following the trend: 2020

• Silicon area, power envelope ≈ 100 Nehalem class cores

or

≈ 1,000 simple cores

(VLIW, in-order superscalar)

Page 14: Faster unicores  are still needed

14

Amdahl’s Law“Cannot run faster than sequential part”

seq. parallel

Page 15: Faster unicores  are still needed

15OK, parallel applications do not scale

• Our recent study on parallel application scaling:

• In general: bp> -1 : sublinear scaling

• Sometimes: bs > 0 : sequential part increases

Execution time Input set Processor number

Page 16: Faster unicores  are still needed

16But let us use a naive (overoptimistic) model

• A parallel application:

Parallel section: can use 1000 processors

Sequential section: run on a single processor

SEQ: constant fraction of sequential code

linear speed-up

Page 17: Faster unicores  are still needed

17Complex cores against simple cores

• CC: 100 complex vs SC :1000 simple cores

with complex 2X faster than simple

if SEQ > 0.8 % then CC > SC

Page 18: Faster unicores  are still needed

18

And hybrid SC + CC ?

CC_SC: 50 complex 500 simple

if SEQ> 0.2% then CC_SC > SC

Page 19: Faster unicores  are still needed

19

And if ..

• Use a huge amount of resource for a single core:

10X the area of the complex core

10X the power of the complex core

Use all the uniprocessor techniques Very wide issue (8 – 16 ?), Ultimate frequency ( « heat

and run »), Helper threads, Value prediction

Invent new techniquesUltra Complex cores

Page 20: Faster unicores  are still needed

20

DAL architecture proposition

• Heterogeneous architecture: A few ultra complex cores

to enable performance on sequential codes and/or critical sections

A « sea » of simple cores for parallel sections

Page 21: Faster unicores  are still needed

21

For the naive model

« DAL » : UC_SC

5 ultra complex cores + 500 simple cores

• If SEQ > 0.13 % then « DAL » > SC

• « DAL » always better than UC, CC, CC_SC

Page 22: Faster unicores  are still needed

22Need for research on faster unicores

• Silicon area is 2nd order issue can use the area of 10 complex cores

• Power/energy is 2nd order issue

can use the power of 10 complex cores

Page 23: Faster unicores  are still needed

23

On going work:Revisiting Value Prediction

with Arthur Pérais

Page 24: Faster unicores  are still needed

24

Value prediction ?Lipasti et al, Gabbay and Mendelson 1996

Basic idea: Eliminate (some) true data dependencies through

predicting instruction results

I0 I1 I3 +2

+3 +1I4 I5

+3

Page 25: Faster unicores  are still needed

25Value Prediction:

• Large body of research 96-02

• Quite efficient: Surprisingly high number of predictable

instructions

• Not implemented so far: High cost : is it still relevant now ? High penalty on misp.: don’t lose all the

benefit

Page 26: Faster unicores  are still needed

26

Last Value Predictor

• Just predict the last produced value

Set Associative Table Use confidence counters

Analogy with PC-based branch prediction

Page 27: Faster unicores  are still needed

27

Stride value predictor

• Add last value + (last difference)

PC +

Analogy with stride prefetcher, but also with loop predictor

Page 28: Faster unicores  are still needed

28

Finite Context Method predictors

Use history of the last values by the instruction

PC

Analogy with local history branch predictor

Page 29: Faster unicores  are still needed

29

And global value history

• Just no sense ! Need the history of the last instructions

Too late !!

• But global branch history !?! ITTAGE is the state-of-the-art indirect

branch predictor !! And it predicts values !

branch

Page 30: Faster unicores  are still needed

30

ITTAGE

pc h[0:L1]

=? =? =?

prediction

pc pc h[0:L2] pc h[0:L3]

3232 1 32 1 32 1

32

32Tagless base Predictor

VTAGE

Longest matching component provides the prediction

Page 31: Faster unicores  are still needed

31

The repair issue on misprediction

I0 I1 I3 I4 I5

misprediction

Page 32: Faster unicores  are still needed

32

Pipeline squash

• Acts as on exception, branch misprediction

• Very high penalty

I0 I1 I3 I4 I5

Page 33: Faster unicores  are still needed

33

Selective replay

• Cancel all dependent instructions, but save the others

• Very complex to implement: Unlimited dependence chains

I0 I1 I3 I4 I5

Page 34: Faster unicores  are still needed

34

Critical path

• Predicted value needed late in the pipeline: Disptach time is sufficient

• Except that:

Page 35: Faster unicores  are still needed

35

A FCM implementation issue

PC

Spe

cula

tive

Win

dow

Must take the last local values

Might be a critical path

Page 36: Faster unicores  are still needed

36Critical path on the stride value predictor

PC +

Spe

cula

tive

Win

dow

Stride AND spec. last valuemust be high confidence

Can be reused on the next cycle

Page 37: Faster unicores  are still needed

37

Experiments

• 8-way superscalar, deep pipeline

• Use prediction only on high confidence 3-bit counters + saturated + reset

Page 38: Faster unicores  are still needed

38

Squashing

Page 39: Faster unicores  are still needed

39

Selective replay

Page 40: Faster unicores  are still needed

40High confidence through probabilistic counters

• Need for very high confidence: 95 % accuracy unsufficient >> 99 % needed

TRADING ACCURACY AGAINST COVERAGE

• Saturation with only very low probability 1/32, 1/256

Page 41: Faster unicores  are still needed

41

Squashing

Page 42: Faster unicores  are still needed

42

And hybrids

Page 43: Faster unicores  are still needed

43

Current status

• All value predictors amenable to very high confidence No complex selective repair needed

• No need for local value prediction No complex critical path in the local

value predictor

Page 44: Faster unicores  are still needed

44

On going work:Selective Prediction of Predicated

Instructions

with Nathanael Prémillieu

Page 45: Faster unicores  are still needed

45Who cares about predicated instructions ?

• CMOV in all ISA

• ARM, Itanium : All instructions are predicated

out-of-order execution: just a nightmare

Page 46: Faster unicores  are still needed

46

Mapping Table

I1: R1 R2, R3 (p)

I2: R4 R1, R2

Before renaming:

After renaming:

I1: P1 P15, P22 (p)

I2: P13  ???, P15

The multiple definition problem

Page 47: Faster unicores  are still needed

47

After renaming:

I1a: P1 P15, P22

I1b: P27 (p) ? P1, P11

I2: P13  P27, P15

Expansion/Serialization

• Create an extra instruction

• Force I1bI2 dependency

Page 48: Faster unicores  are still needed

48

Aggressive serialization

I1: P18 (p) ? (op P15, P22) : P23

I2: P13  P18, P15

• No expansion, but an extra operand on I1: • complexity on register file, issue logic, bypass network

• Force I1I2 dependency

Page 49: Faster unicores  are still needed

49

Predicting the predicates

• branch history or branch+predicate history to predict the predicates

Eliminate multiple definitions

Predicate mispredictions become branch mispredictions

Page 50: Faster unicores  are still needed

50

Not that convincing !

Page 51: Faster unicores  are still needed

51

• Filter the predicate prediction

• Replay at rename time the mispredicted predicates

Page 52: Faster unicores  are still needed

52

Page 53: Faster unicores  are still needed

53

• Predicate prediction + filtering allows:

Better performance

Without aggressive out-of-order implementation

• Current compilers « shy » on predication usage

might be worth to reconsider

Page 54: Faster unicores  are still needed

54

Conclusion

Faster cores are needed:

Amdahl’s law,

Uniprocessor workload

Silicon, power, etc are available:

Just grab the resource from the rest of the system

Do research as if (area, power) was not a constraint:

Then, take into account the constraints

(or somebody else will manage to do it)