37
1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

Embed Size (px)

Citation preview

Page 1: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

1

Thread level parallelism: It’s time now !

André Seznec

IRISA/INRIA

CAPS team

Page 2: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

2

Disclaimer

No nice mathematical formula

No theorem

BTW : I got the aggregation of maths, but it was long time ago

Page 3: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

3

Focus of high performance computer architecture

Up to 1980 Mainframes

Up to 1990 Supercomputers

Till now: General purpose microprocessors

Coming: Mobile computing, embedded computing

Page 4: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

4

Uniprocessor architecture has driven performance progress so far

The famous “Moore’s law”

Page 5: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

5

“Moore’s Law”: predict exponential growth

Every 18 months:

The number of transistors on chip doubles The performance of the processor doubles The size of the main memory doubles

30 years IRISA = 1,000,000 x

Page 6: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

6

Moore’s Law: transistors on a processor chip

1

100

10000

1000000

100000000

10000000000

1970 1980 1990 2000 2010

year

tra

ns

isto

rs

400480486

HP8500

Montecito IRISA80 persons

IRISA550 persons

IRISA200 persons

IRISA350 persons

Page 7: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

7Moore’s Law: bits per DRAM chip

110

1001000

10000100000

100000010000000

1000000001000000000

1970 1975 1980 1985 1990 1995 2000 2005 2010

year

1K

64K

1M

1Gbit

64M

M. Métivier

G. Ruget L. Kott

J.P. Verjus

J.P. Banatre

C. Labit

and the IRISA directors

Page 8: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

8

Moore’s Law: Performance

1970 1980 1990 2000 2010

year

MIP

S

0.06

16

20,000

3,000

J. Lenfant

A. Seznec

P. Michaud

F. Bodin

I. Puaut

and the CAPS group

Page 9: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

9

And parallel machines, so far ..

Parallel machines have been built from every processor generation: Hardware coherent shared memory processors:

• Dual processors board• Up to 8-processor server• Large (very expensive) hardware coherent NUMA

architectures

Distributed memory systems:• Intel iPSC, Paragon• clusters, clusters of clusters ..

Page 10: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

10

Parallel machines have not been mainstream so far

But it might change

But it will change

Page 11: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

11

What has prevented parallel machines to prevail ?

Economic issue: Hardware cost grew superlinearly with the number of processors Performance:

• Never been able to use the last generation micropocessor: Scalability issue:

• Bus snooping does not scale well above 4-8 processors

Parallel applications are missing: Writing parallel applications requires thinking “parallel” Automatic parallelization works on small segments

Page 12: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

12

What has prevented parallel machines to prevail ? (2)

We ( the computer architects) were also guilty: We just found how to use these transistors in a

uniprocessor

IC technology brought the transistors and the frequency

We brought the performance : Compiler guys also helped a little bit

Page 13: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

13

Up to now, what was microarchitecture about ?

Memory access time is 100 ns Program semantic is sequential Instruction life (fetch, decode,..,execute, ..,memory access,..) is

10-20 ns

How can we use the transistors to achieve the highest performance as possible? So far, up to 4 instructions every 0.3 ns

Page 14: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

14

The architect tool box for uniprocessor performance

Pipelining

Instruction Level Parallelism

Speculative execution

Memory hierarchy

Page 15: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

15

Pipelining

Just slice the instruction life in equal stages and launch concurrent execution:

IF … DC … EX M … CTI0I1

I2

IF … DC … EX M … CTIF … DC … EX M … CT

IF … DC … EX M … CT

time

Page 16: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

16

+ Instruction Level Parallelism

IF … DC … EX M … CTIF … DC … EX M … CT

IF … DC … EX M … CTIF … DC … EX M … CT

IF … DC … EX M … CTIF … DC … EX M … CT

IF … DC … EX M … CTIF … DC … EX M … CT

Page 17: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

17

+ out-of-order execution

To optimize resource usage: Executes as soon as operands are valid

IF DC EX M CTwait waitIF DC EX M

IF DC EX M CTIF DC EX M CT

IF DC EX M CTIF DC EX M CT

CT

Page 18: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

18

+ speculative execution

10-15 % branches: On Pentium 4: direction and target known at cycle 31 !!

Predict and execute speculatively: Validate at execution time State-of-the-art predictors:

• ≈2 mispredictions per 1000 instructions

Also predict: Memory (in)dependency (limited) data value

Page 19: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

19

+ memory hierarchy

Processor

Main memory

I $ D$

300 cycles ≈ 1000 instL2 cache: 10 cycles

Small1-2 cycles

Page 20: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

20

+ prefetch

On a miss on the L2 cache: stops for 300 cycles ! ?!

Try to predict which memory block will be missing in the near future and bring it in the L2 cache

Page 21: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

21

Can we continue to just throw transistors in uniprocessors ?

•Increasing inst. per cycles?•Larger caches ?•New prefetch mechanisms ?

Page 22: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

22

One billion transistors now !!The uniprocessor road seems over 16-32 way uniprocessor seems out of reach:

just not enough ILP quadratic complexity on a few key components:

register file, bypass, issue logic,..

to avoid temperature hot spots: • very long intra-CPU communications would be

needed 5-7 years to design a 4-way superscalar core:

• How long to design a 16-way ?

Page 23: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

23One billion transistors:

Process/Thread level parallelism..it’s time now !

Simultaneous multithreading:TLP on a uniprocessor !

Chip multiprocessor

Combine both !!

Page 24: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

24

Simultaneous Multithreading (SMT): parallel processing on a uniprocessor

functional units are underused on superscalar processors

SMT: Sharing the functional units on a superscalar

processor between several processes Advantages:

Single process can use all the resources dynamic sharing of all structures on

parallel/multiprocess workloads

Page 25: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

25

Superscalar SMT

Tim

e

Unused

Thread 1Thread 2Thread 3

Thread 4

Page 26: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

26

The Chip Multiprocessor

Put a shared memory multiprocessor on a single die: Duplicate the processor, its L1 cache, may

be L2, Keep the caches coherentShare the last level of the memory

hierarchy (may be)Share the external interface (to memory

and system)

Page 27: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

27

General purpose Chip MultiProcessor (CMP):why it did not (really) appear before 2003

Till 2003 better (economic) usage for transistors: Single process performance is the most important More complex superscalar implementation More cache space

Diminishing return !!

Page 28: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

28

General Purpose CMP: why it should not still appear as mainstream

No further (significant) benefit in complexifying single processors: Logically we shoud use smaller and cheaper chips or integrate the more functionalities on the same chip:

• E.g. the graphic pipeline

Poor catalog of parallel applications: Single processor is still mainstream Parallel programming is the privilege (knowledge) of a few

Page 29: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

29

General Purpose CMP: why they appear as mainstream now !

The economic factor:-The consumer user pays 1000-2000 euros for a PC-The professional user pays 2000-3000 euros for a PC

A constant:The processor represents 15-30 % of the PC price

Intel and AMD will not cut their share

Page 30: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

30General Purpose Multicore SMT in 2005:

an industry reality Intel and IBM

PCs : Dual-core 2-thread SMT Pentium 4 Dual-core Amd64

Servers: Itanium Montecito: dual-core 2-thread SMT IBM Power 5: dual-core 2-thread SMT Sun Niagara: 8-core 4-thread

Page 31: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

31

a 4-core 4-thread SMT: think about tuning for performance !

P0

4 threads

P1 P2 P3

Shared L3 Cache

HUGE MAIN MEMORY

Page 32: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

32

a 4-core 4-thread SMT: think about tuning for performance ! (2)

Write the application with more than 16 threads

Take in account first level memory hierarchy sharing !

Take in account 2nd level memory hierarchy sharing !

Optimize for instruction level parallelism

Competing for memory bandwidth !!

Page 33: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

33

Hardware TLP is there !!

But where are the threads/processes ?

A unique opportunity for the software industry: hardware parallelism comes for free

Page 34: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

34

Waiting for the threads (1)

Generate threads to increase performance of single threads: Speculative threads:

• Predict threads at medium granularity– Either software or hardware

Helper threads:• Run ahead a speculative skeleton of the application to:

– Avoid branch mispredictions– Prefetch data

Page 35: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

35

Waiting for the threads (2)

Hardware transcient faults are becoming a concern: Runs twice the same thread on two cores and

check integrity

Security: Array bound checking is nearly for free on a out-

of-order core Encryption/decryption

Page 36: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

36

Waiting for the threads (3)

Hardware clock frequency is limited by: Power budget: every core running Temperature hot-spots

On single thread workload: Increase clock frequency and migrate the process

Page 37: 1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team

37

Conclusion

Hardware TLP is becoming mainstream for general-purpose computing

Moderate degrees of hardware TLPs will be available for mid-term

That is the first real opportunity for the whole software industry to go parallel !

But it might demand a new generation of application developpers !!