33
When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency Gaurav Chadha, Scott Mahlke, Satish Narayanasamy University of Michigan

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

  • Upload
    diella

  • View
    51

  • Download
    0

Embed Size (px)

DESCRIPTION

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency. Gaurav Chadha , Scott Mahlke , Satish Narayanasamy University of Michigan. Motivation. Hardware trends CMPs are ubiquitous. More and more cores in a system - PowerPoint PPT Presentation

Citation preview

Page 1: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled

Parallelism forImproved Efficiency

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy

University of Michigan

Page 2: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Motivation• Hardware trends

o CMPs are ubiquitous. o More and more cores in a system

• Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3.• Server: Tilera

• Multi-threaded applications are pervasive.• But, do we always want to maximize the number of threads?NO

Page 3: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Run fewer threads: DVFS• Most multi-threaded applications stop scaling beyond a

certain number of cores.• It becomes counter-productive to run more threads.• Maximum power budget is fixed for a system.• Fewer cores can “borrow” power from disabled cores.

o Intel Turbo Boost

cores

freq

uenc

y

Frequency increases in steps of 133 MHz

Page 4: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Scalability: Problems• Too many threads

o Increased contention for shared resources.o Increased synchronization costs.

• Too few threadso Underutilization of resources.

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Page 5: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Scalability: Less threads are better

• 4 threads best for streamcluster

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Page 6: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Scalability: Less threads are as good

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• Ferret, facesim, x264, dedup show poor scalability

Page 7: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Scalability: Opportunities

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• Run fewer threadso Disable some cores and increase frequency of the active ones.

Page 8: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Page 9: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

1 2 4 8 16 320

5

10

15

20

25

30

with DVFS

streamclus...

#threads

spee

dup

over

1 t

hrea

d

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

Run fewer threads: DVFS

Page 10: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Page 11: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

1 2 4 8 16 320

5

10

15

20

25

30 with DVFS

spee

dup

over

1 t

hrea

d

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

Page 12: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Page 13: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

0

5

10

15

20

25

30 with DVFS

spee

dup

over

1 t

hrea

d

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

Page 14: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• DVFS makes the case for fewer threads more compelling.• With fewer threads

o increase frequencyo reduce contention.

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

Run fewer threads: DVFS

Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

1 2 4 8 16 320

5

10

15

20

25

30

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

1 2 4 8 16 320

5

10

15

20

25

30

with DVFS

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

5 out of 11 benchmarks

Who can decide the best number of threads?

Page 15: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

DVFS in current systems

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled

1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Stal

led

Turbo Boost increases frequency

Programmer decides how many threads to

run (e.g. 32 threads on 32 cores)

thre

ads

Inputs change

System resources change

Different hardware configurations

Program characteristics change

Page 16: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Our system

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Frequency is increased

Detection logic pro-actively

disables more threads

Disa

ble

d

Turbo Boost

thre

ads

Page 17: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Less Is MOre (LIMO)• Less Is MOre for efficiency• Observation:

o Most programs do not scale after a certain limito DVFS can help provide better performance

• A runtime systemo Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

LIMO

Page 18: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

OutlineRoadblocks to scalability

LIMO

Methodology

Results

Conclusion

Page 19: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Roadblocks: Shared Cache

Page 20: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Roadblocks: Shared Cache

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170

1

2

3

4

5

6

7

8

9

#threads

spee

dup

over

1 th

read

Working set fits in

shared cache

Best performance

Working set does not fit in shared cache Working set too large

• Abstract representation of most multi-threaded programs• The peak performance point shifts depending on working set size

and shared cache size

Page 21: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Roadblocks: Program Resources

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Synchronization stalls (locks)

Page 22: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

1

2

3

4

5

6

#threads

Spee

dup

over

1 th

read

Roadblocks: Program Resources

Incr

ease

d pa

ralle

lism

giv

es m

ore

perfo

rman

ce

Increased synchronization costs hurt performance

Best performance

Page 23: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

LIMO

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Frequency is increased

Pro-actively disables more

threads

Disa

ble

d

thre

ads

8 threads disabled

After 100 million instructions,

working set size estimate calculated

Disa

bled

Working set of 10 threads fits in cache - 6

threads disabled

D

Pro-actively disables more threads

1.8 GHz

• 20 threads at 1.1 GHz: 20 * 1.1 = 22

• 16 threads at 1.4 GHz: 16 * 1.4 = 22.4

• 10 threads at 1.4 GHz: 10 * 1.4 = 14

• 8 threads at 1.8 GHz: 8 * 1.8 = 14.4

Page 24: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Methodology: Configuration• Modified timing simulator FeS2 which uses Simics.• Hardware configuration:

Cores 32, out-of-orderCaches InclusiveCoherence protocol MOESI directoryTopology MeshOff-Chip memory bandwidth 5 GbpsL1 data cache PrivateL2 cache SharedMain memory latency 156 cyclesL1 hit latency 3 cyclesL2 hit latency 11 cyclesRouter + network link latency 5 cycles

Cores Frequency (GHz)

4 2.2688 1.816 1.42932 1.134

Page 25: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Methodology: Simulation• 9 evenly spaced checkpoints

• Timing simulations starting from these checkpoints

• 80 million useful instructions simulated/checkpointo Statistics cleared after the first 20 milliono Useful instructions: committed in user mode, excluding spin loops.

• Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.

Page 26: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 3 4 5 6 7 8 902468

101214

8t 16t 32t

Execution interval

inst

ructi

ons p

er n

sExample perf. breakdown

Ferret

Page 27: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Example perf. breakdown

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

s

1 2 3 4 5 6 7 8 90

102030405060708090

8t 32t LIMO

Execution interval

% s

ynch

roni

zatio

n st

alls

Ferret

Page 28: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

L2 Lo

ad M

isse

s

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

sExample perf. breakdown

Ferret

Page 29: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

s

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

L2 lo

ad m

isse

sExample perf. breakdown

Ferret

Page 30: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

1 2 3 4 5 6 7 8 902468

101214 numProcs

Execution interval

core

s ac

tive

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

sExample perf. breakdown

Ferret

Page 31: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

% Performance Improvement

blacksch

olesded

up

facesim

swaptions

vips

fluidanimate

httpdsphinx

ferret

strea

mcluste

rmean

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

TB_DVFS LIMO

spee

dup

over

32t

Good scalability

Reduced synchronization

stalls

Reduced thrashing in

shared cache

Page 32: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

33

Conclusion• Scalability is difficult to achieve and predict.• Determining best number of threads is hard.

o Contention in shared hardware resourceso Contention in program level shared objects

• LIMO frees the programmer from this burden.o Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

• 14% average improvement in performance over all threads.

Page 33: When Less Is  MOre  (LIMO): Controlled Parallelism for Improved Efficiency

Thank you!