When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency

When Less Is MOre (LIMO): Controlled

Parallelism forImproved Efficiency

Gaurav Chadha, Scott Mahlke, Satish Narayanasamy

University of Michigan

Motivation• Hardware trends

o CMPs are ubiquitous. o More and more cores in a system

• Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3.• Server: Tilera

• Multi-threaded applications are pervasive.• But, do we always want to maximize the number of threads?NO

Run fewer threads: DVFS• Most multi-threaded applications stop scaling beyond a

certain number of cores.• It becomes counter-productive to run more threads.• Maximum power budget is fixed for a system.• Fewer cores can “borrow” power from disabled cores.

o Intel Turbo Boost

cores

freq

uenc

y

Frequency increases in steps of 133 MHz

Scalability: Problems• Too many threads

o Increased contention for shared resources.o Increased synchronization costs.

• Too few threadso Underutilization of resources.

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup

blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264

#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Scalability: Less threads are better

• 4 threads best for streamcluster

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Scalability: Less threads are as good

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• Ferret, facesim, x264, dedup show poor scalability

Scalability: Opportunities

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• Run fewer threadso Disable some cores and increase frequency of the active ones.

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

Run fewer threads: DVFS

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

1 2 4 8 16 320

5

10

15

20

25

30

with DVFS

streamclus...

#threads

spee

dup

over

1 t

hrea

d

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1


1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores


Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores


1 2 4 8 16 320

5

10

15

20

25

30 with DVFS

spee

dup

over

1 t

hrea

d


1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores


Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1

1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores


0

5

10

15

20

25

30 with DVFS

spee

dup

over

1 t

hrea

d


1 2 4 8 16 320

5

10

15

20

25

30

Parsec: speedup


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

• DVFS makes the case for fewer threads more compelling.• With fewer threads

o increase frequencyo reduce contention.

Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1


Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1

1 2 4 8 16 320

5

10

15

20

25

30


#threads

spee

dup

over

1 t

hrea

d

#threads = #cores

1 2 4 8 16 320

5

10

15

20

25

30

with DVFS


#threads

spee

dup

over

1 t

hrea

d

5 out of 11 benchmarks

Who can decide the best number of threads?

DVFS in current systems

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled

1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Stal

led

Turbo Boost increases frequency

Programmer decides how many threads to

run (e.g. 32 threads on 32 cores)

thre

ads

Inputs change

System resources change

Different hardware configurations

Program characteristics change

Our system

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Frequency is increased

Detection logic pro-actively

disables more threads

Disa

ble

d

Turbo Boost

thre

ads

Less Is MOre (LIMO)• Less Is MOre for efficiency• Observation:

o Most programs do not scale after a certain limito DVFS can help provide better performance

• A runtime systemo Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

LIMO

OutlineRoadblocks to scalability

LIMO

Methodology

Results

Conclusion

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Roadblocks: Shared Cache

Roadblocks: Shared Cache

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170

1

2

3

4

5

6

7

8

9

#threads

spee

dup

over

1 th

read

Working set fits in

shared cache

Best performance

Working set does not fit in shared cache Working set too large

• Abstract representation of most multi-threaded programs• The peak performance point shifts depending on working set size

and shared cache size

Roadblocks: Program Resources

Roadblocks

Physical shared

resources

Shared cache

Program level shared

resources

Synchronization stalls (locks)

1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

1

2

3

4

5

6

#threads

Spee

dup

over

1 th

read

Roadblocks: Program Resources

Incr

ease

d pa

ralle

lism

giv

es m

ore

perfo

rman

ce

Increased synchronization costs hurt performance

Best performance

LIMO

Execution progress

10 threads stalled

12 threads stalled

16 threads stalled/disabled

1.1 GHz 1.1 GHz 1.4 GHz

Stal

led

S

Frequency is increased

Pro-actively disables more

threads

Disa

ble

d

thre

ads

8 threads disabled

After 100 million instructions,

working set size estimate calculated

Disa

bled

Working set of 10 threads fits in cache - 6

threads disabled

D

Pro-actively disables more threads

1.8 GHz

• 20 threads at 1.1 GHz: 20 * 1.1 = 22

• 16 threads at 1.4 GHz: 16 * 1.4 = 22.4

• 10 threads at 1.4 GHz: 10 * 1.4 = 14

• 8 threads at 1.8 GHz: 8 * 1.8 = 14.4

Methodology: Configuration• Modified timing simulator FeS2 which uses Simics.• Hardware configuration:

Cores 32, out-of-orderCaches InclusiveCoherence protocol MOESI directoryTopology MeshOff-Chip memory bandwidth 5 GbpsL1 data cache PrivateL2 cache SharedMain memory latency 156 cyclesL1 hit latency 3 cyclesL2 hit latency 11 cyclesRouter + network link latency 5 cycles

Cores Frequency (GHz)

4 2.2688 1.816 1.42932 1.134

Methodology: Simulation• 9 evenly spaced checkpoints

• Timing simulations starting from these checkpoints

• 80 million useful instructions simulated/checkpointo Statistics cleared after the first 20 milliono Useful instructions: committed in user mode, excluding spin loops.

• Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.

1 2 3 4 5 6 7 8 902468

101214

8t 16t 32t

Execution interval

inst

ructi

ons p

er n

sExample perf. breakdown

Ferret

Example perf. breakdown

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

s

1 2 3 4 5 6 7 8 90

102030405060708090

8t 32t LIMO

Execution interval

% s

ynch

roni

zatio

n st

alls

Ferret

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

L2 Lo

ad M

isse

s

1 2 3 4 5 6 7 8 902468

101214

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n


Ferret

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n

s

1 2 3 4 5 6 7 8 90

50000100000150000200000250000300000350000400000

8t 32t LIMO

Execution interval

L2 lo

ad m

isse


Ferret

1 2 3 4 5 6 7 8 902468

101214 numProcs

Execution interval

core

s ac

tive

1 2 3 4 5 6 7 8 902468

10121416

8t 32t LIMO

Execution interval

inst

ructi

ons p

er n


Ferret

% Performance Improvement

blacksch

olesded

up

facesim

swaptions

vips

fluidanimate

httpdsphinx

ferret

strea

mcluste

rmean

0.6

0.8

1

1.2

1.4

1.6

1.8

2

2.2

TB_DVFS LIMO

spee

dup

over

32t

Good scalability

Reduced synchronization

stalls

Reduced thrashing in

shared cache

33

Conclusion• Scalability is difficult to achieve and predict.• Determining best number of threads is hard.

o Contention in shared hardware resourceso Contention in program level shared objects

• LIMO frees the programmer from this burden.o Monitors shared resource contention (shared cache,

shared program variables)o Pro-actively disables threadso Employs DVFS

• 14% average improvement in performance over all threads.

Thank you!

Documents

When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency