Upload
diella
View
51
Download
0
Tags:
Embed Size (px)
DESCRIPTION
When Less Is MOre (LIMO): Controlled Parallelism for Improved Efficiency. Gaurav Chadha , Scott Mahlke , Satish Narayanasamy University of Michigan. Motivation. Hardware trends CMPs are ubiquitous. More and more cores in a system - PowerPoint PPT Presentation
Citation preview
When Less Is MOre (LIMO): Controlled
Parallelism forImproved Efficiency
Gaurav Chadha, Scott Mahlke, Satish Narayanasamy
University of Michigan
Motivation• Hardware trends
o CMPs are ubiquitous. o More and more cores in a system
• Mobile: Qualcomm Snapdragon, Samsung Exynos, NVIDIA Tegra 3.• Server: Tilera
• Multi-threaded applications are pervasive.• But, do we always want to maximize the number of threads?NO
Run fewer threads: DVFS• Most multi-threaded applications stop scaling beyond a
certain number of cores.• It becomes counter-productive to run more threads.• Maximum power budget is fixed for a system.• Fewer cores can “borrow” power from disabled cores.
o Intel Turbo Boost
cores
freq
uenc
y
Frequency increases in steps of 133 MHz
Scalability: Problems• Too many threads
o Increased contention for shared resources.o Increased synchronization costs.
• Too few threadso Underutilization of resources.
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Scalability: Less threads are better
• 4 threads best for streamcluster
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Scalability: Less threads are as good
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
• Ferret, facesim, x264, dedup show poor scalability
Scalability: Opportunities
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
• Run fewer threadso Disable some cores and increase frequency of the active ones.
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Run fewer threads: DVFS
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
1 2 4 8 16 320
5
10
15
20
25
30
with DVFS
streamclus...
#threads
spee
dup
over
1 t
hrea
d
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
Run fewer threads: DVFS
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Run fewer threads: DVFS
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Run fewer threads: DVFS
1 2 4 8 16 320
5
10
15
20
25
30 with DVFS
spee
dup
over
1 t
hrea
d
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Run fewer threads: DVFS
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
Run fewer threads: DVFS
0
5
10
15
20
25
30 with DVFS
spee
dup
over
1 t
hrea
d
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1 Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
1 2 4 8 16 320
5
10
15
20
25
30
Parsec: speedup
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
• DVFS makes the case for fewer threads more compelling.• With fewer threads
o increase frequencyo reduce contention.
Frequency (GHz): 1.1 1.1 1.1 1.1 1.1 1.1
Run fewer threads: DVFS
Frequency (GHz): 3.6 2.8 2.2 1.8 1.4 1.1
1 2 4 8 16 320
5
10
15
20
25
30
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
#threads = #cores
1 2 4 8 16 320
5
10
15
20
25
30
with DVFS
blackscholesbodytrackcannealdedupfacesimferretfluidanimatestreamclusterswaptionsvipsx264
#threads
spee
dup
over
1 t
hrea
d
5 out of 11 benchmarks
Who can decide the best number of threads?
DVFS in current systems
Execution progress
10 threads stalled
12 threads stalled
16 threads stalled
1.1 GHz 1.1 GHz 1.1 GHz 1.4 GHz
Stal
led
S
Stal
led
Turbo Boost increases frequency
Programmer decides how many threads to
run (e.g. 32 threads on 32 cores)
thre
ads
Inputs change
System resources change
Different hardware configurations
Program characteristics change
Our system
Execution progress
10 threads stalled
12 threads stalled
16 threads stalled/disabled
1.1 GHz 1.1 GHz 1.4 GHz
Stal
led
S
Frequency is increased
Detection logic pro-actively
disables more threads
Disa
ble
d
Turbo Boost
thre
ads
Less Is MOre (LIMO)• Less Is MOre for efficiency• Observation:
o Most programs do not scale after a certain limito DVFS can help provide better performance
• A runtime systemo Monitors shared resource contention (shared cache,
shared program variables)o Pro-actively disables threadso Employs DVFS
LIMO
OutlineRoadblocks to scalability
LIMO
Methodology
Results
Conclusion
Roadblocks
Physical shared
resources
Shared cache
Program level shared
resources
Roadblocks: Shared Cache
Roadblocks: Shared Cache
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 170
1
2
3
4
5
6
7
8
9
#threads
spee
dup
over
1 th
read
Working set fits in
shared cache
Best performance
Working set does not fit in shared cache Working set too large
• Abstract representation of most multi-threaded programs• The peak performance point shifts depending on working set size
and shared cache size
Roadblocks: Program Resources
Roadblocks
Physical shared
resources
Shared cache
Program level shared
resources
Synchronization stalls (locks)
1 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320
1
2
3
4
5
6
#threads
Spee
dup
over
1 th
read
Roadblocks: Program Resources
Incr
ease
d pa
ralle
lism
giv
es m
ore
perfo
rman
ce
Increased synchronization costs hurt performance
Best performance
LIMO
Execution progress
10 threads stalled
12 threads stalled
16 threads stalled/disabled
1.1 GHz 1.1 GHz 1.4 GHz
Stal
led
S
Frequency is increased
Pro-actively disables more
threads
Disa
ble
d
thre
ads
8 threads disabled
After 100 million instructions,
working set size estimate calculated
Disa
bled
Working set of 10 threads fits in cache - 6
threads disabled
D
Pro-actively disables more threads
1.8 GHz
• 20 threads at 1.1 GHz: 20 * 1.1 = 22
• 16 threads at 1.4 GHz: 16 * 1.4 = 22.4
• 10 threads at 1.4 GHz: 10 * 1.4 = 14
• 8 threads at 1.8 GHz: 8 * 1.8 = 14.4
Methodology: Configuration• Modified timing simulator FeS2 which uses Simics.• Hardware configuration:
Cores 32, out-of-orderCaches InclusiveCoherence protocol MOESI directoryTopology MeshOff-Chip memory bandwidth 5 GbpsL1 data cache PrivateL2 cache SharedMain memory latency 156 cyclesL1 hit latency 3 cyclesL2 hit latency 11 cyclesRouter + network link latency 5 cycles
Cores Frequency (GHz)
4 2.2688 1.816 1.42932 1.134
Methodology: Simulation• 9 evenly spaced checkpoints
• Timing simulations starting from these checkpoints
• 80 million useful instructions simulated/checkpointo Statistics cleared after the first 20 milliono Useful instructions: committed in user mode, excluding spin loops.
• Benchmarks from the PARSEC benchmark suite, Apache web server (httpd), speech recognition benchmark (sphinx) from ALP.
1 2 3 4 5 6 7 8 902468
101214
8t 16t 32t
Execution interval
inst
ructi
ons p
er n
sExample perf. breakdown
Ferret
Example perf. breakdown
1 2 3 4 5 6 7 8 902468
101214
8t 32t LIMO
Execution interval
inst
ructi
ons p
er n
s
1 2 3 4 5 6 7 8 90
102030405060708090
8t 32t LIMO
Execution interval
% s
ynch
roni
zatio
n st
alls
Ferret
1 2 3 4 5 6 7 8 90
50000100000150000200000250000300000350000400000
8t 32t LIMO
Execution interval
L2 Lo
ad M
isse
s
1 2 3 4 5 6 7 8 902468
101214
8t 32t LIMO
Execution interval
inst
ructi
ons p
er n
sExample perf. breakdown
Ferret
1 2 3 4 5 6 7 8 902468
10121416
8t 32t LIMO
Execution interval
inst
ructi
ons p
er n
s
1 2 3 4 5 6 7 8 90
50000100000150000200000250000300000350000400000
8t 32t LIMO
Execution interval
L2 lo
ad m
isse
sExample perf. breakdown
Ferret
1 2 3 4 5 6 7 8 902468
101214 numProcs
Execution interval
core
s ac
tive
1 2 3 4 5 6 7 8 902468
10121416
8t 32t LIMO
Execution interval
inst
ructi
ons p
er n
sExample perf. breakdown
Ferret
% Performance Improvement
blacksch
olesded
up
facesim
swaptions
vips
fluidanimate
httpdsphinx
ferret
strea
mcluste
rmean
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
TB_DVFS LIMO
spee
dup
over
32t
Good scalability
Reduced synchronization
stalls
Reduced thrashing in
shared cache
33
Conclusion• Scalability is difficult to achieve and predict.• Determining best number of threads is hard.
o Contention in shared hardware resourceso Contention in program level shared objects
• LIMO frees the programmer from this burden.o Monitors shared resource contention (shared cache,
shared program variables)o Pro-actively disables threadso Employs DVFS
• 14% average improvement in performance over all threads.
Thank you!