Upload
margie
View
53
Download
0
Tags:
Embed Size (px)
DESCRIPTION
ECE 692 Topic Presentation. Thread criticality for power efficiency in CMPs. Khairul Kabir Nov. 3 rd , 2009. Why Thread Criticality prediction?. Critical thread One with the longest completion time in the parallel region . T0 T1 T2T3. Insts Exec. Problems - PowerPoint PPT Presentation
Citation preview
1
Thread criticality for power efficiency in CMPs
Khairul KabirNov. 3rd, 2009
ECE 692 Topic Presentation
2
Why Thread Criticality prediction?
Critical thread− One with the longest completion
time in the parallel region
D-Cache Miss
I-Cache Miss
Stall Stall
T0 T1 T2 T3Insts Exec Problems
− Performance degradation − Energy inefficiency
Sources of variability− Algorithm, process variation, thermal
emergencies etc.
Purpose− Load balancing for performance
improvement− Energy optimization using DVFS
3
Related Work Instruction criticality [Fields et al., Tune et al. 2001 etc.]
− Identify critical instruction
Thrifty barrier [Li et al. 2005]− Faster cores transitioned into low-power mode based on prediction
of barrier stall time.
DVFS for energy-efficiency at barriers [Liu et al. 2005]− Faster core tracks the waiting time and predicts the DVFS for next
execution of the same parallel loop
Meeting points [Cai et al. 2008] − DVFS non-critical threads by tracking loop iterations completion
rate across cores (parallel loops)
4
Abhishek Bhattacharjee Margaret Martonosi
Dept. of Electrical Engineering Princeton University
Thread Criticality Predictors for Dynamic Performance, Power, and Resource
Management in Chip Multiprocessors
5
What is This Paper About?
Thread criticality predictor(TCP) design− Methodology− Identify architectural events impacting thread criticality− Introduce basic TCP hardware
Thread criticality predictor uses− Apply to Intel’s Threading Building Blocks(TBB)− Apply for energy-efficiency in barrier-based programs
6
Design goals
1. Accuracy
2. Low-overhead implementation• Simple HW (allow SW policies to be
built on top)
3. One predictor, many uses
Design decisions
1. Find suitable architectural metric
2. History-based local approach versus thread-comparative approach
3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, …
Thread Criticality Prediction Goals
7
Methodology Evaluations on a range of architectures: high-performance and
embedded domains– GEMS simulator – To evaluate the performance on architectures
representative of the high-performance domain– ARM simulator – To evaluate the performance benefits of TCP-guided
task stealing in Intel’s TBB– FPGA-based emulator used to assess energy savings from TCP-guided
DVFSInfrastructureDomain
System
Cores
Caches
GEMS SimulatorHigh-performance, wide-issue, out-of-order16 core CMP with Solaris 104-issue SPARC
32KB L1 , 4MB L2
ARM SimulatorEmbedded, in-order
4-32 core CMP
2-issue ARM
32KB L1, 4MB L2
FPGA EmulatorEmbedded, in-order
4-core CMP with Linux 2.61-issue SPARC
4KB I-Cache, 8KB D-Cache
8
Architectural Metrics
History-based TCP– Requires repetitive barrier behavior– Information local to core: no communication – Problem for in-order pipelines: variant IPCs
Inter-core TCP metrics– Instruction count– Cache misses– Control flow changes– Translate lookaside buffer(TLB) miss
0 1 2 3 4 5 6 7 80
0.20.40.60.8
11.2 Stall Compute
Ocean: Iteration Number (Barrier 8)
Nor
m. T
ime
(Rel
. to
Itera
tion
0)
Time
9
Thread-Comparative Metrics for TCP: Instruction Counts
Cho
lesk
y
FFT
Rad
ix
Wat
er-S
p
Volre
nd
Wat
er-N
sq
Bar
nes
Oce
an LU
Swap
tions
Flui
dani
mat
e
Bla
cksc
hole
s
Stre
amcl
uste
r
SPLASH-2 PARSEC
0
20
40
60
80
100
In-order Instruction Count
% E
rror
of M
etric
in T
rack
ing
Com
pute
Ti
me
10
Thread-Comparative Metrics for TCP: L1 D Cache Misses
Cho
lesk
y
FFT
Rad
ix
Wat
er-S
p
Volre
nd
Wat
er-N
sq
Bar
nes
Oce
an LU
Swap
tions
Flui
dani
mat
e
Bla
cksc
hole
s
Stre
amcl
uste
r
SPLASH-2 PARSEC
0
20
40
60
80
100
In-order Instruction Count In-order L1 D Cache per Inst
% E
rror
of M
etric
in T
rack
ing
Com
pute
Ti
me
11
Thread-Comparative Metrics for TCP: L1 I & D Cache Misses
Cho
lesk
y
FFT
Rad
ix
Wat
er-S
p
Volre
nd
Wat
er-N
sq
Bar
nes
Oce
an LU
Swap
tions
Flui
dani
mat
e
Bla
cksc
hole
s
Stre
amcl
uste
r
SPLASH-2 PARSEC
0
20
40
60
80
100In-order Instruction Count In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst
% E
rror
of M
etric
in T
rack
ing
Com
pute
Ti
me
12
Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses
Cho
lesk
y
FFT
Rad
ix
Wat
er-S
p
Volre
nd
Wat
er-N
sq
Bar
nes
Oce
an LU
Swap
tions
Flui
dani
mat
e
Bla
cksc
hole
s
Stre
amcl
uste
r
SPLASH-2 PARSEC
0
20
40
60
80
100In-order Instruction Count In-order L1 D Cache per Inst
In-order L1 I & D Cache per Inst In-order L1 & L2 Cache per Inst
% E
rror
of M
etric
in T
rack
ing
Com
pute
Ti
me
13
Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses
Cho
lesk
y
FFT
Rad
ix
Wat
er-S
p
Volre
nd
Wat
er-N
sq
Bar
nes
Oce
an LU
Swap
tions
Flui
dani
mat
e
Bla
cksc
hole
s
Stre
amcl
uste
r
SPLASH-2 PARSEC
0
20
40
60
80
100 In-order Instruction Count In-order L1 D Cache per InstIn-order L1 I & D Cache per Inst In-order L1 & L2 Cache per InstOut-of-Order L1 & L2 Cache per Inst
% E
rror
of M
etric
in T
rack
ing
Com
pute
Ti
me
14
Basic TCP Hardware TCP hardware components
− Per core criticality counters− Interval bound register
15
Basic TCP Hardware
Core 0 Core 1 Core 2
L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $
Core 3
L1 I $ L1 D $
Shared L2 Cache
L2 Controller TCP Hardware
Inst 1 Inst 1 Inst 1 Inst 1Inst 2 Inst 2 Inst 2 Inst 2Inst 5 Inst 5: L1 D$ Miss! Inst 5 Inst 5
Criticality Counters 0 0 0 0
L1 Cache Miss!
0 1 0 0
Inst 15 Inst 5: Miss Over Inst 15 Inst 15Inst 20 Inst 10 Inst 20: L1
I$ Miss! Inst 20
L1 Cache Miss!
0 1 1 0
Inst 30 Inst 20 Inst 20: Miss Over Inst 30Inst 35 Inst 25: L2
$ Miss Inst 25 Inst 35
L2 Cache Miss!
0 11 1 0
Per-core Criticality Counters track poorly cached, slow
threads
Inst 135 Inst 25: Miss Over Inst 125 Inst 135
Periodically refresh criticality counters with Interval Bound Register
16
TBB Task Stealing & Thread Criticality
TBB dynamic scheduler distributes tasks
Each thread maintains software queue filled with tasks– Empty queue – thread “steals” task from another thread’s queue
Approach 1: Default TBB uses random task stealing– More failed steals at higher core counts poor performance
Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008]– Steal based on number of items in SW queue– Must track and compare maximum occupancy counts
17
TCP-Guided TBB Task StealingCore 0
SW Q0
Shared L2 Cache
Core 1
SW Q1
Core 2
SW Q2
Core 3
SW Q3
Criticality Counters
Interval Bound Register
Task 1
TCP Control Logic
Task 0
Task 2
Task 3
Task 4 Task 5 Task 6
Task 7
Clock: 0Clock: 100 0 0 01Core 3:
L2 Miss 11Clock: 30Clock: 100
None
14 5 2 21
Core 2: Steal Req.
Scan for max val.
Steal from
Core 3
Task 7
• TCP initiates steals from critical thread• Modest message overhead: L2 access latency• Scalable: 14-bit criticality counters 114 bytes of storage @ 64 cores
Core 3: L1 Miss
18
TCP-Guided TBB Task Stealing TBB with random task stealing
TBB with TCP-guided task stealing
19
TCP-Guided TBB Performance
4 8 16 32 4 8 16 32 4 8 16 32 4 8 16 32Blackscholes Fluidanimate Swaptions Stream.
0
5
10
15
20
25
30
35
Occupancy-based ApproachCriticality-based Approach
Core Count
% Performance improvement versus random task stealing
Avg. Improvement over Random (32 cores) = 21.6 %
Avg. Improvement over Occupancy (32 cores) = 13.8 %
20
Adapting TCP for Energy Efficiency in Barrier-Based Programs
T0 T1 T2 T3Insts Exec
L2 D$ MissL2 D$ Over
T1 critical, => DVFS T0, T2, T3
Approach: DVFS non-critical threads to eliminate barrier stall time Challenges:• Relative criticalities• Miss-prediction costs• DVFS overheads
21
Hardware and Algorithm for TCP-Guided DVFS
TCP hardware components− Criticality counters− SST – Switching Suggestion Table− SCT – Suggestion Confidence
Table− Interval bound register
TCP-guided DVFS algorithm – two key steps Use SST to translate criticality counter values into thread criticalities
− Criticality counter value is above a pre-defined threshold T and running at the nominal frequency
− Determined by criticality counter value with SST entries− Suggests frequency switch if matching SST entry is different from current frequency
Feeds the suggested target frequency from SST to the SCT − Assesses confidence on SST’s DVFS suggestion
22
TCP-Guided DVFS – Effect of Criticality Counter Threshold
− Lowest bar – pre-calculated or correct state, averaged across all barrier instances
− Central bar – learning time taken until the correct DVFS state is first reached
− Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one
Low T increases susceptibility to temporal noise Too many frequency changes and performance overhead result without good suggestion confidence
23
TCP for DVFS: Results
Choles
kyRad
ixFFT
Barnes
Wate
r-Sp
Wate
r-Nsq
Ocean
Blacks
chole
s
Stream
cluste
r
Volren
d LU Avg.
0
0.05
0.1
0.15
0.2
0.25
0.3 Normalized Energy Savings (Rel. to Original Benchmark)
Average 15% energy savings
Benchmark with more load imbalance generally save more energy
24
Conclusions
Goal 1: Accuracy– Accurate TCPs based on simple cache statistics
Goal 2: Low-overhead hardware– Scalable per-core criticality counters used– TCP in central location where cache information is already available
Goal 3: Versatility– TBB improved by 13.8% over best known approach @ 32 cores– DVFS used to achieve 15% energy savings– Two uses shown, many others possible…
25
Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro
Antonio González
Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region
26
Introduction
Proposed applications Thread delaying for multi-core systems
− Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads
Thread balancing for simultaneous multi-threaded cores− Improves overall performance by giving higher priority to the
critical thread
Meeting point thread characterization− Identifies the critical thread of a single multi threaded application − Identifies amount of the slacks of non-critical threads
27
Example: a parallelized loop from PageRank (lz77 method)
#pragma omp parallel for for (int i = 0; i < nb ; i++ ){ Msizeu partstart = partition()[i]; Msizeu partend = partition()[i+1];
diag(i).multply_transposed(...); offdiag.multply(...);
for (int i = partstart; i < partend; i++) {
res[i] *=c;normres += std::abs(res[i]);
}}
0
20
40
60
80
100
120
1 11 21 31 41 51 61 71 81 91 101
Tim
e (in
mill
isec
onds
)
Number of Iterations
CPU0 CPU1
Observations:1. The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0.2. Reasons for imbalance: (i) Different cache misses (ii) Different control paths
How To Find Critical Threads Dynamically?
28
Identification of Critical Threads
Identification technique− A thread-private counter is incremented
− The most critical thread is the one with the smallest counter
− Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter
Insertion of meeting points− Place in a parallel region that is
visited by all thread− Can be done by the hardware,
the compiler or the programmer
#pragma omp parallel for for (int i = 0; i < nb ; i++ ){ Msizeu partstart = partition()[i]; Msizeu partend = partition()[i+1];
diag(i).multply_transposed(... ); offdiag.multply(...);
for (int i = partstart; i < partend; i++) {
res[i] *=c;normres += std::abs(res[i]);
} asm(“ mp_inst” );}
29
Thread delaying
Make non-critical threads run at a lower frequency/voltage level− All threads arrive at the barrier at the same time
CPUs of the non-critical threads, can be put into deep sleep− Consumes almost zero energy− Not the most energy-efficient approach to deal with workload
imbalance
30
Thread Delaying
B
A
D
C
barrier
Area -> Energy needed to execute the instructions of the thread
Thread 1
Thread 2
Thread 3
Thread 4
Freq
uenc
yFr
eque
ncy
Freq
uenc
yFr
eque
ncy
Proposal: Energy = Activity x Capacitance x Voltage2
Reduce voltage when executing parallel threads Delay threads arriving early to the barrier
31
Thread Delaying
B
A
D
C
barrier
Area -> Energy needed to execute the instructions of the thread
Thread 1
Thread 2
Thread 3
Thread 4
Freq
uenc
yFr
eque
ncy
Freq
uenc
yFr
eque
ncy 1
1
1
1
2
2
2
2
3
3
3
3
32
Thread Delaying
A B C D
B
D
C
A B C D
Energy
Energy Saved
33
Implementation of Thread delaying
HISTORY-TABLE− An entry for each possible frequency level
− 2-bit up-down saturating counter
MP-COUNTER_TABLE− Contains as many entries as number of
cores in the processor− 32-bit counter− Consistent among all cores
Implementation− Each core broadcasts the counter value in
each 10 execution of the meeting point instruction
− Invoke thread delaying algorithm
− History table is updated
34
Thread Balancing Speeding up a parallel application running more than one
thread− Two-way in-order SMT with an issue bandwidth of two instruction per cycle− Both threads have ready instructions, allow both of them− One thread has ready instruction, can issue up to two instruction per cycle− If threads belong the same parallel application, prioritize critical thread
Thread balancing− Identify critical thread
− Give the critical thread more priority in the issue logic
35
Thread Balancing Logic
imbalance hardware
logic
Meeting Point IP (MPIP)
Pseudocode:if (MPIP) if (IterationDelta==0) FastTID = TID if (TID == FastTID) // fast thread IterationDelta++ else IterationDelta--
IterationDelta
FastTID
Issue Prioritization
Logic
Pseudocode:if IterationDelta > 0 // Imbalance, issue prioritize Prioritize()else // balanced, then no // issue priority
Prioritize Thread
Targeted for 2-way SMT:− Imbalance hardware logic: identify critical thread
Issue prioritization logic − If a thread is critical and it has two ready instructions, it is allowed to issue both
instructions regardless of the number of ready instructions the non-critical thread has
− Otherwise, the base issue policy is applied
36
Simulation Framework and Benchmarks
SoftSDV for Intel64/IA32 processor− Simulate multithreaded primitives including locks and synchronization operation and
shared memory and events
RMS(Recognition, Mining, and Synthesis) benchmark− Highly data-intensive and highly
parallel(computer vision, data mining, etc)
− Benchmarks are parallelized by pthreads or OpenMP
− 99% of total execution is parallel for all except FIMI (28% coverage)
37
Performance Results for Thread delaying
Baseline is aggressive− Every core is running at full speed and stops when it is completed. Once the
core stops, it consumes zero power
Save 4% - 44% energy− Energy savings come from the large frequency decreases on non-critical
thread
2p 4p 8p 2p 8p 2p 4p 8p 2p 4p 8p
Gauss PageRank (lz77) PageRank (sparse) Summarizationexecution time 0.99 1.00 1.01 1.00 1.02 1.01 1.00 0.98 1.01 1.01 1.02energy consumption 0.96 0.94 0.93 0.94 0.90 0.90 0.78 0.56 0.94 0.88 0.74
0.00.2
0.4
0.60.8
1.0
1.2
Nor
mal
izati
on
to B
asel
ine
38
Performance Results for Thread Balancing
Baseline is aggressive− Every core is running at full speed and stops when it is completed
Performance benefit ranges from 1% - 20%− Performance benefit correlates with imbalance levels
39
Conclusions Meeting point thread characterization dynamically estimates the
criticality of the threads in a parallel execution Thread delaying combines per-core DVFS and meeting point
thread characterization together to reduce energy consumption on non-critical threads
Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.
40
Comparison of the Two PapersThread criticality predictor Meeting points
Target A range of parallelization schemes beyond parallel loop
Parallel loops
Critical thread identification method
Cache behavior, L1 and L2 cache miss
Meeting points count
Performance balancing method
Task stealing from critical thread
Prioritizing critical thread
Need extra hardware support
Yes Yes
Energy saving technique
DVFS DVFS
Benchmark SPLASH-2 and PARSEC RMS
Evaluation ARM-based simulator, GEMS simulator, FPGA-based Emulator
SoftSDV
41
Critiques Paper 1
− It did not mention how to calculate the values for SST− Accuracy of barrier based DVFS depends on pre-calculated SST’s
values
Paper 2− The total number of times each thread visits the meeting point
should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size
− It just works well for parallel loop, but fails for any large parallel region without parallel loop
− It might not be always feasible for hardware to detect parallel loop and insert the meeting point
42
Thank you !