Thread criticality for power efficiency in CMPs

1

Thread criticality for power efficiency in CMPs

Khairul KabirNov. 3rd, 2009

ECE 692 Topic Presentation

2

Why Thread Criticality prediction?

Critical thread− One with the longest completion

time in the parallel region

D-Cache Miss

I-Cache Miss

Stall Stall

T0 T1 T2 T3Insts Exec Problems

− Performance degradation − Energy inefficiency

Sources of variability− Algorithm, process variation, thermal

emergencies etc.

Purpose− Load balancing for performance

improvement− Energy optimization using DVFS

3

Related Work Instruction criticality [Fields et al., Tune et al. 2001 etc.]

− Identify critical instruction

Thrifty barrier [Li et al. 2005]− Faster cores transitioned into low-power mode based on prediction

of barrier stall time.

DVFS for energy-efficiency at barriers [Liu et al. 2005]− Faster core tracks the waiting time and predicts the DVFS for next

execution of the same parallel loop

Meeting points [Cai et al. 2008] − DVFS non-critical threads by tracking loop iterations completion

rate across cores (parallel loops)

4

Abhishek Bhattacharjee Margaret Martonosi

Dept. of Electrical Engineering Princeton University

Thread Criticality Predictors for Dynamic Performance, Power, and Resource

Management in Chip Multiprocessors

5

What is This Paper About?

Thread criticality predictor(TCP) design− Methodology− Identify architectural events impacting thread criticality− Introduce basic TCP hardware

Thread criticality predictor uses− Apply to Intel’s Threading Building Blocks(TBB)− Apply for energy-efficiency in barrier-based programs

6

Design goals

1. Accuracy

2. Low-overhead implementation• Simple HW (allow SW policies to be

built on top)

3. One predictor, many uses

Design decisions

1. Find suitable architectural metric

2. History-based local approach versus thread-comparative approach

3. This paper: TBB, DVFS and other uses: shared last-level cache management, SMT and memory priority, …

Thread Criticality Prediction Goals

7

Methodology Evaluations on a range of architectures: high-performance and

embedded domains– GEMS simulator – To evaluate the performance on architectures

representative of the high-performance domain– ARM simulator – To evaluate the performance benefits of TCP-guided

task stealing in Intel’s TBB– FPGA-based emulator used to assess energy savings from TCP-guided

DVFSInfrastructureDomain

System

Cores

Caches

GEMS SimulatorHigh-performance, wide-issue, out-of-order16 core CMP with Solaris 104-issue SPARC

32KB L1 , 4MB L2

ARM SimulatorEmbedded, in-order

4-32 core CMP

2-issue ARM

32KB L1, 4MB L2

FPGA EmulatorEmbedded, in-order

4-core CMP with Linux 2.61-issue SPARC

4KB I-Cache, 8KB D-Cache

8

Architectural Metrics

History-based TCP– Requires repetitive barrier behavior– Information local to core: no communication – Problem for in-order pipelines: variant IPCs

Inter-core TCP metrics– Instruction count– Cache misses– Control flow changes– Translate lookaside buffer(TLB) miss

0 1 2 3 4 5 6 7 80

0.20.40.60.8

11.2 Stall Compute

Ocean: Iteration Number (Barrier 8)

Nor

m. T

ime

(Rel

. to

Itera

tion

0)

Time

9

Thread-Comparative Metrics for TCP: Instruction Counts

Cho

lesk

y

FFT

Rad

ix

Wat

er-S

p

Volre

nd

Wat

er-N

sq

Bar

nes

Oce

an LU

Swap

tions

Flui

dani

mat

e

Bla

cksc

hole

s

Stre

amcl

uste

r

SPLASH-2 PARSEC

0

20

40

60

80

100

In-order Instruction Count

% E

rror

of M

etric

in T

rack

ing

Com

pute

Ti

me

10

Thread-Comparative Metrics for TCP: L1 D Cache Misses

Cho

lesk

y

FFT

Rad

ix

Wat

er-S

p

Volre

nd

Wat

er-N

sq

Bar

nes

Oce

an LU

Swap

tions

Flui

dani

mat

e

Bla

cksc

hole

s

Stre

amcl

uste

r

SPLASH-2 PARSEC

0

20

40

60

80

100

In-order Instruction Count In-order L1 D Cache per Inst

% E

rror

of M

etric

in T

rack

ing

Com

pute

Ti

me

11

Thread-Comparative Metrics for TCP: L1 I & D Cache Misses

Cho

lesk

y

FFT

Rad

ix

Wat

er-S

p

Volre

nd

Wat

er-N

sq

Bar

nes

Oce

an LU

Swap

tions

Flui

dani

mat

e

Bla

cksc

hole

s

Stre

amcl

uste

r

SPLASH-2 PARSEC

0

20

40

60

80

100In-order Instruction Count In-order L1 D Cache per Inst

In-order L1 I & D Cache per Inst

% E

rror

of M

etric

in T

rack

ing

Com

pute

Ti

me

12

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

Cho

lesk

y

FFT

Rad

ix

Wat

er-S

p

Volre

nd

Wat

er-N

sq

Bar

nes

Oce

an LU

Swap

tions

Flui

dani

mat

e

Bla

cksc

hole

s

Stre

amcl

uste

r

SPLASH-2 PARSEC

0

20

40

60

80

100In-order Instruction Count In-order L1 D Cache per Inst

In-order L1 I & D Cache per Inst In-order L1 & L2 Cache per Inst

% E

rror

of M

etric

in T

rack

ing

Com

pute

Ti

me

13

Thread-Comparative Metrics for TCP: All L1 and L2 Cache Misses

Cho

lesk

y

FFT

Rad

ix

Wat

er-S

p

Volre

nd

Wat

er-N

sq

Bar

nes

Oce

an LU

Swap

tions

Flui

dani

mat

e

Bla

cksc

hole

s

Stre

amcl

uste

r

SPLASH-2 PARSEC

0

20

40

60

80

100 In-order Instruction Count In-order L1 D Cache per InstIn-order L1 I & D Cache per Inst In-order L1 & L2 Cache per InstOut-of-Order L1 & L2 Cache per Inst

% E

rror

of M

etric

in T

rack

ing

Com

pute

Ti

me

14

Basic TCP Hardware TCP hardware components

− Per core criticality counters− Interval bound register

15

Basic TCP Hardware

Core 0 Core 1 Core 2

L1 I $ L1 D $ L1 I $ L1 D $ L1 I $ L1 D $

Core 3

L1 I $ L1 D $

Shared L2 Cache

L2 Controller TCP Hardware

Inst 1 Inst 1 Inst 1 Inst 1Inst 2 Inst 2 Inst 2 Inst 2Inst 5 Inst 5: L1 D$ Miss! Inst 5 Inst 5

Criticality Counters 0 0 0 0

L1 Cache Miss!

0 1 0 0

Inst 15 Inst 5: Miss Over Inst 15 Inst 15Inst 20 Inst 10 Inst 20: L1

I$ Miss! Inst 20

L1 Cache Miss!

0 1 1 0

Inst 30 Inst 20 Inst 20: Miss Over Inst 30Inst 35 Inst 25: L2

$ Miss Inst 25 Inst 35

L2 Cache Miss!

0 11 1 0

Per-core Criticality Counters track poorly cached, slow

threads

Inst 135 Inst 25: Miss Over Inst 125 Inst 135

Periodically refresh criticality counters with Interval Bound Register

16

TBB Task Stealing & Thread Criticality

TBB dynamic scheduler distributes tasks

Each thread maintains software queue filled with tasks– Empty queue – thread “steals” task from another thread’s queue

Approach 1: Default TBB uses random task stealing– More failed steals at higher core counts poor performance

Approach 2: Occupancy-based task stealing [Contreras, Martonosi, 2008]– Steal based on number of items in SW queue– Must track and compare maximum occupancy counts

17

TCP-Guided TBB Task StealingCore 0

SW Q0

Shared L2 Cache

Core 1

SW Q1

Core 2

SW Q2

Core 3

SW Q3

Criticality Counters

Interval Bound Register

Task 1

TCP Control Logic

Task 0

Task 2

Task 3

Task 4 Task 5 Task 6

Task 7

Clock: 0Clock: 100 0 0 01Core 3:

L2 Miss 11Clock: 30Clock: 100

None

14 5 2 21

Core 2: Steal Req.

Scan for max val.

Steal from

Core 3

Task 7

• TCP initiates steals from critical thread• Modest message overhead: L2 access latency• Scalable: 14-bit criticality counters 114 bytes of storage @ 64 cores

Core 3: L1 Miss

18

TCP-Guided TBB Task Stealing TBB with random task stealing

TBB with TCP-guided task stealing

19

TCP-Guided TBB Performance

4 8 16 32 4 8 16 32 4 8 16 32 4 8 16 32Blackscholes Fluidanimate Swaptions Stream.

0

5

10

15

20

25

30

35

Occupancy-based ApproachCriticality-based Approach

Core Count

% Performance improvement versus random task stealing

Avg. Improvement over Random (32 cores) = 21.6 %

Avg. Improvement over Occupancy (32 cores) = 13.8 %

20

Adapting TCP for Energy Efficiency in Barrier-Based Programs

T0 T1 T2 T3Insts Exec

L2 D$ MissL2 D$ Over

T1 critical, => DVFS T0, T2, T3

Approach: DVFS non-critical threads to eliminate barrier stall time Challenges:• Relative criticalities• Miss-prediction costs• DVFS overheads

21

Hardware and Algorithm for TCP-Guided DVFS

TCP hardware components− Criticality counters− SST – Switching Suggestion Table− SCT – Suggestion Confidence

Table− Interval bound register

TCP-guided DVFS algorithm – two key steps Use SST to translate criticality counter values into thread criticalities

− Criticality counter value is above a pre-defined threshold T and running at the nominal frequency

− Determined by criticality counter value with SST entries− Suggests frequency switch if matching SST entry is different from current frequency

Feeds the suggested target frequency from SST to the SCT − Assesses confidence on SST’s DVFS suggestion

22

TCP-Guided DVFS – Effect of Criticality Counter Threshold

− Lowest bar – pre-calculated or correct state, averaged across all barrier instances

− Central bar – learning time taken until the correct DVFS state is first reached

− Upper bar – prediction noise or time spent in erroneous DVFS after having arrived at the correct one

Low T increases susceptibility to temporal noise Too many frequency changes and performance overhead result without good suggestion confidence

23

TCP for DVFS: Results

Choles

kyRad

ixFFT

Barnes

Wate

r-Sp

Wate

r-Nsq

Ocean

Blacks

chole

s

Stream

cluste

r

Volren

d LU Avg.

0

0.05

0.1

0.15

0.2

0.25

0.3 Normalized Energy Savings (Rel. to Original Benchmark)

Average 15% energy savings

Benchmark with more load imbalance generally save more energy

24

Conclusions

Goal 1: Accuracy– Accurate TCPs based on simple cache statistics

Goal 2: Low-overhead hardware– Scalable per-core criticality counters used– TCP in central location where cache information is already available

Goal 3: Versatility– TBB improved by 13.8% over best known approach @ 32 cores– DVFS used to achieve 15% energy savings– Two uses shown, many others possible…

25

Qiong Cai, José González, Ryan Rakvic, Grigorios Magklis, Pedro Chaparro

Antonio González

Meeting Points: Using Thread Criticality to Adapt Multi-core Hardware to parallel Region

26

Introduction

Proposed applications Thread delaying for multi-core systems

− Save energy consumptions by scaling down the frequency and voltage of the cores containing non-critical threads

Thread balancing for simultaneous multi-threaded cores− Improves overall performance by giving higher priority to the

critical thread

Meeting point thread characterization− Identifies the critical thread of a single multi threaded application − Identifies amount of the slacks of non-critical threads

27

Example: a parallelized loop from PageRank (lz77 method)

#pragma omp parallel for for (int i = 0; i < nb ; i++ ){ Msizeu partstart = partition()[i]; Msizeu partend = partition()[i+1];

diag(i).multply_transposed(...); offdiag.multply(...);

for (int i = partstart; i < partend; i++) {

res[i] *=c;normres += std::abs(res[i]);

}}

0

20

40

60

80

100

120

1 11 21 31 41 51 61 71 81 91 101

Tim

e (in

mill

isec

onds

)

Number of Iterations

CPU0 CPU1

Observations:1. The code is already written to achieve workload balance but imbalance still exists. CPU1 is slower than CPU0.2. Reasons for imbalance: (i) Different cache misses (ii) Different control paths

How To Find Critical Threads Dynamically?

28

Identification of Critical Threads

Identification technique− A thread-private counter is incremented

− The most critical thread is the one with the smallest counter

− Slack of a thread is estimated as the difference of its counter and the counter of the slowest counter

Insertion of meeting points− Place in a parallel region that is

visited by all thread− Can be done by the hardware,

the compiler or the programmer

#pragma omp parallel for for (int i = 0; i < nb ; i++ ){ Msizeu partstart = partition()[i]; Msizeu partend = partition()[i+1];

diag(i).multply_transposed(... ); offdiag.multply(...);

for (int i = partstart; i < partend; i++) {

res[i] *=c;normres += std::abs(res[i]);

} asm(“ mp_inst” );}

29

Thread delaying

Make non-critical threads run at a lower frequency/voltage level− All threads arrive at the barrier at the same time

CPUs of the non-critical threads, can be put into deep sleep− Consumes almost zero energy− Not the most energy-efficient approach to deal with workload

imbalance

30

Thread Delaying

B

A

D

C

barrier

Area -> Energy needed to execute the instructions of the thread

Thread 1

Thread 2

Thread 3

Thread 4

Freq

uenc

yFr

eque

ncy

Freq

uenc

yFr

eque

ncy

Proposal: Energy = Activity x Capacitance x Voltage2

Reduce voltage when executing parallel threads Delay threads arriving early to the barrier

31

Thread Delaying

B

A

D

C

barrier

Area -> Energy needed to execute the instructions of the thread

Thread 1

Thread 2

Thread 3

Thread 4

Freq

uenc

yFr

eque

ncy

Freq

uenc

yFr

eque

ncy 1

1

1

1

2

2

2

2

3

3

3

3

32

Thread Delaying

A B C D

B

D

C

A B C D

Energy

Energy Saved

33

Implementation of Thread delaying

HISTORY-TABLE− An entry for each possible frequency level

− 2-bit up-down saturating counter

MP-COUNTER_TABLE− Contains as many entries as number of

cores in the processor− 32-bit counter− Consistent among all cores

Implementation− Each core broadcasts the counter value in

each 10 execution of the meeting point instruction

− Invoke thread delaying algorithm

− History table is updated

34

Thread Balancing Speeding up a parallel application running more than one

thread− Two-way in-order SMT with an issue bandwidth of two instruction per cycle− Both threads have ready instructions, allow both of them− One thread has ready instruction, can issue up to two instruction per cycle− If threads belong the same parallel application, prioritize critical thread

Thread balancing− Identify critical thread

− Give the critical thread more priority in the issue logic

35

Thread Balancing Logic

imbalance hardware

logic

Meeting Point IP (MPIP)

Pseudocode:if (MPIP) if (IterationDelta==0) FastTID = TID if (TID == FastTID) // fast thread IterationDelta++ else IterationDelta--

IterationDelta

FastTID

Issue Prioritization

Logic

Pseudocode:if IterationDelta > 0 // Imbalance, issue prioritize Prioritize()else // balanced, then no // issue priority

Prioritize Thread

Targeted for 2-way SMT:− Imbalance hardware logic: identify critical thread

Issue prioritization logic − If a thread is critical and it has two ready instructions, it is allowed to issue both

instructions regardless of the number of ready instructions the non-critical thread has

− Otherwise, the base issue policy is applied

36

Simulation Framework and Benchmarks

SoftSDV for Intel64/IA32 processor− Simulate multithreaded primitives including locks and synchronization operation and

shared memory and events

RMS(Recognition, Mining, and Synthesis) benchmark− Highly data-intensive and highly

parallel(computer vision, data mining, etc)

− Benchmarks are parallelized by pthreads or OpenMP

− 99% of total execution is parallel for all except FIMI (28% coverage)

37

Performance Results for Thread delaying

Baseline is aggressive− Every core is running at full speed and stops when it is completed. Once the

core stops, it consumes zero power

Save 4% - 44% energy− Energy savings come from the large frequency decreases on non-critical

thread

2p 4p 8p 2p 8p 2p 4p 8p 2p 4p 8p

Gauss PageRank (lz77) PageRank (sparse) Summarizationexecution time 0.99 1.00 1.01 1.00 1.02 1.01 1.00 0.98 1.01 1.01 1.02energy consumption 0.96 0.94 0.93 0.94 0.90 0.90 0.78 0.56 0.94 0.88 0.74

0.00.2

0.4

0.60.8

1.0

1.2

Nor

mal

izati

on

to B

asel

ine

38

Performance Results for Thread Balancing

Baseline is aggressive− Every core is running at full speed and stops when it is completed

Performance benefit ranges from 1% - 20%− Performance benefit correlates with imbalance levels

39

Conclusions Meeting point thread characterization dynamically estimates the

criticality of the threads in a parallel execution Thread delaying combines per-core DVFS and meeting point

thread characterization together to reduce energy consumption on non-critical threads

Thread balancing gives higher priority in the issue queue of an SMT core to the critical thread.

40

Comparison of the Two PapersThread criticality predictor Meeting points

Target A range of parallelization schemes beyond parallel loop

Parallel loops

Critical thread identification method

Cache behavior, L1 and L2 cache miss

Meeting points count

Performance balancing method

Task stealing from critical thread

Prioritizing critical thread

Need extra hardware support

Yes Yes

Energy saving technique

DVFS DVFS

Benchmark SPLASH-2 and PARSEC RMS

Evaluation ARM-based simulator, GEMS simulator, FPGA-based Emulator

SoftSDV

41

Critiques Paper 1

− It did not mention how to calculate the values for SST− Accuracy of barrier based DVFS depends on pre-calculated SST’s

values

Paper 2− The total number of times each thread visits the meeting point

should be roughly same, that means meeting point thread characterization cannot handle variable loop iteration size

− It just works well for parallel loop, but fails for any large parallel region without parallel loop

− It might not be always feasible for hardware to detect parallel loop and insert the meeting point

42

Thank you !

Documents

Thread criticality for power efficiency in CMPs