Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA

Pipelined Profiling and Analysis

on Multi-core Systems

Qin Zhao

Ioana Cutcutache

Weng-Fai Wong

PiPA

CGO 20082

Why PiPA?

Code profiling and analysis– very useful for understanding program behavior

– implemented using dynamic instrumentation systems

– several challenges – coverage, accuracy, overhead

• overhead due to instrumentation engine

• overhead due to profiling code

The performance problem!– Cachegrind - 100x slowdown

– Pin dcache - 32x slowdown

Need faster tools!

CGO 20083

Our Goals

Improve the performance– reduce the overall profiling and analysis overhead– but maintain the accuracy

How?– parallelize! – optimize

Keep it simple– easy to understand– easy to build new analysis tools

CGO 20084

Parallelized slice profiling– SuperPin, Shadow Profiling

Suitable for simple, independent tasks

Previous Approach

Uninstrumented application

Instrumented application

SuperPinned application

Original application

Instrumentation overhead

Profiling overhead

Instrumented slices

CGO 20085

Pipelining!

PiPA Key Idea

Instrumented application – stage 0

Profile processing – stage 1

Time

Analysis on profile 1




Parallel analysis stage 2

Th

rea

ds

or

Pro

ce

ss

es

Original application Instrumentation overhead Profiling overhead

Profile Inform

ation

CGO 20086

PiPA Challenges

Minimize the profiling overhead– Runtime Execution Profile (REP)

Minimize the communication between stages– double buffering

Design efficient parallel analysis algorithms– we focus on cache simulation

PiPA Prototype

Cache Simulation

CGO 20088

Our Prototype

Implemented in DynamoRIO

Three stages– Stage 0 : instrumented application – collect REP

– Stage 1 : parallel profile recovery and splitting

– Stage 2 : parallel cache simulation

Experiments– SPEC2000 & SPEC2006 benchmarks

– 3 systems : dual core, quad core, eight core

CGO 20089

Communication

Keys to minimize the overhead– double buffering– shared buffers– large buffers

Example – communication between stage 0 and stage 1

Shared buffers Processing threads at stage 1Profiling thread at stage 0

Stage 0: Profiling

compact profile

minimal overhead

CGO 200811

Stage 0 : Profiling

Runtime Execution Profile (REP)– fast profiling– small profile size– easy information extraction

Hierarchical Structure– profile buffers

• data units– slots

Can be customized for different analyses– in our prototype we consider cache simulation

CGO 200812

REP Example

bb1

REP

eax

esp

bb2

REP Unit

esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3

bb1:

REP Unit

tag: 0x080483d7num_slots: 2num_refs: 3refs: ref0

CanaryZone

.

.

.

.

.

.

REPS REPD

profile basepointer

.

.

.

Next buffer

mov [eax + 0x0c] eaxmov ebp esppop ebpreturn

First buffer

type: read

offset: 12value_slot: 1size_slot: -1

pc: 0x080483dctype: readsize: 4offset: 0value_slot: 2size_slot: -1

pc: 0x080483ddtype: readsize: 4offset: 4value_slot: 2size_slot: -1

pc: 0x080483d7

size: 4

12 bytes

CGO 200813

Profiling Optimization

Store register values in REP– avoid computing the memory address

Register liveness analysis– avoid register stealing if possible

Record a single register value for multiple references– a single stack pointer value for a sequence of push/pop– the base address for multiple accesses to the same structure

More in the paper

CGO 200814

esp

size_slot: -1value_slot: 1

pc: 0x080483d7

size: 4

REP Example

bb1

REP

eax

bb2

REP Unit

esp bb2: pop ebx pop ecx cmp eax, 0 jz label_bb3

bb1:

REP Unit


CanaryZone

.

.

.

.

.

.

REPS REPD

profile basepointer

.

.

.

Next buffer

mov [eax + 0x0c] eaxmov ebp esppop ebpreturn

First buffer

type: read

offset: 12

pc: 0x080483dctype: readsize: 4offset: 0

size_slot: -1

pc: 0x080483ddtype: readsize: 4offset: 4

size_slot: -1

value_slot: 2

value_slot: 2

CGO 200815

0

1

2

3

4

5

6

7

8

9

10

Slo

wd

ow

n r

ela

tiv

e

to n

ati

ve

ex

ecu

tio

n

SPECint2000 SPECfp2000 SPEC2000

0

1

2

3

4

5

6

7

8

9

Slo

wd

ow

n r

ela

tiv

e

to n

ati

ve

ex

ecu

tio

n


0

1

2

3

4

5

6

7

8

Slo

wd

ow

n r

ela

tiv

e

to n

ati

ve

ex

ecu

tio

n


Profiling Overhead

optimized instrumentation

instrumentation without optimization

2-core 4-core

8-core

Avg slowdown : ~ 3x

Stage 1: Profile Recovery

fast recovery

CGO 200817

Stage 1 : Profile Recovery

Need to reconstruct the full memory reference information– <pc, address, type, size>

bb1

REP

0x2304

0x141a

bb2

REP Unit

0x1423REP Unit


CanaryZone

.

.

.

.

.

.

.

.

.

pc: 0x080483d7type: read size: 4offset: 12value_slot: 1size_slot: -1

pc: 0x080483dctype: read size: 4offset: 0value_slot: 2size_slot: -1

. . .

PC Address Type Size.... ............. ........ .........

0x080483d7 read 4

0x080483dc read 4

0x2310

0x141a.... ............. ........ .........

CGO 200818

Profile Recovery Overhead

Factor 1 : buffer size

Experiments done on the 8-core system, using 8 recovery threads

0

1

2

3

4

5

6

7

8

9

Slo

wd

ow

n r

elat

ive

to n

ativ

e ex

ecu

tio

n


small (64k) medium (1M) large (16M)

CGO 200819


Factor 2 : the number of recovery threads

Experiments done on the 8-core system, using 16MB buffers

0

2

4

6

8

10

12

14

16

18

20

Slo

wd

ow

n r

elat

ive

to n

ativ

e ex

ecu

tio

n


0 threads

2 threads

4 threads

6 threads

8 threads

CGO 200820


Factor 3 : the number of available cores

Experiments done using 16MB buffers and 8 recovery threads

0.00

0.50

1.00

1.50

2.00

2.50

Slo

wd

ow

n r

elat

ive

to p

rofi

ling


2 cores

4 cores

8 cores

CGO 200821


Factor 4 : the impact of using REP– experiments done on the 8-core system with 16MB buffers and 8 threads

PIPA using REP

PIPA using standard profile format PIPA-standard : 20.7x

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

168.

wupwise

171.

swim

172.

mgr

id

173.

applu

177.

mes

a

178.

galge

l

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

lucas

191.

fma3

d

200.

sixtra

ck

301.

apsi

FP Ave

rage

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

256.

bzip2

300.

twolf

INT A

vera

ge

SPEC2000

Ave

rage

Slo

wd

ow

n r

elat

ive

to

nat

ive

exec

uti

on

PIPA-REP : 4.5x

<pc, address, type, size>

Stage 2: Cache Simulation

parallel analysis

independent simulators

CGO 200823

818382485648

Stage 2 : Parallel Cache Simulation

How to parallelize?– split the address trace into independent groups

Set associative caches– partition the cache sets and simulate them using several independent simulators– merge the results (no of hits and misses) at the end of the simulation

Example:– 32K cache, 32-byte line, 4-way associative => 256 sets– 4 independent simulators, each one simulates 64 sets (round-robin distribution)

PC Address Type Size

.... r 4

.... w 4

.... r 4

.... w 4

.... r 4

.... r 4

0xbf9c46140xbf9c47050xbf9c4a340xbf9c4a600xbf9c4a5c

Set index:

0:

1:

2:

3:

0xbf9c4614, 0xbf9c4705

0xbf9c4a34

0xbf9c4a60

0xbf9c4a5c

...

...

...

... , 0xbf9c460d

0xbf9c460d

– two memory references that access different sets are independent

CGO 200824

Cache Simulation Overhead

0.00

10.00

20.00

30.00

40.00

50.00

60.00

168.

wupwise

171.

swim

172.

mgr

id

173.

applu

177.

mes

a

178.

galge

l

179.

art

183.

equa

ke

187.

face

rec

188.

amm

p

189.

lucas

191.

fma3

d

200.

sixtra

ck

301.

apsi

FP Ave

rage

164.

gzip

175.

vpr

176.

gcc

181.

mcf

186.

craf

ty

197.

pars

er

252.

eon

253.

perlb

mk

254.

gap

255.

vorte

x

56.b

zip2

300.

twolf

INT A

vera

ge

SPEC2000

Ave

rage

Slo

wd

ow

n r

ela

tiv

e t

o

na

tiv

e e

xec

uti

on

PiPA speedup over dcache : 3x

Experiments done on the 8-core system– 8 recovery threads and 8 cache simulators

PiPA

Pin dcache

10.5x

32x

CGO 200825

SPEC 2006 Results

Experiments done using the 8-core system

Profiling

Profiling + recovery

Full cache simulation

Average speedup over dcache : 3x

3.7x

10.2x

3.27x

0

2

4

6

8

10

12

14

16

18

410.

bwav

es

416.

gam

ess

433.

milc

435.

grom

acs

436.

cact

usADM

437.

leslie

3d

444.

nam

d

447.

dealI

I

450.

sople

x

453.

povr

ay

454.

calcu

lix

459.

GemsF

DTD

465.

tont

o

470.

lbm

481.

wrf

482.

sphin

x3

FP Ave

rage

400.

perlb

ench

401.

bzip2

403.

gcc

429.

mcf

445.

gobm

k

456.

hmm

er

458.

sjeng

462.

libqu

antu

m

464.

h264

ref

471.

omne

tpp

473.

asta

r

483.

xalan

cbm

k

INT A

vera

ge

SPEC2006

Ave

rage

Slo

wd

ow

n r

ela

tiv

e t

o

na

tiv

e e

xec

uti

on

CGO 200826

Summary

PiPA is an effective technique for parallel profiling and analysis– based on pipelining

– drastically reduces both• profiling time• analysis time

– full cache simulation incurs only 10.5x slowdown

Runtime Execution Profile– requires minimal instrumentation code

– compact enough to ensure optimal buffer usage

– makes it easy for next stages to recover the full trace

Parallel cache simulation– the cache is partitioned into several independent simulators

CGO 200827

Future Work

Design APIs– hide the communication between the pipeline stages– focus only on the instrumentation and analysis tasks

Further improve the efficiency– parallel profiling– workload monitoring

More analysis algorithms– branch prediction simulation– memory dependence analysis– ...

CGO 200828

Pin Prototype

Second implementation in Pin

Preliminary results– 2.6x speedup over Pin dcache

Plan to release PiPAwww.comp.nus.edu.sg/~ioana

Documents

Pipelined Profiling and Analysis on Multi-core Systems Qin Zhao Ioana Cutcutache Weng-Fai Wong PiPA