Pin Building Customized Program Analysis Tools with Dynamic Instrumentation

PLDI’05 1

Pin Pin Building Customized Program Analysis Tools with Building Customized Program Analysis Tools with

Dynamic InstrumentationDynamic Instrumentation

CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood

Intel

Vijay Janapa ReddiUniversity of Colorado

http://rogue.colorado.edu/Pin

PLDI’05 2

Instrumentation

• Insert extra code into programs to collect information about execution– Program analysis:

• Code coverage, call-graph generation, memory-leak detection– Architectural study:

• Processor simulation, fault injection

• Existing binary-level instrumentation systems:– Static:

• ATOM, EEL, Etch, Morph– Dynamic:

• Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO

Pin is a new dynamic binary instrumentation system

PLDI’05 3

Advantages of Pin Instrumentation

1. Easy-to-use Instrumentation API– Instrumentation code written in C/C++/asm– ATOM-like API, based on procedure calls

2. Instrumentation tools portable across platforms– Same tools work on IA32, EM64T (x86-64), Itanium, ARM– Same tools work on Linux and Windows (ongoing work)

3. Low instrumentation overhead– Pin automatically optimizes instrumentation code– Pin can attach instrumentation to a running process

4. Robust– Handle mixed code and data, variable-length instructions,

dynamically-generated code

5. Transparent– Application sees original addresses, values, and stack content

PLDI’05 4

A Pintool for Tracing Memory Writes

#include <iostream>#include "pin.H"

FILE* trace;

VOID RecordMemWrite(VOID* ip, VOID* addr, UINT32 size) { fprintf(trace, “%p: W %p %d\n”, ip, addr, size);} VOID Instruction(INS ins, VOID *v) { if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE,

IARG_END); }

int main(int argc, char * argv[]) { PIN_Init(argc, argv); trace = fopen(“atrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); return 0;}

executed when an instruction is dynamically compiled

executed immediately before a write is executed

• Same source code works on the 4 architectures

=> Pin takes care of different addressing modes

• No need to manually save/restore application state

=> Pin does it for you automatically and efficiently

PLDI’05 5

Dynamic Instrumentation

Original codeCode cache

Pin fetches trace starting block 1 and start instrumentation

7’

2’

1’

Pin

2 3

1

7

45

6

Exits point back to Pin

PLDI’05 6



Pin transfers control intocode cache (block 1)

2 3

1

7

45

67’

2’

1’

Pin

PLDI’05 7



7’

2’

1’

PinPin fetches and instrument a new trace

6’

5’

3’trace linking

2 3

1

7

45

6

PLDI’05 8

Pin’s Software Architecture

JIT Compiler

Emulation Unit

Virtual Machine (VM)

Code

Cache

Instrumentation APIs

Ap

pli

cati

on

Operating SystemHardware

PinPintool 3 programs (Pin, Pintool, App) in

same address space:

User-level only

Instrumentation APIs:

Through which Pintool communicates with Pin

JIT compiler:

Dynamically compile and instrument

Emulation unit:

Handle insts that can’t be directly executed (e.g., syscalls)

Code cache:

Store compiled code

=> Coordinated by VM

Address space

PLDI’05 9

Pin Internal Details

• Loading of Pin, Pintool, & Application

• An Improved Trace Linking Technique

• Register Re-allocation

• Instrumentation Optimizations

• Multithreading Support

PLDI’05 10

Register Re-allocation

• Instrumented code needs extra registers. E.g.:– Virtual registers available to the tool– A virtual stack pointer pointing to the instrumentation stack– Many more …

• Approaches to get extra registers:1. Ad-hoc (e.g., DynamoRIO, Strata, DynInst)

– Whenever you need a register, spill one and fill it afterward

2. Re-allocate all registers during compilationa. Local allocation (e.g., Valgrind)

– Allocate registers independently within each trace

b. Global allocation (Pin)– Allocate registers across traces (can be inter-procedural)

PLDI’05 11

Valgrind’s Register Re-allocation

mov 1, %eax

mov 2, %ebx

cmp %ecx, %edx

jz t

add 1, %eax

sub 2, %ebx

t:

Original Code

Simple but inefficient

• All modified registers are spilled at a trace’s end

• Refill registers at a trace’s beginning

Trace 1mov 1, %eax

mov 2, %esi

cmp %ecx, %edx

mov %eax, SPILLeax

mov %esi, SPILLebx

jz t’

mov SPILLeax, %eax

mov SPILLebx ,%edi

add 1, %eax

sub 2, %edi

Trace 2

t’:

re-allocate

%edx%edx

%ecx%ecx

%esi%ebx

%eax%eax

PhysicalVirtual

%edx%edx

%ecx%ecx

%edi%ebx

%eax%eax

PhysicalVirtual

PLDI’05 12

Pin’s Register Re-allocation

%edx%edx

%ecx%ecx

%esi%ebx

%eax%eax

PhysicalVirtual

Compile Trace 2 using the binding at Trace 1’s exit:

Scenario (1): Compiling a new trace at a trace exit

mov 1, %eax

mov 2, %ebx

cmp %ecx, %edx

jz t

add 1, %eax

sub 2, %ebx

t:

Original Code

re-allocate

Trace 2

mov 1, %eax

mov 2, %esi

cmp %ecx, %edx

jz t’

Trace 1

add 1, %eax

sub 2, %esi t’:

No spilling/filling needed across traces

PLDI’05 13

Pin’s Register Re-allocation

Minimal spilling/filling code

Scenario (2): Targeting an already generated trace at a trace exit

Trace 1 (being compiled)

mov 1, %eax

mov 2, %ebx

cmp %ecx, %edx

jz t

add 1, %eax

sub 2, %ebx

t:

Original Codemov 1, %eax

mov 2, %esi

cmp %ecx, %edx

mov %esi, SPILLebx

mov SPILLebx, %edi

jz t’

add 1, %eax

sub 2, %edi

Trace 2 (in code cache)

t’:

re-allocate

%edx%edx

%ecx%ecx

%esi%ebx

%eax%eax

PhysicalVirtual

%edx%edx

%ecx%ecx

%edi%ebx

%eax%eax

PhysicalVirtual

PLDI’05 14

Instrumentation Optimizations

1. Inline instrumentation code into the application

2. Avoid saving/restoring eflags with liveness analysis

3. Schedule inlined instrumentation code

PLDI’05 15

Example: Instruction Counting

pushf

push %edx

push %ecx

push %eax

movl 0x3, %eax

call docount

pop %eax

pop %ecx

pop %edx

popf

ret

bridge()Instrument without applying any optimization

docount()add %eax,icount

ret

Original code

33 extra instructions executed altogether

cmov %esi, %edicmp %edi, (%esp) jle <target1>

add %ecx, %edxcmp %edx, 0 je <target2>

mov %esp,SPILLappsp

mov SPILLpinsp,%espcall <bridge>cmov %esi, %edimov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’>

Trace

mov %esp,SPILLappsp

mov SPILLpinsp,%espcall <bridge>add %ecx, %edx cmp %edx, 0 je <target2’>

BBL_InsertCall(bbl, IPOINT_BEFORE, docount(), IARG_UINT32, BBL_NumIns(bbl), IARG_END)

PLDI’05 16


Inlining

Original code

11 extra instructions executed


add %ecx, %edxcmp %edx, 0 je <target2> mov %esp,SPILLappsp

mov SPILLpinsp,%esppushfadd 0x3, icount popfcmov %esi, %edimov SPILLappsp,%espcmp %edi, (%esp) jle <target1’>

Trace

mov %esp,SPILLappsp

mov SPILLpinsp,%esppushfadd 0x3, icount popfadd %ecx, %edxcmp %edx, 0 je <target2’>

PLDI’05 17


Inlining + eflags liveness analysis

Original code




Trace

add 0x3, icount add %ecx, %edxcmp %edx, 0 je <target2’>

mov %esp,SPILLappsp

mov SPILLpinsp,%esppushfadd 0x3, icount popfcmov %esi, %edimov SPILLappsp,%espcmp %edi, (%esp) jle <target1’>

PLDI’05 18


Inlining + eflags liveness analysis + scheduling

Original code




cmov %esi, %ediadd 0x3, icountcmp %edi, (%esp) jle <target1’>

Trace

add 0x3, icount add %ecx, %edxcmp %edx, 0 je <target2’>

PLDI’05 19

Pin Instrumentation Performance

Runtime overhead of basic-block counting with Pin on IA32

10.4

3.9

7.8

3.52.8

1.52.5

1.4

0123456789

1011

SPECINT SPECFP

Ave

rag

e S

low

do

wn

Without optimizationInliningInlining + eflags liveness analysisInlining + eflags liveness analysis + scheduling

(SPEC2K using reference data sets)

PLDI’05 20

Comparison among Dynamic Instrumentation Tools

Runtime overhead of basic-block counting with three different tools

• Valgrind is a popular instrumentation tool on Linux

• Call-based instrumentation, no inlining

• DynamoRIO is the performance leader in binary dynamic optimization

• Manually inline, no eflags liveness analysis and scheduling

Pin automatically provides efficient instrumentation

8.3

5.1

2.5

0123456789

SPECINT

Ave

rag

e S

low

do

wn

Valgrind DynamoRIO Pin

PLDI’05 21

Pin Applications• Sample tools in the Pin distribution:

– Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler

• Some tools developed and used inside Intel:– Opcodemix (analyze code generated by compilers)– PinPoints (find representative regions in programs to simulate)– A tool for detecting memory bugs

• Some companies are writing their own Pintools:– A major database vendor, a major search engine provider

• Some universities using Pin in teaching and research:– U. of Colorado, MIT, Harvard, Princeton, U of Minnesota,

Northeastern, Tufts, University of Rochester, …

PLDI’05 22

Conclusions

• Pin– A dynamic instrumentation system for building your own

program analysis tools– Easy to use, robust, transparent, efficient– Tool source compatible on IA32, EM64T, Itanium, ARM– Works on large applications

• database, search engine, web browsers, …– Available on Linux; Windows version coming soon

• Downloadable from http://rogue.colorado.edu/Pin– User manual, many example tools, tutorials– 3300 downloads since 2004 July

PLDI’05 23

Acknowledgments

• Prof Dan Connors– Hosting Pin website at U of Colorado

• Intel Bistro Team– Providing the Falcon decoder/encoder– Suggesting instrumentation scheduling

• Mark Charney– Providing the XED decoder/encoder

• Ramesh Peri– Implementing part of Itanium Instrumentation

PLDI’05 24

Backup

PLDI’05 25

Talk Outline

• A Sample Pintool

• Pin Internal Details

• Experimental Results

• Pin Applications

• Conclusions

PLDI’05 26

Trace Linking

• Trace linking is a very effective optimization– Bypass VM when transferring from one trace to another – Slowdown without trace linking as much as 100x

• Linking direct branches/calls– Straightforward as targets are unique

• Linking indirect branches/calls & returns– More challenging because the target can be different

each time– Our approach:

• For all indirect control transfers, use chaining• For returns, further optimizes with function cloning

PLDI’05 27

Indirect Trace Linking

• Chains are built incrementally– Most recent target inserted at the chain’s head

• Hash table is local to each indirect jump

chain of predicted targets

original indirect jump

jmp [%eax]

mov [%eax], T

jmp target_1’

target_1’:

if (T != target_1)

jmp target_2’

…

target_N’:

if (T != target_N)

jmp LookupHtab

…

if (hit) jmp translated[T]else call Pin

LookupHtab:

Improved prediction accuracy over existing schemes

slow path

PLDI’05 28

Return-Address Prediction

• Distinguish different callers to a function by cloning:

ret

F():

call F()

A():

call F()

B():

F’():

pop T

jmp A’

A’:

if (T != A)

jmp B’

…

B’:

if (T != B)

jmp Lookuphtab1

…

no cloning

F_A’():pop T

jmp A’

A’:

if (T != A)

jmp Lookuphtab1

…

F_B’():pop T

jmp B’

B’:

if (T != B)

jmp Lookuphtab2

…

cloning

Prediction accuracy further improved

PLDI’05 29

Pin Multithreading Support

• For instrumenting multithreaded programs:– Pin intercepts all threading-related system calls:

• Create and start jitting a thread if a clone() is seen

– Pin provides a “thread id” for pintools to index thread-local storage

– Pin’s virtual registers are backed up by per-thread spilling area

• For writing multithreaded pintools:– Since Pin cannot link in libpthread in the pintool (to

avoid conflicts in setting up signal handlers by two libpthreads)

• Pin implements a subset of libpthread itself• Pin can also redirect libpthread calls in pintool to the

application’s libpthread

PLDI’05 30

Instrumenting Multithreaded Programs

• Pin instruments multithreaded programs:– Spilling area has to be thread local

• Create a new per-thread spilling area when a thread-create system call (e.g., clone()) is intercepted

• How to access to per-thread spilling area?– Steal a physical register to point to the per-thread spilling area

– x86-specific optimization:• Initially assuming single-threaded program

– Access to the spilling area via its absolute address

• If multiple threads detected later:– Flush the code cache

– Recompile with a physical register pointing to per-thread spilling area

PLDI’05 31

Optimizing Instrumentation Performance

Observations:– Slowdown largely due to executing instrumentation

code rather than dynamic compilation Make sense to spend more time to optimize

– Focus on optimizing simple instrumentation tools:• Performance depends on how fast we can transit

between the application and the tool• Simple yet commonly used (e.g., basic-block profiling)

PLDI’05 32

Pin Source Code Organization

• Pin source organized into generic, architecture-dependent, OS-dependent modules:

Architecture #source files #source lines

Generic 87 (48%) 53595 (47%)

x86 (32-bit + 64-bit) 34 (19%) 22794 (20%)

Itanium 34 (19%) 20474 (18%)

ARM 27 (14%) 17933 (15%)

TOTAL 182 (100%) 114796 (100%)

~50% code shared among architectures

PLDI’05 33

Pin Instrumentation Performance236 343

214

259 4

12

189

108 179

450

134 2

89

162

118

121

114

110

152

127

144

109

110

149

105

110

104

317

248

138

0

500

1000

1500

2000

bzi

p2

cra

fty

eo

n

ga

p

gcc

gzi

p

mcf

pa

rse

r

pe

rlb

mk

two

lf

vort

ex

vpr

INT

-Ari

Me

an

am

mp

ap

plu

ap

si art

eq

ua

ke

face

rec

fma

3d

ga

lge

l

luca

s

me

sa

mg

rid

sixt

rack

swim

wu

pw

ise

FP

-Ari

Me

an

No

rmali

zed

Execu

tio

n T

ime (

%)

Without optimizationInliningInlining + eflags liveness analysisInlining + eflags liveness analysis + scheduling

Performance of basic-block counting with Pin/IA32

Average slowdown

INT FP

Without optimization 10.4x 3.9x

Inlining 7.8x 3.5x

Inlining + eflags analysis 2.8x 1.5x

Inlining + eflags analysis + scheduling 2.5x 1.4x

PLDI’05 34

Comparison among Dynamic Instrumentation Tools

582

1091

860

1583

934

191

574

1220

391

817 93

6

834

479 61

7

606

633 71

8

158

480

793

269

520

320 50

8

236 34

3

259 41

2

189

108 179

450

134 28

9

162 25

1

0200400600800

1000120014001600

No

rmal

ized

Exe

cuti

on

Tim

e (%

)

Valgrind DynamoRIO Pin/IA32Performance of basic-block counting with three different tools

• Valgrind is a popular instrumentation tool on Linux

• Call-based instrumentation, no inlining

• DynamoRIO is the performance leader in dynamic optimization

• Manually inline, no eflags liveness analysis and scheduling

Pin automatically provides efficient instrumentation

PLDI’05 35

Pin/IA32 Performance (no instrumentation)

108

182

156

122

299

111

101 11

5

237

114

198

109

101

100

102

101

104

103 12

1

103

101

106

101

104

100 11

1 105

154

0

50

100

150

200

250

300

350

400

bzip2

craf

ty eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vorte

xvp

r

INT-A

rtMea

n

amm

p

appl

uap

si art

equa

ke

face

rec

fma3

d

galg

el

luca

sm

esa

mgr

id

sixtra

cksw

im

wupwise

FP-AriM

ean

Nor

mal

ized

Exe

cutio

n T

ime

(%)

Code Cache JIT-Decode JIT-Regalloc JIT-Other VM Total

PLDI’05 36

Pin/EM64T Performance (no instrumentation)

105

159

144

148

376

107

100 11

2

296

111

183

110

101

101

103

101

102

104 11

1

104

101

105

101

103

101

106 104

163

0

50

100

150

200

250

300

350

400bz

ip2

craf

ty

eon

gap

gcc

gzip

mcf

pars

er

perlb

mk

twol

f

vort

ex vpr

INT

-AriM

ean

amm

p

appl

u

apsi art

equa

ke

face

rec

fma3

d

galg

el

luca

s

mes

a

mgr

id

sixt

rack

swim

wup

wis

e

FP

-AriM

ean

Nor

mal

ized

Exe

cutio

n T

ime

(%)

Code Cache JIT-Decode JIT-Regalloc JIT-Other VM Total

PLDI’05 37

Pin0/IPF Performance (no instrumentation)12

2

173

133

210

260

120

105 12

5

357

125 14

2

126

114

109 12

8

125

113 13

5

99

117

100

142

104 13

2

112

115 119

167

0

50

100

150

200

250

300

350

400

bzip2

craft

yeo

nga

pgc

cgz

ipm

cf

parse

r

perlb

mk

twolf

vorte

xvp

r

INT-A

riMea

n

ammp

applu ap

si art

equa

ke

face

rec

fma3

d

galge

l

lucas

mes

am

grid

sixtra

cksw

im

wupwise

FP-AriM

ean

No

rmal

ized

Exe

cuti

on

Tim

e (%

)

Total

Documents

Pin Building Customized Program Analysis Tools with Dynamic Instrumentation