Upload
dugan
View
66
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Pin Building Customized Program Analysis Tools with Dynamic Instrumentation. CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood Intel. Vijay Janapa Reddi University of Colorado. http://rogue.colorado.edu/Pin. Instrumentation. - PowerPoint PPT Presentation
Citation preview
PLDI’05 1
Pin Pin Building Customized Program Analysis Tools with Building Customized Program Analysis Tools with
Dynamic InstrumentationDynamic Instrumentation
CK Luk, Robert Cohn, Robert Muth, Harish Patil, Artur Klauser, Geoff Lowney, Steven Wallace, Kim Hazelwood
Intel
Vijay Janapa ReddiUniversity of Colorado
http://rogue.colorado.edu/Pin
PLDI’05 2
Instrumentation
• Insert extra code into programs to collect information about execution– Program analysis:
• Code coverage, call-graph generation, memory-leak detection– Architectural study:
• Processor simulation, fault injection
• Existing binary-level instrumentation systems:– Static:
• ATOM, EEL, Etch, Morph– Dynamic:
• Dyninst, Vulcan, DTrace, Valgrind, Strata, DynamoRIO
Pin is a new dynamic binary instrumentation system
PLDI’05 3
Advantages of Pin Instrumentation
1. Easy-to-use Instrumentation API– Instrumentation code written in C/C++/asm– ATOM-like API, based on procedure calls
2. Instrumentation tools portable across platforms– Same tools work on IA32, EM64T (x86-64), Itanium, ARM– Same tools work on Linux and Windows (ongoing work)
3. Low instrumentation overhead– Pin automatically optimizes instrumentation code– Pin can attach instrumentation to a running process
4. Robust– Handle mixed code and data, variable-length instructions,
dynamically-generated code
5. Transparent– Application sees original addresses, values, and stack content
PLDI’05 4
A Pintool for Tracing Memory Writes
#include <iostream>#include "pin.H"
FILE* trace;
VOID RecordMemWrite(VOID* ip, VOID* addr, UINT32 size) { fprintf(trace, “%p: W %p %d\n”, ip, addr, size);} VOID Instruction(INS ins, VOID *v) { if (INS_IsMemoryWrite(ins)) INS_InsertCall(ins, IPOINT_BEFORE, AFUNPTR(RecordMemWrite), IARG_INST_PTR, IARG_MEMORYWRITE_EA, IARG_MEMORYWRITE_SIZE,
IARG_END); }
int main(int argc, char * argv[]) { PIN_Init(argc, argv); trace = fopen(“atrace.out”, “w”); INS_AddInstrumentFunction(Instruction, 0); PIN_StartProgram(); return 0;}
executed when an instruction is dynamically compiled
executed immediately before a write is executed
• Same source code works on the 4 architectures
=> Pin takes care of different addressing modes
• No need to manually save/restore application state
=> Pin does it for you automatically and efficiently
PLDI’05 5
Dynamic Instrumentation
Original codeCode cache
Pin fetches trace starting block 1 and start instrumentation
7’
2’
1’
Pin
2 3
1
7
45
6
Exits point back to Pin
PLDI’05 6
Dynamic Instrumentation
Original codeCode cache
Pin transfers control intocode cache (block 1)
2 3
1
7
45
67’
2’
1’
Pin
PLDI’05 7
Dynamic Instrumentation
Original codeCode cache
7’
2’
1’
PinPin fetches and instrument a new trace
6’
5’
3’trace linking
2 3
1
7
45
6
PLDI’05 8
Pin’s Software Architecture
JIT Compiler
Emulation Unit
Virtual Machine (VM)
Code
Cache
Instrumentation APIs
Ap
pli
cati
on
Operating SystemHardware
PinPintool 3 programs (Pin, Pintool, App) in
same address space:
User-level only
Instrumentation APIs:
Through which Pintool communicates with Pin
JIT compiler:
Dynamically compile and instrument
Emulation unit:
Handle insts that can’t be directly executed (e.g., syscalls)
Code cache:
Store compiled code
=> Coordinated by VM
Address space
PLDI’05 9
Pin Internal Details
• Loading of Pin, Pintool, & Application
• An Improved Trace Linking Technique
• Register Re-allocation
• Instrumentation Optimizations
• Multithreading Support
PLDI’05 10
Register Re-allocation
• Instrumented code needs extra registers. E.g.:– Virtual registers available to the tool– A virtual stack pointer pointing to the instrumentation stack– Many more …
• Approaches to get extra registers:1. Ad-hoc (e.g., DynamoRIO, Strata, DynInst)
– Whenever you need a register, spill one and fill it afterward
2. Re-allocate all registers during compilationa. Local allocation (e.g., Valgrind)
– Allocate registers independently within each trace
b. Global allocation (Pin)– Allocate registers across traces (can be inter-procedural)
PLDI’05 11
Valgrind’s Register Re-allocation
mov 1, %eax
mov 2, %ebx
cmp %ecx, %edx
jz t
add 1, %eax
sub 2, %ebx
t:
Original Code
Simple but inefficient
• All modified registers are spilled at a trace’s end
• Refill registers at a trace’s beginning
Trace 1mov 1, %eax
mov 2, %esi
cmp %ecx, %edx
mov %eax, SPILLeax
mov %esi, SPILLebx
jz t’
mov SPILLeax, %eax
mov SPILLebx ,%edi
add 1, %eax
sub 2, %edi
Trace 2
t’:
re-allocate
%edx%edx
%ecx%ecx
%esi%ebx
%eax%eax
PhysicalVirtual
%edx%edx
%ecx%ecx
%edi%ebx
%eax%eax
PhysicalVirtual
PLDI’05 12
Pin’s Register Re-allocation
%edx%edx
%ecx%ecx
%esi%ebx
%eax%eax
PhysicalVirtual
Compile Trace 2 using the binding at Trace 1’s exit:
Scenario (1): Compiling a new trace at a trace exit
mov 1, %eax
mov 2, %ebx
cmp %ecx, %edx
jz t
add 1, %eax
sub 2, %ebx
t:
Original Code
re-allocate
Trace 2
mov 1, %eax
mov 2, %esi
cmp %ecx, %edx
jz t’
Trace 1
add 1, %eax
sub 2, %esi t’:
No spilling/filling needed across traces
PLDI’05 13
Pin’s Register Re-allocation
Minimal spilling/filling code
Scenario (2): Targeting an already generated trace at a trace exit
Trace 1 (being compiled)
mov 1, %eax
mov 2, %ebx
cmp %ecx, %edx
jz t
add 1, %eax
sub 2, %ebx
t:
Original Codemov 1, %eax
mov 2, %esi
cmp %ecx, %edx
mov %esi, SPILLebx
mov SPILLebx, %edi
jz t’
add 1, %eax
sub 2, %edi
Trace 2 (in code cache)
t’:
re-allocate
%edx%edx
%ecx%ecx
%esi%ebx
%eax%eax
PhysicalVirtual
%edx%edx
%ecx%ecx
%edi%ebx
%eax%eax
PhysicalVirtual
PLDI’05 14
Instrumentation Optimizations
1. Inline instrumentation code into the application
2. Avoid saving/restoring eflags with liveness analysis
3. Schedule inlined instrumentation code
PLDI’05 15
Example: Instruction Counting
pushf
push %edx
push %ecx
push %eax
movl 0x3, %eax
call docount
pop %eax
pop %ecx
pop %edx
popf
ret
bridge()Instrument without applying any optimization
docount()add %eax,icount
ret
Original code
33 extra instructions executed altogether
cmov %esi, %edicmp %edi, (%esp) jle <target1>
add %ecx, %edxcmp %edx, 0 je <target2>
mov %esp,SPILLappsp
mov SPILLpinsp,%espcall <bridge>cmov %esi, %edimov SPILLappsp,%esp cmp %edi, (%esp) jle <target1’>
Trace
mov %esp,SPILLappsp
mov SPILLpinsp,%espcall <bridge>add %ecx, %edx cmp %edx, 0 je <target2’>
BBL_InsertCall(bbl, IPOINT_BEFORE, docount(), IARG_UINT32, BBL_NumIns(bbl), IARG_END)
PLDI’05 16
Example: Instruction Counting
Inlining
Original code
11 extra instructions executed
cmov %esi, %edicmp %edi, (%esp) jle <target1>
add %ecx, %edxcmp %edx, 0 je <target2> mov %esp,SPILLappsp
mov SPILLpinsp,%esppushfadd 0x3, icount popfcmov %esi, %edimov SPILLappsp,%espcmp %edi, (%esp) jle <target1’>
Trace
mov %esp,SPILLappsp
mov SPILLpinsp,%esppushfadd 0x3, icount popfadd %ecx, %edxcmp %edx, 0 je <target2’>
PLDI’05 17
Example: Instruction Counting
Inlining + eflags liveness analysis
Original code
7 extra instructions executed
cmov %esi, %edicmp %edi, (%esp) jle <target1>
add %ecx, %edxcmp %edx, 0 je <target2>
Trace
add 0x3, icount add %ecx, %edxcmp %edx, 0 je <target2’>
mov %esp,SPILLappsp
mov SPILLpinsp,%esppushfadd 0x3, icount popfcmov %esi, %edimov SPILLappsp,%espcmp %edi, (%esp) jle <target1’>
PLDI’05 18
Example: Instruction Counting
Inlining + eflags liveness analysis + scheduling
Original code
2 extra instructions executed
cmov %esi, %edicmp %edi, (%esp) jle <target1>
add %ecx, %edxcmp %edx, 0 je <target2>
cmov %esi, %ediadd 0x3, icountcmp %edi, (%esp) jle <target1’>
Trace
add 0x3, icount add %ecx, %edxcmp %edx, 0 je <target2’>
PLDI’05 19
Pin Instrumentation Performance
Runtime overhead of basic-block counting with Pin on IA32
10.4
3.9
7.8
3.52.8
1.52.5
1.4
0123456789
1011
SPECINT SPECFP
Ave
rag
e S
low
do
wn
Without optimizationInliningInlining + eflags liveness analysisInlining + eflags liveness analysis + scheduling
(SPEC2K using reference data sets)
PLDI’05 20
Comparison among Dynamic Instrumentation Tools
Runtime overhead of basic-block counting with three different tools
• Valgrind is a popular instrumentation tool on Linux
• Call-based instrumentation, no inlining
• DynamoRIO is the performance leader in binary dynamic optimization
• Manually inline, no eflags liveness analysis and scheduling
Pin automatically provides efficient instrumentation
8.3
5.1
2.5
0123456789
SPECINT
Ave
rag
e S
low
do
wn
Valgrind DynamoRIO Pin
PLDI’05 21
Pin Applications• Sample tools in the Pin distribution:
– Cache simulators, branch predictors, address tracer, syscall tracer, edge profiler, stride profiler
• Some tools developed and used inside Intel:– Opcodemix (analyze code generated by compilers)– PinPoints (find representative regions in programs to simulate)– A tool for detecting memory bugs
• Some companies are writing their own Pintools:– A major database vendor, a major search engine provider
• Some universities using Pin in teaching and research:– U. of Colorado, MIT, Harvard, Princeton, U of Minnesota,
Northeastern, Tufts, University of Rochester, …
PLDI’05 22
Conclusions
• Pin– A dynamic instrumentation system for building your own
program analysis tools– Easy to use, robust, transparent, efficient– Tool source compatible on IA32, EM64T, Itanium, ARM– Works on large applications
• database, search engine, web browsers, …– Available on Linux; Windows version coming soon
• Downloadable from http://rogue.colorado.edu/Pin– User manual, many example tools, tutorials– 3300 downloads since 2004 July
PLDI’05 23
Acknowledgments
• Prof Dan Connors– Hosting Pin website at U of Colorado
• Intel Bistro Team– Providing the Falcon decoder/encoder– Suggesting instrumentation scheduling
• Mark Charney– Providing the XED decoder/encoder
• Ramesh Peri– Implementing part of Itanium Instrumentation
PLDI’05 24
Backup
PLDI’05 25
Talk Outline
• A Sample Pintool
• Pin Internal Details
• Experimental Results
• Pin Applications
• Conclusions
PLDI’05 26
Trace Linking
• Trace linking is a very effective optimization– Bypass VM when transferring from one trace to another – Slowdown without trace linking as much as 100x
• Linking direct branches/calls– Straightforward as targets are unique
• Linking indirect branches/calls & returns– More challenging because the target can be different
each time– Our approach:
• For all indirect control transfers, use chaining• For returns, further optimizes with function cloning
PLDI’05 27
Indirect Trace Linking
• Chains are built incrementally– Most recent target inserted at the chain’s head
• Hash table is local to each indirect jump
chain of predicted targets
original indirect jump
jmp [%eax]
mov [%eax], T
jmp target_1’
target_1’:
if (T != target_1)
jmp target_2’
…
target_N’:
if (T != target_N)
jmp LookupHtab
…
if (hit) jmp translated[T]else call Pin
LookupHtab:
Improved prediction accuracy over existing schemes
slow path
PLDI’05 28
Return-Address Prediction
• Distinguish different callers to a function by cloning:
ret
F():
call F()
A():
call F()
B():
F’():
pop T
jmp A’
A’:
if (T != A)
jmp B’
…
B’:
if (T != B)
jmp Lookuphtab1
…
no cloning
F_A’():pop T
jmp A’
A’:
if (T != A)
jmp Lookuphtab1
…
F_B’():pop T
jmp B’
B’:
if (T != B)
jmp Lookuphtab2
…
cloning
Prediction accuracy further improved
PLDI’05 29
Pin Multithreading Support
• For instrumenting multithreaded programs:– Pin intercepts all threading-related system calls:
• Create and start jitting a thread if a clone() is seen
– Pin provides a “thread id” for pintools to index thread-local storage
– Pin’s virtual registers are backed up by per-thread spilling area
• For writing multithreaded pintools:– Since Pin cannot link in libpthread in the pintool (to
avoid conflicts in setting up signal handlers by two libpthreads)
• Pin implements a subset of libpthread itself• Pin can also redirect libpthread calls in pintool to the
application’s libpthread
PLDI’05 30
Instrumenting Multithreaded Programs
• Pin instruments multithreaded programs:– Spilling area has to be thread local
• Create a new per-thread spilling area when a thread-create system call (e.g., clone()) is intercepted
• How to access to per-thread spilling area?– Steal a physical register to point to the per-thread spilling area
– x86-specific optimization:• Initially assuming single-threaded program
– Access to the spilling area via its absolute address
• If multiple threads detected later:– Flush the code cache
– Recompile with a physical register pointing to per-thread spilling area
PLDI’05 31
Optimizing Instrumentation Performance
Observations:– Slowdown largely due to executing instrumentation
code rather than dynamic compilation Make sense to spend more time to optimize
– Focus on optimizing simple instrumentation tools:• Performance depends on how fast we can transit
between the application and the tool• Simple yet commonly used (e.g., basic-block profiling)
PLDI’05 32
Pin Source Code Organization
• Pin source organized into generic, architecture-dependent, OS-dependent modules:
Architecture #source files #source lines
Generic 87 (48%) 53595 (47%)
x86 (32-bit + 64-bit) 34 (19%) 22794 (20%)
Itanium 34 (19%) 20474 (18%)
ARM 27 (14%) 17933 (15%)
TOTAL 182 (100%) 114796 (100%)
~50% code shared among architectures
PLDI’05 33
Pin Instrumentation Performance236 343
214
259 4
12
189
108 179
450
134 2
89
162
118
121
114
110
152
127
144
109
110
149
105
110
104
317
248
138
0
500
1000
1500
2000
bzi
p2
cra
fty
eo
n
ga
p
gcc
gzi
p
mcf
pa
rse
r
pe
rlb
mk
two
lf
vort
ex
vpr
INT
-Ari
Me
an
am
mp
ap
plu
ap
si art
eq
ua
ke
face
rec
fma
3d
ga
lge
l
luca
s
me
sa
mg
rid
sixt
rack
swim
wu
pw
ise
FP
-Ari
Me
an
No
rmali
zed
Execu
tio
n T
ime (
%)
Without optimizationInliningInlining + eflags liveness analysisInlining + eflags liveness analysis + scheduling
Performance of basic-block counting with Pin/IA32
Average slowdown
INT FP
Without optimization 10.4x 3.9x
Inlining 7.8x 3.5x
Inlining + eflags analysis 2.8x 1.5x
Inlining + eflags analysis + scheduling 2.5x 1.4x
PLDI’05 34
Comparison among Dynamic Instrumentation Tools
582
1091
860
1583
934
191
574
1220
391
817 93
6
834
479 61
7
606
633 71
8
158
480
793
269
520
320 50
8
236 34
3
259 41
2
189
108 179
450
134 28
9
162 25
1
0200400600800
1000120014001600
No
rmal
ized
Exe
cuti
on
Tim
e (%
)
Valgrind DynamoRIO Pin/IA32Performance of basic-block counting with three different tools
• Valgrind is a popular instrumentation tool on Linux
• Call-based instrumentation, no inlining
• DynamoRIO is the performance leader in dynamic optimization
• Manually inline, no eflags liveness analysis and scheduling
Pin automatically provides efficient instrumentation
PLDI’05 35
Pin/IA32 Performance (no instrumentation)
108
182
156
122
299
111
101 11
5
237
114
198
109
101
100
102
101
104
103 12
1
103
101
106
101
104
100 11
1 105
154
0
50
100
150
200
250
300
350
400
bzip2
craf
ty eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vorte
xvp
r
INT-A
rtMea
n
amm
p
appl
uap
si art
equa
ke
face
rec
fma3
d
galg
el
luca
sm
esa
mgr
id
sixtra
cksw
im
wupwise
FP-AriM
ean
Nor
mal
ized
Exe
cutio
n T
ime
(%)
Code Cache JIT-Decode JIT-Regalloc JIT-Other VM Total
PLDI’05 36
Pin/EM64T Performance (no instrumentation)
105
159
144
148
376
107
100 11
2
296
111
183
110
101
101
103
101
102
104 11
1
104
101
105
101
103
101
106 104
163
0
50
100
150
200
250
300
350
400bz
ip2
craf
ty
eon
gap
gcc
gzip
mcf
pars
er
perlb
mk
twol
f
vort
ex vpr
INT
-AriM
ean
amm
p
appl
u
apsi art
equa
ke
face
rec
fma3
d
galg
el
luca
s
mes
a
mgr
id
sixt
rack
swim
wup
wis
e
FP
-AriM
ean
Nor
mal
ized
Exe
cutio
n T
ime
(%)
Code Cache JIT-Decode JIT-Regalloc JIT-Other VM Total
PLDI’05 37
Pin0/IPF Performance (no instrumentation)12
2
173
133
210
260
120
105 12
5
357
125 14
2
126
114
109 12
8
125
113 13
5
99
117
100
142
104 13
2
112
115 119
167
0
50
100
150
200
250
300
350
400
bzip2
craft
yeo
nga
pgc
cgz
ipm
cf
parse
r
perlb
mk
twolf
vorte
xvp
r
INT-A
riMea
n
ammp
applu ap
si art
equa
ke
face
rec
fma3
d
galge
l
lucas
mes
am
grid
sixtra
cksw
im
wupwise
FP-AriM
ean
No
rmal
ized
Exe
cuti
on
Tim
e (%
)
Total