1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

1

Ubiquitous Memory Introspection (UMI)

Qin Zhao, NUSRodric Rabbah, IBM

Saman Amarasinghe, MITLarry Rudolph, MIT

Weng-Fai Wong, NUS

CGO 2007, March 14 2007San Jose, CA

2

The complexity crunchhardware complexity ↑

+software complexity ↑

⇓too many things happening concurrently,

complex interactions and side-effects

⇓less understanding of program

execution behavior ↓

3

• S/W developer– Computation vs.

communication ratio– Function/path frequency– Test coverage

• H/W designer– Cache behavior– Branch prediction

• System administrator– Interaction between

processes

The importance of program behavior characterization

Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

4

Common approaches toprogram understanding

Overhead

Level of detail

Versatility

Space and time overhead

Coarse grained summaries vs. instruction level and contextual information

Portability and ability to adapt to different uses

5


Profiling Simulation

Overhead high very high

Level of detail very high very high

Versatility high very high

6



HW Counters

Overhead high very high very low

Level of detail very high very high very low

Versatility high very high very low

7

Slowdown due to HW counters (counting L1 misses for 181.mcf on P4)

0100200300400500600700800

native 10 100 1K 10K 100K 1M

HW counter sample size

Tim

e (s

econ

ds)

(>2000%)

(1%)(10%) (1%)(35%)(325%)

(% slowdown)

8



HW Counters

Desirable Approach

Overhead high very high very low low

Level of detail very high very high very low high

Versatility high very high very low high

9



HW Counters

Desirable Approach

Overhead high very high very low low

Level of detail very high very high very low high

Versatility high very high very low high

UMI

10

Key components• Dynamic Binary Instrumentation

– Complete coverage, transparent, language independent, versatile, …

• Bursty Simulation– Sampling and fast forwarding techniques– Detailed context information– Reasonable extrapolation and prediction

11

Ubiquitous Memory Introspection

online mini-simulations analyze short memory access profiles recorded from frequently executed code regions

• Key concepts– Focus on hot code regions– Selectively instrument instructions– Fast online mini-simulations– Actionable profiling results for online memory

optimizations

12

Working prototype• Implemented in DynamoRIO

– Runs on Linux– Used on Intel P4 and AMD K7

• Benchmarks– SPEC 2000, SPEC 2006, Olden– Server apps: MySQL, Apache– Desktop apps: Acrobat-reader,

MEncoder

13

UMI is cheap and non-intrusive(SPEC2K reference workloads on

P4)

Averag

e

0%

20%

40%

60%

80%

168.w

upwise

171.s

wim

172.m

grid

173.a

pplu

177.m

esa

178.g

algel

179.a

rt

183.e

quak

e

187.f

acere

c

188.a

mmp

189.l

ucas

191.f

ma3d

200.s

ixtrac

k

301.a

psi

164.g

zip

175.v

pr

176.g

cc

181.m

cf

186.c

rafty

197.p

arser

252.e

on

253.p

erlbm

k

254.g

ap

255.v

ortex

56.bz

ip2

300.t

wolfem

3dhe

alth

mst

treea

dd tspft

DynamoRIOUMI

•Average overhead is 14%•1% more than DynamoRIO

% s

low

dow

n co

mpa

red

to n

ativ

e ex

ecut

ion

14

What can UMI do for you?• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

15

Coarse grained memory analysis

• Experiment: measure cache misses in three ways – HW counters– Full cache simulator (Cachegrind)– UMI

• Report correlation between measurements– Linear relationship of two sets of data

1-1strong

positive correlation

0strong negative

correlation

no correlation

16

Cache miss correlation results • HW counter vs. Cachegrind

– 0.99– 20x to 100x slowdown

• HW counter vs. UMI– 0.88– Less than 2x slowdown

in worst case0

0.2

0.4

0.6

0.8

1

(HW

, CG)

(HW

, UMI)

17




• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

18

Fine grained memory analysis

• Experiment: predict delinquent loads using UMI– Individual loads with cache miss rate greater

than threshold• Delinquent load set determined according

to full cache simulator (Cachegrind)– Loads that contribute 90% of total cache misses

• Measure and report two metricsDelinquent loadsidentified byCachegrind

Delinquent loads predicted by

UMIrecall falsepositive

19

UMI delinquent load prediction accuracy

Recall

(higher is better)

False positive

(lower is better)

benchmarks with ≥ 1% miss rate

88% 55%

benchmarks with < 1% miss rate

26% 59%

20




• Runtime memory-specific optimizations– Pluggable prefetcher, learning, adaptation

21

Experiment: online stride prefetching

• Use results of delinquent load prediction

• Discover stride patterns for delinquent loads

• Insert instructions to prefetch data to L2

• Compare runtime for UMI and P4 with HW stride prefetcher

22

Data prefetching results summary

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

runn

ing

time

norm

aliz

ed to

nat

ive

exec

utio

n (lo

wer

is b

ette

r)

SW prefetching + DynamoRIO + UMI

HW prefetching + Native Binary

Combined prefetching + DynamoRIO + UMI

ft

Averag

e

0.20.30.40.50.60.70.80.91.01.1

181.m

cf

171.s

wim

172.m

grid

179.a

rt

183.e

quak

e

188.a

mmp

191.f

ma3d

301.a

psi

em3d mst ft

Averag

e

More than offsets slowdowns from binary instrumentation!

23

The Gory Details

24

UMI components• Region selector

• Instrumentor

• Profile analyzer

25

Region selector• Identify representative code regions

– Focus on traces, loops– Frequently executed code– Piggy back on binary instrumentor tricks

• Reinforce with sampling – Time based, or leverage HW counters– Naturally adapt to program phases

26

Instrumentor• Record address references

– Insert instructions to record address referenced by memory operation

• Manage profiling overhead– Clone code trace (akin to Arnold-Ryder

scheme)– Selective instrumentation of memory

operations• E.g., ignore stack and static data

27

Recording profilesCode Trace Profile

T1T1T2T1

counter

0x1040x0280x0130x012

0x1000x0240x011op3op2op1

Code Trace T1

Code Trace T2

0x0310x032op2op1

counterpage protection to detect profile overflow

Address Profiles

early trace exist

early trace exist

28

Mini-simulator• Triggered when code or address profile is full• Simple cache simulator

– Currently simulate L2 cache of host– LRU replacement– Improve approximations with techniques similar

to offline fast forwarding simulators• Warm up and periodic flushing

• Other possible analyzer– Reference affinity model– Data reuse and locality analysis

29

Mini-simulations and parameter sensitivity

181.mcf

0.000.200.400.600.801.001.20

1 2 4 8 16 32 64 128 256 512 1024sampling frequency threshold

recall falsepositive

1.40

normalized performance recall ratio false positive ratio

• Regular data structures• If sampling threshold too high, starts to exceed

loop bounds: miss out on profiling important loops

• Adaptive threshold is best

30

Mini-simulations and parameter sensitivity

0.000.200.400.600.801.001.201.40

64 128 256 512 1K 2K 4K 8K 32K

address profile length

197.parserrecall false

positive

16K

normalized performance recall ratio false positive ratio

• Irregular data structures• Need longer profiles to reach useful conclusions

31

Summary• UMI is lightweight and has a low overhead

– 1% more than DynamoRIO– Can be done with Pin, Valgrind, etc.– No added hardware necessary– No synchronization or syscall headaches– Other cores can do real work!

• Practical for extracting detailed information– Online and workload specific– Instruction level memory reference profiles– Versatile and user-programmable analysis

• Facilitate migration of offline memory optimizations to online setting

32

Future work• More types of online analysis

– Include global information– Incremental (leverage previous execution info)– Combine analysis across multiple threads

• More runtime optimizations – E.g., advanced prefetch optimization

• Hot data stream prefetch• Markov prefetcher

– Locality enhancing data reorganization• Pool allocation• Cooperative allocation between different threads

• Your ideas here…

Documents

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14