32
1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14 2007 San Jose, CA

1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

Embed Size (px)

DESCRIPTION

3 S/W developer –Computation vs. communication ratio –Function/path frequency –Test coverage H/W designer –Cache behavior –Branch prediction System administrator –Interaction between processes The importance of program behavior characterization Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

Citation preview

Page 1: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

1

Ubiquitous Memory Introspection (UMI)

Qin Zhao, NUSRodric Rabbah, IBM

Saman Amarasinghe, MITLarry Rudolph, MIT

Weng-Fai Wong, NUS

CGO 2007, March 14 2007San Jose, CA

Page 2: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

2

The complexity crunchhardware complexity ↑

+software complexity ↑

⇓too many things happening concurrently,

complex interactions and side-effects

⇓less understanding of program

execution behavior ↓

Page 3: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

3

• S/W developer– Computation vs.

communication ratio– Function/path frequency– Test coverage

• H/W designer– Cache behavior– Branch prediction

• System administrator– Interaction between

processes

The importance of program behavior characterization

Better understanding of program characteristics can lead to more robust software, and more efficient hardware and systems

Page 4: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

4

Common approaches toprogram understanding

Overhead

Level of detail

Versatility

Space and time overhead

Coarse grained summaries vs. instruction level and contextual information

Portability and ability to adapt to different uses

Page 5: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

5

Common approaches toprogram understanding

Profiling Simulation

Overhead high very high

Level of detail very high very high

Versatility high very high

Page 6: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

6

Common approaches toprogram understanding

Profiling Simulation

HW Counters

Overhead high very high very low

Level of detail very high very high very low

Versatility high very high very low

Page 7: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

7

Slowdown due to HW counters (counting L1 misses for 181.mcf on P4)

0100200300400500600700800

native 10 100 1K 10K 100K 1M

HW counter sample size

Tim

e (s

econ

ds)

(>2000%)

(1%)(10%) (1%)(35%)(325%)

(% slowdown)

Page 8: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

8

Common approaches toprogram understanding

Profiling Simulation

HW Counters

Desirable Approach

Overhead high very high very low low

Level of detail very high very high very low high

Versatility high very high very low high

Page 9: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

9

Common approaches toprogram understanding

Profiling Simulation

HW Counters

Desirable Approach

Overhead high very high very low low

Level of detail very high very high very low high

Versatility high very high very low high

UMI

Page 10: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

10

Key components• Dynamic Binary Instrumentation

– Complete coverage, transparent, language independent, versatile, …

• Bursty Simulation– Sampling and fast forwarding techniques– Detailed context information– Reasonable extrapolation and prediction

Page 11: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

11

Ubiquitous Memory Introspection

online mini-simulations analyze short memory access profiles recorded from frequently executed code regions

• Key concepts– Focus on hot code regions– Selectively instrument instructions– Fast online mini-simulations– Actionable profiling results for online memory

optimizations

Page 12: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

12

Working prototype• Implemented in DynamoRIO

– Runs on Linux– Used on Intel P4 and AMD K7

• Benchmarks– SPEC 2000, SPEC 2006, Olden– Server apps: MySQL, Apache– Desktop apps: Acrobat-reader,

MEncoder

Page 13: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

13

UMI is cheap and non-intrusive(SPEC2K reference workloads on

P4)

Averag

e

0%

20%

40%

60%

80%

168.w

upwise

171.s

wim

172.m

grid

173.a

pplu

177.m

esa

178.g

algel

179.a

rt

183.e

quak

e

187.f

acere

c

188.a

mmp

189.l

ucas

191.f

ma3d

200.s

ixtrac

k

301.a

psi

164.g

zip

175.v

pr

176.g

cc

181.m

cf

186.c

rafty

197.p

arser

252.e

on

253.p

erlbm

k

254.g

ap

255.v

ortex

56.bz

ip2

300.t

wolfem

3dhe

alth

mst

treea

dd tspft

DynamoRIOUMI

•Average overhead is 14%•1% more than DynamoRIO

% s

low

dow

n co

mpa

red

to n

ativ

e ex

ecut

ion

Page 14: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

14

What can UMI do for you?• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

Page 15: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

15

Coarse grained memory analysis

• Experiment: measure cache misses in three ways – HW counters– Full cache simulator (Cachegrind)– UMI

• Report correlation between measurements– Linear relationship of two sets of data

1-1strong

positive correlation

0strong negative

correlation

no correlation

Page 16: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

16

Cache miss correlation results • HW counter vs. Cachegrind

– 0.99– 20x to 100x slowdown

• HW counter vs. UMI– 0.88– Less than 2x slowdown

in worst case0

0.2

0.4

0.6

0.8

1

(HW

, CG)

(HW

, UMI)

Page 17: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

17

What can UMI do for you?• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetching, learning, adaptation

Page 18: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

18

Fine grained memory analysis

• Experiment: predict delinquent loads using UMI– Individual loads with cache miss rate greater

than threshold• Delinquent load set determined according

to full cache simulator (Cachegrind)– Loads that contribute 90% of total cache misses

• Measure and report two metricsDelinquent loadsidentified byCachegrind

Delinquent loads predicted by

UMIrecall falsepositive

Page 19: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

19

UMI delinquent load prediction accuracy

Recall

(higher is better)

False positive

(lower is better)

benchmarks with ≥ 1% miss rate

88% 55%

benchmarks with < 1% miss rate

26% 59%

Page 20: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

20

What can UMI do for you?• Inexpensive introspection everywhere

• Coarse grained memory analysis– Quick and dirty

• Fine grained memory analysis– Expose opportunities for optimizations

• Runtime memory-specific optimizations– Pluggable prefetcher, learning, adaptation

Page 21: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

21

Experiment: online stride prefetching

• Use results of delinquent load prediction

• Discover stride patterns for delinquent loads

• Insert instructions to prefetch data to L2

• Compare runtime for UMI and P4 with HW stride prefetcher

Page 22: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

22

Data prefetching results summary

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

runn

ing

time

norm

aliz

ed to

nat

ive

exec

utio

n (lo

wer

is b

ette

r)

SW prefetching + DynamoRIO + UMI

HW prefetching + Native Binary

Combined prefetching + DynamoRIO + UMI

ft

Averag

e

0.20.30.40.50.60.70.80.91.01.1

181.m

cf

171.s

wim

172.m

grid

179.a

rt

183.e

quak

e

188.a

mmp

191.f

ma3d

301.a

psi

em3d mst ft

Averag

e

More than offsets slowdowns from binary instrumentation!

Page 23: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

23

The Gory Details

Page 24: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

24

UMI components• Region selector

• Instrumentor

• Profile analyzer

Page 25: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

25

Region selector• Identify representative code regions

– Focus on traces, loops– Frequently executed code– Piggy back on binary instrumentor tricks

• Reinforce with sampling – Time based, or leverage HW counters– Naturally adapt to program phases

Page 26: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

26

Instrumentor• Record address references

– Insert instructions to record address referenced by memory operation

• Manage profiling overhead– Clone code trace (akin to Arnold-Ryder

scheme)– Selective instrumentation of memory

operations• E.g., ignore stack and static data

Page 27: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

27

Recording profilesCode Trace Profile

T1T1T2T1

counter

0x1040x0280x0130x012

0x1000x0240x011op3op2op1

Code Trace T1

Code Trace T2

0x0310x032op2op1

counterpage protection to detect profile overflow

Address Profiles

early trace exist

early trace exist

Page 28: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

28

Mini-simulator• Triggered when code or address profile is full• Simple cache simulator

– Currently simulate L2 cache of host– LRU replacement– Improve approximations with techniques similar

to offline fast forwarding simulators• Warm up and periodic flushing

• Other possible analyzer– Reference affinity model– Data reuse and locality analysis

Page 29: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

29

Mini-simulations and parameter sensitivity

181.mcf

0.000.200.400.600.801.001.20

1 2 4 8 16 32 64 128 256 512 1024sampling frequency threshold

recall falsepositive

1.40

normalized performance recall ratio false positive ratio

• Regular data structures• If sampling threshold too high, starts to exceed

loop bounds: miss out on profiling important loops

• Adaptive threshold is best

Page 30: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

30

Mini-simulations and parameter sensitivity

0.000.200.400.600.801.001.201.40

64 128 256 512 1K 2K 4K 8K 32K

address profile length

197.parserrecall false

positive

16K

normalized performance recall ratio false positive ratio

• Irregular data structures• Need longer profiles to reach useful conclusions

Page 31: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

31

Summary• UMI is lightweight and has a low overhead

– 1% more than DynamoRIO– Can be done with Pin, Valgrind, etc.– No added hardware necessary– No synchronization or syscall headaches– Other cores can do real work!

• Practical for extracting detailed information– Online and workload specific– Instruction level memory reference profiles– Versatile and user-programmable analysis

• Facilitate migration of offline memory optimizations to online setting

Page 32: 1 Ubiquitous Memory Introspection (UMI) Qin Zhao, NUS Rodric Rabbah, IBM Saman Amarasinghe, MIT Larry Rudolph, MIT Weng-Fai Wong, NUS CGO 2007, March 14

32

Future work• More types of online analysis

– Include global information– Incremental (leverage previous execution info)– Combine analysis across multiple threads

• More runtime optimizations – E.g., advanced prefetch optimization

• Hot data stream prefetch• Markov prefetcher

– Locality enhancing data reorganization• Pool allocation• Cooperative allocation between different threads

• Your ideas here…