A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling

Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan

Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)

ISPASS 2012April 2, 2012

Background• Memory behavior is the key factor of the performance of a

program.• Understanding memory behavior is significant for identifying

the bottleneck of both architecture and application.• For example,

– TLB is an essential component of memory system– Applications’ working set tends to be larger and lager, leading to

serious TLB miss– Study 1: that TLB miss can degrade system performance by 5~14%

[Bhargava’08]

– Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08]

Done by memory profiling

Memory Profiling• Memory profiling is to collect memory behavior

information during the execution of programs. • Profiling can be performed for – different hardware components – at different software levels

TLB/Cache/DRAMObjects (Array, List etc.)Function

ApplicationWhole System

Object Memory Profiling• Object refers to a group of

data stored as a unit [Wu’04]

– Distinguish regular patterns from mixed and irregular traces

• Valuable for optimization– Memory trace compression– Data layout– Object-level prefetching– Cache partition [Soft-OLP, PACT 2009]

Whole SystemTraces

ApplicationTraces

Object Trace

Irregular

Regular

Current Profiling Approaches• Existing approaches – Compiler-driven: re-compile/re-link, source code – Instrumentation: heavy overhead– Simulation: accuracy problem, slow– Performance Counter: lack of detailed information

• All cannot observe page table walks due to TLB Miss

• We propose a hybrid hardware/software approach for object memory profiling– Accurate: real application & real system– Lightweight– Track page table walks at object-level

Outline

• Background

• Design and Implementation

• Experimental Results

• Conclusion

An OverviewObject Access

PatternMatrix (VA: 0x1f05000)0x1f05000

0x1f060000x1f07000

……0x1f150000x1f160000x1f17000

……0x1f250000x1f26000

……

VirtualAddress Trace

0x398f24a0x398f24b0x398f24c

……0x1af4aa0x1af4a60x1af4a8

……0x38d2cfc0x38d2cfd

……

Physical Address Trace

HMTT• Hybrid Memory Trace Toolkit

– A DDR3 SDRAM compatible memory trace monitoring system – Adopts hardware snooping technology

DIMM plugged on the other side

PCIE Cable Connector

Memory Trace:<time_stamp, r/w, phy_addr>

Advantages:• Platform independent• Negligible overhead• Full-system real memory

traces, including OS, page table walks

Challenges (1)

• How to translate physical address trace to virtual address trace of a specific process?

• Modify OS kernel to obtain page table

• Lookup a phy_addr in the dumped page table

• Generate virtual trace of each process

Challenge (2)• How to synchronize hardware and software

when an page table update occurs in kernel?

• Physical Page allocation/Free in kernel

• Trigger annotations in OS VM module

• Update dumped page table

• Send a sync_tag to hardware

Challenge (3)

• How to translate virtual address to objects without modifying source codes?

matrix = malloc(0x1000)

Object:matrix

Virtual Address Space

matrix = mymalloc(0x1000)

Object-VAMapping Table

• The role of malloc() is to map VA to object

• Use dynamic library overwrite to replace malloc()

Put them all togetherObject Access

PatternMatrix (VA: 0x1f05000)0x1f05000

0x1f060000x1f07000

……0x1f150000x1f160000x1f17000

……0x1f250000x1f26000

……

VirtualAddress Trace

0x398f24a0x398f24b0x398f24c

……0x1af4aa0x1af4a60x1af4a8

……0x38d2cfc0x38d2cfd

……

Physical Address Trace

Object-VAMapping Table

Dumped Page Tablesync_tag

sync_tag

page walk

page walk

Use page table to distinguish three types of memory access• Sync_tag update page table• Access page table itself page table walk due to TLB miss • Other memory access virtual address

Evaluation Methodology

ProcessorIntel Xeon E5504, 2.0GHz,

2 Sockets, 4 Cores per Socket (8 core in total)

Private CacheL1

D-Cache: 32KB, 8-way, 64Byte/line I-Cache: 32KB, 4-way, 64Byte/Line

L2 256KB, 8-way, 64Byte/line

Shared Cache L3 4MB, 16-way, 64Byte/line

TLB(private)

DTLB064 entries for 4-KByte pages

32 entries for huge pages (2MByte)

TLB1 512 entries for 4-KByte pages

MemoryDDR3-800 RDIMM, dual-rank, plugged into Socket 0, 4GB

0.25GB reserved for HMTT configuration and buffer3.75GB system available

Operating System CentOS 5.3, Linux kernel 2.6.32.18

BenchmarksMultithreaded PARSEC 2.1

A custom hybrid MPI/pthread implemented BFS of Graph500-1.2

Validation• For SpMV benchmark (CSR) :

y = ax * xhost

Our system is able to distinguish regular access pattern from irregular pattern

• Micro-benchmark: —The error is less than 2%

Overhead

• Two main overhead:– Dumping page table traces: + dump_pt– Dumping object-VA mapping: + dump_obj• Monitoring objects >= 4KB: result in most memory references

0.96

0.98

1

1.02

1.04

1.06 Origin +dump_pt +dump_obj

Nor

mal

ized

Ove

rhea

d

<1%<2%

Case Study 1: BFS (Breadth-First Search)

• column object got about 71% of page walks key object• Optimization: use huge page for column object

– Speedup: about 12% for 8-thread, 8% for 128-thread

1 2 4 8 16 32 64 1280.8

0.9

1

1.1

1.2

1.3

1.4w/o hugetlb w/ hugetlb

Number of Threads

Nor

mal

ized

Spee

dup

8.18%

1 2 4 32 1280%

20%

40%

60%

80%

100%

120%

rowstarts column pred oldqnewq visited

Number of Threads

Perc

enta

ge o

f Pag

e W

alks

Case Study 2: Canneal (PARSEC)• Cache-aware simulated annealing (SA) to

minimize the routing cost of a chip design• Two objects contribute most of the memory

accesses: _elements and _location

_elements_r _elements_w

_location_r _location_w others0E+002E+084E+086E+088E+081E+09 1 2 4 8

Main Objects in Canneal

Num

ber o

f mem

ory

requ

ests

The memory access almost do not change while increasing thread number.

Case Study 2: Canneal

1 2 4 80E+00

1E+08

2E+08

3E+08total _elements _locations

Number of Threads

Num

ber o

f Pag

e W

alks

• _elements object contributes the most of the increased page walks

• Put the _elements object into huge page to reduce TLB miss Speedup: about 5% for 8-thread

1 2 4 80.9

0.95

1

1.05

1.1 w/o hugetlbw/ hugetlb

Number of Threads

Nor

mal

ized

Spee

dup

A Visual Demo of the HMTT

Conclusion

• We have designed and implemented a hybrid hardware/software approach to conduct object-relative memory profiling.– Accurate: real application & real system– Lightweight– Track page table walks at object-level

• We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance.

• We intend to use this approach to analyze virtual machine on real machines.

Thanks！&Questions?

Extra Slides

Memory Profiling Approaches

Accurate Detailed Low overhead

Page walks+

Instrument √ √ × ×

Simulator * √ × ×

Performance Counter √ × √ *

Compiler √ √ √ ×

Hybrid H/S √ √ √ √

Note: √-Yes, ×-No, *-Maybe

Reverse Page Table

• Physical address pid, virtual address

0

1

2

3

...

N-1

Vaddr1 pid1 ... Vaddrk pidk

Vaddr1' Pid1' ...

...

Vaddr” Pid” ...

Physical page number

Index

Validation

Obj Read Write Rate Per Error

a0 4,194,370 0 4:0 4:0 0%

a1 4,194,310 1,048,576 4:1 4:1 0%

a2 4,194,369 2,096,927 4:2 4:2 0%

a3 4,194,303 3,087,379 4:2.94 4:3 2.04%

a4 4,194,436 4,149,586 4:3.96 4:4 1.01%

Access objects with different pattern: • a0: all read accesses, forward• a1: 3/4 read and 1/4 write accesses, forward• a2: 2/4 read and 2/4 write accesses, forward• a3: 1/4 read and 3/4 write accesses, backward• a4: all write accesses, backward

a0

a4

Size 256MB, access step 64B, requests: 4M

HMTT Configuration Space• A reserved physical memory region• Can be accessed by source codes and binary codes

Documents

A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling