Upload
denton
View
43
Download
0
Embed Size (px)
DESCRIPTION
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling. Licheng Chen , Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan. Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS). ISPASS 2012 April 2, 2012. - PowerPoint PPT Presentation
Citation preview
A Lightweight Hybrid Hardware/Software Approach for Object-Relative Memory Profiling
Licheng Chen, Zehan Cui, Yungang Bao, Mingyu Chen, Yongbing Huang, and Guangming Tan
Institute of Computing Technology (ICT) Chinese Academy of Sciences (CAS)
ISPASS 2012April 2, 2012
Background• Memory behavior is the key factor of the performance of a
program.• Understanding memory behavior is significant for identifying
the bottleneck of both architecture and application.• For example,
– TLB is an essential component of memory system– Applications’ working set tends to be larger and lager, leading to
serious TLB miss– Study 1: that TLB miss can degrade system performance by 5~14%
[Bhargava’08]
– Study 2: a large number of TLB misses in multi-threaded programs are redundant and predictable, which implies the optimization potential. [Bhattacharjee’08]
Done by memory profiling
Memory Profiling• Memory profiling is to collect memory behavior
information during the execution of programs. • Profiling can be performed for – different hardware components – at different software levels
TLB/Cache/DRAMObjects (Array, List etc.)Function
ApplicationWhole System
Object Memory Profiling• Object refers to a group of
data stored as a unit [Wu’04]
– Distinguish regular patterns from mixed and irregular traces
• Valuable for optimization– Memory trace compression– Data layout– Object-level prefetching– Cache partition [Soft-OLP, PACT 2009]
Whole SystemTraces
ApplicationTraces
Object Trace
Irregular
Regular
Current Profiling Approaches• Existing approaches – Compiler-driven: re-compile/re-link, source code – Instrumentation: heavy overhead– Simulation: accuracy problem, slow– Performance Counter: lack of detailed information
• All cannot observe page table walks due to TLB Miss
• We propose a hybrid hardware/software approach for object memory profiling– Accurate: real application & real system– Lightweight– Track page table walks at object-level
Outline
• Background
• Design and Implementation
• Experimental Results
• Conclusion
An OverviewObject Access
PatternMatrix (VA: 0x1f05000)0x1f05000
0x1f060000x1f07000
……0x1f150000x1f160000x1f17000
……0x1f250000x1f26000
……
VirtualAddress Trace
0x398f24a0x398f24b0x398f24c
……0x1af4aa0x1af4a60x1af4a8
……0x38d2cfc0x38d2cfd
……
Physical Address Trace
HMTT• Hybrid Memory Trace Toolkit
– A DDR3 SDRAM compatible memory trace monitoring system – Adopts hardware snooping technology
DIMM plugged on the other side
PCIE Cable Connector
Memory Trace:<time_stamp, r/w, phy_addr>
Advantages:• Platform independent• Negligible overhead• Full-system real memory
traces, including OS, page table walks
Challenges (1)
• How to translate physical address trace to virtual address trace of a specific process?
• Modify OS kernel to obtain page table
• Lookup a phy_addr in the dumped page table
• Generate virtual trace of each process
Challenge (2)• How to synchronize hardware and software
when an page table update occurs in kernel?
• Physical Page allocation/Free in kernel
• Trigger annotations in OS VM module
• Update dumped page table
• Send a sync_tag to hardware
Challenge (3)
• How to translate virtual address to objects without modifying source codes?
matrix = malloc(0x1000)
Object:matrix
Virtual Address Space
matrix = mymalloc(0x1000)
Object-VAMapping Table
• The role of malloc() is to map VA to object
• Use dynamic library overwrite to replace malloc()
Put them all togetherObject Access
PatternMatrix (VA: 0x1f05000)0x1f05000
0x1f060000x1f07000
……0x1f150000x1f160000x1f17000
……0x1f250000x1f26000
……
VirtualAddress Trace
0x398f24a0x398f24b0x398f24c
……0x1af4aa0x1af4a60x1af4a8
……0x38d2cfc0x38d2cfd
……
Physical Address Trace
Object-VAMapping Table
Dumped Page Tablesync_tag
sync_tag
page walk
page walk
Use page table to distinguish three types of memory access• Sync_tag update page table• Access page table itself page table walk due to TLB miss • Other memory access virtual address
Evaluation Methodology
ProcessorIntel Xeon E5504, 2.0GHz,
2 Sockets, 4 Cores per Socket (8 core in total)
Private CacheL1
D-Cache: 32KB, 8-way, 64Byte/line I-Cache: 32KB, 4-way, 64Byte/Line
L2 256KB, 8-way, 64Byte/line
Shared Cache L3 4MB, 16-way, 64Byte/line
TLB(private)
DTLB064 entries for 4-KByte pages
32 entries for huge pages (2MByte)
TLB1 512 entries for 4-KByte pages
MemoryDDR3-800 RDIMM, dual-rank, plugged into Socket 0, 4GB
0.25GB reserved for HMTT configuration and buffer3.75GB system available
Operating System CentOS 5.3, Linux kernel 2.6.32.18
BenchmarksMultithreaded PARSEC 2.1
A custom hybrid MPI/pthread implemented BFS of Graph500-1.2
Validation• For SpMV benchmark (CSR) :
y = ax * xhost
Our system is able to distinguish regular access pattern from irregular pattern
• Micro-benchmark: —The error is less than 2%
Overhead
• Two main overhead:– Dumping page table traces: + dump_pt– Dumping object-VA mapping: + dump_obj• Monitoring objects >= 4KB: result in most memory references
0.96
0.98
1
1.02
1.04
1.06 Origin +dump_pt +dump_obj
Nor
mal
ized
Ove
rhea
d
<1%<2%
Case Study 1: BFS (Breadth-First Search)
• column object got about 71% of page walks key object• Optimization: use huge page for column object
– Speedup: about 12% for 8-thread, 8% for 128-thread
1 2 4 8 16 32 64 1280.8
0.9
1
1.1
1.2
1.3
1.4w/o hugetlb w/ hugetlb
Number of Threads
Nor
mal
ized
Spee
dup
8.18%
1 2 4 32 1280%
20%
40%
60%
80%
100%
120%
rowstarts column pred oldqnewq visited
Number of Threads
Perc
enta
ge o
f Pag
e W
alks
Case Study 2: Canneal (PARSEC)• Cache-aware simulated annealing (SA) to
minimize the routing cost of a chip design• Two objects contribute most of the memory
accesses: _elements and _location
_elements_r _elements_w
_location_r _location_w others0E+002E+084E+086E+088E+081E+09 1 2 4 8
Main Objects in Canneal
Num
ber o
f mem
ory
requ
ests
The memory access almost do not change while increasing thread number.
Case Study 2: Canneal
1 2 4 80E+00
1E+08
2E+08
3E+08total _elements _locations
Number of Threads
Num
ber o
f Pag
e W
alks
• _elements object contributes the most of the increased page walks
• Put the _elements object into huge page to reduce TLB miss Speedup: about 5% for 8-thread
1 2 4 80.9
0.95
1
1.05
1.1 w/o hugetlbw/ hugetlb
Number of Threads
Nor
mal
ized
Spee
dup
A Visual Demo of the HMTT
Conclusion
• We have designed and implemented a hybrid hardware/software approach to conduct object-relative memory profiling.– Accurate: real application & real system– Lightweight– Track page table walks at object-level
• We demonstrate two case studies to show how the approach can help users better understand memory behavior and optimize performance.
• We intend to use this approach to analyze virtual machine on real machines.
Thanks!&Questions?
Extra Slides
Memory Profiling Approaches
Accurate Detailed Low overhead
Page walks+
Instrument √ √ × ×
Simulator * √ × ×
Performance Counter √ × √ *
Compiler √ √ √ ×
Hybrid H/S √ √ √ √
Note: √-Yes, ×-No, *-Maybe
Reverse Page Table
• Physical address pid, virtual address
0
1
2
3
...
N-1
Vaddr1 pid1 ... Vaddrk pidk
Vaddr1' Pid1' ...
...
Vaddr” Pid” ...
Physical page number
Index
Validation
Obj Read Write Rate Per Error
a0 4,194,370 0 4:0 4:0 0%
a1 4,194,310 1,048,576 4:1 4:1 0%
a2 4,194,369 2,096,927 4:2 4:2 0%
a3 4,194,303 3,087,379 4:2.94 4:3 2.04%
a4 4,194,436 4,149,586 4:3.96 4:4 1.01%
Access objects with different pattern: • a0: all read accesses, forward• a1: 3/4 read and 1/4 write accesses, forward• a2: 2/4 read and 2/4 write accesses, forward• a3: 1/4 read and 3/4 write accesses, backward• a4: all write accesses, backward
a0
a4
Size 256MB, access step 64B, requests: 4M
HMTT Configuration Space• A reserved physical memory region• Can be accessed by source codes and binary codes