View
216
Download
0
Tags:
Embed Size (px)
Citation preview
1
PATH: Page Access Tracking Hardware to
Improve Memory Management
Reza Azimi, Livio Soares, Michael Stumm, Tom Walsh, and Angela Demke Brown
University of Toronto, Canada
2
Page Access Tracking Challenge Storage Management Research
Many sophisticated algorithms Most require accurate knowledge about memory access trace Adopted mostly for file systems or databases Not straightforward for virtual memory
Problem: Limited Page Access Tracking Hard to measure either Reuse Distance or Temporal
Locality
Conventional Access Tracking Mechanisms Monitoring page faults
Most page accesses are missed.
Scanning Page Table bits High scanning overhead => low scanning frequency
3
Page Access Tracking Challenge(cont’d) Access Tracking with Performance Counters
Statistical Data Sampling: Favours only hot pages Hard to track reuse distance or temporal locality
Recording TLB misses- High overhead
TLB’s are small (TLB miss is very frequent) TLB miss handling is performance-critical
Hardware Approach [Zhou et al. ASPLOS’04]+ Effective for its purpose (but inflexible)- Impractical hardware resource requirements
~ 1 MB of hardware buffer per 1GB of physical memory!
Software Approach [Yang et al. OSDI’06] Dividing pages into active and inactive sets Page-protecting members of the inactive set
- Overhead can still be too high
4
Page Access Tracking in Software
Performance of adaptive page replacement for FFT vs.
Runtime overhead of page access tracking in software
10% overhead even with a
large active set and poor
performance
90% overhead to get acceptable performance
5
Page Access Tracking Hardware (PATH)
Advantages Extra hardware resources required are small (around 10KB) Off the common path Scalable (does not grow with physical memory)
TLBCPU Core
VADDRLOOKU
PMISS
Page Tables
VADDRPage Access Buffer
OverflowInterrupt
MISSPage Access Log
6
Information Provided by PATH
Raw Form Abstraction: Precise LRU Stack Abstraction: Miss Rate Curve (MRC)
TLBCPU Core
VADDRLOOKU
PMISS
Page Tables
VADDRPage Access Buffer
OverflowInterrupt
MISSPage Access Log
7
Basic Abstraction: LRU Stack
Accessed and updated for each entry on the Page Access Log
Implementation: Lookup:
Page Table-like Structure O(1) lookup time
Update Doubly linked list A few pointers are updated for each page access
Most Recently Accessed
Least Recently Accessed
8
Basic Abstraction: Miss Rate Curve (MRC)
Memory Size
Cap
acit
y M
isse
s
Basic Info: The number of misses for a given memory size in period of time.
Basic Use: Estimating the “memory needs” of an application.
9
Computing MRC Online Mattson’s Stack Algorithm
For LRU: Memory Sizes < LRU Distance: miss Memory Sizes >= LRU Distance: hit
LRU Distance
MRU Distance
Page Access
Most Recently Accessed
Least Recently Accessed
10
Runtime Overhead Tradeoff
The larger the Page Access Buffer (active set) The more page accesses are filtered+ The less run-time overhead
- The less accurate page access trace
TLBCPU Core
VADDRLOOKU
PMISS
Page Tables
VADDRPage Access Buffer
OverflowInterrupt
MISSPage Access Log
11
Runtime Overhead, Example: FFT
020406080
100120140160180200
PATH Software Approach
Ru
nti
me
Ove
rhea
d (
%)
128
512
2K
4K
8K
16K
32K
Active Set
Entries
12
Runtime Overhead, Example: LU-non.
0102030405060708090
100
PATH Software Approach
Ru
nti
me
Ove
rhea
d (
%)
128
512
2K
4K
8K
16K
32K
Active Set
Entries
13
Runtime Overhead Summary
Overall, a 2K Entry Page Access Buffer seems to be the best point in the tradeoff between performance and runtime overhead.
PATH’s overhead is less than 6% across a wide variety of applications.
PATH’s overhead is negligible in most cases.
14
Case 1: Adaptive Page Replacement Region-based Page Replacement
Use different replacement policies for different regions in the virtual address space
Rationale: each region is likely to contain a data structure with a fairly stable access pattern
Low Inter-Reference Set (LIRS) Handles sequential and looping patterns Requires tracking page accesses Originally developed for file system caching Easily enabled by the PATH-generated information
15
Region-based Replacement Using MRC for comparison:
Memory Size
Mis
s R
ate
LRU
MRU
16
Region-based Replacement (cont’d) Dividing Memory among Regions
Minimize total miss rate by giving memory to the regions that have more “benefit-per-page”.
Region 1
Memory Size
Cap
acity
Mis
ses
Region 2
Memory Size
Cap
acity
Mis
ses
17
Simulation Results
LU-contiguous (SPLASH2)
18
Simulation Results
BT (NAS Benchmark)
19
Case 2: Prefetching Spatial Locality-based
Prefetch pages spatially-adjacent to the faulted page. Advantages
Simple and easy to implement Effective for many cases
Major drawback Oblivious to non-spatial access patterns
Temporal Locality-based Prefetch pages that are regularly accessed together. Use PATH to track temporal locality of pages.
20
Temporal Locality-based Prefetching
Page Proximity Graph (PPG) Each page is a node There exists an edge from p to q if q is regularly accessed shortly after p
(temporal locality)
PPG Update: Add a page q to p’s proximity set if q appears in the LRU stack in close
proximity to p repeatedly .
Basic prefetching scheme: Breadth-First traversal starting from the faulted page.
21
Prefetching
LU non-contiguous (SPLASH2)
22
Conclusions Page Access Tracking Hardware
Small (10KBytes in size) Low-overhead Generic
Cases Studied Adaptive Page Replacement Process Memory Allocation (See Paper) Prefetching
Significant performance improvement can be achieved by tracking page accesses.
23
Future Directions Other case studies
NUMA page placement Super-page management
Per-thread page access tracking Augmenting page accesses with thread info
Multiprocessor issues Combining traces collected on multiple CPUs
24
Questions