View
220
Download
0
Category
Preview:
Citation preview
Managing Distributed, Shared L2 Cachesthrough
OS-Level Page Allocation
Sangyeun Cho and Lei Jin
Dept. of Computer ScienceUniversity of Pittsburgh
Dec. 13 ’06 – MICRO-39
Multicore distributed L2 caches
L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”
(Distributed L2 caches + switched NoC) NUCA
Hardware-based management schemes• Private caching• Shared caching• Hybrid caching
processor core
local L2 cache
router
Dec. 13 ’06 – MICRO-39
Private caching
2. L2 access
1. L1 miss
1. L1 miss2. L2 access
• Hit• Miss
3. Access directory• A copy on chip• Global miss
3. Access directory
short hit latency (always local)
high on-chip miss rate
long miss resolution time
complex coherence enforcement
Dec. 13 ’06 – MICRO-39
Shared caching
1. L1 miss
1. L1 miss
2. L2 access• Hit• Miss
low on-chip miss rate
straightforward data location
simple coherence (no replication)
long average hit latency
Dec. 13 ’06 – MICRO-39
Our work
Placing “flexibility” as the top design consideration
OS-level data to L2 cache mapping• Simple hardware based on shared caching• Efficient mapping maintenance at page granularity
Demonstrating the impact using different policies
Dec. 13 ’06 – MICRO-39
Talk roadmap
Data mapping, a key property
Flexible page-level mapping• Goals• Architectural support• OS design issues
Management policies
Conclusion and future works
Dec. 13 ’06 – MICRO-39
Data mapping, the key
Data mapping = deciding data location (i.e., cache slice)
Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control
Shared caching• Data mapping determined by address
slice number = (block address) % (Nslice)
• Mapping is static• Cache block installation at miss time• No explicit control• (Run-time can impact location within slice)
Mapping granularity = block
Dec. 13 ’06 – MICRO-39
Changing cache mapping granularity
Memory blocks Memory pages
miss rate?
impact on existing techniques?
(e.g., prefetching)
latency?
Dec. 13 ’06 – MICRO-39
Observation: page-level mapping
Memory pages Program 1
Program 2
OS PAGE ALLOCATIONOS PAGE ALLOCATION
Mapping data to different $$ feasible
Key: OS page allocation policies
Flexible
Dec. 13 ’06 – MICRO-39
Goal 1: performance management
Proximity-aware data mapping
Dec. 13 ’06 – MICRO-39
Goal 2: power management
Usage-aware cache shut-off
0
0 0
000 0
0000
0
Dec. 13 ’06 – MICRO-39
Goal 3: reliability management
On-demand cache isolation
X
X
Dec. 13 ’06 – MICRO-39
Goal 4: QoS management
Contract-based cache allocation
Dec. 13 ’06 – MICRO-39
page_num page offset
Architectural support
L1 miss
Method 1: “bit selection”
slice_num = (page_num) % (Nslice)
other bits slice_num page offset
data address
Method 2: “region table”
regionx_low ≤ page_num ≤ regionx_high
page_num page offset
region0_low slice_num0
region0_high
region1_low slice_num1
region1_high
Method 3: “page table (TLB)”
page_num «–» slice_num
vpage_num0 slice_num0
ppage_num0
vpage_num1 slice_num1
ppage_num1
reg_table
TLB
Method 1: “bit selection”slice number = (page_num) % (Nslice)
Method 2: “region table”regionx_low ≤ page_num ≤ regionx_high
Method 3: “page table (TLB)”page_num «–» slice_num
Simple hardware support enough
Combined scheme feasible
Dec. 13 ’06 – MICRO-39
Some OS design issues
Congruence group CG(i)• Set of physical pages mapped to slice i• A free list for each i multiple free lists
On each page allocation, consider• Data proximity• Cache pressure• (e.g.) Profitability function P = f(M, L, P, Q, C)
M: miss ratesL: network link statusP: current page allocation statusQ: QoS requirementsC: cache configuration
Impact on process scheduling
Leverage existing frameworks• Page coloring – multiple free lists• NUMA OS – process scheduling & page allocation
Dec. 13 ’06 – MICRO-39
Working example
Program
10 2 3
4 5 6
8
7
9 10 11
12 13 14 15
5
5
5
5
P(4) = 0.9P(6) = 0.8P(5) = 0.7…
P(1) = 0.95P(6) = 0.9P(4) = 0.8… 4
1
6
Static vs. dynamic mapping
Program information (e.g., profile)
Proper run-time monitoring needed
Dec. 13 ’06 – MICRO-39
Page mapping policies
Dec. 13 ’06 – MICRO-39
Simulating private caching
For a page requested from a program running on core i, map the page to cache slice i
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
L2 c
ach
e late
ncy
(cy
cles)
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
SPEC2k INT SPEC2k FP
private caching
OS-based
L2 cache slice size
Simulating private caching is simple
Similar or better performance
Dec. 13 ’06 – MICRO-39
Simulating “large” private caching
For a page requested from a program running on core i, map the page to cache slice i; also spread pages
SPEC2k INT SPEC2k FP
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gcc parser eon twolf
Rela
tive p
erf
orm
ance
(ti
me
-1)
OSprivate
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
wupwise galgel ammp sixtracks
1.93
512kB cache slice
Dec. 13 ’06 – MICRO-39
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
Simulating shared caching
For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …)
L2 c
ach
e late
ncy
(cy
cles)
SPEC2k INT SPEC2k FP
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
L2 cache slice size
sharedOS
129 106
Simulating shared caching is simple
Mostly similar behavior/performance
Pathological cases (e.g., applu)
Dec. 13 ’06 – MICRO-39
10 2 3
4 5 6
8
7
9 10 11
12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
FFT LU RADIX OCEAN
Simulating clustered caching
For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …)
Rela
tive p
erf
orm
ance
(ti
me
-1)
privateOSshared
4 cores used; 512kB cache slice
Simulating clustered caching is simple
Lower miss traffic than private
Lower on-chip traffic than shared
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping
Using profiling collect:• Inter-page conflict information• Per-page access count information
Page mapping cost function (per slice)• Given program location, page to map, and previously mapped
pages• (# conflicts miss penalty) + weight (# accesses latency)• weight as a knob
Larger value more weight on proximity (than miss rate)Optimize both miss rate and data proximity
Theoretically important to understand limits Can be practically important, too
miss cost Latency cost
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping, cont’d
0%
20%
40%
60%
80%
100%
amm
p
art
bzi
p2
craf
ty
eon
equa
ke
gap gcc
gzi
p
mcf
mes
a
mg
rid
par
ser
twol
f
vort
ex vpr
wup
wis
e
0%
20%
40%
60%
80%
100%
256kB L2 cache slice
remote
local
miss
on-chip hit
L2 c
ach
e a
ccess
es
weight
Dec. 13 ’06 – MICRO-39
0
50
100
150
200
250
300
350
400
450
Profile-driven page mapping, cont’d
# p
ages
mapped
256kB L2 cache slice
Program location
GCC
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping, cont’d
256kB L2 cache slice
Perf
orm
ance
im
pro
vem
ent
Over
share
d c
ach
ing
-1%
9%
1%
17%
7%0%
9%
21%
2%
39%
4% 3%9%
23%
2%6%
-20%
0%
20%
40%
60%
80%
amm
p
art
bzi
p2
craf
ty
eon
equa
ke
gap gcc
gzi
p
mcf
mes
a
mg
rid
par
ser
twol
f
vort
ex vpr
wup
wis
e
108%
Room for performance improvement
Best of the two or better than the two
Dynamic mapping schemes desired
Dec. 13 ’06 – MICRO-39
Isolating faulty caches
When there are faulty cache slices, avoid mapping pages to them
0
2
4
6
8
0 1 2 4 8
Rela
tive L
2 c
ach
e late
ncy
4 cores running a multiprogrammed workload; 512kB cache slice
shared
OS
# cache slice deletions
Dec. 13 ’06 – MICRO-39
Conclusion
“Flexibility” will become important in future multicores• Many shared resources• Allows us to implement high-level policies
OS-level page-granularity data-to-slice mapping• Low hardware overhead• Flexible
Several management policies studied• Mimicking private/shared/clustered caching straightforward• Performance-improving schemes
Dec. 13 ’06 – MICRO-39
Future works
Dynamic mapping schemes• Performance• Power
Performance monitoring techniques• Hardware-based• Software-based
Data migration and replication support
Dec. 13 ’06 – MICRO-39
Managing Distributed, Shared L2 Caches
through OS-Level Page AllocationSangyeun Cho and Lei Jin
Dept. of Computer ScienceUniversity of Pittsburgh
Thank you!
Dec. 13 ’06 – MICRO-39
Multicores are here
AMD Opteron dual-core (2005)
IBM Power5 (2004)
Sun Micro. T1, 8 cores (2005)
Intel Core2 Duo (2006)
Quad cores (2007)
Intel 80 cores? (2010?)
Dec. 13 ’06 – MICRO-39
A multicore outlook
???
Dec. 13 ’06 – MICRO-39
A processor model
Many cores (e.g., 16)
processor core
local L2 cache
router
Private L1 I/D-$$• 8kB~32kB
Local unified L2 $$• 128kB~512kB• 8~18 cycles
Switched network• 2~4 cycles/switch
Distributed directory• Scatter hotspots
Dec. 13 ’06 – MICRO-39
Other approaches
Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]
Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]
Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]
Recommended