31
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University of Pittsburgh

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring

  • Upload
    myron

  • View
    25

  • Download
    0

Embed Size (px)

DESCRIPTION

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring. Lei Jin and Sangyeun Cho. Dept. of Computer Science University of Pittsburgh. Processor Core. Router. Local L2 Cache. Multicore distributed L2 caches. - PowerPoint PPT Presentation

Citation preview

Page 1: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

Better than the Two: Exceeding Private and Shared Caches

via Two-Dimensional Page Coloring

Lei Jin and Sangyeun Cho

Dept. of Computer ScienceUniversity of Pittsburgh

Page 2: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Multicore distributed L2 caches

L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”

(Distributed L2 caches + switched NoC) NUCA

Hardware-based management schemes• Private caching• Shared caching• Hybrid caching

Local L2 Cache

ProcessorCore

Router

Page 3: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Private and shared caching

Private caching:Private caching:

short hit latency (always local)short hit latency (always local)

high on-chip miss ratehigh on-chip miss rate

long miss resolution timelong miss resolution time

complex coherence enforcementcomplex coherence enforcement

Shared caching:

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

Page 4: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Other approaches

Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]

Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]

Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]

Page 5: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Motivation

Miss rateMiss rate

Hit latencyHit latency

What is the optimal balance between miss rate and hit latency?

Page 6: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Talk roadmap

Data mapping, a key property [cho and Jin, Micro2006]

Two-dimensional (2D) page coloring algorithm

Evaluation and results

Conclusion and future works

Page 7: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Data mapping

Data mapping• Memory data location in L2 cache

Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control

Shared caching• Data mapping determined by address

slice number = (block address) % (Nslice)

• Mapping is static• No explicit control

Page 8: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Page

Change mapping granularity

slice number = (block address) % (N slice)

Block granularityPage granularity

Page

Page

Page

slice number = (page address) % (N slice)

Page 9: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

OS controlled page mapping

Memory pages Program 1

Program 2

OS PAGE ALLOCATIONOS PAGE ALLOCATION

Virtual address spacePhysical address space

Page 10: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D page coloring: the problem

PagePagePage500 30

500 3

500 10

500 7

500 12

access miss

PagePage

Network latency / hop = 3 cycles

Memory latency = 300 cycles

Cost(color #) = (# access x # hop x 3 cycles) + (# miss x 300 cycles)

cost

9000

6900

9000

8100

9600

P

Page 11: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm

Collect L2 reference trace Derive conflict information [Sherwood et al., ICS1999]

Page A Page CPage B Page B

Reference 1 Reference 2 Reference 3 Reference 4

Page 12: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Reference Matrix

A B C

A 0 0 0B 0 0 0C 0 0 0

Conflict Matrix

A B C

A 0 0 0B 0 0 0C 0 0 0

11

Page 13: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Reference Matrix

A B C

A 0 0 0B 1 0 0C 1 0 0

Conflict Matrix

A B C

A 0 0 0B 0 0 0C 0 0 0

Page 14: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Page B

Reference 2

Reference Matrix

A B C

A 0 0 0B 1 0 0C 1 0 0

Conflict Matrix

A B C

A 0 0 0B 0 0 0C 0 0 0

1

1

Page 15: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Page B

Reference 2

Reference Matrix

A B C

A 0 1 0B 1 0 0C 1 1 0

Conflict Matrix

A B C

A 0 0 0B 0 0 0C 0 0 0

1

+1

0

Page 16: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Reference Matrix

A B C

A 0 1 0B 0 0 0C 1 1 0

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 0 0 0

Page 17: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A B C

A 0 1 0B 0 0 0C 1 1 0

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 0 0 0

11

Page 18: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

Derive conflict information

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A B C

A 0 1 1B 0 0 1C 1 1 0

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 0 0 0

+1 +1

1 10 0

Page 19: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

2D Page coloring

Page A

Reference 1

Page B

Reference 2

Page B

Reference 3

Page C

Reference 4

Reference Matrix

A B C

A 0 1 1B 0 0 1C 0 0 0

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 1 1 0

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 1 1 0

Access Counter

A B C

1 2 1

Page 20: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

2D coloring algorithm (cont’d)

2D Page coloring

Conflict Matrix

A B C

A 0 0 0B 1 0 0C 1 1 0

Access Counter

A B C

1 2 1

#Conflict(color) #Access

Cost(color, page#) = ( x mem latency) +

x #hop(color) x hop delay)

Optimal color(page#) = {C | Cost(C) = MIN[Cost(color, page#)] for all colors}

α x

(1-α) x

Page 21: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Experiments setup

Experiments were carried out using simulator derived from SimpleScalar toolset.

The simulator models a 16-core tile-based CMP.

Each core has private 32KB I/D L1, global shared 256KB L2 slice (total 4MB).

Profiling 2D coloringTiming

Simulation

TracePage

mapping

Tuning α

Page 22: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Optimal page mapping

0

50

100

150

200

250

300

350

400

450

gcc

α = 1/64

# o

f pages

x y

0

100

200

300

400

500

600

700

800

# o

f pages

xy

α = 1/256

Page 23: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Access distribution

0%

20%

40%

60%

80%

100%

misshit

0%

20%

40%

60%

80%

100%

remotelocal

α 1/32 – 1/2048

Page 24: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Relative performance

0

0.5

1

1.5

2

2.5

3

privatesharedline coloring2D coloring

Page 25: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Value of α

l og(α ) base 1/ 2

0

2

4

6

8

10

12

l og(α )

Page 26: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Conclusions

With cautious data placement, there is huge room for performance improvement.

Dynamic mapping schemes with information assisted by hardware are possible to achieve similar perform-ance improvement.

This method can also be applied to other optimization target.

Page 27: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Current and future works

Dynamic mapping schemes• Performance• Power

Multiprogrammed and parallel workloads

Page 28: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Thank you & Questions?

Page 29: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Private caching

1. L1 miss2. L2 access

• Hit• Miss

3. Access directory• A copy on chip• Global miss

L1 miss

Local L2 access short hit latency (always local)short hit latency (always local)

high on-chip miss ratehigh on-chip miss rate

long miss resolution timelong miss resolution time

complex coherence enforcementcomplex coherence enforcement

Page 30: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

Shared caching

1. L1 miss

2. L2 access• Hit• Miss

L1 miss

low on-chip miss rate

straightforward data location

simple coherence (no replication)

long average hit latency

Page 31: Better than the Two:  Exceeding Private and Shared Caches  via Two-Dimensional Page Coloring

CMPMSI’07 02/11/07

PerformancePerf

orm

ance

im

pro

vem

ent

Over

share

d c

ach

ing

141%150%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

ammp art crafty gap gcc gzip mcf mgrid twolf vortex wupwise

line placement

2D coloring