24
Moshovos © 1 RegionScout: RegionScout: Exploiting Coarse Grain Exploiting Coarse Grain Sharing in Snoop Coherence Sharing in Snoop Coherence Andreas Moshovos Andreas Moshovos [email protected] [email protected] www.eecg.toronto.edu/aenao www.eecg.toronto.edu/aenao

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos [email protected]

Embed Size (px)

Citation preview

Page 1: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 1

RegionScout: RegionScout: Exploiting Coarse Grain Sharing in Exploiting Coarse Grain Sharing in

Snoop CoherenceSnoop Coherence

Andreas MoshovosAndreas [email protected]@eecg.toronto.edu

www.eecg.toronto.edu/aenaowww.eecg.toronto.edu/aenao

Page 2: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 2

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

Improving Snoop Coherence

Conventional Considerations: Complexity and Correctness NOT Power/Bandwidth

Can we: (1) Reduce Power/bandwidth (2) Leverage snoop coherence? Remains Attractive: Simple / Design Re-use

Yes: Exploit Program Behavior toDynamically Identify Requests that do not Need Snooping

Page 3: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 3

CPU

I$ D$

CPU

I$ D$

CPU

I$ D$

interconnect

Main Memory

RegionScout: Avoid Some Snoops

Frequent case: non-sharing even at a coarse level/Region RegionScout: Dynamically Identify Non-Shared Regions

First Request to a Region Identifies it as not Shared Subsequent Requests do not need to be broadcast

Uses Imprecise Information Small structures Layer on top of conventional coherence No additional constraints

Page 4: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 4

Roadmap

Conventional Coherence: The need for power-aware designs

Potential: Program Behavior

RegionScout: What and How

Implementation

Evaluation

Summary

Page 5: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 5

Coherence Basics

Given request for memory block X (address) Detect where its current value resides

Main Memory

snoop

snoop

X

hit

CPU CPU CPU

Page 6: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 6

Conventional Coherence not Power-Aware/Bandwidth-Effective

All L2 tags see all accessesPerf. & Complexity: Have L2 tags why not use themPower: All L2 tags consume power on all accesses

Bandwidth: broadcast all coherent requests

Main Memory

L2

CPU

missmiss

CPU CPU

Page 7: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 7

RegionScout Motivation:Sharing is Coarse

Region: large continuous memory area, power of 2 size CPU X asks for data block in region R

1. No one else has X

2. No one else has any block in RRegionScout Exploits this Behavior

Layered Extension over Snoop Coherence

Typical Memory Space Snapshot: colored by owner(s)

addresses

Page 8: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 8

Optimization Opportunities

Power and Bandwidth Originating node: avoid asking others Remote node: avoid tag lookup

CPU

I$ D$

CPU

I$ D$

Memory

SWITCH

CPU

I$ D$

Page 9: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 9

Potential: Region Miss Frequency

0%

25%

50%

75%

100%

256 512 1K 2K 4K 8K 16K

p4.512K

p4.1M

p8.512K

p8.1M

% o

f all

request

s

Region Size

Even with a 16K Region~45% of requests miss in all remote nodes

bett

er

Glo

bal R

eg

ion

Mis

ses

Page 10: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 10

RegionScout at Work: Non-Shared Region Discovery

First request detects a non-shared region

Main Memory

CPUCPU CPU

Global Region Miss

Region Miss Region Miss12 2

3

Record: Non-Shared Regions Record: Locally Cached Regions

Page 11: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 11

RegionScout at Work:Avoiding Snoops

Subsequent request avoids snoops

Main Memory

CPUCPU CPU

Global Region Miss

1

2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 12: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 12

RegionScout is Self-Correcting

Request from another node invalidates non-shared record

Main Memory

CPUCPU CPU

12 2

Record: Non-Shared Regions Record: Locally Cached Regions

Page 13: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 13

Requesting Node provides address:

At Originating Node – from CPU: Have I discovered that this region is not shared?

At Remote Nodes – from Interconnect: Do I have a block in the region?

Implementation: Requirements

Region Tag offsetlg(Region Size)

CPU

address

Page 14: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 14

Remembering Non-Shared Regions

Records non-shared regions Lookup by Region portion prior to issuing a request Snoop requests and invalidate

Region Tag offsetaddress

validNon-Shared Region Table

Few entries16x4 in most experiments

Page 15: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 15

What Regions are Locally Cached?

If we had as many counters as regions: Block Allocation: counter[region]++ Block Eviction: counter[region]-- Region cached only if counter[region] non-zero

Not Practical: E.g., 16K Regions and 4G Memory 256K counters

Region Tag offset

counter

Page 16: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 16

What Regions are Locally Cached?

Use few Counters Imprecise: Records a superset of locally cached Regions False positives: lost opportunity, correctness preserved

Region Tag offset

counter

hashCached Region Hash

“Counter”: + on block allocation - on block evictionFew entries, e.g., 256

p bits

P-bit 1 if counter non-zero used for lookups

Page 17: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 17

Roadmap

Conventional Coherence

Program Behavior: Region Miss Frequency

RegionScout

Evaluation

Summary

Page 18: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 18

Evaluation Overview

Methodology

Filter rates Practical Filters can capture many Region Misses

Interconnect bandwidth reduction

Page 19: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 19

Methodology

In-House simulator based on Simplescalar Execution driven All instructions simulated – MIPS like ISA System calls faked by passing them to host OS Synchronization using load-linked/store-conditional Simple in-order processors Memory requests complete instantaneously MESI snoop coherence 1 or 2 level memory hierarchy WATTCH power models

SPLASH II benchmarks Scientific workloads Feasibility study

Page 20: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 20

Filter Rates

0%

25%

50%

75%

100%

256 512 1K 2K

p4.512K.R4K

p4.512K.R16K

p8.512K.R4K

p8.512K.R16K

Iden

tifi

ed

Glo

bal R

eg

ion

Mis

ses

CRH Size

bett

er

For small CRH better to use large regionsPractical RegionScout filters capture a lot of the potential

Page 21: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 21

Bandwidth Reduction

0%

25%

50%

75%

100%

2K 4K 8K 16K

p4.512K

p8.512K

p4.64K

p8.64K

Messag

es

Region Size

bett

er

CM

P

Moderate Bandwidth Savings for SMP (15%-22%)More so for CMP (>25%)

Page 22: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 22

Related Work

RegionScout Technical Report, Dec. 2003

Jetty Moshovos, Memik, Falsafi, Choudhary, HPCA 2001

PST Eckman, Dahlgren, and Stenström, ISLPED 2002

Coarse-Grain Coherence Cantin, Lipasti and Smith, ISCA 2005

Page 23: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 23

Summary

Exploit program behavior/optimize a frequent case Many requests result in a global region miss

RegionScout Practical filter mechanism Dynamically detect would-be region misses Avoid broadcasts Save tag lookup power and interconnect bandwidth Small structures Layered extension over existing mechanisms Invisible to programmer and the OS

Page 24: Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos moshovos@eecg.toronto.edu

Moshovos © 24

RegionScout and Directories

Different information Directory block-level sharing RegionScout: Region-level sharing

Could build Region-level directory This work serves as motivation

Directories use precise information RegionScout does not have to

Directories/Implementation RegionScout can approximate a directory

If remote nodes sent sharing info as opposed to a single bit