91
Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with: Jason Cantin , IBM (Ph.D. ’06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton) http://www.ece.wisc.edu/~pharm

Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Embed Size (px)

Citation preview

Page 1: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coarse-Grained Coherence

Mikko H. LipastiAssociate Professor

Electrical and Computer EngineeringUniversity of Wisconsin – Madison

Joint work with: Jason Cantin, IBM (Ph.D. ’06)Natalie Enright JergerProf. Jim SmithProf. Li-Shiuan Peh (Princeton)

http://www.ece.wisc.edu/~pharm

Page 2: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Motivation Multiprocessors are commonplace

Historically, glass house servers Now laptops, soon cell phones

Most common multiprocessor Symmetric processors w/coherent

caches Logical extension of time-shared

uniprocessors Easy to program, reason about

Not so easy to buildAug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 3: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coherence Granularity Track each individual word

Too much overhead Track larger blocks

32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing

P0 P1 P2 P3 P4 P5 P6 P7

Solution: use multiple granularities Small blocks: manage local read/write

permissions Large blocks: track global behavior

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 4: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coarse-Grained Coherence Initially

Identify non-shared regions Decouple obtaining coherence

permission from data transfer Filter snoops to reduce broadcast

bandwidth Later

Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to

match

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 5: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coarse-Grained Coherence Optimizations lead to

Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect,

caches, and in DRAM World peace and end to global warming

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 6: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coarse-Grained Coherence Tracking

Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line

size Can range from two lines to a physical page

A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA)

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 7: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Each entry has an address tag, state, and count of lines cached by the processor

The region state indicates if the processor and / or other processors are sharing / modifying lines in the region

Customize policy/protocol/interconnect to exploit region state

Region Coherence Arrays

Page 8: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-Grained Coherence Techniques

Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 9: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Unnecessary Broadcasts

0%

20%

40%

60%

80%

100%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Req

ues

ts

Write-back

Writes

Read

I-Fetch

DCB

Page 10: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Broadcast Snoop Reduction Identify requests that don’t need a

broadcast

Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency

Avoid sending non-data requests externallyExample

Page 11: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Simulator EvaluationPHARMsim: near-RTL but written in C

Execution-driven simulator built on top of SimOS-PPC

Four 4-way superscalar out-of-order processors

Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines

Separate address / data networks –similar to Sun Fireplane

Page 12: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Workloads Scientific

Ocean, Raytrace, Barnes

Multiprogrammed SPECint2000_rate, SPECint95_rate

Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000

Page 13: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Broadcasts Avoided

0%

20%

40%

60%

80%

100%U

nne

cess

ary

128

B2

56B

512

B1

KB

2K

B4

KB

Un

nece

ssa

ry1

28B

256

B5

12B

1K

B2

KB

4K

BU

nne

cess

ary

128

B2

56B

512

B1

KB

2K

B4

KB

Un

nece

ssa

ry1

28B

256

B5

12B

1K

B2

KB

4K

B

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Req

ues

ts

Write-backs

I-Fetches

Writes

Reads

DCB

Page 14: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Execution Time

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Exe

cuti

on

Tim

e

Baseline 128B 256B 512B 1KB 2KB 4KB

Page 15: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Summary Eliminates nearly all unnecessary

broadcasts

Reduces snoop activity by 65% Fewer broadcasts Fewer lookups

Provides modest speedup

Page 16: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 17: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Prefetching in Multiprocessors Prefetching

Anticipate future reference, fetch into cache Many prefetching heuristics possible

Current systems: next-block, stride Proposed: skip pointer, content-based

Some/many prefetched blocks are not used Multiprocessors complications

Premature or unnecessary prefetches Permission thrashing if blocks are shared

Separate study [ISPASS 2006]

Page 18: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Lines from non-shared regions can be prefetched stealthily and efficiently

Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining

exclusive copies

Without broadcasting prefetch requests

Fetched from DRAM with low overheadExample

Stealth Prefetching

Page 19: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching After a threshold number of L2 misses (2), the

rest of the lines from a region are prefetched

These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer)

After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a

broadcast unnecessary for coherent access

Page 20: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

L2 Misses Prefetched

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scientific Multiprogrammed Commercial Arithmetic Mean

L2

Mis

ses

SP-512B SP-1KB SP-2KB Perfect

Page 21: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Speedup

0%

4%

8%

12%

16%

20%

24%

28%

32%

36%

Scientific Multiprogrammed Commercial Arithmetic Mean

Spe

edup

CGCT -512B Region SP-512B SP-1KB SP-2KB

Page 22: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

SummaryStealth Prefetching can prefetch data:

Stealthily: Only non-shared data prefetched Prefetch requests not broadcast

Aggressively: Large regions prefetched at once, 80-90%

timely

Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode

Page 23: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 24: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response

Trading DRAM bandwidth for latency Wasting power

Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily

Power-Efficient DRAM Speculation

Broadcast ReqSnoop TagsSend Resp

DRAM Read Xmit Block

Page 25: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Operations

0%

20%

40%

60%

80%

100%

Scientific Mean MultiprogrammedMean

CommercialMean

Overall Mean

DR

AM

Req

ues

ts

Writes

Useful Reads

MisspeculatedReads

Page 26: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Direct memory requests are non-speculative

Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively

Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches

Power-Efficient DRAM Speculation

Page 27: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useless DRAM Reads

0%

20%

40%

60%

80%

100%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

DR

AM

Rea

ds

Externally-Clean Region

UnknownRegion State

Externally-Dirty Region

Page 28: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Useful DRAM Reads

0%

20%

40%

60%

80%

100%E

xt-D

irty

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn

Scientific Mean MultiprogrammedMean

CommercialMean

Overall Mean

DR

AM

Rea

ds

False Positives

Page 29: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

DRAM Reads Performed/Delayed

0.0%

101.6%

0.0% 0.0%

81.5%

6.9%

78.2% 77.2%

13.3%12.5%

100.0%

71.4%

0%

20%

40%

60%

80%

100%

120%R

eads

Per

form

ed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Baseline CGCT,Speculate All

CGCT, OracleSpeculation

No-speculateDirty Regions

No-speculateDirty or

UnknownRegions

No-speculate

DR

AM

Rea

ds

Page 30: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

SummaryPower-Efficient DRAM Speculation:

Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM

Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer

Page 31: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 32: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Chip Multiprocessor Interconnect

Options Buses: don’t scale Crossbars: too

expensive Rings: too slow Packet-switched mesh

Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization

Page 33: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?

Page 34: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit-Switched Interconnects

Communication patterns Spatial locality to memory Pairwise communication

Circuit-switched links Avoid switching/routing Reduce latency Save power?

Poor utilization! Maybe OK

Page 35: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Router Design

Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if

CS Can also act as packet-switched network Design details in [CA Letters ‘07]

Page 36: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power

Page 37: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Hybrid Circuit Switching (1)

•Hybrid Circuit Switching improves performance by up to 7%Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 38: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Hybrid Circuit Switching (2)

•Positive interaction in co-designed interconnect & protocol•More circuit reuse => greater latency benefit

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 39: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

SummaryHybrid Circuit Switching:

Routing overhead eliminated Still enable high bandwidth when

needed Co-designed protocol

Optimize cache-to-cache transfers

Substantial performance benefits To do: power analysis

Page 40: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 41: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Server Consolidation on CMPs CMP as consolidation platform Simplify system administration

Save power, cost and physical infrastructure Study combinations of individual

workloads in full system environment Micro-coded hypervisor schedules VMs

See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 42: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Virtual Proximity Interactions between VM

scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance

Evaluate 3 scheduling policies Gang, Affinity and Load Balanced

HCS provides virtual proximity

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 43: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Scheduling Algorithms Gang Scheduling

Co-schedules all threads of a VM No idle-cycle stealing

Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing

core Load Balanced Scheduling

Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 44: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

•Load balancing wins with fast interconnect•Affinity scheduling wins with slow interconnect•HCS creates virtual proximity

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 45: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

• HCS able to provide virtual proximity

Virtual Proximity Performance

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 46: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

•As physical distance (hop count) increases, HCS provides significantly lower latency

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 47: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

SummaryVirtual Proximity [in submission]

Enables placement agnostic hypervisor scheduler

Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over

affinity Low-latency interconnect mitigates increase in L2

cache conflicts from load balancing L2 misses up by 10% but execution time reduced by

11%

A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7%

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 48: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 49: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Circuit Switched Snooping (1) Scalable, efficient broadcasting on

unordered network Remove latency overhead of directory

indirection Extend point-to-point circuit-switched

links to trees Low latency multicast via circuit-

switched tree Help provide performance isolation

as requests do not share same communication medium

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 50: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Circuit-Switched Snooping (2) Extend Coarse Grain Coherence

Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts

Effective in Server Consolidation Workloads Very few coherence requests to

globally shared data

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 51: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snooping Interconnect Switches consist of

Configurable crossbar Configuration memory

Circuits span two or more nodes, based on RCA

Snooping occurs across circuits

All sharers in region join circuit

Each link can physically accommodate multiple circuits

Page 52: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Circuit-Switched Snooping Use RCA to identify subsets of

nodes that share data Create shared circuits among

these nodes Design challenges

Multi-drop, bidirectional circuits Memory ordering

Results: very much in progress

Page 53: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview

Page 54: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka

Page 55: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization

Page 56: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research

equipment Schneider ECE Faculty Fellowship UW Graduate School

Page 57: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Questions?http://www.ece.wisc.edu/

~pharm

Page 58: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Backup Slides

Page 59: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region

Processor Other Processors Broadcast Needed?

Invalid (I) No Cached Copies Unknown Yes

Clean-Invalid (CI) Unmodified Copies Only No Cached Copies No

Clean-Clean (CC) Unmodified Copies Only Unmodified Copies Only For Modifiable Copy

Clean-Dirty (CD) Unmodified Copies Only Modified/Unmodified Copies Yes

Dirty-Invalid (DI) Modified/Unmodified Copies No Cached Copies No

Dirty-Clean (DC) Modified/Unmodified Copies Unmodified Copies Only For Modifiable Copy

Dirty-Dirty (DD) Modified/Unmodified Copies Modified/Unmodified Copies Yes

Region Coherence Arrays

Page 60: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Region Coherence Arrays On cache misses, the region state is read

to determine if a broadcast is necessary On external snoops, the region state is

read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state

The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region

Page 61: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

P0 P1

M0 M1

Network

$0 RCA $1 RCA001

Invalid000

DIExclusive Invalid0000 Invalid000

Invalid0000 Invalid000Exclusive

0010

0011

• P1 stores 100002

MISS

• Snoop performed

• Response sent

• Data transfer

Store: 100002

RFO: P1, 100002

0010 Pending 001 Pending

Owned, Region Owned

DDPending

RFO: P1, 100002Owned, Region Owned

DDInvalid Modified

DataData

Coarse-Grain Coherence Tracking

Region Coherence Array added; two lines per region

Region not exclusive anymore

Hits in P0 cache

Page 62: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Overhead Storage for RCA Two bits in snoop response for

region snoop response Region Externally Clean/Dirty

2-way set-assoc. RCA, 48-bit addresses Bits / Set Total Kilobytes Tag Overhead Cache Overhead

2K-Entries 74 9.3 5.0% 0.8%

4K-Entries 72 18.0 9.7% 1.5%

8K-Entries 70 35.0 48.6% 2.8%

16K-Entries 68 68.0 88.3% 5.5%

Page 63: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Overhead RCA maintains inclusion over caches

RCA must respond correctly to external requests if lines cached

When regions evicted from RCA, their lines are evicted from the cache

Replacement algorithm uses line count to favor regions with no lines cached

Page 64: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Peak

0

2

4

6

8

10

12

14

16

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Pea

k B

road

cast

s /

1000

CP

U C

ycle

s

Page 65: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic – Average

0

2

4

6

8

10

12

14

16

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Ave

rag

e B

road

cast

s / 1

000

CP

U C

ycle

s

Page 66: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoop Traffic

Peak snoop traffic is halved

Average snoop traffic reduced by nearly two thirds

The system is more scalable, and may effectively support more processors

Page 67: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero,

send external request to cache Reduces power consumption in the

cache tag arrays Increases broadcast snoop latency

Tag Lookups Filtered

Page 68: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Tag Lookups Filtered

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%O

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

B

Scientific Mean MultiprogrammedMean

CommercialMean

Overall Mean

Ext

ern

al R

equ

ests

Tag LookupsFiltered

Tag Lookups forBroadcasts Avoided

Write-back TagLookups

Page 69: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Line Evictions for Inclusion

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%1

28

B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Re

gio

ns

Ev

icte

d

8 lines evicted

7 lines evicted

6 lines evicted

5 lines evicted

4 lines evicted

3 lines evicted

2 lines evicted

1 line evicted

0 lines evicted

Page 70: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

L2

Mis

s R

atio

Baseline 128B 256B 512B 1KB 2KB 4KB

L2 Miss Ratio Increase

Page 71: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2).

A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data

Stealth Prefetching

Page 72: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching

Invalid

PendingData

PendingRequested

Data

Valid

Line PrefetchInitiated

ProcessorMiss Request

Data, Sendto Cache

Data

Processor Miss Request

Invalidate

Invalidate

Invalid

PendingData

PendingRequested

Data

Valid

Line PrefetchInitiated

ProcessorMiss Request

Data, Sendto Cache

Data

Processor Miss Request

Invalidate

Invalidate

Prefetched lines are managed by a simple protocol

Page 73: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Prefetch Timeliness

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scientific Multiprogrammed Commercial Arithmetic Mean

Tim

ely

Pre

fetc

hes

SP-512B SP-1KB SP-2KB

Page 74: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Data Traffic

0%10%20%30%40%50%60%70%80%90%

100%110%120%130%140%150%

Scientific Multiprogrammed Commercial Arithmetic Mean

Dat

a T

raffi

c

Baseline CGCT-512B SP-512B SP-1KB SP-2KB

Page 75: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Period Between DRAM Requests

0

200

400

600

800

1000

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Pro

cess

or

Cyc

les

Baseline CGCT, Speculate AllNo-speculate Dirty Region No-speculate Dirty or Unknown RegionsNo-speculate

Page 76: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Switch design

Page 77: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Value-Aware Techniques Coherence misses in multiprocessors

Store Value Locality [Lepak ‘03] Ensuring consistency

Value-based checks [Cain ‘04] Reducing speculation

Operand significance Create (nearly) nonspeculative execution

schedule Java Virtual Machine runtime optimization

[Su] Speculative optimizations [VEE ’07]

Page 78: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Complexity-Effective Techniques

Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi]

Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]

Page 79: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Power-Efficient Techniques Power-efficient techniques

Reduced speculation [Gunadi] Clock gating [E. Hill]

Transparent pipelines need fine-grained stalls

Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright]

Reduce overhead of CMP cache coherence Improve latency, power

Page 80: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A<= 1

1

Load A

Memory

Page 81: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A<= 1

Memory

1

Load A

A 1

Page 82: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Snoopy Cache Coherence All cache misses broadcast on shared

bus Processors and memory snoop and respond

Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state)

Must upgrade block before writing to it Other copies invalidated

Read/write-shared blocks bounce from cache to cache Migratory sharing

Page 83: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Data

P0

$0

Invalid0000 Pending0010

Example: Conventional Snooping

P1

$1

M0 M1

Network

Load: 100002

Invalid0000

Tag State

Read: P0, 100002

Read: P0, 100002

• P0 loads 100002

MISS

• Snoop performed

Invalid0000

Invalid0000

• Response sent

InvalidInvalid

• Data transfer

Data

Exclusive

Page 84: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

$0 RCA

Coarse-Grain Coherence Tracking

P0 P1

$1

M0 M1

Network

RCA• P0 loads 100002

Load: 100002

Read: P0, 100002 Invalid, Region Not Shared

Data

Tag State

Invalid0000

Invalid0000

Invalid0000

Invalid0000

Invalid000

Invalid000 MISS

Pending0010

• Snoop performed

Pending

Invalid

Invalid

000

000

• Response sent

Read: P0, 100002Invalid, Region Not Shared

• Data transfer

DIExclusive 001

Region Coherence Array added; two lines per region

Data

P0 has exclusive access to region

Page 85: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

P0 P1

M0 M1

Network

$0 RCA $1 RCAInvalid0000

001

Invalid000

0010 DIExclusive Invalid0000 Invalid000

Invalid0000 Invalid000

Tag State

• P0 loads 110002

Load: 110002

MISS, Region Hit

• Direct request sent

• Data transferRead: P0, 110002

Data

Pending0011 Exclusive

Coarse-Grain Coherence Tracking

Region Coherence Array added; two lines per region

Data

Exclusive region state, broadcast unnecessary

Page 86: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Impact on Execution Time

0%

20%

40%

60%

80%

100%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Exe

cu

tio

n T

ime

Baseline CGCT, Speculate All

No-speculate Dirty Regions No-speculate Dirty or Unknown Regions

No-speculate

Page 87: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

P0 P1

M0 M1

Network

$0RCA $1

RCAInvalid0000

001

Invalid0000100

DI

Exclusive Invalid0000

Invalid000

Invalid0000

Invalid000

Tag State

• P0 loads 0x28

Load: 0x28

MISS, RCA Hit

• Direct request sent

• Data transfer

Read: P0, 0x28Prefetch: 11002

Data

Pending0101 Exclusive

Stealth Prefetching

Data

SDPB

Invalid0000 Invalid0000

Pending

Pending

Valid

Valid0110

0111

• Prefetch data

SDPB

Prefetch: 11002

Invalid

Invalid0000

0000

Assume 8-byte lines, 32-byte regions, 2-line threshold

Page 88: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Stealth Prefetching

P0 P1

M0 M1

Network

$0RCA $1

RCA001

Invalid0000100

DI

Exclusive Invalid0000

Invalid000

Invalid0000

Invalid000

Tag State

0101 Exclusive

SDPB

Invalid0000 Invalid0000

0000

0000

Valid

Valid0110

0111

• P0 loads 0x30

Load: 0x30

Pending0110

Invalid

Exclusive

Data

MISS, SDPB Hit

SDPB

• Data TransferReturn Data

Assume 8-byte lines, 32-byte regions, 2-line threshold

Page 89: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Communication Latencies

CC-NUMA CMP

Local Cache Access 12 12

Remote Cache-to-Cache Transfer

12 + 21 * H * 3(H = hop count)

12 + 4 * H * 3

Local Memory Access 150 150

Remote Memory Access

150 + 21 * H * 2 150 + 4 * H *2

•Remote cache access is 2-5x faster in CMPs than NUMA machines•Lower communication latencies allow for more flexible thread placement

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 90: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

ConfigurationSimulation Parameters

Cores 16 single-threaded light-weight, in-order

Interconnect 2-D Packet-Switched Mesh3-cycle router pipeline (baseline)

Hybrid Circuit-Switched Mesh4 Circuits

L1 Cache Split I/D, 16KB each (2 cycles)

L2 Cache Private, 128 KB (6 cycles)

L3 Cache Shared, 16 MB (16 1MB banks)12 cycles

Memory Latency 150 cyclesWorkload Mixes

Mix 1 TPC-W (4) + TPC-H (4)

Mix 2 TPC-W (4) + SPECjbb (4)

Mix 3 TPC-H (4) + SPECjbb(4)Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Page 91: Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

•Load Balancing with HCS outperforms local placement•Virtual proximity to memory home node

Effect of Memory Placement

Aug 30, 2007 Mikko Lipasti-University of Wisconsin