Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason

Coarse-Grained Coherence

Mikko H. LipastiAssociate Professor

Electrical and Computer EngineeringUniversity of Wisconsin – Madison

Joint work with: Jason Cantin, IBM (Ph.D. ’06)Natalie Enright JergerProf. Jim SmithProf. Li-Shiuan Peh (Princeton)

http://www.ece.wisc.edu/~pharm

Motivation Multiprocessors are commonplace

Historically, glass house servers Now laptops, soon cell phones

Most common multiprocessor Symmetric processors w/coherent

caches Logical extension of time-shared

uniprocessors Easy to program, reason about

Not so easy to buildAug 30, 2007 Mikko Lipasti-University of Wisconsin

Coherence Granularity Track each individual word

Too much overhead Track larger blocks

32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing

P0 P1 P2 P3 P4 P5 P6 P7

Solution: use multiple granularities Small blocks: manage local read/write

permissions Large blocks: track global behavior

Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Coarse-Grained Coherence Initially

Identify non-shared regions Decouple obtaining coherence

permission from data transfer Filter snoops to reduce broadcast

bandwidth Later

Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to

match


Coarse-Grained Coherence Optimizations lead to

Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect,

caches, and in DRAM World peace and end to global warming


Coarse-Grained Coherence Tracking

Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line

size Can range from two lines to a physical page

A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA)



Each entry has an address tag, state, and count of lines cached by the processor

The region state indicates if the processor and / or other processors are sharing / modifying lines in the region

Customize policy/protocol/interconnect to exploit region state

Region Coherence Arrays


Talk Outline Motivation Overview of Coarse-Grained Coherence Techniques

Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping

Research Group Overview


Unnecessary Broadcasts

0%

20%

40%

60%

80%

100%

Scientific Mean MultiprogrammedMean

Commercial Mean Overall Mean

Req

ues

ts

Write-back

Writes

Read

I-Fetch

DCB


Broadcast Snoop Reduction Identify requests that don’t need a

broadcast

Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency

Avoid sending non-data requests externallyExample


Simulator EvaluationPHARMsim: near-RTL but written in C

Execution-driven simulator built on top of SimOS-PPC

Four 4-way superscalar out-of-order processors

Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines

Separate address / data networks –similar to Sun Fireplane


Workloads Scientific

Ocean, Raytrace, Barnes

Multiprogrammed SPECint2000_rate, SPECint95_rate

Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000


Broadcasts Avoided

0%

20%

40%

60%

80%

100%U

nne

cess

ary

128

B2

56B

512

B1

KB

2K

B4

KB

Un

nece

ssa

ry1

28B

256

B5

12B

1K

B2

KB

4K

BU

nne

cess

ary

128

B2

56B

512

B1

KB

2K

B4

KB

Un

nece

ssa

ry1

28B

256

B5

12B

1K

B2

KB

4K

B



Req

ues

ts

Write-backs

I-Fetches

Writes

Reads

DCB


Execution Time

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%



Exe

cuti

on

Tim

e

Baseline 128B 256B 512B 1KB 2KB 4KB


Summary Eliminates nearly all unnecessary

broadcasts

Reduces snoop activity by 65% Fewer broadcasts Fewer lookups

Provides modest speedup


Talk Outline Motivation Overview of Coarse-grained Coherence Techniques

Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping



Prefetching in Multiprocessors Prefetching

Anticipate future reference, fetch into cache Many prefetching heuristics possible

Current systems: next-block, stride Proposed: skip pointer, content-based

Some/many prefetched blocks are not used Multiprocessors complications

Premature or unnecessary prefetches Permission thrashing if blocks are shared

Separate study [ISPASS 2006]


Lines from non-shared regions can be prefetched stealthily and efficiently

Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining

exclusive copies

Without broadcasting prefetch requests

Fetched from DRAM with low overheadExample

Stealth Prefetching


Stealth Prefetching After a threshold number of L2 misses (2), the

rest of the lines from a region are prefetched

These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer)

After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a

broadcast unnecessary for coherent access


L2 Misses Prefetched

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Scientific Multiprogrammed Commercial Arithmetic Mean

L2

Mis

ses

SP-512B SP-1KB SP-2KB Perfect


Speedup

0%

4%

8%

12%

16%

20%

24%

28%

32%

36%


Spe

edup

CGCT -512B Region SP-512B SP-1KB SP-2KB


SummaryStealth Prefetching can prefetch data:

Stealthily: Only non-shared data prefetched Prefetch requests not broadcast

Aggressively: Large regions prefetched at once, 80-90%

timely

Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode






Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response

Trading DRAM bandwidth for latency Wasting power

Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily

Power-Efficient DRAM Speculation

Broadcast ReqSnoop TagsSend Resp

DRAM Read Xmit Block


DRAM Operations

0%

20%

40%

60%

80%

100%


CommercialMean

Overall Mean

DR

AM

Req

ues

ts

Writes

Useful Reads

MisspeculatedReads


Direct memory requests are non-speculative

Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively

Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches

Power-Efficient DRAM Speculation


Useless DRAM Reads

0%

20%

40%

60%

80%

100%



DR

AM

Rea

ds

Externally-Clean Region

UnknownRegion State

Externally-Dirty Region


Useful DRAM Reads

0%

20%

40%

60%

80%

100%E

xt-D

irty

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn

Ext

-Dirt

y

Ext

-Cle

an

Ext

-U

nkno

wn


CommercialMean

Overall Mean

DR

AM

Rea

ds

False Positives


DRAM Reads Performed/Delayed

0.0%

101.6%

0.0% 0.0%

81.5%

6.9%

78.2% 77.2%

13.3%12.5%

100.0%

71.4%

0%

20%

40%

60%

80%

100%

120%R

eads

Per

form

ed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Rea

dsP

erfo

rmed

Rea

dsD

elay

ed

Baseline CGCT,Speculate All

CGCT, OracleSpeculation

No-speculateDirty Regions

No-speculateDirty or

UnknownRegions

No-speculate

DR

AM

Rea

ds


SummaryPower-Efficient DRAM Speculation:

Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM

Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer






Chip Multiprocessor Interconnect

Options Buses: don’t scale Crossbars: too

expensive Rings: too slow Packet-switched mesh

Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization


CMP Interconnection Networks

But… Cables/traces are now

on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop

Router latency adds up 3-4 cycles per hop

Store-and-forward Lots of activity/power

Is this the right answer?


Circuit-Switched Interconnects

Communication patterns Spatial locality to memory Pairwise communication

Circuit-switched links Avoid switching/routing Reduce latency Save power?

Poor utilization! Maybe OK


Router Design

Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if

CS Can also act as packet-switched network Design details in [CA Letters ‘07]


Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests

Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list

Benefits Reduced 3-hop latency Less activity, less power

Hybrid Circuit Switching (1)

•Hybrid Circuit Switching improves performance by up to 7%Aug 30, 2007 Mikko Lipasti-University of Wisconsin

Hybrid Circuit Switching (2)

•Positive interaction in co-designed interconnect & protocol•More circuit reuse => greater latency benefit



SummaryHybrid Circuit Switching:

Routing overhead eliminated Still enable high bandwidth when

needed Co-designed protocol

Optimize cache-to-cache transfers

Substantial performance benefits To do: power analysis





Server Consolidation on CMPs CMP as consolidation platform Simplify system administration

Save power, cost and physical infrastructure Study combinations of individual

workloads in full system environment Micro-coded hypervisor schedules VMs

See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win


Virtual Proximity Interactions between VM

scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance

Evaluate 3 scheduling policies Gang, Affinity and Load Balanced

HCS provides virtual proximity


Scheduling Algorithms Gang Scheduling

Co-schedules all threads of a VM No idle-cycle stealing

Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing

core Load Balanced Scheduling

Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip


•Load balancing wins with fast interconnect•Affinity scheduling wins with slow interconnect•HCS creates virtual proximity


• HCS able to provide virtual proximity

Virtual Proximity Performance


•As physical distance (hop count) increases, HCS provides significantly lower latency


SummaryVirtual Proximity [in submission]

Enables placement agnostic hypervisor scheduler

Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over

affinity Low-latency interconnect mitigates increase in L2

cache conflicts from load balancing L2 misses up by 10% but execution time reduced by

11%

A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7%






Circuit Switched Snooping (1) Scalable, efficient broadcasting on

unordered network Remove latency overhead of directory

indirection Extend point-to-point circuit-switched

links to trees Low latency multicast via circuit-

switched tree Help provide performance isolation

as requests do not share same communication medium


Circuit-Switched Snooping (2) Extend Coarse Grain Coherence

Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts

Effective in Server Consolidation Workloads Very few coherence requests to

globally shared data



Snooping Interconnect Switches consist of

Configurable crossbar Configuration memory

Circuits span two or more nodes, based on RCA

Snooping occurs across circuits

All sharers in region join circuit

Each link can physically accommodate multiple circuits


Circuit-Switched Snooping Use RCA to identify subsets of

nodes that share data Create shared circuits among

these nodes Design challenges

Multi-drop, bidirectional circuits Memory ordering

Results: very much in progress






Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students

Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease

Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig

Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan

Cheveresan, Pranay Koka


Current Focus Areas Multiprocessors

Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems

Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions

Software Java Virtual Machine run-time optimization Workload development and characterization


Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research

equipment Schneider ECE Faculty Fellowship UW Graduate School


Questions?http://www.ece.wisc.edu/

~pharm


Backup Slides


The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region

Processor Other Processors Broadcast Needed?

Invalid (I) No Cached Copies Unknown Yes

Clean-Invalid (CI) Unmodified Copies Only No Cached Copies No

Clean-Clean (CC) Unmodified Copies Only Unmodified Copies Only For Modifiable Copy

Clean-Dirty (CD) Unmodified Copies Only Modified/Unmodified Copies Yes

Dirty-Invalid (DI) Modified/Unmodified Copies No Cached Copies No

Dirty-Clean (DC) Modified/Unmodified Copies Unmodified Copies Only For Modifiable Copy

Dirty-Dirty (DD) Modified/Unmodified Copies Modified/Unmodified Copies Yes

Region Coherence Arrays


Region Coherence Arrays On cache misses, the region state is read

to determine if a broadcast is necessary On external snoops, the region state is

read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state

The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region


P0 P1

M0 M1

Network

$0 RCA $1 RCA001

Invalid000

DIExclusive Invalid0000 Invalid000

Invalid0000 Invalid000Exclusive

0010

0011

• P1 stores 100002

MISS

• Snoop performed

• Response sent

• Data transfer

Store: 100002

RFO: P1, 100002

0010 Pending 001 Pending

Owned, Region Owned

DDPending

RFO: P1, 100002Owned, Region Owned

DDInvalid Modified

DataData

Coarse-Grain Coherence Tracking

Region Coherence Array added; two lines per region

Region not exclusive anymore

Hits in P0 cache


Overhead Storage for RCA Two bits in snoop response for

region snoop response Region Externally Clean/Dirty

2-way set-assoc. RCA, 48-bit addresses Bits / Set Total Kilobytes Tag Overhead Cache Overhead

2K-Entries 74 9.3 5.0% 0.8%

4K-Entries 72 18.0 9.7% 1.5%

8K-Entries 70 35.0 48.6% 2.8%

16K-Entries 68 68.0 88.3% 5.5%


Overhead RCA maintains inclusion over caches

RCA must respond correctly to external requests if lines cached

When regions evicted from RCA, their lines are evicted from the cache

Replacement algorithm uses line count to favor regions with no lines cached


Snoop Traffic – Peak

0

2

4

6

8

10

12

14

16

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB



Pea

k B

road

cast

s /

1000

CP

U C

ycle

s


Snoop Traffic – Average

0

2

4

6

8

10

12

14

16

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB

Bas

elin

e

128B

256B

512B

1KB

2KB

4KB



Ave

rag

e B

road

cast

s / 1

000

CP

U C

ycle

s


Snoop Traffic

Peak snoop traffic is halved

Average snoop traffic reduced by nearly two thirds

The system is more scalable, and may effectively support more processors


Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero,

send external request to cache Reduces power consumption in the

cache tag arrays Increases broadcast snoop latency

Tag Lookups Filtered


Tag Lookups Filtered

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%O

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

BO

racl

e1

28B

256

B5

12B

1K

B2

KB

4K

B


CommercialMean

Overall Mean

Ext

ern

al R

equ

ests

Tag LookupsFiltered

Tag Lookups forBroadcasts Avoided

Write-back TagLookups


Line Evictions for Inclusion

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%1

28

B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B

12

8B

25

6B

51

2B

1K

B

2K

B

4K

B



Re

gio

ns

Ev

icte

d

8 lines evicted

7 lines evicted

6 lines evicted

5 lines evicted

4 lines evicted

3 lines evicted

2 lines evicted

1 line evicted

0 lines evicted


0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%



L2

Mis

s R

atio

Baseline 128B 256B 512B 1KB 2KB 4KB

L2 Miss Ratio Increase


Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2).

A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data

Stealth Prefetching


Stealth Prefetching

Invalid

PendingData

PendingRequested

Data

Valid

Line PrefetchInitiated

ProcessorMiss Request

Data, Sendto Cache

Data

Processor Miss Request

Invalidate

Invalidate

Invalid

PendingData

PendingRequested

Data

Valid

Line PrefetchInitiated

ProcessorMiss Request

Data, Sendto Cache

Data

Processor Miss Request

Invalidate

Invalidate

Prefetched lines are managed by a simple protocol


Prefetch Timeliness

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%


Tim

ely

Pre

fetc

hes

SP-512B SP-1KB SP-2KB


Data Traffic

0%10%20%30%40%50%60%70%80%90%

100%110%120%130%140%150%


Dat

a T

raffi

c

Baseline CGCT-512B SP-512B SP-1KB SP-2KB


Period Between DRAM Requests

0

200

400

600

800

1000



Pro

cess

or

Cyc

les

Baseline CGCT, Speculate AllNo-speculate Dirty Region No-speculate Dirty or Unknown RegionsNo-speculate


Switch design


Value-Aware Techniques Coherence misses in multiprocessors

Store Value Locality [Lepak ‘03] Ensuring consistency

Value-based checks [Cain ‘04] Reducing speculation

Operand significance Create (nearly) nonspeculative execution

schedule Java Virtual Machine runtime optimization

[Su] Speculative optimizations [VEE ’07]


Complexity-Effective Techniques

Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi]

Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]


Power-Efficient Techniques Power-efficient techniques

Reduced speculation [Gunadi] Clock gating [E. Hill]

Transparent pipelines need fine-grained stalls

Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright]

Reduce overhead of CMP cache coherence Improve latency, power


Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A<= 1

1

Load A

Memory


Cache Coherence Problem

P0 P1Load A

A 0

Load A

A 0

Store A<= 1

Memory

1

Load A

A 1


Snoopy Cache Coherence All cache misses broadcast on shared

bus Processors and memory snoop and respond

Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state)

Must upgrade block before writing to it Other copies invalidated

Read/write-shared blocks bounce from cache to cache Migratory sharing


Data

P0

$0

Invalid0000 Pending0010

Example: Conventional Snooping

P1

$1

M0 M1

Network

Load: 100002

Invalid0000

Tag State

Read: P0, 100002

Read: P0, 100002

• P0 loads 100002

MISS

• Snoop performed

Invalid0000

Invalid0000

• Response sent

InvalidInvalid

• Data transfer

Data

Exclusive


$0 RCA


P0 P1

$1

M0 M1

Network

RCA• P0 loads 100002

Load: 100002

Read: P0, 100002 Invalid, Region Not Shared

Data

Tag State

Invalid0000

Invalid0000

Invalid0000

Invalid0000

Invalid000

Invalid000 MISS

Pending0010

• Snoop performed

Pending

Invalid

Invalid

000

000

• Response sent

Read: P0, 100002Invalid, Region Not Shared

• Data transfer

DIExclusive 001


Data

P0 has exclusive access to region


P0 P1

M0 M1

Network

$0 RCA $1 RCAInvalid0000

001

Invalid000

0010 DIExclusive Invalid0000 Invalid000

Invalid0000 Invalid000

Tag State

• P0 loads 110002

Load: 110002

MISS, Region Hit

• Direct request sent

• Data transferRead: P0, 110002

Data

Pending0011 Exclusive



Data

Exclusive region state, broadcast unnecessary


Impact on Execution Time

0%

20%

40%

60%

80%

100%



Exe

cu

tio

n T

ime

Baseline CGCT, Speculate All

No-speculate Dirty Regions No-speculate Dirty or Unknown Regions

No-speculate


P0 P1

M0 M1

Network

$0RCA $1

RCAInvalid0000

001

Invalid0000100

DI

Exclusive Invalid0000

Invalid000

Invalid0000

Invalid000

Tag State

• P0 loads 0x28

Load: 0x28

MISS, RCA Hit

• Direct request sent

• Data transfer

Read: P0, 0x28Prefetch: 11002

Data

Pending0101 Exclusive

Stealth Prefetching

Data

SDPB


Pending

Pending

Valid

Valid0110

0111

• Prefetch data

SDPB

Prefetch: 11002

Invalid

Invalid0000

0000

Assume 8-byte lines, 32-byte regions, 2-line threshold


Stealth Prefetching

P0 P1

M0 M1

Network

$0RCA $1

RCA001

Invalid0000100

DI

Exclusive Invalid0000

Invalid000

Invalid0000

Invalid000

Tag State

0101 Exclusive

SDPB


0000

0000

Valid

Valid0110

0111

• P0 loads 0x30

Load: 0x30

Pending0110

Invalid

Exclusive

Data

MISS, SDPB Hit

SDPB

• Data TransferReturn Data

Assume 8-byte lines, 32-byte regions, 2-line threshold

Communication Latencies

CC-NUMA CMP

Local Cache Access 12 12

Remote Cache-to-Cache Transfer

12 + 21 * H * 3(H = hop count)

12 + 4 * H * 3

Local Memory Access 150 150

Remote Memory Access

150 + 21 * H * 2 150 + 4 * H *2

•Remote cache access is 2-5x faster in CMPs than NUMA machines•Lower communication latencies allow for more flexible thread placement


ConfigurationSimulation Parameters

Cores 16 single-threaded light-weight, in-order

Interconnect 2-D Packet-Switched Mesh3-cycle router pipeline (baseline)

Hybrid Circuit-Switched Mesh4 Circuits

L1 Cache Split I/D, 16KB each (2 cycles)

L2 Cache Private, 128 KB (6 cycles)

L3 Cache Shared, 16 MB (16 1MB banks)12 cycles

Memory Latency 150 cyclesWorkload Mixes

Mix 1 TPC-W (4) + TPC-H (4)

Mix 2 TPC-W (4) + SPECjbb (4)

Mix 3 TPC-H (4) + SPECjbb(4)Aug 30, 2007 Mikko Lipasti-University of Wisconsin

•Load Balancing with HCS outperforms local placement•Virtual proximity to memory home node

Effect of Memory Placement


Documents

Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason