29
The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain [email protected] ф Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spain [email protected] ψ Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain [email protected] ICS 2010, Tsukuba (Japan) – June 2, 2010

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures Javier Lira ψ Carlos Molina ψ,ф Antonio González ψ,λ λ Intel Barcelona Research

  • View
    222

  • Download
    0

Embed Size (px)

Citation preview

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures

Javier Lira ψ

Carlos Molina ψ,ф

Antonio González ψ,λ

λ Intel Barcelona Research Center

Intel Labs - UPC

Barcelona, Spain

[email protected]

ф Dept. Enginyeria Informàtica

Universitat Rovira i Virgili

Tarragona, Spain

[email protected]

ψ Dept. Arquitectura de Computadors

Universitat Politècnica de Catalunya

Barcelona, Spain

[email protected]

ICS 2010, Tsukuba (Japan) – June 2, 2010

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

2

Introduction

CMPs incorporate large and shared last-level caches.

Access latency in large caches is dominated by wire delays.

Traditional caches are no longer feasible as LLC in CMPs.

3

40-45%

Intel® Nehalem IBM® Power7

Non-Uniform Cache Architecture

NUCA divides a large cache in smaller and faster banks.

Cache access latency consists of the routing and bank access latencies.

Banks close to cache controller have smaller latencies than further banks. Processor

4

Motivation

Banks work independently.

Most frequently accessed data concentrate in few banks.

In case of replacement…

A good choice in a particular bank could be completely unfair if the whole NUCA is considered.

Cor

e 6

Cor

e 7

Core 0 Core 1

Cor

e 2

Cor

e 3

Core 4Core 5

@

The Auction

A collaborative replacement technique that

finds the most appropriate data to evict, not only from

a particular bank but from the whole NUCA cache.

6

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

7

Methodology

Simulation tools: Simics + GEMS CACTI v6.0

Two scenarios: Multi-programmed

Mix of SPEC CPU2006

Parallel applications PARSEC

Number of cores 8 – UltraSPARC IIIi

Frequency 1.5 GHz

Main Memory Size 4 Gbytes

Memory Bandwidth 512 Bytes/cycle

Private L1 caches 8 x 32 Kbytes, 2-way

Shared L2 NUCA cache 8 MBytes, 256 Banks

NUCA Bank 32 KBytes, 8-way

L1 cache latency 3 cycles

NUCA bank latency 4 cycles

Router delay 1 cycle

On-chip wire delay 1 cycle

Main memory latency 250 cycles (from core)

Auction time-out 150 cycles

Baseline NUCA cache architecture

• CMP-DNUCA

• 8 cores

• 256 banks

• 16-way bank-set assoc.

(8 local + 8 central)

• LRU in the bank

• Zero-copy in the NUCA

[2] B. M. Beckmann and D. A. Wood. Managing wire delay in large chip-multiprocessor caches. MICRO ‘04

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

10

The Auction

The Auction is a collaborative replacement technique that finds the most appropriate data to evict,

not only from a particular bank but from the whole NUCA cache.”

11

“• It owns the item, but wants to sell it.• Bank where the replacement happens.Owner• Potential owners of the auctioned item.• The other banks from the bankset.Bidder• Manages the current auction.• New component: Auction slots.Controller

Auction participants:

The Auction

12

Cor

e 6

Cor

e 7

Core 0 Core 1

Cor

e 2

Cor

e 3

Core 4Core 5

Auction Slots

. . .

Step 1: Owner starts the auctionStep 2: Bids for the auctioned itemStep 3: Item is sold!

First Auction Approach: Base

Fills the gaps provoked by invalidating replicated data.

Owner Invites all other banks from the bankset.

Bidder Bids if NO new replacement.

Controller

First bid wins, but prioritising central banks.

13

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

14

Performance

15

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

simferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

H-mean

0.950000000000001

1

1.05

1.1

1.15

1.2

Baseline Victim Cache One-Copy AUC-BASE

Per

form

ance

sp

eed

-up

1.23 1.24 1.34

Significant benefits with large working setsGood performance in both scenariosBlindly relocating data could be harmfulThe Auction outperforms prior proposals

Energy consumption

16

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

sim ferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

A-mean

0.8

0.85

0.9

0.95

1

1.05

1.1 OffchipDynamic

A: Baseline, B: Victim Cache, C: One-Copy and D: AUC-BASE

En

erg

y p

er

ins

tru

cti

on

(n

orm

ali

zed

)

A B C D

Leakage dominates the energy consumptionAuction reduces overall energy consumed

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

17

Enhanced Auction Approaches

Almost half of auctions finished without receiving bids.

We need… a metric to measure the quality of data.

By increasing auction accuracy… Controller has more options to decide the best destination. Auctions with no bids are reduced.

Auction-based global replacement policies.

18

Bank Usage Imbalance

19

Banks will bid relying on their usage rate.

Owner Invites all other banks from the bankset.

Bidder Bids if less frequently “used” than owner.

Controller

The least “used” bidder wins.

Capacity replacements per cache-set

Prioritising most accessed data

20

Keeps most accessed data in the NUCA cache.

Owner Invites all other banks from the bankset.

Bidder Bids if LRU’s been less accessed than item.

Controller

Bidder with the least accessed LRU wins.

Access counter per line

Auction accuracy

21

0 1 2 3 4 5 6 7 8 9 100

102030405060708090

100

AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC

Number of bids

Pe

rce

nta

ge

of

Au

cti

on

s

Reduction of auctions that finish with no bidsController decisions are more accurated

Auction network

22

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

sim ferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

A-Mean

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC

Au

cti

on

me

ss

ag

es At the cost of increasing network traffic

Performance

23

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

simferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

H-mean

0.950000000000001

1

1.05

1.1

1.15

1.2

Baseline AUC-BASE AUC-ENH1-IMB AUC-ENH2-ACC

Pe

rfo

rma

nc

e s

pe

ed

-up

1.231.25

1.28 1.341.28

1.37

Increasing auction accuracy, we take better replacement decisionsNetwork contention is a key constraint

Outline

Introduction

Methodology

The Auction

Results

Enhanced Auction Approaches

Conclusions

24

Conclusions

The decentralized nature of NUCA makes replacement policies not effective.

The Auction finds the most appropriate data to evict, not only from a particular bank but from the whole NUCA cache.

The Auction adapts to the program behaviour and relocates data only if it is worthy.

By using auction-based replacement policies, the baseline NUCA improved its performance by 8% and reduced energy consumption by 4%.

25

The Auction: Optimizing Banks Usage in Non-Uniform Cache Architectures

Questions?

More results (1)

27

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

sim ferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

A-mean

0

2

4

6

8

10

12

14 BaselineVictim CacheOne-CopyAUC-BASE

Mis

se

s p

er

tho

us

an

d i

ns

tru

cti

on

s

More results (2)

28

MIX

black

schole

s

bodytra

ck

canneal

dedup

face

simferre

t

fluid

animate

freqm

ine

rayt

race

stre

amclu

ster

swaptio

nsvip

sx2

64

A-Mean

0.950000000000001

1.05

1.15

1.25

1.35

1.45

Common Network Traffic Auction

A: Baseline, B: One-Copy and C: AUC-BASE

Ne

two

rk t

raff

ic (

no

rma

lize

d)

A B C

More results (3)

29

MIX

black

scho

les

body

track

cann

eal

dedu

p

face

simfe

rret

fluida

nimat

e

freqm

ine

rayt

race

stre

amclu

ster

swap

tions vip

sx2

64

A-mea

n0.8

0.85

0.9

0.95

1

1.05

1.1 OffchipDynamic

A: Baseline, B: AUC-BASE, C: AUC-ENH1-IMB and D: AUC-ENH2-ACC

En

erg

y p

er

ins

tru

cti

on

(n

orm

ali

zed

)

A B C D