66
JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT’13 - EDINBURGH, SCOTLAND SEP 11, 2013

JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL

PACT’13 - EDINBURGH, SCOTLAND SEP 11, 2013

Page 2: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Summary

¨  NUCA is giving us more capacity, but further away

¨  Applications have widely varying cache behavior

¨  Cache organization should adapt to application

¨  Jigsaw uses physical cache resources as building blocks of virtual caches, or shares

Cache Size 0 16MB

40

MPK

I libquantum zeusmp sphinx3

Page 3: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Approach 3

¨  Jigsaw uses physical cache resources as building blocks of virtual caches, or shares libquantum

zeusmp

sphinx3

Cache Size 0 16MB

40

MPK

I

Tiled Multicore

Bank

Page 4: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Agenda 4

¨  Introduction

¨  Background ¤ Goals ¤ Existing Approaches

¨  Jigsaw Design

¨  Evaluation

Page 5: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Goals 5

¨  Make effective use of cache capacity

¨  Place data for low latency

¨  Provide capacity isolation for performance

¨  Have a simple implementation

Page 6: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Existing Approaches: S-NUCA 6

Spread lines evenly across banks

¨  High Capacity ¨  High Latency ¨  No Isolation ¨  Simple

Page 7: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Existing Approaches: Partitioning 7

Isolate regions of cache between applications.

¨  High Capacity ¨  High Latency ¨  Isolation ¨  Simple

¨  Jigsaw needs partitioning; uses Vantage to get strong guarantees with no loss in associativity

Page 8: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Existing Approaches: Private 8

Place lines in local bank

¨  Low Capacity ¨  Low Latency ¨  Isolation ¨  Complex – LLC directory

Page 9: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Existing Approaches: D-NUCA 9

Placement, migration, and replication heuristics

¨  High Capacity ¤ But beware of over-replication

and restrictive mappings ¨  Low Latency

¤ Don’t fully exploit capacity vs. latency tradeoff

¨  No Isolation ¨  Complexity Varies

¤ Private-baseline schemes require LLC directory

Page 10: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Existing Approaches: Summary 10

S-NUCA Partitioning Private D-NUCA

High Capacity Yes Yes No Yes

Low Latency No No Yes Yes

Isolation No Yes Yes No

Simple Yes Yes No Depends

Page 11: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Jigsaw 11

¨  High Capacity – Any share can take full capacity, no replication

¨  Low Latency – Shares allocated near cores that use them

¨  Isolation – Partitions within each bank

¨  Simple – Low overhead hardware, no LLC directory, software-managed

Page 12: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Agenda 12

¨  Introduction

¨  Background

¨  Jigsaw Design ¤ Operation ¤ Monitoring ¤ Configuration

¨  Evaluation

Page 13: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Jigsaw Components 13

Operation

Monitoring Configuration

Miss Curves

Accesses Size & Placement

Page 14: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Jigsaw Components 14

Operation

Monitoring Configuration

Page 15: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Agenda 15

¨  Introduction

¨  Background

¨  Jigsaw Design ¤ Operation ¤ Monitoring ¤ Configuration

¨  Evaluation

Page 16: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Operation: Access 16

Classifier

STB

Share 1

Share 2

Share 3

Share N ...

TLB

Core

L1I L1D L2

LLC

LD 0x5CA1AB1E

Data è shares, so no LLC coherence required

Page 17: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

¨  Jigsaw classifies data based on access pattern ¤ Thread, Process, Global, and Kernel

¨  Data lazily re-classified on TLB miss ¤ Similar to R-NUCA but…

n R-NUCA: Classification è Location n  Jigsaw: Classification è Share (sized & placed dynamically)

¤ Negligible overhead

Data Classification 17

Operating System

•  6 thread shares •  2 process shares •  1 global share •  1 kernel share

Page 18: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Operation: Share-bank Translation Buffer 18

STB Entry STB Entry STB Entry STB Entry

Address (from L1 miss) Share Id (from TLB)

1/3 3/5 1/3

H

0/8 … 1

Bank/ Part 0

Bank/ Part 63

Address 0x5CA1AB1E maps to bank 3, partition 5

2706

4 entries, associative, exception on miss

Share Config

0x5CA1AB1E

¨  Hash lines proportionally q  Share:

q  STB:

¨  400 bytes; low overhead

A A B A A B

¨  Gives unique location of the line in the LLC

¨  Address, Share è Bank, Partition

Page 19: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Agenda 19

¨  Introduction

¨  Background

¨  Jigsaw Design ¤ Operation ¤ Monitoring ¤ Configuration

¨  Evaluation

Page 20: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Monitoring 20

¨  Software requires miss curves for each share

¨  Add utility monitors (UMONs) per tile to produce miss curves

¨  Dynamic sampling to model full LLC at each bank; see paper

0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA …

0xB3D3GA 0x0E5A7B 0x123456 0x7890AB …

0xCDEF00 0x3173FC 0xCDC911 0xBAD031 …

0x7A5744 0x7A4A70 0xADD235 0x541302 …

717,543 117,030 213,021 32,103 …

Hit Counters

Tag Array

Way 0 Way N-1 …

Misses

Size Cache Size

Page 21: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration 21

¨  Software decides share configuration

¨  Approach: Size è Place ¤ Solving independently is simple ¤ Sizing is hard, placing is easy

PLACE

Miss

es

Size LLC

SIZE

Page 22: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Sizing 22

¨  Partitioning problem: Divide cache capacity of S among P partitions/shares to maximize hits

¨  Use miss curves to describe partition behavior

¨  NP-complete in general

¨  Existing approaches: ¤ Hill climbing is fast but gets stuck in local optima ¤ UCP Lookahead is good but scales quadratically: O(P x S2)

Utility-based Cache Partitioning, Qureshi and Patt, MICRO’06

Can we scale Lookahead?

Page 23: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 23

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 24: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 24

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 25: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 25

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 26: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 26

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 27: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 27

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 28: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 28

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 29: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 29

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 30: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 30

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 31: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 31

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 32: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 32

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 33: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 33

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 34: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 34

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 35: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 35

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 36: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 36

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Page 37: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 37

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Maximum cache utility

Page 38: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 38

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Maximum cache utility

Page 39: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 39

¨  UCP Lookahead: ¤ Scan miss curves to find allocation that maximizes average

cache utility (hits per byte)

Miss

es

Size LLC Size

Maximum cache utility

Page 40: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Lookahead 40

¨  Observation: Lookahead traces the convex hull of the miss curve

Miss

es

Size LLC Size

Maximum cache utility

Page 41: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Convex Hulls 41

¨  The convex hull of a curve is the set containing all lines between any two points on the curve, or “the curve connecting the points along the bottom”

Miss

es

Size LLC Size

Miss

es

Size LLC Size

Page 42: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 42

¨  There are well-known linear algorithms to compute convex hulls

¨  Peekahead algorithm is an exact, linear-time implementation of UCP Lookahead

Miss

es

Size LLC Size

Miss

es

Size LLC Size

Page 43: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 43

¨  Peekahead computes all convex hulls encountered during allocation in linear time ¤ Starting from every possible allocation ¤ Up to any remaining cache capacity

Miss

es

Size LLC Size

Miss

es

Size LLC Size

Page 44: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 44

¨  Knowing the convex hull, each allocation step is O(log P) ¤ Convex hulls have decreasing slope è decreasing average

cache utility è only consider next point on hull ¤ Use max-heap to compare between partitions

Best Step?

Page 45: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 45

¨  Knowing the convex hull, each allocation step is O(log P)

Current Allocation

Best Step?

Page 46: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 46

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step

Current Allocation

Page 47: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 47

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step?

Page 48: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 48

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step

Page 49: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 49

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step?

Page 50: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 50

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step

Page 51: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 51

¨  Knowing the convex hull, each allocation step is O(log P)

Best Step

Page 52: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Configuration: Peekahead 52

¨  Full runtime is O(P x S) ¤ P – number of partitions ¤ S – cache size

¨  See paper for additional examples, algorithm, and corner cases

¨  See technical report for additional detail, proofs, and run-time analysis ¤  Jigsaw: Scalable Software-Defined Caches (Extended Version), Nathan Beckmann and Daniel

Sanchez, Technical Report MIT-CSAIL-TR-2013-017, Massachusetts Institute of Technology, July 2013 

Page 53: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Re-configuration 53

¨  When STB changes, some addresses hash to different banks

¨  Selective invalidation hardware walks the LLC and invalidates lines that have moved

¨  Heavy-handed but infrequent and avoids directory ¤ Maximum of 300K cycles / 50M cycles = 0.6% overhead

1/3 1/3 3/5 1/3 1/3 3/5

H0x5CA1AB1E

1/3 4/9 3/5 1/3 1/3 3/5

H0x5CA1AB1E

Page 54: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Design: Hardware Summary 54

¨  Operation: ¤  Share-bank translation buffer (STB)

handles accesses ¤  TLB augmented with share id

¨  Monitoring HW: produces miss curves

¨  Configuration: invalidation HW

¨  Partitioning HW (Vantage)

Tile Organization

Jigsaw L3 Bank

NoC Router

Bank partitioning HW (Vantage)

Inv HW Monitoring HW

Core

STB

TLBs

L1I L1D L2

Modified structures New/added structures

Page 55: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Agenda 55

¨  Introduction

¨  Background

¨  Jigsaw Design

¨  Evaluation ¤ Methodology ¤ Performance ¤ Energy

Page 56: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Methodology 56

¨  Execution-driven simulation using zsim

¨  Workloads: ¤  16-core singlethreaded mixes of SPECCPU2006 workloads ¤  64-core multithreaded (4x16-thread) mixes of PARSEC

¨  Cache organizations ¤  LRU – shared S-NUCA cache with LRU replacement; baseline ¤ Vantage – S-NUCA with Vantage and UCP Lookahead ¤  R-NUCA – state-of-the-art shared-baseline D-NUCA organization ¤  IdealSPD (“shared-private D-NUCA”) – private L3 + shared L4

n  2x capacity of other schemes n  Upper bound for private-baseline D-NUCA organizations

¤  Jigsaw

Page 57: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Evaluation: Performance 57

¨  16-core multiprogrammed mixes of SPECCPU2006

¨  Jigsaw achieves best performance ¤  Up to 50% improved throughput, 2.2x improved w. speedup ¤  Gmean +14% throughput, +18% w. speedup

¨  Jigsaw does even better on the most memory intensive mixes ¤  Top 20% of LRU MPKI ¤  Gmean +21% throughput, +29% w. speedup

Page 58: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Evaluation: Performance 58

¨  64-core multithreaded mixes of PARSEC

¨  Jigsaw achieves best performance ¤ Gmean +9% throughput, +9% w. speedup

¨  Remember IdealSPD is an upper bound with 2x capacity

Page 59: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Evaluation: Performance Breakdown 59

¨  16-core multiprogrammed mixes of SPECCPU2006

¨  Breakdown memory stalls into network and DRAM ¤  Normalized to LRU

¨  R-NUCA is limited by capacity in these workloads (private data è local bank)

¨  Vantage only benefits DRAM

¨  IdealSPD acts as either a private organization (benefit network) or a shared organization (benefit DRAM)

¨  Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Optimum

Page 60: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Evaluation: Energy 60

¨  16-core multiprogrammed mixes

¨  McPAT models of full-system energy (chip + DRAM) ¨  Jigsaw achieves best energy reduction

¤ Up to 72%, gmean of 11% ¤ Reduces both network and DRAM energy

Page 61: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Conclusion ¨  NUCA is giving us more capacity, but further away

¨  Applications have widely varying cache behavior

¨  Cache organization should adapt to meet application needs

¨  Jigsaw uses physical cache resources as building blocks of virtual caches, or shares ¤  Sized to fit working set ¤  Placed near application for low latency

¨  Jigsaw improves performance up to 2.2x and reduces energy up to 72%

Cache Size 0 16MB

40

MPK

I

Page 62: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

QUESTIONS

62

Miss

es

Size LLC Size

0x3F7AB 0xFE3D98 0xD380B 0x3930EA …

0xBD3GA 0x0E5A7B 0x123456 0x7890AB …

0xCDEF00 0x3173FC 0xCDC911 0xBAD031 …

0x7A5744 0x74A70 0xAD235 0x541302 …

717,543 117,030 213,021 32,103 …

Hit Counters

Tag Array

Way 0 Way N-1 … Address

H3

Limit

<

Page 63: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Placement

¨  Greedy algorithm

¨  Each share is allocated budget

¨  Shares take turns grabbing space in “nearby” banks ¤ Banks ordered by distance from “center of mass” of cores

accessing share

¨  Repeat until budget & banks exhausted

Page 64: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Monitoring 64

¨  Software requires miss curves for each share

¨  Add UMONs per tile ¤  Small tag array that models LRU on sample of accesses ¤  Tracks # hits per way, # misses è miss curve

¨  Changing sampling rate models a larger cache

¨  STB spreads lines proportionally to partition size, so sampling rate must compensate

Lines Cache ModeledLines UMONRate Sampling =

Lines Cache ModeledLines UMON

sizePartition size ShareRate Sampling ×=

Page 65: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Monitoring 65

¨  STB spreads addresses unevenly è change sampling rate to compensate

¨  Augment UMON with hash (shared with STB) and 32-bit limit register that gives fine control over sampling rate

¨  UMON now models full LLC capacity exactly ¤  Shares require only one UMON ¤  Max four shares / bank è four UMONs / bank è 1.4% overhead

0x3DF7AB 0xFE3D98 0xDD380B 0x3930EA …

0xB3D3GA 0x0E5A7B 0x123456 0x7890AB …

0xCDEF00 0x3173FC 0xCDC911 0xBAD031 …

0x7A5744 0x7A4A70 0xADD235 0x541302 …

717,543 117,030 213,021 32,103 …

Hit Counters

Tag Array

Way 0 Way N-1 … Address

H3

Limit

<

Page 66: JIGSAW - conferences.inf.ed.ac.ukconferences.inf.ed.ac.uk/pact2013//slides/2013.jigsaw.pact.slides.pdf · Jigsaw is the only scheme to simultaneously benefit network and DRAM latency

Evaluation: Extra 66

¨  See paper for: ¤ Out-of-order results ¤ Execution time breakdown ¤ Peekahead performance ¤ Sensitivity studies