36
Ioana Burcea* Stephen Somogyi § , Andreas Moshovos*, Babak Falsafi §# Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University # École Polytechnique Fédérale de Lausanne ASPLOS 13 March 4, 2008

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Embed Size (px)

Citation preview

Page 1: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Ioana Burcea*

Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#

Predictor Virtualization

*University of Toronto

Canada

§Carnegie Mellon University

#École Polytechnique Fédérale de Lausanne

ASPLOS 13

March 4, 2008

Page 2: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

2Ioana Burcea Predictor Virtualization University of Toronto

Why Predictors? History Repeats Itself

CPU

Branch Prediction

Prefetching

Value Prediction

Pointer Caching

Cache Replacement

Predictors

Application footprints grow

Predictors need to scale to remain effective

Page 3: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

3Ioana Burcea Predictor Virtualization University of Toronto

Extra Resources: CMPs With Large On-Chip Caches

Main Memory

D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$

CPU

L2 Cache10’s – 100’s of MB

Page 4: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

4Ioana Burcea Predictor Virtualization University of Toronto

Predictor Virtualization

Physical Memory Address Space

D$I$

CPU

D$I$

CPU

D$I$

CPU

D$I$

CPU

L2 Cache

Page 5: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

5Ioana Burcea Predictor Virtualization University of Toronto

Predictor Virtualization (PV)

Emulate large predictor tables

Reduce predictor table dedicated resources

Page 6: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

6Ioana Burcea Predictor Virtualization University of Toronto

Research Contributions PV – metadata stored in conventional cache hierarchy

Benefits Emulate larger tables → increased accuracy Less dedicated resources

Why now? Large caches / CMPs / Need for larger predictors

Will this work? Metadata locality → intrinsically exploited by caches

First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Advantages of virtualization

Page 7: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

7Ioana Burcea Predictor Virtualization University of Toronto

PV architecture

PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*

Conclusions

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Talk Road Map

Page 8: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

8Ioana Burcea Predictor Virtualization University of Toronto

PV architecture

PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*

Conclusions

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Talk Road Map

Page 9: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

9Ioana Burcea Predictor Virtualization University of Toronto

PV Architecture

Virtualize

request prediction

D$I$

CPU

L2 Cache

Main Memory

Predictor

Table

Optimization

Engine

Page 10: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

10Ioana Burcea Predictor Virtualization University of Toronto

PV Architecture

request prediction

D$I$

CPU

L2 Cache

index

PVCache

PVProxy

Physical Memory Address Space

PVTable

Optimization

Engine

PVStart

Page 11: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

11Ioana Burcea Predictor Virtualization University of Toronto

PV: Variable Prediction Latency

request prediction

D$I$

CPU

L2 Cache

index

PVCache

PVProxy

Physical Memory Address Space

PVTable

Optimization

Engine

PVStart

Common

Case

Infrequent

Rare

Page 12: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

12Ioana Burcea Predictor Virtualization University of Toronto

Metadata Locality

Entry reuse Temporal

One entry used for multiple predictions

Spatial – can be engineered One miss overcome by several subsequent hits

Metadata access pattern predictability Predictor metadata prefetching

Page 13: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

13Ioana Burcea Predictor Virtualization University of Toronto

PV architecture

PV in action Virtualized “Spatial Memory Streaming” [ISCA 06]*

Conclusions

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos. “Spatial Memory Streaming”

Talk Road Map

Page 14: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

14Ioana Burcea Predictor Virtualization University of Toronto

Spatial Memory Streaming [ISCA 06]M

emor

y

spatial patterns

1100000001101…

1100001010001…Spatial patterns stored in a pattern history table (PHT)

*[ISCA 06] S. Somogyi, T. Wenisch, A. Ailamaki, B. Falsafi, and A. Moshovos.

“Spatial Memory Streaming”

Page 15: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

15Ioana Burcea Predictor Virtualization University of Toronto

data access stream

Virtualizing “Spatial Memory Streaming” (SMS)

Detector Predictor

patterns

patterns

prefetchestrigger access

Virtualize

~1KB ~60 KB

Page 16: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

16Ioana Burcea Predictor Virtualization University of Toronto

8 sets

Virtualizing SMS

VirtualTable1K

sets

11 ways

PVCache

11 ways

tag pattern

tag tagpattern

pattern

unused

11 bits 32 bits 39 bits

Set entries → cache block – 64 bytes

Page 17: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

17Ioana Burcea Predictor Virtualization University of Toronto

Current Implementation

Non-Intrusive Virtual table stored in reserved physical address space

One table per core

Caches oblivious to metadata

Options Predictor tables stored in virtual memory

Single, shared table per application

Caches aware of metadata

Page 18: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

18Ioana Burcea Predictor Virtualization University of Toronto

Simulation Infrastructure

SimFlex

Full-system simulator based on Simics

Base processor configuration

4-core CMP

8-wide OoO

256-entry ROB

L1D/L1I 64KB 4-way set-associative

UL2 8MB 16-way set-associative

Commercial workloads

TPC-C: DB2 and Oracle

TPC-H: Query 1, Query 2, Query 16, Query 17

SpecWeb: Apache and Zeus

Page 19: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

19Ioana Burcea Predictor Virtualization University of Toronto

0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

better

Original Prefetcher – Accuracy vs. Predictor Size

L1

Rea

d M

isse

s

Page 20: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

20Ioana Burcea Predictor Virtualization University of Toronto

0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

better

Original Prefetcher – Accuracy vs. Predictor Size

L1

Rea

d M

isse

s

Page 21: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

21Ioana Burcea Predictor Virtualization University of Toronto

0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

better

Original Prefetcher – Accuracy vs. Predictor Size

L1

Rea

d M

isse

s

Page 22: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

22Ioana Burcea Predictor Virtualization University of Toronto

Original Prefetcher – Accuracy vs. Predictor Size

Small Tables Diminish Prefetching Accuracy

0%

20%

40%

60%

80%

100%

120%

140%

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Infin

ite

1K-1

6a

1K-1

1a

512-

11a

256-

11a

128-

11a

64-1

1a

32-1

1a

16-1

1a

8-11

a

Apache Oracle Qry 17

Covered Uncovered Overpredictions

better

L1

Rea

d M

isse

s

Page 23: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

23Ioana Burcea Predictor Virtualization University of Toronto

Virtualized Prefetcher - Performance

Sp

eed

up

Original Prefetcher ~60KB

Virtualized Prefetcher < 1KB

better 0%

10%

20%

30%

40%

50%

60%

70%

Apache Zeus DB2 Oracle Qry 1 Qry 2 Qry 16 Qry 17

Original - 1K sets Original - 16 sets Original - 8 sets Virtualized - 8 sets

Hardware Cost

Page 24: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

24Ioana Burcea Predictor Virtualization University of Toronto

Impact on L2 Memory Requests

Dark Side: Increased L2 Memory Requests

better

L2

Mem

ory

Req

ues

ts I

ncr

eas

e

0%

10%

20%

30%

40%

Apache Oracle Qry 17

PV - 8 sets

Page 25: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

25Ioana Burcea Predictor Virtualization University of Toronto

Impact of Virtualization on Off-Chip Bandwidth

0%

1%

2%

3%

4%

5%

Apache Qry17 Oracle

App L2 Misses App L2 Write-backs

PV L2 Misses PV L2 Write-backs

Minimal Impact on Off-Chip Bandwidth

better

Off

-Ch

ip B

and

wid

th I

ncr

ease

Indirect impact on performance

Direct impact on performance

Page 26: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

26Ioana Burcea Predictor Virtualization University of Toronto

Conclusions

Predictor Virtualization Metadata stored in conventional cache hierarchy

Benefits Emulate larger tables → increased accuracy Less dedicated resources

First Step – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation

Page 27: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Ioana Burcea*[email protected]

Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#

Predictor Virtualization

*University of Toronto

Canada

§Carnegie Mellon University

#École Polytechnique Fédérale de Lausanne

ASPLOS 13

March 4, 2008

Page 28: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Ioana Burcea*[email protected]

Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#

Predictor Virtualization

*University of Toronto

Canada

§Carnegie Mellon University

#École Polytechnique Fédérale de Lausanne

ASPLOS 13

March 4, 2008

Page 29: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

Ioana Burcea*[email protected]

Stephen Somogyi§, Andreas Moshovos*, Babak Falsafi§#

Predictor Virtualization

*University of Toronto

Canada

§Carnegie Mellon University

#École Polytechnique Fédérale de Lausanne

ASPLOS 13

March 4, 2008

Page 30: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

30Ioana Burcea Predictor Virtualization University of Toronto

PV – Motivating Trends

Dedicating resources to predictors hard to justify Larger predictor tables

Increased performance

Chip multiprocessors Space dedicated to predictors ↔ # processors

Memory hierarchies offer the opportunity Increased capacity

Diminishing returns

Use conventional memory hierarchies to store predictor metadata

Page 31: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

31Ioana Burcea Predictor Virtualization University of Toronto

Virtualizing the Predictor Table

Pattern History Table

Tag Pattern Tag Pattern…

1 0 1 0 1 1 1 0

1 0 1 0

1 0 1 1

0 0 1 1

0 0 1 1

PC

Trigger Access

Address

Tag index

Pattern

Prefetch

Virtualize

PHT stored in physical address space

Multiple PHT entries packed in one memory block

one memory request brings an entire table set

Page 32: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

32Ioana Burcea Predictor Virtualization University of Toronto

Packing Entries in One Cache Block

Index: PC + offset within spatial group PC →16 bits

32 blocks in a spatial group → 5 bit offset

→ 32 bit spatial pattern

Pattern table: 1K sets 10 bits to index the table → 11 bit tag

Cache block: 64 bytes 11 entries per cache block → Pattern table

1K sets – 11-way set associative

21 bit index

tag pattern

tag tagpattern

pattern0 11 43 54 85 unused

Page 33: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

33Ioana Burcea Predictor Virtualization University of Toronto

Memory Address Calculation

+000000

16 bits 5 bits

10 bits

PV Start Address

Block offset

Memory Address

PC

tag

Page 34: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

34Ioana Burcea Predictor Virtualization University of Toronto

Increase in Off-Chip Bandwidth – different L2 sizes

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

2MB

4MB

8MB

Apache Zeus DB2 Oracle Qry1 Qry2 Qry16 Qry17

Write-backs

L2 Misses

Off

-Ch

ip B

and

wid

th I

ncr

ease

Page 35: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

35Ioana Burcea Predictor Virtualization University of Toronto

Increased L2 Latency

0%

10%

20%

30%

40%

50%

60% SMS - 1K SMS - PV8

Sp

eed

up

Page 36: Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University

36Ioana Burcea Predictor Virtualization University of Toronto

Conclusions PV – metadata stored in conventional cache hierarchy

Benefits Less dedicated resources Emulate larger tables → increased accuracy

Example – Virtualized Data Prefetcher Performance: within 1% on average Space: 60KB down to < 1KB

Why now? Large caches / CMPs / Need for larger predictors

Will this work? Metadata locality → intrinsically exploited by caches Metadata access pattern predictability

Opportunities Metadata sharing and persistence Application directed prediction Predictor adaptation