35
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

Embed Size (px)

Citation preview

Page 1: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

Pınar Tözün

Anastasia Ailamaki

SLICC Self-Assembly of Instruction Cache Collectivesfor OLTP Workloads

Islam Atta

Andreas Moshovos

Page 2: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

$100 Billion/Yr, +10% annually

•E.g., banking, online purchases, stock market…

Benchmarking

•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

© Islam Atta 2

Page 3: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Footp

rin

t

Each

Tim

e

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

3

Page 4: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Even on a CMP all Transactions Suffer

CoresL1-1 Caches

Transactions

All caches thrashed with similar code blocks

Tim

e

© Islam Atta 4

Page 5: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Opportunity

Footprint over Multiple Cores Reduced Instruction Misses

Technology:

• CMP’s aggregate L1

instruction cache capacity

is large enough

Application Behavior:

• Instruction overlap within

and across transactions

Multiple L1-I caches

Multiple threads

Tim

e

© Islam Atta 5

Page 6: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Dynamic Hardware Solution

• How to divide a transaction

• When to move

• Where to go

Performance

•Reduces instruction misses by 44% (TPC-C), 68% (TPC-E)

•Performance improves by 60% (TPC-C), 79% (TPC-E)

Robust:

• non-OLTP workload remains unaffected

SLICC Overview

© Islam Atta 6

Page 7: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

• Intra/Inter-thread instruction locality is high• SLICC Concept

• SLICC Ingredients

• Results

• Summary

Talk Roadmap

© Islam Atta 7

Page 8: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Many concurrent transactions

Few DB operations

•28 – 65KB

Few transaction types

•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

8

Page 9: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Instruction Commonality Across Transactions

Lots of code reuse

More Yellow

Even higher across same-type transactions

Most

Few

Single

TPC-C TPC-E

All Threads

Per TransactionType

More Reuse

© Islam Atta 9

Page 10: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Enable usage of aggregate L1-I capacity

•Large cache size without increased latency

Exploit instruction commonality

•Localizes common transaction instructions

Dynamic

•Independent of footprint size or cache configuration

Requirements

© Islam Atta 10

Page 11: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

• Intra/Inter-thread instruction locality is high

• SLICC Concept• SLICC Ingredients

• Results

• Summary

Talk Roadmap

© Islam Atta 11

Page 12: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Example for Concurrent Transactions

T1 T2 T3

Code segments that can fit into L1-I

TransactionsControl FlowGraph

© Islam Atta 12

Page 13: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

T1 T2T1

T1

T3

T2 T3T1

T1

Scheduling Threads

T1 T2

T2 T3

T1 T3

0 1 2 3

CORES

T3

Conventional

L1-I

T1

T2

T3

ThreadsTi

me

T1

T1

0 1 2 3

CORES

SLICC

T1

T2

T3 T2

T1 T3

T3

T1T1

Cache Filled 10 times Cache Filled 4 times

T2 T2T2

© Islam Atta 13

Page 14: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

• Intra/Inter-thread instruction locality is high

• SLICC Concept

• SLICC Ingredients• Results

• Summary

Talk Roadmap

© Islam Atta 14

Page 15: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new code segment

Where to go?

Step 3:

Predict where is the next code segment?

Migration Ingredients

© Islam Atta 15

Page 16: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Migration Ingredients

Tim

e

Idle coresWhen to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:

Where is the next segment?

Loops

IdleReturn back

T1

© Islam Atta 16

Page 17: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Migration Ingredients

When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:

Where is the next segment?

Tim

e

T2

© Islam Atta 17

Page 18: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Implementation

When to migrate?

Step 1:

Detect: cache full

Step 2:

Detect: new segment

Where to go?

Step 3:

Where is the next segment?

Find signature blocks on

remote cores

Miss Counter

Miss Dilution

© Islam Atta 18

Page 19: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

More overlap across transactions of the same-type

SLICC: Transaction Type-oblivious

Transaction Type-aware

•SLICC-Pp: Pre-processing to detect similar transactions

•SLICC-SW : Software provides information

Boosting Effectiveness

© Islam Atta 19

Page 20: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

• Intra/Inter-thread instruction locality is high

• SLICC Concept

• SLICC Ingredients

• Results• Summary

Talk Roadmap

© Islam Atta 20

Page 21: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

How does SLICC affect INSTRUCTION misses?

Our primary goal

How does it affect DATA misses?

Expected to increase, by how much?

Performance impact:

Are DATA misses and MIGRATION OVERHEADS amortized?

Experimental Evaluation

© Islam Atta 21

Page 22: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Simulation

•Zesto (x86)

•16 OoO cores, 32KB L1-I, 32KB L1-D, 1MB per core L2

•QEMU extension

•User and Kernel space

Workloads

Methodology

Shore-MT

© Islam Atta 22

Page 23: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Baseline: no effort to reduce instruction misses

Effect on MissesB

ett

er

Reduce I-MPKI by 58%. Increase D-MPKI by 7%.

I-MPKI

D-MPKI

Base

SLIC

C

SLIC

C-S

W

Base

SLIC

C

SLIC

C-S

W

Base

SLIC

C

SLIC

C-S

W

TPC-C-10 TPC-E MapReduce

0

5

10

15

20

25

30

35

40

45

MP

KI

© Islam Atta 23

Page 24: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Next-line: always prefetch the next-line

Upper bound for Proactive Instruction Fetch [Ferdman, MICRO’11]

Performance

TPC-C-1 TPC-C-10 TPC-E MapReduce1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

Sp

eed

up

Bett

er

TPC-C: +60% TPC-E: +79%

Storage per core- PIF: ~40KB- SLICC: <1KB.

Next-Line

PIF-No Overhead

SLICC

SLICC-SW

© Islam Atta 24

Page 25: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

OLTP’s performance suffers due to instruction stalls.

Technology & Application Opportunities:

• Instruction footprint fits on aggregate L1-I capacity of CMPs.

• Inter- and intra-thread locality.

SLICC:

• Thread migration spread instruction footprint over multiple cores.

• Reduce I-MPKI by 58%

• Improve performance by

Summary

Baseline: +70%

Next-line: +44%

PIF: ±2% to +21%

© Islam Atta 25

Page 26: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

Email: [email protected]: http://islamatta.com

Thanks!

Page 27: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Example: thread migrates from core A core B.

•Read data on core B that is fetched on core A.

•Write data on core B to invalidate data on core A.

•When returning to core A, cache blocks might be evicted by other

threads.

Why data misses increase?

© Islam Atta 27

Page 28: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

SLICC Agent per Core

MSV(Miss Shift-Vector)

Count “1”s

MC(Miss Counter)

Fill-up_t

...

Enable shifting

Dilution_t

Locating Missed Blocks on Remote

Cores

Miss Tag-Queue (MTQ)

EnableMigration Select Matching Core

Mat

ched

_t

entr

ies

EnableSearching

+Remote Cache Segment Search

Cache Full DetectionMiss(1)Hit(0)

Miss Dilution Tracking

© Islam Atta 28

Page 29: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Zesto (x86)

Qtrace (QEMU extension)

Shore-MT

Detailed Methodology

© Islam Atta 29

Page 30: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Hardware Cost

© Islam Atta 30

Page 31: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Larger I-caches?

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4Conflict Capacity Compulsory Speedup

MP

KI

Cache Size (K)

Sp

eed

Up

Bett

er

Bett

er

© Islam Atta 31

Page 32: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Different Replacement Policies?

TPC-C TPC-E MapReduce0

5

10

15

20

25

30

35

40LRU LIP BIP DIP SRRIP BRRIP DRRIP

L1 I

nstr

ucti

on

MP

KI

Bett

er

© Islam Atta 32

Page 33: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Parameter Space (1)

Base

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

Base

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

128

256

384

512

2 4 6 8 10 2 4 6 8 10TPC-C TPC-E

0

10

20

30

40

50

60

70

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

I-MPKI D-MPKI Speedup

Fill-up_t (top), Matched_t (bottom)

MP

KI

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 33

Page 34: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Parameter Space (2)

2 4 6 81

01

21

41

61

82

02

22

42

62

83

0 2 4 6 81

01

21

41

61

82

02

22

42

62

83

0

TPC-C TPC-E

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2I-MPKI D-MPKI Speedup

Dilution_t

MP

KI

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 34

Page 35: Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos

SLICC

Partial Bloom Filter

Cache Signature Accuracy

512 1K 2K 4K 8K 512 1K 2K 4K 8KTPC-C TPC-E

96

97

98

99

100

101

BF AccuracyA

ccu

racy

(%)

Bett

er

© Islam Atta 35