46
STRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Anastasia Ailamaki* Andreas Moshovos *

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Embed Size (px)

Citation preview

Page 1: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

STRexBoosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution

Islam Atta Pınar Tözün* Xin Tong

Anastasia Ailamaki* Andreas

Moshovos

*

Page 2: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Shark #1Shark #2

Starfish

Page 3: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakesVanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes

Page 4: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes

Page 5: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

1

2 3

Page 6: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

1

2

3

1

2

3

1

2

3

Had only one of these

Page 7: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Shark #1 Shark 2 Starfish

Time

1

Empty, Wash, Fill

23 123 123

Page 8: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

© Islam Atta 8

Sssshhh…

Page 9: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Time

1

Empty/Wash/Fill

23 123 123

Shark #1 Shark 2 Starfish

When executing OLTP Transactions Processors aren’t as clever

Page 10: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

DB operations

Transaction

DB Query

Instruction Cache

Processor

Icing Cakes and OLTP Transactions

Page 11: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Better Way

Page 12: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

© Islam Atta 12

Unlike Icing Cakes…

Transaction Operations

UnclearBoundaries

RepeatedConditional Different

Page 13: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

STRex

© Islam Atta 13

Page 14: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• SLICC and its limitations

• Results

• Summary

Roadmap

© Islam Atta 14

Page 15: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

$100 Billion/Yr, +10% annually

•E.g., banking, online purchases, stock market…

Benchmarking

•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

© Islam Atta 15

Page 16: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Footp

rin

t

Each

Tim

e

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

16

Page 17: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Many concurrent transactions

Few DB operations

•28 – 65KB

Few transaction types

•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

17

Page 18: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Temporal Code Redundancy

© Islam Atta 18

0 10 20 30 40 50 60 70 80 90 1001101201301401501601701801900%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

>=10

<10

<5

1

K-Instructions

Perc

en

tag

e o

f C

ach

e c

on

ten

ts

Payment

Transactions perform similar operations in similar sequence

time

Page 19: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Why Is There So Much Instruction Overlap?

© Islam Atta 19

Payment

IT(CUST)

R(DIST)

R(CUST)

U(CUST)

U(DIST)

U(WH)

I(HIST)

R(WH)

New Order

R(DIST)

I(NORD)

R(WH)

U(DIST)

R(CUST)

R(ITEM)

R(STO)

U(STO)

I(OL)

I(ORD)

Loop (OL_CNT)

Condition

Transactions are built using few DB operations

Similar transactions perform similar operations

Page 20: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Stratified Execution

Page 21: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Challenges

© Islam Atta 21

© Islam Atta 21

UnclearBoundaries

Repeated Conditional Different

Page 22: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Generalized Transaction Scheduling

NP-Complete Heuristic needed © Islam Atta

22

Page 23: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

© Islam Atta 23

“When you cannot solve a problem… think of a problem you can solve”

Pikos Apikos, MCMLXXXV

Page 24: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Identical Transactions

Conventional

STREX

Scheduling Identical Transactions

© Islam Atta 24

A B C A B C A B C

A A A B B B C C C

Transaction A

Transaction B

Transaction C

Miss Overhead Time

A AATransaction A

B BBTransaction B

C CCTransaction C

Phase 1 Phase 2 Phase 3

Page 25: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Optimal Scheduling for Identical Transactions

© Islam Atta 25

Phase 1

Transaction A

Transaction B

Transaction C

L1-I

Phase 2 Phase 3

Time

Do not evict a red block

Page 26: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Implementation

© Islam Atta 26

Phase 1

Transaction A

Transaction B

Transaction C

L1-I

1. Same-type transaction groups

2. First thread Lead

3. Phase # starts at ONE

4. Touched blocks marked with current phase #

5. Victim block tagged with current phase # switch thread

6. Lead thread increments phase #

Phase 2 Phase 3

Lead

TimeWorks Well for the General Case

Page 27: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Roadmap

© Islam Atta 27

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• SLICC and its limitations

• Results

• Summary

Page 28: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

I. Atta, P. Tözün, A. Ailamaki, A. Moshovos

MICRO-45, December, 2012.

SLICC Concept

© Islam Atta 28

Technology:

• CMP’s aggregate L1

instruction cache capacity

is large enough

Multiple L1-I caches

Multiple threads

Tim

e

SLICC is similar to icing cackes with multiple icing

bags

Condition: Aggregate cache capacity is sufficient

SLICC was Demonstrated on 16 cores

Page 29: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

SLICC Needs Enough Cores

© Islam Atta 29

Few cores

Larger Footprint

Can these happen in practice?

1. Data center constraints limit core count

2. Increasing instruction footprints

Multiple L1-I caches

Page 30: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Roadmap

© Islam Atta 30

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• SLICC and its limitations

• Results

• Summary

Page 31: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Simulation

•Zesto (x86) (thank you to GTech)

•2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2

•QTrace (Xin Tong’s QEMU extension)

Workloads

Methodology

Shore-MT

© Islam Atta 31

Page 32: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Effect on INSTRUCTION and DATA misses?

L1-I (instruction locality), L1-D (data sharing)

Performance impact:

Are CONTEXT SWITCHING OVERHEADS amortized?

Compared to SLICC

Measure sensitivity to available CORE COUNT

Experimental Evaluation

© Islam Atta 32

Page 33: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Instructions

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

0

5

10

15

20

25

30

35

40

45

I-M

PK

I

STREXSLICCBaseline

Bett

er

© Islam Atta 33

Page 34: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Data

© Islam Atta 34

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

0

5

10

15

20

25

30

35

D-M

PK

I

STREXSLICCBaseline

Bett

er

Page 35: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Throughput

© Islam Atta 35

2-core

4-core

8-core

16-core

2-core

4-core

8-core

16-core

TPC-C-10 TPC-E

0

1

2

3

4

5

6

7 Base SLICC STREX STREX+SLICC

Rela

tive T

hro

ug

hp

ut

Bett

er

Page 36: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

STRex

© Islam Atta 36

Page 37: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

OLTP’s performance suffers due to instruction stalls

Application Opportunities: temporal code redundancy

SLICC: Thread Migration

• Sensitive to runtime core count

STREX: Thread Stratification

• Synchronize transaction execution on a single core

• Improve L1 instruction (and data) locality

Hybrid: Best of both Worlds

Summary

© Islam Atta 37

Page 38: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Email: [email protected]: http://islamatta.com Thanks!

Page 39: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Larger L1-I caches? [DaMoN’12]

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4Conflict Capacity Compulsory Speedup

MP

KI

Cache Size (K)

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 39

Page 40: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

STREX with Identical Transactions

© Islam Atta 40

Delivery

New

Ord

er

Paym

ent

Sto

ck

Bro

ker

Cust

om

er

Mark

et

Secu

rity

Tr_

Sta

t

Tr_

Upd

Tr_

Look

TPC-C TPC-E

0

5

10

15

20

25

30

35

40

45 Baseline CTX-Identical

I-M

PK

I

Page 41: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Replacement Policies

© Islam Atta 41

TPC-C TPC-E0

5

10

15

20

25

30

35

40

45

LRULIPBIPSRRIPBRRIPSTREX+LRUSTREX+BIPSTREX+BRRIP

Instr

ucti

on

Mis

s p

er

Kilo I

n-

str

ucti

on

(I

-MP

KI)

Page 42: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Thread Latency Trade-off

© Islam Atta 42

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

Mor

e0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Baseline (6.37) STREX-2T (5.96)STREX-4T (10.48) STREX-6T (15.25)STREX-8T (17.42) STREX-10T (14.83)STREX-12T (21.04) STREX-16T (21.77)STREX-20T (29.68) SLICC-2 (23.00)SLICC-4 (12.80) SLICC-8 (6.95)SLICC-16 (7.49)

M-Cycles

Fre

qu

en

cy

Base

2 4 6 8 10 12 16 200

0.5

1

1.5

2

TPC-CTPC-E

Rela

tive

Th

rou

gh

pu

t

Page 43: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Zesto (x86)

Qtrace (QEMU extension)

Shore-MT

Detailed Methodology

© Islam Atta 43

Page 44: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Focus on OLTP

•Important Class of Applications

•Instruction stalls dominate performance

Other workloads?

•Data Serving

•Media Streaming

•Web Frontend

•SPECweb 2009

•Web Backend

Workloads

© Islam Atta 44

Similar to OLTP

Page 45: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Hardware Cost

© Islam Atta 45

Page 46: S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

Hybrid

© Islam Atta 46