S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*

STRexBoosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution

Islam Atta Pınar Tözün* Xin Tong

Anastasia Ailamaki* Andreas

Moshovos

*

Shark #1Shark #2

Starfish

Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakesVanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes

Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes

1

2 3

1

2

3

1

2

3

1

2

3

Had only one of these

Shark #1 Shark 2 Starfish

Time

1

Empty, Wash, Fill

23 123 123

© Islam Atta 8

Sssshhh…

Time

1

Empty/Wash/Fill

23 123 123

Shark #1 Shark 2 Starfish

When executing OLTP Transactions Processors aren’t as clever

DB operations

Transaction

DB Query

Instruction Cache

Processor

Icing Cakes and OLTP Transactions

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Better Way

© Islam Atta 12

Unlike Icing Cakes…

Transaction Operations

UnclearBoundaries

RepeatedConditional Different

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

STRex

© Islam Atta 13

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX

• SLICC and its limitations

• Results

• Summary

Roadmap

© Islam Atta 14

$100 Billion/Yr, +10% annually

•E.g., banking, online purchases, stock market…

Benchmarking

•Transaction Processing Council

•TPC-C: Wholesale retailer

•TPC-E: Brokerage market

Online Transaction Processing (OLTP)

OLTP drives innovation for HW and DB vendors

© Islam Atta 15

Many concurrent transactions

Transactions Suffer from Instruction Misses

L1-I size

Footp

rin

t

Each

Tim

e

Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta

16

Many concurrent transactions

Few DB operations

•28 – 65KB

Few transaction types

•TPC-C: 5, TPC-E: 12

Transactions fit in 128-512KB

OLTP Facts

Overlap within and across different transactions

R() U() I() D() IT() ITP()

PaymentNew Order

CMPs’ aggregate L1-I cache is large enough© Islam Atta

17

Temporal Code Redundancy

© Islam Atta 18

0 10 20 30 40 50 60 70 80 90 1001101201301401501601701801900%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

>=10

<10

<5

1

K-Instructions

Perc

en

tag

e o

f C

ach

e c

on

ten

ts

Payment

Transactions perform similar operations in similar sequence

time

Why Is There So Much Instruction Overlap?

© Islam Atta 19

Payment

IT(CUST)

R(DIST)

R(CUST)

U(CUST)

U(DIST)

U(WH)

I(HIST)

R(WH)

New Order

R(DIST)

I(NORD)

R(WH)

U(DIST)

R(CUST)

R(ITEM)

R(STO)

U(STO)

I(OL)

I(ORD)

Loop (OL_CNT)

Condition

Transactions are built using few DB operations

Similar transactions perform similar operations

Transaction #1 Transaction #2 Transaction #3

Today’s Systems

Instruction Misses

Stratified Execution

Challenges

© Islam Atta 21

© Islam Atta 21

UnclearBoundaries

Repeated Conditional Different

Generalized Transaction Scheduling

NP-Complete Heuristic needed © Islam Atta

22

© Islam Atta 23

“When you cannot solve a problem… think of a problem you can solve”

Pikos Apikos, MCMLXXXV

Identical Transactions

Conventional

STREX

Scheduling Identical Transactions

© Islam Atta 24

A B C A B C A B C

A A A B B B C C C

Transaction A

Transaction B

Transaction C

Miss Overhead Time

A AATransaction A

B BBTransaction B

C CCTransaction C

Phase 1 Phase 2 Phase 3

Optimal Scheduling for Identical Transactions

© Islam Atta 25

Phase 1

Transaction A

Transaction B

Transaction C

L1-I

Phase 2 Phase 3

Time

Do not evict a red block

Implementation

© Islam Atta 26

Phase 1

Transaction A

Transaction B

Transaction C

L1-I

1. Same-type transaction groups

2. First thread Lead

3. Phase # starts at ONE

4. Touched blocks marked with current phase #

5. Victim block tagged with current phase # switch thread

6. Lead thread increments phase #

Phase 2 Phase 3

Lead

TimeWorks Well for the General Case

Roadmap

© Islam Atta 27

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX


• Results

• Summary

SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads

I. Atta, P. Tözün, A. Ailamaki, A. Moshovos

MICRO-45, December, 2012.

SLICC Concept

© Islam Atta 28

Technology:

• CMP’s aggregate L1

instruction cache capacity

is large enough

Multiple L1-I caches

Multiple threads

Tim

e

SLICC is similar to icing cackes with multiple icing

bags

Condition: Aggregate cache capacity is sufficient

SLICC was Demonstrated on 16 cores

SLICC Needs Enough Cores

© Islam Atta 29

Few cores

Larger Footprint

Can these happen in practice?

1. Data center constraints limit core count

2. Increasing instruction footprints

Multiple L1-I caches

Roadmap

© Islam Atta 30

• OLTP

• Characteristics

• Challenges

• Opportunities

• STREX


• Results

• Summary

Simulation

•Zesto (x86) (thank you to GTech)

•2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2

•QTrace (Xin Tong’s QEMU extension)

Workloads

Methodology

Shore-MT

© Islam Atta 31

Effect on INSTRUCTION and DATA misses?

L1-I (instruction locality), L1-D (data sharing)

Performance impact:

Are CONTEXT SWITCHING OVERHEADS amortized?

Compared to SLICC

Measure sensitivity to available CORE COUNT

Experimental Evaluation

© Islam Atta 32

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Instructions

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

0

5

10

15

20

25

30

35

40

45

I-M

PK

I

STREXSLICCBaseline

Bett

er

© Islam Atta 33

Baseline: no effort to reduce instruction misses

SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]

L1 Miss per Kilo Instructions (MPKI): Data

© Islam Atta 34

2 cores

4 cores

8 cores

16 cores

2 cores

4 cores

8 cores

16 cores

TPC-C-10 TPC-E

0

5

10

15

20

25

30

35

D-M

PK

I

STREXSLICCBaseline

Bett

er

Throughput

© Islam Atta 35

2-core

4-core

8-core

16-core

2-core

4-core

8-core

16-core

TPC-C-10 TPC-E

0

1

2

3

4

5

6

7 Base SLICC STREX STREX+SLICC

Rela

tive T

hro

ug

hp

ut

Bett

er

Dynamic Hardware Solution

• Breaks execution into L1I-sized sub-problems

• Time-multiplex to improve locality

Performance

Reduces instruction misses by up to 44%

Reduces data misses by up to 37%

Improves throughput by 35-55% for 2-16 cores

Robust:

• Non-OLTP workload remains unaffected

STRex

© Islam Atta 36

OLTP’s performance suffers due to instruction stalls

Application Opportunities: temporal code redundancy

SLICC: Thread Migration

• Sensitive to runtime core count

STREX: Thread Stratification

• Synchronize transaction execution on a single core

• Improve L1 instruction (and data) locality

Hybrid: Best of both Worlds

Summary

© Islam Atta 37

Email: [email protected]: http://islamatta.com Thanks!

mailto:[email protected]

http://islamatta.com/

Larger L1-I caches? [DaMoN’12]

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

16

32

64

128

256

512

Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce

0

10

20

30

40

50

60

0

0.2

0.4

0.6

0.8

1

1.2

1.4Conflict Capacity Compulsory Speedup

MP

KI

Cache Size (K)

Sp

eed

up

Bett

er

Bett

er

© Islam Atta 39

STREX with Identical Transactions

© Islam Atta 40

Delivery

New

Ord

er

Paym

ent

Sto

ck

Bro

ker

Cust

om

er

Mark

et

Secu

rity

Tr_

Sta

t

Tr_

Upd

Tr_

Look

TPC-C TPC-E

0

5

10

15

20

25

30

35

40

45 Baseline CTX-Identical

I-M

PK

I

Replacement Policies

© Islam Atta 41

TPC-C TPC-E0

5

10

15

20

25

30

35

40

45

LRULIPBIPSRRIPBRRIPSTREX+LRUSTREX+BIPSTREX+BRRIP

Instr

ucti

on

Mis

s p

er

Kilo I

n-

str

ucti

on

(I

-MP

KI)

Thread Latency Trade-off

© Islam Atta 42

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50

Mor

e0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Baseline (6.37) STREX-2T (5.96)STREX-4T (10.48) STREX-6T (15.25)STREX-8T (17.42) STREX-10T (14.83)STREX-12T (21.04) STREX-16T (21.77)STREX-20T (29.68) SLICC-2 (23.00)SLICC-4 (12.80) SLICC-8 (6.95)SLICC-16 (7.49)

M-Cycles

Fre

qu

en

cy

Base

2 4 6 8 10 12 16 200

0.5

1

1.5

2

TPC-CTPC-E

Rela

tive

Th

rou

gh

pu

t

Zesto (x86)

Qtrace (QEMU extension)

Shore-MT

Detailed Methodology

© Islam Atta 43

Focus on OLTP

•Important Class of Applications

•Instruction stalls dominate performance

Other workloads?

•Data Serving

•Media Streaming

•Web Frontend

•SPECweb 2009

•Web Backend

Workloads

© Islam Atta 44

Similar to OLTP

Hardware Cost

© Islam Atta 45

Hybrid

© Islam Atta 46

Documents

S TRex Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution Islam Atta Pınar Tözün* Xin Tong Islam Atta Pınar Tözün*