Upload
lenard-blair
View
256
Download
7
Embed Size (px)
Citation preview
STRexBoosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution
Islam Atta Pınar Tözün* Xin Tong
Anastasia Ailamaki* Andreas
Moshovos
*
Shark #1Shark #2
Starfish
Chocolate Base: http://www.marthastewart.com/337010/chocolate-cupcakesVanilla Base: http://www.marthastewart.com/256334/vanilla-cupcakes
Swiss Meringue Buttercream: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes
1
2 3
1
2
3
1
2
3
1
2
3
Had only one of these
Shark #1 Shark 2 Starfish
Time
1
Empty, Wash, Fill
23 123 123
© Islam Atta 8
Sssshhh…
Time
1
Empty/Wash/Fill
23 123 123
Shark #1 Shark 2 Starfish
When executing OLTP Transactions Processors aren’t as clever
DB operations
Transaction
DB Query
Instruction Cache
Processor
Icing Cakes and OLTP Transactions
Transaction #1 Transaction #2 Transaction #3
Today’s Systems
Instruction Misses
Better Way
© Islam Atta 12
Unlike Icing Cakes…
Transaction Operations
UnclearBoundaries
RepeatedConditional Different
Dynamic Hardware Solution
• Breaks execution into L1I-sized sub-problems
• Time-multiplex to improve locality
Performance
Reduces instruction misses by up to 44%
Reduces data misses by up to 37%
Improves throughput by 35-55% for 2-16 cores
Robust:
• Non-OLTP workload remains unaffected
STRex
© Islam Atta 13
• OLTP
• Characteristics
• Challenges
• Opportunities
• STREX
• SLICC and its limitations
• Results
• Summary
Roadmap
© Islam Atta 14
$100 Billion/Yr, +10% annually
•E.g., banking, online purchases, stock market…
Benchmarking
•Transaction Processing Council
•TPC-C: Wholesale retailer
•TPC-E: Brokerage market
Online Transaction Processing (OLTP)
OLTP drives innovation for HW and DB vendors
© Islam Atta 15
Many concurrent transactions
Transactions Suffer from Instruction Misses
L1-I size
Footp
rin
t
Each
Tim
e
Instruction Stalls due to L1 Instruction Cache Thrashing© Islam Atta
16
Many concurrent transactions
Few DB operations
•28 – 65KB
Few transaction types
•TPC-C: 5, TPC-E: 12
Transactions fit in 128-512KB
OLTP Facts
Overlap within and across different transactions
R() U() I() D() IT() ITP()
PaymentNew Order
CMPs’ aggregate L1-I cache is large enough© Islam Atta
17
Temporal Code Redundancy
© Islam Atta 18
0 10 20 30 40 50 60 70 80 90 1001101201301401501601701801900%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
>=10
<10
<5
1
K-Instructions
Perc
en
tag
e o
f C
ach
e c
on
ten
ts
Payment
Transactions perform similar operations in similar sequence
time
Why Is There So Much Instruction Overlap?
© Islam Atta 19
Payment
IT(CUST)
R(DIST)
R(CUST)
U(CUST)
U(DIST)
U(WH)
I(HIST)
R(WH)
New Order
R(DIST)
I(NORD)
R(WH)
U(DIST)
R(CUST)
R(ITEM)
R(STO)
U(STO)
I(OL)
I(ORD)
Loop (OL_CNT)
Condition
Transactions are built using few DB operations
Similar transactions perform similar operations
Transaction #1 Transaction #2 Transaction #3
Today’s Systems
Instruction Misses
Stratified Execution
Challenges
© Islam Atta 21
© Islam Atta 21
UnclearBoundaries
Repeated Conditional Different
Generalized Transaction Scheduling
NP-Complete Heuristic needed © Islam Atta
22
© Islam Atta 23
“When you cannot solve a problem… think of a problem you can solve”
Pikos Apikos, MCMLXXXV
Identical Transactions
Conventional
STREX
Scheduling Identical Transactions
© Islam Atta 24
A B C A B C A B C
A A A B B B C C C
Transaction A
Transaction B
Transaction C
Miss Overhead Time
A AATransaction A
B BBTransaction B
C CCTransaction C
Phase 1 Phase 2 Phase 3
Optimal Scheduling for Identical Transactions
© Islam Atta 25
Phase 1
Transaction A
Transaction B
Transaction C
L1-I
Phase 2 Phase 3
Time
Do not evict a red block
Implementation
© Islam Atta 26
Phase 1
Transaction A
Transaction B
Transaction C
L1-I
1. Same-type transaction groups
2. First thread Lead
3. Phase # starts at ONE
4. Touched blocks marked with current phase #
5. Victim block tagged with current phase # switch thread
6. Lead thread increments phase #
Phase 2 Phase 3
Lead
TimeWorks Well for the General Case
Roadmap
© Islam Atta 27
• OLTP
• Characteristics
• Challenges
• Opportunities
• STREX
• SLICC and its limitations
• Results
• Summary
SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads
I. Atta, P. Tözün, A. Ailamaki, A. Moshovos
MICRO-45, December, 2012.
SLICC Concept
© Islam Atta 28
Technology:
• CMP’s aggregate L1
instruction cache capacity
is large enough
Multiple L1-I caches
Multiple threads
Tim
e
SLICC is similar to icing cackes with multiple icing
bags
Condition: Aggregate cache capacity is sufficient
SLICC was Demonstrated on 16 cores
SLICC Needs Enough Cores
© Islam Atta 29
Few cores
Larger Footprint
Can these happen in practice?
1. Data center constraints limit core count
2. Increasing instruction footprints
Multiple L1-I caches
Roadmap
© Islam Atta 30
• OLTP
• Characteristics
• Challenges
• Opportunities
• STREX
• SLICC and its limitations
• Results
• Summary
Simulation
•Zesto (x86) (thank you to GTech)
•2-16 OoO cores, 32KB 8-way L1-I and L1-D, 1MB per core L2
•QTrace (Xin Tong’s QEMU extension)
Workloads
Methodology
Shore-MT
© Islam Atta 31
Effect on INSTRUCTION and DATA misses?
L1-I (instruction locality), L1-D (data sharing)
Performance impact:
Are CONTEXT SWITCHING OVERHEADS amortized?
Compared to SLICC
Measure sensitivity to available CORE COUNT
Experimental Evaluation
© Islam Atta 32
Baseline: no effort to reduce instruction misses
SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]
L1 Miss per Kilo Instructions (MPKI): Instructions
2 cores
4 cores
8 cores
16 cores
2 cores
4 cores
8 cores
16 cores
TPC-C-10 TPC-E
0
5
10
15
20
25
30
35
40
45
I-M
PK
I
STREXSLICCBaseline
Bett
er
© Islam Atta 33
Baseline: no effort to reduce instruction misses
SLICC: distribute footprint across CMP cores/caches [Atta,MICRO’12]
L1 Miss per Kilo Instructions (MPKI): Data
© Islam Atta 34
2 cores
4 cores
8 cores
16 cores
2 cores
4 cores
8 cores
16 cores
TPC-C-10 TPC-E
0
5
10
15
20
25
30
35
D-M
PK
I
STREXSLICCBaseline
Bett
er
Throughput
© Islam Atta 35
2-core
4-core
8-core
16-core
2-core
4-core
8-core
16-core
TPC-C-10 TPC-E
0
1
2
3
4
5
6
7 Base SLICC STREX STREX+SLICC
Rela
tive T
hro
ug
hp
ut
Bett
er
Dynamic Hardware Solution
• Breaks execution into L1I-sized sub-problems
• Time-multiplex to improve locality
Performance
Reduces instruction misses by up to 44%
Reduces data misses by up to 37%
Improves throughput by 35-55% for 2-16 cores
Robust:
• Non-OLTP workload remains unaffected
STRex
© Islam Atta 36
OLTP’s performance suffers due to instruction stalls
Application Opportunities: temporal code redundancy
SLICC: Thread Migration
• Sensitive to runtime core count
STREX: Thread Stratification
• Synchronize transaction execution on a single core
• Improve L1 instruction (and data) locality
Hybrid: Best of both Worlds
Summary
© Islam Atta 37
Email: [email protected]: http://islamatta.com Thanks!
Larger L1-I caches? [DaMoN’12]
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
16
32
64
128
256
512
Instructions Data Instructions Data Instructions DataTPC-C-10 TPC-E MapReduce
0
10
20
30
40
50
60
0
0.2
0.4
0.6
0.8
1
1.2
1.4Conflict Capacity Compulsory Speedup
MP
KI
Cache Size (K)
Sp
eed
up
Bett
er
Bett
er
© Islam Atta 39
STREX with Identical Transactions
© Islam Atta 40
Delivery
New
Ord
er
Paym
ent
Sto
ck
Bro
ker
Cust
om
er
Mark
et
Secu
rity
Tr_
Sta
t
Tr_
Upd
Tr_
Look
TPC-C TPC-E
0
5
10
15
20
25
30
35
40
45 Baseline CTX-Identical
I-M
PK
I
Replacement Policies
© Islam Atta 41
TPC-C TPC-E0
5
10
15
20
25
30
35
40
45
LRULIPBIPSRRIPBRRIPSTREX+LRUSTREX+BIPSTREX+BRRIP
Instr
ucti
on
Mis
s p
er
Kilo I
n-
str
ucti
on
(I
-MP
KI)
Thread Latency Trade-off
© Islam Atta 42
2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Mor
e0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
100.00%
Baseline (6.37) STREX-2T (5.96)STREX-4T (10.48) STREX-6T (15.25)STREX-8T (17.42) STREX-10T (14.83)STREX-12T (21.04) STREX-16T (21.77)STREX-20T (29.68) SLICC-2 (23.00)SLICC-4 (12.80) SLICC-8 (6.95)SLICC-16 (7.49)
M-Cycles
Fre
qu
en
cy
Base
2 4 6 8 10 12 16 200
0.5
1
1.5
2
TPC-CTPC-E
Rela
tive
Th
rou
gh
pu
t
Zesto (x86)
Qtrace (QEMU extension)
Shore-MT
Detailed Methodology
© Islam Atta 43
Focus on OLTP
•Important Class of Applications
•Instruction stalls dominate performance
Other workloads?
•Data Serving
•Media Streaming
•Web Frontend
•SPECweb 2009
•Web Backend
Workloads
© Islam Atta 44
Similar to OLTP
Hardware Cost
© Islam Atta 45
Hybrid
© Islam Atta 46