Upload
thomas-kejser
View
3.551
Download
3
Embed Size (px)
Citation preview
Thomas Kejser
http://blog.kejser.org
@thomaskejser
Super Scaling SQL Server
Diagnosing and Fixing Hard Problems
Thomas Kejser
• Formerly SQLCAT• Tuning SQL Server since 6.5• 15+ Years of database experience
• http://blog.kejser.org • CTO Fusion-io Europe
Image(s): FreeDigitalPhotos.net
VS. VS.
Best Practice
Performance Scalabilityvs.
Response Time
Ressource Use
Adding moreof a HW ressource
makes things faster
You can scale without having performance (ex: HADOOP)
You can perform without having scalability (ex: In Memory Engines)
Our Reasonably Priced Server
• 2 Socket Xeon E3645• 2 x 6 Cores• 2.4Ghz• NUMA enabled, HT off
• 12 GB RAM• 1 ioDrive2 Duo• 2.4TB Flash• 4K formatted • 64K AUS• 1 Stripe
• Power Save Off• Win 2008R2• SQL 2012 Image Source: DeviantArt
Between disk and Memory
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms10us
The “cache out curve”
Data Size
Throughput/thread
Cache Size
Service Time + Wait Time
NUMA Nodes
CPU
L3
L2
L2C
C
CPU
L3
L2
L2C
C
Can I write?
Bus Transfer
Bus Transfer
There are several of these curves
Throughput
Touched Data Size
CPU Cache
TLB
NUMARemote
Storage
Response time = Service Time + Wait Time
Algorithmsand
Data Structures
“Bottlenecks”
• DBA tasks• Installation of OS and SQL• Basic Memory Configuration• Basic Perfmon style monitoring• Backup/Restore and HA setup
• Basic reading a Query Plan• Basic understanding of database
structures• Adding Indexes to tables• Running a Profiler trace
What you ALREADY know
Below the Surface
What we Need
• Free tools from MS
• Windows SDK• In Win8: The
“ADK”
• Need .NET 4 to install
Where Did the Time Go?
Service Time + Wait Time
Xperf –on Base –f Base.etl
SELECT TOP 100000 *FROM LINEITEMINNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY
SQLCMD –E –S. –i “Select.sql”
?
Xperf –stop
DEMOBASE profile with xperf
Service Time + Wait Time
Right Click – Summary Table
Service Time + Wait Time
What exactly is SQLNCLI?
Service Time + Wait Time
Quantifying just how stupid XML is
SELECT TOP 1000000 * FROM ORDERSJOIN LINEITEM ON L_ORDERKEY = O_ORDERKEYFOR XML RAW ('OUTPUT')
Xperf –on Base –f Base.etl
With XML
“Native” Format
Which CPU cycles are Expensive?
“App” tierWeb Server Licensing>3K USD Blades
Database TierCore Licensing>10K USD
<XML> ?
Service Time + Wait Time
• What about the time INSIDE the process?
• What if the EXE won’t tell us?
Diving even Deeper
What is a Debug Symbol?
mov ax,10mov bx,20mov cx,3push axpush bxpush cxcall <address>
<address>push bpmov bp,spmov ax,[bp+8]
mov bx,[bp+6]mov cx,[bp+4]
add ax,bxdiv cxmov dx,axret
HeaderdoStuff(10,20,3)
…
int doStuff(int a, int b, int c){ return (a + b) / c }
Compiles
Compiles
myProg.exe
Machine Code
<address> = doStuff
Symbol table
myProg.pdb
Symbol
Build
Service Time + Wait Time
Where do you get PDB files?
_NT_SYMBOL_PATH=SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
_NT_SYMCACHE_PATH=C:\SymCache
• Public Symbol Server
• Configure Environment
• Dbghelp.dll
Service Time + Wait Time
• Auto Generated by Visual Studio:
Your Own Debug Symbols
Service Time + Wait Time
• Symbols are indexed. Have to add them
Adding and Checking Your Symbols
Cd Bin/x64/Release/symstore add /f *.pdb /s C:/Symbols /t “MyExe”
• Validate that the Symbols can resolveCd Bin/x64/Release/symchk MyExe.exe /V
• Standard Xperf works fine for you own native code
• BUT: Before Windows 8, stack walking is broken for x64 .NET
• If you have .NET with 64 bit code. You must NGEN first:
Got .NET and x64?
Ngen install Bin/x64/Release/MyExe.exe(ngen lives here: %Windir%\Microsoft.NET\framework64\<Version>\Ngen.exe
Service Time + Wait Time
• Free tool from MS:
.NET tracing is a pain, get a tool!
• Not to be confused with xperfview• Same trace API and file format• Helps set obscure .NET specific trace flags
Service Time + Wait Time
And Finally, You can do Very Cool Things
Did I tell you about interlockedoperations?...
Whiteboard time!
Service Time + Wait Time
• Consider again our LINEITEM table
What is SQL Server REALLY doing?
• How expensive is it to read from that?• Think ETL code and DW/BI queries
CREATE TABLE LINEITEM ([L_ORDERKEY] [int] NOT NULL,[L_PARTKEY] [int] NOT NULL,[L_SUPPKEY] [int] NOT NULL,[L_LINENUMBER] [int] NOT NULL,[L_QUANTITY] [decimal](15, 2) NOT NULL,[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,[L_DISCOUNT] [decimal](15, 2) NOT NULL,[L_TAX] [decimal](15, 2) NOT NULL,[L_RETURNFLAG] [char](1) NOT NULL,[L_LINESTATUS] [char](1) NOT NULL,[L_SHIPDATE] [date] NOT NULL,[L_COMMITDATE] [date] NOT NULL,[L_RECEIPTDATE] [date] NOT NULL,[L_SHIPINSTRUCT] [char](25) NOT NULL,[L_SHIPMODE] [char](10) NOT NULL,[L_COMMENT] [varchar](44) NOT NULL
)
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
SQLCMD – Native code Test
SQLCMD.EXE
Where does the time go?
Service Time + Wait Time
Standard Reading of Data
xperf -on base -stackwalk profile -f stackwalk.etl
SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"
55sec
xperf -stop
xperf –merge stackwalk.etl stackwalkmerge.etl
?
Service Time + Wait Time
Details of the Time – Padding?
Service Time + Wait Time
More Details – Conversion Work?
An Educated guess about improvements
CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL)
CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL,)
Before After
Service Time + Wait Time
Getting Rid of Useless Work
Additional parameters for SQLCMD:
-a32767 -W -s";" -f437
x1.5
Service Time + Wait Time
Unicode – 10% overhead?
Service Time + Wait Time
Lets try that with Native and Unicode …
x5
Service Time + Wait Time
• SQLNCLI is one of these in disguise• ODBC• OLEDB
• Pick good data types• MONEY over NUMERIC• UNICODE of data arrives like this
• Native protocols vs. flexibility
Summary Moving Data
• Get• Windows 8 ADK• Windows 7 SDK
• Set up Symbol Paths• Xperf –on Base • Standard trace for time, narrow to process
and DLL/EXE
• Xperf –on Base –stackwalk Profile• Get to the call stack, find the offending
function(s)
• Ease of use for .NET: perfview.exe
Summary – Xperf
Service Time + Wait Time
Response time = Service Time + Wait Time
Introducing TPC-H
Service Time + Wait Time
Loop Join
Find my match
n row B-tree
Log(n) reads
Complexity: O(m * log(n))Service Time + Wait Time
m row result14313
7
3
Linked List Tree
Linked List vs. Tree
Service Time + Wait Time
0
1
2
3
4
5
6
7
8
n
8
134
62 1510
16141197531
Log 2(n
)
Cluster on O_ORDERKEY Index on O_ORDERKEY
Basic argument for Cluster Indexes
Service Time + Wait Time
CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_ClusterWHERE O_ORDERKEY = 3000000
CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_HeapWHERE O_ORDERKEY = 3000000
Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0
Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0
Cluster on O_ORDERKEY heap + Index on O_ORDERKEY
But what if we do this a lot?
CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)
WITH (FILLFACTOR = 100)
CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)
WITH (FILLFACTOR = 100)
SELECT * FROM ORDERS_HeapWHERE O_CUSTKEY = 47480
SELECT * FROM ORDERS_ClusterWHERE O_CUSTKEY = 47480
Table 'ORDERS_Cluster'. Scan count 1, logical reads 27, physical reads 0
Table 'ORDERS_Heap'. Scan count 1, logical reads 11, physical reads 0
Service Time + Wait Time
How many LOOP joins/sec/core?
7 Sec
Service Time + Wait Time
What did we just measure?
Xperf –on Base –stackwalk profile
About 40%...
Service Time + Wait Time
• The query language itself
• Why so many ExecuteStmt?
• …With so much CPU use?
What is sqllang.dll?
Service Time + Wait Time
A different way to Measure Loops
1 Sec
Service Time + Wait Time
VS.
What does THAT look like?
Takeaway:
The T-SQL languageitself is expensive
Service Time + Wait Time
• Sample from LINEITEM
• Force loop join with index seeks
• Do 1.4M seeks
Test: Singleton Row Fetch
Singleton seeks – Cost of compressionCompression Seek (1.4M seeks) CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 24 sec 100% one core
None – I/O 21 sec 100% one core
PAGE – I/O 32 sec 100% one core
Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11%Additional Runtime of GetNextRowValuesInternal 0.14%Total Compression 2.28%Total CPU (single core) 8.33%Compression % 27.00%
xperf –on base –stackwalk profile
Modern CPU
CPU
L3 Cache 4MB
Inst Cache32KB
Core
Data Cache32KB
L2 Uni Cache 256K
Inst Cache32KB
Core
Data Cache32KB
L2 Uni Cache 256K
Bus
Service Time + Wait Time
The B+ Tree
Service Time + Wait Time
B+ Tree
Hekaton Style “Loop”
Lookup Table
(hash)
Service Time + Wait Time
Merge Join
m row result1123
n row result1234
4
43
43
Sort
ed
Sort
ed
Complexity: O(m + n)Service Time + Wait Time
Merge Join – What is Fastest?
Service Time + Wait Time
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEMINNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY …or
SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERSINNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY
Comparing the Query Plans
Service Time + Wait Time
Digging in Deeper
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms.
Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms.
Service Time + Wait Time
We can beat SQL Server at this game
SELECT MAX(O_ORDERDATE), MAX(MAX_P)FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) bINNER MERGE JOIN ORDERSON O_ORDERKEY = b.L_ORDERKEY
Service Time + Wait Time
Hash Join
m row result14313
7
n row join table
Hash(1)
n row hash table
Find my match
Complexity: O(m + 2n)
3
Service Time + Wait Time
When Hash Joins hurt you
Service Time + Wait Time
0501001502002503003504000
5
10
15
20
25
30
Runtime (seconds)
Hash Memory (MB)
Spill Zone!
Hash Joins Don’t Scale in MSSQL
The Bottleneck Curve
ACCESS_METHODS_DATASET_PARENT:
“Used to synchronize child dataset access to the parent dataset during parallel operations.”
Books Online Story…
Image: FreeDigitalPhotos.net
Using XPERF to find documentation
xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch
Lets dig in…
xperf -on base -stackwalk profile -f stackwalk.etl
What LATCH pattern do we see?
GetNextRangeForChildScan
Inside:TableScanNew
• Partition the table by a “random” value
• Modulo the Key for example
• Use SQL Server partition function/schema
The Fix?…
0123456
253254255
hash
Closer…
…But no Cigar
What is the Problem here?
Anti Scale Patterns
CPU Caches
0
100,000,000
200,000,000
300,000,000
400,000,000
500,000,000
600,000,000
700,000,000
800,000,000
900,000,000
1,000,000,000
Random Pages
Sequential Pages
Single Page
Size of Accessed memory (MB)Service Time + Wait Time
Goals:• Compressed• Prefetch Friendly• Cache Resident Code
Example, Column Stores
ID Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale
Service Time + Wait Time
Compression is Easy
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7 2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
RL Date
7 2011-11-25
Date’
RL Sale
2 2 GBP
1 10 GBP
4 5 GBP
2 10 GBP
Sale’
Service Time + Wait Time
Squeezing it even more
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Product’
RL Value
2 1
1 2
2 3
2 2
Product’
Beer = 1Vodka = 2Whiskey = 3
ID Value
1-2 Beer
3-3 Vodka
4-5 Whiskey
6-7 Vodka
Product’
4+4+4+2 = 14B + 4+4+5+2 = 15B+ 4+4+7+2 = 17B+ 4+4+5+2 = 15B
= 61B
4+4+2 = 10B + 4+5+2 = 11B+ 4+7+2 = 13B+ 4+5+2 = 11B
= 45B
4+4 = 8B + 4+4 = 8B+ 4+4 = 8B+ 4+4 = 8B
= 32B
RL Value
2 0x01
1 0x10
2 0x11
2 0x10
Product’
4 = 4B + 4 = 4B+ 4 = 4B+ 4 = 4B
+ 4 x 2b = 2B= 18B
Service Time + Wait Time
RL Value
2 Beer
1 Vodka
2 Whiskey
2 Vodka
RL Customer
3 Thomas
2 Christian
2 Alexei
Product’ Customer’
2 steps with Beer2 steps with Thomas
Beer Thomas
Beer Thomas
SELECT Product, Customer FROM Table
1 step with Vodka1 step with Thomas
Vodka Thomas
2 step with Whiskey2 step with Christian
Whiskey Christian
Whiskey Christian
2 step with Vodka (Note: Repeated value)2 step with Alexei
Vodka Alexei
Vodka Alexei
Service Time + Wait Time
Hash Joining with Column Stores
RL Key
2 Beer
1 Vodka
2 Whiskey
2 Vodka
Table
Key Type
Beer Soft
Vodka Strong
Whiskey Strong
Vodka Strong
Dim Product
SELECT …FROM TableJOIN DimProduct ON KeyWHERE Type = ‘Strong’
1 Compute bloom filter of Keys belonging to ‘strong’2 Read RL = 2, Beer from Table3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2)5 Compute bloom value for Vodka6 Equal to filter value from 1? No. Do nothing7 Compute bloom value for Whiskey8 Equal to filter value from 1? No. Do nothing
Can pre fetch data (news RLE)
Can calculate match/no match using only local CPU cache
Wont work for OLTP!Service Time + Wait Time
Why is it so hard to get joins right?
n
m
Time
Loop Join
Merge Join
Hash Join
Service Time + Wait Time
Desired Join Join Hint Query Hint
LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN
OPTION (LOOP JOIN)
MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN
OPTION (MERGE JOIN)
HASH [INNER | LEFT | CROSS | FULL] HASH JOIN
OPTION (HASH JOIN)
LOOP with Seek
WITH FORCESEEKWITH ( INDEX (index = <name>) )
N/A
Controlling Joins
Note: Join hints force the order of the ENTIRE join tree!Service Time + Wait Time
What Type of Workload?
BigSmall
Small
Big
Dat
a Re
turn
ed
Data Touched
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
How to Classify?
OLTP BI/DW
Simulation ETL
Full Scan/secRange Scans/secProbe Scans/sec
Index Search/secRange Scans/sec
Full Scan/secRange Scans/sec
Bulk Copy Rows/sec?
There should ALWAYS be a fully indexed path to the data.
OLTP System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
1. Find worst CPU consuming query with sys.dm_exec_query_stats
2. Add OPTION (LOOP JOIN) to offending query
3. Check estimated query plan4. If table spool found: add index to
remedy and GOTO 35. Happy? If not, GOTO 1
The Super Quick OLTP Tuning Guide
Service Time + Wait Time
The query will not be (much)worse than a full scan of a fact
partition
DW/BI System Basic Query Pattern
BigSmall
Small
Big
OLTP BI/DW
Simulation ETL
Service Time + Wait Time
1. Find offending query2. Add OPTION (HASH JOIN) to query3. Does dimension tables have indexed path to
build hash? If not, add index4. Do you get a fact table scan and hash build of
all dimensions? If not, check statistics (especially on facts and skewed)
5. Optimize Fact table scans1. Partition and partition elimination2. Column store if you have it3. Aggregate Views4. Bitmap index pushdown (statistics!)5. Composite indexes (last resort!)
The Super Quick DW tuning Guide
Service Time + Wait Time
The expected DW Query Plan
PartialAggregate
Fact CSI Scan
Dim Scan
Dim Seek
BatchBuild
BatchBuild
HashJoin
HashJoin
HashStreamAggregate
• At least enough RAM to hold the hash tables of the largest dimension
• De-normalisation helps… a LOT• Especially for the large/large joins
• Likely: need to scan fast from disk if RAM is not big enough to hold the fact• Compression REALLY matters
Things that Follow from desired DW Plan
Service Time + Wait Time
Coffee Break
Response time = Service Time + Wait Time
Where EVERY Server wide diagnosis starts
SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC
Service Time + Wait Time
• Shows up as waits for PAGEIOLATCH• You can dig into details with:
Common Problems - PAGEIO
Service Time + Wait Time
SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)
• Can also Xevent your way to it per query
CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed(ACTION (sqlserver.database_id,sqlserver.session_id))
• I/O, like memory, is a GLOBAL resource for the machine
• When does it make sense to partition a global resource? • When you deeply know the workload• When the workload is ALREADY partitioned
• When neither of those are true: DON’T partition
• If you have NAND/SSD – Why bother?
The general I/O Guidance
Service Time + Wait Time
A good way to Think of Spindle I/O
DRAM
JBOD SAME
LUN
Seq.
LUN
Seq.
LUN
Seq.
RAID system
Large LUN
Seq. Seq. Seq.
RANDOM I/O
Service Time + Wait Time
Stripe vs. Concatenation
RAID 10 RAID 10
Concatenated LUN
RAID 10 RAID 10
Striped LUN
Service Time + Wait Time
OLTP
• One big SAME setup• data files• Tempdb
• Dedicate• Transaction log
• DRAM: • Enough to hold most of
DB
Data Warehouse
• JBOD setup• Data Files • 1-2 per LUN
• SAME setup• Tempdb
• Dedicate• Transaction Log
• DRAM: • Enough to hold largest
partition of largest table
Rules of Thumb – Spindle I/O and DRAM
Service Time + Wait Time
• Short Stroking
• Elevator Sort
• Sequential vs. Random
• Weaving
You can do a bit better… or worse
Service Time + Wait Time
• Intentionally use lower % of total space
• Tradeoff: • Space for Speed
• Test:• 15K rpm• SAS spindle • 300GB
Short Stroking Disks
0% 20% 40% 60% 80% 100%150
200
250
300
350
400
% Capacity Used
IOPS
Service Time + Wait Time
Full Stroked Short Stroked
Why does Short Stroking Work?
Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement
Service Time + Wait Time
Adding Elevator Sorting
Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked0
100
200
300
400
500
600
0
200
400
600
800
1000
1200
8K random I/OIOPS
Avg. Latency
Max Latency
IOPS
Late
ncy
Bat powered disk!
Why Chase Sequential I/O?
Sequential Full Stroke Random100
1000
10000
100000
0
10
20
30
40
50
60
70
808K Block Pattern
IOPSAvg LatencyMax Latency
Log(
IOPS
)
Late
ncy
(ms)
Service Time + Wait Time
• One SATA disk
• Two partitions
• One file on each
• Sequential read on each file
But all is not well!
File1 File2
Service Time + Wait Time
I/O Weaving in action
64K Random 64K Dual Sequential0
50
100
150
200
250
300
0
2
4
6
8
10
12
14
16
18
IOPS
Avg Latency
IOPS
Late
ncy
(ms)
Source: Michael Anderson Service Time + Wait Time
Storage Pool and Weaving
DataLog DataLog DataLog
Massive, then Provisioned Pool
Seq
Ran
Seq
Ran
Seq
Ran
RANDOM!Service Time + Wait Time
The SAN will properly handle Sharing!
Green: Checkpoint, Red: tx/sec, Black: Disk Latency
The “cache out curve”
Service Time + Wait Time
Numbers to Remember - Spindles
Characteristic Typical Units
Throughput / Bandwidth 90-125MB/secBut ONLY if sequential access!
Operations per Sec 10K RPM Spindle: 100-130 IOPS15K RPM Spindle: 150-180 IOPS
Can get about 2x if short stroking (more later)
Latency 3-5ms (compare DRAM: 100ns)
Capacity 100s of GB to single digit TB
2012 numbers, will change in future Service Time + Wait Time
• Few hundreds of IOPS
• Faster if short stroked
• Trade latency for speed with elevator sort
• Sequential is hard to get right
Summary so far.. Single Disk
Service Time + Wait Time
• Wider Stripes neat• But scale not linear
• Very deep queues help• But add latency
• Shared Components
Why does a big RAID pile not solve this?
Service Time + Wait Time
RAID Scale?Your Mileage WILL vary with the hardware
Before After
Getting rid of Sharing
Switch
HBA HBA HBA HBA
Storage Port
Storage Port
Switch
LUN LUN
Cache
Disk
CPU
Switch
HBA HBA HBA HBA
Storage Port
Storage Port
Switch
LUN LUN
Cache
Disk
CPU
x2
4K
PN N
NAND Flash Basics
112
PN N
Oxide Layer
Floating Gate
Electronstrapped
Control Gate
NAND Die
Pack
Blocks
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
4K
PN N PN NPN N
PN NPN N PN NPN N
Pages
NAND Flash Problems
• Erase Cycles• Around 100K• Rebalancing and reclaim/trim
• Voltage measurement• Gets worse with density• Changes over time• Depends on how you program
• Bit Rot• Must refresh even on read
• SLC easier to manage than MLC• But much more expensive!
113
Voltage
00
01
10
11
Lessons Learned: Try to Avoid Sharing
BAD BETTER BEST
Service Time + Wait Time
The Network
• Only partially diagnosed as waits in sys.dm_os_wait_stats
• Task Manager gives a bit more information
• Need: transparency to the deep level latencies and packets!
Common Problems: ASYNC_NETWORK, OLEDB
Service Time + Wait Time
A common Wait Type
The database is really slow! The code takes
forever to run!
Service Time + Wait Time
• We may not always have insight into what is going on at the client…
Xperf Diagnosing the Network
xperf –on latency+network
SummaryTable
Service Time + Wait Time
Timeline of the network Traffic
ASYNC_NETWORK_IO, the typical issue
Service Time + Wait Time
Handling network is EXPENSIVE
xperf –on latency
?
Service Time + Wait Time
Short Story on DPC/ISR handling
CPU
Core
Core
L1-L3 Cache
PCIBUS
IRQ
HALT executionFire ISR Routine
if (my interrupt){ <Mark Handled> Queue DPC}
NICWork Done
DPC<Do work needed>
<Wake Application>
Core can run other stuff again
Service Time + Wait Time
It looks like this…
DPCISR
Service Time + Wait Time
• Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool
from MS
Setting Interrupt Affinity
Service Time + Wait Time
• Standard Payload Network (MTU): • 1500 B
• Jumbo Frames• 9014 B(MTU)
Jumbo Frame and SQL Packets
• Standard SQL payload• 4096 B
• Largest• 32767 B
SELECT session_id, net_packet_sizeFROM sys.dm_exec_connections
Server=foo;Packet size=32767
Service Time + Wait Time
Single Threaded
Core Evolution
Moore’s “Law”:
“The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”
• Never faster than a single core• Smaller servers are faster than bigger ones• Large L2 caches and more clock speed help
• The algorithm dictates speed• Latency of Wait Time sets upper limit
• Examples from MSSQL land: • Formula Engine in MSAS• Transaction Log Writes• INSERT/UPDATE/DELETE (as we shall see)
Single Threaded
VLF files
• When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked• Too many VLF = Too much blocking• Lesson: Preallocate the database log file in big chunks
• Up to 128 Log Buffers per database• Spawned on demand, will not be released once spawned• Transactions will wait for LOGBUFFER is no buffer available
• Think of this like a pipeline of commits waiting…
VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K
<=60KX 128
Transaction Log Background
Buffer Offset (cache line)
LOGCACHEACCESS
Alloc Slot in Buffer
MemCpy SlotContent
Log Writer
Writer Queue
Async I/O Completion Port
Slot1
LOGBUFFER
WR
ITE
LO
G
LOGFLUSHQ
Signal thread which issued commit
T0
Tn
Slot 127
Slot126
• Speed is determined by Latency and Code Path
• Max Log Write Size: 60K
Zooming to the Log Writer
Log Writer
Async I/O Completion Port
Signal thread which issued commit
Latency
Writer Queue
Long Distance Replication…
Log Entry Log Entry
NetworkLog Entry
Send log
Ack Log
Primary Secondary
Write Write
Executive Summary:The speed of light ( c )
is not fast enough!
• Perfmon will only show millisec• What if we want microseconds?
Getting to the Real Latency
xperf –on latency
It’s in Memory, so it must be fast?
VS.
Latency: 15-30us Latency: <5us
RAM DISK
1.5sec 1.5sec
No, Because…
This adds up to one core… it is doing all it can with the CPU it has
The Effect on UPDATE
Naïve UPDATE MyBigTable SET c6 = 43
Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1C X
Runtime(smaller is faster)
Multi Threaded
What is Scalable?
0 4 8 12 16 20 240
500
1000
1500
2000
2500
3000
Good
So so
Bad
Some Hardware Resource
Thro
ughp
ut
We want to live here
Amdahl’s Law of gated speedup
0 8 16 24 32 40 48 56 641
6
11
16
21
26
31
P = 100%
P = 95%
P = 90%
P = 80%
Number of cores
Spee
dup
Fact
or
P = Part of program that can be made Parallel (Note that this may be 0... or 1)
N = Number of CPU cores available
Speedup =
Introducing Contention – Locks
Table A
Table B
Table C
INSERT TableA …
INSERT TableB …
INSERT TableC …
LCK
LCK
LCKLCK
LCKLCKLCKLCK
Wait Stat: LCK_<X>
But those rows have to be stored…
Table A
Table B
Table C
LCK
LCK
LCKLCK
LCKLCKLCKLCK
DataFile
FileGroup
It all Starts with Wait Stats
SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC
DBCC PAGE
PFS – Hidden Single Page Contention
Data FileGAM/SGAM PFS
64MB
PFS PFS
64MB
PFS
64MB
PFSB B B BB B B BB B B B
B B B B
8K
10010010
PAGELATCHINSERT TableA …
Allocated bit
DataFile
DataFile
DataFile
More Files
Table A
Table B
Table C
LCK
LCK
LCKLCK
LCKLCKLCKLCK
DataFile
FileGroup • Round Robin
between files
• More files, more structures
• No affinity
How many more Files?
0 8 16 24 32 40 48260
280
300
320
340
360
380
400
100
1000
10000
100000
1000000
10000000
Runtime PAGELATCH_UP
# Data Files
Runti
me
PAG
ELAT
CH
• Shared, physical MEMORY structures can cause bottlenecks (ex: PFS)• SQL Server must sync too…
• Understanding where structure resides leads to tuning fix• Theory of engine!
Concurrency: What we learned so far
• Commonly misdiagnosed• CXPACKET does NOT (always) mean that
your DOP is “too high”
CXPACKET
10.015.020.025.030.035.040.00
20,000,000
40,000,000
60,000,000
80,000,000
100,000,000
120,000,000
140,000,000
160,000,000
180,000,000
200,000,000
CXPACKET waits / Throughput
Throughput (MB/sec)
CXPA
CKET
Wai
ts
1 6 11 16 21 26 31 36 41 460.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
Throughput / DOP
DOP
Thro
ughp
ut (M
B/se
c(
CXPACKET = Issue may be elsewhere…
• What happens when you get things like:
LATCH_<x>PAGELATCH_<x>
Step 1: Dig into:
Diagnosing Latches
SELECT * FROM sys.dm_os_latch_stats
Service Time + Wait Time
Digging into Latches Again…
Zooming into the Ready Thread
Post Fix Pattern
GetNextRangeForChildScan
GetNextRangeForChildScan
GetNextRangeForChildScan
• Before: 6GB/sec
• After: 20GB/sec
• This sometimes works on cluster indexes too…
…Whiteboard
Speedup with Hash Partition of Heap
UPDATE Hotspot
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX
Before
ALTER TABLE HotUpdatesADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (‘X’)
After
UPDATE Hack on Small Tables
Page (8K)
ROW
LCK_U
PAGELATCH_EX
CHAR(5000)
Page (8K)
ROW
ROW
ROW
LCK_U
LCK_U
PAGELATCH_EX
Test: Updates of pages
Compression Update 1.4M CPU Load
None - Memory 13 sec 100% one core
PAGE - Memory 54 sec 100% one core
None – I/O 17 sec 100% one core
PAGE – I/O 59 sec 100% one core
L_QUANTITY is NOT NULL
i.e. in place UPDATE
Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01Total 5.17
UPDATE Compression burners
Out of 8.55 … Approx: 60%
Compression and Locks
Xevent TraceLock Acquire/Release
High Res Timer
How long are locks held?
PAGE NONE0
100000
200000
300000
400000
500000
600000
Lock Held Cycle Count
Avg
StdDev
CPU KCycles
• Sharing is generally bad for scale (but may be good for performance)
• PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats
• CXPACKET• Only important if throughput drops when
DOP goes up• If this happens, look for another wait/latch
• Table partitioning can be used to work around concurrency issues
Summary Concurrency – So Far..
The Paul Randal INSERT test
160M rows, executing at concurrencyCommit every 1K:
EASY tuning?
All is as Expected?
But Page Splits are Bad, right?
= BAD!
= Better!...
WRITELOG gone? Faster?
?
?
NO!sys.dm_os_wait_stats
And the Score Is…
newguid() newsequentialid() IDENTITY0
5000
10000
15000
20000
25000
30000
35000
Time in Seconds
What is going on here???
Min
Min
Min Min
Min
MinMin
Min Min
Min
HOBT_ROOT
Max
Tricks to Work Around this
0-1000
1001- 2000
2001- 3000
3001- 4000
INS
ER
T
INS
ER
T
INS
ER
T
INS
ER
T
All Cores at 100%
newguid()
newsequen...
IDEN
TITY
IDEN
TITY + U
nique
IDEN
TITY+ U
niq...
IDEN
TITY...
IDEN
TITY...
SPID + O
f...
0
5000
10000
15000
20000
25000
30000
35000
Runtime in Seconds
Seco
nds 600K
Inserts/sec830K
Inserts/sec
All Cores at ~100%
• Don’t use Sequential Keys• Page Splitting isn’t so bad• Neither are GUID
• Generate keys wisely. Ideally in the app server
• For “transparent” speedup, consider our old hash trick
Takeaways, INSERT workload
• Minimally Logged• Single, large
execution (thousands)
• Unsorted data• Concurrent Loaders
BULK INSERT Workload
Heap
Bulk Insert
Bulk Insert
Measure:
SELECT * FROM sys_dm_os_latch_stats
Observe waits on
ALLOC_FREESPACE_CACHE
Theory (just read BOL):“Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.”
When does BULK INSERT scale break?
0 5 10 15 20 25 300.0
50.0
100.0
150.0
200.0
250.0
Concurrent BULK INSERT
MB/
Sec
1
2
3
What is Happening here?
Free Page information (PFS/GAM/SSGAM)
HOBT Cache
FatChunks
Allocnewpages!Bulk Insert
ALLOC_FREESPACE_CACHE
This is in DRAMand L2
• Break Up table by “some key”
• Optional: Switch out partitions
• Spin up multiple bulks
• Linear scale• 3GB/sec• 16M
LINEITEM/sec
Breaking Through the Bottleneck
425
555
215
200
101
453
666
Area
Bulk Insert
Bulk Insert
Bulk Insert
BULK INSERT - Reloaded
• Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed• But this was on heaps, I have a clustered table
• Alright then, let us hit a cluster index
1-1000
Clustered and partitioned
1001-2000
2001-3000
3001-4000
X Lock
X Lock
X Lock
X Lock
Cluster Bulking – It seemed so plausible!1
2
3
Cluster Bulking – Stage and Switch
1
2
3
Coffee Break
SPIN LOCKS
• Context Switching is expensive• Typically 10K or more CPU cycles
• If you expect the ressource to be held only shortly, why fall asleep?
What is a Spinlock?
spin_acquire(int* s){ while(*s==1) *s = 1;}
Spin_release(int* s){ *s = 0;}
• Acquire can be very expensive• SQL Server implements a backoff
mechanism
What is a backoff?
spin_acquire(int* s){ int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1;}
SELECT * FROM sys.dm_os_spinlock_stats
DBCC SQLPERF(spinlockstats)
Backoff
Life at 600K INSERT/sec
WRITELOG is I/O – right?
Should be the same as this… or?
No! Because:
• Step 1: Copy sqlserver.pdb to the BINN directory
• Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from:
http://www.microsoft.com/en-us/download/details.aspx?id=26666
Note for 2012, you additionally need:• sqlmin.pdb, sqllang.pdb, sqldk.pdb
Diagnosing a Spinlock the Cool way!
Spinlock Walkthrough – Extended Events Script
--Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE')
--create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack)
where type = 144 --SOS_CACHESTORE)
add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)
--Run this section to measure the contention alter event session spin_lock_backoff on server state=start
A complete walkthrough of the technique can be found here:
http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx
--wait to measure the number of backoffs over a 1 minute periodwaitfor delay '00:01:00'
--To view the data--1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1)
--Get the callstacks from the bucketize targetselect event_session_address, target_name, execution_count, cast (target_data as XML)from sys.dm_xe_session_targets xst
inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)where xs.name = 'spin_lock_backoff'
--clean up the session alter event session spin_lock_backoff on server state=stopdrop event session spin_lock_backoff on server
Of course, you can just use 2012…
How to improve a spinlock?
CPU
Core
Core
L1-L3 Cache
CPU
Core
Core
L1-L3 Cache
spin_acquire
Int sspin_acquire
Int s
spin_acquire
Int s
Transfer cache line
Transfer cache line
CPU CPU
CoreInfo.Exe – where are my cores?
CoreInfo.exe
Revisiting the TLOG
Buffer Offset (cache line)
LOGCACHEACCESS
Alloc Slot in Buffer
MemCpy SlotContent
Log Writer
Writer Queue
Async I/O Completion Port
Slot1
LOGBUFFER
WR
ITE
LO
G
LOGFLUSHQ
Signal thread which issued commit
T0
Tn
Slot 127
Slot126
I/O Affinity Mask!
SPID + Offset
SPID + Affinity
0
50
100
150
200
250
sp_configure ‘AffinityIOMask’
Bulking at Concurrency
• What’s that spin?
xperf –on latency –stackwalk profilexperf –d trace.etlxperview trace.etl
SELECT * FROM sys.dm_os_spinlock_statsORDER BY spins_count
DBCC SQLPERF (spinlockstats)
?
SOS_OBJECT_STORE at high INSERT
• Observed: This Spin happens when inserting
• Need: Reduce locking overhead• Fixes that work well here:
8x throughput
Bonus
• Lets try something really silly:
• Run lots of: EXEC emptyProc
• This should be infinitely scalable, right?
Diagnosing another Spinlock
CREATE PROCEDURE emptyProcASRETURN
Initial Diagnosis
MUTEX ??? … what Mutex?
Using the Spinlock Script gives us
Some cacheWhich one?
Validating the Theory
CREATE PROCEDURE emptyProc0ASRETURNGOCREATE PROCEDURE emptyProc1ASRETURNGO
…
CREATE PROCEDURE emptyProc31ASRETURN
What is the SOS_OBJECT_STORE?
Security Check?
Validating the new “fix”…
DECLARE @ParmDef NVARCHAR(500)DECLARE @sql NVARCHAR(500)SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6)
VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT'DECLARE @constDate DATETIME = '1974-12-22'
DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRANEND
Consider this Test harness code…
Spinning on MUTEX
Diagnose with trace flag shows spins stack offender:
CSecurityContext::GetUserTokenFromCache
This is REALLY expensive at scale:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END
Initialize a new execution context on every loop!
Fixing the MUTEX spin
• Instead of:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END
• Write:SET @sql = N'DECLARE @i INT WHILE (1=1) BEGINBEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRANEND
EXEC sys.sp_executesql @sql, @ParmDef
4x throughput
Bonus
• When all other bottlenecks are gone, sharing happens in the most unlikely places
• You can use spinlock Xevents inside SQL Server• Remember symbol files in BINN• Trace flag 3656
• This can also be done in XPERF for non SQL apps• Ex: Analysis Services
Concurrency, Spinlock Summary
• Control of buffers and NUMA for Xperf setting
• By default: • 4MB mem• Spool to disk at root of C-drive
• Can do buffer/file control:• -buffersize and –maxbuffers• -maxfile and –FileMode Circular
Xperf controlling buffers
• Round robin between NUMA nodes• Inside the NUMA: Pick the one that looks
the least busy
• This is NOT a perfect system
How SQL Server assigns threads
Xperf -on Latency+CSWITCH+DISPATCHER -stackWalkCSwitch+ReadyThread+ThreadCreate+Profile -BufferSize 1024 -MaxBuffers1024 -MaxFile 1024 -FileMode Circular
REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f
Super Xperf
• All the tuning wont help you if your model is wrong
• Tunings gets your far, but to really scale, you need a good data model
• This is what my other courses are about
But does the Data Model Work?
Q A&
Problem Statement
Queue Structure
Msg Msg Msg Msg Msg
Ordered
Push Pop
300Bmsg
The Naïve Approach
• Push
• Seek First Row• INSERT Row
• Pop
• Seek Last Row• DELETE/Output
Key
Max
Msg
Min Max
Msg
Min
Msg
Why this doesn’t Scale!
Min
Min
Min Min
Min
MinMin
Min Min
Min
HOBT_ROOT
Max
NextPrev
Virtual Root
SHLATCH
HOBT_VIRTUAL_ROOT
LCK
PAGELATCH
X
SH
SHPAGELATCH
PAGELATCH
EX
SH
SH
EX
SH
EX
EX
EXEX
B-Tree Root Pages
Summarising the Problem
• Hot stuff• Root• Min page• Max page• Intermediate
pages
• Alloc/Dealloc• BUT: We Must
have order!
Cooling it down
What if…
• Push
• Seek first value page
• UPDATE Reference Count
• Pop
• Seek last value page
• UPDATE Reference Count
Min Max
Msg++
Min Max
Msg--
Dissipate the Heat
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Min
Msg--
Max
Msg++
Last Digit = 0 Last Digit = 1 Last Digit = 2
Eliminating Thread Contention
Queue Structure
Ordered
PushSequence++PopSequence++
87654
VERY fast!
Ring Buffers
Queue Structure
Ordered
PushSequence++Mod 100
PopSequence++Mod 100
Slot: 8Msg: 108
Slot: 7Msg: 107
Slot: 6Msg: 106
Slot: 5Msg: 105
Slot:4Msg:104
Summing Up Message Queue Hack
• UPDATE • instead of INSERT/DELETE
• More partitions = More B-Trees
• Ring buffer using modulo• Find Sweet spot
concurrency