218
Thomas Kejser [email protected] http://blog.kejser.org @thomaskejser Super Scaling SQL Server Diagnosing and Fixing Hard Problems

Master tuning

Embed Size (px)

Citation preview

Page 1: Master   tuning

Thomas Kejser

[email protected]

http://blog.kejser.org

@thomaskejser

Super Scaling SQL Server

Diagnosing and Fixing Hard Problems

Page 2: Master   tuning

Thomas Kejser

• Formerly SQLCAT• Tuning SQL Server since 6.5• 15+ Years of database experience

• http://blog.kejser.org • CTO Fusion-io Europe

Page 3: Master   tuning

Image(s): FreeDigitalPhotos.net

VS. VS.

Best Practice

Page 4: Master   tuning

Performance Scalabilityvs.

Response Time

Ressource Use

Adding moreof a HW ressource

makes things faster

You can scale without having performance (ex: HADOOP)

You can perform without having scalability (ex: In Memory Engines)

Page 5: Master   tuning

Our Reasonably Priced Server

• 2 Socket Xeon E3645• 2 x 6 Cores• 2.4Ghz• NUMA enabled, HT off

• 12 GB RAM• 1 ioDrive2 Duo• 2.4TB Flash• 4K formatted • 64K AUS• 1 Stripe

• Power Save Off• Win 2008R2• SQL 2012 Image Source: DeviantArt

Page 6: Master   tuning

Between disk and Memory

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

1ns 10ns 100ns 100us 10ms10us

Page 7: Master   tuning

The “cache out curve”

Data Size

Throughput/thread

Cache Size

Service Time + Wait Time

Page 8: Master   tuning

NUMA Nodes

CPU

L3

L2

L2C

C

CPU

L3

L2

L2C

C

Can I write?

Bus Transfer

Bus Transfer

Page 9: Master   tuning

There are several of these curves

Throughput

Touched Data Size

CPU Cache

TLB

NUMARemote

Storage

Page 10: Master   tuning

Response time = Service Time + Wait Time

Algorithmsand

Data Structures

“Bottlenecks”

Page 11: Master   tuning

• DBA tasks• Installation of OS and SQL• Basic Memory Configuration• Basic Perfmon style monitoring• Backup/Restore and HA setup

• Basic reading a Query Plan• Basic understanding of database

structures• Adding Indexes to tables• Running a Profiler trace

What you ALREADY know

Page 12: Master   tuning

Below the Surface

Page 13: Master   tuning

What we Need

• Free tools from MS

• Windows SDK• In Win8: The

“ADK”

• Need .NET 4 to install

Page 14: Master   tuning

Where Did the Time Go?

Service Time + Wait Time

Xperf –on Base –f Base.etl

SELECT TOP 100000 *FROM LINEITEMINNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY

SQLCMD –E –S. –i “Select.sql”

?

Xperf –stop

Page 15: Master   tuning

DEMOBASE profile with xperf

Service Time + Wait Time

Page 16: Master   tuning

Right Click – Summary Table

Service Time + Wait Time

Page 17: Master   tuning

What exactly is SQLNCLI?

Service Time + Wait Time

Page 18: Master   tuning

Quantifying just how stupid XML is

SELECT TOP 1000000 * FROM ORDERSJOIN LINEITEM ON L_ORDERKEY = O_ORDERKEYFOR XML RAW ('OUTPUT')

Xperf –on Base –f Base.etl

With XML

“Native” Format

Page 19: Master   tuning

Which CPU cycles are Expensive?

“App” tierWeb Server Licensing>3K USD Blades

Database TierCore Licensing>10K USD

<XML> ?

Service Time + Wait Time

Page 20: Master   tuning

• What about the time INSIDE the process?

• What if the EXE won’t tell us?

Diving even Deeper

Page 21: Master   tuning

What is a Debug Symbol?

mov ax,10mov bx,20mov cx,3push axpush bxpush cxcall <address>

<address>push bpmov bp,spmov ax,[bp+8]

mov bx,[bp+6]mov cx,[bp+4]

add ax,bxdiv cxmov dx,axret

HeaderdoStuff(10,20,3)

int doStuff(int a, int b, int c){ return (a + b) / c }

Compiles

Compiles

myProg.exe

Machine Code

<address> = doStuff

Symbol table

myProg.pdb

Symbol

Build

Service Time + Wait Time

Page 22: Master   tuning

Where do you get PDB files?

_NT_SYMBOL_PATH=SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

_NT_SYMCACHE_PATH=C:\SymCache

• Public Symbol Server

• Configure Environment

• Dbghelp.dll

Service Time + Wait Time

Page 23: Master   tuning

• Auto Generated by Visual Studio:

Your Own Debug Symbols

Service Time + Wait Time

Page 24: Master   tuning

• Symbols are indexed. Have to add them

Adding and Checking Your Symbols

Cd Bin/x64/Release/symstore add /f *.pdb /s C:/Symbols /t “MyExe”

• Validate that the Symbols can resolveCd Bin/x64/Release/symchk MyExe.exe /V

Page 25: Master   tuning

• Standard Xperf works fine for you own native code

• BUT: Before Windows 8, stack walking is broken for x64 .NET

• If you have .NET with 64 bit code. You must NGEN first:

Got .NET and x64?

Ngen install Bin/x64/Release/MyExe.exe(ngen lives here: %Windir%\Microsoft.NET\framework64\<Version>\Ngen.exe

Service Time + Wait Time

Page 26: Master   tuning

• Free tool from MS:

.NET tracing is a pain, get a tool!

• Not to be confused with xperfview• Same trace API and file format• Helps set obscure .NET specific trace flags

Service Time + Wait Time

Page 27: Master   tuning

And Finally, You can do Very Cool Things

Did I tell you about interlockedoperations?...

Whiteboard time!

Service Time + Wait Time

Page 28: Master   tuning

• Consider again our LINEITEM table

What is SQL Server REALLY doing?

• How expensive is it to read from that?• Think ETL code and DW/BI queries

CREATE TABLE LINEITEM ([L_ORDERKEY] [int] NOT NULL,[L_PARTKEY] [int] NOT NULL,[L_SUPPKEY] [int] NOT NULL,[L_LINENUMBER] [int] NOT NULL,[L_QUANTITY] [decimal](15, 2) NOT NULL,[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,[L_DISCOUNT] [decimal](15, 2) NOT NULL,[L_TAX] [decimal](15, 2) NOT NULL,[L_RETURNFLAG] [char](1) NOT NULL,[L_LINESTATUS] [char](1) NOT NULL,[L_SHIPDATE] [date] NOT NULL,[L_COMMITDATE] [date] NOT NULL,[L_RECEIPTDATE] [date] NOT NULL,[L_SHIPINSTRUCT] [char](25) NOT NULL,[L_SHIPMODE] [char](10) NOT NULL,[L_COMMENT] [varchar](44) NOT NULL

)

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL

Service Time + Wait Time

Page 29: Master   tuning

SQLCMD – Native code Test

SQLCMD.EXE

Where does the time go?

Service Time + Wait Time

Page 30: Master   tuning

Standard Reading of Data

xperf -on base -stackwalk profile -f stackwalk.etl

SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"

55sec

xperf -stop

xperf –merge stackwalk.etl stackwalkmerge.etl

?

Service Time + Wait Time

Page 31: Master   tuning

Details of the Time – Padding?

Service Time + Wait Time

Page 32: Master   tuning

More Details – Conversion Work?

Page 33: Master   tuning

An Educated guess about improvements

CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL)

CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL,)

Before After

Service Time + Wait Time

Page 34: Master   tuning

Getting Rid of Useless Work

Additional parameters for SQLCMD:

-a32767 -W -s";" -f437

x1.5

Service Time + Wait Time

Page 35: Master   tuning

Unicode – 10% overhead?

Service Time + Wait Time

Page 36: Master   tuning

Lets try that with Native and Unicode …

x5

Service Time + Wait Time

Page 37: Master   tuning

• SQLNCLI is one of these in disguise• ODBC• OLEDB

• Pick good data types• MONEY over NUMERIC• UNICODE of data arrives like this

• Native protocols vs. flexibility

Summary Moving Data

Page 38: Master   tuning

• Get• Windows 8 ADK• Windows 7 SDK

• Set up Symbol Paths• Xperf –on Base • Standard trace for time, narrow to process

and DLL/EXE

• Xperf –on Base –stackwalk Profile• Get to the call stack, find the offending

function(s)

• Ease of use for .NET: perfview.exe

Summary – Xperf

Service Time + Wait Time

Page 39: Master   tuning

Response time = Service Time + Wait Time

Page 40: Master   tuning

Introducing TPC-H

Service Time + Wait Time

Page 41: Master   tuning

Loop Join

Find my match

n row B-tree

Log(n) reads

Complexity: O(m * log(n))Service Time + Wait Time

m row result14313

7

3

Page 42: Master   tuning

Linked List Tree

Linked List vs. Tree

Service Time + Wait Time

0

1

2

3

4

5

6

7

8

n

8

134

62 1510

16141197531

Log 2(n

)

Page 43: Master   tuning

Cluster on O_ORDERKEY Index on O_ORDERKEY

Basic argument for Cluster Indexes

Service Time + Wait Time

CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_ClusterWHERE O_ORDERKEY = 3000000

CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_HeapWHERE O_ORDERKEY = 3000000

Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0

Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0

Page 44: Master   tuning

Cluster on O_ORDERKEY heap + Index on O_ORDERKEY

But what if we do this a lot?

CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)

WITH (FILLFACTOR = 100)

CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)

WITH (FILLFACTOR = 100)

SELECT * FROM ORDERS_HeapWHERE O_CUSTKEY = 47480

SELECT * FROM ORDERS_ClusterWHERE O_CUSTKEY = 47480

Table 'ORDERS_Cluster'. Scan count 1, logical reads 27, physical reads 0

Table 'ORDERS_Heap'. Scan count 1, logical reads 11, physical reads 0

Service Time + Wait Time

Page 45: Master   tuning

How many LOOP joins/sec/core?

7 Sec

Service Time + Wait Time

Page 46: Master   tuning

What did we just measure?

Xperf –on Base –stackwalk profile

About 40%...

Service Time + Wait Time

Page 47: Master   tuning

• The query language itself

• Why so many ExecuteStmt?

• …With so much CPU use?

What is sqllang.dll?

Service Time + Wait Time

Page 48: Master   tuning

A different way to Measure Loops

1 Sec

Service Time + Wait Time

Page 49: Master   tuning

VS.

What does THAT look like?

Takeaway:

The T-SQL languageitself is expensive

Service Time + Wait Time

Page 50: Master   tuning

• Sample from LINEITEM

• Force loop join with index seeks

• Do 1.4M seeks

Test: Singleton Row Fetch

Page 51: Master   tuning

Singleton seeks – Cost of compressionCompression Seek (1.4M seeks) CPU Load

None - Memory 13 sec 100% one core

PAGE - Memory 24 sec 100% one core

None – I/O 21 sec 100% one core

PAGE – I/O 32 sec 100% one core

Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11%Additional Runtime of GetNextRowValuesInternal 0.14%Total Compression 2.28%Total CPU (single core) 8.33%Compression % 27.00%

xperf –on base –stackwalk profile

Page 52: Master   tuning

Modern CPU

CPU

L3 Cache 4MB

Inst Cache32KB

Core

Data Cache32KB

L2 Uni Cache 256K

Inst Cache32KB

Core

Data Cache32KB

L2 Uni Cache 256K

Bus

Service Time + Wait Time

Page 53: Master   tuning

The B+ Tree

Service Time + Wait Time

B+ Tree

Page 54: Master   tuning

Hekaton Style “Loop”

Lookup Table

(hash)

Service Time + Wait Time

Page 55: Master   tuning

Merge Join

m row result1123

n row result1234

4

43

43

Sort

ed

Sort

ed

Complexity: O(m + n)Service Time + Wait Time

Page 56: Master   tuning

Merge Join – What is Fastest?

Service Time + Wait Time

SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEMINNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY  …or

SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERSINNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY

Page 57: Master   tuning

Comparing the Query Plans

Service Time + Wait Time

Page 58: Master   tuning

Digging in Deeper

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.  SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms. 

 Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.   SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms.

Service Time + Wait Time

Page 59: Master   tuning

We can beat SQL Server at this game

SELECT MAX(O_ORDERDATE), MAX(MAX_P)FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) bINNER MERGE JOIN ORDERSON O_ORDERKEY = b.L_ORDERKEY

Service Time + Wait Time

Page 60: Master   tuning

Hash Join

m row result14313

7

n row join table

Hash(1)

n row hash table

Find my match

Complexity: O(m + 2n)

3

Service Time + Wait Time

Page 61: Master   tuning

When Hash Joins hurt you

Service Time + Wait Time

0501001502002503003504000

5

10

15

20

25

30

Runtime (seconds)

Hash Memory (MB)

Spill Zone!

Page 62: Master   tuning

Hash Joins Don’t Scale in MSSQL

Page 63: Master   tuning

The Bottleneck Curve

Page 64: Master   tuning

ACCESS_METHODS_DATASET_PARENT:

“Used to synchronize child dataset access to the parent dataset during parallel operations.”

Books Online Story…

Image: FreeDigitalPhotos.net

Page 65: Master   tuning

Using XPERF to find documentation

xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch

Page 66: Master   tuning

Lets dig in…

xperf -on base -stackwalk profile -f stackwalk.etl

Page 67: Master   tuning

What LATCH pattern do we see?

GetNextRangeForChildScan

Inside:TableScanNew

Page 68: Master   tuning

• Partition the table by a “random” value

• Modulo the Key for example

• Use SQL Server partition function/schema

The Fix?…

0123456

253254255

hash

Page 69: Master   tuning

Closer…

Page 70: Master   tuning

…But no Cigar

Page 71: Master   tuning

What is the Problem here?

Page 72: Master   tuning

Anti Scale Patterns

Page 73: Master   tuning

CPU Caches

0

100,000,000

200,000,000

300,000,000

400,000,000

500,000,000

600,000,000

700,000,000

800,000,000

900,000,000

1,000,000,000

Random Pages

Sequential Pages

Single Page

Size of Accessed memory (MB)Service Time + Wait Time

Page 74: Master   tuning

Goals:• Compressed• Prefetch Friendly• Cache Resident Code

Example, Column Stores

ID Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

ID Date

1 2011-11-25

2 2011-11-25

3 2011-11-25

4 2011-11-25

5 2011-11-25

6 2011-11-25

7 2011-11-25

Date

ID Sale

1 2 GBP

2 2 GBP

3 10 GBP

4 5 GBP

5 5 GBP

6 10 GBP

7 10 GBP

Sale

Service Time + Wait Time

Page 75: Master   tuning

Compression is Easy

ID Value

1-2 Beer

3 Vodka

4-5 Whiskey

6-7 Vodka

ID Customer

1-3 Thomas

4-5 Christian

6-7 Alexei

Product’ Customer’

ID Date

1-7 2011-11-25

Date’

ID Sale

1-2 2 GBP

3 10 GBP

4-5 5 GBP

6-7 10 GBP

Sale’

RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

RL Customer

3 Thomas

2 Christian

2 Alexei

Product’ Customer’

RL Date

7 2011-11-25

Date’

RL Sale

2 2 GBP

1 10 GBP

4 5 GBP

2 10 GBP

Sale’

Service Time + Wait Time

Page 76: Master   tuning

Squeezing it even more

RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

Product’

RL Value

2 1

1 2

2 3

2 2

Product’

Beer = 1Vodka = 2Whiskey = 3

ID Value

1-2 Beer

3-3 Vodka

4-5 Whiskey

6-7 Vodka

Product’

4+4+4+2 = 14B + 4+4+5+2 = 15B+ 4+4+7+2 = 17B+ 4+4+5+2 = 15B

= 61B

4+4+2 = 10B + 4+5+2 = 11B+ 4+7+2 = 13B+ 4+5+2 = 11B

= 45B

4+4 = 8B + 4+4 = 8B+ 4+4 = 8B+ 4+4 = 8B

= 32B

RL Value

2 0x01

1 0x10

2 0x11

2 0x10

Product’

4 = 4B + 4 = 4B+ 4 = 4B+ 4 = 4B

+ 4 x 2b = 2B= 18B

Service Time + Wait Time

Page 77: Master   tuning

RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

RL Customer

3 Thomas

2 Christian

2 Alexei

Product’ Customer’

2 steps with Beer2 steps with Thomas

Beer Thomas

Beer Thomas

SELECT Product, Customer FROM Table

1 step with Vodka1 step with Thomas

Vodka Thomas

2 step with Whiskey2 step with Christian

Whiskey Christian

Whiskey Christian

2 step with Vodka (Note: Repeated value)2 step with Alexei

Vodka Alexei

Vodka Alexei

Service Time + Wait Time

Page 78: Master   tuning

Hash Joining with Column Stores

RL Key

2 Beer

1 Vodka

2 Whiskey

2 Vodka

Table

Key Type

Beer Soft

Vodka Strong

Whiskey Strong

Vodka Strong

Dim Product

SELECT …FROM TableJOIN DimProduct ON KeyWHERE Type = ‘Strong’

1 Compute bloom filter of Keys belonging to ‘strong’2 Read RL = 2, Beer from Table3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2)5 Compute bloom value for Vodka6 Equal to filter value from 1? No. Do nothing7 Compute bloom value for Whiskey8 Equal to filter value from 1? No. Do nothing

Can pre fetch data (news RLE)

Can calculate match/no match using only local CPU cache

Wont work for OLTP!Service Time + Wait Time

Page 79: Master   tuning

Why is it so hard to get joins right?

n

m

Time

Loop Join

Merge Join

Hash Join

Service Time + Wait Time

Page 80: Master   tuning

Desired Join Join Hint Query Hint

LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN

OPTION (LOOP JOIN)

MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN

OPTION (MERGE JOIN)

HASH [INNER | LEFT | CROSS | FULL] HASH JOIN

OPTION (HASH JOIN)

LOOP with Seek

WITH FORCESEEKWITH ( INDEX (index = <name>) )

N/A

Controlling Joins

Note: Join hints force the order of the ENTIRE join tree!Service Time + Wait Time

Page 81: Master   tuning

What Type of Workload?

BigSmall

Small

Big

Dat

a Re

turn

ed

Data Touched

OLTP BI/DW

Simulation ETL

Service Time + Wait Time

Page 82: Master   tuning

How to Classify?

OLTP BI/DW

Simulation ETL

Full Scan/secRange Scans/secProbe Scans/sec

Index Search/secRange Scans/sec

Full Scan/secRange Scans/sec

Bulk Copy Rows/sec?

Page 83: Master   tuning

There should ALWAYS be a fully indexed path to the data.

OLTP System Basic Query Pattern

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL

Service Time + Wait Time

Page 84: Master   tuning

1. Find worst CPU consuming query with sys.dm_exec_query_stats

2. Add OPTION (LOOP JOIN) to offending query

3. Check estimated query plan4. If table spool found: add index to

remedy and GOTO 35. Happy? If not, GOTO 1

The Super Quick OLTP Tuning Guide

Service Time + Wait Time

Page 85: Master   tuning

The query will not be (much)worse than a full scan of a fact

partition

DW/BI System Basic Query Pattern

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL

Service Time + Wait Time

Page 86: Master   tuning

1. Find offending query2. Add OPTION (HASH JOIN) to query3. Does dimension tables have indexed path to

build hash? If not, add index4. Do you get a fact table scan and hash build of

all dimensions? If not, check statistics (especially on facts and skewed)

5. Optimize Fact table scans1. Partition and partition elimination2. Column store if you have it3. Aggregate Views4. Bitmap index pushdown (statistics!)5. Composite indexes (last resort!)

The Super Quick DW tuning Guide

Service Time + Wait Time

Page 87: Master   tuning

The expected DW Query Plan

PartialAggregate

Fact CSI Scan

Dim Scan

Dim Seek

BatchBuild

BatchBuild

HashJoin

HashJoin

HashStreamAggregate

Page 88: Master   tuning

• At least enough RAM to hold the hash tables of the largest dimension

• De-normalisation helps… a LOT• Especially for the large/large joins

• Likely: need to scan fast from disk if RAM is not big enough to hold the fact• Compression REALLY matters

Things that Follow from desired DW Plan

Service Time + Wait Time

Page 89: Master   tuning

Coffee Break

Page 90: Master   tuning

Response time = Service Time + Wait Time

Page 91: Master   tuning

Where EVERY Server wide diagnosis starts

SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC

Service Time + Wait Time

Page 92: Master   tuning

• Shows up as waits for PAGEIOLATCH• You can dig into details with:

Common Problems - PAGEIO

Service Time + Wait Time

SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)

• Can also Xevent your way to it per query

CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed(ACTION (sqlserver.database_id,sqlserver.session_id))

Page 93: Master   tuning

• I/O, like memory, is a GLOBAL resource for the machine

• When does it make sense to partition a global resource? • When you deeply know the workload• When the workload is ALREADY partitioned

• When neither of those are true: DON’T partition

• If you have NAND/SSD – Why bother?

The general I/O Guidance

Service Time + Wait Time

Page 94: Master   tuning

A good way to Think of Spindle I/O

DRAM

Page 95: Master   tuning

JBOD SAME

LUN

Seq.

LUN

Seq.

LUN

Seq.

RAID system

Large LUN

Seq. Seq. Seq.

RANDOM I/O

Service Time + Wait Time

Page 96: Master   tuning

Stripe vs. Concatenation

RAID 10 RAID 10

Concatenated LUN

RAID 10 RAID 10

Striped LUN

Service Time + Wait Time

Page 97: Master   tuning

OLTP

• One big SAME setup• data files• Tempdb

• Dedicate• Transaction log

• DRAM: • Enough to hold most of

DB

Data Warehouse

• JBOD setup• Data Files • 1-2 per LUN

• SAME setup• Tempdb

• Dedicate• Transaction Log

• DRAM: • Enough to hold largest

partition of largest table

Rules of Thumb – Spindle I/O and DRAM

Service Time + Wait Time

Page 98: Master   tuning

• Short Stroking

• Elevator Sort

• Sequential vs. Random

• Weaving

You can do a bit better… or worse

Service Time + Wait Time

Page 99: Master   tuning

• Intentionally use lower % of total space

• Tradeoff: • Space for Speed

• Test:• 15K rpm• SAS spindle • 300GB

Short Stroking Disks

0% 20% 40% 60% 80% 100%150

200

250

300

350

400

% Capacity Used

IOPS

Service Time + Wait Time

Page 100: Master   tuning

Full Stroked Short Stroked

Why does Short Stroking Work?

Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement

Service Time + Wait Time

Page 101: Master   tuning

Adding Elevator Sorting

Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked0

100

200

300

400

500

600

0

200

400

600

800

1000

1200

8K random I/OIOPS

Avg. Latency

Max Latency

IOPS

Late

ncy

Bat powered disk!

Page 102: Master   tuning

Why Chase Sequential I/O?

Sequential Full Stroke Random100

1000

10000

100000

0

10

20

30

40

50

60

70

808K Block Pattern

IOPSAvg LatencyMax Latency

Log(

IOPS

)

Late

ncy

(ms)

Service Time + Wait Time

Page 103: Master   tuning

• One SATA disk

• Two partitions

• One file on each

• Sequential read on each file

But all is not well!

File1 File2

Service Time + Wait Time

Page 104: Master   tuning

I/O Weaving in action

64K Random 64K Dual Sequential0

50

100

150

200

250

300

0

2

4

6

8

10

12

14

16

18

IOPS

Avg Latency

IOPS

Late

ncy

(ms)

Source: Michael Anderson Service Time + Wait Time

Page 105: Master   tuning

Storage Pool and Weaving

DataLog DataLog DataLog

Massive, then Provisioned Pool

Seq

Ran

Seq

Ran

Seq

Ran

RANDOM!Service Time + Wait Time

Page 106: Master   tuning

The SAN will properly handle Sharing!

Green: Checkpoint, Red: tx/sec, Black: Disk Latency

The “cache out curve”

Service Time + Wait Time

Page 107: Master   tuning

Numbers to Remember - Spindles

Characteristic Typical Units

Throughput / Bandwidth 90-125MB/secBut ONLY if sequential access!

Operations per Sec 10K RPM Spindle: 100-130 IOPS15K RPM Spindle: 150-180 IOPS

Can get about 2x if short stroking (more later)

Latency 3-5ms (compare DRAM: 100ns)

Capacity 100s of GB to single digit TB

2012 numbers, will change in future Service Time + Wait Time

Page 108: Master   tuning

• Few hundreds of IOPS

• Faster if short stroked

• Trade latency for speed with elevator sort

• Sequential is hard to get right

Summary so far.. Single Disk

Service Time + Wait Time

Page 109: Master   tuning

• Wider Stripes neat• But scale not linear

• Very deep queues help• But add latency

• Shared Components

Why does a big RAID pile not solve this?

Service Time + Wait Time

Page 110: Master   tuning

RAID Scale?Your Mileage WILL vary with the hardware

Page 111: Master   tuning

Before After

Getting rid of Sharing

Switch

HBA HBA HBA HBA

Storage Port

Storage Port

Switch

LUN LUN

Cache

Disk

CPU

Switch

HBA HBA HBA HBA

Storage Port

Storage Port

Switch

LUN LUN

Cache

Disk

CPU

x2

Page 112: Master   tuning

4K

PN N

NAND Flash Basics

112

PN N

Oxide Layer

Floating Gate

Electronstrapped

Control Gate

NAND Die

Pack

Blocks

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

PN N PN NPN N

PN NPN N PN NPN N

Pages

Page 113: Master   tuning

NAND Flash Problems

• Erase Cycles• Around 100K• Rebalancing and reclaim/trim

• Voltage measurement• Gets worse with density• Changes over time• Depends on how you program

• Bit Rot• Must refresh even on read

• SLC easier to manage than MLC• But much more expensive!

113

Voltage

00

01

10

11

Page 114: Master   tuning

Lessons Learned: Try to Avoid Sharing

BAD BETTER BEST

Service Time + Wait Time

Page 115: Master   tuning

The Network

Page 116: Master   tuning

• Only partially diagnosed as waits in sys.dm_os_wait_stats

• Task Manager gives a bit more information

• Need: transparency to the deep level latencies and packets!

Common Problems: ASYNC_NETWORK, OLEDB

Service Time + Wait Time

Page 117: Master   tuning

A common Wait Type

The database is really slow! The code takes

forever to run!

Service Time + Wait Time

Page 118: Master   tuning

• We may not always have insight into what is going on at the client…

Xperf Diagnosing the Network

xperf –on latency+network

SummaryTable

Service Time + Wait Time

Page 119: Master   tuning

Timeline of the network Traffic

Page 120: Master   tuning

ASYNC_NETWORK_IO, the typical issue

Service Time + Wait Time

Page 121: Master   tuning

Handling network is EXPENSIVE

xperf –on latency

?

Service Time + Wait Time

Page 122: Master   tuning

Short Story on DPC/ISR handling

CPU

Core

Core

L1-L3 Cache

PCIBUS

IRQ

HALT executionFire ISR Routine

if (my interrupt){ <Mark Handled> Queue DPC}

NICWork Done

DPC<Do work needed>

<Wake Application>

Core can run other stuff again

Service Time + Wait Time

Page 123: Master   tuning

It looks like this…

DPCISR

Service Time + Wait Time

Page 124: Master   tuning

• Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool

from MS

Setting Interrupt Affinity

Service Time + Wait Time

Page 125: Master   tuning

• Standard Payload Network (MTU): • 1500 B

• Jumbo Frames• 9014 B(MTU)

Jumbo Frame and SQL Packets

• Standard SQL payload• 4096 B

• Largest• 32767 B

SELECT session_id, net_packet_sizeFROM sys.dm_exec_connections

Server=foo;Packet size=32767

Service Time + Wait Time

Page 126: Master   tuning

Single Threaded

Page 127: Master   tuning

Core Evolution

Moore’s “Law”:

“The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”

Page 128: Master   tuning

• Never faster than a single core• Smaller servers are faster than bigger ones• Large L2 caches and more clock speed help

• The algorithm dictates speed• Latency of Wait Time sets upper limit

• Examples from MSSQL land: • Formula Engine in MSAS• Transaction Log Writes• INSERT/UPDATE/DELETE (as we shall see)

Single Threaded

Page 129: Master   tuning

VLF files

• When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked• Too many VLF = Too much blocking• Lesson: Preallocate the database log file in big chunks

• Up to 128 Log Buffers per database• Spawned on demand, will not be released once spawned• Transactions will wait for LOGBUFFER is no buffer available

• Think of this like a pipeline of commits waiting…

VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K

<=60KX 128

Page 130: Master   tuning

Transaction Log Background

Buffer Offset (cache line)

LOGCACHEACCESS

Alloc Slot in Buffer

MemCpy SlotContent

Log Writer

Writer Queue

Async I/O Completion Port

Slot1

LOGBUFFER

WR

ITE

LO

G

LOGFLUSHQ

Signal thread which issued commit

T0

Tn

Slot 127

Slot126

Page 131: Master   tuning

• Speed is determined by Latency and Code Path

• Max Log Write Size: 60K

Zooming to the Log Writer

Log Writer

Async I/O Completion Port

Signal thread which issued commit

Latency

Writer Queue

Page 132: Master   tuning

Long Distance Replication…

Log Entry Log Entry

NetworkLog Entry

Send log

Ack Log

Primary Secondary

Write Write

Executive Summary:The speed of light ( c )

is not fast enough!

Page 133: Master   tuning

• Perfmon will only show millisec• What if we want microseconds?

Getting to the Real Latency

xperf –on latency

Page 134: Master   tuning

It’s in Memory, so it must be fast?

VS.

Latency: 15-30us Latency: <5us

RAM DISK

1.5sec 1.5sec

Page 135: Master   tuning

No, Because…

This adds up to one core… it is doing all it can with the CPU it has

Page 136: Master   tuning

The Effect on UPDATE

Naïve UPDATE MyBigTable SET c6 = 43

Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1C X

Runtime(smaller is faster)

Page 137: Master   tuning

Multi Threaded

Page 138: Master   tuning

What is Scalable?

0 4 8 12 16 20 240

500

1000

1500

2000

2500

3000

Good

So so

Bad

Some Hardware Resource

Thro

ughp

ut

We want to live here

Page 139: Master   tuning

Amdahl’s Law of gated speedup

0 8 16 24 32 40 48 56 641

6

11

16

21

26

31

P = 100%

P = 95%

P = 90%

P = 80%

Number of cores

Spee

dup

Fact

or

P = Part of program that can be made Parallel (Note that this may be 0... or 1)

N = Number of CPU cores available

Speedup =

Page 140: Master   tuning

Introducing Contention – Locks

Table A

Table B

Table C

INSERT TableA …

INSERT TableB …

INSERT TableC …

LCK

LCK

LCKLCK

LCKLCKLCKLCK

Wait Stat: LCK_<X>

Page 141: Master   tuning

But those rows have to be stored…

Table A

Table B

Table C

LCK

LCK

LCKLCK

LCKLCKLCKLCK

DataFile

FileGroup

Page 142: Master   tuning

It all Starts with Wait Stats

SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC

DBCC PAGE

Page 143: Master   tuning

PFS – Hidden Single Page Contention

Data FileGAM/SGAM PFS

64MB

PFS PFS

64MB

PFS

64MB

PFSB B B BB B B BB B B B

B B B B

8K

10010010

PAGELATCHINSERT TableA …

Allocated bit

Page 144: Master   tuning

DataFile

DataFile

DataFile

More Files

Table A

Table B

Table C

LCK

LCK

LCKLCK

LCKLCKLCKLCK

DataFile

FileGroup • Round Robin

between files

• More files, more structures

• No affinity

Page 145: Master   tuning

How many more Files?

0 8 16 24 32 40 48260

280

300

320

340

360

380

400

100

1000

10000

100000

1000000

10000000

Runtime PAGELATCH_UP

# Data Files

Runti

me

PAG

ELAT

CH

Page 146: Master   tuning

• Shared, physical MEMORY structures can cause bottlenecks (ex: PFS)• SQL Server must sync too…

• Understanding where structure resides leads to tuning fix• Theory of engine!

Concurrency: What we learned so far

Page 147: Master   tuning

• Commonly misdiagnosed• CXPACKET does NOT (always) mean that

your DOP is “too high”

CXPACKET

10.015.020.025.030.035.040.00

20,000,000

40,000,000

60,000,000

80,000,000

100,000,000

120,000,000

140,000,000

160,000,000

180,000,000

200,000,000

CXPACKET waits / Throughput

Throughput (MB/sec)

CXPA

CKET

Wai

ts

1 6 11 16 21 26 31 36 41 460.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

Throughput / DOP

DOP

Thro

ughp

ut (M

B/se

c(

Page 148: Master   tuning

CXPACKET = Issue may be elsewhere…

Page 149: Master   tuning

• What happens when you get things like:

LATCH_<x>PAGELATCH_<x>

Step 1: Dig into:

Diagnosing Latches

SELECT * FROM sys.dm_os_latch_stats

Service Time + Wait Time

Page 150: Master   tuning

Digging into Latches Again…

Page 151: Master   tuning
Page 152: Master   tuning

Zooming into the Ready Thread

Page 153: Master   tuning

Post Fix Pattern

GetNextRangeForChildScan

GetNextRangeForChildScan

GetNextRangeForChildScan

Page 154: Master   tuning

• Before: 6GB/sec

• After: 20GB/sec

• This sometimes works on cluster indexes too…

…Whiteboard

Speedup with Hash Partition of Heap

Page 155: Master   tuning

UPDATE Hotspot

Page (8K)

ROW

ROW

ROW

LCK_U

LCK_U

PAGELATCH_EX

Page 156: Master   tuning

Before

ALTER TABLE HotUpdatesADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (‘X’)

After

UPDATE Hack on Small Tables

Page (8K)

ROW

LCK_U

PAGELATCH_EX

CHAR(5000)

Page (8K)

ROW

ROW

ROW

LCK_U

LCK_U

PAGELATCH_EX

Page 157: Master   tuning

Test: Updates of pages

Compression Update 1.4M CPU Load

None - Memory 13 sec 100% one core

PAGE - Memory 54 sec 100% one core

None – I/O 17 sec 100% one core

PAGE – I/O 59 sec 100% one core

L_QUANTITY is NOT NULL

i.e. in place UPDATE

Page 158: Master   tuning

Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01Total 5.17

UPDATE Compression burners

Out of 8.55 … Approx: 60%

Page 159: Master   tuning

Compression and Locks

Xevent TraceLock Acquire/Release

High Res Timer

Page 160: Master   tuning

How long are locks held?

PAGE NONE0

100000

200000

300000

400000

500000

600000

Lock Held Cycle Count

Avg

StdDev

CPU KCycles

Page 161: Master   tuning

• Sharing is generally bad for scale (but may be good for performance)

• PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats

• CXPACKET• Only important if throughput drops when

DOP goes up• If this happens, look for another wait/latch

• Table partitioning can be used to work around concurrency issues

Summary Concurrency – So Far..

Page 162: Master   tuning

The Paul Randal INSERT test

160M rows, executing at concurrencyCommit every 1K:

EASY tuning?

Page 163: Master   tuning

All is as Expected?

Page 164: Master   tuning

But Page Splits are Bad, right?

= BAD!

= Better!...

Page 165: Master   tuning

WRITELOG gone? Faster?

?

?

NO!sys.dm_os_wait_stats

Page 166: Master   tuning

And the Score Is…

newguid() newsequentialid() IDENTITY0

5000

10000

15000

20000

25000

30000

35000

Time in Seconds

Page 167: Master   tuning

What is going on here???

Min

Min

Min Min

Min

MinMin

Min Min

Min

HOBT_ROOT

Max

Page 168: Master   tuning

Tricks to Work Around this

0-1000

1001- 2000

2001- 3000

3001- 4000

INS

ER

T

INS

ER

T

INS

ER

T

INS

ER

T

Page 169: Master   tuning

All Cores at 100%

newguid()

newsequen...

IDEN

TITY

IDEN

TITY + U

nique

IDEN

TITY+ U

niq...

IDEN

TITY...

IDEN

TITY...

SPID + O

f...

0

5000

10000

15000

20000

25000

30000

35000

Runtime in Seconds

Seco

nds 600K

Inserts/sec830K

Inserts/sec

All Cores at ~100%

Page 170: Master   tuning

• Don’t use Sequential Keys• Page Splitting isn’t so bad• Neither are GUID

• Generate keys wisely. Ideally in the app server

• For “transparent” speedup, consider our old hash trick

Takeaways, INSERT workload

Page 171: Master   tuning

• Minimally Logged• Single, large

execution (thousands)

• Unsorted data• Concurrent Loaders

BULK INSERT Workload

Heap

Bulk Insert

Bulk Insert

Page 172: Master   tuning

Measure:

SELECT * FROM sys_dm_os_latch_stats

Observe waits on

ALLOC_FREESPACE_CACHE

Theory (just read BOL):“Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.”

When does BULK INSERT scale break?

0 5 10 15 20 25 300.0

50.0

100.0

150.0

200.0

250.0

Concurrent BULK INSERT

MB/

Sec

1

2

3

Page 173: Master   tuning

What is Happening here?

Free Page information (PFS/GAM/SSGAM)

HOBT Cache

FatChunks

Allocnewpages!Bulk Insert

ALLOC_FREESPACE_CACHE

This is in DRAMand L2

Page 174: Master   tuning

• Break Up table by “some key”

• Optional: Switch out partitions

• Spin up multiple bulks

• Linear scale• 3GB/sec• 16M

LINEITEM/sec

Breaking Through the Bottleneck

425

555

215

200

101

453

666

Area

Bulk Insert

Bulk Insert

Bulk Insert

Page 175: Master   tuning

BULK INSERT - Reloaded

• Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed• But this was on heaps, I have a clustered table

• Alright then, let us hit a cluster index

1-1000

Clustered and partitioned

1001-2000

2001-3000

3001-4000

X Lock

X Lock

X Lock

X Lock

Page 176: Master   tuning

Cluster Bulking – It seemed so plausible!1

2

3

Page 177: Master   tuning

Cluster Bulking – Stage and Switch

1

2

3

Page 178: Master   tuning

Coffee Break

Page 179: Master   tuning

SPIN LOCKS

Page 180: Master   tuning

• Context Switching is expensive• Typically 10K or more CPU cycles

• If you expect the ressource to be held only shortly, why fall asleep?

What is a Spinlock?

spin_acquire(int* s){ while(*s==1) *s = 1;}

Spin_release(int* s){ *s = 0;}

Page 181: Master   tuning

• Acquire can be very expensive• SQL Server implements a backoff

mechanism

What is a backoff?

spin_acquire(int* s){ int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1;}

SELECT * FROM sys.dm_os_spinlock_stats

DBCC SQLPERF(spinlockstats)

Backoff

Page 182: Master   tuning

Life at 600K INSERT/sec

Page 183: Master   tuning

WRITELOG is I/O – right?

Should be the same as this… or?

No! Because:

Page 184: Master   tuning

• Step 1: Copy sqlserver.pdb to the BINN directory

• Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from:

http://www.microsoft.com/en-us/download/details.aspx?id=26666

Note for 2012, you additionally need:• sqlmin.pdb, sqllang.pdb, sqldk.pdb

Diagnosing a Spinlock the Cool way!

Page 185: Master   tuning

Spinlock Walkthrough – Extended Events Script

--Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE')

--create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack)

where type = 144 --SOS_CACHESTORE)

add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)

--Run this section to measure the contention alter event session spin_lock_backoff on server state=start

A complete walkthrough of the technique can be found here:

http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

--wait to measure the number of backoffs over a 1 minute periodwaitfor delay '00:01:00'

--To view the data--1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1)

--Get the callstacks from the bucketize targetselect event_session_address, target_name, execution_count, cast (target_data as XML)from sys.dm_xe_session_targets xst

inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)where xs.name = 'spin_lock_backoff'

--clean up the session alter event session spin_lock_backoff on server state=stopdrop event session spin_lock_backoff on server

Page 186: Master   tuning

Of course, you can just use 2012…

Page 187: Master   tuning

How to improve a spinlock?

CPU

Core

Core

L1-L3 Cache

CPU

Core

Core

L1-L3 Cache

spin_acquire

Int sspin_acquire

Int s

spin_acquire

Int s

Transfer cache line

Transfer cache line

CPU CPU

Page 188: Master   tuning

CoreInfo.Exe – where are my cores?

CoreInfo.exe

Page 189: Master   tuning

Revisiting the TLOG

Buffer Offset (cache line)

LOGCACHEACCESS

Alloc Slot in Buffer

MemCpy SlotContent

Log Writer

Writer Queue

Async I/O Completion Port

Slot1

LOGBUFFER

WR

ITE

LO

G

LOGFLUSHQ

Signal thread which issued commit

T0

Tn

Slot 127

Slot126

Page 190: Master   tuning

I/O Affinity Mask!

SPID + Offset

SPID + Affinity

0

50

100

150

200

250

sp_configure ‘AffinityIOMask’

Page 191: Master   tuning

Bulking at Concurrency

• What’s that spin?

xperf –on latency –stackwalk profilexperf –d trace.etlxperview trace.etl

SELECT * FROM sys.dm_os_spinlock_statsORDER BY spins_count

DBCC SQLPERF (spinlockstats)

?

Page 192: Master   tuning

SOS_OBJECT_STORE at high INSERT

• Observed: This Spin happens when inserting

• Need: Reduce locking overhead• Fixes that work well here:

8x throughput

Bonus

Page 193: Master   tuning

• Lets try something really silly:

• Run lots of: EXEC emptyProc

• This should be infinitely scalable, right?

Diagnosing another Spinlock

CREATE PROCEDURE emptyProcASRETURN

Page 194: Master   tuning

Initial Diagnosis

MUTEX ??? … what Mutex?

Page 195: Master   tuning

Using the Spinlock Script gives us

Some cacheWhich one?

Page 196: Master   tuning

Validating the Theory

CREATE PROCEDURE emptyProc0ASRETURNGOCREATE PROCEDURE emptyProc1ASRETURNGO

CREATE PROCEDURE emptyProc31ASRETURN

Page 197: Master   tuning

What is the SOS_OBJECT_STORE?

Security Check?

Page 198: Master   tuning

Validating the new “fix”…

Page 199: Master   tuning

DECLARE @ParmDef NVARCHAR(500)DECLARE @sql NVARCHAR(500)SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6)

VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT'DECLARE @constDate DATETIME = '1974-12-22'

DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRANEND

Consider this Test harness code…

Page 200: Master   tuning

Spinning on MUTEX

Diagnose with trace flag shows spins stack offender:

CSecurityContext::GetUserTokenFromCache

This is REALLY expensive at scale:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END

Initialize a new execution context on every loop!

Page 201: Master   tuning

Fixing the MUTEX spin

• Instead of:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END

• Write:SET @sql = N'DECLARE @i INT WHILE (1=1) BEGINBEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRANEND

EXEC sys.sp_executesql @sql, @ParmDef

4x throughput

Bonus

Page 202: Master   tuning

• When all other bottlenecks are gone, sharing happens in the most unlikely places

• You can use spinlock Xevents inside SQL Server• Remember symbol files in BINN• Trace flag 3656

• This can also be done in XPERF for non SQL apps• Ex: Analysis Services

Concurrency, Spinlock Summary

Page 203: Master   tuning

• Control of buffers and NUMA for Xperf setting

• By default: • 4MB mem• Spool to disk at root of C-drive

• Can do buffer/file control:• -buffersize and –maxbuffers• -maxfile and –FileMode Circular

Xperf controlling buffers

Page 204: Master   tuning

• Round robin between NUMA nodes• Inside the NUMA: Pick the one that looks

the least busy

• This is NOT a perfect system

How SQL Server assigns threads

Page 205: Master   tuning

Xperf -on Latency+CSWITCH+DISPATCHER -stackWalkCSwitch+ReadyThread+ThreadCreate+Profile -BufferSize 1024 -MaxBuffers1024 -MaxFile 1024 -FileMode Circular

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Super Xperf

Page 206: Master   tuning

• All the tuning wont help you if your model is wrong

• Tunings gets your far, but to really scale, you need a good data model

• This is what my other courses are about

But does the Data Model Work?

Page 207: Master   tuning

Q A&

Page 208: Master   tuning

Problem Statement

Queue Structure

Msg Msg Msg Msg Msg

Ordered

Push Pop

300Bmsg

Page 209: Master   tuning

The Naïve Approach

• Push

• Seek First Row• INSERT Row

• Pop

• Seek Last Row• DELETE/Output

Key

Max

Msg

Min Max

Msg

Min

Msg

Page 210: Master   tuning

Why this doesn’t Scale!

Min

Min

Min Min

Min

MinMin

Min Min

Min

HOBT_ROOT

Max

Page 211: Master   tuning

NextPrev

Virtual Root

SHLATCH

HOBT_VIRTUAL_ROOT

LCK

PAGELATCH

X

SH

SHPAGELATCH

PAGELATCH

EX

SH

SH

EX

SH

EX

EX

EXEX

B-Tree Root Pages

Page 212: Master   tuning

Summarising the Problem

• Hot stuff• Root• Min page• Max page• Intermediate

pages

• Alloc/Dealloc• BUT: We Must

have order!

Page 213: Master   tuning

Cooling it down

Page 214: Master   tuning

What if…

• Push

• Seek first value page

• UPDATE Reference Count

• Pop

• Seek last value page

• UPDATE Reference Count

Min Max

Msg++

Min Max

Msg--

Page 215: Master   tuning

Dissipate the Heat

Min

Msg--

Max

Msg++

Min

Msg--

Max

Msg++

Min

Msg--

Max

Msg++

Last Digit = 0 Last Digit = 1 Last Digit = 2

Page 216: Master   tuning

Eliminating Thread Contention

Queue Structure

Ordered

PushSequence++PopSequence++

87654

VERY fast!

Page 217: Master   tuning

Ring Buffers

Queue Structure

Ordered

PushSequence++Mod 100

PopSequence++Mod 100

Slot: 8Msg: 108

Slot: 7Msg: 107

Slot: 6Msg: 106

Slot: 5Msg: 105

Slot:4Msg:104

Page 218: Master   tuning

Summing Up Message Queue Hack

• UPDATE • instead of INSERT/DELETE

• More partitions = More B-Trees

• Ring buffer using modulo• Find Sweet spot

concurrency