Master tuning

Thomas Kejser

[email protected]

http://blog.kejser.org

@thomaskejser

Super Scaling SQL Server

Diagnosing and Fixing Hard Problems

mailto:[email protected]

http://blog.kejser.org/

Thomas Kejser

• Formerly SQLCAT• Tuning SQL Server since 6.5• 15+ Years of database experience

• http://blog.kejser.org • CTO Fusion-io Europe

http://Blog.kejser.org/

Image(s): FreeDigitalPhotos.net

VS. VS.

Best Practice

Performance Scalabilityvs.

Response Time

Ressource Use

Adding moreof a HW ressource

makes things faster

You can scale without having performance (ex: HADOOP)

You can perform without having scalability (ex: In Memory Engines)

Our Reasonably Priced Server

• 2 Socket Xeon E3645• 2 x 6 Cores• 2.4Ghz• NUMA enabled, HT off

• 12 GB RAM• 1 ioDrive2 Duo• 2.4TB Flash• 4K formatted • 64K AUS• 1 Stripe

• Power Save Off• Win 2008R2• SQL 2012 Image Source: DeviantArt

Between disk and Memory

Core

Core

Core

Core

L1

L1

L1

L1

L3

L2

L2

L2

L2

1ns 10ns 100ns 100us 10ms10us

The “cache out curve”

Data Size

Throughput/thread

Cache Size

Service Time + Wait Time

NUMA Nodes

CPU

L3

L2

L2C

C

CPU

L3

L2

L2C

C

Can I write?

Bus Transfer

Bus Transfer

There are several of these curves

Throughput

Touched Data Size

CPU Cache

TLB

NUMARemote

Storage

Response time = Service Time + Wait Time

Algorithmsand

Data Structures

“Bottlenecks”

• DBA tasks• Installation of OS and SQL• Basic Memory Configuration• Basic Perfmon style monitoring• Backup/Restore and HA setup

• Basic reading a Query Plan• Basic understanding of database

structures• Adding Indexes to tables• Running a Profiler trace

What you ALREADY know

Below the Surface

What we Need

• Free tools from MS

• Windows SDK• In Win8: The

“ADK”

• Need .NET 4 to install

Where Did the Time Go?


Xperf –on Base –f Base.etl

SELECT TOP 100000 *FROM LINEITEMINNER JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY

SQLCMD –E –S. –i “Select.sql”

?

Xperf –stop

DEMOBASE profile with xperf


Right Click – Summary Table


What exactly is SQLNCLI?


Quantifying just how stupid XML is

SELECT TOP 1000000 * FROM ORDERSJOIN LINEITEM ON L_ORDERKEY = O_ORDERKEYFOR XML RAW ('OUTPUT')

Xperf –on Base –f Base.etl

With XML

“Native” Format

Which CPU cycles are Expensive?

“App” tierWeb Server Licensing>3K USD Blades

Database TierCore Licensing>10K USD

<XML> ?


• What about the time INSIDE the process?

• What if the EXE won’t tell us?

Diving even Deeper

What is a Debug Symbol?

mov ax,10mov bx,20mov cx,3push axpush bxpush cxcall <address>

<address>push bpmov bp,spmov ax,[bp+8]

mov bx,[bp+6]mov cx,[bp+4]

add ax,bxdiv cxmov dx,axret

HeaderdoStuff(10,20,3)

…

int doStuff(int a, int b, int c){ return (a + b) / c }

Compiles

Compiles

myProg.exe

Machine Code

<address> = doStuff

Symbol table

myProg.pdb

Symbol

Build


Where do you get PDB files?

_NT_SYMBOL_PATH=SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols

_NT_SYMCACHE_PATH=C:\SymCache

• Public Symbol Server

• Configure Environment

• Dbghelp.dll


• Auto Generated by Visual Studio:

Your Own Debug Symbols


• Symbols are indexed. Have to add them

Adding and Checking Your Symbols

Cd Bin/x64/Release/symstore add /f *.pdb /s C:/Symbols /t “MyExe”

• Validate that the Symbols can resolveCd Bin/x64/Release/symchk MyExe.exe /V

• Standard Xperf works fine for you own native code

• BUT: Before Windows 8, stack walking is broken for x64 .NET

• If you have .NET with 64 bit code. You must NGEN first:

Got .NET and x64?

Ngen install Bin/x64/Release/MyExe.exe(ngen lives here: %Windir%\Microsoft.NET\framework64\<Version>\Ngen.exe


• Free tool from MS:

.NET tracing is a pain, get a tool!

• Not to be confused with xperfview• Same trace API and file format• Helps set obscure .NET specific trace flags


And Finally, You can do Very Cool Things

Did I tell you about interlockedoperations?...

Whiteboard time!


• Consider again our LINEITEM table

What is SQL Server REALLY doing?

• How expensive is it to read from that?• Think ETL code and DW/BI queries

CREATE TABLE LINEITEM ([L_ORDERKEY] [int] NOT NULL,[L_PARTKEY] [int] NOT NULL,[L_SUPPKEY] [int] NOT NULL,[L_LINENUMBER] [int] NOT NULL,[L_QUANTITY] [decimal](15, 2) NOT NULL,[L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL,[L_DISCOUNT] [decimal](15, 2) NOT NULL,[L_TAX] [decimal](15, 2) NOT NULL,[L_RETURNFLAG] [char](1) NOT NULL,[L_LINESTATUS] [char](1) NOT NULL,[L_SHIPDATE] [date] NOT NULL,[L_COMMITDATE] [date] NOT NULL,[L_RECEIPTDATE] [date] NOT NULL,[L_SHIPINSTRUCT] [char](25) NOT NULL,[L_SHIPMODE] [char](10) NOT NULL,[L_COMMENT] [varchar](44) NOT NULL

)

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL


SQLCMD – Native code Test

SQLCMD.EXE

Where does the time go?


Standard Reading of Data

xperf -on base -stackwalk profile -f stackwalk.etl

SQLCMD -S. -dSlam –E -Q"SELECT * FROM LINEITEM_tpch"

55sec

xperf -stop

xperf –merge stackwalk.etl stackwalkmerge.etl

?


Details of the Time – Padding?


More Details – Conversion Work?

An Educated guess about improvements

CREATE TABLE [dbo].[LINEITEM_native]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] money NOT NULL, [L_EXTENDEDPRICE] money NOT NULL, [L_DISCOUNT] money NOT NULL, [L_TAX] money NOT NULL, [L_RETURNFLAG] int NOT NULL, [L_LINESTATUS] int NOT NULL, [L_SHIPDATE] int NOT NULL, [L_COMMITDATE] int NOT NULL, [L_RECEIPTDATE] int NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] int NOT NULL, [L_COMMENT] char(44) NOT NULL)

CREATE TABLE [dbo].[LINEITEM]( [L_ORDERKEY] [int] NOT NULL, [L_PARTKEY] [int] NOT NULL, [L_SUPPKEY] [int] NOT NULL, [L_LINENUMBER] [int] NOT NULL, [L_QUANTITY] [decimal](15, 2) NOT NULL, [L_EXTENDEDPRICE] [decimal](15, 2) NOT NULL, [L_DISCOUNT] [decimal](15, 2) NOT NULL, [L_TAX] [decimal](15, 2) NOT NULL, [L_RETURNFLAG] [char](1) NOT NULL, [L_LINESTATUS] [char](1) NOT NULL, [L_SHIPDATE] [date] NOT NULL, [L_COMMITDATE] [date] NOT NULL, [L_RECEIPTDATE] [date] NOT NULL, [L_SHIPINSTRUCT] [char](25) NOT NULL, [L_SHIPMODE] [char](10) NOT NULL, [L_COMMENT] [varchar](44) NOT NULL,)

Before After


Getting Rid of Useless Work

Additional parameters for SQLCMD:

-a32767 -W -s";" -f437

x1.5


Unicode – 10% overhead?


Lets try that with Native and Unicode …

x5


• SQLNCLI is one of these in disguise• ODBC• OLEDB

• Pick good data types• MONEY over NUMERIC• UNICODE of data arrives like this

• Native protocols vs. flexibility

Summary Moving Data

• Get• Windows 8 ADK• Windows 7 SDK

• Set up Symbol Paths• Xperf –on Base • Standard trace for time, narrow to process

and DLL/EXE

• Xperf –on Base –stackwalk Profile• Get to the call stack, find the offending

function(s)

• Ease of use for .NET: perfview.exe

Summary – Xperf



Introducing TPC-H


Loop Join

Find my match

n row B-tree

Log(n) reads

Complexity: O(m * log(n))Service Time + Wait Time

m row result14313

7

3

Linked List Tree

Linked List vs. Tree


0

1

2

3

4

5

6

7

8

n

8

134

62 1510

16141197531

Log 2(n

)

Cluster on O_ORDERKEY Index on O_ORDERKEY

Basic argument for Cluster Indexes


CREATE UNIQUE CLUSTERED INDEX CIX_Key ON ORDERS_Cluster (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_ClusterWHERE O_ORDERKEY = 3000000

CREATE UNIQUE INDEX IX_Key ON ORDERS_Heap (O_ORDERKEY) WITH (FILLFACTOR = 100) SELECT *FROM ORDERS_HeapWHERE O_ORDERKEY = 3000000

Table 'ORDERS_Heap'. Scan count 0, logical reads 3 , physical reads 0, read-ahead reads 0

Table 'ORDERS_Cluster'. Scan count 0, logical reads 4 , physical reads 0, read-ahead reads 0

Cluster on O_ORDERKEY heap + Index on O_ORDERKEY

But what if we do this a lot?

CREATE INDEX IX_Customer ON ORDERS_Cluster (O_CUSTKEY)

WITH (FILLFACTOR = 100)

CREATE INDEX IX_Customer ON ORDERS_Heap (O_CUSTKEY)

WITH (FILLFACTOR = 100)

SELECT * FROM ORDERS_HeapWHERE O_CUSTKEY = 47480

SELECT * FROM ORDERS_ClusterWHERE O_CUSTKEY = 47480

Table 'ORDERS_Cluster'. Scan count 1, logical reads 27, physical reads 0

Table 'ORDERS_Heap'. Scan count 1, logical reads 11, physical reads 0


How many LOOP joins/sec/core?

7 Sec


What did we just measure?

Xperf –on Base –stackwalk profile

About 40%...


• The query language itself

• Why so many ExecuteStmt?

• …With so much CPU use?

What is sqllang.dll?


A different way to Measure Loops

1 Sec


VS.

What does THAT look like?

Takeaway:

The T-SQL languageitself is expensive


• Sample from LINEITEM

• Force loop join with index seeks

• Do 1.4M seeks

Test: Singleton Row Fetch

Singleton seeks – Cost of compressionCompression Seek (1.4M seeks) CPU Load

None - Memory 13 sec 100% one core

PAGE - Memory 24 sec 100% one core

None – I/O 21 sec 100% one core

PAGE – I/O 32 sec 100% one core

Function % Weight CDRecord::LocateColumnInternal 0.82% DataAccessWrapper::DecompressColumnValue 0.47% SearchInfo::CompareCompressedColumn 0.28% PageComprMgr::DecompressColumn 0.24% AnchorRecordCache::LocateColumn 0.18% ScalarCompression::AddPadding 0.04% ScalarCompression::Compare 0.11%Additional Runtime of GetNextRowValuesInternal 0.14%Total Compression 2.28%Total CPU (single core) 8.33%Compression % 27.00%

xperf –on base –stackwalk profile

Modern CPU

CPU

L3 Cache 4MB

Inst Cache32KB

Core

Data Cache32KB

L2 Uni Cache 256K

Inst Cache32KB

Core

Data Cache32KB

L2 Uni Cache 256K

Bus


The B+ Tree


B+ Tree

Hekaton Style “Loop”

Lookup Table

(hash)


Merge Join

m row result1123

n row result1234

4

43

43

Sort

ed

Sort

ed

Complexity: O(m + n)Service Time + Wait Time

Merge Join – What is Fastest?


SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM LINEITEMINNER MERGE JOIN ORDERS ON O_ORDERKEY = L_ORDERKEY …or

SELECT MAX(L_PARTKEY), MAX(O_ORDERDATE) FROM ORDERSINNER MERGE JOIN LINEITEM ON O_ORDERKEY = L_ORDERKEY

Comparing the Query Plans


Digging in Deeper

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 3265 ms, elapsed time = 3357 ms.

Table 'LINEITEM'. Scan count 1, logical reads 104522, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.Table 'ORDERS'. Scan count 1, logical reads 22162, physical reads 0, read-ahead reads 0 , lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 2469 ms, elapsed time = 2607 ms.


We can beat SQL Server at this game

SELECT MAX(O_ORDERDATE), MAX(MAX_P)FROM (SELECT L_ORDERKEY,MAX(L_PARTKEY) AS MAX_P FROM LINEITEM GROUP BY L_ORDERKEY) bINNER MERGE JOIN ORDERSON O_ORDERKEY = b.L_ORDERKEY


Hash Join

m row result14313

7

n row join table

Hash(1)

n row hash table

Find my match

Complexity: O(m + 2n)

3


When Hash Joins hurt you


0501001502002503003504000

5

10

15

20

25

30

Runtime (seconds)

Hash Memory (MB)

Spill Zone!

Hash Joins Don’t Scale in MSSQL

The Bottleneck Curve

ACCESS_METHODS_DATASET_PARENT:

“Used to synchronize child dataset access to the parent dataset during parallel operations.”

Books Online Story…

Image: FreeDigitalPhotos.net

Using XPERF to find documentation

xperf –on base+cswitch+dispatcher –stackwalk profile+readythread+cswitch

Lets dig in…

xperf -on base -stackwalk profile -f stackwalk.etl

What LATCH pattern do we see?

GetNextRangeForChildScan

Inside:TableScanNew

• Partition the table by a “random” value

• Modulo the Key for example

• Use SQL Server partition function/schema

The Fix?…

0123456

253254255

hash

Closer…

…But no Cigar

What is the Problem here?

Anti Scale Patterns

CPU Caches

0

100,000,000

200,000,000

300,000,000

400,000,000

500,000,000

600,000,000

700,000,000

800,000,000

900,000,000

1,000,000,000

Random Pages

Sequential Pages

Single Page

Size of Accessed memory (MB)Service Time + Wait Time

Goals:• Compressed• Prefetch Friendly• Cache Resident Code

Example, Column Stores

ID Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

ID Date

1 2011-11-25

2 2011-11-25

3 2011-11-25

4 2011-11-25

5 2011-11-25

6 2011-11-25

7 2011-11-25

Date

ID Sale

1 2 GBP

2 2 GBP

3 10 GBP

4 5 GBP

5 5 GBP

6 10 GBP

7 10 GBP

Sale


Compression is Easy

ID Value

1-2 Beer

3 Vodka

4-5 Whiskey

6-7 Vodka

ID Customer

1-3 Thomas

4-5 Christian

6-7 Alexei

Product’ Customer’

ID Date

1-7 2011-11-25

Date’

ID Sale

1-2 2 GBP

3 10 GBP

4-5 5 GBP

6-7 10 GBP

Sale’

RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

RL Customer

3 Thomas

2 Christian

2 Alexei


RL Date

7 2011-11-25

Date’

RL Sale

2 2 GBP

1 10 GBP

4 5 GBP

2 10 GBP

Sale’


Squeezing it even more

RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

Product’

RL Value

2 1

1 2

2 3

2 2

Product’

Beer = 1Vodka = 2Whiskey = 3

ID Value

1-2 Beer

3-3 Vodka

4-5 Whiskey

6-7 Vodka

Product’

4+4+4+2 = 14B + 4+4+5+2 = 15B+ 4+4+7+2 = 17B+ 4+4+5+2 = 15B

= 61B

4+4+2 = 10B + 4+5+2 = 11B+ 4+7+2 = 13B+ 4+5+2 = 11B

= 45B

4+4 = 8B + 4+4 = 8B+ 4+4 = 8B+ 4+4 = 8B

= 32B

RL Value

2 0x01

1 0x10

2 0x11

2 0x10

Product’

4 = 4B + 4 = 4B+ 4 = 4B+ 4 = 4B

+ 4 x 2b = 2B= 18B


RL Value

2 Beer

1 Vodka

2 Whiskey

2 Vodka

RL Customer

3 Thomas

2 Christian

2 Alexei


2 steps with Beer2 steps with Thomas

Beer Thomas

Beer Thomas

SELECT Product, Customer FROM Table

1 step with Vodka1 step with Thomas

Vodka Thomas

2 step with Whiskey2 step with Christian

Whiskey Christian

Whiskey Christian

2 step with Vodka (Note: Repeated value)2 step with Alexei

Vodka Alexei

Vodka Alexei


Hash Joining with Column Stores

RL Key

2 Beer

1 Vodka

2 Whiskey

2 Vodka

Table

Key Type

Beer Soft

Vodka Strong

Whiskey Strong

Vodka Strong

Dim Product

SELECT …FROM TableJOIN DimProduct ON KeyWHERE Type = ‘Strong’

1 Compute bloom filter of Keys belonging to ‘strong’2 Read RL = 2, Beer from Table3 Compute bloom value of Beer. 4 Equal to filter value from 1? Yes. Output two rows (RL=2)5 Compute bloom value for Vodka6 Equal to filter value from 1? No. Do nothing7 Compute bloom value for Whiskey8 Equal to filter value from 1? No. Do nothing

Can pre fetch data (news RLE)

Can calculate match/no match using only local CPU cache

Wont work for OLTP!Service Time + Wait Time

Why is it so hard to get joins right?

n

m

Time

Loop Join

Merge Join

Hash Join


Desired Join Join Hint Query Hint

LOOP [INNER | LEFT | CROSS | FULL] LOOP JOIN

OPTION (LOOP JOIN)

MERGE [INNER | LEFT | CROSS | FULL] MERGE JOIN

OPTION (MERGE JOIN)

HASH [INNER | LEFT | CROSS | FULL] HASH JOIN

OPTION (HASH JOIN)

LOOP with Seek

WITH FORCESEEKWITH ( INDEX (index = <name>) )

N/A

Controlling Joins

Note: Join hints force the order of the ENTIRE join tree!Service Time + Wait Time

What Type of Workload?

BigSmall

Small

Big

Dat

a Re

turn

ed

Data Touched

OLTP BI/DW

Simulation ETL


How to Classify?

OLTP BI/DW

Simulation ETL

Full Scan/secRange Scans/secProbe Scans/sec

Index Search/secRange Scans/sec

Full Scan/secRange Scans/sec

Bulk Copy Rows/sec?

There should ALWAYS be a fully indexed path to the data.

OLTP System Basic Query Pattern

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL


1. Find worst CPU consuming query with sys.dm_exec_query_stats

2. Add OPTION (LOOP JOIN) to offending query

3. Check estimated query plan4. If table spool found: add index to

remedy and GOTO 35. Happy? If not, GOTO 1

The Super Quick OLTP Tuning Guide


The query will not be (much)worse than a full scan of a fact

partition

DW/BI System Basic Query Pattern

BigSmall

Small

Big

OLTP BI/DW

Simulation ETL


1. Find offending query2. Add OPTION (HASH JOIN) to query3. Does dimension tables have indexed path to

build hash? If not, add index4. Do you get a fact table scan and hash build of

all dimensions? If not, check statistics (especially on facts and skewed)

5. Optimize Fact table scans1. Partition and partition elimination2. Column store if you have it3. Aggregate Views4. Bitmap index pushdown (statistics!)5. Composite indexes (last resort!)

The Super Quick DW tuning Guide


The expected DW Query Plan

PartialAggregate

Fact CSI Scan

Dim Scan

Dim Seek

BatchBuild

BatchBuild

HashJoin

HashJoin

HashStreamAggregate

• At least enough RAM to hold the hash tables of the largest dimension

• De-normalisation helps… a LOT• Especially for the large/large joins

• Likely: need to scan fast from disk if RAM is not big enough to hold the fact• Compression REALLY matters

Things that Follow from desired DW Plan


Coffee Break


Where EVERY Server wide diagnosis starts

SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC


• Shows up as waits for PAGEIOLATCH• You can dig into details with:

Common Problems - PAGEIO


SELECT * FROM sys.dm_io_virtual_file_stats(DB_ID(), NULL)

• Can also Xevent your way to it per query

CREATE EVENT SESSION [TraceIO] ON SERVER ADD EVENT sqlserver.file_read_completed(ACTION (sqlserver.database_id,sqlserver.session_id))

• I/O, like memory, is a GLOBAL resource for the machine

• When does it make sense to partition a global resource? • When you deeply know the workload• When the workload is ALREADY partitioned

• When neither of those are true: DON’T partition

• If you have NAND/SSD – Why bother?

The general I/O Guidance


A good way to Think of Spindle I/O

DRAM

JBOD SAME

LUN

Seq.

LUN

Seq.

LUN

Seq.

RAID system

Large LUN

Seq. Seq. Seq.

RANDOM I/O


Stripe vs. Concatenation

RAID 10 RAID 10

Concatenated LUN

RAID 10 RAID 10

Striped LUN


OLTP

• One big SAME setup• data files• Tempdb

• Dedicate• Transaction log

• DRAM: • Enough to hold most of

DB

Data Warehouse

• JBOD setup• Data Files • 1-2 per LUN

• SAME setup• Tempdb

• Dedicate• Transaction Log

• DRAM: • Enough to hold largest

partition of largest table

Rules of Thumb – Spindle I/O and DRAM


• Short Stroking

• Elevator Sort

• Sequential vs. Random

• Weaving

You can do a bit better… or worse


• Intentionally use lower % of total space

• Tradeoff: • Space for Speed

• Test:• 15K rpm• SAS spindle • 300GB

Short Stroking Disks

0% 20% 40% 60% 80% 100%150

200

250

300

350

400

% Capacity Used

IOPS


Full Stroked Short Stroked

Why does Short Stroking Work?

Disk are typically consumed “from the outside in”. If partitions don’t use the full disk size, the disk wont use the full platter either. The result: less head movement


Adding Elevator Sorting

Full Stroke Random Outer Short Inner Short Elevator Sort Elevator Short Stroked0

100

200

300

400

500

600

0

200

400

600

800

1000

1200

8K random I/OIOPS

Avg. Latency

Max Latency

IOPS

Late

ncy

Bat powered disk!

Why Chase Sequential I/O?

Sequential Full Stroke Random100

1000

10000

100000

0

10

20

30

40

50

60

70

808K Block Pattern

IOPSAvg LatencyMax Latency

Log(

IOPS

)

Late

ncy

(ms)


• One SATA disk

• Two partitions

• One file on each

• Sequential read on each file

But all is not well!

File1 File2


I/O Weaving in action

64K Random 64K Dual Sequential0

50

100

150

200

250

300

0

2

4

6

8

10

12

14

16

18

IOPS

Avg Latency

IOPS

Late

ncy

(ms)

Source: Michael Anderson Service Time + Wait Time

Storage Pool and Weaving

DataLog DataLog DataLog

Massive, then Provisioned Pool

Seq

Ran

Seq

Ran

Seq

Ran

RANDOM!Service Time + Wait Time

The SAN will properly handle Sharing!

Green: Checkpoint, Red: tx/sec, Black: Disk Latency

The “cache out curve”


Numbers to Remember - Spindles

Characteristic Typical Units

Throughput / Bandwidth 90-125MB/secBut ONLY if sequential access!

Operations per Sec 10K RPM Spindle: 100-130 IOPS15K RPM Spindle: 150-180 IOPS

Can get about 2x if short stroking (more later)

Latency 3-5ms (compare DRAM: 100ns)

Capacity 100s of GB to single digit TB

2012 numbers, will change in future Service Time + Wait Time

• Few hundreds of IOPS

• Faster if short stroked

• Trade latency for speed with elevator sort

• Sequential is hard to get right

Summary so far.. Single Disk


• Wider Stripes neat• But scale not linear

• Very deep queues help• But add latency

• Shared Components

Why does a big RAID pile not solve this?


RAID Scale?Your Mileage WILL vary with the hardware

Before After

Getting rid of Sharing

Switch

HBA HBA HBA HBA

Storage Port

Storage Port

Switch

LUN LUN

Cache

Disk

CPU

Switch

HBA HBA HBA HBA

Storage Port

Storage Port

Switch

LUN LUN

Cache

Disk

CPU

x2

4K

PN N

NAND Flash Basics

112

PN N

Oxide Layer

Floating Gate

Electronstrapped

Control Gate

NAND Die

Pack

Blocks

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

4K

PN N PN NPN N

PN NPN N PN NPN N

Pages

NAND Flash Problems

• Erase Cycles• Around 100K• Rebalancing and reclaim/trim

• Voltage measurement• Gets worse with density• Changes over time• Depends on how you program

• Bit Rot• Must refresh even on read

• SLC easier to manage than MLC• But much more expensive!

113

Voltage

00

01

10

11

Lessons Learned: Try to Avoid Sharing

BAD BETTER BEST


The Network

• Only partially diagnosed as waits in sys.dm_os_wait_stats

• Task Manager gives a bit more information

• Need: transparency to the deep level latencies and packets!

Common Problems: ASYNC_NETWORK, OLEDB


A common Wait Type

The database is really slow! The code takes

forever to run!


• We may not always have insight into what is going on at the client…

Xperf Diagnosing the Network

xperf –on latency+network

SummaryTable


Timeline of the network Traffic

ASYNC_NETWORK_IO, the typical issue


Handling network is EXPENSIVE

xperf –on latency

?


Short Story on DPC/ISR handling

CPU

Core

Core

L1-L3 Cache

PCIBUS

IRQ

HALT executionFire ISR Routine

if (my interrupt){ <Mark Handled> Queue DPC}

NICWork Done

DPC<Do work needed>

<Wake Application>

Core can run other stuff again


It looks like this…

DPCISR


• Option 1: Use the HW vendors tool • Option 2: Use interrupt Affinity Policy Tool

from MS

Setting Interrupt Affinity


• Standard Payload Network (MTU): • 1500 B

• Jumbo Frames• 9014 B(MTU)

Jumbo Frame and SQL Packets

• Standard SQL payload• 4096 B

• Largest• 32767 B

SELECT session_id, net_packet_sizeFROM sys.dm_exec_connections

Server=foo;Packet size=32767


Single Threaded

Core Evolution

Moore’s “Law”:

“The number of transistors per square inch on integrated circuits has doubled every two years since the integrated circuit was invented”

• Never faster than a single core• Smaller servers are faster than bigger ones• Large L2 caches and more clock speed help

• The algorithm dictates speed• Latency of Wait Time sets upper limit

• Examples from MSSQL land: • Formula Engine in MSAS• Transaction Log Writes• INSERT/UPDATE/DELETE (as we shall see)

Single Threaded

VLF files

• When switching to new VLF – it has to be ”formatted” with 8K sync write • While this happens, transactions are blocked• Too many VLF = Too much blocking• Lesson: Preallocate the database log file in big chunks

• Up to 128 Log Buffers per database• Spawned on demand, will not be released once spawned• Transactions will wait for LOGBUFFER is no buffer available

• Think of this like a pipeline of commits waiting…

VLF(1) VLF(2) VLF(3) VLF(4) VLF(5) VLF(6)8K 8K 8K 8K 8K 8K

<=60KX 128

Transaction Log Background

Buffer Offset (cache line)

LOGCACHEACCESS

Alloc Slot in Buffer

MemCpy SlotContent

Log Writer

Writer Queue

Async I/O Completion Port

Slot1

LOGBUFFER

WR

ITE

LO

G

LOGFLUSHQ

Signal thread which issued commit

T0

Tn

Slot 127

Slot126

• Speed is determined by Latency and Code Path

• Max Log Write Size: 60K

Zooming to the Log Writer

Log Writer



Latency

Writer Queue

Long Distance Replication…

Log Entry Log Entry

NetworkLog Entry

Send log

Ack Log

Primary Secondary

Write Write

Executive Summary:The speed of light ( c )

is not fast enough!

• Perfmon will only show millisec• What if we want microseconds?

Getting to the Real Latency

xperf –on latency

It’s in Memory, so it must be fast?

VS.

Latency: 15-30us Latency: <5us

RAM DISK

1.5sec 1.5sec

No, Because…

This adds up to one core… it is doing all it can with the CPU it has

The Effect on UPDATE

Naïve UPDATE MyBigTable SET c6 = 43

Parallel UPDATE MyBigTable SET c6 = 43 WHERE key BETWEEN 10**9 * n AND 10**9 * (n+1) -1C X

Runtime(smaller is faster)

Multi Threaded

What is Scalable?

0 4 8 12 16 20 240

500

1000

1500

2000

2500

3000

Good

So so

Bad

Some Hardware Resource

Thro

ughp

ut

We want to live here

Amdahl’s Law of gated speedup

0 8 16 24 32 40 48 56 641

6

11

16

21

26

31

P = 100%

P = 95%

P = 90%

P = 80%

Number of cores

Spee

dup

Fact

or

P = Part of program that can be made Parallel (Note that this may be 0... or 1)

N = Number of CPU cores available

Speedup =

Introducing Contention – Locks

Table A

Table B

Table C

INSERT TableA …

INSERT TableB …

INSERT TableC …

LCK

LCK

LCKLCK

LCKLCKLCKLCK

Wait Stat: LCK_<X>

But those rows have to be stored…

Table A

Table B

Table C

LCK

LCK

LCKLCK

LCKLCKLCKLCK

DataFile

FileGroup

It all Starts with Wait Stats

SELECT * FROM sys.dm_os_wait_statsWHERE wait_type NOT IN (SELECT wait_type FROM #ignorewaits) AND waiting_tasks_count > 0ORDER BY wait_time_ms DESC

DBCC PAGE

PFS – Hidden Single Page Contention

Data FileGAM/SGAM PFS

64MB

PFS PFS

64MB

PFS

64MB

PFSB B B BB B B BB B B B

B B B B

8K

10010010

PAGELATCHINSERT TableA …

Allocated bit

DataFile

DataFile

DataFile

More Files

Table A

Table B

Table C

LCK

LCK

LCKLCK

LCKLCKLCKLCK

DataFile

FileGroup • Round Robin

between files

• More files, more structures

• No affinity

How many more Files?

0 8 16 24 32 40 48260

280

300

320

340

360

380

400

100

1000

10000

100000

1000000

10000000

Runtime PAGELATCH_UP

# Data Files

Runti

me

PAG

ELAT

CH

• Shared, physical MEMORY structures can cause bottlenecks (ex: PFS)• SQL Server must sync too…

• Understanding where structure resides leads to tuning fix• Theory of engine!

Concurrency: What we learned so far

• Commonly misdiagnosed• CXPACKET does NOT (always) mean that

your DOP is “too high”

CXPACKET

10.015.020.025.030.035.040.00

20,000,000

40,000,000

60,000,000

80,000,000

100,000,000

120,000,000

140,000,000

160,000,000

180,000,000

200,000,000

CXPACKET waits / Throughput

Throughput (MB/sec)

CXPA

CKET

Wai

ts

1 6 11 16 21 26 31 36 41 460.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

Throughput / DOP

DOP

Thro

ughp

ut (M

B/se

c(

CXPACKET = Issue may be elsewhere…

• What happens when you get things like:

LATCH_<x>PAGELATCH_<x>

Step 1: Dig into:

Diagnosing Latches

SELECT * FROM sys.dm_os_latch_stats


Digging into Latches Again…

Zooming into the Ready Thread

Post Fix Pattern




• Before: 6GB/sec

• After: 20GB/sec

• This sometimes works on cluster indexes too…

…Whiteboard

Speedup with Hash Partition of Heap

UPDATE Hotspot

Page (8K)

ROW

ROW

ROW

LCK_U

LCK_U

PAGELATCH_EX

Before

ALTER TABLE HotUpdatesADD COLUMN Padding CHAR(5000) NOT NULL DEFAULT (‘X’)

After

UPDATE Hack on Small Tables

Page (8K)

ROW

LCK_U

PAGELATCH_EX

CHAR(5000)

Page (8K)

ROW

ROW

ROW

LCK_U

LCK_U

PAGELATCH_EX

Test: Updates of pages

Compression Update 1.4M CPU Load

None - Memory 13 sec 100% one core

PAGE - Memory 54 sec 100% one core

None – I/O 17 sec 100% one core

PAGE – I/O 59 sec 100% one core

L_QUANTITY is NOT NULL

i.e. in place UPDATE

Function CPU % qsort 0.86 CDRecord::Resize 0.84 CDRecord::LocateColumnInternal 0.36 perror 0.36 Page::CompactPage 0.36 ObjectMetadata::`scalar deleting destructor' 0.27 SearchInfo::CompareCompressedColumn 0.24 CDRecord::InitVariable 0.19 CDRecord::LocateColumnWithCookie 0.18 memcmp 0.16 PageDictionary::ValueToSymbol 0.16 Record::DecompressRec 0.14 PageComprMgr::DecompressColumn 0.14 CDRecord::InitFixedFromOld 0.1 SOS_MemoryManager::GetAddressInfo64 0.08 AnchorRecordCache::LocateColumn 0.08 CDRecord::GetDataForAllColumns 0.08 ScalarCompression::Compare 0.07 PageComprMgr::CompressColumn 0.07 Record::CreatePageCompressedRecNoCheck 0.06 memset 0.05 PageComprMgr::ExpandPrefix 0.04 PageRef::ModifyColumnsInternal 0.04 Page::ModifyColumns 0.03 DataAccessWrapper::ProcessAndCompressBuffer 0.03 SingleColAccessor::LocateColumn 0.03 CDRecord::BuildLongRegionBulk 0.02 ChecksumSectors 0.02 Page::MCILinearRegress 0.02 DataAccessWrapper::DecompressColumnValue 0.02 SOS_MemoryManager::GetAddressInfo 0.02 CDRecord::FindDiff 0.02 AnchorRecordCache::Init 0.02 PageComprMgr::CombinePrefix 0.01Total 5.17

UPDATE Compression burners

Out of 8.55 … Approx: 60%

Compression and Locks

Xevent TraceLock Acquire/Release

High Res Timer

How long are locks held?

PAGE NONE0

100000

200000

300000

400000

500000

600000

Lock Held Cycle Count

Avg

StdDev

CPU KCycles

• Sharing is generally bad for scale (but may be good for performance)

• PAGELATCH and LATCH diagnosis starts in sys.dm_os_latch_stats

• CXPACKET• Only important if throughput drops when

DOP goes up• If this happens, look for another wait/latch

• Table partitioning can be used to work around concurrency issues

Summary Concurrency – So Far..

The Paul Randal INSERT test

160M rows, executing at concurrencyCommit every 1K:

EASY tuning?

All is as Expected?

But Page Splits are Bad, right?

= BAD!

= Better!...

WRITELOG gone? Faster?

?

?

NO!sys.dm_os_wait_stats

And the Score Is…

newguid() newsequentialid() IDENTITY0

5000

10000

15000

20000

25000

30000

35000

Time in Seconds

What is going on here???

Min

Min

Min Min

Min

MinMin

Min Min

Min

HOBT_ROOT

Max

Tricks to Work Around this

0-1000

1001- 2000

2001- 3000

3001- 4000

INS

ER

T

INS

ER

T

INS

ER

T

INS

ER

T

All Cores at 100%

newguid()

newsequen...

IDEN

TITY

IDEN

TITY + U

nique

IDEN

TITY+ U

niq...

IDEN

TITY...

IDEN

TITY...

SPID + O

f...

0

5000

10000

15000

20000

25000

30000

35000

Runtime in Seconds

Seco

nds 600K

Inserts/sec830K

Inserts/sec

All Cores at ~100%

• Don’t use Sequential Keys• Page Splitting isn’t so bad• Neither are GUID

• Generate keys wisely. Ideally in the app server

• For “transparent” speedup, consider our old hash trick

Takeaways, INSERT workload

• Minimally Logged• Single, large

execution (thousands)

• Unsorted data• Concurrent Loaders

BULK INSERT Workload

Heap

Bulk Insert

Bulk Insert

Measure:

SELECT * FROM sys_dm_os_latch_stats

Observe waits on

ALLOC_FREESPACE_CACHE

Theory (just read BOL):“Used to synchronize the access to a cache of pages with available space for heaps and binary large objects (BLOBs). Contention on latches of this class can occur when multiple connections try to insert rows into a heap or BLOB at the same time. You can reduce this contention by partitioning the object.”

When does BULK INSERT scale break?

0 5 10 15 20 25 300.0

50.0

100.0

150.0

200.0

250.0

Concurrent BULK INSERT

MB/

Sec

1

2

3

What is Happening here?

Free Page information (PFS/GAM/SSGAM)

HOBT Cache

FatChunks

Allocnewpages!Bulk Insert

ALLOC_FREESPACE_CACHE

This is in DRAMand L2

• Break Up table by “some key”

• Optional: Switch out partitions

• Spin up multiple bulks

• Linear scale• 3GB/sec• 16M

LINEITEM/sec

Breaking Through the Bottleneck

425

555

215

200

101

453

666

Area

Bulk Insert

Bulk Insert

Bulk Insert

BULK INSERT - Reloaded

• Thomas, you might have gotten 16M rows/sec at 3GB/sec insert speed• But this was on heaps, I have a clustered table

• Alright then, let us hit a cluster index

1-1000

Clustered and partitioned

1001-2000

2001-3000

3001-4000

X Lock

X Lock

X Lock

X Lock

Cluster Bulking – It seemed so plausible!1

2

3

Cluster Bulking – Stage and Switch

1

2

3

Coffee Break

SPIN LOCKS

• Context Switching is expensive• Typically 10K or more CPU cycles

• If you expect the ressource to be held only shortly, why fall asleep?

What is a Spinlock?

spin_acquire(int* s){ while(*s==1) *s = 1;}

Spin_release(int* s){ *s = 0;}

• Acquire can be very expensive• SQL Server implements a backoff

mechanism

What is a backoff?

spin_acquire(int* s){ int spins = 0; while(*s==1) { spins++; if (spins > threshold) { <Sleep and WaitForRessource> } } *s = 1;}

SELECT * FROM sys.dm_os_spinlock_stats

DBCC SQLPERF(spinlockstats)

Backoff

Life at 600K INSERT/sec

WRITELOG is I/O – right?

Should be the same as this… or?

No! Because:

• Step 1: Copy sqlserver.pdb to the BINN directory

• Step 2: DBCC TRACEON (3656, -1) • Step 3: Steal script from:

http://www.microsoft.com/en-us/download/details.aspx?id=26666

Note for 2012, you additionally need:• sqlmin.pdb, sqllang.pdb, sqldk.pdb

Diagnosing a Spinlock the Cool way!





Spinlock Walkthrough – Extended Events Script

--Get the type value for any given spinlock type select map_value, map_key, name from sys.dm_xe_map_values where map_value IN ('SOS_CACHESTORE')

--create the even session that will capture the callstacks to a bucketizer create event session spin_lock_backoff on server add event sqlos.spinlock_backoff (action (package0.callstack)

where type = 144 --SOS_CACHESTORE)

add target package0.asynchronous_bucketizer ( set filtering_event_name='sqlos.spinlock_backoff', source_type=1, source='package0.callstack') with (MAX_MEMORY=50MB, MEMORY_PARTITION_MODE = PER_NODE)

--Run this section to measure the contention alter event session spin_lock_backoff on server state=start

A complete walkthrough of the technique can be found here:

http://sqlcat.com/msdnmirror/archive/2010/05/11/resolving-dtc-related-waits-and-tuning-scalability-of-dtc.aspx

--wait to measure the number of backoffs over a 1 minute periodwaitfor delay '00:01:00'

--To view the data--1. Ensure the sqlservr.pdb is in the same directory as the sqlservr.exe --2. Enable this trace flag to turn on symbol resolution DBCC traceon (3656, -1)

--Get the callstacks from the bucketize targetselect event_session_address, target_name, execution_count, cast (target_data as XML)from sys.dm_xe_session_targets xst

inner join sys.dm_xe_sessions xs on (xst.event_session_address = xs.address)where xs.name = 'spin_lock_backoff'

--clean up the session alter event session spin_lock_backoff on server state=stopdrop event session spin_lock_backoff on server

Of course, you can just use 2012…

How to improve a spinlock?

CPU

Core

Core

L1-L3 Cache

CPU

Core

Core

L1-L3 Cache

spin_acquire

Int sspin_acquire

Int s

spin_acquire

Int s

Transfer cache line

Transfer cache line

CPU CPU

CoreInfo.Exe – where are my cores?

CoreInfo.exe

Revisiting the TLOG

Buffer Offset (cache line)

LOGCACHEACCESS

Alloc Slot in Buffer

MemCpy SlotContent

Log Writer

Writer Queue


Slot1

LOGBUFFER

WR

ITE

LO

G

LOGFLUSHQ


T0

Tn

Slot 127

Slot126

I/O Affinity Mask!

SPID + Offset

SPID + Affinity

0

50

100

150

200

250

sp_configure ‘AffinityIOMask’

Bulking at Concurrency

• What’s that spin?

xperf –on latency –stackwalk profilexperf –d trace.etlxperview trace.etl

SELECT * FROM sys.dm_os_spinlock_statsORDER BY spins_count

DBCC SQLPERF (spinlockstats)

?

SOS_OBJECT_STORE at high INSERT

• Observed: This Spin happens when inserting

• Need: Reduce locking overhead• Fixes that work well here:

8x throughput

Bonus

• Lets try something really silly:

• Run lots of: EXEC emptyProc

• This should be infinitely scalable, right?

Diagnosing another Spinlock

CREATE PROCEDURE emptyProcASRETURN

Initial Diagnosis

MUTEX ??? … what Mutex?

Using the Spinlock Script gives us

Some cacheWhich one?

Validating the Theory

CREATE PROCEDURE emptyProc0ASRETURNGOCREATE PROCEDURE emptyProc1ASRETURNGO

…

CREATE PROCEDURE emptyProc31ASRETURN

What is the SOS_OBJECT_STORE?

Security Check?

Validating the new “fix”…

DECLARE @ParmDef NVARCHAR(500)DECLARE @sql NVARCHAR(500)SET @sql = N'INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6)

VALUES (@p1, @p2, @p3, @p4, @p5, @p6)'SET @sql = REPLACE(@sql, '<t>', dbo.ZeroPad(@table, 3))SET @ParmDef = '@p1 BIGINT, @p2 DATETIME, @p3 CHAR(111), @p4 INT, @p5 INT, @p6 BIGINT'DECLARE @constDate DATETIME = '1974-12-22'

DECLARE @i INT WHILE (1=1) BEGIN BEGIN TRAN SET @i = 1 WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, @ParmDef , @p1 = 1, @p2 = @constDate, @p3 = 'x', @p4 = 42, @p5 = 7, @p6 = 13 SET @i = @i + 1 END COMMIT TRANEND

Consider this Test harness code…

Spinning on MUTEX

Diagnose with trace flag shows spins stack offender:

CSecurityContext::GetUserTokenFromCache

This is REALLY expensive at scale:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END

Initialize a new execution context on every loop!

Fixing the MUTEX spin

• Instead of:WHILE @i <= 1000 BEGIN EXEC sys.sp_executesql @sql, SET @i = @i + 1 END

• Write:SET @sql = N'DECLARE @i INT WHILE (1=1) BEGINBEGIN TRAN WHILE @i <= 1000 BEGIN INSERT INTO dbo_<t>.MyBigTable_<t> WITH (TABLOCK) (c1, c2, c3, c4,c5,c6) VALUES (@p1, @p2, @p3, @p4, @p5, @p6) SET @i = @i + 1 END COMMIT TRANEND

EXEC sys.sp_executesql @sql, @ParmDef

4x throughput

Bonus

• When all other bottlenecks are gone, sharing happens in the most unlikely places

• You can use spinlock Xevents inside SQL Server• Remember symbol files in BINN• Trace flag 3656

• This can also be done in XPERF for non SQL apps• Ex: Analysis Services

Concurrency, Spinlock Summary

• Control of buffers and NUMA for Xperf setting

• By default: • 4MB mem• Spool to disk at root of C-drive

• Can do buffer/file control:• -buffersize and –maxbuffers• -maxfile and –FileMode Circular

Xperf controlling buffers

• Round robin between NUMA nodes• Inside the NUMA: Pick the one that looks

the least busy

• This is NOT a perfect system

How SQL Server assigns threads

Xperf -on Latency+CSWITCH+DISPATCHER -stackWalkCSwitch+ReadyThread+ThreadCreate+Profile -BufferSize 1024 -MaxBuffers1024 -MaxFile 1024 -FileMode Circular

REG ADD "HKLM\System\CurrentControlSet\Control\Session Manager\Memory Management" -v DisablePagingExecutive -d 0x1 -t REG_DWORD -f

Super Xperf

• All the tuning wont help you if your model is wrong

• Tunings gets your far, but to really scale, you need a good data model

• This is what my other courses are about

But does the Data Model Work?

Q A&

Problem Statement

Queue Structure

Msg Msg Msg Msg Msg

Ordered

Push Pop

300Bmsg

The Naïve Approach

• Push

• Seek First Row• INSERT Row

• Pop

• Seek Last Row• DELETE/Output

Key

Max

Msg

Min Max

Msg

Min

Msg

Why this doesn’t Scale!

Min

Min

Min Min

Min

MinMin

Min Min

Min

HOBT_ROOT

Max

NextPrev

Virtual Root

SHLATCH

HOBT_VIRTUAL_ROOT

LCK

PAGELATCH

X

SH

SHPAGELATCH

PAGELATCH

EX

SH

SH

EX

SH

EX

EX

EXEX

B-Tree Root Pages

Summarising the Problem

• Hot stuff• Root• Min page• Max page• Intermediate

pages

• Alloc/Dealloc• BUT: We Must

have order!

Cooling it down

What if…

• Push

• Seek first value page

• UPDATE Reference Count

• Pop

• Seek last value page

• UPDATE Reference Count

Min Max

Msg++

Min Max

Msg--

Dissipate the Heat

Min

Msg--

Max

Msg++

Min

Msg--

Max

Msg++

Min

Msg--

Max

Msg++

Last Digit = 0 Last Digit = 1 Last Digit = 2

Eliminating Thread Contention

Queue Structure

Ordered

PushSequence++PopSequence++

87654

VERY fast!

Ring Buffers

Queue Structure

Ordered

PushSequence++Mod 100

PopSequence++Mod 100

Slot: 8Msg: 108

Slot: 7Msg: 107

Slot: 6Msg: 106

Slot: 5Msg: 105

Slot:4Msg:104

Summing Up Message Queue Hack

• UPDATE • instead of INSERT/DELETE

• More partitions = More B-Trees

• Ring buffer using modulo• Find Sweet spot

concurrency

Technology

Master tuning