Storage Performance on SQL Server

Storage Performance on SQL Server

Joe Chang

CoverageCoverage

Emphasis is on Line of Business DBDifferent priorities for less critical apps

PerformanceFault-tolerance covered else where

OverviewOverview

IO Performance Objectives The Complete Storage Environment

Direct-Attach and SANStorage ComponentsDisk PerformanceSQL Server IO CharacteristicsConfiguration ExamplesSSD

Old RulesOld Rules

Meet transaction throughputDisk Performance Criteria

Read from Data, Write to LogsSeparate Data and Log files?Disk Queue Depth < 2 per disk

Prevalent use of SANLUNs with unknown number of disks

Latency (Avg Disk Sec/Read)

Storage Performance CriteriaStorage Performance Criteria

SELECT (Read) QueryData must be read into buffer cache if not already in cache – read from data

INSERT/UPDATE/DELETE (Write) Query

Data must be read into buffer cache Transaction must be written to logBuffer is marked as dirty, lazy writer handles

Large Query (as necessary)Write and Read to tempdb

Flashback: 1994 - 2009Flashback: 1994 - 2009

1994: Pentium 100MHz64MB, 4 x 16MB SIMM ($700+ each?)OS + DB executable ~ 16-24MBNet: 40MB Buffer cache

Difficult to support transactionsReports run on 1st of month

Today: 4 x Quad Core128GB, 32 x 4GB, $48003000 X increase in buffer cache

Requirements Then and NowRequirements Then and Now

Old: Support transactionsNo longer really an issue for most environments (after proper SQL tuning!)

Today: Minimize disruptions to transactions

Large query or table scan while supporting transactionsCheckpoint – write dirty buffers to dataTransaction Log backup Backup & Restore

Cost versus Value/RequirementsCost versus Value/Requirements

Money is no object:With sufficient number of disks, IO channels, proper configurationIt is possible to avoid most disruptions

Otherwise – Manage IO disruptionsEstablish tolerable disruptions: 5-30 seconds?Large reports run off-hoursConfigure sufficient performance to handle transient events

Complete Storage EnvironmentDirect Attach and SAN

Most Common Mistakes

Storage sized to capacity requirements2 HBA (or RAID Controllers)Too few big capacity disk drives

Fill system PCI-E slots with controllersMany small 15K drives (146 3.5 or 73 2.5)

Direct AttachDirect AttachSystem IO capabilities is distributed across multiple PCI-E slots.

Single controller does not have sufficient IOSingle (or even Dual) SAS/FC port does not have sufficient IO

Distribute IO over multiplePCI-E channelsControllers (SAS or FC)Dual port SAS or FC

Disk Array Enclosures (DAE)Do not daisy chain (shared SAS/FC) until all channels are filled!

Server SystemCPU CPUCPU CPU

IO HUB

PCI-E

PCI-E

HBA HBA

SAS SA

S SAS SA

S

IO HUB

PCI-E

PCI-E

HBA HBA

SASSA

SSASSA

S

SANSANSAN is really computer system(s)Typically connected by FC to host and storage

Can be fault-tolerant in all components and paths: HBA, cables, switches, SP, disks

No special performance enhancementsSlight degradation (excessive layers)

Write cache is mirrored between SP’sReally important!Distribute load over all front-end and back-end FC ports

SANSP A SP B

HBA HBA HBA HBA

HBA HBA

Server SystemCPU CPUCPU CPU

IO HUB

PCI-E

PCI-E

HBA HBA

IO HUB

PCI-E

PCI-E

HBA HBA

FC FC FC FC FC FC FC FC

HBA HBA

FC FC FC

FC FC FC FC

FC

Direct Attach & SANDirect Attach & SAN

Direct AttachRAID Controller in ServerFault-tolerant disks,

sometimes controller/path, 2-node clusters

SANHost Bus Adapter, (switches)Service Processor Full component and path fault tolerance

Multi-node clusters

SAN Vendor ViewSAN Vendor View

One immensely powerful SAN serving storage needs of all servers

Storage consolidation – centralize management and minimize unused spaceProblem is: SAN is not immensely powerful

What happens if LUN for another server fails, and a restore from backup is initiated during busy hours

DW-BIDB Email

SAN

Web QADB

OLTPDB

SharePoint

Switch

Proper ViewProper View

Nothing should disrupt the operation of a line-of-business server

Data Warehouse is not be mixed with transaction processing DB

Consider multiple storage systems for very large IOPS loads instead of a single SAN

DW/BI

Storage Storage

Email

SAN

Sharepoint

File ServerOLTP

SAN SAN

Storage SystemsStorage Systems

SANEntry

Mid range

Enterprise

DA HighDensity

DirectAttach

HP MSA 2000, (Dell MD 3000)

EMC CLARiiON, HP EVA, NetApp FAS3100

EMC DMX, Hitachi, 3 PAR, FAS6000

HP MSA 50, 70, Dell MD 1120

HP MSA 60, Dell MD 1000

EMC CLARiiON

x8 CMI

LCCLCC

High-performance Flash drives

Spin DownLow power SATA II drives

Adaptive Cooling

Virtual Provisioning= Capacity optimization

= Energy efficiency

Multi-core processorsIncreased memory

64-bit FLAREUp to 960 drives= up to twice the

performance, scale

SPSSPS

Power Supply

Power Supply

IO Complex

iSCSI module

Fibre Channel module

iSCSI module




CPU Module

Multi-Core Processors

Memory

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

IO Complex

iSCSI module


iSCSI module




CPU Module

Multi-Core Processors

Memory

CPU

CPU

CPU

CPU

CPU

CPU

CPU

CPU

EMC DMXEMC DMX

Cache

If system memory is 128GBWhat you expect to find in 16GB SAN cacheThat is not in the buffer cache?

Performance benchmarksMost use direct attach storageWith SAN: cache disabledAlternative: tiny read cache, almost all to write

Complete Environment SummaryComplete Environment Summary

Server SystemMemory BandwidthIO bandwidth, port, PCI-E slots

Pipes/channels from Server to StorageStorage System

RAID controller, etcPipes to disk drivesDisk drivesIf system memory is 128GB, what you expect to find in the

16GB SAN cache that is not in the buffer cache?

Storage Components

Storage Components/InterfacesStorage Components/Interfaces

System IODisk DrivesHBA and RAID Controller

SAS (3Gbit/s going to 6), FC (4Gbit/s to 8)

Storage Enclosures (DAE)Disk DrivesSAN – SystemsSAN – Switches

Server Systems: PCI-E Gen 1Server Systems: PCI-E Gen 1

PCI-E Gen 1: 2.5Gbit/s per lane, bi-directionalDell PowerEdge 2950 – 2 x8, 1 x4Dell PowerEdge R900 – 4 x8, 3 x4 (shared)HP ProLiant DL385G5p – 2 x8, 2 x4HP ProLiant DL585G5 – 3 x8, 4 x4HP ProLiant DL785G5 – 3 x16, 3 x8, 5 x4Most PCI-E slots have dedicated bandwidth, some may be

shared bandwidth (with expander chip)

Server Systems: PCI-E Gen 2Server Systems: PCI-E Gen 2

PCI-E Gen 2: 5.0Gb/s per lanex4: 2 GB/sec in each direction

Dell PowerEdge R710 – 2 x8, 2 x4Dell PowerEdge R910(?)HP ProLiant DL370G6 – 2 x16, 2 x8, 6 x4

Intel 5520 chipset: 36 PCI-E Gen 2 lanes, 1 ESI (x4)ProLiant ML/DL 370G6 has 2 5520 IOH devices

Disk DrivesDisk Drives

Rotational Speed – 7200, 10K, 15KAverage Rotational latency 4, 3, 2 milli-sec

Average Seek Time 8.5, 4.7, 3.4ms (7200, 10K, 15K RPM)2.5 in 15K 2.9 ms avg. seek

Average Random Access TimeRotational + Seek + Transfer + Overhead

Native Command Queuing

Disk InterfacesDisk Interfaces

SATA – mostly 7200RPMSATA disk can be used in SAS systemSATA Adapter cannot connect to SAS disk

SAS –15K3.5 in LFF, 2.5in SFFCurrently 3 Gbits/sec, next gen: 6 Gb/s

FC – typically in SAN4 Gbit/s, next: 8 Gbit/s

Disk Drives (3.5in, LFF)Disk Drives (3.5in, LFF)

95mm 84mm 65mm

7200RPM, 1TB Barracuda 12: 8.5ms, 125MB/sBarracuda LP 95MB/s (5900)

10,000RPM, 5ms

End of life?

15,000RPM, 3.4ms 146, 300, 450GB167MB/sec

Lower RPM drives have higher bit density and larger platters contributing to very low $/GB. Desktop rated for 2 years @ 20% duty cycle, server for 5 years @ 100%

Seagate DrivesSeagate Drives

15K.7Cheetah 3.5in LFF drives15K.2 2.9/3.3 15K.4 36/73/146GB 3.5/4.0ms 95?15K.5 73/146/300GB 3.5/4.0ms 125-7315K.6 146/300/450GB 3.4/3.9ms 171-112MB/sec15K.7 300/450/600GB

Savvio 15K.2

Barracuda ES

Savvio 10K.3

Savvio 2.5 in SFF drives15K.1 36/72GB2.9/3.3 ms 112-79MB/sec15K.2 73/146GB 2.9/3.3 ms 160-120MB/s

Dell PowerVaultDell PowerVault

Dell PowerVault MD 1000 – 15 3.5in $7K for 15 x 146GB 15K drives

Dell PowerVault MD 1120 – 24 2.5in $11K for 24 x 73GB 15K

HP MSAHP MSA

MSA 60: 12 LFF drives

MSA 70: 25 SFF drives

Direct Attach Cluster CapableDirect Attach Cluster Capable

Dell PowerVault MD 3000 – 15 3.5in 2 internal dual-port RAID controllers$11.5K for 15 x 146G 15K drives

Listed as Direct Attach, but essentially an entry SAN

PCI-E SAS RAID ControllersPCI-E SAS RAID Controllers

First GenerationPCI-E host interfacePCI-X SAS controllerPCI-E to PCI-X bridge

800MB/secSecond Generation

Native PCI-E to SAS1.6GB/sec in x8 PCI-E, 2 x4 SAS ports

FC HBAFC HBA

QLogic QLE2562Dual port 8Gbs FC, x8 PCI-E Gen 2

QLogic QLE 2462 Dual Port 4Gbs, x4 PCI-E Gen 1

Qlogic QLE 2464Quad port FC, x8 PCI-E Gen 1

Emulex LPe12002Emulex LPe11002/11004

Disk Performance

Random IO Theory Queue Depth 1Random IO Theory Queue Depth 1

10K 3.0 4.7 0.07 7.77 128.7

15K 2.0 3.4 0.05 5.45 183.6

15K SFF 2.0 2.9 0.05 4.95 202

7200 4.17 8.5 0.06 12.7 78.6

Drive RotationalLatency

AvgSeek

8KBtransfer

Totalmilli-sec IOPS

IO rate based on data distributed over entire disk accessed at random, one IO command issued at a timeNot accounting for other delays

Other FactorsOther Factors

Short Stroke:

Data is distributed over a fraction of the entire disk

Average seek time is lower (track-to-track minimum)

Command Queuing:

More than one IO issued at a time,

Disk can reorder individual IO accesses, lowering access time per IO

8K Random IOPS vs Utilization8K Random IOPS vs Utilization

0

100

200

300

400

500

600

88% 47% 24% 12% 6.1% 3.0% 1.4%

Q1 Q2 Q4 Q8

Q16 Q32 Q64

IOPS for range of Queue depth and space utilization

Latency versus Queue DepthLatency versus Queue Depth

0

20

40

60

80

100

120

140

160

180

Q1 Q2 Q4 Q8 Q16 Q32 Q64

88% 47% 24% 12% 6.1% 3.0% 1.4%

Latency versus Queue depth for range of space utilization

Disk SummaryDisk Summary

Frequently cited rules for random IOApplies to Queue Depth 1Data spread across entire disk

Key FactorShort-strokeHigh-Queue Depth

SANComplex SAN may hide SS and HQ behavior

SQL Server IO Patterns

SQL Server IOSQL Server IO

Transactional queriesRead/Write

Reporting / DW queriesCheckpointsT-Log backupsDifferential/Full backups

Transactional QueryTransactional Query

Few rows involvedSELECT xx FROM Table WHERE Col1 = yy

Execution Plan has bookmark lookup or loop joinsIO for data not in buffer cache

8KB, randomissued 1 at a time, serially (5ms min latency)

(up to around 24-26 rows)Even if LUN has many disks, IO depth is 1!

Large QueryLarge Query

Plan has bookmark lookup or loop join

Uses Scatter-Gather IOMore than (approximately) 30 rowsDepending on Standard or Enterprise EditionMultiple IO issued with one call,

Generates high-queue depthQuery for 100 rows can run faster than 20!

High row count non-clustered index seek: Are key lookups really random. Build index with care. Only highly selective SARG in key.

TempdbTempdb

Large Query may to spool intermediate results to tempdbSequence of events is:

Read from dataWrite to tempdbRead from tempdb (sometimes)Repeat

Disk load is not temporally uniform!Data and tempdb should share common pool of Disks/LUNs

CheckpointCheckpoint

Dirty data buffers written to diskUser does not wait on data writeSQL Server should throttle checkpoint writesBut high-queue depth of writes may result in high-latency reads

Log BackupLog Backup

Disrupts sequential log writes

Update

Problem in SQL Server 2000UPDATE uses non-clustered indexPlan does not factor in key lookupsExecution – fetch one row at a time

~5-10ms per key lookup

Storage Configuration Examples

General Strategy – General Strategy – DistributeDistribute IO IO

Distribute IO across multiple PCI-E slotsDistribute IO across multiple HBA/ControllersDistribute IO across many disk drives

Daisy chain DAE only after High transaction (write) volume

Dedicate HBA/controller, SAN SP, disk drives for logs?

LFF or SFF disksLFF or SFF disks

LFF 12-15 disks per enclosureSFF 24-25 disks per enclosure15 disks on x4 SAS,

Total bandwidth: 800MB/s, 53MB/s per disk

24 disks on x4 SAS, 33MB/s

Minimum for Line-of-BusinessMinimum for Line-of-Business2 x Xeon 5500 or 5400 series64-72GB memory4 SAS RAID Controllers$11-13K

4 x 15 Disk Enclosures60 146GB 15K drives6TB capacity (3+1 RAID 5)600GB database3GB/sec sequential30K IOPS short-stroke, peak$28K

SQL Server Ent License $50K

12-15 disks per x4 SAS port800-1000MB/sec bandwidth

SAN Option: 2 dual-port FC HBAEMC CLARiiON CX2-240, 4 DAE

x4 or x8 PCI-E

IntermediateIntermediate

1 DAE per controller in x4 PCI-E slots2 DAE per controller in x8 PCI-E slots, use both SAS ports, 1 DAE per x4 SAS Daisy-chain DAE only for very high disks

SAN example: CLARiiON CX4-480, 3 dual-port,HBA6 DAE

x4 PCI-E x8 PCI-E x4 PCI-E x8 PCI-E

4 x Xeon 7400 series128GB memory4 SAS RAID Controllers$25K

6 LFF (3.5) Disk Enclosures90 73GB 15K drives9TB capacity (3+1 RAID 5)900GB database3GB/sec+ sequential45K IOPS short-stroke, peak$42K


SFF disks for Heavy Random IOSFF disks for Heavy Random IO

x4 PCI-E x8 PCI-E x4 PCI-E x8 PCI-E

4 x Xeon 7400 series128GB memory4 SAS RAID Controllers$25K

6 SFF (2.5in) Disk Enclosures144 73GB 15K drives7TB capacity (3+1 RAID 5)700GB database3GB/sec+ sequential70K IOPS short-stroke, peak$66K


Really Serious DWReally Serious DW

8 x Opteron 8400 series246GB memory8 SAS RAID Controllers$80KOr Unisys, NEC, IBM

14 SFF (2.5in) Disk Enclosures336 73GB 15K drives16TB capacity (3+1 RAID 5)7-9GB/sec+ sequential

1.6TB database, 160K IOPS peak3.2TB, 130K IOPS peak$154K


Need lots of IO bandwidth and slots, more than 4-way Xeon 7400 series with 7300 chipset can handle

SAN – CLARiiON example

Minimum (disks)CX4-240, 2 dual-port FC HBA, 4 DAE

Intermediate (120 disks)CX4-480, 4 dual-port FC HBA, 8 DAE

High-bandwidth DW (240 disks)CX4-960, 2 quad, 4 dual-port FC HBA, 16 DAE

Very high random IO (480 disks)CX4-960, 2 quad, 4 dual-port HBA, 32 DAE

Storage Performance Verification

What To Test

SequentialRandom low queue, high queueHigh row count Update with nonclustered indexCheckpoint writesFull-stroke and Short-stroke

Cache Settings

ReadRead-Ahead, Adaptive Read-Ahead, None

WriteWrite Back, Write Through

Read – none or very small (2MB/LUN)Write – Write-Back

SAN - HBA Settings

NumberOfRequestsDefault – 32? Prevents multiple hosts from overloading SANMatch to number of disks to control queue depth?

MaxSGList

SSD

SSD TypesSSD Types

DRAM – fastest, most expensiveNVRAM

SLC – more expensive /GB, higher writeMLC - low cost per GB

InterfacesSASPCI-E (Fusion-IO, 1GB/sec, 120K IOPS+)Complete SAN (Texas Memory Systems)

SSDSSD

Intel X-25E, 32 & 64GBSequential Read 250MB/s, Write 170MB/sRandom Read: 35,000 IOPS @ 4KBRandom Write: 3,300 IOPS @ 4KB

Good but not spectacularLatency: 75 us Read, 85 us Write

Really helpful for serial Queue Depth 1 accesses

SQL Server IO Cost StructureSQL Server IO Cost Structure

Key Lookup, Loop Join4-5 micro-sec in-memory15-25 us for 8K read from disk + eviction45 us for 64K read due to cold cache

SSD and RAIDSSD and RAID

Does an SSD need to be in RAIDDisk drive is fundamentally is single device

Motor or media failure results in loss of drive

SSD is not required to be a single device

Composed of SoC, interfaces SAS to NVRAMDual SoC plus ECC w/chip kill could make SSD fault-tolerant

Additional Slides

Partition Alignment

http://blogs.msdn.com/jimmymay/default.aspxMisaligned TheoryWith 64K stripe, warm cache, 8KB IO

on average every 8 random IO accesses will generate 10 actual IO, 25% gain

64K stripe, cold cache, 64KB IOEvery disk access generates 2 IO, 100% gain

http://blogs.msdn.com/jimmymay/default.aspx

http://blogs.msdn.com/jimmymay/default.aspx

RAID TheoryRAID Theory

Operation RAID 0 RAID 1+0 RAID 5

Read 1 1 1Small Write 1 1/2 1/4Large Write 1 1/2 1 - 1/N

Theoretical performance per drive for N drives in a RAID group

RAID 5 write: 1 read data, 1 read parity, 1 write data, 1 write parity. Write penalty is reduced if entire stripe can be written

EMC CLARiiONCX4-120 CX4-240 CX4-480 CX4-960

SP CPU 1x1.2GHz DC 1x1.6GHz DC 1 2.2GHz DC 2x2.3GH QC

System memory 6GB 8GB 16GB 32GB

Memory per SP 3GB 4GB 8GB 16GB

Max cache 600MB 1.264GB 4.5GB 10.76GB

Max write cache 600MB 1.264GB 4.5GB 10.76GB

CMI X4 X4 X8

Front-End Base 4 FC + 4 iSCSI

4 FC + 4 iSCSI

8 FC + 4 iSCSI

8 + 4

Back-end Base 2 FC 4 FC 8 FC 8FC

Max drives 120 240 480 480-960

Tot IO Slots 6 8 10 12

IO populated in base

4 4 6 6

Front-end FC ports 12 12 16 24

Back-end FC 2 4 8 16

Max iSCSI 8 12 12 16

NetApp

Write Anywhere File Layout (WAFL)Very different characteristicsOverrides many standard database strategiesNo need to defragment

See NetApp specific documentsIndex rebuild to clean up unused space may still be helpful

Enterprise SAN

Massive cross-barRAID groups

RAID 5 3+1 or 7+1, RAID 10 2+2 or 4+4

Hyper Volume: 16GB slices from RAID groupLUNS created from Hyper Volumes

Theory: Massive number disks, say 1000 disks, can do 150K IOPS. Each server averages 10K IOPS steady, with surges to 50K. Many servers can share large SAN

Table Scan to DiskTable Scan to Disk

0

200

400

600

800

1,000

1,200

1,400

1,600

Default RowLock PagLock TabLock NoLock

MB/

sec

SQL 2000 Clust. Index Scan SQL 2000 Heap Table ScanSQL 2005 Clust. Index Scan SQL 2005 Heap Table

Low Queue WritesLow Queue Writes

Read activity drops sharply during checkpoints

4 15K SCSI

Updates – All data in memoryUpdates – All data in memory

Checkpoints does not slow SQL batch, no reads required

HP Test System 2HP Test System 2

rx862016 Itanium 2

1.5GHz

HSV110 HSV110 HSV110 HSV110

8 2Gb/sFC ports 6 SCSI Disks

rx862016 Itanium 2

1.5GHz

HSV110 HSV110 HSV110 HSV110

8 2Gb/sFC ports 6 SCSI Disks

Documents

Storage Performance on SQL Server