41
Parallel Execution Plans Joe Chang [email protected] www.qdpma.com

Parallel Execution Plans

  • Upload
    kelii

  • View
    33

  • Download
    0

Embed Size (px)

DESCRIPTION

Parallel Execution Plans. Joe Chang [email protected] www.qdpma.com. About Joe Chang. SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools - PowerPoint PPT Presentation

Citation preview

Page 2: Parallel Execution Plans

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost ModelTrue cost structure by system architectureDecoding statblob (distribution statistics)

SQL Clone – statistics-only databaseTools

ExecStats – cross-reference index use by SQL-execution planPerformance Monitoring, Profiler/Trace aggregation

Page 3: Parallel Execution Plans

So you bought a 64+ core boxSo you bought a 64+ core box

Learn all about Parallel ExecutionAll guns (cores) blazingNegative scaling Super-scalingHigh degree of parallelism & small SQLAnomalies, execution plan changes etc

Compression Partitioning

Now

No I have not been smoking pot

Yes, this can happen, how will you know

How much in CPU do I pay for this?

Great management tool, what else?

Page 4: Parallel Execution Plans

Parallel Execution PlansParallel Execution PlansThis should be a separate slide deck

Page 5: Parallel Execution Plans
Page 6: Parallel Execution Plans

Execution Plan QuickieExecution Plan Quickie

Cost is duration in seconds on some reference platformIO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KBIO in Nested Loops Join: 1 = 320/s, multiple of 0.003125

F4

Estimated Execution Plan

I/O and CPU Cost components

Page 7: Parallel Execution Plans

Index + Key Lookup - ScanIndex + Key Lookup - Scan

(926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%)

Actual CPU Time (Data in memory)LU 1919 1919Scan 8736 8727

1,093,729 pages/1350 = 810.17 (8,748MB)

True cross-over approx 1,400,000 rows1 row : page

Page 8: Parallel Execution Plans

Index + Key Lookup - ScanIndex + Key Lookup - Scan

8748000KB/8/1350 = 810 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%)

Actual CPU TimeLU 2138 321Scan 18622 658

Page 9: Parallel Execution Plans

Actual Execution PlanActual Execution Plan

Note Actual Number of Rows, Rebinds, Rewinds

Actual

Estimated

Actual Estimated

Page 10: Parallel Execution Plans

Row Count and ExecutionsRow Count and Executions

For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows

Inner Source

Outer

Page 11: Parallel Execution Plans
Page 12: Parallel Execution Plans

Parallel PlansParallel Plans

Page 13: Parallel Execution Plans

Parallelism OperationsParallelism Operations

Distribute StreamsNon-parallel source, parallel destination

Repartition StreamsParallel source and destination

Gather StreamsDestination is non-parallel

Page 14: Parallel Execution Plans

Parallel Execution PlansParallel Execution Plans

Note: gold circle with double arrow, and parallelism operations

Page 15: Parallel Execution Plans

Parallel Scan (and Index Seek)Parallel Scan (and Index Seek)

DOP 1 DOP 2

DOP 4 DOP 8

IO Cost sameCPU reduce by degree of parallelism, except no reduction for DOP 16

2X

4X8X

IO contributes most of cost!

Page 16: Parallel Execution Plans

Parallel Scan 2Parallel Scan 2

DOP 16

Page 17: Parallel Execution Plans

Hash Match AggregateHash Match Aggregate

CPU cost only reducesBy 2X,

Page 18: Parallel Execution Plans

Parallel ScanParallel Scan

IO Cost is the sameCPU cost reduced in proportion to degree of parallelism, last 2X excluded?

On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration).A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option?

The IO component can be a very large portion of the overall plan costNot reducing IO cost in parallel plan may inhibit generating favorable plan,i.e., not sufficient to offset the contribution from the Parallelism operations.

A parallel execution plan is more likely on larger systems (-P to fake it?)

Page 19: Parallel Execution Plans

Actual Execution Plan - ParallelActual Execution Plan - Parallel

Page 20: Parallel Execution Plans

More Parallel Plan DetailsMore Parallel Plan Details

Page 21: Parallel Execution Plans

Parallel Plan - ActualParallel Plan - Actual

Page 22: Parallel Execution Plans

Parallelism – Hash JoinsParallelism – Hash Joins

Page 23: Parallel Execution Plans

Hash Join CostHash Join Cost

DOP 1 DOP 2

DOP 8

DOP 4

Search: Understanding Hash JoinsFor In-memory, Grace, Recursive

Page 24: Parallel Execution Plans

Hash Join CostHash Join Cost

CPU Cost is linear with number of rows, outer and inner source

See BOL on Hash Joins for In-Memory, Grace, RecursiveIO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit)Parallel Plan: Memory allocation is per thread!

Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?

Page 25: Parallel Execution Plans

Parallelism Repartition StreamsParallelism Repartition Streams

DOP 2 DOP 4 DOP 8

Page 26: Parallel Execution Plans

BitmapBitmap

BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed.

SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.

Page 27: Parallel Execution Plans
Page 28: Parallel Execution Plans

What Should Scale?What Should Scale?

Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently,3) Small amount of coordination to consolidate threads

223

Page 29: Parallel Execution Plans

More DifficultMore Difficult

Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage3) Large coordination effort between threads4) More work…Consolidate

2

2

3

3

4

Page 30: Parallel Execution Plans

Parallel Execution Plan SummaryParallel Execution Plan Summary

Queries with high IO cost may show little plan cost reduction on parallel executionPlans with high portion hash or sort cost show large parallel plan cost reductionParallel plans may be inhibited by high row count in Parallelism Repartition StreamsWatch out for (Parallel) Merge Joins!

Page 31: Parallel Execution Plans

Test SystemsTest Systems

Page 32: Parallel Execution Plans

Test SystemsTest Systems

2-way quad-core Xeon 5430 2.66GHzWindows Server 2008 R2, SQL 2008 R2

8-way dual-core Opteron 2.8GHzWindows Server 2008 SP1, SQL 2008 SP1

8-way quad-core Opteron 2.7GHz Barcelona

Windows Server 2008 R2, SQL 2008 SP18-way systems were configured for AD- not good!

Build 2789

Page 33: Parallel Execution Plans

Test MethodologyTest Methodology

Boot with all processorsRun queries at MAXDOP 1, 2, 4, 8, etc

Not the same as running on 1-way, 2-way, 4-way serverInterpret results with caution

Page 34: Parallel Execution Plans

TPC-HTPC-H

Page 35: Parallel Execution Plans

Continuing DevelopmentContinuing Development

Page 36: Parallel Execution Plans

Suppose I need to ALTER TABLE ADD new columns?Of course, then UPDATE to set default

Page 37: Parallel Execution Plans

Write OperationsWrite Operations

Insert, Update and Delete (IUD) component operations are not parallelizable.Select portion of query may be parallelized.Select parallelization may be inhibited if row count is high.

Page 38: Parallel Execution Plans

Mass UpdateMass Update

Insert, Update and Delete (IUD) component operations are not parallelizable.Select portion of query may be parallelized.Select parallelization may be inhibited if row count is high.

Page 39: Parallel Execution Plans
Page 40: Parallel Execution Plans

Compressed TableCompressed Table

LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82

Page 41: Parallel Execution Plans