Upload
jordan-pierce
View
215
Download
1
Embed Size (px)
Citation preview
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
TPC-HTPC-H
TPC-HTPC-H
DSS – 22 queries, geometric mean60X range plan cost, comparable actual range
Power – single streamTests ability to scale parallel execution plans
Throughput – multiple streams
Scale Factor 1 – Line item data is 1GB
875MB with DATE instead of DATETIME
Only single column indexes allowed, Ad-hoc
SF 10, test studiesSF 10, test studies
Not valid for publication
Auto-Statistics enabled, Excludes compile time
Big Queries – Line Item Scan
Super Scaling – Mission Impossible
Small Queries & High Parallelism
Other queries, negative scaling
Did not apply T2301, or disallow page locks
0
500
1,000
1,500
2,000
2,500
3,000
3,500
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16
Big Q: Plan Cost vs ActualBig Q: Plan Cost vs ActualPlan Cost reduction from DOP1 to 16/32Q1 28%Q9 44%Q18 70%Q21 20%
Plan Cost says scaling is poor except for Q18,
memory affects Hash IO onset
Plan Cost @ 10GB
0
15
30
45
60
75
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
Actual Query timeIn seconds
Plan Cost is poor indicator of true parallelism scaling
Q18 & Q 21 > 3X Q1, Q9
02468
10121416182022242628303234
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Big Query: Speed Up and CPUBig Query: Speed Up and CPU
Q13 has slightly better than perfect scaling?In general, excellent scaling to DOP 8-24, weak afterwards
Holy Grail
0
10
20
30
40
50
60
70
80
90
Q1 Q9 Q13 Q18 Q21
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
CPU timeIn seconds
Speed up relative to DOP 1
Super ScalingSuper Scaling
Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged
CPU time = 100 sec, elapse time = 100 sec
What is best case for DOP 2?Assuming nearly zero Repartition Threads cost
CPU time = 100 sec, elapsed time = 50?
Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!No, I have not started drinking, yet
0.0
0.5
1.0
1.5
2.0
2.5
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Super ScalingSuper Scaling
CPU-sec goes down from DOP 1 to 2 and higher (typically 8)
0
2
4
6
8
10
12
14
16
18
20
22
24
26
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
CPU normalized to DOP 1
Speed up relative to DOP 1
3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)
CPU and Query time in secondsCPU and Query time in seconds
0
2
4
6
8
10
12
14
16
18
20
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
Q7 Q8 Q11 Q21 Q22
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
CPU time
Query time
Super Scaling SummarySuper Scaling Summary
Most probable causeBitmap Operator in Parallel Plan
Bitmap Filters are great, Question for Microsoft:
Can I use Bitmap Filters in OLTP systems with non-parallel plans?
Small Queries – Plan Cost vs ActSmall Queries – Plan Cost vs Act
Query 3 and 16 have lower plan cost than Q17, but not included
0
50
100
150
200
250
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Q4,6,17 great scaling to DOP 4, then weak
Negative scaling also occurs
Query time
Plan Cost
Small Queries CPU & SpeedupSmall Queries CPU & Speedup
What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling
0
1
2
3
4
5
6
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
14
16
18
Q2 Q4 Q6 Q15 Q17 Q20
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 30 DOP 32
Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8
CPU time
Speed up
High Parallelism – Small QueriesHigh Parallelism – Small Queries
Why? Almost No value
TPC-H geometric mean scoringSmall queries have as much impact as large
Linear sum of weights large queries
OLTP with 32, 64+ coresParallelism good if super-scaling
Default max degree of parallelism 0
Seriously bad news, especially for small Q
Increase cost threshold for parallelism?
Sometimes you do get lucky
Q that go NegativeQ that go Negative
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
0
2
4
6
8
10
12
14
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Query time
“Speedup”
CPUCPU
0
2
4
6
8
10
12
Q17 Q19 Q20 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Other Queries – CPU & SpeedupOther Queries – CPU & Speedup
0
2
4
6
8
10
12
14
16
18
20
22
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
0
2
4
6
8
10
12
14
16
18
20
22
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 30 DOP 32
Q3 has problems beyond DOP 2
CPU time
Speedup
Other - Query Time secondsOther - Query Time seconds
0
2
4
6
8
10
12
14
16
Q3 Q5 Q10 Q12 Q14 Q16
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 30 DOP 32
Query time
Scaling SummaryScaling Summary
Some queries show excellent scaling
Super-scaling, better than 2X
Sharp CPU jump on last DOP doubling
Need strategy to cap DOPTo limit negative scaling
Especially for some smaller queries?
Other anomalies
CompressionCompression
PAGE
1.0
1.1
1.2
1.3
1.4
1.5
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
1.0
1.1
1.2
1.3
1.4
1.5
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Compression Overhead - OverallCompression Overhead - Overall
40% overhead for compression at low DOP,10% overhead at max DOP???
Query time compressed relative to uncompressed
CPU time compressed relative to uncompressed
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
Query time compressed relative to uncompressed
CPU time compressed relative to uncompressed
Compressed TableCompressed Table
LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82
PartitioningPartitioning
Orders and Line Item on Order Key
Partitioning Impact - OverallPartitioning Impact - Overall
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
0.90
0.95
1.00
1.05
1.10
1.15
DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32
Query time partitioned relative to not partitioned
CPU time partitioned relative to not partitioned
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2 DOP 4
DOP 8 DOP 16 DOP 24
DOP 32
0
1
2
3
4
5
6
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22
DOP 1 DOP 2
DOP 4 DOP 8
DOP 16 DOP 24
DOP 32
Query time partitioned relative to not partitioned
CPU time partitioned relative to not partitioned
Plan for Partitioned TablesPlan for Partitioned Tables
Scaling DW SummaryScaling DW Summary
Massive IO bandwidth
Parallel options for data load, updates etc
Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc
Scaling with and w/o HT
Strategy for limiting DOP with multiple users
Fixes from Microsoft NeededFixes from Microsoft Needed
Contention issues in parallel execution
Table scan, Nested Loops
Better plan cost model for scalingBack-off on parallelism if gain is negligible
Fix throughput degradation with multiple users running big DW queries
Sybase and Oracle, Throughput is close to Power or better
Query PlansQuery Plans
Big QueriesBig Queries
Q1 Pricing Summary ReportQ1 Pricing Summary Report
Q1 Plan Q1 Plan
Non-Parallel
Parallel
Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction
Q9 Product Type Profit MeasureQ9 Product Type Profit Measure
IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower
Non-Parallel Parallel
Q9 Non-Parallel PlanQ9 Non-Parallel Plan
Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost
Join sequence: Supplier, (Part, PartSupp), Line Item, Orders
Q9 Parallel PlanQ9 Parallel Plan
Non-Parallel: (Supplier), (Part, PartSupp), Line Item, OrdersParallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp
Q9 Non-Parallel Plan detailsQ9 Non-Parallel Plan details
Table Scans comprise 64%,IO from 4 tables contribute 58% of plan cost
Q9 Parallel reg vs Partitioned Q9 Parallel reg vs Partitioned
Q13Q13 Why does Q13 have perfect scaling?
Q18 Large Volume CustomerQ18 Large Volume Customer
Non-Parallel
Parallel
Q18 Graphical PlanQ18 Graphical Plan
Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan
Q18 Plan DetailsQ18 Plan Details
Non-Parallel
Parallel
Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPUDOP 16/32: size is below IO threshold, CPU reduced by >10X
Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting
Note 3 references to Line Item
Non-Parallel Parallel
Q21 Non-Parallel PlanQ21 Non-Parallel Plan
H1
H1H2H3
H2H3
Q21 ParallelQ21 Parallel
Q21Q21
3 full Line Item clustered index scans
Plan cost is approx 3X Q1, single “scan”
Super ScalingSuper Scaling
Q7 Volume ShippingQ7 Volume Shipping
Non-Parallel Parallel
Q7 Non-Parallel PlanQ7 Non-Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q7 Parallel PlanQ7 Parallel Plan
Join sequence: Nation, Customer, Orders, Line Item
Q8 National Market ShareQ8 National Market Share
Non-Parallel Parallel
Q8 Non-Parallel PlanQ8 Non-Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q8 Parallel PlanQ8 Parallel Plan
Join sequence: Part, Line Item, Orders, Customer
Q11 Important Stock IdentificationQ11 Important Stock Identification
Non-Parallel Parallel
Q11Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Q11Q11
Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Small QueriesSmall Queries
Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q2Q2
Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)
Q2Q2
PartSupp is now Index Scan + Key Lookup
Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change
Note sure why this blows CPUScalar values are pre-computed, pre-converted
Q20?Q20?
This query may get a poor execution plan
Date functions are usually written as
because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel
Q20Q20
Q20Q20
Q20 alternate - parallelQ20 alternate - parallel
Statistics estimation error here
Penalty for mistakeapplied here
Other QueriesOther Queries
Q3Q3
Q3Q3
Q12 Random IO?Q12 Random IO?
Will this generate random IO?
Query 12 PlansQuery 12 PlansNon-Parallel
Parallel
Queries that go NegativeQueries that go Negative
Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue
Q17Q17
Table Spool is concern
Q17Q17
the usual suspects
Q19Q19
Q19Q19
Q22Q22
Q22Q22
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot
DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot
DOP 2 DOP 4 DOP 8
DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time
CPU relative to DOP 1