TPC-H Studies Joe Chang [email protected]

TPC-H Studies

Joe [email protected]

mailto:[email protected]

http://www.qdpma.com/

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

TPC-HTPC-H

TPC-HTPC-H

DSS – 22 queries, geometric mean60X range plan cost, comparable actual range

Power – single streamTests ability to scale parallel execution plans

Throughput – multiple streams

Scale Factor 1 – Line item data is 1GB

875MB with DATE instead of DATETIME

Only single column indexes allowed, Ad-hoc

SF 10, test studiesSF 10, test studies

Not valid for publication

Auto-Statistics enabled, Excludes compile time

Big Queries – Line Item Scan

Super Scaling – Mission Impossible

Small Queries & High Parallelism

Other queries, negative scaling

Did not apply T2301, or disallow page locks

0

500

1,000

1,500

2,000

2,500

3,000

3,500

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

Big Q: Plan Cost vs ActualBig Q: Plan Cost vs ActualPlan Cost reduction from DOP1 to 16/32Q1 28%Q9 44%Q18 70%Q21 20%

Plan Cost says scaling is poor except for Q18,

memory affects Hash IO onset

Plan Cost @ 10GB

0

15

30

45

60

75

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Actual Query timeIn seconds

Plan Cost is poor indicator of true parallelism scaling

Q18 & Q 21 > 3X Q1, Q9

02468

10121416182022242628303234

Q1 Q9 Q13 Q18 Q21

DOP 1 DOP 2 DOP 4 DOP 8


Big Query: Speed Up and CPUBig Query: Speed Up and CPU

Q13 has slightly better than perfect scaling?In general, excellent scaling to DOP 8-24, weak afterwards

Holy Grail

0

10

20

30

40

50

60

70

80

90

Q1 Q9 Q13 Q18 Q21



CPU timeIn seconds

Speed up relative to DOP 1

Super ScalingSuper Scaling

Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged

CPU time = 100 sec, elapse time = 100 sec

What is best case for DOP 2?Assuming nearly zero Repartition Threads cost

CPU time = 100 sec, elapsed time = 50?

Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan!No, I have not started drinking, yet

0.0

0.5

1.0

1.5

2.0

2.5

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32


CPU-sec goes down from DOP 1 to 2 and higher (typically 8)

0

2

4

6

8

10

12

14

16

18

20

22

24

26

Q7 Q8 Q11 Q21 Q22



CPU normalized to DOP 1

Speed up relative to DOP 1

3.5X speedup from DOP 1 to 2 (Normalized to DOP 1)

CPU and Query time in secondsCPU and Query time in seconds

0

2

4

6

8

10

12

14

16

18

20

Q7 Q8 Q11 Q21 Q22



0

2

4

6

8

10

12

Q7 Q8 Q11 Q21 Q22

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

CPU time

Query time

Super Scaling SummarySuper Scaling Summary

Most probable causeBitmap Operator in Parallel Plan

Bitmap Filters are great, Question for Microsoft:

Can I use Bitmap Filters in OLTP systems with non-parallel plans?

Small Queries – Plan Cost vs ActSmall Queries – Plan Cost vs Act

Query 3 and 16 have lower plan cost than Q17, but not included

0

50

100

150

200

250

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

Q2 Q4 Q6 Q15 Q17 Q20



Q4,6,17 great scaling to DOP 4, then weak

Negative scaling also occurs

Query time

Plan Cost

Small Queries CPU & SpeedupSmall Queries CPU & Speedup

What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling

0

1

2

3

4

5

6

Q2 Q4 Q6 Q15 Q17 Q20



0

2

4

6

8

10

12

14

16

18

Q2 Q4 Q6 Q15 Q17 Q20

DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 30 DOP 32

Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

CPU time

Speed up

High Parallelism – Small QueriesHigh Parallelism – Small Queries

Why? Almost No value

TPC-H geometric mean scoringSmall queries have as much impact as large

Linear sum of weights large queries

OLTP with 32, 64+ coresParallelism good if super-scaling

Default max degree of parallelism 0

Seriously bad news, especially for small Q

Increase cost threshold for parallelism?

Sometimes you do get lucky

Q that go NegativeQ that go Negative

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

0

2

4

6

8

10

12

14

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Query time

“Speedup”

CPUCPU

0

2

4

6

8

10

12

Q17 Q19 Q20 Q22

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Other Queries – CPU & SpeedupOther Queries – CPU & Speedup

0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16



0

2

4

6

8

10

12

14

16

18

20

22

Q3 Q5 Q10 Q12 Q14 Q16

DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 30 DOP 32

Q3 has problems beyond DOP 2

CPU time

Speedup

Other - Query Time secondsOther - Query Time seconds

0

2

4

6

8

10

12

14

16

Q3 Q5 Q10 Q12 Q14 Q16



Query time

Scaling SummaryScaling Summary

Some queries show excellent scaling

Super-scaling, better than 2X

Sharp CPU jump on last DOP doubling

Need strategy to cap DOPTo limit negative scaling

Especially for some smaller queries?

Other anomalies

CompressionCompression

PAGE

1.0

1.1

1.2

1.3

1.4

1.5

DOP 1 DOP 2 DOP 4 DOP 8 DOP 16 DOP 24 DOP 30 DOP 32

1.0

1.1

1.2

1.3

1.4

1.5


Compression Overhead - OverallCompression Overhead - Overall

40% overhead for compression at low DOP,10% overhead at max DOP???

Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22


DOP 16 DOP 24 DOP 32

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0




Query time compressed relative to uncompressed

CPU time compressed relative to uncompressed

Compressed TableCompressed Table

LINEITEM – real data may be more compressibleUncompressed: 8,749,760KB, Average Bytes per row: 149Compressed: 4,819,592KB, Average Bytes per row: 82

PartitioningPartitioning

Orders and Line Item on Order Key

Partitioning Impact - OverallPartitioning Impact - Overall

0.8

0.9

1.0

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8


0.90

0.95

1.00

1.05

1.10

1.15


Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0


DOP 1 DOP 2 DOP 4

DOP 8 DOP 16 DOP 24

DOP 32

0

1

2

3

4

5

6


DOP 1 DOP 2

DOP 4 DOP 8

DOP 16 DOP 24

DOP 32

Query time partitioned relative to not partitioned

CPU time partitioned relative to not partitioned

Plan for Partitioned TablesPlan for Partitioned Tables

Scaling DW SummaryScaling DW Summary

Massive IO bandwidth

Parallel options for data load, updates etc

Investigate Parallel Execution PlansScaling from DOP 1, 2, 4, 8, 16, 32 etc

Scaling with and w/o HT

Strategy for limiting DOP with multiple users

Fixes from Microsoft NeededFixes from Microsoft Needed

Contention issues in parallel execution

Table scan, Nested Loops

Better plan cost model for scalingBack-off on parallelism if gain is negligible

Fix throughput degradation with multiple users running big DW queries

Sybase and Oracle, Throughput is close to Power or better

Query PlansQuery Plans

Big QueriesBig Queries

Q1 Pricing Summary ReportQ1 Pricing Summary Report

Q1 Plan Q1 Plan

Non-Parallel

Parallel

Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction

Q9 Product Type Profit MeasureQ9 Product Type Profit Measure

IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower

Non-Parallel Parallel

Q9 Non-Parallel PlanQ9 Non-Parallel Plan

Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Join sequence: Supplier, (Part, PartSupp), Line Item, Orders

Q9 Parallel PlanQ9 Parallel Plan

Non-Parallel: (Supplier), (Part, PartSupp), Line Item, OrdersParallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp

Q9 Non-Parallel Plan detailsQ9 Non-Parallel Plan details

Table Scans comprise 64%,IO from 4 tables contribute 58% of plan cost

Q9 Parallel reg vs Partitioned Q9 Parallel reg vs Partitioned

Q13Q13 Why does Q13 have perfect scaling?

Q18 Large Volume CustomerQ18 Large Volume Customer

Non-Parallel

Parallel

Q18 Graphical PlanQ18 Graphical Plan

Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan

Q18 Plan DetailsQ18 Plan Details

Non-Parallel

Parallel

Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPUDOP 16/32: size is below IO threshold, CPU reduced by >10X

Q21 Suppliers Who Kept Orders WaitingQ21 Suppliers Who Kept Orders Waiting

Note 3 references to Line Item



H1

H1H2H3

H2H3

Q21 ParallelQ21 Parallel

Q21Q21

3 full Line Item clustered index scans

Plan cost is approx 3X Q1, single “scan”


Q7 Volume ShippingQ7 Volume Shipping



Join sequence: Nation, Customer, Orders, Line Item


Join sequence: Nation, Customer, Orders, Line Item

Q8 National Market ShareQ8 National Market Share



Join sequence: Part, Line Item, Orders, Customer


Join sequence: Part, Line Item, Orders, Customer

Q11 Important Stock IdentificationQ11 Important Stock Identification


Q11Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Q11Q11

Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Small QueriesSmall Queries

Query 2 Minimum Cost SupplierQuery 2 Minimum Cost Supplier

Wordy, but only touches the small tables, second lowest plan cost (Q15)

Q2Q2

Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)

Q2Q2

PartSupp is now Index Scan + Key Lookup

Q6 Forecasting Revenue ChangeQ6 Forecasting Revenue Change

Note sure why this blows CPUScalar values are pre-computed, pre-converted

Q20?Q20?

This query may get a poor execution plan

Date functions are usually written as

because Line Item date columns are “date” typeCAST helps DOP 1 plan, but get bad plan for parallel

Q20Q20

Q20Q20

Q20 alternate - parallelQ20 alternate - parallel

Statistics estimation error here

Penalty for mistakeapplied here

Other QueriesOther Queries

Q3Q3

Q3Q3

Q12 Random IO?Q12 Random IO?

Will this generate random IO?

Query 12 PlansQuery 12 PlansNon-Parallel

Parallel

Queries that go NegativeQueries that go Negative

Q17 Small Quantity Order RevenueQ17 Small Quantity Order Revenue

Q17Q17

Table Spool is concern

Q17Q17

the usual suspects

Q19Q19

Q19Q19

Q22Q22

Q22Q22

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8


0

2

4

6

8

10

12

14

16

18

20

22

24

26

28

30

32

Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12 Q13 Q14 Q15 Q16 Q17 Q18 Q19 Q20 Q21 Q22 Tot

DOP 2 DOP 4 DOP 8

DOP 16 DOP 24 DOP 32 Speedup from DOP 1 query time

CPU relative to DOP 1

Documents

TPC-H Studies Joe Chang [email protected]