Relational Query Optimizationweb.cs.wpi.edu/~cs542/s17/material/Ch15_Opt_whirlwind.pdfAND termk 8 Assumptions Uniform distribution of values in domain Independent distribution of values

1

Relational Query Optimization

Chapter 15

2

Query Blocks: Units of Optimization

❖ SQL query is parsed into a collection of query blocks :

▪ An SQL query with no nesting and exactly one SELECT, FROM, WHERE, GROUP BY, and HAVING clause.

▪ WHERE is in conjunctive normal form.

3

Query Blocks: Units of Optimization

❖Nested blocks are usually treated as calls to a subroutine, made once per outer tuple.

❖Optimization is one block at a time.

SELECT S.snameFROM Sailors SWHERE S.age IN

(SELECT MAX (S2.age)FROM Sailors S2GROUP BY S2.rating)

Nested block

Outer block

SELECT S.snameFROM Sailors SWHERE S.age IN Reference to nested block

SELECT MAX (S2.age)FROM Sailors S2GROUP BY S2.rating

4

Query Block

❖ For each block, the plans considered are:

– All available access methods

– for each relation in FROM clause.

– All left-deep join trees

– all ways to join the relations one-at-a-time, with inner relation in FROM clause, considering all relation permutations and join methods.

MAX(S2.age) (Group By S2.rating( S2.age(Sailors)))

5

Cost Estimation

❖ For each plan considered, estimate cost:

▪ Estimate cost of each operation in plan tree.

•Depends on input cardinalities.

•Depends on algorithm (scan, index, etc).

▪ Estimate size of result for each operation.

•Use information about input relations.

•Make assumptions about effect of predicates.

❖ Cost of plan = sum of cost of each operator in tree.

6

Size Estimation and Reduction Factors

❖Goal : Estimate result size

SELECT attribute listFROM relation listWHERE term-1 AND ... AND term-k

7

Size Estimation and Reduction Factors

❖Compute maximum # tuples in result :▪ product of cardinalities of relations in FROM clause.

❖Reduction factor (RF) associated with each term :▪ reflects impact of term in reducing result size.

❖Result size = ▪ product of cardinalities of involved relations (FROM)

* product of reduction factors (WHERE).

SELECT attribute listFROM relation list R1, R2WHERE term1 AND ... AND termk

8

Assumptions

▪ Uniform distribution of values in domain

▪ Independent distribution of values in different columns.

▪ For selections and joins, assume independence of predicates.

9

Reduction Factors▪ Column = value.

- Given index I on column, assume uniform distribution.

• 1/Nkeys(I).

• Otherwise, RF = 1/10

▪ Column1 = column2- assuming each key value in I1 (smaller one) has a

matching value in I2.

• 1/MAX(Nkeys(I1), Nkeys(I2)).

• otherwise, RF = 1/10

▪ Column > value- Given an index I on column, arithmetic type.

• High(I) – value / High(I) – Low(I).

• RF = 1/3

▪ Column IN (list of values)

10

More on Estimation

▪ Uniform distribution is not accurate since real data is not uniformly distributed.

▪ Histogram: a data structure maintained by a DBMS to approximate a data distribution.

11

Estimation❖ Equi-width: divide range of column values into subranges

(buckets). Assuming the distribution within the histogram bucket is uniform.

❖ Equi-depth: number of tuples within each bucket is equal

❖ Compressed equi-depth: maintain separate counts for a small number of very frequent values, and maintain equi-depth histogram to cover the remaining values.

2 3 3 1 2 1 3 8 4 2 0 1 2 4 9

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

2.67

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Equiwidth

1.33

5.0

1.0

5.0

Bucket 1 2 3 4 5

Count 8 4 15 3 15

2.25

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Equidepth

2.5

5.0

1.75

9.0

Bucket 1 2 3 4 5

Count 9 10 10 7 9

12

Relational Algebra Equivalences

Allow us to choose different join orders

Allow us to `push’ selections and

projections ahead of joins.

13


❖ Selections:

(Cascade) c cn c cnR R1 1 ... . . .

c c c cR R1 2 2 1 (Commute)

14


❖ Projections: a a anR R1 1 . . . (Cascade)Where ai ai+1 for i =1…n-1.

15


❖ Joins: R (S T) (R S) T (Associative)

(R S) (S R) (Commute)

• When joining several relations, we are free to join

the relations in any order we choose.

16

Selects, Projects and Joins

❖ Case1: A projection commutes with a selection that only uses attributes retained by the projection.

a (c(R)) c (a(R))

❖ Case2: Combine a selection with a cross-product to form a join. R c S c (RS)

❖ Case3: A selection on just attributes of R commutes with Join. i.e., (R S) (R) S

17

Selects, Projects and Joins

❖ Selection and Joins :

c1c2 c3 (R S) c1(c2 (R) c3 (S))

❖ Project and Joinsa (RS) a1 (R) a2 (S) a (R c S) a1 (R) c a2 (S) (Where c appear in a)

18

Enumeration of Alternative Plans

There are two main cases:

• Single-relation plans

• Multiple-relation plans

19

Single Relation Plans

❖ Queries over a single relation with a combination of selects, projects, and aggregate operations:

▪ Main decision : Which access path for retrieving tuples.

• Most selective access path (file scan / index) if only single operator considered.

▪ Different operations essentially carried out together.

• e.g., if an index is used for a selection, projection is done for each retrieved tuple, and the resulting tuples are pipelined into the aggregate computation.

20

S.rating, COUNT(*)(HAVING COUNT DISTINCT(S.sname) > 2(GROUP BY S.rating(s.rating, S.sname ( S.rating>5 S.age=20 (Sailors)))))

Single Relation Plans without Index

SELECT S.rating, COUNT(*)

FROM Sailors S

WHERE S.rating > 5 AND S.age = 20

GROUP BY S.rating

HAVING COUNT DISTINCT (S.sname) > 2

Sailors = 500 pages

21


Single Relation Plans without Index

• File scan to retrieve tuples and apply selections and

projections on-the-fly.

• Writing out tuples after selections and projections.

• Sorting these tuples to implement GROUP BY clause.

• GROUP BY and HAVING are done on-the-fly.

22


Single Relation Plans without IndexSELECT S.rating, COUNT(*)

FROM Sailors S

WHERE S.rating > 5 AND S.age = 20

GROUP BY S.rating

HAVING COUNT DISTINCT (S.sname) > 2

e.g., Cost = Cost1(scan)

+ cost2 (writing pairs)

+ cost3 (sorting as per the GROUP BY clause).

cost 1 = 500 IOs

cost2 = 500* (ratio of tuple size) * RFs = 20 pages

ratio of tuple size = pair size / tuple size = 0.8

RF(S.rating>5) = 0.5 RF(S.age=20) = 0.1cost3 = 3*Npages = 60 IOs (assuming two passes)

Sailors = 500 pages

26

Queries Over Multiple Relations

▪ As the number of joins increases, the number of alternative plans grows rapidly

▪ We need to restrict search space !

27

Queries Over Multiple Relations

❖ Fundamental decision in System R (IBM):

▪ Only left-deep join trees are considered.

▪ Left-deep trees can generate all fully pipelined plans.

• Intermediate results not written to temporary files.

• Not all left-deep trees are fully pipelined (e.g., SM join).

BA

C

D

BA

C

D

C DBA

28

❖Left-deep plans differ in :

▪ the order of relations,

▪ the access method for each relation,

▪ the join method for each join.

Enumeration of Left-Deep Plans

29

❖ Enumerated using N passes (if N relations joined):

▪ Pass 1: Find best 1-relation plan for each relation.

▪ Pass 2: Find best way to join result of each 1-relation plan (as outer) to another relation. (All 2-relation plans.)

▪ Pass N: Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation. (All N-relation plans.)

❖ For each subset of relations,

retain :

▪ Cheapest plan overall, plus

▪ Cheapest plan for each interesting order of the tuples.

Enumeration of Left-Deep Plans

BA C DPass 1

Pass 2

Pass 3

33

Example:Pass 1

❖ Sailors:

▪ Choice1: B+ tree matches rating>5, probably cheapest.

▪ If selection is expected to retrieve a lot of tuples, and index is unclustered, file scan may be cheaper.

▪ Decision: B+ tree plan kept (because tuples are in rating order).

❖ Reserves:

▪ B+ tree on bid matches bid=500; cheapest.

Sailors:B+ tree on ratingHash on sid

Reserves:B+ tree on bid

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

34

ExamplePass 2

❖ Pass1:

▪ Sailors:

• Choice1: B+ tree matches rating>5,

▪ Reserves:

• B+ tree on bid matches bid=100.

Sailors:B+ tree on ratingHash on sid

Reserves:B+ tree on bid

❖ Pass 2:

– each plan retained from Pass 1 as the outer.

– how to join it with the (only) other relation.

– e.g., Reserves as outer : Hash index can be used to get Sailors tuples that satisfy sid = outer tuple’s sid value.

– e.g., Sailors as outer : Could possibly use sort-merge join, etc.

Reserves Sailors

sid=sid

bid=100 rating > 5

sname

35

Nested Queries

❖ Nested block is optimized independently, with the outer tuple considered as providing a selection condition.

❖ Outer block is optimized with the cost of `calling’ nested block computation taken into account.

❖ Implicit ordering of these blocks means that some good strategies are not considered.

❖ The non-nested version of the query is typically optimized better.

SELECT S.snameFROM Sailors SWHERE EXISTS

(SELECT *FROM Reserves RWHERE R.bid=103 AND R.sid=S.sid)

Nested block to optimize:

SELECT *FROM Reserves RWHERE R.bid=103

AND S.sid= outer value

Equivalent non-nested query:

SELECT S.snameFROM Sailors S, Reserves RWHERE S.sid=R.sid

AND R.bid=103

36

Highlights of System R Optimizer

❖ Impact of R Optimizer:

▪ Most widely used currently; works well for < 10 joins.

❖ Cost estimation: “Approximate art”.

▪ Statistics, maintained in system catalog, used to estimate cost of operations and result sizes.

▪ Considers combination of CPU and I/O costs.

❖ Plan Space: Too large, must be pruned.

▪ Only the space of left-deep plans considered.

▪ Cartesian products avoided.

37

Other Approaches to Query Optimization

❖ Exclusive search is not suitable for large number of Joins.

❖ Rule-based optimizers: a set of rules to guide the generation of candidate plans.

❖ Randomized plan generation: probabilistic algorithms to explore a large space of plans.

❖ Estimating the size of intermediate relations accurately.• Parametric query optimization: find good plans for a given query for each

of several different conditions that might be encountered at run-time.

• Multiple-query optimization: take concurrent execution of several queries into account.

38

Summary : Query Optimization

▪ Consider a set of alternative plans.

• Must prune search space;

• Typically, left-deep plans only;

• Algorithms to explore alternate plans

▪ Must estimate cost of each plan that is considered.

• Estimate size of result of each operator

• Estimate cost for each plan node.

• Key issues: Statistics, indexes, operator implementations.

Documents

Relational Query Optimizationweb.cs.wpi.edu/~cs542/s17/material/Ch15_Opt_whirlwind.pdfAND termk 8 Assumptions Uniform distribution of values in domain Independent distribution of values