46
Optimizing Nested Queries with Parameter Sort Orders Appeared in the 31 st VLDB Conference 2005 Ravindra N. Guravannavar Ramanujam H.S. S. Sudarshan Indian Institute of Technology Bombay

Optimizing Nested Queries with Parameter Sort Orders

Embed Size (px)

DESCRIPTION

Optimizing Nested Queries with Parameter Sort Orders. Appeared in the 31 st VLDB Conference 2005 Ravindra N. Guravannavar Ramanujam H.S. S. Sudarshan Indian Institute of Technology Bombay. Nested Queries are Important. Commonly encountered in practice - PowerPoint PPT Presentation

Citation preview

Page 1: Optimizing Nested Queries with Parameter Sort Orders

Optimizing Nested Queries with Parameter Sort Orders

Appeared in the 31st VLDB Conference 2005

Ravindra N. Guravannavar Ramanujam H.S. S. Sudarshan

Indian Institute of Technology Bombay

Page 2: Optimizing Nested Queries with Parameter Sort Orders

2

Nested Queries are Important Commonly encountered in practice Queries having performance issues are often

complex nested queries In WHERE clause, SELECT clause, SQL

LATERAL clause Queries invoking User-Defined Functions

(UDFs)

Page 3: Optimizing Nested Queries with Parameter Sort Orders

3

Nested Queries – Few ExamplesExample 1:SELECT order_id, order_dateFROM ORDER O WHERE default_ship_to NOT IN (

SELECT ship_to FROM ORDERITEM OI WHERE OI.order_id = O.order_id );

Example 2:SELECT name, desgn FROM EMP E1WHERE E1.sal=(SELECT max(E2.sal)

FROM EMP E2 WHERE E2.dept=E1.dept);

Page 4: Optimizing Nested Queries with Parameter Sort Orders

4

Find the turn-around time for high priority orders

SELECT orderid, TurnaroundTime(orderid, totalprice, orderdate) FROM ORDERS WHERE order_priority=’HIGH’;

An Example: Query Invoking a UDF

DEFINE TurnaroundTime(@orderid, @totalprice, @orderdate) // Compute the order category with some procedural logic.

IF (@category = ‘A’)SELECT max(L.shipdate – @orderdate) FROM LINEITEM LWHERE L.orderid=@orderid;

ELSE SELECT MAX(L.commitdate – @orderdate) FROM LINEITEM L WHERE L.orderid=@orderid;

END;

Page 5: Optimizing Nested Queries with Parameter Sort Orders

5

Nested Iteration For each tuple t in the outer block

Bind parameter values from t Evaluate inner block – collect results in s Process t, s

Advantages Simple to implement Easy to ensure correctness Applicable to all types of nested queries

Page 6: Optimizing Nested Queries with Parameter Sort Orders

6

Nested IterationDrawbacks Performance can be very poor

Repeated work Random I/O

Cost = Cost(OuterBlock) + n*Cost(InnerBlock)

Where n=# tuples in the result of outer block

Improvements Proposed in System R Cache the inner subquery result for each distinct correlation

binding Sort the outer tuples so as to be able to cache a single

result at any given time

Page 7: Optimizing Nested Queries with Parameter Sort Orders

7

Decorrelation Techniques Rewrite nested query as an equivalent flat

query Allows the choice of set-oriented evaluation

plans such as hash and merge-join A range of techniques proposed and refined

over 2 decades

Page 8: Optimizing Nested Queries with Parameter Sort Orders

8

Decorrelation ExampleOriginal Query:SELECT O.order_id, O.order_dateFROM ORDER OWHERE default_ship_to IN ( SELECT ship_to

FROM ORDERITEM OI WHERE OI.order_id = O.order_id);

Decorrelated Query:SELECT O.order_id, O.order_date FROM ORDER O, ORDERITEM OI WHERE O.order_id=OI.order_id AND O.default_ship_to=OI.ship_to;

* Queries are not equivalent when duplicates are present

Page 9: Optimizing Nested Queries with Parameter Sort Orders

9

Decorrelation ExampleOriginal Query:SELECT c_name FROM CUSTOMER CWHERE 10 = ( SELECT count(order_id) FROM ORDER O

WHERE O.cust_id = C.cust_id);

Decorrelation [Kim 82]:Temp = SELECT cust_id, count(order_id) as order_count

FROM ORDER O GROUP BY cust_id;

SELECT c_name FROM CUSTOMER C, Temp T

WHERE C.cust_id=T.cust_id AND T.order_count=10;

* Goes wrong if one tries to find customers with no orders!

Page 10: Optimizing Nested Queries with Parameter Sort Orders

10

Problems with Decorrelation

Not always possible E.g., NOT IN predicate – requires anti-join

Many cases need duplicate elimination and an extra outer-join. Outer joins are not commutative and do not

associate with joins Duplicate elimination expensive

May not be applicable to UDFs unless their structure is very simple

Page 11: Optimizing Nested Queries with Parameter Sort Orders

11

Our Approach Optimize nested queries keeping their structure

intact Exploit properties of parameters (such as sort

order) to efficiently evaluate the inner sub-query More generic and can be applied to a wider class

of queries (e.g., Queries invoking complex UDFs)

Page 12: Optimizing Nested Queries with Parameter Sort Orders

12

Benefits of Sorting Outer Tuples

Sorting allows caching of a single inner result (System R)

Advantageous buffer effects (Graefe) A clustered index scan in the inner block will

access each block at most once irrespective of the buffer replacement policy

Allows state-retaining operators Re-startable table scan Incremental computation of aggregates

Page 13: Optimizing Nested Queries with Parameter Sort Orders

13

Restartable Table Scan Parameter bindings match sort order of

inner relation Retain state across function calls Similar to merge join – applicable for NI

Page 14: Optimizing Nested Queries with Parameter Sort Orders

14

Restartable Table Scan Parameters: match

sort order of inner relation

Retain state across function calls

2005-02-021200

2

1

2

1

lineitemid

2005-01-04140

2005-02-01200

2005-01-12100

2005-01-10100

shipdateorderid

Table LINEITEMParameter Bindings

orderid, totalprice, orderdate

{100, 20.5, 2005-01-02}

{140, 10.2, 2005-01-04}

{200, 30.8, 2005-02-01}

SELECT TurnaroundTime(orderid, … ) FROM ORDERS WHERE …

TurnaroundTime(@orderid, …) IF (…)

SELECT … FROM LINEITEM WHERE L.orderid=@orderid;ELSE

SELECT … FROM LINEITEM WHERE L.orderid=@orderid;

Page 15: Optimizing Nested Queries with Parameter Sort Orders

15

Incremental Computation of AggregatesSELECT day, sales

FROM DAILYSALES DS1WHERE sales > (SELECT MAX(sales)

FROM DAILYSALES DS2 WHERE DS2.day < DS1.day);

Applicable to:

Aggregates SUM, COUNT, MAX, MIN, AVG and

Predicates <, ≤,>, ≥

Param Sort Order Plan Cost

No order O(n*B) block transfers + seeks

DS1.day 2B block transfers + less seeks

Page 16: Optimizing Nested Queries with Parameter Sort Orders

16

Benefits of Sorting for a Clustered Index

Case-1Keys: 50, 500,400,80,600,200Potential data block fetches=6* Assume a single data block can be held in memoryRandom I/O

Case-2Keys: 50,80,200,400,500,600Data block fetches=3Sequential I/O

50 80 200 400 500 600

Data Block-1 Data Block-2 Data Block-3

400

Page 17: Optimizing Nested Queries with Parameter Sort Orders

17

Query Optimization with Nested Iteration

B1

B6 B7B5

B4B2B3

B8 B9

BIND variable setUSE variables set

A multi-level, multi-branch query

Plan cost for a block: A function of the order guaranteed on the IN variables and order required on the OUT variables

Not every possible sort order may be useful (only interesting orders)

Not every interesting order may be feasible/valid

Similar to interesting sort order of results but on parameters

Page 18: Optimizing Nested Queries with Parameter Sort Orders

18

Representing Nested Queries with Apply

A *

Bind Expression Use Expression

B:$a, $b U:$a, $b

A – The Apply Operator [Galindo-Legaria et.al. SIGMOD 2001]* – Operation between the outer tuple and result of the inner block

SELECT PO.order_idFROM PURCHASEORDER POWHERE default_ship_to NOT IN ( SELECT ship_to FROM ORDERITEM OI

WHERE OI.order_id = PO.order_id );

Page 19: Optimizing Nested Queries with Parameter Sort Orders

19

A UDF Represented with Apply

DEFINE fn(p1, p2, … pn) ASBEGIN

fnQ1 <p1, p2>; fnQ2 <p1, p2, p3>;

IF (condition)fnQ3<p2>;

ELSEfnQ4<p3>;

// Cursor loop binding v1, v2OPEN CURSOR ON fnQ5<p2, p3>; LOOP

fnQ6<p1, p2, v1, v2>;END LOOP

END

A

AQifnQ1 fnQ2 fnQ3 fnQ4

fnQ5 fnQ6

Page 20: Optimizing Nested Queries with Parameter Sort Orders

20

Optimizing with Parameter Sort Orders

Top-Down Exhaustive ApproachFor each possible sort order of the parameters, optimize the outer block and then the inner block.

A query block b at level l using n parameters will get optimized d(k)

l times where,

d(k)=kp0 +kp1 + … kpk

• Assuming an average of k=n/l parameters are bound at each block above b.

• And kpi = k!/(k-i)!

Page 21: Optimizing Nested Queries with Parameter Sort Orders

21

Optimizing with Parameter Sort OrdersOur proposal: Top-Down Multi-Pass

Approach Traverse the inner block(s) to find all valid,

interesting orders. For each valid, interesting order ord

Optimize the (outer) block with ord as the required output sort order (physical property).

Then optimize the inner block(s) with ord as the guaranteed parameter sort order.

Keep the combination, if it is cheaper than the cheapest plan found so far.

Page 22: Optimizing Nested Queries with Parameter Sort Orders

22

Feasible/Valid Parameter Sort Orders

Parameter sort order (a1, a2, … an) is valid iff

level(ai) <= level(aj) for all i, j s.t. i < j

B1

B2

B3

Binds a : sorted

Binds b : sorted

Is (a, b) valid/observable?

B1

B2

B3

Binds a, b : sorted

Uses a,b. Binds c : sorted

Cannot get (a, c) by dup elimination

Page 23: Optimizing Nested Queries with Parameter Sort Orders

23

A Stricter Notion of ValidityParameter sort order o=(a1, a2, … an) is valid(observable) at block bx iff

i. level(ai) <= level(aj) for all i, j s.t. i < j AND

ii. For each block bk s.t. level(bx) - level(bk) > 1,

corrattrs(bk, o) U bindattrs(bk, o) is a candidate key bk (key of schema of the expression in the FROM clause of bk)

Notation:level(bi): Level of the block bi

level(ai): Level of the block in which ai is bound

bindattrs(bk, o): Attributes in o that are bound at block bk

corrattrs(bk, o): Atttributes in bk that are correlated with attributes in o

with an equality predicate.

Page 24: Optimizing Nested Queries with Parameter Sort Orders

24

A Stricter Notion of Validity (Example)

B2

B3

B4

Binds b, has pred c=aKey: b, c

Now, (a, b) is valid at B4

B1 Binds aKey: a

Page 25: Optimizing Nested Queries with Parameter Sort Orders

25

Weaker Notion of Sort Orders (b11, b12,…)(b21, b22…)…

Sorted on seg-0 For a given value of seg-i, seg-i+1 can have

several sorted runs A parameter sort order p is said to weakly

subsume a sort order o if o is a subsequence of p ignoring parantheses

Operators need to have a method reset_state(segno) to reset the state for a specific segment

Cost of a state-retaining plan must be multipled by the number of expected runs

Page 26: Optimizing Nested Queries with Parameter Sort Orders

26

Plan Generation• Traverse the use inputs and

obtain valid interesting orders

• Extract orders relevant to the bind input

• Optimize the bind input making the order as a required output physical property

• Optimize the use input making the order as a guaranteed parameter sort order

Query Block-2 Query Block-3Binds $cUses $a,$b

Uses $a, $b, $c

Binds $a, $b

Query Block-1

Interesting Parameter Sort Orders

Required Result Sort Order

A

A

Page 27: Optimizing Nested Queries with Parameter Sort Orders

27

Plan Generation (Contd.) At a non-Apply logical operation node

Consider only those algorithms that require parameter sort order weaker than or equal to the guaranteed parameter sort order

E.g., An algorithm requiring parameter sort order (a, b) is not applicable when no order is guaranteed on the parameters.

Page 28: Optimizing Nested Queries with Parameter Sort Orders

28

Sort Order Propagation for a Multi-Level Multi-Branch Expression

σc1=a ^ c2=b (R2)R2 sorted on (c1,c2)

Page 29: Optimizing Nested Queries with Parameter Sort Orders

29

Experiments Evaluated the benefits of state retention

plans with PostgreSQL Scan and Aggregate operators were

modified for state retention Plans were hard coded as the Optimizer

extensions were not complete

Page 30: Optimizing Nested Queries with Parameter Sort Orders

30

Experiments (Contd.)A Nested Aggregate Query with Non-Equality Corrl. Predicate

SELECT day, sales FROM DAILYSALES DS1 WHERE sales > (SELECT MAX(sales) FROM DAILYSALES DS2 WHERE DS2.day < DS1.day);

NI – Nested IterationMAG – Magic Decorrelation [SPL96]NISR – NI with State Retention

Page 31: Optimizing Nested Queries with Parameter Sort Orders

31

Experiments (Contd.)TPC-H MIN COST Supplier Query

SELECT name, address … FROM PARTS, SUPPLIER, PARTSUPP WHERE nation=’FRANCE’ AND p_size=15 AND p_type=’BRASS’ AND <join_preds> AND ps_supplycost = ( SELECT min(PS1.supplycost) FROM …);

Page 32: Optimizing Nested Queries with Parameter Sort Orders

32

Experiments (Contd.)

SELECT orderid, TurnaroundTime(orderid, totalprice, orderdate)

FROM ORDERS WHERE order_priority=’H’;

DEFINE TurnaroundTime(@orderid, @totalprice, @orderdate)

… Compute the order category with some procedural logic …IF (@category = ‘A’)

SELECT max(L.shipdate – @orderdate) FROM LINEITEM L

WHERE L.orderid=@orderid;ELSE

SELECT MAX(L.commitdate – @orderdate)

FROM LINEITEM L WHERE L.orderid =@orderid;

END;

A query with UDF

Page 33: Optimizing Nested Queries with Parameter Sort Orders

33

Questions?

Page 34: Optimizing Nested Queries with Parameter Sort Orders

34

Extra Slides

Page 35: Optimizing Nested Queries with Parameter Sort Orders

35

Physical Plan Space Generation

PhysEqNode PhysDAGGen(LogEQNode e, PhyProp p, ParamSortOrder s)

If a physical equivalence node np exists for e, p, s

return np

Create an equivalence node np for e, p, s

For each logical operation node o below e If(o is an instance of ApplyOp)

ProcApplyNode(o, s, np)

elseProcLogOpNode(o, p, s, np)

For each enforcer f that generates property p Create an enforcer node of under np

Set the input of of = PhysDAGGen(e, null, s)

return np

End

Page 36: Optimizing Nested Queries with Parameter Sort Orders

36

Processing a Non-Apply Nodevoid ProcLogOpNode(LogOpNode o, PhysProp p,ParamSortOrder s,

PhysEqNode np)

For each algorithm a for o that guarantees p and requires no stronger sort order than s

Create an algorithm node oa under np

For each input i of oa

Let oi be the i th input of oa

Let pi be the physical property required

from input i by algorithm aSet input i of oa = PhysDAGGen(oi, pi, s)

End

Page 37: Optimizing Nested Queries with Parameter Sort Orders

37

Processing the Apply Nodevoid ProcApplyNode(LogOpNode o, ParamSortOrder s, PhysEqNode np)

Initialize i_ords to be an empty set or sort ordersFor each use expression u under o uOrds = GetInterestingOrders(u) i_ords = i_ords Union uOrds

l_ords = GetLocalOrders(i ords, o.bindInput)For each order ord in l_ords and empty order

leq = PhysDAGGen(lop.bindInput, ord, s)Let newOrd = concat(s, ord)applyOp = create new applyPhysOp(o.TYPE)applyOp.lchild = leqFor each use expression u of o ueq = PhysDAGGen(u, null, newOrd) Add ueq as a child node of applyOpnp.addChild(applyOp)

End

Page 38: Optimizing Nested Queries with Parameter Sort Orders

38

Generating Interesting Parameter OrdersSet<Order> GetInterestingOrders(LogEqNode e) if the set of interesting orders i_ords for e is already found return i_ords Create an empty set result of sort orders for each logical operation node o under e for each algorithm a for o Let sa be the sort order of interest to a on the unbound parameters in e if sa is a valid order and sa is not in result

Add sa to result

for each input logical equivalence node ei of a

childOrd = GetInterestingOrders(ei)

if (o is an Apply operator AND ei is a use input) childOrd = GetAncestorOrders(childOrd, o.bindInput) result = result Union childOrd return resultEnd

Page 39: Optimizing Nested Queries with Parameter Sort Orders

39

Extracting Ancestor OrdersSet<Order> GetAncestorOrders(Set<Order> i_ords, LogEqNode e)

Initialize a_ords to be an empty set of sort orders for each order ord in i_ords newOrd = Empty vector; for (i = 1; i <=length(ord); i = i + 1) if ord[i] is NOT bound by e append(ord[i], newOrd) else break;

add newOrd to a_ords return a_ordsEnd

Page 40: Optimizing Nested Queries with Parameter Sort Orders

40

Extracting Local OrdersSet<Order> GetLocalOrders(Set<Order> i_ords, LogEqNode e)

Initialize l_ords to be an empty set or sort ordersFor each ord in i_ords

newOrd = Empty vector;For (i =length(ord); i > 0; i = i – 1 )

If ord[i] is bound by eprepend(ord[i], newOrd)

Elsebreak;

add newOrd to l_ordsreturn l_ords

End

Page 41: Optimizing Nested Queries with Parameter Sort Orders

41

Extensions to the Volcano Optimizer

Contract of the original algorithm for optimization:Plan FindBestPlan(Expr e, PhysProp rpp, Cost cl)

Contract of the modified algorithm for optimization:Plan FindBestPlan(Expr e, PhysProp rpp, Cost cl, Order pso, int callCount)

Plans generated and cached for <e, rpp, pso, callCount> Not all possible orderings of the parameters are valid

Parameter Sort Order (a1, a2, … an) is valid iff level(ai) <= level(aj) for all i, j s.t. i < j.

Not all valid orders may be interesting (we consider only valid, interesting parameter sort orders)

Page 42: Optimizing Nested Queries with Parameter Sort Orders

42

A Typical Nested Iteration Plan

For ti {t1, t2, t3, … tn} do

innerResult = {Ø} For ui {u1, u2, u3, … um} do

if (pred(ti ,ui))

Add ui to innerResult;

done;process(ti ,innerResult);

done;

Page 43: Optimizing Nested Queries with Parameter Sort Orders

43

Benefits of Sorting for a Clustered Index

Case-1Keys: 50, 500,400,80,600,200Potential data block fetches=6* Assume a single data block can be held in memoryRandom I/O

Case-2Keys: 50,80,200,400,500,600Data block fetches=3Sequential I/O

50 80 200 400 500 600

Data Block-1 Data Block-2 Data Block-3

400

* We provide cost estimation for clustered index scan taking the buffer effects into account (full length paper)

Page 44: Optimizing Nested Queries with Parameter Sort Orders

44

Difference from Join Optimization

Block-1B:{R1.a, R1.b}

Block-2B:{R2.c}U:{R1.a}

Block-3U:{R1.b, R2.c}

R1 R3

R2

Sort on R1.a

Sort on R3.b

Not an option for Nested Iteration

Page 45: Optimizing Nested Queries with Parameter Sort Orders

45

Experiments (Contd.)A simple IN query with no outer predicates

SELECT o_orderkey FROM ORDERS WHERE o_orderdate IN (SELECT l_shipdate FROM LINEITEM WHERE l_orderkey = o_orderkey);

NI – Nested IterationMAG – Magic Decorrelation [SPL96]NISR – NI with State Retention

Note: MAG is just one form of decorrelation, and the comparison here is NOT with decorrelation techniques in general

Page 46: Optimizing Nested Queries with Parameter Sort Orders

46

Future Work Factoring execution probabilities of queries

inside function body for appropriate costing Analyze function body Exploit history of execution (when available)

Parameter properties other than sort orders that would be interesting to nested queries and functions

SQL/XML, XQuery