Query Optimizing

Distributed Query Optimization

Chapter 9

Query Processing and Optimization

• Query processing is the process of translating a query expressed in a high-level language such as SQL into low-level data manipulation operations.

• Query Optimization refers to the process by which the best execution strategy for a given query is found from a set of alternatives.

Query Optimization

• The input to the third step is an algebraic query on fragments.

• By permuting the ordering of operations within one fragment query, many equivalent query execution plans may be found. indices should be used. order the operations of a query (e. g. joins, selects,

and projects).• The goal of query optimization is to find an execution

strategy for the query that is close optimal query.

Distributed Query Optimization

Components of the distributed query optimizer, i. e.

• Search Space.• Search strategy. The search strategy explores the search space

and selects the best plan.• Cost Strategy.

Distributed Query Optimization Issues

• Linear query trees are not necessarily a good choice• Bushy query trees are not necessarily a bad choice• What and where to ship the relations• How to ship relations (ship as a whole, ship as

needed)• When to use semi-joins instead of joins

Search Space.• The set of alternative query execution plans (QEP)

–Typically very large.–The main issue is to optimize the joins.

Equivalent query trees (join trees) of the joins in the following query

SELECT ENAME,RESP FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO

Basic ConceptsReduction of the search space-Restrict by means of heuristics

• Perform unary operations before binary operations, – Restrict the shape of the join tree• Consider the type of trees (linear trees, vs. bushy

ones)

Search Space• There are two main strategies to scan the search

space– Deterministic– Randomized

•

Deterministic scan of the search space – Start from base relations and build plans by adding one relation at each step. – Breadth-first strategy: build all possible plans before choosing the “best” plan. (dynamic programming approach) – Depth-first strategy: build only one plan (greedy approach)

Randomized scan of the search space– Search for optimal solutions around a particular starting

point e.g., iterative improvement– Trades optimization time for execution time Does not guarantee that the best solution is obtained,

but avoid the high cost of optimization– The strategy is better when more than 5-6 relations are

involved.

Distributed Cost Model

• Two different types of cost functions can be used

– Reduce total time– Reduce response time

Distributed Cost Model . . .

• Total time: Sum of the time of all individual components.

• Local processing time: CPU time + I/O time• Communication time: fixed time to initiate a message

+ time to transmit the data

The individual components of the total cost have different weights:

– Wide area network– Local area networks



• Response time: Elapsed time between the initiation and the completion of a query

Assume that only the communication cost is consideredTotal time = 2 message initialization time + unit transmission time

(x+y)Response time = max {time to send x from 1 to 3, time to send y

from 2 to 3}time to send x from 1 to 3 = message initialization time + unit

transmission time xtime to send y from 2 to 3 = message initialization time + unit

transmission time y

ExampleSite 1

Site 2

x units

y units

Site 3

Database Statistics

The primary cost factor is the size of intermediate relations• must be transmitted over the network, if a subsequent

operation is located on a different site.• costly to compute the size of the intermediate relations

precisely.• Instead global statistics of relations and fragments are

computed.

Let R(A1,A2, . . . ,Ak) be a relation fragmented into R1,R2, . . . ,Rr.

• Relation statistics– min and max values of each attribute: min{Ai}, max{Ai}.– length of each attribute: length(Ai)– number of distinct values in each fragment (cardinality):

card(Ai),(card(dom(Ai)))

• Fragment statistics– cardinality of the fragment: card(Ri)– cardinality of each attribute of each fragment:

card(πAi(Rj))

Database Statistics

Query Optimization Process

Search SpaceGeneration

SearchStrategy

Equivalent QEP

Input Query

TransformationRules

Cost Model

Best QEP

• INGRES

– dynamic

– interpretive

• System R

– static

– exhaustive search

Centralized Query Optimization

Decompose each multi-variable query into a sequence of mono-variable queries with a common variable

Process each by a one variable query processor– Choose an initial execution plan (heuristics)

– Order the rest by considering intermediate relation sizes

No statistical information is maintained

INGRES Algorithm

22

INGRES Language: QUEL

• QUEL Language - a tuple calculus languageExample:

range of e is EMPrange of g is ASGrange of j is PROJretrieve e.ENAME where e.ENO=g.ENO and j.PNO=g.PNO

and j.PNAME=”CAD/CAM”

Note: e, g, and j are called variables

• Replace an n variable query q by a series of queries

q1 q2 … qn

where qi uses the result of qi-1.• Detachment

– Query q decomposed into q' q" where q' and q" have a common variable which is the result of q'

• Tuple substitution– Replace the value of each tuple with actual values

and simplify the queryq(V1, V2, ... Vn) (q' (t1, V2, V2, ... , Vn), t1 R)

INGRES Algorithm–Decomposition

q: SELECT V2.A2,V3.A3, …,Vn.An

FROM R1 V1, …,Rn Vn

WHERE P1(V1.A1’) AND P2(V1.A1,V2.A2,…,

Vn.An)

q': SELECT V1.A1 INTO R1'

FROM R1 V1

WHERE P1(V1.A1)

q": SELECT V2.A2, …, Vn.An

FROM R1' V1, R2 V2, …, Rn Vn

WHERE P2(V1.A1, V2.A2, …, Vn.An)

Detachment

Names of employees working on CAD/CAM projectQ1: SELECT EMP.ENAME

FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNOAND PROJ.PNAME="CAD/CAM"

q11: SELECT PROJ.PNO INTO JVAR

FROM PROJWHERE PROJ.PNAME="CAD/CAM"

q': SELECT EMP.ENAMEFROM EMP,ASG,JVARWHERE EMP.ENO=ASG.ENOAND ASG.PNO=JVAR.PNO

Detachment Example

q': SELECT EMP.ENAMEFROMEMP,ASG,JVARWHERE EMP.ENO=ASG.ENOAND ASG.PNO=JVAR.PNO

q12: SELECT ASG.ENO INTO GVAR

FROMASG,JVARWHERE ASG.PNO=JVAR.PNO

q13: SELECT EMP.ENAME

FROMEMP,GVARWHERE EMP.ENO=GVAR.ENO

Detachment Example (cont’d)

q11 is a mono-variable queryq12 and q13 is subject to tuple substitutionAssume GVAR has two tuples only: <E1> and <E2>Then q13 becomes

q131: SELECT EMP.ENAMEFROMEMPWHERE EMP.ENO="E1"

q132: SELECT EMP.ENAMEFROMEMPWHERE EMP.ENO="E2"

Tuple Substitution

Same as the centralized version except

• Movement of relations (and fragments) need to be considered

• Optimization with respect to communication cost or response time possible

Distributed INGRES Algorithm

• Ordering joins

– Distributed INGRES

– System R*

Join Ordering in Fragment Queries

Consider two relations only

• Multiple relations more difficult because too many alternatives.– Compute the cost of all alternatives and select the best

one.• Necessary to compute the size of intermediate relations

which is difficult.

– Use heuristics

Join Ordering

Rif size (R) < size (S)

if size (R) > size (S)S

ConsiderPROJ PNOASG ENOEMP

Join Ordering – Example

Site 2

Site 3Site 1

PNOENO

PROJ

ASG

EMP

Simple (i.e., mono-relation) queries are executed according to the best access path

Execute joins

2.1 Determine the possible ordering of joins

2.2 Determine the cost of each ordering

2.3 Choose the join ordering with minimal cost

System R Algorithm

For joins, two alternative algorithms :• Nested loops

for each tuple of external relation (cardinality n1)

for each tuple of internal relation (cardinality n2)

join two tuples if the join predicate is trueend

end

– Complexity: n1n2

• Merge joinsort relations merge relations

– Complexity: n1+ n2 if relations are previously sorted and equijoin

System R Algorithm

Names of employees working on the CAD/CAM project Assume

– EMP has an index on ENO,– ASG has an index on PNO,– PROJ has an index on PNO and an index on PNAME

System R Algorithm – Example

PNOENO

PROJ

ASG

EMP

Choose the best access paths to each relation– EMP:sequential scan (no selection on EMP)– ASG: sequential scan (no selection on ASG)– PROJ:index on PNAME (there is a selection on

PROJ based on PNAME)Determine the best join ordering

– EMP ASG PROJ– ASG PROJ EMP– PROJ ASG EMP– ASG EMP PROJ– EMP PROJ ASG– PROJ EMP ASG– Select the best ordering based on the join costs

evaluated according to the two methods

System R Example (cont’d)

Best total join order is one of((ASG EMP) PROJ)((PROJ ASG) EMP)

System R Algorithm

EMP ASGpruned

ASGEMP PROJ

(PROJ ASG) EMP

EMP PROJpruned

ASG EMP PROJ EMPpruned

PROJ ASG

(ASG EMP) PROJ

ASG PROJpruned

Alternatives

• ((PROJ ASG) EMP) has a useful index on the select attribute and direct access to the join attributes of ASG and EMP

• Therefore, chose it with the following access methods:

– select PROJ using index on PNAME

– then join with ASG using index on PNO

– then join with EMP using index on ENO

System R Algorithm

Distributed Query Optimization Problems• Cost model

– multiple query optimization.– heuristics to cut down on alternatives.

• Larger set of queries– optimization only on select-project-join queries.– also need to handle complex queries (e.g., unions,

disjunctions, aggregations and sorting).• Optimization cost vs execution cost tradeoff

– heuristics to cut down on alternatives.– controllable search strategies.

• Optimization/re optimization interval– extent of changes in database profile before re optimization is

necessary.

Summary• Distributed query optimization is more complex that centralized

query processing, since• – bushy query trees are not necessarily a bad choice• – one needs to decide what, where, and how to ship the

relations between the sites• Query optimization searches the optimal query plan (tree)• For N relations, there are O(N!) equivalent join trees. There are

two main strategies in query optimization: randomized and deterministic.

• (Few) semi-joins can be used to implement a join. The semi-joins require more operations to perform, however the data transfer rate is reduced

• INGRES, System R, Hill Climbing, and SDD-1 are distributed query optimization algorithms

Documents

Query Optimizing