Upload
jessica-smith
View
55
Download
0
Embed Size (px)
DESCRIPTION
How to Optimize Query in Distributed Database Management System
Citation preview
Distributed Query Optimization
Chapter 9
Query Processing and Optimization
• Query processing is the process of translating a query expressed in a high-level language such as SQL into low-level data manipulation operations.
• Query Optimization refers to the process by which the best execution strategy for a given query is found from a set of alternatives.
Query Optimization
• The input to the third step is an algebraic query on fragments.
• By permuting the ordering of operations within one fragment query, many equivalent query execution plans may be found. indices should be used. order the operations of a query (e. g. joins, selects,
and projects).• The goal of query optimization is to find an execution
strategy for the query that is close optimal query.
Distributed Query Optimization
Components of the distributed query optimizer, i. e.
• Search Space.• Search strategy. The search strategy explores the search space
and selects the best plan.• Cost Strategy.
Distributed Query Optimization Issues
• Linear query trees are not necessarily a good choice• Bushy query trees are not necessarily a bad choice• What and where to ship the relations• How to ship relations (ship as a whole, ship as
needed)• When to use semi-joins instead of joins
Search Space.• The set of alternative query execution plans (QEP)
–Typically very large.–The main issue is to optimize the joins.
Equivalent query trees (join trees) of the joins in the following query
SELECT ENAME,RESP FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNO
Basic ConceptsReduction of the search space-Restrict by means of heuristics
• Perform unary operations before binary operations, – Restrict the shape of the join tree• Consider the type of trees (linear trees, vs. bushy
ones)
Search Space• There are two main strategies to scan the search
space– Deterministic– Randomized
•
Deterministic scan of the search space – Start from base relations and build plans by adding one relation at each step. – Breadth-first strategy: build all possible plans before choosing the “best” plan. (dynamic programming approach) – Depth-first strategy: build only one plan (greedy approach)
Randomized scan of the search space– Search for optimal solutions around a particular starting
point e.g., iterative improvement– Trades optimization time for execution time Does not guarantee that the best solution is obtained,
but avoid the high cost of optimization– The strategy is better when more than 5-6 relations are
involved.
Distributed Cost Model
• Two different types of cost functions can be used
– Reduce total time– Reduce response time
Distributed Cost Model . . .
• Total time: Sum of the time of all individual components.
• Local processing time: CPU time + I/O time• Communication time: fixed time to initiate a message
+ time to transmit the data
The individual components of the total cost have different weights:
– Wide area network– Local area networks
Distributed Cost Model . . .
Distributed Cost Model . . .
• Response time: Elapsed time between the initiation and the completion of a query
Assume that only the communication cost is consideredTotal time = 2 message initialization time + unit transmission time
(x+y)Response time = max {time to send x from 1 to 3, time to send y
from 2 to 3}time to send x from 1 to 3 = message initialization time + unit
transmission time xtime to send y from 2 to 3 = message initialization time + unit
transmission time y
ExampleSite 1
Site 2
x units
y units
Site 3
Database Statistics
The primary cost factor is the size of intermediate relations• must be transmitted over the network, if a subsequent
operation is located on a different site.• costly to compute the size of the intermediate relations
precisely.• Instead global statistics of relations and fragments are
computed.
Let R(A1,A2, . . . ,Ak) be a relation fragmented into R1,R2, . . . ,Rr.
• Relation statistics– min and max values of each attribute: min{Ai}, max{Ai}.– length of each attribute: length(Ai)– number of distinct values in each fragment (cardinality):
card(Ai),(card(dom(Ai)))
• Fragment statistics– cardinality of the fragment: card(Ri)– cardinality of each attribute of each fragment:
card(πAi(Rj))
Database Statistics
Query Optimization Process
Search SpaceGeneration
SearchStrategy
Equivalent QEP
Input Query
TransformationRules
Cost Model
Best QEP
• INGRES
– dynamic
– interpretive
• System R
– static
– exhaustive search
Centralized Query Optimization
Decompose each multi-variable query into a sequence of mono-variable queries with a common variable
Process each by a one variable query processor– Choose an initial execution plan (heuristics)
– Order the rest by considering intermediate relation sizes
No statistical information is maintained
INGRES Algorithm
22
INGRES Language: QUEL
• QUEL Language - a tuple calculus languageExample:
range of e is EMPrange of g is ASGrange of j is PROJretrieve e.ENAME where e.ENO=g.ENO and j.PNO=g.PNO
and j.PNAME=”CAD/CAM”
Note: e, g, and j are called variables
• Replace an n variable query q by a series of queries
q1 q2 … qn
where qi uses the result of qi-1.• Detachment
– Query q decomposed into q' q" where q' and q" have a common variable which is the result of q'
• Tuple substitution– Replace the value of each tuple with actual values
and simplify the queryq(V1, V2, ... Vn) (q' (t1, V2, V2, ... , Vn), t1 R)
INGRES Algorithm–Decomposition
q: SELECT V2.A2,V3.A3, …,Vn.An
FROM R1 V1, …,Rn Vn
WHERE P1(V1.A1’) AND P2(V1.A1,V2.A2,…,
Vn.An)
q': SELECT V1.A1 INTO R1'
FROM R1 V1
WHERE P1(V1.A1)
q": SELECT V2.A2, …, Vn.An
FROM R1' V1, R2 V2, …, Rn Vn
WHERE P2(V1.A1, V2.A2, …, Vn.An)
Detachment
Names of employees working on CAD/CAM projectQ1: SELECT EMP.ENAME
FROM EMP, ASG, PROJWHERE EMP.ENO=ASG.ENO AND ASG.PNO=PROJ.PNOAND PROJ.PNAME="CAD/CAM"
q11: SELECT PROJ.PNO INTO JVAR
FROM PROJWHERE PROJ.PNAME="CAD/CAM"
q': SELECT EMP.ENAMEFROM EMP,ASG,JVARWHERE EMP.ENO=ASG.ENOAND ASG.PNO=JVAR.PNO
Detachment Example
q': SELECT EMP.ENAMEFROMEMP,ASG,JVARWHERE EMP.ENO=ASG.ENOAND ASG.PNO=JVAR.PNO
q12: SELECT ASG.ENO INTO GVAR
FROMASG,JVARWHERE ASG.PNO=JVAR.PNO
q13: SELECT EMP.ENAME
FROMEMP,GVARWHERE EMP.ENO=GVAR.ENO
Detachment Example (cont’d)
q11 is a mono-variable queryq12 and q13 is subject to tuple substitutionAssume GVAR has two tuples only: <E1> and <E2>Then q13 becomes
q131: SELECT EMP.ENAMEFROMEMPWHERE EMP.ENO="E1"
q132: SELECT EMP.ENAMEFROMEMPWHERE EMP.ENO="E2"
Tuple Substitution
Same as the centralized version except
• Movement of relations (and fragments) need to be considered
• Optimization with respect to communication cost or response time possible
Distributed INGRES Algorithm
• Ordering joins
– Distributed INGRES
– System R*
Join Ordering in Fragment Queries
Consider two relations only
• Multiple relations more difficult because too many alternatives.– Compute the cost of all alternatives and select the best
one.• Necessary to compute the size of intermediate relations
which is difficult.
– Use heuristics
Join Ordering
Rif size (R) < size (S)
if size (R) > size (S)S
ConsiderPROJ PNOASG ENOEMP
Join Ordering – Example
Site 2
Site 3Site 1
PNOENO
PROJ
ASG
EMP
Simple (i.e., mono-relation) queries are executed according to the best access path
Execute joins
2.1 Determine the possible ordering of joins
2.2 Determine the cost of each ordering
2.3 Choose the join ordering with minimal cost
System R Algorithm
For joins, two alternative algorithms :• Nested loops
for each tuple of external relation (cardinality n1)
for each tuple of internal relation (cardinality n2)
join two tuples if the join predicate is trueend
end
– Complexity: n1n2
• Merge joinsort relations merge relations
– Complexity: n1+ n2 if relations are previously sorted and equijoin
System R Algorithm
Names of employees working on the CAD/CAM project Assume
– EMP has an index on ENO,– ASG has an index on PNO,– PROJ has an index on PNO and an index on PNAME
System R Algorithm – Example
PNOENO
PROJ
ASG
EMP
Choose the best access paths to each relation– EMP:sequential scan (no selection on EMP)– ASG: sequential scan (no selection on ASG)– PROJ:index on PNAME (there is a selection on
PROJ based on PNAME)Determine the best join ordering
– EMP ASG PROJ– ASG PROJ EMP– PROJ ASG EMP– ASG EMP PROJ– EMP PROJ ASG– PROJ EMP ASG– Select the best ordering based on the join costs
evaluated according to the two methods
System R Example (cont’d)
Best total join order is one of((ASG EMP) PROJ)((PROJ ASG) EMP)
System R Algorithm
EMP ASGpruned
ASGEMP PROJ
(PROJ ASG) EMP
EMP PROJpruned
ASG EMP PROJ EMPpruned
PROJ ASG
(ASG EMP) PROJ
ASG PROJpruned
Alternatives
• ((PROJ ASG) EMP) has a useful index on the select attribute and direct access to the join attributes of ASG and EMP
• Therefore, chose it with the following access methods:
– select PROJ using index on PNAME
– then join with ASG using index on PNO
– then join with EMP using index on ENO
System R Algorithm
Distributed Query Optimization Problems• Cost model
– multiple query optimization.– heuristics to cut down on alternatives.
• Larger set of queries– optimization only on select-project-join queries.– also need to handle complex queries (e.g., unions,
disjunctions, aggregations and sorting).• Optimization cost vs execution cost tradeoff
– heuristics to cut down on alternatives.– controllable search strategies.
• Optimization/re optimization interval– extent of changes in database profile before re optimization is
necessary.
Summary• Distributed query optimization is more complex that centralized
query processing, since• – bushy query trees are not necessarily a bad choice• – one needs to decide what, where, and how to ship the
relations between the sites• Query optimization searches the optimal query plan (tree)• For N relations, there are O(N!) equivalent join trees. There are
two main strategies in query optimization: randomized and deterministic.
• (Few) semi-joins can be used to implement a join. The semi-joins require more operations to perform, however the data transfer rate is reduced
• INGRES, System R, Hill Climbing, and SDD-1 are distributed query optimization algorithms