19
Access Path Selection in a Relational Database Management System Selinger et al.

Access Path Selection in a Relational Database Management System

Embed Size (px)

DESCRIPTION

Access Path Selection in a Relational Database Management System. Selinger et al. Query Optimization. Declarative query language relieves programmer of the burden to choose an access plan. Difference between good and bad access plans can be several orders of magnitude. - PowerPoint PPT Presentation

Citation preview

Page 1: Access Path Selection in a Relational Database Management System

Access Path Selection in a Relational Database Management

System

Selinger et al.

Page 2: Access Path Selection in a Relational Database Management System

Query Optimization

• Declarative query language relieves programmer of the burden to choose an access plan.

• Difference between good and bad access plans can be several orders of magnitude.

• Problem: How do we find a good plan?

Page 3: Access Path Selection in a Relational Database Management System

How to choose a good plan?

Divide into three problems:

• What is the set of plans we consider?

• How do we compare (cost) plans?

• How do we choose a “good” plan from this set?

Page 4: Access Path Selection in a Relational Database Management System

Background: SQL

SELECT {DISTINCT} <list of columns>

FROM <list of tables>

{WHERE <list of "Boolean Factors“(predicates in CNF)>}

{GROUP BY <list of columns>

{HAVING <list of Boolean Factors>}}

{ORDER BY <list of columns>};

Page 5: Access Path Selection in a Relational Database Management System

Processing SQL Statements

• Four steps:– Parsing– Optimization– Code generation– Execution

Page 6: Access Path Selection in a Relational Database Management System

SQL Query Types

• Single block queries only– optimize nested sub-queries separately

• Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries.– Rewrite to remove correlations where possible.

(Tricky, more about this later.)

• SPJ queries only in this paper

Page 7: Access Path Selection in a Relational Database Management System

Problem 1: Plan Space

• Fixed set of individual access methods• Sequential & index (clustered/unclustered)

scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping

• Plan flows in a non-blocking fashion with get-next iterators

Page 8: Access Path Selection in a Relational Database Management System

Plan Space (Contd.)

• Assumptions in System R:– Selections of “sargable” predicates are "pushed

down"– Projections are "pushed down" – Single query blocks– Only left-deep plan trees (There are n! plans

(not factoring in choice of join method))– Avoid Cartesian products

Page 9: Access Path Selection in a Relational Database Management System

Problem 2: How to Cost a Plan

• Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence)

• Estimation of costs for each operator based on statistics about input relations– Cost is weighted function between I/O and CPU

(no distinction between random and sequential IO)

Page 10: Access Path Selection in a Relational Database Management System

Cost Estimation (Selinger)

• Maintenance of simple statistics:– # of tuples & pages – # of values per column (only for indexed

columns)

• Assumption of attribute independence• When no estimate available, use magic

number• These estimations are done periodically

Page 11: Access Path Selection in a Relational Database Management System

Cost Estimation (Today)

• Sampling: so far only concrete results for base relations

• Histograms: getting better. Common in industry, some interesting new research.

• Controlling "error propagation"

Page 12: Access Path Selection in a Relational Database Management System

Problem 3: Choosing a Plan

• Exhaustive search • Dynamic Programming (prunes suboptimal parts

of the search space): System R • Top-down, transformative version of DP:

Volcano, Cascades (used in MS SQL Server?) • Randomized search algorithms (e.g. Ioannidis &

Kang) • Techniques from Operations Research• Etc.

Page 13: Access Path Selection in a Relational Database Management System

System R Approach

• Recall: Only left-deep plan trees (n! different plans)

• Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming

Page 14: Access Path Selection in a Relational Database Management System

Dynamic Programming Approach

1. Find all 1-table plans for each base relation2. Try all ways of joining i-table plans saved so far

with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order

• Note: secondary join predicates are just like selections that can’t be pushed down

3. At the end, GROUP BY and ORDER BY– Use plan in interesting order, or add sort to cheapest

unordered plan.

Page 15: Access Path Selection in a Relational Database Management System

Evaluation

• Complexity of dynamic programming: about n2n-1, intermediate storage:  plans

• “No-cartesian-products” rule can make a big difference for some queries.

• DP only works up to 10-15 joins• Adding parameters to the search space

makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)

Page 16: Access Path Selection in a Relational Database Management System

Nested Queries

• Subqueries optimized separately

• Uncorrelated vs. correlated subqueries– Uncorrelated subqueries are basically constants

to be computed once– Correlated subqueries are like function calls

Page 17: Access Path Selection in a Relational Database Management System

Query Rewrite in IBM DB2

Leung et al.

Page 18: Access Path Selection in a Relational Database Management System

Why Query Rewrite?

• Problem:– Very complex queries automatically generated

through tools with many levels of subqueries– Correlated subqueries– Selinger approach only works one block at a

time

• Main idea: Transform the query into a simpler, equivalent query

Page 19: Access Path Selection in a Relational Database Management System

Query Rewrite is Tricky

SELECT P.pnoFROM Parts PWHERE P.quantity =

(SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000)

INSERT INTO Temp (pnum, cnt)(SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1-2000 GROUP BY P.pno)

SELECT P.pnoFROM Parts P, Temp TWHERE P.quantity = T.cnt

ANDP.pno = T.pno