Access Path Selection in a Relational Database Management System

Access Path Selection in a Relational Database Management

System

Selinger et al.

Query Optimization

• Declarative query language relieves programmer of the burden to choose an access plan.

• Difference between good and bad access plans can be several orders of magnitude.

• Problem: How do we find a good plan?

How to choose a good plan?

Divide into three problems:

• What is the set of plans we consider?

• How do we compare (cost) plans?

• How do we choose a “good” plan from this set?

Background: SQL

SELECT {DISTINCT} <list of columns>

FROM <list of tables>

{WHERE <list of "Boolean Factors“(predicates in CNF)>}

{GROUP BY <list of columns>

{HAVING <list of Boolean Factors>}}

{ORDER BY <list of columns>};

Processing SQL Statements

• Four steps:– Parsing– Optimization– Code generation– Execution

SQL Query Types

• Single block queries only– optimize nested sub-queries separately

• Correlated sub-queries are much harder and much more expensive than un-correlated sub-queries.– Rewrite to remove correlations where possible.

(Tricky, more about this later.)

• SPJ queries only in this paper

Problem 1: Plan Space

• Fixed set of individual access methods• Sequential & index (clustered/unclustered)

scans • NL-join, (sort)-merge join, hash join • Sorting & hash-based grouping

• Plan flows in a non-blocking fashion with get-next iterators

Plan Space (Contd.)

• Assumptions in System R:– Selections of “sargable” predicates are "pushed

down"– Projections are "pushed down" – Single query blocks– Only left-deep plan trees (There are n! plans

(not factoring in choice of join method))– Avoid Cartesian products

Problem 2: How to Cost a Plan

• Estimation of input and intermediate cardinalities based on simple statistical models (e.g., uniform distribution assumption, attribute independence)

• Estimation of costs for each operator based on statistics about input relations– Cost is weighted function between I/O and CPU

(no distinction between random and sequential IO)

Cost Estimation (Selinger)

• Maintenance of simple statistics:– # of tuples & pages – # of values per column (only for indexed

columns)

• Assumption of attribute independence• When no estimate available, use magic

number• These estimations are done periodically

Cost Estimation (Today)

• Sampling: so far only concrete results for base relations

• Histograms: getting better. Common in industry, some interesting new research.

• Controlling "error propagation"

Problem 3: Choosing a Plan

• Exhaustive search • Dynamic Programming (prunes suboptimal parts

of the search space): System R • Top-down, transformative version of DP:

Volcano, Cascades (used in MS SQL Server?) • Randomized search algorithms (e.g. Ioannidis &

Kang) • Techniques from Operations Research• Etc.

System R Approach

• Recall: Only left-deep plan trees (n! different plans)

• Observation: Many of these plans share common prefixes, so do not recompute all of them: Dynamic Programming

Dynamic Programming Approach

1. Find all 1-table plans for each base relation2. Try all ways of joining i-table plans saved so far

with 1-table plans. Save cheapest unordered (i+1)-table plans, and cheapest (i+1)-table plans for each interesting order

• Note: secondary join predicates are just like selections that can’t be pushed down

3. At the end, GROUP BY and ORDER BY– Use plan in interesting order, or add sort to cheapest

unordered plan.

Evaluation

• Complexity of dynamic programming: about n2n-1, intermediate storage: plans

• “No-cartesian-products” rule can make a big difference for some queries.

• DP only works up to 10-15 joins• Adding parameters to the search space

makes things worse (e.g. expensive predicates, distribution, parallelism, etc.)

Nested Queries

• Subqueries optimized separately

• Uncorrelated vs. correlated subqueries– Uncorrelated subqueries are basically constants

to be computed once– Correlated subqueries are like function calls

Query Rewrite in IBM DB2

Leung et al.

Why Query Rewrite?

• Problem:– Very complex queries automatically generated

through tools with many levels of subqueries– Correlated subqueries– Selinger approach only works one block at a

time

• Main idea: Transform the query into a simpler, equivalent query

Query Rewrite is Tricky

SELECT P.pnoFROM Parts PWHERE P.quantity =

(SELECT COUNT(*) FROM Supply S WHERE S.pno = P.pno AND S.shipdate < 1-1-2000)

INSERT INTO Temp (pnum, cnt)(SELECT P.pno, COUNT(*) FROM Supply S WHERE S.shipdate < 1-1-2000 GROUP BY P.pno)

SELECT P.pnoFROM Parts P, Temp TWHERE P.quantity = T.cnt

ANDP.pno = T.pno

Documents

Access Path Selection in a Relational Database Management System