Upload
barnaby-wilkerson
View
218
Download
2
Embed Size (px)
Citation preview
“Optimization”More properly called access path selection“Optimizer” selects a strategy for processingApproaches:
◦ Cost-based: estimate total cost to process by different approaches, choose lowest estimate
◦ Heuristic: use rules to decide how to processCost-based is typically used by all database
systems today
2
3
RUNSTATSRUNSTATS is the name of a
statistics-gathering utility first included in IBM’s DB2
It scans the database, gathers statistics used for estimating costs for access path selection
DBA determines how often and when to run the utility
What statistics do you think are gathered?
4
QuandaryThe more that RUNSTATS
collects, the better job the optimizer can do of selecting efficient processing methods
However, RUNSTATS uses a lot of resources, scanning every relation
Use of RUNSTATS must be balanced against its cost
The OptimizerSelects which indexes to use Chooses the order of using
indexesChooses algorithms to useDecides when to apply predicates
5
6
SQL Statement Parts of InterestSimple query:
SELECT ENAME, JOBFROM EMPWHERE SAL > 20 OR JOB =
‘VP’OR JOB LIKE ‘PRES%’;
We’re interested in the FROM clause (that tells us the table names) and the WHERE clause (that tells us the predicates)
8
PredicatesWHERE clause of SQL statement
is made up of predicatesEach predicate is a conditionEach condition references a
columnConditions may be equality,
inequality, range, LIKEThe first three conditions use an
index if one exists, scan the table if no index exists
9
Example
SAL > 20 OR
JOB = ‘VP’OR
JOB LIKE ‘PRES%’;For each predicate, do we use an index to retrieve rows that make it true, then examine each row for the other predicates?
10
Predicate SelectivitySelectivity: an estimate of the
fraction of rows of a table that make a predicate true
Classes of PredicatesPredicate: condition in the WHERE
clausePredicates are combined using AND,
OR to make WHERE clausesClasses of predicates:
◦ Sargable: search arguments that can be processed close to the data
◦ Residual: not sargable, such as complex use of nesting
11
Access PathsFive possible access paths:
◦Table scan◦Non-selective index scan◦Selective index scan◦Index only access◦Fully qualified unique index
Each of these types of scans has different cost estimates for its use
12
Predicate Selectivity Selectivity function f(p): % of rows retrieved on
average by predicate p Number of rows retrieved is strongly related to the
cost to carry out the operation n = number of rows in table
13
Form of P f
column = value 1/n
column != value 1-1/n (nearly 1)
column > value (high value - search value)/(high value - low value)
Column LIKE ‘value’ n
p1 or p2 f(p1) + f(p2)
p1 and p2 f(p1) * f(p2)
14
Single-Table QueriesFind out which columns that are
referenced in the WHERE clause have indexes
Find out selectivity of indexesEstimate selectivity of each
predicateUse most selective index-
predicate combination to retrieve rows that satisfy one predicate
Examine each row for other predicates
16
JoinResult of a join is a subset of the
Cartesian product of the tables being joined
Cartesian product of two tables with m and n rows is a new table of mn rows, where every row of the join consists of one row of the first table and one row of the second table
18
Simple Join Processing Algorithm1. Form the Cartesian product of all tables
involved in the join2. Scan rows of the Cartesian product,
testing each against all of the predicates3. Eliminate rows of the Cartesian product
that don’t meet the predicates
What’s wrong with this picture?Think about two tables of 1 million rows. Cartesian product would be 1 thousand billion rows!
19
Joining More than 2 RelationsA join of more than two relations
is processed 2 relations at a timePart of access path planning is to
select that sequenceWe will talk about algorithms for
joining 2 tables and then choosing the order of processing a multi-table join
20
JoinsAn equijoin is based on equality
of an attribute of each of two relations
Outer join includes all rows of both tables even if some rows did not have a matching value
A semi-join can be based on inequalities as the relationship
21
Join-Processing AlgorithmsNested loop join
◦Each tuple of outer relation is compared to all rows of the inner relation
Sort-merge joinHash-based join
22
Nested-Loop JoinThe algorithm:
For efficiency, the relation with higher cardinality (R) is chosen as the inner relation
Number of operations: NR+ NR* NS
What if there is an index?
Join OrderFor JOIN queries, the “outer”
table is access first, “inner” second
Order for joining tables must be selected
Most selective firstLeast costly joins first
24
25
Merge JoinFirst, each relation is sorted on the
join attributeThen both relations are scanned in
the order of the join attributesTuples that satisfy the join predicate
are concatenated and placed in the output relation
Number of operations: NR+NS (after the sort!)
What is there is an index on R or S or both?
28
Hash-JoinThe joins we have looked at
compare tuples in the first relation with tuples in the second relation that cannot possibly be part of the join
The goal of the hash join is to compare only those tuples that might be part of the join
Hashing is used to identify those tuples
There are many variations of hash-join
30
Hash-Join PerformancePerformance of hash-join can be
superior to other join algorithmPerformance depends on the
hashing algorithm (although note Lum’s research)
Perfect hashing algorithm could find match or non-match with a single probe
With hash table in RAM, processing would be very fast
31
Indexes Impact of a b-tree index on
performance of these algorithms is obvious
But the index must be maintained itself
When an attribute that’s indexed in changed in a relation, the value in the index must also be changed
And note that the changes must be synchronized (and locked together)
32
Order of Processing JoinsTypically, all combinations of order of
processing are considered and a cost developed for each
Selectivity of predicates, selectivity of indexes, cardinality of relations all are factors in cost analysis
Goal is to minimize number of intermediate results produced during processing
Usually, low selectivity values are processed first (that is, highest selectivity)
33
Summary Single-table queriesMulti-table queries
◦Nested loop◦Sort-merge◦Hash
Order of processing joins
34
But Note:We have left out a LOTRelations may be partitioned and
joins processed by partitionMany other parts of the DBMS
affect performanceIf you are responsible for database
performance, buy a book and dig inRemember not to give up on
normalization to get performance
How to StartFirst, don’t even consider
denormalizationYou have many tools to get the
performance you need without ruining the data model (and the applications)
Performance test the applicationsLook for SQL operations that are
taking a long time
36
EXPLAINIBM invented the EXPLAIN utility;
it explains the processing strategy for each WHERE clause
Run it for operations that are taking too long
Look for table scans, cartesian product joins
Provide indexes to speed things up
37
38
EXPLAIN PLANTells you the execution plan an
Oracle database follow for a SQL statement
Inserts a row describing each step of the execution plan into a specified table
Determines total cost of execution
Beyond EXPLAINThere are many indexing options,
other options to control physical characteristics of the database
Learn about them, learn how to control them
But you will go very far with EXPLAIN and providing indexes
40