Upload
heather-morrison
View
256
Download
1
Embed Size (px)
Citation preview
Query Optimization and IndexesQuery Optimization and Indexes
IntroductionIntroduction
Relational DatabasesRelational Databases
DB2/400DB2/400
QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES
IntroductionIntroduction
Overview
Research Problem
Literature
Next
Previous
OverviewOverviewINTRODUCTIONINTRODUCTION
IBM has a DBMS called DB2. Query optimization
– Can significantly improve database performance– If the right indexes are available, queries can generally
be implemented using better performing algorithms.
Next
Previous
INTRODUCTIONINTRODUCTION
Research ProblemResearch Problem
Due to the complexity of the query optimizer, customers are often baffled about indexes and query response time. Users need advise on what indexes would offer the best performance.
Management asks:
How can an intelligent index manager be created? Related question:
How can queries be evaluated to determine a set of indexes which may minimize the total cost of database transactions, thus optimizing performance?
Next
Previous
LiteratureLiterature
Index selection problem (ISP) for secondary indexes is a well-known optimization problem.
This problem known to be NP-Complete.
INTRODUCTIONINTRODUCTION
Query Optimization and IndexesQuery Optimization and Indexes
IntroductionIntroduction
Relational DatabasesRelational Databases
DB2/400DB2/400
QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES
Relational DatabasesRelational Databases
Background Indexes
Table Access and JoinsQuery Optimization
Next
Previous
RELATIONAL DBsRELATIONAL DBs
BackgroundBackground
Relational model introduced by Codd Based on relations (tables), tuples (records), and
attributes (fields)
Queries can be represented using relational algebra, relational calculus, or as a graph.
SQL is a high level query language.
Next
Previous
Query OptimizationQuery Optimization
A query optimizer takes the query and creates a procedural sequence of implementation steps known as an access plan.– This plan is created after analyzing various alternatives to arrive
at the best choice.
– Table indexes are critical in creating efficient access plans.
Objectives of query optimization:– Minimize response time
– Minimize usage of system resources
RELATIONAL DBsRELATIONAL DBs
Next
Previous
RELATIONAL DBsRELATIONAL DBs
Query Optimization ObjectivesQuery Optimization Objectives
Hardware objectives include minimizing:– CPU costs– Communications costs (in a network)– I/O costs for accessing secondary storage– Cost of using main memory and secondary storage
Software objectives include minimizing:– Cost of query optimization
(should be small compared to execution cost)
Next
Previous
Query Optimization ProcessQuery Optimization ProcessRELATIONAL DBsRELATIONAL DBs
Rewriter
Algebraic Space Cost Model
Method StructureSpace
Size-DistributionEstimator
Planner
Next
Previous
Query Optimization ProcessQuery Optimization Process
Rewriter– creates an internal query representation.– applies transformations
to streamline query evaluation.
RELATIONAL DBsRELATIONAL DBs
Next
Previous
Query Optimization ProcessQuery Optimization Process
Planner
– maps the transformed query• into various sequences of operations
(algebraic space)
• which can be implemented
(method structure space)
• with a known cost
(cost model & size distribution),
creating candidate access plans.
– computes the cost of each candidate plan,
and chooses the cheapest one.
RELATIONAL DBsRELATIONAL DBs
RELATIONAL DBsRELATIONAL DBs
IndexesIndexes
B-Tree Indexes
Clustered/Non-clustered Indexes
Hash Indexes
Next
Previous
IntroductionIntroduction
Memory access on the order of nanoseconds– For example, 40 (.00000004 sec.)
Disk access on the order of milliseconds– For example, 25 (.025 sec.) i.e., 40 random I/O’s per sec.
• .016 sec. Seek time (Move disk arm to the proper cylinder)• .008 sec. Rotational latency (Rotate platter into position)• .001 sec. Transfer time (to read/write data)• .025 sec. Total
$ cost of disk cheap compared to RAM
RELATIONAL DBsRELATIONAL DBsIndexesIndexes
Next
Previous
Composition– Root level node (one at the first level)– Directory (or index) nodes (usually one or two levels)
– Leaf level nodes (the bottom level, pointing to records)
B-Tree IndexesB-Tree IndexesRELATIONAL DBsRELATIONAL DBs
IndexesIndexes
... 221 np 346 np 398 ...
… 377 rid 411 rid 449 rid …… 278 rid 305 rid 346 rid …
np = node pointer to disk page
rid = relative row ID to actual record
Next
Previous
B-Tree IndexesB-Tree Indexes
Significantly, due to the frequency of access, upper-levels of a B-tree index may remain in memory.
For example, to access a record in a million record database:assuming fanout of 256
– B-tree index has 3 levels CEIL(log256 1,000,000)
– B-tree index probed 3 times reading 3 pages into RAM
– Only 1 disk I/O may be required to read in the disk page holding the leaf node pointing to the desired record.
RELATIONAL DBsRELATIONAL DBsIndexesIndexes
Next
Previous
B-Tree IndexesB-Tree Indexes
B-tree index on a one million record database
has 34 nodes above the leaf nodes– Assume:
• key values 4 bytes long (an integer)
• node pointers & row IDs 4 bytes long
• fill factor 70%
• disk page header 48 bytes
• disk page size 2 KB
– so (2048-48) *.70/(4+4) 175 entries per disk page• CEIL(1,000,000/175) 5715 disk pages for leaf nodes
• CEIL(5715/175) 33 disk pages for directory nodes
• CEIL(33/175) 1 disk page for parent node(s), root
RELATIONAL DBsRELATIONAL DBsIndexesIndexes
Next
Previous
Clustered/Non-clustered IndexesClustered/Non-clustered Indexes
In a clustered index the records are stored in the same order as the key.
Provides superior performance, for example:– Database of 10 million customer in 200 cities– Customer records of 100 bytes
• so 20 data records per disk page (2048-48)/100• so 500,000 disk pages for database (10,000,000/20)
– Mailing for a specific city averages 50,000 customers– If clustered index on city, only 2500 disk pages of data (50,000/20) need to be
scanned, taking 1 min. (2500/40*60)– If non-clustered, we could assume scan 50,000 disk pages of data, taking 20 min.
(50,000/40*60)
RELATIONAL DBsRELATIONAL DBsIndexesIndexes
Next
Previous
Hash IndexesHash Indexes
No disk file of keys, only a single I/O required (ideally) to read a record from disk.
The record key is the input value to a some hash function, whose output becomes the key to a disk page or relative record position.
Collisions, well-known problem, where 2 records have the same hash value.– Ideally, rehash to same disk page to minimize I/O.
Generally, cannot increase size of hash table.– Extensible hashing allows hash table to grow, based on a
linear hashing algorithm.
RELATIONAL DBsRELATIONAL DBsIndexesIndexes
RELATIONAL DBsRELATIONAL DBs
Table Access and JoinsTable Access and Joins
No Indexes
Simple Indexes
Composite Indexes
Multiple Indexes
Nested Loop Joins
Sort-Merge Joins
Next
Previous
IntroductionIntroduction
When a database system receives a query it compiles the query.– This includes syntax checking and query optimization.
– The result of the compilation step is an access plan (a series of steps specific to the computer and database) which will execute the query at run time.
Query optimizer– Minimizes CPU time & number of I/O requests.
– CPU memory, although important, may be limited to certain established levels.
RELATIONAL DBsRELATIONAL DBsTable Access and JoinsTable Access and Joins
Next
Previous
No IndexesNo Indexes
A table scan is an access step where all the rows in a table are sequentially searched.– The data collected is restricted by the
WHERE clause of an SQL statement.
Access plan (DB2/MVS):– ACCESSTYPE column
• Letter R, for a table scan
– PREFETCH column• Blank, if random I/O• Letter S, if multi-block I/O (called sequential prefetch)
RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins
Next
Previous
Simple/Composite IndexesSimple/Composite Indexes
A matching index scan:– implemented in DB2/MVS using a B-tree index.– based on a single index– rows retrieved from a table
based on the condition(s) specified in the WHERE clause of the SELECT statement.
RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins
Next
Previous
Simple IndexesSimple Indexes
The access plan for a SELECT statement may use indexes to limit the number of rows searched in a database. For example:
If ZipIdx exists on ZipCode in the Customers table,
SELECT * FROM Customers WHERE ZipCode = 56001
implemented in DB2/MVS using a matching index scan.
The plan for this query takes one step and is designated by the following DB2/MVS access plan:
ACCESSTYPE = I for an index scan
ACCESSNAME = ZipIdx
MATCHCOLS = 1
RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins
Next
Previous
Composite IndexesComposite Indexes
A matching index scan retrieves records from a table when the column components of the index can be matched to the predicates of the WHERE clause. For example:
If MailIdx exists on ZipCode + IncomeLevel + MaritalStatusSELECT Name, Address FROM Customers
WHERE ZipCode = 56001 AND IncomeLevel = 10
implemented in DB2/MVS using a matching index scan.
The DB2 access plan will have:ACCESSTYPE = I for an index scan
ACCESSNAME = MailIdx
MATCHCOLS = 2
RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins
Next
Previous
Multiple IndexesMultiple Indexes
Multiple index access used when different indexes on the predicates in the WHERE clause. For example:SELECT Name, Address FROM Customers
WHERE ZipCode = 56001AND (IncomeLevel = 10 OR MaritalStatus = “M”)
DB2 Access Plan assuming ZipIdx, IncomeIdx, and MaritalIdx :TNAME MATCHCOLS PREFETCH
ACCESSTYPE ACCESSNAME MIXOPSEQCustomersM 0 L 0CustomersMX 1 IncomeIdx S 1CustomersMX 1 MaritalIdx S 2CustomersMU 0 3CustomersMX 1 ZipIdx S 4CustomersMI 0 5
RELATIONAL DBsRELATIONAL DBsTable AccessTable Access and Joins and Joins
Next
Previous
OverviewOverview
Common methods to join tables– Nested Loop– Sort Merge– Hash
In DB2, multi-step access plans for join processing make use of columns called:
– PLANNO (1, 2, …) to indicate which step– METHOD (1=Nested Loop, 2=Sort Merge, 4=Hybrid Join) to indicate the join technique
chosen by the query optimizer– TABNO (such as 1 or 2) to indicate whether we are extracting rows from the first or second
table of the join– SORTN_JOIN (Y or N) to indicate whether a sort is required (for example, for
the sort merge join)
RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins
Next
Previous
OverviewOverview
A join of two tables occurs in two steps– where one table becomes the outer table– and the other becomes the inner table.
During join processing– records from outer table are presented one-by-one
to the inner table– and the inner table searched for records
matching the one presented to it.
RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins
Next
Previous
Nested Loop JoinsNested Loop Joins
Process:– Records of the outer table are retrieved (or presented) using a table
scan (or indexing, if possible).• Only candidate records are retrieved.• Candidate records satisfy local predicates.
– For each retrieved row from the outer table:• the inner table is searched for qualifying records• & the results merged into a new record in an output table.
SELECT CustomerID, Name, Street, City, State, Zip, TotalSales
FROM Customers, Sales
WHERE State = “MN”
AND Customers.CustomerID = Sales.CustomerID
RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins
Next
Previous
Nested Loop JoinsNested Loop Joins
Appropriate:– When the outer table has only a few records (after applying predicates)– Where the inner table is small or has an index usable to
access qualifying records
Drawbacks when the inner and outer tables are not indexed or clustered on the same values:– The inner table (including indexes) may be scanned repetitively to find
matching records.– The outer table is processed inefficiently for records with the same
value in the join columns.
RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins
Next
Previous
Sort-Merge JoinsSort-Merge Joins
Algorithm where the 2 tables are scanned only once:– For each table, a temporary table is created of qualifying
candidate records.– Each temporary table is sorted on the same columns based on the
joining predicates (conditions).– Finally the two temporary tables are merged and joined (as an inner and
outer table) into a third table.
If either of the original two tables has an index on the selection and joining predicates, it may be possible to skip the creation of the corresponding temporary table.
RELATIONAL DBsRELATIONAL DBsTable Access and Table Access and JoinsJoins
Query Optimization and IndexesQuery Optimization and Indexes
IntroductionIntroduction
Relational DatabasesRelational Databases
DB2/400DB2/400
QUERY OPTIMIATION QUERY OPTIMIATION AND INDEXESAND INDEXES
DB/400DB/400
Data Management Methods
Summary
The Optimizer
The Database Monitor
Components
Proposed Scope
Proposed Solution
Research Problem/Objectives
Next
Previous
ComponentsComponents
Query component
-> Query optimizer• Cost based
Data management methods– Access paths– Access methods
DB/400DB/400
Next
Previous
Data Management MethodsData Management Methods
Access Paths– Sequential, also called arrival sequence
• accesses data in physical order
– Keyed sequential, uses indexes• accesses data in the order of the index
Access Methods– Dataspace scan - similar to table scan - parallel version too– Key selection - requires index - parallel version too– Key positioning - requires index - parallel version too– Index only - requires index - SMP required– Index-from-Index - requires index - SMP required– Hashing - SMP required
DB/400DB/400
Next
Previous
The OptimizerThe Optimizer
Query optimization is a tradeoff between– the time to determine an optimal implementation– the time to actually execute the query
The query optimizer– selects the most efficient access method at
query run-time• identifies alternatives• estimates current costs
– optimizes joins and grouping operations
DB/400DB/400
Next
Previous
The OptimizerThe Optimizer
The access cost is modeled for:– reading records without an index
(a dataspace scan)– reading records with an index
(key selection or key positioning)*– creating a temporary index on the relevant data– creating a temporary index on another index
(index-from-index)– using the hashing method or a query sort routine
*Each index is evaluated, unless a time limit is reached first. Indexes are examined in LIFO order.
DB/400DB/400
Next
Previous
The OptimizerThe Optimizer
The cost of a particular method is the sum of:– the start-up cost
– the cost associated with the optimization mode**
– the cost of creating any indexes
– the cost for the expected number of page faults to read the data and the cost to process the expected number of rows
**The optimization mode is given by a parameter, which defines the minimization goal as either the time to retrieve the first buffer of data or the time to retrieve all the rows that would be selected.
DB/400DB/400
Next
Previous
The Database MonitorThe Database Monitor
The database monitor collects statistics and data about query implementation and performance during query optimization and execution.
For example, SQL can extract data showing:– Queries implemented as table scans– Queries taking the most time– Queries stopped by the query governor– All queries executed and data on each query, such as
• Table names• Number of rows in a table and the number of rows selected• Estimated & actual execution time• Indexed advised & fields
DB/400DB/400
Next
Previous
Research Problem/ObjectivesResearch Problem/Objectives
RESEARCH PROBLEM
Improve query performance
by determining an appropriate set of indexes.
RESEARCH OBJECTIVES Determine the information required & available.
Determine the formulas/algorithms
needed to evaluate the data. Determine the user interface
(including capabilities)
of an user-friendly intelligent index advisor.
DB/400DB/400
Next
Previous
DB/400DB/400
Proposed ScopeProposed Scope
Create a prototype intelligent index manager.– Experiment with some basic queries.
Evaluate simple queries– no joins & no GROUP BY or ORDER BY clauses in SQL.
Next
Previous
SummarySummary
IBM’s DB2/400 has several components:– The query component, including the query optimizer which:
• chooses an minimal cost method to implement queries• validating or creating an access plan
– a control structure containing the implementation data.
– The data management methods:• algorithms for retrieving data through an access path using a
table access method.
The database monitor:– collects statistics and implementation details about queries.– includes recommendations on creating indexes:
• on a per query basis• based on a limited subset of the query optimizer rules.
DB/400DB/400
Next
Previous
SummarySummary
An intelligent index manager would:– need data on query implementation and execution from the
database monitor.• The database monitor would need to be enhanced to save
additional data from the query optimizer• Supplemental data may be needed such as
– How often the records change in queried tables– User specifications
– need algorithms, formulas & rules based on• The major ones used by the query optimizer.• Additional criteria, such as projected index maintenance costs & rules based
on the user specifications
DB/400DB/400
Next
Previous
SummarySummary
An intelligent index manager would:– evaluate & judge if new indexes are needed; and if so, what indexes would
provide the best overall system performance.• Decisions would be made on the makeup of the indexes: which tables and which
fields.• Decisions would be based on all the queries within its scope
– advise users recommending specific indexes
Further research is needed to resolve questions concerning exact implementation details:– specific data requirements, algorithms and formulas– an appropriate user-interface
DB/400DB/400