Upload
saikat-banerjee
View
36
Download
0
Tags:
Embed Size (px)
DESCRIPTION
report
Citation preview
-------------------------------------------- By Yu, Fang ([email protected])
SQL Optimization Tips
Warning: The contents in this document are mainly coming from the book<<Database Solution>>
(http://baike.baidu.com/view/4620738.htm).
Contents 1. SQL Execution Explain Plan ....................................................................................................................... 3
1.1. SQL and Optimizer ............................................................................................................................. 3
1.1.1. SQL Statement Transformation .................................................................................................. 3
1.2. Explain Plan ........................................................................................................................................ 6
1.2.1. Scans ........................................................................................................................................... 6
1.2.2. Table Join .................................................................................................................................... 8
1.2.3. Other operations ....................................................................................................................... 10
2. Create Efficient Indexes .......................................................................................................................... 12
2.1. Comparison between “Index Merge” and “Composite Index” ........................................................ 12
2.2. The characteristics of the “Composite Index” ................................................................................. 12
3. Partial Range Scan ................................................................................................................................... 14
3.1. What’s Partial Range Scan ............................................................................................................... 14
3.2. The partial range scan usage rule .................................................................................................... 14
3.2.1. The requirements for Partial Range Scan ................................................................................. 14
3.2.2. Partial Range Scan in different optimizer mode ....................................................................... 15
3.3. The principle to improve the execution speed of Partial Range Scan ............................................. 15
3.3.1. The principle of Partial Range Scan ........................................................................................... 15
3.4. The ways to instruct the optimizer to choose Partial Range Scan ................................................... 16
3.4.1. Replace SORT operation by (index) access path ....................................................................... 16
3.4.2. Use index scan only for partial range scan ............................................................................... 17
3.4.3. MAX and MIN functions ............................................................................................................ 17
3.4.4. “Filter” partial range scan ......................................................................................................... 18
3.4.5. Take advantage of “ROWNUM” for partial range scan ............................................................ 18
3.4.6. Take advantage of “Inline View/Scalar Sub Query” for partial range scan .............................. 18
3.4.7. Take advantage of “Function” for partial range scan ............................................................... 19
4. Table Joins ............................................................................................................................................... 21
-------------------------------------------- By Yu, Fang ([email protected])
4.1. Join VS Loop Query .......................................................................................................................... 21
4.2. The impact of Join Condition on Table Join ..................................................................................... 22
4.2.1. Both sides of the Join Condition are valid ................................................................................. 22
4.2.2. One side of the join condition is invalid .................................................................................... 23
4.2.3. Neither side of the join condition is valid ................................................................................. 23
4.3. Different kinds of table join ............................................................................................................. 23
4.3.1. Nested Loop Join ....................................................................................................................... 23
4.3.2. Sort Merge Join ......................................................................................................................... 24
4.3.3. Nested Loop Join V.S. Sort Merge Join ..................................................................................... 24
4.3.4. Hash Join ................................................................................................................................... 25
4.3.5. Semi Join ................................................................................................................................... 26
4.3.6. Star Join ..................................................................................................................................... 30
4.3.7. Star Transforming Join .............................................................................................................. 31
4.3.8. Bitmap Join Index ...................................................................................................................... 32
-------------------------------------------- By Yu, Fang ([email protected])
1. SQL Execution Explain Plan
1.1. SQL and Optimizer
1.1.1. SQL Statement Transformation
The SQL Optimizer consists of “Query Transformer”, “Cost Estimator” and “Explain Plan Generator”.
Most of the SQL statements will be transformed more or less by the optimizer before generating the
execution plan for the purpose of getting the best performance.
Here are some examples of query transformation…
(1) sales_qty > 1200 / 12
(2) sales_qty > 100
(3) sales_qty * 12 > 1200
The predicate (1) will be transformed into (2), but the (3) will not. This is because SQL optimizer will not
“move” the condition across the “operator” (>).
Another example, the IN list will be transformed using “OR” operator.
(1) job IN ('MANAGER', 'CLERK')
(2) job = 'CLERK' OR job = 'MANAGER'
“BETWEEN” will be transformed using “>=” and “<=” operators.
(1) sales_qty BETWEEN 100 AND 200
(2) sales_qty >= 100 AND sales_qty <=200
“ANY” operator will be transformed using “OR” operators in some cases…
(1) sales_qty > ANY ( 100, 200) (2) sales_qty > 100 OR sales_qty > 200
1.1.1.1. Transitivity principle (only available in CBO)
Comparison with constant
If the same table column is used in two different predicates (join conditions), the optimizer will generate
a new predicate (join condition) and create the best explain plan based on this new predicate.
WHERE column1 comparision_operators constant AND column1=column2
The “comparision_operators” must be one of “=”, “<>”, “>”, “<”, “>=”, “<=”. And the “constant” can be
“operation”, “constant literal”, “bind variable” or SQL function.
Then under such circumstances, the optimizer will generate a new predicate like below,
-------------------------------------------- By Yu, Fang ([email protected])
column2 comparision_operators constant
However, if the SQL statement is like…
WHERE column1 comparision_operators column3
AND column1=column2
Then the optimizer cannot deduce the following predicate…
column2 comparision_operators column3
Transform “OR” to “UNION ALL”
Please note that the transformation only happens when the performance will be better.
If the transformed SQL statement can use index to boost the performance, the transformation
will be performed. The optimizer will generate the explain plan as “IN-LIST ITERATOR” or
“CONCATENATION” in this case.
If some query condition cannot use index or the “OR” query condition is used for data check
(filter) only, then the optimizer will not conduct the query transformation. However, we can
take advantage of the hint “USE_CONCAT” to instruct the optimizer to transform the query if we
are sure of better performance for the new query.
For example,
SELECT sal
FROM emp
WHERE job = 'CLERK'
OR deptno = 10;
If there are indexes created on the column job and deptno, the optimizer will transform the SQL
statement above into the following one…
SELECT * FROM emp WHERE job = 'CLERK' UNION ALL
SELECT * FROM emp WHERE deptno=10 AND job <> 'CLERK';
Transform sub query to table join
The optimizer will not transform every sub query into table join. If the transformation will help the
performance boost and the transformation is feasible, it will do it. Otherwise, it will generate the
best explain plan for both the main query and sub query instead of transforming it.
For example,
SELECT *
FROM emp WHERE deptno IN (SELECT deptno FROM dept WHERE loc='New York')
-------------------------------------------- By Yu, Fang ([email protected])
If there is one unique index on the column deptno in table dept, which indicates there is only one
record corresponding to the main query, then the SQL above can be transformed into the table join
as below,
SELECT *
FROM emp, dept
WHERE emp.deptno = dept.deptno AND dept.loc = 'NEW YORK';
However, for the query below,
SELECT *
FROM emp
WHERE sal > (SELECT AVG(sal) FROM emp WHERE deptno=20);
The optimizer cannot transform it into table join. Instead, it will generate the best plan for main
query and sub-query. If the sub-query can use index and the result can be used to probe the data
from the main query (sub-query is the data provider), the optimizer will generate the plan that
execute the sub-query first. If the sub-query will act as the data filter, the query will be executed as
hash join (semi).
1.1.1.2. View Merging
In order to generate the best execution plan for the view (inline view), the optimizer may need to
transform the SQL query. There are two ways for the transformation:
View Merging: merge the view query and the query condition (predicates)
Predicate Pushing: If the view merge cannot be performed, push the predicates into the view query
Please note the “direction” of the two methods above is different. The former is to “rewrite” the outer
query using the view query (inner query), while the latter on is to push the query condition of the outer
query into the view query.
If the outer query includes the following operations, then the “View Merging” will not be applicable…
SET operations, like UNION, UNION ALL, INTERSECT, MINUS, etc
CONNECT BY
ROWNUM
Aggregation function in SELECT-list, like SUM, AVG, MAX, MIN, etc
GROUP BY ( can use hint MERGE to instruct the optimizer to choose view merging)
DISTINCT in SELECT-list ( can use hint MERGE to instruct the optimizer to choose view merging)
If the outer query has many query conditions that can reduce the query range and merge the query
condition into the view can reduce the data volume that need to be processed, the view merge is
preferable, otherwise the view merging is not necessary.
-------------------------------------------- By Yu, Fang ([email protected])
For example,
CREATE VIEW emp_10(e_no, e_name, job, manager, hire_date, salary)
AS SELECT empno, ename, job, mgr, hiredate, sal, comm
FROM emp WHERE deptno = 10;
SELECT e_no, e_name, salary, hire_date
FROM emp_10 WHERE salary > 10000000;
Can be transformed using view merging as follows,
SELECT empno, ename, sal, hiredate FROM emp
WHERE deptno=10 AND sal > 10000000;
Another example,
CREATE VIEW emp_group_by_deptno
AS SELECT deptno, AVG(sal) avg_sal, min(sal) min_sal, max(sal) max_sal
FROM emp
GROUP BY deptno;
SELECT *
FROM emp_group_by_deptno WHERE deptno=10;
Can be transformed as follows…
SELECT deptno, AVG(sal) avg_sal, min(sal) min_sal, max(sal) max_sal
FROM emp WHERE deptno=10
GROUP BY deptno;
1.2. Explain Plan
1.2.1. Scans
1.2.1.1. Full Table Scans
-------------------------------------------- By Yu, Fang ([email protected])
Full table scan will scan all the data block that under the HWM (high water mark), including the empty
data block. In order to reduce the physical I/O, one parameter DB_FILE_MUTLIBLOCK_READ_COUNT
can be set to a higher value.
1.2.1.2. ROWID Scans
ROWID is composed of data object id, data file id, data block id and the record slot in the data block. The
fastest way to retrieve one record from one table is to use ROWID.
1.2.1.3. Index Scans
Index Unique Scan
Index Range Scan
Index Range Scans Descending
Index range scan descending is similar to index range scan, except it accesses the table data in
descending order instead of in ascending order. The optimizer will choose this kind of index scan
under two circumstances: one is the query uses the “ORDER BY…DESC” and the other one is the
query uses the “INDEX_DESC” hint.
Index Skip Scan
Index Skip Scan is introduced to resolve the issue of the composite index cannot be used if the
leading column is not used in the predicates.
Index Full Scan
Index Full Scan will be used when the following two conditions are met,
All the columns in the SELECT-list are included in the index.
There is at least one NOT NULL column in the index
Index Fast Full Scan
The difference between Index Fast Full Scan and Index Full Scan is that the Index Fast Full Scan will
read multiple index blocks rather than one block in each I/O operation.
1.2.1.4. B-Tree Cluster Access
1.2.1.5. Hash Cluster Access
1.2.1.6. Sample Table Access
Sample table access is only available in Full Table Scans and Index Fast Full Scans. The basic syntax is as
follows,
SELECT …
FROM table_name SAMPLE {BLOCK option} (Sample Percent)
-------------------------------------------- By Yu, Fang ([email protected])
WHERE…
GROUP BY… HAVING…
ORDER BY…
1.2.2. Table Join
Table joins will be detailed in Section 4.
1.2.2.1. Nested Loop Join
The most distinguishing character of Nested Loop Join is that the outer query (driving query) determines
the data volume that need to be processed. The Nested Loop Join performs well when the data volume
is small and there are proper indexes on the join columns. The most outstanding disadvantage of the
Nested Loop join is that it may cause too many random table accesses.
1.2.2.2. Sort Merge Join
Compared with Nested Loop Join, the SORT MERGE JOIN will not introduce much random table accesses.
And there is no “driving table” in SORT MERGE JOIN.
If most of the join conditions are ‘LIKE’, ‘BETWEEN’, ‘>’, >=’, ‘<’, ‘<=’ instead of ‘=’, the SORT MERGE JOIN
is better than Nested Loop Join.
1.2.2.3. Hash Join
Hash join is to use the hash function for the table join. And Hash Join can be only used when the join
operator is “=”.
1.2.2.4. Semi Join
Semi join happens when there is sub query in the SQL statement. The join between “main query” and
“sub query” is called semi join. Since the “sub query” is subject to “main query”, if the relationship
between the main query and sub query is “M:1”, then the join between the main query and sub query is
the same as the general table join, otherwise the sub query will be transformed into the “1” to make
sure the final result is compatible with the main query.
The sub query can be executed earlier than the main query (acts as data provider) or later than the main
query (acts as the data filter). In the first case, if the sub query is the “M” side, one operation named
“SORT(UNIQUE)” will be involved to transform the sub query to the “1” side. In the second case, the sub
query will be aborted when the first matching record in found.
Please note that the IN-list sub query will not necessarily be executed before the main query.
1.2.2.5. Cartesian Join
-------------------------------------------- By Yu, Fang ([email protected])
Cartesian join means there is no join condition between the two tables. Generally, the Cartesian join is
executed as “Sort Merge Join”.
The typical sort merge join execution plan is as below…
MERGE JOIN (CATESIAN) TABLE ACCESS (FULL) OF 'emp'
BUFFER (SORT)
TABLE ACCESS (FULL) OF 'copy_t'
1.2.2.6. Outer Join
Nested Loop Outer Join
Hash Outer Join
If the outer join query has (inline) view, the view merging will not be executed. Instead, the (inline)
view must be executed separately before the table outer joins.
If the inner table has some query condition, the outer join needs more caution.
For example,
SELECT last_name, nvl(sum(ord_amt), 0) FROM customers c, order o
WHERE c.cust_id = o.cust_id(+) AND c.credit_limit > 1000
AND o.ord_type IN ('01', '03') ------ query condition on the inner table
GROUP BY last_name;
Please note the query condition “o.ord_type IN (‘01’, ‘03’)” will be used as the data filter which is
executed after the outer join which leads to the wrong results.
To resolve this issue, we need to turn to “inline view” for help as the inline view will be executed
first in the outer join.
SELECT last_name, nvl(sum(ord_amt), 0)
FROM customers c, (SELECT cust_id, ord_amt
FROM orders
WHERE ord_type IN ('01', '03')) o WHERE c.cust_id = o.cust_id(+)
AND c.credit_limit > 1000
GROUP BY last_name;
Another better solution is to use ANSI SQL…
SELECT c.last_name, nvl(sum(o.ord_amt), 0)
FROM customers c LEFT OUTER JOIN orders o
ON (c.cust_id = o.cust_id AND o.ord_type IN ('01', '03'))
-------------------------------------------- By Yu, Fang ([email protected])
WHERE c.credit_limit > 1000
GROUP BY c.last_name;
Sort Merge Outer Join
Full Outer Join
1.2.2.7. Index Join
Index Join means if the table in the query has more than one index on the columns then use hash join to
join these indexes together to get the final result. This means no need to query the table via the index,
just retrieve the data using index join operation.
1.2.3. Other operations
1.2.3.1. IN list iterator explain plan
Please note the difference between “BETWEEN…AND” and “IN list”. The “BETWEEN…AND” means a
range while “IN list” means a list of separate values.
SELECT order_id, order_type, order_amount
FROM orders WHERE order_type IN (1, 2, 3);
Execution Plan
SELECT STATEMENT
INLIST ITERATOR TABLE ACCESS (BY INDEX ROWID) OF ‘orders’
INDEX (RANGE SCAN) OF ‘orders_idx1’ (NON-UNIQUE)
1.2.3.2. Concatenation explain plan
Concatenation explain plan means the SQL statement uses “OR” operator to concatenate multiple query
conditions associated with “different” columns. In this case, the SQL statement will be split into multiple
SELECT clauses with the best explain plan chose for each query portion, and at last combine
(concatenate) the result of each query potion.
Please note that only if the “OR” query condition is used as the driving condition will the concatenation
explain plan be chosen by the optimizer; otherwise the “OR” query condition will be used as the filter
only.
The execution order of the each “query portion” is starting from the last predicates (query condition) in
the “OR” list.
For example,
SELECT *
-------------------------------------------- By Yu, Fang ([email protected])
FROM table1
WHERE A = '10'
OR B = '123';
Execution plan
CONCATENATION
TABLE ACCESS (BY INDEX ROWID) OF ‘table1’
INDEX (RANGE SCAN) OF ‘b_idx’ (NON-UNIQUE) ---- b is executed first TABLE ACCESS (BY INDEX ROWID) OF ‘table1’
INDEX (RANGE SCAN) OF ‘a_idx’ (NON-UNIQUE)
1.2.3.3. Sort explain plan
SORT (UNIQUE)
There are two possibilities for this explain plan: one is there is DISTINCT operation in the SELECT-list
and the other one is there is one sub-query acting as the data provider for the main query.
SORT (AGGREGATE)
There is no GROUP BY clause but aggregation function is used in the SELECT-list.
SORT (GROUP BY)
There is GROUP BY clause in the SQL statement.
SORT (JOIN)
Sort Merge Join.
SORT (ORDER BY)
There is ORDER BY clause in the SQL statement.
1.2.3.4. SET operation explain plan
Union/Union-All explain plan
Intersection explain plan
Minus explain plan
1.2.3.5. COUNT (STOPKEY) explain plan
When the SQL statement has ROWNUM used, the explain plan will show “COUNT (STOPKEY)” operation.
SELECT *
FROM orders WHERE order_date = :b1
AND ROWNUM <= 20;
-------------------------------------------- By Yu, Fang ([email protected])
Execution Plan
SELECT STATEMENT
COUNT (STOPKEY) TABLE ACCESS (BY INDEX ROWID) OF ‘orders’
INDEX (RANGE SCAN) OF ‘order_idx2’ (NON-UNIQUE)
2. Create Efficient Indexes
2.1. Comparison between “Index Merge” and “Composite Index” The “Index Merge” works well when the indexes that will be merged have similar density. And
“Composite Index” works well when the query condition (predicates in the WHERE clause) uses “=”
operator.
When the query condition doesn’t use the first column in the composite index, the composite index will
generally perform badly.
2.2. The characteristics of the “Composite Index”
When the leading column (the first column in the index) isn’t used in the query condition, the composite
index will most likely not be used. Even under some circumstances the “index skip scan” can use the
composite index, the performance is not very sound.
To create a composite index, two factors should be considered. One is which columns should be
included in the index, the other one is the order of the columns in the index. These two factors have
great influence on the performance of the index.
The relationship between the density and the order of the columns
If the indexed columns will be only used using “=” operator, the density of the columns has little
impact on the order of the columns.
The impact of “=” operation on the order of the columns
If the query condition doesn’t use “=” operator for the first column in the composite index, the
index will not perform well even if other columns in the index are used with “=” operator in the
query condition.
“=” operation is more important than the density of the column when deciding the order of columns
in the composite index. So to make the best use of the composite index, we need to take both the
density and the column usage (“=” or not) into consideration.
IN list iterator
Sometimes, if the leading column of the composite index is used in “BETWEEN...AND” or “LIKE”
operation, we can take advantage of “IN list” to rewrite the SQL to improve the performance.
For example, suppose the there is one index idx_tab1 (col1, col2) on the table TAB1…
-------------------------------------------- By Yu, Fang ([email protected])
SELECT * FROM TAB1
WHERE col1 between 10 and 20 AND col2 = 'A';
If there are only limited values that meets the predicate (col1 between 10 and 20), we can rewrite
the SQL as follows..
SELECT * FROM TAB1
WHERE col1 IN (10, 15, 20) AND col2 = 'A';
The SQL statement above is equal to …
SELECT * FORM TAB1 WHERE (col1=10 AND col2=’A’) OR
(col1=15 AND col2=’A’) OR (col1=20 AND col2=’A’)
This way, the composite index idx_tab1 can be used well because the SQL engine can scan less of
the index entries.
Another example, suppose there is one index idx_tab1 (col1, col3, col2) on table TAB1,
SELECT * FROM TAB1 WHERE col1 = 'A' and col2='222';
This time, even the leading column col1 is used in the “=” operator, the second column col3 is not
used the WHERE clause. This way, the col2=’222’ can only be used as the “filter” to check the index
entry which is not very efficient.
If we know the column col3 only have several values, like 1, 2, 3, and then the SQL statement above
can be rewritten as follows…
SELECT * FROM TAB1
WHERE col1 = 'A' and col2='222' and col3 in (1, 2, 3);
It is equal to the following SQL statement…
SELECT * FROM TAB1
WHERE (col1= 'A' and col3=1 and col2='222') OR (col1= ‘A’ and col3=2 and col2='222')
OR (col1= ‘A’ and col3=3 and col2='222')
This way, the column col2, col3 can be used for index entry access which is much efficient than
being a data “filter”.
-------------------------------------------- By Yu, Fang ([email protected])
3. Partial Range Scan
3.1. What’s Partial Range Scan The Partial Range Scan doesn’t mean to scan all the data that meet the conditions (predicates) in the
SQL Where clause; instead it means that the SQL engine doesn’t need to scan all the data before
returning the first set of data to the users. This is similar to what the optimizer mode “FIRST_ROWS”
indicates.
Partial Range Scan is very helpful for the OLTP operations, but this doesn’t mean the partial range scan
cannot be used in the batch process operations.
Partial Range Scan can reduce the data volume that need to scan and to what extent the data volume
that the Partial Range Scan need to scan is not impacted by the data volume that meets the conditions
(predicates) in the WHERE clause. This is the charming characteristic of the Partial Range Scan.
3.2. The partial range scan usage rule If we can change the Full Range Scan to Partial Range Scan sometimes, the SQL execution performance
will most likely be improved greatly. However, not all the Full Range Scan can be converted to Partial
Range Scan.
If the SQL execution plan has “SORT” operations, like SORT(UNIQUE), SORT(JOIN), SORT(AGGREGATE),
SORT(ORDER BY), SORT(GROUP BY), etc, we can deems the optimizer doesn’t choose Partial Range Scan,
instead it chooses the Full Range Scan operation.
Besides, if the SQL statements have set operations, like UNION, MINUS, INTERSECT, then the SQL
statements cannot be executed via Partial Range Scan as the set operation will sort all the data (SORT
(UNIQUE)) to remove the duplicated records. But UNION ALL can be executed via Partial Range Scan.
3.2.1. The requirements for Partial Range Scan
Generally, if the SQL statement has ORDER BY clause, the SQL statement cannot be executed via Partial
Range Scan. However, if the column in the ORDER BY clause is indexed and the index is used for the
driving index, then the SQL statement can be executed via Partial Range Scan.
SELECT ord_date, ordqty * 1000 FROM order
WHERE ord_date like '200512%' ORDER BY ord_date;
If the column ord_date is indexed, the optimizer can ignore the ORDER BY clause and then the SQL
statement can be executed via Partial Range Scan.
-------------------------------------------- By Yu, Fang ([email protected])
As a result, not all SQL statements that have ORDER BY clause cannot be executed via the Partial Range
Scan. Only when the SORT operation comes up in the SQL execution plans that the Partial Range Scan
cannot be applied.
3.2.2. Partial Range Scan in different optimizer mode
Generally, the SQL statement in FIRST_ROWS will be executed via Partial Range Scan and in ALL_ROWS
will be executed via Full Range Scan. If want to instruct the optimizer to choose Partial Range Scan, we
can use some hints, like INDEX or FIRST_ROWS (n). In general, set the optimizer mode to “FIRST_ROWS”
in OLTP system.
3.3. The principle to improve the execution speed of Partial Range Scan Take a look at an example first,
SELECT * FROM order;
Generally, the SQL statement above will get the results returned (first set of data) quickly. But the SQL
statement below will get the returned much more slowly.
SELECT * FROM order ORDER BY item;
The reason is not merely because there is one SORT operation in the second SQL statement. The more
important reason is that the SORT operation causes the SQL engine need to perform the FULL RANGE
(table) SCAN operation before the first set of data can be returned.
If there is one index of which the leading column is “item”, the SQL statement above can be rewritten as
follows,
SELECT * FROM order WHERE item > ' ';
This way, the optimizer will take advantage of the index to perform the data scan and the partial range
scan is possible. We can also uses the hint INDEX to impose the use the index, like…
SELECT /*+ index (order item_index) */ * FROM order WHERE item > ' ';
3.3.1. The principle of Partial Range Scan
If the data volume that meets the “driving” query condition is small, the execution operation cost will be
less. If the data volume that meets the “filtering” query condition is big, the execution operation cost
will be low.
In order to make the query condition that will lead to small data volume be the “driving” condition, we
can take advantage of some hints (index, etc) or other methods.
-------------------------------------------- By Yu, Fang ([email protected])
For example, suppose there are indexes created on the column “ordno” and “custno”, but the query
condition on the column “custo” will lead to smaller data volume which is appropriate for the driving
condition, we can instruct the optimizer to follow our intent…
SELECT * FROM order WHERE RTRIM(ordno) between 1 and 1000 AND custno like 'DN%';
SELECT /*+ INDEX(order custno_index)*/ *
FROM order WHERE ordno BETWEEN 1 and 1000 AND cusno like 'DN%';
The data range that meets the “driving” query condition
The data range that meets the “filtering” query condition
The execution speed
The solution
Small Small Fast
Big Fast
Big Small Slow Swap the “driving” and “filtering” role
Big Fast
3.4. The ways to instruct the optimizer to choose Partial Range Scan
3.4.1. Replace SORT operation by (index) access path
In order to eliminate the “SORT” operation from the SQL execution plan, we can add the columns used
in the ORDER BY clause into the index. This way, we can take advantage of this index to avoid the Full
Range Scan operation.
SELECT ord_dept, ordqty * 1000
FROM order
WHERE ord_date like '2005%' ORDER BY ord_dept DESC
In the SQL statement above, the condition used to filter (drive) the data set is using the column
“ord_date” while the column used in the ORDER BY clause is the column “ord_dept”. If the data set
returned by applying the condition “orde_date like ‘2005%’” is large, the Full Range Scan will respond
slowly. However, if there is also one index on the column ord_dept, we can rewrite the SQL statement
as follows,
SELECT /*+ INDEX_DESC (a ord_dept_index)*/ *
FROM order a WHERE a.ord_date like ‘2005%’ AND ord_dept > ' ';
This way, we not only remove the ORDER BY clause (by using the hint INDEX_DESC) which make the
Partial Range Scan possible, but also make the ord_dept the driving column and the column ord_date be
-------------------------------------------- By Yu, Fang ([email protected])
the “filter” column. According the principle of the Partial Range Scan, if the filtering condition causes
large data volume, the execution speed will be fast.
3.4.2. Use index scan only for partial range scan
If all the columns used by the SQL statement are included in the index, then the optimizer can only
access the index to get the data. There is no need to scan the table in this case.
This is very efficient as the I/O will be reduced.
As a result, to instruct the optimizer to choose this index scan, we need to think carefully for those
candidate columns that can be included in the index.
3.4.3. MAX and MIN functions
Since MAX and MIN are aggregate functions, it seems that if the SQL statement has these function used
then the Partial Range Scan is impossible.
However, in the new optimizer, there is a special process operation for the MAX/MIN function which
uses the Partial Range Scan which make the MAX/MIN have good response time.
For example, the index pk_order is based on the column (deptno, seq)…
SELECT MAX(seq) + 1 FROM order
WHERE deptno = '1234';
EXECUTION PLAN
SELECT STATEMENT SORT (AGGREGATE)
FIRST ROW INDEX (RANGE SCAN (MIN/MAX)) OF ‘pk_order’ (UNIQUE)
Please note the “FIRST ROW” and “RANGE SCAN(MIN/MAX) in the execution plan. They make the SQL
engine doesn’t need to wait until all the deptno ‘1234’ are scanned before returning the result.
The SQL statement above is almost executed by the optimizer as the SQL statement below…
SELECT /*+ INDEX_DESC(order pk_order) */ NVL(MAX(SEQ), 0) + 1
FROM order WHERE dept_no = '1234' AND ROWNUM =1;
Please note the hint “INDEX_DESC” and ROWNUM are used in the SQL statement “explicitly tell” the
optimizer to choose partial range scan.
-------------------------------------------- By Yu, Fang ([email protected])
3.4.4. “Filter” partial range scan
The “EXISTS” sub query will return when the first matching record is found. The sub query will be not
executed fully which means the all the records will be joined with the main query, which is valuable for
the partial range scan.
SELECT 1 INTO :cnt FROM DUAL WHERE EXISTS
(SELECT NULL
FROM item_tab WHERE dept='101'
AND seq > 100);
Generally, the “EXISTS” sub query is correlated with the main query. However, just like the example
above, the “EXISTS” sub query can be non-correlated at all. This SQL statement is check whether there is
one record that matching the predicates (dept=’101’ and seq>100) in the table “item_tab”.
The execution plan is as follows,
Execution Plan
SELECT STATEMENT
TABLE ACCESS (FULL) OF ‘dual’ TABLE ACCESS (BY INDEX ROWID) OF ‘item_tab’
INDEX (RANGE SCAN) OF ‘item_dept_idx’ (NOT UNIQUE)
3.4.5. Take advantage of “ROWNUM” for partial range scan
ROWNUM is a fake column which is usually used to limit the number of the records that returned by the
query.
Please note that ROWNUM is not the sequence number of the record that is processed, but the
sequence number of the record that is returned by the query. That’s to say, even if the SQL query has
ROWNUM <=10 predicate, the actual records processed by the query is most likely more than 10.
3.4.6. Take advantage of “Inline View/Scalar Sub Query” for partial range scan
Include the data that must be processed via “Full Range Scan” inside one inline view, and this can make
sure the other part of the SQL can be processed via “Partial Range Scan”. Otherwise, the whole SQL
query would be processed via “Full Range Scan”.
SELECT a.dept_name, b.empno, b.emp_name, c.sal_ym, c.sal_tot FROM department a, employee b, salary c
WHERE b.deptno = a.deptno AND c.empno = b.empno
AND a.location = 'SEOUL'
AND b.job = 'MANAGER'
-------------------------------------------- By Yu, Fang ([email protected])
AND c.sal_ym = '200512'
ORDER BY a.dept_name, b.hire_date, c.sal_ym;
Since the SQL statement above has the ORDER BY clause, it seems the SQL statement can only be
executed via “Full Range Scan”. But considering the data volume in the table department and employee
are not very large, we can join these two tables first and then join the table salary. What’s more, in
order not to sort by column sal_ym in the table salary, we can create one index on the columns
(empno+sal_ym). This way, the SQL statement above can be rewritten as follows,
SELECT /*+ ORDERED USE_NL(x y)*/
a.dept_name, b.empno, b.emp_name, c.sal_ym, c.sal_tot FROM (SELECT a.dept_name, b.hire_date, b.empno, b.emp_name
FROM dept a, employee b
WHERE b.deptno = a.deptno AND a.location = 'SEOUL'
AND b.job='MANAGER' ORDER BY a.dept_name, b.hire_date) x, salary y
WHERE y.empno = x.empno
AND y.sal_ym = '200512';
Another example,
SELECT a.product_cd, product_name, avg_stock
FROM product a, ( SELECT product_cd, SUM(stock_qty) / (:b2 - :b1) avg_stock
FROM prod_stock WHERE stock_date BETWEEN :b1 AND :b2
GROUP BY product_cd) b WHERE b.product_cd = a.product_cd
AND a.category_cd = '20';
Can be rewritten as follows,
SELECT a.product_cd, product_name,
(SELECT SUM(stock_qty) / (:b2 - :b1)
FROM prod_stock b WHERE b.product_cd = a.product_cd
AND b.stock_date BETWEEN :b1 AND :b2 ) avg_stock
FROM product a
WHERE category_cd = '20';
3.4.7. Take advantage of “Function” for partial range scan
Take a look at the following SQL statement…
SELECT y.cust_no, y.cust_name, x.bill_tot
FROM ( SELECT a.cust_no, SUM(bill_amt) bill_tot FROM account a, charge b
WHERE a.acct_no = b.acct_no
-------------------------------------------- By Yu, Fang ([email protected])
AND b.bill_cd = ‘FEE’
AND b.bill_ym between :b1 and :b2 GROUP BY a.cust_no
HAVING SUM(b.bill_amt) > 1000000) x, Customer y
WHERE y.cust_no = x.cust_no
AND y.cust_status = 'ARR' AND ROWNUM <= 30;
Though the SQL statement only needs to query the customer that has status with ‘ARR’, the inline view
still needs to group all the customers. Obviously, this is not very efficient as the inline view performs
much useless operation. What’s more, the SQL statement cannot return the first set of data until the
inline view is completely processed.
To resolve this issue, we can take advantage of function as follows…
CREATE OR REPLACE FUNCTION cust_arr_fee_func
( v_custno IN varchar2, v_start_ym in varchar2, v_end_ym IN varchar2) RETURN number
AS
Ret_val number(14); BEGIN
SELECT SUM(bill_amt) INTO ret_val FROM account a, charge b
WHERE a.acct_no = b.acct_no AND a.cust_no = v_cust_no
AND b.bill_cd = 'FEE'
AND b.bill_ym BETWEEN v_start_ym AND v_end_ym;
RETURN ret_val; END cust_arr_fee_func;
SELECT cust_no, cust_name, CUST_ARR_FEE_FUNC(cust_no, :b1, :b2)
FROM customer WHERE cust_status = 'ARR'
AND CUST_ARR_FEE_FUNC(cust_no, :b1, :b2) >= 1000000 AND ROWNUM <= 30;
The SQL statement calls the function twice, and it can be rewritten by using of inline view…
SELECT cust_no, cust_name, bill_tot
FROM ( SELECT ROWNUM, cust_no, cust_name,
CUST_ARR_FEE_FUNC(cust_no, :b1, :b2) bill_tot FROM customer
WHERE cust_status = 'ARR') WHERE bill_tot >= 1000000
AND ROWNUM <= 30;
-------------------------------------------- By Yu, Fang ([email protected])
Please note that the inline view includes one fake column – ROWNUM, which is used to prevent the
view merging.
4. Table Joins Table joins are set operations (集合运算); they are not merely to retrieve data by using the FKs defined
on the tables.
4.1. Join VS Loop Query The “Loop Query” means using “procedural processing logic” to replace the table join operation. It will
first query the data from one table and then use the results (a list of constant values) to probe the final
result from the other table in one loop.
For example, the SQL statement (using table join)
SELECT t1.col1, t2.col2 FROM tab1 t1, tab2 t2
WHERE t1.key# = t2.join_field;
…can be rewritten by using “Loop Query” like below…
FOR rec in (SELECT key#, col1 FROM tab1)
LOOP SELECT col2 FROM tab2 WHERE join_field = rec.key#;
END LOOP;
If the SQL (table join) statement involves some operations (like order by, group by, etc) which makes the
SQL cannot return the first set of results before processing the whole data set, the “loop query” might
performs better than Join sometimes. However, we can rewrite the general table join by taking
advantage of some techniques (like “inline view”, “scalar sub-query”, etc), which can make the table join
performs well.
Example 1:
SELECT a.fld1, ……, b.col1, …..
FROM tab2 b, tab1 a WHERE a.key1 = b.key2
AND a.fld1 = '10'
ORDER BY a.fld2
Can be rewritten using inline view as follows,
SELECT x.fld1, …., x.fldn, y.col1….., y.coln FROM (SELECT fld1, ….., fldn
FROM tab1 WHERE fld = '10'
ORDER BY fld2) x, Tab2 y
-------------------------------------------- By Yu, Fang ([email protected])
WHERE y.key2 = x.key1
Example 2:
SELECT b.dept_name, sum(a.sale_money)
FROM tab1 a, tab2 b WHERE a.dept# = b.dept#
AND a.sale_date like '200503%'
GROUP BY b.dept#
Can be rewritten as follows…
SELECT x.dept#, y.dept_name, sale_money
FROM (SELECT dept#, sum(sale_money) sale_money
FROM tab1 WHERE sale_date like '200503%'
GROUP BY dept#) x, TAB2 y
WHERE y.dept# = x.dept#
Example 3:
SELECT a.*, decode(a.type, ‘1’, b.client_name, ‘2’, project_name) name
FROM tab a, clients b, projects c
WHERE a.issue_date like '200503%' AND b.client_no(+) = decode(a.type, ‘1’, a.type_code)
AND c.project_no(+) = decode(a.type, ‘2’, a.type_code)
Can be rewritten using scalar sub-query as follows…
SELECT a.*, (SELECT client_name FROM clients b WHERE b.client_no = a.type_code),
(SELECT project_name FROM projects c WHERE c.project_no = a.type_code) FROM tab a
WHERE a.issue_date like '200503%'
4.2. The impact of Join Condition on Table Join The join condition here mainly means whether there is any valid or proper index on the join columns,
which is very important for optimizer to generate an efficient execution plan.
4.2.1. Both sides of the Join Condition are valid
Under this circumstance, there are proper or valid indexes created on two sides of the join columns. In
this case, each of the two tables can be the “driving” table and will not yield bad execution plan under
most of circumstances.
However, bear in mind that to get the best performance, we need to filter as much as possible data
volume before joining two tables. That’s to say, we need to choose the table that can filter more data to
be the driving table.
-------------------------------------------- By Yu, Fang ([email protected])
If the optimizer chooses the wrong join order, we can instruct the optimizer to take the right join order
by taking advantage of some hints (like ORDERED) or rewrite the SQL statement.
For example, suppose there are indexes created on the tab2(fld2+key2) and tab1(fld1+key1) and we
know make the table tab2 as the driving table will be better, we can write the following SQL statement
to make the optimizer to follow our intents,
SELECT a.*, b.*
FROM tab2 b, tab1 a WHERE a.key1 = b.key2
AND b .fld2 like 'ABC%'’ AND RTRIM (a.fld1) = ‘10’;
SELECT /*+ordered*/ a.*, b.*
FROM tab2 b, tab1 a WHERE a.key1 = b.key2
AND b.fld2 like 'ABC%'
AND a.fld1 = '10';
4.2.2. One side of the join condition is invalid
Under this circumstance, only one join column is indexed. In this case, the join order is very important.
Generally, the table that has join column indexed should be inner table if uses the NESTED LOOP joins,
or just uses the SORT MERGE JOIN or HASH JOIN which doesn’t use indexes.
4.2.3. Neither side of the join condition is valid
Under these circumstances, both sides of the join columns are not indexed. As a result, the NESTED
LOOP JOIN will not be a good performer, and the SORT MERGE JOIN or HASH JOIN will be a better choice.
4.3. Different kinds of table join
4.3.1. Nested Loop Join
4.3.1.1. The characteristics of Nested Loop Join
The (data) sets are processed in order. The records in the driving table are processed in order and
the tables are joined in specific order.
The data volume need to be processed in the driving table determines the data volume that need to
be processed. So it’s better to choose the table with small data volume (need to be processed) as
the driving data set.
Not all the indexes on the columns in the predicates will be used in the table join.
-------------------------------------------- By Yu, Fang ([email protected])
The join condition (i.e. valid indexes) is very important for the Nested Loop join.
Permit the partial range scan.
4.3.1.2. The rules of applying Nested Loop Join
If the partial range scan is possible, it’s better to choose nested loop join.
If one side of the table cannot reduce the data volume that need to be scanned by itself (i.e. it
depends on other table to reduce the data volume that need to be scanned), it’s better to choose
nested loop join.
If the data volume that needs to be processed is not very large, it’s better to choose nested loop join.
If the query range of the driving table is large or the random table access is too much when joining
tables, it is better not to choose nested loop join.
4.3.2. Sort Merge Join
Sort Merge Join means sort the two data sets based on the joined columns before joining two tables.
4.3.2.1. The characteristics of Sort Merge Join
It can reduce the random table accesses in great extent.
It is processed via full range scan. The table join cannot happen before the sort operation is finished.
The join order of the table is irrelevant.
The join condition (i.e. the index on the joined column) is not important.
4.3.2.2. The rules of applying Sort Merge Join
If the data volume is large and the partial range scan is impossible, it’s better to use sort merge join.
It’s better to create efficient index to reduce the data volume that need to be sorted rather than the
index on the join columns.
4.3.3. Nested Loop Join V.S. Sort Merge Join
4.3.3.1. If only side of the joined tables have query conditions
Under such circumstance, the Nested Loop join can work well since there is still one side of the table
have query condition which can reduce the data volume (random table access) that needs to be joined.
This is even better when the partial range scan is possible because no query condition (filter) on the
inner table can make the first set of data returned more quickly.
However, this is bad for Sort Merge Join as without query condition filtering data volume, much more
data volume will be sorted before the table join.
4.3.3.2. If both side of the joined tables have no query conditions
Under such circumstance, the Nested Loop Join will perform badly as the both side of tables will be scanned via Full Table Scan and there are much more random table accesses introduced.
-------------------------------------------- By Yu, Fang ([email protected])
However, the Sort Merge Join will be the better choice under such circumstance.
4.3.4. Hash Join
The most distinguished advantage of the Hash Join is that it can get rid of lots of random table accesses and sort operation when processing the huge amount of data.
Please note there are two hashing (hash function) happening during the hash join, one hashing is to determine the position of the “partition”, the second one is to calculate the “hash value” based on which the hash table is built. The hash table stores the hash values and the corresponding positions of the “clusters” (also called slot in the partition).
Followings are some terms related to Hash Join:
4.3.4.1. Hash Area
Hash Area is the memory space allocated for hash join to work normally. It consists of “bitmap vector”,
“hash table”, “partition table” and the space occupied by the “partitions”. “Bitmap vector” stores the
unique values generated from the joined value from the “build input”, it is used to filter the data from
the “probe input” before the table join.
4.3.4.2. Partition
Partition is the bucket of the records from the “build input” whose hash values are the same. One partition can be further divided into multiple “clusters” which is the unit of the one I/O or query.
4.3.4.3. Cluster
The cluster is contained in the one partition which is the unit of the I/O operation. The cluster is also called slot. The cluster not only stores the joined columns, but also the columns referenced in the SELECT-list of “Build Input”.
4.3.4.4. Build Input and Probe Input
The Build Input is the data set used to build the hash table. The Probe Input is the data set that used the
hash table for the table join. Generally, the smaller data set will be chosen as the build input.
4.3.4.5. In-memory hash join and Delayed hash join
If the build input can be fully contained in the Hash Area, the hash join is In-Memory hash join,
otherwise it’s called delayed hash join.
4.3.4.6. Bitmap Vector
-------------------------------------------- By Yu, Fang ([email protected])
Bitmap Vector is created during creating the partitions for the build input. It’s used to store the unique
(hash) value of the joined columns of the build input in the Hash Area. When building the partitions for
the Probe Input, the Bitmap Vector is used to filter the data set.
4.3.4.7. Hash Table
Hash table is created in memory (Hash Area) which is used as the “index” for the “probe input” during
the table join. The “probe input” uses the Hash Table to query the “addresses” of the “cluster” which
contains the joined columns and the columns referenced in the SELECT-list from the “build input”, and
then join to the “Cluster”.
4.3.4.8. Partition Table
Partition Table is used to store the information (e.g. address) of the each “partition” when the “Build
Input” cannot be contained fully in the memory. The information contained in the “Partition Table” can
be used to reload the “partition” from the temporary segments into the memory.
4.3.5. Semi Join
The SQL optimizer will often choose “semi join” when there is “sub-query” clause in the SQL statement.
The reason that called it “semi join” is that “sub-query” is quite different from general table joins.
4.3.5.1. What’s the semi join and what’s the characteristics
Semi-join is to join the “sub-query” and “main query”, it’s a “broad-wide” table join. Sub-query is the
child query while the main-query is the parent query. Just like the inheritance in the OO, the sub-query
can reference the fields (columns) in the main-query, but not vice versa.
Though table join and sub-query are similar, they are quite different in essence. The sets (tables) in the
table join are au pair but not in the sub query.
4.3.5.2. The execution plans for semi-join
Nested Loop Semi-join
In semi-join, the sub-query can be executed before or after the main-query. In the former case, the
records (literal values) returned in the sub-query will be used to probe the final results from the
main-query; while in the latter case, the results returned from the main-query will be further
checked by the sub-query.
Suppose there are two tables TAB1 and TAB2 the relations between these two tables are 1 to M,
which means one record in TAB may have more than 1 record matched in the table TAB2.
The following SQL statement (pseudo snippet)…
SELECT col1, col2… FROM TAB1 x
-------------------------------------------- By Yu, Fang ([email protected])
WHERE key1 IN ( SELECT key2
FROM TAB2 y WHERE y.col1…
AND y.col2…)
The corresponding SQL execution plan can be like…
NESTED LOOPS
VIEW SORT(UNIQUE)
TABLE ACCESS BY ROWID OF TAB2 INDEX RANGE SCAN OF col1_idx
TABLE ACCESS BY ROWID OF TAB1
INDEX RANGE SCAN OF key1_idx
From the SQL execution plan, we can see the sub-query (TAB2) is executed first. The records
returned by the sub-query is used to join (NESTED LOOP) with the main query (TAB1). Please note
that there is one operation – SORT (UNIQUE) in the sub-query which is to eliminate the duplicated
values returned from the sub-query as TAB2 is the child table of the TAB1. This is the difference
between the table join and sub-query. If TAB1 and TAB2 are joined as two tables, there will no
“SORT (UNIQUE)” operation. However, if the column (key2) returned from the sub-query is the
primary key which is unique; there is no need to perform “SORT (UNIQUE)” operation. Then under
such circumstance, the sub-query becomes the same as the table join.
The SQL statement above can be called “non-correlated sub-query” because the sub-query doesn’t
reference the columns in the main-query. For such sub-queries, the optimizer can choose to execute
the sub-query before the main-query or not depending on the statistics. However, if the “sub-query”
is “correlated” which means the sub-query references the columns in the main-query, the optimizer
will choose to execute the main-query first and uses the sub-query for the results filtering. The
reason is that the sub-query depends on the main-query; the sub-query cannot know the values of
the join-columns before executing the main-query. As the programmer, we can instruct the
execution plan by taking advantage of this fact. If we rewrite the SQL statement above as follows,
SELECT col1, col2…
FROM TAB1 x WHERE key1 IN (SELECT key2
FROM TAB2 y
WHERE y.col1… AND y.col2…
AND x.key1 = y.key2)
The main-query will be executed first. Please note the join condition x.key1=y.key2 doesn’t need to
specify explicitly, the optimizer can deduce this from the query. But if we intent to make the SQL
engine execute the main-query first we can make the sub-query “correlated”.
Sort Merge Semi-Join
-------------------------------------------- By Yu, Fang ([email protected])
“Filter” Semi-Join
As mentioned in section “Nested Loop Semi-Join”, the “sub-query” can be the provider (executed
before the main-query) or be the filter (executed after the main-query, i.e. correlated sub-query).
Generally, the SQL optimizer will choose the “Filter” type semi-join if the sub-query containing
“EXISTS” operator.
One typical “Filter” Semi-Join execution plan is as follows,
SELECT … FROM order x
WHERE ordate LIKE '200506%' AND EXISTS (SELECT NULL
FROM dept y
WHERE y.deptno = x.saldept AND y.type1 = '1');
FILTER TABLE ACCESS (BY ROWID) OF ‘order’
INDEX (RANGE SCAN) OF ‘orddate_index’ (NON_UNIQUE)
TABLE ACCESS (BY ROWID) OF ‘dept’ INDEX (UNIQUE SCAN) OF ‘dept_pk’ (UNIQUE)
Not the operation “FILTER” where the “NESTED LOOPS” is usually seen. This is the most obvious
difference between Nested Loop join and “Filter” semi-Join in term of execution plan. In the “filter”
semi-join, once the matching record is found in the sub-query (dept, in this case) the join is ended,
while in nested loop joins, all the match records between the two tables (dept and order, in this case)
are joined together. Compared with “Nested Loops”, “Filter Semi-Join” can reduce the times of table
random access for the tables in the sub-query. So if the sub-query acts merely as the “filter”, the
sub-query may perform over the table join. However, if the sub-query acts not just as the “filter” or
the optimizer will try to change the execution sequence between the main-query and sub-query,
then try to change the sub-query to the table join.
(For the SQL statement above, if there is one index on the column order (orddate, saldept), the SQL
statement will be more efficient as this will take the best advantage of the cache to reduce the
random table access of the table dept)
Hash Semi-Join
Just like the Nested Loop joins, the “Filter” semi-join will introduce lots of random table access for
the “sub-query” tables which is not efficient for large volume of data. When processing huge
amount of data, sort merge join or hash join is usually a better choice.
-------------------------------------------- By Yu, Fang ([email protected])
SELECT …
FROM order x WHERE orddate LIKE '200506%'
AND EXISTS (SELECT /*+ hash_sj(x, y) */ NULL FROM dept y
WHERE y.deptno = x.saldept
AND y.type1 = '1');
HASH JOIN SEMI
TABLE ACCESS (BY ROWID) OF ‘order’ INDEX (RANGE SCAN) OF ‘orddate_index’ (NON_UNIQUE)
TABLE ACCESS (FULL) OF ‘dept’
Please note there is one hint “hash_sj” in the sub-query which instructs the optimizer to choose the
“Hash Join Semi”.
However, there are some restrictions to use hash join. For example, the sub-query cannot have
more than one table; the join condition can only be ‘=’; the sub-query cannot have GROUP BY,
CONNECT BY, ROWNUM, etc.
Anti Semi-Join
The ANTI semi join will be chosen when there is “NOT” operator used between the main-query and
sub-query. If the sub-query uses “NOT” operator, no matter “NOT IN” or “NOT EXISTS”, the sub-
query will acts as the data “filter”.
Under most circumstances, the optimizer will choose “filter” semi-join for anti semi-join which is
good when the data volume is not large. When the data volume gets large, to reduce the times of
random table access, sort merge join and hash join will be better choices for the anti semi-join. We
can use some hints to instruct the optimizer to choose these join methods, like MERGE_AJ or
HASH_AJ.
For example,
SELECT COUNT(*) FROM tab1
WHERE col1 like 'ABC%' AND col2 IS NOT NULL
AND col2 NOT IN
(SELECT /*+ MERGE_AJ*/ FLD2 FROM tab2
WHERE fld3 BETWEEN '20050101' and '20050131' AND fld2 IS NOT NULL)
MERGE JOIN (ANTI)
SORT (JOIN) TABLE ACCESS (BY ROWID) OF ‘tab1’
-------------------------------------------- By Yu, Fang ([email protected])
INDEX (RANGE SCAN) OF ‘col1_index’ (NON-UNIQUE)
SORT (UNIQUE) VIEW
TABLE ACCESS (BY ROWID) OF ‘tab2’ INDEX (RANGE SCAN) OF ‘fld3_index’ (NON-UNIQUE)
Please note there is one filter condition in the sub-query (fld2 IS NOT NULL); this is because the
main-query uses NOT IN as the check operator to filter the data that equals to the values of “fld2” in
the sub-query. If the results returned by the sub-query include “NULL” values, the results returned
by the main-query will be incorrect as NULL doesn’t equal to any value.
4.3.6. Star Join
Star Join is not a brand new join method; it still uses the normal join methods, like nested loop join, sort
merge join, hash joint, etc. The special characteristic of star join is it will use special join order to join
tables.
Though star join mostly comes up in data warehouse (data mart), it’s not necessarily to say the normal
OLTP database cannot have such join operation. The start join works well when there are some small
tables joining to a big table and those small tables don’t join to each other directly. This is like a star
shape and that’s why “star” join is called. If the big table joins to each small table one by one, this is very
inefficient as this will lead to too much I/O overhead. (The small table corresponds to dimension table
and the big table corresponds to fact table in data warehouse.)
To resolve this issue, the star join will take advantage of “Cartesian join” to join those small tables first
to get a data set and then uses this data set to join the big table. Since each small table has small data
volume, the Cartesian product will not produce too much data volume. However, if the small table is not
small enough, the Cartesian join will produce too much data volume, which will make the star join works
badly.
The execution plan below shows what a typical star join is like…
SELECT STATEMENT Optimizer=ALL_ROWS
HASH JOIN MERGE JOIN (CARTESIAN)
TABLE ACCESS (FULL) OF ‘dept’
BUFFER (SORT) TABLE ACCESS (FULL) OF ‘products’
TABLE ACCESS (FULL) OF ‘sales’
The table ‘dept’ and ‘products’ are dimension tables while the table ‘sales’ is fact table. Please note that
star join is only available in CBO and when the statistics are gathered. And there is one hint (/*+STAR*/)
can be used to instruct the optimizer to choose star join.
Under most circumstance, the Cartesian product is usually generated by sort merge join.
-------------------------------------------- By Yu, Fang ([email protected])
Since there is generally no proper composite index created in the big table (fact table), the tables join
operation between the big table and the Cartesian product is usually hash join.
4.3.7. Star Transforming Join
Star transforming join is introduced to make up some drawbacks of the star join. It’s not a replacement
of star join.
As we know, if the data volume of the Cartesian product in star join is very huge, the star join will
perform badly. Star transforming join takes advantage of bitmap index to get rid of the Cartesian
product and the composite indexes created on the big (fact) table.
The “transforming” in the star transforming join means the optimizer will transform the SQL query in
another form by applying the idea that the “sub-query” can be used to data provider.
Let’s see an example,
SELECT d.dept_name, c.cust_city, p.product_name, SUM(s.amount) sales_amount
FROM sales s, products t, customers c, dept d WHERE s.product_cd = t.product_cd
AND s.cust_id = c.cust_id AND s.sales_dept = d.dept_no
AND c.cust_grade between '10' and '15'
AND d.location = 'SEOUL' AND p.product_name IN ('PA001', 'DR210')
GROUP BY d.dept_name, c.cust_city, p.product_name;
The SQL statement above can be transformed into the following one…
SELECT d.dept_name, c.cust_city, p.product_name, SUM(s.amount) sales_amount FROM sales s, products t, customers c, dept d
WHERE s.product_cd = t.product_cd AND s.cust_id = c.cust_id
AND s.sales_dept = d.dept_no
AND c.cust_grade between '10' and '15' AND d.location = 'SEOUL'
AND p.product_name IN (‘PA’, ‘DR’) AND s.product_cd IN (SELECT product_cd FROM products WHERE product_name IN
('PA001', 'DR210'))
AND s.cust_id IN (SELECT cust_id FROM customers WHERE cust_grade between '10' and '15')
AND s.sales_dept IN (SELECT dept_cd FROM dept WHERE location = ‘SEOUL’) GROUP BY d.dept_name, c.cust_city, p.product_name;
The execution plan is as follows,
SELECT STATEMENT Optimizer=ALL_ROWS HASH JOIN
-------------------------------------------- By Yu, Fang ([email protected])
HASH JOIN
HASH JOIN TABLE ACCESS (FULL) OF ‘dept’
TABLE ACCESS (BY INDEX ROWID) OF ‘sales’ BITMAP CONVERSION (TO ROWIDS)
BITMAP AND
BITMAP MERGE BITMAP KEY ITERATION
TABLE ACCESS (FULL) OF ‘products BITMAP INDEX (RANGE SCAN) OF ‘sales_product_bx’
BITMAP MERGE BITMAP KEY ITERATION
TABLE ACCESS (FULL) OF ‘dept’
BITMAP INDEX (RANGE SCAN) OF ‘sales_dept_bx’ BITMAP MERGE
BITMAP KEY ITERATION TABLE ACCESS (BY INDEX ROWID) OF ‘customers’
INDEX (RANGE SCAN) OF ‘cust_grade_idx’
BIMAP COVERSION (FROM ROWIDS) INDEX (RANGE SCAN) OF ‘sales_cust_idx’
TABLE ACCESS (FULL) OF ‘products’ TABLE ACCESS (BY INDEX ROWID) OF ‘customers’
INDEX (RANGE SCAN) OF ‘cust_state_province_idx’
There are some preconditions should be met before star transforming join can be used by the optimizer.
There must be one fact table and more than 2 dimension tables
There should be bitmap indexes created on the join columns in the fact table.
There should be statistics gathered on the fact table.
The parameter “star_transformation_enabled” should be set to TRUE or TEMP_DISABLE. Or the
hint (STAR_TRANSFORMATION) should be used in the SQL statement.
Please note that if the SQL statement uses the bind variable, the star transforming join will not be used
by the optimizer as the optimizer needs to know the statistics of the fact table. The bind variable will
make the optimizer have no idea of the statistics.
4.3.8. Bitmap Join Index
Bitmap join index is created to prompt the performance of the star transforming join. With the bitmap
join index at hand, the star transforming join can get rid of the “BITMAP MERGE” operation.
Suppose we create one bitmap join index…
CREATE BITMAP INDEX sales_cust_job_bjix ON sales (customers.cust_job)
FROM sales, customers WHERE sales.cust_id = customers.cust_id
LOCAL NOLOGGING COMPUTE STATISTICS;
-------------------------------------------- By Yu, Fang ([email protected])
And the SQL execution plan will be like below…
SELELCT STATEMENT SOR GROUP BY
HASH JOIN TABLE ACCESS FULL CHANNELS
TABLE ACCESS BY LOCAL INDEX ROWID SALES
BITMAP CONVERSION TO ROWIDS BITMAP AND
BITMAP INDEX SINGLE VALUE sales_cust_join_bjix BITMAP MERGE
BITMAP KEY ITERATION TABLE ACCESS FULL products
BITMAP INDEX RANGE SCAN sales_prod_bix
BITMAP MERGE BITMAP KEY ITERATION
TABLE ACCESS FULL dept BITMAP INDEX RANGE SCAN sales_dept_bix