26
Further understand column group statistics in DB2 Leverage the extended use of multi-column statistics in DB2 9.5 to improve cardinality estimates Skill Level: Intermediate Samir Kapoor ([email protected]) DB2 Advanced Support Analyst IBM Canada Ltd. Vincent Corvinelli ([email protected]) DB2 Optimizer Developer IBM Canada Ltd. 04 Sep 2008 With multi-column statistics in IBM® DB2® for Linux®, UNIX®, and Windows® (DB2), the optimizer can determine a better query access plan and improve query performance when there is correlation between multiple predicates. In this article, learn how to use multi-column statistics to take advantage of the enhancements to the optimizer in DB2 9.5 that extend their use to a broader range of predicates. Introduction The article "Understand column group statistics in DB2" (developerWorks, December 2006) describes the importance of collecting column group statistics and how the DB2 SQL Optimizer (referred to as optimizer hereafter) makes use of these multi-column statistics to detect a statistical correlation between two or more local or join equality predicates. In DB2 9.5, the optimizer further extended the use of multi-column statistics to a broader range of predicates. The optimizer depends on accurate cardinality estimates to properly compute the cost of each query access plan considered. Cardinality estimation is a process by which the optimizer uses statistics to determine the size of partial query results after Further understand column group statistics in DB2 © Copyright IBM Corporation 1994, 2008. All rights reserved. Page 1 of 26

Further understand column group statistics in DB2

  • Upload
    tess98

  • View
    456

  • Download
    1

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Further understand column group statistics in DB2

Further understand column group statistics in DB2Leverage the extended use of multi-column statistics in DB2 9.5to improve cardinality estimates

Skill Level: Intermediate

Samir Kapoor ([email protected])DB2 Advanced Support AnalystIBM Canada Ltd.

Vincent Corvinelli ([email protected])DB2 Optimizer DeveloperIBM Canada Ltd.

04 Sep 2008

With multi-column statistics in IBM® DB2® for Linux®, UNIX®, and Windows®(DB2), the optimizer can determine a better query access plan and improve queryperformance when there is correlation between multiple predicates. In this article,learn how to use multi-column statistics to take advantage of the enhancements tothe optimizer in DB2 9.5 that extend their use to a broader range of predicates.

Introduction

The article "Understand column group statistics in DB2" (developerWorks,December 2006) describes the importance of collecting column group statistics andhow the DB2 SQL Optimizer (referred to as optimizer hereafter) makes use of thesemulti-column statistics to detect a statistical correlation between two or more local orjoin equality predicates. In DB2 9.5, the optimizer further extended the use ofmulti-column statistics to a broader range of predicates.

The optimizer depends on accurate cardinality estimates to properly compute thecost of each query access plan considered. Cardinality estimation is a process bywhich the optimizer uses statistics to determine the size of partial query results after

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 1 of 26

Page 2: Further understand column group statistics in DB2

predicates are applied or aggregation is performed. At each operator in the accessplan, the optimizer estimates the cardinality output from the operator. Theapplication of one or more predicates may reduce the output stream cardinality.

It is common practice to assume the predicates are independent of each other whencomputing their combined filtering effect on the cardinality estimate. However, thepredicates can be statistically correlated. Treating multiple predicates independentlytypically results in the optimizer under-estimating the cardinality. Under-estimatingthe cardinality could lead the optimizer to choose a sub-optimal access plan.

The optimizer considers using multi-column statistics to detect a statisticalcorrelation and estimate more accurately the combined filtering effect of multiplepredicates. This article describes how the optimizer makes use of multi-columnstatistics to detect a statistical correlation and estimate more accurately thecombined filtering effect of multiple predicates for SQL statements that apply at leasttwo local IN, OR, and equality predicates, and the filtering effect of predicates forSQL statements that apply some classes of OR predicates. "Understand columngroup statistics in DB2" describes how the optimizer makes use of multi-columnstatistics to detect a correlation between two or more local equality predicates andfor the join of two or more tables that apply at least two equality join predicatesbetween the pair of tables. The RUNSTATS command options, as described in thatarticle, are used in the same manner, so those command options will not bedescribed in this article.

Statistical correlation of multiple local equality and local INpredicates

If the WHERE clause of an SQL statement applies multiple predicates, as follows:

C1=? AND C2 IN ( ?, ?, ? )

and multi-column statistics on (C1, C2) are collected, then the optimizer attempts todetect a statistical correlation between the predicates in order to improve thecardinality estimates. This does not apply to:

• Join predicates with IN or OR operators

• Local predicates with inequality, LIKE, or IS NULL operators

• Predicates with subqueries

The C1=? predicate is an example of a local equality predicate, which is an equalitypredicate applied to a single table and is described as follows:

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 2 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 3: Further understand column group statistics in DB2

COLUMN = literal

where the literal can be any one of these:

• A constant value

• A parameter marker or host variable

• A special register (for example, CURRENT DATE)

The C2 IN ( ?, ?, ? ) predicate is an example of a local IN predicate, which is apredicate applied to the same single table that the equality predicate is applied to,and is described as follows:

COLUMN IN ( <VALUE LIST> )

where the <VALUE LIST> is a comma separated list of one or more literals, asdescribed for the local equality predicate.

An OR predicate that is equivalent to an IN predicate can be specified in the SQLstatement instead of the IN predicate, and the optimizer will treat it in the samemanner when accounting for statistical correlation; that is,

COL IN ( literal_1, literal_2, ..., literal_n )

is equivalent to

COL=literal_1 OR COL=literal_2 OR ... OR COL=literal_n

The following are some examples for which the optimizer tries to detect a correlationbetween local IN, OR, and equality predicates:

a) COL_1 IN ( <VALUE LIST> ) AND COL_2=literal ANDCOL_3=literalb) (COL_1=literal_1 OR COL_1=literal_2 OR ... ORCOL_1=liternal_n) AND COL_2=literal AND ... AND COL_m=literal

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 3 of 26

Page 4: Further understand column group statistics in DB2

c) COL_1 IN ( <VALUE LIST> ) AND COL_2 IN ( <VALUE LIST> ) AND... AND COL_m IN ( <VALUE LIST> )d) (COL_1=literal_1 OR COL_1=literal_2) AND (COL_2=literal_1 ORCOL_2=literal_2) AND ... AND (COL_m=literal_1 ORCOL_M=literal_2)e) COL_1 IN ( <VALUE LIST> ) AND ... And COL_m IN ( <VALUELIST> ) AND COL_1_2=literal AND ... AND COL_1_k=literalf) (COL_1=literal_1 OR COL_1=literal_2) AND COL_2=literal ANDCOL_3=literalg) (C)L_1=literal_1 OR COL_1=literal_2) AND (COL_2=literal_1 ORCOL_2=literal_2) AND COL_3=literal

The following are some examples of predicates that are not considered for statisticalcorrelation detection by the optimizer:

a) (COL_1=literal AND COL_2=literal) OR (COL_1=literal ANDCOL_2=literal AND COL_3=literal)b)((COL_1=literal AND COL_2=literal) OR (COL_1=literal ANDCOL_2=literal)) AND COL_3=literalc)( COL_1 IN ( <VALUE LIST> ) OR (COL_2 IN ( <VALUE LIST> ) )AND COL_3=literal

Example 1: C1 IN ( <VALUE LIST> ) AND C2 = literal

Note: Please replace SKAPOOR with your own schema in all the examplesdescribed in this article.

These examples were tested in the following environment, using the SAMPLEdatabase, which can be created by executing db2sampl:

Listing 1. Testing environment for samples

DB21085I Instance "skapoor" uses "64" bits and DB2 code release "SQL09051"with level identifier "03020107".Informational tokens are "DB2 v9.5.0.1", "s080328", "U814639", and Fix Pack"1".Product is installed at "/home2/skapoor/sqllib".

Configuration: (as displayed by the db2exfmt tool)

Database Context:----------------

Parallelism: NoneCPU Speed: 4.000000e-05Comm Speed: 100Buffer Pool size: 1000Sort Heap size: 256Database Heap size: 1200Lock List size: 100Maximum Lock List: 10Average Applications: 1Locks Available: 640

Package Context:

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 4 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 5: Further understand column group statistics in DB2

---------------SQL Type: DynamicOptimization Level: 5Blocking: Block All CursorsIsolation Level: Cursor Stability

STMTHEAP: (Statement heap size)6402

Consider the following query on the EMPLOYEE table in the SAMPLE database:

Listing 2. Query on the EMPLOYEE table in the SAMPLE database

SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARYFROM EMPLOYEE

WHERE JOB IN ('CLERK', 'SALESREP') ANDWORKDEPT = 'A00'

ORDER BY JOB, SALARY

It returns four records from the EMPLOYEE table:

Listing 3. Records returned from the EMPLOYEE table

FIRSTNME LASTNAME JOB WORKDEPT SALARY------------ --------------- -------- -------- -----------GREG ORLANDO CLERK A00 39250.00SEAN O'CONNELL CLERK A00 49250.00DIAN HEMMINGER SALESREP A00 46500.00VINCENZO LUCCHESSI SALESREP A00 66500.00

4 record(s) selected.

The EXPLAIN tool, which requires the existence of the EXPLAIN tables, can beused to view the query access plan chosen by the optimizer. To create the EXPLAINtables, execute:

db2 -tvf $DB2PATH/misc/EXPLAIN.DDL

When the SAMPLE database is initially created, statistics are not collected on thetables. To collect statistics on the EMPLOYEE table, the RUNSTATS tool can beused. The following RUNSTATS command collects statistics on each column,including distribution statistics, and detailed statistics on all indexes defined in theEMPLOYEE table, if any:

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 5 of 26

Page 6: Further understand column group statistics in DB2

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEWITH DISTRIBUTION AND DETAILED INDEXES ALL

Once the EXPLAIN tables are created and the statistics are collected, the SETCURRENT EXPLAIN MODE statement can be used to insert the query access plandetails for one or more statements into the EXPLAIN tables, as follows:

Listing 4. Insert the query access plan details into the EXPLAIN tables

SET CURRENT EXPLAIN MODE EXPLAIN;

SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARYFROM EMPLOYEEWHERE JOB IN ('CLERK', 'SALESREP') AND

WORKDEPT = 'A00'ORDER BY JOB, SALARY;

SET CURRENT EXPLAIN MODE NO;

The db2exfmt tool reads the data in the EXPLAIN tables, and formats the queryaccess plan in a text file:

db2exfmt -d SAMPLE -1 -g -o exfmt_example1.out

The file exfmt_example1.out contains a query access plan similar to the following,with an estimated cardinality of 1:

Listing 5. Query access plan

RowsRETURN( 1)CostI/O|

1.19048TBSCAN( 2)10.7902

1|

1.19048SORT( 3)10.7387

1|

1.19048FETCH( 4)10.6299

1/---+---\

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 6 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 7: Further understand column group statistics in DB2

5 42IXSCAN TABLE: SKAPOOR( 5) EMPLOYEE2.27828

0|42

INDEX: SKAPOORXEMP2

The cardinality estimate of 1 does not match the actual result of 4. The optimizerassumes the two predicates are independent because relevant index or columngroup statistics do not exist. The RUNSTATS tool can be used to collect columngroup statistics on the group (JOB,WORKDEPT) to provide the optimizer with theappropriate information to detect a statistical correlation, if any, between the twocolumns:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEE ON ALL COLUMNSAND COLUMNS ((JOB,WORKDEPT)) WITH DISTRIBUTIONAND DETAILED INDEXES ALL

After repeating the above steps to explain the query again to generate the queryaccess plan, the optimizer computes a better cardinality estimate as a result ofcollecting column group statistics on the two columns:

Listing 6. Query access plan, with better cardinality estimate

RowsRETURN( 1)CostI/O|5

TBSCAN( 2)10.8458

1|5

SORT( 3)10.7944

1|5

FETCH( 4)10.6299

1/---+---\

5 42IXSCAN TABLE: SKAPOOR( 5) EMPLOYEE2.27828

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 7 of 26

Page 8: Further understand column group statistics in DB2

0|42

INDEX: SKAPOORXEMP2

The cardinality estimate is slightly higher than the actual value of 4 since the columngroup statistic is a uniform distribution statistic. You may have noticed that the queryaccess plan itself did not change with the increase in cardinality estimate. Theexamples described in this article are simple in order to illustrate how to improve thecardinality estimate. Statements involving larger tables and joins of two or moretables are more likely to exhibit a change in query access plan as a result of theimproved cardinality estimate.

Example 2: C1 IN ( <VALUE LIST> ) AND C2 IN ( <VALUE LIST> )

This example illustrates the effect of column group statistics on two IN predicates.Consider the following query that retrieves the bonus and salaries for managers anddesigners in certain departments:

Listing 7. Bonus and salaries query

SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARYFROM EMPLOYEE

WHERE WORKDEPT IN ('D11','D21') ANDJOB IN ('MANAGER','DESIGNER')

ORDER BY WORKDEPT, SALARY

This query returns 12 records from the EMPLOYEE table:

Listing 8. Records returned from the EMPLOYEE table

FIRSTNME LASTNAME WORKDEPT JOB BONUS SALARY------------ --------------- -------- -------- ----------- -----------MASATOSHI YOSHIMURA D11 DESIGNER 500.00 44680.00JENNIFER LUTZ D11 DESIGNER 600.00 49840.00JAMES WALKER D11 DESIGNER 400.00 50450.00MARILYN SCOUTTEN D11 DESIGNER 500.00 51340.00BRUCE ADAMSON D11 DESIGNER 500.00 55280.00DAVID BROWN D11 DESIGNER 600.00 57740.00ELIZABETH PIANKA D11 DESIGNER 400.00 62250.00KIYOSHI YAMAMOTO D11 DESIGNER 500.00 64680.00WILLIAM JONES D11 DESIGNER 400.00 68270.00REBA JOHN D11 DESIGNER 600.00 69840.00IRVING STERN D11 MANAGER 500.00 72250.00EVA PULASKI D21 MANAGER 700.00 96170.00

12 record(s) selected.

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 8 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 9: Further understand column group statistics in DB2

First, examine the query access plan and cardinality estimates without the columngroup statistic on (JOB,WORKDEPT). This is accomplished by executing anotherRUNSTATS command on the EMPLOYEE table as follows:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEWITH DISTRIBUTION AND DETAILED INDEXES ALL

The previous statistics collected are cleared by the latest RUNSTATS command, sothe column group statistics collected earlier are no longer kept. Generating the queryaccess plan using EXPLAIN and the db2exfmt tool, as in Example 1, you canexamine the estimated cardinality by the optimizer:

Listing 9. Insert the query access plan details into the EXPLAIN tables

SET CURRENT EXPLAIN MODE EXPLAIN;

SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARYFROM EMPLOYEE

WHERE WORKDEPT IN ('D11','D21') ANDJOB IN ('MANAGER','DESIGNER')

ORDER BY WORKDEPT, SALARY

SET CURRENT EXPLAIN MODE NO;

db2exfmt -d SAMPLE -1 -g -o exfmt_example2.out

The file exfmt_example2.out should contain a query access plan similar to thefollowing, with an estimated cardinality of 7:

Listing 10. Query access plan

RowsRETURN( 1)CostI/O|

7.28572TBSCAN( 2)13.7066

1|

7.28572SORT( 3)13.5723

1|

7.28572NLJOIN( 4)13.1318

1

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 9 of 26

Page 10: Further understand column group statistics in DB2

/------+------\2 3.64286

TBSCAN FETCH( 5) ( 6)0.006 11.0934

0 1| /---+---\2 9 42

TABFNC: SYSIBM IXSCAN TABLE: SKAPOORGENROW ( 7) EMPLOYEE

2.553640|42

INDEX: SKAPOORXEMP2

In the query access plan shown in Listing 9, notice a join between the tableEMPLOYEE and a table function, GENROW. When an IN predicate (or anequivalent OR predicate) is used, the optimizer considers an IN-to-JOINtransformation, converting the IN predicate to a join predicate. The GENROW tablefunction produces the values listed in the <VALUE LIST> of the IN predicate. Whenthe IN predicate is used in its join form, the optimizer still considers it for statisticalcorrelation detection.

The cardinality estimate of 7 does not match the actual result of 12. As in Example1, collecting column group statistics on the columns (JOB,WORKDEPT) provides thenecessary information for the optimizer to account for a statistical correlation whencomputing the combined filtering effect of the two IN predicates:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEON ALL COLUMNS AND COLUMNS ((JOB,WORKDEPT))WITH DISTRIBUTION AND DETAILED INDEXES ALL

After repeating the above steps to explain the query again to generate the queryaccess plan, the optimizer computes a better cardinality estimate that is very closeto the actual result:

Listing 11. Query access plan with more accurate cardinality estimate

RowsRETURN( 1)CostI/O|11.2

TBSCAN( 2)13.9768

1|11.2

SORT

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 10 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 11: Further understand column group statistics in DB2

( 3)13.8033

1|11.2

NLJOIN( 4)13.1318

1/------+------\

2 5.6TBSCAN FETCH( 5) ( 6)0.006 11.0934

0 1| /---+---\2 9 42

TABFNC: SYSIBM IXSCAN TABLE: SKAPOORGENROW ( 7) EMPLOYEE

2.553640|42

INDEX: SKAPOORXEMP2

Example 3: C1 IN ( <VALUE LIST> ) AND C2 IN ( <VALUE LIST> ) ANDC3=literal

In this example, you add a third predicate to the query in Example 2 to determinewhich employees received a bonus of $500:

Listing 12. Add a third predicate to find $500 bonus

SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARYFROM EMPLOYEEWHERE WORKDEPT IN ('D11','D21') AND

JOB IN ('MANAGER','DESIGNER') ANDBONUS = 500

ORDER BY WORKDEPT, SALARY

This query returns five records from the EMPLOYEE table:

Listing 13. Records returned from EMPLOYEE table

FIRSTNME LASTNAME WORKDEPT JOB BONUS SALARY------------ --------------- -------- -------- ----------- -----------MASATOSHI YOSHIMURA D11 DESIGNER 500.00 44680.00MARILYN SCOUTTEN D11 DESIGNER 500.00 51340.00BRUCE ADAMSON D11 DESIGNER 500.00 55280.00KIYOSHI YAMAMOTO D11 DESIGNER 500.00 64680.00IRVING STERN D11 MANAGER 500.00 72250.00

5 record(s) selected.

If you re-collect the statistics without the column group statistics using:

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 11 of 26

Page 12: Further understand column group statistics in DB2

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEWITH DISTRIBUTION AND DETAILED INDEXES ALL

A query access plan similar to the following is chosen by the optimizer, with acardinality estimate of 2:

Listing 14. Query access plan

RowsRETURN( 1)CostI/O|

2.42857TBSCAN( 2)13.8494

1|

2.42857SORT( 3)13.7636

1|

2.42857NLJOIN( 4)13.5765

1/------+------\

2 1.21429TBSCAN FETCH( 5) ( 6)0.006 11.3158

0 1| /---+---\2 9 42

TABFNC: SYSIBM IXSCAN TABLE: SKAPOORGENROW ( 7) EMPLOYEE

2.553640|42

INDEX: SKAPOORXEMP2

With three predicates applied in the WHERE clause, assuming they are independentresults in the optimizer underestimating the cardinality. To illustrate how theoptimizer can use index statistics, as well as column group statistics, to detect astatistical correlation, create an index covering the three columns(JOB,WORKDEPT,BONUS) that are referenced in the predicates, and collectstatistics:

Listing 15. Create index and collect statistics

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 12 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 13: Further understand column group statistics in DB2

CREATE INDEX JOB_DEPT_BONUS ON EMPLOYEE (JOB,WORKDEPT,BONUS)

-- The RUNSTATS command provides the option to collect statistics on a set of-- indexes only, without affecting the statistics previously collected.RUNSTATS ON TABLE SKAPOOR.EMPLOYEE FOR DETAILED INDEXES SKAPOOR.JOB_DEPT_BONUS

With the new index created, and statistics collected on it, the optimizer corrects thecardinality estimate of the query access plan:

Listing 16. A corrected cardinality estimate from the query access plan

RowsRETURN( 1)CostI/O|5.25

TBSCAN( 2)13.5227

1|5.25

SORT( 3)13.4087

1|5.25

NLJOIN( 4)13.0875

1/------+------\

2 2.625TBSCAN FETCH( 5) ( 6)0.006 11.0713

0 1| /---+---\2 2.625 42

TABFNC: SYSIBM IXSCAN TABLE: SKAPOORGENROW ( 7) EMPLOYEE

2.859330|42

INDEX: SKAPOORJOB_DEPT_BONUS

Example 4: (C1=literal OR C1=literal2) AND (C2=literal OR C2=literal2) ANDC3=literal

This example is equivalent to Example 3, using equivalent OR predicates to replace

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 13 of 26

Page 14: Further understand column group statistics in DB2

the IN predicates:

Listing 17. Equivalent OR predicates to replace the IN predicates

SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARYFROM EMPLOYEE

WHERE (WORKDEPT = 'D11' OR WORKDEPT = 'D21') AND(JOB = 'MANAGER' OR JOB = 'DESIGNER') ANDBONUS = 500

ORDER BY WORKDEPT, SALARY

This query returns the same result set as in Example 3. This example illustrates theeffect that partial statistics have on the ability of the optimizer to estimate thecardinality. Drop the index created in Example 3 and re-collect the statistics withcolumn group statistics on the group ((JOB,WORKDEPT)) only:

DROP INDEX JOB_DEPT_BONUSRUNSTATS ON TABLE SKAPOOR.EMPLOYEE

ON ALL COLUMNS AND COLUMNS((JOB,WORKDEPT))

WITH DISTRIBUTION AND DETAILED INDEXES ALL

With column group statistics collected on a subset of the columns referenced by theeligible IN, OR, and equality predicates, the optimizer estimates a cardinality that isclose to the actual result, but not as accurate as shown in Example 3 when columngroup statistics were collected on all three columns:

Listing 18. Query access plan

RowsRETURN( 1)CostI/O|

3.73333TBSCAN( 2)13.9174

1|

3.73333SORT( 3)13.8186

1|

3.73333NLJOIN( 4)13.5765

1/------+------\

2 1.86667

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 14 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 15: Further understand column group statistics in DB2

TBSCAN FETCH( 5) ( 6)0.006 11.3158

0 1| /---+---\2 9 42

TABFNC: SYSIBM IXSCAN TABLE: SKAPOORGENROW ( 7) EMPLOYEE

2.553640|42

INDEX: SKAPOORXEMP2

The optimizer used the column group statistic on (JOB,WORKDEPT) to account fora statistical correlation between the two OR predicates, but without includingBONUS in the column group, it considered the BONUS=500 predicate asindependent of the two OR predicates, resulting in the slightly underestimated finalcardinality.

Note: if you analyze the Optimized Statement section of the db2exfmt output for theabove query, you may notice that the OR predicates were converted to theirequivalent IN predicates:

Listing 19. OR predicates converted to their equivalent IN predicates

Optimized Statement:-------------------SELECT Q5.FIRSTNME AS "FIRSTNME", Q5.LASTNAME AS "LASTNAME", Q5.WORKDEPT AS

"WORKDEPT", Q5.JOB AS "JOB", +0000500.00 AS "BONUS", Q5.SALARY AS"SALARY"

FROM SKAPOOR.EMPLOYEE AS Q5WHERE (Q5.BONUS = +0000500.00) AND Q5.JOB IN ('MANAGER ', 'DESIGNER') AND

Q5.WORKDEPT IN ('D11', 'D21')ORDER BY Q5.WORKDEPT, Q5.SALARY

Collecting column group statistics on all three columns result in the same cardinalityestimate as in Example 3. In this case, you still collect the column group statistics onthe previous two columns (JOB,WORKDEPT) and include the full set of threecolumns (JOB,WORKDEPT,BONUS):

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEON ALL COLUMNS AND COLUMNS((JOB,WORKDEPT), (JOB,WORKDEPT,BONUS))

WITH DISTRIBUTION AND DETAILED INDEXES ALL

As described in "Understand column group statistics in DB2", you can gather one ormore column group statistics between the same sets of columns. The query accessplan produced after collecting these statistics is the same as the final plan inExample 3. It is left as an exercise for you to verify this is the case.

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 15 of 26

Page 16: Further understand column group statistics in DB2

Example 5: Index oring

This example illustrates how collecting column group statistics can also improve thecardinality estimate of index oring access plans. Consider the following query on theEMPLOYEE table that retrieves all clerks and sales representatives that belong todepartment A00:

Listing 20. Query on the EMPLOYEE table that retrieves all clerks and salesrepresentatives that belong to department A00

SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARYFROM EMPLOYEE

WHERE JOB IN ('CLERK', 'SALESREP') ANDWORKDEPT='A00'

ORDER BY JOB, SALARY

This query returns four records from the EMPLOYEE table:

Listing 21. Query returns four records

FIRSTNME LASTNAME JOB WORKDEPT SALARY------------ --------------- -------- -------- -----------GREG ORLANDO CLERK A00 39250.00SEAN O'CONNELL CLERK A00 49250.00DIAN HEMMINGER SALESREP A00 46500.00VINCENZO LUCCHESSI SALESREP A00 66500.00

4 record(s) selected.

To better illustrate the improvement in cardinality estimation, drop all the existingindexes on the EMPLOYEE table except the primary key index:

DROP INDEX XEMP2

and create the following index that includes both columns referenced by predicatesin the WHERE clause of the above query, separated by the SALARY column:

CREATE INDEX IND2 ON EMPLOYEE (JOB,SALARY,WORKDEPT)

Statistics are re-collected on the EMPLOYEE table and its new and remainingindexes:

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 16 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 17: Further understand column group statistics in DB2

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEWITH DISTRIBUTION AND DETAILED INDEXES ALL

In this example, the optimizer is forced to choose an index oring table accessoperation by using the optimization profile feature. To do so, the optimization profileincludes two optimizer guidelines:

1. A guideline to disable the transformation of the IN predicate to a join

2. A guideline to force the optimizer to choose the index oring operation toaccess the EMPLOYEE table

The first step in creating the optimization profile is to create an XML file, calledexample5.xml, that contains the following contents:

Listing 22. XML file contents

<?xml version="1.0" encoding="UTF-8"?>

<OPTPROFILE VERSION="9.5.1"><STMTPROFILE ID="Example 5 Index oring test">

<STMTKEY><![CDATA[SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARY

FROM EMPLOYEEWHERE JOB IN ('CLERK', 'SALESREP') AND

WORKDEPT='A00'ORDER BY JOB, SALARY]]>

</STMTKEY>

<OPTGUIDELINES><INLIST2JOIN OPTION="DISABLE" TABLE="EMPLOYEE" COLUMN="JOB"/><IXOR TABLE="EMPLOYEE" INDEX="IND2"/>

</OPTGUIDELINES></STMTPROFILE>

</OPTPROFILE>

The second step involves creating a del file, called example5.del, that contains thefollowing contents:

"SKAPOOR","IXORPLAN","example5.xml"

where SKAPOOR is the schema for the profile, IXORPLAN is the name youassociated to the profile, and example5.xml is the XML file created in the first step,which contains the contents describing the profile.

The third step requires placing both the example5.xml and the example5.del files inthe same location and issuing the following commands:

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 17 of 26

Page 18: Further understand column group statistics in DB2

Listing 23. Commands to use with example5.xml and example5.del

-- Create the OPT_PROFILE table, if it does not already existCREATE TABLE SYSTOOLS.OPT_PROFILE (SCHEMA VARCHAR(128) NOT NULL,NAME VARCHAR(128) NOT NULL,PROFILE BLOB (2M) NOT NULL,

PRIMARY KEY ( SCHEMA, NAME ))

-- Add an entry to OPT_PROFILE table for our index-oring guidelineIMPORT FROM example5.del OF DEL

MODIFIED BY LOBSINFILEINSERT INTO SYSTOOLS.OPT_PROFILE

To view the query access plan using the optimization profile created, the SETCURRENT OPTIMIZATION PROFILE statement can be used in combination withthe SET CURRENT EXPLAIN MODE statement, as follows:

Listing 24. View the query access plan using the optimization profile

-- use the IXORPLAN profileSET CURRENT OPTIMIZATION PROFILE="IXORPLAN"

SET CURRENT EXPLAIN MODE EXPLAIN

SELECT FIRSTNME, LASTNAME, JOB, WORKDEPT, SALARYFROM EMPLOYEE

WHERE JOB IN ('CLERK', 'SALESREP') ANDWORKDEPT='A00'

ORDER BY JOB, SALARY

SET CURRENT EXPLAIN MODE NO

A query access plan similar to the following is chosen by the optimizer:

Listing 25. Query access plan

RowsRETURN( 1)CostI/O|

1.19048TBSCAN( 2)13.0404

0.963719|

1.19048SORT( 3)12.967

0.963719|

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 18 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 19: Further understand column group statistics in DB2

1.19048FETCH( 4)12.8278

0.963719/---+---\

1.19048 42RIDSCN TABLE: SKAPOOR( 5) EMPLOYEE4.89404

0/-----+-----\

0.952381 0.238095SORT SORT( 6) ( 8)2.7352 2.21536

0 0| |

0.952381 0.238095IXSCAN IXSCAN( 7) ( 9)2.6272 2.10736

0 0| |42 42

INDEX: SKAPOOR INDEX: SKAPOORIND2 IND2

If the generated query access plan is not the index oring plan shown above, thenthere is a problem with your optimization profile setup. In the db2exfmt output, thefollowing is seen if the optimizer used the optimization profile:

Profile Information:--------------------OPT_PROF: (Optimization Profile Name)

SKAPOOR.IXORPLANSTMTPROF: (Statement Profile Name)

Example 5 Index oring test

It is left as an exercise to the reader to determine the appropriate method to collect acolumn group statistic on the columns (JOB,WORKDEPT). Once the column groupstatistic is collected, the query access plan displays improved cardinality estimates:

Listing 26. Improved cardinality estimates in query access plan

RowsRETURN( 1)CostI/O|5

TBSCAN( 2)13.878

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 19 of 26

Page 20: Further understand column group statistics in DB2

1|5

SORT( 3)13.7665

1|5

FETCH( 4)13.4746

1/---+---\

5 42RIDSCN TABLE: SKAPOOR( 5) EMPLOYEE4.89404

0/-----+-----\

4 1SORT SORT( 6) ( 8)2.7352 2.21536

0 0| |4 1

IXSCAN IXSCAN( 7) ( 9)2.6272 2.10736

0 0| |42 42

INDEX: SKAPOOR INDEX: SKAPOORIND2 IND2

At each IXSCAN operator, the cardinality is corrected to account for a correlationbetween the predicates:

• JOB='CLERK' AND WORKDEPT='A00'

• JOB='SALESREP' AND WORKDEPT='A00'

and the cardinality is corrected at the RIDSCN and FETCH operators, whichaccounts for the statistical correlation between the IN and equality predicates.

Statistical correlation of multiple local equality predicates withinsubterms of OR operators

If the WHERE clause of an SQL statement applies OR operators with multiple localpredicates within each subterm, as follows:

(C1=literal_1 AND C2=literal_2) OR(C1=literal_3 AND C2=literal_4) OR

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 20 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 21: Further understand column group statistics in DB2

(C1=literal_5 AND C2=literal_6)

and multi-column statistics on (C1,C2) are collected, then the optimizer will attemptto detect a statistical correlation between the predicates in order to improve thefiltering effect of the OR predicate. In this article, the above OR operators aredescribed as a single OR predicate with three subterms:

1. (C1=literal_1 AND C2=literal_2)

2. (C1=literal_3 AND C2=literal_4)

3. (C1=literal_5 AND C2=literal_6)

This does not apply if the OR predicate contains any of the following:

• Non-local equality predicates in any of the subterms

• Different sets of columns referenced in two or more subterms

The following are some examples for which the optimizer tries to detect a correlationbetween local IN, OR, and equality predicates:

a) (COL_1=literal_1 AND COL_2=literal_2) OR(COL_1=literal_3 AND COL_2=literal_4) OR

... OR(COL_1=literal_n AND COL_2=literal_m)

The following are some examples of predicates that are not considered for statisticalcorrelation detection by the optimizer:

a) (COL_1=literal_1 AND COL_2=literal_2) OR(COL_1=literal_3 AND COL_2=literal_4 AND COL_3=literal_5)

b) (COL_1=literal_1 AND COL_2=literal_2) OR(COL_1=literal_3 AND COL_2=literal_4) OR(COL_1=literal_5 AND COL_2=literal_6 AND COL_3=literal_7)

Example 6: (C1=LITERAL1 AND C2=LITERAL2) OR (C1=LITERAL3 ANDC2=LITERAL4)

This example illustrates the effect of column group statistics on a qualifying OR

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 21 of 26

Page 22: Further understand column group statistics in DB2

predicate. Consider the following query on the EMPLOYEE table:

Listing 27. Query on the EMPLOYEE table

SELECT FIRSTNME, LASTNAME, WORKDEPT, JOB, BONUS, SALARYFROM EMPLOYEEWHERE ( WORKDEPT='E21' AND JOB='FIELDREP' ) OR

( WORKDEPT='D21' AND JOB='MANAGER' )ORDER BY WORKDEPT, SALARY

This query returns six records from the EMPLOYEE table:

Listing 28. Query results from the EMPLOYEE table

FIRSTNME LASTNAME WORKDEPT JOB BONUS SALARY------------ --------------- -------- -------- ----------- -----------EVA PULASKI D21 MANAGER 700.00 96170.00ROY ALONZO E21 FIELDREP 500.00 31840.00HELENA WONG E21 FIELDREP 500.00 35370.00RAMLAL MEHTA E21 FIELDREP 400.00 39950.00JASON GOUNOT E21 FIELDREP 500.00 43840.00WING LEE E21 FIELDREP 500.00 45370.00

6 record(s) selected.

If you re-collect the statistics without the column group statistics using:

RUNSTATS ON TABLE SKAPOOR.EMPLOYEEWITH DISTRIBUTION AND DETAILED INDEXES ALL

a query access plan similar to the following is chosen by the optimizer, with acardinality estimate under 2:

Listing 29. Query access plan similar to the one chosen by the optimizer

RowsRETURN( 1)CostI/O|

1.88095TBSCAN( 2)16.1786

1|

1.88095SORT( 3)16.1272

1|

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 22 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 23: Further understand column group statistics in DB2

1.88095TBSCAN( 4)16.0113

1|42

TABLE: SKAPOOREMPLOYEE

Collecting a column group statistic on the columns (JOB,WORKDEPT) allows theoptimizer to better estimate the filtering effect of the OR predicate, since eachsubterm of the OR predicate applies a set of local equality predicates on thecolumns JOB and WORKDEPT. It is left as an exercise for you to determine theappropriate RUNSTATS statement to collect a column group statistic. Oncecollected, a query access plan similar to the following is chosen by the optimizer,with an improved cardinality estimate that is very close to the actual result of sixrows:

Listing 30. Query access plan with more accurate cardinality estimate

RowsRETURN( 1)CostI/O|5.6

TBSCAN( 2)16.2651

1|5.6

SORT( 3)16.2136

1|5.6

TBSCAN( 4)16.0113

1|42

TABLE: SKAPOOREMPLOYEE

Conclusion

The optimizer is dependent on accurate cardinality estimates to properly computethe cost of each query access plan considered. You can leverage the extended use

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 23 of 26

Page 24: Further understand column group statistics in DB2

of multi-column statistics in DB2 9.5 to provide the optimizer more information tobetter estimate the cardinality in order to choose an optimal query access plan.

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 24 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.

Page 25: Further understand column group statistics in DB2

Resources

Learn

• "Understand column group statistics in DB2" (developerWorks, December2006): Learn all about how to use column group statistics.

• "Comparing real-time cardinality to the optimizer cardinality estimates"(developerWorks, December 205): Get all the details to create count queries toevaluate real-time cardinalities at certain operators in an access plan.

• "Influence query optimization with optimization profiles and statistical views inDB2 9" (developerWorks, December 2006): Learn about enhancements in DB29 that enable you to influence the default query optimization behaviour.

• Anatomy of an optimization profile section of the "IBM DB2 Database for Linux,UNIX, and Windows Information Center": Get an introduction to the contents ofan optimization profile.

• developerWorks DB2 for Linux, UNIX, and Windows page: Read articles andtutorials and connect to other resources to expand your DB2 skills.

• Learn about DB2 Express-C, the no-charge version of DB2 Express Edition forthe community.

Get products and technologies

• Download a free trial version of DB2 Enterprise 9.

• Now you can use DB2 for free. Download DB2 Express-C a no-charge versionof DB2 Express Edition for the community that offers the same core datafeatures as DB2 Express Edtion and provides a solid base to build and deployapplications.

• Download IBM product evaluation versions and get your hands on applicationdevelopment tools and middleware products from IBM InformationManagement, Lotus®, Rational®, Tivoli®, and WebSphere®.

Discuss

• Participate in the discussion forum for this content.

• Check out developerWorks blogs and get involved in the developerWorkscommunity.

About the authors

Samir KapoorSamir Kapoor is an IBM Certified Advance Technical Expert for DB2. Samir currently

ibm.com/developerWorks developerWorks®

Further understand column group statistics in DB2© Copyright IBM Corporation 1994, 2008. All rights reserved. Page 25 of 26

Page 26: Further understand column group statistics in DB2

works with the DB2 Advanced Support -- Down System Division (DSD) team and hasin-depth knowledge in the engine area.

Vincent CorvinelliVincent Corvinelli is an advisory software developer in the DB2 Query OptimizerDevelopment team at the IBM Toronto Lab.

developerWorks® ibm.com/developerWorks

Further understand column group statistics in DB2Page 26 of 26 © Copyright IBM Corporation 1994, 2008. All rights reserved.