Informix Unleashed -- Ch 23 -- Tuning Your Informix Environment.docx

Embed Size (px)

Citation preview

Informix Unleashed -- Ch 23 -- Tuning Your Informix Environment

Informix Unleashed

- 23 -Tuning Your Informix Environment Tuning Your Efforts Taking the First Steps Recognizing Your Application Tuning Your Informix Instance Optimizing Shared Memory Optimizing Disk Usage Optimizing Network Traffic Tuning Your Informix Database Indexing Mechanics Indexing Guidelines Logging Locking Isolation Levels Data Types Constraints Denormalization Tuning Your Informix Operations Update Statistics Parallel Data Query Archiving Bulk Loads In-Place ALTER TABLE Tuning Your Informix Application The Cost-Based Optimizer Optimizing SQL Sacrificing a Goat (or Overriding the Optimizer) Optimizing Application Code Stored Procedures and Triggers Summary

by Glenn MillerDatabases can always be made faster--at a cost. The goal in tuning an Informix environment effectively is to know what improvements will have the biggest effects and what trade-offs are required to implement them. This effort demands that you be comfortable with an unfinished task because you will never be done. This chapter will help you decide where to start and when to stop.The first requirement is to know your system. You can find information about ways to take your system's pulse in Chapter 21, "Monitoring Your Informix Environment," and Chapter 22, "Advanced Monitoring Tools." Monitoring must be done not only when troubles arise, but also when no performance issues are pressing. You need to recognize your system's baseline activity and fundamental limitations.In an ideal world, the topic of tuning might arise when a system is being designed: How best should the disks be arranged, what data model is most efficient, what coding standards enhance performance? More commonly, however, tuning becomes necessary when an already operational system is unacceptably slow. This more frequent scenario is not a bad thing. Rather, with a live system you have much more data regarding system load, background processes, and disk activity. The problems are tangible, not theoretical. This concreteness helps you focus your effort where it is needed the most. Tuning Your EffortsBe sure you are examining the real problem. Not all slow processes indicate that Informix should be tuned. The elapsed time of an activity is a combination of network communication time, CPU time handling the user process, CPU time handling system activities such as paging, and I/O time. Use tools such as vmstat to find out which component is the limiting factor. Use ps -ef to see which other activities are running simultaneously. Discover whether other elements have changed recently. Was there an operating system upgrade? Did the network configuration change? Look around.Narrow your efforts by examining what is out of the ordinary. At any given time, something is always slowest. Find that "hot spot," and address it individually. When disk I/O is the limiting factor, look for ways to reduce or balance I/O. If the CPU is pegged--fully utilized--but is never waiting for I/O, then tuning disk access is pointless. If the database engine is overburdened, use onstat -g ses to inspect the active SQL operations. Address slow SQLs one at a time. The principle is to maintain as close to a controlled environment as you can so that you can see the effect of individual changes. Tune one thing. Examine the results. Verify that you have not slowed anything else unduly. Verify that you have not broken anything. Repeat this process until the diminishing returns you achieve are no longer worth your effort. Taking the First StepsThe relevancy of most tips depends on your specific system configuration. These first two, however, do not.

TIP: Run UPDATE STATISTICS. The optimizer needs the information about your database's contents that only this statement can supply. Refer to the section "Tuning Your Informix Operations" for complete details about this crucial command.

TIP: Read the release notes. Upon installation, Informix stores the release notes in the $INFORMIXDIR/release directory. You will find valuable information there on new features, known problems, workarounds, new optimization schemes, optimal Informix parameters, compatibility issues, operating system requirements, and much more. Because its products are constantly evolving, Informix uses the release notes as the most direct, and often the only, means of communicating essential system-specific and version-specific information.

Recognizing Your ApplicationThe primary distinction here is between OLTP and DSS. OnLine Transaction Processing (OLTP) systems are characterized by multiple users with simultaneous access. Generally, they select few rows, and they perform inserts and updates. OLTP applications usually use indexed reads and have sub-second response times. Fast query speed is paramount. As opposed to OLTP environments, Decision Support Systems (DSS) are characterized by sequential scans of the data, with concomitant slow response times. Data warehouses are prime examples of DSS applications. Maximizing throughput is especially critical for these very large databases. They usually find themselves disk-bound, so it is crucial that such environments employ the most effective means of scanning data rapidly.If your environment is OLTP, and an SQL process is slow, fix it with an index.

TIP: Add an index. In most OLTP environments, for most databases, for most applications, adding a well-considered index will provide the greatest performance improvement at the least cost. Hundredfold decreases in query execution time are not uncommon. Really. Look to indexes first. Refer to the section "Tuning Your Informix Database" later in this chapter for a thorough (perhaps excruciating) explanation of indexing mechanics and guidelines.

In a DSS environment, the primary concern is reading huge amounts of data, usually for aggregation, or to summarize for trends. Disk I/O and specifically disk reads are most important. In such environments, fragmenting tables and indexes intelligently will generally produce the most significant improvements. Fragmenting is partitioning data or indexes horizontally across separate disks for parallel access. For more information on distributing data in this way, refer to Chapter 20, "Data and Index Fragmentation."

TIP: Fragment your critical tables and indexes. Fragmentation allows parallel scans and, for query execution, elimination of those fragments that cannot satisfy the query. These two advantages can dramatically improve your overall performance. If invoking fragmentation means that you need to upgrade to Informix-DSA, you should consider doing so.

Tuning Your Informix InstanceAn instance is a single installation of an Informix database engine, such as OnLine or Standard Engine. For the most part, the topics in this section refer only to OnLine. Administration of SE is intentionally simple and mostly not tunable. Optimizing Shared MemoryA crucial feature of OnLine is its management of shared memory segments, those reserved sections of RAM isolated for OnLine's private use. By adjusting the values in onconfig, you can tune the way in which OnLine allocates resources within its shared memory pool for greatest efficiency. Refer to Chapter 13, "Advanced Configurations," for more information about these important settings.Installing more than one instance of OnLine on a system is possible. Having multiple database servers coexisting on the same computer is called multiple residency. On occasion, creating such a residency is done to segregate production environments from development environments or to test different server versions, but, for performance, it is a bad idea.

CAUTION: Avoid multiple residency. Informix is unable to manage the separate segments of shared memory efficiently.

In addition, maintaining separate environments is tricky and prone to error. Optimizing Disk UsageFor most databases, disk I/O presents the chief bottleneck. Finding ways to avoid, balance, defer, minimize, or predict I/O should all be components of your disk tuning toolkit. It is also true that disks are the most likely component of a database environment to fail. If your disks are inaccessible because of a disk failure, and you do not have a proper archiving scheme, tuning cannot fix it. Before you implement any of these changes, ensure that your archiving procedure is sturdy and that your root dbspace is mirrored. For more information about developing a complete archiving strategy, refer to Chapter 18, "Managing Data Backups." Increasing Cached ReadsInformix can process only data that is in memory, and it stores only whole pages there. First, these pages must be read from the disk, a process that is generally the slowest part of most applications. At any given time, the disk or its controller might be busy handling other requests. When the disk does become available, the access arm might have to spend up to hundreds of milliseconds seeking the proper sector. The latency, or rotational time until the page is under the access arm, could be a few milliseconds more. Disks are slow.Conversely, reads from shared memory buffers take only microseconds. In these buffers, Informix caches pages it has read from disk, where they remain until more urgent pages replace them. For OLTP systems, you should allocate as many shared memory buffers as you can afford. When your system is operational, use onstat -p to examine the percentage of cached reads and writes. It is common for a tuned OLTP system to read from the buffer cache well over 99 percent of the time. Although no absolute tuning rule applies here, you should continue allocating buffers until the percentage of cached reads stops increasing.Some DSS applications can invoke light scans, described later in this chapter. These types of reads place pages of data in the virtual segment of shared memory and bypass the buffer cache. In such cases, your cached read percentage could be extremely low, even zero. See the "Light Scans" section for ways to encourage this efficient behavior. Balancing I/OWith a multidisk system, a primary way to ease an I/O bottleneck is to ensure that the distribution of work among the disks is well balanced. To do so, you need to recognize which disks are busiest and then attempt to reorganize their contents to alleviate the burden. On a production system, you can use any number of disk monitoring utilities, especially iostat and onstat -g iof, to recognize where activity is highest. For a development environment or for a system being designed, you have to rely instead on broad guidelines. The following general priorities indicate a reasonable starting point in identifying which areas ought to receive the highest disk priority--that is, which dbspaces you will place on the "prime real estate," the centers of each disk, and which items are good candidates for fragmentation. For an OLTP system, in descending order of importance, try the following: 1. Logs2. High Use Tables3. Low Use Tables4. DBSPACETEMP For DSS applications, a reasonable order is as follows: 1. High Use Tables2. DBSPACETEMP3. Low Use Tables4. Logs Additionally, prudent disk management should also include the use of raw, rather than cooked, disks. Among the numerous reasons for using these disks, performance is foremost. Not only do raw disks bypass UNIX buffering, but if the operating system allows, raw disk access might use kernel-asynchronous I/O (KAIO). KAIO threads make system calls directly to the operating system and are faster than the standard asynchronous I/O virtual processor threads. You cannot implement KAIO specifically; it is enabled if your platform supports it. Read the release notes to determine whether KAIO is available for your system.Many hardware platforms also offer some version of striping, commonly via a logical volume manager. Employing this hardware feature as a means of distributing disk I/O for high-use areas is generally advantageous. However, if you're using Informix-specific fragmentation, you should avoid striping those dbspaces that contain table and index fragments.

CAUTION: Hardware striping and database-level fragmentation are generally not complementary.

Finally, set DBSPACETEMP to a series of temporary dbspaces that reside on separate disks. When OnLine must perform operations on temporary tables, such as the large ORDER BY and GROUP BY operations typically called for in DSS applications, it uses the dbspaces listed in DBSPACETEMP in a round-robin fashion. Consolidating Scattered TablesDisk access for a table is generally reduced when all the data for a table is contiguous. Sequential table scans will not incur additional seek time, as the disk head continues to be positioned correctly for the next access. One goal of physical database design is preventing pieces of a table from becoming scattered across a disk. To prevent this scattering, you can designate the size of the first and subsequent extents for each table when it is created. Unfortunately, if the extents are set too large, disk space can be wasted. For a review of table space allocation, refer to Chapter 20. In practice, unanticipated growth often interleaves multiple tables and indexes across a dbspace. When this scattering becomes excessive--more than eight non-contiguous extents for the same table in one dbspace--"repacking" the data is often worthwhile. With the Informix utility oncheck -pe, you can examine the physical layout of each dbspace and recognize those tables that occupy too many extents.You can employ several straightforward methods to reconsolidate the data. Informix allows the ALTER TABLE NEXT EXTENT extentsize, but this command alone does not physically move data; it only changes the size of the next extent allocated when the table grows. If a table is small, with few constraints, rebuilding the table entirely is often easiest: 1. Generate a complete schema.

2. Unload the data.

3. Rebuild the table with larger extents.

4. Reload the data.

5. Rebuild any indexes.

6. Rebuild any constraints.

7. Rebuild any triggers.

8. Rebuild any views dependent on the data.

9. Rebuild any local synonyms. Note that whenever a table is dropped, all indexes, constraints, triggers, views, and local synonyms dependent on it are also dropped. You can see that this process can become complicated and often impractical if many database interdependencies exist. The simplest alternative when many tables are scattered is to perform an onunload/onload of the entire database. This operation reorganizes data pages into new extents of the size currently specified for each table. Just before unloading the data, you can set the next extent size larger for those tables that have become excessively scattered. Upon the reload, the new value will be used for all extents beyond the first that are allocated.An alternative for an individual table is to create a clustered index, described more fully later in this chapter. When a clustered index is built, the data rows are physically rewritten in newly allocated extents in index order. If you have just set the next extent size to accommodate the entire table, the rebuilt table will now be in, at most, two extents. When you use a clustered index for this purpose, any other benefits are merely a bonus. Light ScansLight scans are efficient methods of reading data that OnLine-DSA uses when it is able. These types of reads bypass the buffer pool in the resident portion of shared memory and use the virtual segment instead. Data read by light scans to the virtual buffer cache is private; therefore, no overhead is incurred for concurrency issues such as locking. When the goal is to read massive amounts of data from disk quickly, these scans are ideal. Unfortunately, DSA does not always choose to always employ them. In general, the following conditions must be true for light scans to be invoked: PDQ is on. Data is fragmented. Data pages, not index pages, are being scanned. The optimizer determines that the amount of data to be scanned would swamp the resident buffer cache. The Cursor Stability isolation level is not being used. The selectivity of the filters is low, and the optimizer determines that at least 15 to 20 percent of the data pages will need to be scanned. Update statistics has been run, to provide an accurate value for systables.npused.

TIP: Design your DSS application to exploit light scans.

You can examine whether light scans are active with onstat -g lsc. You can employ a few tricks to encourage light scans if you think they would be beneficial for your application: Reduce the size of the buffer cache by reducing BUFFERS in onconfig. Increase the size of the virtual portion of shared memory by increasing SHMVIRTSIZE in onconfig. Drop secondary indexes on the table being scanned. They include foreign key constraints and all other non-primary key indexes. Manually increase the systables.npused value. Enabling light scans is worth the effort. Performance increases can be in the 100 to 300 percent range. LRU QueuesAs a means of efficiently managing its resident shared memory buffers, OnLine organizes them into LRU (Least Recently Used) queues. As buffer pages become modified, they get out of synch with the disk images, or dirty. At some point, OnLine determines that dirty pages that have not been recently accessed should be written to disk. This disk write is performed by a page cleaner thread whenever an individual LRU queue reaches its maximum number of dirty pages, as dictated by LRU_MAX_DIRTY. After a page cleaner thread begins writing dirty pages to disk, it continues cleaning until it reaches the LRU_MIN_DIRTY threshold.These writes can occur as other processes are active, and they have a minimal but persistent background cost. You can monitor these writes with onstat -f. Here is a sample output:Fg Writes LRU Writes Chunk Writes0 144537 62561The LRU Writes column indicates writes by the page cleaners on behalf of dirty LRU queues. Earlier versions of OnLine included Idle Writes, which are now consolidated with LRU Writes. Foreground writes (Fg Writes) are those caused by the server when no clean pages can be found. They preempt other operations, suspend the database temporarily, and are generally a signal that the various page cleaner parameters need to be tuned to clean pages more frequently. Chunk writes are those performed by checkpoints, and they also suspend user activity. They are described in the "Checkpoints" section later in this chapter. You should consider tuning the LRU queue parameters or number of page cleaners if the temporary suspensions of activity from checkpoints become troublesome.Generally, the LRU_MAX_DIRTY and LRU_MIN_DIRTY are the most significant tuning parameters. To increase the ratio of LRU writes, decrease these values and monitor the performance with onstat -f. Values as low as 3 and 5 might be reasonable for your system.You can use onstat -R to monitor the percentage of dirty pages in your LRU queues. If the ratio of dirty pages consistently exceeds LRU_MAX_DIRTY, you have too few LRU queues or too few page cleaners. First, try increasing the LRUS parameter in onconfig to create more LRU queues. If that is insufficient, increment CLEANERS to add more page cleaners. For most applications, set CLEANERS to the number of disks, but not less than one per LRU so that one is always available when an LRU queue reaches its threshold. If your system has more than 20 disks, try setting CLEANERS to 1 per 2 disks, but not less than 20. CheckpointsOne of the background processes that can affect performance is the writing of checkpoints. Checkpoints are occasions during which the database server, in order to maintain internal consistency, synchronizes the pages on disk with the contents of the resident shared memory buffer pool. In the event of a database failure, physical recovery begins as of the last checkpoint. Thus, frequent checkpoints are an aid to speedy recovery. However, user activity ceases during a checkpoint, and if the checkpoint takes an appreciable amount of time, user frustration can result. Furthermore, writing checkpoints too frequently incurs unnecessary overhead. The goal is to balance the concerns of recovery, user perceptions, and total throughput.Checkpoints are initiated when any of the following occurs: The checkpoint interval is reached, and database modifications have occurred since the last checkpoint. The physical log becomes 75 percent full. The administrator forces a checkpoint. OnLine detects that the next logical log contains the last checkpoint. Certain dbspace administration activities occur. Each time a checkpoint occurs, a record is written in the message log. With onstat -m, you can monitor this activity. You can adjust the checkpoint interval directly by setting the onconfig parameter CKPTINTVL. If quick recovery is not crucial, try increasing the 5-minute default to 10, 15, or 30 minutes. Additionally, consider driving initiation of checkpoints by decreasing the physical log size.

TIP: Set the checkpoint frequency by adjusting the size of the physical log. Using the trigger of having a checkpoint forced when the physical log is 75-percent full ensures that checkpoints are used only when needed.

There is one additional performance consideration for large batch processes. Note that page cleaner threads write pages from memory to disk both when LRU queues are dirty and when a checkpoint is performed. However, the write via a checkpoint is more efficient. It uses chunk writes, which are performed as sorted writes, the most efficient writes available to OnLine. Also, because other user activity is suspended, the page cleaner threads are not forced to switch contexts during checkpoint writes. Finally, checkpoint writes use OnLine's big buffers, 32-page buffers reserved for large contiguous reads and writes. These advantages make chunk writes preferable to LRU writes. Large batch processes can be made more efficient by increasing the ratio of chunk writes to LRU writes. To do so, increase the LRU_MAX_DIRTY and LRU_MIN_DIRTY values, perhaps as high as 95 percent and 98 percent. Then decrease the CKPTINTVL or physical log size until the bulk of the writes are chunk writes. Read AheadsWhen OnLine performs sequential table or index scans, it presupposes that adjacent pages on disk will be the next ones requested by the application. To minimize the time an application has to wait for a disk read, the server performs a read ahead while the current pages are being processed. It caches those pages in the shared memory buffer pool. When the number of those pages remaining to be read reaches the read ahead threshold (RA_THRESHOLD), OnLine fetches another set of pages equal to the RA_PAGES parameter. In this way, it can stay slightly ahead of the user process.Although the default parameters are usually adequate, you can adjust both RA_PAGES and RA_THRESHOLD. If you expect a large number of sequential scans of data or index pages, consider increasing RA_PAGES. You should keep RA_PAGES a multiple of 8, the size of the light scan buffers in virtual memory. For most OLTP systems, 32 pages is generous. Very large DSS applications could make effective use of RA_PAGES as high as 128 or 256 pages. The danger in setting this value too high is that unnecessary page cleaning could have to occur to make room for pages that might never be used. Optimizing Network TrafficIn a client/server environment, application programs communicate with the database server across a network. The traffic from this operation can be a bottleneck. When the server sends data to an application, it does not send all the requested data at once. Rather, it sends only the amount that fits into the fetch buffer, whose size is defined by the application program. The fetch buffer resides in the application process; in a client/server environment, this means the client side of the application.When only one row is being returned, the default fetch buffer size is the size of a row. When more than one row is returned, the buffer size depends on the size of three rows: If they fit into a 1,024-byte buffer, the fetch buffer size is 1,024 bytes. If not, but instead they fit into a 2,048-byte buffer, the fetch buffer size is 2,048 bytes. Otherwise, the fetch buffer size is the size of a row. If your application has very large rows or passes voluminous data from the server to the application, you might benefit from increasing the fetch buffer size. With a larger size, the application would not need to wait so often while the server fetches and supplies the next buffer-full of data. The FET_BUF_SIZE environmental variable dictates the fetch buffer size, in bytes, for an ESQL/C application. Its minimum is the default; its maximum is generally the size of a SMALLINT: 32,767 bytes. For example, with the following korn shell command, you could set the fetch buffer size to 20,000 bytes for the duration of your current shell:export FET_BUF_SIZE=20000You can also override the FET_BUF_SIZE from within an ESQL/C application. The global variable FetBufSize, defined in sqlhdr.h, can be reset at compile time. For example, the following C code excerpt sets FetBufSize to 20000:EXEC SQL include sqlhdr;...FetBufSize = 20000;Tuning Your Informix DatabaseDatabase tuning generally occurs in two distinct phases. The first, at initial design, includes the fundamental and broad issue of table design, incorporating normalization and the choices of data types. Extent sizing and referential constraints are often included here. A primary reason that these choices are made at this stage is that changing them later is difficult. For example, choosing to denormalize a table by storing a derived value is best done early in the application development cycle. Later justification for changing a schema must be very convincing. The second phase of tuning involves those structures that are more dynamic: indexes, views, fragmentation schemes, and isolation levels. A key feature of these structures is their capability to be generated on-the-fly. Indexing MechanicsMuch of this chapter describes when to use indexes. As an efficient mechanism for pointing to data, indexes are invaluable. However, they have costs, not only in disk space, but also in maintenance overhead. For you to exercise effective judgment in their creation, you need a thorough understanding of Informix's indexing mechanics and the overhead involved in index maintenance.The remainder of this section describes how Informix builds and maintains indexes through the use of B+ tree data structures. B+ trees are hierarchical search mechanisms that have the trait of always being balanced--that is, of having the same number of levels between the root node and any leaf node. B+ Tree Index PagesIndexes comprise specially structured pages of three types: root nodes, branch nodes, and leaf nodes. Each node, including the singular root node, holds sets of associated sorted keys and pointers. The keys are the concatenated data values of the indexed columns. The pointers are addresses of data pages or, for root and branch nodes, addresses of index pages. Figure 23.1 shows a fully developed index B+ tree, with three levels. In this diagram, finding the address of a data element from its key value requires reading three index pages. Given a key value, the root node determines which branch to examine. The branch node points to a specific leaf node. The leaf node reveals the address of the data page.Leaf nodes also include an additional element, a delete flag for each key value. Furthermore, non-root nodes have lateral pointers to adjacent index nodes on that level. These pointers are used for horizontal index traversal, described later in this section.Figure 23.1.Indexes use a B+ tree structure. Index Node SplitsIndexes are not fully formed when they are created. Instead, they start out as a single page: a root node that functions also as a leaf node. They evolve over time. When enough index entries are added to fill a node, it splits. To split, it creates another node at its level and moves half its index entries to that page. It then elevates the middle key value, the one dividing the two nodes, to the parent node. There, new index entries that point to these two nodes are created. If no parent node exists, one is created. When the root node splits, its new parent page becomes the root node. Figure 23.2 shows this process of splitting an index node to create a new level in the B+ tree index.Figure 23.2.Index node split creates a new root node.

NOTE: If a table has multiple indexes, inserting a data row forces the index creation step for each index. The performance and space overhead can be significant.

Delete FlaggingFor OnLine versions after 6.0, when a data row is deleted, its index rows are not immediately removed. Instead, the index row is marked with a delete flag, indicating that this row is available for deletion. Marking deleted index entries with a delete flag avoids some locking and concurrency problems that could surface with the adjacent key locking mechanism used in older Informix versions. The rows are actually deleted by a page cleaner thread. The page cleaner examines the pages in its list--whose entries were placed there by the delete process--every minute, or whenever it has more than 100 entries.When the page cleaner thread deletes a row, it checks to see whether two or fewer index entries remain on the page. If so, OnLine tries to merge the entries on the page with an adjacent node. If it can, it then frees the current page for other purposes. If no space is available on an adjacent node, OnLine instead shuffles data from the adjacent node into the current page to try to balance the index entries. The merging and shuffling caused by massive deletes not only invoke considerable processing overhead, but also can leave many semi-empty index pages. Update CostsWhen an indexed value is updated, Informix maintains the index by first deleting the old index entry and then inserting a new entry, thus invoking the overhead of both operations. You will find that bulk updates on indexed columns can consume considerable resources. Optimal StructureIndex pages, like any pages, are read most efficiently when cached in the shared memory buffers. Because only whole pages are read from disk, only whole pages can be cached. Non-full index nodes therefore take more space to store, thus reducing the number of keys that can be stored at once. It is usual for the root node of an index to remain cached, and common for the first branch nodes as well. Subsequent levels are usually read from disk. Therefore, compact, balanced indexes are more efficient.Checking the status of indexes occasionally, especially after numerous database operations, is therefore prudent. Oncheck is designed for this purpose. Here is a sample oncheck command, followed by the relevant part of its output:oncheck -pT retail:customers Average AverageLevel Total No. Keys Free Bytes----- -------- -------- ---------- 1 1 2 4043 2 2 246 1542 3 492 535 1359----- -------- -------- ----------Total 495 533 1365Note that this index B+ tree has three levels and that the leaf nodes (level 3) average about one-third empty. An index with a high percentage of unused space, for which you do not anticipate many inserts soon, is a good candidate for rebuilding. When you rebuild an index, consider setting the FILLFACTOR variable, described in the next section.The preceding index was rebuilt with a FILLFACTOR of 100. Notice how much less free space remains in each page and, therefore, how many more key values each leaf node contains. Additionally, an entire level was removed from the B+ tree. Average AverageLevel Total No. Keys Free Bytes----- -------- -------- ---------- 1 1 366 636 2 366 719 428----- -------- -------- ----------Total 367 718 429FILLFACTORWhen OnLine builds an index, it leaves 10 percent of each index page free to allow for eventual insertions. The percent filled is dictated by the onconfig parameter FILLFACTOR, which defaults to 90. For most indexes, this value is adequate. However, the more compact an index is, the more efficient it is. When you're creating an index on a static read-only table, you should consider setting FILLFACTOR to 100. You can override the default with an explicit declaration, as in the following example:CREATE INDEX ix_hist ON order_history (cust_no) FILLFACTOR 100;Likewise, when you know that an index will undergo extensive modifications soon, you can set the FILLFACTOR lower, perhaps to 50.FILLFACTOR applies only to the initial index creation and is not maintained over time. In addition, it takes effect only when at least one of the following conditions is true: The table has over 5,000 rows and over 100 data pages. The table is fragmented. The index is fragmented, but the table is not. Indexing GuidelinesCrafting proper indexes is part experience and part formula. In general, you should index columns that are frequently used for the following: Joins Filters that can usually discriminate less than 10 percent of the data values UNIQUE constraints, including PRIMARY KEY constraints FOREIGN KEY constraints GROUP BY operations ORDER BY clauses In addition, try to avoid indexes on the following: Columns with few values Columns that already head a composite index VARCHARS, for which the entire maximum length of the column is stored for each index key value Beyond these general guidelines, index what needs to be indexed. The optimizer can help you determine what should be indexed. Check the query plans, and let the optimizer guide you. In the section called "Tuning Your Informix Application" later in this chapter, you learn how to use the SET EXPLAIN ON directive to examine Informix's use of specific indexes. Unique IndexesCreate a unique index on the primary key, at least. One is generated automatically for unique constraints, but the names of system-generated constraints start with a space. Naming your PRIMARY KEY indexes yourself is best. Explicit names are clearer and allow for easier modification later. For example, altering an index to cluster, or changing its fragmentation scheme, is simpler for a named index. Cluster IndexesClustering physically reorders the data rows to match the index order. It is especially useful when groups of rows related by the index value are usually read together. For example, all rows of an invoice line item table that share an invoice number might usually be read together. If such rows are clustered, they will often be stored on the same or adjacent pages so that a single read will fetch every line item. You create a clustered index with a statement like this:CREATE CLUSTER INDEX ix_line_item on invoice_lines (invoice_no);When it creates the index, Informix first allocates new extents for the table and then copies the data to the new extents. In the process, room must be available for two complete copies of the table; otherwise, the operation will fail.

CAUTION: Before you create a cluster index, verify that two copies of the table can coexist in the available space.

Because clustering allocates new extents, it can be used to consolidate the remnants of a scattered table.The clustering on a table is not maintained over time. If frequent inserts or deletes occur, the benefits of clustering will diminish. However, you can recluster a table as needed, like this:ALTER INDEX ix_line_item TO CLUSTER;This statement instructs the database engine to reorder the rows, regardless of whether the index named was previously a cluster index. Composite IndexesComposite indexes are those formed from more than one column, such asCREATE INDEX ix_cust_name ON customer (last_name, first_name, middle_name);You can use this index to accomplish searching or ordering on all three values, on last_name and first_name, or on last_name alone. Because the index keys are created from concatenated key values, any subset of the columns, left to right, can be used to satisfy queries. Therefore, any independent indexes on these columns would be redundant and a waste of space.Because the column order in a composite index is so significant, you should put the most frequently used column first. Doing so will help ensure that the index has greater utility. That is, its component parts can also be used often to fulfill index criteria.One other use for composite indexes is to store the data for key-only reads, described in the next section. By doing so, you can, in effect, create a subset of the table's data that can be used very effectively in certain queries. Of course, there is a cost. You must balance the overhead of index maintenance and extra disk usage with the benefits of quicker performance when the key-only reads occur.

TIP: Consider creating an artificial composite index whose only purpose is to allow a key-only read.

Key-Only ReadsKey-only reads are those that can satisfy the query entirely with values found in the index pages alone. Naturally, the avoidance of invoking the I/O required to access the data pages affords a considerable performance improvement. You can generally predict a key-only read by examining the indexes available to the optimizer. For example, consider the following query: SELECT last_name, count(*) FROM customerGROUP BY last_nameORDER BY last_name;Its needs can be satisfied entirely with the values contained in an index on customer.last_name. The output of SET EXPLAIN confirms that only the index is needed to complete the query:Estimated Cost: 80Estimated # of Rows Returned: 101) informix.customers: INDEX PATH (1) Index Keys: last_name (Key-Only)The single index suffices to supply the data and to enable the GROUP BY, the ORDER BY, and the COUNT operations. Bi-Directional IndexesBi-directional indexes are introduced in OnLine-DSA version 7.2. With them, OnLine can traverse an index in either direction. Whether a single column index is created in ascending or descending order is irrelevant. Indexes are still created ascending by default. Composite indexes can also be accessed from either direction but are reversed at the column level when they contain column-specific direction instructions. For example, consider the following index:create index ix_cust3 on customers (last_name asc, first_name asc, cust_no desc);Access from the opposite direction acts as if the rows were sorted in the reverse order on every column. For the preceding index, reading it in the opposite direction is the same as reading the following index:create index ix_cust4 on customers (last_name desc, first_name desc, cust_no asc);Horizontal Index TraversalInformix index pages can be traversed in two ways. The one used for index-based standard lookups starts at the root node and follows the traditional root to branch to leaf pattern. But there is another way. In the page header of every index leaf node and branch node page are horizontal links to sibling index pages. They are pointers to the adjacent left and right pages.When Informix does a sequential index scan, such as a non-discriminatory key-only select from a composite index, the index nodes are traversed in sequential index order, left to right, at the leaf node level only. Data pages are never read, nor are the root or branch nodes accessed. This efficient means of navigating indexes is not tunable but is just one of the ways Informix optimizes its index structure. LoggingIn all Informix versions prior to XPS, database logging is not required. However, without it, explicit transactions cannot be performed; that is, rollbacks are not available. For most operations, business constraints make this unacceptable. There are exceptions, such as turning off logging for bulk database loads, but, in general, logging overhead must be incurred. Often, the only choice is how to minimize the overhead.Databases with logging can be created either buffered or unbuffered. Buffered logging routes transactions through a buffer pool and writes the buffer to disk only when the logical log buffer fills. Although unbufferred logging transactions also pass through the logical log buffer, with them the entire buffer is written after any transaction is completed. Because of the frequent writes, unbuffered logging provides greater data integrity in case of a system failure.

CAUTION: With buffered logging, transactions in memory can be lost if the system crashes.

Nonetheless, the I/O savings afforded by buffered logging are almost always worth the small risk of data loss. This is especially true in active OLTP systems in which logical log writes are often the greatest source of disk activity.

NOTE: All databases share logical logs and the logical log buffer. If one database is unbuffered, it will cause flushing of the entire log buffer whenever a transaction within it is committed. This action can negate the advantage of buffered logging for all other databases in the instance.

Using Non-Logging TablesNon-logging databases are no longer supported in INFORMIX-OnLine XPS. Rather, within a database that has logging, new table types exist for specific operations that need not incur the overhead of logging. For example, when you load raw data from an external source into a table for initial scrubbing, logging is often superfluous, because you can usually perform the load again should the initial attempt fail. Table 23.1 summarizes the table types available with OnLine XPS. Table 23.1. OnLine XPS table types.Table TypeDurationLoggedWrites AllowedIndexes AllowedRestorable from Archive

SCRATCHtemporarynoyesnono

TEMPtemporaryyesyesyesno

RAWpermanentnoyesnono

STATICpermanentnonoyesno

OPERATIONALpermanentyesyesyesno

STANDARDpermanentyesyesyesyes

A common tactic is to use a RAW table to load data from an external source and then alter it to STATIC after the operation is finished. As an added bonus, whereas all temporary tables can be read with light scans because they are private, STATIC tables can always be read with light scans because they are read-only. LockingLocking in Informix is available at the following decreasing levels of granularity, or scope: Database Table Page Row For a complete discussion of locking, refer to Chapter 15, "Managing Data with Locking." Generally, the demands of increased concurrency in a multi-user application force locking to be assigned at the smallest granularity. Unfortunately, this constraint also invokes the greatest overhead. Although the number of locks is tunable, they are finite resources.

TIP: Lock at the highest granularity possible. Generating, holding, and checking for a lock all take time. You should make every effort to reduce the number of locks by increasing their granularity.

Certainly for bulk off-hour operations, you should consider the LOCK DATABASE databasename EXCLUSIVE command. Finally, be cautious about creating large tables with row-level locking; with it, mass inserts or deletes to a table can quickly exhaust the available locks. Even with tables that have page level locking, using LOCK TABLE tablename IN EXCLUSIVE MODE whenever possible is best. Isolation LevelsThe isolation level dictates the degree to which any reads you perform affect and are affected by other concurrent users. The different levels place increasingly stringent requirements on what changes other processes can make to rows you are examining and to what degree you can read data currently being modified by other processes. Isolation levels are meaningful only for reads, not for data manipulation statements.In decreasing order of permissiveness, the isolation levels available in OnLine are as follow: Dirty Read Committed Read Cursor Stability Repeatable Read Dirty Read is the most efficient and simplest isolation level. Effectively, it does not honor any locks placed by other processes, nor does it place any. Regardless of whether data on disk is committed or uncommitted, a Dirty Read scan will copy the data. The danger is that a program using Dirty Read isolation might read a row that is later uncommitted. Therefore, be sure you account for this possibility, or read only from static tables when this isolation level is set.

TIP: For greatest efficiency, use the Dirty Read isolation level whenever possible.

The Committed Read isolation level ensures that only rows committed in the database are read. As it reads each row, OnLine checks for the presence of an update lock. If one exists, it ignores the row. Because OnLine places no locks, the Committed Read isolation level is almost as efficient as the Dirty Read isolation level.Cursor Stability causes the database to place a lock on the current row as it reads the row. This lock ensures that the row will not change while the current process is using it. When the server reads and locks the next row, it releases the previous lock. The placement of locks suggests that processes with this isolation level will incur additional overhead as they read data.With Repeatable Read, processes lock every row that has been read in the current transaction. This mode guarantees that reading the same rows later would find the same data. As a result, Repeatable Read processes can generate many locks and hold them for a long time.

CAUTION: Be careful when you use the Repeatable Read isolation level. The number of locks generated could exceed the maximum available.

Data TypesTwo key performance principles apply when you're selecting a data type: minimize space and reduce conversions. Smaller data types save disk space, create tidier indexes, fit better into shared memory, and allow faster joins. For example, never use an INTEGER when a SMALLINT will do. Unless you need the added range (an INTEGER can store from -2,147,483,647 to 2,147,483,647, whereas the limits for a SMALLINT are -32,767 and 32,767) use the 2-byte SMALLINT rather than the 4-byte INTEGER. In a similar fashion, minimize the precision of DECIMAL, MONEY, and DATETIME data types. Their storage requirements are directly related to their precision.In addition, use data types most appropriate for the operations being performed on them. For example, do not store numeric values in a CHAR field if they are to be used for calculations. Such type mismatches cause the database to perform a conversion for every operation. Small Join KeysWhen Informix joins two tables, it performs the operation in memory as much as it is able. The smaller the keys, the less likely a join operation is to overflow to disk. Operations in memory are fast; disk operations are slow. Creating a small join key might mean replacing large composite keys with alternatives, often serial keys. It is common that the natural keys in a table offer no concise join candidate. For example, consider the following table:CREATE TABLE transactions (cust_no INTEGER,trans_type CHAR(6),trans_time DATETIME YEAR TO FRACTION,trans_amount MONEY(12,2),PRIMARY KEY (cust_no, trans_type, trans_time));Imagine that business rules demand that it often be joined to the following:CREATE TABLE trans_audits (cust_no INTEGER,trans_type CHAR(6),trans_time DATETIME YEAR TO FRACTION,auditor_no INTEGER,audit_date DATE,FOREIGN KEY (cust_no, trans_type, trans_time) REFERENCES transactions);If these tables are large, joins will be slow. The transaction table is an excellent candidate for an artificial key whose sole purpose is to make joins of this sort more efficient. Such a scheme would look like the following:CREATE TABLE transactions (trans_no SERIAL PRIMARY KEY,cust_no INTEGER,trans_type CHAR(6),trans_time DATETIME YEAR TO FRACTION,trans_amount MONEY(12,2));CREATE TABLE trans_audits (trans_no INTEGER REFERENCES transactions,auditor_no INTEGER,audit_date DATE);At the cost of making the transactions table a little larger and forcing the maintenance of an added key, the joins are more efficient, and the trans_audit table is considerably smaller. BlobsBlobs can be stored in a table's tblspace with the rest of its data or in a custom blobspace. Blobspaces comprise blobpages, which can be defined to be multiple pages.

TIP: Place large blobs in blobspaces, and define the blobpages large enough to store the average blob.

With blobpages large enough, most blobs will be stored contiguously. Blobs stored in blobspaces also bypass the logical log and buffer cache; blobs stored in tblspaces do not. Another hazard of storing blobs in tblspaces is that they could flood the cache buffers and force out other, more useful pages. ConstraintsOne of the most insidious means of sapping performance is to allow bad data to infiltrate your database. Disk space is wasted. Application code becomes convoluted as it tries to accommodate data that should not be there. Special processes must be run to correct or cull the invalid data. These violations should be prevented, not repaired. Using Informix's constraints to enforce integrity upon your database is almost always worthwhile. Constraints are mostly enabled via indexes, and indexes can have high costs. But the existence of costs should not preclude implementing a good idea.In the real world, performance is almost never considered in a vacuum. Usually, you are faced with trade-offs: Indexing to improve query speed uses extra disk space; adding more cache buffers increases the paging frequency; an efficient but tricky piece of application code requires a greater maintenance effort. The principle applies here as well: Using constraints to enforce integrity carries with it significant overhead. Do it anyway.Table 23.2 shows how key elements critical to a sound data model can be enforced with constructs available in Informix. Table 23.2. Using constraints to enforce integrity.Relational ObjectEnforcement Mechanism

Primary KeyPRIMARY KEY CONSTRAINT

UNIQUE CONSTRAINT

NOT NULL CONSTRAINT

Domaindata types

CHECK CONSTRAINT

DEFAULT values

NOT NULL CONSTRAINT

Foreign KeyFOREIGN KEY CONSTRAINT, including ON DELETE CASCADE

triggers and stored procedures

For more information on enforcing primary key and foreign key constraints, refer to Chapter 17, "Managing Data Integrity with Constraints." DenormalizationFor every rule, you'll find exceptions. On occasion, conforming to absolute relational strictures imposes too great a performance cost. At such times, well considered denormalization can perhaps provide a performance gain significant enough to justify the effort. A fully normalized model has no redundant data or derived data. The examples in this section suggest times when introducing redundancy or derived data might be of value. Maintain Aggregates TablesA stock-in-trade of data warehouse applications, aggregate tables often store intermediate levels of derived data. Perhaps a retail DSS application reports frequently on historical trends of sales for each product by store and by day. Yet the base data available in the normalized database is at the transaction level, where thousands of individual rows must be aggregated to reach what to the DSS application is an atomic value. Furthermore, although historical transaction data is static, queries often summarize transactions months or years old.Such an environment calls for creating an aggregate table like the following:CREATE TABLE daily_trans (product_no INTEGER,store_no INTEGER,trans_date DATE,sales_total MONEY(16, 2));New aggregates can be summed nightly from base transaction data and added to the aggregate. In addition to creating efficient queries at the granularity of one daily_trans row, this table can be used as the starting point for other queries. From it, calculating sales by month or daily sales by product across all stores would be simple matters. Maintain Aggregate ColumnsStoring a denormalized aggregate value within a table is often reasonable, especially if it is referenced often and requires a join or aggregate function (or both) to build. For example, an orders table might commonly store an order_total value, even though it could be calculated as follows:SELECT SUM(order_details.line_total) FROM orders, order_details WHERE orders.order_no = order_details.order_no;Application code, or perhaps a trigger and a stored procedure, must be created to keep the order_total value current. In the same fashion for the following example, customers.last_order_date might be worth maintaining rather than always recalculating:SELECT MAX(orders.order_date) FROM customers, orders WHERE customers.cust_no = orders.cust_no;In these cases, you must monitor your application closely. You have to weigh whether the extra complexity and overhead are justified by any performance improvements. Split Wide TablesWide tables are those with many columns, especially those with several large columns. Long character strings often contribute greatly to a table's width. Few of the long rows from a wide table can fit on any given page; consequently, disk I/O for such a table can be inefficient. One tactic to consider is to split the table into components that have a one-to-one relationship with each other. Perhaps all the attributes that are rarely selected can be segregated to a table of their own. Possibly a very few columns that are used for critical selects can be isolated in their own table. Large strings could be expelled to a companion table. Any number of methods could be considered for creating complementary tables; you have to consider individually whether the performance gain justifies the added complexity. Tuning Your Informix OperationsYou can improve the overall operation of your environment by balancing system resources effectively. For example, as much as possible, only run resource-intensive processes when the system is least frequently used, generally at night. Candidates for off-hour processing include calculating aggregates and running complex reports. Also during the off-hours, perform the background operations that keep your system healthy, such as archiving and updating statistics. Update StatisticsTo optimize SQL statements effectively, Informix relies on data it stores internally. It uses the sysindexes, systables, syscolumns, sysconstraints, sysfragments, and sysdistrib tables to store data on each table. It tracks such values as the number of rows, number of data pages, and depth of indexes. It stores high and low values for each column and, on demand, can generate actual data distributions as well. It recognizes which indexes exist and how selective they are. It knows where data is stored on disk and how it is apportioned. With this data, it can optimize your SQL statements to construct the most efficient query plan and reduce execution time.Informix can perform these jobs well only when the internal statistics are up-to-date. But they often are not--these values are not maintained in real-time. In fact, most of the critical values are updated only when you run the UPDATE STATISTICS statement. Therefore, you must do so on a regular basis.Whenever you run UPDATE STATISTICS, you specify the objects on which it should act: specific columns, specific tables, all tables, all tables and procedures, specific procedures, or all procedures.

NOTE: If you execute UPDATE STATISTICS without specifying FOR TABLE, execution plans for stored procedures are also re-optimized.

In addition, you can specify how much information is examined to generate the statistics. In LOW mode, UPDATE STATISTICS constructs table and index information. UPDATE STATISTICS MEDIUM and HIGH also construct this data but, by scanning data pages, add data distributions. UPDATE STATISTICS LOWWith the default UPDATE STATISTICS mode (LOW), the minimum information about the specified object is gathered. This information includes table, row, and page counts along with index and column statistics for any columns specified. This data is sufficient for many purposes and takes little time to generate. The following statements show examples of these operations:UPDATE STATISTICS LOW FOR TABLE customers (cust_no);UPDATE STATISTICS LOW FOR TABLE customers;UPDATE STATISTICS LOW;You can even use the UPDATE STATISTICS statement on a temporary table. Also, with the DROP DISTRIBUTIONS clause, you can drop previously generated data distribution statistics:UPDATE STATISTICS LOW FOR TABLE customers DROP DISTRIBUTIONS;Distributions are values that have been generated by a previous execution of UPDATE STATISTICS MEDIUM or UPDATE STATISTICS HIGH. If you do not specify the DROP DISTRIBUTIONS clause, any data distribution information that already exists will remain intact. UPDATE STATISTICS MEDIUMThe MEDIUM and HIGH modes of UPDATE STATISTICS duplicate the effort of UPDATE STATISTICS LOW, but they also create data distributions. With MEDIUM, data is only sampled; with HIGH, all data rows are read. These data distributions are stored in the sysdistrib table. Informix creates distributions by ordering the data it scans and allocating the values into bins of approximately equal size. By recording the extreme values in each bin, it can recognize the selectivity of filters that might later be applied against these columns. Thus, Informix can recognize when the data values are skewed or highly duplicated, for example. You can alter the sampling rate and the number of bins by adjusting the CONFIDENCE and RESOLUTION parameters. For example, the following statement generates 25 bins (100/RESOLUTION) and samples enough data to give the same results as UPDATE STATISTICS HIGH approximately 98 percent of the time:UPDATE STATISTICS MEDIUM FOR TABLE customers (cust_no) RESOLUTION 4 CONFIDENCE 98;UPDATE STATISTICS HIGHWhen you specify UPDATE STATISTICS HIGH, Informix reads every row of data to generate exact distributions. This process can take a long time. Normally, HIGH and MEDIUM gather index and table information, as well as distributions. If you have already gathered index information, you can avoid recalculating it by adding the DISTRIBUTIONS ONLY clause:UPDATE STATISTICS LOW FOR TABLE customers;UPDATE STATISTICS HIGH FOR TABLE customers (cust_no) DISTRIBUTIONS ONLY;With the DISTRIBUTIONS ONLY clause, UPDATE STATISTICS MEDIUM and HIGH generate only table and distribution data. Comprehensive UPDATE STATISTICS PlanYour goal should be to balance the performance overhead of creating statistics inefficiently or too often with the need for regular recalculations of these values. The following plan strikes a good balance between execution speed and completeness: 1. Run the UPDATE STATISTICS MEDIUM command for the whole database. It will generate index, table, and distribution data for every table and will re-optimize all stored procedures.

2. Run the UPDATE STATISTICS HIGH command with DISTRIBUTIONS ONLY for all columns that head an index. This accuracy will give the optimizer the best data about an index's selectivity.

3. Run the UPDATE STATISTICS LOW command for all remaining columns that are part of composite indexes. If your database is moderately dynamic, consider activating such an UPDATE STATISTICS script periodically, even nightly, via cron, the UNIX automated job scheduler. Finally, remember to use UPDATE STATISTICS specifically whenever a table undergoes major alterations. Parallel Data QueryOnLine-DSA offers the administrator methods of apportioning the limited shared memory resources among simultaneous DSS queries. Primary among these parameters is MAX_PDQPRIORITY, a number that represents the total fraction of PDQ resources available to any one DSS query. For a complete description of the PDQ management tools available to the administrator, refer to Chapter 19, "Parallel Database Query." ArchivingIf you use ON-Archive, you can exercise very specific control over the dbspaces archived. By carefully allocating like entities to similar dbspaces, you can create an efficient archive schedule. One tactic is to avoid archiving index-only dbspaces. Generally, indexes can be reconstructed as needed from the base data. In addition, arrange a schedule that archives active dbspaces more frequently than less dynamic ones. By giving some thought to the nature of individual dbspaces, you can design an archive strategy that balances a quick recovery with a minimal archiving time. Bulk LoadsWhen you need to load large amounts of data into a table, consider ways to reduce the overhead. Any of the following procedures could improve performance or, at the least, minimize the use of limited system resources such as locks: Drop indexes to save shuffling of the B+ tree index structure as it attempts to stay balanced. Lock the table in exclusive mode to conserve locks. Turn off logging for the database to avoid writing each insert to the logical logs and perhaps creating a dangerous long transaction. Be sure to restore the database or table to its original state after the load is finished. In-Place ALTER TABLEStarting with OnLine version 7.2, ALTER TABLE statements no longer necessarily rebuild a table when executed. If a column is added to the end of the current column list, then an in-place ALTER TABLE operation will be performed. With this mechanism, the table is rewritten over time. Inserts of new rows are written with the updated format, but an existing row is rewritten only when it is updated. As a result, a small amount of additional overhead is required to perform this conversion. Although the in-place ALTER TABLE is generally efficient, you might find it useful to explicitly force the table to be rebuilt when you issue the ALTER TABLE statement. Including the BEFORE clause in the ALTER TABLE statement ensures this action will occur. By forcing an immediate rebuild, you can avoid the ongoing update overhead. Tuning Your Informix ApplicationApplication programs generally contain numerous components: procedural statements intermingled with various embedded SQL commands. Foremost in tuning an application is identifying the element that is slow. Often, users do this work for you. A query that previously was fast is suddenly slow, or a report takes too long to run. When you start trying to isolate the specific bottleneck, recognize that almost never is it anything other than a database operation.

TIP: If an Informix-based application program is slow, the culprit is an SQL statement.

When an application is generally slow, you need to peer inside it as it runs to identify the bottleneck. Two monitoring tools are especially useful to help you with this job. The first is onstat -g sql:onstat -g sql sesid -r intervalWith the preceding command, you can take a series of snapshots of the SQL statement currently being run for a given session. Generally, a single statement will emerge as the one that needs attention.The second important tool is xtree. Normally, xtree is invoked as a component of the performance monitoring tool onperf. With xtree, you can examine the exact execution path of a query in progress and track its joins, sorts, and scans.Given that most application performance tuning will address making queries more efficient, understanding how Informix analyzes and executes them is important. The Cost-Based OptimizerInformix employs a cost-based optimizer. This means that the database engine calculates all the paths--the query plans--that can fulfill a query. A query plan includes the following: Table evaluation order Join methods Index usage Temporary table creation Parallel data access Number of threads required The engine then assigns a cost to each query plan and chooses the plan with the lowest cost. The cost assignment depends on several factors, enumerated in the next section, but chief of which is accurate data distribution statistics. Statistics on data distributions are not maintained in real-time; in fact, they are updated only when you execute UPDATE STATISTICS. It is critical that statistics be updated in a timely fashion, especially after major insert or delete operations. Query Plan SelectionTo calculate the cost of a query plan, the optimizer considers as much of the following data as is available (certain of these values are not stored for SE): How many rows are in the table The distribution of the values of the data The number of data pages and index pages with values The number of B+ tree levels in the index The second-largest and second-smallest values for an indexed column The presence of indexes, whether they are clustered, their order, and the fields that comprise them Whether a column is forced via a constraint to be unique Whether the data or indexes are fragmented across multiple disks Any optimizer hints: the current optimization level and the value of OPTCOMPINDOf these factors, the first five are updated only with the UPDATE STATISTICS statement. Based on the query expression, the optimizer anticipates the number of I/O requests mandated by each type of access, the processor work necessary to evaluate the filter expressions, and the effort required to aggregate or order the data. Understanding Query PlansThe SQL statement SET EXPLAIN ON tells Informix to record the query plans it selects in a file named sqexplain.out. The directive stays in effect for the duration of the current session, or until you countermand it via SET EXPLAIN OFF. Because the sqexplain.out file continually grows as new query plans are appended to it, you should generally toggle SET EXPLAIN ON only long enough to tune a query and then turn it off again. Additionally, a small amount of overhead is required to record the query plans.Some sample excerpts from sqexplain.out follow, with line-by-line explanations.Estimated Cost: 80234The cost is in arbitrary disk access units and is generally useful only to compare alternative plans for the same query. A lower cost for different access methods for the same query is usually an accurate prediction that the actual query will be faster.Estimated # of Rows Returned: 26123When the data distributions are accurate, this estimated number can be very close to the actual number of rows that eventually satisfy the query.Temporary Files Required For: Group ByTemporary files are not intrinsically bad, but if Informix must keep re-creating the same one to handle a common query, it could be a signal that you should create an index on the GROUP BY columns. Notice that not all GROUP BY operations can be handled with an index. For example, if a GROUP BY clause includes columns from more than one table or includes derived data, no index can be used.1) informix.orders: SEQUENTIAL SCAN2) informix.customers: INDEX PATH (1) Index Keys: cust_no Lower Index Filter: informix.customers.cust_no = informix.orders.cust_noIn the preceding example, the optimizer chooses to examine the orders table first via a sequential scan. Then it joins orders rows to customers rows using the index on customers.cust_no.SET EXPLAIN can reveal myriad variations of query plans. You should examine the output from several queries to familiarize yourself with the various components of sqexplain.out. When you're tuning specific queries, spending your time examining query plans is critical. Look for sequential scans late in the process. If the table being scanned is large, a late sequential scan is probably a sign of trouble and might merit an index. Look for any failure to use indexes that should be used; look for data scans when key-only reads make sense; look for high relative costs; look for unreasonable index choices.Experience here counts. Part of that experience must include understanding the join methods available to the database engine. Join MethodsWhen Informix must join tables, it can choose any of three algorithms. All joins are two-table joins; multi-table joins are resolved by joining initial resultant sets to subsequent tables in turn. The optimizer chooses which join method to use based on costs, except when you override this decision by setting OPTCOMPIND. Nested Loop Join: When the join columns on both tables are indexed, this method is usually the most efficient. The first table is scanned in any order. The join columns are matched via the indexes to form a resultant row. A row from the second table is then looked up via the index. Occasionally, Informix will construct a dynamic index on the second table to enable this join. These joins are often the most efficient for OLTP applications. Sort Merge Join: After filters are applied, the database engine scans both tables in the order of the join filter. Both tables might need to be sorted first. If an index exists on the join column, no sort is necessary. This method is usually chosen when either or both join columns do not have an index. After the tables are sorted, joining is a simple matter of merging the sorted values. Hash Join: Available starting in version 7, the hash merge join first scans one table and puts its hashed key values in a hash table. The second table is then scanned once, and its join values are looked up in the hash table. Hash joins are often faster than sort merge joins because no sort is required. Even though creating the hash table requires some overhead, with most DSS applications in which the tables involved are very large, this method is usually preferred.

NOTE: The hash table is created in the virtual portion of shared memory. Any values that cannot fit will be written to disk. Be sure to set DBSPACETEMP to point to enough temporary space to accommodate any overflow.

Influencing the OptimizerMuch of how you can influence the optimizer depends on your constructing queries that are easily satisfied. Nonetheless, you can set two specific parameters to influence the OnLine optimizer directly.For version 7, you can set the OPTCOMPIND (OPTimizer COMPare INDex methods) parameter to influence the join method OnLine chooses. You can override the onconfig default of 2 by setting it as an environmental variable. OPTCOMPIND is used only when OnLine is considering the order of joining two tables in a join pair to each other: Should it join table A to B, or should it join table B to A? And, when it makes the decision, is it free to consider a dynamic-index nested loop join as one of the options? The choices for OPTCOMPIND are as follow: 0--Only consider the index paths. Prefer nested loop joins to the other two methods. This method forces the optimizer to behave as in earlier releases. 1--If the isolation level is Repeatable Read, act as if OPTCOMPIND were 0. Otherwise, act as if OPTCOMPIND were 2. The danger with the Repeatable Read isolation level is that table scans, such as those performed with sort merge and hash joins, could lock all records in the table. 2--Use costs to determine the join methods. Do not give preference to nested loop joins over table scans. These options are admittedly obscure. If you choose to tune this parameter, first try the following tip.

TIP: For OLTP applications, set OPTCOMPIND to 0. For DSS applications, set OPTCOMPIND to 1.

OPTCOMPIND is not used with INFORMIX-XPS. XPS always chooses the join method based solely on cost.You can explicitly set the optimization level with SET OPTIMIZATION LOW. The default, and only other choice for SET OPTIMIZATION, is HIGH. Normally, the cost-based optimizer examines every possible query path and applies a cost to each. With SET OPTIMIZATION LOW, OnLine eliminates some of the less likely paths early in the optimization process, and as a result saves some time in this step. Usually, the optimization time for a stand-alone query is insignificant, but on complex joins (five tables or more), it can be noticeable. Generally, the best result you can expect is that the optimizer will choose the same path it would have taken with SET OPTIMIZATION HIGH but will find it quicker. Optimizing SQLIdentifying which process is slow is half the tuning battle. Understanding how Informix optimizes and performs the queries is the other half. With those facts in hand, tuning individual queries is generally a matter of persuading Informix to operate as efficiently as it can. The following suggestions offer some specific ways of doing that. UPDATE STATISTICSBy now, this refrain should be familiar. If Informix seems to be constructing an unreasonable query plan, perhaps the internal statistics are out of date. Run the UPDATE STATISTICS command. Eliminate FragmentsWith OnLine-DSA, tables and indexes can be fragmented across multiple disks. One way to accomplish this horizontal partitioning is to create the table or index with a FRAGMENT BY EXPRESSION scheme. Consider this example:CREATE TABLE orders (order_no SERIAL,order_total MONEY (8,2))FRAGMENT BY EXPRESSIONorder_no >= 0 AND order_no < 5000 IN dbspace1,order_no >= 5000 AND order_no < 10000 IN dbspace2,order_no >= 10000 IN dbspace3;A query such asSELECT SUM(order_total) FROM orders WHERE order_no BETWEEN 6487 AND 7212;can be satisfied wholly with the data in dbspace2. The optimizer recognizes this and spawns a scan thread only for that fragment. The savings in disk access when fragment elimination occurs can be considerable. Additionally, contention between users can be significantly reduced as they compete less for individual disks. For a complete explanation of this topic, refer to Chapter 20. Change the IndexingBe guided by the optimizer. If it suggests an auto-index, add a permanent one. If it continues to create a temporary table, try to construct an index to replace it. If a very wide table is scanned often for only a few values, consider creating an artificial index solely to enable key-only reads. If a sequential scan is occurring late in the query plan, look for ways that an index can alter it, perhaps by indexing a column on that table that is used for a filter or a join. Indexes allow you to experiment without a large investment. Take advantage of this fact and experiment. Use Explicit Temp TablesSometimes a complex query takes a tortuous path to completion. By examining the query path, you might be able to recognize how a mandated intermediate step would be of value. You can often create a temporary table to guarantee that certain intermediate steps occur.When you use explicit temporary tables in this way, create them using WITH NO LOG to avoid any possibility of logging. Indexing temporary tables and running UPDATE STATISTICS on them are also legal. Examine whether either of the these operations might be worthwhile. Select Minimal DataKeep your communication traffic small. Internal program stacks, fetch buffers, and cache buffers all operate more efficiently when less data is sent. Therefore, select only the data that you need. Especially, do not select an aggregate or add an ORDER BY clause when one is not needed. Avoid Non-Initial Substring SearchesIndexes work left to right from the beginning of a character string. If the initial value is not supplied in a filter, an index cannot be used. For example, no index can be used for any of the following selection criteria:WHERE last_name MATCHES "*WHITE"WHERE last_name[2,5] = "MITH"WHERE last_name LIKE "%SON%"Rewrite Correlated SubqueriesA subquery is a query nested inside the WHERE clause of another query. A correlated subquery is one in which the evaluation of the inner query depends on a value in the outer query. Here is an example:SELECT cust_no FROM customers WHERE cust_no IN (SELECT cust_no FROM orders WHERE order_date > customers.last_order_date);This subquery is correlated because it depends on customers.last_order_date, a value from the outer query. Because it is correlated, the subquery must execute once for each unique value from the outer SELECT. This process can take a long time. Occasionally, correlated subqueries can be rewritten to use a join. For example, the preceding query is identical to this one:SELECT c.cust_no FROM customers s, orders o WHERE c.cust_no = o.cust_no AND o.order_date > c.last_order_date;Usually, the join is faster. In fact, INFORMIX-XPS can do this job for you on occasion. Part of its optimization includes restructuring subqueries to use joins when possible. Sacrificing a Goat (or Overriding the Optimizer)Wave the computer over your head three times in a clockwise direction. If that fails, and you are desperate, you might try these arcane and equally disreputable incantations. The Informix optimizer has continually improved over the years, but it is still not foolproof. Sometimes, when the query plan it has constructed is simply not the one you know it should be, you can try underhanded ways to influence it. Be aware, though, that some of the techniques in this section work only in older versions of Informix, and recent versions of the optimizer might even negate your trick (such as stripping out duplicate filters) before constructing a query plan.

CAUTION: Trying these techniques will get you laughed at. And they probably won't work.

Rearrange Table OrderPut the smallest tables first. The order of table evaluation in constructing a query plan is critical. Exponential differences in performance can result if the tables are scanned in the wrong order, and sometimes the optimizer is unable to differentiate between otherwise equal paths. As a last resort, the optimizer looks at the order in which the tables are listed in the FROM clause to determine the order of evaluation. Complete a Commutative ExpressionCompleting a commutative expression means explicitly stating all permutations of equivalent expressions. Consider the following statement:SELECT c.cust_no, o.order_status FROM customers c, orders o WHERE c.cust_no = o.cust_no AND o.cust_no < 100;The optimizer might select an index on orders.cust_no and evaluate that table first. Perhaps you recognize that selecting the customers table first should result in a speedier query. You could include the following line with the preceding query to give the optimizer more choices:AND c.cust_no < 100The optimizer might change its query plan, using the following statement:SELECT r.* FROM customers c, orders o, remarks r WHERE c.cust_no = o.cust_no AND o.cust_no = r.cust_no;Older versions of the optimizer would not consider that all customer numbers are equal. By stating it explicitly, as follows, you offer the optimizer more ways to satisfy the query:AND c.cust_no = r.cust_noDuplicate an Important FilterWithout duplicating the filter in the following query, the optimizer first suggests a query plan with a sequential scan. Indexes exist on orders.cust_no, customers.cust_no, and customers.last_name. The output from sqexplain.out follows the query.SELECT o.order_no FROM customers c, orders o WHERE c.cust_no = o.cust_no AND c.last_name MATCHES "JON*";1) informix.o: SEQUENTIAL SCAN2) informix.c: INDEX PATH Filters: informix.c.last_name MATCHES `JON*' (1) Index Keys: cust_no Lower Index Filter: informix.c.cust_no = informix.o.cust_noOne trick is to duplicate the filter on last_name to tell the optimizer how important it is. In this case, it responds by suggesting two indexed reads:SELECT o.order_no FROM customers c, orders o WHERE c.cust_no = o.cust_no AND c.last_name MATCHES "JON*" AND c.last_name MATCHES "JON*";1) informix.c: INDEX PATH Filters: informix.c.last_name MATCHES `JON*' (1) Index Keys: last_name Lower Index Filter: informix.c.last_name MATCHES `JON*'2) informix.o: INDEX PATH (1) Index Keys: cust_no Lower Index Filter: informix.o.cust_no = informix.c.cust_noYou have no guarantee that the second method will actually execute faster, but at least you will have the opportunity to find out. Add an Insignificant FilterFor the following query, Informix uses the index on cust_no instead of order_no and creates a temporary table for the sort: SELECT * FROM orders WHERE cust_no > 12ORDER BY order_no;In this instance, perhaps you decide that the index on cust_no is not very discriminatory and should be ignored so that the index on order_no can be used for a more efficient sort. Adding the following filter does not change the data returned because every order_no is greater than 0:AND order_no > 0However, adding this filter might force the optimizer to select the index you prefer. Avoid Difficult ConjunctionsSome versions of the optimizer cannot use an index for certain conjunction expressions. At such times, using a UNION clause, instead of OR, to combine results is more efficient. For example, if you have an index on customers.cust_no and on customers.last_name, the following UNION-based expression can be faster than the OR-based one:SELECT last_name, first_name FROM customers WHERE cust_no = 53 OR last_name = "JONES";SELECT last_name, first_name FROM customers WHERE cust_no = 53 UNIONSELECT last_name, first_name FROM customers WHERE last_name = "JONES";In the preceding examples, the optimizer might choose to use each index once for the UNION-based query but neither index for the OR-based expression. Optimizing Application CodeEspecially in OLTP systems, the performance of application code is crucial. DSS environments often run more "naked" queries and reports, where the specific queries are apparent. With languages such as ESQL/C and INFORMIX-4GL that can have embedded SQL statements, it is often unclear which statement is slow and, furthermore, how to make it faster. When you're examining a piece of slow code, assume first that an SQL statement is the bottleneck. The performance differences of non-SQL operations are generally overshadowed. Although a linked list might be microseconds slower than an array, for example, SQL operations take milliseconds, at least. Spend your time where it is most fruitful: Examine the SQL. Identify the CulpritOne way to study the query plans of embedded SQL commands is to include an option that invokes SET EXPLAIN ON as a runtime directive. Within the code, check for the existence of an environmental variable that can be toggled by the user. For example, consider this INFORMIX-4GL code:IF (fgl_getenv("EXPLAIN_MODE") = "ON") THEN SET EXPLAIN ONEND IFBy placing code such as this at the beginning of your 4GL MAIN routine, you can enable SET EXPLAIN exactly when you need it. Extract QueriesQueries buried deep within complex application code can be difficult to optimize. It is often beneficial to extract the query and examine it in isolation. With DBaccess, you can give a troublesome query special treatment, using SET EXPLAIN ON to examine the query plan. Performing many iterations of modifying a query with DBaccess is much easier than it is when the statement is embedded within many layers of application code. Prepare SQL StatementsWhen an SQL statement gets executed on-the-fly, as through DBaccess, the database engine does the following: 1. Checks the syntax

2. Validates the user's permissions

3. Optimizes the statement

4. Executes the statement These actions require reading a number of system tables and incur considerable overhead when performed often. For very simple statements, steps 1 through 3 can take longer than step 4. Yet for an application, only step 4, executing the statement, is needed for each iteration. The PREPARE statement allows the database to parse, validate, and assemble a query plan for a given statement only once. After it does so, it creates an internal statement identifier that you can use as a handle to execute the statement repeatedly.

TIP: Use PREPARE to create efficient handles for commonly used SQL statements.

Often used to construct dynamic SQL statements at runtime, the PREPARE statement can significantly help performance as well. The first place to look for good candidates to use with PREPARE is inside loops. Consider the following theoretical fragment of INFORMIX-4GL code:DECLARE good_cust_cursor CURSOR FOR SELECT cust_no FROM customers WHERE acct_balance