TM 5-1 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Chapter 5: Physical Database Design and Performance Jason C. H. Chen,

TM 5-1Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems

Chapter 5:Physical Database Design and

Performance

Jason C. H. Chen, Ph.D.

Professor of MIS

School of Business Administration

Gonzaga University

Spokane, WA 99258

[email protected]


Objectives• Definition of terms• Describe the physical database design process• Choose storage formats for attributes• Select appropriate file organizations• Describe three types of file organization• Describe indexes and their appropriate use• Translate a database model into efficient structures• Know when and how to use denormalization


Physical Database Design

• Purpose–translate the logical description of data into the technical specifications for storing and retrieving data

• Goal–create a design for storing data that will provide adequate performance and insure database integrity, security, and recoverability


Physical Design Process

Normalized relations

Volume estimates

Attribute definitions

Response time expectations

Data security needs

Backup/recovery needs

Integrity expectations

DBMS technology used

Inputs

Attribute data types

Physical record descriptions (doesn’t always

match logical design)

File organizations

Indexes and database architectures

Query optimization

Leads to

Decisions

TM 5-5Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems5

Figure 5-1 Composite usage map (Pine Valley Furniture Company)


What will we learn fromUsage Map?

• Why should Manufacutred_Part and Purchased_Part tables be separated?

• Why should SUPPLIER and SUPPLIES tables be combined?


Figure 5-1 Composite usage map (Pine Valley Furniture Company) (cont.)

Database access frequencies are estimated from “transaction volumes.”

Data (transaction)volumes



Access Frequencies (per hour)



Usage analysis:14,000 purchased parts accessed per hour 8000 quotations accessed from these 140 purchased part accesses 7000 suppliers accessed from these 8000 quotation accesses



Usage analysis:7500 suppliers accessed per hour 4000 quotations accessed from these 7500 supplier accesses 4000 purchased parts accessed from these 4000 quotation accesses


Designing Fields

• Field: smallest unit of data in database

• Field design –Choosing data type–Coding, compression, encryption–Controlling data integrity


Choosing Data Types

• CHAR–fixed-length character• VARCHAR2–variable-length character (memo)• LONG–large number• NUMBER–positive/negative number• INEGER–positive/negative whole number• DATE–actual date• BLOB–binary large object (good for graphics,

sound clips, etc.)


Figure 5-2 Example code look-up table(Pine Valley Furniture Company)

Code saves space, but costs an additional lookup to obtain actual value


Field Data Integrity• Default value–assumed value if no explicit value

– e.g., s_state CHAR(2) DEFAULT ‘WI’;• Range control–allowable value limitations

(constraints or validation rules)• Null value control–allowing or prohibiting empty

fields• Referential integrity–range control (and null value

allowances) for foreign-key to primary-key match-ups

Sarbanes-Oxley Act (SOX) legislates importance of financial data integrity


Handling Missing Data

• Substitute an estimate of the missing value (e.g., using a formula)

• Construct a report listing missing values• In programs, ignore missing data unless the value

is significant (sensitivity testing)

Triggers can be used to perform these operations


Physical Records• Physical Record: A group of fields stored in

adjacent memory locations and retrieved together as a unit

• Page: The amount of data read or written in one I/O operation

• Blocking Factor: The number of physical records per page


Denormalization• Transforming normalized relations into unnormalized physical

record specifications• Benefits:

– Can improve performance (speed) by reducing number of table lookups (i.e. reduce number of necessary join queries)

• Costs (due to data duplication)– Wasted storage space– Data integrity/consistency threats

• Common denormalization opportunities– One-to-one relationship (Fig. 5-3)– Many-to-many relationship with attributes (Fig. 5-4)– Reference data (1:N relationship where 1-side has data not used in any

other relationship) (Fig. 5-5)


Figure 5-3 A possible denormalization situation: two entities with one-to-one relationship

mandatory optional


Figure 5-4 A possible denormalization situation: a many-to-many relationship with nonkey attributes

Extra table access required

Null description possible


Figure 5-5A possible denormalization situation:reference data

Extra table access required

Data duplication


Partitioning• Horizontal Partitioning: Distributing the rows of a table

into several separate files– Useful for situations where different users need access

to different rows– Three types: Key Range Partitioning, Hash

Partitioning, or Composite Partitioning• Vertical Partitioning: Distributing the columns of a table

into several separate relations– Useful for situations where different users need access

to different columns– The primary key must be repeated in each file

• Combinations of Horizontal and Vertical

Partitions often correspond with User Schemas (user views)


Partitioning (cont.)• Advantages of Partitioning:

– Efficiency: Records used together are grouped together– Local optimization: Each partition can be optimized for

performance– Security, recovery– Load balancing: Partitions stored on different disks,

reduces contention– Take advantage of parallel processing capability

• Disadvantages of Partitioning:– Inconsistent access speed: Slow retrievals across partitions– Complexity: Non-transparent partitioning– Extra space or update time: Duplicate data; access from

multiple partitions

TM 5-23Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems4-23

Figure: Magnetic Disk Components

This figure shows how data are recorded on magnetic disks.


Oracle 11g Horizontal Partitioning Methods

• Range partitioning– Partitions defined by range of field values– Could result in unbalanced distribution of rows– Like-valued fields share partitions

• Hash partitioning– Partitions defined via hash functions– Will guarantee balanced distribution of rows– Partition could contain widely varying valued fields

• List partitioning– Based on predefined lists of values for the partitioning key

• Composite partitioning– Combination of the other approaches


Designing Physical Files

• Physical File: – A named portion of secondary memory allocated for

the purpose of storing physical records– Tablespace–named set of disk storage elements in

which physical files for database tables can be stored– Extent–contiguous section of disk space

• Constructs to link two pieces of data:– Sequential storage– Pointers : field of data that can be used to locate

related fields or records


Figure 5-6 DBMS terminology in an Oracle 11g environment


File Organizations• Technique for physically arranging records of a file on

secondary storage• Factors for selecting file organization:

– Fast data retrieval and throughput– Efficient storage space utilization– Protection from failure and data loss– Minimizing need for reorganization– Accommodating growth– Security from unauthorized use

• Types of file organizations– Sequential (Fig. 5-7a): the most efficient with storage space.– Indexed (Fig. 5-7b) : quick retrieval

• Join Index, Fig 5-8– Hashed (Fig. 5-7c) : easiest to update


Figure 5-7a Sequential file organization

If not sortedAverage time to find desired record = n/2

1

2

n

Records of the file are stored in sequence by the primary key field values

If sorted – every insert or delete requires resort


Indexed File Organizations• Indexed File Organization: the storage of records

either sequentially or nonsequentially with an index that allows software to locate individual records

• Index: a table or other data structure used to determine in a file the location of records that satisfy some condition

• Primary keys are automatically indexed• Other fields or combinations of fields can also be

indexed; these are called secondary keys (or nonunique keys)

• Oracle has a CREATE INDEX operation, and MS ACCESS allows indexes to be created for most field types


Figure 5-7b Indexed file organization

uses a tree searchAverage time to find desired record = depth of the tree


Figure 5-7cHashed file or index organization

Hash algorithmUsually uses division-remainder to determine record position. Records with same position are grouped in lists


Figure 5-8 Join Indexes–speeds up join operations

a) Join index for common non-key columns

b) Join index for matching foreign key (FK) and primary key (PK)


Understand their general concepts


Clustering Files

• In some relational DBMSs, related records from different tables can be stored together in the same disk area

• Useful for improving performance of join operations

• Primary key records of the main table are stored adjacent to associated foreign key records of the dependent table

• e.g. Oracle has a CREATE CLUSTER command


Cluster• Clustering is a method of storing tables that are

intimately related and often joined together into the same area on disk. For example, instead of the WORKER table being in one section of the disk and the WORKERSKILL table being somewhere else, their rows could be interleaved together in a single area, called a cluster.

• The cluster key is the field (or fields) by which the table are usually joined in a query (e.g., Worker_ID in the example). To cluster tables, you must own the tables you are going to cluster together.


Fig. 4-10: Mapping an entity with a multivalued attribute

(a) Employee entity type with multivalued attribute

(b) Mapping a multivalued attributeMultivalued attribute becomes a separate relation with foreign key

[Two relations created with one containing all of the attributes except the multi-valued attribute, and the second one contains the pk (on the first one) and the multi-valued attribute]

One–to–many relationship between original entity and new relation

WORKER

Worker_ID Name LodgingAge

Worker_ID Skill Ability

WORKERSKILL

WORKERWorker_IDNameAgeLoding{Skill, Ability}


The following is the basic syntax of the create cluster command:

CREATE CLUSTER cluster_name (field-1 datatype, Field-2 datatype …)

CREATE CLUSTER WORKERandSKILL (Judy VARCHAR2(25));

Cluster created.

This creates a cluster (a space is set aside, as it would be for a table) with nothing in it. The use of Worker_Cluster for the cluster key is irrelevant; you’ll never use it again.

Next, tables are created to be included in this cluster.


CREATE TABLE WORKER (Worker_ID NUMBER(2),Name VARCHAR2 (25) NOT NULL,Age NUMBER (2),Lodging VARCHAR2 (15))CLUSTER WORKERandSKILL (Worker_ID);

Prior to inserting rows into WORKER, you must create a cluster index:

CREATE INDEX WORKERandSKILL_NDXON CLUSTER WORKERandSKILL;Notice that nowhere does either statement say explicitly that the Worker_ID column goes into the July cluster key. The mentioned in their respective cluster statements. Multiple columns and cluster keys are matched first to first, second to second, third to third, and so on. Now a second table is added to the cluster.


CREATE TABLE WORKERSKILL (Worker_ID NUMBER(2),Skill VARCHAR2(25) NOT NULL,Ability VARCHAR2 (15) NOT NULL) CLUSTER WORKERandSKILL (Worker_ID);

We will investigate how the tables are stored in the cluster.Recall that WORKER table has four fields: Worker_ID, Name, Age, and Lodging. The WORKERSKILL table has three fields: Worker_ID, Skill, and Ability. When these two tables are clustered, each unique Worker_ID is actually stored only once, in the cluster key. To each Worker_ID are attached the columns from both of these tables.

The data from both of these tables is actually stored in a single location, almost as if the cluster where a big table containing data drawn from both of the tables that make it up.


AGE LODGING NAME WORKER_ID SKILL ABILITY

2329221816434155153327

21253335326726

25

PAPA KINGROSE HILLCRANMERROSE HILLMATTSWEITBROCHTROSE HILLPAPA KING

MATTSROSE HILL

ROSE HILLROSE HILLCRANMERWEITBROCHTMATTSMULLERS

CRANMER

ADAH TALBOTANDREW DYEBART SARJEANTDICK JONESDONALD ROLLOELBERT TALBOTGEORGE OSCARGERHARDT KENTGENHELEN BRANDTJED HOPKINSJOHN PEARSON

PALMER WALLBOMPAT LAVAYPETER LAWSONRICHARD KOCH ROLAND BRANDTVICTORIA LYNNWILFRED LOWELL

WILLIAM SWING

1234567891011

12131415161718

19

PHOTO

SMITHY

COMPUTER

COMBINE DRIVER

COMBINE DRIVRWOODCUTTERSMITHY

SMITHYPHOTOCOMPUTER

GOOD

EXCELLENT

SLOW

VERY FAST

GOODAVERAGEGOOD

PRECISEAVERAGEAVERAGE

Figure: How data is stored in clusters

from WORKER tablefrom WORKERSKILL table

CLUSTER key


AGE LODGING NAME WORKER_ID SKILL ABILITY

2329221816434155153327

21253335326726

25

PAPA KINGROSE HILLCRANMERROSE HILLMATTSWEITBROCHTROSE HILLPAPA KING

MATTSROSE HILL

ROSE HILLROSE HILLCRANMERWEITBROCHTMATTSMULLERS

CRANMER

ADAH TALBOTANDREW DYEBART SARJEANTDICK JONESDONALD ROLLOELBERT TALBOTGEORGE OSCARGERHARDT KENTGENHELEN BRANDTJED HOPKINSJOHN PEARSON

PALMER WALLBOMPAT LAVAYPETER LAWSONRICHARD KOCH ROLAND BRANDTVICTORIA LYNNWILFRED LOWELL

WILLIAM SWING

1234567891011

12131415161718

19

PHOTO

SMITHY

COMPUTER

COMBINE DRIVER

COMBINE DRIVRWOODCUTTERSMITHY

SMITHYPHOTOCOMPUTER

GOOD

EXCELLENT

SLOW

VERY FAST

GOODAVERAGEGOOD

PRECISEAVERAGEAVERAGE

Figure: How data is stored in clusters

Why we do not simply create a table like the one shown above?


Optimizing Query Processing -A Quick Way to Speed Access to Your data

• Indexes may be used to improve the efficiency of data searches to meet particular search criteria after the table has been in use for some time. Therefore, the ability to create indexes quickly and efficiently at any time is important.

• SQL indexes can be created on the basis of any selected attributes


Optimizing Query Processing (conti.)

For example:SELECT index_name FROM user_indexes;

SELECT *FROM inventoryWHERE inv_qoh>= 130;

CREATE INDEX qoh_indexON inventory (inv_qoh);

Note that this statement defines an index called qoh_index for the qoh column in the inventory table. This index ensures that in the next example SQL only needs to look at row in the database that satisfy the WHERE condition, and is, therefore, quicker to produce an answer.


SQL> SELECT * 2 FROM inventory 3 WHERE inv_qoh>= 130;

INV_ID ITEM_ID COLOR INV_SIZE INV_PRICE INV_QOH---------- ---------- -------------------- ---------- ---------- ---------- 3 3 Khaki S 29.95 150 4 3 Khaki M 29.95 147 6 3 Navy S 29.95 139 7 3 Navy M 29.95 137 9 4 Eggplant S 59.95 135 10 4 Eggplant M 59.95 168 11 4 Eggplant L 59.95 187 19 5 Bright Pink 10 15.99 148 20 5 Bright Pink 11 15.99 137 21 5 Bright Pink 12 15.99 134

10 rows selected.

CREATE INDEX qoh_index ON inventory (inv_qoh);

SQL> SELECT * 2 FROM inventory 3 WHERE inv_qoh>= 130;

INV_ID ITEM_ID COLOR INV_SIZE INV_PRICE INV_QOH---------- ---------- -------------------- ---------- ---------- ---------- 21 5 Bright Pink 12 15.99 134 9 4 Eggplant S 59.95 135 7 3 Navy M 29.95 137 20 5 Bright Pink 11 15.99 137 6 3 Navy S 29.95 139 4 3 Khaki M 29.95 147 19 5 Bright Pink 10 15.99 148 3 3 Khaki S 29.95 150 10 4 Eggplant M 59.95 168 11 4 Eggplant L 59.95 187

10 rows selected.


Optimizing Query Processing (conti.)

You may even create an index that prevents you from using a value that has been used before. Such a feature is especially useful when the index field (attribute) is a primary key whose values must not be duplicated:

CREATE UNIQUE INDEX <index_field>ON <tablename> (the key field);

SELECT index_name FROM user_indexes;DROP INDEX <index_name>;


Summary on Optimizing Query Processing (conti.)

hThe indexes are defined so as to optimize the processing of SELECT statements. ;

hAn index is never explicitly referenced in a SELECT statement; the syntax of SQL does not allow this;

hDuring the processing of a statement, SQL itself determines whether an existing index will be used;

hAn index may be created or deleted at any time;

hWhen updating, inserting or deleting rows, SQL also maintains the indexes on the tables concerned. This means that, on the one hand, the processing time for SELECT statements is reduced, while, on the other hand, the processing time for update statements (such as INSERT, UPDATE and DELETE) can increase.


Rules for Using Indexes

1. Use on larger tables2. Index the primary key of each table3. Index search fields (fields frequently in

WHERE clause)4. Fields in SQL ORDER BY and GROUP BY

commands5. When there are >100 values but not when

there are <30 values


Rules for Using Indexes (cont.)6. Avoid use of indexes for fields with long values;

perhaps compress values first7. If key to index is used to determine location of

record, use surrogate (like sequence nbr) to allow even spread in storage area

8. DBMS may have limit on number of indexes per table and number of bytes per indexed field(s)

9. Be careful of indexing attributes with null values; many DBMSs will not recognize null values in an index search


Query Optimization

• Parallel query processing–possible when working in multiprocessor systems

• Overriding automatic query optimization–allows for query writers to preempt the automated optimization

• Picking data block size–factors to consider include:– Block contention, random and sequential row

access speed, row size• Balancing I/O across disk controllers


Query Design Guidelines• 1. Understand how indexes are used• 2. Keep optimization statistics up-to-date• 3. Use compatible data types for fields and literals• 4. Write simple queries• 5. Break complex queries into multiple, simple parts• 6. Don’t use one query inside another• 7. Don’t combine a table with itself


Query Design Guidelines (cont.)

• 8. Create temporary tables for groups of queries• 9. Combine update operations• 10. Retrieve only the data you need• 11. Don’t have the DBMS sort without an index• 12. Learn!

– Learning to Learn and Learning to Change!• 13. Consider the total query processing time for ad

hoc queries


Dat

abas

e A

rch

itec

ture

s(M

odel

) (F

igu

re-E

xtra

)Legacy

Systems

Current

Technology

Data

Warehouses

multidimensional


Exercise and Homework• Exercises (p.234)

– #1 (index), 5 (storage), 8 (denormalization)

• HW (p.235) [usage map]– #19

Documents

TM 5-1 Copyright © Addison Wesley Longman, Inc. & Dr. Chen, Business Database Systems Chapter 5: Physical Database Design and Performance Jason C. H. Chen,