50
Introduction to Databases Relational Database Design Normalization Ajit K Nayak, Ph.D. Siksha O Anusandhan University

Introduction to database-Normalisation

Embed Size (px)

Citation preview

Page 1: Introduction to database-Normalisation

Introduction to Databases

Relational Database Design

Normalization

Ajit K Nayak, Ph.D.

Siksha O Anusandhan University

Page 2: Introduction to database-Normalisation

AKN/IDBII.2Introduction to databases

The Goal

The goal of relational database design is to

generate a set of relation schemas that allows

to store information without unnecessary

redundancy,

also allows us to retrieve information easily

and efficiently.

Page 3: Introduction to database-Normalisation

AKN/IDBII.3Introduction to databases

Redundancy: The Problem

Consider a relation schema

instDept (ID, name, salary, dept name, building, budget)

Problems

For each instructor of same department the building

and budget information gets repeated.

If a new department is opened, then database is

unable to keep this department information until a

new instructor is appointed.

What is the assurance that, one department is

housed in one building, and one budget?

Page 4: Introduction to database-Normalisation

AKN/IDBII.4Introduction to databases

Solution The database design tries to avoid these

problems using the concept of normalization

It is the technique of designing the relation schema

in compliance to one of the several normal forms.

Normal forms are the well defined rules to avoid

unnecessary redundancy and other anomalous

conditions.

6NF5NF

4NFBCNF

3NF2NF 1NF

Arranged

according to

strictness, i.e. 6th

is highest and 1st

is lowest

Page 5: Introduction to database-Normalisation

AKN/IDBII.5Introduction to databases

Anomalies in Relational Database-I If a database not designed properly may exhibit

following anomalies.

Redundancies (repetition of information )

Unnecessary wastage of disk space.

studNum Address deptNum deptName Building

S21 Patna 5 CSIT C-Block

S22 Edinburgh 5 CSIT C-Block

S23 BBSR 4 MECH B-Block

S24 KolKata 4 MECH B-Block

S25 Manchester 1 PHY D-Block

Any change to department building information need

to be updated in multiple records, that may lead to

inconsistency.

Page 6: Introduction to database-Normalisation

AKN/IDBII.6Introduction to databases

Anomalies in Relational Database Insertion Anomaly

If a new department is opened, then there is no

scope to insert this information into the database

unless a student gets admitted in to the department

Deletion Anomaly

If the last student of a department leaves the college and hence deleted from the database, then the department

information also deleted from the database forever.

All these problems do occur due to the faulty design of

the database.

Therefore, database should be designed using

normalization techniques that assures avoidance of

redundancy and hence anomalies.

Page 7: Introduction to database-Normalisation

AKN/IDBII.7Introduction to databases

First Normal Form - I A relation schema R is said to be in 1NF, if the domain

of all attributes in R is atomic in nature.

A domain is atomic if elements of the domain are of

indivisible units

i.e. according to 1NF, there can’t be sub-structure

within a column and the value present in each

attribute is never a set of values or a list of values.

Examples

Sub-structure: address (street, city, state, pin), regNo

(SOAITERCSIT2016A101)

Set/List of values: multiple phone numbers, mail ids,

names etc.

Page 8: Introduction to database-Normalisation

AKN/IDBII.8Introduction to databases

First Normal Form - II regNo (SOAITERCSIT2016A101) : The dept of a student

can be found by writing code (extra programming!)

i.e. information coded in programming rather than data

If this attribute is used as primary key, and the student

changes department!

The regNo of that student interpreted by code gives wrong result!

need to be changed every where it occurs – a difficult task

However, In some domains entities may have a

complex structure, forcing an 1NF puts an extra burden

on programmer to write code to convert data back

and forth.

In fact modern databases do support many non-

atomic values!

Page 9: Introduction to database-Normalisation

AKN/IDBII.9Introduction to databases

Functional Dependency It is a formal methodology for evaluating whether a

relational schema should be decomposed.

Notations used

relation schema: r(R)

i.e. r : relation and R: set of attributes. and r(R) R, when

relation name is not important.

K : super key of r(R)

Only r : instance of relation r

There exists certain constraints on the data

Students and instructors are uniquely identified by their ID.

Each student and instructor has only one name.

Each instructor and student is (primarily) associated with only

one department etc.

Page 10: Introduction to database-Normalisation

AKN/IDBII.10Introduction to databases

Super Key An instance of a relation that satisfies all such real-world

constraints is called a legal instance of the relation

Super Key: A subset K of R is a superkey of r(R),

if t1 ≠ t2, then t1[K] ≠ t2[K], for all pairs t1 and t2 of tuples in the

instance of r

That is, no two tuples in any legal instance of relation r (R) may

have the same value on attribute set K.

A super key uniquely identifies a tuple in r

A functional dependency allows us to express

constraints that uniquely identify the values of certain

attributes.

Page 11: Introduction to database-Normalisation

AKN/IDBII.11Introduction to databases

Functional Dependency - I Let x,y R, then the instance of r(R) is said to be

satisfying functional dependency x y,

If t1[x] = t2[x], then t1[y] = t2[y], for all pair of tuples t1 and t2

Functional dependency x y holds on schema r (R) if,

in every legal instance of r (R), it satisfies the functional

dependency.

Functional dependency is a generalization of key

concept of database. i.e.

K is a super key if, for every pair of tuples t1 and t2,

If t1[K] = t2[K], then t1[R] = t2[R]. i.e. (t1 = t2)

i.e. K is a superkey of r (R) if the functional dependency K→R

holds on r (R). (K R), and K uniquely determines tuples in r(R)

Page 12: Introduction to database-Normalisation

AKN/IDBII.12Introduction to databases

Example: FD Consider the relation schema

account(accNum, balance, brID).

There exists functional dependency like

accNum balance

i.e. if t1[accNum] = t2[accNum ], then t1[balance] =

t2[balance] etc.

accNum brID,

. . .

accNum accNum, balance, brID

i.e. accNum uniquely determines the tuples in account

relation.

Therefore accNum shall be the key

Page 13: Introduction to database-Normalisation

AKN/IDBII.13Introduction to databases

Example-II Find Functional dependencies

A B A C A D

B A C A D A

A A B B

AB A AB B

These FDs are satisfied by all relations and are called

trivial functional dependency

A FD of the form x y in r(R) are said to be trivial FD

if y x, x, y R

Page 14: Introduction to database-Normalisation

AKN/IDBII.14Introduction to databases

Clousure of FD Set The given set of Fds may logically infer few more FDs

For any FD set F, the set of all FDs that can be inferred

is called the closure of F and is denoted by F+.

Example: Let r(A,B,C,D,E) and given F={A D, D B, B

C}

Then F+ = {A D, D B, B C, A B, A C, D C}

The rules (Axioms) used to find the closure of FD set is

called Armstrong's Axioms

Rule 1: Reflexivity Rule

If y x, then x y holds

Rule 2: Augmentation Rule

If x y, then zx zy holds

Page 15: Introduction to database-Normalisation

AKN/IDBII.15Introduction to databases

Armstrong’s rule contd. Rule 3: Transitivity Rule

If x y, AND y z then x z holds

Armstrong’s rules are sound and complete, but to find

closure some more rules are derived from these

axioms.

Rule 4: Union Rule

If x y, AND x z then x yz holds

Rule 5: Decomposition Rule

If x yz then x y, AND x z holds

Rule 6: Pseudo-transitivity Rule

If x y, AND yz w then xz w holds

Page 16: Introduction to database-Normalisation

AKN/IDBII.16Introduction to databases

Example: Finding F+

Let R=(A, B, C, G, H, I) and F={A B, A C, CG H,

CG I, B H}. Find F+.

A B AND B H A H (Transitivity)

CG H AND CG I CG HI (Union)

A C AND CG I AG I (Pseudo-transitivity)

F+ = {

A B,

A C,

CG H,

CG I,

B H,

A H,

CG HI,

AG I }

Page 17: Introduction to database-Normalisation

AKN/IDBII.17Introduction to databases

Attribute Closure a b : b is functionally determined by a

Can we know whether a is a super key?

i.e. if we can prove that a functionally determines all

other attributes.

Solution: Compute F+ then consider all FDs taking a as

the LHS and take the union of the RHS. However, the

process is expensive if F+ is large.

The attribute closure of x, represented as x+ represents

all those attributes of R that can be functionally

determined from x.

Attribute closure may be used to

Find if an attribute or a set of attributes is a key. i.e. If x+=R,

then x is a key of r(R)

To determine, if the FD x y holds

Page 18: Introduction to database-Normalisation

AKN/IDBII.18Introduction to databases

Ex:Attribute Closure Example 1: R=(A, B, C, D, E), F={A CD, C B, B E

}, find the key.

Solution

A+ = {ABCDE} : A is a key

BC+={BCE}

B+ = {BE}

Example 2: For the above example, check if A

functionally determines E?

Solution

A+ = {ABCDE} , so A E is true

Page 19: Introduction to database-Normalisation

AKN/IDBII.19Introduction to databases

Decomposition Relational DB design requires a relation schema to be

decomposed into more than one relation as a process

of DB normalization.

Any decomposition of a relation schema must satisfy

following properties

Lossless decomposition

Dependency preservation

Page 20: Introduction to database-Normalisation

AKN/IDBII.20Introduction to databases

Lossless Decomposition If R be decomposed into two relation schema R1 and

R2, then the decomposition is said to be lossless

if no DB information is lost in the process of decomposition and

all information can be recalled by joining the decomposed

relation schemas.

In other words the decomposition is loss less

If r1(R1) ⨝ r2 (R2) = r(R), ⨝ : join operator

The above decomposition can be verified for its

lossless property if any one of the following holds. i.e.

Either R1 R2 R1

Or R1 R2 R2

A decomposition is lossless if the decomposed integrity shares

referential integrity among them. i.e. if P(K) of one relation is F(K) of another relation.

Page 21: Introduction to database-Normalisation

AKN/IDBII.21Introduction to databases

Dependency Preservation If R with FD set F be decomposed into two relation

schema R1 and R2, resulting two FD sets as F1 and F2

respectively then the decomposition is said to be

dependency preserving if it satisfying

(F1 F2)+ = F+

That is if no FD exhibited by original relation schema is lost in the process of decomposition.

Example1:

Let R=(A, B, C) and F = {A B, B C} is decomposed as R1=(A, B)

with F1 = {A B} and R2(B, C) with F2 = {B C}

Here (F1 F2)+ = F+ , Therefore dependency preserved

Example2:

Let R=(A, B, C) and F = {A B, B C} is decomposed as R1=(A, B)

with F1 = {A B} and R2(A, C) with F2 = {A C}

Here (F1 F2)+ ≠ F+ , Therefore dependency is not preserved

Page 22: Introduction to database-Normalisation

AKN/IDBII.22Introduction to databases

Second Normal form A relation schema is said to be in second normal form,

if it does not exhibit any partial functional

dependency

If a relation schema is having a composite primary

key, then

there may exist a FD where a part of the key functionally

determines non-key attributes

such FDs are referred as partial functional dependency.

Ex. R(A, B, C, D, E), F={AB C, B D, D E }

R exhibits a partial FD of the form, B D

Hence it does not satisfy 2NF

Page 23: Introduction to database-Normalisation

AKN/IDBII.23Introduction to databases

Normalizing to 2NF Divide R(A, B, C, D, E) into two relations

R1(A,B,C), F1={ABC}, key={AB}

R2(B,D,E), F2={B D, D E}, key={B}

For R1 and R2 individually no partial FD, so they are

now normalized to 2NF

R1 R2 = B R2, so the decomposition is lossless

F1 F2 = F, so it is dependency preserving

Problem: Check if the following relation is in 2NF, if not

normalize it

order(orderNum, clientNum, itemNo, unitPrice, qty)

F={orderNum clientNum

itemNumunitPrice

orderNum, itemNumqty }

Key={orderNum,itemNum}

Page 24: Introduction to database-Normalisation

AKN/IDBII.24Introduction to databases

Solution - I order exhibits partial dependency of the form,

orderNum clientNum,

itemNumunitPrice, it exhibits partial functional dependency,

hence does not satisfy 2NF

Normalization: divide the relation into the followings

orderItem(orderNum, itemNum,qty),

F1={orderNum, itemNum qty} , key1={orderNum, itemNum}

orderClient(orderNum,clientNum),

F2={orderNum clientNum}, key2={orderNum}

item(itemNum,unitPrice),

F3={itemNum unitPrice}, key3={itemNum}

Page 25: Introduction to database-Normalisation

AKN/IDBII.25Introduction to databases

Solution - II Check for lossless decomposition

orderItem orderClient = orderNum orderClient

orderClient item = itemNum item, so lossless

Check for dependency preserving

F1 F2 F3 = F, so it is also dependency preserving

Therefore, the relation schemas are in 2NF

N.B.: A relation schema having singular or non-

composite primary key is always in 2NF! (why?)

as it can not have partial FD

Page 26: Introduction to database-Normalisation

AKN/IDBII.26Introduction to databases

Example Check if the following relation is in 2NF, if not normalize

it.

F={Manufacturer → Manufacturer Country

Manufacturer, Model → ModelFullName}

Key={Manufacturer, Model }

Composite hence not in 2NF

Manufacturer Model ModelFullNameManufacturer

Country

Forte X-Prime Forte X-Prime Italy

Forte Ultraclean Forte Ultraclean Italy

Dent-o-Fresh EZbrush Dent-o-Fresh EZbrush USA

Kobayashi ST-60 Kobayashi ST-60 Japan

Hoch Toothmaster Hoch Toothmaster Germany

Hoch X-Prime Hoch X-Prime Germany

Page 27: Introduction to database-Normalisation

AKN/IDBII.27Introduction to databases

Solution Break it to two tables as follows

Key1={Manufacturer}

Key2={Manufacturer, Model}

Lossless?

Dependency preserving?

ManufacturerManufacturer

Country

Forte Italy

Dent-o-Fresh USA

Kobayashi Japan

Hoch Germany

Manufacturer Model Model Full Name

Forte X-Prime Forte X-Prime

Forte Ultraclean Forte Ultraclean

Dent-o-Fresh EZbrush Dent-o-Fresh EZbrush

Kobayashi ST-60 Kobayashi ST-60

HochToothmast

erHoch Toothmaster

Hoch X-Prime Hoch X-Prime

Page 28: Introduction to database-Normalisation

AKN/IDBII.28Introduction to databases

Third Normal Form (3NF) A relation r(R), with a given set of FDs is said to be in

3NF ,

Defn 1: If for all FDs of the form X Y in F+, if any one

of the three following condition is satisfied

XY is a trivial FD

X is the supper key

Y contains at least one prime attribute (key attribute)

Defn 2: If for all non-trivial FDs of the form X Y in F+, if

any one of the following two condition is satisfied

X is the supper key

Y contains at least one prime attribute (key attribute)

Page 29: Introduction to database-Normalisation

AKN/IDBII.29Introduction to databases

Third Normal Form (3NF) Defn 3: If the schema does not exhibit any transitive

dependency of the form

keynon-key non-key

That is a schema is said to be in 3NF, if it does not

exhibit any functional dependency from a non-key to

another non-key attribute(s).

Ex1. Consider the relation instance, check for 3NF, 2NF

studNum Address deptNum deptName Building

S21 Patna 5 CSIT C-Block

S22 Edinburgh 5 CSIT C-Block

S23 BBSR 4 MECH B-Block

S24 KolKata 4 MECH B-Block

S25 Manchester 1 PHY D-Block

Page 30: Introduction to database-Normalisation

AKN/IDBII.30Introduction to databases

Solution-I Find Functional Dependencies

F = {studNum Address, deptNum, deptName, Building

deptNum deptName, Building}

Find the key

Key = {studNum}

Check for 3NF

studNum deptNum deptName, Building

i.e. key non-key non-key

Hence it is not in 3 NF

Decomposition

R1(studNum , Address, deptNum), R2(deptNum, deptName,

Building )

F1={studNum Address, deptNum},

F2={deptNum deptName, Building}

Page 31: Introduction to database-Normalisation

AKN/IDBII.31Introduction to databases

Solution-II Decomposition continued

Key1 = {studNum}, key2={deptNum}

Hence R1 and R2 are now in 3NF as they does not

exhibit transitive dependency

Loss less decomposition

R1R2 = deptNum R2, hence loss less

Dependency Preservation

(F1 F2)+ = F, hence dependency preserving

2NF

There is no partial FD, therefore R1 and R2 are in 2NF

Page 32: Introduction to database-Normalisation

AKN/IDBII.32Introduction to databases

Example-2 Consider the relation schema R(A, B, C, D, E) with FD

set F={AB C, B D, D E}

What normal form R is in? Normalize the relation upto

3NF.

Solution:

Check for 2NF

Key={AB}

Partial FD, B D, hence not in 2NF

Decompose: R1 (A, B, C), R2(B, D, E)

F1={AB C}, F2={B D, D E}, key1 = {AB} , key2={B}

It is now in 2NF

Page 33: Introduction to database-Normalisation

AKN/IDBII.33Introduction to databases

Example-2 contd. Check for 3NF

R1 in 3NF, R2 not in 3NF (?)

Transitive dependency in R2 (B D E)

Decompose R2: R3(B, D), R4(D, E)

F3={B D }, F4={D E}

Now both are in 3NF

Final Schema: R1(A, B, C), R3(B, D), R4(D, E)

Check for Loss less and dependency preservation

decomposition

Page 34: Introduction to database-Normalisation

AKN/IDBII.34Introduction to databases

Task Consider the relation schema R(A, B, C, D, E) with FD

set F={AC B, E D, A E}

What normal form R is in? Normalize the relation upto

3NF.

Page 35: Introduction to database-Normalisation

AKN/IDBII.35Introduction to databases

Boyce Codd Normal Form (BCNF) Defn 1: r(R) is said to be in BCNF with respect to F+, if for all FDs of

the form X Y in F+ any one of the following two conditions hold

X Y is trivial FD

X is the super key

Defn 2: r(R) is said to be in BCNF with respect to F+, if for all non-

trivial FDs of the form X Y in F+ and X is the super key

Defn 3: BCNF allows only those FDs where the left side

contains only the key of the relational schema.

Note:

BCNF is the highest possible normal form for relation schemas

only exhibiting FD

BCNF is more strict than 3NF

Every relation in BCNF is also in BCNF, however a relation in

3NF is not necessarily in BCNF.

Page 36: Introduction to database-Normalisation

AKN/IDBII.36Introduction to databases

Boyce Codd Normal Form (BCNF) Example: check for 3NF and BCNF

R={A,B,C}

F={AB C,

C B }

3NF

both are non-trivial FD

C B : Y is a prime attribute and key non-key key

Hence in 3NF

BCNF

C B => non-key key, Hence not in BCNF

Page 37: Introduction to database-Normalisation

AKN/IDBII.37Introduction to databases

Boyce Codd Normal Form (BCNF) Every relation in 3NF is also in BCNF, however a relation

in 3NF is not necessarily in BCNF.

Example:

R(property_id, countryName, lot#, area, price, taxRate)

F={property_id countryName, lot#, area, price, taxRate

countryName, lot# property_id #, area, price, taxRate

countryName taxRate

area price

area countryName

}

Page 38: Introduction to database-Normalisation

AKN/IDBII.38Introduction to databases

Example - I Normalize upto BCNF

Partial Functional dependency:

Country_name Tax_rate

Hence not in 2NF

Page 39: Introduction to database-Normalisation

AKN/IDBII.39Introduction to databases

Example - II Normalize to 2NF

Non key Non key

Area Price, hence not in 3NF

Normalize to 3NF

Page 40: Introduction to database-Normalisation

AKN/IDBII.40Introduction to databases

Example - III Non key key

Area Country_name, hence not in BCNF

Normalize to BCNF

LOTS

LOTS1 LOTS2

LOTS1AX LOTS1AY LOTS1B LOTS2

LOTS1A LOTS1B LOTS2

1NF

2NF

3NF

BCNF

Page 41: Introduction to database-Normalisation

AKN/IDBII.41Introduction to databases

Limitations of BCNF There exist multiple ways of decomposing/normalising

a non-BCNF schema to BCNF schemas

All possible BCNF decomposition although generates

loss-less property, it may not gurantee the property of

dependency preservation.

If the DB designer do not find a possible BCNF

decomposition, that gurantees dependency

preservation, they may have to restrict themselves for

the lower normal form, i.e. 3NF

Page 42: Introduction to database-Normalisation

AKN/IDBII.42Introduction to databases

Functional Dependency Contd.

In some cases, constraints can’t be expressed

as functional dependencies.

Ex. loan(custNum, loanNum, phoneNum)

One customer can have multiple loans and multiple

phone numbers

Is it in BCNF?

Key = {custNum, loanNum, phoneNum}

It exhibits trivial functional dependency hence in

BCNF

But still this schema exhibits redundancy

Page 43: Introduction to database-Normalisation

AKN/IDBII.43Introduction to databases

Example contd.

If we have two or more multi-valued independent

attributes, then we need to repeat every value of one

attribute with every value of another attribute to make

the relation consistent.

This type of constraint is specified by multi-valued

dependency.

Loan

custNum loanNum phoneNum

C1 L1 P1

C1 L1 P2

C1 L2 P1

C1 L2 P2

Page 44: Introduction to database-Normalisation

AKN/IDBII.44Introduction to databases

Multi-Valued Dependency

A multi-valued dependency (MVD) from X to Y

(X Y, X,Y R) specified on a relation r(R), exibits following constraints on r: if two tuples t1

and t2 exist in r such that t1[x] = t2[x], then two

other tuples t3, t4 should also exist in r with

following properties.

t3[X]=t4[X]=t1[X]=t2[X]

t3[Y] =t1[Y] & t4[Y] = t2[Y]

t3[R-XY] = t2[R-XY] & t4[R-XY] = t1[R-XY]

Page 45: Introduction to database-Normalisation

AKN/IDBII.45Introduction to databases

Multi-Valued Dependency - I

Whenever X →→ Y holds, we say that X multi-

determines Y.

Because of the symmetry in the definition,

whenever X →→ Y holds in R, so does X →→ Z.

(Z=R-XY)

Hence, X →→ Y X →→ Z, and therefore it is

sometimes written as X →→ Y|Z.

An MVD X →→ Y in R is called a trivial MVD if

Y is a subset of X, or

X ∪ Y= R

Page 46: Introduction to database-Normalisation

AKN/IDBII.46Introduction to databases

Fourth Normal form (4NF)- I

If a relation schema r(R), with a given set of

dependencies D, where D includes FDs and

MVDs, then r(R) is said to be in 4NF if all MVDs

w.r.t. D+ holds any one of the following two

conditions.

X Y is a trivial MVD

X is a superkey

Example1: test if the relation schema is in 4NF

R(A,B,C,E) and

D={A E

AB

A C}

Page 47: Introduction to database-Normalisation

AKN/IDBII.47Introduction to databases

4NF Example contd.

It is not in 4NF because

AE is not a trivial MVD

A is not a superkey

Decompose into R1(A,E),D1(AE) and R2(A,B,C),

F2(AB, AC)

In R1: AE is trivial MVD, thus in 4NF

In R2: A is the key , thus in 4NF

Example 2: R(custNum, loanNum, phoneNum)

D={custNumloanNum,

custNumphoneNum}

Not in 4NF?

Page 48: Introduction to database-Normalisation

AKN/IDBII.48Introduction to databases

4NF Example contd.

Decompose into

R1(custNum, loanNum), D1={custNumloanNum}

R2(custNum, phoneNum), D1={custNumphoneNum}

R1

custNum loanNum

C1 L1

C1 L2

R2

custNum phoneNum

C1 P1

C1 P2

Page 49: Introduction to database-Normalisation

AKN/IDBII.49Introduction to databases

Denormalization for Performance Occasionally database designers choose a schema

that has redundant information

They use the redundancy to improve performance for

specific applications.

The penalty paid for not using a normalized schema is

the extra work (in terms of coding time and execution

time) to keep redundant data consistent.

The process of taking a normalized schema and

making it non-normalized is called denormalization

Designers use it to tune performance of systems to

support time-critical operations.

A better alternative is to use the normalized schema,

and additionally store the join of them as a

materialized view.

Page 50: Introduction to database-Normalisation

AKN/IDBII.50Introduction to databases

Thank You