23
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 1 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 1

Database Systems I

Design Theoryfor Relational Databases

Page 2: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 2

IntroductionTypically, the first database design uses a high-level database model such as the ER model. This model is then translated into a relational schema.Sometimes a relational database schema is developed directly without going through the high-level design.Either way, the initial relational schema has room for improvement, in particular by eliminating redundancy.Redundancies lead to undesirable update and deletion anomalies.Relational database design theory introduces various normal forms of a schema that avoid various types of redundancies and algorithms to convert a relational schema into these normal forms.

Page 3: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 3

Functional DependencyNormal forms are based on the concept of functional dependencies between sets of attributes.A functional dependency (FD) X Y is an assertion about a relation R that whenever two tuples of R agree on all the attributes of set X, then they must also agree on all attributes in set Y.We say “X Y holds in R.”Convention: …, X, Y, Z represent sets of attributes of relation R. A, B, C,… represent single attributes of R.Convention: no parentheses to denote sets of attributes, just ABC, rather than {A,B,C }.A FD X Y is called trivial if .

XY

Page 4: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 4

Splitting / Combining Rule

XA1A2…An holds for R if and only if each of

XA1, XA2,…, XAn hold for R.

Example: The FD ABC is equivalent to the two FDs AB and AC.

This rule can be used to split a FD into multiple ones with singleton right sides or to combine multiple singleton right side FDs into one FD.

There is no splitting /combining rule for left sides.

We’ll generally express FDs with singleton right sides.

Page 5: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 5

Functional Dependency

Consider the relationMovies1 (title, year, length, genre, studioName, starName).title year length genre studioName holds, assuming that there are not two movies with the same title in the same year.title year starName does not hold, since a movie can have more than one star acting.A FD makes an assertion about all possible instances of a relation, not only about one (such as the current) instance.

Page 6: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 6

Keys

Given a relation R with attributes X = {A1, . . .,

An}.

is a superkey for relation R if K

functionally determines X, i.e. K X.

K is a key for R if K is a superkey, but no

proper subset of K is a superkey.

Keys are a special case of a FD.

Keys can be deduced systematically, if all FDs

for relation R are given.

XK

Page 7: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 7

Keys

{title, year, starName} is a superkey of Movies1, since title title, year year, title year length,

title year genre,title year studioName,starName starName.

Remember that title year starName does not hold.{title, year}, {year, starName} and {title, starName} are not superkeys.Thus, {title, year, starName} is a key of Movies1.

Page 8: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 8

Closure of Attributes

Given a set of attributes {A1, . . ., An} and a set S of FDs.The closure of {A1, . . ., An} under S is the set of attributes X such that every relation that satisfies all the FDs in S also satisfies {A1, . . ., An} X,i.e. {A1, . . ., An} X follows from the FDs in S.

The closure of set Y is denoted by Y+.

Exampleattribute set {A, B, C}FDs {ABD, DE, BCF, GH}{A,B,C}+ = {A, B, C, D, E, F}

Page 9: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 9

Computing the Closure of Attributes

Given a set of attributes {A1, . . ., An} and a set

S of FDs.

If necessary, apply the splitting rule to the FDs in S.

Initialize X to {A1, . . ., An}.

Repeat search for some FD B1, . . ., Bm C in S

such that for all and add C to the set Xuntil no more attribute C can be added.

Now X = {A1, . . ., An}+, , return X.

XCXBi i and:

Page 10: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 10

Computing the Closure of Attributes

Given a set of attributes {A, B, C, D, E, F}

and FDs {AB C, BC AD, D E, CFB}.

What is {A,B} +?

Apply the splitting rule: split BC AD into BC A and BC D.

Initialize X = {A,B} .

Iterationsapply AB C, X = {A,B,C}

apply BC D, X = {A,B,C,D}apply D E, X = {A,B,C,D,E}

Return {A,B} + = {A,B,C,D,E}.

Page 11: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 11

Relational Schema DesignGoal of relational schema design is to avoid anomalies.Redundancies lead to certain forms of anomalies. and redundancy.Update anomalyone occurrence of a fact is changed, but not all occurrences.Deletion anomalyvalid fact is lost when a tuple is deleted.In the following example, consider the relationMovies1 (title, year, length, genre, studioName, starName).

Page 12: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 12

Relational Schema Designtitle year length genre studioNam

estarName

Star Wars 1977 124 SciFi Fox Carrie Fisher

Star Wars 1977 124 SciFi Fox Mark Hamill

Star Wars 1977 124 SciFi Fox Harrison Ford

Gone with the Wind

1939 231 drama MGM Vivien Leigh

Wayne’s World

1992 95 comedy Paramount Dana Carvey

Wayne’s World

1992 95 comedy Paramount Mike MeyersUpdate anomaly: update the length to 125 (only) for the first Star Wars tuple.Deletion anomaly: delete Vivien Leigh (and the corresponding movie).

Page 13: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 13

Decomposing Relations

How to eliminate these anomalies? Decompose the relation into two or more relations that together have the same attributes.Given relation R {A1, . . ., An}. A decomposition of R consists of two relations S {B1, . . ., Bm} and T {C1, . . ., Ck} such that

},...,{},...,{},...,{ 111 kmn CCBBAA

)(,...,1RS

mBB

)(,...,1RT

kCC

Page 14: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 14

Decomposing RelationsDecompose Movies1 (title, year, length, genre, studioName, starName)into Movies2 (title, year, length, genre, studioName)and Movies3 (title, year, starName).

title year length

genre studioName

Star Wars

1977

124 SciFi Fox

Gone with the Wind

1939

231 drama MGM

Wayne’s World

1992

95 comedy

Paramount

title year starName

Star Wars

1977 Carrie Fisher

Star Wars

1977 Mark Hamill

Star Wars

1977 Harrison Ford

Gone with the Wind

1939 Vivien Leigh

Wayne’s World

1992 Dana Carvey

Wayne’s World

1992 Mike Meyers

Movies2

Movies3

The update and deletionanomalies are gone!

Page 15: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 15

Boyce-Codd Normal Form

There are many ways of decomposing a relation.Which decompositions leads to an anomaly-free relation? Boyce-Codd normal form defines a condition under which the anomalies discussed so far cannot exist.A relation R is in Boyce-Codd normal form (BCNF), if and only if the following condition holds:for every non-trivial FD A1 . . . An B1 . . . Bm for R {A1, . . ., An } is a superkey for R.

I.e., the left side of a FD needs to contain a key.

Page 16: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 16

Boyce-Codd Normal FormConsider Movies1 (title, year, length, genre, studioName, starName).{title, year, starName} is a key, and it is the only one.We have the FD title year length genre studioName.The left side of this FD is not a superkey, i.e. Movies1 is not in BCNF.Consider Movies2 (title, year, length, genre, studioName).{title, year } is its only key, since neither

title length genre studioName nor year length genre studioName hold.

All non-trivial FDs must have at least title and year on the left side. Thus, Movies2 is in BCNF.

Page 17: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 17

Decomposition into BCNF

We want to find a decomposition R into R1, . . ., Rk such that

each of the resulting relations is in BNCF, andthe original relation R can be reconstructed from R1, . . ., Rk .

Look for non-trivial FD that violates BCNF, e.g. A1 . . . An B1 . . . Bm. {A1,. . ., An} is no superkey.Add to the right side all other attributes {C1,. . ., Ck} that functionally depend on {A1,. . ., An} .

Need to compute the closure of {A1,. . ., An} under the given FDs.This step is optional, but leads to a smaller number of component relations.

Page 18: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 18

Decomposition into BCNF

Decompose R into R1 and R2: R1 contains {A1,. . ., An} and {B1,. . ., Bm} and {C1,. . ., Ck} R2 contains {A1,. . ., An} and all attributes not involved in the FD.Continue to decompose the resulting relations until there is no more FD for any of the Ri that violates BCNF.

Page 19: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 19

Decomposition into BCNF

Consider Movies1 (title, year, length, genre, studioName, starName).title year length genre studioName violates BCNF.Decompose Movies1 into Movies2 (title, year, length, genre, studioName)

and Movies3 (title, year, starName).{title, year, starName} is key for Movies3.In general, more than one decomposition necessary to reach a schema in BCNF.

Page 20: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 20

Decomposition into BCNFConsider relation schema {title, year, studioName, president, presAddr}.FDs

title year studioNamestudioName presidentpresident presAddr

{title, year} is the only key.The last two FDs violate BCNF. Assume that we first deal with studioName president.Decompose into

{title, year, studioName}, and {studioName, president, presAddr}.While the first of these relations is in BCNF, the second one is not:

president presAddr, but studioName is key.Decompose the second relation further into {studioName, president} and

{president, presAddr}.

Page 21: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 21

Recoverability of InformationSo far, we know how to decompose R into R1, . . ., Rk such that each of the resulting relations is in BNCF.But can we reconstruct R from R1, . . ., Rk ?

More precisely: is ?Yes!To see why, consider a relation R(A,B,C) and a FD BC that violates BCNF.Decompose R into R1(A,B) and R2(B,C).Let t = (a,b,c) be an arbitrary tuple from R.

Then and consequently .

Thus, all tuples in R are in

kRRR

1

RcbRbaCBBA ,,

),(and),(

21),,( RRcba

.21RR

Page 22: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 22

Recoverability of InformationLet t = (a, b, d) and v = (d, b, e) be arbitray tuples in R, which implies that

Then Is it also in R?BC and implies d = e.Since (a, b, d) in R, also (a, b, e) in R.Thus, all tuples in are in R. This means that the BCNF decomposition is lossless.

21RR

RebRdbRbdRba

CBCB

BABA

,,

,,

),( and),(),( and),(

.21

),,( RReba

RebRdbCBCB ,,

),( and),(

Page 23: CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 227 Database Systems I Design Theory for Relational Databases

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 23

Summary

A functional dependency (FD) states that two tuples that agree on some set of attributes also agree on another attribute set.Keys are special cases of functional dependencies.Redundancies in a relational table lead to anomalies such as update and deletion anomalies. A relation is in Boyce-Codd normal form (BCNF), if the left sides of all non-trivial FDs contain a key.A schema in BCNF avoids the above anomalies.A given schema can be decomposed into subsets of attributes such that the resulting tables are all in BCNF and the join of these tables recovers the original table.