24
Database Requires Normalization Related data in a database must be organized in a set of related tables following certain relational rules How do we do this? What (fields) should be in each table? Poorly designed databases will lose data integrity over time, become slow, and lose ability to support queries

Database Requires Normalization

  • Upload
    alva

  • View
    53

  • Download
    0

Embed Size (px)

DESCRIPTION

Database Requires Normalization. Related d ata in a database must be organized in a set of related tables following certain relational rules How do we do this? What (fields) should be in each table? - PowerPoint PPT Presentation

Citation preview

Page 1: Database Requires Normalization

Database Requires Normalization• Related data in a database must be organized in a

set of related tables following certain relational rules– How do we do this? What (fields) should be in each table?

• Poorly designed databases will lose data integrity over time, become slow, and lose ability to support queries

Page 2: Database Requires Normalization

A well-designed table is the one that:• minimizes redundant data

• represents a single subject (e.g., sample, River, Country)

• has a primary key

• does not have multi-part fields (‘123 Nice Ave, Atlanta, Ga’)– Should not have different items under the same column

• Does not have duplicate fields (e.g., Analysis1; Analysis2, …)– Same thing in different columns

• Does not have fields that depend on fields (subkeys) other than the PK

Page 3: Database Requires Normalization

Normalization• Is the gradual and sequential process of efficiently organizing

unstructured data in a database that follows the rules listed in the previous slide

• Normalization commonly involves the following three schemes (in order):

• First, Second, and Third Normal Form, or:1NF, 2NF, 3NF

– This is commonly done during early stages of modeling on UML class diagrams

– The next slide shows a database with one un-normalized table with many problems!

Page 4: Database Requires Normalization

Example of an un-normalized Student tableStudent Exam Format Grade Instructor TA Date

Dobb, Min Structural Geology

Essay A- Babaie Gabbri, Boris

2012-08-02

Dobrin, Garn

GIS Lab Exercise B+ Dai Mafique, Marie

2014-02-15

Petri, Tuff Remote Sensing

Essay B- Kiage Karsto, Travert

2010-09-18

Lac, Du GIS Lab Exercise B Dai Mafique, Marie

2014-02-15

Dobb, Min GIS Lab Exercise A- Dai Mafique, Marie

2014-02-15

Lac, Du Petrology Multiple choice B+ Hidalgo Phenos, Meg

2014-05-12

Petri, Tuff Petrology Multiple Choice B Hidalgo Phenos, Meg

2014-05-12

Mixed names

Repeated types

Repeated types repeated Mixed & repeated

< -- problems

Page 5: Database Requires Normalization

Goal of Normalization• Eliminate redundant (duplication of) data (which make database large,

inefficient, and slow) which in turn prevents data manipulation (insert, delete, update) anomalies and loss of data integrity– If there are duplicate data in different rows, changes that happen in different

places may not be the same (can make mistake entering data)

– We want the change to happen in one place (one row) and then propagate throughout

– Redundancy reduces flexibility

– Redundancy creates insert, delete, and update anomalies.

– Cannot change the name of Mafique Marie to Basique Marie in one row if she marries.

– Cannot insert a new instructor since we do not have a table for instructors– Cannot delete a row without deleting other information

Page 6: Database Requires Normalization

• So, we have to create other tables and assign PK for each one, and make sure that each information shows up once in the database

• The process eliminates redundant data (storing the same data in more than one table) and ensures data dependencies are logical (only storing related data in a table; not shoes and frogs)

– Normalization reduces the amount of space a database consumes and ensures data is logically stored

Page 7: Database Requires Normalization

Alumni Database: The First Attempt• In this set of slides we will design and normalize the first

version of a database called AlumniDB– NOTE: You are going to build the Alumni DB in the E3 exercise!

• The initial AlumniDB database may just have a few tables, like the single table in the next slide

• As you can see, this early version of the table has redundancies, is inefficient, and therefore is not useful!

• It must be changed through the three ordered normalization steps (1NF, 2NF, 3NF)

Page 8: Database Requires Normalization

Alumni Table First Version: InefficientAlum GradYear CurrentJob Donation

(USD)CurrentSchool workPhone CellPhone Address

JohnSedi

2000 IBMGoogle

111-222-3333

222-333-4444

123 2nd Ave, Los Angeles, CA 90014

Joe Strat

2010 50 Univ. of VA 678-345-6666

345 First Ave, Richmond, VA 23219

Liz Hidro

1998 HydroPool, Chevron

456-344-9988

444 Kelly St, Frankfort, KY 40601

Rocky Tuff

2002 Univ. of MA 999-887-4447

987 Red Rock St, Waltham, MA 02154

Joe Strat

2010 100 Univ. of VA 678-345-6666

345 First Ave, Richmond, VA 23219

There are a few problems with this table (see items in red font)!It first needs to go through the 1NF (see next slide)

Page 9: Database Requires Normalization

First Normal Form (1NF)• 1NF deals with duplicative data across multiple columns!

– NOTE: The two phone columns have the same type of data

• 1NF sets the very basic rules to make sure that:

– Separate tables are created for each group of related data (e.g., Lake, IsotopicAge, Fold, Rock), i.e.,

– each table should represent a distinct entity (or subject)

INF ensures that:We do not have multiple values in a single column or We do not have multiple columns of similar data

Page 10: Database Requires Normalization

1. Repeated columns are not allowed. • Duplicative (repeating) columns in a table that contain the same type of

data are removed from the table– There should be no repeated groups of related data:

Mineral1, Mineral2, Mineral3, or cellPhone, homePhone, workPhone• These should go to a new Mineral and Phone tables!

2. No multi-valued attributes (columns) are allowed. • All columns contain a single value (i.e., are indivisible), i.e.,– All attributes must be atomic (e.g., XRF,) not multi-valued (like the

address in the Alumni table or Multiple Choice and Essay in the Student table). • Otherwise, we will have problem retrieving data by a specified value. • In other words, each cell must only have one value,

e.g., XRF, not ‘XRF, REE, Isotope’

3. There should be a set of one or more columns that uniquely identify each row

i.e., there should be a primary key (PK)

Page 11: Database Requires Normalization

The Alumni table is NOT in First Normal Form (1NF)Alum GradYear CurrentJob Donation

(USD)CurrentSchool workPhone CellPhone Address

JohnSedi

2000 IBMGoogle

111-222-3333

222-333-4444

123 2nd Ave, Los Angeles, CA 90014

Joe Strat

2010 50 Univ. of VA 678-345-6666

345 First Ave, Richmond, VA 23219

Liz Hidro

1998 HydroPool, Chevron

456-344-9988

444 Kelly St, Frankfort, KY 40601

Rocky Tuff

2002 Univ. of MA 999-887-4447

987 Red Rock St, Waltham, MA 02154

Joe Strat

2010 100 Univ. of VA 678-345-6666

345 First Ave, Richmond, VA 23219

Problems with 1NF: • Violates rule: “There should be no repeating columns”

We have repeating data types (workPhone and Cellphone)• Violates rule: “Each column must have a single value”

There are two current jobs given for some people. The Address field is complex

• Violates rule : “There must be a primary key to uniquely identify rows”There is none!

Page 12: Database Requires Normalization

Example of an un-normalized Student tableStudent Exam Format Grade Instructor TA Date

Dobb, Min Structural Geology

Essay A- Babaie Gabbri, Boris

2012-08-02

Dobrin, Garn

GIS Lab Exercise B+ Dai Mafique, Marie

2014-02-15

Petri, Tuff Remote Sensing

Essay B- Kiage Karsto, Travert

2010-09-18

Lac, Du GIS Lab Exercise B Dai Mafique, Marie

2014-02-15

Dobb, Min GIS Lab Exercise A- Dai Mafique, Marie

2014-02-15

Lac, Du Petrology Multiple choice B+ Hidalgo Phenos, Meg

2014-05-12

Petri, Tuff Petrology Multiple Choice B Hidalgo Phenos, Meg

2014-05-12

Mixed names

Repeated types

Repeated types repeated Mixed & repeated

Page 13: Database Requires Normalization

Alumni Table: Modified; Satisfies 1NFAlumID Alum GradYear CurrentSchool Donation

(USD)Address

1 John Sedi

2000 123 2nd Ave, Los Angeles, CA 90014

2 Joe Strat

2010 Univ. of VA 50 345 First Ave, Richmond, VA 23219

3 Liz Hidro

1998 444 Kelly St, Frankfort, KY 40601

4 Rocky Tuff

2002 Univ. of MA 987 Red Rock St, Waltham, MA 02154

5 Joe Strat

2010 Univ. of VA 100 345 First Ave, Richmond, VA 23219

This table is in First Normal Form (1NF); But, table is NOT in 2NF • The Job, GradSchool, and phones are removed to their own tables

because they are not dependent on the PK (AlumId). • Records for Joe Strat and Univ. of VA are repeated! • Remove everything except Alum data (keep GradYear) in new

tablesAdd first_name, last_name, etc. for the Alumni Table

Page 14: Database Requires Normalization

Second Normal Form (2NF)2NF deals with redundancy across multiple rows!• 2NF helps to further remove duplicative data

• For a table to be in 2NF:• It should meet all the requirements of the first normal form• In addition to that: we should take the following steps:– Identify columns whose data repeat in different places, and

remove them to their own table• In the next slide, we see that data for Joe Strat is repeated.

Solution: Remove the alum column (with its address and school into their own Table called Alum and School

– Every non-key attribute must be dependent on all parts of the Primary Key (PK) • If not, move them to a new table with their own PK and FK

Page 15: Database Requires Normalization

2NF: Eliminate partial dependencies• Non-key columns must refer to the entire

composite key (if it exists), not just part of it.

• For example, the PK in the Student table (copied in next slide) is the composite (Student, Exam).

– The ExamFormat column depends on (i.e., is an attribute of) the Exam, not on the Student.

– This means that the data belong to another table

– This is taken care of by the 2NF

Page 16: Database Requires Normalization

Example of an un-normalized Student tableStudent Exam ExamFormat Grade Instructor TA Date

Dobb, Min Structural Geology

Essay A- Babaie Gabbri, Boris

2012-08-02

Dobrin, Garn

GIS Lab Exercise B+ Dai Mafique, Marie

2014-02-15

Petri, Tuff Remote Sensing

Essay B- Kiage Karsto, Travert

2010-09-18

Lac, Du GIS Lab Exercise B Dai Mafique, Marie

2014-02-15

Dobb, Min GIS Lab Exercise A- Dai Mafique, Marie

2014-02-15

Lac, Du Petrology Multiple choice B+ Hidalgo Phenos, Meg

2014-05-12

Petri, Tuff Petrology Multiple Choice B Hidalgo Phenos, Meg

2014-05-12

Mixed names

Repeated types

Repeated types repeated Mixed & repeated

Page 17: Database Requires Normalization

Third Normal Form (3NF)• Third normal form is about dependency• For a table to be in the 3NF:• It must meet all the requirements of the 2NF, and:

• Every non-key attribute must be mutually independent– Changing one non-key column should not change the other columns If it

does, remove the interdependent attributes

• No transitive functional dependencies– Remove columns that are not dependent upon the primary key, and

depend on other columns• Remove columns that their values depend on columns other than the

PK– This means: we have to remove the subkeys– Create new tables– Assign new primary keys and foreign keys after changes

Page 18: Database Requires Normalization

3NF: Eliminate transitive dependencies• If a non-key column refers not to (i.e., is

independent of) the PK but to another column, it should be removed to another table.

• For example, the TA column in the Student table does not depend on the PK (Student, Exam); it depends on the Instructor column.

• TA is removed to the new Instructor table

Page 19: Database Requires Normalization

3NF …• There should be no partial functional dependencies• If x y, i.e., x functionally determines y, and y is functionally

dependent on x, then given x, we can find y.– Example, in the Address table, given the nine-digit zip code, we can

find city and state because they are functionally dependent on the zip code. The opposite is not true, given a city we cannot find the zip code (Note: some cities have several zip codes; same named city can be in different states)

• By definition, a super key (e.g., primary key) functionally determines all other attributes in the table

• The zip code is a subkey (not a superkey) because it only determine the city and state part of the Address table not the other attributes

Page 20: Database Requires Normalization

Student

StudentID

StudentFirst

StudentMiddle

StudentLast

Grade

StudentID

ExamID

Grade

Exam

ExamID

InstructorID

Exam

Date

Instructor

InstructorID

Instructor

TA

Format

ExamID

Format

All entities broken into separate tablesPKs defined (shown in bold; some are composite; e.g., in Exam)Each table has unique information about something or subject

Page 21: Database Requires Normalization

Alumni Table, modified again: Satisfies 2NFAlumID GradYear Address

1 2000 123 2nd Ave, Los Angeles, CA 900142 2010 345 First Ave, Richmond, VA 23219

3 1998 444 Kelly St, Frankfort, KY 40601

4 2002 987 Red Rock St, Waltham, MA 02154

This table is in Second Normal Form (2NF)But Not in 3NF: There is a subkey (zip code) upon which the city and state depend. Zip code is not a PK. • We remove the subkey and put it in a new table • We break the Address data into the following tables:

ZipCodes, Cities, and States because these do not relate to any specific alum.

• However, these are directly related to each other (street address relies on city, city on state)

Page 22: Database Requires Normalization

• To take care of the partial functional dependency issue take 3 steps:– Remove all the attributes that depend on the subkey (e.g., zip code) from

the table (e.g., city and State from Address table)– Move them into a new table (e.g., call it ZipLocations with zipCode, city,

and state attributes– Keep a copy of the subkey attribute (i.e., zipCode) in the original table as a

foreign key• The address table now has firstname, lastname, street (these 3 make

the composite PK), and zipCode (as FK to the other table).

• Summary: Subkeys always result in redundant data and must be removed!

• In other words, remove subsets of data that apply to multiple rows of a table and place them in separate tables– i.e., remove duplicative data– For example, break address into its independent constituents that do not

depend on each other• Create relationships between these new tables and their predecessors

through the use of foreign keys

Page 23: Database Requires Normalization

Plus other tables!

Zip CityID

90014 1234

23219 5678

40601 4321

02154 8765

3NF Alumni Table

3NF Zipcodes Table 3NF Cities Table 3NF States Table

CityID Name StateID

1234 Los Angeles 5

5678 Richmond 46

4321 Frankfort 17

8765 Waltham 21

StateID Name Abbrev

5 California CA

46 Virginia VA

17 Kentucky KY

21 Massachusetts MA

Alumni Table, 4th attempt: Satisfies 3NF

AlumID GradYear StreetNumber StreetName Zip

1 2000 123 2nd Ave 90014

2 2010 345 1st Ave 23219

3 1998 444 Kelly St 40601

4 2002 987 Red Rock St 02154

Page 24: Database Requires Normalization

Fourth Normal Form (4NF)

• Normalizing a database to the 3NF is usually sufficient

• The fourth normal form (4NF) has one additional requirement

• Meet all the requirements of the third normal form

• A relation is in 4NF if it has no multi-valued dependencies