45
CompSci 516 Database Systems Lecture 5 Relational Algebra And Normalization Guest Lecture: Junyang Gao 1 Duke CS, Fall 2019 CompSci 516: Database Systems

Lecture 5 Relational Algebra And Normalization...• Normalization Duke CS, Fall 2019 3 Acknowledgement: The following slides have been created adapting the instructor material of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

CompSci 516DatabaseSystems

Lecture5RelationalAlgebra

AndNormalization

GuestLecture:Junyang Gao

1DukeCS,Fall2019 CompSci516:DatabaseSystems

Announcements

• Lab-2onRAonThursday– Donotforgetyourlaptop!

• Homework-1(Part-1and2)havebeenpostedonsakai– FirstdeadlinenextTuesday:09/17– ParsingXMLwilltaketime!

• Nextweek:– RevisitRelationalCalculus!– Newtopic:Databaseinternalsandindexes!

DukeCS,Fall2019 2CompSci516:DatabaseSystems

Today’stopic

• RelationalAlgebra• Normalization

DukeCS,Fall2019 3

Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.CompSci516:DatabaseSystems

AQuickRecap• The“Drinker-Beer-Bar”example

DukeCS,Fall2019 CompSci516:DatabaseSystems 4

RelationalAlgebra(RA)

RelationalAlgebra• Alanguageforqueryingrelationaldatabasedon“operators”

• Takesoneormorerelationsasinput,andproducesarelationasoutput– operator– operand– semantic– soanalgebra!

• Sinceeachoperationreturnsarelation,operationscanbecomposed– Algebrais“closed”

DukeCS,Fall2019 CompSci516:DatabaseSystems 6

RelationalAlgebra• Basicoperations:

– Selection(σ)Selectsasubsetofrowsfromrelation– Projection(π)Deletesunwantedcolumnsfromrelation.– Cross-product(x)Allowsustocombinetworelations.– Set-difference(-)Tuplesinreln.1,butnotinreln.2.– Union(∪)Tuplesinreln.1orinreln.2.

• Additionaloperations:– Intersection(∩)– join⨝– division(/)– renaming(ρ)– Notessential,but(very)useful,especiallyjoin!

DukeCS,Fall2019 CompSci516:DatabaseSystems 7

ExampleSchemaandInstances

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

sid bid day22 101 10/10/9658 103 11/12/96

R1

S1 S2

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Projection

sname ratingyuppy 9lubber 8guppy 5rusty 10p sname rating S, ( )2

page S( )2

• Deletesattributesthatarenotinprojectionlist.

• Schema ofresultcontainsexactlythefieldsintheprojectionlist,withthesamenamesthattheyhadinthe(single)inputrelation.

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

S2

10DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Projection

sname ratingyuppy 9lubber 8guppy 5rusty 10p sname rating S, ( )2

age35.055.5

page S( )2

• Deletesattributesthatarenotinprojectionlist.

• Schema ofresultcontainsexactlythefieldsintheprojectionlist,withthesamenamesthattheyhadinthe(single)inputrelation.

• Projectionoperatorhastoeliminateduplicates(Setsemantic!)– Note:realsystemstypicallydon’tdo

duplicateeliminationunlesstheuserexplicitlyasksforit(performance)

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

S2

10DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

Selection

s rating S>8 2( )

sid sname rating age28 yuppy 9 35.058 rusty 10 35.0

• Selectsrowsthatsatisfyselectioncondition

• Noduplicatesinresult– Becauseinputisaset!

• Schemaofresultidenticaltoschemaof(single)inputrelation

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

S2

11DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems

CompositionofOperators

• Resultrelationcanbetheinputforanotherrelationalalgebraoperation– Operatorcomposition

p ssname rating rating S, ( ( ))>8 2

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

S2

Whatdowegethere?

CompositionofOperators

• Resultrelationcanbetheinputforanotherrelationalalgebraoperation– Operatorcomposition

sname ratingyuppy 9rusty 10

p ssname rating rating S, ( ( ))>8 2

S2 sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

Whatdowegethere?

Union,Intersection,Set-Difference

• Alloftheseoperationstaketwoinputrelations,whichmustbeunion-compatible:– Samenumberoffields.– “Corresponding”fieldsmust

havethesametypeandsameschemaastheinputs

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.044 guppy 5 35.028 yuppy 9 35.0

S S1 2È

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

S1 S2

Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 12

Youwouldlosepointsifyourrelationsin⋃,⋂,➖ arenotunioncompatible!

Union,Intersection,Set-Difference

• Note:noduplicate– “Setsemantic”– SQL:UNION– ButSQLallows“bagsemantic”aswell:UNIONALL

S S1 2È

S1 S2

Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 12

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.044 guppy 5 35.028 yuppy 9 35.0

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

sid sname rating age31 lubber 8 55.558 rusty 10 35.0

S S1 2Ç

sid sname rating age22 dustin 7 45.0

S S1 2-

S1 S2

Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 13

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0

sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0

Union,Intersection,Set-Difference

Cross-Product• EachrowofS1ispairedwitheachrowofR.• ResultschemahasonefieldperfieldofS1andR,withfield

names`inherited’ifpossible.– Conflict:BothS1andRhaveafieldcalledsid.

(sid) sname rating age (sid) bid day22 dustin 7 45.0 22 101 10/10/9622 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 22 101 10/10/9631 lubber 8 55.5 58 103 11/12/9658 rusty 10 35.0 22 101 10/10/9658 rusty 10 35.0 58 103 11/12/96

DukeCS,Fall2019 CompSci516:DatabaseSystems 17

sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0

sid bid day22 101 10/10/9658 103 11/12/96

RenamingOperator⍴

(sid) sname rating age (sid) bid day22 dustin 7 45.0 22 101 10/10/9622 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 22 101 10/10/9631 lubber 8 55.5 58 103 11/12/9658 rusty 10 35.0 22 101 10/10/9658 rusty 10 35.0 58 103 11/12/96

§Ingeneral,canuse⍴(<Temp>,<RA-expression>)

DukeCS,Fall2019 CompSci516:DatabaseSystems 18

(⍴sid →sid1S1)⨉ (⍴sid →sid2R1)or

⍴(C(1→sid1,5→sid2),S1⨉ R1)

Cisthenewrelationname

DifferentsyntaxesareusedYoucanuseanyofthese

sid1 sid2

Joins

• Resultschemasameasthatofcross-product.• Fewertuplesthancross-product,mightbeableto

computemoreefficiently– Wewilldojoinalgorithmslater

R c S c R S!" = ´s ( )

(sid) sname rating age (sid) bid day22 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 58 103 11/12/96

S RS sid R sid1 11 1!" . .<

DukeCS,Fall2019 CompSci516:DatabaseSystems 19

Findnamesofsailorswho’vereservedboat#103

DukeCS,Fall2019 CompSci516:DatabaseSystems 20

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Findnamesofsailorswho’vereservedboat#103

DukeCS,Fall2019 CompSci516:DatabaseSystems 21

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Findnamesofsailorswho’vereservedboat#103

• Solution1:

• Solution2:

p ssname bid serves Sailors(( Re ) )=103 !"

p ssname bid serves Sailors( (Re ))=103 !"

DukeCS,Fall2019 CompSci516:DatabaseSystems 22

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

ExpressinganRAexpressionasaTree

DukeCS,Fall2019 CompSci516:DatabaseSystems 23

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

p ssname bid serves Sailors(( Re ) )=103 !"

Sailors Reserves

σbid=103

⨝sid =sid

πsname

Alsocalledalogicalqueryplan

YoucanuseeitherRAexpressionoralogicalqueryplaninexams

ButinthelabonThursdayyouwilluseRAexpressionsInRADBandRATest tools

Findsailorswho’vereservedaredoragreenboat

DukeCS,Fall2019 CompSci516:DatabaseSystems 24

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Useofrenameoperation

Findsailorswho’vereservedaredoragreenboat

• Canidentifyallredorgreenboats,thenfindsailorswho’vereservedoneoftheseboats:

r s( , ( ' ' ' ' ))Tempboats color red color green Boats= Ú =

p sname Tempboats serves Sailors( Re )!" !"

Can also define Tempboats using union

Try the “AND” version yourselfDukeCS,Fall2019 CompSci516:DatabaseSystems 25

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Useofrenameoperation

Whataboutaggregates?

• Supportedbyextendedrelationalalgebra• 𝝲age,avg(rating)→ avgr Sailors

• Alsoextendedto“bagsemantic”:allowduplicates– Takeintoaccountcardinality– RandShavetupletresp.mandntimes– R∪ Shastm+n times– R∩Shastmin(m,n)times– R– Shastmax(0,m-n)times– sorting(τ),duplicateremoval(ẟ)operators

DukeCS,Fall2019 CompSci516:DatabaseSystems 26

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Example:AggregatesinRA

DukeCS,Fall2019 CompSci516:DatabaseSystems 27

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Outputbnames thathavebeenreservedbyatleast5sailors

Example:AggregatesinRA

DukeCS,Fall2019 CompSci516:DatabaseSystems 28

Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)

Sailors Reserves

⨝sid =sid

πbname

Outputbnames thathavebeenreservedbyatleast5sailors𝝲bname,count(distinctsid)→ns

σns >=5

DatabaseNormalization

Whatwillwelearn?

• Whatgoeswrongifwehaveredundantinfoinadatabase?

• Whyandhowshouldyourefineaschema?• FunctionalDependencies– anewkindofintegrityconstraints(IC)

• NormalForms• Howtoobtainthosenormalforms

DukeCS,Fall2019 CompSci516:DatabaseSystems 30

Example

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 31

Thelistofhourlyemployeesinanorganization

• key=SSN• Supposeforagivenrating,thereisonlyonehourly_wage value• Redundancyinthetable• Whyisredundancybad?

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 32

Thelistofhourlyemployeesinanorganization

1. Redundantstorage:– Someinformationisstoredrepeatedly– Theratingvalue8correspondstohourly_wage 10,whichisstoredthreetimes

Whyisredundancybad?1/4

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10→9 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 33

Thelistofhourlyemployeesinanorganization

2. Updateanomalies– Ifonecopyofdataisupdated,aninconsistencyiscreatedunlessallcopiesaresimilarly

updated– Supposeyouupdatethehourly_wage valueinthefirsttupleusingUPDATEstatementin

SQL-- inconsistency

Whyisredundancybad?2/4

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 34

Thelistofhourlyemployeesinanorganization

3. Insertionanomalies:– Itmaynotbepossibletostorecertaininformationunlesssomeother,unrelatedinfois

storedaswell– Wecannotinsertatupleforanemployeeunlessweknowthehourlywageforthe

employee’sratingvalue

Whyisredundancybad?3/4

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 35

Thelistofhourlyemployeesinanorganization

4. Deletionanomalies:– Itmaynotbepossibledeletecertaininformationwithoutlosingsomeotherinformation

aswell– Ifwedeletealltupleswithagivenratingvalue(Attishoo,Smiley,Madayan),welosethe

associationbetweenthatratingvalueanditshourly_wage value

Whyisredundancybad?4/4

Nullsmayormaynothelp

• Doesnothelpredundantstorageorupdateanomalies• Mayhelpinsertionanddeletionanomalies

– caninsertatuplewithnullvalueinthehourly_wage field– butcannotrecordhourly_wage foraratingunlessthereissuchan

employee(SSNcannotbenull)– samefordeletionDukeCS,Fall2019 CompSci516:DatabaseSystems 36

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

Summary:Redundancy

• Redundancyariseswhentheschemaforcesanassociationbetweenattributesthatis“notnatural”

• Wewantschemasthatdonotpermitredundancy– atleastidentifyschemasthatallowredundancytomakeaninformed

decision(e.g.forperformancereasons)

• Nullvaluemayormaynothelp

• Solution?

–Decompositionofschema!

DukeCS,Fall2019 CompSci516:DatabaseSystems 37

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 38

Decomposition:Example-1

ssn (S) name(N) lot(L)

rating(R)

hourly-wage(W)

hours-worked(H)

111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40

DukeCS,Fall2019 CompSci516:DatabaseSystems 39

Decomposition:Example-1

ssn (S) name(N) lot(L)

rating(R)

hours-worked(H)

111-11-1111 Attishoo 48 8 40222-22-2222 Smiley 22 8 30333-33-3333 Smethurst 35 5 30444-44-4444 Guldu 35 5 32555-55-5555 Madayan 35 8 40

rating hourly_wage

8 10

5 7

Decomposition– Example-2

40

uid uname twitterid gid fromDate

142 Bart @BartJSimpson dps 1987-04-19

123 Milhouse @MilhouseVan_ gov 1989-12-17

857 Lisa @lisasimpson abc 1987-04-19

857 Lisa @lisasimpson gov 1988-09-01

456 Ralph @ralphwiggum abc 1991-04-25

456 Ralph @ralphwiggum gov 1992-09-01

… … … … …

uid uname twitterid

142 Bart @BartJSimpson

123 Milhouse @MilhouseVan_

857 Lisa @lisasimpson

456 Ralph @ralphwiggum

… … …

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …

(ontwitter)

• Userid• username• Twitterid• Groupid• JoiningDate(toa

group)• Bothuid and

twitterid arekeys

DukeCS,Fall2019 CompSci516:DatabaseSystems

uid twitterid

142 @BartJSimpson

123 @MilhouseVan_

857 @lisasimpson

456 @ralphwiggum

… …

uid uname

142 Bart

123 Milhouse

857 Lisa

456 Ralph

… …

Unnecessarydecomposition

• Stillcorrect: joinreturnstheoriginalrelation• Unnecessary:noredundancyisremoved;schemaismore

complicated(anduid isstoredtwice!)41

uid uname twitterid

142 Bart @BartJSimpson

123 Milhouse @MilhouseVan_

857 Lisa @lisasimpson

456 Ralph @ralphwiggum

… … …

DukeCS,Fall2019 CompSci516:DatabaseSystems

Why?

uid fromDate

142 1987-04-19

123 1989-12-17

857 1987-04-19

857 1988-09-01

456 1991-04-25

456 1992-09-01

… …

Baddecomposition

• Associationbetweengid andfromDate islost• Joinreturnsmorerowsthantheoriginalrelation

42

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …uid gid

142 dps

123 gov

857 abc

857 gov

456 abc

456 gov

… …

DukeCS,Fall2019 CompSci516:DatabaseSystems

Whybad?

Losslessjoindecomposition

• Decomposerelation𝑅 intorelations𝑆 and𝑇– 𝑎𝑡𝑡𝑟𝑠 𝑅 = 𝑎𝑡𝑡𝑟𝑠 𝑆 ∪ 𝑎𝑡𝑡𝑟𝑠 𝑇– 𝑆 = 𝜋,--./ 0 𝑅– 𝑇 = 𝜋,--./ 1 𝑅

• Thedecompositionisalosslessjoindecompositionif,givenknownconstraintssuchasFD’s,wecanguaranteethat𝑅 = 𝑆 ⋈ 𝑇

• 𝑅 ⊆ 𝑆 ⋈ 𝑇 or𝑅 ⊇ 𝑆 ⋈ 𝑇 ?

• Anydecompositiongives𝑅 ⊆ 𝑆 ⋈ 𝑇 (why?)– Alossy decompositionisonewith𝑅 ⊂ 𝑆 ⋈ 𝑇

43DukeCS,Fall2019 CompSci516:DatabaseSystems

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1987-04-19

857 gov 1988-09-01

456 abc 1991-04-25

456 gov 1992-09-01

… … …

uid gid fromDate

142 dps 1987-04-19

123 gov 1989-12-17

857 abc 1988-09-01

857 gov 1987-04-19

456 abc 1991-04-25

456 gov 1992-09-01

… … …

Loss?ButIgotmorerows!

• “Loss”refersnottothelossoftuples,buttothelossofinformation– Or,theabilitytodistinguishdifferentoriginalrelations

44

Nowaytotellwhichistheoriginalrelation

uid fromDate

142 1987-04-19

123 1989-12-17

857 1987-04-19

857 1988-09-01

456 1991-04-25

456 1992-09-01

… …

uid gid

142 dps

123 gov

857 abc

857 gov

456 abc

456 gov

… …DukeCS,Fall2019 CompSci516:DatabaseSystems

Decompositionsshouldbeusedjudiciously

1. Doweneedtodecomposearelation?– Several“normalforms”existtoidentifypossibleredundancyat

differentgranularity– Ifarelationisnotinoneofthem,mayneedtodecomposefurther

2. Whataretheproblemswithdecomposition?– Baddecompositions:e.g.,Lossydecompositions– Performanceissues-- decompositionmayboth

• helpperformance(forupdates,somequeriesaccessingpartofdata),or• hurtperformance(newjoinsmaybeneededforsomequeries)

DukeCS,Fall2019 CompSci516:DatabaseSystems 45