Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
CompSci 516DatabaseSystems
Lecture5RelationalAlgebra
AndNormalization
GuestLecture:Junyang Gao
1DukeCS,Fall2019 CompSci516:DatabaseSystems
Announcements
• Lab-2onRAonThursday– Donotforgetyourlaptop!
• Homework-1(Part-1and2)havebeenpostedonsakai– FirstdeadlinenextTuesday:09/17– ParsingXMLwilltaketime!
• Nextweek:– RevisitRelationalCalculus!– Newtopic:Databaseinternalsandindexes!
DukeCS,Fall2019 2CompSci516:DatabaseSystems
Today’stopic
• RelationalAlgebra• Normalization
DukeCS,Fall2019 3
Acknowledgement:Thefollowingslideshavebeencreatedadaptingtheinstructormaterialofthe[RG]bookprovidedbytheauthorsDr.Ramakrishnan andDr.Gehrke.CompSci516:DatabaseSystems
RelationalAlgebra• Alanguageforqueryingrelationaldatabasedon“operators”
• Takesoneormorerelationsasinput,andproducesarelationasoutput– operator– operand– semantic– soanalgebra!
• Sinceeachoperationreturnsarelation,operationscanbecomposed– Algebrais“closed”
DukeCS,Fall2019 CompSci516:DatabaseSystems 6
RelationalAlgebra• Basicoperations:
– Selection(σ)Selectsasubsetofrowsfromrelation– Projection(π)Deletesunwantedcolumnsfromrelation.– Cross-product(x)Allowsustocombinetworelations.– Set-difference(-)Tuplesinreln.1,butnotinreln.2.– Union(∪)Tuplesinreln.1orinreln.2.
• Additionaloperations:– Intersection(∩)– join⨝– division(/)– renaming(ρ)– Notessential,but(very)useful,especiallyjoin!
DukeCS,Fall2019 CompSci516:DatabaseSystems 7
ExampleSchemaandInstances
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid bid day22 101 10/10/9658 103 11/12/96
R1
S1 S2
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Projection
sname ratingyuppy 9lubber 8guppy 5rusty 10p sname rating S, ( )2
page S( )2
• Deletesattributesthatarenotinprojectionlist.
• Schema ofresultcontainsexactlythefieldsintheprojectionlist,withthesamenamesthattheyhadinthe(single)inputrelation.
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
S2
10DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Projection
sname ratingyuppy 9lubber 8guppy 5rusty 10p sname rating S, ( )2
age35.055.5
page S( )2
• Deletesattributesthatarenotinprojectionlist.
• Schema ofresultcontainsexactlythefieldsintheprojectionlist,withthesamenamesthattheyhadinthe(single)inputrelation.
• Projectionoperatorhastoeliminateduplicates(Setsemantic!)– Note:realsystemstypicallydon’tdo
duplicateeliminationunlesstheuserexplicitlyasksforit(performance)
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
S2
10DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
Selection
s rating S>8 2( )
sid sname rating age28 yuppy 9 35.058 rusty 10 35.0
• Selectsrowsthatsatisfyselectioncondition
• Noduplicatesinresult– Becauseinputisaset!
• Schemaofresultidenticaltoschemaof(single)inputrelation
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
S2
11DukeCS,Fall2016 CompSci516:DataIntensiveComputingSystems
CompositionofOperators
• Resultrelationcanbetheinputforanotherrelationalalgebraoperation– Operatorcomposition
p ssname rating rating S, ( ( ))>8 2
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
S2
Whatdowegethere?
CompositionofOperators
• Resultrelationcanbetheinputforanotherrelationalalgebraoperation– Operatorcomposition
sname ratingyuppy 9rusty 10
p ssname rating rating S, ( ( ))>8 2
S2 sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
Whatdowegethere?
Union,Intersection,Set-Difference
• Alloftheseoperationstaketwoinputrelations,whichmustbeunion-compatible:– Samenumberoffields.– “Corresponding”fieldsmust
havethesametypeandsameschemaastheinputs
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.044 guppy 5 35.028 yuppy 9 35.0
S S1 2È
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
S1 S2
Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 12
Youwouldlosepointsifyourrelationsin⋃,⋂,➖ arenotunioncompatible!
Union,Intersection,Set-Difference
• Note:noduplicate– “Setsemantic”– SQL:UNION– ButSQLallows“bagsemantic”aswell:UNIONALL
S S1 2È
S1 S2
Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 12
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.044 guppy 5 35.028 yuppy 9 35.0
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
sid sname rating age31 lubber 8 55.558 rusty 10 35.0
S S1 2Ç
sid sname rating age22 dustin 7 45.0
S S1 2-
S1 S2
Duke%CS,%Spring%2016 CompSci 516:%Data%Intensive%Computing%Systems 13
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0
sid sname rating age28 yuppy 9 35.031 lubber 8 55.544 guppy 5 35.058 rusty 10 35.0
Union,Intersection,Set-Difference
Cross-Product• EachrowofS1ispairedwitheachrowofR.• ResultschemahasonefieldperfieldofS1andR,withfield
names`inherited’ifpossible.– Conflict:BothS1andRhaveafieldcalledsid.
(sid) sname rating age (sid) bid day22 dustin 7 45.0 22 101 10/10/9622 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 22 101 10/10/9631 lubber 8 55.5 58 103 11/12/9658 rusty 10 35.0 22 101 10/10/9658 rusty 10 35.0 58 103 11/12/96
DukeCS,Fall2019 CompSci516:DatabaseSystems 17
sid sname rating age22 dustin 7 45.031 lubber 8 55.558 rusty 10 35.0
sid bid day22 101 10/10/9658 103 11/12/96
RenamingOperator⍴
(sid) sname rating age (sid) bid day22 dustin 7 45.0 22 101 10/10/9622 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 22 101 10/10/9631 lubber 8 55.5 58 103 11/12/9658 rusty 10 35.0 22 101 10/10/9658 rusty 10 35.0 58 103 11/12/96
§Ingeneral,canuse⍴(<Temp>,<RA-expression>)
DukeCS,Fall2019 CompSci516:DatabaseSystems 18
(⍴sid →sid1S1)⨉ (⍴sid →sid2R1)or
⍴(C(1→sid1,5→sid2),S1⨉ R1)
Cisthenewrelationname
DifferentsyntaxesareusedYoucanuseanyofthese
sid1 sid2
Joins
• Resultschemasameasthatofcross-product.• Fewertuplesthancross-product,mightbeableto
computemoreefficiently– Wewilldojoinalgorithmslater
R c S c R S!" = ´s ( )
(sid) sname rating age (sid) bid day22 dustin 7 45.0 58 103 11/12/9631 lubber 8 55.5 58 103 11/12/96
S RS sid R sid1 11 1!" . .<
DukeCS,Fall2019 CompSci516:DatabaseSystems 19
Findnamesofsailorswho’vereservedboat#103
DukeCS,Fall2019 CompSci516:DatabaseSystems 20
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Findnamesofsailorswho’vereservedboat#103
DukeCS,Fall2019 CompSci516:DatabaseSystems 21
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Findnamesofsailorswho’vereservedboat#103
• Solution1:
• Solution2:
p ssname bid serves Sailors(( Re ) )=103 !"
p ssname bid serves Sailors( (Re ))=103 !"
DukeCS,Fall2019 CompSci516:DatabaseSystems 22
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
ExpressinganRAexpressionasaTree
DukeCS,Fall2019 CompSci516:DatabaseSystems 23
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
p ssname bid serves Sailors(( Re ) )=103 !"
Sailors Reserves
σbid=103
⨝sid =sid
πsname
Alsocalledalogicalqueryplan
YoucanuseeitherRAexpressionoralogicalqueryplaninexams
ButinthelabonThursdayyouwilluseRAexpressionsInRADBandRATest tools
Findsailorswho’vereservedaredoragreenboat
DukeCS,Fall2019 CompSci516:DatabaseSystems 24
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Useofrenameoperation
Findsailorswho’vereservedaredoragreenboat
• Canidentifyallredorgreenboats,thenfindsailorswho’vereservedoneoftheseboats:
r s( , ( ' ' ' ' ))Tempboats color red color green Boats= Ú =
p sname Tempboats serves Sailors( Re )!" !"
Can also define Tempboats using union
Try the “AND” version yourselfDukeCS,Fall2019 CompSci516:DatabaseSystems 25
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Useofrenameoperation
Whataboutaggregates?
• Supportedbyextendedrelationalalgebra• 𝝲age,avg(rating)→ avgr Sailors
• Alsoextendedto“bagsemantic”:allowduplicates– Takeintoaccountcardinality– RandShavetupletresp.mandntimes– R∪ Shastm+n times– R∩Shastmin(m,n)times– R– Shastmax(0,m-n)times– sorting(τ),duplicateremoval(ẟ)operators
DukeCS,Fall2019 CompSci516:DatabaseSystems 26
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Example:AggregatesinRA
DukeCS,Fall2019 CompSci516:DatabaseSystems 27
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Outputbnames thathavebeenreservedbyatleast5sailors
Example:AggregatesinRA
DukeCS,Fall2019 CompSci516:DatabaseSystems 28
Sailors(sid,sname,rating,age)Boats(bid,bname,color)Reserves(sid,bid,day)
Sailors Reserves
⨝sid =sid
πbname
Outputbnames thathavebeenreservedbyatleast5sailors𝝲bname,count(distinctsid)→ns
σns >=5
Whatwillwelearn?
• Whatgoeswrongifwehaveredundantinfoinadatabase?
• Whyandhowshouldyourefineaschema?• FunctionalDependencies– anewkindofintegrityconstraints(IC)
• NormalForms• Howtoobtainthosenormalforms
DukeCS,Fall2019 CompSci516:DatabaseSystems 30
Example
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 31
Thelistofhourlyemployeesinanorganization
• key=SSN• Supposeforagivenrating,thereisonlyonehourly_wage value• Redundancyinthetable• Whyisredundancybad?
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 32
Thelistofhourlyemployeesinanorganization
1. Redundantstorage:– Someinformationisstoredrepeatedly– Theratingvalue8correspondstohourly_wage 10,whichisstoredthreetimes
Whyisredundancybad?1/4
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10→9 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 33
Thelistofhourlyemployeesinanorganization
2. Updateanomalies– Ifonecopyofdataisupdated,aninconsistencyiscreatedunlessallcopiesaresimilarly
updated– Supposeyouupdatethehourly_wage valueinthefirsttupleusingUPDATEstatementin
SQL-- inconsistency
Whyisredundancybad?2/4
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 34
Thelistofhourlyemployeesinanorganization
3. Insertionanomalies:– Itmaynotbepossibletostorecertaininformationunlesssomeother,unrelatedinfois
storedaswell– Wecannotinsertatupleforanemployeeunlessweknowthehourlywageforthe
employee’sratingvalue
Whyisredundancybad?3/4
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 35
Thelistofhourlyemployeesinanorganization
4. Deletionanomalies:– Itmaynotbepossibledeletecertaininformationwithoutlosingsomeotherinformation
aswell– Ifwedeletealltupleswithagivenratingvalue(Attishoo,Smiley,Madayan),welosethe
associationbetweenthatratingvalueanditshourly_wage value
Whyisredundancybad?4/4
Nullsmayormaynothelp
• Doesnothelpredundantstorageorupdateanomalies• Mayhelpinsertionanddeletionanomalies
– caninsertatuplewithnullvalueinthehourly_wage field– butcannotrecordhourly_wage foraratingunlessthereissuchan
employee(SSNcannotbenull)– samefordeletionDukeCS,Fall2019 CompSci516:DatabaseSystems 36
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
Summary:Redundancy
• Redundancyariseswhentheschemaforcesanassociationbetweenattributesthatis“notnatural”
• Wewantschemasthatdonotpermitredundancy– atleastidentifyschemasthatallowredundancytomakeaninformed
decision(e.g.forperformancereasons)
• Nullvaluemayormaynothelp
• Solution?
–Decompositionofschema!
DukeCS,Fall2019 CompSci516:DatabaseSystems 37
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 38
Decomposition:Example-1
ssn (S) name(N) lot(L)
rating(R)
hourly-wage(W)
hours-worked(H)
111-11-1111 Attishoo 48 8 10 40222-22-2222 Smiley 22 8 10 30333-33-3333 Smethurst 35 5 7 30444-44-4444 Guldu 35 5 7 32555-55-5555 Madayan 35 8 10 40
DukeCS,Fall2019 CompSci516:DatabaseSystems 39
Decomposition:Example-1
ssn (S) name(N) lot(L)
rating(R)
hours-worked(H)
111-11-1111 Attishoo 48 8 40222-22-2222 Smiley 22 8 30333-33-3333 Smethurst 35 5 30444-44-4444 Guldu 35 5 32555-55-5555 Madayan 35 8 40
rating hourly_wage
8 10
5 7
Decomposition– Example-2
40
uid uname twitterid gid fromDate
142 Bart @BartJSimpson dps 1987-04-19
123 Milhouse @MilhouseVan_ gov 1989-12-17
857 Lisa @lisasimpson abc 1987-04-19
857 Lisa @lisasimpson gov 1988-09-01
456 Ralph @ralphwiggum abc 1991-04-25
456 Ralph @ralphwiggum gov 1992-09-01
… … … … …
uid uname twitterid
142 Bart @BartJSimpson
123 Milhouse @MilhouseVan_
857 Lisa @lisasimpson
456 Ralph @ralphwiggum
… … …
uid gid fromDate
142 dps 1987-04-19
123 gov 1989-12-17
857 abc 1987-04-19
857 gov 1988-09-01
456 abc 1991-04-25
456 gov 1992-09-01
… … …
(ontwitter)
• Userid• username• Twitterid• Groupid• JoiningDate(toa
group)• Bothuid and
twitterid arekeys
DukeCS,Fall2019 CompSci516:DatabaseSystems
uid twitterid
142 @BartJSimpson
123 @MilhouseVan_
857 @lisasimpson
456 @ralphwiggum
… …
uid uname
142 Bart
123 Milhouse
857 Lisa
456 Ralph
… …
Unnecessarydecomposition
• Stillcorrect: joinreturnstheoriginalrelation• Unnecessary:noredundancyisremoved;schemaismore
complicated(anduid isstoredtwice!)41
uid uname twitterid
142 Bart @BartJSimpson
123 Milhouse @MilhouseVan_
857 Lisa @lisasimpson
456 Ralph @ralphwiggum
… … …
DukeCS,Fall2019 CompSci516:DatabaseSystems
Why?
uid fromDate
142 1987-04-19
123 1989-12-17
857 1987-04-19
857 1988-09-01
456 1991-04-25
456 1992-09-01
… …
Baddecomposition
• Associationbetweengid andfromDate islost• Joinreturnsmorerowsthantheoriginalrelation
42
uid gid fromDate
142 dps 1987-04-19
123 gov 1989-12-17
857 abc 1987-04-19
857 gov 1988-09-01
456 abc 1991-04-25
456 gov 1992-09-01
… … …uid gid
142 dps
123 gov
857 abc
857 gov
456 abc
456 gov
… …
DukeCS,Fall2019 CompSci516:DatabaseSystems
Whybad?
Losslessjoindecomposition
• Decomposerelation𝑅 intorelations𝑆 and𝑇– 𝑎𝑡𝑡𝑟𝑠 𝑅 = 𝑎𝑡𝑡𝑟𝑠 𝑆 ∪ 𝑎𝑡𝑡𝑟𝑠 𝑇– 𝑆 = 𝜋,--./ 0 𝑅– 𝑇 = 𝜋,--./ 1 𝑅
• Thedecompositionisalosslessjoindecompositionif,givenknownconstraintssuchasFD’s,wecanguaranteethat𝑅 = 𝑆 ⋈ 𝑇
• 𝑅 ⊆ 𝑆 ⋈ 𝑇 or𝑅 ⊇ 𝑆 ⋈ 𝑇 ?
• Anydecompositiongives𝑅 ⊆ 𝑆 ⋈ 𝑇 (why?)– Alossy decompositionisonewith𝑅 ⊂ 𝑆 ⋈ 𝑇
43DukeCS,Fall2019 CompSci516:DatabaseSystems
uid gid fromDate
142 dps 1987-04-19
123 gov 1989-12-17
857 abc 1987-04-19
857 gov 1988-09-01
456 abc 1991-04-25
456 gov 1992-09-01
… … …
uid gid fromDate
142 dps 1987-04-19
123 gov 1989-12-17
857 abc 1988-09-01
857 gov 1987-04-19
456 abc 1991-04-25
456 gov 1992-09-01
… … …
Loss?ButIgotmorerows!
• “Loss”refersnottothelossoftuples,buttothelossofinformation– Or,theabilitytodistinguishdifferentoriginalrelations
44
Nowaytotellwhichistheoriginalrelation
uid fromDate
142 1987-04-19
123 1989-12-17
857 1987-04-19
857 1988-09-01
456 1991-04-25
456 1992-09-01
… …
uid gid
142 dps
123 gov
857 abc
857 gov
456 abc
456 gov
… …DukeCS,Fall2019 CompSci516:DatabaseSystems
Decompositionsshouldbeusedjudiciously
1. Doweneedtodecomposearelation?– Several“normalforms”existtoidentifypossibleredundancyat
differentgranularity– Ifarelationisnotinoneofthem,mayneedtodecomposefurther
2. Whataretheproblemswithdecomposition?– Baddecompositions:e.g.,Lossydecompositions– Performanceissues-- decompositionmayboth
• helpperformance(forupdates,somequeriesaccessingpartofdata),or• hurtperformance(newjoinsmaybeneededforsomequeries)
DukeCS,Fall2019 CompSci516:DatabaseSystems 45