48
Introduction to Databases CSC 343 Winter 2018 MICHAEL LIUT ( [email protected] ) DEPARTMENT OF MATHEMATICAL AND COMPUTATIONAL SCIENCES UNIVERSITY OF TORONTO MISSISSAUGA

Introduction to Databases - Michael Liut · 2020. 2. 2. · introduction to databases csc 343 winter 2018 michael liut ([email protected]) department of mathematical and computational

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

  • IntroductiontoDatabasesCSC343Winter2018MICHAEL L IUT(MICHAEL.L [email protected])DEPARTMENT OF MATH EMAT ICA L ANDCOMPUTAT IONA L SC IENCE SUN IV E RS IT Y OF TORONTO M ISS ISSAUGA

  • AdministrationInstructor:MichaelLiut

    Office:DH–3097B

    CourseWebsite:https://www.michaelliut.ca/csc343.html

    Textbook:DatabaseManagementSystems(3rd Ed.),Ramakrishnan &Gehrke.

    2

    Tutorial Date Time Location

    TUT01 Wednesday 11:00am– 12:00pm DH-2020

    TUT02 Wednesday 12:00pm – 1:00pm DH-2020

    Lecture Date Time Location

    LEC01 Monday 9:00am– 11:00am KN-L1220

  • 3

  • TeachingAssistants

    1. MohammedHossain ([email protected])

    2. PankajAgrawal([email protected])

    4

  • Evaluations

    Examinations DueDate Weight

    MidtermExam March5th, 2018 20%

    Final Exam TBA 40%

    5

    GroupAssignments* DueDate Weight

    Assignment 1 February 5th,2018 13.3"%Assignment 2 March12th, 2018 13.3"%Assignment 3 April 2nd,2018 13.3"%

    *best2of3,undertherequirementthataminimumgradeof50%isachievedoneach.

  • GroupAssignments• Assignmentsaretobecompletedinpairs(groupsof2).

    • Groupsmuststaytogetherforthedurationofthesemester.

    • Pleaseemailmeyourpairselectionbefore9AMonJanuary22nd.

    • Assignmentsarepostedonthecoursewebsite

    6

  • GroupAssignments• SubmissionsmustbecompletedonBlackboard.

    • LatePolicy:o 20%willbe dockedper dayoflateness.o Afterfour (4) days, theassignment willno longer beaccepted.

    • Re-Markingrequests:◦ Please contactthegrading TAfirst.◦ Please feelfree tocontacttheInstructor if theTAcannot/does not resolve theissue.

    7

  • PlagiarismYouareencouragedtodiscusscoursecontentwithyourfellowpeers,however,submittedworkandsolutionsmustbeformulatedbasedonyourownideasandconclusions.

    Plagiarismandcheatingareseriousacademicoffenses,andwillbehandledaccordingly.

    Whenyousubmitapieceofassessment(e.g.anassignmentoranexamination)youarecertifyingthat itisyourworkandyoualonegeneratedthesolution.

    8

  • PlagiarismDetection

    Turnitin - http://turnitin.com◦ Automatically integrated toBlackboard.◦ Checks againstcurrent/past submissions and allonline resource databases.

    MOSS- MeasureOfSoftwareSimilarity◦ Developed atStanford, utilizing “Document Fingerprinting”:◦ http://theory.stanford.edu/~aiken/publications/papers/sigmod03.pdf

    ◦ Used forchecking allprogramming submissions.

    9

  • Questions/Concerns/Issues/Doyoujustwanttotalk?

    Ifsomething isunclear,pleaseask inclass!Iencouragefeedback!

    Ifsomething isconcerningyou,pleaseletmeknow! Idon’t bite!

    OfficeHours:Mondayfrom1PMto2PM

    OfficeLocation:DH–3097B

    [email protected]

    OpenDoorPolicy! Ifmydoor isopen,feelfreetoenter,evenifitisjustforachatorquickquestion. Ifitisclosed,pleaseknock.

    10

  • CourseSyllabusREADIT!Itisimportant!MyCourseSyllabus.UTM’sOfficialCourseSyllabus.

    Found onmythecoursewebsite:https://www.michaelliut.ca/csc343.html

    BlackboardandtheCourseWebsitewillbeusedinterchangeably.Botharetobecheckedonaregularbasis.

    11

  • Topics

    • RelationalModel

    • ERModel

    • SQL

    • AggregationandJoins

    • ConstraintsandTriggers

    • RelationalAlgebra

    12

    • ViewsandIndexes

    • DatabaseDesign

    • Transactions

    • Concurrency

    • IntrotoNoSQLandMongoDB (timepermitting)

    • Hadoopvs.GFSvs.Cassandra(timepermitting)

  • 13

  • 14

  • 15

  • 16

  • BigDataANDTrends

    Duringahurricanewarning/priortoahurricaneoccurring,Walmartfoundanincreaseinsalesin:

    17

    StrawberryPop-Tarts

    7timesthe“norm”

  • BigDataANDTrends

    18

  • DataScience

    EmpiricalScience:collectandsystemizefacts.

    TheoreticalScience:formulatetheoriesandempiricallytestthem.

    ComputationalScience:runautomaticproofs,runsimulations.

    DataScience:collectdataandfindpatternswithinthedata.Thinkstatisticsmeetsmathematicsmeetscomputerscienceandprogramming.

    19

  • 20

  • Howdodatabasesrunyourlife?

    • CloudStorage(e.g.Dropbox,GoogleDrive,iCloud,etc...)o Where isthedata?Howis itcategorizedandquickly accessible?

    • OnlineStreamingApplications(e.g.Netflix,YouTube,HBONow,etc...)o Generating lists of videos basedon searches andtracking users preferences. “Recommended”videos.

    • Finances(e.g.Chequing/SavingsAccount,StockMarket,CreditCards,etc...)o VISAprocesses anaverageof150million transactions per day.

    21

  • Howdodatabasesrunyourlife?

    • SocialMedia(e.g.Instagram,Facebook,Twitter,etc…)o Storing personal information andmultimedia content.o “Suggestions ForYou” or“People YouMayKnow”.

    • E-commerce(e.g.Amazon,eBay,Alibaba,etc...)o Online business thatstoreandcatalogue items.o Organizetheir product’s details, pricing information, andsellers.o Store users’ purchase history, payment information/preferences, and search history.

    22

    HugeareaofDataAnalytics!

  • DataAnalytics

    • Thescienceofexaminingandinterpretingrawdatatofindpatternsanddeduceconclusions.

    • Applyingalgorithms,mathematicaltechniques, andmechanicalprocessestoformaconclusionaboutthe informationbeinganalyzed.

    • By2020,therewillbeover$200Billionspentannually inthe‘BigDataandBusinessAnalytics’market.

    23

    .

  • WhatisaDatabase?

    Naivelydefinedas…◦ a collectionofinformationthatexistsoveralongperiodoftime.

    24

  • WhatisaDatabase?DATABASE

    Averylarge,integratedcollectionofdata(i.e.recordsorfiles).

    Modelsareal-worldenterprise◦ Entities (e.g.teams, games)◦ Relationships (e.g.BarackObama received TheNobel PeacePrize)◦ Constraints (e.g.atleast onedoctor onduty during off-hours)

    DATABASEMANAGEMENTSYSTEM(DBMS)

    Asoftwaresystemdesignedtostore,manage, and facilitateaccess todata.

    25

  • IstheWWWaDBMS?WWW=WorldWideWeb

    Fairlysophisticatedsearchesavailable◦ WebCrawlersindex pages◦ Keyword-based searchforpages

    Currentlydataisunstructured anduntyped

    SearchONLY◦ Can’tmodifythedata◦ Can’tgetsummariesorcomplexcombinations ofdata

    26

  • IstheWWWaDBMS?Few(zero)guaranteesprovidefor:◦ Freshness ofdata◦ Consistencyacrossdataitems◦ FaultTolerance

    Websites(e.g.e-commercesitessuchasAmazonorE-Bay)typicallyhaveaDBMSinthebackgroundtoprovidethesefunctions.

    27

  • “Search”vs.“Query”

    Whatifyouwantedtolookup allofthecountriesapartoftheEuropeanUnion(EU)?

    Try“countries intheeu”inasearchengine(e.g.Google)

    28

  • SearchBasedonkeywordmatching

    ◦ OursearchmatchescountriesthatbelongtotheEuropeanUnion(EU)

    ◦ Resultsarerankedbasedon:◦ Popularity◦ Reputation◦ PaidAdvertisements

    ◦ Webdocuments◦ Limitedstructure

    29

  • “Search”vs.“Query”

    “Search”returnsadocumentasis.

    30

  • QueryArequestofinformationfromaDatabase.◦ InaDBMS,aspecializedlanguage(QueryLanguage)isused.

    Theeaseofwhichthis informationcanbeobtainedfromadatabaseoftendeterminesitsvaluetoauser.

    Thequestions posed inaQueryaregenerallydesignedforamorespecificresultthanthose inasearch.

    31

  • QueryThinkofaUniversityDatabase,somequestions askedmaybe:

    1. Whatisthenameofthestudentwithstudent ID#123456?

    2. Howmanystudents areenrolled inCSC343?

    3. Whatfractionofstudents inCSC343received agradebetterthanB?

    32

  • IsaFileaDBMS?ThoughtExperiment1

    Youandafriendarebotheditingafileatthesametime.

    Youandyourfriendbothsavethefileattheexactsametime.

    Whosechangesurvived?

    A)YoursB)YourFriendsC)Both D)NeitherE)NotA,B,C,orD

    33

  • IsaFileaDBMS?ThoughtExperiment2

    Youandafriendareupdatingafile

    Thepowergoesout

    Whosechangesurvived?

    A)AllB)NoneC)AllSinceLastSaveD)NotA,B,orC

    34

    Q: Howdoyouwriteprogramsoverasubsystemwhenitpromises youNoOptions?A: VERY,VERYCAREFULLY!!

  • WhyUseaDBMS?

    • Dataindependence andefficientaccess.

    • Reduceapplicationdevelopmenttime.

    • Dataintegrityandsecurity.

    • Concurrentaccess,recoveryfromcrashes.

    35

  • WhyStudyDatabases?Shiftfromcomputationto information◦ Always trueforcorporate computing◦ Web madethis point for personal computing◦ Moreandmore trueforscientific computing

    NeedforDMBShasexploded!◦ Corporate: retailswipe/clickstreams, “customer relationship management”, “supply chainmanagement”, ”datawarehouses”, BigData,etc..

    ◦ Scientific: digitallibraries, Human GenomeProject, SloanDigitalSky Survey, physical sensors, etc…

    DMBSencompassesmuchofCSisapracticaldiscipline◦ OS, languages, theory, machine learning, logic◦ Yettraditional focus on real-world apps

    36

  • WhatisIntellectualContent?Representing Information◦ datamodelling

    Languages andSystemsforQueryingData◦ complexquerieswithrealisticsemantics*◦ overmassivedatasets

    ConcurrencyControl forDataManipulation◦ controllingconcurrentaccess◦ Ensuringtransactionalsemantics*

    Reliable DataStorage◦ maintaindatasemantics*evenifyoupulltheplug

    37

    *semantics: themeaningorrelationship ofmeaningofasignorsetofsigns.

  • DescribingData:DataModelsAdatamodel isacollection ofconcepts fordescribing data.

    Ascheme isadescription ofaparticularcollection ofdata,usingagivendatamodel.

    Therelational datamodel isthemostwidelyusedmodeltoday.◦ MainConcepts: relation,basicallyatablewithrowsandcolumns.◦ Everyrelationhasaschema,whichdescribesthecolumns,orfields.

    38

  • DataIndependenceApplications insulated fromhowdataisstructuresandstored.

    Logicaldataindependence• Protectionfromchanges inlogicalstructureofdata.• i.e.theabilitytochangetheconceptual(logical)schemawithout changingtheexternalschema(userview).• e.g.addition/removalofanentityorrelationship.

    Physicaldataindependence• Protectionfromchanges inphysicalstructureofdata.• e.g.hardware-levelconsiderations,systemdesigns,etc…

    39

    Q:Whyisthisparticularly importantforDBMS?A: RateofchangeofDBapplications areslow!

    MoreGenerally: dapp/dt

  • ConcurrencyControl

    Concurrentexecutionofuserprograms:keytogoodDBMSperformance.◦ Frequent disk accesses.◦ Keep theCPUworking on several programs concurrently.

    Interleavingactionsofdifferentprograms:trouble!◦ e.g.,account-transfer andprint statementatthesame time.

    DBMSensuressuchproblemsdon’t arise.◦ Users/programmers canpretend they areusing asingle-user system (“Isolation”).◦ Thank goodness! Youdon’t have toprogram“very, very carefully”.

    40

  • DatabaseStructure

    Typicallyhasalayeredarchitecture.

    Thefiguredoesn’tshow:◦ ConcurrencyControl◦ RecoveryComponents

    Eachsystemhasitsownvariations.

    41

    Query Optimization and Execution

    Relational Operators

    Files andAccess Methods

    Buffer Management

    Disk SpaceManagement

    DB

    These layersmustconsiderconcurrencycontrolandrecovery!

  • WhyDon’tWeAlwaysUseaDBMS?

    1. Expensive/complicated tosetupandmaintain

    2. Costandcomplexitymustbeoffsetbyneed

    3. General-purpose, notsuitedforspecial-purpose tasks(e.g.textsearch!)

    42

  • TheACIDApproach

    1. Atomicity:allchangestakeeffect,ornonedo.

    2. Consistency: thedatabase istransferredfromonevalidstatetoanothervalidstate.

    3. Isolation: theresultsofatransaction areinvisible toothertransactionsuntil thetransaction iscomplete.

    4. Durability:oncecommitted,theresults ofatransactionarepermanentandsurvivefuturesystemandmedia failures.

    43

  • DatabasesMakeTheseFolksHappy…DBMSVendorsandProgrammers◦ Oracle,IBM,Microsoft,…

    End-Users inmanyfields◦ Business,Education,Science,…

    DatabaseApplication Programmers◦ BuildenterpriseapplicationsontopofDBMSs◦ BuildwebservicesthatrunoffDBMSs

    44

  • DatabasesMakeTheseFolksHappy…DatabaseAdministrators (DBAs)◦ Handlesecurityandauthorization◦ Dataavailabilityandcrashrecovery◦ Databasetuningasneedsevolve

    DataScientists andAnalysts

    45

  • SummaryDBMSusedtomaintain,querylargedatasets.◦ Canmanipulate dataandexploit semantics

    Otherbenefitsinclude:◦ DataIndependence◦ Quickapplication development◦ Dataintegrityandsecurity◦ Recoveryfromsystemcrashes◦ Concurrentaccess

    46

  • SummaryLevelsofabstractionprovidedataindependence◦ Keywhendapp/dt

  • Citations,ImagesandResourcesDatabaseManagementSystems(3rd Ed.),Ramakrishnan &Gehrke

    http://www.vcloudnews.com/every-day-big-data-statistics-2-5-quintillion-bytes-of-data-created-daily/

    https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

    http://www.nytimes.com/2004/11/14/business/yourmoney/what-walmart-knows-about-customers-habits.html

    https://www.kaggle.com/c/walmart-recruiting-sales-in-stormy-weather/forums/t/13299/predicting-strawberry-pop-tarts

    http://truthaboutguns-zippykid.netdna-ssl.com/wp-content/uploads/2014/12/Strawberry_Pop_Tarts.jpg

    http://gurupk.com/wp-content/uploads/2016/03/GTSEO.png

    http://www.npr.org/sections/alltechconsidered/2016/06/24/480949383/britains-google-searches-for -what-is-the-eu-spike-after-brexit-vote

    https://www.gov.uk/eu-eea

    http://www.mjhaccountants.co.uk/wp-content/uploads/cartoon-filing-cabinet-l -e4b53be1891574f1.gif

    http://www.cs.toronto.edu/~ryanjohn/teaching/cscc43-s12/lectures/c43-intro-v03.pdf

    http://www.tenouk.com/ModuleV_files/image002.png

    http://www.quoteslike.com/images/1480/love-girl-lyrics-and-leave-a-suggestion-at-the-bottom-of-the-page -SiPB6f-quote.jpg

    https://www.simplilearn.com/data-science-vs-big-data-vs-data-analytics-article

    Datascienceprocess flowchartfrom "DoingDataScience",CathyO'NeilandRachelSchutt,2013

    48