43
CS639: Data Management for Data Science Lecture 3: Principles of Data Management Theodoros Rekatsinas 1

Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

CS639:DataManagementfor

DataScienceLecture3:PrinciplesofDataManagement

TheodorosRekatsinas1

Page 2: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

2

Announcements• Mix-upwithduedatesL Itshouldbefixednow.• Nochangestothemidterm

• UpdatesandhintsonPA1assignmentonPiazza

• Questions?

Page 3: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

Today’sLecture

1. DataManagement

2. DataModels

3. RDBMsandtheRelationalDataModel

3

Page 4: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

1.DataManagement

4

Section1

Page 5: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Datarepresentsthetraces ofreal-worldprocesses.

• Dataisvaluablebuthardandcostlytomanage• Storage,representationcomplexity,collection

• Datamanagementseekstoanswertwoquestions:• Whatoperationsdowewanttoperformonthisdata?• Whatfunctionalitydoweneedtomanagethisdata?

5

Section1

DataManagement

Page 6: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Describereal-worldentitiesintermsofstoreddata• Create&persistentlystorelargedatasets• Efficientlyquery&update• Musthandlecomplexquestionsaboutthedata• Musthandlesophisticatedupdates• Performancematters

• Changestructure(e.g.,addattributes)• Concurrencycontrol:enablesimultaneousqueries,updatesetc• Crashrecovery• Accesscontrol,security,integrity

6

Section1

RequiredFunctionality

Itisdifficultandcostlytoimplementallthesefeatures!

Page 7: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Relationaldatabasemanagementsystems• HDFS-basedsystems(e.g.,hadoop)• Streammanagementsystems:ApacheKafka• Others?

7

Section1

Systemsprovidingdatamanagementfeatures

Page 8: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

8

Section1

Page 9: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

2.DataModels

9

Section2

Page 10: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

Whatyouwilllearnaboutinthissection

1. TypesofData

2. DataModels

10

Section2

Page 11: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Structureddata

• Semi-structureddata

• Unstructureddata

11

Section2

Dataishighlyheterogeneous

Increasingamountsofdata

Page 12: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Informationwithahighdegreeoforganization

• Alldataconformstoaschema.Ex:businessdata

• Easytoquery,searchover,aggregate

• Example:tablesinadatabase,tablesinexcel,etc.

12

Section2

Structureddata

Page 13: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Somestructureinthedatabutimplicitandirregular

• Itcontains tagsorothermarkerstoseparatesemanticelementsandenforcehierarchiesofrecordsandfieldswithinthedata

• Example:JSON,HTML,XML

13

Section2

Semi-structureddata

Page 14: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Informationthateitherdoesnothaveapre-defined structure orisnotorganizedinapre-definedmanner.

• Text,video,images,etc.

• Abundantandextremelyvaluable.Hardtoquery,aggregate,analyze,search.

14

Section2

Unstructureddata

Page 15: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Adatamodelisacollectionofconceptsfordescribingdata

• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel

• Adatamodel enablesuserstodefinethedatausinghigh-levelconstructswithoutworryingaboutmanylow-leveldetailsofhowdatawillbestoredondisk.

15

Section2

DataModel

Page 16: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

16

Section2

Levelsofabstraction

Page 17: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

17

Section2

Datamodels

• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network

Mostdatabasemanagementsystems

Page 18: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network

18

Section2

Datamodels

NoSQL

Page 19: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network

19

Section2

Datamodels

Machinelearning,Scientificapplications

Page 20: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Relational• Key/Value• Graph• Document• Column-family• Array/Matrix• Hierarchical• Network

20

Section2

Datamodels

Obsolete/Rare

Page 21: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

3.RDBMsandtheRelationalDataModel

21

Section3

Page 22: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

Whatyouwilllearnaboutinthissection

1. DefinitionofDBMS

2. Datamodels&therelationaldatamodel

3. Schemas&dataindependence

22

Section3

Page 23: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

WhatisaDBMS?

• Alarge,integratedcollectionofdata

• Modelsareal-worldenterprise• Entities(e.g.,Students,Courses)• Relationships(e.g., AliceisenrolledinCS564)

ADatabaseManagementSystem(DBMS) isapieceofsoftwaredesignedtostoreandmanagedatabases

23

Section3

Page 24: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

24

AMotivating,RunningExample

• Considerbuildingacoursemanagementsystem(CMS):

• Students• Courses• Professors

• Whotakeswhat• Whoteacheswhat

Entities

Relationships

Section3

Page 25: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

Datamodels• Adatamodelisacollectionofconceptsfordescribingdata

• Therelationalmodelofdata isthemostwidelyusedmodeltoday• MainConcept:therelation- essentially,atable

• Aschema isadescriptionofaparticularcollectionofdata,usingthegivendatamodel

• E.g.everyrelation inarelationaldatamodelhasaschema describingtypes,etc.

25

Section3

Page 26: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

ModelingtheCourseManagementSystem• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)

sid Name Gpa101 Bob 3.2123 Mary 3.8

Students

cid cname credits564 564-2 4308 417 2

Coursessid cid Grade123 564 A

Enrolled

Relations

26

Section3

Page 27: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

ModelingtheCourseManagementSystem• LogicalSchema• Students(sid:string,name:string,gpa:float)• Courses(cid:string,cname:string,credits:int)• Enrolled(sid:string,cid:string,grade:string)

sid Name Gpa101 Bob 3.2123 Mary 3.8

Students

cid cname credits564 564-2 4308 417 2

Coursessid cid Grade123 564 A

Enrolled27

Correspondingkeys

Section3

Page 28: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

OtherSchemata…

• PhysicalSchema:describesdatalayout• Relationsasunorderedfiles• Somedatainsortedorder(index)

• LogicalSchema:Previousslide

• ExternalSchema:(Views)• Course_info(cid:string,enrollment:integer)• Derivedfromothertables

Applications

Administrators

28

Section3

Page 29: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

DataindependenceConcept: Applicationsdonotneedtoworryabouthowthedataisstructuredandstored

Logicaldataindependence:protectionfromchangesinthelogicalstructureofthedata

Physicaldataindependence:protectionfromphysicallayoutchanges

OneofthemostimportantreasonstouseaDBMS 29

I.e.shouldnotneedtoask:canweaddanewentityorattributewithoutrewritingtheapplication?

I.e.shouldnotneedtoask:whichdisksarethedatastoredon?Isthedataindexed?

Section3

Page 30: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

• Structure:Thedefinitionofrelationsandtheircontents.

• Integrity:Ensurethedatabase’scontentssatisfyconstraints.

• Manipulation:Howtoaccessandmodifyadatabase’scontents.

RelationalModel

30

Section3

Page 31: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

31

TablesintheRelationalModel

PName Price Manufacturer

Gizmo $19.99 GizmoWorks

Powergizmo $29.99 GizmoWorks

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi

ProductArelation ortable isamultiset oftupleshavingtheattributesspecifiedbytheschema

Let’sbreakthisdefinitiondown

Section3

Page 32: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

32

TablesintheRelationalModel

PName Price Manufacturer

Gizmo $19.99 GizmoWorks

Powergizmo $29.99 GizmoWorks

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi

Product

Amultiset isanunorderedlist(or:asetwithmultipleduplicateinstancesallowed)

List:[1,1,2,3]Set:{1,2,3}Multiset:{1,1,2,3}

i.e.nonext(),etc.methods!

Section3

Page 33: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

33

TablesintheRelationalModel

PName Price Manufacturer

Gizmo $19.99 GizmoWorks

Powergizmo $29.99 GizmoWorks

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi

Product Anattribute (orcolumn)isatypeddataentrypresentineachtupleintherelation

Attributesmusthaveanatomictype,i.e.notalist,set,etc.

Section3

Page 34: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

34

TablesintheRelationalModel

PName Price Manufacturer

Gizmo $19.99 GizmoWorks

Powergizmo $29.99 GizmoWorks

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi

Product

Atuple orrow isasingleentryinthetablehavingtheattributesspecifiedbytheschemaAlsoreferredtosometimesasarecord

Section3

Page 35: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

35

TablesintheRelationalModel

PName Price Manufacturer

Gizmo $19.99 GizmoWorks

Powergizmo $29.99 GizmoWorks

SingleTouch $149.99 Canon

MultiTouch $203.99 Hitachi

Product

Thenumberoftuplesisthecardinality oftherelation

Thenumberofattributesisthearity oftherelation

Section3

n-ary Relation=

Table with n columns

Page 36: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

36

DataTypesinRelationalModel

• Atomictypes:• Characters:CHAR(20),VARCHAR(50)• Numbers:INT,BIGINT,SMALLINT,FLOAT• Others:MONEY,DATETIME,…

• Everyattributemusthaveanatomictype• Hencetablesareflat

Section3

Page 37: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

37

TableSchemas

• Theschema ofatableisthetablename,itsattributes,andtheirtypes:

• Akey isanattributewhosevaluesareunique;weunderlineakey

Product(Pname: string, Price: float, Category: string, Manufacturer: string)

Product(Pname: string, Price: float, Category: string, Manufacturer: string)

Section3

Page 38: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

Keyconstraints

• Akeyisanimplicitconstraintonwhichtuplescanbeintherelation

• i.e.iftwotuplesagreeonthevaluesofthekey,thentheymustbethesametuple!

1.Whichwouldyouselectasakey?2.Isakeyalwaysguaranteedtoexist?3.Canwehavemorethanonekey?

Akey isaminimalsubsetofattributes thatactsasauniqueidentifierfortuplesinarelation

Students(sid:string, name:string, gpa: float)

Section3

Page 39: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

NULLandNOTNULL

• Tosay“don’tknowthevalue”weuseNULL• NULLhas(sometimespainful)semantics,moredetailslater

sid name gpa123 Bob 3.9143 Jim NULL Say,Jimjustenrolledinhisfirstclass.

WemayconstrainacolumntobeNOTNULL,e.g.,“name”inthistable

Students(sid:string, name:string, gpa: float)

Section3

Page 40: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

ForeignKeyconstraints

• Aforeignkey specifiesthatanattributefromonerelationhastomaptoatupleinanotherrelation.

Section3

Page 41: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

ForeignKeyconstraints

student_id aloneisnotakey- whatis?

sid name gpa101 Bob 3.2123 Mary 3.8

student_id cid grade

123 564 A123 537 A+

Students Enrolled

Wesaythatstudent_id isaforeignkey thatreferstoStudents

Students(sid: string, name: string, gpa: float)

Enrolled(student_id: string, cid: string, grade: string)

• Supposewehavethefollowingschema:

• Andwewanttoimposethefollowingconstraint:• ‘Onlyrealstudentsmayenrollincourses’ i.e.astudentmustappearintheStudentstabletoenrollinaclass

Section3

Page 42: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

SummaryofSchemaInformation

• SchemaandConstraintsarehowdatabasesunderstandthesemantics(meaning)ofdata

• Theyarealsousefulforoptimization

Section3

Page 43: Lecture 3 PDM - GitHub Pages · Modeling the Course Management System •Logical Schema •Students(sid: string, name: string, gpa: float) •Courses(cid: string, cname: string, credits:

DATAMANIPULATIONLANGUAGES(DML)

• Howtostoreandretrieveinformationfromadatabase.

• Procedural:Thequeryspecifiesthe(high-level)strategytheDBMSshouldusetofindthedesiredresult.

• WewillseeSQLandRelationalAlgebra

Section3