72
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #1: Introduc/on

CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Embed Size (px)

Citation preview

Page 1: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

CS  5614:  (Big)  Data  Management  Systems  

B.  Aditya  Prakash  Lecture  #1:  Introduc/on  

Page 2: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

CS  5614:  (Big)  Data  Management  Systems   2  Prakash  2014  

Page 3: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

3  CS  5614:  (Big)  Data  Management  Systems  Data  contains  value  and  knowledge  

Prakash  2014  

Page 4: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data  and  Business  Data1and1business*

4*

+250%'clicksAvs.1editorial1oneDsizeDfitsDall*

+79%''clicksAvs.1randomly1selected*

+43%'clicksAvs.1editor1selected*

Recommended'linksA Personalized''News'InterestsA

Top'SearchesA

Prakash  2014   4  Source:  A.  Machhanavajjhala  CS  5614:  (Big)  Data  Management  Systems  

Page 5: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data  and  Science  Data1and1science*5*

Detecting'influenza'epidemics'using'search'engine'query'data1

http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html

Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)*

Prakash  2014   5  CS  5614:  (Big)  Data  Management  Systems  

Page 6: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data  and  Government  Data1and1government*6*

http://www.washingtonpost.com/opinions/obama-the-big-data-president/2013/06/14/1d71fe2e-d391-11e2-b05f-3ea3f0e7bb5a_story.html

http://www.whitehouse.gov/blog/Democratizing-Data

http://www.washingtonpost.com/business/economy/democrats-push-to-redeploy-obamas-voter-database/2012/11/20/d14793a4-2e83-11e2-89d4-040c9330702a_story.html

http://www.theguardian.com/world/2013/jun/23/edward-snowden-nsa-files-timeline

Prakash  2014   6  CS  5614:  (Big)  Data  Management  Systems  Source:  A.  Machhanavajjhala  

Page 7: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data  and  Culture  Data1and1culture*7*

•  Word1frequencies1in1EnglishDlanguage1books1in1Google’s1database11http://blogs.plos.org/everyone/2013/03/20/what-are-you-in-the-mood-for-emotional-trends-in-20th-century-books/

Prakash  2014   7  CS  5614:  (Big)  Data  Management  Systems  Source:  A.  Machhanavajjhala  

Page 8: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data1and1____1☜ your1favorite1subject*

8*

Sports* Journalism*

Prakash  2014   8  CS  5614:  (Big)  Data  Management  Systems  

Page 9: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Good  news:  Demand  for  Data  Mining  

9  CS  5614:  (Big)  Data  Management  Systems  Prakash  2014  

Page 10: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

How  to  extract  value  from  data?  

§ Manipulate  Data  – CS,  Domain  exper/se  

§  Analyze  Data  – Math,  CS,  Stat…  

§  Communicate  your  results  – CS,  Domain  Exper/se  

Prakash  2014   10  CS  5614:  (Big)  Data  Management  Systems  

Page 11: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

CommunicaEon  is  important!  Communicating1results*

“The"British"government"spends""£13"billion"a"year"on"universities.”F– So?*– Try1instead1http://wheredoesmymoneygo.org/"bubbletree-map.html#/~/total/education/university

“On"average,"1"in"every"15"Europeans""is"totally"illiterate.”F– True*– But1about111in1every1141is1under171years1old!*http://datajournalismhandbook.org/1.0/en/understanding_data_0.html*

13*

Prakash  2014   11  CS  5614:  (Big)  Data  Management  Systems  

Page 12: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

What  is  Data  Mining?  §  Given  lots  of  data  §  Discover  paJerns  and  models  that  are:  – Valid:    hold  on  new  data  with  some  certainty  – Useful:    should  be  possible  to  act  on  the  item    – Unexpected:    non-­‐obvious  to  the  system  – Understandable:  humans  should  be  able  to    interpret  the  paWern  

CS  5614:  (Big)  Data  Management  Systems   12  Prakash  2014  

Page 13: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Data  Mining  Tasks  

§  DescripEve  methods  – Find  human-­‐interpretable  paWerns  that    describe  the  data  •  Example:  Clustering  

§  PredicEve  methods  – Use  some  variables  to  predict  unknown    or  future  values  of  other  variables  •  Example:  Recommender  systems  

 CS  5614:  (Big)  Data  Management  Systems   13  Prakash  2014  

Page 14: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

ML  &  Stats.  

Comp.  Systems  

Theory  &  Algo.  

Biology  

Econ.  

Social  Science  

Physics  

14  

Big  data  

CS  5614:  (Big)  Data  Management  Systems  Prakash  2014  

Page 15: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

COURSE  LOGISTICS  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   15  

Page 16: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Course  InformaEon  §  Instructor    B.  Aditya  Prakash,  Torgersen  Hall  3160  F,  [email protected]    –  Office  Hours:  TBD  –  Include  string  CS  5614  in  subject    

§  Teaching  Assistant    Vanessa  Cadeno,  [email protected]  –    Office  Hours:  TBD  

 §  Class  MeeEng  Time    Mondays,  Wednesdays,  2:30-­‐3:45pm,  McBryde  Hall  134  

 §  Syllabus:  RelaEonal  Database  Systems,  Big  data  Technologies  (MR  

and  new  soYware  stack),  Streams,  RecommendaEon  Systems,  Large  Scale  Machine  Learning,  and  Graph  Mining  

   Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   16  

Page 17: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Course  InformaEon  §  Keeping  in  Touch    Course  web  site    

                               hWp://www.cs.vt.edu/~badityap/classes/cs5614-­‐Fall14/    updated  regularly  through  the  semester  – Piazza  link  on  the  website  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   17  

Page 18: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Textbook  §  Required    Jure  Leskovec,  Anand  Rajaraman  and  Jeffery  Ullman:  Mining  Massive  Datasets  (2nd)  Cambridge  University  Press.  2010  

   Web  page  for  the  book  (with  FREE  PDF!)  

       www.mmds.org      

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   18  

Page 19: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Textbook  §  Recommended  (for  database  internals)            Raghu  Ramakrishnan  and  Johannes  Gehrke      Database  Management  Systems  (3rd  Ed.).  McGraw  Hill.    

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   19  

Page 20: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Pre-­‐reqs  (A) Should  enjoy  the  course    J  (B) Background  in    

1.  Algorithms    2.  Probability  and  Stats  3.  Undergraduate  level  Databases  4.  Linear  Algebra  (helps)  5.  Graph  theory  (helps)  

(C) Graduate-­‐level  Programming  Skills  (i.e.  ability  to  use  unfamiliar  soiware,  picking  up  new  languages,  comfortable  with  at  least  one  of  Python/C/C++/Ruby/Java  etc.  (Matlab/R  a  plus))  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   20  

Page 21: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Course  Grading  §  Details  coming  soon  (next  lecture)  

§  Broadly  – Some  hws  – No  midterm  – Take-­‐home  Final  – Project  – Class  par/cipa/on  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   21  

Page 22: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Course  Project  §  2,  or  3  (max)  persons  per  project.  §  Major  work  for  this  class.  §  Pick  your  own  topic  

–  You  have  to  jus/fy  why  the  topic  is  interes/ng,  and  relevant  to  the  course,  and  of  suitable  difficulty  

§  Harder  way:    –  Joint  projects  with  other  courses  are  also  nego/able.  In  that  case,  you  will  need  the  approval  of  the  instructor,  and  you  also  need  to  clarify  exactly  what  steps  will  be  done  for  our  course,  as  well  as  for  the  other  course.  

–  Projects  related  to  your  disserta/on/master-­‐project  are  also  possible,  as  long  as  there  is  no  'double-­‐dipping',  i.e.,  you  clearly  specify  what  the  project  will  do,  in  addi/on  to  what  you  were  planning  to  do  for  your  thesis  anyway.  

§  Ask  me  if  you  need  help  and  ideas  (I  may  release  a  list  of  suitable  topics  later)  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   22  

Page 23: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Course  Project  §  Proposal  §  Milestone  §  Final  Report  §  Poster  Presenta/on  (or  in-­‐class  presenta/on  TBD)  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   23  

Page 24: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

WARM-­‐UP  AND  BASICS  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   24  

Page 25: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

RelaEonal  Databases:  What  we  will  cover  (next  1  month)  

§  Implementa/on  –  What  is  under-­‐the-­‐hood  of  a  DB  like  Oracal/MySQL?  

§  Design  –  How  do  you  model  your  data  and  structure  your  informa/on  in  a  database?    

§  Programming  –  How  do  you  use  the  capabili/es  of  a  DBMS?  

§  CS  4604  achieves  a  balance  between  –  a  firm  theore/cal  founda/on  to  designing  moderate-­‐sized  databases  

–  crea/ng,  querying,  and  implemen/ng  realis/c  databases  and  connec/ng  them  to  applica/ons  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   25  

Page 26: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

CS4604:  Course  Outline  §  Weeks  1–4:  Query/

Manipula/on  Languages  and  Data  Modeling  –  Rela/onal  Algebra  –  Data  defini/on  –  Programming  with  SQL  –  En/ty-­‐Rela/onship  (E/R)  approach  

–  Specifying  Constraints  –  Good  E/R  design  

§  Weeks  5–8:  Indexes,  Processing  and  Op/miza/on  –  Storing  –  Hashing/Sor/ng  –  Query  Op/miza/on  –  NoSQL  and  Hadoop  

§  Week  9-­‐10:  Rela/onal  Design  –  Func/onal  Dependencies  –  Normaliza/on  to  avoid  redundancy  

§  Week  11-­‐12:  Concurrency  Control  –  Transac/ons  –  Logging  and  Recovery  

§  Week  13–14:  Students’  choice  –  Prac/ce  Problems    –  XML  –  Data  mining  and  warehousing  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   26  

We  will  go  over  all  of  this  quickly!  ☺  

Page 27: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

What  is  the  goal  of  a  DBMS?  

§  Electronic  record-­‐keeping          Fast  and  convenient  access  to  informa/on    §  DBMS  ==  database  management  system    –  `Rela/onal’  in  this  class  – data  +  set  of  instruc/ons  to  access/manipulate  data  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   27  

Page 28: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

What  is  a  DBMS?  §  Features  of  a  DBMS  –  Support  massive  amounts  of  data  –  Persistent  storage  –  Efficient  and  convenient  access  –  Secure,  concurrent,  and  atomic  access  

§  Examples?  –  Search  engines,  banking  systems,  airline  reserva/ons,  corporate  records,  payrolls,  sales  inventories.  

–  New  applica/ons:  Wikis,  social/biological/mul/media/scien/fic/geographic  data,  heterogeneous  data.  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   28  

Page 29: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Features  of  a  DBMS  •  Support  massive  amounts  of  data      

–  Giga/tera/petabytes  –  Far  too  big  for  main  memory    

•  Persistent  storage  –  Programs  update,  query,  manipulate  data.  –  Data  con/nues  to  live  long  aier  program  finishes.  

•  Efficient  and  convenient  access  –  Efficient:  do  not  search  en/re  database  to  answer  a  query.  –  Convenient:  allow  users  to  query  the  data  as  easily  as  possible.  

•  Secure,  concurrent,  and  atomic  access  –  Allow  mul/ple  users  to  access  database  simultaneously.  –  Allow  a  user  access  to  only  to  authorized  data.  –  Provide  some  guarantee  of  reliability  against  system  failures.  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   29  

Page 30: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Example  Scenario  

§  Students,  taking  classes,  obtaining  grades  – Find  my  GPA  – <and  other  ad-­‐hoc  queries>  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   30  

Page 31: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Obvious  soluEon  1:  Folders  

§  Advantages?  – Cheap;  Easy-­‐to-­‐use  

§  Disadvantages?  – No  ad-­‐hoc  queries  – No  sharing  – Large  Physical  foot-­‐print  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   31  

Page 32: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Obvious  SoluEon++  

§  Flat  files  and  C  (C++,  Java…)  programs    – E.g.  one  (or  more)  UNIX/DOS  files,  with  student  records  and  their  courses  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   32  

Page 33: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Obvious  SoluEon++  

§  Layout  for  student  records?  – CSV  (‘comma-­‐separated-­‐values’)   Hermione Grainger,123,Potions,A

Draco Malfoy,111,Potions,B

Harry Potter,234,Potions,A

Ron Weasley,345,Potions,C

 

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   33  

Page 34: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Obvious  SoluEon++  

§  Layout  for  student  records?  – Other  possibili/es  like  Hermione Grainger,123 123,Potions,A

Draco Malfoy,111 111,Potions,B

Harry Potter,234 234,Potions,A

Ron Weasley,345 345,Potions,C

   

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   34  

Page 35: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Problems?  

§  inconvenient  access  to  data  (need  ‘C++’  exper/ze,  plus  knowledge  of  file-­‐layout)  –  data  isola/on  

§  data  redundancy  (and  inconsistencies)  §  integrity  problems  §  atomicity  problems  §  concurrent-­‐access  problems  §  security  problems  §  …….  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   35  

Page 36: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Problems-­‐Why?  

§  Two  main  reasons:  – file-­‐layout  descrip/on  is  buried  within  the  C  programs  and  

–  there  is  no  support  for  transac/ons  (concurrency  and  recovery)  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   36  

DBMSs  handle  exactly  these  two  problems  

Page 37: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Example  Scenario  §  RDBMS  =  “Rela/onal”DBMS  §  The  rela/onal  model  uses  rela/ons  or  tables  to  structure  data  §  ClassList  rela/on:          

§  Rela/on  separates  the  logical  view  (externals)  from  the  physical  view  (internals)  

§  Simple  query  languages  (SQL)  for  accessing/modifying  data  –  Find  all  students  whose  grades  are  beWer  than  B.  –  SELECT  Student  FROM  ClassList  WHERE  Grade  >“B”  

Student   Course     Grade  

Hermione  Grainger     Po/ons     A  

Draco  Malfoy     Po/ons     B  

Harry  PoWer     Po/ons     A  

Ron  Weasley     Po/ons     C  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   37  

Page 38: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

DBMS  Architecture  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   38  

Page 39: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

TransacEon  Processing  §  One  or  more  database  opera/ons  are  grouped  into  a  “transac/on”  

§  Transac/ons    should  meet  the  “ACID  test”  – Atomicity:  All-­‐or-­‐nothing  execu/on  of  transac/ons.  

– Consistency:  Databases  have  consistency  rules  (e.g.  what  data  is  valid).  A  transac/on  should  NOT  violate  the  database’s  consistency.  If  it  does,  it  needs  to  be  rolled  back.  

– Isola/on:  Each  transac/on  must  appear  to  be  executed  as  if  no  other  transac/on  is  execu/ng  at  the  same  /me.  

– Durability:  Any  change  a  transac/on  makes  to  the  database  should  persist  and  not  be  lost.    

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   39  

Page 40: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   40  

Disadvantages  over  (flat)  files?  

Page 41: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   41  

Disadvantages  over  (flat)  files  

§  Price  §  addi/onal  exper/se  (SQL/DBA)    (hence:  over-­‐kill  for  small,  single-­‐user  data  sets  But:  mobile  phones  (eg.,  android)  use  sqlite)  

Page 42: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

A  Brief  History  of  DBMS  §  The  earliest  databases  (1960s)  evolved  from  file  systems  

–  File  systems  •  Allow    storage  of  large  amounts  of  data  over  a  long  period  of  /me  •  File  systems  do  not  support:  

–  Efficient  access  of  data  items  whose  loca/on  in  a  par/cular  file  is  not  known  

–  Logical  structure  of  data  is  limited  to  crea/on  of  directory  structures  –  Concurrent  access:  Mul/ple  users  modifying  a  single  file  generate  non-­‐uniform  results  

•  Naviga/onal  and  hierarchical  •  User  programmed  the  queries  by  walking  from  node  to  node  in  the  DBMS.  

§  Rela/onal  DBMS  (1970s  to  now)  –  View  database  in  terms  of  rela/ons  or  tables  –  High-­‐level  query  and  defini/on  languages  such  as  SQL  –   Allow  user  to  specify  what  (s)he  wants,  not  how  to  get  what  (s)he  wants  

§  Object-­‐oriented  DBMS  (1980s)  –  Inspired  by  object-­‐oriented  languages  –  Object-­‐rela/onal  DBMS  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   42  

Page 43: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

The  DBMS  Industry    §  A  DBMS  is  a  soiware  system.  

§  Major  DBMS  vendors:  Oracle,  Microsoi,  IBM,  Sybase  

§  Free/Open-­‐source  DBMS:  MySQL,  PostgreSQL,  Firebird.  –  Used  by  companies  such  as  Google,  Yahoo,  Lycos,  BASF….  

§  All  are  “rela/onal”  (or  “object-­‐rela/onal”)  DBMS.  

§  A  mulE-­‐billion  dollar  industry  

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   43  

Page 44: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   44  

Fundamental  concepts  

§  3-­‐level  architecture  §  logical  data  independence  §  physical  data  independence  

Page 45: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   45  

3-­‐level  architecture  

§  view  level  §  logical  level  §  physical  level  

v1   v2  v3  

Page 46: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   46  

3-­‐level  architecture  

§  view  level  §  logical  level:  eg.,  tables    – STUDENT(ssn,  name)  – TAKES  (ssn,  cid,  grade)  

§  physical  level:  – how  are  these  tables  stored,  how  many  bytes  /  aWribute  etc  

Page 47: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   47  

3-­‐level  architecture  

§  view  level,  eg:  – v1:  select  ssn  from  student  – v2:  select  ssn,  c-­‐id  from  takes  

§  logical  level  §  physical  level  

Page 48: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   48  

3-­‐level  architecture  

§  -­‐>  hence,  physical  and  logical  data  independence:  

§  logical  D.I.:  – ???  

§  physical  D.I.:  – ???  

Page 49: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   49  

3-­‐level  architecture  

§  -­‐>  hence,  physical  and  logical  data  independence:  

§  logical  D.I.:  – can  add  (drop)  column;  add/drop  table  

§  physical  D.I.:  – can  add  index;  change  record  order  

Page 50: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   50  

Database  users  

§  ‘naive’  users  §  casual  users  §  applica/on  programmers  §  [  DBA  (Data  base  administrator)]  

Page 51: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   51  

Casual  users  

DBMS  

data  

and  meta-­‐data  =  catalog  

select  *    from  student  

Page 52: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   52  

``Naive’’  users  

Pictorially:  

DBMS  

data  

and  meta-­‐data  =  catalog  

app.  (eg.,    report  generator)  

Page 53: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   53  

App.  programmers  

§  those  who  write  the  applica/ons  (like  the  ‘report  generator’)  

Page 54: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   54  

DB  Administrator  (DBA)  

§  Du/es?  

Page 55: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   55  

DB  Administrator  (DBA)  

§  schema  defini/on  (‘logical’  level)  §  physical  schema  (storage  structure,  access  methods  

§  schemas  modifica/ons  §  gran/ng  authoriza/ons  §  integrity  constraint  specifica/on  

Page 56: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   56  

Overall  system  architecture  

§  [Users]  §  DBMS  – query  processor  – storage  manager  –  transac/on          manager  

§  [Files]  

Page 57: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   57  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

Page 58: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   58  

Overall  system  architecture  

§  query  processor  – DML  compiler  – embedded  DML  pre-­‐compiler  – DDL  interpreter  – Query  evalua/on  engine  

Page 59: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   59  

Overall  system  architecture  (cont’d)  

§  storage  manager  – authoriza/on  and  integrity  manager  –  transac/on  manager  – buffer  manager  – file  manager  

Page 60: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   60  

Overall  system  architecture  (cont’d)  

§  Files  – data  files  – data  dic/onary  =  catalog  (=  meta-­‐data)  –  indices  – sta/s/cal  data  

Page 61: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   61  

Some  examples:  

§  DBA  doing  a  DDL  (data  defini/on  language)  opera/on,  eg.,    create  table  student  ...  

Page 62: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   62  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

Page 63: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   63  

Some  examples:  

§  casual  user,  asking  for  an  update,  eg.:  update  student    set  name  to  ‘smith’  where  ssn  =  ‘345’  

Page 64: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   64  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

Page 65: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   65  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

Page 66: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   66  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

Page 67: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   67  

Some  examples:  

§  app.  programmer,  crea/ng  a  report,  eg  main(){  ....  exec  sql  “select  *  from  student”  ...  }  

Page 68: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   68  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

pgm  (src)  

Page 69: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   69  

Some  examples:  

§  ‘naive’  user,  running  the  previous  app.  

Page 70: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   70  

DDL  int.  DML  proc.  

query  eval.  app.  pgm(o)  

trans.  mgr  

emb.  DML  

buff.  mgr   file  mgr  

data   meta-­‐data  

query  proc.  

storage  mgr.  

naive   app.  pgmr   casual   DBA   users  

pgm  (src)  

Page 71: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   71  

Conclusions  

§  (rela/onal)  DBMSs:  electronic  record  keepers  §  customize  them  with  create  table  commands  §  ask  SQL  queries  to  retrieve  info  

Page 72: CS5614:(Big)Data# Management#Systems# - People at VT ...people.cs.vt.edu/badityap/classes/cs5614-Fall14/lectures/lecture-1.pdf · CS’5614:’(Big)’DataManagementSystems’ 3 Data#contains#value#and#knowledge#

Prakash  2014   CS  5614:  (Big)  Data  Management  Systems   72  

Conclusions  contd  

main  advantages  over  (flat)  files  &  scripts:  §  logical  +  physical  data  independence  (ie.,  flexibility  of  adding  new  aWributes,  new  tables  and  indices)  

§  concurrency  control  and  recovery