Upload
vuongcong
View
223
Download
0
Embed Size (px)
Citation preview
CS 5614: (Big) Data Management Systems
B. Aditya Prakash Lecture #1: Introduc/on
CS 5614: (Big) Data Management Systems 2 Prakash 2014
3 CS 5614: (Big) Data Management Systems Data contains value and knowledge
Prakash 2014
Data and Business Data1and1business*
4*
+250%'clicksAvs.1editorial1oneDsizeDfitsDall*
+79%''clicksAvs.1randomly1selected*
+43%'clicksAvs.1editor1selected*
Recommended'linksA Personalized''News'InterestsA
Top'SearchesA
Prakash 2014 4 Source: A. Machhanavajjhala CS 5614: (Big) Data Management Systems
Data and Science Data1and1science*5*
Detecting'influenza'epidemics'using'search'engine'query'data1
http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html
Red:1official1numbers1from1Center1for1Disease1Control1and1Prevention;1weekly11Black:1based1on1Google1search1logs;1daily1(potentially1instantaneously)*
Prakash 2014 5 CS 5614: (Big) Data Management Systems
Data and Government Data1and1government*6*
http://www.washingtonpost.com/opinions/obama-the-big-data-president/2013/06/14/1d71fe2e-d391-11e2-b05f-3ea3f0e7bb5a_story.html
http://www.whitehouse.gov/blog/Democratizing-Data
http://www.washingtonpost.com/business/economy/democrats-push-to-redeploy-obamas-voter-database/2012/11/20/d14793a4-2e83-11e2-89d4-040c9330702a_story.html
http://www.theguardian.com/world/2013/jun/23/edward-snowden-nsa-files-timeline
Prakash 2014 6 CS 5614: (Big) Data Management Systems Source: A. Machhanavajjhala
Data and Culture Data1and1culture*7*
• Word1frequencies1in1EnglishDlanguage1books1in1Google’s1database11http://blogs.plos.org/everyone/2013/03/20/what-are-you-in-the-mood-for-emotional-trends-in-20th-century-books/
Prakash 2014 7 CS 5614: (Big) Data Management Systems Source: A. Machhanavajjhala
Data1and1____1☜ your1favorite1subject*
8*
Sports* Journalism*
Prakash 2014 8 CS 5614: (Big) Data Management Systems
Good news: Demand for Data Mining
9 CS 5614: (Big) Data Management Systems Prakash 2014
How to extract value from data?
§ Manipulate Data – CS, Domain exper/se
§ Analyze Data – Math, CS, Stat…
§ Communicate your results – CS, Domain Exper/se
Prakash 2014 10 CS 5614: (Big) Data Management Systems
CommunicaEon is important! Communicating1results*
“The"British"government"spends""£13"billion"a"year"on"universities.”F– So?*– Try1instead1http://wheredoesmymoneygo.org/"bubbletree-map.html#/~/total/education/university
“On"average,"1"in"every"15"Europeans""is"totally"illiterate.”F– True*– But1about111in1every1141is1under171years1old!*http://datajournalismhandbook.org/1.0/en/understanding_data_0.html*
13*
Prakash 2014 11 CS 5614: (Big) Data Management Systems
What is Data Mining? § Given lots of data § Discover paJerns and models that are: – Valid: hold on new data with some certainty – Useful: should be possible to act on the item – Unexpected: non-‐obvious to the system – Understandable: humans should be able to interpret the paWern
CS 5614: (Big) Data Management Systems 12 Prakash 2014
Data Mining Tasks
§ DescripEve methods – Find human-‐interpretable paWerns that describe the data • Example: Clustering
§ PredicEve methods – Use some variables to predict unknown or future values of other variables • Example: Recommender systems
CS 5614: (Big) Data Management Systems 13 Prakash 2014
ML & Stats.
Comp. Systems
Theory & Algo.
Biology
Econ.
Social Science
Physics
14
Big data
CS 5614: (Big) Data Management Systems Prakash 2014
COURSE LOGISTICS
Prakash 2014 CS 5614: (Big) Data Management Systems 15
Course InformaEon § Instructor B. Aditya Prakash, Torgersen Hall 3160 F, [email protected] – Office Hours: TBD – Include string CS 5614 in subject
§ Teaching Assistant Vanessa Cadeno, [email protected] – Office Hours: TBD
§ Class MeeEng Time Mondays, Wednesdays, 2:30-‐3:45pm, McBryde Hall 134
§ Syllabus: RelaEonal Database Systems, Big data Technologies (MR
and new soYware stack), Streams, RecommendaEon Systems, Large Scale Machine Learning, and Graph Mining
Prakash 2014 CS 5614: (Big) Data Management Systems 16
Course InformaEon § Keeping in Touch Course web site
hWp://www.cs.vt.edu/~badityap/classes/cs5614-‐Fall14/ updated regularly through the semester – Piazza link on the website
Prakash 2014 CS 5614: (Big) Data Management Systems 17
Textbook § Required Jure Leskovec, Anand Rajaraman and Jeffery Ullman: Mining Massive Datasets (2nd) Cambridge University Press. 2010
Web page for the book (with FREE PDF!)
www.mmds.org
Prakash 2014 CS 5614: (Big) Data Management Systems 18
Textbook § Recommended (for database internals) Raghu Ramakrishnan and Johannes Gehrke Database Management Systems (3rd Ed.). McGraw Hill.
Prakash 2014 CS 5614: (Big) Data Management Systems 19
Pre-‐reqs (A) Should enjoy the course J (B) Background in
1. Algorithms 2. Probability and Stats 3. Undergraduate level Databases 4. Linear Algebra (helps) 5. Graph theory (helps)
(C) Graduate-‐level Programming Skills (i.e. ability to use unfamiliar soiware, picking up new languages, comfortable with at least one of Python/C/C++/Ruby/Java etc. (Matlab/R a plus))
Prakash 2014 CS 5614: (Big) Data Management Systems 20
Course Grading § Details coming soon (next lecture)
§ Broadly – Some hws – No midterm – Take-‐home Final – Project – Class par/cipa/on
Prakash 2014 CS 5614: (Big) Data Management Systems 21
Course Project § 2, or 3 (max) persons per project. § Major work for this class. § Pick your own topic
– You have to jus/fy why the topic is interes/ng, and relevant to the course, and of suitable difficulty
§ Harder way: – Joint projects with other courses are also nego/able. In that case, you will need the approval of the instructor, and you also need to clarify exactly what steps will be done for our course, as well as for the other course.
– Projects related to your disserta/on/master-‐project are also possible, as long as there is no 'double-‐dipping', i.e., you clearly specify what the project will do, in addi/on to what you were planning to do for your thesis anyway.
§ Ask me if you need help and ideas (I may release a list of suitable topics later)
Prakash 2014 CS 5614: (Big) Data Management Systems 22
Course Project § Proposal § Milestone § Final Report § Poster Presenta/on (or in-‐class presenta/on TBD)
Prakash 2014 CS 5614: (Big) Data Management Systems 23
WARM-‐UP AND BASICS
Prakash 2014 CS 5614: (Big) Data Management Systems 24
RelaEonal Databases: What we will cover (next 1 month)
§ Implementa/on – What is under-‐the-‐hood of a DB like Oracal/MySQL?
§ Design – How do you model your data and structure your informa/on in a database?
§ Programming – How do you use the capabili/es of a DBMS?
§ CS 4604 achieves a balance between – a firm theore/cal founda/on to designing moderate-‐sized databases
– crea/ng, querying, and implemen/ng realis/c databases and connec/ng them to applica/ons
Prakash 2014 CS 5614: (Big) Data Management Systems 25
CS4604: Course Outline § Weeks 1–4: Query/
Manipula/on Languages and Data Modeling – Rela/onal Algebra – Data defini/on – Programming with SQL – En/ty-‐Rela/onship (E/R) approach
– Specifying Constraints – Good E/R design
§ Weeks 5–8: Indexes, Processing and Op/miza/on – Storing – Hashing/Sor/ng – Query Op/miza/on – NoSQL and Hadoop
§ Week 9-‐10: Rela/onal Design – Func/onal Dependencies – Normaliza/on to avoid redundancy
§ Week 11-‐12: Concurrency Control – Transac/ons – Logging and Recovery
§ Week 13–14: Students’ choice – Prac/ce Problems – XML – Data mining and warehousing
Prakash 2014 CS 5614: (Big) Data Management Systems 26
We will go over all of this quickly! ☺
What is the goal of a DBMS?
§ Electronic record-‐keeping Fast and convenient access to informa/on § DBMS == database management system – `Rela/onal’ in this class – data + set of instruc/ons to access/manipulate data
Prakash 2014 CS 5614: (Big) Data Management Systems 27
What is a DBMS? § Features of a DBMS – Support massive amounts of data – Persistent storage – Efficient and convenient access – Secure, concurrent, and atomic access
§ Examples? – Search engines, banking systems, airline reserva/ons, corporate records, payrolls, sales inventories.
– New applica/ons: Wikis, social/biological/mul/media/scien/fic/geographic data, heterogeneous data.
Prakash 2014 CS 5614: (Big) Data Management Systems 28
Features of a DBMS • Support massive amounts of data
– Giga/tera/petabytes – Far too big for main memory
• Persistent storage – Programs update, query, manipulate data. – Data con/nues to live long aier program finishes.
• Efficient and convenient access – Efficient: do not search en/re database to answer a query. – Convenient: allow users to query the data as easily as possible.
• Secure, concurrent, and atomic access – Allow mul/ple users to access database simultaneously. – Allow a user access to only to authorized data. – Provide some guarantee of reliability against system failures.
Prakash 2014 CS 5614: (Big) Data Management Systems 29
Example Scenario
§ Students, taking classes, obtaining grades – Find my GPA – <and other ad-‐hoc queries>
Prakash 2014 CS 5614: (Big) Data Management Systems 30
Obvious soluEon 1: Folders
§ Advantages? – Cheap; Easy-‐to-‐use
§ Disadvantages? – No ad-‐hoc queries – No sharing – Large Physical foot-‐print
Prakash 2014 CS 5614: (Big) Data Management Systems 31
Obvious SoluEon++
§ Flat files and C (C++, Java…) programs – E.g. one (or more) UNIX/DOS files, with student records and their courses
Prakash 2014 CS 5614: (Big) Data Management Systems 32
Obvious SoluEon++
§ Layout for student records? – CSV (‘comma-‐separated-‐values’) Hermione Grainger,123,Potions,A
Draco Malfoy,111,Potions,B
Harry Potter,234,Potions,A
Ron Weasley,345,Potions,C
Prakash 2014 CS 5614: (Big) Data Management Systems 33
Obvious SoluEon++
§ Layout for student records? – Other possibili/es like Hermione Grainger,123 123,Potions,A
Draco Malfoy,111 111,Potions,B
Harry Potter,234 234,Potions,A
Ron Weasley,345 345,Potions,C
Prakash 2014 CS 5614: (Big) Data Management Systems 34
Problems?
§ inconvenient access to data (need ‘C++’ exper/ze, plus knowledge of file-‐layout) – data isola/on
§ data redundancy (and inconsistencies) § integrity problems § atomicity problems § concurrent-‐access problems § security problems § …….
Prakash 2014 CS 5614: (Big) Data Management Systems 35
Problems-‐Why?
§ Two main reasons: – file-‐layout descrip/on is buried within the C programs and
– there is no support for transac/ons (concurrency and recovery)
Prakash 2014 CS 5614: (Big) Data Management Systems 36
DBMSs handle exactly these two problems
Example Scenario § RDBMS = “Rela/onal”DBMS § The rela/onal model uses rela/ons or tables to structure data § ClassList rela/on:
§ Rela/on separates the logical view (externals) from the physical view (internals)
§ Simple query languages (SQL) for accessing/modifying data – Find all students whose grades are beWer than B. – SELECT Student FROM ClassList WHERE Grade >“B”
Student Course Grade
Hermione Grainger Po/ons A
Draco Malfoy Po/ons B
Harry PoWer Po/ons A
Ron Weasley Po/ons C
Prakash 2014 CS 5614: (Big) Data Management Systems 37
DBMS Architecture
Prakash 2014 CS 5614: (Big) Data Management Systems 38
TransacEon Processing § One or more database opera/ons are grouped into a “transac/on”
§ Transac/ons should meet the “ACID test” – Atomicity: All-‐or-‐nothing execu/on of transac/ons.
– Consistency: Databases have consistency rules (e.g. what data is valid). A transac/on should NOT violate the database’s consistency. If it does, it needs to be rolled back.
– Isola/on: Each transac/on must appear to be executed as if no other transac/on is execu/ng at the same /me.
– Durability: Any change a transac/on makes to the database should persist and not be lost.
Prakash 2014 CS 5614: (Big) Data Management Systems 39
Prakash 2014 CS 5614: (Big) Data Management Systems 40
Disadvantages over (flat) files?
Prakash 2014 CS 5614: (Big) Data Management Systems 41
Disadvantages over (flat) files
§ Price § addi/onal exper/se (SQL/DBA) (hence: over-‐kill for small, single-‐user data sets But: mobile phones (eg., android) use sqlite)
A Brief History of DBMS § The earliest databases (1960s) evolved from file systems
– File systems • Allow storage of large amounts of data over a long period of /me • File systems do not support:
– Efficient access of data items whose loca/on in a par/cular file is not known
– Logical structure of data is limited to crea/on of directory structures – Concurrent access: Mul/ple users modifying a single file generate non-‐uniform results
• Naviga/onal and hierarchical • User programmed the queries by walking from node to node in the DBMS.
§ Rela/onal DBMS (1970s to now) – View database in terms of rela/ons or tables – High-‐level query and defini/on languages such as SQL – Allow user to specify what (s)he wants, not how to get what (s)he wants
§ Object-‐oriented DBMS (1980s) – Inspired by object-‐oriented languages – Object-‐rela/onal DBMS
Prakash 2014 CS 5614: (Big) Data Management Systems 42
The DBMS Industry § A DBMS is a soiware system.
§ Major DBMS vendors: Oracle, Microsoi, IBM, Sybase
§ Free/Open-‐source DBMS: MySQL, PostgreSQL, Firebird. – Used by companies such as Google, Yahoo, Lycos, BASF….
§ All are “rela/onal” (or “object-‐rela/onal”) DBMS.
§ A mulE-‐billion dollar industry
Prakash 2014 CS 5614: (Big) Data Management Systems 43
Prakash 2014 CS 5614: (Big) Data Management Systems 44
Fundamental concepts
§ 3-‐level architecture § logical data independence § physical data independence
Prakash 2014 CS 5614: (Big) Data Management Systems 45
3-‐level architecture
§ view level § logical level § physical level
v1 v2 v3
Prakash 2014 CS 5614: (Big) Data Management Systems 46
3-‐level architecture
§ view level § logical level: eg., tables – STUDENT(ssn, name) – TAKES (ssn, cid, grade)
§ physical level: – how are these tables stored, how many bytes / aWribute etc
Prakash 2014 CS 5614: (Big) Data Management Systems 47
3-‐level architecture
§ view level, eg: – v1: select ssn from student – v2: select ssn, c-‐id from takes
§ logical level § physical level
Prakash 2014 CS 5614: (Big) Data Management Systems 48
3-‐level architecture
§ -‐> hence, physical and logical data independence:
§ logical D.I.: – ???
§ physical D.I.: – ???
Prakash 2014 CS 5614: (Big) Data Management Systems 49
3-‐level architecture
§ -‐> hence, physical and logical data independence:
§ logical D.I.: – can add (drop) column; add/drop table
§ physical D.I.: – can add index; change record order
Prakash 2014 CS 5614: (Big) Data Management Systems 50
Database users
§ ‘naive’ users § casual users § applica/on programmers § [ DBA (Data base administrator)]
Prakash 2014 CS 5614: (Big) Data Management Systems 51
Casual users
DBMS
data
and meta-‐data = catalog
select * from student
Prakash 2014 CS 5614: (Big) Data Management Systems 52
``Naive’’ users
Pictorially:
DBMS
data
and meta-‐data = catalog
app. (eg., report generator)
Prakash 2014 CS 5614: (Big) Data Management Systems 53
App. programmers
§ those who write the applica/ons (like the ‘report generator’)
Prakash 2014 CS 5614: (Big) Data Management Systems 54
DB Administrator (DBA)
§ Du/es?
Prakash 2014 CS 5614: (Big) Data Management Systems 55
DB Administrator (DBA)
§ schema defini/on (‘logical’ level) § physical schema (storage structure, access methods
§ schemas modifica/ons § gran/ng authoriza/ons § integrity constraint specifica/on
Prakash 2014 CS 5614: (Big) Data Management Systems 56
Overall system architecture
§ [Users] § DBMS – query processor – storage manager – transac/on manager
§ [Files]
Prakash 2014 CS 5614: (Big) Data Management Systems 57
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
Prakash 2014 CS 5614: (Big) Data Management Systems 58
Overall system architecture
§ query processor – DML compiler – embedded DML pre-‐compiler – DDL interpreter – Query evalua/on engine
Prakash 2014 CS 5614: (Big) Data Management Systems 59
Overall system architecture (cont’d)
§ storage manager – authoriza/on and integrity manager – transac/on manager – buffer manager – file manager
Prakash 2014 CS 5614: (Big) Data Management Systems 60
Overall system architecture (cont’d)
§ Files – data files – data dic/onary = catalog (= meta-‐data) – indices – sta/s/cal data
Prakash 2014 CS 5614: (Big) Data Management Systems 61
Some examples:
§ DBA doing a DDL (data defini/on language) opera/on, eg., create table student ...
Prakash 2014 CS 5614: (Big) Data Management Systems 62
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
Prakash 2014 CS 5614: (Big) Data Management Systems 63
Some examples:
§ casual user, asking for an update, eg.: update student set name to ‘smith’ where ssn = ‘345’
Prakash 2014 CS 5614: (Big) Data Management Systems 64
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
Prakash 2014 CS 5614: (Big) Data Management Systems 65
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
Prakash 2014 CS 5614: (Big) Data Management Systems 66
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
Prakash 2014 CS 5614: (Big) Data Management Systems 67
Some examples:
§ app. programmer, crea/ng a report, eg main(){ .... exec sql “select * from student” ... }
Prakash 2014 CS 5614: (Big) Data Management Systems 68
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
pgm (src)
Prakash 2014 CS 5614: (Big) Data Management Systems 69
Some examples:
§ ‘naive’ user, running the previous app.
Prakash 2014 CS 5614: (Big) Data Management Systems 70
DDL int. DML proc.
query eval. app. pgm(o)
trans. mgr
emb. DML
buff. mgr file mgr
data meta-‐data
query proc.
storage mgr.
naive app. pgmr casual DBA users
pgm (src)
Prakash 2014 CS 5614: (Big) Data Management Systems 71
Conclusions
§ (rela/onal) DBMSs: electronic record keepers § customize them with create table commands § ask SQL queries to retrieve info
Prakash 2014 CS 5614: (Big) Data Management Systems 72
Conclusions contd
main advantages over (flat) files & scripts: § logical + physical data independence (ie., flexibility of adding new aWributes, new tables and indices)
§ concurrency control and recovery