1 1 1 Berendt: Advanced databases, 2011, berendt/teaching Advanced databases – Large-scale data storage and processing (1):

1

1

1

Berendt: Advanced databases, 2011, http://www.cs.kuleuven.be/~berendt/teaching

Advanced databases –

Large-scale data storage and processing (1):

Map-Reduce

Bettina Berendt

Katholieke Universiteit Leuven, Department of Computer Science

http://www.cs.kuleuven.be/~berendt/teaching/

Last update: 28 December 2011

2

2

2


Agenda

Motivation

Map-Reduce

Comparing Map-Reduce and parallel DBMS:Is performance everything?

3

3

3


Recall: Gegevensbanken Bachelor structure

Les

Nr. wie wat

1 ED intro, ER

2 ED EER

3 ED relational model

4 ED mapping EER2relational

5 KV relational algebra, relational calculus

6 KV SQL

7 KV vervolg SQL

8 KV demo Access, QBE, JDBC

9 KV functional dependencies and normalisation

10 KV functional dependencies and normalisation

11 BB file structures and hashing

12 BB indexing I

13 BB indexing II and higher-dimensional structures

14 BB query processing

15 BB transaction

16 BB query security

17 BB Data warehousing and mining

18 ED XML, oodb, multimedia db

Conceptueel model

Relationeel model

Fysisch model / vragen

Nieuwe thema‘s / vooruitblik

4

4

4


Structure, which is taken up here again

1. From knowledge (in the head) to data:

Conceptual modelling at different levels of expressivity

2. Getting knowledge out of the data

SQL, deductive and inductive inferences

3. Making this fast(er)

1. Optimising file and index structures, queries, … Bachelor

2. Parallelising things today

3. Doing only what‘s needed next lecture

4. New topics (more on text processing, or databases and privacy) last lecture

5

5

5


As an example of the usefulness of parallelisation, consider:Bayes‘ formula and its use for classification

1. Joint probabilities and conditional probabilities: basics P(A & B) = P(A|B) * P(B) = P(B|A) * P(A) P(A|B) = ( P(B|A) * P(A) ) / P(B) (Bayes´ formula) P(A) : prior probability of A (a hypothesis, e.g. that an object belongs to a

certain class) P(A|B) : posterior probability of A (given the evidence B)

2. Estimation: Estimate P(A) by the frequency of A in the training set (i.e., the number of A

instances divided by the total number of instances) Estimate P(B|A) by the frequency of B within the class-A instances (i.e., the

number of A instances that have B divided by the total number of class-A instances)

3. Decision rule for classifying an instance: If there are two possible hypotheses/classes (A and ~A), choose the one that is

more probable given the evidence (~A is „not A“) If P(A|B) > P(~A|B), choose A The denominators are equal If ( P(B|A) * P(A) ) > ( P(B|~A) * P(~A) ), choose A

6

6

6


Simplifications and Naive Bayes

4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A)

If P(B|A) > P(B|~A), choose A

5. More than one kind of evidence

General formula:

P(A | B1 & B2 ) = P(A & B1 & B2 ) / P(B1 & B2) = P(B1 & B2 | A) * P(A) / P(B1 & B2) = P(B1 | B2 & A) * P(B2 | A) * P(A) / P(B1 & B2)

Enter the „naive“ assumption: B1 and B2 are independent given A

P(A | B1 & B2 ) = P(B1|A) * P(B2|A) * P(A) / P(B1 & B2)

By reasoning as in 3. and 4. above, the last two terms can be omitted

If (P(B1|A) * P(B2|A) ) > (P(B1|~A) * P(B2|~A) ), choose A

The generalization to n kinds of evidence is straightforward.

In machine learning, features are the evidence.

7

7

7


Example: Texts as bags of words

Common representations of texts

Set: can contain each element (word) at most once

Bag (aka multiset): can contain each word multiple times (most common representation used in text mining)

Hypotheses and evidence

A = The blog is a happy blog, the email is a spam email, etc.

~A = The blog is a sad blog, the email is a proper email, etc.

Bi refers to the ith word occurring in the whole corpus of texts

Estimation for the bag-of-words representation:

Example estimation of P(B1|A) :

number of occurrences of the first word in all happy blogs, divided by the total number of words in happy blogs (etc.)

8

8

8


Where can parallelism be used?

4. Simplify by setting the priors equal (i.e., by using as many instances of class A as of class ~A)

If P(B|A) > P(B|~A), choose A

5. More than one kind of evidence

General formula:

P(A | B1 & B2 ) = P(A & B1 & B2 ) / P(B1 & B2) = P(B1 & B2 | A) * P(A) / P(B1 & B2) = P(B1 | B2 & A) * P(B2 | A) * P(A) / P(B1 & B2)

Enter the „naive“ assumption: B1 and B2 are independent given A

P(A | B1 & B2 ) = P(B1|A) * P(B2|A) * P(A) / P(B1 & B2)

By reasoning as in 3. and 4. above, the last two terms can be omitted

If (P(B1|A) * P(B2|A) ) > (P(B1|~A) * P(B2|~A) ), choose A

The generalization to n kinds of evidence is straightforward.

In machine learning, features are the evidence.

Need toestimate these

based on the wordcounts

in every document!

Approach:„take the

parallelismto the data“

9

9

9


Agenda

Motivation

Map-Reduce


10

10

10


http://rakaposhi.eas.asu.edu/cse494/notes/s07-map-reduce.ppt

(Note: this is based on the classical Map-Reduce article from 2004; numbers have further increased since then)

11

11

11


Types

map (k1,v1) list(k2,v2)

reduce (k2,list(v2)) list(v2) for one k2

12

12

12


Execution overview(from Dean & Ghemawat, 2004)

13

13

13


Agenda

Motivation

Map-Reduce


14

14

14


Recall: SQL Query optimization – Wat is slimmer?

SELECT empname, projectname FROM emp, project

WHERE emp.SSN = project.leaderSSN

AND emp.income > 1000000

emp project

X

σ emp.SSN = project.leader.SSN

π emp.empname, project.projectname

σ emp.income > 1000000

join emp.SSN = project.leaderSSN

emp

σ emp.income > 1000000

project

π emp.empname,

project.projectname

15

15

15


Parallel database query execution plans

16

16

16


Performance comparison: Hadoop (Map-Reduce) vs. Parallel DBMS

17

17

17


Friends, not foes

“MapReduce complements DBMSs since databases are not designed for extract-transform-load tasks, a MapReduce specialty.”

(Stonebraker et al., 2010)

A general finding: MapReduce / Hadoop shows its superiority as data volumes get very large

Note: Some benchmarking studies appear interest-driven …

18

18

18


A MapReduce controversy (1)

D. J. DeWitt & M. Stonebraker (2008). MapReduce: A major step backwards.http://www.databasecolumn.com/2008/01/mapreduce-a-major-step-back.html (no longer available). Online 1/11/09 at http://www.yjanboo.cn/?p=237

(reproduced at http://craig-henderson.blogspot.com/2009/11/dewitt-and-stonebrakers-mapreduce-major.html)

„MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

A giant step backward in the programming paradigm for large-scale data intensive applications

A sub-optimal implementation, in that it uses brute force instead of indexing

Not novel at all — it represents a specific implementation of well known techniques developed nearly 25 years ago

Missing most of the features that are routinely included in current DBMS

Incompatible with all of the tools DBMS users have come to depend on“

2 more publications by teams including these authors (2009; 2010: s.a.)

19

19

19


A MapReduce controversy (2)

The original authors „address several misconceptions about MapReduce in these [publications]“:

J. Dean & S. Ghemawat (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72-77.

http://www.cs.princeton.edu/courses/archive/spr11/cos448/web/docs/week10_reading2.pdf

20

20

20


As an aside: Can text mining / information retrieval help to learn about (or even solve) this controversy? ;-)

HCIR 2011 Challenge

„1) The Great MapReduce Debate

In 2004, Google introduced MapReduce as a software framework to support distributed computing on large data sets on clusters of computers. In the "Map" step, the master node takes the input, partitions it up into smaller sub-problems, and distributes those to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node. In the "Reduce" step, the master node then takes the answers to all the sub-problems and combines them to obtain the final output.

In a blog post, David J. DeWitt and Michael Stonebraker asserted that MapReduce was not novel -- that the techniques employed by MapReduce are more than 20 years old. Use your [information retrieval] system to either support DeWitt and Stonebraker's case or to argue that a thorough search of the literature does not yield examples that support their case.“

http://hcir.info/hcir-2011

21

21

21


Next lecture

Motivation

Map-Reduce


NoSQL

22

22

22


Literature

The original article:

Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150

http://usenix.org/events/osdi04/tech/full_papers/dean/dean.pdf

Benchmarking and controversy:

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker: A comparison of approaches to large-scale data analysis. SIGMOD Conference 2009: 165-178. http://db.csail.mit.edu/pubs/benchmarks-sigmod09.pdf

Michael Stonebraker, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Erik Paulson, Andrew Pavlo, Alexander Rasin (2010). MapReduce and parallel DBMSs: friends or foes? Communications of the ACM, 53(1), 64-71. http://database.cs.brown.edu/papers/stonebraker-cacm2010.pdf

J. Dean & S. Ghemawat (2010). MapReduce: A flexible data processing tool. Communications of the ACM, 53(1), 72-77.

http://www.cs.princeton.edu/courses/archive/spr11/cos448/web/docs/week10_reading2.pdf

See also:

http://people.cs.kuleuven.be/~bettina.berendt/teaching/2009-10-1stsemester/adb/Lecture/Session10/truemper.html

Documents

1 1 1 Berendt: Advanced databases, 2011, berendt/teaching Advanced databases – Large-scale data storage and processing (1):