48
MapReduce Distributed computing on large commodity clusters Dr. Spiros Denaxas, Epidemiology & Public Health, UCL, 18 Feb 2010

MapReduce: distributed computing on large commodity clusters

Embed Size (px)

DESCRIPTION

MapReduce: distributed computing on large commodity clusters Given at the University College London MediaFeatures group http://mediafutures.cs.ucl.ac.uk/reading_group/

Citation preview

Page 1: MapReduce: distributed computing on large commodity clusters

MapReduce

Distributed computing on large commodity clusters

Dr. Spiros Denaxas, Epidemiology & Public Health, UCL, 18 Feb 2010

Page 2: MapReduce: distributed computing on large commodity clusters

Hello

� Introduction

� Who am I

� Structure of presentation

�Distributed computing

�MapReduce examples

�Amazon Web Services

�Live demo

Page 3: MapReduce: distributed computing on large commodity clusters

Data and some more data

� Google processes > 20PB daily

� Facebook processes 15TB daily, more than 4TB new data per day

� Archive.org has 2PB of content

� Baidu, 3TB/week

� CERN LHC will generate 20GB/sec

Page 4: MapReduce: distributed computing on large commodity clusters

Data driven applications

� Fraud detection

� Web indexing

� Risk management

� Service personalization

� Spam detection

� Document clustering

Page 5: MapReduce: distributed computing on large commodity clusters

Fruits of some sort

� Consider a very simple example

� fruit_diary.log: a text file with fruit names

� “cat fruit_diary.txt | sort | uniq –c”

� Matches all lines from your eating diary

� Sorts all lines in memory

� uniq –c counts the unique occurrences

� What if fruit diary was 500GB? What if it was

500TB? 500PB ?

Page 6: MapReduce: distributed computing on large commodity clusters

Big Data problem

� 1. Iterate over a large number of records

� 2. Extract something of interest

� 3. Shuffle and sort the intermediate results

� 4. Aggregate the intermediate results

� 5. Generate final output

Page 7: MapReduce: distributed computing on large commodity clusters

Big Data problem

� Majority of “big data” players rely heavily on data

analysis, be it commercial or scientific.

� Ad hoc investigation, trends, patterns, reporting.

� Timely manner

� Questions must be answered in hours, not

weeks.

Page 8: MapReduce: distributed computing on large commodity clusters

Scalability of single entities

� The single disk/memory model does not scale

on a single processing entity.

� Now what?

� Lets add many disk/memory/processing entities!

� Parallel vs. Distributed computing

� Parallel: Multiple CPU’s in a single computer

� Distributed: Multiple CPU’s across multiple computers

over the network

Page 9: MapReduce: distributed computing on large commodity clusters

What is wrong with this?

� Worker 1:void work() {

var2++;

var1 = var2 + 5;}

� Worker 2:void work() {

var1++;

var2 = var1; }

Page 10: MapReduce: distributed computing on large commodity clusters

Parallel computing pitfalls

� “Parallel computing is a black art”

� Very hard to program, expensive, complicated ; how does it scale?

� How do we know:

�when a worker has finished?

�when a worker has failed?

�How to synchronize?

Page 11: MapReduce: distributed computing on large commodity clusters

Now what?

� Data needs to be processed on a massive scale

in a distributed fashion as it does not fit a single

node.

� Solution must be scalable

� Solution must be cheap

� Low cost hardware with redundancy

� Don’t worry about concurrency, focus on more

serious problems.

Page 12: MapReduce: distributed computing on large commodity clusters

Distributed systems

� Fault tolerant

� Highly available

� Recoverable

� Consistent

� Scalable

Page 13: MapReduce: distributed computing on large commodity clusters

Data storage

� Google FS (GFS) and Hadoop Distributed File

System (HDFS)

� Data must be available by all processing nodes

� Don’t move data to workers, move workers to

the data

� Store data on local disks of nodes

� Start workers on node that has data locally

� Minimize meta-data by using large blocks

Page 14: MapReduce: distributed computing on large commodity clusters

MapReduce

� A framework for processing data using a cluster of computer nodes

� Created by Google in C++

� Two steps: map and reduce

� Automatic parallelization, distribution, failover, synchronization and and and …

� Clean abstraction layer for programmers

� Processes are isolated

Page 15: MapReduce: distributed computing on large commodity clusters

mapmap map map

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys

Page 16: MapReduce: distributed computing on large commodity clusters

MapReduce map()

� map(in_key,in_value) => (out_key,

intermediate_value) list

� Data (lines from files, database rows etc) are

read, recorded and emitted as key / value pairs

� For example ”coconut,1”

� Map() produces one or more intermediate

values along with an output key from the input

data. Map() runs in parallel and independently.

Page 17: MapReduce: distributed computing on large commodity clusters

MapReduce map()

Page 18: MapReduce: distributed computing on large commodity clusters

MapReduce reduce()

� A reducer is given a key and all values for this specific key.

� Once the map phase is over, intermediate values of a given output key are collapsed into a list.

� Reduce() combines intermediate values into one or more final values for the same output key

� Bottleneck: reduce() stage cannot start until map() phase is done.

Page 19: MapReduce: distributed computing on large commodity clusters

MapReduce reduce()

Page 20: MapReduce: distributed computing on large commodity clusters

Big Data problem

� 1. Iterate over a large number of records

� 2. map()

� 3. Shuffle and sort the intermediate results

� 4. reduce()

� 5. Generate final output

Page 21: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� The TF of a given term is the number of times it appears within a document collection.

� ”The sea-reach of the Thames stretched before us like the beginning of an interminable waterway. In the offing the sea and the sky were welded together without a joint, and in the luminous space the tanned sails of the barges drifting up with the tide seemed to stand still in red clusters of canvas sharply peaked, with gleams of varnished sprits.”

Page 22: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� Stopword elimination

� sea reach Thames stretched before like beginning interminable waterway offing sea sky welded together joint luminous space tanned sails barges drifting tide seemed stand still red clusters canvas sharply peaked gleams varnished sprits

Page 23: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� A generic map() function

� Input: a single line

� Ouput: <word,frequency> pairs

� map(line) {

@words = split / / line

foreach word ( @words ) {

print word, 1 } }

Page 24: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� sea reach Thames stretched before like beginning

� Output would look like:� sea,1

� reach,1� Thames,1

� stretched,1� before,1

� Like,1� beginning,1

Page 25: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� A generic reducer() function

� Sums up the values which are the occurrence of each word.

� Input: series of <word,frequency> pairs

� Output: series of <word,sum> pairs

� Reducer ( word, frequency ) {datastructure[ word ] ++;}

foreach word ( datastructure ) { print word freq } }

Page 26: MapReduce: distributed computing on large commodity clusters

Term Frequency (TF) calculation

� Output from reduce stage would look like:

� sky,2

� reach,1

� Thames,1

� stretched,1

� before,1

� like,1

� beginning,1 [...]

Page 27: MapReduce: distributed computing on large commodity clusters

MapReduce indexing

� Map over all documents� Emit term as key, (docno, tf) as value

� Emit other information as necessary (e.g., term position)

� Sort/shuffle: group by term

� Reduce� Gather and sort (e.g., by docno or tf)

� Write to disk

� MapReduce does all the heavy lifting!

Page 28: MapReduce: distributed computing on large commodity clusters

Inverted index (Boolean)one fish, two fish

Doc 1

red fish, blue fishDoc 2

cat in the hatDoc 3

1

1

1

1

1

1

1 2 3

1

1

1

4

blue

cat

egg

fish

green

ham

hat

one

3

4

1

4

4

3

2

1

blue

cat

egg

fish

green

ham

hat

one

2

green eggs and hamDoc 4

1red

1two

2red

1two

Page 29: MapReduce: distributed computing on large commodity clusters

Inverted index (ranked)one fish, two fish

Doc 1

red fish, blue fishDoc 2

cat in the hatDoc 3

2

1

2

1

1

1

1 2 3

1

1

1

4

1

1

1

1

1

1

2

1

tf

df

blue

cat

egg

fish

green

ham

hat

one

3,1

4,1

1,2

4,1

4,1

3,1

2,1

1,1

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

2,2

green eggs and hamDoc 4

1 1red

1 1two

2,11red

1,11two

Page 30: MapReduce: distributed computing on large commodity clusters

11one 11

11two 11

11fish 22

one fish, two fishDoc 1

22red 11

22blue 11

22fish 22

red fish, blue fishDoc 2

33cat 11

33hat 11

cat in the hatDoc 3

11fish 22 22 22

11one 1111two 11

22red 11

33cat 11

22blue 11

33hat 11

Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys

Map

Reduce

Page 31: MapReduce: distributed computing on large commodity clusters

Inverted Index (positional)one fish, two fish

Doc 1

red fish, blue fishDoc 2

cat in the hatDoc 3

3,1,[1]

4,1,[2]

1,2,[2,4]

4,1,[3]

4,1,[1]

3,1,[2]

2,1,[3]

1,1,[1]

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

green eggs and hamDoc 4

2,1,[1]1red

1,1,[3]1two

1,2,[2,4]

3,1

4,1

1,2

4,1

4,1

3,1

2,1

1,1

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

2,2

2,11red

1,11two

Page 32: MapReduce: distributed computing on large commodity clusters

11one 11

11two 11

11fish 22

one fish, two fishDoc 1

22red 11

22blue 11

22fish 22

red fish, blue fishDoc 2

33cat 11

33hat 11

cat in the hatDoc 3

11fish 22 22 22

11one 1111two 11

22red 11

33cat 11

22blue 11

33hat 11

Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys

Map

Reduce

[2,4]

[2,4]

[1][1]

[3][3]

[2,4]

[2,4]

[1][1]

[3][3]

[1][1]

[2][2]

[1][1]

[1][1]

[3][3]

[2][2]

[3][3][2,4

]

[2,4]

[1][1]

[2,4]

[2,4]

Page 33: MapReduce: distributed computing on large commodity clusters

PageRank

� Named after Larry Page at Google

� Essentially a link analysis algorithm

� Measures the relative importance of a web page

� Algorithmically assesses and quantifies that “importance”

Page 34: MapReduce: distributed computing on large commodity clusters

PageRank

� How can we define how important page X is?

� One solution: quantify the incoming links from other pages to that page

� Surely, more incoming links would mean a more authoritative status?

Page 35: MapReduce: distributed computing on large commodity clusters

PageRank

� Imagine your typical web surfer browsing page X

� Only two things can happen:

�A) Random link from X is clicked (probability

a)

�B) User teleports away (probability 1 –a )

Page 36: MapReduce: distributed computing on large commodity clusters

PageRank defined

Given page x with in-bound links t1…tn, where

� C(t) is the out-degree of t

� α is probability of random jump

� N is the total number of nodes in the graph

∑=

−+

=

n

i i

i

tC

tPR

NxPR

1 )(

)()1(

1)( αα

Page 37: MapReduce: distributed computing on large commodity clusters

PageRank defined

X

t1

t2

tn

Page 38: MapReduce: distributed computing on large commodity clusters

Computing PageRank

� Properties of PageRank� Can be computed iteratively� Effects of each iteration are local

� Sketch of algorithm:� Start with seed PRi values

� Each page distributes PRi “credit” to all pages it links to

� Each target page adds up “credit” from multiple in-bound links to compute PRi+1

� Iterate until values converge (100 times?)

Page 39: MapReduce: distributed computing on large commodity clusters

Map: distribute PageRank “credit” to link targets

Reduce: gather up PageRank “credit” from multiple sources

to compute new PageRank value

Iterate until

convergence

Page 40: MapReduce: distributed computing on large commodity clusters

Graph algorithms in MapReduce

� General approach:� Store graphs as adjacency lists

� Each map task receives a node and its adjacency list

� Map task compute some function of the link structure, emits value with target as the key

� Reduce task collects keys (target nodes) and aggregates

� Perform multiple MapReduce iterations until some termination condition

Page 41: MapReduce: distributed computing on large commodity clusters

Amazon Web Services (AWS)

� “A collection of remote computing services

offered over the Internet by Amazon”

� Accessed over HTTP using Representational

State Transfer (REST) or Simple Object Access

Protocol (SOAP)

� Cheap

� Scalable

� API implementations

Page 42: MapReduce: distributed computing on large commodity clusters

Amazon Simple Storage Service

(S3)

� “An online persistent data storage service offered by Amazon Web Services”

� Charged on data stored and transferred

� Block-based filesystem

� Data is organized in buckets

� Buckets are accessed using HTTP REST

Page 43: MapReduce: distributed computing on large commodity clusters

Amazon Simple Storage Service

(S3)

� http://<bucket>.s3.amazonaws.com/<key>

� Like HDFS, data is replicated across nodes, enabling the storage of very large files

� Several big players use S3 like Twitter and HP

Page 44: MapReduce: distributed computing on large commodity clusters

Amazon Elastic Computer Cloud

(EC2)

� Scalable deployment of virtual servers for large scale data processing.

� Billed by hour of processing and magnitude of resources needed.

� No persistent storage (that’s what S3 is for!)

� Automatic scalability

Page 45: MapReduce: distributed computing on large commodity clusters

Amazon Elastic Computer Cloud

(EC2)

� Amazon Machine Images (AMI)

�Sun, Oracle, IBM

�Windows, Linux

� Several sizes for all tastes

� From 1.5 to 65GB RAM

�From 1 core to 2 quad cores

Page 46: MapReduce: distributed computing on large commodity clusters

Amazon Elastic MapReduce

� Hadoop-ready virtual servers on EC2

� HDFS-esque input from S3

� Amazon Web Services Management

�Hadoop in < 5 minutes!

Page 47: MapReduce: distributed computing on large commodity clusters

Live Demo

� Word counting

�Project Gutenberg

� 1500 books from all languages

� Approx half a million lines of text

� 1500 files stored on S3

�8 EC2 Instances deployed

� 14 minutes from start to finish

� Less than 2 USD

Page 48: MapReduce: distributed computing on large commodity clusters

Thank you