MapReduce: distributed computing on large commodity clusters

MapReduce

Distributed computing on large commodity clusters

Dr. Spiros Denaxas, Epidemiology & Public Health, UCL, 18 Feb 2010

Hello

� Introduction

� Who am I

� Structure of presentation

�Distributed computing

�MapReduce examples

�Amazon Web Services

�Live demo

Data and some more data

� Google processes > 20PB daily

� Facebook processes 15TB daily, more than 4TB new data per day

� Archive.org has 2PB of content

� Baidu, 3TB/week

� CERN LHC will generate 20GB/sec

Data driven applications

� Fraud detection

� Web indexing

� Risk management

� Service personalization

� Spam detection

� Document clustering

Fruits of some sort

� Consider a very simple example

� fruit_diary.log: a text file with fruit names

� “cat fruit_diary.txt | sort | uniq –c”

� Matches all lines from your eating diary

� Sorts all lines in memory

� uniq –c counts the unique occurrences

� What if fruit diary was 500GB? What if it was

500TB? 500PB ?

Big Data problem

� 1. Iterate over a large number of records

� 2. Extract something of interest

� 3. Shuffle and sort the intermediate results

� 4. Aggregate the intermediate results

� 5. Generate final output

Big Data problem

� Majority of “big data” players rely heavily on data

analysis, be it commercial or scientific.

� Ad hoc investigation, trends, patterns, reporting.

� Timely manner

� Questions must be answered in hours, not

weeks.

Scalability of single entities

� The single disk/memory model does not scale

on a single processing entity.

� Now what?

� Lets add many disk/memory/processing entities!

� Parallel vs. Distributed computing

� Parallel: Multiple CPU’s in a single computer

� Distributed: Multiple CPU’s across multiple computers

over the network

What is wrong with this?

� Worker 1:void work() {

var2++;

var1 = var2 + 5;}

� Worker 2:void work() {

var1++;

var2 = var1; }

Parallel computing pitfalls

� “Parallel computing is a black art”

� Very hard to program, expensive, complicated ; how does it scale?

� How do we know:

�when a worker has finished?

�when a worker has failed?

�How to synchronize?

Now what?

� Data needs to be processed on a massive scale

in a distributed fashion as it does not fit a single

node.

� Solution must be scalable

� Solution must be cheap

� Low cost hardware with redundancy

� Don’t worry about concurrency, focus on more

serious problems.

Distributed systems

� Fault tolerant

� Highly available

� Recoverable

� Consistent

� Scalable

Data storage

� Google FS (GFS) and Hadoop Distributed File

System (HDFS)

� Data must be available by all processing nodes

� Don’t move data to workers, move workers to

the data

� Store data on local disks of nodes

� Start workers on node that has data locally

� Minimize meta-data by using large blocks

MapReduce

� A framework for processing data using a cluster of computer nodes

� Created by Google in C++

� Two steps: map and reduce

� Automatic parallelization, distribution, failover, synchronization and and and …

� Clean abstraction layer for programmers

� Processes are isolated

mapmap map map

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

Shuffle and Sort: aggregate values by keysShuffle and Sort: aggregate values by keys

MapReduce map()

� map(in_key,in_value) => (out_key,

intermediate_value) list

� Data (lines from files, database rows etc) are

read, recorded and emitted as key / value pairs

� For example ”coconut,1”

� Map() produces one or more intermediate

values along with an output key from the input

data. Map() runs in parallel and independently.

MapReduce map()

MapReduce reduce()

� A reducer is given a key and all values for this specific key.

� Once the map phase is over, intermediate values of a given output key are collapsed into a list.

� Reduce() combines intermediate values into one or more final values for the same output key

� Bottleneck: reduce() stage cannot start until map() phase is done.

MapReduce reduce()

Big Data problem

� 1. Iterate over a large number of records

� 2. map()

� 3. Shuffle and sort the intermediate results

� 4. reduce()

� 5. Generate final output

Term Frequency (TF) calculation

� The TF of a given term is the number of times it appears within a document collection.

� ”The sea-reach of the Thames stretched before us like the beginning of an interminable waterway. In the offing the sea and the sky were welded together without a joint, and in the luminous space the tanned sails of the barges drifting up with the tide seemed to stand still in red clusters of canvas sharply peaked, with gleams of varnished sprits.”


� Stopword elimination

� sea reach Thames stretched before like beginning interminable waterway offing sea sky welded together joint luminous space tanned sails barges drifting tide seemed stand still red clusters canvas sharply peaked gleams varnished sprits


� A generic map() function

� Input: a single line

� Ouput: <word,frequency> pairs

� map(line) {

@words = split / / line

foreach word ( @words ) {

print word, 1 } }


� sea reach Thames stretched before like beginning

� Output would look like:� sea,1

� reach,1� Thames,1

� stretched,1� before,1

� Like,1� beginning,1


� A generic reducer() function

� Sums up the values which are the occurrence of each word.

� Input: series of <word,frequency> pairs

� Output: series of <word,sum> pairs

� Reducer ( word, frequency ) {datastructure[ word ] ++;}

foreach word ( datastructure ) { print word freq } }


� Output from reduce stage would look like:

� sky,2

� reach,1

� Thames,1

� stretched,1

� before,1

� like,1

� beginning,1 [...]

MapReduce indexing

� Map over all documents� Emit term as key, (docno, tf) as value

� Emit other information as necessary (e.g., term position)

� Sort/shuffle: group by term

� Reduce� Gather and sort (e.g., by docno or tf)

� Write to disk

� MapReduce does all the heavy lifting!

Inverted index (Boolean)one fish, two fish

Doc 1

red fish, blue fishDoc 2

cat in the hatDoc 3

1

1

1

1

1

1

1 2 3

1

1

1

4

blue

cat

egg

fish

green

ham

hat

one

3

4

1

4

4

3

2

1

blue

cat

egg

fish

green

ham

hat

one

2

green eggs and hamDoc 4

1red

1two

2red

1two

Inverted index (ranked)one fish, two fish

Doc 1


cat in the hatDoc 3

2

1

2

1

1

1

1 2 3

1

1

1

4

1

1

1

1

1

1

2

1

tf

df

blue

cat

egg

fish

green

ham

hat

one

3,1

4,1

1,2

4,1

4,1

3,1

2,1

1,1

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

2,2


1 1red

1 1two

2,11red

1,11two

11one 11

11two 11

11fish 22

one fish, two fishDoc 1

22red 11

22blue 11

22fish 22


33cat 11

33hat 11

cat in the hatDoc 3

11fish 22 22 22

11one 1111two 11

22red 11

33cat 11

22blue 11

33hat 11


Map

Reduce

Inverted Index (positional)one fish, two fish

Doc 1


cat in the hatDoc 3

3,1,[1]

4,1,[2]

1,2,[2,4]

4,1,[3]

4,1,[1]

3,1,[2]

2,1,[3]

1,1,[1]

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one


2,1,[1]1red

1,1,[3]1two

1,2,[2,4]

3,1

4,1

1,2

4,1

4,1

3,1

2,1

1,1

1

1

1

1

1

1

2

1

blue

cat

egg

fish

green

ham

hat

one

2,2

2,11red

1,11two

11one 11

11two 11

11fish 22

one fish, two fishDoc 1

22red 11

22blue 11

22fish 22


33cat 11

33hat 11

cat in the hatDoc 3

11fish 22 22 22

11one 1111two 11

22red 11

33cat 11

22blue 11

33hat 11


Map

Reduce

[2,4]

[2,4]

[1][1]

[3][3]

[2,4]

[2,4]

[1][1]

[3][3]

[1][1]

[2][2]

[1][1]

[1][1]

[3][3]

[2][2]

[3][3][2,4

]

[2,4]

[1][1]

[2,4]

[2,4]

PageRank

� Named after Larry Page at Google

� Essentially a link analysis algorithm

� Measures the relative importance of a web page

� Algorithmically assesses and quantifies that “importance”

PageRank

� How can we define how important page X is?

� One solution: quantify the incoming links from other pages to that page

� Surely, more incoming links would mean a more authoritative status?

PageRank

� Imagine your typical web surfer browsing page X

� Only two things can happen:

�A) Random link from X is clicked (probability

a)

�B) User teleports away (probability 1 –a )

PageRank defined

Given page x with in-bound links t1…tn, where

� C(t) is the out-degree of t

� α is probability of random jump

� N is the total number of nodes in the graph

∑=

−+

=

n

i i

i

tC

tPR

NxPR

1 )(

)()1(

1)( αα

PageRank defined

X

t1

t2

tn

…

Computing PageRank

� Properties of PageRank� Can be computed iteratively� Effects of each iteration are local

� Sketch of algorithm:� Start with seed PRi values

� Each page distributes PRi “credit” to all pages it links to

� Each target page adds up “credit” from multiple in-bound links to compute PRi+1

� Iterate until values converge (100 times?)

Map: distribute PageRank “credit” to link targets

Reduce: gather up PageRank “credit” from multiple sources

to compute new PageRank value

Iterate until

convergence

Graph algorithms in MapReduce

� General approach:� Store graphs as adjacency lists

� Each map task receives a node and its adjacency list

� Map task compute some function of the link structure, emits value with target as the key

� Reduce task collects keys (target nodes) and aggregates

� Perform multiple MapReduce iterations until some termination condition

Amazon Web Services (AWS)

� “A collection of remote computing services

offered over the Internet by Amazon”

� Accessed over HTTP using Representational

State Transfer (REST) or Simple Object Access

Protocol (SOAP)

� Cheap

� Scalable

� API implementations

Amazon Simple Storage Service

(S3)

� “An online persistent data storage service offered by Amazon Web Services”

� Charged on data stored and transferred

� Block-based filesystem

� Data is organized in buckets

� Buckets are accessed using HTTP REST

Amazon Simple Storage Service

(S3)

� http://<bucket>.s3.amazonaws.com/<key>

� Like HDFS, data is replicated across nodes, enabling the storage of very large files

� Several big players use S3 like Twitter and HP

Amazon Elastic Computer Cloud

(EC2)

� Scalable deployment of virtual servers for large scale data processing.

� Billed by hour of processing and magnitude of resources needed.

� No persistent storage (that’s what S3 is for!)

� Automatic scalability

Amazon Elastic Computer Cloud

(EC2)

� Amazon Machine Images (AMI)

�Sun, Oracle, IBM

�Windows, Linux

� Several sizes for all tastes

� From 1.5 to 65GB RAM

�From 1 core to 2 quad cores

Amazon Elastic MapReduce

� Hadoop-ready virtual servers on EC2

� HDFS-esque input from S3

� Amazon Web Services Management

�Hadoop in < 5 minutes!

Live Demo

� Word counting

�Project Gutenberg

� 1500 books from all languages

� Approx half a million lines of text

� 1500 files stored on S3

�8 EC2 Instances deployed

� 14 minutes from start to finish

� Less than 2 USD

Thank you

Technology

MapReduce: distributed computing on large commodity clusters