36
ge-scale file systems and Map-Redu Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer reads 30-35 MB/sec from disk ~4 months to read the web ~1,000 hard drives to store the web Takes even more to do something useful with t New standard architecture is emerging: Cluster of commodity Linux nodes Gigabit ethernet interconnect Slide based on www.mmds.com

Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Embed Size (px)

Citation preview

Page 1: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Large-scale file systems and Map-Reduce

Single-node architecture

Memory

Disk

CPU

Google example:

• 20+ billion web pages x 20KB = 400+ Terabyte• 1 computer reads 30-35 MB/sec from disk• ~4 months to read the web

• ~1,000 hard drives to store the web• Takes even more to do something useful with the data• New standard architecture is emerging:• Cluster of commodity Linux nodes• Gigabit ethernet interconnect

Slide based on www.mmds.com

Page 2: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Distributed File Systems

• Files are very large, read/append.• They are divided into chunks.– Typically 64MB to a chunk.

• Chunks are replicated at several compute-nodes.• A master (possibly replicated) keeps track of all

locations of all chunks.

Slide based on www.mmds.com

Page 3: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Commodity clusters: compute nodes• Organized into racks.• Intra-rack connection typically gigabit speed.• Inter-rack connection faster by a small factor.• Recall that chunks are replicated

Some implementations:• GFS (Google File System –

proprietary). In Aug 2006 Google had ~450,000 machines

• HDFS (Hadoop Distributed File System – open source).

• CloudStore (Kosmix File System, open source).

Slide based on www.mmds.com

Page 4: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

• Problems with Large-scale computing on commodity hardware

• Challenges:– How do you distribute computation?– How can we make it easy to write distributed

programs?– Machines fail:• One server may stay up 3 years (1,000 days)• If you have 1,000 servers, expect to loose 1/day• People estimated Google had ~1M machines in 2011

– 1,000 machines fail every day!

Slide based on www.mmds.com

Page 5: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Slide based on www.mmds.com

• Issue: Copying data over a network takes time• Idea:– Bring computation close to the data– Store files multiple times for reliability

• Map-reduce addresses these problems– Google’s computational/data manipulation model– Elegant way to work with big data– Storage Infrastructure – File system• Google: GFS. Hadoop: HDFS

– Programming model• Map-Reduce

Page 6: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Slide based on www.mmds.com

• Problem:– If nodes fail, how to store data persistently?

• Answer:– Distributed File System:• Provides global file namespace• Google GFS; Hadoop HDFS;

• Typical usage pattern– Huge files (100s of GB to TB)– Data is rarely updated in place– Reads and appends are common

Page 7: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Racks of Compute Nodes

File

Chunks

Replication

Slide based on www.mmds.com

Page 8: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

3-way replication offiles, with copies ondifferent racks.

Replication

Slide based on www.mmds.com

Page 9: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Map-Reduce• You write two functions, Map and Reduce.– They each have a special form to be explained.

• System (e.g., Hadoop) creates a large number of tasks for each function.– Work is divided among tasks in a precise way.

Slide based on www.mmds.com

Page 10: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Map-Reduce Algorithms• Map tasks convert inputs to key-value pairs.– “keys” are not necessarily unique.

• Outputs of Map tasks are sorted by key, and each key is assigned to one Reduce task.

• Reduce tasks combine values associated with a key.

Slide based on www.mmds.com

Page 11: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Simple map-reduce example: Word Count• We have a large file of words, one word to a line• Count the number of times each distinct word

appears in the file• Sample application: analyze web server logs to find

popular URLs• Different scenarios:– Case 1: Entire file fits in main memory– Case 2: File too large for main mem, but all

<word, count> pairs fit in main mem– Case 3: File on disk, too many distinct words to

fit in memorySlide based on www.mmds.com

Page 12: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Word Count• Map task: For each word, e.g. CAT output (CAT,1)

• Total output: (w1,1), (w1,1), …., (w1,1) (w2,1), (w2,1), …., (w2,1) …… Hash each (w,1) to bucket h(w) in [0,r-1] in local intermediate file. r is the number of reducers

• Master: Group by key: (w1,[1,1,…,1]), (w2,[1,1,…,1]), Push group (w,[1,1,..,1]) to reducer h(w)

• Reduce task: Reducer h(w)

Read : (w,[1,1,…,1]) Aggregate: each (w,[1,1,…,1]) into (w,sum) Output: (w,sum) into common output file• Since addition is commutative and associative the map task could have sent : (w1,sum1), (w2,sum2), …

• Reduce task would receive: (wi,sumi,1), (wi,sumi,2), … (wj,sumj,1), (wj,sumj,2), … and output (wi,sumi), (wj,sumj), ….

Slide based on www.mmds.com

Page 13: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Partition Function

• Inputs to map tasks are created by contiguous splits of input file

• For reduce, we need to ensure that records with the same intermediate key end up at the same worker

• System uses a default partition function e.g., hash(key) mod R

• Sometimes useful to override – E.g., hash(hostname(URL)) mod R ensures URLs

from a host end up in the same output file

Slide based on www.mmds.com

Page 14: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Coordination

• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become

available– When a map task completes, it sends the master

the location and sizes of its R intermediate files, one for each reducer

– Master pushes this info to reducers• Master pings workers periodically to detect

failuresSlide based on www.mmds.com

Page 15: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Data flow• Input, final output are stored on a distributed file system

– Scheduler tries to schedule map tasks “close” to physical storage location of input data

• Intermediate results are stored on local FS of map and reduce workers

• Output is often input to another map-reduce task

• Master data structures– Task status: (idle, in-progress, completed)– Idle tasks get scheduled as workers become available– When a map task completes, it sends the master the location and sizes

of its R intermediate files, one for each reducer– Master pushes this info to reducers

• Master pings workers periodically to detect failures

Slide based on www.mmds.com

Page 16: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Failures

• Map worker failure– Map tasks completed or in-progress at worker

are reset to idle (result sits locally at worker)– Reduce workers are notified when task is

rescheduled on another worker• Reduce worker failure– Only in-progress tasks are reset to idle

• Master failure– Map-reduce task is aborted and client is notified

Slide based on www.mmds.com

Page 17: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

How many Map and Reduce jobs?

• M map tasks, R reduce tasks• Rule of thumb:– Make M and R much larger than the number of

nodes in cluster– One DFS chunk per map is common– Improves dynamic load balancing and speeds

recovery from worker failure• Usually R is smaller than M, because output

is spread across R files

Slide based on www.mmds.com

Page 18: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Relational operators with map-reduce

Selection

Map task: If C(t) is true output pair (t,t)

Reduce task: With input (t,t) output t

Selection is not really suitable for map-reduce,everything could have been done in the map task

)(RC

Page 19: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Relational operators with map-reduce

Projection

Map task: Let t’ be the projection of t. Output pair (t’,t’)

Reduce task: With input (t’,[t’,t’,…,t’] ) output t’

Here the duplicate elimination is done by the reduce task

)(RL

Page 20: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Relational operators with map-reduce

• Union RSMap task: for each tuple t of the chunk of R or S output (t, t)Reduce task: input is (t,[t]) or (t,[t, t]). Output t

• Intersection R SMap task: for each tuple t of the chunk output (t,t)Reduce task: if input is (t,[t,t]), output t if input is (t,[t]) , output nothing

• Difference R – SMap task: for each tuple t of R output (t,R) for each tuple t of S output (t,S)Reduce task: if input is (t,[R]), output t if input is (t,[R,S]) , output nothing

Page 21: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Joining by Map-Reduce• Suppose we want to compute • R(A,B) JOIN S(B,C), using k Reduce tasks.– I.e., find tuples with matching B-values.

• R and S are each stored in a chunked file.

• Use a hash function h from B-values to k buckets.– Bucket = Reduce task.

• The Map tasks take chunks from R and S, and send:– Tuple R(a,b) to Reduce task h(b).

• Key = b value = R(a,b).– Tuple S(b,c) to Reduce task h(b).

• Key = b; value = S(b,c).

Slide based on www.mmds.com

Page 22: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Reducetask i

Map tasks sendR(a,b) if h(b) = i

Map tasks sendS(b,c) if h(b) = i

All (a,b,c) such thath(b) = i, and (a,b)is in R, and (b,c) isin S.

• Key point: If R(a,b) joins with S(b,c), then both tuples are sent to Reduce task h(b).

• Thus, their join (a,b,c) will be produced there and shipped to the output file.

Slide based on www.mmds.com

Page 23: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Mapping tuples in joins

Mapper for R(1,2)

R(1,2) (2, (R,1))

Mapper for R(4,2)R(4,2)

Mapper for S(2,3)

S(2,3)

Mapper for S(5,6)

S(5,6)

(2, (R,4))

(2, (S,3))

(5, (S,6))

Reducerfor B = 2

Reducerfor B = 5

(2, [(R,1), (R,4), (S,3)])

(5, [(S,6)])

Slide based on www.mmds.com

Page 24: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Output of the Reducers

Reducerfor B = 2

Reducerfor B = 5

(2, [(R,1), (R,4), (S,3)])

(5, [(S,6)])

(1,2,3), (4,2,3)

Slide based on www.mmds.com

Page 25: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Relational operators with map-reduce

Grouping and aggregation: A,agg(B)(R(A,B,C))

Map task: for each tuple (a,b,c) output (a,[b]) Reduce task: if input is (a,[b1, b2, …, bn]), output (a,agg(b1, b2, …, bn))

for example (a, b1+b2+ …+bn)

Page 26: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Matrix-vector multiplication using map-reduce

j=1

Page 27: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

If vector doesn’t fit in main memory

Divide matrix and vector into stripes:

Each map task gets a chunk of stripe i of the matrixand the entire stripe i of the vector and producespairs

Reduce task i gets all pairs and producespairs

Slide based on www.mmds.com

Page 28: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Example:

MAPPERS:

REDUCERS:

Page 29: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Examples:

• Hamming distance 1 between bit-strings

• Matrix multiplication in one MR-round

• Matrix multiplication in two MR-rounds

• Three-way joins in two rounds and in one round

Page 30: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer
Page 31: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Relational operators with map-reduceThree-Way Join

• We shall consider a simple join of three relations, the natural join

R(A,B) ⋈ S(B,C) ⋈ T(C,D).

• One way: cascade of two 2-way joins, each implemented by map-reduce.

• Fine, unless the 2-way joins produce large intermediate relations.

Slide based on www.mmds.com

Page 32: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Another 3-Way Join

• Reduce processes use hash values of entire S(B,C) tuples as key.

• Choose a hash function h that maps B- and C-values to k buckets.

• There are k2 Reduce processes, one for each (B-bucket, C-bucket) pair.

Slide based on www.mmds.com

Page 33: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Job of the Reducers

• Each reducer gets, for certain B-values b and C-values c :

1. All tuples from R with B = b,2. All tuples from T with C = c, and3. The tuple S(b,c) if it exists.

• Thus it can create every tuple of the form (a, b, c, d) in the join.

Slide based on www.mmds.com

Page 34: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Mapping for 3-Way Join

We map each tuple S(b,c) to ((h(b), h(c)), (S, b, c)).

We map each R(a,b) tuple to ((h(b), y), (R, a, b)) for all y = 1, 2,…,k.

We map each T(c,d) tuple to ((x, h(c)), (T, c, d)) for all x = 1, 2,…,k.

Keys Values

Aside: even normalmap-reduce allowsinputs to map toseveral key-valuepairs.

Slide based on www.mmds.com

Page 35: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

Assigning Tuples to Reducers

h(b) = 0

1

2

3

h(c) = 0 1 2 3

S(b,c) whereh(b)=1; h(c)=2

R(a,b), whereh(b)=2

T(c,d), whereh(c)=3

Slide based on www.mmds.com

Page 36: Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer

DB = R(1,1)R(1,2)R(2,1)R(2,2)

S(1,1)S(1,2)S(2,1)S(2,2)

T(1,1)T(1,2)T(2,1)T(2,2)

..etc

R(1,1)R(2,1)

S(1,1) T(1,1)T(1,2)

R(1,1)R(2,1)

S(1,2) T(2,1)T(2,2)

R(1,2)R(2,2)

S(2,1) T(1,1)T(1,2)

R(1,2)R(2,2)

S(2,2) T(2,1)T(2,2)

MapperR(1,1) (1,1,(R,1))(1,2,(R,1))

R(1,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,1) R(2,1) S(1,1) T(1,2) R(1,1) S(1,1) T(1,2)

MapperS(1,2) (1,2,(S,1,2))