lec6-map - Wellesley Collegecs.wellesley.edu/~cs343/lectures/lec6.pdf · 2/9/20 8 §A worker with an assigned map task. §Reads the contents of its corresponding input split. §Parses

2/9/20

1

Google’s parallel data-processing programming paradigm

2/9/20

2

2/9/20

3

map(String key, String value):// key: candy batch name // value: batch contentsfor each piece of candy w in value:

Emit(w.type, "1");

reduce(String key, Iterator values): // key: a single type // values: a list of countsint result = 0; for each v in values:

result += ParseInt(v); Emit(key, result));

2/9/20

4

§ The programmer needs to:

§ Internally, everything is managed by the master node.§ One of the N nodes in the cluster is chosen to be a master node.§ The rest of the nodes are workers.

§ The master takes care of § Memory management§ Scheduling tasks§ Fault-tolerance

§ The input is in:

§ The intermediate data emitted by the mappers is in:

§ Shuffling is:

§ The final data emitted by the reducers is in:

2/9/20

5

§ Large-scale distributed Grep

§ Input: A directory of files

§ Output: All lines that match a pattern

§ Map: emit the line if it matches the encoded pattern

§ Reduce: Copy the intermediate data directly to the output file

2/9/20

6

§ Large-scale distributed Word Count

§ Input: A directory of files

§ Output: A histogram of frequency of words

§ Map: emit each word with a count 1 <word, 1>

§ Reduce: Sum up the counts of all words in a set from the intermediate data, and emit <word, total count>

§ Count word frequency

§ Input: A directory of files§ Output: For each word, the % of total occurrences of that word

§ This will be done by 2 linked MapReduce jobs:

§ First MapReduce is just like WordCount in example 3.§ Second MapReduce:

§ Input: Output of the first MapReduce job <word, count>§ Map: emit <1, <word, count>>§ Reduce:

§ One round to count all the 1’s -> which is counting what?§ Second round to emit <word, wordcount/total number of words>

2/9/20

7

The infrastructure that coordinates your MapReduce program

§ The data to be processed.

§ Typically very large files (GB to TB).

§ Usually stored by the programmer in a global distributed file system.§ Example: Google File System (GFS) or Hadoop Distributed File System

(HDFS)

§ Could have any format.

2/9/20

8

§ A worker with an assigned map task.

§ Reads the contents of its corresponding input split.

§ Parses the input according to the mapping function provided by the client.

§ Intermediate results are buffered in its local memory.

§ Intermediate results are partitioned by reducer key§ Keys are obtained from the master node§ Information of locations of local data is passed on to the master

§ A worker with an assigned reduce task.§ Usually we have less number of reducers than mappers

§ A subset of the intermediate key space is assigned to each reducer.§ Using hashing methods (hash(key)%num_reduces)

§ It contacts the master node to find the locations of the intermediate data to be reduced.

§ It directly contacts the nodes to read the locally buffered data.

2/9/20

9

§ After the shuffle is complete, § The reducer groups the intermediate data according to their keys.§ Then, a sort is performed to reduce the complexity of the reduction.

§ Finally, the data is parsed according to the reducing function provided by the client.

§ The final output is stored in the global file system.

§ It assigns tasks to workers.

§ It coordinates the communication between nodes.

§ It monitors the status of each task and worker.

§ It’s in charge of managing faults.

2/9/20

10

§ Let’s think what can fail in this model…

§ The task

§ The worker

§ The master!!!

§ The master checks is the worker is alive periodically.§ If it doesn’t respond, it’s marked as failed.

§ What happens to,§ The map/reduce tasks assigned to the worker?§ The map/reduce tasks running on the worker?§ The map/reduce tasks completed by the worker?

2/9/20

11

§ How does the master node decide which worker node gets the next map / reduce job?

§ How big should N, M, and R be?

§ What about stragglers?

§ Will this model of computation work on any distributed system?

2/9/20

12

§ The content of these slides is inspired from:1. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data

processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.

2. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.

3. The Yahoo! Hadoop tutorial (https://developer.yahoo.com/hadoop/tutorial/)

4. The Apache Hadoop tutorial (https://hadoop.apache.org/docs/r1.2.1/index.html)

Documents

lec6-map - Wellesley Collegecs.wellesley.edu/~cs343/lectures/lec6.pdf · 2/9/20 8 §A worker with an assigned map task. §Reads the contents of its corresponding input split. §Parses