Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
2/9/20
1
Google’s parallel data-processing programming paradigm
2/9/20
2
2/9/20
3
map(String key, String value):// key: candy batch name // value: batch contentsfor each piece of candy w in value:
Emit(w.type, "1");
reduce(String key, Iterator values): // key: a single type // values: a list of countsint result = 0; for each v in values:
result += ParseInt(v); Emit(key, result));
2/9/20
4
§ The programmer needs to:
§ Internally, everything is managed by the master node.§ One of the N nodes in the cluster is chosen to be a master node.§ The rest of the nodes are workers.
§ The master takes care of § Memory management§ Scheduling tasks§ Fault-tolerance
§ The input is in:
§ The intermediate data emitted by the mappers is in:
§ Shuffling is:
§ The final data emitted by the reducers is in:
2/9/20
5
§ Large-scale distributed Grep
§ Input: A directory of files
§ Output: All lines that match a pattern
§ Map: emit the line if it matches the encoded pattern
§ Reduce: Copy the intermediate data directly to the output file
2/9/20
6
§ Large-scale distributed Word Count
§ Input: A directory of files
§ Output: A histogram of frequency of words
§ Map: emit each word with a count 1 <word, 1>
§ Reduce: Sum up the counts of all words in a set from the intermediate data, and emit <word, total count>
§ Count word frequency
§ Input: A directory of files§ Output: For each word, the % of total occurrences of that word
§ This will be done by 2 linked MapReduce jobs:
§ First MapReduce is just like WordCount in example 3.§ Second MapReduce:
§ Input: Output of the first MapReduce job <word, count>§ Map: emit <1, <word, count>>§ Reduce:
§ One round to count all the 1’s -> which is counting what?§ Second round to emit <word, wordcount/total number of words>
2/9/20
7
The infrastructure that coordinates your MapReduce program
§ The data to be processed.
§ Typically very large files (GB to TB).
§ Usually stored by the programmer in a global distributed file system.§ Example: Google File System (GFS) or Hadoop Distributed File System
(HDFS)
§ Could have any format.
2/9/20
8
§ A worker with an assigned map task.
§ Reads the contents of its corresponding input split.
§ Parses the input according to the mapping function provided by the client.
§ Intermediate results are buffered in its local memory.
§ Intermediate results are partitioned by reducer key§ Keys are obtained from the master node§ Information of locations of local data is passed on to the master
§ A worker with an assigned reduce task.§ Usually we have less number of reducers than mappers
§ A subset of the intermediate key space is assigned to each reducer.§ Using hashing methods (hash(key)%num_reduces)
§ It contacts the master node to find the locations of the intermediate data to be reduced.
§ It directly contacts the nodes to read the locally buffered data.
2/9/20
9
§ After the shuffle is complete, § The reducer groups the intermediate data according to their keys.§ Then, a sort is performed to reduce the complexity of the reduction.
§ Finally, the data is parsed according to the reducing function provided by the client.
§ The final output is stored in the global file system.
§ It assigns tasks to workers.
§ It coordinates the communication between nodes.
§ It monitors the status of each task and worker.
§ It’s in charge of managing faults.
2/9/20
10
§ Let’s think what can fail in this model…
§ The task
§ The worker
§ The master!!!
§ The master checks is the worker is alive periodically.§ If it doesn’t respond, it’s marked as failed.
§ What happens to,§ The map/reduce tasks assigned to the worker?§ The map/reduce tasks running on the worker?§ The map/reduce tasks completed by the worker?
2/9/20
11
§ How does the master node decide which worker node gets the next map / reduce job?
§ How big should N, M, and R be?
§ What about stragglers?
§ Will this model of computation work on any distributed system?
2/9/20
12
§ The content of these slides is inspired from:1. Dean, Jeffrey, and Sanjay Ghemawat. "MapReduce: simplified data
processing on large clusters." Communications of the ACM 51.1 (2008): 107-113.
2. Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. "The Google file system." ACM SIGOPS operating systems review. Vol. 37. No. 5. ACM, 2003.
3. The Yahoo! Hadoop tutorial (https://developer.yahoo.com/hadoop/tutorial/)
4. The Apache Hadoop tutorial (https://hadoop.apache.org/docs/r1.2.1/index.html)