66
Distributed Computing & MapReduce Presented by: Abdul Qadeer

Distributed Computing & MapReduce

Embed Size (px)

DESCRIPTION

Shared by Mansoor MirzaDistributed ComputingWhat is it?Why & when we need it?Comparison with centralized computing‘MapReduce’ (MR) FrameworkTheory and practice‘MapReduce’ in ActionUsing HadoopLab exercises

Citation preview

Page 1: Distributed Computing & MapReduce

Distributed Computing & MapReduce

Presented by: Abdul Qadeer

Page 2: Distributed Computing & MapReduce

2

Today’s Agenda• Distributed Computing

– What is it?– Why & when we need it?– Comparison with centralized computing

• ‘MapReduce’ (MR) Framework– Theory and practice

• ‘MapReduce’ in Action– Using Hadoop– Lab exercises

Feel free to comment / ask questions anytime!

Page 3: Distributed Computing & MapReduce

3

Distributed Computing

Page 4: Distributed Computing & MapReduce

4

Distributed Computing• A way of computing where:

– Software system has multiple sub-components– Sub-components are loosely coupled using ‘network’– The placement and algorithms of interaction between sub-

components are constructed to meet system goals

– System goals can be:• Fault Tolerance

• Scalability in some design dimension (e.g. Increasing sys load)

• Reliability

• Performance (e.g. Many computers working together)

• Cost (Better yet performance / $)

• Usability / Accessibility / Ease of use (e.g. Distributed file system)

• Easy and cost effective software maintenance (e.g. ChromeBook)

• Etc.

Page 5: Distributed Computing & MapReduce

5

Distributed Computing - What

• Everything on single computer– E.g. Old POS system on a small shop

• Use of multiple computers– Client – Server Model– Multiple tiers (2 .. N)– Logical / physical tiers

• Depends on complexity, load …

Page 6: Distributed Computing & MapReduce

6

Distributed Computing - Why• Single computer has limitations

– Computability limitation– Memory (Storage) limitation– Bandwidth limitation

• But these limitations are relative– Today a single computer is darn powerful!

• E.g. Ethane – Taking control of the enterprise

– Multiple cores (8 core systems common)– 6 to 8 GB memory (Up to 3 TB hard disk)– 10Gbps network interfaces

Page 7: Distributed Computing & MapReduce

7

Distributed Computing - Why• Reliability

– Failure Model– Probability of failure– The concept of redundancy

• Scalability in different dimensions– Load balancing (distribution or partitioning)– MSN messenger user example– Scalability always has dimensions

• Cost– 1 beefy server VS many commodity machines– Economy of scale

Page 8: Distributed Computing & MapReduce

8

Distributed Computing• Can use ‘toy problems’ to learn different concepts of

distributed computing

• Real world use should be based on a real need

– Many real world problems where distributed computed is needed

• Large scale web applications

– Computers, cell phones etc.

• Data post processing

– Finding clues in data, weather forecast

• Many scientific and HPC applications

– Discussed previously in another lecture of this course

Page 9: Distributed Computing & MapReduce

9

Clusters

Page 10: Distributed Computing & MapReduce

10

How Clusters are Built?

• Two possible approaches

– Big, expensive, capable server class machines• Usually Govts. used to had that much money!

• There are limits beyond which a machine cant go!

• Any failure to a big machine means disruption to a large population of clients

– Off the shelf, ordinary, cheap computers connected with ordinary Ethernet!

• Cost effective in terms of operational cost

• Failure of a machine only disrupt a small portion of the operations

• Most industry clusters are built like this!

Page 11: Distributed Computing & MapReduce

http://www.flickr.com/photos/drurydrama/;http://www.fotosearch.com/photos-images/ox.html

11

How Clusters are Built?• In pioneer days they used oxen for heavy pulling, and when one ox

couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.

—Grace Hopper

Page 12: Distributed Computing & MapReduce

Table generated at : http://top500.org/stats/list/37/archtype 12

How Clusters are Built?

Page 13: Distributed Computing & MapReduce

13

How Clusters are Built?

• Mostly 64 bit processors used

Page 14: Distributed Computing & MapReduce

14

How Clusters are Built?

• Ethernet holds a major share

Page 15: Distributed Computing & MapReduce

15

How Clusters are Used?

• Different ways depending on the consumer

– Academic use• End user logs in to ‘Head Nodes’

• Head Nodes are beefy systems used for non-compute intensive tasks; or task which have a good mix of compute and I/O; e.g. compiling code

• Cluster machines are usually not accessible directly; only accessibly via head node.

• Machine acquiring, job submission, job monitoring (e.g. using qsub command in a PBS based system)

Page 16: Distributed Computing & MapReduce

16

How Clusters are Used?

• Different ways depending on the consumer

– Commercial use• Infrastructure as a service

– Raw machines– E.g. AWS selling machine instances

• Platform as a service – machine + OS + other software stack– E.g. Google’s’ App Engine

• Application as a service– Using Gmail, Google search etc using browser– Google Docs

• Google programmers sharing cluster machines using proprietary software

Page 17: Distributed Computing & MapReduce

17

MapReduce

Page 18: Distributed Computing & MapReduce

18

The Problem at Hand!

• Many organizations are data centric– Google, Yahoo, Bing etc. (Search engines)– Facebook, Twiter, MySpace etc. (Social networking, Blogs)– NYT, – Stock exchanges

• Data is increasing at a high rate– New web pages added each day– NY stock exchange produce 1 TB data every day!

Page 19: Distributed Computing & MapReduce

19

The Goal

• To “process” data:– In reasonable time

• A single machine has capability limits!!!

– Cost effectively• Cost coming from hardware resources

– No super computers!!

– Not even high end server machines

• Programmer hours

– Applicability (A framework for similar problems)

– Simplicity

– No specific / Expensive programmer training

Page 20: Distributed Computing & MapReduce

20

Elaboration by Example• Word count• Input:

– Data size = 20 TB

• Output:– A file as follows:

– <word1 12>– <word2 30>– <word3 34>– …

• Solution:?– Single machine based solutions

• Don’t scale well!!!– Use multiple machines (MPI?)

Page 21: Distributed Computing & MapReduce

21

Solution 1

• Pseudo code– (a) Make a dynamically extendable hash table such that word will be

the “key” and an integer count will be its “value”

– (b) while (read a line == true)– (c) parse the line on the basis of space, tab, newline

– (d) for each word parsed in step (c)

» (e) insert or update <word, oldval + 1> in hashtable

Page 22: Distributed Computing & MapReduce

22

What is wrong with Solution1?

(a) Make a dynamically extendable hash table such that word will be the “key” and an integer count will be its “value”

(b) while (read a line == true)(c) parse the line on the basis of

space, tab, newline

(d) for each word parsed in step (c)(e) insert or update <word, oldval +

1> in hashtable

• A big hashtable which might not fit in memory

• Might be using swap memory. A typical pattern of access might cause thrashing!

• Reading from disk is very slow. (i.e. Slow I/O)

• Expect frequent cache misses and hence poor performance

Page 23: Distributed Computing & MapReduce

23

How to improve on solution1?

• Multithreaded application– Pros:

• Can fully utilize the disk bandwidth (~ 64 MB / sec)

– Cons:• Dividing the input file among threads

• Locking on some hash table blocks (e.g. the)

– One lock per hash table (very poor performance!)

– One lock per hash table element (lots of lock state!)

– Deadlock prevention (Lock free design)

More cons than pros!

No guarantee of speed up.

Page 24: Distributed Computing & MapReduce

24

Solution2• Pseudo code

– (a) While (read line == true){• (b) Parse the line on white spaces, tabs, new line• (c) For each word in step (b) do

– (d) Emit the tuple <word 1>

}//end of while(e) Sort all the tuples emitted in step (d) on the basis of word(f) For the list of tuples produced in step (e) {

sum up similar words and emit <word final count>}

Page 25: Distributed Computing & MapReduce

25

Solution 2 Elaboration• Example Text

– “the quick brown fox jumps over the lazy dog”

• Step 1 :– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps

1>, <over 1>, <the 1>, <lazy 1>, <dog1>

• Step 2 :– <brown 1>, <dog 1>, <fox 1>, <jumps 1>,

<lazy 1>, <over 1>, <quick 1>, <the 1>, <the1>

– Emit output <key, value> pairs• <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy

1>, <over 1>, <quick 1>, <the 2>

Page 26: Distributed Computing & MapReduce

26

What is wrong with Solution2?

• Pros:– No locking– No concurrency to handle in hash map

• Cons:– Reading from disk (Slow I/O again!)– Intermediate keys might not fit in memory (Data too big to fit!)– External sort might need to be used– The sorting might become a bottleneck

Page 27: Distributed Computing & MapReduce

27

How to improve solution2

• Most of the problems are related to “scalability”• There is so much you can extract from one machine!• Use multiple networked machines?

– Distributed computing intricacies• How to divide work efficiently (who does what)

• Scalability of solution with increasing problem size

• Reliability, fault tolerance

• Network related problems ( link failures, delays, etc.)

So ideally we need a solution which takes care of messy / complicated parallelism / distributed

computing details

Page 28: Distributed Computing & MapReduce

28

What is MapReduce

• So MapReduce is a framework to– Process enormous data (~ Multi Terabyte)– Efficient– Using cluster of ordinary machines– Linear scalability– Simple for end programmers

• Only write two small pieces of code and that’s it!

• Not a general parallel programming model!– No every parallel problem is solvable by it

• E.g. Producer consumer problems

– MR good where sub tasks either do not communicate with each other or any communication can be handled at “map end and reduce start” stage.

Page 29: Distributed Computing & MapReduce

Fig. 1 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

29

Nuts and Bolts of MapReduce

App. programmer write mapper and reducer code. Rest is automatic! Messy

details of parallelism, scalability, fault tolerance is taken care of by the

framework.

Page 30: Distributed Computing & MapReduce

30

Word Count Example• Example Text

– “the quick brown fox jumps over the lazy dog”• Step 1 – Split Input:

– Split usually on the basis of the size of the input data and available machines (assume only one machine for this example!)

• Step 2 – Map phase:– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps 1>,

<over 1>, <the 1>, <lazy 1>, <dog 1>• Step 3 – Distribute:

– If there is only one reducer, all the intermediate key value pairs are placed in a single intermediate file

• Step 4 – Reduce:– Copies the intermediate file locally and sort it on the key– <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over

1>, <quick 1>, <the 1>, <the 1>– Emit output <key, value> pairs

• <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick1>,

<the 2>

Page 31: Distributed Computing & MapReduce

31

Word Count Example• 2 Mappers and 2 Reducers• Example Text

– “the quick brown fox jumps over the lazy dog”

• Step 1 – Split Input:– the quick brown fox– jumps over the lazy dog

• Step 2 – Map phase:– Mapper 1:– <0, the quick brown fox>– <the 1>, <quick 1>, <brown 1>, <fox 1>

– Mapper 2:– <0, jumps over the lazy dog >– <jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog1>

Page 32: Distributed Computing & MapReduce

32

Word Count Example• Step 3 – Distribute:

– Two intermediate files per mapper because there are 2 Reducers.– A hash function is used to place intermediate key value pairs in

either of the “bucket”– A to M (capital or small) in bucket A, others in B

– For Mapper 1:• File A will have the pairs:

– <brown 1>, <fox 1>• File B will have the pairs:

– <the 1>, <quick 1>

– For Mapper 2:• File A will have the pairs:

– <jumps 1>, <lazy 1>, <dog 1>

• File B will have the pairs:– <over 1>, <the 1>

Page 33: Distributed Computing & MapReduce

33

Word Count Example

• Step 4 - Reduce– Reducer 1:

• Fetches both file A from mapper 1 and 2• Merge them and sort on key

– <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy 1>

• Emits final <key, value> pairs

– Reducer 2:• Fetches both file B from mapper 1 and 2• Merge them and sort on key

– <over 1>, <quick 1>, <the 1>, <the1>

• Emits:– <over 1>, <quick 1>, <the 2>

– So final output has 2 files• Merger them or feed to next stage mapper

Page 34: Distributed Computing & MapReduce

34

Word Count Example – A Quiz• Assume:

– I have a huge text file– 50,000 Mappers, 2 Reducers– Same hash function used

• Questions:– The pair <the 1> will always end up with Mapper 2 True or False?– Each mapper will produce how many intermediate files?

• (a) 50,000 (b) 2 (c) 100,000

Page 35: Distributed Computing & MapReduce

35

Why Split Data• Exploiting locality

– Data on the same local disk (best case)– I/O operations are too slow– Somewhere in the same rack– Other nearby metrics– In GFS, by default each chunk of data is 64 MB long and usually

each chunk is replicated at three different places

• Utilizing many machines– Each split given to a different machine to work on– Increasing parallelism

• Fine grained split is better– Faster machines can do more work

Page 36: Distributed Computing & MapReduce

36

The Power of Splitting

Page 37: Distributed Computing & MapReduce

37

Fault Tolerance

• Worker failures– Machines can fail, disks can crash– Tasks are re-scheduled on some other machines– Idempotent operations make it simple

• Master Failure– Might be a single point of failure– Simple mechanisms like writing checkpoints can solve the problem– Replicate it (Approach proposed by Apache Hadoop)– Presents an important point in the design space

• Introduce as much complexity in the system as necessary

Page 38: Distributed Computing & MapReduce

38

Backup Tasks

• Straggler Jobs– Slow jobs due to failing hard disk

• 30 MB/s reducing to 1 MB/s

– Some other job running on the same machine• Jobs competing for CPU / Disk I/O

• Run repetitive jobs– Any one finishing first is considered– The other duplicate is killed

Page 39: Distributed Computing & MapReduce

Fig. 3 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

39

Performance due to Backup Tasks

Page 40: Distributed Computing & MapReduce

40

Combiners

• A way to condense number of intermediate keys– If e.g. keys are as follows

• <the 1>, <the 1> ….

• <the n>

– Reducing the needed bandwidth

• Only applicable if operation is commutative and associative– Counting (i.e. summing up is comm. & assoc.)– Mean is not!

Page 41: Distributed Computing & MapReduce

41

Example Usage of MapReduce

• Search (grep)• Sort• Large scale indexing• Count of URL Access Frequency• Reverse Web-link graph• Term-vector per host• Inverted Index• Etc.

Page 42: Distributed Computing & MapReduce

Fig taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

42

Web Search

• Web search Index of Google was changed using MapReduce (Actually was re-written)

Page 43: Distributed Computing & MapReduce

43

Yahoo Now Using MapReduce•Yahoo followed the suit after 4 years

Page 44: Distributed Computing & MapReduce

44

Case Studies

Page 45: Distributed Computing & MapReduce

45

Case Study 1: New York Times

• Make 1851 to 1980 paper articles online– There are about 11 million articles– Articles were stores as scanned images– On getting a request these images glued together on the fly

• Can be slow

• Can be stupid! (May be doing redundant work)

• The new design– Glue up all articles and make a pdf document

• EC2, S3 and HadoopSecure– Uploade 4TB to S3– 100 EC2 instances did work in 24 hours– 1.5 TB of output data against stored in S3– Data served to clients from S3

Page 46: Distributed Computing & MapReduce

46

Case Study 2: IPv4 Census

• John Heidemann et al conduct IPv4 census– About 4 billion IPv4 – Addresses about to get exhausted– Seeing usability trends using

pings and their responses

• Hilbert Curve used to present 32

Bit IP responses into 2 Dimensions

Page 47: Distributed Computing & MapReduce

47

Case Study 2: IPv4 Census

• Lab machines used to run Hadoop– Machines share resources with other processes– Hadoop process has normal priority; can be pre-empted by higher

priority processes– Machine sharing same as in Wisconsin’s Candor project– Cluster has about 20 machines– File / data sharing using NFS and HDFS mounts– Improvised use of NFS to deploy latest version of Hadoop

• Census data– Each census file about the size of 37GB

Page 48: Distributed Computing & MapReduce

48

Hadoop

Page 49: Distributed Computing & MapReduce

49

MapReduce in Action!• MapReduce’s open source implementation by Apache called Hadoop

• Hadoop:– 1 TB sorting record in 209 seconds (little less than 3 minutes!) using

910 machines

• Google’s MapReduce implementation:– 1TB records sorted in 68 seconds using 1000 machines

• Question:– Why Google was able to sort 3 times faster than Hadoop?

Page 50: Distributed Computing & MapReduce

50

• Hadoop can be run on– Linux like OS– Windows with cygwin

• Hadoop opetaing mode is:– Stand alone

• Good for development– Pseudo distributed

• Good for debugging / testing– Fully distributed

• Mapper / Reduce code can be written in:– Any thing executable on Linux shell!!!

• C++ executable• Scripts like Python, shell scripts, etc.• Java program

Hadoop and HPCNL Cluster

Lets have a visual tour of Hadoop!

Page 51: Distributed Computing & MapReduce

51

Cluster Summary

Page 52: Distributed Computing & MapReduce

52

Worker State

Page 53: Distributed Computing & MapReduce

53

Job Status

Page 54: Distributed Computing & MapReduce

54

Graphical Progress Report

Page 55: Distributed Computing & MapReduce

55

Speedy Machines do More Work

Page 56: Distributed Computing & MapReduce

56

Backup jobs

Page 57: Distributed Computing & MapReduce

57

Failures are Frequent

Page 58: Distributed Computing & MapReduce

58

Backwards Progress

Page 59: Distributed Computing & MapReduce

59

Backwards Progress

Page 60: Distributed Computing & MapReduce

60

Blacklisting

Page 61: Distributed Computing & MapReduce

61

Slow Machines

Page 62: Distributed Computing & MapReduce

62

Stragglers

Page 63: Distributed Computing & MapReduce

63

Multiple Backup Tasks

Page 64: Distributed Computing & MapReduce

64

Sequencial WordCount

Page 65: Distributed Computing & MapReduce

65

Word Counting Mapper and Reducer

• Mapper code • Reduce code

Page 66: Distributed Computing & MapReduce

66

Concluding Remarks

• Master MapReduce / Hadoop because of its wide applicability – Today’s lab exercises get you started– Utilize it in your term / final year projects– Build something in your free time which uses MapReduce

• MapReduce adds another tool into your repertoire of parallel programming– Using the right tool at the right time remains your responsibility!– MapReduce is widely applicable but you can’t use it for every

problem!– Misusing MapReduce for a problem where some other tool might be

better, will most probably result in degraded performance

Any more questions?