Distributed Computing & MapReduce

Distributed Computing & MapReduce

Presented by: Abdul Qadeer

2

Today’s Agenda• Distributed Computing

– What is it?– Why & when we need it?– Comparison with centralized computing

• ‘MapReduce’ (MR) Framework– Theory and practice

• ‘MapReduce’ in Action– Using Hadoop– Lab exercises

Feel free to comment / ask questions anytime!

3

Distributed Computing

4

Distributed Computing• A way of computing where:

– Software system has multiple sub-components– Sub-components are loosely coupled using ‘network’– The placement and algorithms of interaction between sub-

components are constructed to meet system goals

– System goals can be:• Fault Tolerance

• Scalability in some design dimension (e.g. Increasing sys load)

• Reliability

• Performance (e.g. Many computers working together)

• Cost (Better yet performance / $)

• Usability / Accessibility / Ease of use (e.g. Distributed file system)

• Easy and cost effective software maintenance (e.g. ChromeBook)

• Etc.

5

Distributed Computing - What

• Everything on single computer– E.g. Old POS system on a small shop

• Use of multiple computers– Client – Server Model– Multiple tiers (2 .. N)– Logical / physical tiers

• Depends on complexity, load …

6

Distributed Computing - Why• Single computer has limitations

– Computability limitation– Memory (Storage) limitation– Bandwidth limitation

• But these limitations are relative– Today a single computer is darn powerful!

• E.g. Ethane – Taking control of the enterprise

– Multiple cores (8 core systems common)– 6 to 8 GB memory (Up to 3 TB hard disk)– 10Gbps network interfaces

7

Distributed Computing - Why• Reliability

– Failure Model– Probability of failure– The concept of redundancy

• Scalability in different dimensions– Load balancing (distribution or partitioning)– MSN messenger user example– Scalability always has dimensions

• Cost– 1 beefy server VS many commodity machines– Economy of scale

8

Distributed Computing• Can use ‘toy problems’ to learn different concepts of

distributed computing

• Real world use should be based on a real need

– Many real world problems where distributed computed is needed

• Large scale web applications

– Computers, cell phones etc.

• Data post processing

– Finding clues in data, weather forecast

• Many scientific and HPC applications

– Discussed previously in another lecture of this course

9

Clusters

10

How Clusters are Built?

• Two possible approaches

– Big, expensive, capable server class machines• Usually Govts. used to had that much money!

• There are limits beyond which a machine cant go!

• Any failure to a big machine means disruption to a large population of clients

– Off the shelf, ordinary, cheap computers connected with ordinary Ethernet!

• Cost effective in terms of operational cost

• Failure of a machine only disrupt a small portion of the operations

• Most industry clusters are built like this!

http://www.flickr.com/photos/drurydrama/;http://www.fotosearch.com/photos-images/ox.html

11

How Clusters are Built?• In pioneer days they used oxen for heavy pulling, and when one ox

couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.

—Grace Hopper

Table generated at : http://top500.org/stats/list/37/archtype 12


13


• Mostly 64 bit processors used

14


• Ethernet holds a major share

15

How Clusters are Used?

• Different ways depending on the consumer

– Academic use• End user logs in to ‘Head Nodes’

• Head Nodes are beefy systems used for non-compute intensive tasks; or task which have a good mix of compute and I/O; e.g. compiling code

• Cluster machines are usually not accessible directly; only accessibly via head node.

• Machine acquiring, job submission, job monitoring (e.g. using qsub command in a PBS based system)

16

How Clusters are Used?

• Different ways depending on the consumer

– Commercial use• Infrastructure as a service

– Raw machines– E.g. AWS selling machine instances

• Platform as a service – machine + OS + other software stack– E.g. Google’s’ App Engine

• Application as a service– Using Gmail, Google search etc using browser– Google Docs

• Google programmers sharing cluster machines using proprietary software

17

MapReduce

18

The Problem at Hand!

• Many organizations are data centric– Google, Yahoo, Bing etc. (Search engines)– Facebook, Twiter, MySpace etc. (Social networking, Blogs)– NYT, – Stock exchanges

• Data is increasing at a high rate– New web pages added each day– NY stock exchange produce 1 TB data every day!

19

The Goal

• To “process” data:– In reasonable time

• A single machine has capability limits!!!

– Cost effectively• Cost coming from hardware resources

– No super computers!!

– Not even high end server machines

• Programmer hours

– Applicability (A framework for similar problems)

– Simplicity

– No specific / Expensive programmer training

20

Elaboration by Example• Word count• Input:

– Data size = 20 TB

• Output:– A file as follows:

– <word1 12>– <word2 30>– <word3 34>– …

• Solution:?– Single machine based solutions

• Don’t scale well!!!– Use multiple machines (MPI?)

21

Solution 1

• Pseudo code– (a) Make a dynamically extendable hash table such that word will be

the “key” and an integer count will be its “value”

– (b) while (read a line == true)– (c) parse the line on the basis of space, tab, newline

– (d) for each word parsed in step (c)

» (e) insert or update <word, oldval + 1> in hashtable

22

What is wrong with Solution1?

(a) Make a dynamically extendable hash table such that word will be the “key” and an integer count will be its “value”

(b) while (read a line == true)(c) parse the line on the basis of

space, tab, newline

(d) for each word parsed in step (c)(e) insert or update <word, oldval +

1> in hashtable

• A big hashtable which might not fit in memory

• Might be using swap memory. A typical pattern of access might cause thrashing!

• Reading from disk is very slow. (i.e. Slow I/O)

• Expect frequent cache misses and hence poor performance

23

How to improve on solution1?

• Multithreaded application– Pros:

• Can fully utilize the disk bandwidth (~ 64 MB / sec)

– Cons:• Dividing the input file among threads

• Locking on some hash table blocks (e.g. the)

– One lock per hash table (very poor performance!)

– One lock per hash table element (lots of lock state!)

– Deadlock prevention (Lock free design)

More cons than pros!

No guarantee of speed up.

24

Solution2• Pseudo code

– (a) While (read line == true){• (b) Parse the line on white spaces, tabs, new line• (c) For each word in step (b) do

– (d) Emit the tuple <word 1>

}//end of while(e) Sort all the tuples emitted in step (d) on the basis of word(f) For the list of tuples produced in step (e) {

sum up similar words and emit <word final count>}

25

Solution 2 Elaboration• Example Text

– “the quick brown fox jumps over the lazy dog”

• Step 1 :– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps

1>, <over 1>, <the 1>, <lazy 1>, <dog1>

• Step 2 :– <brown 1>, <dog 1>, <fox 1>, <jumps 1>,

<lazy 1>, <over 1>, <quick 1>, <the 1>, <the1>

– Emit output <key, value> pairs• <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy

1>, <over 1>, <quick 1>, <the 2>

26

What is wrong with Solution2?

• Pros:– No locking– No concurrency to handle in hash map

• Cons:– Reading from disk (Slow I/O again!)– Intermediate keys might not fit in memory (Data too big to fit!)– External sort might need to be used– The sorting might become a bottleneck

27

How to improve solution2

• Most of the problems are related to “scalability”• There is so much you can extract from one machine!• Use multiple networked machines?

– Distributed computing intricacies• How to divide work efficiently (who does what)

• Scalability of solution with increasing problem size

• Reliability, fault tolerance

• Network related problems ( link failures, delays, etc.)

So ideally we need a solution which takes care of messy / complicated parallelism / distributed

computing details

28

What is MapReduce

• So MapReduce is a framework to– Process enormous data (~ Multi Terabyte)– Efficient– Using cluster of ordinary machines– Linear scalability– Simple for end programmers

• Only write two small pieces of code and that’s it!

• Not a general parallel programming model!– No every parallel problem is solvable by it

• E.g. Producer consumer problems

– MR good where sub tasks either do not communicate with each other or any communication can be handled at “map end and reduce start” stage.

Fig. 1 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

29

Nuts and Bolts of MapReduce

App. programmer write mapper and reducer code. Rest is automatic! Messy

details of parallelism, scalability, fault tolerance is taken care of by the

framework.

30

Word Count Example• Example Text

– “the quick brown fox jumps over the lazy dog”• Step 1 – Split Input:

– Split usually on the basis of the size of the input data and available machines (assume only one machine for this example!)

• Step 2 – Map phase:– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps 1>,

<over 1>, <the 1>, <lazy 1>, <dog 1>• Step 3 – Distribute:

– If there is only one reducer, all the intermediate key value pairs are placed in a single intermediate file

• Step 4 – Reduce:– Copies the intermediate file locally and sort it on the key– <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over

1>, <quick 1>, <the 1>, <the 1>– Emit output <key, value> pairs

• <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick1>,

<the 2>

31

Word Count Example• 2 Mappers and 2 Reducers• Example Text

– “the quick brown fox jumps over the lazy dog”

• Step 1 – Split Input:– the quick brown fox– jumps over the lazy dog

• Step 2 – Map phase:– Mapper 1:– <0, the quick brown fox>– <the 1>, <quick 1>, <brown 1>, <fox 1>

– Mapper 2:– <0, jumps over the lazy dog >– <jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog1>

32

Word Count Example• Step 3 – Distribute:

– Two intermediate files per mapper because there are 2 Reducers.– A hash function is used to place intermediate key value pairs in

either of the “bucket”– A to M (capital or small) in bucket A, others in B

– For Mapper 1:• File A will have the pairs:

– <brown 1>, <fox 1>• File B will have the pairs:

– <the 1>, <quick 1>

– For Mapper 2:• File A will have the pairs:

– <jumps 1>, <lazy 1>, <dog 1>

• File B will have the pairs:– <over 1>, <the 1>

33

Word Count Example

• Step 4 - Reduce– Reducer 1:

• Fetches both file A from mapper 1 and 2• Merge them and sort on key

– <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy 1>

• Emits final <key, value> pairs

– Reducer 2:• Fetches both file B from mapper 1 and 2• Merge them and sort on key

– <over 1>, <quick 1>, <the 1>, <the1>

• Emits:– <over 1>, <quick 1>, <the 2>

– So final output has 2 files• Merger them or feed to next stage mapper

34

Word Count Example – A Quiz• Assume:

– I have a huge text file– 50,000 Mappers, 2 Reducers– Same hash function used

• Questions:– The pair <the 1> will always end up with Mapper 2 True or False?– Each mapper will produce how many intermediate files?

• (a) 50,000 (b) 2 (c) 100,000

35

Why Split Data• Exploiting locality

– Data on the same local disk (best case)– I/O operations are too slow– Somewhere in the same rack– Other nearby metrics– In GFS, by default each chunk of data is 64 MB long and usually

each chunk is replicated at three different places

• Utilizing many machines– Each split given to a different machine to work on– Increasing parallelism

• Fine grained split is better– Faster machines can do more work

36

The Power of Splitting

37

Fault Tolerance

• Worker failures– Machines can fail, disks can crash– Tasks are re-scheduled on some other machines– Idempotent operations make it simple

• Master Failure– Might be a single point of failure– Simple mechanisms like writing checkpoints can solve the problem– Replicate it (Approach proposed by Apache Hadoop)– Presents an important point in the design space

• Introduce as much complexity in the system as necessary

38

Backup Tasks

• Straggler Jobs– Slow jobs due to failing hard disk

• 30 MB/s reducing to 1 MB/s

– Some other job running on the same machine• Jobs competing for CPU / Disk I/O

• Run repetitive jobs– Any one finishing first is considered– The other duplicate is killed

Fig. 3 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

39

Performance due to Backup Tasks

40

Combiners

• A way to condense number of intermediate keys– If e.g. keys are as follows

• <the 1>, <the 1> ….

• <the n>

– Reducing the needed bandwidth

• Only applicable if operation is commutative and associative– Counting (i.e. summing up is comm. & assoc.)– Mean is not!

41

Example Usage of MapReduce

• Search (grep)• Sort• Large scale indexing• Count of URL Access Frequency• Reverse Web-link graph• Term-vector per host• Inverted Index• Etc.

Fig taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters

42

Web Search

• Web search Index of Google was changed using MapReduce (Actually was re-written)

43

Yahoo Now Using MapReduce•Yahoo followed the suit after 4 years

44

Case Studies

45

Case Study 1: New York Times

• Make 1851 to 1980 paper articles online– There are about 11 million articles– Articles were stores as scanned images– On getting a request these images glued together on the fly

• Can be slow

• Can be stupid! (May be doing redundant work)

• The new design– Glue up all articles and make a pdf document

• EC2, S3 and HadoopSecure– Uploade 4TB to S3– 100 EC2 instances did work in 24 hours– 1.5 TB of output data against stored in S3– Data served to clients from S3

46

Case Study 2: IPv4 Census

• John Heidemann et al conduct IPv4 census– About 4 billion IPv4 – Addresses about to get exhausted– Seeing usability trends using

pings and their responses

• Hilbert Curve used to present 32

Bit IP responses into 2 Dimensions

47

Case Study 2: IPv4 Census

• Lab machines used to run Hadoop– Machines share resources with other processes– Hadoop process has normal priority; can be pre-empted by higher

priority processes– Machine sharing same as in Wisconsin’s Candor project– Cluster has about 20 machines– File / data sharing using NFS and HDFS mounts– Improvised use of NFS to deploy latest version of Hadoop

• Census data– Each census file about the size of 37GB

48

Hadoop

49

MapReduce in Action!• MapReduce’s open source implementation by Apache called Hadoop

• Hadoop:– 1 TB sorting record in 209 seconds (little less than 3 minutes!) using

910 machines

• Google’s MapReduce implementation:– 1TB records sorted in 68 seconds using 1000 machines

• Question:– Why Google was able to sort 3 times faster than Hadoop?

50

• Hadoop can be run on– Linux like OS– Windows with cygwin

• Hadoop opetaing mode is:– Stand alone

• Good for development– Pseudo distributed

• Good for debugging / testing– Fully distributed

• Mapper / Reduce code can be written in:– Any thing executable on Linux shell!!!

• C++ executable• Scripts like Python, shell scripts, etc.• Java program

Hadoop and HPCNL Cluster

Lets have a visual tour of Hadoop!

51

Cluster Summary

52

Worker State

53

Job Status

54

Graphical Progress Report

55

Speedy Machines do More Work

56

Backup jobs

57

Failures are Frequent

58

Backwards Progress

59

Backwards Progress

60

Blacklisting

61

Slow Machines

62

Stragglers

63

Multiple Backup Tasks

64

Sequencial WordCount

65

Word Counting Mapper and Reducer

• Mapper code • Reduce code

66

Concluding Remarks

• Master MapReduce / Hadoop because of its wide applicability – Today’s lab exercises get you started– Utilize it in your term / final year projects– Build something in your free time which uses MapReduce

• MapReduce adds another tool into your repertoire of parallel programming– Using the right tool at the right time remains your responsibility!– MapReduce is widely applicable but you can’t use it for every

problem!– Misusing MapReduce for a problem where some other tool might be

better, will most probably result in degraded performance

Any more questions?

Technology

Distributed Computing & MapReduce