Upload
coolmirza143
View
1.022
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Shared by Mansoor MirzaDistributed ComputingWhat is it?Why & when we need it?Comparison with centralized computing‘MapReduce’ (MR) FrameworkTheory and practice‘MapReduce’ in ActionUsing HadoopLab exercises
Citation preview
Distributed Computing & MapReduce
Presented by: Abdul Qadeer
2
Today’s Agenda• Distributed Computing
– What is it?– Why & when we need it?– Comparison with centralized computing
• ‘MapReduce’ (MR) Framework– Theory and practice
• ‘MapReduce’ in Action– Using Hadoop– Lab exercises
Feel free to comment / ask questions anytime!
3
Distributed Computing
4
Distributed Computing• A way of computing where:
– Software system has multiple sub-components– Sub-components are loosely coupled using ‘network’– The placement and algorithms of interaction between sub-
components are constructed to meet system goals
– System goals can be:• Fault Tolerance
• Scalability in some design dimension (e.g. Increasing sys load)
• Reliability
• Performance (e.g. Many computers working together)
• Cost (Better yet performance / $)
• Usability / Accessibility / Ease of use (e.g. Distributed file system)
• Easy and cost effective software maintenance (e.g. ChromeBook)
• Etc.
5
Distributed Computing - What
• Everything on single computer– E.g. Old POS system on a small shop
• Use of multiple computers– Client – Server Model– Multiple tiers (2 .. N)– Logical / physical tiers
• Depends on complexity, load …
6
Distributed Computing - Why• Single computer has limitations
– Computability limitation– Memory (Storage) limitation– Bandwidth limitation
• But these limitations are relative– Today a single computer is darn powerful!
• E.g. Ethane – Taking control of the enterprise
– Multiple cores (8 core systems common)– 6 to 8 GB memory (Up to 3 TB hard disk)– 10Gbps network interfaces
7
Distributed Computing - Why• Reliability
– Failure Model– Probability of failure– The concept of redundancy
• Scalability in different dimensions– Load balancing (distribution or partitioning)– MSN messenger user example– Scalability always has dimensions
• Cost– 1 beefy server VS many commodity machines– Economy of scale
8
Distributed Computing• Can use ‘toy problems’ to learn different concepts of
distributed computing
• Real world use should be based on a real need
– Many real world problems where distributed computed is needed
• Large scale web applications
– Computers, cell phones etc.
• Data post processing
– Finding clues in data, weather forecast
• Many scientific and HPC applications
– Discussed previously in another lecture of this course
9
Clusters
10
How Clusters are Built?
• Two possible approaches
– Big, expensive, capable server class machines• Usually Govts. used to had that much money!
• There are limits beyond which a machine cant go!
• Any failure to a big machine means disruption to a large population of clients
– Off the shelf, ordinary, cheap computers connected with ordinary Ethernet!
• Cost effective in terms of operational cost
• Failure of a machine only disrupt a small portion of the operations
• Most industry clusters are built like this!
http://www.flickr.com/photos/drurydrama/;http://www.fotosearch.com/photos-images/ox.html
11
How Clusters are Built?• In pioneer days they used oxen for heavy pulling, and when one ox
couldn't budge a log, they didn't try to grow a larger ox. We shouldn't be trying for bigger computers, but for more systems of computers.
—Grace Hopper
Table generated at : http://top500.org/stats/list/37/archtype 12
How Clusters are Built?
13
How Clusters are Built?
• Mostly 64 bit processors used
14
How Clusters are Built?
• Ethernet holds a major share
15
How Clusters are Used?
• Different ways depending on the consumer
– Academic use• End user logs in to ‘Head Nodes’
• Head Nodes are beefy systems used for non-compute intensive tasks; or task which have a good mix of compute and I/O; e.g. compiling code
• Cluster machines are usually not accessible directly; only accessibly via head node.
• Machine acquiring, job submission, job monitoring (e.g. using qsub command in a PBS based system)
16
How Clusters are Used?
• Different ways depending on the consumer
– Commercial use• Infrastructure as a service
– Raw machines– E.g. AWS selling machine instances
• Platform as a service – machine + OS + other software stack– E.g. Google’s’ App Engine
• Application as a service– Using Gmail, Google search etc using browser– Google Docs
• Google programmers sharing cluster machines using proprietary software
17
MapReduce
18
The Problem at Hand!
• Many organizations are data centric– Google, Yahoo, Bing etc. (Search engines)– Facebook, Twiter, MySpace etc. (Social networking, Blogs)– NYT, – Stock exchanges
• Data is increasing at a high rate– New web pages added each day– NY stock exchange produce 1 TB data every day!
19
The Goal
• To “process” data:– In reasonable time
• A single machine has capability limits!!!
– Cost effectively• Cost coming from hardware resources
– No super computers!!
– Not even high end server machines
• Programmer hours
– Applicability (A framework for similar problems)
– Simplicity
– No specific / Expensive programmer training
20
Elaboration by Example• Word count• Input:
– Data size = 20 TB
• Output:– A file as follows:
– <word1 12>– <word2 30>– <word3 34>– …
• Solution:?– Single machine based solutions
• Don’t scale well!!!– Use multiple machines (MPI?)
21
Solution 1
• Pseudo code– (a) Make a dynamically extendable hash table such that word will be
the “key” and an integer count will be its “value”
– (b) while (read a line == true)– (c) parse the line on the basis of space, tab, newline
– (d) for each word parsed in step (c)
» (e) insert or update <word, oldval + 1> in hashtable
22
What is wrong with Solution1?
(a) Make a dynamically extendable hash table such that word will be the “key” and an integer count will be its “value”
(b) while (read a line == true)(c) parse the line on the basis of
space, tab, newline
(d) for each word parsed in step (c)(e) insert or update <word, oldval +
1> in hashtable
• A big hashtable which might not fit in memory
• Might be using swap memory. A typical pattern of access might cause thrashing!
• Reading from disk is very slow. (i.e. Slow I/O)
• Expect frequent cache misses and hence poor performance
23
How to improve on solution1?
• Multithreaded application– Pros:
• Can fully utilize the disk bandwidth (~ 64 MB / sec)
– Cons:• Dividing the input file among threads
• Locking on some hash table blocks (e.g. the)
– One lock per hash table (very poor performance!)
– One lock per hash table element (lots of lock state!)
– Deadlock prevention (Lock free design)
More cons than pros!
No guarantee of speed up.
24
Solution2• Pseudo code
– (a) While (read line == true){• (b) Parse the line on white spaces, tabs, new line• (c) For each word in step (b) do
– (d) Emit the tuple <word 1>
}//end of while(e) Sort all the tuples emitted in step (d) on the basis of word(f) For the list of tuples produced in step (e) {
sum up similar words and emit <word final count>}
25
Solution 2 Elaboration• Example Text
– “the quick brown fox jumps over the lazy dog”
• Step 1 :– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps
1>, <over 1>, <the 1>, <lazy 1>, <dog1>
• Step 2 :– <brown 1>, <dog 1>, <fox 1>, <jumps 1>,
<lazy 1>, <over 1>, <quick 1>, <the 1>, <the1>
– Emit output <key, value> pairs• <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy
1>, <over 1>, <quick 1>, <the 2>
26
What is wrong with Solution2?
• Pros:– No locking– No concurrency to handle in hash map
• Cons:– Reading from disk (Slow I/O again!)– Intermediate keys might not fit in memory (Data too big to fit!)– External sort might need to be used– The sorting might become a bottleneck
27
How to improve solution2
• Most of the problems are related to “scalability”• There is so much you can extract from one machine!• Use multiple networked machines?
– Distributed computing intricacies• How to divide work efficiently (who does what)
• Scalability of solution with increasing problem size
• Reliability, fault tolerance
• Network related problems ( link failures, delays, etc.)
So ideally we need a solution which takes care of messy / complicated parallelism / distributed
computing details
28
What is MapReduce
• So MapReduce is a framework to– Process enormous data (~ Multi Terabyte)– Efficient– Using cluster of ordinary machines– Linear scalability– Simple for end programmers
• Only write two small pieces of code and that’s it!
• Not a general parallel programming model!– No every parallel problem is solvable by it
• E.g. Producer consumer problems
– MR good where sub tasks either do not communicate with each other or any communication can be handled at “map end and reduce start” stage.
Fig. 1 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters
29
Nuts and Bolts of MapReduce
App. programmer write mapper and reducer code. Rest is automatic! Messy
details of parallelism, scalability, fault tolerance is taken care of by the
framework.
30
Word Count Example• Example Text
– “the quick brown fox jumps over the lazy dog”• Step 1 – Split Input:
– Split usually on the basis of the size of the input data and available machines (assume only one machine for this example!)
• Step 2 – Map phase:– <0, the quick brown fox jumps over the lazy dog>– <the 1>, <quick 1>, <brown 1>, <fox 1>, <jumps 1>,
<over 1>, <the 1>, <lazy 1>, <dog 1>• Step 3 – Distribute:
– If there is only one reducer, all the intermediate key value pairs are placed in a single intermediate file
• Step 4 – Reduce:– Copies the intermediate file locally and sort it on the key– <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over
1>, <quick 1>, <the 1>, <the 1>– Emit output <key, value> pairs
• <brown 1>, <dog 1>, <fox 1>, <jumps 1>, <lazy 1>, <over 1>, <quick1>,
<the 2>
31
Word Count Example• 2 Mappers and 2 Reducers• Example Text
– “the quick brown fox jumps over the lazy dog”
• Step 1 – Split Input:– the quick brown fox– jumps over the lazy dog
• Step 2 – Map phase:– Mapper 1:– <0, the quick brown fox>– <the 1>, <quick 1>, <brown 1>, <fox 1>
– Mapper 2:– <0, jumps over the lazy dog >– <jumps 1>, <over 1>, <the 1>, <lazy 1>, <dog1>
32
Word Count Example• Step 3 – Distribute:
– Two intermediate files per mapper because there are 2 Reducers.– A hash function is used to place intermediate key value pairs in
either of the “bucket”– A to M (capital or small) in bucket A, others in B
– For Mapper 1:• File A will have the pairs:
– <brown 1>, <fox 1>• File B will have the pairs:
– <the 1>, <quick 1>
– For Mapper 2:• File A will have the pairs:
– <jumps 1>, <lazy 1>, <dog 1>
• File B will have the pairs:– <over 1>, <the 1>
33
Word Count Example
• Step 4 - Reduce– Reducer 1:
• Fetches both file A from mapper 1 and 2• Merge them and sort on key
– <brown 1>, <dog1>, <fox 1>, <jumps 1>, <lazy 1>
• Emits final <key, value> pairs
– Reducer 2:• Fetches both file B from mapper 1 and 2• Merge them and sort on key
– <over 1>, <quick 1>, <the 1>, <the1>
• Emits:– <over 1>, <quick 1>, <the 2>
– So final output has 2 files• Merger them or feed to next stage mapper
34
Word Count Example – A Quiz• Assume:
– I have a huge text file– 50,000 Mappers, 2 Reducers– Same hash function used
• Questions:– The pair <the 1> will always end up with Mapper 2 True or False?– Each mapper will produce how many intermediate files?
• (a) 50,000 (b) 2 (c) 100,000
35
Why Split Data• Exploiting locality
– Data on the same local disk (best case)– I/O operations are too slow– Somewhere in the same rack– Other nearby metrics– In GFS, by default each chunk of data is 64 MB long and usually
each chunk is replicated at three different places
• Utilizing many machines– Each split given to a different machine to work on– Increasing parallelism
• Fine grained split is better– Faster machines can do more work
36
The Power of Splitting
37
Fault Tolerance
• Worker failures– Machines can fail, disks can crash– Tasks are re-scheduled on some other machines– Idempotent operations make it simple
• Master Failure– Might be a single point of failure– Simple mechanisms like writing checkpoints can solve the problem– Replicate it (Approach proposed by Apache Hadoop)– Presents an important point in the design space
• Introduce as much complexity in the system as necessary
38
Backup Tasks
• Straggler Jobs– Slow jobs due to failing hard disk
• 30 MB/s reducing to 1 MB/s
– Some other job running on the same machine• Jobs competing for CPU / Disk I/O
• Run repetitive jobs– Any one finishing first is considered– The other duplicate is killed
Fig. 3 taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters
39
Performance due to Backup Tasks
40
Combiners
• A way to condense number of intermediate keys– If e.g. keys are as follows
• <the 1>, <the 1> ….
• <the n>
– Reducing the needed bandwidth
• Only applicable if operation is commutative and associative– Counting (i.e. summing up is comm. & assoc.)– Mean is not!
41
Example Usage of MapReduce
• Search (grep)• Sort• Large scale indexing• Count of URL Access Frequency• Reverse Web-link graph• Term-vector per host• Inverted Index• Etc.
Fig taken from OSDI 2004 paper:MapReduce: Simplified Data Processing on Large Clusters
42
Web Search
• Web search Index of Google was changed using MapReduce (Actually was re-written)
43
Yahoo Now Using MapReduce•Yahoo followed the suit after 4 years
44
Case Studies
45
Case Study 1: New York Times
• Make 1851 to 1980 paper articles online– There are about 11 million articles– Articles were stores as scanned images– On getting a request these images glued together on the fly
• Can be slow
• Can be stupid! (May be doing redundant work)
• The new design– Glue up all articles and make a pdf document
• EC2, S3 and HadoopSecure– Uploade 4TB to S3– 100 EC2 instances did work in 24 hours– 1.5 TB of output data against stored in S3– Data served to clients from S3
46
Case Study 2: IPv4 Census
• John Heidemann et al conduct IPv4 census– About 4 billion IPv4 – Addresses about to get exhausted– Seeing usability trends using
pings and their responses
• Hilbert Curve used to present 32
Bit IP responses into 2 Dimensions
47
Case Study 2: IPv4 Census
• Lab machines used to run Hadoop– Machines share resources with other processes– Hadoop process has normal priority; can be pre-empted by higher
priority processes– Machine sharing same as in Wisconsin’s Candor project– Cluster has about 20 machines– File / data sharing using NFS and HDFS mounts– Improvised use of NFS to deploy latest version of Hadoop
• Census data– Each census file about the size of 37GB
48
Hadoop
49
MapReduce in Action!• MapReduce’s open source implementation by Apache called Hadoop
• Hadoop:– 1 TB sorting record in 209 seconds (little less than 3 minutes!) using
910 machines
• Google’s MapReduce implementation:– 1TB records sorted in 68 seconds using 1000 machines
• Question:– Why Google was able to sort 3 times faster than Hadoop?
50
• Hadoop can be run on– Linux like OS– Windows with cygwin
• Hadoop opetaing mode is:– Stand alone
• Good for development– Pseudo distributed
• Good for debugging / testing– Fully distributed
• Mapper / Reduce code can be written in:– Any thing executable on Linux shell!!!
• C++ executable• Scripts like Python, shell scripts, etc.• Java program
Hadoop and HPCNL Cluster
Lets have a visual tour of Hadoop!
51
Cluster Summary
52
Worker State
53
Job Status
54
Graphical Progress Report
55
Speedy Machines do More Work
56
Backup jobs
57
Failures are Frequent
58
Backwards Progress
59
Backwards Progress
60
Blacklisting
61
Slow Machines
62
Stragglers
63
Multiple Backup Tasks
64
Sequencial WordCount
65
Word Counting Mapper and Reducer
• Mapper code • Reduce code
66
Concluding Remarks
• Master MapReduce / Hadoop because of its wide applicability – Today’s lab exercises get you started– Utilize it in your term / final year projects– Build something in your free time which uses MapReduce
• MapReduce adds another tool into your repertoire of parallel programming– Using the right tool at the right time remains your responsibility!– MapReduce is widely applicable but you can’t use it for every
problem!– Misusing MapReduce for a problem where some other tool might be
better, will most probably result in degraded performance
Any more questions?