Upload
alielabridi
View
217
Download
0
Embed Size (px)
Citation preview
7/25/2019 Research Paper - Map Reduce -CSC3323
1/16
Processing Large Data Set with a Map Reduce Approach
Processing Large Data Set with a Map
Reduce Approach
Research Paper part of the Honors Component of CSC3323
Algorithm Analysis
Supervisor: Dr. Hamid HARROUD
Honor Student: Ali LA!R"D"
Page 1of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
2/16
Processing Large Data Set with a Map Reduce Approach
Table of Content:
Introdction!!!!!!!!!!!!""""""!!!!!!!!"
!!!!!!!!!!!!!!!!!""3
#apRedce $asic Approach!!!!!!""""""!!!!!!!!!""
!!!!!!!!!!!!!"3
%hat is #apRedce& """""""""""""""""""
"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""'
#ap Phase!!!!!!!!!!!!!!!!!!!!!"
!!!!""!!!!!!!!!!""(
Combining ) Sh*ing
phase""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!"""""""(
Redce Phase!!!!!!!!!!!!!!"""""""""""
!!!!!!!!!!!!!!!!!(
The #odel!!!!!!!!!!!!!!!!!!!!"""""""""""
!!!!!!!!!!!!!!!!"+
#ap Redce
Implementation"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
!""",
-.ample of problems e.pressed as #apRedce
comptations!!""""""!!!""!!"",
Page 2of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
3/16
Processing Large Data Set with a Map Reduce Approach
%ord Conting!!!!!!!!!!!!!!!!!!!!""""""
!!!!!""!!!!!"",
#inimm Spanning Tree!!!!!!!!!!!!!!""""""!!"
!""!!!!!!!"/0
Page Ran1ing!!!!!!!!!!!!!!!!!!!!""""""
!!!!!!""!!!!""/0
sing Clod Compting Serices for #ap Redce!!""""""!!!!""
!!!!!!!!/2
References!!!!!!!!!!!!!!!!!!!!!!""""""
!!!!!!!!"!!!!!!!/3
Introdction
Algorithm analysis is about optimizing computational operations to their best
and lowest cost possible. Howeer! we may easily reach the ma"imum optimization
for a certain set of problems due to their comple"ity or to their nature. #ompanies
that dominate the web hae to process large set of data eery day! and in only few
milliseconds. $or instance! $aceboo% has to process millions of news feeds eery
day because the users& interactions represents the bac%bone of its social concept.
'oogle also has to process (illion of search )ueries. *ndeed! 'oogle is the pioneer of
storing tremendous amount of data. *t has to store and cache all the web pages of
Page 3of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
4/16
Processing Large Data Set with a Map Reduce Approach
the +orld +ide +eb! and this since their creation. ,ery website inde"ed by 'oogle
has to be parsed and analyzed to proide users with the most accurate and releant
search results. -o do so! the power of the search engine focus on counting the
number of specic %eywords on a web page to infer that it is the most releant for a
particular search )uery. *t also focus on ran%ing website reputations /PageRan%01 in
other words! the more a website is being referenced by other websites! the more it
is considered trustworthy! and therefore! placed at top positions in search results.
(oth these two strategies inole heay computations on large set of data since it
has to consider the whole +orld +ide +eb. 2nfortunately no matter how e3cient
your algorithm is! a single computer! regardless of its performances! cannot for
e"ample handle the whole process of computing the PageRan% of a website simply
because it cannot retriee and compute all the number of incoming lin%s to a
website considering the whole +orld +ide +eb without running out of memory. -o
sole this problematic! the Map Reduce approach is specically well suited to sole
large scale problems such as this one.
#apRedce $asic Approach
-o clarify the idea behind Map Reduce in a basic manner! we may consider the
following concrete e"ample4 assuming that we need to count the number of citizen
of a country. *nstead of relying on a single person to count the habitants one by one!
we send in each city a representatie that will count the number of people in each
specic city /map phase0. At the end! we reassemble all the representaties from
the cities to sum the counts they hae made to get the total number of people in
the country /reduce phase0.
Page 4of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
5/16
Processing Large Data Set with a Map Reduce Approach
%hat is #apRedce&
MapReduce paradigm is a programming model that uses parallel processing on
multiple computers to process large data sets. *ts power resides into two main
operations4 map/0! and reduce/0 procedures. (oth of these procedures are dened
by the user. +e will see later on that we can also add another not mandatory
shu5ing and comparing stage to optimize our Map Reduce 6ob. -he map stage
performs independent record transformation that ta%es %ey alue pair /78! 980! to
generate : or more intermediate %ey alue pairs list /7;! 9;0. !9>0. -he type of the input %eys and alues
are di?erent than the output %eys and alues. Moreoer! the intermediate %eys
alues generated by the map function are the same as the output %eys and alues.
+hen launching a MapReduce tas%! the output of the map function is distributed
around a cluster of many computers! then collect bac% the output data with the
reduce function that represents the nal result. -he end result is a scale@free
programming model. A MapReduce code written for 8M( of data can also handle
-erabytes and beyond of data. /See below owchart of a Map Reduce procedure for
counting words.0
Page 5of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
6/16
Processing Large Data Set with a Map Reduce Approach
Map Phase
-he map phase gies the user an opportunity to operate on eery record in the data
set indiidually. -his phase is commonly used to delete unwanted elds! modify or
transform elds! or apply lters. Specic 6oins and grouping can also be done in the
map /e.g.! 6oins where the data is already sorted or hash@based aggregation0. -here
is no re)uirement that for eery input record there should be one output record.
Maps can choose to remoe records or group multiple records into a single one.
-hen! the output of the Map Phase is sent to the Reduce Phase directly! or to the
comparing B shu5ing phase.
#ombining B Shu5ing Phase
-he combine B shu5ing phase ta%es adantage of the fact that when the
Map phase is running! its data is stored in the memory instead of the dis%. -hen! it
runs a reduce@type function that combines pairs with the same alue in a single list!
then ush the memory to leae space for new produced data from the map phase
Page 6of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
7/16
Processing Large Data Set with a Map Reduce Approach
when it runs out of it. the combiner class output the data as if they were from the
map function! the only di?erence is that it speeds up the processing since some
%ey=alue pair hae already been proceeded! and only need to be aggregated by
the reducer function with the other lists produced by the combiner. $or e"ample! a
word count Map Reduce application whose map operation outputs /word! 80 pairs as
words are encountered in the input can use a combiner to speed up processing.
Cnce a certain number of pairs is output! the combine operation will be called once
per uni)ue word with the list aailable as an iterator. -he combiner then emits
/word! count@in@this@part@of@the@input0 pairs.d
Reduce Phase
-he input to the reduce phase is each %ey from the Map function plus all of the
alues associated with that %ey. Since all records with the same %ey=alue pair are
now collected together! it is possible to 6oin and aggregate. -he Map Reduce user
e"plicitly controls parallelism in the reduce function. Map Reduce 6obs that do not
re)uire a reduce phase can set the reduce count to zero. -hese are referred to as
map@only 6obs! and automatically gien as output. -he aggregation is what combine
all the results into a single one! which represents our nal one.
The #odel
A Map Reduce tas% is performed in the following steps4
8. *nput data! such as a long te"t le! is split into %ey@alue pairs. -hese
%ey@alue pairs are then fed to the mapper. -his 6ob is done by the
master program.
;. -he Map function processes each %ey@alue pair indiidually and
outputs one or more intermediate 1ey4ale pairs.
>. All intermediate %ey@alue pairs are collected! sorted! and grouped by
%ey /done by the shu5ing B comparing phase0.
. $or each ni5e%ey! the reduce function receies the %ey with a list of
all the alues associated with it. -he reducer aggregates these alues
in some way /adding them up! ta%ing aerages! nding the ma"imum!
etc.0 and outputs one or more output %ey@alue pairs.
Page 7of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
8/16
Processing Large Data Set with a Map Reduce Approach
E. Cutput pairs are collected and stored in an output le /by the master
program0.
$igure84 ,"ecution oeriew of a MapReduce -as%
#apRedce Implementations
-he MapReduce paradigm is e"tremely powerful because it is implemented by many
framewor%s such as Apache Hadoop! Ria%! and *nnispan that ta%e care of details.
-he user has only to care about writing the Map and reduce functions. *ssuing tas%s!
$ile *=C! networ%ing between nodes! synchronization and failure recoery are all
managed by these framewor%s written in di?erent programming languages.
Page 8of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
9/16
Processing Large Data Set with a Map Reduce Approach
-.ample of problems e.pressed as #apRedce
comptations
*n this section! we will go through some concrete applications of Map Reduce
using Hadoop.
Word Counting:
+e want to compute the occurrences or fre)uency of certain words present
in a large le document. Cne way of soling the problem is by haing a map
function that ta%es as an input the name of the document and its content as
%ey=alue pair! and emit a set of words in the document le with a %ey occurrence of
8. -hen! the Reduce function ta%e for each word in our set the %ey words and an
iterator to the whole set of alues then emit the occurrences of each specic
%eywords in the whole set alues./nd bellow the 6aa code using Hadoop for word
counting with all its dependencies0. (y splitting the word counting tas% into small
Page 9of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
10/16
Processing Large Data Set with a Map Reduce Approach
tas%s done by multiple nodes in our clusters! the output was computed in a parallel
way! and therefore it has increased the time needed to do the 6ob.
Page 10of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
11/16
Processing Large Data Set with a Map Reduce Approach
$igure4 Faa Hadoop #ode for word counting
Page 11of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
12/16
Processing Large Data Set with a Map Reduce Approach
Minimum Spanning Tree: Another problematic that can be soled using
MapReduce and one of the topics discussed in #S#>>;> Algorithm Analysis is
nding minimum spanning tree that reaches all nodes at minimum cost for ery
large weighted graphs. *ndeed! one intuitie approach that can be easily be mapped
to the MapReduce approach! and discussed in class is the Prim&s Algorithm. Prim&s
approach to nd the minimum spanning tree in a graph is to nd each time the
edge that cost the minimum! and lin%s a set S of nodes with the remaining set of
the graph 9@S! such that 9 is all nodes of the graph. $or the MapReduce approach!
we can partition our graph into multiple set of nodes to send to each computer in
our cluster. ,ach computer will nd the minimum spanning tree on its gien set!
then the reduce function will 6oin the set! of which the MS- has been already found
from the parallel tas%s! by nd the bridge or the edge that cost the least again! and
by then we hae found a minimum spanning tree of the whole large graph! with
parallel computations.
Page Raning:-his algorithm was rst deeloped by Larry Page! one of the co@
founders of 'oogle. His strategy is torely on the other websites to determine
whether a specic website is worthy or not. -he idea is simple. *t needs to count the
number of incoming lin%s to a specic website! and use the following formula to get
the page ran%ing of a website.
PageRan1 of A 6 0"/( 7 0"8( 9 PageRan1$;
7/25/2019 Research Paper - Map Reduce -CSC3323
13/16
Processing Large Data Set with a Map Reduce Approach
*n the graph aboe! we may easily infer that website A has the highest PageRan%
compared to its peers! and therefore! put as rst in the searching results if the
content matches the %eywords entered by the user /using preious word counting
strategy0. Howeer! resides on the fact this graph is a large and growing one.
-herefore! we need to use Map Reduce to traerse it! and e"tract the incoming lin%s
to website A. -o do so! we will run > di?erent Hadoop 6obs self@e"plained in the
following owcharts4
8. Parsing4 -raerse the whole web graph. *n the mapping phase! get for website
and its outgoing lin%s. *n the reduce phase! get for each website the lin%s to the
others page.
;. #alculating4 *n this second Map Reduce Fob! we will compute the PageRan% for
each website. *n the mapping phase! we map each outgoing lin% to the webpage
with its ran% and total outgoing lin%s. *n the reduce phase! calculate the page
ran% for each webpages using the formula describe earlier.
>. Sorting4 ran% the website by their gien page ran%s
Page 13of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
14/16
Processing Large Data Set with a Map Reduce Approach
2sing #loud #omputing Serices for Map Reduce
-his section of this paper is more li%ely to be presented as an e"tra. +e hae
seen that Map Reduce *mplementations will allow us to not deal with tedious and
low leel implementations to run and process large scale program. -his would
represent the software part. Howeer! the implementation of Map Reduce would
re)uire us to hae multiple computers in a large cluster! with high connectiity with
each other. -his hardware part is also not easy to build. -herefore! processing large
data! can also be done using the #loud. Cne of the most used product is the
Amazon ,lastic Map Reduce. *t is a Platform as a Serice /PaaS0 that simplies big
data processing not only by proiding and managing the Hadoop framewor%! but
also by managing the hardware for you. -he Amazon ,MR is reliable! secure! and
low@cost. Gou pay only for what you use! and for how much time you use it. *ndeed!
they proide you with e"ible resources. Gou may e"tend the number of #P2s or
cores you need! and also the RAM needed. $urthermore! you may also 6oin your
irtual cluster with other Amazon products such as the Amazon S> /Amazon Simple
Storage Serice0 to store for e"ample your input and output from your Map Reduce
tas%s. -herefore! you not only aoid haing to deal with low leel implementation of
scale free programs! but also aoid setting up large and complicated cluster to run
your programs by using the #loud.
Page 14of 16
7/25/2019 Research Paper - Map Reduce -CSC3323
15/16
Processing Large Data Set with a Map Reduce Approach
References4
http4==aws.amazon.com=elasticmapreduce=
https4==courses.cs.washington.edu=courses=cse:h=:Iau=lectures=algorithms.pdf
http4==hadoop.apache.org=docs=r8.;.8=mapredJtutorial.html
http4==www.slideshare.net=andreaiacono=mapreduce@>KI
http4==hadooptutorial.wi%ispaces.com=SortingfeatureofMapReduce
http4==chimera.labs.oreilly.com=boo%s=8;>:::::8I88=apb.htmloeriewJmrJdistri
butedJcache
Page 15of 16
http://aws.amazon.com/elasticmapreduce/https://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttps://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttp://aws.amazon.com/elasticmapreduce/7/25/2019 Research Paper - Map Reduce -CSC3323
16/16
Processing Large Data Set with a Map Reduce Approach
http4==webmapreduce.sourceforge.net=docs=2serJ'uide=sect@2serJ'uide@
*ntroduction@+hatJisJMapJReduce.html
http4==blog."ebia.com=;:88=:=;K=wi%i@pageran%@with@hadoop=
Page 16of 16