Research Paper - Map Reduce -CSC3323

7/25/2019 Research Paper - Map Reduce -CSC3323

1/16

Processing Large Data Set with a Map Reduce Approach

Processing Large Data Set with a Map

Reduce Approach

Research Paper part of the Honors Component of CSC3323

Algorithm Analysis

Supervisor: Dr. Hamid HARROUD

Honor Student: Ali LA!R"D"

Page 1of 16


2/16


Table of Content:

Introdction!!!!!!!!!!!!""""""!!!!!!!!"

!!!!!!!!!!!!!!!!!""3

#apRedce $asic Approach!!!!!!""""""!!!!!!!!!""

!!!!!!!!!!!!!"3

%hat is #apRedce& """""""""""""""""""

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""'

#ap Phase!!!!!!!!!!!!!!!!!!!!!"

!!!!""!!!!!!!!!!""(

Combining ) Sh*ing

phase""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!"""""""(

Redce Phase!!!!!!!!!!!!!!"""""""""""

!!!!!!!!!!!!!!!!!(

The #odel!!!!!!!!!!!!!!!!!!!!"""""""""""

!!!!!!!!!!!!!!!!"+

#ap Redce

Implementation"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

!""",

-.ample of problems e.pressed as #apRedce

comptations!!""""""!!!""!!"",

Page 2of 16


3/16


%ord Conting!!!!!!!!!!!!!!!!!!!!""""""

!!!!!""!!!!!"",

#inimm Spanning Tree!!!!!!!!!!!!!!""""""!!"

!""!!!!!!!"/0

Page Ran1ing!!!!!!!!!!!!!!!!!!!!""""""

!!!!!!""!!!!""/0

sing Clod Compting Serices for #ap Redce!!""""""!!!!""

!!!!!!!!/2

References!!!!!!!!!!!!!!!!!!!!!!""""""

!!!!!!!!"!!!!!!!/3

Introdction

Algorithm analysis is about optimizing computational operations to their best

and lowest cost possible. Howeer! we may easily reach the ma"imum optimization

for a certain set of problems due to their comple"ity or to their nature. #ompanies

that dominate the web hae to process large set of data eery day! and in only few

milliseconds. $or instance! $aceboo% has to process millions of news feeds eery

day because the users& interactions represents the bac%bone of its social concept.

'oogle also has to process (illion of search )ueries. *ndeed! 'oogle is the pioneer of

storing tremendous amount of data. *t has to store and cache all the web pages of

Page 3of 16


4/16


the +orld +ide +eb! and this since their creation. ,ery website inde"ed by 'oogle

has to be parsed and analyzed to proide users with the most accurate and releant

search results. -o do so! the power of the search engine focus on counting the

number of specic %eywords on a web page to infer that it is the most releant for a

particular search )uery. *t also focus on ran%ing website reputations /PageRan%01 in

other words! the more a website is being referenced by other websites! the more it

is considered trustworthy! and therefore! placed at top positions in search results.

(oth these two strategies inole heay computations on large set of data since it

has to consider the whole +orld +ide +eb. 2nfortunately no matter how e3cient

your algorithm is! a single computer! regardless of its performances! cannot for

e"ample handle the whole process of computing the PageRan% of a website simply

because it cannot retriee and compute all the number of incoming lin%s to a

website considering the whole +orld +ide +eb without running out of memory. -o

sole this problematic! the Map Reduce approach is specically well suited to sole

large scale problems such as this one.

#apRedce $asic Approach

-o clarify the idea behind Map Reduce in a basic manner! we may consider the

following concrete e"ample4 assuming that we need to count the number of citizen

of a country. *nstead of relying on a single person to count the habitants one by one!

we send in each city a representatie that will count the number of people in each

specic city /map phase0. At the end! we reassemble all the representaties from

the cities to sum the counts they hae made to get the total number of people in

the country /reduce phase0.

Page 4of 16


5/16


%hat is #apRedce&

MapReduce paradigm is a programming model that uses parallel processing on

multiple computers to process large data sets. *ts power resides into two main

operations4 map/0! and reduce/0 procedures. (oth of these procedures are dened

by the user. +e will see later on that we can also add another not mandatory

shu5ing and comparing stage to optimize our Map Reduce 6ob. -he map stage

performs independent record transformation that ta%es %ey alue pair /78! 980! to

generate : or more intermediate %ey alue pairs list /7;! 9;0. !9>0. -he type of the input %eys and alues

are di?erent than the output %eys and alues. Moreoer! the intermediate %eys

alues generated by the map function are the same as the output %eys and alues.

+hen launching a MapReduce tas%! the output of the map function is distributed

around a cluster of many computers! then collect bac% the output data with the

reduce function that represents the nal result. -he end result is a scale@free

programming model. A MapReduce code written for 8M( of data can also handle

-erabytes and beyond of data. /See below owchart of a Map Reduce procedure for

counting words.0

Page 5of 16


6/16


Map Phase

-he map phase gies the user an opportunity to operate on eery record in the data

set indiidually. -his phase is commonly used to delete unwanted elds! modify or

transform elds! or apply lters. Specic 6oins and grouping can also be done in the

map /e.g.! 6oins where the data is already sorted or hash@based aggregation0. -here

is no re)uirement that for eery input record there should be one output record.

Maps can choose to remoe records or group multiple records into a single one.

-hen! the output of the Map Phase is sent to the Reduce Phase directly! or to the

comparing B shu5ing phase.

#ombining B Shu5ing Phase

-he combine B shu5ing phase ta%es adantage of the fact that when the

Map phase is running! its data is stored in the memory instead of the dis%. -hen! it

runs a reduce@type function that combines pairs with the same alue in a single list!

then ush the memory to leae space for new produced data from the map phase

Page 6of 16


7/16


when it runs out of it. the combiner class output the data as if they were from the

map function! the only di?erence is that it speeds up the processing since some

%ey=alue pair hae already been proceeded! and only need to be aggregated by

the reducer function with the other lists produced by the combiner. $or e"ample! a

word count Map Reduce application whose map operation outputs /word! 80 pairs as

words are encountered in the input can use a combiner to speed up processing.

Cnce a certain number of pairs is output! the combine operation will be called once

per uni)ue word with the list aailable as an iterator. -he combiner then emits

/word! count@in@this@part@of@the@input0 pairs.d

Reduce Phase

-he input to the reduce phase is each %ey from the Map function plus all of the

alues associated with that %ey. Since all records with the same %ey=alue pair are

now collected together! it is possible to 6oin and aggregate. -he Map Reduce user

e"plicitly controls parallelism in the reduce function. Map Reduce 6obs that do not

re)uire a reduce phase can set the reduce count to zero. -hese are referred to as

map@only 6obs! and automatically gien as output. -he aggregation is what combine

all the results into a single one! which represents our nal one.

The #odel

A Map Reduce tas% is performed in the following steps4

8. *nput data! such as a long te"t le! is split into %ey@alue pairs. -hese

%ey@alue pairs are then fed to the mapper. -his 6ob is done by the

master program.

;. -he Map function processes each %ey@alue pair indiidually and

outputs one or more intermediate 1ey4ale pairs.

>. All intermediate %ey@alue pairs are collected! sorted! and grouped by

%ey /done by the shu5ing B comparing phase0.

. $or each ni5e%ey! the reduce function receies the %ey with a list of

all the alues associated with it. -he reducer aggregates these alues

in some way /adding them up! ta%ing aerages! nding the ma"imum!

etc.0 and outputs one or more output %ey@alue pairs.

Page 7of 16


8/16


E. Cutput pairs are collected and stored in an output le /by the master

program0.

$igure84 ,"ecution oeriew of a MapReduce -as%

#apRedce Implementations

-he MapReduce paradigm is e"tremely powerful because it is implemented by many

framewor%s such as Apache Hadoop! Ria%! and *nnispan that ta%e care of details.

-he user has only to care about writing the Map and reduce functions. *ssuing tas%s!

$ile *=C! networ%ing between nodes! synchronization and failure recoery are all

managed by these framewor%s written in di?erent programming languages.

Page 8of 16


9/16


-.ample of problems e.pressed as #apRedce

comptations

*n this section! we will go through some concrete applications of Map Reduce

using Hadoop.

Word Counting:

+e want to compute the occurrences or fre)uency of certain words present

in a large le document. Cne way of soling the problem is by haing a map

function that ta%es as an input the name of the document and its content as

%ey=alue pair! and emit a set of words in the document le with a %ey occurrence of

8. -hen! the Reduce function ta%e for each word in our set the %ey words and an

iterator to the whole set of alues then emit the occurrences of each specic

%eywords in the whole set alues./nd bellow the 6aa code using Hadoop for word

counting with all its dependencies0. (y splitting the word counting tas% into small

Page 9of 16


10/16


tas%s done by multiple nodes in our clusters! the output was computed in a parallel

way! and therefore it has increased the time needed to do the 6ob.

Page 10of 16


11/16


$igure4 Faa Hadoop #ode for word counting

Page 11of 16


12/16


Minimum Spanning Tree: Another problematic that can be soled using

MapReduce and one of the topics discussed in #S#>>;> Algorithm Analysis is

nding minimum spanning tree that reaches all nodes at minimum cost for ery

large weighted graphs. *ndeed! one intuitie approach that can be easily be mapped

to the MapReduce approach! and discussed in class is the Prim&s Algorithm. Prim&s

approach to nd the minimum spanning tree in a graph is to nd each time the

edge that cost the minimum! and lin%s a set S of nodes with the remaining set of

the graph 9@S! such that 9 is all nodes of the graph. $or the MapReduce approach!

we can partition our graph into multiple set of nodes to send to each computer in

our cluster. ,ach computer will nd the minimum spanning tree on its gien set!

then the reduce function will 6oin the set! of which the MS- has been already found

from the parallel tas%s! by nd the bridge or the edge that cost the least again! and

by then we hae found a minimum spanning tree of the whole large graph! with

parallel computations.

Page Raning:-his algorithm was rst deeloped by Larry Page! one of the co@

founders of 'oogle. His strategy is torely on the other websites to determine

whether a specic website is worthy or not. -he idea is simple. *t needs to count the

number of incoming lin%s to a specic website! and use the following formula to get

the page ran%ing of a website.

PageRan1 of A 6 0"/( 7 0"8( 9 PageRan1$;


13/16


*n the graph aboe! we may easily infer that website A has the highest PageRan%

compared to its peers! and therefore! put as rst in the searching results if the

content matches the %eywords entered by the user /using preious word counting

strategy0. Howeer! resides on the fact this graph is a large and growing one.

-herefore! we need to use Map Reduce to traerse it! and e"tract the incoming lin%s

to website A. -o do so! we will run > di?erent Hadoop 6obs self@e"plained in the

following owcharts4

8. Parsing4 -raerse the whole web graph. *n the mapping phase! get for website

and its outgoing lin%s. *n the reduce phase! get for each website the lin%s to the

others page.

;. #alculating4 *n this second Map Reduce Fob! we will compute the PageRan% for

each website. *n the mapping phase! we map each outgoing lin% to the webpage

with its ran% and total outgoing lin%s. *n the reduce phase! calculate the page

ran% for each webpages using the formula describe earlier.

>. Sorting4 ran% the website by their gien page ran%s

Page 13of 16


14/16


2sing #loud #omputing Serices for Map Reduce

-his section of this paper is more li%ely to be presented as an e"tra. +e hae

seen that Map Reduce *mplementations will allow us to not deal with tedious and

low leel implementations to run and process large scale program. -his would

represent the software part. Howeer! the implementation of Map Reduce would

re)uire us to hae multiple computers in a large cluster! with high connectiity with

each other. -his hardware part is also not easy to build. -herefore! processing large

data! can also be done using the #loud. Cne of the most used product is the

Amazon ,lastic Map Reduce. *t is a Platform as a Serice /PaaS0 that simplies big

data processing not only by proiding and managing the Hadoop framewor%! but

also by managing the hardware for you. -he Amazon ,MR is reliable! secure! and

low@cost. Gou pay only for what you use! and for how much time you use it. *ndeed!

they proide you with e"ible resources. Gou may e"tend the number of #P2s or

cores you need! and also the RAM needed. $urthermore! you may also 6oin your

irtual cluster with other Amazon products such as the Amazon S> /Amazon Simple

Storage Serice0 to store for e"ample your input and output from your Map Reduce

tas%s. -herefore! you not only aoid haing to deal with low leel implementation of

scale free programs! but also aoid setting up large and complicated cluster to run

your programs by using the #loud.

Page 14of 16


15/16


References4

http4==aws.amazon.com=elasticmapreduce=

https4==courses.cs.washington.edu=courses=cse:h=:Iau=lectures=algorithms.pdf

http4==hadoop.apache.org=docs=r8.;.8=mapredJtutorial.html

http4==www.slideshare.net=andreaiacono=mapreduce@>KI

http4==hadooptutorial.wi%ispaces.com=SortingfeatureofMapReduce

http4==chimera.labs.oreilly.com=boo%s=8;>:::::8I88=apb.htmloeriewJmrJdistri

butedJcache

Page 15of 16
http://aws.amazon.com/elasticmapreduce/https://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttps://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttp://aws.amazon.com/elasticmapreduce/


16/16


http4==webmapreduce.sourceforge.net=docs=2serJ'uide=sect@2serJ'uide@

*ntroduction@+hatJisJMapJReduce.html

http4==blog."ebia.com=;:88=:=;K=wi%i@pageran%@with@hadoop=

Page 16of 16

Documents

Research Paper - Map Reduce -CSC3323