Research Paper - Map Reduce -CSC3323

Embed Size (px)

Citation preview

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    1/16

    Processing Large Data Set with a Map Reduce Approach

    Processing Large Data Set with a Map

    Reduce Approach

    Research Paper part of the Honors Component of CSC3323

    Algorithm Analysis

    Supervisor: Dr. Hamid HARROUD

    Honor Student: Ali LA!R"D"

    Page 1of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    2/16

    Processing Large Data Set with a Map Reduce Approach

    Table of Content:

    Introdction!!!!!!!!!!!!""""""!!!!!!!!"

    !!!!!!!!!!!!!!!!!""3

    #apRedce $asic Approach!!!!!!""""""!!!!!!!!!""

    !!!!!!!!!!!!!"3

    %hat is #apRedce& """""""""""""""""""

    """""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""'

    #ap Phase!!!!!!!!!!!!!!!!!!!!!"

    !!!!""!!!!!!!!!!""(

    Combining ) Sh*ing

    phase""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""!"""""""(

    Redce Phase!!!!!!!!!!!!!!"""""""""""

    !!!!!!!!!!!!!!!!!(

    The #odel!!!!!!!!!!!!!!!!!!!!"""""""""""

    !!!!!!!!!!!!!!!!"+

    #ap Redce

    Implementation"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

    !""",

    -.ample of problems e.pressed as #apRedce

    comptations!!""""""!!!""!!"",

    Page 2of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    3/16

    Processing Large Data Set with a Map Reduce Approach

    %ord Conting!!!!!!!!!!!!!!!!!!!!""""""

    !!!!!""!!!!!"",

    #inimm Spanning Tree!!!!!!!!!!!!!!""""""!!"

    !""!!!!!!!"/0

    Page Ran1ing!!!!!!!!!!!!!!!!!!!!""""""

    !!!!!!""!!!!""/0

    sing Clod Compting Serices for #ap Redce!!""""""!!!!""

    !!!!!!!!/2

    References!!!!!!!!!!!!!!!!!!!!!!""""""

    !!!!!!!!"!!!!!!!/3

    Introdction

    Algorithm analysis is about optimizing computational operations to their best

    and lowest cost possible. Howeer! we may easily reach the ma"imum optimization

    for a certain set of problems due to their comple"ity or to their nature. #ompanies

    that dominate the web hae to process large set of data eery day! and in only few

    milliseconds. $or instance! $aceboo% has to process millions of news feeds eery

    day because the users& interactions represents the bac%bone of its social concept.

    'oogle also has to process (illion of search )ueries. *ndeed! 'oogle is the pioneer of

    storing tremendous amount of data. *t has to store and cache all the web pages of

    Page 3of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    4/16

    Processing Large Data Set with a Map Reduce Approach

    the +orld +ide +eb! and this since their creation. ,ery website inde"ed by 'oogle

    has to be parsed and analyzed to proide users with the most accurate and releant

    search results. -o do so! the power of the search engine focus on counting the

    number of specic %eywords on a web page to infer that it is the most releant for a

    particular search )uery. *t also focus on ran%ing website reputations /PageRan%01 in

    other words! the more a website is being referenced by other websites! the more it

    is considered trustworthy! and therefore! placed at top positions in search results.

    (oth these two strategies inole heay computations on large set of data since it

    has to consider the whole +orld +ide +eb. 2nfortunately no matter how e3cient

    your algorithm is! a single computer! regardless of its performances! cannot for

    e"ample handle the whole process of computing the PageRan% of a website simply

    because it cannot retriee and compute all the number of incoming lin%s to a

    website considering the whole +orld +ide +eb without running out of memory. -o

    sole this problematic! the Map Reduce approach is specically well suited to sole

    large scale problems such as this one.

    #apRedce $asic Approach

    -o clarify the idea behind Map Reduce in a basic manner! we may consider the

    following concrete e"ample4 assuming that we need to count the number of citizen

    of a country. *nstead of relying on a single person to count the habitants one by one!

    we send in each city a representatie that will count the number of people in each

    specic city /map phase0. At the end! we reassemble all the representaties from

    the cities to sum the counts they hae made to get the total number of people in

    the country /reduce phase0.

    Page 4of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    5/16

    Processing Large Data Set with a Map Reduce Approach

    %hat is #apRedce&

    MapReduce paradigm is a programming model that uses parallel processing on

    multiple computers to process large data sets. *ts power resides into two main

    operations4 map/0! and reduce/0 procedures. (oth of these procedures are dened

    by the user. +e will see later on that we can also add another not mandatory

    shu5ing and comparing stage to optimize our Map Reduce 6ob. -he map stage

    performs independent record transformation that ta%es %ey alue pair /78! 980! to

    generate : or more intermediate %ey alue pairs list /7;! 9;0. !9>0. -he type of the input %eys and alues

    are di?erent than the output %eys and alues. Moreoer! the intermediate %eys

    alues generated by the map function are the same as the output %eys and alues.

    +hen launching a MapReduce tas%! the output of the map function is distributed

    around a cluster of many computers! then collect bac% the output data with the

    reduce function that represents the nal result. -he end result is a scale@free

    programming model. A MapReduce code written for 8M( of data can also handle

    -erabytes and beyond of data. /See below owchart of a Map Reduce procedure for

    counting words.0

    Page 5of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    6/16

    Processing Large Data Set with a Map Reduce Approach

    Map Phase

    -he map phase gies the user an opportunity to operate on eery record in the data

    set indiidually. -his phase is commonly used to delete unwanted elds! modify or

    transform elds! or apply lters. Specic 6oins and grouping can also be done in the

    map /e.g.! 6oins where the data is already sorted or hash@based aggregation0. -here

    is no re)uirement that for eery input record there should be one output record.

    Maps can choose to remoe records or group multiple records into a single one.

    -hen! the output of the Map Phase is sent to the Reduce Phase directly! or to the

    comparing B shu5ing phase.

    #ombining B Shu5ing Phase

    -he combine B shu5ing phase ta%es adantage of the fact that when the

    Map phase is running! its data is stored in the memory instead of the dis%. -hen! it

    runs a reduce@type function that combines pairs with the same alue in a single list!

    then ush the memory to leae space for new produced data from the map phase

    Page 6of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    7/16

    Processing Large Data Set with a Map Reduce Approach

    when it runs out of it. the combiner class output the data as if they were from the

    map function! the only di?erence is that it speeds up the processing since some

    %ey=alue pair hae already been proceeded! and only need to be aggregated by

    the reducer function with the other lists produced by the combiner. $or e"ample! a

    word count Map Reduce application whose map operation outputs /word! 80 pairs as

    words are encountered in the input can use a combiner to speed up processing.

    Cnce a certain number of pairs is output! the combine operation will be called once

    per uni)ue word with the list aailable as an iterator. -he combiner then emits

    /word! count@in@this@part@of@the@input0 pairs.d

    Reduce Phase

    -he input to the reduce phase is each %ey from the Map function plus all of the

    alues associated with that %ey. Since all records with the same %ey=alue pair are

    now collected together! it is possible to 6oin and aggregate. -he Map Reduce user

    e"plicitly controls parallelism in the reduce function. Map Reduce 6obs that do not

    re)uire a reduce phase can set the reduce count to zero. -hese are referred to as

    map@only 6obs! and automatically gien as output. -he aggregation is what combine

    all the results into a single one! which represents our nal one.

    The #odel

    A Map Reduce tas% is performed in the following steps4

    8. *nput data! such as a long te"t le! is split into %ey@alue pairs. -hese

    %ey@alue pairs are then fed to the mapper. -his 6ob is done by the

    master program.

    ;. -he Map function processes each %ey@alue pair indiidually and

    outputs one or more intermediate 1ey4ale pairs.

    >. All intermediate %ey@alue pairs are collected! sorted! and grouped by

    %ey /done by the shu5ing B comparing phase0.

    . $or each ni5e%ey! the reduce function receies the %ey with a list of

    all the alues associated with it. -he reducer aggregates these alues

    in some way /adding them up! ta%ing aerages! nding the ma"imum!

    etc.0 and outputs one or more output %ey@alue pairs.

    Page 7of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    8/16

    Processing Large Data Set with a Map Reduce Approach

    E. Cutput pairs are collected and stored in an output le /by the master

    program0.

    $igure84 ,"ecution oeriew of a MapReduce -as%

    #apRedce Implementations

    -he MapReduce paradigm is e"tremely powerful because it is implemented by many

    framewor%s such as Apache Hadoop! Ria%! and *nnispan that ta%e care of details.

    -he user has only to care about writing the Map and reduce functions. *ssuing tas%s!

    $ile *=C! networ%ing between nodes! synchronization and failure recoery are all

    managed by these framewor%s written in di?erent programming languages.

    Page 8of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    9/16

    Processing Large Data Set with a Map Reduce Approach

    -.ample of problems e.pressed as #apRedce

    comptations

    *n this section! we will go through some concrete applications of Map Reduce

    using Hadoop.

    Word Counting:

    +e want to compute the occurrences or fre)uency of certain words present

    in a large le document. Cne way of soling the problem is by haing a map

    function that ta%es as an input the name of the document and its content as

    %ey=alue pair! and emit a set of words in the document le with a %ey occurrence of

    8. -hen! the Reduce function ta%e for each word in our set the %ey words and an

    iterator to the whole set of alues then emit the occurrences of each specic

    %eywords in the whole set alues./nd bellow the 6aa code using Hadoop for word

    counting with all its dependencies0. (y splitting the word counting tas% into small

    Page 9of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    10/16

    Processing Large Data Set with a Map Reduce Approach

    tas%s done by multiple nodes in our clusters! the output was computed in a parallel

    way! and therefore it has increased the time needed to do the 6ob.

    Page 10of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    11/16

    Processing Large Data Set with a Map Reduce Approach

    $igure4 Faa Hadoop #ode for word counting

    Page 11of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    12/16

    Processing Large Data Set with a Map Reduce Approach

    Minimum Spanning Tree: Another problematic that can be soled using

    MapReduce and one of the topics discussed in #S#>>;> Algorithm Analysis is

    nding minimum spanning tree that reaches all nodes at minimum cost for ery

    large weighted graphs. *ndeed! one intuitie approach that can be easily be mapped

    to the MapReduce approach! and discussed in class is the Prim&s Algorithm. Prim&s

    approach to nd the minimum spanning tree in a graph is to nd each time the

    edge that cost the minimum! and lin%s a set S of nodes with the remaining set of

    the graph 9@S! such that 9 is all nodes of the graph. $or the MapReduce approach!

    we can partition our graph into multiple set of nodes to send to each computer in

    our cluster. ,ach computer will nd the minimum spanning tree on its gien set!

    then the reduce function will 6oin the set! of which the MS- has been already found

    from the parallel tas%s! by nd the bridge or the edge that cost the least again! and

    by then we hae found a minimum spanning tree of the whole large graph! with

    parallel computations.

    Page Raning:-his algorithm was rst deeloped by Larry Page! one of the co@

    founders of 'oogle. His strategy is torely on the other websites to determine

    whether a specic website is worthy or not. -he idea is simple. *t needs to count the

    number of incoming lin%s to a specic website! and use the following formula to get

    the page ran%ing of a website.

    PageRan1 of A 6 0"/( 7 0"8( 9 PageRan1$;

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    13/16

    Processing Large Data Set with a Map Reduce Approach

    *n the graph aboe! we may easily infer that website A has the highest PageRan%

    compared to its peers! and therefore! put as rst in the searching results if the

    content matches the %eywords entered by the user /using preious word counting

    strategy0. Howeer! resides on the fact this graph is a large and growing one.

    -herefore! we need to use Map Reduce to traerse it! and e"tract the incoming lin%s

    to website A. -o do so! we will run > di?erent Hadoop 6obs self@e"plained in the

    following owcharts4

    8. Parsing4 -raerse the whole web graph. *n the mapping phase! get for website

    and its outgoing lin%s. *n the reduce phase! get for each website the lin%s to the

    others page.

    ;. #alculating4 *n this second Map Reduce Fob! we will compute the PageRan% for

    each website. *n the mapping phase! we map each outgoing lin% to the webpage

    with its ran% and total outgoing lin%s. *n the reduce phase! calculate the page

    ran% for each webpages using the formula describe earlier.

    >. Sorting4 ran% the website by their gien page ran%s

    Page 13of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    14/16

    Processing Large Data Set with a Map Reduce Approach

    2sing #loud #omputing Serices for Map Reduce

    -his section of this paper is more li%ely to be presented as an e"tra. +e hae

    seen that Map Reduce *mplementations will allow us to not deal with tedious and

    low leel implementations to run and process large scale program. -his would

    represent the software part. Howeer! the implementation of Map Reduce would

    re)uire us to hae multiple computers in a large cluster! with high connectiity with

    each other. -his hardware part is also not easy to build. -herefore! processing large

    data! can also be done using the #loud. Cne of the most used product is the

    Amazon ,lastic Map Reduce. *t is a Platform as a Serice /PaaS0 that simplies big

    data processing not only by proiding and managing the Hadoop framewor%! but

    also by managing the hardware for you. -he Amazon ,MR is reliable! secure! and

    low@cost. Gou pay only for what you use! and for how much time you use it. *ndeed!

    they proide you with e"ible resources. Gou may e"tend the number of #P2s or

    cores you need! and also the RAM needed. $urthermore! you may also 6oin your

    irtual cluster with other Amazon products such as the Amazon S> /Amazon Simple

    Storage Serice0 to store for e"ample your input and output from your Map Reduce

    tas%s. -herefore! you not only aoid haing to deal with low leel implementation of

    scale free programs! but also aoid setting up large and complicated cluster to run

    your programs by using the #loud.

    Page 14of 16

  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    15/16

    Processing Large Data Set with a Map Reduce Approach

    References4

    http4==aws.amazon.com=elasticmapreduce=

    https4==courses.cs.washington.edu=courses=cse:h=:Iau=lectures=algorithms.pdf

    http4==hadoop.apache.org=docs=r8.;.8=mapredJtutorial.html

    http4==www.slideshare.net=andreaiacono=mapreduce@>KI

    http4==hadooptutorial.wi%ispaces.com=SortingfeatureofMapReduce

    http4==chimera.labs.oreilly.com=boo%s=8;>:::::8I88=apb.htmloeriewJmrJdistri

    butedJcache

    Page 15of 16

    http://aws.amazon.com/elasticmapreduce/https://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttps://courses.cs.washington.edu/courses/cse490h/08au/lectures/algorithms.pdfhttp://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.htmlhttp://www.slideshare.net/andreaiacono/mapreduce-34478449http://hadooptutorial.wikispaces.com/Sorting+feature+of+MapReducehttp://aws.amazon.com/elasticmapreduce/
  • 7/25/2019 Research Paper - Map Reduce -CSC3323

    16/16

    Processing Large Data Set with a Map Reduce Approach

    http4==webmapreduce.sourceforge.net=docs=2serJ'uide=sect@2serJ'uide@

    *ntroduction@+hatJisJMapJReduce.html

    http4==blog."ebia.com=;:88=:=;K=wi%i@pageran%@with@hadoop=

    Page 16of 16