MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReduce&Pig&Spark

IoannaMiliouGiuseppeAttardi

AdvancedProgrammingUniversitadiPisa

Hadoop• TheApache™Hadoop®projectdevelopsopen-sourcesoftwarefor

reliable,scalable,distributedcomputing.

• Frameworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.

• Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.

• Itisdesignedtodetectandhandlefailuresattheapplicationlayer.

ThecoreofApacheHadoopconsistsofastoragepart,knownasHadoopDistributedFileSystem(HDFS),andaprocessingpartcalledMapReduce.

Hadoop• Theprojectincludesthesemodules:

– HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.

– HadoopDistributedFileSystem(HDFS):Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.

– HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.

– HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.

Hadoop• OtherHadoop-relatedprojectsatApacheinclude:

– Ambari:Aweb-basedtoolforprovisioning, managing, andmonitoring ApacheHadoop.

– Avro:Adataserializationsystem.– Cassandra:Ascalablemulti-masterdatabasewithnosinglepointsoffailure.– Chukwa:Adatacollectionsystemformanaging largedistributed systems.– HBase:Ascalable,distributeddatabasethatsupports structureddatastorage

forlargetables.– Hive:Adatawarehouseinfrastructurethatprovidesdatasummarizationand

adhocquerying.– Mahout:AScalablemachinelearninganddatamining library.– Tez:Ageneralizeddata-flowprogramming framework,builtonHadoopYARN,

whichprovidesapowerfulandflexibleengine toexecuteanarbitraryDAGoftaskstoprocessdataforbothbatchandinteractiveuse-cases.

– ZooKeeper:Ahigh-performance coordinationservicefordistributedapplications.

Hadoop– Pig:Ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation.

– Spark:AfastandgeneralcomputeengineforHadoopdata.Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

HadoopStack

WhatisMapReduce?

• MapReduceistheheartofHadoop®

• ProgrammingparadigmthatallowsformassivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.

ProposedbyDeanandGhemawat atGoogle

Whatisit?

• ProcessingengineofHadoop• Usedforbigdatabatchprocessing• Parallelprocessingofhugedatavolumes• Faulttolerant• Scalable

Whyuseit?

• YourdatainTerabyte/Petabyterange• YouhavehugeI/O• Hadoopframeworktakescareof– Jobandtaskmanagement– Failures– Storage– Replication

YoujustwriteMapandReducejobs

BigUsers

• Users– Facebook– Yahoo– Amazon– Ebay

• Providers– Amazon– Cloudera– HortonWorks– MapR

Map&ReduceThetermMapReduceactuallyreferstotwoseparateanddistincttasksthatHadoopprogramsperform.

1. Themap job,whichtakesasetofdataandconvertsitintoanothersetofdata,whereindividualelementsarebrokendownintotuples(key/valuepairs).

map(k1,v1)→list(k2,v2)

1. Thereduce jobtakestheoutputfromamapasinputandcombinesthosedatatuplesintoasmallersetoftuples.

reduce(k2,list(v2))→list(v2)

AsthesequenceofthenameMapReduceimplies,thereducejobisalwaysperformedafterthemapjob.

TypicalProblemsolvedbyMapReduce

• Readalotofdata

• Map :extractsomethingyoucareaboutfromeachrecord

• ShuffleandSort

• Reduce:aggregate,summarize,filter,ortransform

• Writetheresults

InputData

Map Map Map Map

Shuffle

Reduce Reduce

Results

Example:WordCountinWebPages

Atypicalexerciseforanewengineerinhisorherfirstweek

• Inputisfileswithonedocumentperrecord• Specifyamap functionthattakesakey/valuepair

key=documentURLvalue=documentcontents

• Outputofmapfunctionis(potentiallymany)key/valuepairs.Inourcase,output(word,“1”)onceperwordinthedocument

“document1”, “AppleOrangeMangoOrangeGrapesPlum”

“Apple”,“1”“Orange”,“1”“Mango”,“1”

…

Examplecontinued:WordCountinWebPages

• MapReducelibrarygatherstogetherallpairswiththesamekey(shuffle/sort)

• ThereducefunctioncombinesthevaluesforakeyInourcase,computethesum

• Outputofreducepairedwithkeyandsaved

key=“Apple”values=“1”

key=“Mango”values=“1”

key=“Orange”values=“1”,“1”

key=“Plum”values=“1”

key=“Grapes”values=“1”

“1” “1”“1”“1”“2”

“Apple”,“1”“Orange”,“2”“Mango”,“1”“Grapes”,“1”“Plum”,“1”

ExamplePseudo-code

map() reduce()

MapReducewrappersWrappershavebeendevelopedinorderto:• provideabettercontrolovertheMapReducecode• aidinthesourcecodedevelopment

Somewell-knownexample:• Sawzall (Google)• Pig(originallyYahoo,nowApache)• Hive(Facebook)• DryadLINQ (Microsoft)

WidelyapplicableatGoogle• ImplementedasaC++librarylinkedtouserprograms• Canreadandwritemanydifferentdatatypes

Exampleuses:

Example:GeneratingLanguageModelStatistics• Usedinthestatisticalmachinetranslationsystem

o needtocount#oftimesevery5-wordsequenceoccursinlargecorpusofdocuments(andkeepallthosewherecount>=4)

• EasywithMapReduce:o map :extract5-wordsequences=>countfromdocumento reduce :combinecounts,andkeepifcountlargeenough

Example:JoiningwithOtherData

• Example:generateper-docsummary,butincludeper-hostinformation(e.g.#ofpagesonhost,importanttermsonhost)o per-hostinformationmightbeinper-processdatastructure,or

mightinvolveRPCtoasetofmachinescontainingdataforallsites

• EasywithMapReduce:o map :extracthostnamefromURL,lookupper-hostinfo,

combinewithper-docdataandemito reduce :identityfunction(justemitkey/valuedirectly)

MapReduce:Scheduling

• Onemaster,manyworkerso InputdatasplitintoMmaptasks(typically64MBinsize)o Reducephasepartitioned intoRreducetaskso Tasksareassignedtoworkersdynamicallyo Often:M=200000,R=4000,workers=2000

• Masterassignseachmaptasktoafreeworkero Considerslocalityofdatatoworkerwhenassigning tasko Worker readstaskinput (often fromlocaldisk)o WorkerproducesRlocalfilescontaining intermediatek/vpairs

• Masterassignseachreducetasktoafreeworkero Worker readsintermediatek/vpairsfrommapworkerso Worker sorts&appliesuser’sReduceop toproduce theoutput

TaskGranularityandPipeliningFinegranularitytasks:manymoremaptasksthanmachines• Minimizestimeforfaultrecovery• Canpipelineshufflingwithmapexecution• Betterdynamicloadbalancing

Oftenuse200,000map/5000reducetasksw/2000machines

Faulttolerance:Handledviare-execution

• Onworkerfailure:o Detectfailureviaperiodicheartbeatso Re-executecompletedandin-progressmap taskso Re-executeinprogressreduce taskso Taskcompletioncommittedthroughmaster

• Masterfailure:o Stateischeckpointed :newmasterrecovers&continues

Robust:OnceGooglelost1600of1800machines,butfinishedfine

Refinement:Backuptasks

• Slowworkerssignificantlylengthencompletiontimeo Otherjobsconsumingresourcesonmachineo Baddiskswithsofterrorstransferdataveryslowlyo Weirdthings:processorcachesdisabled(!!)

• Solution:Nearendofphase,spawnbackupcopiesoftaskso Whicheveronefinishesfirst"wins"

• Effect:Dramaticallyshortensjobcompletiontime

Refinement:LocalityOptimization

Masterschedulingpolicy:• Asksforlocationsofreplicasofinputfileblocks• Maptaskstypicallysplitinto64MB• Maptasksscheduledsoinputblockreplicaareonsame

machineorsamerack

Effect:Thousandsofmachinesreadinputatlocaldiskspeed• Withoutthis,rackswitcheslimitreadrate

Refinement:SkippingBadRecords

Map/Reducefunctionssometimesfailforparticularinputs• Bestsolutionistodebug&fix,butnotalwayspossible

Onseg fault:• SendUDPpackettomasterfromsignalhandler• Includesequencenumberofrecordbeingprocessed

IfmasterseesK failuresforsamerecord(typicallyK setto2or3):• Nextworkeristoldtoskiptherecord

Effect:Canworkaroundbugsinthird-partylibraries

OtherRefinements

• Optionalsecondarykeysforordering

• Compressionofintermediatedata

• Combiner:usefulforsavingnetworkbandwidth

• Localexecutionfordebugging/testing

• User-definedcounters

“Playaround”

• AmazonElasticMapReduce(AmazonEMR)• Hortonworks Sandbox• MapR SandboxforHadoop• Qubole• MicrosoftAzureHDInsight• Cloudera

MapReduceexamplesinJava

Serializable vsWritable• Serializable storestheclassnameandtheobjectrepresentationto

thestream;otherinstancesoftheclassarereferredtobyanhandletotheclassname:thisapproachisnotusablewithrandomaccess

• Forthesamereason,thesortingneededfortheshuffleandsortphasecannotbeusedwithSerializable

• Thedeserializationprocesscreatesannewinstanceoftheobject,whileHadoopneedstoreuseobjecttominimizecomputation

• HadoopintroducesthetwointerfacesWritable andWritableComparable thatsolvetheseproblem

Writablewrappers

ImplementingWritable:theSumCount class

Glossary

WordCount

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

WordCountmapper

WordCount reducer

WordCountresults

TopN :Wewanttofindthetop-nusedwordsofatextfile

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

TopNmapper

TopN reducer

TopN results

MEAN:Wewanttofindthemeanmaxtemperatureforeverymonth

• http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/

• InputData:TemperatureinMilan(DD/MM/YYYY,MIN,MAX)02012015,-2, 703012015,-1,804012015,1,16…29012015,0,530012015,0,931012015,-3,6

Meanmapper

Meanreducer

Meanresults

TODO:k-meansclusteringalgorithm

• Wewanttoaggregate2Dpointsinclustersusingk-meansalgorithm

• Inputdata:Arandomsetofpoints2.2705 0.91781.8600 2.10022.0915 1.3679-0.16120.8481…

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

MapReduceexamplesinPython

WordCountusingmrjob

“a”,936“ab”,6“abbot”,3“abbott”,2“abbreviated”,1…

ProductRecommendations

• Goal:Foreachproductaclientbuys,generatea‘peoplewhoboughtthisalsoboughtthis’recommendation

• InputData:product_id_1,product_id_2

CoincidentPurchaseFrequency

TopRecommendations

But…Supposeyouhave:

• userdatainonefile,• websitedatainanother,

andyouneedtofind

• thetop5 mostvisitedpagesbyusersaged18-25.

InMapReduce

InPigLatin

WhatisApachePig?

Idea:aMapReduceprogramessentiallyperformsagroup-by-aggregationinparalleloveraclusterofmachines.

• Pig isahigh-levelplatformforcreatingMapReduceprogramsusedwithHadoop.

• ThelanguageforthisplatformiscalledPigLatin.Itcombineshigh-leveldeclarativequeryinginthespiritofSQL,andlow-level,proceduralprogrammingà laMapReduce.

DevelopedatYahoo

Pig• ApachePig isaplatformforanalyzinglargedatasetsthat

consistsofahigh-levellanguageforexpressingdataanalysisprograms,coupledwithinfrastructureforevaluatingtheseprograms.ThesalientpropertyofPigprogramsisthattheirstructureisamenabletosubstantialparallelization,whichinturnsenablesthemtohandleverylargedatasets.

• Atthepresenttime,Pig'sinfrastructurelayerconsistsofacompilerthatproducessequencesofMap-Reduceprograms,forwhichlarge-scaleparallelimplementationsalreadyexist(e.g.,theHadoopsubproject).

PigLatinPigLatinhasthefollowingkeyproperties:

• Easeofprogramming. Itistrivialtoachieveparallelexecutionofsimple,"embarrassinglyparallel"dataanalysistasks.Complextaskscomprisedofmultipleinterrelateddatatransformationsareexplicitlyencodedasdataflowsequences,makingthemeasytowrite,understand,andmaintain.

• Optimizationopportunities.Thewayinwhichtasksareencodedpermitsthesystemtooptimizetheirexecutionautomatically,allowingtheusertofocusonsemanticsratherthanefficiency.

• Extensibility. Userscancreatetheirownfunctionstodospecial-purposeprocessing.

Performance

PigHighlights

• Userdefinedfunctions(UDFs)canbewrittenforcolumntransformation(TOUPPER),oraggregation(SUM)

• UDFscanbewrittentotakeadvantageofthecombiner• Fourjoinimplementationsbuiltin:hash,fragment-replicate,merge,

skewed• Multi-query:Pigwillcombinecertaintypesofoperationstogetherina

singlepipelinetoreducethenumberoftimesdataisscanned• Orderbyprovidestotalorderingacrossreducersinabalancedway• WritingloadandstorefunctionsiseasyonceanInputFormat and

OutputFormat exist• Piggybank,acollectionofuserscontributedUDFs

WhousesPigforwhat?

• 70%ofproductionjobsatYahoo (10ksperday)

• AlsousedbyTwitter,LinkedIn,Ebay,AOL,…

• Usedto– Processweblogs– Builduserbehaviormodels– Processimages– Buildmapsoftheweb– Doresearchonrawdatasets

Components

Pigresidesonusermachine

Usermachine

HadoopCluster

Jobexecutesoncluster

NoneedtoinstallanythingextraonyourHadoopcluster.

So,whyPig?

• Fasterdevelopment– Fewerlinesofcode– Don’tre-inventthewheel

• Flexible– Metadataisoptional– Extensible– Proceduralprogramming

But…

• Doyouneedyourprogramtorunfaster?

• Doesyouranalyticjobrunsforhours?

LimitationsofMapReduce

OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.

MapReduceisnotdesignedforiterativeprocesses:aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.

degradationofperformance

LimitationsofPig

Pigusesbatchorientedframeworks,whichmeansyouranalyticsjobswillrunformanyminutesorhours.

Spark isfaster!

WhatisApacheSpark?

• Afastandgeneralcomputeengineforlarge-scaledataprocessing.

• Themajorfeature:theabilitytoperformin-memorycomputation(thedatacanbecachedinmemory).

• Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

DevelopedattheUniversityofCaliforniaatBerkeley

Spark

• It providesaninterfaceforprogrammingentireclusterswithimplicitdataparallelismandfault-tolerance.

• Forcertaintasks,itistestedtobeupto100xfaster(datainmemory)or10x(dataindisk)fasterthanHadoopMapReduce

• ItcanrunonHadoopYARNmanagerandcanreaddatafromHDFS.

Spark• Designedtobeusedwitharangeofprogramminglanguagesand

onavarietyofarchitectures.

• Increasinglypopularwithawiderangeofdevelopers,thankstospeed,simplicity,andbroadsupport forexistingdevelopmentenvironmentsandstoragesystems.

• Relativelyaccessible tothoselearningtoworkwithitforthefirsttime.

• OneofApache'slargestandmostvibrant,withover500contributorsfrommorethan200organizationsresponsibleforcodeinthesoftwarerelease.

Why?

• SparkisbasicallydevelopedtoovercomeMapReduce’sshortcomingthatitisnotoptimizedforiterativealgorithms andinteractivedataanalysiswhichperformscyclicoperationsonsamesetofdata.

• SparkovercomesthisproblembyprovidinganewstorageprimitivecalledResilientDistributedDatasets (RDDs).

ResilientDistributedDatasets(RDDs)

TheResilientDistributedDatasetisaconceptattheheartofSpark.Itisdesignedtosupportin-memorydatastorage,distributedacrossaclusterinamannerthatisdemonstrablybothfault-tolerantandefficient.• Fault-tolerance isachieved,inpart,bytrackingthelineageof

transformationsappliedtocoarse-grainedsetsofdata.• Efficiency isachievedthroughparallelizationofprocessingacrossmultiple

nodesinthecluster,andminimizationofdatareplicationbetweenthosenodes.

OncedataisloadedintoanRDD,twobasictypesofoperationcanbecarriedout:• Transformations,whichcreateanewRDDbychangingtheoriginal

throughprocessessuchasmapping,filtering,andmore;• Actions,suchascounts,whichmeasurebutdonotchangetheoriginal

data.

WordCountinSpark

Anotherexample:logisticregression

Acommonmachinelearningalgorithmforclassifyingobjectssuchas,say,spamvs.non-spamemails.

PigvsSpark

• Pig– Thisisthebestdataloading toolavailableinsidehadoop.– Itusesascripting languagecalledPigLatin,whichismoreworkflowdriven.– Don'tneedtobeanexpertJavaprogrammerbutneedafewcodingskills.– Isalsoanabstractionlayerontopofmap-reduce.– Simple towriteandcontrol.

• Spark– Prettymuchthesuccessortomap-reduceinHadoop,withanemphasisonin-

memorycomputing.– You'llneedtobeaprettygood Javaprogrammer tousethis.– Muchlowerlevel.

Howtochooseaplatform?

• Thedecisiontochooseaparticularplatformforacertainapplicationusuallydependsonthefollowingimportantfactors:– datasize– speedorthroughputoptimization– modeldevelopment(Training/Applyingamodel)

Example:k-meansclusteringalgorithm

Thek-meansalgorithmisusedforprovidingmoreinsightintotheanalyticsalgorithmsondifferentplatforms.

Characteristics:• popularandwidelyused• iterativenature• compute-intensivetask(calculatingthecentroids)• aggregationofthelocalresultstoobtainaglobalsolution

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

k-meansonMapReduceInput:datapointsD,numberofclusterkandcentroids

1. foreachdatapointd∈ Ddo2. assigndtotheclosestcentroid

Output:centroidswithassociateddatapoints

Input:centroidswithassociateddatapoints1. computethenewcentroidsbycalculatingthe

averageofdatapointsincluster2. writetheglobalcentroidstothedisk

Output:newcentroids

Map

Reduce

k-meansonPigLatinREGISTERudf.jarDEFINEfind_centroid FindCentroid('$centroids');points=LOAD'points.txt'as(id:int,pos:double);centroided=FOREACHpointsGENERATEpos,find_centroid(pos)ascentroid;grouped=GROUPcentroided BYcentroid;result=FOREACHgroupedGENERATEgroup,AVG(centroided.pos);STOREresultINTO'output’;

k-meansonSpark

SimilartoMapReduce-based implementation

• Insteadofwritingtheglobalcentroidstothedisk,theyarewrittentomemorywhichspeedsuptheprocessingandreducesthediskI/Ooverhead.

• Thedatawillbeloadedintothesystemmemoryinordertoprovidefasteraccess.

References• Dean,J.andGhemawat,S.MapReduce:Simplifieddataprocessingonlarge

clusters.InProceedingsofOperatingSystemsDesignandImplementation (OSDI).SanFrancisco,CA.137-150.2004

• Hadoop:OpensourceimplementationofMapReduce.http://lucene.apache.org/hadoop/

• C.Olston,B.Reed,U.Srivastava,R.Kumar,A.Tomkins.PigLatin:ANot-so-foreignLanguageforDataProcessing.InProceedingsSIGMOD'08.2008

• D.SinghandC.K.Reddy.Asurveyonplatformsforbigdataanalytics, JournalofBigData.2014

• M.Zaharia,M.Chowdhury,T.Das,A.Dave,J.Ma,M.McCauley,M.Franklin,S.Shenker, andI.Stoica. ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing. USENIXNSDI.2012

Documents

MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based