83
MapReduce & Pig & Spark Ioanna Miliou Giuseppe Attardi Advanced Programming Università di Pisa

MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReduce&Pig&Spark

IoannaMiliouGiuseppeAttardi

AdvancedProgrammingUniversitadiPisa

Page 2: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Hadoop• TheApache™Hadoop®projectdevelopsopen-sourcesoftwarefor

reliable,scalable,distributedcomputing.

• Frameworkthatallowsforthedistributedprocessingoflargedatasetsacrossclustersofcomputersusingsimpleprogrammingmodels.

• Itisdesignedtoscaleupfromsingleserverstothousandsofmachines,eachofferinglocalcomputationandstorage.

• Itisdesignedtodetectandhandlefailuresattheapplicationlayer.

ThecoreofApacheHadoopconsistsofastoragepart,knownasHadoopDistributedFileSystem(HDFS),andaprocessingpartcalledMapReduce.

Page 3: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Hadoop• Theprojectincludesthesemodules:

– HadoopCommon:ThecommonutilitiesthatsupporttheotherHadoopmodules.

– HadoopDistributedFileSystem(HDFS):Adistributedfilesystemthatprovideshigh-throughputaccesstoapplicationdata.

– HadoopYARN:Aframeworkforjobschedulingandclusterresourcemanagement.

– HadoopMapReduce:AYARN-basedsystemforparallelprocessingoflargedatasets.

Page 4: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Hadoop• OtherHadoop-relatedprojectsatApacheinclude:

– Ambari:Aweb-basedtoolforprovisioning, managing, andmonitoring ApacheHadoop.

– Avro:Adataserializationsystem.– Cassandra:Ascalablemulti-masterdatabasewithnosinglepointsoffailure.– Chukwa:Adatacollectionsystemformanaging largedistributed systems.– HBase:Ascalable,distributeddatabasethatsupports structureddatastorage

forlargetables.– Hive:Adatawarehouseinfrastructurethatprovidesdatasummarizationand

adhocquerying.– Mahout:AScalablemachinelearninganddatamining library.– Tez:Ageneralizeddata-flowprogramming framework,builtonHadoopYARN,

whichprovidesapowerfulandflexibleengine toexecuteanarbitraryDAGoftaskstoprocessdataforbothbatchandinteractiveuse-cases.

– ZooKeeper:Ahigh-performance coordinationservicefordistributedapplications.

Page 5: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Hadoop– Pig:Ahigh-leveldata-flowlanguageandexecutionframeworkforparallelcomputation.

– Spark:AfastandgeneralcomputeengineforHadoopdata.Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

Page 6: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

HadoopStack

Page 7: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WhatisMapReduce?

Page 8: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

• MapReduceistheheartofHadoop®

• ProgrammingparadigmthatallowsformassivescalabilityacrosshundredsorthousandsofserversinaHadoopcluster.

ProposedbyDeanandGhemawat atGoogle

Page 9: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Whatisit?

• ProcessingengineofHadoop• Usedforbigdatabatchprocessing• Parallelprocessingofhugedatavolumes• Faulttolerant• Scalable

Page 10: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Whyuseit?

• YourdatainTerabyte/Petabyterange• YouhavehugeI/O• Hadoopframeworktakescareof– Jobandtaskmanagement– Failures– Storage– Replication

YoujustwriteMapandReducejobs

Page 11: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

BigUsers

• Users– Facebook– Yahoo– Amazon– Ebay

• Providers– Amazon– Cloudera– HortonWorks– MapR

Page 12: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Map&ReduceThetermMapReduceactuallyreferstotwoseparateanddistincttasksthatHadoopprogramsperform.

1. Themap job,whichtakesasetofdataandconvertsitintoanothersetofdata,whereindividualelementsarebrokendownintotuples(key/valuepairs).

map(k1,v1)→list(k2,v2)

1. Thereduce jobtakestheoutputfromamapasinputandcombinesthosedatatuplesintoasmallersetoftuples.

reduce(k2,list(v2))→list(v2)

AsthesequenceofthenameMapReduceimplies,thereducejobisalwaysperformedafterthemapjob.

Page 13: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TypicalProblemsolvedbyMapReduce

• Readalotofdata

• Map :extractsomethingyoucareaboutfromeachrecord

• ShuffleandSort

• Reduce:aggregate,summarize,filter,ortransform

• Writetheresults

InputData

Map Map Map Map

Shuffle

Reduce Reduce

Results

Page 14: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Example:WordCountinWebPages

Atypicalexerciseforanewengineerinhisorherfirstweek

• Inputisfileswithonedocumentperrecord• Specifyamap functionthattakesakey/valuepair

key=documentURLvalue=documentcontents

• Outputofmapfunctionis(potentiallymany)key/valuepairs.Inourcase,output(word,“1”)onceperwordinthedocument

“document1”, “AppleOrangeMangoOrangeGrapesPlum”

“Apple”,“1”“Orange”,“1”“Mango”,“1”

Page 15: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Examplecontinued:WordCountinWebPages

• MapReducelibrarygatherstogetherallpairswiththesamekey(shuffle/sort)

• ThereducefunctioncombinesthevaluesforakeyInourcase,computethesum

• Outputofreducepairedwithkeyandsaved

key=“Apple”values=“1”

key=“Mango”values=“1”

key=“Orange”values=“1”,“1”

key=“Plum”values=“1”

key=“Grapes”values=“1”

“1” “1”“1”“1”“2”

“Apple”,“1”“Orange”,“2”“Mango”,“1”“Grapes”,“1”“Plum”,“1”

Page 16: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

ExamplePseudo-code

Page 17: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

map() reduce()

Page 18: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReducewrappersWrappershavebeendevelopedinorderto:• provideabettercontrolovertheMapReducecode• aidinthesourcecodedevelopment

Somewell-knownexample:• Sawzall (Google)• Pig(originallyYahoo,nowApache)• Hive(Facebook)• DryadLINQ (Microsoft)

Page 19: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WidelyapplicableatGoogle• ImplementedasaC++librarylinkedtouserprograms• Canreadandwritemanydifferentdatatypes

Exampleuses:

Page 20: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Example:GeneratingLanguageModelStatistics• Usedinthestatisticalmachinetranslationsystem

o needtocount#oftimesevery5-wordsequenceoccursinlargecorpusofdocuments(andkeepallthosewherecount>=4)

• EasywithMapReduce:o map :extract5-wordsequences=>countfromdocumento reduce :combinecounts,andkeepifcountlargeenough

Page 21: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Example:JoiningwithOtherData

• Example:generateper-docsummary,butincludeper-hostinformation(e.g.#ofpagesonhost,importanttermsonhost)o per-hostinformationmightbeinper-processdatastructure,or

mightinvolveRPCtoasetofmachinescontainingdataforallsites

• EasywithMapReduce:o map :extracthostnamefromURL,lookupper-hostinfo,

combinewithper-docdataandemito reduce :identityfunction(justemitkey/valuedirectly)

Page 22: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReduce:Scheduling

• Onemaster,manyworkerso InputdatasplitintoMmaptasks(typically64MBinsize)o Reducephasepartitioned intoRreducetaskso Tasksareassignedtoworkersdynamicallyo Often:M=200000,R=4000,workers=2000

• Masterassignseachmaptasktoafreeworkero Considerslocalityofdatatoworkerwhenassigning tasko Worker readstaskinput (often fromlocaldisk)o WorkerproducesRlocalfilescontaining intermediatek/vpairs

• Masterassignseachreducetasktoafreeworkero Worker readsintermediatek/vpairsfrommapworkerso Worker sorts&appliesuser’sReduceop toproduce theoutput

Page 23: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based
Page 24: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TaskGranularityandPipeliningFinegranularitytasks:manymoremaptasksthanmachines• Minimizestimeforfaultrecovery• Canpipelineshufflingwithmapexecution• Betterdynamicloadbalancing

Oftenuse200,000map/5000reducetasksw/2000machines

Page 25: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Faulttolerance:Handledviare-execution

• Onworkerfailure:o Detectfailureviaperiodicheartbeatso Re-executecompletedandin-progressmap taskso Re-executeinprogressreduce taskso Taskcompletioncommittedthroughmaster

• Masterfailure:o Stateischeckpointed :newmasterrecovers&continues

Robust:OnceGooglelost1600of1800machines,butfinishedfine

Page 26: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Refinement:Backuptasks

• Slowworkerssignificantlylengthencompletiontimeo Otherjobsconsumingresourcesonmachineo Baddiskswithsofterrorstransferdataveryslowlyo Weirdthings:processorcachesdisabled(!!)

• Solution:Nearendofphase,spawnbackupcopiesoftaskso Whicheveronefinishesfirst"wins"

• Effect:Dramaticallyshortensjobcompletiontime

Page 27: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Refinement:LocalityOptimization

Masterschedulingpolicy:• Asksforlocationsofreplicasofinputfileblocks• Maptaskstypicallysplitinto64MB• Maptasksscheduledsoinputblockreplicaareonsame

machineorsamerack

Effect:Thousandsofmachinesreadinputatlocaldiskspeed• Withoutthis,rackswitcheslimitreadrate

Page 28: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Refinement:SkippingBadRecords

Map/Reducefunctionssometimesfailforparticularinputs• Bestsolutionistodebug&fix,butnotalwayspossible

Onseg fault:• SendUDPpackettomasterfromsignalhandler• Includesequencenumberofrecordbeingprocessed

IfmasterseesK failuresforsamerecord(typicallyK setto2or3):• Nextworkeristoldtoskiptherecord

Effect:Canworkaroundbugsinthird-partylibraries

Page 29: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

OtherRefinements

• Optionalsecondarykeysforordering

• Compressionofintermediatedata

• Combiner:usefulforsavingnetworkbandwidth

• Localexecutionfordebugging/testing

• User-definedcounters

Page 30: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

“Playaround”

• AmazonElasticMapReduce(AmazonEMR)• Hortonworks Sandbox• MapR SandboxforHadoop• Qubole• MicrosoftAzureHDInsight• Cloudera

Page 31: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReduceexamplesinJava

Page 32: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Serializable vsWritable• Serializable storestheclassnameandtheobjectrepresentationto

thestream;otherinstancesoftheclassarereferredtobyanhandletotheclassname:thisapproachisnotusablewithrandomaccess

• Forthesamereason,thesortingneededfortheshuffleandsortphasecannotbeusedwithSerializable

• Thedeserializationprocesscreatesannewinstanceoftheobject,whileHadoopneedstoreuseobjecttominimizecomputation

• HadoopintroducesthetwointerfacesWritable andWritableComparable thatsolvetheseproblem

Page 33: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Writablewrappers

Page 34: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

ImplementingWritable:theSumCount class

Page 35: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Glossary

Page 36: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCount

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

Page 37: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCountmapper

Page 38: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCount reducer

Page 39: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCountresults

Page 40: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TopN :Wewanttofindthetop-nusedwordsofatextfile

• http://www.gutenberg.org/cache/epub/201/pg201.txt

• InputData:Thetextofthebook“Flatland”byEdwinAbbott

Page 41: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TopNmapper

Page 42: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TopN reducer

Page 43: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TopN results

Page 44: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MEAN:Wewanttofindthemeanmaxtemperatureforeverymonth

• http://archivio-meteo.distile.it/tabelle-dati-archivio-meteo/

• InputData:TemperatureinMilan(DD/MM/YYYY,MIN,MAX)02012015,-2, 703012015,-1,804012015,1,16…29012015,0,530012015,0,931012015,-3,6

Page 45: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Meanmapper

Page 46: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Meanreducer

Page 47: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Meanresults

Page 48: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TODO:k-meansclusteringalgorithm

• Wewanttoaggregate2Dpointsinclustersusingk-meansalgorithm

• Inputdata:Arandomsetofpoints2.2705 0.91781.8600 2.10022.0915 1.3679-0.16120.8481…

Page 49: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

Page 50: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

MapReduceexamplesinPython

Page 51: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCountusingmrjob

“a”,936“ab”,6“abbot”,3“abbott”,2“abbreviated”,1…

Page 52: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

ProductRecommendations

• Goal:Foreachproductaclientbuys,generatea‘peoplewhoboughtthisalsoboughtthis’recommendation

• InputData:product_id_1,product_id_2

Page 53: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

CoincidentPurchaseFrequency

Page 54: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

TopRecommendations

Page 55: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

But…Supposeyouhave:

• userdatainonefile,• websitedatainanother,

andyouneedtofind

• thetop5 mostvisitedpagesbyusersaged18-25.

Page 56: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

InMapReduce

Page 57: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

InPigLatin

Page 58: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WhatisApachePig?

Idea:aMapReduceprogramessentiallyperformsagroup-by-aggregationinparalleloveraclusterofmachines.

• Pig isahigh-levelplatformforcreatingMapReduceprogramsusedwithHadoop.

• ThelanguageforthisplatformiscalledPigLatin.Itcombineshigh-leveldeclarativequeryinginthespiritofSQL,andlow-level,proceduralprogrammingà laMapReduce.

DevelopedatYahoo

Page 59: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Pig• ApachePig isaplatformforanalyzinglargedatasetsthat

consistsofahigh-levellanguageforexpressingdataanalysisprograms,coupledwithinfrastructureforevaluatingtheseprograms.ThesalientpropertyofPigprogramsisthattheirstructureisamenabletosubstantialparallelization,whichinturnsenablesthemtohandleverylargedatasets.

• Atthepresenttime,Pig'sinfrastructurelayerconsistsofacompilerthatproducessequencesofMap-Reduceprograms,forwhichlarge-scaleparallelimplementationsalreadyexist(e.g.,theHadoopsubproject).

Page 60: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

PigLatinPigLatinhasthefollowingkeyproperties:

• Easeofprogramming. Itistrivialtoachieveparallelexecutionofsimple,"embarrassinglyparallel"dataanalysistasks.Complextaskscomprisedofmultipleinterrelateddatatransformationsareexplicitlyencodedasdataflowsequences,makingthemeasytowrite,understand,andmaintain.

• Optimizationopportunities.Thewayinwhichtasksareencodedpermitsthesystemtooptimizetheirexecutionautomatically,allowingtheusertofocusonsemanticsratherthanefficiency.

• Extensibility. Userscancreatetheirownfunctionstodospecial-purposeprocessing.

Page 61: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Performance

Page 62: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

PigHighlights

• Userdefinedfunctions(UDFs)canbewrittenforcolumntransformation(TOUPPER),oraggregation(SUM)

• UDFscanbewrittentotakeadvantageofthecombiner• Fourjoinimplementationsbuiltin:hash,fragment-replicate,merge,

skewed• Multi-query:Pigwillcombinecertaintypesofoperationstogetherina

singlepipelinetoreducethenumberoftimesdataisscanned• Orderbyprovidestotalorderingacrossreducersinabalancedway• WritingloadandstorefunctionsiseasyonceanInputFormat and

OutputFormat exist• Piggybank,acollectionofuserscontributedUDFs

Page 63: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WhousesPigforwhat?

• 70%ofproductionjobsatYahoo (10ksperday)

• AlsousedbyTwitter,LinkedIn,Ebay,AOL,…

• Usedto– Processweblogs– Builduserbehaviormodels– Processimages– Buildmapsoftheweb– Doresearchonrawdatasets

Page 64: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Components

Pigresidesonusermachine

Usermachine

HadoopCluster

Jobexecutesoncluster

NoneedtoinstallanythingextraonyourHadoopcluster.

Page 65: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

So,whyPig?

• Fasterdevelopment– Fewerlinesofcode– Don’tre-inventthewheel

• Flexible– Metadataisoptional– Extensible– Proceduralprogramming

Page 66: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

But…

• Doyouneedyourprogramtorunfaster?

• Doesyouranalyticjobrunsforhours?

Page 67: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

LimitationsofMapReduce

OneofthemajordrawbacksofMapReduceisitsinefficiencyinrunningiterativealgorithms.

MapReduceisnotdesignedforiterativeprocesses:aftereachiteration,theresultshavetobewrittentothedisktopassthemontothenextiteration.

degradationofperformance

Page 68: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

LimitationsofPig

Pigusesbatchorientedframeworks,whichmeansyouranalyticsjobswillrunformanyminutesorhours.

Spark isfaster!

Page 69: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WhatisApacheSpark?

• Afastandgeneralcomputeengineforlarge-scaledataprocessing.

• Themajorfeature:theabilitytoperformin-memorycomputation(thedatacanbecachedinmemory).

• Sparkprovidesasimpleandexpressiveprogrammingmodelthatsupportsawiderangeofapplications,includingETL,machinelearning,streamprocessing,andgraphcomputation.

DevelopedattheUniversityofCaliforniaatBerkeley

Page 70: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Spark

• It providesaninterfaceforprogrammingentireclusterswithimplicitdataparallelismandfault-tolerance.

• Forcertaintasks,itistestedtobeupto100xfaster(datainmemory)or10x(dataindisk)fasterthanHadoopMapReduce

• ItcanrunonHadoopYARNmanagerandcanreaddatafromHDFS.

Page 71: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Spark• Designedtobeusedwitharangeofprogramminglanguagesand

onavarietyofarchitectures.

• Increasinglypopularwithawiderangeofdevelopers,thankstospeed,simplicity,andbroadsupport forexistingdevelopmentenvironmentsandstoragesystems.

• Relativelyaccessible tothoselearningtoworkwithitforthefirsttime.

• OneofApache'slargestandmostvibrant,withover500contributorsfrommorethan200organizationsresponsibleforcodeinthesoftwarerelease.

Page 72: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Why?

• SparkisbasicallydevelopedtoovercomeMapReduce’sshortcomingthatitisnotoptimizedforiterativealgorithms andinteractivedataanalysiswhichperformscyclicoperationsonsamesetofdata.

• SparkovercomesthisproblembyprovidinganewstorageprimitivecalledResilientDistributedDatasets (RDDs).

Page 73: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

ResilientDistributedDatasets(RDDs)

TheResilientDistributedDatasetisaconceptattheheartofSpark.Itisdesignedtosupportin-memorydatastorage,distributedacrossaclusterinamannerthatisdemonstrablybothfault-tolerantandefficient.• Fault-tolerance isachieved,inpart,bytrackingthelineageof

transformationsappliedtocoarse-grainedsetsofdata.• Efficiency isachievedthroughparallelizationofprocessingacrossmultiple

nodesinthecluster,andminimizationofdatareplicationbetweenthosenodes.

OncedataisloadedintoanRDD,twobasictypesofoperationcanbecarriedout:• Transformations,whichcreateanewRDDbychangingtheoriginal

throughprocessessuchasmapping,filtering,andmore;• Actions,suchascounts,whichmeasurebutdonotchangetheoriginal

data.

Page 74: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

WordCountinSpark

Page 75: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Anotherexample:logisticregression

Acommonmachinelearningalgorithmforclassifyingobjectssuchas,say,spamvs.non-spamemails.

Page 76: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

PigvsSpark

• Pig– Thisisthebestdataloading toolavailableinsidehadoop.– Itusesascripting languagecalledPigLatin,whichismoreworkflowdriven.– Don'tneedtobeanexpertJavaprogrammerbutneedafewcodingskills.– Isalsoanabstractionlayerontopofmap-reduce.– Simple towriteandcontrol.

• Spark– Prettymuchthesuccessortomap-reduceinHadoop,withanemphasisonin-

memorycomputing.– You'llneedtobeaprettygood Javaprogrammer tousethis.– Muchlowerlevel.

Page 77: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Howtochooseaplatform?

• Thedecisiontochooseaparticularplatformforacertainapplicationusuallydependsonthefollowingimportantfactors:– datasize– speedorthroughputoptimization– modeldevelopment(Training/Applyingamodel)

Page 78: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

Example:k-meansclusteringalgorithm

Thek-meansalgorithmisusedforprovidingmoreinsightintotheanalyticsalgorithmsondifferentplatforms.

Characteristics:• popularandwidelyused• iterativenature• compute-intensivetask(calculatingthecentroids)• aggregationofthelocalresultstoobtainaglobalsolution

Page 79: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

k-meansalgorithmInput:datapointsD,numberofclusterk

1. initializekcentroidsrandomly2. associateeachdatapointinDwiththenearestcentroid.

Thiswilldividethedatapointsintokclusters.3. recalculatethepositionofcentroids.

Repeatsteps2and3untiltherearenomorechangesinthemembershipofthedatapoints.

Output:datapointswithclustermemberships

Page 80: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

k-meansonMapReduceInput:datapointsD,numberofclusterkandcentroids

1. foreachdatapointd∈ Ddo2. assigndtotheclosestcentroid

Output:centroidswithassociateddatapoints

Input:centroidswithassociateddatapoints1. computethenewcentroidsbycalculatingthe

averageofdatapointsincluster2. writetheglobalcentroidstothedisk

Output:newcentroids

Map

Reduce

Page 81: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

k-meansonPigLatinREGISTERudf.jarDEFINEfind_centroid FindCentroid('$centroids');points=LOAD'points.txt'as(id:int,pos:double);centroided=FOREACHpointsGENERATEpos,find_centroid(pos)ascentroid;grouped=GROUPcentroided BYcentroid;result=FOREACHgroupedGENERATEgroup,AVG(centroided.pos);STOREresultINTO'output’;

Page 82: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

k-meansonSpark

SimilartoMapReduce-based implementation

• Insteadofwritingtheglobalcentroidstothedisk,theyarewrittentomemorywhichspeedsuptheprocessingandreducesthediskI/Ooverhead.

• Thedatawillbeloadedintothesystemmemoryinordertoprovidefasteraccess.

Page 83: MapReduce - unipi.itdidawiki.di.unipi.it/lib/exe/fetch.php/magistraleinformatica/pa/mapreduce.pdfHadoop • Other Hadoop-related projects at Apache include : – Ambari: A web-based

References• Dean,J.andGhemawat,S.MapReduce:Simplifieddataprocessingonlarge

clusters.InProceedingsofOperatingSystemsDesignandImplementation (OSDI).SanFrancisco,CA.137-150.2004

• Hadoop:OpensourceimplementationofMapReduce.http://lucene.apache.org/hadoop/

• C.Olston,B.Reed,U.Srivastava,R.Kumar,A.Tomkins.PigLatin:ANot-so-foreignLanguageforDataProcessing.InProceedingsSIGMOD'08.2008

• D.SinghandC.K.Reddy.Asurveyonplatformsforbigdataanalytics, JournalofBigData.2014

• M.Zaharia,M.Chowdhury,T.Das,A.Dave,J.Ma,M.McCauley,M.Franklin,S.Shenker, andI.Stoica. ResilientDistributedDatasets:AFault-TolerantAbstractionforIn-MemoryClusterComputing. USENIXNSDI.2012