Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/benelearn.html

Outline

q WhatisBigdata?q HowtodealwithDataIntensiveapplicaKons?q BigDataAnalyKcsq AdemowithMLlibq Conclusions

Thereisnoastandarddefini.on!

“BigData”involvesdatawhosevolume,diversityandcomplexityrequiresnewtechniques,algorithmsandanalysestoextractvaluableknowledge(hidden).

WhatisBigData?

DataIntensiveapplica.ons

WhatisBigData?The5V’sdefiniKon

Bigdatahasmanyfaces

Outline

•  Problemstatement:scalabilitytobigdatasets.•  Example:

– Explore100TBby1node@50MB/sec=23days– ExploraKonwithaclusterof1000nodes=33minutes

•  Solu.onèDivide-And-Conquer

HowtodealwithdataintensiveapplicaKons?

Whathappensifwehavetomanage1000or10000TB?

MapReduce

•  ParallelProgrammingmodel•  Divide&conquerstrategy

§  div ide : parKKon dataset into smal ler ,independent chunks to be processed in parallel(map)

§  conquer:combine,mergeorotherwiseaggregatetheresultsfromthepreviousstep(reduce)

•  Based on simplicity and transparency to theprogrammers,andassumesdatalocality.• Becomespopularthankstotheopen-sourceprojectHadoop!(UsedbyGoogle,Facebook,Amazon,…)

TradiKonalHPCwayofdoingthings

workernodes

(lotsofthem)

centralstorage

CommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

iiiiii

LimitedI/O

c cc cc

inputdata(relaKvelysmall)

Lotsofcomputa.ons

Lotsofcommunica.on

Source:JanFos.er.Introduc.ontoMapReduceanditsApplica.ontoPost-SequencingAnalysis

Data-intensivejobs

Lowcomputeintensity…

FastcommunicaKonnetwork(Infiniband)

NetworkforI/O

OS OS OS OS OS

LimitedcommunicaKon

centralstorage

inputdata(lotsofit)b c d e

f g h i ja b c d ef g h i j

LotsofI/O

doesn’tscale

Data-intensivejobs

Lowcomputeintensity

CommunicaKonnetwork

LimitedcommunicaKon

inputdata(lotsofit)

e jb c

g ja c

h ib e

g id f

f ha d

Solu.on:storedataonlocaldisksofthenodesthatperformcomputaKonsonthatdata(“datalocality”)

Hadoop

h<p://hadoop.apache.org/

•  Hadoopis:– Anopen-sourceframeworkwri<eninJava– Distributedstorageofverylargedatasets(BigData)– Distributedprocessingofverylargedatasets

•  Thisframeworkconsistsofanumberofmodules– HadoopCommon– HadoopDistributedFileSystem(HDFS)– HadoopYARN–resourcemanager– HadoopMapReduce–programmingmodel

•  Automa.cparalleliza.on:– DependingonthesizeoftheinputdataètherewillbemulKpleMAPtasks!

– DependingonthenumberofKeys<k,value>ètherewillbemulKpleREDUCEtasks!

•  Scalability:–  Itmayworkovereverydatacenterorclusterofcomputers.

•  Transparentfortheprogrammer–  Fault-tolerantmechanism.– AutomaKccommunicaKonsamongcomputers

HadoopMapReduce:MainCharacterisKcs

DataSharinginHadoopMapReduce

iter.1 iter.2 ...

HDFSread

HDFSwrite

HDFSread

HDFSwrite

query1

query2

query3

result1

result2

result3

HDFSread

SlowduetoreplicaKon,serializaKon,anddiskIO

ParadigmsthatdonotfitwithHadoopMapReduce

•  DirectedAcyclicGraph(DAG)model:–  TheDAGdefinesthedataflowoftheapplicaKon,andtheverKcesofthegraphdefinestheoperaKonsonthedata.

•  Graphmodel:– Morecomplexgraphmodelsthatbe<errepresentthedataflowoftheapplicaKon.

–  Cyclicmodels->IteraKvity.•  Itera.veMapReducemodel:

–  AnextentedprogrammingmodelthatsupportsiteraKveMapReducecomputaKonsefficiently.

GIRAPH(APACHEProject)(h<p://giraph.apache.org/)Itera8vegraphprocessing

GPS-AGraphProcessingSystem,(Stanford)h<p://infolab.stanford.edu/gps/Amazon'sEC2

DistributedGraphLab(CarnegieMellonUniv.)h<ps://github.com/graphlab-code/graphlabAmazon'sEC2

HaLoop (UniversityofWashington)

h<p://clue.cs.washington.edu/node/14h<p://code.google.com/p/haloop/Amazon’sEC2

Twister(IndianaUniversity)h<p://www.iteraKvemapreduce.org/PrivateClusters

PrIter(Universityof Massachuse<sAmherst, NortheasternUniversity-China)h<p://code.google.com/p/priter/PrivateclusterandAmazonEC2cloud

GPUbasedplauormsMarsGrex

Spark(UCBerkeley)h<p://spark.incubator.apache.org/research.html

NewplauormstoovercomeHadoop’slimitaKons

Bigdatatechnologies

WhatisSpark?

Efficient

•  GeneralexecuKongraphs•  In-memorystorage

Usable

•  RichAPIsinJava,Scala,Python

•  InteracKveshell

Fast and Expressive Cluster Computing !Engine Compatible with Apache Hadoop

2-5×lesscode

Upto10×fasterondisk,100×inmemory

SparkGoal•  ProvidedistributedmemoryabstracKonsforclusterstosupportappswithworkingsets

•  RetaintheaZrac.veproper.esofMapReduce:– Faulttolerance(forcrashes&stragglers)– Datalocality– Scalability

Ini.alSolu.on:augmentdataflowmodelwith“resilientdistributeddatasets”(RDDs)

RDDsinDetail

•  AnRDDisafault-tolerantcollecKonofelementsthatcanbeoperatedoninparallel.

•  TherearetwowaystocreateRDDs:– ParallelizinganexisKngcollecKoninyourdriverprogram

– Referencingadatasetinanexternalstoragesystem,suchasasharedfilesystem,HDFS,Hbase.

•  Canbecachedforfuturereuse

OperaKonswithRDDs•  TransformaKons(e.g.map,filter,groupBy,join)

– LazyoperaKonstobuildRDDsfromotherRDDs•  AcKons(e.g.count,collect,save)

– ReturnaresultorwriteittostorageTransformations(defineanewRDD)

mapfiltersampleuniongroupByKeyreduceByKeyjoincache…

Paralleloperations(returnaresulttodriver)

reducecollectcountsavelookupKey…

Sparkvs.hadoop

25 50 100

Number of machines

Hadoop

HadoopBinMem

K-Means

[Zaharia et. al, NSDI’12]

Lines of code for K-Means

Spark ~ 90 lines –

Hadoop ~ 4 files, > 300 lines

DataFrame(Spark1.3+)-EquivalenttoatableinarelaKonaldatabase(dataframeinR/Python)-AvoidJavaserializaKonperformedbyRDDs.-APInaturalfordeveloperswhoarefamiliarwithbuildingqueryplans(e.g.SQLexpressions).

Datasets(Spark1.6+)-  BestofbothDataFrameandRDDs.-  FuncKonaltransformaKons(map,flatMap,filter,etc)-  SparkSQL’sopKmisedexecuKonengine.

ApacheSpark–newcollecKons

h<ps://flink.apache.org/

BigData:TechnologyandChronology

2001-2010

2010-2016

BigData

3V’sGartner

DougLaney2004

MapReduceGoogle

JeffreyDean

2008HadoopYahoo!

DougCufng

2010SparkUBerckeleyApacheSparkFeb.2014

MateiZaharia

2009-2013FlinkTUBerlin

FlinkApache(Dec.2014)Volker

2010-2016:BigDataAnalyKcs:Mahout,MLLib,…HadoopEcosystemApplicaKonsNewTechnology

Outline

Clustering

Recommendation Systems

Classification

Association

Poten.alscenarios

Real Time Analytics/ Big Data Streams

SocialMediaMiningSocialBigData

BigDataAnalyKcs

BigDataAnalyKcs:A3generaKonalview

Mahout(Samsara)

29h<p://mahout.apache.org/

•  FirstMLlibraryiniKallybasedonHadoopMapReduce.•  AbandonedMapReduceimplementaKonsfromversion0.9.•  Nowadays it is focused on a newmath environment called

Samsara.•  ItisintegratedwithSpark,FlinkandH2O•  Mainalgorithms:

–  StochasKcSingularValueDecomposiKon(ssvd,dssvd)–  StochasKcPrincipalComponentAnalysis(spca,dspca)–  DistributedCholeskyQR(thinQR)–  DistributedregularizedAlternaKngLeastSquares(dals)–  CollaboraKveFiltering:ItemandRowSimilarity–  NaiveBayesClassificaKon

h<ps://spark.apache.org/mllib/

SparkLibraries

AsofSpark2.0

h<ps://ci.apache.org/projects/flink/flink-docs-master/apis/batch/libs/ml/

FlinkML

Outline

•  InthisdemoIwillshowtwowaysofworkingwithApacheSpark:–  InteracKvemodewithSparkNotebook.– StandalonemodewithScalaIDE.

•  AllthecodeusedinthispresentaKonisavailableat:

h<p://www.cs.no<.ac.uk/~pszit/benelearn.html

DEMOwithSparkNotebookinlocal

h<p://spark-notebook.io/

DEMOwithSparkNotebookinlocal

Advantages:ü  InteracKve.ü  AutomaKcplots.ü  ItallowsconnecKonwithacluster.ü  TabcompleKon

Disadvantages:q Built-inforspecificsparkversions.q Difficulttointegrateyourowncode.

DEMOwithScalaIDE

h<p://scala-ide.org/

Example:AnImbalancedBigDataproblem

n  Two main approaches totacklethisproblem:n  Datasampling:

n  Undersampling,n  Oversamplingn  Hybridapproaches

n  AlgorithmicmodificaKons

I. Trigueroet al,Evolu.onaryUndersampling for Extremely ImbalancedBigDataClassifica.onunderApacheSpark.IEEECongressonEvoluKonaryComputaKon(CEC2016),Vancouver(Canada),640-647,July24-29.

ImbalancedBigDataClassificaKonwithSpark

RunexamplesfromScalaIDE

Runexamplesfromterminal$ mvn package -Dmaven.test.skip=true

$ /opt/spark/bin/spark-submit --master local[*] --class Undersampling.UndersamplingExample target/EUS-0.0.1-BETA.jar hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25.header hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tra100000.data hdfs://localhost:9000/user/pszit/datasets/ECBDL14_25-5-1tst10000.data 4 4 RUS DecisionTree /Users/pszit/outputRUS-DecisionTree

Outline

Conclusions

•  WeneednewstrategiestoperformMLinbigdatasets– Choosingtherighttechnologyislikechoosingtherightdatastructureinaprogram.

•  Theworldofbigdataisrapidlychanging.Beingup-to-dateisdifficultbutnecessary.

•  InteracKvenotebooksareveryusefulforaquickstartandstandardexperiments.

Acknowledgments

Thankyou

BigDataLearninginPrac.ce

12thSeptember2016

IsaacTrigueroSchool of Computer Science

University of Nottingham United Kingdom

Isaac.Triguero@no:ngham.ac.ukh<p://www.cs.no<.ac.uk/~pszit/

Extraslides

Volume:dataatrest

•  Vastamountsofdatageneratedeverysecond

•  DatasetsarebecomingtoolargetostoreusingtradiKonaldatabasetechnology

•  Bigdatatechnologystoresthesedatasetsusingdistributedsystems

Velocity:datainmoKon

•  Speedatwhich:– Dataisgenerated– Dataneedstobeanalyzed.

•  ConKnuousdatastreamsarebeingcaptured(e.g.fromsensorsormobiledevices)andproduced

•  LatedecisionsimplymissedopportuniKes

Variety:datainmanyforms

•  One application may generate many different kind of data

•  Several formats and structures: – Structured data:

• Tables, relation databases

– Unstructured data: • Text, images, audio,

video.

Veracity:dataindoubt

•  Uncertaintyaboutthequalityofthedata.– E.g.naturallanguageprocessingonsocialmedia:typos,abbreviaKons,colloquialspeech.

•  Datamaybemissing,ambiguous,orevencompletelywrong.

•  MostimportantmoKvaKonforbigdata

•  Bigdatamayresultin:– Be<erstaKsKcs/models– Novelinsights– NewopportuniKesforresearchandindustry

Value:datainuse

BigData:applicaKons

•  Scienceandresearch:– E.g.Physics,BioinformaKcs,astronomy.

•  Healthcareandpublichealth:– Be<erpersonalizedmedicine

•  Businessande-commerce– PersonalizedadverKsement.

•  Financialservices–  Insurance,banks.

MapReduce•  Basedonfunc.onalprogramming(e.g.Lisp)

– Operateson<key,value>pairs• Web-basedexample:key=URL;value=webpage•  Graph-basedexample:key=nodes;value=adjacencylist

– UsersspecifiestwofuncKons:map:(k1,v1)→list[k2,v2]

reduce:(k2,list[v2])→list[k3,v3]–  SorKngofintermediatekeysbetweenmapandreducephase

The dataflow in MapReduce is transparent to the programmers

MapReduce

HelloWorldByeWorld

InputFile MapkeyValueSplifng

ShortandShuffle

ReducekeyValuePairs

Output

Hello,1World,1Bye,1

World,1

Hello,1Hello,1

World,1World,1

HelloHadoopGoodbye

Hadoop

Hello,1

Hadoop,1Goodbye,1Hadoop,1

Hadoop,1Hadoop,1

Goodbye,1

Bye,{1}

World,{1,1}

Hello,{1,1}

Hadoop,{1,1}

Goodbye,{1,1}

Hello,2World,2Bye,1

Hadoop,2Goodbye,1

WordCountusingMapReduce

MachinelearningforBigData

•  Dataminingtechniqueshavedemonstratedtobeveryusefultoolstoextractnewvaluableknowledgefromdata.

•  TheknowledgeextracKonprocessfrombigdatahasbecomeaverydifficulttaskformostoftheclassicalandadvanceddataminingtools.

•  Themainchallengesaretodealwith:–  Theincreasingscaleofdata

•  atthelevelofinstances•  attheleveloffeatures

–  Thecomplexityoftheproblem.–  Andmanyotherpoints

Mllib:SparkMachinelearninglibrary

•  MLlib(2010):isaSparkimplementaKonofsomecommonmachinelearningfuncKonality,aswellassociatedtestsanddatagenerators.

•  Includes:–  BinaryclassificaKon(SVMsand–  LogisKcRegression)–  RandomForest–  Regression(Lasso,Ridge,etc.)–  Clustering(K-Means)–  CollaboraKveFiltering–  GradientDescentOpKmizaKon–  PrimiKve

h<ps://spark.apache.org/docs/latest/mllib-guide.html68

Big Data Learning in Prac - KU Leuven KULAK · Big Data: Technology and Chronology 2001-2010...

Documents

Oracle Big Data Science Oracle OpenWorld 2016vlamiscdn.com/papers/Oracle Big Data Science.pdf · Oracle Portfolio of Big Data Science Products Big Data Discovery Big Data SQL Oracle

Big data privacy: a technological perspective and review · 2017. 8. 27. · Jain et al. J Big Data Page 2 of 25 data are reflected by 3V’s, which are, volume, velocity and variety

Big Data Meets Big Data Analytics

Caterpillar Big Data Infrastructure Big Data, Data Analytics, and … · Caterpillar Big Data Infrastructure Big Data, Data Analytics, and Machine Learning. Caterpillar is the world’s

ความย้อนแย้งเรื่อง Big Data (Big Data Paradoxes)

Big Data, künstliche Intelligenz und Data Analytics · Big Data, künstliche Intelligenz, Machine Learning, Data Analytics & Co. How big is big? Big Data in der Versicherung sind

2016 Big Data For Beginners Understanding SMART Big Data, Data Mining & Data Analytics2016 Big Data for Beginners Understanding SMART Big Data, Data Mining & Data Analytics

Big Data and Hadoop - How Big is this Big Data?

Big Data Curation - pdfs.semanticscholar.org · Big Data Curation Webinar 19/12/2013 BIG Big Data Public Private Forum BIG DATA INSIGHTS Coping with data variety and verifiability

Introduction to Big Data. Reference: What is “Big Data”?What is “Big Data”?

On the inequality of the 3V’s of Big Data Architectural ... · components (Volume, Variety, and Velocity) provide a quantitative framework while varia-bility and veracity target

Big Data Technology Big Data - aakritsubedi9.com.npaakritsubedi9.com.np/files/Big Data Technology.pdf · Big Data Technology Big Data 1"Big data" is a field that treats ways to analyze,

KUBOTA COMPACT EXCAVATOR KX KX41-3V - Accueillocationcontrecoeur.ca/data/documents/kx-41.pdf · Clean-running Kubota Engine Powerful and dependable, the KX41-3V’s diesel engine

MSA220/MVE440 Statistical Learning for Big Data - Lecture 1 · 2018. 3. 19. · Statistical Learning for Big Data Big Data BIG DATA: can’t t on a HD Big Data: 10Gb+1Tb big data:

Big Data Visualization: Turning Big Data into Big Insights · PDF fileWhite Paper Big Data Visualization: Turning Big Data Into Big Insights The Rise of Visualization-based Data Discovery

The BIG Future of BIG Data · 2017-03-08 · The BIG Future of BIG Data. Big Data Data Governance Data Warehousing Data Reporting Data Infrastructure. Become a PREDICTIVEEnterprise

Big Data to Big Results - AMT-SYBEX · Big Data – really? Big Data – a bigger definition Pioneers of Big Data ... 16 May 2012 From Big Data to Big Results 9 Smart meters, security

Introduction to Big Data, Big Data Processing, and Big ...cis.csuohio.edu/~sschung/CIS660/Lecture1_IntroBigDataAnalyrics.pdf · What’s Big Data? From Wikipedia: • Big data is

การประยุกต์ใช้ Big Data · การประยุกต์ใช้ Big Data ในการบริหารจัดการฐานข้อมูลทางด้าน

Big Data Visualization: Turning Big Data Into Big Insights – White