Data Science Crash Course

DataScienceCrashCourse - DataWorks Summit- Munich2017

RobertHryniewiczDeveloperAdvocate

@[email protected]

2 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisDataScience?

Ã Extracting knowledge/insightsfrom data– Data:structuredorunstructured

Ã Continuationof– statistics– machinelearning– datamining– predictiveanalytics


WhatisMachineLearning?

MachineLearning

“scienceofhowcomputerslearnwithoutbeingexplicitlyprogrammed”


“AIisthenewelectricity.”

“AIneedstobecompanywidestrategicdecision.”

AndrewNg

ChiefDataScientistCo-founderofCourseraProf.atStanford


ABriefHistoryofAI

Antiquity– AnAncientWishtoForgetheGods1940 (DigitalComputer,scientistsdiscusselectronicbrain)1954– 73(MarvinMinskyetal.inDartmouthCollege)1973– 801980– 87(Japanesegov.)1987– 931993– 20002000à Present


AIinMedia&PopCulture





WhatisAI?

Ã GeneralorPureAIÃ NarroworPragmaticAI



“BigData”Ã InternetofAnything(IoT)

– WindTurbines,OilRigs– Beacons,Wearables– SmartCars

Ã UserGeneratedContent(Social,Web&Mobile)– Twitter,Facebook,Snapchat– Clickstream– Paypal,Venmo

44ZBin2020


Visualizing44ZB

100pixels=1MTB

100px ->1MTBassumes5Mpixelresolutionscreen




KeydriversbehindAIExplosion

Ã Exponentialdatagrowth

Ã Fasterdistributedsystems

Ã Smarteralgorithms


MajorTrendsinAITechnologies

Ã KnowledgeEngineering

Ã MachineLearning

Ã DeepLearning

Ã ImageAnalysis

Ã NaturalLanguageProcessing&Generation

Ã Robotics&Automation


CreatingValuewithAI

Ã Cognitiveinsights

Ã Cognitiveengagement

Ã Cognitiveautomation


Machine Learning Use Cases

HealthcarePredictdiagnosisPrioritizescreeningsReducere-admittancerates

FinancialservicesFraudDetection/preventionPredictunderwritingriskNewaccountriskscreens

PublicSectorAnalyzepublicsentimentOptimizeresourceallocationLawenforcement&security

RetailProductrecommendationInventorymanagementPriceoptimization

Telco/mobilePredictcustomerchurnPredictequipmentfailureCustomerbehavioranalysis

Oil&GasPredictivemaintenanceSeismicdatamanagementPredictwellproductionlevels


WhatIsApacheSpark?

Ã ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

Ã Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms


WhyApacheSpark?

Ã ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

Ã Fast!- In-memorycomputationmodel– Effectiveforiterativecomputations

Ã MachineLearning– ImplementationofdistributedMLalgorithms


SparkSQLStructuredData

SparkStreamingNearReal-time

SparkMLlibMachineLearning

GraphXGraphAnalysis


MoreFlexible BetterStorageandPerformance///


SparkSQLOverview

Ã Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

Ã Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI


DataFrames

Ã Distributed collection ofdata organized intonamedcolumns

Ã ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

Ã APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

Row

DataisdescribedasaDataFramewithrows,columns,andaschema


DataFrames

CSVAvro

HIVE

SparkSQL

Col1 Col2 … … ColN

DataFrame

Column

Row

JSON


Visualizations

28 ©HortonworksInc.2011– 2016.AllRightsReserved Source:commons.wikimedia.org/w/index.php?curid=17857442


DataVisualization:Twitter

Source:https://medium.com/@swainjo/us-presidential-election-2016-twitter-analysis-7596606853e5#.dozwu2bhd


Simplelinechart


Horizon

talploto

fthreeline

charts


Stream

ingdataintoaline

chart


Plottin

gIrisd

atafeaturesinone

plot


Comparin

gIrisd

atadistrib

utions


SparkSQLStructuredData

SparkStreamingNearReal-time

SparkMLlibMachineLearning

GraphXGraphAnalysis


Algorithms


WhatisaMLModel?

Ã Mathematicalformulawithanumberofparameters thatneedtobe learned fromthedata.Andfittingamodeltothedataisaprocessknownasmodeltraining

Ã E.g.linearregression– Goal:fitaliney=mx+c todatapoints– Aftermodeltraining:y=2x+5

Input OutputModel1,0,7,2,… 7,5,19,9,…


STARTRegression

Classification CollaborativeFiltering

Clustering

DimensionalityReduction

• LogisticRegression• SupportVectorMachines(SVM)• RandomForest(RF)• NaïveBayes

• LinearRegression

• AlternatingLeastSquares(ALS)

• K-Means,LDA

• PrincipalComponentAnalysis(PCA)


CLASSIFICATIONIdentifyingtowhichcategoryanobjectbelongsto

Examples:spamdetection,diabetesdiagnosis,textlabeling

Algorithms:

Ã LogisticRegression– Fasttraining,linearmodel– Classesexpressedinprobabilities

Ã SupportVectorMachines(SVM)– “Best”supervisedlearningalgorithm,effective– MorerobusttooutliersthanLogRegression– Handlesnon-linearity

Ã RandomForest– Fasttraining– Handlescategoricalfeatures– Doesnotrequirefeaturescaling– Capturesnon-linearityand

featureinteraction

Ã NaïveBayes– Goodfortextclassification– Assumesindependentvariables


VisualIntrotoDecisionTrees

Ã http://www.r2d3.us/visual-intro-to-machine-learning-part-1

CLASSIFICATION


REGRESSIONPredictingacontinuous-valuedoutput

Example:Predicting housepricesbasedonnumberofbedroomsandsquarefootage

Algorithms:LinearRegression


CLUSTERINGAutomaticgroupingofsimilarobjectsintosets(clusters)

Example:marketsegmentation– autogroupcustomersintodifferentmarketsegments

Algorithms: K-means,LDA


COLLABORATIVEFILTERINGFillinthemissingentriesofauser-itemassociationmatrix

Applications:Product/movierecommendation

Algorithms: Alternating Least Squares (ALS)


DIMENSIONALITYREDUCTIONReducingthenumberofredundantfeatures/variables

Applications:

Ã Removingnoiseinimagesbyselectingonly“important”features

Ã Removingredundantfeatures,e.g.MPH&KPHarelinearlydependent

Algorithms: PrincipalComponentAnalysis(PCA)


STARTRegression

Classification DeepLearning

Clustering

DimensionalityReduction

• XGBoost (ExtremeGradientBoosting)• Classificationandregressiontrees(CART)

• RecurrentNeuralNetwork(RNN)• ConvolutionalNeuralNetwork(CNN)

• Yinyang K-Means

• t-DistributedStochasticNeighborEmbedding(t-SNE)

• LocalRegression(LOESS)

CollaborativeFiltering• WeightedAlternatingLeast

Squares(WALS)



Hyperparameters

Ã Definehigher-levelmodelproperties,e.g.complexityorlearningrate

Ã Cannotbelearnedduringtrainingà needtobepredefined

Ã Canbedecidedby– settingdifferentvalues– trainingdifferentmodels– choosingthevaluesthattestbetter

Ã Hyperparameter examples– Numberofleavesordepthofatree– Numberoflatentfactorsinamatrixfactorization– Learningrate(inmanymodels)– Numberofhiddenlayersinadeepneuralnetwork– Numberofclustersinak-meansclustering


Predictive Analytics Pre-requisites


Predictive Analytics Process and Tools


AskingRelevantQuestions

Ã Specific (canyouthinkofaclearanswer?)

Ã Measurable (quantifiable?datadriven?)

Ã Actionable (ifyouhadananswer,couldyoudosomethingwithit?)

Ã Realistic(canyougetananswerwithdatayouhave?)

Ã Timely (answerinreasonabletimeframe?)


Withthatinmind…

Ã Nosimpleformulafor“goodquestions”onlygeneralguidelines

Ã Therightdataisbetterthanlotsofdata

Ã Understandingrelationshipsmatters


DataPreparation

1. Dataanalysis(auditforanomalies/errors)

2. Creatinganintuitiveworkflow(formulateseq.ofprepoperations)

3. Validation(correctnessevaluatedagainstsamplerepresentativedataset)

4. Transformation (actualprepprocesstakesplace)

5. Backflowofcleaneddata(replaceoriginaldirtydata)

Approx.80%ofDataAnalyst’sjobisDataPreparation!

ExampleofmultiplevaluesusedforU.S.Statesè California,CA,Cal.,Cal


DetailedResearchandOperationalWorkflows


TrainingSet

LearningAlgorithm

hhypothesis/model

input output

Ingest/EnrichData

Clean/Transform/Filter

Select/CreateNewFeatures

EvaluateAccuracy/Score


Building Spark ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

LinearRegression

InputDataFrame

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Train

Predict

ExportModel


Spark ML Pipeline

Ã fit() is for trainingÃ transform() is for prediction

InputDataFrame(TRAIN)

InputDataFrame(TEST)

OutputDataframe

(PREDICTIONS)

Pipeline

PipelineModel

fit()transform()

Train

Predict


Sample Spark ML Pipeline

indexer = …

parser = …

hashingTF = …

vecAssembler = …

rf = RandomForestClassifier(numTrees=100)

pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model

results = model.transform(testData) # Test model


Exporting ML Models - PMML

Ã PredictiveModelMarkupLanguage(PMML)Ã Supportedmodels

–K-Means– LinearRegression–RidgeRegression– Lasso–SVM–Binary


HDCloud


HortonworksCloudSolutions

Microsoft AWS Google

Managed AzureHDInsight

Non-Managed/Marketplace

HortonworksDataCloudforAWS

CloudIaaS HortonworksDataPlatform(viaAmbariandviaCloudbreak)



Zeppelin

Ambari

SparkHistoryServer

FilesView


Ã Zeppelinè Interactivenotebook

Ã Spark

Ã YARNè ResourceManagement

Ã HDFSè DistributedStorageLayer

YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS


Spark and HDP


Labs/Tutorials


Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........


Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult


Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563


ML Lab

• Residuals• residual ofanobservedvalueisthedifferencebetweentheobservedvalueand

the estimated value

• R2 (R Squared) – Coefficient of Determination • indicatesagoodnessoffit• R2of1meansregressionlineperfectlyfitsdata

• RMSE (Root Mean Square Error)• measureofdifferencesbetweenvaluespredictedbyamodelorandvaluesactually

observed• goodmeasureof accuracy,butonlytocompareforecastingerrorsofdifferent

models(individualvariablesarescale-dependent)


Demo:StockPortfolioSimulationusingMonteCarlomethod

MonteCarloSimulation

1. Defineadomainofpossibleinputs2. Randomlygenerateinputsfromprob.distributionoverdomain3. Perform computationontheinputs4. Aggregatetheresults

Approximating the value of π after placing 30K random points.Error < 0.07% of actual value.


Demo:TextClassificationwithNaïveBayes


DiabetesDataset– DecisionTrees/RandomForest

Labeledsetwith8Features

-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667 -1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333 +1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1 -1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6 +1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7 -1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333

...


TensorFlowOnSpark


TensorFlowOnSpark


TensorFlowOnSpark


RobertHryniewiczE:[email protected]:@robertH8z


FeatureSelection


FeatureSelection

Ã Alsoknownasvariableorattributeselection

Ã Whyimportant?– simplificationofmodelsè easiertointerpretbyresearchers/users– shortertrainingtimes– enhancedgeneralizationbyreducing overfitting

Ã Dimensionalityreductionvsfeatureselection– Dimensionalityred:createnewcombinationsofattributes– Featureselection:include/excludeattributesindatawithout changing them

Q:Whichfeaturesshouldyouusetocreateapredictivemodel?


FeatureSelection

Ã Methods– Filter– Wrapper– Embedded

Goal:Identifyandremoveunneeded,irrelevantandredundantfeaturesfromdatathatdonotcontributeormaydecrease theaccuracy ofapredictivemodel.


FeatureSelectionTraps

Ã Featureselectionisanotherkeypartoftheappliedmachinelearningprocess,likemodelselection.Youcannotfireandforget.

Ã Itisimportanttoconsiderfeatureselectionapartofthemodelselectionprocess.Ifyoudonot,youmayinadvertentlyintroducebiasintoyourmodelswhichcanresultinoverfitting.

Ã Forexample,youmustincludefeatureselectionwithintheinner-loopwhenyouareusingaccuracyestimationmethodssuchascross-validation.Thismeansthatfeatureselectionisperformedonthepreparedfoldrightbeforethemodelistrained.Amistakewouldbetoperformfeatureselectionfirsttoprepareyourdata,thenperformmodelselectionandtrainingontheselectedfeatures.


FeatureSelectionChecklist1. Doyouhavedomainknowledge? Ifyes,constructabettersetof“adhoc”features

2. Areyourfeaturescommensurate? Ifno,considernormalizingthem.

3. Doyoususpectinterdependenceoffeatures? Ifyes,expandyourfeaturesetbyconstructingconjunctivefeaturesorproductsoffeatures,asmuchasyourcomputerresourcesallowyou.

4. Doyouneedtoprunetheinputvariables(e.g.forcost,speedordataunderstandingreasons)? Ifno,constructdisjunctivefeaturesorweightedsumsoffeature

5. Doyouneedtoassessfeaturesindividually(e.g.tounderstandtheirinfluenceonthesystemorbecausetheirnumberissolargethatyouneedtodoafirstfiltering)? Ifyes,useavariablerankingmethod;else,doitanywaytogetbaselineresults.

6. Doyouneedapredictor? Ifno,stop

7. Doyoususpectyourdatais“dirty”(hasafewmeaninglessinputpatternsand/ornoisyoutputsorwrongclasslabels)? Ifyes,detecttheoutlierexamplesusingthetoprankingvariablesobtainedinstep5asrepresentation;checkand/ordiscardthem.

8. Doyouknowwhattotryfirst? Ifno,usealinearpredictor.Useaforwardselectionmethodwiththe“probe”methodasastoppingcriterionorusethe0-normembeddedmethodforcomparison,followingtherankingofstep5,constructasequenceofpredictorsofsamenatureusing increasingsubsetsoffeatures.Canyoumatchorimproveperformancewithasmallersubset?Ifyes,tryanon-linearpredictorwiththatsubset.

9. Doyouhavenewideas,time,computationalresources,andenoughexamples? Ifyes,compareseveralfeatureselectionmethods,includingyournewidea,correlationcoefficients,backwardselectionandembeddedmethods.Uselinearandnon-linearpredictors.Selectthebestapproachwithmodelselection

10. Doyouwantastablesolution(toimproveperformanceand/orunderstanding)? Ifyes,subsampleyourdataandredoyouranalysisforseveral“bootstrap”.




AIInvestmentLandscape


Only$100kinvestmentneededtostartwithAI


Report from IDC Analyst firm

Spending on AI• $12.5B in 2017

• $4.5Bonappsforthreatdetection,fraudanalysis,publicsafety,andpharmaceuticalresearch

• $46B+ by 2020


ClosingthoughtsonAI


TheFutureofCognitiveComputing/MI– Machine

• DeepLearning• Discovery• Large-scalemath• Factchecking

– Human

• Compassion• Intuition• Design• Valuejudgements• CommonSense





What’snewinHDP2.6– Spark&Zeppelin

Ã Spark1.6.3GA

Ã Spark2.1GA

Ã RESTAPI(Livy)GA

Ã SparkThriftServerdoAS GA

Ã SparkSQL – Row/ColumnSecurity(GA)

Ã SparkStreaming+KafkaoverSSL

Ã MultiClusterHBase supportforSHC

Ã PackagesupportinPySpark &SparkR

SparkÃ Spark2.xsupport

Ã ImprovedLivyintegration

Ã Nopasswordinclear

Ã JDBCinterpreterimprovements

Ã SmartSenseintegration

Ã KnoxproxyZeppelinUI

Zeppelin0.7.x


Thanks!RobertHryniewicz@[email protected]

Technology

Data Science Crash Course