91
Data Science Crash Course - DataWorks Summit - Munich 2017 Robert Hryniewicz Developer Advocate @RobertH8z [email protected]

Data Science Crash Course

Embed Size (px)

Citation preview

Page 1: Data Science Crash Course

DataScienceCrashCourse - DataWorks Summit- Munich2017

RobertHryniewiczDeveloperAdvocate

@[email protected]

Page 2: Data Science Crash Course

2 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisDataScience?

à Extracting knowledge/insightsfrom data– Data:structuredorunstructured

à Continuationof– statistics– machinelearning– datamining– predictiveanalytics

Page 3: Data Science Crash Course

3 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisMachineLearning?

MachineLearning

“scienceofhowcomputerslearnwithoutbeingexplicitlyprogrammed”

Page 4: Data Science Crash Course

4 ©HortonworksInc.2011– 2016.AllRightsReserved

“AIisthenewelectricity.”

“AIneedstobecompanywidestrategicdecision.”

AndrewNg

ChiefDataScientistCo-founderofCourseraProf.atStanford

Page 5: Data Science Crash Course

5 ©HortonworksInc.2011– 2016.AllRightsReserved

ABriefHistoryofAI

Antiquity– AnAncientWishtoForgetheGods1940 (DigitalComputer,scientistsdiscusselectronicbrain)1954– 73(MarvinMinskyetal.inDartmouthCollege)1973– 801980– 87(Japanesegov.)1987– 931993– 20002000à Present

Page 6: Data Science Crash Course

6 ©HortonworksInc.2011– 2016.AllRightsReserved

AIinMedia&PopCulture

Page 7: Data Science Crash Course

7 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 8: Data Science Crash Course

8 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 9: Data Science Crash Course

9 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 10: Data Science Crash Course

10 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisAI?

à GeneralorPureAIà NarroworPragmaticAI

Page 11: Data Science Crash Course

11 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 12: Data Science Crash Course

12 ©HortonworksInc.2011– 2016.AllRightsReserved

“BigData”à InternetofAnything(IoT)

– WindTurbines,OilRigs– Beacons,Wearables– SmartCars

à UserGeneratedContent(Social,Web&Mobile)– Twitter,Facebook,Snapchat– Clickstream– Paypal,Venmo

44ZBin2020

Page 13: Data Science Crash Course

13 ©HortonworksInc.2011– 2016.AllRightsReserved

Visualizing44ZB

100pixels=1MTB

100px ->1MTBassumes5Mpixelresolutionscreen

Page 14: Data Science Crash Course

14 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 15: Data Science Crash Course

15 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 16: Data Science Crash Course

16 ©HortonworksInc.2011– 2016.AllRightsReserved

KeydriversbehindAIExplosion

à Exponentialdatagrowth

à Fasterdistributedsystems

à Smarteralgorithms

Page 17: Data Science Crash Course

17 ©HortonworksInc.2011– 2016.AllRightsReserved

MajorTrendsinAITechnologies

à KnowledgeEngineering

à MachineLearning

à DeepLearning

à ImageAnalysis

à NaturalLanguageProcessing&Generation

à Robotics&Automation

Page 18: Data Science Crash Course

18 ©HortonworksInc.2011– 2016.AllRightsReserved

CreatingValuewithAI

à Cognitiveinsights

à Cognitiveengagement

à Cognitiveautomation

Page 19: Data Science Crash Course

19 ©HortonworksInc.2011– 2016.AllRightsReserved

Machine Learning Use Cases

HealthcarePredictdiagnosisPrioritizescreeningsReducere-admittancerates

FinancialservicesFraudDetection/preventionPredictunderwritingriskNewaccountriskscreens

PublicSectorAnalyzepublicsentimentOptimizeresourceallocationLawenforcement&security

RetailProductrecommendationInventorymanagementPriceoptimization

Telco/mobilePredictcustomerchurnPredictequipmentfailureCustomerbehavioranalysis

Oil&GasPredictivemaintenanceSeismicdatamanagementPredictwellproductionlevels

Page 20: Data Science Crash Course

20 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatIsApacheSpark?

à ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

à Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms

Page 21: Data Science Crash Course

21 ©HortonworksInc.2011– 2016.AllRightsReserved

WhyApacheSpark?

à ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

à Fast!- In-memorycomputationmodel– Effectiveforiterativecomputations

à MachineLearning– ImplementationofdistributedMLalgorithms

Page 22: Data Science Crash Course

22 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLStructuredData

SparkStreamingNearReal-time

SparkMLlibMachineLearning

GraphXGraphAnalysis

Page 23: Data Science Crash Course

23 ©HortonworksInc.2011– 2016.AllRightsReserved

MoreFlexible BetterStorageandPerformance///

Page 24: Data Science Crash Course

24 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLOverview

à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

Page 25: Data Science Crash Course

25 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

à Distributed collection ofdata organized intonamedcolumns

à ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

à APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

Row

DataisdescribedasaDataFramewithrows,columns,andaschema

Page 26: Data Science Crash Course

26 ©HortonworksInc.2011– 2016.AllRightsReserved

DataFrames

CSVAvro

HIVE

SparkSQL

Col1 Col2 … … ColN

DataFrame

Column

Row

JSON

Page 27: Data Science Crash Course

27 ©HortonworksInc.2011– 2016.AllRightsReserved

Visualizations

Page 28: Data Science Crash Course

28 ©HortonworksInc.2011– 2016.AllRightsReserved Source:commons.wikimedia.org/w/index.php?curid=17857442

Page 29: Data Science Crash Course

29 ©HortonworksInc.2011– 2016.AllRightsReserved

DataVisualization:Twitter

Source:https://medium.com/@swainjo/us-presidential-election-2016-twitter-analysis-7596606853e5#.dozwu2bhd

Page 30: Data Science Crash Course

30 ©HortonworksInc.2011– 2016.AllRightsReserved

Simplelinechart

Page 31: Data Science Crash Course

31 ©HortonworksInc.2011– 2016.AllRightsReserved

Horizon

talploto

fthreeline

charts

Page 32: Data Science Crash Course

32 ©HortonworksInc.2011– 2016.AllRightsReserved

Stream

ingdataintoaline

chart

Page 33: Data Science Crash Course

33 ©HortonworksInc.2011– 2016.AllRightsReserved

Plottin

gIrisd

atafeaturesinone

plot

Page 34: Data Science Crash Course

34 ©HortonworksInc.2011– 2016.AllRightsReserved

Comparin

gIrisd

atadistrib

utions

Page 35: Data Science Crash Course

35 ©HortonworksInc.2011– 2016.AllRightsReserved

SparkSQLStructuredData

SparkStreamingNearReal-time

SparkMLlibMachineLearning

GraphXGraphAnalysis

Page 36: Data Science Crash Course

36 ©HortonworksInc.2011– 2016.AllRightsReserved

Algorithms

Page 37: Data Science Crash Course

37 ©HortonworksInc.2011– 2016.AllRightsReserved

WhatisaMLModel?

à Mathematicalformulawithanumberofparameters thatneedtobe learned fromthedata.Andfittingamodeltothedataisaprocessknownasmodeltraining

à E.g.linearregression– Goal:fitaliney=mx+c todatapoints– Aftermodeltraining:y=2x+5

Input OutputModel1,0,7,2,… 7,5,19,9,…

Page 38: Data Science Crash Course

38 ©HortonworksInc.2011– 2016.AllRightsReserved

STARTRegression

Classification CollaborativeFiltering

Clustering

DimensionalityReduction

• LogisticRegression• SupportVectorMachines(SVM)• RandomForest(RF)• NaïveBayes

• LinearRegression

• AlternatingLeastSquares(ALS)

• K-Means,LDA

• PrincipalComponentAnalysis(PCA)

Page 39: Data Science Crash Course

39 ©HortonworksInc.2011– 2016.AllRightsReserved

CLASSIFICATIONIdentifyingtowhichcategoryanobjectbelongsto

Examples:spamdetection,diabetesdiagnosis,textlabeling

Algorithms:

à LogisticRegression– Fasttraining,linearmodel– Classesexpressedinprobabilities

à SupportVectorMachines(SVM)– “Best”supervisedlearningalgorithm,effective– MorerobusttooutliersthanLogRegression– Handlesnon-linearity

à RandomForest– Fasttraining– Handlescategoricalfeatures– Doesnotrequirefeaturescaling– Capturesnon-linearityand

featureinteraction

à NaïveBayes– Goodfortextclassification– Assumesindependentvariables

Page 40: Data Science Crash Course

40 ©HortonworksInc.2011– 2016.AllRightsReserved

VisualIntrotoDecisionTrees

à http://www.r2d3.us/visual-intro-to-machine-learning-part-1

CLASSIFICATION

Page 41: Data Science Crash Course

41 ©HortonworksInc.2011– 2016.AllRightsReserved

REGRESSIONPredictingacontinuous-valuedoutput

Example:Predicting housepricesbasedonnumberofbedroomsandsquarefootage

Algorithms:LinearRegression

Page 42: Data Science Crash Course

42 ©HortonworksInc.2011– 2016.AllRightsReserved

CLUSTERINGAutomaticgroupingofsimilarobjectsintosets(clusters)

Example:marketsegmentation– autogroupcustomersintodifferentmarketsegments

Algorithms: K-means,LDA

Page 43: Data Science Crash Course

43 ©HortonworksInc.2011– 2016.AllRightsReserved

COLLABORATIVEFILTERINGFillinthemissingentriesofauser-itemassociationmatrix

Applications:Product/movierecommendation

Algorithms: Alternating Least Squares (ALS)

Page 44: Data Science Crash Course

44 ©HortonworksInc.2011– 2016.AllRightsReserved

DIMENSIONALITYREDUCTIONReducingthenumberofredundantfeatures/variables

Applications:

à Removingnoiseinimagesbyselectingonly“important”features

à Removingredundantfeatures,e.g.MPH&KPHarelinearlydependent

Algorithms: PrincipalComponentAnalysis(PCA)

Page 45: Data Science Crash Course

45 ©HortonworksInc.2011– 2016.AllRightsReserved

STARTRegression

Classification DeepLearning

Clustering

DimensionalityReduction

• XGBoost (ExtremeGradientBoosting)• Classificationandregressiontrees(CART)

• RecurrentNeuralNetwork(RNN)• ConvolutionalNeuralNetwork(CNN)

• Yinyang K-Means

• t-DistributedStochasticNeighborEmbedding(t-SNE)

• LocalRegression(LOESS)

CollaborativeFiltering• WeightedAlternatingLeast

Squares(WALS)

Page 46: Data Science Crash Course

46 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 47: Data Science Crash Course

47 ©HortonworksInc.2011– 2016.AllRightsReserved

Hyperparameters

à Definehigher-levelmodelproperties,e.g.complexityorlearningrate

à Cannotbelearnedduringtrainingà needtobepredefined

à Canbedecidedby– settingdifferentvalues– trainingdifferentmodels– choosingthevaluesthattestbetter

à Hyperparameter examples– Numberofleavesordepthofatree– Numberoflatentfactorsinamatrixfactorization– Learningrate(inmanymodels)– Numberofhiddenlayersinadeepneuralnetwork– Numberofclustersinak-meansclustering

Page 48: Data Science Crash Course

48 ©HortonworksInc.2011– 2016.AllRightsReserved

Predictive Analytics Pre-requisites

Page 49: Data Science Crash Course

49 ©HortonworksInc.2011– 2016.AllRightsReserved

Predictive Analytics Process and Tools

Page 50: Data Science Crash Course

50 ©HortonworksInc.2011– 2016.AllRightsReserved

AskingRelevantQuestions

à Specific (canyouthinkofaclearanswer?)

à Measurable (quantifiable?datadriven?)

à Actionable (ifyouhadananswer,couldyoudosomethingwithit?)

à Realistic(canyougetananswerwithdatayouhave?)

à Timely (answerinreasonabletimeframe?)

Page 51: Data Science Crash Course

51 ©HortonworksInc.2011– 2016.AllRightsReserved

Withthatinmind…

à Nosimpleformulafor“goodquestions”onlygeneralguidelines

à Therightdataisbetterthanlotsofdata

à Understandingrelationshipsmatters

Page 52: Data Science Crash Course

52 ©HortonworksInc.2011– 2016.AllRightsReserved

DataPreparation

1. Dataanalysis(auditforanomalies/errors)

2. Creatinganintuitiveworkflow(formulateseq.ofprepoperations)

3. Validation(correctnessevaluatedagainstsamplerepresentativedataset)

4. Transformation (actualprepprocesstakesplace)

5. Backflowofcleaneddata(replaceoriginaldirtydata)

Approx.80%ofDataAnalyst’sjobisDataPreparation!

ExampleofmultiplevaluesusedforU.S.Statesè California,CA,Cal.,Cal

Page 53: Data Science Crash Course

53 ©HortonworksInc.2011– 2016.AllRightsReserved

DetailedResearchandOperationalWorkflows

Page 54: Data Science Crash Course

54 ©HortonworksInc.2011– 2016.AllRightsReserved

TrainingSet

LearningAlgorithm

hhypothesis/model

input output

Ingest/EnrichData

Clean/Transform/Filter

Select/CreateNewFeatures

EvaluateAccuracy/Score

Page 55: Data Science Crash Course

55 ©HortonworksInc.2011– 2016.AllRightsReserved

Building Spark ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

LinearRegression

InputDataFrame

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Train

Predict

ExportModel

Page 56: Data Science Crash Course

56 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark ML Pipeline

à fit() is for trainingà transform() is for prediction

InputDataFrame(TRAIN)

InputDataFrame(TEST)

OutputDataframe

(PREDICTIONS)

Pipeline

PipelineModel

fit()transform()

Train

Predict

Page 57: Data Science Crash Course

57 ©HortonworksInc.2011– 2016.AllRightsReserved

Sample Spark ML Pipeline

indexer = …

parser = …

hashingTF = …

vecAssembler = …

rf = RandomForestClassifier(numTrees=100)

pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])

model = pipe.fit(trainData) # Train model

results = model.transform(testData) # Test model

Page 58: Data Science Crash Course

58 ©HortonworksInc.2011– 2016.AllRightsReserved

Exporting ML Models - PMML

à PredictiveModelMarkupLanguage(PMML)à Supportedmodels

–K-Means– LinearRegression–RidgeRegression– Lasso–SVM–Binary

Page 59: Data Science Crash Course

59 ©HortonworksInc.2011– 2016.AllRightsReserved

HDCloud

Page 60: Data Science Crash Course

60 ©HortonworksInc.2011– 2016.AllRightsReserved

HortonworksCloudSolutions

Microsoft AWS Google

Managed AzureHDInsight

Non-Managed/Marketplace

HortonworksDataCloudforAWS

CloudIaaS HortonworksDataPlatform(viaAmbariandviaCloudbreak)

Page 61: Data Science Crash Course

61 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 62: Data Science Crash Course

62 ©HortonworksInc.2011– 2016.AllRightsReserved

Zeppelin

Ambari

SparkHistoryServer

FilesView

Page 63: Data Science Crash Course

63 ©HortonworksInc.2011– 2016.AllRightsReserved

à Zeppelinè Interactivenotebook

à Spark

à YARNè ResourceManagement

à HDFSè DistributedStorageLayer

YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

Page 64: Data Science Crash Course

64 ©HortonworksInc.2011– 2016.AllRightsReserved

Spark and HDP

Page 65: Data Science Crash Course

65 ©HortonworksInc.2011– 2016.AllRightsReserved

Labs/Tutorials

Page 66: Data Science Crash Course

66 ©HortonworksInc.2011– 2016.AllRightsReserved

Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........

Page 67: Data Science Crash Course

67 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult

Page 68: Data Science Crash Course

68 ©HortonworksInc.2011– 2016.AllRightsReserved

Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563

Page 69: Data Science Crash Course

69 ©HortonworksInc.2011– 2016.AllRightsReserved

ML Lab

• Residuals• residual ofanobservedvalueisthedifferencebetweentheobservedvalueand

the estimated value

• R2 (R Squared) – Coefficient of Determination • indicatesagoodnessoffit• R2of1meansregressionlineperfectlyfitsdata

• RMSE (Root Mean Square Error)• measureofdifferencesbetweenvaluespredictedbyamodelorandvaluesactually

observed• goodmeasureof accuracy,butonlytocompareforecastingerrorsofdifferent

models(individualvariablesarescale-dependent)

Page 70: Data Science Crash Course

70 ©HortonworksInc.2011– 2016.AllRightsReserved

Demo:StockPortfolioSimulationusingMonteCarlomethod

MonteCarloSimulation

1. Defineadomainofpossibleinputs2. Randomlygenerateinputsfromprob.distributionoverdomain3. Perform computationontheinputs4. Aggregatetheresults

Approximating the value of π after placing 30K random points.Error < 0.07% of actual value.

Page 71: Data Science Crash Course

71 ©HortonworksInc.2011– 2016.AllRightsReserved

Demo:TextClassificationwithNaïveBayes

Page 72: Data Science Crash Course

72 ©HortonworksInc.2011– 2016.AllRightsReserved

DiabetesDataset– DecisionTrees/RandomForest

Labeledsetwith8Features

-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667 -1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333 +1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1 -1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6 +1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7 -1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333

...

Page 73: Data Science Crash Course

73 ©HortonworksInc.2011– 2016.AllRightsReserved

TensorFlowOnSpark

Page 74: Data Science Crash Course

74 ©HortonworksInc.2011– 2016.AllRightsReserved

TensorFlowOnSpark

Page 75: Data Science Crash Course

75 ©HortonworksInc.2011– 2016.AllRightsReserved

TensorFlowOnSpark

Page 76: Data Science Crash Course

76 ©HortonworksInc.2011– 2016.AllRightsReserved

RobertHryniewiczE:[email protected]:@robertH8z

Page 77: Data Science Crash Course

77 ©HortonworksInc.2011– 2016.AllRightsReserved

FeatureSelection

Page 78: Data Science Crash Course

78 ©HortonworksInc.2011– 2016.AllRightsReserved

FeatureSelection

à Alsoknownasvariableorattributeselection

à Whyimportant?– simplificationofmodelsè easiertointerpretbyresearchers/users– shortertrainingtimes– enhancedgeneralizationbyreducing overfitting

à Dimensionalityreductionvsfeatureselection– Dimensionalityred:createnewcombinationsofattributes– Featureselection:include/excludeattributesindatawithout changing them

Q:Whichfeaturesshouldyouusetocreateapredictivemodel?

Page 79: Data Science Crash Course

79 ©HortonworksInc.2011– 2016.AllRightsReserved

FeatureSelection

à Methods– Filter– Wrapper– Embedded

Goal:Identifyandremoveunneeded,irrelevantandredundantfeaturesfromdatathatdonotcontributeormaydecrease theaccuracy ofapredictivemodel.

Page 80: Data Science Crash Course

80 ©HortonworksInc.2011– 2016.AllRightsReserved

FeatureSelectionTraps

à Featureselectionisanotherkeypartoftheappliedmachinelearningprocess,likemodelselection.Youcannotfireandforget.

à Itisimportanttoconsiderfeatureselectionapartofthemodelselectionprocess.Ifyoudonot,youmayinadvertentlyintroducebiasintoyourmodelswhichcanresultinoverfitting.

à Forexample,youmustincludefeatureselectionwithintheinner-loopwhenyouareusingaccuracyestimationmethodssuchascross-validation.Thismeansthatfeatureselectionisperformedonthepreparedfoldrightbeforethemodelistrained.Amistakewouldbetoperformfeatureselectionfirsttoprepareyourdata,thenperformmodelselectionandtrainingontheselectedfeatures.

Page 81: Data Science Crash Course

81 ©HortonworksInc.2011– 2016.AllRightsReserved

FeatureSelectionChecklist1. Doyouhavedomainknowledge? Ifyes,constructabettersetof“adhoc”features

2. Areyourfeaturescommensurate? Ifno,considernormalizingthem.

3. Doyoususpectinterdependenceoffeatures? Ifyes,expandyourfeaturesetbyconstructingconjunctivefeaturesorproductsoffeatures,asmuchasyourcomputerresourcesallowyou.

4. Doyouneedtoprunetheinputvariables(e.g.forcost,speedordataunderstandingreasons)? Ifno,constructdisjunctivefeaturesorweightedsumsoffeature

5. Doyouneedtoassessfeaturesindividually(e.g.tounderstandtheirinfluenceonthesystemorbecausetheirnumberissolargethatyouneedtodoafirstfiltering)? Ifyes,useavariablerankingmethod;else,doitanywaytogetbaselineresults.

6. Doyouneedapredictor? Ifno,stop

7. Doyoususpectyourdatais“dirty”(hasafewmeaninglessinputpatternsand/ornoisyoutputsorwrongclasslabels)? Ifyes,detecttheoutlierexamplesusingthetoprankingvariablesobtainedinstep5asrepresentation;checkand/ordiscardthem.

8. Doyouknowwhattotryfirst? Ifno,usealinearpredictor.Useaforwardselectionmethodwiththe“probe”methodasastoppingcriterionorusethe0-normembeddedmethodforcomparison,followingtherankingofstep5,constructasequenceofpredictorsofsamenatureusing increasingsubsetsoffeatures.Canyoumatchorimproveperformancewithasmallersubset?Ifyes,tryanon-linearpredictorwiththatsubset.

9. Doyouhavenewideas,time,computationalresources,andenoughexamples? Ifyes,compareseveralfeatureselectionmethods,includingyournewidea,correlationcoefficients,backwardselectionandembeddedmethods.Uselinearandnon-linearpredictors.Selectthebestapproachwithmodelselection

10. Doyouwantastablesolution(toimproveperformanceand/orunderstanding)? Ifyes,subsampleyourdataandredoyouranalysisforseveral“bootstrap”.

Page 82: Data Science Crash Course

82 ©HortonworksInc.2011– 2016.AllRightsReserved

RobertHryniewiczE:[email protected]:@robertH8z

Page 83: Data Science Crash Course

83 ©HortonworksInc.2011– 2016.AllRightsReserved

AIInvestmentLandscape

Page 84: Data Science Crash Course

84 ©HortonworksInc.2011– 2016.AllRightsReserved

Only$100kinvestmentneededtostartwithAI

Page 85: Data Science Crash Course

85 ©HortonworksInc.2011– 2016.AllRightsReserved

Report from IDC Analyst firm

Spending on AI• $12.5B in 2017

• $4.5Bonappsforthreatdetection,fraudanalysis,publicsafety,andpharmaceuticalresearch

• $46B+ by 2020

Page 86: Data Science Crash Course

86 ©HortonworksInc.2011– 2016.AllRightsReserved

ClosingthoughtsonAI

Page 87: Data Science Crash Course

87 ©HortonworksInc.2011– 2016.AllRightsReserved

TheFutureofCognitiveComputing/MI– Machine

• DeepLearning• Discovery• Large-scalemath• Factchecking

– Human

• Compassion• Intuition• Design• Valuejudgements• CommonSense

Page 88: Data Science Crash Course

88 ©HortonworksInc.2011– 2016.AllRightsReserved

Page 89: Data Science Crash Course

89 ©HortonworksInc.2011– 2016.AllRightsReserved

RobertHryniewiczE:[email protected]:@robertH8z

Page 90: Data Science Crash Course

90 ©HortonworksInc.2011– 2016.AllRightsReserved

What’snewinHDP2.6– Spark&Zeppelin

à Spark1.6.3GA

à Spark2.1GA

à RESTAPI(Livy)GA

à SparkThriftServerdoAS GA

à SparkSQL – Row/ColumnSecurity(GA)

à SparkStreaming+KafkaoverSSL

à MultiClusterHBase supportforSHC

à PackagesupportinPySpark &SparkR

Sparkà Spark2.xsupport

à ImprovedLivyintegration

à Nopasswordinclear

à JDBCinterpreterimprovements

à SmartSenseintegration

à KnoxproxyZeppelinUI

Zeppelin0.7.x

Page 91: Data Science Crash Course

91 ©HortonworksInc.2011– 2016.AllRightsReserved

Thanks!RobertHryniewicz@[email protected]