View
11
Download
1
Embed Size (px)
Citation preview
DataScienceCrashCourse - DataWorks Summit- Munich2017
RobertHryniewiczDeveloperAdvocate
2 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatisDataScience?
à Extracting knowledge/insightsfrom data– Data:structuredorunstructured
à Continuationof– statistics– machinelearning– datamining– predictiveanalytics
3 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatisMachineLearning?
MachineLearning
“scienceofhowcomputerslearnwithoutbeingexplicitlyprogrammed”
4 ©HortonworksInc.2011– 2016.AllRightsReserved
“AIisthenewelectricity.”
“AIneedstobecompanywidestrategicdecision.”
AndrewNg
ChiefDataScientistCo-founderofCourseraProf.atStanford
5 ©HortonworksInc.2011– 2016.AllRightsReserved
ABriefHistoryofAI
Antiquity– AnAncientWishtoForgetheGods1940 (DigitalComputer,scientistsdiscusselectronicbrain)1954– 73(MarvinMinskyetal.inDartmouthCollege)1973– 801980– 87(Japanesegov.)1987– 931993– 20002000à Present
6 ©HortonworksInc.2011– 2016.AllRightsReserved
AIinMedia&PopCulture
7 ©HortonworksInc.2011– 2016.AllRightsReserved
8 ©HortonworksInc.2011– 2016.AllRightsReserved
9 ©HortonworksInc.2011– 2016.AllRightsReserved
10 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatisAI?
à GeneralorPureAIà NarroworPragmaticAI
11 ©HortonworksInc.2011– 2016.AllRightsReserved
12 ©HortonworksInc.2011– 2016.AllRightsReserved
“BigData”à InternetofAnything(IoT)
– WindTurbines,OilRigs– Beacons,Wearables– SmartCars
à UserGeneratedContent(Social,Web&Mobile)– Twitter,Facebook,Snapchat– Clickstream– Paypal,Venmo
44ZBin2020
13 ©HortonworksInc.2011– 2016.AllRightsReserved
Visualizing44ZB
100pixels=1MTB
100px ->1MTBassumes5Mpixelresolutionscreen
14 ©HortonworksInc.2011– 2016.AllRightsReserved
15 ©HortonworksInc.2011– 2016.AllRightsReserved
16 ©HortonworksInc.2011– 2016.AllRightsReserved
KeydriversbehindAIExplosion
à Exponentialdatagrowth
à Fasterdistributedsystems
à Smarteralgorithms
17 ©HortonworksInc.2011– 2016.AllRightsReserved
MajorTrendsinAITechnologies
à KnowledgeEngineering
à MachineLearning
à DeepLearning
à ImageAnalysis
à NaturalLanguageProcessing&Generation
à Robotics&Automation
18 ©HortonworksInc.2011– 2016.AllRightsReserved
CreatingValuewithAI
à Cognitiveinsights
à Cognitiveengagement
à Cognitiveautomation
19 ©HortonworksInc.2011– 2016.AllRightsReserved
Machine Learning Use Cases
HealthcarePredictdiagnosisPrioritizescreeningsReducere-admittancerates
FinancialservicesFraudDetection/preventionPredictunderwritingriskNewaccountriskscreens
PublicSectorAnalyzepublicsentimentOptimizeresourceallocationLawenforcement&security
RetailProductrecommendationInventorymanagementPriceoptimization
Telco/mobilePredictcustomerchurnPredictequipmentfailureCustomerbehavioranalysis
Oil&GasPredictivemaintenanceSeismicdatamanagementPredictwellproductionlevels
20 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatIsApacheSpark?
à ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)
à Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms
21 ©HortonworksInc.2011– 2016.AllRightsReserved
WhyApacheSpark?
à ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)
à Fast!- In-memorycomputationmodel– Effectiveforiterativecomputations
à MachineLearning– ImplementationofdistributedMLalgorithms
22 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQLStructuredData
SparkStreamingNearReal-time
SparkMLlibMachineLearning
GraphXGraphAnalysis
23 ©HortonworksInc.2011– 2016.AllRightsReserved
MoreFlexible BetterStorageandPerformance///
24 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQLOverview
à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)
à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI
25 ©HortonworksInc.2011– 2016.AllRightsReserved
DataFrames
à Distributed collection ofdata organized intonamedcolumns
à ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python
à APIavailableinScala,Java,Python,andR
Col1 Col2 … … ColN
DataFrame
Column
Row
DataisdescribedasaDataFramewithrows,columns,andaschema
26 ©HortonworksInc.2011– 2016.AllRightsReserved
DataFrames
CSVAvro
HIVE
SparkSQL
Col1 Col2 … … ColN
DataFrame
Column
Row
JSON
27 ©HortonworksInc.2011– 2016.AllRightsReserved
Visualizations
28 ©HortonworksInc.2011– 2016.AllRightsReserved Source:commons.wikimedia.org/w/index.php?curid=17857442
29 ©HortonworksInc.2011– 2016.AllRightsReserved
DataVisualization:Twitter
Source:https://medium.com/@swainjo/us-presidential-election-2016-twitter-analysis-7596606853e5#.dozwu2bhd
30 ©HortonworksInc.2011– 2016.AllRightsReserved
Simplelinechart
31 ©HortonworksInc.2011– 2016.AllRightsReserved
Horizon
talploto
fthreeline
charts
32 ©HortonworksInc.2011– 2016.AllRightsReserved
Stream
ingdataintoaline
chart
33 ©HortonworksInc.2011– 2016.AllRightsReserved
Plottin
gIrisd
atafeaturesinone
plot
34 ©HortonworksInc.2011– 2016.AllRightsReserved
Comparin
gIrisd
atadistrib
utions
35 ©HortonworksInc.2011– 2016.AllRightsReserved
SparkSQLStructuredData
SparkStreamingNearReal-time
SparkMLlibMachineLearning
GraphXGraphAnalysis
36 ©HortonworksInc.2011– 2016.AllRightsReserved
Algorithms
37 ©HortonworksInc.2011– 2016.AllRightsReserved
WhatisaMLModel?
à Mathematicalformulawithanumberofparameters thatneedtobe learned fromthedata.Andfittingamodeltothedataisaprocessknownasmodeltraining
à E.g.linearregression– Goal:fitaliney=mx+c todatapoints– Aftermodeltraining:y=2x+5
Input OutputModel1,0,7,2,… 7,5,19,9,…
38 ©HortonworksInc.2011– 2016.AllRightsReserved
STARTRegression
Classification CollaborativeFiltering
Clustering
DimensionalityReduction
• LogisticRegression• SupportVectorMachines(SVM)• RandomForest(RF)• NaïveBayes
• LinearRegression
• AlternatingLeastSquares(ALS)
• K-Means,LDA
• PrincipalComponentAnalysis(PCA)
39 ©HortonworksInc.2011– 2016.AllRightsReserved
CLASSIFICATIONIdentifyingtowhichcategoryanobjectbelongsto
Examples:spamdetection,diabetesdiagnosis,textlabeling
Algorithms:
à LogisticRegression– Fasttraining,linearmodel– Classesexpressedinprobabilities
à SupportVectorMachines(SVM)– “Best”supervisedlearningalgorithm,effective– MorerobusttooutliersthanLogRegression– Handlesnon-linearity
à RandomForest– Fasttraining– Handlescategoricalfeatures– Doesnotrequirefeaturescaling– Capturesnon-linearityand
featureinteraction
à NaïveBayes– Goodfortextclassification– Assumesindependentvariables
40 ©HortonworksInc.2011– 2016.AllRightsReserved
VisualIntrotoDecisionTrees
à http://www.r2d3.us/visual-intro-to-machine-learning-part-1
CLASSIFICATION
41 ©HortonworksInc.2011– 2016.AllRightsReserved
REGRESSIONPredictingacontinuous-valuedoutput
Example:Predicting housepricesbasedonnumberofbedroomsandsquarefootage
Algorithms:LinearRegression
42 ©HortonworksInc.2011– 2016.AllRightsReserved
CLUSTERINGAutomaticgroupingofsimilarobjectsintosets(clusters)
Example:marketsegmentation– autogroupcustomersintodifferentmarketsegments
Algorithms: K-means,LDA
43 ©HortonworksInc.2011– 2016.AllRightsReserved
COLLABORATIVEFILTERINGFillinthemissingentriesofauser-itemassociationmatrix
Applications:Product/movierecommendation
Algorithms: Alternating Least Squares (ALS)
44 ©HortonworksInc.2011– 2016.AllRightsReserved
DIMENSIONALITYREDUCTIONReducingthenumberofredundantfeatures/variables
Applications:
à Removingnoiseinimagesbyselectingonly“important”features
à Removingredundantfeatures,e.g.MPH&KPHarelinearlydependent
Algorithms: PrincipalComponentAnalysis(PCA)
45 ©HortonworksInc.2011– 2016.AllRightsReserved
STARTRegression
Classification DeepLearning
Clustering
DimensionalityReduction
• XGBoost (ExtremeGradientBoosting)• Classificationandregressiontrees(CART)
• RecurrentNeuralNetwork(RNN)• ConvolutionalNeuralNetwork(CNN)
• Yinyang K-Means
• t-DistributedStochasticNeighborEmbedding(t-SNE)
• LocalRegression(LOESS)
CollaborativeFiltering• WeightedAlternatingLeast
Squares(WALS)
46 ©HortonworksInc.2011– 2016.AllRightsReserved
47 ©HortonworksInc.2011– 2016.AllRightsReserved
Hyperparameters
à Definehigher-levelmodelproperties,e.g.complexityorlearningrate
à Cannotbelearnedduringtrainingà needtobepredefined
à Canbedecidedby– settingdifferentvalues– trainingdifferentmodels– choosingthevaluesthattestbetter
à Hyperparameter examples– Numberofleavesordepthofatree– Numberoflatentfactorsinamatrixfactorization– Learningrate(inmanymodels)– Numberofhiddenlayersinadeepneuralnetwork– Numberofclustersinak-meansclustering
48 ©HortonworksInc.2011– 2016.AllRightsReserved
Predictive Analytics Pre-requisites
49 ©HortonworksInc.2011– 2016.AllRightsReserved
Predictive Analytics Process and Tools
50 ©HortonworksInc.2011– 2016.AllRightsReserved
AskingRelevantQuestions
à Specific (canyouthinkofaclearanswer?)
à Measurable (quantifiable?datadriven?)
à Actionable (ifyouhadananswer,couldyoudosomethingwithit?)
à Realistic(canyougetananswerwithdatayouhave?)
à Timely (answerinreasonabletimeframe?)
51 ©HortonworksInc.2011– 2016.AllRightsReserved
Withthatinmind…
à Nosimpleformulafor“goodquestions”onlygeneralguidelines
à Therightdataisbetterthanlotsofdata
à Understandingrelationshipsmatters
52 ©HortonworksInc.2011– 2016.AllRightsReserved
DataPreparation
1. Dataanalysis(auditforanomalies/errors)
2. Creatinganintuitiveworkflow(formulateseq.ofprepoperations)
3. Validation(correctnessevaluatedagainstsamplerepresentativedataset)
4. Transformation (actualprepprocesstakesplace)
5. Backflowofcleaneddata(replaceoriginaldirtydata)
Approx.80%ofDataAnalyst’sjobisDataPreparation!
ExampleofmultiplevaluesusedforU.S.Statesè California,CA,Cal.,Cal
53 ©HortonworksInc.2011– 2016.AllRightsReserved
DetailedResearchandOperationalWorkflows
54 ©HortonworksInc.2011– 2016.AllRightsReserved
TrainingSet
LearningAlgorithm
hhypothesis/model
input output
Ingest/EnrichData
Clean/Transform/Filter
Select/CreateNewFeatures
EvaluateAccuracy/Score
55 ©HortonworksInc.2011– 2016.AllRightsReserved
Building Spark ML pipelines
Featuretransform
1
Featuretransform
2
Combinefeatures
LinearRegression
InputDataFrame
InputDataFrame
OutputDataFrame
Pipeline
PipelineModel
Train
Predict
ExportModel
56 ©HortonworksInc.2011– 2016.AllRightsReserved
Spark ML Pipeline
à fit() is for trainingà transform() is for prediction
InputDataFrame(TRAIN)
InputDataFrame(TEST)
OutputDataframe
(PREDICTIONS)
Pipeline
PipelineModel
fit()transform()
Train
Predict
57 ©HortonworksInc.2011– 2016.AllRightsReserved
Sample Spark ML Pipeline
indexer = …
parser = …
hashingTF = …
vecAssembler = …
rf = RandomForestClassifier(numTrees=100)
pipe = Pipeline(stages=[indexer, parser, hashingTF, vecAssembler, rf])
model = pipe.fit(trainData) # Train model
results = model.transform(testData) # Test model
58 ©HortonworksInc.2011– 2016.AllRightsReserved
Exporting ML Models - PMML
à PredictiveModelMarkupLanguage(PMML)à Supportedmodels
–K-Means– LinearRegression–RidgeRegression– Lasso–SVM–Binary
59 ©HortonworksInc.2011– 2016.AllRightsReserved
HDCloud
60 ©HortonworksInc.2011– 2016.AllRightsReserved
HortonworksCloudSolutions
Microsoft AWS Google
Managed AzureHDInsight
Non-Managed/Marketplace
HortonworksDataCloudforAWS
CloudIaaS HortonworksDataPlatform(viaAmbariandviaCloudbreak)
61 ©HortonworksInc.2011– 2016.AllRightsReserved
62 ©HortonworksInc.2011– 2016.AllRightsReserved
Zeppelin
Ambari
SparkHistoryServer
FilesView
63 ©HortonworksInc.2011– 2016.AllRightsReserved
à Zeppelinè Interactivenotebook
à Spark
à YARNè ResourceManagement
à HDFSè DistributedStorageLayer
YARN
ScalaJava
PythonR
APIs
Spark Core Engine
Spark SQL
Spark StreamingMLlib GraphX
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
NHDFS
64 ©HortonworksInc.2011– 2016.AllRightsReserved
Spark and HDP
65 ©HortonworksInc.2011– 2016.AllRightsReserved
Labs/Tutorials
66 ©HortonworksInc.2011– 2016.AllRightsReserved
Scatter 2D Data Visualized
scatterData ç DataFrame
+-----+--------+
|label|features|
+-----+--------+
|-12.0| [-4.9]|
| -6.0| [-4.5]|
| -7.2| [-4.1]|
| -5.0| [-3.2]|
| -2.0| [-3.0]|
| -3.1| [-2.1]|
| -4.0| [-1.5]|
| -2.2| [-1.2]|
| -2.0| [-0.7]|
| 1.0| [-0.5]|
| -0.7| [-0.2]|.........
67 ©HortonworksInc.2011– 2016.AllRightsReserved
Linear Regression Model Training (one feature)
Coefficients:2.81Intercept:3.05
y=2.81x+3.05
TrainingResult
68 ©HortonworksInc.2011– 2016.AllRightsReserved
Linear Regression (two features)
Coefficients: [0.464, 0.464] Intercept: 0.0563
69 ©HortonworksInc.2011– 2016.AllRightsReserved
ML Lab
• Residuals• residual ofanobservedvalueisthedifferencebetweentheobservedvalueand
the estimated value
• R2 (R Squared) – Coefficient of Determination • indicatesagoodnessoffit• R2of1meansregressionlineperfectlyfitsdata
• RMSE (Root Mean Square Error)• measureofdifferencesbetweenvaluespredictedbyamodelorandvaluesactually
observed• goodmeasureof accuracy,butonlytocompareforecastingerrorsofdifferent
models(individualvariablesarescale-dependent)
70 ©HortonworksInc.2011– 2016.AllRightsReserved
Demo:StockPortfolioSimulationusingMonteCarlomethod
MonteCarloSimulation
1. Defineadomainofpossibleinputs2. Randomlygenerateinputsfromprob.distributionoverdomain3. Perform computationontheinputs4. Aggregatetheresults
Approximating the value of π after placing 30K random points.Error < 0.07% of actual value.
71 ©HortonworksInc.2011– 2016.AllRightsReserved
Demo:TextClassificationwithNaïveBayes
72 ©HortonworksInc.2011– 2016.AllRightsReserved
DiabetesDataset– DecisionTrees/RandomForest
Labeledsetwith8Features
-1 1:-0.294118 2:0.487437 3:0.180328 4:-0.292929 5:-1 6:0.00149028 7:-0.53117 8:-0.0333333+1 1:-0.882353 2:-0.145729 3:0.0819672 4:-0.414141 5:-1 6:-0.207153 7:-0.766866 8:-0.666667 -1 1:-0.0588235 2:0.839196 3:0.0491803 4:-1 5:-1 6:-0.305514 7:-0.492741 8:-0.633333 +1 1:-0.882353 2:-0.105528 3:0.0819672 4:-0.535354 5:-0.777778 6:-0.162444 7:-0.923997 8:-1 -1 1:-1 2:0.376884 3:-0.344262 4:-0.292929 5:-0.602837 6:0.28465 7:0.887276 8:-0.6 +1 1:-0.411765 2:0.165829 3:0.213115 4:-1 5:-1 6:-0.23696 7:-0.894962 8:-0.7 -1 1:-0.647059 2:-0.21608 3:-0.180328 4:-0.353535 5:-0.791962 6:-0.0760059 7:-0.854825 8:-0.833333
...
73 ©HortonworksInc.2011– 2016.AllRightsReserved
TensorFlowOnSpark
74 ©HortonworksInc.2011– 2016.AllRightsReserved
TensorFlowOnSpark
75 ©HortonworksInc.2011– 2016.AllRightsReserved
TensorFlowOnSpark
76 ©HortonworksInc.2011– 2016.AllRightsReserved
RobertHryniewiczE:[email protected]:@robertH8z
77 ©HortonworksInc.2011– 2016.AllRightsReserved
FeatureSelection
78 ©HortonworksInc.2011– 2016.AllRightsReserved
FeatureSelection
à Alsoknownasvariableorattributeselection
à Whyimportant?– simplificationofmodelsè easiertointerpretbyresearchers/users– shortertrainingtimes– enhancedgeneralizationbyreducing overfitting
à Dimensionalityreductionvsfeatureselection– Dimensionalityred:createnewcombinationsofattributes– Featureselection:include/excludeattributesindatawithout changing them
Q:Whichfeaturesshouldyouusetocreateapredictivemodel?
79 ©HortonworksInc.2011– 2016.AllRightsReserved
FeatureSelection
à Methods– Filter– Wrapper– Embedded
Goal:Identifyandremoveunneeded,irrelevantandredundantfeaturesfromdatathatdonotcontributeormaydecrease theaccuracy ofapredictivemodel.
80 ©HortonworksInc.2011– 2016.AllRightsReserved
FeatureSelectionTraps
à Featureselectionisanotherkeypartoftheappliedmachinelearningprocess,likemodelselection.Youcannotfireandforget.
à Itisimportanttoconsiderfeatureselectionapartofthemodelselectionprocess.Ifyoudonot,youmayinadvertentlyintroducebiasintoyourmodelswhichcanresultinoverfitting.
à Forexample,youmustincludefeatureselectionwithintheinner-loopwhenyouareusingaccuracyestimationmethodssuchascross-validation.Thismeansthatfeatureselectionisperformedonthepreparedfoldrightbeforethemodelistrained.Amistakewouldbetoperformfeatureselectionfirsttoprepareyourdata,thenperformmodelselectionandtrainingontheselectedfeatures.
81 ©HortonworksInc.2011– 2016.AllRightsReserved
FeatureSelectionChecklist1. Doyouhavedomainknowledge? Ifyes,constructabettersetof“adhoc”features
2. Areyourfeaturescommensurate? Ifno,considernormalizingthem.
3. Doyoususpectinterdependenceoffeatures? Ifyes,expandyourfeaturesetbyconstructingconjunctivefeaturesorproductsoffeatures,asmuchasyourcomputerresourcesallowyou.
4. Doyouneedtoprunetheinputvariables(e.g.forcost,speedordataunderstandingreasons)? Ifno,constructdisjunctivefeaturesorweightedsumsoffeature
5. Doyouneedtoassessfeaturesindividually(e.g.tounderstandtheirinfluenceonthesystemorbecausetheirnumberissolargethatyouneedtodoafirstfiltering)? Ifyes,useavariablerankingmethod;else,doitanywaytogetbaselineresults.
6. Doyouneedapredictor? Ifno,stop
7. Doyoususpectyourdatais“dirty”(hasafewmeaninglessinputpatternsand/ornoisyoutputsorwrongclasslabels)? Ifyes,detecttheoutlierexamplesusingthetoprankingvariablesobtainedinstep5asrepresentation;checkand/ordiscardthem.
8. Doyouknowwhattotryfirst? Ifno,usealinearpredictor.Useaforwardselectionmethodwiththe“probe”methodasastoppingcriterionorusethe0-normembeddedmethodforcomparison,followingtherankingofstep5,constructasequenceofpredictorsofsamenatureusing increasingsubsetsoffeatures.Canyoumatchorimproveperformancewithasmallersubset?Ifyes,tryanon-linearpredictorwiththatsubset.
9. Doyouhavenewideas,time,computationalresources,andenoughexamples? Ifyes,compareseveralfeatureselectionmethods,includingyournewidea,correlationcoefficients,backwardselectionandembeddedmethods.Uselinearandnon-linearpredictors.Selectthebestapproachwithmodelselection
10. Doyouwantastablesolution(toimproveperformanceand/orunderstanding)? Ifyes,subsampleyourdataandredoyouranalysisforseveral“bootstrap”.
82 ©HortonworksInc.2011– 2016.AllRightsReserved
RobertHryniewiczE:[email protected]:@robertH8z
83 ©HortonworksInc.2011– 2016.AllRightsReserved
AIInvestmentLandscape
84 ©HortonworksInc.2011– 2016.AllRightsReserved
Only$100kinvestmentneededtostartwithAI
85 ©HortonworksInc.2011– 2016.AllRightsReserved
Report from IDC Analyst firm
Spending on AI• $12.5B in 2017
• $4.5Bonappsforthreatdetection,fraudanalysis,publicsafety,andpharmaceuticalresearch
• $46B+ by 2020
86 ©HortonworksInc.2011– 2016.AllRightsReserved
ClosingthoughtsonAI
87 ©HortonworksInc.2011– 2016.AllRightsReserved
TheFutureofCognitiveComputing/MI– Machine
• DeepLearning• Discovery• Large-scalemath• Factchecking
– Human
• Compassion• Intuition• Design• Valuejudgements• CommonSense
88 ©HortonworksInc.2011– 2016.AllRightsReserved
89 ©HortonworksInc.2011– 2016.AllRightsReserved
RobertHryniewiczE:[email protected]:@robertH8z
90 ©HortonworksInc.2011– 2016.AllRightsReserved
What’snewinHDP2.6– Spark&Zeppelin
à Spark1.6.3GA
à Spark2.1GA
à RESTAPI(Livy)GA
à SparkThriftServerdoAS GA
à SparkSQL – Row/ColumnSecurity(GA)
à SparkStreaming+KafkaoverSSL
à MultiClusterHBase supportforSHC
à PackagesupportinPySpark &SparkR
Sparkà Spark2.xsupport
à ImprovedLivyintegration
à Nopasswordinclear
à JDBCinterpreterimprovements
à SmartSenseintegration
à KnoxproxyZeppelinUI
Zeppelin0.7.x
91 ©HortonworksInc.2011– 2016.AllRightsReserved
Thanks!RobertHryniewicz@[email protected]