Upload
bigdatacloud
View
632
Download
0
Embed Size (px)
Citation preview
ComparativeAnalysisofCloudbasedMachineLearningPlatforms
AmazonML, AzureML,DatabricksCloud
ThirdEyeConsultingServices&SolutionsLLC.thirdeyecss.com|[email protected]|@thirdeyecss|408-462-5257
ThirdEyeConsultingServices&SolutionsLLC.thirdeyecss.com|[email protected]|@thirdeyecss|408-462-5257
DATAANSWERS
Disclaimer
• ThirdEyeisadirectvendortoMicrosoft,Amazon&Google.• ThirdEyehasimplementednumerousBigDataprojectsforthemoverlast3years.
• ThirdEyeisNOTaresellerofthecloudservicesofthesecompanies.
• ThirdEyedoesNOTfinanciallybenefitformakinganyofthefollowingrecommendations.
• ThisworkispurelymeantforatechnicalevaluationoftheMLplatformsandshouldnotbeconstruedforanyotherpurposes.
ComparisonApproach: Whatdowelookforin an onlineMachineLearningPlatform?
DataPreparation
• DataIngestion(outoftheboxsupportofdatasources)&DataExport• DataCleaning,Transformation,Visualization
DataSelection
• Featureselection/engineering
Algorithms
• Whichalgorithmsaresupportedoutthebox?Modifyorcreatenewones?• Saving/comparingresults
Optimize• E.g.Identifytheoptimalparametersettingsforalgorithms
Knowledge
MeettheContestants
• AmazonML• AzureML• DatabricksCloud
AmazonML
• Arelativelylimitedentryintermsofcapabilities/algorithmsoffered. • AppearstargetedatexistingAWScustomerswhowanttodosomebaicMLinevetigationswithoutrequiringsignificantexpertise
AmazonML
• Supportedcapabilitiesaredescribedin UseCasesterminology– asopposedtonamesofalgorithms:
– Frauddetection– ContentPersonalization– DocumentClassification– CustomerChurnPrediction– Relevancymodelingformarketing– Recommendations
AmazonML
• Availablealgorithms:– BinaryandMulticlassClassification– Regression
• Limitedornocustomizability:thealgorithmsarealreadyimplementedandchosenforyou:e.g.BinaryRegressionisimplementedviaLogisticRegression
AmazonML
AvailablePerformanceMetrics
• BinaryAUC:Thebinary MLModel usestheAreaUndertheCurve(AUC)techniquetomeasureperformance.
• RegressionRMSE:Theregression MLModel usestheRootMeanSquareError(RMSE)techniquetomeasureperformance.RMSEmeasuresthedifferencebetweenpredictedandactualvaluesforasinglevariable.
• MulticlassAvgFScore:Themulticlass MLModel usestheF1scoretechniquetomeasureperformance.
AmazonML
• DataIngestion/integration– Thisistheirstrongestusecase:easyintegrationwithAWSstoragemedia• S 3,EBS• RedShift• RDS
AzureML
• IntroducedFebruarythisyear
• Butdonotletitsrelativeyouthfulnessbeadistraction:thisisafeaturerichoffering
• Hasadifferentapproach:amoreserious/richsetofalgorithmsandconfigurationsaremadeavailable .
• Default/cannedalgorithmsarestillavailableforthosenewertoMachineLearning
AzureML
• Morecomprehensiveselectionofrepresentativealgorithms:• Providesmoreselectionsforthealgorithmsaswellastuningknobs
AzureML
• Firstclassusability :– Tutorials– Walkthroughs– Videos– IntegratedDevelopmentEnvironment
• AzureMLStudio
– Documentation
AzureML
• ProcessTools– Selectthedataprocessing,modeling,orpredictionactivitymanually
AzureML
• Orfollowthesuggestedworkflow:
AzureML
• Thewizardsarefieldnamesanddatatypeaware
AzureML
• DataPreparationstages
AzureML
• DataPreparationstages
AzureML
• WorkflowVisualization
AzureML
• ViewPredictionResults
AzureML
• Workflowentriesallowviewing/settingdetailedconfiguration/parameters
AzureML
• Workflowentriesallowad-hocoperations
AzureML
• PointandClickaccesstouseful/popularpublicdatasets
AzureML
• Supportforthepopular"Notebooks"structures
AzureML: More choices
• Regression:– Linear, Bayesian, Neural Network , Decision Forest,Boosted Decision Tree, Poisson
• Binary Classification– SVM, Percepton, LR, Bayes, NN's, Decision forest
• Multiclass : – LR, NN, Decision Forest/ Jungle, One vs All
• Anomaly Detection:– SVM, PCA
• Clustering: Kmeans
AzureML: Available Algorithms
DatabricksCloud• Spark:hasjoinedHadoopasde-factoindustrystandardsfordistributed
computing• Rapidlyapproachingpopularityofhadoop
– Andsupplantingitif/whenorganizationscanmaketheswitch• Databricksisthespin-offofBerkeleyAmplab–theoriginalcreatorsofSpark• DatabricksstaffincludealargefractionoftheSparkcorecommitters• Andanevenlargerproportionofthekeydecisionmakers/"shepherds"
– Includingthespark.ml/mllibshepherds• CloudbasedavailabilityofSparkincludingSparkSQLandspark.ml • AccesstocapabilitiesofSparkMllib,SparkDataframes/SQL,Streaming,and
ResilientDistributedDatasets• Notebooksapproach:Scala,Python,Java,andR
SparkEcosystem
DatabricksCloud
• TheonlineofferingwasannouncedJuly2014 atSparkSummit• Purposestatement - Ease ofworkflowmangementforDataScientists:
DatabricksCloud
• TheDatabrickscloudapproach:Notebooks• R,Python,Java,Scala
DatabricksCloud:Notebooks
• NotebooksareDataScientists'''friends• Astandard/typicallypreferredapproachingtodoingtheirwork
– Experimentwithdata– Performad-hocvisualizations– Communicate/shareresultswithcolleagues– orevenpublishthem
• Widespectrumofsophisticationlevelsavailable: – simplyuseexistinglibraries– developnewalgorithmsfromscratch
DatabricksCloud:Notebooks
Wrap-up/SummaryThreegeneraltypesofapproaches (not mutually exclusive)
PointandClick(aswellasbackendAPI's)AmazonMLAzure ML
APIs-OnlyGooglePredictionAPI
NotebooksAzure MLDatabricksCloud
Wrap-up/Summary
AmazonML maybesufficient for:- customers thatalreadyhavedataresidinginthoseproviders - simpler/fewer optionsare acceptable
AzureMLhasastrongusabilityandworkflowapprochandprovidesareasonablecrosssectionofalgorithmsavailableforcasual &intermediate users
DatabricksCloudhasthemostcomprehensiveoffering– Variety,performance,configurabilityofAlgorithms– RichnessofthecapabilitiesoftheNotebooks– Options/configurabilityofthehostingclusters/environment
THANKYOU!
AskYourQuestionsHerehttp://info.thirdeyecss.com/ask_your_question
ThirdEyeConsultingServices&SolutionsLLC.thirdeyecss.com|[email protected]|@thirdeyecss|408-462-5257
DATAANSWERS