Advanced Analytics in Hadoop

Embed Size (px)

DESCRIPTION

On June 11 Thomas Dinsmore gave a nice outline on tools and technologies that are out there handling analytics in Hadoop. It is a must watch for anyone looking for what advance analytics Hadoop could deliver. Please find video and slides below. Synopsis What is the state of play for advanced analytics in Hadoop? A year ago, options included "roll your own" and little else; today there are a number of serious open source and commercial options available, with new capabilities announced daily. In this presentation, we begin with a brief overview of use cases for advanced analytics and a discussion of what types of analytics must run in Hadoop. We continue with an overview of available architectures. The presentation concludes with a hype-free survey of available open source and commercial software for advanced analytics in Hadoop. Bio Thomas W. Dinsmore is Director of Product Management for Revolution Analytics, a company that provides commercial support and services for open source R. In this role, Mr. Dinsmore closely tracks the market for commercial and open source software on all platforms, including Hadoop. Prior to joining Revolution Analytics, Mr. Dinsmore served as an Analytics Solution Architect for IBM Big Data, and as a Principal Consultant for Razorfish and SAS. Mr. Dinsmore has hands-on experience with leading commercial and open source tools for advanced analytics, including SAS, SPSS, R, Oracle Data Mining across a range of platforms, including Hadoop, Netezza, Teradata and Oracle. He is certified in SAS 9. In his career, Mr. Dinsmore has worked with more than 500 enterprises in the United States, Canada, Mexico, Venezuela, Chile, Brazil, the United Kingdom, Belgium, Italy, Turkey, Israel, Malaysia and Singapore.

Citation preview

  • 1. Advanced Analytics in Hadoop Thomas W. Dinsmore 1

2. Advanced Analytics in Hadoop Use cases Architectures Current Options: Open Source Commercial 2 3. Analytics 3 Ad Hoc Queries ReportsData Access Visualization Data Manipulation OLAP/ROLAP etc Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Dashboards Scorecards Streaming Analytics Computational Complexity 4. Advanced Analytics 4 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Computational Complexity 5. Advanced Analytics 5 Advanced Discovery Predictive Analytics Optimization Simulation Text Analytics Geospatial Analytics Econometrics Streaming Analytics Feature Extraction Dimension Reduction 6. 6 7. 7 Analytics Platform 8. For some use cases, you must use all of the data. 8 Anomaly Detection Afnity Analysis Clustering Social Network Analysis Collaborative Filtering 9. For others, using all of the data is worth it. 9 Catastrophic Risk Modeling Modeling with Fine-grained Behavioral Data 10. 10 1. Apache Mahout! 2. Code it yourself.! 3. Your Options (2013) 11. Architecture 11 12. Legacy Alongside 12 HDFS HDFS HDFS HDFS HDFS HDFS Data 13. Legacy Pass-Through 13 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Data 14. MapReduce Push-Down 14 HDFS HDFS HDFS HDFS HDFS HDFS MapReduce Advantages! Co-exists w/ other applications Integrated workload management Simplied administration Disdvantages! MapReduce latency 15. Co-Located In-Memory (Asymmetric) 15 YARN HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce Advantages! Easy to adapt legacy apps Isolates analytic workload Disdvantages! Data moves within the cluster Requires YARN 16. Co-Located In-Memory (Symmetric) 16 HDFS Map! Reduce YARN HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce HDFS Map! Reduce Advantages! Lowest latency Disdvantages! Upgrade every node Requires YARN 17. Summary: Architecture MapReduce Push-Down is current champion Stable Co-exists well with Hadoop ecosystem MR 1.0 penalizes performance Required: persistent in-memory processing YARN enables co-location 17 18. Open Source Projects 18 19. Apache Mahout Apache incubator project (2007) Machine learning library Included in most distributions Thin acceptance, few contributors Diverse architecture Single-node MapReduce New algos run on Spark Recently cleaned up 19 20. Apache Giraph Apache top-level project Runs in MapReduce Dedicated graph engine Used by Facebook, few others Dead in the water No presence in leading distros No signicant commercial support No releases in 13 months No recent code commits on Git 20 21. GraphLab Carnegie Mellon project (2009) Distributed in-memory engine: Primarily graph analysis Selected machine learning algos Interface from Java, JavaScript, Python GraphLab Inc provides commercial support (2013, $6.75MM) Independent distribution, or through Pivotal 21 22. 0xdata H2O Vendor-driven open source project 0xdata sells support, customization Distributed in-memory prediction engine Multiple deployment options: Standalone (with HDFS) Over YARN In MapReduce Claims 2,000+ users 4 public references Used by a leading P&C insurer Java, R, Python and Scala interfaces 22 23. Apache Spark Top-level Apache project (2/14) Release 1.0 (5/14) Distributed in-memory analytics Machine learning Graph analytics Streaming analytics Fast SQL Compatible with Hadoop storage Integrated with YARN Scala, Python, Java interfaces (+SparkR) Growing ecosystem Supported in leading Hadoop distributions 23 24. Apache Spark: Hadoop Distributions 24 Spark Components MLLIB GraphX Spark Streaming Spark SQL Shark Cloudera Yes Yes Yes Yes (Impala) Hortonworks Yes (Storm) (Stinger) MapR Yes Yes Yes Yes Yes Pivotal Yes Yes Yes Yes Yes IBM BigInsights 25. Summary: Open Source Projects 25 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Status Independent Top-Level Top-Level Top-Level Independent Architecture Co-Located Memory- Centric MapReduce MapReduce Co-Located Memory- Centric Co-Located Memory- Centric Interfaces Java, Python, R, Scala Java Java Java, Python, Scala (SparkR) Python Commercial Support 0xdata Databricks GraphLab, Inc. Distribution Independent Independent All Hadoop Distributions Cloudera! Hortonworks! MapR! Pivotal Independent 26. Analytic Features 26 0xdata ! H2O 2.2 Apache ! Giraph 1.1 Apache ! Mahout 0.9 Apache ! Spark 1.0 GraphLab 2.2 Prediction +++ + +++ Dimension Reduction + +++ + + Clustering + +++ + +++ Collaborative Filtering +++ + +++ Text Analytics +++ +++ Matrix Operations + +++ + Graph Analysis + + +++ 27. Analytic Features: Prediction 27 Mahout 0.9 Spark 1.0 H2O 2.2 Linear Regression + Logistic Regression + Generalized Linear Models + Naive Bayes + + + Decision Tree + Gradient Boosted Trees + Random Forests + + Linear Support Vector Machine + Deep Learning (Backprop MLP) + 28. Analytic Features: Dimension Reduction 28 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Singular Value Decomposition + + Lanczos Algorithm + + Stochastic SVD + Principal Components Analysis + + + 29. Analytic Features: Clustering 29 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 k-Means + + + + Fuzzy k-Means + Streaming k-Means + Spectral Clustering + + 30. Analytic Features: Collaborative Filtering 30 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Item-Based + Matrix Factorization with ALS + + + Matrix Factorization with ALS, Implicit Feedback + ALS with Parallel Coordinate Descent + Weighted ALS + Sparse ALS + 31. Analytic Features: Text Analytics 31 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Latent Dirichlet Allocation + + Frequent Pattern Mining + Collocations + 32. Matrix Operations 32 Mahout 0.9 Spark 1.0 H2O 2.2 GraphLab 2.2 Stochastic Gradient Descent + + Limited-Memory BFGS + RowSimilarityJob + ConcatMatrices + 33. Summary: Open Source Giraph is toast Mahout may be recovering from roadkill status GraphLab outperforms Spark GraphX today in graph analytics 0xdata H2O outperforms Spark MLLib today in machine learning Spark catching up fast More resources and distribution Integrated platform for ML and graph analysis 33 34. Commercial Software 34 35. Alpine Business user interface Collaboration environment Broad library of techniques Strong cloud offering Leverages Hadoop (multiple distros), Hawq or Pivotal Greenplum Push-down MapReduce Certied on Spark Small but growing customer base 35 36. IBM SPSS Analytics Server Introduced 2013 Serves as back end for SPSS Modeler Uses push-down MR Limited analytic feature set IBM supports on multiple Hadoop distros Customer acceptance unknown 36 37. Revolution Analytics ScaleR ScaleR library of distributed statistics, machine learning functions Tools to distribute arbitrary R functions Runs in Cloudera, Hortonworks, Teradata, LSF clusters, MS HPC Hadoop edition uses MR push-down Tools simplify installation in large clusters R interface Partnerships with Alteryx, Qlik, MicroStrategy, Tableau provide business interfaces 37 38. Skytree Server Georgia Techs FastLab project, repurposed as commercial software Distributed machine learning platform Very opaque about technical details User interface is an API Co-located in Hadoop under YARN Just certied by Hortonworks Customer acceptance unknown No new public references in a year Used by leading credit card company 38 39. SAS High-Performance Analytics Distributed in-memory analytics Designed to run in special-purpose appliances (2011) Repurposed to run in Hadoop (2013) Co-exists poorly cannot run SAS and MapReduce at the same time Reads entire dataset into memory Uses MPI to communicate among nodes Requires upgrades from standard Hadoop infrastructure Customer acceptance unknown No public references Generic success stories missing from Strata presos 39 40. SAS LASR Server SAS other distributed in-memory platform Back end for several end-user products SAS Visual Analytics (2012) SAS Visual Statistics (New) SAS In-Memory Statistics for Hadoop (New) Recently added statistics and machine learning Does not read raw HDFS; must be transformed to proprietary SASHDAT Like HPA, reads entire dataset into memory. 16 Core 256GB node can load 75GB table Runs DS2 programs, not Legacy SAS programs Fast, but with limited feature set SAS claims 1,400 sites for Visual Analytics Many of those are standalone boxes 40 41. Summary: Commercial Alpines interface is compelling to business user IBM Analytics Server is a good rst release RRE ScaleR appeals to R users, plays well in Hadoop sandbox Skytree Server: strong in prediction SAS: why two competing memory-centric architectures? 41 42. Progress Spark: blindingly fast maturity Rapidly expanding library of analytic features Growing developer community, ecosystem Commercial: from zero to many 42 43. Interesting Questions Will Mahout get a second wind? Will Spark MLLib displace 0xdata? Will Spark GraphX catch up to GraphLab? Can Spark Streaming compete with Storm and commercial entrants? How quickly will customers adopt memory-centric architecture for analytics? What will Alpine and MicroStrategy do with Spark? Will IBM distribute Spark in BigInsights? When will SAS announce a reference customer for HPA/LASR in Hadoop? 43 44. Advanced Analytics in Hadoop Thomas W. Dinsmore 44