Upload
srisatish-ambati
View
105
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Data Science is no longer Rocket Science with H2O. H2O is the OpenSource Math and Prediction Engine for Big Data. H2O makes hadoop do math! And scales statistics, machine learning and math over BigData. With H2O everyone can get past tooling and scale issues to discover insights in the data. H2O is extensible and users can build blocks using simple math legos in the core. H2O keeps familiar interfaces like R, Excel & JSON so that big data enthusiasts & & experts can explore, munge, model and score datasets using a range of simple to advanced algorithms. Data collection is easy. Decision making is hard. H2O makes it fast and easy to derive insights from your data through faster and better predictive modeling. H2O has a vision of online scoring and modeling in a single platform.
Citation preview
H2O the Prediction Engine
Better predictions
https://github.com/0xdata/h2o
H2O makes hadoop do math
Hadoop = opportunity Not enough Data Scientists Analysts won’t code java
H2O the Prediction Engine Exploration Modeling Scoring
Big Data
H2O the Prediction Engine
Adhoc Exploration
Math Modeling
Real-time Scoring
Big Data Velocity Volume
H2O the Prediction Engine
Adhoc Exploration
Math Modeling
Real-time Scoring
Big Data
Messy Clustering
Classification
Ensembles
100’s nanos models
Regression
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
No New API
Approximate results each step
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data beats Better Algorithms
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data and Better Algorithms Scale & Parallelism
H2O the Prediction Engine
Big Data Exploration Modeling Scoring
Real-time
More Data and Better Algorithms Scale & Parallelism
fraud detection
Apps
reco engine
H2O the Prediction Engine
Intellectual Legacy
Math needs to be free
Open Source
Support and Innovation
https://github.com/0xdata/h2o
SriSatish Ambati, CEO & Co-founder Director of Engineering, DataStax, Cassandra & Hadoop Customers & Platform Marketing, Azul Cliff Click. CTO & Co-founder Chief JVM Architect, Azul, Sun, HP, Motorola, JIT & Hotspot Tomas Nykodym Phd Security, Intrusion Detection Cyprien Noel Founder ObjectFabric, TradeWeb, SmartTrade Michal Malohlava Phd DSLs, Compilers Jan Vitek Full Professor, Purdue, On Sabbatical, Real-time VM, R/stats Compiler
Kevin Normoyle AMD Fellow, Distinguished Engineer Sun, Consistency Models Tom Kraljevic VP Of Engineering, founder Luminix, Azul, PMC-Sierra, Chromatic
Credits & Team
Stephen Boyd Professor of Mathemat ica l Engineer ing, Stanford, Convex Opt
Trevor Hastie Professor of Stat is t ics, Stanford, General ized Addi t ive Models
Rob Tibshirani Professor of Stat is t ics, Stanford, GLMNet, Lasso
Doug Lea Mal loc for C. fork- jo in. java memory model , suny oswego Dhruba Borthakur HDFS, Hive, Facebook Nial l Dalton TimeSer ies DB, KX, High- f requency Trading, Cantor-F i tz Char les Zedlewski VP Products, Cloudera
Data Science & Advisors
Distributed! Extensible, reconfigurable!
Math-at Scale – Simple Legos
H2O
+ σ cov
*
µ mean
n
GLM Logistic
Regression
rand shuffle
histo gram
Random Decision
Trees
OLS
k-means
Volume: HDFS
HIVE/SQL
Data Scientist
Munging slice n dice Features
Classification Regression Clustering Optimal Model
Engineer
Velocity: Events Online Scoring
Exploration
Modeling
Offline Scoring
Business Analyst
Ensemble models Low latency
Applications
Predictions
Rule Engine
Before H2O
Product Road Map
algos: RandomForest GLM, ADMM, GLMnet, k-means data: dense, categorical api: REST, JSON, R-like console Scale, Single-Execution GridSearch
In 4-pilots
algos: GroupBy, Grep Unbalanced App: Fraud Detection data: sparse api: R, math, string Adhoc Analytics Multi-Execution Scoring Engine Event Ingest In production
algos: GBM, SVM, KNN Optimization App:RecoEngine data: sparse api: Tableau Visualization Multi-tenant Library Big Adoption
1.15.2013 5.15.2013 8.15.2013
secret sauce move code. not data
Linear Regression
fork/join. data partitioning. fine grain parallelism
phase 1 sums phase 2 distance phase 3 validate
arraylets leaf computes parent aggregates
company confidential. copyright 2012
Fraud Detection Scoring: Event stream on a ScoreCard Model Modeling: Random Forest for outlier detection Modeling: Event sequence patterns
Customer Behavior & Merchant Analytics Scoring: Purchase event stream scoring on Ensemble Models Modeling: Logistic Regression models for Customer Engagement
Failure Prediction from Sensor Data Model device failures and rank vendor graphs.
Upstream Oil Exploration Distance & Regression on 1TB big data MLS for Oil fields
Use Cases
Math & Hadoop users recommend us!
Data & Algorithms
SQL | HDFS | S3 | NoSQL
H2O – Real Time
REST
patterns sequences
Distributed Collections Execution
JSON R Excel
Java API
Hadoop Ecosystem
HDFS
H2O Map Reduce
Hive Pig
Impala Drill
Batch Interactive
H2O
• Alternating Direction Method of Multipliers (Boyd) • Decomposition-coordination • Small Local Sub-Problems and Global Coordination
• Broadcast & Gather • Decomposability Dual Ascent + Convergence of Multipliers • Block & Component Separability
• Generalized Gradients (Hastie, Tibshirani, et al)
Generalized Linear Modeling
l1 norm regularization
https://github.com/0xdata/h2o/blob/master/src/main/java/hex/DLSM.java
• Text Book implementation from Breiman’s paper.
• Data is distributed upon ingest • Splits on random selection of features
• Gini & Entropy
• Handle NAs (during training) • Class-Weighting • Stratified Sampling (local)
Random Forest
https://github.com/0xdata/h2o/tree/master/src/main/java/hex/rf
forest for the tree.. iris dataset
• 1% increase in predictive power - $11m @ major online payment system
• Each fraud scored accurately = expected value of 10s of thousand dollars.
• Leads cost $10-100/lead – Predicting accurate conversion and quality of leads goes directly to bottom line.
• Competitive advantage in predicting which assets to acquire.
Models unlock value in data
Deployment - commodity / cloud
H2O
x86
H2O is pure java and easy-to-install
company confidential. copyright 2012
H2O
H2O
H2O the Prediction Engine
Better predictions
https://github.com/0xdata/h2o
H2O the Prediction Engine
Big Data Science Modeling & Scoring Engine Approximate results each step No new API
Use R, Excel & SAS Scale & Parallelism