View
851
Download
1
Category
Tags:
Preview:
Citation preview
Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine Learning Engine
Who am I?
Hank RoarkData Scientist & Hacker @ H2O.ai
Lecturer in Systems Thinking, UIUC13 years at John Deere, Research, New Product Development, New High Tech VenturesPreviously at startups and consulting
Physics Georgia TechSystems Design & Management MIT
Data Science
Interdisciplinary
Electronic commodity, must speak ‘hacker’
Extract insights from data
Discovery and building knowledge
http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Data Science
Jeff Hammerbacher (Facebook, Cloudera)• Identify problem• Instrument data sources• Collect data• Prepare data (integrate, transform, clean,
impute, filter, aggregate)• Build model• Evaluate model• Communicate results
Data Science
Ben Fry (data visualization expert)• Acquire• Parse• Filter• Mine• Represent• Refine• Interact
Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine Learning Engine
Field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
10
A computer program is said to learn from experience E With regards to some task Tand some performance measure P,
if its performance on T, as measured by P, improves with experience E.
Tom Mitchell, 1998
11
Types of Learning• Supervised Learning
• Inferring function from labeled data• Classification• Regression
• Unsupervised Learning• Finding hidden structure in unlabeled data• Clustering• Anomaly
• Reinforcement Learning• Learning from delayed feedback
Isn’t this just statistics repackaged?
x nature y
Shared goals of data analysis:
Prediction
Information extractionL Breiman
Statistical Analysis
xLinear regression
Logistic regressionCox models
y
Assume some process that creates observed data
Model validation: Yes–no using goodness-of-fit testsResidual examination
L Breiman
Algorithmic Analysis (aka ML)
x Unknown y
Process that creates observed data is unknowable
Model validation: Measured by predictive accuracy L Breiman
Decision treesNeural networks
Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine Learning Engine
Trees
Short exploration of one algorithmic method
Can be used for regression and classification
Segments the prediction space into a number of simple regions
Often referred to as decision trees
Pros and Cons
Simple, thought to mirror human decision making
Not competitive with the best supervised learning approaches in terms of predictive accuracy
Combining large number of trees results in dramatic improvements, with some loss of interpretability
Methods to Improve Predictive Performance of Trees
Bagging Random Forest Boosting
Bagging is short for bootstrap aggregation.
Averaging a set of observations reduces variance.
Individual trees are built on samples, with replacement, of the data. (Bootstrap)
Many trees are built and the results ‘averaged’ (Aggregation)
Random forest builds on bagging, by considering a random subset of the predictors at each tree split
This further decorrelates the trees, resulting in improved predictive performance.
Implemented in H2O as Random Forest.
Builds multiple models sequentially, using information from prior trees.
Slowly fit the residuals of prior models.
Is a general method, not limited to trees.
Implemented in H2O as GBM (Gradient Boosted Models); first ever parallel, distributed GBM.
Which Algorithm Is Best?
25
We have dubbed the associated results No Free Lunch theorems because they demonstrate that if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems. (Wolpert and Macready)
Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine Learning Engine
• Founded: 2011 venture-backed, debuted in 2012• Product: H2O open source in-memory prediction engine• Team: 37 - Distributed Systems Engineers doing ML• HQ: Mountain View, CA
H2O.ai Overview
H2O.aiMachine Intelligence
What is H2O? Open source in-memory prediction engineMath Platform
• Parallelized and distributed algorithms making the most use out of multithreaded systems
• GLM, Random Forest, GBM, Deep Learning, etc.
Easy to use and adoptAPI• Written in Java – perfect for Java Programmers• REST API (JSON) – drives H2O from R, Python, Excel, Tableau
More data? Or better models? BOTHBig Data• Use all of your data – model without down sampling• Run a simple GLM or a more complex GBM to find the best fit for the data• More Data + Better Models = Better Predictions
H2O.aiMachine Intelligence
31
Ad Optimization (200% CPA Lift with H2O)
P2B Model Factory (60k models, 15x faster with H2O than before)
Fraud Detection (11% higher accuracy with H2O Deep Learning - saves millions)
…and many large insurance, financial services, and manufacturing companies!
Real-time marketing (H2O is 10x faster than anything else)
Customer Use Cases
Cisco Predictive Modeling Factories
Problem
Why H2O?
Who uses it?
• Need to predict whether a company will buy a certain product at a given time
• Spend a lot of time preparing models• Less time for scoring and less time left for using the
scores in the sales activities
• P2B factory is 15x faster with H2O• Newer buying patterns incorporated immediately
into models• Scores are published sooner
• More time for planning and executing activities• R + H2O is a robust and powerful combination
• Lou Carvalheira, advanced analytics manager• Customer Intelligence data scientists
P2B factory is 15x faster with H2O
Q1 Q2
P2B Training
Scoring models
Data Refresh Q2
Data Refresh Q1
Prepare, execute Mktg & Sales
activities
Before, without H2O
Q1 Q2
Train &
score
Data Refresh
Prepare, execute
Mktg & Sales activities
Train &
score
Data Refresh
Prepare, execute
Mktg & Sales activities
Now, with H2O
ShareThis AdTech Optimization
Problem
Why H2O?
Who uses it?
• ShareThis ONLY targets users within 24 hours to ensure ads reach them at the most relevant moment for maximum ROI
• Maximized ROI by optimizing campaign performance and budget allocation
• Increased accuracy and better anomaly removal
• Reduced R&D time significantly
• Used all data and built models faster, & faster scoring
• Smooth model building pipeline with R and Spark API
• Prasanta Behera, VP of Engineering• Ad Products team
STANDARD TARGETINGTHRESHOLD
INTER
EST
TIME
TRIGGER
EXCITEMENT
PEAK READI-NESSFOR ENGAGEMENT
FADING INTEREST
MALE 25-45 TECH ENTHUSI-
AST $HHI $75K+
“DAN”
ShareThis ONLY targets users within 24 hours to ensure ads reach them at the most relevant moment
SHARETHIS MESSAGING TRIG-
GER
Real Time Messaging Reaches Users DuringPeak Interest
PayPal Fraud Prevention
Problem
Why H2O?
Who uses it?
• Flag fraudulent behavior upfront• Monitor account activity and account-to-account
transactions for suspicious behavior and changes• Need to model new and complex attack patterns
quickly
• Fast, scalable, and accurate• Flexible deployment• Works seamlessly with Hadoop• Simple interface• 11% improvement in accuracy w/ Deep Learning
• Fraud Prevention data science team
Fraud Prevention at PayPalExperiment
• Dataset
− 160 million records
− 1500 features (150 categorical)
− 0.6TB compressed in HDFS
• Infrastructure
− 800 node Hadoop (CDH3) cluster
• Decision
− Fraud/not-fraud
• Network architecture- 6 layers with 600 neurons each performed the best
• Activation function − RectifierWithDropout performed the
best
• 11% accuracy Improvement with limited feature set & a deep network− With a third of the original feature set,
6 hidden layers, 600 neurons each
Results
Customer selects song to
purchase
$Payment
information entered
Data collected
Comparison with past consumer behavior
Random ForestDetermine fraud/not
fraud
Take steps to stop fraud or prevent
future fraud
Fraud Prevention with Random Forest
Agenda
Data Science
Machine Learning
Trees and Power of Algorithmic Methods
Examples using H2O Scalable Machine Learning Engine
Recommended