Upload
sri-ambati
View
3.535
Download
1
Tags:
Embed Size (px)
Citation preview
H2O.ai Machine Intelligence
Fast, Scalable In-Memory Machine and Deep Learning For Smarter Applications
Python & Sparkling Water with H2O
Cliff Click Michal Malohlava
H2O.ai Machine Intelligence
Who Am I?
Cliff Click CTO, Co-Founder H2O.ai [email protected]
40 yrs coding 35 yrs building compilers 30 yrs distributed computation 20 yrs OS, device drivers, HPC, HotSpot 10 yrs Low-latency GC, custom java hardware
NonBlockingHashMap 20 patents, dozens of papers 100s of public talks
PhD Computer Science 1995 Rice University HotSpot JVM Server Compiler “showed the world JITing is possible”
H2O.ai Machine Intelligence
H2O Open Source In-Memory Machine Learning for Big Data
Distributed In-Memory Math Platform
GLM, GBM, RF, K-Means, PCA, Deep Learning
Easy to use SDK & API
Java, R (CRAN), Scala, Spark, Python, JSON, Browser GUI Use ALL your data
Modeling without sampling HDFS, S3, NFS, NoSql
Big Data & Better Algorithms Better Predictions!
H2O.ai Machine Intelligence
TBD. Customer Support
TBD Head of Sales
Distributed Systems Engineers Making ML Scale!
H2O.ai Machine Intelligence
Practical Machine Learning
Value Requirements Fast & Interactive In-Memory
Big Data (No Sampling) Distributed
Ownership Open Source
Extensibility API/SDK
Portability Java, REST/JSON
Infrastructure Cloud or On-Premise Hadoop or Private Cluster
H2O.ai Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine Web Interface
Spark Scala REPL
Nano-Fast Scoring Engine
Distributed In-Memory K/V Store
Column Compress Data Map/Reduce
Memory Manager
Algorithms! GBM, Random Forest, GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
Real Tim
e D
ata Flow
H2O.ai Machine Intelligence
H2O Architecture
Prediction Engine
R & Exec Engine Web Interface
Spark Scala REPL
Nano-Fast Scoring Engine
Distributed In-Memory K/V Store
Column Compress Data Map/Reduce
Memory Manager
Algorithms! GBM, Random Forest, GLM, PCA, K-Means,
Deep Learning
HDFS S3 NFS
Real Tim
e D
ata Flow
H2O.ai Machine Intelligence
Python & Sparkling Water
● CitiBike of NYC ● Predict bikes-per-hour-per-station
– From per-trip logs ● 10M rows of data ● Group-By, date/time feature-munging
Demo!
H2O.ai Machine Intelligence
H2O: A Platform for Big Math
● Most Any Java on Big 2-D Tables – Write like its single-thread POJO code – Runs distributed & parallel by default
● Fast: billion row logistic regression takes 4 sec ● Worlds first parallel & distributed GBM
– Plus Deep Learn / Neural Nets, RF, PCA, K-means...
● R integration: use terabyte datasets from R ● Sparkling Water: Direct Spark integration
H2O.ai Machine Intelligence
H2O: A Platform for Big Math
● Easy launch: “java -jar h2o.jar” – No GC tuning: -Xmx as big as you like
● Production ready: – Private on-premise cluster OR
In the Cloud – Hadoop, Yarn, EC2, or standalone cluster – HDFS, S3, NFS, URI & other datasources – Open Source, Apache v2
Can I call H2O’s algorithms from
my Spark workflow?
YES, You can!
Sparkling Water
Sparkling WaterProvides
Transparent integration into Spark ecosystem
Pure H2ORDD encapsulating H2O DataFrame
Transparent use of H2O data structures and algorithms with Spark API
Excels in Spark workflows requiring advanced Machine Learning algorithms
Sparkling Water Design
spark-submitSpark Master JVM
Spark Worker
JVM
Spark Worker
JVM
Spark Worker
JVM
Sparkling Water Cluster
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Spark Executor JVM
H2O
Sparkling App
implements
?
Data Distribution
H2O
H2O
H2O
Sparkling Water Cluster
Spark Executor JVMData
Source (e.g. HDFS)
H2O RDD
Spark Executor JVM
Spark Executor JVM
Spark RDD
RDDs and DataFramesshare same memory
space
Demo time!
LAUNCH SPARKLING SHELL> export SPARK_HOME="/path/to/spark/installation"
> bin/sparkling-shell
PREPARE AN ENVIRONMENTval DIR_PREFIX = "/Users/michal/Devel/projects/h2o/repos/h2o2/bigdata/laptop/citibike-nyc/"
// Common importsimport org.apache.spark.h2o._import org.apache.spark.examples.h2o._import org.apache.spark.examples.h2o.DemoUtils._import org.apache.spark.sql.SQLContextimport water.fvec._import hex.tree.gbm.GBMimport hex.tree.gbm.GBMModel.GBMParameters
// Initialize Spark SQLContextimplicit val sqlContext = new SQLContext(sc)import sqlContext._
LAUNCH H2O SERVICESimplicit val h2oContext = new H2OContext(sc).start()
import h2oContext._
LOAD CITIBIKE DATAUSING H2O API
val dataFiles = Array[String]( "2013-07.csv", "2013-08.csv", "2013-09.csv", "2013-10.csv", "2013-11.csv", "2013-12.csv").map(f => new java.io.File(DIR_PREFIX, f))
// Load and parse dataval bikesDF = new DataFrame(dataFiles:_*)
// Rename columns and remove all spaces in headerval colNames = bikesDF.names().map( n => n.replace(' ', '_'))bikesDF._names = colNamesbikesDF.update(null)
USER-DEFINED COLUMN TRANSFORMATION// Select column 'startime'val startTimeF = bikesDF('starttime)
// Invoke column transformation and append the created columnbikesDF.add(new TimeSplit().doIt(startTimeF))// Do not forget to update frame in K/V storebikesDF.update(null)
OPEN H2O FLOW UIopenFlow
AND EXPLORE DATA...> getFrames...
FROM H2O'S DATAFRAME TO RDDval bikesRdd = asSchemaRDD(bikesDF)
USE SPARK SQL// Register table and SQL tablesqlContext.registerRDDAsTable(bikesRdd, "bikesRdd")
// Perform SQL group operationval bikesPerDayRdd = sql( """SELECT Days, start_station_id, count(*) bikes |FROM bikesRdd |GROUP BY Days, start_station_id """.stripMargin)
FROM RDD TO H2O'S DATAFRAMEval bikesPerDayDF:DataFrame = bikesPerDayRdd
AND PERFORM ADDITIONAL COLUMN TRANSFORMATION// Select "Days" columnval daysVec = bikesPerDayDF('Days)// Refine column into "Month" and "DayOfWeek"val finalBikeDF = bikesPerDayDF.add(new TimeTransform().doIt(daysVec))
TIME TO BUILD A MODEL!
GBM MODEL BUILDERdef buildModel(df: DataFrame, trees: Int = 200, depth: Int = 6):R2 = { // Split into train and test parts val frs = splitFrame(df, Seq("train.hex", "test.hex", "hold.hex"), Seq(0.6, 0.3, 0.1)) val (train, test, hold) = (frs(0), frs(1), frs(2)) // Configure GBM parameters val gbmParams = new GBMParameters() gbmParams._train = train gbmParams._valid = test gbmParams._response_column = 'bikes gbmParams._ntrees = trees gbmParams._max_depth = depth // Build a model val gbmModel = new GBM(gbmParams).trainModel.get // Score datasets Seq(train,test,hold).foreach(gbmModel.score(_).delete) // Collect R2 metrics val result = R2("Model #1", r2(gbmModel, train), r2(gbmModel, test), r2(gbmModel, hold)) // Perform clean-up Seq(train, test, hold).foreach(_.delete()) result}
BUILD A GBM MODELval result1 = buildModel(finalBikeDF)
CAN WE IMPROVE MODELBY USING INFORMATION
ABOUT WEATHER?
LOAD WEATHER DATAUSING SPARK API
// Load weather data in NY 2013val weatherData = sc.textFile(DIR_PREFIX + "31081_New_York_City__Hourly_2013.csv")// Parse data and filter themval weatherRdd = weatherData.map(_.split(",")). map(row => NYWeatherParse(row)). filter(!_.isWrongRow()). filter(_.HourLocal == Some(12)).setName("weather").cache()
CREATE A JOINED TABLEUSING H2O'S DATAFRAME AND SPARK'S RDD
// Join with bike tablesqlContext.registerRDDAsTable(weatherRdd, "weatherRdd")sqlContext.registerRDDAsTable(asSchemaRDD(finalBikeDF), "bikesRdd")
val bikesWeatherRdd = sql( """SELECT b.Days, b.start_station_id, b.bikes, |b.Month, b.DayOfWeek, |w.DewPoint, w.HumidityFraction, w.Prcp1Hour, |w.Temperature, w.WeatherCode1 | FROM bikesRdd b | JOIN weatherRdd w | ON b.Days = w.Days """.stripMargin)
BUILD A NEW MODELUSING SPARK'S RDD IN H2O'S API
val result2 = buildModel(bikesWeatherRdd)
Checkout H2O.ai Training Books
http://learn.h2o.ai/
Checkout H2O.ai Blog
http://h2o.ai/blog/
Checkout H2O.ai Youtube Channel
https://www.youtube.com/user/0xdata
Checkout GitHub
https://github.com/h2oai
More info
Learn more about H2O at h2o.ai
Thank you!
Follow us at @h2oai