Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Massive Predictive Modeling using Oracle R Technologies Mark Hornick, Director, Oracle Advanced Analytics
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Safe Harbor Statement
The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.
3
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Agenda
1
2
3
Massive Predictive Modeling
Use cases
Enabling technologies
4
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Quick Survey: How many models have you built? in your lifetime
> 10
> 100
> 1000
> 10000
>100000
>1000000
5
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 7
# Models
Data Size (rows)
1 millions
billions
100s
Massive Predictive Modeling
“Specialized” “Generalized”
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 8
# Models
Data Size (rows)
1 millions
billions
100s
“Broad coverage”
“Targeted”
# Models per Entity
1
1000s
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Massive Predictive Modeling - Goals
• Build one or more models per entity, e.g., customer
• Understand and/or predict entity behavior
• Aggregate results across entities, e.g., to assess future demand
9
model
model
model
model
model
model
model
model
model
Σ cust=1
n
Demand over time
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Massive Predictive Modeling - Challenges
• Effectively dealing with “Big Data” – Hardware, software, network, storage
• Algorithms that scale and perform with Big Data
• Building “many” models in parallel
• Production deployment
• Storing and managing models
• Backup, recovery, and security
10
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Use Cases
14
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Predicting Customer Electricity Usage
15
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Motivation: Energy Theft Detecting patterns of meter tampering
SA country loses
US$4 billion per year due
to energy theft
Storage of information about
which meters have been
tampered with
Analysis and decision making
Forecast future behavior
16
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Motivation: Different customers, different demands
Each customer has different demand and consumption
patterns
Storage of information about the consumption
of each customer in different periods of day
Creation of a demand and consumption
curve for each customer
Analysis: in which period will company have to deliver more energy?
Price electricity in a
given period
Customer decides when to use energy to reduce cost
Company redirects the
energy to where it is most needed at the moment, saving on the generation
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Sensor Data Analysis
• Model each customer’s usage to understand behavior and predict individual usage and overall aggregate demand
• Consider 200K customers, each with a utility “smart meter”
• 1 reading / meter / hour
• 200K x 8760 hours / year 1.752B readings
• 3 years worth of data 5.256B readings
• 26280 readings per customer
• 10 seconds to build each model 555.6 hours (23.2 days) …with 128 DOP 4.3 hours
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
f(dat,args,…) {
}
Oracle Database
Data c1 c2 ci cn
R Script build model
f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)
Model c1
Model c2
Model cn
Model ci
R Datastore R Script Repository
Database-centric architecture Smart meter scenario
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
scores c1
scores c2
scores ci
scores cn
f(dat,args,…) { }
Oracle Database
Data c1 c2 ci cn
R Script score data
f(dat,args,…) f(dat,args,…) f(dat,args,…) f(dat,args,…)
Model Model Model Model R Datastore R Script Repository
Database-centric architecture Smart meter scenario
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
How many lines of code do you think it should take to implement this?
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Build models and store in database, partition on CUST_ID
ore.groupApply (CUST_USAGE_DATA,
CUST_USAGE_DATA$CUST_ID,
function(dat, ds.name) {
cust_id <- dat$CUST_ID[1]
mod <- lm(Consumption ~ . -CUST_ID, dat)
mod$effects <- mod$residuals <- mod$fitted.values <- NULL
name <- paste("mod", cust_id,sep="")
assign(name, mod)
ds.name1 <- paste(ds.name,".",cust_id,sep="")
ore.save(list=paste("mod",cust_id,sep=""), name=ds.name1, overwrite=TRUE)
TRUE
},
ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE
)
14 lines
22
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Score customers in database, partition on CUST_ID
ore.groupApply(CUST_USAGE_DATA_NEW,
CUST_USAGE_DATA_NEW$CUST_ID,
function(dat, ds.name) {
cust_id <- dat$CUST_ID[1]
ds.name1 <- paste(ds.name,".",cust_id,sep="")
ore.load(ds.name1)
name <- paste("mod", cust_id,sep="")
mod <- get(name)
prd <- predict(mod, newdata=dat)
prd[as.integer(rownames(prd))] <- prd
res <- cbind(CUST_ID=cust_id, PRED = prd)
data.frame(res)
},
ds.name="myDatastore", ore.connect=TRUE, parallel=TRUE,
FUN.VALUE=data.frame(CUST_ID=numeric(0), PRED=numeric(0))
)
16 lines
23
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Execution Examples (with DOP=24)
• 1000 Models
– Data: 26,280,000 rows
– Total build time: 65.2 seconds
– Total scoring time: 25.7 seconds (all data)
• 10,000 Models
– Data: 262,800,000 rows
– Total build time: 516 seconds
– Total scoring time: 217 seconds (all data)
24
• 50,000 Models
– Data: 1,314,000,000 rows
– Total build time: 55.85 minutes
– Total scoring time: 18 minutes (all data)
1
10
100
1000
10000
26.3 262.8 1314
Exe
cuti
on
(se
c)
# rows (millions)
Build Time
Score Time
1 Model/Customer
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Simulation
25
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Compute distribution of generated random normal values simulation <- function(index, n) {
set.seed(index)
x <- rnorm(n)
res <- data.frame(t(matrix(summary(x))))
names(res) <- c("min","q1","median","mean","q3","max")
res$id <- index
res
}
(res <- simulation(1,1000))
26
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Simulation with sample size 1000 over 10 trials res <- ore.indexApply(10, simulation, n=1000, FUN.VALUE=res[1,], parallel=TRUE)
stats <- ore.pull(res)
library(reshape2)
melt.stats <- melt(stats, id.vars="id")
boxplot(value~variable, data=melt.stats, main="Distribution of Stats - sample 1000, 10 trials")
27
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Simulation with sample sizes 101:6 and 100 trials
num.trials <- 100
for(n in 10^(1:6)){
t1 <- system.time(stats <- ore.pull(ore.indexApply(num.trials, simulation, n=n,
FUN.VALUE=res[1,], parallel=TRUE)))[3]
cat("n=",n,", time=",t1,"\n")
melt.stats <- melt(stats, id.vars="id")
boxplot(value~variable, data=melt.stats,
main=paste("Distribution of Stats - sample",n,",", num.trials, "trials"))
gc()
}
28
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Plot Results: sample sizes 101:6 and 100 trials
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Scalable Performance varying number of trials 200..5000
(10^x)
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Enabling Technologies
32
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle R Enterprise • Oracle Advanced Analytics Option to Oracle Database
• Eliminate memory constraint of client R engine
• Minimize or eliminate data movement latency
• Execute R scripts through database server machine for scalability and performance
• Achieve scalability and performance by leveraging Oracle Database as HPC environment
• Enable integration and management of R scripts through SQL
• Operationalize entire R scripts in production applications – eliminate porting R code
• Avoid reinventing code to integrate R results into existing applications
Client R Engine
ORE packages
Oracle Database User tables
Transparency Layer
In-db stats
Database Server Machine
SQL Interfaces SQL*Plus, SQLDeveloper, …
34
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Oracle’s R Technologies
• Oracle R Distribution
• ROracle
• Oracle R Enterprise
• Oracle R Advanced Analytics for Hadoop
Software available to R Community for free
35
Come to our booth to learn more…
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
Resources
• Oracle R Distribution • ROracle • Oracle R Enterprise • Oracle R Advanced Analytics for Hadoop
• Book: Using R to Unlock the Value of Big Data
• Blog: https://blogs.oracle.com/R/
• Forum: https://forums.oracle.com/forums/forum.jspa?forumID=1397
http://oracle.com/goto/R
47
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |
FastR
• New implementation of R in Java
– Uses the new Truffle interpreter framework and Graal optimizing compiler in conjunction with the HotSpot™ JVM for high performance, scalability and portability
– Dynamically compiles, adaptively optimizes and deoptimizes at run time
– Joint effort: Oracle Labs (Germany, USA, Austria), JKU Linz (Austria), Purdue University (USA), TU Dortmund (Germany)
• Open-source project (research prototype!)
– GPLv2
– https://bitbucket.org/allr/fastr
• More info at the poster session
48
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | 49