Upload
databricks
View
3.370
Download
2
Embed Size (px)
Citation preview
Introduction to SparkR
Shivaram Venkataraman, Hossein Falaki
Big Data & R
DataFramesVisualization
Libraries Data
+
Big Data & R: ChallengesData access HDFS, Hive Capacity
Single machine memory
ParallelismSingle Thread
Apache SparkEngine for large-scale data processing
Fast, Easy to Use
Runs EverywhereEC2, clusters, laptop etc.
Speed
Scalable
Flexible
Statistics
Visualization
DataFrames
SparkR
Big Data & R: PatternsBig Data Small Learning Partition
AggregateLarge ScaleMachine Learning
1. Big Data, Small Learning
DataCleaningFilteringAggregat
ion
Collect
SubsetDataFramesVisualizationLibraries
1. Big Data, Small Learningsongs <- read.df(
“songs.json”,“json”)
newSongs <- filter( songs, songs$year > 2000)
ggplot(collect(newSongs))
DataCleaningFiltering
Aggregation
Collect
Subset
2. Partition Aggregate
Data Best Mode
lParam
s
Parameter Tuning
params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”)
train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm)}
lapply(params, train)
2. Partition Aggregate
DataBest Model
Params
3. Large Scale Machine Learning
Data Featurize Learning Model
3. Large Scale Machine Learning
Data Featurize Learning Model
training <- read.csv(“t.csv”)
model <- glm(delay~Distance+Des
t,family =
“gaussian”,data=data)
summary(model)
Big Data & RBig Data Small LearningPartitionAggregateLarge ScaleMachine Learning
SparkR:Unified approach
SparkR DataFramespeople <- read.df( “people.json”, “json”)
avgAge <- select( df, avg(df$age))
head(avgAge)
Number of data sources
Column Functions, SQL
Support for R UDFs
Large Scale Machine Learning
Integration with MLLib
Key FeaturesR-like formulas
Model statistics
model <- glm(a ~ b + c,
data = df)
summary(model)
Partition Aggregatespark.lapply: Simple, parallel
API Ex: Parameter tuning, Model
Averaging
Include existing R packages
SparkR StatusOpen source -- Part of Apache Spark
> 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc.
Contributions welcome !
Tutorial Outline Part 1: Data Exploration• ETL: Data loading, schema • Exploration: Filter, clean, aggregate
etc.• Visualization: Integration with ggplot
Part 2: Advanced Analytics (After the break)
Tutorial Setup
Each user gets a dedicated micro cluster• Cluster is terminated after 1 hour of inactivity• Multiple users can collaborate on a notebook
Notebooks can be exported/imported Examples and tutorials in R/Python/Scala
Free online service for learning Apache Spark
Tutorial SetupDatabricks Notebooks • Interactive workspace•Markdown + R, Python, Scala, SQL
Sign up at http://databricks.com/ce
SparkRBig data processing from R
DataFrames for ETL, data exploration
Support for advanced analytics
Tutorial Next StepsSign up at http://databricks.com/ce
Part 1: tiny.cc/sparkr-tutorial-part1
Fill out our survey at tiny.cc/sparkr-user-survey