Use r tutorial part1, introduction to sparkr

Introduction to SparkR

Shivaram Venkataraman, Hossein Falaki

Big Data & R

DataFramesVisualization

Libraries Data

+

Big Data & R: ChallengesData access HDFS, Hive Capacity

Single machine memory

ParallelismSingle Thread

Apache SparkEngine for large-scale data processing

Fast, Easy to Use

Runs EverywhereEC2, clusters, laptop etc.

Speed

Scalable

Flexible

Statistics

Visualization

DataFrames

SparkR

Big Data & R: PatternsBig Data Small Learning Partition

AggregateLarge ScaleMachine Learning

1. Big Data, Small Learning

DataCleaningFilteringAggregat

ion

Collect

SubsetDataFramesVisualizationLibraries

1. Big Data, Small Learningsongs <- read.df(

“songs.json”,“json”)

newSongs <- filter( songs, songs$year > 2000)

ggplot(collect(newSongs))

DataCleaningFiltering

Aggregation

Collect

Subset

2. Partition Aggregate

Data Best Mode

lParam

s

Parameter Tuning

params<-c(1e-3,1e-1,1e2) data <- read.csv(“t.csv”)

train <- function(prm) { lm.ridge(“y ~ x+z”, data, prm)}

lapply(params, train)

2. Partition Aggregate

DataBest Model

Params

3. Large Scale Machine Learning

Data Featurize Learning Model

3. Large Scale Machine Learning

Data Featurize Learning Model

training <- read.csv(“t.csv”)

model <- glm(delay~Distance+Des

t,family =

“gaussian”,data=data)

summary(model)

Big Data & RBig Data Small LearningPartitionAggregateLarge ScaleMachine Learning

SparkR:Unified approach

SparkR DataFramespeople <- read.df( “people.json”, “json”)

avgAge <- select( df, avg(df$age))

head(avgAge)

Number of data sources

Column Functions, SQL

Support for R UDFs

Large Scale Machine Learning

Integration with MLLib

Key FeaturesR-like formulas

Model statistics

model <- glm(a ~ b + c,

data = df)

summary(model)

Partition Aggregatespark.lapply: Simple, parallel

API Ex: Parameter tuning, Model

Averaging

Include existing R packages

SparkR StatusOpen source -- Part of Apache Spark

> 60 committers from UC Berkeley, Databricks, IBM, Intel, Alteryx etc.

Contributions welcome !

Tutorial Outline Part 1: Data Exploration• ETL: Data loading, schema • Exploration: Filter, clean, aggregate

etc.• Visualization: Integration with ggplot

Part 2: Advanced Analytics (After the break)

Tutorial Setup

Each user gets a dedicated micro cluster• Cluster is terminated after 1 hour of inactivity• Multiple users can collaborate on a notebook

Notebooks can be exported/imported Examples and tutorials in R/Python/Scala

Free online service for learning Apache Spark

Tutorial SetupDatabricks Notebooks • Interactive workspace•Markdown + R, Python, Scala, SQL

Sign up at http://databricks.com/ce

http://databricks.com/ce

Tutorial Setup

Fill out our survey at

tiny.cc/sparkr-user-survey

http://tiny.cc/sparkr-user-survey

SparkRBig data processing from R

DataFrames for ETL, data exploration

Support for advanced analytics

Tutorial Next StepsSign up at http://databricks.com/ce

Part 1: tiny.cc/sparkr-tutorial-part1

Fill out our survey at tiny.cc/sparkr-user-survey

http://databricks.com/ce

http://tiny.cc/sparkr-tutorial-part1

http://tiny.cc/sparkr-user-survey

Technology

Use r tutorial part1, introduction to sparkr