25
Introduction to Quirrel & R OSCON, July 25 John A. De Goes @jdegoes

Quirrel & R for Dummies

Embed Size (px)

DESCRIPTION

Quirrel is a statistically-oriented language designed principally for data analysis. It combines a purely-declarative, implicitly parallel design with features needed by data scientists. In this presentation, John A. De Goes (chairman of the Quirrel language committee) introduces Quirrel and shows how it can be used to solve problems across large data sets. Over the past 5 years, R has enjoyed tremendous success in the data science community, and for good reason—it comes with batteries loaded, and sports one of the best communities in the data science world. Although R is not an easy programming language to learn, the basics can be picked up rather quickly. In this talk, John A. De Goes walks through the core syntax and features of R, providing enough training to give anyone the ability to do simple analysis.

Citation preview

Page 1: Quirrel & R for Dummies

Introduction to Quirrel & ROSCON, July 25

John A. De Goes@jdegoes

Page 2: Quirrel & R for Dummies

Quirrel is an open standard language designed for the analysis of large-scale, heterogeneous data sets.

overview

R is an open source programming language and interactive environment for statistical computing and graphics.

Quirrel R

Page 3: Quirrel & R for Dummies

● Young language, still evolving

● Nascent community● Intentionally limited

● Simple, consistent core● Fully parallel● Purely functional● Programmatic or

interactive

quirrel versus r

Quirrel R

CONS / PROS

PROS / CONS

● Mature language, "feature-complete"

● Robust community● Turing-complete

● Complex core● Mostly parallel● Imperative● Interactive

Page 4: Quirrel & R for Dummies

what's the right tool for the job?

Small amount of

data?

Simple analytics?

Simple analytics?

YES

NO

NO

YES

YES

NO

Quirrel

Hive / Pig

SQL

R

Page 5: Quirrel & R for Dummies

pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId where pageViews.duration > avg + bound

sneak peek

pageViews <- read.csv("pageViews.csv")avg <- mean(pageViews$duration)bound <- 1.5 * sd(pageViews$duration)userIds <- subset(pageViews, duration > avg + bound, select=userId)

Quirrel

R

Page 6: Quirrel & R for Dummies

data models

Everything is a random variable.

true, false1, 3.1415null, undefined"Mary Jane"[1, 2, 3][[1, 2, 3], [4, 5, 6], [7, 8, 9]]{"name": "John"}1 || 2 || 3 || 4 || 5 || 6[1, "foo", [1, false]]

Quirrel REverything is an ordered sequence of values.*

TRUE, FALSE1, 3.1415NA, NaN, Inf"Mary Jane"c(1, 2, 3)array(c(1,4,7,2,5,9,3,6,9), dim=c(3,3))data.frame(name=c("John"))c(1, 2, 3, 4, 5, 6)list(1, "foo", list(1, FALSE))

*Except when it's not.

Page 7: Quirrel & R for Dummies

comments

-- ignore me

(- ignore me too! -)

Quirrel R

# ignore me

# ignore # me # too!

Page 8: Quirrel & R for Dummies

basic expressions

2 * 4

(1 + 2) * 3 / 9 > 23

3 > 2 & (1 != 2)

2 + 2 = 4

false & true | !false

undefined = undefined

Quirrel R2 * 4

(1 + 2) * 3 / 9 > 23

3 > 2 & (1 != 2)

2 + 2 == 4

FALSE & TRUE | !FALSE

NA == NA

Page 9: Quirrel & R for Dummies

named expressions

x := 2

square := x * x

Quirrel R

x <- 2

square <- x * x

Page 10: Quirrel & R for Dummies

loading data

//pageViews

load("/pageViews")

//daily_snapshots/*

Quirrel R

read.csv("pageViews")

read.csv("pageViews")

lapply(Sys.glob("daily_snapshots/*", read.csv))

Page 11: Quirrel & R for Dummies

drilldown

pageViews := //pageViews

pageViews.userId

pageViews.keywords[2]

Quirrel R

pageViews <- read.csv("pageViews")

pageViews$userId

vector[2]

list[[1]]

Page 12: Quirrel & R for Dummies

reductions

count(purchases)

sum(purchases.total)

mean(purchases.total)

stdDev(purchases.total)

Quirrel R

length(purchases)

sum(purchases$total)

mean(purchases$total)

sd(purchases$total)

Page 13: Quirrel & R for Dummies

filtering

views.userId where views.duration > 1000

Quirrel Rsubset(views, duration > 100, select=userId)

Page 14: Quirrel & R for Dummies

augmentation

clicks with {dow: dayOfWeek(clicks.ts)}

Quirrel Rclicks$dow <- weekdays(clicks$ts)

Page 15: Quirrel & R for Dummies

libraries

import std::stats::rank

pageViews := //pageViews

rank(pageViews.duration)

Quirrel Rlibrary(data.table)

pageViews <- read.csv("views.csv)

rank(pageViews$duration)

Page 16: Quirrel & R for Dummies

user-defined functions

ctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day)

ctr("Monday")

Quirrel Rctr <- function(d) { c1 <- subset(clicks, clicks$day == d) c2 <- subset(impressions, impressions$day == d) length(c1$day) / length(c2$day)}

ctr("Monday")

Page 17: Quirrel & R for Dummies

grouping - implicit constraints

solve 'day {day: 'day, ctr: count(clicks where clicks.day = 'day) / count(impressions where impressions.day = 'day)}

Quirrel Rclicks$count1 <- 0

c1 <- aggregate(count1 ~ day, data = clicks, FUN=length)

impressions$count2 <- 0 c2 <- aggregate(count2 ~ day, data = impressions, FUN=length)

r <- merge(c1, c2)

ctr <- data.frame(day = r$day, ctr = r$count1 / r$count2)

Page 18: Quirrel & R for Dummies

grouping - explicit constraints

solve 'date = purchases.date {date: 'date, cummTotal: sum(purchases.total where purchases.date < 'date)}

Quirrel Rpurchases2 <-purchases[ order(purchases$date)]

data.frame( date = purchases2$date, cummTotal = cumsum(purchases2$total))

Page 19: Quirrel & R for Dummies

Questions?Nov - Dec 2012

Page 20: Quirrel & R for Dummies

Quirrel / R Challenge ProblemsNov - Dec 2012

Page 21: Quirrel & R for Dummies

■ Using the /london_medals/summer_games data, find the youngest athlete to win a medal

challenge problem #1

Download dataset at http://labcoat.precog.com

Page 22: Quirrel & R for Dummies

■ Using the /london_medals/summer_games data, find the oldest athlete to win a medal

challenge problem #2

Download dataset at http://labcoat.precog.com

Page 23: Quirrel & R for Dummies

■ Using the /london_medals/summer_games data, find the average age at which athletes win medals

challenge problem #3

Download dataset at http://labcoat.precog.com

Page 24: Quirrel & R for Dummies

■ Using the /london_medals/summer_games data, find the most common age to win a medal

challenge problem #4

Download dataset at http://labcoat.precog.com

Page 25: Quirrel & R for Dummies

Thank you!

Follow me on Twitter:@jdegoes

Learn more about R:r-project.org

Download R:r-project.org/mirrors.html

Sign up for a free Precog account:precog.com

Learn more about Quirrel:quirrel-lang.org

Nov - Dec 2012