Upload
john-de-goes
View
104
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Quirrel is a statistically-oriented language designed principally for data analysis. It combines a purely-declarative, implicitly parallel design with features needed by data scientists. In this presentation, John A. De Goes (chairman of the Quirrel language committee) introduces Quirrel and shows how it can be used to solve problems across large data sets. Over the past 5 years, R has enjoyed tremendous success in the data science community, and for good reason—it comes with batteries loaded, and sports one of the best communities in the data science world. Although R is not an easy programming language to learn, the basics can be picked up rather quickly. In this talk, John A. De Goes walks through the core syntax and features of R, providing enough training to give anyone the ability to do simple analysis.
Citation preview
Introduction to Quirrel & ROSCON, July 25
John A. De Goes@jdegoes
Quirrel is an open standard language designed for the analysis of large-scale, heterogeneous data sets.
overview
R is an open source programming language and interactive environment for statistical computing and graphics.
Quirrel R
● Young language, still evolving
● Nascent community● Intentionally limited
● Simple, consistent core● Fully parallel● Purely functional● Programmatic or
interactive
quirrel versus r
Quirrel R
CONS / PROS
PROS / CONS
● Mature language, "feature-complete"
● Robust community● Turing-complete
● Complex core● Mostly parallel● Imperative● Interactive
what's the right tool for the job?
Small amount of
data?
Simple analytics?
Simple analytics?
YES
NO
NO
YES
YES
NO
Quirrel
Hive / Pig
SQL
R
pageViews := //pageViewsavg := mean(pageViews.duration)bound := 1.5 * stdDev(pageViews.duration)pageViews.userId where pageViews.duration > avg + bound
sneak peek
pageViews <- read.csv("pageViews.csv")avg <- mean(pageViews$duration)bound <- 1.5 * sd(pageViews$duration)userIds <- subset(pageViews, duration > avg + bound, select=userId)
Quirrel
R
data models
Everything is a random variable.
true, false1, 3.1415null, undefined"Mary Jane"[1, 2, 3][[1, 2, 3], [4, 5, 6], [7, 8, 9]]{"name": "John"}1 || 2 || 3 || 4 || 5 || 6[1, "foo", [1, false]]
Quirrel REverything is an ordered sequence of values.*
TRUE, FALSE1, 3.1415NA, NaN, Inf"Mary Jane"c(1, 2, 3)array(c(1,4,7,2,5,9,3,6,9), dim=c(3,3))data.frame(name=c("John"))c(1, 2, 3, 4, 5, 6)list(1, "foo", list(1, FALSE))
*Except when it's not.
comments
-- ignore me
(- ignore me too! -)
Quirrel R
# ignore me
# ignore # me # too!
basic expressions
2 * 4
(1 + 2) * 3 / 9 > 23
3 > 2 & (1 != 2)
2 + 2 = 4
false & true | !false
undefined = undefined
Quirrel R2 * 4
(1 + 2) * 3 / 9 > 23
3 > 2 & (1 != 2)
2 + 2 == 4
FALSE & TRUE | !FALSE
NA == NA
named expressions
x := 2
square := x * x
Quirrel R
x <- 2
square <- x * x
loading data
//pageViews
load("/pageViews")
//daily_snapshots/*
Quirrel R
read.csv("pageViews")
read.csv("pageViews")
lapply(Sys.glob("daily_snapshots/*", read.csv))
drilldown
pageViews := //pageViews
pageViews.userId
pageViews.keywords[2]
Quirrel R
pageViews <- read.csv("pageViews")
pageViews$userId
vector[2]
list[[1]]
reductions
count(purchases)
sum(purchases.total)
mean(purchases.total)
stdDev(purchases.total)
Quirrel R
length(purchases)
sum(purchases$total)
mean(purchases$total)
sd(purchases$total)
filtering
views.userId where views.duration > 1000
Quirrel Rsubset(views, duration > 100, select=userId)
augmentation
clicks with {dow: dayOfWeek(clicks.ts)}
Quirrel Rclicks$dow <- weekdays(clicks$ts)
libraries
import std::stats::rank
pageViews := //pageViews
rank(pageViews.duration)
Quirrel Rlibrary(data.table)
pageViews <- read.csv("views.csv)
rank(pageViews$duration)
user-defined functions
ctr(day) := count(clicks where clicks.day = day) / count(impressions where impressions.day = day)
ctr("Monday")
Quirrel Rctr <- function(d) { c1 <- subset(clicks, clicks$day == d) c2 <- subset(impressions, impressions$day == d) length(c1$day) / length(c2$day)}
ctr("Monday")
grouping - implicit constraints
solve 'day {day: 'day, ctr: count(clicks where clicks.day = 'day) / count(impressions where impressions.day = 'day)}
Quirrel Rclicks$count1 <- 0
c1 <- aggregate(count1 ~ day, data = clicks, FUN=length)
impressions$count2 <- 0 c2 <- aggregate(count2 ~ day, data = impressions, FUN=length)
r <- merge(c1, c2)
ctr <- data.frame(day = r$day, ctr = r$count1 / r$count2)
grouping - explicit constraints
solve 'date = purchases.date {date: 'date, cummTotal: sum(purchases.total where purchases.date < 'date)}
Quirrel Rpurchases2 <-purchases[ order(purchases$date)]
data.frame( date = purchases2$date, cummTotal = cumsum(purchases2$total))
Questions?Nov - Dec 2012
Quirrel / R Challenge ProblemsNov - Dec 2012
■ Using the /london_medals/summer_games data, find the youngest athlete to win a medal
challenge problem #1
Download dataset at http://labcoat.precog.com
■ Using the /london_medals/summer_games data, find the oldest athlete to win a medal
challenge problem #2
Download dataset at http://labcoat.precog.com
■ Using the /london_medals/summer_games data, find the average age at which athletes win medals
challenge problem #3
Download dataset at http://labcoat.precog.com
■ Using the /london_medals/summer_games data, find the most common age to win a medal
challenge problem #4
Download dataset at http://labcoat.precog.com
Thank you!
Follow me on Twitter:@jdegoes
Learn more about R:r-project.org
Download R:r-project.org/mirrors.html
Sign up for a free Precog account:precog.com
Learn more about Quirrel:quirrel-lang.org
Nov - Dec 2012