27
RR WWoorrkksshhoopp II get to know NYC open data portal and start to use R Vivian Zhang for NYC-open-data meetup http://www.meetup.com/NYC-Open-Data/ R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1 1 of 27 6/13/14, 1:50 PM

R workshop i r basic (4th time)

Embed Size (px)

DESCRIPTION

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc, R programming, R workshop, ggplot2

Citation preview

Page 1: R workshop i r basic (4th time)

RR WWoorrkksshhoopp IIget to know NYC open data portal and start to use R

Vivian Zhang for NYC-open-data meetuphttp://www.meetup.com/NYC-Open-Data/

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

1 of 27 6/13/14, 1:50 PM

Page 2: R workshop i r basic (4th time)

OOvveerrvviieewwnyc open data portal

Rstudio

R

Github

hack time

·

·

·

·

·

2/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

2 of 27 6/13/14, 1:50 PM

Page 3: R workshop i r basic (4th time)

AAddvvaannttaaggee ooff uussiinngg RRssttuuddiiooEasiness·

install and load R packages

keep track of R dev version

download github repositories

debug faster

-

-

-

-

3/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

3 of 27 6/13/14, 1:50 PM

Page 4: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleerequire(ggplot2)

head(diamonds)

## carat cut color clarity depth table price x y z

## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

4/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

4 of 27 6/13/14, 1:50 PM

Page 5: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[-1, ])

## carat cut color clarity depth table price x y z

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47

5/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

5 of 27 6/13/14, 1:50 PM

Page 6: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[, -1])

## cut color clarity depth table price x y z

## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

6/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

6 of 27 6/13/14, 1:50 PM

Page 7: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[c(1, 2), ])

## carat cut color clarity depth table price x y z

## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

7/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

7 of 27 6/13/14, 1:50 PM

Page 8: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"

## [8] "x" "y" "z"

head(diamonds[, c(T, T, F, F, F, F, T, F, F, F)])

## carat cut price

## 1 0.23 Ideal 326

## 2 0.21 Premium 326

## 3 0.23 Good 327

## 4 0.29 Premium 334

## 5 0.31 Good 335

## 6 0.24 Very Good 336

8/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

8 of 27 6/13/14, 1:50 PM

Page 9: R workshop i r basic (4th time)

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"

## [8] "x" "y" "z"

head(diamonds$carat)

## [1] 0.23 0.21 0.23 0.29 0.31 0.24

diamonds[diamonds$price == max(diamonds$price), ]

## carat cut color clarity depth table price x y z

## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16

9/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

9 of 27 6/13/14, 1:50 PM

Page 10: R workshop i r basic (4th time)

rreeaaddiinngg aanndd ssuubbsseettttiinngg ddaattaa iinn RRblank

integer

logical

character

·

include all-

·

+: include;-: exclude-

·

include TRUEs-

·

lookup by name-

Source: Hadley Wickham

10/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

10 of 27 6/13/14, 1:50 PM

Page 11: R workshop i r basic (4th time)

ddaattaa ssttrruuccttuurree iinn RR

Source: Hadley Wickham

11/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

11 of 27 6/13/14, 1:50 PM

Page 12: R workshop i r basic (4th time)

rreeaadd iinn tthhee ooppeenn ddaattaaread.table()

read.csv()

·

·

rodent1year <- read.csv("C:\\Users\\zhangs\\Google Drive\\R code\\Rworkshop\\311_Service_Requests_from_2010_to_Present

header = TRUE, sep = ",")

dim(rodent1year)

summary(rodent1year)

table(rodent1year$Borough)

12/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

12 of 27 6/13/14, 1:50 PM

Page 13: R workshop i r basic (4th time)

With() is generic function that evaluates expr ina local environment constructed from data.

Using ggplot2, "aes" stands for "aesthetics","geom"" is used to create scatterplots

pplloott ddiiaammoonnddss

with(diamonds, plot(carat, price)) ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

13/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

13 of 27 6/13/14, 1:50 PM

Page 14: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssggplot2 generates more supplicated graph than the traditional graphics package. Let us play withsome color

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point()

14/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

14 of 27 6/13/14, 1:50 PM

Page 15: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssIn stead of fitting linear relation, we try to fit log linear relation

Log(price) is quite linear with log(carat),Bingo!

ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point()

15/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

15 of 27 6/13/14, 1:50 PM

Page 16: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssAs letters go from D to J, the diamond becomes more and more yellow. The numbers beside"S"(small) and "VS"(very small) describe the size of "internal imperfections" in the diamonds. "IF" isinternally flawless.

ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() +

facet_grid(clarity ~ color)

16/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

16 of 27 6/13/14, 1:50 PM

Page 17: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssLet us look back to a normal scale. The bottom left panel shows price vs carat for ultimate white andinternally flawless diamonds. The upper right panel shows price vs carat for most unpure(or dirtiest)and flawed diamonds.

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~

color)

17/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

17 of 27 6/13/14, 1:50 PM

Page 18: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssAs we would expect, for the diamonds at the same level of pureness(observed by row) , the priceper carat increases faster for white stone (bottom left) than for yellow stone(bottom right). And for thediamond at the same level of color (observed by column), the price per carat increases faster forpure stone(bottom left) than for dirty stone(upper left).

18/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

18 of 27 6/13/14, 1:50 PM

Page 19: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssWe facet the plot by one of these factor variables--clarity.

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~

.)

19/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

19 of 27 6/13/14, 1:50 PM

Page 20: R workshop i r basic (4th time)

ggoooodd ttiipp ttoo ggeenneerraattee pplloottssThe same type of graph is used over and over again while new individual component of ggplot2 isintroduced and interpreted. It is a very effective way to display complex relationship in large,high-dimensional data. Remember, the key is to bring in only one change each time.

Source: http://gettinggeneticsdone.blogspot.com/2010/01/ggplot2-tutorial-scatterplots-in-series.html

20/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

20 of 27 6/13/14, 1:50 PM

Page 21: R workshop i r basic (4th time)

pplloott ddiiaammoonnddssLast , we fit line for the orginal data and for the log transformed data.The linear relation is roughlyperfect of the log transformed data if we ignore the few points at two sides of the distribution.

ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()ggplot(diamonds, aes(x = log(carat), y = log(price))) + geom_point()

21/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

21 of 27 6/13/14, 1:50 PM

Page 22: R workshop i r basic (4th time)

aammaazziinngg NNYYTTiimmeess ssaammpplleehttp://timelyportfolio.github.io/rCharts_512paths/

Source: Timely Portfolio and NYTimes

22/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

22 of 27 6/13/14, 1:50 PM

Page 23: R workshop i r basic (4th time)

wwhhyy ddoo wwee uussee RR

Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel

23/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

23 of 27 6/13/14, 1:50 PM

Page 24: R workshop i r basic (4th time)

wwhhyy ddoo wwee uussee RR

Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel

24/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

24 of 27 6/13/14, 1:50 PM

Page 25: R workshop i r basic (4th time)

hhaacckk ttiimmeedownload an open dataset using filter

read it in to your Rstudio

check the dimensity of the dataset

decide which columns you will use

plot it!

·

·

·

·

·

25/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

25 of 27 6/13/14, 1:50 PM

Page 26: R workshop i r basic (4th time)

RReessoouurrcceessR in a Nutshell - Joseph Adler

The Art of R Programming - Norman Matloff

ggplot2 - Elegant Graphics for Data Analysis - Hadley Wickham

26/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

26 of 27 6/13/14, 1:50 PM

Page 27: R workshop i r basic (4th time)

27/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

27 of 27 6/13/14, 1:50 PM