R workshop i r basic (4th time)

Preview:

DESCRIPTION

NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc, R programming, R workshop, ggplot2

Citation preview

RR WWoorrkksshhoopp IIget to know NYC open data portal and start to use R

Vivian Zhang for NYC-open-data meetuphttp://www.meetup.com/NYC-Open-Data/

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

1 of 27 6/13/14, 1:50 PM

OOvveerrvviieewwnyc open data portal

Rstudio

R

Github

hack time

·

·

·

·

·

2/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

2 of 27 6/13/14, 1:50 PM

AAddvvaannttaaggee ooff uussiinngg RRssttuuddiiooEasiness·

install and load R packages

keep track of R dev version

download github repositories

debug faster

-

-

-

-

3/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

3 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleerequire(ggplot2)

head(diamonds)

## carat cut color clarity depth table price x y z

## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

4/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

4 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[-1, ])

## carat cut color clarity depth table price x y z

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47

5/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

5 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[, -1])

## cut color clarity depth table price x y z

## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31

## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31

## 4 Premium I VS2 62.4 58 334 4.20 4.23 2.63

## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75

## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48

6/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

6 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[c(1, 2), ])

## carat cut color clarity depth table price x y z

## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43

## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31

7/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

7 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"

## [8] "x" "y" "z"

head(diamonds[, c(T, T, F, F, F, F, T, F, F, F)])

## carat cut price

## 1 0.23 Ideal 326

## 2 0.21 Premium 326

## 3 0.23 Good 327

## 4 0.29 Premium 334

## 5 0.31 Good 335

## 6 0.24 Very Good 336

8/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

8 of 27 6/13/14, 1:50 PM

ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)

## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"

## [8] "x" "y" "z"

head(diamonds$carat)

## [1] 0.23 0.21 0.23 0.29 0.31 0.24

diamonds[diamonds$price == max(diamonds$price), ]

## carat cut color clarity depth table price x y z

## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16

9/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

9 of 27 6/13/14, 1:50 PM

rreeaaddiinngg aanndd ssuubbsseettttiinngg ddaattaa iinn RRblank

integer

logical

character

·

include all-

·

+: include;-: exclude-

·

include TRUEs-

·

lookup by name-

Source: Hadley Wickham

10/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

10 of 27 6/13/14, 1:50 PM

ddaattaa ssttrruuccttuurree iinn RR

Source: Hadley Wickham

11/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

11 of 27 6/13/14, 1:50 PM

rreeaadd iinn tthhee ooppeenn ddaattaaread.table()

read.csv()

·

·

rodent1year <- read.csv("C:\\Users\\zhangs\\Google Drive\\R code\\Rworkshop\\311_Service_Requests_from_2010_to_Present

header = TRUE, sep = ",")

dim(rodent1year)

summary(rodent1year)

table(rodent1year$Borough)

12/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

12 of 27 6/13/14, 1:50 PM

With() is generic function that evaluates expr ina local environment constructed from data.

Using ggplot2, "aes" stands for "aesthetics","geom"" is used to create scatterplots

pplloott ddiiaammoonnddss

with(diamonds, plot(carat, price)) ggplot(diamonds, aes(x = carat, y = price)) + geom_point()

13/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

13 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssggplot2 generates more supplicated graph than the traditional graphics package. Let us play withsome color

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point()

14/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

14 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssIn stead of fitting linear relation, we try to fit log linear relation

Log(price) is quite linear with log(carat),Bingo!

ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point()

15/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

15 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssAs letters go from D to J, the diamond becomes more and more yellow. The numbers beside"S"(small) and "VS"(very small) describe the size of "internal imperfections" in the diamonds. "IF" isinternally flawless.

ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() +

facet_grid(clarity ~ color)

16/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

16 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssLet us look back to a normal scale. The bottom left panel shows price vs carat for ultimate white andinternally flawless diamonds. The upper right panel shows price vs carat for most unpure(or dirtiest)and flawed diamonds.

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~

color)

17/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

17 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssAs we would expect, for the diamonds at the same level of pureness(observed by row) , the priceper carat increases faster for white stone (bottom left) than for yellow stone(bottom right). And for thediamond at the same level of color (observed by column), the price per carat increases faster forpure stone(bottom left) than for dirty stone(upper left).

18/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

18 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssWe facet the plot by one of these factor variables--clarity.

ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~

.)

19/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

19 of 27 6/13/14, 1:50 PM

ggoooodd ttiipp ttoo ggeenneerraattee pplloottssThe same type of graph is used over and over again while new individual component of ggplot2 isintroduced and interpreted. It is a very effective way to display complex relationship in large,high-dimensional data. Remember, the key is to bring in only one change each time.

Source: http://gettinggeneticsdone.blogspot.com/2010/01/ggplot2-tutorial-scatterplots-in-series.html

20/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

20 of 27 6/13/14, 1:50 PM

pplloott ddiiaammoonnddssLast , we fit line for the orginal data and for the log transformed data.The linear relation is roughlyperfect of the log transformed data if we ignore the few points at two sides of the distribution.

ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()ggplot(diamonds, aes(x = log(carat), y = log(price))) + geom_point()

21/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

21 of 27 6/13/14, 1:50 PM

aammaazziinngg NNYYTTiimmeess ssaammpplleehttp://timelyportfolio.github.io/rCharts_512paths/

Source: Timely Portfolio and NYTimes

22/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

22 of 27 6/13/14, 1:50 PM

wwhhyy ddoo wwee uussee RR

Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel

23/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

23 of 27 6/13/14, 1:50 PM

wwhhyy ddoo wwee uussee RR

Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel

24/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

24 of 27 6/13/14, 1:50 PM

hhaacckk ttiimmeedownload an open dataset using filter

read it in to your Rstudio

check the dimensity of the dataset

decide which columns you will use

plot it!

·

·

·

·

·

25/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

25 of 27 6/13/14, 1:50 PM

RReessoouurrcceessR in a Nutshell - Joseph Adler

The Art of R Programming - Norman Matloff

ggplot2 - Elegant Graphics for Data Analysis - Hadley Wickham

26/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

26 of 27 6/13/14, 1:50 PM

27/27

R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1

27 of 27 6/13/14, 1:50 PM