Upload
vivian-s-zhang
View
111
Download
2
Tags:
Embed Size (px)
DESCRIPTION
NYC Data Science Academy, NYC Open Data Meetup, Big Data, Data Science, NYC, Vivian Zhang, SupStat Inc, R programming, R workshop, ggplot2
Citation preview
RR WWoorrkksshhoopp IIget to know NYC open data portal and start to use R
Vivian Zhang for NYC-open-data meetuphttp://www.meetup.com/NYC-Open-Data/
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
1 of 27 6/13/14, 1:50 PM
OOvveerrvviieewwnyc open data portal
Rstudio
R
Github
hack time
·
·
·
·
·
2/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
2 of 27 6/13/14, 1:50 PM
AAddvvaannttaaggee ooff uussiinngg RRssttuuddiiooEasiness·
install and load R packages
keep track of R dev version
download github repositories
debug faster
-
-
-
-
3/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
3 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleerequire(ggplot2)
head(diamonds)
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
4/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
4 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[-1, ])
## carat cut color clarity depth table price x y z
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
5/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
5 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[, -1])
## cut color clarity depth table price x y z
## 1 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
6/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
6 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleehead(diamonds[c(1, 2), ])
## carat cut color clarity depth table price x y z
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
7/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
7 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
head(diamonds[, c(T, T, F, F, F, F, T, F, F, F)])
## carat cut price
## 1 0.23 Ideal 326
## 2 0.21 Premium 326
## 3 0.23 Good 327
## 4 0.29 Premium 334
## 5 0.31 Good 335
## 6 0.24 Very Good 336
8/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
8 of 27 6/13/14, 1:50 PM
ddiiaammoonnddss ssuubbsseettttiinngg eexxaammpplleenames(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
head(diamonds$carat)
## [1] 0.23 0.21 0.23 0.29 0.31 0.24
diamonds[diamonds$price == max(diamonds$price), ]
## carat cut color clarity depth table price x y z
## 27750 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
9/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
9 of 27 6/13/14, 1:50 PM
rreeaaddiinngg aanndd ssuubbsseettttiinngg ddaattaa iinn RRblank
integer
logical
character
·
include all-
·
+: include;-: exclude-
·
include TRUEs-
·
lookup by name-
Source: Hadley Wickham
10/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
10 of 27 6/13/14, 1:50 PM
ddaattaa ssttrruuccttuurree iinn RR
Source: Hadley Wickham
11/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
11 of 27 6/13/14, 1:50 PM
rreeaadd iinn tthhee ooppeenn ddaattaaread.table()
read.csv()
·
·
rodent1year <- read.csv("C:\\Users\\zhangs\\Google Drive\\R code\\Rworkshop\\311_Service_Requests_from_2010_to_Present
header = TRUE, sep = ",")
dim(rodent1year)
summary(rodent1year)
table(rodent1year$Borough)
12/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
12 of 27 6/13/14, 1:50 PM
With() is generic function that evaluates expr ina local environment constructed from data.
Using ggplot2, "aes" stands for "aesthetics","geom"" is used to create scatterplots
pplloott ddiiaammoonnddss
with(diamonds, plot(carat, price)) ggplot(diamonds, aes(x = carat, y = price)) + geom_point()
13/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
13 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssggplot2 generates more supplicated graph than the traditional graphics package. Let us play withsome color
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point()
14/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
14 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssIn stead of fitting linear relation, we try to fit log linear relation
Log(price) is quite linear with log(carat),Bingo!
ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point()
15/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
15 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssAs letters go from D to J, the diamond becomes more and more yellow. The numbers beside"S"(small) and "VS"(very small) describe the size of "internal imperfections" in the diamonds. "IF" isinternally flawless.
ggplot(diamonds, aes(x = log(carat), y = log(price), colour = cut)) + geom_point() +
facet_grid(clarity ~ color)
16/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
16 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssLet us look back to a normal scale. The bottom left panel shows price vs carat for ultimate white andinternally flawless diamonds. The upper right panel shows price vs carat for most unpure(or dirtiest)and flawed diamonds.
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~
color)
17/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
17 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssAs we would expect, for the diamonds at the same level of pureness(observed by row) , the priceper carat increases faster for white stone (bottom left) than for yellow stone(bottom right). And for thediamond at the same level of color (observed by column), the price per carat increases faster forpure stone(bottom left) than for dirty stone(upper left).
18/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
18 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssWe facet the plot by one of these factor variables--clarity.
ggplot(diamonds, aes(x = carat, y = price, colour = cut)) + geom_point() + facet_grid(clarity ~
.)
19/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
19 of 27 6/13/14, 1:50 PM
ggoooodd ttiipp ttoo ggeenneerraattee pplloottssThe same type of graph is used over and over again while new individual component of ggplot2 isintroduced and interpreted. It is a very effective way to display complex relationship in large,high-dimensional data. Remember, the key is to bring in only one change each time.
Source: http://gettinggeneticsdone.blogspot.com/2010/01/ggplot2-tutorial-scatterplots-in-series.html
20/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
20 of 27 6/13/14, 1:50 PM
pplloott ddiiaammoonnddssLast , we fit line for the orginal data and for the log transformed data.The linear relation is roughlyperfect of the log transformed data if we ignore the few points at two sides of the distribution.
ggplot(diamonds, aes(x = carat, y = price)) + geom_point() + geom_smooth()ggplot(diamonds, aes(x = log(carat), y = log(price))) + geom_point()
21/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
21 of 27 6/13/14, 1:50 PM
aammaazziinngg NNYYTTiimmeess ssaammpplleehttp://timelyportfolio.github.io/rCharts_512paths/
Source: Timely Portfolio and NYTimes
22/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
22 of 27 6/13/14, 1:50 PM
wwhhyy ddoo wwee uussee RR
Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel
23/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
23 of 27 6/13/14, 1:50 PM
wwhhyy ddoo wwee uussee RR
Dirk's exmaple about elegance and efficiency of R Source: Dirk Eddelbuettel
24/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
24 of 27 6/13/14, 1:50 PM
hhaacckk ttiimmeedownload an open dataset using filter
read it in to your Rstudio
check the dimensity of the dataset
decide which columns you will use
plot it!
·
·
·
·
·
25/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
25 of 27 6/13/14, 1:50 PM
RReessoouurrcceessR in a Nutshell - Joseph Adler
The Art of R Programming - Norman Matloff
ggplot2 - Elegant Graphics for Data Analysis - Hadley Wickham
26/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
26 of 27 6/13/14, 1:50 PM
27/27
R Workshop I http://www.nycopendata.com/RworkshopI/index.html#1
27 of 27 6/13/14, 1:50 PM