37
Graphing with R & ggplot2 Week 1: Introduction to R Chicago Data Visualization Group Workshop October 8, 2012

Ggplot2 v3

Embed Size (px)

DESCRIPTION

A high level introduction to R statistical programming language that was presented at the Chicago Data Visualization Group's Graphing in R and ggplot2 workshop on October 8, 2012.

Citation preview

Page 1: Ggplot2 v3

Graphing with R & ggplot2Week 1: Introduction to R

Chicago Data Visualization Group Workshop

October 8, 2012

Page 2: Ggplot2 v3

Data Objects: Factors 2

Agenda

• Introductions• Survey: http://j.mp/ggplot2-2012.• Overview• Workshop

Page 3: Ggplot2 v3

Data Objects: Factors 3

Workshop OverviewIntroduction to R (Week 1)We will familiarized ourselves with the R environment with a gentle introduction to the basic functions. After installing R, we will import and inspect data sets while becoming familiar with R terminology. By the end of the class, we will conduct basic descriptions and plots of the data.Introduction to ggplot2 (Week 2)We will begin to use the ggplot2 package to create basic, but handsome, univariate, bivariate, and time-series graphs. We will introduce the functions and terminology used in ggplot2. We will also explain the fundamentals of proper data visualization techniques and how it relates to the ggplot2 defaults.Grammar of Graphics (Week 3)We will continue to show more advanced features of ggplot2, including how it relates to Leland Wilkinson's Grammar of Graphics. We will show how to plot more than 2 variables in a single graph using colors, shapes, and sizes. We will also discuss how human ability to perceive different shapes and colors should drive the choices we make in data visualization.Plots for Publications (Week 4)After learning how to make plots, we will learn how to customize graphs with custom colors, labels, and themes. We will emphasize how to create a customized look to be included in publications, including addings labels in diagrams to help readers.

Page 4: Ggplot2 v3

Data Objects: Factors 4

Installing R1. Go to http://cran.r-project.org/ 2. Download the installation for your OS.3. Follow instructions for installation.4. Start R from menu.

Page 5: Ggplot2 v3

Data Objects: Factors 5

The R Console• R can execute scripts or respond interactively at the console.

> 5+4 # addition[1] 9> 2^10 # exponentiation[1] 1024> a=2; b=10 # two commands on same line> a^b[1] 1024>ls() # list of existing objects> x = rnorm(100); y = rnorm(100) # generate random distributions> length(x) # get the length of object[1] 100> mean(x) # calculate the mean[1] 0.07163738> sd(x) # calculate the std.dev[1] 1.086229> plot(x, y) # create a simple plot

Page 6: Ggplot2 v3

Data Objects: Factors 6

The R Working Directory• R works in the context of a directory. This is usually the directory from

where R was started.• Get and Set your working directory.

> getwd() # Determine your working directory> setwd(“your directory") # set to your data directory location> getwd() # check that you are in the correct directory > dir() # list file names in the current directory

Page 7: Ggplot2 v3

Data Objects: Factors 7

Getting Help

• At the console> help.start() # general help page> Help(functionname) # help on function> ?(functionname)> ??(search string) # find all references to search string> Example(topic) # see an example of topic> Demo() # see demos

Page 8: Ggplot2 v3

Data Objects: Factors 8

Data Types & Data Objects• Data objects

– Vector: a set of elements of the same type.– Matrix: a set of elements in rows & columns of the same type.– data frame: rows & columns of elements of different types.– Lists & Arrays*

• Data types (aka mode) of data objects – Numeric: 3.14 and 3,4,5,….– Character: “abc”, “cat”, “dog”,…– Logical: TRUE, FALSE, NA– Complex and Raw*

* Out of scope for this presentation

Page 9: Ggplot2 v3

Data Objects: Factors 9

Data Types: Numeric

• Decimal values are numeric in R.> x = 10.5       # assign a decimal value > x              # print the value of x [1] 10.5 > class(x)       # print the class name of x [1] "numeric“

• Integer values are saved as numeric.> k = 1 > k              # print the value of k [1] 1 > class(k)       # print the class name of k [1] "numeric“

• The fact that k is not an integer can be confirmed with the is.integer().> is.integer(k)  # is k an integer? [1] FALSE

Page 10: Ggplot2 v3

Data Objects: Factors 10

Data Types: Integer

• Create an integer with the as.integer(). > y = as.integer(3) > y              # print the value of y [1] 3 > class(y)       # print the class name of y [1] "integer" > is.integer(y)  # is y an integer? [1] TRUE

• Coerce a numeric value into an integer with the same as.integer().> as.integer(3.14)    # coerce a numeric value [1] 3

• Parse a string for decimal values in much the same way.> as.integer("5.27")  # coerce a decimal string [1] 5

Page 11: Ggplot2 v3

Data Objects: Factors 11

Data Type: Logical

• A logical value is often created via comparison between variables.> x = 1; y = 2   # sample values > z = x > y      # is x larger than y? > z              # print the logical value [1] FALSE > class(z)       # print the class name of z [1] "logical"

• Standard logical operations: "&", "|", and "!".> u = TRUE; v = FALSE > u & v          # u AND v [1] FALSE > u | v          # u OR v [1] TRUE > !u             # negation of u [1] FALSE

Page 12: Ggplot2 v3

Data Objects: Factors 12

Data Type: Character• A character data type is used to represent string values in R.

> x = as.character(3.14) > x              # print the character string [1] "3.14" > class(x)       # print the class name of x [1] "character"

• Multiple character values can be concatenated with the paste()or sprintf().> name = "Joe"; amount = 100 > paste(name, "has", amount, "dollars")> sprintf("%s has %d dollars", name, amount)[1] "Joe has 100 dollars“

• To extract a substring, we apply the substr(). > substr("Mary has a little lamb.", start=3, stop=12) [1] "ry has a l"

• Replace strings with the sub().> sub("little", "big", "Mary has a little lamb.") [1] "Mary has a big lamb."

Both expressions produce the same result

Page 13: Ggplot2 v3

Data Objects: Factors 13

Data Objects: Vectors• A vector is a sequence of data elements of the same basic type. • Here is a numeric vector created by the concatenation function c().

> c(2, 3, 5) [1] 2 3 5

• Vectors can be combined via the function c. > n = c(2, 3, 5) > s = c("aa", "bb", "cc", "dd", "ee") > c(n, s) [1] "2"  "3"  "5"  "aa" "bb" "cc" "dd" "ee“

– Notice the numeric values are coerced into character strings.

Page 14: Ggplot2 v3

Data Objects: Factors 14

Vectors: Sequences & Patterns• Sequences & patterns in vectors can be created with seq() & rep()

> seq(1,9,by=2)[1] 1 3 5 7 9> seq(8,20,length=6)[1] 8.0 10.4 12.8 15.2 17.6 20.0> rep(1:3,6)[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3> rep(1:3,rep(6,3))[1] 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3 3

• Ranges can be created with the : operator> 1:5[1] 1 2 3 4 5

Page 15: Ggplot2 v3

15

Vectors: Accessing elements • Access vectors values with [i] operator.

> s = c("aa", "bb", "cc", "dd", "ee") > s[3] [1] "cc"

• Other methods of accessing vector values by index> s[-3] # Negative index removes value.  [1] "aa" "bb" "dd" "ee"> s[c(2, 3)]  # Numeric index[1] "bb" "cc"> s[c(2, 3, 3)]  # Duplicate indexes[1] "bb" "cc" "cc"> s[c(2, 1, 3)]  # Out of order indexes[1] "bb" "aa" "cc"> s[2:4]  # Range index[1] "bb" "cc" "dd“> s[c(FALSE, TRUE, FALSE, TRUE, FALSE)] # Logical index[1] "bb" "dd"

Data Objects: Factors

Page 16: Ggplot2 v3

16

Vectors: Arithmetic• Arithmetic operations of vectors are performed member-by-member.• For example, suppose we have two vectors a and b.

> a = c(1, 3, 5, 7); b = c(1, 2, 4, 8)

• Examples of arithmetic operations.> 5 * a # Multiplication of a by 5.[1]  5 15 25 35>  a + b  # Addition of a & b vectors[1]  2  5  9 15> a * b  # Multiplication of a & b[1]  1  6 20 56# Division and subtraction are also possible.

• If two vectors are of unequal length, the shorter one will be recycled in order to match the longer vector. > u = c(10, 20, 30); v = c(1, 2, 3, 4, 5, 6, 7, 8, 9) > u + v [1] 11 22 33 14 25 36 17 28 39

Data Objects: Factors

Page 17: Ggplot2 v3

17

Vectors: Subsetting and Summary• Let's suppose we've collected some data from an experiment

> x=c(7.5,8.2,3.1,5.6,8.2,9.3,6.5,7.0,9.3,1.2,14.5,6.2)

• Some simple summary statistics of these data can be produced:> mean(x)[1] 7.216667> summary(x)Min. 1st Qu. Median Mean 3rd Qu. Max.1.200 6.050 7.250 7.217 8.475 14.500

• Suppose we want summaries of two extracts of this data. > summary(x[1:6])Min. 1st Qu. Median Mean 3rd Qu. Max.3.100 6.075 7.850 6.983 8.200 9.300> summary(x[7:12])Min. 1st Qu. Median Mean 3rd Qu. Max.1.200 6.275 6.750 7.450 8.725 14.500

Data Objects: Factors

Page 18: Ggplot2 v3

18

Exercises1. Define

> x = c(4,2,6)> y = c(1,0,-1)

Decide what the result will be of the following:(a) length(x)(b) sum(x)(c) sum(x^2)(d) x+y(e) x*y(f) x-2(g) x^2

2. Determine these sequences.(a) 7:11(b) seq(2,9)

(c) seq(4,10,by = 2)(d) seq(3,30,length = 10)(e) seq(6,-4,by = -2)

3. Determine these patterns.(a) rep(2,4)(b) rep(c(1,2),4)(c) rep(c(1,2),c(4,4))(d) rep(1:4,4)(e) rep(1:4,rep(3,4))

4. Use the rep()to define the following vectors.

(a) 6,6,6,6,6,6(b) 5,8,5,8,5,8,5,8(c) 5,5,5,5,8,8,8,8

Data Objects: Factors

Page 19: Ggplot2 v3

Data Objects: Factors 19

Exercises5. If x =c(5,9,2,3,4,6,7,0,8,12,2,9) determine the following.

(a) x[2](b) x[2:4](c) x[c(2,3,6)](d) x[c(1:5,10:12)](e) x[-(10:12)]

Exercises taken from R – A Self Learn Tutorial from the National Center for Ecological Analysis and Synthesis

Page 20: Ggplot2 v3

20

Data Objects: Matrices• A matrix is a collection of data elements of the same type arranged in a

two-dimensional rectangular layout.• Matrices can be created in R in a variety of ways. Perhaps the simplest is

to create the columns and then glue them together with the command cbind. > x=c(5,7,9)> y=c(6,3,4)> z=cbind(x,y) > z x y[1,] 5 6[2,] 7 3[3,] 9 4> dim(z) # Get the dimensions of the matrix[1] 3 2

• Execute the expression rbind(x,y)and observe the result.

Data Objects: Factors

Page 21: Ggplot2 v3

21

Matrices: Creating• Using the matrix() to create matrices.

> matrix(c(2, 4, 3, 1, 5, 7),nrow=2,ncol=3,byrow = TRUE) [,1] [,2] [,3][1,] 2 4 3[2,] 1 5 7

• Removing, or changing, the values of nrow, ncol, & byrow will affect the shape of the matrix. Experiment with this to see how the shape changes.

• The dim() can also be used to create a matrix from a vector.> a = c(5,10,15,20,25,30)> dim(a) = c(2,3) # Reshape “a” by assigning it dimensions> a [,1] [,2] [,3][1,] 5 15 25[2,] 10 20 30

• Transpose matrices with the t().

Data Objects: Factors

Page 22: Ggplot2 v3

22

Matrices: Accessing Elements• An element from our matrix “a” can be accessed by with [].

> a[2,3] # access the element in the second row, third column[1] 30> a[2,] # access the entire second row[1] 10 20 30> a[,3] # access the entire third column[1] 25 30> a[ ,c(1,3)] # access the first and third column [,1] [,2][1,] 5 25[2,] 10 30

• Rows and columns can be named.> dimnames(a) = list(c("row 1","row 2"),c("col 1","col 2","col 3"))> a[ ,c("col 1","col 3")] col 1 col 3row 1 5 25row 2 10 30

Data Objects: Factors

Page 23: Ggplot2 v3

23

Matrices: Arithmetic> ## The Matrices> z =matrix(c(5,7,9,6,3,4),nrow=3,byrow=T)> z # a 3 x 2 matrix [,1] [,2][1,] 5 7[2,] 9 6[3,] 3 4> y=matrix(c(1,3,0,9,5,-1),nrow=3,byrow=T)> y # a 3 x 2 matrix [,1] [,2][1,] 1 3[2,] 0 9[3,] 5 -1> x=matrix(c(3,4,-2,6),nrow=2,byrow=T)> x # a 2 x 2 [,1] [,2][1,] 3 4[2,] -2 6

> ## Matrix Math> y + z [,1] [,2][1,] 6 10[2,] 9 15[3,] 8 3> y * z [,1] [,2][1,] 5 21[2,] 0 54[3,] 15 -4> y%*%x [,1] [,2][1,] -3 22[2,] -18 54[3,] 17 14• Link to matrix multiplica

tion tutorial.

Data Objects: Factors

Page 24: Ggplot2 v3

24

Exercises1. Create in R the matrices

Calculate the following and check your answers in R:(a) 2*x(b) x*x(c) x%*%x(d) x%*%y(e) t(y)

2. With x and y as above, calculate the effect of the following subscript operations and check your answers in R.(a) x[1,](b) x[2,](c) x[,2](d) y[1,2](e) y[,2:3]

Exercises taken from R – A Self Learn Tutorial from the National Center for Ecological Analysis and Synthesis

Data Objects: Factors

Page 25: Ggplot2 v3

25

Data Objects: Data Frames• A data frame is the most common data object in R and is used for storing

data tables. It is a list of vectors of equal length. > n = c(2,3,5); s = c("aa","bb","cc"); b = c(TRUE,FALSE,TRUE)> data.frame(n,s,b) # output the data frame n s b1 2 aa TRUE2 3 bb FALSE3 5 cc TRUE

• Data frames can contain columns of data from different types.• Import data using read.table() and read.csv(). Both return data

frames.

Data Objects: Factors

Page 26: Ggplot2 v3

26

Data Frames: Importing data• Get the following data file and save to working directory.

– http://www.ats.ucla.edu/stat/R/notes/hs0.csv

• Import with read.table()> schdat = read.table("hs0.csv", header=T, sep=",")> names(schdat)[1] "gender" "id" "race" "ses" "schtyp" "prgtype" "read" "write" "math" "science" "socst" > head(schdat) gender id race ses schtyp prgtype read write math science socst1 0 70 4 1 1 general 57 52 41 47 572 1 121 4 2 1 vocati 68 59 53 63 613 0 86 4 3 1 general 44 33 54 58 314 0 141 4 3 1 vocati 63 44 47 53 565 0 172 4 2 1 academic 47 52 57 53 616 0 113 4 2 1 academic 44 52 51 63 61

Data Objects: Factors

Page 27: Ggplot2 v3

27

Data Frames: Subsetting• Retrieving a column vector using [[]] or $ or [,]

– These return the same vector: schdat[[3]];schdat$race;schdat[,”race”]> str(schdat$race) #get the structure of schdat$race int [1:200] 4 4 4 4 4 4 3 1 4 3 ...

• Retrieving a data frame column slice using []– The return the same data frame: schdat[3];schdat[“Race”]> str(schdat["race"]) #get the structure of schdat[“race”] 'data.frame': 200 obs. of 1 variable: $ race: int 4 4 4 4 4 4 3 1 4 3 ...

• Retrieving a data frame row slice using []– The return the same data frame: schdat[3];schdat[“Race”]> schdat[5,] gender id race ses schtyp prgtype read write math science socst5 0 172 4 2 1 academic 47 52 57 53 61

– Run str(schdat[5,]) to see the structure of this data frame.

Data Objects: Factors

Page 28: Ggplot2 v3

28

Data Frames: Subsetting • We can use the subset()to slice both columns and rows. Let’s extract

only the read, write, math, science scores for the “academic” schools.> schdat.academic = subset(schdat, prgtype=="academic",+ select=c("read","write","math","science"))> head(schdat.academic) read write math science5 47 52 57 536 44 52 51 638 34 46 45 3910 57 55 52 5012 57 65 51 6313 73 60 71 61

Data Objects: Factors

Page 29: Ggplot2 v3

29

Data Frames: Exploring data• Let’s subset the read, write, math, and science scores for analysis

> read.sci = schdat[ , c("read","write","math","science")] > summary(read.sci) # get a 5 number summary read write math science Min. :28.00 Min. :31.00 Min. :33.00 Min. :26.00 1st Qu.:44.00 1st Qu.:45.75 1st Qu.:45.00 1st Qu.:44.00 Median :50.00 Median :54.00 Median :52.00 Median :53.00 Mean :52.23 Mean :52.77 Mean :52.65 Mean :51.66 3rd Qu.:60.00 3rd Qu.:60.00 3rd Qu.:59.00 3rd Qu.:58.00 Max. :76.00 Max. :67.00 Max. :75.00 Max. :74.00 NA's :5

Data Objects: Factors

Page 30: Ggplot2 v3

30

Data Frames: Further Analysis• Let’s look at additional statistics in the school data set

> attach(schdat) # allow access to data by variable name only.> options(digits=2) # set significant digits> m = tapply(write,prgtype,mean) # tapply() calculates for every row> v = tapply(write,prgtype,var)> med = tapply(write,prgtype,median)> n = tapply(write,prgtype,length)> sd = tapply(write,prgtype,sd)> cbind(mean=m,var=v,std.dev=sd,median=med,n=n) mean var std.dev median nacademic 56 63 7.9 59 105general 51 88 9.4 54 45vocati 47 87 9.3 46 50> options(digits=7)

Data Objects: Factors

Page 31: Ggplot2 v3

31

Data Frames: Graphing• Exploring data through graphs. Execute the following and examine the

output.> hist(write)> library(lattice) # load trellis graphics> histogram(~write, schdat, type="count") # trellis graphs > histogram(~write | gender, schdat, type="count") # histogram > hist(write, breaks=15) # change the number of bins to 15> boxplot(write) # boxplot function in the graphics package> bwplot(ses~ write| gender, schdat) # boxplot by gender> barplot(table(ses, gender), legend=c("low", "medium", "high"))> barplot(table(ses, gender), beside=T, legend=c("low", "medium",+ "high"), ylim=c(0, 50))

Data Objects: Factors

Page 32: Ggplot2 v3

32

Data Frames: Frequency & Correlation

• Calculating Frequency Tables> table(ses) # One way> tab1=table(gender, ses) # Two way crosstab> prop.table(tab1,1) # row proportions> prop.table(tab1,2) # column proportions> rowSums(tab1) # row frequencies> colSums(tab1) # column frequencies

• Correlations & scatter plot> # correlation of a pair of variables > cor(write, math) > cor(write, science) > cor(write, science, use="complete.obs") > # correlation matrix > cor(read.sci, use="complete.obs") > plot(math, write) # scatter plot matrix > plot(read.sci)

Data Objects: Factors

Page 33: Ggplot2 v3

33

Data Objects: Factors• Factors are numeric or character vectors that have an associated set of

levels—a finite set of values the categorical variable can have.• Let’s create a numeric vector of school types. 0 = private, 1 = public.

> sch.type = sample(0:1, 6, replace=T)> sch.type[1] 0 1 0 1 0 0> is.factor(sch.type)[1] FALSE> factor(sch.type) # return factor of sch.type[1] 0 1 0 1 0 0Levels: 0 1> sch.type.f = factor(sch.type,labels=c("private","public"))> sch.type.f[1] private public private public private privateLevels: private public

Data Objects: Factors

Page 34: Ggplot2 v3

34

Data Objects: Factors• Let’s create a character vector of socioeconomic status.

> ses = c("low","high","high","middle","low","low")> ses[1] "low" "high" "high" "middle" "low" "low" > is.factor(ses)[1] FALSE> factor(ses)[1] low high high middle low low Levels: high low middle> ses.f = factor(ses, levels=c("low","middle","high"))> ses.f[1] low high high middle low low Levels: low middle high

• The levels()will also tell us the levels of a factor.> levels(ses.f)[1] "low" "middle" "high"

Data Objects: Factors

Page 35: Ggplot2 v3

35

Data Objects: Factors• Comparing a table without factors to a table with factors.

> table(ses,sch.type) # without factors sch.typeses 0 1 high 1 1 low 3 0 middle 0 1

> table(ses.f,sch.type.f) # with factors sch.type.fses.f private public low 3 0 middle 0 1 high 1 1

Data Objects: Factors

Page 36: Ggplot2 v3

36

Exercises1. Use the data set mtcars included in R.

Sort the data set by weight, heaviest first. Which car gets the best mileage (largest mpg)? Which gets the worst? The cars in rows c(1:3, 8:14, 18:21, 26:28, 30:32) were imported into the United States. Compare the variable mpg for imported and domestic cars using a boxplot. Is there a difference? Make a scatterplot of weight, wt, versus miles per gallon, mpg. Label the points according to the number of cylinders, cyl. Describe any trends.

2. The data set cfb (http://wiener.math.csi.cuny.edu/UsingR/Data/cfb.R) contains consumer finance data for 1,000 consumers. Create a data frame consisting of just those consumers with positive INCOME and negative NETWORTH.What is its size?

3. Use the data set ewr (http://wiener.math.csi.cuny.edu/UsingR/Data/ewr.R). We extract just the values for the times with df=ewr [,3:10]. The mean of each column is found by using mean (df). How would you find the mean of each row? Why might this be interesting?

Exercises from Using R for Introductory Statistics by John Verzani.

Data Objects: Factors

Page 37: Ggplot2 v3

Data Objects: Factors 37

Resources

The following resources were used for the workshop materials. They are also excellent R references for your continued learning.UCLA’s Resources for RR-BloggersR-TutorQuick RUsing R for Introductory Statistics by John Verzani.