1 The command line 1.2 Assignment - CCACE · 2014-11-27 · 1.2 Assignment Assignment means storing1 an object by name. The assignment operator2 is x

1 The command line

> The “prompt”

> 2 + 2

[1] 4

� The expression is “interpreted” (parsed and evaluated)when you hit the Enter key

� The [1] is a counter, (not part of the data)

� A prompt like “+” means “continue”

� To interrupt hit the Esc key

1.1 Expressions

� Conventional operators and syntax: + − ∗ / ^

> 2 / 3

> 2^3

� When in doubt use brackets

> 2 / 3 + 2

> 2 / (3 + 2)

> 8^1/3

> 8^(1/3)

� White space is ignored

� # starts a “comment”

� Line buffer: ←→� Command history: ↑ ↓

1.2 Assignment

� Assignment means storing1 an object by name.

� The assignment operator2 is <-, but you can use =

> x <- 2 + 2 # x is assigned to the result of 2+2

> x = 2 + 2 # Same thing (one less key-press)

� Rules for names: they can contain letters, numbers, and ‘.’,but they cannot begin with a number, and they are case-sensitive.

> my.variable = 2

> my.Variable = 3 # A different variable because of "V"

� Assignment stores an independent copy of the object.The object can be accessed by its name.

> x = 2 + 2

> x # x prints its current value

> y = x # y becomes an independent copy of x

> y = 0 # y becomes 0, x is not changed by this

> y = x + 1 # y becomes x+1, x is unchanged

> x = x + 1 # x becomes x+1, its previous value is lost

� Objects are more than just numbers.

> x = "hello" # Character strings are in quotes

> x

> x = x + 1 # Characters don't do arithmetic

1Assignment does not save to the computer’s file system. It saves objects in memory while the Rsession is running. You have to run specific commands to save objects to external files.

2The operator “=” does not represent “equality”. It means “give this object a name that we can uselater to retrieve its value”. The equality operator is a double equals sign: “==”. It tests whether twoobjects are equal or not.

1

Exercise 1. Guess the answer of this before you type it in without brackets:7 + 7÷ 7 + 7× 7− 7?

> # Answer = 50.

> # Without brackets all multiplication and division is done first

> # before addition and subtraction.

Exercise 2. Show that 210 = 1024 and that the 10’th root 10241/10 = 2

> 2^10

> 1024^(1/10) # Brackets!

Exercise 3. Show the value of x2 + x− 6 is 0 when x = −3 and when x = 2. (Change the valueassigned to x and re-run the expression using the arrow keys).

> x = -3

> x^2 + x - 6

> x = 2

> x^2 + x - 6

2 Data

� The three main types of data are:

numeric Integers and real numbers:3.14 -2.72 0 12 1.5e-16

character Strings of characters in quotes (single or double):"apple" "3.14" "x = 0"

logical Truth values:TRUE FALSE

� Special values:

NA Not Available (a missing value)NaN Not a Number (an impossible number)Inf InfinityNULL Empty value (nothing at all)

� Special characters3 within strings:

\n Newline\t Tab\" Quote mark\\ Backslash

3The single backslash is the“escape”character that gives special meaning to the character that followsit.

2

� Logical values result from conditional operations

> 1 > 0 # TRUE

> 1 < 0 # FALSE

> 1 < NA # NA (undecidable)

2.1 Combining data into vectors

� A vector is an ordered collection of cells4.Each cell contains a single value.All the cells must contain the same type of value.There are numeric, character, and logical vectors.

� c is the name of a function that makes vectors by combining.

> c(1,3,4.5) # Numeric vector

> c("low","med","high") # Character vector

> c(TRUE,TRUE,FALSE) # Logical vector

� Mixed data types in vectors are “coerced”:logical → numeric → character

> c(1,2,3,TRUE,FALSE) # logical -> numeric (TRUE->1, FALSE->0)

> c(1,2,3,TRUE,FALSE,".") # all -> character

� A vector is a data “object” and any object can be assigned.

> x = c(6,3,4,5,7,8,9) # x becomes the vector

> x # x prints its current value

Exercise 4. Suppose you made a character vector like this:

> x = c("something","completely")

How could you append vector x with another string such as: "different" without re-writingall three strings?

> x = c(x, "different")

Exercise 5. Guess what the result of each of the following will be:

4A scalar item such as a single number or character string is seen as a single-cell vector. The cellscan be seen either as a row or column.

> c(2, 2+2)

> c(1, "1")

> c(0, NULL, NA)

> x = c(6,3,4,5,7,8,9)

> c(x, 2*x, 3*x)

3 Functions

� Think of functions as operators on data.

� Functions are organised into libraries or packages.Several libraries come with the R “base” distribution.The most useful are loaded into memory when you start R.

3.1 Finding functions in memory

� Every library has an index of its functions.The function named library can list the index:

> library(help="base") # The "base" library of functions

> library(help="stats") # The "stats" library of functions

> library(help="graphics") # The "graphics" library of functions

� Function apropos finds functions in memory by string matching.

> apropos("^read") # Functions with names beginning with "read"

> apropos("test$") # Functions with names ending with "test"

� Every function has a help page.Click the Index link at the foot of a library’s help:

> help(package="base")

> help(package="stats")

> help(package="graphics")

3

3.2 Finding functions in libraries on your hard disk

> library() # List the libraries on disk

> library(help="foreign") # List functions in the "foreign" library

� Load libraries from disk into memory when required.

> library("foreign") # Load library "foreign" into memory

> detach(package:foreign) # Unload library "foreign"

3.3 Finding functions in packages on the Internet

� Users contribute libraries of functions and data called packagesOver 3500 packages are available for download from “CRAN”

� Install (download) a package to disk once, (install.packages)Load the library into memory whenever required, (library)

> install.packages("lme4") # Download package "lme4" to disk

> library("lme4") # Load library "lme4" into memory

Some useful web sites

http://www.r-project.org R web sitehttp://cran.r-project.org/ Comprehensive R Archive Networkhttp://cran.r-project.org/web/packages/ Contributed packageshttp://cran.r-project.org/web/views/ Packages organised by taskhttp://cran.r-project.org/manuals.html Manualshttp://cran.r-project.org/faqs.html FAQhttp://www.r-project.org/mail.html Mailing listshttp://www.r-project.org/search.html Search mail archives

Exercise 6. Datasets included with R have help pages too. Look at the index for the datasets

package and the help pages there.

Exercise 7. List the functions in the library named: "MASS".

Exercise 8. Run function sessionInfo(). This displays a list of the libraries that are currentlyloaded into memory. Their functions, datasets, and help pages are accessible to R. (If a functionis not in memory it is not directly accessible).

3.4 Function syntax

� Every function has a name, (with the same rules as variables).

� To call or run a function, type its name followed by brackets(even if they are empty). Blank space is ignored.

� The brackets contain arguments separated by commas.The arguments are the data and also control behaviour.

� The function returns the result of its operation.

� Example: function sqrt returns the square root of its argument.

> sqrt(4) # The value returned is displayed

> x = sqrt(4) # The value is assigned

3.5 Using functions

� Arguments are values copied into the function.Calculations within a function do not change values outside it.

> y = 4

> sqrt(y) # y is not changed by this

> y = sqrt(y) # y becomes its square root

� A function can appear wherever its returned value can be used5.Think of the return as substituted directly at the place where the function is called.

> y = 9

> x = sqrt(y) - sqrt(y+16) # Equivalent to x = 3 - 5

� A function’s value can directly be used as an argument.

> x = -2

> sqrt(abs(x)) # abs returns an absolute (unsigned) value

5Usually functions appear on the right-hand side of assignments, or as arguments to other functions.A few functions can also appear on the left-hand side of assignments. For example names and dimnames.

4

3.6 Function options and their default values

� Arguments that control behaviour are called options.Options have default values:You can omit the option and get the default.Or pass the option name = value and override the default.

� Example: function round has two arguments.The first argument is mandatory and provides the data to be rounded.The second argument is optional and controls the number of decimal places.Its name is digits and its default value is 0.

> round(pi) # Use the option default

> round(pi, digits=2) # Override the default

3.7 How to write your own functions

� A function is defined like this6:

function.name = function(arguments) {

body of the function}

� The result of the last line in the function body is returned7.

� For example a “wrapper” function for sqrt:

> my.sqrt = function(x, dp=3) {

+ round(sqrt(abs(x)), digits=dp)

+ }

� How to call (run) your function8:

> my.sqrt(5) # Using the default

> my.sqrt(5, dp=0) # Overriding the default

6An object whose value is a function is assigned to a variable that becomes the function’s name.Typing the function’s name without the argument brackets shows the value of the variable.

7Use the function return to return from any line within the function.8You may build a library of your own functions. Load these into memory when required using

function source.

3.8 Function help pages

� Every function has a help page. To display the help:

> help(round) # The help page for function round

> ?round # ...same thing

> help(t.test) # The help page for function t.test

� Help pages usually have the following sections:

Description What the function doesUsage Synopsis of the arguments and option defaultsArguments Values you can pass as argumentsDetails Any relevant detailsValue Value(s) returned by the functionSee Also Links to related functionsExamples How it might be used

5

Exercise 9. Write a function named, say, myfunction, that takes an argument x with defaultvalue 0 and returns the value of x2 + x − 6. (Note: you might want to write your function ina text editor outside R and copy-and-paste it into R. You will re-use this function later so itwould be a good idea to save it, and others you write, to a “library” file of functions). Run yourfunction without passing an argument, (to test the default), and then with different values forx such as x = −3 and x = 2.

> myfunction = function(x=0) {

+ x^2 + x - 6

+ }

> myfunction()

> myfunction(-3)

> myfunction(2)

Exercise 10. Look at the help page for function seq. Use it to generate a vector that is asequence of numeric values like this:

> seq(from=-4, to=4, by=0.1)

Try it without naming the arguments like this: seq(-4, 4, 0.1). What happens if you changethe order of the arguments like this: seq(-4, 0.1, 4)?

Exercise 11. Look at the help page for the colon operator: help(":"), (operators are functionstoo). Run the first three of its examples: 1:4, pi:6, and 6:pi. What is it doing? What willhappen if you run 6:-6?

Exercise 12. Look at the help page for function t.test. Which arguments provide data to thefunction and which are options that control its behaviour? Make two numeric vectors as followsand use t.test to test the mean difference between these two small samples. Could they bothhave been drawn from the same population?

> x = 1:9

> y = 5:13

Exercise 13. Append an outlier observation 200 to sample y (above) and repeat the t test.Assign the results to a new variable, (say, z = t.test(...)).

4 Data objects

� The main data objects9 are called:vector, matrix, array, factor, data.frame, and list

� vector is a 1-dimensional layout of cells10.matrix is a 2-dimensional layout of rows and columns11.array is a multi-dimensional layout.

� Each cell contains a single value.All the cells must contain the same type of value:numeric, character, or logical.

� factor is a vector of code numbers12 with labels called levels.Factors are used for categorical variables and grouping indicators.

� data.frame is a collection of columns13 with names.The columns can be vectors or factors.It is a general-purpose container for a dataset.

� list is a general-purpose collection of data objects.It is used to pass multiple values as function arguments and returns.

9Data can be a single value or a collection of multiple values. These are called “objects”. They aredesigned to trade-off flexibility of use and speed of access to multi-valued data.

10A scalar item such as a single number or character string is seen as a single-cell vector. The cellscan be seen either as a row or column.

11A matrix can have just a single row or column. A matrix may be square or rectangular.12The codes are integers: 1, 2, . . .. The first level applies to code 1, the second to code 2, and so on.13The columns are usually seen as variables.

6

4.1 Making vectors

� Why make vectors that are not observed data?Vectors are used in programming, random variables for simulations, and dummy variables.

� “:” is an operator that makes a sequence of numbers

> 1:5 # Integer steps

> 5.5:-5

� seq makes a sequence with fractional steps

> seq(0,1, by=0.1) # Fractional steps

> seq(0,1, length=64) # From 0 to 1 so length=64

� rep repeats or replicates vectors

> x = c("low","med","high")

> rep(x, times=3) # Repeat whole vector

> rep(x, each=3) # Repeat cells (balanced)

4.2 Making random vectors

� rnorm draws a random sample from a normal distribution14

> rnorm(50) # Standard normal: N(0,1)

> rnorm(50, mean=10, sd=2) # Mean and standard deviation

� sample draws a random sample from a vector

> sample(1:100) # Permute (shuffle)

> sample(1:100, size=10)

> sample(c("male","female"), size=100, replace=TRUE)

Exercise 14. Use rep with its times argument to make a character vector consisting of: 3× "low", 5 × "med", and 2 × "high". (This might be used as a grouping indicator for anunbalanced experimental design).

> rep(x, times=c(3,5,2)) # Repeat cells (unbalanced)

14For other distributions see help(Distributions).

Exercise 15. Use“:” to make a vector of the sequence of integers: 1,...,6. Use function sample

to draw a sample of size 100 from the vector. Note: see the replace argument on the function’shelp page. (This might be used to simulate 100 throws of a fair 6-sided die).

> x = 1:6

> sample(x, size=1) # A different sample on each run

> sample(x, size=100, replace=TRUE)

4.3 Making matrices and arrays

� matrix wraps a vector into a matrix.array wraps a vector into an array.

> x = 1:24

> matrix(x, nrow=6, ncol=4)

> matrix(x, nrow=6, ncol=4, byrow=TRUE)

> array(x, dim=c(3,4,2))

> array(x, dim=c(3,2,2,2))

� rbind and cbind bind vectors into a matrix

> x = c(1,3,4.5)

> y = c(7,8,9.5)

> rbind(x,y) # Row bind

> cbind(x,y) # Column bind

� diag makes a diagonal matrix

> diag(6)

> diag(1:6)

Exercise 16. Make a vector x = 1:16 and convert it to a square matrix by columns (x1) andthen by rows (x2). Use function t on both these matrices. It transposes, (swaps rows andcolumns), and should make x1 look like x2 and vice versa.

> x = 1:16

> x1 = matrix(1:16,4,4)

> x2 = matrix(1:16,4,4, byrow=TRUE)

> t(x1)

> t(x2)

7

4.4 Making factors

� factor converts a vector to a factor.

> x = rep(c("male","female"), times=c(5,4)) # Character vector

> f = factor(x) # Convert character vector to factor

� Use it also to re-order and/or re-label factor levels.The order is alphabetical by default, (hence female male).

> factor(f, levels=c("male","female")) # Re-order the levels

> factor(f, levels=c("female","male"), labels=c("F","M")) # Re-label

� gl makes a “balanced” factor.

> gl(2,5, labels=c("male","female"))

� cut makes a factor by cutting a numeric vector at intervals of its range.quantile calculates equal-frequency intervals.

> x = sample(100, size=50)

> cut(x, breaks=c(0,25,100)) # Specific intervals

> cut(x, breaks=4) # Equal-range intervals

> cut(x, breaks=quantile(x)) # Equal-frequency intervals

4.5 Making data frames and lists

� data.frame binds column vectors into a data frame.

> x = c(1,3,4.5)

> y = c("apple","apple","orange")

> dat = data.frame(x=x, y=y) # Column names x and y

� list collects data structures into a list

> z = list(x, y, dat)

Exercise 17. Make a random vector as follows. It could be a sample of responses to a 3-levelitem in a questionnaire. Convert the vector to a factor labelling the responses: no (for 1), maybe(for 2), and yes (for 3).

> x = sample(1:3, size=100, replace=TRUE)

> factor(x, levels=1:3, labels=c("no","maybe","yes"))

Exercise 18. Look at the sleep data frame and its help page, (help(sleep)). This is a typical“long format” layout for data in R. One column contains the scores in both conditions, and othercolumns contain grouping factors to indicate which condition (group) and which subject (ID)each score belongs to. Suppose you have collected some experimental data from independentsubjects randomly assigned to two conditions. Simulate the two samples of results like this:

> x = rnorm(10, mean=1)

> y = rnorm(10, mean=10)

Put these data into a data frame in long format with a grouping factor to indicate whichcondition each score belongs to. Name the two columns (eg): score and cond.

> data.frame(score=c(x,y), cond=gl(2,10))

8

5 Importing data

5.1 Where am I?

� R reads files of data from the current working directory.

> getwd() # What is the current working directory?

> dir() # List files and folders in the working directory

� Set the working directory15:File > Change dir...

5.2 Starting a new project

� Make a new folder for the project.Set the working directory to point to that folder.Start R from: start > All Programs

Do something16

� Quit R using function q

Click the Yes button to save your session:

> q() # Quit R

� Objects are saved to a file17 named: .RData

Command history is saved to a file named: .Rhistory

� Keep different projects in different folders,(each with its own .RData and .Rhistory)

� Resume an R session by double-clicking .RData

Objects and command history are restored.

15You can set the working directory using function setwd. Its argument is the folder’s pathname,given inside quotes. The separator character should not be a Windows-style single backslash. It has tobe either a foward-slash or a double back-slash.

16At the very least create a variable, say x=0, so your session has something to save.17Windows may hide filenames that begin with a dot. To show them use the menu item: Tools

> Folder Options... in any folder. (You may need to hit the Alt key to display a folder’s menuitems). On the View tab ensure the option: Show hidden files and folders is selected, and de-selectthe option: Hide extensions for known file types.

5.3 Reading data

� read.table reads from a plain text file and returns a data frame.

> dat = read.table("myfile.txt", header=TRUE, sep="\t")

� The first argument is the filename18 and extension in quotes.The most useful options are:header is the first line a header?

header=FALSE (the default) no header.header=TRUE the first line is column names.

sep how are the columns separated?sep="" (the default) one or more spaces.sep="," comma.sep="\t" tab.

� The foreign library provides functions for proprietary formats.Function read.spss reads an SPSS19 data file.

> library(foreign)

> dat = read.spss("myfile.sav", to.data.frame=TRUE)

� You’ll want to check the number of rows and columns

> dim(dat) # Dimensions (rows,columns)

> nrow(dat) # Number of rows

> ncol(dat) # Number of columns

> names(dat) # Names of the variables (columns)

> summary(dat) # Summary of each column

> head(dat,10) # The first 10 rows

> tail(dat,10) # The last 10 rows

> View(dat) # A view of the data

Exercise 19. Download a file of plain-text data. First make a new folder thatwill be the “working directory” for this project. Then use a browser to downloadhttp://www.ats.ucla.edu/stat/data/hsb2.txt and save the file hsb2.txt to that folder.

18If you give a pathname to the file you must use either forward-slashes or double back-slashes. Insteadof a filename you can use the function file.choose, or the string "clipboard" to read from Windowsclipboard, or a URL to data through the web.

19A warning message like: “Unrecognized record type 7, subtype 18 encountered in system

file” can be ignored. This is to do with SPSS compatibility with its own previous versions, anddoes not mean the data has not been recognised.

9

Open it (eg. in Notepad) to see its format. Does it have column headings? How are thecolumns are separated?

Exercise 20. Import the text file hsb2.txt into R as a data frame. First point the R workingdirectory to the folder that contains the file, (eg. use File > Change dir...). Then usefunction read.table to read the file. The file name is the first argument. Remember to give thename in quotes including the .txt extension: "hsb2.txt". The rest of the arguments controlhow the function interprets the file format. Remember to set argument header=TRUE if thefile has column headings, (which it does here), and to pass the column separator to argumentsep, (not needed in this example because the columns are separated by spaces, which are thedefault). read.table returns a data frame so you’ll want to assign it a name, (such as dat =

read.table(...)).

Exercise 21. The first thing to do after importing a file is to check the data size and columnsnames. Use functions dim and names on the data frame. You want to see 200 rows and11 columns of data with columns named: "id" "female" "race" "ses" "schtyp" "prog"

"read" "write" "math" "science" "socst".

> # Plain text with column names and columns separated by spaces

> dat = read.table("hsb2.txt", header=TRUE)

> dim(dat)

> names(dat)

Exercise 22. Quit R and save your “workspace image”. Check that a .RData file has appearedin your working directory. Now restart R by clicking on the .RData file. Run function ls() tolist the objects that have been restored. You should see the data frame again. Maybe also runhistory() to see the last few commands you ran before quitting.

> ls()

> history()

6 Vector operators

6.1 Arithmetic operators

� Arithmetic operators: + − ∗ / ^ are vectorized.That means the operation applies element-wise.(In most languages this would need a programming loop).

> x = c(0,2,4,6,8,10)

> x^2 # Return a vector: the square of each element

> x + 2 # What is (each element of) x plus 2?

� Operators between vectors apply to elements pair-wise

> x = c(0,2,4,6,8,10)

> y = c(7,9,11,1,3,5)

> x+y # Add corresponding pairs of elements

� If the vectors are different lengthsthe shorter vector is “recycled” to the length of the longer vector

> x = c(0,2,4,6,8,10)

> y = c(1,3,5)

> x+y

� A warning is displayed if recycling is not an exact multiple

> x = c(0,2,4,6,8,10)

> y = c(1,3,5,7,9)

> x+y

10

Exercise 23. Calculate x2 + x− 6 when x = c(-3,-0.5,2).

Exercise 24. Generate a sequence of values as follows and pass the vector as argument to thefunction you wrote earlier to calculate and return the value of x2 + x− 6. What is returned?

> x = seq(from=-4, to=4, by=0.1)

To plot the function over the range of x save the value returned by your function as y and thenpass x and y as arguments to function plot like this: plot(x,y).

> myfunction = function(x=0) {

+ x^2 + x - 6

+ }

> x = seq(-4,4,0.1)

> y = myfunction(x)

> plot(x,y)

6.2 Arithmetic functions

� Functions that are vectorized apply an operation element-wise and return a vector thesame length as the input.

> x = c(0,2,4,6,8,10)

> x = sqrt(x) # Square root of each element

> x = round(x,1) # Round each element to 1dp

Some arithmetic functions

round Round to given number of decimal placestrunc Truncate down to nearest whole numberabs Absolute (unsigned) valuesqrt Square rootexp Exponentiallog, log10, log2 Log to base e, 10, and 2sin, cos, tan Trigonometric functionsasin, acos, atan Inverse (arc) trigonometric functions

6.3 Descriptive functions

� Functions that summarize a vector and return a single value.

> x = rnorm(100, mean=10, sd=2) # Data to summarize

> length(x) # Length (number of cells, sample size)

> sum(x) # Sum

> mean(x) # Mean

> sd(x) # Standard deviation

Some descriptive functions

length Number of cells in a vectorsum Sum of the values in a vectormin, max, range Minimum, maximum, and range (min,max)mean, median Mean and median of the values in a vectorsd, var Standard deviation and variance

Exercise 25. Write a function that takes a numeric argument and returns the sample size, mean,and variance in a list. Test it by passing a sample N=100 drawn from a standard normal.

> foo = function(x) {

+ list(length(x), mean(x), var(x))

+ }

> foo(rnorm(100))

Exercise 26. Use function rnorm to make a vector that is a random sample of 1000 draws froma normal distribution with mean 100 and standard deviation 15. Use function hist to plot ahistogram of the sample. Centre the sample on 50, (subtract 50 from each number), and plot thehistogram of the result. Standardize the sample by centering on the sample mean and scaling,(dividing each number), by the sample standard deviation. Plot the result.

> x = rnorm(1000, mean=100, sd=15)

> hist(x)

> hist(x-50)

> hist((x-mean(x))/sd(x))

Exercise 27. Write a function named SS that takes a numeric vector argument and returnsthe sum of the squared deviations from the sample mean. Use your function to calculate thesum-of-squares of a random vector, say a standard normal sample N=100. Check that the sum-of-squares returned by your function divided by N-1 is the same as the sample variance calculatedby function var.

> SS = function(x) {

+ sum((x - mean(x))^2)

+ }

> x = rnorm(100)

> SS(x) / 99

> var(x)

11

6.4 Conditional operators

� Conditional operators20 ask questions.

x == y x equal to y?x != y x not equal to y?x < y x less than y?x <= y x less than or equal to y?x > y x greater than y?x >= y x greater than or equal to y?

� The result of each question is either TRUE or FALSE.The operators are vectorized and return logical vectors.

> x = c(0,2,4,6,8,10)

> y = c(7,9,11,1,3,5)

> x == 4 # Is (each element of) x equal to 4?

> x > 4 # Is (each element of) x greater than 4?

> x > y # Compare corresponding pairs of elements

� Conditional operations on characters use string matchingStrings are compared alphabetically and case-sensitively

> x = c("apple","orange","orange","orange","apple")

> x == "apple" # Is (each element of) x equal to "apple"?

> x != "apple" # Is (each element of) x NOT equal to "apple"?

6.5 Logical operators

� ! (NOT) inverts a logical vector.& (AND) and | (OR) combine logical vectors.

> x = c(0,2,4,6,8,10)

> !x == 6 # NOT x == 6

> x > 0 & x < 8 # > 0 AND < 8

> x < 4 | x > 6 # < 4 OR > 6

20Annoyance: a conditional expression like x<-1 is mis-interpreted as assignment.The solution is to include space around the operator: x < -1.

6.6 Arithmetic with logical vectors

� Arithmetic is defined for logical vectors:TRUE → 1 and FALSE → 0

� sum and mean summarize conditions:sum counts TRUE

mean calculates the proportion TRUE

> x = c(0,2,4,6,8,10)

> y = c("apple","apple","orange","apple","orange","banana")

> sum(x > 4) # How many of x are > 4?

> mean(x > 4) # What proportion of x are > 4?

> sum(y=="apple") # How many y are "apple"?

> mean(y=="apple") # What proportion are "apple"?

> sum(y=="apple" | y=="banana") # How many "apple" or "banana"?

> sum(y=="apple" & x>0) # How many "apple" have x > 0?

Exercise 28. There is a variable named LETTERS that is a character vector: "A",...,"Z". Usesample with this vector to draw a random sample of 1000 letters. How many times does "A"

occur in your sample?

> x = sample(LETTERS, size=1000, replace=TRUE)

> sum(x == "A")

Exercise 29. Draw a random sample of 1000 from a normal distribution with mean 10 andstandard deviation 3. What proportion of the sample are above the sample mean? Whatproportion of the sample are more than 1.96 standard deviations above the sample mean?

> x = rnorm(1000, mean=10, sd=3)

> mean(x > mean(x))

> q = mean(x) + 1.96*sd(x)

> mean(x > q)

6.7 Arithmetic with missing values

� Missing values should be coded NA, (“Not Available”)

� Arithmetic operations propagate NA

Descriptive functions have an argument na.rm (“NA remove”)

> x = c(3,5,1,NA,4)

> mean(x) # NA propagated

> mean(x, na.rm=TRUE) # Remove NA from the calculation

12

� Conditional operations propagate NA

Use function is.na to test for NA

> sum(x==NA) # Wrong!

> sum(is.na(x)) # How many NA?

> sum(!is.na(x)) # How many are NOT NA?

Exercise 30. Draw a random sample of size N=100 from the vector c(0,1,NA). Calculate theproportion of your sample that is missing. Pass your sample to the function you wrote earlier toreturn the sample size, mean, and variance as a list. Modify your function so it handles missingvalues.

> x = sample(c(0,1,NA), size=100, replace=TRUE)

> foo = function(x) {

+ # Note how to get number of non-missing values

+ list(sum(!is.na(x)), mean(x,na.rm=TRUE), var(x,na.rm=TRUE))

+ }

> foo(x)

7 Data manipulation

7.1 Get and set named data using “$”

� The $ operator is used to address columns of a data frameor components of a list.

� Get and set columns of a data frame

> names(sleep) # What are the column names?

> sleep$extra # Get column "extra" by name

> mean(sleep$extra) # Mean of "extra"

> sleep$my.col = 0 # Set a new column ("my.col")

> sleep$my.col = NULL # Drop a column

� Get and set components of a list21

> x = rnorm(20)

> y = rnorm(20)

> fit = t.test(x,y) # Function returns list-like object

> names(fit) # What are the component names?

> fit$p.value # Get a named component

� Using attach and detach

> attach(sleep)

> mean(extra) # No need for $

> detach(sleep)

� Using with

> with(sleep, mean(extra)) # No need for $

Exercise 31. The data frame named trees contains data on the girth, height, and timber volumemeasurements of 31 trees. Use names to find the column names.a) What is the sample mean height of the trees?b) How many trees in the sample are over 80 feet tall?c) How many trees over 80 feet tall have girth less than 15 inches?d) Scatterplot girth against height by passing these columns as arguments to plot.

21Many R objects are a “list-like” collection of components, (see help(Extract)). Many functionsreturn multiple values within a list-like object.

13

> mean(trees$Height) # a

> sum(trees$Height > 80) # b

> sum(trees$Height > 80 & trees$Girth < 15) # c

> plot(trees$Height, trees$Girth) # d

7.2 Indexing

� Indexing means addressing particular cells to get or set values.

� Indexing uses square brackets:Vectors are indexed using: x[i]

Matrices and data frames are indexed using: x[i,j]

Arrays are indexed using: x[i,j,k,...]

� The index pair (x[i,j]) of a matrix or data frame:The first index (i) is the row index.The second index (j) is the column index.

� Each index, (i, j, ...), is a vector or factor.

7.3 Using a numeric index

� Cells in a vector, matrix, array, or data frame are ordered and numbered.Cells of a vector x are numbered: 1,...,length(x).Rows of a matrix or data frame x are numbered: 1,...,nrow(x).Columns are numbered: 1,...,ncol(x).

> x = c(3,4,1,2,5,6) # A vector

> x[1] # Get first cell (x[0] is undefined)

> x[1] = 0 # Set first cell

> sleep # A data frame with 20 rows and 3 columns

> sleep[1,1] # Get cell at first row and first column

� An index can address any type of vector

> y = c("red","blue","green") # Character vector

> y[1] # Get first cell

> y[1] = "purple" # Set first cell

� The index is itself a vector and can address multiple cells.

> i = 1:3 # Index vector for the first three cells

> x[i] # Get the first three cells

> x[i] = 0 # Set the first three cells (0 is recycled)

> x[i] = c(7,8,9) # Set the first three cells

> x[i] = x[i] + 1 # Update the first three cells

> i = 1:3 # Index vector for the first three rows

> j = c(1,3) # Index vector for the first and third columns

> sleep[i,j] # Get first 3 rows, first and third columns

� An “empty” index is shorthand for a complete index:x[i,] Index rows (over all columns)x[,j] Index columns (over all rows)

> sleep[2:3,] # Get rows 2:3 (over all columns)

> sleep[,2:3] # Get columns 2:3 (over all rows)

> sleep[,1] # Get column 1

� Special syntax for addressing single columns.

> sleep[1] # Get the first column as a data frame

> sleep[[1]] # Get the first column as a vector

� Cells are addressed in the order of the index vector

> i = c(3,5,1) # Index vector

> x[i] # Get cells in index order

> x[i] = c(9,7,8) # Set cells in index order

> i = nrow(sleep) # Row index (last row)

> j = c(3,2,1) # Column index

> sleep[i,j] # Get last row in reverse order

� A numeric index can be negativeA negative index addresses cells not indexed

> x[-1] # Get all except the first cell

> x[-(1:3)] # Get all except the first three cells

> sleep[-(2:3),] # Get all rows except 2 and 3

> sleep[,-ncol(sleep)] # Get all columns except the last

� Positive index elements can be repeated

> x = c("red","blue","green","orange","magenta")

> x[c(1,1,1,2,2,3)] # Index with replications

� A factor can be used as an index.The factor’s code numbers are a numeric index.

> as.numeric(sleep$group) # The factor's code numbers

> c("low","high")[sleep$group] # The factor as an index

14

Exercise 32. Derive an index vector to extract the vowels ("A" "E" "I" "O" "U") from theLETTERS vector. Use it also to extract the consonants.

> i = c(1,5,9,15,21)

> LETTERS[i]

> LETTERS[-i]

Exercise 33. Drop the last row from the trees data frame. (Function nrow gets the number ofrows in a data frame).

> trees[-nrow(trees),]

Exercise 34. Draw a random sample of 1000 from a standard normal distribution and make10% of it missing completely at random. (Hint: use sample to derive a random index vectorof 100 random draws from 1:1000). As a check calculate the proportion of the sample that ismissing.

> x = rnorm(1000)

> i = sample(1000, size=100)

> x[i] = NA

> mean(is.na(x))

7.4 Using a character index

� Data with named elements can be indexed by name.

> sleep[c("ID","group")] # Get columns by name in that order

� One index can be numeric and the other character:

> sleep[2:3,"extra"] # Get rows 2:3 of column "extra"

� Index vectors can be derived by string matching for namesFunction grep returns indices of a character vectorthat match patterns specified as regular expressions22.

> # Match column names in the "swiss" data that begin with "E"

> grep("^E", names(swiss)) # Numeric

> grep("^E", names(swiss), value=TRUE) # Character

22“Regular expressions” are a tiny language used to describe patterns of characters in strings. See:help(regex).

7.5 Logical index vectors and conditional indexing

� A logical index is a pattern derived from a conditional expressionLogical indices address cells where TRUE

> x = c(3,4,1,2,7,5,0,6)

> x > 2 # Logical vector

> x[x > 2] # Using the logical vector as an index

> sleep[sleep$group == 1,] # Get rows in group 1

� subset is a convenience function for conditional indexingdata frame rows.

> subset(sleep, group==1) # Same as: sleep[sleep$group==1,]

� Conditions can be inverted using ! (NOT)

> i = x > 3 # Logical vector

> x[i] # Get where TRUE

> x[!i] # Get where NOT TRUE

� Composite conditions can be formed using & (AND) and | (OR)

> i1 = x > 1

> i2 = x < 5

> x[i1 & i2] # > 1 AND < 5

> i1 = x < 1

> i2 = x > 5

> x[i1 | i2] # < 1 OR > 5

� Multiple comparisonsx %in% y returns a logical vector the same length as x

that is TRUE for each element of x that is anywhere in y

> sleep[sleep$ID %in% c(2,4,6),] # Get rows where ID is in 2,4,6

� Turning a logical index into a numeric indexwhich(x) returns numeric indices where logical x is TRUE

> which(sleep$extra < 0) # Which rows have extra<0?

Exercise 35. Extract rows from the sleep data frame where extra > 0 AND extra < 4

15

> i1 = sleep$extra > 0

> i2 = sleep$extra < 4

> sleep[i1 & i2,] # > 0 AND < 4

Exercise 36. In the trees data:a) What is the average girth of the trees over 80 feet tall?b) Extract girth and height data on trees below average volume.

> mean(trees$Girth[trees$Height > 80]) # a

> trees[trees$Volume < mean(trees$Volume), c("Girth","Height")] # b

Exercise 37. Create a factor with 2 levels labelled "short" and "tall" to indicate which rowsof the trees data frame are trees over 80 feet tall. Use your factor to index the rows of the dataframe to extract the data on tall trees.

> i = factor(trees$Height > 80, levels=c(TRUE,FALSE), labels=c("tall","short"))

> trees[i=="tall",]

Exercise 38. The airquality data frame contains data with missing values. Use summary onthe data frame to see which columns have missing values. Extract rows of the data frame whereOzone < 20. Be careful about handling the missing values. (Continued).

> i = airquality$Ozone < 20

> airquality[i,]

Exercise 39. The column named Ozone has 37 missing values. If you derive a logical index forOzone < 20 then the index itself will contain missing values, (NA). Indexing with NA always getsNA, (because the result would always be undecidable). The solution is to convert NA in the indexeither to TRUE or to FALSE, depending upon how you decide to handle it. Suppose you decide tokeep rows that have missing Ozone values because the rest of the row may still be useful. Howcould you derive the index that is TRUE where Ozone < 20 OR where the value is missing? Usethat to extract rows of the data frame where Ozone < 20.

> i = is.na(airquality$Ozone) | airquality$Ozone < 20

> airquality[i,]

8 Manipulating data frames

8.1 Sorting

� sort returns a vector sorted in ascending (or descending) orderNumerical vectors are sorted numericallyCharacter vectors are sorted alphabetically

> i = sort(names(sleep), decreasing=TRUE) # Sorted names

> sleep[i] # Sort columns by name

� order derives an index vectorGet the index vector that “would” sort a vectorUse it to sort one vector by anotherUse it to sort data frame rows by one or more columns

> x = c(3,1,2,1,4)

> order(x) # Get the index for sorting

> x[order(x)] # Same as sort(x)

> sleep[order(sleep$extra),] # Sort rows by extra

> sleep[order(sleep$group,sleep$extra),] # extra within group

8.2 Merging

� merge joins two data frames by a common column

> # The "id" column is common to both data frames

> dat1 = data.frame(id=c("s1","s2","s3"), test1=rnorm(3))

> dat2 = data.frame(id=c("s3","s2","s1"), test2=rnorm(3))

> merge(dat1, dat2, by="id")

16

8.3 Reshaping

� reshape converts between “wide” and “long” format

� direction="wide" reshape long to wide:v.names Variable to unstackvarying List of names for the unstacked variablesidvar Subject factortimevar Repetition factor

> sleep.wide = reshape(sleep, direction="wide", v.names="extra",

+ varying=list(c("drug1","drug2")),

+ idvar="ID", timevar="group")

� direction="long" reshape wide to long:varying Variables to stackv.names Name for the stacked variableidvar Name for the subject factortimevar Name for the repetition factor

> sleep.long = reshape(sleep.wide, direction="long",

+ varying=2:3, v.names="extra",

+ idvar="ID", timevar="group")

Exercise 40. Extract two columns from the airquality data frame: Temp and Wind, sorted byTemp and within that by Wind.

> airquality[order(airquality$Temp, airquality$Wind), 4:3]

9 Tables

9.1 Summarizing numeric columns

� colSums and colMeans calculate column sums and means.

> colMeans(swiss)

� sapply applies a given function to each column.

> sapply(swiss, mean) # Apply "mean" to each column

> x = rbind( N = sapply(swiss, length),

+ Mean = sapply(swiss, mean),

+ SD = sapply(swiss, sd) )

� describe calculates summary statistics of each column.

> library(psych) # "describe" is in package "psych"

> tab = data.frame(describe(swiss)) # Convert table to data frame

> tab[c("n","mean","sd")] # Extract columns by name

Exercise 41. Do you still have the hsb2 data frame of text data downloaded earlier? (Use ls

to list all your objects. Use class to see the type of an object). Import it again if necessary.Load the library named psych to access the describe function. Use it to make a table of theread, write, math, and science variables. Extract a subset of the table columns: n, mean, sd,skew, and kurtosis, and round to 2 decimal places.

> library(psych)

> dat = read.table("http://www.ats.ucla.edu/stat/data/hsb2.txt", header=TRUE)

> tab = data.frame(describe(dat[7:10]))

> tab = round(tab[c(2,3,4,11,12)], 2)

9.2 Summarizing categorical columns

� table makes contingency tables of counts.

> data(survey, package="MASS") # Load dataset from library

> table(survey$Smoke) # Counts in levels

> table(survey$Sex, survey$Smoke) # Cross-tab

17

� prop.table converts counts to proportions.

> x = table(survey$Sex, survey$Smoke)

> prop.table(x, margin=1) * 100 # Percentages of row sums

> prop.table(x, margin=2) * 100 # Percentages of column sums

Exercise 42. Suppose part of the codebook for the hsb2 data downloaded earlier looks like this:

female Gender (0=male, 1=female)prog Type of program (1=general 2=academic 3=vocational)

Use function factor to convert the female and prog variables to factors. Label the levels asgiven in the codebook. Remember to save the factors back into the data frame. Use functiontable on the female and prog variables to check you have assigned the labels the right wayround. (You want to see 91 male, 109 female, 45 general, 105 academic, and 50 vocational).

> # Labels are applied to levels in their default order (so 0 gets "male")

> dat$female = factor(dat$female, labels=c("male","female"))

> dat$prog = factor(dat$prog, labels=c("general","academic","vocational"))

> table(dat$female)

> table(dat$prog)

Exercise 43. Use function table to make a two-way contingency table showing the counts, (theobserved frequencies), of each gender in each of the three kinds of prog. Use prop.table toconvert this table to a table showing the proportion of each gender in each of the three kindsof program. Scale and round the table to show the proportions as percentages to the nearestpercentage.

> # Note for expected frequencies: chisq.test(tab)$expected

> tab = table(dat$prog, dat$female)

> round(prop.table(tab, 1) * 100)

9.3 Summarizing groups

� Tables of group means, variances, and so forth.

� tapply applies a given function to a vector grouped by factors.

> tapply(survey$Pulse, survey$Smoke, mean, na.rm=TRUE) # Cell means

> tapply(survey$Pulse, survey$Smoke, var, na.rm=TRUE) # Variances

� For 2-way tables pass two factors in a list.

> with(survey, tapply(Pulse, list(Sex,Smoke), mean, na.rm=TRUE))

Exercise 44. Make a table showing the average read score for each gender in each of the threekinds of program in the hsb2 data.

> tapply(dat$read, list(dat$prog, dat$female), mean)

9.4 Tidying up your table

� rownames and colnames get and set names.

> x = with(survey,tapply(Pulse,list(Sex,Smoke),mean,na.rm=TRUE))

> colnames(x) = c("Heavy","Never","Occasional","Regular")

> x = round(x, 2) # Round

> t(x) # Transpose (swap rows and columns)

9.5 Saving tables

� sink diverts output to a text file.

> sink("mytable.txt") # Turn on saving

> x # All output here is diverted

> sink() # Turn off saving

� write.table and write.csv write tables to files.

> write.table(x, "mytable.txt", quote=FALSE, sep="\t")

> write.csv(x, "mytable.csv") # Opens in Excel

Exercise 45. With the table of average read scores, (previous exercise), round the averages to2 decimal places and use write.csv to save the table to a ".csv" file in your current workingdirectory. Check the file opens in Microsoft Excel.

> tab = tapply(dat$read, list(dat$prog, dat$female), mean)

> write.csv(round(tab,2), "read-scores.csv")

18

10 Graphics

High level graphics functions

plot Scatterplotpairs Scatterplot matrixcoplot Conditioning plothist Histogramstem Stem-and-leaf plotboxplot Box-and-whisker plotqqnorm Quantile-quantile plotbarplot Bar plotdotchart Dot plotinteraction.plot Profile plot of group means

� “High-level” functions create new graphs with axes.

10.1 Scatterplots

� Arguments provide data in several ways23:

1. Two arguments: x,y are bivariate coordinates.

2. One argument that is a data frame or matrix.The first column is the x-values, the second the y-values.

3. One argument that is a “formula”: y∼x.The left side is the y-values, the right side the x-values.

> plot(cars$speed, cars$dist) # See: help(plot.default)

> plot(cars) # See: help(plot.data.frame)

> plot(dist~speed, data=cars) # See: help(plot.formula)

10.2 Histograms, boxplots, and barplots

> hist(cars$dist) # See: help(hist)

> boxplot(extra~group, sleep) # See: help(boxplot)

> height = tapply(sleep$extra, sleep$group, median)

> barplot(height) # See: help(barplot)

23See: help(xy.coords).

10.3 Graphical parameters

Graphical parameters

type Plot type (points, lines, both, ...),pch Plot character (circles, dots, triangles, symbols, ...)cex Size (character expansion)lty Line type (solid, dots, dashes, ...)lwd Line widthcol Colour

...and many others, see: help(par)

� main, ylab, and xlab: main title and axis labels.

> plot(dist~speed, data=cars,

+ main="Speed and Stopping Distances of Cars", # Main title

+ ylab="Stopping distance (ft)", # Label y-axis

+ xlab="Speed (mph)") # Label x-axis

� ylim and xlim: axis ranges24.


+ ylim=c(0,150), # Limits of y-axis

+ xlim=c(0,50)) # Limits of x-axis

� type: plot type25

> plot(cars$dist, type="p") # Points (default)

> plot(cars$dist, type="l") # Lines

> plot(cars$dist, type="b") # Both points and lines

> plot(cars$dist, type="n") # No plot

� pch: plot characters26

> plot(dist~speed, data=cars, pch="A") # Literal character

> plot(dist~speed, data=cars, pch=2) # Character code

> plot(1:25, pch=1:25) # The codes

24By default the range is calculated so the data fills the plot.25See: help(plot).26See: help(points).

19

� cex: character expansion (size)

> plot(1:20, cex=1:20) # Character expansion

� lty and lwd: line type and width

> plot(cars$dist, type="l", lty=2) # Line type

> plot(cars$dist, type="l", lwd=2) # Line width

� col: colour27

> plot(dist~speed, data=cars, col="blue") # Named colour

> colours() # Colour names

� family: font family28.

> names(windowsFonts()) # Font families provided for windows

> plot(dist~speed, data=cars, family="serif") # Times Roman

� font: typeface, (1=plain, 2=bold, 3=italic, 4=bold italic).Labels29 can be given individual arguments in a list.


+ ylab=list("Distance",font=2,col="blue"),

+ xlab=list("Speed",font=3,cex=2,col="red"))

Exercise 46. Use function hist to plot a histogram of read in the hsb2 data. Give it a nicecolour, (maybe ”lightblue”), and tidy up the labelling, (try main="" and xlab="read").

> hist(dat$read, col="lightblue", main="", xlab="read")

Exercise 47. Use the pairs function to plot a matrix of scatterplots of the pairwise relationshipsbetween read, write, math, science, and socst in the hsb2 data. Then use function cor tocalculate the correlation matrix for these variables. Do the correlations reflect the patterns inthe plot?

> pairs(dat[7:11])

> cor(dat[7:11]) # All are strong positive correlations

27See: help(colours) and help(rgb).28serifs are small decorations on characters such as the little bars top and bottom of a capital “I”.

Times Roman is a typical serif font. A sans serif font is literally without serifs. Arial is a typical sansserif font. A monospaced font is a fixed-width font like a traditional typewriter. Courier is a typicalmonospaced font.

29See also: help(expression) and help(plotmath).

Exercise 48. The data frame named iris contains measurements of 3 species of iris flowers.Scatter plot the measurements of Petal.Length against Petal.Width. Try changing the plotcharacter to, say, pch=20. Suppose you want to indicate the Species of each iris by the colourof the plotted point. How many species of iris are there in this data set? Make a vector ofcolour names, one for each species. Index that vector using the Species factor. Pass the resultof that as the value for the col argument to control the colour of each point in the plot.

> plot(Petal.Length~Petal.Width, iris, pch=20)

> nlevels(iris$Species)

> plot(Petal.Length~Petal.Width, iris, pch=20, col=c("red","green","blue")[iris$Species])

10.4 Low-level functions

� “Low-level” functions add graphics, (points, lines, text, etc.),to a graph created by a high-level function.

> plot(dist~speed, data=cars) # High-level plot

> points(subset(cars,speed>20), col="red", pch=20) # Points

> lines(x=c(22,20,12),y=c(20,100,110), col="blue", lty=2) # Lines

> text(x=20,y=110, "High speed!") # Text

� abline adds lines and slopes.

> abline(h=mean(cars$dist), v=mean(cars$speed), col="grey")

> abline(lm(dist~speed,data=cars)) # Regression line

� legend adds a key.

> txt = c("Low speed","High speed")

> col = c("black", "red")

> pch = c(1,20)

> legend("topleft", inset=0.01, legend=txt, pch=pch, col=col)

20

Low level graphics functions

abline Draw a line (intercept and slope, horizontal or vertical)points Plot points at given coordinateslines Draw lines between given coordinatestext Draw text at given coordinatesmtext Draw text in the margins of a plotaxis Add an axisarrows Draw arrowssegments Draw line segmentsrect Draw rectanglespolygon Draw polygonsbox Draw a box around the plotgrid Add a rectangular gridlegend Add a legend (a key)title Add labels

Exercise 49. Load the MASS library to access the data frame named hills.a) Scatter plot time against dist. Set axis ranges so that both include zero. Give the plot a title,(eg. Record Times in Scottish Hill Races), and give the axes labels, (eg. Time (minutes)

and Distance (miles)).b) Use function points to indicate in red data points for climbs greater than 2000 feet.c) Use functions lm and abline to add a regression line to the plot.

> plot(time~dist, hills, xlim=c(0,max(dist)), ylim=c(0,max(time)),

+ main="Record Times in Scottish Hill Races",

+ xlab="Distance (miles)", ylab="Time (minutes)",

+ las=TRUE)

> points(time~dist, subset(hills,climb>2000), col="red")

> abline(lm(time~dist, hills), col="blue")

10.5 Multiple plot layout

� Multiple plots30 in one window

> # Histograms of the iris data

> par(mfrow=c(2,2)) # 2 x 2 layout

> sapply(iris[1:4], hist) # Apply hist to each column

> # Using mapply to vectorize over the column names

> par(mfrow=c(2,2))

> mapply(hist, iris[1:4], xlab=names(iris[1:4]), main="")

30See also: layout and split.screen.

� Multiple windows

> windows() # Open a window

> plot(dist~speed, data=cars, pch=20, col="grey")

> windows() # Open another window

> boxplot(cars, col="grey")

10.6 Saving graphs

� Copy and paste:Right-click on a graph and choose Copy as metafile.Paste into Word or PowerPoint.

� Printing:Right-click on a graph and choose Print...

� Save as a PDF31:

> pdf(file="myplot.pdf") # Open file

> plot(dist~speed, data=cars) # Plot

> dev.off() # Flush output and close the file

Exercise 50. With the hsb2 data plot four histograms in a 2x2 layout of read, write, math,and science. Give the histograms a nice colour and labelling, as before. Open a new MicrosoftWord document and copy-and-paste the plot into it.

> par(mfrow=c(2,2))

> hist(dat$read, col="lightblue", main="", xlab="read")

> hist(dat$write, col="lightblue", main="", xlab="write")

> hist(dat$math, col="lightblue", main="", xlab="math")

> hist(dat$science, col="lightblue", main="", xlab="science")

> # Or..

> par(mfrow=c(2,2))

> mapply(hist, dat[7:10], xlab=names(dat[7:10]), col="lightblue", main="")

31See: help(Devices) for image files: jpeg, bmp, png, etc..

21

11 Hypothesis tests

� t.test performs one and two-sample t tests.Two ways to specify the data for a two-sample test are:

1. Two arguments: x,y, that are two sample vectors.

> x = with(warpbreaks, breaks[wool=="A"])

> y = with(warpbreaks, breaks[wool=="B"])

> t.test(x,y, var.equal=TRUE)

2. A single formula argument: y∼x, where y contains both samples and x is a groupingindicator.

> t.test(breaks~wool, data=warpbreaks, var.equal=TRUE)

Some hypothesis tests

t.test t test of meanswilcox.test Wilcoxon (non-parametric) testvar.test F-test of variancecor.test Correlation (Pearson, Spearman, or Kendall)binom.test Test of proportion in a two-valued sampleprop.test Test of proportions in several two-valued sampleschisq.test Chi-squared test for count datafisher.test Fisher’s exact test for count dataks.test Kolmogorov-Smirnov goodness-of-fit testshapiro.test Shapiro-Wilk normality test

Exercise 51. Suppose you toss a coin 50 times and get 18 heads. Use binom.test to test thenull that the probability of heads is 0.5, (that the coin is fair).

> binom.test(18, 50, p=0.5)

Exercise 52. With the warpbreaks data use var.test to test that the variance of breaks isequal at both levels of wool. Use boxplot to identify two outliers in the distributions. Repeatthe var.test excluding the two outliers.

> var.test(breaks~wool, data=warpbreaks)

> boxplot(breaks~wool, data=warpbreaks)

> var.test(breaks~wool, data=subset(warpbreaks, breaks<=60))

Exercise 53. Use rnorm to make a random sample of 20 draws from a normal distribution withmean 10 and standard deviation 3. Use t.test to test the null that the sample mean is 0 andthen, using the mu argument, to test that it is 10.

> x = rnorm(20, mean=10, sd=3)

> t.test(x, mu=0) # Default mu

> t.test(x, mu=10)

Exercise 54. Use cor.test to test the Pearson correlation between speed and stopping distancein the cars data.

> cor.test(cars$speed, cars$dist)

12 Linear models

12.1 Simple regression

� lm fits a linear regression model by ordinary least squares.A single argument specifies the model as a formula: y ∼ model

y is the dependent variable, (response or outcome).model specifies independent variables separated by “+” signs.

The intercept is denoted by “1”.

> fit0 = lm(dist~1, cars) # Intercept only (null) model

> fit1 = lm(dist~1+speed, cars) # Intercept and slope

> fit1 = lm(dist~speed, cars) # Intercept (implied) and slope

� coef gets the model coefficients.

> coef(fit0) # Intercept-only is mean "dist"

> coef(fit1) # Intercept is "dist" at speed=0

� “I” allows arithmetic within model formulas.For example mean centering:

> mu = mean(cars$speed)

> fit2 = lm(dist~I(speed-mu), cars)

> coef(fit2) # Intercept is "dist" at mean "speed"

� Scatterplot with regression lines.abline can extract intercept and slope coefficientsand add a regression line.

> plot(dist~I(speed-mu), cars)

> abline(fit0, lty=2, col="grey")

> abline(fit2)

> abline(v=0, lty=2, col="grey")

22

12.2 Testing the model coefficients and overall fit

� summary.lm tests coefficients and fit.confint provides confidence intervals.

> summary.lm(fit2)

> confint(fit2)

� Extracting parts of the summary.lm object.

> names(summary.lm(fit2)) # See: help(summary.lm)

> summary.lm(fit2)$coefficients # Table of coefficients

> summary.lm(fit2)$r.squared # R-squared

� residuals and rstandard get residuals.

> residuals(fit2)

> rstandard(fit2) # Standardized residuals

� fitted and predict.lm get fitted and predicted values.

> fitted(fit2)

> predict.lm(fit2) # Same as fitted unless newdata are provided

� Diagnostic plots.

> par(mfrow=c(2,2))

> plot(fit2)

� Standardized coefficients (beta weights).scale standardizes data, (zero mean and unit SD).

> fit3 = lm(dist~speed, data.frame(scale(cars)))

> summary.lm(fit3) # Standardized coefficients

> cor(cars) # Pearson correlation coefficient

Exercise 55. With the hsb2 data, scatter plot read (y-axis) against science (x-axis). Adjustthe range of both axes so they include 0. Use lm to fit an intercept-only regression model ofread. Add the regression line to the scatter plot, (pass the results of lm to abline). What doesthis represent?

> plot(math~science, dat, ylim=c(0,80),xlim=c(0,80))

> fit1 = lm(read~1, dat)

> abline(fit1)

Exercise 56. Fit a regression model of read (dependent variable) on science (independentvariable), including the intercept. Add this regression line to the previous scatter plot. Usesummary.lm to assess the model coefficients. What do they represent?

> fit2 = lm(read~science, dat)

> abline(fit2)

> # Intercept (where science==0) and slope in the regression model

> summary.lm(fit2)

23

12.3 Multiple regression

� Independent variables are included by “+”

> fit4 = lm(Fertility~Agriculture+Education, data=swiss)

> summary.lm(fit4)

� All variables left in the data frame are included by “.”Variables are excluded by “-”

> fit5 = lm(Fertility~., swiss)

> fit6 = lm(Fertility~.-(Examination+Agriculture), swiss)

� anova.lm performs model comparison.

> anova.lm(fit5,fit6) # Test a block of terms

� Product terms (interactions) are formed by “:”Main effects with interaction are included by “*”

> data(hills, package="MASS")

> fit7 = lm(time~dist+climb+dist:climb, data=hills)

> fit8 = lm(time~dist*climb, data=hills) # Shorthand

� Quadratic models are fitted by adding a squared term.

> fit9 = lm(dist~speed+I(speed^2), data=cars)

> summary.lm(fit9)

� How does the formula translate into terms in a model equation?

> model.matrix(~1, data=cars) # Intercept only

> model.matrix(~speed, data=cars) # Simple regression

> model.matrix(~speed+I(speed^2), data=cars) # Squared term

> model.matrix(~dist*climb, data=hills) # Interaction

> model.matrix(~tension*wool, data=warpbreaks) # Dummy variables

Exercise 57. Use lm to fit four linear regression models of time in the hills data:1) time predicted by dist.2) time predicted by dist and climb.3) time predicted by dist, climb, and their interaction.

4) The same model with dist centered on 15 miles and climb centered on 3000 feet.Use summary.lm to compare and interpret the estimated coefficients. Plot regression diagnosticsfor the final centered model. Use rstandard to extract the standardized residuals, and identifyany that are more than 2 standard deviations from the regression line. Re-fit the final modelomitting these outliers. Use predict.lm with this final model to predict the time for a run of26 miles with a climb of 100 feet.

> fit1 = lm(time ~ dist, data=hills)

> fit2 = lm(time ~ dist+climb, data=hills)

> fit3 = lm(time ~ dist*climb, data=hills)

> fit4 = lm(time ~ I(dist-15)*I(climb-3000), data=hills)

> summary.lm(fit1)

> summary.lm(fit2)

> summary.lm(fit3)

> summary.lm(fit4)

> # Diagnostics

> par(mfrow=c(2,2))

> plot(fit4, which=1:4)

> i = which(sqrt(abs(rstandard(fit4))) > 2)

> fit5 = lm(time ~ I(dist-15)*I(climb-3000), data=hills[-i,])

> summary.lm(fit5)

> predict.lm(fit5, newdata=data.frame(dist=26, climb=100))

> # As terms are added the multiple R-squared increases and the residual standard

> # error decreases. Both dist and climb are significant. Their interaction is

> # significant but including it makes climb non-significant. The intercept has

> # no useful interpretation in un-centered fits. Centering gives the intercept

> # meaning and also brings climb into a region of significance. Interpreting the

> # centered fit: The intercept tells the expected time for a run of 15 miles

> # climbing 3000 feet. The coefficient of dist says the run is expected to take

> # about 7.1 minutes longer for each extra mile of distance when the climb is

> # 3000 feet. The coefficient of climb says the run is expected to take about

> # 0.014 minutes for each extra foot of climb, (14 minutes for each 1000ft),

> # when the distance is 15 miles. The interaction is the rate-of-change of the

> # slope effects. The slope of 7.1 minutes/mile for climb of 3000ft becomes 8.1

> # minutes/mile if the climb is 4000ft, (because 7.1 min/mile for 3000ft becomes

> # 7.1+0.001 min/mile for 3001ft, and so 7.1+1 min/mile for 4000ft).

> # The slope of 14 minutes/1000ft for dist of 15 miles becomes 15 minutes/1000ft

> # if the dist is 16 miles, (because 0.014 min/ft for 15m becomes 0.014+0.001

> # min/ft for 16m, or 15 min/1000ft). Dropping the outlier improves the fit

> # (more R-squared, less residual standard error).

Exercise 58. Make a sequence of x values like this: seq(-4,4,0.1) and use it to make valuesy = x2+x−6. Fit two regression models of this data: y = β0+β1x+e and y = β0+β1x+β2x

2+e,(where e is random error). Which has the better fit?

> x = seq(-4,4,0.1)

> y = x^2 + x - 6

24

> summary(lm(y~x))

> summary(lm(y~x+I(x^2))) # Perfect fit! Residual error is 0

Exercise 59. Use lm to fit linear and quadratic models of stopping distance predicted by speedin the cars data. Use predict.lm to calculate the stopping distances predicted by both thesemodels for speeds ranging from 4 to 25mph. Scatterplot dist on speed and use function lines

to add the linear and quadratic regression lines to the plot. Re-fit the quadratic model withspeed centered on 15mph. Use summary.lm to compare the estimated coefficients for the centeredand un-centered quadratic fits.

> fit1 = lm(dist ~ speed, data=cars)

> fit2 = lm(dist ~ speed + I(speed^2), data=cars)

> fit3 = lm(dist ~ I(speed-15) + I((speed-15)^2), data=cars)

> x = 4:25

> y1 = predict.lm(fit1, data.frame(speed=x))

> y2 = predict.lm(fit2, data.frame(speed=x))

> plot(dist ~ speed, cars)

> lines(x, y1)

> lines(x, y2, col="blue")

> summary.lm(fit2)

> summary.lm(fit3)

> # In the quadratic model the coefficient of speed is the rate-of-change of dist

> # where speed=0.

> # y = b0 + b1x + b2x^2

> # dy/dx = b1 + 2b2x Hence b1 is dy/dx (instantaneous slope) when x=0

> # In the uncentered model the instantaneous slope is close to 0 where speed=0,

> # hence non-significant. In the centered model the instantaneous slope is not 0

> # at the centered speed, (15mph), and is significant.

12.4 ANOVA

� Factors represent categorical variables.Factors in a formula become dummy numeric variableswith values given by contrast coding32.contrasts gets and sets a factor’s contrast coding scheme.model.matrix shows the dummy variables and contrast coding.

> contrasts(warpbreaks$tension) # Default coding

> model.matrix(~tension, data=warpbreaks) # Dummy variables

� aov fits an ANOVA model by ordinary least squares33.summary.aov displays the ANOVA table34.summary.lm tests the coefficients and overall fit.

> fit10 = aov(breaks~tension, data=warpbreaks) # 1-way ANOVA

> summary.aov(fit10) # ANOVA table

> summary.lm(fit10) # Coefficients

� Main effects are added by “+”Interactions between terms are specified using “:”“*” is shorthand for main effects and interaction.

> fit11 = aov(breaks~tension+wool, data=warpbreaks) # 2-way ANOVA

> fit12 = aov(breaks~tension*wool, data=warpbreaks) # Interaction

> summary.aov(fit11)

> summary.aov(fit12)

> summary.lm(fit12)

> # Patterns of cell means

> with(warpbreaks, tapply(breaks,list(wool,tension),mean))

> with(warpbreaks, interaction.plot(tension,wool,breaks))

32The default contrast coding is 0,1 dummy coding, called “treatment contrasts” in R. The coefficientshave a simple interpretation: the intercept is the mean of the reference group, and other coefficientsare mean differences between a group and the reference group. Treatment contrasts are not orthogonal.Hence the message: “Estimated effects may be unbalanced”, (which can be ignored if the design isbalanced). Orthogonal contrasts are available, (see help(contr.helmert)).

33aov and lm are the same calculation with results displayed differently: lm shows model coefficients,aov shows sums-of-squares.

34summary.aov calculates a “sequential” ANOVA table using Type-I sums-of-squares. Terms are as-sessed in model order, except interaction terms are always assessed after main effects. The results donot depend upon the order of the terms if the design is balanced. See Anova in package car for Type-IIand Type-III sums-of-squares.

25

Exercise 60. With the warpbreaks data, use aov to carry out a 1-way ANOVA of breaks

(dependent) grouped by tension (independent).a) Use summary.aov to calculate the ANOVA table.b) Use summary.lm to assess the model coefficients. What do they represent?

> fit1 = aov(breaks ~ tension, warpbreaks)

> summary.aov(fit1)

> summary.lm(fit1)

> with(warpbreaks, tapply(breaks, tension, mean))

Exercise 61. Use aov and summary.aov to carry out a 2-way ANOVA of the main effects of wooland tension in the warpbreaks data. Repeat with the wool:tension interaction included inthe model.

> summary.aov(aov(breaks ~ wool + tension, warpbreaks))

> summary.aov(aov(breaks ~ wool * tension, warpbreaks))

12.5 ANCOVA

� Both numeric and factor variables can appear in a model.Factors (categorical variables) become dummy numeric variables.

> data(cabbages, package="MASS")

> model.matrix(~VitC*Cult, data=cabbages)

> fit13 = lm(HeadWt~VitC*Cult, cabbages)

> summary.lm(fit13)

Exercise 62. Load the MASS library to access the dataset named survey. (See its help page:help(survey)).a) Scatterplot Wr.Hnd versus Height.b) Use abline with lm to add two regression lines to the plot for "Female" and "Male" data.c) Use lm and summary.lm to fit and test an ANCOVA model of Wr.Hnd predicted by Height,Sex, and the Height:Sex interaction.d) Refit the model with Height centered on 170.

> data(survey, package="MASS")

> plot(Wr.Hnd ~ Height, survey)

> abline(lm(Wr.Hnd ~ Height, survey[survey$Sex=="Male",]), col="blue")

> abline(lm(Wr.Hnd ~ Height, survey[survey$Sex=="Female",]), col="red")

> summary(lm(Wr.Hnd ~ Height * Sex, survey))

> survey$cHeight = survey$Height - 170

> summary(lm(Wr.Hnd ~ cHeight * Sex, survey))

> # SexMale (the change in intercept from Female to Male) has a reasonable

> # interpretation when Height is centered on 170. It is negative when Height is

> # not centered since the regression lines cross-over before they extrapolate

> # back to zero. Height:SexMale (the change in slope from Female to Male) is

> # unaffected by centering, (shifting the whole sample up or down).

> # It is non-significant, indicating "homogeneity of regression" in the

> # male/female populations. That is, the relationship between span and height

> # is the same for male and female.

12.6 Generalized linear models

� glm fits generalized linear regression by maximum likelihood.A single argument specifies the model as a formula.A single argument named family specifies the responsedistribution35 and link function.

35The default response distribution is normal, (gaussian), and the default link is the identity (donothing) function. Results should be the same as lm.

26

> fit1 = lm(dist~speed, cars)

> fit14 = glm(dist~speed, cars, family=gaussian(link="identity"))

> summary.lm(fit1)

> summary.glm(fit14)

Exercise 63. Load the car library to access the dataset named Cowles. (See its help page:help(Cowles)).a) Use glm with family=binomial(link="logit") to fit a logistic regression model of volunteerpredicted by sex.b) Use coef to extract the model coefficients, and exp to anti-log them for odds units.c) Update the model to control for extraversion and neuroticism.

> data(Cowles,package="car")

> # Unadjusted odds ratio

> fit1 = glm(volunteer ~ sex, data=Cowles, family=binomial(link="logit"))

> coef(fit1) # In log odds units

> exp(coef(fit1)) # In odds units

> # Intercept (0.8097) is odds that a female will volunteer (reference level)

> # sexmale (0.779) is odds multiplier from female to male, (less than 1, so odds

> # of a male volunteering are lower than female)

> # Control for extraversion and neuroticism

> fit2 = glm(volunteer ~ sex + extraversion + neuroticism, data=Cowles, family=binomial(link="logit"))

> exp(coef(fit2))

> # extraversion (1.069) is odds multiplier for unit increase in extraversion.

> # Greater than 1 so more extravert means more likely to volunteer.

> # sexmale (0.790) still less than 1, but not as much lower as before.

> # Female still more likely to volunteer, but extraversion and neuroticism

> # explains some of the difference between female and male.

See help(family) and help(make.link) for the range of response distributions and link functions pro-vided.For logistic regression use: family=binomial(link="logit").For Poisson regression use: family=poisson(link="log").

27

Documents

1 The command line 1.2 Assignment - CCACE · 2014-11-27 · 1.2 Assignment Assignment means storing1 an object by name. The assignment operator2 is x