Statistics Workshop Dave Zhao Statistics Workshop | Dave Zhao | 20151 PowerPoint by Casey Hanson

Statistics Workshop | Dave Zhao | 2015 1

Statistics Workshop

Dave Zhao

PowerPoint by Casey Hanson


Introduction

In this lab, we will do the following:

1. Learn about the R programming language and the RStudio Integrated Development Environment (IDE).

2. Examine and manipulate data in R.

3. Perform univariate regression.

4. Perform multivariate regression.


Step 0A: Local Files

For viewing and manipulating the files needed for this laboratory exercise, insert your flash drive.

Denote the path to the flash drive as the following:

[course_directory]

We will use the files found in:

[course_directory]/01_Statistics/data/


Step 0B: Data Files

data.RData – 20,000 gene expression vectors, CD4 levels, and case/control information for 100 individuals in wide format. This file cannot be opened by excel or notepad/wordpad. It is in an R specific format.

data.txt – Same as data.Rdata, but in text – long format.

data.xlsx – Same as the above, but in excel – long format.

lab.R – All of the R commands we will run in today's lab. The sdir variable needs to be changed to the directory of your R lab data for 01_Statistics.


R ProgrammingThe objectives of this section are to learn the user interface of the RStudio program as well as essential aspects of R programming, namely:

1. How to create variables.2. What the R Environment is.3. Typical math operations in R.4. Typical data objects in R.5. How to modify data using functions in R.


Step 1A: Opening RStudio

Go to the Desktop

Double click the RStudio Icon

A program like the this should open.


Step 1B: Configuring RStudio

Click Tools on the Menu Bar and select Global Options.


Step 1C: Configuring RStudio

The Options window should appear.

Click Pane Layout.

Ensure your configuration matches the image on the right. Both for the clicked boxes and their order.

Click OK.


Step 1D: RStudio Opened


Step 2A: RScript Panel

The Scripting Panel is located in the top left.

You can input R commands here and save the commands to a file.

You can run highlighted R commands by pressing.

Files saved can also be run in the R Console (bottom left).

We won’t be using this panel today.


Step 2B: R Console Panel

The R Console (bottom left) is where we will be doing most of our work.

The R Console takes commands as input and performs operations based on the commands.

The commands may contain references to data variables or objects in the R environment.

Commands may return new data variables based on inputs.


Step 2C: Plots and 4th Panel

The Plots Panel (upper right) contains any visualizations that you might produce using R.

The 4th Panel (bottom left) contains tabs for various views of the R configuration and environment.

Environment – Shows all variables and objects loaded or created. Initially empty.


Step 3A: Using the R Console

Commands typed into the R console, as below, …

Will be represented in this tutorial as the following:

> x <- 3;


Step 4A: Setting a Variable

R is a program that interprets commands as text input, stores data as variables or objects, and is able to modify data using functions.

Let’s start with variables.

Suppose we want to create a variable x that is set to the value 3.

Type the following into the command prompt.

At the end of each line, press the Enter key on the keyboard for R to execute the command.

> x <- 3;

> x


Step 4B: Setting a Variable

You should see the following result:

The first line does two things:1. It creates a variable called x in the current R environment.2. Using the R assignment operator <-, it sets the value of x to 3.

The second line (without ;), prints the value of x to the R console.

The [1] 3 indicates that x has 1 value (shown in brackets) that is 3.

[1] 3


Step 4C: Setting a Variable

RStudio enables the user to examine the R Environment as they type commands into the R console.

For instance, before the command was executed, the environment looked like the following:

After the command was executed, the environment shows the x variable with its corresponding value 3.


Step 5A: R Environment

So…what is the R Environment?

The R environment is the set of variables that the user creates while inputting commands into R.

For instance, given the current R environment, try printing the value of the variable y to the screen.

You should see something like below:

> y

Error: object 'y' not found


Step 5B: R Environment

We get an error because the variable y has yet to be created.

If we create y , say be setting it to 5, it will be added to the environment.

You can see that after creating the variable y with value 5, the environment updates to show the newly created value.

> y <- 5;

> y

> [1] 5


Step 6A: Data Types

Thus far, we have only worked with numeric data (real numbers).

R supports many other data types including :1. Characters (use quotes): 2. Factors: (categorical data with an explicit order)

R can also create a vector of data types, using the c() function.

For instance, to create a vector z that is [1,2,3] :

> x <- ‘c’

> z <- c(1,2,3);

> z

[1] 1 2 3


Step 6B: Data Types

Other data types are available at the following link:

http://www.statmethods.net/input/datatypes.html

The one we will use most today are the vector and data frame.

A data frame is like a matrix except that the columns don’t need to have the same type of data.

For instance, the 1st column can have characters, the 2nd numeric, and the 3rd column factors.

We won’t create data frames explicitly but rather load them from a file.




Step 6B: Data Types

Thus far, we have only worked with numeric data (real numbers).

R supports many other data types including :1. Characters (use quotes): 2. Factors: (categorical data with an explicit order)

R can also create a vector of data types, using the c() function.

For instance, to create a column vector z that is [1,2,3] :

> x <- 'c'

> z <- c(1,2,3);

> z

[1] 1 2 3


Step 7A: Functions and Operators

Functions accept variables as input, perform operations on those variables, and return the result as a new variable.

We have already seen what the function c() does : it takes a list of comma separated variables or values and constructs a column vector from them.

Lets create a variable a that is the column vector [3,5,1,2,3] using only the variables currently in the environment.> a <- c(x,y,z);

> a

[1] 3 5 1 2 3


Step 7B: Functions and Operators

The c() function took as input the variables x, y, and z and concatenated them all into a new column vector a.

So c() operated on the input variables and created a new output variable a.

Did it change x, y, or z though? NO.

In principle, an R function will not change theinput data


Step 7C: Functions and Operators

What is an operator then?

An operator is like a function except it’s form is different from the function form: function_name(arg1, arg2, arg3)

For instance, we can add two numeric variables using the operator.

In fact, we have already seen an operator before: (remember <-).

> z <- x + y

> z

[1] 8


Step 7D: Functions and Operators

Notice that we assigned the variable z to a new value 8.

What does this do? It destroys the old z which was [1,2,3].

It then creates a new z with the value of 8.

Look at the R Environment.


Step 7D: Functions and Operators

A list of common operators is given below:

a + b : Adds a and b and returns the result.

a * b : Multiplies a by b and returns the result.

a – b : Subtracts b from a and returns the result.

a / b : Divides a by b and returns the result.

a ** b : Raises a to the power of b and returns the result.

For more operators visit the following link:

http://www.statmethods.net/management/operators.html




Statistics LabThe objective of this section of the lab is to be able to perform the following:

1. Load .RData into the R Environment.

2. Perform a single univariate regression using glm.

3. Perform multivariate regression using glm.

4. Find p-values of univariate and multivariate regression using anova.


Step 1: Clear the current environment

For the sake of neatness, let us cleanse the current R environment of all variables.

What do we mean by cleanse? Remove all variables.

We can accomplish this with the rm() and ls() functions.

> rm(list=ls());

> ls()

character(0)


Step 2: Change directories

We want to change the working directory of R to make loading the data easier.

Make sure you substitute [course_directory] with the correct path to 01_Statistics.

For example, on my machine, the command would be :

> setwd("[course_directory]/01_Statistics/data");

> setwd("F:/Labs/01_Statistics/data");


Step 3: Load the .RData file

In the directory, there is a file called data.RData.

This file contains ALL of the variables relevant to this lab extracted from Dr. Zhao’s R Environment.

To load the variables into our R Environment, use the load() function. Notice the change in the R Environment panel.

There is only a single object with 100 obs. of 2003 variables.

>

load("data.Rdata")


Step 4A: Examine the Data

We loaded an object called data with 100 obs. of 2003 variables.

What kind of object is it?

How many rows and columns does it have? (Hint: 100 rows, 2003 cols)

> class(data)

data frame

> dim(data)

[1] 100 2003


Step 4B: Examine the Data

How can we get a peak of the data? Let’s view the first 5 rows and first 3 columns. Note that 1:3 evaluates to c(1,2,3).Note: Leave off the semicolon so that it prints to

the console.> data[1:5,1:3]

CD4 gene1 gene2

1 2.00772551 11.759267 13.818804

2 0.01781400 10.084232 13.377747

3 131.84355762 8.178157 13.173177

4 8.82627931 7.316058 6.338184

5 0.01406711 11.690831 8.429529


Step 4C: Examine the Data

It appears that this data frame contains the transposed table from the Excel file shown in lecture and data.xls. If so, why are there 20003 columns instead of 20001 (CD4 + 20000 genes)?

Let’s look a columns 20002 and 20003.> data[1:3, c(20002,20003)]

condition time

1 ctrl 30min

2 ctrl 60min

3 exp 60min


Step 4D: Examine the Data

So columns 20002 and 20003 contain information about whether the individual was a control or exp subject and what time the gene expression was taken (30 or 60 min).

What data types are these? Characters or factors?

So a factor can look like a character string, but it is stored differently and has a built in way of ordering.

This is important for the regression, because glm uses factors, not characters.

> class(data[["condition"]])

[1] "factor"


Step 5: Dividing the data

Create a vector called cd4 by selecting the CD4 column of data.

Create a data frame of gene expressions called genes.

Create two vectors corresponding to control and time.

> cd4 <- data[["CD4"]];

> genes <- data[,2:20001];

> condition <- data[["condition"]];

> time <- data[["time"]];


Step 6A: Univariate Regression

Let us regress the gene expression of gene 1 against response across individuals.

> fit <- glm(log(cd4) ~ genes[[1]], family="gaussian");

Call: glm(formula = log(cd4) ~ genes[[1]], family = "gaussian")

Coefficients:

(Intercept) genes[[1]]

0.4361 -0.3183

Degrees of Freedom: 99 Total (i.e. Null); 98 Residual

Null Deviance: 1128

Residual Deviance: 1042 AIC: 524.1


Step 6B: Univariate Regression

How do we get the correlation r and p-value p of the univariate regression? Is it significant at 0.05 threshold?

> sumfit <- summary(fit);

> sumfit$coefficients

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.4360894 1.1711375 0.3723639 0.710425941

genes[[1]] -0.3182643 0.1119869 -2.8419784 0.005455195

> r <- sumfit$coefficients[2,1];

> p <- sumfit$coefficients[2,4];


Step 7A: Does the drug effect CD4 level

As in Example 1 from lecture, this will require a multivariate regression

Specifically,

Where we test the null hypothesis:


Step 7B: Does the drug effect CD4 level

We first need to fit the alternative hypothesis:

We then need to fit the null hypothesis:

We then need to find the p-value p of the likelihood ratio statistic:

> fit_h1 <- glm(log(cd4) ~ condition *

time,family="gaussian");

> fit_h0 <- glm(log(cd4) ~ time, family="gaussian");

> p <- anova(fit_h0, fit_h1, test="LRT")[2,5];

> p

0.429236


Step 7C: Does the drug effect CD4 level

The p-value of the LRT statistic is 0.429236.

At a significance threshold of 0.05, we fail to reject .

Why? 0.429236 > 0.05

Therefore, we fail to detect that the drug effects CD4 level.


Step 8A: Which genes are affected by the drug

We will perform a very similar procedure to Step 7.

Specifically given a gene expression vector ,

Where we test the null hypothesis:

Except we will do this for all 20,000 genes.

Since we are performing 20,000 statistical tests, we will have to do some p-value adjustment.


Step 8B: Which genes are affected by the drugThe body of the code is similar to before.

All the sapply does is iterate over each column in the genes data frame and returns a vector of p-values.

This will take some time.

ps <- sapply(genes, function(g){

fit_h1 <-glm(g ~ condition * time);

fit_h0 <-glm(g ~ time);

p <- anova(fit_h0, fit_h1, test="LRT")[2,5];

return(p)}

);


Step 8C: Which genes are affected by the drug

The vector ps is a 20,000 dimensional vector containing p-values for each statistical test asking if the gene is affected by CD4.

However, since we do 20,000 different tests, by random chance, we will have many genes with a p-value <= 0.05.

Thus, we must correct the p-values in light of the number of tests conducted.

A popular method for doing so is False Discovery Rate (FDR) control.


Step 8D: Which genes are affected by the drug

To get adjusted p_values using FDR, we have to use the p.adjust function.

We can then reject at our normal value of 0.05.

A way of finding and counting the number of genes with an adjusted p-value <= 0.05 is given below:

> ps_fdr <- p.adjust(ps, method="fdr");

> drug_genes <- which(ps_fdr <= 0.05);

length(drug_genes)

[1] 21


Step 8E: Which genes are affected by the drug

So, which genes were affected by the drug ?

The genes 1-20 and gene11989.

How many genes were affected by the drug?

21


Summary

In this lab, we did the following:

• Learned about the R programming language and the RStudio Integrated Development Environment (IDE).

• Examined and manipulated data in R.

• Performed univariate regression of gene expression and log of CD4 levels.

• Performed multivariate regression to :• Discover if the drug effects CD4 levels.• Discover the identity and number of genes affected by the drug.


Extra MaterialMaterial not essential for the lab, but informative for curious students.


Step 2D: 4th Panel

History – Contains a list of the history of commands. We will not be using this pane in this lab.

Files – Contains a list of the files in the directory R is currently set to. Initial directory can be changed from Tools – Preferences. We will not be using this pane in this lab.


Step 2E: 4th Panel

Packages – Contains a list of Packages (where each Package contains commands that native R doesn’t support) that can be loaded. We will not be using this pane in this lab.

Help – Provides an input box to query for help regarding R. We will not be using this pane in this lab.

View – Provides a way to view local web content. We will not be using this pane in this lab.


Step 5C: R Environment

What does create a variable and assigning it to a value mean.

Initially, the R environment has a large amount of free memory to store data. Let’s conceptualize this as an empty rectangle.


Step 5D: R Environment

When we create a variable, we consume a portion of the free memory of R and dedicate it to a variable.

When created, the variable doesn’t have a value, so we must assign a value to fill its space.

x-

x3


Step 5E: R Environment

As we create more and more variables, the R Environment becomes increasingly crowded, with less room to give to new variables.

So far, we have created two variables in the R Environment, named x and y.

These are fairly tiny variables compared to the size of the R environment.

When we load data in from files, the size of the variables will become much larger.


Testing Differential Expression (D.E.) Over Time in Control and Experimental Conditions.The code for this exercise is available on the schedule page in the data directory link.

The file is called lab_final.R

The additional code from lab.R is found in Section 5.0


Step 9A: Differential Expression Over Time in Control/Experimental

Does differential expression (D.E.) over time differ between experimental and control individuals?

An individual (row) is either in a control or experimental condition. Thus, we should split time and gene data by the control type.

The code below will get which individuals (rows) are "control" and "experimental" respectively.

> indiv_ctrl <- which(condition=="ctrl");

> indiv_exp <- which(condition=="exp");


Step 9B: Differential Expression Over Time in Control/Experimental

Split genes and time by the different control and experimental rows.

For example, genes[indiv_ctrl,] will get the gene expression data for all individual (rows) in the control group for all genes (columns),

Similarly, time[time_ctrl] will get only those time points in the control group. We don't need the , because time is vector, not a matrix.

> genes_ctrl <- genes[indiv_ctrl,];

> genes_exp <- genes[indiv_exp,];

> time_ctrl <- time[indiv_ctrl];

> time_exp <- time[indiv_exp];


Step 9C: Differential Expression Over Time in Control/Experimental

For each condition, we want to regress the time of each gene expression on gene expression within each condition.

In the above code, we iterate over each column (gene) in genes_ctrl, regress its expression against time within the control group, and return the p-value.

> ps_ctrl <- sapply(genes_ctrl, function(g) {

## regress gene expression of gene g against time in the control

group

model <- lm(g ~ time_ctrl);

model_summary <- summary(model); ## summarize model

coef <- model_summary$coefficients; ## get coefficient matrix

p <- coef[2,4]; ## get p-value from coefficient matrix

return(p);

});


Step 9D: Differential Expression Over Time in Control/Experimental

For each condition, we want to regress the time of each gene expression on gene expression within each condition.

In the above code, we iterate over each column (gene) in genes_exp, regress its expression against time within the experimental group, and return the p-value.

> ps_exp <- sapply(genes_exp, function(g) {

## regress gene expression of gene g against time in the experimental

group

model <- lm(g ~ time_exp);

model_summary <- summary(model); ## summarize model

coef <- model_summary$coefficients; ## get coefficient matrix

p <- coef[2,4]; ## get p-value from coefficient matrix

return(p);

});


Step 9E: Differential Expression Over Time in Control/Experimental

ps_ctrl contains the p-values of the association of each gene's expression with its time of sampling within the control group.

What is the data type of ps_ctrl : vector or matrix (data frame)?

How many entries does ps_ctrl have? In other words, what is the length of ps_ctrl ?

ps_exp, likewise, contains the p-values of the association of each gene's expression with its time of sampling within the experimental group.


Step 9E: Differential Expression Over Time in Control/Experimental

ps_ctrl contains the p-values of the association of each gene's expression with its time of sampling within the control group.

What is the data type of ps_ctrl : vector

How many entries does ps_ctrl have? In other words, what is the length of ps_ctrl ?

20,000

ps_exp, likewise, contains the p-values of the association of each gene's expression with its time of sampling within the experimental group.


Step 9F: Differential Expression Over Time in Control/Experimental

We should correct the p-values for multiple testing, using FDR.

The ps_ctrl_fdr vector contains the multiple hypothesis adjusted p-values of the control data using FDR.

The ps_exp_fdr vector, likewise, contains the adjusted p-values of the experimental data using FDR.

> ps_ctrl_fdr <- p.adjust(ps_ctrl,"fdr");

> ps_exp_fdr <- p.adjust(ps_exp, "fdr");


Step 9G: Differential Expression Over Time in Control/Experimental

Now, let's find out which genes are D.E. at an FDR of 5% (0.05) within each group.

What is the data type of de_ctrl: vector or matrix?

What are the types of entries in de_ctrl?

What does de_ctlr[i]represent for any between 1 and 20,000?

> de_ctrl <- ps_ctrl_fdr <= 0.05;

> de_exp <- ps_exp_fdr <= 0.05;


Step 9G: Differential Expression Over Time in Control/Experimental

Now, let's find out which genes are D.E. at an FDR of 5% (0.05) within each group.

What is the data type of de_ctrl: vector or matrix? vector

What are the types of entries in de_ctrl? True/False or 1/0

What does de_ctlr[i]represent for any between 1 and 20,000?

Whether gene is differentially expressed over time in the control group.

> de_ctrl <- ps_ctrl_fdr <= 0.05;

> de_exp <- ps_exp_fdr <= 0.05;


Step 9H: Differential Expression Over Time in Control/Experimental

Let's group genes together into the four different D.E. pairwise outcomes:

For a given gene :

de_ctrl[i] de_exp[i] Meaning

FALSE or 0 FALSE or 0 Gene is not D.E. in either condition.

TRUE or 1 FALSE or 0 Gene is D.E. in the control, but not experimental condition.

FALSE or 0 TRUE or 1 Gene is D.E. in the experimental, but not control condition.

TRUE or 1 TRUE or 1 Gene is D.E. in both conditions.


Step 9I: Differential Expression Over Time in Control/Experimental

Then count the number of genes in each D.E. group.

For two binary vectors with columns in the same relative order, we can group and count the pairwise groups in one command.

Do you see an independence between genes D.E in control and genes D.E. in experimental?

> de_counts <- table(de_ctrl,de_exp);

> de_counts ## Print out table


Step 9J: Differential Expression Over Time in Control/Experimental

Let's test the independence of D.E. genes between control and experimental groups using the Fisher exact test.

The code below get's the p-value of the Fisher exact test.

What value do you getWhat is the NULL hypothesis?

Can we reject the NULL hypothesis at a 0.05 threshold?

> de_p <- fisher.test(de_counts)$p;


Step 9J: Differential Expression Over Time in Control/Experimental

Let's test the independence of D.E. genes between control and experimental groups using the Fisher exact test.

The code below get's the p-value of the Fisher exact test.

What value do you get? (it should be 3.069468e-12)

What is the NULL hypothesis? (D.E. genes in the control are independent of D.E. genes in the experimental group)

Can we reject the NULL hypothesis at a 0.05 threshold? (Yes)

> de_p <- fisher.test(de_counts)$p;

Documents

Statistics Workshop Dave Zhao Statistics Workshop | Dave Zhao | 20151 PowerPoint by Casey Hanson