Welcome to lecture 6: An introduction to data analysis with R IGERT – Sponsored Bioinformatics...

Preview:

Citation preview

Welcome to lecture 6:An introduction to data analysis with R

IGERT – Sponsored Bioinformatics Workshop SeriesMichael Janis and Max Kopelevich, Ph.D.

Dept. of Chemistry & Biochemistry, UCLA

We're moving ahead a bit...

• The majority of the class does some type of microarray analysis– Microarray analysis utilizes the same programmatic

concepts we've been exploring

• We’ve covered variables, control structures, data structures, functions in perl– Now we’ll introduce a new language called R

• particularly suited for numerical analysis

– We’ll learn about explorative data analysis

– We’ll write our own functionality in this new language

R(A data-structure focused introduction)

– Every R intro I’ve read tends towards statistical tools first and data structures / programmatic concepts as an afterward

I don’t think this is the best way to learn a language– After all, we’ll be doing complex data analysis– Going beyond one-off biostatistics learning

• I’m going to introduce R the same way I introduced PERL

• I hope to show you that R is just as friendly…

What is R?From the R-project webpage (www.r-project.org):

R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.

R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.

Where to get R?Available for a wide variety of platforms ...

(handled well in windows too!)

http://www.r-project.org

Libraries available via CRAN (like CPAN we used before)

The “bioperl” version of R is “bioconductor”

- used extensively for routine and experimental

microarray analysis

Let’s begin

The default mode for R is interactive CLI (with emacs keybindings!)A query – response line mode

– An overgrown calculator• R evaluates commands through function calls

• > 2+2

• [1] 4

– A programming language• Complex data structures

• Control blocks

• Objects!

Symbolic variables

• Variable assignment is handled via arrow notation <-• Variables can be examined by simply calling the variable• The index of the first element of the variable is given in

brackets on each line [1]• Scalar elements can be numerical, character, or boolean

> x<-2> x[1] 2> x + x[1] 4> x<-”ACTCGATCGACT”> x[1] “ACTCGATGCACT”> x<-T> x[1] TRUE

Vectors• R handles vectors as single objects• R defines three types of vectors:

– numerical vectors

– character vectors

– logical vectors• Vectors are created (and treated) as concatenation of

scalar elements:

> x<-c(1,2,3,4,5)> x[1] 1 2 3 4 5> x<-c(“ACT”,”TCA”,”GGA”,”CCG”)> x[1] “ACT” “TCA” “GGA” “CCG”> x<-c(T,T,F,T)> x[1] TRUE TRUE FALSE TRUE

Vector element access• Very similar to Perl array element access• Access by index

– The index itself can be a vector, or any type of data element

– Can be an expression

– Negative indeces denote exclusion

> x<-seq(1,20,2)> x[5][1] 9> x[c(1,3,4)][1] 1 5 7> x[x>10][1] 11 13 15 17 19> x[c(-1,-2,-3,-4,-5)][1] 11 13 15 17 19

Vector functions seq (“sequence”)

Creates a range of values in a vector> x<-seq(1,10,1)> x

[1] 1 2 3 4 5 6 7 8 9 10

> x<-4:12

> x

[1] 4 5 6 7 8 9 10 11 12

> x<-LETTERS(1:3)

> x

[1] A B C

Vector functions rep (“replicate”)

Generates repeated values

- Can be used to generate complex patterns

- Can be used to generate data grouping codes

> x<-c(10,100,1000)

> rep(x,3)

[1] 10 100 1000 10 100 1000 10 100 1000

> rep(x,1:3)

[1] 10 100 100 1000 1000 1000

> rep(1:2,c(5,10))

[1] 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2

Vector functions sort

Sorts an array in-place

> x<-c(10000,10,1000)

> sort(x)

[1] 10 1000 10000

Vector functions factor

Grouping for categorical data

> x<-c(0,1,2,1,2)

> fx<-factor(x,levels=0:2)

> levels(fx)<-c(“low”,”middle”,”grande”)

> fx

[1] low middle grande middle grande

Matrices• Simply n-dimensional arrays

– in R, most everything is an array• Extends elements of any type• Can dynamically set and change dimensions

– default matrix dim is by columns

> x<-seq(1,12)> dim(x)<-c(3,4)> x

[,1] [,2] [,3] [,4][1,] 1 4 7 10[2,] 2 5 8 11[3,] 3 6 9 12> matrix(1:12,nrow=3,byrow=T)

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

Matrix functions t (“transpose”)

Changes rows and columns

> matrix(1:12,nrow=3,byrow=T)[,1] [,2] [,3] [,4]

[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> t(x)

[,1] [,2] [,3][1,] 1 5 9[2,] 2 6 10[3,] 3 7 11[4,] 4 8 12

Matrix functions rownames

Assigns scalars to the row indeces (like a hash)

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> rownames(x)<-c(“one”,”two”,”three”)> x

[,1] [,2] [,3] [,4]one 1 2 3 4two 5 6 7 8three 9 10 11 12

Matrix functions colnames

Assigns scalars to the column indeces (like a hash)

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> colnames(x)<-c(“one”,”two”,”three”,”four”)> x

one two three four[1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12

Matrix functions cbind

Adds (in the agglomerative sense) cols together like XL

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> cbind(x,c(111,222,333))

[,1] [,2] [,3] [,4] [,5][1,] 1 2 3 4 111[2,] 5 6 7 8 222[3,] 9 10 11 12 333

Matrix functions rbind

Adds (in the agglomerative sense) rows together like XL

> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> cbind(x,c(111,222,333,444))

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12[4,] 111 222 333 444

Object functions list

Combines collections into composite objects

- Objects are treated as vectors in R, plus methods

- Matrices are collections of vectors> x<-matrix(1:12,nrow=3,byrow=T)> x

[,1] [,2] [,3] [,4][1,] 1 2 3 4[2,] 5 6 7 8[3,] 9 10 11 12> y<-c(1:5)> z<-list(matrix=x,vector=y)> z$matrix

[,1] [,2] [,3] [,4] ...$vector[1] 1 2 3 4 5

list (object) functions indexing

Since it's a vector, we can obtain the elements

> z$vector[1] 1 2 3 4 5> z$vector[5][1] 5> z$matrix[1,3][1] 3> z$vector[z$vector>3][1] 4 5

list (object) functions data.frame

If the vectors are the same length, we can agglomerate them in a special data matrix

- the data is paired, and has unique row names> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> z

x y1 1 62 2 73 3 84 4 95 5 10

Reading data from files data.frame / read.table / read.csv / read.delim

> myData<-read.table(“example.txt”, header=T)> myData

field_one field_two1 10.3 1.082 11.2 0.973 … …

– The data frame is ideal for handling delimited files• Assumes a header is present

– (takes the header to have n-1 entries)

• Can handle a wide variety of interfaces with outputs– Tab, comma delimited txt files– SPSS, SAS, Stata, Minitab, S-PLUS v3 files– Works well with DB interface calls as well

Persistence save.image() / .RData / ls()

– The workspace is dynamic• Variables and functions are created or loaded

– objects() or ls() shows availability of both

– Can be saved to a local .RData file using save.image()

– .RData loaded by default upon startup

– Can specify the .RData (or whatever you name it) workspace using load() (may have to specify pathname!)

data frame (object) functions subset

Allows extraction of a portion of a data frame

> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> subset(z,x>2)

x y3 3 84 4 95 5 19

data frame (object) functions transform

Allows extension of a data frame

> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> transform(z,x.log=log(x))

x x.log1 1 0.00000002 2 0.69314723 3 1.09861234 4 1.38629445 5 1.6094379

data frame (object) functions split

Lists vectors according to group> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> h<-split(z$x,z$y)> h$”1”[1] 6$”2”[1] 7$”3”[1] 8$”4” [1] 9$”5”[1] 10

data frame (object) functions lapply

Implicit looping over group members> x<-c(1:5)> y<-c(6:10)> z<-data.frame(x,y)> lapply(z, mean)$x[1] 3

$y[1] 8

Functions in R very similar to what we've seen in perl!

Blocks are the same

- Takes arguments

- Uses control structures (for, if, while loops, ...)

> x<-c(1:5)

> my.function<-function(x)

{

u<-mean(x)

}

> y<-my.function(x)

> y

[1] 3

Control structures for loop

Loops over a set range> myfunction<-function(x){

for (i in 1:10) {

do something here}

}

The variable i will take values of the sequence in turn

The range is specified by the sequence

A stupid function exampleJust to illustrate passing args back and forth…

> myfun<-function(x)+ {+ X<-x+ for (i in 1:10)+ {+ X<-c(X,i)+ }+ X+ }> myfun(0) [1] 0 1 2 3 4 5 6 7 8 9 10

A better function exampleA function to calculate the two sample t-statistic, showing

“all the steps”. (From http://cran.r-project.org/doc/manuals/R-intro.html)

> twosam <- function(y1, y2) {

n1 <- length(y1); n2 <- length(y2) yb1 <- mean(y1); yb2 <- mean(y2) s1 <- var(y1); s2 <- var(y2) s <- ((n1-1)*s1 + (n2-1)*s2)/(n1+n2-2) tst <- (yb1 - yb2)/sqrt(s*(1/n1 + 1/n2)) tst

}

With this function defined, you could perform two sample t-tests using a call such as:

> tstat <- twosam(data$male, data$female); tstat

Control structures while loop

Loops while an evaluation returns boolean TRUE> myfunction<-function(x){

while (x>10) {

do something here}

}

The evaluation is tested at the beginning of the loop;Note that in this case, the block may never be executed

Control structures repeat loop

Loops until told to stop by break> myfunction<-function(x){

repeat {

do something hereif (x>10) break

}}

Uses a conditional if statement;The break is called whenever the boolean evaluation is true and the block is exited

Descriptive statistics summary()

Summary statistics related to a numeric variable> x<-rnorm(100)> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -3.08800 -0.70850 -0.11480 -0.04413 0.76510 2.89100 >

Descriptive statistics plot()

Simple x vs. y gram (scatterplot)> x<-rnorm(100)> y<-rnorm(100)> plot(x,y)> plot(rnorm(500))> lines(rnorm(500))

Descriptive statistics heatmap generation (image)

Scatterplot grid color weighted by intensities…

- very useful for microarray analysis (we’ll see next time…)

- can be used with dendrogram generation

IQR

Statistics of populations

The equations so far are for sample statistics–a statistic is a single number estimated from a sample

We use the sample to make inferences about the population.

•a parameter is a single number that summarizes some quality of a variable in a population.

•the term for the population mean is (mu), and Ybar is a sample estimator of .

•the term for the population standard deviation is (sigma), and s is a sample estimator of .

Note that and are both elements of the normal probability curve.

Source: http://www.bsos.umd.edu/socy/smartin/601/

IQR

Measuring probabilities under the normal curve

•We can make transformations by scaling everything with respect to the mean and standard deviation.

• Let z = the number of standard deviations above or below the population mean.

–z = 0y = –z = 1y = +/- (p=0.68)–z = 2y = +/- 2 (p=0.95)–z = 3y = +/- 3 (p=0.997)

Plotting using hist() and curve()

> y<-hist(h,plot=F)> ylim<-range(0,y$density,dnorm(0))> hist(x,freq=F,ylim=ylim> curve(dnorm(x),add=T)

Difficult to integrate… But probabilities have beenMapped out to this curve. Transformations from other Curves possible…

Plotting using qqnorm()

> qqnorm(x)

Box plots (box and whiskers plots, Tukey, 1977)

Outliers

Fence / whiskers

IQR

Q3

Q1

Median

Fence / whiskers

min((Q3+1.5(IQR)),largest X)

max((Q1+1.5(IQR)),smallest X)

Plotting using boxplot()

> boxplot(x)> boxplot(log(x))

My advice

First learn to program in R. Then use the R libraries.Everything in R can be built up piecewise

– The data is made of component parts• It’s extremely useful to know how to handle the objects

– The graphics are made of component parts• This allows extreme fine-tuning of your visualization!

• Go beyond scatterplots and barplots to describe complex data well and visualize hidden trends

• A good reference is Data Visualization by Edward Tufte.

Homework

A simple problem, but one we may use frequently

Use lapply (or sapply) to simulate the result of taking the mean of 100 random numbers from the normal distribution for 10 independent samples.

Recommended