36
Statistics for Differential Expression Naomi Altman Oct. 06

Statistics for Differential Expression Naomi Altman Oct. 06

Embed Size (px)

Citation preview

Page 1: Statistics for Differential Expression Naomi Altman Oct. 06

Statistics for Differential Expression

Naomi Altman

Oct. 06

Page 2: Statistics for Differential Expression Naomi Altman Oct. 06

Some things to consider before we start

Model

Replication

Correlation / Independence

Treatments (conditions, varieties ...)

Page 3: Statistics for Differential Expression Naomi Altman Oct. 06

Some things to consider before we start

ModelUsing a statistical model sheds light on the analysis by quantifying

features such as condition effects, sources of biological and experimental variation, etc.

Models can be written down before the data are collected, which clarify how the data should be collected and analyzed.

When an estimate of variability is available, the model can be used to determine appropriate sample size.

Replication

Correlation / Independence

Treatments

Page 4: Statistics for Differential Expression Naomi Altman Oct. 06

Some things to consider before we start

Model

ReplicationStatistical methods compare the condition means to the variation

within condition.

The within condition variation can only be estimated by replication of the condition.

Often technical replication (multiple probes in a probeset or multiple hybridizations of the same sample) are treated as if it has biological meaning, but this is not true replication.

Correlation / Independence

Treatments

Page 5: Statistics for Differential Expression Naomi Altman Oct. 06

Some things to consider before we start

Model

Replication

Correlation / IndependenceObservations are correlated because:

they are taken on the same individual

they are measured on the same array

they are processed in the same replicate

Most simple analysis methods assume independence and hence must be modified to handle correlated data.

Treatments

Page 6: Statistics for Differential Expression Naomi Altman Oct. 06

Some things to consider before we start

Model

Replication

Correlation / Independence

Treatments:

what is interesting?

what is the "action"?

how many can we really handle

Page 7: Statistics for Differential Expression Naomi Altman Oct. 06

2 treatmentsWe have already considered the simple case of 2

treatments using t-tests (or permutation, bootstrap or Wilcoxon versions of the tests)

Which tests do we use and when are they appropriate?

Page 8: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 2 treatmentsTwo-sample "t-tests" (and similar tests)

require independent samples within and between the 2 treatments

i.e.

1. all RNA samples are biologically independent

2. Each sample is hybridized to a different array

single channel arrays such as Affy, Nimblegen, CodeLink

2 channel arrays with a reference sample in the same channel on each array (use M as the data)

Page 9: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 2 treatmentsThe paired "t-test" (and similar tests)

1. Each array includes both treatments.

2. Different arrays come from different biological samples.

3. There is no dye effect or technical dye-swaps have been done and the technical replicates have been averaged.

Page 10: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 3 or more treatments with independent samples

Requires independent samples.

(We cannot extended the paired sample idea, because we do not have 3 or more channels on the array.)

H0: all the population means are equal

HA: At least one of the means differs

Page 11: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 3 or more treatments with independent samples

examples:

Cancers: several cancer types with 1 sample per patient, several patients with each cancer

Genotypes: several genotypes of mice with 1 sample per mouse, several mice per genotype

Drug: different doses applied to different individuals with 1 sample per individual, several individuals per dose

Page 12: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 3 or more treatments with independent samples

The t-test assumes that the spreads are all approximately equal and that the populations are approximately normally distributed. The other versions of the test do not require normality.

The test statistic is the ratio of the variance among the sample means to the variance of each sample

Page 13: Statistics for Differential Expression Naomi Altman Oct. 06

Tests for 3 or more treatments with independent samples

If there are T treatments, with ni observations from the ith treatment.

N=n1+ ... + nT

)/()( 2

1 1

TNyyMSE i

T

i

n

jij

i

)1/()( 2

1

TyyMSTr i

T

i

F*=MStr/MSE

has an F-distribution when the null is true.

One-Way ANOVA

Page 14: Statistics for Differential Expression Naomi Altman Oct. 06

One-way ANOVA

summary(aov(iris$Sepal.Length~iris$Species)) Df Sum Sq Mean Sq F value Pr(>F) iris$Species 2 63.212 31.606 119.26 < 2.2e-16 ***Residuals 147 38.956 0.265 ---Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Permution, bootstrap and rank tests (Kruskal-Wallace test) are readily extended to this situation

Page 15: Statistics for Differential Expression Naomi Altman Oct. 06

More complex situationsMany microarray experiments do not fall into this

simple situation due to correlation in the data due to:

biological correlation (same cell-line, individual ...)

using 2-channel microarrays

having multiple probes for the same gene

Also, we may have multifactor studies:

e.g. 2 genotypes, control and exposed, time course

For this we use Linear Mixed Models

Page 16: Statistics for Differential Expression Naomi Altman Oct. 06

Linear Models

It is useful to consider a model for the observed data (on a single probe or probeset):

Y=log2(intensity)

= + + + + ... + error

is the mean over all the conditions and arrays

error is the random error that is a mixture of measurement error and biological variability

the other terms are systematic deviations from the mean, due to the treatments, array effects, lab effects, etc.

Page 17: Statistics for Differential Expression Naomi Altman Oct. 06

Linear Modelse.g. Comparison of liver and kidney tissue in male

and female mice on 2-channel arrays with 3 replicate spots per gene

5 males and 5 females

Y is the log2(intensity) in one channel for one spot.

We need to remember that dye might have an effect.

Page 18: Statistics for Differential Expression Naomi Altman Oct. 06

Linear ModelsFixed effects are the conditions of interest in the

experiment:

Random effects are conditions which explain some of the noise in the model:

Page 19: Statistics for Differential Expression Naomi Altman Oct. 06

How does the model help us?

Generally, differential expression analysis is looking for differences between treatments that are larger than expected by chance.

The model helps us to understand the meaning of "by chance".

The model also allows us to design our experiment to minimize the probability of chance observation of large differences.

Page 20: Statistics for Differential Expression Naomi Altman Oct. 06

How Does the Model Help Us?

mean Log2

Intensity

Male Female

Liver 5.6 6.3

Kidney 9.3 10.7

difference between male and female in liverdifference between liver and kidney in males

What is larger than expected by chance?

Suppose the arrays are: 5 arrays - male and female liver 5 arrays - male and female kidney

Suppose the arrays are: 5 arrays - male liver and kidney 5 arrays - female liver and kidney

Page 21: Statistics for Differential Expression Naomi Altman Oct. 06

The simplest model2 treatments on 2 channel arrays with independent

biological samples, no dye effect and no dye-swap All of the data are independent.

M=log2(Red) - log2(Green)

Mi =+ errori

No differential expression implies

H0:

The F-test for this model is just t2 from the paired t-test

Page 22: Statistics for Differential Expression Naomi Altman Oct. 06

One-Way "ANOVA"Yij = + i + errorij

is the mean expression for the gene over the entire experiment.

i is the deviation of the mean of the ith condition from the overall mean i i=0

The error variance should not depend on the condition.

Page 23: Statistics for Differential Expression Naomi Altman Oct. 06

More Complicated Models with Fixed Effects Only

Yijk = + i +j+()ij +errorijk

We may have 2 or more factors, e.g.

• genotype and drug dose

• genotype and time point

• treatment and dye

is the mean expression for the gene over the entire experiment.

i is the deviation of the mean of the ith level of factor A from the overall mean, i i=0

i is the deviation of the mean of the ith level of factor B from the overall mean, j j=0

ij is the deviation of the mean of the ijth combination of levels from + i +j, mean i (ij=j (ij=0

The error variance should not depend on the condition.

Page 24: Statistics for Differential Expression Naomi Altman Oct. 06

More Complicated Models with Fixed Effects Only

A

mea

ns

1.0 1.5 2.0 2.5 3.0

02

46

B

mea

ns

1.0 1.5 2.0 2.5 3.0 3.5 4.0

02

46

A

mea

ns

1.0 1.5 2.0 2.5 3.0

02

46

B

mea

ns

1.0 1.5 2.0 2.5 3.0 3.5 4.0

02

46

Interaction among factors

No interaction among factors

Page 25: Statistics for Differential Expression Naomi Altman Oct. 06

More Complicated Models with Fixed Effects Only

Yijk = + i +j+()ij +errorijk

Normal Theory ANOVA is readily extended to this situation and more factors can be added.

Permutation and bootstrap methods begin to get complicated, but can still be applied.

Rank-based methods are available for 2 factors, but get complicated

Page 26: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent

We often have replicates that are NOT independent:

multiple spots for the same gene on an array

multiple arrays from the same RNA

multiple RNAs from the same tissue

multiple samples from the same individual

multiple labs

multiple "batches"

Page 27: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independente.g. A dye-swap experiment in which the dye-swaps are technical replicates (1

dye-swap pair per sample) and there are 2 spots per gene on the array with 2 or more treatments

Yijkt = + i +j +k + s + t + errorijkt

is the mean expression for the gene over the entire experiment.

i is the deviation of the mean of the ith treatment, i i=0

i is the deviation of the mean of the ith level of dye from the overall mean, r+g=0

k is the array effect which induces a correlation between the 2 spots on the same array k~N(0,

2)

s is the spot effect which induces a correlation between the 2 channels at the same spot s~N(0,

2)

t is the biological sample effect which induces a correlation between the 2 arrays in the dye-swap pair t~N(0,

2)

Page 28: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent

The lack of independence can be modeled as a random effect.

This is handled in a straightforward manner by ANOVA modeling but ...

all the other methods get MUCH more complicated.

Much of the available software does not handle this very well.

Page 29: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent

In some cases, we can return to fixed effects models by averaging (but this loses power).

e.g. technical replicates can be averaged and the averages can be used as if they were the primary data

This is much better than discarding technical replicates, but not as good as modeling them.

Page 30: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

2 conditions on a 2-channel array with replicate spots for each gene, and a dye-swap technical replicate.

e.g. 2 genotypes of mouse

3 mice per genotype

1 mouse from each genotype on

each array

2 arrays from each pair of mice

4 replicate spots per array

We will simplify by modeling M, rather than each channel.

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3

Page 31: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

effects:• mouse pair• dye (or equivalently genotype)• array

pair and array are random

dye is fixed

we need to keep track of whether M is

R-G or A-B (genotype difference)

We do not need to include spot as we are using M

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3

Page 32: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

data for 1 mouse pair (m):

2 arrays, with 4 spots per array

Mmdas

m is the mouse pair identifier (1,2,3)

d is the dye for genotype A (r,g)

a is the array (1-6 or 1,2 within m)

s is the spot (1-4 within array)

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3

Page 33: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

Mijkt = + mi + dj + ak + errorijkt effects:

m mouse pair (random)

d dye (or equivalently genotype in R)

a array (random)

t=1,2,3,4 for the spots

The hypothesis of no genotype effect

is =0.

Notice that we have to be careful about

the sign of M. If we code the effects in

the way it is usually done for ANOVA, M=A-B not R-G

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3

Page 34: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

Mijkt = + mi + dj + ak + errorijkt

Our estimate of m is just the

sample mean of M over all the spots.

But our estimate of the SE of ave(M)

is not the sample average, due to the

other effects.

A1

A1B1

B1

A2

A2B2

B2

A3

A3B3

B3

Page 35: Statistics for Differential Expression Naomi Altman Oct. 06

Replicates that are not Independent Example

Mijkt = + mi + dj + ak + errorijkt

)()()()( errorVaraVarmVarMVar

erroramM

24/)()(

6/)()(

3/)()(

errorVarerrorVar

aVaraVar

mVarmVar

3 mouse pairs

6 arrays24 observations/gene

24/6/3/*

222erroram sss

Mt

Page 36: Statistics for Differential Expression Naomi Altman Oct. 06

What if we ignore the Dependence

24/*

2Ms

Mt

2222 23/2023/16)( errormMSE

24/6/3/*

222erroram sss

Mt

Compare with

We would use:

The denominator of the ordinary t-test is

much too small