Statistical Analysis of Microarray Data

Statistical Analysis of

Microarray Data

ByH. Bjørn Nielsen & Hanne Jarmer

Sample PreparationHybridization

Array designProbe design

QuestionExperimental Design

Buy Chip/Array

Statistical AnalysisFit to Model (time series)

Expression IndexCalculation

Advanced Data AnalysisClustering PCA Classification Promoter AnalysisMeta analysis Survival analysis Regulatory Network

ComparableGene Expression Data

Normalization

Image analysis

The DNA Array Analysis Pipeline

What's the question?

Typically we want to identify differentially expressed genes

Example: alcohol dehydrogenase is expressed at a higher level when alcohol is added to the media

without alcohol with alcohol

alcohol dehydrogenase

He’s going to say it

However, the measurements contain stochastic noise

There is no way around it

Statistics

Noisymeasurements p-value

statistics

You can choose to think of statistics as a black box

But, you still need to understand how to interpret the results

P-valueThe chance of rejecting the null hypothesis by coincidence----------------------------For gene expression analysis we can say: the chance that a gene is categorized as differentially expressed by coincidence

The output of the statistics

The statistics gives us a p-value for each gene

We can rank the genes according to the p-value

But, we can’t really trust the p-value in a strict statistical way!

Why not!

For two reasons:

1. We are rarely fulfilling all the assumptions of the statistical test

2. We have to take multi-testing into account

The t-test Assumptions

1. The observations in the two categories must be independent

2. The observations should be normally distributed

3. The sample size must be ‘large’(>30 replicates)

Multi-testing?

In a typical microarray analysis we test thousands of genes

If we use a significance level of 0.05and we test 1000 genes. We expect 50 genes to be significant by chance

1000 x 0.05 = 50

Correction for multiple testing

Benjamini-Hochberg:

P ≤ iN 0.01

Bonferroni:

P ≤ 0.01N

N = number of genesi = number of accepted genes

Confidence level of 99%

But really, those methods are too strict

What we can trust is the ability of the statistical test to rank the genes according to their reliability

If we permute the samples we can get an estimate of the False Discovery Rate (FDR) in this set

The number of genes that are needed or can be handled in downstream processes can be used to set the cutoff

log2 fold change (M)

P-value

Volcano Plot

What's inside the black box ‘statistics’

t-test or ANOVA

The t-test

Calculate T

Lookup T in a table

The t-test II

The t-test tests for difference in means ()

Intensityof gene x

Density wt wt mut mutant

The t statistic is based on the sample mean and variance

The t-test III

ANalysis Of Variance

Very similar to the t-test, but can test multiple categories

Ex: is gene x differentially expressed between wt, mutant 1 and mutant 2

Advantage: it has more ‘power’ than the t-test

ANOVA II

Intensity

Density

Variance between groups

Variance within groups

Blocks and paired tests

Some undesired factors may influence the experiments, the effect of such can be greatly reduced if they are blocked out or if the experiment is paired.

Some possible blocks: - Dye - Patient - Technician - Batch - Day of experiment - Array

Paired t-test

Hypothesis: There is no difference between the mean BLUE and RED

Block 1

Block 3Block 2

An example: (2-way ANOVA with blocks)

Experimental Design: A187 CreA

Glucose Ethanol Glucose Ethanol

Everything was done in batches to capture the systematic noise

Example: Batch to batch variation

Within batch variation is lower than the between batch variation

Example: Data analysis - Blocking

• We can capture the batch variation by blockingTwo-way ANOVA

Glucose

Ethanol

A187 CreA

Effect 1

Effect 2

Batch A B C

Ex: Result of the 2-way ANOVA(3 p-values)

Two-way ANOVA

Ethanol

Glucose

A187 CreA

Media effect

Genotypeeffect

Batch A B C

A genotype p-valueA growth media p-valueAn interaction p-value

Conclusion

• Array data contains stochastic noise– Therefore statistics is needed to conclude on

differential expression

• We can’t really trust the p-value• But the statistics can rank genes• The capacity/needs of downstream

processes can be used to set cutoff• FDR can be estimated• t-test is used for two category tests• ANOVA is used for multiple categories

Statistical Analysis of Microarray Data

Documents

Statistical Issues in the Analysis of Microarray Data · Statistical Analysis of Microarray Data Richard Simon Lisa M. McShane Michael D. Radmacher Biometric Research Branch National

Statistical Applications in Genetics and Molecular Biology · provides a uniﬁed statistical model for systems analysis of microarray experiments with complex experimental designs

Statistical Methods and Software for the Analysis of DNA ... · Statistical Methods and Software for the Analysis of DNA Microarray Experiments Sandrine Dudoit ... MASSIVE PCR PCR

GS01 0163 Analysis of Microarray Data · MD, Wright GW, Zhao Y. Design and Analysis of DNA Microarray Investigations. Springer-Verlag, New York, 2003. Optional: Speed T (ed). Statistical

BioVLAB-Microarray: Microarray Data Analysis in Virtual Environment

Statistical analysis of human microarray data shows that ......Statistical analysis of human microarray data shows that dietary intervention with n-3 fatty acids, ﬂavonoids and resveratrol

Statistical Analysis of cDNA Microarray Data: Challenges and Solutions

Statistical analysis on microarray data: selection of …1 Statistical analysis on microarray data: selection of gene prognosis signatures Kim-Anh Lê Cao1 and Geoffrey J. McLachlan2

Statistical Analysis of Microarray Data · Clustering Statistical Analysis of Microarray Data Jacques van Helden Jacques.van-Helden@univ-amu.fr Aix-Marseille Université, France Technological

BioVLAB-Microarray: Microarray Data Analysis in …web.ornl.gov/~choij/papers/BioVLAB.pdf · BioVLAB-Microarray: Microarray Data Analysis in Virtual Environment Youngik Yang, Jong

The Bioconductor Project: Open-source Statistical Software ... · Open-source Statistical Software for the Analysis of ... • A very common task in microarray data analysis is gene-by

Bringing A Statistical Package To The Biologist’s Fingertips With Applications to Microarray Analysis

Statistical Methods for Analyzing Tissue Microarray Datavnminin.github.io/papers/TissueMicroarray.pdf · Statistical Methods for Analyzing Tissue Microarray Data Xueli Liu1; 2, Vladimir

Microarray Analysis - The Basicstgirke/HTML_Presentations/Manuals/Microarray/... · Microarray Analysis The Basics Thomas Girke December 9, 2011 Microarray Analysis Slide 1/42

Statistical Techniques for Temporal Microarray Data Analysis

Statistical Methods in Microarray Data Analysis Mark Reimers, Genomics and Bioinformatics, Karolinska Institute

Statistical Microarray Data Analysis - cvut.czmicroarray data – origin, aims of analysis Hypothesis induction –traditional statistics vs learning patterns Find significantly differentially

:: Microarray analysis ::

Improved statistical inference from DNA microarray data ...pfbaldi/publications/journals/M0_10192.pdfImproved statistical inference from DNA microarray data using analysis of variance

A probabilistic framework for microarray data analysis ... · Standard statistical analysis involves estimating (and drawing statistical inference about) p parameters using a data