Supervised microarray data analysis

\department of mathematics and computer science

Supervised microarray data analysis

Mark van de Wiel


Quality control

•Protocols

•Perform a small scale, well-controlled experiment to assess

influence of experimental factors (Microarrays from different

batches, printing tips, dyes, linearity of the scanner, etc.)

•Continuous factors (temperature, humidity, spotsize over time,

intensity of control spot over time) can be monitored with

standard control chart techniques.


Design of the experiment•Think very, very well what the biological goals are.

•What software do you have at your disposal to analyse the data?

•Do we need reference or not?

•‘Biological design’: what tissues to combine on an array (cDNA)? More than

one biological factor: factorial design

•Dye-bias: dye-swap.

•Design on the array (negative/positive controls, repeats?, how many genes?

Pilot study first, distributing the repeats over experimental factors (spatial,

printing tips, etc.))

•Save some space on the (cDNA) microarray for assessing variability due to

experimental factors (e.g. print same control gene with several printing tips)


Analysis: Multiple testing (after normalization)Objective: control the number of falsely selected genes

FWE: Family wise error rate

• Weak FWE control:

P(falsely select gene i, i=1, ..., 20.000 | no

gene truly expressed)

• Strong FWE control:

P(falsely select gene i, i=1, ..., 20.000 | some

genes expressed, some genes not expressed)

FDR: False Discovery Rate

F: Expected number of false rejections when no genes are

expressed, T: Total number of rejections

FDR control: F/T


Multiple testing: FWE vs FDR• Control of FDR implies weak control of FWE

• Advantage strong control of the FWE: significance level under all

situations controlled

• Disadvantage: less power than FDR control

• FWE based procedures tend to select less genes than FDR based

procedure

Software:

• Bioconductor: Step-down Westfall-Young (Dudoit et al.), control FDR

and FWE.

• SAM (permutation based ‘control’ of FDR)

SAM Plot

-8

-6

-4

-2

0

2

4

6

8

-5 -4 -3 -2 -1 0 1 2 3 4 5

Expected

Obs

erve

dSignif icant: 249

Median # false signif icant: 12,66389

Delta 1,03419


SAM• Developed at Stanford, Tibshirani et al. (Paper: Tusher et al, PNAS 98,

5116-5121) Claim is FDR-control

Plus:

1. Ease of use, add-in to Excel

2. Allows asymmetric cut-offs

Minus:

1. Distribution under the null-hypotheses (‘no expression’) needs to be the

same for all genes to guarantee FDR control

2. Combination with k-fold rule: no control of FDR anymore

Solutions: Use (normal) rank scores and a simple rank statistic

Explicitly test on k-fold expression; combine with FDR criterion


Modelling vs Normalisation + Testing

• Modelling forces you to state what the assumptions are (linearity,

normality, independence, etc.)

• Normalisation steps may not be commutative

• Non-linearities can be dealt with by normalisation methods

• Advanced modelling requires help of statistician/bio-informatician

Standard approach to modelling: ANOVA. Model has two levels:

1. Normalisation level which includes linear corrections for dye and

microarray effects

2. Gene expression level which includes effects on gene level,

including interactions (interaction of interest is usually gene*variety)


Software• Freeware: SAM, Bioconductor

• Specialized commercial software: Spotfire, Genespring, Genesight,

Rosetta

Most contain: normalisation, variance stabilizing transformations,

ANOVA, testing (most do not yet include the advanced multiple testing

criteria)

• Statistical software: SAS, S-Plus, SPSS

Much more debugged, long history, better documentation (Often very

unclear what the specialized packages really do.)

Advantages specialized software: user-friendly, visualisation (nice

pictures), link with data bases, annotation

Try several!!!


Bayesian models

+Natural translation to networks (pathways)

+Complex models (linearity is not necessary, interactions)

+Prior biological knowledge can be included

+Nesting of the models (image analysis + normalisation +

gene expression)

+Inference for complex functions of gene expression data is

relatively easy

-No ‘easy’ software

-Computational methods may take time to find reliable

estimates

Example Network


Validation• Cross-validation: leave some data out and see how well the

data values are predicted by the model (Note that for

normalisation procedures it may be harder to predict the data

from the normalized data)

• Biological validation (spikes: known concentrations)

Very useful for validating the normalisation procedure or the

model:

1. Pretend that spikes with equal concentrations that are used

under different conditions (different dyes, microarray

batch)are different quantities.

2. Estimate ratio of two estimates after normalisation or

modelling

3. Ratio should approximately be equal to 1.


Comparison and meta analysis

•Objective comparisons between methods very much needed!

•Simulations may help (because we know the truth then). Setting

up realistic simulations may be hard!

•Competition between several methods (CAMDA ’03: Lung cancer)

Future goals:

•Methods that allow for combining data from several experiments.

•From relative quantities to absolute quantities.

•Absolute quantities allow for direct comparison between labs.

(otherwise, only if labs have used same reference material etc.)


Useful overview papers, booksDesign: Churchill, G.A. (2002) Fundamental of experimental design for

cDNA microarrays. Nature Genet.32 (490-495)

Analysis: Slonim, D.K. (2002) From patterns to pathways: gene expression

data analysis comes of age Nature Genet.32 (502-508)

Normalisation: Quackenbush, J. (2002) Microarray normalisation and

transformation Nature Genet.32 (496-501)

Pitfalls: Richard Simon et al. (2003) Pitfalls in the Use of DNA Microarray

Data for Diagnostic and Prognostic Classification J Natl Cancer Inst; 95:

14-18.

Books: Baldi & Hatfield (2002), DNA Microarrays and Gene expression,

Cambridge University Press

Speed, T. (2003) Statistical Analysis of Gene Expression Microarray Data

Chapman & Hall

Acknowledgement: Nicola Armstrong (EURANDOM)

Documents

Supervised microarray data analysis