Microarray preprocessing and quality assessment Wolfgang Huber European Bioinformatics Institute H. Sueltmann DKFZ/MGA

Microarray preprocessing and quality assessmentWolfgang HuberEuropean Bioinformatics Institute

H. Sueltmann DKFZ/MGA

Which genes are differentially transcribed?

same-same tumor-normal

log-ratio

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

05

10

15

log

feat

ure

inte

nsity

(a.

u.)

arrays, colour channels

Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)

PCR plates

PCR plates

PCR plates: boxplots

print-tip effects

-0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.2

0.4

0.6

0.8

1.0

41 (a42-u07639vene.txt) by spotting pin

log(fg.green/fg.red)

F

1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4

q (log-ratio)

F(q

)

spotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

spatial effectsspatial effects

R Rb R-Rbcolor scale by rank

spotted cDNA arrays, Stanford-type

another array:

print-tip

color scale ~ log(G)

color scale ~ rank(G)

10 20 30 40 50 60

1020

3040

5060

1:nrhyb

1:nr

hyb

1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Batches: array to array differences dij = mediank |hik -hjk|

arrays i=1…63; roughly sorted by time

A complex measurement process lies between mRNA concentrations and intensities

o tissue contamination

o clone identification and mapping

o image segmentation

o RNA degradation

o PCR yield, contamination

o signal quantification

o amplification efficiency

o spotting efficiency

o ‘background’ correction

o reverse transcription efficiency

o DNA-support binding

o hybridization efficiency and specificity

o other array manufacturing-related issues

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

Statistics 101:

bias accuracy

p

reci

sio

n

v

aria

nce

Basic dogma of data analysis

Can always increasesensitivity on the cost of specificity,

or vice versa,

the art is to - optimize both- then find the best trade-off.

X

X

X

X

X

X

X

X

X

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

1000

1500

3000

x3

x1.5

A B C

0

200

3000

?

?

A B C

But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changes

The idea of the log-ratio (base 2)0: no change

+1: up by factor of 21 = 2 +2: up by factor of 22 = 4 -1: down by factor of 2-1 = 1/2 -2: down by factor of 2-2 = ¼

What about a change from 0 to 500?- conceptually- noise, measurement precision

A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

ratio compression

Yue et al., (Incyte

Genomics) NAR (2001)

29 e41

How to compare microarray intensities with each other?

How to address measurement uncertainty (“variance”)?

How to calibrate (“normalize”) for biases between samples?

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distributionspotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Calibration Error model

Systematic o similar effect on many measurementso corrections can be estimated from data

Stochastic

o too random to be ex-plicitely accounted for o remain as “noise”

iik ika a

ai per-sample offset

ik additive noise

bi per-sample normalization factor

bk sequence-wise probe efficiency

ik multiplicative noise

exp( )iik k ikb b b

ik ik ik ky a b x

The two component model

measured intensity = offset + gain true abundance

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

B. Durbin, D. Rocke, JCB 2001

Parameterization

(1 )

y a b

y a

x

xb e

two practically equivalent forms

(<<1)

a systematic background

same for all probes (per array x color)

per array x color x print-tip group

random background

iid in whole experiment

iid per array

b systematic gain factor

per array x color per array x color x print-tip group

random gain fluctuations

iid in whole experiment

iid per array

Important issues for model fitting

Parameterization (model complexity)variance vs bias

"Heteroskedasticity" (unequal variances) weighted regression or variance stabilizing

transformation

Outliers use a robust method

AlgorithmIf likelihood is not quadratic, need non-linear

optimization. Local minima / concavity of likelihood?

Models are never correct, but some are useful

True relationship:

1 2 22

(0, 0.15 )y x x N

Model: linear dependence Model: quadratic dependence

variance stabilizing transformations

Xu a family of random variables with EXu=u,

VarXu=v(u). Define

var f(Xu ) independent of u

1( )

v( )

x

f x duu

derivation: linear approximation

0 20000 40000 60000

8.0

8.5

9.0

9.5

10

.01

1.0

raw scale

tra

nsf

orm

ed

sca

le


f(x)

x


1( )

v( )

x

f x duu

1.) constant variance (‘additive’)

2( ) sv u f u

2.) constant CV (‘multiplicative’) 2( ) logv u u f u

4.) additive and multiplicative

2 2 00( ) ( ) arsinh

u uv u u u s f

s

3.) offset2

0 0( ) ( ) log( )v u u u f u u

the “glog” transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hs(x) = asinh(x/s)

2arsinh( ) log 1

arsinh log log2 0limx

x x x

x x

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002

raw scale log glog

difference

log-ratio

generalized

log-ratio

constant partvariance:

proportional part

glog

ii k i k i ka a L a i p e r - s a m p l e o ff s e t

L i k l o c a l b a c k g r o u n d p r o v i d e d b y i m a g e a n a l y s i s

i k ~ N ( 0 , b i2 s 1

2 )

“ a d d i t i v e n o i s e ”

b i p e r - s a m p l en o r m a l i z a t i o n f a c t o r

b k s e q u e n c e - w i s el a b e l i n g e ffi c i e n c y

i k ~ N ( 0 , s 22 )

“ m u l t i p l i c a t i v e n o i s e ”

e x p ( )ii k k i kb b b

i k i k i k i ky a b x

m e a s u r e d i n t e n s i t y = o ff s e t + g a i n * t r u e a b u n d a n c e

parameter estimation (vsn package)

2Yarsinh , (0, )iki

k ki kii

aN c

b

:

o maximum likelihood estimator: straightforward – but sensitive to outliers

o model is for genes that are unchanged; differentially transcribed genes act as outliers.

o robust variant of ML estimator, à la Least Trimmed Sum of Squares regression.

o works well as long many genes are not differentially transcribed (<50% throughout the intensity range)

“usual” log-ratio

'glog' (generalized log-ratio)

1

2

2 21 1 1

2 22 2 2

log

log

xx

x x c

x x c

c1, c2 are experiment specific parameters (~level of background noise)

Variance Bias Trade-Off

Est

imat

ed l

og

-fo

ld-c

han

ge

Signal intensity

logglog

Variance-bias trade-off and shrinkage estimators

Shrinkage estimators:a general technology in statistics:pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio is a shrinkage estimator for fold change

“Single color normalization”

n red-green arrays (R1, G1, R2, G2,… Rn, Gn)

within/between slidesfor each slide i=1…n

calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)normalize Mi vs Ai

Then normalize M1…Mn

all at oncenormalize the combined matrix (R, G)

then calculate log-ratios or any other contrast you like

What about non-linear effects?

o Good data operate in the linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)

Two reasons for non-linearity:

o At the high intensity end: saturation/quenching. This can and should be avoided experimentally - loss of data!

o At the low intensity end: background offsets

0 20 40 60 80 100

02

04

06

08

01

00

original scale

x

y

1 2 5 10 20 50 1002

05

01

00

log/log scale

x

y

10 1y x

Non-linear or affine linear?

Response curveLockhart et. al. Nature Biotechnology 14 (1996)

Gene expression matters

Probe set summaries for Affymetrix expression analysis

genechips

Probe set summarization - data and notation

PMijg , MMijg = Intensities for perfect match and mismatch probe j for gene g in chip i

i = 1,…, n one to hundreds of chips

j = 1,…, J usually 11 or 16 probe pairs

g = 1,…, G 6…30,000 probe sets.

Tasks: calibrate (normalize) the measurements from different

chips (samples)summarize for each probe set the probe level data, i.e.,

16 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

Expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software used AvDiff, a trimmed mean:

o sort dj = PMj -MMj o exclude highest and lowest valueo J := those pairs within 3 standard

deviations of the average

1( )

# j jj J

AvDiff PM MMJ

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

"Signal" =

Tukey.Biweight (log(PM-CT))

(… median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures: Li & Wong

dChip fits a model for each gene

where

– i: expression index for gene i

– j: probe sensitivity

Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.

Need at least 10 or 20 chips.

Current version works with PMs only.

2, (0, )ij ij i j ij ijPM MM N

Expression measures RMA: Irizarry et al. (2002)

o Estimate one global background value b=mode(MM). No probe-specific background!

o Assume: PM = strue + b

Estimate s0 from PM and b as a conditional expectation E[strue|PM, b].

o Use log2(s).

o Nonparametric nonlinear calibration ('quantile normalization') across a set of chips.

RMA

Additive model for probe effects Pj and expression value Ei

log2Sij = Ei + Pj + εij

Estimate Ei using robust procedure

0 200 400 600

0.00

00.

010

X=B+S

X

Den

sity

B

SX

-2 -1 0 1 2 3

-4-2

0

x

f(x)

log2xglog2x

0 50 100 200

-50

050

150

RMA background correction

xS

0 50 100 200

-50

050

150

VSN background correction

x

Svs

n0 50 100 200

-2-1

01

2

log2

x

log 2

Srm

a

0 50 100 200

-2-1

01

2

glog2

x

glog

2Svs

n

bioc/Courses/bioc_R_intro/vsn_vs_bgcorrect.R

ArrayQualityMetrics

R package by Audrey Kauffmann (EBI)

Collaboration with Alvis Brazma, Misha Kapushesky (ArrayExpress)

EU Project "Emerald" http://www.microarray-quality.org

A probe effect normalisation for tiling arrays

Huber et al., Bioinformatics 2006

Genechip S. cerevisiae Tiling Array

4 bp tiling path over complete genome(12 M basepairs, 16 chromosomes)

Sense and Antisense strands6.5 Mio oligonucleotides 5 m feature size

manufactured by Affymetrixdesigned by Lars Steinmetz (EMBL & Stanford Genome Center)

RNA Hybridization

Before normalization

Probe specific response normali-zation

2log ii

i

yq

s

2

( )glog i i

ii

y b sq

s

2log iy

2log is

remove ‘dead’ probes

2glog

i ii

i

PM MMq

s

S/N

3.22

3.47

4.04

4.58

4.36

Probe-specific response normalization

si probe specific response factor. Estimate taken from DNA hybridization data

bi =b(si ) probe specific background term. Estimation: for strata of probes with similar si, estimate b through location estimator of distribution of intergenic probes, then interpolate to obtain continuous b(s)

2

( )glog i i

ii

y b sq

s

Estimation of b: joint distribution of (DNA, RNA) values of intergenic PM probes

log2 DNA intensity

log

2 R

NA

in

ten

sity unannotated

transcripts

background

b(s)

After normalization

Cross-Hybridization

One or several probes from a probeset hybridize with intended transcripts.

Reasons:

sequence similarity (gene families)

errors in gene annotation (e.g. mis-assignment of exons)

pure chance

Casneuf et al., Bioinformatics 2006

Affymetrix CDF

Custom CDF

without sequence-similar genes (E<10-10 )

all genes

How to disentangle correlation

because of functional relatedness and

because of cross-hybridization?

Meta-correlation

for probe set pair X and Y, the Spearman correlation between

alignment scores of X’s reporters to Y's target transcript

and

the signal correlation coefficient of X's reporters to Y's probe set expression value.

Significant positive metacorrelation in cases of off-target sequence similarity

Metacorrelation examplesProbe set

summariesX' reporters off-target alignment vs

signal correlationX: extensin-like family protein

Y: belongs to a zinc finger family.

0.63

0.89

0.70X, Y: functions unknown

0.62

X: protein kinase

Y: ATP citrate lyase

neighbours

0.3

0.55

Caveat RMA

Simulated periodic probeset

White noise

9 of B,2 of A

RMA (median polish) summariesAB=-0.07AC=0.73

Histogram of x

x

Fre

qu

en

cy

-2 0 2 4 6

0.0

1.0

2.0

3.0

Histogram of x

x

Fre

qu

en

cy

-2 0 2 4 6

0.0

1.0

2.0

3.0

set.seed(0xdada)a=c(rnorm(6), rnorm(5, mean=4))b=a; b[6]=b[7]myhist=function(x){ hist(x, col="orange", breaks=seq(-3,6,by=0.8),ylim=c(0,3)) abline(v=median(x), col="black",lwd=3)}par(mfrow=c(2,1))myhist(a)myhist(b)

Median is robust, but can be sensitive to small wiggles in the data

References

Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to the quantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density Oligonucleotide Array Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Acknowledgements

Anja von Heydebreck (Darmstadt)Robert Gentleman (Seattle)Günther Sawitzki (Heidelberg)Martin Vingron (Berlin)Annemarie Poustka, Holger Sültmann, Andreas

Buness, Markus Ruschhaupt (Heidelberg)Rafael Irizarry (Baltimore)Judith Boer (Leiden) Anke Schroth (Heidelberg)Friederike Wilmer (Hilden)Lars Steinmetz (Heidelberg)Tineke Casneuf (Gent)

This new EMBL initiative promotes cross-disciplinary research. EIPODs are supported by at least two labs at the five EMBL sites in Heidelberg and Hamburg (Germany), Grenoble (France), Hinxton (UK) and Monterotondo (Italy). EIPOD projects connect scientific fields that are usually separate, or transfer techniques to a novel context.

For a list of possible projects and further information please visit: www.embl.org/eipod

You are also encouraged to propose your own interdisciplinary project.

Online application until 31st August 2007

2007 Call for Applications

EMBL

EMBL Interdisciplinary Postdocs - EIPOD

Documents

Microarray preprocessing and quality assessment Wolfgang Huber European Bioinformatics Institute H. Sueltmann DKFZ/MGA