Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)

Wolfgang HuberEuropean Bioinformatics Institute

Quality control and normalization

Acknowledgements

Anja von Heydebreck (Darmstadt)Robert Gentleman (Seattle)Günther Sawitzki (Heidelberg)Martin Vingron (Berlin)Annemarie Poustka, Holger Sültmann, Andreas

Buness, Markus Ruschhaupt (Heidelberg)Rafael Irizarry (Baltimore)Judith Boer (Leiden) Anke Schroth (Heidelberg)Friederike Wilmer (Hilden)

Which genes are differentially transcribed?

same-same tumor-normal

log-ratio

Statistics 101:

←bias accuracy→

←pr

ecis

ion

varia

nce→

Basic dogma of data analysis:

Can always increase sensitivity on the cost of specificity,

or vice versa,

the art is to find the best trade-off.

X

X

X

X

X

X

X

X

X

(It can also be possible to increase both by better choice of method / model)

ratios and fold changes

Fold changes are useful to describe continuous changes in expression

10001500

3000x3

x1.5

A B C

0200

3000?

?

A B C

But what if the gene is “off” (below detection limit) in one condition?

ratios and fold changesThe idea of the log-ratio (base 2)

0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼



A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.



A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.

What about a change from 0 to 500?- conceptually- noise, measurement precision

A complex measurement process lies between mRNA concentrations and intensities

o other array manufacturing-related issues

o hybridization efficiency and specificity

o DNA-support binding

o reverse transcription efficiency

o ‘background’ correction

o spotting efficiency

o amplification efficiency

o signal quantification

o PCR yield, contamination

o RNA degradation

o image segmentation

o clone identification and mapping

o tissue contamination

A complex measurement process lies between mRNA concentrations and intensities

o other array manufacturing-related issues

o hybridization efficiency and specificity

o DNA-support binding

o reverse transcription efficiency

o ‘background’ correction

o spotting efficiency

o amplification efficiency

o signal quantification

o PCR yield, contamination

o RNA degradation

o image segmentation

o clone identification and mapping

o tissue contamination

The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.

♦ How to compare microarrayintensities with each other?

♦ How to address measurement uncertainty (“variance”)?

♦ How to calibrate (“normalize”) for biases between samples?

Questions

Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection

probe purity and length distribution

spotting efficiency, spot sizecross-/unspecific hybridizationstray signal




Systematico similar effect on many measurementso corrections can be estimated from data





Calibration





Calibration

Stochastico too random to be ex-plicitely accounted for o remain as “noise”





Calibration

Stochastico too random to be ex-plicitely accounted for o remain as “noise”

Error model

Error models

describe the possible outcomes of a set of measurements

Outcomes depend on:-true value of the measured quantity (abundances of specific molecules in biological sample)

-measurement apparatus (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)

Error models

Purpose:

1. Data compression: summary statistic instead of full empirical distribution

2. Quality control

3. Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, and ethical aspects)

ε= +iik ika aai per-sample offset

εik ~ N(0, bi2s1

2)“additive noise”

bi per-samplenormalization factor

bk sequence-wiseprobe efficiency

ηik ~ N(0,s22)

“multiplicative noise”

exp( )iik k ikb b b η=

ik ik ik ky a b x= +

The two component model

measured intensity = offset + gain × true abundance

The two-component model

raw scale log scaleB. Durbin, D. Rocke, JCB 2001

The two-component model

raw scale log scaleB. Durbin, D. Rocke, JCB 2001

“additive” noise

“multiplicative” noise

Parameterization

(1 )y a b xy a b x eη

ε η

ε

= + + ⋅ ⋅ +

= + + ⋅ ⋅

two practically equivalent forms

(η<<1)

iid per arrayiid in whole experiment

η random gain fluctuations

per array x color x print-tip group

per array x colorb systematic gain factor

iid per arrayiid in whole experiment

ε random background

per array x color x print-tip group

same for all probes (per array x color)

a systematic background

variance stabilizing transformations

Xu a family of random variables with EXu=u, VarXu=v(u). Define

⇒ var f(Xu ) ≈ independent of u

1( )v( )

x

f x duu

= ∫

derivation: linear approximation

0 20000 40000 60000

8.0

8.5

9.0

9.5

10.0

11.0

raw scale

trans

form

ed s

cale

variance stabilizing transformations

f(x)

x

variance stabilizing transformations1( )

v ( )

x

f x d uu

= ∫1.) constant variance (‘additive’) 2( ) sv u f u= ⇒ ∝

2.) constant CV (‘multiplicative’) 2( ) logv u u f u∝ ⇒ ∝

4.) additive and multiplicative

2 2 00( ) ( ) arsinh u uv u u u s f

s+

∝ + + ⇒ ∝

3.) offset 20 0( ) ( ) log( )v u u u f u u∝ + ⇒ ∝ +

the “glog” transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hs(x) = asinh(x/s)

( )( )

2arsinh( ) log 1

arsinh log log 2 0limx

x x x

x x→∞

= + +

− − =

P. Munson, 2001

D. Rocke & B. Durbin, ISMB 2002

W. Huber et al., ISMB 2002

raw scale log glog

difference

log-ratio

generalized

log-ratio

glog

raw scale log glog

difference

log-ratio

generalized

log-ratio

glog

constant partvariance:

proportional part

the transformed model

2

Yarsinh

(0, )

sikik ki

si

ki

ab

N c

µ ε

ε

−= +

∼

i: arrays k: probess: probe strata (e.g. print-tip, region)

“usual” log-ratio

'glog' (generalized log-ratio)

+ +

+ +

1

2

2 21 1 1

2 22 2 2

log

log

xx

x x cx x c

c1, c2 are experiment specific parameters (~level of background noise)

Variance Bias Trade-Off

Estimat

ed log

-fold-

chan

ge

Signal intensity

logglog

Variance-bias trade-off and shrinkage estimators

Shrinkage estimators:pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.

Particularly useful if you have few replicates.

Generalized log-ratio: = a shrinkage estimator for fold change

There are many possible choices, we chose “variance-stabilization”:+ interpretable even in cases where genes are off in some conditions+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale

evaluation: effects of different data transformationsdiff

eren

ce r

ed-g

reen

rank(average)

Normality: QQ-plot

“Single color normalization”

n red-green arrays (R1, G1, R2, G2,… Rn, Gn)

within/between slidesfor (i=1:n)

calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)normalize Mi vs Ai

normalize M1…Mn

all at oncenormalize the matrix of (R, G)then calculate log-ratios or any other

contrast you like

What about non-linear effectso Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)

Two reasons for non-linearity:

o At the high intensity end: saturation/quenching. This can and should be avoided experimentally - loss of data!

o At the low intensity end: background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.

Non-linear or affine linear?

Definitions

linear affine linear genuinely non-linear

How to compare and assess different ‘normalization’ methods?

Normalization :=1. correction for systematic experimental biases2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…3. provision of a measure of measurement uncertainty

Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.

Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?- just use anyway- ignore- shrink


Aesthetic criteriaLogarithm is more beautiful than arsinh

Practical criteraIt takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.

Silly criteriaThe best method is that that makes all my scatterplotslook like straight, slim cigars

Physical criteriaNormalization calculations should be based on physical/chemical model

Economical/political criteriaLife would be so much easier if everybody were just using the same method, who cares which one


Comparison against a ground truthBut you have millions of numbers – need to choose the metric that measures deviation from truth.FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?qualitative/quantitative: how well do you estimate abundance, fold-change?

Spike-In and Dilution series… great, but how representative are they of other data?

Implicitely, from resampling / cross-validating with the actual experiment of interest… but isn’t that too much like Münchhausen’s bootstrap?

evaluation: a benchmark for Affymetrixgenechip expression measures

o Data:Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts

o Benchmark:15 quality measures regarding-reproducibility-sensitivity-specificityPut together by Rafael Irizarry (Johns Hopkins)http://affycomp.biostat.jhsph.edu

Quality assessment and control:

an overview over some diagnostic plots and

common artifacts

Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)

PCR plates

PCR platesPCR plates

PCR plates: boxplotsPCR plates: boxplots

array batchesarray batches

print-tip effectsprint-tip effects

-0.8 -0.6 -0.4 -0.2 0.0 0.2

0.0

0.2

0.4

0.6

0.8

1.0

41 (a42-u07639vene.txt) by spotting pin

log(fg.green/fg.red)

F̂

1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4

q (log-ratio)

F(q)

spotting pin quality declinespotting pin quality decline

after delivery of 3x105 spots

after delivery of 5x105 spots

H. Sueltmann DKFZ/MGA

spatial effectsspatial effects

R Rb R-Rbcolor scale by rank

spotted cDNA arrays, Stanford-type

another array:

print-tip

color scale ~ log(G)

color scale ~ rank(G)

10 20 30 40 50 60

1020

3040

5060

1:nrhyb

1:nr

hyb

1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Batches: array to array differences dij = madk(hik -hjk)

arrays i=1…63; roughly sorted by time

Density representation of the scatterplot(76,000 clones, RZPD Unigene-II filters)

See: package hexbin; also, smoothscatter in package prada

Oligonucleotide chips

Affymetrix files

Main software from Affymetrix: MAS - MicroArray Suite.

DAT file: Image file, ~108 pixels, ~200 MB.CEL file: probe intensities, ~106 numbersCDF file: Chip Description File. Describes

which probes go in which probe sets (genes, gene fragments, ESTs).

1LQ file: Probe sequences and intended targets in the transcriptome

Image analysis

DAT image files CEL filesEach probe cell: 10x10 pixels.Gridding: estimate location of probe cell centers.Signal:

– Remove outer 36 pixels 8x8 pixels.– The probe cell signal, PM or MM, is the 75th

percentile of the 8x8 pixel values.Background: Average of the lowest 2% probe cells

is taken as the background value and subtracted.

Compute also quality values.

Data and notationPMijg , MMijg = Intensities for perfect match and

mismatch probe j for gene g in chip ii = 1,…, n one to hundreds of chipsj = 1,…, J usually 11 or 16 probe pairsg = 1,…, G 6…30,000 probe sets.

Tasks:calibrate (normalize) the measurements from different

chips (samples)summarize for each probe set the probe level data, i.e.,

11 PM and MM pairs, into a single expression measure.

compare between chips (samples) for detecting differential expression.

expression measures: MAS 4.0

expression measures: MAS 4.0

Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:

o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard

deviations of the average

1 ( )# j j

j JAvDiff PM MM

J ∈

= −∑

Expression measures MAS 5.0

Expression measures MAS 5.0

Instead of MM, use "repaired" version CTCT= MM if MM<PM

= PM / "typical log-ratio" if MM>=PM

"Signal" =Tukey.Biweight (log(PM-CT))

(… ≈median)

Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise

Expression measures: Li & Wong

Expression measures: Li & Wong

dChip fits a model for each gene

where– θi: expression index for gene i– φj: probe sensitivity

Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.

Need at least 10 or 20 chips.

Current version works with PMs only.

2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝

Expression measures RMA: Irizarry et al. (2002)


o Estimate one global background value b=mode(MM). No probe-specific background!

o Assume: PM = strue + bEstimate s≥0 from PM and b as a conditional expectation E[strue|PM, b].

o Use log2(s).o Nonparametric nonlinear calibration

('quantile normalization') across a set of chips.

AvDiff-like

with A a set of “suitable” pairs.

Li-Wong-like: additive model

Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2

21RMA log ( )j j

j APM BG

∈

= −Α ∑



2log ( )ij i j ijPM BG a b ε− = + +

Affymetrix: IPM = IMM + Ispecific ?

log(PM/MM)0From: R. Irizarry et al.,

Biostatistics 2002

Sequence-dependent preprocessing

i

25

1log log ( )i i

iY x w s ε

=

= + +∑

wi

position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)

Software for pre-processing Affymetrix data

• Bioconductor R package affy.• Background estimation.• Probe-level normalization.• Expression measures• Two main functions: ReadAffy, expresso.• See also: gcrma, tilingArray, vsn.

References

Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).

Variance stabilization applied to microarray data calibration and to thequantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.

Exploration, Normalization, and Summaries of High Density OligonucleotideArray Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.

Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).

Documents

Wolfgang Huber European Bioinformatics Institutemarray.economia.unimi.it/2005/material/L4.pdf · x xxc xx c c 1, c 2 are experiment specific parameters (~level of background noise)