37
Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

  • View
    217

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Normalization of microarray data

Anja von Heydebreck

Dept. Computational Molecular Biology, MPI for Molecular

Genetics, Berlin

Page 2: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Systematic differences between arrays

The boxplots show distributions of log-ratios from 4 red-green 8448-clone cDNA arrays hybridised with zebrafish samples.

Some are not centered at 0, and they are different from each other.

Page 3: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Experimental variation

Experimental variation

amount of RNA in the sample efficiencies of-RNA extraction-reverse transcription -labeling-photodetection

Normalization

Systematic o similar effect on many measurementso corrections can be estimated from data

Normalization:

Correction of systematic effectsarising from variations in the experimental process

Page 4: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Ad-hoc normalization procedures

• 2-color cDNA-arrays: multiply all intensities of one channel with a constant such that the median of log-ratios is 0 (equivalent: shift log-ratios). Underlying assumption: equally many up- and downregulated genes.

• One-color arrays (Affy, radioactive): multiply intensities from each array k with a constant ck, such that some measure of location of the intensity distributions is the same for all arrays (e.g. the trimmed mean (Affy global scaling)).

Page 5: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

log-log plot of intensities from the two channels of a microarray

comparison of kidney cancer with normal kidney tissue,cDNA microarray with 8704 spots

• red line: mediannormalization• blue lines: two-fold change

Page 6: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Assumptions for normalization

• When we normalize based on the observed data, we assume that the majority of genes are unchanged, or that there is symmetry between up- and downregulation.

• In some cases, this may not be true. Alternative: use (spiked) controls and base normalization on them.

Page 7: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

1. Loess normalization

• M-A plot (minus vs. add): log(R) – log(G) =log(R/G)

vs. log (R) + log(G)=log(RG)

• With 2-color-cDNA arrays, often “banana-shaped” scatterplots on the log-scale are observed.

Page 8: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Loess normalization

• Intensity-dependent trends are modeled bya regression curve,M = f(A) + .

• The normalizedlog-ratios are computed as the residuals of the loess regression.

zebrafish data

Page 9: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Loess regression• Locally weighted regression. • For each value xi of X, a linear or polynomial

regression function fi for Y is fitted based on the data points close to xi. They are weighted according to their distance to xi.

• Local model: Y = fi(X) + • Fit: Minimize the weighted sum of squares

wj (xj)(yj - fi(xj))2

• Then, compute the overall regression as:

Y = f(X) + , where f(xi) = fi(xi).

Page 10: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Loess regression

x0tricubic weight function

regression lines for each data point

The user-defined width c of the weight function determines the degree of smoothing.

3 300

| |( ) [1 ( ) ] ,| |

x xw x x x c

c

Page 11: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Print-tip normalization

• With spotted arrays, distributions of intensities or log-ratios may be different for spots spotted with different pins, or from different PCR plates.

• Normalize the data from each (e.g. print-tip) group separately.

Page 12: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

4x4 or 8x4 sectors

17...38 rows and columns per sector

ca. 4600…46000probes/array

sector: corresponds to one print-tip

Print-tips correspond to localization of spotsSlide: 25x75 mmSpot-to-spot: ca. 150-350

m

Page 13: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Print-tip loess normalization

Page 14: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

2. Error models, variance stabilization and robust

normalization

Page 15: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Sources of variationSources of variationamount of RNA in the sample efficiencies of-RNA extraction-reverse transcription -labeling-photodetection

PCR yieldDNA qualityspotting efficiency, spot sizecross-/unspecific hybridizationstray signal

Normalization Error model

Systematic o similar effect on many measurementso corrections can be estimated from data

Stochastico too random to be ex-plicitely accounted for o “noise”

Page 16: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

A model for measurement error

k kY e

Rocke and Durbin (J. Comput. Biol. 2001):

Yk: measured intensity of gene kk: true expression level of gene k: offset:multiplicative/additive error terms,independent normal

For large expression level k, the multiplicative error is dominant. For k near zero, the additive error is dominant.

Page 17: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

A parametric form for the variance-mean dependence

The model of Durbin and Rocke yields:

2 2 2

E( )

Var( ) ,

k k i k

k k k

u Y m

v Y s s

Thus we obtain a quadratic dependence

2 21 2 3( ) ( ) .k k kv v u c u c c

2

2

, : mean/variance of ,

: variance of

m s e

s

Page 18: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

data (cDNA slide)

Quadratic variance-vs-mean dependence

For each spot k, the variance (Rk – Gk)² is plotted againstthe mean (Rk + Gk)/2.

( ) =( + ) + .2 21 2 3v u c u c c

Page 19: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

The two-component model

raw scale log scale

Page 20: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

The two-component model

raw scale log scale

“additive” noise

“multiplicative” noise

Page 21: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Variance stabilizing transformations

Let Xu be a family of random variables

with EXu=u, VarXu=v(u). Define a

transformation

Var h(Xu ) independent of u

1( )

v( )

x

h x duu

Page 22: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Derivation of the variance-stabilizing transformation

Let Xu be a family of random variables with EXu=u, VarXu= v(u), and h a transformation applied to Xu. Then, by linear approximation of h,

Thus, if h’(u)2 = v(u)-1 ,Var(h(Xu)) is approx. independent of u.

.

u u

2 2u u u

h(X ) ~ h(u) + h'(u)(X - u)

Var(h(X )) ~ h'(u) Var(X - u) = h'(u) Var(X )

Page 23: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

0 20000 40000 60000

8.0

8.5

9.0

9.5

10

.01

1.0

raw scale

tra

nsf

orm

ed

sca

le

Variance stabilizing transformations

f(x)

x

Page 24: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Variance stabilizing transformations

1( )

v( )

x

f x duu

1.) constant CV (‘multiplicative’)

2( ) logv u u f u

3.) additive and multiplicative

2 2 00( ) ( ) arsinh

u uv u u u s f

s

2.) offset2

0 0( ) ( ) log( )v u u u f u u

Page 25: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

The “generalized log” transformation

intensity-200 0 200 400 600 800 1000

- - - f(x) = log(x)

——— hs(x) = arsinh(x/s)

2arsinh( ) log 1x x x

W. Huber et al., ISMB 2002

D. Rocke & B. Durbin, ISMB 2002

Page 26: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

A model for measurement error

i iki kiY e

Yki:measured intensity of gene k in array/color channel iki: true expression level of gene k i, i : additive/multiplicative effects of array/color channel i: multiplicative/additive error terms, independent normal with mean 0

Now we consider data from different arrays or color channels i. We assume they are related through an affine-linear transformation on the raw scale:

Page 27: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

2Yarsinh , (0, )iki

k ki kii

aN c

b

:

• Assume an affine-linear transformation for normalization between arrays, and, after that, common parameters for the variance stabilizing transformation. The composite transformation for array/color channel i is given by ai and bi.• The model is assumed to hold for genes that are unchanged; differentially expressed genes act as outliers.

A statistical model

Page 28: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

2Yarsinh , (0, )iki

k ki kii

aN c

b

:

• Assume that the majority of genes is not differentially expressed.• Use robust variant of maximum likelihood estimation:• Alternate between maximum likelihood estimation (= least squares fit) for a fixed set K of genes and selection of K as the subset of (e.g. 50%) genes with smallest residuals.

Robust parameter estimation

Page 29: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Robust normalization

(generalized) log-ratio

location estimators:• mean• median• least trimmed sum of squares

assumption:majority of genesunchanged

Page 30: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Normalized & transformed data

log scale generalized log scale

Page 31: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Validation: standard deviation versus rank-mean plots

diff

ere

nce r

ed

-g

reen

rank(average)

Page 32: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Which normalization method should one use?

• How can one assess the performance of different methods?

• Diagnostic plots (e.g. scatterplots)• Performance measures:• The variance between replicate

measurements should be low.• Low bias: Changes in expression

should be accurately measured. How to assess this (in most cases, the truth is unknown)?

Page 33: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

o Data: paired tumor/normal tissue from 19 kidney cancers, hybridized in duplicate on 38 cDNA slides à 4000 genes.

o Apply 6 different strategies for normalization and quantification of differential expression

o Apply permutation test to each gene

o Compare numbers of genes detected as differentially expressed, at a certain significance level, between the different normalization methods

Evaluation: sensitivity / specificity in quantifying differential expression

Page 34: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Number of significant genes vs. significance level of permutation test

Comparison of methods

Page 35: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

Parametric vs. non-parametric normalization

• Loess is non-parametric: it makes no assumptions which sort of transformation is appropriate. Disadvantage: Degree of smoothing is chosen in an arbitrary way.

• vsn uses a parametric model: affine-linear normalization. Disadvantage: the model assumptions may not always hold. Advantage: If the model assumptions do hold (at least approximately), the method should perform better.

Page 36: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

vsn may also correct “banana shape”

M-A plot of vsn-normalized zebrafish data, loess fit

Different additiveoffsets may leadto non-linearscatter plotson the log scale.

Page 37: Normalization of microarray data Anja von Heydebreck Dept. Computational Molecular Biology, MPI for Molecular Genetics, Berlin

References

• Software: R package modreg (loess), Bioconductor packages marrayNorm (loess normalization), vsn (variance stabilization)

• W.Huber, A.v.Heydebreck, H.Sültmann, A.Poustka, M.Vingron (2002). Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics 18(S1), 96-104.

• Y.H.Yang, S.Dudoit at al. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 30(4):e15.