Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
Wolfgang HuberEuropean Bioinformatics Institute
Quality control and normalization
Acknowledgements
Anja von Heydebreck (Darmstadt)Robert Gentleman (Seattle)Günther Sawitzki (Heidelberg)Martin Vingron (Berlin)Annemarie Poustka, Holger Sültmann, Andreas
Buness, Markus Ruschhaupt (Heidelberg)Rafael Irizarry (Baltimore)Judith Boer (Leiden) Anke Schroth (Heidelberg)Friederike Wilmer (Hilden)
Which genes are differentially transcribed?
same-same tumor-normal
log-ratio
Statistics 101:
←bias accuracy→
←pr
ecis
ion
varia
nce→
Basic dogma of data analysis:
Can always increase sensitivity on the cost of specificity,
or vice versa,
the art is to find the best trade-off.
X
X
X
X
X
X
X
X
X
(It can also be possible to increase both by better choice of method / model)
ratios and fold changes
Fold changes are useful to describe continuous changes in expression
10001500
3000x3
x1.5
A B C
0200
3000?
?
A B C
But what if the gene is “off” (below detection limit) in one condition?
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
ratios and fold changesThe idea of the log-ratio (base 2)
0: no change+1: up by factor of 21 = 2+2: up by factor of 22 = 4-1: down by factor of 2-1 = 1/2-2: down by factor of 2-2 = ¼
A unit for measuring changes in expression: assumes that a change from 1000 to 2000 units has a similar biological meaning to one from 5000 to 10000.
What about a change from 0 to 500?- conceptually- noise, measurement precision
A complex measurement process lies between mRNA concentrations and intensities
o other array manufacturing-related issues
o hybridization efficiency and specificity
o DNA-support binding
o reverse transcription efficiency
o ‘background’ correction
o spotting efficiency
o amplification efficiency
o signal quantification
o PCR yield, contamination
o RNA degradation
o image segmentation
o clone identification and mapping
o tissue contamination
A complex measurement process lies between mRNA concentrations and intensities
o other array manufacturing-related issues
o hybridization efficiency and specificity
o DNA-support binding
o reverse transcription efficiency
o ‘background’ correction
o spotting efficiency
o amplification efficiency
o signal quantification
o PCR yield, contamination
o RNA degradation
o image segmentation
o clone identification and mapping
o tissue contamination
The problem is less that these steps are ‘not perfect’; it is that they vary from array to array, experiment to experiment.
♦ How to compare microarrayintensities with each other?
♦ How to address measurement uncertainty (“variance”)?
♦ How to calibrate (“normalize”) for biases between samples?
Questions
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
Stochastico too random to be ex-plicitely accounted for o remain as “noise”
Sources of variationamount of RNA in the biopsy efficiencies of-RNA extraction-reverse transcription -labeling-fluorescent detection
probe purity and length distribution
spotting efficiency, spot sizecross-/unspecific hybridizationstray signal
Systematico similar effect on many measurementso corrections can be estimated from data
Calibration
Stochastico too random to be ex-plicitely accounted for o remain as “noise”
Error model
Error models
describe the possible outcomes of a set of measurements
Outcomes depend on:-true value of the measured quantity (abundances of specific molecules in biological sample)
-measurement apparatus (cascade of biochemical reactions, optical detection system with laser scanner or CCD camera)
Error models
Purpose:
1. Data compression: summary statistic instead of full empirical distribution
2. Quality control
3. Statistical inference: appropriate parametric methods have better power than non-parametric (this has practical, financial, and ethical aspects)
ε= +iik ika aai per-sample offset
εik ~ N(0, bi2s1
2)“additive noise”
bi per-samplenormalization factor
bk sequence-wiseprobe efficiency
ηik ~ N(0,s22)
“multiplicative noise”
exp( )iik k ikb b b η=
ik ik ik ky a b x= +
The two component model
measured intensity = offset + gain × true abundance
The two-component model
raw scale log scaleB. Durbin, D. Rocke, JCB 2001
The two-component model
raw scale log scaleB. Durbin, D. Rocke, JCB 2001
“additive” noise
“multiplicative” noise
Parameterization
(1 )y a b xy a b x eη
ε η
ε
= + + ⋅ ⋅ +
= + + ⋅ ⋅
two practically equivalent forms
(η<<1)
iid per arrayiid in whole experiment
η random gain fluctuations
per array x color x print-tip group
per array x colorb systematic gain factor
iid per arrayiid in whole experiment
ε random background
per array x color x print-tip group
same for all probes (per array x color)
a systematic background
variance stabilizing transformations
Xu a family of random variables with EXu=u, VarXu=v(u). Define
⇒ var f(Xu ) ≈ independent of u
1( )v( )
x
f x duu
= ∫
derivation: linear approximation
0 20000 40000 60000
8.0
8.5
9.0
9.5
10.0
11.0
raw scale
trans
form
ed s
cale
variance stabilizing transformations
f(x)
x
variance stabilizing transformations1( )
v ( )
x
f x d uu
= ∫1.) constant variance (‘additive’) 2( ) sv u f u= ⇒ ∝
2.) constant CV (‘multiplicative’) 2( ) logv u u f u∝ ⇒ ∝
4.) additive and multiplicative
2 2 00( ) ( ) arsinh u uv u u u s f
s+
∝ + + ⇒ ∝
3.) offset 20 0( ) ( ) log( )v u u u f u u∝ + ⇒ ∝ +
the “glog” transformation
intensity-200 0 200 400 600 800 1000
- - - f(x) = log(x)
——— hs(x) = asinh(x/s)
( )( )
2arsinh( ) log 1
arsinh log log 2 0limx
x x x
x x→∞
= + +
− − =
P. Munson, 2001
D. Rocke & B. Durbin, ISMB 2002
W. Huber et al., ISMB 2002
raw scale log glog
difference
log-ratio
generalized
log-ratio
glog
raw scale log glog
difference
log-ratio
generalized
log-ratio
glog
constant partvariance:
proportional part
the transformed model
2
Yarsinh
(0, )
sikik ki
si
ki
ab
N c
µ ε
ε
−= +
∼
i: arrays k: probess: probe strata (e.g. print-tip, region)
“usual” log-ratio
'glog' (generalized log-ratio)
+ +
+ +
1
2
2 21 1 1
2 22 2 2
log
log
xx
x x cx x c
c1, c2 are experiment specific parameters (~level of background noise)
Variance Bias Trade-Off
Estimat
ed log
-fold-
chan
ge
Signal intensity
logglog
Variance-bias trade-off and shrinkage estimators
Shrinkage estimators:pay a small price in bias for a large decrease of variance, so overall the mean-squared-error (MSE) is reduced.
Particularly useful if you have few replicates.
Generalized log-ratio: = a shrinkage estimator for fold change
There are many possible choices, we chose “variance-stabilization”:+ interpretable even in cases where genes are off in some conditions+ can subsequently use standard statistical methods (hypothesis testing, ANOVA, clustering, classification…) without the worries about low-level variability that are often warranted on the log-scale
evaluation: effects of different data transformationsdiff
eren
ce r
ed-g
reen
rank(average)
Normality: QQ-plot
“Single color normalization”
n red-green arrays (R1, G1, R2, G2,… Rn, Gn)
within/between slidesfor (i=1:n)
calculate Mi= log(Ri/Gi), Ai= ½ log(Ri*Gi)normalize Mi vs Ai
normalize M1…Mn
all at oncenormalize the matrix of (R, G)then calculate log-ratios or any other
contrast you like
What about non-linear effectso Microarrays can be operated in a linear regime, where fluorescence intensity increases proportionally to target abundance (see e.g. Affymetrix dilution series)
Two reasons for non-linearity:
o At the high intensity end: saturation/quenching. This can and should be avoided experimentally - loss of data!
o At the low intensity end: background offsets, instead of y=k·x we have y=k·x+x0, and in the log-log plot this can look curvilinear. But this is an affine-linear effect and can be correct by affine normalization. Non-parametric methods (e.g. loess) risk overfitting and loss of power.
Non-linear or affine linear?
Definitions
linear affine linear genuinely non-linear
How to compare and assess different ‘normalization’ methods?
Normalization :=1. correction for systematic experimental biases2. provision of expression values that can subsequently be used for testing, clustering, classification, modelling…3. provision of a measure of measurement uncertainty
Quality trade-off: the better the measurements, the less need for normalization. Need for “too much” normalization relates to a quality problem.
Variance-Bias trade-off: how do you weigh measurements that have low signal-noise ratio?- just use anyway- ignore- shrink
How to compare and assess different ‘normalization’ methods?
Aesthetic criteriaLogarithm is more beautiful than arsinh
Practical criteraIt takes forever to run method XX. Referees will only accept my paper if it uses the original MAS5.
Silly criteriaThe best method is that that makes all my scatterplotslook like straight, slim cigars
Physical criteriaNormalization calculations should be based on physical/chemical model
Economical/political criteriaLife would be so much easier if everybody were just using the same method, who cares which one
How to compare and assess different ‘normalization’ methods?
Comparison against a ground truthBut you have millions of numbers – need to choose the metric that measures deviation from truth.FN/FP: do you find all the differentially expressed genes, and do you not find non-d.e. genes?qualitative/quantitative: how well do you estimate abundance, fold-change?
Spike-In and Dilution series… great, but how representative are they of other data?
Implicitely, from resampling / cross-validating with the actual experiment of interest… but isn’t that too much like Münchhausen’s bootstrap?
evaluation: a benchmark for Affymetrixgenechip expression measures
o Data:Spike-in series: from Affymetrix 59 x HGU95A, 16 genes, 14 concentrations, complex backgroundDilution series: from GeneLogic 60 x HGU95Av2,liver & CNS cRNA in different proportions and amounts
o Benchmark:15 quality measures regarding-reproducibility-sensitivity-specificityPut together by Rafael Irizarry (Johns Hopkins)http://affycomp.biostat.jhsph.edu
Quality assessment and control:
an overview over some diagnostic plots and
common artifacts
Scatterplot, colored by PCR-plateTwo RZPD Unigene II filters (cDNA nylon membranes)
PCR plates
PCR platesPCR plates
PCR plates: boxplotsPCR plates: boxplots
array batchesarray batches
print-tip effectsprint-tip effects
-0.8 -0.6 -0.4 -0.2 0.0 0.2
0.0
0.2
0.4
0.6
0.8
1.0
41 (a42-u07639vene.txt) by spotting pin
log(fg.green/fg.red)
F̂
1:11:21:31:42:12:22:32:43:13:23:33:44:14:24:34:4
q (log-ratio)
F(q)
spotting pin quality declinespotting pin quality decline
after delivery of 3x105 spots
after delivery of 5x105 spots
H. Sueltmann DKFZ/MGA
spatial effectsspatial effects
R Rb R-Rbcolor scale by rank
spotted cDNA arrays, Stanford-type
another array:
print-tip
color scale ~ log(G)
color scale ~ rank(G)
10 20 30 40 50 60
1020
3040
5060
1:nrhyb
1:nr
hyb
1 2 3 4 5 6 7 8 910111213141516171823242526272829303132333435363738737475767778798081828384858687888990919293949596979899100
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Batches: array to array differences dij = madk(hik -hjk)
arrays i=1…63; roughly sorted by time
Density representation of the scatterplot(76,000 clones, RZPD Unigene-II filters)
See: package hexbin; also, smoothscatter in package prada
Oligonucleotide chips
Affymetrix files
Main software from Affymetrix: MAS - MicroArray Suite.
DAT file: Image file, ~108 pixels, ~200 MB.CEL file: probe intensities, ~106 numbersCDF file: Chip Description File. Describes
which probes go in which probe sets (genes, gene fragments, ESTs).
1LQ file: Probe sequences and intended targets in the transcriptome
Image analysis
DAT image files CEL filesEach probe cell: 10x10 pixels.Gridding: estimate location of probe cell centers.Signal:
– Remove outer 36 pixels 8x8 pixels.– The probe cell signal, PM or MM, is the 75th
percentile of the 8x8 pixel values.Background: Average of the lowest 2% probe cells
is taken as the background value and subtracted.
Compute also quality values.
Data and notationPMijg , MMijg = Intensities for perfect match and
mismatch probe j for gene g in chip ii = 1,…, n one to hundreds of chipsj = 1,…, J usually 11 or 16 probe pairsg = 1,…, G 6…30,000 probe sets.
Tasks:calibrate (normalize) the measurements from different
chips (samples)summarize for each probe set the probe level data, i.e.,
11 PM and MM pairs, into a single expression measure.
compare between chips (samples) for detecting differential expression.
expression measures: MAS 4.0
expression measures: MAS 4.0
Affymetrix GeneChip MAS 4.0 software uses AvDiff, a trimmed mean:
o sort dj = PMj -MMjo exclude highest and lowest valueo J := those pairs within 3 standard
deviations of the average
1 ( )# j j
j JAvDiff PM MM
J ∈
= −∑
Expression measures MAS 5.0
Expression measures MAS 5.0
Instead of MM, use "repaired" version CTCT= MM if MM<PM
= PM / "typical log-ratio" if MM>=PM
"Signal" =Tukey.Biweight (log(PM-CT))
(… ≈median)
Tukey Biweight: B(x) = (1 – (x/c)^2)^2 if |x|<c, 0 otherwise
Expression measures: Li & Wong
Expression measures: Li & Wong
dChip fits a model for each gene
where– θi: expression index for gene i– φj: probe sensitivity
Maximum likelihood estimate of MBEI is used as expression measure of the gene in chip i.
Need at least 10 or 20 chips.
Current version works with PMs only.
2, (0, )ij ij i j ij ijPM MM Nθ φ ε ε σ− = + ∝
Expression measures RMA: Irizarry et al. (2002)
Expression measures RMA: Irizarry et al. (2002)
o Estimate one global background value b=mode(MM). No probe-specific background!
o Assume: PM = strue + bEstimate s≥0 from PM and b as a conditional expectation E[strue|PM, b].
o Use log2(s).o Nonparametric nonlinear calibration
('quantile normalization') across a set of chips.
AvDiff-like
with A a set of “suitable” pairs.
Li-Wong-like: additive model
Estimate RMA = ai for chip i using robust method median polish (successively remove row and column medians, accumulate terms, until convergence). Works with d>=2
21RMA log ( )j j
j APM BG
∈
= −Α ∑
Expression measures RMA: Irizarry et al. (2002)
Expression measures RMA: Irizarry et al. (2002)
2log ( )ij i j ijPM BG a b ε− = + +
Affymetrix: IPM = IMM + Ispecific ?
log(PM/MM)0From: R. Irizarry et al.,
Biostatistics 2002
Sequence-dependent preprocessing
i
25
1log log ( )i i
iY x w s ε
=
= + +∑
wi
position- and sequence-specific effects wi(s):Naef et al., Phys Rev E 68 (2003)
Software for pre-processing Affymetrix data
• Bioconductor R package affy.• Background estimation.• Probe-level normalization.• Expression measures• Two main functions: ReadAffy, expresso.• See also: gcrma, tilingArray, vsn.
References
Bioinformatics and computational biology solutions using R and Bioconductor, R. Gentleman, V. Carey, W. Huber, R. Irizarry, S. Dudoit, Springer (2005).
Variance stabilization applied to microarray data calibration and to thequantification of differential expression. W. Huber, A. von Heydebreck, H. Sültmann, A. Poustka, M. Vingron. Bioinformatics 18 suppl. 1 (2002), S96-S104.
Exploration, Normalization, and Summaries of High Density OligonucleotideArray Probe Level Data. R. Irizarry, B. Hobbs, F. Collins, …, T. Speed. Biostatistics 4 (2003) 249-264.
Error models for microarray intensities. W. Huber, A. von Heydebreck, and M. Vingron. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).
Differential Expression with the Bioconductor Project. A. von Heydebreck, W. Huber, and R. Gentleman. Encyclopedia of Genomics, Proteomics and Bioinformatics. John Wiley & sons (2005).