DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4

DATA TRANSFORMATION and NORMALIZATION

Lecture Topic 4

DATA PRE-PROCESSING

• TRANSFORMATION

• NORMALIZATION

• SCALING

DATA TRANSFORMATION

• Difference between raw fluorescence is a meaningless number

• Data is transformed:– Ratio allows immediate visualization of

number – Log

Why Log 2?

• Difference in expression intensity exist on a multiplicative scale, log transformation brings them into the additive scale, where a linear model may apply.

• Ex. 4 fold repression=0.25 (Log2=-2)• Ex. 4 fold induction=4 ( Log2=2)• Ex. 16 fold induction=16 (Log2= 4)• Ex. 16 fold repression=0.0625 (Log2=-4)• Evens out highly skewed distributions• Makes variation of intensities…independent of absolute

magnitude

Log Transformation: Makes the distribution less skewed

F635 Median

Fre

qu

en

cy

63000540004500036000270001800090000

7000

6000

5000

4000

3000

2000

1000

0

Histogram of F635 Median

log2GFr

eq

ue

ncy

15.614.413.212.010.89.68.47.2

1200

1000

800

600

400

200

0

Histogram of log2G

Example 2

F532 Median

Fre

qu

en

cy

56000480004000032000240001600080000

6000

5000

4000

3000

2000

1000

0

Histogram of F532 Median

log2R

Fre

qu

en

cy

15.614.413.212.010.89.68.47.2

1000

800

600

400

200

0

Histogram of log2R

Generalized Log Transform

• Idea (Sapir and Churchill, 2000, Rocke and Durbin 2001)• X=e• At very low expression levels is close to 0, hence X is

equal to .– X is normally distributed with mean a and variance 2

• At very high expression levels X~ e

– X is lognormally distributed with variance s

– See

• Log(X)~log(• At moderate intensities it is a mixture distribution.• Hence, its better to use a transformation Log(X+c)

The Transform

))()()log(2

22

S

XXX

Non-parametric Regression: the Loess Method

• LOWESS= LOESS is an Acronym for LOcally reWEighted ScatterPlot Smoothing (Cleveland).

• For i=1 to n, the ith measurement yi of the response y and the corresponding measurement xi of the vector x of p predictors are related by

• Yi=g(xi) + I

• where g is the regression function and i is a random error.

• Idea: g(x) can be locally approximated by a parametric function.

• Obtained by fitting a regression surface to the data points within a chosen neighborhood of the point x.

LOESS contd…

• In the LOESS (LOWESS) method, weighted least squares is used to fit linear or quadratic functions of the predictors at the centers of neighborhoods.

• The radius of each neighborhood is chosen so that the neighborhood contains a specified percentage of the data points. The fraction of the data, called the smoothing parameter, in each local neighborhood controls the smoothness of the estimated surface.

• Data points in a given local neighborhood are weighted by a smooth decreasing function of their distance from the center of the neighborhood.

Distance metrics used

• Finding distance between the ith and hth points 2 predictors:

• Distance between: (Xi1, Xi2) and (Xh1,Xh2):

• Generally Eucledean Distance is used,

• and weights are defined by a tri-cube function:

• Choice of q is between 0 and 1, often between .4 to .6.

• Large q: smoother but maybe too smooth

• Small q: too rough

])()[( 222

211 hihii XXXXd

otherwise 0

if ])/(1[ 33qiqi dddd

wi

Comments on LOESS

fitting is done at each point at which the regression surface is to be estimated

faster computational procedure is to perform such local fitting at a selected sample of points and then to blend local polynomials to obtain a regression surface

can use the LOESS procedure to perform statistical inference provided the error distribution are i.i.d. normal random variables with mean 0.

using the iterative reweighting, LOESS can also provide statistical inference when the error distribution is symmetric but not necessarily normal.

by doing iterative reweighting, you can use the LOESS procedure to perform robust fitting in the presence of outliers in the data.

Data “Normalization”

• To biologists, data normalization means “eliminating systematic noise” from the data

• Noise is systematic variation – experimental variation, human error, variation of scanner technology, etc

• Variation in which we are NOT interested• We are interested in measuring true biologic variation of

genes across experiments, throughout time, etc.• Plays an important role in earlier stages of microarray data

analysis.• Subsequent analysis are highly dependent on

normalization.• NORMALIZATION: Adjusts from any bias which arises

from microarray technology rather than biological

Normalization:Age old Statistical Idea

• Stands for removing bias as a result of experimental artifacts from the data.

• Stems back to Fisher’s idea (1923) setting up of ANOVA.

• There is a thrust to use ANOVA for normalization, but for the most part it is still a stage-wise approach instead of a model taking out all sources of variation at once.

• We will need to look at:– Spatial correction

– Background correction

– Dye-effect correction

– Within replicate rescaling

– Across replicate rescaling

• Within slide normalization

• Paired slide normalization for dye swap

• Multiple slide normalization

Systematic Experimental Variations• Scanner properties• Spatial effects (variations in printings, slides)• Print-tip effects (spotting variations due to pin geometry)• Labeling fluctuations of the dyes• Diverse protocols• Inconsistent washings• Arrays produced at different times• Microarray production errors• Sample to sample fluctuations of the mRNA preparation• Varying reverse transcription to cDNA• Varying PCR amplification• Filter inhomogeneities• Cross-hybridization within gene families• Background noise• Image analysis saturation, spot shape variations• Artifacts such as fingerprints, scratches, technician's deodorant particles

Spatial Normalization

Need arises from:

1. Washing chip unevenly

2. Inserting chip at an angle

3. Scanner issues

4. Edge effects from evaporation

5. Uneven wear of print-tip after hours of printing

Reference Designs with the use of ratio data has been popular since it helps with spatial normalization.

Model for spatial bias

• Sd(x,y) = gd(x,y) * C(x,y)

• Scy3(x,y)/Scy5(x,y) = gcy3(x,y)/gcy5(x,y)

True signal

Complicated function of bias

Array 1: pre and post norm

Array2

Array 3

Array 4

Comments:

• Print-tip normalization is generally a good proxy for spatial effects

• Instead of LOESS one can use SPLINE to estimate the trend to subtract from the raw data.

BACKGROUND CORRECTION

• Idea:

• Signal = True Signal + Background

• So, an attractive idea seems like we should subtract BACKGROUND from the signal to get to the “TRUE” signal.

• The problem is that, the actual BACKGROUND in a spot cannot be measured and what is measured are really a “estimate” for background of places NEAR the spot.

• Criticism: the assumption in these models is that the background is additive– OFTEN WE SEE HIGN CORRELATION BETWEEN FOREGROUND AND

BACKGROUND.

– GENERAL CONSENSUS THESE DAYS: NOT TO SUBTRACT LOCAL BACKGROUND, BUT POSSIBLY SUBTRACT A GLOBAL BACKGROUND (FROM EMPTY SPOTS OR BUFFERS).

Background Correction: more thoughts

• McClure and Wit (2004) suggest calculating the mean or median of the empty spots and estimate, signal as:

– Signal = max(observed signal – center(empty spots), 0)

This allows never to have the problem of negative “corrected signals”.

Background Correction: Probabilistic Idea

• Irrizary et al(2003)

• Looks at finding the conditional expectation of the TRUE signal given the observed signal (which is assumed to be the true signal plus noise)

• E(si | si+bi)

• Here, si assumed to follow Exponential distribution with parameter .

• Bi assumed to follow N(e, 2e)

• Estimate e and e as the mean and standard deviation of empty spots

ey

1ˆ

Irrizary Approach contd…

• This allows the formula to be approximated by the following, where are the CDF and pdf of the standard normal distribution:

1)()(

)()(

ˆ22

22

2

e

ei

e

eei

e

ei

e

eei

eeeiiyy

yy

yy

Normalization Approaches

• GLOBAL Normalization (G): Global (ARRAY) Mean or Median.– NOT USED VERY OFTEN ANYMORE

• Intensity dependent linear Normalization (L): by least square estimation– AGAIN NOT USED AS MUCH

• Intensity dependent non-linear Normalization (N): Lowess curve (Robust scatter plot smoother)

– Under ideal experimental conditions: M=0 for the selected genes used for normalization

– THE MOST COMMONLY USED IDEA THESE DAYS.

Normalization: Historical Approaches

• Gobal normalization

– Sum method: Norm coef.(kj) =

Where Imi = intensity of gene i on array Array m, m=1,2

Bm= background intensity on Array m, m=1,2 n = number of genes on the array

– problem: validity of the assumption; stronger signals dominate the summation.

– Median (robust with respect to outliers)

Normalization coefficient (kj) =

n

iii

n

iii

BI

BI

122

111

)(

)(

22

11

ii

ii

BImedian

BImedian

Normalization continued

• Housekeeping gene normalization– Housekeeping genes are a set of genes whose expression levels are not affected

by the treatment.– The normalization coefficient is the ratio of mC/mT, where mC and mT are the

means of the selected housekeeping genes for control and treatment respectively.

– Problem: housekeeping genes change their expression level sometimes. The assumption doesn’t hold.

• Trimmed mean normalization(adjusted global method) trim off 5% highest and lowest extreme values, then globally normalize data. The

normalization coefficient is:

where are the trimmed means for the ith treatment and control

respectively.

i

i

T

Ci m

mk

ii TC mandm

Ideal Control Spots that should be on an array

As we saw in the previous slide, there can be many special probes spotted onto an array during its manufacture, collectively called control probes.

These include

• Blanks: places where water or nothing is spotted.

• Buffer: where the buffer solution without DNA is spotted.

• Negative: here there are DNA probes, but they shouldn’t be complementary to any target cDNA.

• Calibration: probes corresponding to DNA put in the hyb mix which should have equal signals in the two channels.

• Ratio: probes corresponding to DNA put in the hyb mix which should have known ratios between the two channels (e.g. 3:1,1:3, 10:1, 1:10).

Normalization Within and Across Conditions

• The Normalization WITHIN conditions is more common

• Idea we want all the arrays that represent the SAME condition to be comparable.

• Take out the array effect, in other words.

• Many models for this:– Factorial model (Kerr et al, Wolfinger et al)

– Location Scale Model (Yang et al)

– Scaling (Affymetrix)

Consider the data to be:

xijk: ith spot, jth color, kth array

Quantile Normalization

Idea

• Ideally “replicate” microarrays should be similar

• In real life they are often NOT identically distributed

• Quantile normalization FORCES the same distribution on all the arrays for the same condition

Mathematical details: Quantile Normalization

• {x} represent the matrix of all p spot intensities and the n replicate arrays.

• Here, xik is the spot intensity of the ith spot (i=1,…p, k=1,…n).

• Let x(k) = vector of the smallest spot intensities across the arrays

• be the mean/median of x(j)

• The vector represents the compromise distribution. {r} be the matrix of row ranks associated with matrix {x}

• Then, the following are the quantile normalized value

)( jx

)( jx

)( ijrnormij xx

Numerical Example• Let us consider a situation where we have 5 spots on an array and two

replicates for an array (numbers in brackets represents the ranks)

• Spot 1 2 3 4 5

• Array 1 16(5) 0(1) 9(3) 11(4) 7(2)

• Array2 13(4) 3(1) 5(2) 14(5) 8(3)

• Order the arrays: 0 7 9 11 16

• Array 2 3 5 8 13 14

• Average these: 1.5 6 8.5 12 15

• Replace the ranks by these:

• Normalized arrays are:

• Array1: 15 1.5 8.5 12 6.0

• Array2: 12 1.5 6.0 15 8.5

Conclusion

• No unique normalization method for the same data. It depends on what kind of experiment you have and what the data look like.

• No absolute criteria for normalization. Basically, the normalized log ratio should be centered around 0.

• Nowadays the focus IS on using Nonparametric Regression methods to remove trend or spatial artifacts from the data

• Quantile normalization (though not liked by BIOLOGISTS) is catching on as well.

Documents

DATA TRANSFORMATION and NORMALIZATION Lecture Topic 4