32
Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach Daniel Holder, Bill Pikounis, Richard Raubertas, Vladimir Svetnik, and Keith Soper Biometrics Research Merck Research Laboratories

Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

  • Upload
    mabli

  • View
    34

  • Download
    4

Embed Size (px)

DESCRIPTION

Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach. Daniel Holder, Bill Pikounis, Richard Raubertas, Vladimir Svetnik, and Keith Soper Biometrics Research Merck Research Laboratories. S cale Matters A dditive F its (probes and chips) - PowerPoint PPT Presentation

Citation preview

Page 1: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Quantitation of Gene Expression for High-

Density Oligonucleotide Arrays:

A SAFER Approach

Daniel Holder, Bill Pikounis, Richard Raubertas, Vladimir Svetnik, and Keith SoperBiometrics ResearchMerck Research Laboratories

Page 2: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Scale Matters

Additive Fits (probes and chips)

Experimental-Unit Variability

Robustness and Resistance

Page 3: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Goals of Data Analysis

• Which genes have we detected?• Which genes have changed ?

– Which genes change together?

• Prerequisites– Quantify transcript abundance (“gene

expression index”)– Quantify precision– Assess quality

Page 4: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Our Data Analysis Method

• Normalize chips for overall fluorescence (based on MM)*

• Transform data (linear-log hybrid scale)• Fit probe-specific model using all chips (highly

resistant to outliers)*• Normalize for chip bias (scatterplot smooth)*• Assess differences (Include between-EU

variability, e.g., ANOVA)*

* offers opportunities for QC

Page 5: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0 20 40 60 80 100

01

02

03

04

05

0Fig 1:Hybrid Transformation (knot at c=20)

f(x)=x

f(x)=c*ln(x/c)+c

f(x)=hybrid(0,c)

x

f(x)

Page 6: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Linear-log Hybrid Scale

f(x) = a if x<a= x if x in [a,c)= c*ln(x/c)+c if x c

• Typically choose a=0• Value of c chosen for additivity• Improved homogeneity of variance• For low expression genes compare differences,

not ratios

Page 7: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Probe Specific Effects• “Probe specific biases…are highly reproducible and

predictable, and their adverse effect can be reduced by proper modeling and analysis methods” -Li and Wong (PNAS 2000)

• Multiplicative model for PM - MM, for each probeset, (ith chip, jth probe)

– Resistance achieved by iteratively omitting extreme points (or chips) and refitting using least squares

errorprobechip ijjiyij

Page 8: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Probe Specific Effects (Our Approach)

• For each probeset, resistant, additive fit to PM - MM

errorprobechipyijjiij )(log

– Use a fitting procedure that is highly resistant to extreme values (median polish)

*

*Since logs are undefined for non-positive values and unstable for small values, we use a linear-log hybrid scale

Page 9: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Adjusting for Chip Bias

• Initial centering of chips• Chip bias may depend on gene expression

level• Plot chip effects vs. Overall expression level

(grand median) for each probeset• Omit probesets that appear to change

•Between group |dev|/Within group |dev|•Omit probesets in top 25%

• Fit a resistant scatterplot smoother (loess)

Page 10: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0 50 100 150

-10

-50

51

0

chip 1

chip 2

chip 3

chip 4

chip 5

chip 6

chip 7

chip 8

chip 9

chip 10

Fig 4: Typical Chip Normalization Plot

Grand Median

Ch

ip E

ffect

s* (

Hyb

rid

sca

le)

5 groups 2 chips/group, 7.1K probesets

Page 11: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Terry Speed questions

3. How do you tell that one approach to quantifying expression at the probe set level (e.g. SAFER), is better than another (e.g. dChip)?

• Compare on data for which we ‘know’ the answer

– Spiking experiments (limited # genes)

– Validation (eg TaqMan)

– Create POS and NEG groups as best we can.

• How to compare (depends on down-stream usage)

– repeatibility

– eg. signal to noise ⇛ t-statistic ⇛ p-value

– fold changes

Page 12: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Fibroblast/Adipocyte Mixing Expt

• Mixture %’s (100/0, 75/25, 50/50, 25/75, 0/100)• 3 chips/mix (15 chips total, Mg74A)• 3 methods (SAFER, SAFER(log), dCHIP)• Create groups of probesets using 100/0 vs. 0/100

– POS (max p < 0.01, correct oligos, n=1049)

– NEG (incorrect oligos, n=2611)

– p-value from t-test (pooled variance, hybrid scale)

• We will change the POS, NEG and p-value definitions on some of the later slides

Page 13: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Fibroblast/Adipocyte Mixing Expt (2)

• Performance based on 75/25 vs 25/75– p-values from t-test (pooled variance, hybrid)– for POS require same sign as 100/0 vs 0/100– pos rate, false pos rate (FPR), pos rate vs FPR

• Linearity?

Page 14: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0.0001 0.001 0.01 0.05 0.1 0.25 0.5 0.9

0.0

0.2

0.4

0.6

0.8

1.0

nominal p-value

cdf

dChip SAFER log

SAFER

Fig 5: CDF for 0% vs 100% (all probesets)

n = 12,654

Page 15: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

1e-005 0.0001 0.001 0.01

0.0

0.2

0.4

0.6

0.8

1.0

nominal p-value

cdf

0.001 0.01 0.05 0.25 0.9

0.0

0.2

0.4

0.6

0.8

1.0

nominal p-value

cdf

1e-005 0.0001 0.001 0.01 0.05 0.25

0.0

0.2

0.4

0.6

0.8

1.0

nominal p-value

cdf

0.001 0.01 0.05 0.25 0.6

0.0

0.2

0.4

0.6

0.8

1.0

nominal p-value

cdf

POS: maxp < 0.01 (n = 1049) NEG: wrong sequence (n = 2611)

0% vs 100% POS

25% vs 75% POS

0% vs 100% NEG

25% vs 75% NEG

SAFER

SAFER

SAFER

SAFER

SAFER log

SAFER log

SAFER log

dChip

dChip

dChip

Uniform dist.

Fig 6: CDFs for POS and NEG probesets

Page 16: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

false pos rate

po

s ra

te

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

SAFER

dChip

SAFER log

Fig 7: Positive Rate vs ‘False’ Positive Rate 25% vs 75%

POS: maxp < 0.01 (n = 1049)NEG: wrong seq. (n = 2611))

Page 17: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

false pos rate

po

s ra

te

0.0001 0.001 0.01 0.05 0.1 0.5 0.9

0.0

0.2

0.4

0.6

0.8

1.0

SAFER

dChip

SAFER log

POS: maxp < 0.01 (n = 1049)NEG: wrong seq. (n = 2611)

Fig 8: Positive Rate vs ‘False’ Positive Rate (log scale) 25% vs 75%

log scale

Page 18: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

false pos rate

po

s ra

te

0.0001 0.001 0.01 0.05 0.1 0.5 0.9

0.0

0.2

0.4

0.6

0.8

1.0

Fig 9: Positive Rate vs ‘False’ Positive Rate (log scale)

log scalePOS: maxp < 0.01 (n = 1038)NEG: wrong seq. (n = 2611)

25% vs 75%, dChip p-values used for dChip

SAFER

SAFER log

dChip

Page 19: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

false pos rate

po

s ra

te

0.01 0.05 0.1 0.5 0.9

0.0

0.2

0.4

0.6

0.8

1.0

SAFER

dChip

SAFER log

25% vs 75%Fig 10: Positive Rate vs ‘False’ Positive Rate (log scale)

log scalePOS: rank (dChip(p)) < 1000NEG: wrong seq. & rank (dChip(p)) > 2611-1000

Page 20: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0.2

0.4

0.6

0.8

1.0

Fig 11: Boxplot of R2 values for POS probesets

SAFER SAFER(log) dCHIP

R2

POS: maxp < 0.01 (n = 1049)

Page 21: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0.0

0.2

0.4

0.6

0.8

1.0

Fig 12: Boxplot of R2 values for POS probesets

exclude 100/0 and 0/100 groups

SAFER SAFER(log) dCHIP

R2

POS: maxp < 0.01 (n = 1049)

Page 22: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Terry Speed questions

Response: We don’t know.

errorprobechip ijjiyij

error probe chip yij j i ij ) ( log*

1. Do you lose anything not being able to down-weight non-performing probe pairs in the way Li & Wong can with their phi's (ie, probe effect)?

Li & Wong

SAFER

•Down-weighting non-performing probes seems like a good idea.

•Is up-weighting ‘bright’ probes good? (variability, saturation)

•Possible to incorporate weighting in polishing step.

Page 23: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Terry Speed questions

• Primary goal is to quantitate mRNA detection (and error). Explicit QC methods aimed at avoiding the effects of aberrant arrays, probes, individual observations are less important when resistant methods are used.

•SAFER provides same raw materials (fitted values and residuals) for QC as Li and Wong. QC summaries can easily be made available.

2. Is SAFER QC as thorough as Li & Wong's (in detecting aberrant chips, probe-sets, probe pairs)?

Response: QC is not as thorough, but::

Page 24: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Conclusions

• For these data, it appears that the SAFER method performs better than dChip.

+ Better sensitivity (ROC Curve)

+ Slightly Better Linearity

• Caveat: This is one analysis of one dataset.

Page 25: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Acknowledgments

• Biometrics Research– Bert Gunter

• Other– David Gerhold (Pharmacology)– John Thompson (Immunology)– Eric Muise (Immunology)– Karen Richards (Drug Metabolism)– Jian Xu (Pharmacology)– Yuhong Wang (Bioinformatics)

Page 26: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

Backups

Page 27: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

1 2 3 4 50 26 29 92 1110 0 36 93 109

31 43 51 106 121

1

2

3

chip

probe

36 -34 -8 0 57 730 0 -5 1 4-2

015

grandmedia

nprobe effects

chipeffects -2 28 0 0 0

14 0 0 -2 -3

Example Median Polish

intensities residuals

Page 28: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0.0

0.2

0.4

0.6

0.8

1.0

Fig 2: Choose c using P-values from Tukey Non-additivity Test

P-v

alu

e

Hybrid(0,1)

Hybrid(0,20)

Hybrid(0,40)

Raw

Scale5 groups 2 chips/group, 7.1K probesets

Page 29: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0 50 100 150

0.03125

0.0625

0.125

0.25

0.5

1

2

4

8

16

32

Grand effect

Wit

hin

Gro

up S

DFig 3: Within Group SD, Hybrid Scale

5 groups 2 chips/group, 7.1K probesets

Page 30: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

0 50 100 150

02

04

06

08

01

00

P

P

P

P

P

P

P

P

P

P

P

P

60 80 100 120 140 160

02

04

06

08

01

00

P

P

P

P

P

P

P

P

P

P

P

P

10

0*V

ar B

etw

een/(

Var B

etw

een +

Var W

ithin

)Fig 9: Between EU variability as a percentage of Total variability All probesets Probesets with mean>50 (hybrid)

Grand Median Grand MedianP=known expressed Line = loess smooth 15 human livers 2 chips/liver, 1.5K

probesets

Page 31: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

SAFER Diff (hybrid)

dC

HIP

Diff

(h

ybri

d)

-50 0 50 100

-50

05

01

00

SAFER Abs[Diff] (hybrid)

dC

HIP

Ab

s[D

iff] (

hyb

rid

)

0 20 40 60 80 100 120

02

04

06

08

01

20

SAFER Diff (hybrid)

dC

HIP

Diff

(h

ybri

d)

-20 -10 0 10 20

-20

02

0

SAFER Abs[Diff] (hybrid)

dC

HIP

Ab

s[D

iff] (

hyb

rid

)

0 5 10 15 20 25

05

10

15

20

25

30

dChip vs SAFER differences0% vs 100% (all probesets) 0% vs 100% (POS probesets)

25% vs 75% (all probesets) 25% vs 75% (POS probesets)

POS: maxp < 0.01 (n = 1049)

Page 32: Quantitation of Gene Expression for High-Density Oligonucleotide Arrays: A SAFER Approach

false pos rate

po

s ra

te

0.0001 0.001 0.01 0.05 0.1 0.5 0.9

0.0

0.2

0.4

0.6

0.8

1.0

SAFER

dChip

SAFER log

POS: maxp < 0.01 (n = 1049)NEG: wrong seq. & minp > 0.5 (n = 270)

25% vs 75%Positive Rate vs ‘False’ Positive Rate (log scale)

log scale