Gene Expression Arrays

1

Gene Expression Arrays

EPP 245

Statistical Analysis of

Laboratory Data

November 9, 2006 EPP 245 Statistical Analysis of Laboratory Data

2

Basic Design of Expression Arrays

• For each gene that is a target for the array, we have a known DNA sequence.

• mRNA is reverse transcribed to DNA, and if a complementary sequence is on the on a chip, the DNA will be more likely to stick

• The DNA is labeled with a dye that will fluoresce and generate a signal that is monotonic in the amount in the sample


3

TAAATCGATACGCATTAGTTCGACCTATCGAAGACCCAACACGGATTCGATACGTTAATATGACTACCTGCGCAACCCTAACGTCCATGTATCTAATACGATTTAGCTATGCGTAATCAAGCTGGATAGCTTCTGGGTTGTGCCTAAGCTATGCAATTATACTGATGGACGCGTTGGGATTGCAGGTACATAGATTATGC

Exon Intron

Probe Sequence

• cDNA arrays use variable length probes derived from expressed sequence tags– Spotted and almost always used with two color methods– Can be used in species with an unsequenced genome

• Long oligoarrays use 60-70mers– Agilent two-color arrays– Spotted arrays from UC Davis or elsewhere– Usually use computationally derived probes but can use probes

from sequenced EST’s


4

• Affymetrix GeneChips use multiple 25-mers– For each gene, one or more sets of 8-20 distinct

probes – May overlap – May cover more than one exon

• Affymetrix chips also use mismatch (MM) probes that have the same sequence as perfect match probes except for the middle base which is changed to inhibitbinding.

• This is supposed to act as a control, but often instead binds to another mRNA species, so many analysts do not use them


5

Probe Design

• A good probe sequence should match the chosen gene or exon from a gene and should not match any other gene in the genome.

• Melting temperature depends on the GC content and should be similar on all probes on an array since the hybridization must be conducted at a single temperature.


6

• The affinity of a given piece of DNA for the probe sequence can depend on many things, including secondary and tertiary structure as well as GC content.

• This means that the relationship between the concentration of the RNA species in the original sample and the brightness of the spot on the array can be very different for different probes for the same gene.

• Thus only comparisons of intensity within the same probe across arrays makes sense.


7

Affymetrix GeneChips

• For each probe set, there are 8-20 perfect match (PM) probes which may overlap or not and which target the same gene

• There are also mismatch (MM) probes which are supposed to serve as a control, but do so rather badly

• Most of us ignore the MM probes


8

Expression Indices

• A key issue with Affymetrix chips is how to summarize the multiple data values on a chip for each probe set (aka gene).

• There have been a large number of suggested methods.

• Generally, the worst ones are those from Affy, by a long way; worse means less able to detect real differences


9

Usable Methods

• Li and Wong’s dCHIP and follow on work is demonstrably better than MAS 4.0 and MAS 5.0, but not as good as RMA and GLA

• ArrayAssist can use dCHIP, RMA, gcRMA, and others.

• The GLA method (Durbin, Rocke, Zhou) can be imported into ArrayAssist.


10

Steps in Expression Index Construction

• Background correction is the process of adjusting the signals so that the zero point is similar on all parts of all arrays.

• We like to manage this so that zero signal after background correction corresponds approximately to zero amount of the mRNA species that is the target of the probe set.


11

• Data transformation is the process of changing the scale of the data so that it is more comparable from high to low.

• Common transformations are the logarithm and generalized logarithm

• Normalization is the process of adjusting for systematic differences from one array to another.

• Normalization may be done before or after transformation, and before or after probe set summarization.


12

• One may use only the perfect match (PM) probes, or may subtract or otherwise use the mismatch (MM) probes

• There are many ways to summarize 20 PM probes and 20 MM probes on 10 arrays (total of 200 numbers) into 10 expression index numbers


13

The RMA Method

• Background correction that does not make 0 signal correspond to 0 amount

• Quantile normalization makes the overall distribution of intensity values across probes the same on each array

• Log2 transform

• Median polish summary of PM probes


14

4.00 6.00 5.00 5.00

8.00 9.00 7.00 8.00

12.00 24.00 12.00 16.00

8.00 13.00 8.00

-1.00 1.00 0.00 0.00

0.00 1.00 -1.00 0.00

-4.00 8.00 -4.00 0.00

-1.67 3.33 -1.67

0.67 -2.33 1.67 0.00

1.67 -2.33 0.67 0.00

-2.33 4.67 -2.33 0.00

0.00 0.00 0.00

Analysis by means

•Remove Row Means•Remove Column Means•Rows and Columns have

mean 0•Influence of an outlier spreads


15

4.00 6.00 5.00 5.00

8.00 9.00 7.00 8.00

12.00 24.00 12.00 12.00

8.00 9.00 7.00

-1.00 1.00 0.00 0.00

0.00 1.00 -1.00 0.00

0.00 12.00 0.00 0.00

0.00 1.00 0.00

-1.00 0.00 0.00 0.00

0.00 0.00 -1.00 0.00

0.00 11.00 0.00 0.00

0.00 0.00 0.00

Median Polish

•Remove Row Medians•Remove Column Medians•Rows and Columns may not

have median 0•Outliers contained•May have to be iterated


16

Example Probe Set• Using the Affy HG U133 Plus 2.0

GeneChip with 54675 probe sets, from 604258 PM probes.

• Four chips derived from human IR exposed skin at 0, 1, 10, and 100 cGy

• Probe set number 10067/54675 has Affy ID 200618_at

• Gene is LASP1, LIM and SH3 protein 1, LIM protein subfamily, Src homology, actin binding.


17

0 1 10 100

200618_at1 360 216 158 198 233.0

200618_at2 313 402 106 103 231.0

200618_at3 130 182 79 91 120.5

200618_at4 351 370 195 136 263.0

200618_at5 164 130 98 107 124.8

200618_at6 223 219 164 196 200.5

200618_at7 437 529 195 158 329.8

200618_at8 509 554 274 128 366.3

200618_at9 522 720 285 198 431.3

200618_at10 668 715 247 260 472.5

200618_at11 306 286 144 159 223.8

362.1 393.0 176.8 157.6

Mean Summarization


18

0 1 10 100

200618_at1 2.56 2.33 2.20 2.30 2.35

200618_at2 2.50 2.60 2.03 2.01 2.28

200618_at3 2.11 2.26 1.90 1.96 2.06

200618_at4 2.55 2.57 2.29 2.13 2.38

200618_at5 2.21 2.11 1.99 2.03 2.09

200618_at6 2.35 2.34 2.21 2.29 2.30

200618_at7 2.64 2.72 2.29 2.20 2.46

200618_at8 2.71 2.74 2.44 2.11 2.50

200618_at9 2.72 2.86 2.45 2.30 2.58

200618_at10 2.82 2.85 2.39 2.41 2.62

200618_at11 2.49 2.46 2.16 2.20 2.33

2.51 2.53 2.21 2.18

Mean Summarizationof the Logs


19

The GLA Method

• The Glog Average (GLA) method is simpler than the RMA method, though it can require estimation of a parameter

• Background correction is intended to make a measured value of zero correspond to a zero quantity in the sample

• Transformation uses the glog ~ ln for large values

• Normalization via lowess• Summary is a simple average of PM probes


20

Probe Sets not Genes

• It is unavoidable to refer to a probe set as measuring a “gene”, but nevertheless it can be deceptive

• The annotation of a probe set may be based on homology with a gene of possibly known function in a different organism

• Only a relatively few probe sets correspond to genes with known function and known structure in the organism being studied


21

Two-Color Arrays

• Two-color arrays are designed to account for variability in slides and spots by using two samples on each slide, each labeled with a different dye.

• If a spot is too large, for example, both signals will be too big, and the difference or ratio will eliminate that source of variability


22

Dyes

• The most common dye sets are Cy3 (green) and Cy5 (red), which fluoresce at approximately 550 nm and 649 nm respectively (red light ~ 700 nm, green light ~ 550 nm)

• The dyes are excited with lasers at 532 nm (Cy3 green) and 635 nm (Cy5 red)

• The emissions are read via filters using a CCD device


23


24


25


26

File Format

• A slide scanned with Axon GenePix produces a file with extension .gpr that contains the results:http://www.axon.com/gn_GenePix_File_Formats.html

• This contains 29 rows of headers followed by 43 columns of data (in our example files)

• For full analysis one may also need a .gal file that describes the layout of the arrays

http://www.axon.com/gn_GenePix_File_Formats.html


27

"Block" "Column" "Row" "Name" "ID" "X" "Y" "Dia." "F635 Median" "F635 Mean" "F635 SD" "B635 Median" "B635 Mean" "B635 SD" "% > B635+1SD" "% > B635+2SD" "F635 % Sat."


28

"F532 Median" "F532 Mean" "F532 SD" "B532 Median" "B532 Mean" "B532 SD" "% > B532+1SD" "% > B532+2SD" "F532 % Sat."


29

"Ratio of Medians (635/532)" "Ratio of Means (635/532)" "Median of Ratios (635/532)" "Mean of Ratios (635/532)" "Ratios SD (635/532)""Rgn Ratio (635/532)" "Rgn R² (635/532)" "F Pixels" "B Pixels" "Sum of Medians" "Sum of Means" "Log Ratio (635/532)" "F635 Median - B635""F532 Median - B532" "F635 Mean - B635" "F532 Mean - B532" "Flags"


30

Analysis Choices

• Mean or median foreground intensity

• Background corrected or not

• Log transform (base 2, e, or 10) or glog transform

• Log is compatible only with no background correction

• Glog is best with background correction


31

Block 1

Column 1

Row 1

Name NM_006182

ID discoidin domain receptor family, member

X 2575

Y 2565

Dia. 85

DDR1


32

F635 Median 48

F635 Mean 54

F635 SD 23

B635 Median 34

B635 Mean 36

B635 SD 11

% > B635+1SD 52

% > B635+2SD 36

F635 % Sat. 0

F532 Median 109

F532 Mean 113

F532 SD 26

B532 Median 35

B532 Mean 36

B532 SD 7

% > B532+1SD 100

% > B532+2SD 100

F532 % Sat. 0


33

Issues with Two-Color Arrays

• Chips have different overall intensities, so normalization across chips is needed.

• The overall intensity on the red channel may be greater or less than on the green channel, so normalization across dyes is needed.

• The red/green difference is can be different at different intensity levels


34

Array normalization

• Array normalization is meant to increase the precision of comparisons by adjusting for variations that cover entire arrays

• Without normalization, the analysis would be valid, but possibly less sensitive

• However, a poor normalization method will be worse than none at all.


35

Possible normalization methods

• We can equalize the mean or median intensity by adding or multiplying a correction term

• We can use different normalizations at different intensity levels (intensity-based normalization) for example by lowess or quantiles

• We can normalize for other things such as print tips


36

Group 1 Group 2

Array 1 Array 2 Array 3 Array 4

Gene 1 1100 900 425 550

Gene 2 110 95 85 110

Gene 3 80 65 55 80

Example for Normalization


37

. list Array Group Gene Expression

+---------------------------------+ | Array Group Gene Expres~n | |---------------------------------| 1. | 1 1 1 1100 | 2. | 2 1 1 900 | 3. | 3 2 1 425 | 4. | 4 2 1 550 | 5. | 1 1 2 110 | |---------------------------------| 6. | 2 1 2 95 | 7. | 3 2 2 85 | 8. | 4 2 2 110 | 9. | 1 1 3 80 | 10. | 2 1 3 65 | |---------------------------------| 11. | 3 2 3 55 | 12. | 4 2 3 80 | +---------------------------------+


38

. sort Gene

. by Gene: anova Expression Group

---------------------------------------------------------------------------------> Gene = 1

Number of obs = 4 R-squared = 0.9042 Root MSE = 117.925 Adj R-squared = 0.8564

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 262656.25 1 262656.25 18.89 0.0491 | Group | 262656.25 1 262656.25 18.89 0.0491 | Residual | 27812.5 2 13906.25 -----------+---------------------------------------------------- Total | 290468.75 3 96822.9167


39

-> Gene = 2

Number of obs = 4 R-squared = 0.0556 Root MSE = 14.5774 Adj R-squared = -0.4167

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 25 1 25 0.12 0.7643 | Group | 25 1 25 0.12 0.7643 | Residual | 425 2 212.5 -----------+---------------------------------------------------- Total | 450 3 150 --------------------------------------------------------------------------------> Gene = 3 Number of obs = 4 R-squared = 0.0556 Root MSE = 14.5774 Adj R-squared = -0.4167

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 25 1 25 0.12 0.7643 | Group | 25 1 25 0.12 0.7643 | Residual | 425 2 212.5 -----------+---------------------------------------------------- Total | 450 3 150


40

Group 1 Group 2


Gene 1 975 851 541 608

Gene 2 -15 46 201 168

Gene 3 -45 16 171 138

Additive Normalization by Means


41

. mean Expression

. ereturn list

scalars: e(df_r) = 11 e(N_over) = 1 e(N) = 12 e(k_eq) = 1 e(k_eform) = 0

macros: e(cmd) : "mean" e(title) : "Mean estimation" e(estat_cmd) : "estat_vce_only" e(varlist) : "Expression" e(predict) : "_no_predict" e(properties) : "b V"

matrices: e(b) : 1 x 1 e(V) : 1 x 1 e(_N) : 1 x 1 e(error) : 1 x 1

functions: e(sample)

. matrix ExpMeanMat = e(b)

. matlist ExpMeanMat

| Express~n -------------+----------- y1 | 304.5833

. scalar ExpMean = ExpMeanMat[1,1]

. display ExpMean304.58333. anova Expression Array. predict ArrayMean. generate NormExp1=Expression-ArrayMean +ExpMean


42

. list Array Group Gene Expression ArrayMean NormExp1

+--------------------------------------------------------+ | Array Group Gene Expres~n ArrayM~n NormExp1 | |--------------------------------------------------------| 1. | 1 1 1 1100 430 974.5833 | 2. | 2 1 1 900 353.3333 851.25 | 3. | 3 2 1 425 188.3333 541.25 | 4. | 4 2 1 550 246.6667 607.9167 | 5. | 1 1 2 110 430 -15.41667 | |--------------------------------------------------------| 6. | 2 1 2 95 353.3333 46.24999 | 7. | 3 2 2 85 188.3333 201.25 | 8. | 4 2 2 110 246.6667 167.9167 | 9. | 1 1 3 80 430 -45.41667 | 10. | 2 1 3 65 353.3333 16.24999 | |--------------------------------------------------------| 11. | 3 2 3 55 188.3333 171.25 | 12. | 4 2 3 80 246.6667 137.9167 | +--------------------------------------------------------+


43

. by Gene: anova NormExp1 Group

-------------------------------------------------------------------------------------> Gene = 1




44

-> Gene = 2



--------------------------------------------------------------------------------------> Gene = 3




45

Group 1 Group 2


Gene 1 779 776 687 679

Gene 2 78 82 137 136

Gene 3 57 56 89 99

Multiplicative Normalization by Means


46

. generate NormExp2 = Expression*ExpMean/ArrayMean

. list Array Group Gene Expression ArrayMean NormExp2

+-------------------------------------------------------+ | Array Group Gene Expres~n ArrayM~n NormExp2 | |-------------------------------------------------------| 1. | 1 1 1 1100 430 779.1667 | 2. | 2 1 1 900 353.3333 775.8254 | 3. | 3 2 1 425 188.3333 687.3341 | 4. | 4 2 1 550 246.6667 679.1385 | 5. | 1 1 2 110 430 77.91666 | |-------------------------------------------------------| 6. | 2 1 2 95 353.3333 81.89268 | 7. | 3 2 2 85 188.3333 137.4668 | 8. | 4 2 2 110 246.6667 135.8277 | 9. | 1 1 3 80 430 56.66667 | 10. | 2 1 3 65 353.3333 56.03184 | |-------------------------------------------------------| 11. | 3 2 3 55 188.3333 88.94912 | 12. | 4 2 3 80 246.6667 98.78378 | +-------------------------------------------------------+


47

. by Gene: anova NormExp2 Group

---------------------------------------------------------------------------------> Gene = 1

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 8884.90342 1 8884.90342 453.70 0.0022

---------------------------------------------------------------------------------> Gene = 2


---------------------------------------------------------------------------------> Gene = 3



48

Group 1 Group 2


Gene 1 1025 971 512 512

Gene 2 102 102 102 102

Gene 3 75 70 66 74

Multiplicative Normalization by Medians


49

. sort Array

. table Array, contents(p50 Expression)

------------------------- Array | med(Expres~n)----------+-------------- 1 | 110 2 | 95 3 | 85 4 | 110-------------------------

. input ArrayMed

ArrayMed 1. 110 2. 110 3. 110 4. 95 5. 95 6. 95 7. 85 8. 85 9. 85 10. 110 11. 110 12. 110


50

. summarize Expression, detail

Expression------------------------------------------------------------- Percentiles Smallest 1% 55 55 5% 55 6510% 65 80 Obs 1225% 80 80 Sum of Wgt. 12

50% 102.5 Mean 304.5833 Largest Std. Dev. 363.114475% 487.5 42590% 900 550 Variance 131852.195% 1100 900 Skewness 1.27795499% 1100 1100 Kurtosis 3.132949

. generate NormExp3 = Expression*102.5/ArrayMed


51

-> Gene = 1


-------------------------------------------------------------------------------------> Gene = 2

Source | Partial SS df MS F Prob > F -----------+---------------------------------------------------- Model | 0 1 0

-------------------------------------------------------------------------------------> Gene = 3



52

Intensity-based normalization

• Normalize by means, medians, etc., but do so only in groups of genes with similar expression levels.

• lowess is a procedure that produces a running estimate of the middle, like a robustified mean

• If we subtract the lowess of each array and add the average of the lowess’s, we get the lowess normalization


53


54


55


56

Fitting a model to genes

• We can fit a model to the data of each gene after the whole arrays have been background corrected, transformed, and normalized

• Each gene is then test for whether there is differential expression


57


58

Multiplicity Adjustments

• If we test thousands of genes and pick all the ones which are significant at the 5% level, we will get hundreds of false positives.

• Multiplicity adjustments winnow this down so that the number of false positives is smaller


59

Types of Multiplicity Adjustments

• The Bonferroni correction aims to detect no significant genes at all if there are truly none, and guarantees that the chance that any will be detected is less than .05 under these conditions

• Generally, this is too conservative• Less conservative versions include

methods due to Holm, Hochberg, and Benjamini and Hochberg (FDR)


60

Documents

Gene Expression Arrays