Upload
vinaya
View
38
Download
1
Embed Size (px)
DESCRIPTION
Graphical Exploration of Gene Expression Data by Spectral Map Analysis. L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi. Outline. Short introduction to microarray technology Exploratory multivariate data analysis Multivariate projection methods Principal component analysis - PowerPoint PPT Presentation
Citation preview
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
1
Graphical Exploration of Gene Expression Data by Spectral Map
Analysis
L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
2
Outline
• Short introduction to microarray technology
• Exploratory multivariate data analysis
• Multivariate projection methods
• Principal component analysis
• Correspondence factor analysis
• Spectral map analysis
• Case data: Leukemia data (Golub, 1999)
• Comparison of multivariate projection methods
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
3
Introduction to DNA Microarray Experiments
Hybridization
LabellingmRNACell
Chip Gene Probe
18µm
Scan
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
4
Structure of Microarray Data
n genes
p samples
Y = • p n
(e.g. p = 38, n = 5327)
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
5
Exploratory Multivariate Analysis&
Gene Expression Data
• Supervised learning: Which genes discriminate known groups of subjects ?
• Unsupervised learning:
1. Do gene expression profiles reveal groups of patients (class discovery) ?
2. Which genes correlate with these groups ?
• clustering methods
• projection methods
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
6
Multivariate Projection Methods &
Gene Expression Data
• Principal Component AnalysisLefkovits, I., et al. (1988)Hilsenbeck S.G., et al. (1999)
• Correspondence Factor AnalysisFellenberg, K., et al. (2001)
• Spectral Map Analysis
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
7
Key Steps in Multivariate Projection MethodsLewi, P.J. (1993)
• Weighting by marginal means or other
• Re-expressionreplace each table element by its logarithm
• Closure (row, column, or double)divide each table element by marginal totals
• Centering (row, column, or double)subtract from each table element marginal mean
• Normalization (row, column, or global)divide each table element by standard deviation
• Factorizationgeneralized SVD
• Projection of scores and loadings with proper scalingbiplot
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
8
Principal Component Analysis
• Pearson (1901), Hotelling (1933)
• Optional log-transformation
• Constant weighting of row and columns
• Column centering
• Column normalization
• Centering and normalization handle tables asymmetrically:R-mode - Q-mode analysis
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
9
Correspondence Factor Analysis
• Benzécri (1973), Greenacre (1984)
• Weighting of row and columns by marginal row and column totals
• Double closure (rank reduction)
• Double centering
• Global normalization
• Distances are chi-square related
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
10
Spectral Map Analysis
• Lewi (1976)
• Constant weighting of row and columns, or weighting by marginal row and column totals
• Logarithmic re-expression
• Double centering (rank reduction)
• Global normalization
• Distances are contrasts (log-ratio)
• Areas of symbols are proportional to marginal row-and column-totals or to a selected column
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
11
Spectral Map AnalysisDouble Centering
• On log-basis related to double closure
• Treats row- and column-space equivalently
• Geometrically results in a projection of the data on a hyperplane orthogonal to the line of identiy
• Number of dimensions is reduced by one
• Effectively removes a “size” component from the data
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
12
Properties of Gene Expression Data
• Presence of extreme high values:
logarithmic re-expression
• Importance of level of gene expression, differences at lower levels are less reliable:
weighting for marginal totals
• Extreme rectangular shape of matrices, e.g. 7000 rows, 20 columns:
• assymetric factor scaling (downplay variability among genes)
• label only most extreme genes
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
13
Case StudyMIT Leukemia Data (Golub, 1999)
• Training sample:
• 38 bone marrow samples
• 27 acute lymphoblastic leukemia ALL (T-lineage, B-lineage)
• 11 acute myeloid leukemia AML
• 6817 genes (5327 used)
• arbitrary choice of 50 informative genes to discriminate between ALL and AML
• Validation sample:
• 34 bone marrow samples
• 20 ALL (T-lineage, B-lineage), 14 AML
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
14
Class Discovery & Gene Detection
• Use case study to evaluate & validate three multivariate projection methods:
• Ability to discover (known) classes
• Identify genes related to these classes
• Quantify relationships
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
15
Principal Component Analysis
PC1 71 %
PC
2 3
%
1
4
5
7
8
12
1315
16
1718
19
20
21
22
24
25
26
272
3
6
9
10
11
14
23
3435
36
37
38
28
29
30
31
32
33
AML
ALL-T
ALL-B
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
16
Principal Component AnalysisConclusions
• Results of PCA obscured by presence of size component, i.e. overall level of expression of genes
• Poor performance in class discovery
• Useless for detecting genes related to classes
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
17
Correspondence Factor Analysis
AML
ALL-T
ALL-B
PC1 17 %
PC
2 1
0 %
1
4
5
7
8
12
13
15
1617 18
19
20
21
22
24
25
26
27
2
3
6
910
11
14
23
34
35
3637
38
28
29
30
31
32
33
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
18
Correspondence Factor AnalysisConclusions
• Excellent performance in class discovery
• Difficult to detect genes related to classes
• Difficult to interpret distances between genes, samples (chi-square values)
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
19
Spectral Map Analysis Marginal Means Weighted
AML
ALL-T
ALL-B
PC1 14 %
PC
2 1
2 %
D83920
HG3576-HT3779
K01911
L08895
L33930
M12886
M27891
M28130
M28826
M38690 M57466
M57710M57731
M63438
M83667M84526
M87789
M89957
U36922
U93049X00437
X04145 X05908
X62744
X76223
X82240
Z49194
14
5
7
8
12
13
1516
17
18
19
2021
22
24
25
26
27
2
36
9
10
11
14
23
34
35 36 37
38
2829
30
31
32
33
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
20
Spectral Map AnalysisConclusions
• Excellent performance in class discovery
• Allows to detect genes related to classes
• Distances between objects are contrasts (ratios)
• Suggests data reduction
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
21
Weighted Spectral Map Analysis0.5 % Most Extreme Genes
AML
ALL-T
ALL-B
PC1 43 %
PC
2 3
2 %
D83920
HG3576-HT3779
K01911
L08895
L33930
M12886
M27891
M28130
M28826
M38690
M57466
M57710M57731
M63438
M83667
M84526
M87789
M89957
U36922
U93049
X00437
X04145
X05908
X62744 X76223
X82240
Z49194
14
5
7
8
12
13
15
16
17
18
1920
21
22
24
25
26
27
2
3
69
10
11
14
23
34
35
3637
38
28
29
30
31
32
33
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
22
Positioning of Validation Set
AML
ALL-T
ALL-B
PC1 43 %
PC
2 3
2 %
D83920
HG3576-HT3779
K01911
L08895
L33930
M12886
M27891
M28130
M28826
M38690
M57466
M57710M57731
M63438
M83667
M84526
M87789
M89957
U36922
U93049
X00437
X04145
X05908
X62744 X76223
X82240
Z49194
39
40
42
47
48
49
41
4344
45
46
7071
72
68
69
55
56
59
6752
53
51
50
54
57
5860
61
65
66
63
64
62
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
23
Golub Data Set Conclusion
• Weighted SMA allows to reduce data from 5327 genes to 27 genes without loss of important information
• Positioning validation data indicates 27 genes are also predictive for new cases
• Further data reduction possible ?
• Quantify samples for gene specificity ?
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
24
Spectral Mapping as a Tool forQuantitative Differential Gene Expression
AML
ALL-T
ALL-B
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
25
Conclusions
• Weighted SMA outperforms the other projection methods in:
• Visualizing data
• Class detection of biological samples
• Identifying important genes, confirmed by literature search
• Reducing the data
• Identifying & interpreting relations between genes and subjects
• Quantifying differential gene expression
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
26
SPM Library
• Principal component analysis
PCA<-spectralmap(DATA,center="column",normal="column")plot(PCA,scale="uvc",label.tol=c(1,1),col.group=GRP)
• Correspondence analysis
CFA<spectralmap(DATA,logtrans=False,row.weight="mean”, col.weight="mean",closure="double") plot(CFA,scale="uvc",label.tol=c(1,1),col.group=GRP)
• Spectral map analysis, weighted by marginal totals
SPM<-spectralmap(DATA,row.weight="mean“,col.weight=“mean”) plot(SPM,scale="uvc",col.group=GRP)
September 2002 Center for Statistics, transnational University Limburg,
Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium
27
References
• Wouters, L., Göhlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., Lewi, P.J. (2002). Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods. submitted Biometrics.
• SPM-library availability:email: [email protected] http://alpha.luc.ac.be/~lucp1456/