27
September 2002 Center for Statistics, transnational University Limburg, Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium 1 Graphical Exploration of Gene Expression Data by Spectral Map Analysis L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi

Graphical Exploration of Gene Expression Data by Spectral Map Analysis

  • Upload
    vinaya

  • View
    38

  • Download
    1

Embed Size (px)

DESCRIPTION

Graphical Exploration of Gene Expression Data by Spectral Map Analysis. L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi. Outline. Short introduction to microarray technology Exploratory multivariate data analysis Multivariate projection methods Principal component analysis - PowerPoint PPT Presentation

Citation preview

Page 1: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

1

Graphical Exploration of Gene Expression Data by Spectral Map

Analysis

L. Wouters, H. Göhlmann, L. Bijnens, P. Lewi

Page 2: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

2

Outline

• Short introduction to microarray technology

• Exploratory multivariate data analysis

• Multivariate projection methods

• Principal component analysis

• Correspondence factor analysis

• Spectral map analysis

• Case data: Leukemia data (Golub, 1999)

• Comparison of multivariate projection methods

Page 3: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

3

Introduction to DNA Microarray Experiments

Hybridization

LabellingmRNACell

Chip Gene Probe

18µm

Scan

Page 4: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

4

Structure of Microarray Data

n genes

p samples

Y = • p n

(e.g. p = 38, n = 5327)

Page 5: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

5

Exploratory Multivariate Analysis&

Gene Expression Data

• Supervised learning: Which genes discriminate known groups of subjects ?

• Unsupervised learning:

1. Do gene expression profiles reveal groups of patients (class discovery) ?

2. Which genes correlate with these groups ?

• clustering methods

• projection methods

Page 6: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

6

Multivariate Projection Methods &

Gene Expression Data

• Principal Component AnalysisLefkovits, I., et al. (1988)Hilsenbeck S.G., et al. (1999)

• Correspondence Factor AnalysisFellenberg, K., et al. (2001)

• Spectral Map Analysis

Page 7: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

7

Key Steps in Multivariate Projection MethodsLewi, P.J. (1993)

• Weighting by marginal means or other

• Re-expressionreplace each table element by its logarithm

• Closure (row, column, or double)divide each table element by marginal totals

• Centering (row, column, or double)subtract from each table element marginal mean

• Normalization (row, column, or global)divide each table element by standard deviation

• Factorizationgeneralized SVD

• Projection of scores and loadings with proper scalingbiplot

Page 8: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

8

Principal Component Analysis

• Pearson (1901), Hotelling (1933)

• Optional log-transformation

• Constant weighting of row and columns

• Column centering

• Column normalization

• Centering and normalization handle tables asymmetrically:R-mode - Q-mode analysis

Page 9: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

9

Correspondence Factor Analysis

• Benzécri (1973), Greenacre (1984)

• Weighting of row and columns by marginal row and column totals

• Double closure (rank reduction)

• Double centering

• Global normalization

• Distances are chi-square related

Page 10: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

10

Spectral Map Analysis

• Lewi (1976)

• Constant weighting of row and columns, or weighting by marginal row and column totals

• Logarithmic re-expression

• Double centering (rank reduction)

• Global normalization

• Distances are contrasts (log-ratio)

• Areas of symbols are proportional to marginal row-and column-totals or to a selected column

Page 11: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

11

Spectral Map AnalysisDouble Centering

• On log-basis related to double closure

• Treats row- and column-space equivalently

• Geometrically results in a projection of the data on a hyperplane orthogonal to the line of identiy

• Number of dimensions is reduced by one

• Effectively removes a “size” component from the data

Page 12: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

12

Properties of Gene Expression Data

• Presence of extreme high values:

logarithmic re-expression

• Importance of level of gene expression, differences at lower levels are less reliable:

weighting for marginal totals

• Extreme rectangular shape of matrices, e.g. 7000 rows, 20 columns:

• assymetric factor scaling (downplay variability among genes)

• label only most extreme genes

Page 13: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

13

Case StudyMIT Leukemia Data (Golub, 1999)

• Training sample:

• 38 bone marrow samples

• 27 acute lymphoblastic leukemia ALL (T-lineage, B-lineage)

• 11 acute myeloid leukemia AML

• 6817 genes (5327 used)

• arbitrary choice of 50 informative genes to discriminate between ALL and AML

• Validation sample:

• 34 bone marrow samples

• 20 ALL (T-lineage, B-lineage), 14 AML

Page 14: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

14

Class Discovery & Gene Detection

• Use case study to evaluate & validate three multivariate projection methods:

• Ability to discover (known) classes

• Identify genes related to these classes

• Quantify relationships

Page 15: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

15

Principal Component Analysis

PC1 71 %

PC

2 3

%

1

4

5

7

8

12

1315

16

1718

19

20

21

22

24

25

26

272

3

6

9

10

11

14

23

3435

36

37

38

28

29

30

31

32

33

AML

ALL-T

ALL-B

Page 16: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

16

Principal Component AnalysisConclusions

• Results of PCA obscured by presence of size component, i.e. overall level of expression of genes

• Poor performance in class discovery

• Useless for detecting genes related to classes

Page 17: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

17

Correspondence Factor Analysis

AML

ALL-T

ALL-B

PC1 17 %

PC

2 1

0 %

1

4

5

7

8

12

13

15

1617 18

19

20

21

22

24

25

26

27

2

3

6

910

11

14

23

34

35

3637

38

28

29

30

31

32

33

Page 18: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

18

Correspondence Factor AnalysisConclusions

• Excellent performance in class discovery

• Difficult to detect genes related to classes

• Difficult to interpret distances between genes, samples (chi-square values)

Page 19: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

19

Spectral Map Analysis Marginal Means Weighted

AML

ALL-T

ALL-B

PC1 14 %

PC

2 1

2 %

D83920

HG3576-HT3779

K01911

L08895

L33930

M12886

M27891

M28130

M28826

M38690 M57466

M57710M57731

M63438

M83667M84526

M87789

M89957

U36922

U93049X00437

X04145 X05908

X62744

X76223

X82240

Z49194

14

5

7

8

12

13

1516

17

18

19

2021

22

24

25

26

27

2

36

9

10

11

14

23

34

35 36 37

38

2829

30

31

32

33

Page 20: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

20

Spectral Map AnalysisConclusions

• Excellent performance in class discovery

• Allows to detect genes related to classes

• Distances between objects are contrasts (ratios)

• Suggests data reduction

Page 21: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

21

Weighted Spectral Map Analysis0.5 % Most Extreme Genes

AML

ALL-T

ALL-B

PC1 43 %

PC

2 3

2 %

D83920

HG3576-HT3779

K01911

L08895

L33930

M12886

M27891

M28130

M28826

M38690

M57466

M57710M57731

M63438

M83667

M84526

M87789

M89957

U36922

U93049

X00437

X04145

X05908

X62744 X76223

X82240

Z49194

14

5

7

8

12

13

15

16

17

18

1920

21

22

24

25

26

27

2

3

69

10

11

14

23

34

35

3637

38

28

29

30

31

32

33

Page 22: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

22

Positioning of Validation Set

AML

ALL-T

ALL-B

PC1 43 %

PC

2 3

2 %

D83920

HG3576-HT3779

K01911

L08895

L33930

M12886

M27891

M28130

M28826

M38690

M57466

M57710M57731

M63438

M83667

M84526

M87789

M89957

U36922

U93049

X00437

X04145

X05908

X62744 X76223

X82240

Z49194

39

40

42

47

48

49

41

4344

45

46

7071

72

68

69

55

56

59

6752

53

51

50

54

57

5860

61

65

66

63

64

62

Page 23: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

23

Golub Data Set Conclusion

• Weighted SMA allows to reduce data from 5327 genes to 27 genes without loss of important information

• Positioning validation data indicates 27 genes are also predictive for new cases

• Further data reduction possible ?

• Quantify samples for gene specificity ?

Page 24: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

24

Spectral Mapping as a Tool forQuantitative Differential Gene Expression

AML

ALL-T

ALL-B

Page 25: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

25

Conclusions

• Weighted SMA outperforms the other projection methods in:

• Visualizing data

• Class detection of biological samples

• Identifying important genes, confirmed by literature search

• Reducing the data

• Identifying & interpreting relations between genes and subjects

• Quantifying differential gene expression

Page 26: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

26

SPM Library

• Principal component analysis

PCA<-spectralmap(DATA,center="column",normal="column")plot(PCA,scale="uvc",label.tol=c(1,1),col.group=GRP)

• Correspondence analysis

CFA<spectralmap(DATA,logtrans=False,row.weight="mean”, col.weight="mean",closure="double") plot(CFA,scale="uvc",label.tol=c(1,1),col.group=GRP)

• Spectral map analysis, weighted by marginal totals

SPM<-spectralmap(DATA,row.weight="mean“,col.weight=“mean”) plot(SPM,scale="uvc",col.group=GRP)

Page 27: Graphical Exploration of Gene Expression Data by Spectral Map Analysis

September 2002 Center for Statistics, transnational University Limburg,

Hasselt, Belgium and J&J PRD, Janssen Pharmaceutica, Beerse, Belgium

27

References

• Wouters, L., Göhlmann, H.W., Bijnens, L., Kass, S.U., Molenberghs, G., Lewi, P.J. (2002). Graphical Exploration of Gene Expression Data: A Comparative Study of Three Multivariate Methods. submitted Biometrics.

• SPM-library availability:email: [email protected] http://alpha.luc.ac.be/~lucp1456/