21
Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1 , Gregor Leban 1 , Janez Demšar 1 and Blaž Zupan 1,2 1 Faculty of Computer and Information Science University of Ljubljana, Slovenia 2 Department of Molecular and Human Genetics Baylor College of Medicine, Houston, USA

Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Embed Size (px)

Citation preview

Page 1: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Conquering the Curse of Dimensionality in Gene

Expression Cancer Diagnosis: Tough Problem, Simple Models

Minca Mramor1, Gregor Leban1,Janez Demšar1 and Blaž Zupan1,2

1 Faculty of Computer and Information ScienceUniversity of Ljubljana, Slovenia

2 Department of Molecular and Human GeneticsBaylor College of Medicine, Houston, USA

Page 2: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Cancer

• epidemiology– 2nd cause of death in the developed world – increasing number of patients

• carcinogenesis– a multi factorial and heterogeneous disease– non-lethal injury to the DNA of one cell– a multi step process

Page 3: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Use of gene expression microarrays in cancer research

• uncovering the genetic mechanisms (loss of cell cycle control)

• identification of specific genes• classification of different tumor types

• insight into carcinogenesis• improvement and individualization of treatment, • development of targeted therapeutics • identification of biomarkers

Final Goals

Page 4: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

SRBCT Example: 6567 genes, 83 patients, 4 classes

1. Initial cuts (image analysis failed – 2308 genes left)

2. 10 dominant components obtained with PCA3. 3750 feed-forward neural networks4. Rank genes with the ANN models, select best

965. Clear separation of classes using MDS

Khan et al. (Nature Medicine, 2001)

Page 5: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Open Questions & Goals

• Can graphs with clear class separation be found directly from data?

• Can they include only original attributes?• How many of them are needed for good class

separation?• How are these attributes (genes) related to cancer? • How useful are prevailing feature selection methods?

Page 6: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Data Sets

Dataset Samples Genes Classes

Leukemia 73 7074 2

Prostate 102 12533 2

DLBCL 77 7070 2

MLL 72 12533 3

SRBCT 83 2308 4

Page 7: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Methods: VizRank

• Visualization techniques

• Visualization scoring and ranking• Projection search

Page 8: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Methods (VizRank)

• Visualization techniques• Visualization scoring and ranking

• Projection searchscore = 0.76 score = 0.98

Page 9: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

A snapshot of Orange data mining suite, showing VizRank ranking of best visualizations and the corresponding best-ranked scatterplot for the

leukemia data set

Page 10: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results

LEUKEMIA SRBCTDLBCLPROSTATE

For all investigated data sets VizRank found visualizations with a small number of genes (2-6) with clear separation of

diagnostic classes.

Page 11: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

ResultsMIXED LINEAGE LEUKEMIA

Page 12: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results

Data set Scatterplot Radviz

Leukemia 98.0% 99.6%

Prostate 91.8% 98.3%

DLBCL 96.8% 99.7%

MLL 94.8% 99.8%

SRBCT 87.7% 99.7%

Scores for top-ranked visualizations

[Probability of correct classification for k-NN classifier in projection plane]

Page 13: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results: biological relevance of the genes in the best visual projections

Genes annotated as cancer or cancer related according to the atlas of genetics and cytogenetics in oncology and haematology.

The best radviz visualization of the prostate tumor data set: all six genes are cancer related

PROSTATE TUMOR

Page 14: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results: biological relevance of the genes in the best visual projections

DNTT (terminal deoxynucleotidyl transferase) – a unique DNA polymerase expressed in the lymphoid precursors and their malignant counterparts and an important marker of lymphoblastic leukemias

MME (membrane metalloendopeptidase) or common acute lymphocytic leukemia antigen (CALLA) - an important cell surface marker in the diagnosis of human acute lymphocytic leukemia (ALL)

MIXED - LINEAGE LEUKEMIA

Page 15: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results: biological relevance of the genes and an explanation of the outliers

SMCL and COID class express high levels of neuroendocrine tumors genes (ISL1)

For SQ lung carcinomas diagnostic criteria include evidence of squamous differentiation (KRT5)

Histological diversity of adenocarcinoma (AD) class in the lung cancer data set:• 12 AD were extrapulmonary metastases• seven adenocarcinomas display histological evidence of squamous features

LUNG CANCER

Page 16: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

How many “good” projectionsare there?

Only a few [among several millions of possible projections].

Page 17: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Gene ranking methods

• Signal-to-noise (S2N) (Golub et al., Science 1999) – univariate gene scoring statistic derived from the standard parametric t-test

S2N = (µ0 - µ1)/(σ0 + σ1) µ = mean

σ = standard deviation

• ReliefF (Kononenko, 1994) –attribute scoring function sensitive to feature interactions

Page 18: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

-0.05 0 0.05 0.1 0.15 0.20

200

400

600

800

1000

-0.05 0 0.05 0.1 0.15 0.20

200

400

600

800

1000

-0.05 0 0.05 0.1 0.15 0.20

200

400

600

800

1000

-0.05 0 0.05 0.1 0.15 0.20

200

400

600

800

1000

Results: all data sets include a subset of about 100 highly discriminating genes

Histogram for actual attribute values Histogram for permuted data

For all data sets histograms of ReliefF scores are skewed to the right, with a group of 50 – 100 most discriminating genes in the right tail

A permutation test to verify if these high discriminatory genes were assigned high scores by chance

-0.03 -0.02 -0.01 0 0.01 0.02 0.030

200

400

600

800

1000

-0.03 -0.02 -0.01 0 0.01 0.02 0.030

200

400

600

800

1000

-0.03 -0.02 -0.01 0 0.01 0.02 0.030

200

400

600

800

1000

-0.03 -0.02 -0.01 0 0.01 0.02 0.030

200

400

600

800

1000

Page 19: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Results: S2N and ReliefF yield different gene ranking

Spearman correlation coefficient (from 0.24 for the DLBCL data set to 0.89 for the MLL data set)

20 best genes from the scatterplot visualizations for the leukemia data set

A relatively poor performance of ReliefF, similar to S2N (too large context due to high number of attributes in the data sets?)

Page 20: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

Conclusion

• Cancer diagnostic classes can be clearly separated using the expression data of only a few genes

• Visualizations can– find small sets of most relevant genes– uncover interesting gene interactions– point to outliers

• Our “visual” models are– simple– understandable and – significantly less sophisticated classification model

than prevailing techniques in current cancer gene expression analysis

Page 21: Conquering the Curse of Dimensionality in Gene Expression Cancer Diagnosis: Tough Problem, Simple Models Minca Mramor 1, Gregor Leban 1, Janez Demšar 1

THANK YOU!

Small round blue cell tumors, data by Khan et al. (2001)

Minca Mramor, Gregor Leban, Janez Demšar and Blaž Zupan