Supervised Mining of High Throughput Biomedical … · Supervised Mining of High Throughput Biomedical Data Carsten Peterson Complex Systems Division ... 9795 370 490102 920406 920406

Complex Systems Lund University

Supervised Mining of High Throughput Biomedical Data

Carsten Peterson

Complex Systems DivisionDepartment of Theoretical Physics

Lund University, Swedenwww.thep.lu.se/complex

Advanced Microarray Data Analysis: Class Discovery and Class PredictionElsinore, May 2004

● Proper management of data at all stages

● Supervised and unsupervised approaches for making sense of data Artificial neural network algorithms (PCA preprocessing)

● Data integration; mapping onto ontology/pathway data bases

● Do gene expression data outperform ''good old'' clinical markers?

● Application examples; diagnostic predition of cancers: small round blue tuours of childhood, sporadic breast cancer

© 2004 Carsten Peterson


Microarray data analysis workflow

Data mining supervised unsupervised ....

Prior knowledge ontologies pathways integration of data

Results

Important!

Data management preprocessing






Results


Ab initio models not yet mature



Backend Production

+

Biological / Sample Data

+

Data Analysis

Experimental Data

1 7549 174 260926 891010 891010 63 1 11 10524 203 281103 930308 930308 64 950324 950324 1 23.93 1 1 11 11078 217 290504 931110 931110 65 1 11 7398 267 330528 890720 890720 56 1

7246 349 440118 890524 890524 45 11 6516 359 450901 880609 880609 43 1 1

9795 370 490102 920406 920406 43 1 11 11424 166 260227 940330 940330 68 1 1

11570 211 290128 940530 940630 65 940919 1 3.37 1 1 17632 259 320804 891106 891106 57 1

1 9077 355 450206 910610 910610 46 1 11 8860 242 310329 910318 910318 60 1 1

11620 86 210625 940615 940614 73 11 9126 38 180428 910702 910702 73 1

9732 265 330318 920319 920319 59 971111 1 68 1 1 18147 13 150709 900523 900523 75 1 1

1 7422 36 180305 890822 890822 71 900517 1 8.7 1 1 11 11968 83 210501 941011 941109 73 1 11 7106 92 211218 890320 890320 67 11 7171 95 220119 890418 890418 67 960606 1 86.2 1 1 1 11 7638 124 231004 891108 891108 66 1 11 7366 138 240323 890704 890704 65 1 1

8703 159 250724 910121 910121 65 980109 1 84 1 1 110580 202 281019 930405 930405 64 1

1 7589 222 290729 891030 891030 60 931001 931001 1 47.97 1 1 11 9999 232 300730 920701 920701 62 11 6366 240 310115 880415 880414 57 920901 1 53.47 1 1 1 11 7168 249 311008 890417 890417 58 #NAME? 1

7921 272 340305 900226 900226 56 11 7105 275 340703 890321 890321 55 1 11 6318 338 420405 880314 880314 46 1

7722 345 430424 891204 891204 47 18174 353 441007 900530 900529 46 1

1 6773 356 450710 881024 881025 43 1 11 8318 358 450831 900816 900816 45 910815 1 11.97 1 1 1

7811 262 330116 900122 900122 57 111829 162 251102 940908 940910 69 980122 980122 1 40 1 1 111442 223 290801 940412 940412 65 1 18736 70 201117 910204 910204 70 1

1 6464 23 161103 880531 880531 72 17845 136 240317 900208 900208 66 19016 238 301218 910523 910523 60 970825 1 74.4 1 1 1 1

1 7244 252 320108 890530 890530 57 920616 1 36.47 1 1 112501 24 161107 870304 950606 70 1

1 7104 35 180227 890330 890323 71 1 11 6594 299 371230 880719 880719 51 1 1

7408 147 250104 890802 890802 65 1 111460 233 301013 940418 940420 64 1

1 6262 116 230402 880215 880210 65 1 110335 266 330524 921214 921214 60 1 1

1 7132 191 270921 890329 890413 62 1 11 7020 152 250308 890206 890210 64 11 6707 215 290417 880928 880921 59 1 11 6014 258 320708 871021 871021 55 1 11 6232 250 311210 880202 880209 56 880802 1 6.87 1 1 11 6438 84 210617 880519 880519 67 11 6557 189 270623 880718 880718 61 1 11 7041 331 410524 890228 890228 48 1 11 7149 8 141020 890418 890418 74 11 7170 246 310813 890420 890420 58 940921 940921 1 64.63 1 1 2 1 11 7267 327 410113 890606 890606 48 940824 940824 1 62 1 1 11 7364 17 160731 890720 890706 73 1 11 7454 335 411110 890905 890905 48 940517 1 56.27 1 1 11 7490 82 210408 890922 890922 68 950504 1 68.13 1 1 11 7553 315 400219 891012 891012 50 950719 1 68.97 1 1 1

7807 51 190801 900123 900123 70 1

BASE: BioArray Software Environmenthttp://base.thep.lu.se

• Free Open Source Software; GNU General Public License Store – Manage – Analyze: Array production LIMS Experimental Data Biomaterials Web based UI



Data management

Results

Data mining Prior knowledgePlugin structure for your own favourites

In principle BASE takes you all the way

Complies with MIAME and MAGEML

Proteomics counterparts is on its way for 2D gels: e.g. PROTEIOS

Platform specific alternatives: e.g. Affymetrix GSOC Publish Database

Most time is spent on preprocessing including missing values etc.






Results




Preprocessing: Cuts on intensities, spot areas etc. More elaborate error models



Data Mining


MDSProject highdimensional data down to 2 or 3 dimensions preserving the distances between the data points

Dimensional reduction: Principal Component Analysis (PCA) Multidimensional Scaling (MDS)

G1G2

G1

G2

+

+++++

+ +++

2dimensional example of PCA:



Data Mining


MDSProject highdimensional data down to 2 or 3 dimensions preserving the distances between the data points

Supervised methods: Signaltonoise based disciminators Machine learning: Multilayered Perceptrons (MLP) Support Vector Machines (SVM) ....

Dimensional reduction: Principal Component Analysis (PCA) Multidimensional Scaling (MDS) ....

G2

G1

G2

G1+

+++++

+ +++


Unsupervised methods: Hierarchical clustering Kmeans clustering ....



Supervised methods

● Toolboxes:

Machine learning that allows for general dependencies Multilayered Perceptrons (MLP) Support Vector Machines (SVM) ....

Correlationbased classifiers; signaltonoise ratios

Data highdimensional and noisy Learn from examples rather than using known rules

Representative training sets and good validation procedures crucial Preprocessing to reduce number of variables sometimes called for

● Generics:



Signaltonoise methods

Simple approach:

Find discriminating genes between labeled classes bycomputing signaltonoise ratio weights for each gene

Weight [gene]

=Sum of standard deviations

Difference in averages

NominatorDenominator

Type 1 Type 2

=> list of ranked genes

g1

w1

g2

w2

. .

. .g

Nw

N



Are the topranked genes significant?

Permute sample lables (type 1 and type 2) Calculate weights from random labels

Random permutation test:

Compute Pvalue for each gene

Keep those genes in the list with low Pvalues

g1

P1

g2

P2

. .

. .

. .

Cut



Machine learning

Find planes that separate the classes (adjust parameters) MLP (Multilayered Perceptron SVM (Support Vector Machine)

Each adjusted parameter set represents a model of the data

X 1

X 2

In 2 dimensions

Advantage: Handle general dependencies

More later ....



Proper validation and test procedure

Calibration

Validation

Test

Calibrate multiple models using different partitions

Use ensemble of models for committee votes

Extract ranked genes from model parameters

More later ....




Data mining supervised unsupervised .....

Prior knowledge ontologies pathways other experiments

Results




Map results onto prior knowledge using appropriate tools

Ontology databases Pathway databases Data integration

The Gene Ontology (GO) consortium provides a vocabulary in terms of three structured networks of gene product attributes

● Map candidate genes of interest onto GO network

● Assign occupation numbers on nodes towards the root

● Estimate Pvalue for each node from genes present on cDNA by comparing with random selection of genes

GoMiner: discover.nci.nih.gov/gominer/

Ontoexpress: www.bioinformatics.wayne.edu

Ontology browsers:



From GoMiner

Score Pvalue



Map results onto prior knowledge using appropriate tools

Ontology databases Pathway databases Data integration

Pathway databases: metabolic, signal transduction, transcriptional, ... different levels of resolution; phosphorylation, methylation, ...

KEGG Transpath stke BioCarta .....

Computer readable format desired

Estrogen receptor pathway ( from stke)



Map candidate genes onto pathway database

Estimate Pvalue for pathway from genes present on cDNA by comparing with random selection of genes

The richness of proteomics can be exploited;complexes, phosphorylation, methylation, ...

Less mature field than mappings onto gene ontologies



Applications

● Small round blue cell tumors (SRBT) of childhood

● Estrogen receptor status of breast cancer tumors

● Breast cancer metastases prediction

● Breast cancer cell line analysis; comparisons with tumors



Small round blue cell tumors (SRBT) of childhood

SRBCT exists in four distinct categories

Can be difficult to diagnose using currently available molecular methods

Accurate diagnosis critical for proper treatment

Artificial neural networks (MLP) used for classification

Burkitt's Lymphoma (BL) 8 Ewing's sarcoma (ES) 13+10 Neuroblastoma (NB) 12 Rhabdomyosarcoma (RMS) 10+10 Total 63

Microarray profiling using 6600 cDNA clones

Blind Tests 25



● After filtering 2300 genes remain

● Multilayered Perceptron (MLP) used for modeling with 10 inputs and 4 outputs [(1000), (0100), (0010), (0001)]

Reduce dimensionality by performing PCA Retain 10 largest components: 2300 > 10 (65% of variance left)

● Thousands of genes - huge number of parameters: "overtraining"

G1G2

Gene 1

Gene 2

+

++

+++

++++ G1=Gene1+Gene2




Ranked Genes

0

1

Gene Expression

ML

P o

utp

ut

High Sensitivity

High Rank

0

1

Gene Expression

High Sensitivity

High Rank

0

1

Gene Expression

ML

P O

utp

ut

Low Sensitivity

Low Rank

Sensitivity measurement

Compute derivative of output with respect to expression levels



Training & Validation

3fold validation procedure:

ClassificationCommittee vote:

● Additional independent test samples are classified by a committee of all the models

● The committee for a sample consists only of models for which the sample was not in the training groups; Also yields ranked gene list

● Split samples in 3 groups – training (2) and validation (1) One model is calibrated

● Redo with the other groups reserved for validation => 3 models. Each sample predicted by one model

● Randomly reselect 3 groups and redo N times => 3*N models

[SRBT: N = 1250]



Gene Minimization

Remove genes from bottom of ranking list

Redo calibration

Diagnostic Classification

Sensitivity (%)

9310010096

Cancer

EWSBLNBRMS

Specificity (%)

100100100100

All 63+25 samples



Diagnostic Classification

What do we expect basedon the validation?

Ideal vote would bee.g. EWS: (1,0,0,0)

Compute distance to ideal vote



Hierarchical Clustering

Based on top96 ANN genes



Permutation tests

Can MLPs learn to classify anything?

Randomly relabel data



Microarray profiling of SRBT works very well using MLP and SVM

Minisummary:

Approach evaluated under realistic conditions with 25 blind test samples provided after completed calibrations 5 of these were nonSRBCTs algorithm able to reject these

Out of these, 41 have not been previously reported in this context

A set of 96 genes are identified as key factors for classification

J. Khan et al., Nature Medicine, 673 (2001)

Mini Summary:



Estrogens are important regulators in development and progression of breast cancer regulate gene expression via the estrogen receptor (ER)

Estrogen Receptor

Investigate microarray image profiles of nodenegative sporadic breast cancer tumors with respect to ER status (+ or ) using 6800 cDNA clones

● 3fold cross validation 200 times;

● As in the SRBCT case; ANN models

● After filtering 3400 genes remain

● Sensitivities computed genes ranked

In total committees of 600 models

ER+ 23ER 24

Total: 47

Blind tests: 11

ER status of blind test samples predicted with 100% accuracy (even when ER and GATA3 genes excluded)

S. Gruvberger et al., Cancer Research 61, 5979 (2001)



ER-

Genes 1 100

ER status committee votes

ER+

T

V

T = blind test setV = validation set



How many of the top genes on the gene ranking list can be excluded with preserved predictive power (blind test set)?

1 100 11 100% 51 150 9 100% 101 200 11 100% 151 250 9 100% 201 300 11 100% 251 350 9 93% 301 400 8 97% Random 5.5 53%

Genes correct ROC area

Randomly picked 100 genes are shown for comparison

The ER pathway is ''deep''!



ER-

Genes 1 100

ER status committee votes

ER+

Genes 301 400

ER+

ER-

T

V

T = blind test setV = validation set



More on ER pathway depths ......

● All remaining genes included after removal

Prediction of ER value

Relative error

Number of genes removed from top

● Top100 genes included after removal

S. Gruvberger, P. Eden et al., Mol. Cancer Ther. 3, 161 (2004)

Continuous MLP output



Determining the ER cutoff from gene expression data alone

Classify into ER+ and ER for every possible partition

For each partition: Measure of how well the partition corresponds to molecularly distinct classes with Fisher's linear discriminant Gauge performance using leaveoneout crossvalidation Compute area under ROC curve

S. Gruvberger, P. Eden et al., Molecular Cancer Therapeutics 3, 161 (2004)



● The ER pathway is deep!

● Nonclinical determination of ER cutoff feasible

Minisummary:



Predicting Breast Cancer MetastasesL.J. Van't Veer et al., Nature 415, 530536 (2002) [available]M.J. van de Vijver et al., NEJM 347, 19992009 (2002) [not available]

78 tumors; 34 of which developed distant metastases within 5 yearsCan one predict the latter?

CDNA microarray profiling with 25000 genes After filtering 5000 remainsBlind test set: 19 (12 met/7 nonmet)

Unsupervised hierarchical clustering:

Patients: Roughly divides data according to ER status Genes: ER pathway genes and lymphocytic genes coregulated



From: Nature 415, 530536 (2002)



Predicting Breast Cancer MetastasesL.J. Van't Veer et al., Nature 415, 530536 (2002) [available]M.J. van de Vijver et al., NEJM 347, 19992009 (2002) [not available]

78 tumors; 34 of which developed distant metastases within 5 yearsCan one predict the latter?

CDNA microarray profiling with 25000 genes After filtering 5000 remainsBlind test set: 19 (12 met/7 nonmet)

Unsupervised hierarchical clustering:

Patients: Roughly divides data according to ER status Genes: ER pathway genes and lymphocytic genes coregulated

Classifier:

Compute correlation [C] between genes and metastases outcome |C| > 0.3 yields 231 important genes, which are rankordered

Construct classifier from C by adding genes from list optimizing "leaveoneout" cross validation performance (70 genes)



Many of the classifying genes are ER related!

=> failures: 3 met/15 nonmet

Few met failures desired; optimize sensitivity [decision threshold]

83% validation performance [failures: 5 met/12 nonmet]

Test set performance – failures: 1 met/1 nonmet (19)

In followup study [van de Vijver et al.]:

285 consequtive patients

Gene expression profiler superior to conventional prognostic index:

NIHcriteria (>1cm) St. Gallen – criteria (>=2 cm, grade 23, <35 years, ER)



-- Agendia closes double digit million $ series A financing round --

''The land slide discovery of a predictive breast cancer profile and its validation in an independent study has caught much attention as it is superior to classical criteria St. Gallen consensus or NIH criteria to predict outcome of disease. ''

From company home page: www.agendia.com

Not the complete story ...



Are ''good old'' clinical markers outperformed by microarray gene expression profilers?

Age Tumor size Axillary node status Histological grade ER PGR Angioinvasion Lymphocytic invasion

● With a combined crossvalidation and leaveoneout crosstesting procedure a clinical marker profiler is constructed using ANN models

Conventional clinical markers:

● Using the well established Nottingham prognostic index: fixed rule based upon size, histological grade and axillary node status

P. Eden, C. Ritz et al., Eur. J. Cancer, in press (2004)



Analysis:

● ROC (Receiver Operating Characteristics) areas● Kaplan Meier plots

Method ER+ ER+

Gene expression 0.76 0.79Clinical variables 0.78 0.85NPI 0.77 0.81NIH criteria 0.67 0.66



F = 0.13 Age + 0.26 Tumor size (cm) + 1.0 Histological grade 0.011 ER 0.00 PGR + 1.3 Angioinvasion 0.28 Lymphocytic invasion + 2.8 (threshold: 10% false positive rate)

Clinical variables classifier can be approximated with a linear expression

F > 0 poor prognosis



Good prognosis group

Bad prognosis group

Years

Pro

ba

bil

ity

of

rem

ain

ing

met

ast

ase

s fr

ee

● Clin. Marker profiler● Nottingham index

● Microarray profiler

All samples (97)



Good prognosis group

Bad prognosis group

Years

Pro

ba

bil

ity

of

rem

ain

ing

met

ast

ase

s fr

ee



ER+ samples only (68)




● Microarray profiler● Clin. Marker profiler● Nottingham index


All samples ER+ samples only

Good prog. M+/M OR 95% CL

Gene expression 4/31 15.7 4.770 Conventional markers 4/35 9.8 2.943 NPI 4/21 7.2 2.132 NIH criteria 4/6 1.4 0.37.2 St Gallen criteria 0/7 inf 1.4inf

Limited statistics!



● ''Good old'' clinical markers are by no means outperformed by microarray gene expression profilers as diagnostic predictors

● The conventional markers are relatively inexpensive to employ 50% of all new breast cancers are diagnosed in the third world Keep refining these prognostic indices

● However: Microarray data also carries information about the underlying genes

The microarray technology is young

Minisummary:

● For both gene expression and clinical marker profilers results improve when separately classifying the ER+ cohorts

Hybrid methods ....



How about other data sets?

● Lund data (in preparation):

CMF treated sporadic breast cancer (node+, node)

88 (26 M+, 62 M)

ROC areas:

Gene expression: 0.76

Clinical variables: 0.76

NPI 0.76

(with E. Nimeus, A. Johnsson and others in Lund)

● Population based 99 patient cohort

Treated and untreated (node+, node)

C. Sotiriou et al, PNAS 100, 10393 (2003)



Clinical variables predictor

Age, size, grade, # of nodes, ER

Hierarchical clustering with overlapping van't Veer genes



Pathways in breast cancer

● Comprehensive analysis of expression patterns for multiple treatments and cell lines (e.g. ER+ and ER) to identify pathways (and crosstalk)

E2, tamoxifen, fulvestrant, progestin, antiprogestin, retinoic acid; Epidermal growth factor, a Mek1/2 specific inhibitor and TPA

● Utilize other genomewide data How are these pathways related to disease progression? Expression data from tumors

Two key aspects:

H.E. Cunliffe et al., Cancer Research 63, 7158 (2203)



14 conditions (9 treatments for 3 celllines).84 hybridizations (3 time points in duplicate).14k spotted cDNA arrays.

M CF7

E2

M CF7

ICI

M CF7

4OHT

M CF7

R5020

T47D

R5020

M CF7

RU486

T47D

RU486

M CF7

atRA

436

EGF

M CF7

U0126

436

U0126

M CF7

TPA

436

TPA

M CF7

EGF

1/3 31

1023 genes reshuffled according to similarity (clustering)



# conditions w ith responsive genes

Distribution of 1023responsive genes

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8 9 1011121314

Top 47 genes Includes many associated with oncogenesis (cmyc, cerbB2, cyclin D1 , ...)

Characterizing Pathways (1)



i

iv

v

1023 genes

M CF7

E2

M CF7

ICI

M CF7

4OHT

M CF7

R5020

T47D

R5020

M CF7

RU486

T47D

RU486

M CF7

atRA

436

EGF

M CF7

U0126

436

U0126

M CF7

TPA

436

TPA

M CF7

EGF

ii

iii

vi

vii

viii

Separate 'Proliferation cluster'60% of genes (GO)

E2 'crosstalk' to MAPK pathway?

1/3 31

Important to study pathwaysin context of others!

Characterizing Pathways



Correlating pathways with tumor data

Are genes that discriminate between clinical categoriesfor tumor samples randomly distributed among the clusters?If not, which clusters are significant?

Use the van't Veer tumor expression data

231 prognosis genes from supervised (not clustering) analysispresent on cell line arrays; 41 among the responsive genes

(32 higher in poor prognosis and 9 in good prognosis samples)



i

V iii

iv

v

1/3 311023 genes

M CF7

E2

M CF7

ICI

M CF7

4OHT

M CF7

R5020

T47D

R5020

M CF7

RU486

T47D

RU486

M CF7

atRA

436

EGF

M CF7

U0126

436

U0126

M CF7

TPA

436

TPA

M CF7

EGF

-0.4

0

0.4n=89

0.30.20.1

-0.2-0.1

-0.3L

og

Rat

io

< 42 experiments ordered as in mosaic >

ii

iii

V ii

Cluster

iv

v

vi

viii

n

23

14

89

24

-

8

0

26

1

+

0

4

0

8

-

0

0

19

1

+

0

0

0

0

P-value

1.2E-03

4.7E-03

9.0E-08

1.4E-04

P-value

NS

NS

1.5E-13

NS

ER statusa good prognosisb

vi

Investigate clusters for common mechanisms responsible for their coregulation in tumors



In the future – be more systematic

Map gene lists from array analysis onto pathways (+ downstream targets)

Includes pathways from Transpath, KEGG, …

Allows users to modify pathways and to create their own

Pathway information and topology stored in machine readable form

Overabundance analysis as in GoMiner

Pathway coexpression scores

Information

Analysis



Summary

● Proper management of data at all stages

● Supervised and unsupervised approaches for making sense of data

● Application examples; diagnostic predition of cancers

Looks good!

Do gene expression data outperform ''good old'' clinical markers?

● Mapping onto ontologies and pathway data bases

Care should be exercised in sample selection with heterogeneous diseases

Interpret data directly in terms of interacting pathways 50000 genes > 100 pathways (?)



Acknowledements

NIH

NHGRI:Yidong ChenPaul Meltzer

NCI:Jun WeiJaved Khan

Lund University

Complex Systems:Carl TroeinCecilia RitzPatrik EdénMarkus Ringnér

Oncology:Sofia GruvbergerMårten FernöÅke BorgCarsten Rose


Documents

Supervised Mining of High Throughput Biomedical … · Supervised Mining of High Throughput Biomedical Data Carsten Peterson Complex Systems Division ... 9795 370 490102 920406 920406