Upload
hanhu
View
216
Download
0
Embed Size (px)
Citation preview
Complex Systems Lund University
Supervised Mining of High Throughput Biomedical Data
Carsten Peterson
Complex Systems DivisionDepartment of Theoretical Physics
Lund University, Swedenwww.thep.lu.se/complex
Advanced Microarray Data Analysis: Class Discovery and Class PredictionElsinore, May 2004
● Proper management of data at all stages
● Supervised and unsupervised approaches for making sense of data Artificial neural network algorithms (PCA preprocessing)
● Data integration; mapping onto ontology/pathway data bases
● Do gene expression data outperform ''good old'' clinical markers?
● Application examples; diagnostic predition of cancers: small round blue tuours of childhood, sporadic breast cancer
© 2004 Carsten Peterson
Complex Systems Lund University
Microarray data analysis workflow
Data mining supervised unsupervised ....
Prior knowledge ontologies pathways integration of data
Results
Important!
Data management preprocessing
© 2004 Carsten Peterson
Complex Systems Lund University
Microarray data analysis workflow
Data mining supervised unsupervised ....
Prior knowledge ontologies pathways integration of data
Results
Data management preprocessing
Ab initio models not yet mature
© 2004 Carsten Peterson
Complex Systems Lund University
Backend Production
+
Biological / Sample Data
+
Data Analysis
Experimental Data
1 7549 174 260926 891010 891010 63 1 11 10524 203 281103 930308 930308 64 950324 950324 1 23.93 1 1 11 11078 217 290504 931110 931110 65 1 11 7398 267 330528 890720 890720 56 1
7246 349 440118 890524 890524 45 11 6516 359 450901 880609 880609 43 1 1
9795 370 490102 920406 920406 43 1 11 11424 166 260227 940330 940330 68 1 1
11570 211 290128 940530 940630 65 940919 1 3.37 1 1 17632 259 320804 891106 891106 57 1
1 9077 355 450206 910610 910610 46 1 11 8860 242 310329 910318 910318 60 1 1
11620 86 210625 940615 940614 73 11 9126 38 180428 910702 910702 73 1
9732 265 330318 920319 920319 59 971111 1 68 1 1 18147 13 150709 900523 900523 75 1 1
1 7422 36 180305 890822 890822 71 900517 1 8.7 1 1 11 11968 83 210501 941011 941109 73 1 11 7106 92 211218 890320 890320 67 11 7171 95 220119 890418 890418 67 960606 1 86.2 1 1 1 11 7638 124 231004 891108 891108 66 1 11 7366 138 240323 890704 890704 65 1 1
8703 159 250724 910121 910121 65 980109 1 84 1 1 110580 202 281019 930405 930405 64 1
1 7589 222 290729 891030 891030 60 931001 931001 1 47.97 1 1 11 9999 232 300730 920701 920701 62 11 6366 240 310115 880415 880414 57 920901 1 53.47 1 1 1 11 7168 249 311008 890417 890417 58 #NAME? 1
7921 272 340305 900226 900226 56 11 7105 275 340703 890321 890321 55 1 11 6318 338 420405 880314 880314 46 1
7722 345 430424 891204 891204 47 18174 353 441007 900530 900529 46 1
1 6773 356 450710 881024 881025 43 1 11 8318 358 450831 900816 900816 45 910815 1 11.97 1 1 1
7811 262 330116 900122 900122 57 111829 162 251102 940908 940910 69 980122 980122 1 40 1 1 111442 223 290801 940412 940412 65 1 18736 70 201117 910204 910204 70 1
1 6464 23 161103 880531 880531 72 17845 136 240317 900208 900208 66 19016 238 301218 910523 910523 60 970825 1 74.4 1 1 1 1
1 7244 252 320108 890530 890530 57 920616 1 36.47 1 1 112501 24 161107 870304 950606 70 1
1 7104 35 180227 890330 890323 71 1 11 6594 299 371230 880719 880719 51 1 1
7408 147 250104 890802 890802 65 1 111460 233 301013 940418 940420 64 1
1 6262 116 230402 880215 880210 65 1 110335 266 330524 921214 921214 60 1 1
1 7132 191 270921 890329 890413 62 1 11 7020 152 250308 890206 890210 64 11 6707 215 290417 880928 880921 59 1 11 6014 258 320708 871021 871021 55 1 11 6232 250 311210 880202 880209 56 880802 1 6.87 1 1 11 6438 84 210617 880519 880519 67 11 6557 189 270623 880718 880718 61 1 11 7041 331 410524 890228 890228 48 1 11 7149 8 141020 890418 890418 74 11 7170 246 310813 890420 890420 58 940921 940921 1 64.63 1 1 2 1 11 7267 327 410113 890606 890606 48 940824 940824 1 62 1 1 11 7364 17 160731 890720 890706 73 1 11 7454 335 411110 890905 890905 48 940517 1 56.27 1 1 11 7490 82 210408 890922 890922 68 950504 1 68.13 1 1 11 7553 315 400219 891012 891012 50 950719 1 68.97 1 1 1
7807 51 190801 900123 900123 70 1
BASE: BioArray Software Environmenthttp://base.thep.lu.se
• Free Open Source Software; GNU General Public License Store – Manage – Analyze: Array production LIMS Experimental Data Biomaterials Web based UI
© 2004 Carsten Peterson
Complex Systems Lund University
Data management
Results
Data mining Prior knowledgePlugin structure for your own favourites
In principle BASE takes you all the way
Complies with MIAME and MAGEML
Proteomics counterparts is on its way for 2D gels: e.g. PROTEIOS
Platform specific alternatives: e.g. Affymetrix GSOC Publish Database
Most time is spent on preprocessing including missing values etc.
© 2004 Carsten Peterson
Complex Systems Lund University
Microarray data analysis workflow
Data mining supervised unsupervised ....
Prior knowledge ontologies pathways integration of data
Results
Data management preprocessing
© 2004 Carsten Peterson
Complex Systems Lund University
Preprocessing: Cuts on intensities, spot areas etc. More elaborate error models
© 2004 Carsten Peterson
Complex Systems Lund University
Data Mining
Preprocessing: Cuts on intensities, spot areas etc. More elaborate error models
MDSProject highdimensional data down to 2 or 3 dimensions preserving the distances between the data points
Dimensional reduction: Principal Component Analysis (PCA) Multidimensional Scaling (MDS)
G1G2
G1
G2
+
+++++
+ +++
2dimensional example of PCA:
© 2004 Carsten Peterson
Complex Systems Lund University
Data Mining
Preprocessing: Cuts on intensities, spot areas etc. More elaborate error models
MDSProject highdimensional data down to 2 or 3 dimensions preserving the distances between the data points
Supervised methods: Signaltonoise based disciminators Machine learning: Multilayered Perceptrons (MLP) Support Vector Machines (SVM) ....
Dimensional reduction: Principal Component Analysis (PCA) Multidimensional Scaling (MDS) ....
G2
G1
G2
G1+
+++++
+ +++
2dimensional example of PCA:
Unsupervised methods: Hierarchical clustering Kmeans clustering ....
© 2004 Carsten Peterson
Complex Systems Lund University
Supervised methods
● Toolboxes:
Machine learning that allows for general dependencies Multilayered Perceptrons (MLP) Support Vector Machines (SVM) ....
Correlationbased classifiers; signaltonoise ratios
Data highdimensional and noisy Learn from examples rather than using known rules
Representative training sets and good validation procedures crucial Preprocessing to reduce number of variables sometimes called for
● Generics:
© 2004 Carsten Peterson
Complex Systems Lund University
Signaltonoise methods
Simple approach:
Find discriminating genes between labeled classes bycomputing signaltonoise ratio weights for each gene
Weight [gene]
=Sum of standard deviations
Difference in averages
NominatorDenominator
Type 1 Type 2
=> list of ranked genes
g1
w1
g2
w2
. .
. .g
Nw
N
© 2004 Carsten Peterson
Complex Systems Lund University
Are the topranked genes significant?
Permute sample lables (type 1 and type 2) Calculate weights from random labels
Random permutation test:
Compute Pvalue for each gene
Keep those genes in the list with low Pvalues
g1
P1
g2
P2
. .
. .
. .
Cut
© 2004 Carsten Peterson
Complex Systems Lund University
Machine learning
Find planes that separate the classes (adjust parameters) MLP (Multilayered Perceptron SVM (Support Vector Machine)
Each adjusted parameter set represents a model of the data
X 1
X 2
In 2 dimensions
Advantage: Handle general dependencies
More later ....
© 2004 Carsten Peterson
Complex Systems Lund University
Proper validation and test procedure
Calibration
Validation
Test
Calibrate multiple models using different partitions
Use ensemble of models for committee votes
Extract ranked genes from model parameters
More later ....
© 2004 Carsten Peterson
Complex Systems Lund University
Microarray data analysis workflow
Data mining supervised unsupervised .....
Prior knowledge ontologies pathways other experiments
Results
Data management preprocessing
© 2004 Carsten Peterson
Complex Systems Lund University
Map results onto prior knowledge using appropriate tools
Ontology databases Pathway databases Data integration
The Gene Ontology (GO) consortium provides a vocabulary in terms of three structured networks of gene product attributes
● Map candidate genes of interest onto GO network
● Assign occupation numbers on nodes towards the root
● Estimate Pvalue for each node from genes present on cDNA by comparing with random selection of genes
GoMiner: discover.nci.nih.gov/gominer/
Ontoexpress: www.bioinformatics.wayne.edu
Ontology browsers:
© 2004 Carsten Peterson
Complex Systems Lund University
Map results onto prior knowledge using appropriate tools
Ontology databases Pathway databases Data integration
Pathway databases: metabolic, signal transduction, transcriptional, ... different levels of resolution; phosphorylation, methylation, ...
KEGG Transpath stke BioCarta .....
Computer readable format desired
Estrogen receptor pathway ( from stke)
© 2004 Carsten Peterson
Complex Systems Lund University
Map candidate genes onto pathway database
Estimate Pvalue for pathway from genes present on cDNA by comparing with random selection of genes
The richness of proteomics can be exploited;complexes, phosphorylation, methylation, ...
Less mature field than mappings onto gene ontologies
© 2004 Carsten Peterson
Complex Systems Lund University
Applications
● Small round blue cell tumors (SRBT) of childhood
● Estrogen receptor status of breast cancer tumors
● Breast cancer metastases prediction
● Breast cancer cell line analysis; comparisons with tumors
© 2004 Carsten Peterson
Complex Systems Lund University
Small round blue cell tumors (SRBT) of childhood
SRBCT exists in four distinct categories
Can be difficult to diagnose using currently available molecular methods
Accurate diagnosis critical for proper treatment
Artificial neural networks (MLP) used for classification
Burkitt's Lymphoma (BL) 8 Ewing's sarcoma (ES) 13+10 Neuroblastoma (NB) 12 Rhabdomyosarcoma (RMS) 10+10 Total 63
Microarray profiling using 6600 cDNA clones
Blind Tests 25
© 2004 Carsten Peterson
Complex Systems Lund University
● After filtering 2300 genes remain
● Multilayered Perceptron (MLP) used for modeling with 10 inputs and 4 outputs [(1000), (0100), (0010), (0001)]
Reduce dimensionality by performing PCA Retain 10 largest components: 2300 > 10 (65% of variance left)
● Thousands of genes - huge number of parameters: "overtraining"
G1G2
Gene 1
Gene 2
+
++
+++
++++ G1=Gene1+Gene2
2dimensional example of PCA:
© 2004 Carsten Peterson
Complex Systems Lund University
Ranked Genes
0
1
Gene Expression
ML
P o
utp
ut
High Sensitivity
High Rank
0
1
Gene Expression
High Sensitivity
High Rank
0
1
Gene Expression
ML
P O
utp
ut
Low Sensitivity
Low Rank
Sensitivity measurement
Compute derivative of output with respect to expression levels
© 2004 Carsten Peterson
Complex Systems Lund University
Training & Validation
3fold validation procedure:
ClassificationCommittee vote:
● Additional independent test samples are classified by a committee of all the models
● The committee for a sample consists only of models for which the sample was not in the training groups; Also yields ranked gene list
● Split samples in 3 groups – training (2) and validation (1) One model is calibrated
● Redo with the other groups reserved for validation => 3 models. Each sample predicted by one model
● Randomly reselect 3 groups and redo N times => 3*N models
[SRBT: N = 1250]
© 2004 Carsten Peterson
Complex Systems Lund University
Gene Minimization
Remove genes from bottom of ranking list
Redo calibration
Diagnostic Classification
Sensitivity (%)
9310010096
Cancer
EWSBLNBRMS
Specificity (%)
100100100100
All 63+25 samples
© 2004 Carsten Peterson
Complex Systems Lund University
Diagnostic Classification
What do we expect basedon the validation?
Ideal vote would bee.g. EWS: (1,0,0,0)
Compute distance to ideal vote
© 2004 Carsten Peterson
Complex Systems Lund University
Hierarchical Clustering
Based on top96 ANN genes
© 2004 Carsten Peterson
Complex Systems Lund University
Permutation tests
Can MLPs learn to classify anything?
Randomly relabel data
© 2004 Carsten Peterson
Complex Systems Lund University
Microarray profiling of SRBT works very well using MLP and SVM
Minisummary:
Approach evaluated under realistic conditions with 25 blind test samples provided after completed calibrations 5 of these were nonSRBCTs algorithm able to reject these
Out of these, 41 have not been previously reported in this context
A set of 96 genes are identified as key factors for classification
J. Khan et al., Nature Medicine, 673 (2001)
Mini Summary:
© 2004 Carsten Peterson
Complex Systems Lund University
Estrogens are important regulators in development and progression of breast cancer regulate gene expression via the estrogen receptor (ER)
Estrogen Receptor
Investigate microarray image profiles of nodenegative sporadic breast cancer tumors with respect to ER status (+ or ) using 6800 cDNA clones
● 3fold cross validation 200 times;
● As in the SRBCT case; ANN models
● After filtering 3400 genes remain
● Sensitivities computed genes ranked
In total committees of 600 models
ER+ 23ER 24
Total: 47
Blind tests: 11
ER status of blind test samples predicted with 100% accuracy (even when ER and GATA3 genes excluded)
S. Gruvberger et al., Cancer Research 61, 5979 (2001)
© 2004 Carsten Peterson
Complex Systems Lund University
ER-
Genes 1 100
ER status committee votes
ER+
T
V
T = blind test setV = validation set
© 2004 Carsten Peterson
Complex Systems Lund University
How many of the top genes on the gene ranking list can be excluded with preserved predictive power (blind test set)?
1 100 11 100% 51 150 9 100% 101 200 11 100% 151 250 9 100% 201 300 11 100% 251 350 9 93% 301 400 8 97% Random 5.5 53%
Genes correct ROC area
Randomly picked 100 genes are shown for comparison
The ER pathway is ''deep''!
© 2004 Carsten Peterson
Complex Systems Lund University
ER-
Genes 1 100
ER status committee votes
ER+
Genes 301 400
ER+
ER-
T
V
T = blind test setV = validation set
© 2004 Carsten Peterson
Complex Systems Lund University
More on ER pathway depths ......
● All remaining genes included after removal
Prediction of ER value
Relative error
Number of genes removed from top
● Top100 genes included after removal
S. Gruvberger, P. Eden et al., Mol. Cancer Ther. 3, 161 (2004)
Continuous MLP output
© 2004 Carsten Peterson
Complex Systems Lund University
Determining the ER cutoff from gene expression data alone
Classify into ER+ and ER for every possible partition
For each partition: Measure of how well the partition corresponds to molecularly distinct classes with Fisher's linear discriminant Gauge performance using leaveoneout crossvalidation Compute area under ROC curve
S. Gruvberger, P. Eden et al., Molecular Cancer Therapeutics 3, 161 (2004)
© 2004 Carsten Peterson
Complex Systems Lund University
● The ER pathway is deep!
● Nonclinical determination of ER cutoff feasible
Minisummary:
© 2004 Carsten Peterson
Complex Systems Lund University
Predicting Breast Cancer MetastasesL.J. Van't Veer et al., Nature 415, 530536 (2002) [available]M.J. van de Vijver et al., NEJM 347, 19992009 (2002) [not available]
78 tumors; 34 of which developed distant metastases within 5 yearsCan one predict the latter?
CDNA microarray profiling with 25000 genes After filtering 5000 remainsBlind test set: 19 (12 met/7 nonmet)
Unsupervised hierarchical clustering:
Patients: Roughly divides data according to ER status Genes: ER pathway genes and lymphocytic genes coregulated
© 2004 Carsten Peterson
Complex Systems Lund University
Predicting Breast Cancer MetastasesL.J. Van't Veer et al., Nature 415, 530536 (2002) [available]M.J. van de Vijver et al., NEJM 347, 19992009 (2002) [not available]
78 tumors; 34 of which developed distant metastases within 5 yearsCan one predict the latter?
CDNA microarray profiling with 25000 genes After filtering 5000 remainsBlind test set: 19 (12 met/7 nonmet)
Unsupervised hierarchical clustering:
Patients: Roughly divides data according to ER status Genes: ER pathway genes and lymphocytic genes coregulated
Classifier:
Compute correlation [C] between genes and metastases outcome |C| > 0.3 yields 231 important genes, which are rankordered
Construct classifier from C by adding genes from list optimizing "leaveoneout" cross validation performance (70 genes)
© 2004 Carsten Peterson
Complex Systems Lund University
Many of the classifying genes are ER related!
=> failures: 3 met/15 nonmet
Few met failures desired; optimize sensitivity [decision threshold]
83% validation performance [failures: 5 met/12 nonmet]
Test set performance – failures: 1 met/1 nonmet (19)
In followup study [van de Vijver et al.]:
285 consequtive patients
Gene expression profiler superior to conventional prognostic index:
NIHcriteria (>1cm) St. Gallen – criteria (>=2 cm, grade 23, <35 years, ER)
© 2004 Carsten Peterson
Complex Systems Lund University
-- Agendia closes double digit million $ series A financing round --
''The land slide discovery of a predictive breast cancer profile and its validation in an independent study has caught much attention as it is superior to classical criteria St. Gallen consensus or NIH criteria to predict outcome of disease. ''
From company home page: www.agendia.com
Not the complete story ...
© 2004 Carsten Peterson
Complex Systems Lund University
Are ''good old'' clinical markers outperformed by microarray gene expression profilers?
Age Tumor size Axillary node status Histological grade ER PGR Angioinvasion Lymphocytic invasion
● With a combined crossvalidation and leaveoneout crosstesting procedure a clinical marker profiler is constructed using ANN models
Conventional clinical markers:
● Using the well established Nottingham prognostic index: fixed rule based upon size, histological grade and axillary node status
P. Eden, C. Ritz et al., Eur. J. Cancer, in press (2004)
© 2004 Carsten Peterson
Complex Systems Lund University
Analysis:
● ROC (Receiver Operating Characteristics) areas● Kaplan Meier plots
Method ER+ ER+
Gene expression 0.76 0.79Clinical variables 0.78 0.85NPI 0.77 0.81NIH criteria 0.67 0.66
© 2004 Carsten Peterson
Complex Systems Lund University
F = 0.13 Age + 0.26 Tumor size (cm) + 1.0 Histological grade 0.011 ER 0.00 PGR + 1.3 Angioinvasion 0.28 Lymphocytic invasion + 2.8 (threshold: 10% false positive rate)
Clinical variables classifier can be approximated with a linear expression
F > 0 poor prognosis
© 2004 Carsten Peterson
Complex Systems Lund University
Good prognosis group
Bad prognosis group
Years
Pro
ba
bil
ity
of
rem
ain
ing
met
ast
ase
s fr
ee
● Clin. Marker profiler● Nottingham index
● Microarray profiler
All samples (97)
© 2004 Carsten Peterson
Complex Systems Lund University
Good prognosis group
Bad prognosis group
Years
Pro
ba
bil
ity
of
rem
ain
ing
met
ast
ase
s fr
ee
● Clin. Marker profiler● Nottingham index
● Microarray profiler
ER+ samples only (68)
© 2004 Carsten Peterson
Complex Systems Lund University
● Clin. Marker profiler● Nottingham index
● Microarray profiler● Clin. Marker profiler● Nottingham index
● Microarray profiler
All samples ER+ samples only
Good prog. M+/M OR 95% CL
Gene expression 4/31 15.7 4.770 Conventional markers 4/35 9.8 2.943 NPI 4/21 7.2 2.132 NIH criteria 4/6 1.4 0.37.2 St Gallen criteria 0/7 inf 1.4inf
Limited statistics!
© 2004 Carsten Peterson
Complex Systems Lund University
● ''Good old'' clinical markers are by no means outperformed by microarray gene expression profilers as diagnostic predictors
● The conventional markers are relatively inexpensive to employ 50% of all new breast cancers are diagnosed in the third world Keep refining these prognostic indices
● However: Microarray data also carries information about the underlying genes
The microarray technology is young
Minisummary:
● For both gene expression and clinical marker profilers results improve when separately classifying the ER+ cohorts
Hybrid methods ....
© 2004 Carsten Peterson
Complex Systems Lund University
How about other data sets?
● Lund data (in preparation):
CMF treated sporadic breast cancer (node+, node)
88 (26 M+, 62 M)
ROC areas:
Gene expression: 0.76
Clinical variables: 0.76
NPI 0.76
(with E. Nimeus, A. Johnsson and others in Lund)
● Population based 99 patient cohort
Treated and untreated (node+, node)
C. Sotiriou et al, PNAS 100, 10393 (2003)
© 2004 Carsten Peterson
Complex Systems Lund University
Clinical variables predictor
Age, size, grade, # of nodes, ER
Hierarchical clustering with overlapping van't Veer genes
© 2004 Carsten Peterson
Complex Systems Lund University
Pathways in breast cancer
● Comprehensive analysis of expression patterns for multiple treatments and cell lines (e.g. ER+ and ER) to identify pathways (and crosstalk)
E2, tamoxifen, fulvestrant, progestin, antiprogestin, retinoic acid; Epidermal growth factor, a Mek1/2 specific inhibitor and TPA
● Utilize other genomewide data How are these pathways related to disease progression? Expression data from tumors
Two key aspects:
H.E. Cunliffe et al., Cancer Research 63, 7158 (2203)
© 2004 Carsten Peterson
Complex Systems Lund University
14 conditions (9 treatments for 3 celllines).84 hybridizations (3 time points in duplicate).14k spotted cDNA arrays.
M CF7
E2
M CF7
ICI
M CF7
4OHT
M CF7
R5020
T47D
R5020
M CF7
RU486
T47D
RU486
M CF7
atRA
436
EGF
M CF7
U0126
436
U0126
M CF7
TPA
436
TPA
M CF7
EGF
1/3 31
1023 genes reshuffled according to similarity (clustering)
© 2004 Carsten Peterson
Complex Systems Lund University
# conditions w ith responsive genes
Distribution of 1023responsive genes
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8 9 1011121314
Top 47 genes Includes many associated with oncogenesis (cmyc, cerbB2, cyclin D1 , ...)
Characterizing Pathways (1)
© 2004 Carsten Peterson
Complex Systems Lund University
i
iv
v
1023 genes
M CF7
E2
M CF7
ICI
M CF7
4OHT
M CF7
R5020
T47D
R5020
M CF7
RU486
T47D
RU486
M CF7
atRA
436
EGF
M CF7
U0126
436
U0126
M CF7
TPA
436
TPA
M CF7
EGF
ii
iii
vi
vii
viii
Separate 'Proliferation cluster'60% of genes (GO)
E2 'crosstalk' to MAPK pathway?
1/3 31
Important to study pathwaysin context of others!
Characterizing Pathways
© 2004 Carsten Peterson
Complex Systems Lund University
Correlating pathways with tumor data
Are genes that discriminate between clinical categoriesfor tumor samples randomly distributed among the clusters?If not, which clusters are significant?
Use the van't Veer tumor expression data
231 prognosis genes from supervised (not clustering) analysispresent on cell line arrays; 41 among the responsive genes
(32 higher in poor prognosis and 9 in good prognosis samples)
© 2004 Carsten Peterson
Complex Systems Lund University
i
V iii
iv
v
1/3 311023 genes
M CF7
E2
M CF7
ICI
M CF7
4OHT
M CF7
R5020
T47D
R5020
M CF7
RU486
T47D
RU486
M CF7
atRA
436
EGF
M CF7
U0126
436
U0126
M CF7
TPA
436
TPA
M CF7
EGF
-0.4
0
0.4n=89
0.30.20.1
-0.2-0.1
-0.3L
og
Rat
io
< 42 experiments ordered as in mosaic >
ii
iii
V ii
Cluster
iv
v
vi
viii
n
23
14
89
24
-
8
0
26
1
+
0
4
0
8
-
0
0
19
1
+
0
0
0
0
P-value
1.2E-03
4.7E-03
9.0E-08
1.4E-04
P-value
NS
NS
1.5E-13
NS
ER statusa good prognosisb
vi
Investigate clusters for common mechanisms responsible for their coregulation in tumors
© 2004 Carsten Peterson
Complex Systems Lund University
In the future – be more systematic
Map gene lists from array analysis onto pathways (+ downstream targets)
Includes pathways from Transpath, KEGG, …
Allows users to modify pathways and to create their own
Pathway information and topology stored in machine readable form
Overabundance analysis as in GoMiner
Pathway coexpression scores
Information
Analysis
© 2004 Carsten Peterson
Complex Systems Lund University
Summary
● Proper management of data at all stages
● Supervised and unsupervised approaches for making sense of data
● Application examples; diagnostic predition of cancers
Looks good!
Do gene expression data outperform ''good old'' clinical markers?
● Mapping onto ontologies and pathway data bases
Care should be exercised in sample selection with heterogeneous diseases
Interpret data directly in terms of interacting pathways 50000 genes > 100 pathways (?)
© 2004 Carsten Peterson