View
216
Download
0
Embed Size (px)
Citation preview
Microarrays: algorithms for knowledge discovery in oncology and molecular biology
Frank De Smet
Katholieke Universiteit Leuven
Faculteit Toegepaste Wetenschappen
Departement Elektrotechniek (ESAT)
Promotor: Prof. dr. ir. B. De Moor
PhD defense Frank De Smet May 28, 2004 2
Overview
• Introduction: basic concepts of microarray data
• Feature extraction: – Univariate analysis
– Multivariate analysis: PCA
• Classification
• Clustering
• Conclusions and future research
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 3
Transcription - Translation
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 4
Microarrays
Introduction
Feature extraction
Classification
Clustering
Conclusions
DB
6
Red4
Green
3
DNA-clones
1
cDNA-microarray
mRNA referencemRNA test (tumour)
2
5
PhD defense Frank De Smet May 28, 2004 5
Importance
• Clinical (oncology)– Clinical management of cancer is in many cases empirical and
not all information that is clinically relevant can be extracted using the data that physicians have access to
– Fundamental mechanisms behind carcinogenesis are not always taken into account
But:
– Expression patterns measured with microarrays in malignant cells reflect the phenotype of the tumour
• Molecular biology– Study of the expression behaviour of genes can help to
determine their biological role or function
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 6
Data-mining framework
1
...
1
...
2
...
2
...
...
Patients
1
...
1
...
2
...
2
...
...
1
...
1
...
1
...
1
...
2
...
2
...
2
...
2
...
...Genes
1
...1 2 211
...11 22 22
Features
2
2
1
22
22
11
?
?
?
<1>
<2>
<1>
Classifier
??
??
??
<1>
<2>
<1>
1
...1 2 11
2
...1 2 22
Cluster 1
Cluster 2
Cluster algorithm
1
...1 2 11 1
...1 2 11 11
...11 22 1111
2
...1 2 22 2
...1 2 22 22
...11 2222
Cluster 1
Cluster 2
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 7
Expression matrix
Introduction
Feature extraction
Classification
Clustering
Conclusions
…
…
…
…
…
…
Condition 1 Condition 2
OR
time
Microarray experiments
Gene expression profiles
PhD defense Frank De Smet May 28, 2004 8
Univariate analysis in microarray data
• Expression patterns measured under two different conditions
• Selection of the individual genes with the highest differential expression: p-values
• Rejection level – p : gene is declared
differentially expressed– p > : gene is declared not
differentially expressed
Introduction
Feature extraction
Classification
Clustering
Conclusions
Condition 1 Condition 2
pi (i = 1,...,n; p1<p2<...<pn)
Positive
Negative
PhD defense Frank De Smet May 28, 2004 9
Multiple testing
Introduction
Feature extraction
Classification
Clustering
Conclusions
• Overlap of the p-values of the genes with and without actual differential expression: Type I and II errors
• In literature: control of the Type I error: too conservative for microarray data
• Here: balance of Type I and II error
Actually differentially expressed?
YES NO
YES (p )
TP FP Type I error
Pos D
ecla
red
dif
fere
ntia
lly
expr
esse
d?
NO (p > )
FN Type II error
TN Neg
n1 n0
PhD defense Frank De Smet May 28, 2004 10
Estimation of Type I and II error
No real differential expressionRandomised data setUniform distribution
FN
TN
TP
FP
Rejection level
Non-accidental differential expressionSuperposition of two distribuions
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 11
Calculations
Introduction
Feature extraction
Classification
Clustering
Conclusions
i
ii p
npiV
1
.
1. Estimation of n1 and n0
Actually differentially expressed?
YES NO
YES (p pi)
TPi i - pi.n0
FPi pi.n0
Posi = i
Dec
lare
d d
iffer
entia
lly
expr
esse
d?
NO (p > pi)
FNi n1 - i + pi.n0
TNi (1-pi).n0
Negi = n-i
n1 n0
2. Estimation of TPi, TNi, FPi and FNi
4. ROC curve
1n
TP
FNTP
TPSENS i
ii
ii
0n
TN
FPTN
TNSPEC i
ii
ii
3. Estimation of sensitivity and specificity
PhD defense Frank De Smet May 28, 2004 12
ROC curve
• Optimal balance between Type I and II errors
• Area under the curve – Quantifies how well the genes whose expression is and is not
affected by the difference between conditions can be discriminated using their p-values
– Quality measure for microarray data
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 13
Example: Acute leukemia
Go
lub
et a
l. A
LL
-AM
L
Arm
stro
ng
e
t a
l. A
LL
-AM
L
n 7129 12582
n0 3876 3084
n1 3253 9498
AUC (%) 91.39 95.13 opt 0.18 (= p3429) 0.11 (= p8633)
SENSopt (%) 84.03 87.26
SPECopt (%) 82.06 88.56
SENSopt + SPECopt (%) 166.09 175.82
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 14
Multivariate analysis in microarray dataPrincipal Component Analysis
Introduction
Feature extraction
Classification
Clustering
Conclusions
Acute leukemiaALL - AML
Breast cancerDegree of differentiation
Unsupervise
d
PC1PC2
PC1PC2
PhD defense Frank De Smet May 28, 2004 15
Classification
Introduction
Feature extraction
Classification
Clustering
Conclusions
Acute leukemiaALL - AML
Breast cancerDegree of differentiation
Unsupervise
dSupervised
PhD defense Frank De Smet May 28, 2004 16
Clustering: gene expression profiles
• Importance– Identification of groups of coexpressed genes
– Have a higher probability of having similar biological functions: e.g., might interact with the same transcription factors (coregulation)
• First generation algorithms: disadvantages– Parameter fine-tuning
– Assign each profile to a cluster
– Computational complexity
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 17
Quality-based clustering (Heyer et al.)
Algorithm produces clusters with – a quality guarantee (fixed and user-defined threshold for diameter D)– with a maximum number of profiles
DCandidate cluster 1: 3 profiles
...
Candidate cluster 5: 6 profiles
...
Candidate cluster 17: 2 profiles
Introduction
Feature extraction
Classification
Clustering
Conclusions
Still some disadvantages !
PhD defense Frank De Smet May 28, 2004 18
Adaptive quality-based clustering (AQBC)
• A heuristic iterative two-step approach
– Step 1: Quality-based approach:
Find a cluster center in an area of the data set where the density of expression profiles, within a sphere with preliminary radius, is locally maximal
– Step 2: Adaptive approach:
Re-estimation of the radius
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 19
Step 1: Localization of a cluster center
Introduction
Feature extraction
Classification
Clustering
Conclusions
R
PhD defense Frank De Smet May 28, 2004 20
Step 2: Re-calculation of the radius
Introduction
Feature extraction
Classification
Clustering
Conclusions
SBRpPCRpP
CRpPRCP
newBnewC
newCnew
)|(.)|(.
)|(.)|(
)|(.)|(.)( BrpPCrpPrp BC
PhD defense Frank De Smet May 28, 2004 21
Comparison
AQBC QT_Clust (Heyer et al.)
User-defined parameters
1. Data set 2. Significance level S
3. Minimum number of genes
1. Data set 2. Radius R or diameter D
3. Minimum number of genes
Quality measure Significance level S: statistical parameter
Radius or diameter: arbitrary parameter
Cluster radius R Automatically calculated for each cluster separately - not constant
Constant and user-defined
Computational Complexity ~ O(n e VC) ~ O(n2 e VC)
Number of clusters Not predefined Not predefined
Inclusion of all genes in clusters No No
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 22
Validation
Introduction
Feature extraction
Classification
Clustering
Conclusions
Cluster number P-value (-log10)
AQBC
K-means
MIPS functional category AQBC K-means
1 1
ribosomal proteins organisation of cytoplasm protein synthesis cellular organisation translation organisation of chromosome structure
80 77 74 34 9 1
54 39 NR NR NR 4
2 4
mitochondrial organization energy proteolysis respiration ribosomal proteins protein synthesis protein destination
18 8 7 6 4 4 4
10 NR NR 5 NR NR NR
5 2
DNA synthesis and replication cell growth, cell division, DNA synthesis recombination and DNA repair nuclear organization cell-cycle control and mitosis
18 17 8 8 7
16 NR 5 4 8
PhD defense Frank De Smet May 28, 2004 23
Availability
Introduction
Feature extraction
Classification
Clustering
Conclusions
0
50
100
150
200
250
300
350
Jan-
01
Apr-0
1
Jul-0
1
Oct-0
1
Jan-
02
Apr-0
2
Jul-0
2
Oct-0
2
Jan-
03
Apr-0
3
Jul-0
3
Oct-0
3
Jan-
04
Apr-0
4
Nu
mb
er o
f h
its
Himanen et al. (2004) Transcript profiling of early lateral root initiation. Proc Natl Acad Sci, 101, 5146-5151.
PhD defense Frank De Smet May 28, 2004 24
Conclusions
Data-mining framework for microarray data• Feature extraction
– Univariate analysis• Estimation of n1 and n0
• ROC curves: optimal balance between Type I and II error + quality measure
– Multivariate analysis: PCA
• Classification: FDA and LS-SVM• Clustering
– Microarray experiments– Gene expression profiles: AQBC
Clinical data
Introduction
Feature extraction
Classification
Clustering
ConclusionsPC1
PC2
PC1PC2
PhD defense Frank De Smet May 28, 2004 25
Selected publications
• De Smet, F., Marchal, K., Timmerman, D., Vergote, I., De Moor, B. and Moreau, Y. (2001) Gebruik van microroosters in de klinische oncologie, Tijdschr voor Geneeskunde, 57, 1225-1236.
• De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B. and Moreau Y. (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18, 735-746.
• Moreau, Y., De Smet, F., Thijs, G., Marchal, K. and De Moor, B. (2002) Functional bioinformatics of microarray data: from expression to regulation. Proceedings of the IEEE, 90, 1722-1743.
• De Smet, F., Moreau, Y., Tmmerman, D., Vergote, I. and De Moor, B. (2004) Balancing false positives and false negatives for the detection of differential expression in malignancies. Br J Cancer, submitted.
• Epstein, E., Skoog, L., Isberg, P.E., De Smet, F., De Moor, B., Olofsson, P.A., Gudmundsson, S. and Valentin, L. (2002) An algorithm including results of gray-scale and power Doppler ultrasound examination to predict endometrial malignancy in women with postmenopausal bleeding. Ultrasound Obstet Gynecol, 20, 370-376.
Introduction
Feature extraction
Classification
Clustering
Conclusions
PhD defense Frank De Smet May 28, 2004 26
Future research
• Specific– Ovarian cancer: transcriptomics
• Prediction of chemosensitivity in stage III• Prediction of recurrence in stage I
– Endometriosis: proteomics and transcriptomics• Detection of endometriosis• Prediction of relapse after surgery
• General– Microarrays: number of patients - validation -
standardization– Proteomics– Combination and comparison of microarray,
proteomic and clinical data
Introduction
Feature extraction
Classification
Clustering
Conclusions