Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen

Microarrays: algorithms for knowledge discovery in oncology and molecular biology

Frank De Smet

Katholieke Universiteit Leuven

Faculteit Toegepaste Wetenschappen

Departement Elektrotechniek (ESAT)

Promotor: Prof. dr. ir. B. De Moor

PhD defense Frank De Smet May 28, 2004 2

Overview

• Introduction: basic concepts of microarray data

• Feature extraction: – Univariate analysis

– Multivariate analysis: PCA

• Classification

• Clustering

• Conclusions and future research

Introduction

Feature extraction

Classification

Clustering

Conclusions


Transcription - Translation

Introduction

Feature extraction

Classification

Clustering

Conclusions


Microarrays

Introduction

Feature extraction

Classification

Clustering

Conclusions

DB

6

Red4

Green

3

DNA-clones

1

cDNA-microarray

mRNA referencemRNA test (tumour)

2

5


Importance

• Clinical (oncology)– Clinical management of cancer is in many cases empirical and

not all information that is clinically relevant can be extracted using the data that physicians have access to

– Fundamental mechanisms behind carcinogenesis are not always taken into account

But:

– Expression patterns measured with microarrays in malignant cells reflect the phenotype of the tumour

• Molecular biology– Study of the expression behaviour of genes can help to

determine their biological role or function

Introduction

Feature extraction

Classification

Clustering

Conclusions


Data-mining framework

1

...

1

...

2

...

2

...

...

Patients

1

...

1

...

2

...

2

...

...

1

...

1

...

1

...

1

...

2

...

2

...

2

...

2

...

...Genes

1

...1 2 211

...11 22 22

Features

2

2

1

22

22

11

?

?

?

<1>

<2>

<1>

Classifier

??

??

??

<1>

<2>

<1>

1

...1 2 11

2

...1 2 22

Cluster 1

Cluster 2

Cluster algorithm

1

...1 2 11 1

...1 2 11 11

...11 22 1111

2

...1 2 22 2

...1 2 22 22

...11 2222

Cluster 1

Cluster 2

Introduction

Feature extraction

Classification

Clustering

Conclusions


Expression matrix

Introduction

Feature extraction

Classification

Clustering

Conclusions

…

…

…

…

…

…

Condition 1 Condition 2

OR

time

Microarray experiments

Gene expression profiles


Univariate analysis in microarray data

• Expression patterns measured under two different conditions

• Selection of the individual genes with the highest differential expression: p-values

• Rejection level – p : gene is declared

differentially expressed– p > : gene is declared not

differentially expressed

Introduction

Feature extraction

Classification

Clustering

Conclusions

Condition 1 Condition 2

pi (i = 1,...,n; p1<p2<...<pn)

Positive

Negative


Multiple testing

Introduction

Feature extraction

Classification

Clustering

Conclusions

• Overlap of the p-values of the genes with and without actual differential expression: Type I and II errors

• In literature: control of the Type I error: too conservative for microarray data

• Here: balance of Type I and II error

Actually differentially expressed?

YES NO

YES (p )

TP FP Type I error

Pos D

ecla

red

dif

fere

ntia

lly

expr

esse

d?

NO (p > )

FN Type II error

TN Neg

n1 n0


Estimation of Type I and II error

No real differential expressionRandomised data setUniform distribution

FN

TN

TP

FP

Rejection level

Non-accidental differential expressionSuperposition of two distribuions

Introduction

Feature extraction

Classification

Clustering

Conclusions


Calculations

Introduction

Feature extraction

Classification

Clustering

Conclusions

i

ii p

npiV

1

.

1. Estimation of n1 and n0

Actually differentially expressed?

YES NO

YES (p pi)

TPi i - pi.n0

FPi pi.n0

Posi = i

Dec

lare

d d

iffer

entia

lly

expr

esse

d?

NO (p > pi)

FNi n1 - i + pi.n0

TNi (1-pi).n0

Negi = n-i

n1 n0

2. Estimation of TPi, TNi, FPi and FNi

4. ROC curve

1n

TP

FNTP

TPSENS i

ii

ii

0n

TN

FPTN

TNSPEC i

ii

ii

3. Estimation of sensitivity and specificity


ROC curve

• Optimal balance between Type I and II errors

• Area under the curve – Quantifies how well the genes whose expression is and is not

affected by the difference between conditions can be discriminated using their p-values

– Quality measure for microarray data

Introduction

Feature extraction

Classification

Clustering

Conclusions


Example: Acute leukemia

Go

lub

et a

l. A

LL

-AM

L

Arm

stro

ng

e

t a

l. A

LL

-AM

L

n 7129 12582

n0 3876 3084

n1 3253 9498

AUC (%) 91.39 95.13 opt 0.18 (= p3429) 0.11 (= p8633)

SENSopt (%) 84.03 87.26

SPECopt (%) 82.06 88.56

SENSopt + SPECopt (%) 166.09 175.82

Introduction

Feature extraction

Classification

Clustering

Conclusions


Multivariate analysis in microarray dataPrincipal Component Analysis

Introduction

Feature extraction

Classification

Clustering

Conclusions

Acute leukemiaALL - AML

Breast cancerDegree of differentiation

Unsupervise

d

PC1PC2

PC1PC2


Classification

Introduction

Feature extraction

Classification

Clustering

Conclusions

Acute leukemiaALL - AML

Breast cancerDegree of differentiation

Unsupervise

dSupervised


Clustering: gene expression profiles

• Importance– Identification of groups of coexpressed genes

– Have a higher probability of having similar biological functions: e.g., might interact with the same transcription factors (coregulation)

• First generation algorithms: disadvantages– Parameter fine-tuning

– Assign each profile to a cluster

– Computational complexity

Introduction

Feature extraction

Classification

Clustering

Conclusions


Quality-based clustering (Heyer et al.)

Algorithm produces clusters with – a quality guarantee (fixed and user-defined threshold for diameter D)– with a maximum number of profiles

DCandidate cluster 1: 3 profiles

...

Candidate cluster 5: 6 profiles

...

Candidate cluster 17: 2 profiles

Introduction

Feature extraction

Classification

Clustering

Conclusions

Still some disadvantages !


Adaptive quality-based clustering (AQBC)

• A heuristic iterative two-step approach

– Step 1: Quality-based approach:

Find a cluster center in an area of the data set where the density of expression profiles, within a sphere with preliminary radius, is locally maximal

– Step 2: Adaptive approach:

Re-estimation of the radius

Introduction

Feature extraction

Classification

Clustering

Conclusions


Step 1: Localization of a cluster center

Introduction

Feature extraction

Classification

Clustering

Conclusions

R


Step 2: Re-calculation of the radius

Introduction

Feature extraction

Classification

Clustering

Conclusions

SBRpPCRpP

CRpPRCP

newBnewC

newCnew

)|(.)|(.

)|(.)|(

)|(.)|(.)( BrpPCrpPrp BC


Comparison

AQBC QT_Clust (Heyer et al.)

User-defined parameters

1. Data set 2. Significance level S

3. Minimum number of genes

1. Data set 2. Radius R or diameter D

3. Minimum number of genes

Quality measure Significance level S: statistical parameter

Radius or diameter: arbitrary parameter

Cluster radius R Automatically calculated for each cluster separately - not constant

Constant and user-defined

Computational Complexity ~ O(n e VC) ~ O(n2 e VC)

Number of clusters Not predefined Not predefined

Inclusion of all genes in clusters No No

Introduction

Feature extraction

Classification

Clustering

Conclusions


Validation

Introduction

Feature extraction

Classification

Clustering

Conclusions

Cluster number P-value (-log10)

AQBC

K-means

MIPS functional category AQBC K-means

1 1

ribosomal proteins organisation of cytoplasm protein synthesis cellular organisation translation organisation of chromosome structure

80 77 74 34 9 1

54 39 NR NR NR 4

2 4

mitochondrial organization energy proteolysis respiration ribosomal proteins protein synthesis protein destination

18 8 7 6 4 4 4

10 NR NR 5 NR NR NR

5 2

DNA synthesis and replication cell growth, cell division, DNA synthesis recombination and DNA repair nuclear organization cell-cycle control and mitosis

18 17 8 8 7

16 NR 5 4 8


Availability

Introduction

Feature extraction

Classification

Clustering

Conclusions

0

50

100

150

200

250

300

350

Jan-

01

Apr-0

1

Jul-0

1

Oct-0

1

Jan-

02

Apr-0

2

Jul-0

2

Oct-0

2

Jan-

03

Apr-0

3

Jul-0

3

Oct-0

3

Jan-

04

Apr-0

4

Nu

mb

er o

f h

its

Himanen et al. (2004) Transcript profiling of early lateral root initiation. Proc Natl Acad Sci, 101, 5146-5151.


Conclusions

Data-mining framework for microarray data• Feature extraction

– Univariate analysis• Estimation of n1 and n0

• ROC curves: optimal balance between Type I and II error + quality measure

– Multivariate analysis: PCA

• Classification: FDA and LS-SVM• Clustering

– Microarray experiments– Gene expression profiles: AQBC

Clinical data

Introduction

Feature extraction

Classification

Clustering

ConclusionsPC1

PC2

PC1PC2


Selected publications

• De Smet, F., Marchal, K., Timmerman, D., Vergote, I., De Moor, B. and Moreau, Y. (2001) Gebruik van microroosters in de klinische oncologie, Tijdschr voor Geneeskunde, 57, 1225-1236.

• De Smet, F., Mathys, J., Marchal, K., Thijs, G., De Moor, B. and Moreau Y. (2002) Adaptive quality-based clustering of gene expression profiles. Bioinformatics, 18, 735-746.

• Moreau, Y., De Smet, F., Thijs, G., Marchal, K. and De Moor, B. (2002) Functional bioinformatics of microarray data: from expression to regulation. Proceedings of the IEEE, 90, 1722-1743.

• De Smet, F., Moreau, Y., Tmmerman, D., Vergote, I. and De Moor, B. (2004) Balancing false positives and false negatives for the detection of differential expression in malignancies. Br J Cancer, submitted.

• Epstein, E., Skoog, L., Isberg, P.E., De Smet, F., De Moor, B., Olofsson, P.A., Gudmundsson, S. and Valentin, L. (2002) An algorithm including results of gray-scale and power Doppler ultrasound examination to predict endometrial malignancy in women with postmenopausal bleeding. Ultrasound Obstet Gynecol, 20, 370-376.

Introduction

Feature extraction

Classification

Clustering

Conclusions


Future research

• Specific– Ovarian cancer: transcriptomics

• Prediction of chemosensitivity in stage III• Prediction of recurrence in stage I

– Endometriosis: proteomics and transcriptomics• Detection of endometriosis• Prediction of relapse after surgery

• General– Microarrays: number of patients - validation -

standardization– Proteomics– Combination and comparison of microarray,

proteomic and clinical data

Introduction

Feature extraction

Classification

Clustering

Conclusions

Documents

Microarrays: algorithms for knowledge discovery in oncology and molecular biology Frank De Smet Katholieke Universiteit Leuven Faculteit Toegepaste Wetenschappen