Revealing the internal structures of gene expression data sets Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany

Revealing the internal structures of gene

expression data sets

Matthias E. Futschik

Institute for Theoretical Biology

Humboldt-University, Berlin, Germany

Hvar sommer school, 2004

OverviewIt is a two way road: Top-down vs bottom-

up approachesGood to see: Visualisation

PCA and Multi-dimensional scalingGuilt by association: Clustering

Hard clustering vs soft clusteringGattica becomes alive: Classification

Approaches of modelling in molecular biology

Bottom-upapproach

Top-down approach

System-wide measurements

Set of measurements of single component

Underlying molecular mechanism

Network of interactions of single components

Visualisation

• Important tool for detection of patterns remains visualization of results.

• Examples are MA-plots, dendrograms, Venn-diagrams or projection derived fro multi-dimensional scaling.

• Do not underestimate the ability of the human eye

Principal component analysisPCA: • linear projection of data onto majorprincipal components defined by the eigenvectors of the covariance matrix.• PCA is also used for reducing the dimensionality of the data.• Criterion to be minimised: square of the distance between the original and projected data. This is fulfilled by the Karhuven-Loeve transformation

Px Px

1( )( )

1

ti i

i

x xn

C

P is composed by eigenvectors of the covariance matrix

Example: Leukemia data sets by Golub et al.: Classification of ALL and AML

Sammon`s mapping:• Non-linear multi-dimensional scaling such as Sammon's mapping aim to optimally conserve the distances in an higher dimensional space in the 2/3-dimensional space.• Mathematically: Minimalisation of error function E by steepest descent method:

Multi-linear scaling

Example: DLBCL prognosis – cured vs featal cases

2( )1

Nij ij

Ni j ijiji j

D dE

DD

Clustering: Birds of a feather flock together

• Clustering of genes – Co-expression indicates co-regulation:

functional annotation– Clustering of time series

• Clustering of array: – finding new subclasses in sample-space

• Two-way clustering:– Parallel clustering of samples and genes

Clustering methodsUnsupervised classification of genes and/or samples. Movtivation: Co-expression indicates co-regualtion

General division into hierarchical and partitional clustering

Hierachical clustering • can be divisive or agglomerative producing nested clusters. • Results are usually visualised by tree structures dendrogram. • Clustering depends on the linkage procedure used: single, complete, average, Ward,... • A related family of methods are based on graph-theoretical approach (e.g. CLICK).

Profilíng of breast cancer by Perou et al

Example for hierarchical clustering

A Alizadeh et al, Nature. 2000Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling

Clustering methods II

Partitional clustering • divides data into a (pre-)chosen number of classes.• Examples: k-means, SOMs, fuzzy c-means, simulated annealing, model-based clustering, HMMs,...• Setting the number of clusters is problematic

Cluster validity: • Most cluster algorithms always detect clusters, even in random data. • Cluster validation approaches address the number of existing clusters. • Approaches are based on objective functions, figures of merits, resampling, adding noise ....

Hard clustering vs. soft clusteringHard clustering:

• Based on classical set theory

• Assigns a gene to exactly one cluster

• No differentiation how well gene is represented by cluster centroid

• Examples: hierachical clustering, k-means, SOMs, ...

Soft clustering:

• Can assign a gene to several cluster

• Differentiate grade of representation (cluster membership)

• Example: Fuzzy c-means, HMMs, ...

K-means clustering

1

1

0 1

1

0

ij

kk N

hc ij iji

N

ijj

i j

M U R j

N i

2( ) i ji j

E d x c

Partitional clustering is frequently based on the optimisation of a given objective function. If the data is given as a set of N dimensional vectors, a common objective function is the square error function:

where d is the distance metric and cj is the centre of clusters.

• Partitional clustering splits the data in k partitions with a given integer k. • Partition can represented by a partition matrix U that contains the membership values μij of each object i for each cluster j.• For clustering methods, which is based on classical set theory, clusters are mutually exclusive. This leads to the so called hard partitioning of the data.

Hard partions are defined as

k is the number of clusters and N is the number of data objects.

K-means algorithm

• Initiation: Choose k random vectors as cluster centres cj ;

• Partitioning: Assign xi to if for all k with k ;

• Calculation of cluster centres cj based on the partition derived in

step 2: The cluster centre cj is defined as the mean value of all

vectors within the cluster; • Calculation of the square error function E; • If the chosen stop criterion is met, stop; otherwise continue with

step 2.

For the distance metric D, the Euclidean distance is generally chosen.

Hard clustering is sensitive to noise

Example data set:

Yeast cell cylce data by Cho et al.

Standard procedure is pre-filteringof genes based on variation due to noise sensitvity of hard clustering.However, no obvious threshold exists!(Heyer et al.: ca. 4000 genes, Tavazoe et al.: 3000genes, Tamayo et al.: 823 genes)

=> Risk of essential losing information

=> Need of noise robust clustering method

Standard deviation of expression

Soft clustering is more noise robust

Hard clustering always detects clusters, even in random data

Soft clustering differentiates clusterstrength and, thus, can avoid detection of 'random' clusters

Genes with high membership valuescluster together inspite of added noise

Differentiation in cluster membership allows profiling of cluster cores● A gene can be assigned to several clusters● Each gene is assigned to a cluster with a membership value between 0

and 1● The membership values of a gene add up to one● Genes with lower membership values are not well represented by the

cluster centroid● Expression of genes with high membership values are close to cluster

centroid

=> Clusters have internal structures

Membership value > 0.5 Membership value > 0.7

Hard clustering

Varitation in cluster parameter reveals cluster stability

Variation of fuzzification parameter m determines 'hardness' of clustering:m → 1: Fuzzy c-means clustering becomes equivalent to k-means m → ∞: All genes are equivally assigned

to all clusters.

By variation of m clusters can be distinguished by their stability.

Weak cluster lose their core

Strong clusters maintain their core forincreasing m

m=1.1 m=1.3

Periodic and aperiodic clusters

Periodic clusters of yeast cell cycle:

Aperiodic clusters:

=> Aperiodic clusters were generally weaker than periodic clusters

Global clustering structure

c-means clustering allows definitionof overlap of clusters i.e. how manygenes are shared by two clusters. This enables to define a similarity measure between clusters. Global clustering structures can be visualised by graphs i.e. edges representing overlap.

Increasing number of clusters

Non-linear 2D-projection by Sammon's Mapping

=> Sub-clustering reveals sub-structures

M. Futschik and B. Carlisle, Noise robust, soft clustering of gene expression data (in preparation)

Classifaction of microarray data Many diseases involve (unknown) complex interaction of

multiple genes, thus ``single gene approach´´ is limited genome-wide approaches may reveal this interactions

To detect this patterns, supervised learníng techniques from pattern recognition, statistics and artificial intelligence can be applied.

The medical applications of these “arrays of hope” are various and include the identification of markers for classification, diagnosis, disease outcome prediction, therapeutic responsiveness and target identification.

Gattica – the Art of Classifaction

In contrast to clustering approaches, algorithms for supervised classification are based on labelled data.

Labels assign data objects to a predefined set of classes.

Frequently the class distributions are not known, so the learning of the classifiers is inductive.

Task for classification methods is the correct assignment of new examples based on a set of examples of known classes.

Generalisation

Class 1

Class 2

?

?

?

Challanges in classification of microarray data

• Microarray data inherit large experimental and biological variances

• experimental bias + tissue hetrogenity• cross-hybridisation• ‘bad design’: confounding effects

• Microarray data are sparse • high-dimensionality of gene (feature) space• low number of samples/arrays• Curse of dimensionality

• Microarray data are highly redundant•Many genes are co-expressed, thus their expression is strongly

correlated.

Classification I: Models

• K-nearest neighbour– Simple and quick method

• Decision Trees– Easy to follow the classification process

• Bayesian classifier– Inclusion of prior knowledge possible

• Neural Networks– No model assumed

• Support Vector Machines– Based on statistical learning theory; today`s state of art.

Criteria for classification

Accuracy: how closely are the results to the true values

Precision: how variable are the results compared to the true value

Sensitivity: how many true posítive are detected

Specificity: how many of the selected genes are true positives.

1-SensitivityS

pec

ific

ity

ROC curve

Getting a good team:

Feature Extraction - gene selection Selection of genes based on :• Parametric tests e.g. t-test• Non-parametric tests eg. Wilcoxon-Rank test But the 11 best players do not necessary form the best team!

Selection of groups of genes which act as good: Sensitivity measure

Genetic Algorithms

Decision tree

SVD

Bayesian Classifiers

( ) ( ) ( ) P C P C Px x x

The most fundamental classifier in statistical pattern recognition is the Bayes classifier which is directly derived from the Bayes theorem.

Suppose a vector x belongs to one of k classes. The probability P(C,x) of observing x belonging to class C is

P(x|C): conditional probability for x given class C is observed

P(C): prior probability for class C

Similarly, the joint probability P(C,x) can be expressed by

P(C|x): conditional probability for C given object x is observed P(x): is the prior probability of observing x.

( ) ( ) ( ) P C P C P Cx x

Bayesian Classifiers

( ) ( ) ( ) ( ) P C P C P C Px x x

( ) ( )( )

( )

P C P CP C

P

xx

x

( ) ( ) j kP C P Cx x

Since equations 1 and 2 describe the same probability P(C,x), we can derive

for all classes Cj with . This rule constitutes the Bayes classifier.

This is the famous Bayes theorem, which can be applied for classification as follows. We assigned x to class Cj if

Decision trees

• Stepwise classification into two classes• Comprehensible ( in contrast to black box approaches)• Overfitting can occur easily• Complex (non-linear) interactions of genes may not be reflected in tree structure

Aritificial neural networks

1

1

i ii

w x by

e

•ANN were originally inspired by the functioning of biological neurons.

•Two major components: neurons and connections between them.

•A neuron receives inputs xi and determines the output y based on an

activation function. An example of a non-linear activation function is the sigmoid function

• Multi-layer perceptrons are hierachically structured and trained by backpropagation

Support vector machines

( )

( )

( )( )( )

( )

( )( )

(.)( )

( )

( )

( )( )

( )

( )

( )( )

( )

Feature spaceInput space

( ) ( ) dK x y x y

•SVMs are based on statistical learning theory and belong to the class of kernel based methods.

•The basic concept of SVMs is the transformation of input vectors into a highly dimensional feature space where a linear separation may be possible between the positive and negative class members

Example study: Tumour/Normal classification

Motivation: Colon cancer should be detected as early as possible to avoid invasive treatment.

Data: Study by Alon et al. based on expression profiling of 60 samples with Affymetrix GeneChips containing over 6000 genes.

Method: Adaptive neural networks

M.Futschik et al, AI in medicine, 2003

Network structure can be translated in

linguistic rules

Documents

Revealing the internal structures of gene expression data sets Matthias E. Futschik Institute for Theoretical Biology Humboldt-University, Berlin, Germany