19
Gonzalo Gómez [email protected] Madrid, October 8th, 2008. ::: Normalization methods and data preprocessing Course on Microarray Gene Expression Analysis Bioinformatics Unit CNIO

Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Gonzalo Gómez [email protected]

Madrid, October 8th, 2008.

::: Normalization methods and data preprocessing

Course on Microarray Gene Expression Analysis

Bioinformatics Unit CNIO

Page 2: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

::: Introduction.

•  Involve a set of statistical techniques. •  Applied for the classification of objects into different groups or the partitioning of a data set into subsets (clusters). •  A cluster is a group of relatively homogeneous cases or observations. •  Makes no distinction between dependent and independent variables. •  If the rows (genes) and columns (samples) of a given dataset are clustered simultaneously (biclustering).

Clustering Analysis

Page 3: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

::: Clustering methods.

•  Hierarchical methods:

Agglomerative:

UPGMA (Sneath and Sokal, 1973) Divisive:

SOTA (Herrero et al. 2001) DIvisive ANAlysis clustering (Kaufman & Rousseeuw, 1990) Gene Shaving (Hastie et al, 2000)

•  Non-hierarchical methods:

kmeans (Hartigan and Wong 1979) kmedians (Hartigan and Wong 1979) SOM (Kohonen 1979, Tamayo et al 1999) Fuzzy c-means (Dougherty et al. 2002) Probabilistic clustering (Bhattacharjee et al. 2001)

Clustering Analysis

Page 4: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

UPGMA

1) B and C are the closests elements (i.e. Pearson)

2) A new matrix is recalculated with the elements B and C as a new pseudo-element. The distance of this element is obtained as the average distance (average linkage method)

3) The genes A and D are found to be closest in the distance matrix now and then are merged and the matrix rebuilt again.

4) The process ends up when all the elements have been linked.

Unweighted Pair Group Method with Arithmetic mean

From GEPAS tutorial

::: Clustering methods.

Clustering Analysis

Page 5: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Sample 1 Sample 2 Sample 3

Gene 1

0.7 0.3 7.3

Gene2

1.2 1.9 6.5

Gene3

1.1 0.9 8.9

0.88

-0.19 -0.62

Gene 1

Gene 2

Gene 3

Pearson Correlation Coefficient

Gene 1

Gene 2

Gene 3

Example:

Euclidean distance Pearson correlation coefficient Spearman ρ correlation coefficient …

Metrics Distances between genes??

:::Distances and correlations.

Clustering Analysis

Page 6: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

B C A

B A C

A

B

C

EUCLIDEAN

CITY BLOCK (Manhattan or taxi cab)

OTHERS (Canberra, Bray-Curtis…)

PEARSON

SPEARMAN

Exp

ress

ion

Lev

els

t

Clustering Analysis

Page 7: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

A B C D E

F G

CLUSTER 1

CLUSTER 2

CLUSTER 3

Linkage methods

(Weighted pair-group average, Within-groups clustering Ward´s method…)

Others…

A B C D E F G Average-linkage method

Single-linkage method

Complete-linkage method

E D F G A B C

G F E D C B A

::: Distances and correlations.

Clustering Analysis

Page 8: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

SOTA Self-Organizing Tree Algorithm

-  Artificial Neural Network

-  Growth from the root of the tree, toward the leaves (from lower to higher resolution )

-Threshold of resource value should be fixed - Generation of a hierarchical cluster structure

at the desired level of resolution

Herrero et al. Bioinformatics. 17: 126-136. 2001.

::: Clustering methods.

Clustering Analysis

From GEPAS tutorial

Page 9: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

23

2

15

21

6

12

33

20

Average profiles Number of genes

SOTA Self-Organizing Tree Algorithm

::: Clustering methods.

Page 10: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

K-means clustering

::: Clustering methods.

Clustering Analysis

K = number of a priori defined clusters

Page 11: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

K-means clustering

1 2 3

4 5 6

Gene expression

Time

Gene expression

Time

Gene expression

Time

::: Clustering methods.

Clustering Analysis

Page 12: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Example

www.isrec.isb-sib.ch.

K-means clustering

::: Clustering methods.

Clustering Analysis

Page 13: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Kohonen T. Proc IEEE 78(9):1464-1480,1990 Tamayo P. et al. PNAS, 96: 2907-2912; 1999.

•  Neural network •  Geometric dependence among nodes, i.e. 4 x 3

Self-Organized Maps (SOMs)

::: Clustering methods.

Clustering Analysis

1.

2.

3.

4.

A priori definition of K number and grid topology

Assing an element to a grid node.

Recalculate node weigth. New node weight is influenced by neighbors.

Assing a new element to a grid node.

Page 14: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

EXAMPLE

http://biodiver.bio.ub.es

Self-Organized Maps (SOMs)

::: Clustering methods.

GWP (Gross World Product)

Clustering Analysis

Page 15: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Allison et al. Nat Rev Genet. 2006 Jan;7(1):55-65.

Clustering analyses always display a result. Unsupervised analysis is overused Clusterings based on number of replicates <50 are not reproducible Does not focused on differentially expresed genes

Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting biological themes in cluster genes-EASE (DAVID) Resampling (robust clustering methods): Consensus clustering (Machine Learning, 52, 91–118, 2003) Consensus ensemble clustering (Cancer Res 2007; 67: (22). 2007)

::: Clustering validation.

Clustering Analysis Clustering Analysis

Page 16: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

::: Clustering Validation.

Clustering Analysis

Silhouette index

•  Silhouette index is employed to asses the quality of each partition. The method compares the expression levels of genes in a given cluster and the ones in the sister partition.

•  Both if the expression levels between sister partitions are very similar, or if the expression levels within the same cluster are very divergent, a bad Silhouette index will be given.

•  Silhouette index (S) is a normalized value between -1 and 1:

-1 Represents no separation.

0 Represents a random separation.

1 Represents a perfect separation. All genes within the same cluster show the same expression levels and, at the same time, are completely different from the ones in the sister partition.

From GEPAS tutorial

Available in:

Page 17: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

::: Clustering Validation.

Clustering Analysis

Consensus Clustering Monti, S. et al. Machine Learning, 2003

•  A method to represent the consensus across multiple runs of a clustering algorithm and to assess the stability of the discovered clusters.

•  Standard clustering (UPGMA, k-means, etc.) in conjuction with resampling (perturbation) techniques.

• The underlying assumption is that the more stable the results are with respect to the simulated perturbations, the more these results are to be trusted.

•  Address 2 fundamental issues: (i) how to determine the number of clusters. (ii) how to assign confidence to the selected number of clusters, as well as to the induced cluster assigments.

Available in:

Page 18: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

::: Clustering Validation.

Clustering Analysis

Consensus Clustering Monti, S. et al. Machine Learning, 2003 Available in:

Raw data Raw data Consensus matrices Consensus matrices

Page 19: Course on Microarray Gene Expression Analysis ...bioinfo.cnio.es/files/training/Microarray_Course/5... · Unsupervised clustering should be validated: Silhouette index (GEPAS) Highlighting

Thanks!

Gonzalo Gómez [email protected]

Visit us: http://ubio.bioinfo.cnio.es/ubio/