36
SocalBSI 2008: SocalBSI 2008: Clustering Microarray Datasets Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech Sagar Damle, Ph.D. Candidate, Caltech Distance Metrics: Measuring similarity using the Euclidean and Correlation distance metrics Principle Components Analysis: Reducing the dimensionality of microarray data Clustering Agorithms: Kmeans Self-Organizing Maps (SOM) Hierarchical Clustering

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

  • Upload
    denver

  • View
    43

  • Download
    0

Embed Size (px)

DESCRIPTION

SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech. Distance Metrics: Measuring similarity using the Euclidean and Correlation distance metrics Principle Components Analysis: Reducing the dimensionality of microarray data Clustering Agorithms: Kmeans - PowerPoint PPT Presentation

Citation preview

Page 1: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

SocalBSI 2008:SocalBSI 2008:Clustering Microarray DatasetsClustering Microarray Datasets

Sagar Damle, Ph.D. Candidate, CaltechSagar Damle, Ph.D. Candidate, Caltech

Distance Metrics: Measuring similarity using the Euclidean and Correlation distance metrics

Principle Components Analysis: Reducing the dimensionality of microarray data

Clustering Agorithms: Kmeans Self-Organizing Maps (SOM) Hierarchical Clustering

Page 2: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

MATRIXMATRIXgenes,conditionsgenes,conditions = Expression dataset = Expression datasetthe first genevector = (xthe first genevector = (x1111, x, x1212, x, x1313, x, x1414… x… x1n1n))

the leftmost condition vector = (xthe leftmost condition vector = (x1111, x, x2121, x, x3131 … x … xm1m1))R

ows

(gen

es)

Columns (conditions [timepoints, or tissues])

x11 , x12 , x13 , … x1n

x21

x31 ,…Xm1 … xmn

Page 3: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Clustering identifies group of genes with “similar” expression profiles

How is similarity measured? Euclidian distance Correlation coefficient Others: Manhattan, Chebychev, Euclidean

Squared

Similarity measuresSimilarity measures

Page 4: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

In an experiment with 10 conditions, the gene expression profiles for two genes X, and Y would have this form

X = (x1, x2, x3, …, x10)

Y = (y1, y2, y3, …, y10)

Page 5: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

d(Ga, Gb) = sqrt( (x1-y1)2 + (x2 -y2)2 )

Similarity measure - Euclidian distanceSimilarity measure - Euclidian distance

In general: if there are M experiments:

X = (x1, x2, x3, …, xm)

Y = (y1, y2, y3, …, ym)

Gb: (x1, x2)

Ga: (y1, y2)

Page 6: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

D = 1 - r

r = [Z(X)*Z(Y)] (dot product of the z-scores of vectors X and Y)

r = |Z(X)| |Z(Y)| cos(T)• When two unit vectors are completely correlated, r=1 and D=0• When two unit vectors are non correlated, r=0 and D = 1

Dot product review: http://mathworld.wolfram.com/DotProduct.html

Similarity measure – Pearson Similarity measure – Pearson Correlation CoefficientCorrelation Coefficient

X = (x1, x2, x3, …, xm), Y = (y1, y2, y3, …, ym)

Page 7: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Euclidian vs Pearson Euclidian vs Pearson CorrelationCorrelation

Euclidian distance – takes into account the magnitude of the expression

Correlation distance - insensitive to the amplitude of expression, takes into account the trends of the change.

Common trends are considered biologically relevant, the magnitude is considered less important

Gene X

Gene Y

Page 8: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

What correlation distance sees

What euclidean distance sees

Page 9: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Principle Components Analysis (PCA)Principle Components Analysis (PCA) A method for projecting microarray data onto a

reduced (2 or 3 dimensional) easily visualized spaceDefinition: Principle Components - A set of variables that define a projection that encapsulates the maximumamount of variation in a dataset and is orthogonal (and therefore uncorrelated) to the previous principle component of the same dataset.

Example Dataset: Thousands of genes probed in 10 conditions.

The expression profile of each gene is presented by the vector of its expression levels: X = (X1, X2, X3, X4, X5)

Imagine each gene X as a point in a 5-dimentional space. Each direction/axis corresponds to a specific condition Genes with similar profiles are close to each other in this

space PCA- Project this dataset to 2 dimensions, preserving as

much information as possible

Page 10: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech
Page 11: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

PCA transformation of a microarray PCA transformation of a microarray datasetdataset

Visual estimation of the number of clusters in the data

Page 12: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

http://web.mit.edu/be.400/www/SVD/Singular_Value_Decomposition.htm

1-page tutorial on singular value decomposition (PCA)

Page 13: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Cluster analysis Cluster analysis Function Function Places genes with similar expression patterns in

groups. Sometimes genes of unknown function will be

grouped with genes of known function. The functions that are known allow the investigator

to hypothesize regarding the functions of genes not yet characterized.

Examples: Identify genes important in cell cycle regulation Identify genes that participate in a biosynthetic pathway Identify genes involved in a drug response Identify genes involved in a disease response

Page 14: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Clustering yeast cell cycle dataset VS gene tree ordering

Page 15: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

How to choose the number of clusters How to choose the number of clusters needed to informatively partition the data needed to informatively partition the data

Trial and error: Try clustering with a different number of clusters, and compare your results Criteria for comparison: Homogeneity vs

SeparationUse PCA (Principle Component Analysis) to

visually determine how well the algorithm grouped genes

Calculate the mean distance between all genes within a cluster (it should be small) and compare that to the distance between clusters (which should be large)

Page 16: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Mathematical evaluation of Mathematical evaluation of clustering solutionclustering solution

Merits of a ‘good’ clustering solution: Homogeneity:

Genes inside a cluster are highly similar to each other. Average similarity between a gene and the center

(average profile) of its cluster.

Separation: Genes from different clusters have low similarity to each

other. Weighted average similarity between centers of clusters.

These are conflicting features: increasing the number of clusters tends to improve with-in cluster Homogeneity on the expense of between-cluster Separation

Page 17: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

“True”CAST*

GeneCluster

K-means

CLICK

Homogeneity

Separa

tion

Performance on Yeast Cell Cycle Data

*Ben-Dor, Shamir, Yakhini

1999

698 genes, 72 conditions (Spellman et al. 1998). Each algorithm was run by its authors in a “blind” test.

Page 18: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Clustering AlgorithmsClustering Algorithms

K–meansSOMsHierarchical clustering

Page 19: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

K-MEANSK-MEANS1. The user sets the number of clusters- k2. Initialization: each gene is randomly assigned

to one of the k clusters3. Average expression vector is calculated for

each cluster (cluster’s profile) 4. Iterate over the genes:

• For each gene- compute its similarity to the cluster profiles.

• Move the gene to the cluster it is most similar to.• Recalculated cluster profiles.

5. Score current partition: sum of distances between genes and the profile of the cluster they are assigned to (homogeneity of the solution).

6. Stop criteria: further shuffling of genes results in minor improvement in the clustering score

Page 20: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech
Page 21: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

0hrs 1hr 2hr 3hr 4hr

gene A 0.12 1.68 0.99 1.05 1.44gene B 0.47 1.37 1.06 0.91 1.96gene C 1.97 0.87 1.84 0.30 1.17gene D 1.21 1.22 1.71 1.45 1.68gene E 0.25 0.70 0.66 0.83 1.38gene F 0.81 0.34 1.18 1.85 1.18gene G 1.64 0.08 1.03 0.36 1.64gene H 1.78 1.64 1.71 1.49 0.97gene I 0.14 0.68 0.88 1.54 0.49gene J 1.01 0.84 0.06 1.87 1.11gene K 0.91 1.57 1.49 0.81 1.32gene L 1.71 1.33 0.27 1.59 0.87gene M 1.46 0.12 1.60 0.44 0.73gene N 0.88 1.21 1.44 1.46 1.90.. 1.15 1.30 1.16 1.07 0.23

Experiments

g

en

es

Page 22: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

K-MEANS example: 4 clusters (too many?)K-MEANS example: 4 clusters (too many?)

Mean profile

Standard deviation in each condition

Page 23: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Evaluating KmeansEvaluating Kmeans

Cluster 3

Cluster 1

Cluster 4

Cluster 2

Mis-classified

Page 24: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

K-means example: 3 clusters (looks right)K-means example: 3 clusters (looks right)

Page 25: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Kmeans clustering: K=2 (too few)Kmeans clustering: K=2 (too few)

Page 26: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

SOMs (Self-Organizing Maps)SOMs (Self-Organizing Maps)less clustering and more data less clustering and more data

organizingorganizing User sets the number of clusters in a

form of a rectangular grid (e.g., 3x2) – ‘map nodesmap nodes’

Imagine genes as points in (M-dimensional) space

Initialization: map nodes are randomly placed in the data space

Page 27: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Genes – data points

Clusters – map nodes

Page 28: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

SOM - SchemeSOM - Scheme

• Randomly choose a data point (gene).

• Find its closest map node

• Move this map node towards the data point

• Move the neighbor map nodes towards this point, but to lesser extent (thinner arrows show weaker shift)

• Iterate over data points

Page 29: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

• Each successive gene profile (black dot) has less of an influence on the displacement of the nodes.

• Iterate through all profiles several times (10-100)

• When positions of the cluster nodes have stabilized, assign each gene to its closest map node (cluster)

Page 30: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech
Page 31: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech
Page 32: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Hierarchical ClusteringHierarchical Clustering Goal#1: Organize the genes in a

structure of a hierarchical tree 1) Initial step: each gene is

regarded as a cluster with one item 2) Find the 2 most similar clusters

and merge them into a common node (red dot)

3) Merge successive nodes until all genes are contained in a single cluster

Goal#2: Collapse branches to group genes into distinct clusters g1 g2 g3 g4 g5

{1,2}

{4,5}

{1,2,3}

{1,2,3,4,5}

Page 33: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Which genes to cluster? Which genes to cluster? Apply filtering prior to clustering – focus

the analysis on the ‘responding genes’ The application of controlled statistical tests to

identify ‘responding genes’ usually ends up with too few genes that do not allow for a global characterization of the response.

Variance: filter out genes that do not vary greatly among the conditions of the experiment.

Non-varying genes skew clustering results, especially when using a correlation coefficient

Fold change: choose genes that change by at least M-fold in at least L conditions.

Page 34: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Clustering – ToolsClustering – Tools Cluster (Eisen) – hierarchical clustering

http://rana.lbl.gov/EisenSoftware.htm GeneCluster (Tamayo) – SOM

http://bioinfo.cnio.es/wwwsomtree/ TIGR MeV – K-Means, SOM, hierarchical, QTC,

CAST http://www.tm4.org/mev.html

Expander – CLICK, SOM, K-means, hierarchical http://www.cs.tau.ac.il/~rshamir/expander/

expander.html Many others (e.g. GeneSpring)

http://www.agilent.com/chem/genespring

Page 35: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

(1Transform Dataset Using PCA(2Cluster

•Parameters to test:•Distance Metric

•Number of clusters•Separation &

Homogeneity(3Assign biological meaning to

clusters

Analysis StrategyAnalysis Strategy

Page 36: SocalBSI 2008: Clustering Microarray Datasets Sagar Damle, Ph.D. Candidate, Caltech

Original presentation created by Rani Elkon and posted at:http://www.tau.ac.il/lifesci/bioinfo/teaching/

2002-2003/DNA_microarray_winter_2003.html