More About Clustering

More About Clustering

Naomi Altman

Nov '06

Assessing Clusters

Some things we might like to do:

1. Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles.

2. Decide how many clusters "are" in the data.

3. Display the clusters graphically.

4. Determine cluster stability.

5. Determine stability of genes in a cluster.

Assessing Clusters

1. Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles.

2. Decide how many clusters "are" in the data.

Distance similarity: genes in the same cluster should be closer together than genes in different clusters.

Expression similarity: genes in the same cluster should have expression profiles that are more similar (on features of interest) than genes in different clusters. (And if this is not true, maybe we need a different similarity measure.)

• Display the clusters graphically.• Determine cluster stability.• Determine stability of genes in a cluster.

Distance similarity: Genes in the same cluster should be closer

together than genes in different clusters.

We need to be careful about how we define "closer".

Some criteria

C Cx

cip

i

xxd ),(Minimize the total within cluster distances. (K-means clustering finds a local minimum of this function for p=2. We might also consider p=1. Clearly, the larger K is, the smaller this criterion. We could consider a scree plot.

The silhouette is a measure of how much closer points in the cluster are to one another compared to points outside the cluster.Observations with large and positive sil (~1) are well clustered, those with sil around0 lie between clusters, and those with negative sil are placed in the “wrong” cluster.

#M.yeast is the data with 294 genes and 24 times (arrays)

library(cluster)kmeans6.out=kmeans(M.yeast,6)sil6=silhouette(kmeans.out$cluster,dist(yeast.M[,2:25], method = "euclidean"))plot(sil6)summary(sil6)

Some diagnostics for the Cell Cycle Data

Silhouette of 294 units in 3 Cluster sizes and average silhouette widths: 114 145 35 0.1754309 0.1837610 0.1968444 Individual silhouette widths: Min. 1st Qu. Median Mean 3rd Qu. Max. -0.01413 0.10110 0.17530 0.18210 0.26640 0.38640

Using HeightIn hierarchical clustering, we can use the cluster height

to help determine where to define clusters.

complete.out=hclust(dist(M.yeast),method="complete")

plot(1:293,complete.out$height,ylab="Distance at Join",main="Complete Linkage")

Using Heightin

Hierarchical Clustering

Gaps might indicate heights at which to cut the tree to form clusters

Assessing Clusters

• Understand the within cluster similarity and between cluster differences in terms of distances and expression profiles.

• Decide how many clusters "are" in the data.

Display the clusters graphically.

How do the clusters relate to the treatments?

What are the "typical" expression profiles for genes in the clusters

• Determine cluster stability.

• Determine stability of genes in a cluster.

Plotting Cluster Centers

# All of the clustering routines provide a vector in the same order

# as the genes, with the cluster number.

# For hierarchical methods, you need to cut the tree to form the

# clusters. You can do it by height or the desired number of

# clusters.

clust6=cutree(complete.out,k=6)

par(mfrow=c(2,3))

for (i in 1:6){

M=M.yeast[clust6==i,]

meansi=apply(M,2,mean)

plot(time,meansi,xlab="Time",ylab="Mean Expression",main=paste("Cluster",i),type="l")}

Complete Linkage Single Linkage

cluster 1 2 3 4 5 6 cluster 1 2 3 4 5 6 n 75 17 100 62 37 3 n 287 1 1 3 1 1

Complete Linkage K-means

cluster 1 2 3 4 5 6 cluster 1 2 3 4 5 6 n 75 17 100 62 37 3 n 35 31 61 45 91 31

The genes do not look much like the cluster means - wrong metric!

We are interested in the cyclic pattern - we should use correlation distance.

zscore=function(v) (v-mean(v))/sqrt(var(v))yeast.Z=apply(yeast.M[,2:25],1,zscore)

#apply saves the results in the columns - we want genes in rows

yeast.Z=t(yeast.Z)

Correlation Distance by Zscores

← data

zscore →

Assessing Clusters


• Decide how many clusters "are" in the data.• Display the clusters graphically.

Determine cluster stability.

We already know that different clustering methods and metrics cluster differently.

How stable are the clusters if we:

add noise

leave out a treatment, array, gene.

Removing noise stabilizes real clusters.

• Determine stability of genes in a cluster.

The Effect of Noise on Clustering

Each data value could be anywhere inside the circle. The red gene might be clustered with the blue or green genes.

The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.

Assessing "Noise"When we have replicates for the treatments, we have

several ways to assess noise.

1) Bootstrap arrays - i.e. resample with replacement within a treatment.

2) Subsample arrays - i.e. take subsets of arrays within a treatment

3) Take residuals from a fitted model

a) Bootstrap the residuals

b) Use the residuals to estimate an error distribution and resample from the distribution

Resampling from ArraysWe have treatments 1 ... T and for each treatment there

are nt arrays.

Bootstrap: sample nt arrays with replacement

Subsample: sample kt arrays without replacement

for each treatment

Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)

Using Residualse.g. We have treatments 1 ... T and for each treatment there are nt arrays.

For each gene i and treatment t there is an expression mean mit.

For a gene i, treatment t, array a, there is an expression value, yita.

The estimated noise is rita= yita - mit.

Obtain bootstrap samples by (for each gene, but ignoring t and a), sampling with replacement from rita giving residuals r*ita

The bootstrap samples are y*ita = mit +r*ita

(If you are worried about correlation among the genes on the same array, you can sample the entire vector of errors.)

Alternatively, use the histogram of rita to estimate a distribution for each gene (e.g. N(0, i

and sample from the distribution (but it is harder to deal with correlation among genes)

Recluster. Look at changes in clusters or cluster summaries for each reclustering. (Note that cluster labels will change.)

Removing genes breaks up clusters

Removing arrays changes similarity

Other Perturbations

Recluster: Random subsamples of genes

Random subsamples of treatments

Using the Simulated SamplesIf the clusters are well-separated, we expect the

changes due to the perturbations to be small.

If the clusters are just "zip-coding" i.e. dividing a big cluster into nearby but not separated regions, we expect the changes due to the perturbations to be large.

If the clusters are due to noise, we expect changes due to the perturbations to be large.

The original data and a slightly perturbed version lead to perturbed dendrograms and cluster centers

Reducing Noise to Find Structure

As for differential expression analysis, we hope to find replicable structure by clustering.

Reducing noise improves our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation.

1: Replicate, replicate, replicate

We can cluster arrays just like we cluster genes. If an array does not cluster with its biological replicates then either there is no differential expression or it is a bad array.

For each gene, the mean over the replicates is less noisy by √#reps - 2 reps give 30% less noise, 4 reps give 50% less.

Instead of clustering the profile over arrays, cluster over treatment means.

Reducing Noise to Find StructureAs for differential expression analysis, we hope to find replicable structure by clustering.

Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation.

2. Filter genes

Remove non-differentially expressing genes from the data matrix before clustering - especially when using scale-free distances.

Best: If possible, do a differential expression analysis and use only the significant genes.

The correlation between the black and red genes is 1.0. Correlation distance will blow up patterns that are just due to noise.

Reducing Noise to Find StructureAs for differential expression analysis, we hope to find replicable structure by clustering.

Reducing noise improve our ability to find biologically meaningful structure - particularly when we are using scale invariant similarity measures like correlation.

3. Reduce dimension

The initial dimension for clustering genes is the number of arrays.

Using treatment means reduces this to the number of treatments.

If there are no replicates, dimension reduction can be used such as regressing on the most significant eigengenes, or a pattern of interest such as sine and cosine waves. The regression coefficients are then used as the input to the clustering routine.

svd.z=svd(yeast.Z)design.reg=model.matrix(~svd.z$v[,1:4])fit.reg=lmFit(yeast.Z,design.reg)pc.out=hclust(dist(fit.reg$coef))

Assessing Clusters


• Decide how many clusters "are" in the data.• Display the clusters graphically.• Determine cluster stability.

Determine stability of genes in a cluster.

If the clusters stay distinguishable during the simulation, count the number of times each gene is in each cluster.

If the clusters are less stable, count the number of times the pair gene i and gene j are in the same cluster.

Summarizing the Perturbed Data

Dendrograms: Compute the consensus tree

There are various methods based on the number of times each branch of the dendrogram appears in the perturbed data. (I use phylogenetics software for this.)

Clusters: If clusters are stable, put each gene into the cluster it joins the most often

If clusters are less stable, use the pairwise counts as a new distance measure and recluster based on this measure.

Other partitioning Methods1. Partitioning around medioids (PAM): instead of averages, use

multidim medians as centroids (cluster “prototypes”). Dudoit and Freedland (2002).

2. Self-organizing maps (SOM): add an underlying “topology” (neighboring structure on a lattice) that relates cluster centroids to one another. Kohonen (1997), Tamayo et al. (1999).

3. Fuzzy k-means: allow for a “gradation” of points between clusters; soft partitions. Gash and Eisen (2002).

4. Mixture-based clustering: implemented through an EM (Expectation-Maximization)algorithm. This provides soft partitioning, and allows for modeling of cluster centroids and shapes. Yeung et al. (2001), McLachlan et al. (2002)

F. Chiaromonte Sp 06 5

Assessing the ClustersComputationally

The bottom line is that the clustering is "good" if it is biologically meaningful (but this is hard to assess).

Computationally we can:

1) Use a goodness of cluster measure, such as the within cluster

distances compared to the between cluster distances.

2) Perturb the data and assess cluster changes:

a) add noise (maybe residuals after ANOVA)

b) resample (genes, arrays)

Documents

More About Clustering