50
Karsten Borgwardt: Data Mining in Bioinformatics, Page 1 Data Mining in Bioinformatics Day 8: Clustering in Bioinformatics Clustering Gene Expression Data Chloé-Agathe Azencott & Karsten Borgwardt February 10 to February 21, 2014 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen

Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Karsten Borgwardt: Data Mining in Bioinformatics, Page 1

Data Mining in BioinformaticsDay 8: Clustering in BioinformaticsClustering Gene Expression Data

Chloé-Agathe Azencott & Karsten Borgwardt

February 10 to February 21, 2014

Machine Learning & Computational Biology Research GroupMax Planck Institutes Tübingen andEberhard Karls Universität Tübingen

Page 2: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Gene expression data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 2

Microarray technology

High density arrays

Probes (or “reporters”,“oligos”)

Detect probe-target hybridization

Fluorescence, chemiluminescenceE.g. Cyanine dyes: Cy3 (green) / Cy5 (red)

Page 3: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Gene expression data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 3

Data

X : n×m matrix

n genes

m experiments:

conditionstime pointstissuespatientscell lines

Page 4: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Clustering gene expression data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 4

Group samples

Group together tissues that are similarly affected by adiseaseGroup together patients that are similarly affected by adisease

Group genes

Group together functionally related genesGroup together genes that are similarly affected by adiseaseGroup together genes that respond similarly to an ex-perimental condition

Page 5: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Clustering gene expression data

Karsten Borgwardt: Data Mining in Bioinformatics, Page 5

Applications

Build regulatory networksDiscover subtypes of a diseaseInfer unknown gene functionReduce dimensionality

Popularity

Pubmed hits: 33 548 for “microarray AND clustering”,79 201 for “"gene expression" AND clustering”Toolboxes: MatArray, Cluster3, GeneCluster, Bioconductor,GEO tools, . . .

Page 6: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Pre-processing

Karsten Borgwardt: Data Mining in Bioinformatics, Page 6

Pre-filteringEliminate poorly expressed genesEliminate genes whose expression remains constant

Missing valuesIgnoreReplace with random numbersImpute

Continuity of time seriesValues for similar genes

Page 7: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Pre-processing

Karsten Borgwardt: Data Mining in Bioinformatics, Page 7

Normalizationlog2(ratio)particularly for time serieslog2(Cy5/Cy3)→ induction and repression haveopposite signsvariance normalizationdifferential expression

Page 8: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Distances

Karsten Borgwardt: Data Mining in Bioinformatics, Page 8

Euclidean distanceDistance between gene x and y, given n samples(or distance between samples x and y, given n genes)

d(x, y) =

n∑i=1

√(xi − yi)2

Emphasis: shape

Pearson’s correlation

Correlation between gene x and y, given n samples(or correlation between samples x and y, given n genes)

ρ(x, y) =

∑ni=1(xi − x)(yi − y)√∑n

i=1(xi − x)2∑n

i=1(yi − y)2

Emphasis: magnitude

Page 9: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Distances

Karsten Borgwardt: Data Mining in Bioinformatics, Page 9

d = 8.25

ρ = 0.33

d = 13.27

ρ = 0.79

Page 10: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Clustering evaluation

Karsten Borgwardt: Data Mining in Bioinformatics, Page 10

Clusters shapeCluster tightness (homogeneity)

k∑i=1

1

|Ci|∑x∈Ci

d(x, µi)︸ ︷︷ ︸Ti

Cluster separationk∑i=1

k∑j=i+1

d(µi, µj)︸ ︷︷ ︸Si,j

Davies-Bouldin index

Di := maxj:j 6=i

Ti + TjSi,j

DB :=1

k

k∑i=1

Di

Page 11: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Clustering evaluation

Karsten Borgwardt: Data Mining in Bioinformatics, Page 11

Clusters stability

image from [von Luxburg, 2009]

Does the solution change if we perturb the data?

BootstrapAdd noise

Page 12: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Quality of clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 12

The Gene Ontology“The GO project has developed three structured controlled vocabularies (on-

tologies) that describe gene products in terms of their associated biological pro-

cesses, cellular components and molecular functions in a species-independent

manner”

Cellular Component: where in the cell a gene actsMolecular Function: function(s) carried out by a geneproductBiological Process: biological phenomena the gene isinvolved in (e.g. cell cycle, DNA replication, limb forma-tion)Hierarchical organization (“is a”, “is part of”)

Page 13: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Quality of clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 13

GO enrichment analysis: TANGO[Tanay, 2003]

Are there more genes from a given GO class in a givencluster than expected by chance?Assume genes sampled from the hypergeometric dis-tribution

Pr(|C ∩G| ≥ t) = 1−t∑i=1

(|G|i

)(n−|G||C|−i

)( n|C|)

Correct for multiple hypothesis testingBonferroni too conservative (dependencies betweenGO groups)Empirical computation of the null distribution

Page 14: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Quality of clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 14

Gene Set enrichment analysis (GSEA)[Subramanian et al., 2005]

Use correlation to a phenotype yRank genes according to the correlation ρi of their ex-pression to y → L = {g1, g2, . . . , gn}Phit(C, i) =

∑j:j≤i,gj∈C

|ρj |∑gj∈C |ρj |

Pmiss(C, i) =∑

j:j≤i,gj /∈C1

n−|C|

Enrichment score: ES(C) = maxi |Phit(C, i)− Pmiss(C, i)|

Page 15: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 15

Linkagesingle linkage: d(A,B) = minx∈A,y∈B d(x, y)

complete linkage: d(A,B) = maxx∈A,y∈B d(x, y)

average (arithmetic) linkage:d(A,B) =

∑x∈A,y∈B d(x, y)/|A||B|

also called UPGMA(Unweighted Pair Group Method with Arithmetic Mean)

average (centroid) linkage:d(A,B) = d(

∑x∈A x/|A|,

∑y∈B y/|B|)

also called UPGMC(Unweighted Pair-Group Method using Centroids)

Page 16: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 16

ConstructionAgglomerative approach (bottom-up)Start with every element in its own cluster, then iteratively joinnearby clusters

Divisive approach (top-down)Start with a single cluster containing all elements, then recur-sively divide it into smaller clusters

Page 17: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 17

AdvantagesDoes not require to set the number of clustersGood interpretability

DrawbacksComputationally intensive O(n2log n2)

Hard to decide at which level of the hierarchy to stopLack of robustnessRisk of locking accidental features (local decisions)

Page 18: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 18

Dendrograms

abcdef

bcdef

def

de

d e f

bc

b ca

In biologyPhylogenetic treesSequences analysisinfer the evolutionary historyof sequences being com-pared

Page 19: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 19

[Eisen et al., 1998]

MotivationArrange genes according to similarity in pattern of geneexpressionGraphical display of outputEfficient grouping of genes of similar functions

Page 20: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 20

[Eisen et al., 1998]

DataSaccharomyces cerevisiae:

DNA microarrays containing all ORFsDiauxic shift; mitotic cell division cycle; sporulation;temperature and reducing shocks

Human9 800 cDNAs representing ∼ 8 600 transcriptsfibroblasts stimulated with serum following serum star-vation

Data pre-processingCy5 (red) and Cy3 (green) fluorescences→ log2(Cy5/Cy3)

Page 21: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 21

[Eisen et al., 1998]

MethodsDistance: Pearson’s correlationPairwise average-linkage cluster analysisOrdering of elements:

Ideally: such that adjacent elements have maximalsimilarity (impractical)In practice: rank genes by average gene expression,chromosomal position

Page 22: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 22

[Bar-Joseph et al., 2001]

Fast optimal leaf ordering for hierarchical clustering

n leaves→ 2n − 1 possible ordering

Goal: maximize the sum of similarities of ad-jacent leaves in the ordering

Recursively find, for a node v, the costC(v, ul, ur) of the optimal ordering rooted atv with left-most leaf ul and right-most leaf ur

Work bottom up:C(v, u, w) = C(vl, u,m)+C(vr, k, w)+σ(m, k),where σ(m, k) is the similarity between m

and k

O(n4) time, O(n2) space

Early termination→ O(n3)

Page 23: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 23

[Eisen et al., 1998]

Genes “represent” more than a mere clustertogether

Genes of similar function cluster together

cluster A: cholesterol biosyntehsis

cluster B: cell cycle

cluster C: immediate-early response

cluster D: signaling and angiogenesis

cluster E: tissue remodeling and woundhealing

Page 24: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 24

[Eisen et al., 1998]

cluster E: genes encoding glycolytic enzymesshare a function but are not members of large pro-tein complexes

cluster J: mini-chromosomoe maintenance DNAreplication complex

cluster I: 126 genes strongly down-regulated in response to stress112 of those encode ribosomal proteinsYeast responds to favorable growth conditions by increasing the pro-duction of ribosome, through transcriptional regulation of genes en-coding ribosomal proteins

Page 25: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 25

[Eisen et al., 1998]

ValidationRandomized data does not cluster

Page 26: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 26

[Eisen et al., 1998]

ConclusionsHierarchical clustering of gene expression data groupstogether genes that are known to have similar functionsGene expression clusters reflect biological processesCoexpression data can be used to infer the function ofnew / poorly characterized genes

Page 27: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Hierarchical clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 27

[Bar-Joseph et al., 2001]

Page 28: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 28

source: scikit-learn.org

Page 29: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 29

AdvantagesRelatively efficient O(ntk)n objects, k clusters, t iterationsEasily implementable

DrawbacksNeed to specify k ahead of timeSensitive to noise and outliersClusters are forced to have convex shapes(kernel k-means can be a solution)Results depend on the initial, random partition (k-means++ can be a solution)

Page 30: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 30

[Tavazoie et al., 1999]

MotivationUse whole-genome mRNA data to identify transcrip-tional regulatory sub-networks in yeastSystematic approach, minimally biased to previousknowledgeAn upstream DNA sequence pattern common to allmRNAs in a cluster is a candidate cis-regulatory ele-ment

Page 31: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 31

[Tavazoie et al., 1999]

DataOligonucleotide microarrays, 6 220 mRNA species15 time points across two cell cycles

Data pre-processingvariance-normalizationkeep the most variable 3 000 ORFs

Page 32: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 32

[Tavazoie et al., 1999]

Methodsk-means, k = 30→ 49–186 ORFs per clustercluster labeling:

map the genes to 199 functional categories (MIPSa

database)compute p-values of observing frequencies of genesin particular functional classescumulative hypergeometric probability distribution for finding at least kORFs (g total) from a single functional category (size f ) in a cluster ofsize n

P = 1−k∑

i=1

(fi

)(g−fn−i)(

gn

)correct for 199 tests

aMartinsried Institute of Protein Science

Page 33: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 33

[Tavazoie et al., 1999]

Page 34: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 34

[Tavazoie et al., 1999]

Periodic cluster

Aperiodic cluster

Page 35: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

K-means clustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 35

[Tavazoie et al., 1999]

ConclusionsClusters with significant functional enrichment tend to betighter (mean Euclidean distance)Tighter clusters tend to have significant upstream motifsDiscovered new regulons

Page 36: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 36

a.k.a. Kohonen networksImpose partial structure on the clusters

Start from a geometry of nodes {N1, N2, . . . , Nk}E.g. grids, rings, lines

At each iteration, randomly select a data point P , and movethe nodes towards P .

The nodes closest to P move the most, and the nodesfurthest from P move the least.

f (t+1)(N) = f (t)(N)+τ (t, d(N,NP ))(P−f (t)(N)) NP : node closest to P

The learning rate τ decreases with t and the distancefrom NP to N

Page 37: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 37

Source: Wikimedia Commons – Mcld

Page 38: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 38

AdvantagesCan impose partial structureVisualization

DrawbacksMultiple parameters to setNeed to set an initial geometry

Page 39: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 39

[Tamayo et al., 1999]

MotivationExtract fundamental patterns of gene expressionOrganize the genes into biologically relevant clustersSuggest novel hypotheses

Page 40: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 40

[Tamayo et al., 1999]

DataYeast

6 218 ORFs

2 cell cycles, every 10 minutes

SOM: 6× 5 grid

Human

Macrophage differentiation in HL-60 cells (myeloid leukemia cell line)

5 223 genes

cells harvested at 0, 0.5, 4 and 24 hours after PMA stimulation

SOM: 4× 3 grid

Page 41: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 41

[Tamayo et al., 1999]

Results: Yeast

Periodic behavior

Adjacent clusters have similarbehavior

Page 42: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 42

[Tamayo et al., 1999]

Results: HL-60Cluster 11:

gradual induction as cells loseproliferative capacity and acquirehallmarks of the macrophage lin-eage

8/32 genes not expected givencurrent knowledge of hematopoi-etic differentiation

4 of those suggest role ofimmunophilin-mediated pathwayin macrophage differentiation

Page 43: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Self-organizing maps

Karsten Borgwardt: Data Mining in Bioinformatics, Page 43

[Tamayo et al., 1999]

ConclusionsExtracted the k most prominent patterns to provide an“executive summary”Small data, but illustrative:

Cell cycle periodicity recoveredGenes known to be involved in hematopoietic differ-entiation recoveredNew hypotheses generated

SOMs scale well to larger datasets

Page 44: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 44

Biclustering, co-clustering, two-ways clusteringFind subsets of rows that exhibit similar behaviorsacross subsets of columnsBicluster: subset of genes that show similar expressionpatterns across a subset of conditions/tissues/samples

source: [Yang and Oja, 2012]

Page 45: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 45

[Cheng and Church, 2000]

MotivationSimultaneous clustering of genes and conditionsOverlapped groupingMore appropriate for genes with multiple functions or regulated by multiple

factors

Page 46: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 46

[Cheng and Church, 2000]

AlgorithmGoal: minimize intra-cluster variance

Mean Squared Residue:

MSR(I, J) =1

|I||J |∑

i∈I,j∈J

(xij − xiJ − xIj + xIJ)2

xiJ , xIj, xIJ : mean expression values in row i, column j, and over the wholecluster

δ: maximum acceptable MSR

Single Node Deletion: remove rows/columns of X with largest variance(1|J |∑

j∈J(xij − xiJ − xIj + xIJ)2)

until MSR < δ

Node Addition: some rows/columns may be added back without increasingMSR

Masking Discovered Biclusters: replace the corresponding entries by ran-dom numbers

Page 47: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 47

[Cheng and Church, 2000]

Results: Yeast

Biclusters 17, 67, 71, 80, 90contain genes in clusters 4, 8,12 of [Tavazoie et al., 1999]

Biclusters 57, 63, 77, 84,94 represent cluster 7of [Tavazoie et al., 1999]

Page 48: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 48

[Cheng and Church, 2000]

Results: Human B-cellsData: 4 026 genes, 96 samples of normal and malignantlymphocytes

Cluster 12: 4 genes, 96 condi-tions

19: 103, 25 22: 10, 5739: 9, 51 44:10, 2945: 127, 13 49: 2, 9652: 3, 96 53: 11, 2554: 13, 21 75: 25, 1283: 2, 96

Page 49: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

Biclustering

Karsten Borgwardt: Data Mining in Bioinformatics, Page 49

[Cheng and Church, 2000]

ConclusionBiclustering algorithm that does not require computingpairwise similarities between all entries of the expres-sion matrixGlobal fittingAutomatically drops noisy genes/conditionsRows and columns can be included in multiple biclusters

Page 50: Data Mining in Bioinformatics Day 8: Clustering in ......Hierarchical clustering Karsten Borgwardt: Data Mining in Bioinformatics, Page 24 [Eisen et al., 1998] cluster E: genes encoding

References and further reading

Karsten Borgwardt: Data Mining in Bioinformatics, Page 50

[Bar-Joseph et al., 2001] Bar-Joseph, Z., Gifford, D. K. and Jaakkola, T. S. (2001). Fast optimal leaf ordering for hierarchical clustering.Bioinformatics 17, S22–S29. 22, 27

[Cheng and Church, 2000] Cheng, Y. and Church, G. M. (2000). Biclustering of expression data. In Proceedings of the eighth interna-

tional conference on intelligent systems for molecular biology vol. 8, pp. 93–103,. 45, 46, 47, 48, 49

[Eisen et al., 1998] Eisen, M. B., Spellman, P. T., Brown, P. O. and Botstein, D. (1998). Cluster analysis and display of genome-wideexpression patterns. Proceedings of the National Academy of Sciences 95, 14863–14868. 19, 20, 21, 23, 24, 25, 26

[Eren et al., 2012] Eren, K., Deveci, M., Küçüktunç, O. and Çatalyürek, U. V. (2012). A comparative analysis of biclustering algorithmsfor gene expression data. Briefings in Bioinformatics .

[Subramanian et al., 2005] Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., Paulovich, A.,Pomeroy, S. L., Golub, T. R., Lander, E. S. and Mesirov, J. P. (2005). Gene set enrichment analysis: A knowledge-based approachfor interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America

102, 15545–15550. 14

[Tamayo et al., 1999] Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E. S. and Golub, T. R.(1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation.Proceedings of the National Academy of Sciences 96, 2907–2912. 39, 40, 41, 42, 43

[Tanay, 2003] Tanay, A. (2003). The TANGO program technical note. http://acgt.cs.tau.ac.il/papers/TANGO_manual.txt. 13

[Tavazoie et al., 1999] Tavazoie, S., Hughes, J. D., Campbell, M. J., Cho, R. J., Church, G. M. et al. (1999). Systematic determinationof genetic network architecture. Nature genetics 22, 281–285. 30, 31, 32, 33, 34, 35, 47

[von Luxburg, 2009] von Luxburg, U. (2009). Clustering stability: an overview. Foundations and Trends in Machine Learning 2,235–274. 11

[Yang and Oja, 2012] Yang, Z. and Oja, E. (2012). Quadratic nonnegative matrix factorization. Pattern Recognition 45, 1500–1510.44