Iclust: information based clustering

1

Iclust: information based clustering

Noam SlonimThe Lewis-Sigler Institute for Integrative GenomicsPrinceton University

Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek

2

2 12 -1 -1 6 -3 8 ?? 7 -5 3 -4

12 ?? -5 11 -2 6 11 11 -8 12 ?? -2

?? 12 5 12 4 -1 8 -2 ?? 5 14 ??

8 1 12 1 14 -8 ?? -2 5 14 -8 -7

5 -5 11 17 -2 15 5 14 -8 5 16 2

1 11 -8 0 5 -5 5 14 18 ?? 2 1

-6 12 4 12 4 7 -1 3 -7 3 7 -5

21 ?? ?? 3 2 4 -11 -3 3 -3 ?? 9

K genes

N conditions

Running example

Gene expression data

Relations between genes?Relations between experimental conditions?

(log) ratio of the mRNAexpression level of a genein a specific condition

3

Some nice features of the information measure:

Model independent Responsive to any type of dependencyCaptures more than just pairwise relationsSuitable for both continuous and discrete dataIndependent of the measurement scaleAxiomatic

Information as a correlation/similarity measure

4

Mutual information - definition

We have some “uncertainty” about the state of gene-A;but now someone told us the state of gene-B…

How much can we learn from the state of gene-B about the state of gene-A (and vice versa).

-The resulting reduction in the uncertainty about gene-A stateis called the mutual information between these two variables :

BbAa bpap

bapbapbapHapHbapI, )()(

),(log),()]|([)]([)],([

5

Model independence & responsiveness to “complicated” relations

MI~1 bit; Corr.~0.9

gene-A expression level

gene

-B e

xpre

ssio

n le

vel

MI~2 bits; Corr.~0.6


gene

-B e

xpre

ssio

n le

vel

MI~0 bits; Corr.~0

gene

-B e

xpre

ssio

n le

vel


MI~1.3 bits; Corr.~0

gene

-B e

xpre

ssio

n le

vel


6

MI~0 bits; Corr.~0

Experiment index

gene

-A/g

ene-

B ex

pres

sion

Experiment index

gene

-A/g

ene-

B/ge

ne-C

exp

ress

ion

Triplet-information ~ 1.0 bits

Capturing more than just pairwise relations

Using a model-dependent correlation measure might result in missing significant dependencies in our data.

7

Mycobacterium tuberculosis81 experiments

Pearson Correlation

Mut

ual i

nfor

mat

ion

Mutual-information vs. Pearson-Correlation results in bacteria gene-expression data

8

Information relations between gene expression profiles

Given the expression of gene-A, how much information do we have about the expression of gene-B ? (when averaging over all conditions)

( sample size: number of conditions - 173 in Gasch data )

Once we find these information relations, we often want to apply cluster analysis.

Numerous clustering methods are available – but typically they assume a particular model.

For example, K-means corresponds to the modeling assumption that each cluster can be described by a spherical Gaussian.

Back in square one …?

9

Or … c ii

rr

r

iiisciqciqciqcqcS,...,

2121

1

),...,,()|()...|()|()( )(

Iclust – information based clustering

What is a “good” cluster?

A simple proposal – given a cluster, we pick two items at random, and we want them to be as “similar” to each other as possible.

c ii

iisciqciqcqcS21,

2121 ),()|()|()( )(Formally, we wish to maximize

Or … c ii

rr

r

iiiIciqciqciqcqcIcS,...,

2121

1

),...,,()|()...|()|()()( )(

Namely, we wish to maximize the average information relations in our clusters, or to find clusters s.t. in each cluster all items are highly informative about each other.

10

Iclust – information based clustering (cont.)A penalty term that we wish to minimize, as in rate-distortion theory :

ic cq

icqicqipiCI, )(

)|(log)|()();(

S(c) is maximized, but the penalty term is maximized as well (no compression)Penalty term is minimized (maximal compression), but S(c) is minimized as well.Intermediate interesting cases – small penalty with high S(c)

11

Iclust – information based clustering (cont.)

The intuitive clustering problem can be turned into a General mathematical optimization problem:

);( - )( ] )|( [ iCITcIicqF

Clustering parameters Expected information relationsamong data items

Information between dataitems and clusters

Tradeoff parameter

Clustering is formulated as trading bits of similarity againstbits of descriptive power, without any further assumptions.

12

Relations with other classical rate distortion

Iclust

)(1);(] )|( [ cDiCIicqF

Classical rate distortion

)(~1);(] )|( [ cDiCIicqF

The difference is whether the sum over i2 is before/after d is computed

If the distortion/similarity matrix is a kernel matrix the formulations are equivalent

c ii

iir

r

rdciqciqcqcD,...,

)()(1

1

1 ),...,()|()...|()( )( c i

cidciqcqcD1

1 ),()|()( )(~ )()(1

c i i

ii ciqdciqcqcD1 2

21 ))|(,()|()( )(~ )(2

)(1

c i i

iidciqciqcqcD1 2

21 ),()|()|()( )( )()(21

For the special case of pairwise relations

13

And yet – some important differences

Iclust is applicable when the raw data is given directly as pairwise relations

Iclust do not require a definition of a “prototype” (or “centroid”)

Both formulations induce different decoding schemes

A sender observes a pattern Φi, but is allowed to send only the cluster index, c

In classical rate distortion the receiver is assumed to decode by

)()()( )|(~ i

i

ci ciq Deterministic decoding with vocabulary size Nc

In Iclust he receiver is assumed to decode by )|(~~ )()( ciqii Stochastic decoding

with vocabulary size N

Iclust can handle more than just pairwise correlations

14

Original figure: 220 gray levels

Iclust vs. classical rate-distortion decoding

Iclust (stochastic) decoding

2 clusters

RD (deterministic) decoding

2 clusters

15

Iclust algorithm - freely available Web implementation

Responsive to any type of dependency among the data

Invariant to changes in the data representation

Allows to cluster based on more than pairwise relations

For more details :Slonim, Atwal, Tkacik, and Bialek (2005) Information based clustering, PNAS, in press.

See www.princeton.edu/~nslonim

)}()1( );(1exp{)( )|( cSricSrT

cqicq

Average “similarity” among c members

Average “similarity” of i to c members

RPS10ARPS10BRPS11ARPS11BRPS12…

FRS1KRS1SES1TYS1VAS1…

PGM2UGP1TSL1TPS1TPS2…

C18 C15 C4

Clusters of genes

Proteins of the small ribosomal subunit

Enzymes that attach amino acids to tRNA

Enzymes involved in the trehaloseanabolism pathway

Iclust – clusters examples

Wal-MartTargetHome DepotBest BuyStaples…

MicrosoftApple Comp.DellHPMotorola…

NY TimesTribune Co.Meredith Corp.Dow Jones & Co.Knight-Ridder Inc.…

C17 C12 C2

Clusters of stocks

Data: Dynamics of stock prices

Given the price of stock-A, how much information do we have about the price of stock-B ? (when averaging over many days)

Snow WhiteCinderellaDumboPinocchioAladdin…

PsychoApocalypse NowThe GodfatherTaxi DriverPulp Fiction…

Star WarsReturn of the JediThe TerminatorAlienApollo 13…

C12 C1 C7

Clusters of movies

Data: Rating by viewers

Given the rating of movie-A, how much information do we have about the rating of movie-B ? (when averaging over many viewers)

0

25

50

75

100

Coherence

K-means

K-medians

Hierarchical

K-means

K-medians

Hierarchical

K-means

K-medians

Hierarchical

Coherence results – comparison to alternative algorithms

ESR S&P 500 EachMovie

Quick Summary

Information as the core measure of data analysis with many appealing features

Iclust - a novel information-theoretic formulation of clustering, with some intriguing relations with classical rate distortion clustering.

… and finding coherent stocks clusters, coherent movies clusters …

Validations: finding coherent gene clusters based on information relations in gene-expression data

… and genotype-phenotype association in bacteria, based on phylogenetic data -Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press.

… and more?

Documents

Iclust: information based clustering