18
1 Iclust: information based clustering Noam Slonim The Lewis-Sigler Institute for Integrative Genomics Princeton University Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek

Iclust: information based clustering

  • Upload
    mare

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

Iclust: information based clustering. Noam Slonim The Lewis-Sigler Institute for Integrative Genomics Princeton University. Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek. Running example. Gene expression data. N conditions. 2. 12. -1. -1. 6. -3. 8. ??. - PowerPoint PPT Presentation

Citation preview

Page 1: Iclust: information based clustering

1

Iclust: information based clustering

Noam SlonimThe Lewis-Sigler Institute for Integrative GenomicsPrinceton University

Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek

Page 2: Iclust: information based clustering

2

2 12 -1 -1 6 -3 8 ?? 7 -5 3 -4

12 ?? -5 11 -2 6 11 11 -8 12 ?? -2

?? 12 5 12 4 -1 8 -2 ?? 5 14 ??

8 1 12 1 14 -8 ?? -2 5 14 -8 -7

5 -5 11 17 -2 15 5 14 -8 5 16 2

1 11 -8 0 5 -5 5 14 18 ?? 2 1

-6 12 4 12 4 7 -1 3 -7 3 7 -5

21 ?? ?? 3 2 4 -11 -3 3 -3 ?? 9

K genes

N conditions

Running example

Gene expression data

Relations between genes?Relations between experimental conditions?

(log) ratio of the mRNAexpression level of a genein a specific condition

Page 3: Iclust: information based clustering

3

Some nice features of the information measure:

Model independent Responsive to any type of dependencyCaptures more than just pairwise relationsSuitable for both continuous and discrete dataIndependent of the measurement scaleAxiomatic

Information as a correlation/similarity measure

Page 4: Iclust: information based clustering

4

Mutual information - definition

We have some “uncertainty” about the state of gene-A;but now someone told us the state of gene-B…

How much can we learn from the state of gene-B about the state of gene-A (and vice versa).

-The resulting reduction in the uncertainty about gene-A stateis called the mutual information between these two variables :

BbAa bpap

bapbapbapHapHbapI, )()(

),(log),()]|([)]([)],([

Page 5: Iclust: information based clustering

5

Model independence & responsiveness to “complicated” relations

MI~1 bit; Corr.~0.9

gene-A expression level

gene

-B e

xpre

ssio

n le

vel

MI~2 bits; Corr.~0.6

gene-A expression level

gene

-B e

xpre

ssio

n le

vel

MI~0 bits; Corr.~0

gene

-B e

xpre

ssio

n le

vel

gene-A expression level

MI~1.3 bits; Corr.~0

gene

-B e

xpre

ssio

n le

vel

gene-A expression level

Page 6: Iclust: information based clustering

6

MI~0 bits; Corr.~0

Experiment index

gene

-A/g

ene-

B ex

pres

sion

Experiment index

gene

-A/g

ene-

B/ge

ne-C

exp

ress

ion

Triplet-information ~ 1.0 bits

Capturing more than just pairwise relations

Using a model-dependent correlation measure might result in missing significant dependencies in our data.

Page 7: Iclust: information based clustering

7

Mycobacterium tuberculosis81 experiments

Pearson Correlation

Mut

ual i

nfor

mat

ion

Mutual-information vs. Pearson-Correlation results in bacteria gene-expression data

Page 8: Iclust: information based clustering

8

Information relations between gene expression profiles

Given the expression of gene-A, how much information do we have about the expression of gene-B ? (when averaging over all conditions)

( sample size: number of conditions - 173 in Gasch data )

Once we find these information relations, we often want to apply cluster analysis.

Numerous clustering methods are available – but typically they assume a particular model.

For example, K-means corresponds to the modeling assumption that each cluster can be described by a spherical Gaussian.

Back in square one …?

Page 9: Iclust: information based clustering

9

Or … c ii

rr

r

iiisciqciqciqcqcS,...,

2121

1

),...,,()|()...|()|()( )(

Iclust – information based clustering

What is a “good” cluster?

A simple proposal – given a cluster, we pick two items at random, and we want them to be as “similar” to each other as possible.

c ii

iisciqciqcqcS21,

2121 ),()|()|()( )(Formally, we wish to maximize

Or … c ii

rr

r

iiiIciqciqciqcqcIcS,...,

2121

1

),...,,()|()...|()|()()( )(

Namely, we wish to maximize the average information relations in our clusters, or to find clusters s.t. in each cluster all items are highly informative about each other.

Page 10: Iclust: information based clustering

10

Iclust – information based clustering (cont.)A penalty term that we wish to minimize, as in rate-distortion theory :

ic cq

icqicqipiCI, )(

)|(log)|()();(

S(c) is maximized, but the penalty term is maximized as well (no compression)Penalty term is minimized (maximal compression), but S(c) is minimized as well.Intermediate interesting cases – small penalty with high S(c)

Page 11: Iclust: information based clustering

11

Iclust – information based clustering (cont.)

The intuitive clustering problem can be turned into a General mathematical optimization problem:

);( - )( ] )|( [ iCITcIicqF

Clustering parameters Expected information relationsamong data items

Information between dataitems and clusters

Tradeoff parameter

Clustering is formulated as trading bits of similarity againstbits of descriptive power, without any further assumptions.

Page 12: Iclust: information based clustering

12

Relations with other classical rate distortion

Iclust

)(1);(] )|( [ cDiCIicqF

Classical rate distortion

)(~1);(] )|( [ cDiCIicqF

The difference is whether the sum over i2 is before/after d is computed

If the distortion/similarity matrix is a kernel matrix the formulations are equivalent

c ii

iir

r

rdciqciqcqcD,...,

)()(1

1

1 ),...,()|()...|()( )( c i

cidciqcqcD1

1 ),()|()( )(~ )()(1

c i i

ii ciqdciqcqcD1 2

21 ))|(,()|()( )(~ )(2

)(1

c i i

iidciqciqcqcD1 2

21 ),()|()|()( )( )()(21

For the special case of pairwise relations

Page 13: Iclust: information based clustering

13

And yet – some important differences

Iclust is applicable when the raw data is given directly as pairwise relations

Iclust do not require a definition of a “prototype” (or “centroid”)

Both formulations induce different decoding schemes

A sender observes a pattern Φi, but is allowed to send only the cluster index, c

In classical rate distortion the receiver is assumed to decode by

)()()( )|(~ i

i

ci ciq Deterministic decoding with vocabulary size Nc

In Iclust he receiver is assumed to decode by )|(~~ )()( ciqii Stochastic decoding

with vocabulary size N

Iclust can handle more than just pairwise correlations

Page 14: Iclust: information based clustering

14

Original figure: 220 gray levels

Iclust vs. classical rate-distortion decoding

Iclust (stochastic) decoding

2 clusters

RD (deterministic) decoding

2 clusters

Page 15: Iclust: information based clustering

15

Iclust algorithm - freely available Web implementation

Responsive to any type of dependency among the data

Invariant to changes in the data representation

Allows to cluster based on more than pairwise relations

For more details :Slonim, Atwal, Tkacik, and Bialek (2005) Information based clustering, PNAS, in press.

See www.princeton.edu/~nslonim

)}()1( );(1exp{)( )|( cSricSrT

cqicq

Average “similarity” among c members

Average “similarity” of i to c members

Page 16: Iclust: information based clustering

RPS10ARPS10BRPS11ARPS11BRPS12…

FRS1KRS1SES1TYS1VAS1…

PGM2UGP1TSL1TPS1TPS2…

C18 C15 C4

Clusters of genes

Proteins of the small ribosomal subunit

Enzymes that attach amino acids to tRNA

Enzymes involved in the trehaloseanabolism pathway

Iclust – clusters examples

Wal-MartTargetHome DepotBest BuyStaples…

MicrosoftApple Comp.DellHPMotorola…

NY TimesTribune Co.Meredith Corp.Dow Jones & Co.Knight-Ridder Inc.…

C17 C12 C2

Clusters of stocks

Data: Dynamics of stock prices

Given the price of stock-A, how much information do we have about the price of stock-B ? (when averaging over many days)

Snow WhiteCinderellaDumboPinocchioAladdin…

PsychoApocalypse NowThe GodfatherTaxi DriverPulp Fiction…

Star WarsReturn of the JediThe TerminatorAlienApollo 13…

C12 C1 C7

Clusters of movies

Data: Rating by viewers

Given the rating of movie-A, how much information do we have about the rating of movie-B ? (when averaging over many viewers)

Page 17: Iclust: information based clustering

0

25

50

75

100

Coherence

K-means

K-medians

Hierarchical

K-means

K-medians

Hierarchical

K-means

K-medians

Hierarchical

Coherence results – comparison to alternative algorithms

ESR S&P 500 EachMovie

Page 18: Iclust: information based clustering

Quick Summary

Information as the core measure of data analysis with many appealing features

Iclust - a novel information-theoretic formulation of clustering, with some intriguing relations with classical rate distortion clustering.

… and finding coherent stocks clusters, coherent movies clusters …

Validations: finding coherent gene clusters based on information relations in gene-expression data

… and genotype-phenotype association in bacteria, based on phylogenetic data -Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press.

… and more?