Upload
mare
View
29
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Iclust: information based clustering. Noam Slonim The Lewis-Sigler Institute for Integrative Genomics Princeton University. Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek. Running example. Gene expression data. N conditions. 2. 12. -1. -1. 6. -3. 8. ??. - PowerPoint PPT Presentation
Citation preview
1
Iclust: information based clustering
Noam SlonimThe Lewis-Sigler Institute for Integrative GenomicsPrinceton University
Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek
2
2 12 -1 -1 6 -3 8 ?? 7 -5 3 -4
12 ?? -5 11 -2 6 11 11 -8 12 ?? -2
?? 12 5 12 4 -1 8 -2 ?? 5 14 ??
8 1 12 1 14 -8 ?? -2 5 14 -8 -7
5 -5 11 17 -2 15 5 14 -8 5 16 2
1 11 -8 0 5 -5 5 14 18 ?? 2 1
-6 12 4 12 4 7 -1 3 -7 3 7 -5
21 ?? ?? 3 2 4 -11 -3 3 -3 ?? 9
K genes
N conditions
Running example
Gene expression data
Relations between genes?Relations between experimental conditions?
(log) ratio of the mRNAexpression level of a genein a specific condition
3
Some nice features of the information measure:
Model independent Responsive to any type of dependencyCaptures more than just pairwise relationsSuitable for both continuous and discrete dataIndependent of the measurement scaleAxiomatic
Information as a correlation/similarity measure
4
Mutual information - definition
We have some “uncertainty” about the state of gene-A;but now someone told us the state of gene-B…
How much can we learn from the state of gene-B about the state of gene-A (and vice versa).
-The resulting reduction in the uncertainty about gene-A stateis called the mutual information between these two variables :
BbAa bpap
bapbapbapHapHbapI, )()(
),(log),()]|([)]([)],([
5
Model independence & responsiveness to “complicated” relations
MI~1 bit; Corr.~0.9
gene-A expression level
gene
-B e
xpre
ssio
n le
vel
MI~2 bits; Corr.~0.6
gene-A expression level
gene
-B e
xpre
ssio
n le
vel
MI~0 bits; Corr.~0
gene
-B e
xpre
ssio
n le
vel
gene-A expression level
MI~1.3 bits; Corr.~0
gene
-B e
xpre
ssio
n le
vel
gene-A expression level
6
MI~0 bits; Corr.~0
Experiment index
gene
-A/g
ene-
B ex
pres
sion
Experiment index
gene
-A/g
ene-
B/ge
ne-C
exp
ress
ion
Triplet-information ~ 1.0 bits
Capturing more than just pairwise relations
Using a model-dependent correlation measure might result in missing significant dependencies in our data.
7
Mycobacterium tuberculosis81 experiments
Pearson Correlation
Mut
ual i
nfor
mat
ion
Mutual-information vs. Pearson-Correlation results in bacteria gene-expression data
8
Information relations between gene expression profiles
Given the expression of gene-A, how much information do we have about the expression of gene-B ? (when averaging over all conditions)
( sample size: number of conditions - 173 in Gasch data )
Once we find these information relations, we often want to apply cluster analysis.
Numerous clustering methods are available – but typically they assume a particular model.
For example, K-means corresponds to the modeling assumption that each cluster can be described by a spherical Gaussian.
Back in square one …?
9
Or … c ii
rr
r
iiisciqciqciqcqcS,...,
2121
1
),...,,()|()...|()|()( )(
Iclust – information based clustering
What is a “good” cluster?
A simple proposal – given a cluster, we pick two items at random, and we want them to be as “similar” to each other as possible.
c ii
iisciqciqcqcS21,
2121 ),()|()|()( )(Formally, we wish to maximize
Or … c ii
rr
r
iiiIciqciqciqcqcIcS,...,
2121
1
),...,,()|()...|()|()()( )(
Namely, we wish to maximize the average information relations in our clusters, or to find clusters s.t. in each cluster all items are highly informative about each other.
10
Iclust – information based clustering (cont.)A penalty term that we wish to minimize, as in rate-distortion theory :
ic cq
icqicqipiCI, )(
)|(log)|()();(
S(c) is maximized, but the penalty term is maximized as well (no compression)Penalty term is minimized (maximal compression), but S(c) is minimized as well.Intermediate interesting cases – small penalty with high S(c)
11
Iclust – information based clustering (cont.)
The intuitive clustering problem can be turned into a General mathematical optimization problem:
);( - )( ] )|( [ iCITcIicqF
Clustering parameters Expected information relationsamong data items
Information between dataitems and clusters
Tradeoff parameter
Clustering is formulated as trading bits of similarity againstbits of descriptive power, without any further assumptions.
12
Relations with other classical rate distortion
Iclust
)(1);(] )|( [ cDiCIicqF
Classical rate distortion
)(~1);(] )|( [ cDiCIicqF
The difference is whether the sum over i2 is before/after d is computed
If the distortion/similarity matrix is a kernel matrix the formulations are equivalent
c ii
iir
r
rdciqciqcqcD,...,
)()(1
1
1 ),...,()|()...|()( )( c i
cidciqcqcD1
1 ),()|()( )(~ )()(1
c i i
ii ciqdciqcqcD1 2
21 ))|(,()|()( )(~ )(2
)(1
c i i
iidciqciqcqcD1 2
21 ),()|()|()( )( )()(21
For the special case of pairwise relations
13
And yet – some important differences
Iclust is applicable when the raw data is given directly as pairwise relations
Iclust do not require a definition of a “prototype” (or “centroid”)
Both formulations induce different decoding schemes
A sender observes a pattern Φi, but is allowed to send only the cluster index, c
In classical rate distortion the receiver is assumed to decode by
)()()( )|(~ i
i
ci ciq Deterministic decoding with vocabulary size Nc
In Iclust he receiver is assumed to decode by )|(~~ )()( ciqii Stochastic decoding
with vocabulary size N
Iclust can handle more than just pairwise correlations
14
Original figure: 220 gray levels
Iclust vs. classical rate-distortion decoding
Iclust (stochastic) decoding
2 clusters
RD (deterministic) decoding
2 clusters
15
Iclust algorithm - freely available Web implementation
Responsive to any type of dependency among the data
Invariant to changes in the data representation
Allows to cluster based on more than pairwise relations
For more details :Slonim, Atwal, Tkacik, and Bialek (2005) Information based clustering, PNAS, in press.
See www.princeton.edu/~nslonim
)}()1( );(1exp{)( )|( cSricSrT
cqicq
Average “similarity” among c members
Average “similarity” of i to c members
RPS10ARPS10BRPS11ARPS11BRPS12…
FRS1KRS1SES1TYS1VAS1…
PGM2UGP1TSL1TPS1TPS2…
C18 C15 C4
Clusters of genes
Proteins of the small ribosomal subunit
Enzymes that attach amino acids to tRNA
Enzymes involved in the trehaloseanabolism pathway
Iclust – clusters examples
Wal-MartTargetHome DepotBest BuyStaples…
MicrosoftApple Comp.DellHPMotorola…
NY TimesTribune Co.Meredith Corp.Dow Jones & Co.Knight-Ridder Inc.…
C17 C12 C2
Clusters of stocks
Data: Dynamics of stock prices
Given the price of stock-A, how much information do we have about the price of stock-B ? (when averaging over many days)
Snow WhiteCinderellaDumboPinocchioAladdin…
PsychoApocalypse NowThe GodfatherTaxi DriverPulp Fiction…
Star WarsReturn of the JediThe TerminatorAlienApollo 13…
C12 C1 C7
Clusters of movies
Data: Rating by viewers
Given the rating of movie-A, how much information do we have about the rating of movie-B ? (when averaging over many viewers)
0
25
50
75
100
Coherence
K-means
K-medians
Hierarchical
K-means
K-medians
Hierarchical
K-means
K-medians
Hierarchical
Coherence results – comparison to alternative algorithms
ESR S&P 500 EachMovie
Quick Summary
Information as the core measure of data analysis with many appealing features
Iclust - a novel information-theoretic formulation of clustering, with some intriguing relations with classical rate distortion clustering.
… and finding coherent stocks clusters, coherent movies clusters …
Validations: finding coherent gene clusters based on information relations in gene-expression data
… and genotype-phenotype association in bacteria, based on phylogenetic data -Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press.
… and more?