56
Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December 13, 2003 Albert E. Parker Department of Mathematical Sciences Center for Computational Biology Montana State University Collaborators: Tomas Gedeon, Alex Dimitrov, John Miller, and Zane Aldworth

Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Phase Transitions in the Information Distortion

NIPS 2003 workshop on Information Theory and Learning:

The Bottleneck and Distortion Approach

December 13, 2003

Albert E. Parker

Department of Mathematical Sciences Center for Computational Biology

Montana State University

Collaborators: Tomas Gedeon, Alex Dimitrov, John Miller, and Zane Aldworth

Page 2: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

The Goal:To determine the phase transitions or the bifurcation structure of solutions to clustering

problems of the form

maxqG(q) constrained by D(q)I0

where

is the set of valid conditional probabilities in RNK.

• G and D are sufficiently smooth in .• G and D have symmetry: they are invariant to relabelling of the classes of Z.• The Hessians qG and q D are block diagonal.

X Z

q(Z|X)

K objects N clusters

Page 3: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

A similar formulation:Using the method Lagrange multipliers, the goal of determining the bifurcation structure of

solutions of the optimization problem can be rephrased as finding the bifurcation structure of stationary points of the problem

maxq(G(q)+D(q))

where [0,). is the set of valid conditional probabilities in RNK.

• G and D are sufficiently smooth in .• G and D have symmetry: they are invariant to relabelling of the classes of Z.• The Hessian q(G+ D) is block diagonal, and satisfies a set of regularity conditions at

bifurcation: (e.g. the kernel of each block is one dimensional)

X Z

q(Z|X)

K objects N clusters

Page 4: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

How: Use the Symmetries

By capitalizing on the symmetries of the cost functions, we have described the bifurcation structure of stationary points to problems of the form

maxqG(q) constrained by D(q)I0

or

maxq(G(q)+D(q))

where [0,). is the set of valid conditional probabilities in RNK.• G and D are sufficiently smooth in .• G and D have symmetry: they are invariant to relabelling of the classes of Z.• The Hessian q(G+ D) is block diagonal, and satisfies a set of regularity conditions at bifurcation: (e.g. the

kernel is one dimensional)

Page 5: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

• Rate Distortion Theory (Shannon 1950’s) Minimal Informative Compression

min I(X,Z) constrained by D(X,Z) D0

• Deterministic Annealing (Rose 1990’s) A Clustering Algorithm

max H(Z|X) constrained by D(X,Z) D0

qC,

Examplesoptimizing at a distortion level D(Y,Z) D0

q

Page 6: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

• Rate Distortion Theory (Shannon 1950’s) Minimal Informative Compression

max -I(X,Z) constrained by D(X,Z) D0

• Deterministic Annealing (Rose 1998) A Clustering Algorithm

max H(Z|X) constrained by D(X,Z) D0

I(X,Z)=H(Z) – H(Z|X)

qC,

Examplesoptimizing at a distortion level D(Y,Z) D0

q

Page 7: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Y X

Inputs Outputs

Zq(Z|X)

Clustered Outputs

K objects {xi} N objects {zi}L objects {yi}

p(X,Y)

Inputs and Outputs and Clustered Outputs

Page 8: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Y X

Inputs Outputs

Zq(Z|X)

Clustered Outputs

K objects {xi} N objects {zi}L objects {yi}

p(X,Y)

Inputs and Outputs and Clustered Outputs

Page 9: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)

min I(X,Z) constrained by DI(X,Z) D0

max –I(X,Z) + I(Y;Z)

• Information Distortion Method (Dimitrov and Miller 2001)

max H(Z|X) constrained by DI(X,Z) D0

max H(Z|X) + I(Y;Z)

q

Two methods which use an information distortion function to cluster

q

q

q

Page 10: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)

min I(X,Z) constrained by DI(X,Z) D0

max –I(X,Z) + I(Y;Z)

• Information Distortion Method (Dimitrov and Miller 2001)

max H(Z|X) constrained by DI(X,Z) D0

max H(Z|X) + I(Y;Z)

q

Two methods which use an information distortion function to cluster

q

q

q

The Hessian is always singular … (-I(X,Z)is not strictly concave)

The theory whichfollows does not apply

Page 11: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

• Information Bottleneck Method (Tishby, Pereira, Bialek 1999)

min I(X,Z) constrained by DI(X,Z) D0

max –I(X,Z) + I(Y;Z)

• Information Distortion Method (Dimitrov and Miller 2001)

max H(Z|X) constrained by DI(X,Z) D0

max H(Z|X) + I(Y;Z)

q

Two methods which use an information distortion function to cluster

q

q

q

The Hessian is always singular … (I(X,Z)is not strictly concave)

The theory whichfollows does not apply

H(Z|X) is strictly concave)

The theory whichfollows does apply

Page 12: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

A basic annealing algorithmto solve

maxq(G(q)+D(q))

Let q0 be the maximizer of maxq G(q), and let 0 =0. For k 0, let (qk , k ) be a solution to maxq G(q) + D(q ). Iterate the following steps until K = max for some K.

1. Perform -step: Let k+1 = k + dk where dk>0

2. The initial guess for qk+1 at k+1 is qk+1(0) = qk + for some small

perturbation .

3. Optimization: solve maxq (G(q) + k+1 D(q)) to get the maximizer qk+1 , using initial guess qk+1

(0) .

Page 13: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Application of the annealing method to the Information Distortion problem maxq (H(Z|X) + I(X;Z))

when p(X,Y) is defined by four gaussian blobs

Y, Inputs

X, Outputs

Y X

K=52 outputsL=52 inputs

p(X,Y) X Zq(Z|X)

K=52 outputs N=4 clustered outputs

X, Outputs

Z, C

lust

ered

Ou

tpu

ts

Page 14: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Evolution of the optimal clustering: Observed Bifurcations for the Four Blob problem:

We just saw the optimal clusterings q* at some *= max . What do the clusterings look like for < max ??

I(Y

,Z)

bits

Page 15: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

??????

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?

What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type?

How many bifurcating branches are there?

What do the bifurcating branches look like? Are they 1st order phase transitions (subcritical) or 2nd order phase transitions (supercritical) ?

What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem?

Are there bifurcations after all of the classes have resolved ?

q*

Conceptual Bifurcation Structure

Observed Bifurcations for the 4 Blob Problem

I(Y

,Z)

bits

Page 16: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Recall the Symmetries:

To better understand the bifurcation structure, we capitalize on the symmetries of the function G(q)+D(q)

X Zq(Z|X) : a clustering

K objects {xi} N objects {zi}

class 1

class 3

Page 17: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

X Zq(Z|X) : a clustering

K objects {xi} N objects {zi}

class 3

class 1

Recall the Symmetries:

To better understand the bifurcation structure, we capitalize on the symmetries of the function G(q)+D(q)

Page 18: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

The symmetry group of all permutations on N symbols

is

SN.

Page 19: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

A partial subgroup lattice for SN when N=4.

Page 20: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

4S

34,12 24,13

23,14

A partial lattice of the maximal subgroups S2 x S2 of S4

Page 21: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

This Group Structure determines the

Bifurcation Structure

Page 22: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Define a Gradient FlowGoal: To determine the bifurcation structure of stationary points of

maxq (G(q) + D(q))

Method: Study the equilibria of the of the flow

• Equilibria of this system (in RNK+K ) are possible solutions of the optimization problem

• The Jacobian q,L(q*,*) is symmetric, and so only bifurcations of equilibria can occur.

• The first equilibrium is q*(0 = 0) 1/N.

• If wT qF(q*,) w < 0 for every wker J, then q*() is a maximizer of .

• The first equilibrium is q*(0 = 0) 1/N.

Yy z

yqq xzqqDqGqq

1)|()()(:),,( ,, L

Page 23: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

4

11

N

q

41 by fixed is SSq N

N

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Page 24: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

4

11

N

q

*q

41 by fixed is SSq N

N

31* by fixed is SSq N

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Page 25: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

4

11

N

q

*q

41 by fixed is SSq N

N

31* by fixed is SSq N

*q

22* by fixed is SSq N

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Page 26: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Page 27: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

4S

34,12 24,13

23,14

Page 28: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Symmetry Breaking Bifurcations

q*

*q

)34(),12(by fixed is 22* SSq

4S

34,12 24,13

23,14

Page 29: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Existence Theorems for Bifurcating Branches

Given a bifurcation at a point fixed by SN ,

• Equivariant Branching Lemma The Smoller-Wasserman Theorem (Vanderbauwhede and Cicogna 1980-1) (Smoller and Wasserman 1985-6)

• There are N bifurcating branches, each which have symmetry SN-1 .

• There are N!/(2m!n!) bifurcating branches which have symmetry Sm x Sn if N=m+n.

q*

Page 30: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

q*Existence Theorems for Bifurcating Branches

Given a bifurcation at a point fixed by SN-1 ,

• Equivariant Branching Lemma The Smoller-Wasserman Theorem (Vanderbauwhede and Cicogna 1980-1) (Smoller and Wasserman 1985-6)

• There are N-1 bifurcating branches, each which have symmetry SN-2 .

• There are (N-1)!/(2m!n!) bifurcating branches which have symmetry Sm x Sn if N-1=m+n.

Page 31: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Group Structure

Observed Bifurcation Structure

Page 32: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Group Structure

q*Observed Bifurcation Structure

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

The Equivariant Branching Lemma shows that the bifurcation structure contains the branches …

Page 33: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Group Structure

q*Observed Bifurcation Structure

4S

34,12 24,13

23,14

The subgroups {S2x S2} give additional structure …

Page 34: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Group Structure

q*Observed Bifurcation Structure

4S

34,12 24,13

23,14

The subgroups {S2x S2} give additional structure …

Page 35: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

q*

Theorem: There are at exactly K bifurcations on the branch (q1/N , ) whenever G(q1/N) is nonsingular

There are K=52bifurcations on the first

branch

Observed Bifurcation Structure

Page 36: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

4S

3S3S

3S 3S

0

3

vv

v

v

0

3

vv

v

v

0

3vv

v

v

0

3vv

v

v

2S2S 2S2S2S2S2S2S

1

0

2

0

vv

v

2S 2S 2S2S

0

2

0

vv

v

0

2

0

vv

v

0

0

2

vv

v

0

2

0

vv

v

0

2

0

vv

v

0

0

2

v

v

v

0

20v

v

v

0

0

2

v

v

v

0

0

2

v

v

v

0

0

2

v

v

v

0

02v

v

v

A partial subgroup lattice for S4 and the corresponding bifurcating directions given by the Equivariant Branching Lemma

Page 37: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

4S

34,12 24,13

23,14

v

v

v

v

v

v

v

v

v

v

v

v

A partial subgroup lattice for S4 and the corresponding bifurcating directions corresponding to subgroups isomorphic to S2 x S2.

Page 38: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

This theory enables us to answer the

questions previously posed …

Page 39: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

??????

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?

What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type?

How many bifurcating solutions are there?

What do the bifurcating branches look like? Are they subcritical or supercritical ?

What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem?

Are there bifurcations after all of the classes have resolved ?

q*

Conceptual Bifurcation Structure

Observed Bifurcations for the 4 Blob Problem

Page 40: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Why are there only 3 bifurcations observed? In general, are there only N-1 bifurcations?There are N-1 symmetry breaking bifurcations from SM to SM-1 for M N.

What kinds of bifurcations do we expect: pitchfork-like, transcritical, saddle-node, or some other type?

How many bifurcating solutions are there? There are at least N from the first bifurcation, at least N-1 from the next one, etc.

What do the bifurcating branches look like? They are subcritical or supercritical depending on the sign of the bifurcation discriminator (q*,*,uk) .

What is the stability of the bifurcating branches? Is there always a bifurcating branch which contains solutions of the optimization problem? No.

Are there bifurcations after all of the classes have resolved ? Generically, no.

Conceptual Bifurcation Structure

q*

4S

3S3S

3S 3S

2S2S 2S2S2S2S2S2S

1

2S 2S 2S2S

Page 41: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Continuation techniques numerically illustrate the theory

using the Information Distortion

Page 42: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

q*

I(Y

,Z)

bits

Page 43: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Bifurcating branches with symmetry S2 x S2

= <(12),(34)>

q*

I(Y

,Z)

bits

Page 44: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Additional structure!!

I(Y

,Z)

bits

Page 45: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

I(Y

,Z)

bits

I(Y

,Z)

bits

Page 46: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

A closer look …

q*

I(Y

,Z)

bits

Page 47: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Bifurcation from S4 to S3…

q*

I(Y

,Z)

bits

Page 48: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

The bifurcation from S4 to S3 is subcritical …

(the theory predicted this since the bifurcation discriminator (q1/4,*,u)<0 )

I(Y

,Z)

bits

Page 49: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

(4) RH(I0) = maxqH(Z|X) constrained by I(Y,Z) I0

(7) maxq(H(Z|X) + I(Y,Z))

What does this mean regardingsolutions ofthe original problems?

I(Y

,Z)

bits

Page 50: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Theorem:• dR/dI0 = -(I0)• d2R/dI0

2 = -d(I0)/dI0

(4) RH(I0) = maxqH(Z|X) constrained by I(Y,Z) I0

(7) maxq(H(Z|X) + I(Y,Z))

Page 51: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Theorem:• dR/dI0 = -(I0)• d2R/dI0

2 = d(I0)/dI0

RH as a function of I0

RH(I0) = maxq H(Z|Y) constrained by I(X;Z) I0

• is not convex and not concave• is a monotonically decreasing, continuous function

RH

Page 52: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Consequences??

• Analogue for the Information Distortion

RH(I0) = maxq H(Z|X) constrained by I(Y;Z) I0

is neither concave nor convex since subcritical bifurcations and saddle nodes exist.

• Rate Distortion Function (from Information Theory)R(D0) = minq I(X;Z) constrained by D(X,Z) D0

is convex if D(Y,Z) is linear in q (Rose, 1994; Cover and Thomas; Grey).

• Relevance Compression Function (for Information Bottleneck)RI(I0) = minq I(X;Z) constrained by I(Y;Z) I0

is convex if N>K+1 (Witsenhausen and Wyner 1975, Bachrach et al 2003)

Page 53: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Analogue for the Information Distortion

RH(I0) = maxq H(Z|X) constrained by I(Y;Z) I0

is neither concave nor convex since subcritical bifurcations and saddle nodes exist.

Relevance Compression Function (for Information Bottleneck)RI(I0) = minq I(X;Z) constrained by I(Y;Z) I0

is convex if N>K+1 (Bachrach et al 2003)

• RI(I0) and RH(I0) are related by I(X;Z) = H(Z) - H(Z|X).

• The Information Bottleneck can not have a subcritical bifurcation when N > K+1. Are there subcritical bifurcations when N<K+1 ?

• Is RH(I0) convex when N>K+1 ? That would mean that the subcritical bifurcations go away when considering the gradient flow in R(K+2)K instead of RNK.

So What??

Page 54: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December
Page 55: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Application to cricket sensory data

E(Y|Z): stimulusmeans conditioned

on each of the classes

spikepatterns

optimal clustering

Page 56: Phase Transitions in the Information Distortion NIPS 2003 workshop on Information Theory and Learning: The Bottleneck and Distortion Approach December

Conclusions …

We have a complete theoretical picture of how the clusterings evolve for a class of annealing problems of the form

maxq(G(q)+D(q))

subject to the assumptions stated earlier.

o When clustering to N classes, there are N-1 bifurcations.o In general, there are only pitchfork and saddle-node bifurcations.o We can determine whether pitchfork bifurcations are either subcritical or

supercritical (1st or 2nd order phase transitions)o We know the explicit bifurcating directions

SO WHAT?? There are theoretical consequences … This suggests an algorithm for solving the

annealing problem … (NIPS 2002)