Information Bottleneck

Preview:

DESCRIPTION

Information Bottleneck. presented by Boris Epshtein & Lena Gorelick. Advanced Topics in Computer and Human Vision Spring 2004. Agenda. Motivation Information Theory - Basic Definitions Rate Distortion Theory Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms iIB - PowerPoint PPT Presentation

Citation preview

Information Bottleneck

presented by

Boris Epshtein & Lena GorelickAdvanced Topics in Computer and Human Vision

Spring 2004

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Motivation

Clustering Problem

Motivation

• “Hard” Clustering – partitioning of the input data into several exhaustive and mutually exclusive clusters

• Each cluster is represented by a centroid

Motivation

• “Good” clustering – should group similar data points together and dissimilar points apart

• Quality of partition – average distortion between the data points and corresponding representatives (cluster centroids)

• “Soft” Clustering – each data point is assigned to all clusters with some normalized probability

• Goal – minimize expected distortion between the data points and cluster centroids

Motivation

Complexity-Precision Trade-off

• Too simple model Poor precision• Higher precision requires more complex model

Motivation…

Complexity-Precision Trade-off

• Too simple model Poor precision• Higher precision requires more complex model• Too complex model Overfitting

Motivation…

Complexity-Precision Trade-off

• Too Complex Model – can lead to overfitting– is hard to learn

• Too Simple Model– can not capture the real structure of the data

• Examples of approaches:– SRM Structural Risk Minimization– MDL Minimum Description Length– Rate Distortion Theory

Motivation…

Poor generalization

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Entropy

• The measure of uncertainty about the random variable

Definitions…

Entropy - Example

– Fair Coin:

– Unfair Coin:

Definitions…

Entropy - IllustrationDefinitions…

Highest Lowest

Conditional Entropy

• The measure of uncertainty about the random variable given the value of the variable

Definitions…

Conditional EntropyExample

Definitions…

Mutual Information

• The reduction in uncertainty of due to the knowledge of – Nonnegative– Symmetric– Convex w.r.t. for a fixed

Definitions…

Mutual Information - ExampleDefinitions…

• A distance between distributions– Nonnegative– Asymmetric

Kullback Leibler DistanceDefinitions…

Over the same alphabet

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Rate Distortion TheoryIntroduction

• Goal: obtain compact clustering of the data with minimal expected distortion

• Distortion measure is a part of the problem setup

• The clustering and its quality depend on the choice of the distortion measure

Rate Distortion Theory

• Obtain compact clustering of the data with minimal expected distortion given fixed set of representatives

Data

?

Cover & Thomas

• – zero distortion – not compact

• – high distortion – very compact

Rate Distortion Theory - Intuition

• The quality of clustering is determined by

– Complexity is measured by

– Distortion is measured by

Rate Distortion Theory – Cont.

(a.k.a. Rate)

Rate Distortion Plane

Ed(X,T)

Maximal Compression

Minimal Distortion

D - distortion constraint

Higher values of mean more relaxed distortion constraint

Stronger compression levels are attainable

Rate Distortion Function

• Given the distortion constraint find the most compact model (with smallest complexity )

• Let be an upper bound constraint on the expected distortion

Rate Distortion Function• Given

– Set of points with prior– Set of representatives – Distortion measure

• Find– The most compact soft clustering of

points of that satisfies the distortion constraint

• Rate Distortion Function

Rate Distortion Function

Lagrange Multiplier

Complexity Term

Distortion Term

Minimize !

Rate Distortion Curve

Ed(X,T)

Maximal Compression

Minimal Distortion

Subject to

The minimum is attained when

Rate Distortion Function

Normalization

Minimize

Known

Solution - Analysis

The solution is implicit

Solution:

When is similar to is small

closer points are attached to with higher

probability

Solution - Analysis

Solution:

For a fixed

reduces the influence of distortion

does not depend on

this + maximal compression single cluster

Solution - Analysis

Solution:

most of cond. prob. goes to some

with smallest distortion

hard clustering

Fix t

Fix x

Varying

Solution - Analysis

Solution:

Intermediate soft clustering,

intermediate complexity

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Blahut – Arimoto AlgorithmInput:

Randomly init

Optimize convex function over convex set the minimum is global

Blahut-Arimoto Algorithm

Advantages:• Obtains compact clustering of the data with

minimal expected distortion

• Optimal clustering given fixed set of representatives

Blahut-Arimoto Algorithm

Drawbacks:• Distortion measure is a part of the problem

setup – Hard to obtain for some problems – Equivalent to determining relevant features

• Fixed set of representatives

• Slow convergence

Rate Distortion Theory – Additional Insights

– Another problem would be to find optimal representatives given the clustering.

– Joint optimization of clustering and representatives doesn’t have a unique solution. (like EM or K-means)

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Information Bottleneck

• Copes with the drawbacks of Rate Distortion approach

• Compress the data while preserving “important” (relevant) information

• It is often easier to define what information is important than to define a distortion measure.

• Replace the distortion upper bound constraint by a lower bound constraint over the relevant information

Tishby, Pereira & Bialek, 1999

Information Bottleneck-Example

Documents Joint prior Topics

Given:

Information Bottleneck-Example

Words Partitioning Topics

Obtain:

I(Cluster;Topic)

I(Word;Topic)

I(Word;Cluster)

Information Bottleneck-ExampleExtreme case 1:

I(Cluster;Topic)=0

I(Word;Cluster)=0

Very Compact

Not Informative

Information Bottleneck-Example

Minimize I(Word; Cluster) & maximize I(Cluster; Topic)

I(Cluster;Topic)=max

I(Word;Cluster)=max

Not Compact

VeryInformative

Extreme case 2:

Information Bottleneck

Compactness Relevant Information

words

topics

Relevance Compression Curve

Maximal Compression

Maximal Relevant

InformationD – relevance constraint

• Let be minimal allowed value of

Smaller more relaxed relevant information constraint

Stronger compression levels are attainable

Relevance Compression Function

• Given relevant information constraint Find the most compact model (with smallest )

Relevance Compression Function

Lagrange Multiplier

Compression Term

RelevanceTerm

Minimize !

Relevance Compression Curve

Maximal Compression

Maximal Relevant

Information

Subject to

The minimum is attained when

Relevance Compression Function

Normalization

Minimize

Solution - Analysis

The solution is implicit

Solution:

Known

Solution - Analysis

Solution:

• KL distance emerges as effective distortion measure from IB principle

The optimization is also over cluster representatives

When is similar to KL is small

attach such points to with higher probability

For a fixed

reduces the influence of KL

does not depend on

this + maximal compression single cluster

Solution - Analysis

Solution:

most of cond. prob. goes to some with smallest KL (hard mapping)

Fix t

Fix x

Relevance Compression Curve

Maximal Compression

Maximal Relevant

Information

Hard Mapping

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Iterative Optimization Algorithm (iIB)

• Input:

• Randomly init

Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001

Iterative Optimization Algorithm (iIB)

p(cluster | word)

p(cluster)

p(topic | cluster)

Pereira, Tishby, Lee , 1993;

iIB simulation

• Given:– 300 instances of with prior– Binary relevant variable – Joint prior –

• Obtain:– Optimal clustering (with minimal )

X points and their priors

iIB simulation…

Given the is given by the color of the point on the map

iIB simulation…

iIB simulation…

Single Cluster – Maximal Compression

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

iIB simulation…

Hard Clustering – Maximal Relevant Information

Iterative Optimization Algorithm (iIB)

• Analogy to K-means or EM

Optimize non-convex functional over 3 convex sets the minimum is local

“Semantic change” in the clustering solution

Advantages:

• Defining relevant variable is often easier and more intuitive than defining distortion measure

• Finds local minimum

Iterative Optimization Algorithm (iIB)

Drawbacks:

• Finds local minimum (suboptimal solutions)

• Need to specify the parameters

• Slow convergence

• Large data sample is required

Iterative Optimization Algorithm (iIB)

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

• Iteratively increase the parameter and then adapt the solution from the previous value of to the new one.

• Track the changes in the solution as the system shifts its preference from compression to relevance

• Tries to reconstruct the relevance-compression curve

Deterministic Annealing-like algorithm (dIB)

Slonim, Friedman, Tishby, 2002

Solution from previous step:

Deterministic Annealing-like algorithm (dIB)

Deterministic Annealing-like algorithm (dIB)

Small Perturbation

Deterministic Annealing-like algorithm (dIB)

Apply iIB using the duplicated cluster set as initialization

Deterministic Annealing-like algorithm (dIB)

if are differentleave the split

elseuse the old

Deterministic Annealing-like algorithm (dIB)

Deterministic Annealing-like algorithm (dIB)Illustration

What clusters split at which values of

Advantages:

• Finds local minimum (suboptimal solutions)

• Speed-up convergence by adapting previous soultion

Deterministic Annealing-like algorithm (dIB)

Drawbacks:

• Need to specify and tune several parameters:

- perturbation size

- step for (splits might be “skipped”)

- similarity threshold for splitting

- may need to vary parameters during the process

• Finds local minimum (suboptimal solutions)

• Large data sample is required

Deterministic Annealing-like algorithm (dIB)

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Agglomerative Algorithm (aIB)

• Find hierarchical clustering tree in a greedy bottom-up fashion

• Results in different trees for each

• Each tree is a range of clustering solutions at different resolutions

Same

Different Resolutions

Slonim & Tishby 1999

Agglomerative Algorithm (aIB)Fix

Start with

Agglomerative Algorithm (aIB)For each pair

Compute new

Merge and that produce the smallest

Agglomerative Algorithm (aIB)For each pair

Compute new

Merge and that produce the smallest

Agglomerative Algorithm (aIB)For each pair

Continue merging until single cluster is left

Agglomerative Algorithm (aIB)

Agglomerative Algorithm (aIB)

Advantages:

• Non-parametric

• Full Hierarchy of clusters for each

• Simple

Agglomerative Algorithm (aIB)

Drawbacks:

• Greedy – is not guaranteed to extract even locally minimal solutions along the tree

• Large data sample is required

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Unsupervised Clustering of Images

Modeling assumption:

For a fixed colors and their spatial distribution are generated by a mixture of Gaussians in 5-dim

Applications…

Shiri Gordon et. al., 2003

Unsupervised Clustering of Images

Apply EM procedure to estimate the mixture parameters

Applications…

Shiri Gordon et. al., 2003

Mixture of Gaussians model:

Unsupervised Clustering of Images

Applications…

Shiri Gordon et. al., 2003

• Assume uniform prior

• Calculate conditional

• Apply aIB algorithm

Unsupervised Clustering of ImagesApplications…

Shiri Gordon et. al., 2003

Unsupervised Clustering of ImagesApplications…

Shiri Gordon et. al., 2003

Summary

• Rate Distortion Theory– Blahut-Arimoto algorithm

• Information Bottleneck Principle

• IB algorithms– iIB– dIB– aIB

• Application

Thank you

Blahut-Arimoto algorithmA B

When does it converge to the global minimum?

- A and B are convex + some requirements on distance measure

Convex set of distributions

Convex set of distributions

Minimum Distance

?

Csiszar & Tusnady, 1984

Blahut-Arimoto algorithmA B

Reformulate using distance

Blahut-Arimoto algorithmA B

Rate Distortion Theory - Intuition

• – zero distortion – not compact–

• – high distortion – very compact–

• Assume Markov relations:

– T is a compressed representation of X, thus independent of Y if X is given

– Information processing inequality:

Information Bottleneck - cont’d

Recommended