114
Information Bottleneck presented by Boris Epshtein & Lena Gorelick Advanced Topics in Computer and Human Vision Spring 2004

Information Bottleneck

  • Upload
    radha

  • View
    73

  • Download
    0

Embed Size (px)

DESCRIPTION

Information Bottleneck. presented by Boris Epshtein & Lena Gorelick. Advanced Topics in Computer and Human Vision Spring 2004. Agenda. Motivation Information Theory - Basic Definitions Rate Distortion Theory Blahut-Arimoto algorithm Information Bottleneck Principle IB algorithms iIB - PowerPoint PPT Presentation

Citation preview

Page 1: Information Bottleneck

Information Bottleneck

presented by

Boris Epshtein & Lena GorelickAdvanced Topics in Computer and Human Vision

Spring 2004

Page 2: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 3: Information Bottleneck

Motivation

Clustering Problem

Page 4: Information Bottleneck

Motivation

• “Hard” Clustering – partitioning of the input data into several exhaustive and mutually exclusive clusters

• Each cluster is represented by a centroid

Page 5: Information Bottleneck

Motivation

• “Good” clustering – should group similar data points together and dissimilar points apart

• Quality of partition – average distortion between the data points and corresponding representatives (cluster centroids)

Page 6: Information Bottleneck

• “Soft” Clustering – each data point is assigned to all clusters with some normalized probability

• Goal – minimize expected distortion between the data points and cluster centroids

Motivation

Page 7: Information Bottleneck

Complexity-Precision Trade-off

• Too simple model Poor precision• Higher precision requires more complex model

Motivation…

Page 8: Information Bottleneck

Complexity-Precision Trade-off

• Too simple model Poor precision• Higher precision requires more complex model• Too complex model Overfitting

Motivation…

Page 9: Information Bottleneck

Complexity-Precision Trade-off

• Too Complex Model – can lead to overfitting– is hard to learn

• Too Simple Model– can not capture the real structure of the data

• Examples of approaches:– SRM Structural Risk Minimization– MDL Minimum Description Length– Rate Distortion Theory

Motivation…

Poor generalization

Page 10: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 11: Information Bottleneck

Entropy

• The measure of uncertainty about the random variable

Definitions…

Page 12: Information Bottleneck

Entropy - Example

– Fair Coin:

– Unfair Coin:

Definitions…

Page 13: Information Bottleneck

Entropy - IllustrationDefinitions…

Highest Lowest

Page 14: Information Bottleneck

Conditional Entropy

• The measure of uncertainty about the random variable given the value of the variable

Definitions…

Page 15: Information Bottleneck

Conditional EntropyExample

Definitions…

Page 16: Information Bottleneck

Mutual Information

• The reduction in uncertainty of due to the knowledge of – Nonnegative– Symmetric– Convex w.r.t. for a fixed

Definitions…

Page 17: Information Bottleneck

Mutual Information - ExampleDefinitions…

Page 18: Information Bottleneck

• A distance between distributions– Nonnegative– Asymmetric

Kullback Leibler DistanceDefinitions…

Over the same alphabet

Page 19: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 20: Information Bottleneck

Rate Distortion TheoryIntroduction

• Goal: obtain compact clustering of the data with minimal expected distortion

• Distortion measure is a part of the problem setup

• The clustering and its quality depend on the choice of the distortion measure

Page 21: Information Bottleneck

Rate Distortion Theory

• Obtain compact clustering of the data with minimal expected distortion given fixed set of representatives

Data

?

Cover & Thomas

Page 22: Information Bottleneck

• – zero distortion – not compact

• – high distortion – very compact

Rate Distortion Theory - Intuition

Page 23: Information Bottleneck

• The quality of clustering is determined by

– Complexity is measured by

– Distortion is measured by

Rate Distortion Theory – Cont.

(a.k.a. Rate)

Page 24: Information Bottleneck

Rate Distortion Plane

Ed(X,T)

Maximal Compression

Minimal Distortion

D - distortion constraint

Page 25: Information Bottleneck

Higher values of mean more relaxed distortion constraint

Stronger compression levels are attainable

Rate Distortion Function

• Given the distortion constraint find the most compact model (with smallest complexity )

• Let be an upper bound constraint on the expected distortion

Page 26: Information Bottleneck

Rate Distortion Function• Given

– Set of points with prior– Set of representatives – Distortion measure

• Find– The most compact soft clustering of

points of that satisfies the distortion constraint

• Rate Distortion Function

Page 27: Information Bottleneck

Rate Distortion Function

Lagrange Multiplier

Complexity Term

Distortion Term

Minimize !

Page 28: Information Bottleneck

Rate Distortion Curve

Ed(X,T)

Maximal Compression

Minimal Distortion

Page 29: Information Bottleneck

Subject to

The minimum is attained when

Rate Distortion Function

Normalization

Minimize

Page 30: Information Bottleneck

Known

Solution - Analysis

The solution is implicit

Solution:

Page 31: Information Bottleneck

When is similar to is small

closer points are attached to with higher

probability

Solution - Analysis

Solution:

For a fixed

Page 32: Information Bottleneck

reduces the influence of distortion

does not depend on

this + maximal compression single cluster

Solution - Analysis

Solution:

most of cond. prob. goes to some

with smallest distortion

hard clustering

Fix t

Fix x

Page 33: Information Bottleneck

Varying

Solution - Analysis

Solution:

Intermediate soft clustering,

intermediate complexity

Page 34: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 35: Information Bottleneck

Blahut – Arimoto AlgorithmInput:

Randomly init

Optimize convex function over convex set the minimum is global

Page 36: Information Bottleneck

Blahut-Arimoto Algorithm

Advantages:• Obtains compact clustering of the data with

minimal expected distortion

• Optimal clustering given fixed set of representatives

Page 37: Information Bottleneck

Blahut-Arimoto Algorithm

Drawbacks:• Distortion measure is a part of the problem

setup – Hard to obtain for some problems – Equivalent to determining relevant features

• Fixed set of representatives

• Slow convergence

Page 38: Information Bottleneck

Rate Distortion Theory – Additional Insights

– Another problem would be to find optimal representatives given the clustering.

– Joint optimization of clustering and representatives doesn’t have a unique solution. (like EM or K-means)

Page 39: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 40: Information Bottleneck

Information Bottleneck

• Copes with the drawbacks of Rate Distortion approach

• Compress the data while preserving “important” (relevant) information

• It is often easier to define what information is important than to define a distortion measure.

• Replace the distortion upper bound constraint by a lower bound constraint over the relevant information

Tishby, Pereira & Bialek, 1999

Page 41: Information Bottleneck

Information Bottleneck-Example

Documents Joint prior Topics

Given:

Page 42: Information Bottleneck

Information Bottleneck-Example

Words Partitioning Topics

Obtain:

I(Cluster;Topic)

I(Word;Topic)

I(Word;Cluster)

Page 43: Information Bottleneck

Information Bottleneck-ExampleExtreme case 1:

I(Cluster;Topic)=0

I(Word;Cluster)=0

Very Compact

Not Informative

Page 44: Information Bottleneck

Information Bottleneck-Example

Minimize I(Word; Cluster) & maximize I(Cluster; Topic)

I(Cluster;Topic)=max

I(Word;Cluster)=max

Not Compact

VeryInformative

Extreme case 2:

Page 45: Information Bottleneck

Information Bottleneck

Compactness Relevant Information

words

topics

Page 46: Information Bottleneck

Relevance Compression Curve

Maximal Compression

Maximal Relevant

InformationD – relevance constraint

Page 47: Information Bottleneck

• Let be minimal allowed value of

Smaller more relaxed relevant information constraint

Stronger compression levels are attainable

Relevance Compression Function

• Given relevant information constraint Find the most compact model (with smallest )

Page 48: Information Bottleneck

Relevance Compression Function

Lagrange Multiplier

Compression Term

RelevanceTerm

Minimize !

Page 49: Information Bottleneck

Relevance Compression Curve

Maximal Compression

Maximal Relevant

Information

Page 50: Information Bottleneck

Subject to

The minimum is attained when

Relevance Compression Function

Normalization

Minimize

Page 51: Information Bottleneck

Solution - Analysis

The solution is implicit

Solution:

Known

Page 52: Information Bottleneck

Solution - Analysis

Solution:

• KL distance emerges as effective distortion measure from IB principle

The optimization is also over cluster representatives

When is similar to KL is small

attach such points to with higher probability

For a fixed

Page 53: Information Bottleneck

reduces the influence of KL

does not depend on

this + maximal compression single cluster

Solution - Analysis

Solution:

most of cond. prob. goes to some with smallest KL (hard mapping)

Fix t

Fix x

Page 54: Information Bottleneck

Relevance Compression Curve

Maximal Compression

Maximal Relevant

Information

Hard Mapping

Page 55: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 56: Information Bottleneck

Iterative Optimization Algorithm (iIB)

• Input:

• Randomly init

Pereira, Tishby, Lee , 1993; Tishby, Pereira, Bialek, 2001

Page 57: Information Bottleneck

Iterative Optimization Algorithm (iIB)

p(cluster | word)

p(cluster)

p(topic | cluster)

Pereira, Tishby, Lee , 1993;

Page 58: Information Bottleneck

iIB simulation

• Given:– 300 instances of with prior– Binary relevant variable – Joint prior –

• Obtain:– Optimal clustering (with minimal )

Page 59: Information Bottleneck

X points and their priors

iIB simulation…

Page 60: Information Bottleneck

Given the is given by the color of the point on the map

iIB simulation…

Page 61: Information Bottleneck

iIB simulation…

Single Cluster – Maximal Compression

Page 62: Information Bottleneck

iIB simulation…

Page 63: Information Bottleneck

iIB simulation…

Page 64: Information Bottleneck

iIB simulation…

Page 65: Information Bottleneck

iIB simulation…

Page 66: Information Bottleneck

iIB simulation…

Page 67: Information Bottleneck

iIB simulation…

Page 68: Information Bottleneck

iIB simulation…

Page 69: Information Bottleneck

iIB simulation…

Page 70: Information Bottleneck

iIB simulation…

Page 71: Information Bottleneck

iIB simulation…

Page 72: Information Bottleneck

iIB simulation…

Page 73: Information Bottleneck

iIB simulation…

Page 74: Information Bottleneck

iIB simulation…

Hard Clustering – Maximal Relevant Information

Page 75: Information Bottleneck

Iterative Optimization Algorithm (iIB)

• Analogy to K-means or EM

Optimize non-convex functional over 3 convex sets the minimum is local

Page 76: Information Bottleneck

“Semantic change” in the clustering solution

Page 77: Information Bottleneck

Advantages:

• Defining relevant variable is often easier and more intuitive than defining distortion measure

• Finds local minimum

Iterative Optimization Algorithm (iIB)

Page 78: Information Bottleneck

Drawbacks:

• Finds local minimum (suboptimal solutions)

• Need to specify the parameters

• Slow convergence

• Large data sample is required

Iterative Optimization Algorithm (iIB)

Page 79: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 80: Information Bottleneck

• Iteratively increase the parameter and then adapt the solution from the previous value of to the new one.

• Track the changes in the solution as the system shifts its preference from compression to relevance

• Tries to reconstruct the relevance-compression curve

Deterministic Annealing-like algorithm (dIB)

Slonim, Friedman, Tishby, 2002

Page 81: Information Bottleneck

Solution from previous step:

Deterministic Annealing-like algorithm (dIB)

Page 82: Information Bottleneck

Deterministic Annealing-like algorithm (dIB)

Page 83: Information Bottleneck

Small Perturbation

Deterministic Annealing-like algorithm (dIB)

Page 84: Information Bottleneck

Apply iIB using the duplicated cluster set as initialization

Deterministic Annealing-like algorithm (dIB)

Page 85: Information Bottleneck

if are differentleave the split

elseuse the old

Deterministic Annealing-like algorithm (dIB)

Page 86: Information Bottleneck

Deterministic Annealing-like algorithm (dIB)Illustration

What clusters split at which values of

Page 87: Information Bottleneck

Advantages:

• Finds local minimum (suboptimal solutions)

• Speed-up convergence by adapting previous soultion

Deterministic Annealing-like algorithm (dIB)

Page 88: Information Bottleneck

Drawbacks:

• Need to specify and tune several parameters:

- perturbation size

- step for (splits might be “skipped”)

- similarity threshold for splitting

- may need to vary parameters during the process

• Finds local minimum (suboptimal solutions)

• Large data sample is required

Deterministic Annealing-like algorithm (dIB)

Page 89: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 90: Information Bottleneck

Agglomerative Algorithm (aIB)

• Find hierarchical clustering tree in a greedy bottom-up fashion

• Results in different trees for each

• Each tree is a range of clustering solutions at different resolutions

Same

Different Resolutions

Slonim & Tishby 1999

Page 91: Information Bottleneck

Agglomerative Algorithm (aIB)Fix

Start with

Page 92: Information Bottleneck

Agglomerative Algorithm (aIB)For each pair

Compute new

Merge and that produce the smallest

Page 93: Information Bottleneck

Agglomerative Algorithm (aIB)For each pair

Compute new

Merge and that produce the smallest

Page 94: Information Bottleneck

Agglomerative Algorithm (aIB)For each pair

Continue merging until single cluster is left

Page 95: Information Bottleneck

Agglomerative Algorithm (aIB)

Page 96: Information Bottleneck

Agglomerative Algorithm (aIB)

Advantages:

• Non-parametric

• Full Hierarchy of clusters for each

• Simple

Page 97: Information Bottleneck

Agglomerative Algorithm (aIB)

Drawbacks:

• Greedy – is not guaranteed to extract even locally minimal solutions along the tree

• Large data sample is required

Page 98: Information Bottleneck

Agenda• Motivation• Information Theory - Basic Definitions• Rate Distortion Theory

– Blahut-Arimoto algorithm

• Information Bottleneck Principle• IB algorithms

– iIB– dIB– aIB

• Application

Page 99: Information Bottleneck

Unsupervised Clustering of Images

Modeling assumption:

For a fixed colors and their spatial distribution are generated by a mixture of Gaussians in 5-dim

Applications…

Shiri Gordon et. al., 2003

Page 100: Information Bottleneck

Unsupervised Clustering of Images

Apply EM procedure to estimate the mixture parameters

Applications…

Shiri Gordon et. al., 2003

Mixture of Gaussians model:

Page 101: Information Bottleneck

Unsupervised Clustering of Images

Applications…

Shiri Gordon et. al., 2003

• Assume uniform prior

• Calculate conditional

• Apply aIB algorithm

Page 102: Information Bottleneck

Unsupervised Clustering of ImagesApplications…

Shiri Gordon et. al., 2003

Page 103: Information Bottleneck

Unsupervised Clustering of ImagesApplications…

Shiri Gordon et. al., 2003

Page 104: Information Bottleneck

Summary

• Rate Distortion Theory– Blahut-Arimoto algorithm

• Information Bottleneck Principle

• IB algorithms– iIB– dIB– aIB

• Application

Page 105: Information Bottleneck

Thank you

Page 106: Information Bottleneck

Blahut-Arimoto algorithmA B

When does it converge to the global minimum?

- A and B are convex + some requirements on distance measure

Convex set of distributions

Convex set of distributions

Minimum Distance

?

Csiszar & Tusnady, 1984

Page 107: Information Bottleneck

Blahut-Arimoto algorithmA B

Reformulate using distance

Page 108: Information Bottleneck

Blahut-Arimoto algorithmA B

Page 109: Information Bottleneck

Rate Distortion Theory - Intuition

• – zero distortion – not compact–

• – high distortion – very compact–

Page 110: Information Bottleneck
Page 111: Information Bottleneck
Page 112: Information Bottleneck
Page 113: Information Bottleneck
Page 114: Information Bottleneck

• Assume Markov relations:

– T is a compressed representation of X, thus independent of Y if X is given

– Information processing inequality:

Information Bottleneck - cont’d