18
MINE: Mutual Information Neural Estimation ICML 2018

MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

MINE: Mutual Information Neural Estimation

ICML 2018

Page 2: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Motivation

• Mutual information was a powerful tool in statistical models:– Feature selection, information bottleneck,

casualty• MI quantifies the dependence of two random

variables:

Page 3: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Motivation• MI is tractable only for discrete random variables

or known probability distribution• Common Approaches do not scale well with

sample size or dimension:– Non-parametric approaches– Approximate gaussianity

• Use KL-Divergence for computing MI• Use dual formulation for estimating f-divergence– Adversarial game between neural nets

Page 4: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

MI

• KL-Divergence definition:

• MI:

Page 5: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

MI Estimator

• Donsker-Varadhan representation:

• So:

Page 6: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

MI Estimator

• f-divergence representation:

• Both representations are tight but Donsker-Varadhan representation is stronger as:

– Where:

Page 7: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Method

• Estimate function T using neural network:

• So:

• Estimate using:

Page 8: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Algorithm

Page 9: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Caveats

• Mini-batch computation is biased:

– Moving Average for estimating over full batch

• MI is not bounded and can become infinitely large so it will mask cross-entropy loss:

Page 10: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Properties

• Strong consistency:

– Lemma 1: – Lemma 2:

• Sample Complexity:

Page 11: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Comparing to non-parametric estimator

• Two random variables with multivariate Gaussians distribution

• K-NN based estimator• MINE and MINE-f

Page 12: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Capturing Non-Linear Dependency

• MI is a good measure for capturing non-linearity

Page 13: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Improving GAN

• GAN objective:

• Mode Collapse:– All generated samples are similar

• Maximize the MI between generated samples and code:

Page 14: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Improving GAN - Result

Page 15: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Bi-Directional Adversarial Model

• Encode input and reconstruct it from its encoding:– Encoder:– Decoder:

• Reconstruction error:

• Objectives:

Page 16: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Bi-Directional Adversarial Models-Results

Page 17: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Information Bottleneck

• Find representation Z for X which has enough data for predating Y and discards irrelevant information in X

Page 18: MINE: Mutual Information Neural Estimation · Motivation •MI is tractable only for discrete random variables or known probability distribution •Common Approaches do not scale

Questions?Thanks