Download pdf - Information Theory Primer - POSTECHmlg.postech.ac.kr/~seungjin/courses/easyml/handouts/handout03.pdf · Information Theory Primer: Entropy, KL Divergence, Mutual Information, Jensen’s

Information Theory Primer:Entropy, KL Divergence, Mutual Information, Jensen’s inequality

Seungjin Choi

Department of Computer Science and EngineeringPohang University of Science and Technology

77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]

http://mlg.postech.ac.kr/∼seungjin

1 / 27

Outline

I Entropy (Shannon entropy, differential entropy, and conditionalentropy)

I Kullback-Leibler (KL) divergence

I Mutual information

I Jensen’s inequality and Gibb’s inequality

2 / 27

Information Theory

I Information theory answers two fundamental questions incommunication theory

I What is the ultimate data compression? −→ entropy H.I What is the ultimate transmission rate of communication? −→

channel capacity C .

I In the early 1940’s, it was thought that increasing the transmissionrate of information over a communication channel increased theprobability of error −→ ”This is not true.”Shannon surprised the communication theory community by provingthat this was not true as long as the communication rate was belowthe channel capacity.

I Although information theory was developed for communications, it isalso important to explain ecological theory of sensory processing.Information theory plays a key role in elucidating the goal ofunsupervised learning.

3 / 27

Shannon Entopy

4 / 27

Information and Entropy

I Information can be thought of as surprise, uncertainty, orunexpectedness. Mathematically it is defined by

I = − log pi ,

where pi is the probability that the event labelled i occurs. The rareevent gives large information and frequent event produces smallinformation.

I Entropy is average information, i.e.,

H = E [I ] = −N∑i=1

pi log pi .

5 / 27

Example: Horse Race

Suppose we have a horse race with eight horses taking part. Assume thatthe probabilities of winning for the eight horse are

(1

2,

1

4,

1

8,

1

16,

1

64,

1

64,

1

64,

1

64).

Suppose that we wish to send a message to another person indicatingwhich horse won the race.

How many bits are required to describe this for each of the horses?

3 bits for any of the horses?

6 / 27

No! The win probabilities are not uniform.It makes sense to use shorter descriptions for the more probable horses andlonger descriptions for the less probable ones so that we achieve a loweraverage description length. For example, we can use the following strings torepresent the eight horses:

0, 10, 110, 1110, 111100, 111101, 111110, 111111.

The average description length in this case is 2 bits as opposed to 3 bits for theuniform code.We calculate the entropy:

H = −1

2log2

1

2− 1

4log2

1

4− 1

8log2

1

8− 1

16log2

1

16− 4

1

64log2

1

64= 2 bits.

The entropy of a random variable is a lower bound on the average number ofbits required to represent the random variables and also on the average numberof questions needed to identify the variable in a game of ”twenty questions”.

7 / 27

Shannon Entropy

Given a discrete random variable X with image X ,

I (Shannon) Entropy is the average information (a measure ofuncertainty) that is defined by

H(X ) = −∑x∈X

p(x) log p(x) = Ep [− log p(x)] .

I PropertiesI H(X ) ≥ 0. (since each term in the summation is nonnegative)I H(X ) = 0 if and only if P[X = x ] = 1 for some x ∈ X .I The entropy is maximal if all the outcomes are equally likely.

8 / 27

Differential Entropy

Given a continuous random variable X ,

I Differential entropy is defined as

H(p) = −∫

p(x) log p(x)dx

= −Ep [log p(x)] .

I PropertiesI It can be negative.I Given a fixed variance, Gaussian distribution achieves the maximal

differential entropy.I For x ∼ N (µ, σ2), H(x) = 1

2log(2πeσ2).

I For x ∼ N (µ,Σ), H(x) = 12

log det (2πeΣ) .

9 / 27

Conditional Entropy

Given two discrete random variables X and Y with images X and Y,respectively, we expand the joint entropy

H(X ,Y ) =∑x∈X

∑y∈Y

p(x , y) log

(1

p(x , y)

)

=∑x∈X

∑y∈Y

p(x , y) log

(1

p(x)p(y |x)

)

=∑x∈X

∑y∈Y

p(x , y) log

(1

p(x)

)+∑x∈X

∑y∈Y

p(x , y) log

(1

p(y |x)

)

=∑x∈X

p(x) log

(1

p(x)

)+∑x∈X

p(x)∑y∈Y

p(y |x) log

(1

p(y |x)

)= H(X ) +

∑x∈X

p(x)H(Y |X = x)︸︷︷︸H(Y |X )

.

Chain rule: H(X ,Y ) = H(X ) + H(Y |X ).

10 / 27

I H(X ,Y ) ≤ H(X ) + H(Y )

I H(Y |X ) ≤ H(Y )

Try to prove these by yourself!

11 / 27

KL Divergence

12 / 27

Relative Entropy (Kullback-Leibler Divergence)

I Introduced by Solomon Kullback and Richard Leibler in 1951.

I A measure of how one probability distribution q diverges from the other p

I DKL [p‖q] (KL divergence of q(x) from p(x))

I Discrete probability distributions p and q:

DKL [p‖q] =∑

x

p(x) logp(x)

q(x).

I Probability distributions p and q of continuous random variables:

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx.

I Properties of KL divergence

I Divergence is not symmetric: DKL [p‖q] 6= DKL [q‖p].I Divergence is always nonnegative: DKL [p‖q] ≥ 0 (Gibb’s inequality).I Divergence is a convex function on the domain of probability

distributions.

13 / 27

Theorem (Convexity of divergence)Let p1, q1, and p2, q2, be probability distributions over a random variableX and ∀λ ∈ (0, 1) define

p = λp1 + (1− λ)p2,

q = λq1 + (1− λ)q2.

Then,

DKL [p‖q] ≤ λDKL [p1‖q1] + (1− λ)DKL [p2‖q2] .

Proof. It is deferred to the end of this lecture.

14 / 27

Entropy and Divergence

The entropy of a random variable X with a probability distribution p(x)is related to how much p(x) diverges from the uniform distribution onthe support of X .

H(X ) =∑x∈X

p(x) log

(1

p(x)

)=

∑x∈X

p(x) log

(|X |

p(x)|X |

)

= log |X | −∑x∈X

p(x) log

(p(x)

1|X |

)= log |X | − DKL [p‖unif] .

The more p(x) diverges the lesser its entropy and vice versa.

15 / 27

Recall

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx.

Characterizing KL divergence

I If p and q are high, we are happy

I If p is high but q isn’t, we pay a price

I If p is low, we do not care

I DKL = 0, then distributions are equal

16 / 27

17 / 27

18 / 27

KL Divergence of Two Gaussians

I Two univariate Gaussians (x ∈ R)

I p(x) = N (µ1, σ21) and q(x) = N (µ2, σ

22)

I Calculated as

DKL [p‖q] =

∫p(x) log

p(x)

q(x)dx

=

∫p(x) log p(x)dx −

∫p(x) log q(x)dx

=1

2

σ21

σ22

+1

2

(µ2 − µ1)2

σ22

+ logσ1

σ2− 1

2.

I Two multivariate Gaussians (x ∈ RD)

I p(x) = N (µ1,Σ1) and q(x) = N (µ2,Σ2)I Calculated as

DKL [p‖q] =1

2

[tr(

Σ−12 Σ1

)+ (µ2 − µ1)>Σ−1

2 (µ2 − µ1)− D + log|Σ2||Σ1|

].

19 / 27

Mutual Information

20 / 27

Mutual Information

I Mutual information is the relative entropy between the jointdistribution and the product of marginal distributions,

I (x , y) =∑x∈X

∑y∈Y

p(x , y) log

[p(x , y)

p(x)p(y)

]= DKL [p(x , y)‖p(x)p(y)]

= Ep(x,y)

{log

[p(x , y)

p(x)p(y)

]}.

I Mutual information can be interpreted as the reduction in theuncertainty of x due to the knowledge of y , i.e.,

I (x , y) = H(x)− H(x |y),

where H(x |y) = −Ep(x,y) [log p(x |y)] is the conditional entropy

21 / 27

Convexity, Jensen’s inequality,

and Gibb’s inequality

22 / 27

Convex Sets and Functions

Definition (Convex Sets)Let C be a subset of Rm. C is called a convex set if

αx + (1− α)y ∈ C , ∀x, y ∈ C , ∀α ∈ [0, 1]

Definition (Convex Function)Let C be a convex subset of Rm. A function f : C 7→ R is called aconvex function if

f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y) ∀x, y ∈ C , ∀α ∈ [0, 1]

23 / 27

Jensen’s Inequality

Theorem (Jensen’s Inequality)If f (x) is a convex function and x is a random vector, then

E [f (x)] ≥ f (E [x]) .

Note: Jensen’s inequality can also be rewritten for a concave function,with the direction of the inequality reversed.

24 / 27

Proof of Jensen’s Inequality

Need to show that∑N

i=1 pi f (xi ) ≥ f(∑N

i=1 pixi)

. The proof is based on the

recursion, working from the right-hand side of this equation.

f

(N∑i=1

pixi

)= f

(p1x1 +

N∑i=2

pixi

)

≤ p1f (x1) +

[N∑i=2

pi

]f

(∑Ni=2 pixi∑Ni=2 pi

) (choose α =

p1∑Ni=1 pi

)

≤ p1f (x1) +

[N∑i=2

pi

]{αf (x2) + (1− α)f


)}(

choose α =p2∑Ni=2 pi

)

= p1f (x1) + p2f (x2) +N∑i=3

pi f


),

and so forth.

25 / 27

Gibb’s Inequality

TheoremDKL [p‖q] ≥ 0 with equality iff p = q.

Proof: Consider the Kullback-Leibler divergence for discrete distributions:

DKL [p‖q] =∑i

pi logpiqi

= −∑i

pi logqipi

≥ − log

[∑i

piqipi

](by Jensen’s inequality)

= − log

[∑i

qi

]= 0.

26 / 27

More on Gibb’s Inequality

In order to find the distribution p which minimizes DKL [p‖q], we consider aLagrangian

E = DKL [p‖q] + λ

(1−

∑i

pi

)=∑i

pi logpiqi

+ λ

(1−

∑i

pi

).

Compute the partial derivative ∂E∂pk

and set to zero,

∂E∂pk

= log pk − log qk + 1− λ = 0,

which leads to pk = qkeλ−1. It follows from

∑i pi = 1 that

∑i qie

λ−1 = 1,which leads to λ = 1. Therefore pi = qi .

The Hessian, ∂2E∂p2i

= 1pi

, ∂2E∂pi∂pj

= 0, is positive definite, which shows that

pi = qi is a genuine minimum.

27 / 27