Information Theory Primer:Entropy, KL Divergence, Mutual Information, Jensen’s inequality
Seungjin Choi
Department of Computer Science and EngineeringPohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, [email protected]
http://mlg.postech.ac.kr/∼seungjin
1 / 27
Outline
I Entropy (Shannon entropy, differential entropy, and conditionalentropy)
I Kullback-Leibler (KL) divergence
I Mutual information
I Jensen’s inequality and Gibb’s inequality
2 / 27
Information Theory
I Information theory answers two fundamental questions incommunication theory
I What is the ultimate data compression? −→ entropy H.I What is the ultimate transmission rate of communication? −→
channel capacity C .
I In the early 1940’s, it was thought that increasing the transmissionrate of information over a communication channel increased theprobability of error −→ ”This is not true.”Shannon surprised the communication theory community by provingthat this was not true as long as the communication rate was belowthe channel capacity.
I Although information theory was developed for communications, it isalso important to explain ecological theory of sensory processing.Information theory plays a key role in elucidating the goal ofunsupervised learning.
3 / 27
Shannon Entopy
4 / 27
Information and Entropy
I Information can be thought of as surprise, uncertainty, orunexpectedness. Mathematically it is defined by
I = − log pi ,
where pi is the probability that the event labelled i occurs. The rareevent gives large information and frequent event produces smallinformation.
I Entropy is average information, i.e.,
H = E [I ] = −N∑i=1
pi log pi .
5 / 27
Example: Horse Race
Suppose we have a horse race with eight horses taking part. Assume thatthe probabilities of winning for the eight horse are
(1
2,
1
4,
1
8,
1
16,
1
64,
1
64,
1
64,
1
64).
Suppose that we wish to send a message to another person indicatingwhich horse won the race.
How many bits are required to describe this for each of the horses?
3 bits for any of the horses?
6 / 27
No! The win probabilities are not uniform.It makes sense to use shorter descriptions for the more probable horses andlonger descriptions for the less probable ones so that we achieve a loweraverage description length. For example, we can use the following strings torepresent the eight horses:
0, 10, 110, 1110, 111100, 111101, 111110, 111111.
The average description length in this case is 2 bits as opposed to 3 bits for theuniform code.We calculate the entropy:
H = −1
2log2
1
2− 1
4log2
1
4− 1
8log2
1
8− 1
16log2
1
16− 4
1
64log2
1
64= 2 bits.
The entropy of a random variable is a lower bound on the average number ofbits required to represent the random variables and also on the average numberof questions needed to identify the variable in a game of ”twenty questions”.
7 / 27
Shannon Entropy
Given a discrete random variable X with image X ,
I (Shannon) Entropy is the average information (a measure ofuncertainty) that is defined by
H(X ) = −∑x∈X
p(x) log p(x) = Ep [− log p(x)] .
I PropertiesI H(X ) ≥ 0. (since each term in the summation is nonnegative)I H(X ) = 0 if and only if P[X = x ] = 1 for some x ∈ X .I The entropy is maximal if all the outcomes are equally likely.
8 / 27
Differential Entropy
Given a continuous random variable X ,
I Differential entropy is defined as
H(p) = −∫
p(x) log p(x)dx
= −Ep [log p(x)] .
I PropertiesI It can be negative.I Given a fixed variance, Gaussian distribution achieves the maximal
differential entropy.I For x ∼ N (µ, σ2), H(x) = 1
2log(2πeσ2).
I For x ∼ N (µ,Σ), H(x) = 12
log det (2πeΣ) .
9 / 27
Conditional Entropy
Given two discrete random variables X and Y with images X and Y,respectively, we expand the joint entropy
H(X ,Y ) =∑x∈X
∑y∈Y
p(x , y) log
(1
p(x , y)
)
=∑x∈X
∑y∈Y
p(x , y) log
(1
p(x)p(y |x)
)
=∑x∈X
∑y∈Y
p(x , y) log
(1
p(x)
)+∑x∈X
∑y∈Y
p(x , y) log
(1
p(y |x)
)
=∑x∈X
p(x) log
(1
p(x)
)+∑x∈X
p(x)∑y∈Y
p(y |x) log
(1
p(y |x)
)= H(X ) +
∑x∈X
p(x)H(Y |X = x)︸ ︷︷ ︸H(Y |X )
.
Chain rule: H(X ,Y ) = H(X ) + H(Y |X ).
10 / 27
I H(X ,Y ) ≤ H(X ) + H(Y )
I H(Y |X ) ≤ H(Y )
Try to prove these by yourself!
11 / 27
KL Divergence
12 / 27
Relative Entropy (Kullback-Leibler Divergence)
I Introduced by Solomon Kullback and Richard Leibler in 1951.
I A measure of how one probability distribution q diverges from the other p
I DKL [p‖q] (KL divergence of q(x) from p(x))
I Discrete probability distributions p and q:
DKL [p‖q] =∑
x
p(x) logp(x)
q(x).
I Probability distributions p and q of continuous random variables:
DKL [p‖q] =
∫p(x) log
p(x)
q(x)dx.
I Properties of KL divergence
I Divergence is not symmetric: DKL [p‖q] 6= DKL [q‖p].I Divergence is always nonnegative: DKL [p‖q] ≥ 0 (Gibb’s inequality).I Divergence is a convex function on the domain of probability
distributions.
13 / 27
Theorem (Convexity of divergence)Let p1, q1, and p2, q2, be probability distributions over a random variableX and ∀λ ∈ (0, 1) define
p = λp1 + (1− λ)p2,
q = λq1 + (1− λ)q2.
Then,
DKL [p‖q] ≤ λDKL [p1‖q1] + (1− λ)DKL [p2‖q2] .
Proof. It is deferred to the end of this lecture.
14 / 27
Entropy and Divergence
The entropy of a random variable X with a probability distribution p(x)is related to how much p(x) diverges from the uniform distribution onthe support of X .
H(X ) =∑x∈X
p(x) log
(1
p(x)
)=
∑x∈X
p(x) log
(|X |
p(x)|X |
)
= log |X | −∑x∈X
p(x) log
(p(x)
1|X |
)= log |X | − DKL [p‖unif] .
The more p(x) diverges the lesser its entropy and vice versa.
15 / 27
Recall
DKL [p‖q] =
∫p(x) log
p(x)
q(x)dx.
Characterizing KL divergence
I If p and q are high, we are happy
I If p is high but q isn’t, we pay a price
I If p is low, we do not care
I DKL = 0, then distributions are equal
16 / 27
17 / 27
18 / 27
KL Divergence of Two Gaussians
I Two univariate Gaussians (x ∈ R)
I p(x) = N (µ1, σ21) and q(x) = N (µ2, σ
22)
I Calculated as
DKL [p‖q] =
∫p(x) log
p(x)
q(x)dx
=
∫p(x) log p(x)dx −
∫p(x) log q(x)dx
=1
2
σ21
σ22
+1
2
(µ2 − µ1)2
σ22
+ logσ1
σ2− 1
2.
I Two multivariate Gaussians (x ∈ RD)
I p(x) = N (µ1,Σ1) and q(x) = N (µ2,Σ2)I Calculated as
DKL [p‖q] =1
2
[tr(
Σ−12 Σ1
)+ (µ2 − µ1)>Σ−1
2 (µ2 − µ1)− D + log|Σ2||Σ1|
].
19 / 27
Mutual Information
20 / 27
Mutual Information
I Mutual information is the relative entropy between the jointdistribution and the product of marginal distributions,
I (x , y) =∑x∈X
∑y∈Y
p(x , y) log
[p(x , y)
p(x)p(y)
]= DKL [p(x , y)‖p(x)p(y)]
= Ep(x,y)
{log
[p(x , y)
p(x)p(y)
]}.
I Mutual information can be interpreted as the reduction in theuncertainty of x due to the knowledge of y , i.e.,
I (x , y) = H(x)− H(x |y),
where H(x |y) = −Ep(x,y) [log p(x |y)] is the conditional entropy
21 / 27
Convexity, Jensen’s inequality,
and Gibb’s inequality
22 / 27
Convex Sets and Functions
Definition (Convex Sets)Let C be a subset of Rm. C is called a convex set if
αx + (1− α)y ∈ C , ∀x, y ∈ C , ∀α ∈ [0, 1]
Definition (Convex Function)Let C be a convex subset of Rm. A function f : C 7→ R is called aconvex function if
f (αx + (1− α)y) ≤ αf (x) + (1− α)f (y) ∀x, y ∈ C , ∀α ∈ [0, 1]
23 / 27
Jensen’s Inequality
Theorem (Jensen’s Inequality)If f (x) is a convex function and x is a random vector, then
E [f (x)] ≥ f (E [x]) .
Note: Jensen’s inequality can also be rewritten for a concave function,with the direction of the inequality reversed.
24 / 27
Proof of Jensen’s Inequality
Need to show that∑N
i=1 pi f (xi ) ≥ f(∑N
i=1 pixi)
. The proof is based on the
recursion, working from the right-hand side of this equation.
f
(N∑i=1
pixi
)= f
(p1x1 +
N∑i=2
pixi
)
≤ p1f (x1) +
[N∑i=2
pi
]f
(∑Ni=2 pixi∑Ni=2 pi
) (choose α =
p1∑Ni=1 pi
)
≤ p1f (x1) +
[N∑i=2
pi
]{αf (x2) + (1− α)f
(∑Ni=3 pixi∑Ni=3 pi
)}(
choose α =p2∑Ni=2 pi
)
= p1f (x1) + p2f (x2) +N∑i=3
pi f
(∑Ni=3 pixi∑Ni=3 pi
),
and so forth.
25 / 27
Gibb’s Inequality
TheoremDKL [p‖q] ≥ 0 with equality iff p = q.
Proof: Consider the Kullback-Leibler divergence for discrete distributions:
DKL [p‖q] =∑i
pi logpiqi
= −∑i
pi logqipi
≥ − log
[∑i
piqipi
](by Jensen’s inequality)
= − log
[∑i
qi
]= 0.
26 / 27
More on Gibb’s Inequality
In order to find the distribution p which minimizes DKL [p‖q], we consider aLagrangian
E = DKL [p‖q] + λ
(1−
∑i
pi
)=∑i
pi logpiqi
+ λ
(1−
∑i
pi
).
Compute the partial derivative ∂E∂pk
and set to zero,
∂E∂pk
= log pk − log qk + 1− λ = 0,
which leads to pk = qkeλ−1. It follows from
∑i pi = 1 that
∑i qie
λ−1 = 1,which leads to λ = 1. Therefore pi = qi .
The Hessian, ∂2E∂p2i
= 1pi
, ∂2E∂pi∂pj
= 0, is positive definite, which shows that
pi = qi is a genuine minimum.
27 / 27