23
Entropy Let X be a discrete random variable with alphabet X and probability mass function p(x ) p(x )= Pr {X = x }, x ∈X The entropy of the variable X is defined by H (X )= - X x ∈X p(x ) log p(x ) The logarithm can be in any base, but normally base 2 is used. The unit of the entropy is then called bits. If we use base b in the logarithm, we denote the entropy by H b (X ). We can easily convert between entropies in different bases H b (X ) = log b a · H a (X ) By convention 0 log 0 = 0, since y log y 0 as y 0. The entropy is a measure of the information content of a random variable.

Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Entropy

Let X be a discrete random variable with alphabet X and probabilitymass function p(x)

p(x) = Pr{X = x}, x ∈ X

The entropy of the variable X is defined by

H(X ) = −∑x∈X

p(x) log p(x)

The logarithm can be in any base, but normally base 2 is used. The unitof the entropy is then called bits. If we use base b in the logarithm, wedenote the entropy by Hb(X ). We can easily convert between entropiesin different bases

Hb(X ) = logb a · Ha(X )

By convention 0 log 0 = 0, since y log y → 0 as y → 0.

The entropy is a measure of the information content of a random variable.

Page 2: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Entropy, cont.

The entropy is always non-negative

H(X ) ≥ 0

Proof: Since 0 ≤ p(x) ≤ 1, − log p(x) ≥ 0

The entropy is always less than the logarithm of the alphabet size

H(X ) ≤ log |X |

with equality given a uniform distribution.

Proof: see later slide

Page 3: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Binary entropy

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.2

0.4

0.6

0.8

1

p

H(p

)

Entropy for a binary variable with symbol probabilities p and 1− p.H(p) = −p · log p − (1− p) · log(1− p)

Page 4: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Joint and conditional entropy

The joint entropy of a pair of discrete random variables (X ,Y ) with jointprobability mass function p(x , y) is defined by

H(X ,Y ) = −∑x∈X

∑y∈Y

p(x , y) log p(x , y)

The conditional entropy of Y given X is defined as

H(Y |X ) =∑x∈X

p(x)H(Y |X = x)

= −∑x∈X

p(x)∑y∈Y

p(y |x) log p(y |x)

= −∑x∈X

∑y∈Y

p(x , y) log p(y |x)

Page 5: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Chain rule

The joint entropy can be written as the sum

H(X ,Y ) = H(X ) + H(Y |X )

Proof:

H(X ,Y ) = −∑x∈X

∑y∈Y

p(x , y) log p(x , y)

= −∑x∈X

∑y∈Y

p(x , y) log p(x)p(y |x)

= −∑x∈X

∑y∈Y

p(x , y) log p(x)−∑x∈X

∑y∈Y

p(x , y) log p(y |x)

= −∑x∈X

p(x) log p(x)−∑x∈X

∑y∈Y

p(x , y) log p(y |x)

= H(X ) + H(Y |X )

Page 6: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Chain rule, cont.

We of course also have

H(X ,Y ) = H(Y ) + H(X |Y )

The chain rule can easily be generalized to larger collections of randomvariables. Let X1,X2, . . . ,Xn be drawn according to p(x1, x2, . . . , xn).Then we have that

H(X1,X2, . . . ,Xn) =n∑

i=1

H(Xi |Xi−1, . . . ,X1)

Page 7: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Relative entropy

The relative entropy or Kullback-Leibler distance between two probabilitymass functions p(x) and q(x) is defined by

D(p||q) =∑x∈X

p(x) logp(x)

q(x)

We use the conventions 0 log 00 = 0, 0 log 0

q = 0 and p log p0 =∞.

The relative entropy is always non-negative and zero if and only if p = q.

Note that the relative entropy is not a true metric, since it is notsymmetric and does not satisfy the triangle inequality.

Page 8: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Mutual information

The mutual information between random variables X and Y with jointprobability mass function p(x , y) and marginal probability mass functionsp(x) and p(y) is defined as

I (X ;Y ) =∑x∈X

∑y∈Y

p(x , y) logp(x , y)

p(x)p(y)

= D(p(x , y)||p(x)p(y))

The mutual information is a measure of the amount of information thatone random variable contains about another random variable.

Page 9: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Mutual information, cont.

The mutual information can be rewritten as

I (X ;Y ) =∑x∈X

∑y∈Y

p(x , y) logp(x , y)

p(x)p(y)

=∑x∈X

∑y∈Y

p(x , y) logp(x |y)

p(x)

= −∑x∈X

∑y∈Y

p(x , y) log p(x) +∑x∈X

∑y∈Y

p(x , y) log p(x |y)

= H(X )− H(X |Y )

or symmetrically as

I (X ;Y ) = H(Y )− H(Y |X )

Thus X says as much about Y as Y says about X .

Page 10: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Mutual information, cont.

Note thatI (X ;Y ) = I (Y ;X )

andI (X ;X ) = H(X )− H(X |X ) = H(X )

Using the chain rule, we also note that

I (X ;Y ) = H(X ) + H(Y )− H(X ,Y )

Page 11: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Mutual information, cont.

The conditional mutual information of random variables X and Y givenZ is defined by

I (X ;Y |Z ) = H(X |Z )− H(X |Y ,Z )

Chain rule for mutual information

I (X1,X2, . . . ,Xn;Y ) =n∑

i=1

I (Xi ;Y |Xi−1,Xi−2, . . . ,X1)

Page 12: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Jensen’s inequality

A function f is said to be convex over an interval (a, b) if for everyx1, x2 ∈ (a, b) and 0 ≤ λ ≤ 1

f (λx1 + (1− λ)x2) ≤ λf (x1) + (1− λ)f (x2)

The function is said to be strictly convex if equality holds only if λ = 0 orλ = 1.A function f is (strictly) concave if −f is (strictly) convex.

Jensen’s inequality: If f is a convex function and X is a random variablethen

Ef (X ) ≥ f (EX )

If f is strictly convex, then equality implies that X = EX with probability1, ie X is a constant.

Page 13: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Information inequality

Let p(x), q(x), x ∈ X be two probability mass functions. Then

D(p||q) ≥ 0

with equality if and only if p(x) = q(x) for all x .Proof: Let A = {x : p(x) > 0}. Then

−D(p||q) = −∑x∈A

p(x) logp(x)

q(x)=

∑x∈A

p(x) logq(x)

p(x)

≤ log∑x∈A

p(x)q(x)

p(x)= log

∑x∈A

q(x)

≤ log∑x∈X

q(x) = log 1 = 0

Since log t is a strictly concave function of t, we have equality if and onlyif q(x)/p(x) is constant everywhere, ie p(x) = q(x)

Page 14: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Mutual information, entropy

For any two random variables X and Y we have

I (X ;Y ) ≥ 0

with equality if and only if X and Y are independent.Proof: I (X ;Y ) = D(p(x , y)||p(x)p(y)) ≥ 0 with equality if and only ifp(x , y) = p(x)p(y), ie X and Y are independent.

The entropy is bounded by the logarithm of the alphabet size

H(X ) ≤ log |X |

Proof: Let u(x) = 1|X | be the uniform probability mass function over X

and let p(x) be the probability mass function for X . Then

D(p||u) =∑

p(x) logp(x)

u(x)= log |X | − H(X )

Since D(p||u) ≥ 0 we get the inequality

Page 15: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Conditional entropy

Conditioning reduces entropy

H(X |Y ) ≤ H(X )

with equality of and only if X and Y are independent.

Proof: 0 ≤ I (X ;Y ) = H(X )− H(X |Y )

Knowing another random variable Y reduces (on average) theuncertainty of variable X .

Page 16: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Markov chains

Random variables X , Y , Z are said to form a Markov chain in that order(denoted X → Y → Z ) if the conditional distribution of Z depends onlyon Y and is conditionally independent of X , ie if the joint probabilitymass function can be written as

p(x , y , z) = p(x)p(y |x)p(z |y)

X → Y → Z if and only if X and Z are conditionally independent givenY . Markovity gives

p(x , z |y) =p(x , y , z)

p(y)=

p(x , y)p(z |y)

p(y)= p(x |y)p(z |y)

X → Y → Z implies that Z → Y → X .If Z = f (Y ) then X → Y → Z .

Page 17: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Data processing inequality

If X → Y → Z then I (X ;Y ) ≥ I (X ;Z )Proof: By the chain rule, we can expand mutual information in two ways

I (X ;Y ,Z ) = I (X ;Z ) + I (X ;Y |Z )

= I (X ;Y ) + I (X ;Z |Y )

Since X and Z are conditionally independent given Y , we haveI (X ;Z |Y ) = 0. Since I (X ;Y |Z ) ≥ 0 we get the inequality.We have equality if and only if I (X ;Y |Z ) = 0, ie X → Z → Y ) forms aMarkov chain.

This means that no processing of Y , deterministic or random, canincrease the amount of information that Y contains about X .

Page 18: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Law of large numbers

The law of large numbers states that for independent, identicallydistributed (i.i.d) random variables, 1

n

∑ni=1 Xi is close to the expected

value EX for large n.

Definition: Given a sequence of random variables X1,X2, . . . we say thatthe sequence converges to a random variable X

1. In probability if for every ε > 0,Pr{|Xn − X | > ε} → 0

2. In mean square if E (Xn − X )2 → 0

3. With probability (or almost surely) if Pr{limn→∞ Xn = X} = 1

Page 19: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Asymptotic equipartition property

If the random variables X1,X2, . . . are i.i.d ∼ p(x) then

−1

nlog p(X1,X2, . . . ,Xn)→ H(X ) in probability

Proof: Functions of independent random variables are also independetrandom variables. Since Xi are i.i.d., so are log p(Xi ). Thus, by the lawof large numbers

−1

nlog p(X1,X2, . . . ,Xn) = −1

n

∑i

log p(Xi )

→ −E log p(X ) in probability

= H(X )

Page 20: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Typical sets

The typical set A(n)ε with respect to p(x) is the set of sequences

(x1, x2, . . . , xn) with the property

2−n(H(X )+ε) ≤ p(x1, x2, . . . , xn) ≤ 2−n(H(X )−ε)

The set A(n)ε has the following properties

1. If (x1, x2, . . . , xn) ∈ A(n)ε then

H(X )− ε ≤ − 1n log p(x1, x2, . . . , xn) ≤ H(X ) + ε

2. Pr{A(n)ε } > 1− ε for sufficiently large n

3. |A(n)ε | ≤ 2n(H(X )+ε)

4. |A(n)ε | ≥ (1− ε)2n(H(X )−ε) for sufficiently large n

The typical set has probability near 1, all elements of the typical set arenearly equiprobable and the number of elements in the set is nearly 2nH .

Page 21: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Typical sets, cont.

Property 1 follows directly from the definition of the typical set.

Property 2 follows from the AEP proof.

Property 3 follows from

1 =∑x∈X n

p(x) ≥∑x∈A(n)

ε

p(x)

≥∑x∈A(n)

ε

2−n(H(X )+ε) = 2−n(H(X )+ε)|A(n)ε |

and thus |A(n)ε | ≤ 2n(H(X )+ε).

Page 22: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Typical sets, cont.

Finally, for sufficiently large n we have Pr{A(n)ε } > 1− ε. Thus property

4 follows from

1− ε < Pr{A(n)ε } ≤

∑x∈A(n)

ε

2−n(H(X )−ε) = 2−n(H(X )−ε)|A(n)ε |

and thus |A(n)ε | ≥ (1− ε)2n(H(X )−ε).

Page 23: Entropy - Linköping University · 2013-04-10 · Entropy Let X be a discrete random variable with alphabet Xand probability mass function p(x) p(x) = PrfX = xg; x 2X The entropy

Uses of entropy and mutual information

Lossless source codingAssuming that we want to compress a sequence of i.i.d. random variablesXi . The entropy of the variables will limit the achievable rate R (inbits/symbol)

R ≥ H(Xi )

Channel codingAssuming that we want to transmit data over a discrete noisy channel,described by a conditional distribution p(y |x). The capacity C of thechannel (ie the maximum number of bits we can transmit over thechannel per channel use and still be able to correct any errors) is given bymaximizing the mutual information between the channel input and thechannel output over all input distributions p(x).

C = maxp(x)

I (X ;Y )