27
Information Theory Basics [email protected]

Information Theory Basics [email protected]. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Embed Size (px)

Citation preview

Page 1: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Information Theory Basics

[email protected]

Page 2: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

What is information theory?

A way to quantify information A lot of the theory comes from two worlds

Channel coding Compression

Useful for lots of other things Claude Shannon, mid- to late- 40's

Page 3: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Requirements

“This data will compress to at most N bits” “This channel will allow us to transmit N bits per

second” “This plaintext will require at least N bans of

ciphertext”

N is a number for the amount of information/uncertainty/entropy of a random variable X, that is, H(X) = N

Page 4: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

???

What are the requirements for such a measure?

E.g., Continuity: changing the probabilities a small amount should change the measure by only a small amount.

Page 5: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Maximum

What distribution should be the maximum entropy?

For equiprobable events, what should happen if we increase the number of outcomes.

Page 6: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Maximum

Page 7: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Symmetry

The measure should be unchanged if the outcomes are re-ordered

Page 8: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Additivity

Amount of entropy should be independent of how we divide the process into parts.

Page 9: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Entropy of Discrete RVs

Expected value of the amount of information for an event

Page 10: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Flip a fair coin

(-0.5 lg 0.5) + (-0.5 lg 0.5) = 1.0

Flip three fair coins?

Page 11: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Flip three fair coins

(-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) + (-0.5 lg 0.5) = 3.0

(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125)+(-0.125 lg 0.125) = 3.0

Page 12: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Flip biased coin A

60% heads

Page 13: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Biased coin A

(-0.6 lg 0.6) + (-0.4 lg 0.4) = 0.970950594

Page 14: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Biased coin B

95% heads

(-0.95 lg 0.95) + (-0.05 lg 0.05) = 0.286396957

Why is there less information in biased coins?

Page 15: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Information=uncertainty=entropy

Page 16: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Flip A, then flip B

A: (-0.6 lg 0.6) + (-0.4 lg 0.4) = 0.970950594 B: (-0.95 lg 0.95) + (-0.05 lg 0.05) =

0.286396957 ((-0.6 lg 0.6) + (-0.4 lg 0.4)) + ((-0.95 lg 0.95) +

(-0.05 lg 0.05)) = 0.970950594 + 0.286396957 = 1.25734755

(-(0.6*0.95)lg(0.6*0.95))+(-(0.6*0.05)lg(0.6*0.05))+(-(0.4*0.95)lg(0.4*0.95))+(-(0.4*0.05)lg(0.4*0.05)) = 1.25734755

Page 17: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Entropy (summary)

Continuity, maximum, symmetry, additivity

Page 18: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Example: Maximum Entropy

Wikipedia: “Maximum-likelihood estimators can lack asymptotic normality and can be inconsistent if there is a failure of one (or more) of the below regularity conditions... Estimate on boundary, Data boundary parameter-dependent, Nuisance parameters, Increasing information...”

“Subject to known constraints, the probability distribution which best represents the current state of knowledge is the one with largest entropy.” What distribution maximizes entropy?

Page 19: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Beyond Entropy

Flip fair coin for X if heads flip coin A for Y if tails flip coin B for Y

H(X) = 1.0 H(Y) = (-(0.5*0.6+0.5*0.95)lg(0.5*0.6+0.5*0.95))+(-

(0.5*0.4+0.5*0.05)lg(0.5*0.4+0.5*0.05)) = 0.769192829 Joint entropy H(X,Y) = ((-(0.5 * 0.6)) * lg(0.5 * 0.6)) + ((-(0.5 *

0.95)) * lg(0.5 * 0.95)) + ((-(0.5 * 0.4)) * lg(0.5 * 0.4)) + ((-(0.5 * 0.05)) * lg(0.5 * 0.05)) = 1.62867378

Where is the other 1.769192829 – 1.62867378 = 0.140519049 bits of information?

Page 20: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Mutual Information

I(X;Y) = 0.140519049 I(X;Y) = H(X) + H(Y) – H(X,Y) What are H(X|Y) and H(Y|X)?

Page 21: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Example: sufficient statistics

Students asked to flip a coin 100 times and record the result

How to detect the cheaters?

Page 22: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Example: sufficient statistics

f(x) is a family of probability mass functions indexed by θ, X is a sample from a distribution in this family.

T(X) is a statistic Function of the sample, like sample mean, sample

variance, … I(θ;T(X)) ≤ I(θ;X)

Equality only if no information is lost

Page 23: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Kullback-Leibler divergence (a.k.a differential entropy)

Process 1: Flip unbiased coin, if heads flip biased coin A (60% heads), if tails flip biased coin B (95% heads)

Process 2: Roll a fair die. 1, 2, or 3 = (tails, heads). 4 = (heads, heads). 5 = (heads, tails). 6 = (tails, tails).

Process 3: Flip two fair coins, just record the results.

Which, out of 2 and 3, is a better approximate model of 1?

Page 24: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Kullback-Leibler divergence (a.k.a. Differential entropy)

P is true distribution, Q is model Dkl(P1||P2) = 0.48080 Dkl(P1||P3) = 0.37133 Note that Dkl is not symmetric

Dkl(P2||P1)=0.52898, Dkl(P3||P1)=0.61371

Page 25: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Conditional mutual information

I(X;Y|Z) is the expected value of the mutual information between X and Y conditioned on Z

Page 26: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Interaction information

I(X;Y;Z) is the information bound up in a set of variables beyond that which is present in any subset

I(X;Y;Z) = I(X;Y|Z) – I(X;Y) = I(X;Z|Y) – I(X;Z) = I(Y;Z|X) - I(Y;Z)

Negative interaction information: X is rain, Y is dark, Z is clouds

Positive interaction information: X is fuel pump blocked, Y is battery dead, Z is car starts

Page 27: Information Theory Basics crandall@cs.unm.edu. What is information theory? A way to quantify information A lot of the theory comes from two worlds Channel

Other fun things you should look into if you're interested...

Writing on dirty paper Wire-tap channels Algorithmic complexity Chaitin's constant

Goldbach's conjecture, Riemann hypothesis Portfolio theory