Upload
menefer
View
42
Download
0
Embed Size (px)
DESCRIPTION
Introduction to Natural Language Processing (600.465) Essential Information Theory I. AI-lab 2003.09.24. The Notion of Entropy. Entropy ~ “ chaos ” , fuzziness, opposite of order, ... you know it: it is much easier to create “ mess ” than to tidy things up... Comes from physics: - PowerPoint PPT Presentation
Citation preview
JHU CS 600465Jan Hajic 1
Introduction to Natural Language Processing (600465)
Essential Information Theory I
AI-lab
20030924
JHU CS 600465 Intro to NLPJan Hajic
2
The Notion of Entropy
bull Entropy ~ ldquochaosrdquo fuzziness opposite of order ndash you know it
bull it is much easier to create ldquomessrdquo than to tidy things up
bull Comes from physicsndash Entropy does not go down unless energy is used
bull Measure of uncertaintyndash if low low uncertainty the higher the entropy the hig
her uncertainty but the higher ldquosurpriserdquo (information) we can get out of an experiment
JHU CS 600465 Intro to NLPJan Hajic
3
The Formulabull Let pX(x) be a distribution of random variable X
bull Basic outcomes (alphabet)X
H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)
bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)
JHU CS 600465 Intro to NLPJan Hajic
4
Using the Formula Example
bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5
ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1
bull Take fair 32-sided die p(x) = 1 32 for every side x
ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)
bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081
JHU CS 600465 Intro to NLPJan Hajic
5
Example Book Availability
Entropy H(p)
1
bad bookstore good bookstore
0
0 05 1 p(Book Available)
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
2
The Notion of Entropy
bull Entropy ~ ldquochaosrdquo fuzziness opposite of order ndash you know it
bull it is much easier to create ldquomessrdquo than to tidy things up
bull Comes from physicsndash Entropy does not go down unless energy is used
bull Measure of uncertaintyndash if low low uncertainty the higher the entropy the hig
her uncertainty but the higher ldquosurpriserdquo (information) we can get out of an experiment
JHU CS 600465 Intro to NLPJan Hajic
3
The Formulabull Let pX(x) be a distribution of random variable X
bull Basic outcomes (alphabet)X
H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)
bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)
JHU CS 600465 Intro to NLPJan Hajic
4
Using the Formula Example
bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5
ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1
bull Take fair 32-sided die p(x) = 1 32 for every side x
ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)
bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081
JHU CS 600465 Intro to NLPJan Hajic
5
Example Book Availability
Entropy H(p)
1
bad bookstore good bookstore
0
0 05 1 p(Book Available)
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
3
The Formulabull Let pX(x) be a distribution of random variable X
bull Basic outcomes (alphabet)X
H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)
bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)
JHU CS 600465 Intro to NLPJan Hajic
4
Using the Formula Example
bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5
ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1
bull Take fair 32-sided die p(x) = 1 32 for every side x
ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)
bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081
JHU CS 600465 Intro to NLPJan Hajic
5
Example Book Availability
Entropy H(p)
1
bad bookstore good bookstore
0
0 05 1 p(Book Available)
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
4
Using the Formula Example
bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5
ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1
bull Take fair 32-sided die p(x) = 1 32 for every side x
ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)
bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081
JHU CS 600465 Intro to NLPJan Hajic
5
Example Book Availability
Entropy H(p)
1
bad bookstore good bookstore
0
0 05 1 p(Book Available)
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
5
Example Book Availability
Entropy H(p)
1
bad bookstore good bookstore
0
0 05 1 p(Book Available)
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
6
The Limits
bull When H(p) = 0ndash if a result of an experiment is known ahead of time
ndash necessarily
x p(x) = 1 y y x p(y) = 0
bull Upper bound ndash none in general
ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
7
Entropy and Expectation
bull Recallndash E(X) = xX pX(x) middot x
bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =
= - xX pX(x) log2pX(x) =
= H(pX) =notation H(p)
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
8
Perplexity motivationbull Recall
ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits
bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible
bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif
ficulty of prediction) (also) for random variables with different number of outcomes
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
9
Perplexity
bull Perplexity
ndash G(p) = 2H(p)
bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc
bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif
orm distribution which is equally hard to predict
bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
10
Joint Entropy and Conditional Entropy
bull Two random variables X (space X)Y (Y)bull Joint entropy
ndash no big deal ((XY) considered a single event)
H(XY) = -xXyYp(xy) log2 p(xy)
bull Conditional entropy
H(Y|X) = -xXyYp(xy) log2 p(y|x)
recall that H(X) = E(log2(1pX(x)))
(weighted ldquoaveragerdquo and weights are not conditional)
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
11
Conditional Entropy (Using the Calculus)
bull other definition
H(Y|X) = x p(x) H(Y|X=x) =
for H(Y|X=x) we can use the single-variable definition (x ~ constant)
= xp(x) ( -yY p(y|x) log2p(y|x) ) =
= - xyYp(y|x) p(x) log2p(y|x) =
= - xyYp(xy) log2p(y|x)
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
12
Properties of Entropy I
bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))
bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative
bull and -f is positive for negative f
bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as
ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
13
Properties of Entropy IIbull Conditional Entropy is better (than unconditional)
ndash H(Y|X) H(Y) (proof on Monday)
bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)
bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]
bull H(p) is concave (remember the book availability graph)
ndash concave function f over an interval (ab)
xy (ab) [01]
f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave
bull [for proofs and generalizations see CoverThomas]
f
x y
f(x) +
(1-)f(
y)
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
14
ldquoCodingrdquo Interpretation of Entropy
bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)
bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl
e = low entropy) patterns
ndash their results though have high entropy compressing compressed data does nothing
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
15
Coding Examplebull How many bits do we need for ISO Latin 1
ndash the trivial answer 8
bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b
e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004
bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8
bull code acbbeacutecbaac 0010010111000011111001000010
a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)
bull code length 1 probability conditional prob OK
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it
JHU CS 600465 Intro to NLPJan Hajic
16
Entropy of a Language
bull Imagine that we produce the next letter using
p(ln+1|l1ln)
where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history
bull Then compute its entropy
ndash -hlp(lh) log2 p(l|h)
bull Not very practical isnrsquot it