16
JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing (600.465) Essential Information Theory I AI-lab 2003.09.24

Introduction to Natural Language Processing (600.465) Essential Information Theory I

  • Upload
    menefer

  • View
    42

  • Download
    0

Embed Size (px)

DESCRIPTION

Introduction to Natural Language Processing (600.465) Essential Information Theory I. AI-lab 2003.09.24. The Notion of Entropy. Entropy ~ “ chaos ” , fuzziness, opposite of order, ... you know it: it is much easier to create “ mess ” than to tidy things up... Comes from physics: - PowerPoint PPT Presentation

Citation preview

Page 1: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465Jan Hajic 1

Introduction to Natural Language Processing (600465)

Essential Information Theory I

AI-lab

20030924

JHU CS 600465 Intro to NLPJan Hajic

2

The Notion of Entropy

bull Entropy ~ ldquochaosrdquo fuzziness opposite of order ndash you know it

bull it is much easier to create ldquomessrdquo than to tidy things up

bull Comes from physicsndash Entropy does not go down unless energy is used

bull Measure of uncertaintyndash if low low uncertainty the higher the entropy the hig

her uncertainty but the higher ldquosurpriserdquo (information) we can get out of an experiment

JHU CS 600465 Intro to NLPJan Hajic

3

The Formulabull Let pX(x) be a distribution of random variable X

bull Basic outcomes (alphabet)X

H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)

bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)

JHU CS 600465 Intro to NLPJan Hajic

4

Using the Formula Example

bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5

ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1

bull Take fair 32-sided die p(x) = 1 32 for every side x

ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)

bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081

JHU CS 600465 Intro to NLPJan Hajic

5

Example Book Availability

Entropy H(p)

1

bad bookstore good bookstore

0

0 05 1 p(Book Available)

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 2: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

2

The Notion of Entropy

bull Entropy ~ ldquochaosrdquo fuzziness opposite of order ndash you know it

bull it is much easier to create ldquomessrdquo than to tidy things up

bull Comes from physicsndash Entropy does not go down unless energy is used

bull Measure of uncertaintyndash if low low uncertainty the higher the entropy the hig

her uncertainty but the higher ldquosurpriserdquo (information) we can get out of an experiment

JHU CS 600465 Intro to NLPJan Hajic

3

The Formulabull Let pX(x) be a distribution of random variable X

bull Basic outcomes (alphabet)X

H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)

bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)

JHU CS 600465 Intro to NLPJan Hajic

4

Using the Formula Example

bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5

ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1

bull Take fair 32-sided die p(x) = 1 32 for every side x

ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)

bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081

JHU CS 600465 Intro to NLPJan Hajic

5

Example Book Availability

Entropy H(p)

1

bad bookstore good bookstore

0

0 05 1 p(Book Available)

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 3: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

3

The Formulabull Let pX(x) be a distribution of random variable X

bull Basic outcomes (alphabet)X

H(X) = -xXp(x) log2 p(x) bull Unit bits (log10 nats)

bull Notation H(X) = Hp(X) = H(p) = HX(p) = H(pX)

JHU CS 600465 Intro to NLPJan Hajic

4

Using the Formula Example

bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5

ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1

bull Take fair 32-sided die p(x) = 1 32 for every side x

ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)

bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081

JHU CS 600465 Intro to NLPJan Hajic

5

Example Book Availability

Entropy H(p)

1

bad bookstore good bookstore

0

0 05 1 p(Book Available)

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 4: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

4

Using the Formula Example

bull Toss a fair coin = headtailndash p(head) = 5 p(tail) = 5

ndash H(p) = - 05 log2(05) + (- 05 log2(05)) = 2 ( (-05) (-1) ) = 2 05 = 1

bull Take fair 32-sided die p(x) = 1 32 for every side x

ndash H(p) = -i = 132 p(xi) log2p(xi) = - 32 (p(x1) log2p(x1) (since for all i p(xi) = p(x1) = 132) = -32 ((132) (-5)) = 5 (now you see why itrsquos called bits)

bull Unfair coin ndash p(head) = 2 rarr H(p) = 722 p(head) = 01 rarr H(p) = 081

JHU CS 600465 Intro to NLPJan Hajic

5

Example Book Availability

Entropy H(p)

1

bad bookstore good bookstore

0

0 05 1 p(Book Available)

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 5: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

5

Example Book Availability

Entropy H(p)

1

bad bookstore good bookstore

0

0 05 1 p(Book Available)

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 6: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

6

The Limits

bull When H(p) = 0ndash if a result of an experiment is known ahead of time

ndash necessarily

x p(x) = 1 y y x p(y) = 0

bull Upper bound ndash none in general

ndash for | | = n H(p) log2n bull nothing can be more uncertain than the uniform distribution

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 7: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

7

Entropy and Expectation

bull Recallndash E(X) = xX pX(x) middot x

bull Then E(log2(1pX(x))) = xX pX(x) log2(1pX(x)) =

= - xX pX(x) log2pX(x) =

= H(pX) =notation H(p)

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 8: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

8

Perplexity motivationbull Recall

ndash 2 equiprobable outcomes H(p) = 1 bitndash 32 equiprobable outcomes H(p) = 5 bitsndash 43 billion equiprobable outcomes H(p) ~= 32 bits

bull What if the outcomes are not equiprobablendash 32 outcomes 2 equiprobable at 5 rest impossible

bull H(p) = 1 bitndash Any measure for comparing the entropy (ie uncertaintydif

ficulty of prediction) (also) for random variables with different number of outcomes

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 9: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

9

Perplexity

bull Perplexity

ndash G(p) = 2H(p)

bull so we are back at 32 (for 32 eqp outcomes) 2 for fair coins etc

bull it is easier to imaginendash NLP example vocabulary size of a vocabulary with unif

orm distribution which is equally hard to predict

bull the ldquowilderrdquo (biased) distribution the betterndash lower entropy lower perplexity

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 10: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

10

Joint Entropy and Conditional Entropy

bull Two random variables X (space X)Y (Y)bull Joint entropy

ndash no big deal ((XY) considered a single event)

H(XY) = -xXyYp(xy) log2 p(xy)

bull Conditional entropy

H(Y|X) = -xXyYp(xy) log2 p(y|x)

recall that H(X) = E(log2(1pX(x)))

(weighted ldquoaveragerdquo and weights are not conditional)

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 11: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

11

Conditional Entropy (Using the Calculus)

bull other definition

H(Y|X) = x p(x) H(Y|X=x) =

for H(Y|X=x) we can use the single-variable definition (x ~ constant)

= xp(x) ( -yY p(y|x) log2p(y|x) ) =

= - xyYp(y|x) p(x) log2p(y|x) =

= - xyYp(xy) log2p(y|x)

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 12: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

12

Properties of Entropy I

bull Entropy is non-negativendash H(X) ndash proof (recall H(X) = - x p(x) log2 p(x))

bull log(p(x)) is negative or zero for x 1bull p(x) is non-negative their product p(x)log(p(x) is thus negative bull sum of negative numbers is negative

bull and -f is positive for negative f

bull Chain rule ndash H(XY) = H(Y|X) + H(X) as well as

ndash H(XY) = H(X|Y) + H(Y) (since H(YX) = H(XY))

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 13: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

13

Properties of Entropy IIbull Conditional Entropy is better (than unconditional)

ndash H(Y|X) H(Y) (proof on Monday)

bull H(XY) H(X) + H(Y) (follows from the previous (in)equalities)

bull equality iff XY independentbull [recall XY independent iff p(XY) = p(X)p(Y)]

bull H(p) is concave (remember the book availability graph)

ndash concave function f over an interval (ab)

xy (ab) [01]

f(x + (1-)y) f(x) + (1-)f(y)bull function f is convex if -f is concave

bull [for proofs and generalizations see CoverThomas]

f

x y

f(x) +

(1-)f(

y)

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 14: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

14

ldquoCodingrdquo Interpretation of Entropy

bull The least (average) number of bits needed to encode a message (string sequence series) (each element having being a result of a random process with some distribution p) = H(p)

bull Remember various compressing algorithmsndash they do well on data with repeating (= easily predictabl

e = low entropy) patterns

ndash their results though have high entropy compressing compressed data does nothing

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 15: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

15

Coding Examplebull How many bits do we need for ISO Latin 1

ndash the trivial answer 8

bull Experience some chars are more common some (very) rarebull so what if we use more bits for the rare and less bits for the frequent [b

e careful want to decode (easily)]bull suppose p(lsquoarsquo) = 03 p(lsquobrsquo) = 03 p(lsquocrsquo) = 03 the rest p(x) 0004

bull code lsquoarsquo ~ 00 lsquobrsquo ~ 01 lsquocrsquo ~ 10 rest 11b1b2b3b4b5b6b7b8

bull code acbbeacutecbaac 0010010111000011111001000010

a c b b eacute c b a a c bull number of bits used 28 (vs 80 using ldquonaiverdquo coding)

bull code length 1 probability conditional prob OK

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16
Page 16: Introduction to  Natural Language Processing (600.465) Essential Information Theory I

JHU CS 600465 Intro to NLPJan Hajic

16

Entropy of a Language

bull Imagine that we produce the next letter using

p(ln+1|l1ln)

where l1ln is the sequence of all the letters which had been uttered so far (ie n is really big) letrsquos call l1ln the history

bull Then compute its entropy

ndash -hlp(lh) log2 p(l|h)

bull Not very practical isnrsquot it

  • Slide 1
  • Slide 2
  • Slide 3
  • Slide 4
  • Slide 5
  • Slide 6
  • Slide 7
  • Slide 8
  • Slide 9
  • Slide 10
  • Slide 11
  • Slide 12
  • Slide 13
  • Slide 14
  • Slide 15
  • Slide 16