2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic...

Preview:

Citation preview

2. Mathematical Foundations

2001. 7. 10.

인공지능연구실성경희

Foundations of Statistic Natural Language Processing

2

Contents – Part 1

1. Elementary Probability Theory– Conditional probability

– Bayes’ theorem

– Random variable

– Joint and conditional distributions

– Standard distribution

3

Conditional probability (1/2)

P(A) : the probability of the event A

Ex1> A coin is tossed 3 times.

= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

A = {HHT, HTH, THH} : 2 heads, P(A)=3/8

B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2

: conditional probability

A B

A B

P(B)

B)P(AB)|P(A

4

Conditional probability (2/2)

Multiplication rule

Chain rule

Two events A, B are independent

A)|P(A)P(BB)|P(B)P(AB)P(A

)A|P(A)AA|)P(AA|)P(AP(A)AP(A 112131211 i

ninn

P(A)P(B)B)P(A

B)|P(AP(A) 0,P(B) If

5

Bayes’ theorem (1/2)

Generally, if and the Bi are disjoint

Bayes’ theorem

)B)P(B|P(AB)P(B)|P(A

)BP(AB)P(A P(A)

i

ii ))P(BB|P(A P(A)

ini BA 1

P(A)

B)P(B)|P(A

P(A)

A)P(BA)|P(B

))P(BB|P(A

))P(BB|P(A

P(A)

))P(BB|P(AA)|P(B

1i

n

ii

jjjjj

)B(B ji

6

Bayes’ theorem (2/2)

Ex2> G : the event of the sentence having a parasitic gap

T : the event of the test being positive

This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.

0.0020.99999 0.005 0.00001 0.95

0.00001 0.95

)G)P(G|P(T G)P(G)|P(T

G)P(G)|P(TT)|P(G

7

Random variable

Ex3> Random variable X for the sum of two dice.First

die

Second die1 2 3 4 5 6

6 7 8 9 10 11 12

5 6 7 8 9 10 11

4 5 6 7 8 9 10

3 4 5 6 7 8 9

2 3 4 5 6 7 8

1 2 3 4 5 6 7

x 2 3 4 5 6 7 8 9 10 11 12

p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36

S={2,…,12}

probability mass function(pmf) : p(x) = p(X=x), X ~ p(x)

If X: {0,1}, then X is called an indicator RV or a Bernoulli trial

1)P()p(x i

i

x

xp(x) E(X)

(X))(X

Var(X)22 EE

XEXE

)))((( 2

Expectation :

Variance :

8

Joint and conditional distributions

The joint pmf for two discrete random variables X, Y

Marginal pmfs, which total up the probability mass for the values of each variable separately.

Conditional pmf

y)Yx,P(Xy)p(x,

y

y) p(x,(x)pX x

y) p(x,(y)pY

(y)p

y)p(x,y)|(xp

YY|X for y such that 0(y)pY

9

Standard distributions (1/3)

Discrete distributions: The binomial distribution

– When one has a series of trials with only two outcomes, each trial

being independent from all the others.

– The number r of successes out of n trials given that the probability

of success in any trial is p. :

– Expectation : np, variance : np(1-p)

rnr ppr

npnrbrRp

)(1) , ;()(

rnCrrn

nr

n

!)!(

!nr 0where

10

0

0.1

0.2

0.3

0 5 10 15 20 25 30 35 40

count

prob

abili

tyStandard distributions (2/3)

Discrete distributions: The binomial distribution

0.7) , ;b( nr0.5) , ;b( nr

402010 ,,n

11

0

0.2

0.4

0.6

-5 -4 -3 -2 -1 0 1 2 3 4 5

value

Standard distributions (3/3)

Continuous distributions: The normal distribution

– For the Mean and the standard deviation :

2

2

2

)(

2

1) , ;(

x

exn

Probability density function (pdf)

)1,0(N

(0,0.7)N

(1.5,2)N

12

Contents – Part 2

2. Essential Information Theory

– Entropy

– Joint entropy and conditional entropy

– Mutual information

– The noisy channel model

– Relative entropy or Kullback-Leibler divergence

13

Shannon’s Information Theory

Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line.

Theoretical maxima for data compression – Entropy H

Theoretical maxima for the transmission rate – Channel Capacity

14

Entropy (1/4)

The entropy H (or self-information) is the average uncertainty of a single random variable X.

Entropy is a measure of uncertainty. – The more we know about something, the lower the entropy will

be.

– We can use entropy as a measure of the quality of our models.

Entropy measures the amount of information in a random variable (measured in bits).

Xx

2p(x)p(x)log(X) (p) HH where, p(x) is pmf of X

15

Entropy (2/4)

The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once.

p

)(1)(1

)(

pppp

pH

loglog

back 23 page

16

Entropy (3/4)

Ex7> The result of rolling an 8-sided die. (uniform distribution)

– Entropy : The average length of the message needed to transmit an outcome of that variable.

For expectation E

bits3loglog81

log 8

1

8

1)()((X)

8

1

8

1 ii

ipipH

1 2 3 4 5 6 7 8

001 010 011 100 101 110 111 000

)(

1(X)

XpEH log

17

Entropy (4/4)

Ex8> Simplified Polynesian

– We can design a code that on average takes bits to transmit a letter

– Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable.

bits2

12)()((P)

},,,,,{

uiaktpi

iPiPH log

p t k a i u

1/8 1/4 1/8 1/4 1/8 1/8

21

2

p t k a i u

100 00 101 01 110 1113

22 bits

18

Joint entropy and conditional entropy (1/3)

The joint entropy of a pair of discrete random variable X,Y~ p(x,y)–

The conditional entropy–

The chain rule for entropy–

Xx Yy

2 y) p(x,y)log p(x,Y) (X,H

x)|p(yy)logp(x,

x)|p(yx)log|p(y p(x)x)X|p(x)H(YX)|(Y

Xx Yy2

Xx Yy2

Xx

H

X)|(Y(X)Y) (X, HHH

)X ,...,X|(X )X|(X )(X)X ,...,(X 1-n1n121n1 HHHH

19

Joint entropy and conditional entropy (2/3)

Ex9> Simplified Polynesian revisited– All words of consist of sequence of CV(consonant-vowel) syllables

1

0u

0i

a

ktp

161

16

6

16

3

161

16

1

16

3

161

2

1

4

1

4

1

8

1

4

38

1

uiaktp

161

8

3

16

1

4

18

1

8

1

Per-letter basis probabilitiesMarginal probabilities

(per-syllable basis)

double

back 8 page

20

Joint entropy and conditional entropy (3/3)

1

0u

0i

a

ktp

161

16

6

16

3

161

16

1

16

3

161

2

1

4

1

4

1

8

1

4

3

8

1

1.061bits3bits4

3

4

9

4

3

4

3

8

1

8

12(C) logloglogH

bits 1.375bits8

11

)2

1,0,

2

1(

8

1)

4

1,

4

1,

2

1(

4

3,0)

2

1,

2

1(

8

1

)C|(V)p(CC)|(V,,

HHH

cHcHktpc

bits 2.44bits4

3

8

29V)(C, 3log)C|V()C( HHH

21

Mutual information (1/2)

By the chain rule for entropy– – : mutual information

Mutual information between X and Y

– The amount of information one random variable contains about

another. (symmetric, non-negative)

– It is 0 only when two variables are independent.

– It grows not only with the degree of dependence, but also

according to the entropy of the variables.

– It is actually better to think of it as a measure of independence.

Y)|(X(Y)X)|(Y(X)Y) (X, HHHHH

X)|(Y(Y)Y)|(X(X) HHHH

22

Mutual information (2/2)

– Since

(entropy is called self-information)

– Conditional MI and a chain rule

yx,

y yx,x

p(x)p(y)

y)p(x,y)logp(x,

y)y)logp(x,p(x,p(y)

1p(y)log

p(x)

1p(x)log

Y)(X, (Y) (X) Y)|(X(X)Y)(X; H HHHHI

H(X|Y) H(Y|X)

I(X; Y)

H(X) H(Y)

H(X,Y)

X)(X;X)|(X(X)(X) IHHH

0 X)|(X H

Z)Y,|(XZ)|(XZ)|Y)((X;Z)|Y(X; HHII

)X,...,X|Y;(X )X,...,X|Y;(X Y)|(XY);(X 1ii

n

1ii1-n1n11n

IIII

=I(x,y) Pointwise MI

23

Noisy channel model

Channel capacity : the rate at which one can transmit information through the channel (optimal)–

Binary symmetric channel

– since entropy is non-negative,

Channelp(y|x)

DecoderEncoderW X Y W

Messagefrom a finite

alphabet

Input tochannel

Output fromchannel

Attempt toreconstruct message

based on output

0 0

11

p

1-p

1-p

Y)(X;)(IC

Xpmax

(p)(Y)X)|(Y (Y)Y)(X; HHHHI

(p)1 Y)(X;)(

HIXp

max 1C

go 15 page

24

Relative entropy or Kullback-Leibler divergence

Relative entropy for two pmfs, p(x), q(x) – A measure of how close two pmfs are.

– Non-negative, and D(p||q)=0 if p=q

– Conditional relative entropy and chain rule

q(X)

p(X)

q(x)

p(x)p(x)q)||D(p p loglog E

Xx

p(x)p(y))||y)D(p(x,Y)(X, I

yx x)|q(y

x)|p(yx)log|p(yp(x)x))|q(y||x)|D(p(y

x))|q(y||x)|D(p(yq(x))||D(p(x)y))q(x,||y)D(p(x,