2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic...

2. Mathematical Foundations

2001. 7. 10.

인공지능연구실성경희

Foundations of Statistic Natural Language Processing

Contents – Part 1

1. Elementary Probability Theory– Conditional probability

– Bayes’ theorem

– Random variable

– Joint and conditional distributions

– Standard distribution

Conditional probability (1/2)

P(A) : the probability of the event A

Ex1> A coin is tossed 3 times.

= {HHH, HHT, HTH, HTT, THH, THT, TTH, TTT}

A = {HHT, HTH, THH} : 2 heads, P(A)=3/8

B = {HHH, HHT, HTH, HTT} : first head, P(B)=1/2

: conditional probability

B)P(AB)|P(A

Conditional probability (2/2)

Multiplication rule

Chain rule

Two events A, B are independent

A)|P(A)P(BB)|P(B)P(AB)P(A

)A|P(A)AA|)P(AA|)P(AP(A)AP(A 112131211 i

P(A)P(B)B)P(A

B)|P(AP(A) 0,P(B) If

Bayes’ theorem (1/2)

Generally, if and the Bi are disjoint

Bayes’ theorem

)B)P(B|P(AB)P(B)|P(A

)BP(AB)P(A P(A)

ii ))P(BB|P(A P(A)

ini BA 1

B)P(B)|P(A

A)P(BA)|P(B

))P(BB|P(A

))P(BB|P(AA)|P(B

)B(B ji

Bayes’ theorem (2/2)

Ex2> G : the event of the sentence having a parasitic gap

T : the event of the test being positive

This poor result comes about because the prior probability of a sentence containing a parasitic gap is so low.

0.0020.99999 0.005 0.00001 0.95

0.00001 0.95

)G)P(G|P(T G)P(G)|P(T

G)P(G)|P(TT)|P(G

Random variable

Ex3> Random variable X for the sum of two dice.First

Second die1 2 3 4 5 6

6 7 8 9 10 11 12

5 6 7 8 9 10 11

4 5 6 7 8 9 10

3 4 5 6 7 8 9

2 3 4 5 6 7 8

1 2 3 4 5 6 7

x 2 3 4 5 6 7 8 9 10 11 12

p(X=x) 1/36 1/18 1/12 1/9 5/36 1/6 5/36 1/9 1/12 1/18 1/36

S={2,…,12}

probability mass function(pmf) : p(x) = p(X=x), X ~ p(x)

If X: {0,1}, then X is called an indicator RV or a Bernoulli trial

1)P()p(x i

xp(x) E(X)

(X))(X

Var(X)22 EE

)))((( 2

Expectation :

Variance :

Joint and conditional distributions

The joint pmf for two discrete random variables X, Y

Marginal pmfs, which total up the probability mass for the values of each variable separately.

Conditional pmf

y)Yx,P(Xy)p(x,

y) p(x,(x)pX x

y) p(x,(y)pY

y)p(x,y)|(xp

YY|X for y such that 0(y)pY

Standard distributions (1/3)

Discrete distributions: The binomial distribution

– When one has a series of trials with only two outcomes, each trial

being independent from all the others.

– The number r of successes out of n trials given that the probability

of success in any trial is p. :

– Expectation : np, variance : np(1-p)

rnr ppr

npnrbrRp

)(1) , ;()(

rnCrrn

!nr 0where

0 5 10 15 20 25 30 35 40

tyStandard distributions (2/3)

Discrete distributions: The binomial distribution

0.7) , ;b( nr0.5) , ;b( nr

402010 ,,n

-5 -4 -3 -2 -1 0 1 2 3 4 5

Standard distributions (3/3)

Continuous distributions: The normal distribution

– For the Mean and the standard deviation :

1) , ;(

Probability density function (pdf)

)1,0(N

(0,0.7)N

(1.5,2)N

Contents – Part 2

2. Essential Information Theory

– Entropy

– Joint entropy and conditional entropy

– Mutual information

– The noisy channel model

– Relative entropy or Kullback-Leibler divergence

Shannon’s Information Theory

Maximizing the amount of information that one can transmit over an imperfect communication channel such as a noisy phone line.

Theoretical maxima for data compression – Entropy H

Theoretical maxima for the transmission rate – Channel Capacity

Entropy (1/4)

The entropy H (or self-information) is the average uncertainty of a single random variable X.

Entropy is a measure of uncertainty. – The more we know about something, the lower the entropy will

– We can use entropy as a measure of the quality of our models.

Entropy measures the amount of information in a random variable (measured in bits).

2p(x)p(x)log(X) (p) HH where, p(x) is pmf of X

Entropy (2/4)

The entropy of a weighted coin. The horizontal axis shows the probability of a weighted coin to come up heads. The vertical axis shows the entropy of tossing the corresponding coin once.

)(1)(1

loglog

back 23 page

Entropy (3/4)

Ex7> The result of rolling an 8-sided die. (uniform distribution)

– Entropy : The average length of the message needed to transmit an outcome of that variable.

For expectation E

bits3loglog81

1)()((X)

1 2 3 4 5 6 7 8

001 010 011 100 101 110 111 000

XpEH log

Entropy (4/4)

Ex8> Simplified Polynesian

– We can design a code that on average takes bits to transmit a letter

– Entropy can be interpreted as a measure of the size of the ‘search space’ consisting of the possible values of a random variable.

12)()((P)

},,,,,{

uiaktpi

iPiPH log

p t k a i u

1/8 1/4 1/8 1/4 1/8 1/8

p t k a i u

100 00 101 01 110 1113

22 bits

Joint entropy and conditional entropy (1/3)

The joint entropy of a pair of discrete random variable X,Y~ p(x,y)–

The conditional entropy–

The chain rule for entropy–

2 y) p(x,y)log p(x,Y) (X,H

x)|p(yy)logp(x,

x)|p(yx)log|p(y p(x)x)X|p(x)H(YX)|(Y

Xx Yy2

X)|(Y(X)Y) (X, HHH

)X ,...,X|(X )X|(X )(X)X ,...,(X 1-n1n121n1 HHHH

Ex9> Simplified Polynesian revisited– All words of consist of sequence of CV(consonant-vowel) syllables

uiaktp

Per-letter basis probabilitiesMarginal probabilities

(per-syllable basis)

double

back 8 page

1.061bits3bits4

12(C) logloglogH

bits 1.375bits8

)C|(V)p(CC)|(V,,

cHcHktpc

bits 2.44bits4

29V)(C, 3log)C|V()C( HHH

Mutual information (1/2)

By the chain rule for entropy– – : mutual information

Mutual information between X and Y

– The amount of information one random variable contains about

another. (symmetric, non-negative)

– It is 0 only when two variables are independent.

– It grows not only with the degree of dependence, but also

according to the entropy of the variables.

– It is actually better to think of it as a measure of independence.

Y)|(X(Y)X)|(Y(X)Y) (X, HHHHH

X)|(Y(Y)Y)|(X(X) HHHH

Mutual information (2/2)

– Since

(entropy is called self-information)

– Conditional MI and a chain rule

y yx,x

p(x)p(y)

y)p(x,y)logp(x,

y)y)logp(x,p(x,p(y)

1p(y)log

1p(x)log

Y)(X, (Y) (X) Y)|(X(X)Y)(X; H HHHHI

H(X|Y) H(Y|X)

I(X; Y)

H(X) H(Y)

H(X,Y)

X)(X;X)|(X(X)(X) IHHH

0 X)|(X H

Z)Y,|(XZ)|(XZ)|Y)((X;Z)|Y(X; HHII

)X,...,X|Y;(X )X,...,X|Y;(X Y)|(XY);(X 1ii

1ii1-n1n11n

=I(x,y) Pointwise MI

Noisy channel model

Channel capacity : the rate at which one can transmit information through the channel (optimal)–

Binary symmetric channel

– since entropy is non-negative,

Channelp(y|x)

DecoderEncoderW X Y W

Messagefrom a finite

alphabet

Input tochannel

Output fromchannel

Attempt toreconstruct message

based on output

Y)(X;)(IC

(p)(Y)X)|(Y (Y)Y)(X; HHHHI

(p)1 Y)(X;)(

max 1C

go 15 page

Relative entropy or Kullback-Leibler divergence

Relative entropy for two pmfs, p(x), q(x) – A measure of how close two pmfs are.

– Non-negative, and D(p||q)=0 if p=q

– Conditional relative entropy and chain rule

p(x)p(x)q)||D(p p loglog E

p(x)p(y))||y)D(p(x,Y)(X, I

yx x)|q(y

x)|p(yx)log|p(yp(x)x))|q(y||x)|D(p(y

x))|q(y||x)|D(p(yq(x))||D(p(x)y))q(x,||y)D(p(x,

2. Mathematical Foundations 2001. 7. 10. 인공지능연구실 성경희 Foundations of Statistic...

Documents

TURISMUL ROMÂNIEI Breviar Statistic - insse.ro · TURISMUL ROMÂNIEI Breviar Statistic TURISMUL ROMÂNIEI Breviar Statistic INSTITUTUL NAÞIONAL DE STATISTIC ... Geografia ºi organizarea

Chapter6. Statistical Inference : n-gram Model over Sparse Data 2005. 1. 13 이 동 훈 huni77@pusan.ac.kr Foundations of Statistic Natural Language Processing

Megaputer Intelligence 2000. 3. 27 인공지능연구실 석사 2 학기 최윤정 (cris@ai.ewha.ac.kr)

Statistic .Ver1

Statistic Powerpoint

6 statistic

5 statistic

Statistic Book

statistic mechanics

Statistic Warmups

Maria Gunnarsson Statistic Sweden Mikael Nordberg Statistic Sweden

Statistic 5614

Statistic 4

Ptv statistic

buletin statistic

Statistic a ˘si prelucrarea datelormath.etc.tuiasi.ro/rstrugariu/cursuri/SPD2015/c11.pdfCURS 11 - 08.05.2015 Statistic a ˘si prelucrarea datelor Statistic a inferent˘ial a Statistic

Machine Learning II 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호 (karma@pusan.ac.kr)

Statistic Process

Inference in First-Order Logic 부산대학교 전자전기컴퓨터공학과 인공지능연구실 김민호 (karma@pusan.ac.kr)

Lyseis statistic