parameter - UC Santa Barbara

Parameter Estimation

2PR , ANN, & ML

Notational Convention Probabilities

Mass (discrete) function: capital lettersDensity (continuous) function: small letters

Vector vs. scalar Scalar: plainVector: bold 2D: smallHigher dimension: capital

Notes in a continuous state of fluctuation until a topic is finished (many updates)

3PR , ANN, & ML

Parameter Estimation Optimal classifier maximizes

a prior probability class-conditional density

Assumption no correlation time independent statistics

)()()|()|(

xxx

pPpp ii

i

4PR , ANN, & ML

Popular Approaches Parametric: assume a certain parametric

form for p(x|wi) and estimate the parameters Nonparametric: does not assume a

parametric form for p(x|wi) and estimate the density profile directly

Boundary: estimate the separation hyperplane (hypersurface) between p(x|wi) and p(x|wj)

5PR , ANN, & ML

a prior probability Given the numbers of occurrence:

if number of samples are large enough the selection process is not biasedCaveat: sampling may be biased

kiMnP

Mn

nnn

ii

k

ii

kk

,,1)(

),(,),,(),,(

1

2211

6PR , ANN, & ML

Class conditional density More complicated (not a single number, but

a distribution) assume a certain form estimate the parameters

What form should we assume? Many, but in this courseWe use almost exclusively Gaussian

7PR , ANN, & ML

Gaussian (or Normal) Scalar case

Vector case

Unknowns

class mean and variance

2

2)(21

21),()|( i

iux

iiii eNxp

)]()[(21

2/1

1

||21)()|( i

TieNp

idiii

uxΣux

ΣΣ,μx

Gaussian Distribution

8PR , ANN, & ML

feature

2

2)(21

21

ux

e

population

2

9PR , ANN, & ML

Why Gaussian (Normal)? Central limit theorem predicts normal distribution

from IID experiments In reality

There are only two numbers in the scalar case (mean and variance) to estimate, (or d + d(d+1)/2 in d-dimensions)

Nice mathematical properties (e.g., Fourier transform of a Gaussian is a Gaussian. Products and summation of Gaussian remain Gaussian, Any linear transform of a Gaussian is a Gaussian)

10PR , ANN, & ML

Projection

Transformation

In particular, a whitening transform can diagonalize the covariance matrix

11PR , ANN, & ML

Parameter Estimation

Maximum likelihood estimator Parameters have fixed but unknown values

Bayesian estimator parameters as random variables with know a

prior distributionsBayesian estimator allows us to change the a

priori distribution by incorporating measurements to sharpen the profile

12PR , ANN, & ML

Graphically MLE Bayesian

parameters

likelihood

13PR , ANN, & ML

Maximum Likelihood Estimator Given

n labeled samples (observations)

an assumed distribution of e parameters

samples are drawn independently from

Find

parameter that best explains the observations

},,,{ 21 nxxxX

},,,{ 21 e θ

),|()|( θXX jj pp

14PR , ANN, & ML

MLE Formulation

Maximize

n

jj

j

n

j

ppl

pp

1

1

)|(log)(log)(

)|()|(

θxθ|Xθ

θxθX

Or

n

jjpl

p

10)|(log)(

0)(

θxθ

θ|X

θθ

θ

Log likelihood

15PR , ANN, & ML

An Example

22

21

2

2

1

2

21

2

221

2

22

)(21

)(21

21

)(

)|(log

)(21log

21)|(log

)(21log

21)|(log

21)|( 2

2

j

j

j

jj

jj

ux

j

x

x

xp

xxp

u

uxxp

expj

θ

θ

θ

θ

θ

16PR , ANN, & ML

An Example (cont.)

2

12

2

11

1 12

2

21

2

1 2

1

)ˆ(1ˆ

1ˆ

0)(1

0)(

uxn

xn

x

x

n

jj

n

jj

n

j

n

j

j

n

j

j

Class mean as sample meanclass variance as sample variance

)(ˆ2

1)()ˆ,ˆ()(

1)|()(2

2

ˆ)ˆ(

21

i

ux

iiiiii pepN

xpxpxg i

i

n-1, MLE is biased!

17PR , ANN, & ML

2

22

2

21 )(

21

1

)(21

1 21

21

uxn

j

uxn

jee

1 2coin weight

population

18PR , ANN, & ML

22

2

21

2 )ˆ(21

21

)ˆ(21

11 21

21

uxn

j

uxn

jee

12

0x

If too narrow, many sampling points will be outside 2 width with low likelihood of occurrence

If too wide, 1/ becomes too small and reduces the likelihood of occurrence

19PR , ANN, & ML

A Quick Word on MAP MAP (Maximum a posteriori) estimator Similar to MLE with one additional twist

Maximize the (log) likelihood, l(.) and p(.), prior probability of parameter values (if

you know it), e.g., the mean is more likely to be uo with a normal distribution

MLE has a uniform prior, MAP not necessarily

The added term is a case of “regularization”

20PR , ANN, & ML

Bayesian Estimator Note that MLE is a batch estimator

All data have to be keptDifficult to update estimationDifficult to incorporate other evidence Insist on a single measurement

Bayesian estimatorAllow the freedom that parameters in

themselves can be random variablesAllow multiple evidence Allow iterative update

21PR , ANN, & ML

Bayesian Estimator Based on Bayes rule

With X at our disposal

jjj

iiii Pxp

PxpxP

xPxP)()|(

)()|()(

),()|(

jjj

iiii Pp

PpP

PP)|(),|(

)|(),|()(

),,(),|(XXxXXx

Xx,XxXx

22PR , ANN, & ML

Bayes Rule Formulation Assume

X comes from only one class is independent of X

jjjj

iiiiii Pp

PpP

Pp)(),|(

)(),|(),(

),,()|(

XxXx

XxXxXx,

)( ip

23PR , ANN, & ML

How can X be used? The distribution is known (e.g., normal), the

parameters are unknown For estimating class parameters class parameters then constrain x put it all together

)( X|θp)( θ|xp

θX|θθ|xX|x dppp )()()(

24PR , ANN, & ML

Ideally

θX|θθ|xX|x dppp )()()(

)ˆ()()()(

0

ˆ1)(

θ|xθX|θθ|xX|x

θX|θ

pdppp

otherwisesomep

Otherwise, all possible ’s are used

Bayes Rule Formulation (cont.)

This is MLE!

25PR , ANN, & ML

Graphic Interpretation

},,,{ 21 nxxxX

},,,{ 21 e θ }',,','{' 21 e θ }",,","{" 21 e θ},,,{ 21 e θ

)()|( Pp θx

26PR , ANN, & ML

An example Estimating mean of a normal distribution Variance is known Using n samples First step

2

2

2

2

)(21

)(21

1

21),()(

21)|(

)()()|()|(

o

o

k

eNup

ep

puppp

ooo

xn

k

X

XXX

duupuxpxp )|()|()|( XX

Currentevidence

Previous and otherevidence

Key to Bayesian: Both current and prior

evidence can be used

27PR , ANN, & ML

Then

2

2

21

22

22

2

2

2

2

)(21})(2)1{(

21

)(21)(

21

1

21'

21

21)|(

n

n

o

on

kk

o

o

ok

ee

eep

n

xnn

o

xn

k

X

n

kkn

noo

on

oo

no

on

xn

m

nif

n

nm

nn

1

2222

22

222

22

2

22

2

1

1

28PR , ANN, & ML

X helps in Defining the mean Reducing the uncertainty in mean Trust new data if

Class variance is small Number of sample is large Prior is uncertain

o

nm

2

2

o

n

2

2

o

n

2

2

o

n

29PR , ANN, & ML

Second step2

2)(21

21)|(

x

exp

dxufwhere

fN

dee

dupxpxpxg

n

nn

n

nn

nnn

u

n

xn

n

)}(21exp{),(

),(),(

}2

121{

)|()|()|()(

22

22

22

22

22

)(21)(

21

2

2

2

2

XX

An example (cont.)

Third step

30PR , ANN, & ML

Graphical Interpretation: MLE)(p

2

2)(21

1 21)|(

kxn

k

ep X

)(p

31PR , ANN, & ML

Graphical Interpretation: Bayesian

)(p

2

2)(21

1 21)|(

kxn

k

ep X

)(p

)(21

)()|()|(

2

2)(21

1

upe

upupupkxn

k

XX

32PR , ANN, & ML

Results of Iterative Process Start with a prior distribution Incorporate current batch of data Generate a new prior Goodness of new prior = goodness of old

prior * goodness of interpretation Usually

Prior distribution sharpen (Bayesian learning)Uncertainty drops

33PR , ANN, & ML

MLE vs. Bayes Faster (differentiation) Single model Known model p(x|)

Less information

Slow (integration) Multiple weighted Unknown model fine More information

(nonuniform prior)

34PR , ANN, & ML

Does it really make a difference? Yes, Bayesian classifier and MAP will in

general give different results when used to classify new samples

Because MAP (MLE) keeps only one hypothesis while Bayesian keeps multiple, weighted hypotheses

35PR , ANN, & ML

Example MLE Bayesian

)|(maxarg'

),'|(maxarg)|'(

Xθθ

θxXxx

Pwhere

pp

θx

θX|θθ|xX|x' dppp )()(maxarg)(

0)|(,1)|(,3.)|(0)|(,1)|(,3.)|(

1)|(,0)|(,4.)|(

333

222

111

PPpPPpPPp

XXX

)( X|xp

)|(4.0*3.0*3.1*4.)|(6.1*3.1*3.0*4.)|(

XXX

xppp

Only one hypothesis (1) is kept

36PR , ANN, & ML

Gibbs Sampler Bayesian classifier is optimal, but can be

very expensive – especially when a large number of hypotheses are kept and evaluated

Gibbs – randomly pick one hypothesis according to the current posterior distribution

Can be shown (later) to be related knn classifier and the expected error is at mosttwice as bad as Bayesian

)( X|θp

37PR , ANN, & ML

An Example: Naïve Bayesian Features are a conjunction of attributes Bayes theorem states that a posterior

probability should be maximized Naïve Bayesian classifier assumes

independence of attributes

)|()(maxarg),,,(

)()|,,,(maxarg

),,,|(maxarg

21

21

21

jiijc

n

jjn

c

njc

caPcPaaaP

cPcaaaP

aaacPc

j

j

j

38PR , ANN, & ML

Example

Day Outlook Temperature Humidity Wind Play tennis

D1 Sunny Hot High Weak No

D2 Sunny Host High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cold Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No

39PR , ANN, & ML

Example (cont) <Outlook=sunny, Temperature=cool,

Humidity=high, Wind=strong> PlayTennis=yes? Or no?

)|()|(

)|()|()(maxarg},{

jj

jjjnoyesc

NB

cstrongWindPchighHumidityP

ccooleTemperaturPcsunnyOutlookPcPcj

6.53)|(

33.93)|(

36.145)(

64.149)(

nostrongWindP

yesstrongWindP

noplayTennisP

yesplayTennisP

0206.0)|()|()|()|()(

0053.0)|()|()|()|()(

nostrongPnohighPnocoolPnosunnyPnoP

yesstrongPyeshighPyescoolPyessunnyPyesP

40PR , ANN, & ML

Caveat Guarding against zero probability P(ai|cj)

Especially for small sample sizes and large set of attribute values

Use m-estimate instead If attribute ai can take k values, then p=1/k

estimateprior :samples) more m (add size sample equivalent :

cin samples of #:

a attribute with cin samples of #:

)|(

j

ij

pm

n

n

mnmpn

cap

j

i

j

i

c

a

c

aji

41PR , ANN, & ML

More Examples Web page classification/Newsgroup

classification Like/dislike for web pages Science/sports/entertainment categories for

web pages/newsgroups

42PR , ANN, & ML

More Examples (cont.) Select common occurring words as features

(at least k times in documents) Eliminate stop words (the, it, etc.) and

punctuations Word stemming (like, liked etc.) P(wordk |classj) is independent of word

position in the document Achieve 89% accuracy for classifying

documents for 20 newsgroups

Documents

parameter - UC Santa Barbara