Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Parameter Estimation
2PR , ANN, & ML
Notational Convention Probabilities
Mass (discrete) function: capital lettersDensity (continuous) function: small letters
Vector vs. scalar Scalar: plainVector: bold 2D: smallHigher dimension: capital
Notes in a continuous state of fluctuation until a topic is finished (many updates)
3PR , ANN, & ML
Parameter Estimation Optimal classifier maximizes
a prior probability class-conditional density
Assumption no correlation time independent statistics
)()()|()|(
xxx
pPpp ii
i
4PR , ANN, & ML
Popular Approaches Parametric: assume a certain parametric
form for p(x|wi) and estimate the parameters Nonparametric: does not assume a
parametric form for p(x|wi) and estimate the density profile directly
Boundary: estimate the separation hyperplane (hypersurface) between p(x|wi) and p(x|wj)
5PR , ANN, & ML
a prior probability Given the numbers of occurrence:
if number of samples are large enough the selection process is not biasedCaveat: sampling may be biased
kiMnP
Mn
nnn
ii
k
ii
kk
,,1)(
),(,),,(),,(
1
2211
6PR , ANN, & ML
Class conditional density More complicated (not a single number, but
a distribution) assume a certain form estimate the parameters
What form should we assume? Many, but in this courseWe use almost exclusively Gaussian
7PR , ANN, & ML
Gaussian (or Normal) Scalar case
Vector case
Unknowns
class mean and variance
2
2)(21
21),()|( i
iux
iiii eNxp
)]()[(21
2/1
1
||21)()|( i
TieNp
idiii
uxΣux
ΣΣ,μx
Gaussian Distribution
8PR , ANN, & ML
feature
2
2)(21
21
ux
e
population
2
9PR , ANN, & ML
Why Gaussian (Normal)? Central limit theorem predicts normal distribution
from IID experiments In reality
There are only two numbers in the scalar case (mean and variance) to estimate, (or d + d(d+1)/2 in d-dimensions)
Nice mathematical properties (e.g., Fourier transform of a Gaussian is a Gaussian. Products and summation of Gaussian remain Gaussian, Any linear transform of a Gaussian is a Gaussian)
10PR , ANN, & ML
Projection
Transformation
In particular, a whitening transform can diagonalize the covariance matrix
11PR , ANN, & ML
Parameter Estimation
Maximum likelihood estimator Parameters have fixed but unknown values
Bayesian estimator parameters as random variables with know a
prior distributionsBayesian estimator allows us to change the a
priori distribution by incorporating measurements to sharpen the profile
12PR , ANN, & ML
Graphically MLE Bayesian
parameters
likelihood
13PR , ANN, & ML
Maximum Likelihood Estimator Given
n labeled samples (observations)
an assumed distribution of e parameters
samples are drawn independently from
Find
parameter that best explains the observations
},,,{ 21 nxxxX
},,,{ 21 e θ
),|()|( θXX jj pp
14PR , ANN, & ML
MLE Formulation
Maximize
n
jj
j
n
j
ppl
pp
1
1
)|(log)(log)(
)|()|(
θxθ|Xθ
θxθX
Or
n
jjpl
p
10)|(log)(
0)(
θxθ
θ|X
θθ
θ
Log likelihood
15PR , ANN, & ML
An Example
22
21
2
2
1
2
21
2
221
2
22
)(21
)(21
21
)(
)|(log
)(21log
21)|(log
)(21log
21)|(log
21)|( 2
2
j
j
j
jj
jj
ux
j
x
x
xp
xxp
u
uxxp
expj
θ
θ
θ
θ
θ
16PR , ANN, & ML
An Example (cont.)
2
12
2
11
1 12
2
21
2
1 2
1
)ˆ(1ˆ
1ˆ
0)(1
0)(
uxn
xn
x
x
n
jj
n
jj
n
j
n
j
j
n
j
j
Class mean as sample meanclass variance as sample variance
)(ˆ2
1)()ˆ,ˆ()(
1)|()(2
2
ˆ)ˆ(
21
i
ux
iiiiii pepN
xpxpxg i
i
n-1, MLE is biased!
17PR , ANN, & ML
2
22
2
21 )(
21
1
)(21
1 21
21
uxn
j
uxn
jee
1 2coin weight
population
18PR , ANN, & ML
22
2
21
2 )ˆ(21
21
)ˆ(21
11 21
21
uxn
j
uxn
jee
12
0x
If too narrow, many sampling points will be outside 2 width with low likelihood of occurrence
If too wide, 1/ becomes too small and reduces the likelihood of occurrence
19PR , ANN, & ML
A Quick Word on MAP MAP (Maximum a posteriori) estimator Similar to MLE with one additional twist
Maximize the (log) likelihood, l(.) and p(.), prior probability of parameter values (if
you know it), e.g., the mean is more likely to be uo with a normal distribution
MLE has a uniform prior, MAP not necessarily
The added term is a case of “regularization”
20PR , ANN, & ML
Bayesian Estimator Note that MLE is a batch estimator
All data have to be keptDifficult to update estimationDifficult to incorporate other evidence Insist on a single measurement
Bayesian estimatorAllow the freedom that parameters in
themselves can be random variablesAllow multiple evidence Allow iterative update
21PR , ANN, & ML
Bayesian Estimator Based on Bayes rule
With X at our disposal
jjj
iiii Pxp
PxpxP
xPxP)()|(
)()|()(
),()|(
jjj
iiii Pp
PpP
PP)|(),|(
)|(),|()(
),,(),|(XXxXXx
Xx,XxXx
22PR , ANN, & ML
Bayes Rule Formulation Assume
X comes from only one class is independent of X
jjjj
iiiiii Pp
PpP
Pp)(),|(
)(),|(),(
),,()|(
XxXx
XxXxXx,
)( ip
23PR , ANN, & ML
How can X be used? The distribution is known (e.g., normal), the
parameters are unknown For estimating class parameters class parameters then constrain x put it all together
)( X|θp)( θ|xp
θX|θθ|xX|x dppp )()()(
24PR , ANN, & ML
Ideally
θX|θθ|xX|x dppp )()()(
)ˆ()()()(
0
ˆ1)(
θ|xθX|θθ|xX|x
θX|θ
pdppp
otherwisesomep
Otherwise, all possible ’s are used
Bayes Rule Formulation (cont.)
This is MLE!
25PR , ANN, & ML
Graphic Interpretation
},,,{ 21 nxxxX
},,,{ 21 e θ }',,','{' 21 e θ }",,","{" 21 e θ},,,{ 21 e θ
)()|( Pp θx
26PR , ANN, & ML
An example Estimating mean of a normal distribution Variance is known Using n samples First step
2
2
2
2
)(21
)(21
1
21),()(
21)|(
)()()|()|(
o
o
k
eNup
ep
puppp
ooo
xn
k
X
XXX
duupuxpxp )|()|()|( XX
Currentevidence
Previous and otherevidence
Key to Bayesian: Both current and prior
evidence can be used
27PR , ANN, & ML
Then
2
2
21
22
22
2
2
2
2
)(21})(2)1{(
21
)(21)(
21
1
21'
21
21)|(
n
n
o
on
kk
o
o
ok
ee
eep
n
xnn
o
xn
k
X
n
kkn
noo
on
oo
no
on
xn
m
nif
n
nm
nn
1
2222
22
222
22
2
22
2
1
1
28PR , ANN, & ML
X helps in Defining the mean Reducing the uncertainty in mean Trust new data if
Class variance is small Number of sample is large Prior is uncertain
o
nm
2
2
o
n
2
2
o
n
2
2
o
n
29PR , ANN, & ML
Second step2
2)(21
21)|(
x
exp
dxufwhere
fN
dee
dupxpxpxg
n
nn
n
nn
nnn
u
n
xn
n
)}(21exp{),(
),(),(
}2
121{
)|()|()|()(
22
22
22
22
22
)(21)(
21
2
2
2
2
XX
An example (cont.)
Third step
30PR , ANN, & ML
Graphical Interpretation: MLE)(p
2
2)(21
1 21)|(
kxn
k
ep X
)(p
31PR , ANN, & ML
Graphical Interpretation: Bayesian
)(p
2
2)(21
1 21)|(
kxn
k
ep X
)(p
)(21
)()|()|(
2
2)(21
1
upe
upupupkxn
k
XX
32PR , ANN, & ML
Results of Iterative Process Start with a prior distribution Incorporate current batch of data Generate a new prior Goodness of new prior = goodness of old
prior * goodness of interpretation Usually
Prior distribution sharpen (Bayesian learning)Uncertainty drops
33PR , ANN, & ML
MLE vs. Bayes Faster (differentiation) Single model Known model p(x|)
Less information
Slow (integration) Multiple weighted Unknown model fine More information
(nonuniform prior)
34PR , ANN, & ML
Does it really make a difference? Yes, Bayesian classifier and MAP will in
general give different results when used to classify new samples
Because MAP (MLE) keeps only one hypothesis while Bayesian keeps multiple, weighted hypotheses
35PR , ANN, & ML
Example MLE Bayesian
)|(maxarg'
),'|(maxarg)|'(
Xθθ
θxXxx
Pwhere
pp
θx
θX|θθ|xX|x' dppp )()(maxarg)(
0)|(,1)|(,3.)|(0)|(,1)|(,3.)|(
1)|(,0)|(,4.)|(
333
222
111
PPpPPpPPp
XXX
)( X|xp
)|(4.0*3.0*3.1*4.)|(6.1*3.1*3.0*4.)|(
XXX
xppp
Only one hypothesis (1) is kept
36PR , ANN, & ML
Gibbs Sampler Bayesian classifier is optimal, but can be
very expensive – especially when a large number of hypotheses are kept and evaluated
Gibbs – randomly pick one hypothesis according to the current posterior distribution
Can be shown (later) to be related knn classifier and the expected error is at mosttwice as bad as Bayesian
)( X|θp
37PR , ANN, & ML
An Example: Naïve Bayesian Features are a conjunction of attributes Bayes theorem states that a posterior
probability should be maximized Naïve Bayesian classifier assumes
independence of attributes
)|()(maxarg),,,(
)()|,,,(maxarg
),,,|(maxarg
21
21
21
jiijc
n
jjn
c
njc
caPcPaaaP
cPcaaaP
aaacPc
j
j
j
38PR , ANN, & ML
Example
Day Outlook Temperature Humidity Wind Play tennis
D1 Sunny Hot High Weak No
D2 Sunny Host High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cold Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
39PR , ANN, & ML
Example (cont) <Outlook=sunny, Temperature=cool,
Humidity=high, Wind=strong> PlayTennis=yes? Or no?
)|()|(
)|()|()(maxarg},{
jj
jjjnoyesc
NB
cstrongWindPchighHumidityP
ccooleTemperaturPcsunnyOutlookPcPcj
6.53)|(
33.93)|(
36.145)(
64.149)(
nostrongWindP
yesstrongWindP
noplayTennisP
yesplayTennisP
0206.0)|()|()|()|()(
0053.0)|()|()|()|()(
nostrongPnohighPnocoolPnosunnyPnoP
yesstrongPyeshighPyescoolPyessunnyPyesP
40PR , ANN, & ML
Caveat Guarding against zero probability P(ai|cj)
Especially for small sample sizes and large set of attribute values
Use m-estimate instead If attribute ai can take k values, then p=1/k
estimateprior :samples) more m (add size sample equivalent :
cin samples of #:
a attribute with cin samples of #:
)|(
j
ij
pm
n
n
mnmpn
cap
j
i
j
i
c
a
c
aji
41PR , ANN, & ML
More Examples Web page classification/Newsgroup
classification Like/dislike for web pages Science/sports/entertainment categories for
web pages/newsgroups
42PR , ANN, & ML
More Examples (cont.) Select common occurring words as features
(at least k times in documents) Eliminate stop words (the, it, etc.) and
punctuations Word stemming (like, liked etc.) P(wordk |classj) is independent of word
position in the document Achieve 89% accuracy for classifying
documents for 20 newsgroups