Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Expectation-Maximization Algorithm
CAP5610: Machine LearningInstructor: Guo-Jun QI
What is EM?
An algorithm alternating between assignment and estimation
A parameter estimation method: it falls into the general framework of maximum-likelihood estimation (MLE).
The general form was given in (Dempster, Laird, and Rubin, 1977), although essence of the algorithm appeared previously in various forms.
2
Outline
Maximum Likelihood Estimation (MLE) EM: basic concepts EM for Gaussian Mixture Model (GMM)
3
Outline
Maximum Likelihood Estimation (MLE) EM: basic concepts EM for Gaussian Mixture Model (GMM)
4
What is MLE?
Given A sample X=X1, …, Xn A vector of parameters θ
We define Likelihood of the data: P(X | θ) Log-likelihood of the data: L(θ)=log P(X|θ)
Given X, find
)(maxarg Θ=Ω∈
LMLθ
θ
5
MLE (cont)
Often we assume that Xis are independently identically distributed (i.i.d.)
Depending on the form of p(x|θ), solving optimization problem can be easy or hard.
)|(logmaxarg
)|(logmaxarg
)|,...,(logmaxarg
)|(logmaxarg
)(maxarg
1
Θ=
Θ=
Θ=
Θ=
Θ=
∑
∏
Ω∈
Ω∈
Ω∈
Ω∈
Ω∈
ii
ii
n
ML
XP
XP
XXP
XP
L
θ
θ
θ
θ
θθ
6
Example 1: Fit a line to observed data
x
y
7
Maximum likelihood estimation for the slope of a single line
Maximum likelihood estimate:
gives regression formula
Data likelihood for point n:
NnYX nn ,,1),,( =waXY +=
),0(~ 2σNw
Data:Model:
where
8
Example 2:Fitting two lines to observed data
x
y
9
Fitting two lines: on the one hand…
x
y
If we knew which points went with which lines, we’d be back at the single line-fitting problem, twice.
Line 1
Line 2
10
Fitting two lines, on the other hand…
x
y
We could figure out the probability that any point came from either line if we just knew the two equations for the two lines.
11
Expectation Maximization (EM): a solution to chicken-and-egg problems
12
EM example:
Step 1: randomly guess two lines as initialization.
13
EM example:
Step 2: Assign points to the closest lines.
14
EM example:
Step 3: Update the current fitting of lines to the assigned points.
15
EM example:
Step 4: Re-assign points to the closest lines.
16
EM example:
Converged!
Step 5: Re-fit lines to the assigned points.
17
EM Algorithm
EM with hard assignment: Randomly guess the initial models
The cluster means in K-means and the lines in multi-line fitting problem
Repeat Hard assignment: assign data points to the corresponding clusters
based on fitness Fitness: distance to cluster mean in K-means and distance to
the line in multi-line fitting. Re-fitting of the model for each cluster
In K-means: update cluster mean In multi-line fitting: update the line Both re-fitting rules can be formulated by maximizing the
corresponding likelihood of generating assigned data points Until convergence
18
Outline
Maximum Likelihood Estimation (MLE) EM: basic concepts (soft assignment) EM for Gaussian Mixture Model (GMM)
19
Basic setting in EM
X is a set of data points: observed data Y is a cluster index denoting the assignment of
data point to a cluster Y is not a deterministic variable as in hard assignment; it is now a
random variable for soft assignment.
EM is a method to find θML where
Calculating P(X | θ) directly is hard. Because we have to marginalize out Y.
Calculating P(X,Y|θ) is much simpler, where Y is “hidden” data (or “missing” data).
arg max ( ) arg max log ( | )ML L P X
20
The EM terminology
( , | ) ( | Y k, ) P(Y | ) ( | ) Pk kP X Y k P X k P X
21
EM Algorithm
22
Outline
Maximum Likelihood Estimation (MLE) EM: basic concepts EM for Gaussian Mixture Model (GMM)
23
d = feature dimensions x = random data point = mean value = covariance matrix
Gaussian Distribution
( )( ) ( )11( , ) exp
22 det( )
T
d
x xf x
µ µµ
π
− − Σ −Σ = −
Σ
Σµ
24
GMMs – Gaussian Mixture Models
W
H
Suppose we have 1000 data points in 2D space (w,h)
25
W
H
GMMs – Gaussian Mixture Models
Assume each data point is normally distributed Obviously, there are 5 sets of underlying gaussians
26
GMM
Model the data distribution by a combination of Gaussian functions
Given a set of sample points, how to estimate their assignments and the parameters of the Gaussian distribution for each cluster?
EM with GMM
In the context of GMM X: data points Y: which Gaussian creates which data points Distribution of complete data
Gaussian distribution for each cluster
Proportion of each cluster
where Pk’s must sum up to 1.
Soft assignment
E-Step: Assign each data Xi to a cluster k
Max-Step: Q-function
1 1 11 1
1
1 1
1 , ,
n n Tt t t tik i ik i k i kn
t t t ti ik ik k kn n
t tiik ik
i i
p x p x xp p
n p p
29
EM for GMM (summary)
Objective:Given N data points, find maximum likelihood estimation of :
Algorithm:1. Guess initial2. Perform E step (expectation)
Based on , assign each data point to a specific Gaussian component
3. Perform M step (maximization) Based on data points clustering, maximize Q function
4. Repeat 2-3 until convergence (~tens iterations)
Θ
1arg max ( ,..., )Nf x xΘ
Θ = Θ
Θ
Θ
30
EM Example
Gaussian j
data point t
blue: P(Yt=j|Xt)
31
EM Example
32
EM Example
33
EM Example
34
EM Example
35
EM Example
36
EM Example
37
EM Example
38
EM is more than clustering
Y need not necessarily be a variable of cluster membership
It may just be any variable that is unknown to us. Bad pixels in an image Lost labels for an example
There could be many possible Ys. EM is still applicable
39
Summary
EM is composed of two basic steps E-Step: Estimate the posterior distribution of
unobserved variables based on the current estimate of model parameters
M-Step: Update the model parameters by maximizing the expected likelihood weighted by the posterior distribution obtained in E-Step
Application of EM into Gaussian Mixture Model
40
The steps of EM algorithm
E-step: calculate
M-step: find
),(maxarg)1( tt Q θθθθ
=+
∫==
dyxypyxp
yxpEQt
xypt
t
),|()|,(log
)]|,([log),( ),|(
θθ
θθθθ
See online notes for more derivations and analysis!
41
How to write the Q function under GMM setting Likelihood of a data set is the multiplication of all the
sample likelihood, so
1 1
' ''
1 1 1 1 1
( , ) | , log , |
|| ,
|
, | | | , |
n Kt t
i i i ii k
t tk i kt
i i t tk i k
k
t t t t ti i i i i k i k
Q p y k x p x y k
p p xp y k x
p p x
p x y k p y k p x y k p p x
The Q function specific for GMM is
Plug in the definition of p(x|Θk), compute derivative w.r.t. the parameters, we obtain the iteration procedures
E step
M step
( )( ) ( )( )∑∑∑=
+++ =n
i
tki
tk
kk
tki
tk
tki
tktt xpp
xppxppQ
1
11
'''
1 |log|
|),( θθ
θθθ
( )
( )( )
∑
∑
∑
∑∑
∑
=
=+
=
=+
=
+−−
=Σ==
===
n
i
tik
Ttki
tki
n
i
tik
tkn
i
tik
i
n
i
tik
tk
n
i
tik
tk
k
tki
tk
tki
tkt
iitik
p
xxp
p
xpp
np
xppxppxkypp
1
11
1
11
1
1
'''
,,1
,)|(
)|(,|
µµµ
θθθ