Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

CS9840a Learning and Computer Vision Prof. g p

Olga Veksler

Lecture 11Unsupervised Learning

EM

Today

• New Topic: Unsupervised Learning• supervised vs. unsupervised learning• unsupervised learning

• nonparametric unsupervised learning = clustering• Proximity Measures• Proximity Measures• Criterion Functions• k‐means

b i f i t t B i d i i th• brief intro to Bayesian decision theory • need this for parametric supervised learning

• parametric unsupervised learning• Expectation Maximization (EM)

Supervised vs. Unsupervised LearningU t id d i d l i i• Up to now considered supervised learning scenario, where we have1. samples x1,…, xn1. samples x1,…, xn2. class label yi for all samples xi• this is also called learning with teacher

• In this lecture we consider unsupervised learningscenario, where we are only given, y g1. samples x1,…, xn • this is also called learning without teacher• do not split data into training and test sets• harder to say how good results are compared to the

supervised casep

Unsupervised Learning

l b l d• Data is not labeled

• Parametric Approach• Parametric Approach• assume parametric distribution of data • estimate parameters of this distributionestimate parameters of this distribution

• Non‐parametric Approach• group the data into clustersg p• samples inside each cluster are more

similar than samples across clusters• each cluster (hopefully) says something

about categories (classes) present in the data

Why Unsupervised Learning?• Unsupervised learning is harder• Unsupervised learning is harder

• How do we know if results are meaningful? No answer labels are available.

• Let the expert look at the results (external evaluation)• Define an objective function on clustering (internal evaluation)

W th l d it b• We nevertheless need it because1. Labeling large datasets is very costly (speech recognition)

• sometimes can label only a few examples by handsometimes can label only a few examples by hand

2. May have no idea what/how many classes there are (data mining)

3. May want to use clustering to gain some insight into the structure of the data before designing a classifier

• Clustering as data descriptionClustering as data description

Clustering• Seek “natural” clusters in the data• Seek natural clusters in the data

• What is a good clustering?• internal (within the cluster) distances should be small

l (i l ) h ld b l

• Clustering is a way to discover new categories (classes)

• external (intra‐cluster) should be large

What we Need for Clustering1 Proximity measure either large d small s1. Proximity measure, either large d, small s

large s, small d• similarity measure s(xi,xk): large if xi,xk are

similardi i il i ( di )• dissimilarity (or distance) measure d(xi,xk): small if xi,xk are similar

2. Criterion function to evaluate a clustering

good clustering bad clustering

3. Algorithm to compute clustering• usually by optimizing the criterion function

How Many Clusters?

3 clusters or 2 clusters?

• Possible approachesPossible approaches • fix the number of clusters to k• find the best clustering according to the criterion

function (number of clusters may vary)

Proximity Measures

d l d d• good proximity measure is application dependent• clusters should be invariant under transformations “natural”

to the problemto the problem• for object recognition, should have invariance to rotation

small distance

• For character recognition, no invariance to rotationFor character recognition, no invariance to rotation

9 6l di t9 6large distance

Distance (dissimilarity) Measures• Euclidean distance• Euclidean distance

d

k

kj

kiji xxxxd

1

2,

• translation invariant

• Manhattan (city block) distance

d

• translation invariant

k

kj

kiji xxxxd

1,

• approximation to Euclidean distance, h t tcheaper to compute

• Chebyshev distance ||max,

1

kj

kidkji xxxxd

• approximation to Euclidean distance, hcheapest to compute

Similarity Measures• Cosine similarity:• Cosine similarity:

||||||||

,ji

jTi

ji xxxx

xxs

• smaller angle, gives larger similarity• scale invariant measure

l ff

• popular in text retrieval

• Correlation coefficient• popular in image processing

2/1

22

1,

d dkk

d

ki

kii

ki

ji

xxxxxxs

1 1

k kj

kji

ki xxxx

Feature Scale

ld bl h h l l f• old problem: how to choose appropriate relative scale for features?

• [length (in meters or cms?) weight(in in grams or kgs?)]• [length (in meters or cms?), weight(in in grams or kgs?)]• In supervised learning, can normalize to zero mean unit variance with no problemsp

• in clustering this is more problematic, if variance in data is due to cluster presence, then normalizing features is not a

d thigood thing

before normalization after normalizationbefore normalization after normalization

Criterion Functions for Clustering

l• Have samples x1,…,xn• Suppose partitioned samples into c subsets D1,…,Dc

1D3D

2D

• There are approximately cn/c! distinct partitions• Can define a criterion function J(D1,…,Dc) which

measures the quality of a partitioning D1,…,Dc

pp y / p

q y p g 1 c

• Then the clustering problem is well defined • optimal clustering is partition which optimizes criterion p g p p

function

Iterative Optimization Algorithms• Now have both proximity measure and criterion function• Now have both proximity measure and criterion function,

need algorithm to find the optimal clustering• Exhaustive search is impossible since there areExhaustive search is impossible, since there are

approximately cn/c! possible partitions• Usually some iterative algorithm is usedUsually some iterative algorithm is used

1. find a reasonable initial partition2. repeat: move samples from one group to another s.t. the

objective function J is improved

move

samples to improve J

J = 777,777

improve J

J =666,666

K‐means Clustering• Probably the most famous clustering algorithm• Probably the most famous clustering algorithm• JSSE objective function:

k

i DxiSSE

i

xJ1

2||||

• can work for some other objective functions, but not proven to work as well

• Fix number of clusters to k (c = k)It ti l t i l ith• Iterative clustering algorithm

• Moves many samples from one cluster to anther in a smart waysmart way

K‐means Clustering

1 Initialize x

k = 3

1. Initialize• pick k cluster centers arbitrary• assign each example to closest center

x

xx• assign each example to closest center

x

x x2. compute sample means

for each cluster

x3 reassign all samples to the x

x x3. reassign all samples to the

closest mean

4. if clusters changed at step 3, go to step 2

K‐means Clustering• Consider steps 2 and 3 of the algorithm

2. compute sample means for each cluster

• Consider steps 2 and 3 of the algorithm

12

k

i DxiSSE

i

xJ1

2||||

= sum of= sum of

3. reassign all samples to the closest mean

If we represent clusters 1

2 by their old means, the error has gotten smaller

K‐means Clustering3 reassign all samples to the closest mean3. reassign all samples to the closest mean

If we represent clusters by their old means, the error h tt ll1

2has gotten smaller

• However we represent clusters by their new means• However we represent clusters by their new means, and mean is always smallest representation of cluster

1 1

iDxzx

z2||||

21

iDx

t zzxxz

22 ||||2||||21

iDx

zx 0

1

iDxix

nz 1


d h b d d h• We just proved that by doing steps 2 and 3, the objective function goes down

• found “smart “ move which decreases objective function• found “smart “ move which decreases objective function

• Algorithm converges after a finite number of iterations

• However not guaranteed to find a global minimum

1x

2x

2‐means gets stuck here global minimum of JSSE


d h f h d• Finding the optimum of JSSE is NP‐hard• In practice, k‐means usually performs well• Very efficient• Its solution can be used as a starting point for other

clustering algorithms• Many variants and improvements of k‐means clustering

i texist

Bayesian Decision Theory

• Know probability distribution of the categories • almost never the case in real life!almost never the case in real life!• nevertheless useful since other cases can be reduced to this one after some workreduced to this one after some work

• Do not even need training dataC d i ti l l ifi• Can design optimal classifier

Example: Fish Sorting• Respected fish expert says that

• salmon length has distribution N(5,1)• sea bass length has distribution N(10,4)

• Recall N(μ,σ2) 2lRecall N(μ,σ )

2

2

2

21

l

elp

2

• Thus class conditional densities are

25 2

21|

l

esalmonlp

4*210 2

221|

l

ebasslp

Likelihood function

• Class conditional densities are

25 2

1|

l

esalmonlp

4*210 2

1|

l

ebasslp

• Class conditional densities are

2

| esalmonlp

22

| ebasslpfixed fixed

• Fix length, let fish class varyg , y• get likelihood function

• it is not density and not probability massit is not density and not probability mass

salmonclassife

21 2

5l 2

bassclassife

221

2class|lp810l 2

fixed 22 fixed

Likelihood vs. Class Conditional Density

p(l | class)

length7 length7

Suppose a fish has length 7. How do we classify it?pp g y

ML (maximum likelihood) Classifier• Intuitively want to choose salmon if• Intuitively, want to choose salmon if

bass|7lengthPrsalmon|7lengthPr

• However, since length is a continuous r.v., 0bass|7lengthPrsalmon|7lengthPr

5 2l 10 2

1 l

• Instead, we choose class which maximizes likelihood

25 2

21|

l

esalmonlp

4*210

221|

l

ebasslp

• ML classifier: bass|lp?salmon|lp

salmon

bass

• if p(l | salmon) > p(l | bass), classify salmon, else classify bass

Interval Justification

p( 7 |bass) Thus we choose the class (bass) which is

p( 7 |salmon)

( )more likely to have given the observation

p( | )

7 bass|7p2bass|7BlPr

salmon|7p2salmon|7BlPr

Decision Boundary

l if l l if bclassify as salmon classify as sea bass

length6.70

Priors

f k l d d h b• Prior comes from prior knowledge, no data has been seen yet

• Suppose a fish expert says: in the fall, there are twice as many salmon as sea bass

• Prior for our fish sorting problemP( l ) 2/3• P(salmon) = 2/3

• P(bass) = 1/3

• With the addition of prior to our model, how should we classify a fish of length 7?y g

How Prior Changes Decision Boundary?

• Without priors

salmon sea bass

6.70 length

• How should this change with prior?• P(salmon) = 2/3• P(bass) = 1/3

? ?? ?

6.70

salmon sea bass

length

Bayes Decision Rule

1. Have likelihood functions • p(length | salmon) • p(length | bass)

2. Have priors P(salmon) and P(bass)

• Question: Having observed fish of certain length, do we classify it as salmon or bass?

• Natural Idea:• salmon if P(salmon|length) > P(bass|length)( | g ) ( | g )• bass if P(bass|length) > P(salmon|length)

Posterior

• P(salmon | length) and P(bass | length) are called posterior distributions, because the data (length) was revealed (post data)

• How to compute posteriors?( )

• Bayes rule: ( ) ( )( )lengthp

length,salmonp=length|salmonP

( )( ) ( )( )lengthp

salmonPsalmon|lengthp=

lengthp

bassPbass|lengthplength|bassP • Similarly:

MAP (maximum a posteriori) classifier

lengthbassPlengthsalmonP |?|

bass

salmon

bassPbass|lengthpsalmonPsalmon|lengthp salmon

bass

lengthp

bassPbass|lengthp?lengthp

salmonPsalmon|lengthp

bass

bassPbass|lengthp?salmonPsalmon|lengthpsalmon bassPbass|lengthp?salmonPsalmon|lengthpbass

Back to Fish Sorting

25 2

21|

l

esalmonlp

810 2

221|

l

ebasslp

• likelihood

• Priors: P(salmon) = 2/3, P(bass) = 1/3 1121 10l5l 22

31e

221

32e

21 82

• Solve inequality

salmon sea bassnew decision

boundary

6.70 length7.18

• New boundary makes sense: expect to see more salmon

More on Posterior

Prior(given)

posterior density(our goal)

likelihood(given)

lP

cPc|lPl|cP

(given)(our goal) (g )

lPnormalizing factor, often do not even need it for l ifi i i (l) d d d l f

basspbass|lpsalmonpsalmon|lplP

classification since P(l) does not depend on class c. If we do need it, from the law of total probability:

p|pp|pNotice this formula consists of likelihoods and priors, which are given

More on Priors

• Prior comes from prior knowledge, no data has been seen yet

• If there is a reliable source prior knowledge, it should be usedS bl t b l d li bl• Some problems cannot even be solved reliably without a good prior

• However prior alone is not enough, we still need likelihood• P(salmon)=2/3, P(sea bass)=1/3• If I don’t let you see the data, but ask you to guess, y , y g ,will you choose salmon or sea bass?

More on Map Classifier

t i likelihood prior

lP

cPc|lPl|cP posterior likelihood p

P|lPl|P

• Do not care about P(l) when maximizing P(c|l )proportional

• If P(salmon)=P(bass) (uniform prior) MAP classifier

cPc|lPl|cP proportional

( ) ( ) ( p )becomes ML classifier c|lPl|cP

If f b ti l P(l| l ) P(l|b ) th thi• If for some observation l, P(l|salmon)=P(l|bass), then this observation is uninformative and decision is based solely on the prior cPl|cPon the prior cPl|cP

Justification for MAP Classifier• Probability of error for MAP estimate:• Probability of error for MAP estimate:

l|bassP?l|salmonP salmon

l|bassP?l|salmonPbass

• For any particular l probability of error• For any particular l, probability of error

if d id lP(b |l)Pr[error| l ]=

if we decide salmonP(bass|l)

if we decide bassP(salmon|l)

Thus MAP classifier is optimal for each individual l

Justification for MAP Classifier

• We are interested to minimize error not just for one l, we really want to minimize the average error over all l

dllpl|errorPrdll,errorperrorPr

• If Pr[error| l ]is as small as possible, the integral is ll blsmall as possible

• But Bayes rule makes Pr[error| l ] as small as possible

Thus MAP classifier minimizes the probability of errorThus MAP classifier minimizes the probability of error

Parametric Supervised Learning

• Supervised parametric learning• have m classes• have samples x1,…, xn each of class 1, 2,…, m• suppose Di holds samples from class i• probability distribution for class i is p (x| )• probability distribution for class i is pi(x|i)

p1(x|1) p2(x|2)p1( | 1) p2( | 2)

Parametric Supervised Learning• Use the ML method to estimate parameters iUse the ML method to estimate parameters i

• find i which maximizes the likelihood function F(i))(F)|x(p)|D(p iiii )()|(p)|(p ii

Dxii

i

• or, equivalently, find qi which maximizes the log likelihoodl(i)l(i)

iDx

iiii xpDpl |ln)|(ln

)|(lnmaxargˆ111

1

Dp )|(lnmaxargˆ222

2

Dp

Parametric Supervised Learning• now the distributions are fully specified• can classify unknown sample using MAP (Maximum A Posteriori)

classifier

)ˆ|( 11 xp )ˆ|( xp)|( 11p )|( 22 xp

Parametric Unsupervised Learning

• Expectation Maximization (EM)• one of the most useful statistical methods• one of the most useful statistical methods• oldest version in 1958 (Hartley)

i l i 1977 (D t t l )• seminal paper in 1977 (Dempster et al.)• can also be used when some samples are missing

featuresfeatures

Parametric Unsupervised Learning• Assume the data was generated by a model with• Assume the data was generated by a model with

known shape but unknown parameters

P(x |)

• Advantages of having a model • Gives a meaningful way to cluster data

• adjust the parameters of the model to maximize the probability that the model produced the observed datathe model produced the observed data

• Can sensibly measure if a clustering is good• compute the likelihood of data induced by clustering

• Can compare 2 clustering algorithms• which one gives the higher likelihood of the observed data?

Parametric Unsupervised Learning

• In unsupervised learning, no one tells us the true classes for samples. We still know

h l• have m classes• have samples x1,…, xn each of unknown class• probability distribution for class i is pi(x|i)probability distribution for class i is pi(x|i)

• Can we determine the classes and parameters simultaneously?y

Example: MRI Brain Segmentation

t tisegmentation

• In MRI brain image different brain tissues have different

Picture from M. Leventon

• In MRI brain image, different brain tissues have different intensities

• Know that brain has 6 major types of tissuesh f b d l d b ( 2)• Each type of tissue can be modeled by a Gaussian N(i,i

2) reasonably well, parameters i,i

2 are unknown• Segmenting (classifying) the brain image into different tissue

l i f lclasses is very useful• don’t know which image pixel corresponds to which tissue (class)• don’t know parameters for each N(i,i

2)

Mixture Density Model• Model data with mixture density

j

m

jj cPcxpxp ,|)|(

component densities

jj

jj cPcxpxp 1

,|)|(

m ,...,1• wheremixing parameters

• To generate a sample from distribution p(x|) 1...21 mcPcPcP•

• first select class j with probability P(cj)• then generate x according to probability law p(x|cj ,j )

p(x|c1 ,1 )p(x|c2 ,2 )

p(x|c3 ,3 )

Example: Gaussian Mixture Density

• Mixture of 3 Gaussians

010N)x(p

p2(x)

10,0N)x(p1

40

04,6,6N)x(p2

p1(x)

40,6,6N)x(p2

60

06,7,7N)x(p3 ( )

60,,)(p3 p3(x)

xp5.0xp3.0xp2.0)x(p 321

Mixture Density j

m

jj cPcxpxp ,|)|( jj

jj cPcxpxp 1

,|)|(

• P(c1),…, P(cm) can be known or unknown• Suppose we know how to estimate 1,…, m and

P(c1),…, P(cm)• Can “break apart” mixture p(x| ) for classification• To classify sample x, use MAP estimation, that is choose

class i which maximizes

)|( xcP cP)c|x(p ),|( ii xcP iii cP),c|x(p probability of component i probability of p y p

to generate xp ycomponent i

ML Estimation for Mixture Densitym m

j

m

jjj cPcxpxp

1

,|),|( i

m

jjjcxp

1

,|

• Use Maximum Likelihood estimation for a mixture densityN d i• Need to estimate• 1,…, m

• = P(c ) = P(c ) and = { }• 1 = P(c1),…, m = P(cm), and = {1 ,…, m }• As in supervised case, form the log likelihood function

),|(ln, Dpl

n

ki

m

jjjcxp

1 1,|ln

n

kkxp

1,|ln

ML Estimation for Mixture Densityn m

d t i i l( ) ith t t d

n

ki

m

jjjcxpl

1 1,|ln,

• need to maximize l(,) with respect to and • l(, ) is hard to maximize directly

• partial derivatives with respect to , usually give a “coupled” nonlinear system of equation

• usually closed form solution cannot be foundusually closed form solution cannot be found

• We could use the gradient ascent method• in general, it is not the greatest method to use, should onlyin general, it is not the greatest method to use, should only

be used as last resort

• There is a better algorithm, called EM

Mixture Density• Before EM, let’s look at the mixture density again

j

m

jjjcxpxp

1

,|),|(

Before EM, let s look at the mixture density again

• Say we know how to estimate 1,…, m and 1,…,m• estimating the class of x is easy with MAP, maximize

iiiiii cxpcPcxp ),|(),|(

• Suppose know class of samples x1,…, xnpp p 1, , n• just as supervised case, so estimating 1,…, m and

1,…,m is easy

Thi i l f hi k d bl

)|(lnmaxargˆiii Dp

i

nDi

i||ˆ

• This is an example of chicken‐and‐egg problem• EM algorithm solution is adding “hidden” variables

Expectation Maximization Algorithm• EM is an algorithm for ML parameter estimation when

the data has missing values. It is used when1. data is incomplete (has missing values)

• some features are missing for some samples due to data corruption, partial survey responses, etc.

2 Suppose data X is complete but p(X|q) is hard to optimize2. Suppose data X is complete, but p(X|q) is hard to optimize. Suppose further that introducing certain hidden variables Uwhose values are missing, and suppose it is easier to optimize the “complete” likelihood function p(X,U|q). Then EM is useful. • useful for the mixture density estimation and is subjectuseful for the mixture density estimation, and is subject

of our lecture today• Notice after we introduce artificial (hidden) variables U with

missing values, case 2 is completely equivalent to case 1

EM: Hidden Variables for Mixture Density

m

cpp |)|( jj

jjcxpxp

1

,|)|(

• For simplicity, assume component densities are 21

2

2

2exp

21,|

j

jj

xcxp

• assume for now that the variance is knownassume for now that the variance is known• need to estimate q = {m1,…, mm}

• If we knew which sample came from which componentIf we knew which sample came from which component the ML parameter estimation would be easy

• To get this easier problem introduce hidden variablesTo get this easier problem, introduce hidden variables which indicate which component each sample belongs to

EM: Hidden Variables for Mixture Density

• For define hidden variables z (k)ki 11

01k

izif sample i was generated by component kotherwise

• For , define hidden variables zi(k)mkni 1,1

miiii zzxx ,...,, 1

• zi(k) are indicator random variables, they indicate which Gaussian component generated sample xi

• Let zi = {zi(1),…, zi(m)}, indicator r.v. corresponding to xi

2,~,| kii Nzxp• Conditioned on zi , distribution of xi is Gaussian

• where k is s.t. zi(k) = 1

EM: Joint Likelihood• Let zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn}

|)|( ZX

Let zi {zi ,…, zi }, and Z {z1,…, zn}

n

|• The complete likelihood is

|,...,,,...,)|,( 11 nn zzxxpZXp

i

ii zxp1

|,

n

|zpz|xp

1i

iii |zp,z|xp

gaussian part of c

• If we observed Z, the log likelihood ln[p(X,Z|q)] would be tri ial to ma imi e ith respect to q and r

• The problem, is that the values of Z are missing, since we d it (th t i Z i hidd )

trivial to maximize with respect to q and ri

made it up (that is Z is hidden)

EM Derivation

• Instead of maximizing ln[p(X Z|q)] idea behind EM is to• Instead of maximizing ln[p(X,Z|q)] idea behind EM is to maximize some function of ln[p(X,Z|q)], usually its expected value (conditioned on X)expected value (conditioned on X)

|Z,XplnE X|Z

• common trick: integrate out unknown variables• the expectation is with respect to the missing data Z

• that is with respect to density p(Z|X,q)

• however q is our ultimate goal, we don’t know q !

EM Algorithm

• EM solution is to iterate• EM solution is to iterate

1. start with initial parameters q (0)

E t th t ti f l lik lih d ith

iterate the following 2 step until convergence

E. compute the expectation of log likelihood with respect to current estimate q (t) and X. Let’s call it Q(q|q (t))Let s call it Q(q|q )

tX|Z

t ,X||Z,XplnE|Q

M. maximize Q(q|q (t)) tt Q |maxarg1 Q

|maxarg

EM Algorithm

• It can be proven that EM algorithm converges to the• It can be proven that EM algorithm converges to the local maximum of the log‐likelihood

|l X |ln Xp

• Why is it better than gradient ascent?Why is it better than gradient ascent?• Convergence of EM is usually significantly faster, in the

beginning, very large steps are made (that is likelihood function increases rapidly), as opposed to gradient ascent which usually takes tiny steps

• gradient descent is not guaranteed to convergegradient descent is not guaranteed to converge• recall the difficulties of choosing the appropriate learning rate

EM: Lower Bound Maximization |ln Xp

(t) ( )

l( |(t))

• It can be shown that at time step t EM algorithm

(t) (t+1)

• constructs function l( |(t)) which is bounded above by lnp(X|) and touches l( |(t)) at =(t)

• finds (t+1) that maximizes l( |(t))• finds (t+1) that maximizes l( |(t))• Therefore, log likelihood ln p(X|) can only go up

EM for Mixture of Gaussians: E step• Let’s come back to our example j

m

jjcxpxp ,|)|(Let s come back to our example jj

jjpp 1

,|)|(

2

2

2exp

21,|

j

jj

xcxp

need to estimate = { } and need to estimate = {1,…, m} and 1,…, m

• For , define zi(k)mkni 1,1

01k

izif sample i was generated by component kotherwise

• need log likelihood of observed X and hidden Z

• as before, zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn}

• need log‐likelihood of observed X and hidden Z

n

iii zxpZXp

1|,ln|,ln

n

iiii zPzxp

1,|ln

EM for Mixture of Gaussians: E step

• We need log‐likelihood of observed X and hidden Z• We need log‐likelihood of observed X and hidden Z

n

iii zxpZXp

1|,ln|,ln

n

iiii zPzxp

1,|ln

• First let’s rewrite iii zPzxp ,|

f| 111

iii zPzxp ,|

1zif1zP,1z|xp

1zif1zP,1z|xp

mi

mi

mii

1i

1i

1ii

m zkk

kizPzxp 11|

,|p iiii

k

ii

iizPzxp

11,1|

m

zkki

ki

zPx 2

1exp1

k

ki

ki zP1

2 12

exp2

EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is

n

iiii zPzxpZXp

1,|ln|,ln

z k

i2

n

i

m

k

zk

iki

i

zPx1 1

2

2

12

exp21ln

k

kiz

n

i

ki

kim

kzPx

12

2

11

2exp

21ln

n

i

m

k

ki

kiki zPxz

1 12

2

1ln22

1ln

n m

kik x 2

l1l

kki cPkclassfromxsampleP

i kk

kiki

xz1 1

2 ln22

1ln

EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is

n

i

m

kk

kiki

xzZXp1 1

2

2

ln22

1ln|,ln

• For the E step, we must compute

tZ XZXpE ,||,ln

n mtkik xE

2

ln1ln

tm

ttm

tt QQ ,...,,,...,|| 11

i k

tk

kikiZ zE

1 12 ln

22ln

bxEabxaE

n m

kik x 21

bxEabxaEi

iXii

iiX

i k

kkik

iZxzE

1 12 ln

221ln

EM for Mixture of Gaussians: E step

n m x 21

• need to compute E [z (k)] in the expression above

i kk

kikiZ

t xzEQ1 1

2 ln22

1ln|

• need to compute EZ[zi( )] in the expression above

itk

iitk

ik

iZ xzPxzPzE ,|1*1,|0*0

itk

i xzP ,|1

ti

tki

ki

ti

xpzPzxp

||11,|

mtt

tki

tk x

2

22

12

1exp

m

tki

tk x 2

21exp

j

tji

tj x

1

222

1exp

• Done with the E step

j

tji

ji

ti zPzxP

1|11,|

• Done with the E step• for implementation, need to compute EZ[zi(k)] ’s don’t need Q

EM for Mixture of Gaussians: M step

n m x 21

• Need to maximize Q with respect to all parameters

i kk

kikiZ

t xzEQ1 1

2 ln22

1ln|

• Need to maximize Q with respect to all parameters • First differentiate with respect to mk

t

kQ

|

n

i

kikiZ

xzE1

2

0

nkE

n

kiZ

ii

kiZ

tkk

zE

xzEnew 11

i

iZ1

the mean for class k is weighted average of all samples andthe mean for class k is weighted average of all samples, and this weight is proportional to the current estimate of

probability that the sample belongs to class k

EM for Mixture of Gaussians: M step

n m

kikt xzEQ2

ln1ln|

i k

kiZ zEQ1 1

2 ln22

ln|

• For k use Lagrange multipliers to preserve constraint 11

m

jj

1j

• Need to differentiate

1|,1

m

jj

tQF

01,1

n

i

kiZ

kkzEF 0

1

k

n

i

kiZ zE

• Summing up over all components:

m

kk

m

k

n

i

kiZ zE

11 1

• Since and we get nzEm

k

n

i

kiZ

1 11

1

m

kk n

n1

n

i

kiZ

tk zE

n 1

1 1

EM AlgorithmThis algorithm applies to univariate Gaussian with known variances

1. Randomly initialize 1,…, m , 1,…, m , • with constraint i = 1i

E for all i k computeiterate until no change in 1,…, m , 1,…, m

E. for all i, k, compute

kik

kiZ

xzE

222

1exp

M for all k do parameter update

m

jjij

iZ

xzE

1

222

1exp

M. for all k, do parameter update

n

ii

kiZ xzE

1

nk

iZk zE1

n

i

kiZ

k

zE1

i

iZk n 1

EM Algorithm• For more general case of multivariate Gaussians withFor more general case of multivariate Gaussians with

unknown means and variances

|xp

m

1jjjj

kkkkiZ

,|xp

,|xpzE

• E step:

t11

k1

kt

k2/11k

2/dkk xx21exp

)2(1,|xp

where

• M step:

n

Tk

n

kiZk zE

n1

M step:

n

ik

iZ xzE

n

kiZ

i

Tkiki

kiZ

k

zE

xxzE1

in 1

n

i

kiZ

ik

zE1

1

iiZ

1

EM Algorithm and K‐means• k‐means can be derived from EM algorithmk means can be derived from EM algorithm

21

• Setting mixing parameters equal for all classes,

x 21exp

m

jjij

kikk

iZ

x

xzE

1

22

22

21exp

2exp

m

jji

ki

x

x

1

22

2

21exp

2exp

j 1 j 1

• If we let , then 0

otherwisexxjifzE jikik

iZ 0||||||||,1

• E step for each current mean find all points closest to it• E step, for each current mean, find all points closest to it and form new clusters

• M step, compute new means inside current clusters

n

ii

kiZk xzE

n 1

1

EM Gaussian Mixture Example

After first iteration

EM Gaussian Mixture ExampleAfter first iteration

After second iteration

EM Gaussian Mixture ExampleAfter second iteration

EM Gaussian Mixture ExampleAfter third iteration

EM Gaussian Mixture Example

After 20th iteration

EM Example

• Example from R. Gutierrez-OsunaExample from R. Gutierrez Osuna• Training set of 900 examples forming an annulus• Mixture model withm = 30 Gaussian components ofMixture model with m 30 Gaussian components of

unknown mean and variance is used• Training:Training:

• Initialization:• means to 30 random examples• covaraince matrices initialized to be diagonal, with large

variances on the diagonal, compared to training data variancevariance

• During EM training, components with small mixing coefficients were trimmed• This is a trick to get in a more compact model, with fewer

than 30 Gaussian components

EM Example

from R. Gutierrez-Osuna

EM Texture Segmentation Example

Figure from “Color and Texture Based Image Segmentation Using EM and Its ApplicationFigure from Color and Texture Based Image Segmentation Using EM and Its Application to Content Based Image Retrieval”,S.J. Belongie et al., ICCV 1998

EM Motion Segmentation Example

Three frames from the MPEG “flower garden” sequence

Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Figure from Representing Images with layers, , by J. Wang and .H. Adelson, ITransactions on Image Processing, 1994, c 1994, IEEE

EM Algorithm Summary

• Advantages• Guaranteed to converge (to a local max)• If the assumed data distribution is correct, the

algorithm works well

• Disadvantages• If assumed data distribution is wrong, results can g,

be quite bad. • In particular, bad results if use incorrect number of classes (i.e. the number of mixture components)

Documents

Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric