79
CS9840a Learning and Computer Vision Prof. Olga Veksler Lecture 11 Unsupervised Learning EM

Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

CS9840a Learning and Computer Vision Prof. g p

Olga Veksler

Lecture 11Unsupervised Learning

EM

Page 2: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Today

• New Topic:  Unsupervised Learning• supervised vs. unsupervised learning• unsupervised learning

• nonparametric unsupervised learning = clustering• Proximity Measures• Proximity Measures• Criterion Functions• k‐means

b i f i t t B i d i i th• brief intro to Bayesian decision theory • need this for parametric supervised learning

• parametric unsupervised learning• Expectation Maximization (EM)

Page 3: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Supervised vs. Unsupervised LearningU t id d i d l i i• Up to now considered supervised learning  scenario, where we have1. samples x1,…, xn1. samples x1,…, xn2. class label yi for all samples xi• this is also called learning with teacher

• In this lecture we consider unsupervised learningscenario, where we are only given, y g1. samples x1,…, xn • this is also called learning without teacher• do not split data into training and test sets• harder to say how good results are compared to the 

supervised casep

Page 4: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Unsupervised Learning

l b l d• Data is not labeled

• Parametric Approach• Parametric Approach• assume parametric distribution of data • estimate parameters of this distributionestimate parameters of this distribution

• Non‐parametric Approach• group the data into clustersg p• samples inside each cluster are more 

similar than samples across clusters• each cluster (hopefully) says something 

about categories (classes) present in the data

Page 5: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Why Unsupervised Learning?• Unsupervised learning is harder• Unsupervised learning is harder 

• How do we know if results are meaningful? No answer labels are available.

• Let the expert look at the results (external evaluation)• Define an objective function on clustering (internal evaluation)

W th l d it b• We nevertheless need it because1. Labeling large datasets is very costly (speech recognition)

• sometimes can label only a few examples by handsometimes can label only a few examples by hand

2. May have no idea what/how many classes there are (data mining)

3. May want to use clustering to gain some insight into the structure of the data before designing a classifier

• Clustering as data descriptionClustering as data description

Page 6: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Clustering• Seek “natural” clusters in the data• Seek  natural  clusters in the data

• What is a good clustering?• internal (within the cluster) distances should be  small

l (i l ) h ld b l

• Clustering is a way to discover new categories (classes)

• external (intra‐cluster) should be large

Page 7: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

What we Need for Clustering1 Proximity measure either large d small s1. Proximity measure, either  large d, small s

large s, small d• similarity measure s(xi,xk): large if xi,xk are 

similardi i il i ( di )• dissimilarity (or distance) measure d(xi,xk): small if xi,xk are similar 

2. Criterion function to evaluate a clustering

good clustering bad clustering

3. Algorithm to compute clustering• usually by optimizing the criterion function

Page 8: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

How Many Clusters?

3 clusters or 2 clusters?

• Possible approachesPossible approaches • fix the number of clusters to k• find the best clustering according to the criterion 

function (number of clusters may vary)

Page 9: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Proximity Measures

d l d d• good proximity measure is application dependent• clusters should be invariant under  transformations “natural” 

to the problemto the problem• for object recognition, should have invariance to rotation

small distance 

• For character recognition, no invariance to rotationFor character recognition,  no invariance to rotation

9 6l di t9 6large distance 

Page 10: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Distance (dissimilarity) Measures• Euclidean distance• Euclidean distance

d

k

kj

kiji xxxxd

1

2,

• translation invariant

• Manhattan (city block) distance

d

• translation invariant

k

kj

kiji xxxxd

1,

• approximation to Euclidean distance, h t tcheaper to compute

• Chebyshev distance ||max,

1

kj

kidkji xxxxd

• approximation to Euclidean distance, hcheapest to compute

Page 11: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Similarity Measures• Cosine similarity:• Cosine similarity:

||||||||

,ji

jTi

ji xxxx

xxs

• smaller angle, gives larger similarity• scale invariant measure

l ff

• popular in text retrieval

• Correlation coefficient• popular in image processing

2/1

22

1,

d dkk

d

ki

kii

ki

ji

xxxxxxs

1 1

k kj

kji

ki xxxx

Page 12: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Feature Scale

ld bl h h l l f• old problem: how to choose appropriate relative scale for features?

• [length (in meters or cms?) weight(in in grams or kgs?)]• [length (in meters or cms?), weight(in in grams or kgs?)]• In supervised learning, can normalize to zero mean unit  variance with no problemsp

• in clustering this is more problematic, if variance in data is due to cluster presence, then normalizing features is not a 

d thigood thing

before normalization after normalizationbefore normalization after normalization

Page 13: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Criterion Functions for Clustering

l• Have samples x1,…,xn• Suppose partitioned samples into c subsets D1,…,Dc

1D3D

2D

• There are approximately cn/c! distinct partitions• Can define a criterion function J(D1,…,Dc) which 

measures the quality of a partitioning D1,…,Dc

pp y / p

q y p g 1 c

• Then the clustering problem is well defined • optimal clustering is partition which optimizes criterion p g p p

function

Page 14: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Iterative Optimization Algorithms• Now have both proximity measure and criterion function• Now have both proximity measure and criterion function, 

need algorithm to find the optimal clustering• Exhaustive search is impossible since there areExhaustive search is impossible, since there are 

approximately  cn/c!   possible partitions• Usually some iterative algorithm is usedUsually some iterative algorithm is used 

1. find a reasonable initial partition2. repeat: move samples from one group to another s.t. the 

objective function J is improved

move 

samples to improve J

J = 777,777

improve J

J =666,666

Page 15: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering• Probably the most famous clustering algorithm• Probably the most famous clustering algorithm• JSSE objective function:

k

i DxiSSE

i

xJ1

2||||

• can work for some other objective functions, but not proven to work as well

• Fix number of clusters to k (c = k)It ti l t i l ith• Iterative clustering algorithm

• Moves many samples from one cluster to anther in a smart waysmart way

Page 16: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering

1 Initialize x

k = 3

1. Initialize• pick k cluster centers arbitrary• assign each example to closest center

x

xx• assign each example to closest center

x

x x2. compute sample means 

for each cluster

x3 reassign all samples to the x

x x3. reassign all samples to the 

closest mean

4. if clusters changed at step 3, go to step 2

Page 17: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering• Consider steps 2 and 3 of the algorithm

2. compute sample means for each cluster

• Consider steps 2 and 3 of the algorithm

12

k

i DxiSSE

i

xJ1

2||||

= sum of= sum of

3. reassign all samples to the closest mean

If we represent clusters 1

2 by their old means, the error has gotten smaller

Page 18: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering3 reassign all samples to the closest mean3. reassign all samples to the closest mean

If we represent clusters by their old means, the error h tt ll1

2has gotten smaller

• However we represent clusters by their new means• However we represent clusters by their new means, and mean is always smallest representation of  cluster

1 1

iDxzx

z2||||

21

iDx

t zzxxz

22 ||||2||||21

iDx

zx 0

1

iDxix

nz 1

Page 19: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering

d h b d d h• We just proved that by doing steps 2 and 3, the objective function goes down

• found “smart “ move which decreases objective function• found  “smart “ move which decreases objective function

• Algorithm converges after a finite number of iterations

• However not guaranteed to find a global minimum

1x

2x

2‐means gets stuck  here global minimum of JSSE

Page 20: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

K‐means Clustering

d h f h d• Finding the optimum of JSSE is NP‐hard• In practice, k‐means usually performs well• Very efficient• Its solution can be used as a starting point for other 

clustering algorithms• Many variants and improvements of k‐means clustering 

i texist

Page 21: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Bayesian Decision Theory

• Know probability distribution of the categories  • almost never the case in real life!almost never the case in real life!• nevertheless useful since other cases can be reduced to this one after some workreduced to this one after some work

• Do not even need training dataC d i ti l l ifi• Can design optimal classifier

Page 22: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Example: Fish Sorting• Respected fish expert says that 

• salmon length has distribution  N(5,1)• sea bass length has distribution N(10,4)

• Recall N(μ,σ2) 2lRecall N(μ,σ ) 

2

2

2

21

l

elp

2

• Thus class conditional densities are

25 2

21|

l

esalmonlp

4*210 2

221|

l

ebasslp

Page 23: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Likelihood function 

• Class conditional densities are

25 2

1|

l

esalmonlp

4*210 2

1|

l

ebasslp

• Class conditional densities are

2

| esalmonlp

22

| ebasslpfixed fixed

• Fix length, let fish class varyg , y• get likelihood function

• it is not density and not probability massit is not density and not probability mass

salmonclassife

21 2

5l 2

bassclassife

221

2class|lp810l 2

fixed 22 fixed

Page 24: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Likelihood vs. Class Conditional Density

p(l | class)

length7 length7

Suppose a fish has length 7.  How do we classify it?pp g y

Page 25: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

ML (maximum likelihood) Classifier• Intuitively want to choose salmon if• Intuitively, want to  choose salmon if 

bass|7lengthPrsalmon|7lengthPr

• However, since length is a continuous r.v.,    0bass|7lengthPrsalmon|7lengthPr

5 2l 10 2

1 l

• Instead, we choose class which maximizes likelihood

25 2

21|

l

esalmonlp

4*210

221|

l

ebasslp

• ML classifier:  bass|lp?salmon|lp

salmon

bass

• if p(l | salmon) > p(l | bass), classify salmon, else classify bass

Page 26: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Interval Justification

p( 7 |bass) Thus we choose the class (bass) which is 

p( 7 |salmon)

( )more likely to have given the observation

p( | )

7 bass|7p2bass|7BlPr

salmon|7p2salmon|7BlPr

Page 27: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Decision Boundary

l if l l if bclassify as salmon classify as sea bass

length6.70

Page 28: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Priors

f k l d d h b• Prior comes from prior knowledge, no data has been seen yet

• Suppose a fish expert says: in the fall, there are twice as many salmon as sea bass

• Prior for our fish sorting problemP( l ) 2/3• P(salmon) = 2/3

• P(bass) = 1/3

• With the addition of prior to our model, how should we classify a fish of length 7?y g

Page 29: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

How Prior Changes Decision Boundary?

• Without priors

salmon sea bass

6.70 length

• How should this change with prior?• P(salmon) = 2/3• P(bass) = 1/3

? ?? ?

6.70

salmon sea bass

length

Page 30: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Bayes Decision Rule

1. Have likelihood functions • p(length | salmon) • p(length | bass)

2. Have priors P(salmon) and P(bass)

• Question: Having observed fish of certain length, do we classify it as salmon or bass?

• Natural Idea:• salmon if  P(salmon|length) > P(bass|length)( | g ) ( | g )• bass if P(bass|length) > P(salmon|length)

Page 31: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Posterior

• P(salmon | length) and P(bass | length) are called posterior distributions, because the data (length) was revealed (post data)

• How to compute posteriors?( )

• Bayes rule: ( ) ( )( )lengthp

length,salmonp=length|salmonP

( )( ) ( )( )lengthp

salmonPsalmon|lengthp=

lengthp

bassPbass|lengthplength|bassP • Similarly:

Page 32: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

MAP (maximum a posteriori) classifier

lengthbassPlengthsalmonP |?|

bass

salmon

bassPbass|lengthpsalmonPsalmon|lengthp salmon

bass

lengthp

bassPbass|lengthp?lengthp

salmonPsalmon|lengthp

bass

bassPbass|lengthp?salmonPsalmon|lengthpsalmon bassPbass|lengthp?salmonPsalmon|lengthpbass

Page 33: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Back to Fish Sorting 

25 2

21|

l

esalmonlp

810 2

221|

l

ebasslp

• likelihood

• Priors: P(salmon) = 2/3, P(bass) = 1/3 1121 10l5l 22

31e

221

32e

21 82

• Solve inequality

salmon sea bassnew decision

boundary

6.70 length7.18

• New boundary makes sense: expect to see more salmon

Page 34: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

More on Posterior

Prior(given)

posterior density(our goal)

likelihood(given)

lP

cPc|lPl|cP

(given)(our goal) (g )

lPnormalizing factor, often do not even need it for l ifi i i (l) d d d l f

basspbass|lpsalmonpsalmon|lplP

classification since P(l) does not depend on class c. If we do need it, from the law of total probability:

p|pp|pNotice this formula consists of likelihoods and priors, which are given

Page 35: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

More on Priors

• Prior comes from prior knowledge, no data has been seen yet

• If there is a reliable source prior knowledge, it should be usedS bl t b l d li bl• Some problems cannot even be solved reliably without a good prior

• However prior alone is not enough, we still need likelihood• P(salmon)=2/3, P(sea bass)=1/3• If I don’t let you see the data, but ask you to guess, y , y g ,will you choose salmon or sea bass?

Page 36: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

More on Map Classifier

t i likelihood prior

lP

cPc|lPl|cP posterior likelihood p

P|lPl|P

• Do not care about P(l) when maximizing P(c|l )proportional

• If P(salmon)=P(bass) (uniform prior) MAP classifier 

cPc|lPl|cP proportional

( ) ( ) ( p )becomes ML classifier c|lPl|cP

If f b ti l P(l| l ) P(l|b ) th thi• If for some observation l, P(l|salmon)=P(l|bass), then this observation is uninformative and decision is based solely on the prior cPl|cPon the prior cPl|cP

Page 37: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Justification for MAP Classifier• Probability of error for MAP estimate:• Probability of error for MAP estimate:

l|bassP?l|salmonP salmon

l|bassP?l|salmonPbass

• For any particular l probability of error• For any particular l, probability of error

if d id lP(b |l)Pr[error| l ]=

if we decide salmonP(bass|l)

if  we decide bassP(salmon|l)

Thus MAP classifier is optimal  for each individual l 

Page 38: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Justification for MAP Classifier

• We are interested to minimize error not just for one l, we really want to minimize the average error over all l

dllpl|errorPrdll,errorperrorPr

• If Pr[error| l ]is as small as possible, the integral is ll blsmall as possible

• But Bayes rule makes  Pr[error| l ] as small as possible

Thus MAP classifier minimizes the probability of errorThus MAP classifier minimizes the probability of error

Page 39: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Supervised Learning

• Supervised parametric learning• have m classes• have samples x1,…, xn  each of class 1, 2,…, m• suppose Di  holds samples from class i• probability distribution for class i is p (x| )• probability distribution for class i is pi(x|i)

p1(x|1) p2(x|2)p1( | 1) p2( | 2)

Page 40: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Supervised Learning• Use the ML method to estimate parameters iUse the ML method to estimate parameters i

• find i which maximizes the likelihood function F(i))(F)|x(p)|D(p iiii )()|(p)|(p ii

Dxii

i

• or, equivalently, find qi which maximizes the log likelihoodl(i)l(i)

iDx

iiii xpDpl |ln)|(ln

)|(lnmaxargˆ111

1

Dp )|(lnmaxargˆ222

2

Dp

Page 41: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Supervised Learning• now the distributions are fully specified• can classify unknown sample using MAP (Maximum A Posteriori) 

classifier

)ˆ|( 11 xp )ˆ|( xp)|( 11p )|( 22 xp

Page 42: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Unsupervised Learning

• Expectation Maximization (EM)• one of the most useful statistical methods• one of the most useful statistical methods• oldest  version in 1958 (Hartley)

i l i 1977 (D t t l )• seminal paper in 1977 (Dempster et al.)• can also be used when some samples are missing 

featuresfeatures

Page 43: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Unsupervised Learning• Assume the data was generated by a model with• Assume the data was generated by a model with 

known shape but unknown parameters

P(x |)

• Advantages of having a model • Gives a meaningful way to cluster data

• adjust the parameters of the model to maximize the probability that the model produced the observed datathe model produced the observed data

• Can sensibly measure if a clustering is good• compute the likelihood of data induced by clustering

• Can compare 2 clustering algorithms• which one gives the higher likelihood of the observed data?

Page 44: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Parametric Unsupervised Learning

• In unsupervised learning, no one tells us the true classes for samples. We still know

h l• have m classes• have samples x1,…, xn  each of unknown class• probability distribution for class i is pi(x|i)probability distribution for class i is pi(x|i)

• Can we determine the classes and parameters simultaneously?y

Page 45: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Example: MRI Brain Segmentation

t tisegmentation

• In MRI brain image different brain tissues have different

Picture from M. Leventon

• In MRI brain image, different brain tissues have different intensities

• Know that brain has 6 major types of tissuesh f b d l d b ( 2)• Each type of tissue can be modeled by a Gaussian N(i,i

2) reasonably well, parameters i,i

2 are unknown• Segmenting (classifying) the brain image into different tissue 

l i f lclasses  is very useful• don’t know which image pixel corresponds to which tissue (class)• don’t know parameters for each N(i,i

2) 

Page 46: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Mixture Density Model• Model data with mixture density

j

m

jj cPcxpxp ,|)|(

component densities

jj

jj cPcxpxp 1

,|)|(

m ,...,1• wheremixing parameters

• To generate a sample from distribution p(x|) 1...21 mcPcPcP•

• first select class  j with probability P(cj)• then generate x according to probability law p(x|cj ,j )

p(x|c1 ,1 )p(x|c2 ,2 )

p(x|c3 ,3 )

Page 47: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Example: Gaussian Mixture Density

• Mixture of 3 Gaussians

010N)x(p

p2(x)

10,0N)x(p1

40

04,6,6N)x(p2

p1(x)

40,6,6N)x(p2

60

06,7,7N)x(p3 ( )

60,,)(p3 p3(x)

xp5.0xp3.0xp2.0)x(p 321

Page 48: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Mixture Density j

m

jj cPcxpxp ,|)|( jj

jj cPcxpxp 1

,|)|(

• P(c1),…, P(cm) can be known or unknown• Suppose we know how to estimate 1,…, m and      

P(c1),…, P(cm)• Can “break apart” mixture p(x| ) for classification• To classify sample x, use MAP estimation, that is choose 

class i which maximizes

)|( xcP cP)c|x(p ),|( ii xcP iii cP),c|x(p probability of component i probability of p y p

to generate xp ycomponent i

Page 49: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

ML Estimation for Mixture Densitym m

j

m

jjj cPcxpxp

1

,|),|( i

m

jjjcxp

1

,|

• Use Maximum Likelihood estimation for a mixture densityN d i• Need to estimate• 1,…, m

• = P(c ) = P(c ) and = { }• 1 = P(c1),…, m = P(cm), and = {1 ,…, m }• As in supervised case, form the log likelihood function

),|(ln, Dpl

n

ki

m

jjjcxp

1 1,|ln

n

kkxp

1,|ln

Page 50: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

ML Estimation for Mixture Densityn m

d t i i l( ) ith t t d

n

ki

m

jjjcxpl

1 1,|ln,

• need to maximize l(,) with respect to  and • l(, ) is hard to maximize directly

• partial derivatives with respect to , usually give a “coupled” nonlinear system of equation

• usually closed form solution cannot be foundusually closed form solution cannot be found

• We could use the gradient ascent method• in general, it is not the greatest method to use, should onlyin general, it is  not the greatest method to use, should only 

be used as last resort

• There is a better algorithm, called EM

Page 51: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Mixture Density• Before EM, let’s look at the mixture density again

j

m

jjjcxpxp

1

,|),|(

Before EM, let s look at the mixture density again

• Say we know how to estimate 1,…, m and  1,…,m• estimating the class of x is easy with MAP, maximize

iiiiii cxpcPcxp ),|(),|(

• Suppose know class of samples x1,…, xnpp p 1, , n• just as supervised case, so estimating 1,…, m and

1,…,m is easy

Thi i l f hi k d bl

)|(lnmaxargˆiii Dp

i

nDi

i||ˆ

• This is an example of chicken‐and‐egg problem• EM algorithm solution is adding “hidden” variables

Page 52: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

Expectation Maximization Algorithm• EM is an algorithm for ML parameter estimation when 

the data has missing values. It is used when1. data is incomplete (has missing values) 

• some features are missing for some samples due to data corruption, partial survey responses, etc.

2 Suppose data X is complete but p(X|q) is hard to optimize2. Suppose data X is complete, but p(X|q) is hard to optimize. Suppose further that introducing certain hidden variables Uwhose values are missing, and suppose it is easier to optimize the “complete” likelihood function p(X,U|q). Then EM is useful. • useful for the mixture density estimation and is subjectuseful for the mixture density estimation, and is subject 

of our lecture today• Notice after we introduce artificial (hidden) variables U with 

missing values, case 2 is completely equivalent to case 1

Page 53: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM: Hidden Variables for Mixture Density

m

cpp |)|( jj

jjcxpxp

1

,|)|(

• For simplicity, assume component densities are 21

2

2

2exp

21,|

j

jj

xcxp

• assume for now that the variance is knownassume for now that the variance is known• need to estimate q = {m1,…, mm}

• If we knew which sample came from which componentIf we knew which sample came from which component the ML parameter estimation would be easy

• To get this easier problem introduce hidden variablesTo get this easier problem, introduce hidden variables which indicate which component each sample belongs to

Page 54: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM: Hidden Variables for Mixture Density

• For define hidden variables z (k)ki 11

01k

izif sample i was generated by component kotherwise

• For                                   , define hidden variables zi(k)mkni 1,1

miiii zzxx ,...,, 1

• zi(k) are indicator random variables, they indicate  which Gaussian component generated sample xi

• Let zi = {zi(1),…, zi(m)}, indicator r.v. corresponding to xi

2,~,| kii Nzxp• Conditioned on zi , distribution of xi is Gaussian

• where k is s.t. zi(k) = 1

Page 55: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM: Joint Likelihood• Let zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn}

|)|( ZX

Let zi  {zi ,…, zi }, and Z  {z1,…, zn} 

n

|• The complete likelihood is

|,...,,,...,)|,( 11 nn zzxxpZXp

i

ii zxp1

|,

n

|zpz|xp

1i

iii |zp,z|xp

gaussian part of c

• If we observed Z, the log likelihood ln[p(X,Z|q)] would be tri ial to ma imi e ith respect to q and r

• The problem, is that the values of Z are missing, since we d it (th t i Z i hidd )

trivial to maximize with respect to q and ri

made it up (that is Z is hidden)

Page 56: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Derivation

• Instead of maximizing ln[p(X Z|q)] idea behind EM is to• Instead of maximizing ln[p(X,Z|q)] idea behind EM is to maximize some function of ln[p(X,Z|q)], usually its expected value (conditioned on X)expected value (conditioned on X)

|Z,XplnE X|Z

• common trick: integrate out unknown variables• the expectation is with respect to the missing data Z

• that is with respect to density p(Z|X,q)

• however q is our ultimate goal, we don’t know q !

Page 57: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Algorithm

• EM solution is to iterate• EM solution is to iterate

1. start with initial parameters q (0)

E t th t ti f l lik lih d ith

iterate the following 2 step until convergence

E. compute the expectation of log likelihood  with respect to current estimate q (t) and X.               Let’s call it Q(q|q (t))Let s call it Q(q|q )

tX|Z

t ,X||Z,XplnE|Q

M. maximize Q(q|q (t)) tt Q |maxarg1 Q

|maxarg

Page 58: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Algorithm

• It can be proven that EM algorithm converges to the• It can be proven that EM algorithm converges to the local maximum of the  log‐likelihood

|l X |ln Xp

• Why is it better than gradient ascent?Why is it better than gradient ascent?• Convergence of EM is usually significantly faster, in the 

beginning, very large steps are made (that is likelihood function increases rapidly), as opposed to gradient ascent which usually takes tiny steps

• gradient descent is not guaranteed to convergegradient descent is not guaranteed to converge• recall the difficulties of choosing the appropriate learning rate

Page 59: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM: Lower Bound Maximization |ln Xp

(t) ( )

l( |(t))

• It can be shown that at time step t  EM algorithm

(t) (t+1)

• constructs function l( |(t)) which is bounded above by lnp(X|) and touches l( |(t)) at  =(t)

• finds (t+1) that maximizes l( |(t))• finds (t+1) that maximizes l( |(t))• Therefore, log likelihood ln p(X|)  can only go up

Page 60: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: E step• Let’s come back to our example j

m

jjcxpxp ,|)|(Let s come back to our example jj

jjpp 1

,|)|(

2

2

2exp

21,|

j

jj

xcxp

need to estimate = { } and need to estimate = {1,…, m} and 1,…, m

• For                                 , define  zi(k)mkni 1,1

01k

izif sample i was generated by component kotherwise

• need log likelihood of observed X and hidden Z

• as before,  zi = {zi(1),…, zi(m)}, and Z = {z1,…, zn} 

• need log‐likelihood of observed X and hidden Z

n

iii zxpZXp

1|,ln|,ln

n

iiii zPzxp

1,|ln

Page 61: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: E step

• We need log‐likelihood of observed X and hidden Z• We need log‐likelihood of observed X and hidden Z

n

iii zxpZXp

1|,ln|,ln

n

iiii zPzxp

1,|ln

• First let’s rewrite iii zPzxp ,|

f| 111

iii zPzxp ,|

1zif1zP,1z|xp

1zif1zP,1z|xp

mi

mi

mii

1i

1i

1ii

m zkk

kizPzxp 11|

,|p iiii

k

ii

iizPzxp

11,1|

m

zkki

ki

zPx 2

1exp1

k

ki

ki zP1

2 12

exp2

Page 62: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is

n

iiii zPzxpZXp

1,|ln|,ln

z k

i2

n

i

m

k

zk

iki

i

zPx1 1

2

2

12

exp21ln

k

kiz

n

i

ki

kim

kzPx

12

2

11

2exp

21ln

n

i

m

k

ki

kiki zPxz

1 12

2

1ln22

1ln

n m

kik x 2

l1l

kki cPkclassfromxsampleP

i kk

kiki

xz1 1

2 ln22

1ln

Page 63: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: E step• log‐likelihood of observed X and hidden Z islog likelihood of observed X and hidden Z is

n

i

m

kk

kiki

xzZXp1 1

2

2

ln22

1ln|,ln

• For the E step, we must compute

tZ XZXpE ,||,ln

n mtkik xE

2

ln1ln

tm

ttm

tt QQ ,...,,,...,|| 11

i k

tk

kikiZ zE

1 12 ln

22ln

bxEabxaE

n m

kik x 21

bxEabxaEi

iXii

iiX

i k

kkik

iZxzE

1 12 ln

221ln

Page 64: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: E step

n m x 21

• need to compute E [z (k)] in the expression above

i kk

kikiZ

t xzEQ1 1

2 ln22

1ln|

• need to compute EZ[zi( )] in the expression above

itk

iitk

ik

iZ xzPxzPzE ,|1*1,|0*0

itk

i xzP ,|1

ti

tki

ki

ti

xpzPzxp

||11,|

mtt

tki

tk x

2

22

12

1exp

m

tki

tk x 2

21exp

j

tji

tj x

1

222

1exp

• Done with the E step

j

tji

ji

ti zPzxP

1|11,|

• Done with the E step• for implementation, need to compute EZ[zi(k)] ’s don’t need Q

Page 65: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: M step

n m x 21

• Need to maximize Q with respect to all parameters

i kk

kikiZ

t xzEQ1 1

2 ln22

1ln|

• Need to maximize Q with respect to all parameters • First differentiate with respect to mk

t

kQ

|

n

i

kikiZ

xzE1

2

0

nkE

n

kiZ

ii

kiZ

tkk

zE

xzEnew 11

i

iZ1

the mean for class k is weighted average of all samples andthe mean for class k is weighted average of all samples, and this weight is proportional to the current estimate of  

probability that the sample belongs to class k

Page 66: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM for Mixture of Gaussians: M step

n m

kikt xzEQ2

ln1ln|

i k

kiZ zEQ1 1

2 ln22

ln|

• For k use Lagrange multipliers to preserve constraint 11

m

jj

1j

• Need to differentiate

1|,1

m

jj

tQF

01,1

n

i

kiZ

kkzEF 0

1

k

n

i

kiZ zE

• Summing up over all components:

m

kk

m

k

n

i

kiZ zE

11 1

• Since and  we get nzEm

k

n

i

kiZ

1 11

1

m

kk n

n1

n

i

kiZ

tk zE

n 1

1 1

Page 67: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM AlgorithmThis algorithm applies to univariate Gaussian with known variances

1. Randomly initialize 1,…, m , 1,…, m , • with  constraint   i = 1i

E for all i k computeiterate until no change in   1,…, m , 1,…, m

E. for all i, k, compute

kik

kiZ

xzE

222

1exp

M for all k do parameter update

m

jjij

iZ

xzE

1

222

1exp

M. for all k, do parameter update

n

ii

kiZ xzE

1

nk

iZk zE1

n

i

kiZ

k

zE1

i

iZk n 1

Page 68: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Algorithm• For more general case of multivariate Gaussians withFor more general case of multivariate Gaussians with 

unknown means and variances

|xp

m

1jjjj

kkkkiZ

,|xp

,|xpzE

• E step:

t11

k1

kt

k2/11k

2/dkk xx21exp

)2(1,|xp

where

• M step:

n

Tk

n

kiZk zE

n1

M step:

n

ik

iZ xzE

n

kiZ

i

Tkiki

kiZ

k

zE

xxzE1

in 1

n

i

kiZ

ik

zE1

1

iiZ

1

Page 69: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Algorithm and K‐means• k‐means can be derived from EM algorithmk means  can be derived from EM algorithm

21

• Setting mixing parameters equal for all classes, 

x 21exp

m

jjij

kikk

iZ

x

xzE

1

22

22

21exp

2exp

m

jji

ki

x

x

1

22

2

21exp

2exp

j 1 j 1

• If we let              , then 0

otherwisexxjifzE jikik

iZ 0||||||||,1

• E step for each current mean find all points closest to it• E step, for each current mean, find all points closest to it and form new clusters

• M step, compute new means inside current clusters

n

ii

kiZk xzE

n 1

1

Page 70: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Gaussian Mixture Example

Page 71: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

After first iteration

EM Gaussian Mixture ExampleAfter first iteration

Page 72: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

After second iteration

EM Gaussian Mixture ExampleAfter second iteration

Page 73: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Gaussian Mixture ExampleAfter third iteration

Page 74: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Gaussian Mixture Example

After 20th iteration

Page 75: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Example

• Example from R. Gutierrez-OsunaExample from R. Gutierrez Osuna• Training set of 900 examples forming an annulus• Mixture model withm = 30 Gaussian components ofMixture model with m  30 Gaussian components of 

unknown mean and variance is used• Training:Training:

• Initialization:• means to 30 random examples• covaraince matrices initialized to be diagonal, with large 

variances on the diagonal, compared to training data variancevariance

• During EM training, components with small mixing coefficients were trimmed• This is a trick to get in a more compact model, with fewer 

than 30 Gaussian components

Page 76: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Example

from R. Gutierrez-Osuna

Page 77: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Texture Segmentation Example

Figure from “Color and Texture Based Image Segmentation Using EM and Its ApplicationFigure from  Color and Texture Based Image Segmentation Using EM and Its Application to Content Based Image Retrieval”,S.J. Belongie et al., ICCV 1998

Page 78: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Motion Segmentation Example

Three frames from the MPEG “flower garden” sequence

Figure from “Representing Images with layers,”, by J. Wang and E.H. Adelson, IEEE Figure from Representing Images with layers, , by J. Wang and .H. Adelson, ITransactions on Image Processing, 1994, c 1994, IEEE

Page 79: Lecture 11 - csd.uwo.ca · Lecture 11 Unsupervised Learning EM. Today • New Topic: Unsupervised Learning • supervised vs. unsupervised learning • unsupervised learning • nonparametric

EM Algorithm Summary

• Advantages• Guaranteed to converge (to a local max)• If the assumed data distribution is correct, the 

algorithm works well

• Disadvantages• If assumed data distribution is wrong, results can g,

be quite bad. • In particular, bad results if use incorrect number of classes (i.e. the number of mixture components)