CS260 Machine Learning Algorithms - Computer...

Preview:

Citation preview

Clustering

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 1 / 26

Outline

1 Administration

2 Review of last lecture

3 Clustering

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 2 / 26

Schedule and Final

Upcoming Schedule

Today: Last day of class

Next Monday (3/13): No class – I will hold office hours in my office(BH 4531F)

Next Wednesday (3/15): Final Exam, HW6 due

Final Exam

Cumulative but with more emphasis on new material

8 short questions (recall that midterm had 6 short and 3 longquestions)

Should be much shorter than midterm, but of equal difficulty

Focus on major concepts (e.g., MLE, primal and dual formulations ofSVM, gradient descent, etc.)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 3 / 26

Outline

1 Administration

2 Review of last lecture

3 Clustering

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 4 / 26

Neural Networks – Basic Idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Backpropogation = efficient algorithm for (stochastic) gradientdescent

Can write down explicit updates via chain rule of calculus

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 5 / 26

Single'Node'

9'

Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) =1

1 + e�z

h✓(x) =1

1 + e�✓Txh✓(x) = g (✓|x)

x0 = 1x0 = 1

“bias'unit”'

h✓(x) =1

1 + e�✓Tx

x =

2664

x0

x1

x2

x3

3775 ✓ =

2664

✓0

✓1

✓2

✓3

3775

✓0✓1

✓2

✓3

Based'on'slide'by'Andrew'Ng'

X

h✓(x) =1

1 + e�✓Tx

Neural'Network'

11'

Layer'3'(Output'Layer)'

Layer'1'(Input'Layer)'

Layer'2'(Hidden'Layer)'

x0 = 1bias'units' a(2)0

Slide'by'Andrew'Ng'

Feed@Forward'Steps:'

Vectoriza1on'

15'

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

(2)3

⌘= g

⇣z(3)1

Based'on'slide'by'Andrew'Ng'

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))⇥(1) ⇥(2)

h✓(x) =1

1 + e�✓Tx

Learning'in'NN:'Backpropaga1on'•  Similar'to'the'perceptron'learning'algorithm,'we'cycle'through'our'examples'–  If'the'output'of'the'network'is'correct,'no'changes'are'made'–  If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error'

•  We'are'just'performing''(stochas1c)'gradient'descent!'

32'Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par'

Op1mizing'the'Neural'Network'

34'

Solve'via:''

J(⇥) = � 1

n

"nX

i=1

KX

k=1

yik log(h⇥(xi))k + (1 � yik) log⇣1 � (h⇥(xi))k

⌘#

+�

2n

L�1X

l=1

sl�1X

i=1

slX

j=1

⇣⇥

(l)ji

⌘2

Unlike before, J(Θ)'is'not'convex,'so'GD'on'a'neural'net'yields'a'local'op1mum'

Based'on'slide'by'Andrew'Ng'

@

@⇥(l)ij

J(⇥) = a(l)j �

(l+1)i (ignoring'λ;'if'λ = 0)'

Backpropaga1on'Forward'Propaga1on'

Forward'Propaga1on'•  Given'one'labeled'training'instance'(x, y):''Forward'Propaga1on'•  a(1)'= x'•  z(2) = Θ(1)a(1)

•  a(2)'= g(z(2)) [add'a0(2)]'

•  z(3) = Θ(2)a(2)

•  a(3)'= g(z(3)) [add'a0(3)]'

•  z(4) = Θ(3)a(3)

•  a(4)'= hΘ(x) = g(z(4))'

35'

a(1)

a(2) a(3) a(4)

Based'on'slide'by'Andrew'Ng'

Backpropaga1on:'Gradient'Computa1on'Let'δj

(l) ='“error”'of'node'j'in'layer'l (#layers'L'='4)

Backpropaga1on''•  δ(4) = a(4) – y •  δ(3) = (Θ(3))Tδ(4) .* g’(z(3)) •  δ(2) = (Θ(2))Tδ(3) .* g’(z(2)) •  (No'δ(1))

''

42'

g’(z(3)) = a(3) .* (1–a(3))'

g’(z(2)) = a(2) .* (1–a(2))'

δ(4) δ(3) δ(2)

Element@wise'product'.*'

Based'on'slide'by'Andrew'Ng'

Training'a'Neural'Network'via'Gradient'Descent'with'Backprop'

45'

Given: training set {(x1, y1), . . . , (xn, yn)}Initialize all ⇥(l) randomly (NOT to 0!)Loop // each iteration is called an epoch

Set �(l)ij = 0 8l, i, j

For each training instance (xi, yi):Set a(1) = xi

Compute {a(2), . . . ,a(L)} via forward propagationCompute �(L) = a(L) � yi

Compute errors {�(L�1), . . . , �(2)}Compute gradients �

(l)ij = �

(l)ij + a

(l)j �

(l+1)i

Compute avg regularized gradient D(l)ij =

(1n�

(l)ij + �⇥

(l)ij if j 6= 0

1n�

(l)ij otherwise

Update weights via gradient step ⇥(l)ij = ⇥

(l)ij � ↵D

(l)ij

Until weights converge or max #epochs is reached

(Used'to'accumulate'gradient)'

Based'on'slide'by'Andrew'Ng'

Backpropaga1on'

Summary of the course so far

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), where φ is either explicit or

depends on a kernel function k(xm,xn) = φ(xm)Tφ(xn)

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 6 / 26

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 7 / 26

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 8 / 26

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 8 / 26

Outline

1 Administration

2 Review of last lecture

3 ClusteringK-meansGaussian mixture models

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 9 / 26

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership

Toy Example Cluster data into two clusters.

(a)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 10 / 26

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2

(c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2

(e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2

(f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26

K-means example

(a)

−2 0 2

−2

0

2 (b)

−2 0 2

−2

0

2 (c)

−2 0 2

−2

0

2

(d)

−2 0 2

−2

0

2 (e)

−2 0 2

−2

0

2 (f)

−2 0 2

−2

0

2

(g)

−2 0 2

−2

0

2 (h)

−2 0 2

−2

0

2 (i)

−2 0 2

−2

0

2

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 11 / 26

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

J =

N∑

n=1

K∑

k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 12 / 26

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

J =

N∑

n=1

K∑

k=1

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 12 / 26

Algorithm

Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values

Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Fix {rnk} and minimize over {µk} to get this update:

µk =

∑n rnkxn∑n rnk

Step 3 Return to Step 1 unless stopping criterion is met

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26

Algorithm

Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values

Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Fix {rnk} and minimize over {µk} to get this update:

µk =

∑n rnkxn∑n rnk

Step 3 Return to Step 1 unless stopping criterion is met

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26

Algorithm

Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values

Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:

rnk =

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Fix {rnk} and minimize over {µk} to get this update:

µk =

∑n rnkxn∑n rnk

Step 3 Return to Step 1 unless stopping criterion is met

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 13 / 26

Remarks

Prototype µk is the mean of points assigned to cluster k, hence‘K-means’

The procedure reduces J in both Step 1 and Step 2 and thus makesimprovements on each iteration

No guarantee we find the global solution

Quality of local optimum depends on initial values at Step 0

k-means++ is a principled approximation algorithm

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 14 / 26

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?

(b)

0 0.5 1

0

0.5

1Points seem to form 3 clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?

(b)

0 0.5 1

0

0.5

1Points seem to form 3 clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?

(b)

0 0.5 1

0

0.5

1Points seem to form 3 clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 15 / 26

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 16 / 26

Gaussian mixture models: intuition

(a)

0 0.5 1

0

0.5

1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 16 / 26

Gaussian mixture models: formal definition

GMM has the following density function for x

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

K: number of Gaussians — they are called mixture components

µk and Σk: mean and covariance matrix of k-th component

ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:

∀ k, ωk > 0, and∑

k

ωk = 1

These properties ensure p(x) is in fact a probability density function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26

Gaussian mixture models: formal definition

GMM has the following density function for x

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

K: number of Gaussians — they are called mixture components

µk and Σk: mean and covariance matrix of k-th component

ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:

∀ k, ωk > 0, and∑

k

ωk = 1

These properties ensure p(x) is in fact a probability density function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26

Gaussian mixture models: formal definition

GMM has the following density function for x

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

K: number of Gaussians — they are called mixture components

µk and Σk: mean and covariance matrix of k-th component

ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:

∀ k, ωk > 0, and∑

k

ωk = 1

These properties ensure p(x) is in fact a probability density function

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 17 / 26

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.

Denoteωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

K∑

k=1

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 18 / 26

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 19 / 26

GMMs: example

(a)

0 0.5 1

0

0.5

1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

(b)

0 0.5 1

0

0.5

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 19 / 26

Parameter estimation for Gaussian mixture models

The parameters in GMMs are

θ = {ωk,µk,Σk}Kk=1

Let’s first consider the simple/unrealistic case where we have labels z

Define D′ = {xn, zn}Nn=1

D′ is the complete data

D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑

n

log p(xn, zn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

Let’s first consider the simple/unrealistic case where we have labels z

Define D′ = {xn, zn}Nn=1

D′ is the complete data

D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑

n

log p(xn, zn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26

Parameter estimation for Gaussian mixture models

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

Let’s first consider the simple/unrealistic case where we have labels z

Define D′ = {xn, zn}Nn=1

D′ is the complete data

D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑

n

log p(xn, zn)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 20 / 26

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

n

log p(xn, zn) =∑

n

log p(zn)p(xn|zn) =∑

k

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

n

log p(xn, zn) =∑

k

n

γnk log p(z = k)p(xn|z = k)

=∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

n

log p(xn, zn) =∑

n

log p(zn)p(xn|zn) =∑

k

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

n

log p(xn, zn) =∑

k

n

γnk log p(z = k)p(xn|z = k)

=∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

n

log p(xn, zn) =∑

n

log p(zn)p(xn|zn) =∑

k

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

n

log p(xn, zn) =∑

k

n

γnk log p(z = k)p(xn|z = k)

=∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

n

log p(xn, zn) =∑

n

log p(zn)p(xn|zn) =∑

k

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

n

log p(xn, zn) =∑

k

n

γnk log p(z = k)p(xn|z = k)

=∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 21 / 26

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

n

log p(xn, zn) =∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have

n

log p(xn, zn) =∑

k

n

γnk logωk +∑

k

{∑

n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

n

γnkxn

Σk =1∑n γnk

n

γnk(xn − µk)(xn − µk)T

What’s the intuition?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

n

log p(xn, zn) =∑

k

n

γnk [logωk + logN(xn|µk,Σk)]

Regrouping, we have

n

log p(xn, zn) =∑

k

n

γnk logωk +∑

k

{∑

n

γnk logN(xn|µk,Σk)

}

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

n

γnkxn

Σk =1∑n γnk

n

γnk(xn − µk)(xn − µk)T

What’s the intuition?Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑k

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 23 / 26

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑k

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 23 / 26

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 24 / 26

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 24 / 26

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk was previously binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

n

γnkxn

Σk =1∑n γnk

n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk was previously binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

n

γnkxn

Σk =1∑n γnk

n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk was previously binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

ωk =

∑n γnk∑

k

∑n γnk

, µk =1∑n γnk

n

γnkxn

Σk =1∑n γnk

n

γnk(xn − µk)(xn − µk)T

But remember, we’re ‘cheating’ by using θ to compute γnk!

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 25 / 26

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 26 / 26

Recommended