CS260 Machine Learning Algorithms - Computer...

Clustering

Professor Ameet Talwalkar

Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 1 / 26

Outline

1 Administration

2 Review of last lecture

3 Clustering

Schedule and Final

Upcoming Schedule

Today: Last day of class

Next Monday (3/13): No class – I will hold office hours in my office(BH 4531F)

Next Wednesday (3/15): Final Exam, HW6 due

Final Exam

Cumulative but with more emphasis on new material

8 short questions (recall that midterm had 6 short and 3 longquestions)

Should be much shorter than midterm, but of equal difficulty

Focus on major concepts (e.g., MLE, primal and dual formulations ofSVM, gradient descent, etc.)

Outline

1 Administration

3 Clustering

Neural Networks – Basic Idea

Learning nonlinear basis functions and classifiers

Hidden layers are nonlinear mappings from input features to newrepresentation

Output layers use the new representations for classification andregression

Learning parameters

Backpropogation = efficient algorithm for (stochastic) gradientdescent

Can write down explicit updates via chain rule of calculus

Single'Node'

Sigmoid'(logis1c)'ac1va1on'func1on:' g(z) =1

1 + e�z

h✓(x) =1

1 + e�✓Txh✓(x) = g (✓|x)

x0 = 1x0 = 1

“bias'unit”'

h✓(x) =1

1 + e�✓Tx

3775 ✓ =

✓0✓1

Based'on'slide'by'Andrew'Ng'

h✓(x) =1

1 + e�✓Tx

Neural'Network'

Layer'3'(Output'Layer)'

Layer'1'(Input'Layer)'

Layer'2'(Hidden'Layer)'

x0 = 1bias'units' a(2)0

Slide'by'Andrew'Ng'

Feed@Forward'Steps:'

Vectoriza1on'

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

⌘= g

⇣z(3)1

a(2)1 = g

⇣⇥

(1)10 x0 + ⇥

(1)11 x1 + ⇥

(1)12 x2 + ⇥

(1)13 x3

⌘= g

⇣z(2)1

a(2)2 = g

⇣⇥

(1)20 x0 + ⇥

(1)21 x1 + ⇥

(1)22 x2 + ⇥

(1)23 x3

⌘= g

⇣z(2)2

a(2)3 = g

⇣⇥

(1)30 x0 + ⇥

(1)31 x1 + ⇥

(1)32 x2 + ⇥

(1)33 x3

⌘= g

⇣z(2)3

h⇥(x) = g⇣⇥

(2)10 a

(2)0 + ⇥

(2)11 a

(2)1 + ⇥

(2)12 a

(2)2 + ⇥

(2)13 a

⌘= g

⇣z(3)1

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))

z(2) = ⇥(1)x

a(2) = g(z(2))

Add a(2)0 = 1

z(3) = ⇥(2)a(2)

h⇥(x) = a(3) = g(z(3))⇥(1) ⇥(2)

h✓(x) =1

1 + e�✓Tx

Learning'in'NN:'Backpropaga1on'•  Similar'to'the'perceptron'learning'algorithm,'we'cycle'through'our'examples'–  If'the'output'of'the'network'is'correct,'no'changes'are'made'–  If'there'is'an'error,'weights'are'adjusted'to'reduce'the'error'

•  We'are'just'performing''(stochas1c)'gradient'descent!'

32'Based'on'slide'by'T.'Finin,'M.'desJardins,'L'Getoor,'R.'Par'

Op1mizing'the'Neural'Network'

Solve'via:''

J(⇥) = � 1

yik log(h⇥(xi))k + (1 � yik) log⇣1 � (h⇥(xi))k

L�1X

sl�1X

⇣⇥

Unlike before, J(Θ)'is'not'convex,'so'GD'on'a'neural'net'yields'a'local'op1mum'

@⇥(l)ij

J(⇥) = a(l)j �

(l+1)i (ignoring'λ;'if'λ = 0)'

Backpropaga1on'Forward'Propaga1on'

Forward'Propaga1on'•  Given'one'labeled'training'instance'(x, y):''Forward'Propaga1on'•  a(1)'= x'•  z(2) = Θ(1)a(1)

•  a(2)'= g(z(2)) [add'a0(2)]'

•  z(3) = Θ(2)a(2)

•  a(3)'= g(z(3)) [add'a0(3)]'

•  z(4) = Θ(3)a(3)

•  a(4)'= hΘ(x) = g(z(4))'

a(2) a(3) a(4)

Backpropaga1on:'Gradient'Computa1on'Let'δj

(l) ='“error”'of'node'j'in'layer'l (#layers'L'='4)

Backpropaga1on''•  δ(4) = a(4) – y •  δ(3) = (Θ(3))Tδ(4) .* g’(z(3)) •  δ(2) = (Θ(2))Tδ(3) .* g’(z(2)) •  (No'δ(1))

g’(z(3)) = a(3) .* (1–a(3))'

g’(z(2)) = a(2) .* (1–a(2))'

δ(4) δ(3) δ(2)

Element@wise'product'.*'

Training'a'Neural'Network'via'Gradient'Descent'with'Backprop'

Given: training set {(x1, y1), . . . , (xn, yn)}Initialize all ⇥(l) randomly (NOT to 0!)Loop // each iteration is called an epoch

Set �(l)ij = 0 8l, i, j

For each training instance (xi, yi):Set a(1) = xi

Compute {a(2), . . . ,a(L)} via forward propagationCompute �(L) = a(L) � yi

Compute errors {�(L�1), . . . , �(2)}Compute gradients �

(l)ij = �

(l)ij + a

(l)j �

(l+1)i

Compute avg regularized gradient D(l)ij =

(1n�

(l)ij + �⇥

(l)ij if j 6= 0

(l)ij otherwise

Update weights via gradient step ⇥(l)ij = ⇥

(l)ij � ↵D

Until weights converge or max #epochs is reached

(Used'to'accumulate'gradient)'

Backpropaga1on'

Summary of the course so far

Supervised learning has been our focus

Setup: given a training dataset {xn, yn}Nn=1, we learn a function h(x)to predict x’s true value y (i.e., regression or classification)

Linear vs. nonlinear features1 Linear: h(x) depends on wTx2 Nonlinear: h(x) depends on wTφ(x), where φ is either explicit or

depends on a kernel function k(xm,xn) = φ(xm)Tφ(xn)

Loss function1 Squared loss: least square for regression (minimizing residual sum of

errors)2 Logistic loss: logistic regression3 Exponential loss: AdaBoost4 Margin-based loss: support vector machines

Principles of estimation1 Point estimate: maximum likelihood, regularized likelihood

cont’d

Optimization1 Methods: gradient descent, Newton method2 Convex optimization: global optimum vs. local optimum3 Lagrange duality: primal and dual formulation

Learning theory1 Difference between training error and generalization error2 Overfitting, bias and variance tradeoff3 Regularization: various regularized models

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Supervised versus Unsupervised Learning

Supervised Learning from labeled observations

Labels ‘teach’ algorithm to learn mapping from observations to labels

Classification, Regression

Unsupervised Learning from unlabeled observations

Learning algorithm must find latent structure from features alone

Can be goal in itself (discover hidden patterns, exploratory analysis)

Can be means to an end (preprocessing for supervised task)

Clustering (Today)

Outline

1 Administration

3 ClusteringK-meansGaussian mixture models

ClusteringSetup Given D = {xn}Nn=1 and K, we want to output:

{µk}Kk=1: prototypes of clusters

A(xn) ∈ {1, 2, . . . ,K}: the cluster membership

Toy Example Cluster data into two clusters.

−2 0 2

Applications

Identify communities within social networks

Find topics in news stories

Group similiar sequences into gene families

−2 0 2

Applications

−2 0 2

Applications

K-means example

−2 0 2

K-means example

−2 0 2

K-means example

−2 0 2

K-means example

−2 0 2

K-means example

−2 0 2

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

K-means clustering

Intuition Data points assigned to cluster k should be near prototype µk

Distortion measure (clustering objective function, cost function)

rnk‖xn − µk‖22

where rnk ∈ {0, 1} is an indicator variable

rnk = 1 if and only if A(xn) = k

Algorithm

Minimize distortion Alternative optimization between {rnk} and {µk}Step 0 Initialize {µk} to some values

Step 1 Fix {µk} and minimize over {rnk}, to get this assignment:

{1 if k = argminj ‖xn − µj‖220 otherwise

Step 2 Fix {rnk} and minimize over {µk} to get this update:

∑n rnkxn∑n rnk

Step 3 Return to Step 1 unless stopping criterion is met

Algorithm

∑n rnkxn∑n rnk

Algorithm

∑n rnkxn∑n rnk

Remarks

Prototype µk is the mean of points assigned to cluster k, hence‘K-means’

The procedure reduces J in both Step 1 and Step 2 and thus makesimprovements on each iteration

No guarantee we find the global solution

Quality of local optimum depends on initial values at Step 0

k-means++ is a principled approximation algorithm

Probabilistic interpretation of clustering?

How can we model p(x) to reflect our intuition that points stay close totheir cluster centers?

0 0.5 1

1Points seem to form 3 clusters

We cannot model p(x) withsimple and known distributions

E.g., the data is not a Guassianb/c we have 3 distinctconcentrated regions

0 0.5 1

Gaussian mixture models: intuition

0 0.5 1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Gaussian mixture models: intuition

0 0.5 1

Model each region with adistinct distribution

Can use Gaussians — Gaussianmixture models (GMMs)

We don’t know clusterassignments (label), parametersof Gaussians, or mixturecomponents!

Must learn from unlabeled dataD = {xn}Nn=1

Gaussian mixture models: formal definition

GMM has the following density function for x

p(x) =

ωkN(x|µk,Σk)

K: number of Gaussians — they are called mixture components

µk and Σk: mean and covariance matrix of k-th component

ωk: mixture weights (or priors) represent how much each componentcontributes to final distribution. They satisfy 2 properties:

∀ k, ωk > 0, and∑

ωk = 1

These properties ensure p(x) is in fact a probability density function

p(x) =

ωkN(x|µk,Σk)

∀ k, ωk > 0, and∑

ωk = 1

p(x) =

ωkN(x|µk,Σk)

∀ k, ωk > 0, and∑

ωk = 1

GMM as the marginal distribution of a joint distribution

Consider the following joint distribution

p(x, z) = p(z)p(x|z)

where z is a discrete random variable taking values between 1 and K.

Denoteωk = p(z = k)

Now, assume the conditional distributions are Gaussian distributions

p(x|z = k) = N(x|µk,Σk)

Then, the marginal distribution of x is

p(x) =

ωkN(x|µk,Σk)

Namely, the Gaussian mixture model

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

p(x) =

ωkN(x|µk,Σk)

where z is a discrete random variable taking values between 1 and K.Denote

ωk = p(z = k)

p(x) =

ωkN(x|µk,Σk)

GMMs: example

0 0.5 1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

0 0.5 1

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

GMMs: example

0 0.5 1

The conditional distribution between x and z(representing color) are

p(x|z = red) = N(x|µ1,Σ1)

p(x|z = blue) = N(x|µ2,Σ2)

p(x|z = green) = N(x|µ3,Σ3)

0 0.5 1

1 The marginal distribution is thus

p(x) = p(red)N(x|µ1,Σ1) + p(blue)N(x|µ2,Σ2)

+ p(green)N(x|µ3,Σ3)

Parameter estimation for Gaussian mixture models

The parameters in GMMs are

θ = {ωk,µk,Σk}Kk=1

Let’s first consider the simple/unrealistic case where we have labels z

Define D′ = {xn, zn}Nn=1

D′ is the complete data

D the incomplete data

How can we learn our parameters?

Given D′, the maximum likelihood estimation of the θ is given by

θ = argmax logD′ =∑

log p(xn, zn)

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

log p(xn, zn)

The parameters in GMMs are θ = {ωk,µk,Σk}Kk=1

log p(xn, zn)

Parameter estimation for GMMs: complete data

The complete likelihood is decomposable

log p(xn, zn) =∑

log p(zn)p(xn|zn) =∑

n:zn=k

log p(zn)p(xn|zn)

where we have grouped data by its values zn.

Let γnk ∈ {0, 1} be a binary variable that indicates whether zn = k:

log p(xn, zn) =∑

γnk log p(z = k)p(xn|z = k)

γnk [logωk + logN(xn|µk,Σk)]

log p(xn, zn) =∑

n:zn=k

log p(zn)p(xn|zn)

log p(xn, zn) =∑

n:zn=k

log p(zn)p(xn|zn)

log p(xn, zn) =∑

n:zn=k

log p(zn)p(xn|zn)

log p(xn, zn) =∑

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

log p(xn, zn) =∑

Regrouping, we have

log p(xn, zn) =∑

γnk logωk +∑

γnk logN(xn|µk,Σk)

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

∑n γnk∑

∑n γnk

, µk =1∑n γnk

γnkxn

Σk =1∑n γnk

γnk(xn − µk)(xn − µk)T

What’s the intuition?

Parameter estimation for GMMs: complete dataFrom our previous discussion, we have

log p(xn, zn) =∑

Regrouping, we have

log p(xn, zn) =∑

γnk logωk +∑

γnk logN(xn|µk,Σk)

The term inside the braces depends on k-th component’s parameters. It is noweasy to show that (left as an exercise) the MLE is:

∑n γnk∑

∑n γnk

, µk =1∑n γnk

γnkxn

Σk =1∑n γnk

What’s the intuition?Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, 2017 22 / 26

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Intuition

Since γnk is binary, the previous solution is nothing but

ωk: fraction of total data points whose zn is kI note that

∑n γnk = N

µk: mean of all data points whose zn is k

Σk: covariance of all data points whose zn is k

This intuition will help us develop an algorithm for estimating θ when wedo not know zn (incomplete data)

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Parameter estimation for GMMs: incomplete data

When zn is not given, we can guess it via the posterior probability

p(zn = k|xn) =p(xn|zn = k)p(zn = k)

p(xn)=

p(xn|zn = k)p(zn = k)∑K

k′=1 p(xn|zn = k′)p(zn = k′)

To compute the posterior probability, we need to know the parameters θ!

Let’s pretend we know the value of the parameters so we can compute theposterior probability.

How is that going to help us?

Estimation with soft γnk

We define γnk = p(zn = k|xn)

Recall that γnk was previously binary

Now it’s a “soft” assignment of xn to k-th component

Each xn is assigned to a component fractionally according top(zn = k|xn)

We now get the same expression for the MLE as before!

∑n γnk∑

∑n γnk

, µk =1∑n γnk

γnkxn

Σk =1∑n γnk

But remember, we’re ‘cheating’ by using θ to compute γnk!

∑n γnk∑

∑n γnk

, µk =1∑n γnk

γnkxn

Σk =1∑n γnk

∑n γnk∑

∑n γnk

, µk =1∑n γnk

γnkxn

Σk =1∑n γnk

Iterative procedure

Alternate between estimating γnk and computing parameters

Step 0: initialize θ with some values (random or otherwise)

Step 1: compute γnk using the current θ

Step 2: update θ using the just computed γnk

Step 3: go back to Step 1

This is an example of the EM algorithm — a powerful procedure for modelestimation with hidden/latent variables

Connection with K-means?

GMMs provide probabilistic interpretation for K-means

K-means is “hard” GMM or GMMs is “soft” K-means

Posterior γnk provides a probabilistic assignment for xn to cluster k

Iterative procedure

CS260 Machine Learning Algorithms - Computer...

Documents

The Microbiome and Probiotics: Fact and Fiction Ameet Parikh, MD

CS260: Machine Learning Algorithmsweb.cs.ucla.edu/~chohsieh/teaching/CS260_Winter2019/lecture7.pdf · Learning Algorithms Lecture 7: VC Dimension Cho-Jui Hsieh UCLA Jan 30, 2019 Reducing

Blind evaluation of location based queries using space ...ravi/CS260-ST/Shahabi_Privacy.pdfBlind evaluation of location based queries using space ... Building on the above two algorithms,

CS260 Intro to Java & Android 09.AndroidAdvUI (Part I)zeus.cs.pacificu.edu/ryand/cs260/2020/Lectures/08.AndroidAdvUIand... · Canvas The canvas is a surface to draw on Android framework

Ameet John Pinto...Ameet John Pinto ABSTRACT This study consists of three research phases. First, we developed corrective action strategies to mitigate the impact of calcium hypochlorite

Efficius · 2019. 5. 23. · ameet sili zwierzogrÓd. moje labirynty ameet sil2 gdzie jest dory. moje labirynty ameet sil401 truskaw ciastko. moje labirynty ameet skb2 minnie i daisy

Context and Context- Aware Computing Omar Khan CS260, Fall 2006

Professor Ameet Talwalkaratalwalk/teaching/winter17/cs260/...Let’s look at two more examples: Gaussian (or Quadratic) Discriminative Analysis and Linear Discriminative Analysis Professor

CS260: Research Topics in HCI Fall 2006 John Canny UCB EECS

Der Leselockstoff - Ameet...Thomas Mohr neuer AMEET-Vertriebschef 12.01.2017 • 15:48 Uhr Verlage: Kooperationen Ameet bringt im Herbst Bücher zu den Schleich-Themenwelten Teilen

Design and Analysis of Algorithms Chapter 3umpeysak/Classes/CS260/... · 2015. 10. 2. · nd ed., Ch. 3 = 3(• = 3, = • = = = •() = = = •, –.(

CS260 Machine Learning Algorithmsweb.cs.ucla.edu/~ameet/cs260fall2015/lectures/lec15.pdf · Principles of estimation 1 Point estimate: maximum likelihood, regularized likelihood Professor

SagarKaramchand Sanjay Shahid Ameet

Approximation Algorithms Chapter 9: Bin Packingneal/2006/cs260/piyush.pdfApproximation Algorithms Chapter 9: Bin Packing Presented By: Piyush Ranjan Satapathy Class CS260 By Dr Neal

Ameet Sarpatwari, "Drug Pricing and Cost"

Arpad Kovacs CS260 Class Discussion CS260 - Input II ...hci.berkeley.edu/cs260-fall10/images/a/ab/SurfaceComputing_Arpad… · What is Surface Computing? Replace monitor, mouse and

CS260 Intro to Java & Android 05.Android UIzeus.cs.pacificu.edu › ryand › cs260 › 2011 › Lectures › 05.AndroidUI.… · User Interface UIs in Android are built using View

CS260 Intro to Java & Android 03zeus.cs.pacificu.edu/ryand/cs260/2011/Lectures/03.AndroidIntro.pdf · Android Portability Android applications run within the Dalvik virtual machine

CS260: Machine Learning Algorithms - CS | Computer Science

Ameet Shinde Mungle1