78
CS480 Introduction to Machine Learning Semi-Supervised Learning Edith Law

CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

  • Upload
    others

  • View
    20

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

CS480 Introduction to Machine Learning Semi-Supervised Learning

Edith Law

Page 2: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Basic Idea

• Label data can be expensive / difficult to collect. – Human annotation is slow, boring! – Labels can require experts, or special devices to acquire. – We prefer to get better performance for free: Unlabeled data!

• Goal of semi-supervised learning is to exploit both labeled and unlabeled examples.

Page 3: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Example of Hard-to-Get Labels

• Task: speech analysis • switchboard dataset • telephone conversation transcription • 400 hours annotation time for each hour of speech

Page 4: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Example of Hard-to-Get Labels

• Task: natural language parsing • Penn Chinese Treebank • 2 years for 4000 sentences

“The National Track and Field Championship has finished.”

Page 5: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Example of Not-So-Hard-to-Get Labels

For some tasks, it may not be too difficult to label 1000+ instances.

Task: image categorization of “eclipse”

Page 6: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Example of Not-So-Hard-to-Get Labels

There are ways like the EPS game to encourage “human computation” for more labels.

Page 7: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Example of Not-So-Hard-to-Get Labels

For some tasks, it may not be too difficult to label 1000+ instances.

Nonetheless, we may be able to learn better/faster if we make use of both labeled and unlabeled examples.

Page 8: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Can Unlabeled Data Help?

If we assume that each class is a coherent group (e.g., Gaussian), then unlabeled data can help identify the boundary more accurately.

Semi-supervised learning is possible because we can make assumptions about the relationship between the distribution of unlabeled data P(x) and the target label.

Page 9: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Notation

Given:

• Labeled data: (Xl, Yl) = {x1:l, y1:l} available during training

• Unlabeled data: XU = {xl+1:n} available during training

• Test data: Xtest = {Xn+1:N} NOT available during training

Usually , so much more unlabeled data than labeled data.l ≪ n

Page 10: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Notation

supervised learning (classification/regression){(x1:n, y1:n)}

(inductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n, xtest}

(transductive) semi-supervised learning (classification/regression){(x1:n, y1:n), xl+1:n}

semi-supervised clustering (classification/regression){x1:n, must-links, cannot-links}

unsupervised learning (e.g., clustering){x1:n}

Page 11: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�11

Page 12: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�12

Page 13: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm

1. Pick your favourite classification method. 2. Train a classifier f using labeled data (Xl, Yl) 3. Use f to classify all unlabeled items x in Xu 4. Pick x* with the highest confidence 5. Add (x*, f(x*)) to labeled data 6. Repeat

Page 14: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm (Pseudocode)

Page 15: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm

• Assumption: One’s own high confidence predictions are correct. – this is likely to be the case when the classes are from well-

separated clusters

• Variations: – Add a few most confident (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data. – Add all (x, f(x)) to labeled data, weigh each by confidence.

Page 16: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm: A Concrete Example

Page 17: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self Training with 1NN: Well Separated Case

Page 18: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self Training with 1NN: Outlier Case

Page 19: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Example: Image Categorization

• Train a Naive Bayes classifier on two initial images

Page 20: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Example: Image Categorization

Each image is divided into small patches, 10 x 10 grid, random size of 10-20

Represent each patch: - normalize patches - define a dictionary of 200 “visual words” - Represent a patch by the index of its

closest visual word.

Page 21: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Example: Image Categorization

• Train a Naive Bayes classifier on two initial images

• Classify unlabeled data, sort by confidence log Pr(y=astronomy | x)

Page 22: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Example: Image Categorization

• Add the most confident images and predicted labels to the labels data

• Retrain classifier and repeat

Page 23: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm: Advantages

• The simplest semi-supervised learning method.

• A wrapper method, i.e., the technique can be applied to any existing learning algorithms.

• Often used in real tasks like natural language processing.

Page 24: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Self-Training Algorithm: Disadvantages

• Early mistakes could reinforce themselves. - Heuristic solutions, e.g. “un-label” an instance if its confidence

falls below a threshold.

• Cannot say too much in terms of convergence, but … - There are special cases when self-training is equivalent to the

Expectation-Maximization (EM) algorithm. - There are also special cases (e.g. linear functions) when the

closed-form solution is known.

Page 25: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�25

Page 26: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Leveraging Multiple Views

Suppose we are classifying a collection of university webpages into two categories: Student vs Faculty.

The webpage can be classified by • the text on the webpage (e.g., words such as “professor”) • the words in all the hyperlinks that point to that webpage

“professor” “grant” “graduated”

view 1my advisormy Ph.D. advisormy mentor

view 2

Page 27: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Leveraging Multiple Views

In named entity classification, we classify a named entity (i.e., a proper name such as “Washington State” or Mr. Washington”) by a class label (e.g., Person or Location).

Each instance is represented by two views: • the set of words that make up the named entity itself • the words in the context of the named entity

Page 28: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

but there are so many ways to express Person or Location!

So, how can we learn using no labeled data?

Page 29: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

Inst 1: “headquartered in” seems to indicate y=Location.Inst 3: “Kazakhstan” must be a Location since it appears with the context “headquartered in” Inst 4: “flew to” must be about Location because it is in the context of “Kazakhstan”

“Location”

Even though we have never seen the word “China”!

Page 30: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

Inst 2: “Mr.” seems to indicate y=Person.Inst 5: “partner at” must be about a Person because it appears in context of “Mr”

“Person”

Even though we have never seen the words “Robert Jordan”!

Page 31: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

•Appropriate in settings where the features are redundant •The idea is to train two classifiers on the disjoint features, and make

the two classifiers take turns teaching each other •Co-training is different from self-training in that we use two classifiers

in turn, operating on different views of the instances.

Page 32: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training Algorithm

1. Train two classifiers: f(1) from (Xl(1), Yl), f(2) from (Xl(2), Yl) 2. Classify Xu with a classifier f(1) and f(2) separately. 3. Add f(1) ’s k-most-confident (x, f(1)(x)) to f(2) ’s labeled data 4. Add f(2) ’s k-most-confident (x, f(2)(x)) to f(1) ’s labeled data 5. Repeat

Page 33: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training (Pseudocode) (Blum and Mitchell, 1998)

Page 34: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training (Blum and Mitchell, 1998)

•begin with 12 labeled web pages (academic courses) •provide 1000 additional unlabeled web pages •average error learning from labeled data: 11.1% •average error co-training: 5%

Page 35: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

Pros: •wrapper method: it does not matter what algorithms are used for the

two classifiers f(1) and f(2)

• requires that classifiers can assign a confidence score to their prediction, which is used to select which unlabeled instances to turn into additional training data for the other view.

• less sensitive to mistakes than self-training Cons: • feature split may not exist •classifier that use both feature sets may have better performance

Page 36: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training: Assumptions

•existence of two views x = [x(1), x(2)] •assumption #1: each view alone is sufficient to make good

classifications, given enough labeled data. •assumption #2: the two views are conditionally independent given the

class label.

- knowing one view doesn’t affect what we will observe for the other view - e.g., context “headquartered in” does not favour any particular location.

Page 37: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Co-Training

Many real-world problems of this type

• Semantic lexicon generation [Riloff, Jones 99], [Collins, Singer 99]

• Web page classification [Blum, Mitchell 98]

• Word sense disambiguation [Yarowsky 95]

• Speech recognition [de Sa, Ballard 98]

• Visual classification of cars [Levin, Viola, Freund 03]

Page 38: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�38

Page 39: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Modeling Data using Distributions

y2

y1

In mixture models, we assume that we have a set of hidden mixture components {1,…,K} which generated the data.

Page 40: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Generative Story of Gaussian Mixture Model

Specifically, we assume that the data that we observed has been generated by repeating the following two steps

= 𝒩(x; μk, σk)

• pick a mixture component:

• generated x based on the selected mixture component: x ∼ p(x |y = k)

y ∼ p(y) = Multinomial(π)

p(x, y) = p(y) p(x |y)

to give the joint distribution:

Idea: Fit data with a combination of Gaussian distributions.

mixing proportions: a vector of probabilities that sum to 1

Page 41: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Gaussian Mixture Model

In general, we can compute the probability density function p(x) over by summing out y:

where

p(x)

is the probability of the kth mixture componentπk = p(y = k)= probability of x in the kth mixture componentp(x |y = k) = 𝒩(x |μk, σ2

k )

Goal: estimate parameters πk, μk, σ2k

= ∑y

p(y) p(x |y) =K

∑k=1

p(y = k) p(x |y = k)

=K

∑k=1

πk p(x |y = k)

Page 42: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Gaussian Mixture Model

• Think of each mixture component as a class.

• The unlabeled data tells us how instances from all classes, mixed together, are distributed.

• If we know how the instances from each class are distributed, then we can decompose the mixture into individual classes.

πk, μk, σ2kWhy do we care to learn these parameters

Page 43: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

• Given labeled data, assume each class has a Gaussian distribution …

Page 44: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

θ = {π1, μ1, Σ1, π2, μ2, Σ2}Model Parameters: Model Parameters:

The GMM: p(x, y |θ) = p(y |θ)p(x |y, θ)

= πy𝒩(x; μy, Σy)

Classification: p(y |x, θ) −p(x, y |θ)

∑y′ � p(x, y′�|θ)

Page 45: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

• Given labeled data, assume each class has a Gaussian distribution…

• The most likely model and its decision boundary …

Page 46: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

• Given labeled data, assume each class has a Gaussian distribution.

• The most likely model and its decision boundary is …

• Add unlabeled data …

Page 47: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

• Given labeled data, assume each class has a Gaussian distribution.

• The most likely model and its decision boundary is …

• Add unlabeled data …

• With unlabeled data, the most likely model and decision boundary changed.

Page 48: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

GMM: Supervised vs Semi-Supervised

Decision boundaries are different because they maximize different quantities.

Page 49: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Finding the Parameters

If we have labeled data, then it’s just a matter of finding the maximum likelihood estimate of the parameters.

θ = arg maxθ

log p(D |θ)

= arg maxθ

logl

∏i=1

p(xi, yi |θ)

For a K-class GMM, we will solve constrained optimization problem:

s.t.K

∑k=1

p(y = k |θ) = 1arg maxθ

logl

∏i=1

p(xi, yi |θ)

by setting up the Lagrangian, taking partial derivative of the Lagrangian with respect with the parameters, set to 0 and solve.

Page 50: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Finding the Parameters: Labeled vs Unlabeled Data

with labeled data with labeled and unlabeled data

= logl

∏i=1

p(xi, yi |θ)

= logl

∏i=1

p(yi |θ) p(xi |yi, θ)

= logl

∏i=1

p(xi, yi |θ)l+u

∏i=l+1

p(xi |θ)

= logl

∏i=1

p(yi |θ) p(xi |yi, θ)l+u

∏i=l+1

K

∑k=1

p(yi = k |θ) p(xi |yi, θ)

With unlabeled data, we are left with marginal probability in the data likelihood. i.e., the probability of generating x from any of the classes (because we do not know which class the instances belong to).

log P(D |θ) log P(D |θ)

Challenge: find MLE of parameters when there are unobservable or hidden variables.

marginal probability

Page 51: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Expectation Maximization

Iterative method for learning the maximum likelihood estimate of a probabilistic model, when the model contains unobservable variables.

Main idea:

• If we had sufficient statistics for the data (e.g. counts of possible values), we could easily maximize the likelihood.

• With missing data, we “fantasize” how the data should look based on the current parameter setting, i.e., compute expected sufficient statistics.

• Then we maximize parameter setting, based on these statistics.

Page 52: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Expectation Maximization

• Start with some initial parameter setting.

• Repeat (as long as desired): – Expectation (E) step: Complete the data by assigning “values”

to the missing items. – Maximization (M) step: Compute the maximum likelihood

parameter setting based on the completed data.

Page 53: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

A Simple Algorithm: K-means clustering

• Objective: Cluster n instances into K distinct classes. • Preliminaries:

– Step 1: Pick the desired number of clusters, K. – Step 2: Assume a parametric distribution for each class (e.g.

Normal). – Step 3: Randomly estimate the parameters of the K

distributions. • Iterate, until convergence:

– Step 4: Assign instances to the most likely classes based on the current parametric distributions.

– Step 5: Estimate the parametric distribution of each class based on the latest assignment. maximization step

expectation step

Page 54: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

EM for 2-Class GMM

– Expectation (E) step: for every instance i, compute the responsibilities, e.g., class 1 is responsible for xi with probability .γi,1

γi,k ← P(yi = k |xi)

– Maximization (M) step: compute the weighted means, variances and mixing probabilities for each component k.

• Repeat (as long as desired):

=πk𝒩(xi; uk, σk)

∑2k′�=1 πk′�𝒩(xi; uk′�, σk′�)

πk ←1N

N

∑i=1

γi,kμk ←1

∑Ni=1 γi,k

N

∑i=1

γi,kxi σk ←1

∑Ni=1 γi,k

N

∑i=1

γi,k(xi − μk)2

(with unlabeled data)

• Start with some initial parameter setting: μ1, μ2, σ21 , σ2

2 , π1, π2

(“soft” assignment of classes)

Page 55: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

(with labeled and unlabeled data)

Page 56: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

EM Algorithm: In General

• Setup: – Observed data: D = (Xl, Yl, Xu) – Hidden data: H = Yu – P(D|𝛳) = ∑H p(D, H | 𝛳)

• Goal: Find 𝛳 to maximize p(D|𝛳) • Algorithm:

– Start with some arbitrary 𝛳0. – E-step: Estimate p(H|D,𝛳) – M-step: Find argmax𝛳 ∑H p(D,H|𝛳)

• Comments: EM iteratively improves p(D|𝛳). Converges to a local minima of 𝛳.

Page 57: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

EM Algorithm: Example (Ipeirotis et al., 2010)

Page 58: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

EM Algorithm: Example (Ipeirotis et al., 2010)

Page 59: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Expectation Maximization: Properties

• Likelihood function is guaranteed to improve (or stay the same) with each iteration.

• Iterations can stop when no more improvements are achieved.

• Convergence to a local optimum of the likelihood function.

• Re-starts with different initial parameters are often necessary.

• Time complexity (per iteration) depends on model structure.

• EM is very useful in practice!

Page 60: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Assumptions of GMM

The assumption is that data comes from the mixture, where the number of components, prior p(y) and conditional probability p(x|y) are all correct.

Page 61: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Alternative Method: Cluster than Label

Instead of running EM with the probabilistic generative model using the labeled data:

1. Run the clustering algorithm assuming all data is unlabeled.

2. Train a supervised learner using the labeled instances that fall into each cluster.

3. Use the predictor to label unlabeled instances in that cluster.

Pro: • Another simple wrapper method • Do not have to assume a probabilistic mixture model.

Con: • Can be difficult to analyze; labels within a cluster may disagree.

Page 62: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Cluster then Label (Pseudocode)

Page 63: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Cluster then Label: Single vs Complete Linkage

Page 64: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�64

Page 65: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Graph-Based Methods

Page 66: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Graph-Based Methods

• assumption: similar unlabeled data have similar labels.

(Jerry Zhu’s Ph.D. thesis)

two images of ‘2’ with large Euclidean distances

a path in an Euclidean distance kNN graph between them

Page 67: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Graph-Based Methods (Jerry Zhu’s Ph.D. thesis)

• idea: construct similarity graphs to model local neighbourhood relations between data points - nodes: labeled and unlabeled

examples - edges: similarity weights computed

from features

• different types of graphs - k-nearest neighbour graph - fully connected graph, weight decay

with distance - eNN graph: two nodes are

connected if the distance is <= e.

Page 68: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Graph-Based Methods: Assumptions

key assumption: labels are “smooth” with respect to the graph, such that they vary slowly on the graph. That is, if two instances are connected by a strong edge, their labels tend to be the same.

Page 69: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Semi-Supervised SVM

• Standard SVM: maximize labeled margin.

• Semi-supervised SVMs: maximize unlabeled margin - enumerate all 2u

possible labeling of Xu - Build one standard SVM for each labeling - Pick the SVM with the largest margin.

Page 70: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Semi-Supervised SVM

SVM

S3VM

hat losshinge loss

hinge loss

Page 71: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Semi-Supervised SVM

Hinge Loss Hat Loss

prefer yf(x)=1, i.e., instances that are on the correct side

prefer f(x)<=-1 and f(x)>=1, i.e., that unlabeled instances are away from the decision boundary f(x)=0.

Page 72: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Semi-Supervised SVM

The third term prefers unlabeled points outside the margin.

In other words, the decision boundary f=0 wants to be placed so that there is few unlabeled data near it.

Page 73: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Semi-Supervised SVM

Avoiding unlabeled dense region may not be the right assumption!

Assumption: the classes are well-separated, such that the decision boundary falls into a low density region in the feature space, and does not cut through dense unlabeled data.

Page 74: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Overview

• Self-Training • Co-Training • Gaussian Mixture Model (GMM) • Other: Graph-based methods, Semi-Supervised SVM

�74

Page 75: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Can Unlabeled Data Help?

Caveats: Model assumption can make up for lack of labeled data, but can also affect the quality of the predictor.

Page 76: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Active vs Semi-supervised Learning

Both try to attack the same problem: make the most of unlabeled data.

Page 77: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

Slides based on Jerry Zhu’s book and tutorials

https://www.morganclaypool.com/doi/abs/10.2200/s00196ed1v01y200906aim006http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdfhttp://pages.cs.wisc.edu/~jerryzhu/pub/AERFAI06ssl.pdf

Page 78: CS480 Introduction to Machine Learning Semi-Supervised ...edithlaw.ca/.../14-semi-supervised-learning.pdf · • Other: Graph-based methods, Semi-Supervised SVM 38. Modeling Data

What you should know

• Problem definition for semi-supervised learning. • Self-training method, pros/cons • Basic idea of co-training • Gaussian mixture models as a generative method

• What the “E” and “M” step of the EM algorithm means • General idea of graph-based semi-supervised learning • General idea of semi-supervised SVM