45
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010

Lecture 17: Supervised Learning Recap

  • Upload
    butest

  • View
    1.235

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 17: Supervised Learning Recap

Lecture 17: Supervised Learning Recap

Machine LearningApril 6, 2010

Page 2: Lecture 17: Supervised Learning Recap

Last Time

• Support Vector Machines• Kernel Methods

Page 3: Lecture 17: Supervised Learning Recap

Today

• Short recap of Kernel Methods• Review of Supervised Learning• Unsupervised Learning– (Soft) K-means clustering– Expectation Maximization– Spectral Clustering– Principle Components Analysis– Latent Semantic Analysis

Page 4: Lecture 17: Supervised Learning Recap

Kernel Methods

• Feature extraction to higher dimensional spaces.

• Kernels describe the relationship between vectors (points) rather than the new feature space directly.

Page 5: Lecture 17: Supervised Learning Recap

When can we use kernels?

• Any time training and evaluation are both based on the dot product between two points.

• SVMs• Perceptron• k-nearest neighbors• k-means• etc.

Page 6: Lecture 17: Supervised Learning Recap

Kernels in SVMs

• Optimize αi’s and bias w.r.t. kernel• Decision function:

Page 7: Lecture 17: Supervised Learning Recap

Kernels in Perceptrons

• Training

• Decision function

Page 8: Lecture 17: Supervised Learning Recap

Good and Valid Kernels

• Good: Computing K(xi,xj) is cheaper than ϕ(xi)• Valid: – Symmetric: K(xi,xj) =K(xj,xi)

– Decomposable into ϕ(xi)Tϕ(xj)• Positive Semi Definite Gram Matrix

• Popular Kernels– Linear, Polynomial– Radial Basis Function– String (technically infinite dimensions)– Graph

Page 9: Lecture 17: Supervised Learning Recap

Supervised Learning

• Linear Regression• Logistic Regression• Graphical Models– Hidden Markov Models

• Neural Networks• Support Vector Machines– Kernel Methods

Page 10: Lecture 17: Supervised Learning Recap

Major concepts

• Gaussian, Multinomial, Bernoulli Distributions• Joint vs. Conditional Distributions• Marginalization• Maximum Likelihood• Risk Minimization• Gradient Descent• Feature Extraction, Kernel Methods

Page 11: Lecture 17: Supervised Learning Recap

Some favorite distributions

• Bernoulli

• Multinomial

• Gaussian

Page 12: Lecture 17: Supervised Learning Recap

Maximum Likelihood

• Identify the parameter values that yield the maximum likelihood of generating the observed data.

• Take the partial derivative of the likelihood function• Set to zero• Solve

• NB: maximum likelihood parameters are the same as maximum log likelihood parameters

Page 13: Lecture 17: Supervised Learning Recap

Maximum Log Likelihood

• Why do we like the log function?• It turns products (difficult to differentiate) and

turns them into sums (easy to differentiate)

• log(xy) = log(x) + log(y)• log(xc) = c log(x)•

Page 14: Lecture 17: Supervised Learning Recap

Risk Minimization

• Pick a loss function– Squared loss– Linear loss– Perceptron (classification) loss

• Identify the parameters that minimize the loss function.– Take the partial derivative of the loss function– Set to zero– Solve

Page 15: Lecture 17: Supervised Learning Recap

Frequentists v. Bayesians

• Point estimates vs. Posteriors• Risk Minimization vs. Maximum Likelihood• L2-Regularization– Frequentists: Add a constraint on the size of the

weight vector– Bayesians: Introduce a zero-mean prior on the

weight vector– Result is the same!

Page 16: Lecture 17: Supervised Learning Recap

L2-Regularization

• Frequentists:– Introduce a cost on the size of the weights

• Bayesians:– Introduce a prior on the weights

Page 17: Lecture 17: Supervised Learning Recap

Types of Classifiers

• Generative Models– Highest resource requirements. – Need to approximate the joint probability

• Discriminative Models– Moderate resource requirements. – Typically fewer parameters to approximate than generative models

• Discriminant Functions– Can be trained probabilistically, but the output does not include

confidence information

Page 18: Lecture 17: Supervised Learning Recap

Linear Regression

• Fit a line to a set of points

Page 19: Lecture 17: Supervised Learning Recap

Linear Regression

• Extension to higher dimensions– Polynomial fitting

– Arbitrary function fitting• Wavelets• Radial basis functions• Classifier output

Page 20: Lecture 17: Supervised Learning Recap

Logistic Regression

• Fit gaussians to data for each class• The decision boundary is where the PDFs cross

• No “closed form” solution to the gradient.• Gradient Descent

Page 21: Lecture 17: Supervised Learning Recap

Graphical Models

• General way to describe the dependence relationships between variables.

• Junction Tree Algorithm allows us to efficiently calculate marginals over any variable.

Page 22: Lecture 17: Supervised Learning Recap

Junction Tree Algorithm

• Moralization– “Marry the parents”– Make undirected

• Triangulation– Remove cycles >4

• Junction Tree Construction– Identify separators such that the running intersection

property holds• Introduction of Evidence– Pass slices around the junction tree to generate marginals

Page 23: Lecture 17: Supervised Learning Recap

Hidden Markov Models

• Sequential Modeling– Generative Model

• Relationship between observations and state (class) sequences

Page 24: Lecture 17: Supervised Learning Recap

Perceptron

• Step function used for squashing.• Classifier as Neuron metaphor.

Page 25: Lecture 17: Supervised Learning Recap

Perceptron Loss

• Classification Error vs. Sigmoid Error– Loss is only calculated on Mistakes

Perceptrons usestrictly classificationerror

Page 26: Lecture 17: Supervised Learning Recap

Neural Networks

• Interconnected Layers of Perceptrons or Logistic Regression “neurons”

Page 27: Lecture 17: Supervised Learning Recap

Neural Networks

• There are many possible configurations of neural networks– Vary the number of layers– Size of layers

Page 28: Lecture 17: Supervised Learning Recap

Support Vector Machines

• Maximum Margin Classification Small Margin

Large Margin

Page 29: Lecture 17: Supervised Learning Recap

Support Vector Machines

• Optimization Function

• Decision Function

Page 30: Lecture 17: Supervised Learning Recap

30

Visualization of Support Vectors

Page 31: Lecture 17: Supervised Learning Recap

Questions?

• Now would be a good time to ask questions about Supervised Techniques.

Page 32: Lecture 17: Supervised Learning Recap

Clustering

• Identify discrete groups of similar data points• Data points are unlabeled

Page 33: Lecture 17: Supervised Learning Recap

Recall K-Means

• Algorithm– Select K – the desired number of clusters– Initialize K cluster centroids– For each point in the data set, assign it to the cluster

with the closest centroid

– Update the centroid based on the points assigned to each cluster

– If any data point has changed clusters, repeat

Page 34: Lecture 17: Supervised Learning Recap

k-means output

Page 35: Lecture 17: Supervised Learning Recap

Soft K-means

• In k-means, we force every data point to exist in exactly one cluster.

• This constraint can be relaxed.

Minimizes the entropy of cluster assignment

Page 36: Lecture 17: Supervised Learning Recap

Soft k-means example

Page 37: Lecture 17: Supervised Learning Recap

Soft k-means

• We still define a cluster by a centroid, but we calculate the centroid as the weighted mean of all the data points

• Convergence is based on a stopping threshold rather than changed assignments

Page 38: Lecture 17: Supervised Learning Recap

Gaussian Mixture Models

• Rather than identifying clusters by “nearest” centroids

• Fit a Set of k Gaussians to the data.

Page 39: Lecture 17: Supervised Learning Recap

GMM example

Page 40: Lecture 17: Supervised Learning Recap

Gaussian Mixture Models

• Formally a Mixture Model is the weighted sum of a number of pdfs where the weights are determined by a distribution,

Page 41: Lecture 17: Supervised Learning Recap

Graphical Modelswith unobserved variables

• What if you have variables in a Graphical model that are never observed?– Latent Variables

• Training latent variable models is an unsupervised learning application

laughing

amused

sweating

uncomfortable

Page 42: Lecture 17: Supervised Learning Recap

Latent Variable HMMs

• We can cluster sequences using an HMM with unobserved state variables

• We will train the latent variable models using Expectation Maximization

Page 43: Lecture 17: Supervised Learning Recap

Expectation Maximization

• Both the training of GMMs and Gaussian Models with latent variables are accomplished using Expectation Maximization– Step 1: Expectation (E-step)• Evaluate the “responsibilities” of each cluster with the

current parameters

– Step 2: Maximization (M-step)• Re-estimate parameters using the existing

“responsibilities”

• Related to k-means

Page 44: Lecture 17: Supervised Learning Recap

Questions

• One more time for questions on supervised learning…

Page 45: Lecture 17: Supervised Learning Recap

Next Time

• Gaussian Mixture Models (GMMs)• Expectation Maximization