16
Introduction to Machine Learning Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth Lecture 3: Bayesian Learning, MAP & ML Estimation Classification: Naïve Bayes Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

  • Upload
    others

  • View
    17

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Introduction to Machine Learning

Brown University CSCI 1950-F, Spring 2012 Prof. Erik Sudderth

Lecture 3: Bayesian Learning, MAP & ML Estimation

Classification: Naïve Bayes

Many figures courtesy Kevin Murphy’s textbook, Machine Learning: A Probabilistic Perspective

Page 2: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Bayes Rule (Bayes Theorem) unknown parameters (many possible models)

observed data available for learning

prior distribution (domain knowledge)

likelihood function (measurement model)

posterior distribution (learned information)

θD

p(θ)

p(D | θ)p(θ | D)

p(θ,D) = p(θ)p(D | θ) = p(D)p(θ | D)

∝ p(D | θ)p(θ)

p(θ | D) =p(θ,D)

p(D)=

p(D | θ)p(θ)∑θ′∈Θ p(D | θ′)p(θ′)

Page 3: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

The Number Game •! I am thinking of some arithmetical concept,

such as: •!Prime numbers •!Numbers between 1 and 10 •!Even numbers •!!

•! I give you a series of randomly chosen positive examples from the chosen class •!Question: Are other test digits also in the

class?

Tenenbaum 1999

Page 4: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Predictions of 20 Humans

D = {16}

D = {16,8,2,64}

D = {16,23,19,20}

Page 5: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

A Bayesian Model Likelihood:

•! Assume examples are sampled uniformly at random from all numbers that are consistent with the hypothesis

•! Size principle: Favors smallest consistent hypotheses

Prior: •! Based on prior experience, some hypotheses are more

probable ( natural ) than others •! Powers of 2 •! Powers of 2 except 32, plus 37

•! Subjectivity: May depend on observer’s experience

D = {5, 34, 2, 89, 1, 13}?

Page 6: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Posterior Distributions

D = {16} D = {16,8,2,64}

Page 7: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Posterior Estimation

•! As the amount of data becomes large, weak conditions on the hypothesis space and measurement process imply that

•! This is a maximum a posteriori (MAP) estimate:

•! With a large amount of data, and/or an (almost) uniform prior, we approach the maximum likelihood (ML) estimate:

•! More theory to come later!

, pp (

Page 8: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Posterior Predictions

•! Suppose we want to predict the next number that will be revealed to us. One option is to use the MAP estimate:

p(x | D) ≈ p(x | h)

•! But if we correctly apply Bayes rule to the specified model, we instead obtain the posterior predictive distribution:

•! This is sometimes called Bayesian model averaging.

p(x | D) =∑

h∈Hp(x | h)p(h | D)

x ∈ {1, 2, 3, . . .}

Page 9: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Posterior Predictive Distribution

Page 10: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Experiment: Human Judgments

Page 11: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Experiment: Bayesian Predictions

Page 12: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Machine Learning Problems

Supervised Learning Unsupervised Learning

Dis

cret

e C

ontin

uous

classification or categorization

regression

clustering

dimensionality reduction

Page 13: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Generative Classifiers class label in {1,!,C}, observed in training

observed features to be used for classification

parameters indexing family of models

yx ∈ X

θp(y, x | θ) = p(y | θ)p(x | y, θ)

prior distribution

likelihood function

•! Compute class posterior distribution via Bayes rule:

•! Inference: Find label distribution for some input example •! Classification: Make decision based on inferred distribution •! Learning: Estimate parameters from labeled training data

p(y = c | x, θ) = p(y = c | θ)p(x | y = c, θ)∑C

c′=1 p(y = c′ | θ)p(x | y = c′, θ)

θ

Page 14: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Specifying a Generative Model p(y, x | θ) = p(y | θ)p(x | y, θ)

prior distribution

likelihood function

•! For generative classification, we take the prior to be some categorical (multinoulli) distribution:

•! The likelihood must be matched to the domain of the data

p(y | θ) = Cat(y | θ)

•! Suppose x is a vector of D different features •! The simplest generative model assumes that these features

are conditionally independent, given the class label:

•! This is a so-called naïve Bayes model for classification

p(x | y = c, θ) =

D∏

j=1

p(xj | y = c, θjc)

Page 15: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Learning a Generative Model

•! Assume we have N training examples independently sampled from an unknown naïve Bayes model:

p(θ | y, x) ∝ p(θ, y, x) = p(θ)p(y | θ)p(x | y, θ)model class features

p(θ | y, x) ∝ p(θ)

N∏

i=1

p(yi | θ)p(xi | yi, θ)

∝ p(θ)

N∏

i=1

p(yi | θ)D∏

j=1

p(xij | yi, θ)

observed class label for training example i

value of feature j for training example i

yixij

•! Learning: ML estimate, MAP estimate, or posterior prediction

Page 16: Introduction to Machine Learning - Brown Universitycs.brown.edu/.../2012-02-02_bayesClassification.pdf · 2/2/2012  · Introduction to Machine Learning Brown University CSCI 1950-F,

Naïve Bayes: ML Estimation

Nc number of examples of training class c

•! Even if we are doing discrete categorization based on discrete features, this is a continuous optimization problem!

•! Bayesian reasoning about models also requires continuous probability distributions, even in this simple case

•! Next week we will show that the ML class estimates are:

•! For binary features, we have: