66
Maxent Models (II) CMSC 473/673 UMBC September 20 th , 2017

Maxent Models (II)

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Maxent Models (II)

Maxent Models (II)

CMSC 473/673

UMBC

September 20th, 2017

Page 2: Maxent Models (II)

Announcements: Assignment 1

Due 11:59 PM, Saturday 9/23

~3.5 days

Use submit utility with:class id cs473_ferraroassignment id a1

We must be able to run it on GL!Common pitfall #1: forgetting filesCommon pitfall #2: incorrect paths to filesCommon pitfall #3: 3rd party libraries

Page 3: Maxent Models (II)

Announcements: Question 6

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) = 𝜆𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 + 1 − 𝜆𝑛 𝑝(𝑛−1)(𝑥𝑖|𝑥𝑖−𝑛+2: 𝑥𝑖−1)

𝑝(𝑛) 𝑥𝑖 𝑥𝑖−𝑛+1: 𝑥𝑖−1) =

𝜆𝑛,𝑛𝑓(𝑛) 𝑥𝑖−𝑛+1: 𝑥𝑖 +

𝜆𝑛,𝑛−1𝑓(𝑛−1) 𝑥𝑖−𝑛+2: 𝑥𝑖 +

𝜆𝑛,0𝑓(0) ⋅

𝜆𝑛,0 = 1 −

𝑚=0

𝑛−1

𝜆𝑛,𝑛−𝑚

Page 4: Maxent Models (II)

Announcements: Course Project

Official handout will be out Friday 9/22

Until then, focus on assignment 1

Teams of 1-3

Mixed undergrad/grad is encouraged but not required

Some novel aspect is needed

Ex 1: reimplement existing technique and apply to new domain

Ex 2: reimplement existing technique and apply to new (human) language

Ex 3: explore novel technique on existing problem

Page 5: Maxent Models (II)

Recap from last time…

Page 6: Maxent Models (II)

Classify or Decode with Bayes Rule

how well does text Yrepresent label X?

how likely is label X overall?

For “simple” or “flat” labels:* iterate through labels* evaluate score for each label, keeping only the best (n best)* return the best (or n best) label and score

Page 7: Maxent Models (II)

Classification Evaluation:the 2-by-2 contingency table

ActuallyCorrect

Actually Incorrect

Selected/Guessed

True Positive (TP)

False Positive (FP)

Not selected/not guessed

False Negative (FN)

True Negative (TN)

Classes/Choices

Correct Guessed Correct Guessed

Correct Guessed Correct Guessed

Page 8: Maxent Models (II)

Classification Evaluation:Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

Page 9: Maxent Models (II)

Language Modeling as Naïve Bayes Classifier

Adopt naïve bag of words representation Yi

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

Page 10: Maxent Models (II)

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Page 11: Maxent Models (II)

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification…

Page 12: Maxent Models (II)

Naïve Bayes Summary

Potential Advantages

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

Potential Issues

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data?

(automated, more principled)

Relevant for classification… andlanguage modeling

Page 13: Maxent Models (II)

Maximum Entropy Models

a more general language model

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

Page 14: Maxent Models (II)

Maximum Entropy Models

a more general language model

classify in one go

argmax𝑋𝑝 𝑌 𝑋) ∗ 𝑝(𝑋)

argmax𝑋𝑝 𝑋 𝑌)

Page 15: Maxent Models (II)

ATTACKThree people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

We need to score the different combinations.

Document Classification

Page 16: Maxent Models (II)

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Page 17: Maxent Models (II)

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Page 18: Maxent Models (II)

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

Learn these scores… but how?

What do we optimize?

Page 19: Maxent Models (II)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent Modeling

Page 20: Maxent Models (II)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

f(x) = exp(x)

exp gives a positive, unnormalized probability

Page 21: Maxent Models (II)

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 22: Maxent Models (II)

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

Page 23: Maxent Models (II)

Q: What if none of our features apply?

Page 24: Maxent Models (II)

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

Page 25: Maxent Models (II)

Guiding Principle for Log-Linear Models

“[The log-linear estimate]is the least biased estimate possible on the given information; i.e., it is maximally noncommittal with regard to missing information.”Edwin T. Jaynes, 1957

exp(θ· f) exp(θ· 0) = 1

Page 26: Maxent Models (II)

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

Easier-to-write form

Page 27: Maxent Models (II)

exp( ))θ1 * f1(fatally shot, ATTACK)

θ 2 * f2(seriously wounded, ATTACK)

θ 3 * f 3(Shining Path, ATTACK)…

Easier-to-write form

K weights

K features

Page 28: Maxent Models (II)

exp( )θ · f (doc, ATTACK)

Easier-to-write form

K-dimensional weight vector

K-dimensional feature vector

dot product

Page 29: Maxent Models (II)

Log-Linear Models

Page 30: Maxent Models (II)

Log-Linear Models

Page 31: Maxent Models (II)

Log-Linear Models

Feature function(s)Sufficient statistics

“Strength” function(s)

Page 32: Maxent Models (II)

Log-Linear Models

Feature WeightsNatural parameters

Distribution Parameters

Page 33: Maxent Models (II)

Log-Linear Models

How do we normalize?

Page 34: Maxent Models (II)

exp( )weight1 * applies1(fatally shot, X)

weight2 * applies2(seriously wounded, X)

weight3 * applies3(Shining Path, X)…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

Page 35: Maxent Models (II)

Normalization for Language Model

general class-based (X) language model of doc y

Page 36: Maxent Models (II)

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

Page 37: Maxent Models (II)

Normalization for Language Model

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

Page 38: Maxent Models (II)

https://www.csee.umbc.edu/courses/undergraduate/473/f17/loglin-tutorial/

https://goo.gl/B23Rxo

Lesson 1

Page 39: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

Page 40: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regressionas statistical regression

Page 41: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

as statistical regression

based in information theory

Page 42: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

as statistical regression

a form of

based in information theory

Page 43: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

as statistical regression

a form of

viewed as

based in information theory

Page 44: Maxent Models (II)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Page 45: Maxent Models (II)

Learning θ

Page 46: Maxent Models (II)

pθ(y | x ) probabilistic model

objective(given observations)

Page 47: Maxent Models (II)

How will we optimize F(θ)?

Calculus

Page 48: Maxent Models (II)

F(θ)

θ

Page 49: Maxent Models (II)

F(θ)

θθ*

Page 50: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F

wrt θ

θ*

Page 51: Maxent Models (II)

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

x = 2

Page 52: Maxent Models (II)

Common Derivative Rules

Page 53: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Page 54: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

θ0

y0

Page 55: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

θ0

y0

g0

Page 56: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

g0

Page 57: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

g0

g1

Page 58: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Page 59: Maxent Models (II)

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:

1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Page 60: Maxent Models (II)

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Page 61: Maxent Models (II)

Gradient Ascent

Page 62: Maxent Models (II)

Gradient Ascent

Page 63: Maxent Models (II)

Gradient Ascent

Page 64: Maxent Models (II)

Gradient Ascent

Page 65: Maxent Models (II)

Gradient Ascent

Page 66: Maxent Models (II)

Gradient Ascent