50
Midterm Review Instructor: Wei Xu

16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Midterm Review Instructor: Wei Xu

Page 2: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 3: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Classification

Page 4: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Text Classification

...planningtemporalreasoningplanlanguage...

?

Page 5: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Classification Methods: Supervised Machine Learning

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

• Output: – a learned classifier γ:d ! c

Page 6: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Naïve Bayes Classifier

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

bag of word

conditional independence assumption

Page 7: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Multinomial Naïve Bayes: Learning

• maximum likelihood estimates – simply use the frequencies in the data

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

=count(wi ,c)+1

count(w,cw∈V∑ )

⎝⎜⎜

⎠⎟⎟ + V

• For unknown words (which completely doesn’t occur in training set), we can ignore them.

laplace (add-1) smoothing to avoid zero probabilities

Page 8: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Weaknesses of Naïve Bayes

• Assuming conditional independence

• Correlated features -> double counting evidence – Parameters are estimated independently

• This can hurt classifier accuracy and calibration

Page 9: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Logistic Regression

• Doesn’t assume conditional independence of features – Better calibrated probabilities – Can handle highly correlated overlapping

features

Page 10: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Logistic vs. Softmax Regression

one estimated probability for

each class

one estimated probability good for binary classification

the class with highest

probability

Generalization to Multi-class Problems

Page 11: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Maximum Entropy Models (MaxEnt)• a.k.a logistic regression or multinominal logistic regression or

multiclass logistic regression or softmax regression

• belongs to the family of classifiers know as log-linear classifiers

• Math proof of “LR=MaxEnt”: - [Klein and Manning 2003] - [Mount 2011]

http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf

Page 12: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

convert into probabilities between [0, 1]

softmax function

Page 13: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

concave function!

Page 14: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Gradient Ascent

Page 15: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Gradient Ascent

Loop While not converged: For all features j, compute and add derivatives

: Training set log-likelihood (Objective function)

: Gradient vector

Page 16: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 17: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 18: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

LR Gradient

logistic function

Page 19: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Perceptron vs. Logistic Regression

weights (parameters)

inputs (features)

choices of activation function

Page 20: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i) != y_i w += y_i * x_i

Error-driven!

Perceptron Algorithm• Very similar to logistic regression • Not exactly computing gradient

Page 21: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

MultiClass Perceptron Algorithm

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred != y_i w_y_gold += x_i w_y_pred -= x_i

increase score for right answer

decrease score for wrong answer

Page 22: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Perceptron vs. Logistic Regression

• Only hyperparameter of perceptron is maximum number of iterations (LR also needs learning rate)

• Perceptron is guaranteed to converge if the data is linearly separable (LR always converge)

Page 23: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Perceptron vs. Logistic Regression• The Perceptron is an online learning algorithm. • Logistic Regression is not:

this update is effectively the same as “w += y_i * x_i”

Page 24: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Multinominal LR Gradient

empirical feature count expected feature count

Page 25: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

MAP-based learning (Perceptron)

.≈

Maximum A Posteriori

approximate using maximization

Page 26: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Markov Assumption, Perplexity, Interpolation, Backoff, and

Kneser-Ney Smoothing

Language Modeling

Page 27: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Probabilistic Language Modeling•Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)

•Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)

•A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model or LM

Page 28: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

The Chain Rule applied to compute joint probability of words in sentence

Markov Assumption

Page 29: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Bigram model• Condition on the previous word:

Page 30: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Add-one estimation

•Also called Laplace smoothing

• Pretend we saw each word one more time than we did • Just add one to all the counts!

•MLE estimate:

•Add-1 estimate:

PMLE(wi |wi−1) =c(wi−1,wi )c(wi−1)

PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V

Page 31: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Add-1 estimation is a blunt instrument

• So add-1 isn’t used for N-grams: •We’ll see better methods

• But add-1 is used to smooth other NLP models • For text classification • In domains where the number of zeros isn’t so huge.

Page 32: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

•Linear interpolation

•Backoff •Absolute Discounting •Kneser-Ney Smoothing

Better Language Models

Page 33: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Perplexity

Perplexity is the inverse probability of the test set, “normalized” by the number of words:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set • Gives the highest P(sentence)

Chain Rule

for bigram

Page 34: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Hidden Markov Models, Maximum Entropy Markov

Models (Log-linear Models for Tagging), and Viterbi Algorithm

Tagging

Page 35: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Tagging (Sequence Labeling)• Given a sequence (in NLP, words), assign appropriate labels to

each word. • Many NLP problems can be viewed as sequence labeling: - POS Tagging - Chunking - Named Entity Tagging

• Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors

Plays well with others. VBZ RB IN NNS

Page 36: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 37: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 38: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Emission parameters

Trigram parameters

Page 39: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 40: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 41: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 42: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Andrew Viterbi, 1967

Page 43: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 44: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 45: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

running time for trigram HMM

running time for bigram HMM is O(n|S|2)

Page 46: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 47: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 48: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent
Page 49: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Decoding§ Linear Perceptron

§ Features must be local, for x=x1…xm, and s=s1…sm

§ Define π(i,si) to be the max score of a sequence of length iending in tag si

§ Viterbi algorithm (HMMs):

§ Viterbi algorithm (Maxent):�(i, si) = max

si�1

p(si|si�1, x1 . . . xm)�(i� 1, si�1)

⇡(i, si) = max

si�1

e(xi|si)q(si|si�1)⇡(i� 1, si�1)

s⇤ = argmaxs

w · �(x, s) · �

�(i, si) = max

si�1

w · ⇥(x, i, si�i, si) + �(i� 1, si�1)

�(x, s) =mX

j=1

�(x, j, sj�1, sj)

Page 50: 16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent