16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Midterm Review Instructor: Wei Xu

Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent

Classification

Planning GUIGarbageCollection

Machine Learning NLP

parsertagtrainingtranslationlanguage...

learningtrainingalgorithmshrinkagenetwork...

garbagecollectionmemoryoptimizationregion...

Test document

parserlanguagelabeltranslation…

Text Classification

...planningtemporalreasoningplanlanguage...

?

Classification Methods: Supervised Machine Learning

• Input: – a document d – a fixed set of classes C = {c1, c2,…, cJ}

– A training set of m hand-labeled documents (d1,c1),....,(dm,cm)

• Output: – a learned classifier γ:d ! c

Naïve Bayes Classifier

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(cj ) P(x | c)x∈X∏

bag of word

conditional independence assumption

Multinomial Naïve Bayes: Learning

• maximum likelihood estimates – simply use the frequencies in the data

P̂(wi | cj ) =count(wi ,cj )count(w,cj )

w∈V∑

P̂(cj ) =doccount(C = cj )

Ndoc

=count(wi ,c)+1

count(w,cw∈V∑ )

⎛

⎝⎜⎜

⎞

⎠⎟⎟ + V

• For unknown words (which completely doesn’t occur in training set), we can ignore them.

laplace (add-1) smoothing to avoid zero probabilities

Weaknesses of Naïve Bayes

• Assuming conditional independence

• Correlated features -> double counting evidence – Parameters are estimated independently

• This can hurt classifier accuracy and calibration

Logistic Regression

• Doesn’t assume conditional independence of features – Better calibrated probabilities – Can handle highly correlated overlapping

features

Logistic vs. Softmax Regression

one estimated probability for

each class

one estimated probability good for binary classification

the class with highest

probability

Generalization to Multi-class Problems

Maximum Entropy Models (MaxEnt)• a.k.a logistic regression or multinominal logistic regression or

multiclass logistic regression or softmax regression

• belongs to the family of classifiers know as log-linear classifiers

• Math proof of “LR=MaxEnt”: - [Klein and Manning 2003] - [Mount 2011]

http://www.win-vector.com/dfiles/LogisticRegressionMaxEnt.pdf

convert into probabilities between [0, 1]

softmax function

concave function!

Gradient Ascent

Gradient Ascent

Loop While not converged: For all features j, compute and add derivatives

: Training set log-likelihood (Objective function)

: Gradient vector

LR Gradient

logistic function

Perceptron vs. Logistic Regression

weights (parameters)

inputs (features)

choices of activation function

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i) != y_i w += y_i * x_i

Error-driven!

Perceptron Algorithm• Very similar to logistic regression • Not exactly computing gradient

MultiClass Perceptron Algorithm

Initialize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred != y_i w_y_gold += x_i w_y_pred -= x_i

increase score for right answer

decrease score for wrong answer

Perceptron vs. Logistic Regression

• Only hyperparameter of perceptron is maximum number of iterations (LR also needs learning rate)

• Perceptron is guaranteed to converge if the data is linearly separable (LR always converge)

Perceptron vs. Logistic Regression• The Perceptron is an online learning algorithm. • Logistic Regression is not:

this update is effectively the same as “w += y_i * x_i”

Multinominal LR Gradient

empirical feature count expected feature count

MAP-based learning (Perceptron)

.≈

Maximum A Posteriori

approximate using maximization

Markov Assumption, Perplexity, Interpolation, Backoff, and

Kneser-Ney Smoothing

Language Modeling

Probabilistic Language Modeling•Goal: compute the probability of a sentence or sequence of words: P(W) = P(w1,w2,w3,w4,w5…wn)

•Related task: probability of an upcoming word: P(w5|w1,w2,w3,w4)

•A model that computes either of these: P(W) or P(wn|w1,w2…wn-1) is called a language model or LM

The Chain Rule applied to compute joint probability of words in sentence

Markov Assumption

Bigram model• Condition on the previous word:

Add-one estimation

•Also called Laplace smoothing

• Pretend we saw each word one more time than we did • Just add one to all the counts!

•MLE estimate:

•Add-1 estimate:

PMLE(wi |wi−1) =c(wi−1,wi )c(wi−1)

PAdd−1(wi |wi−1) =c(wi−1,wi )+1c(wi−1)+V

Add-1 estimation is a blunt instrument

• So add-1 isn’t used for N-grams: •We’ll see better methods

• But add-1 is used to smooth other NLP models • For text classification • In domains where the number of zeros isn’t so huge.

•Linear interpolation

•Backoff •Absolute Discounting •Kneser-Ney Smoothing

Better Language Models

Perplexity

Perplexity is the inverse probability of the test set, “normalized” by the number of words:

Minimizing perplexity is the same as maximizing probability

The best language model is one that best predicts an unseen test set • Gives the highest P(sentence)

Chain Rule

for bigram

Hidden Markov Models, Maximum Entropy Markov

Models (Log-linear Models for Tagging), and Viterbi Algorithm

Tagging

Tagging (Sequence Labeling)• Given a sequence (in NLP, words), assign appropriate labels to

each word. • Many NLP problems can be viewed as sequence labeling: - POS Tagging - Chunking - Named Entity Tagging

• Labels of tokens are dependent on the labels of other tokens in the sequence, particularly their neighbors

Plays well with others. VBZ RB IN NNS

Emission parameters

Trigram parameters

Andrew Viterbi, 1967

running time for trigram HMM

running time for bigram HMM is O(n|S|2)

Decoding§ Linear Perceptron

§ Features must be local, for x=x1…xm, and s=s1…sm

§ Define π(i,si) to be the max score of a sequence of length iending in tag si

§ Viterbi algorithm (HMMs):

§ Viterbi algorithm (Maxent):�(i, si) = max

si�1

p(si|si�1, x1 . . . xm)�(i� 1, si�1)

⇡(i, si) = max

si�1

e(xi|si)q(si|si�1)⇡(i� 1, si�1)

s⇤ = argmaxs

w · �(x, s) · �

�(i, si) = max

si�1

w · ⇥(x, i, si�i, si) + �(i� 1, si�1)

�(x, s) =mX

j=1

�(x, j, sj�1, sj)

Documents

16 midterm review - Wei Xu · Midterm Review Instructor: Wei Xu. Naïve Bayes, Logistic Regression (Log-linear Models), Perceptron, Gradient Descent