48
John Blitzer 自自自自自自自 http://research.microsoft.com/ asia/group/nlc/

Supervised and semi-supervised learning for NLP

  • Upload
    jesse

  • View
    81

  • Download
    2

Embed Size (px)

DESCRIPTION

Supervised and semi-supervised learning for NLP. John Blitzer. 自然语言计算组 http://research.microsoft.com/asia/group/nlc/. Why should I know about machine learning?. This is an NLP summer school. Why should I care about machine learning? - PowerPoint PPT Presentation

Citation preview

Page 1: Supervised and semi-supervised learning for NLP

John Blitzer

自然语言计算组http://research.microsoft.com/asia/group/nlc/

Page 2: Supervised and semi-supervised learning for NLP

This is an NLP summer school. Why should I care about machine learning?ACL 2008: 50 of 96 full papers mention learning, or statistics in their titles4 of 4 outstanding papers propose new learning or statistical inference methods

Page 3: Supervised and semi-supervised learning for NLP

Running with Scissors: A Memoir

Title: Horrible book, horrible.

This book was horrible. I read half of it,

suffering from a headache the entire time, and

eventually i lit it on fire. One less copy in the

world...don't waste your money. I wish i had the

time spent reading this book back so i could use

it for better purposes. This book wasted my life

Positive

Negative

Output: Labels

Input: Product Review

Page 4: Supervised and semi-supervised learning for NLP

From the MSRA 机器学习组 http://research.microsoft.com/research/china/DCCUE/

ml.aspx

Page 5: Supervised and semi-supervised learning for NLP

. . .

Un-ranked List Ranked List

. . .

Page 6: Supervised and semi-supervised learning for NLP

全国田径冠军赛结束

The national track & field championships concluded

Input: English sentence

Output: Chinese sentence

Page 7: Supervised and semi-supervised learning for NLP

1) Supervised Learning [2.5 hrs]

2) Semi-supervised learning [3 hrs]

3) Learning bounds for domain adaptation [30 mins]

Page 8: Supervised and semi-supervised learning for NLP

1) Notation and Definitions [5 mins]

2) Generative Models [25 mins]

3) Discriminative Models [55 mins]

4) Machine Learning Examples [15 mins]

Page 9: Supervised and semi-supervised learning for NLP

Training data: labeled pairs

. . .

?? ?? . . . ??

Use this function to label unlabeled testing data

Use training data to learn a function

Page 10: Supervised and semi-supervised learning for NLP

3 0

horrible

read_half waste

0... 1 0 ... 0 2

0 0 0... 1 ... 00

horrible

2

excellent

loved_it

Page 11: Supervised and semi-supervised learning for NLP
Page 12: Supervised and semi-supervised learning for NLP

Encode a multivariate probability distribution

Nodes indicate random variablesEdges indicate conditional dependency

Page 13: Supervised and semi-supervised learning for NLP

p(horrible | -

1)

horrible

waste

p(read_half | -

1)

read_half

p(y = -1)

Page 14: Supervised and semi-supervised learning for NLP

Given an unlabeled instance, how can we find its label?

??

Just choose the most probable label y

Page 15: Supervised and semi-supervised learning for NLP

Back to labeled training data:

. . .

Page 16: Supervised and semi-supervised learning for NLP

Input query: “ 自然语言处理”

Query classificationTravel

TechnologyNews

Entertainment

. . . . Training and testing same as in binary case

Page 17: Supervised and semi-supervised learning for NLP

Why set parameters to counts?

Page 18: Supervised and semi-supervised learning for NLP
Page 19: Supervised and semi-supervised learning for NLP

Predicting broken traffic lights

Lights are broken: both lights are red alwaysLights are working: 1 is red & 1 is green

Page 20: Supervised and semi-supervised learning for NLP

Now, suppose both lights are red. What will our model predict?

We got the wrong answer. Is there a better model?

The MLE generative model is not the best model!!

Page 21: Supervised and semi-supervised learning for NLP

We can introduce more dependencies

This can explode parameter space

Discriminative models minimize error -- nextFurther readingK. Toutanova. Competitive generative models with structure learning for NLP classification tasks. EMNLP 2006.A. Ng and M. Jordan. On Discriminative vs. Generative Classifiers: A comparison of logistic regression and naïve Bayes. NIPS 2002

Page 22: Supervised and semi-supervised learning for NLP

We will focus on linear models

Model training error

Page 23: Supervised and semi-supervised learning for NLP

-1.5 -1 -0.5 0 0.5 1 1.5

2

4

60-1 loss (error): NP-hard to minimize over all data points

Exp loss: exp(-score): Minimized by AdaBoost

Hinge loss: Minimized by support vector machines

Page 24: Supervised and semi-supervised learning for NLP

In NLP, a feature can be a weak learner

Sentiment example:

Page 25: Supervised and semi-supervised learning for NLP

Input: training sample(1) Initialize (2) For t = 1 … T,

Train a weak hypothesis to minimize error on

Set [later]

Update

(3) Output model

Page 26: Supervised and semi-supervised learning for NLP

+ + – –Excellent book. The_plot was riveting

Excellent read

Terrible: The_plot was boring and opaque

Awful book. Couldn’t follow the_plot.

+++– –

++– – – –– –

– –– –– –

Page 27: Supervised and semi-supervised learning for NLP

Bound on training error [Freund & Schapire 1995]

We greedily minimize error by minimizing

Page 28: Supervised and semi-supervised learning for NLP

For proofs and a more complete discussionRobert Schapire and Yoram Singer. Improved Boosting Algorithms Using Confidence-rated Predictions. Machine Learning Journal 1998.

Page 29: Supervised and semi-supervised learning for NLP

We chose to minimize . Was that the right choice?

This gives

Plugging in our solution for , we have

Page 30: Supervised and semi-supervised learning for NLP

-1.5 -1 -0.5 0 0.5 1 1.5

2

4

6What happens when an example is mis-labeled or an outlier?Exp loss exponentially penalizes incorrect scores.

Hinge loss linearly penalizes incorrect scores.

Page 31: Supervised and semi-supervised learning for NLP

Linearly separable

++––– –

+++

+––

––

++Non-

separable

Page 32: Supervised and semi-supervised learning for NLP

Lots of separating hyperplanes. Which should we choose?

++––– –

++ ++––– –

++

Choose the hyperplane with largest margin

Page 33: Supervised and semi-supervised learning for NLP

score of correct label greater thanmargin

Why do we fix norm of w to be less than 1?

Scaling the weight vector doesn’t change the optimal hyperplane

Page 34: Supervised and semi-supervised learning for NLP

Minimize the norm of the weight vectorWith fixed margin for each example

Page 35: Supervised and semi-supervised learning for NLP

We can’t satisfy the margin constraintsBut some hyperplanes are better than others

++

––––

++

Page 36: Supervised and semi-supervised learning for NLP

Add slack variables to the optimization

Allow margin constraints to be violatedBut minimize the violation as much as possible

Page 37: Supervised and semi-supervised learning for NLP
Page 38: Supervised and semi-supervised learning for NLP

-0.5 0 0.5 1 1.5

0.5

1

1.5

Max creates a non-differentiable point, but there is a subgradient

Subgradient:

Page 39: Supervised and semi-supervised learning for NLP

Subgradient descent is like gradient descent.Also guaranteed to converge, but slowPegasos [Shalev-Schwartz and Singer 2007]

Sub-gradient descent for a randomly selected subset of examples. Convergence bound:

Objective after T iterations

Best objective value Linear

convergence

Page 40: Supervised and semi-supervised learning for NLP

We’ve been looking at binary classification

But most NLP problems aren’t binaryPiece-wise linear decision boundaries

We showed 2-dimensional examplesBut NLP is typically very high dimensionalJoachims [2000] discusses linear models in high-dimensional spaces

Page 41: Supervised and semi-supervised learning for NLP

Kernels let us efficiently map training data into a high-dimensional feature spaceThen learn a model which is linear in the new space, but non-linear in our original spaceBut for NLP, we already have a high-dimensional representation!Optimization with non-linear kernels is often super-linear in number of examples

Page 42: Supervised and semi-supervised learning for NLP

John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press 2004.Dan Klein and Ben Taskar. Max Margin Methods for NLP: Estimation, Structure, and Applications. ACL 2005 Tutorial.

Ryan McDonald. Generalized Linear Classifiers in NLP. Tutorial at the Swedish Graduate School in Language Technology. 2007.

Page 43: Supervised and semi-supervised learning for NLP

SVMs with slack are noise tolerantAdaBoost has no explicit regularization

Must resort to early stopping

AdaBoost easily extends to non-linear modelsNon-linear optimization for SVMs is super-linear in the number of examples

Can be important for examples with hundreds or thousands of features

Page 44: Supervised and semi-supervised learning for NLP

Logistic regression: Also known as Maximum Entropy

Probabilistic discriminative model which directly models p(y | x)

A good general machine learning bookOn discriminative learning and moreChris Bishop. Pattern Recognition and Machine Learning. Springer 2006.

Page 45: Supervised and semi-supervised learning for NLP

(1)

(2)

(3)

(4)

Page 46: Supervised and semi-supervised learning for NLP

Good features for this model? (1) How many words are shared between

the query and the web page?(2) What is the PageRank of the

webpage?(3) Other ideas?

Page 47: Supervised and semi-supervised learning for NLP

Loss for a query and a pair of documents Score for documents of different ranks must be separated by a margin

MSRA 互联网搜索与挖掘组http://research.microsoft.com/asia/group/wsm/

Page 48: Supervised and semi-supervised learning for NLP

http://www.msra.cn/recruitment/