41
CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang ([email protected] ) Unit 3: Structured Learning (part 2) Discriminative Models: Perceptron & CRF Thursday, November 18, 2010

Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang ([email protected]) Unit 3: Structured Learning (part 2)

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

CS 562: Empirical Methodsin Natural Language Processing

Nov 2010

Liang Huang ([email protected])

Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF

Thursday, November 18, 2010

Page 2: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

A Quiz on Forest

2

Generative models

t̂ = arg maxt

P(t | w)

= arg maxt

P(w, t)P(w)

= arg maxt

P(w, t)

More work to learn to generate w

But can handle incomplete data: P(w) =�

t

P(w, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Page 3: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models3

The generative story

P(w, t) = P(t) × P(w | t)

≈ P(t1)n�

i=2

P(ti | ti−1)P(stop | tn) ×n�

i=1

P(wi | ti)

SOURCE-CHANNEL

HMM

P(t | w) ≈n�

i=1

P(ti | wi)

DEFICIENT

P(t | w) ≈n�

i=1

P(ti | wi) × P(t1)n�

i=2

P(ti | ti−1)P(stop | tn)

Wednesday, November 25, 2009

you generated ti twice!

Thursday, November 18, 2010

Page 4: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

Thursday, November 18, 2010

Page 5: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

vanilla perceptron(Rosenblatt, 1958)

structured perceptron(Collins, 2002)

the man bit the dog

DT NN VBD DT NN

Thursday, November 18, 2010

Page 6: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Generic Perceptron• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

5

inferencexi

update weightszi

yi

w

Thursday, November 18, 2010

Page 7: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

Thursday, November 18, 2010

Page 8: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Page 9: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Page 10: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Page 11: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Thursday, November 18, 2010

Page 12: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

!

Structured Perceptron

7

inferencexi

update weightszi

yi

w

Thursday, November 18, 2010

Page 13: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Page 14: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Page 15: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Page 16: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Page 17: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Thursday, November 18, 2010

Page 18: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Averaged Perceptron

• more stable and accurate results

• approximation of voted perceptron (Freund & Schapire, 1999)

9

j

!

0

jj + 1

=

!

j

Wj

Thursday, November 18, 2010

Page 19: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Averaging Tricks

• Daume (2006, PhD thesis)

10

Thursday, November 18, 2010

Page 20: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Do we need smoothing?

• smoothing is much easier in discriminative models

• just make sure for each feature template, its subset templates are also included

• e.g., to include (t0 w0 w-1) you must also include

• (t0 w0) (t0 w-1) (w0 w-1)

• and maybe also (t0 t-1) because t is less sparse than w

11

x

y

Thursday, November 18, 2010

Page 21: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Comparison with Other Models

Thursday, November 18, 2010

Page 22: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

from HMM to MEMM

13

HMM: joint distribution MEMM: locally normalized(per-state conditional)

Thursday, November 18, 2010

Page 23: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Label Bias Problem

• bias towards states with fewer outgoing transitions

• a problem with all locally normalized models

14

Thursday, November 18, 2010

Page 24: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Conditional Random Fields

• globally normalized (no label bias problem)

• but training requires expected features counts

• (related to the fractional counts in EM)

• need to use Inside-Outside algorithm (sum)

• Perceptron just needs Viterbi (max)

15

Thursday, November 18, 2010

Page 25: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Experiments

Thursday, November 18, 2010

Page 26: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)

• trigram tagger: current tag ti, previous tags ti-1, ti-2

• current word wi and its spelling features

• surrounding words wi-1 wi+1 wi-2 wi+2..

17

Thursday, November 18, 2010

Page 27: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Experiments: NP Chunking

• B-I-O scheme

• features:

• unigram model

• surrounding words and POS tags

18

Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I

Thursday, November 18, 2010

Page 28: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Experiments: NP Chunking• results

• (Sha and Pereira, 2003) trigram tagger

• voted perceptron: 94.09% vs. CRF: 94.38%19

Thursday, November 18, 2010

Page 29: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Other NLP Applications

• dependency parsing (McDonald et al., 2005)

• parse reranking (Collins)

• phrase-based translation (Liang et al., 2006)

• word segmentation

• ... and many many more ...

20

Thursday, November 18, 2010

Page 30: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Theory

Thursday, November 18, 2010

Page 31: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Vanilla Perceptron

22

Thursday, November 18, 2010

Page 32: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Vanilla Perceptron

23

Thursday, November 18, 2010

Page 33: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Vanilla Perceptron

24

Thursday, November 18, 2010

Page 34: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Convergence Theorem

• Data is separable if and only if perceptron converges

• number of updates is bounded by

• γ is the margin; R = maxi || xi ||

• This result generalizes to structured perceptron

• Also in Collins paper: theorems for non-separable cases and generalization bounds

25

(R/!)2

R = maxi

!!(xi, yi) " !(xi, zi)!

Thursday, November 18, 2010

Page 35: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

Perceptron Conclusion

• a very simple framework that can work with many structured problems and that works very well

• all you need is (fast) 1-best inference

• much simpler than CRFs and SVMs

• can be applied to parsing, translation, etc.

• generalization bounds depend on separability

• not the (exponential) size of the search space

• limitation 1: features local on y (non-local on x)

• limitation 2: no probabilistic interpretation (vs. CRF)

26

Thursday, November 18, 2010

Page 36: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

From HMM to CRF

27(Sutton and McCallum, 2007) -- recommended reading

Thursday, November 18, 2010

Page 37: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models28

Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Page 38: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models29

Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Page 39: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models30

Optimizing likelihood∂L∂λi=

N�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

Gradient ascent:

Stochastic gradient ascent: for each

λi ← λi + αN�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

(wj, t j)

λi ← λi + α�

fi(wj, t j) − E[ fi(wj, t)]�

Wednesday, November 25, 2009Thursday, November 18, 2010

Page 40: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models31

Expected feature counts

E[ fV,loves] =� 1

Zα[F]µ(V, loves)β[G]

Z = α[END]

F G

µ(V, loves)α[F] β[G]

Wednesday, November 25, 2009Thursday, November 18, 2010

Page 41: Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang (lhuang@isi.edu) Unit 3: Structured Learning (part 2)

Discriminative Models

CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent

• SGD = stochastic gradient descent = online CRF

• CRF => (online) => SGD => (viterbi) => Perceptron

• update rules (very similar):

• CRF: batch, E[...]

• SGD: one example at a time, expectation E[...]

• Perceptron: one example at a time, and Viterbi

32

Thursday, November 18, 2010