Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang ([email protected]) Unit 3: Structured Learning (part 2)

CS 562: Empirical Methodsin Natural Language Processing

Nov 2010

Liang Huang ([email protected])

Unit 3: Structured Learning(part 2) Discriminative Models: Perceptron & CRF

Thursday, November 18, 2010

mailto:[email protected]

mailto:[email protected]

Discriminative Models

A Quiz on Forest

2

Generative models

t̂ = arg maxt

P(t | w)

= arg maxt

P(w, t)P(w)

= arg maxt

P(w, t)

More work to learn to generate w

But can handle incomplete data: P(w) =�

t

P(w, t)

Wednesday, November 25, 2009Thursday, November 18, 2010

Discriminative Models3

The generative story

P(w, t) = P(t) × P(w | t)

≈ P(t1)n�

i=2

P(ti | ti−1)P(stop | tn) ×n�

i=1

P(wi | ti)

SOURCE-CHANNEL

HMM

P(t | w) ≈n�

i=1

P(ti | wi)

DEFICIENT

P(t | w) ≈n�

i=1

P(ti | wi) × P(t1)n�

i=2

P(ti | ti−1)P(stop | tn)

Wednesday, November 25, 2009

you generated ti twice!



Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4



Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

vanilla perceptron(Rosenblatt, 1958)

structured perceptron(Collins, 2002)

the man bit the dog

DT NN VBD DT NN



Generic Perceptron• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

5

inferencexi

update weightszi

yi

w



Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN


• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6












6

x

x

y

z

!(x, z)

!(x, y)












6

x

x

y

z

!(x, z)

!(x, y)












6

x

x

y

z

!(x, z)

!(x, y)












6

x

x

y

z

!(x, z)

!(x, y)



!

Structured Perceptron

7

inferencexi

update weightszi

yi

w



Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)








8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)








8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)








8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)








8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)



Averaged Perceptron

• more stable and accurate results

• approximation of voted perceptron (Freund & Schapire, 1999)

9

j

!

0

jj + 1

=

!

j

Wj



Averaging Tricks

• Daume (2006, PhD thesis)

10



Do we need smoothing?

• smoothing is much easier in discriminative models

• just make sure for each feature template, its subset templates are also included

• e.g., to include (t0 w0 w-1) you must also include

• (t0 w0) (t0 w-1) (w0 w-1)

• and maybe also (t0 t-1) because t is less sparse than w

11

x

y


Comparison with Other Models



from HMM to MEMM

13

HMM: joint distribution MEMM: locally normalized(per-state conditional)



Label Bias Problem

• bias towards states with fewer outgoing transitions

• a problem with all locally normalized models

14



Conditional Random Fields

• globally normalized (no label bias problem)

• but training requires expected features counts

• (related to the fractional counts in EM)

• need to use Inside-Outside algorithm (sum)

• Perceptron just needs Viterbi (max)

15


Experiments



Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)

• trigram tagger: current tag ti, previous tags ti-1, ti-2

• current word wi and its spelling features

• surrounding words wi-1 wi+1 wi-2 wi+2..

17



Experiments: NP Chunking

• B-I-O scheme

• features:

• unigram model

• surrounding words and POS tags

18

Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I



Experiments: NP Chunking• results

• (Sha and Pereira, 2003) trigram tagger

• voted perceptron: 94.09% vs. CRF: 94.38%19



Other NLP Applications

• dependency parsing (McDonald et al., 2005)

• parse reranking (Collins)

• phrase-based translation (Liang et al., 2006)

• word segmentation

• ... and many many more ...

20


Theory



Vanilla Perceptron

22



Vanilla Perceptron

23



Vanilla Perceptron

24



Convergence Theorem

• Data is separable if and only if perceptron converges

• number of updates is bounded by

• γ is the margin; R = maxi || xi ||

• This result generalizes to structured perceptron

• Also in Collins paper: theorems for non-separable cases and generalization bounds

25

(R/!)2

R = maxi

!!(xi, yi) " !(xi, zi)!



Perceptron Conclusion

• a very simple framework that can work with many structured problems and that works very well

• all you need is (fast) 1-best inference

• much simpler than CRFs and SVMs

• can be applied to parsing, translation, etc.

• generalization bounds depend on separability

• not the (exponential) size of the search space

• limitation 1: features local on y (non-local on x)

• limitation 2: no probabilistic interpretation (vs. CRF)

26



From HMM to CRF

27(Sutton and McCallum, 2007) -- recommended reading



Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

�

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)



Maximum likelihoodL = log

N�

j=1

P(t j | wj)

= logN�

j=1

1Z(wj)

exp

n�

i=1

λi fi(wj, t j)

= logN�

j=1

exp��n

i=1 λi fi(wj, t j)�

�t exp

��ni=1 λi fi(wj, t)

�

=

N�

j=1

n�

i=1

λi fi(wj, t j) − log�

t

expn�

i=1

λi fi(wj, t)



Optimizing likelihood∂L∂λi=

N�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

�

Gradient ascent:

Stochastic gradient ascent: for each

λi ← λi + αN�

j=1

�fi(wj, t j) − E[ fi(wj, t)]

�

(wj, t j)

λi ← λi + α�

fi(wj, t j) − E[ fi(wj, t)]�



Expected feature counts

E[ fV,loves] =� 1

Zα[F]µ(V, loves)β[G]

Z = α[END]

F G

µ(V, loves)α[F] β[G]



CRF : SGD : Perceptron• CRF = conditional random fields = structured maxent

• SGD = stochastic gradient descent = online CRF

• CRF => (online) => SGD => (viterbi) => Perceptron

• update rules (very similar):

• CRF: batch, E[...]

• SGD: one example at a time, expectation E[...]

• Perceptron: one example at a time, and Viterbi

32


Documents

Unit 3: Structured Learning · 2010. 11. 19. · CS 562: Empirical Methods in Natural Language Processing Nov 2010 Liang Huang ([email protected]) Unit 3: Structured Learning (part 2)