23
1 LING 696B: Maximum- Entropy and Random Fields

LING 696B: Maximum-Entropy and Random Fields

  • Upload
    gretel

  • View
    45

  • Download
    0

Embed Size (px)

DESCRIPTION

LING 696B: Maximum-Entropy and Random Fields. Review: two worlds. Statistical model and OT seem to ask different questions about learning UG: what is possible/impossible? Hard-coded generalizations Combinatorial optimization (sorting) - PowerPoint PPT Presentation

Citation preview

Page 1: LING 696B: Maximum-Entropy and Random Fields

1

LING 696B: Maximum-Entropy and Random Fields

Page 2: LING 696B: Maximum-Entropy and Random Fields

2

Review: two worlds Statistical model and OT seem to ask

different questions about learning UG: what is possible/impossible?

Hard-coded generalizations Combinatorial optimization (sorting)

Statistical: among the things that are possible, what is likely/unlikely? Soft-coded generalizations Numerical optimization

Marriage of the two?

Page 3: LING 696B: Maximum-Entropy and Random Fields

3

Review: two worlds OT: relate possible/impossible

patterns in different languages through constraint reranking

Stochastic OT: consider a distribution over all possible grammars to generate variation

Today: model frequency of input/output pairs (among the possible) directly using a powerful model

Page 4: LING 696B: Maximum-Entropy and Random Fields

4

Maximum entropy and OT Imaginary data:

Stochastic OT: let *[+voice]>>Ident(voice) and Ident(voice)>>*[+voice] 50% of the time each

Maximum-Entropy (using positive weights): p([bab]|/bap/) ~ (1/Z) exp{-(2*w1)}p([pap]|/bap/) ~ (1/Z) exp{-(w2)}

/bap/ P(.) *[+voice]

Ident(#voi)

Bab .5 2

pap .5 1

Page 5: LING 696B: Maximum-Entropy and Random Fields

5

Maximum entropy Why have Z?

Need to be a conditional distribution: p([bab]|/bap/) + p([pap]|/bap/) = 1

So Z = exp{-(2*w1)} + exp{-(w2)} (same for all candidates) -- called a normalization constant

Z can quickly become difficult to compute, when number of candidates is large

Very similar proposal in Smolensky, 86 How to get w1, w2?

Learned from data (by calculating gradients) Need: frequency counts, violation vectors

(same as stochastic OT)

Page 6: LING 696B: Maximum-Entropy and Random Fields

6

Maximum entropy Why do exp{.}?

It’s like take maximum, but “soft” -- easy to differentiate and optimize

Page 7: LING 696B: Maximum-Entropy and Random Fields

7

Maximum entropy and OT Inputs are violation vectors: e.g. x=(2,0) and (0,1) Outputs are one of K winners -- essentially a

classification problem Violating a constraint works against the candidate

(prob ~ exp{-(x1*w1 + x2*w2)} Crucial difference: ordering candidates by one

score, not by lexico-graphic orders

/bap/ P(.) *[+voice]

Ident(voice)

Bab .5 2

Pap .5 1

Page 8: LING 696B: Maximum-Entropy and Random Fields

8

Maximum entropy Ordering discrete outputs from

input vectors is a common problem: Also called Logistic Regression (recall

Nearey) Explaining the name:

Let P= p([bab]|/bap/), then log[P/(1-P)] = w2 - 2*w1

Linear regressionLogistic transform

Page 9: LING 696B: Maximum-Entropy and Random Fields

9

The power of Maximum Entropy Max Eng/logistic regression is widely used

in many areas with interacting, correlated inputs Recall Nearey: phones, diphones, … NLP: tagging, labeling, parsing … (anything with

a discrete output) Easy to learn: only a global maximum,

optimization efficient Isn’t this the greatest thing in the world?

Need to understand the story behind the exp{} (in a few minutes)

Page 10: LING 696B: Maximum-Entropy and Random Fields

10

Demo: Spanish diminutives Data from Arbisi-Kelm

Constraints: ALIGN(TE,Word,R), MAX-OO(V), DEP-IO and BaseTooLittle

Page 11: LING 696B: Maximum-Entropy and Random Fields

11

Stochastic OT and Max-Ent Is better fit always a good thing?

Page 12: LING 696B: Maximum-Entropy and Random Fields

12

Stochastic OT and Max-Ent Is better fit always a good thing? Should model-fitting become a new

fashion in phonology?

Page 13: LING 696B: Maximum-Entropy and Random Fields

13

The crucial difference What are the possible distributions

of p(.|/bap/) in this case?

/bap/ P(.) *[+voice]

Ident(voice)

Bab 2

Pap 1

Bap 1

pab 1 1

Page 14: LING 696B: Maximum-Entropy and Random Fields

14

The crucial difference What are the possible distributions

of p(.|/bap/) in this case? Max-Ent considers a much wider

range of distributions

/bap/ P(.) *[+voice]

Ident(voice)

Bab 2

Pap 1

Bap 1

pab 1 1

Page 15: LING 696B: Maximum-Entropy and Random Fields

15

What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state

corresponds to the distribution with the most entropy

Given a dice, which distribution has the largest entropy?

Page 16: LING 696B: Maximum-Entropy and Random Fields

16

What is Maximum Entropy anyway? Jaynes, 53: the most ignorant state

corresponds to the distribution with the most entropy

Given a dice, which distribution has the largest entropy?

Add constraints to distributions: the average of some feature functions is assumed to be fixed:

Observed value

Page 17: LING 696B: Maximum-Entropy and Random Fields

17

What is Maximum Entropy anyway?

Example of features: violations, word counts, N-grams, co-occurrences, …

The constraints change the shape of the maximum entropy distribution Solve constrained optimization problem

This leads to p(x) ~ exp{k wk*fk(x)} Very general (see later), many choices of

fk

Page 18: LING 696B: Maximum-Entropy and Random Fields

18

The basic intuition Begin “ignorant” as much as possible (with

maximum entropy), as far as the chosen distribution matches certain “descriptions” of the empirical data (statistics of fk(x))

Approximation property: any distribution can be approximated with a max-ent distribution with sufficient number of features (Cramer and Wold) Common practice in NLP

This is better seen as a “descriptive” model

Page 19: LING 696B: Maximum-Entropy and Random Fields

19

Going towards Markov random fields Maximum entropy applied to

conditional/joint distributionp(y|x) or p(x,y) ~ exp{k wk*fk(x,y)}

There can be many creative ways of extracting features fk(x,y) One way is to let a graph structure

guide the calculation of features. E.g. neighborhood/clique

Known as Markov network/random field

Page 20: LING 696B: Maximum-Entropy and Random Fields

20

Conditional random field Impose a chain-structured graph,

and assign features to edges Still a max-ent, same calculation

f(xi, yi)

m(yi, yi+1)

Page 21: LING 696B: Maximum-Entropy and Random Fields

21

Wilson’s idea Isn’t this a familiar picture in

phonology?

m(yi, yi+1) -- Markedness

f(xi, yi)Faithfulnes

s

Surface form

Underlying form

Page 22: LING 696B: Maximum-Entropy and Random Fields

22

The story of smoothing In Max-Ent models, the weights can get

very large and “over-fit” the data (see demo)

Common to penalize (smooth) this with a new objective function:new objective = old objective + parameter * magnitude of weights

Wilson’s claim: this smoothing parameter has to do with substantive bias in phonological learning Constraints that force less similarity --> a

higher penalty for them to change value

Page 23: LING 696B: Maximum-Entropy and Random Fields

23

Wilson’s model fitting to the velar palatalization data