64
Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Embed Size (px)

Citation preview

Page 1: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Ling 570

Day #3

Stemming, Probabilistic Automata, Markov Chains/Model

Page 2: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

2

MORPHOLOGY AND FSTS

Page 3: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

3

FST as Translator

FR: ce bill met de le baume sur une blessure

EN: this bill puts balm on a sore wound

Last Class

Page 4: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

4

FST Application Examples

• Case folding:– He said he said

• Tokenization:– “He ran.” “ He ran . “

• POS tagging:– They can fish PRO VERB NOUN

Page 5: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

5

FST Application Examples

• Pronunciation:– B AH T EH R B AH DX EH R

• Morphological generation:– Fox s Foxes

• Morphological analysis:– cats cat s

Page 6: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

6

Roadmap

• Motivation:– Representing words

• A little (mostly English) Morphology• Stemming

Page 7: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

7

The Lexicon

• Goal: Represent all the words in a language• Approach?

Page 8: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

8

The Lexicon

• Goal: Represent all the words in a language• Approach?

– Enumerate all words?

Page 9: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

9

The Lexicon

• Goal: Represent all the words in a language• Approach?

– Enumerate all words?• Doable for English

– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished

Page 10: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

10

The Lexicon

• Goal: Represent all the words in a language• Approach?

– Enumerate all words?• Doable for English

– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished

• Other languages?

Page 11: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

11

The Lexicon

• Goal: Represent all the words in a language• Approach?

– Enumerate all words?• Doable for English

– Typical for ASR (Automatic Speech Recognition)– English is morphologically relatively impoverished

• Other languages?– Wildly impractical

» Turkish: 40,000 forms/verb;

uygarlas¸tıramadıklarımızdanmıs¸sınızcasına

“(behaving) as if you are among those whom we could not civilize”

Page 12: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

12

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

Page 13: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

13

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.

Page 14: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

14

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central

meaning unit in a word– Affix: prefix, suffix, infix, circumfix

Page 15: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

15

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central

meaning unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible

Page 16: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

16

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central

meaning unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible• Suffix: e.g., walk walking

Page 17: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

17

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning

unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)

Page 18: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

18

Morphological Parsing

• Goal: Take a surface word form and generate a linguistic structure of component morphemes

• A morpheme is the minimal meaning-bearing unit in a language.– Stem: the morpheme that forms the central meaning

unit in a word– Affix: prefix, suffix, infix, circumfix

• Prefix: e.g., possible impossible• Suffix: e.g., walk walking• Infix: e.g., hingi humingi (Tagalog)• Circumfix: e.g., sagen gesagt (German)

Page 19: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

19

Surface Variation & Morphology

• Searching (a la Bing) for documents about:– Televised sports

Page 20: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

20

Surface Variation & Morphology

• Searching (a la Bing) for documents about:– Televised sports

• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…

Page 21: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

21

Surface Variation & Morphology

• Searching (a la Bing) for documents about:– Televised sports

• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…

• How can we match?

Page 22: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

22

Surface Variation & Morphology

• Searching (a la Bing) for documents about:– Televised sports

• Many possible surface forms:– Televised, television, televise,..– Sports, sport, sporting,…

• How can we match?– Convert surface forms to common base form

• Stemming or morphological analysis

Page 23: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

23

Two Perspectives

• Stemming:– writing

Page 24: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

24

Two Perspectives

• Stemming:– writing write (or writ)– Beijing

Page 25: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

25

Two Perspectives

• Stemming:– writing write (or writ)– Beijing Beije

• Morphological Analysis:

Page 26: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

26

Two Perspectives

• Stemming:– writing write (or writ)– Beijing Beije

• Morphological Analysis:– writing write+V+prog

Page 27: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

27

Two Perspectives

• Stemming:– writing write (or writ)– Beijing Beije

• Morphological Analysis:– writing write+V+prog– cats cat + N + pl– writes write+V+3rdpers+Sg

Page 28: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise

Page 29: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer

Page 30: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer

• Task: Given surface form, produce base form– Typically, removes suffixes

Page 31: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Stemming

• Simple type of morphological analysis• Supports matching using base form• e.g. Television, televised, televising televise• Most popular: Porter stemmer

• Task: Given surface form, produce base form– Typically, removes suffixes

• Model:– Rule cascade– No lexicon!

Page 32: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

32

Stemming

• Used in many NLP/IR applications• For building equivalence classes

ConnectConnectedConnectingConnectionConnections

Porter Stemmer, simple and efficientWebsite: http://www.tartarus.org/~martin/PorterStemmer

On patas: ~/dropbox/12-13/570/porter

Same class;

suffixes irrelevant

Page 33: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2

Page 34: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε

Page 35: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

Page 36: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

– Rule partial order:• Step1a: -s• Step1b: -ed, -ing

Page 37: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes

Page 38: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup

• Pros:

Page 39: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup

• Pros: Simple, fast, buildable for a variety of languages• Cons:

Page 40: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Porter Stemmer

• Rule cascade:– Rule form:

• (condition) PATT1 PATT2• E.g. stem contains vowel, ING -> ε• ATIONAL ATE

– Rule partial order:• Step1a: -s• Step1b: -ed, -ing• Step 2-4: derivational suffixes• Step 5: cleanup

• Pros: Simple, fast, buildable for a variety of languages• Cons: Overaggressive and underaggressive

Page 41: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

41

STEMMING & EVAL

Page 42: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

42

Evaluating Performance

• Measures of Stemming Performance rely on similar metrics used in IR:– Precision: measure of the proportion of selected items

the system got right• precision = tp / (tp + fp)• # of correct answers / # of answers given

– Recall: measure of the proportion of the target items the system selected• recall = tp / (tp + fn)• # of correct answers / # of possible correct answers

– Rule of thumb: as precision increases, recall drops, and vice versa

• Metrics widely adopted in Stat NLP

Page 43: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

43

Precision and Recall

• Take a given stemming task– Suppose there are 100 words that could be

stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)

Precision = 52 / (52 + 10) = .84

Recall = 52 / (52 + 48) = .52

Page 44: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

44

Precision and Recall

• Take a given stemming task– Suppose there are 100 words that could be

stemmed– A stemmer gets 52 of these right (tp)– But it inadvertently stems 10 others (fp)

Precision = 52 / (52 + 10) = .84

Recall = 52 / (52 + 48) = .52

Note: easy to get precision of 1.0. Why?

Page 45: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

45

Baseline Tokenizer 1 Tokenizer 2 Tokenizer 3 Tokenizer 4After After After After Aftercoming coming coming coming comingclose close close close close Precision Recall F-Measure

to to to to to Tokenizer 1 0.827586 0.888889 0.858237548

a a a a a Tokenizer 2 0.961538 0.925926 0.943732194

partial partial partial partial partial Tokenizer 3 0.928571 0.962963 0.945767196

settlementsettlement settlement settlement settlement Tokenizer 4 1 1 1

a a a a ayear year year year yearago ago ago ago ago, , , , ,shareholdersshareholders shareholders shareholders shareholderswho who who who whofiled filed filed filed filedcivil civil civil civil civilsuits suits suits suits suitsagainst against against against againstIvan Ivan Ivan Ivan IvanF. F F. F. F.Boesky . Boesky Boesky Boesky& Boesky & & &Co. & Co. Co Co.L.P. Co L.P. . L.P.Drexel . Drexel's L.P. Drexel's L.P plaintiffs Drexel 'splaintiffs . ' 's plaintiffs' Drexel plaintiffs '

's 'plaintiffs

Page 46: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

WEIGHTED AUTOMATA & MARKOV CHAINS

Page 47: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

PFA Definition

• A Probabilistic Finite-State Automaton is a 6-tuple:– A set of states Q– An alphabet Σ– A set of transitions: δsubset Q x Σ x Q– Initial state probabilities: Q R+

– Transition probabilities: δ R+

– Final state probabilities: Q R+

Page 48: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

PFA Recap

• Subject to constraints:

• Computing sequence probabilities

Page 49: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

PFA Example

• Example– I(q0)=1

– I(q1)=0

– F(q0)=0

– F(q1)=0.2

– P(q0,a,q1)=1; P(q1,b,q1) =0.8

– P(abn) = I(q0)*P(q0,a,q1)*P(q1,b,q1)n*F(q1)

– = 0.8n*0.2

Page 50: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain

• A Markov Chain is a special case of a PFA in which the sequence uniquely determines which states the automaton will go through.

• Markov Chains can not represent inherently ambiguous problems– Can assign probability to unambiguous

sequences

Page 51: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain for Words

Page 52: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain for Pronunciation

• Observations: 0/1

Page 53: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain for Walking through Groningen

Page 54: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain: “First-order observable Markov Model”

• A set of states – Q = q1, q2…qN; the state at time t is qt

Page 55: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain: “First-order observable Markov Model”

• A set of states – Q = q1, q2…qN; the state at time t is qt

• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.

– Each aij represents the probability of transitioning from state i to state j

– The set of these is the transition probability matrix A

Page 56: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain: “First-order observable Markov Model”

• A set of states – Q = q1, q2…qN; the state at time t is qt

• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.

– Each aij represents the probability of transitioning from state i to state j

– The set of these is the transition probability matrix A• Distinguished start and final states

– q0,qF

Page 57: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Chain: “First-order observable Markov Model”

• A set of states – Q = q1, q2…qN; the state at time t is qt

• Transition probabilities: – a set of probabilities A = a01a02…an1…ann.

– Each aij represents the probability of transitioning from state i to state j

– The set of these is the transition probability matrix A• Distinguished start and final states

– q0,qF

• Current state only depends on previous state

Page 58: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Models

• The parameters of a MM can be arranged in matrices

• The A-matrix for the set of transition probabilities:

p11 p12 …p1j

A = p21 p22 …p2j

…[ ]

Page 59: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Models

• The parameters of a MM can be arranged in matrices

• The A-matrix for the set of transition probabilities:

p11 p12 …p1j

A = p21 p22 …p2j

• What’s missing?

[ ]

Page 60: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Models

• The parameters of a MM can be arranged in matrices

• The A-matrix for the set of transition probabilities:

p11 p12 …p1j

A = p21 p22 …p2j

• What’s missing? Starting probabilities.

[ ]

Page 61: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Models

• Exercise– Build the transition probability matrix over

this set of dataThe duck died.

The car killed the duck.

The duck died under her car.

We duck under the car.

We retrieve the poor duck.

– Build the starting probability matrix

Page 62: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

Markov Models

• Exercise– Given your model, what’s the probability for

each of the following sentences?The duck died under her car.

We duck under the car.

The duck under the car.

We retrieve killed the duck.

We the poor duck died.

We retrieve the poor duck under the car.

– For a given start state (The, We), what’s the most likely string (of the above)?

Page 63: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model
Page 64: Ling 570 Day #3 Stemming, Probabilistic Automata, Markov Chains/Model

HMMs

• Next class