145
Finite-State Technology in Natural Language Processing Andreas Maletti Institute for Natural Language Processing Universität Stuttgart, Germany [email protected] Umeå — August 18, 2015 FST in NLP A. Maletti · 1

Finite-State Technology in Natural Language Processing

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Finite-State Technology in Natural Language Processing

Finite-State Technologyin Natural Language Processing

Andreas Maletti

Institute for Natural Language ProcessingUniversität Stuttgart, Germany

[email protected]

Umeå — August 18, 2015

FST in NLP A. Maletti · 1

Page 2: Finite-State Technology in Natural Language Processing

Roadmap

1 Linguistic Basics and Weighted Automata2 Part-of-Speech Tagging3 Parsing4 Machine Translation

Always ask questions right away!

FST in NLP A. Maletti · 2

Page 3: Finite-State Technology in Natural Language Processing

Linguistic Basics

UnitsSentence (syntactic unit expressing complete thought)

I Alla sätt är bra utom de dåliga.I “Are you serious?” she asked.

Clause (grammatically complete syntactic unit)I Vännens örfil är ärligt menad, fiendens kyssar vill bedra.

(2 main clauses)I People who live in glass houses should not throw stones.

(main clause + relative clause)Phrases (smaller syntactic units)

I the green car (noun phrase = noun and its modifiers)I killed the snake (verb phrase = verb and its objects)

Token (smallest unit = “word”; often derived from lexicon entry)I house, car, lived, smallest, 45th, STACS, KnuthI but tricky: Knuth’s vs. Knuth ’s; well-known vs. well - known

FST in NLP A. Maletti · 3

Page 4: Finite-State Technology in Natural Language Processing

Linguistic Basics

UnitsSentence (syntactic unit expressing complete thought)

I Alla sätt är bra utom de dåliga.I “Are you serious?” she asked.

Clause (grammatically complete syntactic unit)I Vännens örfil är ärligt menad, fiendens kyssar vill bedra.

(2 main clauses)I People who live in glass houses should not throw stones.

(main clause + relative clause)

Phrases (smaller syntactic units)I the green car (noun phrase = noun and its modifiers)I killed the snake (verb phrase = verb and its objects)

Token (smallest unit = “word”; often derived from lexicon entry)I house, car, lived, smallest, 45th, STACS, KnuthI but tricky: Knuth’s vs. Knuth ’s; well-known vs. well - known

FST in NLP A. Maletti · 3

Page 5: Finite-State Technology in Natural Language Processing

Linguistic Basics

UnitsSentence (syntactic unit expressing complete thought)

I Alla sätt är bra utom de dåliga.I “Are you serious?” she asked.

Clause (grammatically complete syntactic unit)

I Vännens örfil är ärligt menad, fiendens kyssar vill bedra.(2 main clauses)

I People who live in glass houses should not throw stones.(main clause + relative clause)

Phrases (smaller syntactic units)I the green car (noun phrase = noun and its modifiers)I killed the snake (verb phrase = verb and its objects)

Token (smallest unit = “word”; often derived from lexicon entry)I house, car, lived, smallest, 45th, STACS, KnuthI but tricky: Knuth’s vs. Knuth ’s; well-known vs. well - known

FST in NLP A. Maletti · 3

Page 6: Finite-State Technology in Natural Language Processing

Linguistic Basics

UnitsSentence (syntactic unit expressing complete thought)

I Alla sätt är bra utom de dåliga.I “Are you serious?” she asked.

Clause (grammatically complete syntactic unit)

I Vännens örfil är ärligt menad, fiendens kyssar vill bedra.(2 main clauses)

I People who live in glass houses should not throw stones.(main clause + relative clause)

Phrases (smaller syntactic units)

I the green car (noun phrase = noun and its modifiers)I killed the snake (verb phrase = verb and its objects)

Token (smallest unit = “word”; often derived from lexicon entry)I house, car, lived, smallest, 45th, STACS, KnuthI but tricky: Knuth’s vs. Knuth ’s; well-known vs. well - known

FST in NLP A. Maletti · 3

Page 7: Finite-State Technology in Natural Language Processing

Linguistic Basics

TokenizationSplitting text into sentences and tokens

relatively simple for English, Swedish, German, etc.(actually often via complicated regular expression)

hard for other languages:I Chinese: 小洞不补,大洞吃苦。

(A small hole not plugged will make you suffer a big hole.)

I Turkish: Çekoslovakyalılastıramadıklarımızdanmıssınız(You are said to be one of thosethat we couldn’t manage to convert to a Czechoslovak)

I Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)RE . 〈WhiteSpace〉+ [A–Z] covers most cases (in English)

FST in NLP A. Maletti · 4

Page 8: Finite-State Technology in Natural Language Processing

Linguistic Basics

TokenizationSplitting text into sentences and tokens

relatively simple for English, Swedish, German, etc.(actually often via complicated regular expression)hard for other languages:

I Chinese: 小洞不补,大洞吃苦。(A small hole not plugged will make you suffer a big hole.)

I Turkish: Çekoslovakyalılastıramadıklarımızdanmıssınız(You are said to be one of thosethat we couldn’t manage to convert to a Czechoslovak)

I Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)RE . 〈WhiteSpace〉+ [A–Z] covers most cases (in English)

FST in NLP A. Maletti · 4

Page 9: Finite-State Technology in Natural Language Processing

Linguistic Basics

TokenizationSplitting text into sentences and tokens

relatively simple for English, Swedish, German, etc.(actually often via complicated regular expression)hard for other languages:

I Chinese: 小洞不补,大洞吃苦。(A small hole not plugged will make you suffer a big hole.)

I Turkish: Çekoslovakyalılastıramadıklarımızdanmıssınız(You are said to be one of thosethat we couldn’t manage to convert to a Czechoslovak)

I Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)RE . 〈WhiteSpace〉+ [A–Z] covers most cases (in English)

FST in NLP A. Maletti · 4

Page 10: Finite-State Technology in Natural Language Processing

Linguistic Basics

TokenizationSplitting text into sentences and tokens

relatively simple for English, Swedish, German, etc.(actually often via complicated regular expression)hard for other languages:

I Chinese: 小洞不补,大洞吃苦。(A small hole not plugged will make you suffer a big hole.)

I Turkish: Çekoslovakyalılastıramadıklarımızdanmıssınız(You are said to be one of thosethat we couldn’t manage to convert to a Czechoslovak)

I Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)RE . 〈WhiteSpace〉+ [A–Z] covers most cases (in English)

FST in NLP A. Maletti · 4

Page 11: Finite-State Technology in Natural Language Processing

Linguistic Basics

TokenizationSplitting text into sentences and tokens

relatively simple for English, Swedish, German, etc.(actually often via complicated regular expression)hard for other languages:

I Chinese: 小洞不补,大洞吃苦。(A small hole not plugged will make you suffer a big hole.)

I Turkish: Çekoslovakyalılastıramadıklarımızdanmıssınız(You are said to be one of thosethat we couldn’t manage to convert to a Czechoslovak)

I Hungarian: legeslegmegszentségteleníttethetetlenebbjeitekként(like the most of most undesecratable ones of you)

Example (English sentence-ending full stop)RE . 〈WhiteSpace〉+ [A–Z] covers most cases (in English)

FST in NLP A. Maletti · 4

Page 12: Finite-State Technology in Natural Language Processing

Linguistic Basics

Example (English)tokens usually separated by whitespacesentence end marker “.” highly ambiguous:

I common abbreviationsI dates, ordinals, and phone numbers

STANFORD tokenizerI implemented in JAVA (based on JFlex)I compiles RegEx into DFA and runs DFAI can process 1,000,000 tokens per second

FST in NLP A. Maletti · 5

Page 13: Finite-State Technology in Natural Language Processing

Linguistic Basics

Example (English)tokens usually separated by whitespacesentence end marker “.” highly ambiguous:

I common abbreviationsI dates, ordinals, and phone numbers

STANFORD tokenizerI implemented in JAVA (based on JFlex)I compiles RegEx into DFA and runs DFAI can process 1,000,000 tokens per second

FST in NLP A. Maletti · 5

Page 14: Finite-State Technology in Natural Language Processing

Weighted Automata

DefinitionA weighted automaton is a system (Q,Σ, I,∆,F ,wt)

finite set Q of statesinput alphabet Σ

initial states I ⊆ Qtransitions ∆ ⊆ Q × Σ×Qfinal states F ⊆ Q

transition weights wt : ∆→ [0,1]

Example

q0 q1 q2 q3. 〈WhiteSpace〉

〈WhiteSpace〉

{A, . . . ,Z}

FST in NLP A. Maletti · 6

Page 15: Finite-State Technology in Natural Language Processing

Weighted Automata

DefinitionA weighted automaton is a system (Q,Σ, I,∆,F ,wt)

finite set Q of statesinput alphabet Σ

initial states I ⊆ Qtransitions ∆ ⊆ Q × Σ×Qfinal states F ⊆ Qtransition weights wt : ∆→ [0,1]

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

FST in NLP A. Maletti · 6

Page 16: Finite-State Technology in Natural Language Processing

Weighted Automata

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

Definition (Semantics)Weight of a run: product of the transition weights(run (q0, .,q1)(q1,_,q2)(q2,F,q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

Weight of an input: sum of the weights of all successful runs(input “._F” has weight 0.12)

FST in NLP A. Maletti · 7

Page 17: Finite-State Technology in Natural Language Processing

Weighted Automata

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

Definition (Semantics)Weight of a run: product of the transition weights(run (q0, .,q1)(q1,_,q2)(q2,F,q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

Weight of an input: sum of the weights of all successful runs(input “._F” has weight 0.12)

FST in NLP A. Maletti · 7

Page 18: Finite-State Technology in Natural Language Processing

Weighted Automata

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

Definition (Semantics)Weight of a run: product of the transition weights(run (q0, .,q1)(q1,_,q2)(q2,F,q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

Weight of an input: sum of the weights of all successful runs(input “._F” has weight 0.12)

FST in NLP A. Maletti · 7

Page 19: Finite-State Technology in Natural Language Processing

Weighted Automata

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

Definition (Semantics)Weight of a run: product of the transition weights(run (q0, .,q1)(q1,_,q2)(q2,F,q3) has weight 0.5 · 0.8 · 0.3 = 0.12)

Weight of an input: sum of the weights of all successful runs(input “._F” has weight 0.12)

FST in NLP A. Maletti · 7

Page 20: Finite-State Technology in Natural Language Processing

Weighted Automata

Example

q0 q1 q2 q3.

0.5〈WhiteSpace〉

0.8

〈WhiteSpace〉1

{A, . . . ,Z}0.3

Definition (Semantics)Weight of a run: product of the transition weights(run (q0, .,q1)(q1,_,q2)(q2,F,q3) has weight 0.5 · 0.8 · 0.3 = 0.12)Weight of an input: sum of the weights of all successful runs(input “._F” has weight 0.12)

FST in NLP A. Maletti · 7

Page 21: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

FST in NLP A. Maletti · 8

Page 22: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

MotivationIndexing (GOOGLE): Which (meaning-carrying) tokens to index in

Alla sätt är bra utom de dåliga.

Usually nouns, adjectives, and verbs are meaning-carryingPart-of-speech tagging= task of assigning a grammatical function to each token= task of determining the word class for each token (of a sentence)

ExampleAlla sätt är bra utom de dåligadet. noun verb adj. subord. det. adj.

conj.

FST in NLP A. Maletti · 9

Page 23: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

MotivationIndexing (GOOGLE): Which (meaning-carrying) tokens to index in

Alla sätt är bra utom de dåliga.Usually nouns, adjectives, and verbs are meaning-carryingPart-of-speech tagging= task of assigning a grammatical function to each token= task of determining the word class for each token (of a sentence)

ExampleAlla sätt är bra utom de dåligadet. noun verb adj. subord. det. adj.

conj.

FST in NLP A. Maletti · 9

Page 24: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

MotivationIndexing (GOOGLE): Which (meaning-carrying) tokens to index in

Alla sätt är bra utom de dåliga.Usually nouns, adjectives, and verbs are meaning-carryingPart-of-speech tagging= task of assigning a grammatical function to each token= task of determining the word class for each token (of a sentence)

ExampleAlla sätt är bra utom de dåligadet. noun verb adj. subord. det. adj.

conj.

FST in NLP A. Maletti · 9

Page 25: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

MotivationIndexing (GOOGLE): Which (meaning-carrying) tokens to index in

Alla sätt är bra utom de dåliga.Usually nouns, adjectives, and verbs are meaning-carryingPart-of-speech tagging= task of assigning a grammatical function to each token= task of determining the word class for each token (of a sentence)

ExampleAlla sätt är bra utom de dåligadet. noun verb adj. subord. det. adj.

conj.

FST in NLP A. Maletti · 9

Page 26: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)DT = determinerNN = noun (singular or mass)JJ = adjectiveMD = modalVB = verb (base form)VBD = verb (past tense)VBG = verb (gerund or present participle)

Example (Tagging exercise)

FST in NLP A. Maletti · 10

Page 27: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)DT = determinerNN = noun (singular or mass)JJ = adjectiveMD = modalVB = verb (base form)VBD = verb (past tense)VBG = verb (gerund or present participle)

Example (Tagging exercise)

show

VB

FST in NLP A. Maletti · 10

Page 28: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)DT = determinerNN = noun (singular or mass)JJ = adjectiveMD = modalVB = verb (base form)VBD = verb (past tense)VBG = verb (gerund or present participle)

Example (Tagging exercise)

show VB

FST in NLP A. Maletti · 10

Page 29: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)DT = determinerNN = noun (singular or mass)JJ = adjectiveMD = modalVB = verb (base form)VBD = verb (past tense)VBG = verb (gerund or present participle)

Example (Tagging exercise)

the show

DT NN

FST in NLP A. Maletti · 10

Page 30: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Tags (from the PENN tree bank — English)DT = determinerNN = noun (singular or mass)JJ = adjectiveMD = modalVB = verb (base form)VBD = verb (past tense)VBG = verb (gerund or present participle)

Example (Tagging exercise)

the show DT NN

FST in NLP A. Maletti · 10

Page 31: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

History1960s

I manually tagged BROWN corpus (1,000,000 words)I tag lists with frequency for each token

e.g., {VB,MD,NN} for canI excluding ling.-implausible sequences (e.g. DT VB)I “most common tag” yields 90% accuracy

[CHARNIAK, 97]

1980sI hidden MARKOV models (HMM)I dynamic programming and VITERBI algorithms

(wA algorithms)2000s

I British national corpus (100,000,000 words)I parsers are better taggers (wTA algorithms)

FST in NLP A. Maletti · 11

Page 32: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

History1960s

I manually tagged BROWN corpus (1,000,000 words)I tag lists with frequency for each token

e.g., {VB,MD,NN} for canI excluding ling.-implausible sequences (e.g. DT VB)I “most common tag” yields 90% accuracy

[CHARNIAK, 97]1980s

I hidden MARKOV models (HMM)I dynamic programming and VITERBI algorithms

(wA algorithms)

2000sI British national corpus (100,000,000 words)I parsers are better taggers (wTA algorithms)

FST in NLP A. Maletti · 11

Page 33: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

History1960s

I manually tagged BROWN corpus (1,000,000 words)I tag lists with frequency for each token

e.g., {VB,MD,NN} for canI excluding ling.-implausible sequences (e.g. DT VB)I “most common tag” yields 90% accuracy

[CHARNIAK, 97]1980s

I hidden MARKOV models (HMM)I dynamic programming and VITERBI algorithms

(wA algorithms)2000s

I British national corpus (100,000,000 words)I parsers are better taggers (wTA algorithms)

FST in NLP A. Maletti · 11

Page 34: Finite-State Technology in Natural Language Processing

Markov Model

Statistical approachGiven a sequence w = w1 · · ·wk of tokens,determine the most likely sequence t1 · · · tk of part-of-speech tags

(ti is the tag of wi )

(t1, . . . , tk )= arg max(t1,...,tk )

p(t1, . . . , tk | w)

= arg max(t1,...,tk )

p(t1, . . . , tk ,w)

p(w)

= arg max(t1,...,tk )

p(t1, . . . , tk ,w1, . . . ,wk )

= arg max(t1,...,tk )

p(t1,w1) ·k∏

i=2

p(ti ,wi | t1, . . . , ti−1,w1, . . . ,wi−1)

FST in NLP A. Maletti · 12

Page 35: Finite-State Technology in Natural Language Processing

Markov Model

Modelling as stochastic processintroduce event Ei = wi ∩ ti = (wi , ti)

p(t1,w1) ·k∏

i=2

p(ti ,wi | t1, . . . , ti−1,w1, . . . ,wi−1)

= p(E1) ·k∏

i=2

p(Ei | E1, . . . ,Ei−1)

assume MARKOV property

p(Ei | E1, . . . ,Ei−1) = p(Ei | Ei−1) = p(E2 | E1)

FST in NLP A. Maletti · 13

Page 36: Finite-State Technology in Natural Language Processing

Markov Model

Summaryinitial weights p(E) (not indicated below)transition weights p(E | E ′)

(the,DT) (fun,NN)

(a,DT) (car,NN)

0

0.2

0

0.4

0.10

0.1

0.40

0.1

00.3

0.1

0.2

0.1

0

FST in NLP A. Maletti · 14

Page 37: Finite-State Technology in Natural Language Processing

Markov Model

Maximum likelihood estimation (MLE)Assume that likelihood = relative frequency in corpus

initial weights p(E)How often does E start a tagged sentence?transition weights p(E | E ′)How often does E follow E ′?

ProblemsVocabulary: ≈ 350,000 English tokens,but only 50,000 tokens (14%) in BROWN corpusSparsity: (car, NN) (fun, NN) not attested in corpus, but plausible(frequency estimates might be wrong)

FST in NLP A. Maletti · 15

Page 38: Finite-State Technology in Natural Language Processing

Markov Model

Maximum likelihood estimation (MLE)Assume that likelihood = relative frequency in corpus

initial weights p(E)How often does E start a tagged sentence?transition weights p(E | E ′)How often does E follow E ′?

ProblemsVocabulary: ≈ 350,000 English tokens,but only 50,000 tokens (14%) in BROWN corpusSparsity: (car, NN) (fun, NN) not attested in corpus, but plausible(frequency estimates might be wrong)

FST in NLP A. Maletti · 15

Page 39: Finite-State Technology in Natural Language Processing

Transformation into Weighted Automaton

(the,DT) (fun,NN)

(a,DT) (car,NN)

0

0.2

0

0.4

0.10

0.1

0.40

0.1

00.3

0.1

0.2

0.1

0

FST in NLP A. Maletti · 16

Page 40: Finite-State Technology in Natural Language Processing

Transformation into Weighted Automaton

q1 q2

q0

q3 q4

(fun,NN)

0.2

(car,NN)

0.4

(the,DT)

0.1

(a,DT)

0.1

(car,NN)0.4(fun,NN)

0.1

(car,NN)0.3

(the,DT)0.1

(fun,NN)0.2

(a,DT)0.1

FST in NLP A. Maletti · 16

Page 41: Finite-State Technology in Natural Language Processing

Transformation into Weighted Automaton

q1 q2

q0

q3 q4

(fun,NN)

0.2

(car,NN)

0.4

(the,DT)

0.1

(a,DT)

0.1

(car,NN)0.4(fun,NN)

0.1

(car,NN)0.3

(the,DT)0.1

(fun,NN)0.2

(a,DT)0.1

(the,DT)

0.5

(fun,NN)

0.2

(a,DT)

0.3

(car,NN)0.5

FST in NLP A. Maletti · 16

Page 42: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Typical questionsDecoding: (or language model evaluation)Given model M and sentence w , determine probability M1(w)

I project labels to first componentsI evaluate w in the obtained wA M1I efficient: initial-algebra semantics (forward algorithm)

Tagging:Given model M and sentence w ,determine the best tag sequence t1 · · · tk

I intersect M with the DFA for w and any tag sequenceI determine best run in the obtained wAI efficient: VITERBI algorithm

FST in NLP A. Maletti · 17

Page 43: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Typical questionsDecoding: (or language model evaluation)Given model M and sentence w , determine probability M1(w)

I project labels to first componentsI evaluate w in the obtained wA M1I efficient: initial-algebra semantics (forward algorithm)

Tagging:Given model M and sentence w ,determine the best tag sequence t1 · · · tk

I intersect M with the DFA for w and any tag sequenceI determine best run in the obtained wAI efficient: VITERBI algorithm

FST in NLP A. Maletti · 17

Page 44: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Typical questions(Weight) Induction: (or MLE training)Given NFA (Q,Σ, I,∆,F ) and sequence w1, . . . ,wk of taggedsentences w i ∈ Σ∗, determine transition weights wt : ∆→ [0,1]

such that∏k

i=1 Mwt(w i) is maximal with Mwt = (Q,Σ, I,∆,F ,wt)I no closed solution (in general), but many approximationsI efficient: hill-climbing methods (EM, simulated annealing, etc.)

Learning: (or HMM induction)Given NFA (Q,Σ, I,∆,F ) and sequence w1, . . . ,wk of untaggedsentences wi , determine transition weights wt : ∆→ [0,1] suchthat

∏ki=1(Mwt)1(wi) is maximal with Mwt = (Q,Σ, I,∆,F ,wt)

I no exact solution (in general), but many approximationsI efficient: hill-climbing methods (EM, simulated annealing, etc.)

FST in NLP A. Maletti · 18

Page 45: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

Typical questions(Weight) Induction: (or MLE training)Given NFA (Q,Σ, I,∆,F ) and sequence w1, . . . ,wk of taggedsentences w i ∈ Σ∗, determine transition weights wt : ∆→ [0,1]

such that∏k

i=1 Mwt(w i) is maximal with Mwt = (Q,Σ, I,∆,F ,wt)I no closed solution (in general), but many approximationsI efficient: hill-climbing methods (EM, simulated annealing, etc.)

Learning: (or HMM induction)Given NFA (Q,Σ, I,∆,F ) and sequence w1, . . . ,wk of untaggedsentences wi , determine transition weights wt : ∆→ [0,1] suchthat

∏ki=1(Mwt)1(wi) is maximal with Mwt = (Q,Σ, I,∆,F ,wt)

I no exact solution (in general), but many approximationsI efficient: hill-climbing methods (EM, simulated annealing, etc.)

FST in NLP A. Maletti · 18

Page 46: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

IssuesWA too big (in comparison to training data)

I cannot reliably estimate that many probabilities p(E | E ′)I simplify model

e.g., assume transition probability only depends on tags

p((w , t) | (w ′, t ′)

)= p(t | t ′)

unknown wordsI no statistics on words that do not occur in corpusI allow only assignment of open tags

(open tag = potentially unbounded number of elements, e.g. NNP)(closed tag = fixed finite number of elements, e.g. DT or PRP)

I use morphological clues (capitalization, affixes, etc.)I use context to disambiguateI use “global” statistics

FST in NLP A. Maletti · 19

Page 47: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

IssuesWA too big (in comparison to training data)

I cannot reliably estimate that many probabilities p(E | E ′)I simplify model

e.g., assume transition probability only depends on tags

p((w , t) | (w ′, t ′)

)= p(t | t ′)

unknown wordsI no statistics on words that do not occur in corpusI allow only assignment of open tags

(open tag = potentially unbounded number of elements, e.g. NNP)(closed tag = fixed finite number of elements, e.g. DT or PRP)

I use morphological clues (capitalization, affixes, etc.)I use context to disambiguateI use “global” statistics

FST in NLP A. Maletti · 19

Page 48: Finite-State Technology in Natural Language Processing

Part-of-Speech Tagging

TCS contributionsefficient evaluation and complexity considerations(initial-algebra semantics, best runs, best strings, etc.)model simplifications(trimming, determinization, minimization, etc.)model transformations(projection, intersection, RegEx-to-DFA, etc.)model induction(grammar induction, weight training, etc.)

FST in NLP A. Maletti · 20

Page 49: Finite-State Technology in Natural Language Processing

Parsing

FST in NLP A. Maletti · 21

Page 50: Finite-State Technology in Natural Language Processing

Parsing

Motivation(syntactic) parsing= determining the syntactic structure of a sentenceimportant in several applications:

I co-reference resolution(determining which noun phrases refer to the same object/concept)

I comprehension(determining the meaning)

I speech repair and sentence-like unit detection in speech(speech offers no punctuation; needs to be predicted)

FST in NLP A. Maletti · 22

Page 51: Finite-State Technology in Natural Language Processing

Parsing

We must bear in mind the Community as a whole

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

FST in NLP A. Maletti · 23

Page 52: Finite-State Technology in Natural Language Processing

Parsing

We must bear in mind the Community as a whole

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

FST in NLP A. Maletti · 23

Page 53: Finite-State Technology in Natural Language Processing

Trees

Finite sets Σ and W

DefinitionSet TΣ(W ) of Σ-trees indexed by W is smallest T

w ∈ T for all w ∈Wσ(t1, . . . , tk ) ∈ T for all k ∈ N, σ ∈ Σ, and t1, . . . , tk ∈ T

Notesobvious recursion & induction principle

FST in NLP A. Maletti · 24

Page 54: Finite-State Technology in Natural Language Processing

Trees

Finite sets Σ and W

DefinitionSet TΣ(W ) of Σ-trees indexed by W is smallest T

w ∈ T for all w ∈Wσ(t1, . . . , tk ) ∈ T for all k ∈ N, σ ∈ Σ, and t1, . . . , tk ∈ T

Notesobvious recursion & induction principle

FST in NLP A. Maletti · 24

Page 55: Finite-State Technology in Natural Language Processing

Parsing

Problemassume a hidden g : W ∗ → TΣ(W ) (reference parser)given a finite set T ⊆ TΣ(W ) (training set)generated by gdevelop a system representing f : W ∗ → TΣ(W ) (parser)approximating g

ClarificationT generated by g ⇐⇒ T = g(L) for some finite L ⊆W ∗

for approximation we could use |{w ∈W ∗ | f (w) = g(w)}|

FST in NLP A. Maletti · 25

Page 56: Finite-State Technology in Natural Language Processing

Parsing

Problemassume a hidden g : W ∗ → TΣ(W ) (reference parser)given a finite set T ⊆ TΣ(W ) (training set)generated by gdevelop a system representing f : W ∗ → TΣ(W ) (parser)approximating g

ClarificationT generated by g ⇐⇒ T = g(L) for some finite L ⊆W ∗

for approximation we could use |{w ∈W ∗ | f (w) = g(w)}|

FST in NLP A. Maletti · 25

Page 57: Finite-State Technology in Natural Language Processing

Parsing

Problemassume a hidden g : W ∗ → TΣ(W ) (reference parser)given a finite set T ⊆ TΣ(W ) (training set)generated by gdevelop a system representing f : W ∗ → TΣ(W ) (parser)approximating g

ClarificationT generated by g ⇐⇒ T = g(L) for some finite L ⊆W ∗

for approximation we could use |{w ∈W ∗ | f (w) = g(w)}|

FST in NLP A. Maletti · 25

Page 58: Finite-State Technology in Natural Language Processing

Parsing

Short historybefore 1990

I hand-crafted rules based on POS tags (unlexicalized parsing)I corrections and selection by human annotators

1990sI PENN tree bank (1,000,000 words)I weighted local tree grammars (weighted CFG) as parsers

(often still unlexicalized)I WALL STREET JOURNAL tree bank (30,000,000 words)

since 2000I weighted tree automata (weighted CFG with latent variables)I lexicalized parsers

FST in NLP A. Maletti · 26

Page 59: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

S

NP

PRP$

My

NN

dog

VP

VBZ

sleeps

S

NP

PRP

I

VP

VBD

scored

ADVP

RB

well

LTG production extractionsimply read off CFG productions:

S −→ NP VP NP −→ PRP$ NN

PRP$ −→ My NN −→ dog

VP −→ VBZ VBZ −→ sleeps

NP −→ PRP PRP −→ I

VP −→ VBD ADVP VBD −→ scored

ADVP −→ RB RB −→ well

FST in NLP A. Maletti · 27

Page 60: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

S

NP

PRP$

My

NN

dog

VP

VBZ

sleeps

S

NP

PRP

I

VP

VBD

scored

ADVP

RB

well

LTG production extractionsimply read off CFG productions:

S −→ NP VP NP −→ PRP$ NN

PRP$ −→ My NN −→ dog

VP −→ VBZ VBZ −→ sleeps

NP −→ PRP PRP −→ I

VP −→ VBD ADVP VBD −→ scored

ADVP −→ RB RB −→ well

FST in NLP A. Maletti · 27

Page 61: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

ObservationsLTG offer unique explanation on tree level(rules observable in training data; as for POS tagging)but ambiguity on the string level(i.e., on unannotated data; as for POS tagging)

→ weighted productions

Illustration

S

NP

PRP

We

VP

VBD

saw

NP

PRP$

her

NN

duck

S

NP

PRP

We

VP

VBD

saw

S-BAR

S

NP

PRP

her

VP

VBP

duck

FST in NLP A. Maletti · 28

Page 62: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

ObservationsLTG offer unique explanation on tree level(rules observable in training data; as for POS tagging)but ambiguity on the string level(i.e., on unannotated data; as for POS tagging)

→ weighted productions

Illustration

S

NP

PRP

We

VP

VBD

saw

NP

PRP$

her

NN

duck

S

NP

PRP

We

VP

VBD

saw

S-BAR

S

NP

PRP

her

VP

VBP

duck

FST in NLP A. Maletti · 28

Page 63: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

DefinitionA weighted local tree grammar (wLTG) is aweighted CFG G = (N,W ,S,P,wt)

finite set N (nonterminals)finite set W (terminals)S ⊆ N (start nonterminals)finite set P ⊆ N × (N ∪W )∗ (productions)mapping wt : P → [0,1] (weight assignment)

It computes the weighted derivation trees of the wCFG

FST in NLP A. Maletti · 29

Page 64: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

S

NP

PRP$

My

NN

dog

VP

VBZ

sleeps

S

NP

PRP

I

VP

VBD

scored

ADVP

RB

well

wLTG production extractionsimply read of CFG productions and keep counts:

S −→ NP VP (2) NP −→ PRP$ NN (1)

PRP$ −→ My (1) NN −→ dog (1)

VP −→ VBZ (1) VBZ −→ sleeps (1)

NP −→ PRP (1) PRP −→ I (1)

VP −→ VBD ADVP (1) VBD −→ scored (1)

ADVP −→ RB (1) RB −→ well (1)

FST in NLP A. Maletti · 30

Page 65: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

wLTG production extractionnormalize counts: (here by left-hand side)

S

1

−→ NP VP (2)

NP

0.5

−→ PRP$ NN (1) NP

0.5

−→ PRP (1)

PRP$

1

−→ My (1)

NN

1

−→ dog (1)

VP

0.5

−→ VBZ (1) VP

0.5

−→ VBD ADVP (1)

VBZ

1

−→ sleeps (1)

PRP

1

−→ I (1)

VBD

1

−→ scored (1)

ADVP

1

−→ RB (1)

RB

1

−→ well (1)

FST in NLP A. Maletti · 31

Page 66: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

wLTG production extractionnormalize counts: (here by left-hand side)

S 1−→ NP VP

(2)

NP 0.5−→ PRP$ NN

(1)

NP 0.5−→ PRP

(1)

PRP$ 1−→ My

(1)

NN 1−→ dog

(1)

VP 0.5−→ VBZ

(1)

VP 0.5−→ VBD ADVP

(1)

VBZ 1−→ sleeps

(1)

PRP 1−→ I

(1)

VBD 1−→ scored

(1)

ADVP 1−→ RB

(1)

RB 1−→ well

(1)

FST in NLP A. Maletti · 31

Page 67: Finite-State Technology in Natural Language Processing

Weighted Local Tree Grammars

Weighted parses

S

NP

PRP$

My

NN

dog

VP

VBZ

sleeps

S

NP

PRP

I

VP

VBD

scored

ADVP

RB

well

weight: 0.25 weight: 0.25

Weighted LTG productions(only productions with weight 6= 1)

NP 0.5−→ PRP$ NN NP 0.5−→ PRP

VP 0.5−→ VBZ VP 0.5−→ VBD ADVP

FST in NLP A. Maletti · 32

Page 68: Finite-State Technology in Natural Language Processing

Parser Evaluation

BERKELEY parser [Reference]:

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

CHARNIAK-JOHNSON parser:S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

DT

the

NNP

Community

PP

IN

as

NP

DT

a

NN

whole

FST in NLP A. Maletti · 33

Page 69: Finite-State Technology in Natural Language Processing

Parser Evaluation

Definition (ParseEval measure)precision = number of correct constituents(heading the same phrase as in reference)divided by number of all constituents in parse

recall = number of correct constituentsdivided by number of all constituents in reference(weighted) harmonic mean

Fα = (1 + α2) · precision · recallα2 · precision + recall

FST in NLP A. Maletti · 34

Page 70: Finite-State Technology in Natural Language Processing

Parser Evaluation

Definition (ParseEval measure)precision = number of correct constituents(heading the same phrase as in reference)divided by number of all constituents in parse

recall = number of correct constituentsdivided by number of all constituents in reference

(weighted) harmonic mean

Fα = (1 + α2) · precision · recallα2 · precision + recall

FST in NLP A. Maletti · 34

Page 71: Finite-State Technology in Natural Language Processing

Parser Evaluation

Definition (ParseEval measure)precision = number of correct constituents(heading the same phrase as in reference)divided by number of all constituents in parse

recall = number of correct constituentsdivided by number of all constituents in reference(weighted) harmonic mean

Fα = (1 + α2) · precision · recallα2 · precision + recall

FST in NLP A. Maletti · 34

Page 72: Finite-State Technology in Natural Language Processing

Parser Evaluation

Reference Parser outputS

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

DT

the

NNP

Community

PP

IN

as

NP

DT

a

NN

whole

precision = 99 = 100%

recall = 910 = 90%

F1 = 2 · 1·0.91+0.9 = 95%

FST in NLP A. Maletti · 35

Page 73: Finite-State Technology in Natural Language Processing

Parser Evaluation

Reference Parser outputS

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

DT

the

NNP

Community

PP

IN

as

NP

DT

a

NN

whole

precision = 99 = 100%

recall = 910 = 90%

F1 = 2 · 1·0.91+0.9 = 95%

FST in NLP A. Maletti · 35

Page 74: Finite-State Technology in Natural Language Processing

Parser Evaluation

Reference Parser outputS

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

NP

DT

the

NN

Community

PP

IN

as

NP

DT

a

NN

whole

S

NP

PRP

We

VP

MD

must

VP

VB

bear

PP

IN

in

NP

NN

mind

NP

DT

the

NNP

Community

PP

IN

as

NP

DT

a

NN

whole

precision = 99 = 100%

recall = 910 = 90%

F1 = 2 · 1·0.91+0.9 = 95%

FST in NLP A. Maletti · 35

Page 75: Finite-State Technology in Natural Language Processing

Parser Evaluation

Standardized Setuptraining data: PENN treebank Sections 2–21(articles from the WALL STREET JOURNAL)development test data: PENN treebank Section 22evaluation data: PENN treebank Section 23

Experiment [POST, GILDEA, ’09]grammar model precision recall F1

wLTG 75.37 70.05 72.61

These are bad compared to the state-of-the-art!

FST in NLP A. Maletti · 36

Page 76: Finite-State Technology in Natural Language Processing

Parser Evaluation

Standardized Setuptraining data: PENN treebank Sections 2–21(articles from the WALL STREET JOURNAL)development test data: PENN treebank Section 22evaluation data: PENN treebank Section 23

Experiment [POST, GILDEA, ’09]grammar model precision recall F1

wLTG 75.37 70.05 72.61

These are bad compared to the state-of-the-art!

FST in NLP A. Maletti · 36

Page 77: Finite-State Technology in Natural Language Processing

Parser Evaluation

State-of-the-art modelscontext-free grammars with latent variables (CFGlv)[COLLINS, ’99], [KLEIN, MANNING, ’03], [PETROV, KLEIN, ’07]

tree substitution grammars with latent variables (TSGlv)[SHINDO et al., ’12]

(both as expressive as weighted tree automata)other models

Experiment [SHINDO et al., ’12]grammar model F1

wLTG = wCFG 72.6wTSG [COHN et al., 2010] 84.7wCFGlv [PETROV, 2010] 91.8wTSGlv [SHINDO et al., 2012] 92.4

FST in NLP A. Maletti · 37

Page 78: Finite-State Technology in Natural Language Processing

Parser Evaluation

State-of-the-art modelscontext-free grammars with latent variables (CFGlv)[COLLINS, ’99], [KLEIN, MANNING, ’03], [PETROV, KLEIN, ’07]

tree substitution grammars with latent variables (TSGlv)[SHINDO et al., ’12]

(both as expressive as weighted tree automata)other models

Experiment [SHINDO et al., ’12]grammar model F1

wLTG = wCFG 72.6wTSG [COHN et al., 2010] 84.7wCFGlv [PETROV, 2010] 91.8wTSGlv [SHINDO et al., 2012] 92.4

FST in NLP A. Maletti · 37

Page 79: Finite-State Technology in Natural Language Processing

Grammars with Latent Variables

DefinitionA grammar with latent variables is (grammar with relabeling)

a grammar G generating L(G) ⊆ TΣ(W )

a (total) mapping ρ : Σ→ ∆ functional relabeling

Definition (Semantics)

L(G, ρ) = ρ(L(G)) = {ρ(t) | t ∈ L(G)}

Language class: REL(L) for language class L

FST in NLP A. Maletti · 38

Page 80: Finite-State Technology in Natural Language Processing

Grammars with Latent Variables

DefinitionA grammar with latent variables is (grammar with relabeling)

a grammar G generating L(G) ⊆ TΣ(W )

a (total) mapping ρ : Σ→ ∆ functional relabeling

Definition (Semantics)

L(G, ρ) = ρ(L(G)) = {ρ(t) | t ∈ L(G)}

Language class: REL(L) for language class L

FST in NLP A. Maletti · 38

Page 81: Finite-State Technology in Natural Language Processing

Weighted Tree Automata

DefinitionA weighted tree automaton (wTA) is asystem G = (Q,N,W ,S,P,wt)

finite set Q (states)finite set N (nonterminals)finite set W (terminals)S ⊆ Q (start states)finite set P ⊆

(Q × N × (Q ∪W )+

)∪(Q ×W

)(productions)

mapping wt : P → [0,1] (weight assignment)

production (q,n,w1, . . . ,wk ) is often written q → n(w1, . . . ,wk )

FST in NLP A. Maletti · 39

Page 82: Finite-State Technology in Natural Language Processing

Grammars with Latent Variables

TheoremREL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA]

REL(wTSL) [wTSGlv]

wTSL [wTSG] REL(wCFL) [wCFGlv]

wCFL [wCFG]

here: latent variables ≈ finite-state

FST in NLP A. Maletti · 40

Page 83: Finite-State Technology in Natural Language Processing

Grammars with Latent Variables

TheoremREL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA]

REL(wTSL) [wTSGlv]

wTSL [wTSG] REL(wCFL) [wCFGlv]

wCFL [wCFG]

here: latent variables ≈ finite-state

FST in NLP A. Maletti · 40

Page 84: Finite-State Technology in Natural Language Processing

Grammars with Latent Variables

TheoremREL(wLTL) = REL(wTSL) = wRTL

wRTL [wTA]

REL(wTSL) [wTSGlv]

wTSL [wTSG] REL(wCFL) [wCFGlv]

wCFL [wCFG]

here: latent variables ≈ finite-stateFST in NLP A. Maletti · 40

Page 85: Finite-State Technology in Natural Language Processing

Parsing

Typical questionsDecoding: (or language model evaluation)Given model M and sentence w , determine probability M(w)

I intersect M with the DTA for w and any parseI evaluate w in the obtained WTAI efficient: initial-algebra semantics (forward algorithm)

Parsing:Given model M and sentence w ,determine the best parse t for w

I intersect M with the DTA for w and any parseI determine best tree in the obtained WTAI efficient: none (NP-hard even for wLTG)

FST in NLP A. Maletti · 41

Page 86: Finite-State Technology in Natural Language Processing

Parsing

Typical questionsDecoding: (or language model evaluation)Given model M and sentence w , determine probability M(w)

I intersect M with the DTA for w and any parseI evaluate w in the obtained WTAI efficient: initial-algebra semantics (forward algorithm)

Parsing:Given model M and sentence w ,determine the best parse t for w

I intersect M with the DTA for w and any parseI determine best tree in the obtained WTAI efficient: none (NP-hard even for wLTG)

FST in NLP A. Maletti · 41

Page 87: Finite-State Technology in Natural Language Processing

Parsing

Statistical parsing approachGiven wLTG M and sentence w , return highest-scoring parse for w

ConsequenceThe first parse should be prefered

(“duck” more frequently a noun, etc.)

S

NP

PRP

We

VP

VBD

saw

NP

PRP$

her

NN

duck

S

NP

PRP

We

VP

VBD

saw

S-BAR

S

NP

PRP

her

VP

VBP

duck

FST in NLP A. Maletti · 42

Page 88: Finite-State Technology in Natural Language Processing

Parsing

Statistical parsing approachGiven wLTG M and sentence w , return highest-scoring parse for w

ConsequenceThe first parse should be prefered

(“duck” more frequently a noun, etc.)

S

NP

PRP

We

VP

VBD

saw

NP

PRP$

her

NN

duck

S

NP

PRP

We

VP

VBD

saw

S-BAR

S

NP

PRP

her

VP

VBP

duck

FST in NLP A. Maletti · 42

Page 89: Finite-State Technology in Natural Language Processing

Parsing

TCS contributionsefficient evaluation and complexity considerations(initial-algebra semantics, best runs, best trees, etc.)model simplifications(trimming, determinization, minimization, etc.)model transformations(intersection, normalization, lexicalization, etc.)model induction(grammar induction, weight training, spectral learning, etc.)

FST in NLP A. Maletti · 43

Page 90: Finite-State Technology in Natural Language Processing

Parsing

NLP contribution to TCSgood source of (relevant) problemsgood source for practical techniques(e.g., fine-to-coarse decoding)good source of (relevant) large wTA

language states non-lexical productionsEnglish 1,132 1,842,218Chinese 994 1,109,500German 981 616,776

FST in NLP A. Maletti · 44

Page 91: Finite-State Technology in Natural Language Processing

Machine Translation

FST in NLP A. Maletti · 45

Page 92: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manuals

US military

Example (An mp3 player)The synchronous manifestation of lyrics is a procedure for canbroadcasting the music, waiting the mp3 file at the same time showingthe lyrics.

With the this kind method that the equipments thatsynchronous function of support up broadcast to make use ofdocument create setup, you can pass the LCD window way the checkat the document contents that broadcast. That procedure returnsofferings to have to modify, and delete, and stick top , keep etc. editfunction.

FST in NLP A. Maletti · 46

Page 93: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manuals

US military

Example (An mp3 player)The synchronous manifestation of lyrics is a procedure for canbroadcasting the music, waiting the mp3 file at the same time showingthe lyrics. With the this kind method that the equipments thatsynchronous function of support up broadcast to make use ofdocument create setup, you can pass the LCD window way the checkat the document contents that broadcast.

That procedure returnsofferings to have to modify, and delete, and stick top , keep etc. editfunction.

FST in NLP A. Maletti · 46

Page 94: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manuals

US military

Example (An mp3 player)The synchronous manifestation of lyrics is a procedure for canbroadcasting the music, waiting the mp3 file at the same time showingthe lyrics. With the this kind method that the equipments thatsynchronous function of support up broadcast to make use ofdocument create setup, you can pass the LCD window way the checkat the document contents that broadcast. That procedure returnsofferings to have to modify, and delete, and stick top , keep etc. editfunction.

FST in NLP A. Maletti · 46

Page 95: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manualsUS military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name?A: Abdul.

E: And your last name?A: Al Farran.

E: Okay, what’s your name?A: milk a mechanic and I am here

I mean yes

E: What is your last name?A: every two weeks

my son’s name is ismail

FST in NLP A. Maletti · 46

Page 96: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manualsUS military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name?A: Abdul.

E: And your last name?A: Al Farran.

E: Okay, what’s your name?A: milk a mechanic and I am here

I mean yes

E: What is your last name?A: every two weeks

my son’s name is ismail

FST in NLP A. Maletti · 46

Page 97: Finite-State Technology in Natural Language Processing

Machine Translation

ApplicationsTechnical manualsUS military

Example (Speech-to-text [JONES et al., ’09])

E: Okay, what is your name?A: Abdul.

E: And your last name?A: Al Farran.

E: Okay, what’s your name?A: milk a mechanic and I am here

I mean yesE: What is your last name?A: every two weeks

my son’s name is ismail

FST in NLP A. Maletti · 46

Page 98: Finite-State Technology in Natural Language Processing

Machine Translation

VAUQUOIS triangle:

phrase

syntax

semantics

foreign German

Translation model:

FST in NLP A. Maletti · 47

Page 99: Finite-State Technology in Natural Language Processing

Machine Translation

VAUQUOIS triangle:

phrase

syntax

semantics

foreign German

Translation model: string-to-tree

FST in NLP A. Maletti · 47

Page 100: Finite-State Technology in Natural Language Processing

Machine Translation

VAUQUOIS triangle:

phrase

syntax

semantics

foreign German

Translation model: tree-to-tree

FST in NLP A. Maletti · 47

Page 101: Finite-State Technology in Natural Language Processing

Machine Translation

Training dataparallel corpusword alignmentsparse trees for the target sentences

Parallel Corpuslinguistic resource containing example translations

(sentence level)

FST in NLP A. Maletti · 48

Page 102: Finite-State Technology in Natural Language Processing

Machine Translation

Training dataparallel corpusword alignmentsparse trees for the target sentences

Parallel Corpuslinguistic resource containing example translations

(sentence level)

FST in NLP A. Maletti · 48

Page 103: Finite-State Technology in Natural Language Processing

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

FST in NLP A. Maletti · 49

Page 104: Finite-State Technology in Natural Language Processing

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

via GIZA++ [OCH, NEY, ’03]

FST in NLP A. Maletti · 49

Page 105: Finite-State Technology in Natural Language Processing

Machine Translation

parallel corpus, word alignments, parse tree

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

via BERKELEY parser [PETROV et al., ’06]

FST in NLP A. Maletti · 49

Page 106: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

FST in NLP A. Maletti · 50

Page 107: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

FST in NLP A. Maletti · 50

Page 108: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

FST in NLP A. Maletti · 50

Page 109: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

FST in NLP A. Maletti · 50

Page 110: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

FST in NLP A. Maletti · 50

Page 111: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Extended top-down tree transducer (STSG)variant of [M., GRAEHL, HOPKINS, KNIGHT, ’09]rules of the form NT→

(r , r1

)for nonterminal NT

I right-hand side r of context-free grammar ruleI right-hand side r1 of regular tree grammar rule

(bijective) synchronization of nonterminals

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

FST in NLP A. Maletti · 50

Page 112: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

Rule application1 Selection of synchronous nonterminals

2 Selection of suitable rule3 Replacement on both sides

FST in NLP A. Maletti · 51

Page 113: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

Rule application1 Selection of synchronous nonterminals

2 Selection of suitable rule3 Replacement on both sides

FST in NLP A. Maletti · 51

Page 114: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would likeKOUS PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

Rule application1 Selection of synchronous nonterminals2 Selection of suitable rule

3 Replacement on both sides

KOUS→

would like

Konnten

KOUS

FST in NLP A. Maletti · 51

Page 115: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER KOUSwould like PPER advice PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN PP VV

NP

S

Rule application1 Selection of synchronous nonterminals2 Selection of suitable rule3 Replacement on both sides

KOUS→

would like

Konnten

KOUS

FST in NLP A. Maletti · 51

Page 116: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would like PPER advice APPR NN CD PPPP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN APPR NN CD PP VV

PP

NP

S

Rule application1 synchronous nonterminals

2 suitable rule3 replacement

FST in NLP A. Maletti · 52

Page 117: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would like PPER advice APPR NN CD PPPP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN APPR NN CD PP VV

PP

NP

S

Rule application1 synchronous nonterminals

2 suitable rule3 replacement

FST in NLP A. Maletti · 52

Page 118: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would like PPER advice APPR NN CD PPPP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN APPR NN CD PP VV

PP

NP

S

Rule application1 synchronous nonterminals2 suitable rule

3 replacement

PP→

APPR NN CD PP

APPR NN CD PP

PP

FST in NLP A. Maletti · 52

Page 119: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

S →

PPER would like PPER advice PPAPPR NN CD PP

Konnten eine Auskunft geben

KOUS PPER PPER ART NN APPR NN CD PP VV

PP

NP

S

Rule application1 synchronous nonterminals2 suitable rule3 replacement

PP→

APPR NN CD PP

APPR NN CD PP

PP

FST in NLP A. Maletti · 52

Page 120: Finite-State Technology in Natural Language Processing

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

extractable rules marked in red

FST in NLP A. Maletti · 53

Page 121: Finite-State Technology in Natural Language Processing

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 53

Page 122: Finite-State Technology in Natural Language Processing

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 53

Page 123: Finite-State Technology in Natural Language Processing

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 53

Page 124: Finite-State Technology in Natural Language Processing

Rule extraction

following [GALLEY, HOPKINS, KNIGHT, MARCU, ’04]

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 53

Page 125: Finite-State Technology in Natural Language Processing

Rule extraction

Removal of extractable rule:

I would like your advice about Rule 143 concerning inadmissibility

Konnten Sie mir eine Auskunft zu Artikel 143 im Zusammenhang mit der Unzulassigkeit geben

KOUS PPER PPER ART NN APPR NN CD AART NN APPR ART NN VV

PP

PP

PP

NP

S

FST in NLP A. Maletti · 54

Page 126: Finite-State Technology in Natural Language Processing

Rule extraction

Removal of extractable rule:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

FST in NLP A. Maletti · 54

Page 127: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in red

FST in NLP A. Maletti · 55

Page 128: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in red

FST in NLP A. Maletti · 55

Page 129: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 55

Page 130: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 55

Page 131: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 55

Page 132: Finite-State Technology in Natural Language Processing

Rule extraction

Repeated rule extraction:

PPER would like your advice about Rule 143 PP

Konnten Sie eine Auskunft zu Artikel 143 geben

KOUS PPER PPER ART NN APPR NN CD VV

PP

PP

NP

S

extractable rules marked in redFST in NLP A. Maletti · 55

Page 133: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Advantagesvery simpleimplemented in MOSES [KOEHN et al., ’07]“context-free”

Disadvantagesproblems with discontinuitiescomposition and binarization not possible[M. et al., ’09] and [ZHANG et al., ’06]“context-free”

FST in NLP A. Maletti · 56

Page 134: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Advantagesvery simpleimplemented in MOSES [KOEHN et al., ’07]“context-free”

Disadvantagesproblems with discontinuitiescomposition and binarization not possible[M. et al., ’09] and [ZHANG et al., ’06]“context-free”

FST in NLP A. Maletti · 56

Page 135: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Remarkssynchronization breaks almost all existing constructions(e.g., the normalization construction)

→ the basic grammar model very important

tree-to-tree models use trees on both sides

FST in NLP A. Maletti · 57

Page 136: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Remarkssynchronization breaks almost all existing constructions(e.g., the normalization construction)

→ the basic grammar model very importanttree-to-tree models use trees on both sides

FST in NLP A. Maletti · 57

Page 137: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Major (tree-to-tree) models1 linear top-down tree transducer (with look-ahead)

I input-side: tree automatonI output-side: regular tree grammarI synchronization: mapping output NT to input NT

2 linear extended top-down tree transducer (w. look-ahead)I input-side: regular tree grammarI output-side: regular tree grammarI synchronization: mapping output NT to input NT

FST in NLP A. Maletti · 58

Page 138: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Major (tree-to-tree) models1 linear top-down tree transducer (with look-ahead)

I input-side: tree automatonI output-side: regular tree grammarI synchronization: mapping output NT to input NT

2 linear extended top-down tree transducer (w. look-ahead)I input-side: regular tree grammarI output-side: regular tree grammarI synchronization: mapping output NT to input NT

FST in NLP A. Maletti · 58

Page 139: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Synchronous grammar rule:

VP

q1 q2 q3

q—

VP

q2 VP

q1 q3

“Classical” top-down tree transducer rule:

q

VP

x1 x2 x3

VP

q2

x2

VP

q1

x1

q3

x3

FST in NLP A. Maletti · 59

Page 140: Finite-State Technology in Natural Language Processing

Extended Tree Transducer

Syntactic restrictionsnondeleting if synchronization bijective (in all rules)strict if r1 not a nonterminal (for all rules q → (r , r1))ε-free if r not a nonterminal (for all rules q → (r , r1))

Composition (COMP)executing transformations τ ⊆ TΣ × T∆ and τ ′ ⊆ T∆ × TΓ

one after the other:

τ ; τ ′ = {(s,u) | ∃t ∈ T∆ : (s, t) ∈ τ, (t ,u) ∈ τ ′}

FST in NLP A. Maletti · 60

Page 141: Finite-State Technology in Natural Language Processing

Top-down Tree Transducer

TOPR∞

TOP∞ l-TOPR1

l-TOP2 ls-TOPR1

ls-TOP2 ln-TOP1

lns-TOP1

composition closure indicated in subscript

FST in NLP A. Maletti · 61

Page 142: Finite-State Technology in Natural Language Processing

Extended Tree TransducerXTOP∞ XTOPR

l-XTOP∞ l-XTOPR∞

ln-XTOP∞ ε-XTOP∞ TOPR∞

lns-XTOP∞ lε-XTOP4 lε-XTOPR3

lnε-XTOP∞ lsε-XTOPR2

lnsε-XTOP2 lsε-XTOP2 l-TOPF2 l-TOPR

1

TOP∞

ls-TOP2 l-TOP2

ln-TOP1

lns-TOP1

composition closure indicated in subscriptFST in NLP A. Maletti · 62

Page 143: Finite-State Technology in Natural Language Processing

Machine Translation

TCS contributionsefficient evaluation and complexity considerations(exact decoding, best runs, best translations, etc.)evaluation of expressive power(which linguistic phenomena can be captured?relationship to other models)model transformations(intersection, language model integration,parse forest decoding, etc.)very little on model induction so far(mostly local models so far; power of finite-state not yet explored)

FST in NLP A. Maletti · 63

Page 144: Finite-State Technology in Natural Language Processing

Evaluation

Task System BLEU

English→ German

STSG 15.22MBOT 15.90

phrase-based 16.73hierarchical 16.95

GHKM 17.10

English→ Arabic

STSG 48.32MBOT 49.10

phrase-based 50.27hierarchical 51.71

GHKM 46.66

English→ Chinese

STSG 17.69MBOT 18.35

phrase-based 18.09hierarchical 18.49

GHKM 18.12

from [SEEMANN et al., ’15]FST in NLP A. Maletti · 64

Page 145: Finite-State Technology in Natural Language Processing

Selected Literature

ENGELFRIET: Bottom-up and Top-down Tree Transformations —A Comparison. Math. Systems Theory 9 ’75

KLEIN, MANNING: Accurate Unlexicalized ParsingProc. ACL ’03

KOEHN: Statistical Machine TranslationCambridge University Press ’10

MANNING, SCHÜTZE: Foundations of Statistical Natural LanguageProcessing. MIT Press ’99

PETROV, BARRETT, THIBAUX AND KLEIN: Learning Accurate,Compact, and Interpretable Tree Annotation. Proc. ACL ’06

SHINDO, MIYAO, FUJINO AND NAGATA: Bayesian Symbol-RefinedTree Substitution Grammars for Syntactic Parsing. Proc. ACL ’12

FST in NLP A. Maletti · 65