Upload
buiduong
View
215
Download
0
Embed Size (px)
Citation preview
Lecture 16: MT, FSA,
Formal Language Theory
LING 1330/2330: Introduction to Computational Linguistics
Na-Rae Han
Overview
4/18/2017 2
Machine translation Language and Computers, Ch.7 Machine translation systems Speech and Language Processing, Jurafsky and Martin (2008) Ch. 25 Machine Translation, 5.4 Noisy channel model
Finite-State Automata Language and Computers, 4.4 Searching semi-structured data with regular
expressions, pp. 112-115 Under the Hood 7 “Finite-state Automata”
Formal language theory Regular, context-free, context-sensitive languages The Chomsky Hierarchy
Mathematical Methods in Linguistics by B. Partee, A. ter Meulen and R. Wall Part D. English as a Formal Language Excerpt posted on CourseWeb
Interlingua approaches: advantage
4/18/2017 3
ADVANTAGE: A single interlingua intermediates between all language pairs. For n languages, only 2*n translation models need to be built:
English INTERLINGUA French
English INTERLINGUA Spanish
English INTERLINGUA Japanese
Japanese INTERLINGUA English
Each language only needs to be translated to and from INTERLINGUA:
Interlingua
English
German
Chinese
Swahili Korean
Swedish Hindi
Interlingua: not as straightforward
4/18/2017 4
"Feeling hungry" is a physical sensation that's universal to human. Yet…
I am hungry
A state attributed to a person
Tengo hambre (Spanish)
"I have hunger" Possession relation
お腹が空いた (Japanese)
"(my) stomach is empty" A state attributed to a location/physical entity
Would you say these sentences have different meanings?
How do we construct a single universal meaning representation (= interlingua) for this?
Interlingua is expensive
4/18/2017 5
Someone has to build the formal semantic representation for every meaning component
May be feasible for limited subdomains (e.g., air travel, hotel reservation system, etc.)
In its pure form, interlingua requires full disambiguation, which sometimes results in unnecessary efforts
YOUNGER- vs. OLDER-BROTHER distinction
Mandatory in Korean and Japanese
Necessary for KoreanJapanese or EnglishKorean translation
Not needed for KoreanEnglish or GermanEnglish translation
Statistical MT
4/18/2017 6
The three classic architectures focus on the appropriate representations to use symbolic approach
In statistical MT, however, the focus is more on the result JAPANESE: fukaku hansei shite orimasu.
English1: I sincerely apologize.
English2: I am deeply regretting.
E1 is more natural English; E2 is more faithful to the original meaning
There are two considerations in translation: (1) Faithfulness to the original message
(2) Fluency (naturalness) of the target language text
Successful translation can be schematized as:
A translation T that maximizes the product of the two best-translation ˆT = argmaxT faithfulness(T, S) * fluency(T)
The noisy channel model
4/18/2017 7
The noisy channel model
How to estimate the original clean signal from the degraded signal?
ŷ: system's best guess what the original message was
P(y): the probability of the original message
P(x|y): the observed output of the channel, given y
ŷ is the message that maximizes P(y) * P(x|y)
Noisy channel
Clean signal (y)
Degraded signal (x)
y)|P(y)P(xargmax y y
Spell correction as noisy channel model
4/18/2017 8
Spelling error correction
Estimating the intended word based on the mis-spelled word
ŷ: system's best guess what the intended word was, e.g., bright?, birth? broth?
P(y): the probability of the intended word, e.g., P(birth) vs. P(broth)
P(x|y): the probability of the misspelled word, given the intended word, e.g., P(briht|birth), P(briht|broth), ...
ŷ is the word that maximizes P(y) * P(x|y)
Sloppy human
Intended word (y)
Miss-spelled word (x)
y)|P(y)P(xargmax y y
Translation as noisy channel model
4/18/2017 9
Translation model, French (F) to English (E):
Ê: system's best guess at best English translation
P(E): the probability of the English translation text
P(F|E): the observed distribution of French text, given the English text
We need two components: The English language model P(E) and the translation model P(F|E) Based on the two, a decoder picks , i.e., determines the best English
translation
Noisy channel
English text (E)
French text (F)
)|( )(argmax )|(argmax ˆ EFPEPFEPE EE
Language model
Translation model
E
Note the DIRECTION!
P(E) vs. P(F|E)
4/18/2017 10
Two components:
The English language model: P(E)
Essentially a model of fluency/naturalness ("I sincerely apologize" more likely English sentence than "I am deeply regretting")
We can build this independently based on a large English corpus
(Bigram model, trigram model, HMM, ...)
The translation model: P(F|E)
Essentially a model of translation faithfulness ("I am deeply regretting" more faithful to original Japanese than "I sincerely apologize")
How should we obtain this?
Also from corpus data: parallel corpus
As with any other statistical language models, this model is REDUCTIVE: the stats come from word-level co-occurrence patterns
P(F|E) and word alignment
4/18/2017 11
Two components:
The English language model: P(E)
Essentially a model of fluency/naturalness ("I sincerely apologize" more likely English sentence than "I am deeply regretting")
We can build this independently based on a large English corpus
(Bigram model, trigram model, HMM, ...)
The translation model: P(F|E)
Essentially a model of translation faithfulness ("I am deeply regretting" more faithful to original Japanese than "I sincerely apologize")
How should we obtain this?
Also from corpus data: parallel corpus
As with any other statistical language models, this model is REDUCTIVE: the stats come from word-level co-occurrence patterns
Need word alignment: A mapping between the source words and the target words in a set of parallel sentence.
Word alignment
4/18/2017 12
The translation model P(F|E) based on word alignment statistics Sentences are reduced to a bag-of-words
Given an English sentence as words (we1... wen), how likely is it to see a French sentence (wf1... wfn) with a set of alignment (A) between words?
Where to get word-level alignments (A)?
It can be automatically extracted from a large parallel corpus, where sentences are pre-aligned.
There are many methods for extracting word alignment. (IBM Model 1, 2, 3, 4; HMM, ...)
She had picked up the letter the day before yesterday
Elle avait ramassé la lettre avant-hier
A
EAFPEFP )|,()|(
Parallel corpora
4/18/2017 13
European Parliament Proceedings Parallel Corpus 1996-2011
Europarl: http://www.statmt.org/europarl/
Extracted from the proceedings of the European Parliament.
Includes versions in 21 European languages. (EU currently has 23 official languages)
Each language section is aligned on sentence level
EU and MT
Public official documents of the European Union (EU) must be translated to every official languages
EU has been a big sponsor for MT research efforts.
Europarl Corpus in NLTK
4/18/2017 14
>>> from nltk.corpus.europarl_raw import english, german, french, spanish >>> english.sents()[2] ['You', 'have', 'requested', 'a', 'debate', 'on', 'this', 'subject', 'in', 'the', 'course', 'of', 'the', 'next', 'few', 'days', ',', 'during', 'this', 'part-session', '.'] >>> german.sents()[2] ['Doch', 'sind', 'Bürger', 'einiger', 'unserer', 'Mitgliedstaaten', 'Opfer', 'von', 'schrecklichen', 'Naturkatastrophen', 'geworden', '.'] >>> french.sents()[2] ['En', 'revanche', ',', 'les', 'citoyens', "d'", 'un', 'certain', 'nombre', 'de', 'nos', 'pays', 'ont', 'été', 'victimes', 'de', 'catastrophes', 'naturelles', 'qui', 'ont', 'vraiment', 'été', 'terribles', '.'] >>> spanish.sents()[2] ['En', 'cambio', ',', 'los', 'ciudadanos', 'de', 'varios', 'de', 'nuestros', 'países', 'han', 'sido', 'víctimas', 'de', 'catástrofes', 'naturales', 'verdaderamente', 'terribles', '.']
Also available: Danish, Dutch, Finnish, Greek, Italian, Portuguese, Spanish, Swedish
Parallel corpus with syntactic annotation
15
;;05:127: 저는 그 일을 할 수 있는 한 빨리 하겠습니다 .
(S (NP-SBJ 저/NPN+는/PAU)
(VP (NP-OBJ-LV 그/DAN
일/NNC+을/PCA)
(VP (NP-ADV (S (NP-SBJ (S (NP-SBJ *pro*)
(VP 하/VV+ㄹ/EAN))
(NP 수/NNX))
(ADJP 있/VJ+는/EAN))
(NP 한/NNX))
(ADVP 빨리/ADV)
(VP (LV 하/VV+겠/EPF+습니다/EFN))))
./SFN)
;;05:127: I will do it as fast as possible.
(S (NP-SBJ (PRP I))
(VP (MD will)
(VP (VP (VB do)
(NP-OBJ (PRP it)))
(ADVP-MNR (ADVP (RB as)
(RB fast))
(PP (IN as)
(ADJP (JJ possible))))))
(. .))
Korean English Treebank Annotations: https://catalog.ldc.upenn.edu/LDC2002T26
MT: state-of-the-art
4/18/2017 16
Systran – one of the largest and oldest MT companies http://www.systransoft.com/
Babel Fish – Yahoo’s MT engine https://www.babelfish.com/
Bing Translator https://www.bing.com/translator
Google Translate– Google’s MT engine http://translate.google.com/
CMU Link Grammar English-German Translator http://www.link.cs.cmu.edu/link/submit-to-translator.html
Safaba http://www.safaba.com/
Started as CMU off-shoot; now an Amazon company in Amazon Pittsburgh office
Current state of NLP and AI
4/18/2017 17
The Great A.I. Awakening
Dec 2016, NYTimes Magazine
What did you all think?
Last Words: Computational Linguistics and Deep Learning
By Christopher Manning; Original publication Dec 2015
"The deep learning tsunami", "Why computational linguists need not worry"
The Dark Secret at the Heart of AI
Apr 2017, MIT Technology Review
"No one really knows how the most advanced algorithms do what they do. That could be a problem."
Google: Our Assistant will Trigger the Next Era of A.I.
Oct 2016, backchannel.com
"How can a machine truly understand the phrases that human beings peck and blab into its search fields and microphone? "
Finite-State Automata
Formal Language Theory
Finite-State Automata
4/18/2017 19
A finite-state automaton (FSA, also called a finite-state machine) is a mathematical model of computation
It consists of:
A set of states. One state is initial; each state is either final (=accepting) or non-final.
A set of transition arcs between states with a label.
The machine starts at the initial state, and then transits to a next state through an arc, reading the label
When the input string is exhausted, if the machine is at a final state (ab), then the string is accepted/generated; if not (aba), it is rejected.
Input string is also rejected when it cannot be completely processed. (b, aaa)
2 START a
b
1 a
3
Example of FSA
4/18/2017 20
Which string(s) does this FSA accept? Answer: ab, ac
What is its RE equivalent? Answer: a(b|c)
3 START a
a b
1 2 b
3 START a
1 2
Which string(s) does this FSA accept? Answer: ab, aab, abb,
aaab, aabbb, aaaaabbbbbb, ...
What is its RE equivalent? Answer: a+b+
b
c
Regular expressions vs. automata
4/18/2017 21
Regular expression
A compact representation of a set of strings
/(have|has|had)( n?ever)? been/
represents a set of 9 strings.
/(have|has|had)( \w+)* been/
represents a set of infinite number of strings.
Regular expressions as a formalism have a different incarnation in the form of finite-state automata:
1 2 START b
a b
a*b* equivalent
Regular expressions vs. FSA
4/18/2017 22
Regular expressions and FSA are equivalent.
For any regular expression you compose, there is a corresponding FSA.
Any FSA can be converted to a corresponding regular expression.
How do you define equivalence?
A regular expression represents a set of strings.
A FSA accepts/generates a set of strings.
If the two sets are identical, the regular expression and the FSA are equivalent.
3 START a
b
1 a
2 ab*a aa, aba,
abba, abbba, abbbba, ...
4.12 (Language and Computers)
4/18/2017 23
Accepted by this FSA? '' 'a' 'aa' 'b'
Equivalent regular expression?
/a/
2 START a
1
4.13
4/18/2017 24
Accepted by this FSA? '' 'a' 'b' 'c' 'ab' 'bc' 'aa'
Equivalent regular expression?
/a|b|c/
2 START b
1
a
c
4.14
4/18/2017 25
Accepted by this FSA? '' 'a' 'aa' 'aaa' 'b' 'bc' 'abc' 'ba' 'aba' 'ab' 'aac'
Equivalent regular expression?
/a*(a|b|c)/
2 START b
1
a
c
Also: 'c', 'ac', 'aaac', …
a
Non-deterministic: Multiple choices on reading a single arc
label.
4.15
4/18/2017 26
Accepted by this FSA? '' 'a' 'b' 'ab' 'ba' 'aa' 'bca'
Equivalent regular expression?
/ab/
3 START a
1 b
2
4.17
4/18/2017 27
Accepted by this FSA? '' 'a' 'aa' 'aaa' 'aaaa' 'b' 'ab' 'baaa'
Equivalent regular expression?
/a*/
1 START
Also: 'aaaaa', 'aaaaaa', …
a
4/18/2017 28
Accepted by this FSA? '' 'a' 'aa' 'aaa' 'aaaa' 'b' 'ab' 'baaa'
Equivalent regular expression?
/(a|b)*/
1 START
a
b
Basically any string made of a and b!
If a and b are the only alphabet,
equivalent to
/.*/
4.18
4/18/2017 29
Accepted by this FSA? '' 'a' 'aa' 'aaa' 'b' 'ab' 'aabb' 'baa' 'bab'
Equivalent regular expression?
/a*b*/
2 START b
Also: 'bb', 'abb', 'aaabbbbb', …
a
1
b
Wrong FSA shown in the textbook. This is the
correct FSA that matches the regex.
Any number of a's followed by any number of b's
4.18 (as in the textbook p.115)
4/18/2017 30
Accepted by this FSA? '' 'a' 'b' 'ab' 'ba' 'aa' 'aab' 'aaabb'
'aaaaaa' 'bbbbbb' 'bbbaa' 'aaaab' 'abbbbbb'
Equivalent regular expression?
/a*(ab*)?/
3 START a b
a b
2 1 This part is
in fact redundant.
If b is going to occur at all, it must be preceded by at
least one a
RE/FSA and natural language grammar
4/18/2017 31
RE/FSA can be thought of as a grammar. RE/FSA is a mechanism that rules if a string is accepted
("grammatical") or not ("ungrammatical")
An RE/FSA represents a set of accepted strings (= "grammatical strings") to the exclusion of the rest ("ungrammatical strings")
A language is a set of all grammatical strings generated over its vocabulary (arc labels).
L(ab*a) = a language generated by regular expression ab*a
= {aa, aba, abba, abbba, abbbba, abbbbba, ... }
English morpho-syntax as FSA
4/18/2017 32
A set of English morphemes:
{thank, joy, taste, thought, ful, less, ly}
You can concatenate the morphemes to make English words.
Grammatical: joy, joyful, joyless, joyfully, …
Ungrammatical: ful, fuljoy, lessful, joythank, …
English morpho-syntax dictates which sequences are grammatical English words and which are not.
How to formalize the rules?
English morpho-syntax as FSA
4/18/2017 33
Arc labels (=vocabulary): English morphemes
Set of accepted strings: legitimate English words
Which words are in this language?
Which are not?
What's the corresponding regular expression?
(thank|joy|taste|thought)((ful|less)(ly)?)?
2 1 3 4 START
thank
joy
taste
thought
ful
less
ly
English phonotactics as FSA
4/18/2017 34
A set of English phonemes:
{g, k, b, p, d, t, s, z, v, f, θ, ð, ʃ, ʒ, tʃ, dʒ, ɹ, l, h, m, n, ŋ, j, w, a, æ, ɛ, i, ɪ, ɔ, ʌ, ɑ, u, ʊ, eɪ, oɪ, aʊ, ou, aɪ}
You can concatenate the phonemes to make English syllables.
Grammatical: pɔ, deɪt, æbz, stɹid, …
Ungrammatical: mk, gki, iu, mɑnpk, ɹstid, …
English phonotactics dictates which sequences are grammatical English syllables and which are not.
How to formalize the rules?
English phonotactics as FSA
4/18/2017 35
Arc labels (=vocabulary): English phonemes
Set of accepted strings: valid English syllables
Which syllables are in this language?
Which are not?
What's the corresponding regular expression?
'': empty label
START
s
1
2 3
4 5 6
t
p
d
k
w
h
ʌ
æ
p
t
k i
''
''
ɹ
eɪ
ɪ
ŋ
m
Wrapping up
4/18/2017 36
Next class: Formal language theory Course wrap-up
PyLing this Wednesday 6pm, 2818 CL A send-off to graduating seniors. Learn about where they are going and how they
prepared for their next big step!
Do your OMET survey!
A guide to your final grade Participation & attendance grades almost ready CourseWeb "running total" ready A chance to make up for missed homework next slide
Final exam Next slide
Late homework forgiveness
4/18/2017 37
Homework 10 and 11 are still open for late submission, with standard penalty.
http://www.pitt.edu/~naraehan/ling1330/policies.html
Additionally, you may make up 1 missed assignment (homework or exercise) of your choice.
20% late penalty applies.
Deadline: 4/27 (Thu) midnight
Upload on CourseWeb and email me.
If a solution has been published, feel free to look it up. It's fine as long as you don't blindly copy it. There's already late penalty, and I'd rather you learn.
Final exam
4/18/2017 38
Wednesday 4/26, noon – 1:50pm
At LMC (G17 CL)
150 total points
All pen-and-pencil based
1 letter-sized, front-and-back, cheat-sheet allowed You can word-process & print, but do NOT print out whole-sale copies of
PPT slides.
Lecture topics & NLTK
Some questions on topics that span pre-midterm and post-midterm.
BRING YOUR CALCULATOR