Lecture 16: FSA, formal language theorynaraehan/ling1330/Lecture16.pdf · A finite-state automaton (FSA, also called a finite-state machine) is a mathematical model of computation

Lecture 16: MT, FSA,

Formal Language Theory

LING 1330/2330: Introduction to Computational Linguistics

Na-Rae Han

Overview

4/18/2017 2

Machine translation Language and Computers, Ch.7 Machine translation systems Speech and Language Processing, Jurafsky and Martin (2008) Ch. 25 Machine Translation, 5.4 Noisy channel model

Finite-State Automata Language and Computers, 4.4 Searching semi-structured data with regular

expressions, pp. 112-115 Under the Hood 7 “Finite-state Automata”

Formal language theory Regular, context-free, context-sensitive languages The Chomsky Hierarchy

Mathematical Methods in Linguistics by B. Partee, A. ter Meulen and R. Wall Part D. English as a Formal Language Excerpt posted on CourseWeb

http://www.cs.colorado.edu/~martin/slp.html

http://www.cs.colorado.edu/~martin/slp.html

http://www.amazon.com/Mathematical-Methods-Linguistics-Studies-Philosophy/dp/9027722455



Interlingua approaches: advantage

4/18/2017 3

ADVANTAGE: A single interlingua intermediates between all language pairs. For n languages, only 2*n translation models need to be built:

English INTERLINGUA French

English INTERLINGUA Spanish

English INTERLINGUA Japanese

Japanese INTERLINGUA English

Each language only needs to be translated to and from INTERLINGUA:

Interlingua

English

German

Chinese

Swahili Korean

Swedish Hindi

Interlingua: not as straightforward

4/18/2017 4

"Feeling hungry" is a physical sensation that's universal to human. Yet…

I am hungry

A state attributed to a person

Tengo hambre (Spanish)

"I have hunger" Possession relation

お腹が空いた (Japanese)

"(my) stomach is empty" A state attributed to a location/physical entity

Would you say these sentences have different meanings?

How do we construct a single universal meaning representation (= interlingua) for this?

Interlingua is expensive

4/18/2017 5

Someone has to build the formal semantic representation for every meaning component

May be feasible for limited subdomains (e.g., air travel, hotel reservation system, etc.)

In its pure form, interlingua requires full disambiguation, which sometimes results in unnecessary efforts

YOUNGER- vs. OLDER-BROTHER distinction

Mandatory in Korean and Japanese

Necessary for KoreanJapanese or EnglishKorean translation

Not needed for KoreanEnglish or GermanEnglish translation

Statistical MT

4/18/2017 6

The three classic architectures focus on the appropriate representations to use symbolic approach

In statistical MT, however, the focus is more on the result JAPANESE: fukaku hansei shite orimasu.

English1: I sincerely apologize.

English2: I am deeply regretting.

E1 is more natural English; E2 is more faithful to the original meaning

There are two considerations in translation: (1) Faithfulness to the original message

(2) Fluency (naturalness) of the target language text

Successful translation can be schematized as:

A translation T that maximizes the product of the two best-translation ˆT = argmaxT faithfulness(T, S) * fluency(T)

The noisy channel model

4/18/2017 7

The noisy channel model

How to estimate the original clean signal from the degraded signal?

ŷ: system's best guess what the original message was

P(y): the probability of the original message

P(x|y): the observed output of the channel, given y

ŷ is the message that maximizes P(y) * P(x|y)

Noisy channel

Clean signal (y)

Degraded signal (x)

y)|P(y)P(xargmax y y

Spell correction as noisy channel model

4/18/2017 8

Spelling error correction

Estimating the intended word based on the mis-spelled word

ŷ: system's best guess what the intended word was, e.g., bright?, birth? broth?

P(y): the probability of the intended word, e.g., P(birth) vs. P(broth)

P(x|y): the probability of the misspelled word, given the intended word, e.g., P(briht|birth), P(briht|broth), ...

ŷ is the word that maximizes P(y) * P(x|y)

Sloppy human

Intended word (y)

Miss-spelled word (x)

y)|P(y)P(xargmax y y

Translation as noisy channel model

4/18/2017 9

Translation model, French (F) to English (E):

Ê: system's best guess at best English translation

P(E): the probability of the English translation text

P(F|E): the observed distribution of French text, given the English text

We need two components: The English language model P(E) and the translation model P(F|E) Based on the two, a decoder picks , i.e., determines the best English

translation

Noisy channel

English text (E)

French text (F)

)|( )(argmax )|(argmax ˆ EFPEPFEPE EE

Language model

Translation model

E

Note the DIRECTION!

P(E) vs. P(F|E)

4/18/2017 10

Two components:

The English language model: P(E)

Essentially a model of fluency/naturalness ("I sincerely apologize" more likely English sentence than "I am deeply regretting")

We can build this independently based on a large English corpus

(Bigram model, trigram model, HMM, ...)

The translation model: P(F|E)

Essentially a model of translation faithfulness ("I am deeply regretting" more faithful to original Japanese than "I sincerely apologize")

How should we obtain this?

Also from corpus data: parallel corpus

As with any other statistical language models, this model is REDUCTIVE: the stats come from word-level co-occurrence patterns

P(F|E) and word alignment

4/18/2017 11

Two components:

The English language model: P(E)

Essentially a model of fluency/naturalness ("I sincerely apologize" more likely English sentence than "I am deeply regretting")

We can build this independently based on a large English corpus

(Bigram model, trigram model, HMM, ...)

The translation model: P(F|E)

Essentially a model of translation faithfulness ("I am deeply regretting" more faithful to original Japanese than "I sincerely apologize")

How should we obtain this?

Also from corpus data: parallel corpus

As with any other statistical language models, this model is REDUCTIVE: the stats come from word-level co-occurrence patterns

Need word alignment: A mapping between the source words and the target words in a set of parallel sentence.

Word alignment

4/18/2017 12

The translation model P(F|E) based on word alignment statistics Sentences are reduced to a bag-of-words

Given an English sentence as words (we1... wen), how likely is it to see a French sentence (wf1... wfn) with a set of alignment (A) between words?

Where to get word-level alignments (A)?

It can be automatically extracted from a large parallel corpus, where sentences are pre-aligned.

There are many methods for extracting word alignment. (IBM Model 1, 2, 3, 4; HMM, ...)

She had picked up the letter the day before yesterday

Elle avait ramassé la lettre avant-hier

A

EAFPEFP )|,()|(

Parallel corpora

4/18/2017 13

European Parliament Proceedings Parallel Corpus 1996-2011

Europarl: http://www.statmt.org/europarl/

Extracted from the proceedings of the European Parliament.

Includes versions in 21 European languages. (EU currently has 23 official languages)

Each language section is aligned on sentence level

EU and MT

Public official documents of the European Union (EU) must be translated to every official languages

EU has been a big sponsor for MT research efforts.

http://www.statmt.org/europarl/

Europarl Corpus in NLTK

4/18/2017 14

>>> from nltk.corpus.europarl_raw import english, german, french, spanish >>> english.sents()[2] ['You', 'have', 'requested', 'a', 'debate', 'on', 'this', 'subject', 'in', 'the', 'course', 'of', 'the', 'next', 'few', 'days', ',', 'during', 'this', 'part-session', '.'] >>> german.sents()[2] ['Doch', 'sind', 'Bürger', 'einiger', 'unserer', 'Mitgliedstaaten', 'Opfer', 'von', 'schrecklichen', 'Naturkatastrophen', 'geworden', '.'] >>> french.sents()[2] ['En', 'revanche', ',', 'les', 'citoyens', "d'", 'un', 'certain', 'nombre', 'de', 'nos', 'pays', 'ont', 'été', 'victimes', 'de', 'catastrophes', 'naturelles', 'qui', 'ont', 'vraiment', 'été', 'terribles', '.'] >>> spanish.sents()[2] ['En', 'cambio', ',', 'los', 'ciudadanos', 'de', 'varios', 'de', 'nuestros', 'países', 'han', 'sido', 'víctimas', 'de', 'catástrofes', 'naturales', 'verdaderamente', 'terribles', '.']

Also available: Danish, Dutch, Finnish, Greek, Italian, Portuguese, Spanish, Swedish

Parallel corpus with syntactic annotation

15

;;05:127: 저는 그 일을 할 수 있는 한 빨리 하겠습니다 .

(S (NP-SBJ 저/NPN+는/PAU)

(VP (NP-OBJ-LV 그/DAN

일/NNC+을/PCA)

(VP (NP-ADV (S (NP-SBJ (S (NP-SBJ *pro*)

(VP 하/VV+ㄹ/EAN))

(NP 수/NNX))

(ADJP 있/VJ+는/EAN))

(NP 한/NNX))

(ADVP 빨리/ADV)

(VP (LV 하/VV+겠/EPF+습니다/EFN))))

./SFN)

;;05:127: I will do it as fast as possible.

(S (NP-SBJ (PRP I))

(VP (MD will)

(VP (VP (VB do)

(NP-OBJ (PRP it)))

(ADVP-MNR (ADVP (RB as)

(RB fast))

(PP (IN as)

(ADJP (JJ possible))))))

(. .))

Korean English Treebank Annotations: https://catalog.ldc.upenn.edu/LDC2002T26

https://catalog.ldc.upenn.edu/LDC2002T26




MT: state-of-the-art

4/18/2017 16

Systran – one of the largest and oldest MT companies http://www.systransoft.com/

Babel Fish – Yahoo’s MT engine https://www.babelfish.com/

Bing Translator https://www.bing.com/translator

Google Translate– Google’s MT engine http://translate.google.com/

CMU Link Grammar English-German Translator http://www.link.cs.cmu.edu/link/submit-to-translator.html

Safaba http://www.safaba.com/

Started as CMU off-shoot; now an Amazon company in Amazon Pittsburgh office

http://www.systran-software.co.kr/

http://www.systransoft.com/

http://www.systransoft.com/

https://www.babelfish.com/

https://www.babelfish.com/

https://www.bing.com/translator

https://www.bing.com/translator

http://translate.google.com/

http://www.link.cs.cmu.edu/link/submit-to-translator.html





http://www.safaba.com/

http://www.safaba.com/

http://www.post-gazette.com/business/tech-news/2017/01/17/Amazon-Pittsburgh-office-opens-in-South-Side-Works/stories/201701170116?pgpageversion=pgevoke

http://www.post-gazette.com/business/tech-news/2017/01/17/Amazon-Pittsburgh-office-opens-in-South-Side-Works/stories/201701170116?pgpageversion=pgevoke

Current state of NLP and AI

4/18/2017 17

The Great A.I. Awakening

Dec 2016, NYTimes Magazine

What did you all think?

Last Words: Computational Linguistics and Deep Learning

By Christopher Manning; Original publication Dec 2015

"The deep learning tsunami", "Why computational linguists need not worry"

The Dark Secret at the Heart of AI

Apr 2017, MIT Technology Review

"No one really knows how the most advanced algorithms do what they do. That could be a problem."

Google: Our Assistant will Trigger the Next Era of A.I.

Oct 2016, backchannel.com

"How can a machine truly understand the phrases that human beings peck and blab into its search fields and microphone? "

https://www.nytimes.com/2016/12/14/magazine/the-great-ai-awakening.html?_r=0

http://mitp.nautil.us/article/170/last-words-computational-linguistics-and-deep-learning

http://mitp.nautil.us/article/170/last-words-computational-linguistics-and-deep-learning

https://www.technologyreview.com/s/604087/the-dark-secret-at-the-heart-of-ai/

https://backchannel.com/google-our-assistant-will-trigger-the-next-era-of-ai-3c72a4d7bc75

Finite-State Automata

Formal Language Theory

Finite-State Automata

4/18/2017 19

A finite-state automaton (FSA, also called a finite-state machine) is a mathematical model of computation

It consists of:

A set of states. One state is initial; each state is either final (=accepting) or non-final.

A set of transition arcs between states with a label.

The machine starts at the initial state, and then transits to a next state through an arc, reading the label

When the input string is exhausted, if the machine is at a final state (ab), then the string is accepted/generated; if not (aba), it is rejected.

Input string is also rejected when it cannot be completely processed. (b, aaa)

2 START a

b

1 a

3

Example of FSA

4/18/2017 20

Which string(s) does this FSA accept? Answer: ab, ac

What is its RE equivalent? Answer: a(b|c)

3 START a

a b

1 2 b

3 START a

1 2

Which string(s) does this FSA accept? Answer: ab, aab, abb,

aaab, aabbb, aaaaabbbbbb, ...

What is its RE equivalent? Answer: a+b+

b

c

Regular expressions vs. automata

4/18/2017 21

Regular expression

A compact representation of a set of strings

/(have|has|had)( n?ever)? been/

represents a set of 9 strings.

/(have|has|had)( \w+)* been/

represents a set of infinite number of strings.

Regular expressions as a formalism have a different incarnation in the form of finite-state automata:

1 2 START b

a b

a*b* equivalent

Regular expressions vs. FSA

4/18/2017 22

Regular expressions and FSA are equivalent.

For any regular expression you compose, there is a corresponding FSA.

Any FSA can be converted to a corresponding regular expression.

How do you define equivalence?

A regular expression represents a set of strings.

A FSA accepts/generates a set of strings.

If the two sets are identical, the regular expression and the FSA are equivalent.

3 START a

b

1 a

2 ab*a aa, aba,

abba, abbba, abbbba, ...

4.12 (Language and Computers)

4/18/2017 23

Accepted by this FSA? '' 'a' 'aa' 'b'

Equivalent regular expression?

/a/

2 START a

1

4.13

4/18/2017 24

Accepted by this FSA? '' 'a' 'b' 'c' 'ab' 'bc' 'aa'


/a|b|c/

2 START b

1

a

c

4.14

4/18/2017 25

Accepted by this FSA? '' 'a' 'aa' 'aaa' 'b' 'bc' 'abc' 'ba' 'aba' 'ab' 'aac'


/a*(a|b|c)/

2 START b

1

a

c

Also: 'c', 'ac', 'aaac', …

a

Non-deterministic: Multiple choices on reading a single arc

label.

4.15

4/18/2017 26

Accepted by this FSA? '' 'a' 'b' 'ab' 'ba' 'aa' 'bca'


/ab/

3 START a

1 b

2

4.17

4/18/2017 27

Accepted by this FSA? '' 'a' 'aa' 'aaa' 'aaaa' 'b' 'ab' 'baaa'


/a*/

1 START

Also: 'aaaaa', 'aaaaaa', …

a

4/18/2017 28

Accepted by this FSA? '' 'a' 'aa' 'aaa' 'aaaa' 'b' 'ab' 'baaa'


/(a|b)*/

1 START

a

b

Basically any string made of a and b!

If a and b are the only alphabet,

equivalent to

/.*/

4.18

4/18/2017 29

Accepted by this FSA? '' 'a' 'aa' 'aaa' 'b' 'ab' 'aabb' 'baa' 'bab'


/a*b*/

2 START b

Also: 'bb', 'abb', 'aaabbbbb', …

a

1

b

Wrong FSA shown in the textbook. This is the

correct FSA that matches the regex.

Any number of a's followed by any number of b's

4.18 (as in the textbook p.115)

4/18/2017 30

Accepted by this FSA? '' 'a' 'b' 'ab' 'ba' 'aa' 'aab' 'aaabb'

'aaaaaa' 'bbbbbb' 'bbbaa' 'aaaab' 'abbbbbb'


/a*(ab*)?/

3 START a b

a b

2 1 This part is

in fact redundant.

If b is going to occur at all, it must be preceded by at

least one a

RE/FSA and natural language grammar

4/18/2017 31

RE/FSA can be thought of as a grammar. RE/FSA is a mechanism that rules if a string is accepted

("grammatical") or not ("ungrammatical")

An RE/FSA represents a set of accepted strings (= "grammatical strings") to the exclusion of the rest ("ungrammatical strings")

A language is a set of all grammatical strings generated over its vocabulary (arc labels).

L(ab*a) = a language generated by regular expression ab*a

= {aa, aba, abba, abbba, abbbba, abbbbba, ... }

English morpho-syntax as FSA

4/18/2017 32

A set of English morphemes:

{thank, joy, taste, thought, ful, less, ly}

You can concatenate the morphemes to make English words.

Grammatical: joy, joyful, joyless, joyfully, …

Ungrammatical: ful, fuljoy, lessful, joythank, …

English morpho-syntax dictates which sequences are grammatical English words and which are not.

How to formalize the rules?

English morpho-syntax as FSA

4/18/2017 33

Arc labels (=vocabulary): English morphemes

Set of accepted strings: legitimate English words

Which words are in this language?

Which are not?

What's the corresponding regular expression?

(thank|joy|taste|thought)((ful|less)(ly)?)?

2 1 3 4 START

thank

joy

taste

thought

ful

less

ly

English phonotactics as FSA

4/18/2017 34

A set of English phonemes:

{g, k, b, p, d, t, s, z, v, f, θ, ð, ʃ, ʒ, tʃ, dʒ, ɹ, l, h, m, n, ŋ, j, w, a, æ, ɛ, i, ɪ, ɔ, ʌ, ɑ, u, ʊ, eɪ, oɪ, aʊ, ou, aɪ}

You can concatenate the phonemes to make English syllables.

Grammatical: pɔ, deɪt, æbz, stɹid, …

Ungrammatical: mk, gki, iu, mɑnpk, ɹstid, …

English phonotactics dictates which sequences are grammatical English syllables and which are not.

How to formalize the rules?

English phonotactics as FSA

4/18/2017 35

Arc labels (=vocabulary): English phonemes

Set of accepted strings: valid English syllables

Which syllables are in this language?

Which are not?

What's the corresponding regular expression?

'': empty label

START

s

1

2 3

4 5 6

t

p

d

k

w

h

ʌ

æ

p

t

k i

''

''

ɹ

eɪ

ɪ

ŋ

m

Wrapping up

4/18/2017 36

Next class: Formal language theory Course wrap-up

PyLing this Wednesday 6pm, 2818 CL A send-off to graduating seniors. Learn about where they are going and how they

prepared for their next big step!

Do your OMET survey!

A guide to your final grade Participation & attendance grades almost ready CourseWeb "running total" ready A chance to make up for missed homework next slide

Final exam Next slide

Late homework forgiveness

4/18/2017 37

Homework 10 and 11 are still open for late submission, with standard penalty.

http://www.pitt.edu/~naraehan/ling1330/policies.html

Additionally, you may make up 1 missed assignment (homework or exercise) of your choice.

20% late penalty applies.

Deadline: 4/27 (Thu) midnight

Upload on CourseWeb and email me.

If a solution has been published, feel free to look it up. It's fine as long as you don't blindly copy it. There's already late penalty, and I'd rather you learn.





Final exam

4/18/2017 38

Wednesday 4/26, noon – 1:50pm

At LMC (G17 CL)

150 total points

All pen-and-pencil based

1 letter-sized, front-and-back, cheat-sheet allowed You can word-process & print, but do NOT print out whole-sale copies of

PPT slides.

Lecture topics & NLTK

Some questions on topics that span pre-midterm and post-midterm.

BRING YOUR CALCULATOR

Documents

Lecture 16: FSA, formal language theorynaraehan/ling1330/Lecture16.pdf · A finite-state automaton (FSA, also called a finite-state machine) is a mathematical model of computation