44
Lecture 18 Natural Language Processing Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Dan Klein at Berkeley

Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

Lecture 18Natural Language Processing

Marco Chiarandini

Department of Mathematics & Computer ScienceUniversity of Southern Denmark

Slides by Dan Klein at Berkeley

Page 2: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationCourse Overview

4 Introduction4 Artificial Intelligence4 Intelligent Agents

4 Search4 Uninformed Search4 Heuristic Search

4 Uncertain knowledge andReasoning

4 Probability and Bayesianapproach

4 Bayesian Networks4 Hidden Markov Chains4 Kalman Filters

4 Learning4 Supervised

Decision Trees, NeuralNetworksLearning Bayesian Networks

4 UnsupervisedEM Algorithm

4 Reinforcement LearningI Games and Adversarial Search

I Minimax search andAlpha-beta pruning

I Multiagent searchI Knowledge representation and

ReasoningI Propositional logicI First order logicI InferenceI Plannning

2

Page 3: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationOutline

1. Recap

2. Speech Recognition

3. Machine TranslationStatistical MTRule-based MT

3

Page 4: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationRecap: Sequential data

4

Page 5: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationRecap: Filtering

5

Page 6: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationRecap: State Trellis

I State trellis: graph of states and transitions over time

I Each arc represents some transition xt−1 → xt

I Each arc has weight Pr(xt | xt−1)Pr(et | xt)

I Each path is a sequence of statesI The product of weights on a path is the seq’s probabilityI Can think of the Forward (and now Viterbi) algorithms as computing

sums of all paths (best paths) in this graph

6

Page 7: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationRecap: Forward/Viterbi

7

Page 8: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationRecap: Particle Filtering

Particles: track samples of states rather than an explicit distribution

8

Page 9: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationNatural Language

I 100.000 years ago humans started to speak

I 7.000 years ago humans started to write

Machines process natural language to:

I acquire information

I communicate with humans

9

Page 10: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationNatural Language Processing

I Speech technologies

I Automatic speech recognition (ASR)

I Text-to-speech synthesis (TTS)

I Dialog systems

I Language processing technologies

I Machine translation

I Information extraction

I Web search, question answering

I Text classification, spam filtering, etc.

10

Page 11: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationOutline

1. Recap

2. Speech Recognition

3. Machine TranslationStatistical MTRule-based MT

11

Page 12: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationDigitalizing Speech

Speech input is an acoustic wave form

12

Page 13: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationSpectral Analysis

13

Page 14: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationAcoustic Feature Sequence

14

Page 15: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationState Space

I Pr(E |X ) encodes which acoustic vectors are appropriate for eachphoneme (each kind of sound)

I Pr(X |X ′) encodes how sounds can be strung together

I We will have one state for each sound in each word

I From some state x, can only:

I Stay in the same state (e.g. speaking slowly)

I Move to the next position in the word

I At the end of the word, move to the start of the next word

I We build a little state graph for each word and chain them together toform our state space X

15

Page 16: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationHMM for speech

16

Page 17: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationTransition with Bigrams

17

Page 18: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationDecoding

I While there are some practical issues, finding the words given theacoustics is an HMM inference problem

I We want to know which state sequence x1:T is most likely given theevidence e1:T :

I From the sequence x, we can simply read off the words

18

Page 19: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationOutline

1. Recap

2. Speech Recognition

3. Machine TranslationStatistical MTRule-based MT

19

Page 20: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine Translation

I Fundamental goal: analyze and process human language, broadly, robustly,accurately...

I End systems that we want to build:Ambitious: speech recognition, machine translation, information extraction,dialog interfaces, question answering...Modest: spelling correction, text categorization, language recognition, genreclassification.

20

Page 21: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLanguage Models

I Language defined by a sequence of strings and rules called grammars.

I Formal languages also need semantics that define meaning.

I Natural Languages:

1. not definitive: is disagreement with grammar rules“Not to be invited is sad”“To be not invited is sad”

2. ambiguous:“Entire store 25% off”“I will bring my bike tomorrow if it looks nice in the morning.”

3. large and constantly changing

21

Page 22: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine Translation

I n-gram sequence of n characters or sequence of n words, syllablesI n-gram models: define probability distributions for these sequencesI n-gram model is defined as a Markov chain of order n − 1.

For a trigram:

p(ci | c1:i−1) = p(ci | ci−2:i−1)

p(c1:N) =N∏

i=1

Pr(ci | c1:i−1) =N∏

i=1

Pr(ci | ci−2:i−1)

I 100 chars ; millions of entriesI with words even worse

I Corpus body of text

22

Page 23: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLanguage identification

Learned from corpus:

p(ci | ci−2:i−1, l)

Most probable language:

l∗ = argmaxl p(l | c1:N)

= argmaxl p(l)p(c1:N | l) (Bayes)

= argmaxl p(l)N∏

i=1

p(ci | ci−2:i−1, l) (Markov property)

Computers can reach 99% accuracy

23

Page 24: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine Translation

Rough translation: gives the main point but contains errors

Pre-edited translation: original text written in constrained language easier totranslate automatically

Restricted-source translation: fully automatic but only on technical contentas e.g. weather forecast

24

Page 25: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine Translation Systems

Very simplified there are three types of machine translation

Statistical machine translation (SMT) learn relational dependencies offeatures such as grams, lemmas, etc. • Requires large data sets• Example: google translate • Relatively easy to implement

Rule-based machine translation (RBMT) use grammatical rules and languageconstructions to analyze syntax and semantics • Use moderatesize data sets • Long development time and expertise

Hybrid machine translation either construct from RBMT and use SMT topost-process and optimize the result • Or use grammaticalrules to derive further features to then be fed in the statisticallearning machine • New direction of research.

25

Page 26: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationBrief History

26

Page 27: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine Translation

I Interlingual model: the source language, i.e. the text to be translated istransformed into an interlingua, i.e., an abstract language-independentrepresentation. The target language is then generated from theinterlingua.

I Transfer model: the source language is transformed into an abstract, lesslanguage-specific representation. Linguistic rules which are specific tothe language pair then transform the source language representation intoan abstract target language representation and from this the targetsentence is generated.

I Direct model: words are translated directly without passing through anadditional representation.

27

Page 28: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLevels of Transfer

Interlingua SemanticsAttraction(NamedJohn, NamedMary, High)

English WordsJohn loves Mary

French WordsJean aime Marie

English Syntax

S(NP(John), VP(loves, NP(Mary))) S(NP(Jean), VP(aime, NP(Marie)))French Syntax

English SemanticsLoves(John, Mary) Aime(Jean, Marie)

French Semantics

Vauquois pyramid

28

Page 29: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLevels of Transfer

29

Page 30: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationThe problem with dictionary look ups

30

Page 31: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationStatistical machine translation

Data driven MT

32

Page 32: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine Translation

I e sequence of strings in EnglishI f sequence of strings in French

f ∗ = argmaxf Pr(f | e) = argmaxf Pr(e | f )Pr(f )

I Pr(e | f ) learned from bilingual (parallel) corpus made of phrases seenbefore

33

Page 33: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine Translation

There is a smelly wumpus sleeping in 2 2

Il y a un wumpus qui dortmalodorant à 2 2

e1

e2

e3

e4

e5

d1 = 0 d

3 = -2 d

2 = +1 d

4 = +1 d

5 = 0

f1

f3

f2

f4

f5

Given English sentence e find French sentence f ∗:

1. break English e into phrases e1, . . . , en

2. ∀ei choose the French fi : Pr(fi | ei )

3. choose a permutation of phrases f1, . . . , fn∀fi choose distortion di : num. of words that phrase fi has moved wrt fi−1

Pr(f , d | e) =n∏

i=1

Pr(fi | ei )Pr(di )

with 100 French phrases for a 5-gram English there are 1005 different 5-gramand 5! reorderings. 34

Page 34: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLearn probabilities

1. Parallel corpus: parliamentary debates, web pages

2. Segment into sentences. Periods are good indicators with some care.

3. Align sentences. length of sentences is an indicator, landmarks another

4. Align phrases within sentence: iterative process,aggregation of evidence,no other pair appear so frequently in the corpus. Pr(fi | ei )

5. Extract distortions: count how often distortions appear in the corpusafter phrase alignment (smoothing)

6. Improve estimates of Pr(f | e) and Pr(d) with EM.

35

Page 35: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationLearning to translate

36

Page 36: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationAn HMM model

37

Page 37: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine translation systems

39

Page 38: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationGrammars

Grammars: set of rules (from left to right) that describe how to form stringsfrom the language’s alphabet that are valid according to the language’ssyntax (Language generator).

Parsing is the process of recognizing a string in natural languages by breakingit down to a set of symbols and analyzing each one against the grammar ofthe language, ie, determining whether the string belongs to the language or isgrammatically incorrect. The result is a parse tree.

I context free grammars (seehttp://en.wikipedia.org/wiki/Chomsky_hierarchy)

I probabilistic context free grammars

I lexicalized probabilistic context free grammars

40

Page 39: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationParsing as search

41

Page 40: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationProbabilistic Context Free Grammars

42

Page 41: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationHybrid Systems

The translated sentence can bechecked against a monolingual corpus.

43

Page 42: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine Translation

I Translate text from one language to another

I Recombines fragments of example translations

I Challenges:

I What fragments? [learning to translate]

I How to make efficient? [fast translation search]

44

Page 43: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationMachine Translation

I After a first bubble now full speed in the sector

I In spite of the economical crisis 7% growth on world basis

I Commercial and technological focus

I Danish is a marginal language and existing systems cannot be appliedreliably

I www.eicom.dk and www.oversaetterhuset.dk search development incollaboration with research institutions (SDU, CBS, ASB)

45

Page 44: Lecture 18 Natural Language Processing - SDUmarco/DM828/Slides/dm828-lec18.pdf · 2011. 12. 18. · CourseOverview Machine Translation 4 Introduction 4 ArtificialIntelligence 4 IntelligentAgents

RecapSpeech RecognitionMachine TranslationAnnouncement

Need for human resources, possibilties for thesis and individual studyactivities together with:

I Visual Interactive Syntax Learning project at the Institute for Languageand Communication of SDUhttp://beta.visl.sdu.dk/constraint_grammar.html

I Eckhard Bick project leaderhttp://en.wikipedia.org/wiki/Eckhard_Bick

If interested contact me.

46