38
MT For Low- MT For Low- Density Density Languages Languages Ryan Georgi Ryan Georgi Ling 575 – MT Seminar Ling 575 – MT Seminar Winter 2007 Winter 2007

MT For Low-Density Languages Ryan Georgi Ling 575 – MT Seminar Winter 2007

Embed Size (px)

Citation preview

MT For Low-MT For Low-Density Density

LanguagesLanguagesRyan GeorgiRyan Georgi

Ling 575 – MT SeminarLing 575 – MT Seminar

Winter 2007Winter 2007

What is “Low What is “Low Density”?Density”?

What is “Low Density”?What is “Low Density”?

In NLP, languages are usually In NLP, languages are usually chosen for:chosen for: Economic ValueEconomic Value Ease of developmentEase of development Funding (NSA, anyone?)Funding (NSA, anyone?)

What is “Low Density”?What is “Low Density”?

As a result, NLP work until recently As a result, NLP work until recently has focused on a rather small set of has focused on a rather small set of languages. languages.

e.g. English, German, French, e.g. English, German, French, Japanese, ChineseJapanese, Chinese

What is “Low Density”?What is “Low Density”?

““Density” refers to the availability of Density” refers to the availability of resources (primarily digital) for a resources (primarily digital) for a given language.given language. Parallel textParallel text TreebanksTreebanks DictionariesDictionaries Chunked, semantically tagged, or other Chunked, semantically tagged, or other

annotationannotation

What is “Low Density”?What is “Low Density”?

““Density” not necessarily linked to Density” not necessarily linked to speaker populationspeaker population Our favorite example, Iniktitut Our favorite example, Iniktitut

So, why study So, why study LDL?LDL?

So, why study LDL?So, why study LDL?

Preserving endangered languagesPreserving endangered languages

Spreading benefits of NLP to other Spreading benefits of NLP to other populationspopulations (Tegic has T9 for Azerbaijani now)(Tegic has T9 for Azerbaijani now)

Benefits of wide typological coverage for Benefits of wide typological coverage for cross-linguistic researchcross-linguistic research (?)(?)

Problem of LDL?Problem of LDL?

Problem of LDL?Problem of LDL?

““The fundamental problem for The fundamental problem for annotation of lower-density languages is annotation of lower-density languages is that they are lower density” – Maxwell that they are lower density” – Maxwell & Hughes& Hughes

Easiest NLP development (and often Easiest NLP development (and often best) done with statistical methodsbest) done with statistical methods Training requires lots of resourcesTraining requires lots of resources Resources require lots of moneyResources require lots of money Cost/Benefit chicken and the eggCost/Benefit chicken and the egg

What are our options?What are our options?

Create corpora by handCreate corpora by hand Very time-consuming (= expensive)Very time-consuming (= expensive) Requires trained native speakersRequires trained native speakers

Digitize printed resourcesDigitize printed resources Time-consumingTime-consuming May require trained native speakersMay require trained native speakers

e.g. orthography without unicode entriese.g. orthography without unicode entries

What are our options?What are our options?

Traditional requirements are going to be Traditional requirements are going to be difficult to satisfy, no matter how we slice difficult to satisfy, no matter how we slice it.it.

We need to, then:We need to, then: Maximize information extracted from Maximize information extracted from

resources we can getresources we can get Reduce requirements for building a system Reduce requirements for building a system

Maximizing Maximizing Information with Information with

IGTIGT

Maximizing Information Maximizing Information with IGTwith IGT

IInterlinear nterlinear GGlossed lossed TTextext Traditional form of transcription for Traditional form of transcription for

linguistic field researchers and linguistic field researchers and grammariansgrammarians

Example:Example:Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gave-3sg the teacher book to-the boy yesterdaygave-3sg the teacher book to-the boy yesterday

““The teacher gave a book to the boy yesterday”The teacher gave a book to the boy yesterday”

Benefits of IGTBenefits of IGT

As IGT is frequently used in fieldwork, it is As IGT is frequently used in fieldwork, it is often available for low-density languages often available for low-density languages

IGT provides information about syntax, IGT provides information about syntax, morphology, morphology,

The translation line is usually a high-The translation line is usually a high-density language that we can use as a density language that we can use as a pivot language.pivot language.

Drawbacks of IGTDrawbacks of IGT

Data can be ‘abormal’ in a number Data can be ‘abormal’ in a number of waysof ways Usually quite shortUsually quite short May be used by grammarian to May be used by grammarian to

illustrate fringe usagesillustrate fringe usages Often purposely limited vocabularies Often purposely limited vocabularies

Still, in working with LDL it might Still, in working with LDL it might be all we’ve gotbe all we’ve got

Utilizing IGTUtilizing IGT

First, a big nod to Fei (this is her First, a big nod to Fei (this is her paper!)paper!)

As we saw in HW#2, word alignment As we saw in HW#2, word alignment is hard.is hard.

IGT, however, often gets us halfway IGT, however, often gets us halfway there!there!

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gave-3sg the teacher book to-the boy yesterdaygave-3sg the teacher book to-the boy yesterday

““The teacher gave a book to the boy yesterday”The teacher gave a book to the boy yesterday”

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gavegave-3sg the teacher book to-the boy yesterday-3sg the teacher book to-the boy yesterday

““The teacher The teacher gavegave a book to the boy yesterday” a book to the boy yesterday”

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gavegave-3sg -3sg the teacherthe teacher book to-the boy yesterday book to-the boy yesterday

““The teacherThe teacher gavegave a book to the boy yesterday” a book to the boy yesterday”

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gavegave-3sg -3sg the teacherthe teacher bookbook to-the boy yesterday to-the boy yesterday

““The teacherThe teacher gavegave a a bookbook to the boy yesterday” to the boy yesterday”

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gavegave-3sg -3sg the teacherthe teacher bookbook to-theto-the boyboy yesterdayyesterday

““The teacherThe teacher gavegave a a bookbook to theto the boyboy yesterdayyesterday””

Utilizing IGTUtilizing IGT

Take the previous example:Take the previous example:

Rhoddodd yr athro lyfr I’r bachgen ddoeRhoddodd yr athro lyfr I’r bachgen ddoe

gavegave-3sg -3sg the teacherthe teacher bookbook to-theto-the boyboy yesterdayyesterday

““The teacherThe teacher gavegave a a bookbook to theto the boyboy yesterdayyesterday””

The interlinear already aligns the source with the The interlinear already aligns the source with the glossgloss

Often, the gloss uses words found in the translation Often, the gloss uses words found in the translation alreadyalready

Utilizing IGTUtilizing IGT

Alignment isn’t always this easy…Alignment isn’t always this easy…

xaraju mina lgurfati wa nah.nu nadxuluxaraju mina lgurfati wa nah.nu nadxulu

xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxuluxaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu

exited-3MPL from DEF-room-GEN and we 1PL-enterexited-3MPL from DEF-room-GEN and we 1PL-enter

'They left the room as we were entering it‘'They left the room as we were entering it‘

(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

Utilizing IGTUtilizing IGT

Alignment isn’t always this easy…Alignment isn’t always this easy…

xaraju mina lgurfati wa nah.nu nadxuluxaraju mina lgurfati wa nah.nu nadxulu

xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxuluxaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu

exited-3MPL from DEF-exited-3MPL from DEF-roomroom-GEN and -GEN and wewe 1PL-enter 1PL-enter

'They left the 'They left the roomroom as as wewe were entering it‘ were entering it‘

(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

We can get a little more by stemming…We can get a little more by stemming…

Utilizing IGTUtilizing IGT

Alignment isn’t always this easy…Alignment isn’t always this easy…

xaraju mina lgurfati wa nah.nu nadxuluxaraju mina lgurfati wa nah.nu nadxulu

xaraj-u: mina ?al-gurfat-i wa nah.nu na-dxuluxaraj-u: mina ?al-gurfat-i wa nah.nu na-dxulu

exited-3MPL from DEF-exited-3MPL from DEF-roomroom-GEN and -GEN and wewe 1PL- 1PL-enterenter

'They left the 'They left the roomroom as as wewe were were enterentering it‘ing it‘

(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)(Source: Modern Arabic: Structures, Functions, and Varieties; Clive Holes)

We can get a little more by stemming…We can get a little more by stemming… ……but we’re going to need more.but we’re going to need more.

Utilizing IGTUtilizing IGT Thankfully, with an English translation, we already have tools to get Thankfully, with an English translation, we already have tools to get

phrase and dependency structures that we can project:phrase and dependency structures that we can project:

(Source: Will & Fei’s NAACL 2007 Paper!)(Source: Will & Fei’s NAACL 2007 Paper!)

Utilizing IGTUtilizing IGT Thankfully, with an English translation, we already have tools Thankfully, with an English translation, we already have tools

to get phrase and dependency structures that we can project:to get phrase and dependency structures that we can project:

(Source: Will & Fei’s NAACL 2007 Paper!)(Source: Will & Fei’s NAACL 2007 Paper!)

Utilizing IGTUtilizing IGT

What can we get from this?What can we get from this? Automatically generated CFGsAutomatically generated CFGs Can infer word order from these CFGsCan infer word order from these CFGs Can infer possible constituentsCan infer possible constituents ……suggestions?suggestions?

From a small amount of data, this is a From a small amount of data, this is a lot of information, but what about…lot of information, but what about…

Reducing data Reducing data Requirements with Requirements with

PrototypingPrototyping

Grammar InductionGrammar Induction

So, we have a way to get production So, we have a way to get production rules from a small amount of data.rules from a small amount of data.

Is this enough?Is this enough? Probably not.Probably not. CFGs aren’t known for their robustnessCFGs aren’t known for their robustness

How about using what we have as a How about using what we have as a bootstrap?bootstrap?

Grammar InductionGrammar Induction

Given unannotated text, we can derive Given unannotated text, we can derive PCFGsPCFGs Without annotation, though, we just have unlabelled Without annotation, though, we just have unlabelled

trees:trees:

ROOTROOT

C2C2

X0 X1 Y2X0 X1 Y2

the dogthe dog Z3 N4 Z3 N4

fell asleepfell asleep

Such an unlabelled parse doesn’t give us S -> NP VP, Such an unlabelled parse doesn’t give us S -> NP VP, though.though.

p=0.003p=0.02

p=5.3e-2p=0.09

p=0.45e-4

Grammar InductionGrammar Induction

Can we get labeled trees without Can we get labeled trees without annotated text?annotated text?

Haghighi & Klein (2006)Haghighi & Klein (2006) Propose a way in which production rules can Propose a way in which production rules can

be passed to a PCFG induction algorithm as be passed to a PCFG induction algorithm as “prototypical” constituents“prototypical” constituents

Think of these prototypes as a rubric that Think of these prototypes as a rubric that could be given to a human annotatorcould be given to a human annotator

e.g. for English, NP -> DT NNe.g. for English, NP -> DT NN

Grammar InductionGrammar Induction

Let’s take the possible constituent DT NNLet’s take the possible constituent DT NN

We could tell our PCFG algorithm to apply We could tell our PCFG algorithm to apply this as a constituent everywhere it occursthis as a constituent everywhere it occurs But what about DT NN NN? (the train But what about DT NN NN? (the train

station)?station)?

We would like to catch this as wellWe would like to catch this as well

Grammar InductionGrammar Induction

K&H’s solution?K&H’s solution? distributional clusteringdistributional clustering ““a similarity measure between two items on the a similarity measure between two items on the

basis of their immediate left and right contexts”basis of their immediate left and right contexts” ……to be honest, I lose them in the math here.to be honest, I lose them in the math here.

Importantly, however, weighting the Importantly, however, weighting the probability of a constituent with the right probability of a constituent with the right measure improves from the base measure improves from the base unsupervised level of f-measure 35.3 to 62.2unsupervised level of f-measure 35.3 to 62.2

So… what now?So… what now?

Next StepsNext Steps

By extracting production rules from By extracting production rules from a very small amount of data using a very small amount of data using IGT and using Haghighi & Klein’s IGT and using Haghighi & Klein’s unsupervised methods, it may be unsupervised methods, it may be possible to bootstrap an effective possible to bootstrap an effective language model from very little language model from very little data!data!

Next StepsNext Steps

Possible applications:Possible applications: Automatic generation of language Automatic generation of language

resourcesresources (While a system with the same goals would only (While a system with the same goals would only

compound error, automatically annotated data compound error, automatically annotated data could be easier for a human to correct rather could be easier for a human to correct rather than hand-generate)than hand-generate)

Assist linguists in the field Assist linguists in the field (Better model performance could imply better (Better model performance could imply better

grammar coverage)grammar coverage)

……you tell me!you tell me!