45
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence) Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb, 2011

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence)

  • Upload
    dympna

  • View
    34

  • Download
    1

Embed Size (px)

DESCRIPTION

CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 15–Language Divergence). Pushpak Bhattacharyya CSE Dept., IIT Bombay 8 th Feb , 2011. Key difference between Statistical/ML-based NLP and Knowledge-based/linguistics-based NLP. - PowerPoint PPT Presentation

Citation preview

Resources

CS460/626 : Natural Language Processing/Speech, NLP and the Web(Lecture 15Language Divergence)Pushpak BhattacharyyaCSE Dept., IIT Bombay 8th Feb, 2011

1Key difference between Statistical/ML-based NLP and Knowledge-based/linguistics-based NLPStat NLP: speed and robustness are the main concernsKB NLP: Phenomena basedExample:Boys, Toys, ToesTo get the root remove sHow about foxes, boxes, ladiesUnderstand phenomena: go deeperSlower processingPerspective on Statistical MTWhat is a good translation?Faithful to sourceFluent in target

fluencyfaithfulness3Word-alignment example (1) (2) (3) (4) Ram has an apple

(1) (2)(3) (4) (5) (6)

Ram of near an apple is

4Kinds of MT Systems(point of entry from source to the target text) fwd

Why is MT difficult?Classical NLP problems

AmbiguityLexical: Went to the bank to withdraw money Structural: Saw the boy with a telescopeEllipsis: I wanted a book and John a penCo-referenceAnaphora: John said he likes music Hypernymic: Johns house is a robust structure Why is MT DifficultLanguage DivergenceLexico-Semantic DivergenceStructural DivergenceLanguage Divergence(English Hindi: Noun to Adjective)The demands on sportsmen today can lead to burnout at an early age.(noun the state of being extremely tired or ill, either physically or mentally, because you have worked too hard) , Sportsmen-from, which today demands exist, that (correlative) them early age in inactive do can (aspectual) V-AUX.

Language Divergence(English Hindi: Noun to Verb)Every concert they gave us was a sell-out. (an event for which all the tickets have been sold) - Their every concert-of all ticket sell-past-passive-plural (were sold out).Language Divergence(English Hindi: Adjective to Adverb)The children were watching in wide-eyed amazement.(with eyes fully open because of fear, great surprise, etc) Children amazement-with eyes opening widely seeing were.Language Divergence(English Hindi: Adjective to Verb)He was in a bad mood at breakfast and wasn't very communicative.(able and willing to talk and give information to other people) - Breakfast-of time he bad mood-in was and much conversation not do-past-progressive-sing (was doing).Language Divergence(English Hindi: Preposition to Adverb)It gets cooler toward evening. (near a point in time) - Evening happening-happening (reduplication; typical Indian language phenomenon) cold increase-goes (verb compound; polar vector).Language Divergence(English Hindi: idiomatic usage)Given her interest in children, teaching seems the right job for her.(when you consider sth) () , Children-towards her interest having seen, teaching for her appropriate seems.Language Divergence is ubiquitous (Marathi-Hindi-English: case marking and postpositions transfer: works!)Not only for languages from distant families, but also within close cousins (simple present) . He goes.(universal truth) . The earth revolves round the sun.Language Divergence(Marathi-Hindi-English: case marking and postpositions: works again!) (historical truth) ... ...Krushna says to Arjuna (quoting) , ... , ...Damle says,...Language Divergence(Marathi-Hindi-English: case marking and postpositions: does not work!) (immediate past) ? ! ? When did you come? Just now (I came). (certainty in future) ! !He is in for a thrashing. (assurance) . I will see you tomorrow.Language Divergence Theory: Lexico-Semantic Divergences (ref: Dave, Parikh, Bhattacharyya, Journal of MT, 2002)Conflational divergenceE: stab; H: churaa se maaranaa (knife-with hit)S: Utrymningsplan; E: escape planStructural divergenceE: SVO; H: SOVCategorial divergenceChange is in POS category (many examples discussed)Head swapping divergenceE: Prime Minister of India; H: bhaarat ke pradhaan mantrii (India-of Prime Minister)Lexical divergenceE: advise; H: paraamarsh denaa (advice give): Noun Incorporation- very common Indian Language Phenomenon

Language Divergence Theory: Syntactic DivergencesConstituent Order divergenceE: Singh, the PM of India, will address the nation today; H: bhaarat ke pradhaan mantrii, singh, (India-of PM, Singh)Adjunction DivergenceE: She will visit here in the summer; H: vah yahaa garmii meM aayegii (she here summer-in will come)Preposition-Stranding divergenceE: Who do you want to go with?; H: kisake saath aap jaanaa chaahate ho? (who with)Null Subject Divergence E: I will go; H: jaauMgaa (subject dropped)Pleonastic DivergenceE: It is raining; H: baarish ho rahii haai (rain happening is: no translation of it)

Entropy considerationsWork of Chirag and Venkatesh, ongoingLanguage Typology

20Language Typology

21Parallel CorporaEnglishHindiMarathiJaipur , popularly known as the Pink City , is the capital of Rajasthan state , India . , , , , .Until the war of 1982 , the rainy , windswept Falkland Islands were a forgotten remnant of the old British Empire . 1982 , .Spanish rule was administered from a distance , leaving the various regions to develop separately from the capital , Caracas , which was founded by Diego de Losada in 1567 . , , 1567 , , .22Phrase Table EntriesHindi-English Phrase Table Entries ||| a ||| 0.1 ||| afford ||| 0.1 ||| offer ||| 0.5 ||| offers ||| 0.3 Contribution to entropy = 0.507

Hindi-Marathi Phrase Table Entries ||| ||| 0.05 ||| ||| 0.2 ||| ||| 0.05 ||| ||| 0.6 ||| ||| 0.1Contribution to entropy = 0.503

23Entropy EvaluationThe phrase table gives a probability distribution over the possible translations for each source phrase.We use the probability of the source phrase itself to get a distribution for the entire phrase table.Entropy is evaluated as per the standard formula

Hindi-Marathi Phrase Table Entropy : 9.671Hindi English Phrase Table Entropy : 9.770

24Handling Divergence through Indicative Translation(Microsoft Techvista Award, Ananthakrishnan 2007) Indicative Translation what and why?Native speaker acceptable translation not possibleespecially considering English-Hindi (Indian languages) divergenceCompromiseshuman-aided translation (post-editing)narrow domain (weather reports)rough translation Indicative MT

Goal: understandable rather than perfect outputPurpose: assimilation rather than dissemination (translation on the web)

27Divergence between English and HindiDivergence: differences in lexical and syntactic choices that languages make in expressing ideasMaTra:Structural transfer SVO to SOVpost-modifiers to pre-modifiersLexical transfer: WSD + lexicon lookupinflectionscase-markers.28Divergence between Natural and Indicative Hindi: some examplesE: We eat the rotten canteen food every night.H: I:

E: The batsman who had been scoring heavily against them has to be removed early.H: I:, , 29Categorial divergence E: I am feeling hungryH: I:

n-gram matches: unigrams: 0/6; bigrams: 0/5; trigrams: 0/4; 4-grams: 0/330Relation between words in noun-noun compounds E: The ten best Aamir Khan performancesH: I:

n-gram matches: unigrams: 5/5; bigrams: 2/4; trigrams: 0/3; 4-grams: 0/231Lexical divergence E: Food, clothing and shelter are a man's basic needs.H: , I: , ,

n-gram matches: unigrams: 8/10; bigrams: 6/9; trigrams: 4/8; 4-grams: 3/732Pleonastic Divergence E: It is rainingH: I: n-gram matches: unigrams: 4/5; bigrams: 3/4; trigrams:2/3; 4-grams: 1/2

E: There was a great kingH: I:

33Stylistic differencesE: The Lok Sabha has 545 members. H: I:

n-gram matches: unigrams: 5/7; bigrams:3/6; trigrams: 1/5; 4-grams: 0/4

Other differences: word order, sentence length34Transliteration and WSD errorsE: I purchased a bat.H: I:

n-gram matches: unigrams: 3/4; bigrams: 1/3; trigrams:0/2; 4-grams: 0/135Divergence/problemAverage BLEU precision Translation acceptable? Categorial0YesNoun-noun compounds 0.38YesLexical 0.6YesTransliteration0.27YesPleonastic 0.68NoStylistic 0.35NoWSD error0.27NoAdvantages of a hybrid Rule-based + SMT systemWhat SMT brings to the tableIf data available, then no need for linguistic resourcesQuick adaptation to new domains (tourism, health)new language pairs (English-Gujarati/Marathi)See improvements by adding dataWhat rule-based systems bring to the tableCapture small set of systematic difference wellSVO SOV (do we need to learn this?)Better handle on correcting specific cases

Preprocessing rules + SMT for English-Indian language MTLack of linguistic resources for Indian languagesLots of resources available for EnglishMorphology is rich for Indian languagesWider systematic syntactic differences between English and Indian languages

Placed within the Vauquois Triangle

Previous work on factored MTPrevious work{ney:04} show that the use of morpho-syntactic information drastically reduces the need for bilingual training data

{ney:06} report the use of morphological and syntactic restructuring information for Spanish-English and Serbian-English translation40Previous work (contd)Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model

Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors41Previous work (contd)Koehn and Hoang {koehn:07} propose factored translation models that combine feature functions to handle syntactic, morphological, and other linguistic information in a log-linear model

Experiments in translating from English to German, Spanish, and Czech, including the use of morphological factors42Previous work (contd)Avramidis and Koehn {koehn:08} report work on translating from poor to rich morphology, namely, English to Greek and Czech translationFactored models with case and verb conjugation related factors determined by heuristics on parse treesUsed only on the source side, and not on the target side43Previous work (contd)Melamed {melamed:04} proposes methods based on tree-to-tree mappings

Imamura et al. {imamura:05} present a similar method that achieves significant improvements over a phrase-based baseline model for Japanese-English translation44Previous work (contd)Target language does not have parsing/clause-detecting toolsNiessen and Ney {ney:04}: Reorder the source language data prior to the SMT training and decoding cycles German-English SMTPopovic and Ney {ney:06} :simple local transformation rules for Spanish-English and Serbian-English translationCollins et al. {collins:05}: German clause restructuring to improve German-English SMTWang et al. {wang:07}: similar work for Chinese-English SMTAnanthakrishnan and Bhattacharyya {anand:08}: syntactic reordering and morphological suffix separation for English-Hindi SMT45