67
Trends in Machine Translation M.V. Padmavati

Machine Translation System: Chhattisgarhi to Hindi

Embed Size (px)

Citation preview

Page 1: Machine Translation System: Chhattisgarhi to Hindi

Trends in Machine Translation

M.V. Padmavati

Page 2: Machine Translation System: Chhattisgarhi to Hindi

A case study

• Maya wants to pursue Ph.Dfrom a foreign university

• She is not interested in USA

• She gets an offer from France

• She went to France

Page 3: Machine Translation System: Chhattisgarhi to Hindi

Maya in France

• The official language is French.Few people know English.

• Now she got a writtenagreement from the institutethrough mail.

• The document was in French.

• Possible Solutions?

Page 4: Machine Translation System: Chhattisgarhi to Hindi

Options Maya Have• Search for a person who knows both the

languages

– The person should have time

– The person may charge for the work

• Search for machine translator

– Use it any time

– No charge for the work

Ex. Google Translate

Page 5: Machine Translation System: Chhattisgarhi to Hindi

Overview• What is Machine Translation (MT)?

– Automated system

– Analyzes text from Source Language (SL)

– Produces “equivalent” text in Target Language (TL)

– Ideally without human intervention

Source

Language

Target

Language

Page 6: Machine Translation System: Chhattisgarhi to Hindi

Machine Translation

Machine Translation (MT), is a sub-fieldof computational linguistics that investigatesthe use of software to translate text or speechfrom one language to another.

Good Morning शुभ प्रभात

Page 7: Machine Translation System: Chhattisgarhi to Hindi

• Automatic translation of all kinds ofdocuments at a quality equaling that of thebest human translators.

• In any translation, meaning of the statement isto be preserved.

• Right words and order.

Page 8: Machine Translation System: Chhattisgarhi to Hindi

Problems during Machine Translation

Page 9: Machine Translation System: Chhattisgarhi to Hindi

Word Order

• Hindi is sometimes called an “SOV” language.

<subject> <object> <verb>

• But typical word order of English sentences, is SVO.

<subject> <verb> <object>

Maya likes mangoes

माया को आम पसंद है

Page 10: Machine Translation System: Chhattisgarhi to Hindi

Word Sense

A word may have more than one sense.Choosing the appropriate target word accordingto context is very important.

Page 11: Machine Translation System: Chhattisgarhi to Hindi

Pronoun Resolution

• Their - उनका / उनकी /उनके

Page 12: Machine Translation System: Chhattisgarhi to Hindi

Idioms

नौ दो ग्यारह होना

To escape

Nine two becomes eleven

Page 13: Machine Translation System: Chhattisgarhi to Hindi

John went to office

जॉन चला गया के ललए कायाालय

Is Dictionary Sufficient?

Page 14: Machine Translation System: Chhattisgarhi to Hindi

Approaches in MT

Approaches in MT can be classified into five categories:

• Direct MT

• Rule-based MT– Transfer Based MT

– Interlingua based MT

• Corpus-based MT– Statistical MT

– Example based MT

• Knowledge-based MT

• Neural MT

Page 15: Machine Translation System: Chhattisgarhi to Hindi

A brief history of MT

• (1966-1980): A virtual end to MT research

• The 1980s: Rule based and example based MT

• The 1990s: Statistical MT

• The 2000s: Hybrid MT

• 2015: Neural MT

Page 16: Machine Translation System: Chhattisgarhi to Hindi

Direct Machine Translation

• Input (English Sentence) - Maya slept in the garden.

• Words translation – माया सो गई में बाग |

• Syntactic rearrangement - माया बाग में सो गई |

Besides simple word translation and ordering, suffix handling and preposition handling is needed to make the translation acceptable. It is called as idiomatization.

Page 17: Machine Translation System: Chhattisgarhi to Hindi

• It carries out translation word by word usingbilingual dictionary usually followed by somesyntactic rearrangement

Direct Machine Translation

Page 18: Machine Translation System: Chhattisgarhi to Hindi

• Direct MT is very difficult if the SL and TL doesnot share near syntactical as well asmorphological phenomena.

• For a Hindi to English or English to Hinditranslation system, such a word byreplacement and idiomatization will notproduce understandable MT output.

Direct Machine Translation

Page 19: Machine Translation System: Chhattisgarhi to Hindi

Limitations of direct MT

• Does not considers the structure and

relationship between words

• There is no attempt to disambiguate the sense.

• No adaptability -The system which is developed

for a particular language pair will not be suitable

for another language pair.

Page 20: Machine Translation System: Chhattisgarhi to Hindi

Morphology

• How a verb infects because of gender, tense and case

कर (Root word)- करता करती करते ककये करना

How to identify and change root word based ongender, tense and case is called morphologicalanalysis.

Page 21: Machine Translation System: Chhattisgarhi to Hindi

Rule-Based Machine Translation

Based on the specification of rules for morphology, syntax,lexical selection, semantic analysis, transfer and generationprocess.

AnglaHindi MTS was developed by IIT, Kanpur in year 2003 isbased on Rule Based MT approach.

Page 22: Machine Translation System: Chhattisgarhi to Hindi

Interlingua Based MT

• Some systems make use of a so-called“interlingua” or intermediate language

– The transfer stage is divided into two steps, onetranslating a source sentence into the interlinguaand the other translating the result of this into anabstract representation in the target language

Page 23: Machine Translation System: Chhattisgarhi to Hindi

UNL Based MT: the scenario

UNL

ENGLISH

HINDIFRENCH

RUSSIANENCONVERSION

DECONVERSION

Page 24: Machine Translation System: Chhattisgarhi to Hindi

Machine Translation Work at IIT Kanpur

ANGLABHARTI represents a machine-aided translation methodology specifically designed for translating English to Indian languages.

Anglabharti uses a pseudo-interlingua approach.

It analyses English only once and creates an intermediate structure with most of the disambiguation performed

The intermediate structure is then converted to each Indian language through a process of text-generation.

The effort in analyzing the English sentences is about 70% and the text-generation for the rest of the 30%.

additional 30% effort, a new English to Indian language translator can be built.

Page 25: Machine Translation System: Chhattisgarhi to Hindi

Rule Based Machine Translation

Page 26: Machine Translation System: Chhattisgarhi to Hindi

Eng POS Taggermorph & chunkerSentence Parser

English Sentence

Target Language Independent

Parsed Output

Word Sense Disambiguation (WSD)

Word senses marked

Transfer GrammarRule Application

Bilingual work

E-I Dictionary lookup

Tense, AspectModalityLookup

Indian LanguageGenerator

Target Language Dependent

IL Sentence

Page 27: Machine Translation System: Chhattisgarhi to Hindi

27

PoS Tagger- On inserting an input source sentence the PoS tagger willtag each word with a part of speech.

Parser – It will generate a parse tree containing each word in form ofnode along with part of speech tag.

Reordering Module - Reordering module have “Transfer Link Rule File”which gives information about how the source structure is transformed tothe target structure.

Lexicalization Module - The target equivalents are found in the root wordlexicon along with part of speech category.

Synthesization Module– The final and most important stage of proposedMT system is synthesizing the target lexicons to convert into targetsentence.

Function of Various Modules in Architecture

Page 28: Machine Translation System: Chhattisgarhi to Hindi

Part of Speech Tagging

Part of Speech tagging is the process of identifying the part of speech

corresponding to each word in the text, based on both its definition, as well as its

context (i.e. relationship with adjacent and related words in a phrase or sentence.)

• E.g. if we consider the sentence ‘The white dog ate the biscuits’ we have the

following tags

• The [DT] white [JJ] dog [NN] ate [VBD] the [DT] biscuits [NN]

Page 29: Machine Translation System: Chhattisgarhi to Hindi

Structural Transfer-RBMT

Page 30: Machine Translation System: Chhattisgarhi to Hindi

30

Issues in Chhattisgarhi to Hindi Machine Translation

The following are some of the issues to consider for the design ofChhattisgarhi to Hindi machine translator:

Lexical differences: Sometimes, a word used in one language has nosingle-word equivalent in another language which results into lexicaldifferences between languages.

Example 1: The word in Chhattisgarhi has three different meaning inHindi.

A±BR>- g§. 1.E|R>Zo H$s {H«$`m `m ^md 2.AH$‹S> 3.K‘§S> 4. Jd©

Gender resolution: In Hindi there are two types of gender masculineand feminine, but in Chhattisgarhi in interrogative sentences it isdifficult to identify the gender.

Page 31: Machine Translation System: Chhattisgarhi to Hindi

31

Example 2: The following interrogative sentence in Chhattisgarhi can be written in two different ways in Hindi depending on the gender.

ते हा जा थस का ? 1.क्या तुम जा रही हो?2.क्या तुम जा रहे हो?

Increasing of words: During translation from Chhattisghari to Hindi there are some cases of increase in the number of words in the target language.Example 3:

eSnku e ikgV [kMs~ gSAeSnku esa HkSlks dk lewg [kM~k gSA

Issues in Chhattisgarhi to Hindi Machine Translation (contd...)

Page 32: Machine Translation System: Chhattisgarhi to Hindi

32

Decreasing of words: During translation from Chhattisghari to Hindithere are some cases of decrease in the number of words in the targetlanguage.

Example 4:es g ,d Bu vkek [kk;s gqWA eSa ,d vke [kk;k gw A

Issues in Chhattisgarhi to Hindi Machine Translation (contd...)

Page 33: Machine Translation System: Chhattisgarhi to Hindi

33

The conversion of Chhattisgarhi to Hindi sentence can be illustratedwith the help of following example :

oks gk ?kj tkFksA => og ?kj tkrk gSA

Following will be the stages of translation:1st stage: getting basic part-of-speech information of each source word:oks = loZuke ; gk = foHkfDr ; ?kj = laKk ; tkFks = fdz;k

2nd stage: getting syntactic information about the verb “tkFks”:Here: tkFks – Present Simple, 3rd Person Singular, Active Voice

3rd stage: parsing the source sentence:(loZuke) ( foHkfDr ) (laKk) ( fdz;k)

Proposed Methodology

Page 34: Machine Translation System: Chhattisgarhi to Hindi

34

4th stage: translate Chhattisgarhi words into Hindi

oks (category = loZuke) => og (category = loZuke)

gk (category = oHkfDr) => tkrk (category = fdz;k)

?kj (category = laKk) => ?kj (category = laKk)

tkFks (category = fdz;k ) => gS (category = l .fdz;k)

5th stage: Mapping dictionary entries into appropriate forms(Synthesization or Target Sentence Generation):oks gk ?kj tkFksA => og ?kj tkrk gSA

Proposed Methodology(contd...)

Page 35: Machine Translation System: Chhattisgarhi to Hindi

35

Rulebase for conversion

( loZuke ) ( oHkfDr )( laKk )( fdz;k )=>( loZuke )( laKk ) ( fdz;k )( l .fdz;k )

1 2 3 4 1 2 3 4(Source Rulebase) (Target Rulebase)

Reordering1:1 || 2:3 || 3:2 || 4:4 -> Transfer Link Rule File

Proposed Methodology(contd...)

Page 36: Machine Translation System: Chhattisgarhi to Hindi

Snap shots of Chhattisgarhi to Hindi Dictionary

Page 37: Machine Translation System: Chhattisgarhi to Hindi

Complete Chhattisgarhi Dictionary

Page 38: Machine Translation System: Chhattisgarhi to Hindi

Snap shots of Chhattisgarhi POS Tagger

Page 39: Machine Translation System: Chhattisgarhi to Hindi

Snap shots of Chhattisgarhi Morphological Analyzer

Page 40: Machine Translation System: Chhattisgarhi to Hindi

Corpus-based MT

• Corpus based MT systems require sentence-aligned parallel text for each language pair.

• The corpus based approach is furtherclassified into

1. Statistical Machine Translation

2. Example Based Machine Translation

Page 41: Machine Translation System: Chhattisgarhi to Hindi

What is corpus and how it is collected

• a collection of structured text to study linguisticproperties

• Plural of corpus is corpora

• Collection of corpus of the different languages

• Collection of translation corpus (English to Hindidictionaries and translations etc.)

• Use n-grams- an n-gram is a contiguous sequenceof n items from a given sequence of text or speech.

Page 42: Machine Translation System: Chhattisgarhi to Hindi

Collection of Translated Corpora

Harry Potterin English

Harry Potterin Hindi

MachineLearning

Magic

Probabilistic Model

Page 43: Machine Translation System: Chhattisgarhi to Hindi

Basic statistics- SMT

• 0 <= P(A) <=1

• P(A)

– Probability that word A present in the text

• P(A,B)

– Probability that words A and B present in the text

• P(A|B)

– Probability that word A presents in the text when B is already present in the text

Page 44: Machine Translation System: Chhattisgarhi to Hindi

Basic statistics

• Conditional probability

)(

)B,()|(

BP

APBAP

Page 45: Machine Translation System: Chhattisgarhi to Hindi

Basic Statistics

• Use definition of conditional probability to derive the chain rule

P(A | B) P(A,B)

P(B)

P(A,B) P(B)P(A | B) P(A)P(B | A)

P(A1,A2,K An )

P(An | An1,K A1)P(An1,K A1)

L

P(A1)P(A2 | A1)P(A3 | A1,A2)K P(An | A1K ,An1)

Page 46: Machine Translation System: Chhattisgarhi to Hindi

Goal- SMT

• Translate.

• I’ll use English(E) into Hindi(H) as the running example.

Page 47: Machine Translation System: Chhattisgarhi to Hindi

Approach: Statistics

• We are trying to model P(H|E)

– I give you a English sentence

– You give me back Hindi

• How are we going to model this?

– We could use Bayes rule:

)()|()(

)()|()|( HPHEP

EP

HPHEPEHP

Page 48: Machine Translation System: Chhattisgarhi to Hindi

Why Bayes rule at all?

• Why not model P(H|E) directly?

• P(E|H)P(H) decomposition allows us to be sloppy

– P(H) worries about good Hindi

– P(E|H) worries about English that matches Hindi text

– The two can be trained independently

Page 49: Machine Translation System: Chhattisgarhi to Hindi

Where will we get P(E|H)?

Books inHindi

Same books,in English

MachineLearning

Magic

P(E|H) model

We call collections stored in two languages parallel corpora or parallel texts

Want to update your system? Just add more text!

Page 50: Machine Translation System: Chhattisgarhi to Hindi

English to Hindi SMT• Let's consider the example of English to Hindi SMT system. Every

Hindi sentence h is a possible translation of an English sentence e.

• The probability that 'गाय खास खाता है।' is translation of 'Murthyeats apple' is low as compared to the probability of 'रवि खानाखाता है' being the translation of the sentence. Every pair ofsentence (E,H) a probability,

• P(H|E), which is the probability that a translator when presentedwith an English sentence E, will produce H as its Hindi translation.

• We can assume that when a native speaker of Hindi produces anEnglish sentence he will be having a Hindi sentence in mind andwill be translating it in to English mentally.

• The goal of SMT is to find the sentence H that the native speakerin his mind when he produces E.

Page 51: Machine Translation System: Chhattisgarhi to Hindi

)()|()(

)()|()|( HPHEP

EP

HPHEPEHP

• The two components in SMT are Language Model(LM) and TranslationModel(TM).

• A language model gives the probability of a sentence. These probabilities arecalculated with N-Gram techniques.

• The translation model helps to compute the conditional probability P(E|H). itis trained from a parallel corpus of Hindi/English pairs.

English to Hindi SMT

Page 52: Machine Translation System: Chhattisgarhi to Hindi

Working of SMT

Page 53: Machine Translation System: Chhattisgarhi to Hindi

Statistical Machine Translation (SMT)

• The general idea in SMT system is that the translationwill be from the most likely translated word.

• The system consists of three different models. TheLanguage Model (LM) computes the probability of thetarget language ‘T’ as probability P(T).

• The Translation Model (TM), helps to compute theconditional probability of target sentences given thesource sentence, P(T|S).

• Decoder maximizes the product of LM and TMprobabilities.

Page 54: Machine Translation System: Chhattisgarhi to Hindi

RBMT Vs SMT

• RBMT can achieve good results but the training and developmentcosts are very high for a good quality system.

• In terms of investment, the customization cycle needed to reachthe quality threshold can be long and costly.

• RBMT systems are built with much less data than SMT systems.• Language is constantly changing, which means rules must be

managed and updated where necessary in RBMT systems.• SMT systems can be built in much less time and do not require

linguistic experts to apply language rules to the system.• SMT models require state-of the-art computer processing power

and storage capacity to build and manage large translation models.• SMT systems can mimic the style of the training data to generate

output based on the frequency of patterns allowing them toproduce more fluent output.

Page 55: Machine Translation System: Chhattisgarhi to Hindi

Example Based Machine Translation

• It uses previous translation examples to generate

translations for an input provided.

• When an input sentence is presented to the system,

it retrieves a similar source sentence from the

example-base and its translation.

• The system then adapts the example translation to

generate the translation of the input sentence.

Page 56: Machine Translation System: Chhattisgarhi to Hindi

Knowledge based MT

• Early MT systems are characterized by thesyntax.

• Semantic features are attached using AItechniques.

Page 57: Machine Translation System: Chhattisgarhi to Hindi

Neural Machine Translation

• It uses a large neural network for deep learning

• Google and Microsoft translation services now use NMT from December 2016.

• It requires less corpora than SMT

Page 58: Machine Translation System: Chhattisgarhi to Hindi

Deep Learning

Deep learning is essentially a set of techniquesthat help you to parameterize deep neuralnetwork structures, neural networks with many,many layers and parameters.

Page 59: Machine Translation System: Chhattisgarhi to Hindi

Recurrent Neural NetworkA recurrent neural network (RNN) is a classof artificial neural network where connectionsbetween units form a directed cycle. Thiscreates an internal state of the network whichallows it to exhibit dynamic temporal behaviour.

Page 60: Machine Translation System: Chhattisgarhi to Hindi

Encoder-Decoder

• Encode the source sentence x : It analyzes the source sentence and the result of the analysis is a mysterious sequence of vectors

• Decode that to target sentence y

Neural Machine Translation

Page 61: Machine Translation System: Chhattisgarhi to Hindi

Neural Machine Translation

मैं एक छात्र ह ूँ

OpenNMT - Open-Source Neural Machine Translation

Page 62: Machine Translation System: Chhattisgarhi to Hindi

Automatic Evaluation of Machine Translator

• BLEU: BLEU was one of the first metrics to report high correlation with human

judgments of quality. The metric is currently one of the most popular in the field

• NIST: The NIST metric is based on the BLEU metric, but with some alterations.

Where BLEU simply calculates n-gram precision adding equal weight to each one,

NIST also calculates how informative a particular n-gram is. That is to say when a

correct n-gram is found, the rarer that n-gram is, the more weight it is given.

• For example, if the bigram "on the" correctly matches, it receives lower weight

than the correct matching of bigram "interesting calculations," as this is less likely

to occur. NIST also differs from BLEU in its calculation of the brevity penalty, insofar

as small variations in translation length do not impact the overall score as much.

Page 63: Machine Translation System: Chhattisgarhi to Hindi
Page 64: Machine Translation System: Chhattisgarhi to Hindi

Where are we now?

• Huge potential/need due to the internet, globalization andinternational politics.

• Quick development time due to SMT, the availability ofparallel data and computers.

• Translation is reasonable for language pairs with a largeamount of resource.

• Start to include more “minor” languages.

Page 65: Machine Translation System: Chhattisgarhi to Hindi

Indian Institutes with Major work in MT

• IIIT Hyderabad -Anusaaraka- Prof. Rajeev Sangal

• Centre for Development of Advanced Computing (CDAC), Pune- Mantra machine

translation system:

• IIT, Bombay- Prof. Pushpak Bhattacharyya working on machine translation system from

English to Marathi and Bengali using the UNL (universal networking languages-

interlingua) formalism

• Government of India, through its Technology Development in Indian Languages (TDIL)

Project

• IIT Kanpur – AnglaBharti (English to Indian Languages)

Page 66: Machine Translation System: Chhattisgarhi to Hindi

Machine Translation: IndiaProblem #1

Too many language pairs!

Implication: Language Barrier will continue to be a problem.

Problem #2

Fragmentation of efforts

No consolidated effort at solving MT problems

Problem #3

Lack of NLP tools

Lack of Corpora

Lack of standardized methods of evaluation, encoding, etc.

Highly Specialized

Poor quality systems, No reusable components, No real learning from

each other’s work

Solutions!

Problem #1: Statistical Machine Translation

Problem #2: Collaborative work (2-3 teams)

Problem #3: Common Tools Framework plus Standards

Page 67: Machine Translation System: Chhattisgarhi to Hindi

Thank You