Machine Translation across Indian Languages Dipti Misra Sharma
LTRC, IIIT Hyderabad Patiala 15-11-2013
Slide 2
Outline Introduction Information Dynamics in language Machine
Translation (MT) Approaches to MT Practical MT systems Challenges
in MT Ambiguities Syntactic differences in L1 an L2 MT efforts in
India Sampark : IL to IL MT systems Objective Design Issues
Conclusions
Slide 3
Introduction Natural Language Processing (NLP) involves
Processing information contained in natural languages Natural as
opposed to formal/artificial Formal languages : Programming
languages, logic, mathematics etc Artificial : Esperanto
Slide 4
Natural Language Processing (NLP) Helps in Communication
between Man-machine Question answering systems, eg interactive
railway reservation Man man Machine translation
Slide 5
Communication Transfer of information from one to the other
Language is a means of communication Therefore, one can say It
encodes what is communicated We apply the processes of Analysis
(decoding) for understanding Synthesis (encoding) for expression
(speaking)
Slide 6
What do we communicate ? Information Spain delivered a football
masterclass at Euro 2012 Intention Emphasis/focus Euro 2012
bagged/won by Spain Spain bags Euro 2012 Introduces variation
Slide 7
How do we communicate ? We use linguistic elements such as
Words (country, park, the, is, Bandipur, of, as, and, considered,
National, a, spot, beautiful, tourist, life, in, best, wild,
sanctuaries, the, one) Arrangement of the words (Sentences) Words
are related to each-other to provide the composite meaning
(Bandipur National park is a beautiful tourist spot and considered
as one of the best wild life sanctuaries in the country)
Slide 8
How do we communicate ? Contd.. Arrangement of sentences
(Discourse) Sentences or parts of sentences are related to each
other to provide a cohesive meaning *Considered as one of the best
wild life sanctuaries in the country. It is a national park
covering an area of about 874 km. Bandipur National park is a
beautiful tourist spot. Bandipur National park is a beautiful
tourist spot and considered as one of the best wild life
sanctuaries in the country. It is a national park covering an area
of about 874 km Languages differ in the way they organise
information in these entities All of these interact in the
organisation of information
Slide 9
Information Dynamics in Language (1/4) Languages encode
information Hindi: cuuhe maarate haiM kutte ' rat-pl' 'kill-hab'
'pres-pl' 'dog-pl' rats kill dogs Hindi sentence is ambiguous
Possible interpretations Dogs kill rats Rats kill dogs However,
English sentence is not ambiguous
Slide 10
Information Dynamics in Language (2/4) Ambiguity in Hindi is
resolved if, cuuhe maarate haiM kuttoM ko rats kill-hab pres-pl
dogs-obl acc Hindi encodes information in morphemes English encodes
information in positions Languages encode information
differently
Slide 11
English does not explicitly mark accusative case (except in
pronouns) no morpheme No lexical item/morpheme for yes no questions
(Eng: Is he coming ? Hindi : kyaa vah aa rahaa hai?) Position plays
an important role in encoding information in English Subject is
sacrosanct Hindi encodes information morphologically
Slide 12
Information Dynamics in Language (3/4) Another example, This
chair has been sat on The chair has been used for sitting Someone
sat on this chair, and it is known The sentence does not mention
someone Languages encode information partially
Slide 13
Information Dynamics in Language (4/4) English pronouns he,
she, it Hindi pronounvaha He is going to Delhi ==> vaha dilli
jaa rahaa hai She is going to Delhi ==> vaha dillii jaa rahii
hai It broke ==> vaha TuuTa ?? Information does not always map
fully from one language into another Conceptual worlds may be
different Gender Information
Slide 14
Information in Language Languages encode information
differently Languages code information only partially Tension
between BREVITY and PRECISION
Slide 15
Human beings use World knowledge Context (both linguistic and
extra-linguistic) Cultural knowledge and Language conventions to
resolve ambiguities Can all this knowledge be provided to the
machine ?
Slide 16
Languages differ Script (For written language) Vocabulary
Grammar These differences can be considered as a measure of
language distance
Slide 17
Language Distance Script --------------
Vocabulary----------Grammar Urdu-> Hindi Telugu -> Hindi
Telugu->Hindi English -> Hindi English-> Hindi
English->Hindi
Slide 18
Machine Translatoion Machine translation aims at automatic
translation of a text in source language to a text in the target
language. Mohan gave Hari a book -> Mohan ne Hari ko kitAba
dI
Slide 19
Machine Translation Let us view MT as a problem of Language
encoding (source) - analysis Decoding (target) - synthesis
Slide 20
English to Hindi : An Example SL (Eng) sentence : I met a boy
who plays cricket with you everyday Mapped to TL(Hin) : I a boy met
who everyday with you cricket plays TL synthesis : mEM eka laDake
se milA jo roza tumhAre sAtha kriketa khelatA hE OR mEM roza
tumhAre sAtha kriketa khelanevAle eka laDake se milA OR meM eka Ese
laDake se milA jo roza tumhAre sAtha kriketa khelatA hE
Slide 21
Machine Translation : Challenges Languages encode information
differently Language codes information only partially Tension
between BREVITY and PRECISION Brevity wins leading to inherent
ambiguity at different levels
Slide 22
Linguistic Issues in MT (1/2) Look at the word 'plot' in the
following examples (a) The plot having rocks and boulders is not
good. (b) The plot having twists and turns is interesting. 'plot'
in (a) means 'a piece of land' and in (b) 'an outline of the events
in a story'
Slide 23
Linguistic Issues in MT (2/2) Ambiguity in Language Lexical
level Sentence level Structural differences between SL and TL
Slide 24
Lexical ambiguity Lexical ambiguity can be both for Content
words nouns, verbs etc Function words prepositions, TAMs etc
Content words ambiguity is of two types Homonymy Polysemy
Slide 25
Homonymy A word has two or more unrelated senses Example : I
was walking on the bank (river-bank) I deposited the money in the
bank (money-bank)
Slide 26
Polsysemy 'Act', an English noun 1. It was a kind act to help
the blind man across the road (kArya) 2. The hero died in the Act
four, scene three (aMka) 3. Don't take her seriously, its all an
act (aBinaya) 4. The parliament has passed an Act (dhArA)
Slide 27
Function words can also pose problems (1/5) Prepositions
English prepositions in the target language Tense Aspect Modality
(TAM) Lexical correspondence of TAM
Slide 28
Function words can also pose problems (2/5) Function words can
also be ambiguous For example English preposition 'in' (a) I met
him in the garden mEM usase bagIce meM milA (b) I met him in the
morning mEM usase subaha 0 milA 'Ambiguity' here refers to the
'appropriate correspondence' in the target language.
Slide 29
Function words can also pose problems(3/5) He bought a shirt
with tiny collars. usane chote kOlaroM vAlI kamIza kharIdI he tiny
collars with shirt bought with gets translated as vAlI in hindi He
washed a shirt with soap. usane sAbuna se kamIza dhoI he soap with
shirt washed with gets translated as se.
Slide 30
Function words can also pose problems (4/5) TAM Markers m ark
tense, aspect and modality Consist of inflections and/or auxiliary
verbs in Hindi An important source of information Narrow down the
meaning of a verb (eg. lied, lay)
Slide 31
Function words can also pose problems (4/5) TAM Markers m ark
tense, aspect and modality Consist of inflections and/or auxiliary
verbs in Hindi An important source of information Narrow down the
meaning of a verb (eg. lied, lay)
Slide 32
Function words can also pose problems (5/5) English Simple Past
vs Habitual' 1a. He stayed in the guest house during his visit to
our University in Jan (rahA) 1b. He stayed in the guest house
whenever he visited us (rahatA thA) 2a. He went to the school just
now (gayA) 2b. He went to the school everyday (jAtA thA)
Slide 33
Sentence level ambiguity o I met the girl in the store +
Possible readings a) I met the girl who works in the store b) I met
the girl while I was in the store o Time flies like an arrow. +
Possible parses: a) Time flies like an arrow (N V Prep Det N) b)
Time flies like an arrow (N N V Det N) c) Time flies like an arrow
(V N Prep Det N) (flies are like an arrow) d) Time flies like an
arrow (V N Prep Det N) (manner of timing)
Slide 34
Differences in SL and TL Lexical level (a) One word may
translate into different words in different contexts (WSD) English
'plot' zamiin, kathanak (b) A SL word may not have a corresponding
word in the TL (Gaps) English 'reads' in 'This book reads very
well' (d) Pronouns across Indian languages Hindi 'vaha' Telugu
'adi', 'atanu', 'aame'
Slide 35
Differences in SL and TL Structural differences (a) word order
(English Hindi) (b) nominal modification (Hindi Tamil, Telugu etc)
(i) relative clause vs relative participles Telugu 'nenu tinnina
camcaa' Hindi : *meraa khaayaa cammaca Maine jis cammaca se khaayaa
hai vah cammac (ii) missing copula (Hindi Telugu, Bengali, Tamil
etc) Telugu : raamudu mancivaadu Hindi : Ram acchaa ladakaa
hai
Slide 36
Human beings use World Knowledge Context Cultural knowledge and
Language conventions To resolve ambiguities and interpret
meaning
Slide 37
What to do for the machine ? Challenging problem!!! Providing
all the knowledge may: - take too much of time and effort - be
difficult/become complex - not be possible (world knowledge
acquired from experience) Therefore, Break the problem into smaller
problems Choose the solution as per the nature of problem Build
language resources to the extent possible and continue to add to it
Engineer knowledge efficiently
Slide 38
Approaches to MT (1/2) Rule-based or Transfer based Uses
linguistic rules to map SL and TL, such as Maps grammatical
structures Disambiguation rules Knowledge-based Extensive knowledge
of the domain Concepts in the language Ability to reason
Slide 39
Approaches to MT (2/2) Example-based Mapping is based on stored
example translations Translation memory based Uses phrases/words
from earlier translation as examples Statistical Does not formulate
explicit linguistic knowledge Develops rules based on probabilities
Hybrid Mixes two or more techniques
Slide 40
A Glance at MT Efforts in India (1/4) Domain Specific Mantra
system (C-DAC, Pune) Translation of govt. appointment letters Uses
Tree Adjoining Grammar Public health compaign documents Angla
Bharati approach (C-DAC Noida & IIT Kanpur)
Slide 41
A Glance at MT Efforts in India (2/4) Application Specific
Matra (Human aided MT) (NCST,now C-DAC, Mumbai) General Purpose
(not yet in use) Angla Bharati approach (IIT Kanpur ) UNL based MT
(IIT Bombay) Shiva: EBMT (IIIT Hyderabad/IISc Bangalore) Shakti:
English-Hindi MT system (IIIT Hyderabad)
Slide 42
MT Efforts in India (3/4) Major Government funded MT projects
in consortium mode Indian Language to Indian Language Machine
Translation (ILMT) (Lead Institute - IIIT, Hyderabad) English to
Indian Language Machine Translation Mantra, Shakti etc (Lead inst -
C-DAC, Pune) Anglabharati (Lead inst IIT, Kanpur) Sanskrit to Hindi
MT System (Lead Inst University of Hyderabad)
Slide 43
MT Efforts in India (4/4) Anusaaraka : Language Accesspr cum MT
System (IIIT, Hyderabad, Chinmaya Shodh Sansthan)
Slide 44
Our Focus Sampark : Indian Language to Indian Language MT
systems
Slide 45
Sampark : Indian Language to Indian Language MT Systems
Consortium mode project Funded by DeiTY 11 Partiicpating Institutes
Nine language pairs 18 Systems
Slide 46
Participating institutions IIIT, Hyderabad (Lead institute)
University of Hyderabad IIT, Bombay IIT, Kharagpur AUKBC, Chennai
Jadavpur University, Kolkata Tamil University, Thanjavur IIIT,
Trivandrum IIIT, Allahabad IISc, Bangalore CDAC, Noida
Slide 47
Objectives Develop general purpose MT systems from one IL to
another for 9 language pairs Bidirectional Deliver domain specific
versions of the MT systems. Domains are: Tourism and pilgrimage One
additional domain (health/agriculture, box office reviews,
electronic gadgets instruction manuals, recipes, cricket reports)
By-products basic tools and lexical resources for Indian languages:
POS taggers, chunkers, morph analysers, shallow parsers, NERs,
parsers etc. Bidirectional bilingual dictionaries, annotated
corpora, etc.
User Scenario Web based system for tourism/ pilgrimage domain.
A common traveler/tourist/piligrim to access info in his language.
Access to selected Government portals in agriculture/health
Automatic MT in domain General purpose web based translation
Potential to attach to major search engines such as Google, Yahoo,
Microsoft, Web-duniya
Slide 50
Design and Approach Largely transfer based Analysis, Transfer,
Generate Modular (module could be Pipeline architecture Hybrid some
modules statistical, some rule based Analysis : Shallow parser No
deep parsing in the first phase
Slide 51
Approach Largely transfer based Analysis, Transfer, Generate
Modular Modules could be statistical or rule based depending on the
nature of problem (Hybrid) Pipeline architecture Analysis : Shallow
parsing followed by a simple parser
Slide 52
Design o Design decisions based on - the commonality in Indian
languages - easy to extend to other languages o Phase the
development - Phase 1 o Analysis at sentence level o Shallow parser
o Simple parser o Transfer : map lexicon, structures, script o
Generate the target
Slide 53
Design Contd Phase 2 Extend the analysis to discourse level
Anaphora resolution Relations between clauses (discourse
connectives) Word Sense Disambiguation (WSD) Named Entity
Recognition (NER) Multi Word Expressions (MWE) Explore SMT for
transfer rules
Slide 54
Transfer based MT Source Sentence Source Analysis Analysis
Analysis in Target Language Target Sentence Transfer
Generation
Slide 55
Form (Input sentence/text) Meaning Analysis Form Generation L1
Various types of linguistic information helps in arriving from form
to meaning It is complex. Modularization helps in simplifying
it.
Slide 56
Modularize Word Structure In context Morph Analyser Syntactic
What is functions as Semantic What it means (POS tagger) (WSD)
Relations between words Local (local word grouping,/ chunking)
Non-local (Subject,object/karaka)
Slide 57
Form (Input sentence/text) Meaning Analysis Form Generation
Semantic analysis POS Chunking parsing Morph Analysis Formal
semantics All this information is implicit in language. How to make
it explicit? Build resources Dictionaries, Verb frames,
Treebanks
Slide 58
Sampark Architecture
Slide 59
Details Standards Annotation standards POS and Chunk Input
output of each module Representation - SSF Data format Dictionaries
Emphasis on proper software engineering Development environment
Dashboard Blackboard architecture CVS for version control etc.
Slide 60
Machine Learning: Separating engines from language data Module
for Task (T) Sentence in Language (L) Training data (lang. L)
Engine for task T Out Manual Correction
Vertical Tasks for Each Language V1 POS tagger & chunker V2
Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual
dictionary bidirectional V6 Transfer grammar V7 Annotated corpus V8
Evaluation V9 Co-ordination
Slide 63
Vertical Tasks for Each Language V1 POS tagger & chunker V2
Morph analyzer V3 Generator V4 Named entity recognizer V5 Bilingual
dictionary bidirectional V6 Transfer grammar V7 Annotated corpus V8
Evaluation V9 Co-ordination
Slide 64
An Example : Hindi to Panjabi System 1500 . . . . . 1500
Slide 65
Panjabi to Hindi . . . 23 1931 , . . 23 1931 ,
Slide 66
Panjabi to Hindi (NER) . (WSD) (Agreement) . (word generation)
. 23 1931 , (function word substitution) . .
Slide 67
Evaluation Testing, system integration, and evaluation team
Involvement of industry Regular In-house subjective evaluation
Third party evaluation on system submission
Slide 68
Achievements of ILMT Project Phase I 18 MT systems built among
Indian languages Shallow parser for all 9 Indian languages Lexical
resources for all 9 languages Largely built from scratch Developed
standards for all stages Developed open architecture
Slide 69
Achievements -Deployment Deployed and running over web 8
systems (sampark.org.in ) Others deployed over ILMT test site 4
more ready to go to Sampark soon Rest are being evaluated and
tested internally (require a few more months to go to Sampark site
after reaching quality levels) Constant qualilty improvement going
on for various existing modules New modules are under testing and
would be soon integrated
Slide 70
Future Tasks Enhance the quality of MT output Enhancing
dictionaries Increasing coverage of grammar Adding new technology
to ILMT systems Full sentence parsing Discourse processing -
anaphora Target some users
Slide 71
Some Possibilities Possible tie up with search engines
companies Possible tie up with content companies such as - Dainik
Jagran, Web duniya, Rediff, Yahoo Identify translation bureaus and
agencies Build MT workbench for their use, their domains, etc.
Poised for major public impact with a unique technology.
Slide 72
Future Systems Add language pairs Gujrati Hindi Kashmiri Hindi
Manipuri Hindi Oriya Hindi Etc
Slide 73
Future Systems Add language pairs Gujrati Hindi Kashmiri Hindi
Manipuri Hindi Oriya Hindi Etc
Slide 74
CONCLUSION Developing MT systems, though a challenging task, is
a useful effort particularly in the multilingual context of
India