Upload
abigail-brooks
View
219
Download
0
Tags:
Embed Size (px)
Citation preview
Enhancing Translation Systems with Enhancing Translation Systems with Bilingual Concordancing FunctionalitiesBilingual Concordancing Functionalities
V. ANTONOPOULOSV. ANTONOPOULOS C. MALAVAZOSC. MALAVAZOS
I.I. TRIANTAFYLLOU TRIANTAFYLLOU SS. . PIPERIDISPIPERIDIS
Presentation: V. AntonopoulosPresentation: V. Antonopoulos
[email protected]@ilsp.gr
Institute for Language and Speech ProcessingInstitute for Language and Speech Processing
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools
Current FrameworkCurrent Framework
Increasing demand for multilinguality, for translationIncreasing demand for multilinguality, for translation
Current translation systems still fail to completely meet Current translation systems still fail to completely meet
the translation needsthe translation needs
Language transfer still prevailing problemLanguage transfer still prevailing problem
Need for further development of existing systemsNeed for further development of existing systems
1.1. Integration of technologies (TM & MT)Integration of technologies (TM & MT)
2.2. Intelligent ToolsIntelligent Tools
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 2 of 17
Proposed MethodProposed Method
Expands the transfer selection capabilitiesExpands the transfer selection capabilities
Utilizes sub-sentential informationUtilizes sub-sentential information
Performs well when dealing with limited amount of Performs well when dealing with limited amount of
parallel data (Translation Memories)parallel data (Translation Memories)
Feasible usage for run-time applicationsFeasible usage for run-time applications
Statistically overcome the translation unit (TU) Statistically overcome the translation unit (TU)
identification barrieridentification barrier
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 3 of 17
Method BasicsMethod Basics
Extracts sub-sentential bilingual correspondencesExtracts sub-sentential bilingual correspondences
Statistical approachStatistical approach
Unique prerequisiteUnique prerequisite a parallel corpus a parallel corpus
Automatic translation unit identificationAutomatic translation unit identification
Two-level iterative method: Two-level iterative method:
Incrementally constructed translationIncrementally constructed translation
Continuously extended source segmentsContinuously extended source segments
Employs target language correspondence informationEmploys target language correspondence information
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 4 of 17
Core Engine DescriptionCore Engine Description
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 5 of 17
C
D
E
FW Filtering
C
E
Parallel TextDatabase
SSent-1SSent-2
.
.
.SSent-N
TSent-1TSent-2
.
.
.TSent-N
TW-1TW-2
.
.
.TW-k
Irrele
van
tw
ord
CTWSS TS
11st st - Level Iterations- Level Iterations
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 6 of 17
Incremental translation construction:Incremental translation construction:
Employs DICE coefficient as similarity measureEmploys DICE coefficient as similarity measure
Adds one word from CTW set in every new iterationAdds one word from CTW set in every new iteration
Stores translations above threshold during an iterationStores translations above threshold during an iteration
Terminates when no new translation is addedTerminates when no new translation is added
Selects best translation based on similarity score and lengthSelects best translation based on similarity score and length
11st st - Level Iterations Example- Level Iterations Example
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 7 of 17
ηλεκτρονική αυτόματη μετάδοση
electronic
automatic
transmission
Iteration 1
electronic automatic
automatic transmission
Iteration 2
electronic automatic transmission
Iteration 3
ECU
refer
EAT
Transmission EAT
EAT ECU
refer electronic automatic
automatic transmission EAT
refer electronic automatic
Transmission EAT
Translation Synthesis ExampleTranslation Synthesis Example
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 8 of 17
ηλεκτρονική αυτόματη μετάδοση
electronic
automatic
transmission
Iteration 1
electronic automatic
automatic transmission
Iteration 2
electronic automatic transmission
Iteration 3
electronic automatic transmissionelectronic automatic transmission
ECU
refer
EAT
EAT ECU
automatic transmission EAT
a) length
b) score
22ndnd - Level Iterations- Level Iterations
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 9 of 17
Aims of this 2Aims of this 2ndnd - level process: - level process:
Improve accuracy of translation outcomeImprove accuracy of translation outcome
Automatic translation unit identificationAutomatic translation unit identification
Efficient integration in a Translation Memory FrameworkEfficient integration in a Translation Memory Framework
22ndnd - Level Iterations- Level Iterations
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 10 of 17
Employ Employ “Sequence Window Variety”“Sequence Window Variety” technique technique:
Try to dTry to determine the best “cover” of an input text by examining etermine the best “cover” of an input text by examining translation outcome of length-varying source segmentstranslation outcome of length-varying source segments
Initiate procedure from smallest segments (1-word segments) Initiate procedure from smallest segments (1-word segments)
Continuously extend the input source segmentsContinuously extend the input source segments
Shift observation window from left to right for source segmentsShift observation window from left to right for source segments
Store acceptable translations along with their score during every Store acceptable translations along with their score during every iterationiteration
CCombinatorial process ombinatorial process for for computing the optimal set of candidate computing the optimal set of candidate source units that providesource units that providess the best the best ““covercover””
22ndnd - Level Iterations Example- Level Iterations Example
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 11 of 17
A B C D E F G HIteration 0
Iteration 0-a
Iteration 0-b
Iteration 0-c
Iteration 1-a
Iteration 1-b
Iteration 2-a
Iteration 2-b
Iteration 2-c
Iterations Source Sentence Input Phrase
A B C D E F G H D
A B C D E F G H E
A B C D E F G H D E
A B C D E F G H C D E
A B C D E F G H D E F
A B C D E F G H B C D E
A B C D E F G H C D E F
A B C D E F G H D E F G
Transmission EAT
Translation Synthesis Example (1)Translation Synthesis Example (1)
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 12 of 17
ηλεκτρονική αυτόματη μετάδοση
electronic
permission
traction
electronic automatic
automatic transmission
electronicelectronic & & automatic transmissionautomatic transmission
ETC
force
EAT
EAT ECU a) length
b) score
ηλεκτρονική αυτόματη μετάδοση
fuse passenger
passenger compartment
Translation Synthesis Example (2)Translation Synthesis Example (2)
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 13 of 17
ασφαλειοθήκη χώρου επιβατών
fuse
box
switch
passenger
compartment
fuse boxfuse box && passenger compartmentpassenger compartment
relay
ignition
fuse box
a) length
b) score
ασφαλειοθήκη χώρου επιβατών
compartment fuse box
Significant Technical AspectsSignificant Technical Aspects
N-gram based conflation N-gram based conflation method for enhancing the existing method for enhancing the existing
statistical evidencestatistical evidence (overcome limitations that morphologically (overcome limitations that morphologically
rich languages introduce)rich languages introduce)
Variable cut-off threshold Variable cut-off threshold (eliminate rejections of translation (eliminate rejections of translation
parts at an early stage of the algorithm)parts at an early stage of the algorithm)
Specific word order not taken into account Specific word order not taken into account (enhance statistical (enhance statistical
evidence in small bilingual corpora)evidence in small bilingual corpora)
Contiguity requirement Contiguity requirement (ensure translation accuracy)(ensure translation accuracy)
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 14 of 17
EvaluationEvaluation
Evaluation set:Evaluation set:
350 input text fragments (80% noun phrases, 20% verb phrases)350 input text fragments (80% noun phrases, 20% verb phrases)
manually extracted from an automotive bilingual parallel manually extracted from an automotive bilingual parallel
corpus (3.100 EN words, 4.300 EL words)corpus (3.100 EN words, 4.300 EL words)
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 15 of 17
Static WindowStatic Window Flexible WindowFlexible Window
CorrectCorrect 75%75% 83%83%
Second Second MatchMatch
8%8% 6%6%
ErrorsErrors 17%17% 11%11%
Future WorkFuture Work
Apply in comparable bilingual corporaApply in comparable bilingual corpora
Exploit linguistic information when availableExploit linguistic information when available
Explore ways of integrating in a Machine Translation & Explore ways of integrating in a Machine Translation &
Translation Memory frameworkTranslation Memory framework
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 16 of 17
Integration in MT & TM FrameworkIntegration in MT & TM Framework
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools Page 17 of 17
TMTM
Statistical ProcessingStatistical Processing
Machine TranslationMachine Translation
A B C D E F G H
A B C D E F G H
Part 1 D E F Part 3
Part 2
Target SentenceTarget Sentence
Why DICEWhy DICE
Although the constituent words may have multiple senses, Although the constituent words may have multiple senses,
the identified TUs appear to have unique translationthe identified TUs appear to have unique translation
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools
““current”current”::
a) present, existinga) present, existing
b) electricity (alternating ~)b) electricity (alternating ~)
““currentcurrent flows across”: flows across”:
a) a) ρεύμα περνά ρεύμα περνά (1 meaning)(1 meaning)
Better measure of Better measure of similaritysimilarity than MI and specific MI (log- than MI and specific MI (log-
likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are likelihood ratio): 1-1, 1-0 matches are significant, 0-0 are
notnot
Good measures of independence are not necessarily good Good measures of independence are not necessarily good
measures of similarity…measures of similarity…
In practice, DICE works better!In practice, DICE works better!
Corpus SizeCorpus Size
Automotive industry bilingual corpus (EN-EL)Automotive industry bilingual corpus (EN-EL)
6.200 sentences in each language6.200 sentences in each language
3.100 EN words – 4.300 EL words3.100 EN words – 4.300 EL words
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools
Champollion ApproachChampollion Approach
Tested in 2 different parts of Hansard corpus (Canadian Tested in 2 different parts of Hansard corpus (Canadian
Parliament) : 3.5 million & 8.5 million wordsParliament) : 3.5 million & 8.5 million words
65% - 75% accuracy was reported for the 3 evaluation sets65% - 75% accuracy was reported for the 3 evaluation sets
Proposed to increase database corpus for better resultsProposed to increase database corpus for better results
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools
Conflation MethodConflation Method
N-gram methodN-gram method
Soft clustering of wordsSoft clustering of words
>98% accuracy (evaluated using the first 1000 entries >98% accuracy (evaluated using the first 1000 entries
of the ILSP morphological lexicon)of the ILSP morphological lexicon)
Works well even with small wordsWorks well even with small words
Most significant factor was the performance, so Most significant factor was the performance, so
emphasis was given on recallemphasis was given on recall
Workshop on Balkan Language Resources & ToolsWorkshop on Balkan Language Resources & Tools