Upload
welocalize
View
536
Download
4
Embed Size (px)
DESCRIPTION
Presentation at memoQfest Americas 2014 by Welocalize David Landan. Machine Translations (MT) is here to stay. Better MT means less post-editing effort, resulting in higher throughput for less money. This technical presentation details how automated sourcing and post-edit analysis produces better translations. MT quality depends on training data quantity, quality, and relevance. Review of weScore, scoring.
Citation preview
Better translations through automated source and post-edit
analysis
David LandanWelocalize
Background
• MT is here to stay– Better MT = less PE effort = higher
throughput for less money
• MT quality depends on training data quantity, quality, and relevance– Selecting in-domain data increases BLEU
scores by 10-20 BLEU over generic engines
• LSPs have less control over quantity, so we need to focus on quality & relevance
A data-driven approach
• Analytics at each stepTraining MT Production
Post-Editing
• Perplexity Evaluator
• Candidate Scorer
• StyleScorer
• Source Content Profiler (joint project w/CNGL)
• UGC Normalization
• StyleScorer
• Number checking
• WeScore
• StyleScorer
Candidate Scorer
• Uses corpus of known “difficult” text• Compares part of speech (POS) n-grams– Generates per-sentence scores
Perplexity (PPL) Evaluator
• Build language models (LMs) from multiple corpora– Known “good” sentences for MT– Known “bad” sentences for MT– Client-specific in-domain data
• Each document gets a PPL score against each LM
StyleScorer
• Combines PPL ratio, dissimilarity score, and classification score– Each document receives a score from 0-
4– Higher score indicates better match to
style established by client’s documents– Does not require parallel data
• Source scored for training/tuning suitability
Source Content Profiler
• CNGL project (beta)– Classification of docs into profiles– Features based on:• Word & sent. length• Readability score• Syntactic structure• Terminology• Tag ratios• Do Not Translate lists• Glossary matches
Does it work?
Engine en-USnl-NL
en-USpl-PL en-UShu-HU
Plain vanilla 21.26 16.88 17.31
Domain match 36.39 37.07 38.36
Plain + target 44.07 34.61 30.43
Domain + target 64.40 54.55 49.53
A data-driven approach
• Analytics at each stepTraining MT Production
Post-Editing
• Perplexity Evaluator
• Candidate Scorer
• StyleScorer
• Source Content Profiler (joint project w/CNGL)
• UGC Normalization
• StyleScorer
• Number checking
• WeScore
• StyleScorer
UGC normalization
• Make substitutions in source for known MT pain points before translating– Frequent misspellings – “teh”, “mroe”,
etc.– Abbreviations – “imho”, “tyvm”, etc.–Missing punctuation – “cant”, “theyll”,
etc.– Emoticons– Spelling variants/slang – “cuz”, “usu”,
etc.
Number checking
• Verify that numeric MT output is localized correctly– Currency – “$1B” vs “1 млрд. $”– Dates – “2/28/2014” vs “28/2/2014”– Time – “2pm” vs “14h00”– Separator & radix – “1,234.5” vs “1
234,5”
StyleScorer revisited
• MT output is compared to client’s historical (in-domain) PE data– Treat each target segment as a
document– Lower scores indicate segments likely to
require greater PE effort
A data-driven approach
• Analytics at each stepTraining MT Production
Post-Editing
• Perplexity Evaluator
• Candidate Scorer
• StyleScorer
• Source Content Profiler (joint project w/CNGL)
• UGC Normalization
• StyleScorer
• Number checking
• WeScore
• StyleScorer
WeScore
• Dashboard for viewing MT metrics– Tokenizes input from variety of formats
& runs several scoring algorithms in parallel
– Exports detailed analysis to spreadsheet for sentence-by-sentence review
WeScore
WeScore
StyleScorer III
• PE output is compared to client’s historical (in-domain) data– Treat each PE segment as a document– Lower score indicates possible deviation
from established style
Feedback loop
• Data collected and lessons learned– Update client-specific data for future
engine training–Mine data for generalizable patterns in
problem areas–Work with post-editors to understand
how to make a better system & how to improve PE experience and throughput
Q&A
Thank you!