Better translations through automated source and post edit analysis

Better translations through automated source and post-edit

analysis

David LandanWelocalize

Background

• MT is here to stay– Better MT = less PE effort = higher

throughput for less money

• MT quality depends on training data quantity, quality, and relevance– Selecting in-domain data increases BLEU

scores by 10-20 BLEU over generic engines

• LSPs have less control over quantity, so we need to focus on quality & relevance

A data-driven approach

• Analytics at each stepTraining MT Production

Post-Editing

• Perplexity Evaluator

• Candidate Scorer

• StyleScorer

• Source Content Profiler (joint project w/CNGL)

• UGC Normalization

• StyleScorer

• Number checking

• WeScore

• StyleScorer

Candidate Scorer

• Uses corpus of known “difficult” text• Compares part of speech (POS) n-grams– Generates per-sentence scores

Perplexity (PPL) Evaluator

• Build language models (LMs) from multiple corpora– Known “good” sentences for MT– Known “bad” sentences for MT– Client-specific in-domain data

• Each document gets a PPL score against each LM

StyleScorer

• Combines PPL ratio, dissimilarity score, and classification score– Each document receives a score from 0-

4– Higher score indicates better match to

style established by client’s documents– Does not require parallel data

• Source scored for training/tuning suitability

Source Content Profiler

• CNGL project (beta)– Classification of docs into profiles– Features based on:• Word & sent. length• Readability score• Syntactic structure• Terminology• Tag ratios• Do Not Translate lists• Glossary matches

Does it work?

Engine en-USnl-NL

en-USpl-PL en-UShu-HU

Plain vanilla 21.26 16.88 17.31

Domain match 36.39 37.07 38.36

Plain + target 44.07 34.61 30.43

Domain + target 64.40 54.55 49.53

Post-Editing

• StyleScorer

• Number checking

• WeScore

• StyleScorer

UGC normalization

• Make substitutions in source for known MT pain points before translating– Frequent misspellings – “teh”, “mroe”,

etc.– Abbreviations – “imho”, “tyvm”, etc.–Missing punctuation – “cant”, “theyll”,

etc.– Emoticons– Spelling variants/slang – “cuz”, “usu”,

Number checking

• Verify that numeric MT output is localized correctly– Currency – “$1B” vs “1 млрд. $”– Dates – “2/28/2014” vs “28/2/2014”– Time – “2pm” vs “14h00”– Separator & radix – “1,234.5” vs “1

234,5”

StyleScorer revisited

• MT output is compared to client’s historical (in-domain) PE data– Treat each target segment as a

document– Lower scores indicate segments likely to

require greater PE effort

Post-Editing

• StyleScorer

• Number checking

• WeScore

• StyleScorer

WeScore

• Dashboard for viewing MT metrics– Tokenizes input from variety of formats

& runs several scoring algorithms in parallel

– Exports detailed analysis to spreadsheet for sentence-by-sentence review

WeScore

StyleScorer III

• PE output is compared to client’s historical (in-domain) data– Treat each PE segment as a document– Lower score indicates possible deviation

from established style

Feedback loop

• Data collected and lessons learned– Update client-specific data for future

engine training–Mine data for generalizable patterns in

problem areas–Work with post-editors to understand

how to make a better system & how to improve PE experience and throughput

Thank you!

Better translations through automated source and post edit analysis

Business

Bible Translations

TEXTS & TRANSLATIONS

D-Items—Translations sh Translations in Serbo-Croatian

Black Box Recursive Translations for Molecular Optimization · newly discovered compounds from known starting points. 1 Introduction Automated molecular design using generative models

Bible Translations and the Apocrypha...Title Bible Translations and the Apocrypha Keywords Bible Translations; the Apocrypha; Bible Translations and the Apocrypha; Trinitarian Bible

ire Translations

BLENDER TUTORIAL - Automated animation for cars3d-synthesis.com/tutorials/Cars_Animation_Tutorial.pdf · BLENDER TUTORIAL - Automated animation for cars - ... In edit mode, select

Translations Koran

Translations of

Winterreise Translations

TRADUÇÕES TRANSLATIONS

Graphonological Levenshtein Edit Distance: Application · PDF fileGraphonological Levenshtein Edit Distance: Application for Automated Cognate Identification 117 consonants, i.e.,

Automated Targeted Attacks - Virus Bulletin...Click to edit Master title style •Click to edit Master text styles –Second level •Third level –Fourth level » Fifth level June

Surah translations

4.1 Translations

Rotations vs. Translations Translations Rotations

AXE Translations

ПЕРЕВОДЫ / TRANSLATIONS

10.1 Translations

11.6 Translations