Better translations through automated source and post edit analysis

Better translations through automated source and post-edit

analysis

David LandanWelocalize

Background

• MT is here to stay– Better MT = less PE effort = higher

throughput for less money

• MT quality depends on training data quantity, quality, and relevance– Selecting in-domain data increases BLEU

scores by 10-20 BLEU over generic engines

• LSPs have less control over quantity, so we need to focus on quality & relevance

A data-driven approach

• Analytics at each stepTraining MT Production

Post-Editing

• Perplexity Evaluator

• Candidate Scorer

• StyleScorer

• Source Content Profiler (joint project w/CNGL)

• UGC Normalization

• StyleScorer

• Number checking

• WeScore

• StyleScorer

Candidate Scorer

• Uses corpus of known “difficult” text• Compares part of speech (POS) n-grams– Generates per-sentence scores

Perplexity (PPL) Evaluator

• Build language models (LMs) from multiple corpora– Known “good” sentences for MT– Known “bad” sentences for MT– Client-specific in-domain data

• Each document gets a PPL score against each LM

StyleScorer

• Combines PPL ratio, dissimilarity score, and classification score– Each document receives a score from 0-

4– Higher score indicates better match to

style established by client’s documents– Does not require parallel data

• Source scored for training/tuning suitability

Source Content Profiler

• CNGL project (beta)– Classification of docs into profiles– Features based on:• Word & sent. length• Readability score• Syntactic structure• Terminology• Tag ratios• Do Not Translate lists• Glossary matches

Does it work?

Engine en-USnl-NL

en-USpl-PL en-UShu-HU

Plain vanilla 21.26 16.88 17.31

Domain match 36.39 37.07 38.36

Plain + target 44.07 34.61 30.43

Domain + target 64.40 54.55 49.53



Post-Editing



• StyleScorer



• StyleScorer

• Number checking

• WeScore

• StyleScorer

UGC normalization

• Make substitutions in source for known MT pain points before translating– Frequent misspellings – “teh”, “mroe”,

etc.– Abbreviations – “imho”, “tyvm”, etc.–Missing punctuation – “cant”, “theyll”,

etc.– Emoticons– Spelling variants/slang – “cuz”, “usu”,

etc.

Number checking

• Verify that numeric MT output is localized correctly– Currency – “$1B” vs “1 млрд. $”– Dates – “2/28/2014” vs “28/2/2014”– Time – “2pm” vs “14h00”– Separator & radix – “1,234.5” vs “1

234,5”

StyleScorer revisited

• MT output is compared to client’s historical (in-domain) PE data– Treat each target segment as a

document– Lower scores indicate segments likely to

require greater PE effort



Post-Editing



• StyleScorer



• StyleScorer

• Number checking

• WeScore

• StyleScorer

WeScore

• Dashboard for viewing MT metrics– Tokenizes input from variety of formats

& runs several scoring algorithms in parallel

– Exports detailed analysis to spreadsheet for sentence-by-sentence review

WeScore

WeScore

StyleScorer III

• PE output is compared to client’s historical (in-domain) data– Treat each PE segment as a document– Lower score indicates possible deviation

from established style

Feedback loop

• Data collected and lessons learned– Update client-specific data for future

engine training–Mine data for generalizable patterns in

problem areas–Work with post-editors to understand

how to make a better system & how to improve PE experience and throughput

Q&A

Thank you!

Business

Better translations through automated source and post edit analysis