21
Better translations through automated source and post-edit analysis David Landan Welocalize

Better translations through automated source and post edit analysis

Embed Size (px)

DESCRIPTION

Presentation at memoQfest Americas 2014 by Welocalize David Landan. Machine Translations (MT) is here to stay. Better MT means less post-editing effort, resulting in higher throughput for less money. This technical presentation details how automated sourcing and post-edit analysis produces better translations. MT quality depends on training data quantity, quality, and relevance. Review of weScore, scoring.

Citation preview

Page 1: Better translations through automated source and post edit analysis

Better translations through automated source and post-edit

analysis

David LandanWelocalize

Page 2: Better translations through automated source and post edit analysis

Background

• MT is here to stay– Better MT = less PE effort = higher

throughput for less money

• MT quality depends on training data quantity, quality, and relevance– Selecting in-domain data increases BLEU

scores by 10-20 BLEU over generic engines

• LSPs have less control over quantity, so we need to focus on quality & relevance

Page 3: Better translations through automated source and post edit analysis

A data-driven approach

• Analytics at each stepTraining MT Production

Post-Editing

• Perplexity Evaluator

• Candidate Scorer

• StyleScorer

• Source Content Profiler (joint project w/CNGL)

• UGC Normalization

• StyleScorer

• Number checking

• WeScore

• StyleScorer

Page 4: Better translations through automated source and post edit analysis

Candidate Scorer

• Uses corpus of known “difficult” text• Compares part of speech (POS) n-grams– Generates per-sentence scores

Page 5: Better translations through automated source and post edit analysis
Page 6: Better translations through automated source and post edit analysis

Perplexity (PPL) Evaluator

• Build language models (LMs) from multiple corpora– Known “good” sentences for MT– Known “bad” sentences for MT– Client-specific in-domain data

• Each document gets a PPL score against each LM

Page 7: Better translations through automated source and post edit analysis
Page 8: Better translations through automated source and post edit analysis

StyleScorer

• Combines PPL ratio, dissimilarity score, and classification score– Each document receives a score from 0-

4– Higher score indicates better match to

style established by client’s documents– Does not require parallel data

• Source scored for training/tuning suitability

Page 9: Better translations through automated source and post edit analysis

Source Content Profiler

• CNGL project (beta)– Classification of docs into profiles– Features based on:• Word & sent. length• Readability score• Syntactic structure• Terminology• Tag ratios• Do Not Translate lists• Glossary matches

Page 10: Better translations through automated source and post edit analysis

Does it work?

Engine en-USnl-NL

en-USpl-PL en-UShu-HU

Plain vanilla 21.26 16.88 17.31

Domain match 36.39 37.07 38.36

Plain + target 44.07 34.61 30.43

Domain + target 64.40 54.55 49.53

Page 11: Better translations through automated source and post edit analysis

A data-driven approach

• Analytics at each stepTraining MT Production

Post-Editing

• Perplexity Evaluator

• Candidate Scorer

• StyleScorer

• Source Content Profiler (joint project w/CNGL)

• UGC Normalization

• StyleScorer

• Number checking

• WeScore

• StyleScorer

Page 12: Better translations through automated source and post edit analysis

UGC normalization

• Make substitutions in source for known MT pain points before translating– Frequent misspellings – “teh”, “mroe”,

etc.– Abbreviations – “imho”, “tyvm”, etc.–Missing punctuation – “cant”, “theyll”,

etc.– Emoticons– Spelling variants/slang – “cuz”, “usu”,

etc.

Page 13: Better translations through automated source and post edit analysis

Number checking

• Verify that numeric MT output is localized correctly– Currency – “$1B” vs “1 млрд. $”– Dates – “2/28/2014” vs “28/2/2014”– Time – “2pm” vs “14h00”– Separator & radix – “1,234.5” vs “1

234,5”

Page 14: Better translations through automated source and post edit analysis

StyleScorer revisited

• MT output is compared to client’s historical (in-domain) PE data– Treat each target segment as a

document– Lower scores indicate segments likely to

require greater PE effort

Page 15: Better translations through automated source and post edit analysis

A data-driven approach

• Analytics at each stepTraining MT Production

Post-Editing

• Perplexity Evaluator

• Candidate Scorer

• StyleScorer

• Source Content Profiler (joint project w/CNGL)

• UGC Normalization

• StyleScorer

• Number checking

• WeScore

• StyleScorer

Page 16: Better translations through automated source and post edit analysis

WeScore

• Dashboard for viewing MT metrics– Tokenizes input from variety of formats

& runs several scoring algorithms in parallel

– Exports detailed analysis to spreadsheet for sentence-by-sentence review

Page 17: Better translations through automated source and post edit analysis

WeScore

Page 18: Better translations through automated source and post edit analysis

WeScore

Page 19: Better translations through automated source and post edit analysis

StyleScorer III

• PE output is compared to client’s historical (in-domain) data– Treat each PE segment as a document– Lower score indicates possible deviation

from established style

Page 20: Better translations through automated source and post edit analysis

Feedback loop

• Data collected and lessons learned– Update client-specific data for future

engine training–Mine data for generalizable patterns in

problem areas–Work with post-editors to understand

how to make a better system & how to improve PE experience and throughput

Page 21: Better translations through automated source and post edit analysis

Q&A

Thank you!