27
Computing & Information Sciences Kansas State University Rabat, Morocco Human Language Technologies Workshop Human Language Technologies Human Language Technologies (HLT) Workshop 2006 (HLT) Workshop 2006 Classification-Based Contextual Classification-Based Contextual Correction of Mistranslations: Correction of Mistranslations: A Machine Learning Approach A Machine Learning Approach William H. Hsu Joint work with: Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris Meyer Thursday, 01 June 2006 Laboratory for Knowledge Discovery in Databases Kansas State University http://www.kddresearch.org/KSU/CIS/HLT-Specialized- 20060601.ppt

Computing & Information Sciences Kansas State University Rabat, MoroccoHuman Language Technologies Workshop Human Language Technologies (HLT) Workshop

Embed Size (px)

Citation preview

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Human Language TechnologiesHuman Language Technologies(HLT) Workshop 2006(HLT) Workshop 2006

Classification-Based Contextual Correction Classification-Based Contextual Correction of Mistranslations:of Mistranslations:

A Machine Learning ApproachA Machine Learning Approach

Human Language TechnologiesHuman Language Technologies(HLT) Workshop 2006(HLT) Workshop 2006

Classification-Based Contextual Correction Classification-Based Contextual Correction of Mistranslations:of Mistranslations:

A Machine Learning ApproachA Machine Learning Approach

William H. HsuJoint work with:

Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris MeyerThursday, 01 June 2006

Laboratory for Knowledge Discovery in DatabasesKansas State University

http://www.kddresearch.org/KSU/CIS/HLT-Specialized-20060601.ppt

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

A Technical Survey of Statistical MT:A Technical Survey of Statistical MT:Phrase-Based Methods and MetricsPhrase-Based Methods and Metrics

A Technical Survey of Statistical MT:A Technical Survey of Statistical MT:Phrase-Based Methods and MetricsPhrase-Based Methods and Metrics

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

OutlineOutlineOutlineOutline

Background: Statistical Approaches to MT

State of the Field: Metrics

Open Problems

New Approaches, Applications and Software Tools

Current and Future Research Prospectus

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Global Search Decoder Algorithmargmaxe P(t) *P(s|t)

LanguageModel

P(t)

Translation ModelP(s|t)

Input:

Source Language s

Output:

Target Language t

Training Program

(e.g., GIZA)

BilingualParallelCorpora

Language Modeling

toolkit

Target Language

Machine Translation:Machine Translation:Generic System ArchitectureGeneric System Architecture

Machine Translation:Machine Translation:Generic System ArchitectureGeneric System Architecture

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Based on noisy channel model

Source: foreign sentence f

Target: English sentence e

Bayesian inference: Maximum A Posteriori (MAP)

Background [1]:Background [1]:Phrase Translation ModelPhrase Translation Model

Background [1]:Background [1]:Phrase Translation ModelPhrase Translation Model

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Preliminaries: segmentation of foreign input

Result:

Use: lexical analysis tools – string tokenizer, etc.

Goal: decoding

Segmented input:

Output:

Distributions

Prediction:

Distortion: ai = start of fi, bi-1 = end of fi-1

Background [2]:Background [2]:Modeling StepsModeling StepsBackground [2]:Background [2]:Modeling StepsModeling Steps

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Length normalization factor:

Language Model (pLM): Trigram

[Seymour and Rosenfeld, 1997]

Background [3]:Background [3]:Probabilistic FormulationProbabilistic Formulation

Background [3]:Background [3]:Probabilistic FormulationProbabilistic Formulation

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Methods for Learning in MT:Methods for Learning in MT:SurveySurvey

Methods for Learning in MT:Methods for Learning in MT:SurveySurvey

Transformation-Based Learning (TBL)

Example-Based Machine Translation (EBMT)

Symbolic AI: Frames, Conceptual Grammars, Analogy, CBR

Statistical

0. classical / naïve (cf. Weaver’s correspondence with Weiner)

1. phrase alignments from word-aligned model [Och & Ney, 2000]

2. linguistically motivated models [Yamada & Knight, 2001]

3. joint phrase model [Marcu & Wong, 2002]

4. generative phrase alignment [Koehn, Och & Marcu, 2003]

5. hierarchical models [Chiang, 2005; Taskar, 2005]

6. new approaches

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

OutlineOutlineOutlineOutline

Background: Statistical Approaches to MT

State of the Field: Metrics

Open Problems

New Approaches, Applications and Software Tools

Current and Future Research Prospectus

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

• N-gram precision (score is between 0 & 1)– What percentage of machine n-grams can be

found in the reference translation? – n-gram: sequence of n units (words)

– Not allowed to use same portion of reference translation twice (can’t cheat by repetition)

• Brevity penalty– Can’t just type out single word “the”

pn : n-gram precision wn : positive weightsr : words-in-reference c : words-in-machine

• Hard to “game” system (i.e., change machine output so that BLEU goes up, but quality doesn’t)

Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Adapted from Knight (2003)

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [1]:nderstudy (BLEU) [1]:Papineni Papineni et al.et al. (ACL, 2002) (ACL, 2002)

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [1]:nderstudy (BLEU) [1]:Papineni Papineni et al.et al. (ACL, 2002) (ACL, 2002)

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [2]:nderstudy (BLEU) [2]:Multiple Reference TranslationsMultiple Reference Translations

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [2]:nderstudy (BLEU) [2]:Multiple Reference TranslationsMultiple Reference Translations

Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .

Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .

Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .

Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .

Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.

© 2003 Knight, K.

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [3]:nderstudy (BLEU) [3]:Tracking Human JudgmentTracking Human Judgment

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [3]:nderstudy (BLEU) [3]:Tracking Human JudgmentTracking Human Judgment

R2 = 88.0%

R2 = 90.2%

-2.5

-2.0

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

Human Judgments

NIS

T S

core

Adequacy

Fluency

Linear(Adequacy)Linear(Fluency)

(va

ria

nt

of

BL

EU

)

Courtesy G. Doddington (NIST)

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [4]:nderstudy (BLEU) [4]:Metrics in ActionMetrics in Action

BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [4]:nderstudy (BLEU) [4]:Metrics in ActionMetrics in Action

枪手被警方击毙。 (Foreign Original)

the gunman was shot to death by the police . (Reference Translation)

the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10

green = 4-gram match (good!)red = word not matched (bad!)

© 2003 Kevin Knight

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

New TechnologiesNew Technologiesand Transfer Planand Transfer PlanNew TechnologiesNew Technologiesand Transfer Planand Transfer Plan

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

OutlineOutlineOutlineOutline

Background: Statistical Approaches to MT

State of the Field: Metrics

Open Problems

New Approaches, Applications and Software Tools

Current and Future Research Prospectus

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Context-Driven NLP:Context-Driven NLP:MT ApplicationsMT Applications

Context-Driven NLP:Context-Driven NLP:MT ApplicationsMT Applications

Classical Natural Language Processing (NLP) (Noun and verb) phrase extraction

Detection of named entity phrases

Word sense disambiguation

Spelling correction

Interlingual Challenges Making use of mixed resources: bilingual & monolingual

Semi-supervised learning

Applications Mixed-mode (semi-interactive) MT – assistive technology

Correcting mistranslations

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. So the news most likely be reported in many languages.

For example:

Translating Named Entity Phrases [1]:Translating Named Entity Phrases [1]:Arabic-English ApplicationArabic-English Application

Translating Named Entity Phrases [1]:Translating Named Entity Phrases [1]:Arabic-English ApplicationArabic-English Application

The Arabic newspaper article is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war.

[Knight & Al-Onaizan, 2001]

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Generate ranked list of translation candidates Bilingual resources: parallel corpus Monolingual resources

Re-score list of candidates using different monolingual clues

Translating Named Entity Phrases [2]:Translating Named Entity Phrases [2]:Two-Phase ApproachTwo-Phase Approach

Translating Named Entity Phrases [2]:Translating Named Entity Phrases [2]:Two-Phase ApproachTwo-Phase Approach

[Knight & Al-Onaizan, 2001]

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Correcting Faulty TranslationsCorrecting Faulty TranslationsCorrecting Faulty TranslationsCorrecting Faulty Translations

Human-Assistive Technology

Semi-Supervised: Two Training Corpora Labeled: “bad translations” and “near misses” Unlabeled: candidate translations

Interactive Aspect “Which of these translations is right?” “Why is this candidate incorrect?”

Application: Boosting Accuracy of SMT

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Boosting the Accuracy of SMTBoosting the Accuracy of SMTBoosting the Accuracy of SMTBoosting the Accuracy of SMT

Parsing [Koehn et al., 2003] Pro: Found to slow growth of translation tables Con: Limited effect on BLEU

Context-Specificity Supported by computational linguistic theory Some positive results in NLP prediction tasks

[Elman, 1994] Very effective in sequence learning

[Barash & Friedman, 2001] Important for Relational and First-Order Representations

New Work: Semi-Supervised Approaches

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

OutlineOutlineOutlineOutline

Background: Statistical Approaches to MT

State of the Field: Metrics

Open Problems

New Approaches, Applications and Software Tools

Current and Future Research Prospectus

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Popular SMT ToolsPopular SMT ToolsPopular SMT ToolsPopular SMT Tools

Translation Model Generator: GIZA++

Search Decoder : PHARAOH, ISI ReWrite Decoder

Language Model Generator : SRILM, CMU-Cambridge Statistical Language Modeling Toolkit

EGYPT : A toolkit for SMT that consists GIZA/GIZA++ and word alignment tools.

Evaluation packages: MTEVAL, GMT

Metrics: BLEU, NIST, n-grams, WER, PER and SSER

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Software Tools for Graphical Models:Software Tools for Graphical Models:BNJ v3BNJ v3

© 2005 KSU Bayesian Network tools in Java (BNJ) Development Team

ALARM Network

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

OutlineOutlineOutlineOutline

Background: Statistical Approaches to MT

State of the Field: Metrics

Open Problems

New Approaches, Applications and Software Tools

Current and Future Research Prospectus

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Current WorkCurrent WorkCurrent WorkCurrent Work

Development: End-to-End SMT System for NIST 2006 Evaluation Arabic-English

Chinese-English

Assemblage of Parallel Corpora

Software Library Development: SMT Modules Aligners

Parsers

Phrase-Based Learning

Transformation-Based Learning (TBL)

Development of Graphical Models Toolkit

BNJ v4 under development: http://bnj.sourceforge.net

Integration with KSU SMT library

Applications: Relational Link Mining in Social Networks

© 2005 Walker Blogs

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Knowledge Representation Strategy

Deep/Complex

Shallow/Simple

Learn from un-annotated data

Phrase tables

Word-based only

Learn from annotated data

Example-based MT

Original statistical MT

Typical transfer system

Classic interlingual system

Original direct approach

Syntactic Constituent Structure

Interlingua

New Research: Context-Specificity

Semantic analysis

Hand-built by non-experts

Hand-built by experts

Electronic dictionaries

Knowledge Acquisition Strategy

All manual Fully automated

MT Strategies (1954-2006)

Slide courtesy ofLaurie Gerber

Future Research DirectionsFuture Research DirectionsFuture Research DirectionsFuture Research Directions

Computing & Information SciencesKansas State University

Rabat, MoroccoHuman Language Technologies Workshop

Questions and DiscussionQuestions and DiscussionQuestions and DiscussionQuestions and Discussion