Upload
regina-johnston
View
214
Download
0
Embed Size (px)
Citation preview
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Human Language TechnologiesHuman Language Technologies(HLT) Workshop 2006(HLT) Workshop 2006
Classification-Based Contextual Correction Classification-Based Contextual Correction of Mistranslations:of Mistranslations:
A Machine Learning ApproachA Machine Learning Approach
Human Language TechnologiesHuman Language Technologies(HLT) Workshop 2006(HLT) Workshop 2006
Classification-Based Contextual Correction Classification-Based Contextual Correction of Mistranslations:of Mistranslations:
A Machine Learning ApproachA Machine Learning Approach
William H. HsuJoint work with:
Waleed Al-Jandal, Martin S. R. Paradesi, Tejaswi Pydimarri, Chris MeyerThursday, 01 June 2006
Laboratory for Knowledge Discovery in DatabasesKansas State University
http://www.kddresearch.org/KSU/CIS/HLT-Specialized-20060601.ppt
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
A Technical Survey of Statistical MT:A Technical Survey of Statistical MT:Phrase-Based Methods and MetricsPhrase-Based Methods and Metrics
A Technical Survey of Statistical MT:A Technical Survey of Statistical MT:Phrase-Based Methods and MetricsPhrase-Based Methods and Metrics
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
OutlineOutlineOutlineOutline
Background: Statistical Approaches to MT
State of the Field: Metrics
Open Problems
New Approaches, Applications and Software Tools
Current and Future Research Prospectus
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Global Search Decoder Algorithmargmaxe P(t) *P(s|t)
LanguageModel
P(t)
Translation ModelP(s|t)
Input:
Source Language s
Output:
Target Language t
Training Program
(e.g., GIZA)
BilingualParallelCorpora
Language Modeling
toolkit
Target Language
Machine Translation:Machine Translation:Generic System ArchitectureGeneric System Architecture
Machine Translation:Machine Translation:Generic System ArchitectureGeneric System Architecture
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Based on noisy channel model
Source: foreign sentence f
Target: English sentence e
Bayesian inference: Maximum A Posteriori (MAP)
Background [1]:Background [1]:Phrase Translation ModelPhrase Translation Model
Background [1]:Background [1]:Phrase Translation ModelPhrase Translation Model
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Preliminaries: segmentation of foreign input
Result:
Use: lexical analysis tools – string tokenizer, etc.
Goal: decoding
Segmented input:
Output:
Distributions
Prediction:
Distortion: ai = start of fi, bi-1 = end of fi-1
Background [2]:Background [2]:Modeling StepsModeling StepsBackground [2]:Background [2]:Modeling StepsModeling Steps
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Length normalization factor:
Language Model (pLM): Trigram
[Seymour and Rosenfeld, 1997]
Background [3]:Background [3]:Probabilistic FormulationProbabilistic Formulation
Background [3]:Background [3]:Probabilistic FormulationProbabilistic Formulation
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Methods for Learning in MT:Methods for Learning in MT:SurveySurvey
Methods for Learning in MT:Methods for Learning in MT:SurveySurvey
Transformation-Based Learning (TBL)
Example-Based Machine Translation (EBMT)
Symbolic AI: Frames, Conceptual Grammars, Analogy, CBR
Statistical
0. classical / naïve (cf. Weaver’s correspondence with Weiner)
1. phrase alignments from word-aligned model [Och & Ney, 2000]
2. linguistically motivated models [Yamada & Knight, 2001]
3. joint phrase model [Marcu & Wong, 2002]
4. generative phrase alignment [Koehn, Och & Marcu, 2003]
5. hierarchical models [Chiang, 2005; Taskar, 2005]
6. new approaches
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
OutlineOutlineOutlineOutline
Background: Statistical Approaches to MT
State of the Field: Metrics
Open Problems
New Approaches, Applications and Software Tools
Current and Future Research Prospectus
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
• N-gram precision (score is between 0 & 1)– What percentage of machine n-grams can be
found in the reference translation? – n-gram: sequence of n units (words)
– Not allowed to use same portion of reference translation twice (can’t cheat by repetition)
• Brevity penalty– Can’t just type out single word “the”
pn : n-gram precision wn : positive weightsr : words-in-reference c : words-in-machine
• Hard to “game” system (i.e., change machine output so that BLEU goes up, but quality doesn’t)
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance. Adapted from Knight (2003)
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [1]:nderstudy (BLEU) [1]:Papineni Papineni et al.et al. (ACL, 2002) (ACL, 2002)
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [1]:nderstudy (BLEU) [1]:Papineni Papineni et al.et al. (ACL, 2002) (ACL, 2002)
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [2]:nderstudy (BLEU) [2]:Multiple Reference TranslationsMultiple Reference Translations
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [2]:nderstudy (BLEU) [2]:Multiple Reference TranslationsMultiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
© 2003 Knight, K.
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [3]:nderstudy (BLEU) [3]:Tracking Human JudgmentTracking Human Judgment
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [3]:nderstudy (BLEU) [3]:Tracking Human JudgmentTracking Human Judgment
R2 = 88.0%
R2 = 90.2%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Human Judgments
NIS
T S
core
Adequacy
Fluency
Linear(Adequacy)Linear(Fluency)
(va
ria
nt
of
BL
EU
)
Courtesy G. Doddington (NIST)
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [4]:nderstudy (BLEU) [4]:Metrics in ActionMetrics in Action
BBiillingual ingual EEvaluation valuation UUnderstudy (BLEU) [4]:nderstudy (BLEU) [4]:Metrics in ActionMetrics in Action
枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10
green = 4-gram match (good!)red = word not matched (bad!)
© 2003 Kevin Knight
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
New TechnologiesNew Technologiesand Transfer Planand Transfer PlanNew TechnologiesNew Technologiesand Transfer Planand Transfer Plan
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
OutlineOutlineOutlineOutline
Background: Statistical Approaches to MT
State of the Field: Metrics
Open Problems
New Approaches, Applications and Software Tools
Current and Future Research Prospectus
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Context-Driven NLP:Context-Driven NLP:MT ApplicationsMT Applications
Context-Driven NLP:Context-Driven NLP:MT ApplicationsMT Applications
Classical Natural Language Processing (NLP) (Noun and verb) phrase extraction
Detection of named entity phrases
Word sense disambiguation
Spelling correction
Interlingual Challenges Making use of mixed resources: bilingual & monolingual
Semi-supervised learning
Applications Mixed-mode (semi-interactive) MT – assistive technology
Correcting mistranslations
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
The frequency of named-entity phrases in news text reflects the significance of the events they are associated with. So the news most likely be reported in many languages.
For example:
Translating Named Entity Phrases [1]:Translating Named Entity Phrases [1]:Arabic-English ApplicationArabic-English Application
Translating Named Entity Phrases [1]:Translating Named Entity Phrases [1]:Arabic-English ApplicationArabic-English Application
The Arabic newspaper article is about negotiations between the US and North Korean authorities regarding the search for the remains of US soldiers who died during the Korean war.
[Knight & Al-Onaizan, 2001]
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Generate ranked list of translation candidates Bilingual resources: parallel corpus Monolingual resources
Re-score list of candidates using different monolingual clues
Translating Named Entity Phrases [2]:Translating Named Entity Phrases [2]:Two-Phase ApproachTwo-Phase Approach
Translating Named Entity Phrases [2]:Translating Named Entity Phrases [2]:Two-Phase ApproachTwo-Phase Approach
[Knight & Al-Onaizan, 2001]
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Correcting Faulty TranslationsCorrecting Faulty TranslationsCorrecting Faulty TranslationsCorrecting Faulty Translations
Human-Assistive Technology
Semi-Supervised: Two Training Corpora Labeled: “bad translations” and “near misses” Unlabeled: candidate translations
Interactive Aspect “Which of these translations is right?” “Why is this candidate incorrect?”
Application: Boosting Accuracy of SMT
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Boosting the Accuracy of SMTBoosting the Accuracy of SMTBoosting the Accuracy of SMTBoosting the Accuracy of SMT
Parsing [Koehn et al., 2003] Pro: Found to slow growth of translation tables Con: Limited effect on BLEU
Context-Specificity Supported by computational linguistic theory Some positive results in NLP prediction tasks
[Elman, 1994] Very effective in sequence learning
[Barash & Friedman, 2001] Important for Relational and First-Order Representations
New Work: Semi-Supervised Approaches
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
OutlineOutlineOutlineOutline
Background: Statistical Approaches to MT
State of the Field: Metrics
Open Problems
New Approaches, Applications and Software Tools
Current and Future Research Prospectus
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Popular SMT ToolsPopular SMT ToolsPopular SMT ToolsPopular SMT Tools
Translation Model Generator: GIZA++
Search Decoder : PHARAOH, ISI ReWrite Decoder
Language Model Generator : SRILM, CMU-Cambridge Statistical Language Modeling Toolkit
EGYPT : A toolkit for SMT that consists GIZA/GIZA++ and word alignment tools.
Evaluation packages: MTEVAL, GMT
Metrics: BLEU, NIST, n-grams, WER, PER and SSER
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Software Tools for Graphical Models:Software Tools for Graphical Models:BNJ v3BNJ v3
© 2005 KSU Bayesian Network tools in Java (BNJ) Development Team
ALARM Network
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
OutlineOutlineOutlineOutline
Background: Statistical Approaches to MT
State of the Field: Metrics
Open Problems
New Approaches, Applications and Software Tools
Current and Future Research Prospectus
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Current WorkCurrent WorkCurrent WorkCurrent Work
Development: End-to-End SMT System for NIST 2006 Evaluation Arabic-English
Chinese-English
Assemblage of Parallel Corpora
Software Library Development: SMT Modules Aligners
Parsers
Phrase-Based Learning
Transformation-Based Learning (TBL)
Development of Graphical Models Toolkit
BNJ v4 under development: http://bnj.sourceforge.net
Integration with KSU SMT library
Applications: Relational Link Mining in Social Networks
© 2005 Walker Blogs
Computing & Information SciencesKansas State University
Rabat, MoroccoHuman Language Technologies Workshop
Knowledge Representation Strategy
Deep/Complex
Shallow/Simple
Learn from un-annotated data
Phrase tables
Word-based only
Learn from annotated data
Example-based MT
Original statistical MT
Typical transfer system
Classic interlingual system
Original direct approach
Syntactic Constituent Structure
Interlingua
New Research: Context-Specificity
Semantic analysis
Hand-built by non-experts
Hand-built by experts
Electronic dictionaries
Knowledge Acquisition Strategy
All manual Fully automated
MT Strategies (1954-2006)
Slide courtesy ofLaurie Gerber
Future Research DirectionsFuture Research DirectionsFuture Research DirectionsFuture Research Directions