Upload
leo-jonah-johnston
View
223
Download
0
Embed Size (px)
Citation preview
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
An Investigation of Machine Translation Evaluation Metrics
in Cross-lingual Question AnsweringKyoshiro Sugiyama, Masahiro Mizukami, Graham Neubig,
Koichiro Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi NakamuraNAIST, Japan
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Question answering (QA)
One of the techniques for information retrievalInput: Question Output: Answer
InformationSourceWhere is the
capital of Japan?
Tokyo.
Retrieval
Retrieval Result
2/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA using knowledge bases
Convert question sentence into a queryLow ambiguityLinguistic restriction of knowledge base Cross-lingual QA is necessary
Where is the capital of Japan?
Tokyo.
Type.LocationCountry.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge baseQuery
Response3/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Cross-lingual QA (CLQA)
Question sentence (Linguistic difference) Information source
日本の首都はどこ?
東京
Type.LocationCountry.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
To create mapping:High cost and
not re-usable in other languages
4/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
CLQA using machine translation
Machine translation (MT) can be used to perform CLQAEasy, low cost and usable in many languagesQA accuracy depends on MT quality
日本の首都はどこ?
Where is the capital of
Japan? ExistingQA system
Tokyo
MachineTranslation
東京 MachineTranslation
5/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Purpose of our work
To make clear how translation affects QA accuracyWhich MT metrics are suitable for the CLQA task? Creation of QA dataset using various translations systems Evaluation of the translation quality and QA accuracy
What kind of translation results influences QA accuracy? Case study (manual analysis of the QA results)
6/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA system
SEMPRE framework [Berant et al., 13]3 steps of query generation:Alignment
Convert entities in the question sentence into “logical forms”
BridgingGenerate predicates compatible with
neighboring predicatesScoring
Evaluate candidates using scoring function
7/22
Scoring
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Data set creation
8/22
Training(512 pairs)
Dev.(129 pairs)
Test(276 pairs)
(OR set)
Free917
JA set
HT set
GT set
YT set
Mo set
Tra set
Manual translation into Japanese
Translation into English
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation method
Manual Translation (“HT” set): Professional humans
Commercial MT systemsGoogle Translate (“GT” set)Yahoo! Translate (“YT” set)
Moses (“Mo” set): Phrase-based MT system
Travatar (“Tra” set): Tree-to-String based MT system9/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Experiments
Evaluation of translation quality of created data setsReference is the questions in the OR set
QA accuracy evaluation using created data setsUsing same model
Investigation of correlation between them
10/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Metrics for evaluation of translation quality
11/22
BLEU+1: Evaluates local n-grams
1-WER: Evaluates whole word order strictly
RIBES: Evaluates rank correlation of word order
NIST: Evaluates local word order and correctness of infrequent words
Acceptability: Human evaluation
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality
12/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA accuracy
13/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality and QA accuracy
14/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality and QA accuracy
15/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level analysis
47% questions of OR set are not answered correctly These questions might be difficult to answer even with the correct translation result
Dividing questions into two groupsCorrect group (141*5=705 questions):
Translated from 141 questions answered correctly in OR setIncorrect group (123*5=615 questions):
Translated from remaining 123 questions in OR set
16/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics (correct group) (incorrect group)
BLUE+1 0.900 0.0071-WER 0.690 0.092RIBES 0.418 0.311NIST 0.942 0.210
Acceptability 0.890 0.547
17/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics (correct group) (incorrect group)
BLUE+1 0.900 0.0071-WER 0.690 0.092RIBES 0.418 0.311NIST 0.942 0.210
Acceptability 0.890 0.547
Very little correlation
NIST has the highest correlation Importance of content words
If the reference cannot be answered correctly, the sentences are not suitable,
even for negative samples
18/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 1
19/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 2
20/22
Lack of thequestion type-word
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 3
21/22
All questions were answered correctly though they are grammatically incorrect.
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Conclusion
NIST score has the highest correlationNIST is sensitive to the change of content words
If reference cannot be answered correctly, there is very little correlation between translation quality and QA accuracyAnswerable references should be used
3 factors which cause change of QA results:content words, question types and syntax
22/22