25
Evaluating MT Systems with Second Language Proficiency Tests Takuya Matsuzaki, Akira Fujita, Naoya Todo, Noriko H. Arai ACL 2015 2015/09/24 AHCLab M1 Makoto Morishita

[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Embed Size (px)

Citation preview

Page 1: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Evaluating MT Systems with Second Language Proficiency Tests

Takuya Matsuzaki, Akira Fujita, Naoya Todo, Noriko H. Arai

ACL 2015

2015/09/24 AHCLab M1 Makoto Morishita

Page 2: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Abstract

• BLEU have some weak points to evaluate the system in a real situation.

• In this paper, evaluate the system by using second language ability test (TOEIC, etc).

• It revealed that the context-unawareness of the current MT systems severely damages human performance when solving the test problems.

2

Page 3: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Weak Points of BLEU

1. Unreliability in evaluating short translations

2. Non-interpretability of the scores beyond numerical comparison

3. Bias towards SMT systems

3

Page 4: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Weak Points of Manual Evaluation

1. It costs much.

2. It is not easy to analyze the characteristics of MT systems based solely on the evaluation results.

4

Page 5: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Solution

• Task-based evaluation of MT systems - Measures the human performance in a task

• Human do some task such as information extraction from a machine-translated text.

5

Page 6: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Weak Points ofTask-Based Evaluation• It costs much.

- We have to make test materials, and gather appropriate human subjects.

• This paper use second-language proficiency tests (SLPTs) such as TOEIC, as the source of test materials.

• Human solve the problem which is translated and evaluate the system by the test scores.

6

Page 7: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Second-Language Proficiency Tests(SLPT)

• There are a lot of SLPTs in many languages.

• They are carefully designed to evaluate various aspects of language ability.

• SLPTs are designed to assess the language ability, but not general intelligence.- Can be robust against the heterogeneity of the subjects.

7

(多様性)

Page 8: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Materials

• We chose 40 problems randomly from National Center Test for University Admissions (センター試験).

• All the problem consisted of a short conversation between two people.

8

Page 9: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Materials

• In this paper, we use a multiple-choice dialogue completion problems.

9

Page 10: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Experiment

• The original problems were English, and we translated them into Japanese.

• The human subjects solved the translated problems.

• The translation quality was evaluated based on the rate of correct answers given by the human subjects.

10

Page 11: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Experiment

• Evaluated 4 systems.- G: Google Translate- Y: Yahoo Translate - Hs: Human translation which do not consider context- Ho: Human translation which consider context

11

Page 12: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Participants

• 320 Japanese junior high school student

12

School A School B1st: 80 2nd: 80 3rd: 78

1st: 82

Page 13: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Extrinsic Evaluation Metric

• CAR: Correct Answer Rate

13

CARM (p) =# of subjects that correctly answered M(p)

# of subjects who solved M(p)

Avg � CARM =1

|P |X

p2P

CARM (P )

Page 14: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Robustness against the Heterogeneity of the Human Subjects

14

School A

1st: 80 2nd: 80 3rd: 78

No difference

School A1st: 80

School B1st: 82

No difference

→The participants’ Heterogeneity did not affect the test result

Page 15: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

System-level Evaluation

• We cannot find significant difference between Y and Hs

15

Page 16: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

System-level Evaluation

16

Page 17: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

System-level Evaluation

17

Better

Better

Page 18: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

System-level Evaluation

18

Same

Better

Page 19: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

System-level Evaluation

19

• Refo: Do not consider context

• Refs: Consider context

Better

Page 20: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Agreement

• If Score of Intrinsic Measure M System A’s translation > B’s translation AndScore of CAR System A’s translation > B’s translation then Agree

• Check the agreement rate of each problems

20

Page 21: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Agreement Rate

• Agreement Rates between Automatic Evaluation Metrics and Human Evaluation

21

Page 22: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Agreement Rate

• Agreement Rates between Intrinsic Evaluation Metrics and Correct Answer Rate

22

Page 23: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Agreement Rate

• The human evaluation agrees with the CAR slightly better than the automatic metrics.

• But still less than 0.7

• CAR can be critically damaged by a subtle mistake.

23

Page 24: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Conclusion

• Comparing 4 systems, it is important to consider contexts of individual sentences in translating dialogues.

• SLPT can evaluate a different dimension of translation quality.

• SLPT can be robust against the heterogeneity of human subjects.

24

Page 25: [Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Questions & Comments