[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests

Evaluating MT Systems with Second Language Proficiency Tests

Takuya Matsuzaki, Akira Fujita, Naoya Todo, Noriko H. Arai

ACL 2015

2015/09/24 AHCLab M1 Makoto Morishita

Abstract

• BLEU have some weak points to evaluate the system in a real situation.

• In this paper, evaluate the system by using second language ability test (TOEIC, etc).

• It revealed that the context-unawareness of the current MT systems severely damages human performance when solving the test problems.

2

Weak Points of BLEU

1. Unreliability in evaluating short translations

2. Non-interpretability of the scores beyond numerical comparison

3. Bias towards SMT systems

3

Weak Points of Manual Evaluation

1. It costs much.

2. It is not easy to analyze the characteristics of MT systems based solely on the evaluation results.

4

Solution

• Task-based evaluation of MT systems - Measures the human performance in a task

• Human do some task such as information extraction from a machine-translated text.

5

Weak Points ofTask-Based Evaluation• It costs much.

- We have to make test materials, and gather appropriate human subjects.

• This paper use second-language proficiency tests (SLPTs) such as TOEIC, as the source of test materials.

• Human solve the problem which is translated and evaluate the system by the test scores.

6

Second-Language Proficiency Tests(SLPT)

• There are a lot of SLPTs in many languages.

• They are carefully designed to evaluate various aspects of language ability.

• SLPTs are designed to assess the language ability, but not general intelligence.- Can be robust against the heterogeneity of the subjects.

7

(多様性)

Materials

• We chose 40 problems randomly from National Center Test for University Admissions (センター試験).

• All the problem consisted of a short conversation between two people.

8

Materials

• In this paper, we use a multiple-choice dialogue completion problems.

9

Experiment

• The original problems were English, and we translated them into Japanese.

• The human subjects solved the translated problems.

• The translation quality was evaluated based on the rate of correct answers given by the human subjects.

10

Experiment

• Evaluated 4 systems.- G: Google Translate- Y: Yahoo Translate - Hs: Human translation which do not consider context- Ho: Human translation which consider context

11

Participants

• 320 Japanese junior high school student

12

School A School B1st: 80 2nd: 80 3rd: 78

1st: 82

Extrinsic Evaluation Metric

• CAR: Correct Answer Rate

13

CARM (p) =# of subjects that correctly answered M(p)

# of subjects who solved M(p)

Avg � CARM =1

|P |X

p2P

CARM (P )

Robustness against the Heterogeneity of the Human Subjects

14

School A

1st: 80 2nd: 80 3rd: 78

No difference

School A1st: 80

School B1st: 82

No difference

→The participants’ Heterogeneity did not affect the test result

System-level Evaluation

• We cannot find significant difference between Y and Hs

15


16


17

Better

Better


18

Same

Better


19

• Refo: Do not consider context

• Refs: Consider context

Better

Agreement

• If Score of Intrinsic Measure M System A’s translation > B’s translation AndScore of CAR System A’s translation > B’s translation then Agree

• Check the agreement rate of each problems

20

Agreement Rate

• Agreement Rates between Automatic Evaluation Metrics and Human Evaluation

21

Agreement Rate

• Agreement Rates between Intrinsic Evaluation Metrics and Correct Answer Rate

22

Agreement Rate

• The human evaluation agrees with the CAR slightly better than the automatic metrics.

• But still less than 0.7

• CAR can be critically damaged by a subtle mistake.

23

Conclusion

• Comparing 4 systems, it is important to consider contexts of individual sentences in translating dialogues.

• SLPT can evaluate a different dimension of translation quality.

• SLPT can be robust against the heterogeneity of human subjects.

24

Questions & Comments

Technology

[Paper Introduction] Evaluating MT Systems with Second Language Proficiency Tests