38
Tomoya Mizumoto , Mamoru Komachi , Masaaki Nagata , Yuji Matsumoto † Nara Institute of Science and Technology ‡ NTT Communication Science Laboratories Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners 1 2011.11.09 IJCNLP

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Embed Size (px)

Citation preview

Page 1: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Tomoya Mizumoto†, Mamoru Komachi†, Masaaki Nagata‡, Yuji Matsumoto†

† Nara Institute of Science and Technology ‡ NTT Communication Science Laboratories 

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of

Second Language Learners

1 2011.11.09 IJCNLP

Page 2: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Background !  Number of Japanese language learners has

increased !  3.65 million people in 133 countries and regions

!  Only 50,000 Japanese language teachers overseas !  High demand to find good instructors for writers of

Japanese as a Second Language (JSL)

2

Page 3: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Recent error correction for language learners

!  NLP research has begun to pay attention to second language learning !  Most previous research deals with restricted type

of learners’ errors !  E.g. research for JSL learners’ error

!  Mainly focus on Japanese case particles

!  Real JSL learners’ writing contains various errors !  Spelling errors !  Collocation errors

3

Page 4: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Error correction using SMT !  Proposed to correct ESL learners’ errors using

statistical machine translation !  Advantage is that it doesn’t require expert

knowledge !  Learns a correction model from learners’ and

corrected corpora !  Not easy to acquire large scale learners’ corpora

!  Japanese sentences is not segmented into words !  JSL learners’ sentences are hard to tokenize

4

[Brockett et al., 2006]

Page 5: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Purpose of our study

5

1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error

revision logs of language learning SNS

2.  Solve the problem of word segmentation errors caused by erroneous input using SMT techniques with extracted learners’ corpus

Page 6: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

SNS sites that helps language learners

6

!  smart.fm !  Helps learners’ practice language learning

!  Livemocha !  Offers course of grammar instructions, reading

comprehension exercises and practice !  Lang-8

!  Multi-lingual language learning and language exchange SNS

!  Soon after the learners write a passage in a learning language, native speakers of the language correct errors in it

Page 7: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

!  Sentence of JSL learners: 925,588 !  Corrected sentences: 1,288,934

!  Example of corrected sentence from Lang-8

Lang-8 data

7

Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。

Pairs of learners sentence and corrected sentence

Language English Japanese Mandarin Korean

Number of sentences 1,069,549 925,588 136,203 93,955

Page 8: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Types of correction

8

!  Correction by insertion, deletion and substitution

!  Correction with a comment

!  Exist “corrected” sentences to which only the word “GOOD” is appended at the end

!  Removing comments !  Number of sentence pair: 1,288,934 → 849,894

Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。

Learner: 銭湯に行った。 Correct: 銭湯に行った。 いつ行ったかがあるほうがいい

Comment

Learner: 銭湯に行った。 Correct: 銭湯に行った。 GOOD

Page 9: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Comparison of Japanese learners’ corpora

Corpus Data size

Our Lang-8 corpus 849,894 sentences 448 MB

Teramura error Data (1990)

4,601 sentences 420 KB

Ohso Database (1998)

756 files 15 MB

JSL learners parallel database (2000)

1,500 JSL learners’ writings

9

Page 10: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Error correction using SMT

10

ˆ e = argmaxe

P e f( ) = argmaxe

P e( )P f e( )

SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model

Page 11: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Error correction using SMT

11

ˆ e = argmaxe

P e f( ) = argmaxe

P e( )P f e( )

SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model

Error correction e: corrected sentences f: Japanese learners’ sentences P(e): probability of the language model P(f|e): probability of the translation model

Can be learned from the sentence-aligned learners’ corpus

Can be learned from a monolingual corpus of the language to be learned

Page 12: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Difficulty of handling the JSL learners’ sentences

!  Word segmentation is usually performed as a pre-processing

!  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological

analyzer

12

Page 13: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Difficulty of handling the JSL learners’ sentences

13

!  E.g. Learner: でもじ ょずじゃりません

Correct: でもじょうずじゃありません

tokenize

Learner: でも じ ょずじゃりません

Correct: でも じょうず じゃ ありません

Page 14: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Character-wise model !  Character-wise segmented

!  e.g.

!  Not affected by word segmentation errors !  Expected to be more robust

14

でもじょずじゃりません → で も じ ょ ず じ ゃ り ま せ ん

Learner: で も じ ょ  ず じ ゃ   り ま せ ん

Correct: で も じ ょ う ず じ ゃ あ り ま せ ん

Page 15: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experiment

15

!  Carried out an experiment to see 1.  the effect of corpus size

2.  the effect of granularity of tokenization

Page 16: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental setting

16

!  Methods !  Baseline: Word-wise model !  Proposed method: Character-wise model

!  Language model: 3-gram !  Language model: 5-gram

!  Data !  Extracted from revision logs of Lang-8

!  849,849 sentences !  Test: 500 sentences !  Re-annotated 500 sentences to make gold-standard

Page 17: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Evaluation metrics !  BLEU

!  Adopted to BLEU for automatic assessment of ESL errors.

!  Followed their use of BLEU in the error correction task of JSL learners !  JSL learners’ sentences are hard to tokenize by

morphological analyzer !  Character-based BLEU

17

[Park and Levy, 2011]

Page 18: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Larger the corpus, the higher the BLEU

!  Character-wise model: Character 5-gram

18

81

81.1

81.2

81.3

81.4

81.5

81.6

81.7

81.8

81.9

0.1M 0.15M 0.3M 0.85M

BLE

U

Learning data size of TM

The difference is not statistically significant

Page 19: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Character-wise models are better than word-wise model

!  TM Training corpus: 0.3M sentences !  Achieves the best result

19

Word 3-gram

Character 3-gram

Character 5-gram

80.72 81.63 81.81

Page 20: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Both 0.1M and 0.3M model corrected

20

Learner: またど もう ありがとう (Thanks, Mantadomou (OOV)) Correct: またど うも ありがとう (Thank you again)

Learner: TRUTH わ 美しいです (TRUTH wa beautiful) Correct: TRUTH は 美しいです (TRUTH is beautiful)

Page 21: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

0.3M model corrected

21

Learner: 学生な るたら 学校に行ける (The learner made an error in conjunction form) Correct: 学生な ったら 学校に行ける (Becoming a student, I can go to school) 0.1M : 学生な るため 学校に行ける (I can go to school to be student) 0.3M : 学生な ったら 学校に行ける (Becoming a student, I can go to school)

Page 22: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Conclusions !  Make use of a large-scale corpus from the revision

logs of a language learning SNS !  Adopted SMT approaches to alleviate the

problem of erroneous input from learners !  Character-wise model outperforms the word-wise

model !  Apply method using SMT techniques with

extracted learners’ corpus to error correction of English as a second language learners

22

Page 23: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Handling the comment

23

!  Conduct the following three pre-processing steps 1.  If the corrected sentence contain only “GOOD”

or “OK”, we don’t include it in the corpus 2.  If edit distance between the learner’s sentences

and corrected sentences is larger than 5, we simply drop the sentence for the corpus

3.  If the corrected sentence ends with “OK” or “GOOD”, we remove it and retain the sentence pair.

Page 24: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Feature work

24

!  Apply method using SMT techniques with extracted learners’ corpus to error correction of English as a second language

!  Apply factored language and translation models incorporating the POS information of the words on the target side, while learners’ input is processed by a character-wise model

Page 25: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Approach for correcting unrestricted errors

!  EM-based unsupervised approach to perform whole sentence grammar correction !  Types of error must be pre-determined

!  Requires expert knowledge of L2 teaching

!  Error correction using SMT !  Advantage is that it doesn’t require expert

knowledge !  Learns a correction model from learners’ corpora

!  Not easy to acquire large scale learners’ corpora

25

[Park and Levy, 2011]

[Brockett et al., 2006]

Page 26: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Statistical machine translation

Japanese Corpus

26

Parallel Corpus

English Japanese

I like English ー 私は英語が好き ・・・

Translation Model

English sentence

Language Model

Japanese sentence

Japanese sentence

Japanese sentence

TM is learned from sentence-aligned parallel corpus

LM is learned from Japanese monolingual corpus

Page 27: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Japanese error correction

27

Japanese Corpus Learners’ Corpus

Learner Correct

私わ英語が好き ー 私は英語が好き ・・・

Translation Model

Learner’s sentence

Language Model

Correct sentence

Correct sentence

Correct sentence

TM is learned from sentence-aligned learners’ corpus

LM is learned from Japanese monolingual corpus

Page 28: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Evaluation metrics !  Character-based BLEU !  Recall and precision based on LCS

!  F-measure: harmonic average between R and P !  e.g. correct: 私 は 学 生 で す system: 私 は 学 生 だ

28

recall(R) =NLCS

NSYSTEM

precision(P) =NLCS

NCORRECT,

     : number of character contained in corrected answers     : number of character contained in system results   : number of character contained in LCS of corrected answers and system results

NCORRECT

NSYSTEM

NLCS

NCORRECT = 6,NSYSTEM = 5NLCS = 4

R = 5,P = 4 6

[Park and Levy, 2011]

Page 29: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental results - granularity of tokenization -

!  Training corpus: L1= ALL !  Test corpus: L1= English !  TM size: 0.3M sentences

29

W C3 C5

Recall 90.43 90.89 90.83

Precision 91.75 92.34 92.43

F-measure 91.09 91.61 91.62

BLEU 80.72 81.63 81.81

Page 30: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Purpose of our study

30

1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error

revision logs of language learning SNS 2.  Propose a method using SMT techniques with

extracted learners’ corpus 3.  Solve the problem of word segmentation errors

caused by erroneous input

Page 31: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experiment

31

!  Carried out an experiment to see 1.  the effect of granularity of tokenization

2.  the effect of corpus size

3.  the difference of first language (L1)

Page 32: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental data !  Training data

!  Extracted from revision logs of Lang-8 !  Prepare three L1 models

!  L1= ALL: 849,894 sentences !  L1= English: 320,655 sentences !  L1= Mandarin: 186,807 sentences

!  Test data !  extracted 500 sentences from each L1=English and

L1 Mandarin !  Re-annotated 500 sentences to make gold-standard

32

Page 33: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental results - granularity of tokenization -

!  Training corpus: L1= ALL !  Test corpus: L1= English

33

W C3 C5

80.72 81.63 81.81

Page 34: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental results - corpus size -

!  Training corpus: L1= ALL model !  Test corpus: L1= English model

34

81 81.1 81.2 81.3 81.4 81.5 81.6 81.7 81.8 81.9

0.1M 0.15M 0.3M 0.85M

BLE

U

Learning data size of TM

Page 35: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Experimental results - L1 model -

!  TM size: 0.18M sentences

35

L1 of test data

L1 of training data

English Mandarin

English 81.48 85.73

Mandarin 80.83 85.89

all 81.21 85.53

Page 36: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Difficulty of handling the JSL learners’ sentences

!  Word segmentation is usually performed as a pre-processing

!  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological

analyzer !  e.g.

36

Learner: でもじょずじゃりません Correct: でもじょうずじゃありません

tokenize

Learner: でも じ ょずじゃりません Correct: でも じょうず じゃ ありません

Page 37: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Statistical machine translation

37

Japanese Corpus Parallel Corpus

English Japanese

I like English ー 私は英語が好き ・・・

Translation Model

I went to America. Language Model

Japanese sentence

Japanese sentence

私はアメリカに行った

私は英語が好き ・・・

Page 38: Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Statistical machine translation

38

Japanese Corpus Learner Corpus

Learner Correct

私わ英語が好き ー 私は英語が好き ・・・

Translation Model

私わアメリカに 行った

Language Model

Correct sentence

Correct sentence

私はアメリカに行った

私は英語が好き ・・・