58
1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo , Japan) (Now Yahoo! Japan)

1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

Embed Size (px)

Citation preview

Page 1: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

1

Web-based Acquisition of Japanese Katakana Variants

Hiroshi Nakagawa (University of Tokyo, Japan)

Takeshi Masuyama (University of Tokyo , Japan)†

†(Now Yahoo! Japan)

Page 2: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

2

• Very sorry for Katakana fonts printing problem in proceedings. We could not check the final printing.

• Please read English transliterations of Katakana parts like %-%c…..

Page 3: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

3

Cooperation with

Satoshi SekineComputer Science, New York University

AndLanguage Craft Co.

Page 4: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

4

mew,mew

ニャア,ニャア( nyaa,nyaa )

ニャアー、ニャアー (nyah,nyah)

The way of sound to spelling defers language by language

Katakana word variants

Page 5: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

5

History of Katakana

Every Country and every language has its own history of meanings, codes and fonts.

Phonogram vs. Ideogram

Page 6: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

6

Kanji(Hanji ) Character

s (=ideogram) imported   to Japan

1300 yeas ago

漢字( Hanji)

Page 7: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

7

Almost 1000 years ago, women writers worked out phonogram(Hiragana and Katakana) from Kanji (ideogram) to express Japanese people’s mentality.

世(Kanji)ideogram

せ(hiragana

)セ(katakana)

phonogram

紫式部

Page 8: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

8

Modern history of Katakana Japanese Katakana and Hiragana have one to

one mapping. After Meiji revolution(1868), Japanese people

used Katakana to express functional wordHiragana to express words imported from western

countries. After World War II(1945), we exchanged them.

Hiragana became used to express functional word like case markers

Katakana became used to express words imported from western countries.

Thus majority of Katakana words are transliterations from English words.

Page 9: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

9

However, Japanese Katakana has only five vowels (a,i,u,

e,o) and 19 consonants (k,g,s,z,j,t,d,n,h,b,m,y,r,w,c,sh,ch,ny,my,).

Pronunciations are always C+V or V. No C+C.

No distinction between, (b,v),(h,f),(l,r),..There are no orthographic way to express English sou

nds with Katakana character set. Thus Japanese language accepted several Kata

kana spellings for one English word. Katakana variants

Page 11: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

11

An example of search result Hits for “spaghetti” with Google

To make sure to avoid overlap between distinct Katakana variants by + and - options.

Katakana variants. Hits of Google search (%)

スパゲッティ (supagettuthi) 187,000 (32.7%)

スパゲッティー (supagettuthii) 57,600 (10.1%)

スパゲッテイ (supagettutei) 6,850 (1.2% )

スパゲティ (supagetuthi) 240,000 ( 41.9% )

スパゲティー (supagethii) 77,400 ( 13.5% )

スパゲテイ (supagetei) 3,800 ( 0.7% )total 572,650 ( 100% )

Page 12: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

12

Katakana variants extraction system is needed to enhance

the cross-language ability of

Information RetrievalSearch engine Machine translationInformation ExtractionSummarizationQuestion Answering

Page 13: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

13

Previous research 1 :

Manually constructed Rewriting rules to generate and/or extract Katakana variants from given Katakana word ( Shishibori et al, 1993, 1994, Kubota 1994 )Samples of rewrite rules

ベ (Be)⇔ ヴェ (Ve)チ (chi)⇔ ツィ (thi)

Input : ベネチア (Benechia)Output : ベネツィア (Benethia) ヴェネチア (Venechia) ヴェネツィア (Venethia)

Page 14: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

14

Previous research 2 :

Extract Katakana variants with weighted edit distance ( Magari et al 、 2004 )、( Ohtake et al 、 2004 )Edit distance is defined as

Number of operations to transform one Katakana word into another Katakana word:

Operations: insert, delete,replace Ex. : レポート (Repooto) リポート (Ripooto) → edit d

ist. =1

Weighted edit distanceWeight of each operation is manually givenEx : Weight of edit dist. ( レポートリポート ) 0.8

Page 15: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

15

Previous research 3 : more direct way

String penalty to extract Katakana variants ( Masuyama et al, 2004 )

String penalty: SPBased on weighted edit distance, but extended to t

reat two,three characters:stringManually given weights to Combination of edit ope

rations = string replacing operations.Ex.SP( ボイス , ヴォイス )=4 … replace and inser

t

Boisu, Voisu

Page 16: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

16

Previous research 4 : Combination method

(Masuyama,Nakagawa,Sekine 2004 COLING)

Combination of string penalty and context

String penalty :SPSP value is given by an expertise

Similarity of contexts in which each Katakana variant appearsVector space model (automatically calculated) If Words around each Katakana words are similar,

then the Katakana words are variants each other

Page 17: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

17

Problems of previous researches

Less coverage Need human intellectual and intensive work for

Working out rewrite rulesDetermining weights of weighted edit distanceDetermining values of string penalty of each

Katakana string pairs

Depend on specific corpus which is used to calculate weights of weighted edit distancestring penalty

Page 18: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

18

Purpose of this work

The problem of manually given string penalty:Labor intensive (even in combination of SP and context)Low coverage

Determine string penalty mechanically

and

Automatically building Katakana variants

for each Katakana word

Page 19: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

19

Calculating string penalties Mechanically

For this, we need accurate and high quality Katakana variants database!

Page 20: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

20

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

Page 21: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

21

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

Web search by

Page 22: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

22

How to find candidates of Katakana variant pairs (1/3)

1. To collect English words and thier Katakana variants i.e. (vodka ウォッカ )

we used four Web sites where we collect a number of English words and their Japanese translations. http://homepage2.nifty.com/katakanaEnglish/ http://www.hoshi.cis.ibaraki.ac.jp/usefull/usefull15.html http://ke.ics.saitama-u.ac.jp/jsgs/keywords.html http://smalltown.ne.jp/~uasa/pub/distfiles/skk-extra-200307/S

KK-JISYO.edit

14,958 distinct pairs of English words and their Katakana translations.

Page 23: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

23

How to find candidates of Katakana variant pairs (2/3)

1. Extract many English word and its Katakana variant

14.958 pairs of English-Katakana

2. To collect more Katakana variants for each English word, we use Google search to get pages that include English word and Katakana word of its translation

“English word + ( language = Japanese )” “English word + 「英和」 (“English to Japanese”)” in order to

search English-Japanese dictionary site

3. Gather Katakana words from search results

Page 24: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

24

Google search with English word “vodka” among page written in Japanese

vodka

Page 25: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

25

Add a query 「英和」 (english-Japanese) and Google search

英和’ e-j’ vodaka

Page 26: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

26

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

String Penalty

Process

Pairs of variant

Web search

“ 英和 (e-j) report”

Edit dist.  =1

Page 27: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

27

How to find candidates of Katakana variant pairs (3/3)

4. Extract promising candidates of Katakana word pairs whose edit distance =1 as Katakana variants

Ex. (vodka ウォッカ ) (ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka ) (ウォッカ’ Uottuka’ 、ウオッカ’ UOttuka’ ) (ウォッカ’ Uottuka 、ヴォッカ (Vuottuka’ )

Page 28: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

28

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt candi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : 1ト’ to’⇔ ッ’ ttu’ : 3

…String

Penalty

Process

Pairs of variant candi. by c

ontext

Web search by“ 英和 (e-j) repor

t”

Edit dist.  =1

cosine sim > 0.00006

Page 29: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

29

How to extract documents in which context similarity is calculated

Google search with a query of Katakana word which is a candidate of Katakana variant.

Extract context of the Katakana variant from search result pages.

Page 30: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

30

Search “Vodka” with Google

+ ウォッカ

‘ +vodka’

retrieves all pages

includingウォッカ

ウオッカ‘s

contexts

ウオッカ‘s

contexts

Page 31: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

31

1. Calculate context similarity of a candidate of Katakana variant pair

drink vodka(Vuottka) with a main dish and plate of caviar in the restaurants

cosine similarity

eat some main dish plate after vodka(Uotoka) in that restaurants

50 words around a candidate of Katakana variant is used as its context

2. Identify and extract Katakana variants if cosine similarity is greater than the threshold of 0.00006.

Page 32: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

32

Detail of context similarity calculation

context = 50 words around Katakana word Weight of word t in context

log(freq(t)+1) Context similarity = cosine Selection from candidates by

cosine similarity≧0.00006 ( threshold ) The threshold optimization

argmax of F-value threshold

on positive pairs ( 347pairs ) and negative pairs(111 pair)

Page 33: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

33

Results of context similarity vs cosine threshold

80

82

84

86

88

90

threshold of cosine similarity

-val

ue (

%)F

Page 34: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

34

(idea, アイデア )(report, レポート )

English word and its

Katakana variantWWW

( レポート’ repooto’ ,リポート’ ripooto’)( レポート’ repooto’ ,サポート’ sapooto’)

… Pairs of varia

nt cadi.

… レポート …… report …

… リポート ……

( レポート’ repooto’ ,リポート’ repooto’)( レファレンス’ refarensu’, リファレンス’ rifarenssu’)

(アーキテクト’ aakitekuto’, アーキテクツ’ aakitekutu’ )

レ’ re’⇔ リ’ ri’ : SP=1ト’ to’⇔ ッ’ ttu’ : SP=3

Process

Pairs of variant

Web search by“ 英和 (e-j) repor

t”

Edit dist.  =1

cosine sim > 0.00006

Next to do is to calculate SP based on Statistics

Page 35: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

35

2nd stage:Calculation of string penalty :SP

String penalty of operation x y (x replaces with y)

We focus onHigh correlation between replaced strings and their cha

racter context which is composed of several characters around the target string.

Example: (ウインブルドン、ウィンブルドン) (ウインドウズ、ウィンドウズ) (ウインク、ウィンク)

replace イ’ I’ with ィ’ i’→ ウ’ U’ and ン’ n’ co-occurs

Page 36: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

36

Character level context:CLC1..CLC5 used to calculate SP

x : target characterα 、 β 、 γ 、 δ : characters around x

CLC String contexts around x

CLC1 αβ x preceeding two characters of x

CLC2 β x preceeding one character of x

CLC3 x γ succeeding one character of x

CLC4 x γδ succeeding two characters of x

CLC5 β x γ preceeding and succeeding characters of x

Page 37: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

37

Calculation of string penalty:SP

2)(

1),()|(P

i

ii CLCf

yxCLCfCLCyx

i=1,2,3,4,5

f(CLCi) = freq. of pairs in which CLCi occurs

f(CLCi, xy) = freq. of pairs in which both of

CLCi and xy occur

Page 38: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

38

Calculation of string penalty :SP

iiCLC

CLCyxCLCi

|Pmaxarg)5,..,1(

)CLC|P(

1

yxSP yx

Identify character context CLCi which most probably co-occurs with operation x y

Then

Rank of occurrence ≈ C * (Prob. of occurrence)-1

Zipf’s law

Page 39: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

39

Examples of string penaltiesoperation SP Example

Insertion and deletion of ‘ ・’

1 ラストシーン、ラスト・シーン

Insertion and deletion of macron ‘ ー’

1 エネルギー、エネルギ

Replace オ ‘ O’ and ォ ‘ o’

1 ウオッカ、ウォッカ

Replace グ ‘ gu’ and ク ‘ ku’

2 バック、バッグ

Replace ヴ ’ vu’ and ブ ’ bu’

2 ジュネーヴ、ジュネーブ

Replace ヴ ‘ vu’ and ウ ‘ U’

3 ヴォッカ、ウォッカ

Page 40: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

40

Comparison of SP by hand and SP by the proposed method

SP by hand proposed by Masuyama et al(2004)

Expertise worked out SP by handGold standard Katakana variants:

682 pairs of Katakana variant candidates extracted from newspaper corpus and whose string penalties are between 1 and 12We found no correct variants whose SPs are bewte

en 10 and 12. Thus, the above gold standard probably cover all correct varinats.

Page 41: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

41

SP SP by hand SP by proposed mechanical method

1 216/221 (97.7%) 262/286 (91.6%)

2 162/207 (78.3%) 133/148 (89.9%)

3 70/99 (70.7%) 51/90 (56.7%)

4 2/14 (14.3%) 2/26 (7.7%)

5 0/29 (0.0%) 0/16 (0.0%)

6 0/13 (0.0%) 2/34 (5.9%)

7 1/20 (5.0%) 1/39 (2.6%)

8 0/13 (0.0%) 1/15 (6.7%)

9 1/12 (8.3%) 0/8 (0.0%)

10 0/16 (0.0%) 0/5 (0.0%)

11 0/17 (0.0%) 0/12 (0.0%)

12 0/21 (0.0%) 0/3 (0.0%)

Comparison of SPs

Page 42: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

42

1 2 3 4 5 6 7 8 9 10 11 12 合計1 207 7 3 2 0 1 1 0 0 0 0 0 221

2 20 123 59 2 1 1 1 0 0 0 0 0 207

3 59 11 20 3 2 3 1 0 0 0 0 0 99

4 0 2 3 2 2 0 4 0 0 1 0 0 14

5 0 2 2 6 3 4 5 3 1 0 2 1 296 0 0 0 3 1 1 2 0 3 1 2 0 13

7 0 0 1 3 2 2 2 4 1 1 3 1 20

8 0 1 0 0 0 4 6 1 0 0 1 0 13

9 0 1 0 0 0 0 2 3 1 1 4 0 12

10 0 1 1 5 0 0 4 2 1 1 0 1 16

11 0 0 1 0 2 13 0 0 1 0 0 0 17

12 0 0 0 0 3 5 11 2 0 0 0 0 21

合計 286 148 90 26 16 34 39 15 8 5 12 3 682

correlation : 0.76

SP

by hand

SP by proposed mechanical methodComparison of SPs correlation

Page 43: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

43

Building Katakana variantsDB automatically

Page 44: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

44

Context similarity

Extracted variants

Context similarity

Extracted variants

SP

by Mechanical

methodby hand

Correlation

0.76

Accurate Accurate?

Summary of comparison and next?

SP

COLING 2004 SIGIR2005

Page 45: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

45

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Page 46: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

46

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

Page 47: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

47

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

SP ≤ 3

Page 48: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

48

(レポート,ラポート)( レポート,リポート )( レポート,サポート )

News paper corpus

Candidates of Katakana variants

( レポート,ラポート )( レポート,リポート )

…Candidates of

Katakana variants

( レポート,リポート )…

Katakana variants DB

Variants DB

… レポート …… ラポート …… リポート …… サポート …

Extract Katakana words

SP ≤3

Context similarity

≥ 0.005

Optimized threshold

Page 49: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

49

SP by hand of expertise SP by the proposed mechanical method

recall 417/420 (99.3%) 415/420 (98.8%)

precision 417/480 (86.9%) 415/480 (86.5%)

F-value 92.7% 92.2%

Comparison of variants DB

SP 3, context similarity 0.05≦ ≧

cf. The whole DB contains 3 million Katakana variants for 1 million distinct Katakana words.

Page 50: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

50

Conclusions

Mechanical method of calculating SP Using Web search engine to extract variant

candidates SP by character context Almost same accuracy as SP by hand of expertise

Katakana variants DB with SP by mechanical method

recall : 98.8% precision : 86.5% F -value : 92.2%

Page 51: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

51

Future of our research

Other language like GermanArbeit -- アルバイト

Application of our methodology (Web resource + statistical string penalty) to other language pair.Londre LondonMünchen Munich

Our hope is: Cross-language automatic spelling variants generator for any language pairs based on the proposed method.

Page 52: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

52

Thank you!

サンキュー( sankyuh)

サンキュウ (sankyuu)

Question or comments are welcome.

Page 53: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

53

Error analysis grizzly bear グリーズリーベア  vs  グリーズリー・ベア  gurihzurihbea gurihzurih ・ bea are not regarded as variants animal Norman Shwarzkovtotally different contexts! sign pole sign ball サインポール   vs . サインボール  sainpohru sainbohruAre regarded as variants. barber shop baseball customer, shop, sales ( very similar contexts)

Page 54: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

54

The threshold of SP vs. F-value

0

20

40

60

80

100

1 2 3 4 5 6 7 8 9 10 11 12The threshold of SP

-val

ue(%

)F

Page 55: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

55

cosine similarity vs. F-value

25303540455055606570758085

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4Threshold of SP

F-va

lue(

%)

Page 56: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

56

If you search some Kataka variant with Google,…

In case of spaghetti

Katakana Variants Found or not

スパゲッティ( spaghetti) ○

スパゲッティー( supagettuthii ) ○

スパゲッテイ( supagettutei ) ×

スパゲティ( supagettuthi ) ○

スパゲティー( supagethii ) ○

スパゲテイ( supagetei ) ×

Page 57: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

57

How to find candidates of Katakana

How to extract document in which context similarity is calculated

Google search with a query of Katakana word which is a candidate of Katakana variant.

Extract context of the Katakana variant from search result pages and calculate context similarity to identify Katakana variants.

Page 58: 1 Web-based Acquisition of Japanese Katakana Variants Hiroshi Nakagawa (University of Tokyo, Japan) Takeshi Masuyama (University of Tokyo, Japan) † †

58

Example of similarity calculation

(ウォッカ’ Uottuka’ 、ウォトカ’ Uotoka’ )ウォッカ: liquor : 1.1 、 strong : 1.4 、 alcohol : 1.6 、 western liquir : 0.

7 、・・・ウォトカ liquor : 0.7 、 strong : 0.7 、 alcohol : 3.4 、 western liquor : 1.4 、・・・

00157.04.37.07.06.14.11.1

7.04.17.01.1),cos(

222222

・・・・・・

・・・UotokaUottuka