29
Tomaž Erjavec 1 , Adam Kilgarriff 2 , Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3 Tokyo Institute of Technology, Japan A large public-access Japanese corpus and its query tool - JapWaC and Sketch Engine -

Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

Embed Size (px)

Citation preview

Page 1: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

Tomaž Erjavec1, Adam Kilgarriff2, Irena Srdanović Erjavec3

1Jožef Stefan Institute, Slovenia 2Lexical Computing Ltd. and University of Leeds, UK 3Tokyo Institute of Technology, Japan

A large public-access Japanese corpus and its query tool- JapWaC and Sketch Engine -

Page 2: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

2 COJAS, March 2007

Overview

1. The case for corpora

2. The case for web corpora

3. How JapWaC was created

4. Sketch Engine (SkE)

5. Demo-ing JapWaC & SkE

6. Future work

7. Access to JapWaC & SkE

Page 3: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

3 COJAS, March 2007

Corpora A sample of a language Useful for studying the language Language is diverse

Big samples needed, to catch everything Good tools needed, for large amounts of data

Last 15 years Big samples are easier to gather Tools are better Rapid growth in corpus methods

Page 4: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

4 COJAS, March 2007

Web corpora Web is huge, free, easily accessible (Non-)linguists use it for lang. check/research Skewed?

Keller and Lapata 03: • web results match human judgements well• the large amount of data outweighs the “noise” problem

Web importance as a resource is growing David Crystal “Language and the Internet” 06:

• “new linguistic medium that we cannot ignore” Web-corpus expertise is growing (WaCky etc.)

Page 5: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

5 COJAS, March 2007

Steps to compile web corpora(Sharoff, Baroni)

1. Get URL list for required language ~500 most frequent word forms

not function words; for general-purpose corpora, words that do not belong to a spec. domain

5000-6000 queries, 4 words, top 10 URLs2. Download HTML pages3. Normalize encoding (to UTF-8)4. HTML clean-up

boilerplate removal: HTML tags, Java code, navigation frames,…

5. Extract meta-data (URL, title, date,…)6. Linguistic annotation

Page 6: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

6 COJAS, March 2007

Steps for JapWaC

URL list of pages in Japanese provided by Serge Sharoff

• word, lemma and PoS frequency lists for Japanese, c.f. http://corpus.leeds.ac.uk/list.html

Files downloaded and cleaned with BootCat • by Marco Baroni and others from the WaCky

project, c.f. http://wacky.sslmit.unibo.it/ Segmented, tokenised, tagged with Chasen Translated Chasen tags to English Converted to Sketch Engine format and

loaded

Page 7: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

7 COJAS, March 2007

Example fileThe file size is 7038669 kB, showing first 1 kB.

<doc id="http://www.0start-hp.com/voice/index.php"><s>月々 月々 N.Adv2 2 N.Num6 6 N.Num3 3 N.Num円 円 N.Suff.msrで だ Aux、 、 Sym.cあなた あなた N.Pron.gも も P.bindブログデビュー ブログデビュー Unknownし する V.freeて て P.Conjみ みる V.bndませ ます Auxん ん Auxか か P.advcoordfin? ? Sym.g</s>

Page 8: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

8 COJAS, March 2007

Basic corpus statistics

49,554 URLs (i.e. HTML files) 16,072 sites (2 domains) 12,759,201 sentences (Chasen) 409,384,411 tokens (Chasen) 7.3 GB filesize tokens/file: 8,263 Average 5,001 Median 3 Min 170,693 Max

Page 9: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

9 COJAS, March 2007

URL statistics(top ranking domains, sites and keywords)

Page 10: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

10 COJAS, March 2007

Chasen POS statistics

Page 11: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

11 COJAS, March 2007

The Sketch Engine Leading corpus query system Any corpus, any language Web-based

No software to install

Concordance Word sketches

“one-page, corpus based account of a word’s grammatical and collocational behaviour”

Thesaurus Word Sketch Difference

Page 12: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

12 COJAS, March 2007

Use of Sketch Engine

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002 Lexicography

Language learning

Linguistic research

.

.

.

Page 13: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

13 COJAS, March 2007

Page 14: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

14 COJAS, March 2007

Creating SkE for Japanese

1. Load JapWaC into SkE

2. Write gram relations for Japanese Chasen POS (as used for jaSlo)

3. Compile word sketches

4. Recompute scores in WS

5. Compile thesaurus

Page 15: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

15 COJAS, March 2007

Page 16: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

16 COJAS, March 2007

Page 17: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

17 COJAS, March 2007

Page 18: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

18 COJAS, March 2007

Page 19: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

19 COJAS, March 2007

Page 20: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

20 COJAS, March 2007

Page 21: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

21 COJAS, March 2007

Page 22: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

22 COJAS, March 2007

Word Sketch examples

WS for 女の子 (noun )

WS for 冷たい (adjective)

WS for 書く (verb)

Page 23: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

23 COJAS, March 2007

Page 24: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

24 COJAS, March 2007

Thesaurus, WS Diff example(1)

WS Diff for 女の子 and 男の子

Page 25: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

25 COJAS, March 2007

Thesaurus, WS Diff example (2)

WS Diff for 寒い and 冷たい

Page 26: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

26 COJAS, March 2007

「温泉」 example

WS for 温泉

Page 27: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

27 COJAS, March 2007

More metadata in the corpus:• date, title, author; text typology

More data cleaning Japanese corpus for HLT research:

• sampling only 10 consecutive sentences, 100M• would be available for download with Creative Commons

license For native speakers’ and learners’ use:

• original Chasen tags, Chasen kana• Ruby romaji, furigana in examples

Connecting to jaSlo, Natsume system More advanced relations (MWU etc.), Cabocha? Load other corpora into SkE (Kotonoha, AB)

Future work

Page 28: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

28 COJAS, March 2007

Access to JapWaC & SkE

http://www.sketchengine.co.uk Free 30-day trial Self-registration Japanese, Chinese, English, French,

German, Italian, Spanish, Portuguese, Slovene

Also gives access to WebBootCaT “instant web corpora”

Page 29: Tomaž Erjavec 1, Adam Kilgarriff 2, Irena Srdanović Erjavec 3 1 Jožef Stefan Institute, Slovenia 2 Lexical Computing Ltd. and University of Leeds, UK 3

29 COJAS, March 2007

Thank you for your attention!