28
Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd.

Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Embed Size (px)

Citation preview

Page 1: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Comparable Corpora BootCaT (CCBC)

Adam Kilgarriff, Avinesh PVS, Jan Pomikalek

Lexical Computing Ltd.

Page 2: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Just-in-time corpora

Krista Varantola

Translators, terminologists

In-domain terminology: Domain dictionaries

• Don’t exist

• Out of date

• Not accessible

Collect in-domain web pages

Instant corpus

2

Page 3: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

BootCaT (Bootstrapping Corpora and Terms)

Baroni and Bernardini 2004

User: input ‘seed terms’

Send 3-at-a-time to a search engine• Returns search hits page

Retrieve those pages

A corpus!• Cleaning, deduplicating, linguistic processing

Extract terms• Can use extracted terms as seeds, iterate

3

Page 4: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Works well

Widely used More implementations

SkE has WebBootCaT, web front end Secret:

piggybacks on search enginesThey do the donkey-work

• on-domain, text-rich pages, no spam, …

4

Page 5: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Also in use for

General language corpusLong list of general seed words

• Pioneer: Serge Sharoff• LCL: Corpus Factory

‘Varieties of Learner English’General English, same queries except

• Region=UK, US, Canada, Aus, China, Japan, Korea

Validation under way

5

Page 6: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Corpus query tool, since 2003

Widely used by lexicographers• Commercial

• OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan

• National dictionary projects• Bulgaria, Czech Republic, Estonia, Netherlands, Slovakia,

Slovenia

Universities• Linguistics, language research, NLP, language

teaching, teaching translation

6

The Sketch Engine

Page 7: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

55 languages and counting

Large corpora ready-to-use for

Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese

7

Page 8: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Handles large corporaLargest to date: 8 billion words

Fast Web-based: no software to install Build ‘instant corpora’ from the web Load your own corpus

Quota of space on SkE server Word sketches

One-page, automatic accounts of a word’s grammatical and collocational behaviour

Free 30-day trial: sketchengine.co.uk8

Page 9: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

9

Adam Kilgarriff

Lexical Computing Ltd.

Page 10: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

WebBootCaT

BootCaT integrated in SkE BootCaT a corpus

Clean, de-dupe, POS-tag, thenLoad into Sketch Engine

10

Page 11: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd
Page 12: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd
Page 13: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd
Page 14: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd
Page 15: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

How big a corpus do we get?

Page 16: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Observation

Specialist domain, L1 Specialist domain, L2 Matching terminology

16

Page 17: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Going multilingual

Translate seeds English: volcanology volcanologist "volcanic

eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic

French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques

BootCaT for French

Page 18: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd
Page 19: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

CCBC

Input: L1, L1 seeds, L2 Choose dictionary

Google as default• Google dictionary (25 lg pairs, limited API)• Google translate (1225 lg pairs, only 1 transl)

Option: edit translations Bootcat 2 corpora Bilingual word sketches

19

Page 20: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Bilingual word sketches(very first pass)

For L1 nodeword nFor each of its translations n1, n2, …

• For each collocate c in word sketch• For each of its translations c1, c2, …

• Does ci occur as collocate in word sketch for ni?

• If yes: output <c; ni , ci >

• Add L1 and L2 examples sentences

20

Page 21: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

21

Page 22: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Matching seeds – how?

User translates Yes but limited

Bilingual dictionary Yes but finding them?? Google dictionary

Machine translations Wikipedia

Matching articles

Page 23: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Evaluation

Extract terms for L1, L2 Ask expert

1. Are they terms

2. Do the L1, L2 lists contain translations of each other?

23

Page 24: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

3 lg-pairsEn-Fr, En-De, En-CzOne expert for each pair

3 domainsVolcanoesStradivariusPancreatic cancer

• Wikipedia: En and De only

24

Page 25: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Results

25

In brief

•Words good•Multiwords bad

Page 26: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Unithood and termhood

To find termsFor multiwords only

• Does it hang together?

• UnithoodIt it distinctive?

• Keywords

• Termhood

We didn’t use termhood for multiwords but need to

26

Page 27: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Next steps

Termhood for multiwords WebBootCaT from wikipedia From collocations to terms

More-than-2-word collocations• … deadline next Tuesday

27

Page 28: Comparable Corpora BootCaT (CCBC) Adam Kilgarriff, Avinesh PVS, Jan Pomikalek Lexical Computing Ltd

Thank you

http://www.sketchengine.co.uk

28