56
Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Embed Size (px)

Citation preview

Page 1: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Using Corpora in Linguistics and Lexicography

Adam Kilgarriff

Lexical Computing Ltd

Universities of Leeds, Sussex, UK

Page 2: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 2

Outline Precision and recall History of corpus lexicography The Sketch Engine

– Demo Web corpora Corpus and dictionary

Page 3: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 3

Find me all the fat cats

a request for information

Page 4: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 4

High recall

Lots of responses Maybe not all good

Page 5: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 5

High precision

Fewer hits Higher confidence

Page 6: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 6

Information-seeking

Recall Precision

Computers good bad

People bad good

Page 7: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 7

Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations

Page 8: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 8

Four ages of corpus lexicography

Page 9: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 9

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Page 10: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 10

Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography

Page 11: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 11

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,

slow • 5000 lines: no

Page 12: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 12

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Page 13: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 13

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Page 14: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 14

Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword

word word

forests life

$1.2 dollars

lives costs

enormous thousands

annually face

jobs estimated

money your

Page 15: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 15

Age 4: The word sketch Large well-balanced corpus Parse to find

– subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

Page 16: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 16

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Page 17: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 17

Working practice

Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

Page 18: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 18

Euralex 2002

Page 19: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 19

Euralex 2002 Can I have them for my language

please

Page 20: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 20

The Sketch Engine Input:

– any corpus, any language Lemmatised, part-of-speech tagged

– specification of grammatical relations Word sketches integrated with Corpus query system

– Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ

Page 21: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 21

Customers Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– Macmillan– FrameNet Project (Berkeley, US)– National dictionary projects in

Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities

– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,…

Other– Language teaching, textbook writing– Information management, web search companies– Automatic translation

Page 22: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 22

Web corpora

Replaceable or replacable?– http://googlefight.com – http://looglefight.com

Page 23: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 23

The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

Page 24: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 24

Web corpus types Large, general corpora Small, specialised corpora

– Specially for translators

Page 25: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 25

Basic steps Gather pages

– CSE hits– Select and gather whole sites– General crawl

Filter De-duplicate Linguistic processing Load into corpus tool

Page 26: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 26

WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch

Engine– Currently 30 languages– Growing monthly

Pioneers: Marco Baroni, Serge Sharoff Corpus Factory

Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

Google on seed words, then crawl

Page 27: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 27

CorporaArabic 174 Hindi 31 Russian 188

Chinese 456 Indonesian 102 Slovak 536

Czech 800 Irish 34 Slovene 738

Dutch 128 Italian 1910 Spanish 117

English 5508 Japanese 409 Swedish 114

French 126 Norwegian 95 Telugu 5

German 1627 Persian 6 Thai 108

Greek 149 Portuguese 66 Vietnamese 174

Estonian 11 Romanian 53 Welsh 63

Korean 77 Polish 156 Malay 230

Page 28: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 28

How good are they? How to assess?

– Hard question, open research topic Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal,

kitchen vocab Web corpus / BNC / journalism corpus

– First two are close

Page 29: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 29

Evaluating word sketches 11 years

– 1999-2010 Feedback

– Good but anecdotal Formal evaluation Method also lets us evaluate corpora

Page 30: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 30

Goal

Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

Ask a lexicographer– For 42 headwords

For 20 best collocates per headwords– “should we include this collocation in a

published dictionary?”

Page 31: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 31

Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Page 32: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 32

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

Page 33: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 33

Four languages, three families

Dutch– ANW, 102m-word lexicographic corpus

English– UKWaC, 1.5b web corpus

Japanese– JpWaC, 400m web corpus

Slovene – FidaPlus, 620m lexicographic corpus

Page 34: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 34

User evaluation

Evaluate whole system– Will it help with my task

Eg preparing a collocations dictionary

Contrast: developer evaluation– Can I make the system better?

Evaluate each module separately Current work

Page 35: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 35

Components

Corpus NLP tools

– Segmenter, lemmatiser, POS-tagger

Sketch grammar Statistics

Page 36: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 36

Practicalities Interface

– Good, Good-but Merge to good

– Maybe, Maybe-specialised, Bad Merge to bad

For each language– Two/three linguists/lexicographers– If they disagree

Don't use for computing performance

Page 37: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

Kilgarriff 37

Results

Dutch 66% English 71% Japanese 87% Slovene 71%

Page 38: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 38

Two thirds of a collocations dictionary can be gathered automatically

Page 39: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 39

Small specialised corpora Terminologists Translators needing target-language

domain-specific vocab Specialist dictionaries

– Don’t exist– Expensive/inaccessible– Out of date

Instant small web corpora– BootCaT: Baroni and Bernardini 2004– WebBootCaT demo

Page 40: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 40

Cyborgs A creature that is partly human and

partly machine – Macmillan English Dictionary

Page 41: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 41

Page 42: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 42

Page 43: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 43

Page 44: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 44

Page 45: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 45

Cyborgs and the Information Society

The dictionary-making agent is part human (for precision), part computer (for recall).

Page 46: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 46

Treat your computer with respect. You and it can do

great things together.

Page 47: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 47

Thank you

http://www.sketchengine.co.uk

Page 48: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 48

Corpus and dictionary Established model:

– Lexicographers use corpora, users use dictionaries

But– Users like collocations, examples– But are not corpus linguists

Explore the space between corpus and dictionary

Page 49: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 49

Collocationality Which words are most ‘collocational’ Dictionary publishers

– Where to put ‘collocation boxes’ Language learners

Page 50: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 50

Verb Freq MLE Prob x log = entropy

Take 2084 -.469

Gain 131 -.169

Offer 117 -.157

See 110 -.150

Enjoy 67 -.104

… … …

Clarify 1 -0.0031

… … …

Total 3730 -3.909

Calculation of entropy for advantage (object relation)

Page 51: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 51

Page 52: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 52

place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),

Page 53: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 53

What is there on the web? Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English 1,000,000,000,000

Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

Page 54: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 54

Web-high (155 terms)

61 web and computing– config browser spyware url www forum

38 porn 22 US English 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old 4 legal

– trademarks pursuant accordance herein

Page 55: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 55

Web-low

Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 56: Using Corpora in Linguistics and Lexicography Adam Kilgarriff Lexical Computing Ltd Universities of Leeds, Sussex, UK

IDS Mannheim 2010 Kilgarriff 56

Observations Pronouns and past tense verbs

– Fiction Masc vs fem Yesterday

– Probably daily newspapers Constancy of ratios:

– He/him/himself– She/her/herself