56
1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Embed Size (px)

Citation preview

Page 1: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

1

Evaluating word sketches

Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex

Page 2: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 2 Geneva, April 2010

Overview Research programme Examples:

Case study• Word sketching

Evaluating word sketches

Page 3: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 3 Geneva, April 2010

What is language?

Page 4: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 4 Geneva, April 2010

What is language? In our heads

Page 5: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 5 Geneva, April 2010

What is language? In our heads In texts and sound signals

Page 6: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 6 Geneva, April 2010

What is language? In our heads In texts and sound signals Both

Page 7: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 7 Geneva, April 2010

Methodology

Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)

Page 8: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 8 Geneva, April 2010

Methodology

Study language in our headsCompetenceChomsky“rationalist” (Descartes, Leibniz)Odd method for objective sciencePractical problems: coverage,

arbitrariness

Page 9: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 9 Geneva, April 2010

Methodology

Study text“empiricist” (Locke, Hume)

Physics: forces, matterChemistry: chemicals, bondsLanguage: text, speech signals

Page 10: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 10 Geneva, April 2010

It goes against the grain

What is important about a sentence?its meaning

Corpus methodology:Throw away individual sentence

meaningFind patterns

Page 11: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 11 Geneva, April 2010

Twenty years of rapid ascent

Computer power Corpora

bigger and bigger data sets Language technology tools

lemmatizers, POS-taggers, parsersMachine learning, pattern-finding

Page 12: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 12 Geneva, April 2010

A virtuous circle

Pattern finding

Linguisticprocessing

Corpus

Lexicon•Part-of-speech tagging

•Parsing

•Lemmatizing

More data →

gets richer eachtime round

Page 13: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 13 Geneva, April 2010

Case study: corpus lexicography

- four ages

Page 14: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 14 Geneva, April 2010

Age 1:Pre-computer

Oxford English Dictionary:• 20 million index cards

Page 15: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 15 Geneva, April 2010

Age 2: KWIC Concordances

From 1980 Computerised

Page 16: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 16 Geneva, April 2010

Age 2: KWIC Concordance

1 arity, which will be used to take a party of under-privileged children 2 from outside. You are invited to a party and after a couple of drinks 3 tion, we believe politicians of all parties will listen to our views.4 ould be reaching agreement with all parties concerned, as to which 5 lack people. I have certainly been party to one or two discussions 6 . These should be discussed by both parties before entering into the7 presents They had hosted a cocktail party at Kensington palace, for8 akes. By midnight the end-of-course party is in full swing, but most 9 e should be a right for the injured party to terminate the contract. A10 by the Safran Peoples ' Liberation Party. This presents the powerful11 s. Ahead I could see the rest of my party plodding towards the final 12 cial ethic. The two main political parties - the Tories and the 13 ritish successes in Perth The small party of British players competing 14 to help control. One member of the party went to summon the rescue15 rket society fashion magazine. The party was held at his flat which16 security and secrecy than any Tory Party Conference : it seems that

Page 17: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 17 Geneva, April 2010

Age 2: KWIC Concordances

From 1980 Computerised COBUILD project was innovator the coloured-pens method

Page 18: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 18 Geneva, April 2010

1 arity, which will be used to take a party of under-privileged children to D2 from outside. You are invited to a party and after a couple of drinks you d3 tion, we believe politicians of all parties will listen to our views. &equo4 ould be reaching agreement with all parties concerned , as to which events,5 lack people. I have certainly been party to one or two discussions amongst 6 . These should be discussed by both parties before entering into the relatio7 presents They had hosted a cocktail party at Kensington palace, for example8 akes. By midnight the end-of-course party is in full swing, but most cadet9 e should be a right for the injured party to terminate the contract. A mana10 by the Safran Peoples ' Liberation Party . This presents the powerful neigh11 s. Ahead I could see the rest of my party plodding towards the final slope t12 cial ethic. The two main political parties - the Tories and the Liberals -13 ritish successes in Perth The small party of British players competing in th14 to help control. One member of the party went to summon the rescue team and15 rket society fashion magazine. The party was held at his flat which was a l16 security and secrecy than any Tory Party Conference : it seems that bootleg

The coloured pens method

Page 19: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 19 Geneva, April 2010

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: read all • 500 lines: could read all, takes a long

time• 5000 lines: no

Page 20: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 20 Geneva, April 2010

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Page 21: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 21 Geneva, April 2010

Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword

wordword

yourmoney

estimatedjobs

faceannually

thousandsenormous

costslives

dollars$1.2

lifeforests

Page 22: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 22 Geneva, April 2010

Age 4: The word sketch

A corpus-derived one-page summary of a word’s grammatical and collocational behaviour

Page 23: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 23 Geneva, April 2010

Age 4: The word sketch

Large corpus Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation

Statistics to sort each list, as before

Page 24: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 24 Geneva, April 2010

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Page 25: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 25 Geneva, April 2010

Euralex 2002

Page 26: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 26 Geneva, April 2010

Euralex 2002

Can I have them for my language please

Page 27: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 27 Geneva, April 2010

The Sketch Engine

Input: any corpus, any language

• Lemmatised, part-of-speech tagged specification of grammatical

relations Word sketches integrated with

corpus query system

Developer: Pavel Rychly, Brno

Page 28: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 28 Geneva, April 2010

Users: Dictionary publishers

• Oxford UP, Collins, Chambers, Macmillan, Cambridge UP

Universities • Teaching, research• Framenet

Language teaching http://www.sketchengine.co.uk/

• Self-registration for free trial account

Page 29: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 29 Geneva, April 2010

Lexical Computing Ltd

Since 2003Directors

• Adam Kilgarriff (UK), Pavel Rychly (Cz), • Diana McCarthy (UK, since Oct 2009)

Main activities• Sketch engine service• Corpus development

Research-led

Page 30: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 30 Geneva, April 2010

(demo)

Page 31: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 31 Geneva, April 2010

Evaluating word sketches

10 years 1999-2009

Feedback Good but anecdotal

Formal evaluation

Page 32: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 32 Geneva, April 2010

Goal

Collocations dictionary Model: Oxford Collocations Dictionary Publication-quality

Ask a lexicographer For 42 headwords

• For 20 best collocates per headwords “should we include this collocation in a

published dictionary?”

Page 33: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 33 Geneva, April 2010

Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable

Page 34: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 34 Geneva, April 2010

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

Page 35: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 35 Geneva, April 2010

Four languages, three families

Dutch ANW, 102m-word lexicographic corpus

English UKWaC, 1.5b web corpus

Japanese JpWaC, 400m web corpus

Slovene FidaPlus, 620m lexicographic corpus

Page 36: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 36 Geneva, April 2010

User evaluation

Evaluate whole system Will it help with my task

• Eg preparing a collocations dictionary

Contrast: developer evaluation Can I make the system better?

• Evaluate each module separately

• Current work

Page 37: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 37 Geneva, April 2010

Components

Corpus NLP tools

Segmenter, lemmatiser, POS-tagger

Sketch grammar Statistics

Page 38: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 38 Geneva, April 2010

Practicalities

Interface Good, Good-but

• Merge to good Maybe, Maybe-specialised, Bad

• Merge to bad

For each language Two/three linguists/lexicographers If they disagree

• Don't use for computing performance

Page 39: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 39 Geneva, April 2010

Results

Dutch 66% English 71% Japanese 87% Slovene 71%

Page 40: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 40 Geneva, April 2010

Corpus evaluation Collocation-finding

Typical corpus task Recall Hold all else constant

Statistic, NLP tools, grammarBest results: best corpus

• (for collocation-finding)

Pomikalek: de-duplication

Page 41: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 41 Geneva, April 2010

Other topics Dante

a new lexical database for English

Corpus building (mostly from the web) Instant corpora with WebBootCaT Bigger and better (English)

• BiWeC and New Model Corpus

Corpus Factory (many languages) Corpus comparison, similarity, evaluation Statistics: collocations, keyword lists

Word frequency lists Word senses and lexicography

SADD: semi-automatic dictionary drafting

Page 42: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 42 Geneva, April 2010

Thank you

http://www.sketchengine.co.uk

Page 43: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 43 Geneva, April 2010

Words and word senses automatic thesauruses

words

Page 44: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 44 Geneva, April 2010

Words and word senses automatic thesauruses

words manual thesauruses

simple hierarchy is appealinghomonyms

Page 45: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 45 Geneva, April 2010

Words and word senses automatic thesauruses

words manual thesauruses

simple hierarchy is appealinghomonyms“aha! objects must be word senses”

Page 46: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 46 Geneva, April 2010

Problems

Theoretical Practical

Page 47: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 47 Geneva, April 2010

Theoretical

Page 48: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 48 Geneva, April 2010

Page 49: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 49 Geneva, April 2010

Page 50: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 50 Geneva, April 2010

Wittgenstein

Don’t ask for the meaning, ask for the use

Page 51: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 51 Geneva, April 2010

Practical

Page 52: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 52 Geneva, April 2010

Problems Practical

a thesaurus is a toolif the tool organises words senses

you must do WSD before you can use it

WSD: state of the art, optimal conditions: 80%

Page 53: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 53 Geneva, April 2010

Problems

“To use this tool, first replace one fifth of your input with junk”

Page 54: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 54 Geneva, April 2010

Avoid word senses

Page 55: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 55 Geneva, April 2010

Avoid word senses This word has three meanings/senses

Page 56: 1 Evaluating word sketches Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

Adam Kilgarriff 56 Geneva, April 2010

Avoid word senses This word has three meanings/senses This word has three kinds of use

well foundedempiricalwe can build on it