54
1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton, UK

1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

Embed Size (px)

Citation preview

Page 1: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

1

What computers can and cannot do for lexicography

or

Us precision, them recall

Adam Kilgarriff

Lexicography Masterclass Ltd

and

University of Brighton, UK

Page 2: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 2

Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs

Page 3: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 3

Find me all the fat cats

a request for information

Page 4: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 4

High recall

Lots of responses Maybe not all good

Page 5: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 5

High precision

Fewer hits Higher confidence

Page 6: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 6

Us precision, them recall

Recall Precision

Computers good bad

People bad good

Page 7: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 7

Us precision, them recall

True in many areas– web searching, google– finding an image to illustrate a talk

Nowhere more so than

lexicography

Page 8: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 8

Lexicography: finding facts about words

collocations grammatical patterns idioms synonyms antonyms meanings translations

Page 9: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 9

Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs

Page 10: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 10

Four ages of corpus lexicography

Page 11: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 11

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Page 12: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 12

Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html the coloured-pens method

Page 13: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 13

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,

slow • 5000 lines: no

Page 14: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 14

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Page 15: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 15

Collocation listingFor right collocates of save (>5 hits)

word fr(x+y) fr(y) word fr(x+y) fr(y)

forests 6 170 life 36 4875

$1.2 6 180 dollars 8 1668

lives 37 1697 costs 7 1719

enormous 6 301 thousands 6 1481

annually 7 447 face 9 2590

jobs 20 2001 estimated 6 2387

money 64 6776 your 7 3141

Page 16: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 16

Collocation statistics Which words?

– next word – last word – window, +1 to +5; window, -5 to -1

How sorted? most common collocates --but for most

nouns it's the most salient collocates --how to

measure salience?

Page 17: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 17

Mutual Information

Church and Hanks 1989 How much more often does a word pair

occur, than one might expect by chance “Chance” of x and y occurring together:

p(x) * p(y) Probabilities approximated by

frequencies

p(x) =(approx) f(x)/N

Page 18: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 18

Mutual Information

X fr eat fr X fr eat X

MI* rank

it 1000 400,000

400 1/

1M

3

meat 1000 6000 120 20/

1M

2

sushi 1000 100 8 80/

1M

1

* numbers are log-proportional to MI

Page 19: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 19

Problem mathematical salience = lexicographic

salience? no! higher-frequency items are

lexicographically more salient Solution multiply MI by raw frequency

Page 20: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 20

Mutual Information

X fr eat fr X fr eat X

MI rank

it 1000 400,000

400 1/M 3

meat 1000 6000 120 20/M 2

sushi 1000 100 8 80/M 1

MI x fr

new rank

400/M

3

2400/M

1

640/M

2

Page 21: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 21

Collocation listingFor right collocates of save (>5 hits)

word fr(x+y) fr(y) word fr(x+y) fr(y)

forests 6 170 life 36 4875

$1.2 6 180 dollars 8 1668

lives 37 1697 costs 7 1719

enormous 6 301 thousands 6 1481

annually 7 447 face 9 2590

jobs 20 2001 estimated 6 2387

money 64 6776 your 7 3141

Page 22: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 22

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type --MI lists mix adverbs,

subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Page 23: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 23

Age 4: The word sketch Large well-balanced corpus Parse to find

– subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

Page 24: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 24

Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks

exist if any parser can handle large corpus, it's

probably good enough

--- sorting, statistics, make us error-tolerant

Page 25: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 25

Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks

exist if any parser can handle large corpus, it's

probably good enough--- sorting, statistics, make us error-tolerant

Poor man’s parsing:– object (of active verb) = last noun in any sequence

of nouns, adjectives, determiners, numbers and adverbs following the verb

Page 26: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 26

Can we do it? high-accuracy parsing is hard lots of NLP work, many parsing frameworks

exist if any parser can handle large corpus, it's

probably good enough--- sorting, statistics, make us error-tolerant

Poor man’s parsing:– object (of active verb) = last noun in any sequence

of nouns, adjectives, determiners, numbers and adverbs following the verb

Page 27: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 27

The word sketch British National Corpus (BNC)

– 100 M words, already POS-tagged lemmatized using John Carroll's lemmatizer poor man’s parsing database of 70 million triples

<object, sip, coffee> <subject, arrive, coffee>

<and-or, tea, coffee> <modifier, coffee, instant>

coffee_n.html

Page 28: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 28

Macmillan Dictionary of English for Advanced Leaners, 2002

Editor: Rundell.

Work done 1999. 6000 word sketches

– most common nouns, verbs, adjectives of English HTML files with hyperlinked corpus examples lexicographers used them extensively

– main use of corpus positive feedback

Page 29: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 29

The WASPbench with David Tugwell, UK EPSRC, grant M54971

A lexicographer's workbench runtime creation of word sketches integration with Word Sense Disambiguation

technology output is "disambiguating dictionary" - analysis

of word's meaning into senses, plus computer program for disambiguating contextualised instances of the word

First release now available. http://wasps.itri.brighton.ac.uk/

Page 30: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 30

The Sketch Engine Input:

– any corpus, any language Lemmatised, part-of-speech tagged

– specification of grammatical relations Word sketches integrated with Corpus query system

– Supports complex searching, sorting etc First release early 2004

Page 31: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 31

Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs

Page 32: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 32

Natural Language Processing

The academic discipline which provides the tools– Also known as Computational Linguistics,

Human Language Technology (HLT), Language Engineering

Good at evaluation of its tools Good news for lexicography:

– identify the best tools, apply them to our corpora

Page 33: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 33

An Anglophone Apology Technology, tools, resources most often

available for English This talk centres on English Other languages often present new

problems– Finding word delimiters for Chinese is hard– Finding bunsetsu for Japanese is hard

Fewer resources available, less work done Recommendation:

– find the local experts for your language

Page 34: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 34

Recap: Lexicography: finding facts about words

collocations grammatical patterns idioms synonyms antonyms meanings translations

Page 35: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 35

Recap: Lexicography: finding facts about words

collocations - sketches grammatical patterns - sketches idioms synonyms antonyms meanings translations

Page 36: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 36

Idioms Extreme case of collocation/multi word

expressions Sequence of workshops on

collocations, MWE Technical terms (of great interest to

technologists, technical): TERMIGHT

Page 37: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 37

Antonyms Essential semantic relation

Page 38: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 38

Antonyms Essential semantic relation

but Justeson and Katz 1995: distributional

evidence for typical antonym pairs– rich men and poor men– the big ones and the small ones– black and white issues

Perhaps antonyms are ‘really’ distributional

Page 39: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 39

Thesauruses Also near-synonyms

– are there any true synonyms? Distributional: which words share same

distributions– if corpus contains

<object, drink, wine>, <object, drink, beer>

– 1 pt similarity between wine and beer– gather all points; find nearest neighbours

Sparck Jones, Lin, Grefenstette

Page 40: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 40

Nearest neighbours In WASPbench Will be generated in Sketch Engine

NOUNS

zebra: giraffe buffalo hippopotamus rhinoceros gazelle antelope cheetah hippo leopard kangaroo crocodile deer rhino herbivore tortoise primate hyena camel scorpion macaque elephant mammoth alligator carnivore squirrel tiger newt chimpanzee monkey

Page 41: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 41

exception: exemption limitation exclusion instance modification restriction recognition extension contrast addition refusal example clause indication definition error restraint reference objection consideration concession distinction variation occurrence anomaly offence jurisdiction implication analogy

pot: bowl pan jar container dish jug mug tin tub tray bag saucepan bottle basket bucket vase plate kettle teapot glass spoon soup box can cake tea packet pipe cup

Page 42: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 42

VERBS

measure

determine assess calculate decrease monitor increase evaluate reduce detect estimate indicate analyse exceed vary test observe define record reflect affect obtain generate predict enhance alter examine quantify relate adjust

boil

simmer heat cook fry bubble cool stir warm steam sizzle bake flavour spill soak roast taste pour dry wash chop melt freeze scald consume burn mix ferment scorch soften

Page 43: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 43

ADJECTIVES

hypnotic

haunting piercing expressionless dreamy monotonous seductive meditative emotive comforting expressive mournful healing indistinct unforgettable unreadable harmonic prophetic steely sensuous soothing malevolent irresistible restful insidious expectant demonic incessant inhuman spooky

pink

purple yellow red blue white pale brown green grey coloured bright scarlet orange cream black crimson thick soft dark striped thin golden faded matching embroidered silver warm mauve damp

Page 44: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 44

Translation Parallel corpora

– Texts and their translations or Comparable corpora

– Matched for source and target (genre and subject matter), not translations

Which L1 words occur in equivalent L1 settings to L2 words in L2 settings?– They are candidate translation pairs

Very hard problem Lots of high quality research

Page 45: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 45

Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs

Page 46: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 46

Cyborgs Robots: will they take over? Rod Brooks’s answer:

– Wrong question: greatest advances are in what the human+computer ensemble can do

Page 47: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 47

Cyborgs A creature that is partly human and

partly machine – Macmillan English Dictionary

Page 48: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 48

Page 49: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 49

Page 50: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 50

Page 51: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 51

Page 52: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 52

Cyborgs and the Information Society

The dictionary-making agent is part human (for precision), part computer (for recall).

Page 53: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 53

Treat your computer with respect. You and it can do

great things together.

Page 54: 1 What computers can and cannot do for lexicography or Us precision, them recall Adam Kilgarriff Lexicography Masterclass Ltd and University of Brighton,

27-29 Aug 2003 Adam Kilgarriff: Us precision them recall 54

Lexicographers of the future?