27
1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds & Sussex, UK

1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

Embed Size (px)

Citation preview

Page 1: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

1

Corpora, Dictionaries, and points in between in the

age of the web

Adam Kilgarriff

Lexical Computing Ltd

Lexicography MasterClass Ltd

Universities of Leeds & Sussex, UK

Page 2: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 2

Outline Precision and recall History of corpus lexicography Sketch Engine

– demo Automatic Collocations Dictionary

– demo Electronic dictionaries

Page 3: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 3

Find me all the fat cats

a request for information

Page 4: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 4

High recall

Lots of responses Maybe not all good

Page 5: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 5

High precision

Fewer hits Higher confidence

Page 6: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 6

Us precision, them recall

Recall Precision

Computers good bad

People bad good

Page 7: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 7

Us precision, them recall

True in many areas– web searching, google– finding an image to illustrate a talk

Nowhere more so than

lexicography

Page 8: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 8

Lexicography: finding facts about words

collocations grammatical patterns idioms synonyms antonyms meanings translations

Page 9: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 9

Outline Precision and recall History of corpus lexicography Natural Language Processing Cyborgs

Page 10: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 10

Four ages of corpus lexicography

Page 11: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 11

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Page 12: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 12

Age 2: KWIC Concordances From 1980 Computerised COBUILD project was innovator asian-kwic.html the coloured-pens method

Page 13: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 13

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,

slow • 5000 lines: no

Page 14: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 14

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Page 15: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 15

Collocation listingFor right collocates of save (>5 hits)

word fr(x+y) fr(y) word fr(x+y) fr(y)

forests 6 170 life 36 4875

$1.2 6 180 dollars 8 1668

lives 37 1697 costs 7 1719

enormous 6 301 thousands 6 1481

annually 7 447 face 9 2590

jobs 20 2001 estimated 6 2387

money 64 6776 your 7 3141

Page 16: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 16

Collocation statistics

Which words? – next word – last word – window, +1 to +5; window, -5 to -1

How sorted? most common collocates --but for most

nouns it's the

Page 17: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 17

Collocation listingFor right collocates of save (>5 hits)

word fr(x+y) fr(y) word fr(x+y) fr(y)

forests 6 170 life 36 4875

$1.2 6 180 dollars 8 1668

lives 37 1697 costs 7 1719

enormous 6 301 thousands 6 1481

annually 7 447 face 9 2590

jobs 20 2001 estimated 6 2387

money 64 6776 your 7 3141

Page 18: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 18

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type --MI lists mix adverbs,

subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Page 19: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 19

Age 4: The word sketch Automatic one-page summary of a

word’s grammatical and collocatonal behaviour

Page 20: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 20

The Sketch Engine Input:

– any corpus, any language Lemmatised, part-of-speech tagged

– specification of grammatical relations Word sketches integrated with Corpus query system

– Supports complex searching, sorting etc First release early 2004

Page 21: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 21

Recap: Lexicography: finding facts about words

collocations grammatical patterns idioms synonyms meanings translations

Page 22: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 22

Thesaurus Also near-synonyms

– are there any true synonyms? Distributional: which words share same

distributions– if corpus contains

<object, drink, wine>, <object, drink, beer>

– 1 pt similarity between wine and beer– gather all points; find nearest neighbours

Sparck Jones, Lin, Grefenstette

Page 23: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 23

Electronic dictionaries Conference on them last week Rundell quotation On

– PC– Handheld– Cellphone– Web

Page 24: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 24

On PCs CD-ROMs as added extra

– No income model– Large extra publishing cost– No extra income

Page 25: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 25

Handhelds Students like them, teachers don’t

– Subversive!– Fast to use: used even for conversation

Many dictionaries on one device– Users usually do not know which– For publishers

Complex distribution channels Dictionary publishers have little control

Page 26: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 26

Cellphones

Page 27: 1 Corpora, Dictionaries, and points in between in the age of the web Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of

October 2009 Kilgarriff: FLTRP 27

Web dictionaries Traditional publishers vs new players Business models

– Free + premium– Advertising

How many hits/month? Macmillan 2.5m Cambridge UP 30m Leo 100m