45
Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Embed Size (px)

Citation preview

Page 1: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Between Corpus and Dictionary

Adam KilgarriffLexical Computing Ltd

Lexicography MasterClass LtdUniversities of Leeds, Sussex

Page 2: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 2

What is a word sense?

Page 3: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 3

Preliminaries

What is language? What is meaning?

Page 4: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 4

What is language?

Page 5: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 5

What is language? In our heads

Page 6: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 6

What is language? In our heads In texts and sound signals

Page 7: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 7

What is language? In our heads In texts and sound signals Both

Page 8: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 8

Methodology Study language in our heads

Introspection Semantic analysis Experiments with human subjects

“rationalist” (Leibniz, Chomsky) Problems: coverage, arbitrariness

Page 9: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 9

Methodology Study text

“empiricist” (Locke, Hume)

Physics: forces, matter Chemistry: chemicals, bonds Language: text, speech

signals

Page 10: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 10

It goes against the grain

What is important about a sentence? its meaning

Corpus methodology: Throw away individual sentence

meaning Find patterns

Page 11: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 11

Empiricist linguistics A new way to find out about

language 15 years of rapid ascent

Computers Corpora

bigger and bigger data sets available Language technology tools

lemmatizers, POS-taggers, parsers, machine learning for pattern finding

Page 12: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 12

Rationalists vs empiricists in the age of the web

semantic web vs Google?

Page 13: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 13

What are you?

Temperament Complementary/alternatives

Barbu and Poesio, Keller and Lapata: comparisons, evaluations

(AK: current research project)

Page 14: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 14

What is meaning?

Fregean Gricean

Page 15: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 15

Gottlob Frege (1848-1925) Founder of modern logic Truth values

The sentence “grass is green” is true if and only if grass is green (Tarski)

Meanings of words, phrases are such that:

Put them together in a sentence State basic facts Sentence computes to ‘true’ if sentence is

true, ‘false’ if it is false

Page 16: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 16

Gottlob Frege (1848-1925)

Formal semantics Sparkling analyses for quantifiers,

connectives Montague semantics

Foundations for maths, databases, ontologies …

Page 17: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 17

H. P. Grice (1913-1988)

An agent means something by an utterance if and only if they intended the utterance to produce some effect in an audience by means of the recognition of this intention.

Dictionary of Philosophy of Mind, http://philosophy.uwaterloo.ca

Page 18: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 18

Meaning is something you do

Basis of meaning is Meaning event Speaker’s intention Speaker’s expectation of

interpretation of hearer (messy, hard)

Page 19: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 19

Strawson commentary (1970s)For the sake of a label, we might call it the

conflict between the theorists of communication-intention and the theorists of formal semantics. […] A struggle on what seems to be such a central issue in philosophy should have something of a Homeric quality; and a Homeric struggle calls for gods and heroes. I can at least, though tentatively, name some living captains and benevolent shades: on the one side, say, Grice, Austin, and the later Wittgenstein; on the other, Chomsky, Frege, and the earlier Wittgenstein.

Page 20: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 20

Battle of the two Adams?

Page 21: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 21

Relevance to word senses

Fregean Supports reasoning Builds on well-defined word-meanings

Identifying word meanings: can’t help Fall back on Grice

Page 22: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 22

Fauconnier and Turner

“linguistics expressions prompt for meanings rather than express meanings”

(AK chapter, Agirre and Edmonds WSD book)

Page 23: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 23

Preliminaries over

What is a word sense

Page 24: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 24

The lexicographers

They create them Methods

Introspection Other dictionaries Corpus

Atkins, Hanks, Krishnamurthy

Page 25: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 25

What is a word sense (1)

SFIP Sufficiently frequent insufficiently

predictable (a glass of) whisky x (a glass of) tequila

Page 26: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 26

What is a word sense (2)

homonymy

analogy polysemy rules

collocation

Page 27: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 27

What is a word sense (3)

A cluster Of instances of use

Operationalised as: corpus lines Clustered by lexicographers

Page 28: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 28

What is a word sense (3)

Page 29: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 29

What is a word sense (3)

Page 30: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 30

What is a word sense (3)

Page 31: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 31

What is a word sense (3)

Page 32: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 32

What is a word sense (3) A cluster

Of instances of use Operationalised as: corpus lines

Clustered by lexicographers Makes sense of

Overlapping senses Different dictionaries, different senses Lumping and splitting

Page 33: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 33

I don’t believe in word senses

Believe in: resurrection ghost witch vampire god

miracle fairy Philosophy:

Ontological commitment (same meaning different register)

“good entities to build belief systems on”

Page 34: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 34

But I’m an NLP person

Automatic clustering? Inspiration:

Hindle 1991, Schütze 1993, Grefenstette 1993, Lin 1999

You can get semantic sense from corpora+stats

Page 35: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 35

First attempt

Longman 1994 Abject failure

No grammar Corpus too small and noisy Naïve clustering Useless programmer

Page 36: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 36

Collocations Easy

Most words don’t go with most other words

Then build on what we can do well (metaphor, analogy, homonymy,

rules: all much harder)

Page 37: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 37

The Sketch Engine 2003: programmer problem solved Corpora

More available Build big clean ones from web

Grammar POS-taggers/lemmatisers available Shallow regexp grammars if no full parser

Stats: progress (Lin, Curran, Evert …)

Page 38: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 38

demo

Page 39: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 39

Clustering

Word sketch Collocates organised by grammar

Dictionary Collocates (and other things)

organised by meaning How to re-organise

Three phases

Page 40: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 40

Semi-automatic dictionary drafting (SADD) Automatic clustering of collocates

Propose senses Iterate:

Lexicographer input Confirm/reject/edit sense inventory Assigns collocates / corpus lines to senses

WSD Uses seeds to build full WSD for word Find more collocates for each sense

XML dictionary entry Load into dictionary-editing tool

Page 41: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 41

Atkins method for bilingual lexicography Analyse source language

From corpus List all expressions that might possibly have

a non-predictable translation Very fine grained Lots of collocations

target-language-neutral; re-usable Translate Edit to finalise dictionary

Page 42: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 42

New English-Irish Dictionary

Irish: Gaelic language, some native speakers,

culturally important for Ireland Project

To replace dictionary from 1950s Government-funded project Lexicography MasterClass (Atkins

Rundell Kilgarriff) designed project in 2003

Page 43: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 43

English analysis for NEID

New project, 1st Feb 2008- late 2010 Contractor: Lexicography

MasterClass 12 lexicographers Plan

Test SADD If viable, use it on industrial scale

Page 44: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 44

demo2

http://corpora.fi.muni.cz/sadd/

Page 45: Between Corpus and Dictionary Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds, Sussex

Szeged, Jan 2008 Kilgarriff, Global WordNet 45

Thank you

Sketch Engine: http://www.sketchengine.co.uk

Lexicom workshop Pre-Euralex, 10-15 July, Barcelona

http://www.iula.upf.edu/agenda/lexicom Pre-CICLING, Mexico, Feb 2009