Transcript
Page 1: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

A cascade of corpora:The Cambridge Learner Corpus,

English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

Page 2: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

English Profile

• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal

– for each CEFR level, find characteristic lexis and grammar

• CEFR: Common European Framework of Reference– A1, A2: Beginner– B1, B2: Intermediate– C1, C2: Advanced

– Main resource: CLCNTNU Nov 2011 KIlgarriff 2

Page 3: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Cambridge Learner Corpus (CLC)

• Since 1993 • Leading resource• CUP and Cambridge Assessment

– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)

• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities

NTNU Nov 2011 KIlgarriff 3

Page 4: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Sketch Engine

• Leading corpus tool• Word sketches

– One-page summaries of a word’s grammatical and collocational behaviour

• In use at OUP, CUP, Collins, Macmillan, INL …• 55 languages

– 175 corpora– Since May including CHILDES: demo– Since last year including CLC

NTNU Nov 2011 KIlgarriff 4

Page 5: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 5

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Page 6: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Error-coded corpus

• Challenge– Intuitive to search for x

• anywhere• only where it is part of an error• only where it is part of a correction

where x can be a word, phrase, grammar pattern …

Requirement for CLC in Sketch Engine

NTNU Nov 2011 KIlgarriff 6

Page 7: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Error-coded corpora in SkE

• demo

NTNU Nov 2011 KIlgarriff 7

Page 8: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

HOO / HOO+

• Helping Our Own• HOO: English-NNS NLP researchers

– Developer = user: motivation– Shared task/competitive evaluation

• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test

data• Six teams (incl Tübingen), workshop end Sept

NTNU Nov 2011 KIlgarriff 8

Page 9: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

HOO+ (2012)

• Probably– English: learner data from CLC– Other languages? – Tasks

• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/

NTNU Nov 2011 KIlgarriff 9

Page 10: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

DANTE

Highlights of English lexicography

NTNU Nov 2011 KIlgarriff 10

Page 11: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

DANTE

NTNU Nov 2011 KIlgarriff 11

Page 12: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

DANTE

NTNU Nov 2011 KIlgarriff 12

Page 13: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

DANTE

NTNU Nov 2011 KIlgarriff 13

Page 14: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

DANTE

http://webdante.com

NTNU Nov 2011 KIlgarriff 14

Page 15: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

The KELLY Project

• EU Lifelong Learning Project• Word cards

– 9 languages• Arabic Chinese English Greek Italian Norwegian Polish

Russian Swedish– All 36 pairs– Words the learner should know (at A1 … C2)

• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,

ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd

NTNU Nov 2011 KIlgarriff 15

Page 16: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Interesting question

• How close to purely corpus-based can a pedagogic list be?

NTNU Nov 2011 KIlgarriff 16

Page 17: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Method

• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in

translations:– Review source list

• http://kelly.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 17

Page 18: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

• Symmatrical pairs: <x,y> and <y,x>• Cliques:

– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)

• hospital library music sun theory

NTNU Nov 2011 KIlgarriff 18

Page 19: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 19

Web corpora

• Replaceable or replacable?– http://googlefight.com – http://looglefight.com

Page 20: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 20

• The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

Page 21: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 21

Web corpus types

• Large, general corpora• Small, specialised corpora

– Specially for translators

Page 22: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 22

Basic steps• Gather pages

– CSE hits– Select and gather whole sites– General crawl

• Filter• De-duplicate• Linguistic processing• Load into corpus tool

Page 23: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 23

WaC family corpora• 100m – 2b word corpora• 2-month project each• All major world languages available in Sketch Engine

– Currently 42 languages– Growing monthly

• Pioneers: Marco Baroni, Serge Sharoff• Corpus Factory

• Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

• Google on seed words, then crawl

Page 24: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 24

How good are they?• How to assess?

– Hard question, open research topic• Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal, kitchen

vocab• Web corpus / BNC / journalism corpus

– First two are close

Page 25: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 25

Evaluating word sketches

• 11 years – 1999-2011

• Feedback– Good but anecdotal

• Formal evaluation• Method also lets us evaluate corpora

Page 26: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 26

Goal

• Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

• Ask a lexicographer– For 42 headwords

• For 20 best collocates per headwords– “should we include this collocation in a published

dictionary?”

NTNU Nov 2011

Page 27: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 27

Sample of headwords• Nouns verbs adjectives, random• High (Top 3000)• N space solution opinion mass corporation leader• V serve incorporate mix desire• Adj high detailed open academic• Mid (3000- 9999)• N cattle repayment fundraising elder biologist sanitation• V grieve classify ascertain implant• Adj adjacent eldest prolific ill• Low (10,000- 30,000)• N predicament adulterer bake bombshell candy shellfish• V slap outgrow plow traipse• Adj neoclassical votive adulterous expandable

NTNU Nov 2011

Page 28: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 28

Precision and recall

• a request for information– Find me all the fat cats

Page 29: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 29

High recall

• Lots of responses• Maybe not all good

Page 30: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 30

High precision

• Fewer hits• Higher confidence

Page 31: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 31

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

NTNU Nov 2011

Page 32: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 32

Four languages, three families

• Dutch– ANW, 102m-word lexicographic corpus

• English– UKWaC, 1.5b web corpus

• Japanese– JpWaC, 400m web corpus

• Slovene – FidaPlus, 620m lexicographic corpus

NTNU Nov 2011

Page 33: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 33

User evaluation

• Evaluate whole system– Will it help with my task

• Eg preparing a collocations dictionary

• Contrast: developer evaluation– Can I make the system better?

• Evaluate each module separately• Current work

NTNU Nov 2011

Page 34: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 34

Components

• Corpus• NLP tools

– Segmenter, lemmatiser, POS-tagger

• Sketch grammar• Statistics

NTNU Nov 2011

Page 35: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 35

Practicalities• Interface

– Good, Good-but• Merge to good

– Maybe, Maybe-specialised, Bad• Merge to bad

• For each language– Two/three linguists/lexicographers– If they disagree

• Don't use for computing performance

NTNU Nov 2011

Page 36: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

KIlgarriff 36

Results

• Dutch 66%• English 71%• Japanese 87%• Slovene 71%

NTNU Nov 2011

Page 37: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 37

Two thirds of a collocations dictionary can be gathered automatically

Page 38: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Thank you

http://www.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 38

Page 39: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 39

Page 40: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 40

Lexicography: finding facts about words

• collocations• grammatical patterns• idioms• synonyms• meanings• translations

Page 41: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 41

Four ages of corpus lexicography

Page 42: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 42

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Page 43: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 43

Age 2: KWIC Concordances

• From 1980• Computerised• Overhauled lexicography

Page 44: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 44

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no

Page 45: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 45

Age 3: Collocation statistics

• Problem:too much data - how to summarise?

• Solution:list of words occurring in neighbourhood of headword, with frequencies

• Sorted by salience

Page 46: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 46

Age-3 collocation statistics: limitations

Lists contain• junk • unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: • noise-free lists • one list for each grammatical relation

Page 47: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 47

Age 4: The word sketch• Large well-balanced corpus• Parse to find

– subjects, objects, heads, modifiers etc

• One list for each grammatical relation• Statistics to sort each list, as before

Page 48: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 48

Working practice

• Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

Page 49: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 49

Euralex 2002

Page 50: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 50

Euralex 2002

• Can I have them for my language please

Page 51: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 51

The Sketch Engine

• Input: – any corpus, any language

• Lemmatised, part-of-speech tagged– specification of grammatical relations

• Word sketches integrated with• Corpus query system

– Supports complex searching, sorting etc• Credit: Pavel Rychly, Masaryk Univ

Page 52: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 52

Customers• Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– National dictionary projects in

• Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia

• Universities– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, …

• Other– Language teaching, textbook writing– Information management, web search

Page 53: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 53

• Demo– http://sketchengine.co.uk– Free trial

Page 54: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 54

What is there on the web?• Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English• 1,000,000,000,000

• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

Page 55: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 55

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old• 4 legal

– trademarks pursuant accordance herein

Page 56: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 56

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 57: Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 57

Observations• Pronouns and past tense verbs

– Fiction

• Masc vs fem• Yesterday

– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself


Recommended