57
A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing Ltd http:// www.sketchengine.co.uk

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Embed Size (px)

Citation preview

Page 1: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

A cascade of corpora:The Cambridge Learner Corpus,

English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

Page 2: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

English Profile

• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal

– for each CEFR level, find characteristic lexis and grammar

• CEFR: Common European Framework of Reference– A1, A2: Beginner– B1, B2: Intermediate– C1, C2: Advanced

– Main resource: CLCNTNU Nov 2011 KIlgarriff 2

Page 3: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Cambridge Learner Corpus (CLC)

• Since 1993 • Leading resource• CUP and Cambridge Assessment

– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)

• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities

NTNU Nov 2011 KIlgarriff 3

Page 4: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Sketch Engine

• Leading corpus tool• Word sketches

– One-page summaries of a word’s grammatical and collocational behaviour

• In use at OUP, CUP, Collins, Macmillan, INL …• 55 languages

– 175 corpora– Since May including CHILDES: demo– Since last year including CLC

NTNU Nov 2011 KIlgarriff 4

Page 5: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 5

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Page 6: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Error-coded corpus

• Challenge– Intuitive to search for x

• anywhere• only where it is part of an error• only where it is part of a correction

where x can be a word, phrase, grammar pattern …

Requirement for CLC in Sketch Engine

NTNU Nov 2011 KIlgarriff 6

Page 7: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Error-coded corpora in SkE

• demo

NTNU Nov 2011 KIlgarriff 7

Page 8: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

HOO / HOO+

• Helping Our Own• HOO: English-NNS NLP researchers

– Developer = user: motivation– Shared task/competitive evaluation

• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test

data• Six teams (incl Tübingen), workshop end Sept

NTNU Nov 2011 KIlgarriff 8

Page 9: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

HOO+ (2012)

• Probably– English: learner data from CLC– Other languages? – Tasks

• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/

NTNU Nov 2011 KIlgarriff 9

Page 10: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

DANTE

Highlights of English lexicography

NTNU Nov 2011 KIlgarriff 10

Page 11: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

DANTE

NTNU Nov 2011 KIlgarriff 11

Page 12: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

DANTE

NTNU Nov 2011 KIlgarriff 12

Page 13: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

DANTE

NTNU Nov 2011 KIlgarriff 13

Page 14: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

DANTE

http://webdante.com

NTNU Nov 2011 KIlgarriff 14

Page 15: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

The KELLY Project

• EU Lifelong Learning Project• Word cards

– 9 languages• Arabic Chinese English Greek Italian Norwegian Polish

Russian Swedish– All 36 pairs– Words the learner should know (at A1 … C2)

• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,

ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd

NTNU Nov 2011 KIlgarriff 15

Page 16: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Interesting question

• How close to purely corpus-based can a pedagogic list be?

NTNU Nov 2011 KIlgarriff 16

Page 17: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Method

• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in

translations:– Review source list

• http://kelly.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 17

Page 18: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

• Symmatrical pairs: <x,y> and <y,x>• Cliques:

– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)

• hospital library music sun theory

NTNU Nov 2011 KIlgarriff 18

Page 19: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 19

Web corpora

• Replaceable or replacable?– http://googlefight.com – http://looglefight.com

Page 20: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 20

• The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

Page 21: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 21

Web corpus types

• Large, general corpora• Small, specialised corpora

– Specially for translators

Page 22: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 22

Basic steps• Gather pages

– CSE hits– Select and gather whole sites– General crawl

• Filter• De-duplicate• Linguistic processing• Load into corpus tool

Page 23: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 23

WaC family corpora• 100m – 2b word corpora• 2-month project each• All major world languages available in Sketch Engine

– Currently 42 languages– Growing monthly

• Pioneers: Marco Baroni, Serge Sharoff• Corpus Factory

• Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

• Google on seed words, then crawl

Page 24: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 24

How good are they?• How to assess?

– Hard question, open research topic• Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal, kitchen

vocab• Web corpus / BNC / journalism corpus

– First two are close

Page 25: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 25

Evaluating word sketches

• 11 years – 1999-2011

• Feedback– Good but anecdotal

• Formal evaluation• Method also lets us evaluate corpora

Page 26: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 26

Goal

• Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

• Ask a lexicographer– For 42 headwords

• For 20 best collocates per headwords– “should we include this collocation in a published

dictionary?”

NTNU Nov 2011

Page 27: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 27

Sample of headwords• Nouns verbs adjectives, random• High (Top 3000)• N space solution opinion mass corporation leader• V serve incorporate mix desire• Adj high detailed open academic• Mid (3000- 9999)• N cattle repayment fundraising elder biologist sanitation• V grieve classify ascertain implant• Adj adjacent eldest prolific ill• Low (10,000- 30,000)• N predicament adulterer bake bombshell candy shellfish• V slap outgrow plow traipse• Adj neoclassical votive adulterous expandable

NTNU Nov 2011

Page 28: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 28

Precision and recall

• a request for information– Find me all the fat cats

Page 29: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 29

High recall

• Lots of responses• Maybe not all good

Page 30: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 30

High precision

• Fewer hits• Higher confidence

Page 31: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 31

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

NTNU Nov 2011

Page 32: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 32

Four languages, three families

• Dutch– ANW, 102m-word lexicographic corpus

• English– UKWaC, 1.5b web corpus

• Japanese– JpWaC, 400m web corpus

• Slovene – FidaPlus, 620m lexicographic corpus

NTNU Nov 2011

Page 33: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 33

User evaluation

• Evaluate whole system– Will it help with my task

• Eg preparing a collocations dictionary

• Contrast: developer evaluation– Can I make the system better?

• Evaluate each module separately• Current work

NTNU Nov 2011

Page 34: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 34

Components

• Corpus• NLP tools

– Segmenter, lemmatiser, POS-tagger

• Sketch grammar• Statistics

NTNU Nov 2011

Page 35: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 35

Practicalities• Interface

– Good, Good-but• Merge to good

– Maybe, Maybe-specialised, Bad• Merge to bad

• For each language– Two/three linguists/lexicographers– If they disagree

• Don't use for computing performance

NTNU Nov 2011

Page 36: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

KIlgarriff 36

Results

• Dutch 66%• English 71%• Japanese 87%• Slovene 71%

NTNU Nov 2011

Page 37: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 37

Two thirds of a collocations dictionary can be gathered automatically

Page 38: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

Thank you

http://www.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 38

Page 39: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 39

Page 40: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 40

Lexicography: finding facts about words

• collocations• grammatical patterns• idioms• synonyms• meanings• translations

Page 41: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 41

Four ages of corpus lexicography

Page 42: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 42

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

Page 43: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 43

Age 2: KWIC Concordances

• From 1980• Computerised• Overhauled lexicography

Page 44: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 44

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no

Page 45: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 45

Age 3: Collocation statistics

• Problem:too much data - how to summarise?

• Solution:list of words occurring in neighbourhood of headword, with frequencies

• Sorted by salience

Page 46: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 46

Age-3 collocation statistics: limitations

Lists contain• junk • unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: • noise-free lists • one list for each grammatical relation

Page 47: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 47

Age 4: The word sketch• Large well-balanced corpus• Parse to find

– subjects, objects, heads, modifiers etc

• One list for each grammatical relation• Statistics to sort each list, as before

Page 48: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 48

Working practice

• Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

Page 49: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 49

Euralex 2002

Page 50: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 50

Euralex 2002

• Can I have them for my language please

Page 51: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 51

The Sketch Engine

• Input: – any corpus, any language

• Lemmatised, part-of-speech tagged– specification of grammatical relations

• Word sketches integrated with• Corpus query system

– Supports complex searching, sorting etc• Credit: Pavel Rychly, Masaryk Univ

Page 52: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 52

Customers• Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– National dictionary projects in

• Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia

• Universities– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, …

• Other– Language teaching, textbook writing– Information management, web search

Page 53: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 53

• Demo– http://sketchengine.co.uk– Free trial

Page 54: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 54

What is there on the web?• Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English• 1,000,000,000,000

• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

Page 55: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 55

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old• 4 legal

– trademarks pursuant accordance herein

Page 56: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 56

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 57: A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project Adam Kilgarriff Lexical Computing

NTNU Nov 2011 KIlgarriff 57

Observations• Pronouns and past tense verbs

– Fiction

• Masc vs fem• Yesterday

– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself