Adam Kilgarriff Lexical Computing Ltd sketchengine.co.uk

Preview:

DESCRIPTION

A cascade of corpora: The Cambridge Learner Corpus, English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project. Adam Kilgarriff Lexical Computing Ltd http://www.sketchengine.co.uk. English Profile. From 2006 Cambridge Univ, Univ Press, ESOL (+ others) Goal - PowerPoint PPT Presentation

Citation preview

A cascade of corpora:The Cambridge Learner Corpus,

English Profile, the Sketch Engine, HOO, DANTE and the Kelly Project

Adam KilgarriffLexical Computing Ltd

http://www.sketchengine.co.uk

English Profile

• From 2006• Cambridge Univ, Univ Press, ESOL (+ others)• Goal

– for each CEFR level, find characteristic lexis and grammar

• CEFR: Common European Framework of Reference– A1, A2: Beginner– B1, B2: Intermediate– C1, C2: Advanced

– Main resource: CLCNTNU Nov 2011 KIlgarriff 2

Cambridge Learner Corpus (CLC)

• Since 1993 • Leading resource• CUP and Cambridge Assessment

– For better dictionaries, ELT courses, tests– Material: all from exams (levels A1-C2)

• 45m words; 22m error-tagged• 200,000 scripts, 138 L1s, 203 nationalities

NTNU Nov 2011 KIlgarriff 3

Sketch Engine

• Leading corpus tool• Word sketches

– One-page summaries of a word’s grammatical and collocational behaviour

• In use at OUP, CUP, Collins, Macmillan, INL …• 55 languages

– 175 corpora– Since May including CHILDES: demo– Since last year including CLC

NTNU Nov 2011 KIlgarriff 4

NTNU Nov 2011 KIlgarriff 5

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002

Error-coded corpus

• Challenge– Intuitive to search for x

• anywhere• only where it is part of an error• only where it is part of a correction

where x can be a word, phrase, grammar pattern …

Requirement for CLC in Sketch Engine

NTNU Nov 2011 KIlgarriff 6

Error-coded corpora in SkE

• demo

NTNU Nov 2011 KIlgarriff 7

HOO / HOO+

• Helping Our Own• HOO: English-NNS NLP researchers

– Developer = user: motivation– Shared task/competitive evaluation

• Organisers define task and prepare ‘gold standard’• Teams participate by running their software over test

data• Six teams (incl Tübingen), workshop end Sept

NTNU Nov 2011 KIlgarriff 8

HOO+ (2012)

• Probably– English: learner data from CLC– Other languages? – Tasks

• Essay scoring • Determiner, preposition errors• ?• http://www.clt.mq.edu.au/research/projects/hoo/

NTNU Nov 2011 KIlgarriff 9

DANTE

Highlights of English lexicography

NTNU Nov 2011 KIlgarriff 10

DANTE

NTNU Nov 2011 KIlgarriff 11

DANTE

NTNU Nov 2011 KIlgarriff 12

DANTE

NTNU Nov 2011 KIlgarriff 13

DANTE

http://webdante.com

NTNU Nov 2011 KIlgarriff 14

The KELLY Project

• EU Lifelong Learning Project• Word cards

– 9 languages• Arabic Chinese English Greek Italian Norwegian Polish

Russian Swedish– All 36 pairs– Words the learner should know (at A1 … C2)

• Partners• Stockholm Univ, Gotheburg Univ, Adam Mickiewicz Univ,

ILSP Athens, CNR Pisa, Oslo Univ, Leeds Univ, Keewords A/S, Lexical Computing Ltd

NTNU Nov 2011 KIlgarriff 15

Interesting question

• How close to purely corpus-based can a pedagogic list be?

NTNU Nov 2011 KIlgarriff 16

Method

• Take a general corpus• Count• Review, add, delete using other lists and corpora• Translate (72 directed-lg-pairs)• Words not in source list which occur in

translations:– Review source list

• http://kelly.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 17

• Symmatrical pairs: <x,y> and <y,x>• Cliques:

– For x, y, z, … all pairs are symmetrical– 9-language cliques (English members)

• hospital library music sun theory

NTNU Nov 2011 KIlgarriff 18

NTNU Nov 2011 KIlgarriff 19

Web corpora

• Replaceable or replacable?– http://googlefight.com – http://looglefight.com

NTNU Nov 2011 KIlgarriff 20

• The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access

NTNU Nov 2011 KIlgarriff 21

Web corpus types

• Large, general corpora• Small, specialised corpora

– Specially for translators

NTNU Nov 2011 KIlgarriff 22

Basic steps• Gather pages

– CSE hits– Select and gather whole sites– General crawl

• Filter• De-duplicate• Linguistic processing• Load into corpus tool

NTNU Nov 2011 KIlgarriff 23

WaC family corpora• 100m – 2b word corpora• 2-month project each• All major world languages available in Sketch Engine

– Currently 42 languages– Growing monthly

• Pioneers: Marco Baroni, Serge Sharoff• Corpus Factory

• Seeds: – mid-frequency words from ‘core vocab’ lists and corpora

• Google on seed words, then crawl

NTNU Nov 2011 KIlgarriff 24

How good are they?• How to assess?

– Hard question, open research topic• Good coverage

– Newspapers: news, politics bias– Web corpora: also cover personal, kitchen

vocab• Web corpus / BNC / journalism corpus

– First two are close

NTNU Nov 2011 KIlgarriff 25

Evaluating word sketches

• 11 years – 1999-2011

• Feedback– Good but anecdotal

• Formal evaluation• Method also lets us evaluate corpora

KIlgarriff 26

Goal

• Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality

• Ask a lexicographer– For 42 headwords

• For 20 best collocates per headwords– “should we include this collocation in a published

dictionary?”

NTNU Nov 2011

KIlgarriff 27

Sample of headwords• Nouns verbs adjectives, random• High (Top 3000)• N space solution opinion mass corporation leader• V serve incorporate mix desire• Adj high detailed open academic• Mid (3000- 9999)• N cattle repayment fundraising elder biologist sanitation• V grieve classify ascertain implant• Adj adjacent eldest prolific ill• Low (10,000- 30,000)• N predicament adulterer bake bombshell candy shellfish• V slap outgrow plow traipse• Adj neoclassical votive adulterous expandable

NTNU Nov 2011

NTNU Nov 2011 KIlgarriff 28

Precision and recall

• a request for information– Find me all the fat cats

NTNU Nov 2011 KIlgarriff 29

High recall

• Lots of responses• Maybe not all good

NTNU Nov 2011 KIlgarriff 30

High precision

• Fewer hits• Higher confidence

KIlgarriff 31

Precision and recall We test precision Recall is harder

How do we find all the collocations that the system should have found?

Current work• 200 collocates per headword

• Selected from

• All the corpora we have

• Various parameter settings

• Plus just-in-time evaluation for 'new' collocates

NTNU Nov 2011

KIlgarriff 32

Four languages, three families

• Dutch– ANW, 102m-word lexicographic corpus

• English– UKWaC, 1.5b web corpus

• Japanese– JpWaC, 400m web corpus

• Slovene – FidaPlus, 620m lexicographic corpus

NTNU Nov 2011

KIlgarriff 33

User evaluation

• Evaluate whole system– Will it help with my task

• Eg preparing a collocations dictionary

• Contrast: developer evaluation– Can I make the system better?

• Evaluate each module separately• Current work

NTNU Nov 2011

KIlgarriff 34

Components

• Corpus• NLP tools

– Segmenter, lemmatiser, POS-tagger

• Sketch grammar• Statistics

NTNU Nov 2011

KIlgarriff 35

Practicalities• Interface

– Good, Good-but• Merge to good

– Maybe, Maybe-specialised, Bad• Merge to bad

• For each language– Two/three linguists/lexicographers– If they disagree

• Don't use for computing performance

NTNU Nov 2011

KIlgarriff 36

Results

• Dutch 66%• English 71%• Japanese 87%• Slovene 71%

NTNU Nov 2011

NTNU Nov 2011 KIlgarriff 37

Two thirds of a collocations dictionary can be gathered automatically

Thank you

http://www.sketchengine.co.uk

NTNU Nov 2011 KIlgarriff 38

NTNU Nov 2011 KIlgarriff 39

NTNU Nov 2011 KIlgarriff 40

Lexicography: finding facts about words

• collocations• grammatical patterns• idioms• synonyms• meanings• translations

NTNU Nov 2011 KIlgarriff 41

Four ages of corpus lexicography

NTNU Nov 2011 KIlgarriff 42

Age 1:Precomputer

Oxford English Dictionary:• 5 million index cards

NTNU Nov 2011 KIlgarriff 43

Age 2: KWIC Concordances

• From 1980• Computerised• Overhauled lexicography

NTNU Nov 2011 KIlgarriff 44

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long time, slow • 5000 lines: no

NTNU Nov 2011 KIlgarriff 45

Age 3: Collocation statistics

• Problem:too much data - how to summarise?

• Solution:list of words occurring in neighbourhood of headword, with frequencies

• Sorted by salience

NTNU Nov 2011 KIlgarriff 46

Age-3 collocation statistics: limitations

Lists contain• junk • unsorted for type – mixes together adverbs,

subjects, objects, prepositions

What we really want: • noise-free lists • one list for each grammatical relation

NTNU Nov 2011 KIlgarriff 47

Age 4: The word sketch• Large well-balanced corpus• Parse to find

– subjects, objects, heads, modifiers etc

• One list for each grammatical relation• Statistics to sort each list, as before

NTNU Nov 2011 KIlgarriff 48

Working practice

• Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster

NTNU Nov 2011 KIlgarriff 49

Euralex 2002

NTNU Nov 2011 KIlgarriff 50

Euralex 2002

• Can I have them for my language please

NTNU Nov 2011 KIlgarriff 51

The Sketch Engine

• Input: – any corpus, any language

• Lemmatised, part-of-speech tagged– specification of grammatical relations

• Word sketches integrated with• Corpus query system

– Supports complex searching, sorting etc• Credit: Pavel Rychly, Masaryk Univ

NTNU Nov 2011 KIlgarriff 52

Customers• Dictionary publishers

– Oxford University Press– Cambridge University Press– Collins– National dictionary projects in

• Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia

• Universities– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, …

• Other– Language teaching, textbook writing– Information management, web search

NTNU Nov 2011 KIlgarriff 53

• Demo– http://sketchengine.co.uk– Free trial

NTNU Nov 2011 KIlgarriff 54

What is there on the web?• Web1T

– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion

(1012) words of English• 1,000,000,000,000

• Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio

NTNU Nov 2011 KIlgarriff 55

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old• 4 legal

– trademarks pursuant accordance herein

NTNU Nov 2011 KIlgarriff 56

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

NTNU Nov 2011 KIlgarriff 57

Observations• Pronouns and past tense verbs

– Fiction

• Masc vs fem• Yesterday

– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself

Recommended