Upload
noreen-marsh
View
221
Download
0
Tags:
Embed Size (px)
Citation preview
Using Corpora in Linguistics and Lexicography
Adam Kilgarriff
Lexical Computing Ltd
Universities of Leeds, Sussex, UK
IDS Mannheim 2010 Kilgarriff 2
Outline Precision and recall History of corpus lexicography The Sketch Engine
– Demo Web corpora Corpus and dictionary
IDS Mannheim 2010 Kilgarriff 3
Find me all the fat cats
a request for information
IDS Mannheim 2010 Kilgarriff 4
High recall
Lots of responses Maybe not all good
IDS Mannheim 2010 Kilgarriff 5
High precision
Fewer hits Higher confidence
IDS Mannheim 2010 Kilgarriff 6
Information-seeking
Recall Precision
Computers good bad
People bad good
IDS Mannheim 2010 Kilgarriff 7
Lexicography: finding facts about words collocations grammatical patterns idioms synonyms meanings translations
IDS Mannheim 2010 Kilgarriff 8
Four ages of corpus lexicography
IDS Mannheim 2010 Kilgarriff 9
Age 1:Precomputer
Oxford English Dictionary:• 5 million index cards
IDS Mannheim 2010 Kilgarriff 10
Age 2: KWIC Concordances From 1980 Computerised Overhauled lexicography
IDS Mannheim 2010 Kilgarriff 11
Age 2: limitations
as corpora get bigger:too much data
• 50 lines for a word: :read all • 500 lines: could read all, takes a long time,
slow • 5000 lines: no
IDS Mannheim 2010 Kilgarriff 12
Age 3: Collocation statistics
Problem:too much data - how to summarise?
Solution:list of words occurring in neighbourhood of headword, with frequencies
Sorted by salience
IDS Mannheim 2010 Kilgarriff 13
Age-3 collocation statistics: limitations
Lists contain junk unsorted for type – mixes together adverbs,
subjects, objects, prepositions
What we really want: noise-free lists one list for each grammatical relation
IDS Mannheim 2010 Kilgarriff 14
Collocation listing For collocates of save (>5 hits), window 1-5 words to right of nodeword
word word
forests life
$1.2 dollars
lives costs
enormous thousands
annually face
jobs estimated
money your
IDS Mannheim 2010 Kilgarriff 15
Age 4: The word sketch Large well-balanced corpus Parse to find
– subjects, objects, heads, modifiers etc
One list for each grammatical relation Statistics to sort each list, as before
IDS Mannheim 2010 Kilgarriff 16
Macmillan English DictionaryFor Advanced Learners
Ed: Rundell, 2002
IDS Mannheim 2010 Kilgarriff 17
Working practice
Lexicographers mainly used sketches not concordances – missed less, more consistent– Faster
IDS Mannheim 2010 Kilgarriff 18
Euralex 2002
IDS Mannheim 2010 Kilgarriff 19
Euralex 2002 Can I have them for my language
please
IDS Mannheim 2010 Kilgarriff 20
The Sketch Engine Input:
– any corpus, any language Lemmatised, part-of-speech tagged
– specification of grammatical relations Word sketches integrated with Corpus query system
– Supports complex searching, sorting etc Credit: Pavel Rychly, Masaryk Univ
IDS Mannheim 2010 Kilgarriff 21
Customers Dictionary publishers
– Oxford University Press– Cambridge University Press– Collins– Macmillan– FrameNet Project (Berkeley, US)– National dictionary projects in
Czech Republic, Estonia, Ireland, Netherlands, Slovakia, Slovenia Universities
– Teaching and research– Languages, linguistics, language technology– UK, Germany, US, Greece, Taiwan, Japan, China, Slovenia,…
Other– Language teaching, textbook writing– Information management, web search companies– Automatic translation
IDS Mannheim 2010 Kilgarriff 22
Web corpora
Replaceable or replacable?– http://googlefight.com – http://looglefight.com
IDS Mannheim 2010 Kilgarriff 23
The web is– Very very large– Most languages– Most language types– Up-to-date– Free– Instant access
IDS Mannheim 2010 Kilgarriff 24
Web corpus types Large, general corpora Small, specialised corpora
– Specially for translators
IDS Mannheim 2010 Kilgarriff 25
Basic steps Gather pages
– CSE hits– Select and gather whole sites– General crawl
Filter De-duplicate Linguistic processing Load into corpus tool
IDS Mannheim 2010 Kilgarriff 26
WaC family corpora 100m – 2b word corpora 2-month project each All major world languages available in Sketch
Engine– Currently 30 languages– Growing monthly
Pioneers: Marco Baroni, Serge Sharoff Corpus Factory
Seeds: – mid-frequency words from ‘core vocab’ lists and corpora
Google on seed words, then crawl
IDS Mannheim 2010 Kilgarriff 27
CorporaArabic 174 Hindi 31 Russian 188
Chinese 456 Indonesian 102 Slovak 536
Czech 800 Irish 34 Slovene 738
Dutch 128 Italian 1910 Spanish 117
English 5508 Japanese 409 Swedish 114
French 126 Norwegian 95 Telugu 5
German 1627 Persian 6 Thai 108
Greek 149 Portuguese 66 Vietnamese 174
Estonian 11 Romanian 53 Welsh 63
Korean 77 Polish 156 Malay 230
IDS Mannheim 2010 Kilgarriff 28
How good are they? How to assess?
– Hard question, open research topic Good coverage
– Newspapers: news, politics bias– Web corpora: also cover personal,
kitchen vocab Web corpus / BNC / journalism corpus
– First two are close
IDS Mannheim 2010 Kilgarriff 29
Evaluating word sketches 11 years
– 1999-2010 Feedback
– Good but anecdotal Formal evaluation Method also lets us evaluate corpora
Kilgarriff 30
Goal
Collocations dictionary– Model: Oxford Collocations Dictionary– Publication-quality
Ask a lexicographer– For 42 headwords
For 20 best collocates per headwords– “should we include this collocation in a
published dictionary?”
Kilgarriff 31
Sample of headwords Nouns verbs adjectives, random High (Top 3000) N space solution opinion mass corporation leader V serve incorporate mix desire Adj high detailed open academic Mid (3000- 9999) N cattle repayment fundraising elder biologist sanitation V grieve classify ascertain implant Adj adjacent eldest prolific ill Low (10,000- 30,000) N predicament adulterer bake bombshell candy shellfish V slap outgrow plow traipse Adj neoclassical votive adulterous expandable
Kilgarriff 32
Precision and recall We test precision Recall is harder
How do we find all the collocations that the system should have found?
Current work• 200 collocates per headword
• Selected from
• All the corpora we have
• Various parameter settings
• Plus just-in-time evaluation for 'new' collocates
Kilgarriff 33
Four languages, three families
Dutch– ANW, 102m-word lexicographic corpus
English– UKWaC, 1.5b web corpus
Japanese– JpWaC, 400m web corpus
Slovene – FidaPlus, 620m lexicographic corpus
Kilgarriff 34
User evaluation
Evaluate whole system– Will it help with my task
Eg preparing a collocations dictionary
Contrast: developer evaluation– Can I make the system better?
Evaluate each module separately Current work
Kilgarriff 35
Components
Corpus NLP tools
– Segmenter, lemmatiser, POS-tagger
Sketch grammar Statistics
Kilgarriff 36
Practicalities Interface
– Good, Good-but Merge to good
– Maybe, Maybe-specialised, Bad Merge to bad
For each language– Two/three linguists/lexicographers– If they disagree
Don't use for computing performance
Kilgarriff 37
Results
Dutch 66% English 71% Japanese 87% Slovene 71%
IDS Mannheim 2010 Kilgarriff 38
Two thirds of a collocations dictionary can be gathered automatically
IDS Mannheim 2010 Kilgarriff 39
Small specialised corpora Terminologists Translators needing target-language
domain-specific vocab Specialist dictionaries
– Don’t exist– Expensive/inaccessible– Out of date
Instant small web corpora– BootCaT: Baroni and Bernardini 2004– WebBootCaT demo
IDS Mannheim 2010 Kilgarriff 40
Cyborgs A creature that is partly human and
partly machine – Macmillan English Dictionary
IDS Mannheim 2010 Kilgarriff 41
IDS Mannheim 2010 Kilgarriff 42
IDS Mannheim 2010 Kilgarriff 43
IDS Mannheim 2010 Kilgarriff 44
IDS Mannheim 2010 Kilgarriff 45
Cyborgs and the Information Society
The dictionary-making agent is part human (for precision), part computer (for recall).
IDS Mannheim 2010 Kilgarriff 46
Treat your computer with respect. You and it can do
great things together.
IDS Mannheim 2010 Kilgarriff 47
Thank you
http://www.sketchengine.co.uk
IDS Mannheim 2010 Kilgarriff 48
Corpus and dictionary Established model:
– Lexicographers use corpora, users use dictionaries
But– Users like collocations, examples– But are not corpus linguists
Explore the space between corpus and dictionary
IDS Mannheim 2010 Kilgarriff 49
Collocationality Which words are most ‘collocational’ Dictionary publishers
– Where to put ‘collocation boxes’ Language learners
IDS Mannheim 2010 Kilgarriff 50
Verb Freq MLE Prob x log = entropy
Take 2084 -.469
Gain 131 -.169
Offer 117 -.157
See 110 -.150
Enjoy 67 -.104
… … …
Clarify 1 -0.0031
… … …
Total 3730 -3.909
Calculation of entropy for advantage (object relation)
IDS Mannheim 2010 Kilgarriff 51
IDS Mannheim 2010 Kilgarriff 52
place (17881), attention (8476), door (8426), care (4884), step (4277), advantage (3730), rise (3334), attempt (2825), impression (2596), notice (2462), chapter (2318), mistake (2205), breath (2140), hold (1949), birth (1016), living (953), indication (812), tribute (720), debut (714), button (661), eyebrow (649), anniversary (637), mention (615), glimpse (531), suicide (486), toll (472), refuge (470), spokesman (453), sigh (436), birthday (429), wicket (412), appendix (410), pardon (399), precaution (396), temptation (374), goodbye (372), fuss (366), resemblance (350), goodness (288), precedence (285), havoc (270), tennis (266), comeback (260), farewell (228), prominence (228), go-ahead (202), sip (198),
IDS Mannheim 2010 Kilgarriff 53
What is there on the web? Web1T
– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion
(1012) words of English 1,000,000,000,000
Compare with BNC– Take top 50,000 items of each– 105 Web1T words not in BNC top50k– 50 words with highest Web1T:BNC ratio– 50 words with lowest ratio
IDS Mannheim 2010 Kilgarriff 54
Web-high (155 terms)
61 web and computing– config browser spyware url www forum
38 porn 22 US English 18 business/products common on web
– poker viagra lingerie ringtone dvd casino rental collectible tiffany
– NB: BNC is old 4 legal
– trademarks pursuant accordance herein
IDS Mannheim 2010 Kilgarriff 55
Web-low
Exclude British English, transcription/tokenisation anomalies
– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him
IDS Mannheim 2010 Kilgarriff 56
Observations Pronouns and past tense verbs
– Fiction Masc vs fem Yesterday
– Probably daily newspapers Constancy of ratios:
– He/him/himself– She/her/herself