31
Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex

Why We Need Corpora and the Sketch Engine Adam Kilgarriff Lexical Computing Ltd, UK Universities of Leeds and Sussex

Embed Size (px)

Citation preview

Why We Need Corpora and the Sketch Engine

Adam KilgarriffLexical Computing Ltd, UKUniversities of Leeds and Sussex

Madrid April 2010 Kilgarriff: Why corpora and how 2

Corpora show us the facts of the language

Madrid April 2010 Kilgarriff: Why corpora and how 3

Exercise

planet Think about the word What could you say about it if you

were writing a dictionary entry Write down three (or more) things

Madrid April 2010 Kilgarriff: Why corpora and how 4

The Sketch Engine: demo

http://www.sketchengine.co.uk

Madrid April 2010 Kilgarriff: Why corpora and how 5

Dictionaries

How to decide what to say about the word?

Madrid April 2010 Kilgarriff: Why corpora and how 6

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection)

Madrid April 2010 Kilgarriff: Why corpora and how 7

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say

Madrid April 2010 Kilgarriff: Why corpora and how 8

Dictionaries

How to decide what to say about the word? What the native speaker knows

(introspection) What other dictionaries say corpus

Madrid April 2010 Kilgarriff: Why corpora and how 9

Four ages of corpus lexicography

Madrid April 2010 Kilgarriff: Why corpora and how 10

Age 1:

Pre-computer

Oxford English Dictionary:• 20 million index cards

Madrid April 2010 Kilgarriff: Why corpora and how 11

Age 2: KWIC Concordances

From 1980 Computerised Overhauled lexicography

Madrid April 2010 Kilgarriff: Why corpora and how 12

Age 2: limitations

as corpora get bigger:too much data

• 50 lines for a word: :read all • 500 lines: could read all, takes a long

time, slow • 5000 lines: no

Madrid April 2010 Kilgarriff: Why corpora and how 13

Age 3: Collocation statistics

Problem:too much data - how to summarise?

Solution:list of words occurring in neighbourhood of headword, with frequencies

Sorted by salience

Madrid April 2010 Kilgarriff: Why corpora and how 14

Collocation listing

For collocates of save (>5 hits), to right of nodeword

word word

forests life

$1.2 dollars

lives costs

enormous thousands

annually face

jobs estimated

money your

Madrid April 2010 Kilgarriff: Why corpora and how 15

Age-3 collocation statistics: limitations

Lists contain junk unsorted for type

mixes together adverbs, subjects, objects, prepositions

What we really want: noise-free lists one list for each grammatical relation

Madrid April 2010 Kilgarriff: Why corpora and how 16

Age 4: The word sketch

Large well-balanced corpus Parse to find

subjects, objects, heads, modifiers etc

One list for each grammatical relation Statistics to sort each list, as before

Madrid April 2010 Kilgarriff: Why corpora and how 17

Macmillan English DictionaryFor Advanced Learners

Ed: Rundell, 2002, 2007

Madrid April 2010 Kilgarriff: Why corpora and how 18

Demo part 2

Madrid April 2010 Kilgarriff: Why corpora and how 19

Fruit task

Choose fruit Concordance

Lemma, noun, lower case Frequency: node forms Write down

Plural freq (pl) Singular freq (sing)

Compute proportion: pl/(pl+sing)

Madrid April 2010 Kilgarriff: Why corpora and how 20

What is a corpus?

A collection of texts (as used for linguistic study)

Which texts? How many?

Madrid April 2010 Kilgarriff: Why corpora and how 21

Which texts?

Written Spoken

Madrid April 2010 Kilgarriff: Why corpora and how 22

Written Books

Fiction Non-fiction Textbooks

Newspapers Letters, unpublished Web pages Academic journals Student essays …

Madrid April 2010 Kilgarriff: Why corpora and how 23

Spoken

Must be transcribed, for text corpora Conversation

Who? Region, class, age-group, situation… Lectures TV and Radio Film transcripts Meetings, seminars …

Madrid April 2010 Kilgarriff: Why corpora and how 24

Which texts?

Different purposes, different text types

Making dictionaries: Cover the whole language Some of everything

Madrid April 2010 Kilgarriff: Why corpora and how 25

How much?

Most words are rare Zipf’s Law To get enough data for most words,

we need very big corpora

Madrid April 2010 Kilgarriff: Why corpora and how 26

Zipf’s Law

Word (pos) r f r x f

the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000

Madrid April 2010 Kilgarriff: Why corpora and how 27

Zipf’s Law the: 6%

100 most frequent: 45% 7500 most frequent: 90% all others: rare

Madrid April 2010 Kilgarriff: Why corpora and how 28

Zipf’s Law

0102030405060708090

100

'the' 100 mostfrequent

3500most

frequent

7500most

frequent

% of all texts

Madrid April 2010 Kilgarriff: Why corpora and how 29

Leading English Corpora: Size

109

108

107

106

Size of

Corpora

(in words)

1960s 1970s 1980s 1990s 2000s

Brown/LOB COBUILD BNC OEC

Madrid April 2010 Kilgarriff: Why corpora and how 30

Good news

The web

Madrid April 2010 Kilgarriff: Why corpora and how 31

Thank you

http://www.sketchengine.co.uk