Transcript
Page 1: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Using Corpora and how to build them

Adam Kilgarriff

Lexical Computing Ltd

Page 2: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

• Corpora show us the facts of the language

Page 3: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 3

What is a corpus?

• a corpus is a collection of texts – when viewed as an object of language

research

Page 4: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Which texts?

• Written

• Spoken

Page 5: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Written

• Books– Fiction– Non-fiction– Textbooks

• Newspapers• Letters, unpublished• Web pages• Academic journals• Student essays• …

Page 6: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Spoken

Must be transcribed, for text corpora

• Conversation– Who? Region, class, age-group, situation…

• Lectures

• TV and Radio

• Film transcripts

• Meetings, seminars

• …

Page 7: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Which texts?

• Different purposes, different text types

• Making dictionaries:– Cover the whole language– Some of everything

Page 8: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

How much?

• Most words are rare

• Zipf’s Law

• To get enough data for most words, we need very big corpora

Page 9: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

Word (pos) r f r x f

the (det) 1 6187267 6187267 to (prep) 10 917579 9175790as (adv) 100 91583 9158300playing (vb) 1000 9738 9738000paint (vb) 2000 4539 9078000amateur (adj) 10,000 741 7410000

Page 10: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

• the: 6%

• 100 most frequent: 45%

• 7500 most frequent: 90%

• all others: rare

Page 11: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Zipf’s Law

0102030405060708090

100

'the' 100 mostfrequent

3500most

frequent

7500most

frequent

% of all texts

Page 12: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Leading English Corpora: Size

109

108

107

106

Size of

Corpora

(in words)

1960s 1970s 1980s 1990s 2000s

Brown/LOB COBUILD BNC OEC

Page 13: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Good news

• The web

Page 14: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 14

You can’t help noticing

• Replaceable or replacable?– http://googlefight.com

Page 15: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 15

• Very very large– 2006 estimates for duplicate free, linguistic, Google-

indexed web• German: 44 billion words• Italian: 25 billion words• English: 1,000 billion -10,000 billion words

• Most languages• Most language types• Up-to-date• Free• Instant access

Page 16: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 16

What is a corpus?

• a corpus is a collection of texts – when viewed as an object of language

research

Page 17: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 17

Is the web a corpus?

Yes

Page 18: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 18

but it’s not representative

Page 19: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 19

sublanguage• Language = core + sublanguages• Options for corpus construction

– none– some– all

• None– impoverished view of language

• Some: BNC– cake recipes and gastro-uterine disease– not car repair manuals or astronomy or …

• All: until recently, not viable

Page 20: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 20

Representativeness• The web is not representative• but nor is anything else• Text type variation

– under-researched, lacking in theory• Atkins Clear Ostler 1993 on design brief for BNC;

Biber 1988, Kilgarriff 2001

• Text type is an issue across NLP– Web: issue is acute because, as against BNC or

WSJ, we simply don’t know what is there

Page 21: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 21

What is out there?

• What text types are there on the web?– some are new: chatroom

– proportions

• is it overwhelmed by porn? How much?

• Hard question

Page 22: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 22

Comparing frequency lists

• Web1T– Present from google– All 1-, 2-, 3-, 4, 5-grams with f>40 in one

trillion (1012) words of English• that’s 1,000,000,000,000

• Compare with BNC– 100 words with highest Web1T:BNC ratio– 100 words with lowest ratio

Page 23: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 23

Web-high (155 terms)

• 61 web and computing– config browser spyware url www forum

• 38 porn• 22 US English (incl Spanish influence –los)• 18 business/products common on web

– poker viagra lingerie ringtone dvd casino rental collectible tiffany

– NB: BNC is old

• 4 legal– trademarks pursuant accordance herein

Page 24: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 24

Web-low

• Exclude British English, transcription/tokenisation anomalies

– herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

Page 25: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 25

Observations

• Pronouns and past tense verbs– Fiction

• Masc vs fem

• Yesterday– Probably daily newspapers

• Constancy of ratios:– He/him/himself– She/her/herself

Page 26: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 26

• The web– a social, cultural, political phenomenon– new, little understood– a legitimate object of science– mostly language

• we are well placed– a lot of people will be interested

• Let’s– study the web– source of language data– apply our tools for web use (dictionaries, MT)– use the web as infrastructure

Page 27: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 27

Using Search Engines

No setup costsStart querying today

Methods• Hit counts• ‘snippets’

– Metasearch engines, WebCorp

• Find pages and download

Page 28: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 28

Googleology

• Google hit counts for language modelling

– Example: (Keller & Lapata 2003) – 36 queries to estimate freq(fulfil, obligation) to

each of Google and Altavista

• Very interesting work

• Great interest in query syntax

Page 29: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 29

The Trouble with Google• not enough instances

– max 1000• not enough queries

– max 1000 per day with API• not enough context

– 10-word snippet around search term• sort order

– search term in titles and headings • untrustworthy hit counts• limited search options• linguistically dumb, eg not lemmatised

• aime/aimer/aimes/aimons/aimez/aiment …

Page 30: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 30

• Appeal– Zero-cost entry, just start googling

• Reality– High-quality work: high-cost methodology

Page 31: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 31

Also:

• No replicability

• Methods, stats not published

• At mercy of commercial corporation

Page 32: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 32

Also:

• No replicability

• Methods, stats not published

• At mercy of commercial corporation

• Googleology is bad science

• So…

Page 33: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 33

Basic steps

• Gather pages– Google hits– Select and gather whole sites– General crawl

• Filter

• De-duplicate

• Linguistic processing

• Load into corpus tool

Page 34: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 34

Oxford English Corpus

• Whole domains chosen and harvested– control over text type

• 2 billion words (Mar 08)

Page 35: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 35

Oxford English Corpus

Page 36: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 36

DeWaC, ItWaC, UKWaC

• 1.5 B words each

• Marco Baroni, Adriano Ferraresi

• Seeds: – mid-frequency words from ‘core vocab’ lists

and corpora

• Google on seed words, then crawl

Page 37: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 37

Filtering

• Non-text (sound, image etc) files• Boilerplate (within file)

– Copyright notices, navigation bars– “high markup” heuristic

• Not “text in sentences”– Look for function words– Lists?? Sports results?? Crossword puzzles??

• Spam, pornography– Tough

• De-duplication (also tough)

Page 38: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 38

Small, specialised corpora

• Terminologists

• Translators needing target-language domain-specific vocab

• Specialist dictionaries– Don’t exist– Expensive/inaccessible– Out of date

Page 39: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora 39

BootCat (Bootstrapping Corpora and Terms)

• Put in seed terms• Google/Yahoo search• Retrieve Google/Yahoo hits

– Remove duplicates, boilerplate

• Small instant corpora• Baroni and Bernardini, LREC 2004• Web version

– WebBootCaT– At Sketch Engine site

Page 40: Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

Madrid 2010 Kilgarriff: Web Corpora

Task

• Choose area of specialist interest– English or Spanish

• Select at least 5 seed terms– Specialist: good

• Build corpus– At least 100,000 words– Iterate if necessary

• Find at least six words/phrases/meanings you did not know before

• Write up


Recommended