Transcript

Corpora by Web Services

Adam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex

Leeds, April 2010 Kilgarriff: Corpora by Web Services 2

Starting a PhD in NLP

Then Prolog Type in a few

grammar rules Lexical entries Example sentences

We’re off!

Leeds, April 2010 Kilgarriff: Corpora by Web Services 3

Now Corpus

Which? Budget/schedule Howe much can we afford? Hard disk space

Access software Build

Big job, makign it fast is hard – or Research, acquire, install, maintain …

Leeds, April 2010 Kilgarriff: Corpora by Web Services 4

Resarch question Morphology, syntax, discourse structure,

semantics, anaphor First six months at least

Acquiring data, software Complications

Leeds, April 2010 Kilgarriff: Corpora by Web Services 5

Leeds, April 2010 Kilgarriff: Corpora by Web Services 6

If you’re not super-geeky

Did I do it properly? Dumbing down

Let’s choose an easier question Looking over shoulder

Leeds, April 2010 Kilgarriff: Corpora by Web Services 7

Disappointment

Leeds, April 2010 Kilgarriff: Corpora by Web Services 8

Making it easy

Like picking up a hire car

Leeds, April 2010 Kilgarriff: Corpora by Web Services 9

Corpora by web services

Possible? Already available

Leeds, April 2010 Kilgarriff: Corpora by Web Services 10

Sketch Engine

Corpus querying Fast Handles large corpora In use for lexicography at

OUP, CUP, Macmillan, Collins, Le Robert Word sketches

Data-driven summary of a word’s grammatical and collocational behaviour

Leeds, April 2010 Kilgarriff: Corpora by Web Services 11

Leeds, April 2010 Kilgarriff: Corpora by Web Services 12

Corpora

63Welsh53Romanian

174Vietnamese66Portuguese149Greek

108Thai6Persian1627German

5Telugu95Norwegian126French

114Swedish409Japanese5508English

117Spanish1910Italian128Dutch

738Slovene34Irish800Czech

536Slovak102Indonesian456Chinese

188Russian31Hindi174Arabic

Leeds, April 2010 Kilgarriff: Corpora by Web Services 13

Big, High Quality corpora

Big Performance

Banko and Brill 2004 There’s no data like more data

Ample data for rare phenomena Big subcorpora

5b Medical: 30m

Leeds, April 2010 Kilgarriff: Corpora by Web Services 14

Quality Bad data

Spam Navigation-bars Duplicates Lists Bungled formatting Wrong language …

Less discussed Maybe a footnote I wonder why

Quick fixes and run

Leeds, April 2010 Kilgarriff: Corpora by Web Services 15

The Google/Yahoo/Bing option

Appeal Not setup costs Start googling today

Leeds, April 2010 Kilgarriff: Corpora by Web Services 16

Very interesting work Keller and Lapata

Validity of SE counts vs BNC counts vs psycholinguistic validity of collocations

36 queries per collocation “fulfil obligation” “fulfil ? Obligation” “fulfilling obligations” ...

Nakov, Nakov and Hearst Great interest in query syntax

Leeds, April 2010 Kilgarriff: Corpora by Web Services 17

but

Limited hits-per-query Limited hits-per-day Sort order

Not documented 'unsorted' not possible

Snippets too short for research No (documented) morphology Limited query syntax

Leeds, April 2010 Kilgarriff: Corpora by Web Services 18

and

At mercy of commercial company Might change at any time Not replicable

Leeds, April 2010 Kilgarriff: Corpora by Web Services 19

So

Appeal No setup costs

Serious research Many difficult practical issues Not a tool designed for linguists

Conclusion If only SE indexes are big enough

Yes Else no

Leeds, April 2010 Kilgarriff: Corpora by Web Services 20

Strategy

More languages Corpus Factory, as Sharoff

Bigger Big Web Corpus (BiWeC) Currently 5.5b fully processed Target 20b

Better

Leeds, April 2010 Kilgarriff: Corpora by Web Services 21

New Model Corpus

BNC is past its sell-by Early 1990s Pre web Still dominant model

New model needed

Leeds, April 2010 Kilgarriff: Corpora by Web Services 22

Model

Small: model train Model train

Design: software model NMC

1:100 for BiWeC-scale 100m

Update of BNC as design model Data from web but Text type avalable

Leeds, April 2010 Kilgarriff: Corpora by Web Services 23

Open-source/collaboration

We distribute You annotate

Pos-tags, parses, anaphor, discourse moves, semantics, multiwords, entity-types ...

Domain, register, region ... Send us annotations We integrate

And give access in SkE

Leeds, April 2010 Kilgarriff: Corpora by Web Services 24

Divide and rule

Bigger (BiWeC) Better (NMC) Take best annotations

Accuracy Speed Usefulness Good collaboration

from NMC, apply to BiWeC

Leeds, April 2010 Kilgarriff: Corpora by Web Services 25

TEDDCLOG

Taiwan English Data-Driven CLOze Generation

with Simon Smith and colleagues, Taipei API case study

Leeds, April 2010 Kilgarriff: Corpora by Web Services 26

Cloze

'fill-the gap' Several metal _____ violently with cold water

A: behave B: react C: realise D: respond

Popular with students, teachers, testers Unpopular with theorists :-(

Leeds, April 2010 Kilgarriff: Corpora by Web Services 27

One objection

Test item writers make them up Not naturally-occurring language

The Sinclair-Johns critique

Also: expensive

TEDDCLOG Uses corpus sentences and distractors

Leeds, April 2010 Kilgarriff: Corpora by Web Services 28

reactThesaurus module

Several metals react violently with cold water.

Diffs moduleConcordance module

behave, interact, respond

Text processing moduleSeveral metals ___ violently with cold

water. (a) behave (b) react (c) realise (d) respond

behave realise

respond

metals behave x metals respond x

metals realise xmetals react √

Leeds, April 2010 Kilgarriff: Corpora by Web Services 29

API calls

Find distractorts thesaurus

Find key-only collocate Sketch diffs

Needs optimising

Find carrier sentence Concordance with GDEX module

Good Dictionary Example Finder

Leeds, April 2010 Kilgarriff: Corpora by Web Services 30

Current status

TEDDCLOG Next phase: produccing decent results

Corpora by Web Services Upping server capacity Looking for users (currently with UKWaC)

New Model Corpus Nervous over copyright but Available in SkE, for download

Leeds, April 2010 Kilgarriff: Corpora by Web Services 31

Another announcement: DANTE

Lexical database for English Detailed Accurate Extensive of English Highly corpus-driven 3 yr project 18 expert lexicographers Led by Sue Atkins

BNC, FrameNet, Euralex, COBUILD...

English side, New English-Irish dictionary Available for NLP research imminently


Recommended