43
Lou Burnard HUMANITIES COMPUTING UNIT Oxford University Computing Services http://info.ox.ac.uk/bnc/ The British National Corpus: where did we go wrong?

Lou Burnard H UMANITIES C OMPUTING U NIT Oxford University Computing Services The British National Corpus: where did we go wrong?

  • View
    217

  • Download
    0

Embed Size (px)

Citation preview

Lou BurnardHUMANITIES COMPUTING UNIT

Oxford University Computing Serviceshttp://info.ox.ac.uk/bnc/

The British National Corpus:where did we go wrong?

What is the BNC? 100 million words of modern British English produced by a consortium of dictionary

publishers and academic researchers OUP, Longman, Chambers Oxford, Lancaster, British Library

funded as pre-competitive resource by DTI/ SERC under JFIT 1990-1994

Where did we go wrong?

(if we did) or, The Benefit of Hindsight or, If I'd known then what I know now... or, Wisdom After the Event And, Where Do We Go From Here?

Production of the BNC

took three years (at least) cost GBP 1.6 million (at least) came about through an unusual coincidence

of interests amongst: Lexicographical publishers Government (DTI) Engineering and Science Research Council

The Neotenous Nineties

WinWord or WP5? the choice is yours On your desk … a 386 with 50 Mb

diskspace (just about enough to run Windows 3)

In your lab ... a VAX or a Sparc for serious work

On the WWW (maybe) ... Mosaic for X

Intellectual currents

corpus linguistics the LOB school the Birmingham school the LDC view

text encoding theory language engineering the JFIT mentality, or Reconciling Town

and Gown

Stated Project Goals

A synchronic (1990-4) corpus of samples both spoken and written from the full range of British English language production

of non-opportunistic design, for generic applicability

with word class annotation and contextual information

Actual (?) project goals

Better ELT dictionaries authoritative both speech and writing

A model for European corpus work design, and encoding Industrial-academic co-operation

A REALLY BIG corpus

Consequences

industrial scale text production system compromises in design and execution IPR and profitability

The BNC looks back to Brown and LOB in its design and markup, and forward to the Web in its scope and indeterminacy

The BNC “sausage machine”

OUPOUPWritten(OUP/Chambers)

Written(OUP/Chambers)

Spoken(Longman)

Spoken(Longman)

Initial CDIF Conversion and Validation

(OUCS)

Initial CDIF Conversion and Validation

(OUCS)Word Class Annotation

(UCREL)

Header generation and final validation

(OUCS)

Header generation and final validation

(OUCS)

Selection, clearance, and capture

Enrichment and encoding

Documentation, distribution, maintenance

Task groups

permissions selection, design criteria encoding and markup enrichment and annotation retrieval software

Through-put (million words/quarter)

0

5

10

15

20

25

30

35

6 7 8 9 10 11 12 13 14

ReceivedValidatedAnnotated

Tensions

desire to test annotation scheme requirement to meet deliverables

slipping goal posts quantity above quality

… an interesting learning experience for both sides!

BNC Selection Criteria Written selection criteria

predefined proportions of• different media (books, newspapers,

unpublished…)

• different domains (informative, entertaining…)

maximum sample size 45000 words all texts incomplete

Spoken selection criteria context-governed demographically-sampled

Word tagging

<s n=00011> <w AT0>The <w NP0>Queen<w POS>‘s <w AJ0>real <w NN1>annus horribilis <w VVD>began <w PRP> <w NN0>Sunday<c PUN>.</s>

word-pos pair white space problems validation problems

Sample written text<text complete=Y decls='CN000 HN001 QN000 SN000'> <div1 complete=Y org=SEQ> <head type=MAIN> <s n=001><w NP0>CAMRA <w NN1>FACT <w NN1>SHEET <w AT0>No <w CRD>1 </head> <head r=it type=SUB> <s n=002><w AVQ>How <w NN1>beer <w VBZ>is <w AJ0-VVN>brewed </head> <p><s n=003><w NN1>Beer <w VVZ>seems <w DT0>such <w AT0>a <w AJ0>simple <w NN1>drink <w CJT>that <w PNP>we <w VVB>tend <w TO0>to <w VVI>take <w PNP>it <w CJS-PRP>for <w VVD-VVN>granted<c PUN>.

Transcription practice

Regionalised typists Markup makes explicit

changes of speaker and overlap words as perceived by transcriber plus indications of false starts, truncation, uncertainty some performance features e.g. pausing, stage

directions etc. speaker details where available (always for

respondents, sometimes for others)

Sample spoken text<u who=PS04Y><s n=01296><w ITJ>Mm <pause> <w ITJ>yes <pause dur=7><w PNP>I <w VVD>told <w NP0>Paul <pause> <w CJT>that <w PNP>he <w VM0>can <w VVI>bring <w AT0>a <w NN1>lady <w AVP>up <pause> <w PRP>at <w NN1>Christmas-time<c PUN>.</u><u who=PS04U><s n=01297><w VBZ>Is <w PNP>he <w XX0>not <w VVG>going <w AV0>home <w AV0>then<c PUN>?</u><u who=PS04Y><s n=01298><w ITJ>No <pause dur=8> <w CJC>and <w UNC>erm <pause dur=7> <w PNP>I<w VBB>'m <w VVG>leaving <w AT0>a <w NN1>turkey <w PRP>in <w AT0>the <w NN1>freezer<c PUN> <s n=01299><w NP0>Paul <w VBZ>is <w AV0>quite <w AJ0>good <w PRP>at <w NN1-VVG>cooking <pause> <w AJ0>standard <w NN1>cooking<c PUN>.</u>

Metadata

each text has a TEI header identification and classification specific details (e.g. speakers) housekeeping information

all common data in the corpus header classification(s) in header pointed to by

individual texts

Text classifications

spoken texts age, sex, class (of respondent) domain, region, type

written texts author age, sex, type audience, circulation, status medium, domain

Intention was to improve coverage, not accessibility

In retrospect…

Some classifications were poorly defined and only partially populated Domain or text-type? Dating

• date of copy? first publication?

Author age• when?

Author ethnic origin, domicile

That famous BNC balance

81089443

6143048

4214819 8712764

Spoken Demographic Spoken Context Governed

Books and Periodicals Other written

BNC-1

That famous BNC balance

787312765997489

8021274 8743604

Spoken Demographic Spoken Context Governed

Books and Periodicals Other written

BNC-W

Written Domains

16479306

7106818

7259346

3064222

1134065719695650

3754756

13707349

7394103

Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure

BNC-1

Written Domains

16781393

7327671

7242024

3093407

1163008316612770

3798318

13496137

7493077

Imaginative Scientific Social ScienceApplied Science World Affairs CommerceArts Belief Leisure

BNC-2

Written Domains

7493.077

13496.137

3798.318

16612.77

11630.083

3093.407

7242.0247327.671

16781 .393

0

5000

10000

15000

20000

25000

I magi nati ve Sc i enti fi c Soc i al Sc i enc e A ppl i ed Sc i enc e Wor l d A ff ai r s C ommer c e A r ts B el i ef Lei s ur e

Thou

sand

s WB

Spoken domains

4214819

1639159

1285938

1652246

1565705

Educational Business Institutional

Leisure Demographic

Availability

BNC end-user licence commercial exploitation of the corpus is

forbidden commercial exploitation of derived works is

permitted OUCS is sole agent for licensing, reporting to

Consortium Original restriction to EU has been lifted

Distribution methods

100 million words is (still) a lot of data IPR agreements imply not-for-profit

distribution (which has its downsides too)

The options are... install it yourself online access the sampler

Install it yourself (version 1)

You need... £220 for a licence and 3 CDs £2000 for a Unix box with min 6 Gb disk some Unix expertise

You get... access to the whole corpus using the tools of your choice configurable for a local network

Version 2 will be delivered to run “standalone” on a suitably configured PC

BNC Online service

You need... access to the Internet

You get... free (but limited) access using any web browser free (temporary) access using SARA (PC only) for an annual fee, SARA plus documentation

http://sara.natcorp.ox.ac.uk

Accesses per month

0

5000

10000

15000

20000

25000

30000

all sessions web sessions SARA

The BNC Sampler

You need... $50 for a CD A PC with a CD drive and (preferably) 90 Mb

disk space You get...

2% sample, half written, half spoken four different search engines documentation

Available at this conference, at a special price !!

The BNC World Edition (aka BNC2)

has IPR clearance for world usage (we lose about 50 texts)

extensive set of revisions and corrections catching up with the standards accompanied with new enhanced version of

SARA

… and it’s nearly ready (honest)

Error correction issues

Nothing can be added Catching up with the standards

CDIF … TEI … EAGLES… CES … headers are now in TEI-conformant XML

Indeterminacy of any transcription On the scale of the BNC, especially

If seven maids with seven mops…

Error Corrections in BNC2 POS correction

Systematic• uses improved rules derived from BNC Sampler• significantly reduced error rate and indeterminacy

Major production errors fixed Semi-systematic

• duplicate texts• wrongly labelled texts• participant details• classification errors and lacunae

Typos remain... and will do so!

The BNC as an Open Corpus

We chose SGML to encourage development of other tools

This is coming more slowly than we expected,e.g. the Sampler

But people still think the BNC and SARA are the same thing

New features in SARA

POS code searches Collocation searches Subcorpora Lemmatization rules Usable with any TEI conformant corpus

What lessons have we learned?

know your audience technological blindspots missed opportunities

Know your audience

Everyone knows you should research the market first... small, specialist research community, lexicographers

The actual market is immense: language learners applied linguists cultural historians

and technically unsophisticated hence often misled or disappointed

Technological blind spots

we didn't expect the XML revolution! • so we wasted time in format conversion and

compromises

we didnt foresee pcs with 8Gb disks and sound cards!

• so we didn’t try to get rights to the audio

• and we focussed efforts on developing a client/server application

Missed opportunities: the R-word

Original design talks of Representativeness This shifted to the idea of the BNC as a

"fonds" : a source of specialist corpora This implies

a clearer and agreed taxonomy of text types better access facilities for subcorpora

Missed opportunities: watching the river flow

The BNC as a monitor corpus Diachronic sampling

But this implies a constant ability to fund and integrate

How long will we want to study the language of the nineties?

Will the web provide?