28
Guy Aston, Ylva Berglund Prytz, & Lou Burnard, http:// www.natcorp.oucs.ox.ac.uk Exploring BNC-XML with Xaira

Guy Aston, Ylva Berglund Prytz, & Lou Burnard, Exploring BNC-XML with Xaira

Embed Size (px)

Citation preview

Page 1: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Guy Aston, Ylva Berglund Prytz, & Lou Burnard,

http://www.natcorp.oucs.ox.ac.uk

Exploring BNC-XML with Xaira

Page 2: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

What is the BNC?

a snapshot of British English, taken at the end of the 20th century

100 million words in approx 4000 different text samples, both spoken (10%) and written (90%)

synchronic (1990-4), sampled, general purpose corpus

available under licence; latest edition is BNC-XML (13 mar 2007)

Page 3: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Distinctive features of the BNC

non-opportunistic design standardized markup system

structural annotation word class annotation contextual information

general availability

...in these respects, the BNC remains distinctive, twenty years on!

Page 4: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

What's new in BNC-XML? No systematic proofing, re-editing, or re-parsing... Same as BNC World:

texts (minus duplicates) POS tagging (but extended)

Additions simpler pos codes lemmata

Improvements Duplications, categorizations, segmentations... Coded descriptions

Page 5: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

BNC-XML regroups texts using additional classification criteria

...sentences

Academic

Literary

Press

Nonfiction

Unpublished

Conversation

OtherSpolen

...words

Page 6: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

<wtext type="NONAC"><div level="1" n="1" type="leaflet"> <head type="MAIN"><s n="1"><w c5="NN1" hw="factsheet" pos="SUBST">FACTSHEET</w> <w c5="DTQ" hw="what" pos="PRON">WHAT</w> <w c5="VBZ" hw="be" pos="VERB">IS</w> <w c5="NN1" hw="aids" pos="SUBST">AIDS</w><c c5="PUN">?</c> </s>  </head><p><s n="2"><hi rend="bo">  <w c5="NN1" hw="aids" pos="SUBST">AIDS</w> <c c5="PUL">(</c><w c5="VVN-AJ0" hw="acquire" pos="VERB">Acquired</w> <w c5="AJ0" hw="immune" pos="ADJ">Immune</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="syndrome" pos="SUBST">Syndrome</w><c c5="PUR">)</c></hi> <w c5="VBZ" hw="be" pos="VERB">is</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="condition" pos="SUBST">condition</w> <w c5="VVN" hw="cause" pos="VERB">caused</w> <w c5="PRP" hw="by" pos="PREP">by</w> <w c5="AT0" hw="a" pos="ART">a</w> <w c5="NN1" hw="virus" pos="SUBST">virus</w> <w c5="VVN" hw="call" pos="VERB">called</w> <w c5="NP0" hw="hiv" pos="SUBST">HIV</w> <c c5="PUL">(</c>   <w c5="AJ0-NN1" hw="human" pos="ADJ">Human</w> <w c5="NN1" hw="immuno" pos="SUBST">Immuno</w> <w c5="NN1" hw="deficiency" pos="SUBST">Deficiency</w> <w c5="NN1" hw="virus" pos="SUBST">Virus</w><c c5="PUR">)</c><c c5="PUN">.</c> </s> … </p>… </div></wtext>

Page 7: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

What is the markup for?

It makes it possible for you to distinguish aids=SUBST from aids=VERB distinguish occurrences in writing from ones in speech distinguish occurrences in headings from ones in

paragraphs identify contextual units like sentences and paragraphs

FACTSHEET WHAT IS AIDS?AIDS (Acquired Immune Deficiency Syndrome) is a condition caused by a virus called HIV (Human Immuno Deficiency Virus).

Page 8: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 9: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 10: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 11: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 12: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Has English moved on since the BNC?

types of text e-mail web pages / blogs SMS personal letters

topics globalization internet Elvis Word Perfect

how comparable is the Web?

Page 13: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Out of date?

The composition (and date) of any corpus affects inferences drawn from it

There aren't many alternatives Web-as-corpus: 85% of written texts aren't on the web -

and spoken texts? Results from monitor corpora non-replicable Copyright permissions unrepeatable

Quantitative and qualitative comparative evaluations of BNC coverage are needed but “it's surprising how much is there”

Page 14: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

What can you do with it?

The BNC is a problematizing resource... complements (and corrects) intuition increases learner autonomy critiques the myth of the native speaker

... for teacher and learner alike XML makes it more accessible by non

specialist software (eg A0S in web browser)

Page 15: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

You can use XAIRA to ...

find sample sentences cloze tests

check what the text book says grammar vs usage

(dis)confirm intuitions find sample specialist texts make serendipitous discoveries

Page 16: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Finding sample sentences

some phrases that take the gerund there's no point .... how / what about ...

generatable phrases [comparative] and [comparative]

sentence structures [s-initial interjection]

Page 17: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

(Dis)confirming intuition

about choices have a problem + infinitive or gerund? do you make or take decisions?

about vocabulary which nouns collocate with hard?

about grammar I would be grateful if you [modal]?

Page 18: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Finding specialised texts

The BNC has an extraordinary range travel agent brochures, weather reports, formal

invitations, advertising, children's talk, academic discourse, doctor's consultations, marketing meetings, oral history, jokes and anecdotes, high literature, best-sellers, leaflets, personal diaries...

The problem is finding it use WLD principle

Page 19: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

For learners...

The same as teachers Pointers to follow in the quest for idiomicity

collocations colligations semantic preferences semantic prosodies/pragmatic associations associations with particular genres/domains

Can learners use the BNC “autonomously”?

Page 20: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

The ins and outs of autonomous use Learners may need warning to...

focus on patterns which recur, without necessarily trying to explain all the data

avoid overgeneralisation ... and encouragement to

be curious browse the context investigate exceptions

Page 21: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

What are ins and outs?

(and are they the same as ups and downs)? 50 occurrences, sort left 2 colligation: (all) the ins and outs of semantic preference: know/learn/understand/keep

up with/get to grips with/get down to/forget; explain/teach/guide through/give/look at

semantic prosody: difficulty(?) analysis - mainly spoken conversation, but

numbers too small for reliable inference

Page 22: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 23: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira
Page 24: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Exploring idioms

make a point the point is point out

have a point high point point to

in point of fact starting point no point in

point of view at X point what‘s the point

to the point see/get/grasp the point

Example: idioms with point

Page 25: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Exploring features of speech

PS6NR >: [laugh] he's not a millionaire yet.PS6NM >: No so perhaps not, mm.Oh perhaps, perhaps he, perhaps he has the knowledge but has difficulty in er navigating his way to the betting shop to to do anything about it. PS6NR >: [laugh] PS6NM >: Anyway ermPS6NR >: Right I've ... results see this isPS6NM >: Mm.PS6NR >: this is really what I'm [ ... ] PS6NM >: Yeah. PS6NR >: comparison of subjects within groups and between groups I thought that's PS6NM >: Yeah, mm. PS6NR >: like a typical [ ... ]

Examples: spoken discourse markers and back channels

Page 26: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Exploring productivity of affixes

How many adjectives can you think of ending in -ish? babyish, bearish, .... wankish, whorish, yobbish

How many nouns starting with anti-? How about verbs?

Page 27: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Creative writing

Paul Auster: City of Glass

It was the wrong number that started it, the telephone ringing three times in the dead of the night, and the voice on the other end asking for someone he was not.

Examples: story beginnings

Ian McEwan: Saturday

Everyone agrees, airliners look different these days, predatory and doomed.

Page 28: Guy Aston, Ylva Berglund Prytz, & Lou Burnard,  Exploring BNC-XML with Xaira

Where can I get one?

BNC XML: http://www.natcorp.ox.ac.uk now available on DVD standalone single user licence or institutional licence discounted price till end June

XAIRA Delivered free with the BNC (and also available free

from http://xaira.sf.net) Usable with any XML corpus Usable/ish on any platform