Online corpus: Literacy teachers' best friend

Preview:

DESCRIPTION

Presentation delivered at Dyslexia Guild Summer Conference 2011 in Oxford. (Slideshow updated based on feedback from the session).

Citation preview

training.dyslexiaaction.org.uk

Online CorpusLiteracy Teachers’

Best Friend

Dominik Lukešhttp://dominiklukes.net

Dyslexia Guild Summer Conference 2011

Outline

training.dyslexiaaction.org.uk

http://www.flickr.com/photos/adactio/3563832656

What is a corpus

Answering questions with a corpus

The language of corpus searches

The corpus and the classroom

Practice

Corpus / Corpora

training.dyslexiaaction.org.uk

????

training.dyslexiaaction.org.uk

of about

language

knowledge

http://www.flickr.com/photos/missturner/3029700617/

Prescriptivism

training.dyslexiaaction.org.uk

… how language should be used

Descriptivism

… how language is used

v

training.dyslexiaaction.org.uk

“Most of the prescriptive rules of the language mavens make no sense on any level. They are bits of folklore that originated for screwball reasons several hundred years ago… For as long as they have existed, speakers have flouted them…”

training.dyslexiaaction.org.uk

“intellectual abdication”“should be ashamed”

“current around 1900” “a perversion of grammatical

education” “blind to textual evidence even

when he himself exhibits it”

“dishonest and stupid”

“vile little compendium of tripe about style”

Grammarian Geoffrey K Pullum on …

“More passives in Orwell's pompous essay with the warning about how you

mustn't use them than in any periodical you can lay your

hands on! “

This usage stuff is not straightforward and easy. If ever someone tells you that the rules of English grammar are simple and logical and you should just learn them and obey them, walk away, because you're getting advice from a fool.

http://languagelog.ldc.upenn.edu/nll/?p=2790

Corpus

training.dyslexiaaction.org.uk

Key modern tool for finding out about how language works…

Corpus

training.dyslexiaaction.org.uk

… is a large database of representative language samples …

Corpus

training.dyslexiaaction.org.uk

… 100s of millions of words from (mostly) written language in different genres in small samples (~2000 words) …

Corpus

training.dyslexiaaction.org.uk

… used for linguistic research, making dictionaries, writing grammars, …

training.dyslexiaaction.org.uk

Corpora available for teachers

training.dyslexiaaction.org.uk

http://corpus.byu.edu

Access to COCA and related BYU corpora is free…

training.dyslexiaaction.org.uk

…but free registration required for more than ~10 queries a day

training.dyslexiaaction.org.uk

training.dyslexiaaction.org.uk

Brown – the grandfatherCOCABNCWebcorpGoogle

training.dyslexiaaction.org.uk

training.dyslexiaaction.org.uk

training.dyslexiaaction.org.uk

http://www.flickr.com/photos/atoach/3900591006/

Searching a corpus early on in the process of making a generalization can save you a lot of unpleasant surprises later.

How do we use the word dyslexia?

We speak more often of dyslexic children than adults.

We speak more often of dyslexia than any other dys- word.

training.dyslexiaaction.org.uk

ConcordanceBNC:dyslexic [n*]

COCA: dyslexic [n*]

http://www.americancorpus.org/

http://corpus.byu.edu/bnc

training.dyslexiaaction.org.uk

COCA:dys*

Suffixing rules

training.dyslexiaaction.org.uk

*yed

*ied

Suffixing rules

training.dyslexiaaction.org.uk

*yed

*ied

playedstayed

portrayedenjoyed

unemployedsurveyed

diedtried

marriedworried

identifiedapplied

The Corpus Magic

training.dyslexiaaction.org.uk

*

[ ]

?

Different corpora use slightly different codes. Read the

manual.

[n* ]

The Corpus Magic

training.dyslexiaaction.org.uk

*[ ]

?Any one character

Any number ofcharacters (incl 0)

Lemma (all inflectional

forms of a word)

Different corpora use slightly different codes. Read the

manual.

[n* ]Part of speech tags

(e.g. nouns)

training.dyslexiaaction.org.uk

**each each, reach, beach, teach,

outreach, …, impeach, …

teach* teachers, teaching, …, teachable, teacher-librarians, …

t*ch touch, teach, tech, torch, trench, twitch, …, three-inch, …

teach * teach the, teach us, teach students, …

training.dyslexiaaction.org.uk

??each reach, beach, teach, peach,

leach, keach, …

each? each- (1), each# (1) [ie nothing]

?each? peachy, bleachy, teacha, reachs (2) [ie spelling error], …

t?ch tech, tach, toch, tuch, tsch, tich

t??ch touch, teach, torch, tisch, …

[Lemma]

training.dyslexiaaction.org.uk

Part of speech tags

training.dyslexiaaction.org.uk

[run].[n*]

[run] [n*]

Common tags

training.dyslexiaaction.org.uk

[n*] noun[NN2] plural nouns

[v*] verb[VVD] verb past tense

[aj*] (BNC) / [j*](COCA) adjective[av*] (BNC) / [r*](COCA) adverb

Help

training.dyslexiaaction.org.uk

training.dyslexiaaction.org.uk

training.dyslexiaaction.org.uk

You can also

training.dyslexiaaction.org.uk

cats and dogs search for idioms

?each*s combine wildcards

[=pretty] search for synonyms

car|bike|horse search for alternatives

used -car exclude searches

For more details see:

Concordance + KWIC

training.dyslexiaaction.org.uk

*ies.[N*]

training.dyslexiaaction.org.uk

KWIC – Key-Word In Context

*ies.[N*]

Limit searches by genre

training.dyslexiaaction.org.uk

Other questions corpus can answerAre there more nouns or verbs ending in -ies?

*ies.[V*] vs. *ies.[N*]

Are there four-letter verbs ending in -ed in the present tense? ??ed.[VVB]

What are the most common adjectives describing students vs. pupils. [j*] [student] vs. [j*] [pupil]

What do we say teachers do most often? [teacher] [vvb]

training.dyslexiaaction.org.uk

Corpus, rules, and regularity

training.dyslexiaaction.org.uk

http://www.flickr.com/photos/51505078@N00/352492687

pre*

*ed

*ies.[V*]

CollocationsLimits on variability

training.dyslexiaaction.org.uk

See also Kennedy, p. 80-23

Collocations (cont)Limits on variability

training.dyslexiaaction.org.uk

See also Kennedy, p. 80-23

Collocations (cont)

training.dyslexiaaction.org.uk

[teacher] must [v*]

Idioms and set phrases

training.dyslexiaaction.org.uk

275 results

359 results

Google as a Corpus

training.dyslexiaaction.org.uk

"put the search text in quotes"

use * for the search item

training.dyslexiaaction.org.uk

Google as a Corpus Pros & Cons

training.dyslexiaaction.org.uk

PRO: rare, low frequency usage, uptodate usage

CON: no sampling, no frequency sort, no genre limit, no part of speech tags

Google results counts are only rough estimates…

training.dyslexiaaction.org.uk

http://searchengineland.com/why-google-cant-count-results-properly-53559

Different people searching in different geographic locations can get different numbers

Sometimes searching for A gives fewer results than searching for A without B

…but Google fights can be fun training.dyslexiaaction.org.uk

WebCorp is makes Google search results linguist-friendly

training.dyslexiaaction.org.uk

Avoid Common Corpus Errors

training.dyslexiaaction.org.uk

Be aware of limitations: sampling, coverage, size, presence of typos and errors, bad part of speech tagging

Beware of low frequency results

Beware of homographs

Check results come from multiple sources

Check KWIC to confirm relevance

Limit search by genre http://www.flickr.com/photos/andreassolberg/433734311

Check examples and sources

training.dyslexiaaction.org.uk

Always check low frequency results

training.dyslexiaaction.org.uk

must [v*] [n*]

…sometimes they come from the same source

False roots

http://etymonline.com

corner, silly, preface, cockroach, protest, stable …

Make your own corpus with TextSTAT

http://neon.niederlandistik.fu-berlin.de/en/textstat

Make your own corpus with AntConc

training.dyslexiaaction.org.uk

http://www.antlab.sci.waseda.ac.jp/software.html

Corpus in the classroom

training.dyslexiaaction.org.uk

teacher preparation

student discovery

Teacher preparation

training.dyslexiaaction.org.uk

find relevant, common examples prepare worksheets check for exceptions find out answers to student

questions about rules and usage

Student discovery

training.dyslexiaaction.org.uk

show search results to students to work out rules or word meanings

teach students how to search for questions

ask students to give each other puzzles for searching

For heavy classroom use…

training.dyslexiaaction.org.uk

register for group access to prevent spam lock out

Corpus v dictionary

training.dyslexiaaction.org.uk

Non-classroom corpus use

training.dyslexiaaction.org.uk

supplement dictionary

cross-word puzzles

check typical usage when writing

Where to go next?

training.dyslexiaaction.org.uk

http://www.corpora4learning.net

Thank youContact http://dominiklukes.net

training.dyslexiaaction.org.uk

Recommended