Using online corpus for literacy teachers

Embed Size (px)

Text of Using online corpus for literacy teachers

Online Corpus: A Literacy Teachers Best Friend

Online CorpusLiteracy Teachers Best Friend Dominik Lukedlukes@dyslexiaaction.org.uk @techczech

Outline

http://www.flickr.com/photos/adactio/3563832656What is a corpusAnswering questions with a corpusThe language of corpus searchesThe corpus and the classroomPractice

Corpus / Corpora????

3

of aboutlanguageknowledge http://www.flickr.com/photos/missturner/3029700617/

An important dichotomy (one of many) in the study of language4

Prescriptivism how language should be usedDescriptivism how language is usedv

5

Most of the prescriptive rules of the language mavens make no sense on any level. They are bits of folklore that originated for screwball reasons several hundred years ago For as long as they have existed, speakers have flouted them

intellectual abdicationshould be ashamed

current around 1900 a perversion of grammatical education blind to textual evidence even when he himself exhibits itdishonest and stupid

vile little compendium of tripe about style

Grammarian Geoffrey K Pullum on More passives in Orwell's pompous essay with the warning about how you mustn't use them than in any periodical you can lay your hands on!

This usage stuff is not straightforward and easy. If ever someone tells you that the rules of English grammar are simple and logical and you should just learn them and obey them, walk away, because you're getting advice from a fool.http://languagelog.ldc.upenn.edu/nll/?p=2790

CorpusKey modern tool for finding out about how language works

9

Corpus is a large database of representative language samples

10

Corpus 100s of millions of words from (mostly) written language in different genres in small samples (~2000 words)

11

Corpus used for linguistic research, making dictionaries, writing grammars,

12

Corpora available for teachershttp://corpus.byu.edu

BYU corpora availableCOCA (contemporary Am English)COHA (historical Am English)GloWbE (global web English)WikipediaGoogle Books (BrEng/AmEng)BNC (British National Corpus)Hansard (British parliamentary speeches)Spanish/Portugese

Access to COCA and related BYU corpora is free

but free registration required for more than ~10 queries a day

Other resources derived from BYU corporaWordFrequency.infoWordAndPhrase.infoAcademicWords.infoCollocates.info

http://www.webcorp.org.uk

http://corpus.leeds.ac.uk

http://www.flickr.com/photos/atoach/3900591006/Searching a corpus early on in the process of making a generalization can save you a lot of unpleasant surprises later.

21

How do we use the word dyslexia?

We speak more often of dyslexic children than adults.

We speak more often of dyslexia than any other dys- word.

22

Concordance

BNC:dyslexic [n*]COCA: dyslexic [n*]

http://www.americancorpus.org/http://corpus.byu.edu/bnc

Confirms hypothesis that children more than adults and boys more than girls; how about the dyslexia v. dystopia23

COCA:dys*

24

Suffixing rules*yed*ied

Suffixing rules*yed*iedplayedstayedportrayedenjoyedunemployedsurveyeddiedtriedmarriedworriedidentifiedapplied

The Corpus Magic*[ ]?Different corpora use slightly different codes. Read the manual.[n* ]

27

The Corpus Magic*[ ]?Any one characterAny number ofcharacters (incl 0)Lemma (all inflectional forms of a word)Different corpora use slightly different codes. Read the manual.[n* ]Part of speech tags(e.g. nouns)

28

**eacheach, reach, beach, teach, outreach, , impeach, teach*teachers, teaching, , teachable, teacher-librarians, t*chtouch, teach, tech, torch, trench, twitch, , three-inch, teach *teach the, teach us, teach students,

29

??eachreach, beach, teach, peach, leach, keach, each?each- (1), each# (1) [ie nothing]?each?peachy, bleachy, teacha, reachs (2) [ie spelling error], t?chtech, tach, toch, tuch, tsch, ticht??chtouch, teach, torch, tisch,

30

[Lemma]

Part of speech tags[run].[n*]

[run] [n*]

Common tags[n*]noun[NN2]plural nouns[v*]verb[VVD]verb past tense[aj*] (BNC) / [j*](COCA)adjective[av*] (BNC) / [r*](COCA) adverb

Help

You can alsocats and dogssearch for idioms?each*scombine wildcards[=pretty]search for synonymscar|bike|horsesearch for alternativesused -carexclude searches

For more details see:

Concordance + KWIC

*ies.[N*]

38

KWIC Key-Word In Context

*ies.[N*]

39

Limit searches by genre

Other questions corpus can answerAre there more nouns or verbs ending in -ies? *ies.[V*] vs. *ies.[N*]Are there four-letter verbs ending in -ed in the present tense? ??ed.[VVB]What are the most common adjectives describing students vs. pupils. [j*] [student] vs. [j*] [pupil] What do we say teachers do most often? [teacher] [vvb]

41

Corpus, rules, and regularityhttp://www.flickr.com/photos/51505078@N00/352492687

pre*

*ed

*ies.[V*]

CollocationsLimits on variability

See also Kennedy, p. 80-23

CollocationsLimits on variabilitySee also Kennedy, p. 80-23

Collocations (cont)

[teacher] must [v*]

45

Idioms and set phrases

275 results359 results

Google as a Corpus

"put the search text in quotes"use * for the search item

training.dyslexiaaction.org.uk

Google as a Corpus PRO: rare, low frequency usage, up-to-date usageCON:no sampling, no frequency sort, no genre limit, no part of speech tags

Google results counts are only rough estimates

http://searchengineland.com/why-google-cant-count-results-properly-53559Different people searching in different geographic locations can get different numbers

Sometimes searching for A gives fewer results than searching for A without B

but Google fights can be fun

WebCorp is makes Google search results linguist-friendly

Avoid Common Corpus Errors

http://www.flickr.com/photos/andreassolberg/433734311

Be aware of limitations: sampling, coverage, size, presence of typos and errors, bad part of speech taggingBeware of low frequency resultsBeware of homographs

Check results come from multiple sourcesCheck KWIC to confirm relevanceLimit search by genre

Check examples and sourcestraining.dyslexiaaction.org.uk

56

Always check low frequency results

must [v*] [n*]sometimes they come from the same source

False roots

http://etymonline.comcorner, silly, preface, cockroach, protest, stable

Make your own corpus with TextSTAT

http://neon.niederlandistik.fu-berlin.de/en/textstat

Make your own corpus with AntConc

http://www.antlab.sci.waseda.ac.jp/software.html

Corpus in the classroomteacher preparation

student discovery

Teacher preparationfind relevant, common examplesprepare worksheetscheck for exceptionsfind out answers to student questions about rules and usage

Student discoveryshow search results to students to work out rules or word meaningsteach students how to search for questionsask students to give each other puzzles for searching

For heavy classroom use

register for group access to prevent spam lock out

Corpus v dictionary

Non-classroom corpus usesupplement dictionarycross-word puzzlescheck typical usage when writing

Where to go next?

http://www.corpora4learning.net

Thank youContact dlukes@dyslexiaaction.org.uk