Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

  • View

  • Download

Embed Size (px)

Text of Using Corpora and how to build them Adam Kilgarriff Lexical Computing Ltd

  • Using Corpora and how to build themAdam KilgarriffLexical Computing Ltd

  • Corpora show us the facts of the language

  • What is a corpus?a corpus is a collection of texts when viewed as an object of language research

  • Which texts?WrittenSpoken

  • WrittenBooksFictionNon-fictionTextbooksNewspapersLetters, unpublishedWeb pagesAcademic journalsStudent essays

  • SpokenMust be transcribed, for text corporaConversationWho? Region, class, age-group, situationLecturesTV and RadioFilm transcriptsMeetings, seminars

  • Which texts?Different purposes, different text typesMaking dictionaries:Cover the whole languageSome of everything

  • How much?Most words are rareZipfs LawTo get enough data for most words, we need very big corpora

  • Zipfs Law

  • Zipfs Law the: 6% 100 most frequent: 45% 7500 most frequent: 90% all others: rare

  • Zipfs Law

  • Leading English Corpora: Size Size ofCorpora (in words)1960s 1970s 1980s 1990s 2000s Brown/LOB COBUILD BNC OEC

  • Good newsThe web

  • You cant help noticing

    Replaceable or replacable?

  • Very very large2006 estimates for duplicate free, linguistic, Google-indexed webGerman: 44 billion wordsItalian: 25 billion wordsEnglish: 1,000 billion -10,000 billion wordsMost languagesMost language typesUp-to-dateFreeInstant access

  • What is a corpus?a corpus is a collection of texts when viewed as an object of language research

  • Is the web a corpus?


  • but its not representative

  • sublanguageLanguage = core + sublanguagesOptions for corpus constructionnonesomeallNoneimpoverished view of languageSome: BNCcake recipes and gastro-uterine diseasenot car repair manuals or astronomy or All: until recently, not viable

  • RepresentativenessThe web is not representativebut nor is anything elseText type variationunder-researched, lacking in theoryAtkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001Text type is an issue across NLPWeb: issue is acute because, as against BNC or WSJ, we simply dont know what is there

  • What is out there?What text types are there on the web?some are new: chatroomproportionsis it overwhelmed by porn? How much?Hard question

  • Comparing frequency listsWeb1TPresent from googleAll 1-, 2-, 3-, 4, 5-grams with f>40 in one trillion (1012) words of Englishthats 1,000,000,000,000Compare with BNC100 words with highest Web1T:BNC ratio100 words with lowest ratio

  • Web-high (155 terms)61 web and computingconfig browser spyware url www forum 38 porn22 US English (incl Spanish influence los)18 business/products common on webpoker viagra lingerie ringtone dvd casino rental collectible tiffanyNB: BNC is old4 legaltrademarks pursuant accordance herein

  • Web-lowExclude British English, transcription/tokenisation anomalies

    herself stood seemed she looked yesterday sat considerable had council felt perhaps walked round her towards claimed knew obviously remained himself he him

  • ObservationsPronouns and past tense verbsFictionMasc vs femYesterdayProbably daily newspapersConstancy of ratios:He/him/himselfShe/her/herself

  • The weba social, cultural, political phenomenonnew, little understooda legitimate object of sciencemostly languagewe are well placeda lot of people will be interestedLetsstudy the websource of language dataapply our tools for web use (dictionaries, MT)use the web as infrastructure

  • Using Search EnginesNo setup costsStart querying today

    MethodsHit countssnippetsMetasearch engines, WebCorpFind pages and download

  • GoogleologyGoogle hit counts for language modelling

    Example: (Keller & Lapata 2003) 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista Very interesting work

    Great interest in query syntax

  • The Trouble with Googlenot enough instances max 1000not enough queries max 1000 per day with APInot enough context 10-word snippet around search termsort ordersearch term in titles and headings untrustworthy hit countslimited search optionslinguistically dumb, eg not lemmatised aime/aimer/aimes/aimons/aimez/aiment

  • AppealZero-cost entry, just start googlingRealityHigh-quality work: high-cost methodology

  • Also:No replicabilityMethods, stats not publishedAt mercy of commercial corporation

  • Also:No replicabilityMethods, stats not publishedAt mercy of commercial corporationGoogleology is bad scienceSo

  • Basic stepsGather pagesGoogle hitsSelect and gather whole sitesGeneral crawlFilterDe-duplicateLinguistic processingLoad into corpus tool

  • Oxford English Corpus

    Whole domains chosen and harvestedcontrol over text type2 billion words (Mar 08)

  • Oxford English Corpus

  • DeWaC, ItWaC, UKWaC1.5 B words eachMarco Baroni, Adriano FerraresiSeeds: mid-frequency words from core vocab lists and corporaGoogle on seed words, then crawl

  • FilteringNon-text (sound, image etc) filesBoilerplate (within file)Copyright notices, navigation barshigh markup heuristicNot text in sentencesLook for function wordsLists?? Sports results?? Crossword puzzles??Spam, pornographyToughDe-duplication (also tough)

  • Small, specialised corporaTerminologistsTranslators needing target-language domain-specific vocabSpecialist dictionariesDont existExpensive/inaccessibleOut of date

  • BootCat (Bootstrapping Corpora and Terms) Put in seed termsGoogle/Yahoo searchRetrieve Google/Yahoo hitsRemove duplicates, boilerplateSmall instant corporaBaroni and Bernardini, LREC 2004Web versionWebBootCaTAt Sketch Engine site

  • TaskChoose area of specialist interestEnglish or SpanishSelect at least 5 seed termsSpecialist: goodBuild corpusAt least 100,000 wordsIterate if necessaryFind at least six words/phrases/meanings you did not know beforeWrite up