Web Corpora

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Adam Kilgarriff. Web Corpora. You can’t help noticing. Replaceable or replacable? http://googlefight.com. Very very large 2006 estimates for duplicate free, linguistic, Google-indexed web German: 44 billion words Italian: 25 billion words English: 1,000 billion -10,000 billion words - PowerPoint PPT Presentation

Text of Web Corpora

  • Kivik 2013Kilgarriff: Web corpora*Web CorporaAdam Kilgarriff

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*You cant help noticing

    Replaceable or replacable?http://googlefight.com

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Very very large2006 estimates for duplicate free, linguistic, Google-indexed webGerman: 44 billion wordsItalian: 25 billion wordsEnglish: 1,000 billion -10,000 billion wordsMost languagesMost language typesUp-to-dateFreeInstant access

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*OverviewIs the web a corpus?RepresentativenessWhat is out there?Web1TGoogleologyWeb corpus typesTargeted sites: Oxford English CorpusGeneral: WaC familyWebBootCaT

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Is the web a corpus?Sinclair in Developing linguistic corpora, a guide to good practice. Corpus and Text Basic Principlesnot a corpus becausedimensions unknown, constantly changingnot designed from a linguistic perpectiveButWe can find out dimensions Many corpora are not designedas much chatroom dialogue as I can getDef: a corpus is a collection of texts when viewed as an object of language research

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Is the web a corpus?

    Yes

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*but its not representative

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*TheoryA random sample of a population is representative of it. Observations on sample support inferences about population (within confidence bounds)

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*TheoryA random sample of a population is

    What is the population?production and receptionspeech and textcopying

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*TheoryPopulation not definedRepresentative sample not possible

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*sublanguageLanguage = core + sublanguagesOptions for corpus constructionnonesomeallNoneimpoverished view of languageSome: BNCcake recipes and gastro-uterine diseasenot car repair manuals or astronomy or All: until recently, not viable

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*RepresentativenessThe web is not representativebut nor is anything elseText type variationunder-researched, lacking in theoryAtkins Clear Ostler 1993 on design brief for BNC; Biber 1988, Kilgarriff 2001Text type is an issue across NLPWeb: issue is acute because, as against BNC or WSJ, we simply dont know what is there

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*What is out there?What text types are there on the web?some are new: chatroomproportionsis it overwhelmed by porn? How much?Hard question

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*The weba social, cultural, political phenomenonnew, little understooda legitimate object of sciencemostly languagewe are well placeda lot of people will be interestedLetsstudy the websource of language dataapply our tools for web use (dictionaries, MT)use the web as infrastructure

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora* Using Search EnginesNo setup costsStart querying today

    MethodsHit countssnippetsMetasearch engines, WebCorpFind pages and download

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*GoogleologyGoogle hit counts for language modelling

    Example: (Keller & Lapata 2003) 36 queries to estimate freq(fulfil, obligation) to each of Google and Altavista Very interesting work

    Great interest in query syntax

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*The Trouble with Googlenot enough instances max 1000not enough queries max 1000 per day with APInot enough context 10-word snippet around search termsort ordersearch term in titles and headings untrustworthy hit countslimited search optionslinguistically dumb, eg not lemmatised aime/aimer/aimes/aimons/aimez/aiment

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*AppealZero-cost entry, just start googlingRealityHigh-quality work: high-cost methodology

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Also:No replicabilityMethods, stats not publishedAt mercy of commercial corporationGoogleology is bad science

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Better: web-sourced corporaGather pagesGoogle hitsSelect and gather whole sitesGeneral crawlFilterDe-duplicateLinguistic processingLoad into corpus tool

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Oxford English Corpus

    Whole domains chosen and harvestedcontrol over text type2.3 billion words

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Oxford English Corpus

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*WaC family1.5 B words eachBaroni and colleaguesSeeds: mid-frequency words from core vocab lists and corporaGoogle on seed words, then crawl

    Kilgarriff: Web corpora

  • TenTen FamilyProcessing chain Spiderling, a lingustic crawlerA billion words a dayjusText forcleaning: removing non-textOnion remove duplicates (paragraph level) All major world languages2-20 billion wordsLexical ComputingAll available in Sketch Engine

    Kivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*Small, specialised corporaTerminologistsTranslators needing target-language domain-specific vocabSpecialist dictionariesDont existExpensive/inaccessibleOut of date

    Kilgarriff: Web corpora

  • Kivik 2013Kilgarriff: Web corpora*BootCat (Bootstrapping Corpora and Terms) Put in seed termsGoogle/Yahoo searchRetrieve Google/Yahoo hitsRemove duplicates, boilerplateSmall instant corporaBaroni and Bernardini, LREC 2004Web versionWebBootCaTAt Sketch Engine site

    Kilgarriff: Web corpora

  • But did I make a good corpus?Kivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • Bad ScienceBen GoldacreKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • Bad ScienceBen GoldacreBiases in samplesA quarter of the people who tested positive had just been on holiday in MexicoBut the research team didnt noticeKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • Bad linguisticsOur corpus study shows XBut what was in the corpus?Kivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • Bad linguisticsOur corpus study shows XBut what was in the corpus?Moral: Get to know your corpusKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • How?Read it?Too big to readNot designed to be readKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • How?Compare it with other(s)Keyword listsKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • UKWaC vs. enTenTen12Kivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • enTenTen vs. UKWaC accord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yesaccommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub pupil quality range rd realise recognise road route scheme sector service shop site skill specialist st staff stage suitable telephone th top tour training transport uk undertake venue village visit visitor website welcome whilst wide workshop wwwKivik 2013Kilgarriff: Web corpora*

    Kilgarriff: Web corpora

  • enTenTen vs. UKWaCaccord actually amendment among bad because behavior believe bill blog ca center citizen color defense determine do dollar earth effort election even evil fact faculty favor favorite federal foreign forth guess guy he her him himself his honor human kid kill kind know labor law let liberal like man maybe me military movie my nation never nor not nothing official oh organization percent political post president pretty professor program realize recognize say shall she sin soul speak state suppose tell terrorist that thing think thou thy toward true truth unto upon violation vote voter war what while why woman yesaccommodation achieve advice aim area assessment available band behaviour building centre charity click client club colour consultation contact council delivery detail develop development disabled email enable enquiry ensure event excellent facility favourite full further garden guidance guide holiday improve information insurance join link local main manage management match mm nd offer opportunity organisation organise page partnership please pm poker pp programme project pub