Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd Universities of Leeds and Sussex

  • View
    216

  • Download
    0

Embed Size (px)

Text of Corpora by Web Services Adam Kilgarriff Lexical Computing Ltd Lexicography MasterClass Ltd...

  • Corpora by Web ServicesAdam KilgarriffLexical Computing LtdLexicography MasterClass LtdUniversities of Leeds and Sussex

  • Starting a PhD in NLPThenPrologType in a few grammar rulesLexical entriesExample sentencesWere off!

    Kilgarriff: Corpora by Web Services

  • NowCorpusWhich?Budget/scheduleHowe much can we afford?Hard disk spaceAccess softwareBuildBig job, makign it fast is hard orResearch, acquire, install, maintain

    Kilgarriff: Corpora by Web Services

  • Resarch questionMorphology, syntax, discourse structure, semantics, anaphorFirst six months at leastAcquiring data, softwareComplications

    Kilgarriff: Corpora by Web Services

  • Kilgarriff: Corpora by Web Services

  • If youre not super-geekyDid I do it properly?Dumbing downLets choose an easier questionLooking over shoulder

    Kilgarriff: Corpora by Web Services

  • Disappointment

    Kilgarriff: Corpora by Web Services

  • Making it easyLike picking up a hire car

    Kilgarriff: Corpora by Web Services

  • Corpora by web servicesPossible?Already available

    Kilgarriff: Corpora by Web Services

  • Sketch EngineCorpus queryingFastHandles large corporaIn use for lexicography atOUP, CUP, Macmillan, Collins, Le RobertWord sketchesData-driven summary of a words grammatical and collocational behaviour

    Kilgarriff: Corpora by Web Services

  • Kilgarriff: Corpora by Web Services

  • Corpora

    Kilgarriff: Corpora by Web Services

  • Big, High Quality corporaBigPerformanceBanko and Brill 2004Theres no data like more dataAmple data for rare phenomenaBig subcorpora5bMedical: 30m

    Kilgarriff: Corpora by Web Services

  • QualityBad dataSpam Navigation-barsDuplicatesListsBungled formattingWrong languageLess discussedMaybe a footnoteI wonder whyQuick fixes and run

    Kilgarriff: Corpora by Web Services

  • The Google/Yahoo/Bing optionAppealNot setup costsStart googling today

    Kilgarriff: Corpora by Web Services

  • Very interesting workKeller and LapataValidity of SE counts vs BNC counts vs psycholinguistic validity of collocations36 queries per collocationfulfil obligationfulfil ? Obligationfulfilling obligations ...Nakov, Nakov and HearstGreat interest in query syntax

    Kilgarriff: Corpora by Web Services

  • butLimited hits-per-queryLimited hits-per-daySort orderNot documented'unsorted' not possibleSnippets too short for researchNo (documented) morphologyLimited query syntax

    Kilgarriff: Corpora by Web Services

  • andAt mercy of commercial companyMight change at any timeNot replicable

    Kilgarriff: Corpora by Web Services

  • SoAppealNo setup costsSerious researchMany difficult practical issuesNot a tool designed for linguistsConclusionIf only SE indexes are big enoughYesElse no

    Kilgarriff: Corpora by Web Services

  • StrategyMore languagesCorpus Factory, as SharoffBiggerBig Web Corpus (BiWeC)Currently 5.5b fully processedTarget 20bBetter

    Kilgarriff: Corpora by Web Services

  • New Model CorpusBNC is past its sell-byEarly 1990sPre webStill dominant modelNew model needed

    Kilgarriff: Corpora by Web Services

  • ModelSmall: model trainModel trainDesign: software modelNMC1:100 for BiWeC-scale100mUpdate of BNC as design modelData from web butText type avalable

    Kilgarriff: Corpora by Web Services

  • Open-source/collaborationWe distributeYou annotatePos-tags, parses, anaphor, discourse moves, semantics, multiwords, entity-types ...Domain, register, region ...Send us annotationsWe integrateAnd give access in SkE

    Kilgarriff: Corpora by Web Services

  • Divide and ruleBigger (BiWeC)Better (NMC)Take best annotations AccuracySpeedUsefulnessGood collaborationfrom NMC, apply to BiWeC

    Kilgarriff: Corpora by Web Services

  • TEDDCLOGTaiwan English Data-Driven CLOze Generationwith Simon Smith and colleagues, TaipeiAPI case study

    Kilgarriff: Corpora by Web Services

  • Cloze'fill-the gap'Several metal _____ violently with cold waterA: behaveB: reactC: realiseD: respondPopular with students, teachers, testersUnpopular with theorists :-(

    Kilgarriff: Corpora by Web Services

  • One objectionTest item writers make them upNot naturally-occurring languageThe Sinclair-Johns critique

    Also: expensive

    TEDDCLOGUses corpus sentences and distractors

    Kilgarriff: Corpora by Web Services

  • Kilgarriff: Corpora by Web Services

  • API callsFind distractortsthesaurusFind key-only collocateSketch diffsNeeds optimisingFind carrier sentenceConcordance with GDEX moduleGood Dictionary Example Finder

    Kilgarriff: Corpora by Web Services

  • Current statusTEDDCLOGNext phase: produccing decent resultsCorpora by Web ServicesUpping server capacityLooking for users (currently with UKWaC)New Model CorpusNervous over copyright butAvailable in SkE, for download

    Kilgarriff: Corpora by Web Services

  • Another announcement: DANTELexical database for EnglishDetailed Accurate Extensive of EnglishHighly corpus-driven3 yr project18 expert lexicographersLed by Sue Atkins BNC, FrameNet, Euralex, COBUILD...English side, New English-Irish dictionaryAvailable for NLP research imminently

    Kilgarriff: Corpora by Web Services