21
Language Data Resources About Corpora

Language Data Resources

  • Upload
    sonora

  • View
    60

  • Download
    2

Embed Size (px)

DESCRIPTION

About Corpora. Language Data Resources. P. Eisner: “ Znáte jej, ten svůj jazyk? Řekl by přec člověk, že mám-li něco milovat, musím to znát. Vy však češtinu neznáte, a říkám-li to, není to ani obžaloba, ani vůbec výtka. Nemůžete ji znát a obsáhnout, to se dokonale nepodařilo ještě nikomu… “. - PowerPoint PPT Presentation

Citation preview

Page 1: Language Data Resources

Language Data Resources

About Corpora

Page 2: Language Data Resources

J. Sinclair: “Language looks rather different

when you look at a lot of it at once.“

P. Eisner: “Znáte jej, ten svůj jazyk? Řekl by přec člověk, že

mám-li něco milovat, musím to znát. Vy však češtinu neznáte, a říkám-li to, není to ani obžaloba, ani vůbec výtka. Nemůžete ji znát a obsáhnout, to se

dokonale nepodařilo ještě nikomu…“

Page 3: Language Data Resources

Merriam-Webster OnLine:

Page 4: Language Data Resources

Corpus

• F. Čermák: corpus – a structured, unified (and often also tagged) large collection of language data

• T.McEnery: Corpus data – the raw fuel of NLP

Page 5: Language Data Resources

Corpus linguistics

• A study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use

• Corpus linguistics is not a language theory.

Page 6: Language Data Resources

• A. by medium: – printed, electronic text, digitized speech, video

• B. by design method: – balanced vs. special

• C. language variables: – monolingual vs. multilingual– original vs. translations– native speaker vs. Learner

• D. language evolution: – synchronic vs. diachronic 

• E. Plain vs. annotated  

Corpora classification

Page 7: Language Data Resources

Balanced corpora (?)• T.McEnery: “Sampling is inescapable.“

• Proportions corresponding to the real language usage

• Is that possible? Criteria for choosing styles, genres, and eventually concrete texts?

• reception (a few authors, large audience) vs. perception (produkce of a large community of language users)

• N. Chomsky: “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.“

Page 8: Language Data Resources

Corpus size

• Brown Corpus – 1 MW (1964)

• British Natural Corpus – 100 MW (1994)– http://www.natcorp.ox.ac.uk/

• Cosmas – 1.6 GW (2004)– http://corpora.ids-mannheim.de/cosmas/

Page 9: Language Data Resources

Exercise

• Could you estimate the amount of Czech texts (measured in running words) available on the Internet?

Page 10: Language Data Resources

Antecedents of corpora

• Excerption tickets– For Czech systematically from 1911

• Electronic corpus of Czech tests– 1970s– around 500kW

Page 11: Language Data Resources

Corpus annotation• K.Pala: “Annotating consist of adding selected linguistic

information to an existing corpus of written or spoken language. Typically, this is done by some kind of coding being attached (semi)automatically or manually to the electronic representation of the text.“

• Raw texts: difficult to exploit

• solution: gradual „information adding“ (more exactly: adding information in an explicit, machine tractable form),

• Annotation ease of exploitation + reusability

Page 12: Language Data Resources

Criticism of corpus annotation

• Corpus annotations produce impure corpora– forced interpretations

• Consistency vs. Accuracy

Page 13: Language Data Resources

Czech National Corpus

• http://ucnk.ff.cuni.cz• ÚČNK (Institute of Czech National Corpus)

founded in 1994• diachronous section 13-19th century -

DIAKORP• synchronous section – from around 1900

– written language – 100MW v SYN 2000– spoken language – Prague spoken corpus

(PMK), Brno spoken corpus (BMK)– dialects

Page 14: Language Data Resources

Czech National Corpus

Page 15: Language Data Resources

60,00%

15,00%

3,48%3,67%

0,82%3,37%

4,61%2,27%0,74%5,55%0,49%

Korpus SYN2 0 0 0 - z astoupe ní odborné a té matické lite ratury

publicist ika beletrie vědy o um ění sociální vědyprávo a bezpečnost přírodní vědy technika ekonom ie a řízenívíra, náboženství životní styl adm inist rat iva

SYN2000

Page 16: Language Data Resources

Preprocessing• Collect textual material

– electronic form – scanning+OCR – trend: WWW as a corpus

• Conversion and cleaning– Unified format (problém: loosing some information)– Unified encoding (problem: encoding detection)

• Document classification

• Document segmentation– segmentation on sentence boundaries (problem: tables, direct

speech…)– Tokenization on word boundaries (problem: what is a word?)

Page 17: Language Data Resources

(Morphological) Tagging

• (1) Morphological analysis– For each word form, list all possible lemma+tag

pairs (or list of sequences of such pairs, if tokenization is not straightforward)

• (2) Disambiguation– choose one lemma+tag pair

Page 18: Language Data Resources

Parallel corpora

• texts and their translations into another language (or into more languages)

• added value - alignment– explicit pairing of corresponding chunks of text– ideally diagonal– often just sentence-level alignment– automatized alignment?

• anchor points, word-pairs, …

• http://utkl.ff.cuni.cz/~rosen/public/parabrati.ppt

Page 19: Language Data Resources

MULTEXT-EAST• Multilingual Text Tools and Corpora for Central and

Eastern European Languages• Lexical resources

– Entry: word form + lemma + MSD– MSD – morphosyntactic descriptions (Ncms – Noun common

masculine singular)• Annotated multilingual corpus

– Translations of George Orwell's "1984" , about 100kW– Bulgarian, Czech, Estonian, Hungarian, Romanian, and Slovene, as

well as for English (hub language)– (and recently also Croatian, Lithuanian, Resian, Romanian,

Russian, Slovene)– Hand-validated sentence alignment

• http://nl.ijs.si/ME• version 3 released in 2004 (publically available)• TEI P4 XML

Page 20: Language Data Resources

• Prague Czech-English Dependency Treebank• http://ufal.ms.mff.cuni.cz/pcedt/• Czech translation of 21,600 English sentences

from the Wall Street Journal part of Penn Treebank 3 corpus

• Czech-English corpus of plain text from Reader's Digest 1993-1996 consisting of 53,000 parallel sentences

• automatically morphologically annotated and parsed into two levels (analytical and tectogrammatical) of dependency structures

• Available via LDC

PCEDT

Page 21: Language Data Resources

• E. Brill: “More data is more important than better algorithms“

• E. Charniak: “Future is in statistics.“