29
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer

1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer

Embed Size (px)

Citation preview

1

Corpora: Annotating and Searching

LING 5200Computational Corpus LinguisticsMartha Palmer

LING 5200, 2006 BASED on Kevin Cohen’s LING

52002

Features of corpora

Size (little/big/huge) Plasticity (finite/monitor) Metadata (none/lots) Annotation (none, …, lots) Balance

LING 5200, 2006 BASED on Kevin Cohen’s LING

52003

Features: size

Relative over time Currently, micro/small/large/massive

LING 5200, 2006 BASED on Kevin Cohen’s LING

52004

Features: size

Relative over time 1960's: 1M words (Brown) 1990's: 4.5M words (Penn Treebank) 2000's: 415M words (BOE) 2000's: 1000M (English Gigaword)

Currently, micro/small/large/massive

LING 5200, 2006 BASED on Kevin Cohen’s LING

52005

Features

Finite size established in

advance sample sizes

adjusted accordingly doesn't change over

time

Monitor allow diachronic

analysis grows over time

LING 5200, 2006 BASED on Kevin Cohen’s LING

52006

Metadata

(practically) none language, at least document boundaries

some document attributes

title body author date

PMID- 6509398DP - 1984 NovTI - The natural history of Machado-

Joseph disease. An analysis of 138 personally examined cases.

PG - 510-25AB - We have examined 138 cases of a

disorder previously described in people of Portuguese origin and which has received many names. By computer analysis of 46 different items of a standardized neurological examination carried out in each patient, we have been able to delineate the main components of

LING 5200, 2006 BASED on Kevin Cohen’s LING

52007

Metadata

Lots Author characteristics

gender, age, mother tongue(s), dialect, educational level

genre classification news scientific personal

topic relevance

MH - Aged

MH - Azores/ethnology

MH - Cerebellar Ataxia/diagnosis

MH - Gene Frequency

MH - Human

MH - Phenotype

MH - Portugal/ethnology

MH - Support, Non-U.S. Gov't

MH - Syndrome

MH - United States

MH - Variation (Genetics)

LING 5200, 2006 BASED on Kevin Cohen’s LING

52008

Balanced corpora

What are you balancing?

Most common: genre Authors

gender age education dialect

LING 5200, 2006 BASED on Kevin Cohen’s LING

52009

Balanced corpora

speech writing

unpublished published

non-fictionfiction

informativeinstructional persuasive

Composition of the International Corpus of English

academic popular news

(Adapted from Meyer 2002)

LING 5200, 2006 BASED on Kevin Cohen’s LING

520010

Balanced corpora

speech writing

dialogue monologue

scriptedunscripted

talksnews speeches

Composition of the International Corpus of English

(Adapted from Meyer 2002)

LING 5200, 2006 BASED on Kevin Cohen’s LING

520011

Corpus length

Overall length Sample size

partial 2,000 words (Brown, LOB, ICE) 5,000 words (London-Lund)

full takes up space copyright permission issues harder

LING 5200, 2006 BASED on Kevin Cohen’s LING

520012

Sample size

Motivating assumption: more important to maximize number of authors/genres than length of text from each

LING 5200, 2006 BASED on Kevin Cohen’s LING

520013

By purpose

Linguistic-y lexicon vs. other

NLP General purpose information retrieval information extraction

LING 5200, 2006 BASED on Kevin Cohen’s LING

520014

By purpose

Linguistic-y lexicon vs. other

NLP General purpose information retrieval information extraction

Foreign language instruction Native L2 "Learner" L2

LING 5200, 2006 BASED on Kevin Cohen’s LING

520015

Is there a corpus…

LING 5200, 2006 BASED on Kevin Cohen’s LING

520016

Is there a corpus…

http://www.ldc.upenn.edu/

LING 5200, 2006 BASED on Kevin Cohen’s LING

520017

Annotation

None/some/lots

LING 5200, 2006 BASED on Kevin Cohen’s LING

520018

Annotation

None "collection"

Some POS lemmas

lemma(be) = {be, am, is, are, were, being, been}

LING 5200, 2006 BASED on Kevin Cohen’s LING

520019

Annotation

Lots syntax (treebank, "bracketing") semantics

predicate/argument structure ontological

<class (mammal, pet)>Dogs</class> make me happy.

LING 5200, 2006 BASED on Kevin Cohen’s LING

520020

Diachronic

Historical (OE, ME, …) Later sampling of earlier balanced

corpus Monitor

LING 5200, 2006 BASED on Kevin Cohen’s LING

520021

Spoken

Phonetically motivated (elicited) Other ("natural")

LING 5200, 2006 BASED on Kevin Cohen’s LING

520022

Multilingual

Parallel L1 contents == L2 contents Parliamentary proceedings in English &

French Shakespeare in English and German

Translation/comparable two L1's; genre == genre E.g., weather reports

LING 5200, 2006 BASED on Kevin Cohen’s LING

520023

Penn Treebank

treebank: corpus of syntactically-annotated data

first release: 4.5 million words, 3 years' work

currently 4.9 M

LING 5200, 2006 BASED on Kevin Cohen’s LING

520024

Penn Treebank

Scientifi c abstracts 231KNewspaper stories 3,066KDOA bulletins 79KFiction 106KMUC-3 112KComputer manuals 89KRadio transcripts 12KFlight-booking 20KBrown corpus 1,172K

LING 5200, 2006 BASED on Kevin Cohen’s LING

520025

Penn Treebank

POS-tagged Switchboard data Dysfluency-annotated Switchboard data Syntactically-annotated Switchboard da

ta

http://www.cis.upenn.edu/~treebank/switch-samp-pos.htmlhttp://www.cis.upenn.edu/~treebank/switch-samp-dfl.htmlhttp://www.cis.upenn.edu/~treebank/switch-samp-bkt.html

LING 5200, 2006 BASED on Kevin Cohen’s LING

520026

GENIA

2000 abstracts red blood cell transcription factors POS-tagged (HW2, #16) semantic annotation with molecular

biology ontology

LING 5200, 2006 BASED on Kevin Cohen’s LING

520027

Corpora/resources

Dictionaries, ontologies, ... CELEX WordNet

LING 5200, 2006 BASED on Kevin Cohen’s LING

520028

Corpora/resources

Dictionaries, ontologies, ... "discovery procedure" phonology

contrasts phonotactics

morphology term formation inflectional

LING 5200, 2006 BASED on Kevin Cohen’s LING

520029

McEnery & Wilson's definition of "corpus" sampled & representative finite size machine-readable "standard reference"

???