28
Corpus Annotation for corpus linguistics Jorge Baptista Universidade Algarve L2F Spoken Language Laboratory, INESC ID Lisboa [email protected] Erasmus Mundus Master in Natural Language Processing and Human Language Technology Universidade Autònoma de Barcelona Campus de Bellaterra, November 10 and 12, 2009

Corpus annotation for corpus linguistics (nov2009)

Embed Size (px)

DESCRIPTION

Lecture on corpus annotation for corpus linguistics. Contents: DIY corpus, e-texts, character set and text encoding issues, document structure, DTDs, documentation; tools and issues in annotation procedures, good practices; examples from anaphora resolution and named entity recognition annotation campaigns; evaluation of corpus annotation

Citation preview

Page 1: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics

Jorge BaptistaUniversidade Algarve

L2F Spoken Language Laboratory, INESC ID Lisboa

[email protected]

Erasmus Mundus Master in Natural Language Processing and Human Language Technology

Universidade Autònoma de BarcelonaCampus de Bellaterra, November 10 and 12, 2009

Page 2: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

Plan

2

corpus linguistics corpus annotation

before you get to work with your corpus

once you got your eText character set document structure DTDs

Evaluation of annotated corpus gold-standard evaluation methods

Annotating a corpus for Anaphora Resolution

Annotating a corpus for Named Entities Recognition

Annotation tools References

http://www.visualthesaurus.com/

Page 3: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

3

Corpus linguistics

corpus (a definition): a large body of linguistic evidence typically composed of attested language use

machine-readable form well organized collection of data

collected within a sampling frame, designed for exploration of linguistic

features balance, representativeness

multifunctional resource, serve many different disciplines

McEnery 2003, in Mitkov (ed) 2003: 448 ff.

Page 4: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

4

corpus annotation

corpus ‘enhanced with linguistic information’ analysts (humans and/or computers) linguistic analysis is imposed upon the

corpus (make explicit the implicit linguistic information)

encoded by reference to specified range of features

advantages of corpus annotation: ease corpus exploitation, reusability, multifunctionality, explicit analysis

Page 5: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

5

corpus annotation (continued)

markup metadata

corpus information (doc id, speaker id, sex, age, etc, date, review number and history, etc.)

information pertaining to the text as such paragraphs, formatting (italics, bold)

annotation linguistic information superimposed to text

POS, NE_tags, discourse-structure tags, referential information, syntactic tags, semantic tags (for WSD), etc.

Page 6: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

6

corpus annotation (continued)

annotation process automatic (lemmatization, PoS tagging: 3% error

rate)

semi-automatic (treebank)

manual (reference chains for anaphora resolution)

Page 7: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

7

Before you get to work with your corpus*

Corpus-based approach to (computational) linguistics

Quality of corpora > RESULTS Methodology and procedures for

corpus collection, preparation and distribution

General remarks: true problems and difficulties lie in the details

text (whatever its support) and eText (in any digital medium)

* Thompson 2000 in Dale et al. 2000: 385 ff.

Page 8: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

8

Once you got your eText …

Preparation in an ideal scenario

UNICODE (ISO 10646) encoding SGML (ISO 8879) mark-up

in a real-world scenarioraw text, different text-file types different sources and poor metadata, different encodings, no markup at all, or mixed and inconsistent

markup

Page 9: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

9

character set and encoding

characters: abstract objects, glyphs; set of integers (code-points) > set of characters

encoding : mapping computer-representable byte- or word-stream to sequence of code points

ASCII, UNICODE, JIS, ISO-Latin-1 (ISO 8859-1), UTF-8

choosing, recoding, word-boundaries

Page 10: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

10

document structure

any eText already has some structure words, sentences, paragraphs, quotations,

headings, … font size and face changes

what to notate explicitly? sentence boundaries

(never replace orthographic symbols but always add sentence boundaries)

Page 11: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

11

document structure (continued)

How is explicit structural information recorded?

kim: most user-friendly and reusable way1. design you own idiosyncratic annotation

syntax2. use a database3. use a standard markup language: SGML,

XMLa. public DTD (document-type definition): TEI, CESb. design your very own DTD

Page 12: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

12

document structure (continued)

SGML (Standard Generalized Markup Language) ISO 8879

XML (eXtensible Markup Language) simplified version of SGML originally

targeted at providing flexible document markup for the WWW

low-level grammar of annotation (how is markup to be distinguished from text)

definition of the structure of families of related documents or document types

Page 13: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

Text Encoding Initiative (TEI)

Text Encoding Initiative (TEI) sponsored by ACL, ALLC and ACH guidelines to facilitate data exchange standardizing mark-up or encoding of

information stored in electronic form each text (document):

header <teiHeader> body each one may have several elements

13

Page 14: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

TEI

Header <teiHeader> file description <fileDesc> :

full bibliographic description of na electronic file encoding description <encodingDesc> :

relates eText to its source(s) text profile <profileDesc> :

non-bibliographic description, languages, sublanguages, situation of production participants and settings

revision history <revisionDesc> : records changes made to file

14

Page 15: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

TEI

Body of document <p>,<s>,<w>,<c><w POS=AT0>the</w>simplified: <w AT0>the

TEI scheme may be expressed in different formal languages: SGML, XML (system independent) XML (simplified SGML, for the web)

15

Page 16: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

Corpus Enconding Standard (CES) Corpus Enconding Standard

specifically designed for encoding language corpora

EAGLES (Expert Advisory Group on Language Engineering Standards)

TEI-compliant application of SGML available both in SGML and XML (XCES)

16

Page 17: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

17

DTDs (document-type definitions) context-free grammars of allowed tag

structures allowed attributes for each tag up-translation

consistency preexisting markup >replace> XML sed, awk, pearl scripting record every step ! (backtracking changes) manual post-processing > context-sensitive

patches diff

Page 18: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

18

DTDs<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE colHAREM [<!ELEMENT colHAREM (DOC)*><!ATTLIST colHAREMversao CDATA #REQUIRED><!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED><!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*><!ELEMENT ALT (#PCDATA|EM|OMITIDO)*><!ELEMENT EM (#PCDATA)><!ATTLIST EMID CDATA #REQUIREDCATEG CDATA #IMPLIEDTIPO CDATA #IMPLIEDSUBTIPO CDATA #IMPLIEDCOMENT CDATA #IMPLIEDTIPOREL CDATA #IMPLIEDCOREL CDATA #IMPLIEDTEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIEDSENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIEDVAL_DELTA CDATA #IMPLIEDVAL_NORM CDATA #IMPLIED><!ELEMENT OMITIDO (#PCDATA|EM)*>]><colHAREM versao="ColeccaoSegundoHAREM-2.0"><DOC DOCID="cha-73943"><P>Dividir o IRA, eis a estratégia</P><P>Hugo Estenssoro, em Londres</P><P>O IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show' espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.</P>

Page 19: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

19

Evaluation of annotated corpus machine-learning techniques evaluation of NLP systems

analysis systems (linguistic input → abstract representation or classification)

gold standard (‘correct’ output) analysis components: segmentation,

tagging, information extraction and information retrieval

Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.

Page 20: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

20

gold-standard-based measuresgold-standard evaluation methods: Definition of evaluation task and an

associated ‘gold-standard’ format annotation guidelines annotation and scoring tools validation (inter-annotator agreement)

annotated training and test corpora release (data+tools), evaluation interpretation (baseline and ceiling)

Page 21: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

21

Annotating a corpus for Anaphora Resolution

John arrived. He looked tired.

antecedent anaphoranaphora

Page 22: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

22

AR (continued)

John arrived. He looked tired.

<NE ID=267 TYPE=“person”>John</NE> arrived.

<REF TYPE=pro COREF=267>He</REF> looked tired.

Page 23: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

23

AR (an exercise) identification of all the markables (NPs) in a

text regardless of whether they were coreferential or not

coref and ucoref (out of ARE)

relations marked between entities: IDENTITY, SYNONYMY, GENERALISATION and SPECIALISATION

Indirect anaphora relation was not annotated: (the house ... the door)

Hasler et al. (2006); Orasan et al. (2009)

Page 24: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

24

task#1 Pronominal AR on pre-annotated texts

evaluation of pronoun algorithms

NPs annotated (known candidates)

only PRO NP were marked referential (to be resolved)

no influence from wrongly identified candidates

Page 25: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

25

task#2 Coreferential chains on pre-annotated texts

cluster coreferential NPs together in coreferential chains

all referential NP were marked (to be resolved), not only PRO

NPs outside coreferential chains were not annotated

no influence from wrongly identified candidates

Page 26: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

26

an example: NER

www.linguateca.pt/avaliacaoconjunta

Page 27: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

27

annotation tools

PALinkA Perspicuous and Adjustable Links Annotatorhttp://clg.wlv.ac.uk/projects/PALinkA/index.php

Alembic workbench a natural language engineering environment for the development of tagged corpora http://www.mitre.org/tech/alembic-workbench/

ATLAS Architecture and Tools for Linguistic Analysis Systems http://www.nist.gov/speech/atlas/

CLaRK system an XML Based System For Corpora Development http://www.bultreebank.org/clark/index.html

GATE is an architecture, framework and development environment for language engineering which can be also used to annotate textshttp://www.gate.ac.uk/

MMAX a tool for multi-modal annotation in XML, but the new version is no longer free http://mmax.eml-research.de/

Page 28: Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

28

ReferencesDale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New

York/Basel: Marcel Dekker, Inc.Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP

Coreference http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf.Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188.Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429.

Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman & Hall/CRC.

McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource

book. Routledge.Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP:

the cases of anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP. http://clg.wlv.ac.uk/papers/mitkov-99b.pdf.

Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview. Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30 May http://clg.wlv.ac.uk/papers/713_paper.pdf.

Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New York: Rodopi.

Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401.

Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166.

Resourceshttp://www.ldc.upenn.edu/annotation/http://www.routledge.com/textbooks/0415286239