Corpus annotation for corpus linguistics (nov2009)

Corpus Annotation for corpus linguistics

Jorge BaptistaUniversidade Algarve

L2F Spoken Language Laboratory, INESC ID Lisboa

[email protected]

Erasmus Mundus Master in Natural Language Processing and Human Language Technology

Universidade Autònoma de BarcelonaCampus de Bellaterra, November 10 and 12, 2009

Corpus Annotation for corpus linguistics, Jorge Baptista©2009

Plan

2

corpus linguistics corpus annotation

before you get to work with your corpus

once you got your eText character set document structure DTDs

Evaluation of annotated corpus gold-standard evaluation methods

Annotating a corpus for Anaphora Resolution

Annotating a corpus for Named Entities Recognition

Annotation tools References

http://www.visualthesaurus.com/


3

Corpus linguistics

corpus (a definition): a large body of linguistic evidence typically composed of attested language use

machine-readable form well organized collection of data

collected within a sampling frame, designed for exploration of linguistic

features balance, representativeness

multifunctional resource, serve many different disciplines

McEnery 2003, in Mitkov (ed) 2003: 448 ff.


4

corpus annotation

corpus ‘enhanced with linguistic information’ analysts (humans and/or computers) linguistic analysis is imposed upon the

corpus (make explicit the implicit linguistic information)

encoded by reference to specified range of features

advantages of corpus annotation: ease corpus exploitation, reusability, multifunctionality, explicit analysis


5

corpus annotation (continued)

markup metadata

corpus information (doc id, speaker id, sex, age, etc, date, review number and history, etc.)

information pertaining to the text as such paragraphs, formatting (italics, bold)

annotation linguistic information superimposed to text

POS, NE_tags, discourse-structure tags, referential information, syntactic tags, semantic tags (for WSD), etc.


6

corpus annotation (continued)

annotation process automatic (lemmatization, PoS tagging: 3% error

rate)

semi-automatic (treebank)

manual (reference chains for anaphora resolution)


7

Before you get to work with your corpus*

Corpus-based approach to (computational) linguistics

Quality of corpora > RESULTS Methodology and procedures for

corpus collection, preparation and distribution

General remarks: true problems and difficulties lie in the details

text (whatever its support) and eText (in any digital medium)

* Thompson 2000 in Dale et al. 2000: 385 ff.


8

Once you got your eText …

Preparation in an ideal scenario

UNICODE (ISO 10646) encoding SGML (ISO 8879) mark-up

in a real-world scenarioraw text, different text-file types different sources and poor metadata, different encodings, no markup at all, or mixed and inconsistent

markup


9

character set and encoding

characters: abstract objects, glyphs; set of integers (code-points) > set of characters

encoding : mapping computer-representable byte- or word-stream to sequence of code points

ASCII, UNICODE, JIS, ISO-Latin-1 (ISO 8859-1), UTF-8

choosing, recoding, word-boundaries


10

document structure

any eText already has some structure words, sentences, paragraphs, quotations,

headings, … font size and face changes

what to notate explicitly? sentence boundaries

(never replace orthographic symbols but always add sentence boundaries)


11

document structure (continued)

How is explicit structural information recorded?

kim: most user-friendly and reusable way1. design you own idiosyncratic annotation

syntax2. use a database3. use a standard markup language: SGML,

XMLa. public DTD (document-type definition): TEI, CESb. design your very own DTD


12

document structure (continued)

SGML (Standard Generalized Markup Language) ISO 8879

XML (eXtensible Markup Language) simplified version of SGML originally

targeted at providing flexible document markup for the WWW

low-level grammar of annotation (how is markup to be distinguished from text)

definition of the structure of families of related documents or document types


Text Encoding Initiative (TEI)

Text Encoding Initiative (TEI) sponsored by ACL, ALLC and ACH guidelines to facilitate data exchange standardizing mark-up or encoding of

information stored in electronic form each text (document):

header <teiHeader> body each one may have several elements

13


TEI

Header <teiHeader> file description <fileDesc> :

full bibliographic description of na electronic file encoding description <encodingDesc> :

relates eText to its source(s) text profile <profileDesc> :

non-bibliographic description, languages, sublanguages, situation of production participants and settings

revision history <revisionDesc> : records changes made to file

14


TEI

Body of document ,<s>,<w>,<c><w POS=AT0>the</w>simplified: <w AT0>the

TEI scheme may be expressed in different formal languages: SGML, XML (system independent) XML (simplified SGML, for the web)

15


Corpus Enconding Standard (CES) Corpus Enconding Standard

specifically designed for encoding language corpora

EAGLES (Expert Advisory Group on Language Engineering Standards)

TEI-compliant application of SGML available both in SGML and XML (XCES)

16


17

DTDs (document-type definitions) context-free grammars of allowed tag

structures allowed attributes for each tag up-translation

consistency preexisting markup >replace> XML sed, awk, pearl scripting record every step ! (backtracking changes) manual post-processing > context-sensitive

patches diff


18

DTDs<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPE colHAREM [<!ELEMENT colHAREM (DOC)*><!ATTLIST colHAREMversao CDATA #REQUIRED><!ELEMENT DOC (#PCDATA|ALT|EM|OMITIDO|P)*> <!ATTLIST DOC DOCID CDATA #REQUIRED><!ELEMENT P (#PCDATA|ALT|EM|OMITIDO)*><!ELEMENT ALT (#PCDATA|EM|OMITIDO)*><!ELEMENT EM (#PCDATA)><!ATTLIST EMID CDATA #REQUIREDCATEG CDATA #IMPLIEDTIPO CDATA #IMPLIEDSUBTIPO CDATA #IMPLIEDCOMENT CDATA #IMPLIEDTIPOREL CDATA #IMPLIEDCOREL CDATA #IMPLIEDTEMPO_REF (ENUNCIACAO|TEXTUAL) #IMPLIEDSENTIDO (ANTERIOR|POSTERIOR|SIMULT|ANTERIOR_OU_SIMUL|POSTERIOR_OU_SIMULT) #IMPLIEDVAL_DELTA CDATA #IMPLIEDVAL_NORM CDATA #IMPLIED><!ELEMENT OMITIDO (#PCDATA|EM)*>]><colHAREM versao="ColeccaoSegundoHAREM-2.0"><DOC DOCID="cha-73943">Dividir o IRA, eis a estratégiaHugo Estenssoro, em LondresO IRA esteve esta semana na ofensiva, paralisando o aeroporto de Londres e causando prejuízos à temporada turística britânica, com presença obrigatória nas grandes manchetes. As bombas não explodiram, mas o IRA matou um polícia no Ulster em frente à esposa grávida. Foi uma violência anunciada: o líder do Sinn Fein -- o braço político do IRA -- falara poucos dias antes num «`show' espectacular» como resposta à iniciativa anglo-irlandesa lançada pelos primeiros-ministros da Grã-Bretanha e da República da Irlanda com a sua «declaração» de 15 de Dezembro do ano passado. Mas a campanha terrorista foi só parte da resposta.


19

Evaluation of annotated corpus machine-learning techniques evaluation of NLP systems

analysis systems (linguistic input → abstract representation or classification)

gold standard (‘correct’ output) analysis components: segmentation,

tagging, information extraction and information retrieval

Hirschman and Mani (2003) in Mitkov (ed.) 2003 : 414 ff.


20

gold-standard-based measuresgold-standard evaluation methods: Definition of evaluation task and an

associated ‘gold-standard’ format annotation guidelines annotation and scoring tools validation (inter-annotator agreement)

annotated training and test corpora release (data+tools), evaluation interpretation (baseline and ceiling)


21

Annotating a corpus for Anaphora Resolution

John arrived. He looked tired.

antecedent anaphoranaphora


22

AR (continued)

John arrived. He looked tired.

<NE ID=267 TYPE=“person”>John</NE> arrived.

<REF TYPE=pro COREF=267>He</REF> looked tired.


23

AR (an exercise) identification of all the markables (NPs) in a

text regardless of whether they were coreferential or not

coref and ucoref (out of ARE)

relations marked between entities: IDENTITY, SYNONYMY, GENERALISATION and SPECIALISATION

Indirect anaphora relation was not annotated: (the house ... the door)

Hasler et al. (2006); Orasan et al. (2009)


24

task#1 Pronominal AR on pre-annotated texts

evaluation of pronoun algorithms

NPs annotated (known candidates)

only PRO NP were marked referential (to be resolved)

no influence from wrongly identified candidates


25

task#2 Coreferential chains on pre-annotated texts

cluster coreferential NPs together in coreferential chains

all referential NP were marked (to be resolved), not only PRO

NPs outside coreferential chains were not annotated

no influence from wrongly identified candidates


26

an example: NER

www.linguateca.pt/avaliacaoconjunta


27

annotation tools

PALinkA Perspicuous and Adjustable Links Annotatorhttp://clg.wlv.ac.uk/projects/PALinkA/index.php

Alembic workbench a natural language engineering environment for the development of tagged corpora http://www.mitre.org/tech/alembic-workbench/

ATLAS Architecture and Tools for Linguistic Analysis Systems http://www.nist.gov/speech/atlas/

CLaRK system an XML Based System For Corpora Development http://www.bultreebank.org/clark/index.html

GATE is an architecture, framework and development environment for language engineering which can be also used to annotate textshttp://www.gate.ac.uk/

MMAX a tool for multi-modal annotation in XML, but the new version is no longer free http://mmax.eml-research.de/

http://clg.wlv.ac.uk/projects/PALinkA/index.php

http://www.mitre.org/tech/alembic-workbench/

http://www.nist.gov/speech/atlas/

http://www.bultreebank.org/clark/index.html

http://www.gate.ac.uk/

http://mmax.eml-research.de/


28

ReferencesDale, Robert; Moils, Hermann; Sommers, Harold. 2000. Handbook of Natural Language Processing. New

York/Basel: Marcel Dekker, Inc.Hasler, Laura K.; Naumann, K. ; Orasan, C. (2006). Guidelines for Annotation of Within-document NP

Coreference http://clg.wlv.ac.uk/projects/NP4E/NP_guidelines_2006.pdf.Hajičova, E.; Abeillé, A.; Hajič, J.; Mirovský, J. 2010. Treebank annotation. in Indurkhya and Damerau (2010): 167-188.Hirschman, Lynette; Mani, Inderjeet. 2003. Evaluation. in Mitkov, Ruslan (ed.) 2003, pp. 414-429.

Indurkhya, Nitin; Damerau, Fred (Eds.). 2010. Handbook of Natural Language Processing (2nd ed.). Chapman & Hall/CRC.

McEnery, Tony. 2003. Corpus Linguistics. in Mitkov, Ruslan (ed.) 2003 , pp. 448-463. McEnery, Tony; Xiao, Richard; Tono, Yukio. 2006. Corpus-Based Language Studies. An advanced resource

book. Routledge.Mitkov, Ruslan (ed.) 2003. Oxford Handbook of Computational Linguistics. Oxford: Oxford University Press.Mitkov, Ruslan ; Orasan, Constantin ; Evans, Richard. 1999. The importance of annotated corpora for NLP:

the cases of anaphora resolution and clause splitting. TALN ’99 The importance of annotated corpora for NLP. http://clg.wlv.ac.uk/papers/mitkov-99b.pdf.

Orăsan, Constantin; Cristea, Dan; Mitkov, Ruslan; Branco António. Anaphora Resolution Exercise: An overview. Proceedings of 6th Language Resources and Evaluation Conference (LREC’2008), Marrakesh, Morocco, 28 – 30 May http://clg.wlv.ac.uk/papers/713_paper.pdf.

Renouff, Antoinette; Kehoe, Andrew (eds.).2009. Corpus Linguistics: Refinements and Reassessments. Amsterdam/New York: Rodopi.

Thompson, Henry S. 2000. Corpus Creation for Data-Intensive Linguistics. in Dale et al. (eds) 2000, pp. 385-401.

Xiao, Richard. 2010. Corpus Creation. in Indurkhya and Damerau (2010): 147-166.

Resourceshttp://www.ldc.upenn.edu/annotation/http://www.routledge.com/textbooks/0415286239

Education

Corpus annotation for corpus linguistics (nov2009)