Transcript
Page 1: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Corpus annotation

Corpus Linguistics

Richard Xiao

[email protected]

Page 2: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Outline of the session

• Lecture– Rationale for corpus annotation– Leech’s maxims of corpus annotation– Types of annotation

• Lab– CLAWS POS tagger (online and Windows-

based)– Introducing Wmatrix– ICTCLAS

Page 3: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Corpora and annotation

• Unannotated corpus– simple plain text or raw text– the linguistic information is implicit

• e.g. no explicit representation of present as a noun

• Annotated corpus– no longer just text– real repository of linguistic information

• the relevant linguistic information is now explicit (e.g. present as a noun, adjective, or verb)

Page 4: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Corpus annotation• What is annotation?

– “The process of adding […] interpretive, linguistic information to an electronic corpus of spoken and/or written language data” (Leech 1997)

– Broadly, also refers to the results of the annotation process

• In a strict sense, different from corpus markup– Markup provides objective, verifiable information

• e.g. author, paragraph boundary

– Annotation is concerned with interpretive linguistic information

• e.g. part-of-speech

Page 5: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Why annotate a corpus?

• It makes information retrieval and extraction easier, faster and enables human analysts to exploit and retrieve analyses of which they are not themselves capable

• Annotated corpora are reusable resources• Annotated corpora are multifunctional - they can be

annotated with a purpose and be reused with another• Corpus annotation records a linguistic analysis

explicitly• Corpus annotation provides a standard reference

resource, a stable base of linguistic analyses, so that successive studies can be compared and contrasted on a common basis

Page 6: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

How are corpora annotated?

• Automatic annotation– Can be automated reliably for some types (POS, lemmatization)– Can annotate large amount of data quickly at low cost– Post-editing or human correction may be necessary to improve

accuracy• Computer-assisted annotation

– The semi-automatic annotation process (human-machine interface) may produce more reliable results than fully automated annotation, but it is also slower and more costly

• Manual annotation– Occurs where no annotation tool is available or where the

accuracy of available systems is not high enough to be useful– Expensive and time-consuming, typically only feasible for small

corpora

Page 7: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Leech’s 7 maxims of annotation1. It should be possible to remove the annotation from an

annotated corpus in order to revert to the raw corpus.2. It should be possible to extract the annotations by

themselves from the text. 3. The annotation scheme should be based on guidelines

which are available to the end user.4. It should be made clear how and by whom the

annotation was carried out. The end user should be made aware that the corpus annotation is not error-free or infallible, but simply a potentially useful tool.

6. Annotation schemes should be based as far as possible on widely agreed and theory-neutral principles.

7. No annotation scheme has the a priori right to be considered as a standard. Standards emerge through practical consensus.

Page 8: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Types of corpus annotation• Phonological level

– Syllable boundaries (phonetic/phonemic annotation)– Prosodic or suprasegmental features (prosodic

annotation, e.g. pitch, loudness, intonation)• Morphological level

– Prefixes, suffixes, stems (morphological annotation)• Lexical level

– Tokenisation (essential for Chinese)– Parts of speech (POS tagging)

• e.g. present: NN1, VVB, JJ– Lemmas (lemmatization)

• stop, stopped, stops, stopping → stop– Semantic fields (semantic annotation)

• cricket: sport, insect

Page 9: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Tokenisation• The one-to-one correspondence between

orthographic and morpho-syntactic word tokens can be considered as a default in English with three main exceptions– Multiword units (e.g. so that and in spite of)– Mergers (e.g. can’t and gonna)– Variably spelt compounds (e.g. noticeboard, notice-

board, notice board)

• CLAWS examples (“ditto tags”)– so that: so_CS21 that_CS22– in spite of: in_II31 spite_II32 of_II33– can’t: ca_VM n’t_XX

Page 10: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

BNC-style POS tagging

• <s>• <w NN2>Explosives • <w VVD>found • <w PRP>on • <w NP0>Hampstead • <w NP0>Heath • <PUN>• </s>

Explosives found on Hampstead Heath.

new sentence

plural noun

past tense verb

preposition

proper noun

proper noun

punctuation

Page 11: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Example of semantic tagging

See http://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf for the tagset.

Page 12: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Types of corpus annotation• Syntactic level

– Parsing / treebanking / bracketing

(S (NP Mary)

(VP visited

(NP a

(ADJP very nice)

boy)))

• Stanford Parser– http://nlp.stanford.edu:8080/parser/

Page 13: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Types of corpus annotation

• Discourse level– Anaphoric relations (coreference annotation)

(6 the married couple 6) said that <REF=6 they were happy with <REF=6 their lot.

– Speech acts (pragmatic annotation)• 3 layers of coding

– Segmentation (dividing dialogue in textual units, i.e. utterances)– Functional annotation (dialogue act annotation)– Utterance tags (applying utterance tags that characterize the

role of the utterance as a dialogue act)

– Stylistic features such as speech and thought in presentation (stylistic annotation)

• The representation of people’s speech and thoughts, known as speech ad thought presentation (S&TP)

Page 14: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Types of corpus annotation• Other types

– Error tagging• Applying to learner corpus data

• The CLEC error tagging scheme consists of 61 error types clustered in 11 categories

– Problems-specific annotation• Not exhaustive – only the phenomenon directly relevant

to a particular research question

• Developed for its relevance to the specific research question, but not for its broad coverage and consensus-based theory-neutrality

– E.g. Hunston (1993) studies how people talk about sameness and difference (“local grammar”)

Page 15: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Annotation styles• Embedded style - LOB style

going_VVGK TEI entity references

going&VVGK; WSJ style

going/VVGK SGML

<w POS=VVGK>going</w> BNC style (simplified SGML)

<w VVGK>going XML

<w POS=“VVGK”>going</w>

• Standalone style– <s>

<w id=“1”>He</w><w id=“2”>was</w><w id=“3”>going</w><w id=“4”>to</w><w id=“5”>die</w><w id= “6”>.</w></s>

– <s> <word id=“1”>PPHS1</word> <word id=“2”>VBDZ</word> <word id=“3”>VVGK</word> <word id=“4”>TO</word><word id=“5”>VVI</word><word id=“6”>.</word></s>

Page 16: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Introducing CLAWS

• CLAWS: some basic facts– The Constituent Likelihood Automatic Word-tagging

System– Best known POS tagger for general English– Has been used to tag a number of large corpora,

including the 100M word BNC– Has consistently achieved 96-97% accuracy– Free online tagging service allow academic users to

tag 100,000 word at a time (from an academic website)

• http://ucrel.lancs.ac.uk/claws/trial.html

Page 17: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

CLAWS tagsets

• C7 taget– A detailed tagset of 146 tags– http://ucrel.lancs.ac.uk/claws7tags.html

• C5 tagset– Less refined, 61 tags (BNC tagset)– http://ucrel.lancs.ac.uk/claws5tags.html

• The mapping between C7 and C5 is a many-to-one conversion, and is available in a tab-delimited text file

• C8 tagset is an extension of C7 tagset that makes further distinctions in the determiner and pronoun categories as well as for auxiliary verbs– http://ucrel.lancs.ac.uk/claws8tags.pdf

Page 18: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Free CLAWS trial service

Page 19: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

CLAWS output formatsVertical output format

Horizontal output format (Use copy & paste and save as a plain text file)

Pseudo-XML output format

Page 20: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Windows-based CLAWS D:\ZJU CL\tools\Jclaws\lib\run_jclaws.bat (or antclawsgui)

…tagging text in a file

Page 21: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Wmatrix• An online corpus analysis and comparison system• A web interface that allows you to access to the CLAWS

part-of-speech tagger and the USAS semantic tagger– CLAWS– USAS: UCREL Semantic Analysis System

• Including standard corpus research tools– Frequency, KWIC concordance, wordlist, keyword list, word

cluster/n-gram), collocation– Built-in statistics model log likelihood for corpus comparison

• Integrating POS tagging and semantic field annotation into a single profiling tool

• Introduction to Wmatrix– http://ucrel.lancs.ac.uk/wmatrix/

Page 22: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Your Wmarix account

• You will need a username and password to use Wmatrix

• Write down your username and password– Tag and download your text as soon as

possible if you wish to use Wmatrix to tag your data (POS / semantic) on your project

• …and now login with your account– http://ucrel.lancs.ac.uk/wmatrix3.html

Page 23: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Click here to find out more about the UCREL Semantic Annotation System

Click here to run “tag wizard”

Click here to see your work area (for data you have already processed)

Page 24: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Amongst other things, the link explains the categorisation scheme utilised …

Hierarchy of 21 major discourse fields (or domains), which expand into 232 semantic field tags (see the web link)

semantic field (or domain) = “A named area of meaning in which lexemes

interrelate and define each other in specific ways” (Crystal 1995: 157)

Note --- the USAS scheme is derived from McArthur (1981)

Page 25: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

• Designed to undertake the automatic semantic analysis of present-day English texts (spoken and written)

• Involving two stages(i) POS tagging by CLAWSA POS tag is assigned to every lexical item or multi-word expression (MWE), using probabilistic Markov models of likely part-of-speech sequences (accuracy of 97%+)

(ii) Output fed into SEMTAG for semantic annotationSemantic tags are assigned automatically on the basis of pattern matching between the target text and two computer dictionaries developed for use with the program (accuracy of 92%+)

• Present applications: market research, content analysis, information extraction, assistance for translation, linguistic analysis, etc.

The USAS system

Page 26: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Let’s do some tagging

Once you have logged in:

• From the Wmatrix home page, click on Tag wizard

• This will bring up the following page …

Page 27: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Let’s do some tagging

Tag the following two texts:– Tips: It’s a good practice to create one folder for

each file

• Conservative MP Michael Howard’s farewell speech to his party (2005)– D:\ZJU CL\texts\Howard_speech.txt

• New Labour MP Tony Blair’s farewell speech to his party (2006)– D:\ZJU CL\texts\texts\Blair_speech.txt

Page 28: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

part-of-speech tagging semantic tagging frequency lists

A quick “how to”!

• Enter new work area name (Blair / Howard)

• Click the browse button to select the right file

• Click the “upload now” button …

• A new screen will provide you with an update report … e.g.

Page 29: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

You will then be taken to your work area[My folders]

Page 30: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

What you’ll see in the Simple “VIEW of folder”

Click on Frequency to see the most frequent words

You can also do concordance searches of words/phrases

Page 31: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Advanced View of Howard Folder

Click on Frequency to see the most frequent words (as before)

How might we discover the most ‘frequent’ POS? Jot them down

--- and the most ‘frequent’ semantic fields? Make a note of them

We can also see all of the keywords using this VIEW

--- and investigate key parts of speech (POS) and key concepts / domains

Page 32: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Frequency of words in Howard and Blair (using advanced view)

Make a note of the similarities and differences …

Page 33: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Download the tagged text

Remember to change filename and file type

Page 34: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Tagging Chinese text

• ICTCLAS – Institute of Computing Technology, Chinese Lexical Analysis System– Best Chinese tagger

• Fast and reliable (98.45%)

– Online demonstration– Free download of shareware version– http://ictclas.org/

Page 35: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Online demo

Page 36: Corpus annotation Corpus Linguistics Richard Xiao lancsxiaoz@googlemail.com

Standalone ICTCLASD:\ZJU CL\tools\ICTCLAS\ICTCLAS_Win.exe

Tagset - http://www.lancs.ac.uk/fass/projects/corpus/LCMC/lcmc/lcmc_tagset.htm


Recommended