Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus 03

Corpus Analysis

Corpus analysis • Annotation

– Lemmatization

– Tagging

– Parsing

• Corpus analysis– Listing

– Sorting

– Counting

– Concordancing

• Tools

Lemmatization

• A process of classifying together all the identical or related forms of a word under a common head word.

• Irregularities, homonyms

• Automatic lemmatizers employ two processes:

Lemmatization

• by means of a set of rules: go, going, goes, went under the head word of go

• by means of affix stripping: If a word form does not appear as part of the affix rule system, or is not listed as a specific exception, then the word is listed as a lemma in the same form.

• Automatic lemmatization requires substantial manual input involving subjective decisions.

Word-class tagging• Tagset: a set of descriptions and tags for the automati

c tagging. It may include major word classes and their inflectional variants, major function words, certain important individual lexemes, certain punctuation marks and a small number of discourse markers.

• Figure 4.1 (P. 210): The tagset used for tagging the Brown Corpus.

• There is no obvious way of improving the accuracy of rule-based taggers. The set of tags and the rule-based program have to be modified to take account of what actually occurs in the texts: the probability of occurrence of particular sequences of words or tags.

Word-class tagging• Probabilities derived from earlier corpus tagging s

howed the part of speech a particular homograph was most likely to belong to.

• CLAWS (Constituent Likelihood Automatic Word-tagging System): CLAWS 1 made an error in 3% to 4% of word-tokens.

• Later versions of CLAWS have been improved by reducing the size of the tagset to about 60 tags, developing new portmanteau tags, expanding the lexicon to about 10,000 items, enhancing the idiom list of word sequences.

The general tagging procedure

• Figure 4.3 SGML markup before CLAWS (p.216)

• Figure 4.4 After word-class tagging with CLAWS (p.216)

• Figure 4.5 Post-CLAWS processing by SANTA CLAWS (p.217)

Factors that affect tagging• Language theories: The tagset reflects a particular

theory of language. The theoretical assumptions we make about the relationship between morphology, syntax and semantics can affect the types of tags assigned to words.

• Size of the tagset: the necessary degree of delicacy of analysis. In general, the greater the number of tags categories, the greater the potential for tagging error, the greater the precision of analysis available.

Parsing• Parsing is more demanding task involving n

ot only annotation but also linguistic analysis.

• Treebank: collections of labeled constituent structures or phrase markers.

• A parsed corpus provides a labeled analysis for each sentence to show how the various words function.

• Two main approaches: probabilistic and rule-based

Difficulties for rule-based approach • Some features of natural languages (prepositional

phrase attachment, coordination, discontinuous elements, ellipsis, pragmatic features) can cause difficulty for the rule-based analysis.

• According to generative grammar, parsers could be derived from schemes of grammatical rules, rules in a predetermined grammar and lexicon. If rules could not generate the sentence, this sentence would be considered ungrammatical. But ambiguous sentences have to be interpreted with further semantic and pragmatic information.

Probabilistic parsing

• The treebank provides statistics on the frequency of grammatical structures for the automatic parse to use subsequently in the probabilistic analysis of unrestricted text.

Examples of parsing

• Rule-based: Lancaster_Leeds Treebank (Figure 4.9, p.234): based on a context-free phrase-structure grammar which exhibited a huge number of grammatical rules.

• Sentence: I next wondered if she would like to bear down on Shaftesbury Avenue and see a play.

Examples of parsing

• Probabilistic parsing: LOB Corpus Treebank (Figure 4.10, p.235): with a parsing scheme based on the experience and statistics derived from the Lancaster_Leeds Treebank.

• Sentence: it arose during talks following President Kennedy’s report to the British Prime Minister of the outcome of his recent visit to Paris.

Examples of parsing

• Improved probabilistic parsing: UCREL—skeleton parsing (figure 4.11, p.235): tagging could be done best automatically with manual post-editing, using specially developed software to speed up data entry.

• Sentence: Well let’s now turn our attention to cricket once more it’s been a year when test matches the ubiquitous one day internationals proliferated around the world.

Examples of parsing

• Souter (1991): probabilistic parser compatible with systemic functional grammar. Manually passed Polytechnic of Wales Corpus of 60,800 words resulted in over 4,500 syntactic rules, most of which occurred infrequently.

Examples of parsing • Penn Treebank: coarse surface-structure ana

lysis (fingure 4.12, p. 237)

• SUSANNE: Surface and Underlying Structural Analyses of Naturalistic English.

Examples of parsing

• Sampson noted that available parsed corporal have been analysed with varying degrees of delicacy, not all of which show the deep structure or logical relationships in the analysed sentences. The SUSANNE Corpus analytic scheme is an attempt to specify norms and a comprehensive checklist of linguistic phenomena for the annotation of modern English in terms of both its underlying or logical structure as well as the surface grammar.

• Figure 4.13 p. 239: SUSANNE Corpus

Corpus analysis: Listing

• Lemmatization: a smaller list results.• Different arrangement for different purposes: • the alphabetical order of the last letter for the

same suffixes (Table 4.3, p. 248)• word length (Table 4.4, p. 249)• descending order of frequency (Table 4.5,

p.250)

Corpus analysis: Concordance

• A formatted version or display of all the occurrences or tokens of a particular type in a corpus.

• Two types• Batch generation: find every occurrence in the corpus and m

ake a file containing each such occurrence with a predetermined amount of context.

• Pre-indexed type: if the corpus has been pre-processed by computer to index each word, the software can be used to find almost instantly all the occurrences of a type, and the zise of the context can also be easily altered.

• Most usual format: Key Word in Context Concordance (KWIC)


• Examples: figure 4.17 (unsorted), 4.18 (right-sorted), 4.19 (left-sorted)

• A concordance can provide information on the company words keep in a corpus. It also show different senses of a word type.


• Figure 4.20: A tagged corpus can reveal not only collocational behavior in terms of particular word sequences, but also in terms of the distribution of word-class sequences.

• Amount of context: 80 characters for A4 page, 130 characters for wide carriage paper, the whole sentence.

Statistics in corpus analysis

• Number of different word types

• Mean sentence length

• Number of sentences containing particular numbers of words,

• Number of sentences in the text

• Proportion of sentences with more than a specified number of words

Possible statistic inferences

• Statistic significance of occurrence at a greater than chance frequency

• Chi-square analysis

• ANOVA

Softwares

• OCP

• WordCruncher

• TACT

• LEXA

The Oxford Concordance Program (OCP)

• A batch program for making wordlists, concordances and indexes. OCP operates with three main files. The researcher specifies the name of the text file to be analysed, the command file which specifies what analyses are to be carried out, and the output file which will contain the product of the analyses.

Operation system: MS-DOS

Commercial software

WordCruncher

• Two separate programs:• 1. A batch process to index a text file or corpus to

produce a series of specially annotated files.• 2. run as a menu to locate data in the pre-indexed

text. It provides fast retrieval of all the tokens of morphemes, words, phrases and collocations of both adjacent and discontinuous collocates.

• Operation system: MS-DOS• Commercial software

TACT

• Originally it can show where vocabulary is distributed in a text, and which characters in a literary work tend to be associated with the use of particular words and with what frequency.

• TACT can display lists of all the word forms in a text with the number of tokens with user-defined and variable amounts of context from a line to a screenful

• It will show whether there is a tendency for particular words to occur in particular parts of a text.

TACT

• Calculate z scores to measure the relationship between observed and expected frequencies of the co-occurrence of items in a corpus.

• Operation system: MS-DOS

• Shareware

LEXA

• A package contains 60 interrelated programs for many kinds of linguistic analysis.

• The programs range from a lemmatizer and tagger to software which lists all the types in a text or corpus with frequency distribution. Transfer user-tagged items to a database for further processing, identify and retrieve all tokens of user-specified grammatical patterns or particular word strings, or make concordances in various formats.

The Longman Mini Concordancer (LMC)

• A simple concordancer which is very fast and easy to use and particularly accessible for language learners who are exploring the distribution of vocabulary in a text and for seeing authentic corpus-based examples of words in context.

• It gives access to wordlists and concordances with various sorting options

MicroConcord

• A MS-DOS concordancer. It can search any corpus of several million words very quickly in ASCII format without indexing.

• MicroConcord provides a large range of options for searching, sorting, browsing, editing and formatting the analyses of words, part words or strings of words in varied context size for KWIC concordances or sentences

VocaProfile

• A MS_DOS program for comparing the vocabulary overlap of different texts and the proportion of any text made up of words from pre-determined frequency lists.

International corpus of English Corpus Utility Program (ICECUP)

• ICECUP search for, count, sort by frequency or alphabetical order before or after the keyword, and display in lists or concordances words, affixes, word combinations, word-class tags and other markup.

WordSmith

• More detailed analyses of the frequencies of concordanced items and extract collocational information easily

• Search on compleac search arguments including tags, wildcards, and/or/not operators and discontinuous sequences.

SARA

• Distributed with the British National Corpus.

• Compatible with SGML and is aware of annotations.

XKwic

• Also known as CQp

Documents

Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools