36
Corpus 03 Corpus Analysis

Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

  • View
    287

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus 03

Corpus Analysis

Page 2: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus analysis • Annotation

– Lemmatization

– Tagging

– Parsing

• Corpus analysis– Listing

– Sorting

– Counting

– Concordancing

• Tools

Page 3: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Lemmatization

• A process of classifying together all the identical or related forms of a word under a common head word.

• Irregularities, homonyms

• Automatic lemmatizers employ two processes:

Page 4: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Lemmatization

• by means of a set of rules: go, going, goes, went under the head word of go

• by means of affix stripping: If a word form does not appear as part of the affix rule system, or is not listed as a specific exception, then the word is listed as a lemma in the same form.

• Automatic lemmatization requires substantial manual input involving subjective decisions.

Page 5: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Word-class tagging• Tagset: a set of descriptions and tags for the automati

c tagging. It may include major word classes and their inflectional variants, major function words, certain important individual lexemes, certain punctuation marks and a small number of discourse markers.

• Figure 4.1 (P. 210): The tagset used for tagging the Brown Corpus.

• There is no obvious way of improving the accuracy of rule-based taggers. The set of tags and the rule-based program have to be modified to take account of what actually occurs in the texts: the probability of occurrence of particular sequences of words or tags.

Page 6: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Word-class tagging• Probabilities derived from earlier corpus tagging s

howed the part of speech a particular homograph was most likely to belong to.

• CLAWS (Constituent Likelihood Automatic Word-tagging System): CLAWS 1 made an error in 3% to 4% of word-tokens.

• Later versions of CLAWS have been improved by reducing the size of the tagset to about 60 tags, developing new portmanteau tags, expanding the lexicon to about 10,000 items, enhancing the idiom list of word sequences.

Page 7: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

The general tagging procedure

• Figure 4.3 SGML markup before CLAWS (p.216)

• Figure 4.4 After word-class tagging with CLAWS (p.216)

• Figure 4.5 Post-CLAWS processing by SANTA CLAWS (p.217)

Page 8: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Factors that affect tagging• Language theories: The tagset reflects a particular

theory of language. The theoretical assumptions we make about the relationship between morphology, syntax and semantics can affect the types of tags assigned to words.

• Size of the tagset: the necessary degree of delicacy of analysis. In general, the greater the number of tags categories, the greater the potential for tagging error, the greater the precision of analysis available.

Page 9: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Parsing• Parsing is more demanding task involving n

ot only annotation but also linguistic analysis.

• Treebank: collections of labeled constituent structures or phrase markers.

• A parsed corpus provides a labeled analysis for each sentence to show how the various words function.

• Two main approaches: probabilistic and rule-based

Page 10: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Difficulties for rule-based approach • Some features of natural languages (prepositional

phrase attachment, coordination, discontinuous elements, ellipsis, pragmatic features) can cause difficulty for the rule-based analysis.

• According to generative grammar, parsers could be derived from schemes of grammatical rules, rules in a predetermined grammar and lexicon. If rules could not generate the sentence, this sentence would be considered ungrammatical. But ambiguous sentences have to be interpreted with further semantic and pragmatic information.

Page 11: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Probabilistic parsing

• The treebank provides statistics on the frequency of grammatical structures for the automatic parse to use subsequently in the probabilistic analysis of unrestricted text.

Page 12: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing

• Rule-based: Lancaster_Leeds Treebank (Figure 4.9, p.234): based on a context-free phrase-structure grammar which exhibited a huge number of grammatical rules.

• Sentence: I next wondered if she would like to bear down on Shaftesbury Avenue and see a play.

Page 13: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing

• Probabilistic parsing: LOB Corpus Treebank (Figure 4.10, p.235): with a parsing scheme based on the experience and statistics derived from the Lancaster_Leeds Treebank.

• Sentence: it arose during talks following President Kennedy’s report to the British Prime Minister of the outcome of his recent visit to Paris.

Page 14: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing

• Improved probabilistic parsing: UCREL—skeleton parsing (figure 4.11, p.235): tagging could be done best automatically with manual post-editing, using specially developed software to speed up data entry.

• Sentence: Well let’s now turn our attention to cricket once more it’s been a year when test matches the ubiquitous one day internationals proliferated around the world.

Page 15: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing

• Souter (1991): probabilistic parser compatible with systemic functional grammar. Manually passed Polytechnic of Wales Corpus of 60,800 words resulted in over 4,500 syntactic rules, most of which occurred infrequently.

Page 16: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing • Penn Treebank: coarse surface-structure ana

lysis (fingure 4.12, p. 237)

• SUSANNE: Surface and Underlying Structural Analyses of Naturalistic English.

Page 17: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Examples of parsing

• Sampson noted that available parsed corporal have been analysed with varying degrees of delicacy, not all of which show the deep structure or logical relationships in the analysed sentences. The SUSANNE Corpus analytic scheme is an attempt to specify norms and a comprehensive checklist of linguistic phenomena for the annotation of modern English in terms of both its underlying or logical structure as well as the surface grammar.

• Figure 4.13 p. 239: SUSANNE Corpus

Page 18: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus analysis: Listing

• Lemmatization: a smaller list results.• Different arrangement for different purposes: • the alphabetical order of the last letter for the

same suffixes (Table 4.3, p. 248)• word length (Table 4.4, p. 249)• descending order of frequency (Table 4.5,

p.250)

Page 19: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus analysis: Concordance

• A formatted version or display of all the occurrences or tokens of a particular type in a corpus.

• Two types• Batch generation: find every occurrence in the corpus and m

ake a file containing each such occurrence with a predetermined amount of context.

• Pre-indexed type: if the corpus has been pre-processed by computer to index each word, the software can be used to find almost instantly all the occurrences of a type, and the zise of the context can also be easily altered.

• Most usual format: Key Word in Context Concordance (KWIC)

Page 20: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus analysis: Concordance

• Examples: figure 4.17 (unsorted), 4.18 (right-sorted), 4.19 (left-sorted)

• A concordance can provide information on the company words keep in a corpus. It also show different senses of a word type.

Page 21: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Corpus analysis: Concordance

• Figure 4.20: A tagged corpus can reveal not only collocational behavior in terms of particular word sequences, but also in terms of the distribution of word-class sequences.

• Amount of context: 80 characters for A4 page, 130 characters for wide carriage paper, the whole sentence.

Page 22: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Statistics in corpus analysis

• Number of different word types

• Mean sentence length

• Number of sentences containing particular numbers of words,

• Number of sentences in the text

• Proportion of sentences with more than a specified number of words

Page 23: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Possible statistic inferences

• Statistic significance of occurrence at a greater than chance frequency

• Chi-square analysis

• ANOVA

Page 24: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

Softwares

• OCP

• WordCruncher

• TACT

• LEXA

Page 25: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

The Oxford Concordance Program (OCP)

• A batch program for making wordlists, concordances and indexes. OCP operates with three main files. The researcher specifies the name of the text file to be analysed, the command file which specifies what analyses are to be carried out, and the output file which will contain the product of the analyses.

Operation system: MS-DOS

Commercial software

Page 26: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

WordCruncher

• Two separate programs:• 1. A batch process to index a text file or corpus to

produce a series of specially annotated files.• 2. run as a menu to locate data in the pre-indexed

text. It provides fast retrieval of all the tokens of morphemes, words, phrases and collocations of both adjacent and discontinuous collocates.

• Operation system: MS-DOS• Commercial software

Page 27: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

TACT

• Originally it can show where vocabulary is distributed in a text, and which characters in a literary work tend to be associated with the use of particular words and with what frequency.

• TACT can display lists of all the word forms in a text with the number of tokens with user-defined and variable amounts of context from a line to a screenful

• It will show whether there is a tendency for particular words to occur in particular parts of a text.

Page 28: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

TACT

• Calculate z scores to measure the relationship between observed and expected frequencies of the co-occurrence of items in a corpus.

• Operation system: MS-DOS

• Shareware

Page 29: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

LEXA

• A package contains 60 interrelated programs for many kinds of linguistic analysis.

• The programs range from a lemmatizer and tagger to software which lists all the types in a text or corpus with frequency distribution. Transfer user-tagged items to a database for further processing, identify and retrieve all tokens of user-specified grammatical patterns or particular word strings, or make concordances in various formats.

Page 30: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

The Longman Mini Concordancer (LMC)

• A simple concordancer which is very fast and easy to use and particularly accessible for language learners who are exploring the distribution of vocabulary in a text and for seeing authentic corpus-based examples of words in context.

• It gives access to wordlists and concordances with various sorting options

Page 31: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

MicroConcord

• A MS-DOS concordancer. It can search any corpus of several million words very quickly in ASCII format without indexing.

• MicroConcord provides a large range of options for searching, sorting, browsing, editing and formatting the analyses of words, part words or strings of words in varied context size for KWIC concordances or sentences

Page 32: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

VocaProfile

• A MS_DOS program for comparing the vocabulary overlap of different texts and the proportion of any text made up of words from pre-determined frequency lists.

Page 33: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

International corpus of English Corpus Utility Program (ICECUP)

• ICECUP search for, count, sort by frequency or alphabetical order before or after the keyword, and display in lists or concordances words, affixes, word combinations, word-class tags and other markup.

Page 34: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

WordSmith

• More detailed analyses of the frequencies of concordanced items and extract collocational information easily

• Search on compleac search arguments including tags, wildcards, and/or/not operators and discontinuous sequences.

Page 35: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

SARA

• Distributed with the British National Corpus.

• Compatible with SGML and is aware of annotations.

Page 36: Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus analysis –Listing –Sorting –Counting –Concordancing Tools

XKwic

• Also known as CQp