Text of Corpus 03 Corpus Analysis. Corpus analysis Annotation –Lemmatization –Tagging –Parsing Corpus...
Corpus 03 Corpus Analysis
Corpus analysis Annotation Lemmatization Tagging Parsing Corpus analysis Listing Sorting Counting Concordancing Tools
Lemmatization A process of classifying together all the identical or related forms of a word under a common head word. Irregularities, homonyms Automatic lemmatizers employ two processes:
Lemmatization by means of a set of rules: go, going, goes, went under the head word of go by means of affix stripping: If a word form does not appear as part of the affix rule system, or is not listed as a specific exception, then the word is listed as a lemma in the same form. Automatic lemmatization requires substantial manual input involving subjective decisions.
Word-class tagging Tagset: a set of descriptions and tags for the automatic tagging. It may include major word classes and their inflectional variants, major function words, certain important individual lexemes, certain punctuation marks and a small number of discourse markers. Figure 4.1 (P. 210): The tagset used for tagging the Brown Corpus. There is no obvious way of improving the accuracy of rule-based taggers. The set of tags and the rule- based program have to be modified to take account of what actually occurs in the texts: the probability of occurrence of particular sequences of words or tags.
Word-class tagging Probabilities derived from earlier corpus tagging showed the part of speech a particular homograph was most likely to belong to. CLAWS (Constituent Likelihood Automatic Word-tagging System): CLAWS 1 made an error in 3% to 4% of word-tokens. Later versions of CLAWS have been improved by reducing the size of the tagset to about 60 tags, developing new portmanteau tags, expanding the lexicon to about 10,000 items, enhancing the idiom list of word sequences.
The general tagging procedure Figure 4.3 SGML markup before CLAWS (p.216) Figure 4.4 After word-class tagging with CLAWS (p.216) Figure 4.5 Post-CLAWS processing by SANTA CLAWS (p.217)
Factors that affect tagging Language theories: The tagset reflects a particular theory of language. The theoretical assumptions we make about the relationship between morphology, syntax and semantics can affect the types of tags assigned to words. Size of the tagset: the necessary degree of delicacy of analysis. In general, the greater the number of tags categories, the greater the potential for tagging error, the greater the precision of analysis available.
Parsing Parsing is more demanding task involving not only annotation but also linguistic analysis. Treebank: collections of labeled constituent structures or phrase markers. A parsed corpus provides a labeled analysis for each sentence to show how the various words function. Two main approaches: probabilistic and rule-based
Difficulties for rule-based approach Some features of natural languages (prepositional phrase attachment, coordination, discontinuous elements, ellipsis, pragmatic features) can cause difficulty for the rule-based analysis. According to generative grammar, parsers could be derived from schemes of grammatical rules, rules in a predetermined grammar and lexicon. If rules could not generate the sentence, this sentence would be considered ungrammatical. But ambiguous sentences have to be interpreted with further semantic and pragmatic information.
Probabilistic parsing The treebank provides statistics on the frequency of grammatical structures for the automatic parse to use subsequently in the probabilistic analysis of unrestricted text.
Examples of parsing Rule-based: Lancaster_Leeds Treebank (Figure 4.9, p.234): based on a context-free phrase-structure grammar which exhibited a huge number of grammatical rules. Sentence: I next wondered if she would like to bear down on Shaftesbury Avenue and see a play.
Examples of parsing Probabilistic parsing: LOB Corpus Treebank (Figure 4.10, p.235): with a parsing scheme based on the experience and statistics derived from the Lancaster_Leeds Treebank. Sentence: it arose during talks following President Kennedys report to the British Prime Minister of the outcome of his recent visit to Paris.
Examples of parsing Improved probabilistic parsing: UCREL skeleton parsing (figure 4.11, p.235): tagging could be done best automatically with manual post-editing, using specially developed software to speed up data entry. Sentence: Well lets now turn our attention to cricket once more its been a year when test matches the ubiquitous one day internationals proliferated around the world.
Examples of parsing Souter (1991): probabilistic parser compatible with systemic functional grammar. Manually passed Polytechnic of Wales Corpus of 60,800 words resulted in over 4,500 syntactic rules, most of which occurred infrequently.
Examples of parsing Penn Treebank: coarse surface-structure analysis (fingure 4.12, p. 237) SUSANNE: Surface and Underlying Structural Analyses of Naturalistic English.
Examples of parsing Sampson noted that available parsed corporal have been analysed with varying degrees of delicacy, not all of which show the deep structure or logical relationships in the analysed sentences. The SUSANNE Corpus analytic scheme is an attempt to specify norms and a comprehensive checklist of linguistic phenomena for the annotation of modern English in terms of both its underlying or logical structure as well as the surface grammar. Figure 4.13 p. 239: SUSANNE Corpus
Corpus analysis: Listing Lemmatization: a smaller list results. Different arrangement for different purposes: the alphabetical order of the last letter for the same suffixes (Table 4.3, p. 248) word length (Table 4.4, p. 249) descending order of frequency (Table 4.5, p.250)
Corpus analysis: Concordance A formatted version or display of all the occurrences or tokens of a particular type in a corpus. Two types Batch generation: find every occurrence in the corpus and make a file containing each such occurrence with a predetermined amount of context. Pre-indexed type: if the corpus has been pre- processed by computer to index each word, the software can be used to find almost instantly all the occurrences of a type, and the zise of the context can also be easily altered. Most usual format: Key Word in Context Concordance (KWIC)
Corpus analysis: Concordance Examples: figure 4.17 (unsorted), 4.18 (right-sorted), 4.19 (left-sorted) A concordance can provide information on the company words keep in a corpus. It also show different senses of a word type.
Corpus analysis: Concordance Figure 4.20: A tagged corpus can reveal not only collocational behavior in terms of particular word sequences, but also in terms of the distribution of word-class sequences. Amount of context: 80 characters for A4 page, 130 characters for wide carriage paper, the whole sentence.
Statistics in corpus analysis Number of different word types Mean sentence length Number of sentences containing particular numbers of words, Number of sentences in the text Proportion of sentences with more than a specified number of words
Possible statistic inferences Statistic significance of occurrence at a greater than chance frequency Chi-square analysis ANOVA
Softwares OCP WordCruncher TACT LEXA
The Oxford Concordance Program (OCP) A batch program for making wordlists, concordances and indexes. OCP operates with three main files. The researcher specifies the name of the text file to be analysed, the command file which specifies what analyses are to be carried out, and the output file which will contain the product of the analyses. Operation system: MS-DOS Commercial software
WordCruncher Two separate programs: 1. A batch process to index a text file or corpus to produce a series of specially annotated files. 2. run as a menu to locate data in the pre-indexed text. It provides fast retrieval of all the tokens of morphemes, words, phrases and collocations of both adjacent and discontinuous collocates. Operation system: MS-DOS Commercial software
TACT Originally it can show where vocabulary is distributed in a text, and which characters in a literary work tend to be associated with the use of particular words and with what frequency. TACT can display lists of all the word forms in a text with the number of tokens with user-defined and variable amounts of context from a line to a screenful It will show whether there is a tendency for particular words to occur in particular parts of a text.
TACT Calculate z scores to measure the relationship between observed and expected frequencies of the co-occurrence of items in a corpus. Operation system: MS-DOS Shareware
LEXA A package contains 60 interrelated programs for many kinds of linguistic analysis. The programs range from a lemmatizer and tagger to software which lists all the types i