Collocation Frequency as a Readability Factor, Anagnostou

Collocation frequency as a readability factor

George R. S. Weir and Nikolaos K. Anagnostou

Department of Computer and Information Sciences

University of Strathclyde,

Richmond Street, Glasgow, G1 1XH, U.K.

[email protected], [email protected] Abstract A readability measure that can reliably estimate the difficulty level of sample texts has great potential in ESL teaching. We argue for the inclusion of collocation frequency as a generic factor in estimates of ESL text readability, since the readability of any text will be affected adversely by the presence of collocations. Collocational word combinations are seldom amenable to comprehension solely on the basis of acquaintance with individual word components, so they present particular difficulties for L2 learners. Our paper describes the measure of average collocation frequency, how this is derived from a reference corpus and can be applied as a factor in estimating textual readability. A software tool, based upon this approach, is presently in development. Keywords Readability, collocation, average collocation frequency, ESL. Introduction Readability formulae work by using quantifiable textual aspects, in order to estimate the difficulty inherent in that text. Commonly, the key factors considered in readability measures are word length and sentence length, or variations on these constructs. These aspects are founded in readability studies (Dale and Chall, 1945). Since the introduction of computer-based textual analysis, newer factors such as word frequency can be included in readability formulae. The frequency of words, as derived from large reference corpora, reflects a viable factor in estimating readability since more common words are likely to be familiar to more readers. Thereby, a text composed mainly of highly common words is likely to prove more readable (more comprehensible). In spite of the plausibility of including word frequency in readability measures, there are few reported examples (cf. Weir and Ritchie, 2007; Stenner et al., 1988). The logic underlying a focus on word frequency

as an affective factor in readability also extends to frequency of word sequences. For this reason, our research activity on readability considers the impact of word sequences. This work has two strands. Firstly, the impact on text readability from the presence of n-grams (word combinations of length n) will also reflect the commonality of such word sequences, i.e., the more frequently any n-gram appears in general usage, the more likely that it will be familiar to a reader and thereby have less impact upon the difficulty of a text than a sequence of similar length but with lower general frequency of occurrence. The second strand of our investigation of word sequence influence on readability centers on collocations. We follow the sense of Manning and Schutze (1999), who describe collocations as any turn of phrase or accepted usage where somehow the whole is perceived to have an existence beyond the sum of its parts (p.29). Choueka (1988), offers a similar description: [a collocation is] a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning cannot be derived directly from the meaning or connotation of its components. In the context of our research, the significant characteristic of a collocation is that its meaning is not simply derivable from the meaning of its constituent words. Since their composite meaning cannot be derived from an understanding of their components, from a readability perspective, collocations are semantically opaque. Because collocations have this complex semantics, their sophistication presents particular difficulties for language learners. Inevitably, language learners, particularly L2 learners who are non-native to the language culture, lack exposure to the contexts and usages that imbue collocations with meaning. This complexity in collocations can not be captured at the tractable levels of sentences, words

or syllables and thereby constitutes a layer of semantics that is not currently considered in existing readability measures. If we can accommodate a plausible measure of collocational influence upon the readability of a text, then we should be able to ascertain a more accurate estimate for the semantic difficulty of that text. In what follows, we propose a method for gauging the degree of collocational influence on any sample English text. 1 Estimating collocational impact Our procedure has three steps: quantification, scaling and aggregation. In the first step, we quantify the number and frequency of occurrence for collocations in a sample text. Next, the scaling step accounts for the effect of each particular collocation. This is derived by reference to an external measure of likely familiarity for collocations and is factored by the frequency of occurrence of the collocation in the sample text. Finally, we aggregate the individual measures of collocational influence in order to arrive at an estimate for overall impact of collocations upon the sample text. This approach to measuring the collocational impact upon a text is similar to that used for gauging the impact of individual words upon readability (Campbell & Weir, 2007). In both cases, the key requirement is a frequency list derived from a reference such as the British National Corpus (Burnard, 2000). Our external measure of likely familiarity for a collocation reflects the frequency of that collocation in a reference corpus. We assume that any collocation (indeed, any word sequence) that occurs with high frequency in a plausible reference corpus is more likely to be familiar to the language user than another collocation that has lower frequency of occurrence. In order to derive such measures, we create a collocation frequency list from a reference corpus such as the British National Corpus (or, for initial proof of concept, the BNC Baby). Collocation frequency lists are similar in nature to word frequency lists, in which the two main fields are the word type and its frequency in the corpus. Corpus frequency is given either as a percentage (relative frequency) or as a number of occurrences (absolute frequency). In similar vein, collocation frequency lists measure the frequency of collocations rather than individual words. We propose that the frequency of a particular collocation in the reference collocation frequency list is an indicator of its semantic opacity, or simply put, its comprehensibility. As such, higher frequency of occurrence for a collocation signifies a

greater likelihood that it will be understood. Thereby, on the assumption that we can generate a reference frequency list for collocations, we have a plausible means of automating the measurement of collocational impact for sample English texts. 2 Collocation extraction The use of a reference collocation frequency list is central to the procedure we propose for gauging collocational impact. Ideally, this frequency list is derived from a large representative reference corpus. Since such corpora are not available with collocations pre-identified, a requirement for our frequency list is a means of collocation extraction or identification. Once the collocations have been identified, we can readily count their frequency of occurrence in the reference corpus and thereby populate our reference collocation frequency list. 2.1 Association measures Association measures are the criteria employed to decide whether any specific sequence of words qualifies as a collocation. In keeping with Wermter and Hahn (2004), such measures may be classified as:

frequency-based measures (e.g., based on absolute and relative co-occurrence frequencies);

information-theoretic measures (e.g., mutual information, entropy);

statistical measures (e.g., chi-square, t-test, log-likelihood, Dices coefficient);

The most common association measures used in collocation extraction are T-score, Mutual information, log-likelihood, Dice coefficient and Z-score (cf. Evert, 2004, p. 21, McEnery and Wilson, 2001, p. 86-87). Based upon such techniques, a range of software tools is available that aims to identify a list of the collocations in a given text. For the most part, these approaches are more or less noisy, in the sense that they identify as likely collocations, some multi-word units that would not be so considered by English language users. Such software tools are further detailed in Anagnostou & Weir (2007). Despite the inherent noisiness in current collocation extraction techniques, we have selected the Collocate program (Barlow, 2004): as a basis for our proof of concept approach to gauging collocational impact upon readability. Collocate provides two main functions for collocation extraction, called Extract and Full

Extract. . The first of these applies filters such as word/phrase or word/tag combinations, along with regular expressions. While Extract is geared towards targeted searches of collocations, Full Extract allows for the comprehensive extraction of n-grams and collocation candidates from a corpus and is better suited to general collocation identification and thereby, to our requirements. Consequently, this approach was used for producing our reference collocation frequency list and was also used to identify the collocations present in our sample texts. Armed with our reference collocation frequency list and a method for identifying the collocations present in any sample text, we are able to perform the scaling step in our three part measurement process. This scaling takes the reference frequency of a collocation instance and multiplies this by the number of occurrences of this collocation in the sample text, divided by the total number of all collocations in the sample text. This provides a collocational impact factor for each individual collocation in the sample text, i.e., fi*ni/nc where fi is the reference frequency for collocation instance i, ni is the absolute number of occurrences of the collocation instance i in the sample text and nc is the total number of different collocations in the sample text. Having produced a measure of impact upon the sample text for each individual collocation instance, we proceed to the final (aggregation) step in our measurement process. 3 Average Collocation Frequency Our aggregation step allows us to combine the impact measures of individual collocations in a sample text, in order to derive a single metric for the whole text. This metric we term the average collocation frequency (ACF), and this is given by the formula:

Where:

ACF = average collocation frequency nc = total collocation occurrences in sample text m = number of different collocations in sample text fi = frequency of collocation i in reference corpus ni = number of instances of collocation i in sample text

Consider the following scenario, using the BNC Baby (approximately five million words) as the reference corpus. The collocation manna from heaven appears once in the BNC Baby, thus it has a frequency of 0.2 occurrences per million words

(pmw). This is a relatively rare collocation, therefore it is rather difficult and its impact on the semantic opacity of a passage that includes it will be significant. We also take the view that the semantic impact of the collocation is increased if it appears more than once in a text. In other words, the higher the frequency of occurrence of a difficult collocation in a sample text, the harder understanding the text is going to be. Any measure of semantic difficulty based upon collocations needs to accommodate this fact, and this is the role that the relative weight of a collocation plays in the ACF. (Of course, we could consider the repeated appearance of a collocation as reduced in its semantic impact. This would accommodate the idea that repeated use may assist the reader to interpret the meaning of that collocation.) The following example helps to illustrate how the ACF is calculated for any sample text. In this instance, we have a text containing three different types (m=3) of collocations. In Figure 1, column ni indicates how many times each collocation appears in the sample text. By adding the cells in this column we can calculate nc, or the total number of collocation occurrences in the sample text. The cells in column fi are populated with the frequency of each collocation in the reference corpus, in this case, measured as occurrences per million words (pmw).

Collocations found in the sample text ni fi put up with n1 = 2 f1 = 60 pmw

kick the bucket n2 = 1 f2 = 22 pmwpull yourself together n3 = 4 f3 = 46 pmw

m = 3

nc = n1 + n2 + n3 = 2 + 1 + 4 = 7number of different types of collocations

in sample text total number of collocation occurrences

in sample text

number of occurrences of

collocation type iin sample text

frequency of collocation

type i in reference

corpus

Figure 1: Factoring collocations Figure 1 provides the data required in order to proceed to the calculation of the ACF (Figure 2).

3

1

1 1 1 160 2 22 1 46 47 7 7 7

2 1 4 120 22 18460 22 46 46.6 pwm7 7 7 7

i ii

acf f n=

= = + + =+ += + + =

frequency of collocation put up with in the language

relative weight of collocation put up with in the sample

text

Frequency of a hypothetical collocation, which, should it

substitute all the collocations in the sample text, would have the

same impact on semantic difficulty

pmw

Figure 2: Deriving a value for the ACF As illustrated in Figure 2, the ACF acts like a replacement or hypothetical collocation, which

would have the same effect on the semantic difficulty of the text as all the collocations in the considered text. In this example, the ACF is measured in occurrences per million words. The ACF unit of measurement is always the unit of the collocation frequencies in the reference collocation frequency list. For instance, if the collocation frequencies were percentages, the ACF would return a percentage. Whatever unit of measure is applied in the ACF calculation, we believe that the result is a plausible estimate of the aggregate impact of the collocations present in the sample text. 4 Conclusions and further work Given the inherent difficulty that collocations present to English language learners, a method that can quantify such impact in sample texts has considerable potential as a teaching aid. Furthermore, such a measure can serve as a semantic factor in estimating the readability of texts and thereby supplement the word and sentence-based factors conventionally employed in such techniques. We propose the Average Collocation Frequency as filling such roles and have argued for its plausibility as a gauge for the semantic impact of collocations in any sample English text. A software tool is under development for use in conjunction with a collocation extraction facility (such as Collocate). This new tool will generate ACF measurements based upon the approach and the factors described in this paper. The prototype of this system, entitled ACFCalc, is illustrated in Figure 3. This requires a reference collocation frequency list and a sample text collocation frequency list. From these inputs, the average collocation frequency for the sample text is derived.

Figure 3: ACFCalc prototype

This prototype is described further in Anagnostou & Weir (2008). We anticipate that this software facility will be made generally available to the research community in due course. References Anagnostou, N. K. and Weir, G. R. S. (2007).

Review of software applications for deriving collocations, in G. R. S. Weir & T. Ozasa (Eds), Texts, Textbooks and Readability. University of Strathclyde Publishing, Glasgow, pp. 63-72.

Anagnostou, N. K. and Weir, G. R. S. (2008). The ACFCalc Tool, in Proceedings of ICTATLL 2008, Sri Lanka (forthcoming).

Barlow, M. (2004). Collocate 1.0: Locating collocations and terminology. Collocate User Manual.

Burnard, L. (2000). Users Reference Guide for the British National Corpus. Technical report, Oxford University Computing Services.

Campbell, G. and Weir, G. R. S. (2007) Matching Readers to Texts, in G. R. S. Weir & T. Ozasa (Eds), Texts, Textbooks and Readability. University of Strathclyde Publishing, Glasgow, pp. 49-55.

Choueka, Y. (1988). Looking for needles in a haystack. Proceedings of RIAO 88, pp. 609623.

Dale, E. and Chall, J.S. (1948). A formula for predicting readability. Educational research bulletin, 27, pp. 11-20, 37-54.

Evert, S. (2004). The Statistics of Word Cooccurrences (Word Pairs and Collocations). Ph. D. dissertation, Universitat Stuttgart.

Manning, C. D. and Schutze H. (1999). Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA.

McEnery, A.M. and Wilson, A. (2001). Corpus Linguistics. Edinburgh: Edinburgh University Press.

Stenner, A. J., Horabin, I., Smith, D. R. and Smith, R. (1988). The Lexile Framework. Durham. NC: Metametrics, Inc.

Weir, G. R. S. and Ritchie, C. (2007). Estimating Readability with the Strathclyde Readability Measure, in G. R. S. Weir & T. Ozasa (Eds), Texts, Textbooks and Readability. University of Strathclyde Publishing, Glasgow, pp. 26-33.

Wermter, J. and Hahn. U. (2004). Collocation extraction based on modifiability statistics. Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland.

Documents

Collocation Frequency as a Readability Factor, Anagnostou