A Comparison of Features for Automatic Readability Assessment

A Comparison of Features for Automatic Readability

AssessmentLijun Feng1 Matt Huenerfauth1 Martin Jansche2 No´emie Elhadad3

1City University of New York2Google, Inc.

3Columbia UniversityColing 2010: Poster Volume, pages 276–284,

Beijing, August 2010

報告者：劉憶年 2014/4/21

OutlineIntroductionCorpusFeaturesExperiments and DiscussionConclusions

Motivation and MethodReadability Assessment quantifies the difficulty with which a reader understands a text. Automatic readability assessment enables the selection of appropriate reading material for readers of varying proficiency.We use grade levels, which indicate the number of years of education required to completely understand a text, as a proxy for reading difficulty.We treat readability assessment as a classification task and evaluate trained classifiers in terms of their prediction accuracy.

RelatedWorkMany traditional readability metrics are linear models with a few (often two or three) predictor variables based on superficial properties of words, sentences, and documents.These traditional metrics are easy to compute and use, but they are not reliable, as demonstrated by several recent studies in the field.With the advancement of natural language processing tools, a wide range of more complex text properties have been explored at various linguistic levels.In addition to lexical and syntactic features, several researchers started to explore discourse level features and examine their usefulness in predicting text readability.

CorpusWe contacted the Weekly Reader1 corporation, an on-line publisher producing magazines for elementary and high school students, and were granted access in October 2008 to an archive of their articles.While pre-processing the texts, we found that many articles, especially those for lower grade levels, consist of only puzzles and quizzes, often in the form of simple multiple-choice questions. We discarded such texts and kept only 1433 full articles.

Entity-Density Features(1/2)Conceptual information is often introduced in a text by entities, which consist of general nouns and named entities, e.g. people’s names, locations, organizations, etc. These are important in text comprehension, because established entities form basic components of concepts and propositions, on which higher level discourse processing is based.We hypothesized that the number of entities introduced in a text relates to the working memory burden on their targeted readers – individuals with intellectual disabilities. We defined entities as a union of named entities and general nouns (nouns and proper nouns) contained in a text, with overlapping general nouns removed.

Entity-Density Features(2/2)We believe entity-density features may also relate to the readability of a text for a general audience.

Lexical Chain FeaturesDuring reading, a more challenging task with entities is not just to keep track of them, but to resolve the semantic relations among them, so that information can be processed, organized and stored in a structured way for comprehension and later retrieval.

The length of a chain is the number of entities contained in the chain, the span of chain is the distance between the index of the first and last entity in a chain. A chain is defined to be active for a word or an entity if this chain passes through its current location.

Coreference Inference FeaturesRelations among concepts and propositions are often not stated explicitly in a text. Automatically resolving implicit discourse relations is a hard problem.The ability to resolve referential relations is important for text comprehension.

Inference distance is the difference between the index of the referent and that of its pronominal reference. If the same referent occurs more than once in a chain, the index of the closest occurrence is used when computing the inference distance.

Entity Grid FeaturesCoherent texts are easier to read.Each text is abstracted into a grid that captures the distribution of entity patterns at the level of sentence-to-sentence transitions. The entity grid is a two-dimensional array, with one dimension corresponding to the salient entities in the text, and the other corresponding to each sentence of the text.

Language Modeling FeaturesWe use grade levels to divide the whole corpus into four smaller subsets. In addition to implementing Schwarm and Ostendorf’s information-gain approach, we also built LMs based on three other types of text sequences for comparison purposes. These included: word-token-only sequence (i.e., the original text), POS-only sequence, and paired word-POS sequence. For each grade level, we use the SRI Language Modeling Toolkit5 (with Good-Turing discounting and Katz backoff for smoothing) to train 5 language models (1- to 5-gram) using each of the four text sequences, resulting in 4×5×4 = 80 perplexity features for each text tested.

Parsed Syntactic FeaturesOur parsed syntactic features focus on clauses (SBAR), noun phrases (NP), verb phrases (VP) and prepositional phrases (PP).

POS-based FeaturesPart-of-speech-based grammatical features were shown to be useful in readability prediction. To extend prior work, we systematically studied a number of common categories of words and investigated to what extent they are related to a text’s complexity. We focus primarily on five classes of words (nouns, verbs, adjectives, adverbs, and prepositions) and two broad categories (content words, function words). Content words include nouns, verbs, numerals, adjectives, and adverbs; the remaining types are function words.

Shallow FeaturesShallow features, which are limited to superficial text properties, are computationally much less expensive than syntactic or discourse features.

Other FeaturesFor comparison, we replicated 6 out-of-vocabulary features described in Schwarm and Ostendorf (2005).We also replicated the 12 perplexity features implemented by Schwarm and Ostendorf (2005).

Experiments and DiscussionIn our research, we have used various models, including linear regression; standard classification (Logistic Regression and SVM), which assumes no relation between grade levels; and ordinal regression/ classification (provided by Weka, with Logistic Regression and SMO as base function), which assumes that the grade levels are ordered.In this paper, we present the best results, which are obtained by standard classifiers.We evaluate classification accuracy using repeated 10-fold cross-validation on the Weekly Reader corpus. Classification accuracy is defined as the percentage of texts predicted with correct grade levels. We repeat each experiment 10 times and report the mean accuracy and its standard deviation.

Discourse FeaturesWe first discuss the improvement made by extending our earlier entity-density features (Feng et al., 2009).With earlier features only, the model achieves 53.66% accuracy. With our new features added, the model performance is 59.63%.Table 6 presents the classification accuracy of models trained with discourse features.

Language Modeling FeaturesTable 7 compares the performance of models generated using our approach and our replication of Schwarm and Ostendorf’s (2005) approach.

Parsed Syntactic Features(1/2)Table 8 compares a classifier trained on the four parse features of Schwarm and Ostendorf (2005) to a classifier trained on our expanded set of parse features.

Parsed Syntactic Features(2/2)Table 9 shows a detailed comparison of particular parsed syntactic features.

POS-based FeaturesThe classification accuracy generated by models trained with various POS features is presented in Table 10.

Shallow FeaturesWe present some notable findings on shallow features in Table 11.

Comparison with Previous StudiesUsing the same experiment design, we train classifiers with three combinations of our features as listed in Table 12.

ConclusionsDiscourse features do not seem to be very useful in building an accurate readability metric. The reason could lie in the fact that the texts in the corpus we studied exhibit relatively low complexity, since they are aimed at primary-school students. In future work, we plan to investigate whether these discourse features exhibit different discriminative power for texts at higher grade levels.

Documents

A Comparison of Features for Automatic Readability Assessment