Upload
others
View
13
Download
1
Embed Size (px)
Citation preview
Cicero’s Hardest Sentence?:
Measuring Readability for Latin Literature with CLTK
Patrick J. BurnsInstitute for the Study of the Ancient World
Classical Language ToolkitGlobal Philology Open Conference. Universität Leipzig 22.02.17
The Classical Language Toolkit (CLTK) is a free and open-source Python package that offers natural language processing (NLP) support for the languages of Ancient, Classical, and Medieval Eurasia.
Language-specific tokenizers, lemmatizers, POS-taggers, morphological parsers, etc. are available, under development, or in the feature-request list. Latin and Greek functionality are currently most complete.
What is the Classical Language Toolkit?
● Open-source community collaborating at https://github.com/cltk
● Founded by Kyle P. Johnson, Classics PhD from NYU and NLP Research Scientist at Accenture
● Academic Advisors: Gregory Crane (Leipzig/Tufts) , Neil Coffee (Buffalo), Peter Meineck (NYU), Leonard Muellner (Brandeis/CHS)
● CLTK Archive developer: Luke Hollis
Who is the Classical Language Toolkit?
CLTK Goals
● Low: Good analysis-friendly corpora/datasets for NLP of historical languages (Latin, Ancient/Classical Greek, Egyptian hieroglyphs, Hebrew, Sanskrit, Tibetan, Classical Chinese, etc.)
CLTK Goals
● Low: Good analysis-friendly corpora/datasets for NLP of historical languages (Latin, Ancient/Classical Greek, Egyptian hieroglyphs, Hebrew, Sanskrit, Tibetan, Classical Chinese, etc.)
● Medium: Collect & generate linguistic data for quantified classics
CLTK Goals
● Low: Good analysis-friendly corpora/datasets for NLP of historical languages (Latin, Ancient/Classical Greek, Egyptian hieroglyphs, Hebrew, Sanskrit, Tibetan, Classical Chinese, etc.)
● Medium: Collect & generate linguistic data for quantified classics
● High: Framework for an integrated study of the ancient world
CLTK Stats
● Began 2014● 1,702 commits at https://github.com/cltk/cltk● 38 contributors● 43 watchers, 166 stars, 117 forks● 37 people, 18 teams● 46 releases (with Zenodo DOI for every release)● 83% code coverage● Supports POSIX OS (and partially Windows)● 2016 Google Summer of Code participating organization (with
application in for 2017)
Google Summer of Code 2016
● CLTK work onBackoff Latin Lemmatizer
● Modeled afterNLTK Backoff POS Tagger
● Series of trained and rules-based lemmatizers run in sequence
● Can be “tuned” for specific languages
What is Classic Readability?
“[Readability] is the sum total (including the interactions) of all those elements within a given piece of printed material that affects the success that a group of readers have with it. The success is the extent to which they understand it, read it at an optimum speed, and find it interesting.”—Edgar Dale & Jeanne Chall
Dale, E. & Chall, J. 1949. “The Concept of Readability,” Elementary English 26: 19-26.
Classic Readability
Classic readability refers to the measurement of relative ease or difficulty of reading material through two main measurements:
● Sentence complexity (e.g. number of words)● Lexical difficulty
Example of readability in practice
Code at https://github.com/diyclassics/ritchies/blob/master/ritchies.ipynb
Example of readability in practice
Wheelock’s Latin Chapter 1 Sententiae Antiquae● 3.5 words per sentence● 19.2 characters per sentence● 5.49 char./word per sentence
Wheelock’s Latin Chapter 40 Sententiae Antiquae● 11.75 words per sentence● 73.65 characters per sentence● 6.27 char./word per sentence
Latin Readability Goals
Long-term goals of this project are:
1. to develop objective measurements for comparing different Latin authors, works, parts of works, etc.
2. to use comparative measurements to allow students to be matched with reading material at an appropriate level, but especially to help emergent readers move from textbooks to reading Latin texts with the least amount of difficulty.
Key CLTK code for readability
from cltk.tokenize.sentence import TokenizeSentencefrom cltk.tokenize.word import WordTokenizer
sent_tokenizer = TokenizeSentence('latin')word_tokenizer = WordTokenizer('latin')
Key CLTK code for readabilitydef sent_count(text): sents = sent_tokenizer.tokenize(text) return(len(sents))
def word_count(text): words = word_tokenizer.tokenize(text) return(len(words))
def char_count(text): return len(text)
def syll_count(text): return sum(letter in 'aeiouy' for letter in text.lower())
def comp_word_count(text): words = text.split() comp_words = [word for word in words if syll_count(word) > 3] return len(comp_words)
Automated Readability Index
Senter, R.J. & Smith, E.A. 1967. “Automated Readability Index.”. Wright-Patterson AFB: iii.via https://en.wikipedia.org/wiki/Automated_readability_index
Key CLTK code for readability
# Automated Readability Index
def ari(text): c, w, s = char_count(text), word_count(text), sent_count(text) score = 4.71 * (c / w) + 0.5 * (w / s) - 21.43 return score
Key CLTK code for readabilitydef fleschkincaid(text): w, s, sy = word_count(text), sent_count(text), syll_count(text) score = (0.39 * (w / s)) + (11.8 * (sy / w)) - 15.59 return score
def gunningfog(text): w, s, cw = word_count(text), sent_count(text), comp_word_count(text) score = 0.4 * ((w / s) + (100 * (cw / w))) return score
def coleman_liau(text): c, w, s = char_count(text), word_count(text), sent_count(text) cs, ss = (c / w) * 100, (s / w) * 100 score = (0.0588 * cs) - (0.296 * ss) - 15.8 return score
Dale-Chall Readability Formula
Dale E. & Chall J. 1948. “A Formula for Predicting Readability.” Edu. Research Bulletin 27: 11–20.via https://en.wikipedia.org/wiki/Dale–Chall_readability_formula
Key CLTK code for readability
def difficult_words(text): # Based on DCC Core Vocabulary tokens = word_tokenizer.tokenize(text) lemmas = lemmatizer.lemmatize(tokens) difficult_words = [lemma for lemma in lemmas if lemma not in dcc_lemmas_simple] return len(difficult_words)
def dalechall(text): w, s, dw = word_count(text), sent_count(text), difficult_words(text) return (0.1579 * ((dw / w) * 100)) + (0.0496 * (w / s))
So, if we look at formal features likeword & sentence length...
What is the hardest sentence in Cicero?
Cicero’s Hardest Sentence
Nam et commoratio una in re permultum movet et inlustris explanatio rerumque quasi gerantur sub aspectum paene subiectio quae et in exponenda re plurimum valent et ad inlustrandum id quod exponitur et ad amplificandum ut eis qui audient illud quod augebimus quantum efficere oratio poterit tantum esse videatur et huic contraria saepe percursio est et plus ad intellegendum quam dixeris significatio et distincte concisa brevitas et extenuatio et huic adiuncta inlusio a praeceptis caesaris non abhorrens et ab re digressio in qua cum fuerit delectatio tum reditus ad rem aptus et concinnus esse debebit propositioque quid sis dicturus et ab eo quod est dictum seiunctio et reditus ad propositum et iteratio et rationis apta conclusio tum augendi minuendive causa veritatis supralatio atque traiectio et rogatio atque huic finitima quasi percontatio eitioque sententiae suae tum illa quae maxime quasi inrepit in hominum mentis alia dicentis ac significantis dissimulatio quae est periucunda cum orationis non contentione sed sermone tractatur deinde dubitatio tum distributio tum correctio vel ante vel postquam dixeris vel cum aliquid a te ipso reicias praemunitio etiam est ad id quod adgrediare et traiectio in alium communicatio quae est quasi cum eis ipsis apud quos dicas deliberatio morum ac vitae imitatio vel in personis vel sine illis magnum quoddam ornamentum orationis et aptum ad animos conciliandos vel maxime saepe autem etiam ad commovendos personarum ficta inductio vel gravissimum lumen augendi descriptio erroris inductio ad hilaritatem impulsio anteoccupatio tum duo illa quae maxime movent similitudo et exemplum digestio interpellatio contentio reticentia commendatio vox quaedam libera atque etiam effrenatio augendi causa iracundia obiurgatio promissio deprecatio obsecratio declinatio brevis a proposito non ut superior illa digressio purgatio conciliatio laesio optatio atque exsecratio.
Cicero’s Hardest Sentence
● 1919 characters● 280 words● 6.85 characters per word● almost 2 pages in LCL
Challenges with historical languages
Latin, classical Greek, and other historical languages present unique challenges when approaching readability. Some main difficulties are:● Fewer expert readers (and no native readers)● All Latin/Greek words are learned in formal educational
contexts● Compressed timeline for developing “proficiency”● Much less standardized testing data for comparison and
cross-validation
Next steps
Some ideas of where to go next with readability studies for Latin/Greek:● Identification / development of "criterion passages"● i.e. passages rated on an "independent measure such as a test of
reading comprehension, rate of reading, or judgments of difficulty.
● Improved, expanded word lists● Increase attention to the role of the other primary domains of
readability studies: the reader and the context.
Conclusion
“It should be remembered that readability formulas estimate difficulty on the basis of factors that have the highest prediction of difficulty. ...No readability formula is a complete and full measure of text difficulty.” —Jeanne Chall & Edgar Dale
Dale, E. & Chall, J. 1995. Readability revisited. Northampton, MA.
Select BibliographyAnderson, J. 1983. “Lix and Rix: Variations on a Little-Known Readability Index.” Journal of Reading 26 (6): 490–496.Bamberger, R. and A. T. Rabin. 1984. “New Approaches to Readability: Austrian Research.” The Reading Teacher 37 (6): 512–519.Björnsson, C. H. 1983. “Readability of Newspapers in 11 Languages.” Reading Research Quarterly 18 (4): 480–497.Chall, J. S. and E. Dale. 1995. Readability Revisited: The New Dale-Chall Readability Formula. Northampton, MA: Brookline Books.Coleman, M. and Liau, T. L. 1975. “A Computer Readability Formula Designed for Machine Scoring,” Journal of Applied Psychology 60: 283–284.Dale, E. and J. S. Chall. 1948. “A Formula for Predicting Readability.” Educational Research Bulletin 27 (1): 11–28.———. 1949. “The Concept of Readability.” Elementary English 26 (1): 19–26.Flesch, R. 1948. “A New Readability Yardstick.” Journal of Applied Psychology 32 (3): 221–233. doi:10.1037/h0057532.Gunning, R. 1952. The Technique of Clear Writing. New York: McGraw-Hill.Ireland, S. 1976. “The Computer and Its Role in Classical Research.” G&R 23 (1): 40–54.Janson, T. 1967. “Word, Syllable, and Letter in Latin.” Eranos 65: 49–64.Moritz, M., G. Franzini, G. Crane, and B. Pavlek. 2016. “Sentence Shortening via Morpho-Syntactic Annotated Data in Historical Language Learning.” JCCH 9 (1): 1–9. doi:10.1145/2810040.Muccigrosso, J. D. 2004. “Frequent Vocabulary in Latin Instruction.” CW 97 (4): 409–433. doi:10.2307/4352875.Rabin, A. T. 1988. “Determining Difficulty Levels in Text Written in Languages Other than English.” In B. L. Zakaluk and S. J. Samuels, 46–76. Newark, DE: International Reading Association.Rydberg-Cox, J. A, and A. Mahoney. 2002. “Vocabulary Building in the Perseus Digital Library.” CO 79 (4): 145–149.Saragi, T., I. S. P. Nation, and G. F. Meister. 1978. “Vocabulary Learning and Reading.” System 6 (2): 72–78. doi:10.1016/0346-251X(78)90027-1.Senter, R. J., and E. A. Smith. 1967. “Automated Readability Index.” AMRL-TR-66-22. Wright Patterson AFB, Ohio: Aerospace Medical Div.Zakaluk, B. L., and S. J. Samuels. 1988. Readability: Its Past, Present, and Future. Newark, DE: International Reading Association.