Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Structured Vs Unstructured:Extracting Information From Classics

Scholarly Texts

Matteo Romanello1

1Centre for Computing in the Humanities

PhD SeminarLondon 28/01/2010

Extracting Information From Classics Scholarly Texts CCH


Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results



Overview

Introduction


Methodology

Work Phases

Expected Results



The Project at a glance

I Project started in October 2009;I Field of application: Digital Humanities, Classics

(particularly Greek literature);I co-supervision between the CCH and the CS department

at King’s -> application of Computational Linguisticsmethod



Goal

Devising an automatic system to improve information retrievalover a discipline-specific corpus of unstructured texts

I focus on secondary sourcesI automatic -> scalable with huge amount of dataI information retrieval -> the task of retrieving informationI unstructured texts -> raw texts (e.g. .txt files) as opposed

to the structured/encoded XML



Overview

Introduction


Methodology

Work Phases

Expected Results



The Million Book Library

I archives.org, Google Books -> growth ofvolume of information available inelectronic format

I longer “shelf-life” of books inClassics/Humanities

I results of traditional search engines ->high recall but low precision

I need for effective tools to accessinformation for research purposes



Information extraction in Classics

I lack of tools comparable to Citeseer, CiteseerX, GoPubMed forother disciplines

I are JSTOR’s features/functionalities enough for scholarlypurposes?

I still issues with encoding of ancient greek (e.g., The +$%j& ofDanaids)



Access points to information

I going beyond TOCs or stringmatching-based IR

I access points meaningful for Classicsscholars

Contribution to research

I problems peculiar of Classics can help toimprove the performances of existingtools/algorithms

I Analysis of papers published in a Classicsjournal (or archive) as corpus



Mining and information extraction from classics texts

I no ad-hoc gold standards/training setI lack of tools specifically tailored to Classics resourcesI electronically available text does not mean electronic text

Possible corpus analysis

I citation patternsI citation and co-citation networksI trends in the Classics citation practice



Overview

Introduction


Methodology

Work Phases

Expected Results



Finding Mentions of Realia

I mentions of realia are information that matter -> importance of printindexes in Classics

I Using realia as access points to informationI Identifying mentions of RealiaI Disambiguation, different spellings or translations of names

Kinds of realia we are interested in extracting

1. Place Names (ancient and modern);

2. Relevant person Names(mythological names, ancient authors, modernscholars)

3. Reference to primary and secondary sources (canonical texts andmodern publications about them)



Reuse of Structured Information

Scholars have been producing over the last years severalstructured datasources:

I use of structured information to train machine-learningbased tools to mine unstructured texts

I Related projects: EROCS by IBMI current practice: Wikipedia/DBpedia as datasource of

structured informationI what improvements by using a discipline specific

Knowledge B ase?



Overview

Introduction


Methodology

Work Phases

Expected Results





Corpus building

Getting materialsCrawling online archives

Characteristics of considered corpora

I Open Access -> publically accessibleI Possibly multilingual

Extracting the text from collected documents

I Tools for text extraction from PDF -> open issues withAncient Greek encoding

I re-OCR documents even the native digital ones



Corpus Building II

Corpora

I Princeton/Stanford Working Papers in Classics (PSWPC)I LexisI 300 articles in 2 corpora

OCR

I FinereaderI Ocropus (layout analysis)I text extracted from PDFs (tools like pdftotext etc.)



Structured datasources

I Information about the same entities (i.e. realia) can bespread over several datasources

I partial overlappingsI Datasources can use different formats (text, DB, HTML,

XML etc.)I no interoperability



Structured datasources II

To create a semantic knowledge base (KB)

I import each datasourceI map it to high level ontologies (e.g., CIDOC-CRM)I find overlappings between datasources -> alignign the

records

The obtained knowledge base will be used as support for all thetext processing tasks



Corpus Processing

1. sentence identification2. entities extraction (named entities recognition +

disambiguation)I KB implied to build up an entity context

3. canonical references extractionI KB provides training data

4. modern bibliographic references extractionI KB provides list of journals/name places/authors to improve

the perfomances of the tool



Canonical References Extraction

1. citations used specifically for secondary sources (i.e. works ofancient authors)

2. essential entry point to information: refer to the research object,i.e. Ancient Texts

3. logical instead of physical citation scheme (e.g., chapter/paragrvs. page)

4. variation -> time, style, language (regexp insufficient!)

ExampleHom. Il. XII 1Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803Hes. fr. 321 M.-W.Callimaco, ’ep.’ 28 Pf., 5-6



Overview

Introduction


Methodology

Work Phases

Expected Results



ResultsI Provide automatically multiple meaningful entry points to

informationI Enrich the corpus with links to resources (particularly

primary sources)I Improve the user access to the corpusI Demonstrate the scalability of the approach

Tools/Resources

I Knowledge Base for ClassicsI Articles with improved text qualityI Corpora releasedI single tools fr information extraction (e.g. Canonical

References Extractor)


Education

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts