23
Introduction Motivations Methodology WorkPhases ExpectedResults Structured Vs Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello 1 1 Centre for Computing in the Humanities PhD Seminar London 28/01/2010 Extracting Information From Classics Scholarly Texts CCH

Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Embed Size (px)

DESCRIPTION

PhD seminar presentation at CCH/KCL

Citation preview

Page 1: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Structured Vs Unstructured:Extracting Information From Classics

Scholarly Texts

Matteo Romanello1

1Centre for Computing in the Humanities

PhD SeminarLondon 28/01/2010

Extracting Information From Classics Scholarly Texts CCH

Page 2: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 3: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 4: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

The Project at a glance

I Project started in October 2009;I Field of application: Digital Humanities, Classics

(particularly Greek literature);I co-supervision between the CCH and the CS department

at King’s -> application of Computational Linguisticsmethod

Extracting Information From Classics Scholarly Texts CCH

Page 5: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Goal

Devising an automatic system to improve information retrievalover a discipline-specific corpus of unstructured texts

I focus on secondary sourcesI automatic -> scalable with huge amount of dataI information retrieval -> the task of retrieving informationI unstructured texts -> raw texts (e.g. .txt files) as opposed

to the structured/encoded XML

Extracting Information From Classics Scholarly Texts CCH

Page 6: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 7: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

The Million Book Library

I archives.org, Google Books -> growth ofvolume of information available inelectronic format

I longer “shelf-life” of books inClassics/Humanities

I results of traditional search engines ->high recall but low precision

I need for effective tools to accessinformation for research purposes

Extracting Information From Classics Scholarly Texts CCH

Page 8: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Information extraction in Classics

I lack of tools comparable to Citeseer, CiteseerX, GoPubMed forother disciplines

I are JSTOR’s features/functionalities enough for scholarlypurposes?

I still issues with encoding of ancient greek (e.g., The +$%j& ofDanaids)

Extracting Information From Classics Scholarly Texts CCH

Page 9: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Access points to information

I going beyond TOCs or stringmatching-based IR

I access points meaningful for Classicsscholars

Contribution to research

I problems peculiar of Classics can help toimprove the performances of existingtools/algorithms

I Analysis of papers published in a Classicsjournal (or archive) as corpus

Extracting Information From Classics Scholarly Texts CCH

Page 10: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Mining and information extraction from classics texts

I no ad-hoc gold standards/training setI lack of tools specifically tailored to Classics resourcesI electronically available text does not mean electronic text

Possible corpus analysis

I citation patternsI citation and co-citation networksI trends in the Classics citation practice

Extracting Information From Classics Scholarly Texts CCH

Page 11: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 12: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Finding Mentions of Realia

I mentions of realia are information that matter -> importance of printindexes in Classics

I Using realia as access points to informationI Identifying mentions of RealiaI Disambiguation, different spellings or translations of names

Kinds of realia we are interested in extracting

1. Place Names (ancient and modern);

2. Relevant person Names(mythological names, ancient authors, modernscholars)

3. Reference to primary and secondary sources (canonical texts andmodern publications about them)

Extracting Information From Classics Scholarly Texts CCH

Page 13: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Reuse of Structured Information

Scholars have been producing over the last years severalstructured datasources:

I use of structured information to train machine-learningbased tools to mine unstructured texts

I Related projects: EROCS by IBMI current practice: Wikipedia/DBpedia as datasource of

structured informationI what improvements by using a discipline specific

Knowledge B ase?

Extracting Information From Classics Scholarly Texts CCH

Page 14: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 15: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Extracting Information From Classics Scholarly Texts CCH

Page 16: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Corpus building

Getting materialsCrawling online archives

Characteristics of considered corpora

I Open Access -> publically accessibleI Possibly multilingual

Extracting the text from collected documents

I Tools for text extraction from PDF -> open issues withAncient Greek encoding

I re-OCR documents even the native digital ones

Extracting Information From Classics Scholarly Texts CCH

Page 17: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Corpus Building II

Corpora

I Princeton/Stanford Working Papers in Classics (PSWPC)I LexisI 300 articles in 2 corpora

OCR

I FinereaderI Ocropus (layout analysis)I text extracted from PDFs (tools like pdftotext etc.)

Extracting Information From Classics Scholarly Texts CCH

Page 18: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Structured datasources

I Information about the same entities (i.e. realia) can bespread over several datasources

I partial overlappingsI Datasources can use different formats (text, DB, HTML,

XML etc.)I no interoperability

Extracting Information From Classics Scholarly Texts CCH

Page 19: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Structured datasources II

To create a semantic knowledge base (KB)

I import each datasourceI map it to high level ontologies (e.g., CIDOC-CRM)I find overlappings between datasources -> alignign the

records

The obtained knowledge base will be used as support for all thetext processing tasks

Extracting Information From Classics Scholarly Texts CCH

Page 20: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Corpus Processing

1. sentence identification2. entities extraction (named entities recognition +

disambiguation)I KB implied to build up an entity context

3. canonical references extractionI KB provides training data

4. modern bibliographic references extractionI KB provides list of journals/name places/authors to improve

the perfomances of the tool

Extracting Information From Classics Scholarly Texts CCH

Page 21: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Canonical References Extraction

1. citations used specifically for secondary sources (i.e. works ofancient authors)

2. essential entry point to information: refer to the research object,i.e. Ancient Texts

3. logical instead of physical citation scheme (e.g., chapter/paragrvs. page)

4. variation -> time, style, language (regexp insufficient!)

ExampleHom. Il. XII 1Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803Hes. fr. 321 M.-W.Callimaco, ’ep.’ 28 Pf., 5-6

Extracting Information From Classics Scholarly Texts CCH

Page 22: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

Overview

Introduction

Motivations and Background

Methodology

Work Phases

Expected Results

Extracting Information From Classics Scholarly Texts CCH

Page 23: Stuctured Vs Unstructured: Extracting Information from Classics Scholarly Texts

Introduction Motivations Methodology WorkPhases ExpectedResults

ResultsI Provide automatically multiple meaningful entry points to

informationI Enrich the corpus with links to resources (particularly

primary sources)I Improve the user access to the corpusI Demonstrate the scalability of the approach

Tools/Resources

I Knowledge Base for ClassicsI Articles with improved text qualityI Corpora releasedI single tools fr information extraction (e.g. Canonical

References Extractor)

Extracting Information From Classics Scholarly Texts CCH