Upload
matteo-romanello
View
1.366
Download
2
Embed Size (px)
DESCRIPTION
PhD seminar presentation at CCH/KCL
Citation preview
Introduction Motivations Methodology WorkPhases ExpectedResults
Structured Vs Unstructured:Extracting Information From Classics
Scholarly Texts
Matteo Romanello1
1Centre for Computing in the Humanities
PhD SeminarLondon 28/01/2010
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
The Project at a glance
I Project started in October 2009;I Field of application: Digital Humanities, Classics
(particularly Greek literature);I co-supervision between the CCH and the CS department
at King’s -> application of Computational Linguisticsmethod
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Goal
Devising an automatic system to improve information retrievalover a discipline-specific corpus of unstructured texts
I focus on secondary sourcesI automatic -> scalable with huge amount of dataI information retrieval -> the task of retrieving informationI unstructured texts -> raw texts (e.g. .txt files) as opposed
to the structured/encoded XML
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
The Million Book Library
I archives.org, Google Books -> growth ofvolume of information available inelectronic format
I longer “shelf-life” of books inClassics/Humanities
I results of traditional search engines ->high recall but low precision
I need for effective tools to accessinformation for research purposes
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Information extraction in Classics
I lack of tools comparable to Citeseer, CiteseerX, GoPubMed forother disciplines
I are JSTOR’s features/functionalities enough for scholarlypurposes?
I still issues with encoding of ancient greek (e.g., The +$%j& ofDanaids)
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Access points to information
I going beyond TOCs or stringmatching-based IR
I access points meaningful for Classicsscholars
Contribution to research
I problems peculiar of Classics can help toimprove the performances of existingtools/algorithms
I Analysis of papers published in a Classicsjournal (or archive) as corpus
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Mining and information extraction from classics texts
I no ad-hoc gold standards/training setI lack of tools specifically tailored to Classics resourcesI electronically available text does not mean electronic text
Possible corpus analysis
I citation patternsI citation and co-citation networksI trends in the Classics citation practice
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Finding Mentions of Realia
I mentions of realia are information that matter -> importance of printindexes in Classics
I Using realia as access points to informationI Identifying mentions of RealiaI Disambiguation, different spellings or translations of names
Kinds of realia we are interested in extracting
1. Place Names (ancient and modern);
2. Relevant person Names(mythological names, ancient authors, modernscholars)
3. Reference to primary and secondary sources (canonical texts andmodern publications about them)
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Reuse of Structured Information
Scholars have been producing over the last years severalstructured datasources:
I use of structured information to train machine-learningbased tools to mine unstructured texts
I Related projects: EROCS by IBMI current practice: Wikipedia/DBpedia as datasource of
structured informationI what improvements by using a discipline specific
Knowledge B ase?
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus building
Getting materialsCrawling online archives
Characteristics of considered corpora
I Open Access -> publically accessibleI Possibly multilingual
Extracting the text from collected documents
I Tools for text extraction from PDF -> open issues withAncient Greek encoding
I re-OCR documents even the native digital ones
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus Building II
Corpora
I Princeton/Stanford Working Papers in Classics (PSWPC)I LexisI 300 articles in 2 corpora
OCR
I FinereaderI Ocropus (layout analysis)I text extracted from PDFs (tools like pdftotext etc.)
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Structured datasources
I Information about the same entities (i.e. realia) can bespread over several datasources
I partial overlappingsI Datasources can use different formats (text, DB, HTML,
XML etc.)I no interoperability
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Structured datasources II
To create a semantic knowledge base (KB)
I import each datasourceI map it to high level ontologies (e.g., CIDOC-CRM)I find overlappings between datasources -> alignign the
records
The obtained knowledge base will be used as support for all thetext processing tasks
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Corpus Processing
1. sentence identification2. entities extraction (named entities recognition +
disambiguation)I KB implied to build up an entity context
3. canonical references extractionI KB provides training data
4. modern bibliographic references extractionI KB provides list of journals/name places/authors to improve
the perfomances of the tool
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Canonical References Extraction
1. citations used specifically for secondary sources (i.e. works ofancient authors)
2. essential entry point to information: refer to the research object,i.e. Ancient Texts
3. logical instead of physical citation scheme (e.g., chapter/paragrvs. page)
4. variation -> time, style, language (regexp insufficient!)
ExampleHom. Il. XII 1Aesch. ’Sept.’ 565-67, 628-30; Ar. ’Arch.’ 803Hes. fr. 321 M.-W.Callimaco, ’ep.’ 28 Pf., 5-6
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
Overview
Introduction
Motivations and Background
Methodology
Work Phases
Expected Results
Extracting Information From Classics Scholarly Texts CCH
Introduction Motivations Methodology WorkPhases ExpectedResults
ResultsI Provide automatically multiple meaningful entry points to
informationI Enrich the corpus with links to resources (particularly
primary sources)I Improve the user access to the corpusI Demonstrate the scalability of the approach
Tools/Resources
I Knowledge Base for ClassicsI Articles with improved text qualityI Corpora releasedI single tools fr information extraction (e.g. Canonical
References Extractor)
Extracting Information From Classics Scholarly Texts CCH