15
Structured and Unstructured: Extracting Information From Classics Scholarly Texts Matteo Romanello 1 1 Centre for Computing in the Humanities King’s College London Graduate Colloquium - DHSI 2010 University of Victoria BC - 8th June 2010 Romanello CCH Extracting Information From Scholarly Texts

Structured and Unstructured:Extracting Information From Classics Scholarly Texts

Embed Size (px)

DESCRIPTION

Slides of the talk given at the DHSI 2010 Graduate Colloquium at UVic (Canada).

Citation preview

  • 1. Structured and Unstructured:Extracting Information From Classics Scholarly TextsMatteo Romanello11 Centre for Computing in the HumanitiesKings College LondonGraduate Colloquium - DHSI 2010University of Victoria BC - 8th June 2010Romanello CCH Extracting Information From Scholarly Texts

2. The Project at a glance Project started in October 2009;Disciplines: Digital Humanities, Classics, ComputerScience;co-supervised by:Willard McCarty (KCL, Department of Digital Humanities)Jonathan Ginzburg (KCL, Department of ComputerScience)project supported by an AHRC (Arts and HumanitiesResearch Council) awardRomanelloCCH Extracting Information From Scholarly Texts 3. Goal Devising an automatic system to improve semanticinformation retrieval over a discipline-specic corpus ofunstructured textsfocus on secondary sources (e.g. journal papers) asopposed to primary sources (i.e. Ancient Texts)automatic -> scalable with huge amount of datainformation retrieval -> the task of retrieving informationunstructured texts -> raw texts (e.g. .txt les) as opposedto the structured/encoded XML ExampleHom. Il. XII 1: sequence of 14 characters meaning rst lineof the twelfth book of Homers Iliad RomanelloCCH Extracting Information From Scholarly Texts 4. Semantic Information RetrievalSemantic vs String Matching based IR Romanello CCH Extracting Information From Scholarly Texts 5. Named Entities as Entry Point to InformationEntities to be extracted: 1 Place Names (ancient and modern); 2 Relevant Person Names (mythological names, ancient authors, modern scholars) 3 References to primary and secondary sources (canonical texts and modern publications about them) Romanello CCH Extracting Information From Scholarly Texts 6. Work Phases Romanello CCH Extracting Information From Scholarly Texts 7. Corpus buildingGetting materialsCrawling online archives Extracting the text from collected documentsTools for text extraction from PDF -> open issues withAncient Greek encodingre-OCR documents even the native digital ones Romanello CCH Extracting Information From Scholarly Texts 8. Corpus Building IICorporaopen access, multilingualPrinceton/Stanford Working Papers in Classics (PSWPC)Lexis online470 articles in 2 corpora OCR FinereaderOcropus (layout analysis)text extracted from PDFs (tools like pdftotext etc.)Alignment of multiple OCR outputsRomanelloCCH Extracting Information From Scholarly Texts 9. Building the Knowledge Base (KB) Goal: integrate different data sources into a single KBWhy?Information about the same entities spread over severaldata sourcesData sources might use different output formats (raw text,DBs, HTML, XML etc.)partial overlappings but no interoperability How? Use of high level ontologies to map records related to the same entityResult: KB containing semantic dataRomanello CCH Extracting Information From Scholarly Texts 10. Corpus Processing Tasks 1 sentence identication 2 entities extraction (named entities recognition + disambiguation)KB implied to build up an entity context 3 canonical references extraction KB provides training data 4 modern bibliographic references extractionKB provides list of journals/name places/authors to improvethe perfomances of the toolRomanelloCCH Extracting Information From Scholarly Texts 11. Canonical References Romanello CCH Extracting Information From Scholarly Texts 12. Canonical References Extraction1 citations used specically for primary sources (i.e. works of ancient authors) 2 essential entry point to information: refer to the research object, i.e. ancient texts 3 logical instead of physical citation scheme (e.g., chapter/paragr vs. page) 4 variation -> time, style, language (regexp insufcient!) ExampleHom. Il. XII 1Aesch. Sept. 565-67, 628-30; Ar. Arch. 803Hes. fr. 321 M.-W.Callimaco, ep. 28 Pf., 5-6Romanello CCH Extracting Information From Scholarly Texts 13. So What?New Possible Research Questions: how citing primary sources in Classics changed?what are the characteristics of citation and co-citationnetworks?the traditional IR tools in Classics are actually exhaustive? RomanelloCCH Extracting Information From Scholarly Texts 14. Why a Digital Humanities project? Better understanding ofthe discipline specitiesusers needsWriting code to develop a project meansformalizing the way a given result is obtainedcreating a repeatable and thus confutable processintroducing a reasoning based on the analysis ofquantitative data into ClassicsBeing able toapply the product of a DH research to traditional scholarship RomanelloCCH Extracting Information From Scholarly Texts 15. Thanks for your [email protected]://kcl.academia.edu/MatteoRomanello RomanelloCCH Extracting Information From Scholarly Texts