20
Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project

Alan Tonge Semantic Web Data Repositories from Chemistry e-Thesis Data Mining Open Repositories 2008 Southampton University 2 April 2008 SPECTRa-T Project

  • View
    220

  • Download
    0

Embed Size (px)

Citation preview

Alan Tonge

Semantic Web Data Repositories fromChemistry e-Thesis Data Mining

Open Repositories 2008Southampton University2 April 2008

SPECTRa-T Project

• 12-month project between University of Cambridge and Imperial College London to develop text- and data-mining tools to extract chemical data from e-theses

• Part of the JISC Digital Repositories programme

Project Overview

Submission, Preservation and Exposure of Chemistry Teaching and Research Data– in Theses

Background

Chemistry is an experimental science

Synthetic Organic Chemistry is the basis of Pharmaceutical and Agrochemical industries

Where does the information to make this molecule come from?

Ethyl 4,5-epoxy-hex-2-enolate

C8H12O3

Systematic Name :

Molecular Formula :

Chemical Abstracts (9000+ journals - 12,000 structures/day)Beilstein (180 core journals)

Patents (CAS, Derwent, MDL) (400,000 /annum)

Academic chemistry publications largely derived from PhD Theses

Perhaps ~10K published per year worldwide

Synthetic : contains 50-60 preparations – only 20% published in detail

Search Chemical patent & journal abstracting services – e.g.

• List of Starting Materials & Reagents

• Recipe: Reactions Conditions & Work-up

• Product Characterization – spectroscopic & physical properties

Sample preparation from synthetic chemistry thesis

• ~80% of (academic) synthetic preparations remain locked in theses

• Manual abstraction (cf journals/patents) not an option

The Problem

The Solution

• OSCAR3 : Automatic high-throughput chemical name and chemical term recognition

Open Source Chemistry Analysis Routines is an extensible Open Source framework which can identify much of the chemical terminology in electronic articles

• Semantic Web : Deposit extracted terms in searchable RDF triplestore

OSCAR Name recognition:

1. Dictionary of chemical names/terms (ChEBI Ontology)

2. Rules; chemical suffix filters 3. Regular expressions to recognise: data, formulae

Input: PDF Legacy FormatPDF is the de facto format for electronic document deposition

in digital repositories

Problem:

• irregular word order• line-breaks: loss of continuous text; paragraphs difficult to identify• loss of subscripts and superscripts • non-printing characters• erroneous character assignment with OCR.

PDF text is a Page Description Format –

optimized for human, not machine, readability

• Remove linebreaks from extended chemical names

• Remove text fragments derived from Figures and Tables

• Correct whitespace in chemical names

PDF UTF-8 text OSCAR3

SAF XML RDF statements XSLT

Used ‘as is’OSCAR used ‘as is’ on PDF e-theses :

Gives 5000 terms / thessGives 5000 terms / thesis (80% duplicates)

Cannot identify chemical objects (spectra assignments; properties)

Programmatic modifications to:

Input: MS Office Open XML – ‘docx’

• No information loss from student’s deposited thesis (written with MS software)

• Identification of experimental sections no longer a problem -> Chemical Objects

• Conversion of CO’s into Chemical Markup Language

DocX

Extract chemical terms OSCAR3

Link together

RDF statements

Extract chemical objects

CML data files

Data Repository

URI

Sample preparation from synthetic chemistry thesis

Sample preparation from chemistry thesis

CML Infra-Red ASSIGNMENTS<cml:spectrum type="cml:ir">- <cml:conditionList>  <cml:condition title="the form of the IR spectrum“ dictRef="cml:irform">film</cml:condition>   </cml:conditionList>- <cml:peakList>  <cml:peak id="p1" xValue="3446" title="OH" />   <cml:peak id="p2" xValue="3062" title="unassigned" />   <cml:peak id="p3" xValue="3029" title="unassigned" />   <cml:peak id="p4" xValue="2922" title="unassigned" />   <cml:peak id="p5" xValue="1672" title="C=O" />   <cml:peak id="p6" xValue="1604" title="C=C" />   <cml:peak id="p7" xValue="1496" title="unassigned" />   <cml:peak id="p8" xValue="1454" title="unassigned" />   <cml:peak id="p9" xValue="1366" title="unassigned" />   <cml:peak id="p10" xValue="1299" title="unassigned" />   <cml:peak id="p11" xValue="1135" title="unassigned" />   <cml:peak id="p12" xValue="1078" title="unassigned" />   <cml:peak id="p13" xValue="974" title="unassigned" />     </cml:peakList>  </cml:spectrum>

CML C-13 NMR ASSIGNMENTS<cml:spectrum type="cml:cnmr">- <cml:parameterList>  <cml:parameter dictRef="cml:frequency" units="units:MHz">50</cml:parameter>   </cml:parameterList>- <cml:substanceList>  <cml:substance ref="" />   </cml:substanceList>- <cml:peakList>  <cml:peak xValue="198.6" integral="" peakMultiplicity="" title="C=O" />   <cml:peak xValue="198.5" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="145.0" integral="" peakMultiplicity="" title="C" />   <cml:peak xValue="142.7" integral="" peakMultiplicity="" title="C" />   <cml:peak xValue="137.3" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="136.7" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="129.1" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="128.6" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="126.7" integral="" peakMultiplicity="" title="" />   <cml:peak xValue="124.0" integral="" peakMultiplicity="" title="aryl-C" />   <cml:peak xValue="62.5" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="59.0" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="55.2" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="54.9" integral="" peakMultiplicity="" title="CH" />   <cml:peak xValue="38.5" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="32.8" integral="" peakMultiplicity="" title="CH2" />   <cml:peak xValue="26.1" integral="" peakMultiplicity="" title="CH3" />   <cml:peak xValue="26.0" integral="" peakMultiplicity="" title="CH3" />   </cml:peakList>  </cml:spectrum>

RDF - Resource Description Framework.

A component of the Semantic Web, it is based upon the idea of making statements about resources/data in the form of a

subject-predicate-object (or resource-property-value)

expression (called a triple) e.g. :

My_thesis has_chemical_entity 2,4-dinitrobenzene

The value of one property can in turn be used as the resource for another.

RDF TRIPLESTORE ENTRY<rdf:RDF xmlns:dc="http://purl.org/dc/elements/1.1/"

xmlns:dcrdf="http://purl.org/metadata/dublin_core#" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:spectra-t="http://wwmm.ch.cam.ac.uk/spectra-t#">

<rdf:Description rdf:about="file:/C:/spectra-t-theses/Juergen_Harter.docx">

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>CDCl3</spectra-t:chemicalName> <spectra-t:hasSMILES>ClC([2H])(Cl)Cl</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/CHCl3/c2-1(3)4/h1H/i1D</spectra-t:hasInChI> </rdf:Description></spectra-t:hasChemicalName>

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>1-Benzyloxy-but-3-yne</spectra-t:chemicalName> <spectra-t:hasSMILES>C#CCCOCC1=CC=CC=C1</spectra-t:hasSMILES> <spectra-t:hasInChI>InChI=1/C11H12O/c1-2-3-9-12-10-11-7-5-4-6-8-11/h1,4-8H,3,9-10H2</spectra-t:hasInChI> <spectra-t:hasHNMRSpectrum>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasCMLMolecule>http://ch.cam.ac.uk:8182/1ea7f8cd07/data-0.cml</spectra-t:hasCMLMolecule> <spectra-t:hasPreparation>http://ch.cam.ac.uk:8182/1ea7f8cd07/preparation-0.sci.xml</spectra-t:hasPreparation> </rdf:Description></spectra-t:hasChemicalName>

<spectra-t:hasChemicalName>- <rdf:Description> <spectra-t:chemicalName>(3E,5S,6S)-8-(p-Methoxy-benzyloxy)-5,6-epoxy-6-methyl-oct-3-en-2-one</spectra-t:chemicalName> <spectra-t:hasHNMRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHNMRSpectrum> <spectra-t:hasIRSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasIRSpectrum> <spectra-t:hasMassSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasMassSpectrum> <spectra-t:hasHRMSSpectrum>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/data-20.cml</spectra-t:hasHRMSSpectrum> <spectra-t:hasPreparation>http://fiwlt.ch.cam.ac.uk:8182/8f2d98b04/preparation-20.sci.xml</spectra-t:hasPreparation> </rdf:Description></spectra-t:hasChemicalName>

</rdf:Description><rdf:RDF>

SPARQL QUERYPREFIX st: <http://wwmm.ch.cam.ac.uk/spectra-t#>PREFIX dcrdf: <http://purl.org/metadata/dublin_core#>CONSTRUCT { ?thesis st:hasBicycloMoleculeAndHNMR ?chemical .?thesis dcrdf:author ?author}WHERE { ?thesis dcrdf:creator ?author . ?thesis st:hasChemicalName ?annot . ?annot st:chemicalName ?chemical . ?annot st:hasHNMRSpectrum ?hnmr .FILTER regex(?chemical, ".*bicyclo.*") . }

RESULT<rdf:Description rdf:about="file:/C:/spectra-t-articles/B207708F.docx">

<st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-7,8-bis(trimethylsilyl)bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Acetyl-bicyclo[4.2.1]nona-4,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author><st:hasBicycloMoleculeAndHNMR>5-Phenyl-bicyclo[4.2.1]nona-3,7-diene</st:hasBicycloMoleculeAndHNMR><dcrdf:author>N.R.Champness</dcrdf:author>

</rdf:Description>

Caveats (Proof-of-concept):

Single subject area (synthetic organic chemistry)

Single institution docx (limited variation in document structure)

Limited thesis availability

Solutions :

Domain ontology development

Make your e-theses public!

Message to repository managers:

PDF is a limited format for data extraction from e-theses

Docx allows chemical data object extraction (~80% precision / recall)

Acknowledgements

• Project Director: Peter Morgan UL Cambridge• Chemistry leads: Henry Rzepa, Peter Murray-Rust• Developers: Jim Downing, Diana Stewart,

Joe Townsend, Matt Harvey• Project Manager: Alan Tonge

http://www.lib.cam.ac.uk/spectra-t/

SPECTRa Tools Workshop

Autumn 2008

Unilever Centre, Cambridge, UK

Contact: Peter Murray-Rust ([email protected])

Peter Morgan ([email protected])