SemanticCampLondon, 16th February 2008

Automaticallyindexing

science usingnatural-language

processing,RDF andSPARQL

AndrewWalkingshaw,

Nick Day,Peter Corbett,Jim Downing,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Extracting(meta)data

Using the data

Thanks

Automatically indexing science usingnatural-language processing, RDF and

SPARQL

Andrew Walkingshaw, Nick Day, Peter Corbett, JimDowning, Joe Townsend, Peter Murray-Rust

February 16, 2008

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Data sources

• Supplemental and experimental data

• Journals

• Self-archived papers (e.g. arXiv)

• Mainstream journalism

• Blogs

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Data sources

• Journals

• Blogs

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Data sources

• Journals

• Blogs

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Data sources

• Journals

• Blogs

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Data sources

• Journals

• Blogs

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Semi-structured data: Golem

• We’ve got a lot of chemical data as CML

• http://en.wikipedia.org/wiki/Chemical Markup Language

• . . . but we still need to get data out of that and into amore useful form

• hence Golem: http://www.lexical.org.uk/science/golem/

• GRDDLish strategy for extracting data from CML files:identify dialect-specific concepts with XPath expressionsand XSLT stylesheets

• upshot: we can extract JSON objects from CML files.

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Free text: OSCAR3

• http://oscar3-chem.sourceforge.net/

• Natural-language parser for documents about chemistry

• Dark magic: don’t ask me how it works!

• . . . but it can be run as a Jetty webservice so as long as itdoes, I’m happy

• Author’s blog:http://wwmm.ch.cam.ac.uk/blogs/corbett/

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Free text: OSCAR3

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Free text: OSCAR3

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Free text: OSCAR3

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Free text: OSCAR3

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Getting the data in

• Everything (more or less) talks RSS nowadays. . .

• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.

• Thankfully: feedparser (http://feedparser.org/)

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Getting the data in

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Getting the data in

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Serializing metadata

• RDF – using:

• Dublin Core terms

• A homebrew ontology based on the IUCr’s CIF data format

• and another homebrew ontology for OSCAR annotations

• (it’d be good to standardise these, but to be honest, notmany people are doing this sort of thing)

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

• RDF – using:

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

• RDF – using:

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

• RDF – using:

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

• RDF – using:

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• For each feed in a list of feeds:

• If it’s supplying CML data, set Golem on each entry, getthe observables out, and turn them into triples; runOSCAR3 over the title and/or abstract

• If it’s not, extract the free text from each entry, send it tothe OSCAR web service, and assign triples based on thechemical entities OSCAR finds

• Upload the RDF to your triple store

• (I’m using the Talis platform, so that’s just curl)

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

The process

• And. . .

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

SPARQL is great.

Just post queries at a SPARQL endpoint:authortemplate=’’’PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce:<http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributorsome author . }’’’

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

SPARQL isn’t (entirely) great.

• Scientists shouldn’t have to know this stuff.

• So we need to build a front end which your average senioracademic might be able to use. . .

• (i.e. it’s got to look like a website.)

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

What queries do we want?

• What experimental data is an author responsible for?

• What chemical entities are in some data?

• Where is a given chemical entity talked about?

• So we can build a web app around these queries.

• django + rdflib + sparql + Talis Platform

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

And here it is.

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.

AndrewWalkingshaw,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.

SemanticCampLondon, 16th February 2008

Education

Westwood Homes Open House February 16th, 2014

General Management Course - 16th -27th February 2015

No. 8 FEBRUARY · 16th, 1927 ~kt

Day 11 February 16th Chapter 6 + 7

Pran message, 16th edition, february 2017

February 16th, 2018 - Coburg West Primary School€¦ · February 16th, 2018 2018 TERM DATES Term 1 Wednesday 31st January – Thursday 29th March Term 2 Monday 16th April – Friday

Tuesday 16th February Wednesday 24th February 2010democracy.cityoflondon.gov.uk/Data/Court of Common... · Tuesday 16th February – Wednesday 24th February 2010 1. This was the Lord

Pmd & proper note new releases 16th february 2015

Fellows Antique & Modern Jewellery February 16th 2012

ADBEAT Buyers Guide February 16th Publication

Kaipara Lifestyler, February 16th 2016

GlobEng Verona 14th -16th February 2008

February 16th

February 16th Final Results Briefing Albuquerque Heading Home

February 16th, 2005 10:00am PST

Newsletter 3 16th february 2016

Kirklees Business News, 16th February 2010

February 16th 2018 Toshikazu Masuyama

Dairy market report 16th february 2015

February 16th, 2011