SemanticCampLondon, 16th February 2008

Automaticallyindexing

science usingnatural-language

processing,RDF andSPARQL

AndrewWalkingshaw,

Nick Day,Peter Corbett,Jim Downing,

JoeTownsend,

PeterMurray-Rust

Gatheringdata

Extracting(meta)data

Using the data

Thanks

Automatically indexing science usingnatural-language processing, RDF and

SPARQL

Andrew Walkingshaw, Nick Day, Peter Corbett, JimDowning, Joe Townsend, Peter Murray-Rust

February 16, 2008




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources

• Supplemental and experimental data

• Journals

• Self-archived papers (e.g. arXiv)

• Mainstream journalism

• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Data sources


• Journals



• Blogs




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Supplemental data: CrystalEye

• http://wwmm.ch.cam.ac.uk/crystaleye/

• Repository for crystallographic data




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journals and arXiv

• “Traditional” journal articles

• Titles and abstracts. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Journalism and blogs

• Unstructured text with little semantics;

• . . . hence Google Scholar, Web of Science, etc.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Semi-structured data: Golem

• We’ve got a lot of chemical data as CML

• http://en.wikipedia.org/wiki/Chemical Markup Language

• . . . but we still need to get data out of that and into amore useful form

• hence Golem: http://www.lexical.org.uk/science/golem/

• GRDDLish strategy for extracting data from CML files:identify dialect-specific concepts with XPath expressionsand XSLT stylesheets

• upshot: we can extract JSON objects from CML files.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks











AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3

• http://oscar3-chem.sourceforge.net/

• Natural-language parser for documents about chemistry

• Dark magic: don’t ask me how it works!

• . . . but it can be run as a Jetty webservice so as long as itdoes, I’m happy

• Author’s blog:http://wwmm.ch.cam.ac.uk/blogs/corbett/




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Free text: OSCAR3









AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in

• Everything (more or less) talks RSS nowadays. . .

• RSS 0.91, RSS 1.0 (which one?), Atom, etc etc etc.

• Thankfully: feedparser (http://feedparser.org/)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in







AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Getting the data in







AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Serializing metadata

• RDF – using:

• Dublin Core terms

• A homebrew ontology based on the IUCr’s CIF data format

• and another homebrew ontology for OSCAR annotations

• (it’d be good to standardise these, but to be honest, notmany people are doing this sort of thing)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks


• RDF – using:








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process

• For each feed in a list of feeds:

• If it’s supplying CML data, set Golem on each entry, getthe observables out, and turn them into triples; runOSCAR3 over the title and/or abstract

• If it’s not, extract the free text from each entry, send it tothe OSCAR web service, and assign triples based on thechemical entities OSCAR finds

• Upload the RDF to your triple store

• (I’m using the Talis platform, so that’s just curl)

• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

The process






• And. . .




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

SPARQL is great.

Just post queries at a SPARQL endpoint:authortemplate=’’’PREFIX dc: <http://purl.org/dc/terms/>PREFIX ce:<http://wwmm.ch.cam.ac.uk/crystaleye/dictionary#>DESCRIBE ?file WHERE { ?file dc:contributorsome author . }’’’




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

SPARQL isn’t (entirely) great.

• Scientists shouldn’t have to know this stuff.

• So we need to build a front end which your average senioracademic might be able to use. . .

• (i.e. it’s got to look like a website.)




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks








AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

What queries do we want?

• What experimental data is an author responsible for?

• What chemical entities are in some data?

• Where is a given chemical entity talked about?

• So we can build a web app around these queries.

• django + rdflib + sparql + Talis Platform




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks










AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Demo!

And here it is.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.




AndrewWalkingshaw,


JoeTownsend,

PeterMurray-Rust

Gatheringdata


Using the data

Thanks

Thanks to. . .

• Talis (http://n2.talis.com/) for access to their platform

• and to the RSC and IUCr for their support of CrystalEye.

Education

SemanticCampLondon, 16th February 2008