RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider: Collecting and Curating the World’s Chemistry with the Community

A Pragmatic Vision

“Build a Structure Centric Community”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

media.obsessable.com

As few interfaces as possible

What do humans want?

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Getting it done

March 2007 – A beta system opened online One purchased computer, two home-built Seeded with 10.5 million structures Structure/substructure searching

June 2007 A curating layer to flag data A deposition interface to add to the data

And so it continued….

ChemSpider Searches

Search Cholesterol

Search Cholesterol

Search Cholesterol

Search Cholesterol

Search Cholesterol

Linked across the internet

Kyoto Encyclopedia of Genes and Genomes

Links to Patents based on structure

Articles Linked

ChemSpider Complex Searches

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

What is a compound?

ChemSpider is a structure-centric hub

ChemSpider aggregates and links out across the internet

Data aggregate based on “structures and links”

What defines a chemical compound?

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Where Would You look? What Do You Trust?

Question Everything online: www.dhmo.org

Di-Hydrogen Monoxide

2H


2H + 1O


H2O


H2OWater

It’s all on Wikipedia…

Chemistry on The Internet Is Messy

It’s Methane…

What’s Methane?

What’s Methane?

What ELSE is Methane???

PubChem

Chemistry is REALLY Messy

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Vancomycin on ChemSpider 1 compound – 3 days

The EXPERTS must get it right?!

Wikipedia, C&E News, PubChem C&E News (from ACS)

http://www.chemspider.com/RecordView.aspx?id=4445428

The InChI Identifier

Multiple Layers

InChIStrings Hash to InChIKeys

InChIs for Taxol

InChIKeys for Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ

ChEBI and Wikipedia are the SAME structure

Drugbank is a DIFFERENT structure – ONE stereocenter

Does one stereocenter matter?


Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon


Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Assertion and Chemical Entities

Who says what Taxol is?

What is the “timeline” for a molecule?

How do we clean up the Public data?

The Quality source is Chemical Abstracts Service…

Vancomycin – Search the Internet

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

The InChI “Resolver”

Citizen Scientists

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Citizens as Data Sources

Semantic Markup: Project Prospect

Entity-Extraction, Mark-up, Annotate

Success Depends on Dictionaries

ChemMantis and CJOC

Name-Structure Pairs

Species – linked to Wikipedia

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

ChemSpider Everywhere : Embed

ChemSpider Everywhere: Spectral Game

ChemSpider EverywhereCrowdsourced Curation of Spectra

ChemSpider Everywhere:What do computers want?

Web services

flickr.com/photos/microcosmos

ChemSpider Everywhere

Linked from Wikipedia and many Public Databases

Linked from Open Notebook Science sites

Linked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages

Integrated to software offerings from Thermo, Waters, Agilent, Bruker

ChemSpider Everywhere: ChemMobi

There will always be gaps...

What ChemSpider does not deal with, yet...

Materials Minerals Polymers Biological macromolecules

Open Source, Access and Data

ChemSpider is NOT Open Source but we do use Open Source components (OpenBabel, JSpecView, Jmol). Thanks Microsoft!

ChemSpider is not an “Open Access Database” – it’s a “free access” resource

We do not assume copyright. Rights to the data and the creative works remain with the depositor

Is ChemSpider “Open Data”?

Open Data?

Who declares data as Open? Data licensing is very interesting and can spark

“interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? Is PubChem or PubMed Open Data?

We allow people to declare their data as Open and add an Open Data button at upload

A lot of data on ChemSpider are free but not Open Pragmatism: Our focus is a community resource

Conclusions: ChemSpider Today ChemSpider is an established community resource

>23 million compounds from >300 data sources About 7000 unique users per day and up to ½ million

transactions per day A crowdsourced deposition and curation platform Grows daily – more depositions, more links, more data Web services provider

Linked to commercial and open source software Supporting analytical companies: Agilent, Thermo, Waters, Bruker Serving ONS, providing games to students, ChemSpidey robot

A publishing platform for the community

ChemSpider Tomorrow

Continue the curation effort and keep cleaning

Finish depositions – millions left to deposit

Integrate RSC content – a massive archive!

Integrate RSC publishing workflows and databases

Enable the semantic web for chemistry

Acknowledgments Royal Society of Chemistry Valery Tkachenko and Sergey Shevelev Commercial Software: Microsoft, Advanced

Chemistry Development, OpenEye and Symyx Open Source Software: Jmol, OpenBabel,

JSpecView JC Bradley, Andrew Lang – The Spectral Game

and Open Notebook Science integration The “Crowd” of curators 306 Data Source providers SyntheticPages.org

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams

mailto:[email protected]

http://www.chemspider.com/blog

http://www.slideshare.net/AntonyWilliams

Technology

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn