80
ChemSpider: Collecting and Curating the World’s Chemistry with the Community

RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Embed Size (px)

DESCRIPTION

These are the slides I will be giving here at the Science Commons Symposium Pacific Northwest at the Microsoft Campus here in Redmond in about 5 minutes time

Citation preview

Page 1: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider: Collecting and Curating the World’s Chemistry with the Community

Page 2: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

A Pragmatic Vision

“Build a Structure Centric Community”

December 2006 – A hobby project initiated to connect chemistry on the web

Integrate chemical structure data on the web Create a “structure-based hub” to information and

data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data

Page 3: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science

Page 4: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Chemistry on the Internet TODAY

Chemistry searches are generally limited to text-based searches across the internet

Data are dirty: sorting the wheat from the chaff. Who can you trust?

Too many searches required to resource data

Page 5: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

media.obsessable.com

As few interfaces as possible

What do humans want?

Page 6: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Page 7: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Getting it done

March 2007 – A beta system opened online One purchased computer, two home-built Seeded with 10.5 million structures Structure/substructure searching

June 2007 A curating layer to flag data A deposition interface to add to the data

And so it continued….

Page 8: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Searches

Page 9: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Search Cholesterol

Page 10: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Search Cholesterol

Page 11: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Search Cholesterol

Page 12: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Search Cholesterol

Page 13: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Search Cholesterol

Page 14: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Linked across the internet

Page 15: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Kyoto Encyclopedia of Genes and Genomes

Page 16: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Links to Patents based on structure

Page 17: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Articles Linked

Page 18: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Complex Searches

Page 19: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Link off a structure in ChemSpider

Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 20: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Answering Questions for Chemists Questions a chemist might ask…

What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?

Page 21: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

What is a compound?

Page 22: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider is a structure-centric hub

ChemSpider aggregates and links out across the internet

Data aggregate based on “structures and links”

What defines a chemical compound?

Page 23: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Linked Data on the Web

Taken from: Rafael Sidis’ Blog

Page 24: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Where Would You look? What Do You Trust?

Page 25: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Question Everything online: www.dhmo.org

Page 26: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Di-Hydrogen Monoxide

2H

Page 27: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Di-Hydrogen Monoxide

2H + 1O

Page 28: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Di-Hydrogen Monoxide

H2O

Page 29: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Di-Hydrogen Monoxide

H2OWater

Page 30: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

It’s all on Wikipedia…

Page 31: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Chemistry on The Internet Is Messy

Page 32: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

It’s Methane…

Page 33: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

What’s Methane?

Page 34: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

What’s Methane?

Page 35: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

What ELSE is Methane???

Page 36: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

PubChem

Page 37: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Chemistry is REALLY Messy

Page 38: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Vancomycin

Who will curate?

How would you clean such a large dataset?

Assertions!!!

Page 39: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Vancomycin on ChemSpider 1 compound – 3 days

Page 40: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

The EXPERTS must get it right?!

Page 41: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Wikipedia, C&E News, PubChem C&E News (from ACS)

Page 42: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

The InChI Identifier

Page 43: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Multiple Layers

Page 44: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

InChIStrings Hash to InChIKeys

Page 45: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

InChIs for Taxol

Page 46: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

InChIKeys for Taxol

DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ

ChEBI and Wikipedia are the SAME structure

Drugbank is a DIFFERENT structure – ONE stereocenter

Page 47: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Does one stereocenter matter?

Page 48: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Page 49: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Does one stereocenter matter?

Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon

Page 50: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Assertion and Chemical Entities

Who says what Taxol is?

What is the “timeline” for a molecule?

How do we clean up the Public data?

The Quality source is Chemical Abstracts Service…

Page 51: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Vancomycin – Search the Internet

Page 52: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Full Molecule Search: 4 Hits

Page 53: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Full Skeleton Search: 104 Hits

Page 54: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

The InChI “Resolver”

Page 55: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Citizen Scientists

Page 56: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Crowd-sourcing Chemistry Curation

Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate

Page 57: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Building a Structure Centric Community for Chemists

Multi-level Curation and Approval

Page 58: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Citizens as Data Sources

Page 59: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn
Page 60: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Semantic Markup: Project Prospect

Page 61: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Entity-Extraction, Mark-up, Annotate

Page 62: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Success Depends on Dictionaries

Page 63: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemMantis and CJOC

Page 64: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Name-Structure Pairs

Page 65: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Species – linked to Wikipedia

Page 66: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

Page 67: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Everywhere : Embed

Page 68: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Everywhere: Spectral Game

Page 69: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider EverywhereCrowdsourced Curation of Spectra

Page 70: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Everywhere:What do computers want?

Web services

flickr.com/photos/microcosmos

Page 71: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Everywhere

Linked from Wikipedia and many Public Databases

Linked from Open Notebook Science sites

Linked from Blogs using Structure/Spectra EMBED

Integrated into structure drawing packages

Integrated to software offerings from Thermo, Waters, Agilent, Bruker

Page 72: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Everywhere: ChemMobi

Page 73: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

There will always be gaps...

What ChemSpider does not deal with, yet...

Materials Minerals Polymers Biological macromolecules

Page 74: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Open Source, Access and Data

ChemSpider is NOT Open Source but we do use Open Source components (OpenBabel, JSpecView, Jmol). Thanks Microsoft!

ChemSpider is not an “Open Access Database” – it’s a “free access” resource

We do not assume copyright. Rights to the data and the creative works remain with the depositor

Is ChemSpider “Open Data”?

Page 75: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Open Data?

Page 76: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Who declares data as Open? Data licensing is very interesting and can spark

“interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? Is PubChem or PubMed Open Data?

We allow people to declare their data as Open and add an Open Data button at upload

A lot of data on ChemSpider are free but not Open Pragmatism: Our focus is a community resource

Page 77: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Conclusions: ChemSpider Today ChemSpider is an established community resource

>23 million compounds from >300 data sources About 7000 unique users per day and up to ½ million

transactions per day A crowdsourced deposition and curation platform Grows daily – more depositions, more links, more data Web services provider

Linked to commercial and open source software Supporting analytical companies: Agilent, Thermo, Waters, Bruker Serving ONS, providing games to students, ChemSpidey robot

A publishing platform for the community

Page 78: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

ChemSpider Tomorrow

Continue the curation effort and keep cleaning

Finish depositions – millions left to deposit

Integrate RSC content – a massive archive!

Integrate RSC publishing workflows and databases

Enable the semantic web for chemistry

Page 79: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Acknowledgments Royal Society of Chemistry Valery Tkachenko and Sergey Shevelev Commercial Software: Microsoft, Advanced

Chemistry Development, OpenEye and Symyx Open Source Software: Jmol, OpenBabel,

JSpecView JC Bradley, Andrew Lang – The Spectral Game

and Open Notebook Science integration The “Crowd” of curators 306 Data Source providers SyntheticPages.org

Page 80: RSC ChemSpider Science Commons Symposium Pacific Northwest #scspn

Thank you

[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams