Upload
orcid-0000-0002-2668-4821
View
2.419
Download
0
Tags:
Embed Size (px)
DESCRIPTION
These are the slides I will be giving here at the Science Commons Symposium Pacific Northwest at the Microsoft Campus here in Redmond in about 5 minutes time
Citation preview
ChemSpider: Collecting and Curating the World’s Chemistry with the Community
A Pragmatic Vision
“Build a Structure Centric Community”
December 2006 – A hobby project initiated to connect chemistry on the web
Integrate chemical structure data on the web Create a “structure-based hub” to information and
data Provide access to structure-based “algorithms” Let chemists contribute their own data Allow the community to curate/correct data
Where is chemistry online? Encyclopedic articles (Wikipedia) Chemical vendor databases Metabolic pathway databases Property databases Patents with chemical structures Drug Discovery data Scientific publications Compound aggregators Blogs/Wikis and Open Notebook Science
Chemistry on the Internet TODAY
Chemistry searches are generally limited to text-based searches across the internet
Data are dirty: sorting the wheat from the chaff. Who can you trust?
Too many searches required to resource data
media.obsessable.com
As few interfaces as possible
What do humans want?
Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,
syntheses, data, publications and patents A world of Open Access and Open Data
Classical business models will have to morph
Getting it done
March 2007 – A beta system opened online One purchased computer, two home-built Seeded with 10.5 million structures Structure/substructure searching
June 2007 A curating layer to flag data A deposition interface to add to the data
And so it continued….
ChemSpider Searches
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Search Cholesterol
Linked across the internet
Kyoto Encyclopedia of Genes and Genomes
Links to Patents based on structure
Articles Linked
ChemSpider Complex Searches
Link off a structure in ChemSpider
Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
Answering Questions for Chemists Questions a chemist might ask…
What is the melting point of n-butanol? What is the chemical structure of Xanax? Chemically, what is phenolphthalein? What are the stereocenters of cholesterol? Where can I find publications about xylene? What are the different trade names for Ketoconazole? What is the NMR spectrum of Aspirin? What are the safety handling issues for Thymol Blue?
What is a compound?
ChemSpider is a structure-centric hub
ChemSpider aggregates and links out across the internet
Data aggregate based on “structures and links”
What defines a chemical compound?
Linked Data on the Web
Taken from: Rafael Sidis’ Blog
Where Would You look? What Do You Trust?
Question Everything online: www.dhmo.org
Di-Hydrogen Monoxide
2H
Di-Hydrogen Monoxide
2H + 1O
Di-Hydrogen Monoxide
H2O
Di-Hydrogen Monoxide
H2OWater
It’s all on Wikipedia…
Chemistry on The Internet Is Messy
It’s Methane…
What’s Methane?
What’s Methane?
What ELSE is Methane???
PubChem
Chemistry is REALLY Messy
Vancomycin
Who will curate?
How would you clean such a large dataset?
Assertions!!!
Vancomycin on ChemSpider 1 compound – 3 days
The EXPERTS must get it right?!
Wikipedia, C&E News, PubChem C&E News (from ACS)
The InChI Identifier
Multiple Layers
InChIStrings Hash to InChIKeys
InChIs for Taxol
InChIKeys for Taxol
DrugBank: RCINICONZNJXQF-CLDWUXIMDD ChEBI: RCINICONZNJXQF-GXKQXQCDDN Wikipedia: RCINICONZNJXQF-MZXODVADBJ
ChEBI and Wikipedia are the SAME structure
Drugbank is a DIFFERENT structure – ONE stereocenter
Does one stereocenter matter?
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Does one stereocenter matter?
Distaval, Talimol, Nibrol, Sedimide, Quietoplex, Contergan, Neurosedyn, and Softenon
Assertion and Chemical Entities
Who says what Taxol is?
What is the “timeline” for a molecule?
How do we clean up the Public data?
The Quality source is Chemical Abstracts Service…
Vancomycin – Search the Internet
Full Molecule Search: 4 Hits
Full Skeleton Search: 104 Hits
The InChI “Resolver”
Citizen Scientists
Crowd-sourcing Chemistry Curation
Crowd-sourced curation: identify/tag errors, edit names, synonyms, identify records to deprecate
Building a Structure Centric Community for Chemists
Multi-level Curation and Approval
Citizens as Data Sources
Semantic Markup: Project Prospect
Entity-Extraction, Mark-up, Annotate
Success Depends on Dictionaries
ChemMantis and CJOC
Name-Structure Pairs
Species – linked to Wikipedia
Semantic Linking of Structures
What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”
ChemSpider Everywhere : Embed
ChemSpider Everywhere: Spectral Game
ChemSpider EverywhereCrowdsourced Curation of Spectra
ChemSpider Everywhere:What do computers want?
Web services
flickr.com/photos/microcosmos
ChemSpider Everywhere
Linked from Wikipedia and many Public Databases
Linked from Open Notebook Science sites
Linked from Blogs using Structure/Spectra EMBED
Integrated into structure drawing packages
Integrated to software offerings from Thermo, Waters, Agilent, Bruker
ChemSpider Everywhere: ChemMobi
There will always be gaps...
What ChemSpider does not deal with, yet...
Materials Minerals Polymers Biological macromolecules
Open Source, Access and Data
ChemSpider is NOT Open Source but we do use Open Source components (OpenBabel, JSpecView, Jmol). Thanks Microsoft!
ChemSpider is not an “Open Access Database” – it’s a “free access” resource
We do not assume copyright. Rights to the data and the creative works remain with the depositor
Is ChemSpider “Open Data”?
Open Data?
Who declares data as Open? Data licensing is very interesting and can spark
“interesting” conversations. Opinions differ: Are images data? Are assertions data? What on a ChemSpider record is data? Is PubChem or PubMed Open Data?
We allow people to declare their data as Open and add an Open Data button at upload
A lot of data on ChemSpider are free but not Open Pragmatism: Our focus is a community resource
Conclusions: ChemSpider Today ChemSpider is an established community resource
>23 million compounds from >300 data sources About 7000 unique users per day and up to ½ million
transactions per day A crowdsourced deposition and curation platform Grows daily – more depositions, more links, more data Web services provider
Linked to commercial and open source software Supporting analytical companies: Agilent, Thermo, Waters, Bruker Serving ONS, providing games to students, ChemSpidey robot
A publishing platform for the community
ChemSpider Tomorrow
Continue the curation effort and keep cleaning
Finish depositions – millions left to deposit
Integrate RSC content – a massive archive!
Integrate RSC publishing workflows and databases
Enable the semantic web for chemistry
Acknowledgments Royal Society of Chemistry Valery Tkachenko and Sergey Shevelev Commercial Software: Microsoft, Advanced
Chemistry Development, OpenEye and Symyx Open Source Software: Jmol, OpenBabel,
JSpecView JC Bradley, Andrew Lang – The Spectral Game
and Open Notebook Science integration The “Crowd” of curators 306 Data Source providers SyntheticPages.org
Thank you
[email protected]: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams