73
How the InChI identifier is used to underpin our online chemistry databases at RSC Antony Williams, Valery Tkachenko and Ken Karapetyan ACS San Francisco August 2014

How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Embed Size (px)

DESCRIPTION

The Royal Society of Chemistry hosts a growing collection of online chemistry content. For much of our work the InChI identifier is an important component underpinning our projects. This enables the integration of chemical compounds with our archive of scientific publications, the delivery of a reaction database containing millions of reactions as well as a chemical validation and standardization platform developed to help improve the quality of structural representations on the internet. The InChI has been a fundamental part of each of our projects and has been pivotal in our support of international projects such as the Open PHACTS semantic web project integrating chemistry and biology data and the PharmaSea project focused on identifying novel chemical components from the ocean with the intention of identifying new antibiotics. This presentation will provide an overview of the importance of InChI in the development of many of our eScience platforms and how we have used it to provide integration across hundreds of websites and chemistry databases across the web. We will discuss how we are now expanding our efforts to develop a platform encompassing efforts in Open Source Drug Discovery and the support of data management for neglected diseases.

Citation preview

Page 1: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

How the InChI identifier is used to underpin our online chemistry

databases at RSC

Antony Williams, Valery Tkachenko

and Ken Karapetyan

ACS San Francisco

August 2014

Page 2: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What can I say that I haven’t said?

Page 3: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What can I say that I haven’t said?

Page 4: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What can I say that I haven’t said?YouTube InChIKey Collision Movie

Page 5: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What can I say that I haven’t said?

Page 6: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

InChI is for machines but do have a human aspect…

Page 7: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Many Names, One Structure

Page 8: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Structure Identifiers

Page 9: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

OPSIN (chemical name to structure) http://opsin.ch.cam.ac.uk/

• InChI support systems…

Page 10: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

InChI mapping helps a lot!

• We wanted to map together chemical data on the web

• We knew that chemical name mapping was difficult but dictionaries were useful

• It is InChI that became the foundation technology for our database…

• We accepted all the limitations of InChI• We lived with the “Useful but not ideal”• And so….

Page 11: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

• ~32 million chemicals and growing

• Data sourced from >500 different sources

• Crowd sourced curation and annotation

• Ongoing deposition of data from our journals and our collaborators

• Structure centric hub for web-searching

• …and a really big dictionary!!!

Page 17: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

And where can we travel???

Page 18: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

And where can we travel???

Page 19: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

And where can we travel???

Page 20: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

NEW15th

Edition

*The name THE MERCK INDEX is owned by Merck Sharp & Dohme Corp., a subsidiary of Merck & Co., Inc., Whitehouse Station, N.J., U.S.A., and is licensed to The Royal Society of Chemistry for use in the U.S.A. and Canada.

Where else is RSC using InChIs

Page 21: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 22: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 23: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 24: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 25: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Text Mining

The N-(β-hydroxyethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiadiazol-5-yl)urea prepared in Example 6 , thionyl chloride ( 5 ml ) and benzene ( 50 ml ) were charged into a glass reaction vessel equipped with a mechanical stirrer , thermometer and reflux condenser .

The reaction mixture was heated at reflux with stirring , for a period of about one-half hour .

After this time the benzene and unreacted thionyl chloride were stripped from the reaction mixture under reduced pressure to yield the desired product N-(β-chloroethyl)-N-methyl-N'-(2-trifluoromethyl-1,3,4-thiaidazol-5-yl)urea as a solid residue

Page 26: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

SO MANY reactions!

Page 27: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Extracting our Archive

• What could we get from our archive?• Find chemical names and generate structures• Find chemical images and generate structures• Find reactions• Find data (MP, BP, LogP) and deposit• Find figures and database them• Find spectra (and link to structures)• And of course InChIfy the entire collection

Page 28: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

After we mine the Archive

Page 29: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Models published from data

Page 30: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Text-mining Data to compare

Page 31: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Progress to date

• We have text-mined all 21st century articles… >100k articles from 2000-2013

• Marked up with XML and published onto the HTML forms of the articles

• Required multiple iterations based on dictionaries, markup, text mining iterations

• New visualization tools in development – not just chemical names. Add chemical and biomedical terms markup also!

Page 32: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

MedChemComm markup

Page 33: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

MedChemComm markup

Page 34: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

MedChemComm markup

Page 35: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

InChIs under our “repository”

• Scientific publications are a summary of work• Is all work reported?• How much science is lost to pruning?• What of value sits in notebooks and is lost?• Publications offering access to “real data”?

• How much data is lost?• How many compounds never reported?• How many syntheses fail or succeed?• How many characterization measurements?

Page 36: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

New Repository Architecturedoi: 10.1007/s10822-014-9784-5

Page 37: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What are we building?

• We are building the “RSC Data Repository”

• Containers for compounds, reactions, analytical data, tabular data

• Algorithms for data validation and standardization

• Flexible indexing and search technologies

• A platform for modeling data and hosting existing models and predictive algorithms

Page 38: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

New Repository Architecture

Compounds Reactions Spectra Materials Documents

CompoundsAPI

ReactionsAPI

SpectraAPI

MaterialsAPI

DocumentsAPI

CompoundsWidgets

ReactionsWidgets

SpectraWidgets

MaterialsWidgets

DocumentsWidgets

Data tier

Data access tier

User interface

components tier

Analytical Laboratory application

User interface tier

(examples) Electronic Laboratory Notebook

Paid 3rd party integrations (various platforms – SharePoint, Google, etc)

Chemical Inventory application

Page 39: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Deposition of Data

Page 40: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Compounds

Page 41: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Reactions

Page 42: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Analytical data

Page 43: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Crystallography data

Page 44: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

InChIs under the repository

• All compound-based data handling will of course connect with InChIs• Compounds• Reactions• Compound-spectra matching • Etc. etc. etc…

Page 45: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

For Deposition of Data

• Developing systems that provides feedback to users regarding data quality• Validate/standardize chemical compounds• Check for balanced reactions• Checks spectral data

• EXAMPLE Future work• Properties – compare experimental to pred.• Automated structure verification - NMR

Page 46: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

RSC Cheminformatics Projects

• RSC as a provider of support for grant-based projects• Utilizing ChemSpider initially as a platform• Developing Chemical Registry Service• Utilizing core architecture and widgets to

serve the projects

Page 47: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 48: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 49: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 50: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

The PharmaSea Website

Page 51: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

• ChemSpider IDs and InChIs/InChIKeys made open and available for linking

• Exposed via the Open PHACTS RDF export

• A structure ID standard to enable further linking across the semantic web of science

Page 52: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

InChIs and DDP

Page 53: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Electronic Notebook Data

• Development work integrating chemistry into the Southampton Labtrove notebook• Stoichiometry table development• Analytical data integration

• “ChemTrove” includes chemistry widgets and InChI as an important data field

Page 54: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry
Page 55: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Side Effects of InChI Usage

Page 56: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

SMILES by comparison…

Page 57: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Side Effects of InChI Usage

Page 58: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Standardization IssuesDepiction based on molfile

Page 59: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Standardize

• Use the SRS as guidance for standardization• Adjust as necessary to our needs

Page 60: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Nitro groups

Page 61: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Salt and Ionic Bonds

Page 62: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

What needs to happen?

• If we could validate• Catch errors in databases (and clean)• Proactively catch errors in publications/patents• Reduce junk in the ether – improve QUALITY!

• If we standardized• Interlinking should improve

Page 63: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Validate and Standardize

Page 64: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

CVSP Filtering

Page 65: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

CVSP Filtering of DrugBank

Page 66: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

DrugBank (ca. 6000 records)

• 38 records with InChI not matching the structure, e.g. DB08521, DB08187

• 24 records where names (IUPAC_NAME) did not match the structure, e.g. DB08346

• 38 records with SMILES not matching the structure, e.g. DB08293

• 53 records with unusual valence, e.g. DB01983 with boron(V)

Page 67: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

ChEMBL (1.3 million records)

• 11,020 records with 4 bonds and zero charge, e.g. CHEMBL501101 or CHEMBL501973

• 271 records with hypervalent oxygen (e.g. , CHEMBL2219679), carbon (e.g. 1005895), boron, chlorine, iodine or phosphine

• 6,177 records where direction of bond makes no sense, e.g. CHEMBL12760 and CHEMBL34704

Page 68: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

ChemSpider Standardization

• Entire ChemSpider database will be standardized using modified FDA rule set

• Original Molfiles will be standardized and all properties (predicted properties, SMILES, InChIs, Names) will all be regenerated

• CLEAN’ed database to compounds repository

• Standardization procedures automatically applied to all future depositions

Page 69: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Recent Data (last week)

Page 70: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Internet Data

Data Repositories and InChI

Commercial SoftwarePre-competitive Data

Open ScienceOpen DataPublishersEducators

Open DatabasesChemical Vendors

Small organic moleculesUndefined materialsOrganometallicsNanomaterialsPolymersMineralsParticle boundLinks to Biologicals

Page 71: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

If InChI was not developed…

• Database linking would suffer dramatically

• The web would not be “structure searchable”

• Cheminformatics tools would likely not be linking to public domain databases in the same way

• We wouldn’t be here discussing….

• And ChemSpider would not have been built

Page 72: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Acknowledgments

• The InChI team

• The entire RSC cheminformatics team…

• Daniel Lowe for the text mining work

• Igor Tetko for OCHEM modeling

Page 73: How the InChI identifier is used to underpin our online chemistry databases at Royal Society of Chemistry

Thank you

Email: [email protected]: 0000-0002-2668-4821 Twitter: @ChemConnectorPersonal Blog: www.chemconnector.com SLIDES: www.slideshare.net/AntonyWilliams