Enhancing Discoverability Across Royal Society Of Chemistry Content By Integrating To Chem Spider,...

Preview:

DESCRIPTION

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.

Citation preview

Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures

A Pragmatic Vision

“Build a Structure Centric Community toServe Chemists”

Integrate chemical structure data on the web Create a “structure-based hub” to information,

data and algorithmic predictions Let chemists contribute their own data Allow the community to curate/correct data

ChemSpider Today

Over 25 million unique compounds Sourced from over 300 data sources Growing daily – new compounds, annotations, data

Structures, text, spectra, images, movies, syntheses

Text searching the web is far from optimal Structure searching the web is not a dream The quality of data on the web is a problem An example…

Keep Your Plants Healthy-Looking

Which is better for Plants?Vodka, Sprite or Viagra?

It Works – Viagra Wins the Day

Now Which is Better?

Viagra or Cialis?

Images sourced from Wikipedia

Cialis

I want…The structureAny patent informationRelated publicationsWhere can I buy it?Metabolic pathway infoWhat else is easy to find…

Cialis on Google?

What is Cialis?

What is Cialis? Can we trust Wikipedia?

What is Cialis?

6 hits on PubChem

What is Cialis?

Search by Trade Name

Search by CAS Number (from Wikipedia)

Are there other names???

Are there other names???

PubMed hits: 736 Tadalafil 744 Cialis

Are there other names???

Are there other names?

Are There Other Names?

IC351 on PubChem?

5 HITS for IC351

ZERO HITS for IC 351

Text Searching the Web

Text searching the web for chemical compounds is an enormous challenge

RSC has multiple databases, >500,000 articles and a lot of other resources. How do we do?

The RSC Publishing Platform (Beta)

2+2 = 4 Articles?

CAS Number Search

Text Searching the Web

Text searching RSC Publishing for chemical compounds to retrieve ALL hits is a challenge

Dictionaries of name-structure relationships could be very enabling. Creating validated dictionaries is, also, an enormous challenge

Search ChemSpider for Cialis

Cialis on ChemSpider : 1 hit

Chemicals are curated/validated on ChemSpider by ourselves and the community

Based on assertions from various sources. Iterative, time-consuming and exacting!

We believe we know the structure now

Cialis – Searching the Web by InChI

Search Molecular SKELETON

Search Full Molecule

InChI Search the Web by Skeleton78 Hits by Skeleton

InChI Search the Web Exact Match32 Hits by InChIKey

InChI Search the Web Exact Match6 Hits by Standard InChIKey

InChifying the Web

Different versions of InChI lead to complex search results

There are more 2X “skeletons” for Cialis than exact matches – different stereo? Mistakes?

Our judgment…based on the following experience. MISTAKES

Vancomycin – Search the Internet

Full Molecule Search: 4 Hits

Full Skeleton Search: 104 Hits

ChemSpider – Patents Linked

SURECHEM PATENTS GOOGLE

Google Patents

Google Books

Microsoft Academic Search

Google Scholar – Found By CAS #

Identifiers for Tadalafil

Validated Registry NumberSame Result as Searching PubMed

How Many Articles in RSC Journals?

Based on 171596-29 -5 there are 13 articles in RSC journals

What about if we VALIDATE identifiers?

How Many Articles in RSC Journals?

How Many Articles in RSC Journals?

RSC Journals

RSC Journals

REMEMBER 2+2 = 4

RSC Books

PubMed

Google Books – Expanded Hit Set

Google Scholar – Expanded Hit Set

Microsoft Academic Search

Microsoft Academic Search

More mussels than drugs…

RSC Databases

media.obsessable.com

As few interfaces as possible

Did we solve this problem now?

What Do We Know? Validated Name-Structure Dictionaries enable

“structure-searching” the web.

Search the structure on ChemSpider and we have integrated many services online NCBI Entrez PubMed Google Scholar, Books, Patents Microsoft Academic Search SureChem Patents …..

Semantic Markup: Project Prospect

Pospected Compound Deposition

Success Depends on Dictionaries

Link to a Structure or the Right Structure?

Name-Structure Pairs

Semantic Linking of Structures

What would you want to link off a structure? Chemical suppliers Other publications Analytical Data Related Reactions Wikipedia Patents “Everything”

ChemSpider SyntheticPages

Other RSC Resources…

Once we have validated name-structure dictionaries we can tap other RSC resources

There is ALWAYS a validation stage

Ultimately crowdsourced curation is necessary

Roses’ Crystal Image Collection

MP3s and Videos : Titanium

Beautiful Elements

Periodic Table Images

Other system enhancements?

What ChemSpider doesn’t deal with yet...

Markush structures and other “non-defineds” Materials Minerals Polymers Biological macromolecules

Leaving Markush to Patent Indexers

What’s Next? Continue the curation effort and keep cleaning

Enhanced integration with RSC publishing workflows and databases

Tighter integration to RSC databases Natural Product Updates Methods of Organic Synthesis

Use ChemSpider dictionaries to enhance markup precision and recall

What’s Next?

Use entity extraction approaches and ChemSpider dictionaries to analyze the entire RSC archive

Deposit structures into ChemSpider from the backfile

Use crowdsourced curation approaches to optimize the results

The InChI “Resolver”

InChI Resolver to DOIsStructure Search the Web

Most Chemistry is NOT Published Only a fraction of chemistry is published

Only a tiny fraction of chemistry is patented

What of the “Lost Chemistry”- never published and cannot be abstracted Reactions performed Structures made and studied Spectra acquired and then disposed of

ChemSpider can give it all a home…

Chemistry on the Internet FUTURE The semantic web for chemistry is in place Crowdsourced contributions are commonplace Chemists will search by structure/substructure Chemistry articles indexed and searchable Reduced number of searches to find data Data are integrated – compounds, vendors,

syntheses, data, publications and patents A world of Open Access and Open Data

Classical business models will have to morph

Anyone from Penn State here?

Please see me afterwards…

Thank you

antony.williams@chemspider.comTwitter: ChemSpidermanwww.chemspider.com/blogSLIDES: www.slideshare.net/AntonyWilliams