27
Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Embed Size (px)

Citation preview

Page 1: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Chemical Entity extraction using the chemicalize.org-technologyJosef Scheiber

Novartis Pharma AG – NITAS/TMS

Page 2: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Where the story of this project started ...

DreirosenbrückeNovartis Campus

A day in October 2008Some time around 7:45 in the morning ...

Page 3: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Vision for textminingIntegration chemical, biological knowledge

Page 4: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Chemical Knowledge - Rationale

- Make text corpora searchable for chemistry

- Generate chemistry databases for use in research based on Scientific Papers or Patents

- Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications

- Patent analyis for MedChem projects

Connection table

Page 5: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for chemical Knowledge - Rationale

Information on compounds targeting GPCRs

2005: >14.000 publications

1992: 256 articles & 34

patents

1988: 9 journal articles

HELPInformation explosion

Source: Banville, Debra L. “Mining chemical structural information from the drug literature.” Drug Discovery Today, Number 1/2 Jan. 2006, p.35-42

Page 6: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Example:Project Prospect – Royal Society of Chemistry

Enhancing Journal Articles with Chemical Features

This helps you identifying other articles talking about the same molecule

Page 7: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Chemical Knowledge – Focus for today

- Make text corpora searchable for chemistry

- Generate chemistry databases for use in research based on Scientific Papers or Patents

- Link Chemical Information with further annotation in an automated way for e.g. Chemogenomics applications

- Patent analyis for MedChem projects

Connection table

Page 8: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

A use case for successful patent mining(molecules you sometimes find in your inbox ;-) )

Vardenafil (2003, Bayer) –

€ 1.24 billion (USD 1.6 billion)

Sildenafil (1998, Pfizer) –

€ 11.7 billion (USD 15.1 billion)

Slide inspired by an example from Steve Boyer/IBM; Sales data from Prous Integrity datase

Page 9: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Conventional Database Building

Page 10: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Facts – current standard

... (ACS) owes most of its wealth to its two 'information services' divisions — the publications arm and the Chemical Abstracts Service (CAS), a rich database of chemical information and literature. Together, in 2004, these divisions made about $340 million — 82% of the society's revenue — and accounted for $300 million (74%) of its expenditure. Over the past five years, the society has seen its revenue and expenditure grow steadily ...

Source: ACS homepage

Page 11: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Facts

Established applicationStraighforward useDe-facto Gold standardUnique data source

Very costlyNo structure export for reasonable priceVery limited in large-scale follow-up analysisMost recent patents not available

Page 12: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Not data (search), but integration, analysis and insight, leading to

decisions and discovery

Page 13: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Now – What would be the perfect solution?

All patent offices require to provide all claimed structures as machine-readable version available for one-click-download

Page 14: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Text extraction

Definition: Extract all molecules that are mentioned in a patent text of interest, convert them to structures and make them available in

machine-readable format

Page 15: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Chemical KnowledgeTechnologies from providers

Text entity recognition Image recognition

(a) Extractors (IUPAC names)- TEMIS Chemical Entity Relationships Skill Cartridge- Accelrys Pipeline Pilot extractor (Notiora)- Fraunhofer (ProMiner Chemistry)- Chemaxon (chemicalize.org)- Oscar (Corbett, Murray-Rust et al.)- SureChem- IBM ChemFrag Annotator

(b) Converter (Names connection table)- CambridgeSoft name=struct- Openeye Lexichem- Chemaxon

- OSRA (NIH)

- Clide Pro (Keymodule Ltd.)

- Fraunhofer chemoCR

- ChemReader

Page 16: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

The objective

To provide a tool that provides sophisticated text analysis methods for NIBR scientists and

thereby leverages the methods of TMS

Page 17: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Chemical Knowledge – Novartis Tools – the chemicalize-technology is working under the hood!

Clipboard Analysis

Patent text

Identified structures

View structure onMouseOver

Export to other

applications

Page 18: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Knowledge – Novartis ToolsInput example: J Med Chem Paper

Page 19: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Mining for Chemical Knowledge – Use Case

Medicinal Chemist wants to synthesize competitor compound as tool compound for own project

Identification of core scaffold Analysis of

substitution patterns

This enables the identification of compounds most representative for a competitor patent

Page 20: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Example – A text-based patent

Automated Text

extraction452

compounds

Reference636 compounds

71%

A patent example

Page 21: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Example – An image-base patent

Text extraction not suitable for this case, it does find only a meager 40 molecules, 1129 in reference – Why?

An entirely image-based patent example

Page 22: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Language issues – e.g. Japanese patents

Page 23: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Encountered problems

OCR (Optical Character Recognition)!!

USPTO and WIPO are now available full text in most cases

Typos!

Name2Struct problems (less an issue here)

Page 24: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

IBM initiative Patent Mining / ChemVerse database (Steve Boyer)

The objective is to automatically extract all molecules from all patents available and make them searchable in a database

They leverage cloud computing and have access to all full-text patents

This is going absolutely the right direction

They annotate the molecules with information from freely available databases

Page 25: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Future ideas: Patent Analysis

Markush translation, Image+Target

Ranking capabilities of outcome for User

„blurred“ dicos for translating stuff like aryl, cycloalkyl etc.

Select annotate as entity on the fly error-correction

Result goes in a database Crowdsourcing efforts to improve and store results

Suggest functionality

Page 26: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

To enable true Patinformatics analyses ...

Definition by Tony Trippe:

Page 27: Chemical Entity extraction using the chemicalize.org-technology Josef Scheiber Novartis Pharma AG – NITAS/TMS

Acknowledgements

Alex Fromm Katia Vella Olivier Kreim

Therese Vachon Daniel Cronenberger Pierre Parisot Martin Romacker Nicolas Grandjean

NITAS/TMS Clayton Springer Naeem Yusuff Bharat Lagu

And many other people in different divisions of NIBR for their support