33
Chemaxon UGM, Budapest 21/05/2014 SureChEMBL: Open Patent Data Mark Davies ChEMBL Group, EMBL-EBI

EUGM 2014 - Mark Davies (EMBL-EBI): SureChEMBL – Open Patent Data

Embed Size (px)

Citation preview

Chemaxon UGM, Budapest

21/05/2014

SureChEMBL: Open Patent Data

Mark Davies

ChEMBL Group, EMBL-EBI

EMBL-EBI Resources Genes, genomes & variation

ArrayExpress

Expression Atlas

Metabolights

PRIDE

InterPro Pfam UniProt

ChEMBL ChEBI

Literature &

ontologies

Europe PubMed Central

Gene Ontology

Experimental Factor

Ontology

Molecular structures

Protein Data Bank in Europe

Electron Microscopy Data Bank

European Nucleotide

Archive

1000 Genomes

Gene, protein & metabolite expression

Protein sequences, families & motifs

Chemical biology

Reactions, interactions &

pathways

IntAct Reactome MetaboLights

Systems

BioModels

Enzyme Portal

BioSamples

Ensembl

Ensembl Genomes

European Genome-phenome Archive

Metagenomics portal

Bioactivity data

Compound

Assa

y/T

arg

et

>Thrombin

MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE

RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT

NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT

TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT

THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY

CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF

EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR

WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR

ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA

NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG

PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE

3. Insight, tools and resources for translational drug discovery

2. Organization, integration, curation and standardization of pharmacology data

1. Scientific facts

Ki = 4.5nM

APTT = 11 min.

ChEMBL – Data for Drug Discovery

Patent Data

• Do we include patent data in the ChEMBL database?

• We do provide cross-references (UniChem), but not the

underlying chemical data

• Most common question asked about ChEMBL during

training and outreach

• Why is this important to Drug Discovery researchers?

• Patent literature 2-3 years ahead of published literature

• Prior art and freedom to operate

• Lots more data – but high cost to extract + lots of noisy data

SureChem = SureChEMBL

• December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry

patent mining’ product from Digital Science, Macmillan Group

• SureChem not aligned with core future academic business

• SureChem provides a live (updated daily) view chemical patent space

• Existing SureChem User base

• Free (SureChemOpen)

• Paying (SureChemPro + API)

• EMBL-EBI will support existing licensees - All have expired now

• EMBL-EBI will provide an ongoing, free and open resource to entire

community

• Rebranded SureChEMBL

EMBL-EBI Chemistry Resources

RDF and REST API interfaces

REST API Interface

Atlas

Ligand induced

transcript response

750

PDBe

Ligand structures

from structurally

defined protein

complexes

15K

ChEBI

Nomenclature of primary and

secondary metabolites.

Chemical Ontology

24K

SureChEMBL

Ligand structures

from patent literature

15M

ChEMBL

Bioactivity data from literature

and depositions

1.5M

UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M

3rd Party Data

ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,

DrugBank, KEGG, NIH NCC,

eMolecules, FDA SRS, PharmGKB,

Selleck, ….

~55M

SureChEMBL System

https://www.surechembl.org/

Keyword search

Filter by authority

Structure sketch

Filter by document section

Help

Paste SMILES, MOL, name

Types of chemistry

search Filter by date

Patent number search

SureChEMBL System

SureChEMBL System

SureChEMBL System

Data Export

and

View Patent

Family

SureChEMBL System

SureChEMBL API Access

System Capabilities

• Searching capabilities

• Free text keywords and Lucene fields

• Patent IDs & bibliographic information

• Patent authority & date

• Chemical structure

• Retrieving capabilities

• Retrieve chemistry (with additional filters)

• Retrieve patent family information

• Retrieve annotated full patent text

• Accessible via Web Interface and API

SureChEMBL Data Coverage

Data Description & Languages Years

EP applications Bib. data

Full text

DocDB + Original

Original (EN, DE, FR) from 1978

EP granted Bib. data

Full text

DocDB + Original

Original (EN, DE, FR) From 1980

WO applications Bib. data

Full text

DocDB + Original

Original (EN, DE, FR, ES, RU)

From 1978

From 1978

US applications Bib. data

Full text

DocDB + Original

Original (EN)

From 2001

From 2001

US granted Bib. data

Full text

DocDB + Original

Original (EN)

From 1920

From 1976

JP applications Bib. data

Full text

DocDB

PAJ - English abstracts/titles

From 1973

From 1976

JP granted Bib. data DocDB From 1994

90+ countries Bib. data DocDB From 1920

All patents from above data sources are

searchable via SureChEMBL

• Exemplified structures from patent title, description,

abstract and claims

• Structures from text 1976 onwards

• Structures from images 2007 onwards

• USPTO have provided ‘Complex Work Units’ since 2001

• CWU file types include MOL and CDX

• CWUs processed as part of pipeline

SureChEMBL Chemistry Data Coverage

SureChEMBL Data Processing

WO

EP Applications& Granted

US Applications

& granted

JP Abstracts

Patent

Offices Chemistry Database

SureChEMBL System

Patent PDFs

(service)

Application Server

Users

API

Database

Entity Recognition

SureChem IP

1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-

methylpiperazine

Image to Structure (one method)

Name to Structure (five methods)

OCR

Processed patents (service)

SureChEMBL and Chemaxon

WO

EP Applications& Granted

US Applications

& granted

JP Abstracts

Patent

Offices Chemistry Database

SureChEMBL System

Patent PDFs

(service)

Application Server

Users

API

Database

Entity Recognition

SureChem IP

Image to Structure (one method)

Name to Structure (five methods)

OCR

Processed patents (service)

• InChI based comparison using filtered parent compounds

ChEMBL Overlap

235K

18.4% 1.3M 12.2M

SureChEMBL ChEMBL

Filters

• MW between 100 and 1200

• #Atoms between 6 and 70

• ALogP between -10 and 10

• #Rings > 0

• #C > 0

• #C != #Atoms

• RTB <= 20

(Exported 08/05/14) (ChEMBL 18)

SureChEMBL and UniChem

• 12.2M SureChEMBL compounds are being loaded into

UniChem - InChI based ‘Unified Chemical Identifier' system

• SureChem drug-like subset (~5M) previously loaded

• Other UniChem sources include:

https://www.ebi.ac.uk/unichem/

Migration Status

• System currently built and optimised to run on Amazon

Web Services

• The time and cost considered too high to move away

from AWS in short term

• Long term plan will be to migrate on to EBI infrastructure

• 3 Phase migration process

1. Patent Data Server (IFI Claims/Fairview) – done

2. Data Processing Pipeline – done

3. Web Application/API – working in progress

Technical Challenges

• User account migration

• System currently uses Digital Science authentication system

• User account required by Pro account service

• Plan to move over open system e.g. OAuth/OpenID

• We aim to minimise impact on existing user, but may require

users to sign-up again

• Impact of providing free and open access to the Pro

account service and API

• Need to monitor usage

• Usage limitations may be required

Entity Extraction

Enhanced Entity Extraction

• Identify new entity types e.g. proteins, diseases, cell lines, assays..

• Extend using ChEMBL dictionaries + others

• Ontology mapping/Semantic tagging

• Protein/biotherapeutic sequence extraction

• Sequence based patent searches

• Currently system provides minimal cross referencing

• Quickly enhance using UniChem

• Tag up all commonly used identifiers (ChEBI, ChEMBL, PubChem,

UniProt,…)

Bioactivity Data Extraction? Compounds

Target/Assay

Bioactivity

Markush Structure Extraction?

-alkyl

-aryl

-heteroaryl

-heterocyclyl

-cycloalkyl

….

Image Processing

• Image extraction starts from 01/01/2007

• Use Amazon EC2 Spot Instances to process pre-2007 image data

• Spot instance significantly cheaper, e.g. m1.xlarge instance costs:

• Standard Cost = $0.52/hour

• Spot Instance Cost = ~$0.125/hour

• New methods and tools can be introduced to improve compound

image extraction

• System currently uses CLiDE, alternatives include OSRA and Imago

• Document segmentation, developed as part of curation system, could

be applied to complex patent images

Image Processing

Open PHACTS Extension

• Open PHACTS project is keen to include patent data in

future extensions to the project

• ENSO approved - funding to include SureChEMBL data in

Open PHACTS

• RDF conversion, target indexing and API development

• EBI-RDF project benefit from RDF conversion

• SureChEMBL is updated daily, compared to quarterly

ChEMBL updates

• Interesting challenge for us creating exports and systems

loading SureChEMBL

More Future Plans

• Refactor interface for EMBL look and feel

• Third party user support system migration

• Workflow tool enhancements

• Update and release existing KNIME protocols + Pipeline Pilot

• Lots of interest to bring the system in-house for internal document

processing and searching

• Complex licensing issues

• AWS setup makes this easier

• Ligand Ensemble-based mapping of ChEMBL literature to patents

• Provide weekly/monthly feed of patent structures to PubChem

Rebranding Complete

People and Groups Involved

The ChEMBL Group

• John Overington

• Mark Davies

• George Papadatos

• Jon Chambers

• Anne Hersey

Digital Science

• Nicko Goncharoff

• James Siddle

• Richard Koks

• Tom Llewellyn

ChEMBL 18 Released

Website

Web

Services

Widgets

Downloads

Virtual

Machine

Semantic

Web

1,359,508 compounds

12,419,715 activities

1,042,374 assays

9,414 targets

53,298 documents

19 bioactivity sources

https://www.ebi.ac.uk/chembl/

myChEMBL Update Coming Soon

http://chembl.blogspot.co.uk/