View
599
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The patent literature has historically been complex and inaccessible to searches required for effective IP management and maintenance of a competitive position, particularly when it comes to chemical structure information. The availability of raw patent text feeds in a structured form have allowed the application of text-to-structure and image-to-structure conversion techniques. The problem then became one of applying this solution across massive data sets in an accurate and scalable manner to deliver a turnkey patent informatics system with automatically extracted, and searchable chemical structures. SureChem, an advanced cloud application, uses a tournament of methods to achieve higher coverage and accuracy than any single approach. This product was launched and licensed by a user community with a freemium business model. Latterly, user feedback and market shifts indicated a need to link biological data into patents too (sequences, genes, targets, diseases, etc). This created an opportunity to transition SureChem to EMBL-EBI, a public organisation with the remit of data dissemination and sharing, and deep experience of biodata, including the large ChEMBL database of Structure Activity Relationship Data. In 2014 SureChem became SureChEMBL. The presentation will review the development of SureChem, discuss the marketplace for patent informatics, and look ahead to future development plans for SureChEMBL.
Citation preview
John P. Overington - EMBL-EBI
Nicko Goncharoff – Digital Science
SureChEMBL: Open patent chemistry data
EMBL-EBI’s Mission
• Provide freely available data and bioinformatics services
to all facets of the scientific community in ways that
promote scientific progress
• Contribute to the advancement of biology through basic
investigator-driven research in bioinformatics
• Provide advanced bioinformatics training to scientists at
all levels, from PhD students to independent investigators
• Help disseminate cutting-edge technologies to industry
• Coordinate biological data provision throughout Europe
EMBL Member States
Austria, Belgium, Croatia, Czech
Republic, Denmark, Finland,
France, Germany, Greece,
Iceland, Ireland, Israel, Italy,
Luxembourg, the Netherlands,
Norway, Portugal, Spain,
Sweden, Switzerland and the
United Kingdom
Associate member states:
Australia, Argentina
ChEMBL • The world’s largest primary
public database of medicinal chemistry data
• https://www.ebi.ac.uk/chembl
• >1.4 million compounds, >9,000 targets, >12 million bioactivities
• Truly Open Data
• CC-BY-SA license
• Many download/access formats
• myChEMBL
• myChEMBL – Linux VM, PostgresQL RDKit, KNIME…
• Semantic Web
• RDF download, SPARQL endpoint at http://rdf.ebi.ac.uk/chembl
SAR Data
Compound
Assa
y
Ki=4.5 nM
>Thrombin MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLERECVEETCSY
EEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGTNYRGHVNITRSGIECQLWRS
RYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYTTDPTVRRQECSIPVCGQDQVTVAMTPRSEG
SSVNLSPPLEQCVPDRGQQYQGRLAVTTHGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGD
EEGVWCYVAGKPGDFGYCDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEAD
CGLRPLFEKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDRWVL
TAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWRENLDRDIALMKLK
KPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTANVGKGQPSVLQVVNLPIVERPVC
KDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGGPFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFY
THVFRLKKWIQKVIDQFGE
ED2=230 nM
Inhibition of
human Thrombin
PTT (partial
thromboplastin
time)
ChEMBL
SureChem = SureChEMBL • December 2013 EMBL-EBI ‘acquired’ SureChem
• Existing SureChem user base
• Free (SureChemOpen)
• Paying (SureChemPro + API)
• EMBL-EBI supported existing licensees during transition
• EMBL-EBI provides an ongoing, free and open resource to
entire community
• Private, Secure, and Free
• No login system
• Rebranded as SureChEMBL
• https://www.surechembl.org
6 PDG Biotech Meeting
Rebranding Complete!
7 PDG Biotech Meeting
8
https://www.surechembl.org/
https://www.surechembl.org
EMBL-EBI Chemistry Resources
RDF and REST API interfaces
REST API Interface
Atlas
Ligand induced
transcript response
750
PDBe
Ligand structures
from structurally
defined protein
complexes
15K
ChEBI
Nomenclature of primary and
secondary metabolites.
Chemical Ontology
24K
SureChEMBL
Chemical structures
from patent literature
16M
ChEMBL
Bioactivity data from literature
and depositions
1.5M
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M
3rd Party Data
ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,
DrugBank, KEGG, NIH NCC,
eMolecules, FDA SRS, PharmGKB,
Selleck, ….
~55M
SureChEMBL Data Pipeline
WO
EP Applications& Granted
US Applications
& granted
JP Abstracts
Patent
Offices Chemistry Database
SureChEMBL System
Patent PDFs
(service)
Application Server
Users
API
Database
Entity Recognition
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to Structure (one method)
Name to Structure (five methods)
OCR
Processed patents
(IFI Claims)
10 PDG Biotech Meeting
SureChEMBL data coverage Data Description & Languages Years
EP applications Bib. data
Full text
DocDB + Original
Original (EN, DE, FR) from 1978
EP granted Bib. data
Full text
DocDB + Original
Original (EN, DE, FR) From 1980
WO applications Bib. data
Full text
DocDB + Original
Original (EN, DE, FR, ES, RU)
From 1978
From 1978
US applications Bib. data
Full text
DocDB + Original
Original (EN)
From 2001
From 2001
US granted Bib. data
Full text
DocDB + Original
Original (EN)
From 1920
From 1976
JP applications Bib. Data DocDB
PAJ - English abstracts/titles
From 1973
From 1976
JP granted Bib. data DocDB From 1994
90+ countries Bib. data DocDB From 1920
11
• Structures from text: 1976 onwards
• Title, abstract, claims, description
• SureChem Chemical Entity Recognition - proprietary algorithms
• ACD/Labs, ChemAxon, OpenEye, OPSIN, PerkinElmer name-
structure conversion
• Structures from images: 2007 onwards
• CLiDE image-structure conversion
• Will extend image processing backwards using AWS Spot Pricing
compute
• USPTO offers ‘Complex Work Units’ since 2001
• CWU file types include MOL and CDX
• CWUs processed as part of pipeline: 2007 onwards
SureChEMBL Chemistry Data Coverage
12 PDG Biotech Meeting
Chemical Entity Extraction
13 PDG Biotech Meeting
SureChEMBL Content (September 2014)
• 15,668,225 compounds
• 12,888,125 patents
• ~80,000 new compounds extracted from ~50,000 patents
monthly
• 1–7 days for published patent to become searchable in
SureChEMBL
• System provides search access to all patents (not just
chemistry)
14 PDG Biotech Meeting
Current System Capabilities
• Searching capabilities
• Free text keywords and Lucene fields
• Patent IDs & bibliographic information
• Patent authority & date
• Chemical structure
• Retrieval capabilities
• Retrieve chemistry (with additional filters)
• Retrieve patent family information
• Retrieve annotated full patent text
• Retrieve patent document as PDF
15 PDG Biotech Meeting
16
https://www.surechembl.org/
PDG Biotech Meeting 17
PDG Biotech Meeting 18
Compound Report Page
https://www.surechembl.org/chemical/SCHEMBL1895
UniChem Integration
On-the-fly integration with 71M structures and from 25 data sources
SureChEMBL Data Access • UniChem
• https://www.ebi.ac.uk/unichem
• Weekly updates
• Private, secure, live integration with >25 chemistry
resources
• UniChem will soon be the worlds largest chemical structure
integration resource…..
• FTP Site
• ftp://ebi.ac.uk/public
• Quarterly updates
• All SureChEMBL compounds in SDF and CSV format
• Raw data – not filtered for ‘funnies’
• Further downloads planned in future
• Blog for announcements – https://chembl.blogspot.com
21 PDG Biotech Meeting
OCR Errors • Small, poor quality images
• OCR errors in names (OCR done by IFI). There is an OCR
correction step, but cannot fix all errors
-> ‘2,6-Difluoro-Λ/-{1 -r(4-iodo-2-methylphenyl)methvn-1 H-pyrazol-
3- vDbenzamide’
• Reliability better for US patents due to inclusion of mol
files 22 PDG Biotech Meeting
Name Conversion Errors
Pentyl Thiol
2-(2-((3-chloro-6-methyl-5,5-dioxido-6,11-dihydrodibenzo[c,f][1,2]thiazepin-11-yl)amino)ethoxy)acetic acid
• InChI based comparison using filtered parent compounds
ChEMBL – SureChEMBL Overlap
235K
18.4% 1.3M 12.2M
SureChEMBL ChEMBL
Filters
• MW between 100 and 1200
• #Atoms between 6 and 70
• ALogP between -10 and 10
• #C > 0
• #Rings > 0
• #C != #Atoms
• RTB <= 20
(ChEMBL 18)
Future Entity Extraction and Indexing • Identify new entity types e.g. proteins, diseases and cell lines
• Extend using ChEMBL dictionaries + others
• Ontology/synonym mapping - semantic tagging
• Target-relevance assessment
• Protein/biotherapeutic sequence extraction
• Sequence-based patent searches
• Enhanced cross-referencing
• Tag up all commonly used identifiers (Company codes, CAS,
ChEBI, ChEMBL, PubChem, ENSEMBL, RefSeq, UniProt,…)
EFO – http://www.ebi.ac.uk/efo
Far Future - Bioactivity Data Extraction?
Target/Assay
Bioactivity
27 PDG Biotech Meeting
Far Future – Markush Extraction?
-alkyl
-aryl
-heteroaryl
-heterocyclyl
-cycloalkyl
….
28 PDG Biotech Meeting
Acknowledgements • ChEMBL team
• John Overington
• Jon Chambers
• George Papadatos
• Mark Davies
• Nathan Dedman
• Anna Gaulton
• Digital Science
• Nicko Goncharoff
• James Siddle
• Richard Koks
Funding:
• Wellcome Trust Strategic Award for
ChEMBL database
(WT086151/Z/08/Z &
WT104104/Z/14/Z)
• Open PHACTS - Innovative
Medicines Initiative Joint Undertaking
(grant no. 115191)
• European Molecular Biology
Laboratory
• BioMedBridges - European
Commission FP7 Capacities Specific
Programme (grant no. 284209)
• Technology Partners: