Upload
chemaxon
View
172
Download
0
Tags:
Embed Size (px)
Citation preview
EMBL-EBI Resources Genes, genomes & variation
ArrayExpress
Expression Atlas
Metabolights
PRIDE
InterPro Pfam UniProt
ChEMBL ChEBI
Literature &
ontologies
Europe PubMed Central
Gene Ontology
Experimental Factor
Ontology
Molecular structures
Protein Data Bank in Europe
Electron Microscopy Data Bank
European Nucleotide
Archive
1000 Genomes
Gene, protein & metabolite expression
Protein sequences, families & motifs
Chemical biology
Reactions, interactions &
pathways
IntAct Reactome MetaboLights
Systems
BioModels
Enzyme Portal
BioSamples
Ensembl
Ensembl Genomes
European Genome-phenome Archive
Metagenomics portal
Bioactivity data
Compound
Assa
y/T
arg
et
>Thrombin
MAHVRGLQLPGCLALAALCSLVHSQHVFLAPQQARSLLQRVRRANTFLEEVRKGNLE
RECVEETCSYEEAFEALESSTATDVFWAKYTACETARTPRDKLAACLEGNCAEGLGT
NYRGHVNITRSGIECQLWRSRYPHKPEINSTTHPGADLQENFCRNPDSSTTGPWCYT
TDPTVRRQECSIPVCGQDQVTVAMTPRSEGSSVNLSPPLEQCVPDRGQQYQGRLAVT
THGLPCLAWASAQAKALSKHQDFNSAVQLVENFCRNPDGDEEGVWCYVAGKPGDFGY
CDLNYCEEAVEEETGDGLDEDSDRAIEGRTATSEYQTFFNPRTFGSGEADCGLRPLF
EKKSLEDKTERELLESYIDGRIVEGSDAEIGMSPWQVMLFRKSPQELLCGASLISDR
WVLTAAHCLLYPPWDKNFTENDLLVRIGKHSRTRYERNIEKISMLEKIYIHPRYNWR
ENLDRDIALMKLKKPVAFSDYIHPVCLPDRETAASLLQAGYKGRVTGWGNLKETWTA
NVGKGQPSVLQVVNLPIVERPVCKDSTRIRITDNMFCAGYKPDEGKRGDACEGDSGG
PFVMKSPFNNRWYQMGIVSWGEGCDRDGKYGFYTHVFRLKKWIQKVIDQFGE
3. Insight, tools and resources for translational drug discovery
2. Organization, integration, curation and standardization of pharmacology data
1. Scientific facts
Ki = 4.5nM
APTT = 11 min.
ChEMBL – Data for Drug Discovery
Patent Data
• Do we include patent data in the ChEMBL database?
• We do provide cross-references (UniChem), but not the
underlying chemical data
• Most common question asked about ChEMBL during
training and outreach
• Why is this important to Drug Discovery researchers?
• Patent literature 2-3 years ahead of published literature
• Prior art and freedom to operate
• Lots more data – but high cost to extract + lots of noisy data
SureChem = SureChEMBL
• December 2013 EMBL-EBI acquired SureChem – a leading ‘chemistry
patent mining’ product from Digital Science, Macmillan Group
• SureChem not aligned with core future academic business
• SureChem provides a live (updated daily) view chemical patent space
• Existing SureChem User base
• Free (SureChemOpen)
• Paying (SureChemPro + API)
• EMBL-EBI will support existing licensees - All have expired now
• EMBL-EBI will provide an ongoing, free and open resource to entire
community
• Rebranded SureChEMBL
EMBL-EBI Chemistry Resources
RDF and REST API interfaces
REST API Interface
Atlas
Ligand induced
transcript response
750
PDBe
Ligand structures
from structurally
defined protein
complexes
15K
ChEBI
Nomenclature of primary and
secondary metabolites.
Chemical Ontology
24K
SureChEMBL
Ligand structures
from patent literature
15M
ChEMBL
Bioactivity data from literature
and depositions
1.5M
UniChem – InChI-based chemical resolver (full + relaxed ‘lenses’) >70M
3rd Party Data
ZINC, PubChem, ThomsonPharma DOTF, IUPHAR,
DrugBank, KEGG, NIH NCC,
eMolecules, FDA SRS, PharmGKB,
Selleck, ….
~55M
SureChEMBL System
https://www.surechembl.org/
Keyword search
Filter by authority
Structure sketch
Filter by document section
Help
Paste SMILES, MOL, name
Types of chemistry
search Filter by date
Patent number search
System Capabilities
• Searching capabilities
• Free text keywords and Lucene fields
• Patent IDs & bibliographic information
• Patent authority & date
• Chemical structure
• Retrieving capabilities
• Retrieve chemistry (with additional filters)
• Retrieve patent family information
• Retrieve annotated full patent text
• Accessible via Web Interface and API
SureChEMBL Data Coverage
Data Description & Languages Years
EP applications Bib. data
Full text
DocDB + Original
Original (EN, DE, FR) from 1978
EP granted Bib. data
Full text
DocDB + Original
Original (EN, DE, FR) From 1980
WO applications Bib. data
Full text
DocDB + Original
Original (EN, DE, FR, ES, RU)
From 1978
From 1978
US applications Bib. data
Full text
DocDB + Original
Original (EN)
From 2001
From 2001
US granted Bib. data
Full text
DocDB + Original
Original (EN)
From 1920
From 1976
JP applications Bib. data
Full text
DocDB
PAJ - English abstracts/titles
From 1973
From 1976
JP granted Bib. data DocDB From 1994
90+ countries Bib. data DocDB From 1920
All patents from above data sources are
searchable via SureChEMBL
• Exemplified structures from patent title, description,
abstract and claims
• Structures from text 1976 onwards
• Structures from images 2007 onwards
• USPTO have provided ‘Complex Work Units’ since 2001
• CWU file types include MOL and CDX
• CWUs processed as part of pipeline
SureChEMBL Chemistry Data Coverage
SureChEMBL Data Processing
WO
EP Applications& Granted
US Applications
& granted
JP Abstracts
Patent
Offices Chemistry Database
SureChEMBL System
Patent PDFs
(service)
Application Server
Users
API
Database
Entity Recognition
SureChem IP
1-[4-ethoxy-3-(6,7-dihydro-1-methyl-7-oxo-3-propyl-1H-pyrazolo[4,3-d]pyrimidin-5-yl)phenylsulfonyl]-4-
methylpiperazine
Image to Structure (one method)
Name to Structure (five methods)
OCR
Processed patents (service)
SureChEMBL and Chemaxon
WO
EP Applications& Granted
US Applications
& granted
JP Abstracts
Patent
Offices Chemistry Database
SureChEMBL System
Patent PDFs
(service)
Application Server
Users
API
Database
Entity Recognition
SureChem IP
Image to Structure (one method)
Name to Structure (five methods)
OCR
Processed patents (service)
• InChI based comparison using filtered parent compounds
ChEMBL Overlap
235K
18.4% 1.3M 12.2M
SureChEMBL ChEMBL
Filters
• MW between 100 and 1200
• #Atoms between 6 and 70
• ALogP between -10 and 10
• #Rings > 0
• #C > 0
• #C != #Atoms
• RTB <= 20
(Exported 08/05/14) (ChEMBL 18)
SureChEMBL and UniChem
• 12.2M SureChEMBL compounds are being loaded into
UniChem - InChI based ‘Unified Chemical Identifier' system
• SureChem drug-like subset (~5M) previously loaded
• Other UniChem sources include:
https://www.ebi.ac.uk/unichem/
Migration Status
• System currently built and optimised to run on Amazon
Web Services
• The time and cost considered too high to move away
from AWS in short term
• Long term plan will be to migrate on to EBI infrastructure
• 3 Phase migration process
1. Patent Data Server (IFI Claims/Fairview) – done
2. Data Processing Pipeline – done
3. Web Application/API – working in progress
Technical Challenges
• User account migration
• System currently uses Digital Science authentication system
• User account required by Pro account service
• Plan to move over open system e.g. OAuth/OpenID
• We aim to minimise impact on existing user, but may require
users to sign-up again
• Impact of providing free and open access to the Pro
account service and API
• Need to monitor usage
• Usage limitations may be required
Enhanced Entity Extraction
• Identify new entity types e.g. proteins, diseases, cell lines, assays..
• Extend using ChEMBL dictionaries + others
• Ontology mapping/Semantic tagging
• Protein/biotherapeutic sequence extraction
• Sequence based patent searches
• Currently system provides minimal cross referencing
• Quickly enhance using UniChem
• Tag up all commonly used identifiers (ChEBI, ChEMBL, PubChem,
UniProt,…)
Image Processing
• Image extraction starts from 01/01/2007
• Use Amazon EC2 Spot Instances to process pre-2007 image data
• Spot instance significantly cheaper, e.g. m1.xlarge instance costs:
• Standard Cost = $0.52/hour
• Spot Instance Cost = ~$0.125/hour
• New methods and tools can be introduced to improve compound
image extraction
• System currently uses CLiDE, alternatives include OSRA and Imago
• Document segmentation, developed as part of curation system, could
be applied to complex patent images
Image Processing
Open PHACTS Extension
• Open PHACTS project is keen to include patent data in
future extensions to the project
• ENSO approved - funding to include SureChEMBL data in
Open PHACTS
• RDF conversion, target indexing and API development
• EBI-RDF project benefit from RDF conversion
• SureChEMBL is updated daily, compared to quarterly
ChEMBL updates
• Interesting challenge for us creating exports and systems
loading SureChEMBL
More Future Plans
• Refactor interface for EMBL look and feel
• Third party user support system migration
• Workflow tool enhancements
• Update and release existing KNIME protocols + Pipeline Pilot
• Lots of interest to bring the system in-house for internal document
processing and searching
• Complex licensing issues
• AWS setup makes this easier
• Ligand Ensemble-based mapping of ChEMBL literature to patents
• Provide weekly/monthly feed of patent structures to PubChem
People and Groups Involved
The ChEMBL Group
• John Overington
• Mark Davies
• George Papadatos
• Jon Chambers
• Anne Hersey
Digital Science
• Nicko Goncharoff
• James Siddle
• Richard Koks
• Tom Llewellyn
ChEMBL 18 Released
Website
Web
Services
Widgets
Downloads
Virtual
Machine
Semantic
Web
1,359,508 compounds
12,419,715 activities
1,042,374 assays
9,414 targets
53,298 documents
19 bioactivity sources
https://www.ebi.ac.uk/chembl/