Molecular similarity searching methods, seminar

Molecular similarity searching methods in drug discovery

A Presentation in advanced graphical engineering systems seminar 2011/2012

By: Haytham Hijazi

Advisor: Univ-Prof. Hon-Prof. Dr. Dieter Roller



By: Haytham Hijazi

Advisor: Univ-Prof. Hon-Prof. Dr. Dieter Roller

In this work, I propose a contribution to the field of “Cheminformatic”.Cheminformatic means solving chemical problems using computational methods[1].

James Rhodes, Stephen Boyer1, Jeffrey Kreulen, Ying Chen, Patricia Ordonez, “Mining patents using molecular similarity search”, IBM, Almaden Services Research, Pacific Symposium on Biocomputing 12:304-315(2007).


Agenda• The main question in this research

• The principle of similarity

• Drug discovery as an application

• Research problem

• Molecular representations (1D, 2D…)

• Searching the similarity

• Similarity coefficients calculations

• The probabilistic model (BIM)

• The contribution (MDC)

• Experiments, conclusions and discussion

Shape Colour

Size Pattern

“The similarity is in the eye of the beholder”

Can we claim?

Question: Which molecules in a database are similar to the query molecule?

Application: •better compounds than initial lead compound (Drug discovery)•Property prediction of unknown compound.

The main question

Structurally similar molecules are assumed to have similar biological properties.

Similar biological propritiesdrug discovery.

In our context…the principle

1. Sylvaine Roy and Laurence Lafanechère, “Chemogenomics and Chemical Genetics: A User's Introduction for Biologists, Chemists and Informaticians”, Molecular similarity, Springer Berlin, ISBN 978-3-642-19614-0, 1st Edition. 17.06.2011

[1]

7

Problems

Claim: General manufacturing problems!

8

The Map

Molecule represntation

Feature selection

Similarity coefficients

calculations and ranking for search

Historical progression◦ Complete structure◦ Sub-Structure

Descriptors◦ 1D (psychophysical properties), 2D, 3D, and 4D

Connectivity tables and graph theory!

Molecular representation

Image Source: Karine Audouze, “Representation of molecular structures and structural diversity”, ChemoInformatics in Drug Discovery, 2009.

2D structure, line notation

CC(=O)OC1=CC=CC=C1C(=O)OCCCC1=NN(C2=C1NC(=NC2=O)C3=C(C=CC(=C3)S(=O)(=O)N4CCN(CC4)C)OCC)C

SMILES – Simplified Molecular Line Entry System

SMILES

Source: Karine Audouze, “Representation of molecular structures and structural diversity”, ChemoInformatics in Drug Discovery, 2009.

A fingerprint is a vector encoding the presence (‘1’) or absence (‘0’) of FRAGMENT substructures in a molecule

Dictionary based or and hash based fingerprints

2D Fingerprints - Structural key

Descriptor Fragment

1 AR

2 CCCCN

3 Me

9 NH2

2. Source: Karine Audouze, “Representation of molecular structures and structural diversity”, ChemoInformatics in Drug Discovery, 2009.

[1] [2]

3D-fingerprint-topology In 3D keys the position of each bit

corresponds to a certain range of distances or angels.

Computationally complex

Source: Karine Audouze, “Representation of molecular structures and structural diversity”, ChemoInformatics in Drug Discovery, 2009.

13

The Map

Similarity coefficients

calculations and ranking for search

Molecule represntation

Feature selection

Exact structure search Structure search

Substructure search

Similarity searching: maximal common sub graph isomorphism, Tanimoto/Dice/Cosine coefficients

Searching the similarity

The similarity measure (coefficient) is a quantitative measure of similarity

Used to rank the results of the query

Results are ordered decreasingly

Searching the similarity

Distance coefficients. Probabilistic coefficients. Correlation coefficients. Association coefficients.

Associative

Simple matching coefficient (c+d)/(a+b-c+d)

Jaccard measure (Tanimoto) c/(a+b-c) =AND/OR

Cosine, Ochiai c/√(a+b)(c+d)

Dice c/.5[(a+c)+(b+c)] and 2c/a+b

Distance

Hamming distance a+b-2c

Euclidean distance √a+b-2c

Soregel distance a+b-2c/a+b-c

Other coefficients

Pattern difference ab/(a+b c+d)2

Size (a-b)2/(a+b+c+d)2

More coefficients !

Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research Management Centre Project Vote – 75207, University of Malaysia, 2009

Assume we generate the fingerprint fragment based bits

Molecule A:00010100010101000101010011110100

Molecule B:00000000100101001001000011100000

Tanimoto coefficient = Where c=A AND B

Tanimoto=6/(13+8)-6=0.4

Example

( )

c

a b c

ba c

Associate the relevance of a structure to an explicit feature

pi=probability that bit bi appears in an active structure. qi=probability that bit bi appears in an inactive structure αi represents a binary selector. If αi=1 means the bit occurs in the structure, else it is 0 and negated. P (A|S) is the probability of an active structure given S. P (NA|S) is the probability of an inactive structure given S. P(A) is the probability of ACTIVEs P(NA) is the probability of INACTIVES

A probabilistic model (BIM)

Naomie Salim, “The study of probability model for compound similarity searching”, UTM Research Management Centre Project Vote – 75207, University of Malaysia, 2009

19

Problems again

Claim: General manufacturing problems !

20

My proposed hybrid search design Molecular Dynamic Classification method (MDC)

Active compounds DatabaseClass 1

Class 2

Class n

Molecular dynamic

simulating tool

Psychophysical properties

Classification Algorithm

Voting

Better insight about the similarity in terms of bioactivity, toxicity, reactivity...(+)

The time of searching (+)

Prediction and voting possibilities (+)

Cost of simulation tools (-)

Classification errors (-)

MDC discussion

Materials Explorer

Itemtracker -Freezer/Cryogen sample tracking system

CHARMM

MDynaMix

Simulation tools

Fingerprint time generation experiment

Data source: simulating tool indicated in the report [17]

Consider if we have more than 1000 bits!

45

67

8

0

5

10

15

20

25

30

2 bits

3 bits

4 bits

Fingerprint time gneration

2 bits3 bits4 bits

Max path.length

Time (Ms)

Hit rate expirement

0 500 1000 1500 2000 25000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Hit rate

Hit rate

Selection Size

Hit

Rate

Data source: simulating tool indicated in the report [17]

The more we increase the size of features, the more the hit rate of finding actives decreaes.

Even fingerprint fragment based is time consuming

Probabilistic models and machine learning introduced substantial changes

Mixing more than type of descriptors seems efficient i.e. Time and results quality

Still need to have experimental results

General evaluation and conclusions


A Presentation to the advanced graphical engineering systems seminar 2011/2012

Thanks for your listening

Haytham Hijazi

Technology

Molecular similarity searching methods, seminar