Chemoinformatics tools for lead discovery

Chemoinformatics tools for lead discovery

Virtual screeningThe huge numbers of molecules available in

public and in-house databases means that there is a requirement for tools to rank compounds in order of decreasing probability of activity

Range of methods available, varying in the sophistication and the amount of information that is available

Use of structure-based methods when an X-ray structure for the biological target is available

If this is not the case then must make use of information about (potential) ligands

Ligand-Based MethodsSimilarity searching

Use when just a single bioactive reference structure is available

3D pharmacophore searchingUse when it has been possible to carry out a

pharmacophore mapping exerciseMachine learning

Use when a fair number of both actives and inactives have been identified

Similarity Searching: IUse of a similarity measure to quantify

the resemblance between an active target, or reference, structure and each database structure

The similar property principle means that high-ranked structures are likely to have similar activities to that of the target structure

Similarity searching hence provides an obvious way of following-up on an initial active

Similarity searching: IIMany ways in which the similarity

between two molecules can be computedA similarity measure has two components

A structure representationA similarity coefficient to compare two

representationsMost operational systems use similarity

measures based on 2D fingerprints and the Tanimoto coefficient

Fragment bit-strings (fingerprints)

Originally developed for 2D substructure search

Similarity is based on the fragments common to two molecules

Widely used in both in-house and commercial chemoinformatics systems

C C C

C

CC C C

O

Similarity coefficientsTanimoto coefficient for binary bit strings

C bits set in common between Target and Database Structure

T bits set in TargetD bits set in Database structure

Values between zero (no bits in common) and unity (identical fingerprints)

Many other, related similarity coefficients exist:Tversky, cosine, Euclidean distance …..

CDTCSTD

Combination of search techniques using data fusion: I

Tanimoto/fingerprint measures most common but many other types, e.g., Computed physicochemical properties3D grid describing the molecular electrostatic

potential These reflect different molecular

characteristics, so may enhance search performance by using more than one similarity measureData fusion or consensus scoring

Combination of search techniques using data fusion: IICombination of different rankings of

the same sets of moleculesTwo basic approaches

• Generate rankings from the same molecule using different similarity measures (similarity fusion)

• Generate rankings from different molecules using the same similarity measure but different molecules (group fusion)

Reference 1

Group fusion

Reference 2

Reference 3

After truncation to required rank

Reference 2

Reference 1

Reference 3

Fused

r = 2000

r = 1000

Active found in earlier list

New Active

Final truncated

GroupFusion

Group fusion rulesUseful performance increases, even with just

10 actives, as better coverage of structural space with multiple starting points

Improvement most obvious when searching for heterogeneous sets of active molecules

Best results obtained by • Fusing similarity coefficient values, rather than

ranks • Re-ranking using the maximum of the similarity

values associated with each molecule • Using the Tanimoto coefficient

Turbo similarity searching: I Similar property principle: nearest

neighbours are likely to exhibit the same activity as the reference structure

Group fusion improves the identification of active compounds

Potential for further enhancements by group fusion of rankings from the reference structure and from its assumed active nearest neighbours

Turbo similarity searching: IIREFERENCE STRUCTURE

NEAREST NEIGHBOURS

RANKED LIST

Experimental details

MDL Drug Data report (MDDR) dataset of 11 activity classes and 102K structures• In all, 8294 actives in the 11 classes, with

(turbo) similarity searches being carried out using each of these as the reference structure

• ECFP_4 fingerprints/Tanimoto coefficient• MAX group fusion on similarity scores• Increasing numbers of nearest neighbours

Numbers of nearest neighbours

36

37

38

39

40

41

42

43

44

45

46

SS TSS-5 TSS-10 TSS-15 TSS-20 TSS-30 TSS-40 TSS-50 TSS-100

TSS-200

Rec

alls

at 5

% (%

)

Upper and lower bound experiments

30

35

40

45

50

55

60

65

70

SS TSS-100 Upperbound(reference +active NNsamong the100NNs)

Lowerbound(inactive

NNs amongthe 100NNs)

Upperbound(reference +100 active

NNs)

Lowerbound(100

inactiveNNs)

Rec

all a

t 5%

Rationale for upper bound resultsThe true actives in the set of assumed actives

yield significant enhancements in performance

The true inactives in the set of assumed actives have little effect on performance

Taken together, the two groups of compounds yield the observed net enhancement

Use of machine-learning methods for similarity searching: ITurbo similarity searching uses group

fusion to enhance conventional similarity searching

Machine learning is a more powerful virtual screening tool than similarity searchingBut requires a training-set containing known

actives and inactivesGiven an active reference structure, a

training-set can be generated fromUsing the k nearest neighbours of the

reference structure as the activesUsing k randomly chosen, low-similarity

compounds as the inactives

Use of machine-learning methods for similarity searching: II

REFERENCESTRUCTURE

NEARESTNEIGHBOURS

RANKEDLIST

TRAININGSET

SIMILARITYSEARCHING

MACHINELEARNING

RANDOMLYSELECTEDCOMPOUNDS

Results: IExperiments with the MDDR dataset show

that group fusion better than machine-learning methods when averaged over all of the classes

However, group fusion inferior for the most diverse datasets (as measured by the mean pair-wise similarities)

Additional searches using 10 MDDR activity classes that are as structurally diverse as possible

Results: II

18

19

20

21

22

23

24

25

26

27

28

SS TSS-100 TSS-100-SSA-R4

TSS-100-BKD-OPT

TSS-100-SVM

Rec

all a

t 5%

Conclusions: IFingerprint-based similarity searching

using a known reference structure is long-established in chemoinformatics

When small numbers of actives are available, group fusion will enhance performance when the sought actives are structurally heterogeneous

Conclusions: IICan also enhance conventional similarity

search, even if there is just a single active, by assuming that the nearest neighbours are also active

Can be effected in two ways• Use of group fusion to combine similarity

rankings (overall best approach)• Use of substructural analysis to compute

fragment weights (best with highly heterogeneous sets of actives)

Soal untuk dipelajariTunjukkan peran khemoinformatik dalam QSARData dan analisis dari khemoinformatik yang banyak

digunakan dalam docking molekulIndeks kemiripan (similarity index) banyak digunakan

untuk mendapatkan informasi tentang senyawa baru yang memiliki aktivitas biologis tinggi. Jelaskan secara singkat sistem kerjanya

Dalam penemuan obat baru yang lebih potensial dari yang sudah dikenal, banyak memanfaatkan khemoinformatiks. Jelaskan dengan beberapa contoh.

Apa perbedaan penggunaan Khemoinformatiks dalam QSAR, molecular docking dan similarity searching?

Documents

Chemoinformatics tools for lead discovery