37
Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering Rajarshi Guha School of Informatics Indiana University 17 th August, 2007 Cambridge, MA

Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

  • Upload
    rguha

  • View
    711

  • Download
    2

Embed Size (px)

Citation preview

Page 1: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

Characterizing the Density of Chemical Spaces andits Use in Outlier Analysis and Clustering

Rajarshi Guha

School of InformaticsIndiana University

17th August, 2007Cambridge, MA

Page 2: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

Outline

R-NN curves for diversity analysis

Counting clusters with R-NN curves

Density and domain applicability

Page 3: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Nearest Neighbor Methods

Traditional kNN methods aresimple, fast, intuitive

Applications in

regression & classificationdiversity analysis

Can be misleading if the nearestneighbor is far away

R-NN methods may be moresuitable

●●

Page 4: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Diversity Analysis

Why is it Important?

Compound acquisition

Lead hopping

Knowledge of the distribution of compounds in a descriptorspace may improve predictive models

Page 5: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Approaches to Diversity Analysis

Cell based

Divide space into bins

Compounds are mapped to bins

Disadvantages

Not useful for high dimensionaldata

Choosing the bin size can be tricky

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45

Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12

Page 6: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Approaches to Diversity Analysis

Distance based

Considers distance betweencompounds in a space

Generally requires pairwise distancecalculation

Can be sped up by kD trees, MVPtrees etc.

●●

Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58

Page 7: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Generating an R-NN Curve

Observations

Consider a query point with a hypersphere, of radius R,centered on it

For small R, the hypersphere will contain very few or noneighbors

For larger R, the number of neighbors will increase

When R ≥ Dmax , the neighbor set is the whole dataset

The question is . . .

Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?

Page 8: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Generating an R-NN Curve

Observations

Consider a query point with a hypersphere, of radius R,centered on it

For small R, the hypersphere will contain very few or noneighbors

For larger R, the number of neighbors will increase

When R ≥ Dmax , the neighbor set is the whole dataset

The question is . . .

Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?

Page 9: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Generating an R-NN Curve

Algorithm

Dmax ← max pairwise distancefor molecule in dataset do

R ← 0.01× Dmax

while R ≤ Dmax doFind NN’s within radius RIncrement R

end whileend forPlot NN count vs. R

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

5%

10%

30%

50%

Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772

Page 10: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Generating an R-NN Curve

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

0 20 40 60 80 100

050

100

150

200

250

R

Num

ber

of N

eigh

bors

Sparse

●●●●●

●●●

●●

●●●

●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 1000

5010

015

020

025

0R

Num

ber

of N

eigh

bors

Dense

Page 11: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Characterizing an R-NN Curve

Converting the Plot to Numbers

Since R-NN curves are sigmoidal, fit them to the logisticequation

NN = a · 1 + me−R/τ

1 + ne−R/τ

m, n should characterize the curve

Problems

Two parametersNon-linear fitting is dependent on the starting pointFor some starting points, the fir does not converge andrequires repetition

Page 12: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Characterizing an R-NN Curve

Converting the Plot to a Number

Determine the value of R where thelower tail transitions to the linearportion of the curve

Solution

Determine the slope at variouspoints on the curve

Find R for the first occurence ofthe maximal slope (Rmax(S))

Can be achieved using a finitedifference approach

S’’

S = 0

S = 0

S’

Page 13: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Characterizing an R-NN Curve

Converting the Plot to a Number

Determine the value of R where thelower tail transitions to the linearportion of the curve

Solution

Determine the slope at variouspoints on the curve

Find R for the first occurence ofthe maximal slope (Rmax(S))

Can be achieved using a finitedifference approach

S’’

S = 0

S = 0

S’

Page 14: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Characterizing Multiple R-NN Curves

Problem

Visual inspection of curves is usefulfor a few molecules

For larger datasets we need tosummarize R-NN curves

Solution

Plot Rmax(S) values for eachmolecule in the dataset

Points at the top of the plot arelocated in the sparsest regions

Points at the bottom are located inthe densest regions

0 1000 2000 3000 4000

2040

6080

Serial NumberR m

axS

1861

3904

Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320

Page 15: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

R-NN Curves and Outliers

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

0 20 40 60 80 100

010

0020

0030

0040

00

R

Num

ber

of N

eigh

bors

3904

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

0 20 40 60 80 100

010

0020

0030

0040

00

R

Num

ber

of N

eigh

bors

1861

NH

HN OO

NH

NH

O

HO

O

HN

NH

OHO

H2N

N•

O H2N

O

HN

S

HO

S

O

HN

HN

O

HN

O

HN

O

NH

O

HN

OOH

HN

O NH

O

NH

O

OH

HN

O

HO

NH

O

HN

O

NH

O

NH

NH

O

HN

O

NH

H2N

O

O

NH2

•N

HN

HN

NH

O

HN

O

NH

S

H2N

N•

O

NH2

•N

NH2

O

NHO

HO

NH

OH

O

O

NH

H2N

•N

3904

ONH

N

NHO

O

NH

O

O

N

HO

HO

O

HO

OH

N

NH2

OH

HN

NHO

O

HO

OH

NH2

O

H2N

O

H2N

HNO

OH

H2NO

N

S

N

S HN

O

N

N

N

O

O

O

O

1861

Page 16: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

How Can We Use It For Large Datasets?

Breaking the O(n2) barrier

Traditional NN detection has a time complexity of O(n2)

Modern NN algorithms such as kD-trees

have lower time complexityrestricted to the exact NN problem

Solution is to use approximate NN algorithms such as LocalitySensitive Hashing (LSH)

Bentley, J.; Commun. ACM 1980, 23, 214–229

Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004

Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333

Page 17: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

How Can We Use It For Large Datasets?

Why LSH?

Theoretically sublinear

Shown to be 3 orders ofmagnitude faster thantraditional kNN

Very accurate (> 94%)0.02 0.03 0.04 0.05 0.06 0.07 0.08

−3.5

−3.0

−2.5

−2.0

−1.5

−1.0

Radiuslo

g [M

ean

Que

ry T

ime

(sec

)]

LSH (142 descriptors)LSH (20 descriptors)kNN (142 descriptors)kNN (20 descriptors)

Comparison of NN detection speed

on a 42,000 compound dataset

using a 200 point query set

Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333

Page 18: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Alternatives?

Why not use PCA?

R-NN curves are fundamentally aform of dimension reduction

Principal Components Analysis isalso a form of dimension reduction

Disadvantages

Eigendecomposition via SVD isO(n3)

Difficult to visualize more than 2 or3 PC’s at the same time

We are no longer in the originaldescriptor space

0 5 10

−6−4

−20

24

Principal Component 1

Prin

cipal

Com

pone

nt 2

Page 19: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

R-NN Curves and Clusters

−5 0 5 10 15 20 25

05

1015

84

88

137

251

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

Smoothed R-NN Curves

R-NN curves are indicative of the number of clusters

Page 20: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

R-NN Curves and Clusters

Counting the steps

Essentially a curve matchingproblem

All points will not beindicative of the number ofclusters

Not applicable for radiallydistributed clusters

Approaches

Hausdorff / Frechet distance- requires canonical curves

RMSE from distance matrix

Slope analysis

Page 21: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

R-NN Curves and Their Slopes

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 1000

510

15

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed first derivative of the R-NN Curves

Identifying peaks identifies the number of clusters

Automated picking can identify spurious peaks

Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318

Page 22: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

R-NN Curves and Their Slopes

0 20 40 60 80 100

050

100

150

200

250

300

251

0 20 40 60 80 100

050

100

150

200

250

300

84

0 20 40 60 80 100

050

100

150

200

250

300

88

0 20 40 60 80 100

050

100

150

200

250

300

137

0 20 40 60 80 100

02

46

810

1214

251

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

20

84

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 1000

510

15

88

Slo

pe o

f the

RN

N C

urve

0 20 40 60 80 100

05

1015

137

Slo

pe o

f the

RN

N C

urve

Smoothed first derivative of the R-NN Curves

Identifying peaks identifies the number of clusters

Automated picking can identify spurious peaks

Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318

Page 23: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Slope Analysis of R-NN Curves

Procedure

for i in molecules doEvaluate R-NN curveF ← smoothed R-NN curveEvaluate F ′′

Smooth F ′′

Nroot,i ← no. of roots of F ′′

end forNcluster = [max(Nroot) + 1]/2

0 20 40 60 80 100

050

100

150

200

250

300

RNN Curve

0 20 40 60 80 100

02

46

8 D'

0 20 40 60 80 100

−0.

50.

00.

5

D''

●●●●●●●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●

●●

●●●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●

Page 24: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Simulated Data

Simulated 2D data using a Thomas point process

Predicted k, followed by kmeans clustering using k

Investigated similar values of k

k ASW

4 0.613 0.745 0.70

k ASW

3 0.442 0.654 0.47

k ASW

4 0.643 0.485 0.56

ASW - average silhouette width, higher is better; k - number of clusters

Page 25: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

A Mixed Dataset

Dataset composition

277 DHFR inhibitors based on

substituteed pyrimidinediamine anddiaminopteridine scaffolds

277 molecules from the DIPPR project

mainly simple hydrocarbonsboiling point was modeled

Evaluated 147 Molconn-Z descriptors,reduced to 24

We expect at least 3 clusters

Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915

Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983

Page 26: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

A Mixed Dataset

Desc. Set k ASW Purity

4 descriptors 2 0.71 0.633 0.67 0.894 0.73 0.84

6 descriptors 2 0.67 0.943 0.70 0.974 0.61 0.94

All 24 descriptors 2 0.29 0.963 0.33 0.964 0.23 0.90

●●●●

●●●

●●

●●

●●

●●●

●●●

●●●●●●

●●●●●●

●●

●●●

●●●

●●●

●●●

●●

●●

●●●●●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●

●●●

●●

●●●

●●

●●

●●●●●●

●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●

●●●●●●●●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●●

●●

●●●●●●●●●●

●●

●●●●●●●●●

●●

●●

−10 −5 0 5

−10

−5

05

PC1

PC

2

DHFRDIPPR

PC plot indicates 3 mainclusters

In all cases, 3 clusters isoptimal for both qualitymeasures

Page 27: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Drawbacks

Currently works well for well separated clusters

But radial clusters will not be detected

We now consider all the points in the dataset

InefficientSampling experiments seem to indicate that we can make dowith 45% of the pointsCould prioritize by considering points wth low Rmax(S) values

Small changes in local density can mislead the algorithm

But this is somewhat subjective

Page 28: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Domain Applicability for QSAR Models

What is it?

How similar should a new structure be to the training set toget a reliable prediction

Approaches

Descriptor based - distance from the centroid, leverage

Structure based - fingerprint similarity

Cluster based

Stanforth, R.W. et al., QSAR. Comb. Sci., 2007, 26, 837–844

Netzeva, T.I. et al., Altern. Lab. Anim., 2005, 33, 155–173

Sheridan, R.P. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1912–1928

Page 29: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Domain Applicability for QSAR Models

One Domain or Multiple Domains?

It’s possible to encompass the entire training set into a singledomain

This is a very broad approach

Modeling approaches based on neighborhoods essentiallyconsider multiple, local, domains

Even for a global model, some observations are betterpredicted than others

Suggests that certain regions of a descriptor space arepredicted better than others

Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847

Page 30: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Domain Applicability for QSAR Models

One Domain or Multiple Domains?

It’s possible to encompass the entire training set into a singledomain

This is a very broad approach

Modeling approaches based on neighborhoods essentiallyconsider multiple, local, domains

Even for a global model, some observations are betterpredicted than others

Suggests that certain regions of a descriptor space arepredicted better than others

Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847

Page 31: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Local Descriptor Distributions

For local regression models,the descriptor distribution inthe neighborhood will affectpredictive ability

The effect is more severewhen methods such as kNNare used which do not doany interpolation

Nearest neighbors may notactually be nearby

PEOE_VSA.0.1 (TSET)

Den

sity

0 20 40 60 80 100 120

0.00

00.

005

0.01

00.

015

PEOE_VSA.0.1 (Neighborhood)

Den

sity

40 50 60 70 80 90 100

0.00

00.

010

0.02

0

PEOE_VSA.0.1 (Neighborhood)

Den

sity

40 50 60 70 80

0.00

0.02

0.04

0.06

PEOE_VSA_NEG (TSET)

Den

sity

50 100 150 200 250 300

0.00

00.

004

0.00

8

PEOE_VSA_NEG (Neighborhood)

Den

sity

130 140 150 160 170 180 190 200

0.00

00.

005

0.01

00.

015

0.02

0

PEOE_VSA_NEG (Neighborhood)

Den

sity

80 90 100 110 120

0.00

0.02

0.04

0.06

SlogP (TSET)

Den

sity

0 1 2 3 4 5 6

0.0

0.1

0.2

0.3

0.4

SlogP (Neighborhood)

Den

sity

1 2 3 4

0.0

0.1

0.2

0.3

0.4

0.5

SlogP (Neighborhood)

Den

sity

2.0 2.5 3.0

0.0

0.2

0.4

0.6

0.8

1.0

1.2

SlogP_VSA1 (TSET)

Den

sity

0 50 100 150

0.00

00.

005

0.01

00.

015

0.02

0

SlogP_VSA1 (Neighborhood)

Den

sity

30 40 50 60 70 80 90

0.00

0.01

0.02

0.03

0.04

SlogP_VSA1 (Neighborhood)

Den

sity

50 55 60 65 70

0.00

0.02

0.04

0.06

0.08

0.10

SMR_VSA4 (TSET)

Den

sity

0 20 40 60 80 100

0.00

0.01

0.02

0.03

0.04

SMR_VSA4 (Neighborhood)

Den

sity

10 15 20 25 30 35 400.

000.

010.

020.

030.

04

SMR_VSA4 (Neighborhood)

Den

sity

5 10 15 20 25 30 35

0.00

0.04

0.08

0.12

Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847

Page 32: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Mapping Domain Applicability via Density

Query molecules that occurin dense regions of of thetraining set descriptor spaceare expected to be wellpredicted

Doesn’t penalize for being alittle far from the centroid ofthe space

Page 33: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

An Aqeuous Solubility Dataset

Estimate Std. Error t value

(Intercept) -2.1103 0.0909 -23.20PEOE RPC+1 1.8754 0.2437 7.70PEOE RPC-1 3.4954 0.1738 20.11SlogP -0.8199 0.0131 -62.55CASA 0.0008 0.0001 -9.83

RMSE = 0.7542 F (4, 1231) = 1993

1236 compounds

Evaluated 147 descriptors,reduced to 61

Searched for a good subsetusing a GA

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●●●

●● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

● ●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●● ●

●● ●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●●

●●

●●●

● ●

● ●

●●

● ●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

● ● ●●

●●

●●●

●●

● ●

●●

●●●

●●

●●

●● ●

●●

●●

● ●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

−10 −5 0

−10

−5

0

Observed Aq. Solubility

Pre

dict

ed A

q. S

olub

ility

●●●

●●

●●

●●

●●

●●●

●●●●●

●●●

●●

●●●

●●●●●

●●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●

●●

●●●●●

●●

●●

●●●

●●●

●●

●●●

●●

●●

●●

●●●●

●●

●●

●●●

●●●

●●●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●●

●●

●●●●●

●●

●●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●●●

●●

●●

●●

●●●●●●●

●●●●●●

●●

●●●●

●●

●●

●●●●●

●●

●●●●●●●●●●●

●●

●●

●●●●●

●●●●●●●●●●●

●●●●●

●●

●●

●●

●●

●●

●●

●●●●

●●●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●●●

●●

●●

●●●

●●●

●●●●

●●●●●

●●

●●●●●●●●

●●●●●

●●

●●

●●

●●

●●●

●●●●●●●●●

●●

●●●

●●

●●

●●●●

●●

●●

●●

●●

●●●●

●●●

●●●●

●●

●●●

●●●

●●

●●

●●●

●●●●●●

●●

●●

●●●●

●●●●●●●

●●

●●

●●●

●●

●●

●●

●●●

●●

●●●

●●●●

●●●●●●●

●●

●●

●●

●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●

●●

●●

●●

●●●●●

●●●

●●●●●●●

●●

●●●●●●

●●●●●

●●●

●●●

●●

●●●

●●●●

●●●●●●●●

●●

●●

●●●●●

●●●

●●

●●●

●●●

●●

●●

●●

●●

●●●●

●●

●●

●●

●●●

●●●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●

●●●

●●

●●

●●●

●●

0 200 400 600 800 1000 1200

020

4060

8010

0

Index

Rm

ax(S

)

Huuskonen, J., J. Chem. Inf. Comput. Sci., 2000, 40, 773–777

Page 34: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

An Aqueous Solubility Dataset

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

● ●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●●●

●● ●

●●

●●

● ●

●●

● ●

●●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●

●●

●●●

● ●

●●

● ●

●●

●●

● ●

●●

● ●

● ●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●

●● ●

●● ●

●●

●●

●●

● ●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●●

●●

●●●

● ●

● ●

●●

● ●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

● ●

●●

●●

●●

● ●

● ● ●●

●●

●●●

●●

● ●

●●

●●●

●●

●●

●● ●

●●

●●

● ●●

●●

● ●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

−10 −5 0

−10

−5

0

Observed Aq. Solubility

Pre

dict

ed A

q. S

olub

ility

● ●

●●

●●

●●

● ● ● ● ●

● ● ●

●●

●● ●

●●

●●●

● ● ● ● ●

●●

●●●●●

●●

● ● ●

● ●● ●● ●●● ● ●

●●●

●● ● ●● ● ● ●●●● ●●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

● ●

● ●●

●●

●●●

● ●● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●●

● ●

●●

●●

●●

● ● ●●

●● ●●

●●

● ● ●

●●

●●

●● ● ●●

●●

●●●

●● ●●

●●

● ●●●●

●●

●●

●●●●●● ●

●●● ●

●●

●●

●●

●●

●●

●●●●

●●

●●

●●●

●●

●●

● ●

●●

● ●● ●●

●●

●●

●●

●●

●●●

● ●

● ● ● ● ● ●●●

●●

●●

●●

●●

● ●● ●●●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

●●

●●

●●●●

● ●

●●

●●

●● ●

●●

● ●

● ●

● ●

●●

● ●●

● ●

● ● ●

●● ●●

● ●●

●●

●●

●●

●●●●●●● ●●

●●●●● ● ●●●● ●● ●● ●●

●●●●● ●●

●● ●●●

● ●●

● ● ●●●

●●

●●

●● ●● ●●

● ● ● ● ●

●●

●●

● ●

●●

● ● ●●

●●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

●●●

●●

●●

●●

● ●●

●●

●● ●

●●

●●

●●

●●

0 1 2 3 4 5

020

4060

8010

0Absolute Studentized Residual

Rm

ax(S

)

Isolated compounds can be predicted well

The Rmax(S) values could be calculated more rigorously from asmoothed curve

Page 35: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Mapping Domain Applicability via Density

Caveats

The approach assumes that similar compounds will havesimilar activities

This is not always true (activity cliffs)

Correlation between density as measured by Rmax(S) andprediction accuracy needs more investigation

Take into account descriptor importance via weightedEuclidean distances

Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535

Page 36: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability

Summary

R-NN curves . . .

Simple way to characterize spatial distributions and identifyoutliers

Applicable to datasets of arbitrary dimensions and size, viaapproximate NN algorithms such as LSH

Summarizing a dataset does not require user-definedparameters

. . . Clustering

Provides an approach to a priori identification of the numberof clusters, avoiding trial and error

Appears to be more reliable than the silhouette width

Probably not useful for hierarchical clusterings

Page 37: Characterizing the Density of Chemical Spaces and its Use in Outlier Analysis and Clustering

R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability