Upload
rguha
View
711
Download
2
Tags:
Embed Size (px)
Citation preview
Characterizing the Density of Chemical Spaces andits Use in Outlier Analysis and Clustering
Rajarshi Guha
School of InformaticsIndiana University
17th August, 2007Cambridge, MA
Outline
R-NN curves for diversity analysis
Counting clusters with R-NN curves
Density and domain applicability
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Nearest Neighbor Methods
Traditional kNN methods aresimple, fast, intuitive
Applications in
regression & classificationdiversity analysis
Can be misleading if the nearestneighbor is far away
R-NN methods may be moresuitable
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Diversity Analysis
Why is it Important?
Compound acquisition
Lead hopping
Knowledge of the distribution of compounds in a descriptorspace may improve predictive models
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Approaches to Diversity Analysis
Cell based
Divide space into bins
Compounds are mapped to bins
Disadvantages
Not useful for high dimensionaldata
Choosing the bin size can be tricky
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
Schnur, D.; J. Chem. Inf. Comput. Sci. 1999, 39, 36–45
Agrafiotis, D.; Rassokhin, D.; J. Chem. Inf. Comput. Sci. 2002, 42, 117–12
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Approaches to Diversity Analysis
Distance based
Considers distance betweencompounds in a space
Generally requires pairwise distancecalculation
Can be sped up by kD trees, MVPtrees etc.
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
Agrafiotis, D. K. and Lobanov, V. S.; J. Chem. Inf. Comput. Sci. 1999, 39, 51–58
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,centered on it
For small R, the hypersphere will contain very few or noneighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Generating an R-NN Curve
Observations
Consider a query point with a hypersphere, of radius R,centered on it
For small R, the hypersphere will contain very few or noneighbors
For larger R, the number of neighbors will increase
When R ≥ Dmax , the neighbor set is the whole dataset
The question is . . .
Does the variation of nearest neighbor count with radius allow usto characterize the location of a query point in a dataset?
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Generating an R-NN Curve
Algorithm
Dmax ← max pairwise distancefor molecule in dataset do
R ← 0.01× Dmax
while R ≤ Dmax doFind NN’s within radius RIncrement R
end whileend forPlot NN count vs. R
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
5%
10%
30%
50%
Guha, R. et al.; J. Chem. Inf. Model., 2006, 46, 1713–1772
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Generating an R-NN Curve
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●●●●●●●●●●●●●
0 20 40 60 80 100
050
100
150
200
250
R
Num
ber
of N
eigh
bors
Sparse
●●●●●
●
●
●●●
●
●●
●●●
●
●●
●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 1000
5010
015
020
025
0R
Num
ber
of N
eigh
bors
Dense
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to Numbers
Since R-NN curves are sigmoidal, fit them to the logisticequation
NN = a · 1 + me−R/τ
1 + ne−R/τ
m, n should characterize the curve
Problems
Two parametersNon-linear fitting is dependent on the starting pointFor some starting points, the fir does not converge andrequires repetition
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where thelower tail transitions to the linearportion of the curve
Solution
Determine the slope at variouspoints on the curve
Find R for the first occurence ofthe maximal slope (Rmax(S))
Can be achieved using a finitedifference approach
S’’
S = 0
S = 0
S’
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Characterizing an R-NN Curve
Converting the Plot to a Number
Determine the value of R where thelower tail transitions to the linearportion of the curve
Solution
Determine the slope at variouspoints on the curve
Find R for the first occurence ofthe maximal slope (Rmax(S))
Can be achieved using a finitedifference approach
S’’
S = 0
S = 0
S’
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Characterizing Multiple R-NN Curves
Problem
Visual inspection of curves is usefulfor a few molecules
For larger datasets we need tosummarize R-NN curves
Solution
Plot Rmax(S) values for eachmolecule in the dataset
Points at the top of the plot arelocated in the sparsest regions
Points at the bottom are located inthe densest regions
0 1000 2000 3000 4000
2040
6080
Serial NumberR m
axS
1861
3904
Kazius, J. et al.; J. Med. Chem. 2005, 48, 312–320
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
R-NN Curves and Outliers
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●●●●
0 20 40 60 80 100
010
0020
0030
0040
00
R
Num
ber
of N
eigh
bors
3904
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
0 20 40 60 80 100
010
0020
0030
0040
00
R
Num
ber
of N
eigh
bors
1861
NH
HN OO
NH
NH
O
HO
O
HN
NH
OHO
H2N
N•
O H2N
O
HN
S
HO
S
O
HN
HN
O
HN
O
HN
O
NH
O
HN
OOH
HN
O NH
O
NH
O
OH
HN
O
HO
NH
O
HN
O
NH
O
NH
NH
O
HN
O
NH
H2N
O
O
NH2
•N
HN
HN
NH
O
HN
O
NH
S
H2N
N•
O
NH2
•N
NH2
O
NHO
HO
NH
OH
O
O
NH
H2N
•N
3904
ONH
N
NHO
O
NH
O
O
N
HO
HO
O
HO
OH
N
NH2
OH
HN
NHO
O
HO
OH
NH2
O
H2N
O
H2N
HNO
OH
H2NO
N
S
N
S HN
O
N
N
N
O
O
O
O
1861
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
How Can We Use It For Large Datasets?
Breaking the O(n2) barrier
Traditional NN detection has a time complexity of O(n2)
Modern NN algorithms such as kD-trees
have lower time complexityrestricted to the exact NN problem
Solution is to use approximate NN algorithms such as LocalitySensitive Hashing (LSH)
Bentley, J.; Commun. ACM 1980, 23, 214–229
Datar, M. et al.; SCG ’04: Proc. 20th Symp. Comp. Geom.; ACM Press, 2004
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
How Can We Use It For Large Datasets?
Why LSH?
Theoretically sublinear
Shown to be 3 orders ofmagnitude faster thantraditional kNN
Very accurate (> 94%)0.02 0.03 0.04 0.05 0.06 0.07 0.08
−3.5
−3.0
−2.5
−2.0
−1.5
−1.0
Radiuslo
g [M
ean
Que
ry T
ime
(sec
)]
LSH (142 descriptors)LSH (20 descriptors)kNN (142 descriptors)kNN (20 descriptors)
Comparison of NN detection speed
on a 42,000 compound dataset
using a 200 point query set
Dutta, D.; Guha, R.; Jurs, P.; Chen, T.; J. Chem. Inf. Model. 2006, 46, 321–333
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Alternatives?
Why not use PCA?
R-NN curves are fundamentally aform of dimension reduction
Principal Components Analysis isalso a form of dimension reduction
Disadvantages
Eigendecomposition via SVD isO(n3)
Difficult to visualize more than 2 or3 PC’s at the same time
We are no longer in the originaldescriptor space
0 5 10
−6−4
−20
24
Principal Component 1
Prin
cipal
Com
pone
nt 2
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
R-NN Curves and Clusters
−5 0 5 10 15 20 25
05
1015
84
88
137
251
0 20 40 60 80 100
050
100
150
200
250
300
251
0 20 40 60 80 100
050
100
150
200
250
300
84
0 20 40 60 80 100
050
100
150
200
250
300
88
0 20 40 60 80 100
050
100
150
200
250
300
137
Smoothed R-NN Curves
R-NN curves are indicative of the number of clusters
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
R-NN Curves and Clusters
Counting the steps
Essentially a curve matchingproblem
All points will not beindicative of the number ofclusters
Not applicable for radiallydistributed clusters
Approaches
Hausdorff / Frechet distance- requires canonical curves
RMSE from distance matrix
Slope analysis
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
R-NN Curves and Their Slopes
0 20 40 60 80 100
050
100
150
200
250
300
251
0 20 40 60 80 100
050
100
150
200
250
300
84
0 20 40 60 80 100
050
100
150
200
250
300
88
0 20 40 60 80 100
050
100
150
200
250
300
137
0 20 40 60 80 100
02
46
810
1214
251
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 100
05
1015
20
84
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 1000
510
15
88
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 100
05
1015
137
Slo
pe o
f the
RN
N C
urve
Smoothed first derivative of the R-NN Curves
Identifying peaks identifies the number of clusters
Automated picking can identify spurious peaks
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
R-NN Curves and Their Slopes
0 20 40 60 80 100
050
100
150
200
250
300
251
0 20 40 60 80 100
050
100
150
200
250
300
84
0 20 40 60 80 100
050
100
150
200
250
300
88
0 20 40 60 80 100
050
100
150
200
250
300
137
0 20 40 60 80 100
02
46
810
1214
251
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 100
05
1015
20
84
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 1000
510
15
88
Slo
pe o
f the
RN
N C
urve
0 20 40 60 80 100
05
1015
137
Slo
pe o
f the
RN
N C
urve
Smoothed first derivative of the R-NN Curves
Identifying peaks identifies the number of clusters
Automated picking can identify spurious peaks
Guha, R. et al., J. Chem. Inf. Model., 2007, 47, 1308–1318
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Slope Analysis of R-NN Curves
Procedure
for i in molecules doEvaluate R-NN curveF ← smoothed R-NN curveEvaluate F ′′
Smooth F ′′
Nroot,i ← no. of roots of F ′′
end forNcluster = [max(Nroot) + 1]/2
0 20 40 60 80 100
050
100
150
200
250
300
RNN Curve
0 20 40 60 80 100
02
46
8 D'
0 20 40 60 80 100
−0.
50.
00.
5
D''
●●●●●●●●●
●
●
●
●
●
●
●
●
●●●●●●
●●●●●●●●●●●●●●●●
●
●
●
●●●●●●
●
●
●
●
●
●
●
●
●●●●
●●
●
●
●
●
●●●●●
●
●
●
●
●
●
●
●
●●●●
●●●●
●
●●●●●●●●●●●●●●●●●●●●●●
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Simulated Data
Simulated 2D data using a Thomas point process
Predicted k, followed by kmeans clustering using k
Investigated similar values of k
k ASW
4 0.613 0.745 0.70
k ASW
3 0.442 0.654 0.47
k ASW
4 0.643 0.485 0.56
ASW - average silhouette width, higher is better; k - number of clusters
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
A Mixed Dataset
Dataset composition
277 DHFR inhibitors based on
substituteed pyrimidinediamine anddiaminopteridine scaffolds
277 molecules from the DIPPR project
mainly simple hydrocarbonsboiling point was modeled
Evaluated 147 Molconn-Z descriptors,reduced to 24
We expect at least 3 clusters
Sutherland, J. et al.; J. Chem. Inf. Comput. Sci., 2003, 43, 1906–1915
Goll, E. and Jurs, P.; J. Chem. Inf. Comput. Sci., 1999, 39, 974–983
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
A Mixed Dataset
Desc. Set k ASW Purity
4 descriptors 2 0.71 0.633 0.67 0.894 0.73 0.84
6 descriptors 2 0.67 0.943 0.70 0.974 0.61 0.94
All 24 descriptors 2 0.29 0.963 0.33 0.964 0.23 0.90
●●●●
●
●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●●
●
●
●●●●●●
●●●●●●
●●
●
●
●●●
●●●
●●●
●●●
●
●
●●
●●
●●●●●●●
●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●●●
●●
●
●
●
●●
●
●●●
●
●
●
●
●
●●●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●●●●
●●
●●
●
●●●
●
●●●
●●●
●●
●
●●
●
●
●
●
●●●
●●●●
●
●
●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●●●
●
●●●●
●
●●
●
●●●
●●
●●●
●
●●
●
●●
●
●
●
●●●●●●
●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●●●●
●●●●●●●●●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●
●●●
●●●●●●
●●●●●●●●●●●●●●●●
●
●●●●●●
●●●●●●●●●●●●
●●
●
●●●●●●●●●●
●
●●
●●●●●●●●●
●
●●
●●
●
●
−10 −5 0 5
−10
−5
05
PC1
PC
2
DHFRDIPPR
PC plot indicates 3 mainclusters
In all cases, 3 clusters isoptimal for both qualitymeasures
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Drawbacks
Currently works well for well separated clusters
But radial clusters will not be detected
We now consider all the points in the dataset
InefficientSampling experiments seem to indicate that we can make dowith 45% of the pointsCould prioritize by considering points wth low Rmax(S) values
Small changes in local density can mislead the algorithm
But this is somewhat subjective
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Domain Applicability for QSAR Models
What is it?
How similar should a new structure be to the training set toget a reliable prediction
Approaches
Descriptor based - distance from the centroid, leverage
Structure based - fingerprint similarity
Cluster based
Stanforth, R.W. et al., QSAR. Comb. Sci., 2007, 26, 837–844
Netzeva, T.I. et al., Altern. Lab. Anim., 2005, 33, 155–173
Sheridan, R.P. et al., J. Chem. Inf. Comput. Sci., 2004, 44, 1912–1928
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Domain Applicability for QSAR Models
One Domain or Multiple Domains?
It’s possible to encompass the entire training set into a singledomain
This is a very broad approach
Modeling approaches based on neighborhoods essentiallyconsider multiple, local, domains
Even for a global model, some observations are betterpredicted than others
Suggests that certain regions of a descriptor space arepredicted better than others
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Domain Applicability for QSAR Models
One Domain or Multiple Domains?
It’s possible to encompass the entire training set into a singledomain
This is a very broad approach
Modeling approaches based on neighborhoods essentiallyconsider multiple, local, domains
Even for a global model, some observations are betterpredicted than others
Suggests that certain regions of a descriptor space arepredicted better than others
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Local Descriptor Distributions
For local regression models,the descriptor distribution inthe neighborhood will affectpredictive ability
The effect is more severewhen methods such as kNNare used which do not doany interpolation
Nearest neighbors may notactually be nearby
PEOE_VSA.0.1 (TSET)
Den
sity
0 20 40 60 80 100 120
0.00
00.
005
0.01
00.
015
PEOE_VSA.0.1 (Neighborhood)
Den
sity
40 50 60 70 80 90 100
0.00
00.
010
0.02
0
PEOE_VSA.0.1 (Neighborhood)
Den
sity
40 50 60 70 80
0.00
0.02
0.04
0.06
PEOE_VSA_NEG (TSET)
Den
sity
50 100 150 200 250 300
0.00
00.
004
0.00
8
PEOE_VSA_NEG (Neighborhood)
Den
sity
130 140 150 160 170 180 190 200
0.00
00.
005
0.01
00.
015
0.02
0
PEOE_VSA_NEG (Neighborhood)
Den
sity
80 90 100 110 120
0.00
0.02
0.04
0.06
SlogP (TSET)
Den
sity
0 1 2 3 4 5 6
0.0
0.1
0.2
0.3
0.4
SlogP (Neighborhood)
Den
sity
1 2 3 4
0.0
0.1
0.2
0.3
0.4
0.5
SlogP (Neighborhood)
Den
sity
2.0 2.5 3.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
SlogP_VSA1 (TSET)
Den
sity
0 50 100 150
0.00
00.
005
0.01
00.
015
0.02
0
SlogP_VSA1 (Neighborhood)
Den
sity
30 40 50 60 70 80 90
0.00
0.01
0.02
0.03
0.04
SlogP_VSA1 (Neighborhood)
Den
sity
50 55 60 65 70
0.00
0.02
0.04
0.06
0.08
0.10
SMR_VSA4 (TSET)
Den
sity
0 20 40 60 80 100
0.00
0.01
0.02
0.03
0.04
SMR_VSA4 (Neighborhood)
Den
sity
10 15 20 25 30 35 400.
000.
010.
020.
030.
04
SMR_VSA4 (Neighborhood)
Den
sity
5 10 15 20 25 30 35
0.00
0.04
0.08
0.12
Guha, R. et al., J. Chem. Inf. Model., 2006, 46, 1836–1847
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Mapping Domain Applicability via Density
Query molecules that occurin dense regions of of thetraining set descriptor spaceare expected to be wellpredicted
Doesn’t penalize for being alittle far from the centroid ofthe space
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
An Aqeuous Solubility Dataset
Estimate Std. Error t value
(Intercept) -2.1103 0.0909 -23.20PEOE RPC+1 1.8754 0.2437 7.70PEOE RPC-1 3.4954 0.1738 20.11SlogP -0.8199 0.0131 -62.55CASA 0.0008 0.0001 -9.83
RMSE = 0.7542 F (4, 1231) = 1993
1236 compounds
Evaluated 147 descriptors,reduced to 61
Searched for a good subsetusing a GA
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
● ●●
●●●
●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
● ●
●
●●●●
●● ●
●●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●
●●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●●
●● ●
●● ●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●● ●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
● ●●
●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●●
●
●
●
●
● ●
●
●
●●
●
●
●●
●●
●
●
●
●
●
● ●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●●●
●
●
●
●●
●●
●● ●
●
●●
●●
●
● ●●
●●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
−10 −5 0
−10
−5
0
Observed Aq. Solubility
Pre
dict
ed A
q. S
olub
ility
●
●●●
●●
●●
●
●●
●
●
●●
●●●
●
●
●●●●●
●●●
●
●
●●
●
●
●
●
●
●●●
●
●●●●●
●●●●●●●●
●
●●●●●●●●●●●●●●
●
●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●
●
●●●●
●
●●
●
●
●
●
●●●●●
●
●●
●
●●
●
●
●●●
●●●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●●
●
●●●
●
●
●
●
●●●●
●●
●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●●
●●●
●
●
●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●●
●
●
●●
●
●
●
●●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●
●●
●●●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●●
●
●
●●●●●
●
●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●●●●●●●
●
●
●
●
●
●●●●●●
●
●
●●
●
●●●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●●●●
●
●
●
●●
●●●●●●●●●●●
●●
●●
●
●●●●●
●
●
●
●
●●●●●●●●●●●
●
●
●
●
●●●●●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●●●●
●●●●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●●
●
●
●
●●
●
●
●
●●●●●
●
●●
●●
●
●
●
●
●
●
●
●●●
●●●
●
●●●●
●
●
●
●
●●●●●
●●
●●●●●●●●
●
●●●●●
●●
●●
●●
●
●●
●●●
●
●
●
●
●
●
●●●●●●●●●
●
●
●
●
●
●
●●
●●●
●●
●
●
●
●
●
●
●
●●
●
●●●●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●●●●
●
●●●
●●●●
●●
●
●
●
●●●
●
●
●●●
●●
●●
●
●●●
●●●●●●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●
●●●●●●●
●
●
●
●
●
●
●
●●
●
●●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●●●
●
●●
●
●
●●●
●●●●
●
●
●
●●●●●●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●
●●
●●
●
●
●
●
●
●●
●
●
●
●
●●●●●
●
●
●●●
●
●●●●●●●
●
●
●
●●
●
●
●
●●●●●●
●
●●●●●
●
●
●
●●●
●
●●●
●
●
●●
●●●
●
●●●●
●
●●●●●●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●●●●
●●●
●●
●
●
●
●
●●●
●●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●●
●
●●●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●●●●
●
●●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●●
●●●●
●
●
●
●
●●●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●
●
●
●●
●
●
0 200 400 600 800 1000 1200
020
4060
8010
0
Index
Rm
ax(S
)
Huuskonen, J., J. Chem. Inf. Comput. Sci., 2000, 40, 773–777
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
An Aqueous Solubility Dataset
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●●
●
●
●
●
●
●●
●
●●
●
●●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
● ●●
●
●●
●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●
●
●
●●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●● ●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●●
●
●
●●
● ●
●
●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●● ●
●
●
●
●●
●
●
●
● ●●
●●●
●
●
●
●●
●
● ●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●●●
● ●
●
●●●●
●● ●
●●
●●
●
●
● ●
●
●●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●●
●
●●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
● ●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
● ●
●
●
●●
●●
●●
●●
●
●
●
●
●
●● ●
●
●
●
●●
●● ●
●● ●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●●
●
●●
●● ●
●
●
●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●●●
●●
●●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
● ●●
●
●
●
●
●
●●
●
●
● ●●
●
●
●●
●●
●
●
●
●
● ●
●
●
●●
●
●
●●
●●
●
●
●
●
●
● ●
● ● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
● ●
●●
●●●
●
●
●
●●
●●
●● ●
●
●●
●●
●
● ●●
●●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
● ●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
−10 −5 0
−10
−5
0
Observed Aq. Solubility
Pre
dict
ed A
q. S
olub
ility
●
● ●
●
●●
●
●
●
●●
●
●
●
●
●●
●
●
●
● ● ● ● ●
● ● ●
●
●
●●
●
●
●
●
●
●● ●
●
●●
●●●
● ● ● ● ●
●●
●
●
●●●●●
●●
●
●
●
●
● ● ●
●
● ●● ●● ●●● ● ●
●
●●●
●● ● ●● ● ● ●●●● ●●●
●
●●
●●
●
●●
●
●
●
●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
● ●
●●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●
●
●
●
● ●● ●
●●
●
●
●●
●
●●
●
●
●●●
●
●
●
●●
●
●
●
●●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●●
●
●
●
●
●
● ●
●
●●
●
●
●
●
●●
●●
●
●
●●●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●●
● ● ●●
●
●
●
●
●
●● ●●
●●
●
●
●
●
●
● ● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●● ● ●●
●
●
●
●
●
●
●●
●●●
●
●● ●●
●
●
●●
●
● ●●●●
●
●
●
●
●●
●●
●●●●●● ●
●
●
●
●
●
●●● ●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●
●
●●
●●
●●●●
●●
●●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●
● ●● ●●
●
●●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●●●
●
●
●
●
●
●
●
● ●
●
●
● ● ● ● ● ●●●
●
●
●●
●●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
● ●● ●●●●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
●
●●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●●
●● ●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●● ●
●●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
● ●●
●
● ●
●
●
● ● ●
●● ●●
●
●
●
●
●
●
●
● ●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●●●●● ●●
●●●●● ● ●●●● ●● ●● ●●
●
●●●●● ●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●●
●
●
● ●●
●
● ● ●●●
●●
●
●
●
●●
●
●
●
●● ●● ●●
●
● ● ● ● ●
●
●
●
●
●●
●
●●
●
●
●
● ●
●
●●
●
● ● ●●
●
●●●
●●
●
● ●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●●
●
●●
●●
●
●●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●●
●
●
●●
●
●
●
●
●● ●
●
●
●
●
●
●
●●
●●
●
●
●
●
●●
●
●
●
●●
●
●
0 1 2 3 4 5
020
4060
8010
0Absolute Studentized Residual
Rm
ax(S
)
Isolated compounds can be predicted well
The Rmax(S) values could be calculated more rigorously from asmoothed curve
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Mapping Domain Applicability via Density
Caveats
The approach assumes that similar compounds will havesimilar activities
This is not always true (activity cliffs)
Correlation between density as measured by Rmax(S) andprediction accuracy needs more investigation
Take into account descriptor importance via weightedEuclidean distances
Maggiora, G.M., J. Chem. Inf. Model., 2006, 46, 1535–1535
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability
Summary
R-NN curves . . .
Simple way to characterize spatial distributions and identifyoutliers
Applicable to datasets of arbitrary dimensions and size, viaapproximate NN algorithms such as LSH
Summarizing a dataset does not require user-definedparameters
. . . Clustering
Provides an approach to a priori identification of the numberof clusters, avoiding trial and error
Appears to be more reliable than the silhouette width
Probably not useful for hierarchical clusterings
R-NN Curves for Diversity AnalysisLinking R-NN Curves to Cluster CountsDensity & Domain Applicability