Small Random Forest Models for Effective Chemogenomic

Journal of Computer Aided Chemistry , Vol.18, 124-142 (2017) ISSN 1345-8647 124

Copyright 2017 Chemical Society of Japan

Small Random Forest Models for Effective Chemogenomic

Active Learning

Christin Rakers1, Daniel Reker2, J.B. Brown3*

[1] Institute of Transformative bio-Molecules (WPI-ITbM), Nagoya University, Furo-cho, Chikusa-ku,

Nagoya 464-8602, Japan

[2] Koch Institute for Integrative Cancer Research, Massachusetts Institute of Technology (MIT), 500

Main St, Cambridge, MA 02139, USA

[3] Life Science Informatics Research Unit, Laboratory for Molecular Biosciences, Kyoto University

Graduate School of Medicine, Kyoto 606-8501, Japan

(Received April 21, 2017; Accepted July 11, 2017)

The identification of new compound-protein interactions has long been the fundamental quest in the

field of medicinal chemistry. With increasing amounts of biochemical data, advanced machine learning

techniques such as active learning have been proven to be beneficial for building high-performance

prediction models upon subsets of such complex data. In a recently published paper, chemogenomic active

learning had been applied to the interaction spaces of kinases and G protein-coupled receptors featuring

over 150,000 compound-protein interactions. Prediction models were actively trained based on random

forest classification using 500 decision trees per experiment. In a new direction for chemogenomic active

learning, we address the question of how forest size influences model evolution and performance. In

addition to the original chemogenomic active learning findings that highly predictive models could be

constructed from a small fraction of the available data, we find here that that model complexity as viewed

by forest size can be reduced to one-fourth or one-fifth of the previously investigated forest size while still

maintaining reliable prediction performance. Thus, chemogenomic active learning can yield predictive

models with reduced complexity based on only a fraction of the data available for model construction.

Key Words: Virtual screening, chemogenomics, computational chemistry, active learning, drug discovery, random

forest

* [email protected]

Journal of Computer Aided Chemistry, Vol.18 (2017) 125

1. Introduction

In search of new and potent drug candidates,

continuous advancements in high-throughput screening

technologies brought forth large amounts of experimental

data that characterize compound-protein interactions

(CPIs). Despite increasing efforts to identify new drugs

and increased spending by the pharmaceutical industry

whilst producing evermore data, the average number of

new drugs brought to market is currently declining [1,2].

Nevertheless, these complex, human-intractable

databases of molecular interactions might hold the key to

accelerating drug discovery. Data mining approaches

such as machine learning (ML) represent valuable

computational tools to rationalize given activity spaces

and derive structure-activity relationships (SARs) that can

be used to predict new desired endpoints (e.g. molecules,

targets, or interactions) [3,4]. Traditionally,

computational studies that explore SARs either focus on

leveraging molecular information about small molecules

(ligand-based approaches) [5-9] or protein structures

(receptor-based approaches) [10,11]. By combining these

two worlds, chemogenomic (or proteochemometric)

methods explore interaction spaces based on joint

compound-protein descriptors and extrapolate on both

target and chemical spaces while extending the model's

applicability domain [12-17].

Chemogenomically-derived prediction models not

only allow for "one-sided" prediction of novel compounds

or targets, but also for prediction of compound-protein

interactions (CPIs) that are absent in the training matrix,

and prediction of pairs of ligands and targets both outside

of the training data [16,18]. Potential application areas of

chemogenomic approaches therefore also include

assessment of target selectivities, receptor deorphanising,

and drug repurposing. The slow but steady increase in

retro- and prospective studies using chemogenomic

methodologies hints at its utility and benefit for different

applications in medicinal chemistry and chemical biology

[14,19-23]. Nevertheless, the sheer data volume, and

sparseness and complexity of the compound-protein

matrix often necessitate complex chemogenomic machine

learning approaches. Advanced approaches, such as (deep

layered) neural networks, and computationally

sophisticated architectures, including GPU computing,

are harnessed to manage and incorporate the evermore

increasing amount of available data.

In recent years, active learning (AL) - a concept

adapted to the field of medicinal chemistry approximately

15 years ago [24] – has re-attracted interest in the drug

discovery community [25,26]. Based on actively updating

or feedback-driven model training, AL approaches

capitalize on implemented selection strategies that guide

iterative sample selection [25,26]. A prominent strategy

employed for active model training utilizes an explorative

approach that is driven by prediction uncertainty (a

“curious” sample selection strategy). Here, learning is

enforced by selecting training samples that yield

maximum prediction ambiguity under the expectation that

such samples ultimately add knowledge to the model [26].

To date, several pro- and retrospective studies report

successful applications of AL strategies throughout

different fields of research [27-46]. In the area of drug

discovery, active learning has been shown to efficiently

derive high-performance prediction models based on

small subsets of input data [32,34,39,46]. Furthermore,

actively trained models not only reached significantly

higher hit rates compared to experimental standards

which frequently remain below 1 % in cases of unbiased

chemical libraries [34,39,40,47-49], but such models also

contributed to successful identification of novel bioactive

compounds [33,36,42] and cancer rescue mutants of p53

[31]. Overall, AL approaches bear the potential to

improve drug discovery processes by increasing hit rates,

reducing the amount of time- and cost-intensive

experimentation, and accelerate hit-to-lead processes

through integration into a feedback-driven

experimentation workflow [33,36,42,43,50].

In a recently published study, Reker et al. applied an

active learning strategy on family-wide interaction spaces

including more than 150,000 interaction data points

between 308 biomolecular targets and approximately

100,000 ligands [50]. The authors set out to investigate

the applicability and utility of active learning toward

reducing the amount of data necessary for constructing

highly predictive target family-wide chemogenomic

models. Comparing active learning selection strategies in

terms of resulting model performances characterized by

Matthews correlation coefficients (MCC) [51],

calculations showed that a "curious" (explorative)

selection strategy based on maximum prediction variance

outperformed random and greedy (exploitive) selection

[50] (see Methods below for the formula and

interpretation of MCC). Further, the curious AL strategy

achieved highly predictive performances using less than a

quarter of available data points, thereby indicating

chemogenomic active learning's beneficial applicability

for drug discovery and chemical biology screening

approaches. This further validates the generalized active

learning concept as an adaptive data sampling technique

to reduce the training data.


Apart from the selection strategy, the machine learning

models and their associated parameters will influence

learning behavior and predictive performance.

Interestingly, these parameters can also directly affect

resulting model complexity as well as computational costs

for learning and prospective predictions. In this article, the

influence of random forest size, i.e. the number of

decision trees, on prediction model performance was

explored (see Figure 1 for concept and workflow). The

interaction spaces of G protein-coupled receptors

(GPCRs) and kinases were actively modeled with the

curious selection strategy, where the number of trees per

experiment was treated as the key variable. Constructed

prediction models were assessed regarding model

performance (MCC), the evolution of picked samples, and

stopping criteria for model training.

2. Materials & Methods

Datasets

Active learning approaches for classification were

based on datasets from a previously published study [50]

and attributes are summarized in the following. Datasets

covered the ligand-target interaction space of human

GPCRs and kinases (sources: GPCR-specific GLASS

database [52], ChEMBL GPCR SARfari 3 and Kinase

SARfari 5 [53]). Activities were given as IC50 values for

Kinase SARfari 5 and as Ki values for the two GPCR

datasets. The activity threshold for CPIs was set to 100

nM or stronger. The non-CPIs were defined by setting a

threshold of 10 µM or weaker for Kinase SARfari 5 and

GPCR SARfari 3, and a threshold of 1 µM or weaker for

GPCR GLASS. Contradictions and data points between

these ranges were discarded. The numbers of records

retained and discarded are given in Table 1. Note it is

possible for a given compound and protein pair to have

multiple supporting records.

From these databases, interaction pairs, i.e.

compound-protein interactions (CPIs) and non-

interactions (non-CPIs), were extracted (Figure 1, Table

1). Data preparation resulted in three datasets containing

39,706, 47,602, and 69,960 compound-protein pairs (CPIs

and non-CPIs) with 48%, 82%, and 71% positive (strong)

interactions (CPIs) for Kinase SARfari 5, GPCR SARfari

3, and GPCR GLASS, respectively. The datasets covered

the protein space of 98 kinases and 100 (GPCR SARfari

3) or 110 (GPCR GLASS) GPCRs.

Descriptors for active learning modeling were

generated by translating compound structures into circular

topological extended connectivity fingerprints (ECFP)

[54] using a radius of 4 and a bit length of 4096 (OpenEye

OEChem library). Vectorizing dipeptide frequencies of

amino acid sequences generated target descriptors (of

length 400). The input for a compound-protein pair is the

concatenation of the descriptors and its (non-)interaction

status.

For the pair of GPCR datasets, only two targets had

identical FASTA sequences (Supplementary Figure S6).

However, most sequences in one database had a highly

homologous counterpart in the other database, as

validated by clustering protein pairwise sequence

similarity values derived from the Local Alignment

Kernel (Supplementary Figure S7) [55]. 14195 ligands

had identical full InChI string representations in the two

datasets (Supplementary Figure S6).

Table 1: Dataset statistics. The ratio of active

records is similar to the ratio of active CPIs. Data source KS5 GS3 GLASS

Bioactivity type IC50 Ki Ki

Threshold for

interaction 100 nM 100 nM 100 nM

Threshold for non-

interaction 10 µM 10 µM 1 µM

Interaction records 25118

(52%)

49622

(84%)

130483

(76%)

Non-interaction

records 22881 9690 40951

Discarded records 32567 40539 57893

Number of CPIs 19231

(48%)

39166

(82%)

49815

(71%)

Number of non-CPIs 20475 8436 20145

Number of dual-class

targets 98 (100%) 99 (99%) 82 (75%)

Figure 1. Overview on chemogenomic active learning. Abbreviations: CPIs = compound-protein interactions; non-CPIs =

compound-protein non-interactions; MCC = Matthews correlation coefficient.


Active learning methodology

Active learning strategies based on random forest

classification using the Python library scikit-learn [56]

were applied to the interaction space of GPCRs and

kinases as detailed in the previous section and identical to

the protocol by Reker et al. [50]. The curious selection

strategy steers instance picking per iteration towards the

most "interesting" or controversial picks based on the

largest disagreements in tree predictions (maximum

prediction variance; uncertainty-based selection) [50].

The forest size, i.e. number of trees (ntrees), was

varied through the scikit-learn constructor call

RandomForestClassifier(n_estimators=X). Decision trees

were not pruned for depth. The maximum number of

features considered at a node split in a tree was the square

root of the total number of features available, identical to

the previous protocol [50]. Hence, each decision tree

was constructed by considering as many as

floor(sqrt(4096+400)) = 67 randomly selected features at

a node. Each dataset was modeled using 500, 150, 100, 50,

25, 10, 5, 4, 3, 2, and 1 trees as a standard set for

comparison. Additional AL models with larger forest

sizes were calculated using ntrees of 1000 for GPCR

SARfari 5, 2000 and 1000 for GPCR GLASS, and 1000,

300, 250, and 200 for Kinase SARfari 5.

In addition, control experiments using random picking

per iteration as a subsampling-based selection strategy

were performed, though for random sampling, the forest

size parameter was left to the value of 500 as in

accordance with Reker et al. [50]. In all experimental

setups, 10 executions of modeling and evaluation are

performed to assess aggregate performance.

Assessment of model performance and

stopping criteria

The performance of actively trained classification

models was assessed by calculating the Matthews

correlation coefficient, MCC, per iteration. The MCC

ranges between -1 (inverse classification) and 1 (perfect

classification), and is defined as

MCC=(TP*TN-FP*FN)/

√((TP+FP)(TP+FN)(TN+FP)(TN+FN)) , (1)

using confusion matrix counts of true positives (TP),

true negatives (TN), false positives (FP), and false

negatives (FN). Values above 0.6 signal moderate

predictive ability, and values above 0.8 signal strong

predictive ability. The MCC was chosen as the

performance metric due to its superior reliability

regarding imbalanced data sets compared to, for example,

the accuracy metric (TP+TN)/(TP+TN+FP+FN). Further,

previous work demonstrated the potential deception in

result evaluation for chemogenomic modeling that can

occur as a result of evaluating only true positive or true

negative prediction rates [50]. For comparison, the false

positive (FPR), false negative (FNR), true positive (TPR),

and true negative rate (TNR) were also calculated over the

course of iterations.

In addition to MCC curves, the evolution of picked

CPIs and non-CPIs per iteration was monitored, and the

ratio of CPI vs. non-CPI picks was calculated in order to

track the relationship between forest size, selection ratio,

and model performance.

Addressing the question of when to stop model

training, a previously-established stopping criterion [50]

was employed that allows calculation of the number of

iterations and corresponding MCC for a given local speed

of learning, i.e. the slope (derivative) of MCC-iteration

curve functions, which are retrieved through fitting MCC

curves to exponential decay functions. Here, stopping

criteria for dMCC/dIter equal to 1 and 0.8 were assessed

for all executions of active learning. Results from

standardized experiments will be discussed below and are

visualized in a dataset-wise fashion in Figures 2 to 4.

Additional experiments with deviating numbers of trees

are given as supplementary information.

Statistical analysis was performed by assessing

distributions of MCC at stopping criteria for pairs of forest

sizes. Though we have shown the normality of MCC

distribution previously [50] for both curiosity and

randomly-picked chemogenomic active learning, we

nonetheless execute the statistical analyses by both

parametric Welch t-test and non-parametric Wilcoxon

rank-sum test procedures, as implemented in SciPy [57].

Hyperbola and sigmoidal curve fitting of prediction

performance as a function of forest size and iteration was

performed using GraphPad Prism version 7.0 (GraphPad

Software, La Jolla California, USA, www.graphpad.com).

Execution times of active learning runs were assessed

using the following computational host: Xeon E5-2697v4

18 core CPUx2, 16 threads per execution, ECFP-

dipeptide descriptors for GPCR SARfari 3. Datasets were

loaded into memory prior to learning, and the input time

was subtracted from execution wall time.

3. Results

Rooted in the methodology established in the

previous investigation [50], the initial number of trees was

set to 500 (control experiments), and active learning with

varying numbers of trees was performed. Due to strongly

similar model performance for 500-, 300-, 250-, and 200-

tree models for the Kinase SARfari 5 dataset

(Supplementary Figure S3), active learning of the GPCR

interaction spaces (GPCR SARfari 3 and GLASS) was


executed with a standardized set of forest sizes set to

150, 100, 50, 25, 10, 5, 4, 3, 2, and 1. Further experiments

with larger numbers of trees are provided in

Supplementary Figures S1 to S3.

Influence of forest size on active learning

performance

Figure 2. Active learning results for the GPCR GLASS dataset. a) Evolution of prediction model performances as MCC

per iteration (10 executions per picking strategy and forest size). b) Ratio of counts of CPI to non-CPI samples picked per iteration

during prediction model development with varying number of trees. Subfigure b): Relationship between counts of picked CPIs

to non-CPIs during model development with differing numbers of trees. c,d) Boxplot evaluation of stopping criteria applied to

prediction models with varying number of trees (x-axis) based on solutions (MCC and iteration values) for the slopes dMCC/dIter

= 1 and 0.8. The horizontal gray line represents the solution average for the random picking strategy with 500 trees. e) MCC

values from stopping criterion application were analyzed using parametric Welch t-test and non-parametric Wilcoxon rank-sum

test procedures. Values in the heatmaps are the p-values of the tests, where the null hypothesis is that there is no difference in the

mean MCCs of a pair of forest sizes. The average MCC was statistically significantly different between different forest sizes.

Colors in panels (a) and (b) are indexed according to the color key in the lower right.


Per pair of dataset and forest size, ten executions of

model construction and prediction performance

assessment were performed. Assessment of model

performances was realized by calculating MCC values per

iteration for all experiments (Figures 2a, 3a, and 4a). The

overall shape of the MCC curves and average

performances vary between the three datasets due to

inherent dataset variabilities (e.g. number of CPIs and

non-CPIs, complexity of the underlying structure-

activity-relationship, and the potential for target bias).

Overall, for forests of 25 or more trees, the curious

selection strategy outperforms random selection and

prediction using 500 trees. As can be seen in panel (a) of

Figures 2-4, no practical loss of performance was incurred

for ntrees of 500, 150 and 100. For the SARfari datasets,

MCC values of 0.6 were reached within 1,000 iterations

and MCCs of 0.8 within 2,000 and 3,000 iterations for

Kinase SARfari 5 and GPCR SARfari 3, respectively

(Figures 3a and 4a). The best-performing models for

GPCR GLASS (ntrees = 500, 150, 100) reached MCC

values of 0.6 within 3000 iterations (Figure 2a). Using the

curious active learning strategy, prediction models for all

datasets (ntrees = 500, 150, 100) achieved MCC values of

0.6 within 2.1%, 2.5%, and 4.3% of the GPCR SARfari 3,

Kinase SARfari 5, and GPCR GLASS datasets,

respectively. This fully corroborates the efficiency of

Figure 3. Active learning results for the GPCR SARfari 3 dataset. Panels are analogous to Figure 2.


active learning for subset selection on chemogenomic

datasets reported previously [50].

Assessing the influence of forest size on model

performance, comparison of MCC curves indicates

slower speed of learning (more shallow initial slopes) for

decreased number of trees, and smaller forests generally

converge on lower horizontal asymptotes of curve

functions (in the range of 5,000 iterations). For ntrees ≤

50 an increased number of iterations, i.e. model training,

was required to reach reliable MCC values of 0.6 or 0.8.

Across all datasets, MCC curves indicate that model

training with 25 trees generally necessitates almost twice

the amount of data to reach MCC values of 0.6 or 0.8

compared to models trained on 500 trees. For example, an

MCC of 0.6 at 2500 iterations is achieved for the GLASS

dataset with 500 trees, whereas a forest of 25 trees

required approximately 5000 iterations (Figure 2a).

Though it is not difficult to conceive the hypothesis that

more trees could boost predictive performance, actively

trained models with 1,000 and 2,000 decision trees

showed practically identical performance to models of

500 trees, suggesting that the per-dataset prediction

Figure 4. Active learning results for the Kinase SARfari 5 dataset. Panels are analogous to Figure 2. Critically, we see an

effect from forest size with respect to model performance evolution in panel (a) despite balanced selection ratios as shown in panel

(b).


performance is possibly already saturated at 500 trees and

no further model complexity is necessary to fit the data.

The same saturation effect can be found for all three

datasets tested (Supplementary Figures S1 to S3).

From an alternative viewpoint, MCC increases over

different dataset and different iteration thresholds seem to

follow a sigmoidal development (Supplementary Figure

S4). In addition, model performances were assessed by

calculating the classification metrics of TNR (specificity),

TPR (sensitivity), FPR, and FNR per training iteration for

all three datasets (Supplementary Figure S5). For

experiments with 500 trees, models based on the GPCR

datasets showed high sensitivity (identification of true

positives), while models based on kinases performed

better at identifying true negatives (higher specificity or

TNR). Our results support the hypothesis that prediction

model performances should be evaluated holistically and

that focusing on a single metric which does not

incorporate the full confusion matrix for performance

evaluation, such as the TPR, can be misleading. While

MCC values give a balanced interpretation of the

corresponding performance, the investigation of

specificity and sensitivity can be insightful to understand

whether a model over- or underestimates the interaction

potential. For example, the GPCR datasets have high

ratios of actives, and the calculated TPR of GPCR

prediction models tended to over-estimate performance

compared to the corresponding FPR and MCC.

To understand the influence of varying the numbers of

decision trees, our results showed that forests trained on

GPCR data with five or fewer trees tend to be weaker at

improving TNR and FPR over training iterations. In

contrast, forests with more than 25 trees showed

significant improvements in identifying true negatives

and reducing the number of false positives during training.

Becoming better at identifying true negatives for the

GPCR prediction models which were trained on

inherently lower ratios of negatives (71% and 82%

positives (CPIs)) correlated with improved MCC

performances. Overall, all of these classification metric

curves for all three datasets indicated superior

performance with forests of ntree > 25, suggesting that

more complex models can improve model performance

from different perspectives.

Figure 5. Influence of random forest complexity on active learning runtime. Experiments

were run in triplicate using a Xeon E5-2697v4 18 core (36 thread) CPUx2 with 16 threads per

execution.


Stopping criteria for active learning with

varying forest sizes

In order to compare different learning experiments and

to determine the point at which sufficient prediction

performance has been achieved or an increase in training

rounds (iterations) would not significantly contribute to

an increase in performance, a stopping criterion is needed.

As previously shown [50], differentiating calculated

MCC curve functions and solving them for given slopes

(dMCC/dIter) of 1 and 0.8 provides such evaluation

criterion. Solutions of given derivatives provide

corresponding MCCs and iteration numbers at given rates

of learning (Panels (c) and (d) in Figures 2-4).

For all three datasets, resulting MCC values decrease

with decreasing number of trees. The corresponding stop

iteration values decrease, slightly increase, and remain

relatively equal for the datasets of GPCR GLASS, GPCR

SARfari 3, and Kinase SARfari 5, respectively.

Analyzing the MCC results by the Welch and

Wilcoxon tests, we find that the probabilities of

equivalent means for each pair of forest sizes is almost

always less than 0.05, which leads us to reject the null

hypotheses of these tests that the means are equal. Thus,

each increasingly large forest size contributes to some

level of gain in prediction performance in a statistically

reproducible manner.

While statistically significant, the difference between

the MCC means of groups of 500 to 100 trees is relatively

small, indicating very minor loss in model performance

by reducing complexity to 100 trees. Reducing beyond

100 trees, the interval between means increases.

Balance in sample selection during active

learning with random forest size

Another aspect influencing prediction model

development is the ratio of positives (CPIs) and negatives

(non-CPIs) picked per iteration. The sampled CPI and

non-CPI picks per iteration were calculated for all

experiments using the curious selection strategy, and the

ratios of selected CPIs and non-CPIs were visualized in

Figures 2b to 4b and Supplementary Figures S1b to S3b.

In the preceding study [50], it was validated that the

original training dataset biases (71%, 82%, and 48% CPIs

for GPCR GLASS, GPCR SARfari 3, and Kinase SARfari

5, respectively) were converged upon when using the

random picking strategy as a subsampling method.

The curious selection strategy, on the other hand,

maintains balanced (1:1) picking ratios of CPI to non-CPI

per iteration regardless of the original training dataset

ratios [50]. Here we could show that a reduction in the

number of decision trees influenced the selection ratio of

sampled positives to negatives per iteration for the two

imbalanced GPCR datasets (Figures 2b and 3b). Control

experiments (ntrees = 500) and experiments with tree

numbers of 150 and 100 resulted in pick ratios of

approximately 50%, but further reduction of forest size

led to progressive approximation of the underlying input

ratios of GPCR GLASS and GPCR SARfari 3. In other

words, the amount of selected CPIs per iteration changed

towards the inherent dataset bias with decreasing numbers

of decision trees. Apparently the curiosity selection

function’s ability to sample skewed datasets in a balanced

manner is dependent on sufficient model complexity.

Figure 6. Assessment of the influence of random forest complexity on model performances (MCC) with regard to data

usage for model training. The full datasets covered 69,960, 47,602, and 39,706 CPIs for the GPCR GLASS, GPCR SARfari 3, and

Kinase SARfari 5 datasets, respectively.


Balanced selection during active learning is a

major, but not the only, factor in efficient

learning of chemogenomic datasets

We showed that forest size (and hence complexity)

influences both the overall learning performance as well

as the ability of active learning to sample interactions and

non-interactions in a balanced manner. Taken together,

this suggests that a balanced selection is key for efficient

learning of the chemogenomic interaction space. Indeed,

for the GPCR datasets, the model performances (MCC

values) for different forest sizes (Figures 2a and 3a) seem

to correlate with the ratio of selected positives (CPIs) per

iteration (Figures 2b and 3b).

However, this trend was not detectable in the Kinase

SARfari 5 dataset, which is inherently balanced, and

therefore all model complexities, including naïve random

subsampling, were selected with picking ratios remaining

at about 50% (Figure 4b). Comparing these findings

further supports the hypothesis that the superiority in

resulting model performance of the curious active

learning strategy (in comparison to random) originates

from the quality of CPIs/non-CPIs picked for model

training (e.g. sample position in the activity landscape)

and not solely from the ratio of positives (CPIs) to

negatives (non-CPIs).

Smaller random forest models require

significantly less computational resources

Results suggest a sufficiently large number of trees in

a random forest can ensure an optimal performance both

in terms of maximal MCC values as well as a balanced

selection of interactions and non-interactions. However,

smaller models with reduced number of trees might be

advantageous in terms of computational cost.

To test whether the models presented here indeed

show a noticeable difference in terms of the required

computational time and effort, we timed the active

learning campaigns for selected model sizes (ntrees = 500,

150, 50). Not only did we see a markedly and statistically

significant decrease in wall time for smaller random

forests, the increase in computational time required for

smaller forests at later iterations seemed to follow linear

instead of exponential increases, at least for the practically

relevant number of iterations (Figure 5). It is important to

realize that the exact numbers and trends will depend on

the execution architecture, e.g. available compute

core/thread count versus threads used for computation,

dataset size, and forest size.

Smaller models can still achieve good

performance by drawing from more data

To better understand the relationship between model

complexity and number of iterations necessary to achieve

a certain performance, we visualized heatmaps of MCC

values against those two parameters (Figure 6). These

indicate that indeed a trade-off exists between data and

model complexity used, such that simple models require

more data while complex models can make sense out of

smaller datasets.

In practical computational drug discovery and

chemical biology applications, therefore, the tradeoff

plots allow a team to identify the proper production-level

parameters based on project-specific constraints (e.g.,

data availability, screening budget, permitted complexity,

or compute time).

4. Discussion

The advantage of random forests compared to single

decision trees is their superior prediction performance as

an ensemble and their stability towards data perturbations.

On the other hand, with increasing numbers of trees used

for random forest development, computational costs

increase. Therefore, the question of how many trees

should be used in a random forest approach becomes

worth considering. Despite the common procedure to set

the number of trees to a default initial guess (generally

between 100 and 500 [42,58-61]), a handful of

investigatory studies have been reported that address the

influence of tree numbers in a forest. One intuitive

assumption follows the concept of "the larger the better"

[62]. Nevertheless, Breiman's random forest introduction

in 2001 suggests the existence of an asymptotic limit for

the generalization error of a random forest which impedes

model performance improvement by simply adding more

trees [63], and we observed such limits for experiments

with very large forests (Supplementary Figures).

To date, several non-chemogenomic studies have

shown that the number of trees in a random forest could

be reduced to a certain degree without losing model

performance [64-68], and often rule-of-thumb

suggestions are made and accepted within the different

computational communities, which allow model training

without previous parameter optimization. To probe

whether parameter optimization might be an

advantageous step instead of accepting rule-of-thumb

guidelines, we investigated how chemogenomic active

learning performance changed when reducing the number

of trees. Our results suggest that model performance

follows a sigmoidal development, with little change in

performance for very small or very large forests


(Supplementary Figure S4).

A particularly novel finding in this report is our

discovery that the number of trees influences the sample

picking ratios of CPIs to non-CPIs during curiosity-driven

sample selection. The curiosity strategy that was applied

to actively train the prediction models generally steers the

sample selection towards balanced picking ratios of CPIs

to non-CPIs (1:1) for larger forests. As we gradually

decreased the forest size, the picking ratios shifted

towards inherent CPI to non-CPI ratios of the underlying

datasets (i.e. the number of picked CPIs in relation to

picked non-CPIs increased).

These findings lead us back to the question of how to

set the number of trees prior to model development.

Ideally, we would implement a minimized necessary

number of trees to allow optimal performance with

reduced computational complexity. In a study on random

forest parameter sensitivity, it was pointed out that the

choice for optimum parameters depends on the underlying

dataset and that parameters should therefore be tuned

data-set wise [69]. One way to determine parameters such

as ntree prior to experimentation is the application of

parameter estimators such as the grid search method in

scikit-learn which has recently been applied to drug-target

interaction prediction [56,70].

Alternatively, based on the concept of trial-and-error

Boulesteix et al. suggest increasing the number of trees

until convergence on the value of interest (e.g. prediction

error) [60]. One could also consider a “parachuting”

approach in which the initial guess is set extremely high,

and in a systematic reduction method such as a base-2

logarithm, the number of trees is lowered to a value in

which a minimal and acceptable impact in prediction

performance is incurred. Another solution could be to use

the sigmoidal shape of the MCC versus the number of

trees (Figure S4) to predict an inflection point from a

fitting of the data. Depending on the quality of the fit, this

would then be indicative of the region of the number of

trees necessary to achieve a certain predictive quality.

Overall, our findings from prediction model

development showed that despite reducing the size of the

ensemble (forest) used for prediction to 20% or less of our

previous study and commonly used value of 500 trees

[42,58,60,61], active learning strategies could still lead to

reliable performances (MCC values of 0.6 or better)

within the first 5,000 iterations. This MCC threshold is

equivalent to roughly 7.2%, 12.6%, and 10.5% of data in

the GPCR GLASS, GPCR SARfari 3, and Kinase

SARfari 3 datasets, respectively. Thus, given that we have

shown the feasibility of forest reduction for

chemogenomic active learning, a reason for reducing the

number of trees in prospective applied investigations will

be to push toward increased model interpretability.

Another interesting approach towards forest reduction

has been to not only quantitatively reduce the number of

trees but doing so by considering the quality of trees and

assessing their individual contribution to overall model

performance [65,66,71]. Using backward deletion

strategies during model training (e.g. deletion of trees

with minimum contribution to overall prediction

performance), it has been shown that smaller sub-forests

can be capable of representing the "whole" random forest

[65]. In a similar study on sub-forest development,

authors pointed out based on their findings that the

incorporation of specific trees may even diminish model

performance of a larger random forest [66]. Incorporation

of a strategy to change the underlying model as a way to

adaptively improve predictive performance and sampling

behavior represents a valuable extension for active

learning approaches [26,72].

Regarding chemical interpretability, we can consider

the implementation of the random forest classifier

algorithm used. At each iteration, all descriptor values are

presented for the CPIs subsampled, and up to

floor(sqrt(4496))=67 descriptors can be considered at

each node split of a tree. We empirically found that few,

if any, of the ECFP descriptors had zero variance; the bit

fingerprints had a peak of variance distribution close to

0.1. We can therefore expect that the decision tree

building algorithm will rapidly find discriminative

descriptors, including the potential re-use of a

discriminative descriptor at multiple nodes in a tree.

Since the maximum number of descriptors to be included

in a tree was not bounded in this study, small forests could

still randomly sample and identify multiple discriminative

chemical fingerprints. Therefore, we do not anticipate a

change in chemical interpretability.

As another direction to further develop the reported

active learning methodology, the algorithm could be

expanded to address multi-class predictions.

Consideration of the CPI data that were discarded in this

study after applying to the definition of a separating

threshold for active and inactive molecules would

represent a suitable starting point for three-class

predictions of low-, moderate- and high-affinity

compound-protein interactions. It should be noted,

however, that many classification challenges in

proteochemometric modelling discard intermediate

affinities, not only given their subjective nature but also

because of potential differences in their class membership

given by variance in experimental measurements.

Intermediate class modeling also induces an additional

experimental design step to weight evaluation of

prediction of instances close to the bioactivity thresholds

used for creation of three classes. For example, if we

consider the thresholds 100 nM and 10 µM, then CPIs in

the intermediate class with bioactivities of 101 nM or 9


µM should not be penalized as strongly for

misclassification, because of the subjectivity induced as

noted above.

5. Conclusions

Chemogenomic active learning was applied to the

compound-protein interaction spaces of kinases and

GPCRs using random forests of reduced size, and resulted

in robust prediction models and corresponding

performances within 12.6% of the data (the first 5,000

iterations). Assessment of the influence of the number of

decision trees on predictive performance indicates that

forests can be safely cut down to 100 (20%) or 50 (10%)

trees depending on the dataset. Further, model

performances of experiments with varying numbers of

trees indicate the existence of a threshold that cannot be

overcome by increasing the quantity of trees. The way in

which the pharmacological network was explored,

meaning the number of CPIs and non-CPIs incorporated

for model development, was shown to dynamically shift

toward a bias in picking CPIs as the size of the forest was

reduced. Prospective applications of chemogenomic

active learning can benefit from understanding the

behaviors uncovered in this work.

Acknowledgements

D.R. is grateful for support from the Swiss National

Science Foundation (P2EZP3_168827 to D Reker). C.R.

thanks Professor Stephan Irle and Assistant Professor Yuh

Hijikata for supporting this study. The ITbM is supported

by the World Premier International Research Initiative,

Ministry of Education, Culture, Sports, Science and

Technology of Japan. Support from the Japan Society

for the Promotion of Science was used for computational

resources under a Grant-in-Aid for Young Scientists (B)

25870336 and a Grant-in-Aid for Scientific Research (S)

JP16H06306 (to J.B. Brown). We thank OpenEye

Scientific Software (Santa Fe, NM) for providing an

academic license and use of the OEChem chemical

processing library.

References and Notes

[1] Hay, M.; Thomas, D. W.; Craighead, J. L.;

Economides, C.; Rosenthal, J., Nature

Biotechnology, 32, 40 (2014).

[2] Scannell, J. W.; Blanckley, A.; Boldon, H.;

Warrington, B., Nature Reviews: Drug

Discovery, 11, 191 (2012).

[3] Lavecchia, A., Drug Discovery Today, 20, 318

(2015).

[4] Mitchell, J. B., Wiley Interdiscip Rev Comput

Mol Sci, 4, 468 (2014).

[5] Ekins, S.; Mestres, J.; Testa, B., British Jornal of

Pharmacology, 152, 9 (2007).

[6] Geppert, H.; Vogt, M.; Bajorath, J., Journal of

Chemical Information and Modeling, 50, 205

(2010).

[7] Brown, J. B.; Urata, T.; Tamura, T.; Arai, M. A.;

Kawabata, T.; Akutsu, T., Journal of

Bioinformatics and Computational Biology, 8

Suppl 1, 63 (2010).

[8] Achenbach, J.; Tiikkainen, P.; Franke, L.;

Proschak, E., Future Medicinal Chemistry, 3,

961 (2011).

[9] Vidal, D.; Garcia-Serna, R.; Mestres, J., Methods

in Molecular Biology, 672, 489 (2011).

[10] Lyne, P. D., Drug Discovery Today, 7, 1047

(2002).

[11] Berger, S. I.; Iyengar, R., Bioinformatics, 25,

2466 (2009).

[12] Caron, P. R.; Mullican, M. D.; Mashal, R. D.;

Wilson, K. P.; Su, M. S.; Murcko, M. A., Current

Opinion in Chemical Biology, 5, 464 (2001).

[13] Bleicher, K. H., Current Medicinal Chemistry, 9,

2077 (2002).

[14] van Westen, G. J. P.; Wegner, J. K.; Ijzerman, A.

P.; van Vlijmen, H. W. T.; Bender, A.,

MedChemComm, 2, 16 (2011).

[15] Bajorath, J., Molecular Informatics, 32, 1025

(2013).

[16] Brown, J. B.; Niijima, S.; Okuno, Y., Molecular

Informatics, 32, 906 (2013).

[17] Brown, J. B.; Okuno, Y.; Marcou, G.; Varnek,

A.; Horvath, D., Journal of Computer-Aided

Molecular Design, 28, 597 (2014).

[18] Cortes-Ciriano, I.; Ain, Q. U.; Subramanian, V.;

Lenselink, E. B.; Mendez-Lucio, O.; Ijzerman, A.

P.; Wohlfahrt, G.; Prusis, P.; Malliavin, T. E.;

van Westen, G. J. P.et al., MedChemComm, 6, 24

(2015).

[19] Yabuuchi, H.; Niijima, S.; Takematsu, H.; Ida,

T.; Hirokawa, T.; Hara, T.; Ogawa, T.; Minowa,

Y.; Tsujimoto, G.; Okuno, Y., Molecular

Systems Biology, 7, 472 (2011).

[20] van Westen, G. J.; Wegner, J. K.; Geluykens, P.;

Kwanten, L.; Vereycken, I.; Peeters, A.;

Ijzerman, A. P.; van Vlijmen, H. W.; Bender, A.,

PloS One, 6, e27518 (2011).

[21] Cortes-Ciriano, I.; van Westen, G. J.; Lenselink,

E. B.; Murrell, D. S.; Bender, A.; Malliavin, T.,

Journal of Cheminformatics, 6, 35 (2014).

[22] Cortes-Ciriano, I.; Murrell, D. S.; van Westen, G.

J.; Bender, A.; Malliavin, T. E., Journal of

Cheminformatics, 7, 1 (2015).


[23] Bosc, N.; Wroblowski, B.; Meyer, C.; Bonnet, P.,

Journal of Chemical Information and Modeling,

57, 93 (2017).

[24] Warmuth, M. K.; Rätsch, G.; Mathieson, M.;

Liao, J.; Lemmen, C., Advances in Neural

Information Processing Systems, 2, 1449 (2002).

[25] Murphy, R. F., Nature Chemical Biology, 7, 327

(2011).

[26] Reker, D.; Schneider, G., Drug Discovery Today,

20, 458 (2015).

[27] Warmuth, M. K.; Liao, J.; Ratsch, G.; Mathieson,

M.; Putta, S.; Lemmen, C., Journal of Chemical

Information and Computer Science, 43, 667

(2003).

[28] Danziger, S. A.; Zeng, J.; Wang, Y.; Brachmann,

R. K.; Lathrop, R. H., Bioinformatics, 23, i104

(2007).

[29] De Grave, K.; Ramon, J.; De Raedt, L.,

Discovery Science: 11th International

Conference, DS 2008, Budapest, Hungary,

October 13-16, 2008. Proceedings, 185 (2008).

[30] Clark, J.; Frederking, R. E.; Levin, L. S., LREC,

Journal, (2008).

[31] Danziger, S. A.; Baronio, R.; Ho, L.; Hall, L.;

Salmon, K.; Hatfield, G. W.; Kaiser, P.; Lathrop,

R. H., PLoS Computational Biology, 5,

e1000498 (2009).

[32] Mohamed, T. P.; Carbonell, J. G.; Ganapathiraju,

M. K., BMC Bioinformatics, 11 Suppl 1, S57

(2010).

[33] Besnard, J.; Ruda, G. F.; Setola, V.; Abecassis,

K.; Rodriguiz, R. M.; Huang, X. P.; Norval, S.;

Sassano, M. F.; Shin, A. I.; Webster, L. A.et al.,

Nature, 492, 215 (2012).

[34] Cobanoglu, M. C.; Liu, C.; Hu, F.; Oltvai, Z. N.;

Bahar, I., Journal of Chemical Information and

Modeling, 53, 3399 (2013).

[35] Heikamp, K.; Bajorath, J., Journal of Chemical

Information and Modeling, 53, 791 (2013).

[36] Desai, B.; Dixon, K.; Farrant, E.; Feng, Q.;

Gibson, K. R.; van Hoorn, W. P.; Mills, J.;

Morgan, T.; Parry, D. M.; Ramjee, M. K.et al.,

Journal of Medicinal Chemistry, 56, 3033 (2013).

[37] Naik, A. W.; Kangas, J. D.; Langmead, C. J.;

Murphy, R. F., PloS One, 8, e83996 (2013).

[38] Ahmadi, M.; Vogt, M.; Iyer, P.; Bajorath, J.;

Frohlich, H., Journal of Chemical Information

and Modeling, 53, 553 (2013).

[39] Kangas, J. D.; Naik, A. W.; Murphy, R. F., BMC

Bioinformatics, 15, 143 (2014).

[40] Maciejewski, M.; Wassermann, A. M.; Glick,

M.; Lounkine, E., Journal of Chemical


[41] Wei, K.; Iyer, R. K.; Bilmes, J. A., ICML,

Journal, 1954 (2015).

[42] Reker, D.; Schneider, P.; Schneider, G.,

Chemical Science, 7, 3919 (2016).

[43] Naik, A. W.; Kangas, J. D.; Sullivan, D. P.;

Murphy, R. F., Elife, 5, e10047 (2016).

[44] Ueno, T.; Rhone, T. D.; Hou, Z.; Mizoguchi, T.;

Tsuda, K., Materials Discovery, 4, 18 (2016).

[45] Alvarsson, J.; Lampa, S.; Schaal, W.; Andersson,

C.; Wikberg, J. E. S.; Spjuth, O., Journal of

Cheminformatics, 8, 39 (2016).

[46] Lang, T.; Flachsenberg, F.; von Luxburg, U.;

Rarey, M., Journal of Chemical Information and

Modeling, 56, 12 (2016).

[47] Mullarky, E.; Lucki, N. C.; Beheshti Zavareh,

R.; Anglin, J. L.; Gomes, A. P.; Nicolay, B. N.;

Wong, J. C.; Christen, S.; Takahashi, H.; Singh,

P. K.et al., Proceedings of the National Academy

of Sciences of the United States of America, 113,

1778 (2016).

[48] Lucantoni, L.; Fidock, D. A.; Avery, V. M.,

Antimicrobial Agents and Chemotherapy, 60,

2097 (2016).

[49] Lopez-Sambrooks, C.; Shrimal, S.; Khodier, C.;

Flaherty, D. P.; Rinis, N.; Charest, J. C.; Gao, N.;

Zhao, P.; Wells, L.; Lewis, T. A.et al., Nature

Chemical Biology, 12, 1023 (2016).

[50] Reker, D.; Schneider, P.; Schneider, G.; Brown,

J. B., Future Medicinal Chemistry, (2017).

[51] Matthews, B. W., Biochimica et Biophysica Acta

(BBA) - Protein Structure, 405, 442 (1975).

[52] Chan, W. K.; Zhang, H.; Yang, J.; Brender, J. R.;

Hur, J.; Ozgur, A.; Zhang, Y., Bioinformatics, 31,

3035 (2015).

[53] Bento, A. P.; Gaulton, A.; Hersey, A.; Bellis, L.

J.; Chambers, J.; Davies, M.; Kruger, F. A.;

Light, Y.; Mak, L.; McGlinchey, S.et al., Nucleic

Acids Research, 42, D1083 (2014).

[54] Rogers, D.; Hahn, M., Journal of Chemical


[55] Saigo, H.; Vert, J.-P.; Ueda, N.; Akutsu, T.,


[56] Pedregosa, F.; Varoquaux, G.; Gramfort, A.;

Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.;

Prettenhofer, P.; Weiss, R.; Dubourg, V.,

Journal of Machine Learning Research, 12,

2825 (2011).

[57] Jones, E.; Oliphant, T.; Peterson, P., (2014).

[58] Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.;

Sheridan, R. P.; Feuston, B. P., Journal of

Chemical Information and Computer Sciences,

43, 1947 (2003).

[59] Bernard, S.; Adam, S.; Heutte, L., Pattern

Recognition Letters, 33, 1580 (2012).

[60] Boulesteix, A.-L.; Janitza, S.; Kruppa, J.; König,

I. R., Wiley Interdisciplinary Reviews: Data

Mining and Knowledge Discovery, 2, 493 (2012).

[61] Reutlinger, M.; Rodrigues, T.; Schneider, P.;

Schneider, G., Angewandte Chemie

International Edition, 53, 4244 (2014).

[62] Goldstein, B. A.; Polley, E. C.; Briggs, F. B.,


Statistical Applications in Genetics and

Molecular Biology, 10, 32 (2011).

[63] Breiman, L., Machine Learning, 45, 5 (2001).

[64] Latinne, P.; Debeir, O.; Decaestecker, C.,

Multiple Classifier Systems: Second

International Workshop, MCS 2001 Cambridge,

UK, July 2–4, 2001 Proceedings, 178 (2001).

[65] Zhang, H.; Wang, M., Stat Interface, 2, 381

(2009).

[66] Bernard, S.; Heutte, L.; Adam, S., Emerging

Intelligent Computing Technology and

Applications. With Aspects of Artificial

Intelligence: 5th International Conference on

Intelligent Computing, ICIC 2009 Ulsan, South

Korea, September 16-19, 2009 Proceedings, 536

(2009).

[67] Van Essen, B.; Macaraeg, C.; Gokhale, M.;

Prenger, R., Field-Programmable Custom

Computing Machines (FCCM), 2012 IEEE 20th

Annual International Symposium on, Journal,

232 (2012).

[68] Oshiro, T. M.; Perez, P. S.; Baranauskas, J. A.,

Machine Learning and Data Mining in Pattern

Recognition: 8th International Conference,

MLDM 2012, Berlin, Germany, July 13-20, 2012.

Proceedings, 154 (2012).

[69] Huang, B. F.; Boutros, P. C., BMC


[70] Coelho, E. D.; Arrais, J. P.; Oliveira, J. L., 2016

IEEE 29th International Symposium on

Computer-Based Medical Systems (CBMS),

Journal, 36 (2016).

[71] Adnan, M. N.; Islam, M. Z., Knowledge-Based

Systems, 110, 86 (2016).

[72] Baram, Y.; Yaniv, R. E.; Luz, K., Journal of

Machine Learning Research, 5, 255 (2004).


Supplementary Materials

[Supplementary Materials]

Figure S 1. Active learning results for the GPCR GLASS dataset with tree numbers ≥ 500. a) Evolution of prediction

model performances as MCC per iteration. No gain in performance is achieved for large forests. b) Ratio of counts of CPI to non-

CPI samples picked per iteration during prediction model development with varying number of trees. Subfigure b) Relationship

between counts of picked CPIs to non-CPIs during model development for experiments with ≥ 500 trees. Abbreviations: CPIs =

compound-protein interactions; non-CPIs = compound-protein non-interactions; MCC = Matthews correlation coefficient.

Figure S 2. Active learning results for the GPCR SARfari 3 dataset with tree numbers ≥ 500. Panels are analogous

to Figure S1.


Figure S 3. Active learning results for the Kinase SARfari 5 dataset with tree numbers ≥ 200. Panels are analogous to

Figure S1. A limit on performance is reached as early as 200 trees.

Figure S 4. Active learning performance as a function of random forest complexity. MCC curves were calculated

based on hyperbola (upper panel) and sigmoidal fit (lower panel).


Figure S 5. Evaluation of alternative predictive metrics. Evolution of prediction model performances as

FNR, FPR, TNR, and TPR (from upper to lower panels) per iteration (10 executions per forest size) for all three

datasets (columns). We observe that the GPCR datasets with higher ratios of actives have over-estimated True

Positive Rates and could potentially mislead model performance interpretation. Performance curves are indexed

according to the color key in the lower left corner. Abbreviations: FNR = false negative rate; FPR = false positive

rate; TNR = true negative rate; TPR = true positive rate.


Figure S 6. Venn diagrams of identity calculations for the ChEMBL GPCR SARfari 3 (green) and GPCR GLASS datasets

(red). a) Overlap of compounds based on full InChI string comparison. b) Overlap of molecular targets (GPCRs) based on FASTA

sequences.


Figure S 7. Clustered pairwise comparison of FASTA sequences of targets in the GPCR datasets (ChEMBL GPCR SARfari

3 and GLASS). Scores represent normalized similarities of protein pairs as calculated by the Local Alignment Kernel. Deep

colored clusters along the diagonal demonstrate cross-dataset homolog pairs.

Documents

Small Random Forest Models for Effective Chemogenomic