A Study on Feature Selection for Toxicity Prediction * Gongde Guo 1, Daniel Neagu 1 and Mark Cronin 2 1 Department of Computing, University of Bradford

A Study on Feature Selection for Toxicity Prediction*

Gongde Guo1, Daniel Neagu1 and Mark Cronin2

1Department of Computing, University of Bradford2School of Pharmacy and Chemistry, Liverpool John Moores University

*EPSRC Project: PYTHIA – Predictive Toxicology Knowledge representation and Processing Tool based on a Hybrid Intelligent Systems Approach, Grant Reference:GR/T02508/01

Outline of PresentationOutline of Presentation

1.1. Predictive ToxicologyPredictive Toxicology2.2. Feature Section MethodsFeature Section Methods3.3. Relief Family: Relief, ReliefFRelief Family: Relief, ReliefF4.4. KNNMFS Feature SelectionKNNMFS Feature Selection5.5. Evaluation CriteriaEvaluation Criteria6.6. Toxicity Dataset: PhenolsToxicity Dataset: Phenols7.7. Evaluation I: ToxicityEvaluation I: Toxicity8.8. Evaluation II: Mechanism of ActionEvaluation II: Mechanism of Action9.9. ConclusionsConclusions

Predictive ToxicologyPredictive Toxicology

• The goal of predictive toxicology is to describe the relations between the chemical structure of a molecule and biological and toxicological processes (Structure-Activity Relationship SAR) and to use these relations to predict the behaviour of new, unknown chemical compounds.

• Predictive toxicology data mining comprises steps of data preparation; data reduction (includes feature selection); data modelling; prediction (classification, regression); and evaluation of results and further knowledge discovery tasks.

Feature Selection MethodsFeature Selection Methods

Feature selection is the process of identifying and removing as much of the Feature selection is the process of identifying and removing as much of the irrelevant and redundant information as possible. irrelevant and redundant information as possible.

Seven feature selection methods (Witten et al, 2000) are involved in our study:Seven feature selection methods (Witten et al, 2000) are involved in our study:

1. GR – Gain Ratio feature evaluator;

2. IG – Information Gain ranking filter;

3. Chi – Chi-squared ranking filter;

4. ReliefF – ReliefF Feature selection;

5. SVM- SVM feature evaluator;

6. CS – Consistency Subset evaluator;

7. CFS – Correlation-based Feature Selection;

But in this work, we focused on the drawbacks of the ReliefF feature selection method and proposed the kNNMFS feature selection method.

The Relief algorithm works by randomly sampling an instance and The Relief algorithm works by randomly sampling an instance and locating its nearest neighbour from the same and opposite class. The locating its nearest neighbour from the same and opposite class. The values of the features of the nearest neighbours are compared to the values of the features of the nearest neighbours are compared to the sampled instance and used to update the relevance scores for each sampled instance and used to update the relevance scores for each feature.feature.

Relief Feature Selection MethodRelief Feature Selection Method

Miss

Hit

Noise? M=? How to choose individual M instances?

K=1

Relief Feature Selection MethodRelief Feature Selection Method

Algorithm ReliefInput: for each training instance a vector of attribute values and

the class valueOutput: the vector W of estimations of the qualities of attributes

Set all weights W[Ai]=0.0, i=1,2,…,p ;for j=1 to m do begin randomly select an instance Xj; find nearest hit Hj and nearest miss Mj; for k=1 to p do begin W[Ak]=W[Ak]-diff(Ak, Xj, Hj)/m+diff(Ak, Xj, Mj)/m; end;end;

ReliefF Feature Selection MethodReliefF Feature Selection Method

MissHit

K=3

Noise (X); K=? M=? How to choose M instances?

ReliefF Feature Selection MethodReliefF Feature Selection Method

kNN Model-based Classification Method (Guo et al, 2003)

The basic idea of kNN model-based classification method is to find a set of more meaningful representatives of the complete dataset to serve as the basis for further classification.

kNNModel can generate a set of optimal representatives via inductively learning from the dataset.

An Example of kNNModel

Each representative di is represented in the form of <Cls(di), Sim(di), Num(di), Rep(di)> which respectively represents the class label of di; the similarity of di to the furthest instance among the instances covered by Ni; the number of instances covered by Ni; a representation of instance di.

KNNMFS: kNN Model-based Feature Selection

kNNMFS takes the output of the kNNModel as seeds for further feature selection. Given a new instance, kNNMFS finds the nearest representative for each class and then directly uses the inductive information of each representative generated by kNNModel for feature weight calculation. The k in ReliefF is varied in our algorithm. Its value depends on the number of instances covered by each nearest representative used for feature weight calculation. The M in kNNMFS is the number of representatives output from the kNNModel.

KNNMFS Feature Selection MethodKNNMFS Feature Selection Method

Toxicity Dataset: PhenolsToxicity Dataset: Phenols

Phenols data set was collected from TETRATOX database (Scheultz, 1997) which contained 250 compounds. A total of 173 descriptors were calculated for each compounds using different software tools, e.g., ACD/Labs, Chem-X, TSAR. These descriptors were calculated to represent the physico-chemical, structure and topological properties that were relevant to toxicity. Some features are irrelevant to or poor correlate with the class label

X: CX-EMP20 Y:Toxicity X:TS_QuadXX Y:Toxicity

Evaluation Measure for Continuous Class Evaluation Measure for Continuous Class Values PredictionValues Prediction

Endpoint I: ToxicityEndpoint I: Toxicity

FSMFSM NSFNSFEvaluation Using Linear RegressionEvaluation Using Linear Regression

CCCC MAEMAE RSERSE RAERAE RRSERRSE

PhenolsPhenols 173 0.8039 0.3993 0.5427 59.4360% 65.3601%

MostUMostU 12 0.7543 0.4088 0.5454 60.8533% 65.6853%

GR GR 20 0.7722 0.4083 0.5291 60.7675% 63.7304%

IGIG 20 0.7662 0.3942 0.5325 58.6724% 63.1352%

ChiChi 20 0.7570 0.4065 0.5439 60.5101% 65.5146%

ReliefFReliefF 20 0.8353 0.3455 0.4568 51.4319% 55.0232%

SVMSVM 20 0.8239 0.3564 0.4697 53.0501% 56.5722%

CSCS 13 0.7702 0.3982 0.5292 59.2748% 63.7334%

CFSCFS 7 0.8049 0.3681 0.4908 54.7891% 59.1181%

kNNMFSkNNMFS 35 0.8627 0.3150 0.4226 46.8855% 50.8992%

Table 1. Performance of linear regression algorithm on different phenols subsets

Endpoint II: Mechanism of ActionEndpoint II: Mechanism of Action

FSM NSF 10-Fold Cross Validation Using wkNN (k=5)

Average Accuracy Variance Deviation

GR IGChiReliefFSVMCSCFSkNNMFSPhenols

202020202013

735

173

89.3289.0888.6891.4091.8089.4080.7693.2486.24

1.701.210.501.320.400.761.260.440.43

1.311.100.711.150.630.871.120.670.66

Table 2. Performance of wkNN algorithm on different phenols subsets

Conclusion and Future Research Directions

• Using a kNN model as the starter selection can choose a set of more meaningful representatives to replace the original data for feature selection;

• Presenting a more reasonable ‘difference function calculation’ based on inductive information in each representative obtained by kNNModel.

• Better performances are obtained on the subsets of the Phenol dataset with different endpoints by kNNMFS.

• Investigating the effectives of boundary data or centre data of clusters chosen as seeds for kNNMFS

• More comprehensive experiments on the benchmark data will be carried out.

ReferencesReferences

1. Witten, I.H. and Frank, E.: Data Mining: Practical Machine Learning Tools with Java Implementations, Morgan Kaufmann (2000), San Francisco

2. Guo, G., Wang, H., Bell, D. et al.: kNN Model-based Approach in Classification. In Proc. of CoopIS/DOA/ODBASE 2003, LNCS 2888, Springer-Verlag, pp. 986-996 (2003)

3. Scheultz, T.W.: TETRATOX: The Tetrahymena Pyriformis Population Growth Impairment Endpoint – A Surrogate for Fish Lethality. Toxicol. Methods, 7, 289-309 (1997)

Thank you very much!Thank you very much!

Documents

A Study on Feature Selection for Toxicity Prediction * Gongde Guo 1, Daniel Neagu 1 and Mark Cronin 2 1 Department of Computing, University of Bradford