4
Review Article Pooja Mishra 1 , Vijay Tripathi 1 , Brijesh Singh Yadav 2 * Corresponding Author: [email protected] GERF Bulletin of Biosciences 2010, 1(1): 37-40 www.gerfbb.com Insilco QSAR Modeling and Drug Development Process Abstract In this paper we focused on in silico QSAR (Quantitative structure-activity relationship) modeling is one of the well- developed areas in drug development through computational chemistry. Similar molecules with just a slight variation in their structure can have quit different biological activity. This kind of relationship between molecular structure & change in biological activity is center of focus for QSAR Modeling. QSAR are based on a comparison between some type of activity & the chemical structure or physicochemical properties of a series of chemical.Quantitative structure- activity relationship (QSAR) is the process by which chemical structure is quantitatively correlated with a well defined process, such as biological activity or chemical reactivity. Activity = f (physiochemical properties and/or structural properties). There is a wide variety of descriptors for use in QSAR studies. Actually descriptors are biological property that represents in mathematical form. The group or subset of these descriptors is potentially useful for predicting ADME/Tox properties, & describe the arrangement of a pharmaceutical compound without organisms. Here ADME is an acronyms in pharmacokinetics & pharmacology for absorption, distribution, metabolism & excretion, this is the very important step in drug designing is prediction of ADME property of any compound. It is useful to predict toxicity of particular compounds. This article also reviews the current achievements in the field of QSAR modeling and their impact on modern drug discovery processes. The applications of QSAR modeling in drug discovery, such as compound selection, virtual library generation, virtual high throughput screening, HTS data mining, and in-silico ADMET are discussed. We have also presented a quantitative structure–human intestinal absorption relationship using Regression analysis through Tsar. 1 Center of Bioinformatics, University of Allahabad, Allahabad, India 2 *Indian Veterinary Research Institute, Izatnagar, Bareilly, U.P., India Keywords: Quantitative structure-activity relationship (QSAR), ADME, Descriptors, Clustering, Regression, TSAR Introduction Drug is a complex set of molecules. It is not distributed in all over body by complex form, so after being absorbed (orally or by other route). It should be broken up to its in simplest form & after that it can be distributed to all over body by blood. After absorption & distribution the remaining waste part of drug will excreted in body. Historically, drug absorp- tion, distribution, metabolism, excretion, and toxicity (ADMET) studies in animal models were performed after a lead compound was identified. Now, pharmaceutical compa- nies are employing higher-throughput, in vitro assays to evaluate the ADMET characteristics of potential leads at earlier stages of development (Jun Xu, 2002). This is done in order to eliminate candidates as early as possible, thus avoid- ing costs, which would have been expended on chemical synthesis and biological testing. Scientists are developing computational methods to select only compounds with rea- sonable ADMET properties for screening. Molecules from these computationally screened virtual libraries can then be synthesized for high-throughput biological activity screen- ing. As the predictive ability of ADME/Tox software improves, and as pharmaceutical companies incorporate computational prediction methods into their R&D programs, the drug dis- covery process will move from a screening-based to a knowl- edge-based paradigm. In silico QSAR modeling, feature se- lection are used, this feature selections is used to reduce the number of descriptors per compounds. Successful data min- ing depends on good descriptor selection. If molecules are represented by improper descriptors, they will not lead to reasonable predictions. Correct descriptor selections rely on understanding the computational problem that one is trying to solve. Correlation analysis and relevant analysis approaches can help with this understanding. The criteria used for selecting descriptors should be: the selected descriptors should be bio-activity related (requiring correlation analysis), the selected descriptors should be informative (should have diversified value distributions),

QSAR

Embed Size (px)

Citation preview

Page 1: QSAR

Review Article

Pooja Mishra1, Vijay Tripathi1, Brijesh Singh Yadav2*

Corresponding Author: [email protected]

GERF Bulletin of Biosciences 2010, 1(1): 37-40

www.gerfbb.com

Insilco QSAR Modeling and Drug Development Process

AbstractIn this paper we focused on in silico QSAR (Quantitative structure-activity relationship) modeling is one of the well-developed areas in drug development through computational chemistry. Similar molecules with just a slight variation intheir structure can have quit different biological activity. This kind of relationship between molecular structure &change in biological activity is center of focus for QSAR Modeling. QSAR are based on a comparison between some typeof activity & the chemical structure or physicochemical properties of a series of chemical.Quantitative structure-activity relationship (QSAR) is the process by which chemical structure is quantitatively correlated with a well definedprocess, such as biological activity or chemical reactivity. Activity = f (physiochemical properties and/or structuralproperties). There is a wide variety of descriptors for use in QSAR studies. Actually descriptors are biological propertythat represents in mathematical form. The group or subset of these descriptors is potentially useful for predictingADME/Tox properties, & describe the arrangement of a pharmaceutical compound without organisms. Here ADME is anacronyms in pharmacokinetics & pharmacology for absorption, distribution, metabolism & excretion, this is the veryimportant step in drug designing is prediction of ADME property of any compound. It is useful to predict toxicity ofparticular compounds. This article also reviews the current achievements in the field of QSAR modeling and theirimpact on modern drug discovery processes. The applications of QSAR modeling in drug discovery, such as compoundselection, virtual library generation, virtual high throughput screening, HTS data mining, and in-silico ADMET arediscussed. We have also presented a quantitative structure–human intestinal absorption relationship using Regressionanalysis through Tsar.

1Center of Bioinformatics, University of Allahabad, Allahabad, India2*Indian Veterinary Research Institute, Izatnagar, Bareilly, U.P., India

Keywords: Quantitative structure-activity relationship (QSAR), ADME, Descriptors, Clustering, Regression, TSAR

IntroductionDrug is a complex set of molecules. It is not distributed in allover body by complex form, so after being absorbed (orallyor by other route). It should be broken up to its in simplestform & after that it can be distributed to all over body byblood. After absorption & distribution the remaining wastepart of drug will excreted in body. Historically, drug absorp-tion, distribution, metabolism, excretion, and toxicity(ADMET) studies in animal models were performed after alead compound was identified. Now, pharmaceutical compa-nies are employing higher-throughput, in vitro assays toevaluate the ADMET characteristics of potential leads atearlier stages of development (Jun Xu, 2002). This is done inorder to eliminate candidates as early as possible, thus avoid-ing costs, which would have been expended on chemicalsynthesis and biological testing. Scientists are developingcomputational methods to select only compounds with rea-sonable ADMET properties for screening. Molecules from

these computationally screened virtual libraries can then besynthesized for high-throughput biological activity screen-ing. As the predictive ability of ADME/Tox software improves,and as pharmaceutical companies incorporate computationalprediction methods into their R&D programs, the drug dis-covery process will move from a screening-based to a knowl-edge-based paradigm. In silico QSAR modeling, feature se-lection are used, this feature selections is used to reduce thenumber of descriptors per compounds. Successful data min-ing depends on good descriptor selection. If molecules arerepresented by improper descriptors, they will not lead toreasonable predictions. Correct descriptor selections relyon understanding the computational problem that one istrying to solve.

Correlation analysis and relevant analysisapproaches can help with this understanding. The criteriaused for selecting descriptors should be: the selecteddescriptors should be bio-activity related (requiringcorrelation analysis), the selected descriptors should beinformative (should have diversified value distributions),

Page 2: QSAR

the selected descriptors should be independent of each other(if two descriptors are correlated to each other, related prop-erty will be unfairly biased), the selected descriptors shouldbe simple to extract, easy to explain to a chemist, invariant toirrelevant transformations, insensitive to noise, and efficientto discriminate patterns in different categories (specificity).After comparing performance and predictability in highthroughput data mining, researchers from multiple groupshave consistently.

www.gerfbb.com

GERF Bulletin of Biosciences 2010, 1(1): 37-40 38

Computational Methods for the QSAR Modeling

Genetic Algorithm (GA)Genetic algorithm is optimizing algorithm used in find trueor approximate solutions to optimization & search problems.GA is categorized as global search heuristic. Generalapplication of GA: Topology optimizations, Genetic trainingalgorithm, Control parameters optimizations. Genetic methodrepresents a powerful class of computational methodologiesas with GA represents infinite no. of possible algorithm thatcan be used to examine combinatorial problem. This meansthat the problem should dictate the extract from the algorithmsuch as the coding scheme of putative solutions (Niculescu,2003). This is not probably best suited to examine particularproblems.

Artificial neural network (ANN)An ANN, often just called a “NN,” is an interconnectiongroup of artificial neuron that uses a mathematical model orcomputational model. Applications of NN can be applied tobusiness using several different approaches & QSARmodeling. Turnkey application, NN developed tolls(commercial packages are Neuro shell brain maker), Usedextensively in the PMH (position specific iterativepredictions), visualizing protein structure & computingstructure properties: Grail gene finder, sequence analysis,database-searching, pair-wise alignment. The variableselection in particular important and challenging problem inthe developed of ANN models. Why ANN is useful in thatway? If there deterministic relation between some feature ofthe molecules & the property that must be predicted, thenQSAR is amenable to a regression problem i.e. to thedetermination of that unknown relation. From a statisticalpoint of view, NN represents a class of non-parametricadaptive models (Kustrin, 2001). In this framework, animportant issue is to evaluate the performance of the models.This is done by separating the data into two sets: The trainingset & the testing set. The parameters (i.e. the value of thesynaptic weights) of the network are computed using thetraining set.

Self organizing map (SOM)The SOM is a subtype of ANN. It is trained using unsuper-vised learning to produce low dimensional representationsof the training sample while preserving the topological prop

erties of the input space. The SOM have been used at theresearch center in such applications as. Automatic speechrecognition, clinical voice analysis, monitoring of the condi-tion of industrial plant & process, cloud classification fromsatellite images, micro array, analysis of electrical signalsfrom the brain, organization of & retrieval from large docu-ment collections, analysis & visualization of large collec-tions of statistical data. SOMs have also been applied tostudies in the fields of QSAR. The fundamental promise ofQASR studies that structurally related (similar) compoundswill have similar properties determining similarity is a com-plex tasks, and many method exits such as principal com-pounds analysis & hierarchical cluster analysis in QSARstudy (Guha, 2004) the use of a SOMs chose the subset ofmolecular descriptors to dimensionality of a dataset by vi-sualizing as a graphical lower dimensional display, & re-duces the amount of data by representing them with a smallerno of models ordered on a discrete map lattice.

Support vector machine (SVM)SVM are a set of related supervised learning methods thatcan used for classification and regression. SVM are a set ofrelated supervised learning method that that can performbinary classifications (pattern recognitions) & real valuedfunctions approximations (regression estimations) tasksSVM non-linearly map their n- dimensional input space intoa high dimensional feature space a linear classifier isconstructed. A special property of SVM is that theysimultaneously minimize the empirical classification error &minimize geometric margin. SVM was created to addresschallenging problems in QSAR analysis. The goal of QSARanalysis is to predict the bioactivity of molecules. Eachmolecule has many potential descriptors that may be highlycorrelated with each other or irrelevant to the targetbioactivity (Burbidge, 2001). The bioactivity is known foronly a few molecules. These issues make model validationchallenging and over fitting easy. The results of the SVMsare somewhat unstable small changes in the training andvalidation data or on model parameter may produce ratherdifferent sets of nonzero weight attributes.

Decision ForestDecision Forest is a decision support tool that uses a graphor model of decision & their possible consequences. Deci-sion Forest models often have a degree of accuracy thatcannot be obtained using a large, singletree model. DecisionForest models are as easy to create as single tree-models, itcan be applied to regression & classification models, thestochastic (randomization) element in the decision tree for-est algorithm makes it highly resistant to over, Decision For-est can handle hundreds or thousands of predictor variables.It is a novel pattern recognition method, which combinesthe results of multiple distinct but comparable decision treemodels to reach a consensus prediction. A decision forest

Page 3: QSAR

39 GERF Bulletin of Biosciences 2010, 1(1): 37-40

www.gerfbb.com

model was developed using a structurally diverse trainingdata set. A decision forest model was developed using astructurally diverse training data set compounds activitywas tested. The model was subsequently validated using atest data set of compounds selected and then applied to alarge data set with compounds as a screening.

Partial least squarePartial least squares projection to latent structures (PLS) is arobust using projection to summarize multitudes of poten-tially collinear variables. Partial least squares projection tolatent structures (PLS) is a robust multivariate generalizedregression method using projections to summarize multitudesof potentially collinear variables (Waterbeemd, 2008). Multi-variate statistics is a set of statistical tools to analysis data(e.g., chemical and biological) matrices using regression and/or pattern recognition technique. PLS regression techniqueis especially useful in quite in common case where the num-ber of descriptors (independent variables) is compare to orgreater then the no of compounds (data points) and/or thereexist other factors leading to correlations between variables(Khlebnikov, 2007). Many methodologies have been used inQSAR modeling such as the PLS here methodologies thepartial least collinear input data to make no restriction on thenumber of variables used. PLS leads to stable, correct andhighly predictive models even for correlated descriptors(Gieleciak 2007).

Multiple linear RegressionsThis is a mathematical technique used in both fundamental& technical analysis. This technique can be used a no. ofvariables to predict some unknown variables. In statistics,regression analysis examines the relation of dependent vari-ables (response variables) to specified independent vari-ables (predictors) (Papa, 2007). This assumes that the un-derlying relationship is linear & that any deviation from lin-earity will be distributed normally (a parameter assumption).Italso assume that the drug properties are real no & they areindependent each other, so that the affect of one variable isthe other variables. In many QSAR problems it is desirableto learn relationships that are non-linear.

K-mean clusteringK-means clustering is on an alternative method to thehierarchical method. It is top-down approach & is useful ifthere is prior knowledge about the no. of cluster that shouldbe represented in the data. In k-means clustering objects arepartitioned into a fixed no (k) of cluster, such that the clustersare internally similar but externally dissimilar (Mutihac 2008).The process involve in k-means clustering is as follows. Allinitial objects are randomly assigned to one of k clusters(where k is pre- specified). By using k-means clustering onexperiments with k=2, the data will be partitioned in to twogroups. An average expression vector is calculated for eachcluster & this is used to compute the distances between

clusters. The expression vectors for each cluster arerecalculated. K-means clustering algorithm use aninterchange (or switching) method to divide n data pointsinto k group (clusters) is known before clustering. The k-means clustering results depend on the order of the rows inthe input data, the options k-means initialization, and numberof iteration for minimizing distance. The k means approachinvolves ND problems (combinatorial explosion).

Principal component analysisPCA (also called SVD or singular value decomposition) isan exploratory technique and it is used to visually estimatethe number of clusters represented n the data. PCA is apowerful technique for the analysis of QSAR modeling datawhen used with other classification technique such as k-means or SOM [11].

TSARTsar is a fully integrated quantitative structure-activity rela-tionship (QSAR) package for library design and lead optimi-zation. Tsar can be used throughout drug discovery, frominitial compound selection for primary screening to reagentselection and creation of focused libraries for lead optimiza-tion. Tsar’s easy-to-use chemical spreadsheet interface isequally accessible to medicinal chemists, computationalchemists and project team leaders (Ivanenkov, 2009).

Typical applications of Tsar:• Exploring physicochemical properties, 2D or

3D, to understand which promote activity• Reagent selection by sampling substitute or

reagent properties• Designing combinatorial libraries by focusing

on desired product properties Similar or diversesubset selection.

• Developing predictive models of activity.

Accelerates design and selection of single compounds andlibraries for screening Lets you improve activity andeliminate undesirable properties in lead optimization.

Advantages of Tsar

ConclusionQSAR model prediction depends on good descriptor selec-tion because similar molecules with a slight variation in theirstructure can have quite different biological activity. Thiskind of relationship between molecular structure & changein biological activity is center of focus for QSAR Modeling.Correlation analysis and relevant analysis approaches effi-ciently deals this task. The criteria used for QSAR modelingwas biological-activity relationship within selected descrip-tors that is efficiently achieved by the descriptor of a set of64 drugs and their experimentally-derived intestinal absorp-tion (%) values as descriptor showing 95% correlation. Theapplication of the QSAR modeling is at developmental phase

Page 4: QSAR

www.gerfbb.com

GERF Bulletin of Biosciences 2010, 1(1): 37-40 40

currently. After comparing performance and predictabilityin high throughput data mining, researchers from multiplegroups have consistently. Improper descriptors selectionwill not lead to reasonable predictions. Correct descriptorselections rely on understanding the computational problemthat one is trying to solve. Regression analysis and clusteringin the field of QSAR modeling show crucial impact on moderndrug discovery processes. The applications of QSARmodeling in drug discovery, such as compound selection,virtual library generation, virtual high throughput screening,HTS data mining, and in-silico ADMET made it center ofstudy.

References

1. Jun Xu (2002). Chemoinformatics and DrugDiscovery. Molecules. 7(8):566-600.

2. Niculescu SP, (2003). Artificial neural networks andgenetic algorithms in QSAR. J. MolecularStructure: Theochem. 622(1-2):71-83.

3. Kustrin SA et al., (2001). ANN modeling of thepenetration across a polydimethylsiloxanemembrane from theoretically derived moleculardescriptors. J. Pharmaceutical and BiomedicalAnalysis. 26(2):241-254.

4. Guha R et al., (2004). Generation of QSAR setswith a self-organizing map. J. MolecularGraphics and Modelling. 23 (1):1-14.

5. Burbidge R et al., (2001). Drug Design by MachineLearning: Support Vector Machine forPharmaceutical Data Analysis. Computers andChemistry. 26 (1):5-14.

6. Waterbeemd HVD et al., (2008). Glossary of TermsUsed in Computational Drug Design (IUPAC)Recommendations 1997. Annual Reports inMedicinal Chemistry. 33:397-409.

7. Khlebnikov AI et al., (2007). Improved QuantitativeStructure-Activity Relationship Models to PredictAntioxidant Activity of Flavonoids in Chemical,Enzymatic, and Cellular Systems. Bioorg Med.Chem.15 (4): 1749–1770.

9. Papa E et al., (2007). Linear QSAR regression modelsfor the prediction of bioconcentration factors byphysicochemical properties and structuraltheoretical molecular descriptors. Chemosphere.67(2):351-358.

10. Mutihac L and Mutihac R (2008). Mining inchemometrics. Analytica Chimica Acta. 612 (1):1-18.

11. Ivanenkov YA (2009). Computational mapping toolsfor drug discovery. Drug Discovery Today.14:767-775.

8. Gieleciak R and Polanski J (2007). Modeling RobustQSAR. 2. Iterative Variable Elimination Schemes forCoMSA: Application for Modeling Benzoic AcidpKa Values. J. Chem. Inf. Model. 47:547–556.