QSAR itq21 param

  • Upload
    baumesl

  • View
    238

  • Download
    0

Embed Size (px)

Citation preview

  • 8/14/2019 QSAR itq21 param

    1/18

    Prediction of ITQ-21 Zeolite Phase Crystallinity:Parametric Versus Non-parametric Strategies

    Laurent A. Baumes, Manuel Moliner, Avelino Corma*

    Instituto de Tecnologa Qumica (UPV-CSIC), av. de los Naranjos, E-46022, Valencia, Spain, E-mail: [email protected]

    Keywords: Data mining, High-throughput, ITQ-21 zeolite, Parametric, Regression, Statistics

    Received: May 8, 2006; Accepted: July 13, 2006

    DOI: 10.1002/qsar.200620064

    Abstract

    This work deals with data analysis techniques and high-throughput tools for synthesis andcharacterization of solid materials. In previous studies, it was found that the finalproperties of materials could be successfully modeled using learning systems. Machinelearning algorithms such as neural networks, support vector machines, and regressiontrees are non-parametric strategies. They are compared to traditional parametric statistical

    approaches. We review a wide range of statistical methodologies, and all the methods areevaluated using experimental data derived from an exploration-optimization of thematerial ITQ-21. The results are judged on the numerical prediction of phases crystal-linity. We discuss the theoretical aspects of such statistical techniques, which make theman attractive method when compared to other learning strategies for modeling theproperties of the solids. Advantages and drawbacks are highlighted. We show that suchapproaches, by offering broad solutions, can reach high-level performances while offeringease of use, comprehensibility, and control. Finally, we shed light on both the interpre-tation and stability of results, which remain the main drawbacks of the majority ofmachine learning methodologies when trying to retrieve knowledge from the datatreatment.

    1 Introduction

    Molecular sieve and more specifically zeolites are materi-als of considerable interest in gas adsorption and separa-tion, catalysis, and for electronics and medical uses [1, 2].Recent research work from different groups has contribut-ed to the understanding of the synthesis mechanism, aswell as to the discovery of new zeolitic structure [3 11].The discovery of new structures or enlarging the synthesisspace, and optimization of existing ones require a consid-erable experimental effort. This can be reduced by usingHigh-Throughput (HT) synthesis and characterization

    techniques [12 14] since the amount of samples to beprocessed is tremendously increased, and consequentlythe number of parameters to be simultaneously explored.Thus, the possibility of discovering new materials or bettercovering a phase diagram may be strongly accelerated.The need for advanced strategies that aim at optimizingthe retrieve of knowledge from experiments while main-taining their number at a reasonable level is a critical partof the discovery and optimization processes. Numerousdifferent Machine Learning (ML) techniques have beensuccessfully applied for modeling experimental data ob-tained during the exploration of multi-component materi-

    als. However, the synthesis of zeolitic materials throughHT experimentation has received a weaker impulse andfewer studies have been reported. The models allow topredict the properties of unsynthesized materials (also-called virtual solids) taking into account their expectedcompositions or preparation conditions as input variables.Among the different ML techniques, Neural Networks(NNs) often yielded the best modeling results. They havebeen applied for modeling and prediction of the catalyticperformance of libraries for a variety of reactions, andsome selected examples are water gas shift reaction [15],oxidative dehydrogenation ethane [16], oxidative dehydro-

    genation of propane to propene [17], and propene oxida-tion to propene-oxide [18]. However, NNs may sufferfrom overfitting the data, reproducibility problems and,therefore, there is still the need to use or even developother techniques. In this sense Support Vector Machines(SVM) can be a suitable method for overcoming the pit-falls of NNs when they may occur, and a first comparisonhas been recently done for the design of catalysts and ma-terials [19, 20]. In this case, even if overfitting is rather dis-carded, the interpretation of results still remains difficultwhen using complex kernel functions.

    QSAR Comb. Sci. 00, 0000, No.&, 1 18 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &1&

    These are not the final page numbers!

    Full Papers

  • 8/14/2019 QSAR itq21 param

    2/18

    ML methods do not assume any parametric form of theappropriate model to use; they are classified in the set ofdistribution-free methods. Instead of starting with assump-tions on a particular problem, ML uses a toolbox approachin order to identify the correct model structure directlyfrom the available data. One of the main consequences is

    that the methods typically require larger datasets thanparametric statistics. In materials science domain, evenwhen using HT techniques, the number of examples re-mains low (i.e., less than 150). This represents a great prob-lem for non-parametric procedures for preventing overfit-ting. Since the 1990s, a large amount of publications haveappeared using only such ML methods, while traditionalparametric statistics remains relatively neglected as it hasbeen emphasized in [21]. In [22], the authors make use oftraditional statistical analysis while examining split-plotdesign, and very recently a new hybrid statistical method-ology has been proposed [23], which combines evolution-ary algorithm operators with a statistical criterion for opti-

    mizing the structure characterization of a given searchspace taking into account an a priori limited amount of ex-periments to be conducted.

    This work, which deals with data analysis techniquesand HT tools for synthesis and characterization of solidmaterials, aims at showing that statistics can enable a bet-ter interpretation of results while showing similar qualityof performances and discarding ML pitfalls. We review awide range of statistical methodologies and discuss thetheoretical aspects of such techniques, which make theman attractive method for modeling the properties of solidswhen compared to the other learning strategies. Advantag-es and drawbacks are highlighted. We show that such stat-

    istical approaches, by offering broad solutions, allow toreach high-level performances while offering ease of use,comprehensibility, and control. Finally, we shed light onboth the interpretation and stability of results which re-main the major drawbacks of the black-box learning meth-odologies.

    All the methods are evaluated here using experimentaldata derived from exploration/optimization of the synthe-sis of a zeolitic material (ITQ-21) [24]. ITQ-21 is a zeolitewith a three-dimensional pore network containing 1.18-nm-wide cavities, each of which is accessible through sixcircular and 0.74-nm-wide windows. The structure is shown

    in Figure 1a. We have chosen this system because thestructure as ITQ-21 is one of the most interesting largepore zeolites that combines the catalytic properties ofUSY zeolites with a higher diffusitivity and a lower rate ofcatalyst deactivation. Then there is incentive for better un-derstanding and optimizing the synthesis and chemicalcomposition of this material. The results are judged on thenumerical prediction of phase crystallinity.

    2 Experimental Section

    2.1 Synthesis Experimental Data

    A large amount of parameters govern the hydrothermalcrystallization processes of microporous materials, deter-mining which phases are formed and their crystallizationkinetics. In this study, a detailed exploration of the hydro-

    thermal synthesis in the system SiO2/GeO2/Al2O3/F

    /H2O/N(16) Methylsparteinium (MSPT) has been performed, inorder to understand the effect of these factors over thegrowth of ITQ-21. The synthesis variables have been se-lected in order to cover the broadest range of the mostpromising parameter space based on previous experience,while keeping the total amount of experiments within afeasible and reasonable range. Five synthesis variables andtheir respective-expected values are: Si/Ge {15, 20, 25,50}, Al/(Si Ge) {0.02, 0.04, 0.06}, MSPT/(Si Ge) {0.25, 0.5}, H2O/(Si Ge) {2, 5, 10}, and time (day) {1,5}. A sixth variable, F/(Si Ge), is always maintained

    &2& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Figure 1. a) Structure of the ITQ-21 zeolite. b) Standard dif-fractogram of the ITQ-21 zeolite.

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    3/18

    equal to the MSPT amount to get neutral pH. The distribu-tion of experiments comes from a full factorial design withSi/Ge, H2O, (MSPT & F

    )/(Si Ge), Al/(Si Ge), and thesynthesis duration noted as t, respectively, at 4, 3, 2, 3, and2 levels. Therefore each experiment is one combinationamong the 144 possible. All the experiments were carried

    out in a random order.The reagents employed for gel syntheses were ammoni-um fluoride (98%, Aldrich), germanium oxide (99.998%Aldrich), aluminum isopropoxide (98%, Aldrich), methyl-sparteine, LUDOX (AS 40 wt% Aldrich), MilliQ water(Millipore) and N(16)-methyl-sparteinium hydroxide. Au-tomated gel synthesis was done inside Teflon vials (3 mL),which were finally inserted in a multi-autoclave of 15 posi-tions and sealed with a Teflon-lined stainless-steel tip, andsubsequently allowed to crystallize at 175 8C. The sampleswere then washed and filtered in parallel and then dried at100 8C overnight. Finally, the samples were weighted andcharacterized by XRD using a multi-sample Philips XPert

    diffractometer employing Cu Ka radiation. The standardX-ray diffractogram for ITQ-21 is shown in Figure 1b. Cal-culation of the occurrence and crystalinity was done inte-grating the area of the characteristic peaks. For ITQ-21the range for the angle 2q is comprised between 25.78 and26.58.

    2.2 Computational Methods

    In regression problems, the objective is to estimate the val-ue of a continuous output variable that in our case is a giv-en crystalline phase from input variables such as the syn-thesis parameters. All the different techniques used in this

    study are quickly detailed except NNs which already havereceived considerable attention, see [15 20] for recent ap-plications in material science, and [25, 26] for more techni-cal explanations. In order to provide a fair comparison be-tween the different techniques investigated, 28% of thedata chosen randomly among the whole available datasetcomposed of 144 distinct experiments is kept unused formodel generalization evaluation.

    2.2.1 Multiple Linear Regression (MLR)

    An MLR model specifies the relationship between one de-

    pendent variable y, and a set of predictor variables X, sothat y b0 Pik

    i1 bixi in where bi are the regression coef-ficients.

    2.2.2 Generalized Linear Model (GLZ)

    GLZ can be used to predict responses for both dependentvariables with discrete distributions and for dependent var-iables which are non-linearly related to the predictors.GLZ differs from the linear model mainly in the followingmajor aspects. (i) The distribution of the dependent varia-ble can be explicitly non-normal, (ii) the dependent varia-

    ble values are predicted from a linear combination of pre-dictor variables, which are connected to the dependentvariable via a function called link function. The relation-ship in GLZ is assumed to be y g(b0 b1x1 ... bkxk) e, where e stands for the error variability. The inverse func-tion g1 f is the link function; so that

    f~y b0

    Piki1 bixi, where

    ~y stands for the expected val-ue of y. For additional information about GLZ, see [27,

    28].

    2.2.3 Piecewise Linear Regression (PLR)

    This model specifies a common intercept b0, and a slopethat is either equal to b1 if y 100, or b2 taking into ac-count a problem with only two variables, and the followingmodel: y b0 b1x(y 100) b2x(y>100). Stepwise mod-el-building techniques for regression designs with a singledependent variable are described in numerous sources [29,30].

    2.2.4 SVMs as Regression Tool

    A general introduction of SVMs was already presented in[20]. With e-SV regression [31], the goal is to find a func-tion f(x) that has at most e deviation from the target yi forall the training data, and at the same time, as flat as possi-ble. Formally, the problem is written as a convex optimiza-tion problem.

    2.2.5 Regression Trees (RTs)

    Regression trees may be considered as a variant of deci-

    sion trees, designed to approximate real-valued functionsinstead of being used for classification tasks. RT is builtthrough a process known as binary recursive partitioning.This is an iterative process of splitting the data into parti-tions, and then splitting it up further on each of thebranches. In our experiments the classical C&RT [32] treeis used.

    3 Results of Parametric Statistics and Prediction ofITQ-21 Phase Crystallinity

    3.1. Experimental Results

    In Figure 2 is represented the effect of each synthesis vari-able on ITQ-21 crystallinity. It is shown that ITQ-21 is fa-vored by some combination of synthesis variables. Thehighest values of crystallinity appear in concentrated gels[H2O/(Si Ge)

  • 8/14/2019 QSAR itq21 param

    4/18

    &4& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Figure 2. Variation of ITQ-21 phase crystallinity with the studied variables.

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    5/18

    3.2. MLR and First Inspection of the Dataset

    In Figure 3, the MLR is calculated with the synthesis varia-bles as input. Real ITQ-21 phase crystallinity is indicatedon the y-axis while the expected one is represented on the

    x-axis. The adjustment was R2 0.61164 [F(5.138) 43.46;p 0.00000]. According to this method, 61.16% of theoriginal variability has been explained, and (1 R2) is theresidual variability. Regression coefficients are given inTable 1, where highlighted values (gray background color)are significant. As indicated by b values, Si/Ge and H2O/(Si Ge) (respectively, variables 3 and 6) are the most im-portant predictors of ITQ-21 phase crystallinity, and all

    are statistically significant (p

  • 8/14/2019 QSAR itq21 param

    6/18

    ue is not statistically significant when the null hypothesis isfalse is called Type II error. For more details of this aspectsee [23].

    Another way of looking at the unique contributions ofeach independent variable is to compute the partial andsemi-partial correlations. In Table 1, partial correlations

    are the correlations between the respective independentvariables adjusted by all other variables, and the depen-dent variable adjusted by all other variables. The semi-par-tial correlation is the correlation of the respective inde-pendent variable adjusted by all other variables, with theraw dependent variable. Values in Table 1 for such partialand semi-partial correlations appear relatively similar andconfirm the trends observed with b values. In Table 2, thepartial correlation sizes the correlation between two varia-bles that remains after partialling out one other variables(indicated with ), while the correlation coefficientdoes not take into account such control. It can be observedthat the correlations, and partial correlations, between

    each variable and ITQ-21 crystallinity, are quite similar.However, one can note that without considering the effectof H2O (i.e., fifth column of partial correlations in bold)the correlation between Si/Ge and ITQ-21 crystallinity de-creases by ten points. Actually, a similar jump is examinedfor all the correlations when H2O is partialled out; in thecase of positive response (MSTP or F) such effects are in-creased, while negative partial correlations are decreased.Surprisingly, it seems that H2O increase, which has a globalnegative effect on ITQ-21 crystallinity, when combinedwith other variables has a good effect on negative featureand a bad one for the unique positive relation.

    Moreover, it is shown that the three variables that have

    the greatest influences on the formation of ITQ-21 are H2O, Si/Ge, and Al content. For the levels chosen in the pres-ent work, the water content is the variable that has thelargest influence on ITQ-21 crystallinity. This phase pre-fers concentrated gels that present relations of H2O/(Si Ge) with values less than 5. This can also indicate thathigh concentration of F has a positive effect on crystalli-zation. The content of Ge in the framework of the ITQ-21is a critical factor. When the content of Ge decreases inthe starting gel, the rate of crystallization of ITQ-21 de-creases, and for high values of Si/Ge (>30), small amountsof ITQ-21 (low crystallinity) is achieved with the set of

    times and temperature reported here. Finally, the otherfactor that is statistically interesting is the Al content. Thehighest values of crystallinity have been obtained at lowlevels of Al. The reason being that the number of frame-work negative charges introduced by Al and which have tobe neutralized by the Organic Structure Directing Agent

    (OSDA) are limited, due to the fact that OSDA has alsoto compensate the F located within the double fourmember rings [33], and the void volume of the ITQ-21structure can fit a limited number of MSPT cations. It hasto be noted that H2O has the largest effect on crystalliza-tion only considering the chosen ranges of variation.

    The use of parametric procedures allows taking advan-tages of the whole theory behind the model. However, as-sumptions should always be first verified, otherwise theconclusion may not be accurate. For example, in MLR, itis assumed that the residuals are distributed normally.Many tests are robust with regard to violations of this as-sumption. The normal probability plot of residuals gives a

    quick indication of whether or not violations have occur-red. If the observed residuals (plotted on the x-axis of Fig-ure 4) are normally distributed, then all values should fallonto a straight line. If the residuals are not normally dis-tributed, then they will deviate from the line. Figure 4shows a particular lack of fit: the data seems to form an S-shape around the line. This pattern is characteristic whenthe dependent variable may have to be transformedthrough a log-transformation to pull the tails of the distri-bution.

    Another important step when building models is the de-tection of outliers. If one experiment is clearly an outlier,then there is a tendency for the regression line to be pulled

    by this outlier. As mentioned before, one can say that sucha deviation would be rather low compared to the conse-quences (overfitting) which might be observed using MLmodels. As a result, if the respective cases were excluded,different B coefficients would be found. Figure 5 showsthe deleted residual statistic which is the standardizedresidual for the respective case that one would obtain ifthe case was excluded from the analysis. Therefore, if thedeleted residual is different from the standardized residualthe regression analysis may be biased by the given case.However, such a case does not belong to our experimentaldataset and therefore the entire set is kept. Another inter-

    &6& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Table 2. Partial correlations and correlation coefficients between all variables involved in the synthesis study.

    Variables ITQ21 partial correlation ITQ21 correlation

    Time 0.10 0.09 0.09 0.12 0.09Si/Ge 0.40 0.41 0.41 0.51 0.40Al 0.23 0.26 0.23 0.29 0.23MSTP or F 0.11 0.12 0.11 0.13 0.11H2O 0.62 0.67 0.63 0.62 0.61

    The partial correlation sizes a correlation between two variables that remains after controlling for ( e.g., partialling out) one or more other variables.Gray cells contain significant values at p

  • 8/14/2019 QSAR itq21 param

    7/18

    esting test such as heteroscedasticity may be investigated.Homoscedasticity is the assumption that the variability inscores for one variable is roughly the same at all values ofthe other variable, which is related to normality, as whennormality is not met, variables are not homoscedastic, but,they are heteroscedastic. For example, the Goldfeld Quandt test is applicable if you think heteroscedasticity isrelated to only one of the x variables. This test is of greatinterest, for example if the heating system of a multi-chan-nel reactor becomes hazardous on increasing the tempera-

    ture, generating additional noise.To summarize, the MLR fits moderately (~61%) the ob-

    jective variable, and fails to preserve the fitted ITQ-21phase crystallinity from negative values. Moreover, theamount of false positive is very high (i.e., the gray squareson the x-axis in Figure 3), and for the other experiments,the phase crystallinity is greatly underestimated. However,such a preliminary methodology has allowed us to obtain afirst idea about the dependent variable modeling and itscorrelations with synthesis variables through estimates.However, it has been shown that this technique allows tomake test of assumptions that are usually too often accept-

    ed without being tested. Assumptions about the normalityof residuals, the detection of outliers, the significance ofvariables, and others are of great help to the user in deter-mining the first steps of how works the underlying mecha-nism. The examination of the normality assumption willrequire further more complex methodologies, allowing totransform the dependent variable in order to respect thehypothesis of normal distribution while preventing predic-tions from negative values. The GLZ, as an extension ofthe MLR, is investigated below.

    3.3. Generalized Linear Model

    The construction of a GLZ starts by selecting an appropri-ate link function and response probability distribution.Two alternatives are investigated: a distribution fitting andthe choice of the corresponding link function, or only thetransformation of the dependent variable through a chos-en link function. There is many potential distributions(normal, Exponential, Weibull, log-normal, Gamma, etc.)that could be used as a distributional model for the data.Therefore two basic questions are addressed: (i) Does a

    given distributional model provide an adequate fit to thedata? (ii) Does one distribution fit the data better than an-other distribution? The use of Goodness-of-Fit (GoF) testsprovide a method to answer these two questions. The Kol-mogorov Smirnov (KS) test is chosen because of the fol-lowing reasons: unlike the parametric t-test for independ-ent samples, which tests differences in means in the loca-tion of two samples, the KS test is also sensitive to differ-ences in the general shapes of the distributions in the twosamples (i.e., differences in dispersion, skewness, etc.). TheGoF tests confirm that either the Gamma or the log-nor-mal distribution would provide a good model for this data.

    Finally, the best fitting is the Gamma distribution which isdefined as f(x) (x/b)1e( x/b)[bG(c)]1 with b>0 the scaleparameter, c>0 the so-called shape parameter and G isthe gamma function with the following formula:G a

    R10

    ta1etdt. Here b 39.3, c 0.456, see Figure 6.The corresponding link function for such distribution isthe log function. Considering the second option proposedearlier, the normal probability plot of residuals has givenan indication of the non-normal distribution of observedresiduals. Since the data follow an S-shape pattern aroundthe line, we have supposed that the dependent variableshould be transformed into a new one such as g(y) ln(y)

    QSAR Comb. Sci. 00, 0000, No.&, 1 18 www.qcs.wiley-vch.de 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &7&

    These are not the final page numbers!

    Figure 4. Normal probability plot of residuals for ITQ-21phase crystallinity linear model. This visualization procedurepermits to quickly examine if the normal distribution of residual

    assumption is respected or not. The tails show an S-shape pat-tern.

    Figure 5. Residuals vs. deleted residuals plot. This techniqueallows to separate outliers from the dataset when the latter arerelatively far from the line.

    Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    8/18

    in order to pull in the tail of the distribution. Therefore,this modification is handled through the log link functionwhich will force to maintain the values within a positiverange while the distribution of the dependent variable isstill supposed to be normal. Figure 3 shows the predictionsof ITQ-21 crystallinity for GLZ model using Gamma dis-tribution with log link function, normal distribution as-sumption and log link function, and MLR (i.e., normal dis-tribution assumption and identity link function). A betterfitting of GLZ over MLR can be observed. However,GLZ using the normal distribution and log link functionremains the best. In this situation, compared to Gamma

    assumption, more weight is indirectly given to non-nullcrystallinity values of ITQ-21 phases, and therefore thevariability of response for high crystallinity values is nar-rower.

    The previous GLZ were defined with only first-order ef-fects, i.e., bixi. However, in GLZ, more advanced configu-rations, such as factorial, fractional, polynomial, quadraticmodels, or even some special user effects, can be defined.In Figure 7, all the models are estimated and their respec-tive predicted values of ITQ-21 crystallinity are plottedwith corresponding effect estimates given in Table 3. Thevalues of the parameters (bi and the scale parameter) in

    the GLZ are obtained by maximum likelihood estimation.Note that highlighted values correspond to statistically sig-nificant estimates for a 0.5. On the basis of estimate val-ues and their significances when considering differentforms of models, we can say that the MLR does not con-tain enough features for capturing the underlying informa-tion and consequently all the input variables are signifi-cant. On the contrary, the full factorial design takes intoaccount too many variables; thus, the information is spreadand smoothed into the numerous estimates. Finally, themodels retained are the quadratic one for its overall per-formance, the fractional factorial to degree 2 for its sim-

    plicity, and the fractional factorial design to degree 3 sinceit represents an intermediary solution. The relationship be-tween predictors and their interactive effects (e.g., twopredictors masking the effects of a third) are much morecomplex. However, it can be observed that the conclusiondrawn previously about the effect of Si/Ge, and Al con-

    tents when considering or not the effect of the water isconfirmed here through the inspection of the significanceof 2-way interaction effects.

    One can also make statistical inference about the param-eters using confidence intervals and hypothesis tests. Theconfidence intervals for specific statistics give a range ofvalues around the statistic where the true statistic can beexpected to be located with a given level of certainty (herethe level is set 90%). Therefore it is possible to provide con-fidence intervals for predicted values. An example is givenfor the best model found earlier, i.e., quadratic response sur-face regression model, in Figure 8. As a decreases the inter-val will be narrower. Here are examples of the numerous

    advantages allowed using such a parametric modeling.

    3.4. Piecewise Linear Regression

    The slope of a function at a particular point can be com-puted as the first-order derivative of the function at thatpoint. The slope of the slope is the second-order deriva-tive, which tells us how fast the slope is changing at the re-spective point, and in which direction. The quasi-Newtonmethod, at each step, evaluates the function at differentpoints in order to estimate the first order derivatives andsecond order derivatives. It uses this information to followa path toward the minimum of the loss function. We have

    chosen the quasi-Newton method since, for most applica-tions, it yields good performances. Other procedures thatuse various geometrical approaches to function minimiza-tion, may be more robust, that is, they are less likely toconverge on a local minima, and are less sensitive to badstarting values. However, special attention has been givento such parameters and care about the reproducibility ofthe results was taken. The loss function is a least square asin many other cases. In Figure 9 the predicted values ofITQ-21 crystallinity are plotted against the observed val-ues. It is surprising to see that this very simple method,compared to all other approaches, allows us to obtain a

    quasi-perfect fitting of very low or even null crystallinityvalues as shown in Figure 9. The equation of the PLRmodel with a breakpoint at 17.4582 is

    5.19880.1013t0.0244Si/Ge24.8966Al(Si Ge) 4.0457MSTP/(Si Ge)0.5039H2O/(Si Ge)

    and

    113.28 3.2668t2.1384SiGe557.013Al/Si Ge) 26.9783MSTP/(Si Ge)8.2507H2O/(Si Ge).

    &8& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Figure 6. Distribution fitting of ITQ-21 phase crystallinity withthe Gamma function.

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    9/18

    One has to note that the breakpoint is defined on the de-pendent variable and therefore, in order to assign a valueto a new experiment it should be first evaluated on whichside on the breakpoint the dependent variable will be.However, a previous model can be used or a classificationalgorithm with a two-class system defined by the threshold(i.e., the breakpoint). Therefore the final PLR efficiencydepends on such a previous estimation. A quick classifica-

    tion using the quadratic model only misclassified six ex-periments.

    4 Results of Non-parametric Approaches andPrediction of ITQ-21 Phase Crystallinity

    Having previously estimated the distribution of the col-lected data from ITQ21 analysis study, the predictions ofprevious parametric statistics are compared with NN,SVMs, and RTs. For each ML approach, the whole datasetwhich contains 144 data is divided into three different sets,

    namely training, selection, and test, respectively,with 64, 40, and 40 individuals in each set in order to avoidoverfitting. Thus, the test set represents 28% of the entiredataset as mentioned before.

    4.1 Comparison and Performance Assessments

    As in the case of traditional MLR models, fitted GLZ can

    be summarized through statistics such as parameter esti-mates, their standard errors, and GoF statistics. Here dif-ferent statistics such as the correlation coefficient (i.e., thecorrelation coefficient between the predicted and ob-served output values), the coefficient of determination(R2, Eq. 3), R2 adjusted (R2adj, Eq. 4), the standard devia-tion (Eq. 1) of the target output variable (sy), and the stan-dard deviation of errors for the output variable (se) havebeen calculated. r(Eq. 2) represents the linear relationshipbetween two variables. A perfect prediction will have acorrelation coefficient of 1. A correlation of 1 does notnecessarily indicate a perfect prediction (only a prediction

    QSAR Comb. Sci. 00, 0000, No.&, 1 18 www.qcs.wiley-vch.de 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &9&

    These are not the final page numbers!

    Table 3. Estimates of GLZ models using different configurations of effects. Gray cells contain significant values at p

  • 8/14/2019 QSAR itq21 param

    10/18

    which is perfectly linearly correlated with the actual out-puts), although in practice the correlation coefficient is a

    good indicator of performance. It also provides a simpleand familiar way to compare the performance of statisticaland ML methods. In Eqs. 1 4, formulas are given for eachstatistics, with n the amount of data, and p the number ofpredictors. Adding more independent variables to a modelcan only increase the R2. Since the number of variables se-lected by the NN is different from the one used in the oth-er approaches, R2adj has also been used.

    s

    ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1

    N

    Xix x

    2

    r1

    r nX

    ixiyi X

    ixi X

    iyi h i

    nX

    ix2i

    Xixi

    2 !1=2 n

    Xiy2i

    Xiyi

    2 !1=2

    2

    R2 1

    Pi y ~y

    2

    Pi y y

    23

    R2adj 1 1 R2

    n 1

    n p 1

    4

    4.2 Performances of Neural Networks, Regression Trees,

    and SVMs

    The most common NN architectures have outputs in a lim-ited range (e.g., 0 1 for the logistic activation function weuse). When the desired output is in such a range, it pres-ents an interest for classification problems as has been in-vestigated [15]. However, for regression problems there isclearly an issue to be resolved, and some of the consequen-ces are quite subtle. A scaling algorithm can be applied toensure that the networks output will be in a sensiblerange. The simplest scaling function finds the minimumand maximum values of a variable in the training data, andperforms a linear transformation to convert the values into

    the target range. Therefore the networks output will beconstrained to lie within this range. However, this bringsto the problem of extrapolation of new materials out ofthe range defined by the training case. For a fair simula-tion of the prediction of new materials crystallinity, onehas to consider that the expected values can reach levelslower than the actual worst experiment or upper the bestcase previously seen in the current dataset. Thus, we havechosen to always rescale the training data within the range[0 0.9] due to the fact that a crystallinity lower than zero(i.e., amorphous material) cannot be attained. However, itmay be possible to obtain a more crystalline ITQ-21 sam-

    &10& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Figure 7. ITQ-21 phase crystallinity fitting using GLZs and MLR is given as a reference.

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    11/18

    ple than the one obtained up to now. The 100% crystal-lized material may not belong to the training set (e.g., ran-dom selection of training set), and the 100% crystallinityhas been arbitrarily defined by the best zeolite found inour experimentation. Nevertheless new synthesis couldachieve an even better crystallized sample.

    In all the cases, NNs as Multi-Layer Perceptron (MLP)and SVMs using RBF kernel form have reached the bestperformances. In Table 4, the best NN model for the pre-diction of ITQ21 crystallinity is shown. Two points have tobe underlined considering the performance assessment of

    NNs. (i) The work required to obtain and select the bestNN is by far more time-consuming than the other non-parametric approaches. Considerable attention has beengiven to NN due to the high variability of results we haveobtained. Numerous architectures, activation functions,and other parameters have been tested. Several NN mod-els have been discarded due to the great difference of per-formance between the training/selection and the test, indi-cating a clear overfitting phenomenon. (ii) Having com-bined a feature selection algorithm to the NN, among thefirst selected good networks, some of them are com-

    QSAR Comb. Sci. 00, 0000, No.&, 1 18 www.qcs.wiley-vch.de 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &11&

    These are not the final page numbers!

    Figure 8. Confidence intervals of predicted values for the best GLZ models. Three different a values are considered (i.e., 10, 5, and1%)

    Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    12/18

    posed of very few input variables. Considering the synthe-sis of zeolites, it can be shown that any of the variables we

    have used is without effect and can be eliminated from thesynthesis steps. However, the selection of input variablespermits to eliminate variables from which the network didnot find the right way to utilize the information broughtafter the exploitation of the others. Moreover, reducingthe pool of variables input minimizes inherently the poten-tiality of extrapolation when using a broader range for syn-thesis variables, since the role of the discarded variables

    could emerge. Both R2 and corrected R2 have been given,while the use of the latter can be questioned because of

    the above reasons. It has to be noted that such a feature se-lection mechanism could have been used for SVM or re-gression trees. Conversely, the stability of these methodol-ogies are usually better, partially due to the very littlenumber of parameters compared to the numerous onessimply contained into the NN architecture as will beshown later.

    &12& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Figure 9. ITQ-21 phase crystallinity modeling with best GLZ, PLR, SVM, NN, and RT.

    Table 4. Description of all the selected models for the prediction of ITQ21 phases crystallinity.

    Statistics Models MLR GLZ(normal

    distribu-tion,log linkfunction)

    Fullfac-

    torial

    Polynomialof degree 2

    Quadraticresponse

    surfaceregression

    Fractionalfactorial

    to degree 3

    Fractionalfactorial

    to degree 2

    Piecewiselinear

    regression

    Neuralnetwork

    MLP 4:4 5-1: 1

    SVMradial

    basisfunction

    Regres-sion

    tree

    Correlationcoefficient (r)

    0.782 0.919 0.953 0.923 0.955 0.952 0.941 0.962 0.918 0.921 0.916

    R2 0.611 0.844 0.909 0.853 0.913 0.907 0.885 0.925 0.843 0.849 0.840R2 adjusted 0.597 0.839 0.906 0.847 0.910 0.904 0.881 0.923 0.838 0.844 0.835Standard deviationof errors

    15.931 10.151 7.695 9.813 7.514 7.782 8.664 6.978 10.139 9.956 10.216

    Black cells are used for non-parametric approaches and gray ones are the selected models. Mean of the whole dataset: 17.458&Pls check change&.Standard deviation of the whole dataset: 25.565&Pls check change&.

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    13/18

    Figure 9 shows the predictions of ITQ21 crystallinity forthe given NN. The effect of the synthesis variable namedTime being rather low (as indicated earlier), NN has re-moved it from the input parameter. The number of falsepositives is much more important for NN-MLP and SVM-RBF (radial basis function) compared to all other techni-ques. The SVM-RBF is the best among non-parametric ap-proaches considering the overall criteria given in Table 4,

    but on the other hand, numerous negative crystallinity val-ues can be observed. A k-fold (k 10) Cross-Validation(CV) has been utilized for the optimizing capacity (C) andepsilon (e) at the same time. For C 10, gamma (g) hasbeen set at 0.2, and e?? 0.1. Regression Tree (RT) pro-duces accurate predictions based on few logical if thenconditions. A ten-fold CV is used for pruning. The originalversion of the RT was composed of 13 non-terminal nodesand 14 (terminal) leaves. In Figure 10, some terminalshave been pruned again (the leaves containing less than 20individuals are removed) making the reading easier. It canbe observed through the gray scale rectangles that the RT

    succeeds in isolating the different levels of ITQ-21 crystal-linity. It is interesting to observe that the position of therectangles gives an intuitive classification of the samplesstudied, allowing an easy visualization of the crystallinityand the synthesis factors. The leaves in the left branchespresent an increasing crystallization (dark rectangles) go-ing down the splits, while in the right branches the crystal-linity is descending (bright rectangles). For each leaf, the

    mean (m, i.e., mu) of the samples is indicated. Figure 10shows that the highest crystalline samples are obtained forconcentrate gels (H2O

  • 8/14/2019 QSAR itq21 param

    14/18

    However, simpler kernels such as polynomials of degrees2 and 3 have also been tested. Results for a 30% test setand 10-CV are the following:

    {Degree, C, e, g, coeff.} {3, 10, 0.1, 0.3} gives r 0.915(training), r 0.85 (test)

    {Degree, C, e, g, coeff.} {2, 10, 0.1, 0.3} gives r 0.920

    (training), r 0.87 (test)These results confirm what was concluded throughMLR and GLZ examination, i.e., the use of second degreeeffects is useful while the integration of higher effects isnot. The difference in performance between RBF and sucha simple kernel is very low and once again it discards allforms of more complex models in this study. Finally, bothRT and SVM with polynomial kernel of degree 2 are se-lected. RT should be used for a quick overview of the sys-tem while SVM could allow to draw precisely a contourplot.

    5 Advanced Analysis of Methodologies andInterpretation

    Not only to show the difficulties encountered using NNbut also for better arguing the selection of SVM methodol-ogy over NN in this study, both techniques are comparedbased on the stability/variability of their performances de-pending on the amount of data available for the trainingstep. We have chosen to assess performance generalizationfor only these two approaches since SVM has been quali-fied as a more stable technique compared to NN, and allother used techniques are far less likely to overfit the dataor a fast post-processing treatment can be easily combined

    such as for RT. Through this analysis it is also investigatedif the decrease of the size of the test set for allocatingmore resources to the training part makes the variabilityof performance higher and thus the risk of false accepta-tion of the model becomes larger.

    The dataset is divided into two parts: training (Tr) andtest (Te). Their respective size varies and the fitting capaci-ty is assessed. The relative amount of data in the test sub-set is set to either 70 or 30% of the whole available data-set. Five different samplings for each distribution intotraining and test are presented for both NNs and SVM.The frequencies of responses have been checked in order

    to have a minimum number of each type of experimentsinto both training and test sets, i.e., low and high ITQ-21crystallinity values. This will permit to assess the perfor-mance of the modeling on three different ranges of crystal-linity: {0, ]0...50], ]50...100]}. Table 5 gives the mean andstandard deviation of each sample taking into account theranges, while Tables 6 and 7 indicate the statistics for thepredicted values. The best solution using RBF and MLP(three or four layers) is conserved for NN while SVMmakes use of only RBF model form. Considering NNs, thebest network found is kept for each sampling after elimi-nation of the networks that show a clear overfitting

    through the difference between training and test perform-ances as done before. However, the set #2 in Table 6 showsan obvious estimation failure. Only one input has beenkept; consequently, the range of maximum interest (i.e.,>50) is greatly under-evaluated while amorphous mate-rials are overestimated. Obviously, such an NN has been

    trapped into a local optimum. In Tables 6 9, the gray cellsindicate where a given failure has been encountered, whilethe black cells indicate the selected models. Differentkinds of disappointment are underlined below. It has to bepointed out that the following criteria are not independ-ent, and therefore only the most significant criteria of thefailure are shown in gray. On the basis of traditional statis-tics listed in Tables 8 and 9.

    (1) High performance drop from calculation on trainingto test sets such as the NN using the MLP technique andtested with set #1 (11.4% 98.186.7, Table 8) which rep-resents the greatest fall, but also set #5 for NN-MLP in Ta-ble 8.

    (2) Relatively low performance compared to all othermodels of the same type. Therefore, set #4 for NN-RBF inTable 9 is discarded.

    (3) Relatively high error standard deviation. One has tonote that even if a prediction error mean extremely closeto zero is expected, it is possible to get a zero predictionerror mean simply by estimating the averaged trainingdata value, without any recourse to the input variables orany advanced methodologies at all. Thus, the standard de-viation error is of great interest in order not to use falsegood models as NN-RBF tested on set #4 in Tables 8 and9. NN-RBF with set #2 in Table 8 could have been discard-ed directly with the error mean. Note that if the standard

    deviation error is no better than the training data standarddeviation, then the technique has performed no betterthan a simple mean estimator.

    (4) A weak (i.e., non-robust) architecture. Not only theNN-MLP tested on set #2 in Table 8, but also NN-MLPand NN-RBF tested on set #5 in Table 9 possess a very lownumber of input data indicating that the networks did notmanage to use the information brought by all variables. InTables 6 and 7, predictions are followed on separated rang-es of crystallinity.

    (5) Difference between observed and predicted mean of ITQ-21 crystallinity. This is generally observed for high

    values of crystallinity (sets #2, #3, #5 for NN-MLP, and sets#4 and #5 for NN-RBF in Table 6, and sets #1, #3, #4 withNN-RBF in Table 7). This is due to the relatively lowamount of experiments belonging to the range >50. Onthe other hand, in set #2 for both NN-RBF and NN-MLPin Table 6, a very bad recognition of amorphous materialsis detected as well for set #1 for NN-MLP. The predictionfor medium crystallized materials is overestimated in set#1 for NN-MLP, making the margin between the mediumand highly crystallized zeolites very narrow.

    (6) Overfitting phenomenon is also detected throughthe high standard deviation of the predicted ITQ-21 crystal-

    &14& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    15/18

  • 8/14/2019 QSAR itq21 param

    16/18

    &16& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Table 8. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard devia-tion, the ratio of the prediction error standard deviation to the original output data standard deviation noted SD ratio, as well asthe Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equiva-lent to 1 minus the explained variance of the model. The percentage of test set used is 70 as indicated in the first column.

    70% of Test set

    Methodology Test sets Statistics on test set Models

    Error Pearson correlation (i.e., r) Form ParametersMean SD ratio Training

    ( selection)SD ratio Test

    Neural networks Set #1 0.0430 8.3626 0.5254 0.98191 0.86779 MLP 4 : 4 2-1 : 1Set #2 4.2375 12.8548 0.7489 0.55621 0.70001 1 : 1 1-1 1: 1Set #3 2.8432 6.5612 0.4112 0.91686 0.91527 3 : 3 1-2 1 : 1Set #4 1.3291 8.3130 0.4696 0.93298 0.88991 3 : 3 2-1 : 1Set #5 3.5427 7.8134 0.4612 0.96023 0.89572 3 : 3 3-1 : 1Set #1 0.6261 8.9759 0.5109 0.92194 0.87771 RBF 3: 3 9-1: 1Set #2 8.3668 12.8457 0.5488 0.90154 0.86598 3 : 3 10 1 : 1Set #3 1.8639 9.4848 0.5318 0.92553 0.87900 3 : 3 9-1 : 1Set #4 2.0565 13.1512 0.6105 0.82674 0.79190 4 : 4 9-1 : 1Set #5 3.4021 11.8073 0.5934 0.81881 0.80956 4 : 4 10 1 : 1

    70% of Test set

    Methodology Test sets Statistics on test set Models

    Error on Test set Pearson correlation (i.e., r) Form Parameters

    Mean SD ratio Training( selection)

    SD ratio Test

    Support vector machines Set #1 0.3540 9.5929 0.5268 0.93276 0.8517 RBF {C, e, g} {10, 0.1, 0.3}Set #2 1.2760 0.30301 0.4636 0.91980 0.8597Set #3 2.3972 8.7152 0.4634 0.93419 0.8781Set #4 0.5403 10.4749 0.4992 0.91399 0.8581Set #5 0.4055 10.2129 0.5233 0.95273 0.8670

    Table 7. (cont.)

    Neural networks

    Test sets only Ranges MLP4: 4 8-1: 1

    MLP4: 4 3-1: 1

    MLP4: 4 10 8-1: 1

    MLP5: 5 3-1: 1

    MLP3: 3 1-3 1 : 1

    Test sets only Ranges RBF

    5 : 5 2 0 1 : 1

    RBF

    4 : 4 1 5 1 : 1

    RBF

    4 : 4 1 9 1 : 1

    RBF

    5: 5 6-1: 1

    RBF

    2: 2 9-1: 1ITQ-21 crytallitnity Mean 0 1.3549 0.1903 3.5478 3.1014 1.5366>50 33.6393 23.2889 31.4614 29.4760 27.058250 14.3402 14.2591 20.1331 8.8716 20.684350 33.1845 24.0824 26.9391 29.1968 29.104950 17.0246 14.4054 11.9040 12.9160 16.2556

  • 8/14/2019 QSAR itq21 param

    17/18

    linity for medium and highly crystallized materials. This isobserved for the majority of NN models: set #1 for NN-MLP, sets #4 and #5 for NN-RBF in Table 6, and all NN-MLP except the one tested on set #2, and sets #35 forNN-RBF in Table 7. This statistic shows that NNs oftenfail to find a good and stable model over the whole rangeof ITQ-21 crystallinity.

    Considering all these criteria, one can observe that NNsare much more affected than SVMs for both sizes of train-ing sets. The number of detected failures increases as the

    amount of training data decreases as it was expected. Rel-atively small test sets increase the risk of false selection ofmodel as seen through the higher variability of criteria.Considering the case with 70% of test set, it can bechecked that the number of selected inputs for NNs is low-er than for the other case. The relative lack of experimentsdoes not permit to take advantage of the whole set of fea-tures, the variability of responses being quickly associatedto few variables, the others are considered so as to bringredundant or noisy information.

    6 Conclusions

    This work shows a broad investigation of different model-ing techniques for the prediction of performances in mate-rial science. Two types of methodologies are examined: onthe one hand the parametric strategies, and on the otherhand, the non-parametric techniques. The non-parametricmethods employed here are all ML algorithms namelyNNs, SVMs and regression trees. They reach a reasonablefitting accuracy. However, considering ML techniques, the

    recurrent problem of overfitting had to be considered andinvestigated. The parametric methods are less subjected tothis problem of great importance. The difference is due tothe fact that the statistical models are inherently restrictedin their model forms, while learning methods, and particu-larly NNs, possess a high flexibility and numerous settingparameters. The advanced performance assessment ofNNs and SVM has allowed to verify such an assumption.As a general advice, the parametric approach should al-ways employed as a reference for further work. Both ap-proaches are compatible and the selection of a uniquemodel is not compulsory. In contrast, we advocate the use

    QSAR Comb. Sci. 00, 0000, No.&, 1 18 www.qcs.wiley-vch.de 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim &17&

    These are not the final page numbers!

    Table 9. Different statistics are given for each type of model, parameters, test set such as the mean error, the error standard devia-tion, the ratio of the prediction error standard deviation to the original output data standard deviation noted SD ratio, as well asthe Pearson correlation r for both training and test sets. Note that a lower SD ratio indicates a better prediction, and this is equiva-lent to 1 minus the explained variance of the model. The percentage of test set used is 30 as it is indicated in the first column.

    30% of Test set

    Methodology Test sets Statistics on test set Models

    Error Pearson correlation (i.e., r) Form P arameters

    Mean SD SD ratio Training( selection)

    Test

    Neural networks Set #1 1.7774 6.7509 0.3846 0.9309 0.9246 MLP 4 : 4 8-1 : 1Set #2 0.7065 6.6734 0.3610 0.9316 0.9336 4 : 4 3-1 : 1Set #3 1.4581 7.7908 0.5077 0.9281 0.8625 4 : 4 10 8-1 : 1Set #4 0.7467 7.2215 0.4313 0.9520 0.9081 5 : 5 3-1 : 1Set #5 0.2503 8.2122 0.3942 0.9086 0.9197 3 : 3 1-3 1 : 1Set #1 4.3222 6.9701 0.4833 0.8688 0.8891 RBF 5 : 5 20 1 : 1Set #2 0.7534 4.0173 0.4108 0.9018 0.9133 4 : 4 15 1 : 1Set #3 0.6127 8.7909 0.5222 0.8497 0.8528 4 : 4 19 1 : 1Set #4 2.1398 11.4482 0.6267 0.7989 0.7793 5 : 5 6-1 : 1Set #5 2.5788 6.9778 0.4557 0.8225 0.8938 2 : 2 9-1 : 1

    30% of Test set

    Methodology Test sets Statistics on test set Models

    Error on test set Pearson correlation (i.e., r) Form P arameters

    Mean SD SD ratio Training( selection)

    Test

    Support vector machines Set #1 2.2549 7.5327 0.3688 0.9323 0.9297 RBF {C, e, g} {10, 0.1, 0.3}Set #2 0.6212 8.3591 0.3921 0.9258 0.9208Set #3 0.3328 8.2413 0.4362 0.9420 0.9055Set #4 1.8904 6.8110 0.3529 0.9430 0.9452Set #5 1.6718 8.8380 0.3939 0.9207 0.9204

    Prediction of ITQ-21 Zeolite Phase Crystallinity: Parametric Versus Non-parametric Strategies

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/
  • 8/14/2019 QSAR itq21 param

    18/18

    of the simplest methodologies at the beginning, whilemore complicated but less informative techniques are keptwhen complex underlying systems are detected. The com-bination of multiple approaches is of great appeal andshould allow to reach higher and more stable performan-ces, taking advantage of each method contribution, while

    expected drawbacks might be eliminated or correctedthrough the complementarities of each techniquesstrength. Such a study has pointed the difficulty of select-ing models when any a priori or preference is given to agiven modeling technique. Of course, such a work is prob-lem-dependent as the number of input variables, availableamount of data, and complexity of the system investigatedmakes a given approach more or less adapted. Thus, suchpreliminary inspection of techniques appears to be manda-tory decreasing the risk of deceptive results. The data min-ing technology is more and more applied in the productionmode, which usually requires automatic analysis of dataand related results in order to proceed to conclusions. But

    we have shown here that the selection of a given model re-mains a difficult task making the automation of the wholecombinatorial loop a problem which is too often under-es-timated. Unlike traditional data mining contexts whichdeal with voluminous amounts of data, materials science isactually characterized by a scarcity of data, owing to thecost and time involved in conducting simulations or settingup experimental apparatus for data collection. In such do-mains, it is prudent to balance speed through automationand the utility of data. For these reasons, the human inter-action, verification and guidance may lead to better quali-ty output.

    Acknowledgements

    EU Commission (TOPCOMBI Project) is gratefully ac-knowledged.

    References

    [1] H. Lee, S. I. Zones, M. E. Davis, Nature 2003, 425, 385 387.[2] A. Corma, J. Catal. 2003, 216(12), 298 312.[3] C. S. Cundy, P. A. Cox, Chem. Rev. 2003, 103, 663 702.[4] S. I. Zones, S. J. Hwang, S. Elomari, I. Ogino, M. E. Davis,

    A. W. Burton, C. R. Chim. 2005, 8, 267 282.[5] J. L. Paillaud, B. Harbuzaru, J. Patarin, N. Bats, Science

    2004, 304(5673), 990 992.[6] K. G. Strohmaier, D. E. Vaughan, J. Am. Chem. Soc. 2003,

    125(51), 16035 16039.[7] R. Millini, C. Perego, L. Carluccio, G. Bellussi, D. E. Cox,

    B. J. Campbell, A. K. Cheetham, Proceedings of the Interna-

    tional Zeolite Conference 12th, Baltimore, July 5 10, 1998,

    Meeting Date 1998, 1999, 541 549.[8] A. Corma, F. Rey, J. Rius, M. J. Sabatier, S. Valencia, Nature

    2004, 431, 287 290.[9] A. Corma, M. J. Daz-Cabanas, F. Rey, S. Nicolopoulus, K.

    Boulahya, Chem. Commun. 2004, 12, 13561357.[10] C. S. Cundy, P. A. Cox, Micropor. Mesopor. Mat. 2005, 82,

    178.[11] A. Corma, V. Fornes, U. Diaz, Chem. Commun. 2001, 24,26422643.

    [12] M. Moliner, J. M. Serra, A. Corma, E. Argente, S. Valero,V. Botti, Micropor. Mesopor. Mat. 2005, 78, 73 81.

    [13] O. B. Vistad, D. E. Akporiaye, K. Mejland, R. Wendelbo, A.Karlsson, M. Plassen, K. P. Lillerud, Stud. Surf. Sci. Catal.2004, 154, 731 738.

    [14] A. Cantn, A. Corma, M. J. Diaz-Cabanas, J. L. Jorda, M.Moliner, J. Am. Chem. Soc. 2006, 128, 42164217.

    [15] L. A. Baumes, D. Farruseng, M. Lengliz, C. Mirodatos,QSAR Comb. Sci. 2004, 29, 767 778.

    [16] A. Corma, J. M. Serra, E. Argente, S. Valero, V. Botti,Chem. Phys. Chem. 2002, 3, 939 945.

    [17] M. Holena, M. Baerns, Catal. Today 2003, 81, 485 494.

    [18] C. Klanner, D. Farrusseng, L. A. Baumes, C. Mirodatos, F.Schuth, Angew. Chem. Int. Ed. 2004, 43, 5347 5349.

    [19] J. M. Serra, L. A. Baumes, M. Moliner, P. Serna, A. Corma,Comb. Chem. High Throughput Screen. 2006 (Submitted).

    [20] L. A. Baumes, J. M. Serra, P. Serna, A. Corma. J. Comb.Chem. 2006, 8, 583 596.

    [21] D. Nicolaides, QSAR Comb. Sci. 2005, 24, 15 21.[22] M. M. Gardner, J. N. Cawse, in: J. M. Cawse (Ed.), Experi-

    mental Design for Combinatorial and High Throughput Ma-

    terials Development, John Wiley & Sons, Hoboken, NewJersey, 2003, pp. 129 145.

    [23] L. A. Baumes, J. Comb. Chem. 2006, 8, 304313.[24] A. Corma, M. J. Daz-Cabanas, J. Martnez-Triguero, F. Rey,

    J. Rius, Nature 2002, 418, 514 517.[25] C. Bishop, Neural Networks for Pattern Recognition, Oxford

    University Press, Oxford, 1995.[26] S. Haykin, Neural Networks: A Comprehensive Foundation,

    Macmillan Publishing, New York, 1994.[27] P. J. Green, B. W. Silverman, Nonparametric Regression and

    Generalized Linear Models: A Roughness Penalty Approach,Chapman & Hall, New York, 1994.

    [28] A. J. Dobson, An Introduction to Generalized Linear Mod-els, Chapman & Hall, New York, 1990.

    [29] J. Stevens, Applied Multivariate Statistics for the Social Sci-ences, Erlbaum, Hillsdale, NJ, 1986.

    [30] M. S. Younger, A First Course in Linear Regression, 2nd ed,Duxbury Press, Boston, 1985.

    [31] V. Vapnik, The Nature of Statistical Learning Theory,Springer, Berlin, Germany, 1995.

    [32] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Clas- sification and Regression Trees, Wadsworth & Brooks/ColeAdvanced Books & Software, Monterey, CA, 1984.

    [33] T. Blasco, A. Corma, M. J. Diaz-Cabanas, F. Rey, J. Rius, G.Sastre, J. A. Vidal-Moya, J. Am. Chem. Soc. 2004, 126,13414 13423.

    &18& 2006 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.qcs.wiley-vch.de QSAR Comb. Sci. 00, 0000, No.&, 1 18

    These are not the final page numbers!

    Full Papers Laurent A. Baumes et al.

    http://www.qcs.wiley-vch.de/http://www.qcs.wiley-vch.de/