Validation of the Calibration Procedure in Atomic Absorption Spectrometric Methods

Embed Size (px)

Citation preview

  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Val idat ion o f th e Cal ib ra tion Procedu re inAtomic Absorp t ion Spect rom et r icMethods

    Journal ofAnalyticalAtomicSpectrometryW . PENNIN CKX, C. HARTMANN , D. L. MASSART AND J . SMEYERS-VERBEKEChem oAC, Pharm aceutical Institute, Vrije UniversiteitBrussel, Laarbeeklaan 103, 1090 Brussels, Belgium

    A general strategy for the validation of the calibrationprocedure in AAS was developed. In order to accomplish this,the suitability of different experimental designs and statisticaltests, to trace outliers, to examine the behaviour of thevariance and to detect a lack-of-fit, was evaluated. Parametricas well as randomization tests were considered. For theseinvestigations, simulated data were used, which are based onreal measurements. The results obtained indicate that tovalidate a straight-line model, the measurement points shouldpreferably be distributed over three or four concentrationlevels. In order to check the goodness-of-fit, the significance ofthe quadratic term should be investigated. A lack-of-fit to asecond degree model is better detected when the measurementpoints are distributed over more than four concentration levels.For an unweighted second degree model, an analysis ofvariance AN OV A) lack-of-fit can be used, while arandomization test is proposed for a weighted model. A one-tailed Re s t or an alternative randomization test should beused to trace a non-constant variance.Keywords: Meth od validation; calibration; randomization testMethod validation is the process of demonstrating the abilityof a newly developed method to produce reliable results.'Generally, one starts this process by validating the appliedcalibration procedure. The calibration model is used to describethe relationship between the analytical signal ( y ) and theconcentration x). One can assume, for example, that theLambert-Beer law is valid within the applied concentrationrange, so that a straight-line model (y=b,+b,x) can be used.However, if the Lambert-Beer law is not valid and the straight-line model is fitted to the data, the calibration procedureintroduces a systematic error in the analysis results. Thecalibration method has to be evaluated prior to its routineuse, since the limited number of data points used routinelydoes not permit such an evaluation.

    Different approaches can be followed to validate thecalibration procedure. A number of guidelines to validatethe calibration function are, for example, published by theInternational Organisation for Standardisation (ISO).273However, some important problems, such as the investigationof the lack-of-fit of a weighted and a second degree calibrationline, are not discussed by ISO. Therefore, in this paper a moregeneral strategy for the validation of AAS calibration pro-cedures is given.

    This work investigates the validation of straight-line models(y = b, +blx) and second degree models (y =bo+blx +b,x2),which are the most applied in practice and are included in theI S0 guideline^.^,^ Other calibration models are reported in theliterature. Some workers4 use, for example, a cubic model (y =b, +blx +b,x2 +b3x3) .However, such a model has little physi-cal meaning and requires a large number of calibration stan-dards during routine analysis. Barnett' has described thecalibration model that is included in the Perkin-Elmer atomicabsorption spectrometers. Other models have been reportedby Phillips and Eyring6 and Ko ~cie ln iak. ~ince these models

    are not generally applied their validation is not discussedfurther here.

    The validation of the calibration procedure involves anexamination of the behaviour of the variance and of thegoodness-of-fit of the selected model. In order to accomplishthis, aqueous standards are measured at different concentrationlevels, covering the complete calibration range. IS0,233 forexample, recommends to distribute ten standards uniformlyover the calibration range and to perform ten replicate analysesof each of the lowest and highest concentrations. The exper-imental design is important since it influences the probabilitythat a problem, such as a lack-of-fit or a non-constant variance,is detected. Therefore, this work evaluates the applicability ofdifferent experimental designs.

    After the performance of the experiments, one should firstevaluate the data graphically. Often this permits an easydetection of important problems. A good way to do this is byexamining the residuals.' Since this is not included in theIS0293ecommendations, it is discussed briefly in this paper.

    For a statistical evaluation of the results, one should checkwhether the data are free of outliers. Outlying points maydisturb the normality of the data, which is required by mostof the tests used to examine the behaviour of the variance andthe goodness-of-fit. Moreover, the occurrence of multiple out-liers would indicate a fundamental problem with the method.In this paper, two tests to trace single outliers, namely theDixon' and the Grubbs tests, and a test to trace pairedoutliers, are studied. Next, the behaviour of the variance isinvestigated. This work evaluates the suitability of tests whichcompare variances at different concentration levels, such asthe F,2,3 Cochran,12 Hartley12 and Bartlettl3 tests, as well asalternatives for these tests, where the standard deviations atthe different levels are estimated by the range.I4 Finally, inorder to trace a lack-of-fit, the applicability of an analysis ofvariance (ANOVA) procedure15 and the significance of thequadratic terrnI6 are evaluated. An alternative test is consideredfor weighted m0de1s.l~This work also investigates the suit-ability of a number of randomization tests. In that case, thecomputed test statistic is not compared with a critical value,but with a distribution which is obtained by random assign-ment of the experimental data. By deriving the distributionfrom the data themselves, these tests should be less sensitiveto deviations from normality.18

    The different tests and experimental designs are evaluatedin a systematic way by means of simulations which are basedon a number of real data sets. The results of this evaluationare used to construct a general validation strategy for thecalibration procedure in AAS. This could form the basis for amore general strategy applicable to other measurementtechniques.

    E X P E R I M E N T A LThe symbols that are used throughout this paper are summar-ized in Table 1.

    Journal of Analytical Atomic Spectrometry, April 1996, Vol.11 (237-246) 237

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online / Journal Homepage / Table of Contents for this issue

    http://pubs.rsc.org/en/journals/journal/JA?issueid=JA1996_11_4http://pubs.rsc.org/en/journals/journal/JAhttp://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Table 1 Symbols that are used~

    Total number of measurements: NConcentration levels: 1; ...; i; ...; nNumber of replicates: m,; ...;mi;..;m,xi=concentration at level i i=l , ..., n)yij=jth absorbance measured at level i ( j = 1, . , mi)i =mean absorbance at level i

    si = he standard deviation of the absorbances of level iwi = he range of the absorbances of level ib, ; bl ; bz he estimated calibration parametersj i=estimated absorbance at level i=b, + b , X j for a straight-line model= b, +b,xi+ ,x? for a second degree model

    eij=yij- i jth residual at level iCi=mean of the residuals at level iZ =mean of all residualsewij= h%K[ z j- i] jt h weighted residual at level iq = =the weight at level i

    -

    13,

    InstrumentalAll programming was performed on a Compaq ProLinea 4/25spersonal computer. Visual Basic 3.0 (Microsoft) was used asthe programming environment. A Perkin-Elmer (Norwalk, CT,USA) Zeeman 3030 atomic absorption spectrometer equippedwith an HGA-600 graphite furnace, an AS-60 autosampler anda PR-100 printer were used for ETAAS determinations. Forflame AAS determinations, a Perkin-Elmer 373 spectrometerwith a PRS-10 printer sequencer were used.

    Planning of SimulationsDescription o experimental dataIn order to simulate the data as realistically as possible, somereal experimental results were obtained first. The investigateddata sets contain absorbances measured in aqueous solutions,at different levels and over a large concentration range. Zn, Feand Cu measurements were obtained with flame AAS, whileETAAS was used for Pb, Cd and Mn measurements.

    For the different data sets, the standard deviation is constantat low concentration levels, but from a certain level it increaseswith the absorbance. This is illustrated in Fig. 1 for Fe measure-ments obtained with flame AAS. For the given example it isclear that, from a certain concentration level, the standarddeviation increases linearly with the absorbance, but this could

    T0.008 -I

    0.006 \ wv) tI0.004

    A----.--n i +-+ t--

    Fe concentration/mg I-I.002

    0 2 4 6 8 10 12 14 16 18 20

    Fig. 1the Fe concentration (data obtained with flame AAS)Standard deviation of the absorbance n=6 ) as a function of

    1.21 o0.8

    ct 0.60.40.2

    0 1 2 4 6 8 10 12 14 16-0.2 Zn concentration/mg I-Fig. 2 Absorbances as a function of the Zn concentration (dataobtained with flame AAS). The line represents the weighted seconddegree function that is computed from these data

    not be shown for all examined cases. The concentration levelwhere the behaviour of the variance changes can be situatedabove, below or within the selected calibration range. Thismust be investigated during the validation, because the firstsituation (constant variance) permits the use of an unweightedmodel, while for the other two (non-constant variance) themost precise results are obtained with a weighted model.

    Generally, up to a certain concentration, a straight-linemodel can be used to describe the relationship between concen-tration and absorbance. For higher concentrations a seconddegree calibration model is needed. This is also the case forthe inspected data sets. Moreover, it is difficult to build a goodcalibration model in a concentration range where the cali-bration line is partially straight (lower part of the range) andpartially curved (upper part of the range). Fig.2 shows, forexample, the absorbance values measured with flame AAS asa function of the Zn concentration. It can be seen that theweighted second degree model that is computed from the datadoes not describe the measurement results accurately. In sucha situation the calibration range should be split into two parts.For the given example, an unweighted straight-line model canbe used for concentrations up to 2mg l- , while a weightedsecond degree model has to be used in the range between 2and 15 mg 1- (results not shown). Consequently, in the selectedcalibration range, three situations can occur. The calibrationline can be straight or curved over the complete range, or itcan be partially straight and partially curved. Moreover, it isfound that in the region where a straight-line model can beused, the variance remains constant, while it increases whenthe calibration line starts to bend.

    SimulationsIn order to evaluate the different statistical tests and experimen-tal designs for each situation considered, 100 data sets weresimulated. Randomized normal distributed numbers were gen-lerated in Visual Basic, using the method proposed by Box andMu1ler.l The suitability of the random generator was con-firmed by simulating a large number of data of which thenormality of the distribution was checked graphically (with a]histogram) as well as statistically (with a x test). Moreover,in0 correlation was observed between data that were success-ively simulated.

    The simulations are based on the P b data that were obtainedb y ETAAS. The applied parameters are given in Table 2. Thecalibration range (0-100 pg I-) is divided into two equal partsin which different conditions can be valid. For example, a dataset can be simulated with a constant standard deviation in thelower part, but an increasing standard deviation in the upperpart of the calibration range. Similarly, a first degree modelcan be valid in the lower part but not in the upper part. For

    238 Journal o Analytical Atomic Spectrometry, April 1996, Vol. 11

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Table 2 Parameters applied for the simulation of the data

    I . .0 - - : ~0.05 0.10 a.15 0.50 0.25 0.30. 0.35

    Experimental conditions-Concentration range:Equation calibration line:

    b1=0.003

    0-100 pg I- (divided into two equalparts)y P o +plxi+ x i 2with o= 0.02

    pz= -4.0 x -3.8 x ..., -0.2 x 0.0 xStandard deviation:homoscedastic: oi= 0.002heteroscedastic: oi 0.002+ i0.02

    y i j ' = y i j + k a with k = 4 , 6 , 8, 10 o r 12Introduction outlier:In case of a problem at the lowest conce ntration level (seeSection 4):

    Y l j = ~ 1+ c ,

    0.0100.005

    Experimental design- Num ber ofconcentrationDesign levels (n)

    ( d )-~

    c11c21c31c411151

    0 .

    346912

    ... . .05 0.10 0.15 0.20 0.25 0.30

    rn

    Number ofreplicates ateach level mi)129643

    each investigation (evaluation of outlier tests, tests to traceheteroscedasticity and goodness-of-fit tests) different simu-lations were performed, which a re specified further.Five different ways to distribute 36 measurement pointssymmetrically over the calibration range are considered (seeTable 2). Design 1, for example, positions 12 measurements atthree concentration levels, namely 0, 50 and 100pg 1-'. Th echoice of 36 points is arbitrary. The main reason why thisnumber was selected is that it permits the residual variance tobe estimated with a large number of degrees of freedom (> 30)and the measurement points can be distributed evenly overthree or four concentration levels. These are the minimumnumber of levels to investigate a lack-of-fit to a straight lineand second degree m odel, respectively (see below).

    0.006

    0.004

    0.0020

    0.002

    0.004

    . .. . .1

    .I

    : : .0.05 .10 0.15 OO 0.2: 0.30 ; .35. '.I

    -) -0.006a

    .008

    3

    .004

    I

    I1. .

    -0.008I:

    Th e design that is proposed by IS0,293 namely ten replicatesat bo th extremes and a single measurement point a t the eightoth er concentration levels, is not considered, mainly for practi-cal reasons. In the first place, this design can not be applied forthe validation of a weighted model. In order to determine theweight factors the variance must be estimated at the differentconcentration levels which requires the performance of repli-cate measurements. The lack of replicates also hampers theapplication of outlier tests a t the different levels, as well as theevaluation of tests which com pare variance estimates a t differ-ent levels, such as the C och ran test. Moreov er, if one does n ottake into account the replicates at the extremes of the cali-bration range for the investigation of the goodness-of-fit, as isproposed by KO, several tests that are evaluated here (e.g.,ANOVA lack-of-fit) can not be performed.

    R E S U L T S AND DISCUSSION1. Examination of the ResidualsThe residuals ( e i j ) as given in Table 1 are the differencesbetween the responses actually measured (yi j) and those pre-dicted by the calibration model ( j i ) . n order to examine theresiduals they are plotted against the predicted value. When aweighted model is used the weighted residuals (see Table 1 )are plotted against the predicted value. Fig. 3 illustrates foursituations that can occur. In the first situation [Fig. 3(a ) ] theresiduals form a horizontal band which indicates no abnor-mality. In the second situation [Fig. 3(b)] the spread of theresiduals increases with the size of the predicted value (andthus also with the concentration), which indicates that thevariance is not constant. In such a situation the use of aweighted calibration model should be considered. Fig. 3 ( c )shows a trend in the residuals, which indicates that the modelis inadequate. Fig. 3 d ) llustrates the residual plot in a situationwhere the calibration set contains an outlier.

    Drap er an d Smith' have shown that the estimated residualsare correlated but they indicate that this correlation does notinvalidate the residual plot when N is large compared withthe number of regression parameters estimated.

    0.008 i b )0.004

    .I

    I0.004-0.008 . . r 'I..

    I rn I. ' 0.05 0.10 :0.15 : 0.25~ 0.301 0.35= 0.20-0.005

    -0.01 0Predicted value

    Fig. 3 The residuals (Le., the difference between the measured and estimated ab sorbances ) are plotted ag ainst the estimated absorba nce to detecta calibration problem. The following plots can be obtained, which indicate: (a) ,no abnormali ty; b), variance that increases with the estimatedabsorbance (i.e., heteroscedasticity);(c), a lack-of-fit; and ( d ) ,an outlierJournal o Analytical Atomic Spectrometry, April 1996, Vol. 1 1 239

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Table 3 Test criteria to trace an outlier at concentration level i Theapplied symbols are explained in Table 1Test criteria for Dixons test, as presented by ISO:5When 2 < m < 8 :

    When 7 < m < 1 3 :Q =When 126).The Behaviour of the Variance3.1. Description of the evaluated testsParametric tests. I S 0 proposes to use a one-tailed F-test tocheck (a) whether the variance at the highest concentrationlevel is significantly larger than at the lowest concentrationlevel and (b) whether the variance at the lowest concen-tration level is significantly larger than at the highest con-centration l e ~ e l . ~ ? ~or AAS applications, only the first test ismeaningful, because one knows that the variance, when notconstant, increases with the concentration. Other tests, whichcan be applied to trace a non-constant variance, use estimatedvariances at different levels. In this work the suitability of theCochran, Hartley12and BartlettI3 tests is investigated. Table 4shows how the test statistics are computed. Cochran comparesthe ratio between the highest variance and the sum of thevariances with a critical value. Hartley, on the other hand,uses the ratio between the highest and the lowest variance.Theoretically, both these tests, for which specific tables exist,require an equal number of measurement points at eachconcentration level. However, I S 0 indicates that for theCochran test small differences in the number of points can beignored, and applies the Cochran criterion for the number ofmeasurements that occur at most concentration levels. Themore complex Bartlett test does not assume an equal numberof points at the different levels. A quantity M / C is computedI(see Table 4), which is distributed as x when the variances arenot significantly different. All these tests (Cochran, Hartley,Bartlett and F-test) are based on the comparison of estimatedvariances s2). Some workers14 propose to test the samplerange w) . Correction terms are published14by which the range

    240 Journal of Analytical Atomic Spectrometry, April 1996, Vol, 11

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Table 4 Evaluated test parameters to detect a non-constant variance.The symbols are explained in Table 1

    Smm2r = ~Smin

    Hartley

    F-test

    Bartlett M / C , with:

    where vi = mi

    must be divided to ob tain an estimation of s. These estimationscan then be applied in parametric statistical tests, such as theF-test. Alternatives for the Cochran and Hartley tests, basedon the ranges, have also been described.14 Th e test statisticsfor these tests are also given in Table 4.Randomization test. Apart from the parametric tests, whichare generally applied, this work also investigates the applica-bility of a randomization test to trace heteroscedasticity. Th erandomization F-test that we propose is illustrated in Table 5.First, at each concentration level, the squared differencesbetween the individual and the mean measurement results arecomputed di;). Since the size of the d2 values depends on thevariance at that level, the test statistic for the experimentaldata ( R e ) s computed as the sum of the d2 values at the upperconcentration level divided by the sum of the values at thelowest level. In a next step, the d2 values are randomlypermuted between the concentration levels, and for eachpermutation an R , value is computed. If the d2 value (and thu sthe variances) at the upper and the lower concentration levelsare similar, R , values will be found that are distributed aroundR e . However, if the d2 values at the upper concentration levelare significantly larger, most of the R , values will be smallerthan R e . Consequently, the significance level can be comp utedas the ratio between the number of permutations with R,>R,to the total number of permutations. In this work the testresults are based on 1000 permutations obtained by means ofa random data permutation program.183.2. Performan ce under normal conditionsAs already mentioned, five different ways to distribute the 36measurement points symmetrically over the calibration rangewere considered. Fig. 5 shows the number of positive testresults for these designs, in a situation where the variance is

    1

    s?vQ)

    v)0.--I-n

    \

    0 IDesign 1 Design 2 Design 3 Design 4 Design 5Fig. 5 Percentage of positive test results obtained with Cochran ( A ) ,Hartley A), artlett 0 ) nd F H) ests in a heteroscedastic situation

    con stan t in the lower part of the calibration range, but increasesin the upper part. It can be seen that for all tests the bestresults are obtained with the design that positions all meas-urement points at three concentration levels (design 1).Distributing the measurements over more levels, and thusreducing the number of replicates at each level, decreases theprobability of detecting the heteroscedasticity. When thedifferent tests are compared, one can conclude that the bestresults are obtained with the Bartlett and the F-tests. The F-test has its simplicity as an additional advantage. Moreover,this test is recomm ended by IS0.273 For homoscedasticmeasurements the evaluated tests produce between 2 and 10%of false positive results which is in agreement with the specifiedsignificance level of 5 %.For the tests that estimate the standard deviation by therange, a similar performance is observed. The test that usesthe ratio of the smallest to largest absorbance range, forexample, performs similarly to the classical Hartley test.Mo reover, determining the significance level of the F-test by arandomization procedure gives similar results as using a criti-cal value.3.3. Performance in the presence of outliersSince there is always a probability that an outlier is notdetected it is important to examine the effect of such ameasurement point on the applied tests. Here, a number ofsituations are considered, where an outlier at one of theconcentration levels results in a n overestim ation of the varianceat that level.With ho moscedasticity, this overestimated variance can leadto an increased num ber of false positive conclusions. Fig. 6illustrates this, in a situation where the lower or the upperconcentration level contains a n outlier at 60. With the B artletttest, for example, between 50 and 80 of false positive resultsare obtained, depending on the applied design. Similar resultsare obtained with the C ochran a nd Hartley tests (not shown).Th e F-test is only affected by th e outliers a t the extrem e levels,because it does not use the d ata of the ot her levels. More over,since this test is one-sided, an outlier at the lowest concen-tration level does not increase the number of false positiveresults (see Fig. 6). When the upper concentration level con-tains an outlier, the F-test gives a comparable number of falsepositive results as the other tests. However, for the mostsuitable designs (designs 1, 2 and 3), this num ber largelydecreases when the significance level of the test is determinedby a random ization p rocedure (see Fig. 6).A similar conclusionis obtained for data sets where one of the concentration levelsis contaminated with tw o outliers. The num ber of false positiveresults obtained with the F-test is only increased by outliers

    : : L ,0Design 1 Design 2 Design 3 Design 4 Design 5

    Fig. 6 Percentage of positive test results obtained with the Bartletttest (squares), the F-test (triangles) and the randomization F-test(diamonds) in a homoscedastic situation, but with an outlier 60) tthe lower (solid line) or upper (broken line) concentration levelJournal of Analytical Atomic Spectrometry, April 1996, Vol.11 241

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    at the upper concentration levels, while the other tests are alsoaffected by an overestimation of the variance at the other levels.

    When, in a heteroscedastic situation, the variance at thelowest concentration level is overestimated owing to an outlier,an increased number of false negative conclusions can beobtained. In that case, the real difference between the varianceat the highest and lowest concentration levels is underesti-mated, so that an existing heteroscedasticity is masked. As anexample, a situation is considered where design 1 is appliedand the real standard deviations at the lowest, the middle andthe upper level of the concentration range equal 0.002, 0.002and 0.008, respectively. The lowest level is contaminated withan outlier, which leads to an overestimation of the variance atthis level. With an outlier positioned at 40 or at 60 , theprobability to detect the heteroscedasticity with the F-testdecreases from about 100 to 65 and 30 , respectively.Determining the significance levels of the I;-test by a randomiz-ation procedlure does not improve the results. With the Bartletttest, this decrease is also observed, but the number of positivetest results stabilizes and even slightly increases for importantoutliers. This is because, in this situation, the variance at thelowest level is overestimated in such a way, that it becomessignificantly larger than the variance at the other levels. Thiseffect is not experienced with the one-tailed F-test which checkswhether the variance at the upper level is significantly higherthan at the lower level.

    4. Goodness-of-fitfor Homoscedastic Data4.1. Descriplion of the testsParametric tlests. IS 02 ,3 ecommends, as a goodness-of-fit test,to check whether the data are better fitted by a second degreemodel (y = bo+blx +b,x2) than by a straight-line model (y =b, +blx) using an F-test. In this work we preferred to checkwhether b2 iis significantly different from zero, using a t-test(see Table 6). When it is shown that the data are better fittedby a second degree than by a straight-line model, one concludesthat the straight-line model does not fit the data accurately.This does, however, not prove that the second degree modelfits the data correctly, as is illustrated in Fig.2. ANOVA lack-of-fit is a test which can be applied to check the goodness-of-fit of a model such as a straight-line or second degree model.The test, which is described in Table 6, compares the error dueto lack-of-fit with the pure experimental error.Randomization tests. Van der VoetZ2 as proposed a randomiz-ation test to compare the predictive accuracy of two calibrationmodels. In this work we applied this test to check whetherbetter predictions are made by a second degree model than bya straight-line model. Consequently, it can be considered as a

    Table 5 Procedure of the randomization F-test1. Computation o d2 values:

    2. Test param,eter o r experimental data:

    3. Permutations:The following steps are repeatedly performed:3.1. The d2 values are randomly permuted between the3.2. The test parameter is computed for each permutation: R,concentration levels

    4. Determination o signijicance level: =umber of permutationsnumber ofpermutations with R, >Re

    Table 6 Parameters applied to detect a lack-of-fit of an unweightedcalibration model. The symbols are explained in Table 1AN OVA lack-of-@ test:

    Degrees of Mean squaresSum of squares SS) freedom (df) ( M S )

    N - n ssd f p e

    Lack of fit: n - k SSlOfi = l dhofmi c ~ i - j i 1 2

    with k = 2 for a straight-line modelk = 3 for a second degree modelMSlofMSF

    - -F(iV - n ) , ( n - k ) , a -Signijicance o second degree term:

    randomization alternative of the parametric test proposed byIS0.293 he test compares the squared residuals for the straight-line model (ei?l) and the squared residuals for the seconddegree model (eij22).If both models have the same predictiveability, ei:l and e i t 2 have equal distributions (HO=nullhypothesis). However, if better predictions are obtained withthe second degree model, the eij21values are generally largerthan the eij22values. In order to test this, the difference betweenthe squared residuals, dsi j= eij21 ij22, is computed for eachmeasurement point. The mean of these values, which is calledT, is then used as the test parameter. This value is firstcomputed for the experimental data (T,) .If both models havethe same predictive ability, the dsi j values are small anddistributed around zero so that T ,will be almost zero. However,if the second degree model provides better estimations, theds i jvalues will generally be positive so that a T , value larger thanzero is obtained. In the randomization test at each iterationrandom signs are then attached to the dsij values, and the Tvalue is computed T,).If the original dsi j values are distributedaround zero (HO true), this operation has little effect on T sothat T , values are obtained which are sometimes larger andsometimes smaller than T,. However, if random signs areattached to ds, values that are predominantly positive (HO nottrue), most T, values will be smaller than T,. Therefore, thesignificance level is then computed as the ratio between thenumber of iterations with T,> T , and the total number ofiterations.

    In this paper we propose another randomization test totrace a lack-of-fit. This test investigates whether the predictionerror, estimated by the residuals, is independent of the concen-tration level. As shown in Fig. 3(c), a lack-of-fit can be detectedby demonstrating that the size of the residuals depends on thelevel where they are measured. Table 7 explains how this canbe used in a randomization test. The numerator of the appliedtest parameter is an estimate of the variation of the meanresiduals between the different concentration levels, while thedenominator estimates the variation within the levels. First,the parameter value is computed for the experimental dataLJ.Then, data permutations are performed. At each step the

    residuals are re-assigned to the different concentration levels,and the test parameter is re-calculated I+).Re-assigning thedata has little effect on the test parameter when the residualsare randomly distributed over the concentration levels [seeFig. 3 ( a ) ] .Consequently, for the different permutations onewill find test values which are sometimes larger and sometimessmaller than that obtained for the experimental data. However,when the size of the residuals depends on the concentrationlevel [see Fig. 3(c)] re-assigning the data will affect the testparameter. In fact, for most permutations a test value will be

    242 Journal of Analytical Atomic Spectrometry, April 1996, Vol.11

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    Table7 Procedure of the randomization test to investigate thegoodness-of-fit1. The test parameter:

    m i [ ~ i - ~ z

    2. Computation of the test parameter for the experimental data:

    1008060

    4020

    W.I -20 -1 5 -1 0 -5 0che residuals [el ; . ; e i j ;. ; enm]are used to compute L .-

    3. Permutations:The following steps are repeatedly performed:3.1. The residuals are randomly assigned to the n concentration3.2. The test parameter is computed for the ith permutation: L,levels

    4.Determination of signi,ficance level:= total number of permutationsnumber of permutations with L,>L,found which is lower than that obtained for the experimentaldata. The significance level is then calculated as the ratiobetween the number of permutations with a test parameterhigher than that computed with the experimental data L,.>L,)and the total number of permutations. In this work the testresults are again based on 1000 random permutations.

    Strictly speaking, the proposed test, as with most statisticaltests, is applicable only when the data are independent. Forthe estimated residuals this is not the case owing to thecorrelation that exists between them. However, when thenumber of measurement points is large compared with thenumber of regression parameters estimated, the effect is small,so that it can be ignored.

    4.2. Performance under normal conditionsFirst, the validation of a straight-line model is investigated.For the test that is based on the significance of the quadraticterm the best result is theoretically found with design 1. Itfollows from the D-optimality principle23 that the volume ofthe confidence region for the calibration parameters of asecond degree polynomial is minimized when the measure-ments are equally distributed over the two extremes and themiddle of the concentration range. It is clear that the moreprecisely the calibration parameters are estimated, the easierit is to demonstrate that the quadratic term is significantlydifferent from zero. For the ANOVA lack-of-fit it is moredifficult to explain theoretically which is the optimum design.The simulations show, however, that the design that distributesall measurement points over three concentration levels(design 1 ) is the best to trace a lack-of-fit of the straight-linemodel with both tests. The sensitivity of the tests decreaseswhen the number of concentration levels over which themeasurements are distributed increases and the number ofreplicates at each level decreases. This is illustrated in Fig. 7(u)for the ANOVA lack-of-fit. One can see, for example, that fora calibration line with a quadratic term equal to 2x lop7,the lack-of-fit to the straight-line model is almost certainlydetected when design 1 is used. With design 5, on the otherhand, the probability of correctly detecting the lack-of-fit isabout 30%. The test based on the significance of the quadraticterm is less affected by the applied design [see Fig. 7 ( b ) ] .Compared with the ANOVA lack-of-fit, this test gives similarresults when the most suitable design is applied, but betterresults when the other designs are applied. The randomizationtests give less good results. For example, ANOVA lack-of-fitdetects a problem in 97, 88 and 34% of the cases when the

    1008604020

    20 -15 -1 0 -5 0Quadratic term (x lo-)

    Fig.7 Percentage of cases for which (a) a lack-of-fit is detected withan ANOVA, and ( b ) a significance of the quadratic term is detected,for a curved calibration line and for different designs. The linescorrespond to: H, design 1; 0,esign 2; +, design 3; 0 design 4;and A, esign 5calibration lines contain a quadratic term equal to - 12x-8 x and -4 x respectively (design 1 applied).With the randomization test described in Table 7, this is in 93,72 and 18% of the cases. The randomization test proposed byvan der Voet22detects a lack-of-fit in 82, 56 and 13% of thecases. The number of false positive conclusions obtained withthe randomization tests is situated between 2 and 6%, whichis in agreement with the specified significance level of 5 .

    Similar conclusions are obtained for situations where thelack-of-fit to the straight-line model is the result of problemsother than a curvature to the x-axis. For example, a situationis considered where data following a straight-line model aresimulated, and a constant value c b is added to all measurementpoints of the lowest concentration level. Also, in that situation,the test based on the significance of the quadratic term andthe ANOVA lack-of-fit give the best results, and they shouldpreferably be combined with design 1.

    In order to validate a second degree model, an ANOVAlack-of-fit is applied. Design 1, which uses only three concen-tration levels, cannot be applied here. Table 6 illustrates thatwith three parameters to estimate ( k = 3 ) and three concen-tration levels (n=3), the degrees of freedom of the error dueto lack-of-fit would be zero. In order to simulate a lack-of-fitto a second degree calibration model, curved calibration lineswere simulated and a constant value (cb) was added to themeasurement results of the lowest concentration level. Fig.8gives the number of positive test results, for the different cvalues added, and for different designs. For the given situations,the most suitable designs to detect a lack-of-fit are designs 3and 4. For example, when cb=0.005, with design 2 a lack-of-fit is detected in 30% of the cases, while designs 3 and 4 detectit in about 50 of the cases. Thus, in contrast to the straight-line model, the distribution of the measurement points overthe minimum number of concentration levels (four in thissituation) does not guarantee the best results. The reason forthis is explained further for the weighted second degree model,where this problem is even more important. The randomizationtest gives less good results than the ANOVA lack-of-fit. Forexample, for cb values of 0.002,0.004 and 0.006, a lack-of-fit isdetected in 16, 33 and 62 of the cases, respectively, when

    Journal of Analytical Atomic Spectrometry, April 1996, VoL 11 243

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorptio

    O i I0 A501 4 -==

    ' Design 2 Deskn 3 Des ign4 Deiign 5 1Fig. 8 Percentage of positive test results obtained with an ANOVAlack-of-fit for a second degree model, in a situation where the lowestconcentration level is contaminated. To simulate this problem, constantvalues cb) of 0.00 H),0.02 0),.03 (+ ), 0.04 0, 0.05 A) nd 0.06( A ) were added to the absorbances of the lowest concentration level

    ANOVA lack-of-fit and design 3 are applied. With the samedesign, the ra ndom ization test detects a lack-of-fit in 4, 16 and42% of the cases, respectively.4.3. Performance in the presence o outliersThe effect of outliers is only evaluated for the validation of astraight-line model. Two outlier problems are examined,namely a curvature that is masked by one or two too highmeasurement results at the upper concentration level and acurvature that is falsely detected ow ing to one o r two too highmeasurements at the middle concentration level. For the mostsuitable design (design 1) the effect of an outlier at the upperlevel is the same for the ANO VA lack-of-fit and the test basedon the significance of the q uad ratic term. When the curvatureis strong and the outlier is not very large, the probability ofdetecting a lack-o f-fit to th e straigh t-line model is little affected.However, when the curvature is weak and the outlier isimp orta nt, the probability of detecting the curvature decreases.For example, when the quadratic term equals - 2 x-8 x lop7and -4 x a lack-of-fit is detected in 97, 95and 34% of the cases, respectively. When an outlier (60) isintroduced at the upper concentration level, the probabilitiesdecrease to 89, 28 and 3%. With two outliers at the upperconcentration level (40 and 80), he probabilities to detect thelack-of-fit are 25, 3 and 1 . In a similar way to what wasdescribed above, also in the presence of an outlier the signifi-cance of the q uad ratic term is less affected by the applied design.The probability of falsely detecting a curvature increases ifthe middle concentration level contains one or two outliers.With design 1, for example, and a single outlier (60) at themiddle con centration level, one falsely detects a lack-of-fit tothe straight-line model in 10% of the cases, with the ANOVAlack-of-fit as well as w ith the test based on th e sign ificance ofthe quadratic term. When the middle concentration level iscontaminated with two outliers (40 and 80), a lack-of-fit isdetected in 22% of the cases. The tests are performed at asignificance level of 5 , so that a number of false positiveresults aro un d this level is expected.In a similar way to what was described above, also in thepresence of outliers the randomization tests are less sensitivethan the parametric tests. This means that the number of falsepositive results, owing to an outlier in the middle of theconcentration range, is lower with the randomization tests.However, the probability of correctly detecting a curvaturewhen the upper concentration level contains an outlier is alsolower with the randomization than with the parametric test.Therefore, one cannot say that the randomization tests aremore robust to outliers.

    5. Goodness-of-fit for HeteroscedasticData5.1. Description o the testsIn order to examine the goodness-of-fit of a weighted cali-bration model, two parametric and two randomization testsare evaluated. The first testI7 computes the sum of squaresS = c K( J i - 9 i ) 2 , here the weight factor is the inverse ofthe variance of yi . If the calibration model describes the dataaccurately, the value of S has a x distribution, with n - 2 orn- degrees of freedom for a straight-line an d a second degreemodel, respectively. As a second test, to validate a straight-line model, one can also check the significance of the q uad raticterm for a weighted model.The evaluated randomization tests are similar to thosedescribed in Section 4.1. In order to apply the test proposedby van der Voet22 or weighted m odels, the weighted residualspiy i j - j i ) are used. Thus, one co mputes the mean differencebetween the squared weighted residual for the straight-linemodel a nd the squared weighted residual for the second degreemodel and applies the test on these values. For the test that isproposed by us (see Table 7), one states that the weightedresiduals must be randomly distributed over the concentrationlevels. Thus, the randomization test which is explained inTable 7 can also be applied on the weighted residuals. Here itis also assumed that the correlation that exists between theweighted residuals does not affect the test results.5.2. PerformanceFig. 9 compares the performance of the evaluated tests for thevalidation of a weighted straight-line model. A situation isconsidered where all measurement points are distributed overthree concen tration levels (design 1 ) and where the relativestandard deviation equals 2%. It can be seen that the bestresults are obtained with the test that determines the signifi-cance of the quadratic term. Less good results are obtainedwith the randomization test that is proposed by us. The othertests seem less suitable. Regarding the selection of the design,the same conclusions can be made as for the unweightedstraight-line model, namely that the best results are obtainedwith design 1. It should also be remarked that when theheteroscedasticity is no t detected, an d the goodness-of-fit testsfor an unweighted model are performed, satisfactory resultsare still obtained w hen the d ata set is free of outliers. H owever,when the middle concentration level is contaminated with twooutliers (40 an d 8 4 , he prob ability of obtaining a false positiveresult with the unweighted test is very high (between 20 and30% ). Th e weighted tests, on the oth er hand, a re little affectedby these outliers. This is probably because the ou tliers increase

    8?- 60a.--a 40

    20B

    I I30 -25 20 15 -10 5 0

    Quadratic term (xFig. 9 Percentage of cases where a lack-of-fit to a weighted straight-line model is detected with the x test O),he randomization testproposed in Table 7 (A) , the randomization test proposed by van derVoet (+ and by determining the significance of the quadratic term H)

    244 Journal o Analytical Atomic Spectrometry, April 1996, Vol.11

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    ..0.5

    0 --1-2

    . .00.5 10 20 30. 4 0 50 60 70 80 90 100

    I

    ~ : . I10 20 30 40 50 60 70, 80. 99 :rn~. 100

    --

    .I

    m. ..I .. .

    -41 ConcentrationFig. 10 Weighted residuals for a second degree calibration model ina situation where a contamination of the lowest concentration leveloccurs. Design 2 (a) and design 5 ( b )are applied.

    the variance at the middle concentration level so that a smallerweight is given to this level.

    In order to evaluate the goodness-of-fit of a weighted seconddegree model, the randomization test described in Table7,performed on the weighted residuals, is found to be the mostsuitable test. The probability to detect a lack-of-fit increaseswhen the measurement points are spread over an increasingnumber of concentration levels, so that the best results areobtained with design 5. This is because, for a small number ofconcentration levels, an alternative model can be found whichfits the actual data accurately, but does not give a correctdescription of the relation between concentration andabsorbance. This is illustrated in Fig. 10. The weightedresiduals for a calibration line obtained with design 2 anddesign 5 are given in a situation where a curved line issimulated, but a value cb is added to all data points of thelowest level. Design 5 clearly indicates a problem, because theweighted residuals are not randomly distributed over theconcentration levels. With design 2, on the other hand, noproblem is detected. However, the calibration model that isfound with this latter design cannot be used to make correctestimations. The x test performs less well than the randomiz-ation test. With design 4, for example, and Cb=0 . 0 0 2 , 0.004and 0.006, the randomization test detects a lack-of-fit in 24, 58and 90% of the cases, respectively. With the x test, the lack-of-fit is only detected in 1 ,16 and 38% of the cases, respectively.In contrast to what is concluded for the straight-line model,the goodness-of-fit tests that assume homoscedasticity failwhen they are applied on a second degree model in a heterosc-edastic situation.

    Recommended StrategyThe described results were used to build a strategy for thevalidation of atomic absorption calibration models. The pro-posed strategy is based on the assumption that the analyst hasan idea of the linear range of the calibration line before hestarts the validation. Consequently, he will preferably try todemonstrate the suitability of a straight-line model within thisrange. However, sometimes the linear range is so small thatone is obliged to work in the curved concentration range. The

    analyst will then try to demonstrate the suitability of a seconddegree model within the specified calibration range.A number of experienced analysts, whose opinion was asked,stated that to ensure the general acceptance of the validationstrategy it should be combined with information on how tocontinue when a problem, such as a lack-of-fit, is detected.Although this is not part of the validation, the results of thevalidation experiments can be used to give these recommen-dations. In order to avoid confusion, in this section a distinctionis made between the real validation of the calibration line anda number of additional tests that are performed to advise theanalyst on how to continue.

    Validation strategyIn order to validate a straight-line model, the analyst is advisedto apply experimental design 2 (ie. , nine replicates at fourconcentration levels). Design 1 (three concentration levels) ismore sensitive but does not permit a further investigationwhen a lack-of-fit is detected. Designs 4 and 5, on the otherhand, do not permit an accurate outlier detection at thedifferent concentration levels and are less suitable for thevalidation of a straight-line model. Moreover, owing to thesmall number of replicates at each concentration level, thesedesigns are not really suitable to evaluate a possible het-eroscedasticity.

    For the validation of a second degree model, an alternativedesign is proposed because none of the evaluated designs seemsoptimum. The designs that position all measurement points ata small number of concentration levels (designs 1,2 and 3) arethe most appropriate to detect a heteroscedasticity but are theleast suitable to detect a lack-of-fit, especially for a weightedsecond degree model. The designs that spread the measurementpoints over a large number of levels (designs4 and 5 ) , on theother hand, are the most suited to detect a lack-of-fit, but thesmall number of replicates makes them scarcely appropriateto investigate the behaviour of the variance. Therefore, it isproposed to position nine replicates at both extremes and sixreplicates at five other concentration levels, equally spreadover the concentration range. This design requires moremeasurement points than for that proposed for the validationof a straight-line model (48 instead of 36), but additional effortcan be required for the validation of a more complex model.

    After the performance of the experiments, the results shouldfirst be evaluated visually. The most appropriate way to dothis is by plotting the residuals uersus the predicted value. Thestatistical evaluation of the experimental results for a straight-line model is summarized in Fig. 11. First, single outliers aretraced at the different concentration levels, applying a Grubbstest. When no single outliers are found, the presence of possiblepaired outliers is investigated. When, in the complete data set,no more than two outliers are detected, the analyst can removethem and continue the evaluation of the data. More outliersindicate a fundamental problem with the analysis method,which must be investigated. When the problem that is respon-sible for the outliers is solved, a new data set should beprepared. The homoscedasticity of the data is then investigated.One investigates whether the variance at the highest concen-tration level is significantly larger than at the lowest concen-tration level. In order to accomplish this one can use a one-tailed F-test or the alternative randomization test which iseven more suitable. Depending on the result, an unweightedor a weighted model must be used.

    In order to investigate the goodness-of-fit of the straight-line model, one checks whether the data are better fitted by asecond degree model. If this is not the case, one confirms thesuitability of a straight-line model with an ANOVA lack-of-fit(for an unweighted model) or with a randomization test (for

    Journal of Analytical Ato mic Spectrometry, April 1996, Vol.1I 245

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237
  • 5/23/2018 Validation of the Calibration Procedure in Atomic Absorption Spectrometr...

    http:///reader/full/validation-of-the-calibration-procedure-in-atomic-absorption

    method doesnot per fowas expected

    Straightline modelis notsuitable

    outliersfrom data F testn l y7se unweighted rls e weighted

    n lUnueightedstraightline model

    Weightedstraightline model

    Fig. 11 Strategy for the evaluation of the experimental resultsa weighted model). These tests are also those used to demon-strat e the suitability of a second degree model.Recommendations after the detection of a lack-of--tWhen the validation shows that calibration data cannot bedescribed by a straight-line model, two actions can be taken.First, one can check whether a straight-line model can be validover a smaller concentration range. Therefore, the upperconcentration level of the design is eliminated and the tests forthe validation of a straight-line model are applied on thereduced calibration set. Possibly, a decrease of the calibrationrange also allows the use of an unweighted instead of aweighted model. When no suitable straight-line model can be

    built, one can investigate the suitability of a second degreemodel. It must be clear that those tests on a reduced calibrationset are only applied to give an indication of how the analysismethod can be adapted. They cannot be used as validationresults. If these tests indicate, for example, that a straight-linecalibration model seems suitable over a smaller concentrationrange than specified at the start of the validation, the analystcan adapt his method and start the validation of this newmethod.R E F E R E N C E S

    12

    3

    456789

    101112131415

    161718192021

    2223

    Taylor, J . K., Anal. Chem., 1983, 55, 600A.I S 0 Internation al Standa rd 8466- 1, Water Quality-Calibrationand Evaluation of Analytical Methods and Estimation ofPevformance Characteristics-Part 1: Statistical Evaluation of theLinear Calibration Function, International Organisation forStandardisation, Geneva, 1990.I S 0 International Standard 8466-2, Water Quality-Calibrationand Evaluation of Analytical Methods and Estimation ofPerformance Characteristics--Part 2: Calibration Strategy forNon-linear Second Order Functions, International Organisationfor Standardisation, Geneva, 1990.Wendt, R. H., At. Absorpt. Newsl., 1968, 7, 28.Barnett, W. B., Spectrochim. Acta, Part B, 1984, 39, 829.Phillips, G . R., and Eyring, E. M., Anal. Chem., 1983, 55, 1134.Koscielniak, P., Anal. Chim. Acta, 1993, 278, 177.Draper, N., and Smith, H., Applied Regression Analysis, Wiley,New York, 2nd edn., 1981.I S 0 International Standards 5725, International Organisationfor Standardisation, Geneva, 1986.Grubbs, F. E., and Beck, G., Technometrics, 1972, 14, 847.Kelly, P. C., J . Assoc. 08 Anal. Chem., 1990, 73, 58.CETEMA, Statistique Appliqu6e a 1 Exploitation des Mesures,Masson , Paris, 2nd edn., 1986.Snedecor, G. W., and Cochran, W. G., Statistical Methods, TheIowa State University Press, Ames, 7th edn., 1982.Lang-Michaut, C., Pratique des Tests Statistiques, Dunod,Paris, 1990.Ma ssart, D. L., Vandeginste, B. G. M., Deming, S. N., M ichotte , Y.,and Kaufman, L., Chemometrics: a Textbook, Elsevier,Amsterdam, 1988.Garden, J. S., Mitchel, D. G., and Mills, W. N., Anal. Chem.,1980, 52, 2310.Cooper, B. E., Statistics for Experimentalists, Pergamon Press,Oxford, 1969.Edgington, E. S., Randomization Tests, Marcel Dekker, NewYork, 1987.Box, G., and Muller, M., Ann. Math. Stat., 1958, 29, 610.AOAC Referee, 1994, October, p. 6.I S 0 DIS 5725-1 to 5725-3 (Draft versions), Accuracy (Truenessand Precision) of Measurement M ethods and Results, InternationalOrganisation for Standardisation, Geneva, 1990/1991.Van der Voet, H., Chemometr. Intell. Lab., 1994, 25, 313.Atkinson, A. C., Chemometr. Intell. Lab., 1995, 28, 35.

    Paper 5/07400BReceived November 10, 1995Accepted December 13, 1995

    246 Journal of Analytical Atomic Spectrometry, April 1996, VGl. 1 1

    Publishedon01January1996.DownloadedbyUniversidadeNov

    adeLisboaon30/06/201413:38:48.

    View Article Online

    http://dx.doi.org/10.1039/JA9961100237