Calibration Models

Embed Size (px)

Citation preview

  • 8/22/2019 Calibration Models

    1/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 1

    Calibration Models

    SummaryThe Calibration Modelsprocedure is designed to construct a statistical model describing the

    relationship between 2 variables, X and Y, where the intent of the model-building is to constructan equation that can be used to predict X given Y. In a typical application, X represents the true

    value of some important quantity, while Y is the measured value. Initially, a set of samples with

    known X values are used to calibrate the model. Later, when samples with unknown X values aremeasured, the fitted model is used to make an inverse prediction of X from the measured values

    Y.

    Any of 27 linear and nonlinear models may be fit. The output parallels that of the Simple

    Regression procedure.

    Sample StatFolio: calibration.sgp

    Sample Data:The filegalactose.sf6contains data on an experiment performed using a new method for

    measuring the concentration of galactose in blood. The data is similar to that reported by Neter et

    al (1998). n = 12 samples with known galactose concentrations X ranging between 1.0 and 10.0were measured. The data are shown below:

    Known Measured

    1 0.82

    1 0.95

    1 0.87

    4 4.14

    4 4.044 4.01

    7 7.13

    7 6.92

    7 6.81

    10 9.95

    10 10.15

    10 10.08

    An additional sample of unknown concentration was measured, yielding Y = 6.52. An estimate

    of the actual concentration of the additional sample is desired, with a 95% confidence interval.

  • 8/22/2019 Calibration Models

    2/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 2

    Data InputThe data input dialog box can be used in 2 ways:

    1. Given measurements of samples with known values of X, it can be used to fit thecalibration model. The coefficients of the model may be saved for later use.

    2. If new measurements are made, the stored coefficients can be used to predict the true

    value of X.

    Fitting the Calibration Model

    Y (measured): numeric column containing the n measured values of the quantity to bepredicted.

    X (actual): numeric column containing the n known values of that quantity.

    Fitted Model Statistics: left blank when fitting a new model.

    Weights: optional numeric column containing weights to be applied to the residuals ifperforming a weighted least squares fit. If the variability of Y changes as a function of X,these weights can be used to compensate for the different levels of variability.

    Select: subset selection.

    Action: selectFit New Modelto estimate a new model from Y and X.

  • 8/22/2019 Calibration Models

    3/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 3

    Using a Stored Model

    Y (measured): numeric column (or single number) containing the measured values of thequantity to be predicted.

    Fitted Model Statistics: column containing the statistics saved from the original modelestimation. This would normally have been created using the Save Results option when the

    model was calibrated. The column consists of the estimated intercept, slope, and other

    relevant information.

    Action: selectPredict X from Y.

  • 8/22/2019 Calibration Models

    4/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 4

    Analysis Summary

    When fitting a new calibration model, theAnalysis Summary shows information about the fittedmodel.

    Calibration Models - measured vs. knownY (measured): measured

    X (actual): known

    Linear model: Y = a + b*X

    Least Squares Standard T

    Parameter Estimate Error Statistic P-Value

    Intercept -0.0896667 0.0643624 -1.39315 0.1938

    Slope 1.01433 0.00999098 101.525 0.0000

    Analysis of Variance

    Source Sum of Squares Df Mean Square F-Ratio P-Value

    Model 138.898 1 138.898 10307.30 0.0000

    Residual 0.134757 10 0.0134757

    Lack-of-Fit 0.0434233 2 0.0217117 1.90 0.2110

    Pure Error 0.0913333 8 0.0114167

    Total (Corr.) 139.032 11

    Correlation Coefficient = 0.999515R-Squared = 99.9031 percentR-Squared (adjusted for d.f.) = 99.8934 percentStandard Error of Est. = 0.116085Mean absolute error = 0.0923889

    Durbin-Watson statistic = 1.50024 (P=0.0942)Lag 1 residual autocorrelation = 0.206661

    Residual Analysis

    Estimation Validation

    n 12

    MSE 0.0134757

    MAE 0.0923889

    MAPE 3.07549

    ME -4.81097E-16MPE -0.982253

    Included in the output are:

    Variables and model: identification of the input variables and the model that was fit. Bydefault, a linear model of the form

    Y = a + b X (1)

    is fit, although a different model may be selected usingAnalysis Options.

    Coefficients: the estimated coefficients, standard errors, t-statistics, and P values. Theestimates of the model coefficients can be used to write the fitted equation, which in the

    example is

    measured= -0.0896667 + 1.01433 known (2)

    The t-statistic tests the null hypothesis that the corresponding model parameter equals 0,

    versus the alternative hypothesis that it does not equal 0. Small P-Values (less than 0.05 if

    operating at the 5% significance level) indicate that a model coefficient is significantly

  • 8/22/2019 Calibration Models

    5/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 5

    different from 0. In the sample data, the slope is significantly different from 0 but theintercept is not.

    Analysis of Variance: decomposition of the variability of the dependent variable Y into amodel sum of squares and a residual or error sum of squares. The residual sum of squares isfurther partitioned into a lack-of-fit component and a pure error component. Of particular

    interest are the F-tests and the associated P-values. The F-test on theModelline tests the

    statistical significance of the fitted model. A small P-Value (less than 0.05 if operating at the5% significance level) indicates that a significant relationship of the form specified exists

    between Y and X. In the sample data, the model is highly significant. The F-test on the

    Lack-of-fitline tests the adequacy of the selected linear model in describing the observedrelationship between Y and X. A small P-Value indicates that the selected model does not

    adequately describe the relationship. In such cases, a nonlinear model could be selected using

    Analysis Options. For the sample data, the large P-Value indicates that the linear model isadequate. Note: the lack-of-fit test is available only when more than one measurement has

    been obtained at the same value of X.

    Statistics: summary statistics for the fitted model, including:

    Correlation coefficient- measures the strength of the linear relationship between Y and X on

    a scale ranging from -1 (perfect negative linear correlation) to +1 (perfect positive linearcorrelation). In the sample data, the correlation is very strong.

    R-squared- represents the percentage of the variability in Y which has been explained by thefitted regression model, ranging from 0% to 100%. For the sample data, the regression has

    accounted for about 99.9% of the variability amongst the measurements.

    Adjusted R-Squared the R-squared statistic, adjusted for the number of coefficients in themodel. This value is often used to compare models with different numbers of coefficients.

    Standard Error of Est. the estimated standard deviation of the residuals (the deviationsaround the model). This value is used to create prediction limits for new observations.

    Mean Absolute Error the average absolute value of the residuals.

    Durbin-Watson Statistic a measure of serial correlation in the residuals. If the residuals

    vary randomly, this value should be close to 2. A small P-value indicates a non-randompattern in the residuals. For data recorded over time, a small P-value could indicate that some

    trend over time has not been accounted for.

    Lag 1 Residual Autocorrelation the estimated correlation between consecutive residuals, ona scale of 1 to 1. Values far from 0 indicate that significant structure remains unaccounted

    for by the model.

    Residual Analysis if a subset of the rows in the datasheet have been excluded from the

    analysis using the Selectfield on the data input dialog box, the fitted model is used to make

    predictions of the Y values for those rows. This table shows statistics on the predictionerrors, defined by

  • 8/22/2019 Calibration Models

    6/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 6

    iii yye = (3)

    Included are the mean squared error (MSE), the mean absolute error (MAE), the mean

    absolute percentage error (MAPE), the mean error (ME), and the mean percentage error(MPE). This validation statistics can be compared to the statistics for the fitted model to

    determine how well that model predicts observations outside of the data used to fit it.

    Analysis Options

    Type of Model: the model to be estimated. All of the models displayed can be linearized bytransforming either X, Y, or both. When fitting a nonlinear model, STATGRAPHICS first

    transforms the data, then fits the model, and then inverts the transformation to display the

    results.

    Include Constant: whether to include a constant term or intercept in the model. If theconstant is removed, the fitted model will pass through the origin at (X,Y) = (0,0).

    The available models are shown in the following table:

  • 8/22/2019 Calibration Models

    7/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 7

    Model Equation Transformation on Y Transformation on X

    Linear y x= + 0 1 none none

    Square root-Y ( )y x= + 0 12

    square root none

    Exponential ( )y ex

    =+ 0 1

    log none

    Reciprocal-Y ( )1

    10

    += xy reciprocal noneSquared-Y xy 10 +=

    square none

    Square root-X y x= + 0 1 none square root

    Double square root ( )210 xy +=

    square root square root

    Log-Y square root-X ( )xey 10 +

    = log square root

    Reciprocal-Y square

    root-X( ) 1

    10

    += xy reciprocal square root

    Squared-Y square root-X

    xy 10 += square square root

    Logarithmic-X y x= + 0 1 ln( ) none log

    Square root-Y log-X ( )210 )ln(xy +=

    square root log

    Multiplicative y x= 01

    log log

    Reciprocal-Y log-X

    )ln(

    1

    10 xy

    +=

    reciprocal log

    Squared-Y log-X )ln(10 xy += square log

    Reciprocal-X xy /10 += none reciprocal

    Square root-Yreciprocal- X ( )

    210 /xy += square root reciprocal

    S-curve ( )y ex

    =+ 0 1 /

    log reciprocal

    Double reciprocal [ ] 10 /

    += xy reciprocal reciprocal

    Squared-Y reciprocal-X xy /10 += square reciprocal

    Squared-X 210 xy += none square

    Square root-Y squared-X

    ( )2210 xy += square root square

    Log-Y squared-X ( )210 xey += log square

    Reciprocal-Y squared-X ( ) 1210

    += xy reciprocal square

    Double squared 210 xy +=

    square square

    Logistic ( )

    ( )[ ]y

    e

    e

    x

    x=

    +

    +

    +

    0 1

    0 11

    y/(1-y) none

    Log probit y x= + ( ln( ))0 1 1( )y (inv. normal)

    log

  • 8/22/2019 Calibration Models

    8/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 8

    To determine which model to fit to the data, the output in the Comparison of Alternative Models

    pane described below can be helpful, since it fits all of the models and lists them in decreasing

    order of R-squared.

    Plot of Fitted Model

    This pane shows the fitted model or models, together with confidence limits and prediction limits

    if desired.

    Plot of Fitted Model

    measured = -0.0896667 + 1.01433*known

    known

    measured

    6.51627 (6.25035,6.78317)

    6.52

    0 2 4 6 8 10

    0

    2

    4

    6

    8

    10

    12

    The plot includes:

    The line of best fit orprediction equation:

    xbay += (4)

    This is the equation that would be used to predict values of the dependent variable Ygiven values of the independent variable X, or vice versa.

    Confidence intervals for the mean response at X. These are the inner bounds in theabove plot and describe how well the location of the line has been estimated given the

    available data sample. As the size of the sample n increases, these bounds will become

    tighter. You should also note that the width of the bounds varies as a function of X, with

    the line estimated most precisely near the average value x .

    Prediction limits for new observations. These are the outer bounds in the above plot and

    describe how precisely one could predict where a single new observation would lie.Regardless of the size of the sample, new observations will vary around the true line with

    a standard deviation equal to .

    Prediction of a single value. UsingPane Options, a single prediction can be made andplotted. For example, the above plot predicts the value of X given a sample with

    measured value Y = 6.52. The predicted value of X equals 6.516, with 95% confidence

    limits extending from 6.250 to 6.783.

  • 8/22/2019 Calibration Models

    9/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 9

    Pane Options

    Include: the limits to include on the plot.

    Confidence Level: the confidence percentage for the limits.

    Predict: whether to predict Y or X. Enter the value of the other variable in theAtfield.

    Mean Size or Weight: if the measured value is the average of more than one sample, enterthe number of samples m used to calculate the average.

    Predicted Values

    The model can be used to predict X given Y or Y given X. In the first case, the output is shown

    below:

    Predicted Values for X

    95.00%

    Predicted Prediction Limits

    Y-bar X Lower Upper

    6.52 6.51627 6.25035 6.78317

    Included in the table are:

    Y - the measured value at which the prediction is to be made.

    Predicted X - the predicted value of X using the fitted model.

    Prediction limits - prediction limits for X at the selected level of confidence.

    These are the same values displayed on the plot of the fitted model.

  • 8/22/2019 Calibration Models

    10/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 10

    Pane Options

    Predict: whether to predict Y or X.

    Confidence Level: the confidence percentage for the limits.

    Mean Size or Weight: if the measured value is the average of more than one sample, enterthe number of samples m used to calculate the average.

    Predict At: up to 10 values at which to make predictions.

    Confidence Intervals

    The Confidence Intervals pane shows the potential estimation error associated with eachcoefficient in the model.

    95.0% confidence intervals for coefficient estimates

    Standard

    Parameter Estimate Error Lower Limit Upper Limit

    CONSTANT -0.0896667 0.0643624 -0.233075 0.0537421

    SLOPE 1.01433 0.00999098 0.992072 1.03659

    Pane Options

  • 8/22/2019 Calibration Models

    11/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 11

    Type of Interval: either a two-sided confidence interval or a one-sided confidence boundmay be created.

    Confidence Level: percentage level for the interval or bound.

    Hypothesis TestsTheHypothesis Tests pane can be used to test hypotheses about the model coefficients. In each

    case, a t-test is performed. The default tests are shown below:

    Hypothesis TestsNull hypothesis: intercept = 0.0Alternative hypothesis: intercept not equal 0.0Computed t statistic = -1.39315P-value = 0.193765Do not reject the null hypothesis for alpha = 0.05.

    Null hypothesis: slope = 1.0Alternative hypothesis: slope not equal 1.0Computed t statistic = 1.43463P-value = 0.181919

    Do not reject the null hypothesis for alpha = 0.05.

    The first test concerns whether or not the intercept equals 0. If so, the model goes through the

    origin. A small P-Value (less than 0.05 if operating at the 5% significance level) would indicatethat the intercept was notequal to 0. In this case, the result is not significant, so the line may well

    go through the origin. If the slope of the line equals 1, a non-zero intercept would be related to

    bias in the measurements.

    The second test concerns whether or not the slope equals 1. For a linear model, a slope of 1

    indicates that when the known value changes, the measured value changes by the same amount.A small P-Value would indicate that the slope was significantly different than 1.

    In the current case, neither null hypothesis is rejected, indicating that a possible equation for the

    calibration curve is measured= known.

    Pane Options

    Intercept: the value of the intercept specified by the null hypothesis.

    Slope: the value of the slope specified by the null hypothesis.

  • 8/22/2019 Calibration Models

    12/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 12

    Alternative: the type of alternative hypothesis. IfNot Equalis selected, a two-sided P-valueis calculated. Otherwise, a one-sided P-value is calculated.

    Alpha: the probability of a Type I error (rejecting the null hypothesis when it is true). Thisdoes not affect the P-value, only the conclusion stated beneath it.

    Observed versus Predicted

    The Observed versus Predictedplot shows the observed values of Y on the vertical axis and the

    predicted values Yon the horizontal axis.

    Plot of measured

    predicted

    obse

    rved

    0 2 4 6 8 10 12

    0

    2

    46

    8

    10

    12

    If the model fits well, the points should be randomly scattered around the diagonal line. It issometimes possible to see curvature in this plot, which would indicate the need for a curvilinearmodel rather than a linear model. Any change in variability from low values of Y to high valuesof Y might also indicate the need to transform the dependent variable before fitting a model to

    the data.

    Residual Plots

    As with all statistical models, it is good practice to examine the residuals. In a regression, theresiduals are defined by

    iii yye = (5)

    i.e., the residuals are the differences between the observed data values and the fitted model.

    The Calibration Models procedure various type of residual plots, depending onPane Options.

  • 8/22/2019 Calibration Models

    13/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 13

    Scatterplot versus XThis plot is helpful in visualizing any need for a curvilinear model.

    Residual Plot

    known

    Studentizedres

    idual

    0 2 4 6 8 10

    -2.2

    -1.2

    -0.2

    0.8

    1.8

    2.8

    Normal Probability PlotThis plot can be used to determine whether or not the deviations around the line follow a normaldistribution, which is the assumption used to form the prediction intervals.

    Normal Probability Plot for measured

    Studentized residual

    percentage

    -2.2 -1.2 -0.2 0.8 1.8

    0.1

    15

    20

    50

    80

    95

    99

    99.9

    If the deviations follow a normal distribution, they should fall approximately along a straightline.

  • 8/22/2019 Calibration Models

    14/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 14

    Residual AutocorrelationsThis plot calculates the autocorrelation between residuals as a function of the number of rows

    between them in the datasheet.

    Residual Autocorrelations for measured

    lag

    autocorrelation

    0 2 4 6 8

    -1

    -0.6

    -0.2

    0.2

    0.6

    1

    It is only relevant if the data have been collected sequentially. Any bars extending beyond the

    probability limits would indicate significant dependence between residuals separated by theindicated lag, which would violate the assumption of independence made when fitting theregression model.

    Pane Options

    Plot: the type of residuals to plot:

    1. Residuals the residuals from the least squares fit.2. Studentized residuals the difference between the observed valuesyiand the predicted

    values iy when the model is fit using all observations except the i-th, divided by the

    estimated standard error. These residuals are sometimes called externally deleted

  • 8/22/2019 Calibration Models

    15/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 15

    residuals, since they measure how far each value is from the fitted model when thatmodel is fit using all of the data except the point being considered. This is important,since a large outlier might otherwise affect the model so much that it would not appear to

    be unusually far away from the line.

    Type: the type of plot to be created. A Scatterplotis used to test for curvature. ANormalProbability Plotis used to determine whether the model residuals come from a normal

    distribution. AnAutocorrelation Function is used to test for dependence between consecutiveresiduals.

    Plot Versus: for a Scatterplot, the quantity to plot on the horizontal axis.

    Number of Lags: for anAutocorrelation Function, the maximum number of lags. For smalldata sets, the number of lags plotted may be less than this value.

    Confidence Level: for anAutocorrelation Function, the level used to create the probabilitylimits.

    Comparison of Alternative Models

    The Comparison of Alternative Models pane shows the R-squared values obtained when fittingeach of the 27 available models:

    Comparison of Alternative Models

    Model Correlation R-Squared

    Linear 0.9995 99.90%

    Double square root 0.9994 99.88%

    Double squared 0.9993 99.87%

    Double reciprocal 0.9965 99.30%

    Square root-Y logarithmic-X 0.9902 98.05%Multiplicative 0.9902 98.05%

    Square root-X 0.9891 97.83%

    Square root-Y 0.9850 97.02%

    Logarithmic-Y square root-X 0.9829 96.60%

    S-curve model 0.9781 95.67%

    Squared-Y 0.9697 94.04%

    Squared-X 0.9697 94.03%

    Logarithmic-X 0.9551 91.22%

    Exponential 0.9441 89.13%

    Squared-Y square root-X 0.9226 85.12%

    Square root-Y squared-X 0.9182 84.31%

    Reciprocal-X 0.8628 74.45%

    Squared-Y logarithmic-X 0.8539 72.92%

    Logarithmic-Y squared-X 0.8431 71.09%

    Squared-Y reciprocal-X 0.7174 51.47%Reciprocal-Y squared-X 0.7011 49.15%

    Reciprocal-Y

    Reciprocal-Y square root-X

    Reciprocal-Y logarithmic-X

    Square root-Y reciprocal-X

    Logistic

    Log probit

    The models are listed in decreasing order of R-squared. When selecting an alternative model,consideration should be given to those models near the top of the list. However, since the R-

  • 8/22/2019 Calibration Models

    16/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 16

    Squared statistics are calculated after transforming X and/or Y, the model with the highest R-squared may not be the best. You should always plot the fitted model to see whether it does agood job for your data.

    Unusual Residuals

    Once the model has been fit, it is useful to study the residuals to determine whether any outliers

    exist that should be removed from the data. The Unusual Residuals pane lists all observationsthat have Studentized residuals of 2.0 or greater in absolute value.

    Unusual Residuals

    Predicted Studentized

    Row X Y Y Residual Residual

    9 7.0 6.81 7.01067 -0.200667 -2.12

    Studentized residuals greater than 3 in absolute value correspond to points more than 3 standarddeviations from the fitted model, which is an extremely rare event for a normal distribution.

    Note: Points can be removed from the fit while examining thePlot of the Fitted Modelby

    clicking on a point and then pressing theExclude/Include button on the analysis toolbar.Excluded points are marked with an X.

    Influential Points

    In fitting a regression model, all observations do not have an equal influence on the parameterestimates in the fitted model. In a simple regression, points located at very low or very highvalues of X have greater influence than those located nearer to the mean of X. TheInfluentialPoints pane displays any observations that have high influence on the fitted model:

    Influential Points

    Predicted StudentizedRow X Y Y Residual Leverage

    Average leverage of single data point = 0.166667

    The table shows every point with leverage equal to 3 or more times that of an average data point,where the leverage of an observation is a measure of its influence on the estimated modelcoefficients. In general, values with leverage exceeding 5 times that of an average data valueshould be examined closely, since they have unusually large impact on the fitted model.

    Save Results

    The following results may be saved to the datasheet:

    1. Model Statistics a column of numeric values with information about the fitted model.This column can be used later to predict values of X by selecting Predict X from Yon thedata input dialog box.

    2. Predicted Values the predicted value of Y corresponding to each of the n observations.3. Lower Limits for Predictions the lower prediction limits for each predicted value.4. Upper Limits for Predictions the upper prediction limits for each predicted value.5. Lower Limits for Forecast Means the lower confidence limits for the mean value of Y

    at each of the n values of X.

  • 8/22/2019 Calibration Models

    17/17

    STATGRAPHICS Rev. 7/7/2005

    2005 by StatPoint, Inc. Calibration Models - 17

    6. Upper Limits for Forecast Means the upper confidence limits for the mean value of Y ateach of the n values of X.

    7. Residuals the n residuals.8. Studentized Residuals the n Studentized residuals.9. Leverages the leverage values corresponding to the n values of X.10. Coefficients the estimated model coefficients.

    Calculations

    Inverse Predictions

    1

    onewnew

    yx

    = (6)

    Lower and upper limits forxnew are found using Fiellers approach, which solves for the values of

    newx at which the prediction limits

    ( )

    ++

    XX

    new

    nS

    xx

    nmMSEty

    2

    2,2/

    11

    (7)

    are equal toynew, where m is the mean size or weightand

    ( )2

    1

    =

    =

    n

    i

    ixx xxS (8)

    Additional calculations may be found in the Simple Regression documentation.