7
CLIN. CHEM. 39/6, 929-935 (1993) CLINICAL CHEMISTRY, Vol. 39, No. 6, 1993 929 Chemistry with Confidence: Should Clinical Chemistry Require Confidence Intervals for Analytical and Other Data? A. Ralph Henderson Confidence intervals are not commonlyprovidedwith analyticalor other data reported in Clinical Chemistry although P values are. However, confidenceintervals providean explicitdemonstrationof the directionand magnitudeofuncertaintyandareintuitivelyeasytograsp, unlike P values. It is thereforeargued that the Journal should adopt a policy requiringthe provisionof confidence intervals. Such a policy wouldimprovethestatistical rigor of Journalreports. Indexing Terms: statistics parametric vs nonparametric distri- butions likelihood ratio receiver-operating characteristic curie When possible, quantify findings and present them with appro- priate indicators of measurement error or uncertainty (such as confidence intervals). Avoid sole reliance on statistical hypothesis testing, such as the use of P values, which fails to convey impor- tant quantitative information. International Committee of Medical Journal Editors (1) Recently, a reviewer of a paper my colleagues and I submitted to Clinical Chemistry asked for the P value of a difference between sets of data. We had instead pro- vided the confidence intervals for the data, but this was of no interest: P values were, for the reviewer, the “bottom line” in such comparisons. Or are they? It is the intent of this Opinion to make the case for more appro- priate statistical descriptions of experimental data than axe currently used in Clinical Chemistry but which are in accord with the most recent Uniform Requirements of the International Committee of Medical Editors (1) quoted above. The Journal issues very detailed statisti- cal guidelines for authors (2), which do include a re- quirement for the use of appropriate indicators of mea- surement error or uncertainty, but these guidelines are often not observed in practice and frequently the wrong type of statistical information is provided. Examples are readily found on random inspection of the Journal; some are mentioned here for illustrative purposes only but are not cited in this Opinion, it being my intent to persuade rather than pillory. For example, test A was claimed to be “more sensi- tive” [true positive (TP) rate or fraction = 44%] than test B (TP rate = 17%), but the 95% confidence intervals for Department of Clinical Biochemistry, University Hospital (Uni- versity of Western Ontario), P.O. Box 5339, London, Ontario, Canada N6A 5A5. Received September 9, 1992; accepted January 25, 1993. these reported sensitivity values actually overlapped, i.e., 22% to 69% and 3.5% to 41%.’ Test A is not more sensitive; these tests are equivalent.2 In another paper, test C was claimed to be superior to test D on the basis of a receiver-operating characteristic (ROC) curve anal- ysis, but neither the areas under the curve (i.e., accu- racy) nor the confidence intervals of these accuracies were provided. Finally, a set of tests were compared by ROC curve analyses of each test’s accuracy; one test was claimed to possess less “discriminating power” than the others, but the confidence interval of each estimate of test accuracy was not determined and indeed visual inspection of the data suggested that all tests were equivalent. These are three examples taken from one issue of the Journal. How should experimental findings be presented? There are now many sources of good advice. For exam- ple, Clinical Biostatistics (4) and Statistics in Practice (5), both collections of articles first appearing in Clinical Pharmacology and Therapeutics and the British Medical Journal, respectively, are invaluable; the latter is par- ticularly good on the practical aspects. These aspectsare also well addressed in Altman’s recent book (3), as its title suggests. Statistics deals with samples taken from populations consisting of all the possible observations that could be made; these samples are assumed to possess the same characteristics as the parent popula- tion. Altman (3) has pointed out, however, that sam- pling from a population with a truly gaussian distribu- tion (synonym: normal distribution) may not always produce samples that are themselves gaussian. Samples consist of observed data and possess empiri- cal distributions. However, both samples and their par- ent populations may conform to a variety of probability distributions. These mathematical abstractions are characterized by one or more parameters; for example, the gaussian distribution is completely described by two 1NoJ abbreviations: TP, true positive; TN, true negative; ROC, receiver-operating characteristic and LR, likelihood ratio. 2The details of the calculation of confidence intervals are out- lined in a later section (see Examples of Confidence Inierval Calculations). However, it is conceptually easy to grasp, even at this stage, the very obvious overlapping of the two intervals. The more conventional approaches to analyzing data on nominal scales are the proportion tests such as the z-test, x2 test, Fisher’s exact test, or McNemar’s test, using Yates’ continuity correction when the sample sizes are small (3). All of these tests are more involved than the simple calculation of the confidence intervals; what is more, in the quoted example, they all show that there is no statistical difference, i.e., P >0.05, between the stated sensitivities.

ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

CLIN. CHEM. 39/6, 929-935 (1993)

CLINICAL CHEMISTRY, Vol. 39, No. 6, 1993 929

Chemistry with Confidence: Should Clinical Chemistry Require Confidence Intervalsfor Analytical and Other Data?A. Ralph Henderson

Confidence intervals are not commonlyprovidedwithanalyticalor other data reportedin Clinical ChemistryalthoughP values are. However, confidenceintervalsprovidean explicitdemonstrationof the directionandmagnitudeof uncertaintyandare intuitivelyeasyto grasp,unlike P values. It is thereforearguedthat the Journalshould adopt a policyrequiringthe provisionof confidenceintervals. Such a policy wouldimprovethe statisticalrigorof Journalreports.

Indexing Terms: statistics parametric vs nonparametric distri-butions likelihood ratio receiver-operating characteristiccurie

When possible, quantify findings and present them with appro-priate indicators of measurement error or uncertainty (such asconfidence intervals). Avoid sole reliance on statistical hypothesistesting, such as the use of P values, which fails to convey impor-tant quantitative information.

International Committee of Medical Journal Editors (1)

Recently, a reviewer of a paper my colleagues and Isubmitted to Clinical Chemistry asked for the P value ofa difference between sets of data. We had instead pro-vided the confidence intervals for the data, but this wasof no interest: P values were, for the reviewer, the“bottom line” in such comparisons. Or are they? It is theintent of this Opinion to make the case for more appro-priate statistical descriptions of experimental data thanaxe currently used in Clinical Chemistry but which arein accord with the most recent Uniform Requirements ofthe International Committee of Medical Editors (1)quoted above. The Journal issues very detailed statisti-cal guidelines for authors (2), which do include a re-quirement for the use of appropriate indicators of mea-surement error or uncertainty, but these guidelines areoften not observed in practice and frequently the wrongtype of statistical information is provided. Examples arereadily found on random inspection of the Journal; someare mentioned here for illustrative purposes only butare not cited in this Opinion, it being my intent topersuade rather than pillory.

For example, test A was claimed to be “more sensi-tive” [true positive (TP) rate or fraction = 44%] than testB (TP rate = 17%), but the 95% confidence intervals for

Department of Clinical Biochemistry, University Hospital (Uni-versity of Western Ontario), P.O. Box 5339, London, Ontario,Canada N6A 5A5.

Received September 9, 1992; accepted January 25, 1993.

these reported sensitivity values actually overlapped,i.e., 22% to 69% and 3.5% to 41%.’ Test A is not moresensitive; these tests are equivalent.2 In another paper,test C was claimed to be superior to test D on the basisof a receiver-operating characteristic (ROC) curve anal-ysis, but neither the areas under the curve (i.e., accu-racy) nor the confidence intervals of these accuracieswere provided. Finally, a set of tests were compared byROC curve analyses of each test’s accuracy; one test wasclaimed to possessless “discriminating power” than theothers, but the confidence interval of each estimate oftest accuracy was not determined and indeed visualinspection of the data suggested that all tests wereequivalent. These are three examples taken from oneissue of the Journal.

How should experimental findings be presented?There are now many sourcesof good advice. For exam-ple, Clinical Biostatistics (4) and Statistics in Practice(5), both collections of articles first appearing in ClinicalPharmacology and Therapeutics and the British MedicalJournal, respectively, are invaluable; the latter is par-ticularly good on the practical aspects. These aspectsarealso well addressed in Altman’s recent book (3), as itstitle suggests. Statistics deals with samples taken frompopulations consisting of all the possible observationsthat could be made; these samples are assumed topossess the same characteristics as the parent popula-tion. Altman (3) has pointed out, however, that sam-pling from a population with a truly gaussian distribu-tion (synonym: normal distribution) may not alwaysproduce samples that are themselves gaussian.

Samples consist of observed data and possess empiri-cal distributions. However, both samples and their par-ent populations may conform to a variety of probabilitydistributions. These mathematical abstractions arecharacterized by one or more parameters; for example,the gaussian distribution is completely described by two

1NoJ abbreviations: TP, true positive; TN, true negative;

ROC, receiver-operating characteristic and LR, likelihood ratio.2The details of the calculation of confidence intervals are out-

lined in a later section (see Examples of Confidence IniervalCalculations). However, it is conceptually easy to grasp, even atthis stage, the very obvious overlapping of the two intervals. Themore conventional approaches to analyzing data on nominal scalesare the proportion tests such as the z-test, x2 test, Fisher’s exacttest, or McNemar’s test, using Yates’ continuity correction whenthe sample sizes are small (3). All of these tests are more involvedthan the simple calculation of the confidence intervals; what ismore, in the quoted example, they all show that there is nostatistical difference, i.e., P >0.05, between the stated sensitivities.

Page 2: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

1.25

a0,a.

1.00

- 0.75aa-aa

0.500aa

0.25C.,

0.00

0 50 100 150 200

Sample size

250

930 CUNICALCHEMISTRY,Vol.39, No. 6, 1993

parameters-mean and standard deviation (SD). Theadvantage of using appropriate probability distribu-tions to describe empirical data is that it permits statis-tical analysis by powerful mathematical tools. Whensuch distributional assumptions are made, the process isdescribed as “parametric.” When no assumptions aremade at all about the distribution of data, the process is“nonparametric” or “distribution free.” Initially, atten-tion will be paid here to parametric methods.

The most commonly used probability distribution inmedicine is the gaussian, and much biological data isadequately represented by it. For example, Flynn et al.(6) reported that, of 19 analytes exaniined from 1000blood donors, nearly all showed gaussian-type distribu-tions, either with or without logarithmic transforma-tion. However, before sampled data, either raw or afterapplication of a transformation, are assumed to shownormality, this must be validated either by visual in-spection or, better still, by a variety of formal checks (3).That such formal checks, even of the simplest nature,are not always used can readily be appreciated byrandom review of this Journal. A recent paper gave amean and SD of 11.9 and 8.7 kU/mg, respectively. If thedata were distributed normally, 95% of the values wouldbe found in the region -5.2 to 29 kU/mi -5.2 kU/mg!

Let us assume the normality of a set of experimentaldata. Its variabifity is measured by the SD. The percent-age of the data contained by multiples of the SD on eachside of the mean is, for 1 SD, 68.8%; for 2 SDs, 95.5%;and for 3 SDs, 99.7%. Clearly, any percentage of thedata may be chosen, but the most usual ones are 90%,95%, and 99%, for which the following multiples of theSD on each side of the mean apply: 1.645, 1.96, and2.576, respectively. Whereas SD is a descriptive index,the standard error of the mean (SE or, less commonly,SEM) is a measure of uncertainty. Feinstein (7) com-ments that neither “standard” nor “error” is an appro-

priate term for this parameter and that these terms canonly serve to confuse the unwary.

Despite its inappropriate name, the SE is an ex-tremely important index. It is calculated from the SDand sample size (SE = SD/V’). If the population isrepeatedly sampled, and each sample has its mean andSD calculated, how well do these mean values estimatethe true mean of the parent population? If each of thesesample means is thought of as an individual value, thenthe standard deviation of these means is the SE. It isthus “a measure of the uncertainty of a single samplemean as an estimate of the population mean” (3). If thepopulation distribution is gaussian, then the distribu-tion of these sample means will also be gaussian. Inaddition, the distribution of the sample means will alsoapproach normality, whatever the distribution of thevariables in the parent population, provided the sampleis sufficiently large (the Central Limit Theorem).

These remarks are a necessary prerequisite to intro-ducing the concept of a confidence interval. This inter-val for a mean extends on both of its sides by a multipleof the SE. This idea is exactly analogous to that previ-ously described for the SD. Thus 1.96 x SE defines the

FIg.1. The width of the 95% confidence interval as multiples of thestandarddeviation in relation to the samplesize

95% confidence interval that will include the mean ofthe parent population with a probability of 0.95. Or, putanother way, the 95% confidence interval will not in-dude the true population mean 5% of the time, i.e., in 1out of 20 times. When working with small samples, say,n <50, where n is the sample size, it is necessary to usethe t distribution instead of the normal distribution. Thet-value is obtained from a table of Student’s t distribu-tion, for n - 1 degrees of freedom, and the two-tailedpercentage point (a) for the appropriate confidence in-terval (e.g., for 99%, a = 0.01; for 95%, a = 0.05; for 90%,a = 0.1). A plot of the 95% confidence interval, using thet distribution, is shown in Figure 1 for values of nbetween 5 and 250. Thus, for sample sizes between 25and 50, the confidence interval is in the range of 0.25 to0.5 SDs, whereas for sample sizes >50, the confidenceinterval will be <0.25 SDs.

What advantage does the knowledge of the confidenceinterval confer over the more traditional use of SD andSE? The latter parameters are usually used in thetraditional process of stating a null, and often an alter-native, hypothesis and then using a test statistic toobtain a P value for rejecting or accepting the hypothe-ses (3, 8). This process gives no indication at all of themagnitude of the effect being studied; it merely pro-duces a probability value. (This aspect will be examinedin more detail later: see P Values, below.) By contrast,the confidence interval demonstrates, explicitly, themagnitude of the uncertainty, and its direction, as wellas being an intuitively easy concept to grasp. BothLancet and British Medical Journal have publishednumerous artides on this topic, which have been gath-ered, by the latter journal, into a book (9) with anassociated computer program-the Confidence IntervalAnalysis calculator (1O). it is, of course, accepted that

3These references may be obtained from Subscriber Services,American College of Physicians, Independence Mall West, SixthStreet at Race, Philadelphia, PA 19106-1572.

Page 3: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

90

60

500 20 40 60 80 100

Population size

Fig. 2. RelatIonship between populationsizeand the 95% confi-denceintervalfor an estimate of sensitivityThese limits were obtaInedfrom the Geigy Scientific Tables (14); the valueswere corroboratedby using the Confidence Interval Maiysls program (10)

CUNICAL CHEMISTRY, Vol. 39, No. 6, 1993 931

the SD value may also be used to calculate the SE andthe confidence interval. In practice, this is rarely done,as a random inspection of this Journal will show.

Examples of Confidence IntervalCalculationsSome examples are provided to demonstrate the im-

portance of the explicit description of uncertainty.Confidence intervals-means. The simplest example of

the value of using the confidence interval can be seenwhen referring to a mean value (11). Its confidenceinterval is obtained by calculating SE, obtaining theappropriate t-value, as explained above, and evaluatingthe term (t x SE). Thus, when the sample size is 15, themean = 10.0, SD = 3.0, SE = 3/\/i = 0.775, and t =

2.145, the 95% confidence interval on each side of themean is 10 ± 0.775 x 2.145 = 10 ± 1.66 (i.e. 8.34 to11.66); for a population of n = 100, SE = 3/VI6 = 0.3,t = 1.984, and the 95% confidence interval is now mean

10 ± 0.595, or 9.405 to 10.595. Such data ifiustrate theprofound influence of population size on the confidenceinterval already demonstrated in Figure 1. Again, it isnot hard to find articles in Clinical Chemistry thatdisplay, for example, mean ± SD values for severalgroups but with the sizes of the groups varying from 20to >100! ConfIdence intervals would have given a muchclearer understanding of the variability of the data.

The calculations described above may be completelyavoided by use of the Confidence Interval Analysisprogram mentioned earlier (10).

Confidence intervals-p roportions. Sensitivity (TPrate) and specificity (true negative, or TN, rate) data arecommonly reported in Clinical Chemistry. However, it isunusual to see the confidence intervals provided withsuch data. The need for these can readily be appreciatedby an eximinition of Figure 2, which shows the effect ofpopulation size on the 95% confidenceintervals for a testwith a sensitivity of 90%. When the population is <20,

100

sensitivity = 90%

80

70

iwcohdence limit

the zone of uncertainty exceeds 30%. This aspect iscertainly not appreciated by many workers who appearto be seduced by the apparently satisfactory test perfor-mance indicated by a sensitivity of 90%. The confidenceinterval of proportions, such as sensitivity and specific-ity, follows a binomial distribution; therefore, unless theproportion is exactly 50% or the sample size is large, thedistribution is asymmetric, as shown in Figure 2. Theseintervals may be calculated (12, 13) or exact values forthe 90% and 95% zones may be obtained for populationsizes from n = 2 to 100 from the table of binomialdistributions in the Geigy Scientific Tables (14). When n>100, the simple formula given by Gardner and Altman(12) suffices. Alternatively, these limits may be ob-tained by use of the Confidence Interval Analysis pro-

gram mentioned earlier (10).Confidence intervals-likelihood ratios. Bayesian

analysis is frequently invoked in Clinical Chemistry.The likelihood ratio is the link between the pretest andthe posttest odds of disease (15). Of course, as with allsuch estimates, the likelihood ratio is subject to error,which the confidence interval quantifies (16). The 95%confidence interval of a likelihood ratio (LR) value isLR to LRb, where a = 1 - (1.96/2) and b = 1 +

(1.96I2). x2 is evaluated by the simplified formula for a2 x 2 predictive value table. Beck (17) provides adetailed example of the calculations. Again, it is uncom-mon to see likelihood ratios associated with this essen-tial indication of variability in Clinical Chemistry, al-though it is surely as important as the provision of SD orSE.

Confidence intervals-.area under the ROC curve. ROC

curve analysis is an important and powerful tool forevaluating a test’s diagnostic accuracy. The essentialindex of accuracy, when using ROC curve analysis, is thearea under the curve (18). Swets (18) also suggests thatareas of 0.5 to 0.7 denote low accuracy, 0.7 to 0.9 moder-ate accuracy, and >0.9 high accuracy. However, onemust actually measure the area to establish the magni-

100 tude of the accuracy. Bamber (19) has shown that thearea under the curve is related to the Mann-WhitneyU-statistic (a nonparametric test based on rank order).

90 This is the basis for the Hanley-McNeil procedure forobtaining these areas (20). Nonetheless,it is rare toencounter this essential index in the pages of Clinical

80 Chemistry, although ROC curves are frequently used.But all such estimates of accuracy also require an indi-cation of the extent of error of this estimation-which is

70 provided by the SE. Unfortunately, the calculations ofboth the area under the curve and the SE (20, 21) are

60 tedious and prone to error, and are best performed eitherby spreadsheet analysis (21) or by a more comprehensivecomputer program (22,23). Beck and Shultz (21) give anextended fflustration of these calculations.

50 Confidence intervals-regression.4 Clinical Chemistry(2) requires an extensive list of statistical parameters

I have avoided discussionofthe advantages of the Deming plotover the conventional least-squares method (24) or of the bias plot(25) for examining the relationship between two variables.

Page 4: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

AC

C

30 -

250

S20-

2a5 15-Va0U 10-

002 5.

0-

30 -

25-0

20-I-V

15-Va0C) 10-

00.2 s-

0-

0 5 10 15 20 25 30

Serum glucose (rnmol/L)

B

C

/ /

0 5 10 15 20 25 30

Serum glucose (mmol/L.)

FIg. 3. RelationshIp between serum glucose and blood glucosemeterestimationsThe regression line lsy= 0.79x+ 2.94, r= O.941= 15.5J= 15.2, S,,=6.99, S,, = 5.87, Sw, = 2.14, SE, 0.116, and = 1.94. GraphA shows the eight points used toobtain the least-square fit graph B shows, inaddition, the 95% confidence Interval for the mean blood glucose meterreading (Innerzone, and the 95% predictIoninterval for an indMdual bloodglucose meter reading (outerzonei

932 CUNICAL CHEMISTRY, Vol. 39, No. 6, 1993

when contributors use linear-regression analysis. Al-

though sufficient information is provided by this re-quirement for a reader to construct the confidence inter-val for the mean value of y, it would be more useful toprovide that information directly on the graph. Figure3A illustrates the usual type of plot seen in ClinicalChemistry, whereas Figure 3B shows, in addition, twozones of uncertainty (26). Thus, it is possible to see (inthe inner zone) that for serum glucose concentrations of5 and 25 mmol/L, the 95% confidenceintervals for themean blood glucose meter readings are 3.39-10.4 and194-26.0 mmol/L, respectively. The outer zone showsthe uncertainty in predicted values of y for an individualvalue of x-the prediction or tolerance interval. Clearly,graph B provides much more information about thescatter of the data than does graph A, although the

latter type of graph is that commonly seen in ClinicalChemistry.

The confidence interval for the mean value of y for agiven value of x is calculated as described earlier formeans, but with use of the standard error of the esti-mate which is obtained during the linear-regres-sion procedure, and the appropriate value oft for n - 2degrees of freedom and the percentage point (a) for theappropriate confidence interval (e.g., for 99%, a = 0.01,etc.) as previously mentioned.

The value of y (y) is calculated for the chosen valueof x; thus,y = (0.79 x 5.0) + 2.94, and the SE) iscalculated from the expression:

/ 1 (x_)2SE) = V(_+ (n - 1)S2)

For x = 5 mmol/L, Yt = 6.89, and the values for S andtaken from the legend to Figure 3, SE) is evalu-

ated thus:

/11 (5 - 55)2\SE& )‘2.14%/1+ 1=1.43

V \8 7(6.99)2 /

The value of t, for 6 degrees of freedom and a = 0.05, is2.45; the 95% confidence interval is therefore:

Yest - t’ SE) thy + t . SE,

i.e.,yt - 2.45 SE&,,) toy + 2.45 SE),

or 3.39 to 10.39 mmol/L.

The predictive interval is calculated on the same basisas the confidence interval calculated above, but with useof a different expression to obtain SE(,,):

II 1 (x-.)2SE& )SX I(i+-+

_ IY n (n-1)S2

For x = 5 mmolJL, y1, = (0.79 x 5.00) + 2.94 = 6.89,and the expression for SE) is:

// 1 (5 - 15.5)2\SE& ) = 2.14! I 1 + -+ I = 2.57

V \ 8 7(6.99)2 /

As before, the value oft, for 6 degrees of freedom and a= 0.05, is 2.45; the 95% confidence interval is therefore:

Ypred - t SE() tOYprei + t SEy,

i.e.,yp.j - 2.45 SE() t0Yprj + 2.45 SE&_),

or 0.598 to 13.2 mmolJL.

These rather tedious calculations can be completely

Page 5: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

CUNICAL CHEMISTRY, Vol. 39, No. 6, 1993 933

avoided by use of the Confidence Interval Analysisprogram mentioned earlier (10).

Confidence intervals-correlation. The correlation co-efficient, r, also has a degree of uncertainty and this canbe estimated if x and y have a joint bivariate normaldistribution (26). The calculation requires the transfor-mation of r to Fisher’s z, as follows:

1 1(1+r)z=-lnI

2 L(l-r)

which has a standard error of 1/V(n - 3). For the 90%confidence interval, the standard error is multiplied by1.645,for 95% by 1.96, and for 99% by 2.576. Therefore,for the 95% confidence interval, the upper and lowerlimits are:

1.96 1.96= Z + and z1 = z -

The values z1 and z2 need to be transformed back to theoriginal scale to provide the 95% confidence interval forthe correlation coefficient, by use of the expression:

e”-1 e”-1to

e2’+l e2”+l

For the correlation coefficient in the legend to Figure 3,r = 0.94 1, z = 1.7467, z2 = 2.6228, z1 = 0.87058, and the95% confidence interval for the correlation coefficient is0.702 to 0.989.

All of these equations can be evaluated by direct entryinto the sets of z-transformations in the Geigy ScientificTables (27), thus avoiding a set of awkward arithmeticmanipulations. Alternatively, the Confidence IntervalAnalysis program may be used (10).

Confidence intervals-nonparametric analy8es. Whena studied population has a nonnormal distribution-afairly common occurrence in the practice of clinicalchemistry-the commonly used descriptor of the popu-lation is the median. Assume that 11 observations havebeen made, the results of which are listed in ascendingorder (a necessary first step in nonparametric statistics):5, 7, 9, 11, 13, 15, 17, 19, 21, 23, and 25, with a medianvalue of 15. This data set will be used to obtain the 95%confidence interval for the median.

The approximate confidence interval for the median(28) may be calculated as follows:

_ n/VL=-- I 1.96- I andU= 1+-+ 11.96-2\ 2/ 2\ 2

where L and U are the lower and upper limits, respec-tively, and the multiplier is 1.645 for the 90% confidenceinterval, 1.96 for 95%, and 2.576 for 99%. For theexample above, these limits evaluate to:

L = - 1.96 = 2.25

11 \/iiandU=1+-+1.96 =9.75

2 2

or, onroundingto the nearest integer,L = 2andU = 10.Therefore, the lower 95% interval is the 2nd observation(value in the ordered set and the upper limit is the 10thobservation (value 23). Alternatively, exact values for arange of a can be obtained for population sizes from n =2to 499 from the table of binomial distributions (P = 0.5) inthe Geigy Scientific Tables (14); from these tables, L = 1(value 5) and U = 10 (value 23). Note that the 95%confidence interval for the median is very wide: it indudes82% (9 of 11) of the population. This is a feature ofconfidenceintervals for small samples from nonnormaldistributions.

Finally, the Confidence Interval Analysis program (10)may be used to obtain either a Wilcoxon or binomial-basedconfidence interval. I emphasize that these calculationsprovide only approximate values; standard texthooksshould be consulted for more detailed procedures (29, 30).

P Values

Information for Authors (2) suggests that sole reli-ance should not placed on, for example, P values, but theexperience quoted above suggests otherwise. As far backas 1978, Rothman (31) stated, in an editorial in the NewEngland Journal of Medicine, that “P values.. . are notgood measures of the strength of the relation betweenstudy variables. P values serve poorly as descriptivestatistics.” Bailar and Mosteller write (32), in an articleoriginally prepared for the Annals of Internal Medicine,although not cited in Clinical Chemistry’s Informationfor Authors, “Confidence intervals offer a more informa-

tive way to deal with the significance test than does asimple P value. Confidence intervals for a single meanor a proportion provide information about both magni-tude and its variability.” Likewise, Gardner and Alt-man (33)-professional statisticians-comment that

even precise P values convey nothing about thesizes of the differences between study groups.” However,a random search through Clinical Chemistry shows anundue reliance on P value boundaries, e.g., P <0.01,>0.05, and so on. Although, in the past, it was necessaryto rely on statistical tables for the values of P, manycommonly available microcomputer statistical pro-grams can calculate an exact value for P; so why are Pvalues stifi given in this manner? The undue reliance ona P value above or below 0.05 has in any case been

savaged by Feinstein (34 )-who as a mathematicianand clinical epidemiologist speaks with considerableauthority-in terms that should be reprinted in allAdvice to Authors-type articles:

the statistical strategy proposed by Sir Ronald Fisher, whoregarded 95% of the inner values [of a distribution] as common andthe remaining 5% as significantly uncommon. Although the strat-

Page 6: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

934 CUNICAL CHEMISTRY, Vol. 39, No. 6, 1993

egy is regularly used to designate the outer 5% of values asabnormal, Murphy [35] has pointed out that “contrary to popularopinion,” this demarcation of abnormality “is not a recommenda-tion of statisticians and. . . has no support from statistical theory.”Fisher’s proposed boundary of uncommon occurrences was in-tended for inferential decisions about P values, not for descriptivedecisions about normality. Nevertheless, after years of exposure to0.05 as the magic level of stochastic significance, many clinicianshave become thoroughly conditioned to accept the same boundarymarker for abnormality.

The acceptance or rejection of a null hypothesis on thebasis of a P value of 0.049 or 0.051 is clearly nonsensical,but this aspect dominates biomedical thinking-asFeinstein observed.

It may now be obvious that there is a close relation-ship between the P value and the confidence interval ofthe result for a sample. When the P value is “signifi-cant,” i.e., <0.05, the 95% confidence interval will notinclude the value specified by the null hypothesis. How-ever, the P value does not indicate the magnitude of thedifference, or its direction, and the degree of associateduncertainty is unknown. By contrast, the 95% confi-dence interval provides somesense ofall ofthese aspectsand thereforeis a more useful index for assessing thevalidity of the data.

A Statistical Check Ust for Clinical Chemistry?

Many publications have stressed the inappropriateuse of statistical techniques, and in recent years severaljournals have started to use statistical check lists. Clin-ical Chemistry now uses a style check list; could astatistical check list be added? The following topics maypossibly ensure statistical respectability. They havebeen culled from several sources (3, 32, 36):

Design of the reported study. Have the study’s designand objectives been sufficientlydescribed? Have the nulland alternative hypotheses been stated? Was the sam-ple size based on pre-study considerations of statisticalpower? How were the subjects in the study chosen?

Analysis and presentation of the data. Have technicalterms been correctly used? Have statistical terms and

abbreviations been adequately defined? Was the num-ber of subjects/samples stated? Were the statisticalanalyses appropriateand were reasons given for theiruse? Were these statistical procedures adequately do-scribed or referenced? Were indications of measurementerror or uncertainty provided? Is there undue relianceon P values? Are the tables and figures adequate?

When the major medical journals are increasingly

insisting on rational statistical rigor, can ClinicalChemistry afford to ignore such a commendable trend?

I believe that this brief outline of the need to employconfidence intervals in the Journal should convinceothers that the provision of more relevant statisticalmeasurements will enhance the presentation ofexperi-mental data in Clinical Chemistry.

References1. International Committee of Medical Journal Editors. Uniformrequirements for manuscripts submitted to biomedical journals[Special Report]. N Engl J Med 1992;324:424-8.2. Information for authors. Clin Chem 1992;38:1-5.

3. Altman DG. Practical statistics for medical research. London:Chapman and Hall, 1991:6llpp.4. Feinstein AR. Clinical biostatistics. St. Louis, MO: CV MosbyCo., l977:468pp.5. Gore SM, Altman DG, eds. Statistics in practice: comprisingStatistics in question and Statistics and ethics in medical research.London: British Medical Association, l982:lO0pp.6. Flynn FV, Piper KAJ, Garcia-Webb P, McPherson K, HealyMJR. The frequency distributions of commonly determined bloodconstituents in healthy blood donors. Clin Chin Acta 1974;52:163-71.7. Feinstein AR. Clinical biostatistics XXXVII: demeaned errors,confidence games, non-plussed minuses, inefficient coefficients,and other statistical disruptions of scientific communication. ClinPharmacol Ther 1976;20:617-31.8. Ware JH, Mosteller F, Delgado F, Donnelly C, Ingelfinger JA.P values. In:Bailar JC, Mosteller F, eds. Medical uses of statistics,2nd ed. Boston: NEJM Books, 1992:181-200.9. Gardner MJ, Altman DG, eds. Statistics with confidence-confidence intervals and statistical guidelines. London: BritishMedical Journal, l989:l4opp.10. Gardner MJ, Gardner SB, Winter PD. Confidence intervalanalysis microcomputer program. London: British Medical Jour-nal, 1989:77pp.11. Altman DG, Gardner MJ. Calculating confidence intervals formeans and their differences. In: Gardner MJ, Altman DG, eds.Statistics with confidence-confidence intervals and statisticalguidelines. London: British Medical Journal, 1989:20-7.12. Gardner MJ, Altman DG. Calculating confidence intervals forproportions and their differences. In: Gardner MJ, Altman DG,eds. Statistics with confidence-confidence intervals and statisti-cal guidelines. London: British Medical Journal, 1989:28-33.13. Armitage P, Berry G. Statistical methods in medical research,2nd ed. Oxford: Blackwell Scientific Publications, 1987:115-20.14. Diem K, Seldrup J. Geigy scientific tables, Vol.2. Introductionto statistics, statistical tables, & mathematical formulae-thebinomial distribution, 8th ed. Basle: Ciba-Geigy, 1982:73-107.15. Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinicalepidemiology. A basic science for clinical medicine, 2nd ed. Boston:Little, Brown and Co., l99l:44lpp.16. Miettinen OS. Estimahility and estimation in case-referentstudies. Am J Epidemiol 1976;103:226-35.17. Beck JR. Likelihood ratios: another enhancement of sensitiv-ity and specificity [Editorial]. Arch Pathol Lab Med 1986;110:685-6.18. Swets JA. Measuring the accuracy of diagnostic systems.Science 1988;240:1285-93.19. Bamber D. The area above the ordinal dominance graph andthe area below the receiver operating graph. J Math Psychol1975;12:387-415.20. Hanley JA, McNeil BJ. The meaning and use of the areaunder a receiver operating characteristic (ROC)curve. Radiology1982;143:29-36.21. Beck JR, Shultz BK. The use of relative operating character-istic (ROC) curves in test performance evaluation. Arch PatholLab Med 1986;11O:13-20.22. Pellar TG, Leung FY, Henderson AR. A computer program forrapid generation of receiver operating characteristic curves andlikelihood ratios in the evaluation of diagnostic tests. Ann ClinBiochem 1988;25:411-6.23. Pellar TG, Galbraith LV, Leung FY, Henderson AR. A com-puter program to determine diagnostic decision thresholds andlikelihood ratios illustrated with aspartate aininotransferase ac-tivities after a myocardial infarction. Ann Clin Biochem 1989;26:533-7.24. Cornbleet PJ, Gochman N. Incorrect least-square regressioncoefficients in method-comparison analysis. Clin Chem 1979;25:432-S.25. Bland JM, Altman DG. Statistical methods for assessingagreement between two methods of clinical measurement. Lancet1986;i:307-10.26. Altman DG, Gardner MJ. Calculating confidence intervals forregression and correlation. In: Gardner MJ, Altman DG, eds.Statistics with confidence-confidence intervals and statisticalguidelines. London: British Medical Journal, 1989:34-49.27. Diem K, Seldrup J. Geigy scientific tables, vol.2. Introduction

Page 7: ChemistrywithConfidence:Should ClinicalChemistry ... · 1.25 a 0, a. 1.00-0.75a a-a a 0.50 a0 a 0.25 C., 0.00 0 50 100 150 200 Sample size 250 930CUNICALCHEMISTRY,Vol.39,No.6,1993

CLINICAL CHEMISTRY, Vol. 39, No. 6, 1993 935

to statistics, statistical tables & mathematical formulae-z-trans-formation, 8th ed. Basle: Ciba-Geigy, 1982:64-7.28. CampbellMJ, Gardner MJ. Calculating confidence intervalsfor some non-parametric analyses. In: GardnerMJ, Altman DG,eds. Statistics with confidence-confidence intervals and statisti-cal guidelines. London: British Medical Journal, 1989:71-9.29. Sprent P. Quickstatistics-an introduction to non-parametricmethods. Harmondsworth, UK Penguin Books, l98l:264pp.30. Sprent P. Applied nonparametricstatistical methods.London:Chapman and Hall, 1989:259pp.31. Rothman K A show of confidence [Editorial]. N Engl J Med1978;299:1362-3.32. Bailar JC, Mosteller F. Guidelines for statistical reporting inarticles for medical journals: amplificationsand explanations. In:Bailar JC, Mosteller F, eds. Medical uses of statistics, 2nd ed.Boston:NEJM Books, 1992:313-31.33. Gardner MJ, Altman DG. Estimation rather than hypothesistesting: confidenceintervals rather than P values. In: Gardner MJ,Altman DG, eds. Statistics with confidence-confidence intervals

and statistical guidelines. London: British Medical Journal, 1989:6-19.34. Feinstein AR. Clinical epidemiology: the architecture of clin-ical research. Philadelphia: WB Saunders Co., 1985:8l2pp.35. Murphy BA. The normal,and perils of the syileptic argument.Perspect Biol Med 1972;15:566-82.36. GardnerMJ, Machin D, Campbell MJ. Use of check lists inassessing the statistical content of medical studies. In: GardnerMJ, Altman DG, eds. Statistics with confidence-confidence inter-vals and statistical guidelines. London:British Medical Journal,1989:101-8.

Additional Reading1. Bland M. An introduction to medical statistics. Oxford: OxfordMedical Publications, l987:365pp.2. Conover WJ. Practical nonparametric statistics, 2nd ed. NewYork John Wiley & Sons, l980:493pp.3. Fleiss JL. Statistical methods for rates and proportions, 2nd ed.New York: John Wiley & Sons, l98l:32lpp.4. Sprent P. Applied nonparametric statistical methods. London:Chapman and Hall, l989:259pp.