Measurement as Inference: Fundamental Ideas · Measurement as Inference: Fundamental Ideas W. Tyler Estler (2) Precision Engineering Division National Institute of Standards and Technology

1

Measurement as Inference: Fundamental Ideas

W. Tyler Estler (2)Precision Engineering Division

National Institute of Standards and TechnologyGaithersburg, MD 20899 USA

Abstract:

We review the logical basis of inference as distinct from deduction, and show that measurements ingeneral, and dimensional metrology in particular, are best viewed as exercises in probable inference:reasoning from incomplete information. The result of a measurement is a probability distribution thatprovides an unambiguous encoding of one's state of knowledge about the measured quantity. Such statesof knowledge provide the basis for rational decisions in the face of uncertainty. We show how simplerequirements for rationality, consistency, and accord with common sense lead to a set of unique rules forcombining probabilities and thus to an algebra of inference. Methods of assigning probabilities andapplication to measurement, calibration, and industrial inspection are discussed.

Keywords: dimensional metrology, measurement uncertainty, information

1. Introduction

The growing acceptance and use of the ISO Guide to theExpression of Uncertainty in Measurement (GUM) [10] hasstimulated renewed thinking about errors, tolerances,statistics, and the concepts of randomness anddeterminism as they relate to manufacturing engineeringand metrology. While we fully subscribe to the notion ofdeterminism as articulated by J. B. Bryan [3] and R. R.Donaldson [6], the knowledge that a machine moves inperfect accord with natural law provides only small comfortwhen we must assign an uncertainty to measurements ofits positioning errors. We emphasize here the conceptualdistinction between a state of nature (for example, thegeometry of a highly repeatable machine tool) and theuncertainty of a process designed to measure that state(linear positioning error, for example, measured with adisplacement interferometer).

Traditionally, there has been little in the education of atypical engineer or physicist that provides a fundamentalviewpoint or logical basis for dealing with measurementuncertainty, in the way that the laws of Newton and Hookeprovide a foundation for major portions of engineeringscience. While computing the mean and variance of a setof repeated measurements seems like a reasonable thingto do, many statistical tests seem ad hoc and poorlymotivated and they provide no guidance in situations whererepeatability is not an issue or where no population of partsexists.

It is a pleasure to discover that there exists a uniquemathematical system for plausible reasoning in thepresence of uncertainty that satisfies very elementary andnon-controversial requirements for consistency and rationalagreement with common sense. In this paper we present abrief outline of the fundamental ideas of this system, called simply probability theory, with emphasis on its applicationsto engineering metrology. The development of probability

theory as logic had its origins in the work of P. S. Laplacewho remarked that 'probability theory is nothing butcommon sense reduced to calculation.' The moderndevelopment owes much to the work of H. Jeffreys [16],G. Polya [26], R. T. Cox [4-5], and E. T. Jaynes [12-15].Detailed application to problems of data analysis andmeasurement uncertainty from a modern point of view aregiven by D. S. Sivia [30] and K. Weise and W. Wöger [33].

The latter paper is an excellent introduction to the approachto uncertainty advocated by the GUM.

2. Deduction and Plausible Inference

2.1 Deductive logic

Classical deductive logic deals with propositions (writtensimply A, B, C, ...) that are either true or false. Typicalpropositions are declarative statements such as:

A � 'There is life on Mars.'B � 'The error in the length of the workpiece isless than 5 �m.'C � 'The cost of the workpiece is less than $10.'

Propositions are combined and manipulated using a set ofthree basic operations defined as follows:

Negation: ~A � 'A is false'

Logical product: AB � 'A and B are both true'

Logical sum: A + B � 'at least one of the propositions(A,B) is true'

Relations among propositions form the subject of Booleanalgebra, which relates logical combinations of propositionsthat have the same truth value.

A typical Boolean expression is:

�A B AB� �� . (1)

Here, the left-hand side says 'It is not true that at least oneof the propositions (A,B) is true', while the right-hand sidesays 'A and B are both false.' Clearly these verbalexpressions have the same logical status and semanticmeaning, a feature of any valid Boolean expression.Because of logical relations such as (1), only two of thethree basic operations are independent, a fact that willsimplify the development of the rules of probability theory.

Deductive logic is a two-valued logic (true/false, up/down,zero/one, etc) and together with the Boolean formalismprovides the binary mathematical basis of computerscience. Those familiar with the operation of logic gates will

2

recognize the logical sum, for example, as defining theaction of an 'inclusive OR' binary gate.

A basic construction in classical logic is the implication,written 'A implies B', which means that if A is true, then B isalso, necessarily, true. The connection is logical rather than(necessarily) causal; for example, the proposition A � 'thereis life on Mars' would logically imply B1 � 'there is liquidwater on Mars', B2 � 'there is oxygen on Mars', and so on.[In anticipation of objections on semantic grounds we pointout that we are using the term 'life' in the sense of life formssimilar to those that exist on the Earth.]

Deductive logic then proceeds from the implication in twocomplementary ways, according to the following syllogisms:

'If A implies B and A is true, then B is true.'and

'If A implies B and B is false, then A is false.'

These are very simple logical structures with commonsense meanings. If it could be proven beyond doubt, forexample, that Mars was devoid of water, then we couldconclude that no (Earth-like) Martian life could exist.

2.2 Plausible inference and probability

Now suppose that A implies B for some relevant pair ofpropositions, and in the course of contemplating A wehappen to learn that B is true. What does this tell us aboutA? This question is quite different from those in deductivelogic and belongs to the field of plausible inference that wasrichly explored by Polya [26]. Here, knowledge that B is truesupplies evidence for the truth of A, but certainly notdeductive proof. We may feel intuitively that A is more likelyto be true upon learning that one of its consequences istrue, but how much more likely?

It is easy to see that the change in our strength of belief inproposition A will depend on the nature of the informationsupplied by consequence B. Consider the propositionA � 'the length error of the workpiece is less than 5 �m',and suppose that we learn, based on a preliminarymeasurement, that B1 � 'the length error of the workpiece isless than 100 �m' is true. Such information would certainlymake A seem more likely to be true, but it would be muchmore significant to learn from a more recent measurementthat B2 � 'the length error of the workpiece is less than7 �m' is true. In this way we can qualitatively order degreesof plausibility in the sense of: 'A is more likely to be true,given B1' and 'A is much more likely to be true, given B2'. Inneither case does A become certain, but this qualitativeordering is something we do naturally as a matter ofcommon sense reasoning.

What we need now is a way to extend deductive logic intothis region of inference between certainty and impossibility.Such an extended logic should provide a generalquantitative system for reasoning in the face of uncertaintyor when supplied with incomplete information. In thedevelopment of such a quantitative system of inductivelogic or plausible reasoning, we need a numerical measureof credibility or degree of reasonable and consistent beliefthat will serve to describe our state of knowledge aboutpropositions that are neither certain nor impossible.Following the modern interpretation as expressed, forexample, in the GUM, we call this measure the probability,and write:

p A� | I0 f � the probability that A is true, given that I0 is true.

Here, I0 stands for the reasoning environment: the set ofall relevant background information that conditions ourknowledge of A. We will carry I0 along explicitly in order toemphasize that all probabilities are conditional on some setof propositions known (or assumed) to be true. There is anatural intuitive basis for defining probability in this manner.The degree of partial belief in an uncertain proposition willalways depend not only on the proposition itself, but alsoon whatever information we possess that is relevant to thematter. For this reason, there is no such thing as anunconditional probability. The probability we assign to thechance of rain tomorrow depends, for example, uponwhether we have heard a weather forecast, or whether its ispresently raining, or whether storm clouds are gathering,and so on.

In Polya's studies of plausible inference he reasoned, andcommon sense would agree, that if A implies B, thennecessarily p A B p A� � �| |I I0 0f f , since the probability that Ais true, if it changes at all, can only be increased bylearning that one of its consequences is true. In ourexample above concerning the length error of a workpiece,the probabilities would be ordered according top A B p A B p A� � � � �| | |2 1I I I0 0 0f f f. Here we are introducingthe customary and colloquial association of stronger beliefwith greater probability. While such a transitive orderingindicates the direction in which a probability might changein light of new evidence, it provides no way to calculate theamount of such a change and Polya's work stopped shortof providing a quantitative formulation. For this we turn tothe work of R. T. Cox [4-5].

3. The Rules of Probability Theory

The following is a brief sketch of the logic leading to theunique rules for manipulating probabilities. For a morecomplete tutorial introduction we suggest the excellentsynopsis of Smith and Erickson [31]. Following Jaynes [12],we list three desired properties (desiderata) that ought tobe satisfied by a quantitative system of inference. Theseare not strict mathematical requirements or constraints, butany system lacking all of these properties would be of littleor no value for reasoning from incomplete information.

Desideratum I. Probabilities should be represented by realnumbers. This is a simple desire for mathematicalsimplicity.

Desideratum II. Probabilities should display qualitativeagreement with rationality and common sense. Thismeans, for example, that as evidence for the truth of aproposition accumulates, the number representing itsprobability should increase continuously and monotonicallyand the probability of its negation should decreasecontinuously and monotonically. It also means that thesystem of reasoning should contain the deductive limits ofcertainty or impossibility as special cases whenappropriate.

Desideratum III. Rules for manipulating probabilitiesshould be consistent. For example, if we can reason ourway to a conclusion in more than one way, then all waysshould lead to the same result. It should not matter in whatorder we incorporate relevant information into ourreasoning.

3.1 The two axioms of probability theory

Equipped with these quite reasonable requirements, wecan proceed to derive the rules of probability theory. We

3

first seek a way to relate the probability that a proposition istrue to the probability that it is false. That is, given p A� | I0 f ,what is p A~|� I0 f? Cox reasoned that if we know enough, oninformation I0 , to decide if A is true, then the sameinformation should be sufficient to decide if A is false. Thismakes intuitive sense from the point of view of symmetry,since what we call 'A' and what we call ' ~A ' is a matter ofconvention. Cox stated this as the first axiom ofprobability theory:

Axiom 1.

'The probability of an inference (a proposition) ongiven evidence (the conditioning information)determines the probability of its contradictory (itsnegation) on the same evidence.'

In symbolic form, this says:

p A F p A~| |a f fI I0 0� �1 , (2)

where F1 is some function of a single variable.

We next seek a way to relate the probability of the logicalproduct AB of two propositions to the probabilities of A andB separately. That is, suppose we know p A� | I0 f , p B� | I0 f ,p B A� | I0 f , and so on, and we want to know p AB� | I0 f . Forexample, suppose that an engineer is considering thefeasibility of manufacturing a metal spacer for a particularapplication. In order to meet its functional requirements, thespacer must have a length error of no more than 5 �m,while for economic reasons the cost of production must beheld to less than $10. Now consider the two propositions:

A � 'the spacer can be produced with an error ofless than 5 �m.'

B � 'the spacer can be produced for less than$10.'

and their logical product:

AB � 'the spacer can be produced with an errorof less than 5 �m, for less than $10.'

In considering whether or not to proceed, the engineermight first decide whether he has the process capability tomachine a spacer with an error of less than 5 �m [p A� | I0 f ],and then, assuming that this is possible, decide whetherthe cost of production can be held to less than $10[p B A� | I0 f ]. Alternatively, the engineer might first addressthe cost issue and assign p B� | I0 f , and then, on theassumption that the cost target can be met, decide whetherthe length error can be held to less than 5 �m [p A B� | I0 f].Either of these approaches seems reasonable, and eithershould provide enough information to determine p AB� | I0 f .Common sense reasoning along these lines led Cox to thesecond axiom of probability theory:

Axiom 2.

'The probability on given evidence that both of twoinferences (propositions) are true is determined bytheir separate probabilities, one on the givenevidence, the other on this evidence with theadditional assumption that the first inference(proposition) is true.'

As a mathematical assertion, this becomes:

p AB F p A p B A� � � �| | , |I I I0 0 0f f f2 , (3)

where F2 is some function of the two variables. Of course,AB and BA are logically equivalent, so by Desideratum IIIwe could interchange A and B in (3). Any assumedfunctional relation that differs from (3) can be shown to runafoul of our common sense requirements; Tribus [32] givesan exhaustive demonstration.

At this point the reader is encouraged to ponder the logicalcontent of Cox's two axioms and to see how they agreewith the intuitive process of everyday plausible reasoning.The writer knows of no case where these axioms havebeen shown to disagree with common sense, while thedemonstrations of Tribus have shown that they are uniquein this property. This is very important because once thesetwo assertions are accepted as the axiomatic basis forprobability theory, the formal rules of calculation follow bydeductive logic in the form of mathematical theorems.

3.2 The sum and product rules

Equations (2) and (3) are not very informative as theystand. Some obvious constraints on the unknown functionsF1 and F2 follow from Boolean algebra. Since AB = BA forexample, we must have

F p A p B A F p B p A B2 2� � � � �| , | | , |I I I I0 0 0 0f f f f . (4)

Also, since ~~A � A, the function F1 must be such that

F F x x1 1( ) � , (5)

where x is an arbitrary probability. Neither of theseconstraints provides a sufficient restriction to determine theforms of the functions.

Using a different set of Boolean relations and therequirement of consistency, R. T. Cox demonstrated thatthe axiomatic relations (2) and (3) can be reduced to a pairof functional equations whose solutions he proceeded tofind. Details of the proofs may be found in references[4,5,12,31] .

In the case of Axiom 2, the result is called the product rule:

p AB p A p B A� � � �| | |I I I0 0 0f f f. (6)

This is one of the two fundamental rules of probabilitytheory. One of its immediate consequences is that certaintyis represented by a probability equal to one. To see this,suppose that A implies B, so that B is certain given A. Thenlogically AB = A, and from (6):

p AB p A p A p B A� � � � � �| | | |I I I I0 0 0 0f f f f ,so that if p A� �| I0 f 0, then p B A� �| I0 f 1 for B certainlytrue.

In the case of Axiom 1, solution of a second functionalequation yields the sum rule:

p A p A� � � �| ~|I I0 0f f 1. (7)

This is the second fundamental rule of probability theory.An immediate consequence of the sum rule is thatimpossibility is represented by a probability equal to zero.For if A is certainly true then ~A is false, so that p A� | I0 f = 1

4

and from (7) we must have p A~|� �I0 f 0. The sum ruleexpresses a primitive form of normalization for probabilities.

We noted previously that only two of the three basicBoolean operations (logical product, logical sum, andnegation) are independent. It follows that the sum andproduct rules, together with Boolean operations amongpropositions, are sufficient to derive the probability of anyproposition, such as the generalized sum rule:

p A B p A p B p AB�� | | | |I I I I0 0 0 0f f f f . (8)

Note here that the plus sign (+) takes on different meaningsdepending on context, being a logical operator when itrelates propositions and representing ordinary additionwhen applied to numbers such as probabilities. The contextwill make clear the meaning; the alternative is to introducenew mathematical notation which may have a strange lookwhile adding little clarity.

At this point we collect the results of the last fewparagraphs and present a summary of the unique rules formanipulating probabilities. These two simple operationsform the basis for the system of reasoning called by Coxthe algebra of probable inference:

Product Rule:

p AB p A p B A� � � �| | |I I I0 0 0f f f (9a)� � �p B p A B| |I I0 0f f (9b)

Sum Rule:

p A p A� � � �| ~|I I0 0f f 1 (10)

Deductive Limits:

A is true � p A� �| I0 f 1, A is false � p A� �| I0 f 0 (11)

These results may look quite familiar, since they are thecommon rules that are derived in conventional treatmentsof probability and statistics, where probability is defined asthe frequency of successful outcomes in a series ofrepeated trials. In fact, there are several distinct axiomsystems for probability theory, beginning with the work of A.N. Kolmogorov [19], that lead to the same formal rules forcalculation (for a discussion, see D. V. Lindley [21]). Wehave chosen to follow the approach of Cox because of itsintuitive appeal and close connection with the process ofhuman reasoning. The logical flow from first principles hasproceeded according to:

Desiderata � Cox's two axioms � sum and product rules

The result is a general and unique system of extendedlogic, an algebra of inference, that is applicable to anysituation where limited information precludes deductivereasoning. The uniqueness should be emphasized,because any system of reasoning in which probabilities arerepresented by real numbers and which disagrees with thesum and product rules will necessarily violate the veryelementary, common sense requirements for rationality andconsistency.

3.3 Common sense reduced to calculation

A nice demonstration of the way in which the sum andproduct rules accord with common sense and reproduce

the way we reason intuitively follows from the work of A. J.M. Garrett and D. J. Fisher [9]. Suppose that we have anhypothesis H, with an initial probability p H� | I0 f conditionedon I0 , and we then obtain new information in the form ofdata D. Equating the two equivalent forms of the productrule, (9a-b), using propositions H and D gives

p H D Kp H p D H� � � �| | |I I I0 0 0f f f, (12)

where K p D� � �1 | I0 f . Repeating this operation with Hreplaced with ~H and dividing (12) by the resultingexpression yields:

p H Dp H D

p Hp H

p D Hp D H

�

��

�

��

�

� �

|~ |

|~|

|| ~

I

I

I

I

I

I0

0

0

0

0

0

ff

ff

f . (13)

Now, p H p H~| |� � � �I I0 0f f1 and p H D p H D~| |� � � �I I0 0f f1from the sum rule, so that replacing p H~|� I0 f and p H D~|� I0 fin (13) and rearranging gives:

p H Dp H

p D Hp D H

� � ��

�FHG

IKJ �

� �

�

LNM

OQP�

||

| ~

|I

III0

0

0

0f f f1 1 1

1

(14)

This is a very general result that shows how the prior (pre-data) probability p H� | I0 f changes, as a result of obtainingdata D, to yield the posterior (post-data) probabilityp H D� | I0 f. This is just the process of learning, whereby astate of knowledge gets updated in light of new information.

Let us explore the special cases of (14) with a particularexample. Suppose that a doctor must decide a course oftreatment for a patient whose symptoms and medicalhistory suggest a working hypothesis: H � 'my patient hasdisease X.' A blood test for disease X is then performed,with result D � 'the patient has tested positive for diseaseX.' Before performing the test, the doctor's examination ofthe patient leads him to assign an initial probability p H� | I0 fto his working hypothesis. Here, the conditioninginformation I0 includes everything relevant to the doctor'sdiagnosis, including his training and experience as well asthe symptoms and medical history of the patient. What isthe effect of obtaining the positive result of the blood test?Consider the following special cases:

1. If p H� �| I0 f 1 then p H D� �| I0 f 1. If the doctor is certainthat the patient has disease X before the blood test, thenthe positive outcome could be anticipated a priori andwould add no useful information. In such a case, the testitself would be unnecessary.

2. If p H� �| I0 f 0, then p H D� �| I0 f 0 . If the doctor is certainthat the patient does not have disease X before the test,then the data will have no effect on his state of belief. Apositive result would most likely be dismissed as a 'falsepositive.' Two remarks seem relevant here. First, given thatX is deemed impossible to begin with, one wonders why ablood test to detect it would be performed. We can also seethe danger posed by a dogmatic refusal to allow one'sbeliefs to be changed by what might be highly relevant newinformation.

3. If p D H� �| I0 f 0, then p H D� �| I0 f 0 . If it were impossiblefor a person with disease X to have a positive response tothe blood test, then since the patient did test positive, hecould not possibly have disease X.

5

4. If p D H p D H� � � �| | ~I I0 0f , then p H D p H� � �| |I I0 0f f . Ifdata D (here a positive blood test) is equally likely whetherH is true or not, then D is irrelevant for reasoning about H.The doctor would learn nothing, for example, by flipping acoin.

5. If H implies D, so that p D H� �| I0 f 1, then

p H Dp H

p H p H p D H� �

�

� � � � � � �|

|| | | ~I

I

I I I0

0

0 0 0

f ff f1

. (15)

If a positive response always results when disease X ispresent, then the post-test probability p H D� | I0 f, given thepositive response, lies in the range p H p H D� � � �| |I I0 0f f 1and depends strongly on p D H� �| ~I0 , the probability of a'false positive.' For a perfect test, a false positive would beimpossible [p D H� � �| ~I0 0] and a positive result would makeH certain to be true. On the other hand, if p D H� � �| ~I0 1 sothat any test would be likely to yield a positive response,then p H D p H� � �| |I I0 0f f , and one learns almost nothing.

Expression (15) provides the quantitative generalization tothe work of Polya to which we referred at the end of Section2.2. In the case where H implies D, we see that the effect oflearning that D is true depends, for a given state of priorknowledge, on the probability that D is true if H is assumedto be false.

Also note the very important role played by the priorprobability p H� | I0 f . If the doctor assigns p H� | I0 f > 0.9following the initial examination, then immediate treatmentfor X would be indicated, with no need for a blood test. Onthe other hand, if p H� | I0 f � 0.2, the doctor might feelhesitant about beginning a treatment. In this case, apositive blood test with p D H� � �| ~ .I0 0 05 (a 5% chance of afalse positive) would yield a post-test probability ofp H D� �| .I0 f 0 83, and the doctor would feel comfortable intreating the patient for disease X.

3.4 Mutually exclusive and exhaustive propositions

A very common situation arises when we have a set of Npropositions (B1, B2, ... BN ), one and only one of which canpossibly be true, conditioned on information I0 . Suchpropositions are said to be mutually exclusive given I0 , acondition that is written using the product rule:

p B B p B p B Bi j i j ic f a f c f| | |I I I0 0 0� � 0 , for i � j. (16)

It follows from (16) and repeated use of the generalizedsum rule (8) that the probability that one of the propositionsis true is given by

p B B B p BN kk

N

1 21

� � � �

�

��a f a f| |I I0 0 . (17)

If it is further known from prior information I0 that one andonly one of the propositions is certainly true, then thepropositions are also exhaustive, so that the sum in (17)must be equal to one:

p Bkk

Na f| I0

�

� �

11. (18)

This is the general statement of normalization for a finite setof N mutually exclusive and exhaustive propositions, aproperty that occurs frequently in probability theory.

3.5 Marginal probabilities

Another very common and useful operation involvingmutually exclusive and exhaustive sets of propositions iscalled marginalization, which we will illustrate by thefollowing example.

Suppose that a manufacturer produces a large batch ofmetal spacers, dividing the task among N diamond turningmachines. The machines have been individually adjusted,error-mapped, and characterized for machining accuracy,so that the probability that machine k produces goodspacers may be assumed to be p G Mk� | I0 f, where G � 'thespacer is good (within tolerance)', and Mk � 'the spacer wasproduced by machine k.' Because of machine and operatorvariations, the spacer production rate varies from machineto machine. By the end of a shift, machine Mk has producednk spacers so that the N machines together produce a totalof n n nN1 2� � �� spacers which are then mixed togetherand sent to inspection. If an inspector now arbitrarilyselects one of these spacers, what can he say about theprobability that it is in tolerance, before actually performinga measurement?

We can answer this question as follows. The jointprobability that the spacer is in tolerance and that it wasproduced by machine k is p GMka f| I0 . From the productrule we then have

p GM p G p M Gp M p G M

k k

k k

a f f a fa f f

| | || | .

I I II I

0 0 0

0 0

� �

� � (19)

Equating these expressions and summing over the Nmachines gives

p G p M G p G M p Mkk

N

kk

N

k� � ��

� �| | | |I I I I0 0 0 0f a f f a f1 1

. (20)

Now observe that the propositions Mk form a mutuallyexclusive and exhaustive set, so that

k

N

kp M G�

� �

11a f| I0 . (21)

The inclusion of the proposition G as a part of theconditioning information does not alter the normalizationconstraint, since the condition of the spacer does notchange the fact that it was produced by only one of the Nmachines. The probability that the spacer is good is thus :

p G p G M p Mkk

N

k� � ��

�| | |I I I0 0 0f f a f1

. (22)

The left-hand side of (22) is called the marginal probabilityof G, and we can see that it is a weighted sum over theprobabilities for the individual machines p G Mk� | I0 f toproduce good spacers, with each term weighted by theprobability p Mka f| I0 that the particular spacer chosen wasproduced by machine k. The latter may be easily shown(and is probably intuitively obvious to the reader) to beequal to n n n nk N1 2� � �� , the fraction of the totalnumber of spacers produced by machine k.

6

In a problem like this the proposition Mk is called anuisance parameter, which means a quantity that affectsthe inference and occurs in the analysis but is of noparticular interest in itself. Another example is the error of ameasuring instrument that affects the estimate of ameasured quantity but is itself unknown. Marginalization isthe way to account for the effects of nuisance parametersby effectively averaging over all possible values.

4. Uncertainty and random variables

4.1 The meaning of a random variable

Since no measurement is perfect, no statement of an exactvalue for a measured quantity is logically certain to be true.Therefore our belief in a proposition such as: y � 'the lengthof the spacer lies between y and y + �y' is necessarilyuncertain no matter how well we perform a lengthmeasurement. Consistency then requires that wecommunicate the result of a measurement in the languageof probability theory, using the unique rules of the algebraof probable inference. In order to do this, we need amathematical representation for a state of knowledge abouta measurand (such as the length of a spacer)corresponding to all available information after performing ameasurement.

In the view of measurement as inference, all physicalquantities (except, of course, for defined constants such asthe speed of light in vacuum) are treated as randomvariables. This may seem counter to the spirit ofdeterministic metrology, because the words 'random' and'variable' suggest an uncontrolled environment and noisyinstruments, where meaningful data can only be obtainedby repeated sampling and statistical analysis. The word'variable', in particular, seems singularly inappropriate todescribe the result of a dimensional measurement. At thetime of its measurement, for example, the length of a metalspacer is not a variable at all but rather an unknownconstant whose value we are trying to estimate on thebasis of given (but incomplete) information.

The issue here turns out to be purely one of semantics. Inprobability theory, a random variable is defined as 'avariable that may take any of the values of a specified setof values and with which is associated a probabilitydistribution.' (GUM C.2.2). In discussing a quantity such aslength, it is important to distinguish between (a) length as aconcept (specified by a description, or definition), (b) thelength Y of a particular spacer (a random variable), and (c)the set of values that could reasonably be attributed to Y,consistent with whatever information is available. The resultof a measurement is only one of an infinite number of suchvalues that could, with varying degrees of credibility, be soattributed. Similarly, a handbook value for a parametersuch as a thermal expansion coefficient is only one of itspossible values, given a state of incomplete information.Probability theory, as applied to the measurement process,is concerned with these possible values, or outcomes, andtheir associated probability distributions.

4.2 Continuous probability distributions

A state of knowledge about (or degree of belief in) thevalue of a quantity, such as the length of a metal spacer,can be represented by a smooth continuous functionwhose qualitative features can be derived using the sumand product rules as follows. Denote the length of a spacerby Y, let y be some particular value, and consider theprobability

p Y y F y F y� � � � � | , .I0 f 0 1 (23)

Here F(y) is evidently a monotonic non-decreasing functionof y called a cumulative distribution function (CDF). Sincethe length of any real spacer will certainly be greater thansome very small value of y and less than a very largevalue, the qualitative behavior of F(y) will look similar to thecurve shown in Fig. 1

0

0.2

0.4

0.6

0.8

1.0

Length y

F y p Y y� � � �� | I0 f

Figure 1. The probability p Y y� | I0 f that thelength Y of a spacer is less than or equal to agiven length y, where y denotes position along alength axis.

Now suppose we are interested in the probability that Y liesin the interval a Y b� � . Define the propositions:

A � 'Y a� 'B � 'Y b� 'C � 'a Y b� � '.

These propositions satisfy the Boolean relation (logicalsum) B = A + C, and since A and C are mutually exclusive:

p B p A Cp A p C

� � ��

� � � �

| || | ,

I II I

0 0

0 0

f ff f

we have:

p C p B p AF b F a

f y dya

b

� � � � �

� �

� � �

� � � �

z

| | |

,

I I I0 0 0f f f

where f y dF y dy� � � � � is called the probability densityfunction (pdf) for the possible values of Y. The qualitativebehavior of the pdf for the CDF of Fig. 1 is displayed inFig. 2.

The pdf f(y) = dF/dy is typically a continuous, single-peaked(called unimodal) symmetric function of location y. In orderto avoid the proliferation of mathematical symbols, we willuse the notation p y f y� � � �| I0 f , so that the probability ofthe proposition y � 'the length of the spacer lies in theinterval y y dy, � ' will be written simply p y dy� �| I0 . Theidentification of p y� �| I0 with a probability density ratherthan a simple probability should be clear from the context.Also, a density function may sometimes be called a'distribution' in accord with common parlance, and forbrevity, the same symbol may be used for a quantity and itspossible values.

7

The best estimate of the length of the spacer is, bydefinition, the expectation (also called the expected valueor mean) of the distribution, given by:

E Y y yp y dy( ) |� � �

��

�z0 I0 f . (24)

y 0 Length y

f(y) = dF/dy

Figure 2. The probability density function (pdf) f(y)corresponding to the cumulative distribution functionof Fig. 1. For this function, the best estimate (orexpectation) of Y, denoted y 0 , corresponds to thepeak in the pdf.

For a symmetric single-peaked pdf such as the one shownin Fig. 2, y 0 is also the value for which p y� | I0 f is amaximum, called the mode of the pdf. A useful parameterthat characterizes the dispersion of plausible or reasonablevalues of Y about the best estimate y 0 is given by thepositive square root of the variance � y

2 , where

� y E Y y

y y p y dy

E Y y

20

2

02

202

� ��

� ��

� �

��

�z |

( ) .

I0 f (25)

The quantity � y is called the standard deviation of the pdfp y� | I0 f . The GUM defines an estimated standarddeviation to be the standard uncertainty associated with anestimate y 0 , using the notation u y y0� � � � . Theuncertainty characterizes a state of knowledge and is not aphysical attribute of the spacer or something that could bemeasured in a metrology laboratory. For this reason itmakes no sense to argue about the 'true' value of theuncertainty. An expression of uncertainty is always correctwhen properly based on all relevant information. If twopeople express different uncertainties then they must bereasoning on different states of prior information or sets ofprior assumptions.

In a similar way, a probability density function models astate of knowledge, and is not something that could bemeasured in an experiment. The function shown in Fig. 2 isthe familiar normal (or Gaussian) density defined by

p y y y

N y y

� � � ��

� � �

| exp

; , ,

I0 f 12

202 2

02

� �

�

�

(26)

where for simplicity we write � in place of � y . As we shallsee in Sec. 6.3, the normal density is a consequence of ageneral principle for assigning probabilities, called the

principle of maximum entropy, when one's knowledgeconsists only of an estimate y 0 , together with anassociated standard uncertainty � . The normal pdf plays acentral role in probability theory and measurement science.4.3 Levels of confidence and coverage factors

In the language of the GUM, we associate a level ofconfidence in our knowledge of a quantity with a number kcalled a coverage factor. For the spacer example, withestimated length y 0 and associated uncertainty � , this isinterpreted to mean that the length Y may be expected tolie in the interval y k0 � � with an integrated, or cumulative,probability P(k). The standard deviation (or standarduncertainty) thus sets the scale of uncertainty and is oftencalled a scale parameter. The relation between k and Pdepends on the assumed functional form of the pdf, and forthe normal distribution we have the well-known and often-employed values of P = [68%, 95.5%, 99.7%] for k = 1,2, and 3, respectively. Since we are reasoning about asingle, particular spacer, we point out that theseprobabilities have no frequency interpretation. Theirmagnitudes become significant: (a) in the propagation ofuncertainty, where the result of some other measurementdepends on the spacer length, and (b) in the context of asubsequent decision where the length of the spacer is anelement of risk.

A great deal of time can be wasted in heated argumentsconcerning the exact form of the density p y� | I0 f , whichdescribes not reality in itself but only one's knowledgeabout reality. It can be helpful to realize that there exists avery general and useful quantitative bounding relation onthe level of confidence associated with the best estimatey 0 which is independent of the detailed nature of the pdf,so long as it has finite expectation and variance and isproperly normalized. The latter condition means that

p y dy�

�

�

�

�

z | I0 f 1. (27)

If y 0 is the best estimate of Y, then it is straightforward toshow that

p Y y kk

� � �0 2

1�a f| ,I0 (28)

a result known as the Bienaymé - Chebyshev inequality[7, 28]. From this we see, for example, that not less than8/9 � 89% of the reasonably probable values of the lengthof the spacer are contained in the interval y 0 3� � ,whatever the distribution p y� | I0 f . Thus we suggest thatthere is little to be gained in debate over the exact form ofthe pdf. If the uncertainty � is too large to permit aconfident decision, then the proper course of action isusually to reduce uncertainty and sharpen the distributionp y� | I0 f by performing an appropriate measurement.

[NOTE: In writing expressions such as (24) and (27), weuse the formal limits of � �� , and recognize that sincephysical lengths are positive, we must strictly require thatp y� �| I0 f 0 for y � 0. In practice it is common to representstates of knowledge by pdfs such as the normal distributionthat are non-zero over infinite range. The mathematicalconvenience afforded by these analytic functions more thancompensates for the infinitesimally small, non-zeroprobabilities for impossible values of physical quantities.]

5. Measurement as inference: Bayes' Theorem

8

Now suppose that we have a proposition H in the form ofan hypothesis, and that we subsequently obtain somerelevant data D. As usual we denote our prior informationby I0 . Writing the two equivalent forms of the product rule(9a-b):

p HD p H p D H p D p H D( | | | | ( |I I I I I0 0 0 0 0f f f f f� � � � � ,

and rearranging, yields Bayes' Theorem:

p H D p Hp D Hp D

� � ��

| | ( ||

I III0 0

0

0f f f

f , (29)

which is the starting point for the system of reasoningknown as Bayesian inference. From its very trivialderivation we see that Bayes' theorem is not a profoundpiece of mathematics, being no more than a restatement ofthe consistency requirement of probability theory.Nevertheless, Bayes' theorem gives the general procedurefor updating a probability in light of new, relevantinformation, and is a modified form of (14) in which only thehypothesis H appears, and not its negation.

Before we obtain data D, the degree of belief in hypothesisH, conditioned on information I0 , is represented by theprior probability p H� | I0 f . When we learn of the data D, theprior probability is multiplied by the ratio on the right side of(29) to yield the posterior probability p H D� | I0 f. Thequantity p D H� | I0 f is called the likelihood of H given thedata D, and is viewed as the probability of obtaining thedata if the hypothesis is assumed to be true. Thedenominator p D� | I0 f has no special name, although it issometimes called the global likelihood. It is equal to theprobability of obtaining the data whether H is true or not,and can be written as a marginal probability using the sumrule:

p D p D H p H p D H p H� � � � � � � �| | | | ~ ~|I I I I I0 0 0 0 0f f f f. (30)

Since p D� | I0 f is a constant, independent of H, Bayes'theorem is commonly written in the form

p H D Kp H p D H� � � �| | |I I I0 0 0f f f, (31)

with K equal to a normalization constant. In a typicalmeasurement problem, H stands for a propositionconcerning a dimension of interest and D represents themeasurement data. The likelihood is then equal to theprobability of obtaining the data D as a function of anassumed dimension specified in H. The way in which theresult of the measurement affects our degree of belief in His completely contained in the likelihood function.

To illustrate how Bayes' theorem is used in dimensionalmetrology, let us consider a very simple one-dimensionalexample in which a linear indicator is used to measure thelength of a metal spacer. Assume that we have justmanufactured such a spacer and that we need to measureits length in order to make a decision as to whether or not itis acceptable. Before performing the measurement, ourknowledge of the length of the spacer is described by aprior pdf p y� | I0 f , where as before p y dy� | I0 f is theprobability that the length of the spacer lies in the intervaly y dy, � . The width of the prior pdf, as characterized by

its variance � p2 , is a measure of our uncertainty in the

length of the spacer, with the best estimate of the length,y p , corresponding to the expectation of the distribution.Usually we would have only limited information about the

spacer, conditioned primarily by our understanding andexperience with the production process, with such vagueknowledge reflected in a broad prior distribution. This is nota weakness of the approach but rather its motivation: thewhole purpose of performing the measurement is tosharpen this broad distribution, refine our knowledge, andreduce our uncertainty with respect to the length of thespacer.

We now measure the length of the spacer as illustrated inFig. 3. Using a linear indicator we take a pair of readingsbefore and after insertion of the spacer as shown.

y m

Figure 3. The length of a metal spacer is measuredusing a linear indicator. The result of themeasurement is the estimate y m .

The difference in the two indicator readings is the result ofthe measurement y m . The probability that a spacer ofactual length y would yield measurement data y m is justthe likelihood function p y yma f| I0 , whose width, ascharacterized by its variance � m

2 , is a measure of thequality of the measurement process (here, the linearindicator). This is where experimental design enters thepicture, because we want the likelihood to be sharplypeaked about the actual length of the spacer. We then useBayes' theorem to find the updated (posterior) probabilitydistribution that describes our knowledge of the length ofthe spacer after performing the measurement:

p y y Kp y p y y� � �| | |m mI I I0 0 0f f a f , (32)

where K p y p y y p y dy� � �

��

�

�z1m ma f a f f| | |I I I0 0 0 .

This process is illustrated in Fig. 4, where we sketch thequalitative forms of the relevant distributions. When thelikelihood is sharply peaked relative to the prior (pre-data)distribution, the posterior (post-data) distribution will bedominated by the peak in the likelihood, so that the exactform of the prior distribution becomes irrelevant. This isalmost always the case for common engineeringmeasurements, where the measurement process isarranged so that � �m p

2 2�� (sharply peaked likelihood).Under these conditions, the prior distribution will be nearlyconstant in the region where the likelihood is appreciable,and essentially all knowledge of the measurand (here, thelength of the spacer) derives from the measurement data.For such a locally uniform prior probability, Bayes' theoremthus reduces to the approach known as maximumlikelihood, so-called because the best post-data estimate ofthe value of the measurand coincides with the peak in thelikelihood function.

9

prior distribution

Probability

Length

posterior distribution

likelihood

y ym p

Figure 4. In a typical engineering measurementsuch as measuring the length of a metal spacer,the (post-data) posterior distribution is dominatedby a sharply peaked likelihood function. The bestestimate of the spacer length, y m , then verynearly coincides with the peak in the likelihood,and the prior (pre-data) distribution becomesirrelevant. The curves are not to scale.

A common source of systematic error in such a lengthmeasurement is a possible scale error in the linearindicator. In order to correct for this error, we can perform acalibration using a gauge block (length standard) whoseestimated length y g is known to within a small uncertainty� g . In the case of a calibration, the measurand is the error,and Bayes' theorem is written:

p e e K p e p e e� � � �| | |m mI I I0 0 0f f a f , (33)

where �K is a constant, e � 'the indicator systematic errorlies in the range e e de, � ,' and em is the result of themeasurement, given by the difference between theindicator data and the estimated length of the standard:e y ym m g� � . The prior distribution p e� | I0 f is typicallysymmetric about zero in the absence of any a prioriknowledge about the sign of the systematic error. Thelikelihood p e ema f| I0 will be sharply peaked because of thesmall uncertainty in the length of the standard. Again, theposterior distribution for the indicator systematic error isdominated by the peak in the likelihood and whatever isknown a priori becomes irrelevant. This situation isillustrated in Fig. 5.

Measurement and calibration are thus seen to becomplementary operations in Bayesian inference. Themechanics of taking the data are exactly the same in bothcases but we are asking different questions. In ameasurement we focus on the length of a workpiece, in acalibration on the systematic error of an indicator. Themathematics is the same, the only differences being in theidentification of the measurand and the nature of the priorinformation. The calibration/measurement process relies onthe ordering � � �g

2m2

p�� 2 .

Probability

0 em

posterior distribution

likelihoodpriordistribution

indicatorsystematicerror

Error e

Figure 5. Calibration of a linear indicator using agauge block. The measurand is now the systematicerror of the indicator, and the sharply-peakedlikelihood reflects the low uncertainty in the lengthrealized by a gauge block.

The GUM makes no reference to a prior probabilitydistribution for a measurand (while encouraging the use ofassumed a priori distributions to describe knowledge of theinput quantities upon which the measurand depends). Froma theoretical point of view this has to be regarded asinconsistent. Operationally, it amounts to an implicitassumption of a uniform (constant) distribution to describeprior knowledge of the measurand, with the best estimateto be supplied by the measurement data via the likelihoodfunction.

6. The assignment of probabilities

The sum and product rules, together with Bayes' theorem,are the unique algebraic tools for working with andmanipulating probabilities, but the question remains of howto assign prior probabilities in the first place in order for acalculation to get started. Since probabilities represent (orencode) states of knowledge or degrees of reasonablebelief, what is needed are principles by which whateverinformation is available can be uniquely incorporated into aprobability distribution. This problem is addressed in theGUM, for variables other than the measurand, where suchdistributions are called a priori probability distributions, withassociated variances whose positive square roots arecalled Type B standard uncertainties.

There is no easy way to assign a real number to theprobability of an uncertain proposition such as A � 'there islife on Mars', but for the quantities of interest in engineeringmetrology the International System of Units (SI) provides aset of location parameters that makes such assignmentpossible. These parameters are the continuous variablessuch as position or mass, with respect to which we canorder degrees of belief and over which we can sum discreteprobabilities or integrate probability densities in order toeffect normalization.

There are three principal theoretical approaches to theconsistent assignment of prior probabilities in problems ofengineering metrology. By 'consistent' we mean that twopersons with the same state of knowledge should assignthe same probabilities. There is really no conceptualdifference between assigning a prior probability distributionfor a measurand before performing a measurement, andevaluating the likelihood function for the measurementprocess after the data is in hand. Both operations yield

10

probability distributions that describe degrees of belief andboth require the exercise of judgment, insight, knowledge,experience, and skill. In the final analysis it should berecognized that the limiting uncertainty of a measurementcannot be gleaned from anything in the measurement dataitself, nor can the error be known in the sense of a logicaldeduction.

6.1 The representation of ignorance

Since a probability distribution for a quantity of interestencodes what is known about the quantity, it is interestingto ask for the distribution that describes a state of completeignorance. For example, suppose that a long metal bar isengraved with a single ruled line whose position along thebar is unknown. Here our state of knowledge consistssimply of the line's existence, with no information that wouldlead us to favor any location over any other. How can werepresent this state of ignorance? We reason as follows:denote position along the bar by x, and let f x dx� � be theprobability that the line lies in the interval x x dx, � .Ignorance of location then suggests that the probabilityshould be invariant with respect to the translationx x x a� � � � , where a is an arbitrary constant. Thus thedensity f x� � should satisfy

f x dx f x dx� � � �� , (34)

and since dx dx� � , we have f x f x a� � � �� , which impliesthat

f x� �= constant. (35)

Thus the probability density that describes ignorance of alocation parameter, such as the position of the ruled line orthe magnitude of an error, is the uniform density.

Now suppose that there are two lines ruled on the metalbar, thus forming a line scale, and that we are interested inthe length L between them. The probability that the lengthlies in the interval L L dL, � is written g L dL� � . Supposethat we are completely ignorant of the line spacing, in thesense that we have no definite scale for the unit of length.We can imagine drawing a graph of g L� � versus L, usingsome local, arbitrary unit of length. Another metrologist,perhaps using a photograph of the line scale, might draw agraph in different units, g L �� , where � �L L� , with � equalto an unknown scale factor. If the two states of knowledge(or more correctly, ignorance) are to be the same, then weshould assign the same probability to equivalent intervalson the two graphs. That is, we should require thatg L dL g L dL� � �� , with � �L L� , so that:

g L d L g L dL� �� . (36)

Thus we require that g L g L� �� 1 , so that theprobability density g L� � is given by

g L L� �� 1 . (37)

A parameter such as the line spacing that is known a priorito be positive is called a scale parameter. Another scaleparameter is the standard deviation of a probabilitydistribution for the error of a length measurement. We haveshown that the invariant density that represents ignoranceof a scale parameter is the reciprocal density g L L� � � 1 .This is a strange looking probability density that appearsmore reasonable if we write the equivalent forms

g L dL dL L d L� � � �� ln , (38)

so that requiring g L d L g L dL� �� is equivalent to thestatement that

d Lln� � � constant. (39)

Thus ignorance of a scale parameter is represented by auniform distribution of the logarithm of the parameter.

The results given by (35) and (39) for the prior densitiesrepresenting ignorance for location and scale parameterswere originally proposed by Jeffreys [16], using heuristicplausibility arguments. They were subsequently placed ona firm theoretical foundation by Jaynes [14], who invoked a'desideratum of consistency' to express the reasonablerequirement that in two problems where we have the sameinformation, we should assign the same probabilities. In thecase of complete ignorance, where the parameters haveinfinite range (� � � �� x and 0 � � ��L ), the priorprobability densities (35) and (39) cannot be normalized,since the corresponding integrals are undefined. Such priordistributions are called improper priors and have been thesubject of much controversy and criticism, since a non-normalizable function can obviously not represent aprobability density. In response, we make severalobservations. First, in almost any real application usingBayes' theorem, the prior distribution occurs in both thenumerator and denominator, and so cancels out of thecalculation. In such a case, the fact that we might be usingan improper prior becomes moot. Next, in the real world ofengineering metrology we are never completely ignorant inthe mathematical sense. As previously argued, the lengthof a real workpiece, such as a metal spacer, will certainlybe greater than some definite small value and less than adefinite large value, so that the relevant probability densitywill vanish outside of such finite limits, and thenormalization integral will always converge to unity. In anunusual case where the posterior distribution itself shouldturn out to be improper, then this fact should serve as awarning that there is not enough information in themeasurement data to be able to make a confidentinference with respect to the measurand.

In spite of the mainly theoretical problems with improperpriors, they are useful in real problems as labor-savingdevices when the exact finite limits of the relevant priordensities make no resolvable difference in the calculations.

6.2 Symmetry and the principle of indifference

Consider a discrete collection of n propositions A An1��

that form an exhaustive and mutually exclusive set on priorinformation I0 . Furthermore, suppose that that there isnothing in information I0 that would lead us to believe thatany one of the propositions was more or less probable thanany other. In such a case we must then havep A p Aj kb f a f| |I I0 0� for any pair of propositions A Aj k,a f. Ifthis were not the case, then by simply permuting thenumbering scheme of the propositions we coulddemonstrate two problems, each with the same priorinformation but with different probability assignments. Theassignment of equal probabilities in this case is perhapsintuitively obvious given the symmetry of the situation, andemploys what is often called the principle of indifference, aterm introduced by J. M. Keynes [18].

Now since p Akn a f| I01

1� � (exhaustive constraint), andsince all of the probabilities p Aka f| I0 are equal, we havenecessarily:

11

p An

k nka f| , , , .I0 � �

1 1� (40)

The result (40) is perhaps the oldest and most familiar of allprobability assignments. It will appear as a special case ofthe principle of maximum entropy to be described in thenext section, but we chose to introduce it separatelybecause of its importance in probability theory. Theprinciple of indifference leads, of course, to the equal apriori probabilities that characterize games of chance suchas drawing cards or rolling dice. Note, however, that the 1 nprobability assignment is a logical consequence of the sumand product rules of probability theory applied to a set ofexhaustive and mutually exclusive propositions, given aparticular state of prior knowledge. There is no need toimagine an infinite set of repeated experiments and animagined distribution of limiting frequencies. Of coursegiven the probabilities, it is a straightforward procedure tocalculate the expected frequency of any particular outcomein a set of repeated trials, and thus to compute, forexample, the familiar odds of the gambler. Suchcalculations are developed in great detail in most books onprobability and statistics.

The uniform 1 n discrete probability distribution can beusefully employed to characterize ignorance of a physicaldimension, such as the length Y of a metal spacer. Wechoose an interval [ymin, ymax] that is certain, based onengineering judgment, to contain the length Y, and wedivide this interval into a large number n of discrete lengthsy y n1, ,�� . Here n is chosen so that the discrete lengths

y k are separated by less than the measurement resolution.A state of knowledge about the length of the spacer cannow be represented by the discrete probability distributionp pn1, ,�� where p p Y yk k� �a f| I0 . If now our prior

information I0 consists only of knowledge of the interval[ymin, ymax] together with an enumeration of the possiblelengths y y n1, ,�� , then the only consistent and unbiasedprobability assignment is the uniform distributionp p n nn1 1 1, , , ,� �� .

6.3 The principle of maximum entropy

Since probabilities represent states of knowledge, it isuseful and productive to think about the information contentof a probability distribution for a physical quantity. In thisview, an accurate measurement supplies missinginformation that sharpens a vague, poorly informative priordistribution. Said a different way, the information providedby a measurement serves to reduce uncertainty withrespect to the value of an unknown quantity, such as thelength of a metal spacer. In the interpretation of the GUM,what we call 'uncertainty' is just the standard deviation ofthe probability distribution that describes the distribution ofvalues of a quantity that are reasonable or plausible in thesense of being consistent with whatever is known (orassumed) to be true. This kind of uncertainty we might call'location uncertainty' because the standard deviation is acharacteristic measure of the region about the expectationof the distribution in which there is an appreciableprobability that the value of the quantity is located.

If we think more carefully about this, however, we can seethat the GUM-type of location uncertainty is useful andrealistic only for particular states of knowledge. To illustrate,suppose that an inspector has two highly repeatable lengthgauges of identical quality, except for the fact that one ofthem has a significant zero offset z 0 , while the other has anegligible offset.

z 0

LengthFigure 6. A bi-modal probability distribution for thelength of a spacer measured using a gauge withone of two possible zero offsets, zero or z 0 . Theactual offset is unknown. If the peaks are verynarrow relative to their separation, the combinedstandard uncertainty of the measurement (standarddeviation of the distribution) is approximately equalto z0 2 .

The inspector proceeds to measure the length of a metalspacer, but fails to record which of the two gauges wasused for the measurement. In this case the measurementprocess would yield a doubly-peaked (or bi-modal)probability distribution, with the two peaks separated by theunknown gauge offset z 0 , as shown in Fig. 6. If the otheruncertainty components were negligible, the two peakswould be very narrow and the combined standarduncertainty (standard deviation of the distribution) would bewell-approximated by z0 2.

Several features of this situation should be noted. First wesee that the standard deviation z0 2 is a measure of thewidth of the region between the two peaks of thedistribution, over most of which there is a negligibleprobability of containing the true length of the spacer. Theexpectation of the distribution, in particular, lies in thecenter of this low probability region. From this we see thatthe GUM identification of a best estimate with anexpectation is useful only for certain types of probabilitydistributions, and that an estimated standard deviation maynot be the best uncertainty parameter in all cases. Inparticular, we see that should the unknown zero offsetincrease, so would the combined standard uncertainty,together with the inclusion of more and more highlyimprobable values for the spacer length.

Now notice that there is a sense in which increasing thegauge offset error z 0 adds no additional uncertainty at all.If we asked 'Which of the two gauges was used to performthe measurement?', and somehow managed to obtain thisinformation, then the spacer length probability distributionwould collapse via Bayes' theorem to a single narrow peak,and the length of the spacer would be known with highaccuracy. This operation is clearly independent of z 0 ,depending only our knowing that the probability distributionhas two narrow peaks, independent of their separation. Theinformation supplied by the answer to our questiondecreases our uncertainty about the length of the spacer,just as might be accomplished by repeating themeasurement with a gauge of known offset. This suggeststhat there is another way to think about the uncertainty of aprobability distribution that depends only on the form of thedistribution itself and not on the actual values of thequantity described by the distribution. Such an approachleads to the concept of entropy.

Probability

12

Consider again a set y y n1, ,�� of possible lengths of aspacer, with a corresponding discrete probabilitydistribution p pn1, ,�� . We have argued that a state ofcomplete ignorance as to the length of the spacer isrepresented by the uniform distributionp p n nn1 1 1, , , ,� �� , and it seems intuitively

reasonable that the uniform distribution describes a state ofmaximum uncertainty. Now imagine a contrasting situationin which we know for certain that the length of the spacer isY y k� , so that pk � 1 and p j kj � �0, . A plot of thedistribution p pn1, ,�� versus index number j would displaya single spike at j k� with unit probability and zeroseverywhere else. Since the length of the spacer in this caseis known, we have zero uncertainty in the sense of needingno more information in order to decide the length state ofthe spacer, and our certainty is reflected in the sharplyspiked probability distribution.

We see here how the shape of the probability distributionencodes general properties that we identify with informationand uncertainty. This raises the interesting question as towhether there exists some unique function of thedistribution p pn1, ,�� that might serve as a numericalmeasure of the amount of information (in a sense to bedescribed) needed to reduce a state of incompleteknowledge to a state of certainty. Such a function, calledthe entropy of the distribution, was found by C. E. Shannon[29] in the context of communication theory. We proceed tosketch the arguments that lead to the mathematical form ofthe entropy function.

Given a discrete probability distribution p pn1, ,�� , weseek a function H p pn1, ,�� that will serve to measureinformation uncertainty (in contrast to the locationuncertainty as measured by a standard deviation).Following Shannon, we require the function H, if it exists, tosatisfy the following reasonable conditions:

Condition 1. H p pn1, ,�� should be a continuous functionof the probabilities p pn1, ,�� .

Condition 2. If all of the probabilities are equal, so that

p nk � 1 for all k, then Hn n1 1, ,�

FH IK should be a

monotonically increasing function of the positive integer n.More choices should mean more uncertainty.

Condition 3. If a problem is reformulated by groupingsubsets of the probabilities and calculating the uncertaintyin stages, the final result must be the same for all possiblegroupings. This is a consistency requirement.

We illustrate Condition 3 by example (see Fig. 7). Considera problem in which there are three possible inferences withprobabilities p p p1 2 3, ,� � as shown in Figure 7(a). Theinformation uncertainty is H p p p1 2 3, ,� �. Now suppose thatwe proceed in two steps by grouping the inferences asshown in Fig. 7(b). The first step involves the choice ofeither p1 or q p p� �2 3 , with an uncertainty of H p q1,� � .Then, with probability q, there will be an additionaluncertainty associated with the choice of either p2 or p3 inthe amount of H p q p q2 3,� � . Shannon's Condition 3 thenrequires that the information uncertainty be the same inboth cases:

H p p p H p q qH p q p q1 2 3 1 2 3, , , ,� � � � � � � �. (41)

Shannon generalized the result (41) to derive a functionalequation for H p pn1, ,�� and then showed that the uniquesolution for the measure of information uncertainty, calledthe entropy of the distribution p pn1, ,�� is given by

H p p K p pn i ii

n

11

, , log��

� . (42)

In this expression K is a positive constant that depends onthe base of the logarithms. Such a choice is arbitrary, sowe simplify by setting K � 1 and writing for the entropy

H p p p pn i ii

n

11

, , log��

� . (43)

The entropy H of (43) behaves quantitatively as we mightexpect from a measure of uncertainty. If one of theprobabilities is equal to one and the rest equal to zero (astate of certainty), then

H p p Hn1 0 0 1 0 0, , , , , , ,� � �� , (44)

while the uniform distribution, p nk � 1 for all k, hasentropy

H n n n1 1, , log�� , (45)

which is the maximum value of H. The logarithmicdependence of the entropy on the number of equally-likelychoices can be understood most easily in base-2 binarylogic. The answers to N 'yes/no' questions (i.e. N 'bits' ofinformation) would be sufficient to uniquely specify one ofn N� 2 possibilities, so that the entropy is H n N� �log2 .

As the number of possibilities increases exponentially, theentropy increases only linearly, so that, for example,deciding among twice as many possibilities requires onlyone more bit of information.

In the case of a continuous probability distribution for aparameter such as the length of a spacer, where priorignorance is described by a uniform distribution, theentropy becomes

H p y p y dy� � � �z | log |I I0 0f f , (46)

where the integral is over all possible values of the length.

p1

p2

p3

p1

q = p2 +p q2

p q3

(a) (b)

Figure 7. Illustrating the grouping of inferences.The information uncertainty should be the same inboth cases. In (b), the uncertainty associated withthe choice of p2 or p3 occurs with probabilityq p p� �2 3 .

13

There is a close connection between entropy in the senseof information and uncertainty and the entropy of statisticalmechanics. In fact, all of equilibrium statistical mechanicscan be viewed as an exercise in probable inference withrespect to the unknown microscopic state of athermodynamic system, when our information consists onlyof estimates of a few macroscopic variables such astemperature and pressure. The interested reader shouldsee, for example, the pioneering work of Jaynes [15] andthe excellent introductory text by Baierlein [1].

The entropy is a unique measure of uncertainty, in thesense of missing information, with respect to a state ofnature. Our natural desire for objectivity and freedom frombias would therefore suggest that among all possible priordistributions that might describe knowledge of ameasurement variable, we should choose the one thatmaximizes the entropy in a way that is consistent withwhatever is known (or assumed) to be true. This is theprinciple of maximum entropy (PME). The resultingprobability distribution then reproduces what we assume tobe true while distributing the remaining uncertainty in themost honest and unbiased manner. At the same time, PMEis a procedure that satisfies our desire for consistency inthe sense that two persons with the same information (stateof knowledge) should assign the same probabilities. Jaynes[14] has described the maximum entropy distribution asbeing 'maximally noncommittal with regard to missinginformation' and has also observed that this distribution '...is the one which is, in a certain sense, spread out as muchas possible without contradicting the given information, i.e.,it gives free rein to all possible variability of [the unknownquantity] allowed by the constraints. Thus it accomplishes,in at least one sense, the intuitive purpose of assigning aprior distribution; it agrees with what is known, butexpresses a 'maximum uncertainty' with respect to all othermatters, and thus leaves a maximum possible freedom forour final decisions to be influenced by the subsequentsample data.'

The mathematical procedure that underlies the PME is oneof constrained maximization, which seeks to maximize theentropy (either the discrete or continuous form, asappropriate) subject to constraints on the probabilitydistribution imposed by prior information, using the methodof Lagrange multipliers [1, 13, 30, 35]. The example of themetal spacer will serve to illustrate the procedure forparticular states of available information.

Suppose that we are certain, based on engineeringjudgment and the known properties of a productionprocess, that the length of a spacer is contained in theinterval y ymin max, . Such knowledge constrains thedistribution p y� | I0 f via the normalization requirement:

p y dyy

y

� �z | .I0 fmin

max

1 (47)

Maximizing the entropy (46) subject to the constraint (47)then yields the rectangular, or uniform, density given by

p y y y� � �� | I0 f 1 max min (48)

in the allowed range of y, and zero otherwise. We couldhave guessed this simple distribution based on thesymmetry of the situation, but it is instructive to see how thePME works with such meager information.

In many cases we may have a prior estimate y 0 of thelength, together with an estimated variance � y

2 , related to

p y� | I0 f by (24) and (25). We might know, for example,that the spacer was produced by a reliable machine orprocess with a well-characterized production history.Maximizing the entropy subject to these constraints,together with the normalization requirement of (27), yieldsthe normal (or Gaussian) density:

p y y y

N y yy

y

y

� � � ��

�

| exp

; , .

I0 f

a f

12

202 2

02

� �

�

�

(49)

This is a very important and useful result. Prior informationabout the length of the spacer might be based not on theknown characteristics of a machine or production processbut rather on the result of a previous measurement,perhaps performed by a supplier. If the supplier follows therecommendations of the GUM, the result of themeasurement will be reported in the formY y ku y� � � �0 0c , where k is a coverage factor and thecombined standard uncertainty u yc 0� � is an estimatedstandard deviation of the probability distribution thatcharacterizes the supplier's measurement process. Givenonly this information, the best prior probability assignment(being least informative in the sense of the PME) forencoding knowledge of the length of the spacer is justp y N y y u� � � �| ; ,I0 f 0 c

2 . Thus the normal distribution, ratherthan being an unwarranted assumption, is the least biasedand 'maximally noncommittal' of all distributions for givenmean and variance. Consistency would then requireanyone using the supplier's measurement result to assignthe same normal distribution.

7. The ubiquitous normal distribution

The normal, or Gaussian, distribution has a very specialstatus in probability theory and measurement science. Inthis section we describe some of the reasons for theubiquitous occurrence of this particular distribution.

7.1 The central limit theorem

When many small independent effects combine additivelyto affect either a production process or a set of repeatedmeasurements, the resultant frequency distributions(histograms) of either the workpiece errors or themeasurement results will usually be well approximated bynormal distributions. The central limit theorem (CLT)provides a theoretical basis for modeling this behavior,under very general and non-restrictive assumptions aboutthe various probability distributions that characterize theindividual effects. The CLT is a general result in the theoryof random variables. Without attempting a formal proof, theCLT says that if Z is the sum Z X X n� � �1 � of nindependent random variables X i , each of which has finitemean and variance, with none of the variances significantlylarger than the others, then the distribution of Z will beapproximately normal, converging towards a normaldistribution for large n. In practical applications, 'large n'may mean n no greater than three or four.

7.1.1 Gaussian sampling and Type A uncertainties

There is perhaps no source of measurement uncertaintymore basic and fundamental than that caused by getting adifferent answer every time a measurement is repeated.The CLT suggests a useful and realistic model of a noisy,non-repeatable measurement procedure. Consider, forexample, a well-calibrated electronic indicator used tomeasure the length of a metal spacer. A set of n repeatedmeasurements yields a data set of indicator readings

14

D y y y n� � �1 2 � , where each reading is equal to thelength plus an error that fluctuates from reading to reading.Guided by the CLT we assume that each error is the sumof a large number of small random (meaning unpredictable)errors, and model the procedure as repeated samplingfrom a normal frequency distribution with an expectation (ormean) � �y � and a standard deviation � thatcharacterizes the measurement process repeatability.

In many situations, the standard deviation � may beknown from prior experience with the process. The post-data (posterior) distribution for the spacer length thenfollows from Bayes' theorem:

p D Kp p D� � �� | | |I I I0 0 0f f f , (50)

where the prior information I0 includes the known value of�. In (50), K is a normalization constant and p( |� I0 f is theprior distribution for �, which we assume to be constant (auniform density), corresponding to knowing little about thevalue of � a priori. The last factor on the right side of (50) isthe likelihood function:

p D p y y n� �| |� �I I0 0f a f1� . (51)

We assume that the sequential measurements areindependent, which means that the probability of obtainingdatum y i does not depend upon the results of previousmeasurements. For the first two samples, using the productrule, we then have:

p y y p y p y yp y p y

1 2 1 2 1

1 2

a f a f a fa f a f

| | || | .

� � �

� �

I I II I

0 0 0

0 0

�

� (52)

Independence means, according to Cox, that knowledge ofy1 is irrelevant for reasoning about y 2.

Now by definition of the model, the probability of obtaining aparticular indicator reading y i is given by the normaldistribution:

p y yi ia f| exp�

� � ��I0 � � �� L

NMOQP

12

12 2

2 . (53)

Repeated use of the product rule then yields for thelikelihood:

p D p y p y

y

n

ii

n

� �

� � �� LNM

OQP�

�

| | |

exp .

� � �

��

I I I0 0 0f a f a f1

22

1

12

�

(54)

Now since y y y n yi i�� 2 2 2a f , with

y y ni�� (the sample mean), and since the first term isfixed, given the data, the likelihood becomes

p D yn

� � ��FHGIKJ

LNM

OQP

| exp��

�

I0 f 12

2

. (55)

Using this result for the likelihood in Bayes' theorem (50),with a constant (uniform) prior distribution, we have finally:

p D yn

��

�

� � ��FHGIKJ

LNM

OQP

| expI0 f 12

2

. (56)

The post-data distribution for the expectation � is seen tobe a normal distribution centered at the best estimate

� �est � �0 y with standard deviation (GUM Type Astandard uncertainty) u n� �0� � � . This familiar result iscalled a maximum likelihood estimate, which is seen to beno more than Bayesian inference in the case of a uniformprior distribution and Gaussian sampling distribution(likelihood function). This is an example of the way in whichprobability theory as extended logic reproduces the resultsof traditional statistical sampling theory when warranted bythe available information.

The case where � is unknown is straightforward but morecomplicated, so that we simply state the results. For details,see references [2,16]. Using a constant prior density for �and Jeffrey's log-uniform prior density for �, Bayes' theoremleads to a posterior distribution for � given by Student's t-distribution for n �1 degrees of freedom. The best estimate� 0 is again given by the sample mean y , with variance (inthe notation of the GUM) given by

u nn

sn

20

213

��

�. (57)

In this expression, s 2 is the sample variance, computedfrom the data according to

sn

y yi

n2 2

1

11

��

�� a f . (58)

The uncertainty u � 0� � is seen from (57) to be larger thanthe value s n recommended in the GUM, and in fact isdefined only for n � 3. In Bayesian inference, this is asignal that for small n, one needs more prior informationabout � than is provided by the log-uniform density. As nincreases, the result (57) approaches the GUMrecommendation. To sum up the results when sampling from an assumednormal frequency distribution N y; ,� � 2� � , when very little isknown a priori about �: the best estimate � 0 is alwaysgiven by the sample mean y y ni�� ; if � is known a

priori, then u � 0� � is � n ; if � is unknown, then u � 0� � forn � 10 or so, is s n , with s computed according to (58).

7.2 Inspection measurements I: process control

Suppose that a manufacturer produces a large number ofmetal spacers, using a well-designed production process.Because of unavoidable process variations there will besome distribution of lengths in any particular batch ofspacers. We can think of each spacer as sampling themanufacturing process, with the process adding a randomerror to the intended dimension [24]. Assume that theprocess is such that the distribution of spacer lengths, asdisplayed in a histogram, is well approximated by a normaldistribution:

f y y y

N y y

� � � � ��

�

12

202 2

02

� �

�

�

pp

p

exp

; , .a f (59)

Here y 0 is the average length of a spacer and the variance� p

2 characterizes the variability of the production process.The quantity f y� � is a frequency distribution function, wheref y y� �� is approximately equal to the fraction of spacerswith lengths in the range y y y, � � . Figure 8 shows sucha histogram, together with its normal approximation, for a

15

typical large run of spacers.

During production, the spacers are measured using acalibrated length gauge that has been corrected for allsignificant systematic errors. The inspection measurementprocess has a combined standard uncertainty � m thatincludes the effects of temperature, gauge calibrationuncertainty, measurement process reproducibility, and soforth. From experience with this gauge and themeasurement process, it is known that lengthmeasurement errors are well described by the normaldistribution:

y0

� pf(y)

yFigure 8. A histogram showing the frequencydistribution of the lengths of a large run of spacersproduced by a machine. The solid curve is a normaldistribution fit to the histogram.

p e e

N e

� � ��

� � �

| exp

; , ,

I0 f 12

2

0

2 2

2

� �

�

�

mm

m

(60)

so that p e de N e de� � � �| ; ,I0 f 0 2� m is the probability that the

error of a length measurement lies in the range e e de, � .Because all known significant systematic effects have beenaccounted for, the measurement error has an expectationof zero.

As part of a statistical quality control program, an inspectoruses this gauge to measure the lengths of a large sampleof spacers, and plots the result in a histogram. What canwe say about this frequency distribution? In general, theresult y m of a particular length measurement will be thesum of an unknown length y and an unknownmeasurement error e:

y y em � � (61)

Such a result could be realized in an infinite number ofways, corresponding to the infinite number of pairs y e,� �

that satisfy (61). The error e here is a nuisance parameter,present in the data but of no interest in itself, that can beeliminated via marginalization. Let p y dym ma f| I0 be theprobability of the proposition 'the result of a measurementlies in the interval y y dym m m, � .' The distributionp y ma f| I0 can be found by averaging over all possiblemeasurement errors to yield a marginal distribution,analogous to the result of (22), where we had discreteprobability distributions.

The marginal distribution for y m is found according to:

p y p y e de

p y e p e de

N y y e N e de

m m

m

m p m

a f a fa f f

a f

| |

| |

; , ; , .

I I

I I

0 0

0 0

�

� � �

� �

��

�

� � �

zzz 0

2 20� �

(62)

Here the second step follows directly from the product ruleand the result p y e N y y em m pa f a f| ; ,I0 � �0

2� says that for

a given error e, the distribution of y m would equal theproduction distribution, shifted and centered at y e0 � . Thelast integral in (62), called a convolution integral, isstraightforward [2] and leads to the basic result:

p y y y

N y y

mT

m T

m T

a fa f

| exp

; , ,

I0 � � ��

�

12

202 2

02

� �

�

�

(63)

where

� � �T p m2 2 2� � . (64)

We see that the distribution of values that might reasonablybe expected to result from the measurement of a spacerchosen at random is normally distributed, centered at theaverage production value y 0 , with a standard deviationgiven by � � �T p m� �2 2 . The inspector's measurementscan be thought of as a sequence of samples from thisdistribution, so that the resulting histogram can be expectedto be approximately Gaussian with a sample variances 2 2

� � T .

The two sources of variation, production and measurement,are seen to be mixed or confounded in the measurementprocess, and the behavior of the variance � T

2 accords withour common sense. If the measurement process werenearly perfect and noise free so that � m

2 0� , then� �T p

2 2� and the dispersion of measurement results would

reflect only production variation. On the other hand, if all ofthe spacers were nearly identical, so that � p

2 0� , then� �T m

2 2� and such dispersion would be dominated by

measurement variability. In a similar way, any observeddrift in the measurement results away from the nominallength y 0 would be the sum of production andmeasurement drift, requiring more information (i.e. gaugere-calibration) before being unambiguously attributed tochanges in the production process.

From this example we see that it is very important inindustrial inspection to understand the difference betweenactual workpiece variation and measurement uncertainty.The reliable characterization of a production processrequires a measurement system whose expected error isheld close to zero and whose combined standarduncertainty has been independently and carefullyevaluated.

7.3 Inspection measurements II: a particular workpiece

While the result (63) is interesting and useful for processcontrol, it is not the information that an inspector wouldneed in order to accept or reject a given workpiece. In orderto decide if a spacer is acceptable or not, what an inspectorneeds to know, given a particular measurement result y m ,is the best estimate of the length of the spacer actually

16

measured, together with an evaluation of the measurementuncertainty.

Before performing the measurement, the inspector'sknowledge of the length of the spacer is guided by hisexperience with the production process and data such asthat shown in the histogram of Fig. 8. Using this priorinformation he assigns a normal prior distribution:

p y y y

N y y

� � � ��

�

| exp

; , .

I0 f

a f

12

202 2

02

� �

�

�

pp

p

(65)

While this has the same mathematical form as f y� � in (59),it should be stressed that f y� � is a measured frequencydistribution of lengths, while p y� | I0 f is an assignedprobability distribution for a single spacer drawn from (59).Probability and frequency are not the same thing. We alsoobserve that in the real world of manufacturing, many (if notmost) workpieces are never measured at all, but ratheraccepted for use based upon pure inference in whichknowledge of the workpiece is implicitly encoded by adistribution such as p y� | I0 f . After obtaining the measurement data y m , we update theprior distribution to obtain the post-data (posterior) pdfusing Bayes' theorem

p y y p yp y yp y

Kp y p y y

� � �

� �

| |||

| | ,

mm

m

m

I III

I I

0 00

0

0 0

f f a fa ff a f

(66)

where the constant K p y�

�

ma f| I01 is independent of y

and will be absorbed into the normalization of p y y� | m I0 f .The likelihood p y yma f| I0 is the probability of obtaining datay m as a function of an assumed value y. Given ameasurement process with error probability distributed asin (60), this is p y y N y ym m ma f| ; ,I0 � � ��

2 , i.e., a Gaussiancentered at the assumed value of y. Thus:

p y y KN y y N y y� � � � �| ; , ; ,m p m mI0 f a f02 2

� � . (67)

We see from (67) that the posterior pdf p y y� | m I0 f isproportional to the product of two normal distributions. It isa straightforward exercise to show that the result is anothernormal distribution:

p y y y y

N y y

� � � ��

� � �

| exp

; , ,

m I0 f 12

22 2

2� �

�

�

(68)

where:

y y y�

�

�

� �

� �

� �

� ��

12 2

20

2

2 2 2 1

� �

� �

� � �

p mp m m

p m

a f

a f

,

. (69)

These results can be written somewhat more elegantly, andin a form easier to remember, by defining a weightparameter (or simply a weight) w for a probabilitydistribution as the reciprocal of the variance: w � 1 2

� , sothat w p � 1 2

� p , w m m� 1 2� , and w � 1 2

� . With thesedefinitions (69) becomes:

yw y w y

w w

w w w

�

�

�

� �

p m m

p m

p m

0 ,

.(70)

From the results (68-70) we see that the best estimate y ofthe length of the spacer, given the measurement data y m ,is a weighted average of the prior estimate y 0 and themeasured value y m . The weights characterize thesharpness of the respective probability distributions for y 0and y m , and the posterior estimate y will be biased towardthe value in which we have the most confidence, asmeasured by its weight. If we study this result, we find thatit accords very well with what our intuition might suggest. Ina typical industrial inspection, the measurement procedureis arranged such that w wm p�� whence y y� m and thebest estimate of the spacer length derives almostcompletely from the measurement data. On the other hand,imagine using a hand micrometer to measure the diameterof a cylindrical workpiece produced by a modern diamondturning machine. In this case we could well havew wp m�� , and in effect we would be using the workpieceto calibrate the micrometer.

From (69-70) we see that the posterior weight w is alwaysgreater than either of the weights w p or w m , so that ameasurement always supplies information that reducesuncertainty. Usually w wm p�� so that w w� m and theinformation supplied by the measurement overwhelmswhatever we may know a priori. In many cases of modernmanufacture, however, such as the production of complexaspheric optics by single-point diamond machining, theworkpieces are very difficult to measure independently, andwe could well have w wm p� and w w� 2 m , so that priorknowledge of a machine's positioning error characteristicscould lead to a meaningful reduction in measurementuncertainty [25].

7.3.1 Comparison measurements

It is a common practice for a dimensional measurementlaboratory to evaluate its procedures in relation to similarlaboratories by participating in comparison measurements,such as round-robins. Here each participating laboratorymeasures, in turn, the same artifact and the results arethen used to evaluate the equivalency of the variousparticipants. The question naturally arises as to how best toconsolidate and compare the results of the individualmeasurements. The optimum way to do so follows byextending the results of the previous section, using therules of probability theory. We give a simple example andthen touch upon practical difficulties.

Suppose that n laboratories independently measure aquantity � and report the results in the form� �� x i ni i , , ,1 � , where � i is the combinedstandard uncertainty of the i-th measurement, evaluatedaccording to the procedures of the GUM. Here � might bethe length of a gauge block or the diameter of a cylindricalstandard. Guided by the principle of maximum entropy, weassign a Gaussian probability density to each of themeasurement results, so that

p x xi ii

i ia f| exp��

� �

� �I0 � � �� 12

22 2 (71)

17

is the probability density for the i-th measurement result.Assuming that the measurements are completelyindependent, the likelihood of the data set x � � �x x n1� is

p p x

x

i i

n

i i

n

x� �

� � �� LNM

OQP

�

�

| |

exp ,

��

� �

I I0 0f a f1

2 2

12

(72)

where � � �� 1� n . We also assume a Gaussian priordensity for �, centered on the estimate �

�:

p

Np

�

� �

� � �

� � �

�

� � � ��

� � �

� � �

�

� �

� �

�

| exp

; ,

I0 f 12

22 2

2 (73)

where the subscript denotes the pre-data estimate. [Priorignorance can be well-approximated by allowing �

��

2 .]Bayes' theorem then gives for the post-data (posterior)density p p

�� |x I0 f :

p Kp p� �� x | I0 f , (74)

where as usual, K is a normalization constant. Substitutingexpressions (72) and (73) into (74) gives a product of n � 1normal distributions, yielding, after simplification, thenormal posterior density

p�

�

� ��

� �

� � �1

222 2exp . (75)

Here � �� and are, respectively, the best estimate of �

and its combined standard uncertainty after incorporation ofall of the measurement data, and are given by

�

�

�

� �

�

� �

�

�

�

� �

�

�

�

w w x

w w

w w w

i i

n

i

n

i

n

1

1

1,

(76)

where w� �� 1 2� and so on for the rest of the weights.

The results (75-76) have all of the intuitive properties thatwe might expect in pondering the situation. If w w i�

�� forall i, then � �

� �� and rather than learning about the

artifact, the round-robin would reveal estimates of thesystematic errors of the various measurements. This mightdescribe a round-robin in which a National MetrologyInstitute (NMI) circulated a well-calibrated 'golden artifact'among a group of lower echelon laboratories, perhaps aspart of a laboratory accreditation program.

More common is a comparison round-robin (sometimescalled a key comparison when a number of NMIs areinvolved), in which only a nominal value of � is known apriori and the goal is laboratory intercomparison. In suchcomparisons it will almost always be the case that for anylaboratory w w i�

�� , (the measurements overwhelm priorinformation) so that w

�

� 0 and Eqs. (76) simplify to:

��

�

�

�

�

�

�

w x

w

w w

i i

n

i

n

i

n

1

1

1.

(77)

If all laboratories report the same uncertainty � 0 , thenw wi � �0 0

21 � and �� x x ni with uncertainty

� �� 0 n , a familiar result. If laboratory k has a much

smaller uncertainty that the others due to a superiormeasurement process w w i kk i�� , , then �

�� x k

with uncertainty � �� k , which is just as it should be. A

single high-accuracy measurement is more valuable thana number of poor ones.

A central question in the analysis of round-robin data ishow to choose a reference value � ref in order to effect thecomparisons. We see from Eqs. (77) that the logical andconsistent way to do so is to use the weighted mean� �ref � �

which is the best estimate of the measurandusing all available information. In spite of this, it seems tobe common practice to use the simple un-weighted meanvalue, � ref � x , which discounts the variation in themeasurement uncertainty. The motivation for this choiceis to prevent a participant from claiming a very smalluncertainty and forcing a weighted mean value toward hisown result. We see here how in the real world it is easy tomove beyond purely technical considerations and intoareas that have psychological, political, and economicaspects.

Choice of a reference value is only one of severalproblems that arise in the design and data analysis ofcomparison measurements and that are subjects ofactive discussion and debate. Among the others are:

� Correlations. It is difficult to perform a set ofcomparison measurements that are all logicallycompletely independent of each other. Independencemeans that knowledge of Laboratory A's systematicerrors would convey no information that would affectLaboratory B's result. Use of common referencestandards, using instruments from the samemanufacturer, using the same empirical equation(such as Edlén's equation for the refractive index ofair) or phenomenological model --- all of these willcorrelate the experimental results.

Depending upon the particular nature of the

measurement, failing to account for significantcorrelations among the input quantities will lead toeither an underestimation or an overestimation of theuncertainty in the final result.

� Method of measurement. Performing a measurement

in two different ways will often give two differentanswers, even when the individual procedures arehighly repeatable. From the point of view ofprobability theory, we would say that themeasurements occur in different reasoningenvironments I I0 0 and � so that p x p x� �� | |I I0 0 ffor the same data x, a result often called an 'applesand oranges' comparison. This introduces acomponent of uncertainty due to the method ofmeasurement that can be studied by an appropriateexperimental design.

18

� Definition of the measurand. In many comparison

measurements, the uncertainties may be dominatedby an incomplete definition of the measured quantity.Diameter, for example, is not well defined for acylinder that is not perfectly round. Similarly, thewidth of a chromium line deposited on glass is notwell described by a single number at the level of afew nanometers. This lack of complete definition canalso interact strongly with the measurementtechnique, further complicating both the evaluation ofthe uncertainty and comparison with other results.

� Unrecognized errors. It is not uncommon for the

results of two independent measurements of thesame quantity to be inconsistent, which means thatthe difference between the measured valuesexceeds the sum of the individual uncertainties bymore than a 'reasonable' amount. An effort toachieve a very small measurement uncertaintyrequires the correction for smaller and smallereffects, and it is easy for some tiny effect to gounrecognized in the data analysis. In such a case, atleast one of the results must be wrong, but it can bedifficult, if not impossible, to find the source of theinconsistency. Of course, one of the principalreasons for performing a comparison such as around-robin is to discover such unrecognized errors,and it is important to have a consistent procedure forhandling them. An interesting approach has beendemonstrated by F. Fröhner, who calls inconsistentresults 'one of the thorniest problems in dataevaluation [8].' He models the unrecognized errorsthemselves as being normally distributed withmaximum entropy prior distributions for the unknownmeans and variances. The resultant Bayesianinference yields a best estimate for the unknownmeasurand, together with best estimates for theunrecognized errors and their uncertainties, in astraightforward way.

Differences between the results of measurements of the'same' measurand will, in general, be due to someadmixture of these last three sources of variation. Theseeffects cannot be separated without a more completeunderstanding and analysis of the various measurementprocesses. As suggested above, much creative thinkingis needed about how best to treat the measurement datacreated in the course of measurement intercomparisons.

7.4 Industrial inspections III: accept/reject decisions

Let us return to the inspector who has measured thelength of a metal spacer and must now decide whether ornot it is acceptable for use. The nominal length of thespacer is y 0 , and the design specification calls for y 0 tobe centered in a specification zone of width T, where T iscalled the tolerance. This means that the length of anacceptable spacer must lie in the range LSL Y USL� � ,where the lower specification limit LSL y T� �0 2 and theupper specification limit USL y T� �0 2. The tolerance isrelated to the specification limits by T USL LSL� � , asshown in Fig. 9.

T

y 0LSL USL

Figure 9. The specification zone for a metalspacer of design length y 0 and tolerance T.

The goal of the inspector's measurement is to answer thequestion 'Is the length of the measured spacer containedin the specification zone with an acceptable probability?'Clearly what is meant by 'acceptable probability' is aquestion of professional or business judgment thatinvolves matters such as costs and risks. For thepurposes of our discussion we assume that there is acritical probability P0 such that a spacer will beacceptable if there is a probability P P 0 that its lengthlies within the specification zone. Typically P0 will be anumber such as 0.95 or 0.99, corresponding to a level ofconfidence of 95% or 99%.

The inspector's knowledge of the length of the spacerfollowing his measurement is summed up in the posteriordensity p y y N y y� � � �| ; ,m I0 f � 2 of (68-69), whichdescribes the distribution of reasonably probable values.The probability PG that the spacer is good (withintolerance) is just the fraction of this distribution containedbetween the specification limits (see Figure 10):

P p y y dyLSL

USL

G m� �z | I0 f . (78)

From (68) we have explicitly:

P y y dyLSL

USL

G � � �� z12

22 2

� �

�exp . (79)

This integral cannot be evaluated in closed form, but theresult can be expressed in terms of the standard normalcumulative distribution function (CDF) defined by

� x� �

�

� ��

�z1

222

�

exp t dtx

. (80)

The CDF x� � is tabulated in many statistics texts and iscommonly included in commercial mathematics andspreadsheet software. Letting z y y� �� , PG is givenby

P USL y LSL yG �

�FH IK ��FH IK� �

� �

. (81)

19

p y y N y y� � �| ; ,m I0 f a � 2

yLSL USL

�

Figure 10. The probability density p y y� | m I0 f of ameasured spacer, superimposed on thespecification zone. The best estimate of the lengthis y . The probability PG that the spacer is good isthe fraction of the area under the curve (showncross-hatched) contained between thespecification limits. The spacer is in tolerance ifP PG � 0 , where P0 is a threshold valuedetermined by economic considerations.

Now defining the dimensionless variables

y y LSLT

T

*

*

,

.�

�

��

�

(82)

(81) becomes:

P y y

P y

G

G

��FHG

IKJ � �FHGIKJ

�

� �1 *

***

*, *� �

�b g. (83)

The tolerance T thus provides a natural length scale forthe inspector's decision problem. For a spacer to have achance to be accepted, the best estimate y of its lengthmust lie within the specification zone, for otherwise theprobability of being in tolerance would be less than 50%.The specification zone corresponds to the region0 1� �y * , with y *

� 0 being the lower specification limit,y *

� 1 being the upper specification limit, and y * .� 0 5being the center of the specification zone.

The dimensionless uncertainty parameter � �*� T is

sometimes called a gauging ratio, and typically has avalue such as 0.25 (a 4-to-1 ratio) or 0.10 (a 10-to-1 ratiooccasionally called the gauge maker's rule.) The reducedquantities y * * and � are closely related to variousprocess capability indices such as C Cp pk and that areused in statistical quality control [23].

This result can be appreciated by examining Fig. 11,which shows the locus of constant probability for twolevels of confidence (PG and � 0 95 0 99. . ) in the y * *

� �

plane.

0

0.05

0.10

0.15

0.20

0.25

0.30

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

y y LSL T* � � �a

PG � 0 95.

PG � 0 99.

99% conformance zone

� �*� T

Figure 11. The y * *� � plane, showing the locus

of constant probability PG , from Eq. (83), forPG � 0 95. (upper curve) and PG � 0 99. (lowercurve).

For a given level of confidence, acceptable spacers lie inthe region below the corresponding curve. The horizontaldotted line in Fig. 11 locates a particular 'gauge maker'srule' of a 10-to-1 ratio of tolerance to measurementuncertainty. The intersection of such a line with aparticular probability locus defines a conformance zonewhose width determines the range of measured values yallowed for acceptable spacers. The 99% conformancezone is shown, so that a spacer whose measured lengthis such that y * lies in this region has at least a 99%probability of being in tolerance, so long as � �T 10 .

A simplified way of showing the same decision problemfollows from the recently adopted International StandardISO 14253-1 [11] which defines default decision rules forproving conformance or non-conformance to specification.The basic idea is shown in Fig. 12.

U k� �

conformance zone

specification zone

Figure 12. Illustrating the specification andconformance zones according to ISO 14253-1.The quantity U is the expanded uncertainty, with kequal to a coverage factor according to the GUM.

According to this standard, the specification zone isreduced by twice the expanded uncertainty U k� � of themeasurement in order for a supplier to prove conformancewith specification. On the other hand, for a customer toprove non-conformance requires that he add the expandeduncertainty to the result of his measurement, thusincreasing the size of the conformance zone. Themeasurement uncertainty always works against whoever ismaking a conformance or non-conformance decision, andthere is always a tradeoff involving costs and risks.

In ISO 14253-1, the default coverage factor is k � 2 . Itshould be emphasized that this is a default procedure thatfails to consider important economic issues such as thecosts and risks associated with making erroneous

20

decisions [34]. These considerations can greatly affect theboundaries of the conformance and non-conformancezones so that default rules such as ISO 14253-1 will likelybe of marginal value for real decisions in the marketplace[22,27].

8. Summary

We have attempted to give a broad overview of thefundamental ideas of inference, where probability is viewedas a degree of rational belief. In this view, engineeringmetrology is seen to be a particular application of a verygeneral system of extended logic that applies to anysituation where incomplete information precludes the use ofdeductive reasoning. The two major questions of probabilitytheory are (a) how to assign initial probabilities and (b) howto revise probabilities in order to incorporate newinformation. We have shown how the answers to thesequestions are provided by (1) the principal of maximumentropy and (2) the sum and product rules that follow fromthe axioms of Cox. These are the fundamental ideas. All ofthe standard results of statistical sampling theory follow asspecial cases when necessary. Lack of repeatability is onlyone component of uncertainty. Ultimately, any physicalmeasurement will be limited by uncertainty in the realizationof the unit and will reduce to a set of Type B assumeddistributions best estimated by the method of maximumentropy.

9. Acknowledgments

It is a sincere pleasure to thank the many individuals whomade valuable comments and suggestions based onearlier drafts of this paper, and for their contributions andconstructive criticisms that helped to guide its revision. Inparticular, I am deeply indebted to W. Wöger (PTB -Braunschweig) and S. Sartori (IMGC - CNR, Torino) fortheir very thorough critiques of the revised manuscriptand their detailed suggestions for clarifying and correctingsubstantial portions of the paper. My gratitude is alsoextended to [* denotes CIRP member]:

� J. Bryan* - Pleasanton, CA USA� K. Bowen* - Bede Scientific, Englewood, CO USA� D. Banks - NIST, Gaithersburg, MD USA� D. DeBra* - Stanford University, Stanford, CA USA� T. Doiron - NIST, Gaithersburg, MD USA� T. Charlton* - Brown and Sharpe, North Kingston, RI

USA� C. Evans* - NIST, Gaithersburg, MD USA� R. Hocken* - University of North Carolina - Charlotte,

NC USA� H. Kunzmann* - PTB, Braunschweig, Germany� R. Levi* - Politecnico di Torino, Italy� D. Lucca* - Oklahoma State University, Stillwater,

OK USA� P. McKeown* - Cranfield University, Cranfield,

United Kingdom� J. Meijer* - University of Twente, Enschede,

Netherlands� E. Pardo - NPL, Teddington, United Kingdom� G. Peggs* - NPL, Teddington, United Kingdom� J. Peters* - Instituut voor Werktuigkunde, Heverlee,

Belgium� S. Phillips - NIST, Gaithersburg, MD USA

� J. Potzick - NIST, Gaithersburg, MD USA� J. Raja - University of North Carolina - Charlotte, NC

USA

I would also like to acknowledge the late Professor L R.Wilcox of the State University of New York at Stony Brookand the late Dr. C E. Kuyatt of NIST/Gaithersburg for theiressential contributions to my understanding of the natureof probability and uncertainty.

10. References

[1] Baierlein, R., 1971, Atoms and Information Theory,W. H. Freeman, San Francisco.

[2] Box, G. E. P. and Tiao, G. C., 1973, BayesianInference in Statistical Analysis, Wiley ClassicsLibrary Ed. 1992, J. Wiley and Sons, New York.

[3] Bryan, J. B., 1993, The Deterministic Approach inMetrology and Manufacturing, Int. Forum onDimensional Tolerancing and Metrology, ASME,Dearborn, Michigan.

[4] Cox, R. T., 1946, Probability, Frequency, andReasonable Expectation, Am. J. Phys, 14: 1-13.

[5] Cox, R. T., 1961, The Algebra of Probable Inference,Johns Hopkins Press, Baltimore.

[6] Donaldson, R. R., 1972, The Deterministic Approachto Machining Accuracy, Soc. Mech. Eng. FabricationTechnology Symposium, Golden, Colorado.

[7] Estler, W. T., 1997, A Distribution-IndependentBound on the Level of Confidence in the Result of aMeasurement, J. Res. Natl. Inst. Stand. Technol.102, 587-88.

[8] Fröhner, F. H., 1989, Bayesian Evaluation ofDiscrepant Experimental Data, in Maximum Entropyand Bayesian Methods, J. Skilling, ed., KluwerAcademic Publishers, Dordrecht, Netherlands.

[9] Garrett, A. J. M. and Fisher, D. J., 1992, CombiningData from Different Experiments: Bayesian Analysisand Meta-analysis, in Maximum Entropy andBayesian Methods, Seattle 1991, C. R. Smith et al,eds., Kluwer Academic Publishers, Dordrecht,Netherlands 273-86.

[10] International Organization for Standardization (ISO),1995, Guide to the Expression of Uncertainty inMeasurement, ISO, Geneva.

[11] International Organization for Standardization (ISO),1998, International Standard 14253-1, GeometricalProduct Specifications (GPS)-Part 1: Decision rulesfor proving conformance or non-conformance withspecification,.

[12] Jaynes, E. T., 1994, Probability Theory: The Logic ofScience, preliminary version at ftp://bayes.wustl.edu/

pub/Jaynes/book.probability.theory.[13] Jaynes, E. T., 1989, Papers on Probability, Statistics,

and Statistical Physics, R. D. Rosenkrantz, Ed., D.Kluwer Academic Publishers, Dordrecht, Netherlands

[14] Jaynes, E. T., 1968, Prior Probabilities, IEEE Trans.Syst. Sci. and Cybernetics, Vol. SSC-4, 227-41.[reprinted in Ref. 13.]

[15] Jaynes, E. T., 1957, Information Theory andStatistical Mechanics, I, II, Phys. Rev. 106, 620-30,108, 171-90. [reprinted in Ref. 13.]

[16] Jeffreys, H., 1967, Theory of Probability, ClarendonPress, Oxford.

[17] Jessop, A., 1995, Informed Assessments - AnIntroduction to Information, Entropy, and Statistics,Ellis Horwood, London.

[18] Keynes, J. M., 1921, A Treatise on Probability,Macmillan, London.

21

[19] Kolmogorov, A. N., 1950, Foundations of the Theoryof Probability, Chelsea Publishing Co., New York.

[20] Kyburg Jr., H. E. and Smokler, H. E., Eds., 1964,Studies in Subjective Probability, John Wiley andSons, New York.

[21] Lindley, D. V. 1990, The 1988 Wald MemorialLectures: The Present Position in BayesianStatistics, Stat. Sci. 5, No.1, 44-89.

[22] Lindley, D. V., 1985, Making Decisions, 2nd Ed.,John Wiley and Sons, London.

[23] Messina, W. S., 1987, Statistical Quality Control forManufacturing Managers, John Wiley and Sons, NewYork.

[24] Patterson, S. R., 1996, Treatment of Errors andUncertainty, Tutorial Notes, American Society forPrecision Engineering, Raleigh, North Carolina.

[25] Phillips, S. D., Estler, W. T., Levenson, M. S., andEberhardt, K. R., 1998, Calculation of MeasurementUncertainty Using Prior Information, J. Res. Natl.Inst. Stand. Technol. 103, 625-32.

[26] Polya, G., 1954, Mathematics and PlausibleReasoning, 2 Vols., Princeton University Press.

[27] Schlaifer, R., 1959, Probability and Statistics forBusiness Decisions, McGraw-Hill, New York.

[28] Savage, I. R., 1961, Probability Inequalities of theTchebyscheff Type, J. Res. Natl. Bur. Stand. 65B,211-22.

[29] Shannon, C. E. and Weaver, W., 1963, TheMathematical Theory of Communication, Univ. ofIllinois Press, Urbana, Illinois.

[30] Sivia, D. S., 1996, Data Analysis - A BayesianTutorial, Clarendon Press, Oxford.

[31] Smith, C. R. and Erickson, G., 1989, FromRationality and Consistency to Bayesian Probabilityin Maximum Entropy and Bayesian Methods, KluwerAcademic Publishers, Dordrecht, Netherlands.

[32] Tribus, M., 1969, Rational Descriptions, Decisions,and Designs, Pergamon Press, New York.

[33] Weise, K. and Wöger, W., 1992, A Bayesian Theoryof Measurement Uncertainty, Meas. Sci. Technol. 3,1-11.

[34] Williams, R. H. and Hawkins, C. F., 1993, TheEconomics of Guardband Placement, Proc. 24thIEEE International Test Conference, Baltimore USA.

[35] Wöger, W., 1987, Probability Assignment toSystematic Deviations by the Principle of MaximumEntropy, IEEE Trans. Inst. Meas., Vol. IM-36, 655-58.

Documents

Measurement as Inference: Fundamental Ideas · Measurement as Inference: Fundamental Ideas W. Tyler Estler (2) Precision Engineering Division National Institute of Standards and Technology