fit paper

7/25/2019 fit paper

1/55

Data analysis recipes:

Fitting a model to dataDavid W. HoggCenter for Cosmology and Particle Physics, Department of Physics, New York University

Max-Planck-Institut fur Astronomie, Heidelberg

Jo BovyCenter for Cosmology and Particle Physics, Department of Physics, New York University

Dustin LangDepartment of Computer Science, University of Toronto

Princeton University Observatory

Abstract

We go through the many considerations involved in fitting a modelto data, using as an example the fit of a straight line to a set of pointsin a two-dimensional plane. Standard weighted least-squares fittingis only appropriate when there is a dimension along which the datapoints have negligible uncertainties, and another along which all theuncertainties can be described by Gaussians of known variance; theseconditions are rarely met in practice. We consider cases of general,heterogeneous, and arbitrarily covariant two-dimensional uncertain-ties, and situations in which there are bad data (large outliers), un-

known uncertainties, and unknown but expected intrinsic scatter inthe linear relationship being fit. Above all we emphasize the impor-tance of having a generative model for the data, even an approx-imate one. Once there is a generative model, the subsequent fittingis non-arbitrary because the model permits direct computation of thelikelihood of the parameters or the posterior probability distribution.Construction of a posterior probability distribution is indispensible ifthere are nuisance parameters to marginalize away.

It is conventional to begin any scientific document with an introductionthat explains why the subject matter is important. Let us break with tra-

dition and observe that in almost all cases in which scientists fit a straightline to their data, they are doing something that is simultaneously wrongand unnecessary. It is wrong because circumstances in which a set of two

The notes begin on page 39, including the license1 and the acknowledgements2.

1

arXiv:1008.4686

v1[astro-ph.IM]

27Aug2010

7/25/2019 fit paper

2/55

2 Fitting a straight line to data

dimensional measurementsoutputs from an observation, experiment, or

calculationare truly drawn from a narrow, linear relationship is exceed-ingly rare. Indeed, almost any transformation of coordinates renders a trulylinear relationship non-linear. Furthermore, even if a relationship looks lin-ear, unless there is a confidently held theoretical reason to believe that thedata are generated from a linear relationship, it probably isnt linear in detail;in these cases fitting with a linear model can introduce substantial system-atic error, or generate apparent inconsistencies among experiments that areintrinsically consistent.

Even if the investigator doesnt care that the fit is wrong, it is likely to beunnecessary. Why? Because it is rare that, given a complicated observation,experiment, or calculation, the important resultof that work to be commu-

nicated forward in the literature and seminars is the slope and intercept ofa best-fit line! Usually the full distribution of data is much more rich, in-formative, and important than any simple metrics made by fitting an overlysimple model.

That said, it must be admitted that one of the most effective ways tocommunicate scientific results is with catchy punchlines and compact, ap-proximate representations, even when those are unjustified and unnecessary.It can also sometimes be useful to fit a simple model to predict new data,given existing data, even in the absence of a physical justification for the fit.3

For these reasonsand the reason that in rare cases the fit is justifiable and

essentialthe problem of fitting a line to data comes up very frequently inthe life of a scientist.It is a miracle with which we hope everyone reading this is familiar that

if you have a set of two-dimensional points (x, y) that depart from a per-fect, narrow, straight line y = m x+ b only by the addition of Gaussian-distributed noise of known amplitudes in the y direction only, then themaximum-likelihood or best-fit line for the points has a slope mand interceptb that can be obtained justifiably by a perfectly linear matrix-algebra oper-ation known as weighted linear least-square fitting. This miracle deservescontemplation.

Once any of the input assumptions is violated (and note there are many

input assumptions), all bets are off, and there are no consensus methodsused by scientists across disciplines. Indeed, there is a large literature thatimplies that the violation of the assumptions opens up to the investigator awide range of possible optionswith little but aesthetics to distinguish them!Most of these methods of linear regression (a term we avoid) performed

7/25/2019 fit paper

3/55

1. Standard practice 3

in science are either unjustifiable or else unjustified in almost all situations

in which they are used.4

Perhaps because of this plethora of options, orperhaps because there is no agreed-upon bible or cookbook to turn to, orperhaps because most investigators would rather make some stuff up thatworks good enough under deadline, or perhaps because many realize, deepdown, that much fitting is really unnecessary anyway, there are some egre-gious procedures and associated errors and absurdities in the literature.5

When an investigator cites, relies upon, or transmits a best-fit slope or in-tercept reported in the literature, he or she may be propagating more noisethan signal.

It is one of the objectives of this document to promote an understandingthat if the data can be modeled statistically there is little arbitrariness. It

is another objective to promote consensus; when the choice of method is notarbitrary, consensus will emerge automatically. We present below simple,straightforward, comprehensible, andabove alljustifiablemethods for fit-ting a straight line to data with general, non-trivial, and uncertain properties.

This document is part polemic (though we have tried to keep most ofthe polemic in the notes), part attempt at establishing standards, and partcrib sheet for our own future re-use. We apologize in advance for someastrophysics bias and failure to review basic ideas of linear algebra, functionoptimization, and practical statistics. Very good reviews of the latter existin abundance.6 We also apologize for not reviewing the literature; this is

a cheat-sheet not a survey. Our focus is on the specific problems of linearfitting that we so often face; the general ideas appear in the context of theseconcrete and relatively realistic example problems. The reader is encouragedto do the Exercises; many of these ask you to produce plots which can becompared with the Figures.

1 Standard practice

You have a set ofN >2 points (xi, yi), with known Gaussian uncertaintiesyi in the y direction, and no uncertainty at all (that is, perfect knowledge)in the

xdirection. You want to find the function

f(x

) of the form

f(x) =m x+b , (1)

wherem is the slope and b is the intercept, thatbest fitsthe points. What ismeant by the term best fit is, of course, very important; we are making a

7/25/2019 fit paper

4/55


choice here to which we will return. For now, we describe standard practice:

Construct the matrices

Y =

y1y2 yN

, (2)

A =

1 x11 x2

1 xN

, (3)

C=

2y1 0

0

0 2y2 0

0 0 2yN

, (4)

where one might call Ya vector, and the covariance matrix Ccould begeneralized to the non-diagonal case in which there are covariances among theuncertainties of the different points. The best-fit values for the parametersmandb are just the components of a column vector Xfound by

bm

= X=

AC1A

1

AC1 Y

. (5)

This is actually the simplest thing that can be written down that is linear,obeys matrix multiplication rules, and has the right relative sensitivity todata of different statistical significance. It is not just the simplest thing; it isthe correct thing, when the assumptions hold.7 It can be justified in one ofseveral ways; the linear algebra justification starts by noting that you wantto solve the equation

Y = AX , (6)

but you cant because that equation is over-constrained. So you weight ev-erything with the inverse of the covariance matrix (as you would if you were

doing, say, a weighted average), and then left-multiply everything by A

to reduce the dimensionality, and then equation (5) is the solution of thatreduced-dimensionality equation.

This procedure is not arbitrary; it minimizes an objective function8 2

(chi-squared), which is the total squared error, scaled by the uncertainties,

7/25/2019 fit paper

5/55


or

2 =

Ni=1

[yi f(xi)]2

2yi [YAX]C1 [YAX] , (7)

that is, equation (5) yields the values for m and b that minimize 2. Con-ceptually, 2 is like a metric distance in the data space.9 This, of course, isonly one possible meaning of the phrase best fit; that issue is addressedmore below.

When the uncertainties are Gaussian and their variances yi are cor-

rectly estimated, the matrixAC1A

1that appears in equation (5) is

just the covariance matrix (Gaussian uncertaintyuncertainty not error10variances on the diagonal, covariances off the diagonal) for the parameters

in X. We will have more to say about this in Section 4.

Exercise 1: Using the standard linear algebra method of this Section, fitthe straight line y = m x+ b to the x, y, and y values for data points 5through 20 in Table 1 on page 6. That is, ignore the first four data points,and also ignore the columns forx andxy. Make a plot showing the points,their uncertainties, and the best-fit line. Your plot should end up lookinglike Figure 1. What is the standard uncertainty variance 2m on the slope ofthe line?

Exercise 2: Repeat Exercise 1 but for all the data points in Table 1 onpage 6. Your plot should end up looking like Figure 2. What is the standarduncertainty variance2mon the slope of the line? Is there anything you dontlike about the result? Is there anything different about the new points youhave included beyond those used in Exercise 1?

Exercise 3: Generalize the method of this Section to fit a general quadratic(second order) relationship. Add another column to matrix A containing thevaluesx2i , and another element to vector X(call itq). Then re-do Exercise 1but fitting for and plotting the best quadratic relationship

g(x) =q x2 +m x+b . (8)

Your plot should end up looking like Figure 3.

7/25/2019 fit paper

6/55


Table 1.

ID x y y x xy

1 201 592 61 9 -0.842 244 401 25 4 0.313 47 583 38 11 0.644 287 402 15 7 -0.275 203 495 21 5 -0.33

6 58 173 15 9 0.677 210 479 27 4 -0.028 202 504 14 4 -0.059 198 510 30 11 -0.84

10 158 416 16 7 -0.6911 165 393 14 5 0.3012 201 442 25 5 -0.4613 157 317 52 5 -0.0314 131 311 16 6 0.5015 166 400 34 6 0.7316 160 337 31 5 -0.52

17 186 423 42 9 0.9018 125 334 26 8 0.4019 218 533 16 6 -0.7820 146 344 22 5 -0.56

Note. The full uncertaintycovariance matrix for each datapoint is given by

2x xyxy

xyxy 2y

.

7/25/2019 fit paper

7/55


0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

y= (2.24 0.11) x+ (34 18)

Figure 1. Partial solution to Exercise 1: The standard weighted least-square fit.

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

y= (1.08 0.08) x+ (213 14)

Figure 2. Partial solution to Exercise 2: The standard weighted least-square fit but now on a larger data set.

7/25/2019 fit paper

8/55


0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

y= (0.0023 0.0020) x2 + (1.60 0.58) x+ (73 39)

Figure 3. Partial solution to Exercise 3: The standard weighted least-square fit but now with a second-order (quadratic) model.

2 The objective function

A scientists justification of equation (5) cannot appeal purely to abstractideas of linear algebra, but must originate from the scientific question at hand.

Here and in what follows, we will advocate that the only reliable procedureis to use all ones knowledge about the problem to construct a (preferably)

justified, (necessarily) scalar (or, at a minimum, one-dimensional), objec-tive functionthat represents monotonically the quality of the fit.11 In thisframework, fitting anything to anything involves a scientific question aboutthe objective function representing goodness of fit and then a separate andsubsequent engineering question about how to find the optimumand, possi-bly, the posterior probability distribution function around that optimum.12

Note that in the previous Section we didnotproceed according to these rules;indeed the procedure was introduced prior to the objective function, and theobjective function was not justified.

In principle, there are many choices for objective function. But the onlyprocedure that is truly justifiedin the sense that it leads to interpretableprobabilistic inference, is to make a generative model for the data. A gener-ative model is a parameterized, quantitative description of a statistical pro-

7/25/2019 fit paper

9/55

2. The objective function 9

cedure that could reasonably have generated the data set you have. In the

case of the straight line fit in the presence of known, Gaussian uncertaintiesin one dimension, one can create this generative model as follows: Imaginethat the data really do come from a line of the form y = f(x) =m x + b, andthat the only reason that any data point deviates from this perfect, narrow,straight line is that to each of the true y values a small y-direction offsethas been added, where that offset was drawn from a Gaussian distributionof zero mean and known variance 2y. In this model, given an independentposition xi, an uncertaintyyi, a slope m, and an intercept b, the frequencydistributionp(yi|xi, yi, m , b) for yi is

p(yi

|xi, yi, m , b) =

1

2 2yi

exp[yi m xi b]2

2 2

yi , (9)

where this gives the expected frequency (in a hypothetical set of repeatedexperiments13) of getting a value in the infinitesimal range [yi, yi+ dy] perunit dy.

The generative model provides us with a natural, justified, scalar objec-tive: We seek the line (parametersm and b) that maximize the probabilityof the observed data given the model or (in standard parlance) the likelihoodof the parameters.14 In our generative model the data points are indepen-dently drawn (implicitly), so the likelihood L is the product of conditionalprobabilities

L =Ni=1

p(yi|xi, yi, m , b) . (10)

Taking the logarithm,

lnL = KNi=1

[yi m xi b]22 2yi

= K12

2 , (11)

whereKis some constant. This shows that likelihood maximization is iden-

tical to 2

minimization and we have justified, scientifically, the procedureof the previous Section.The Bayesian generalization of this is to say that

p(m, b|{yi}Ni=1, I) =p({yi}Ni=1|m,b,I)p(m, b|I)

p({yi}Ni=1|I) , (12)

7/25/2019 fit paper

10/55


wheremand b are the model parameters,{yi}Ni=1 is a short-hand for all thedatayi,Iis a short-hand for all the prior knowledge of the xiand theyi andeverything else about the problem15,p(m, b|{yi}Ni=1, I) is theposteriorproba-bility distribution for the parameters given the data and the prior knowledge,

p({yi}Ni=1|m,b,I) is the likelihood Ljust computed in equation (10) (whichis a frequency distribution for the data), p(m, b|I) is the prior probabilitydistribution for the parameters that represents all knowledgeexceptthe data{yi}Ni=1, and the denominator canfor our purposesbe thought of as anormalization constant, obtained by marginalizing the numerator over allparameters. Unless the prior p(m, b|I) has a lot of variation in the regionof interest, the posterior distribution function p(m, b|{yi}Ni=1, I) here is go-ing to look very similar to the likelihood function in equation (10) above.

That is, with an uninformative prior, the straight line that maximizes thelikelihood will come extremely close to maximizing the posterior probabilitydistribution function; we will return to this later.

We have succeeded in justifying the standard method as optimizing ajustified, scalar objective function; the likelihood of a generative model forthe data. It is just the great good luck of Gaussian distributions that theobjective can be written as a quadratic function of the data. This makes theoptimum a linear function of the data, and therefore trivial to obtain. Thisis a miracle.16

Exercise 4: Imagine a set of Nmeasurements ti, with uncertainty vari-ances 2ti, all of the same (unknown) quantity T. Assuming the generativemodel that eachtidiffers fromTby a Gaussian-distributed offset, taken froma Gaussian with zero mean and variance 2ti, write down an expression forthe log likelihood ln L for the data given the model parameter T. Take aderivative and show that the maximum likelihood value for T is the usualweighted mean.

Exercise 5: Take the matrix formulation for 2 given in equation (7) andtake derivatives to show that the minimum is at the matrix location given inequation (5).

7/25/2019 fit paper

11/55

3. Pruning outliers 11

3 Pruning outliers

The standard linear fitting method is very sensitive to outliers, that is, pointsthat are substantially farther from the linear relation than expectedor noton the relation at allbecause of unmodeled experimental uncertainty orunmodeled but rare sources of noise (or because the model has a restrictedscope and doesnt apply to all the data). There are two general approachesto mitigating this sensitivity, which are not necessarily different. The firstis to find ways to objectively remove, reject, or become insensitive to badpoints; this is the subject of this Section. The second is to better modelyour data-point uncertainties, permitting larger deviations than the Gaussiannoise estimates; this is the subject of Section 5. Both of these are strongly

preferable to sorting through the data and rejecting points by hand, forreasons of subjectivity and irreproducibility that we need not state. Theyare also preferable to the standard (in astrophysics anyway) procedure knownas sigma clipping, which is a procedure and not the outcome of justifiablemodeling.17

If we want to explicitly and objectively reject bad data points, we mustadd to the problem a set ofN binary integers qi, one per data point, eachof which is unity if the ith data point is good, and zero if the ith datapoint is bad.18 In addition, to construct an objective function, one needs aparameterPb, thepriorprobability that any individual data point is bad, andparameters (Yb, Vb), the mean and variance of the distribution of bad points(in y). We need these latter parameters because, as we have emphasizedabove, our inference comes from evaluating a probabilistic generative modelfor the data, and that model must generate the bad points as well as thegood points!

All theseN+3 extra parameters ({qi}Ni=1, Pb, Yb, Vb) may seem like crazybaggage, but their values can be inferred and marginalized out so in theend, we will be left with an inference just of the line parameters (m, b).In general, there is no reason to be concerned just because you have moreparameters than data.19 However, the marginalization will require that wehave a measure on our parameters (integrals require measures) and that

measure is provided by a prior. That is, the marginalization will require,technically, that we become Bayesian; more on this below.

7/25/2019 fit paper

12/55


In this case, the likelihood is

L p({yi}Ni=1|m,b, {qi}Ni=1, Yb, Vb, I)

L =Ni=1

pfg({yi}Ni=1|m,b,I))

qipbg({yi}Ni=1|Yb, Vb, I)[1qi]

L =Ni=1

1

2 2yi

exp

[yi m xi b]

2

2 2yi

qi

1

2 [Vb+2yi]exp

[yi Yb]

2

2 [Vb+2yi]

[1qi]

, (13)

where pfg() and pbg() are the generative models for the foreground (good,or straight-line) and background (bad, or outlier) points. Because we arepermitting data rejection, there is an important prior probability on the{qi}Ni=1 that penalizes each rejection:

p(m,b, {qi}Ni=1, Pb, Yb, Vb|I) = p({qi}Ni=1|Pb, I)p(m,b,Pb, Yb, Vb|I)

p({qi}Ni=1|Pb, I) =Ni=1

[1 Pb]qi P[1qi]b , (14)

that is, the binomial probability of the particular sequence{qi}N

i=1. Thepriors on the other parameters (Pb, Yb, Vb) should either be set accordingto an analysis of your prior knowledge or else be uninformative (as in flatin Pb in the range [0, 1], flat in Yb in some reasonable range, and flat inthe logarithm of Vb in some reasonable range). If your data are good, itwont matter whether you choose uninformative priors or priors that reallyrepresent your prior knowledge; in the end good data dominate either.

We have made a somewhat restrictive assumption that all data pointsare equally likely a priorito be bad, but you rarely know enough about thebadness of your data to assume anything better.20 The fact that the priorprobabilitiesPb are all set equal, however, does not mean that the posterior

probabilities that individual points are bad will be at all similar. Indeed,this model permits an objective ranking (and hence public embarassment) ofdifferent contributions to any data set.21

We have used in this likelihood formulation a Gaussian model for thebad points; is that permitted? Well, we dont know much about the bad

7/25/2019 fit paper

13/55


points (that is one of the things that makes them bad), so this Gaussian

model must be wrong in detail. But the power of this and the methods tofollow comes not from making an accuratemodel of the outliers, it comessimply frommodelingthem. If you have an accurate or substantiated modelfor your bad data, use it by all means. If you dont, because the Gaussianis the maximum-entropy distribution described by a mean and a variance,it isin some sensethe least restrictive assumption, given that we haveallowed that mean and variance to vary in the fitting (and, indeed, we willmarginalize over both in the end).

The posterior probability distribution function, in general, is the likeli-hood times the prior, properly normalized. In this case, the posterior we(tend to) care about is that for the line parameters (m, b). This is the full

posterior probability distribution marginalized over the bad-data parameters({qi}Ni=1, Pb, Yb, Vb). Defining (m,b, {qi}Ni=1, Pb, Yb, Vb) for brevity, thefull posterior is

p(, I) =p({yi}Ni=1|, I)

p({yi}Ni=1|I) p(|I) , (15)

where the denominator canfor our purposes herebe thought of as a nor-malization integral that makes the posterior distribution function integrateto unity. The marginalization looks like

p(m, b

|{yi

}Ni=1, I) = d{

qi

}Ni=1dPbdYbdVbp(, I) , (16)

where by the integral over d{qi}Ni=1 we really mean a sum over all of the2N possible settings of the qi, and the integrals over the other parametersare over their entire prior support. Effectively, then, this marginalizationinvolves evaluating 2N different likelihoods and marginalizing each one ofthem over (Pb, Yb, Vb)! Of course, because very few points are true candidatesfor rejection, there are many ways to trim the tree and do this more quickly,but thought is not necessary: Marginalization is in fact analytic.

Marginalization of thiswhat we will call exponentialmodel leavesa much more tractablewhat we will call mixturemodel. Imagine

marginalizing over an individual qi. After this marginalization, theith pointcan be thought of as being drawn from a mixture (sum with amplitudesthat sum to unity) of the straight-line and outlier distributions. That is, the

7/25/2019 fit paper

14/55


generative model for the data{yi}Ni=1 integrates (or sums, really) toL p({yi}Ni=1|m,b,Pb, Yb, Vb, I)

L Ni=1

(1 Pb)pfg({yi}Ni=1|m,b,I)) +Pbpbg({yi}Ni=1|Yb, Vb, I)

L Ni=1

1 Pb

2 2yi

exp

[yi m xi b]

2

2 2yi

+ Pb

2 [Vb+2yi]exp

[yi Yb]

2

2 [Vb+2yi]

, (17)

where pfg() and pbg() are the generative models for the foreground andbackground points, Pb is the probability that a data point is bad (or, moreproperly, the amplitude of the bad-data distribution function in the mixture),and (Yb, Vb) are parameters of the bad-data distribution.

Marginalization of the posterior probability distribution function gener-ated by the product of the mixture likelihood in equation (17) times a prior on(m,b,Pb, Yb, Vb) does not involve exploring an exponential number of alter-natives. An integral over only three parameters produces the marginalizedposterior probability distribution function for the straight-line parameters

(m, b). This scale of problem is very well-suited to simple multi-dimensionalintegrators, in particular it works very well with a simple Markov-ChainMonte Carlo, which we advocate below.

We have lots to discuss. We have gone from a procedure (Section 1) to anobjective function (Section 2) to a general expression for a non-trivial poste-rior probability distribution function involving priors and multi-dimensionalintegration. We will treat each of these issues in turn: We must set the prior,we must find a way to integrate to marginalize the distribution function, andthen we have to decide what to do with the marginalized posterior: Optimizeit or sample from it?

It is unfortunate that prior probabilities must be specified. Conven-

tional frequentists take this problem to be so severe that for some it invali-dates Bayesian approaches entirely! But a prior is a necessary condition formarginalization, and marginalization is necessary whenever extra parametersare introduced beyond the two (m, b) that we care about for our straight-line fit.22 Furthermore, in any situation of straight-line fitting where the

7/25/2019 fit paper

15/55


data are numerous and generally good, these priors do not matter very much

to the final answer. The way we have written our likelihoods, the peak inthe posterior distribution function becomes very directly analogous to whatyou might call the maximum-likelihood answer when the (improper, non-normalizable) prior flat in m, flat in b is adopted, and something sensiblehas been chosen for (Pb, Yb, Vb). Usually the investigator does have somebasis for setting some more informative priors, but a deep and sensible dis-cussion of priors is beyond the scope of this document. Suffice it to say, inthis situationpruning bad datathe setting of priors is the least of onesproblems.

The second unfortunate thing about our problem is that we mustmarginalize. It is also a fortunate thing that we can marginalize. As we

have noted, marginalization is necessary if you want to get estimates withuncertainties or confidence intervals in the straight-line parameters (m, b)that properly account for their covariances with the nuisance parameters(Pb, Yb, Vb). This marginalization can be performed by direct numericalintegration (think gridding up the nuisance parameters, evaluating the pos-terior probability at every grid point, and summing), or it can be performedwith some kind of sampler, such as a Monte-Carlo Markov Chain. Westrongly recommend the latter because in situations where the integral is ofa probability distribution, not only does the MCMC scheme integrate it, italso produces a samplingfrom the distribution as a side-product for free.

The standard MetropolisHastings MCMC procedure works extremelywell in this problem for both the optimization and the integration and sam-pling. A discussion of this algorithm is beyond the scope of this document,but briefly the algorithm is this simple: Starting at some position in param-eter space, where you know the posterior probability, iterate the following:(a)Step randomly in parameter space. (b)Compute the new posterior prob-ability at the new position in parameter space. (c)Draw a random numberR in the range 0 < R < 1. (d) IfR is less than the ratio of the new pos-terior probability to the old (pre-step) probability, accept the step, add thenew parameters to the chain; if it is greater, reject the step, re-add the oldparameters to the chain. The devil is in the details, but good descriptions

abound23

and that with a small amount of experimentation (try Gaussiansfor the proposal distribution and scale them until the acceptance ratio isabout half) rapid convergence and sampling is possible in this problem. Alarge class of MCMC algorithms exists, many of which outperform the simpleMetropolis algorithm with little extra effort.24

7/25/2019 fit paper

16/55


Once you have run an MCMC or equivalent and obtained a chain of sam-

ples from the posterior probability distribution for the full parameter set(m,b,Pb, Yb, Vb), you still must decide what to report about this sampling.A histogram of the samples in any one dimension is an approximation to thefully marginalized posterior probability distribution for that parameter, andin any two is that for the pair. You can find the maximum a posteriori(or MAP) value, by taking the peak of the marginalized posterior proba-bility distribution for the line parameters (m, b). This, in general, will bedifferent from the (m, b) value at the peak of the unmarginalized probabilitydistribution for all the parameters (m,b,Pb, Yb, Vb), because the posteriordistribution can be very complex in the five-dimensional space. Because, forour purposes, the three parameters (Pb, Yb, Vb) are nuisance parameters, it is

more correct to take the best-fit (m, b) from the marginalized distribution.Of course finding the MAP value from a sampling is not trivial. At

best, you can take the highest-probability sample. However, this works onlyfor the unmarginalized distribution. The MAP value from the marginalizedposterior probability distribution function for the two line parameters (m, b)will not be the value of (m, b) for the highest-probability sample. To findthe MAP value in the marginalized distribution, it will be necessary to makea histogram or perform density estimation in the sample; this brings up anenormous host of issues way beyond the scope of this document. For thisreason, we usually advise just taking the mean or median or some other

simple statistic on the sampling. It is also possible to establish confidenceintervals (or credible intervals) using the distribution of the samples.25

While the MAP answeror any other simple best answeris inter-esting, it is worth keeping in mind the third unfortunate thing about anyBayesian method: It does not return an answer but rather it returns a pos-terior probability distribution. Strictly, this posterior distribution function isyour answer. However, what scientists are usually doing is not inference butratherdecision-making. That is, the investigator wants a specific answer, nota distribution. There are (at least) two reactions to this. One is to ignore thefundamental Bayesianism at the end, and choose simply the MAP answer.This is the Bayesians analog of maximum likelihood; the standard uncer-

tainty on the MAP answer would be based on the peak curvature or varianceof the marginalized posterior distribution. The other reaction is to suck itup and sample the posterior probability distribution and carry forward notone answer to the problem but Manswers, each of which is drawn fairly andindependently from the posterior distribution function. The latter is to be

7/25/2019 fit paper

17/55

4. Uncertainties in the best-fit parameters 17

preferred because(a)it shows the uncertainties very clearly, and (b)the sam-

pling can be carried forward to future inferences as an approximation to theposterior distribution function, useful for propagating uncertainty, or stand-ing in as a priorfor some subsequent inference.26 It is also returned, trivially(as we noted above) by the MCMC integrator we advised for performing themarginalization.

Exercise 6: Using the mixture model proposed abovethat treats the dis-tribution as a mixture of a thin line containing a fraction [1Pb] of the pointsand a broader Gaussian containing a fractionPbof the pointsfind the best-fit (the maximuma posteriori) straight line y = m x + bfor the x,y , andyfor the data in Table 1 on page 6. Before choosing the MAP line, marginalize

over parameters (Pb, Yb, Vb). That is, if you take a sampling approach, thismeans sampling the full five-dimensional parameter space but then choosingthe peak value in the histogram of samples in the two-dimensional parame-ter space (m, b). Make one plot showing this two-dimensional histogram, andanother showing the points, their uncertainties, and the MAP line. How doesthis compare to the standard result you obtained in Exercise 2? Do you likethe MAP line better or worse? For extra credit, plot a sampling of 10 linesdrawn from the marginalized posterior distribution for (m, b) (marginalizedover Pb, Yb, Vb) and plot the samples as a set of light grey or transparentlines. Your plot should look like Figure 4.

Exercise 7: Solve Exercise 6 but now plot the fully marginalized (overm,b,Yb, Vb) posterior distribution function for parameter Pb. Is this distri-bution peaked about where you would expect, given the data? Now repeatthe problem, but dividing all the data uncertainty variances 2yi by 4 (or di-viding the uncertaintiesyi by 2). Again plot the fully marginalized posteriordistribution function for parameterPb. Your plots should look something likethose in Figure 5. Discuss.

4 Uncertainties in the best-fit parameters

In the standard linear-algebra method of2 minimization given in Section 1,the uncertainties in the best-fit parameters (m, b) are given by the two-

7/25/2019 fit paper

18/55


100 50 0 50 100

b

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

m

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

Figure 4. Partial solution to Exercise 6: On the left, a sampling approx-imation to the marginalized posterior probability distribution (left) for theoutlier (mixture) model. On the right, the marginalized MAP line (dark greyline) and a draw from the sampling (light grey lines).

7/25/2019 fit paper

19/55


0.0 0.2 0.4 0.6 0.8 1.0

Pb

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

using correct data uncertainties

0.0 0.2 0.4 0.6 0.8 1.0

Pb

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

using data uncertainties / 2

Figure 5. Partial solution to Exercise 7: The marginalized posterior prob-ability distribution function for the Pb parameter (prior probability that apoint is an outlier); this distribution gets much worse when the data uncer-tainties are underestimated.

7/25/2019 fit paper

20/55


dimensional output covariance matrix

2b mbmb

2m

=AC1A

1

, (18)

where the ordering is defined by the ordering in matrix A. These uncer-tainties for the model parameters only strictly hold under three extremelystrict conditions, none of which is met in most real situations: (a) The un-certainties in the data points must have variances correctly described by the2yi; (b) there must be no rejection of any data or any departure from theexact, standard definition of2 given in equation (7); and(c) the generativemodel of the data implied by the methodthat is, that the data are trulydrawn from a negligible-scatter linear relationship and subsequently had noise

added, where the noise offsets were generated by a Gaussian processmustbe an accurate description of the data.

These conditions are rarely met in practice. Often the noise estimatesare rough (or missing entirely!), the investigator has applied data rejectionor equivalent conditioning, and the relationship has intrinsic scatter and cur-vature.27 For these generic reasons, we much prefer empirical estimates ofthe uncertainty in the best-fit parameters.

In the Bayesian outlier-modeling schemes of Section 3, the output is aposterior distribution for the parameters (m, b). This distribution function is

closer than the standard estimateAC1A

1to being an empirical mea-

surement of the uncertainties of the parameters. The uncertainty variances(2m,

2b ) and the covariancembcan be computed as second moments of this

posterior distribution function. Computing the variances this way does in-volve assumptions, but it is not extremely sensitive to the assumption thatthe model is a good fit to the data; that is, as the model becomes a bad fitto the data (for example when the data points are not consistent with beingdrawn from a narrow, straight line), these uncertainties change in accordance.

That is in strong contrast to the elements of the matrixAC1A

1, which

dont depend in any way on the quality of fit.Either way, unless the output of the fit is being used onlyto estimate the

slope, and nothing else, it is a mistake to ignore the off-diagonal terms of the

uncertainty variance tensor. That is, you generally know some linear com-binations ofm and b much, much better than you know either individually(see, for example, Figures 4 and 6). When the data are good, this covarianceis related to the location of the data relative to the origin, but any non-trivial propagation of the best-fit results requires use of either the full 2 2

7/25/2019 fit paper

21/55


uncertainty tensor or else a two-dimensional sampling of the two-dimensional

(marginalized, perhaps) posterior distribution.In any non-Bayesian scheme, or when the full posterior has been discardedin favor of only the MAP value, or when data-point uncertainties are nottrusted, there are still empirical methods for determining the uncertainties inthe best-fit parameters. The two most common are bootstrap and jackknife.The first attempts to empirically create new data sets that are similar toyour actual data set. The second measures your differential sensitivity toeach individual data point.

In bootstrap, you do the unlikely-sounding thing of drawing Ndata pointsrandomly from theNdata points you have with replacement. That is, somedata points get dropped, and some get doubled (or even tripled), but the

important thing is that you select each of the Nthat you are going to usein each trial independently from the whole set ofN you have. You do thisselection ofNpoints once for each ofMbootstrap trials j. For each of theM trials j, you get an estimateby whatever method you are using (linearfitting, fitting with rejection, optimizing some custom objective function)of the parameters (mj, bj), where j goes from 1 to M. An estimate of youruncertainty variance on m is

2m= 1

M

Mj=1

[mj m]2 , (19)

wheremstands for the best-fitmusing all the data. The uncertainty varianceonb is the same but with [bj b]2 in the sum, and the covariance mb is thesame but with [mj m] [bj b]. Bootstrap creates a new parameter, thenumber M of trials. There is a huge literature on this, so we wont sayanything too specific, but one intuition is that once you have Mcomparableto N, there probably isnt much else you can learn, unless you got terriblyunlucky with your random number generator.

In jackknife, you make your measurement Ntimes, each time leaving outdata point i. Again, it doesnt matter what method you are using, for eachleave-one-out triali you get an estimate (mi, bi) found by fitting with all the

data except point i. Then you calculate

m= 1

N

Ni=1

mi , (20)

7/25/2019 fit paper

22/55


and the uncertainty variance becomes

2m=N 1

N

Ni=1

[mi m]2 , (21)

with the obvious modifications to make 2b and mb. The factor [N 1]/Naccounts, magically, for the fact that the samples are not independent in anysense; this factor can only be justified in the limit that everything is Gaussianand all the points are identical in their noise properties.

Jackknife and bootstrap are both extremely useful when you dont knowor dont trust your uncertainties{2yi}Ni=1. However, they do make assump-tions about the problem. For example, they can only be justified when data

represent a reasonable sampling from some stationary frequency distributionof possible data from a set of hypothetical, similar experiments. This as-sumption is, in some ways, very strong, and often violated.28 For anotherexample, these methods will not give correct answers if the model does notfit the data reasonably well. That said, if you do believe your uncertaintyestimates, any differences between the jackknife, bootstrap, and traditionaluncertainty estimates for the best-fit line parameters undermine confidencein the validity of the model. If you believe the model, then any differencesbetween the jackknife, bootstrap, and traditional uncertainty estimates un-dermine confidence in the data-point uncertainty variances{2yi}Ni=1.

Exercise 8: Compute the standard uncertainty2m obtained for the slopeof the line found by the standard fit you did in Exercise 2. Now make

jackknife (20 trials) and bootstrap estimates for the uncertainty2m. How dothe uncertainties compare and which seems most reasonable, given the dataand uncertainties on the data?

Exercise 9: Re-do Exercise 6the mixture-based outlier modelbut justwith the inlier points 5 through 20 from Table 1 on page 6. Then dothe same again, but with all measurement uncertainties reduced by a factorof 2 (uncertainty variances reduced by a factor of 4). Plot the marginal-

ized posterior probability distributions for line parameters (m, b) in bothcases. Your plots should look like those in Figure 6. Did these posteriordistributions get smaller or larger with the reduction in the data-point un-certainties? Compare this with the dependence of the standard uncertainty

estimateAC1A

1.

7/25/2019 fit paper

23/55

5. Non-Gaussian uncertainties 23

100 50 0 50 100

b

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

m

100 50 0 50 100

b

1.6

1.8

2.0

2.2

2.4

2.6

2.8

3.0

3.2

m

Figure 6. Partial solution to Exercise 9: The posterior probability distri-bution can become multi-modal (right panel) when you underestimate yourobservational uncertainties.

5 Non-Gaussian uncertainties

The standard method of Section 1 is only justified when the measurement

uncertainties are Gaussian with known variances. Many noise processes inthe real world are notGaussian; what to do? There are three general kindsof methodologies for dealing with non-Gaussian uncertainties. Before wedescribe them, permit an aside on uncertainty and error.

In note 10, we point out that an uncertainty is a measure of what youdont know, while an error is a mistake or incorrect value. The fact thata measurement yi differs from some kind of true value Yi that it wouldhave in some kind of perfect experiment can be for two reasons (whichare not unrelated): It could be that there is some source of noise in theexperiment that generates offsets for each data point i away from its truevalue. Or it could be that the measurement yi is actually the outcome

of some probabilistic inference itself and therefore comes with a posteriorprobability distribution. In either case, there is a finite uncertainty, and inboth cases it comes somehow from imperfection in the experiment, but theway you understand the distribution of the uncertainty is different. In the

7/25/2019 fit paper

24/55


first case, you have to make a model for the noise in the experiment. In

the second, you have to analyze some posteriors. In either case, clearly, theuncertainty can be non-Gaussian.All that said, and even when you suspectthat the sources of noise are not

Gaussian, it is still sometimes okay to treat the uncertaintiesas Gaussian:The Gaussian distribution is the maximum-entropy distribution for a fixedvariance. That means that if you have some (perhaps magical) way to esti-mate the variance of your uncertainty, and you arent sure of anything elsebut the variance, then the Gaussian iscontrary to popular opinionthemost conservativething you can assume.29 Of course it is very rare that youdo know the variance of your uncertainty or noise; most simple uncertaintyestimators are measurements of the central part of the distribution function

(not the tails) and are therefore notgood estimators of the total variance.Back to business. The first approach to non-Gaussian noise or uncertainty

is (somehow) to estimate the true or total variance of your uncertainties andthen do the most conservative thing, which is to once again assume that thenoise is Gaussian! This is the simplest approach, but rarely possible, becausewhen there are non-Gaussian uncertainties, the observed variance of a finitesample of points can be very much smaller than the true variance of thegenerating distribution function. The second approach is to make no attemptto understand the non-Gaussianity, but to use the data-rejection methods ofSection 3. We generally recommend data-rejection methods, but there is a

third approach in which the likelihood is artificially or heuristically softenedto reduce the influence of the outliers. This approach is not usually a goodidea when it makes the objective function far from anything justifiable.30

The fourth approachthe only approach that is really justifiedis to fullyunderstand and model the non-Gaussianity.

For example, if data points are generated by a process the noise fromwhich looks like a sum or mixture ofk Gaussians with different variances,and the distribution can be understood well enough to be modeled, the in-vestigator can replace the Gaussian objective function with something morerealistic. Indeed, any(reasonable) frequency distribution can be describedas a mixture of Gaussians, so this approach is extremely general (in prin-

ciple; it may not be easy). For example, the generative model in Section 2for the standard case was built out of individual data-point likelihoods givenin equation (9). If the frequency distribution for the noise contribution todata pointyihas been expressed as the sum ofk Gaussians, these likelihoods

7/25/2019 fit paper

25/55

6. Goodness of fit and unknown uncertainties 25

become

p(yi|xi, yi, m , b) =k

j=1

aij2 2yij

exp

[yi+ yij m xi b]

2

2 2yij

, (22)

where the k Gaussians for measurement i have variances yij , offsets(meansthere is no need for the Gaussians to be concentric) yij , andamplitudesaij that sum to unity

kj=1

aij = 1 (23)

In this case, fitting becomes much more challenging (from an optimizationstandpoint), but this can become a completely generalobjective function forarbitrarily complicated non-Gaussian uncertainties. It can even be gener-alized to situations of arbitrarily complicated joint distributions for all themeasurement uncertainties.

One very common non-Gaussian situation is one in which the data pointsinclude upper or lower limits, or the investigator has a value with an uncer-tainty estimate but knows that the true value cant enter into some region(for example, there are many situations in which one knows that all the trueYi must be greater than zero). In all these situationsof upper limits or

lower limits or otherwise limited uncertaintiesthe best thing to do is tomodel the uncertainty distribution as well as possible, construct the proper

justified scalar objective function, and carry on.31 Optimization might be-come challenging, but that is an engineering problem that must be tackledfor scientific correctness.

6 Goodness of fit and unknown uncertainties

How can you decide if your fit is a good one, or that your assumptions arejustified? And how can you infer or assess the individual data-point uncer-

tainties if you dont know them, or dont trust them? These seem like differ-ent questions, but they are coupled: In the standard (frequentist) paradigm,you cant test your assumptions unless you are very confident about youruncertainty estimates, and you cant test your uncertainty estimates unlessyou are very confident about your assumptions.

7/25/2019 fit paper

26/55


Imagine that the generative model in Section 2 is a valid description of

the data. In this case, the noise contribution for each data point i has beendrawn from a Gaussian with variance2yi. The expectation is that data pointi will provide a mean squared error comparable to 2yi, and a contributionof order unity to 2 when the parameters (m, b) are set close to their truevalues. Because in detail there are two fit parameters (m, b) which you havebeen permitted to optimize, the expectation for 2 is smaller than N. Themodel is linear, so the distribution of2 is known analytically and is given(unsurprisingly) by a chi-square distribution: in the limit of large N therule of thumb is thatwhen the model is a good fit, the uncertainties areGaussian with known variances, and there are two linear parameters (mandbin this case),

2 = [N 2] 2 [N 2] , (24)where the symbol is used loosely to indicate something close to a standarduncertainty (something close to a 68-percent confidence interval). If you find2 in this ballpark, it is conceivable that the model is good.32

Model rejection on the basis of too-large 2 is a frequentists option. ABayesian cant interpret the data without a model, so there is no meaningto the question is this model good?. Bayesians only answer questionsof the form is this model better than that one?. Because we are onlyconsidering the straight-line model in this document, further discussion ofthis (controversial, it turns out) point goes beyond our scope.33

It is easy to get a 2 muchhigherthan [N2]; getting alowervalue seemsimpossible; yet it happens very frequently. This is one of the many reasonsthat rejection or acceptance of a model on the2 basis is dangerous; exactlythe same kinds of problems that can make 2 unreasonably low can make itunreasonably high; worse yet, it can make 2 reasonable when the model isbad. Reasons the 2 value can be lower include that uncertainty estimatescan easily be overestimated. The opposite of this problem can make 2 highwhen the model is good. Even worse is the fact that uncertainty estimatesare often correlated, which can result in a 2 value significantly different fromthe expected value.

Correlated measurements are not uncommon; the process by which theobservations were taken often involve shared information or assumptions. Ifthe individual data-points yi have been estimated by some means that effec-tively relies on that shared information, then there will be large covariancesamong the data points. These covariances bring off-diagonal elements into

7/25/2019 fit paper

27/55

6. Goodness of fit and unknown uncertainties 27

the covariance matrix C, which was trivially constructed in equation (4) un-

der the assumption that all covariances (off-diagonal elements) are preciselyzero. Once the off-diagonal elements are non-zero, 2 must be computed bythe matrix expression in equation (7); this is equivalent to replacing the sumover i to two sums over i and j and considering all cross terms

2 = [YAX]C1 [YAX] =Ni=1

Nj=1

wij [yi f(xi)] [yj f(xj)] ,

(25)where thewij are the elements of the inverse covariance matrix C

1.In principle, data-point uncertainty variance underestimation or mean

point-to-point covariance can be estimated by adjusting them until 2 isreasonable, in a model you know (for independent reasons) to be good. Thisis rarely a good idea, both because you rarely know that the model is a goodone, and because you are much better served understanding your variancesand covariances directly.

If you dont have or trust your uncertainty estimates and dont care aboutthem at all, your best bet is to go Bayesian, infer them, and marginalize themout. Imagine that you dont know anything about your individual data-pointuncertainty variances{2yi}Ni=1 at all. Imagine, further, that you dont evenknow anything about their relative magnitudes; that is, you cant even as-sume that they have similar magnitudes. In this case a procedure is the

following: Move the uncertainties into the model parameters to get the largeparameter list (m,b, {2yi}Ni=1). Pick a prior on the uncertainties (on whichyou must have someprior information, given, for example, the range of yourdata and the like). Apply Bayess rule to obtain a posterior distributionfunction for all the parameters (m,b, {2yi}Ni=1). Marginalize over the uncer-tainties to obtain the properly marginalized posterior distribution functionfor (m, b). No sane person would imagine that the procedure described herecan lead to any informative inference. However, if the prior on the{2yi}Ni=1is relatively flat in the relevant region, the lack of specificity of the modelwhen the uncertainties are large pushes the system to smaller uncertainties,and the inaccuracyof the model when the uncertainties are small pushesthe system to larger uncertainties. If the model is reasonable, the inferreduncertainties will be reasonable too.

Exercise 10: Assess the 2 value for the fit performed in Exercise 1 (do

7/25/2019 fit paper

28/55


that problem first if you havent already). Is the fit good? What about for

the fit performed in Exercise 2?

Exercise 11: Re-do the fit of Exercise 1 but setting all 2yi = S, that is,ignoring the uncertainties and replacing them all with the same value S.What uncertainty variance S would make 2 = N 2? Relevant plots areshown in Figure 7. How does it compare to the mean and median of theuncertainty variances{2yi}Ni=1?

200 400 600 800 1000 1200 1400

S

6

8

10

12

14

16

18

20

22

24

2

900 920 940 960 980 1000

S

12.0

12.5

13.0

13.5

14.0

14.5

15.0

15.5

16.0

2

Figure 7. Partial solution to Exercise 11: The dependence of the fittingscalar 2 on what is assumed about the data uncertainties.

Exercise 12: Flesh out and write all equations for the Bayesian uncer-tainty estimation and marginalization procedure described in this Section.Note that the inference and marginalization would be very expensive with-out excellent sampling tools! Make the additional (unjustified) assumptionthat all the uncertainties have the same variance 2yi =Sto make the prob-lem tractable. Apply the method to the x and y values for points 5 through20 in Table 1 on page 6. Make a plot showing the points, the maximuma posteriori value of the uncertainty variance as error bars, and the maxi-muma posterioristraight line. For extra credit, plot two straight lines, one

7/25/2019 fit paper

29/55

7. Arbitrary two-dimensional uncertainties 29

that is maximum a posteriorifor the full posterior and one that is the same

but for the posterior after the uncertainty variance Shas been marginalizedout. Your result should look like Figure 8. Also plot two sets of error bars,one that shows the maximum for the full posterior and one for the posteriorafter the line parameters (m, b) have been marginalized out.

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

MAP : y = 2.22 x+ 29

MAP :S= 28.7

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

marginalized over S : y = 2 .22x+ 25marginalized over (m,b) :

S= 29.1

Figure 8. Partial solution to Exercise 12: The results of inferring the ob-

servational uncertainties simultaneously with the fit parameters.

7 Arbitrary two-dimensional uncertainties

Of course most real two-dimensional data {xi, yi}Ni=1come with uncertaintiesin both directions (in both x and y). You might not know the amplitudesof these uncertainties, but it is unlikely that the x values are known tosufficiently high accuracy that any of the straight-line fitting methods givenso far is valid. Recall that everything so far has assumed that the x-direction

uncertainties were negligible. This might be true, for example, when the xiare the times at which a set of stellar measurements are made, and the yiarethe declinations of the star at those times. It is notgoing to be true whenthexi are the right ascensions of the star.

In general, when one makes a two-dimensional measurement (xi, yi), that

7/25/2019 fit paper

30/55


measurement comes with uncertainties (2xi, 2yi) in both directions, and some

covariance xyi between them. These can be put together into a covariancetensor Si

Si

2xi xyixyi

2yi

. (26)

If the uncertainties are Gaussian, or if all that is known about the uncer-tainties is their variances, then the covariance tensor can be used in a two-dimensional Gaussian representation of the probability of getting measure-ment (xi, yi) when the true value (the value you would have for this datapoint if it had been observed with negligible noise) is (x, y):

p(xi, yi|Si, x , y) = 1

2

det(Si) exp

1

2[Zi Z]

Si1

[Zi Z] ,(27)

where we have implicitly made column vectors

Z=

xy

; Zi=

xiyi

. (28)

Now in the face of these general (though Gaussian) two-dimensional un-certainties, how do we fit a line? Justified objective functions will havesomething to do with the probability of the observations {xi, yi}Ni=1given theuncertainties

{Si}Ni=1, as a function of properties (m, b) of the line. As in

Section 2, the probability of the observations given model parameters (m, b)is proportional to the likelihood of the parameters given the observations.We will now construct this likelihood and maximize it, or else multiply it bya prior on the parameters and report the posterior on the parameters.

Schematically, the construction of the likelihood involves specifying theline (parameterized by m, b), finding the probability of each observed datapoint given any true point on that line, and marginalizing over all possibletrue points on the line. This is a model in which each point really does havea true location on the line, but that location is not directly measured; it is,in some sense, a missing datum for each point.34 One approach is to think

of the straight line as a two-dimensional Gaussian with an infinite eigenvaluecorresponding to the direction of the line and a zero eigenvalue for the direc-tion orthogonal to this. In the end, this line of argument leads to projectionsof the two-dimensional uncertainty Gaussians along the line (or onto the sub-space that is orthogonal to the line), and evaluation of those at the projected

7/25/2019 fit paper

31/55


displacements. Projection is a standard linear algebra technique, so we will

use linear algebra (matrix) notation.35

A slope m can be described by a unit vector v orthogonalto the line orlinear relation:

v= 11 +m2

m1

=

sin cos

, (29)

where we have defined the angle = arctan m made between the line andthe x axis. The orthogonal displacement i of each data point (xi, yi) fromthe line is given by

i = vZi bcos , (30)

where Zi is the column vector made from (xi, yi) in equation (28). Simi-larly, each data points covariance matrix Si projects down to an orthogonalvariance 2i given by

2i = vSiv (31)

and then the log likelihood for (m, b) or (, b cos ) can be written as

lnL =KNi=1

2i2 2i

, (32)

whereKis some constant. Thislikelihood can be maximized, and the result-

ing values of (m, b) are justifiably the best-fit values. The only modificationwe would suggest is performing the fit or likelihood maximization not interms of (m, b) but rather (, b), where b [b cos ] is the perpendiculardistance of the line from the origin. This removes the paradox that the stan-dard prior assumption of standard straight-line fitting treats all slopes mequally, putting way too much attention on angles near /2. The Bayesianmust set a prior. Again, there are many choices, but the most natural issomething like flat in and flat inb (the latter not proper).

The implicit generative model here is that there are Npoints with truevalues that lie precisely on a narrow, linear relation in the xy plane. Toeach of these true points a Gaussian offset has been added to make each

observed point (xi, yi), where the offset was drawn from the two-dimensionalGaussian with covariance tensor Si. As usual, if this generative model isa good approximation to the properties of the data set, the method worksvery well. Of course there are many situations in which this is not a goodapproximation. In Section 8, we consider the (very common) case that the

7/25/2019 fit paper

32/55


relationship is near-linear but not narrow, so there is an intrinsic width or

scatter in the true relationship. Another case is that there are outliers; thiscan be taken care of by methods very precisely analogous to the methods inSection 3. We ask for this in the Exercises below.

It is true that standard least-squares fitting is easy and simple; presum-ably this explains why it is used so often when it is inappropriate. We hopeto have convinced some readers that doing something justifiable and sensi-ble when there are uncertainties in both dimensionswhen standard linearfitting is inappropriateis neither difficult nor complex. That said, there issomething fundamental wrong with the generative model of this Section, andit is that the model generates onlythe displacements of the points orthogo-nal to the linear relationship. The model is completely unspecified for the

distributionalong the relationship.36In the astrophysics literature (see, for example, the TullyFisher litera-

ture), there is a tradition, when there are uncertainties in both directions, offitting the forward and reverse relationsthat is, fitting y as a functionofx and thenx as a function ofyand then splitting the difference betweenthe two slopes so obtained, or treating the difference between the slopes as asystematic uncertainty. This is unjustified.37 Another common method forfinding the linear relationship in data when there are uncertainties in bothdirections is principal components analysis. The manifold reasons notto usePCA are beyond the scope of this document.38

Exercise 13: Using the method of this Section, fit the straight line y =m x + bto the x,y,2x,xy, and

2y values of points 5 through 20 taken from

Table 1 on page 6. Make a plot showing the points, their two-dimensionaluncertainties (show them as one-sigma ellipses), and the best-fit line. Yourplot should look like Figure 9.

Exercise 14: Repeat Exercise 13, but using all of the data in Table 1on page 6. Some of the points are now outliers, so your fit may look worse.Follow the fit by a robust procedure analogous to the Bayesian mixture modelwith bad-data probabilityPb described in Section 3. Use something sensible

for the prior probability distribution for (m, b). Plot the two results withthe data and uncertainties. For extra credit, plot a sampling of 10 linesdrawn from the marginalized posterior distribution for (m, b) and plot thesamples as a set of light grey or transparent lines. For extra extra credit,

7/25/2019 fit paper

33/55


0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

y = 2.20x + 36

Figure 9. Partial solution to Exercise 13: The best fit after properly ac-counting for covariant noise in both dimensions.

mark each data point on your plot with the fully marginalized probabilitythat the point is bad (that is, rejected, or has q = 0). Your result shouldlook like Figure 10.

Exercise 15: Perform the abominable forwardreverse fitting procedureon points 5 through 20 from Table 1 on page 6. That is, fit a straight line tothey values of the points, using the y-direction uncertainties2y only, by thestandard method described in Section 1. Now transpose the problem and fitthe same data but fitting the x values using the x-direction uncertainties 2xonly. Make a plot showing the data points, the x-direction and y-directionuncertainties, and the two best-fit lines. Your plot should look like Figure 11.Comment.

Exercise 16: Perform principal components analysis on points 5 through

20 from Table 1 on page 6. That is, diagonalize the 2 2 matrix Q given by

Q =Ni=1

[Zi Z] [Zi Z] , (33)

7/25/2019 fit paper

34/55


0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

y = 1.33 x+ 164

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

0.0

1.0

1.0

1.0

0.0

0.0

0.00.00.0

0.00.0

0.0

0.00.0

0.0

0.0

0.0

0.0

0.0

0.0

Figure 10. Partial solution to Exercise 14: On the left, the same as Figure 9but including the outlier points. On the right, the same as in Figure 4but applying the outlier (mixture) model to the case of two-dimensionaluncertainties.

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

forward y= (2.24 0.11) x+ (34 18)reverse y = (2 .64 0.12) x+ (50 21)

Figure 11. Partial solution to Exercise 15: Results of forward and reversefitting. Dont ever do this.

7/25/2019 fit paper

35/55

8. Intrinsic scatter 35

Z

=

1

N

N

i=1

Zi (34)

Find the eigenvector ofQ with the largest eigenvalue. Now make a plot show-ing the data points, and the line that goes through the mean Z of the datawith the slope corresponding to the direction of the principal eigenvector.Your plot should look like Figure 12. Comment.

0 50 100 150 200 250 300

x

0

100

200

300

400

500

600

700

y

y = 2.42 x 5

Figure 12. Partial solution to Exercise 16: The dominant component froma principal components analysis.

8 Intrinsic scatter

So far, everything we have done has implicitly assumed that there truly isa narrow linear relationship between x and y (or there would be if theywere both measured with negligible uncertainties). The words narrow andlinear are both problematic, since there are very few problems in science,especially in astrophysics, in which a relationship between two observables isexpected to be either. The reasons for intrinsic scatter abound; but gener-ically the quantities being measured are produced or affected by a largenumber of additional, unmeasured or unmeasurable quantities; the relation

7/25/2019 fit paper

36/55


between x and y is rarely exact, even in when observed by theoretically

perfect, noise-free observations.Proper treatment of these problems gets into the complicated area ofestimating density given finite and noisy samples; again this is a huge subjectso we will only consider one simple solution to the one simple problem of arelationship that is linear but not narrow. We will not consider anything thatlooks like subtraction of variances in quadrature; that is pure procedure andalmost never justifiable.39 Insteadas usualwe introduce a model thatgenerates the data and infer the parameters of that model.

Introduce an intrinsic Gaussian variance V, orthogonal to the line. Inthis case, the parameters of the relationship become (, b, V). In this case,each data point can be treated as being drawn from a projected distribu-

tion function that is a convolutionof the projected uncertainty Gaussian, ofvariance 2i defined in equation (31), with the intrinsic scatter Gaussian ofvariance V. Convolution of Gaussians is trivial and the likelihood in thiscase becomes

lnL =KNi=1

1

2 ln(2i +V)

Ni=1

2i2 [2i +V]

, (35)

where againKis a constant, everything else is defined as it is in equation (32),and an additional term has appeared to penalize very broad variances (which,because they are less specific than small variances, have lower likelihoods).Actually, that term existed in equation (32) as well, but because there wasnoVparameter, it did not enter into any optimization so we absorbed it intoK.

As we mentioned in note 36, there are limitations to this procedure be-cause it models only the distribution orthogonalto the relationship.40 Prob-ably all good methods fully model the intrinsic two-dimensional density func-tion, the function that, when convolved with each data points intrinsic uncer-tainty Gaussian, creates a distribution function from which that data point isa single sample. Such methods fall into the intersection of density estimationand deconvolution, and completely general methods exist.41

Exercise 17: Re-do Exercise 13, but now allowing for an orthogonal intrin-sic Gaussian varianceVand only excluding data point 3. Re-make the plot,showing not the best-fit line but rather theV lines for the maximum-likelihood intrinsicrelation. Your plot should look like Figure 13.

7/25/2019 fit paper

37/55

8. Intrinsic scatter 37

0 50 100 150 200 250 300x

0

100

200

300

400

500

600

700

y

Figure 13. Partial solution to Exercise 17: The maximum-likelihood fit,allowing for intrinsic scatter.

Exercise 18: Re-do Exercise 17 but as a Bayesian, with sensible Bayesianpriors on (, b, V). Find and marginalize the posterior distribution over(, b) to generate a marginalized posterior probability for the intrinsic vari-ance parameter V. Plot this posterior with the 95 and 99 percent upper

limits on V marked. Your plot should look like Figure 14. Why did we askonly for upper limits?

7/25/2019 fit paper

38/55


20 30 40 50 60 70V

0.00

0.01

0.02

0.03

0.04

0.05

0.06

0.07

Figure 14. Partial solution to Exercise 18: The marginalized posteriorprobability distribution for the intrinsic variance.

7/25/2019 fit paper

39/55

Notes 39

Notes

1Copyyright 2010 by the authors. You may copy and distribute this doc-ument provided that you make no changes to it whatsoever.

2Above all we owe a debt to Sam Roweis (Toronto & NYU, now de-ceased), who taught us to find and optimize justified scalar objective func-tions. He will be sorely missed by all three of us. In addition it is a pleasureto thank Mike Blanton (NYU), Scott Burles (D. E. Shaw), Daniel Foreman-Mackey (Queens), Brandon C. Kelly (Harvard), Iain Murray (Edinburgh),Hans-Walter Rix (MPIA), and David Schiminovich (Columbia) for valuablecomments and discussions. This research was partially supported by NASA(ADP grant NNX08AJ48G), NSF (grant AST-0908357), and a Research Fel-lowship of the Alexander von Humboldt Foundation. This research made useof the Python programming language and the open-source Python packagesscipy, numpy, and matplotlib.

3When fitting is used for data-driven predictionthat is, using existingdata to predict the properties of new data not yet acquiredthe conditionsfor model applicability are weaker. For example, a line fit by the standardmethod outlined in Section 1 can be, under a range of assumptions, thebest linear predictor of new data, even when it is not a good or appropriatemodel. A full discussion of this, which involves model selection and clarityabout exactly what is being predicted (it is different to predict y given xthan to predict x given y or x and y together), is beyond the scope of thisdocument, especially since in astrophysics and other observational sciencesit is relatively rare that data prediction is the goal of fitting.

4The term linear regression is used in many different contexts, but inthe most common, they seem to mean linear fitting without any use of theuncertainties. In other words, linear regression is usually just the procedureoutlined in this Section, but with the covariance matrix Cset equal to theidentity matrix. This is equivalent to fitting under the assumption not justthat the x-direction uncertainties are negligible, but also that all of the y-direction uncertainties are simultaneously identical and Gaussian.

Sometimes the term linear regression is meant to cover a much largerrange of options; in our field of astrophysics there have been attempts totest them all (for example, see Isobe et al.1990). This kind of worktestingad hoc procedures on simulated datacan only be performed if you are

7/25/2019 fit paper

40/55


willing to simultaneously commit (at least) two errors: First, the linear re-

gression methods areprocedures; they do not (necessarily) optimize anythingthat a scientist would care about. That is, the vast majority of proceduresfail to produce a best-fit line in any possible sense of the word best. Ofcourse, some procedures do produce a best-fit line under some assumptions,but scientific procedures should flow from assumptions and not the other wayaround. Scientists do not trade in procedures, they trade in objectives, andchoose procedures only when they are demonstrated to optimize their objec-tives (we very much hope). Second, in this kind of testing, the investigatormust decide which procedure performs best by applying the procedures tosets ofsimulated data, where truththe true generative modelis known.All that possibly can be shown is which procedure does well at reproducing

the parameters of the model with which the simulated data are generated!This has no necessary relationship with anything by which the realdata aregenerated. Even if the generative model for the artificial data matches per-fectly the generation of the true data, it is still much better to analyze thelikelihood created by this generative model than to explore procedure spacefor a procedure that comes close to doing that by accidentunless, perhaps,when computational limitations demand that only a small number of simpleoperations be performed on the data.

5Along the lines of egregious, in addition to the complaints in the previ-ous note, the above-cited paper (Isobeet al., 1990) makes another substantial

error: When deciding whether to fit fory as a function ofx or x as a functionofy, the authors claim that the decision should be based on the physics ofx and y! This is the independent variable fallacythe investigator thinksthaty is the independent variable because physically it seems independent(as in it is really the velocity that depends on the distance, not the distancethat depends on the velocity and other statistically meaningless intuitions).The truth is that this decision must be based on the uncertainty propertiesofx and y. If x has much smaller uncertainties, then you must fit y as afunction ofx; if the other way then the other way, and if neither has muchsmaller uncertainties, then that kind of linear fitting is invalid. We have moreto say about this generic situation in later Sections. What is written in Isobeet al. (1990)by professional statisticians, we could addis simply wrong.

6For getting started with modern data modeling and inference, the textsMackay (2003), Press et al. (2007), and Sivia & Skilling (2006) are all very

7/25/2019 fit paper

41/55

Notes 41

useful.

7Even in situations in which linear fitting is appropriate, it is still oftendone wrong! It is not unusual to see the individual data-point uncertaintyestimates ignored, even when they are known at least approximately. It isalso common for the problem to get transposed such that the coordinatesfor which uncertainties are negligible are put into the Y vector and thecoordinates for which uncertainties are not negligible are put into the Amatrix. In this latter case, the procedure makes no sense at all really; ithappens when the investigator thinks of some quantity really being thedependent variable, despite the fact that it has the smaller uncertainty. Forexample there are mistakes in the Hubble expansion literature, because many

have an intuition that the velocity depends on the distance when in factvelocities are measured better, so from the point of view of fitting, it is betterto think of the distance depending on the velocity. In the context of fitting,there is no meaning to the (ill-chosen) terms independent variable anddependent variable beyond the uncertainty or noise properties of the data.This is the independent variable fallacy mentioned in note 5. In performingthis standard fit, the investigator is effectively assuming that the x valueshave negligible uncertainties; if they do not, then the investigator is makinga mistake.

8As we mention in note 4, scientific procedures should achieve justified

objectives; this objective function makes that statement well posed. Theobjective function is also sometimes called the loss function in the statisticsliterature.

9The inverse covariance matrix appears in the construction of the 2 ob-jective function like a linear metric for the data space: It is used to turnaN-dimensional vector displacement into a scalar squared distance betweenthe observed values and the values predicted by the model. This distancecan then be minimized. This ideathat the covariance matrix is a metricissometimes useful for thinking about statistics as a physicist; it will returnin Section 7 when we think about data for which there are uncertainties in

both the x and y directions.Because of the purely quadratic nature of this metric distance, the stan-dard problem solved in this Section has a linear solution. Not only that, butthe objective function landscape is also convex so the optimum is globaland the solution is unique. This is remarkable, and almost no perturbation

7/25/2019 fit paper

42/55


of this problem (as we will see in later Sections) has either of these proper-

ties, let alone both. For this standard problem, the linearity and convexitypermit a one-step solution with no significant engineering challenges, evenwhenN gets exceedingly large. As we modify (and make more realistic) thisproblem, linearity and convexity will bothbe broken; some thought will haveto be put into the optimization.

10There is a difference between uncertainty and error: For some rea-son, the uncertainty estimates assigned to data points are often called er-rors or error bars or standard errors. That terminology is wrong: Theyare uncertainties, not errors; if they were errors, we would have correctedthem! DWHs mentor Gerry Neugebauer liked to say this; Neugebauer cred-

its the quotation to physicist Matthew Sands. Data come with uncertainties,which are limitations of knowledge, not mistakes.

11In geometry or physics, a scalar is not just a quantity with a mag-nitude; it is a quantity with a magnitude that has certain transformationproperties: It is invariant under some relevant transformations of the coor-dinates. The objective ought to be something like a scalar in this sense,because we want the inference to be as insensitive to coordinate choices aspossible. The standard2 objective function is a scalar in the sense that it isinvariant to translations or rescalings of the data in the Y vector space de-fined in equation (2). Of course a rotationof the data in the N-dimensional

space of the {yi}N

i=1 would be pretty strange, and would make the covariancematrix Cmuch more complicated, so perhaps the fact that the objective isliterally a linear algebra scalarthis is clear from its matrix representationin equation (7)is something of a red herring.

The more important thing about the objective is that it must have onemagnitude only. Often when one asks scientists what they are trying to opti-mize, they say things like I am trying to maximize the resolution, minimizethe noise, and minimize the error in measured fluxes. Well, those are goodobjectives, but you cant optimize them all. You either have to pick one,or make a convex combination of them (such as by adding them togetherin quadrature with weights). This point is so obvious, it is not clear what

else to say, except that many scientific investigations are so ill-posed thatthe investigator does not know what is being optimized. When there is agenerative model for the data, this problem doesnt arise.

12Of course, whether you are a frequentist or a Bayesian, the most scien-

7/25/2019 fit paper

43/55

Notes 43

tifically responsible thing to do is not just optimize the objective, but pass

forward a description of the dependence of the objective on the parameters, sothat other investigators can combine your results with theirs in a responsibleway. In the fully Gaussian case described in this Section, this means passingforward the optimum andsecond derivatives around that optimum (since aGaussian is fully specified by its modeor meanand its second derivativeat the mode). In what follows, when things arent purely Gaussian, moresophisticated outputs will be necessary.

13We have tried in this document to be a bit careful with words. FollowingJaynes (2003), we have tried to use the word probability for our knowl-edge or uncertainty about a parameter in the problem, and frequency for

the rate at which a random variable comes a particular way. So the noiseprocess that sets the uncertainty yi has a frequency distribution, but if weare just talking about our knowledgeabout the true value of y for a datapoint measured to have value yi, we might describe that with a probabilitydistribution. In the Bayesian context, thento get extremely technicaltheposterior probability distribution function is truly a probability distribution,but the likelihood factorreally, our generative modelis constructed fromfrequency distributions.

14The likelihood Lhas been defined in this document to be the frequencydistribution for the observables evaluated at the observed data{yi}Ni=1. It isa function of both the data and the model parameters. Although it appearsas an explicit function of the data, this is called the likelihood for the pa-rameters and it is thought of as a function of the parameters in subsequentinference. In some sense, this is because the data arenotvariables but ratherfixed observed values, and it is only the parameters that are permitted tovary.

15One oddity in this Section is that when we set up the Bayesian problemwe put the{xi}Ni=1 and{2yi}Ni=1 into the prior information I and not intothe data. These were not considered data. Why not? This was a choice, butwe made it to emphasize that in the standard approach to fitting, there is

a total asymmetry

between the xi and the yi. The xi are considered part oftheexperimental design; they are the input to an experiment that got theyias output. For example, the xi are the times (measured by your perfect andreliable clock) at which you chose a priorito look through the telescope, andtheyi are the declinations you measured at those times. As soon as there is

7/25/2019 fit paper

44/55


any sense in which the xi are themselves also data, the methodBayesian

or frequentistof this Section is invalid.16It remains a miracle (to us) that the optimization of the 2 objective,

which is the only sensible objective under the assumptions of the standardproblem, has a linear solution. One can attribute this to the magical prop-erties of the Gaussian distribution, but the Gaussian distribution is also themaximum-entropy distribution (the least informative possible distribution)constrained to have zero mean and a known variance; it is the limiting dis-tribution of the central limit theorem. That this leads to a convex, linearalgebra solution is something for which we all ought to give thanks.

17Sigma clippingis a procedure that involves performing the fit, identifying

the worst outliers in a2 sensethe points that contribute more than somethresholdQ2 to the 2 sum, removing those, fitting again, identifying again,and so on to convergence. This procedure is easy to implement, fast, andreliable (provided that the threshold Q2 is set high enough), but it has var-ious problems that make it less suitable than methods in which the outliersare modeled. One is that sigma clipping is a procedureand not the result ofoptimizing an objective function. The procedure does not necessarily opti-mize any objective (let alone any justifiableobjective). A second is that theprocedure gives an answer that depends, in general, on the starting pointor initialization, and because there is there is no way to compare different

answers (there is no objective function), the investigator cant decide whichof two converged answers is better. You might think that the answer withleast scatter (smallest2 per data point) is best, but that will favor solutionsthat involve rejecting most of the data; you might think the answer that usesthe most data is best, but that can have a very large 2. These issues re-late to the fact that the method does not explicitly penalize the rejectionof data; this is another bad consequence of not having an explicit objectivefunction. A third problem is that the procedure does not necessarily con-verge to anything non-trivial at all; if the threshold Q2 gets very small, thereare situations in which all but two of the data points can be rejected by theprocedure. All that said, with Q2

1 and standard, pretty good data, the

sigma-clipping procedure is easy to implement and fast; we have often usedit ourselves.

Complicating matters is the fact that sigma clipping is often employedin the common situation in which you have to perform data rejection but

7/25/2019 fit paper

45/55

Notes 45

you dont know the magnitudes of your

Documents

fit paper