Practical Statistics - University of Arizona

Practical Statistics • Lecture 3 (Aug. 28):

- Correlation

- Hypothesis Testing

• Lecture 4 (Aug. 30):- Parameter Estimation

- Bayesian Analysis

- Rejecting Outliers

- Bootstrap + Jack-knife

• Lecture 5 (Sep. 4)- Random Numbers

- Monte Carlo Modeling

• Lecture 4 (Sep. 6):- Markov Chain MC

• Lecture 5 (Sep. 11):- Fourier Techniques

- Filtering

- Unevenly Sampled Data

• Lecture 6 (Sep. 20)- Principle Component

Analysis

1

Good Reference: Numerical Recipes, Press et al. p. 825

More Matlab routines

• Some possibly useful functions, for reference- hist(vec,numcontours);

‣ creates histogram of values in “vec” with selectable number of contours.

- [X,Y]=meshgrid(xvec,yvec);

‣ If you define a range of x and y values, this creates the 2D indices.

2

Markov Chain Monte Carlo (MCMC)

• Often want to explore a multi-dimensional parameter space to evaluate a metric.- A grid search approach is inefficient.- Want algorithm that maps out spaces with higher probability

more effectively.

3

General procedure:Start with a given set of parameters; Calculate metricChoose new set of points and calculate new metric.Accept new point with probability P= metric(new)/metric(old) * P(x1,x2)

- The procedure provides the optimum parameter values, and also explores the parameter values in a way that allows derivation of confidence intervals.

MCMC References

Detailed Lecture Notes from Phil Gregory:http://www.astro.ufl.edu/~eford/astrostats/

Florida2Mar2010.pdfTutorial Lecture from Sujit Saha:http://www.soe.ucsc.edu/classes/cmps290c/Winter06/

paps/mcmc.pdfMCMC example from Murali Haran:http://www.stat.psu.edu/~mharan/MCMCtut/MCMC.htmlGood Lecture series on Astrostatistics:http://www.astro.ufl.edu/~eford/astrostats/ 4

Bayesian Review

5

“I see I have drawn 6 red balls out of 10 total trials.”

“I hypothesize that there are an equal number of red and white balls in a box.”

“There is a 24% chance that my hypothesis is correct.”

“Odds” on what isin the box.

Bayes’ Theorem Bayes’ formula is used to merge data with prior information.

A is typically the data, B the statistic we want to know. P(B) is the “prior” information we may know about the

experiment.

P(data) is just a normalization constant

P (B|data) � P (data|B) · P (B)

P (B|A) =P (A|B) · P (B)

P (A)

Application to the Balls in a Box Problem

• P(frac=5| n=6) is calculated from the binomial probability distribution:- “If frac=5/10, then p=frac, and P(6)=0.21.”

• P(frac) can be assumed to be uniform.- Is this a good choice?

7

• P(n=6) is the integral of all possibilities from frac=0/10 to frac=10/10 of getting P(6 | frac). =0.91

• P(n=6 | frac=5)=24%

�(⇥x) = P (D|⇥x)P (⇥x)

Goal of MCMC in Bayesian Analysis

Assume we have a set of data, D, and a metric, P(D | x), that tells us the probability of getting D, given a set of parameters, x.

If we assume a prior, P(x), then Bayes’ theorem gives us:

8

Since we don’t know the normalizing constant, P(D), we might integrate this function (numerically, or analytically) to obtain an answer.

The value of MCMC is that its points are distributed in direct relation to the likelihood of π(x).

What is a Markov Chain?

A random number that depends on what the previous number was.

Our previous discussion of Monte Carlo used completely independent random values.

Example:A dice roll is a random number.Brownian motion is a Markov Process.

9

�(x1)p(x2|x1) = �(x2)p(x1|x2)

Markov Chains

• Mathematicians (Metropolis et al. 1950) realized that using a Markov chain to relate successive points allowed the sequence to visit the points in proportion to π(x).- Called the Ergodic property.

• A Markov chain is considered ergodic if it satisfies:

10

• This can be shown to prove that if x1 is drawn from π(x), then so is x2.

MCMC jargon:

• Candidate point: New values of parameters that are compared to the current value in terms of the relative probability.

• Proposal distribution: Distribution of candidate points to try. This is a distribution which depends on the current value.

• Acceptance probability: The probability that a candidate point will be accepted as the next step in the MC.

11

y = q(y|x(t))

�(x(t), y) = min(1,⇥(y)q(x(t)|y)

⇥(x(t))q(y|x(t)))

Candidate points and Proposal Distributions

• A “candidate” point,y, can be generated using a proposal distribution, y.

12

• Hastings developed the general criteria for using any distribution with a Markov chain.

• The Acceptance probability is

�(x(t), y) = min(1,⇥(y)q(x(t)|y)

⇥(x(t))q(y|x(t)))

�(x(t), y) = min(1,⇥(y)⇥(x(t)

)

Acceptance Probability

13

• The proposal distribution is often selected to simplify this. If it is symmetric q(x|y) = q(y| x). Then

But how do we get x1?• The starting point may be far from the equilibrium

solution.- Even very unlikely points in a probability distribution

occasionally occur.

- The number of points needed for the chain to “forget” where it started is called the “burn in” time. This is longer if the starting point was a very unlikely possibility, or the movement from one point to another is defined to be small.‣ MCMC methods should use other ways of obtaining a best guess before starting

14

Two “random walks” that have forgotten that appear interchangeable after ~10 iterations.

see http://www.soe.ucsc.edu/classes/cmps290c/Winter06/paps/mcmc.pdf

for more detailed discussion.

Burn-in Guidance

• The best solution for determining when the initial conditions have been forgotten is to simply look at the output of the calculations.

• Independent starting values can (and should) be used to check when the burn-in process is complete.- These are parallel computations which are trivial to implement

on today’s multi-core CPU computers.

15

How do we choose the sampling?

• You want to choose a proposal distribution that generates high acceptance criteria:- Suggests a small variation (small sigma, if Gaussian)

• You want to explore parameter space in a “complete” way and eliminate starting conditions “burn in” quickly.- Suggests a larger variation (larger sigma if Gaussian).

• Suggests that an adaptive approach may be useful.

• This area is where the majority of “art” in MCMC techniques is accomplished.

16

MCMC example: Fitting Images

• MCMC approaches can be used to derive best fit and uncertainties on multi-dimensional fit data sets.

17

From Skemer et al. 2008

Fitting Procedure

• 3 parameter fit for each star.- x,y,flux

• 3 additional PSF parameters - width, e, PA

• Do best 12-d fit with Levenberg-Marquardt minimization.

• Use covariance matrix as first guess for step size.

18

see Skemer et al. 2008 for details

Example of results

19

A simple MCMC example

• Assume we have a known probability distribution, which is weirdly shaped:

20

Proposal distribution

• Choose zero mean, normally distributed values with sigma, s, to add to initial values.

• Accept new values with probability given by P(xnew)/P(xold).

•Want to look at how long “burn in” lasts vs. s.

•What are the range of parameters?

21

Script available at:

http://zero.as.arizona.edu/Ast518/mcmc_example.m

also need:

http://zero.as.arizona.edu/Ast518/MCMCpdf.m

Summary• MCMC techniques are useful both as an optimization

tool and for characterizing the confidence intervals of parameters.

• It is most useful for large-dimensional datasets, or ones where the probability distribution function is complex or not able to be manipulated analytically.

• The key way it works is by finding points in proportion to the relative probability of occurring. - Good for parameter estimation.

Can’t get enough astrostatistics? See:http://astrostatistics.psu.edu/ 22

Documents

Practical Statistics - University of Arizona