3
Physics 129c Problem Set Number 9 Due 5:00PM Wednesday, June 3, 2015 This is the last homework set for Ph 129c. Please turn in to Hyungrok Kim’s mail slot on the 4th floor of Lauritsen. Reading: Read the note (posted on the course web page) on “Density Estimation”. 38. (worth two problems) Problem 34 illustrated a situation in which a particular statis- tic sometimes carries no information on the hypothesis to be tested. This example raises the question: Can we devise a goodness-of-fit test for sampling from an expo- nential distribution with unknown mean? Certainly any test designed to test against the true mean is useless, since the true mean is unknown. But we should be able to test against other features of the distribution, such as against the exponential fall-off. A promising approach then is to compare the observed cumulative distribution against the expected CDF for an exponential. This is a very common sort of test, and a variety of different specific tests have been devised to make such tests, with different strengths and weaknesses. Several of these are discussed in section 6.7 of the note on hypothesis testing. Let’s try such a test, the Kolmogorov-Smirnov (“KS”) test, which is frequently ap- plied in physics research. The idea of this test is to look for the maximum deviation between the two CDF’s being compared. In our case, the two CDF’s are: (a) The “CDF” for the data. We have a sample of size n, with values x 1 ,...,x n . With this data, we generate the empirical CDF (ECDF) as follows. Order the x i from lowest to highest, call the ordered values y i ,i =1,...,n. The ECDF starts at zero for x → -∞. The ECDF is zero until the first sample value, y 1 is reached, at which point it steps to a value 1/n. As x increases, the ECDF makes additional steps of size 1/n for each sample value encountered. (b) The CDF of the model being tested. In this case the model is just the expo- nential distribution: f (x; θ)= 1 θ e -x/θ . (10) However, we don’t know θ, so we substitute the MLE for θ. The effect of this substitution will sometimes not be important, but should be studied to be sure (just as we studied such a technique in section 6.7 of the note on hypothesis testing). The KS test statistic is then the maximum difference, D, between the ECDF and the model CDF. Note that D should be small if the data was drawn from the model. The value of D thus obtained is then compared with the distribution of D under the null hypothesis (that the data is drawn from the model), and a P -value may be obtained. The distribution of D under the null hypothesis may be obtained by simulations (see again section 6.7 for examples of doing this), although there are also known forms for this distribution (e.g., see Narsky&Porter; for a given sample 13

p 1291519

Embed Size (px)

DESCRIPTION

lol

Citation preview

  • Physics 129cProblem Set Number 9

    Due 5:00PM Wednesday, June 3, 2015

    This is the last homework set for Ph 129c. Please turn in to Hyungrok Kimsmail slot on the 4th floor of Lauritsen.

    Reading: Read the note (posted on the course web page) on Density Estimation.

    38. (worth two problems) Problem 34 illustrated a situation in which a particular statis-tic sometimes carries no information on the hypothesis to be tested. This exampleraises the question: Can we devise a goodness-of-fit test for sampling from an expo-nential distribution with unknown mean? Certainly any test designed to test againstthe true mean is useless, since the true mean is unknown. But we should be ableto test against other features of the distribution, such as against the exponentialfall-off.

    A promising approach then is to compare the observed cumulative distributionagainst the expected CDF for an exponential. This is a very common sort of test,and a variety of different specific tests have been devised to make such tests, withdifferent strengths and weaknesses. Several of these are discussed in section 6.7 ofthe note on hypothesis testing.

    Lets try such a test, the Kolmogorov-Smirnov (KS) test, which is frequently ap-plied in physics research. The idea of this test is to look for the maximum deviationbetween the two CDFs being compared. In our case, the two CDFs are:

    (a) The CDF for the data. We have a sample of size n, with values x1, . . . , xn.With this data, we generate the empirical CDF (ECDF) as follows. Order thexi from lowest to highest, call the ordered values yi, i = 1, . . . , n. The ECDFstarts at zero for x . The ECDF is zero until the first sample value, y1is reached, at which point it steps to a value 1/n. As x increases, the ECDFmakes additional steps of size 1/n for each sample value encountered.

    (b) The CDF of the model being tested. In this case the model is just the expo-nential distribution:

    f(x; ) =1

    ex/. (10)

    However, we dont know , so we substitute the MLE for . The effect of thissubstitution will sometimes not be important, but should be studied to be sure(just as we studied such a technique in section 6.7 of the note on hypothesistesting).

    The KS test statistic is then the maximum difference, D, between the ECDF andthe model CDF. Note that D should be small if the data was drawn from the model.The value of D thus obtained is then compared with the distribution of D underthe null hypothesis (that the data is drawn from the model), and a P -value maybe obtained. The distribution of D under the null hypothesis may be obtained bysimulations (see again section 6.7 for examples of doing this), although there arealso known forms for this distribution (e.g., see Narsky&Porter; for a given sample

    13

  • size, the distribution of the KS statistic actually doesnt depend on f if the nullhypothesis is simple).

    Well explore this test as applied to the exponential model in this problem. Youmay code things yourself or use the R or MATLAB tools as you wish. Note thatks.test is the R function that performs a KS test. In MATLAB, stats::ksGOFTis a choice.

    I want you to produce four histograms in this problem, each histogram with 1000entries (simulated experiments).

    (a) Histogram the distribution of KS statistic D for a sample of size n = 1000(that is, each experiment corresponds to a sample of size 1000) drawn froman exponential distribution with = 20. Use the true value of (20) in yourKS test. Of course, this is not what you would actually do in a real situation,since you arent supposed to know , but we are doing this to see how muchdifference it makes (comparing with the second histogram. . . ).

    (b) Same as the first histogram, except use the sample mean as your estimate for in the KS test.

    (c) Same as above, except now sample from the = 23 distribution (and againuse the sample mean as your estimate of the mean). You should get a similarhistogram to the previous part.

    (d) Same as above, except generate the sample (of size 1000) from a uniform dis-tribution on (0, 20). Use the sample mean as your estimate for in the KS test(your model is still the exponential!).

    Now lets do a little analysis, looking at some probabilities. First, compute the valueof D, call it D95, for which 95% of the experiments have D less than D95. Do this forthe data in the first two histograms. That is, the critical region of the test at the 5%significance level would be all D values larger than D95. Note that we would have togenerate more experiments if we want to get reliable results for smaller significancelevels this is a limitation of this approach, even with fast computers. Do you findthat the critical region determined with the known mean can be used in a test withunknown mean?

    Recall our null hypothesis is: the actual distribution is exponential. Part (c)demonstrated (I hope) that D is a useful statistic for this hypothesis, since it isinsensitive to parameter . Now determine the power of your 5% significance leveltest, for the alternative of the uniform distribution in (d). You should use the criticalregion as determined from the data in the second histogram.

    39. Lets try a problem in density estimation. Well need a dataset with a bit morestructure than most that you have generated so far. Generate a sample of size 1000from the following distribution:

    0.3N(3, 1) + 0.5N(5, 1) + 0.2N(8, 2), (11)

    where N denotes the normal distribution, in which the first argument is the mean,and the second is the standard deviation. Make a graph showing the followingcomponents (with the same normalizations):

    14

  • (a) A histogram of the data.

    (b) A curve showing the above sampling distribution

    (c) A curve showing the result of a kernel density estimate based on your sampleddata. You may program it yourself or use the density function in R, or perhapsksdensity in MATLAB. You may use a Gaussian kernel, or something else;just say what you use, and also the values of any smoothing parameters youuse.

    40. We have mentioned the bootstrap method (a type of resampling) a few times inclass, have used it in the context of parameter estimation in problem 32, and itis introduced in the density estimation note. Let us get a hint for how it worksand how it might be useful in the context of density estimation. Recall that thebootstrap consists in generating replicas of our dataset. The ensemble of replicas isthen used to answer questions, for example, about the distribution of statistics ofinterest. Here, we will be interested in the distribution of our density estimate.

    The bootstrap algorithm is very simple, to recap: Given a dataset of size n, weobtain bootstrap replicas, each of size n, by sampling from the empirical pdf.That is, to generate one replica, we randomly select n observations from our originaldataset. This is done with replacement, that is, a given observation may appearin our replica multiple times. You can think about doing it sequentially. Withequal probabilty for each observation, we draw an observation from our dataset andplace it into the bootstrap dataset. That observation is kept in the original dataset(replaced), and a new sample drawn. We do this n times to obtain one bootstrapreplica. Then we go through the whole process again to generate another replica,etc.

    We try this on the previous problem. Generate several (not many, we are going tograph them just to get the idea about how things work) bootstrap replicas fromyour data in the previous problem. Apply the kernel estimation to each replica.Make a plot showing the results, so that the curves may be compared. The ideais to get a visual impression for the variance in your density estimate. Again, youmay either code it yourself or make use of available tools. If you are using R, youmay come across the boot package. However, I suggest that you will find it easierto just use the sample function. In MATLAB, you could use randsample, makingsure to specify replacement, or alternatively bootstrp.

    15