Upload
others
View
8
Download
0
Embed Size (px)
Citation preview
Resampling Statistics
Introduction to ResamplingProbability ModelingResample add-inBootstrapping values, vectors, matricesR boot packageConclusions
Conventional Statistics
Assumptions of “conventional” statistics:- Variables are randomly sampled- Follow a normal distribution (Gaussian)
Thus, the basis of “conventional” inference is that samples are drawn at random from a larger population and the observations in the sample are then presumed to reflect the population (e.g., mean & variance).
Resampling Statistics
In resampling statistics, statistical estimates are formed by taking random samples directly from the data at hand.
In other words, you randomly sample your random sample!
Resampling Statistics- Key Features -
1. For small data sets, resampling procedures probably provide more accurate statistical answers than conventional statistics.
2. For large data sets, resampling answers and conventional answers usually agree.
3. Resampling can handle virtually any statistic, not just those for which a distribution is known.
4. Resampling typically generates accurate 95CIs.
Resampling Statistics- Terminology -
Resampling is a “generic term” which refers to a whole array of computer intensive methods for testing hypotheses based on Monte Carlo and resampling simulations.
Bootstrapping and jackknifing represent the two most common forms applied to “conventional statistical designs.”
This lecture will focus primarily on bootstrapping procedures.
Resampling Statistics- References -
These procedures have been around for a long time but have really only begun to be applied recently because of enhanced computer technology.
Selected References:
Efron, B. 1982. The jackknife, the bootstrap, and other resampling plans. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA.
Simon, J.L. 1997. Resampling: The new statistics, 2nd ed. (online)http://www.resample.com/content/text/index.shtml
Good, P.I. 2005. Introduction to statistics through resampling methods and R/S-Plus. Wiley Interscience, New York, NY.
Probability Modeling
Direct modeling of probabilities is the primary point of resampling statistics.
Consider a simple coin flip example.
A coin contains two outcomes: heads (1), tails (0)
If you flip 100 times, the expectation is:50:50 or half 1s and half 0s.
Probability ModelingConsider a less trivial & more biological case of probabilities:
In clutch sizes of 8, how often would you expect to see 3 males and 5 females (i.e., 3:5 ratio)?
This can be modeled using a coin flip algorithm. Assume the probability of male vs. female is equal and independent of previous clutches.
One can flip 8 coins, count the heads (males), and repeat this procedure many times.
Probability ModelingThe only possible logistical difficulty in this is the “many times” part.
Resampling statistical software is available in a variety of forms. A simple Excel add-in is available for $99 (academic pricing) or calculations can be done various ways in R.
Let's first look at a simple using the Excel add-in to get the general idea using our clutch size data. We can mathematically flip a coin 8 times, determine how many males there are, and do this many, many times:
Select Resample, input range A1:A2, place data in D1 in a group of 8
Resampling Software
Resampling Software
The result is 8 values of 0 or 1 placed in column D.
Cell D9 contains the column sum (5 males for this one case of 8 flips).
We need to do this 999 more times!
Resampling Software
Click OK, then 2x click on this cell(will turn red when selected, then 2x Click on any empty cell), 1 score recorded.
Resampling Software
Next, click on RS (Repeat and Score), enter 1000 trials, click OK, go to output tab…
Data are sortedhigh to low
The sum (males) of 1000 groups of 8-flips areplaced in A onoutput sheet
Now, using the stats add-in from Excel, construct a histogram of the 1000 resamples.
3 males happens in 210 of 1000 clutches or 0.210, or ca. 1 in 5 clutches.
Resampling Software
Boot Packagev. 1.2-4325-SEP-11
http://cran.r-project.org/web/packages/boot/boot.pdf
The BOOT package is designed to provide extensive facilities for all forms of bootstrapping and resampling.
One can bootstrap a simple statistic (e,g., median), a vector (e.g., regression weights), or an entire matrix.
The main bootstrapping function is boot() and has the following format:
Bootobject <- boot(data= , statistic=, R=, ...)
where,
data = a vector, matrix, or dataframe
statistic = a function that produces the k statistics to be bootstrapped (k=1 if bootstrapping a single statistic). The function should include an “indicies parameter” that the boot( ) function can use to select cases for each replication.
R = the number of bootstrap replicates
… = additional parameters
Boot( ) calls the statistic function R times.
Each time, it generates a set of random indices, with replacement. (Just like the resample Excel add-in.)
These indices are used within the statistic function to select a sample.
The statistics are calculated on the sample and the results accumulated in bootobject.
The bootobject structure includes:
t0 = The observed values of k statistics applied to the original data
t = An R x k matrix where each row is a bootstrap replicate of the k statistics.
You can access these as bootobject$t0 and bootobject$t
Once the bootstrap samples have been generated, use print(bootobject) and plot(bootobject) to examine the results.
boot.ci() can be used to obtain confidence intervals for the statistic(s).
Let's load the library boot and use one of its datasets:
...
We can try a standard linear model of mpg as a function of weight and displacement:
> summary(reg)
Call:lm(formula = mpg ~ wt + disp)
Residuals: Min 1Q Median 3Q Max -3.4087 -2.3243 -0.7683 1.7721 6.3484
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 34.96055 2.16454 16.151 4.91e-16 ***wt -3.35082 1.16413 -2.878 0.00743 ** disp -0.01773 0.00919 -1.929 0.06362 . ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.917 on 29 degrees of freedomMultiple R-squared: 0.7809, Adjusted R-squared: 0.7658 F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10
> results
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:boot(data = mtcars, statistic = rsq, R = 1000, formula = mpg ~ wt + disp)
Bootstrap Statistics : original bias std. errort1* 0.7809306 0.009334923 0.04890951
> quartz(height=4,width=7)> plot(results)
> boot.ci(results, type="bca")
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 1000 bootstrap replicates
CALL : boot.ci(boot.out = results, type = "bca")
Intervals : Level BCa 95% ( 0.6314, 0.8525 ) Calculations and Intervals on Original ScaleSome BCa intervals may be unstable
We can extend a single value bootstrap to an entire vector and continue with same example, but this time determine the model regression coefficients:
> bsmodel <- function(formula, data, indices) {+ d <- data[indices,] # allows boot to select sample + fit <- lm(formula, data=d)+ return(coef(fit)) + }
> results <- boot(data=mtcars, + statistic=bsmodel, + R=1000, formula=mpg~wt+disp)
> results
ORDINARY NONPARAMETRIC BOOTSTRAP
Call:boot(data = mtcars, statistic = bs, R = 1000, formula = mpg ~ wt + disp)
Bootstrap Statistics : original bias std. errort1* 34.96055404 9.262732e-02 2.493484690t2* -3.35082533 -5.329619e-02 1.180377872t3* -0.01772474 3.939446e-05 0.008735869
> results$t [,1] [,2] [,3] [1,] 31.65568 -2.06400409 -2.212067e-02 [2,] 34.12020 -2.88466428 -1.819257e-02 [3,] 38.02991 -4.35540788 -1.735722e-02 [4,] 33.95197 -3.77649064 -9.752654e-03 [5,] 34.43601 -3.16552898 -1.873982e-02 [6,] 34.47165 -2.89633129 -2.302154e-02 [7,] 35.48928 -3.69683419 -1.510129e-02 [8,] 35.47456 -3.11758947 -2.271243e-02 [9,] 33.57981 -2.30608721 -2.730837e-02 [10,] 36.10200 -4.51600675 -4.876640e-03 [11,] 31.67622 -2.60958056 -1.730342e-02. . .
> results$t0(Intercept) wt disp 34.96055404 -3.35082533 -0.01772474
> boot.ci(results, type="bca", index=1) # intercept
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONSBased on 1000 bootstrap replicates
CALL : boot.ci(boot.out = results, type = "bca", index = 1)
Intervals : Level BCa 95% (29.83, 39.96 ) Calculations and Intervals on Original Scale
> boot.ci(results, type="bca", index=2) # wt > boot.ci(results, type="bca", index=3) # disp
CarBoot.RScript File
Resampling- Conclusions -
Hopefully, by now, you can see that there is a very general principle here that can be applied to virtually any statistical design.
Resampling via bootstrapping is a powerful tool in many statistical situations (cf. Chpt. 19 in W&S textbook).
A nice overview of the concepts examined here can be found in:
Efron, B. 1983. Computer-intensive methods in statistics. Scientific American, May, 116-130.