Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
More on Sampling Distributions and ConfidenceIntervals
Jared S. MurrayThe University of Texas at Austin
McCombs School of Business
1
Recall: Sampling Distributions and Standard Errors
Sampling distributions describe how our estimates are likely to
change if we had seen slightly different data (a different sample
from the same population)
Large spread in the sampling distribution → low confidence that
our estimate – which is one random draw from this distribution – is
close to the true value (usually the mean, or close to the mean, of
this distribution)
An estimate’s standard error is the standard deviation (spread) of
its sampling distribution.
2
Estimating standard errors
3
Estimating standard errorsWe saw last week how to estimate sampling distributions &
standard errors using the bootstrap. This approach is useful,
general, and easy to implement.
For some important statistics, we can also directly calculate
estimates of standard errors, under some assumptions. This is
probably how you did it in your last stats class.
For example: The standard error of the sample mean y is
sy =
√σ2
n≈
√s2y
n
where n is the sample size, σ2 is the population variance of y , and
s2y is the sample variance of y
4
Example: AFC data
Let’s see how this works out in the AFC data.... (R script)
5
Bootstrap vs Direct estimation of standard errors
Why bother with the bootstrap? It’s more general, easy, often
makes fewer assumptions and works in cases where a mathematical
expression for the standard error is impossible to obtain
Why bother with direct estimation? Often faster to compute, and
tells us something about how the estimates behave (How does the
standard error of the sample mean change with the sample size?
With the spread of the data (population variance)?)
6
Normal approximations to sampling distributions
7
Normal approximations to sampling distributions
We’ve seem several examples where sampling distributions looked
approximately normal.
This is not a coincidence! For many statistics the sampling
distribution looks like a normal distribution, especially in large
samples – this is the Central Limit Theorem at work.
8
Normal approximations to sampling distributions
For example, for a sample mean, if n is large then
y ∼ N
(µ,σ2
n
)(approximately). For sample means, this approximation is quite
good.
9
Confidence intervals
10
Confidence Intervals
At a high level, confidence intervals give us a set of plausible
values for the quantity we’re trying to estimate.
What do we mean by plausible? Consistent with the data we
observe and what we expect the error to be in repeated samples
(i.e., the spread of the sampling distribution)
We have a few different ways to compute confidence intervals....
11
Confidence Intervals (Standard Error Method)
Consider estimating a confidence interval for the sample mean. We
have (approximately)
Y ∼ N(µ, s2
Y
)so our error has the distribution
(Y − µ) ∼ N(0, s2
Y
)I What is a good prediction for µ? What is our best guess?
Y
I How do we make mistakes? How far from µ might we be?
About 95% of the time our error is ±2× sY
I [Y ±2× sY ] gives a 95% confidence interval for µ. You
can think of this as a set of plausible values for µ 12
Confidence Intervals (Standard Error Method)
We can use a different critical value (number of standard errors)
to get a confidence interval with a different level than 95%.
We can either compute an estimate for the standard error using
the data directly – e.g. sy/√n for a sample mean – or using the
bootstrap.
13
Confidence Intervals (Percentile Method)
When the normal approximation to the sampling distribution is
good, a 90% confidence interval (for example) runs approximately
from the 5th to the 95th percentile of the bootstrap distribution.
When the normal approximation is bad, we can use percentiles of
the bootstrap distribution directly. (With some corrections; see
footnotes in DSGI Ch 5)
Often the percentile and standard error confidence intervals are
close; if they differ, reporting the largest one (or the union of the
two) is a reasonable thing to do.
14
Summary: 3 ways to estimate sampling distributions/CIs
There are three tools in our toolchest:
I The Central Limit Theorem: Assuming the estimator is
(approximately or asymptotically) unbiased, compute the
standard error & do calculations based on the normal
distribution
I Bootstrapped standard errors: Use the bootstrap to estimate
the standard error of the estimator & do calculations based on
the normal distribution (CLT)
I Percentile bootstrap: Use the bootstrap to estimate the
sampling distribution directly, and form confidence intervals
using quantiles of the estimated sampling distribution
These are all easy to do in R. (See examples in the R script)
Technically item 3 requires additional adjustments to be correct; see
footnotes in DGSI
15
Interpreting Confidence Intervals
However they’re constructed, the goal of a 100(1− α)%
confidence interval is to “cover” the true quantity when computed
in 100(1− α)% of datasets
(see p. 119 of DSGI and the module linked from this week’s post)
In any particular dataset, we can’t say whether the confidence
interval actually contains the true value. But over many analyses,
reporting a confidence interval as the set of plausible values means
that our intervals will usually contain the true value – far better
than reporting just the estimate!
16
Revisiting standard errors and confidence
intervals for regression models
17
Sampling Distribution of Regression Coefficients
Like the sample mean, regression coefficients estimated via least
squares have their own central limit theorem, so in large samples
their sampling distribution is approximately normal
We’ve seen how to bootstrap regression models; with some
assumptions we can directly estimate their standard errors too. In
particular, we assume:
I The residuals are independent of each other
I The residual standard deviation is constant (the spread of the
residuals doesn’t change with X , or other factors like time)
18
Sampling distribution of the slope
Can we intuit what should be in the formula for the standard error
of the slope, sβ1?
I How should the residual standard deviation se figure in the
formula?
I What about n?
I Anything else?
s2β1≈ s2
e∑(xi − x)2
=s2e
(n − 1)s2x
Three Factors:
sample size (n), residual variance (s2e ), and X -spread (sx).
19
Sampling distribution of the intercept
s2β0≈ s2
e
(1
n+
1
n − 1
(x
sx
)2)
Three Factors:
sample size (n), residual variance (s2e ), and the standardized
distance between x and zero:
x
sx=
x − 0
sx
20
Extracting these standard errors from lm
R script...
21
About that bootstrap...
Again: These formulas “work” when the residuals are independent
and have the same variance (i.e., the spread of values around the
regression line is constant)
The bootstrapped SEs don’t require the constant variance
assumption, and can be more appropriate if it seems to be violated
(see the AFC example in the R script.)
But if the constant variance assumption seems OK, the standard
errors/confidence intervals from lm will tend to be good in large
samples and/or when the residuals are approximately normal.
22