Sampling Distributions
CHAPTER 6
Sample Mean and Variance
A statistics is a function of sample observations that contains no unknown parameters.
The sampling distribution describes probabilities associated with a statistic when a random sample is drawn from a population.
The sampling distribution is the probability distribution or probability density function of the statistic.
Derivation of the sampling distribution is the first step in calculating a confidence interval or carrying out a hypothesis test for a parameter.
Example
Suppose that X1, ......., Xn are a simple random sample from a normally distributed population with expected value µ and known variance σ2 . Then the sample mean is a statistic used to give information about the population parameter µ; x¯ is normally distributed with expected value µ and variance σ /n.
Principle of centrality for sampling distributions of means: The sample means tend to center around the population mean.
Principle of variability: The variability among the sample means decreases as the sample size increases.
For a random variable X, the expected value E(X) = µ and the variance Var(X) = σ2 .
Since the random variable X can take on any of N values with probability 1/N, then the mean and variance becomes:
These random variables can be considered as elements of a random sample from an infinite population having a probability distribution with mean µ and variance σ2 .
The values xi denotes the ith possible value of X.
2
1
2
1
)(1
1
N
ii
N
ii
xN
xN
For a sample of size n, the sample mean is where xi is the ith sample observation.
To measure the variability of the sample, we might try a sample copy of the variance, namely
We might say, that is a good estimator of µ.
The sample variance is not a very good estimator for the population variance. Whereas ransom samples tend to center at their population mean, these samples tend to have less variability than the population that they came from. To compensate for this fact we change the denominator from n to n-1.
n
iix
Nx
1
_ 1
2_
1
2 )(1
xxn
sn
ii
_
x
2_
1
2 )(1
1xx
ns
n
ii
The sampling distribution of the mean is the probability distribution of the mean of a random sample. Its mean and variance can be easily calculated as follows:
The central limit theorem states that the sampling distribution of the mean, for any set of independent and identically distributed random variables, will tend towards the normal distribution as the sample size gets larger. This may be restated as follows:
Central Limit Theorem
Central Limit Theorem : When n is sufficiently large, the sampling distribution of is well approximated by a normal curve, even when the population distribution is not itself normal.
We sometimes abbreviate the CLT to the phrase “ is asymptotically distributed with mean µ and variance σ2/n”.
Therefore,
where Z is a standard normal random variable.
n
bZP
n
b
n
XPbXP
///)(
__
Example
The fracture strengths of a certain type of glass average 14 (in thousands of pounds per square inch) and have a standard deviation of 2.
A) What is the probability that the average fracture strength for 100 pieces of this glass exceeds 14.5?
B) Find an interval that includes the average fracture strength for 100 pieces of this glass with probability 0.95.
Example The average strength has approximately a normal distribution
with mean µ=14 and standard deviation
Thus,
2.0100
2
n
0062.04938.05.0)5.2(2.0
5.0
2.0
145.14
/
5.14
/)5.14(
__
ZPZPZP
elyapproximat
nn
XPXP
The probability of seeing an average value (n=100) more than 0.5 unit above the population mean is, in this case, very small.
B) We have seen that
for a normally distributed . In this problem,
Approximately 95% of the sample mean fracture strengths, for sample of size 100, should lie between 13.6 and 14.4.
95.096.196.1_
nX
nP
6.13100
296.11496.1
6.13100
296.11496.1
n
andn
Sampling Distribution of Sums
The Normal Approximation to the Binomial Distribution
Requirements for a Binomial Distribution
- Fixed Number of Trials - Trials are independent - Each trial has 2 possible outcomes - Probabilities remain constant (p and q, q = 1-p)
Formula: nCx pxqn-x
Mean= np Variance = npq
Example Example:
54% of people have answering machines. Sample 1000 households. What is the probability that more than 556 have answering machines.
P(X >556)= P(557)+P(558)+P(559)+...+P(999)+P(1000)
We are going to use the normal distribution to approximate a binomial distribution.
Requirements for Using a Normal Distribution as an Approximation to a Binomial Distribution:
If np≥5 and nq≥5, then the binomial random variable is approximately normally distributed with the mean and standard deviation of μ=np and σ = sqrt(npq).
Continuity correction factor We are using a Continuous model to approximate a Discrete model.
We need to make adjustment for continuity. This is the continuity correction factor.
Because the normal distribution can take all real numbers (is continuous) but the binomial distribution can only take integer values (is discrete), a normal approximation to the binomial should identify the binomial event "8" with the normal interval "(7.5, 8.5)" (and similarly for other integer values).
Continuity Correction Factor
0.5( )
(1 )
0.5( ) 1
(1 )
x npP X x
np p
x npP X x
np p
Continuity correction factor
Example: If n=20 and p=.25, what is the probability that X is greater than or equal to 8?
The normal approximation without the continuity correction factor yields Z=(8-20 × .25)/(20 × .25 × .75)^.5 = 1.55, hence P(X ≥ 8) ~ .0606
The continuity correction factor requires us to use 7.5 in order to include 8 since the inequality is weak and we want the region to the right. z = (7.5 - 5)/(20 × .25 × .75)^.5 = 1.29, hence the area under the normal curve is .0985.
The exact solution is .1019 approximation. Hence for small n, the continuity correction factor gives a much better answer.
Example: 54% of people have answering machines. Sample1000 households. Estimate the probability that more than 556 have answering machines.
1) Test if Normal is Appropriate np = .54 X 1000 = 540 and nq = .46 X 1000 =460 and both are greater than 5
2) Find the mean and the standard deviation µ = np = 540 and σ = 16.87
3) Draw the normal curve and identify the region representing the probability to be found.
Example 4) Find the continuity correction factor
5) Estimate the probability
P(N > 556) = P(N ≥ 555.5)=P( (X-µ)/σ ≥ (555.5– 540)/ 16.87) = P(Z ≥ 0.919) = 0.5-0.3212=0.1788
What about the probability that less than 519 have answering machines?
P(N < 519) = P(N ≤ 519.5)=P( (X-µ)/σ ≤ (519.5– 540)/ 16.87) = P(Z ≤ -1.22) =0.5 – 0.388=0.112
•A sample statistic used to estimate an unknown population parameter is called an estimate.
• The discrepancy between the estimate and the true parameter value is known as sampling error.
• A statistic is a random variable with a probability distribution, called the sampling distribution, which is generated by repeated sampling.
• We use the sampling distribution of a statistic to assess the sampling error in an estimate.
Sampling Distributions
The Sampling Distribution of the Sample Variance
There is no analog to the CLT for which gives an approximation for large samples for an arbitrary distribution.
The exact distribution for S2 can be derived for X ~ i.i.d. Normal.
The sampling distribution of S2 is needed to infer the variability of a population from the variability of its sample. The simplest case is when the population have a normal distribution.
Chi-Square Distribution Theorem. If S2 is the variance of a random sample of size n taken
from a normal population having the variance σ2 , then the statistic
has a chi-squared distribution with v = n - 1 degrees of freedom.
.
Mean of S2
Χ2 distribution has a mean equal to its degree of freedom n-1 and the variance equal to twice its degree of freedom 2(n-1).
From this information we find the mean and variance of S2.
22
2
2
2
2
2
)1()1()(
1)()1(
1)1(
nnSE
and
nSEn
therefore
nSn
E
Variance of S2
1-n
2
1)1(2)S(
)1(2)S( )1(
)1(2 S )1(
4222
22
2
2
2
nnV
and
nVn
therefore
nn
V
Example For a certain launching mechanism, the distances by which the
projectile misses the target center have a normal distribution with variance σ2 = 100square meters. An experiment involving n=25 launches is to be conducted. Let S2 denote the sample variance of the distances between the impact of the projectile and the target center.
Approximate P(S2 > 50). Find E(S2 ) and V(S2 ).
Let U = (n-1) S2 / σ2 , which then has a Χ2 (24) distribution for n=25. P(S2 >50) = P ((n-1) S2 / σ2 > 24 * 50 /100) = P(U > 12) from Table 6
is a little larger than 0.975.
Example We know that E(S2)= σ2 = 100 and V(S2) = 2 σ4 /(n-1) = 2 (100)^2 / 24 = 10000/12