Upload
nurulqurraisyia
View
64
Download
3
Tags:
Embed Size (px)
DESCRIPTION
Introduction to Confidence Interval Estimation and Hypothesis Testing. Confidence Intervals for Population Mean. Making Decisions using Confidence Intervals. Sample Size Determination. Confidence Intervals for Population Proportion
Citation preview
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Lecture 8, 9 (Chapter 5.1, Chapter 7.1)
the greater part of our happiness or misery depends upon our disposition, and not upon our circumstances. Martha Washington (1732-1802)
Plan:
Introduction to Confidence Interval Estimation and Hypothesis Testing.
Confidence Intervals for Population Mean.
Making Decisions using Confidence Intervals. Sample Size Determination.
Confidence Intervals for Population Proportion.
Introduction to Confidence Interval Estimation and Hypothesis Testing.
Population versus Sample Parameter versus Statistic
Statistical inference is the process of drawing conclusions about a population parameter based
on data, or statistic an estimate or a summary computed from the observations. There are two
types of estimates for population parameter: confidence interval estimates and point estimates.
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
A confidence interval estimate is a range of values for the population parameter with a pre-
defined level of confidence (e.g., 95% confidence interval). A point estimate for a population
parameter is the statistic and can be considered the best (available) single-number estimate of that
parameter.
Examples:
1) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting
time is estimated to be 37.85 minutes.
_________________________point estimate____________
2) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting
time is estimated to be between 32.65 minutes and 42.10 minutes.
______________________confidence interval______________
Question: How far is the sample mean of 37.85 minutes is from true (population) average waiting
time at ER (means for all weekends for all patients)?
To answer this question we will learn two most common statistical inference procedures:
Confidence Interval Estimation and Hypothesis Testing that are related to confidence interval
estimates and point estimates, respectively.
Confidence Interval Estimation is based on confidence interval estimates. Using these
estimates the researcher is fairly confident that confidence interval will cover the true,
unknown value of the population parameter.
Hypothesis testing uses point estimate to attempt to reject/accept a hypothesis about the
population. Usually researchers want to reject the notion that chance alone can explain the
sample results.
Hypothesis testing is applied to population parameters by specifying a null value for the
parametera value that would indicate that nothing of interest is happening.
Hypothesis testing proceeds by obtaining a sample, computing a point estimate (sample
statistic), and assessing how unlikely the sample statistic would be if the null parameter
value were correct.
In most cases, the researchers are trying to show that the null value is not correct.
Achieving statistical significance is equivalent to rejecting the idea that the observed results
are plausible if the null value is correct.
In this course, we will study statistical inference methods for main population parameters that
involve either proportions (for categorical data) or means (for quantitative data).
Population proportion p versus sample proportion (for categorical response);
Population mean versus sample mean (for quantitative).
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Exercise:
1) The average waiting time in the emergency room on weekends, based on the sample of 100
patients
p
2) The proportion of adults that believe in love at first sight
p
3) A sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40 hours
per week.
p
4) Average GPA of Boston University students
p
5) Average GPA of Boston University students, based on MA113 Summer II students
p
Exercise: Do you work more than 40 hours per week?
A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).
A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40
hours per week.
Population = all workers
Parameter = p, the proportion of people who work more than 40 hours
Sample = 1000 workers
Statistic = (.46) of those who work more than 40 hours
Can anyone say how close this observed sample proportion is to the true population
proportion p ?_ __ __ __ n o _ __ __ __ __ __ ___ _ __ __ __ _
If we were to take another random sample of the same size n = 1000, would we get the same
value for the sample proportion ? ____not necessarily ______________
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: Management of an airline uses a normal distribution to model the value claimed for a lost piece
of luggage on domestic flights. The mean of the distribution is $600 and the standard deviation is $85.
Suppose a random sample of 100 pieces of luggage is to be selected.
Describe the approximate sampling distribution of the sample mean claimed value for a random
sample of the 100 pieces of lost luggage? Provide all features of the distribution.
According to the Central Limit Theorem (CLT), here we have:
Recall the main results about the Sampling Distribution of the Sample Mean and CLT:
Sampling Distribution of the Sample Mean :
If the parent population IS a normal distribution with a mean and a standard deviation ,
then for any sample size (small or large), the sample mean will have a __normal___
distribution with a mean of __ ___ and a standard deviation of __ n ___.
Central Limit Theorem : approximately for large n (more than 30)
If the parent population is NOT a normal distribution but with a mean and a standard
deviation then for a large sample size (n30), the sample mean will have __approximately
normal__ distribution with a mean of __ ___ and a standard deviation of _ n ___.
Recall the Empirical Rule for Normal Distribution:
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Exercise (Example 5.1, page 175): Estimating Mean Waiting Time in the Emergency Room.
Suppose we know that the true population mean time in the ER during weekends =35 minutes and
with the standard deviation =9.5.
a) Describe and draw the sampling distribution of a sample mean waiting time based on a random
sample of n=35 and n=100 patients coming during last weekend.
b) Using the Empirical Rule, identify the lower and the upper bounds of the interval within which
falls 99.7% of the most typical values for a sample mean.
c) Which interval is wider for n=35 or n=100? Why?
For n = 35
d) Now assume that true population is unknown. How can we estimate ?
Using a point estimate ( ) or a confidence interval.
Usually, he true population parameter value is unknown, we take a sample and use the sample
statistic to estimate the parameter. The sample statistic (a single point estimator) may not be
equal to the population parameter; in fact, it could change every time we take a new sample.
Question: Will the observed sample statistic value be a reasonable estimate?
Answer: If our sample is a random sample, then we will be able to say something about the accuracy
of the estimation process. But a sample statistic estimates (NOT necessarily equal to) a population
parameter.
e) Based on two independent random samples of 100 patients we computed two values for sample
mean minutes and minutes. Which one is better? Why?
If the true population mean is unknown, then there is no way to say which estimate is better
(more accurate). In terms of the problem, people would like to believe that the average waiting
time in ER on weekends is actually closer to 33.75 minutes than 37.85 minutes.
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
f) Which estimate is closer to the true (known, =35) population mean?
minutes
g) Which estimate is closer to the true (unknown) population mean?
(see part e))
h) Assume that the true population mean is unknown, but the standard deviation is known =9.5,
also known that based on a random sample of 100 patients the sample mean
minutes. What would a rough guess for the lower and the upper bounds of the interval within
which falls 99.7% of the most typical values for a sample mean.
Roughly,
Note: Each value in computed 99.7% confidence interval provides a possible (or acceptable) value
for the sample mean waiting time. Each possible value of the sample mean represents a point
estimator for the population mean; therefore, computed interval will contain possible (or
acceptable) values for the population mean, and 95% of the times (when we repeat the process of
taking samples from the target population of the same size) that would be true.
When the sample size is large (n30), the general form of the confidence interval for
population mean is
Question: Why does this formula work?
According to the CLT for large enough sample size (n30),
Therefore follows the Standard Normal Distribution, for which P(-1.96
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
The 95% Confidence Interval for the
population mean is:
Here ,96.1)2/(1Z and = 0.05; refers to the total area in the tails of the Standard Normal
Distribution.
Example (Example 5.1, ER waiting time):
a) Compute the 95% confidence interval for the mean waiting time in the ER during weekends:
=9.5, n=100 ,
37.85 1.96*9.5/10 = 37.851.86 = (35.99, 39.71)
b) Is it different from our first rough guess? Why?
Yes, because we used 3 for 99.7% CI and 2 for 95% CI, not 1.96.
c) Does the 95% CI computed in the part a) contain the true population mean waiting time of 35
minutes?
NO
d) Compute the 95% confidence interval for the mean waiting time in the ER during weekends
using (the same n=100 and the same standard deviation):
33.751.86 = (31.89, 35.61)
d) Does this 95% CI contain the true population mean waiting time 35 minutes?
Yes
e) What will happen to 95% CI when n=35?
Wider interval
nxXX 96.196.1
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Definition:
A confidence interval (CI) is a range of values that are likely to cover the true population
parameter.
The basic structure for any confidence interval is:
point estimate multiplier * standard error,
where multiplier * standard error = the margin of error.
Definitions:
The margin of error is a build in component that addresses how close (or how far) the point
estimates are from the true, unknown parameter.
The (estimated) variance in point estimates (e.g., ), is called the standard error.
Standard Error Interpretation:
If repeated samples of (sample size) are obtained from this same population, we would estimate the
resulting sample (statistic) to be about (value of standard errors) away from the true (population
parameter) on average.
The standard error will depend on the sample size and the true population standard
deviation.
The multiplier used will depend on the confidence level, the parameter of interest, but not
the sample size.
The confidence level is the proportion of times the method will produce an interval that does
contain the true parameter in repeated random sampling.
Confidence interval applet
http://onlinestatbook.com/stat_sim/conf_interval/index.html
Note: For the same sample estimate, a 99% CI is wider than a 95%
CI, because the margin of error is greater for 99% confidence level is
used rather than when a 95% confidence.
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Confidence Intervals for Population Mean Summary
Note: when n is large, the t-distribution is very close to the Standard Normal Distribution.
Exercise: (Example 5.4, page 183) Estimating Mean
IQ. Estimate, using a 95% confidence interval, the
mean IQ for all 12-year-olds. We select a random
sample of sixteen (n=16) 12-year-olds and computed
their average IQ score to be 106 with a (sample)
standard deviation 12.4.
Assuming that IQs are approximately normally
distributed, compute 95% confidence interval using an
appropriate formula.
n=16 s=12.4
df= n-1= 16-1=15
t=2.131 =1-.95 = 0.05
UB=106+2.131*12.4/4 = 112.6
LB=106- 2.131*12.4/4 = 99.4
So, (99.4, 112.6)
Exercise: (Example 5.5, page 184) Estimating Mean Number of Visits to Primary Care. Estimate,
using a 99% confidence interval, the mean number of visits to primary care doctor over 3 years for
all patients with Type II diabetes. We select a random sample of 65 (n=65) patients and recorded
the number of visits each makes. The mean number of visits is 16 with the standard deviation of
1.4.
Since n=65>30, then we can use the formula that uses ; for 99% CI .
With n=65, =16, s=1.4, the 99% CI is
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Sample Size Determination
Note from the examples that the margins of error in the confidence intervals vary widely from
example to example, depending on both the variation of the population or sample standard
deviation (standard error) and the sample size.
The margin of error is small when the sample size is large and/or when the standard error is
small.
In experimental design, statisticians are concerned with determining the numbers of subjects
and the sampling strategy to be employed in a particular application so as to satisfy specific
precision criteria in the statistical inference phase of the analysis.
Suppose we wish to estimate the mean of a population and it is important to produce an
estimate that is within 5 units of the true mean with 95% confidence.
Exercise: (Example 5.2 and 5.6, pages 180 and 185)
a) Generate a 95% confidence interval estimate for the mean age at which patients with
hypertension based on a sample of n = 12 subjects. The mean age for 12 subjects is 47.
Suppose the standard deviation is known and equals to 7.2.
b) Compute the margin of error:
c) Suppose that we wanted a more precise estimate, for example, an interval with a margin of
error not exceeding 2 years.
In applications in which one desires a confidence interval estimate for the mean of a
population, formulas can be used to determine the necessary sample size.
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
In order to design the experiment, specifically to determine the sample size, the following questions
must be answered:
1. How much error can be tolerated in the estimate (i.e., how close must the estimate be to the true mean)?
2. What level of confidence is desired?
CI General Form: , where E is the margin of error
Solving for n gives:
Note: we will always round n upward, because n is the minimum number of subjects required to
ensure a of margin of error equal to E in the confidence interval for the population mean with the
specified level of confidence reflected in .
Lets compute n for the minimum number of patients required to estimate the mean age at which patients with hypertension with 95% confidence within 2 years:
Note: if the true population standard deviation is unknown, use the sample standard deviation s.
If neither s nor is available, use a conservative estimate given as:
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Confidence Interval for a Population Proportion Summary
(Chapter 7.1, 7.6, pp. 294-296, pp. 328-330)
The assumptions required for CI for a population proportion to be valid:
the sample size n is large enough (check: 5pn and 51 pn )
the data are a random sample from that population.
General Formula for Confidence Interval for the Population Proportion p
n
ppZp
)1(
)2/(1
Formula for Approximate Confidence Interval for the Population Proportion p
n
ppZp
)1(
)2/(1
Formula for Conservative Confidence Interval for the Population Proportion p
Note: here =0.5 is used to compute the standard error, nnn
pppSE
2
1)5.01(5.0)1()(
nZp
2
1
)2/(1
The following formula is used to determine the sample size required to produce an estimate for the
population proportion p with a certain level of precision:
2
)2/(1)1(
E
Zppn
If the population parameter p is unknown, the sample estimate can be used instead
2
)2/(1)1(
E
Zppn
If such estimate is not available, it can be shown that the function p (1-p) is maximized at p=0.5.
Therefore the most conservative estimate of n, the sample size, is produced by substituting p=0.5 in
the general formula, which is equivalent to
2
)2/(125.0
E
Zn
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: The proportion of adults that believe in love at first sight. Assume for the sample of
100 people 40 will say they do believe.
1) Compute an approximate 95% confidence interval.
Since the true population proportion is unknown, use the formula for approximate confidence
interval:
n
ppZp
)1(
)2/(1
2) Compute a conservative 95% confidence interval.
nZp
2
1
)2/(1
3) Which confidence interval is wider?
__________________________________
Exercise: Do you work more than 40 hours per week?
A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).
A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40
hours per week.
a) Compute an approximate 95% confidence interval for the true population proportion of people
who work more than 40 hours.
n
ppZp
)1(
)2/(1
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
b) How many people do we need to survey in order to estimate the true population proportion with
10% accuracy? (E=0.1)
From the statement of the problem:
Since the true population proportion p is unknown, use the following formula 2
)2/(1)1(
E
Zppn
c) What would a conservative sample size (assuming unknown p and )?
2
)2/(125.0
E
Zn
Exercise: Work through the Examples 7.2(p. 296) and 7.15 (p. 329).
CAS MA 113 Elementary Statistics Summer II Lecture 8, 9
Principles for using Confidence Intervals to Guide Decision Making:
Principle 1: A value not in a CI can be rejected as possible value of the population
parameter. A value in a CI is an acceptable or reasonable possibility for the value of a
population parameter.
Principle 2: When the CIs for parameters for two different populations do not overlap, it is
reasonable to conclude that the parameters for the two populations are different.
1. The probability that the true parameter lies in a particular, already computed, confidence interval is either 0 or 1. The interval is now fixed and the parameter is not random, so the
parameter is either in that particular interval or it is not.
2. A 95% Confidence Interval: We are 95% confident that the true parameter value lies inside the confidence interval. The interval provides a range of reasonable values for the
population parameter.
3. The 95% Confidence Level: If the procedure were repeated many times (that is, if we repeatedly took a random sample of the same size and computed the 95% confidence
interval for each sample), we would expect 95% of the resulting confidence intervals to
contain the true population parameter.
Note: giving a correct interpretation of the interval and of the confidence level can be challenging.
Example: We are 95% confident that the true average waiting time in ER on weekends is between
35.99 minutes and 39.71minutes.
If the procedure were repeated many times (that is, if we repeatedly took random samples of 100
patients and for each sample computed the 95% confidence interval for each sample), we would
expect 95% of the resulting confidence intervals to contain the true average waiting time in ER on
weekends.