Confidence Interval

CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

Lecture 8, 9 (Chapter 5.1, Chapter 7.1)

the greater part of our happiness or misery depends upon our disposition, and not upon our circumstances. Martha Washington (1732-1802)

Plan:

Introduction to Confidence Interval Estimation and Hypothesis Testing.

Confidence Intervals for Population Mean.

Making Decisions using Confidence Intervals. Sample Size Determination.

Confidence Intervals for Population Proportion.

Introduction to Confidence Interval Estimation and Hypothesis Testing.

Population versus Sample Parameter versus Statistic

Statistical inference is the process of drawing conclusions about a population parameter based

on data, or statistic an estimate or a summary computed from the observations. There are two

types of estimates for population parameter: confidence interval estimates and point estimates.


A confidence interval estimate is a range of values for the population parameter with a pre-

defined level of confidence (e.g., 95% confidence interval). A point estimate for a population

parameter is the statistic and can be considered the best (available) single-number estimate of that

parameter.

Examples:

1) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting

time is estimated to be 37.85 minutes.

_________________________point estimate____________

2) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting

time is estimated to be between 32.65 minutes and 42.10 minutes.

______________________confidence interval______________

Question: How far is the sample mean of 37.85 minutes is from true (population) average waiting

time at ER (means for all weekends for all patients)?

To answer this question we will learn two most common statistical inference procedures:

Confidence Interval Estimation and Hypothesis Testing that are related to confidence interval

estimates and point estimates, respectively.

Confidence Interval Estimation is based on confidence interval estimates. Using these

estimates the researcher is fairly confident that confidence interval will cover the true,

unknown value of the population parameter.

Hypothesis testing uses point estimate to attempt to reject/accept a hypothesis about the

population. Usually researchers want to reject the notion that chance alone can explain the

sample results.

Hypothesis testing is applied to population parameters by specifying a null value for the

parametera value that would indicate that nothing of interest is happening.

Hypothesis testing proceeds by obtaining a sample, computing a point estimate (sample

statistic), and assessing how unlikely the sample statistic would be if the null parameter

value were correct.

In most cases, the researchers are trying to show that the null value is not correct.

Achieving statistical significance is equivalent to rejecting the idea that the observed results

are plausible if the null value is correct.

In this course, we will study statistical inference methods for main population parameters that

involve either proportions (for categorical data) or means (for quantitative data).

Population proportion p versus sample proportion (for categorical response);

Population mean versus sample mean (for quantitative).


Exercise:

1) The average waiting time in the emergency room on weekends, based on the sample of 100

patients

p

2) The proportion of adults that believe in love at first sight

p

3) A sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40 hours

per week.

p

4) Average GPA of Boston University students

p

5) Average GPA of Boston University students, based on MA113 Summer II students

p

Exercise: Do you work more than 40 hours per week?

A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).

A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40

hours per week.

Population = all workers

Parameter = p, the proportion of people who work more than 40 hours

Sample = 1000 workers

Statistic = (.46) of those who work more than 40 hours

Can anyone say how close this observed sample proportion is to the true population

proportion p ?_ __ __ __ n o _ __ __ __ __ __ ___ _ __ __ __ _

If we were to take another random sample of the same size n = 1000, would we get the same

value for the sample proportion ? ____not necessarily ______________

CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: Management of an airline uses a normal distribution to model the value claimed for a lost piece

of luggage on domestic flights. The mean of the distribution is $600 and the standard deviation is $85.

Suppose a random sample of 100 pieces of luggage is to be selected.

Describe the approximate sampling distribution of the sample mean claimed value for a random

sample of the 100 pieces of lost luggage? Provide all features of the distribution.

According to the Central Limit Theorem (CLT), here we have:

Recall the main results about the Sampling Distribution of the Sample Mean and CLT:

Sampling Distribution of the Sample Mean :

If the parent population IS a normal distribution with a mean and a standard deviation ,

then for any sample size (small or large), the sample mean will have a __normal___

distribution with a mean of __ ___ and a standard deviation of __ n ___.

Central Limit Theorem : approximately for large n (more than 30)

If the parent population is NOT a normal distribution but with a mean and a standard

deviation then for a large sample size (n30), the sample mean will have __approximately

normal__ distribution with a mean of __ ___ and a standard deviation of _ n ___.

Recall the Empirical Rule for Normal Distribution:


Exercise (Example 5.1, page 175): Estimating Mean Waiting Time in the Emergency Room.

Suppose we know that the true population mean time in the ER during weekends =35 minutes and

with the standard deviation =9.5.

a) Describe and draw the sampling distribution of a sample mean waiting time based on a random

sample of n=35 and n=100 patients coming during last weekend.

b) Using the Empirical Rule, identify the lower and the upper bounds of the interval within which

falls 99.7% of the most typical values for a sample mean.

c) Which interval is wider for n=35 or n=100? Why?

For n = 35

d) Now assume that true population is unknown. How can we estimate ?

Using a point estimate ( ) or a confidence interval.

Usually, he true population parameter value is unknown, we take a sample and use the sample

statistic to estimate the parameter. The sample statistic (a single point estimator) may not be

equal to the population parameter; in fact, it could change every time we take a new sample.

Question: Will the observed sample statistic value be a reasonable estimate?

Answer: If our sample is a random sample, then we will be able to say something about the accuracy

of the estimation process. But a sample statistic estimates (NOT necessarily equal to) a population

parameter.

e) Based on two independent random samples of 100 patients we computed two values for sample

mean minutes and minutes. Which one is better? Why?

If the true population mean is unknown, then there is no way to say which estimate is better

(more accurate). In terms of the problem, people would like to believe that the average waiting

time in ER on weekends is actually closer to 33.75 minutes than 37.85 minutes.


f) Which estimate is closer to the true (known, =35) population mean?

minutes

g) Which estimate is closer to the true (unknown) population mean?

(see part e))

h) Assume that the true population mean is unknown, but the standard deviation is known =9.5,

also known that based on a random sample of 100 patients the sample mean

minutes. What would a rough guess for the lower and the upper bounds of the interval within

which falls 99.7% of the most typical values for a sample mean.

Roughly,

Note: Each value in computed 99.7% confidence interval provides a possible (or acceptable) value

for the sample mean waiting time. Each possible value of the sample mean represents a point

estimator for the population mean; therefore, computed interval will contain possible (or

acceptable) values for the population mean, and 95% of the times (when we repeat the process of

taking samples from the target population of the same size) that would be true.

When the sample size is large (n30), the general form of the confidence interval for

population mean is

Question: Why does this formula work?

According to the CLT for large enough sample size (n30),

Therefore follows the Standard Normal Distribution, for which P(-1.96


The 95% Confidence Interval for the

population mean is:

Here ,96.1)2/(1Z and = 0.05; refers to the total area in the tails of the Standard Normal

Distribution.

Example (Example 5.1, ER waiting time):

a) Compute the 95% confidence interval for the mean waiting time in the ER during weekends:

=9.5, n=100 ,

37.85 1.96*9.5/10 = 37.851.86 = (35.99, 39.71)

b) Is it different from our first rough guess? Why?

Yes, because we used 3 for 99.7% CI and 2 for 95% CI, not 1.96.

c) Does the 95% CI computed in the part a) contain the true population mean waiting time of 35

minutes?

NO

d) Compute the 95% confidence interval for the mean waiting time in the ER during weekends

using (the same n=100 and the same standard deviation):

33.751.86 = (31.89, 35.61)

d) Does this 95% CI contain the true population mean waiting time 35 minutes?

Yes

e) What will happen to 95% CI when n=35?

Wider interval

nxXX 96.196.1


Definition:

A confidence interval (CI) is a range of values that are likely to cover the true population

parameter.

The basic structure for any confidence interval is:

point estimate multiplier * standard error,

where multiplier * standard error = the margin of error.

Definitions:

The margin of error is a build in component that addresses how close (or how far) the point

estimates are from the true, unknown parameter.

The (estimated) variance in point estimates (e.g., ), is called the standard error.

Standard Error Interpretation:

If repeated samples of (sample size) are obtained from this same population, we would estimate the

resulting sample (statistic) to be about (value of standard errors) away from the true (population

parameter) on average.

The standard error will depend on the sample size and the true population standard

deviation.

The multiplier used will depend on the confidence level, the parameter of interest, but not

the sample size.

The confidence level is the proportion of times the method will produce an interval that does

contain the true parameter in repeated random sampling.

Confidence interval applet

http://onlinestatbook.com/stat_sim/conf_interval/index.html

Note: For the same sample estimate, a 99% CI is wider than a 95%

CI, because the margin of error is greater for 99% confidence level is

used rather than when a 95% confidence.


Confidence Intervals for Population Mean Summary

Note: when n is large, the t-distribution is very close to the Standard Normal Distribution.

Exercise: (Example 5.4, page 183) Estimating Mean

IQ. Estimate, using a 95% confidence interval, the

mean IQ for all 12-year-olds. We select a random

sample of sixteen (n=16) 12-year-olds and computed

their average IQ score to be 106 with a (sample)

standard deviation 12.4.

Assuming that IQs are approximately normally

distributed, compute 95% confidence interval using an

appropriate formula.

n=16 s=12.4

df= n-1= 16-1=15

t=2.131 =1-.95 = 0.05

UB=106+2.131*12.4/4 = 112.6

LB=106- 2.131*12.4/4 = 99.4

So, (99.4, 112.6)

Exercise: (Example 5.5, page 184) Estimating Mean Number of Visits to Primary Care. Estimate,

using a 99% confidence interval, the mean number of visits to primary care doctor over 3 years for

all patients with Type II diabetes. We select a random sample of 65 (n=65) patients and recorded

the number of visits each makes. The mean number of visits is 16 with the standard deviation of

1.4.

Since n=65>30, then we can use the formula that uses ; for 99% CI .

With n=65, =16, s=1.4, the 99% CI is


Sample Size Determination

Note from the examples that the margins of error in the confidence intervals vary widely from

example to example, depending on both the variation of the population or sample standard

deviation (standard error) and the sample size.

The margin of error is small when the sample size is large and/or when the standard error is

small.

In experimental design, statisticians are concerned with determining the numbers of subjects

and the sampling strategy to be employed in a particular application so as to satisfy specific

precision criteria in the statistical inference phase of the analysis.

Suppose we wish to estimate the mean of a population and it is important to produce an

estimate that is within 5 units of the true mean with 95% confidence.

Exercise: (Example 5.2 and 5.6, pages 180 and 185)

a) Generate a 95% confidence interval estimate for the mean age at which patients with

hypertension based on a sample of n = 12 subjects. The mean age for 12 subjects is 47.

Suppose the standard deviation is known and equals to 7.2.

b) Compute the margin of error:

c) Suppose that we wanted a more precise estimate, for example, an interval with a margin of

error not exceeding 2 years.

In applications in which one desires a confidence interval estimate for the mean of a

population, formulas can be used to determine the necessary sample size.


In order to design the experiment, specifically to determine the sample size, the following questions

must be answered:

1. How much error can be tolerated in the estimate (i.e., how close must the estimate be to the true mean)?

2. What level of confidence is desired?

CI General Form: , where E is the margin of error

Solving for n gives:

Note: we will always round n upward, because n is the minimum number of subjects required to

ensure a of margin of error equal to E in the confidence interval for the population mean with the

specified level of confidence reflected in .

Lets compute n for the minimum number of patients required to estimate the mean age at which patients with hypertension with 95% confidence within 2 years:

Note: if the true population standard deviation is unknown, use the sample standard deviation s.

If neither s nor is available, use a conservative estimate given as:


Confidence Interval for a Population Proportion Summary

(Chapter 7.1, 7.6, pp. 294-296, pp. 328-330)

The assumptions required for CI for a population proportion to be valid:

the sample size n is large enough (check: 5pn and 51 pn )

the data are a random sample from that population.

General Formula for Confidence Interval for the Population Proportion p

n

ppZp

)1(

)2/(1

Formula for Approximate Confidence Interval for the Population Proportion p

n

ppZp

)1(

)2/(1

Formula for Conservative Confidence Interval for the Population Proportion p

Note: here =0.5 is used to compute the standard error, nnn

pppSE

2

1)5.01(5.0)1()(

nZp

2

1

)2/(1

The following formula is used to determine the sample size required to produce an estimate for the

population proportion p with a certain level of precision:

2

)2/(1)1(

E

Zppn

If the population parameter p is unknown, the sample estimate can be used instead

2

)2/(1)1(

E

Zppn

If such estimate is not available, it can be shown that the function p (1-p) is maximized at p=0.5.

Therefore the most conservative estimate of n, the sample size, is produced by substituting p=0.5 in

the general formula, which is equivalent to

2

)2/(125.0

E

Zn

CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: The proportion of adults that believe in love at first sight. Assume for the sample of

100 people 40 will say they do believe.

1) Compute an approximate 95% confidence interval.

Since the true population proportion is unknown, use the formula for approximate confidence

interval:

n

ppZp

)1(

)2/(1

2) Compute a conservative 95% confidence interval.

nZp

2

1

)2/(1

3) Which confidence interval is wider?

__________________________________

Exercise: Do you work more than 40 hours per week?

A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).

A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40

hours per week.

a) Compute an approximate 95% confidence interval for the true population proportion of people

who work more than 40 hours.

n

ppZp

)1(

)2/(1


b) How many people do we need to survey in order to estimate the true population proportion with

10% accuracy? (E=0.1)

From the statement of the problem:

Since the true population proportion p is unknown, use the following formula 2

)2/(1)1(

E

Zppn

c) What would a conservative sample size (assuming unknown p and )?

2

)2/(125.0

E

Zn

Exercise: Work through the Examples 7.2(p. 296) and 7.15 (p. 329).


Principles for using Confidence Intervals to Guide Decision Making:

Principle 1: A value not in a CI can be rejected as possible value of the population

parameter. A value in a CI is an acceptable or reasonable possibility for the value of a

population parameter.

Principle 2: When the CIs for parameters for two different populations do not overlap, it is

reasonable to conclude that the parameters for the two populations are different.

1. The probability that the true parameter lies in a particular, already computed, confidence interval is either 0 or 1. The interval is now fixed and the parameter is not random, so the

parameter is either in that particular interval or it is not.

2. A 95% Confidence Interval: We are 95% confident that the true parameter value lies inside the confidence interval. The interval provides a range of reasonable values for the

population parameter.

3. The 95% Confidence Level: If the procedure were repeated many times (that is, if we repeatedly took a random sample of the same size and computed the 95% confidence

interval for each sample), we would expect 95% of the resulting confidence intervals to

contain the true population parameter.

Note: giving a correct interpretation of the interval and of the confidence level can be challenging.

Example: We are 95% confident that the true average waiting time in ER on weekends is between

35.99 minutes and 39.71minutes.

If the procedure were repeated many times (that is, if we repeatedly took random samples of 100

patients and for each sample computed the 95% confidence interval for each sample), we would

expect 95% of the resulting confidence intervals to contain the true average waiting time in ER on

weekends.

Documents

Confidence Interval