15
CAS MA 113 • Elementary Statistics • Summer II • Lecture 8, 9 Lecture 8, 9 (Chapter 5.1, Chapter 7.1) “… the greater part of our happiness or misery depends upon our disposition, and not upon our circumstances.” … Martha Washington (1732-1802) Plan: Introduction to Confidence Interval Estimation and Hypothesis Testing. Confidence Intervals for Population Mean. Making Decisions using Confidence Intervals. Sample Size Determination. Confidence Intervals for Population Proportion. Introduction to Confidence Interval Estimation and Hypothesis Testing. Population versus Sample Parameter versus Statistic Statistical inference is the process of drawing conclusions about a population parameter based on data, or statistic an estimate or a summary computed from the observations. There are two types of estimates for population parameter: confidence interval estimates and point estimates.

Confidence Interval

Embed Size (px)

DESCRIPTION

Introduction to Confidence Interval Estimation and Hypothesis Testing. Confidence Intervals for Population Mean. Making Decisions using Confidence Intervals. Sample Size Determination. Confidence Intervals for Population Proportion

Citation preview

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Lecture 8, 9 (Chapter 5.1, Chapter 7.1)

    the greater part of our happiness or misery depends upon our disposition, and not upon our circumstances. Martha Washington (1732-1802)

    Plan:

    Introduction to Confidence Interval Estimation and Hypothesis Testing.

    Confidence Intervals for Population Mean.

    Making Decisions using Confidence Intervals. Sample Size Determination.

    Confidence Intervals for Population Proportion.

    Introduction to Confidence Interval Estimation and Hypothesis Testing.

    Population versus Sample Parameter versus Statistic

    Statistical inference is the process of drawing conclusions about a population parameter based

    on data, or statistic an estimate or a summary computed from the observations. There are two

    types of estimates for population parameter: confidence interval estimates and point estimates.

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    A confidence interval estimate is a range of values for the population parameter with a pre-

    defined level of confidence (e.g., 95% confidence interval). A point estimate for a population

    parameter is the statistic and can be considered the best (available) single-number estimate of that

    parameter.

    Examples:

    1) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting

    time is estimated to be 37.85 minutes.

    _________________________point estimate____________

    2) Based on a random sample of the 100 patients visiting ER this weekend, the average waiting

    time is estimated to be between 32.65 minutes and 42.10 minutes.

    ______________________confidence interval______________

    Question: How far is the sample mean of 37.85 minutes is from true (population) average waiting

    time at ER (means for all weekends for all patients)?

    To answer this question we will learn two most common statistical inference procedures:

    Confidence Interval Estimation and Hypothesis Testing that are related to confidence interval

    estimates and point estimates, respectively.

    Confidence Interval Estimation is based on confidence interval estimates. Using these

    estimates the researcher is fairly confident that confidence interval will cover the true,

    unknown value of the population parameter.

    Hypothesis testing uses point estimate to attempt to reject/accept a hypothesis about the

    population. Usually researchers want to reject the notion that chance alone can explain the

    sample results.

    Hypothesis testing is applied to population parameters by specifying a null value for the

    parametera value that would indicate that nothing of interest is happening.

    Hypothesis testing proceeds by obtaining a sample, computing a point estimate (sample

    statistic), and assessing how unlikely the sample statistic would be if the null parameter

    value were correct.

    In most cases, the researchers are trying to show that the null value is not correct.

    Achieving statistical significance is equivalent to rejecting the idea that the observed results

    are plausible if the null value is correct.

    In this course, we will study statistical inference methods for main population parameters that

    involve either proportions (for categorical data) or means (for quantitative data).

    Population proportion p versus sample proportion (for categorical response);

    Population mean versus sample mean (for quantitative).

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Exercise:

    1) The average waiting time in the emergency room on weekends, based on the sample of 100

    patients

    p

    2) The proportion of adults that believe in love at first sight

    p

    3) A sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40 hours

    per week.

    p

    4) Average GPA of Boston University students

    p

    5) Average GPA of Boston University students, based on MA113 Summer II students

    p

    Exercise: Do you work more than 40 hours per week?

    A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).

    A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40

    hours per week.

    Population = all workers

    Parameter = p, the proportion of people who work more than 40 hours

    Sample = 1000 workers

    Statistic = (.46) of those who work more than 40 hours

    Can anyone say how close this observed sample proportion is to the true population

    proportion p ?_ __ __ __ n o _ __ __ __ __ __ ___ _ __ __ __ _

    If we were to take another random sample of the same size n = 1000, would we get the same

    value for the sample proportion ? ____not necessarily ______________

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: Management of an airline uses a normal distribution to model the value claimed for a lost piece

    of luggage on domestic flights. The mean of the distribution is $600 and the standard deviation is $85.

    Suppose a random sample of 100 pieces of luggage is to be selected.

    Describe the approximate sampling distribution of the sample mean claimed value for a random

    sample of the 100 pieces of lost luggage? Provide all features of the distribution.

    According to the Central Limit Theorem (CLT), here we have:

    Recall the main results about the Sampling Distribution of the Sample Mean and CLT:

    Sampling Distribution of the Sample Mean :

    If the parent population IS a normal distribution with a mean and a standard deviation ,

    then for any sample size (small or large), the sample mean will have a __normal___

    distribution with a mean of __ ___ and a standard deviation of __ n ___.

    Central Limit Theorem : approximately for large n (more than 30)

    If the parent population is NOT a normal distribution but with a mean and a standard

    deviation then for a large sample size (n30), the sample mean will have __approximately

    normal__ distribution with a mean of __ ___ and a standard deviation of _ n ___.

    Recall the Empirical Rule for Normal Distribution:

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Exercise (Example 5.1, page 175): Estimating Mean Waiting Time in the Emergency Room.

    Suppose we know that the true population mean time in the ER during weekends =35 minutes and

    with the standard deviation =9.5.

    a) Describe and draw the sampling distribution of a sample mean waiting time based on a random

    sample of n=35 and n=100 patients coming during last weekend.

    b) Using the Empirical Rule, identify the lower and the upper bounds of the interval within which

    falls 99.7% of the most typical values for a sample mean.

    c) Which interval is wider for n=35 or n=100? Why?

    For n = 35

    d) Now assume that true population is unknown. How can we estimate ?

    Using a point estimate ( ) or a confidence interval.

    Usually, he true population parameter value is unknown, we take a sample and use the sample

    statistic to estimate the parameter. The sample statistic (a single point estimator) may not be

    equal to the population parameter; in fact, it could change every time we take a new sample.

    Question: Will the observed sample statistic value be a reasonable estimate?

    Answer: If our sample is a random sample, then we will be able to say something about the accuracy

    of the estimation process. But a sample statistic estimates (NOT necessarily equal to) a population

    parameter.

    e) Based on two independent random samples of 100 patients we computed two values for sample

    mean minutes and minutes. Which one is better? Why?

    If the true population mean is unknown, then there is no way to say which estimate is better

    (more accurate). In terms of the problem, people would like to believe that the average waiting

    time in ER on weekends is actually closer to 33.75 minutes than 37.85 minutes.

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    f) Which estimate is closer to the true (known, =35) population mean?

    minutes

    g) Which estimate is closer to the true (unknown) population mean?

    (see part e))

    h) Assume that the true population mean is unknown, but the standard deviation is known =9.5,

    also known that based on a random sample of 100 patients the sample mean

    minutes. What would a rough guess for the lower and the upper bounds of the interval within

    which falls 99.7% of the most typical values for a sample mean.

    Roughly,

    Note: Each value in computed 99.7% confidence interval provides a possible (or acceptable) value

    for the sample mean waiting time. Each possible value of the sample mean represents a point

    estimator for the population mean; therefore, computed interval will contain possible (or

    acceptable) values for the population mean, and 95% of the times (when we repeat the process of

    taking samples from the target population of the same size) that would be true.

    When the sample size is large (n30), the general form of the confidence interval for

    population mean is

    Question: Why does this formula work?

    According to the CLT for large enough sample size (n30),

    Therefore follows the Standard Normal Distribution, for which P(-1.96

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    The 95% Confidence Interval for the

    population mean is:

    Here ,96.1)2/(1Z and = 0.05; refers to the total area in the tails of the Standard Normal

    Distribution.

    Example (Example 5.1, ER waiting time):

    a) Compute the 95% confidence interval for the mean waiting time in the ER during weekends:

    =9.5, n=100 ,

    37.85 1.96*9.5/10 = 37.851.86 = (35.99, 39.71)

    b) Is it different from our first rough guess? Why?

    Yes, because we used 3 for 99.7% CI and 2 for 95% CI, not 1.96.

    c) Does the 95% CI computed in the part a) contain the true population mean waiting time of 35

    minutes?

    NO

    d) Compute the 95% confidence interval for the mean waiting time in the ER during weekends

    using (the same n=100 and the same standard deviation):

    33.751.86 = (31.89, 35.61)

    d) Does this 95% CI contain the true population mean waiting time 35 minutes?

    Yes

    e) What will happen to 95% CI when n=35?

    Wider interval

    nxXX 96.196.1

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Definition:

    A confidence interval (CI) is a range of values that are likely to cover the true population

    parameter.

    The basic structure for any confidence interval is:

    point estimate multiplier * standard error,

    where multiplier * standard error = the margin of error.

    Definitions:

    The margin of error is a build in component that addresses how close (or how far) the point

    estimates are from the true, unknown parameter.

    The (estimated) variance in point estimates (e.g., ), is called the standard error.

    Standard Error Interpretation:

    If repeated samples of (sample size) are obtained from this same population, we would estimate the

    resulting sample (statistic) to be about (value of standard errors) away from the true (population

    parameter) on average.

    The standard error will depend on the sample size and the true population standard

    deviation.

    The multiplier used will depend on the confidence level, the parameter of interest, but not

    the sample size.

    The confidence level is the proportion of times the method will produce an interval that does

    contain the true parameter in repeated random sampling.

    Confidence interval applet

    http://onlinestatbook.com/stat_sim/conf_interval/index.html

    Note: For the same sample estimate, a 99% CI is wider than a 95%

    CI, because the margin of error is greater for 99% confidence level is

    used rather than when a 95% confidence.

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Confidence Intervals for Population Mean Summary

    Note: when n is large, the t-distribution is very close to the Standard Normal Distribution.

    Exercise: (Example 5.4, page 183) Estimating Mean

    IQ. Estimate, using a 95% confidence interval, the

    mean IQ for all 12-year-olds. We select a random

    sample of sixteen (n=16) 12-year-olds and computed

    their average IQ score to be 106 with a (sample)

    standard deviation 12.4.

    Assuming that IQs are approximately normally

    distributed, compute 95% confidence interval using an

    appropriate formula.

    n=16 s=12.4

    df= n-1= 16-1=15

    t=2.131 =1-.95 = 0.05

    UB=106+2.131*12.4/4 = 112.6

    LB=106- 2.131*12.4/4 = 99.4

    So, (99.4, 112.6)

    Exercise: (Example 5.5, page 184) Estimating Mean Number of Visits to Primary Care. Estimate,

    using a 99% confidence interval, the mean number of visits to primary care doctor over 3 years for

    all patients with Type II diabetes. We select a random sample of 65 (n=65) patients and recorded

    the number of visits each makes. The mean number of visits is 16 with the standard deviation of

    1.4.

    Since n=65>30, then we can use the formula that uses ; for 99% CI .

    With n=65, =16, s=1.4, the 99% CI is

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Sample Size Determination

    Note from the examples that the margins of error in the confidence intervals vary widely from

    example to example, depending on both the variation of the population or sample standard

    deviation (standard error) and the sample size.

    The margin of error is small when the sample size is large and/or when the standard error is

    small.

    In experimental design, statisticians are concerned with determining the numbers of subjects

    and the sampling strategy to be employed in a particular application so as to satisfy specific

    precision criteria in the statistical inference phase of the analysis.

    Suppose we wish to estimate the mean of a population and it is important to produce an

    estimate that is within 5 units of the true mean with 95% confidence.

    Exercise: (Example 5.2 and 5.6, pages 180 and 185)

    a) Generate a 95% confidence interval estimate for the mean age at which patients with

    hypertension based on a sample of n = 12 subjects. The mean age for 12 subjects is 47.

    Suppose the standard deviation is known and equals to 7.2.

    b) Compute the margin of error:

    c) Suppose that we wanted a more precise estimate, for example, an interval with a margin of

    error not exceeding 2 years.

    In applications in which one desires a confidence interval estimate for the mean of a

    population, formulas can be used to determine the necessary sample size.

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    In order to design the experiment, specifically to determine the sample size, the following questions

    must be answered:

    1. How much error can be tolerated in the estimate (i.e., how close must the estimate be to the true mean)?

    2. What level of confidence is desired?

    CI General Form: , where E is the margin of error

    Solving for n gives:

    Note: we will always round n upward, because n is the minimum number of subjects required to

    ensure a of margin of error equal to E in the confidence interval for the population mean with the

    specified level of confidence reflected in .

    Lets compute n for the minimum number of patients required to estimate the mean age at which patients with hypertension with 95% confidence within 2 years:

    Note: if the true population standard deviation is unknown, use the sample standard deviation s.

    If neither s nor is available, use a conservative estimate given as:

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Confidence Interval for a Population Proportion Summary

    (Chapter 7.1, 7.6, pp. 294-296, pp. 328-330)

    The assumptions required for CI for a population proportion to be valid:

    the sample size n is large enough (check: 5pn and 51 pn )

    the data are a random sample from that population.

    General Formula for Confidence Interval for the Population Proportion p

    n

    ppZp

    )1(

    )2/(1

    Formula for Approximate Confidence Interval for the Population Proportion p

    n

    ppZp

    )1(

    )2/(1

    Formula for Conservative Confidence Interval for the Population Proportion p

    Note: here =0.5 is used to compute the standard error, nnn

    pppSE

    2

    1)5.01(5.0)1()(

    nZp

    2

    1

    )2/(1

    The following formula is used to determine the sample size required to produce an estimate for the

    population proportion p with a certain level of precision:

    2

    )2/(1)1(

    E

    Zppn

    If the population parameter p is unknown, the sample estimate can be used instead

    2

    )2/(1)1(

    E

    Zppn

    If such estimate is not available, it can be shown that the function p (1-p) is maximized at p=0.5.

    Therefore the most conservative estimate of n, the sample size, is produced by substituting p=0.5 in

    the general formula, which is equivalent to

    2

    )2/(125.0

    E

    Zn

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9 Exercise: The proportion of adults that believe in love at first sight. Assume for the sample of

    100 people 40 will say they do believe.

    1) Compute an approximate 95% confidence interval.

    Since the true population proportion is unknown, use the formula for approximate confidence

    interval:

    n

    ppZp

    )1(

    )2/(1

    2) Compute a conservative 95% confidence interval.

    nZp

    2

    1

    )2/(1

    3) Which confidence interval is wider?

    __________________________________

    Exercise: Do you work more than 40 hours per week?

    A poll was conducted by The Heldrich Center for Workforce Development (at Rutgers University).

    A probability sample of 1000 workers resulted in 460 (for 46%) stating they work more than 40

    hours per week.

    a) Compute an approximate 95% confidence interval for the true population proportion of people

    who work more than 40 hours.

    n

    ppZp

    )1(

    )2/(1

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    b) How many people do we need to survey in order to estimate the true population proportion with

    10% accuracy? (E=0.1)

    From the statement of the problem:

    Since the true population proportion p is unknown, use the following formula 2

    )2/(1)1(

    E

    Zppn

    c) What would a conservative sample size (assuming unknown p and )?

    2

    )2/(125.0

    E

    Zn

    Exercise: Work through the Examples 7.2(p. 296) and 7.15 (p. 329).

  • CAS MA 113 Elementary Statistics Summer II Lecture 8, 9

    Principles for using Confidence Intervals to Guide Decision Making:

    Principle 1: A value not in a CI can be rejected as possible value of the population

    parameter. A value in a CI is an acceptable or reasonable possibility for the value of a

    population parameter.

    Principle 2: When the CIs for parameters for two different populations do not overlap, it is

    reasonable to conclude that the parameters for the two populations are different.

    1. The probability that the true parameter lies in a particular, already computed, confidence interval is either 0 or 1. The interval is now fixed and the parameter is not random, so the

    parameter is either in that particular interval or it is not.

    2. A 95% Confidence Interval: We are 95% confident that the true parameter value lies inside the confidence interval. The interval provides a range of reasonable values for the

    population parameter.

    3. The 95% Confidence Level: If the procedure were repeated many times (that is, if we repeatedly took a random sample of the same size and computed the 95% confidence

    interval for each sample), we would expect 95% of the resulting confidence intervals to

    contain the true population parameter.

    Note: giving a correct interpretation of the interval and of the confidence level can be challenging.

    Example: We are 95% confident that the true average waiting time in ER on weekends is between

    35.99 minutes and 39.71minutes.

    If the procedure were repeated many times (that is, if we repeatedly took random samples of 100

    patients and for each sample computed the 95% confidence interval for each sample), we would

    expect 95% of the resulting confidence intervals to contain the true average waiting time in ER on

    weekends.