Business Statistics for Managerial Decision

Business Statistics for Managerial Decision

Inference for proportions

Inference for Proportions Some statistical studies concern variables measured in a

scale of equal units such as dollars or grams. We have discussed inference about the mean of

variables likes these in our previous lectures. Other studies record categorical variables, such as the

race or occupation of a person, the make of a car, or type of complaint received from a customer.

When we record categorical variables, our data consists of counts or percents obtained from counts.

Inference for Proportions The parameters we want to do inference about in

these settings are population proportions. Just as in the case of inference about population

means, we may be concerned with a single population or with comparing two populations.

Inference about one or two proportions is very similar to inference about means and it is based on sampling distributions that are approximately Normal.

Example: Work stress and personal life

The human resources manager of a chain restaurants is concerned that work stress may be affecting the chain’s employees. She asks a random sample of 100 employees to respond Yes or No to the question “Does work stress have a negative impact on your personal life?” Of these 68 say “yes.”

Example: Work stress and personal life

The Parameter of interest is the proportion of the chain’s employee who would answer “Yes” if asked.

This is population proportion, which we call P.

The statistic used to estimate the unknown parameter is the sample proportion

68.0100

68ˆ p

Inference for a Single Proportion The sample proportion is a discrete random

variable that can take the values 0, 1/100, 2/100, …, 99/100 or 1.

The probability model for can be based on the Binomial distributions for counts.

If the sample size n is very small, we must base tests and confidence intervals for P on the discrete distribution of .

We can approximate the distribution of by a Normal distribution when the sample size is large.

p̂

p̂

p̂

p̂

Sampling Distribution of a Sample Proportion Choose a SRS of size n from a large population that contains

population proportion P of “successes.” Let be the sample proportion of successes,

Then: As the sample size increases, the sampling distribution of

becomes approximately Normal. The mean of the sampling distribution is P. The standard deviation of the sampling distribution is

p̂

n

X

np

sample in the successes ofcount ˆ

p̂

n

pp )1(

Sampling Distribution of a Sample Proportion

The sampling distribution of the sample proportion of successes has approximately a Normal distribution.

p̂

Confidence Interval for a Single Proportion The sample proportion is the natural estimator of the

population proportion P. The traditional confidence interval for P is based on the

Normal approximation to the distribution of . Unfortunately, confidence intervals based on this statistic

can be quite inaccurate, even for large samples. We can do better by moving sample proportion

slightly away from 0 and 1. The following simple adjustment works very well in

practice.

n

Xp ˆ

p̂

p̂

Confidence Interval for a Single Proportion

Wilson Estimate: Assume we have 4 additional observations, 2 of

which are successes and 2 of which are failures. The new sample size is n + 4 and the count of

successes is X+2. The estimator of the population proportion is

4

2~

n

Xp

Confidence Interval for a Single Proportion

We base a confidence interval on the z statistic obtained by standardizing the Wilson estimate .

The distribution of is close to the Normal distribution with mean P and standard deviation .

4

)1(

n

pp

p~

p~

Confidence Interval for a Single Proportion Choose a SRS of size n from a large population with unknown

proportion p of successes. The Wilson estimate of the population proportion is

The standard error of is

An approximate Level C confidence interval for P is

Where z* is the value for the standard Normal density curve with C area between –z* and z*.

Use this interval when sample size is at least n = 5 and the confidence level is 90% or more.

4

2~

n

Xp

4

)~1(~~

n

ppSEp

pSEzp ~*~

p~

Example: estimating the effect of work stress

The sample survey in previous example found that 68 out of 100 employees agreed that work stress had a negative impact on their personal lives.The sample size is n = 100 and the count of successes is X = 68. The Wilson estimate of the proportion of all employees affected by work stress is

The standard error is

6731.04100

268

4

2~

n

Xp

0460.0104

)6731.01(6731.0

4

)~1(~~

n

ppSEp

Example: estimating the effect of work stress

The z critical value for 95% confidence is z* = 1.96, so the confidence interval is

We are 95% confident that between 58.3% and 76.3% of the restaurant chain’s employees feel that work stress is damaging their personal lives.

090.673.0

)0460.0)(96.1(6731.0*~~

pSEzp

Significance Test for a Single Proportion

The sample proportion is approximately Normal with mean and standard deviation

For confidence interval we used the Wilson estimate and estimated the standard deviation from the data.

When performing significance test, the null hypothesis specifies a value for p which we call p0.

We assume the hypothesized p were actually true and substitute p0 for p in the expression for and then standardize .

p

Xp ˆ

p̂n

ppp

)1(ˆ

p̂p̂

Significance Test for a Single Proportion

Example: Work stress A national survey of restaurant employees found

that 75% said that work stress had a negative impact on their personal lives. A sample of 100 employees of a restaurant chain found that 68 answered “Yes” when asked, “does work stress have a negative impact on your personal life?” Is this good reason to think that the proportion of all employees of this chain who say “Yes” differs from the national proportion p0 = 0.75?

Example: Work stress To answer this question, we test

H0: p = 0.75

Ha: P 0.75

The expected number of “Yes” and “No” responses are

100 0.75 = 75 and 1000.25 = 25 Both are greater than 10 , so we can use z test.

Test statistic is62.1

10025.075.0

75.068.0

)1(

ˆ

00

0

npp

ppz

Example: Work stress From table A we find

The P-value is P = 20.0526 = .1052

We conclude that the chain restaurant data are compatible with the survey results.

0526.09474.1)62.1( zp

Choosing a Sample Size We want to see how to choose the sample size n to

obtain a confidence interval with specified margin of error m for a population proportion.

The margin of error for the confidence interval for a population proportion is:

Choosing a confidence level C fixes the critical value z*.

4

)~1(~** ~

n

ppzSEzm p

Choosing a Sample Size The margin of error also depends on the the value of and

the sample size n. We don’t know the value of until we gather data,

therefore we must guess a value to use in the calculations. Let’s call the guess value p*. There are two ways to get p*.

Use sample estimate from a pilot study or from similar studies done earlier.

Use p* = 0.5. Because the margin of error is largest when , this choice gives a sample size that is somewhat larger than we really need for the confidence level we choose. It is a safe choice no matter what the data later show.

p~

p~

5.0~ p

Choosing a Sample Size The level C confidence interval for a proportion p will have a

margin of error approximately equal to a specified value m when the sample size satisfies

Here z* is the critical value for confidence C, and p* is a guessed value for the proportion of successes in the future sample.

The margin of error will be less than or equal to m if p* is chosen to be 0.5. The sample size required is then given by

*)1(**

42

ppm

zn

2

2

*4

m

zn

Example: Planning a sample of customers Your company has received complaints about its customer

support service. You intend to hire a consulting company to carry out a sample survey of customers. Before contacting the consultant, you want some idea of the sample size you will have to pay for. One critical question is the degree of satisfaction with your customer service, measured on a five-point scale. You want to estimate the proportion P of your customers who are satisfied (That is , who choose either “satisfied” or “very satisfied,” the two highest levels on the five point scale).

Example: Planning a sample of customers You want to estimate P with 95% confidence and a margin of

error less than or equal to 3%. For planning purposes, you are willing to use p* = 0.5. The sample size required is:

Round up to get n+4 = 1068 or n = 1064 (Always round up. Rounding down would give a margin of error slightly greater than 0.03.)

Similarly for a 2.5% margin of error we have (after rounding up)

1.106703.02

96.1

2

*4

22

m

zn

1537025.02

96.14

2

n

Comparing Two Proportions We often want to compare the proportions of two

groups (such as men and women) that have some characteristics.

We call the two groups being compared Population 1 and population 2.

The two population proportions of “Successes” P1 and P2.

The data consist of two independent SRS The sample sizes are n1 from population 1 and n2

from population 2.

Comparing Two Proportions The proportion of successes in each sample

estimates the corresponding population proportion.

Here is the notation we will usepopulation population Sample Count of Sample

proportion size successes proportion

1 P1 n1 X1

2 P2 n2 X2

111ˆ nXp

222ˆ nXp

Sampling Distribution of Choose independent SRS of sizes n1 and n2 from

two populations with proportions P1 and P2 of successes.

Let be the difference between the two sample proportions of successes.

Then as both sample sizes increase, the sampling distribution of D becomes approximately Normal. The mean of the sampling distribution is . The standard deviation of the sampling distribution is

21 ˆˆ pp

21 ˆˆ ppD

21 PP

2

22

1

11 )1()1(

n

PP

n

PPD

Sampling Distribution of The sampling distribution

of the difference of two sample proportions is approximately Normal.

The mean and standard deviation are found from the two population proportions of successes, P1 and P2

21 ˆˆ pp

Confidence Interval Just as in the case of estimating a single

proportion, a small modification of the sample proportions greatly improves the accuracy of confidence intervals.

The Wilson estimates of the two population proportions are

)2()1(~

111 nXP

)2()1(~222 nXp

Confidence Interval The standard deviation of is approximately

To obtain a confidence interval for P1-P2, we replace the unknown parameters in the standard deviation by estimates to obtain an estimated standard deviation, or standard error.

D~

2

)~1(~

2

)~1(~

2

22

1

21~

n

pp

n

ppD

Confidence Interval for Comparing Two Proportions

Example:”No Sweat” Garment Labels

Following complaints about the working conditions in some apparel factories both in the United States and Abroad, a joint government and industry commission recommended in 1998 that companies that monitor and enforce proper standards be allowed to display a “No Sweat” label on their product. A survey of U.S. residents aged 18 or older asked a series of questions about how likely they would be to purchase a garment under various conditions.


For some conditions, it was stated that the garment had a “No Sweat” label; for others, there was no mention of such label. On the basis of of the responses, each person was classified as a “label user” or “ a “label nonuser.” About 16.5% of those surveyed were label users. One purpose of the study was to describe the demographic characteristics of users and nonusers.


The study suggested that there is a gender difference in the proportion of label users. Here is a summary of the data. Let X denote the number of label users.

population n X1 (women) 296 63 0.213 0.2152 (men) 251 27 0.108 0.111

nXp ˆ )2()1(~ nXp


First calculate the standard error of the observed difference.

The 95% confidence interval is

0308.02251

)889.0)(111.0(

2296

)785.0)(215.0(

2

)~1(~

2

)~1(~

2

22

1

11~

n

pp

n

ppSE

D

)16.0,04.0(060.0104.

)0308.0)(96.1()111.0215.0(

*)~~( ~21

D

SEzpp

Example:”No Sweat” Garment Labels With 95% confidence we can say that the difference in the

proportions is between 0.04 and 0.16. Alternatively, we can report that the women are about 10%

more likely to be label users than men, with a 95% margin of error of 6%.

In this example we chose women to be the first population. Had we chosen men as the first population, the estimate of the difference would be negative (-0.104).

Because it is easier to discuss positive numbers, we generally choose the first population to be the one with the higher proportion.

The choice does not affect the substance of the analysis.

Significance Tests It is sometimes useful to test the null hypothesis that the two

population proportions are the same. We standardize by subtracting its mean P1-P2 and then

dividing by its standard deviation

If n1 and n2 are large, the standardized difference is approximately N(0, 1).

To estimate D we take into account the null hypothesis that P1 = P2.

21 ˆˆ ppD

2

22

1

11 )1()1(

n

PP

n

PPD

Significance Tests If these two proportions are equal, we can

view all of the data as coming from a single population.

Let P denote the common value of P1 and P2. The standard deviation of is then

21 ˆˆ ppD

21

21

11)1(

)1()1(

nnPP

n

PP

n

PPDp

Significance Tests We estimate the common value of P by the overall proportion of

successes in the two samples.

This estimate of P is called the pooled estimate. To estimate the standard deviation of D, substitute for P

in the expression for DP. The result is a standard error for D under the condition that the

null hypothesis H0: P1 = P1 is true. The test statistic uses this standard error to standardize the

difference between the two sample proportions.

21

21

samplesboth in nsobservatio ofnumber

samplesboth in successes ofnumber ˆnn

XXP

p̂

Significance Tests for Comparing Two Proportions

Example:men, women, and garment labels.

The previous example presented the survey data on whether consumers are “label users” who pay attention to label details when buying a shirt. Are men and women equally likely to be label users?

Here is the data summary:

Population n X

1 (women) 296 63 0.2132 (men) 251 27 0.108

nXp ˆ

Example:men, women, and garment labels

We compare the proportions of label users in the two populations (women and men) by testing the hypotheses

H0:P1= P2

Ha:P1 P2

The pooled estimate of the common value of P is:

This is the proportion of label users in the entire sample.1645.0

547

90

251296

2763ˆ

p


The test statistic is calculated as follows:

The observed difference is more than 3 standard deviation away from zero.

03181.0251

1

296

1)8355.0)(1645.0(

DPSE

30.303181.0

108.0213.0ˆˆ 21

DPSE

ppz


The P-value is:

Conclusion: 21% of women are label users versus only 11%

of men; the difference is statistically significant.

001.00005.02)9995.01(2)30.3(2 zP

Documents

Business Statistics for Managerial Decision