Business Statistics for Managerial Decision Farideh Dehkordi-Vakil

Business Statistics for Managerial Decision

Farideh Dehkordi-Vakil

Comparing Two Proportions We often want to compare the proportions of two

groups (such as men and women) that have some characteristics.

We call the two groups being compared Population 1 and population 2.

The two population proportions of “Successes” P1 and P2.

The data consist of two independent SRS The sample sizes are n1 from population 1 and n2

from population 2.

Comparing Two Proportions The proportion of successes in each sample

estimates the corresponding population proportion.

Here is the notation we will usepopulation population Sample Count of Sample

proportion size successes proportion

1 P1 n1 X1

2 P2 n2 X2

111ˆ nXp

222ˆ nXp

Sampling Distribution of Choose independent SRS of sizes n1 and n2 from

two populations with proportions P1 and P2 of successes.

Let be the difference between the two sample proportions of successes.

Then as both sample sizes increase, the sampling distribution of D becomes approximately Normal. The mean of the sampling distribution is . The standard deviation of the sampling distribution is

21 ˆˆ pp

21 ˆˆ ppD

21 PP

2

22

1

11 )1()1(

n

PP

n

PPD

Sampling Distribution of The sampling distribution

of the difference of two sample proportions is approximately Normal.

The mean and standard deviation are found from the two population proportions of successes, P1 and P2

21 ˆˆ pp

Confidence Interval Just as in the case of estimating a single

proportion, a small modification of the sample proportions greatly improves the accuracy of confidence intervals.

The Wilson estimates of the two population proportions are

)2()1(~

111 nXP

)2()1(~222 nXp

Confidence Interval The standard deviation of is approximately

To obtain a confidence interval for P1-P2, we replace the unknown parameters in the standard deviation by estimates to obtain an estimated standard deviation, or standard error.

D~

2

)~1(~

2

)~1(~

2

22

1

21~

n

pp

n

ppD

Confidence Interval for Comparing Two Proportions

Example:”No Sweat” Garment Labels

Following complaints about the working conditions in some apparel factories both in the United States and Abroad, a joint government and industry commission recommended in 1998 that companies that monitor and enforce proper standards be allowed to display a “No Sweat” label on their product. A survey of U.S. residents aged 18 or older asked a series of questions about how likely they would be to purchase a garment under various conditions.


For some conditions, it was stated that the garment had a “No Sweat” label; for others, there was no mention of such label. On the basis of of the responses, each person was classified as a “label user” or “ a “label nonuser.” About 16.5% of those surveyed were label users. One purpose of the study was to describe the demographic characteristics of users and nonusers.


The study suggested that there is a gender difference in the proportion of label users. Here is a summary of the data. Let X denote the number of label users.

population n X1 (women) 296 63 0.213 0.2152 (men) 251 27 0.108 0.111

nXp ˆ )2()1(~ nXp


First calculate the standard error of the observed difference.

The 95% confidence interval is

0308.02251

)889.0)(111.0(

2296

)785.0)(215.0(

2

)~1(~

2

)~1(~

2

22

1

21~

n

pp

n

ppSE

D

)16.0,04.0(060.0104.

)0308.0)(96.1()111.0215.0(

*)~~( ~21

D

SEzpp

Example:”No Sweat” Garment Labels With 95% confidence we can say that the difference in the

proportions is between 0.04 and 0.16. Alternatively, we can report that the women are about 10%

more likely to be label users than men, with a 95% margin of error of 6%.

In this example we chose women to be the first population. Had we chosen men as the first population, the estimate of the difference would be negative (-0.104).

Because it is easier to discuss positive numbers, we generally choose the first population to be the one with the higher proportion.

The choice does not affect the substance of the analysis.

Significance Tests It is sometimes useful to test the null hypothesis that the two

population proportions are the same. We standardize by subtracting its mean P1-P2 and then

dividing by its standard deviation

If n1 and n2 are large, the standardized difference is approximately N(0, 1).

To estimate D we take into account the null hypothesis that P1 = P2.

21 ˆˆ ppD

2

22

1

11 )1()1(

n

PP

n

PPD

Significance Tests If these two proportions are equal, we can

view all of the data as coming from a single population.

Let P denote the common value of P1 and P2. The standard deviation of is then

21 ˆˆ ppD

21

21

11)1(

)1()1(

nnPP

n

PP

n

PPDp

Significance Tests We estimate the common value of P by the overall proportion of

successes in the two samples.

This estimate of P is called the pooled estimate. To estimate the standard deviation of D, substitute for P

in the expression for DP. The result is a standard error for D under the condition that the

null hypothesis H0: P1 = P1 is true. The test statistic uses this standard error to standardize the

difference between the two sample proportions.

21

21

samplesboth in nsobservatio ofnumber

samplesboth in successes ofnumber ˆnn

XXP

p

Significance Tests for Comparing Two Proportions

Example:men, women, and garment labels.

The previous example presented the survey data on whether consumers are “label users” who pay attention to label details when buying a shirt. Are men and women equally likely to be label users?

Here is the data summary:

Population n X

1 (women) 296 63 0.2132 (men) 251 27 0.108

nXp ˆ

Example:men, women, and garment labels

We compare the proportions of label users in the two populations (women and men) by testing the hypotheses

H0:P1= P2

Ha:P1 P2

The pooled estimate of the common value of P is:

This is the proportion of label users in the entire sample.1645.0

547

90

251296

2763ˆ

p


The test statistic is calculated as follows:

The observed difference is more than 3 standard deviation away from zero.

03181.0251

1

296

1)8355.0)(1645.0(

DPSE

30.303181.0

108.0213.0ˆˆ 21

DPSE

ppz


The P-value is:

Conclusion: 21% of women are label users versus only 11%

of men; the difference is statistically significant.

001.00005.02)9995.01(2)30.3(2 zP

Simple Regression Simple regression analysis is a statistical tool That

gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x).

The dependent variable is the variable for which we want to make a prediction.

While various non-linear forms may be used, simple linear regression models are the most common.

Introduction• The primary goal of quantitative

analysis is to use current information about a phenomenon to predict its future behavior.

• Current information is usually in the form of a set of data.

• In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y.

lot size Man-hours30 7320 5060 12880 17040 8750 10860 13530 6970 14860 132

Introduction The goal of the analyst

who studies the data is to find a functional relation

between the response variable y and the predictor variable x.

Statistical relation between Lot size and Man-Hour

0

20

40

60

80

100

120

140

160

180

0 10 20 30 40 50 60 70 80 90

Lot size

Man

-Hou

r

)(xfy

Regression Function The statement that the relation

between X and Y is statistical should be interpreted as providing the following guidelines:

1. Regard Y as a random variable.

2. For each X, take f (x) to be the expected value (i.e., mean value) of y.

3. Given that E (Y) denotes the expected value of Y, call the equation

the regression function.

)()( xfYE

Historical Origin of Regression

Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers.

Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.

Basic Assumptions of a Regression Model

A regression model is based on the following assumptions:

1. There is a probability distribution of Y for each level of X.

2. Given that y is the mean value of Y, the standard form of the model is

where is a random variable with a normal distribution.

)(xfY

Statistical relation between Lot Size and number of man-Hours-Westwood Company Example

Statistical relation between Lot size and number of Man-Hours

0

20

40

60

80

100

120

140

160

180

0 10 20 30 40 50 60 70 80 90

Pictorial Presentation of Linear Regression Model

Construction of Regression Models

Selection of independent variables Functional form of regression relation Scope of model

Uses of Regression Analysis

Regression analysis serves Three major purposes.

1. Description

2. Control

3. Prediction The several purposes of regression analysis

frequently overlap in practice

Formal Statement of the Model

General regression model

1. 0, and 1 are parameters

2. X is a known constant

3. Deviations are independent N(o, 2)

XY 10

Meaning of Regression Coefficients

The values of the regression parameters 0, and 1 are not known.We estimate them from data.

1 indicates the change in the mean response per unit increase in X.

Regression Line If the scatter plot of our sample data

suggests a linear relationship between two variables i.e.

we can summarize the relationship by drawing a straight line on the plot.

Least squares method give us the “best” estimated line for our set of sample data.

xy 10

Regression Line We will write an estimated regression line

based on sample data as

The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors

xbby 10ˆ

2

110

1

2)ˆ(

n

iii

n

iii xbbyyySSE

Regression Line Using calculus, we obtain estimating

formulas:

n

ii

n

iii

xx

yyxxb

1

2

11

)(

))((

xbyb 10

Estimation of Mean Response Fitted regression line can be used to estimate the mean

value of y for a given value of x. Example

The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table.

y x1250 411380 541425 631425 541450 481300 461400 621510 611575 641650 71

Point Estimation of Mean Response

From previous table we have:

The least squares estimates of the regression coefficients are:

81875514365

3260456410 2

xyy

xxn

8.10)564()32604(10

)14365)(564()818755(10

)( 2221

xxn

yxxynb

828)4.56(8.105.14360 b


The estimated regression function is:

This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.

eExpenditur 8.10828Sales

10.8x828y


Fitted values for the sample data are obtained by substituting the x value into the estimated regression function.

For example if the advertising expenditure is $50, then the estimated Sales is:

This is called the point estimate of the mean response (sales).

1368)50(8.10828 Sales

Residual The difference between the observed value

yi and the corresponding fitted value .

Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand.

iii yye ˆ

Example: weekly advertising expenditure y x y-hat Residual (e)

1250 41 1270.8 -20.81380 54 1411.2 -31.21425 63 1508.4 -83.41425 54 1411.2 13.81450 48 1346.4 103.61300 46 1324.8 -24.81400 62 1497.6 -97.61510 61 1486.8 23.21575 64 1519.2 55.81650 71 1594.8 55.2

Documents

Business Statistics for Managerial Decision Farideh Dehkordi-Vakil