Upload
juniper-snow
View
247
Download
0
Tags:
Embed Size (px)
Citation preview
Business Statistics for Managerial Decision
Farideh Dehkordi-Vakil
Comparing Two Proportions We often want to compare the proportions of two
groups (such as men and women) that have some characteristics.
We call the two groups being compared Population 1 and population 2.
The two population proportions of “Successes” P1 and P2.
The data consist of two independent SRS The sample sizes are n1 from population 1 and n2
from population 2.
Comparing Two Proportions The proportion of successes in each sample
estimates the corresponding population proportion.
Here is the notation we will usepopulation population Sample Count of Sample
proportion size successes proportion
1 P1 n1 X1
2 P2 n2 X2
111ˆ nXp
222ˆ nXp
Sampling Distribution of Choose independent SRS of sizes n1 and n2 from
two populations with proportions P1 and P2 of successes.
Let be the difference between the two sample proportions of successes.
Then as both sample sizes increase, the sampling distribution of D becomes approximately Normal. The mean of the sampling distribution is . The standard deviation of the sampling distribution is
21 ˆˆ pp
21 ˆˆ ppD
21 PP
2
22
1
11 )1()1(
n
PP
n
PPD
Sampling Distribution of The sampling distribution
of the difference of two sample proportions is approximately Normal.
The mean and standard deviation are found from the two population proportions of successes, P1 and P2
21 ˆˆ pp
Confidence Interval Just as in the case of estimating a single
proportion, a small modification of the sample proportions greatly improves the accuracy of confidence intervals.
The Wilson estimates of the two population proportions are
)2()1(~
111 nXP
)2()1(~222 nXp
Confidence Interval The standard deviation of is approximately
To obtain a confidence interval for P1-P2, we replace the unknown parameters in the standard deviation by estimates to obtain an estimated standard deviation, or standard error.
D~
2
)~1(~
2
)~1(~
2
22
1
21~
n
pp
n
ppD
Confidence Interval for Comparing Two Proportions
Example:”No Sweat” Garment Labels
Following complaints about the working conditions in some apparel factories both in the United States and Abroad, a joint government and industry commission recommended in 1998 that companies that monitor and enforce proper standards be allowed to display a “No Sweat” label on their product. A survey of U.S. residents aged 18 or older asked a series of questions about how likely they would be to purchase a garment under various conditions.
Example:”No Sweat” Garment Labels
For some conditions, it was stated that the garment had a “No Sweat” label; for others, there was no mention of such label. On the basis of of the responses, each person was classified as a “label user” or “ a “label nonuser.” About 16.5% of those surveyed were label users. One purpose of the study was to describe the demographic characteristics of users and nonusers.
Example:”No Sweat” Garment Labels
The study suggested that there is a gender difference in the proportion of label users. Here is a summary of the data. Let X denote the number of label users.
population n X1 (women) 296 63 0.213 0.2152 (men) 251 27 0.108 0.111
nXp ˆ )2()1(~ nXp
Example:”No Sweat” Garment Labels
First calculate the standard error of the observed difference.
The 95% confidence interval is
0308.02251
)889.0)(111.0(
2296
)785.0)(215.0(
2
)~1(~
2
)~1(~
2
22
1
21~
n
pp
n
ppSE
D
)16.0,04.0(060.0104.
)0308.0)(96.1()111.0215.0(
*)~~( ~21
D
SEzpp
Example:”No Sweat” Garment Labels With 95% confidence we can say that the difference in the
proportions is between 0.04 and 0.16. Alternatively, we can report that the women are about 10%
more likely to be label users than men, with a 95% margin of error of 6%.
In this example we chose women to be the first population. Had we chosen men as the first population, the estimate of the difference would be negative (-0.104).
Because it is easier to discuss positive numbers, we generally choose the first population to be the one with the higher proportion.
The choice does not affect the substance of the analysis.
Significance Tests It is sometimes useful to test the null hypothesis that the two
population proportions are the same. We standardize by subtracting its mean P1-P2 and then
dividing by its standard deviation
If n1 and n2 are large, the standardized difference is approximately N(0, 1).
To estimate D we take into account the null hypothesis that P1 = P2.
21 ˆˆ ppD
2
22
1
11 )1()1(
n
PP
n
PPD
Significance Tests If these two proportions are equal, we can
view all of the data as coming from a single population.
Let P denote the common value of P1 and P2. The standard deviation of is then
21 ˆˆ ppD
21
21
11)1(
)1()1(
nnPP
n
PP
n
PPDp
Significance Tests We estimate the common value of P by the overall proportion of
successes in the two samples.
This estimate of P is called the pooled estimate. To estimate the standard deviation of D, substitute for P
in the expression for DP. The result is a standard error for D under the condition that the
null hypothesis H0: P1 = P1 is true. The test statistic uses this standard error to standardize the
difference between the two sample proportions.
21
21
samplesboth in nsobservatio ofnumber
samplesboth in successes ofnumber ˆnn
XXP
p
Significance Tests for Comparing Two Proportions
Example:men, women, and garment labels.
The previous example presented the survey data on whether consumers are “label users” who pay attention to label details when buying a shirt. Are men and women equally likely to be label users?
Here is the data summary:
Population n X
1 (women) 296 63 0.2132 (men) 251 27 0.108
nXp ˆ
Example:men, women, and garment labels
We compare the proportions of label users in the two populations (women and men) by testing the hypotheses
H0:P1= P2
Ha:P1 P2
The pooled estimate of the common value of P is:
This is the proportion of label users in the entire sample.1645.0
547
90
251296
2763ˆ
p
Example:men, women, and garment labels
The test statistic is calculated as follows:
The observed difference is more than 3 standard deviation away from zero.
03181.0251
1
296
1)8355.0)(1645.0(
DPSE
30.303181.0
108.0213.0ˆˆ 21
DPSE
ppz
Example:men, women, and garment labels
The P-value is:
Conclusion: 21% of women are label users versus only 11%
of men; the difference is statistically significant.
001.00005.02)9995.01(2)30.3(2 zP
Simple Regression Simple regression analysis is a statistical tool That
gives us the ability to estimate the mathematical relationship between a dependent variable (usually called y) and an independent variable (usually called x).
The dependent variable is the variable for which we want to make a prediction.
While various non-linear forms may be used, simple linear regression models are the most common.
Introduction• The primary goal of quantitative
analysis is to use current information about a phenomenon to predict its future behavior.
• Current information is usually in the form of a set of data.
• In a simple case, when the data form a set of pairs of numbers, we may interpret them as representing the observed values of an independent (or predictor ) variable X and a dependent ( or response) variable Y.
lot size Man-hours30 7320 5060 12880 17040 8750 10860 13530 6970 14860 132
Introduction The goal of the analyst
who studies the data is to find a functional relation
between the response variable y and the predictor variable x.
Statistical relation between Lot size and Man-Hour
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Lot size
Man
-Hou
r
)(xfy
Regression Function The statement that the relation
between X and Y is statistical should be interpreted as providing the following guidelines:
1. Regard Y as a random variable.
2. For each X, take f (x) to be the expected value (i.e., mean value) of y.
3. Given that E (Y) denotes the expected value of Y, call the equation
the regression function.
)()( xfYE
Historical Origin of Regression
Regression Analysis was first developed by Sir Francis Galton, who studied the relation between heights of sons and fathers.
Heights of sons of both tall and short fathers appeared to “revert” or “regress” to the mean of the group.
Basic Assumptions of a Regression Model
A regression model is based on the following assumptions:
1. There is a probability distribution of Y for each level of X.
2. Given that y is the mean value of Y, the standard form of the model is
where is a random variable with a normal distribution.
)(xfY
Statistical relation between Lot Size and number of man-Hours-Westwood Company Example
Statistical relation between Lot size and number of Man-Hours
0
20
40
60
80
100
120
140
160
180
0 10 20 30 40 50 60 70 80 90
Pictorial Presentation of Linear Regression Model
Construction of Regression Models
Selection of independent variables Functional form of regression relation Scope of model
Uses of Regression Analysis
Regression analysis serves Three major purposes.
1. Description
2. Control
3. Prediction The several purposes of regression analysis
frequently overlap in practice
Formal Statement of the Model
General regression model
1. 0, and 1 are parameters
2. X is a known constant
3. Deviations are independent N(o, 2)
XY 10
Meaning of Regression Coefficients
The values of the regression parameters 0, and 1 are not known.We estimate them from data.
1 indicates the change in the mean response per unit increase in X.
Regression Line If the scatter plot of our sample data
suggests a linear relationship between two variables i.e.
we can summarize the relationship by drawing a straight line on the plot.
Least squares method give us the “best” estimated line for our set of sample data.
xy 10
Regression Line We will write an estimated regression line
based on sample data as
The method of least squares chooses the values for b0, and b1 to minimize the sum of squared errors
xbby 10ˆ
2
110
1
2)ˆ(
n
iii
n
iii xbbyyySSE
Regression Line Using calculus, we obtain estimating
formulas:
n
ii
n
iii
xx
yyxxb
1
2
11
)(
))((
xbyb 10
Estimation of Mean Response Fitted regression line can be used to estimate the mean
value of y for a given value of x. Example
The weekly advertising expenditure (x) and weekly sales (y) are presented in the following table.
y x1250 411380 541425 631425 541450 481300 461400 621510 611575 641650 71
Point Estimation of Mean Response
From previous table we have:
The least squares estimates of the regression coefficients are:
81875514365
3260456410 2
xyy
xxn
8.10)564()32604(10
)14365)(564()818755(10
)( 2221
xxn
yxxynb
828)4.56(8.105.14360 b
Point Estimation of Mean Response
The estimated regression function is:
This means that if the weekly advertising expenditure is increased by $1 we would expect the weekly sales to increase by $10.8.
eExpenditur 8.10828Sales
10.8x828y
Point Estimation of Mean Response
Fitted values for the sample data are obtained by substituting the x value into the estimated regression function.
For example if the advertising expenditure is $50, then the estimated Sales is:
This is called the point estimate of the mean response (sales).
1368)50(8.10828 Sales
Residual The difference between the observed value
yi and the corresponding fitted value .
Residuals are highly useful for studying whether a given regression model is appropriate for the data at hand.
iii yye ˆ
Example: weekly advertising expenditure y x y-hat Residual (e)
1250 41 1270.8 -20.81380 54 1411.2 -31.21425 63 1508.4 -83.41425 54 1411.2 13.81450 48 1346.4 103.61300 46 1324.8 -24.81400 62 1497.6 -97.61510 61 1486.8 23.21575 64 1519.2 55.81650 71 1594.8 55.2