Statistical Modeling - 1 US Army Logistics Management College Part 1: Regression Analysis Estimating Relationships

Statistical Modeling - 1

US Army Logistics Management College

Part 1: Regression AnalysisEstimating Relationships



Preparing to Use Stat Tools Pharmex Drug Stores

• Stat Tools is a part of the Decision Tools Suite• Open both Excel and Stat Tools.• Select StatTools + Data Set Manager• Select New• Highlight the portion of the spreadsheet that

includes the data and select OK

Pharmex.xls



Scatterplots: Graphing RelationshipsPharmex Drug Stores

• Pharmex is a chain of drugstores that operates around the country.• The company has collected data from 50 randomly selected

metropolitan regions. In each region it has collected data on its promotional expenditures and sales in the region over the past year.

• There are two variables each of which are indexes, not dollar amounts.• Promote: Pharmex’s promotional expenditures as a percentage

of those of the leading competitor.

• Sales: Pharmex’s sales as a percentage of those of the leading competitor.

• The company expects that there is a positive relationship between the two variables, so that regions with relatively more expenditures have relatively more sales. However, it is not clear what the nature of this relationship is.

Pharmex.xls



Creating the Scatterplot Pharmex Drug Stores

• The tricky part is to decide which variable should be on the horizontal axis.

• Select any data cell.• Select StatTools + Summary Graphs + Scatterplot…• In regression analysis, we always put the explanatory

variable on the horizontal axis and the response variable on the vertical axis. In this example the store tends to believe that large promotional expenditures “cause” larger values of sales, so select “Sales” as the Y variable (the vertical axis)..

• Select “Promote” as the X variable (the horizontal axis).

Pharmex.xls



Interpretation Pharmex Drug Stores

• The scatterplot indicates that there is a positive relationship between Promote and Sales - the points tend to rise from bottom left to top right - but the relationship is not perfect.

• The correlation of 0.673 is shown automatically on the plot. The important things to note about the correlation is that it is positive and its magnitude is moderately large.

• Causation - we can never make definitive statements about causation based on regression analysis. Regression identifies only a statistical relationship, not a causal relationship



Simple Linear Regression Pharmex Drug Stores

• The Pharmex scatterplot hints at a linear relationship between Promote and Sales. We want to draw the “best fitting” straight line through the points to quantify that linear relationship.

• Since the relationship is not perfect, not all points lie exactly on the line. The differences are the residuals. They show how much the observed values differ from the fitted values. The fitted value is the vertical distance from the horizontal axis to the line .

• We decide to define “best fitting” line through the points in the scatterplot to be the one with the smallest sum of the squared residuals. This line is called the least squares line

• We now want to find the least squares line for the Pharmex drugstore data, using Sales as the response variable and Promote as the explanatory variable.



Least Squares Line with StatTools Pharmex Drug Stores

• Select any data cell.• From the Menu bar, select : StatTools + Regression & Classification + Regression…• Specify that “Sales” is the

response (dependent) variable.• Specify that “Promote” is the

explanatory (independent) variable.

• Select graph option: “Residuals vs Fitted values”



Regression Output Table Pharmex Drug Stores

• The “Constant” and “Promote” coefficients B18:C18 imply that the equation for the least squares line is:

Predicted Sales = 25.1264 + (0.7623 x Promote)



Least Square Line Equation Pharmex Drug Stores

We can interpret this equation as follows:• The slope 0.7623 indicates that the sales index tends to increase by

about 0.76 for each unit increase in the promotional expenses index.• The interpretation of the intercept is less important. It is literally the

predicted sales index for a region that does no promotions.

The Scatterplot• A useful graph in almost any regression analysis is a

scatterplot of residuals (on the vertical axis) versus fitted values.

• We typically examine the scatterplot for striking patterns. A “good” fit not only has small residuals, but it has residuals scattered randomly around 0 with no apparent pattern. This is the case here.



The Scatterplot of Residuals vs Fitted Values



Multiple RegressionBendrix Automotive Parts Company

• The Bendrix Company manufactures various types of parts for automobiles.

• The factory manager wants to get a better understanding of overhead costs, including supervision, indirect labor, supplies, payroll taxes, overtime premiums,depreciation, and a number of miscellaneous items such as insurance, utilities, and janitorial and maintenance expenses.

• Some of the overhead costs are “fixed” in the sense they do not vary appreciably with the volume of work being done, whereas others are “variable” and do vary directly with the volume of work being done. It is not easy to draw a clear line between the fixed and variable overhead components.

• The Bendrix manager has tracked total overhead costs for 36 months.



Explanatory Variables Bendrix Automotive Parts Company

• The factory manager collected data on two variables he believes might be responsible for variations in overhead costs:MachHrs: number of machine hours used during the month.

ProdRuns: the number of separate production runs during the month (Bendrix manufactures parts in fairly large batches called production runs. Between each run there is a downtime.).

• Each observation (row) corresponds to a single month.• We need to estimate and interpret the equation for Overhead

when both explanatory variables, MachHrs and ProdRuns, are included in the regression equation, but because these are time series variables we should also look out for relationships between these variables and the Month variable.

Bendrix.xls



Multiple Regression with StatTools Bendrix Automotive Parts Company

• Select StatTools + Regression & Classification + Regression…• Check “Overhead” as the response (dependent) variable.• Check “MachHrs” and ProdRuns” as the explanatory (independent)

variables.• Select the Graph options in the

dialog box as shown here.



Multiple Regression Output Table Bendrix Automotive Parts Company

• The coefficients in B18:B20 indicate that the estimated regression equation is

Predicted Overhead = 3997 + (43.45 x MachHrs) + (883.62 x ProdRuns)



Interpretation of Equation Bendrix Automotive Parts Company

• If the number of production runs is held constant, then the overhead cost is expected to increase by $43.54 for each extra machine hour

• If the number of machine hours is held constant, the overhead is expected to increase by $883.62 for each extra production run.

• $3997 is the fixed component of overhead.• The slope terms involving MachHrs and ProdRuns

are the variable components of overhead.



Equation Comparison Bendrix Automotive Parts Company

• It is interesting to compare this equation with the separate equations:

Predicted Overhead = 48,621 + 34.70(MachHrs) and

Predicted Overhead = 75,606 + 655.07(ProdRuns)

Predicted Overhead = 3,997 + 43.45 MachHrs + 883.62 ProdRuns

• Note that both coefficients have increased. Also, the intercept is now lower than either intercept in the single variable equation. It is difficult to guess the changes that more explanatory variables will cause, but it is likely that changes will occur.

• The reasoning for this is that when MachHrs is the only variable in the equation, we are obviously not holding ProdRuns constant - we are ignoring it - so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead.

• But when we include both variables, the coefficient of 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant.

• Since the coefficients have different meanings, it is not surprising that we obtain different estimates.



Modeling Possibilities Fifth National Bank Gender-Discrimination Suit

• The Fifth National Bank of Springfield is facing a gender-discrimination suit. The charge is that its female employees receive substantially smaller salaries than its male employees.

• The bank’s employee database is listed in this file. Here is a partial list of the data.

Bank.xls



Variables Fifth National Bank Gender-Discrimination Suit

• EducLev: education level with categories 1 (high school grad), 2 (some college), 3 (bachelor’s degree), 4 (some graduate courses) & 5 (graduate degree)

• JobGrade: current job level, the possible levels being from 1-6 (6 is highest)

• YrHired: year employee was hired

• Salary: current annual salary in thousands of dollars

• YrBorn: year employee was born

• Gender: a categorical variable with values “Female” and “Male”

• YrsPrior: number of years of work experience at another bank prior to working at Fifth National

• PCJob: a dummy variable with value 1 if the employee’s current job is computer-related and value 0 otherwise

Do the data provide evidence that females are discriminated against in terms of salary?

For each of the 208 employees, the variables in the data set are:



Naïve Approach Fifth National Bank Gender-Discrimination Suit

• A naïve approach to the problem is to compare the average salaries of the males and females.

• The average of all salaries is $39,922, the average female salary is $37,210, and the average male salary is $45,505.

• The difference between the averages is statistically different. The females are definitely earning less, but perhaps there is a reason.

• The question is whether the differences between the average salaries is still evident after taking other attributes into account. A perfect task for regression.



Dummy Variables Fifth National Bank Gender-Discrimination Suit

• Some potential explanatory variables are categorical and cannot be measured on a quantitative scale. However, we often need to use these variables because they are related to the response variable.

• The trick is to create dummy variables, also called indicator or 0-1 variables, that indicate the category a given observation is in.

• To create dummy variables we can use an IF statement or we can use StatTools’ Dummy variable procedure, which is usually easier particularly when there are multiple categories.

• Once the dummy variables are created, we can combine the variables if we like by simply adding the columns to get the dummy for the new category.



Regression Analysis w/Dummy Variables Fifth National Bank Gender-Discrimination Suit

• In this example we create dummy variables for Gender, and JobGrade. We also create another variable: YrsExper = 95 – YrHired (since this is 1995 data)

• We must follow two rules:We shouldn’t use any of the original categorical

variables that the dummies are based on.We should use one less dummy than the number of

categories for any categorical variable.• Then we can run a regression analysis with Salary as the

response variable, using any combination of numerical and dummy explanatory variables.



Creating Dummy VariablesGender Categorical Variable

To create a dummy variable called Female for Gender:• Select any data cell.• From the Menu bar, select StatTools + Data Utilities + Dummy…• Select “Gender”, as the variable• Select “Create One Dummy Variable for Each Distinct

Category”.• Answer “Yes” to warnings.

Repeat the procedure for JobGrade.



Regression AnalysisGender Only

• We first estimate a regression equation with Female as the only variable. The resulting equation is:

Predicted Salary = 45.505 - 8.296Female

• To interpret this equation recall that Female has only two possible values, 0 and 1. If we substitute 1 then the predicted salary equals 37.209 and if we substitute 0 the predicated salary is 45.505.

• These are the average salaries of females and males. Therefore the interpretation of the -8.2955 coefficient of the Female dummy variable is straightforward.

• The above equation only tells part of the story, it ignores all information except for gender.



Regression AnalysisGender + YrsExper + YrsPrior

• We expand this equation by adding YrsExper and YrsPrior.• The corresponding equation is:Pred Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior - 8.080Female

• It is useful to write two separate equations, one for females:

Predicted Salary = 27.412 + 0.988YrsExper + 0.131YrsPrior

and one for males:

Predicted Salary = 35.492 + 0.988YrsExper + 0.131YrsPrior

• We interpret the coefficient -8.080 of the Female dummy variable as the average salary disadvantage for females relative to males after controlling for job experience. But there is still more story to tell.



Regression AnalysisGender + YrsExper + YrsPrior + JobGrade

• We next add job grade to the equation by including five of the six job grade dummies. Although any five can be use we use Job_2 - Job_6.

• The estimated regression equations is now:

Predicted Salary = 30.230 + 0.408YrsExper + 0.149YrsPrior - 1.962Female + 2.575Job_2

+ 6.295Job_3 + 10.475Job_4 +16.011Job_5 + 27.647Job_6

• There are now two categorical variables involved, gender and job grade. However, we can still write a separate equation for any combination of categories by setting the dummies to the appropriate values.



InterpretationGender + YrsExper + YrsPrior + JobGrade

• The equation for females at the fifth job grade is found by setting Female=1, Job_5=1, & other job dummies equal to 0.

PredictedSalary = 44.279 + 0.408YrsExper + 0.149YrsPrior

• The expected salary increase for one extra year of experience is $408; the expected salary increase for one year experience with another bank is $149 (either gender and any job grade).• The coefficients of the job dummies indicate the average

increase in salary an employee can expect relative to the reference (lowest) job grade.

• The key coefficient, the negative $1962 for females indicates the average salary disadvantage for females relative to males, given that they have the same experience levels and are in the same job grade

• The “penalty” is less than a fourth of the penalty we saw before. It appears that females might be getting paid less on average partly because they are in the lower job categories.



Pivot TableConcentration of Females in Lower Paid Jobs

• We can use a pivot table to check whether females are disproportionately in the lower job categories (set JobGrade in the row area, Gender in the column area and the count (expressed as a percentage) of any variable in the data area).

• Clearly, females tend to be concentrated at the lower job grades.

• This helps explain why females get lower salaries on average, but doesn’t explain why females are at the lower job grades in the first place.

• We won’t be able to provide a thorough analysis of this issue.



Conclusion• The main conclusion we can draw from the output

is that there is still a plausible case to be made for discrimination against females, even after including information on all the variables in the database in the regression equation.



Interaction Terms Fifth National Bank Gender-Discrimination Suit

• An interaction variable algebraically is the product of two variables. Its effect is to allow the effect of one of the variables on Y to depend on the value of the other variable.

• The interaction term allows the slope of the regression line to differ between the two categories.

• Earlier we estimated an equation for Salary using the numerical explanatory variables YrsExper and YrsPrior and the dummy variable Female.

• If we drop the YrsPrior variable from the equation (for simplicity) and rerun the regression, we obtain the equation

Predicted Salary = 35.824 + 0.981YrsExper - 8.012Female

• The R2 value for this equation is 49.1%. If we decide to include an interaction variable between YrsExper and Female in this equation, what is the effect?



Solution with Interaction Terms Fifth National Bank Gender-Discrimination Suit

• We first need to form an interaction variable that is the product of YrsExper and Female.

• This can be done two ways in Excel.• Do it manually by introducing a new variable that contains the

product of the two variables involved, or

• Use: StatTools + Data Utilities + Interaction… • Using the latter way we must select Female and YrsExper as

the variables.• Once the interaction variable has been created, we include it

in the regression equation in addition to the other variables.



Interpretation w/ Interaction TermsFifth National Bank Gender-Discrimination Suit

• The estimated regression equation is

Predicted Salary = 30.430 + 1.528YrsExper + 4.098Female - 1.248YrsExper_Female

• The female equation is: Pred Salary = 34.528 + 0.280YrsExper

& the male equation is: Pred Salary = 30.430 + 1.528YrsExper

• Graphically - Nonparallel Female and Male Salary Lines



Conclusion w/Interaction Terms Fifth National Bank Gender-Discrimination Suit

• The Y-intercept for the female line is slightly higher - females with no experience at Fifth National Bank tend to start out slightly higher than males - but the slope of the female line is much lower. That is, males tend to move up the salary ladder much more quickly than females.

• Again, this provides another argument, although a somewhat different one, for gender discrimination against females.

• The R2 value increased from 49.1% to 63.9%. The interaction variable has definitely added to the explanatory power of the equation.



Part 2: Regression AnalysisStatistical Inference



Inference About Regression Coefficients Bendrix Automotive Parts Company

• As before, the response variable is Overhead and the explanatory variables are MachHrs and ProdRuns.

• What inferences can we make about the regression coefficients?• We obtain the output from using StatTools

Bendrix1.xls



Multiple Regression Output Bendrix Automotive Parts Company

• Regression coefficients estimate the true, but unobservable, population coefficients.

• The standard error of bi indicates the accuracy of these point estimates.

• For example, the effect on Overhead of a one-unit increase in MachHrs is 43.536. We are 95% confident that the coefficient is between 36.234 to 50.839. Similar statements can be made for the coefficient of ProdRuns and the intercept term.

Predicted Overhead = 3997 + 43.54MachHrs + 883.62ProdRuns



A Test for the Overall Fit:The ANOVA Table

Bendrix Automotive Parts Company • Does the ANOVA table for the Bendrix manufacturing data

indicate that the combination MachHrs and ProdRuns has at least some ability to explain variation in Overhead?

• The F-ratio is “off the charts” and the p-value is practically 0.



Interpretation of the ANOVA Table Bendrix Automotive Parts Company

• This information wouldn’t be much comfort for the Bendrix manager who is trying to understand the causes of variation in overhead costs.

• This manager already knows that machine hours and production runs are related positively to overhead costs - everyone in the company knows that!

• What he really wants to know is a set of explanatory variables that yields a high R2 and a low se.

• The low p-value in the ANOVA table does not guarantee these. All it guarantees is that MachHrs and ProdRuns are of “some help” in explaining variation in Overhead.



Violations of Regression Assumptions Bendrix Automotive Parts Company

• Is there evidence of non constant variance?

• Is there any evidence of lag 1 autocorrelation in the Bendrix data when Overhead is regressed on MachHrs and ProdRuns?

• Is there evidence of non Normality?



Do the Residuals HaveConstant Variance?

Bendrix Automotive Parts Company

• If the residual variance is not constant, the standard error of the regression coefficient, s(bi), is incorrect.

• Note: when we ran the regression we selected “Residuals vs Fitted Values” graphs.



Plot of Residuals vs Fitted Values Bendrix Automotive Parts Company

• Residuals appear to have equal Variances (homoscedasticity)

Scatterplot of Residual vs Fit

-12000.0

-10000.0

-8000.0

-6000.0

-4000.0

-2000.0

0.0

2000.0

4000.0

6000.0

8000.0

75000.0 80000.0 85000.0 90000.0 95000.0 100000.0

105000.0

110000.0

115000.0

120000.0

Fit

Res

idua

l



Autocorrelated Residuals Bendrix Automotive Parts Company

• The residuals of time series data are often autocorrelated. The most frequent type of autocorrelation is positive autocorrelation. For example, if residuals separated by 1 month are auto correlated, this is called lag 1 autocorrelation.

• We use the fitted (col C) and residuals values (col D) In the “Regression” tab. The residuals represent how much the regression over-predicts (if negative) or under-predicts (if positive) the overhead cost for that month.



Durbin-Watson Test Bendrix Automotive Parts Company

• We can check for lag 1 autocorrelation in two ways, with the Durbin-Watson(DW) statistic and by examining the time series graph of the residuals.

• The Durbin-Watson (DW) statistic is scaled between 0 and 4.• 2 - little lag 1 autocorrelation• < 2 - positive autocorrelation • > 2 – negative autocorrelation. • If n = 30 and bi’s 1-5, <1.2 is a problem)

• We calculate the DW statistics in cell E45 with the formula:

=StatDurbinWatson(D45:D80)

Based on our guidelines for DW value 1.3131 suggests positive

autocorrelation - it is less than 2 - but not enough to cause concern.



Time Series Graph of Residuals Bendrix Automotive Parts Company

• This general conclusion is supported by the time series graph.

• Serious autocorrelation of lag 1 would tend to show long runs of residuals alternating above and below the horizontal axis - positives would tend to follow positives and negatives would tend to follow negatives. There is some indication of this in the graph but not an excessive amount.

Time Series of Residuals / Data Set #2

-12000

-10000

-8000

-6000

-4000

-2000

0

2000

4000

6000

8000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Observation #

− Add the range A44:D80 as a Data set

− StatTools + Time Series & Forecasting + Time Series Graph

− Select Residuals as the variable



Are the ResidualsNormally Distributed?

Bendrix Automotive Parts Company

• The Inferences we want to make assume the residuals are normally distributed.

• Using Data Set #2

• Select: StatTools + Normality Tests + Q-Q Normal Plot

• Select “Residuals” as the variable

• Check “Plot Using Standardized Q-Values” and “Include Reference Line”



Normal Probability Plot Bendrix Automotive Parts Company

• Error terms appear to be Normally Distributed

Q-Q Normal Plot of Residuals / Data Set #2

-3.5

-2.5

-1.5

-0.5

0.5

1.5

2.5

3.5

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

Z-Value

Sta

ndar

dize

d Q

-Val

ue



Multicollinearity Height vs Left & Right Feet

• The relationship between the explanatory variable X and the response variable Y is not always accurately reflected in the coefficient of X; it depends on which other X’s are included or not included in the equation (especially when there is a linear relationship between two or more explanatory variables, in which case we have multicollinearity).

• Multicollinearity is the presence of a fairly strong linear relationship between two or more explanatory variables, and it can make estimation difficult.

• We want to explain a person’s height by means of foot length. The response variable is Height, and the explanatory variables are Right and Left, the length of the right foot and the left foot, respectively.

• It is likely that there is a large correlation between height and foot size, so we would expect this regression equation to do a good job. The R2 value will probably be large. But what about the coefficients of Right and Left?



Correlation of Left & RightHeight vs Left & Right Feet

• To show what can happen numerically, we generated a hypothetical data set of heights and left and right foot lengths in this file.

• We did this so that, except for random error, height is approximately 32 plus 3.2 times foot length (in inches).

The correlations between Height and either Right or Left in our data set are quite large, and the correlation between Right and Left is very close to 1.

Height.xls

StatTools + Summary Statistics + Correlation & Covariance



Multiple Regression Height vs Left & Right Feet

• The Regression output tells a somewhat confusing story.• The multiple R and the corresponding R2 are about what we

would expect, given the correlations between Height and either Right or Left.

• In particular, the multiple R is close to the correlation between Height and either Right or Left. Also, the se value is quite good. It implies that predictions of height from this regression equation will typically be off by only about 2 inches.

• However, the coefficients of Right and Left are not all what we might expect, given that we generated heights as approximately 32 plus 3.2 times foot length.

• In fact, the coefficient of Left has the wrong sign - it is negative!• Besides this wrong sign, the tip-off that there is a problem is

that the t-value of Left is quite small and the corresponding p-value is quite large.



Solution • Judging by this, we might conclude that Height and

Left are either not related or are related negatively. But we know from the table of correlations that both of these are false.

• In contrast, the coefficient of Right has the “correct” sign, and its t-value and associated p-value do imply statistical significance, at least at the 5% level.

• However, this happened mostly by chance, slight changes in the data could change the results completely.



Solution • Although both Right and Left are clearly related to Height, it is

impossible for the least squares method to distinguish their separate effects.

• Note that the sum of the coefficients is 3.178 which is close to the coefficient of 3.2 we used to generate the data. Therefore, the estimated equation will work well for predicting heights, but does not provide reliable estimates of the coefficients of Right and Left.

• When Right is only variable: Predicted Height = 31.546 + 3.195Right

• The R2 = 81.6%, se = 2.005, the t-value = 21.34 and p-value = 0.000 for the coefficient of Right - very significant.

• When Left is only variable: Predicted Height = 31.526 + 3.197Left

• The R2 = 81.1%, and se = 2.033, the t-value = 20.99, and the p-value = 0.0000 for the coefficient of Left - again very significant.

• Clearly, both of these equations tell almost identical stories, and they are much easier to interpret than the equation with both Right and Left included.



Stepwise RegressionHyTex Catalogs

• HyTex is a direct marketer of stereo equipment, personal computers, and other electronic products. HyTex advertises entirely by mailing catalogs to its customers, and all of its orders are taken over the telephone.

• The company spends a great deal of money on its catalog mailings, and it wants to be sure that this is paying off in sales. Data on 250 customers who purchased mail-order products from the HyTex Company in 1998 is available.

• Stepwise regression will be used to produce a regression equation for the amount spent in 1998.



The Data HyTex Catalogs

• Age: (1 = 30 or younger, 2 = 31 to 55, 3 for 56 and older)

• Gender: (1 = males, 0 =females

• OwnHome: (1 = customer owns home, 0 otherwise)

• Married: (1 = customer is currently married, 0 otherwise)

• Close: (1 = customers lives reasonably close to shopping area that sells similar merchandise, 2 otherwise)

• Salary: combined annual salary of customer and spouse (if any)

• Children: number of children living with customer

• Customer97: (1 = customer purchased from HyTex during 1997, 0 otherwise)

• Spent97: total amount of purchase in 1997 from HyTex

• Catalogs: Number of catalogs sent to the customer in 1998

• Spent98: total amount of purchase in 1998 from HyTex

For each customer there are data on the following variables:



Stepwise Regression

• Many statistical packages provide some assistance by including automatic equation-building options. These options estimate a series of regression equations by successively adding (or deleting) variables according to prescribed rules.

• Generically, these methods are referred to as stepwise regression.• There are three types: forward, backward and stepwise.

Forward - begins with no explanatory variables in the equation and successively adds one at a time until no explanatory variables make a significant contribution.

Backward - begins with all potential explanatory variables in the equation and deletes them one at a time until further deletion would do more harm than good.

Stepwise - much like a forward procedure, except that it also considers possible deletions along the way.



• Select StatTools + Regression & Classification + Regression

• Select Regression Type: Stepwise.

• Specify Spent98 as the response variable and select all of the other variables (besides Customer) as potential explanatory variables.

• Choose p-values or F-values as the appropriate criterion.

Stepwise Regression in StatTools HyTex Catalogs



Interpretation of Final Regression Equation

• The coefficient of Catalogs implies that $42.00 more was spent for each catalog sent.

• The coefficient of Married implies that $330.44 more was spent for every married person.

• The coefficient of Own Home implies that $206.28 more was spent for every person owning their own home.

• The coefficients for Spent97 and Customer97 are somewhat more difficult to interpret. First, both are 0 for customers who didn’t purchase the previous year. For those who did, the terms become -1,117.95 + 0.93Spent97.



The Partial F Test



The Partial F TestFifth National Bank Gender-Discrimination Suit

• The Fifth National Bank is facing a gender-discrimination suit charging that its female employees receive substantially smaller salaries than its male employees.

• Previously we ran several regressions for Salary to see whether there is convincing evidence of salary discrimination against females.

• Now, we will perform the following analysis:• We will regress Salary versus the Gender_Female, Yrs_Exper, and

Yrs_Exper*Gender_Female_1. This will be the reduced equation.

• Then we’ll see whether the variables JobGrade_2 through JobGrade_6 add anything significant to the reduced equation.

• Next see if the variables Gender_Female_1*JobGrade_2_1 through Gender_Female_1*JobGrade_6_1 add anything significant to what we already have.

• Continuing on, see if EducLev_1 through EducLev_5 add anything significant to what we already have.



First Solution Fifth National Bank Gender-Discrimination Suit

• First, note that we created all of the dummies and interaction variables with StatTools’ Data Utilities procedures.

• Also, note that we have used three sets of dummies, for gender, job grad and education level. When we use these in a regression equation, the dummy for one category of each should always be excluded; it is the reference category. The reference categories we have used are “male”, job grade 1 and education level 1.

• The “smallest” equation uses Gender_Female, Yrs_Exper, and Yrs_Exper*Gender_Female_1 as explanatory variables.

• We’re off to a good start. These three variables already explain 63.9% of the variation of Salary.

• The next equation adds the explanatory variables JobGrade_2 through JobGrade_6.



Second Solution Fifth National Bank Gender-Discrimination Suit

• This equation appears much better. ( R2 increased to 81.1%). Check whether it is significantly better with the partial F test.

• Calculate the F–ratio. Given SSER = 9478.232, SSEC = 4958.368, MSEC = 24.916 , k – j = 8 – 3 = 5 (represents the number of extra variables) the F–ratio is 36.28

• Calculate the corresponding p-value. Using Excel, the formula is: “=FDIST(x, dof1, dof2)” where x is the result of the partial F Test (above), dof1 is the number of additional variables (k – j), and dof2 is the degrees of freedom for the unexplained complete equation.

Since FDIST(36.28,5,199) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.

C

CR

MSE

jkSSESSEratioF

/



Third Solution Fifth National Bank Gender-Discrimination Suit

• This equation appears better. ( R2 increased to 84%). Check whether it is significantly better with the partial F test.

• Calculate the F–ratio. Given SSER = 4958.368, SSEC = 4206.345, MSEC = 21.682 , k – j = 13 – 8 = 5 the F–ratio is 6.9368


Since FDIST(6.9368,5,194) = 0, there is no doubt the added variables contribute to the explanatory power of the equation.

C

CR

MSE

jkSSESSEratioF

/



Fourth Solution Fifth National Bank Gender-Discrimination Suit

• This equation seems very slightly better. ( R2 increased to 84.7%). Check whether it is significantly better with the partial F test.

• Calculate the F–ratio. Given SSER = 4206.345, SSEC = 4005.418, MSEC = 21.081 , k – j = 17 – 13 = 4 the F–ratio is 2.383


Since FDIST(2.383,4,190) = 0.0529, we can not be 95% confident the added variables contribute to the explanatory power of the equation. We therefore choose not to include them in the model.

C

CR

MSE

jkSSESSEratioF

/



Solution Fifth National Bank Gender-Discrimination Suit

• According to the partial F test, the variables added to the forth equation do not improve the solution enough to qualify for statistical significance at the 5% level.

• Based on this evidence, there is not much to gain from including the education dummies in the equation, so we would probably elect to exclude them.

• As a result, the third solution is considered the complete solution.



Concluding Comments Fifth National Bank Gender-Discrimination Suit

• The partial test is the formal test of significance for an extra set of variables. Many users look only at the R2 and/or se values to check whether extra variables are doing a “good job”.

• If the partial F test shows that a block of variables is significant, it does not imply that each variable in this block is significant. Some added variables can have low t-values.



Concluding Comments Fifth National Bank Gender-Discrimination Suit

• Producing all of these outputs and completing the partial F Test is a lot of work.

• StatTools includes a routine called “Block” that simplifies the process.

• Select StatTools + Regression & Classification + Regression

• Select Regression Type: Block.• Choose 4 blocks and identify which

additional variables enter each block.



Concluding Comments

• While we have concentrated on the partial F test and statistical significance in this example, don’t lose sight of the bigger picture. Once we have decided on a “final” regression equation we need to analyze its implications for the problem at hand.

• In this case the bank is interested in possible salary discrimination against females, so we should interpret this final equation in these terms. Don’t get so caught up in the details of statistical significance that you lose sight of the original purpose of the analysis!



Outliers Fifth National Bank Gender-Discrimination Suit

• Are there any obvious outliers from the 208 employees?• In what sense are they outliers?• Does it matter to the regression results, particularly those concerning

gender discrimination, whether the outliers are removed?• There are several places we could look for outliers. An obvious place

is the Salary variable.• The boxplot shown here shows that there are several employees

making substantially more in salary than most of the employees.



Solution• We could consider these outliers and remove them, arguing

perhaps that these are senior managers who shouldn’t be included in the discrimination analysis. We leave it to you to check whether the regression results are any different with these high salary employees than without them.

• Another place to look is at the scatterplot of the residuals versus the fitted values. This type of plot shows points with abnormally large residuals. For example, we ran the Regression with Female, YrsExper, Fem-YrsExper and the five job grade dummies, and we obtained the output and scatterplot shown here.



Solution

• This scatterplot has several points that could be considered outliers, but we focus on the point identified in the figure.

• The residual for this point is approximately -21.

• Given the se for this regression is approximately 5, this residual is over four standard errors below 0 - quite a lot.

• This person is found to be unusual and special circumstances can explain this.

• We delete this employee and rerun the regression with the same variables



Solution • Recalling that gender discrimination is the key issue in this

example we compare the coefficients of Female and Fem_YrsExper in the two outputs.

• The coefficient of Female has dropped from 6.063 to 4.353. In words, the Y-intercept for the female regression line used to be about $6000 higher than for the male line, now it’s only about $4350.

• More importantly, the coefficient of Fem_YrsExper has changed from -1.021 to -0.721. This indicates how much less steep the female line for Salary versus Yrs_Exper is than the male line.

• A change from -1.021 to -0.721 indicates less discrimination against females now than before. This unusual female employee accounts for a good bit of the discrimination argument - although a strong argument still exists even without her.



Part 3: Analysis of Varianceand Experimental Design



One-Way ANOVA



One-way ANOVA

• A one-way analysis variance or one-way ANOVA is the procedure for analyzing the differences between more than two population means. A one-way ANOVA is also used in randomized experiments where a single population is treated in one of several ways.

• The data analysis in these two situations is identical; only the interpretation of the results differ.



One-way ANOVA Process

• The one-way ANOVA procedure is usually run in two stages.

• The first stage tests the null hypothesis. • If the p-value is not sufficiently small, then there

is not enough evidence to reject the equal-means hypothesis, and the analysis stops.

• If the p-value is sufficiently small, we can conclude with some assurance that the means are not all equal.

• If all means are not equal, the second stage determines which of the groups differ significantly from the others.



Background InformationEffects of Shelf Height on Cereal Sales at Midway

• Midway Company selects 125 stores in its chain of supermarkets to conduct an experiment on cereal sales.

• These stores are similar in terms of store size, customer traffic, customer types, and other characteristics.

• Each store stocks cereal in a similar location in the store on five-shelf displays

• The 125 stores are randomly selected to be in one of five groups, where each group stocks Brand X cereal in a specific shelf location (highest, next highest, middle, next lowest, lowest)

• The number of Brand X boxes sold at each store are recorded for the last two weeks of the experiment (the first two weeks allow customers to get used to the shelving positions)

• Objective: does shelf height make any difference in mean sales of Brand X cereal, and if so, which shelf heights outperform the others.



One-way ANOVA SolutionCereal Sales at Midway

• Note that this is a designed experiment• Initial stores chosen in an attempt to control for extraneous

factors• Randomly assigned stores to treatment levels (shelf heights)

• The output consists of three basic parts:• summary statistics• the ANOVA table• confidence intervals

• Select Statistical Inference + One Way ANOVA• The next slide contains this output.



One-way ANOVA Solution Cereal Sales at Midway

Results of one-way ANOVA

Summary stats for samplesLowest Next-to-lowest Middle Next-to-highest Highest

Sample sizes 25 25 25 25 25Sample means 334.920 378.680 383.440 426.280 383.880Sample standard deviations 61.043 84.081 75.625 85.054 69.619Sample variances 3726.243 7069.560 5719.173 7234.210 4846.777Weights for pooled variance 0.200 0.200 0.200 0.200 0.200

Number of samples 5Total sample size 125Grand mean 381.440Pooled variance 5719.193Pooled standard deviation 75.625

OneWay ANOVA tableSource SS df MS F p-valueBetween variation 104807.680 4 26201.920 4.581 0.0018Within variation 686303.120 120 5719.193Total variation 791110.800 124

Confidence intervals for mean differencesConfidence level 95.0%

Tukey methodDifference Mean diff Lower Upper Signif?Lowest - Next-to-lowest -43.760 -103.050 15.530 NoLowest - Middle -48.520 -107.810 10.770 NoLowest - Next-to-highest -91.360 -150.650 -32.070 YesLowest - Highest -48.960 -108.250 10.330 NoNext-to-lowest - Middle -4.760 -64.050 54.530 NoNext-to-lowest - Next-to-highest -47.600 -106.890 11.690 NoNext-to-lowest - Highest -5.200 -64.490 54.090 NoMiddle - Next-to-highest -42.840 -102.130 16.450 NoMiddle - Highest -0.440 -59.730 58.850 NoNext-to-highest - Highest 42.400 -16.890 101.690 No



Summary StatisticsCereal Sales at Midway

• The summary statistics show that the next to highest shelf position has the largest mean store sales (426.28), and the lowest shelf has the smallest mean store sales (334.92), with the others in between.

• The sample standard deviations (or variances) vary somewhat across the shelf positions, but not enough to invalidate the procedure (we assume equal variance).

• The side-by-side boxplots in the figure on the next slide illustrate these summary measures graphically. However, there is too much overlap to tell whether the differences are statistically significant.



Boxplot of Mean Results by Region Cereal Sales at Midway

Low est

Next-to-low est

Middle

Next-to-highest

Highest

150 225 300 375 450 525 600



ANOVA Table ResultsCereal Sales at Midway

• The Total variation in the ANOVA Table is based on the total variation of all observations around the grand mean in the summary section, and is used mainly to aid in calculations.

• The grand mean is the sample mean of all observations.• The between variation is the squared difference between the

treatment level means and the grand mean weighted by the treatment sample sizes (df = number of groups – 1)

• The within variation is variation due to differences within individual treatment groups (df = total sample size - # groups)

• The F-ratio for the test is 4.581 with a corresponding p-value of 0.0018 (since < .05, we reject the null hypothesis that all means are equal).

• Since all means are not equal, we proceed to a comparison test to determine which means are not equal



ResultsCereal Sales at Midway

• The final section of output lists a set of multiple comparison of two treatment levels (shelf heights).

• The difference shows which two shelf heights are being compared, and the mean difference shows how much difference there is between the mean sales for the two shelf heights

• The lower and upper level shows the confidence intervals for the two shelf heights – if the lower value is negative and the upper value is positive, then 0 is contained in the interval and we can conclude that there is no statistical difference in sales between those two heights

• The only statistically significant difference we can discern is between the next to highest shelf and the lowest shelf (largest and smallest mean sales)

• The company needs to discern if that difference is practically significant, or if any external factors confounded the experiment.



Two-Way ANOVA



Background Information Golf Ball Testing

• Many golf ball manufacturers claim to have the “longest ball,” that is, the ball that goes the farthest on drives.

• This example illustrates how these claims might be tested by testing five major brands (Brand A through E)

• A consumer testing service runs an experiment where 60 balls of each brand are driven under three temperature conditions.

• The first 20 are driven in cool weather, the next 20 are driven in mild weather, and the last 20 are driven in warm weather.

• The goal is to see whether some brands differ significantly, on average, from other brands and what effect temperature has on the mean differences between brands.



Experimental Design Golf Ball Testing

• Unlike the last example, this example represents a controlled experiment (20 golf balls of each brand are randomly assigned to each of three temperature levels).

• In general terminology, the experimental units are the individual golf balls and the response variable is the length (in yards) of each drive.

• There are two factors (brand and temperature), each with different treatment levels (brand has levels A through E, and temperature has three levels: cool, mild, and warm).

• The design is balanced because the same number of balls, 20, is used at each of the 5 x 3 = 15 treatment level combinations.

• There is one further piece of terminology. We call this a full factorial two-way design because we test golf balls at each of the 15 possible treatment level combinations.



Conducting the Experiment Golf Ball Testing

• How should the consumer testing service carry out the experiment?• One possibility is to have 15 golfers, each of approximately the

same skill level, hit 20 balls each. The downside of this design could be that the golfers assigned to a certain brand could be having a good day.

• Golfers could be spread out (each golfer could hit 2 balls). This, however, introduces an unwanted source of variation: the different abilities of the golfers.

• You could use the same golfer for 300 balls. Unfortunately, the golfer might get tired in the process of hitting this many balls.

• These are the type of things designers of experiments must consider.



Conducting the Experiment Golf Ball Testing

• The design should attempt to eliminate as many unwanted sources of variation as possible, so that any difference across the factor levels of interest can be attributed to these factors and not to extraneous factors.

• In this example, we suspect the best solution is to employ a “mechanical” golf ball driver to hit all 300 balls.

• This should reduce the inevitable random variation that would occur by using human golfers.



Coding the data Golf Ball Testing

• Although many rows in the figure are hidden, there are actually 300 rows of data, 20 for each of the 15 combinations of Brand and Temp.

• There must be two “code” variables that represent the levels of the two factors and a measurement variable that represents the response variable.

• Again this is a balanced design, which is what StatTools expects for its two-way ANOVA procedure.



Analysis of Results Golf Ball Testing

Prompted by the table, here are some questions we might ask:1. Look at column I. Do any brands average significantly more yards

than any others (where these averages are averages over all temperatures)?

2. Look at the bottom row. Do average yardages differ significantly across temperatures (where these averages are across all brands)?

3. Look at the middle of the table. Do differences among averages of brands depend on temperature? For example, does one brand dominate in cool weather and another in warm weather? Also, do differences among averages of temperatures depend on brand? For example, are some brands very sensitive to changes in temperature while others are not?




• It is useful to characterize the type of information these questions are seeking.• Question 1 is asking about the main effect of the brand factor. If

we ignore the temperature, do some brands tend to go farther than some others?

• Question 2 is also asking about a main effect, the main effect of the temperature factor. If we ignore the brand, do balls tend to go farther in some temperatures than others? (This answer is obvious to golfers: balls compress better and go farther in warm temperatures.) Therefore this is not a key question, although we would expect the study to confirm what common sense tells us.

• Question 3 is asking about interactions between the two factors. These interactions are often the most interesting results of a two-way study. In this example interactions are patterns of the averages that could not be guessed by looking only at the “main effect” averages.



Interaction Effects Golf Ball Testing

• Specifically, the order of brands in column F, from largest to smallest average yardages, is E, C, B, A, D. If there were not interactions at all, this ordering would hold at each temperature. For these data it is close.

• At cool temperatures the ordering is C, E, B, A, D; for mild, it is E, B, C, D, A; for warm, it is E, C, A, B, D.

• Actually, having no interaction implies even more than the preservation of these rankings.



Interaction Effects Golf Ball Testing

• It implies that the difference between any two brand averages is the same at any of the three temperature levels.

• For example, the difference between brands E and D at the three temperature levels are:

224.8 - 215.0 = 9.8 255.7 - 237.6 = 18.1 270.9 - 256.1 = 14.8

• If there were no interactions at all, these three differences would be equal.



Interaction Graphically Golf Ball Testing

• The concept of interaction is much easier to understand by looking at graphs.

• The following graphs, which are both outputs from StatTools’ two-way ANOVA procedure, represent two ways of looking at the pattern of averages for different combinations of brand and temperature.

• The first graph shows a line for each brand, where each point on the line corresponds to a different temperature. The second shows the same information with the roles of brand and temperature reversed.




Interaction Plot: Brand by Temp

200.00

210.00

220.00

230.00

240.00

250.00

260.00

270.00

Cool Mild Warm

A

B

C

D

E




Interaction Plot: Temp by Brand

200.00

210.00

220.00

230.00

240.00

250.00

260.00

270.00

280.00

A B C D E

Cool

Mild

Warm




• Neither graph is better than the other, they simply show the same data from different perspectives.

• The key to either is whether the lines are parallel. If they are, then there is no interactions - the effect of one factor on average yardage is the same regardless of the level of the other factor. The more nonparallel they are, however, the stronger the interactions are.

• The lines in either of these graphs are not exactly parallel but they are nearly so. This implies that there is very little interaction between brand and temperature.



Type of Interactions

• In general, interactions can be of several types. • Shown here are two contrasting types. These

graphs focus on two types and on different data than in GOLFBALLS.XLS.

• In the first graph brand A dominates at all temperatures. However, there is little interaction because the difference between brands increases as temperatures increase.




• In this situation the interaction effect is interesting, but not the main effect of brand - brand A is better when averaged over all temperatures - is also interesting.

• The situation is quite different in the next graph, where there is a crossover.




• Brand A is somewhat better at cool temperatures, but brand B is better at mild and warm temperatures.

• In this case the interaction is the most interesting finding, and the main effect of brand is much less interesting.



Type of Interactions • In simple terms, if you are a golfer, you’d buy

brand A in cool temperatures and brand B otherwise, and you wouldn’t care very much which brand is better when averaged over all temperatures.

• For these reasons, we check first for interactions in a two-way design.

• If there are significant interactions, then the main effects might not be as interesting.

• However, if there are no significant interactions, then main effects generally become more important.



Main Effects versus Interactions

• Main effects are differences in average across the levels on one factor, where these averages are averages over all levels of the other factor.

• In a table of sample means, we can check for main effects by looking at the averages in the “Grand Total” column and row.

• In contrast, the interactions are patterns of averages in the main body of the table and are best shown graphically. They indicate whether the effect of one factor depends on the level of the other factors.



Two Way ANOVA Table • The next question is whether the main effects and

interactions we see in the table of sample means are statistically significant.

• As in a one-way ANOVA, this is answered by an ANOVA table.

• However, instead of having just two sources of variation, within and between, as in a one-way ANOVA, there are now four sources of variation: one for the main effect of each factor, one for interactions, and one for the variation within treatment level combinations.




• For the golf ball data, two-way ANOVA separates the total variation across all 300 observations into four sources.

1. There is variation due to different brands producing different average yardages.

2. There is variation due to different average yardages at different temperatures.

3. There is variation due to the interactions we saw in the interaction graphs.

4. There is the same type of “within” variation as in one-way ANOVA. This is the variation that occurs because yardages for the 20 balls of the same brand hit at the same temperature are not all identical.



Output of Results

Golf Ball Testing • Select the StatTools/

Statistical Inference/Two-way ANOVA menu item, selecting Brand and Temp as the “code” variables and Yards as the “measurement” variable

• Output includes tables of sample sizes, sample means, and sample standard deviations, as well as the ANOVA table.




• We test whether main effects or interactions are statistically significant in the usual way - by examining p-values.• Looking first at the interactions, the p-value is about 0.03, which

says that the lines in the interaction graphs are significantly non-parallel, at least at the 5% significant level. There is at least some interaction between brand and temperature (although the practical significance could be disputed).

• The two p-values for the main effects in cells G32 and G33 are practically 0, meaning that there are differences across brands and across temperatures.

• Of course, the main effect of temperature was a foregone conclusion - we already know that balls do not go as far in cold temperatures - but the main effect of brand is more interesting.

• According to the evidence, some brands definitely go farther, on average, than some others.

Documents

Statistical Modeling - 1 US Army Logistics Management College Part 1: Regression Analysis Estimating Relationships