6
46 TEACHING SUGGESTIONS Teaching Suggestion 4.1: Which Is the Independent Variable? We find that students are often confused about which variable is independent and which is dependent in a regression model. For example, in Triple A’s problem, clarify which variable is X and which is Y. Emphasize that the dependent variable (Y ) is what we are trying to predict based on the value of the independent (X ) variable. Use examples such as the time required to drive to a store and the distance traveled, the totals number of units sold and the selling price of a product, and the cost of a computer and the processor speed. Teaching Suggestion 4.2: Statistical Correlation Does Not Always Mean Causality. Students should understand that a high R 2 doesn’t always mean one variable will be a good predictor of the other. Explain that skirt lengths and stock market prices may be correlated, but rais- ing one doesn’t necessarily mean the other will go up or down. An interesting study indicated that, over a 10-year period, the salaries of college professors were highly correlated to the dollar sales vol- ume of alcoholic beverages (both were actually correlated with inflation). Teaching Suggestion 4.3: Give students a set of data and have them plot the data and manually draw a line through the data. A discussion of which line is “best” can help them appreciate the least squares criterion. Teaching Suggestion 4.4: Select some randomly generated values for X and Y (you can use random numbers from the random number table in Chapter 15 or use the RAND function in Excel). Develop a regression line using Excel and discuss the coefficient of determination and the F-test. Students will see that a regression line can always be developed, but it may not necessarily be useful. Teaching Suggestion 4.5: A discussion of the long formulas and short-cut formulas that are provided in the appendix is helpful. The long formulas provide students with a better understanding of the meaning of the SSE and SST. Since many people use computers for regression problems, it helps to see the original formulas. The short-cut formulas are helpful if students are performing the computations on a calculator. ALTERNATIVE EXAMPLES Alternative Example 4.1: The sales manager of a large apart- ment rental complex feels the demand for apartments may be related to the number of newspaper ads placed during the previous month. She has collected the data shown in the accompanying table. Ads purchased, (X) Apartments leased, (Y) 15 6 9 4 40 16 20 6 25 13 25 9 15 10 35 16 We can find a mathematical equation by using the least squares regression approach. Leases, Y Ads, X (X ¯¯ X ) 2 (X ¯¯ X )(Y ¯¯ Y ) 6 15 64 32 4 9 196 84 16 40 289 102 6 20 9 12 13 25 4 6 9 25 4 2 10 15 64 0 16 35 144 72 Y 80 X 184 (X ¯¯ X ) 2 774 (X ¯¯ X )(Y ¯¯ Y ) 306 b 1 306/774 0.395 b 0 10 0.395(23) 0.915 The estimated regression equation is ˆ Y 0.915 0.395X or Apartments leased 0.915 0.395 ads placed If the number of ads is 30, we can estimate the number of apart- ments leased with the regression equation 0.915 0.395(30) 12.76 or 13 apartments Alternative Example 4.2: Given the data on ads and apartment rentals in Alternative Example 4.1, find the coefficient of determi- nation. The following have been computed in the table that follows: SST 150; SSE 29.02; SSR 120.76 (Note: Round-off error may cause this to be slightly different than a computer solution.) Y X = = = = 80 8 10 184 8 23 ; 4 C H A P T E R Regression Models M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 46 REVISED

LFCH4

  • Upload
    xavier

  • View
    125

  • Download
    3

Embed Size (px)

Citation preview

Page 1: LFCH4

46

TEACHING SUGGESTIONS

Teaching Suggestion 4.1: Which Is the Independent Variable?We find that students are often confused about which variable isindependent and which is dependent in a regression model. Forexample, in Triple A’s problem, clarify which variable is X andwhich is Y. Emphasize that the dependent variable (Y ) is what weare trying to predict based on the value of the independent (X)variable. Use examples such as the time required to drive to a storeand the distance traveled, the totals number of units sold and theselling price of a product, and the cost of a computer and theprocessor speed.

Teaching Suggestion 4.2: Statistical Correlation Does NotAlways Mean Causality.Students should understand that a high R2 doesn’t always meanone variable will be a good predictor of the other. Explain thatskirt lengths and stock market prices may be correlated, but rais-ing one doesn’t necessarily mean the other will go up or down. Aninteresting study indicated that, over a 10-year period, the salariesof college professors were highly correlated to the dollar sales vol-ume of alcoholic beverages (both were actually correlated with inflation).

Teaching Suggestion 4.3: Give students a set of data and havethem plot the data and manually draw a line through the data. Adiscussion of which line is “best” can help them appreciate theleast squares criterion.

Teaching Suggestion 4.4: Select some randomly generated valuesfor X and Y (you can use random numbers from the randomnumber table in Chapter 15 or use the RAND function in Excel).Develop a regression line using Excel and discuss the coefficientof determination and the F-test. Students will see that a regressionline can always be developed, but it may not necessarily be useful.

Teaching Suggestion 4.5: A discussion of the long formulas andshort-cut formulas that are provided in the appendix is helpful.The long formulas provide students with a better understanding of the meaning of the SSE and SST. Since many people usecomputers for regression problems, it helps to see the originalformulas. The short-cut formulas are helpful if students areperforming the computations on a calculator.

ALTERNATIVE EXAMPLES

Alternative Example 4.1: The sales manager of a large apart-ment rental complex feels the demand for apartments may be relatedto the number of newspaper ads placed during the previous month.She has collected the data shown in the accompanying table.

Ads purchased, (X) Apartments leased, (Y)

15 69 4

40 1620 625 1325 915 1035 16

We can find a mathematical equation by using the least squaresregression approach.

Leases, Y Ads, X (X � ¯X )2 (X � ¯X )(Y � ¯Y )

6 15 64 324 9 196 84

16 40 289 1026 20 9 12

13 25 4 69 25 4 �2

10 15 64 016 35 144 72

�Y � 80 �X � 184 �(X � ¯X )2 � 774 �(X � ¯X )(Y � ¯Y ) � 306

b1 � 306/774 � 0.395

b0 � 10 � 0.395(23) � 0.915

The estimated regression equation is

Y � 0.915 � 0.395X

or

Apartments leased � 0.915 � 0.395 ads placed

If the number of ads is 30, we can estimate the number of apart-ments leased with the regression equation

0.915 � 0.395(30) � 12.76 or 13 apartments

Alternative Example 4.2: Given the data on ads and apartmentrentals in Alternative Example 4.1, find the coefficient of determi-nation. The following have been computed in the table that follows:

SST � 150; SSE � 29.02; SSR � 120.76

(Note: Round-off error may cause this to be slightly different thana computer solution.)

Y X= = = =80

810

184

823;

4C H A P T E R

Regression Models

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 46REVISED

Page 2: LFCH4

CHAPTER 4 REGRESS ION MODELS 47

From this the coefficient of determination is

r2 � SSR/SST � 120.76/150 � 0.81

Alternative Example 4.3: For Alternative Examples 4.1 and 4.2,dealing with ads, X, and apartments leased, Y, compute the correla-tion coefficient.

Since r2 � 0.81 and the slope is positive (�0.395), the posi-tive square root of 0.81 is the correlation coefficient. r � 0.90.

SOLUTIONS TO DISCUSSION QUESTIONS

AND PROBLEMS

4-1. The term least-squares means that the regression line willminimize the sum of the squared errors (SSE). No other line willgive a lower SSE.

4-2. Dummy variables are used when a qualitative factor suchas the gender of an individual (male or female) is to be included inthe model. Usually this is given a value of 1 when the condition ismet (e.g. person is male) and 0 otherwise. When there are morethan two levels or values for the qualitative factor, more than onedummy variable must be used. The number of dummy variables isone less than the number of possible values or categories. For ex-ample, if students are classified as freshmen, sophomores, juniorsand seniors, three dummy variables would be necessary.

4-3. The coefficient of determination (r2) is the square of thecoefficient of correlation (r). Both of these give an indication ofhow well a regression model fits a particular set of data. An r2

value of 1 would indicate a perfect fit of the regression model tothe points. This would also mean that r would equal �1 or �1.

4-4. A scatter diagram is a plot of the data. This graphicalimage helps to determine if a linear relationship is present, or ifanother type of relationship would be more appropriate.

4-5. The adjusted r2 value is used to help determine if a newvariable should be added to a regression model. Generally, if theadjusted r2 value increases when a new variable is added to amodel, this new variable should be included in the model. If the

12

10

8

6

4

2

00 2 4 6 8 10

TV Appearances

Dem

and

adjusted r2 value declines or does not increase when a new vari-able is added, then the variable should not be added to the model.

4-6. The F-test is used to determine if the overall regressionmodel is helpful in predicting the value of the independent variable(Y). If the F-value is large and the p-value or significance level islow, then we can conclude that there is a linear relationship and themodel is useful, as these results would probably not occur bychance. If the significance level is high, then the model is not usefuland the results in the sample could be due to random variations.

4-7. The SSE is the sum of the squared errors in a regressionmodel. SST � SSE � SSR.

4-8. When the residuals (errors) are plotted after a regressionline is found, the errors should be random and should not showany significant pattern. If a pattern does exist, then the assump-tions may not be met or another model (perhaps nonlinear) wouldbe more appropriate.

4-9. a. Y � 36 � 4.3(70) � 337

b. Y � 36 � 4.3(80) � 380

c. Y � 36 � 4.3(90) � 423

4-10. a.

Y X (Y � ¯Y )2 Y� 0.915�0.395X (Y �Y)2 (Y � ¯Y )2

6.00 15.00 16 6.84 0.706 9.9864.00 9.00 36 4.47 0.221 30.581

16.00 40.00 36 16.715 0.511 45.0916.00 20.00 16 8.815 7.924 1.404

13.00 25.00 9 10.79 4.884 0.6249.00 25.00 1 10.79 3.204 0.624

10.00 15.00 0 6.84 9.986 9.98616.00 35.00 36 14.74 1.588 22.468

80.00 184.00 SST�150.00 80.00 SSE�29.02 SSR�120.76

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 47REVISED

Page 3: LFCH4

48 CHAPTER 4 REGRESS ION MODELS

SST � 29.5; SSE � 12; SSR � 17.5

b1 � 17.5/17.5 � 1

b0 � 6.5 � 1(5.5) � 1

The regression equation is Y � 1 � 1X.

c. Y � 1 � 1X � 1 � 1(6) � 7.

4-10. b.

Demand TV AppearancesY X (X � ¯X )2 (Y � ¯Y )2 (X � ¯X )(Y � ¯Y ) Y (Y �Y)2 (Y � ¯Y )2

3 3 6.25 12.25 8.75 4 1 6.256 4 2.25 0.25 0.75 5 1 2.257 7 2.25 0.25 0.75 8 1 2.255 6 0.25 2.25 �0.75 7 4 0.25

10 8 6.25 12.25 8.75 9 1 6.258 5 0.25 2.25 �0.75 6 4 0.25

�Y � 39.0 �X � 33 17.5 29.5 17.5 12 17.5¯Y � 6.5 ¯X � 5.5

SST SSE SSR

4-11. See the table for the solution to problem 4-10 to obtainsome of these numbers.

MSE = SSE/(n � k � 1) = 12/(6 � 1 � 1) = 3

MSR = SSR/k = 17.7/1 = 17.5

F = MSR/MSE = 17.5/3 = 5.83

df1 = k = 1

df2 = n � k � 1 = 6 � 1 � 1 = 4

F0.05, 1, 4 = 7.71

Do not reject H0 since 5.83 � 7.71. Therefore, we cannot concludethere is a statistically significant relationship at the 0.05 level.

Fin. Test 1

Ave,(Y) (X) (X � ¯X )2 (Y � ¯Y )2 (X � ¯X )(Y � ¯Y ) Y (Y �Y)2 (Y � ¯Y )2

93 98 285.235 196 236.444 91.5 2.264 156.13578 77 16.901 1 4.111 76 4.168 9.25284 88 47.457 25 34.444 84.1 0.009 25.97773 80 1.235 36 6.667 78.2 26.811 0.67684 96 221.679 25 74.444 90 36.188 121.34564 61 404.457 225 301.667 64.1 0.015 221.39664 66 228.346 225 226.667 67.8 14.592 124.99495 95 192.901 256 222.222 89.3 32.766 105.59276 69 146.679 9 36.333 70 35.528 80.291

711 730 1544.9 998 1143 152.341 845.659

b1 = 1143/1544.9 = 0.740

b0 = (711/9) � 0.740 (730/9) = 18.99

4-12. Using Excel, the regression equation is Y � 1 � 1X. F � 5.83, the significance level is 0.073. This is significant at the0.10 level (0.073 � 0.10), but it is not significant at the 0.05 level.There is marginal evidence that there is a relationship between demand for drums and TV appearances.

4-13.

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 48REVISED

Page 4: LFCH4

CHAPTER 4 REGRESS ION MODELS 49

a. Y � 18.99 � 0.74X

b. Y � 18.99 � 0.74(83) � 80.41

c. r2 = SSR/SST = 845.629/998 = 0.85; r � 0.92; thismeans that 85% of the variability in the final average canbe explained by the variability in the first test score.

50

45

40

35

30

25

20

15

10

5

00 5 10

Tourists (Millions)

15 20 25

Rid

ersh

ip (

100,

000s

)

4-14. See the table for the solution to problem 4-13 to obtainsome of these numbers.

MSE = SSE/(n � k � 1) = 152.341/(9 � 1 � 1) = 21.76

MSR = SSR/k = 845.659/1 = 845.659

F = MSR/MSE = 845.659/21.76 = 38.9

df1 = k = 1

df2 = n � k � 1 = 9 � 1 � 1 = 7

F0.05, 1, 7 = 5.59

Because 38.9 � 5.59, we can conclude (at the 0.05 level) thatthere is a statistically significant relationship between the first testgrade and the final average.

4-15. F � 38.86; the significance level � 0.0004 (which is ex-tremely small) so there is definitely a statistically significant relationship.

4-16. a. Y � 13,473 � 37.65(1,860) � $83,502.

b. The predicted average selling price for a house thissize would be $83,502. Some will sell for more and somewill sell for less. There are other factors besides size thatinfluence the price of the house.

c. Some other variables that might be included are age ofthe house, number of bedrooms, and size of the lot. Thereare other factors in addition to these that one can identify.

d. The coefficient of determination (r2) � (0.63)2 �0.3969.

4-17. The multiple regression equation is Y � $90.00 �$48.50X1 � $0.40X2

a. Number of days on the road: X1 � 5; Distance traveled:X2 � 300 miles

The amount he may be expected to claim is

Y � 90.00 � 48.50(5) � $0.40(300) � $452.50

b. The reimbursement request, according to the model, appears to be too high. However, this does not mean that it isnot justified. The accountants should question ThomasWilliams about his expenses to see if there are other explana-tions for the high cost.

c. A number of other variables should be included, such asthe type of travel (air or car), conference fees if any, and ex-penses for entertainment of customers, and other transportation (cab and limousine) expenses. In addition, the coefficient ofcorrelation is only 0.68 and r2 � (0.68)2 � 0.46. Thus, about46% of the variability in the cost of the trip is explained by thismodel; the other 54% is due to other factors.

4-18. Using computer software to get the regression equation,we get

Y � 1.03 � 0.0034X

where Y � predicted GPA and X � SAT score.

If a student scores 450 on the SAT, we get

Y � 1.03 � 0.0034(450) � 2.56.

If a student scores 800 on the SAT, we get

Y � 1.03 � 0.0034(800) � 3.75.

4-19. a. A linear model is reasonable from the graph below.

b. Y � 5.060 � 1.593X

c. Y � 5.060 � 1.593(10) � 20.99, or 2,099,000 people.

d. If there are no tourists, the predicted ridership would be5.06 (100,000s) or 506,000. Because X � 0 is outside therange of values that were used to construct the regressionmodel, this number may be questionable.

4-20. The F-value for the F-test is 52.6 and the significancelevel is extremely small (0.00002) which indicates that there is astatistically significant relationship between number of touristsand ridership. The coefficient of determination is 0.84 indicatingthat 84% of the variability in ridership from one year to the nextcould be explained by the variations in the number of tourists.

4-21. a. Y � 24,328 � 3026.67X1 � 6684X2

where Y predicted starting salary; X1 � GPA; X2 � 1 if businessmajor, 0 otherwise.

b. Y � 24,328 � 3026.67(3.0) � 6684(1) � $40,092.01.

c. The starting salary for business majors tends to beabout $6,684 higher than non-business majors in thissample, even after adjusting for variations in GPA.

d. The overall significance level is 0.099 and r2 � 0.69.Thus, the model is significant at the 0.10 level and 69% ofthe variability in starting salary is explained by GPA andmajor. The model is useful in predicting starting salary.

4-22. a. Let

Y � predicted selling price

X1 � square footage

X2 � number of bedrooms

X3 � age

The model with square footage: Y � 2367.26 � 46.60X1 ; r2 � 0.65

The model with number of bedrooms: Y � 1923.5 � 36137.76X2 ;r2 � 0.36

The model with age: Y � 147670.9 � 2424.16X3 ; r2 � 0.78

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 49REVISED

Page 5: LFCH4

50 CHAPTER 4 REGRESS ION MODELS

All of these models are significant at the 0.01 level or less. Thebest model uses age as the independent variable. The coefficient ofdetermination is highest for this, and it is significant.

4-23. Y � 5701.45 � 48.51X1 � 2540.39X2 and r2 � 0.65.

Y � 5701.45 � 48.51(2000) � 2540.39(3) � 95,100.28.

Notice the r2 value is the same as it was in the previous problemwith just square footage as the independent variable. Adding thenumber of bedrooms did not add any significant information thatwas not already captured by the square footage. It should not beincluded in the model. The r2 for this is lower than for age alone inthe previous problem.

4-24. Y � 82185.5 � 25.94X1 � 2151.7X2 � 1711.5X3 and r2 � 0.89.

Y � 82185.5 � 25.94(2000) � 2151.7(3) � 1711.5(10) � $110,495.4.

4-25. Y � 3071.885 � 6.5326X where

Y � DJIA and X � S&P.

r � 0.84 and r2 � 0.70.

Y � 3071.885 � 6.5326(1100) � 10257.8 (rounded)

4-26. With one independent variable, beds, in the model, r2 �0.88. With just admissions in the model, r2 � 0.974. When bothvariables are in the model, r2 � 0.975. Thus, the model with onlyadmissions as the independent variable is the best. Adding thenumber of beds had virtually no impact on r2, and the adjusted r2

decreased slightly. Thus, the best model is Y � 1.518 � 0.6686Xwhere Y � expense and X � admissions.

4-27. Using Excel with Y � MPG; X1 � horsepower; X2 �weight the models are:

Y � 53.87 � 0.269X1; r2 � 0.77Y � 57.53 � 0.01X2; r2 � 0.73.

Thus, the model with horsepower as the independent variable isbetter since r2 is higher.

4-28. Y � 57,69 � 0.17X1 � 0.005X2 where

Y � MPG

X1 � horsepower

X2 � weight

r2 � 0.82.

This model is better because the coefficient of determination is muchhigher with both variables than it is with either one individually.

4-29. Let Y � MPG; X1 � horsepower; X2 � weight

The model Y � b0 � b1X1 � b2X12 is Y � 69.93 �0.620X1 �

0.001747X12 and has r2 � 0.798.

The model Y � b0 � b3X2 � b4X22 is Y � 89.09 � 0.0337X2 �

0.0000039X22 and has r2 � 0.800.

The model Y � b0 � b1X1 � b2X12 � b3X2 � b4X2

2 is Y � 89.2 �0.51X1 � 0.001889X1

2 � 0.01615X2 � 0.00000162X22 and has r2 �

0.883. This model has a higher r2 value than the model in 4-28. Agraph of the data would show a nonlinear relationship.

4-30. If SAT median score alone is used to predict the cost, we get

Y � �7793.1 � 21.8X1 with r2 � 0.22.

If both SAT and a dummy variable (X2 � 1 for private, 0 otherwise)are used to predict the cost, we get r2 � 0.79. The model is

Y � 7121.8 � 5.16X1 � 9354.99X2.

This says that a private school tends to be about $9,355 more ex-pensive than a public school when the median SAT score is used toadjust for the quality of the school. The coefficient of determinationindicates that about 79% of the variability in cost can be explainedby these factors. The model is significant at the 0.001 level.

4-31.

There is a significant relationship between the number of victories(Y) and the payroll (X) at the 0.054 level, which is marginally sig-nificant. However, r2 = 0.24, so the relationship is not very strong.Only about 24% of the variability in victories is explained by thismodel.

4-32. a.

b.

c. The correlation coefficient for the first stock is only0.19 while the correlation coefficient for the second is0.96. Thus, there is a much stronger correlation betweenstock 2 and the DJI than there is for stock 1 and the DJI.

CASE STUDIES

SOLUTION TO NORTH–SOUTH AIRLINE CASE

Northern Airline Data

Airframe Cost Engine Cost Average AgeYear per Aircraft per Aircraft (Hours)

2001 51.80 43.49 6,5122002 54.92 38.58 8,4042003 69.70 51.48 11,0772004 68.90 58.72 11,7172005 63.72 45.47 13,2752006 84.73 50.26 15,2152007 78.74 79.60 18,390

Southeast Airline Data

Airframe Cost Engine Cost Average Age Year per Aircraft per Aircraft (Hours)

2001 13.29 18.86 5,1072002 25.15 31.55 8,1452003 32.18 40.43 7,3602004 31.78 22.10 5,7732005 25.34 19.69 7,1502006 32.78 32.58 9,3642007 35.56 38.07 8,259

Utilizing QM for Windows, we can develop the following regres-sion equations for the variables of interest.

Northern Airline—airframe maintenance cost:

Cost � 36.10 � 0.0025 (airframe age)Coefficient of determination � 0.7694Coefficient of correlation � 0.8771

ˆ . .Y X= − +31 54 0 0058

ˆ . .Y X= +42 43 0 0004

ˆ . .Y X= +67 8 0 0145

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 50REVISED

Page 6: LFCH4

CHAPTER 4 REGRESS ION MODELS 51

Northern Airline—engine maintenance cost:

Cost � 20.57 � 0.0026 (airframe age)Coefficient of determination � 0.6124Coefficient of correlation � 0.7825

Southeast Airline—airframe maintenance cost:

Cost � 4.60 � 0.0032 (airframe age)Coefficient of determination � 0.3904Coefficient of correlation � 0.6248

Southeast Airline—engine maintenance cost:

Cost � �0.671 � 0.0041 (airframe age)Coefficient of determination � 0.4599Coefficient of correlation � 0.6782

The graphs below portray both the actual data and the regres-sion lines for airframe and engine maintenance costs for both air-lines. Note that the two graphs have been drawn to the same scaleto facilitate comparisons between the two airlines.

Northern Airline: There seem to be modest correlations be-tween maintenance costs and airframe age for Northern Airline.There is certainly reason to conclude, however, that airframe ageis not the only important factor.

Southeast Airline: The relationships between maintenancecosts and airframe age for Southeast Airline are much less welldefined. It is even more obvious that airframe age is not the onlyimportant factor—perhaps not even the most important factor.

Overall, it would seem that:

1. Northern Airline has the smallest variance in mainte-nance costs, indicating that the day-to-day management ofmaintenance is working pretty well.2. Maintenance costs seem to be more a function of airlinethan of airframe age.3. The airframe and engine maintenance costs for SoutheastAirline are not only lower but more nearly similar than thosefor Northern Airline, but, from the graphs at least, appear tobe rising more sharply with age.4. From an overall perspective, it appears that Southeast Air-line may perform more efficiently on sporadic or emergencyrepairs, and Northern Airline may place more emphasis onpreventive maintenance.

Ms. Young’s report should conclude that:1. There is evidence to suggest that maintenance costscould be made to be a function of airframe age by implement-ing more effective management practices.2. The difference between maintenance procedures of thetwo airlines should be investigated.3. The data with which she is presently working do not pro-vide conclusive results.

90

80

70

60

50

40

30

20

105 7 9 11 13 15 17 19

Average Airframe Age (Thousands)

Cos

t ($)

Northern Airline

AirframeEngine

90

80

70

60

50

40

30

20

105 7 9 11 13 15 17 19

Average Airframe Age (Thousands)

Cos

t ($)

Southeast Airline

AirframeEngine

M04_REND6289_10_IM_C04.QXD 5/7/08 2:49 PM Page 51REVISED