Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Chapter 10 Re-expressing Data: It’s Easier Than You Think 145
Chapter 10 – Re-expressing Data:
It’s Easier Than You Think
1. Models.
a)
e)
b) c) d)
2. More models.
a) b) c)
d) e)
3. Gas mileage.
a) The association between weight and gas mileage of cars is fairly linear, strong, andnegative. Heavier cars tend to have lower gas mileage.
b) For each additional thousand pounds of weight, the linear model predicts a decrease of7.652 miles per gallon in gas mileage.
c) The linear model is not appropriate. There is a curved pattern in the residuals plot. Themodel tends to underestimate gas mileage for cars with relatively low and high gasmileages, and overestimates the gas mileage of cars with average gas mileage.
4. Crowdedness.
a) The scatterplot shows that the relationship between Crowdedness and GDP is strong,negative, and curved. Re-expression may yield an association that is more linear.
b) Start with logs, since GDP is non-negative. A plot of the log of GDP against Crowdednessscore may be straighter.
ˆ . .
ˆ . . ( )
ˆ .
y x
y
y
= += +=
1 2 0 81 2 0 8 22 8
ln ˆ . .
ln ˆ . . ( )
ln ˆ .
ˆ ..
y x
y
y
y e
= += +=
= =
1 2 0 81 2 0 8 22 8
16 442 8
ˆ . .
ˆ . . ( )
ˆ .
ˆ . .
y x
y
y
y
= +
= +
=
= =
1 2 0 8
1 2 0 8 2
2 8
2 8 7 842
11 2 0 8
11 2 0 8 2
12 8
12 8
0 36
ˆ. .
ˆ. . ( )
ˆ.
ˆ.
.
yx
y
y
y
= +
= +
=
= =
ˆ . ( )
ˆ .
.y
y
==
1 2 22 09
0 8
ˆ . . log
ˆ . . log( )
ˆ .
y x
y
y
= += +=
1 2 0 81 2 0 8 21 44
log ˆ . .
log ˆ . . ( )
log ˆ .
ˆ ..
y x
y
y
y
= += +=
= =
1 2 0 81 2 0 8 22 8
10 630 962 8
ˆ . .
ˆ . .
ˆ .
y x
y
y
= +
= +=
1 2 0 8
1 2 0 8 22 33
ˆ . ( . )
ˆ . ( . )
ˆ .
y
y
y
x=
==
1 2 0 8
1 2 0 80 77
2
ˆ . .
ˆ . ( ) . ( )
ˆ .
y x x
y
y
= + +
= + +=
0 8 1 2 1
0 8 2 1 2 2 16 6
2
2
146 Part II Exploring Relationships Between Variables
5. Gas mileage revisited.
a) The residuals plot for the re-expressed relationship is much more scattered. This is anindication of an appropriate model.
b) The linear model that predicts the number of gallons per 100 miles in gas mileage from theweight of a car is: Gal Weightˆ / . .100 0 625 1 178= + ( ) .
c) For each additional 1000 pounds of weight, the model predicts that the car will require anadditional 1.178 gallons to drive 100 miles.
d)
Gal Weight
Gal
Gal
ˆ / . .
ˆ / . . .
ˆ / .
100 0 625 1 178
100 0 625 1 178 3 5
100 4 748
= + ( )= + ( )=
6. Crowdedness again.
a) This re-expression is not useful. The student has gone too far down the ladder of powers.We now see marked downward curvature and increasing scatter.
b) The next step would be to try a “weaker” re-expression, like reciprocal square root or log ofGDP. Having gone too far, the student should move back “up” the ladder of powers.
7. Catapults.
D W
D
D
=
= ×=
1 1
1 1 3 7 89531 59
13
13
.
. ( , )
.
8. Pressure.
The scatterplot at the right shows a strong,curved, negative association between the heightof the cylinder and the pressure inside. Becauseof the curved form of the association, a linearmodel is not appropriate.
According to the model, a car that weighs 3500pounds (3.5 thousand pounds) is expected to requireapproximately 4.748 gallons to drive 100 miles, or0.04748 gallons per mile.
This is 1
0 0474821 06
..≈ miles per gallon.
40
60
80
100
120
20 30 40
Boyle’s Pressure and Volume
Pres
sure
Height(# of marked spaces)
According to the model, the torsion spring hole ofArchimedes’ catapult was about 31.59 dactyls in diameter.
Chapter 10 Re-expressing Data: It’s Easier Than You Think 147
Re-expressing the pressure as the reciprocal of the pressure produces a scatterplot (belowthe first) that is much straighter. Computer regression output for the height versus thereciprocal of pressure is below.
The reciprocal re-expression is very straight (perfectly straight, as far as the statisticalsoftware is concerned!). R2= 100%, meaning that 100% of the variability in the reciprocal ofpressure is explained by the model. The equation of the model is:
10 000077 0 000713
PressureHeight
ˆ. .= − + ( ) .
Re-expressing each variable using logarithms is also a good model. TI-83 regressionoutput for the log-log re-expression is given below.
There is a strong, negative, linear relationship between log(Pressure) and log(Height).
log( ˆ ) . . (log( ))Pressure Height= −3 150 1 001 models the situation well, explaining nearly 100%of the variability in the logarithm of pressure. The residuals plot is fairly scattered,indicating an appropriate model.
0.0150
0.0225
0.0300
20 30 40
1/Pr
essu
re
Height (inches)
Height vs.Reciprocal of PressureDependent variable is:
No Selectorrecip pressure
R squared = 100.0% R squared (adjusted) = 100.0%s = 0.0001 with 12 - 2 = 10 degrees of freedom
Source
RegressionResidual
Sum of Squares
0.0008410.000000
df
11 0
Mean Square
0.0008410.000000
F - r a t i o
75241
Var iable
ConstantHeight
Coefficient
-7.66970e-5 7.13072e-4
s.e. of Coeff
0.00010.0000
t - r a t i o
-0.982274
prob
0.3494 ≤ 0.0001
Scatterplot of Log(pressure) Regression Output Residuals Plot vs. Log(height)
148 Part II Exploring Relationships Between Variables
9. Brakes.
a) The association between speed and stopping distance is strong, positive, and appearsstraight. Higher speeds are generally associated with greater stopping distances. Thelinear regression model, with equation Distance Speedˆ . .= − + ( )65 9 5 98 , has R2 = 96.9%,meaning that the model explains 96.9% of the variability in stopping distance. However,the residuals plot has a curved pattern. The linear model is not appropriate. A modelusing re-expressed variables should be used.
b) Stopping distances appear to be relatively higher for higher speeds. This increase in therate of change might be able to be straightened by taking the square root of the responsevariable, stopping distance. The scatterplot of Speed versus Distance seems like it mightbe a bit straighter.
c) The model for the re-expressed data is Dist̂ance . .= + ( )3 303 0 235 Speed . The residuals plotshows no pattern, and R2= 98.4%, so 98.4% of the variability in the square root of thestopping distance can be explained by the model.
75
150
225
300
20 30 40 50
Speed (mph)
Stopping Distance and Speed
Stop
ping
Dis
tanc
e
-15
0
15
100 150 200 250
Predicted Distance(feet)
Res
idua
l
Residuals Plot
10.0
12.5
15.0
17.5
20 30 40 50
Speed (mph)
Squa
re R
oot o
f
Square RootRe-expression
-0.6
-0.3
0.0
0.3
10 12 14 16
Residuals PlotSquare Root Re-expression
Predicted
Res
idua
l
Chapter 10 Re-expressing Data: It’s Easier Than You Think 149
d)
Distance Speed
Distance
Distance
Distance
ˆ . .
ˆ . .
ˆ .
ˆ . .
= + ( )= + ( )=
= ≈
3 303 0 235
3 303 0 235 55
16 228
16 228 263 42
e)
Distance Speed
Distance
Distance
Distance
ˆ . .
ˆ . .
ˆ .
ˆ . .
= + ( )= + ( )=
= ≈
3 303 0 235
3 303 0 235 70
19 753
19 753 390 22
f) The level of confidence in the predictions should be quite high. R2 is high, and theresiduals plot is scattered. The prediction for 70 mph is a bit of an extrapolation, butshould still be reasonably close.
10. Pendulum.
a) The scatterplot shows the association between thelength of string and the number of swings apendulum took every 20 seconds to be strong,negative, and curved. A pendulum with a longerstring tended to take fewer swings in 20 seconds.The linear model is not appropriate, because theassociation is curved.
b) Curvature in a negative relationship sometimes isan indication of a reciprocal relationship. Try re-expressing the response variable with the reciprocal.
According to the model, a car traveling 55 mph isexpected to require approximately 263.4 feet to cometo a stop.
According to the model, a car traveling 70 mph isexpected to require approximately 390.2 feet to cometo a stop.
10.0
12.5
15.0
17.5
20.0
7.5 15.0 22.5 30.0
Length (inches)
Num
ber
of S
win
gs
Pendulum
0.0500
0.0625
0.0750
0.0875
0.1000
7.5 15.0 22.5 30.0
Length (inches)
Reciprocal Re-expression
(1/#
of s
win
gs)
-0.0015
0.0000
0.0015
0.0030
0.060 0.075 0.090
Res
idua
ls
Residuals PlotReciprocal Re-expression
Predicted
150 Part II Exploring Relationships Between Variables
c) The reciprocal re-expression yields the model 1
0 0367 0 00176Swi ngs
Lengthˆ . . ( )= + . The
residuals plot is scattered, and R2 = 98.1%, indicating that the model explains 98.1% of thevariability in the reciprocal of the number of swings. The model is both appropriate andaccurate.
Re-expressing each variable with logarithms also results in a good model, and has theadded benefit of maintaining the direction of the relationship between number of swingsand the length of the string. TI-83 regression output for the log- log re-expression is givenbelow.
There is a strong, negative, linear relationship between log(swings) and log(length).
log( ˆ ) . . (log( ))Swi ngs Length= −1 721 0 453 explains 99% of the variability in log(swings), andthe residuals plot is scattered. The log-log re-expression is useful, as well.
d) Using the reciprocal model:
10 0367 0 00176 0 0367 0 00176 4 0 04374
Swi ngsLengthˆ . . ( ) . . ( ) .= + = + =
Swi ngsˆ.
.= ≈10 04374
22 9
According to the reciprocal model, a pendulum with a 4” string is expected to swingapproximately 22.9 times in 20 seconds.
Using the log-log model:
To make estimates involving logs, it is a good idea to use as much accuracy as possible. Ifthe equation of the model is stored in the calculator, the model predicts that a pendulumwith a 4” string will swing approximately 28 times in 20 seconds.
e) Using the reciprocal model:
10 0367 0 00176 0 0367 0 00176 48 0 12118
Swi ngsLengthˆ . . ( ) . . ( ) .= + = + =
Swi ngsˆ.
.= ≈10 12118
8 3. The model predicts 8.3 swings in 20 seconds for a 48” string.
Scatterplot of Log(swings) Regression Output Residuals Plot vs. Log(length)
Chapter 10 Re-expressing Data: It’s Easier Than You Think 151
Using the log-log model:
To make estimates involving logs, it is a good idea to use as much accuracy as possible. Ifthe equation of the model is stored in the calculator, the model predicts that a pendulumwith a 48” string will swing approximately 9.1 times in 20 seconds.
f) Confidence in the predictions is fairly high. The model is appropriate, as indicated by thescattered residuals plot, and accurate, indicated by the high value of R2 . The only concernis the fact that these predictions are slight extrapolations. The lengths of the strings aren’ttoo far outside the range of the data, so the predictions should be reasonably accurate.
11. Baseball salaries.
a) The scatterplot of year versus highest salary ismoderately strong, positive and curved. Thehighest salary has generally increased over theyears. Since the scatterplot shows a curvedrelationship, the linear model is not appropriate.Using the logarithm re-expression on thesalaries seems to straighten the scatterplotconsiderably.
b) log( ˆ ) . . ( )Salary Year= − +113 473 0 05734 is a good mode for predicting the highest salary fromthe year. The residuals plot is scattered and R2 = 84.9%, indicating that 84.9% of thevariability in highest salary can be explained by the model.
c) log( ˆ ) . . ( ) . . ( ) .Salary Year= − + = − + =113 473 0 05734 113 473 0 05734 2005 1 4937
Salaryˆ ..= ≈10 31 21 4937
According to the model, the highest salary in baseball is expected to be approximately 31.2million dollars. (Be careful with decimal places here. When the slope is small and thevalue of the explanatory variable is large, like a year, too much rounding of the slope canbe detrimental to the accuracy of predictions. When using logs, this is an even biggerproblem!)
7.5
15.0
22.5
1980 1985 1990 1995
YearSa
lary
Baseball’s Highest Salaries
0.0
0.3
0.6
0.9
1.2
1980 1985 1990 1995
Year
Log(
Sala
ry)
Log Re-expression
Res
idua
ls
-0.15
0.00
0.15
0.30
0.3 0.6 0.9 1.2
Predicted
Residuals PlotLog Re-expression
152 Part II Exploring Relationships Between Variables
12. Planet distances and years.
a) The association between distance from the sunand planet year is strong, positive, and curvedconcave upward. Generally, planets fartherfrom the sun have longer years than closerplanets.
b) The rate of change in length of year per unit distance appears to be increasing, but notexponentially. Re-expressing with the logarithm of each variable may straighten a plotsuch as this. The scatterplot and residuals plot for the linear model relating log(Distance)and log(Length of Year) appear below.
The regression model for the log-log re-expression is :log( ˆ ) . . (log( ))Length Distance= − +2 955 1 501 .
c) R2 = 100%, so the model explains 100% of the variability in the log of the length of theplanetary year, at least according to the accuracy of the statistical software. The residualsplot is scattered, and the residuals are all extremely small. This is a very accurate model.
50
100
150
200
750 1500 2250 3000
Distance from the sun(millions of miles)
Leng
th o
f Yea
r(E
arth
yea
rs)
Planetary Distance andLength of Year
-0.00
0.75
1.50
2.25
2.0 2.5 3.0 3.5
log Distance
log Length
Log-Log Re-expression
-0.0004
0.0000
0.0004
0.0008
-0.00 0.75 1.50
Residuals Plot
Res
idua
l
Predicted
Chapter 10 Re-expressing Data: It’s Easier Than You Think 153
13. Planet distances and order.
a) The association between planetary position and distance from the sun is strong, positive,and curved (below left). A good re-expression of the data is position versus Log(Distance).The scatterplot with regression line (below center) shows the straightened association. Theequation of the model is log( ˆ ) . . ( )Distance Position= +1 245 0 271 . The residuals plot (belowright) may have some pattern, but after trying several re-expressions, this is the best thatcan be done. R2= 98.2%, so the model explains 98.2% of the variability in the log of theplanets distance from the sun.
b) At first glance, this model appears to provide little evidence to support the contention ofthe astronomers. Pluto appears to fit the pattern, although Pluto’s distance from the sun isa bit less than expected. A model generated without Pluto does not have a dramaticallyimproved residuals plot, does not have a significantly higher R2, nor a different slope.Pluto does not appear to be influential.
But don’t forget that a logarithmic scale is being used for the vertical axis. The higher upthe vertical axis you go, the greater the effect of a small change.
log( ˆ ) . . ( )
log( ˆ ) . . ( )
log( ˆ ) .
ˆ .
Distance Position
Distance
Distance
Distance
= += +=
= ≈
1 24521 0 270922
1 24521 0 270922 9
3 683508
10 48253 683508
Pluto doesn’t fit the pattern for position and distance in the solar system. In fact, the modelmade with Pluto included isn’t a good one, because Pluto influences those predictions. Themodel without Pluto, log( ˆ ) . . ( )Distance Position= +1 20259 0 283706 , works much better. Ithas a high R2 , and scattered residuals plot. This new model predicts that the 9th planetshould be a whopping 5701 million miles away from the sun! There is evidence that theastronomers are correct. Pluto doesn’t behave like planet in its relation to position anddistance.
750
1500
2250
3000
2 3 4 5 6 7 8 9
Position Number
Dis
tanc
e fr
om s
un
Planetary Positionand Distance
2.1
2.6
3.1
3.6
1 2 3 4 5 6 7 8 9
Position Number
Position vs.Log(Distance)
Log
(Dis
tanc
e)
-0.150
-0.075
0.000
0.075
2.0 2.5 3.0 3.5
Residuals PlotLog Re-expression
Res
idua
l
Predicted
According to the model, the 9th planet inthe solar system is predicted to beapproximately 4825 million miles awayfrom the sun. Pluto is actually 3666million miles away.
154 Part II Exploring Relationships Between Variables
14. Planets, part 3.
Using the revised planetary numbering system, and straightening the scatterplot using thesame methods as in Exercise 9, the new model, log( ˆ ) . . ( )Distance Position= +1 3058 0 232923 ,is a slightly better fit. The residuals plot is more scattered, and R2 is slightly higher, withthe improved model explaining 99.4% of the variability in the log of distance from the sun.
Pluto still doesn’t fit very well. The new model predicts that Pluto, as 10th planet, shouldbe about 4315 million miles away. That’s about 650 million miles farther away. A bettermodel yet is log( ˆ ) . . ( )Distance Position= +1 28514 0 238826 , a model made with the newnumbering system and with Pluto omitted.
15. Quaoar, planets, part 4.
a)
log( ˆ ) . . ( )
log( ˆ ) . . ( )
log( ˆ ) .
ˆ .
Distance Position
Distance
Distance
Distance
= += +=
= ≈
1 20259 0 283706
1 20259 0 283706 9
3 755944
10 57013 755944
b) With Quaoar as the 9th planet, the model that relates planetary position to distance islog( ˆ ) . . ( )Distance Position= +1 23679 0 273447 .
log( ˆ ) . . ( )
log( ˆ ) . . ( )
log( ˆ ) .
ˆ .
Distance Position
Distance
Distance
Distance
= += +=
= ≈
1 23679 0 273447
1 23679 0 273447 9
3 697813
10 49873 697813
16. Models and laws, planets part 5
The re-expressed data relating distance and year length (Exercise 8), are better described bytheir model than the re-expressed data relating position and distance (Exercises 9-11). Themodel relating distance and year length has R2 = 100%, and a very scattered residuals plot(with miniscule residuals), possibly a natural “law”. If planets in another solar systemfollowed the Titius-Bode pattern, this belief would be reinforced. Similarly, if data were
750
1500
2250
3000
3750
2 4 6 8 10
New Positionand Distance
Dis
tanc
e fr
om s
un
Position Number
Position vsLog(Distance)
Log(
Dis
tanc
e)
Position Number
2.1
2.6
3.1
3.6
2 4 6 8 10
-0.05
0.00
0.05
2.0 2.5 3.0 3.5
Res
idua
l
Predicted
Residuals PlotLog Re-expression
Using the model predicting distance fromthe sun from position, the 9th planet isexpected to be about 5701 million milesfrom the sun. At 4000 million miles,Quaoar is a better fit than Pluto, but stillnot a good fit.
According to the new model, the 9th planetis predicted to be 4987 million miles fromthe sun. Quaoar, at approximately 4000million miles, is a better fit than Pluto, butdoesn’t appear to be an additional planet.
Chapter 10 Re-expressing Data: It’s Easier Than You Think 155
acquired from planets in another solar system that did not follow this pattern, we would beunlikely to think that this relationship was a universal law.
17 – 22. Curved models.
17. (Exercise 8) Pressure.
The power model, Pressure Heightˆ . ( ) .= −1413 3 1 001 , works well. The model explains 99.99%of the variability in pressure, the residuals are very small, and the residuals plot shows nostrong pattern.
18. (Exercise 9) Brakes.
The quadratic model, Distance Speed Speedˆ . ( ) . ( ) .= + +0 05 1 96 4 42 , works well. The modelexplains 97.9% of the variability in stopping distance and the residuals plot shows nostrong pattern, although stopping distance is harder to predict accurately at high speeds.
Scatterplot of Pressure vs. Regression Output Residuals Plot Height
Scatterplot of Stopping Regression Output Residuals Plot Distance vs. Speed
156 Part II Exploring Relationships Between Variables
19. (Exercise 10) Pendulum.
The power model, Swi ngs Lengthˆ . ( ) .= −52 547 0 453 , works well. The model explains 99.01% ofthe variability in the number of swings in 20 seconds, and the residuals plot shows littlepattern.
20. (Exercise 11) Baseball salaries.
Express the years in years since 1900. For example, 1980 = 80.
The exponential model, Sal ary Yearˆ ( . )( . )= 0 0000297 1 14 , works well. The model explains96.22% of the variability in salary, and the residuals plot shows no pattern. The 25.3million dollar salary of Alex Rodriguez is much higher than the model predicts.
21. (Exercise 12) Planet distances and years.
Scatterplot of Number of Regression Output Residuals Plot Swings vs. Length
Scatterplot of Salary Regression Output Residuals Plot vs. Years since 1900
Scatterplot of Length of Year Regression Output Residuals Plot vs. Distance from Sun
Chapter 10 Re-expressing Data: It’s Easier Than You Think 157
The power model, Year Distanceˆ . ( ) .= 0 00111 1 501, works well. The model explains nearly100% of the variability in the length of the year and the residuals plot shows not apparentpattern.
22. (Exercise 13) Planet distances and order.
The exponential model, Distance Positionˆ . ( . )= 17 598 1 866 , is the best fitting curved model. Themodel explains 98.15% of the variability in distance from the Sun. The pattern in theresiduals seems to be associated with the inclusion of Pluto, supporting the theory thatPluto may not be a true planet.
The quadratic model, Distance Position Positionˆ . ( ) . ( ) .= − +88 14 434 63 487 172 , fits the givendata well, but doesn’t seem to be appropriate. Although the model explains 99.42% of thevariability in distance from the Sun and the residuals plot is scattered, the model has aminimum around the second planet. The model predicts that the first planet is furtherfrom the Sun than the second and third planets! That doesn’t seem reasonable.
Scatterplot of Distance from Regression Output Residuals Plot Sun vs. Position
Scatterplot of Distance from Regression Output Residuals Plot Sun vs. Position
158 Part II Exploring Relationships Between Variables
23. Logs (not logarithms).
a) The association between the diameter of a log and thenumber of board feet of lumber is strong, positive, andcurved. As the diameter of the log increases, so does thenumber of board feet of lumber contained in the log.
The model used to generate the table used by the logbuyers is based upon a square root re-expression. Thevalues in the table correspond exactly to the model
BoardFeet Diameterˆ = − +4 .
b)
BoardFeet Diameter
BoardFeet
BoardFeet
BoardFeet
ˆ
ˆ ( )
ˆ
ˆ
= − +
= − +
=
=
4
4 10
6
36c)
BoardFeet Diameter
BoardFeet
BoardFeet
BoardFeet
ˆ
ˆ ( )
ˆ
ˆ
= − +
= − +
=
=
4
4 36
32
1024Normally, we would be cautious of this prediction, because it is an extrapolation beyondthe given data, but since this is a prediction made from an exact model based on thevolume of the log, the prediction will be accurate.
24. Weight lifting.
a) The association between weight class and weight liftedfor gold medal winners in weightlifting at the 2000Olympics is strong, positive, and curved. The linearmodel that best fits the data isLi ft WeightClassˆ . . ( )= +179 918 2 401 .
Although this model explains 96.9% of the variability inweight lifted, it does not fit the data well.
125
250
375
500
8 12 16 20 24
# of
boa
rd fe
et
Diameter of log (inches)
Doyle Log Scale
4
8
12
16
20
24
8 12 16 20 24
Diameter of log (inches)Sq
uare
roo
t of #
of b
dft
Square root re-expressionAccording to themodel, a log 10” indiameter is expectedto contain 36 boardfeet of lumber.
According to themodel, a log 36” indiameter is expectedto contain 1024 boardfeet of lumber.
Men’s Weightlifting2000 Olympics
325
350
375
400
425
62.5 75.0 87.5 100.0
Weight Class (kg)
Chapter 10 Re-expressing Data: It’s Easier Than You Think 159
b) The residuals plot for the linear model shows a curvedpattern, indicating that the linear model has failed tomodel the association well. A re-expressed model or acurved model might fit the association between weightclass and weight lifted better than the linear model.
c) A first attempt at re-expressing the data might be one of the logarithmic re-expressions, orthe exponential model from the TI-83. These models do not decrease the amount ofcurvature in the residuals plot, and seem to fit no better than the linear model. Thequadratic model, Li ft WeightClass WeightClassˆ . ( ) . ( ) .= − + +0 025 6 42 25 592 , fits well.
d) The quadratic model is a better model, because it has a residuals plot that shows lesspattern than the residuals plot for the linear model. Also, the quadratic model explains99.1% of the variability in weight lifted. Care must be taken in using this model, however.The presence of one unusual point, the lift of 357.5 kg in the 69 kg weight class seems to beaffecting the model.
e) Boevski, from Bulgaria, lifted 357.5 kg. His lift does not fit the pattern established by theother lifts. The lift in this weight class is predicted to be 349.2 kg. According to the model,Boevski lifted 8.3 kg (over 18 pounds) more than expected.
-5
0
5
10
330 360 390 420
Res
idua
l
Predicted Lift (kg)
Residuals Plot
Scatterplot of Lift Regression Output Residuals Plot vs. Weight class
160 Part II Exploring Relationships Between Variables
25. Life expectancy.
The association between year and life expectancy isstrong, curved and positive. As the years passed, lifeexpectancy for white males has increased. The bend(in mathematical language, the change in curvature)in the scatterplot makes it impossible to straighten theassociation by re-expression.If all of the data points are to be used, the linearmodel, LifeExp Yearˆ . . ( )= − +454 938 0 265697 isprobably the best model we can make. It doesn’t fitthe data well, but accounts for 93.6% of the variabilityin life expectancy.
In order to make better future predictions, we mightuse only a subset of the data points. The scatterplot isvery linear from 1970 to 2000, and these recent yearsare likely to be more indicative of the current trend inlife expectancy. Using only these four data points, wemight be able to make more accurate futurepredictions.The model LifeExp Yearˆ . . ( )= − +379 020 0 227 explains99.6% of the variability in life expectancy, and has ascattered residuals plot. Great care should be taken inusing this model for predictions, since it is basedupon only 4 points, and, as always, prediction into thefuture is risky. This is especially true since the response variable is life expectancy, whichcannot be expected to increase at the same rate indefinitely.
26. Lifting more weight.
a) Without Boevski, the quadratic model is: Li ft WeightClass WeightClassˆ . . .= − ( ) + ( ) +0 02 5 8 45 52
b) According to the quadratic model, the winner of the 69 kg weight class was predicted to lift346.47 kg.
c) Since Boevski actually lifted 357.5 kg, his residual was 357.5 – 346.47 = 11.03. He lifted 11kg (over 24 pounds) more than predicted!
27. Slower is cheaper?
The association between speed and mileage is strong andcurved. The linear model, Mileage Speedˆ . . ( )= −32 20 0 097 , is aterrible fit. For speeds under 50 miles per hour, higher speedsare associated with higher mileage. For speeds in excess of 50mph, higher speeds are associated with lower mileage. Forthis reason, the scatterplot cannot be straightened by themethods used in this chapter. A curved model may be abetter fit.
Life
Exp
ecta
ncy
(yea
rs)
52.5
60.0
67.5
75.0
1900 1940 1980
Year
Life ExpectancyWhite Males
69.0
70.5
72.0
73.5
1980 1990 2000
Year
Life ExpectancyWhite Males
Life
Exp
ecta
ncy
(yea
rs)
Mile
s pe
r ga
llon
Speed and Mileage
24.0
25.5
27.0
28.5
40 50 60 70
Speed (MPH)
Chapter 10 Re-expressing Data: It’s Easier Than You Think 161
The quadratic model: Mileage Speed Speedˆ . ( ) . ( ) .= − + −0 114 1 158 0 4192 , works well.
Since the residuals plot scattered, the model is appropriate. The model is strong, sinceR2 96≈ % . This means that 96% of the variation in mileage is explained by the model.
Another logical modeling decision is to divide the curve up into two linear models. Theassociation seems fairly linear from 35—50 mph, and also fairly linear from 55—75 mph.
For lower speeds, the model isMileage Speedˆ . . ( )= +18 040 0 232 .The model has some pattern inthe residuals plot, but no re-expression seems to change it.The model explains 96.7% ofthe variability in mileage. Forlow speeds, the model predictsan increase of 0.232 miles pergallon for each additional mileper hour of speed.
For high speed, the model isMileage Speedˆ . . ( ).= −46 800 0 320The residuals plot isscattered and the modelexplains 99.1% of thevariability in mileage. Forhigher speeds, the modelpredicts a decrease inmileage of 0.320 miles pergallon for each additionalmile per hour of speed
Note that these models willnot produce accurate predictions about the gas mileage people may get in daily driving.People don’t drive at consistent speeds.
Mileage atLower Speeds
26
27
28
29
40 45 50
Speed (mph)
Mile
s pe
r ga
llon
-0.15
0.00
0.15
0.30
27 28 29
Predicted
Residuals Plot35—50 mph
24.0
25.5
27.0
28.5
55 60 65 70
Speed (mph)
Mile
s p
er g
allo
n
Mileage atHigher Speeds
-0.15
0.00
0.15
0.30
24.0 27.0
e
d
M
Residuals Plot55—75 mph
Res
idua
l
Predicted
Scatterplot of Mileage Regression Output Residuals Plot vs. Speed
162 Part II Exploring Relationships Between Variables
28. Orange production.
The association between thenumber of oranges per tree andthe average weight of theoranges is strong, negative, andappears linear at first look.Generally, trees that containlarger numbers of oranges havelower average weight perorange. The linear model maylook appropriate, until theresiduals plot is examined.Even though the linear model isa close fit, the residuals plotshows a strong curved pattern.The data should be re-expressed.
Plotting the number of oranges pertree and the reciprocal of theaverage weight per orangestraightens the relationshipconsiderably. The residuals plotshows little pattern and the valueof R2indicates that the modelexplains 99.8% of the variability inthe reciprocal of the averageweight per orange. The moreappropriate model is:
11 603 0 00112
Ave wtOranges Tree
. ˆ. . (# / )= + .
29. Years to live.
a) The association between the age and estimatedadditional years of life for black males is strong, curved,and negative. Older men generally have fewer estimatedyears of life remaining. The square root re-expression ofthe data, YearsLeft Ageˆ . . ( )= −8 39253 0 069553 , straightensthe data considerably, but has an extremely patternedresiduals plot. The model is not a mathematicallyappropriate model, but fits so closely that it should befine for predictions within the range of data. The modelexplains 99.8% of the variability in the estimated numberof additional years.
-0.0075
0.0000
0.0075
0.40 0.50
0.40
0.45
0.50
0.55
0.60
200 400 600 800
Ave
rage
Wei
ght (
lbs)
# of orangesper tree
Orange Production Residuals Plot
Res
idua
l
Predicted
1.75
2.00
2.25
2.50
200 600
ec
we
gh
ReciprocalRe-expression
1/(A
ve. W
eigh
t)
# of orangesper tree
-0.02
-0.01
0.00
0.01
0.02
1.75 2.25
Residuals Plot
Predicted
15
30
45
60
0 20 40 60 80
Age
Number of Years LeftBlack Males
Yea
rs le
ft
Chapter 10 Re-expressing Data: It’s Easier Than You Think 163
Another good model is the quadratic model, YearsLeft Age Ageˆ . ( ) . ( ) .= − +0 0052 1 21 71 572 .This model explains 99.9% of the variation in the number of years left and although theresiduals plot shows a pattern, the residuals are very small. We don’t have any need toextrapolate, so this model should work well.
b) Using the square root re-expression:
YearsLeft Age
YearsLeft
YearsLeft
YearsLeft
ˆ . . ( )
ˆ . . ( )
ˆ .
ˆ . .
= −
= −
=
= ≈
8 39253 0 069553
8 39253 0 069553 18
7 140936
7 140936 50 992
Using the quadratic model:
If the quadratic model, estimated as YearsLeft Age Ageˆ . ( ) . ( ) .= − +0 0052 1 21 71 572 , is stored inthe memory of the calculator, an 18-year-old black male is expected to live another 51.43years, for a total age of about 69.4 years.
-0.1
0.0
0.1
0.2
2 4 6 8
3.0
4.5
6.0
7.5
0 20 40 60 80
Age
Square RootRe-expression
Squa
re R
oot
Predicted
Res
idua
l
Residuals Plot
According to the model, an 18-year-oldblack male is expected to live and additional50.99 years, for a total age of 68.99 years.
Scatterplot of Years Left Regression Output Residuals Plot vs. Age
164 Part II Exploring Relationships Between Variables
30. Oil production (again).
The association between the year and thenumber of barrels of oil is strong and curved.Oil production generally increased until themid-1980s, and then decreased until the year2000. Because of the association reversesdirection, no re-expression of the y-variablewill straighten this scatterplot. This abruptchange in direction also makes it unlikelythat any model will be able to make accuratepredictions about the future. Oil productionis falling now, but we have no indication thatthis trend will continue. Very little faithshould be placed in predictions made from any model.
31. Internet.
a) The association between the year and the numberof electronic journals is strong, positive, and curved. Thenumber of electronic journals rapidly increased in the lastseveral years. Clearly, the linear model would not beappropriate for modeling these data.
A good model is the log re-expression Log -( ˆ ) . . ( )Journals Year= +686 763 0 345547 . Theresiduals plot is scattered, and the value of R2indicates that the model explains 95.9% ofthe variability in the number of journals.
2000000
2400000
2800000
3200000
1950 1965 1980 1995
United States Oil Production
Oil
Prod
uctio
n(1
000s
of b
arre
ls)
Year
500
1000
1500
2000
1992 1994 1996
Year
# of
Jour
nals
Internet Journals
1.9
2.4
2.9
3.4
1992 1994 1996
Year
Log Re-expression
Log(
# of
Jour
nals
)
-0.250
-0.125
0.000
0.125
1.5 2.0 2.5 3.0
Residuals Plot
Predicted
Chapter 10 Re-expressing Data: It’s Easier Than You Think 165
b)
Log -Log -Log
( ˆ ) . . ( )
( ˆ ) . . ( )
( ˆ ) .
ˆ ,.
Journals Year
Journals
Journals
Journals
= += +=
= ≈
686 763 0 345547686 763 0 345547 2000
4 334
10 21 4294 334
c) The estimate of 21,429 journals in 2000 seems very high, and seems unlikely to be accurate.The year 2000 is an extrapolation 3 years beyond the most recent year included in the dataset. The model is increasing exponentially, but it is quite unlikely that the number ofjournals could keep increasing exponentially.
32. Tree growth.
a) The association between ageand average diameter ofgrapefruit trees is strong,curved, and positive.Generally, older trees havelarger average diameters.
The linear model for this association, AverageDiameter Ageˆ . . ( )= +1 973 0 463 is notappropriate. The residuals plot shows a clear pattern.
Because of the change in curvature in the association, these data cannot be straightened byre-expression. The bend in the scatterplot may indicate a cubic relationship.
The cubic model, Diameter Age Age Ageˆ . ( ) . ( ) . ( ) .= − + −0 0023 0 083 1 31 0 1973 2 , is a great model.It explains 99.98% of the variability in tree diameter, and the residuals plot shows nopattern. It should work well for making estimates within the range of the data, and forsmall extrapolations. It would be risky to make estimates for very young or old trees.
b) If diameters from individual trees were given, instead of averages, the association wouldhave been weaker. Individual observations are more variable than averages.
According to the model, there are expectedto be approximately 21,429 electronicjournals in the year 2000.
4
6
8
10
4 8 12 16
Dia
met
er (i
nche
s)
Grapefruit Trees
-0.6
-0.3
0.0
0.3
4 6 8 10Predicted
Residuals Plot
Age (years)
Scatterplot of Diameter Regression Output Residuals Plot vs. Age