Chapter 10 – Re-expressing Data: It’s Easier Than You Think · 2012. 9. 17. · Chapter 10 Re-expressing Data: It’s Easier Than You Think 147 Re-expressing the pressure as the

Chapter 10 Re-expressing Data: It’s Easier Than You Think 145

Chapter 10 – Re-expressing Data:

It’s Easier Than You Think

1. Models.

a)

e)

b) c) d)

2. More models.

a) b) c)

d) e)

3. Gas mileage.

a) The association between weight and gas mileage of cars is fairly linear, strong, andnegative. Heavier cars tend to have lower gas mileage.

b) For each additional thousand pounds of weight, the linear model predicts a decrease of7.652 miles per gallon in gas mileage.

c) The linear model is not appropriate. There is a curved pattern in the residuals plot. Themodel tends to underestimate gas mileage for cars with relatively low and high gasmileages, and overestimates the gas mileage of cars with average gas mileage.

4. Crowdedness.

a) The scatterplot shows that the relationship between Crowdedness and GDP is strong,negative, and curved. Re-expression may yield an association that is more linear.

b) Start with logs, since GDP is non-negative. A plot of the log of GDP against Crowdednessscore may be straighter.

ˆ . .

ˆ . . ( )

ˆ .

y x

y

y

= += +=

1 2 0 81 2 0 8 22 8

ln ˆ . .

ln ˆ . . ( )

ln ˆ .

ˆ ..

y x

y

y

y e

= += +=

= =

1 2 0 81 2 0 8 22 8

16 442 8

ˆ . .

ˆ . . ( )

ˆ .

ˆ . .

y x

y

y

y

= +

= +

=

= =

1 2 0 8

1 2 0 8 2

2 8

2 8 7 842

11 2 0 8

11 2 0 8 2

12 8

12 8

0 36

ˆ. .

ˆ. . ( )

ˆ.

ˆ.

.

yx

y

y

y

= +

= +

=

= =

ˆ . ( )

ˆ .

.y

y

==

1 2 22 09

0 8

ˆ . . log

ˆ . . log( )

ˆ .

y x

y

y

= += +=

1 2 0 81 2 0 8 21 44

log ˆ . .

log ˆ . . ( )

log ˆ .

ˆ ..

y x

y

y

y

= += +=

= =

1 2 0 81 2 0 8 22 8

10 630 962 8

ˆ . .

ˆ . .

ˆ .

y x

y

y

= +

= +=

1 2 0 8

1 2 0 8 22 33

ˆ . ( . )

ˆ . ( . )

ˆ .

y

y

y

x=

==

1 2 0 8

1 2 0 80 77

2

ˆ . .

ˆ . ( ) . ( )

ˆ .

y x x

y

y

= + +

= + +=

0 8 1 2 1

0 8 2 1 2 2 16 6

2

2

146 Part II Exploring Relationships Between Variables

5. Gas mileage revisited.

a) The residuals plot for the re-expressed relationship is much more scattered. This is anindication of an appropriate model.

b) The linear model that predicts the number of gallons per 100 miles in gas mileage from theweight of a car is: Gal Weightˆ / . .100 0 625 1 178= + ( ) .

c) For each additional 1000 pounds of weight, the model predicts that the car will require anadditional 1.178 gallons to drive 100 miles.

d)

Gal Weight

Gal

Gal

ˆ / . .

ˆ / . . .

ˆ / .

100 0 625 1 178

100 0 625 1 178 3 5

100 4 748

= + ( )= + ( )=

6. Crowdedness again.

a) This re-expression is not useful. The student has gone too far down the ladder of powers.We now see marked downward curvature and increasing scatter.

b) The next step would be to try a “weaker” re-expression, like reciprocal square root or log ofGDP. Having gone too far, the student should move back “up” the ladder of powers.

7. Catapults.

D W

D

D

=

= ×=

1 1

1 1 3 7 89531 59

13

13

.

. ( , )

.

8. Pressure.

The scatterplot at the right shows a strong,curved, negative association between the heightof the cylinder and the pressure inside. Becauseof the curved form of the association, a linearmodel is not appropriate.

According to the model, a car that weighs 3500pounds (3.5 thousand pounds) is expected to requireapproximately 4.748 gallons to drive 100 miles, or0.04748 gallons per mile.

This is 1

0 0474821 06

..≈ miles per gallon.

40

60

80

100

120

20 30 40

Boyle’s Pressure and Volume

Pres

sure

Height(# of marked spaces)

According to the model, the torsion spring hole ofArchimedes’ catapult was about 31.59 dactyls in diameter.


Re-expressing the pressure as the reciprocal of the pressure produces a scatterplot (belowthe first) that is much straighter. Computer regression output for the height versus thereciprocal of pressure is below.

The reciprocal re-expression is very straight (perfectly straight, as far as the statisticalsoftware is concerned!). R2= 100%, meaning that 100% of the variability in the reciprocal ofpressure is explained by the model. The equation of the model is:

10 000077 0 000713

PressureHeight

ˆ. .= − + ( ) .

Re-expressing each variable using logarithms is also a good model. TI-83 regressionoutput for the log-log re-expression is given below.

There is a strong, negative, linear relationship between log(Pressure) and log(Height).

log( ˆ ) . . (log( ))Pressure Height= −3 150 1 001 models the situation well, explaining nearly 100%of the variability in the logarithm of pressure. The residuals plot is fairly scattered,indicating an appropriate model.

0.0150

0.0225

0.0300

20 30 40

1/Pr

essu

re

Height (inches)

Height vs.Reciprocal of PressureDependent variable is:

No Selectorrecip pressure

R squared = 100.0% R squared (adjusted) = 100.0%s = 0.0001 with 12 - 2 = 10 degrees of freedom

Source

RegressionResidual

Sum of Squares

0.0008410.000000

df

11 0

Mean Square

0.0008410.000000

F - r a t i o

75241

Var iable

ConstantHeight

Coefficient

-7.66970e-5 7.13072e-4

s.e. of Coeff

0.00010.0000

t - r a t i o

-0.982274

prob

0.3494 ≤ 0.0001

Scatterplot of Log(pressure) Regression Output Residuals Plot vs. Log(height)


9. Brakes.

a) The association between speed and stopping distance is strong, positive, and appearsstraight. Higher speeds are generally associated with greater stopping distances. Thelinear regression model, with equation Distance Speedˆ . .= − + ( )65 9 5 98 , has R2 = 96.9%,meaning that the model explains 96.9% of the variability in stopping distance. However,the residuals plot has a curved pattern. The linear model is not appropriate. A modelusing re-expressed variables should be used.

b) Stopping distances appear to be relatively higher for higher speeds. This increase in therate of change might be able to be straightened by taking the square root of the responsevariable, stopping distance. The scatterplot of Speed versus Distance seems like it mightbe a bit straighter.

c) The model for the re-expressed data is Dist̂ance . .= + ( )3 303 0 235 Speed . The residuals plotshows no pattern, and R2= 98.4%, so 98.4% of the variability in the square root of thestopping distance can be explained by the model.

75

150

225

300

20 30 40 50

Speed (mph)

Stopping Distance and Speed

Stop

ping

Dis

tanc

e

-15

0

15

100 150 200 250

Predicted Distance(feet)

Res

idua

l

Residuals Plot

10.0

12.5

15.0

17.5

20 30 40 50

Speed (mph)

Squa

re R

oot o

f

Square RootRe-expression

-0.6

-0.3

0.0

0.3

10 12 14 16

Residuals PlotSquare Root Re-expression

Predicted

Res

idua

l


d)

Distance Speed

Distance

Distance

Distance

ˆ . .

ˆ . .

ˆ .

ˆ . .

= + ( )= + ( )=

= ≈

3 303 0 235

3 303 0 235 55

16 228

16 228 263 42

e)

Distance Speed

Distance

Distance

Distance

ˆ . .

ˆ . .

ˆ .

ˆ . .

= + ( )= + ( )=

= ≈

3 303 0 235

3 303 0 235 70

19 753

19 753 390 22

f) The level of confidence in the predictions should be quite high. R2 is high, and theresiduals plot is scattered. The prediction for 70 mph is a bit of an extrapolation, butshould still be reasonably close.

10. Pendulum.

a) The scatterplot shows the association between thelength of string and the number of swings apendulum took every 20 seconds to be strong,negative, and curved. A pendulum with a longerstring tended to take fewer swings in 20 seconds.The linear model is not appropriate, because theassociation is curved.

b) Curvature in a negative relationship sometimes isan indication of a reciprocal relationship. Try re-expressing the response variable with the reciprocal.

According to the model, a car traveling 55 mph isexpected to require approximately 263.4 feet to cometo a stop.

According to the model, a car traveling 70 mph isexpected to require approximately 390.2 feet to cometo a stop.

10.0

12.5

15.0

17.5

20.0

7.5 15.0 22.5 30.0

Length (inches)

Num

ber

of S

win

gs

Pendulum

0.0500

0.0625

0.0750

0.0875

0.1000

7.5 15.0 22.5 30.0

Length (inches)

Reciprocal Re-expression

(1/#

of s

win

gs)

-0.0015

0.0000

0.0015

0.0030

0.060 0.075 0.090

Res

idua

ls

Residuals PlotReciprocal Re-expression

Predicted


c) The reciprocal re-expression yields the model 1

0 0367 0 00176Swi ngs

Lengthˆ . . ( )= + . The

residuals plot is scattered, and R2 = 98.1%, indicating that the model explains 98.1% of thevariability in the reciprocal of the number of swings. The model is both appropriate andaccurate.

Re-expressing each variable with logarithms also results in a good model, and has theadded benefit of maintaining the direction of the relationship between number of swingsand the length of the string. TI-83 regression output for the log- log re-expression is givenbelow.

There is a strong, negative, linear relationship between log(swings) and log(length).

log( ˆ ) . . (log( ))Swi ngs Length= −1 721 0 453 explains 99% of the variability in log(swings), andthe residuals plot is scattered. The log-log re-expression is useful, as well.

d) Using the reciprocal model:

10 0367 0 00176 0 0367 0 00176 4 0 04374

Swi ngsLengthˆ . . ( ) . . ( ) .= + = + =

Swi ngsˆ.

.= ≈10 04374

22 9

According to the reciprocal model, a pendulum with a 4” string is expected to swingapproximately 22.9 times in 20 seconds.

Using the log-log model:

To make estimates involving logs, it is a good idea to use as much accuracy as possible. Ifthe equation of the model is stored in the calculator, the model predicts that a pendulumwith a 4” string will swing approximately 28 times in 20 seconds.

e) Using the reciprocal model:

10 0367 0 00176 0 0367 0 00176 48 0 12118

Swi ngsLengthˆ . . ( ) . . ( ) .= + = + =

Swi ngsˆ.

.= ≈10 12118

8 3. The model predicts 8.3 swings in 20 seconds for a 48” string.

Scatterplot of Log(swings) Regression Output Residuals Plot vs. Log(length)


Using the log-log model:

To make estimates involving logs, it is a good idea to use as much accuracy as possible. Ifthe equation of the model is stored in the calculator, the model predicts that a pendulumwith a 48” string will swing approximately 9.1 times in 20 seconds.

f) Confidence in the predictions is fairly high. The model is appropriate, as indicated by thescattered residuals plot, and accurate, indicated by the high value of R2 . The only concernis the fact that these predictions are slight extrapolations. The lengths of the strings aren’ttoo far outside the range of the data, so the predictions should be reasonably accurate.

11. Baseball salaries.

a) The scatterplot of year versus highest salary ismoderately strong, positive and curved. Thehighest salary has generally increased over theyears. Since the scatterplot shows a curvedrelationship, the linear model is not appropriate.Using the logarithm re-expression on thesalaries seems to straighten the scatterplotconsiderably.

b) log( ˆ ) . . ( )Salary Year= − +113 473 0 05734 is a good mode for predicting the highest salary fromthe year. The residuals plot is scattered and R2 = 84.9%, indicating that 84.9% of thevariability in highest salary can be explained by the model.

c) log( ˆ ) . . ( ) . . ( ) .Salary Year= − + = − + =113 473 0 05734 113 473 0 05734 2005 1 4937

Salaryˆ ..= ≈10 31 21 4937

According to the model, the highest salary in baseball is expected to be approximately 31.2million dollars. (Be careful with decimal places here. When the slope is small and thevalue of the explanatory variable is large, like a year, too much rounding of the slope canbe detrimental to the accuracy of predictions. When using logs, this is an even biggerproblem!)

7.5

15.0

22.5

1980 1985 1990 1995

YearSa

lary

Baseball’s Highest Salaries

0.0

0.3

0.6

0.9

1.2

1980 1985 1990 1995

Year

Log(

Sala

ry)

Log Re-expression

Res

idua

ls

-0.15

0.00

0.15

0.30

0.3 0.6 0.9 1.2

Predicted

Residuals PlotLog Re-expression


12. Planet distances and years.

a) The association between distance from the sunand planet year is strong, positive, and curvedconcave upward. Generally, planets fartherfrom the sun have longer years than closerplanets.

b) The rate of change in length of year per unit distance appears to be increasing, but notexponentially. Re-expressing with the logarithm of each variable may straighten a plotsuch as this. The scatterplot and residuals plot for the linear model relating log(Distance)and log(Length of Year) appear below.

The regression model for the log-log re-expression is :log( ˆ ) . . (log( ))Length Distance= − +2 955 1 501 .

c) R2 = 100%, so the model explains 100% of the variability in the log of the length of theplanetary year, at least according to the accuracy of the statistical software. The residualsplot is scattered, and the residuals are all extremely small. This is a very accurate model.

50

100

150

200

750 1500 2250 3000

Distance from the sun(millions of miles)

Leng

th o

f Yea

r(E

arth

yea

rs)

Planetary Distance andLength of Year

-0.00

0.75

1.50

2.25

2.0 2.5 3.0 3.5

log Distance

log Length

Log-Log Re-expression

-0.0004

0.0000

0.0004

0.0008

-0.00 0.75 1.50

Residuals Plot

Res

idua

l

Predicted


13. Planet distances and order.

a) The association between planetary position and distance from the sun is strong, positive,and curved (below left). A good re-expression of the data is position versus Log(Distance).The scatterplot with regression line (below center) shows the straightened association. Theequation of the model is log( ˆ ) . . ( )Distance Position= +1 245 0 271 . The residuals plot (belowright) may have some pattern, but after trying several re-expressions, this is the best thatcan be done. R2= 98.2%, so the model explains 98.2% of the variability in the log of theplanets distance from the sun.

b) At first glance, this model appears to provide little evidence to support the contention ofthe astronomers. Pluto appears to fit the pattern, although Pluto’s distance from the sun isa bit less than expected. A model generated without Pluto does not have a dramaticallyimproved residuals plot, does not have a significantly higher R2, nor a different slope.Pluto does not appear to be influential.

But don’t forget that a logarithmic scale is being used for the vertical axis. The higher upthe vertical axis you go, the greater the effect of a small change.

log( ˆ ) . . ( )

log( ˆ ) . . ( )

log( ˆ ) .

ˆ .

Distance Position

Distance

Distance

Distance

= += +=

= ≈

1 24521 0 270922

1 24521 0 270922 9

3 683508

10 48253 683508

Pluto doesn’t fit the pattern for position and distance in the solar system. In fact, the modelmade with Pluto included isn’t a good one, because Pluto influences those predictions. Themodel without Pluto, log( ˆ ) . . ( )Distance Position= +1 20259 0 283706 , works much better. Ithas a high R2 , and scattered residuals plot. This new model predicts that the 9th planetshould be a whopping 5701 million miles away from the sun! There is evidence that theastronomers are correct. Pluto doesn’t behave like planet in its relation to position anddistance.

750

1500

2250

3000

2 3 4 5 6 7 8 9

Position Number

Dis

tanc

e fr

om s

un

Planetary Positionand Distance

2.1

2.6

3.1

3.6

1 2 3 4 5 6 7 8 9

Position Number

Position vs.Log(Distance)

Log

(Dis

tanc

e)

-0.150

-0.075

0.000

0.075

2.0 2.5 3.0 3.5


Res

idua

l

Predicted

According to the model, the 9th planet inthe solar system is predicted to beapproximately 4825 million miles awayfrom the sun. Pluto is actually 3666million miles away.


14. Planets, part 3.

Using the revised planetary numbering system, and straightening the scatterplot using thesame methods as in Exercise 9, the new model, log( ˆ ) . . ( )Distance Position= +1 3058 0 232923 ,is a slightly better fit. The residuals plot is more scattered, and R2 is slightly higher, withthe improved model explaining 99.4% of the variability in the log of distance from the sun.

Pluto still doesn’t fit very well. The new model predicts that Pluto, as 10th planet, shouldbe about 4315 million miles away. That’s about 650 million miles farther away. A bettermodel yet is log( ˆ ) . . ( )Distance Position= +1 28514 0 238826 , a model made with the newnumbering system and with Pluto omitted.

15. Quaoar, planets, part 4.

a)

log( ˆ ) . . ( )

log( ˆ ) . . ( )

log( ˆ ) .

ˆ .

Distance Position

Distance

Distance

Distance

= += +=

= ≈

1 20259 0 283706

1 20259 0 283706 9

3 755944

10 57013 755944

b) With Quaoar as the 9th planet, the model that relates planetary position to distance islog( ˆ ) . . ( )Distance Position= +1 23679 0 273447 .

log( ˆ ) . . ( )

log( ˆ ) . . ( )

log( ˆ ) .

ˆ .

Distance Position

Distance

Distance

Distance

= += +=

= ≈

1 23679 0 273447

1 23679 0 273447 9

3 697813

10 49873 697813

16. Models and laws, planets part 5

The re-expressed data relating distance and year length (Exercise 8), are better described bytheir model than the re-expressed data relating position and distance (Exercises 9-11). Themodel relating distance and year length has R2 = 100%, and a very scattered residuals plot(with miniscule residuals), possibly a natural “law”. If planets in another solar systemfollowed the Titius-Bode pattern, this belief would be reinforced. Similarly, if data were

750

1500

2250

3000

3750

2 4 6 8 10

New Positionand Distance

Dis

tanc

e fr

om s

un

Position Number

Position vsLog(Distance)

Log(

Dis

tanc

e)

Position Number

2.1

2.6

3.1

3.6

2 4 6 8 10

-0.05

0.00

0.05

2.0 2.5 3.0 3.5

Res

idua

l

Predicted


Using the model predicting distance fromthe sun from position, the 9th planet isexpected to be about 5701 million milesfrom the sun. At 4000 million miles,Quaoar is a better fit than Pluto, but stillnot a good fit.

According to the new model, the 9th planetis predicted to be 4987 million miles fromthe sun. Quaoar, at approximately 4000million miles, is a better fit than Pluto, butdoesn’t appear to be an additional planet.


acquired from planets in another solar system that did not follow this pattern, we would beunlikely to think that this relationship was a universal law.

17 – 22. Curved models.

17. (Exercise 8) Pressure.

The power model, Pressure Heightˆ . ( ) .= −1413 3 1 001 , works well. The model explains 99.99%of the variability in pressure, the residuals are very small, and the residuals plot shows nostrong pattern.

18. (Exercise 9) Brakes.

The quadratic model, Distance Speed Speedˆ . ( ) . ( ) .= + +0 05 1 96 4 42 , works well. The modelexplains 97.9% of the variability in stopping distance and the residuals plot shows nostrong pattern, although stopping distance is harder to predict accurately at high speeds.

Scatterplot of Pressure vs. Regression Output Residuals Plot Height

Scatterplot of Stopping Regression Output Residuals Plot Distance vs. Speed


19. (Exercise 10) Pendulum.

The power model, Swi ngs Lengthˆ . ( ) .= −52 547 0 453 , works well. The model explains 99.01% ofthe variability in the number of swings in 20 seconds, and the residuals plot shows littlepattern.

20. (Exercise 11) Baseball salaries.

Express the years in years since 1900. For example, 1980 = 80.

The exponential model, Sal ary Yearˆ ( . )( . )= 0 0000297 1 14 , works well. The model explains96.22% of the variability in salary, and the residuals plot shows no pattern. The 25.3million dollar salary of Alex Rodriguez is much higher than the model predicts.

21. (Exercise 12) Planet distances and years.

Scatterplot of Number of Regression Output Residuals Plot Swings vs. Length

Scatterplot of Salary Regression Output Residuals Plot vs. Years since 1900

Scatterplot of Length of Year Regression Output Residuals Plot vs. Distance from Sun


The power model, Year Distanceˆ . ( ) .= 0 00111 1 501, works well. The model explains nearly100% of the variability in the length of the year and the residuals plot shows not apparentpattern.

22. (Exercise 13) Planet distances and order.

The exponential model, Distance Positionˆ . ( . )= 17 598 1 866 , is the best fitting curved model. Themodel explains 98.15% of the variability in distance from the Sun. The pattern in theresiduals seems to be associated with the inclusion of Pluto, supporting the theory thatPluto may not be a true planet.

The quadratic model, Distance Position Positionˆ . ( ) . ( ) .= − +88 14 434 63 487 172 , fits the givendata well, but doesn’t seem to be appropriate. Although the model explains 99.42% of thevariability in distance from the Sun and the residuals plot is scattered, the model has aminimum around the second planet. The model predicts that the first planet is furtherfrom the Sun than the second and third planets! That doesn’t seem reasonable.

Scatterplot of Distance from Regression Output Residuals Plot Sun vs. Position

Scatterplot of Distance from Regression Output Residuals Plot Sun vs. Position


23. Logs (not logarithms).

a) The association between the diameter of a log and thenumber of board feet of lumber is strong, positive, andcurved. As the diameter of the log increases, so does thenumber of board feet of lumber contained in the log.

The model used to generate the table used by the logbuyers is based upon a square root re-expression. Thevalues in the table correspond exactly to the model

BoardFeet Diameterˆ = − +4 .

b)

BoardFeet Diameter

BoardFeet

BoardFeet

BoardFeet

ˆ

ˆ ( )

ˆ

ˆ

= − +

= − +

=

=

4

4 10

6

36c)

BoardFeet Diameter

BoardFeet

BoardFeet

BoardFeet

ˆ

ˆ ( )

ˆ

ˆ

= − +

= − +

=

=

4

4 36

32

1024Normally, we would be cautious of this prediction, because it is an extrapolation beyondthe given data, but since this is a prediction made from an exact model based on thevolume of the log, the prediction will be accurate.

24. Weight lifting.

a) The association between weight class and weight liftedfor gold medal winners in weightlifting at the 2000Olympics is strong, positive, and curved. The linearmodel that best fits the data isLi ft WeightClassˆ . . ( )= +179 918 2 401 .

Although this model explains 96.9% of the variability inweight lifted, it does not fit the data well.

125

250

375

500

8 12 16 20 24

# of

boa

rd fe

et

Diameter of log (inches)

Doyle Log Scale

4

8

12

16

20

24

8 12 16 20 24

Diameter of log (inches)Sq

uare

roo

t of #

of b

dft

Square root re-expressionAccording to themodel, a log 10” indiameter is expectedto contain 36 boardfeet of lumber.

According to themodel, a log 36” indiameter is expectedto contain 1024 boardfeet of lumber.

Men’s Weightlifting2000 Olympics

325

350

375

400

425

62.5 75.0 87.5 100.0

Weight Class (kg)


b) The residuals plot for the linear model shows a curvedpattern, indicating that the linear model has failed tomodel the association well. A re-expressed model or acurved model might fit the association between weightclass and weight lifted better than the linear model.

c) A first attempt at re-expressing the data might be one of the logarithmic re-expressions, orthe exponential model from the TI-83. These models do not decrease the amount ofcurvature in the residuals plot, and seem to fit no better than the linear model. Thequadratic model, Li ft WeightClass WeightClassˆ . ( ) . ( ) .= − + +0 025 6 42 25 592 , fits well.

d) The quadratic model is a better model, because it has a residuals plot that shows lesspattern than the residuals plot for the linear model. Also, the quadratic model explains99.1% of the variability in weight lifted. Care must be taken in using this model, however.The presence of one unusual point, the lift of 357.5 kg in the 69 kg weight class seems to beaffecting the model.

e) Boevski, from Bulgaria, lifted 357.5 kg. His lift does not fit the pattern established by theother lifts. The lift in this weight class is predicted to be 349.2 kg. According to the model,Boevski lifted 8.3 kg (over 18 pounds) more than expected.

-5

0

5

10

330 360 390 420

Res

idua

l

Predicted Lift (kg)

Residuals Plot

Scatterplot of Lift Regression Output Residuals Plot vs. Weight class


25. Life expectancy.

The association between year and life expectancy isstrong, curved and positive. As the years passed, lifeexpectancy for white males has increased. The bend(in mathematical language, the change in curvature)in the scatterplot makes it impossible to straighten theassociation by re-expression.If all of the data points are to be used, the linearmodel, LifeExp Yearˆ . . ( )= − +454 938 0 265697 isprobably the best model we can make. It doesn’t fitthe data well, but accounts for 93.6% of the variabilityin life expectancy.

In order to make better future predictions, we mightuse only a subset of the data points. The scatterplot isvery linear from 1970 to 2000, and these recent yearsare likely to be more indicative of the current trend inlife expectancy. Using only these four data points, wemight be able to make more accurate futurepredictions.The model LifeExp Yearˆ . . ( )= − +379 020 0 227 explains99.6% of the variability in life expectancy, and has ascattered residuals plot. Great care should be taken inusing this model for predictions, since it is basedupon only 4 points, and, as always, prediction into thefuture is risky. This is especially true since the response variable is life expectancy, whichcannot be expected to increase at the same rate indefinitely.

26. Lifting more weight.

a) Without Boevski, the quadratic model is: Li ft WeightClass WeightClassˆ . . .= − ( ) + ( ) +0 02 5 8 45 52

b) According to the quadratic model, the winner of the 69 kg weight class was predicted to lift346.47 kg.

c) Since Boevski actually lifted 357.5 kg, his residual was 357.5 – 346.47 = 11.03. He lifted 11kg (over 24 pounds) more than predicted!

27. Slower is cheaper?

The association between speed and mileage is strong andcurved. The linear model, Mileage Speedˆ . . ( )= −32 20 0 097 , is aterrible fit. For speeds under 50 miles per hour, higher speedsare associated with higher mileage. For speeds in excess of 50mph, higher speeds are associated with lower mileage. Forthis reason, the scatterplot cannot be straightened by themethods used in this chapter. A curved model may be abetter fit.

Life

Exp

ecta

ncy

(yea

rs)

52.5

60.0

67.5

75.0

1900 1940 1980

Year

Life ExpectancyWhite Males

69.0

70.5

72.0

73.5

1980 1990 2000

Year

Life ExpectancyWhite Males

Life

Exp

ecta

ncy

(yea

rs)

Mile

s pe

r ga

llon

Speed and Mileage

24.0

25.5

27.0

28.5

40 50 60 70

Speed (MPH)


The quadratic model: Mileage Speed Speedˆ . ( ) . ( ) .= − + −0 114 1 158 0 4192 , works well.

Since the residuals plot scattered, the model is appropriate. The model is strong, sinceR2 96≈ % . This means that 96% of the variation in mileage is explained by the model.

Another logical modeling decision is to divide the curve up into two linear models. Theassociation seems fairly linear from 35—50 mph, and also fairly linear from 55—75 mph.

For lower speeds, the model isMileage Speedˆ . . ( )= +18 040 0 232 .The model has some pattern inthe residuals plot, but no re-expression seems to change it.The model explains 96.7% ofthe variability in mileage. Forlow speeds, the model predictsan increase of 0.232 miles pergallon for each additional mileper hour of speed.

For high speed, the model isMileage Speedˆ . . ( ).= −46 800 0 320The residuals plot isscattered and the modelexplains 99.1% of thevariability in mileage. Forhigher speeds, the modelpredicts a decrease inmileage of 0.320 miles pergallon for each additionalmile per hour of speed

Note that these models willnot produce accurate predictions about the gas mileage people may get in daily driving.People don’t drive at consistent speeds.

Mileage atLower Speeds

26

27

28

29

40 45 50

Speed (mph)

Mile

s pe

r ga

llon

-0.15

0.00

0.15

0.30

27 28 29

Predicted

Residuals Plot35—50 mph

24.0

25.5

27.0

28.5

55 60 65 70

Speed (mph)

Mile

s p

er g

allo

n

Mileage atHigher Speeds

-0.15

0.00

0.15

0.30

24.0 27.0

e

d

M

Residuals Plot55—75 mph

Res

idua

l

Predicted

Scatterplot of Mileage Regression Output Residuals Plot vs. Speed


28. Orange production.

The association between thenumber of oranges per tree andthe average weight of theoranges is strong, negative, andappears linear at first look.Generally, trees that containlarger numbers of oranges havelower average weight perorange. The linear model maylook appropriate, until theresiduals plot is examined.Even though the linear model isa close fit, the residuals plotshows a strong curved pattern.The data should be re-expressed.

Plotting the number of oranges pertree and the reciprocal of theaverage weight per orangestraightens the relationshipconsiderably. The residuals plotshows little pattern and the valueof R2indicates that the modelexplains 99.8% of the variability inthe reciprocal of the averageweight per orange. The moreappropriate model is:

11 603 0 00112

Ave wtOranges Tree

. ˆ. . (# / )= + .

29. Years to live.

a) The association between the age and estimatedadditional years of life for black males is strong, curved,and negative. Older men generally have fewer estimatedyears of life remaining. The square root re-expression ofthe data, YearsLeft Ageˆ . . ( )= −8 39253 0 069553 , straightensthe data considerably, but has an extremely patternedresiduals plot. The model is not a mathematicallyappropriate model, but fits so closely that it should befine for predictions within the range of data. The modelexplains 99.8% of the variability in the estimated numberof additional years.

-0.0075

0.0000

0.0075

0.40 0.50

0.40

0.45

0.50

0.55

0.60

200 400 600 800

Ave

rage

Wei

ght (

lbs)

# of orangesper tree

Orange Production Residuals Plot

Res

idua

l

Predicted

1.75

2.00

2.25

2.50

200 600

ec

we

gh

ReciprocalRe-expression

1/(A

ve. W

eigh

t)

# of orangesper tree

-0.02

-0.01

0.00

0.01

0.02

1.75 2.25

Residuals Plot

Predicted

15

30

45

60

0 20 40 60 80

Age

Number of Years LeftBlack Males

Yea

rs le

ft


Another good model is the quadratic model, YearsLeft Age Ageˆ . ( ) . ( ) .= − +0 0052 1 21 71 572 .This model explains 99.9% of the variation in the number of years left and although theresiduals plot shows a pattern, the residuals are very small. We don’t have any need toextrapolate, so this model should work well.

b) Using the square root re-expression:

YearsLeft Age

YearsLeft

YearsLeft

YearsLeft

ˆ . . ( )

ˆ . . ( )

ˆ .

ˆ . .

= −

= −

=

= ≈

8 39253 0 069553

8 39253 0 069553 18

7 140936

7 140936 50 992

Using the quadratic model:

If the quadratic model, estimated as YearsLeft Age Ageˆ . ( ) . ( ) .= − +0 0052 1 21 71 572 , is stored inthe memory of the calculator, an 18-year-old black male is expected to live another 51.43years, for a total age of about 69.4 years.

-0.1

0.0

0.1

0.2

2 4 6 8

3.0

4.5

6.0

7.5

0 20 40 60 80

Age

Square RootRe-expression

Squa

re R

oot

Predicted

Res

idua

l

Residuals Plot

According to the model, an 18-year-oldblack male is expected to live and additional50.99 years, for a total age of 68.99 years.

Scatterplot of Years Left Regression Output Residuals Plot vs. Age


30. Oil production (again).

The association between the year and thenumber of barrels of oil is strong and curved.Oil production generally increased until themid-1980s, and then decreased until the year2000. Because of the association reversesdirection, no re-expression of the y-variablewill straighten this scatterplot. This abruptchange in direction also makes it unlikelythat any model will be able to make accuratepredictions about the future. Oil productionis falling now, but we have no indication thatthis trend will continue. Very little faithshould be placed in predictions made from any model.

31. Internet.

a) The association between the year and the numberof electronic journals is strong, positive, and curved. Thenumber of electronic journals rapidly increased in the lastseveral years. Clearly, the linear model would not beappropriate for modeling these data.

A good model is the log re-expression Log -( ˆ ) . . ( )Journals Year= +686 763 0 345547 . Theresiduals plot is scattered, and the value of R2indicates that the model explains 95.9% ofthe variability in the number of journals.

2000000

2400000

2800000

3200000

1950 1965 1980 1995

United States Oil Production

Oil

Prod

uctio

n(1

000s

of b

arre

ls)

Year

500

1000

1500

2000

1992 1994 1996

Year

# of

Jour

nals

Internet Journals

1.9

2.4

2.9

3.4

1992 1994 1996

Year

Log Re-expression

Log(

# of

Jour

nals

)

-0.250

-0.125

0.000

0.125

1.5 2.0 2.5 3.0

Residuals Plot

Predicted


b)

Log -Log -Log

( ˆ ) . . ( )

( ˆ ) . . ( )

( ˆ ) .

ˆ ,.

Journals Year

Journals

Journals

Journals

= += +=

= ≈

686 763 0 345547686 763 0 345547 2000

4 334

10 21 4294 334

c) The estimate of 21,429 journals in 2000 seems very high, and seems unlikely to be accurate.The year 2000 is an extrapolation 3 years beyond the most recent year included in the dataset. The model is increasing exponentially, but it is quite unlikely that the number ofjournals could keep increasing exponentially.

32. Tree growth.

a) The association between ageand average diameter ofgrapefruit trees is strong,curved, and positive.Generally, older trees havelarger average diameters.

The linear model for this association, AverageDiameter Ageˆ . . ( )= +1 973 0 463 is notappropriate. The residuals plot shows a clear pattern.

Because of the change in curvature in the association, these data cannot be straightened byre-expression. The bend in the scatterplot may indicate a cubic relationship.

The cubic model, Diameter Age Age Ageˆ . ( ) . ( ) . ( ) .= − + −0 0023 0 083 1 31 0 1973 2 , is a great model.It explains 99.98% of the variability in tree diameter, and the residuals plot shows nopattern. It should work well for making estimates within the range of the data, and forsmall extrapolations. It would be risky to make estimates for very young or old trees.

b) If diameters from individual trees were given, instead of averages, the association wouldhave been weaker. Individual observations are more variable than averages.

According to the model, there are expectedto be approximately 21,429 electronicjournals in the year 2000.

4

6

8

10

4 8 12 16

Dia

met

er (i

nche

s)

Grapefruit Trees

-0.6

-0.3

0.0

0.3

4 6 8 10Predicted

Residuals Plot

Age (years)

Scatterplot of Diameter Regression Output Residuals Plot vs. Age

Documents

Chapter 10 – Re-expressing Data: It’s Easier Than You Think · 2012. 9. 17. · Chapter 10 Re-expressing Data: It’s Easier Than You Think 147 Re-expressing the pressure as the