6- Single Variable Regression (Part II)

7/25/2019 6- Single Variable Regression (Part II)

1/15

Single Variable Regression (Part II)

7. Residual Plots

After the curve is fit, it is important to examine if the fitted curve is reasonable. This isdone using residuals. The residual for a point is the difference between the observed value

and the predicted value, i.e., the residual from fitting a straight line is found as:

There are several standard residual plots:

plot of residuals vs predicted

plot of residuals vs X;

plot of residuals vs time ordering.

In all cases, the residual plots should show random scatter around ero with no obvious

pattern. !on"t plot residual vs # $ this will lead to odd loo%ing plots which are an artifact of

the plot and don"t mean an&thing.

8. Probability Plots

The probabilit& plot is a graphical techni'ue for assessing whether or not a data set

follows a given distribution such as the normal distribution. The data are plotted against atheoretical normal distribution in such a wa& that the points should form approximatel& a

straight line. !epartures from this straight line indicate departures from the specified

distribution.

Page 9 of 15


2/15

The points on this plot form a nearl& linear pattern, which indicates that the normal

distribution is a good model for this data set.

The normal probabilit& plot is formed b&: (ertical axis: )rdered response values

*oriontal axis: +ormal order statistic medians

The observations are plotted as a function of the corresponding normal order statistic

'uantiles. In addition, a straight line can be fit to the points and added as a reference line.

The further the points var& from this line, the greater the indication of departures from

normalit&. The correlation coefficient of the points on the normal probabilit& plot can be

compared to a table of critical valuesto provide a formal test of the h&pothesis that the

data come from a normal distribution.

n -.- -.-/ -.-

4 .01/ .0234 .030

5 .1-33 .00-4 .035-

10 .1342 .10- .00-4

15 .1/-6 .1303 .1-

20 .16-- .1/-3 .151-

25 .1665 .1/05 .14-0

30 .12-2 .1631 .141-

40 .1262 .12/ .1/12

50 .10-2 .1264 .1664

0 .103/ .1211 .12-

75 .106/ .103/ .12/2

The normal probabilit& plot is used to answer the following 'uestions.

. Are the data 7meaning the residuals8 normall& distributed95. hat is the nature of the departure from normalit& 7data s%ewed, shorter than

expected tails, longer than expected tails89

Page 10 of 15
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm


3/15

!y"i#al$or%al Probability Plot& $or%ally 'istributed 'ata

+ormal;robabilit&

;lot

The following normal probabilit& plot is from a heat flowmeterdata.


4/15

+ormal

;robabilit&

;lot for !ata

with =hort

Tails

The following is a normal probabilit& plot for /-- random

numbers generated from a Tu%e&$>ambdadistribution with

the parameter e'ual to ..


5/15

!y"i#al $or%al Probability Plot& 'ata ae +ong !ails

+ormal

;robabilit&

;lot for !ata

with >ong Tails

The following is a normal probabilit& plot of /-- numbers

generated from adouble exponentialdistribution. The double

exponential distribution is s&mmetric, but relative to the

normal it declines rapidl& and has longer tails.


6/15

the non$linearit& of the normal probabilit& plot can show up in

two wa&s. ?irst, the middle of the data ma& show an =$li%e

pattern. This is common for both short and long tails. In this

particular case, the = pattern in the middle is fairl& mild.

=econd, the first few and the last few points show mar%ed

departure from the reference fitted line. In the plot above,this is most noticeable for the first few data points. In

comparing this plot to the short$tail examplein the previous

section, the important difference is the direction of the

departure from the fitted line for the first few and the last

few points. ?or long tails, the first few points show increasing

departure from the fitted line belowthe line and last few

points show increasing departure from the fitted line above

the line. ?or short tails, this pattern is reversed.

In this case we can reasonabl& conclude that the normaldistribution can be improved upon as a model for these data.

?or probabilit& plots that indicate long$tailed distributions,

the next step might be to generate a Tu%e& >ambda


7/15

data.!iscussion This 'uadratic pattern in the normal probabilit& plot is the

signature of a significantl& right$s%ewed data set. =imilarl&,

if all the points on the normal probabilit& plot fell above the

reference lineconnecting the first andlast points, that would

be the signature pattern for a significantl& left$s%ewed dataset.

In this case we can 'uite reasonabl& conclude that we need to

model these data with a right s%ewed distribution such as the

eibullor lognormal.

. /a%"le ield and ertilier

e wish to investigate the relationship between &ield 7>iters8 and fertilier 7%gha8 for

tomato plants. An experiment was conducted in the =chwar household on summer on

plots of land where the amount of fertilier was varied and the &ield measured at the end

of the season.

The amount of fertilier applied to each plot was chosen between / and 0 %gha. hile

the levels were not s&stematicall& chosen 7e.g. the& were not evenl& spaced between the

highest and lowest values8, the& represent commonl& used amounts based on a preliminar&

surve& of producers.

Interest also lies in predicting the &ield when 6 %gha are assigned. The level of fertilier

were randoml& assigned to each plot. At the end of the experiment, the &ields were

measured and the following

data were obtained.

Page 15 of 15
http://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm


8/15

In this stud&, it is 'uite clear that the fertilier is the predictor 7X8 variable, while the

response variable 7#8 is the &ield.

The population consists of all possible field plots with all possible tomato plants of this

t&pe grown under all possible fertilier levels between about / and 0 %gha.

If all of the population could be measured 7which it can"t8 &ou could find a relationship

between the &ield and the amount of fertilier applied. This relationship would have the

form:

where Boand Brepresent the true population intercept and slope respectivel&. The term C

represents random variation that is alwa&s present, i.e. even if the same plot was grown

twice in a row with the same amount of fertilier, the &ield would not be identical 7wh&98.

The population parameters to be estimated are Bo$ the true average &ield when theamount of fertilier is -, and B, the true average change in &ield per unit change in the

amount of fertilier. These are ta%en over all plants in all possible field plots of this t&pe.

The values of Boand Bare impossible to obtain as the entire population could never be

measured.

P+6! nalysis

*ere is the data entered into a D#>)T data sheet. +ote the scale of both variables

7continuous8. The ordering of the rows is +)T important; however, it is often easier to

find individual data points if the data is sorted b& the X value and the rows for future

predictions are placed at the end of the dataset.

Page 1" of 15


9/15

Ese the =tatistics$F @egression Anal&sis$F =imple @egressionplatform to start the

anal&sis. =pecif& the # and X variable as needed.

Then clic% )D. A ne- s"read s*eetwill be created that #ontains t*e regression results.

Page 1# of 15


10/15

At this stage, it would be also useful to draw a scatter plot of the data 7refer to previous

D#>)T tutorials8

The relationship loo% approximatel& linear; there don"t appear to be an& outlier orinfluential points; the scatter appears to be roughl& e'ual across the entire regression

line. @esidual plots will be used later to chec% these assumptions in more detail.

The 8itmenu item allows &ou to fit the least$s'uares line. The actual fitted line is drawn

on the scatter plot, and the straight line e'uation coefficients, 7here called A for the

intercept and A5 for the slope8 of the fitted line are printed below the fit spread sheet.

Page 1$ of 15


11/15

The estimated regression line is:

In terms of estimates, b-5.0/6 is the

estimated intercept, and b.- is the

estimated slope.

The estimated slope is the estimated

change in &ield when the amount of

fertilier is increased b& unit. In this

case, the &ield is expected to increase

7wh&98 b& .-32 > when the fertilieramount is increased b& %gha. +)TG

that the slope is the . In this particular case

the intercept has a meaningful interpretation, but I"d be worried about extrapolating

outside the range of the observed X values.

)nce again, these are the results from a single experiment. If another experiment was

repeated, &ou would obtain different estimates 7b-and bwould change8. The sampling

distribution over all possible experiments would describe the variation in b-and bover all

possible experiments. The standard deviation of b-and bover all possible experiments is

again referred to as the standard error of b-and b.

The formulae for the standard errors of b-and bare mess&, and hopeless to compute b&

hand. And ust li%e inference for a mean or a proportion, we can obtain estimates of the

standard error from D#>)T 7from the regression results sheet created in page 0 8.

Page 19 of 15


12/15

The estimated standard error for b 7the estimated slope8 is -.35 >%g. This is an

estimate of the standard deviation of bover all possible experiments. +ormall&, theintercept is of limited interest, but a standard error can also be found for it as shown in

the above table.

Esing exactl& the same logic as when we found a confidence interval for the population

mean, a confidence interval for the population slope 7B8 is found 7approximatel&8 as bJ

57estimated se8 In the above example, an approximate confidence interval for Bis found

as

.- J 5 K 7-.358 .- J .564 7-.032 to .36/8 >%g

of fertilier applied. An LexactM confidence interval can be computed b& D#>)T as shown

above. The LexactM confidence interval is based on the t$distribution and is slightl& wider

than our approximate confidence interval because the total sample sie 7 pairs of points8

is rather small.

e interpret this interval as Nbeing 1/O confident that the true increase in &ield when the

amount of fertilier is increased b& one unit is somewhere between 7.032 to .36/8 >%g."

Page 20 of 15


13/15

Pe sure to carefull& distinguish between B and b. +ote that the confidence interval is

computed using b, but is a confidence interval for B$ the population parameter that is

un%nown .

In linear regression problems, one h&pothesis of interest is if the true slope is ero. This

would correspond to no linear relationship between the response and predictor variable

7wh&98 In man& cases, a confidence interval tells the entire stor&.

D#>)T produces a test of the h&pothesis that each of the parameters 7the slope and the

intercept in the population8 is ero. The output is reproduced again below:

The test of h&pothesis about the intercept is not of interest 7wh&98.

>et

B be the true 7un%nown8 slope.

b be the estimated slope. In this case b .-4.

The h&pothesis testing proceeds as follows. Again note that we are interested in the

population parameters and not the sample statistics:

. =pecif& the null and alternate h&pothesis:

+otice that the null h&pothesis is in terms of the population

parameter B. This is a two$sided test as we are interested in detecting differences from

ero in either direction.

5. ?ind the test statistic and the p$value. The test statistic is computed as:

In other words, the estimate is over 0 standard errors awa& from the h&pothesied valueQ

This will be compared to a t$distribution with nR5 1 degrees of freedom. The p$value is

found to ver& small 7less than -.---8.

Page 21 of 15


14/15

3.


15/15

=econd the experimenter ma& be interested in predicting the average of A>> ?ETE@G

responses at a particular X. This would correspond to the average &ield for all future plots

when 6 %gha of fertilier is added.

The prediction interval for an individual response is sometimes called a confidence intervalfor an individual response but this is an unfortunate 7and incorrect8 use of the term

confidence interval. =trictl& spea%ing confidence intervals are computed for fixed

un%nown parameter values; predication intervals are

computed for future random variables.

7To be continued8.

Page 23 of 15

Documents

6- Single Variable Regression (Part II)