6- Single Variable Regression (Part II)

Embed Size (px)

Citation preview

  • 7/25/2019 6- Single Variable Regression (Part II)

    1/15

    Single Variable Regression (Part II)

    7. Residual Plots

    After the curve is fit, it is important to examine if the fitted curve is reasonable. This isdone using residuals. The residual for a point is the difference between the observed value

    and the predicted value, i.e., the residual from fitting a straight line is found as:

    There are several standard residual plots:

    plot of residuals vs predicted

    plot of residuals vs X;

    plot of residuals vs time ordering.

    In all cases, the residual plots should show random scatter around ero with no obvious

    pattern. !on"t plot residual vs # $ this will lead to odd loo%ing plots which are an artifact of

    the plot and don"t mean an&thing.

    8. Probability Plots

    The probabilit& plot is a graphical techni'ue for assessing whether or not a data set

    follows a given distribution such as the normal distribution. The data are plotted against atheoretical normal distribution in such a wa& that the points should form approximatel& a

    straight line. !epartures from this straight line indicate departures from the specified

    distribution.

    Page 9 of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    2/15

    The points on this plot form a nearl& linear pattern, which indicates that the normal

    distribution is a good model for this data set.

    The normal probabilit& plot is formed b&: (ertical axis: )rdered response values

    *oriontal axis: +ormal order statistic medians

    The observations are plotted as a function of the corresponding normal order statistic

    'uantiles. In addition, a straight line can be fit to the points and added as a reference line.

    The further the points var& from this line, the greater the indication of departures from

    normalit&. The correlation coefficient of the points on the normal probabilit& plot can be

    compared to a table of critical valuesto provide a formal test of the h&pothesis that the

    data come from a normal distribution.

    n -.- -.-/ -.-

    4 .01/ .0234 .030

    5 .1-33 .00-4 .035-

    10 .1342 .10- .00-4

    15 .1/-6 .1303 .1-

    20 .16-- .1/-3 .151-

    25 .1665 .1/05 .14-0

    30 .12-2 .1631 .141-

    40 .1262 .12/ .1/12

    50 .10-2 .1264 .1664

    0 .103/ .1211 .12-

    75 .106/ .103/ .12/2

    The normal probabilit& plot is used to answer the following 'uestions.

    . Are the data 7meaning the residuals8 normall& distributed95. hat is the nature of the departure from normalit& 7data s%ewed, shorter than

    expected tails, longer than expected tails89

    Page 10 of 15

    http://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3676.htm
  • 7/25/2019 6- Single Variable Regression (Part II)

    3/15

    !y"i#al$or%al Probability Plot& $or%ally 'istributed 'ata

    +ormal;robabilit&

    ;lot

    The following normal probabilit& plot is from a heat flowmeterdata.

  • 7/25/2019 6- Single Variable Regression (Part II)

    4/15

    +ormal

    ;robabilit&

    ;lot for !ata

    with =hort

    Tails

    The following is a normal probabilit& plot for /-- random

    numbers generated from a Tu%e&$>ambdadistribution with

    the parameter e'ual to ..

  • 7/25/2019 6- Single Variable Regression (Part II)

    5/15

    !y"i#al $or%al Probability Plot& 'ata ae +ong !ails

    +ormal

    ;robabilit&

    ;lot for !ata

    with >ong Tails

    The following is a normal probabilit& plot of /-- numbers

    generated from adouble exponentialdistribution. The double

    exponential distribution is s&mmetric, but relative to the

    normal it declines rapidl& and has longer tails.

  • 7/25/2019 6- Single Variable Regression (Part II)

    6/15

    the non$linearit& of the normal probabilit& plot can show up in

    two wa&s. ?irst, the middle of the data ma& show an =$li%e

    pattern. This is common for both short and long tails. In this

    particular case, the = pattern in the middle is fairl& mild.

    =econd, the first few and the last few points show mar%ed

    departure from the reference fitted line. In the plot above,this is most noticeable for the first few data points. In

    comparing this plot to the short$tail examplein the previous

    section, the important difference is the direction of the

    departure from the fitted line for the first few and the last

    few points. ?or long tails, the first few points show increasing

    departure from the fitted line belowthe line and last few

    points show increasing departure from the fitted line above

    the line. ?or short tails, this pattern is reversed.

    In this case we can reasonabl& conclude that the normaldistribution can be improved upon as a model for these data.

    ?or probabilit& plots that indicate long$tailed distributions,

    the next step might be to generate a Tu%e& >ambda

  • 7/25/2019 6- Single Variable Regression (Part II)

    7/15

    data.!iscussion This 'uadratic pattern in the normal probabilit& plot is the

    signature of a significantl& right$s%ewed data set. =imilarl&,

    if all the points on the normal probabilit& plot fell above the

    reference lineconnecting the first andlast points, that would

    be the signature pattern for a significantl& left$s%ewed dataset.

    In this case we can 'uite reasonabl& conclude that we need to

    model these data with a right s%ewed distribution such as the

    eibullor lognormal.

    . /a%"le ield and ertilier

    e wish to investigate the relationship between &ield 7>iters8 and fertilier 7%gha8 for

    tomato plants. An experiment was conducted in the =chwar household on summer on

    plots of land where the amount of fertilier was varied and the &ield measured at the end

    of the season.

    The amount of fertilier applied to each plot was chosen between / and 0 %gha. hile

    the levels were not s&stematicall& chosen 7e.g. the& were not evenl& spaced between the

    highest and lowest values8, the& represent commonl& used amounts based on a preliminar&

    surve& of producers.

    Interest also lies in predicting the &ield when 6 %gha are assigned. The level of fertilier

    were randoml& assigned to each plot. At the end of the experiment, the &ields were

    measured and the following

    data were obtained.

    Page 15 of 15

    http://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3668.htmhttp://www.itl.nist.gov/div898/handbook/eda/section3/eda3669.htm
  • 7/25/2019 6- Single Variable Regression (Part II)

    8/15

    In this stud&, it is 'uite clear that the fertilier is the predictor 7X8 variable, while the

    response variable 7#8 is the &ield.

    The population consists of all possible field plots with all possible tomato plants of this

    t&pe grown under all possible fertilier levels between about / and 0 %gha.

    If all of the population could be measured 7which it can"t8 &ou could find a relationship

    between the &ield and the amount of fertilier applied. This relationship would have the

    form:

    where Boand Brepresent the true population intercept and slope respectivel&. The term C

    represents random variation that is alwa&s present, i.e. even if the same plot was grown

    twice in a row with the same amount of fertilier, the &ield would not be identical 7wh&98.

    The population parameters to be estimated are Bo$ the true average &ield when theamount of fertilier is -, and B, the true average change in &ield per unit change in the

    amount of fertilier. These are ta%en over all plants in all possible field plots of this t&pe.

    The values of Boand Bare impossible to obtain as the entire population could never be

    measured.

    P+6! nalysis

    *ere is the data entered into a D#>)T data sheet. +ote the scale of both variables

    7continuous8. The ordering of the rows is +)T important; however, it is often easier to

    find individual data points if the data is sorted b& the X value and the rows for future

    predictions are placed at the end of the dataset.

    Page 1" of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    9/15

    Ese the =tatistics$F @egression Anal&sis$F =imple @egressionplatform to start the

    anal&sis. =pecif& the # and X variable as needed.

    Then clic% )D. A ne- s"read s*eetwill be created that #ontains t*e regression results.

    Page 1# of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    10/15

    At this stage, it would be also useful to draw a scatter plot of the data 7refer to previous

    D#>)T tutorials8

    The relationship loo% approximatel& linear; there don"t appear to be an& outlier orinfluential points; the scatter appears to be roughl& e'ual across the entire regression

    line. @esidual plots will be used later to chec% these assumptions in more detail.

    The 8itmenu item allows &ou to fit the least$s'uares line. The actual fitted line is drawn

    on the scatter plot, and the straight line e'uation coefficients, 7here called A for the

    intercept and A5 for the slope8 of the fitted line are printed below the fit spread sheet.

    Page 1$ of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    11/15

    The estimated regression line is:

    In terms of estimates, b-5.0/6 is the

    estimated intercept, and b.- is the

    estimated slope.

    The estimated slope is the estimated

    change in &ield when the amount of

    fertilier is increased b& unit. In this

    case, the &ield is expected to increase

    7wh&98 b& .-32 > when the fertilieramount is increased b& %gha. +)TG

    that the slope is the . In this particular case

    the intercept has a meaningful interpretation, but I"d be worried about extrapolating

    outside the range of the observed X values.

    )nce again, these are the results from a single experiment. If another experiment was

    repeated, &ou would obtain different estimates 7b-and bwould change8. The sampling

    distribution over all possible experiments would describe the variation in b-and bover all

    possible experiments. The standard deviation of b-and bover all possible experiments is

    again referred to as the standard error of b-and b.

    The formulae for the standard errors of b-and bare mess&, and hopeless to compute b&

    hand. And ust li%e inference for a mean or a proportion, we can obtain estimates of the

    standard error from D#>)T 7from the regression results sheet created in page 0 8.

    Page 19 of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    12/15

    The estimated standard error for b 7the estimated slope8 is -.35 >%g. This is an

    estimate of the standard deviation of bover all possible experiments. +ormall&, theintercept is of limited interest, but a standard error can also be found for it as shown in

    the above table.

    Esing exactl& the same logic as when we found a confidence interval for the population

    mean, a confidence interval for the population slope 7B8 is found 7approximatel&8 as bJ

    57estimated se8 In the above example, an approximate confidence interval for Bis found

    as

    .- J 5 K 7-.358 .- J .564 7-.032 to .36/8 >%g

    of fertilier applied. An LexactM confidence interval can be computed b& D#>)T as shown

    above. The LexactM confidence interval is based on the t$distribution and is slightl& wider

    than our approximate confidence interval because the total sample sie 7 pairs of points8

    is rather small.

    e interpret this interval as Nbeing 1/O confident that the true increase in &ield when the

    amount of fertilier is increased b& one unit is somewhere between 7.032 to .36/8 >%g."

    Page 20 of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    13/15

    Pe sure to carefull& distinguish between B and b. +ote that the confidence interval is

    computed using b, but is a confidence interval for B$ the population parameter that is

    un%nown .

    In linear regression problems, one h&pothesis of interest is if the true slope is ero. This

    would correspond to no linear relationship between the response and predictor variable

    7wh&98 In man& cases, a confidence interval tells the entire stor&.

    D#>)T produces a test of the h&pothesis that each of the parameters 7the slope and the

    intercept in the population8 is ero. The output is reproduced again below:

    The test of h&pothesis about the intercept is not of interest 7wh&98.

    >et

    B be the true 7un%nown8 slope.

    b be the estimated slope. In this case b .-4.

    The h&pothesis testing proceeds as follows. Again note that we are interested in the

    population parameters and not the sample statistics:

    . =pecif& the null and alternate h&pothesis:

    +otice that the null h&pothesis is in terms of the population

    parameter B. This is a two$sided test as we are interested in detecting differences from

    ero in either direction.

    5. ?ind the test statistic and the p$value. The test statistic is computed as:

    In other words, the estimate is over 0 standard errors awa& from the h&pothesied valueQ

    This will be compared to a t$distribution with nR5 1 degrees of freedom. The p$value is

    found to ver& small 7less than -.---8.

    Page 21 of 15

  • 7/25/2019 6- Single Variable Regression (Part II)

    14/15

    3.

  • 7/25/2019 6- Single Variable Regression (Part II)

    15/15

    =econd the experimenter ma& be interested in predicting the average of A>> ?ETE@G

    responses at a particular X. This would correspond to the average &ield for all future plots

    when 6 %gha of fertilier is added.

    The prediction interval for an individual response is sometimes called a confidence intervalfor an individual response but this is an unfortunate 7and incorrect8 use of the term

    confidence interval. =trictl& spea%ing confidence intervals are computed for fixed

    un%nown parameter values; predication intervals are

    computed for future random variables.

    7To be continued8.

    Page 23 of 15