Upload
orlando-estes
View
40
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Regression revisited. FPP 11 and 12 and a little more. Statistical modeling. Often researchers seek to explain or predict one variable from others. In most contexts, it is impossible to do this perfectly: too much we don’t know. - PowerPoint PPT Presentation
Citation preview
FPP 11 and 12 and a little more
Regression revisited
Statistical modelingOften researchers seek to explain or
predict one variable from others.In most contexts, it is impossible to do this
perfectly: too much we don’t know.Use mathematical models that describe
relationships as best we can.Incorporate chance error into models so
that we can incorporate uncertainty in our explanations/predictions
Linear regressionLinear regression is probably the most
common statistical modelIdea is like regression lines from Chapter
10. But, the slope and intercept from a regression line are estimates of that true line (just like a sample mean is an estimate of a population mean).
Hence, we can make inference (confidence intervals and hypothesis tests) for the true slope and true intrecept
Linear regressionOften relationships are described reasonably
well by a linear trend.Linear regression allows us to estimate these
trends
Plan of attackPose regression model and investigate
assumptionsEstimate regression parameters from data
Use hypothesis testing and confidence interval ideas to determine if the relationship between two variables has occurred by chance alone
Regression with multiple predictors
Regression terminologyTypically, we label the outcome variable as
Y and the predictor as X .
Synonyms for outcome variables:response variable, dependent variables
Synonyms for predictor variablesexplanatory variables, independent
variables, covariates
Some notationRecall the regression line or least squares
line notation from earlier in the class
α denotes the population interceptβdenotes the population slope
€
y = α + βx
Sample regression lineIf we collect a sample from some
population and use sample values to calculate a regression line, then there is uncertainty associated with the sample slope and intercept estimates.
The following notation is used to denote the sample regression line
€
ˆ y = a + bx
Motivating exampleA forest service official needs to
determine the total volume of lumber on a piece of forest landAny ideas on how she might do this?
Motivating exampleA forest service official needs to
determine the total volume of lumber on a piece of forest landAny ideas on how she might do this that
doesn’t require cutting down lots of trees?She hopes predicting volume of wood from tree
diameter for individual trees will help determine total volume for the piece of forest land. She investigates, “Can the volume of wood for a tree be predicted by its diameter?”
Motivating exampleFirst she randomly
samples 31 trees and measures the diameter of each tree and then its volume.
Then she constructs a scatter plot of the data collected and checks for a linear pattern
Is relationship linear?We know how to
estimate the slope and intercept of the line that “best” fits the data
But
Motivating exampleWhat would happen if the forest service
agent took another sample of 31 trees?Would the slope change?Would the intercept change?
What about a third sample of 31 trees?a and b are statistics and are dependent
on a sampleWe know how to compute them
They are also estimates of a population intercept and slope
Mathematics of regression modelTo accommodate the added uncertainty associated
with the regression line we add one more term to the model
This model specification has three assumptions1. the average value of Y for each X falls on line2. the deviations don’t depend on X3. the deviations from the straight line follow a normal
curve 4. all units are independent
€
y i = α + βx i + ε i,
where ε i comes from N(0,σ ε )
The mechanics of regressionQuestions we aim to answer
How do we perform statistical inference on the intercept and slope of the regression line?
What is a typical deviation from the regression line?
How do we know the regression line explains the data well?
Estimating intercept and slopeFrom early in the semester recall that the
intercept and slope estimates for the line of “best” fit are
€
b = rSDy
SDx
= 0.9716.46
3.14
⎛
⎝ ⎜
⎞
⎠ ⎟= 5.07
a = y − bx = 30.14 − 5.08(13.25) = −37.02
ˆ y = −37.02 + 5.07x
Root mean square error (RMSE)What is the typical deviation from the regression
line for a given x?The typical deviation is denoted by
The root mean square error (RMSE) is a measure of the typical deviation from the regression line for a given x
For the trees data this is 4.28
A tree with a diameter of 15 inches can be expected to have a volume of -37.02 + 5.07(15) = 39.03 cubic inches give or take about 4.28 cubic inches
€
σε
JMP output
Residuals are used to compute RMSEThe deviation of each yi from the line is
called a residual that we will denote by di
An estimate of that is used in most software packages is denoted by
€
di = y i − ˆ y i= y i − (a + bx i)
€
sε =1
n − 2di
2
i=1
n
∑€
σε
Significant tests and CIsGoing back to the example of trees sampled from
the plot of land.The sampled trees are one possible random sample
from all trees in the plot of land
Questions:What is a likely range for the population regression
slope?
Does the sample regression slope provide enough evidence to say with conviction that the population slope doesn’t equal zero?
Why zero?
CI for slopeEst. ± multiplier*SE
Same old friend in a new hat
We will use the sample slope as an estimate
The multiplier is found from a t-distribution with (n-2) degress of freedom
The SE of the slope (not to be confused with RMSE) is
€
SEb =sε
(x i − x )2
i=1
n
∑
CI slopeA 95% confidence interval
for the population slope between diameter of tree and volume is
b multiplier*SEb
5.07 multiplier*0.249567
Where does the multiplier value come from?We use the the t-table and
find column with n-2 degrees of freedom and match with desired confidence level
But with 31-2=29 d.f. we are not able to use t-table. So use normal5.07 1.96*0.249567(4.58, 5.56)
CI of slopeWe found that a 95% confidence interval for is
(4.58, 5.56).
What is the interpretation of this interval?95% confident that the population slope that
describes the relationship between a tree’s diameter and its lumber volume is between 4.58 and 5.56 inches.
What does the statement “95% Confidence” mean (this is the same thing as statistical confidence)We are confident that the method used will give
correct results in 95% of all possible samples.That is, 95% of all samples will produce confidence
intervals that contain the true population slope.
Hypothesis test for existence of linear relationshipWhat parameter (αorβ) should we test to
determine whether X is useful in predicting Y?
We want to test:H0: There is NO linear relationship between
two numerical variables (X is NOT useful for predicting Y)
Ha: There is a linear relationship between two numerical variables (X is useful for predicting Y)
Draw the picture
Hypothesis test for existence of linear relationshipWhat parameter (αorβ) should we test to
determine whether X is useful in predicting Y?We want to test:
H0: There is NO linear relationship between two numerical variables (X is NOT useful for predicting Y)
Ha: There is a linear relationship between two numerical variables (X is useful for predicting Y)
Draw the picture
The hypothesis can also be stated as Ho: = 0 vs. Ho: 0
Hypothesis testThe test statistic is
To find the p-value associated with this test statistic, we find the area under a t-curve with (n-2) degrees of freedom.
According to JMP, this p-value equals smaller than 0.0001.According to the table it is smaller than 0.0005
Hence, there is strong evidence against the null. Conclude that the sample regression slope is not consistent with a population regression slope being equal to zero. There does appear to be a relationship between the diameter of a tree and its volume.
€
t =est.− hyp.
SE=
5.07 − 0
.2495= 20.31
JMP output
How well does regression model fit data?Do determine this we need to check the
assumptions made when using the model.Recall that the regression assumptions are
1. The average value of Y for each X falls on a line (i.e. the relationship between Y and X is linear)
2. The deviations (RMSE) are the same for all X
3. For any X, the distribution of Y around its mean is a normal curve.
4. All units are independent
Check the regression fit to the dataWhen the assumptions are true, values of
the residuals should reflect chance error.
That is, there should be only random patterns in the residuals.
Check this by plotting the residuals versus the predictor
If there is a non-random pattern in this plot, assumptions might be violated
Diagnosing residual plotsWhen pattern in residuals around the
horizontal line at zero is:Curved (e.g. parabolic shape):
Assumption 1 (slide 25) is violated
Fan-shaped:Assumption 2 (slide 25) is violated
Filled with many outliers:Assumption 2 (slide 25) is violated
Possible patterns in Residual Plots
Residual plotDo the residuals
look randomly scattered?Or is there
some pattern?
Is there spread of the points similar at different values of diameter
One number summary of regression fitR2 is the percentage of variation in Y’s
explained by the regression line
R2 lies between 0 and 1
Values near 1 indicate regression predicts y’s in data set very closely
Values near 0 indicate regression does not predict the y’s in the data set very closely
Interpretation in tree exampleWe get a R2 = 0.93. Hence, the regression
line between diameter and volume explains 93% of the variability volume
Caution about R2
Don’t rely exclusively on R2 as a measure of the goodness of fit of the regression.
It can be large even when assumptions are violated
Always check the assumptions with residual plots before accepting any regression model
Predictions from regressionTo predict an outcome for a unit with
unobserved Y but known X, use the fitted regression model
Example from the tree data:Predict volume from a tree that has a 15 inch
€
ˆ y = −37.02 + 5.07x
= −32.02 + 5.07(15)
= 44.03 in3
€
ˆ y = a + bx
Recall warnings Predicting Y at values of X beyond the range of the X’s in
the data is dangerous (extrapolation) Association doesn’t imply causation Influential points/outliers
Fit model with and with out point to see if estimates change
Often we aren’t interested in the intercept Ecological inference
Regression fits for aggregated data tend to show stronger relationships
With census data there is no sampling variability (we’ve exhausted the population)There is no standard errorSometimes census data are viewed as a random sample
from a hypothetical “super-population”. In this case the census data provide inferences about the super-population
When using time as the X variable care must be taken as the independent unit assumption is often not validMost likely will need to use special models