Upload
vuthien
View
223
Download
3
Embed Size (px)
Citation preview
10/01/09 Lecture 9 1
STOR 155 Introductory Statistics
Lecture 9: Cautions about Regression
and Correlation, Causation
The UNIVERSITY of NORTH CAROLINA
at CHAPEL HILL
10/01/09 Lecture 9 2
Review
• Least-Squares Regression Lines
• Equation and interpretation of the line
• Prediction using the line
• Correlation and Regression
• Coefficient of Determination
10/01/09 Lecture 9 3
Regression Diagnostics
• Look at residuals (errors):
– A residual is the difference between an
observed value of the response variable and
the value predicted by the regression line, i.e.,
– The sum of the least-squares residuals is
always zero.
.ˆresidual yy
Why?
10/01/09 Lecture 9 4
Residual Plots
• A residual plot is a scatterplot of the
regression residuals against the
explanatory variable.
• Residual plots help us assess the fit of a
regression line.
10/01/09 Lecture 9 6
Residual Plot
• If the regression line catches the overall
pattern of the data, there should be no
pattern in the residual.
totally random
10/01/09 Lecture 9 8
Diabetes Patient: FPG vs HbA
• FPG: fasting plasma glucose.
• HbA: percent of red blood cells that have a
glucose molecule attached.
• Both are measuring blood glucose.
• We expect a positive association.
• 18 subjects, r = 0.4819.
• See the scatterplot on the next page.
10/01/09 Lecture 9 10
Outliers and Influential Observations
• An outlier is a point that lies outside the overall
pattern of the other points.
– Outliers in the y direction have large residuals, but
other outliers may not.
• An influential obs. is a point that the regression
line would be significantly changed with or
without it.
– Outliers in the x direction are often influential
points.
– But not always…
10/01/09 Lecture 9 12
• Outliers in the y direction can be spotted
from the residual plot.
• Influential points can be identified by
fitting regression lines with/without those
points. More serious.
– Can not be identified via residual plot.
– Scatterplot gives us some hint.
Outliers & Influential Obs.
10/01/09 Lecture 9 13
Cautions about correlation and regression
• Linear only
• DO NOT extrapolate
• Not resistant
• Beware lurking variables
• Beware correlations based on averaged
data
• The restricted-range problem
10/01/09 Lecture 9 14
Lurking Variable
• A lurking (hidden) variable is a variable that has an
important effect on the relationship among the variables
in a study, but is not included among the variables being
studied.
• Examples:
– SAT scores and college grades
• Lurking variable: IQ
10/01/09 Lecture 9 15
Lurking variables can create nonsense correlations.
• For the world’s nations, let x be the number of TVs/person and y be the average life expectancy;
• A high positive correlation – nations with more TV sets have higher life expectancies.
– Could we lengthen the lives of people in Rwanda by shipping them more TVs?
• Lurking variable: wealth of the nation– Rich nations: more TV sets.
– Rich nations: longer life expectancies because of better nutrition, clean water, and better health care.
• There is no cause-and-effect tie between TV sets and length of life.
• Association vs causation.
10/01/09 Lecture 9 17
Beware correlations based on averaged data
• A correlation based on averages over
many individuals is usually higher than the
correlation between the same variables
based on data for individuals.
• Age vs Height
• (Basketball) score % vs practice time
10/01/09 Lecture 9 18
The restricted-range problem
• A restricted-range problem occurs when
one does not get to observe the full range
of the variables.
• When data suffer from restricted range, r
and r2 are lower than they would be if the
full range could be observed.
• SAT scores vs College GPA
– Princeton vs Generic State College (Ex 2.26)
10/01/09 Lecture 9 19
Causation vs Association
• Some studies want to find the existence of causation.
• Example of causation: – Increased drinking of alcohol causes a decrease in
coordination.
– Smoking and Lung Cancer.
• Example of association: – The above two examples.
– SAT scores and Freshman year GPA.
10/01/09 Lecture 9 20
Association does not imply causation.
• An association between two variables x and y
can reflect many types of relationship among x,
y, and one or more lurking variables.
• An association between a predictor x and a
response y, even if it is very strong, is not by
itself good evidence that changes in x actually
cause changes in y.
10/01/09 Lecture 9 22
Explaining Association: Causation
• Cause-and-effect
• Examples– Amount of fertilizer and yield of corn
– Weight of a car and its MPG
– Dosage of a drug and the survival rate of the mice
10/01/09 Lecture 9 23
Explaining Association: Common Response
• Lurking variables
• Both x and y change in response to changes in z, the lurking variable
• There may not be direct causal link between x and y.
• Examples:
– SAT scores vs College GPA (IQ, Attitude)
– Monthly flow of money into stock mutual funds vs rate of return for the stock market (Market Condition, Investor Attitude)
10/01/09 Lecture 9 24
Explaining Association: Confounding
• Two variables are confounded when their effects
on a response variable are mixed together.
• One explanatory variable may be confounded
with other explanatory variables or lurking
variables.
• Examples:
– More education leads to higher income.
• Family background…
– Religious people live longer.
• Life style…
10/01/09 Lecture 9 25
Establishing causation
• The only compelling method: Designed
experiment (More in Chapter 3)
• Hot disputes:
– Does gun control reduce violent crime?
– Does meat consumption in your diet cause
heart diseases?
– Does smoking cause lung cancer?
10/01/09 Lecture 9 26
Does smoking CAUSE lung cancer?
• causation: smoking causes lung cancer.
• common response: people who have a
genetic predisposition to lung cancer also
have a genetic predisposition to smoking.
• confounding: people who drink too much,
don't exercise, eat unhealthy foods, etc.
are more likely to get lung cancer as a
result of their lifestyle. Such people may
be more likely to be smokers as well.
10/01/09 Lecture 9 27
Some guidelines when designed experiment is impossible:
• strong association
• association consistent across various
studies
• higher dose associated with stronger
responses
• the cause precedes the effect in time
• plausibility