Upload
samuel-boone
View
234
Download
0
Tags:
Embed Size (px)
Citation preview
1
EXPLORING RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES
• SCATTERPLOTS, ASSOCIATION, AND CORRELATION
ADDITIONAL REFERENCE READING MATERIAL
• COURSEPACK PAGES 29 – 58
2
LINEAR RELATIONSHIPS BETWEEN TWO VARIABLES X AND Y
WHY STUDY LINEAR RELATIONSHIPS?•LINEAR RELATIONSHIPS ARE THE EASIEST TO UNDERSTAND AND ANALYZE;•MOST RELATIONSHIPS ARE OFTEN APPROXIMATELY LINEAR;•VARIABLES WITH A NONLINEAR RELATIONSHIP CAN OFTEN BE TRANSFORMED SO THAT THE RELATIONSHIP OF THE TRANSFORMED VARIABLES IS LINEAR. FOR EXAMPLE, CONSIDER THE EQUATION RELATING BRAIN WEIGHT W, AND BODY WEIGHT Z.
mcZW
3
EXAMPLES
• RELATIONSHIP BETWEEN SMOKING AND LUNG CANCER;
• RELATIONSHIP BETWEEN ALTITUDE AND THE BOILING POINT OF WATER;
• RELATIONSHIP BETWEEN TEMPERATURE AND OZONE CONCENTRATION IN THE AIR;
IN THESE EXAMPLES, TWO VARIABLES ARE INVOLVED NAMELY: THE RESPONSE VARIABLE Y, AND THE EXPLANATORY VARIABLE, X.
Three Tools We Will Use
• Scatterplot, a two-dimensional graph of data values
• Correlation, a statistic that measures the strength and direction of a linear relationship between two quantitative variables.
• Regression equation, an equation that describes the average relationship between a quantitative response and explanatory variable.
4
5
LEAST SQUARES LINE (REGRESSION LINE)
• GIVEN A SET OF n OBSERVATIONS,• QUESTION: WHAT LINE “BEST” FITS THE
OBSERVATIONS?• WE SHALL ANSWER THIS QUESTION GRAPHICALLY
USING A SCATTERPLOT, AND ANALYTICALLY USING LEAST SQUARES REGRESSION FORMULA.
• SCATTERPLOTS: A SCATTERPLOT IS A PLOT OF
THE POINTS
),,(),...,,( 11 nn YXYX
),(),...,,( 11 nn YXYX
6
WHAT LINE “BEST FITS” THE SET OF OBSERVATIONS?
•GRAPHICAL SOLUTION
7
EXAMPLE: GIVEN THE SET OF OBSERVATIONS, (1,2), (2,5), (3,4), (4,1), (5,8), (6,3), (7,2), PLOT A SCATTERGRAM.
• SCATTERGRAM
X
Y
4
8
6
2
2 4 6 8
X
XX
X
X
X
X
Example: Height and Handspan
8
Data shown are the first 12 observations of a data set that includes the heights (in inches) and fully stretched handspans (in centimeters) of 167 college students
Example: Height and Handspan
Taller people tend to have greater handspan measurements than shorter people do.
When two variables tend to increase together, we say that they have a positive association.
The handspan and height measurements may have a
linear relationship.
9
Example: Driver Age and MaximumLegibility Distance of Highway Signs
• A research firm determined the maximum distance at which each of 30 drivers could read a newly designed sign.
• The 30 participants in the study ranged in age from 18 to 82 years old.
• We want to examine the relationship between age and the sign legibility distance
10
Example: Driver Age and MaximumLegibility Distance of Highway Signs
11
Example: Driver Age and MaximumLegibility Distance of Highway Signs
• We see a negative association with a linear pattern.
• We will use a straight-line equation to model this relationship.
12
13
LOOKING AT SCATTERPLOTS• SCATTERPLOTS ARE THE BEST WAY TO START
OBSERVING THE RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES.
• BY JUST LOOKING AT THEM, YOU CAN SEE PATTERNS, TRENDS, RELATIONSHIPS, AND EVEN THE OCCASIONAL EXTRAORDINARY VALUE SITTING APART FROM THE OTHERS.
• THERE ARE FOUR THINGS WE LOOK FOR IN A SCATTERPLOT.– DIRECTION– FORM– STRENGTH– UNUSUAL FEATURES
Looking for Patterns with Scatterplots
Questions to Ask about a Scatterplot
• What is the average pattern? Does it look like a straight line, or is it curved?
• What is the direction of the pattern?
• How much do individual points vary from the average pattern?
• Are there any unusual data points?
14
What we Look for in a scatterplot
We examine a scatterplot to study association. How do values on the response variable change as values of the explanatory variable change?
You can describe the overall pattern of a scatterplot by the trend, direction, and strength of the relationship between the two variables.
Trend: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the trend
Also look for outliers from the overall trend.
15
16
DIRECTION
POSITIVE NEGATIVE NEITHER
THE PATTERN RUNS THE PATTERN RUNS
FROM THE BOTTOM LEFT FROM THE UPPER LEFT
TO THE UPPER RIGHT. TO THE LOWER RIGHT.
X
XX
XX
X
XXX
X
XX
X
XX
X
XX
XXX
XXX
X X
X
XXX
X
X
X
X
X
X
X
X
XX
XX
X
XXX
X X
X X
X
X
XX
X X X
X
X
17
FORM
STRAIGHT
CURVED
EXOTIC
NO PATTERNS
18
FORM: POSITIVE STRAIGHT DIRECTION
POSITIVELY STRAIGHT RELATIONSHIP
X
X
X
X
X
X
X
X
X
X
X
X
X
X
XX
X
X
X
XX
X
Example: 100 Cars on the Lot of a Used-Car Dealership
Question: Would you expect a positive association, a negative association or no association between the age of the car and the mileage on the odometer?
– Positive association– Negative association – No association
19
20
FORM: NEGATIVE STRAIGHT DIRECTION
• NEGATIVELY STRAIGHT RELATIONSHIP
XX X
XX
XX
XX
XX
XX
X
X XX
X
X
X
X
X
X
X
Positive and Negative Associations
•Two quantitative variables x and y are – Positively associated when
• high values of x tend to occur with high values of y.• low values of x tend to occur with low values of y.
– Negatively associated when high values of one variable tend to pair with low values of the other variable.
21
Positive, Negative Associations, Linear Relationships
• Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increase.
• Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase.
• Two variables have a linear relationship when the pattern of their relationship resembles a straight line.
22
23
FORM: CURVED RELATIONSHIP
• CURVED ASSOCIATION BETWEEN X AND Y
X
X
X
XX X X
XX
XX
XX
X
X
XXXXX
XX
XX
X X X XX
XX
XX
XX XX
X
X
X X
X
24
FORM:EXOTIC – SHARP POINTS
• OUTSTANDING FEATURE – SHARP POINTS
XX
XX
XX
XXXX
XX
XX
X
XX
XXXXX
XXXX
X
XX
XX
XX
XX
X
XX
25
FORM: NO CLEAR PATTERNS
26
STRENGTH
STRONG MODERATE WEAK
X XX X
XXX
XXXX
X XXX
XXXX
XX
X XXXXX
XX
X
X
XX
X
XX
X
XX
X
XX
X X
X
X
X
X
X
XX
X
X
X
X
XX
X
X
X
27
UNUSUAL FEATURES
OUTLIERS SUBGROUPS
X XX X
XXXX X
XX
X
XX
X XX
X
X
XX X
X
X
X
XX
XX XX
XXX
X X
Unusual Features
28
29
EXAMPLES
30
REGRESSION WISDOM: CORRELATION COEFFICIENT
• CORRELATION COEFFICIENT r
THE CORRELATION COEFFICIENT IS A NUMERICAL MEASURE OF THE DIRECTION AND STRENGTH OF A LINEAR ASSOCIATION.
• FOR A SET OF PAIRED DATA
THE LINEAR CORRELATION COEFFICIENT r IS GIVEN BY
),(),...,,(),,( 2211 nn YXYXYX
yx
n
iii
SSn
yyxxr
)1(
))((1
31
• WHERE AND ARE RESPECTIVELY THE STANDARD DEVIATION OF X AND Y.
• EXAMPLE: FIND THE LINEAR CORRELATION COEFFICIENT FOR THE FOLLOWING FOUR PAIRS OF NUMBERS: (6,5), (10,3), (14,7), (19,8), (21,12).
xS yS
32
PROPERTIES OF CORRELATION COEFFICIENT, r
• WITH POSITIVE r MEANING
POSITIVE RELATIONSHIP AND NEGATIVE r MEANING NEGATIVE RELATIONSHIP BETWEEN THE TWO VARIABLES.
• r = 1 IF AND ONLY IF POINTS LIE ON A LINE WITH POSITIVE SLOPE.
• r = -1 IF AND ONLY IF POINTS LIE ON A LINE WITH NEGATIVE SLOPE.
• THE VALUE OF r DOES NOT CHANGE IF THE UNITS OF MEASUREMENT ARE CHANGED.
• r MEASURES THE STRENGTH AND DIRECTION OF THE LINEAR RELATIONSHIP BETWEEN Y AND X.
11 r
33
•THE VALUE OF r DOES NOT DEPEND ON WHICH OF THE TWO VARIABLES IS LABELED X.
• THE VALUE OF r IS A MEASURE OF THE EXTENT TO WHICH X AND Y ARE LINEARLY RELATED – THAT IS, THE EXTENT TO WHICH THE POINTS IN THE SCATTER PLOT FALL CLOSE TO A STRAIGHT LINE. A VALUE OF r CLOSE TO ZERO DOES NOT RULE OUT ANY STRONG RELATIONSHIP BETWEEN X AND Y; THERE COULD STILL BE A STRONG RELATIONSHIP BUT ONE THAT IS NOT LINEAR.
0 +1-1 -0.5 0.5 0.8-0.8
STRONG STRONG
MODERATE
WEAK
MODERATE
Scatterplots and Correlation Coefficient
• Let’s get a feel for the correlation r by looking at its values for the scatterplots shown below
34
35
36
CORRELATION CONDITIONS
• QUANTITATIVE VARIABLES CONDITION
• CORRELATION APPLIES ONLY TO QUANTITATIVE VARIABLES.
• BE SURE NOT TO APPLY CORRELATION TO CATEGORICAL DATA MASQUERADING AS QUANTITATIVE.
• CHECK THE VARIABLES’ UNITS AND WHAT THEY MEASURE.
37
STRAIGHT ENOUGH CONDITION
• MAKE SURE THE FORM OF THE SCATTERPLOT IS STRAIGHT ENOUGH THAT A LINEAR RELATIONSHIP MAKES SENSE.
• CORRELATION MEASURES THE STRENGTH ONLY OF THE LINEAR ASSOCIATION, AND WILL BE MISLEADING IF THE RELATIONSHIP IS NOT LINEAR.
• IF A RELATIONSHIP IS CURVED, THEN SUMMARIZING ITS STRENGTH WITH A CORRELATION WOULD BE MISLEADING.
38
OUTLIER CONDITION
• OUTLIERS CAN DISTORT THE CORRELATION DRAMATICALLY.
• AN OUTLIER CAN MAKE AN OTHERWISE WEAK CORRELATION LOOK BIG OR HIDE A STRONG CORRELATION.
• AN OUTLIER CAN EVEN GIVE AN OTHERWISE POSITIVE ASSOCIATION A NEGATIVE CORRELATION (AND VICE VERSA)
• WHEN YOU SEE AN OUTLIER, IT’S OFTEN A GOOD IDEA TO REPORT THE CORRELATION WITH AND WITHOUT THAT POINT.
39
WHICH LINE “BEST FITS” THE SET OF OBSERVATIONS?
•THE ANALYTICAL APPROACH
Regression Line
•The first step of a regression analysis is to identify the response and explanatory variables.
– We use y to denote the response variable.
– We use x to denote the explanatory variable.
40
41
FITTING THE MODEL: THE LEAST SQUARES METHOD
• CONSIDER THE EXAMPLE:• SUPPOSE AN APPLIANCE
STORE CONDUCTS A FIVE-MONTH EXPERIMENT TO DETERMINE THE EFFECT OF ADVERTISING ON SALES REVENUE. THE RESULTS ARE SHOWN IN THE TABLE.
MONTH ADVERTISING EXPENDITURE, X ($100s)
SALE REVENUE
Y, ($1,000s)
1 1 1
2 2 1
3 3 2
4 4 2
5 5 4
42
FIRST STEP IS TO MAKE A SCATTERGRAM
SALES REVENUE
($1000s)
x
Y
AD. EXPENDITURE ($100s)
1 2 3 4 5
1
2
3
4
X X
X X
X
43
WHAT IS THE BEST FIT?• SCATTERGRAM WITH POSSIBLE FITS
X
Y
1
2
3
4
1 2 3 4 5 6
X X
X X
X
44
GENERAL EQUATION OF A REGRESSION LINE
y = a + b.x + error LINEAR PART • THE LINEAR PART OF THE EQUATION REQUIRES
DETERMINATION OF TWO COEFFICIENTS – a (THE Y-INTERCEPT) AND b (THE SLOPE) IN ORDER TO PREDICT VALUES OF Y.
• ONCE a AND b ARE OBTAINED, THE STRAIGHT LINE IS KNOWN AND CAN BE PLOTTED ON THE SCATTER DIAGRAM. THEN WE COULD MAKE A VISUAL COMPARISON OF HOW WELL OUR PARTICULAR STATISTICAL MODEL (A STRAIGHT LINE) FITS THE ORIGINAL DATA.
Regression Equation
• The regression line predicts the value for the response variable y as a straight-line function of the value x of the explanatory variable.
• • Let denote the predicted value of y. The equation
for the regression line has the form
• In this formula, a denotes the y-intercept and b denotes the slope.
45
bxay ˆ
Example: Height Based on Human Remains
Regression Equation:
is the predicted height and is the length of a femur (thighbone), measured in centimeters.
Use the regression equation to predict the height of a person whose femur length was 50
46
ˆ y 61.4 2.4(50) 181.4
xy 4.24.61ˆ
47
COMPUTING a AND b
LETTING AND REPRESENT THE MEANS OF
AND RESPECTIVELY,
THE RESULTING FORMULAS FOR THE INTERCEPT
AND SLOPE ARE GIVEN BY
X Y
nXXX ,...,, 21 nYYY ,...,, 21
n
ii
n
iii
xx
yyxxb
1
2
1
)(
))((ˆ
48
ALTERNATIVE FORMULA WHEN THE STANDARD DEVIATIONS OF X AND Y
VARIABLES ARE KNOWN
xbya
s
srb
x
y
ˆˆ
ˆ
49
NOTE: THE ‘HAT’ IS PLACED OVER THE LETTERS a AND b TO REMIND US THAT THESE ARE THE VALUES WHICH MINIMIZES THE SUM OF SQUARED DEVIATIONS.
• CLASS WORK
• Example From Midterm 1 Review Sheet
Example: Baseball Scoring Vs Batting average
• Given the following statistics from data on baseball scoring versus batting average, find the regression line.
50
Example: Baseball Scoring Vs Batting Average
51
32.22597.007.2645.4 xbya
Example: Baseball Scoring Vs Batting Average
• The regression line to predict team scoring from batting average is
52
2.32 26.1x
Slope: Positive, Negative, Zero
53
Interpreting the Slope
•Slope: measures the change in the predicted variable (y) for a 1 unit increase in the explanatory variable (x).
•Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height.
54
The Slope and Correlation
•Slope:
– Numerical value depends on the units used to measure the variables.
– Does not tell us whether the association is strong or weak.
– The two variables must be identified as response and explanatory variables.
– The regression equation can be used to predict values of the response variable for given values of the explanatory variable.
55
Interpreting the y - Intercept
•y-Intercept: – The predicted value for y when x = 0; – This fact helps in plotting the line;– May not have any interpretative value if no
observations had x values near 0;
It does not make sense for femur length to be 0 cm, so the y-intercept for the equation
is not a relevant predicted height.56
ˆ 61.4 2.4y x
57
REGRESSION WISDOM: PREDICTION, RESIDUALS, CORRELATION
• PREDICTION• OBTAINING THE REGRESSION FORMULA IS NOT
THE END OF THE ANALYSIS. MOSTLY, WE ARE INTERESTED IN PREDICTING FUTURE OUTCOMES WITH THE REGRESSION FUNCTION.
• TYPES OF PREDICTION• EXTRAPOLATION: EXTRAPOLATION IS THE
USE OF A REGRESSION LINE FOR PREDICTION OUTSIDE THE RANGE OF VALUES OF THE EXPLANATORY VARIABLE X THAT IS USED TO OBTAIN THE LINE. SUCH PREDICTION CANNOT BE TRUSTED.
Extrapolation is Dangerous
•Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data.
– Riskier the farther we move from the range of the given x-values.
– There is no guarantee that the relationship given by the regression equation holds outside the range of sampled x-values.
58
59
INTERPOLATION:
• INTERPOLATION IS THE USE OF A REGRESSION LINE FOR PREDICTION WITHIN THE RANGE OF VALUES OF THE EXPLANATORY VARIABLE X THAT IS USED TO OBTAIN THE LINE. INTERPOLATION IS GENERALLY SAFE.
REMARKS• EXTRAPOLATION SHOULD BE HANDLED
WITH CAUTION. LIMIT PREDICTIONS TO X VALUES WHICH ARE WITHIN THE RANGE OF THE DATA USED TO COMPUTE THE LEAST SQUARES LINE.
60
INTERPOLATING AND EXTRAPOLATING
• ILLUSTRATION
61
•DO NOT MAKE PREDICTIONS OUTSIDE THE CONTEXT OF THE STUDY IN WHICH THE DATA WERE COLLECTED. FOR EXAMPLE, IT IS INAPPROPRIATE
TO USE THE LEAST SQUARES LINE FITTED BY THE BABY DATA OF AMERICANS TO PREDICT WEIGHTS FOR BABIES BORN IN CHINA.
DIAGNOSTICS: AFTER OBTAINING THE REGRESSION LINE, WE WOULD LIKE TO KNOW HOW WELL THE REGRESSION LINE FITS THE DATA. ALSO, WE WOULD LIKE TO KNOW IF THERE IS ANY POTENTIAL POINT THAT AFFECTS THE REGRESSION LINE. A DIAGNOSTIC ANALYSIS SUCH AS THE ANALYSIS OF RESIDUAL IS VERY USEFUL.
62
REGRESSION WISDOM - RESIDUALS
• THE DISCREPANCY BETWEEN DATA AND MODEL IS CALLED RESIDUAL. HENCE, A RESIDUAL IS THE DIFFERENCE BETWEEN AN OBSERVED VALUE OF Y AND THE VALUE PREDICTED BY THE REGRESSION LINE. THAT IS,
RESIDUAL = OBSERVED Y – PREDICTED Y
=
WHERE
yy ˆ
ixbay ˆˆˆ
63
NOTATION: RESIDUAL IS DENOTED BY THE LETTER e
• THE RESIDUALS FOR INDIVIDUAL i IS DENOTED BY
EXAMPLE: A LINEAR MODEL RELATING HURRICANES’ WIND SPEEDS TO THEIR CENTRAL PRESSURES IS
MaxWindSpeed = 955.27 – 0.897CentralPressureHURRICANE KATRINA HAD A CENTRAL PRESSURE AT 920 MILLIBARS. WHAT DOES OUR REGRESSION MODEL PREDICT FOR HER MAXIMUM WIND SPEED? HOW GOOD IS THAT PREDICTION, GIVEN THAT KATRINA’S ACTUAL WIND SPEED WAS MEASURED AT 110 KNOTS?
iiiii yyxbaye ˆ)ˆˆ(
Analysis of Residuals
• ANALYSIS OF RESIDUAL HELPS US TO ASSESS THE ADEQUACY OF A MODEL AND HELPS TO IDENTIFY OUTLIERS OR OTHER INTERESTING DATA POINT
• WHEN A REGRESSION MODEL IS APPROPRIATE, IT SHOULD MODEL THE UNDERLYING RELATIONSHIP. NOTHING INTERESTING SHOULD BE LEFT BEHIND. SO AFTER WE FIT A REGRESSION MODEL, WE USUALLY PLOT THE RESIDUALS IN THE HOPE OF FINDING … NOTHING.
64
65
Analysis of Residuals – Residual Plots
• RESIDUAL PLOT IS A SCATTERPLOT OF THE RESIDUALS [ON THE VERTICAL AXIS] AGAINST THE EXPLANATORY VARIABLE, X, ON THE HORIZONTAL AXIS.
• THE PLOT SHOULD NOT HAVE ANY INTERESTING FEATURES, LIKE A DIRECTION OR SHAPE. IT SHOULD STRETCH HORIZONTALLY, WITH ABOUT THE SAME AMOUNT OF SCATTER THROUGHOUT. IT SHOULD SHOW NO BENDS, AND IT SHOULD HAVE NO OUTLIERS.
66
NOTE: SUM OF RESIDUALS = 0
• A RESIDUAL PLOT
e
x
●
●
●
●
●
●
●
●
● ●
●
●
●
●
67
CLASS EXAMPLES
68
INTERPRETATION OF POSITIVE AND NEGATIVE RESIDUALS
• POSITIVE RESIDUAL: THE MODEL OR PREDICTED VALUES UNDERESTIMATE THE ACTUAL DATA VALUE.
• NEGATIVE RESIDUAL: THE MODEL OR PREDICTED VALUES OVERESTIMATE THE ACTUAL DATA VALUE
69
REMARK• MOST COMPUTER STATISTICS PACKAGES PLOT
THE RESIDUALS AGAINST THE PREDICTED VALUES, RATHER THAN AGAINST THE X-VALUES. WHEN THE SLOPE IS NEGATIVE, THE TWO VERSIONS ARE MIRROR IMAGES. WHEN THE SLOPE IS POSITIVE, THEY ARE VIRTUALLY IDENTICAL EXCEPT FOR THE AXIS LABELS. SINCE ALL WE CARE ABOUT IS THE PATTERNS (OR, BETTER, LACK OF PATTERNS) IN THE RESIDUAL PLOT, IT REALLY DOES NOT MATTER WHICH WAY WE PLOT THE RESIDUALS.
70
CLASS EXAMPLES
71
COEFFICIENT OF DETERMINATION
• THE COEFFICIENT OF DETERMINATION MEASURES THE PROPORTION OF VARIATION THAT IS EXPLAINED BY THE INDEPENDENT VARIABLE X, IN THE REGRESSION MODEL. THAT IS, MEASURES THE PROPORTION OF THE TOTAL VARIABILITY IN Y THAT IS REMOVED BY ADDING X TO THE LINEAR MODEL.
• NOTATION:
THE COEFFICIENT OF DETERMINATION IS USEFUL WHEN
INTERPRETING r. ITS SYMBOL EXPLAINS HOW IT IS
COMPUTED; TO OBTAIN IT, SIMPLY SQUARE r – THE CORRELATION COEFFICIENT.
2r
Coefficient of Determination
•The typical way to interpret is as the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.
•When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using only .
•We measure the proportional reduction in error and call it, .
72
2r
y
2r
Coefficient of Determination
• measures the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.
•A correlation of .9 means that
– 81% of the variation in the y-values can be explained by the explanatory variable, x.
73
%8181.9. 2
2r
74
PROPERTIES OF COEFFICIENT OF DETERMINATION
1. 2. IF AND ONLY IF ALL POINTS LIE ON A LINE.
3. DOES NOT CHANGE IF THE UNITS OF MEASUREMENT ARE CHANGED.
4. MEASURES THE STRENGTH OF LINEAR ASSOCIATION BETWEEN THE VARIABLES Y AND X. IT IS POSSIBLE THAT X AND Y ARE STRONGLY RELATED, BUT IS CLOSE TO 0.
10 2 r
12 r2r
2r
2r
75
Remark
• THE COEFFICIENT OF DETERMINATION, WHEN CONVERTED TO A PERCENTAGE, INDICATES HOW MUCH VARIANCE IS ACCOUNTED FOR BY THE VARIANCE ON THE OTHER VARIABLE
• Examples From Midterm Review 1 Sheet
Outliers and Influential Points
•A regression outlier is an observation that lies far away from the trend that the rest of the data follows.
•An observation is influential if
– its x value is relatively low or high compared to the remainder of the data.
– the observation is a regression outlier.Influential observations tend to pull the regression
line toward that data point and away from the rest of the data points.
76
Be Cautious of Influential Points
•One reason to plot the data before you do a correlation or regression analysis is to check for unusual observations.
•Search for observations that are regression outliers, being well removed from the trend that the rest of the data follow.
77
Outliers and Influential Points
78
Outliers and Influential Points
• An Observation Is a Regression Outlier if it is Far Removed from the Trend that the Rest of the Data Follow. The top two points are regression outliers. Not all regression outliers are influential in affecting the correlation or slope. Question: Which regression outlier in this figure is influential?
79
Correlation Does not Imply Causation
•In a regression analysis, suppose that as x goes up, y also tends to go up (or down). Can we conclude that there’s a causal connection, with changes in x causing changes in y?
– A strong correlation between x and y means that there is a strong linear association that exists between the two variables.
– A strong correlation between x and y, does not mean that x causes y to change.
80
Correlation Does not Imply Causation (Extra – Credit Exercise)
Data are available for all fires in Chicago last year on x = number
of firefighters at the fire and y = cost of damages due to the fire. 1. Would you expect the correlation to be negative, zero, or
positive? 2. If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse? Yes or
No? 3. Identify a third variable that could be considered a
common cause of x and y:
Distance from the fire station Intensity of the fire Size of the fire
81
Lurking Variables & Confounding
A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.
Ice cream sales and drowning – lurking variable = temperature
Reading level and shoe size – lurking variable = age Childhood obesity rate and GDP-lurking variable =
time
When two explanatory variables are both associated with a response variable but are also associated with each other, there is said to be confounding.
Lurking variables are not measured in the study but have the potential for confounding.
82
The Effect of Lurking Variables on Associations
• Lurking variables can affect associations in many ways. For instance, a lurking variable may be a common cause of both the explanatory and response variable.
• In practice, there’s usually not a single variable that causally explains a response variable or the association between two variables. More commonly, there are multiple causes . When there are multiple causes, the association among them makes it difficult to study the effect of any single variable.
83
The Effects of Confounding on Associations
• When two explanatory variables are both associated with a response variable but are also associated with each other, confounding occurs.
• It is difficult to determine whether either of them truly causes the response because a variable’s effect could be at least partly due to its association with the other variable.
84
85
LEVERAGE AND INFLUENTIAL POINTS
LEVERAGE
DATA POINTS WITH X-VARIABLES FAR FROM THE MEAN OF X ARE SAID TO EXERT LEVERAGE ON A LINEAR MODEL. HIGH LEVERAGE POINTS PULL THE LINE CLOSE TO THEM, AND SO THEY CAN HAVE A LARGE EFFECT ON THE LINE, SOMETIMES COMPLETELY DETERMINING THE SLOPE AND Y- INTERCEPT. WITH HIGH ENOUGH LEVERAGE, THEIR RESIDUALS CAN APPEAR TO BE DECEPTIVELY SMALL.
86
INFLUENTIAL POINT
• IF OMITING A POINT FROM THE DATA RESULTS IN A VERY DIFFERENT REGRESSION MODEL, THEN THAT POINT IS CALLED AN INFLUENTIAL POINT.
• ILLUSTRATIVE EXAMPLES