1 EXPLORING RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES SCATTERPLOTS, ASSOCIATION, AND CORRELATION ADDITIONAL REFERENCE READING MATERIAL COURSEPACK PAGES

1

EXPLORING RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

• SCATTERPLOTS, ASSOCIATION, AND CORRELATION

ADDITIONAL REFERENCE READING MATERIAL

• COURSEPACK PAGES 29 – 58

2

LINEAR RELATIONSHIPS BETWEEN TWO VARIABLES X AND Y

WHY STUDY LINEAR RELATIONSHIPS?•LINEAR RELATIONSHIPS ARE THE EASIEST TO UNDERSTAND AND ANALYZE;•MOST RELATIONSHIPS ARE OFTEN APPROXIMATELY LINEAR;•VARIABLES WITH A NONLINEAR RELATIONSHIP CAN OFTEN BE TRANSFORMED SO THAT THE RELATIONSHIP OF THE TRANSFORMED VARIABLES IS LINEAR. FOR EXAMPLE, CONSIDER THE EQUATION RELATING BRAIN WEIGHT W, AND BODY WEIGHT Z.

mcZW

3

EXAMPLES

• RELATIONSHIP BETWEEN SMOKING AND LUNG CANCER;

• RELATIONSHIP BETWEEN ALTITUDE AND THE BOILING POINT OF WATER;

• RELATIONSHIP BETWEEN TEMPERATURE AND OZONE CONCENTRATION IN THE AIR;

IN THESE EXAMPLES, TWO VARIABLES ARE INVOLVED NAMELY: THE RESPONSE VARIABLE Y, AND THE EXPLANATORY VARIABLE, X.

Three Tools We Will Use

• Scatterplot, a two-dimensional graph of data values

• Correlation, a statistic that measures the strength and direction of a linear relationship between two quantitative variables.

• Regression equation, an equation that describes the average relationship between a quantitative response and explanatory variable.

4

5

LEAST SQUARES LINE (REGRESSION LINE)

• GIVEN A SET OF n OBSERVATIONS,• QUESTION: WHAT LINE “BEST” FITS THE

OBSERVATIONS?• WE SHALL ANSWER THIS QUESTION GRAPHICALLY

USING A SCATTERPLOT, AND ANALYTICALLY USING LEAST SQUARES REGRESSION FORMULA.

• SCATTERPLOTS: A SCATTERPLOT IS A PLOT OF

THE POINTS

),,(),...,,( 11 nn YXYX

),(),...,,( 11 nn YXYX

6

WHAT LINE “BEST FITS” THE SET OF OBSERVATIONS?

•GRAPHICAL SOLUTION

7

EXAMPLE: GIVEN THE SET OF OBSERVATIONS, (1,2), (2,5), (3,4), (4,1), (5,8), (6,3), (7,2), PLOT A SCATTERGRAM.

• SCATTERGRAM

X

Y

4

8

6

2

2 4 6 8

X

XX

X

X

X

X

Example: Height and Handspan

8

Data shown are the first 12 observations of a data set that includes the heights (in inches) and fully stretched handspans (in centimeters) of 167 college students

Example: Height and Handspan

Taller people tend to have greater handspan measurements than shorter people do.

When two variables tend to increase together, we say that they have a positive association.

The handspan and height measurements may have a

linear relationship.

9

Example: Driver Age and MaximumLegibility Distance of Highway Signs

• A research firm determined the maximum distance at which each of 30 drivers could read a newly designed sign.

• The 30 participants in the study ranged in age from 18 to 82 years old.

• We want to examine the relationship between age and the sign legibility distance

10


11


• We see a negative association with a linear pattern.

• We will use a straight-line equation to model this relationship.

12

13

LOOKING AT SCATTERPLOTS• SCATTERPLOTS ARE THE BEST WAY TO START

OBSERVING THE RELATIONSHIP BETWEEN TWO QUANTITATIVE VARIABLES.

• BY JUST LOOKING AT THEM, YOU CAN SEE PATTERNS, TRENDS, RELATIONSHIPS, AND EVEN THE OCCASIONAL EXTRAORDINARY VALUE SITTING APART FROM THE OTHERS.

• THERE ARE FOUR THINGS WE LOOK FOR IN A SCATTERPLOT.– DIRECTION– FORM– STRENGTH– UNUSUAL FEATURES

Looking for Patterns with Scatterplots

Questions to Ask about a Scatterplot

• What is the average pattern? Does it look like a straight line, or is it curved?

• What is the direction of the pattern?

• How much do individual points vary from the average pattern?

• Are there any unusual data points?

14

What we Look for in a scatterplot

We examine a scatterplot to study association. How do values on the response variable change as values of the explanatory variable change?

You can describe the overall pattern of a scatterplot by the trend, direction, and strength of the relationship between the two variables.

Trend: linear, curved, clusters, no pattern Direction: positive, negative, no direction Strength: how closely the points fit the trend

Also look for outliers from the overall trend.

15

16

DIRECTION

POSITIVE NEGATIVE NEITHER

THE PATTERN RUNS THE PATTERN RUNS

FROM THE BOTTOM LEFT FROM THE UPPER LEFT

TO THE UPPER RIGHT. TO THE LOWER RIGHT.

X

XX

XX

X

XXX

X

XX

X

XX

X

XX

XXX

XXX

X X

X

XXX

X

X

X

X

X

X

X

X

XX

XX

X

XXX

X X

X X

X

X

XX

X X X

X

X

17

FORM

STRAIGHT

CURVED

EXOTIC

NO PATTERNS

18

FORM: POSITIVE STRAIGHT DIRECTION

POSITIVELY STRAIGHT RELATIONSHIP

X

X

X

X

X

X

X

X

X

X

X

X

X

X

XX

X

X

X

XX

X

Example: 100 Cars on the Lot of a Used-Car Dealership

Question: Would you expect a positive association, a negative association or no association between the age of the car and the mileage on the odometer?

– Positive association– Negative association – No association

19

20

FORM: NEGATIVE STRAIGHT DIRECTION

• NEGATIVELY STRAIGHT RELATIONSHIP

XX X

XX

XX

XX

XX

XX

X

X XX

X

X

X

X

X

X

X

Positive and Negative Associations

•Two quantitative variables x and y are – Positively associated when

• high values of x tend to occur with high values of y.• low values of x tend to occur with low values of y.

– Negatively associated when high values of one variable tend to pair with low values of the other variable.

21

Positive, Negative Associations, Linear Relationships

• Two variables have a positive association when the values of one variable tend to increase as the values of the other variable increase.

• Two variables have a negative association when the values of one variable tend to decrease as the values of the other variable increase.

• Two variables have a linear relationship when the pattern of their relationship resembles a straight line.

22

23

FORM: CURVED RELATIONSHIP

• CURVED ASSOCIATION BETWEEN X AND Y

X

X

X

XX X X

XX

XX

XX

X

X

XXXXX

XX

XX

X X X XX

XX

XX

XX XX

X

X

X X

X

24

FORM:EXOTIC – SHARP POINTS

• OUTSTANDING FEATURE – SHARP POINTS

XX

XX

XX

XXXX

XX

XX

X

XX

XXXXX

XXXX

X

XX

XX

XX

XX

X

XX

25

FORM: NO CLEAR PATTERNS

26

STRENGTH

STRONG MODERATE WEAK

X XX X

XXX

XXXX

X XXX

XXXX

XX

X XXXXX

XX

X

X

XX

X

XX

X

XX

X

XX

X X

X

X

X

X

X

XX

X

X

X

X

XX

X

X

X

27

UNUSUAL FEATURES

OUTLIERS SUBGROUPS

X XX X

XXXX X

XX

X

XX

X XX

X

X

XX X

X

X

X

XX

XX XX

XXX

X X

Unusual Features

28

29

EXAMPLES

30

REGRESSION WISDOM: CORRELATION COEFFICIENT

• CORRELATION COEFFICIENT r

THE CORRELATION COEFFICIENT IS A NUMERICAL MEASURE OF THE DIRECTION AND STRENGTH OF A LINEAR ASSOCIATION.

• FOR A SET OF PAIRED DATA

THE LINEAR CORRELATION COEFFICIENT r IS GIVEN BY

),(),...,,(),,( 2211 nn YXYXYX

yx

n

iii

SSn

yyxxr

)1(

))((1

31

• WHERE AND ARE RESPECTIVELY THE STANDARD DEVIATION OF X AND Y.

• EXAMPLE: FIND THE LINEAR CORRELATION COEFFICIENT FOR THE FOLLOWING FOUR PAIRS OF NUMBERS: (6,5), (10,3), (14,7), (19,8), (21,12).

xS yS

32

PROPERTIES OF CORRELATION COEFFICIENT, r

• WITH POSITIVE r MEANING

POSITIVE RELATIONSHIP AND NEGATIVE r MEANING NEGATIVE RELATIONSHIP BETWEEN THE TWO VARIABLES.

• r = 1 IF AND ONLY IF POINTS LIE ON A LINE WITH POSITIVE SLOPE.

• r = -1 IF AND ONLY IF POINTS LIE ON A LINE WITH NEGATIVE SLOPE.

• THE VALUE OF r DOES NOT CHANGE IF THE UNITS OF MEASUREMENT ARE CHANGED.

• r MEASURES THE STRENGTH AND DIRECTION OF THE LINEAR RELATIONSHIP BETWEEN Y AND X.

11 r

33

•THE VALUE OF r DOES NOT DEPEND ON WHICH OF THE TWO VARIABLES IS LABELED X.

• THE VALUE OF r IS A MEASURE OF THE EXTENT TO WHICH X AND Y ARE LINEARLY RELATED – THAT IS, THE EXTENT TO WHICH THE POINTS IN THE SCATTER PLOT FALL CLOSE TO A STRAIGHT LINE. A VALUE OF r CLOSE TO ZERO DOES NOT RULE OUT ANY STRONG RELATIONSHIP BETWEEN X AND Y; THERE COULD STILL BE A STRONG RELATIONSHIP BUT ONE THAT IS NOT LINEAR.

0 +1-1 -0.5 0.5 0.8-0.8

STRONG STRONG

MODERATE

WEAK

MODERATE

Scatterplots and Correlation Coefficient

• Let’s get a feel for the correlation r by looking at its values for the scatterplots shown below

34

35

36

CORRELATION CONDITIONS

• QUANTITATIVE VARIABLES CONDITION

• CORRELATION APPLIES ONLY TO QUANTITATIVE VARIABLES.

• BE SURE NOT TO APPLY CORRELATION TO CATEGORICAL DATA MASQUERADING AS QUANTITATIVE.

• CHECK THE VARIABLES’ UNITS AND WHAT THEY MEASURE.

37

STRAIGHT ENOUGH CONDITION

• MAKE SURE THE FORM OF THE SCATTERPLOT IS STRAIGHT ENOUGH THAT A LINEAR RELATIONSHIP MAKES SENSE.

• CORRELATION MEASURES THE STRENGTH ONLY OF THE LINEAR ASSOCIATION, AND WILL BE MISLEADING IF THE RELATIONSHIP IS NOT LINEAR.

• IF A RELATIONSHIP IS CURVED, THEN SUMMARIZING ITS STRENGTH WITH A CORRELATION WOULD BE MISLEADING.

38

OUTLIER CONDITION

• OUTLIERS CAN DISTORT THE CORRELATION DRAMATICALLY.

• AN OUTLIER CAN MAKE AN OTHERWISE WEAK CORRELATION LOOK BIG OR HIDE A STRONG CORRELATION.

• AN OUTLIER CAN EVEN GIVE AN OTHERWISE POSITIVE ASSOCIATION A NEGATIVE CORRELATION (AND VICE VERSA)

• WHEN YOU SEE AN OUTLIER, IT’S OFTEN A GOOD IDEA TO REPORT THE CORRELATION WITH AND WITHOUT THAT POINT.

39

WHICH LINE “BEST FITS” THE SET OF OBSERVATIONS?

•THE ANALYTICAL APPROACH

Regression Line

•The first step of a regression analysis is to identify the response and explanatory variables.

– We use y to denote the response variable.

– We use x to denote the explanatory variable.

40

41

FITTING THE MODEL: THE LEAST SQUARES METHOD

• CONSIDER THE EXAMPLE:• SUPPOSE AN APPLIANCE

STORE CONDUCTS A FIVE-MONTH EXPERIMENT TO DETERMINE THE EFFECT OF ADVERTISING ON SALES REVENUE. THE RESULTS ARE SHOWN IN THE TABLE.

MONTH ADVERTISING EXPENDITURE, X ($100s)

SALE REVENUE

Y, ($1,000s)

1 1 1

2 2 1

3 3 2

4 4 2

5 5 4

42

FIRST STEP IS TO MAKE A SCATTERGRAM

SALES REVENUE

($1000s)

x

Y

AD. EXPENDITURE ($100s)

1 2 3 4 5

1

2

3

4

X X

X X

X

43

WHAT IS THE BEST FIT?• SCATTERGRAM WITH POSSIBLE FITS

X

Y

1

2

3

4

1 2 3 4 5 6

X X

X X

X

44

GENERAL EQUATION OF A REGRESSION LINE

y = a + b.x + error LINEAR PART • THE LINEAR PART OF THE EQUATION REQUIRES

DETERMINATION OF TWO COEFFICIENTS – a (THE Y-INTERCEPT) AND b (THE SLOPE) IN ORDER TO PREDICT VALUES OF Y.

• ONCE a AND b ARE OBTAINED, THE STRAIGHT LINE IS KNOWN AND CAN BE PLOTTED ON THE SCATTER DIAGRAM. THEN WE COULD MAKE A VISUAL COMPARISON OF HOW WELL OUR PARTICULAR STATISTICAL MODEL (A STRAIGHT LINE) FITS THE ORIGINAL DATA.

Regression Equation

• The regression line predicts the value for the response variable y as a straight-line function of the value x of the explanatory variable.

• • Let denote the predicted value of y. The equation

for the regression line has the form

• In this formula, a denotes the y-intercept and b denotes the slope.

45

bxay ˆ

Example: Height Based on Human Remains

Regression Equation:

is the predicted height and is the length of a femur (thighbone), measured in centimeters.

Use the regression equation to predict the height of a person whose femur length was 50

46

ˆ y 61.4 2.4(50) 181.4

xy 4.24.61ˆ

47

COMPUTING a AND b

LETTING AND REPRESENT THE MEANS OF

AND RESPECTIVELY,

THE RESULTING FORMULAS FOR THE INTERCEPT

AND SLOPE ARE GIVEN BY

X Y

nXXX ,...,, 21 nYYY ,...,, 21

n

ii

n

iii

xx

yyxxb

1

2

1

)(

))((ˆ

48

ALTERNATIVE FORMULA WHEN THE STANDARD DEVIATIONS OF X AND Y

VARIABLES ARE KNOWN

xbya

s

srb

x

y

ˆˆ

ˆ

49

NOTE: THE ‘HAT’ IS PLACED OVER THE LETTERS a AND b TO REMIND US THAT THESE ARE THE VALUES WHICH MINIMIZES THE SUM OF SQUARED DEVIATIONS.

• CLASS WORK

• Example From Midterm 1 Review Sheet

Example: Baseball Scoring Vs Batting average

• Given the following statistics from data on baseball scoring versus batting average, find the regression line.

50

Example: Baseball Scoring Vs Batting Average

51

32.22597.007.2645.4 xbya

Example: Baseball Scoring Vs Batting Average

• The regression line to predict team scoring from batting average is

52

2.32 26.1x

Slope: Positive, Negative, Zero

53

Interpreting the Slope

•Slope: measures the change in the predicted variable (y) for a 1 unit increase in the explanatory variable (x).

•Example: A 1 cm increase in femur length results in a 2.4 cm increase in predicted height.

54

The Slope and Correlation

•Slope:

– Numerical value depends on the units used to measure the variables.

– Does not tell us whether the association is strong or weak.

– The two variables must be identified as response and explanatory variables.

– The regression equation can be used to predict values of the response variable for given values of the explanatory variable.

55

Interpreting the y - Intercept

•y-Intercept: – The predicted value for y when x = 0; – This fact helps in plotting the line;– May not have any interpretative value if no

observations had x values near 0;

It does not make sense for femur length to be 0 cm, so the y-intercept for the equation

is not a relevant predicted height.56

ˆ 61.4 2.4y x

57

REGRESSION WISDOM: PREDICTION, RESIDUALS, CORRELATION

• PREDICTION• OBTAINING THE REGRESSION FORMULA IS NOT

THE END OF THE ANALYSIS. MOSTLY, WE ARE INTERESTED IN PREDICTING FUTURE OUTCOMES WITH THE REGRESSION FUNCTION.

• TYPES OF PREDICTION• EXTRAPOLATION: EXTRAPOLATION IS THE

USE OF A REGRESSION LINE FOR PREDICTION OUTSIDE THE RANGE OF VALUES OF THE EXPLANATORY VARIABLE X THAT IS USED TO OBTAIN THE LINE. SUCH PREDICTION CANNOT BE TRUSTED.

Extrapolation is Dangerous

•Extrapolation: Using a regression line to predict y-values for x-values outside the observed range of the data.

– Riskier the farther we move from the range of the given x-values.

– There is no guarantee that the relationship given by the regression equation holds outside the range of sampled x-values.

58

59

INTERPOLATION:

• INTERPOLATION IS THE USE OF A REGRESSION LINE FOR PREDICTION WITHIN THE RANGE OF VALUES OF THE EXPLANATORY VARIABLE X THAT IS USED TO OBTAIN THE LINE. INTERPOLATION IS GENERALLY SAFE.

REMARKS• EXTRAPOLATION SHOULD BE HANDLED

WITH CAUTION. LIMIT PREDICTIONS TO X VALUES WHICH ARE WITHIN THE RANGE OF THE DATA USED TO COMPUTE THE LEAST SQUARES LINE.

60

INTERPOLATING AND EXTRAPOLATING

• ILLUSTRATION

61

•DO NOT MAKE PREDICTIONS OUTSIDE THE CONTEXT OF THE STUDY IN WHICH THE DATA WERE COLLECTED. FOR EXAMPLE, IT IS INAPPROPRIATE

TO USE THE LEAST SQUARES LINE FITTED BY THE BABY DATA OF AMERICANS TO PREDICT WEIGHTS FOR BABIES BORN IN CHINA.

DIAGNOSTICS: AFTER OBTAINING THE REGRESSION LINE, WE WOULD LIKE TO KNOW HOW WELL THE REGRESSION LINE FITS THE DATA. ALSO, WE WOULD LIKE TO KNOW IF THERE IS ANY POTENTIAL POINT THAT AFFECTS THE REGRESSION LINE. A DIAGNOSTIC ANALYSIS SUCH AS THE ANALYSIS OF RESIDUAL IS VERY USEFUL.

62

REGRESSION WISDOM - RESIDUALS

• THE DISCREPANCY BETWEEN DATA AND MODEL IS CALLED RESIDUAL. HENCE, A RESIDUAL IS THE DIFFERENCE BETWEEN AN OBSERVED VALUE OF Y AND THE VALUE PREDICTED BY THE REGRESSION LINE. THAT IS,

RESIDUAL = OBSERVED Y – PREDICTED Y

=

WHERE

yy ˆ

ixbay ˆˆˆ

63

NOTATION: RESIDUAL IS DENOTED BY THE LETTER e

• THE RESIDUALS FOR INDIVIDUAL i IS DENOTED BY

EXAMPLE: A LINEAR MODEL RELATING HURRICANES’ WIND SPEEDS TO THEIR CENTRAL PRESSURES IS

MaxWindSpeed = 955.27 – 0.897CentralPressureHURRICANE KATRINA HAD A CENTRAL PRESSURE AT 920 MILLIBARS. WHAT DOES OUR REGRESSION MODEL PREDICT FOR HER MAXIMUM WIND SPEED? HOW GOOD IS THAT PREDICTION, GIVEN THAT KATRINA’S ACTUAL WIND SPEED WAS MEASURED AT 110 KNOTS?

iiiii yyxbaye ˆ)ˆˆ(

Analysis of Residuals

• ANALYSIS OF RESIDUAL HELPS US TO ASSESS THE ADEQUACY OF A MODEL AND HELPS TO IDENTIFY OUTLIERS OR OTHER INTERESTING DATA POINT

• WHEN A REGRESSION MODEL IS APPROPRIATE, IT SHOULD MODEL THE UNDERLYING RELATIONSHIP. NOTHING INTERESTING SHOULD BE LEFT BEHIND. SO AFTER WE FIT A REGRESSION MODEL, WE USUALLY PLOT THE RESIDUALS IN THE HOPE OF FINDING … NOTHING.

64

65

Analysis of Residuals – Residual Plots

• RESIDUAL PLOT IS A SCATTERPLOT OF THE RESIDUALS [ON THE VERTICAL AXIS] AGAINST THE EXPLANATORY VARIABLE, X, ON THE HORIZONTAL AXIS.

• THE PLOT SHOULD NOT HAVE ANY INTERESTING FEATURES, LIKE A DIRECTION OR SHAPE. IT SHOULD STRETCH HORIZONTALLY, WITH ABOUT THE SAME AMOUNT OF SCATTER THROUGHOUT. IT SHOULD SHOW NO BENDS, AND IT SHOULD HAVE NO OUTLIERS.

66

NOTE: SUM OF RESIDUALS = 0

• A RESIDUAL PLOT

e

x

●

●

●

●

●

●

●

●

● ●

●

●

●

●

67

CLASS EXAMPLES

68

INTERPRETATION OF POSITIVE AND NEGATIVE RESIDUALS

• POSITIVE RESIDUAL: THE MODEL OR PREDICTED VALUES UNDERESTIMATE THE ACTUAL DATA VALUE.

• NEGATIVE RESIDUAL: THE MODEL OR PREDICTED VALUES OVERESTIMATE THE ACTUAL DATA VALUE

69

REMARK• MOST COMPUTER STATISTICS PACKAGES PLOT

THE RESIDUALS AGAINST THE PREDICTED VALUES, RATHER THAN AGAINST THE X-VALUES. WHEN THE SLOPE IS NEGATIVE, THE TWO VERSIONS ARE MIRROR IMAGES. WHEN THE SLOPE IS POSITIVE, THEY ARE VIRTUALLY IDENTICAL EXCEPT FOR THE AXIS LABELS. SINCE ALL WE CARE ABOUT IS THE PATTERNS (OR, BETTER, LACK OF PATTERNS) IN THE RESIDUAL PLOT, IT REALLY DOES NOT MATTER WHICH WAY WE PLOT THE RESIDUALS.

70

CLASS EXAMPLES

71

COEFFICIENT OF DETERMINATION

• THE COEFFICIENT OF DETERMINATION MEASURES THE PROPORTION OF VARIATION THAT IS EXPLAINED BY THE INDEPENDENT VARIABLE X, IN THE REGRESSION MODEL. THAT IS, MEASURES THE PROPORTION OF THE TOTAL VARIABILITY IN Y THAT IS REMOVED BY ADDING X TO THE LINEAR MODEL.

• NOTATION:

THE COEFFICIENT OF DETERMINATION IS USEFUL WHEN

INTERPRETING r. ITS SYMBOL EXPLAINS HOW IT IS

COMPUTED; TO OBTAIN IT, SIMPLY SQUARE r – THE CORRELATION COEFFICIENT.

2r

Coefficient of Determination

•The typical way to interpret is as the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.

•When a strong linear association exists, the regression equation predictions tend to be much better than the predictions using only .

•We measure the proportional reduction in error and call it, .

72

2r

y

2r

Coefficient of Determination

• measures the proportion of the variation in the y-values that is accounted for by the linear relationship of y with x.

•A correlation of .9 means that

– 81% of the variation in the y-values can be explained by the explanatory variable, x.

73

%8181.9. 2

2r

74

PROPERTIES OF COEFFICIENT OF DETERMINATION

1. 2. IF AND ONLY IF ALL POINTS LIE ON A LINE.

3. DOES NOT CHANGE IF THE UNITS OF MEASUREMENT ARE CHANGED.

4. MEASURES THE STRENGTH OF LINEAR ASSOCIATION BETWEEN THE VARIABLES Y AND X. IT IS POSSIBLE THAT X AND Y ARE STRONGLY RELATED, BUT IS CLOSE TO 0.

10 2 r

12 r2r

2r

2r

75

Remark

• THE COEFFICIENT OF DETERMINATION, WHEN CONVERTED TO A PERCENTAGE, INDICATES HOW MUCH VARIANCE IS ACCOUNTED FOR BY THE VARIANCE ON THE OTHER VARIABLE

• Examples From Midterm Review 1 Sheet

Outliers and Influential Points

•A regression outlier is an observation that lies far away from the trend that the rest of the data follows.

•An observation is influential if

– its x value is relatively low or high compared to the remainder of the data.

– the observation is a regression outlier.Influential observations tend to pull the regression

line toward that data point and away from the rest of the data points.

76

Be Cautious of Influential Points

•One reason to plot the data before you do a correlation or regression analysis is to check for unusual observations.

•Search for observations that are regression outliers, being well removed from the trend that the rest of the data follow.

77


78


• An Observation Is a Regression Outlier if it is Far Removed from the Trend that the Rest of the Data Follow. The top two points are regression outliers. Not all regression outliers are influential in affecting the correlation or slope. Question: Which regression outlier in this figure is influential?

79

Correlation Does not Imply Causation

•In a regression analysis, suppose that as x goes up, y also tends to go up (or down). Can we conclude that there’s a causal connection, with changes in x causing changes in y?

– A strong correlation between x and y means that there is a strong linear association that exists between the two variables.

– A strong correlation between x and y, does not mean that x causes y to change.

80

Correlation Does not Imply Causation (Extra – Credit Exercise)

Data are available for all fires in Chicago last year on x = number

of firefighters at the fire and y = cost of damages due to the fire. 1. Would you expect the correlation to be negative, zero, or

positive? 2. If the correlation is positive, does this mean that having more firefighters at a fire causes the damages to be worse? Yes or

No? 3. Identify a third variable that could be considered a

common cause of x and y:

Distance from the fire station Intensity of the fire Size of the fire

81

Lurking Variables & Confounding

A lurking variable is a variable, usually unobserved, that influences the association between the variables of primary interest.

Ice cream sales and drowning – lurking variable = temperature

Reading level and shoe size – lurking variable = age Childhood obesity rate and GDP-lurking variable =

time

When two explanatory variables are both associated with a response variable but are also associated with each other, there is said to be confounding.

Lurking variables are not measured in the study but have the potential for confounding.

82

The Effect of Lurking Variables on Associations

• Lurking variables can affect associations in many ways. For instance, a lurking variable may be a common cause of both the explanatory and response variable.

• In practice, there’s usually not a single variable that causally explains a response variable or the association between two variables. More commonly, there are multiple causes . When there are multiple causes, the association among them makes it difficult to study the effect of any single variable.

83

The Effects of Confounding on Associations

• When two explanatory variables are both associated with a response variable but are also associated with each other, confounding occurs.

• It is difficult to determine whether either of them truly causes the response because a variable’s effect could be at least partly due to its association with the other variable.

84

85

LEVERAGE AND INFLUENTIAL POINTS

LEVERAGE

DATA POINTS WITH X-VARIABLES FAR FROM THE MEAN OF X ARE SAID TO EXERT LEVERAGE ON A LINEAR MODEL. HIGH LEVERAGE POINTS PULL THE LINE CLOSE TO THEM, AND SO THEY CAN HAVE A LARGE EFFECT ON THE LINE, SOMETIMES COMPLETELY DETERMINING THE SLOPE AND Y- INTERCEPT. WITH HIGH ENOUGH LEVERAGE, THEIR RESIDUALS CAN APPEAR TO BE DECEPTIVELY SMALL.

86

INFLUENTIAL POINT

• IF OMITING A POINT FROM THE DATA RESULTS IN A VERY DIFFERENT REGRESSION MODEL, THEN THAT POINT IS CALLED AN INFLUENTIAL POINT.

• ILLUSTRATIVE EXAMPLES

Documents

1 EXPLORING RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES SCATTERPLOTS, ASSOCIATION, AND CORRELATION ADDITIONAL REFERENCE READING MATERIAL COURSEPACK PAGES