Section 7.2 ~ Interpreting Correlations Introduction to Probability and Statistics Ms. Young ~ room 113

Section 7.2 ~ Interpreting Correlations

Introduction to Probability and StatisticsMs. Young ~ room 113

Objective

Sec. 7.2

After this section you will be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality.

Beware of Outliers When examining a scatterplot to determine correlation, be aware of

any outliers They can greatly affect the correlation coefficient, possibly resulting in a

misleading conclusion about the relationship between the variables The scatterplot below has an outlier located in the top right

With the outlier included, r = 0.880, which represents a very strong positive correlation

If you calculate the correlation coefficient without the outlier it is 0, which represents absolutely no correlation

Even though outliers can mask the correlation, you should not remove them without having a strong reason to believe that they do not belong in the data set

Sec. 7.2

Example 1 ~ Masked Correlation You’ve conducted a study to determine how the number of calories

a person consumes in a day correlates with time spent in vigorous bicycling. Your sample consisted of ten women cyclists, all of approximately the same height and weight. Over a period of two weeks, you asked each woman to record the amount of time she spent cycling each day and what she ate on each of those days. You used the eating records to calculate the calories consumed each day. The diagram below shows each woman’s mean time spent cycling on the horizontal axis and mean caloric intake on the vertical axis. Do higher cycling times correspond to higher intake of calories?

Sec. 7.2

Example 1 ~ Solution If you look at the data as a whole, your eye will probably tell you

that there is a positive correlation in which greater cycling time tends to go with higher caloric intake. But the correlation is very weak, with a correlation coefficient of 0.374

However, notice that two points are outliers: one representing a cyclist who cycled about a half-hour per day and consumed more than 3,000 calories, and the other representing a cyclist who cycled more than 2 hours per day on only 1,200 calories

It’s difficult to explain the two outliers, given that all the women in the sample have similar heights and weights. We might therefore suspect that these two women either recorded their data incorrectly or were not following their usual habits during the two-week study. If we can confirm this suspicion, then we would have reason to delete the two data points as invalid.

The correlation is quite strong without those two outlier points, and suggests that the number of calories consumed rises by a little more than 500 calories for each hour of cycling, but we should not remove the outliers without confirming our suspicion that they were invalid data points, and we should report our reasons for leaving them out.

Sec. 7.2

Beware of Inappropriate Grouping Sometimes grouping data inappropriately can hide correlations

Data may appear to have no correlation, but when grouped differently, a correlation is apparent

Ex. ~ Consider a study in which researchers seek a correlation between hours of TV watched per week and high school grade point average (GPA). They collect the 21 data pairs in Table 7.3.

Sec. 7.2

The scatterplot shows virtually no correlation, and the correlation coefficient equals -0.063

The apparent conclusion is that TV viewing habits are unrelated to academic achievement

Beware of Inappropriate Grouping Cont’d…

However, after further investigation, one astute researcher realizes that some of the students watched mostly educational programs, while others tended to watch comedies, dramas, and movies.

She therefore divides the data set into two groups, one for the students who watched mostly educational television and one for the other students.

Sec. 7.2


After graphing each of the groups separately, we find two very strong correlations: A strong positive correlation for the students who watched educational

programs (r = 0.855) A strong negative correlation for the other students (r = -0.951).

Sec. 7.2


Sometimes data may appear to have a correlation, but when grouped differently there is no correlation Ex. ~ Consider the data collected by a consumer group studying the

relationship between the weights and prices of cars. The data set as a whole shows a strong positive correlation (r = 0.949) After closer examination, you can see that there are two rather distinct

categories; light cars and heavy cars If you analyze the light cars alone, r = 0.019 (nearly no correlation) If you analyze the heavy cars alone, r = -0.022 (nearly no correlation) This false correlation occurred because of the separation between the two

clusters

Sec. 7.2

Correlation Does Not Imply Causality Just because numbers tell us that there is a correlation

between two variables, it does not mean that it is necessarily true

In other words, “correlation does not imply causality”, or one variable does not necessarily cause the other one

Here are some possible explanations for a correlation The correlation may be a coincidence

Ex. ~ Super Bowl and the stock market (refer to ex. 2 on P.303) Both correlated variables might be directly influenced by some

common underlying cause Ex. ~ As eggnog sales increase in Pennsylvania, accident rates

increase as well; the underlying cause would be that eggnog is typically sold in the winter and accidents are more common in the winter due to inclement weather

One of the correlated variables may actually be a cause of the other, but it may just be one of several causes

Ex. ~ There is a correlation between smoking and lung cancer, but smoking is not the only way one can get lung cancer

Sec. 7.2

Documents

Section 7.2 ~ Interpreting Correlations Introduction to Probability and Statistics Ms. Young ~ room 113