Upload
dorothy-barton
View
218
Download
4
Embed Size (px)
Citation preview
Section 7.2 ~ Interpreting Correlations
Introduction to Probability and StatisticsMs. Young ~ room 113
Objective
Sec. 7.2
After this section you will be aware of important cautions concerning the interpretation of correlations, especially the effects of outliers, the effects of grouping data, and the crucial fact that correlation does not necessarily imply causality.
Beware of Outliers When examining a scatterplot to determine correlation, be aware of
any outliers They can greatly affect the correlation coefficient, possibly resulting in a
misleading conclusion about the relationship between the variables The scatterplot below has an outlier located in the top right
With the outlier included, r = 0.880, which represents a very strong positive correlation
If you calculate the correlation coefficient without the outlier it is 0, which represents absolutely no correlation
Even though outliers can mask the correlation, you should not remove them without having a strong reason to believe that they do not belong in the data set
Sec. 7.2
Example 1 ~ Masked Correlation You’ve conducted a study to determine how the number of calories
a person consumes in a day correlates with time spent in vigorous bicycling. Your sample consisted of ten women cyclists, all of approximately the same height and weight. Over a period of two weeks, you asked each woman to record the amount of time she spent cycling each day and what she ate on each of those days. You used the eating records to calculate the calories consumed each day. The diagram below shows each woman’s mean time spent cycling on the horizontal axis and mean caloric intake on the vertical axis. Do higher cycling times correspond to higher intake of calories?
Sec. 7.2
Example 1 ~ Solution If you look at the data as a whole, your eye will probably tell you
that there is a positive correlation in which greater cycling time tends to go with higher caloric intake. But the correlation is very weak, with a correlation coefficient of 0.374
However, notice that two points are outliers: one representing a cyclist who cycled about a half-hour per day and consumed more than 3,000 calories, and the other representing a cyclist who cycled more than 2 hours per day on only 1,200 calories
It’s difficult to explain the two outliers, given that all the women in the sample have similar heights and weights. We might therefore suspect that these two women either recorded their data incorrectly or were not following their usual habits during the two-week study. If we can confirm this suspicion, then we would have reason to delete the two data points as invalid.
The correlation is quite strong without those two outlier points, and suggests that the number of calories consumed rises by a little more than 500 calories for each hour of cycling, but we should not remove the outliers without confirming our suspicion that they were invalid data points, and we should report our reasons for leaving them out.
Sec. 7.2
Beware of Inappropriate Grouping Sometimes grouping data inappropriately can hide correlations
Data may appear to have no correlation, but when grouped differently, a correlation is apparent
Ex. ~ Consider a study in which researchers seek a correlation between hours of TV watched per week and high school grade point average (GPA). They collect the 21 data pairs in Table 7.3.
Sec. 7.2
The scatterplot shows virtually no correlation, and the correlation coefficient equals -0.063
The apparent conclusion is that TV viewing habits are unrelated to academic achievement
Beware of Inappropriate Grouping Cont’d…
However, after further investigation, one astute researcher realizes that some of the students watched mostly educational programs, while others tended to watch comedies, dramas, and movies.
She therefore divides the data set into two groups, one for the students who watched mostly educational television and one for the other students.
Sec. 7.2
Beware of Inappropriate Grouping Cont’d…
After graphing each of the groups separately, we find two very strong correlations: A strong positive correlation for the students who watched educational
programs (r = 0.855) A strong negative correlation for the other students (r = -0.951).
Sec. 7.2
Beware of Inappropriate Grouping Cont’d…
Sometimes data may appear to have a correlation, but when grouped differently there is no correlation Ex. ~ Consider the data collected by a consumer group studying the
relationship between the weights and prices of cars. The data set as a whole shows a strong positive correlation (r = 0.949) After closer examination, you can see that there are two rather distinct
categories; light cars and heavy cars If you analyze the light cars alone, r = 0.019 (nearly no correlation) If you analyze the heavy cars alone, r = -0.022 (nearly no correlation) This false correlation occurred because of the separation between the two
clusters
Sec. 7.2
Correlation Does Not Imply Causality Just because numbers tell us that there is a correlation
between two variables, it does not mean that it is necessarily true
In other words, “correlation does not imply causality”, or one variable does not necessarily cause the other one
Here are some possible explanations for a correlation The correlation may be a coincidence
Ex. ~ Super Bowl and the stock market (refer to ex. 2 on P.303) Both correlated variables might be directly influenced by some
common underlying cause Ex. ~ As eggnog sales increase in Pennsylvania, accident rates
increase as well; the underlying cause would be that eggnog is typically sold in the winter and accidents are more common in the winter due to inclement weather
One of the correlated variables may actually be a cause of the other, but it may just be one of several causes
Ex. ~ There is a correlation between smoking and lung cancer, but smoking is not the only way one can get lung cancer
Sec. 7.2