Upload
elaine-hurrell
View
221
Download
1
Embed Size (px)
Citation preview
CorrelationOh yeah!
Outline
•Basics•Visualization•Covariance•Significance testing and interval
estimation•Effect size•Bias•Factors affecting correlation•Issues with correlational studies
Correlation
•Research question: What is the relationship between two variables?
•Correlation is a measure of the direction and degree of linear association between 2 variables.
•Correlation is the standardized covariance between two variables
Questions to be asked…• Is there a linear relationship between x and y?• What is the strength of this relationship?
▫ Pearson Product Moment Correlation Coefficient r• Can we describe this relationship and use this to
predict y from x? ▫ y=bx+a
• Is the relationship we have described statistically significant?▫ Not a very interesting one if tested against a null of
r = 0
Other stuff• Check scatterplots to see whether a Pearson
r makes sense• Use both r and R2 to understand the situation• If data is non-metric or non-normal, use
“non-parametric” correlations• Correlation does not prove causation
▫ True relationship may be in opposite direction, co-causal, or due to other variables
• However, correlation is the primary statistic used in making an assessment of causality▫ ‘Potential’ Causation
Possible outcomes•-1 to +1•As one variable increases/decreases, the
other variable increases/decreases ▫Positive covariance
•As one variable increases/decreases, another decreases/increases▫Negative covariance
•No relationship (independence) ▫r = 0
•Non-linear relationship?
Scatterplots• As we discussed previously, scatterplots provide a pictorial
examination of the relationship between two quantitative variables
• Predictor variable on the X-axis (abscissa); Criterion variable on the Y-axis (ordinate)
• Each subject is located in the scatterplot by means of a pair of scores (score on the X variable and score on the Y variable)▫ Plot each pair of observations (X, Y)
X = predictor variable (independent) Y = criterion variable (dependent)
• Check for linear relationship▫ ‘Line of best fit’▫ y = a + bx
• Check for outliers
Example of a Scatterplot
• The relationship between scores on a test of quantitative skills taken by students on the first day of a stats course (X-axis) and their combined scores on two semester exams (Y-axis)
Example of a Scatterplot
• The two variables are positively related ▫ As quantitative skill increases, so does performance on
the two midterm exams• Linear relationship between the variables
▫ Line of best fit drawn on the graph - the ‘regression line’
• The ‘strength’ or ‘degree’ of the liner relationship is measured by a correlation coefficient i.e. how tightly the data points cluster around the regression line
• We can use this information to determine whether the linear relationship represents a true relationship in the population or is due entirely to chance factors
What do we look for in a Scatterplot?
•Overall pattern: Ellipse▫ Any striking deviations (outliers)
•Form: is it linear? (curved? clustered?)•Direction: is it positive…
high values of the two variables tend to occur together)
▫ Or negative high values of one variable tend to occur
with low values of the other variable)?•Strength: how close the points lie to the
line of best fit (if a linear relationship)
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r =1
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = 0.7
0
20
40
60
80
100
120
140
160
40 60 80 100 120 140
r = 0.4
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.4
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.7
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -0.95
0
20
40
60
80
100
120
140
40 60 80 100 120 140
r = -1
Linear Correlation / Covariance
•How do we obtain a quantitative measure of the linear association between X and Y?
•The Pearson Product-Moment Correlation Coefficient, r, comes from the covariance statistic, it reflects the degree to which the two variables vary together
Covariance
• The variance shared by two variables• When X and Y move in the same direction (i.e.
their deviations from the mean are similarly pos or neg)▫ cov (x,y) = pos.
• When X and Y move in opposite directions▫ cov (x,y) = neg.
• When no constant relationship▫ cov (x,y) = 0
1
( )( )cov( , )
1
n
i ii
x x y yx y
n
Covariance
•Covariance is not easily interpreted on its own and cannot be compared across different scales of measurement
•Solution: standardize this measure•Pearson’s r:
),cov( yx
yxxy ss
yxr
),cov(
Significance test for correlation•All correlations in a practical setting will be
non-zero•A significance test can be conducted in an
effort to infer to a population •Key Question: “Is the r large enough that it
is unlikely to have come from a population in which the two variables are unrelated?”
•Testing the null hypothesis that▫H0: = 0 vs. alternative hypothesis H1: ≠ 0
=population product-moment correlation coefficient
Significance test for correlation• However with larger N,
small, possibly non-meaningful, correlations can be deemed ‘significant’
• So the better question is: Is a test against zero useful?
• Tests of significance for r have typically have limited utility if testing against a zero value
• Go by the size1 and judge worth by what is seen in the relevant literature
• df critical• N-2 =.05• 5 .67• 10 .50• 15 .41• 20 .36• 25 .32• 30 .30• 50 .23• 200 .11• 500 .07• 1000 .05
Significance test for correlation• Furthermore, using the
approaches outlined in Howell, while standard, are really not necessary
• Using the t-distribution as described we would only really be able to test a null hypothesis of zero
• If we want to test against some specific value1, we have to convert r in some odd fashion and test using these new values▫ Fisher transformation
2
2
1
r Nt
r
df = N - 2
r
rr e
1
1log)5(. 3
1
Nse
3
1
N
rz
1( )
3cvCI r zN
Test of the difference between two rs• While those new values create an r′ that
approximates a normal distribution, why do we have to do it?
• The reason for this transformation is that since r has limits of +1, the larger the absolute value of r, the more skewed its sampling distribution about the population (rho)
Sampling distribution of a correlation
• Via the bootstrap, we can a see for ourselves that the sampling distribution becomes more and more skewed as we deviate from a null value of zero
The better approach
•Nowadays, we can bootstrap the r or difference between two rs and do hypothesis tests without unnecessary (and most likely problematic) transformations and assumptions▫ Even for small samples of about 30 it
performs as well as the transformation in ideal situations (Efron, 1988)
•Furthermore, it can be applied to other correlation metrics.
Correlation• Typically though, for a single sample correlations
among the variables should be considered descriptive statistics1, and often the correlation matrix is the data set that forms the basis of an analysis
• A correlation can also be thought of as an effect size in and of itself▫ Standardized measure of amount of covariation▫ The strength and degree of a linear relationship
between variables▫ The amount some variable moves in standard deviation
units with a 1 standard deviation change in another variable
• R2 is also an effect size▫ Amount of variability seen in y that can be explained by
the variability seen in x▫ Amount of variance they share2
Biased estimate- Adjusted r
•r turns out to be upwardly biased, and the smaller the sample size, the greater the bias▫With large samples the difference will be
negligible•With smaller samples one should report
adjusted r or R2
2(1 )( 1)1
2adj
r Nr
N
Factors affecting correlation
•Linearity•Heterogeneous subsamples•Range restrictions•Outliers
Linearity
•Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship
Heterogeneous subsamples• Sub-samples may artificially increase or
decrease overall r, or in a corollary to Simpson’s paradox, produce opposite sign relations for the aggregated data compared to the groups
• Solution - calculate r separately for sub-samples & overall, look for differences
Heterogeneous subsamples
Range restriction•Limiting the variability of your data can in
turn limit the possibility for covariability between two variables, thus attenuating r.
•Common example occurs with Likert scales▫E.g. 1 - 4 vs. 1 - 9
•However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out▫Wilcox 2001
Effect of Outliers
•Outliers can artificially increase or decrease r
•Options▫Compute r with and without outliers▫Conduct robustified R!
For example, recode outliers as having more conservative scores (winsorize)
▫Transform variables (last resort)
Advantages of correlational studies•Show the amount (strength) of
relationship present•Can be used to make predictions about
the variables studied•Often easier to collect correlational data,
and interpretation is fairly straightforward.
Disadvantages of correlational studies•Can’t assume that a cause-effect
relationship exists•Little or no control (experimental
manipulation) of the variables is usually seen
•Relationships may be accidental or due to a third variable, unmeasured factor ▫Common causes▫Spurious correlations and Mediators