Correlation Oh yeah!. Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation

CorrelationOh yeah!

Outline

•Basics•Visualization•Covariance•Significance testing and interval

estimation•Effect size•Bias•Factors affecting correlation•Issues with correlational studies

Correlation

•Research question: What is the relationship between two variables?

•Correlation is a measure of the direction and degree of linear association between 2 variables.

•Correlation is the standardized covariance between two variables

Questions to be asked…• Is there a linear relationship between x and y?• What is the strength of this relationship?

▫ Pearson Product Moment Correlation Coefficient r• Can we describe this relationship and use this to

predict y from x? ▫ y=bx+a

• Is the relationship we have described statistically significant?▫ Not a very interesting one if tested against a null of

r = 0

Other stuff• Check scatterplots to see whether a Pearson

r makes sense• Use both r and R2 to understand the situation• If data is non-metric or non-normal, use

“non-parametric” correlations• Correlation does not prove causation

▫ True relationship may be in opposite direction, co-causal, or due to other variables

• However, correlation is the primary statistic used in making an assessment of causality▫ ‘Potential’ Causation

Possible outcomes•-1 to +1•As one variable increases/decreases, the

other variable increases/decreases ▫Positive covariance

•As one variable increases/decreases, another decreases/increases▫Negative covariance

•No relationship (independence) ▫r = 0

•Non-linear relationship?

Scatterplots• As we discussed previously, scatterplots provide a pictorial

examination of the relationship between two quantitative variables

• Predictor variable on the X-axis (abscissa); Criterion variable on the Y-axis (ordinate)

• Each subject is located in the scatterplot by means of a pair of scores (score on the X variable and score on the Y variable)▫ Plot each pair of observations (X, Y)

X = predictor variable (independent) Y = criterion variable (dependent)

• Check for linear relationship▫ ‘Line of best fit’▫ y = a + bx

• Check for outliers

Example of a Scatterplot

• The relationship between scores on a test of quantitative skills taken by students on the first day of a stats course (X-axis) and their combined scores on two semester exams (Y-axis)

Example of a Scatterplot

• The two variables are positively related ▫ As quantitative skill increases, so does performance on

the two midterm exams• Linear relationship between the variables

▫ Line of best fit drawn on the graph - the ‘regression line’

• The ‘strength’ or ‘degree’ of the liner relationship is measured by a correlation coefficient i.e. how tightly the data points cluster around the regression line

• We can use this information to determine whether the linear relationship represents a true relationship in the population or is due entirely to chance factors

What do we look for in a Scatterplot?

•Overall pattern: Ellipse▫ Any striking deviations (outliers)

•Form: is it linear? (curved? clustered?)•Direction: is it positive…

high values of the two variables tend to occur together)

▫ Or negative high values of one variable tend to occur

with low values of the other variable)?•Strength: how close the points lie to the

line of best fit (if a linear relationship)

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r =1

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = 0.7

0

20

40

60

80

100

120

140

160

40 60 80 100 120 140

r = 0.4

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.4

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.7

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -0.95

0

20

40

60

80

100

120

140

40 60 80 100 120 140

r = -1

Linear Correlation / Covariance

•How do we obtain a quantitative measure of the linear association between X and Y?

•The Pearson Product-Moment Correlation Coefficient, r, comes from the covariance statistic, it reflects the degree to which the two variables vary together

Covariance

• The variance shared by two variables• When X and Y move in the same direction (i.e.

their deviations from the mean are similarly pos or neg)▫ cov (x,y) = pos.

• When X and Y move in opposite directions▫ cov (x,y) = neg.

• When no constant relationship▫ cov (x,y) = 0

1

( )( )cov( , )

1

n

i ii

x x y yx y

n

Covariance

•Covariance is not easily interpreted on its own and cannot be compared across different scales of measurement

•Solution: standardize this measure•Pearson’s r:

),cov( yx

yxxy ss

yxr

),cov(

Significance test for correlation•All correlations in a practical setting will be

non-zero•A significance test can be conducted in an

effort to infer to a population •Key Question: “Is the r large enough that it

is unlikely to have come from a population in which the two variables are unrelated?”

•Testing the null hypothesis that▫H0: = 0 vs. alternative hypothesis H1: ≠ 0

=population product-moment correlation coefficient

Significance test for correlation• However with larger N,

small, possibly non-meaningful, correlations can be deemed ‘significant’

• So the better question is: Is a test against zero useful?

• Tests of significance for r have typically have limited utility if testing against a zero value

• Go by the size1 and judge worth by what is seen in the relevant literature

• df critical• N-2 =.05• 5 .67• 10 .50• 15 .41• 20 .36• 25 .32• 30 .30• 50 .23• 200 .11• 500 .07• 1000 .05

Significance test for correlation• Furthermore, using the

approaches outlined in Howell, while standard, are really not necessary

• Using the t-distribution as described we would only really be able to test a null hypothesis of zero

• If we want to test against some specific value1, we have to convert r in some odd fashion and test using these new values▫ Fisher transformation

2

2

1

r Nt

r

df = N - 2

r

rr e

1

1log)5(. 3

1

Nse

3

1

N

rz

1( )

3cvCI r zN

Test of the difference between two rs• While those new values create an r′ that

approximates a normal distribution, why do we have to do it?

• The reason for this transformation is that since r has limits of +1, the larger the absolute value of r, the more skewed its sampling distribution about the population (rho)

Sampling distribution of a correlation

• Via the bootstrap, we can a see for ourselves that the sampling distribution becomes more and more skewed as we deviate from a null value of zero

The better approach

•Nowadays, we can bootstrap the r or difference between two rs and do hypothesis tests without unnecessary (and most likely problematic) transformations and assumptions▫ Even for small samples of about 30 it

performs as well as the transformation in ideal situations (Efron, 1988)

•Furthermore, it can be applied to other correlation metrics.

Correlation• Typically though, for a single sample correlations

among the variables should be considered descriptive statistics1, and often the correlation matrix is the data set that forms the basis of an analysis

• A correlation can also be thought of as an effect size in and of itself▫ Standardized measure of amount of covariation▫ The strength and degree of a linear relationship

between variables▫ The amount some variable moves in standard deviation

units with a 1 standard deviation change in another variable

• R2 is also an effect size▫ Amount of variability seen in y that can be explained by

the variability seen in x▫ Amount of variance they share2

Biased estimate- Adjusted r

•r turns out to be upwardly biased, and the smaller the sample size, the greater the bias▫With large samples the difference will be

negligible•With smaller samples one should report

adjusted r or R2

2(1 )( 1)1

2adj

r Nr

N

Factors affecting correlation

•Linearity•Heterogeneous subsamples•Range restrictions•Outliers

Linearity

•Nonlinear relationships will have an adverse effect on a measure designed to find a linear relationship

Heterogeneous subsamples• Sub-samples may artificially increase or

decrease overall r, or in a corollary to Simpson’s paradox, produce opposite sign relations for the aggregated data compared to the groups

• Solution - calculate r separately for sub-samples & overall, look for differences

Heterogeneous subsamples

Range restriction•Limiting the variability of your data can in

turn limit the possibility for covariability between two variables, thus attenuating r.

•Common example occurs with Likert scales▫E.g. 1 - 4 vs. 1 - 9

•However it is also the case that restricting the range can actually increase r if by doing so, highly influential data points would be kept out▫Wilcox 2001

Effect of Outliers

•Outliers can artificially increase or decrease r

•Options▫Compute r with and without outliers▫Conduct robustified R!

For example, recode outliers as having more conservative scores (winsorize)

▫Transform variables (last resort)

Advantages of correlational studies•Show the amount (strength) of

relationship present•Can be used to make predictions about

the variables studied•Often easier to collect correlational data,

and interpretation is fairly straightforward.

Disadvantages of correlational studies•Can’t assume that a cause-effect

relationship exists•Little or no control (experimental

manipulation) of the variables is usually seen

•Relationships may be accidental or due to a third variable, unmeasured factor ▫Common causes▫Spurious correlations and Mediators

Documents

Correlation Oh yeah!. Outline Basics Visualization Covariance Significance testing and interval estimation Effect size Bias Factors affecting correlation