33
Principles of Biostatistics Chapter 17 Correlation http://statdtedm.6to23.com 宇宇宇宇 宇宇宇 宇宇 ()

Principles of Biostatistics Chapter 17 Correlation

Embed Size (px)

DESCRIPTION

Principles of Biostatistics Chapter 17 Correlation. 宇传华 http://statdtedm.6to23.com 网上免费统计资源(八). Terminology. scatter plot 散点图 correlation 相关 linear correlation 直线 相关 correlation coefficient 相关系数 - PowerPoint PPT Presentation

Citation preview

Page 1: Principles of Biostatistics Chapter 17     Correlation

Principles of Biostatistics

Chapter 17 Correlation

宇传华 http://statdtedm.6to23.com

网上免费统计资源(八)

Page 2: Principles of Biostatistics Chapter 17     Correlation

Terminology

scatter plot 散点图correlation 相关linear correlation 直线相关correlation coefficient 相关系数Pearson’s correlation coefficient Pearson 相关系数Spearman’s rank correlation coefficient

Spearman 等级相关系数

Page 3: Principles of Biostatistics Chapter 17     Correlation

§17.1 The Two-Way Scatter Plot

CONTENTS

§17.2 Pearson’s Correlation Coefficient: r

§17.3 Spearman’s Correlation Coefficient: rs

§17.4 Further Application

Page 4: Principles of Biostatistics Chapter 17     Correlation

The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.

The population correlation, denoted by Greek letter, Symbol 字体,读读 rou

The sample correlation, denoted by r (Latin letter or English letter),

(r)can take on any value from - 1 to 1.

The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.

The population correlation, denoted by Greek letter, Symbol 字体,读读 rou

The sample correlation, denoted by r (Latin letter or English letter),

(r)can take on any value from - 1 to 1.

读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship

The absolute value of indicates the strength( 强度 ) of the relationship.

-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship

The sign of indicates the Direction ( 方向 ) of the relationship.

读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship

The absolute value of indicates the strength( 强度 ) of the relationship.

-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship

The sign of indicates the Direction ( 方向 ) of the relationship.

Correlation (coefficient)

Page 5: Principles of Biostatistics Chapter 17     Correlation

Before we conduct correlation analysis, we should always created a two-way scatter plot (scatter diagram).

X variable------horizontal axis Y variable------vertical axis; each point on the graph represents a combination val

ue (Xi,Yi).

Through scatter plot, we can often determine whether a linear relationship exists between X and Y.

One statistical technique often employed to measure such an association is known as correlation analysis

Page 6: Principles of Biostatistics Chapter 17     Correlation

§17.1 The Two-Way Scatter Plot表 凝血酶浓度( X )与凝血时间( Y )间的关系

Subj ectconcentrati on

of thrombi n(Xi )Cl otti ngti me(Yi )

1 1. 1 142 1. 0 153 0. 9 154 1. 2 135 0. 6 176 1. 0 147 0. 9 16

Page 7: Principles of Biostatistics Chapter 17     Correlation

Scatter Plot

Page 8: Principles of Biostatistics Chapter 17     Correlation

Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80

Strong negative No correlation Non-linear correlationcorrelation r = -0.98 r = 0.00

Page 9: Principles of Biostatistics Chapter 17     Correlation

The important of a scatter plot

In the next chapter (simple linear regression), we also need a scatter plot to find if the relationship between X and Y is a linear relationship, if the relationship between X and Y is a positive linear relationship.

So, before the analysis of correlation and regression, we should usually make a scatter plot

Page 10: Principles of Biostatistics Chapter 17     Correlation

§17.2 Pearson’s correlation coefficient ( r)

Synonyms: product moment ( 积矩 ) correlation coefficient simple linear (简单线性) correlation coefficient

Definition:

r-------A statistical index to describe the iintensity (strength) ntensity (strength) and the directiondirection of association between two variables (X,Y).

r is a dimensionless number( 无量纲数 );it has no units of measurement -1≤r ≤ 1

Page 11: Principles of Biostatistics Chapter 17     Correlation

X,Y: random variables following normal distribution (Bivariate Normal Distribution).

both Xi and Yi are measured from the same subject ith

Page 12: Principles of Biostatistics Chapter 17     Correlation

How do we calculate r?

1

2 21

( )( )

( ) (1 | 1

)

ni i i XY

nXX YYi i i

X X Y Y lr

l lX X Y Yr

22 21 1 1

Sum of squares of deviations of from its mean

n n nXX i i i i i i

x

X

l x x x x n

的离均差平方和

22 21 1 1

Sum of squares of deviations of from its mean

n n nYY i i i i i i

y

l y y y y n

1 1 1 1

Sum of cross products of deviations from its mean and deviations from its mean

n n n nXY i i i i i i i i i i

x y

X Y

l x x y y x y x y n

与 的离均差交叉乘积和

X

Y

Page 13: Principles of Biostatistics Chapter 17     Correlation

Subject i

Concentration of thrombin x (u/ml)

Clotting time y (second) x2 y2 x×y

1 1.1 14 1.21 196 15.42 1.2 13 1.44 169 15.63 1.0 15 1.00 225 15.04 0.9 15 0.81 225 13.55 1.2 13 1.44 169 15.66 1.1 14 1.21 196 15.47 0.9 16 0.81 256 14.48 0.6 17 0.36 289 10.29 1.0 14 1.00 196 14.010 0.9 16 0.81 256 14.411 1.1 15 1.21 225 16.512 0.9 16 0.81 256 14.413 1.1 14 1.21 196 15.414 1.0 15 1.00 225 15.015 0.7 17 0.49 289 11.9

sum 14.7 224 14.81

3368 216.7

x y y2 xyx2

Page 14: Principles of Biostatistics Chapter 17     Correlation

lXX=0.404 , lYY=22.933 , lXY=-2.82

2 2

( )( )

( ) ( )

i i XY

XX YYi i

X X Y Y lr

l lX X Y Y

2) Calculation of r

2.82

0.9260.404 22.933

r

X,Y : stronger negative relationship

Page 15: Principles of Biostatistics Chapter 17     Correlation

Inference about correlation coefficient r ---------- hypothesis test

1) Establish testing hypothesis ,

determining significant level α

H0 : =0

no linear association between X and Y

H1 : ≠0

linear association between X and Y exists

=0.05 two-sided probability of type I error

Page 16: Principles of Biostatistics Chapter 17     Correlation

2) Calculating statistic

2

2

0 0

(1 ) ( 2)

(1 ) ( 2) is standard error of

rr

r

r rt

S r n

S r n r

=n-2

(13)2

0.9268.874 ~

(1 0.926 ) (15 2)rt t

For the above example =15-2=13

From t distribution table (Table A4,Appendix), the critical value is t0.05/2(13)=2.160 < |t|=8.874, P<0.05,

Correlation coefficient is statistically significant at α=0.05. concentration of thrombin and clotting time are negatively related.

Page 17: Principles of Biostatistics Chapter 17     Correlation

§17.3 Spearman’s Rank Correlation Coefficient: rs Spearman 等级相关系数

rank 可翻译为: 秩,等级

Spearman‘s rank correlation ( a method of nonparametric test ) is applied if two variable

s are distributed far from normal.

i.e. the normality requirement is not satisfied

Page 18: Principles of Biostatistics Chapter 17     Correlation

The steps of hypothesis test Rank ordering according to its magnitude

of values for each of the two variables (Xi,Yi) (Xri,Yri)

iriririr

irir

YYXX

YXs

ll

lr

Calculating the Spearman’s rank correlation coefficient based on the ranks

2

12

,

61

( 1)

n

ii

s

if have

dtha

not any tie rank

n r

s

n n

Page 19: Principles of Biostatistics Chapter 17     Correlation

Table hemorrhage degrees and thrombocyte counts (109/L) from 12 children of acute leukemia Patient

iplateletX

iRank:Xir (Xir )2 Bleeding

YiRank: Yir (Yir)2 Xir × Yir

(1) (2) (3) (4) (5) (6) (7) (8)1 121 1 1 +++ 11.5 132.25 11.52 138 2 4 ++ 9.0 81.00 18.03 165 3 9 + 7.0 49.00 21.04 310 4 16 – 3.5 12.25 14.05 426 5 25 ++ 9.0 81.00 45.06 540 6 36 ++ 9.0 81.00 54.07 740 7 49 – 3.5 12.25 24.58 1060 8 64 – 3.5 12.25 28.09 1260 9 81 – 3.5 12.25 31.510 1290 10 100 – 3.5 12.25 35.011 1438 11 121 +++ 11.5 132.25 126.512 2004 12 144 – 3.5 12.25 42.0

total 78 650 78 630 451

For tie (equal) ranks, mean rank is used instead. Six ‘–’s, mean=(1+2+3+4+5+6)/6=3.5

Page 20: Principles of Biostatistics Chapter 17     Correlation

Calculation of rs (numerical values are from Table above) Patie

ntplatele

t Rank:Xir (Xir )2 Bleeding Rank: Yir (Yir)2 Xir * Yir

(1) (2) (3) (4) (5) (6) (7) (8)

total 78 650 78 630 451

14312)78(650 222 nXXl irirXX irir

22 2630 (78) 12 123ir irY Y ir irl Y Y n

5612)78)(78(451

nYXYXl iriririrYX irir

422.0123143

56

iriririr

irir

YYXX

YXs

ll

lr

Page 21: Principles of Biostatistics Chapter 17     Correlation

422.0iriririr

irir

YYXX

YXs

ll

lr

322.0)112(12

37861

)1(

61

221

2

nn

dr

n

ii

s

Because there are some tie ranks in Y we can not use the formula latter.

Page 22: Principles of Biostatistics Chapter 17     Correlation

(1) - 1≤rs≤1 and similar meaning as r does

(2) Difference between rs and r.

rs≠ r

Calculated by ranks Calculated by original values of data

Explanation of Spearman’s rank correlation coefficient: rs

Page 23: Principles of Biostatistics Chapter 17     Correlation

Statistical inference about rs

1) Setting up hypothesis, determining significant level

H0 : s=0 H1 : s0 =0.05

2) Calculating test statistic

2 2

0.05/ 2(10) 0

0 0 0.422 01.47

(1 ) ( 2) 1 ( 0.422) 12 2

|1.47 | 2.228 0.05, failed to reject

s sr

r s

r rt

S r n

t P H

3) Conclusion: No association between platelet( 血小板 ) and bleeding (出血) .

Page 24: Principles of Biostatistics Chapter 17     Correlation

Notices in application

1. r=0 does not mean no correlation (might be non-linear correlation)

Y

X

Y

X

Y

X

H0:=0 H0:=0 H0:=0

Page 25: Principles of Biostatistics Chapter 17     Correlation

Notices in application

2. When levels of either variable X or Y are artificially selected , it is not suitable to make Pearson’s correlation analysis ( but we can do spearman’s rank correlation analysis ) .

Pearson’s correlation analysis requires that both X and Y follows normal distribution.

Page 26: Principles of Biostatistics Chapter 17     Correlation

Notices in application

3. Outliers can affect correlation coefficient heavily.

Page 27: Principles of Biostatistics Chapter 17     Correlation

Notices in application

4. Correlation cause-effect association( 因果联系 ), Correlation intrinsic association (固有联系) .

5. The difference between statistical significance (P value) intensity of correlation (absolute value of r ) :

There are statistical significance of correlation coefficient ------ the probability of r from the =0 is small (P value is small).

Intensity of correlation ----the absolute value of r

Page 28: Principles of Biostatistics Chapter 17     Correlation

DATA EXP17_12;INPUT X Y;CARDS;77 11869 6532 18485 894 4399 1289 5513 20895 7 95 9 54 9

§17.4 Further Application

89 124 95 1087 691 3398 1673 3247 14576 8790 9;PROC CORR PEARSON

SPEARMAN;VAR X Y;RUN;

SAS Codes for textbook’s Table 17.1 and Table 17.2

Page 29: Principles of Biostatistics Chapter 17     Correlation

The CORR Procedure 2 Variables: X Y

Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum

X 20 77.40000 23.65409 88.00000 13.00000 99.00000 Y 20 59.00000 63.86581 32.50000 6.00000 208.00000

Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.79107 <.0001

Y -0.79107 1.00000 <.0001

Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.54319 0.0133

Y -0.54319 1.00000 0.0133

Page 30: Principles of Biostatistics Chapter 17     Correlation

DATA EXP17_34;INPUT X Y;CARDS;5 600100 398 6784 170100 699 1570 12050 17026 3006 830 100 10

37 80035 50096 6055 10090 1096 599 599 895 120;PROC CORR PEARSON

SPEARMAN;VAR X Y;RUN;

SAS Codes for textbook’s Table 17.3 and Table 17.4

Page 31: Principles of Biostatistics Chapter 17     Correlation

The CORR Procedure 2 Variables: X Y

Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum

X 20 72.00000 33.79193 92.50000 5.00000 100.00000 Y 20 194.95000 268.92211 83.50000 3.00000 830.00000

Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.87681 <.0001

Y -0.87681 1.00000 <.0001

Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.88969 <.0001

Y -0.88969 1.00000 <.0001

Page 32: Principles of Biostatistics Chapter 17     Correlation

1. Simple linear correlation coefficient: r Condition: Both X and Y variablesfollow the normal distribution.

2. Spearman’s rank correlation coefficient: rs

It does not require that X or Y follows the normal distribution.

SUMMARY

Page 33: Principles of Biostatistics Chapter 17     Correlation

AssignmentAssignmentReview Exercises 5. (pp. 412)Review Exercises 5. (pp. 412)