Principles of Biostatistics Chapter 17 Correlation

Preview:

DESCRIPTION

Principles of Biostatistics Chapter 17 Correlation. 宇传华 http://statdtedm.6to23.com 网上免费统计资源(八). Terminology. scatter plot 散点图 correlation 相关 linear correlation 直线 相关 correlation coefficient 相关系数 - PowerPoint PPT Presentation

Citation preview

Principles of Biostatistics

Chapter 17 Correlation

宇传华 http://statdtedm.6to23.com

网上免费统计资源(八)

Terminology

scatter plot 散点图correlation 相关linear correlation 直线相关correlation coefficient 相关系数Pearson’s correlation coefficient Pearson 相关系数Spearman’s rank correlation coefficient

Spearman 等级相关系数

§17.1 The Two-Way Scatter Plot

CONTENTS

§17.2 Pearson’s Correlation Coefficient: r

§17.3 Spearman’s Correlation Coefficient: rs

§17.4 Further Application

The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.

The population correlation, denoted by Greek letter, Symbol 字体,读读 rou

The sample correlation, denoted by r (Latin letter or English letter),

(r)can take on any value from - 1 to 1.

The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.

The population correlation, denoted by Greek letter, Symbol 字体,读读 rou

The sample correlation, denoted by r (Latin letter or English letter),

(r)can take on any value from - 1 to 1.

读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship

The absolute value of indicates the strength( 强度 ) of the relationship.

-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship

The sign of indicates the Direction ( 方向 ) of the relationship.

读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship

The absolute value of indicates the strength( 强度 ) of the relationship.

-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship

The sign of indicates the Direction ( 方向 ) of the relationship.

Correlation (coefficient)

Before we conduct correlation analysis, we should always created a two-way scatter plot (scatter diagram).

X variable------horizontal axis Y variable------vertical axis; each point on the graph represents a combination val

ue (Xi,Yi).

Through scatter plot, we can often determine whether a linear relationship exists between X and Y.

One statistical technique often employed to measure such an association is known as correlation analysis

§17.1 The Two-Way Scatter Plot表 凝血酶浓度( X )与凝血时间( Y )间的关系

Subj ectconcentrati on

of thrombi n(Xi )Cl otti ngti me(Yi )

1 1. 1 142 1. 0 153 0. 9 154 1. 2 135 0. 6 176 1. 0 147 0. 9 16

Scatter Plot

Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80

Strong negative No correlation Non-linear correlationcorrelation r = -0.98 r = 0.00

The important of a scatter plot

In the next chapter (simple linear regression), we also need a scatter plot to find if the relationship between X and Y is a linear relationship, if the relationship between X and Y is a positive linear relationship.

So, before the analysis of correlation and regression, we should usually make a scatter plot

§17.2 Pearson’s correlation coefficient ( r)

Synonyms: product moment ( 积矩 ) correlation coefficient simple linear (简单线性) correlation coefficient

Definition:

r-------A statistical index to describe the iintensity (strength) ntensity (strength) and the directiondirection of association between two variables (X,Y).

r is a dimensionless number( 无量纲数 );it has no units of measurement -1≤r ≤ 1

X,Y: random variables following normal distribution (Bivariate Normal Distribution).

both Xi and Yi are measured from the same subject ith

How do we calculate r?

1

2 21

( )( )

( ) (1 | 1

)

ni i i XY

nXX YYi i i

X X Y Y lr

l lX X Y Yr

22 21 1 1

Sum of squares of deviations of from its mean

n n nXX i i i i i i

x

X

l x x x x n

的离均差平方和

22 21 1 1

Sum of squares of deviations of from its mean

n n nYY i i i i i i

y

l y y y y n

1 1 1 1

Sum of cross products of deviations from its mean and deviations from its mean

n n n nXY i i i i i i i i i i

x y

X Y

l x x y y x y x y n

与 的离均差交叉乘积和

X

Y

Subject i

Concentration of thrombin x (u/ml)

Clotting time y (second) x2 y2 x×y

1 1.1 14 1.21 196 15.42 1.2 13 1.44 169 15.63 1.0 15 1.00 225 15.04 0.9 15 0.81 225 13.55 1.2 13 1.44 169 15.66 1.1 14 1.21 196 15.47 0.9 16 0.81 256 14.48 0.6 17 0.36 289 10.29 1.0 14 1.00 196 14.010 0.9 16 0.81 256 14.411 1.1 15 1.21 225 16.512 0.9 16 0.81 256 14.413 1.1 14 1.21 196 15.414 1.0 15 1.00 225 15.015 0.7 17 0.49 289 11.9

sum 14.7 224 14.81

3368 216.7

x y y2 xyx2

lXX=0.404 , lYY=22.933 , lXY=-2.82

2 2

( )( )

( ) ( )

i i XY

XX YYi i

X X Y Y lr

l lX X Y Y

2) Calculation of r

2.82

0.9260.404 22.933

r

X,Y : stronger negative relationship

Inference about correlation coefficient r ---------- hypothesis test

1) Establish testing hypothesis ,

determining significant level α

H0 : =0

no linear association between X and Y

H1 : ≠0

linear association between X and Y exists

=0.05 two-sided probability of type I error

2) Calculating statistic

2

2

0 0

(1 ) ( 2)

(1 ) ( 2) is standard error of

rr

r

r rt

S r n

S r n r

=n-2

(13)2

0.9268.874 ~

(1 0.926 ) (15 2)rt t

For the above example =15-2=13

From t distribution table (Table A4,Appendix), the critical value is t0.05/2(13)=2.160 < |t|=8.874, P<0.05,

Correlation coefficient is statistically significant at α=0.05. concentration of thrombin and clotting time are negatively related.

§17.3 Spearman’s Rank Correlation Coefficient: rs Spearman 等级相关系数

rank 可翻译为: 秩,等级

Spearman‘s rank correlation ( a method of nonparametric test ) is applied if two variable

s are distributed far from normal.

i.e. the normality requirement is not satisfied

The steps of hypothesis test Rank ordering according to its magnitude

of values for each of the two variables (Xi,Yi) (Xri,Yri)

iriririr

irir

YYXX

YXs

ll

lr

Calculating the Spearman’s rank correlation coefficient based on the ranks

2

12

,

61

( 1)

n

ii

s

if have

dtha

not any tie rank

n r

s

n n

Table hemorrhage degrees and thrombocyte counts (109/L) from 12 children of acute leukemia Patient

iplateletX

iRank:Xir (Xir )2 Bleeding

YiRank: Yir (Yir)2 Xir × Yir

(1) (2) (3) (4) (5) (6) (7) (8)1 121 1 1 +++ 11.5 132.25 11.52 138 2 4 ++ 9.0 81.00 18.03 165 3 9 + 7.0 49.00 21.04 310 4 16 – 3.5 12.25 14.05 426 5 25 ++ 9.0 81.00 45.06 540 6 36 ++ 9.0 81.00 54.07 740 7 49 – 3.5 12.25 24.58 1060 8 64 – 3.5 12.25 28.09 1260 9 81 – 3.5 12.25 31.510 1290 10 100 – 3.5 12.25 35.011 1438 11 121 +++ 11.5 132.25 126.512 2004 12 144 – 3.5 12.25 42.0

total 78 650 78 630 451

For tie (equal) ranks, mean rank is used instead. Six ‘–’s, mean=(1+2+3+4+5+6)/6=3.5

Calculation of rs (numerical values are from Table above) Patie

ntplatele

t Rank:Xir (Xir )2 Bleeding Rank: Yir (Yir)2 Xir * Yir

(1) (2) (3) (4) (5) (6) (7) (8)

total 78 650 78 630 451

14312)78(650 222 nXXl irirXX irir

22 2630 (78) 12 123ir irY Y ir irl Y Y n

5612)78)(78(451

nYXYXl iriririrYX irir

422.0123143

56

iriririr

irir

YYXX

YXs

ll

lr

422.0iriririr

irir

YYXX

YXs

ll

lr

322.0)112(12

37861

)1(

61

221

2

nn

dr

n

ii

s

Because there are some tie ranks in Y we can not use the formula latter.

(1) - 1≤rs≤1 and similar meaning as r does

(2) Difference between rs and r.

rs≠ r

Calculated by ranks Calculated by original values of data

Explanation of Spearman’s rank correlation coefficient: rs

Statistical inference about rs

1) Setting up hypothesis, determining significant level

H0 : s=0 H1 : s0 =0.05

2) Calculating test statistic

2 2

0.05/ 2(10) 0

0 0 0.422 01.47

(1 ) ( 2) 1 ( 0.422) 12 2

|1.47 | 2.228 0.05, failed to reject

s sr

r s

r rt

S r n

t P H

3) Conclusion: No association between platelet( 血小板 ) and bleeding (出血) .

Notices in application

1. r=0 does not mean no correlation (might be non-linear correlation)

Y

X

Y

X

Y

X

H0:=0 H0:=0 H0:=0

Notices in application

2. When levels of either variable X or Y are artificially selected , it is not suitable to make Pearson’s correlation analysis ( but we can do spearman’s rank correlation analysis ) .

Pearson’s correlation analysis requires that both X and Y follows normal distribution.

Notices in application

3. Outliers can affect correlation coefficient heavily.

Notices in application

4. Correlation cause-effect association( 因果联系 ), Correlation intrinsic association (固有联系) .

5. The difference between statistical significance (P value) intensity of correlation (absolute value of r ) :

There are statistical significance of correlation coefficient ------ the probability of r from the =0 is small (P value is small).

Intensity of correlation ----the absolute value of r

DATA EXP17_12;INPUT X Y;CARDS;77 11869 6532 18485 894 4399 1289 5513 20895 7 95 9 54 9

§17.4 Further Application

89 124 95 1087 691 3398 1673 3247 14576 8790 9;PROC CORR PEARSON

SPEARMAN;VAR X Y;RUN;

SAS Codes for textbook’s Table 17.1 and Table 17.2

The CORR Procedure 2 Variables: X Y

Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum

X 20 77.40000 23.65409 88.00000 13.00000 99.00000 Y 20 59.00000 63.86581 32.50000 6.00000 208.00000

Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.79107 <.0001

Y -0.79107 1.00000 <.0001

Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.54319 0.0133

Y -0.54319 1.00000 0.0133

DATA EXP17_34;INPUT X Y;CARDS;5 600100 398 6784 170100 699 1570 12050 17026 3006 830 100 10

37 80035 50096 6055 10090 1096 599 599 895 120;PROC CORR PEARSON

SPEARMAN;VAR X Y;RUN;

SAS Codes for textbook’s Table 17.3 and Table 17.4

The CORR Procedure 2 Variables: X Y

Simple Statistics

Variable N Mean Std Dev Median Minimum Maximum

X 20 72.00000 33.79193 92.50000 5.00000 100.00000 Y 20 194.95000 268.92211 83.50000 3.00000 830.00000

Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.87681 <.0001

Y -0.87681 1.00000 <.0001

Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0

X Y

X 1.00000 -0.88969 <.0001

Y -0.88969 1.00000 <.0001

1. Simple linear correlation coefficient: r Condition: Both X and Y variablesfollow the normal distribution.

2. Spearman’s rank correlation coefficient: rs

It does not require that X or Y follows the normal distribution.

SUMMARY

AssignmentAssignmentReview Exercises 5. (pp. 412)Review Exercises 5. (pp. 412)

Recommended