View
114
Download
0
Category
Preview:
DESCRIPTION
Principles of Biostatistics Chapter 17 Correlation. 宇传华 http://statdtedm.6to23.com 网上免费统计资源(八). Terminology. scatter plot 散点图 correlation 相关 linear correlation 直线 相关 correlation coefficient 相关系数 - PowerPoint PPT Presentation
Citation preview
Principles of Biostatistics
Chapter 17 Correlation
宇传华 http://statdtedm.6to23.com
网上免费统计资源(八)
Terminology
scatter plot 散点图correlation 相关linear correlation 直线相关correlation coefficient 相关系数Pearson’s correlation coefficient Pearson 相关系数Spearman’s rank correlation coefficient
Spearman 等级相关系数
§17.1 The Two-Way Scatter Plot
CONTENTS
§17.2 Pearson’s Correlation Coefficient: r
§17.3 Spearman’s Correlation Coefficient: rs
§17.4 Further Application
The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.
The population correlation, denoted by Greek letter, Symbol 字体,读读 rou
The sample correlation, denoted by r (Latin letter or English letter),
(r)can take on any value from - 1 to 1.
The correlation between two random variables, X and Y, is a measure (指标) of the degree of linear association between the two variables.
The population correlation, denoted by Greek letter, Symbol 字体,读读 rou
The sample correlation, denoted by r (Latin letter or English letter),
(r)can take on any value from - 1 to 1.
读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship
The absolute value of indicates the strength( 强度 ) of the relationship.
-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship
The sign of indicates the Direction ( 方向 ) of the relationship.
读 r 读 indicates a perfect negative linear relationship读indicates a perfect positive linear relationship读indicates no linear relationship
The absolute value of indicates the strength( 强度 ) of the relationship.
-1< <0 indicates a negative linear relationship 0< <1 indicates a positive linear relationship
The sign of indicates the Direction ( 方向 ) of the relationship.
Correlation (coefficient)
Before we conduct correlation analysis, we should always created a two-way scatter plot (scatter diagram).
X variable------horizontal axis Y variable------vertical axis; each point on the graph represents a combination val
ue (Xi,Yi).
Through scatter plot, we can often determine whether a linear relationship exists between X and Y.
One statistical technique often employed to measure such an association is known as correlation analysis
§17.1 The Two-Way Scatter Plot表 凝血酶浓度( X )与凝血时间( Y )间的关系
Subj ectconcentrati on
of thrombi n(Xi )Cl otti ngti me(Yi )
1 1. 1 142 1. 0 153 0. 9 154 1. 2 135 0. 6 176 1. 0 147 0. 9 16
Scatter Plot
Perfect positive Strong positive Positive correlation r = 1 correlation r = 0.99 correlation r = 0.80
Strong negative No correlation Non-linear correlationcorrelation r = -0.98 r = 0.00
The important of a scatter plot
In the next chapter (simple linear regression), we also need a scatter plot to find if the relationship between X and Y is a linear relationship, if the relationship between X and Y is a positive linear relationship.
So, before the analysis of correlation and regression, we should usually make a scatter plot
§17.2 Pearson’s correlation coefficient ( r)
Synonyms: product moment ( 积矩 ) correlation coefficient simple linear (简单线性) correlation coefficient
Definition:
r-------A statistical index to describe the iintensity (strength) ntensity (strength) and the directiondirection of association between two variables (X,Y).
r is a dimensionless number( 无量纲数 );it has no units of measurement -1≤r ≤ 1
X,Y: random variables following normal distribution (Bivariate Normal Distribution).
both Xi and Yi are measured from the same subject ith
How do we calculate r?
1
2 21
( )( )
( ) (1 | 1
)
ni i i XY
nXX YYi i i
X X Y Y lr
l lX X Y Yr
22 21 1 1
Sum of squares of deviations of from its mean
n n nXX i i i i i i
x
X
l x x x x n
的离均差平方和
22 21 1 1
Sum of squares of deviations of from its mean
n n nYY i i i i i i
y
l y y y y n
1 1 1 1
Sum of cross products of deviations from its mean and deviations from its mean
n n n nXY i i i i i i i i i i
x y
X Y
l x x y y x y x y n
与 的离均差交叉乘积和
X
Y
Subject i
Concentration of thrombin x (u/ml)
Clotting time y (second) x2 y2 x×y
1 1.1 14 1.21 196 15.42 1.2 13 1.44 169 15.63 1.0 15 1.00 225 15.04 0.9 15 0.81 225 13.55 1.2 13 1.44 169 15.66 1.1 14 1.21 196 15.47 0.9 16 0.81 256 14.48 0.6 17 0.36 289 10.29 1.0 14 1.00 196 14.010 0.9 16 0.81 256 14.411 1.1 15 1.21 225 16.512 0.9 16 0.81 256 14.413 1.1 14 1.21 196 15.414 1.0 15 1.00 225 15.015 0.7 17 0.49 289 11.9
sum 14.7 224 14.81
3368 216.7
x y y2 xyx2
lXX=0.404 , lYY=22.933 , lXY=-2.82
2 2
( )( )
( ) ( )
i i XY
XX YYi i
X X Y Y lr
l lX X Y Y
2) Calculation of r
2.82
0.9260.404 22.933
r
X,Y : stronger negative relationship
Inference about correlation coefficient r ---------- hypothesis test
1) Establish testing hypothesis ,
determining significant level α
H0 : =0
no linear association between X and Y
H1 : ≠0
linear association between X and Y exists
=0.05 two-sided probability of type I error
2) Calculating statistic
2
2
0 0
(1 ) ( 2)
(1 ) ( 2) is standard error of
rr
r
r rt
S r n
S r n r
=n-2
(13)2
0.9268.874 ~
(1 0.926 ) (15 2)rt t
For the above example =15-2=13
From t distribution table (Table A4,Appendix), the critical value is t0.05/2(13)=2.160 < |t|=8.874, P<0.05,
Correlation coefficient is statistically significant at α=0.05. concentration of thrombin and clotting time are negatively related.
§17.3 Spearman’s Rank Correlation Coefficient: rs Spearman 等级相关系数
rank 可翻译为: 秩,等级
Spearman‘s rank correlation ( a method of nonparametric test ) is applied if two variable
s are distributed far from normal.
i.e. the normality requirement is not satisfied
The steps of hypothesis test Rank ordering according to its magnitude
of values for each of the two variables (Xi,Yi) (Xri,Yri)
iriririr
irir
YYXX
YXs
ll
lr
Calculating the Spearman’s rank correlation coefficient based on the ranks
2
12
,
61
( 1)
n
ii
s
if have
dtha
not any tie rank
n r
s
n n
Table hemorrhage degrees and thrombocyte counts (109/L) from 12 children of acute leukemia Patient
iplateletX
iRank:Xir (Xir )2 Bleeding
YiRank: Yir (Yir)2 Xir × Yir
(1) (2) (3) (4) (5) (6) (7) (8)1 121 1 1 +++ 11.5 132.25 11.52 138 2 4 ++ 9.0 81.00 18.03 165 3 9 + 7.0 49.00 21.04 310 4 16 – 3.5 12.25 14.05 426 5 25 ++ 9.0 81.00 45.06 540 6 36 ++ 9.0 81.00 54.07 740 7 49 – 3.5 12.25 24.58 1060 8 64 – 3.5 12.25 28.09 1260 9 81 – 3.5 12.25 31.510 1290 10 100 – 3.5 12.25 35.011 1438 11 121 +++ 11.5 132.25 126.512 2004 12 144 – 3.5 12.25 42.0
total 78 650 78 630 451
For tie (equal) ranks, mean rank is used instead. Six ‘–’s, mean=(1+2+3+4+5+6)/6=3.5
Calculation of rs (numerical values are from Table above) Patie
ntplatele
t Rank:Xir (Xir )2 Bleeding Rank: Yir (Yir)2 Xir * Yir
(1) (2) (3) (4) (5) (6) (7) (8)
total 78 650 78 630 451
14312)78(650 222 nXXl irirXX irir
22 2630 (78) 12 123ir irY Y ir irl Y Y n
5612)78)(78(451
nYXYXl iriririrYX irir
422.0123143
56
iriririr
irir
YYXX
YXs
ll
lr
422.0iriririr
irir
YYXX
YXs
ll
lr
322.0)112(12
37861
)1(
61
221
2
nn
dr
n
ii
s
Because there are some tie ranks in Y we can not use the formula latter.
(1) - 1≤rs≤1 and similar meaning as r does
(2) Difference between rs and r.
rs≠ r
Calculated by ranks Calculated by original values of data
Explanation of Spearman’s rank correlation coefficient: rs
Statistical inference about rs
1) Setting up hypothesis, determining significant level
H0 : s=0 H1 : s0 =0.05
2) Calculating test statistic
2 2
0.05/ 2(10) 0
0 0 0.422 01.47
(1 ) ( 2) 1 ( 0.422) 12 2
|1.47 | 2.228 0.05, failed to reject
s sr
r s
r rt
S r n
t P H
3) Conclusion: No association between platelet( 血小板 ) and bleeding (出血) .
Notices in application
1. r=0 does not mean no correlation (might be non-linear correlation)
Y
X
Y
X
Y
X
H0:=0 H0:=0 H0:=0
Notices in application
2. When levels of either variable X or Y are artificially selected , it is not suitable to make Pearson’s correlation analysis ( but we can do spearman’s rank correlation analysis ) .
Pearson’s correlation analysis requires that both X and Y follows normal distribution.
Notices in application
3. Outliers can affect correlation coefficient heavily.
Notices in application
4. Correlation cause-effect association( 因果联系 ), Correlation intrinsic association (固有联系) .
5. The difference between statistical significance (P value) intensity of correlation (absolute value of r ) :
There are statistical significance of correlation coefficient ------ the probability of r from the =0 is small (P value is small).
Intensity of correlation ----the absolute value of r
DATA EXP17_12;INPUT X Y;CARDS;77 11869 6532 18485 894 4399 1289 5513 20895 7 95 9 54 9
§17.4 Further Application
89 124 95 1087 691 3398 1673 3247 14576 8790 9;PROC CORR PEARSON
SPEARMAN;VAR X Y;RUN;
SAS Codes for textbook’s Table 17.1 and Table 17.2
The CORR Procedure 2 Variables: X Y
Simple Statistics
Variable N Mean Std Dev Median Minimum Maximum
X 20 77.40000 23.65409 88.00000 13.00000 99.00000 Y 20 59.00000 63.86581 32.50000 6.00000 208.00000
Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0
X Y
X 1.00000 -0.79107 <.0001
Y -0.79107 1.00000 <.0001
Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0
X Y
X 1.00000 -0.54319 0.0133
Y -0.54319 1.00000 0.0133
DATA EXP17_34;INPUT X Y;CARDS;5 600100 398 6784 170100 699 1570 12050 17026 3006 830 100 10
37 80035 50096 6055 10090 1096 599 599 895 120;PROC CORR PEARSON
SPEARMAN;VAR X Y;RUN;
SAS Codes for textbook’s Table 17.3 and Table 17.4
The CORR Procedure 2 Variables: X Y
Simple Statistics
Variable N Mean Std Dev Median Minimum Maximum
X 20 72.00000 33.79193 92.50000 5.00000 100.00000 Y 20 194.95000 268.92211 83.50000 3.00000 830.00000
Pearson Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0
X Y
X 1.00000 -0.87681 <.0001
Y -0.87681 1.00000 <.0001
Spearman Correlation Coefficients, N = 20 Prob > |r| under H0: Rho=0
X Y
X 1.00000 -0.88969 <.0001
Y -0.88969 1.00000 <.0001
1. Simple linear correlation coefficient: r Condition: Both X and Y variablesfollow the normal distribution.
2. Spearman’s rank correlation coefficient: rs
It does not require that X or Y follows the normal distribution.
SUMMARY
AssignmentAssignmentReview Exercises 5. (pp. 412)Review Exercises 5. (pp. 412)
Recommended