Upload
tranthuy
View
220
Download
1
Embed Size (px)
Citation preview
Chapter 10
상관 분석
Correlation analysis
상관모형 (Correlation)
• 전형적인 회귀모형에서 X는 고정(fixed non-random)
Y는 확률적 (random)
• X, Y가 동시에 확률적 상관모형
(If both X & Y are random -> bivariate normal dist’n)
이변량 정규분포2 2
2
| |
2
| |
( , ) ~ ( , , , , )
| ~ ( , )
| ~ ( , )
x y x y
x y x y
y x y x
X Y BN
X Y y N
Y X x N
• 는 모집단 상관계수로 다음과 같이 정의 된다.
(population correlation coefficeint is defined by)
( , ) [( )( )]X Y
X Y X Y
Cov X Y E X Y
X
Y
…..
..
X
Y
그림 10.1.1 이변량 정규분포bivariate normal dist’n
(a)는 이변량정규분포의 결합밀도함수를 보여준다.(b)는 이변량 정규분포에서 X가 주어졌을 때 Y의 부분모집단의 주변분포함수를 보여주며, 이때 부분모집단은 정규분포를 따른다.(c)는 Y가 주어졌을 때 X의 부분모집단의 분포를 보여주며, (b)와 마찬가지로 이 경우에도 정규분포를 따른다.
(a) (b) (c)
10.2 상관계수(Correlation Coefficient)
births<-read.csv('E:\\kim\\yes\\myweb\\int\\2018\\data\\birth.csv',header=T)
births$bweight #birth weight of baby
births$gestwks #gestation period
model<-lm(births$bweight~births$gestwks)
summary(model)
cor.test(births$bweight,births$gestwks)
•Sample correlation (r2) and the coefficient of determination (r)
2
1
2 2 2 2
2 2 2 2
/ ( 1)
( ) / / ( 1) ( ) / / ( 1)
( ) ( )
i i i i
i i i i
i i i i
i i i i
r r
x y n x y nr
x x n n y y n n
n x y x y
n x x n y y
표본상관계수 : 관계수모집단 상 의 추정치
• If
2
2~ ( 2)
1
nt r t n
r
0
•When is not 0
① Fisher’s method ( n>25)1 1
ln2 1
~ (0 , 1)1 3
r
r
rz
r
z zz N
n
r=0.872, n=203 H0 :ρ=0.80 vs HA :ρ≠0.80
𝑟 = 0.872 -> 𝑧𝑟=1.34137 by (10.2.4) 𝜌𝑜 = 0.8, 𝑧𝜌𝑜 = 1.09861
𝑍 =1.34137 − 1.09861
1/ 203 − 3= 3.43
-> Reject Ho
*
*
* ** * *
3
4
1
1
test stat : ( ) 11 1
(3 )where :
4
rr
z
z rz z
n
n
zz z n
n
z
n
=z
Hotelling’s method (10≤n≤25)
의 신뢰구간 (CI)
𝑧𝑟 ± 𝑧1−𝛼/2 ∙1
𝑛−3
𝑧𝜌에 대한 95% 신뢰구간 (CI)
1.34137 ± 1.96 ∙1
203−3⟹ (1.20277, 1.47996)
𝒛𝒓 𝒓
1.20277 0.834
1.47996 0.901
> tanh(1.20277)[1] 0.8344976> tanh(1.47996)[1] 0.9014605> library(psych)> fisherz(0.834)[1] 1.201133> fisherz(0.8344976)[1] 1.20277> fisherz(0.9014605)[1] 1.47996
10.3 Some precautions
1. Checking assumptions(Normality, linear relations, homogeneity, etc.)
2. Two variables from a same subject
3. Strong association does not necessarily mean causal effect.
– X ≈Y : X->Y or Y->X or
A->X & A->Y or by chance
4. Be cautious on extrapolation (외연법)
•외연법 (extrapolation)
# regg.Rregg<-read.table(header=TRUE, text="xx yy 50 6155 6160 5965 7170 8075 7680 9085 10690 9895 100100 114")attach(regg)result<-lm(yy~xx) summary(result)plot(xx,yy,xlim=c(0,100),ylim=c(0,120))abline(result)p<-as.data.frame(predict(result, level=0.95, interval="confidence"))lines(cbind(xx,p$lwr), col="blue", lty="dashed")lines(cbind(xx,p$upr), col="blue", lty="dashed")
plot(xx,yy,xlim=c(0,100),ylim=c(0,120))abline(result)pp<-as.data.frame(predict(result, level=0.95, interval="prediction"))lines(cbind(xx,pp$lwr), col="red", lty="dashed")lines(cbind(xx,pp$upr), col="red", lty="dashed")
SAS example corr.sas
/* file corr.sas
SAS example for Chapter 8 */
data corr;
input method1 method2;
x2=method1**2;
y2=method2**2;
xy=method1*method2 ;
cards;
132 130
138 134
144 132
146 140
148 150
152 144
158 150
130 122
162 160
168 150
172 160
174 178
180 168
180 174
188 186
194 172
194 182
200 178
200 196
204 188
210 180
210 196
216 210
220 190
220 202
;run;
* To compare sums and squared sum
on Table 8.7.2 ;
proc means sum;
var method1 method2 x2 y2 xy;
run;
proc plot;
plot method2*method1;
run;
proc corr;
var method1 method2 ;
run;
proc corr fisher(rho0=0.98);var method1 method2 ;run;
R examplecorr<-
read.table(header=TRUE,
text="
method1 method2
132 130
138 134
144 132
146 140
148 150
152 144
158 150
130 122
162 160
168 150
172 160
174 178
180 168
180 174
188 186
194 172
194 182
200 178
200 196
204 188
210 180
210 196
216 210
220 190
220 202
")
attach(corr)
cor.test(method1,method2)
10.4 중상관모형multiple correlation model
Both x and y are random variablesx and y ~ multivariate normalMultiple correlation can be used to see the correlation between them.
0 1 1i j k kj jy x x
x와 y가 모수확률변수일 때
x와 y가 다변량 정규분포일 때
Y와 Xi와의 상관정도 중상관계수
보기 10.4.1> data <-read.csv('E:\\kim\\yes\\myweb\\int\\2018\\data\\example_10_4_1.csv',header=T)
> summary(lm(y~x1+x2,data=data))
Call:
lm(formula = y ~ x1 + x2, data = data)
Residuals:
Min 1Q Median 3Q Max
-62.98 -31.45 -9.12 15.25 258.87
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 34.81450 12.35254 2.818 0.0077 **
x1 34.53919 22.26469 1.551 0.1293
x2 0.04143 0.01259 3.291 0.0022 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 55.15 on 37 degrees of freedom
Multiple R-squared: 0.3269, Adjusted R-squared: 0.2905
F-statistic: 8.984 on 2 and 37 DF, p-value: 0.0006602
𝑅𝑦.122 = 0.3269 -> R𝑦.12 = R𝑦.12
2 = 0.572
Test: 𝐹 =𝑅𝑦.12⋯𝑘2
1−𝑅𝑦.12⋯𝑘2 ∙
𝑛−𝑘−1
𝑘
𝐹 =0.327
1−0.327∙40−2−1
2= 8.9848 > qf(0.05,2,37,lower.tail=F)=3.252
유의수준 0.05에서 영가설 𝐻0 ∶ 𝛽1 = 𝛽2 = 0을 기각하고
모집단에서 𝑌, 𝑋1, 𝑋2 사이에 선형성이 존재한다고 결론 내림
/* file mcorr.sas
SAS example for Table 9.7.1
*/
data mcorr;
input chol weight sbp;
cards;
162.2 51.0 108
158.0 52.9 111
157.0 56.0 115
155.0 56.5 116
156.0 58.0 117
154.1 60.1 120
169.1 58.0 124
181.0 61.0 127
174.9 59.4 122
180.2 56.1 121
174.0 61.2 125
;run;
proc plot;
plot chol*weight chol*sbp
sbp*weight;
run;
proc corr;
var chol weight sbp ;
run;
proc corr;
var chol weight ;
partial sbp ;
run;
proc corr;
var chol sbp ;
partial weight ;
run;
proc corr;
var weight sbp ;
partial chol ;
run;
부분상관계수 (Partial Corr. Coef)
• 다른 변수의 효과를 제어한 상태에서의 관계조사
• Linear association after controlling for other covariates
1.2 2
1
. . :y
e g r X
Y X
를 상수로 하고,
와 과의 상관성을 측정하는 부분상관계수
1.2 2
1
. . : when y
e g r X
Y X
is a constant (is fixed to a same value),
partial corr. coef. bewteen and
0 1.2...
21.2...1.2...
: 0
1
1
y k
y ky k
H
n kt r
r
𝑋2를 통제한 후 𝑌와 𝑋1 사이의 편상관 𝑟𝑦1.2 =(𝑟𝑦1−𝑟𝑦2𝑟12)
(1−𝑟𝑦22 )(1−𝑟12
2 )
𝑋1를 통제한 후 𝑌와 𝑋2 사이의 편상관
𝑟𝑦2.1 =(𝑟𝑦2−𝑟𝑦1𝑟12)
(1−𝑟𝑦12 )(1−𝑟12
2 )
𝑌를 통제한 후 𝑋1과 𝑋2 사이의 편상관 𝑋𝑟12.𝑦 =(𝑟12−𝑟𝑦1𝑟𝑦2)
(1−𝑟𝑦12 )(1−𝑟𝑦2
2 )
> data<-read.csv("E:\\kim\\yes\\myweb\\int\\2018\\data\\example_10_4_1.csv",header=T)> cor(data)
y x1 x2y 1.0000000 0.3604168 0.5320713x1 0.3604168 1.0000000 0.3025645x2 0.5320713 0.3025645 1.0000000> library(ppcor)> pcor.test(data$y,data$x1,data$x2)
estimate p.value statistic n gp Method1 0.2471221 0.1293425 1.551299 40 1 pearson> pcor.test(data$x1,data$x2,data$y)
estimate p.value statistic n gp Method1 0.1402861 0.3943218 0.8618496 40 1 pearson> pcor.test(data$y,data$x2,data$x1)
estimate p.value statistic n gp Method1 0.4758025 0.002202508 3.290531 40 1 pearson>
pair-wise plot by R ## pair.r
## put histograms on the diagonal
panel.hist <- function(x, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, col="cyan", ...)
}
## put (absolute) correlations on the upper panels,
## with size proportional to the correlations.
panel.cor <- function(x, y, digits=2, prefix="", cex.cor, ...)
{
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits=digits)[1]
txt <- paste(prefix, txt, sep="")
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
?swiss
summary(swiss)
cor(swiss)
pairs(swiss)
pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor)
pairs(swiss, lower.panel=panel.smooth, upper.panel=panel.cor, diag.panel=panel.hist)