Linear Regression using R

Preview:

DESCRIPTION

Linear Regression using R. Sungsu Lim Applied Algorithm Lab. KAIST. 1/35. Regression. Regression analysis answers questions about the dependencies of a response variable on one or more predictors, - PowerPoint PPT Presentation

Citation preview

Linear Regression using R

Sungsu LimApplied Algorithm Lab.

KAIST

1/35

Regression

• Regression analysis answers questions about the depen-dencies of a response variable on one or more predictors,• including prediction of future values of a response, dis-covering which predictors are important, and estimating the impact of changing a predictor or a treatment on the value of the response.• In linear regression, models of the unknown parameters are estimated from the data using linear functions. (Usually, the conditional mean of Y given the value of X)

2/35

Correlation Coefficient

• The correlation coefficient between two random vari-ables X and Y is defined as• If we have a series of n measurements of X and Y, then the sample correlation coefficient is defined as

• It has a value between -1 and 1, and it indicates the de-gree of linear dependence between the variables. It de-tects only linear dependencies between two variables. 3/35

Example

> install.packages("alr3") # Installing a package> library(alr3) # loading a package> data(fuel2001) # loading a specific data set> fueldata<-fuel2001[,1:5]> fueldata[,1]<-fuel2001$Tax> fueldata[,2]<-1000*fuel2001$Drivers/fuel2001$Pop> fueldata[,3]<-fuel2001$Income/1000> fueldata[,4]<-log(fuel2001$Miles,2)> fueldata[,5]<-1000*fuel2001$FuelC/fuel2001$Pop> colnames(fueldata)<-c("Tax","Dlic","Income","logMiles","Fuel")

4/35

Example

> cor(fueldata) Tax Dlic Income logMiles FuelTax 1.00000000 -0.08584424 -0.01068494 -0.04373696 -0.2594471Dlic -0.08584424 1.00000000 -0.17596063 0.03059068 0.4685063Income -0.01068494 -0.17596063 1.00000000 -0.29585136 -0.4644050logMiles -0.04373696 0.03059068 -0.29585136 1.00000000 0.4220323Fuel -0.25944711 0.46850627 -0.46440498 0.42203233 1.0000000> round(cor(fueldata),2) Tax Dlic Income logMiles FuelTax 1.00 -0.09 -0.01 -0.04 -0.26Dlic -0.09 1.00 -0.18 0.03 0.47Income -0.01 -0.18 1.00 -0.30 -0.46logMiles -0.04 0.03 -0.30 1.00 0.42Fuel -0.26 0.47 -0.46 0.42 1.00> cor(fueldata$Dlic,fueldata$Fuel)[1] 0.4685063 5/35

Example

> pairs(fuel2001)

6/35

Simple Linear Regression

7/35

• We make n paired observations on two variables:• The objective is to test for a linear relationship be-tween them,• How to quantify a good fit? The least squares approach: Choose to minimize

1 1( , ), , ( , )n nx y x y

0 1

" "" "

i i i

random errorpredictable

y x

0 1( , ) ' β

20 1

1

( ) ( )n

i ii

L y x

Simple Linear Regression

8/35

• is the sum of squared errors (SSE).• It is minimized by solving , and we have and • If we assume i.i.d. (identically & in-dependently) then it yields MLE (maximum likelihood estimates).

( )L

'( ) 0L

11

2

1

( )( )ˆ

( )

n

i ii

n

ii

x x y y

x x

0 1ˆ ˆy x

2~ (0, )i N

Simple Linear Regression

9/35

• Assumptions of the linear model

1. Errors ( 오차의 정규성 ). 2. Error variances are equal ( 오차의 등분산성 ). 3. Errors are independent ( 오차의 독립성 ). 4. Y has a linear dependence on X.

0 1i i iy x

2~ (0, )i N

Example

> library(alr3)> data(forbes)> forbes Temp Pressure Lpres1 194.5 20.79 131.79…17 212.2 30.06 147.80> g<-lm(Lpres ~Temp, data=forbes)> g Call:lm(formula = Lpres ~ Temp, data = forbes)Coefficients:(Intercept) Temp -42.1378 0.8955

10/35

> plot(forbes$Temp,forbes$Lpres)> abline(g$coeff,col="red")

Example

> summary(g)Call:lm(formula = Lpres ~ Temp, data = forbes)Residuals: Min 1Q Median 3Q Max -0.32220 -0.14473 -0.06664 0.02184 1.35978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -42.13778 3.34020 -12.62 2.18e-09 ***Temp 0.89549 0.01645 54.43 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.379 on 15 degrees of freedomMultiple R-squared: 0.995, Adjusted R-squared: 0.9946 F-statistic: 2963 on 1 and 15 DF, p-value: < 2.2e-16 11/35

Multiple Linear Regression

12/35

• Assumptions of the linear model

1. Errors . 2. Error variances are equal. 3. Errors are independent. 4. Y has a linear dependence on X.2~ (0, )i N

ppxxXYE 110)|(

2)|( XYVar

Multiple Linear Regression

13/35

• Using the matrix representation,),0(~ 2NIDe j

npnpnn

p

p

n e

e

e

xxx

xxx

xxx

y

y

y

2

1

1

0

21

22221

11211

2

1

1

1

1

eXY

njexxy jjppjj ,,1,110

),0(~ 2nINe

Multiple Linear Regression

14/35

• The residual sum or squares• We can compute that minimizes by using the matrix representation . The OLS (ordinary least squares) estimates.

(matrix version of the normal equations)

2

1

( ) ( ' ) ( ) '( ) 'n

i ii

L y x Y X Y X e e

( ) ( ) '( ) ' ' ' ' ' ' ' ' ' 2 ' 'L Y X Y X Y Y X Y Y X X X Y Y X X X Y

ˆ

( ) ˆ| 2 ' 2 ' 0L

X X X Y

( )L

YXXX ')'(ˆ 1

Multiple Linear Regression

15/35

• To minimize SSE=e’e, we have X’e=0.

Multiple Linear Regression

16/35

• Fact : is an unbiased estimator of .• If e is normally distributed, • Define SSreg=SYY-SSE (SYY= the sum of squares of Y) As with the simple regression, the coefficient of determi-nation is It is also called the multiple correlation coefficient because it is the maximum of the correlation between Y and any linear combination of the terms in the mean function.

2ˆ ( )( 1)

SSEMSE

n p

2

22 2

ˆ( ( 1))~ ( ( 1))

SSE n pn p

2

2 1regSS SSER

SYY SYY

Example

> summary(lm(Fuel~.,data=fueldata)) # How can we analyze this results?Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 17/35

)1](1

[)1/(

)1/(1 222 R

pn

pR

nSYY

pnRSSRadj

t-test

18/35

• We want to test• Assume , then where • Since and

We have

ii

ii

bH

bH

:

:

1

0

),0(~ 2INe ),(~ˆ 2 iiii cN iiii XXc ])'[( 1

),(~ˆ 2 iiii cN )1(~)1( 222

pnMSEpnRSS

)1(~ˆ

pntMSEc

tii

ii

t-test

19/35

• Hypothesis concerning one of the terms• t-test statistic: • If H0 is true, ,so we reject H0 at level if

The confidence interval for is

ii

ii

bH

bH

:

:

1

0

)1(~ˆ

pntMSEc

bt

ii

ii

)1(~ pntt)1(2|| 2/ pntt

MSEctb iii 2/i

Example: t-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 20/35

F-test

21/35

• We refer to the full model with all the predictors as the complete model. The model containing only some of these predictors is called the reduced model. (nested with in the complete model)• Testing whether the complete model is identical to the reduced model is equivalent to testing whether the extra parameters in the complete model equal 0. (none of the extra variables increases the explained variability in Y)

F-test

22/35

• We may assume:• Hypothesis test for the reduced model• When H0 is true,

01

10

:

0:

HnotH

H pr

rr

pprr

xxxXYEreduced

xxxxXYEfull

110

110

)|(:

)|(:

222 FRFR SSESSESSESSE

)(~)1(~)1(~ 22

22

22

rpSSESSE

pnSSE

rnSSE FRFR

F-test

23/35

• Hypothesis test for the reduced model• F test statistic: • If H0 is true, so we reject H0 at level if• From this test, we conclude that the hypotheses are plausible or not. And we say that which model is ade-quate.

01

10

:

0:

HnotH

H pr

)1

/()(

pn

SSE

rp

SSESSEF FFR

)1,(~ pnrpFF)1,( pnrpFF

Example: F-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 24/35

ANOVA

25/35

• In the analysis of variance, the mean function with all the termsis compared with the mean function that includes only an intercept.

• For the second case, and the residual sum of squares is SYY.• We have SSE<=SYY, and the difference between these twoSSreg=SYY-SSE explained by the larger mean func-tion that is not explained by the smaller mean func-tion.

ppxxxXYE 110)|(

0)|( xXYE

y0

ANOVA

26/35

• By F-test, we measure the goodness of fit of the re-gression model.Source df SS MS F P-

value

Regres-sion

p SSreg MSreg=SSreg / p

MSreg / MSE

Residual n-(p+1) SSE

Total n-1 SYY

( 1)

SSEMSE

n p

Example: ANOVA

> g1<-lm(Fuel~.-Tax,data=fueldata)> g2<-lm(Fuel~.,data=fueldata)> anova(g1,g2) # Full model vs Reduced modelAnalysis of Variance TableModel 1: Fuel ~ (Tax + Dlic + Income + logMiles) - TaxModel 2: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 211964 2 46 193700 1 18264 4.3373 0.04287 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.• The p-value 0.04 (<0.05), we have modest evidence that the coeffi-cient for Tax is different from 0. This is called a partial F-test.

27/35

NHnotAH

NH

:

0: 1

Example: sequential ANOVA

28/35

> f0<-lm(Fuel~1,data=fueldata)> f1<-lm(Fuel~Dlic,data=fueldata)> f2<-lm(Fuel~Dlic+Tax,data=fueldata)> f3<-lm(Fuel~Dlic+Tax+Income,data=fueldata)> f4<-lm(Fuel~.,data=fueldata)

> anova(f0,f1,f2,f3,f4)Analysis of Variance TableModel 1: Fuel ~ 1Model 2: Fuel ~ DlicModel 3: Fuel ~ Dlic + TaxModel 4: Fuel ~ Dlic + Tax + IncomeModel 5: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 50 395694 2 49 308840 1 86854 20.6262 4.019e-05 ***3 48 289681 1 19159 4.5498 0.0382899 * 4 47 228273 1 61408 14.5833 0.0003997 ***5 46 193700 1 34573 8.2104 0.0062592 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.

Variable Selection

29/35

• Usually, we don’t expect every candidate predictor to be related to response. We want to identify a useful subset of the variables. Forward selection Backward elimination Stepwise method• By using F-test, we can add or remove some variables to the model. The procedure ends when none of the candidate variables have a p-value smaller than the pre-specified value.

Multicollinearity

30/35

• When the independent variables are correlated among themselves, multicollinearity among them is said to exist.• Estimated regression coefficients vary widely when the independent variables are highly correlated.• Variable Inflation Factor (VIF): Large changes in the estimated regression coefficients when a predictor variable is added or deleted, or when an observation is altered or deleted.

Multicollinearity

31/35

• where is a coefficient of deter-mination when is regressed on the other X variables.• VIFs measure how much the variances of the esti-mated regression coefficients are inflated as com-pared to when the predictor variables are not linearly related.• Generally, a maximum VIF value in excess of 5~10 is taken as an indication of multicollinearity. > vif(lm(Fuel~.,data=fueldata)) Tax Dlic Income logMiles 1.010786 1.040992 1.132311 1.099395

2

1

1jj

VIFR

2jR

jX

Model Assumption

32/35

> par(mfrow=c(2,2))> plot(lm(Fuel~.,data=fueldata)))Check the model assumptions.1. 선형 모형 ?2. 오차의 정규성 ?3. 오차의 등분산성 ?4. 추정식에 많은 영향을 준 값 ?=> 이상값 (outlier) 검출

Residual Analysis

33/35

> result=lm(Fuel~.,data=fueldata)> plot(resid(result))> line1=sd(resid(result))*1.96> line2=sd(resid(result))*-1.96> abline(line1,0)> abline(line2,0)> abline(0,0)> par(mfrow=c(1,2))> boxplot(resid(result))> hist(resid(result))

Variable Transformation

34/35

• Box-Cox transformation Select minimizing SSE. (Generally, it is between -2 and 2) > boxcox(lm(Fuel~.,data=fueldata))

0 1 1 p pY x x

Regression Procedure

35/35

• 전처리▫ 데이터 분석▫ 설명변수의 다중 공선성 등을 통해 독립성 검사▫ 오차항의 정규성 조사▫ 잔차 분석을 통해 데이터의 선형성 , 오차의 등분산성 검사▫ Box-Cox 변환▫ 이상값 조사

• 회귀분석▫ Scatterplot 과 covariance ( 혹은 correlation) 행렬 조사▫ Full model 에 대한 회귀분석을 통해 각 변수의 t-test▫ 여러 가지 변수선택방법을 통해 상호 비교 후 최적 변수 선택▫ 모델 해석▫ 최종 후보 모델 제시

Recommended