Linear Regression using R

Sungsu LimApplied Algorithm Lab.

Regression

• Regression analysis answers questions about the depen-dencies of a response variable on one or more predictors,• including prediction of future values of a response, dis-covering which predictors are important, and estimating the impact of changing a predictor or a treatment on the value of the response.• In linear regression, models of the unknown parameters are estimated from the data using linear functions. (Usually, the conditional mean of Y given the value of X)

Correlation Coefficient

• The correlation coefficient between two random vari-ables X and Y is defined as• If we have a series of n measurements of X and Y, then the sample correlation coefficient is defined as

• It has a value between -1 and 1, and it indicates the de-gree of linear dependence between the variables. It de-tects only linear dependencies between two variables. 3/35

Example

> install.packages("alr3") # Installing a package> library(alr3) # loading a package> data(fuel2001) # loading a specific data set> fueldata<-fuel2001[,1:5]> fueldata[,1]<-fuel2001$Tax> fueldata[,2]<-1000*fuel2001$Drivers/fuel2001$Pop> fueldata[,3]<-fuel2001$Income/1000> fueldata[,4]<-log(fuel2001$Miles,2)> fueldata[,5]<-1000*fuel2001$FuelC/fuel2001$Pop> colnames(fueldata)<-c("Tax","Dlic","Income","logMiles","Fuel")

Example

> cor(fueldata) Tax Dlic Income logMiles FuelTax 1.00000000 -0.08584424 -0.01068494 -0.04373696 -0.2594471Dlic -0.08584424 1.00000000 -0.17596063 0.03059068 0.4685063Income -0.01068494 -0.17596063 1.00000000 -0.29585136 -0.4644050logMiles -0.04373696 0.03059068 -0.29585136 1.00000000 0.4220323Fuel -0.25944711 0.46850627 -0.46440498 0.42203233 1.0000000> round(cor(fueldata),2) Tax Dlic Income logMiles FuelTax 1.00 -0.09 -0.01 -0.04 -0.26Dlic -0.09 1.00 -0.18 0.03 0.47Income -0.01 -0.18 1.00 -0.30 -0.46logMiles -0.04 0.03 -0.30 1.00 0.42Fuel -0.26 0.47 -0.46 0.42 1.00> cor(fueldata$Dlic,fueldata$Fuel)[1] 0.4685063 5/35

Example

> pairs(fuel2001)

Simple Linear Regression

• We make n paired observations on two variables:• The objective is to test for a linear relationship be-tween them,• How to quantify a good fit? The least squares approach: Choose to minimize

1 1( , ), , ( , )n nx y x y

" "" "

random errorpredictable

0 1( , ) ' β

( ) ( )n

• is the sum of squared errors (SSE).• It is minimized by solving , and we have and • If we assume i.i.d. (identically & in-dependently) then it yields MLE (maximum likelihood estimates).

'( ) 0L

( )( )ˆ

x x y y

0 1ˆ ˆy x

2~ (0, )i N

• Assumptions of the linear model

1. Errors ( 오차의 정규성 ). 2. Error variances are equal ( 오차의 등분산성 ). 3. Errors are independent ( 오차의 독립성 ). 4. Y has a linear dependence on X.

0 1i i iy x

2~ (0, )i N

Example

> library(alr3)> data(forbes)> forbes Temp Pressure Lpres1 194.5 20.79 131.79…17 212.2 30.06 147.80> g<-lm(Lpres ~Temp, data=forbes)> g Call:lm(formula = Lpres ~ Temp, data = forbes)Coefficients:(Intercept) Temp -42.1378 0.8955

> plot(forbes$Temp,forbes$Lpres)> abline(g$coeff,col="red")

Example

> summary(g)Call:lm(formula = Lpres ~ Temp, data = forbes)Residuals: Min 1Q Median 3Q Max -0.32220 -0.14473 -0.06664 0.02184 1.35978 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -42.13778 3.34020 -12.62 2.18e-09 ***Temp 0.89549 0.01645 54.43 < 2e-16 ***---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.379 on 15 degrees of freedomMultiple R-squared: 0.995, Adjusted R-squared: 0.9946 F-statistic: 2963 on 1 and 15 DF, p-value: < 2.2e-16 11/35

Multiple Linear Regression

• Assumptions of the linear model

1. Errors . 2. Error variances are equal. 3. Errors are independent. 4. Y has a linear dependence on X.2~ (0, )i N

ppxxXYE 110)|(

2)|( XYVar

• Using the matrix representation,),0(~ 2NIDe j

npnpnn

njexxy jjppjj ,,1,110

),0(~ 2nINe

• The residual sum or squares• We can compute that minimizes by using the matrix representation . The OLS (ordinary least squares) estimates.

(matrix version of the normal equations)

( ) ( ' ) ( ) '( ) 'n

L y x Y X Y X e e

( ) ( ) '( ) ' ' ' ' ' ' ' ' ' 2 ' 'L Y X Y X Y Y X Y Y X X X Y Y X X X Y

( ) ˆ| 2 ' 2 ' 0L

X X X Y

YXXX ')'(ˆ 1

• To minimize SSE=e’e, we have X’e=0.

• Fact : is an unbiased estimator of .• If e is normally distributed, • Define SSreg=SYY-SSE (SYY= the sum of squares of Y) As with the simple regression, the coefficient of determi-nation is It is also called the multiple correlation coefficient because it is the maximum of the correlation between Y and any linear combination of the terms in the mean function.

2ˆ ( )( 1)

SSEMSE

ˆ( ( 1))~ ( ( 1))

SSE n pn p

2 1regSS SSER

SYY SYY

Example

> summary(lm(Fuel~.,data=fueldata)) # How can we analyze this results?Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 17/35

)1/(1 222 R

pnRSSRadj

t-test

• We want to test• Assume , then where • Since and

We have

),0(~ 2INe ),(~ˆ 2 iiii cN iiii XXc ])'[( 1

),(~ˆ 2 iiii cN )1(~)1( 222

pnMSEpnRSS

)1(~ˆ

pntMSEc

t-test

• Hypothesis concerning one of the terms• t-test statistic: • If H0 is true, ,so we reject H0 at level if

The confidence interval for is

)1(~ˆ

pntMSEc

)1(~ pntt)1(2|| 2/ pntt

MSEctb iii 2/i

Example: t-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 20/35

F-test

• We refer to the full model with all the predictors as the complete model. The model containing only some of these predictors is called the reduced model. (nested with in the complete model)• Testing whether the complete model is identical to the reduced model is equivalent to testing whether the extra parameters in the complete model equal 0. (none of the extra variables increases the explained variability in Y)

F-test

• We may assume:• Hypothesis test for the reduced model• When H0 is true,

xxxXYEreduced

xxxxXYEfull

222 FRFR SSESSESSESSE

)(~)1(~)1(~ 22

rpSSESSE

rnSSE FRFR

F-test

• Hypothesis test for the reduced model• F test statistic: • If H0 is true, so we reject H0 at level if• From this test, we conclude that the hypotheses are plausible or not. And we say that which model is ade-quate.

SSESSEF FFR

)1,(~ pnrpFF)1,( pnrpFF

Example: F-test

> summary(lm(Fuel~.,data=fueldata))Call:lm(formula = Fuel ~ ., data = fueldata)Residuals: Min 1Q Median 3Q Max -163.145 -33.039 5.895 31.989 183.499 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 154.1928 194.9062 0.791 0.432938 Tax -4.2280 2.0301 -2.083 0.042873 * Dlic 0.4719 0.1285 3.672 0.000626 ***Income -6.1353 2.1936 -2.797 0.007508 ** logMiles 18.5453 6.4722 2.865 0.006259 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 64.89 on 46 degrees of freedomMultiple R-squared: 0.5105, Adjusted R-squared: 0.4679 F-statistic: 11.99 on 4 and 46 DF, p-value: 9.33e-07 24/35

• In the analysis of variance, the mean function with all the termsis compared with the mean function that includes only an intercept.

• For the second case, and the residual sum of squares is SYY.• We have SSE<=SYY, and the difference between these twoSSreg=SYY-SSE explained by the larger mean func-tion that is not explained by the smaller mean func-tion.

ppxxxXYE 110)|(

0)|( xXYE

• By F-test, we measure the goodness of fit of the re-gression model.Source df SS MS F P-

Regres-sion

p SSreg MSreg=SSreg / p

MSreg / MSE

Residual n-(p+1) SSE

Total n-1 SYY

SSEMSE

Example: ANOVA

> g1<-lm(Fuel~.-Tax,data=fueldata)> g2<-lm(Fuel~.,data=fueldata)> anova(g1,g2) # Full model vs Reduced modelAnalysis of Variance TableModel 1: Fuel ~ (Tax + Dlic + Income + logMiles) - TaxModel 2: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 47 211964 2 46 193700 1 18264 4.3373 0.04287 *---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.• The p-value 0.04 (<0.05), we have modest evidence that the coeffi-cient for Tax is different from 0. This is called a partial F-test.

NHnotAH

Example: sequential ANOVA

> f0<-lm(Fuel~1,data=fueldata)> f1<-lm(Fuel~Dlic,data=fueldata)> f2<-lm(Fuel~Dlic+Tax,data=fueldata)> f3<-lm(Fuel~Dlic+Tax+Income,data=fueldata)> f4<-lm(Fuel~.,data=fueldata)

> anova(f0,f1,f2,f3,f4)Analysis of Variance TableModel 1: Fuel ~ 1Model 2: Fuel ~ DlicModel 3: Fuel ~ Dlic + TaxModel 4: Fuel ~ Dlic + Tax + IncomeModel 5: Fuel ~ Tax + Dlic + Income + logMiles Res.Df RSS Df Sum of Sq F Pr(>F) 1 50 395694 2 49 308840 1 86854 20.6262 4.019e-05 ***3 48 289681 1 19159 4.5498 0.0382899 * 4 47 228273 1 61408 14.5833 0.0003997 ***5 46 193700 1 34573 8.2104 0.0062592 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.

Variable Selection

• Usually, we don’t expect every candidate predictor to be related to response. We want to identify a useful subset of the variables. Forward selection Backward elimination Stepwise method• By using F-test, we can add or remove some variables to the model. The procedure ends when none of the candidate variables have a p-value smaller than the pre-specified value.

Multicollinearity

• When the independent variables are correlated among themselves, multicollinearity among them is said to exist.• Estimated regression coefficients vary widely when the independent variables are highly correlated.• Variable Inflation Factor (VIF): Large changes in the estimated regression coefficients when a predictor variable is added or deleted, or when an observation is altered or deleted.

Multicollinearity

• where is a coefficient of deter-mination when is regressed on the other X variables.• VIFs measure how much the variances of the esti-mated regression coefficients are inflated as com-pared to when the predictor variables are not linearly related.• Generally, a maximum VIF value in excess of 5~10 is taken as an indication of multicollinearity. > vif(lm(Fuel~.,data=fueldata)) Tax Dlic Income logMiles 1.010786 1.040992 1.132311 1.099395

Model Assumption

> par(mfrow=c(2,2))> plot(lm(Fuel~.,data=fueldata)))Check the model assumptions.1. 선형 모형 ?2. 오차의 정규성 ?3. 오차의 등분산성 ?4. 추정식에 많은 영향을 준 값 ?=> 이상값 (outlier) 검출

Residual Analysis

> result=lm(Fuel~.,data=fueldata)> plot(resid(result))> line1=sd(resid(result))*1.96> line2=sd(resid(result))*-1.96> abline(line1,0)> abline(line2,0)> abline(0,0)> par(mfrow=c(1,2))> boxplot(resid(result))> hist(resid(result))

Variable Transformation

• Box-Cox transformation Select minimizing SSE. (Generally, it is between -2 and 2) > boxcox(lm(Fuel~.,data=fueldata))

0 1 1 p pY x x

Regression Procedure

• 전처리▫ 데이터 분석▫ 설명변수의 다중 공선성 등을 통해 독립성 검사▫ 오차항의 정규성 조사▫ 잔차 분석을 통해 데이터의 선형성 , 오차의 등분산성 검사▫ Box-Cox 변환▫ 이상값 조사

• 회귀분석▫ Scatterplot 과 covariance ( 혹은 correlation) 행렬 조사▫ Full model 에 대한 회귀분석을 통해 각 변수의 t-test▫ 여러 가지 변수선택방법을 통해 상호 비교 후 최적 변수 선택▫ 모델 해석▫ 최종 후보 모델 제시

Linear Regression using R

Documents

Regression Linear Regression Regression Trees

Regression Analysis Using ArcMap - MIT Libraries · Linear Regression Used to analyze ... Local Model = fits a regression equation to every ... Use a non-linear regression model

Predicting share price by using Multiple Linear Regression

Introducing Linear Regression: An Example Using Basketball

Simple Linear Regression Models Using SAS

Multiple Linear Regression - Analysis Made Easy · Multiple Linear Regression The MULTIPLE LINEAR REGRESSION command performs simple multiple regression using least squares. Linear

Forecasting Academic Performance using Multiple Linear Regression

Prediction Using Estimators of Linear Regression Model with

Regression Analysis with SCILAB - University of Novi …omorr/radovan_omorjan_003_prII/s...Example of multiple linear regression using matrices 17 Covariance in multiple linear regression

Analysis of GDP using Linear Regression

Linear Regression Using SPSS

Multiple Linear Regression - AnalystSoft Inc. · 2019-09-24 · Multiple Linear Regression The MULTIPLE LINEAR REGRESSION command performs simple multiple regression using least squares

Linear Regression Demo using PolyAnalyst Generating Linear Regression Formula Generating Regression Rules for Categorical classification

Using Excel 2007 for Linear Regression

Regression Linear Regression

Basic Linear Regression tutorial using GraphPad … · Basic Linear Regression tutorial using GraphPad Prism S-Cubed Website: Email: info@scubed.org.mt April 2013

Forecasting Using the Simple Linear Regression Model and Correlation

1 Curve-Fitting Polynomial Interpolation. 2 Curve Fitting Regression Linear Regression Polynomial Regression Multiple Linear Regression Non-linear Regression

Linear Regression - SAS · MULTIPLE LINEAR REGRESSION SAS/STAT SYNTAX & EXAMPLE DATA ... SIMPLE LINEAR REGRESSION ... title 'Best Models Using All-Regression Option'; run;

Using and Interpreting Linear Regression and Correlation Analyses