Lecture 5: Simple Linear Regression Laura McAvinue School of Psychology Trinity College Dublin

Lecture 5:Simple Linear Regression

Laura McAvinue

School of Psychology

Trinity College Dublin

Previous Lecture

• Regression Line

– Offers a model of the relationship between two variables

• A straight line that represents the ‘best fit’

– Enables us to predict Variable Y on the basis of Variable X

ˆ Y bX a

Today

• Calculation of the regression line

• Measuring the accuracy of prediction

• Some practice!

How is the regression line calculated?

• The Method of Least Squares

– Computes a line that minimises the difference between the predicted values of Y (Y’) and the actual values of Y (Y)

– Minimises…• (Y – Y’)s• Errors of prediction• Residuals

0

1

2

3

4

5

6

7

0 1 2 3 4 5

These lines =Errors of prediction

(Y - Y’)sResiduals

Y

X

Y

Y’

0

1

2

3

4

5

6

7

0 1 2 3 4 5

Y

X

Y = 6

Y’ = 5

Method of Least Squares

• When fitting a line to the data, the regression procedure attempts to fit a line that minimises these errors of prediction, total (Y – Y’)s– But! You can’t try to minimise (Y-Y’) as (Y-Y’)s will have positive

and negative values, which will cancel each other out

• So, you square the residuals and then add them and try to minimise (Y-Y’)2 – Hence, the name, ‘Method of Least Squares’

How do we measure the accuracy of prediction?

• The regression line is fitted in such a way that the errors of prediction are kept as small as possible– You can fit a regression line to any dataset, doesn’t mean it’s a

good fit!– How do we measure how good this fit is?– How to we measure the accuracy of the prediction that our

regression equation makes?

• Three methods– Standard Error of the Estimate– r2

– Statistical Significance

Standard Error of the Estimate

• A measure of the size of the errors of prediction

• We’ve seen that…– The regression line is computed in such a way as to minimise the

difference between the predicted values (Y’) and the actual values (Y)

– The difference between these values are known as errors of prediction or residuals, (Y – Y’)s

• For any set of data, the errors of prediction will vary– Some data points will be close to the line, so (Y – Y’) will be

small– Some data points will be far from the line, so (Y – Y’) will be big


• One way to assess the fit of the regression line is to take the standard deviation of all of these errors– On average, how much do the data points vary from the

regression line?

• Standard error of the estimate


• One point to note…• Standard error is a measure of the standard deviation of

data points around the regression line• (Standard error)2 expresses the variance of the data

points around the regression line– Residual or error variance

r2

• Interested in the relationship between two variables– Variable X

• A set of scores that vary around a mean,

– Variable Y• A set of scores that vary around a mean,

• If these two variables are correlated, they will share some variance

X Y

Shared variancebetween X and Y

Variance in Y that is not related to X

Variance in X that is not related to Y

•In regression, we are trying to explain Variable Y as a function of Variable X

•Would be useful if we could find out what percentage of variance in Variable Y can be explained by variance in Variable X

Total Variance in Variable Y

SStotal

Variance due to Variable X

Regression / Model Variance

SSm

SStotal - SSerror

Variance due to other factors

Error Variance

SSerror

r2

• To calculate the percentage of variance in Variable Y that can be explained by variance in Variable X

SSm Variance due to X / regression

SStotal Total variance in Y

= r2

=

r2

• (Pearson Correlation)2

– Shared variance between two variables– Used in simple linear regression to show what

percentage of Variable Y can be explained by Variable X

• For example– If rxy = .8, r2

xy = .64, then 64% of the variability in Y is directly predictable from variable X

– If rxy = .2, r2xy = .04, then 4% of the variability in Y is

due to / can be explained by X

Statistical Significance

• Does the regression model predict Variable Y better than chance?

• Simple linear regression– Does X significantly predict Y?– If the correlation between X & Y is statistically

significant, the regression model will be statistically significant

– Not so for multiple regression, next lecture

• F Ratio

Statistical Significance

• F-Ratio – Average variance due to the regression

Average variance due to error

– MSm = SSm / dfm

MSerror SSerror / dferror

– It uses the mean square rather than the sum of squares in order to compare the average variance

• You want the F-Ratio to be large and statistically significant

• If large, then more variance is explained by the regression than by the error in the model

An example

‘Linear regression’ data-set

• I want to predict a person’s verbal coherency based on the number of units of alcohol they consume

• Record how much alcohol is consumed and administer a test of verbal coherency– SPSS– Analyse, Regression, Linear– Dependent variable: Verbal Coherency– Independent variable: Alcohol– Method: Enter

Three parts to the output

Model Summaryr2

Standard error

AnovaF Ratio

CoefficientsRegression Equation

Model Summary

.628a .394 .393 10.29Model1

R R SquareAdjusted R

SquareStd. Error ofthe Estimate

Predictors: (Constant), alcohol unitsa.

• Table: how well our regression model explains the variation in verbal coherency

Pearson r betweenalcohol and

verbal coherency

Proportion of variation

in verbal coherency that is related to

alcohol

Statistical estimate of the population

proportion of variation

in verbal coherency that is related to

alcohol

Statistical estimate of the error in the

regression model

ANOVA b

30105.539 1 30105.539 284.515 .000a

46240.488 437 105.81376346.027 438

RegressionResidualTotal

Model1

Sum of Squares df Mean Square F Sig.

Predictors: (Constant), alcohol unitsa.

Dependent Variable: VRBLCOHRb.

Total variation in data due to regression model

Total variation in data NOTdue to regression model

Average variation in data due to

regression model

Average variation in data

NOT due to regression model

Ratio of variation in data due

to regression model & variation not due to

model

Probability of observing

this F-ratio ifHo is true

Coefficients a

21.543 1.063 20.276 .000

4.715 .280 .628 16.868 .000

(Constant)

alcohol units

Model1

B Std. Error

UnstandardizedCoefficients

Beta

Standardized

Coefficients

t Sig.

Dependent Variable: VRBLCOHRa.

Values that we use in the regression equation (Y = BX +a)Verbal Coherency = B (alcohol) + constantVerbal coherency = 4.7 (alcohol) + 21.5As alcohol 1 unit, verbal coherency by 4.7 units

T-statistic = tells us whether using the predictor variable gives us a better than chance prediction of the DVAlcohol is a sig. predictor of verbal coherency

Second Example

• Can we predict how many months a person survives after being diagnosed with cancer, based on their level of optimism?

• Linear Regression dataset• Analyse, regression, linear

• Dependent variable: Survival• Independent variable: Optimism

Aspects of Regression analysis

• Write the regression equation• Explain what this equation tells us about the relationship

between Variables X and Y• Make a prediction of Y when given a value of X• State the standard error of your prediction

• Ascertain if the regression model significantly predicts the dependent variable Y

• State what percentage of Variable Y is explained by Variable X

State the following…

• Describe the relationship between survival (Y) and optimism (X) in terms of a regression equation.

• In your own words, explain what this equation tells us about the relationship between survival and optimism.

• Using this equation, predict how many months a person will survive for if their optimism score is 10.


• What is the standard error of your prediction?

• Does the regression model significantly predict the dependent variable?

• What percentage of variance in survival is explained by optimism level?

Answers

• Describe the relationship between survival (Y) and optimism (X) in terms of a regression equation.

• Y’ = .69X + 18.4

• In your own words, explain what this equation tells us about the relationship between survival and optimism.

• As optimism level increases by one unit, survival increases by .69months

• When a person’s optimism score is 0, his/her predicted length of survival is 18.4 months

• Using this equation, predict how many months a person will survive for if their optimism score is 10.

• Y’ = .69(10) + 18.4 = 25.3 months


• What is the standard error of your prediction?• 4.5months

• Does the regression model significantly predict the dependent variable?

• Yes, F (1, 432) = 202, p < .001

• What percentage of variance in survival is explained by optimism level?

• 32%

Summary

• Simple linear regression • Provides a model of the relationship between two

variables• Creates a straight line that best represents the

relationship between two variables• Enables us to estimate the percentage of variance

in one variable that can be explained by another• Enables us to predict one variable on the basis of

another• Remember that a regression line can be fitted to any

dataset. It’s necessary to assess the accuracy of the fit.

Documents

Lecture 5: Simple Linear Regression Laura McAvinue School of Psychology Trinity College Dublin