Regression in R...•Correctly specified model Regression coefficients can be interpreted as causal effects if model is “correctly specified” Other models still valid prediction

Regression in RSeth Margolis

GradQuant

May 31, 2018

1

What is Regression Good For?

• Assessing relationships between variables This probably covers most of what

you do

Example:

What is the relationship between intelligence and GPA?

Intelligence is the independent variable, which we will call X

GPA is the dependent variable, which we will call Y

2

2

2.2

2.4

2.6

2.8

3

3.2

3.4

3.6

3.8

4

60 70 80 90 100 110 120 130G

PA

Intelligence

Person Intelligence GPA

1 70 2.7

2 75 2.5

3 80 3.4

4 85 2.6

5 90 2.7

6 95 2.7

7 100 3.2

8 105 3.5

9 110 3.2

10 115 3.8

11 120 3.6

What is Regression?

• Predicts an outcome (Y) from 1+ predictors (Xs)

• Fitting a line to data Y = mx + b + error

Y = intercept + slope*X + error

• Y = b0 + b1X1 + … + b2X2 + e

• Y is one column of data

• Each X is a column of data

• bx is the coefficient/weight for that x

• b0 is the intercept Prediction for Y when all Xs are 0

• e is error/residual What is left over after prediction

Yactual-Ypredicted 1 value for each case (row of data)

3

• Y = b0 + b1X1 + b2X2 + e

• What would the best bs do?

• They would lead to predictions of Y that are closest to the actual Y values

• How close is one prediction from the actual value for that case (row in data)?

The residual/error

Yactual-Ypredicted

Better bs will lead to smaller residuals/errors

• Proposed solution: Find bs that lead to the lowest average residuals/errors

• Problem: Average residual = 0

• Solution: Find bs that lead to lowest average squared residuals/errors Ordinary least squares (OLS) regression

• R finds the bs that minimize the squared residuals/errors using matrix algebra

How does R select the best bs?

4

Assumptions

1. Homoscedasticity Variance of residuals does not change at levels of X

When violated, bootstrap

2. Residuals/errors normally distributed Can use histogram or P-P / Q-Q plot

When violated, bootstrap

3. Independence of residuals/errors When violated, we can use multilevel modeling

4. Linearity When violated, transform Xs to meet this assumption

• Xs can have any distribution 5

R and RStudio

• Script

• Console

• Environment

• Object assignment

• Plotting

• Packages

• Help

• Commenting

• Data frames

• Functions and arguments

6

Load Data• Usually will use

dataname = read.csv(“filepath”)

• We will use a dataset in the corrgram package about 322 Major League Baseball regular and substitute hitters in 1986

7

Some Data Transformations – Not Focus of this Workshop

8

1 Continuous Predictor – Scatterplot

• Y = b0 + b1X1 + e

• Scatterplot can be quite revealing

• Want to make sure the relationship looks linear

• Otherwise you will probably want to transform your predictor

Later in workshop

0 500 1000 1500 2000 2500

0.2

00

.25

0.3

00

.35

Salary and Season Batting Average

Salary (thousands of $ per year)

Se

aso

n B

attin

g A

ve

rag

e

9

1 Continuous Predictor – Model

• Estimate = b = regression coefficient

• Standard error = standard deviation of sampling distribution Provides confidence intervals

Most affected by N (sample size)

• t = b / se

• t and df -> p df = N-k-1

N = sample size

k = number of predictors

• Multiple R2 = correlation of predicted and actual values squared p-value is p from F (lower right)

• Adjusted R2 adjusts R2 based on number of predictors

For each 1-unit increase in SeasonSalary, the

expected SeasonBattingAvg increases by .000022 10

Standardized Coefficients

• What if all values were standardized before entering model?

• Z = (X – Mean) / SD

• βs can be interpreted like correlations

-1 to 1

Called standardized regression coefficients

For each 1 SD increase in SeasonSalary,

SeasonBattingAvg increases by .33 SDs

11

Multiple Continuous Predictors

• Coefficients are partial coefficients

Contribution of that variable holding other variables constant

Unique contribution of that variable over and above other variables

Y

X2X1

12

Adding a Predictor

For each 1 unit increase in CareerYears,

the expected SeasonBattingAvg increases

by -.00099, holding other variables constant 13

Adding a Predictor

14

Categorical Predictors: Dichotomous

A N

0.2

00

.25

0.3

00

.35

League

Se

aso

n B

attin

g A

ve

rag

e

The NL has a batting average .0036

lower than the AL 15

Categorical Predictors: 3+ Levels• Dummy Coding

1 2 3 4 5 6

1B 1 0 0 0 0 0

2B 0 1 0 0 0 0

3B 0 0 1 0 0 0

C 0 0 0 1 0 0

DH 0 0 0 0 1 0

OF 0 0 0 0 0 0

SS 0 0 0 0 0 1

• Effects Coding 1 2 3 4 5 6

1B 1 0 0 0 0 0

2B 0 1 0 0 0 0

3B 0 0 1 0 0 0

C 0 0 0 1 0 0

DH 0 0 0 0 1 0

OF -1-1-1-1-1-1

SS 0 0 0 0 0 1

• Dummy Coding

Pick a reference group

Estimate that group’s level of Y as the intercept

For all other groups, estimate how different their level of Y is, compared to the dummy group as the bs

I usually prefer this over effects coding

• Effects Coding

Estimate the grand mean (mean of all groups) as the intercept

For all other groups (except 1), estimate how different their level of Y is, compared to the grand mean as the bs

16

Categorical Predictors: 3+ Levels

17

Categorical Predictors: 3+ Levels

1B 2B 3B C DH OF SS

0.2

00

.25

0.3

00

.35

PositionS

ea

so

n B

attin

g A

ve

rag

e

Catchers have a batting average .019 lower

than outfielders18

Comparing to ANOVA

Between-Group Variability

Within-Group Variability 19

ANCOVA – Categorical and Continuous Predictors

Catchers have a batting average .010 lower

than outfielders, holding other variables

constant

20

Non-linear Relationships

5 10 15 20

0.2

00

.25

0.3

00

.35

Career Years

Se

aso

n B

attin

g A

ve

rag

e

21

Interactions

• Interaction = effect of X1 depends on level of X2 If that is true, opposite is true too

• Y = b0 + b1X1 + b2X2 + b3X1X2

• Y = b0 + b1X1 + (b2 + b3X1)X2 b3 = For each 1-unit increase in X

how much does b2 increase

And vice-versa

• Can have 3-way, 4-way, etc. interactions

22

Interaction Plot

23

Spline Regression – Making Data

24

Spline Regression – Model

bs are slopes before and after knot

25

Count Outcome

26

Count Outcome –Negative Binomial Model

27

Count Outcome – Negative Binomial Model

For each 1-unit increase in log(CareerYears),

the expected log(CareerAtBats) increases by

1.12 28

0 500 1000 1500 2000 2500

02

04

06

08

01

00

12

01

40

BaseballData$SeasonSalary

Ba

se

ba

llD

ata

$S

ea

so

nH

itsZ

ero

Infla

ted

Count Outcome – Zero-Inflated Model

For each 1-unit increase in SeasonSalary,

the expected log(SeasonHits) increases by

.0004

Use when many observations have a 0 on the Y

29

Proportional Outcome – Beta Regression

For each 1-unit increase in SeasonSalary, the log odds

of SeasonBatting Avg increases by .00011Log odds = log(probability/1-probability)

Can do algebra to get probability at each X 30

Quantile (Percentile) Regression

0 500 1000 1500 2000 2500

0.2

00

.25

0.3

00

.35

Season Salary

Se

aso

n B

attin

g A

ve

rag

e

50th percentile

80th percentile

50th percentile

80th percentile

50th percentile

20th percentile

Predict a certain percentile at each X

31

Dichotomous Outcome – Logistic Regression

0.20 0.25 0.30 0.35

0.0

0.2

0.4

0.6

0.8

1.0

Season Batting Average

Pro

ba

bility o

f O

utfie

lde

r

For each 1-unit increase in

SeasonBattingAvg, the log odds of being an

outfielder increases by 65

odds ratio = 2 -> probability of .66

32

Ordinal Outcome - Ordered Logistic Regression

33

Ordinal Outcome - Ordered Logistic Regression

34

Ordinal Outcome –Ordered Logistic Regression

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Season Walks

Pro

ba

bility

30 Hits

40 Hits

50 Hits

60 Hits

70 Hits

80 Hits

90 Hits

100 Hits

200 Hits

35

Categorical Outcome – Multinomial Logistic Regression

36

Categorical Outcome – Multinomial Logistic Regression

37

Categorical Outcome– Multinomial Logistic Regression

0.20 0.25 0.30 0.35

0.0

0.2

0.4

0.6

0.8

1.0

Season Batting Average

Pro

ba

bility

Catcher

1st Baseman

2nd Baseman

3rd Baseman

Shortstop

Outfielder

38

Nested Data – Multilevel Modeling• Assumption of regression: independence of residuals/errors

When violated, we can use multilevel modeling

• Dependence of residuals/errors usually results from grouping/nesting in the measured DV

Students nested in classrooms/teachers

Observations nested within participants (i.e., repeated measures)

Participants nested in countries

• Conceptually, it’s like running separate regression models for each classroom and then aggregating them

• Can have student-level and classroom-level predictors

• Can have more than 2 levels

39

Multilevel Equations• Example with level-1 and level-2 predictors:

Level-1 Model:

Yij = β0j + β1jXij + rij

Level-2 Model:

β0j = γ00 + γ01Wj + u0j

β1j = γ10 + γ11Wj + u1j

Combined Model:

Yij = γ00 + γ01Wj + γ10Xij + γ11WjXij + u0j + u1jXij + rij

• Var(rij) = σ2

• Var(u0j) = τ00

• Var(u1j) = τ11

• Cov(u0j, u1j) = τ01

40

Nested Data – Multilevel Modeling

• Level-1 Model:

Salaryij = β0j + β1jSeasonHomeRunsij + rij

• Level-2 Model:

β0j = 313.9 + u0j

β1j = 19.3 + u1j

• Var(rij) = σ2 = 150331.9

• Var(u0j) = τ00 = 1560.0

• Var(u1j) = τ11 = 168.7

• Cor(u0j, u1j) = -.71

Within teams, a 1-unit increase in

SeasonHomeRuns the expected Salary

increases by 19.3 units 41


42


43


44

Regularization• To prevent overfitting, take parsimony into account

• Ridge (L2)

Causes regression coefficients to shrink

• Lasso (L1)

Causes some regression coefficients to become 0

• Elastic Net

Hybrid of other 2

45

Overfit Multiple Regression Model

0.20 0.25 0.30 0.35 0.400

.20

0.2

50

.30

0.3

5

MultipleRegressionModel$fitted.values

Ba

se

ba

llD

ata

$S

ea

so

nB

attin

gA

vg

46

Multicollinearity

47

Ridge Regression

-2 0 2 4 6

0.4

0.6

0.8

1.0

log(Lambda)

Me

an

-Sq

ua

red

Err

or

10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

48

Ridge Regression

-2 -1 0 1 2

-2-1

01

2

Predicted Season Batting Average

Actu

al S

ea

so

n B

attin

g A

ve

rag

e

49

Lasso Regression

50

Elastic Net Regression

51

Bootstrapping• Two assumptions of regression:

1. Homoscedasticity

Variance of residuals does not change at levels of X

2. Residuals/errors normally distributed

Can use histogram or P-P / Q-Q plot

• Can solve each with bootstrapping

Imagine a dataset with N rows

Could sample rows with replacement N times

Run model with that new dataset

Record results

Repeat 10,000 times

Provides distribution of results with a mean and standard deviation/error

52

Bootstrapping Three-Predictor Model

53

Causal Claims• Correctly specified model

Regression coefficients can be interpreted as causal effects if model is “correctly specified”

Other models still valid prediction models

All causes of Y that are correlated with any Xs in the model are in the model

Rare, except for…

• Random assignment

Creates a correctly specified model

If randomly assigned condition is a predictor, nothing is correlated to it (assuming large enough N)

So model is correctly specified

If add covariates, can still interpret effects of condition causally

Must be able to manipulate IV

If cannot, try to make model as “correct” as possible

54

Future Directions

• Structural Equation Modeling (SEM) Path analysis

Mediation

Latent variables – model measurement error

• Factor Analysis

• Longitudinal data analysis Regressed change

Predict t2 from t1 and other variables

Difference scores Outcome is t2-t1

MLM

SEM Latent growth models

Cross-lagged models

Time series analyses

• Machine Learning

55

Documents

Regression in R...•Correctly specified model Regression coefficients can be interpreted as causal effects if model is “correctly specified” Other models still valid prediction