Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Regression in RSeth Margolis
GradQuant
May 31, 2018
1
What is Regression Good For?
• Assessing relationships between variables This probably covers most of what
you do
Example:
What is the relationship between intelligence and GPA?
Intelligence is the independent variable, which we will call X
GPA is the dependent variable, which we will call Y
2
2
2.2
2.4
2.6
2.8
3
3.2
3.4
3.6
3.8
4
60 70 80 90 100 110 120 130G
PA
Intelligence
Person Intelligence GPA
1 70 2.7
2 75 2.5
3 80 3.4
4 85 2.6
5 90 2.7
6 95 2.7
7 100 3.2
8 105 3.5
9 110 3.2
10 115 3.8
11 120 3.6
What is Regression?
• Predicts an outcome (Y) from 1+ predictors (Xs)
• Fitting a line to data Y = mx + b + error
Y = intercept + slope*X + error
• Y = b0 + b1X1 + … + b2X2 + e
• Y is one column of data
• Each X is a column of data
• bx is the coefficient/weight for that x
• b0 is the intercept Prediction for Y when all Xs are 0
• e is error/residual What is left over after prediction
Yactual-Ypredicted 1 value for each case (row of data)
3
• Y = b0 + b1X1 + b2X2 + e
• What would the best bs do?
• They would lead to predictions of Y that are closest to the actual Y values
• How close is one prediction from the actual value for that case (row in data)?
The residual/error
Yactual-Ypredicted
Better bs will lead to smaller residuals/errors
• Proposed solution: Find bs that lead to the lowest average residuals/errors
• Problem: Average residual = 0
• Solution: Find bs that lead to lowest average squared residuals/errors Ordinary least squares (OLS) regression
• R finds the bs that minimize the squared residuals/errors using matrix algebra
How does R select the best bs?
4
Assumptions
1. Homoscedasticity Variance of residuals does not change at levels of X
When violated, bootstrap
2. Residuals/errors normally distributed Can use histogram or P-P / Q-Q plot
When violated, bootstrap
3. Independence of residuals/errors When violated, we can use multilevel modeling
4. Linearity When violated, transform Xs to meet this assumption
• Xs can have any distribution 5
R and RStudio
• Script
• Console
• Environment
• Object assignment
• Plotting
• Packages
• Help
• Commenting
• Data frames
• Functions and arguments
6
Load Data• Usually will use
dataname = read.csv(“filepath”)
• We will use a dataset in the corrgram package about 322 Major League Baseball regular and substitute hitters in 1986
7
Some Data Transformations – Not Focus of this Workshop
8
1 Continuous Predictor – Scatterplot
• Y = b0 + b1X1 + e
• Scatterplot can be quite revealing
• Want to make sure the relationship looks linear
• Otherwise you will probably want to transform your predictor
Later in workshop
0 500 1000 1500 2000 2500
0.2
00
.25
0.3
00
.35
Salary and Season Batting Average
Salary (thousands of $ per year)
Se
aso
n B
attin
g A
ve
rag
e
9
1 Continuous Predictor – Model
• Estimate = b = regression coefficient
• Standard error = standard deviation of sampling distribution Provides confidence intervals
Most affected by N (sample size)
• t = b / se
• t and df -> p df = N-k-1
N = sample size
k = number of predictors
• Multiple R2 = correlation of predicted and actual values squared p-value is p from F (lower right)
• Adjusted R2 adjusts R2 based on number of predictors
For each 1-unit increase in SeasonSalary, the
expected SeasonBattingAvg increases by .000022 10
Standardized Coefficients
• What if all values were standardized before entering model?
• Z = (X – Mean) / SD
• βs can be interpreted like correlations
-1 to 1
Called standardized regression coefficients
For each 1 SD increase in SeasonSalary,
SeasonBattingAvg increases by .33 SDs
11
Multiple Continuous Predictors
• Coefficients are partial coefficients
Contribution of that variable holding other variables constant
Unique contribution of that variable over and above other variables
Y
X2X1
12
Adding a Predictor
For each 1 unit increase in CareerYears,
the expected SeasonBattingAvg increases
by -.00099, holding other variables constant 13
Adding a Predictor
14
Categorical Predictors: Dichotomous
A N
0.2
00
.25
0.3
00
.35
League
Se
aso
n B
attin
g A
ve
rag
e
The NL has a batting average .0036
lower than the AL 15
Categorical Predictors: 3+ Levels• Dummy Coding
1 2 3 4 5 6
1B 1 0 0 0 0 0
2B 0 1 0 0 0 0
3B 0 0 1 0 0 0
C 0 0 0 1 0 0
DH 0 0 0 0 1 0
OF 0 0 0 0 0 0
SS 0 0 0 0 0 1
• Effects Coding 1 2 3 4 5 6
1B 1 0 0 0 0 0
2B 0 1 0 0 0 0
3B 0 0 1 0 0 0
C 0 0 0 1 0 0
DH 0 0 0 0 1 0
OF -1-1-1-1-1-1
SS 0 0 0 0 0 1
• Dummy Coding
Pick a reference group
Estimate that group’s level of Y as the intercept
For all other groups, estimate how different their level of Y is, compared to the dummy group as the bs
I usually prefer this over effects coding
• Effects Coding
Estimate the grand mean (mean of all groups) as the intercept
For all other groups (except 1), estimate how different their level of Y is, compared to the grand mean as the bs
16
Categorical Predictors: 3+ Levels
17
Categorical Predictors: 3+ Levels
1B 2B 3B C DH OF SS
0.2
00
.25
0.3
00
.35
PositionS
ea
so
n B
attin
g A
ve
rag
e
Catchers have a batting average .019 lower
than outfielders18
Comparing to ANOVA
Between-Group Variability
Within-Group Variability 19
ANCOVA – Categorical and Continuous Predictors
Catchers have a batting average .010 lower
than outfielders, holding other variables
constant
20
Non-linear Relationships
5 10 15 20
0.2
00
.25
0.3
00
.35
Career Years
Se
aso
n B
attin
g A
ve
rag
e
21
Interactions
• Interaction = effect of X1 depends on level of X2 If that is true, opposite is true too
• Y = b0 + b1X1 + b2X2 + b3X1X2
• Y = b0 + b1X1 + (b2 + b3X1)X2 b3 = For each 1-unit increase in X
how much does b2 increase
And vice-versa
• Can have 3-way, 4-way, etc. interactions
22
Interaction Plot
23
Spline Regression – Making Data
24
Spline Regression – Model
bs are slopes before and after knot
25
Count Outcome
26
Count Outcome –Negative Binomial Model
27
Count Outcome – Negative Binomial Model
For each 1-unit increase in log(CareerYears),
the expected log(CareerAtBats) increases by
1.12 28
0 500 1000 1500 2000 2500
02
04
06
08
01
00
12
01
40
BaseballData$SeasonSalary
Ba
se
ba
llD
ata
$S
ea
so
nH
itsZ
ero
Infla
ted
Count Outcome – Zero-Inflated Model
For each 1-unit increase in SeasonSalary,
the expected log(SeasonHits) increases by
.0004
Use when many observations have a 0 on the Y
29
Proportional Outcome – Beta Regression
For each 1-unit increase in SeasonSalary, the log odds
of SeasonBatting Avg increases by .00011Log odds = log(probability/1-probability)
Can do algebra to get probability at each X 30
Quantile (Percentile) Regression
0 500 1000 1500 2000 2500
0.2
00
.25
0.3
00
.35
Season Salary
Se
aso
n B
attin
g A
ve
rag
e
50th percentile
80th percentile
50th percentile
80th percentile
50th percentile
20th percentile
Predict a certain percentile at each X
31
Dichotomous Outcome – Logistic Regression
0.20 0.25 0.30 0.35
0.0
0.2
0.4
0.6
0.8
1.0
Season Batting Average
Pro
ba
bility o
f O
utfie
lde
r
For each 1-unit increase in
SeasonBattingAvg, the log odds of being an
outfielder increases by 65
odds ratio = 2 -> probability of .66
32
Ordinal Outcome - Ordered Logistic Regression
33
Ordinal Outcome - Ordered Logistic Regression
34
Ordinal Outcome –Ordered Logistic Regression
20 40 60 80 100
0.0
0.2
0.4
0.6
0.8
1.0
Season Walks
Pro
ba
bility
30 Hits
40 Hits
50 Hits
60 Hits
70 Hits
80 Hits
90 Hits
100 Hits
200 Hits
35
Categorical Outcome – Multinomial Logistic Regression
36
Categorical Outcome – Multinomial Logistic Regression
37
Categorical Outcome– Multinomial Logistic Regression
0.20 0.25 0.30 0.35
0.0
0.2
0.4
0.6
0.8
1.0
Season Batting Average
Pro
ba
bility
Catcher
1st Baseman
2nd Baseman
3rd Baseman
Shortstop
Outfielder
38
Nested Data – Multilevel Modeling• Assumption of regression: independence of residuals/errors
When violated, we can use multilevel modeling
• Dependence of residuals/errors usually results from grouping/nesting in the measured DV
Students nested in classrooms/teachers
Observations nested within participants (i.e., repeated measures)
Participants nested in countries
• Conceptually, it’s like running separate regression models for each classroom and then aggregating them
• Can have student-level and classroom-level predictors
• Can have more than 2 levels
39
Multilevel Equations• Example with level-1 and level-2 predictors:
Level-1 Model:
Yij = β0j + β1jXij + rij
Level-2 Model:
β0j = γ00 + γ01Wj + u0j
β1j = γ10 + γ11Wj + u1j
Combined Model:
Yij = γ00 + γ01Wj + γ10Xij + γ11WjXij + u0j + u1jXij + rij
• Var(rij) = σ2
• Var(u0j) = τ00
• Var(u1j) = τ11
• Cov(u0j, u1j) = τ01
40
Nested Data – Multilevel Modeling
• Level-1 Model:
Salaryij = β0j + β1jSeasonHomeRunsij + rij
• Level-2 Model:
β0j = 313.9 + u0j
β1j = 19.3 + u1j
• Var(rij) = σ2 = 150331.9
• Var(u0j) = τ00 = 1560.0
• Var(u1j) = τ11 = 168.7
• Cor(u0j, u1j) = -.71
Within teams, a 1-unit increase in
SeasonHomeRuns the expected Salary
increases by 19.3 units 41
Nested Data – Multilevel Modeling
42
Nested Data – Multilevel Modeling
43
Nested Data – Multilevel Modeling
44
Regularization• To prevent overfitting, take parsimony into account
• Ridge (L2)
Causes regression coefficients to shrink
• Lasso (L1)
Causes some regression coefficients to become 0
• Elastic Net
Hybrid of other 2
45
Overfit Multiple Regression Model
0.20 0.25 0.30 0.35 0.400
.20
0.2
50
.30
0.3
5
MultipleRegressionModel$fitted.values
Ba
se
ba
llD
ata
$S
ea
so
nB
attin
gA
vg
46
Multicollinearity
47
Ridge Regression
-2 0 2 4 6
0.4
0.6
0.8
1.0
log(Lambda)
Me
an
-Sq
ua
red
Err
or
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
48
Ridge Regression
-2 -1 0 1 2
-2-1
01
2
Predicted Season Batting Average
Actu
al S
ea
so
n B
attin
g A
ve
rag
e
49
Lasso Regression
50
Elastic Net Regression
51
Bootstrapping• Two assumptions of regression:
1. Homoscedasticity
Variance of residuals does not change at levels of X
2. Residuals/errors normally distributed
Can use histogram or P-P / Q-Q plot
• Can solve each with bootstrapping
Imagine a dataset with N rows
Could sample rows with replacement N times
Run model with that new dataset
Record results
Repeat 10,000 times
Provides distribution of results with a mean and standard deviation/error
52
Bootstrapping Three-Predictor Model
53
Causal Claims• Correctly specified model
Regression coefficients can be interpreted as causal effects if model is “correctly specified”
Other models still valid prediction models
All causes of Y that are correlated with any Xs in the model are in the model
Rare, except for…
• Random assignment
Creates a correctly specified model
If randomly assigned condition is a predictor, nothing is correlated to it (assuming large enough N)
So model is correctly specified
If add covariates, can still interpret effects of condition causally
Must be able to manipulate IV
If cannot, try to make model as “correct” as possible
54
Future Directions
• Structural Equation Modeling (SEM) Path analysis
Mediation
Latent variables – model measurement error
• Factor Analysis
• Longitudinal data analysis Regressed change
Predict t2 from t1 and other variables
Difference scores Outcome is t2-t1
MLM
SEM Latent growth models
Cross-lagged models
Time series analyses
• Machine Learning
55