Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 1
1
Biost 536 / Epi 536Categorical Data Analysis in
EpidemiologyScott S. Emerson, M.D., Ph.D.
Professor of BiostatisticsUniversity of Washington
Lecture 6:Review of Regression
October 14, 2014
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 2
2
Lecture Outline
• General Regression Model
• Confounding and Precision in Regression– Linear regression with adjustment for a single covariate– Logistic regression with adjustment for a single covariate– PH regression with adjustment for a single covariate– Linear regression with adjustment for multiple covariates
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 3
3
General Regression Model
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 4
4
Regression Models
• According to the parameter compared across groups– Means Linear regression– Geom Means Linear regression on logs– Odds Logistic regression– Rates Poisson regression– Hazards Proportional Hazards regr– Quantiles Parametric (AFT) survival regr
• A special case for this course– Cumulative incidence Exponential regression
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 5
5
General Regression
• General notation for variables and parameter
• The parameter might be the mean, geometric mean, odds, rate, instantaneous risk of an event (hazard), etc.
ofon distributi ofParameter subjectth for the variablesadjustment of Value
subject th for the POI theof Value subject th on the measured Response
,, 21
ii
ii
i
i
Yi
ii
WWXY
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 6
6
Multiple Regression
• General notation for multiple regression model
• The link function is usually either none (means) or log (geommean, odds, hazard)
• In this class: complementary log log link for proportions measuring cumulative incidence, which relate to log link on exponential hazards– log (- log p) = log + log Δt
" covariatefor Slope" "Interest of Predfor Slope" Intercept""
modelingfor usedfunction link""
1
1
0
231210
jj
iiii
WX
gWWXg
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 7
7
Borrowing Information
• Use other groups to make estimates in groups with sparse data
• Intuitively: 67 and 69 year olds would provide some relevant information about 68 year olds
• Assuming straight line relationship tells us how to adjust data from other (even more distant) age groups– If we do not know about the exact functional relationship, we
might want to borrow information only close to each group
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 8
8
Defining “Contrasts”
• Define a comparison across groups to use when answering scientific question
• If straight line relationship in parameter, slope for POI compares parameter between groups differing by 1 unit in X when all othercovariates in model are equal
• If nonlinear relationship in parameter, slope is average comparison of parameter between groups differing by 1 unit in X “holding covariates constant”– Statistical jargon: a “contrast” across the groups
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 9
9
Comparison of Models
• The major difference between regression models is interpretationof the parameters
• Summary: Mean, geometric mean, odds, hazards• Comparison of groups: Difference, ratio
• Issues related to inclusion of covariates remain the same• Address the scientific question
– Predictor of interest; Effect modifiers• Address confounding• Increase precision
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 10
10
Interpretation of Parameters
• Intercept– Corresponds to a population with all modeled covariates equal to
zero– Most often outside range of data; quite often impossible; very
rarely of interest by itself
• Slope– A comparison between groups differing by 1 unit in corresponding
covariate, but agreeing on all other modeled covariates– Sometimes impossible to use this definition when modeling
interactions or complex curves
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 11
11
Regression with Identity Link
• E.g., modeling mean of response Y on predictors X, W1, W2, …
1
12110
1210
0
1210
: of Difference
1
00
Model
ijijii
ijijii
jii
iii
wxwWxXwxwWxX
WX
WX
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 12
12
Regression with Log Link
• E.g., modeling geometric mean, odds, rate, hazard of response Yon predictors X, W1, W2, …
1
12110
1210
0
1210
:log of Difference
log1 log
log00
log Model
ijijii
ijijii
jii
iii
wxwWxXwxwWxX
WX
WX
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 13
13
Regression with Log Link: Back Transform
• E.g., modeling geometric mean, odds, rate, hazard of response Yon predictors X, W1, W2, …
1
12110
1210
0
:of Ratio
1
00
log Model
1210
e
ewWxXewWxX
eWX
WX
i
i
wxjijii
wxjijii
jii
iii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 14
14
Stratification vs Regression
• Generally, any stratified analysis could be performed as a regression model
• But stratification adjusts for covariates and all interactions among those covariates– E.g, sex, race, and the sex-race interaction
• Our habit in regression is to just adjust for the covariates (the “main effect”), and consider interactions less often
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 15
15
Regression with “Saturated” Models
• Regression model is “saturated” if as many regression parameters as distinct combinations of predictors in the data
• No borrowing of information across groups
• Fitted values will correspond to common descriptive statistics– Linear regression: sample means (proportions)– Logistic regression: (log) sample odds– Poisson regression: (log) sample rates
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 16
16
Example of “Saturated” Models
• Suppose model sex (two values), race (four values), and categorized age (three levels)
• Suppose we have at least one observation for each combination of sex, race, and categorized age
• Then a regression model that fits r = 24 parameters is saturated
• Model dummy variables with all interactions:– Intercept (1)– Main effects: sex (1), race (3), age (2)– Two-way interactions: sex-race (3), sex-age (2), race-age(6)– Three-way interactions: sex-race-age (6)
• (24 equations in 24 unknowns: Fit descriptive statistics exactly)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 17
17
Regression with Binary Predictor
• Common regression models correspond to common two-sample tests when only a single binary predictor is modeled
• Linear regression (classic SE) t test, equal variance (exact)
• Linear regression (robust SE) t test, unequal variance (approx)
• Logistic regression chi square test (score test)
• Prop hazard regression logrank test (score test)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 18
18
Stata: Multiple Regression
• In Stata, common syntax is used for common regression models – Keyword indicating model (and summary measure-link)– Response variable (except in proportional hazards)– List of predictors / model
• Special syntax for creating dummy variables, interactions– Options
• Output– Table of regression parameters, CI, p values
• (Optional display: back-transformed parameters of RR, OR, HR)– Test of entire regression model also provided
• A test that all slopes are equal to 0
• Post-estimation commands– Joint tests of parameters: test, testparm– Inference about combinations of parameters: lincom, nlcom
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 19
19
Stata: Regression Commands
• Linear regression (means, proportions)– regress Y X W1 W2 W3, robust
• Linear regression ( geometric means)– regress logY X W1 W2 W3, robust
• Logistic regression (odds)– logistic Y X W1 W2 W3, [robust]– logit Y X W1 W2 W3, [robust]
• Poisson regression (rates)– poisson Y X W1 W2 W3, robust
• Proportional hazards regression (hazards)– stset Y Event– stcox X W1 W2 W3, robust
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 20
20
Stata: glm Regression Commands
• Difference of means, proportions (linear regression)– glm Y X W1, robust family(gaussian) link(identity)
• Difference of log means, log proportions– glm Y X W1 W2, robust link(log)
• Difference of log geometric means (linear regression on logs)– glm logY X W1 W2, robust
• Difference of log odds (logistic regression)– glm Y X W1, [robust] family(binomial) link(logit)
• Difference of log rates (Poisson regression)– glm Y X W1 W2, robust family(poisson) link(log)
• Back transforming regression parameters with log link– glm Y X W1 W2, robust family(…) link(…) eform
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 21
21
R: Multiple Regression
• The uwIntroStats package in R adopts similar syntax to Stata
• A single function takes the “functional” as its first argument– “Functional” = “summary measure”– E.g., mean, geometric mean, odds, rates, hazard
• Allows pre-specification of “multiple-partial” tests– Joint significance of dummy variables, polynomials, interactions,
etc.
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 22
22
Regression Models withBinary Response
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 23
23
Common Regression Models
• According to the parameter and contrast across groups– Mean (differences) Linear regression (a GLM)– Mean (ratios) GLM with log link (a GLM)– Geom Means (ratios) Linear regression on logs (a GLM)– Odds (ratios) Logistic regression (a GLM)– Rates (ratios) Poisson regression (a GLM)– Hazards (ratios) Proportional Hazards regr– Quantiles (ratios) Parametric (AFT) survival regr
• A special case for this course– Cumulative incidence Exponential regression (a GLM)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 24
24
General Regression
• General notation for variables and parameter
• The parameter might be the mean, geometric mean, odds, rate, instantaneous risk of an event (hazard), etc.
• (Sometimes we will use multiple regression predictors to model the scientific POI– E.g., linear and quadratic terms, dummy variables, etc.)
ofon distributi of measure)(summary Parameter subjectth for the variablesadjustment of Value
subject th for the POI theof Value subject th on the measured Response
,, 21
ii
ii
i
i
Yi
ii
WWXY
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 25
25
Multiple Regression
• General notation for multiple regression model
• The link function can usually be viewed as either none (means) or log (geom mean, odds, hazard) but some exceptions– log odds viewed by many as logit mean– complementary log log link for proportions measuring cumulative
incidence, which relate to log link on exponential hazards• log (- log p) = log + log Δt
" covariatefor Slope" "Interest of Predfor Slope" Intercept""
modelingfor usedfunction link""
1
1
0
231210
jj
iiii
WX
gWWXg
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 26
26
Generalized Linear Model (GLM)
• A special subset of my general regression model is the “Generalized Linear Model”– θ is the mean of response Y– Y has a 1-parameter exponential family distribution
• Includes normal (Gaussian), binomial, Poisson, exponential, gamma, negative binomial, log-normal…
• Estimating equations derived from maximum likelihood theory in these “regular” models (i.e., distributions that satisfy certain technical assumptions)– MLEs are “consistent”: arbitrarily close to the truth with large n– MLEs are “asymptotically efficient”: have greatest precision
possible in general (meet Cramer-Rao lower bound)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 27
27
GLM Estimating Equations
• Use maximum likelihood estimation assuming a parametric distribution for the data
• Quasi-likelihood and generalized estimating equations (GEE) use same general equation, but do not assume parametric distribution– Only model mean and mean-variance relationship
0ˆˆˆ
:satisfy toˆ estimatesparameter regression Find
:equation" estimatingfunction Score",~|
ˆ1
112
221102
ii
in
i i
iij
i
iiii
n
i i
iiin
i i
iij
iiiiiiiii
VYU
xgVYYU
xxxVxXY
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 28
28
“Family”: Mean Variance Relationship
• If we trust our linear model and data distribution completely, then using the correct mean-variance relationship will be more efficient– Weights observations less if the variance is greater– Works because every weighted average will correctly estimate
the straight line relationship
in
i i
ii
in
i i
ii
in
i ii
ii
in
i
ii
YU
YU
YU
YU
12
1
1
12
lExponentia
Poisson
1 Bernoulli
Gaussian
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 29
29
Role of Link Function
• Conceptually, we can use any link function with any distributionfamily
• There are often advantages of using the “canonical” link for each family: These tend to be the default value in statistical software– Gaussian distribution Identity link– Binomial distribution Logit link (on means)– Poisson distribution Log link– Exponential distribution Inverse link: g(μ) = 1/ μ
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 30
30
Binary Data: Role of Link Function
• With Bernoulli data, we might consider three different link functions
ORX
eeYU
RRXeeYU
RDXXX
XYU
i
n
iX
X
i
i
n
iX
Xi
i
n
i ii
ii
i
i
i
i
1
1ˆ
1
1 Logit
1 Log
ˆ1ˆ Identity
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 31
31
Uses of Regression
• Modeling questions about associations– “Defining averages” across groups– “Defining contrasts” for associations between response and POI– “Defining contrasts” for detecting effect modification
• Enabling comparisons across POI groups that are “otherwise similar” with respect to other variables– Adjusting for confounding– Adjusting to gain precision
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 32
32
Borrowing Information
• Use other groups to make estimates in groups with sparse data
• Intuitively: 67 and 69 year olds would provide some relevant information about 68 year olds
• Assuming straight line relationship in modeled covariates tells us how to adjust data from other (even more distant) age groups– If we do not know about the exact functional relationship, we
might want to borrow information only close to each group
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 33
33
Defining “Contrasts”
• Define a comparison across groups to use when answering scientific question
• If straight line relationship in parameter, slope for POI compares parameter between groups differing by 1 unit in X when all othercovariates in model are equal
• If nonlinear relationship in parameter, slope is average comparison of parameter between groups differing by 1 unit in X “holding covariates constant”– Statistical jargon: a “contrast” across the groups
• If multiple regression predictors model the POI, interpetation of the contrast is more difficult
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 34
34
Comparison of Models
• The major difference between regression models is interpretationof the parameters– Summary: Mean, geometric mean, odds, hazards– Comparison of groups: Difference, ratio
• Issues related to inclusion of covariates remain the same– Address the scientific question
• Predictor of interest; Effect modifiers– Address confounding– Increase precision
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 35
35
Interpretation of Parameters
• Intercept– Corresponds to a population with all modeled covariates equal to
zero– Most often outside range of data; quite often impossible; very
rarely of interest by itself
• Slope– A comparison between groups differing by 1 unit in corresponding
covariate, but agreeing on all other modeled covariates• Identity link: a difference in the summary measure• Log link: a ratio in the summary measure (difference in logs)
– Sometimes impossible to use this definition when modeling interactions or complex curves
• I most often resort to looking at predicted summary measures
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 36
36
Regression with Identity Link
• E.g., modeling mean of response Y on predictors X, W1, W2, …
1
12110
1210
0
1210
: of Difference
1
00
Model
ijijii
ijijii
jii
iii
wxwWxXwxwWxX
WX
WX
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 37
37
Regression with Identity Link
• Most common example: Linear regression for difference in means– Ordinary (unweighted) least squares
• Optimal (in some sense): independent, homoscedastic response– Weighted least squares: weighting by inverse variance
• Optimal (in some sense): independent, heteroscedastic response– (Biost 540) General least squares: weighting by covariance matrix
• Optimal (in some sense) for correlated response data
• Optimality: For any distribution having a variance, then– “Best linear unbiased estimate”: greatest precision among all
unbiased estimators that are a linear combination of the data• Gauss-Markov theorem
– If response is normal within groups and linear model correct, then• Unbiased and efficient (greatest possible precision)• Also MLE
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 38
38
Linear Regression: Ordinary, Weighted
• Use least squares estimation– Ordinary least squares: assume equal variance– Weighted least squares: use actual variances
0ˆ
ˆ
:satisfy toˆ estimatesparameter regression Find
:equation" estimatingfunction Score"
Minimize :squares"Least "
,~|
12
12
12
222110
2
ji
n
i i
iiij
i
ji
n
i i
iiij
n
i i
iii
iiiiiiii
xxYU
xxYU
xY
xxxxxXY
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 39
39
Properties of Least Squares Estimates
• In unweighted regression with no covariates– Estimated intercept is the sample mean
• In unweighted regression with a single binary estimator X– Estimated intercept is the sample mean for group with X=0– Estimated slope is the difference in sample means
• Group with X=1 minus group with X=0
• In unweighted regression with as many parameters as distinct combinations of predictors– The fitted value for every group will correspond exactly to the
sample mean
• In weighted regressions the last two may not hold, unless all subjects with the same covariate values have the same weights
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 40
40
Linear Regression: Binary Response
• A mean-variance relationship: mean μ = p variance p(1-p)
• Optimal (BLUE) estimates would use weighted least squares– We have to estimate the weights in an iterative search
0ˆ1ˆ
ˆˆ
:satisfy toˆ estimatesparameter regression Find
1
:equation" estimatingfunction Score"
1
112
ji
n
iiiii
iiij
i
ji
n
i iiii
iiiji
n
i i
iiij
xxx
xYU
xxx
xYxxYU
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 41
41
Stata: Binary Response with Identity Link
• Weighted least squares using iteratively estimated weights:– glm y x1 …,family(binomial) link(identity)– binreg y x1 …,rd– cs y x1
• Ordinary (unweighted) least squares :– glm y x1 …,family(gaussian) link(identity)– regress y x1 …
• Robust standard errors can be obtained with either– Specify option robust with glm or regress – Specify option vce(robust) with binreg
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 42
42
Examples: From Fall 2013 HW #2
• Modeling females’ risk of dying within 4 years as a function of– Estrogen use (0 or 1)– Prior history of cardiovascular disease (0 or 1)– Age (65 – 100 years)
• Derived variable estr_prev = estrogen * prevdis
• Comparisons– No covariates– Binary covariate– Saturated model with two binary covariates– Continuous covariate
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 43
43
Fitted Values with Linear Age Fit
• Some fitted values estimate a negative probability– Weights would be negative in weighted regression– Alters the “contrast” across groups
0.0
5.1
.15
.2Fi
tted
valu
es
60.00 70.00 80.00 90.00 100.00age
No estrogen, no CVD No estrogen, prior CVDEstrogen, no CVD Estrogen, prior CVD
fit5: Age Continuous Linear
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 44
44
Relative Advantages
• IF the linear model is correct, weighting by the estimated mean-variance relationship is most efficient
• HOWEVER, if the linear model is not correct, we can end up with negative variance estimates– Not a good thing to use when weighting
• The unweighted analysis– is unbiased if the linear model is correct, – interpretable as a linear trend when the linear model is not exactly
correct, and– can be adjusted for the heteroscedasticity when making inference
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 45
45
Linear Regression: “Gaussian” Response
• No mean-variance relationship
• Optimal (BLUE) estimates only if homoscedastic– But unbiased regardless
0ˆˆ
:satisfy toˆ estimatesparameter regression Find
:equation" estimatingfunction Score"
1
1
ji
n
iiij
i
ji
n
iiij
xxYU
xxYU
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 46
46
Regression with Log Link
• E.g., modeling geometric mean, odds, rate, hazard of response Yon predictors X, W1, W2, …
1
12110
1210
0
1210
:log of Difference
log1 log
log00
log Model
ijijii
ijijii
jii
iii
wxwWxXwxwWxX
WX
WX
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 47
47
Regression with Log Link: Back Transform
• E.g., modeling geometric mean, odds, rate, hazard of response Yon predictors X, W1, W2, …
1
12110
1210
0
:of Ratio
1
00
log Model
1210
e
ewWxXewWxX
eWX
WX
i
i
wxjijii
wxjijii
jii
iii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 48
48
Log Link: Binary Response
• A mean-variance relationship: mean μ = p variance p(1-p)
• Log link: log(p) = linear predictor
01
ˆ
:satisfy toˆ estimatesparameter regression Find
1
:equation"estimatingfunction Score"
1ˆ
ˆ
112
ji
n
ix
xi
j
i
xji
n
ixx
xix
ji
n
i i
xi
j
xeeYU
exee
eYexeYU
ii
ii
ii
iiii
iiii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 49
49
Relative Advantages
• IF the linear model is correct, weighting by the estimated mean-variance relationship is most efficient
• HOWEVER, if the linear model is not correct, we can end up with negative variance estimates when the mean estimate > 1– Not a good thing to use when weighting
• We can do an unweighted analysis based on Poisson regression– is consistent if the linear model is correct, – interpretable as a linear trend when the linear model is not exactly
correct, and– can be adjusted for the heteroscedasticity when making inference
• This is generally important to do– (Can use a time variable equal to 1 for all subjects)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 50
50
Poisson Regression
22110
1
22110
loglog:logfor 1 oft coefficienknown a
having exposure"" with regression formulate that weNote
0ˆ:satisfy toˆ estimatesparameter regression Find
link Log
:equation" estimatingfunction Score"
log~|
iiiii
i
j
i
i
n
i
Xii
iiiiiiiii
xxttt
U
XetYU
xxxtPxXY
i
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 51
51
Simple Poisson Regression
• Modeling rate of count response Y on predictor X
110
10
0
10
log1 log
log0
loglog,| Model!
|Pr P~ Distn
xxXxxX
X
XTTXTYEk
tetTkYtY
ii
ii
ii
iiiiiii
kii
t
iiiiii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 52
52
Interpretation as Rates
• Exponentiation of parameters
110
10
0
1 0
loglog,| Model!
|Pr P~ Distn
10
eeexXeexX
eX
XTTXTYEk
tetTkYtY
xii
xii
ii
iiiiiii
kii
t
iiiiii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 53
53
Simple Poisson Regression
• Interpretation of the model
• Rate when predictor is 0– Found by exponentiation of the intercept from the Poisson
regression: exp(0)
• Rate ratio between groups differing in the value of the predictor by 1 unit– Found by exponentiation of the slope from the Poisson
regression: exp(1)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 54
54
Stata Commands
• Same form as for other regression models
• Exception:– If the observed counts are measured over different amounts of
time or space, we must specify the length of “exposure”– poisson respvar predvar, exposure(tm) [robust]
• Exposure can also be given as the “offset”, which is just the log of the exposure time– poisson respvar predvar, offset(logtm) [robust]
• By default, Stata reports estimates on the log mean and log mean ratio scale– Specifying the option irr will cause Stata to suppress output of
the intercept and to report “incidence rate ratios”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 55
55
Stata: Poisson regression
• Weighted least squares using iteratively estimated weights:– poisson y x1 …,exposure(t)– glm y x1 …,family(poisson) link(log)
• Robust standard errors can be obtained with either– Specify option robust with glm or poisson– Specify option vce(robust) with binreg
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 56
56
Similarity to Other Regressions
• Poisson regression uses maximum likelihood estimation to find parameter estimates
• If a saturated model is fit, the estimated mean in each group will agree exactly with the sample mean
• In large samples, the regression parameter estimates are approximately normally distributed– P values and CI that are displayed for each parameter estimate
are Wald- based estimates
ˆˆ
ˆ
)()()( :statTest
ˆˆˆ )()()( :CI 95%
0
2/1
esZ
errstdnullestimateZ
eszerrstdvaluecritestimate
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 57
57
Technical Details
• Unlike linear regression, there is no closed form expression to find the Poisson regression parameter estimates
• Instead, computer programs use an iterative search
• This search may fail in saturated or nearly saturated models if some parameter corresponds to a group having no events– In this setting, Poisson regression parameters modeling the log
mean are trying to estimate negative infinity– The sample size is too small for the model
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 58
58
Example
• Prevalence of stroke (cerebrovascular accident- CVA) by age in subset of Cardiovascular Health Study
• Response variable is CVA– Binary variable: 0= no history of prior stroke, 1= prior history of
stroke
• Predictor variable is Age– Continuous predictor
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 59
59
Example: Regression Model
• Answer question by assessing linear trends in log odds of strokeby age
• Estimate best fitting line to log odds of CVA within age groups
• An association will exist if the slope (1) is nonzero– In that case, the odds (and probability) of CVA will be different
across different age groups
AgeAgeCVAE 10log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 60
60
Parameter Estimates. poisson cva age, robust
(iteration info deleted)
Poisson regression
Number of obs = 735
Wald chi2(1) = 2.61
Prob > chi2 = 0.1064
Log pseudolklhd = -245.0 Pseudo R2 = 0.0044
| Robust
cva | Coef. StdErr z P>|z| [95% Conf Intvl]
age | .0298 .0184 1.61 0.106 -.00637 .0659
_cons | -4.52 1.40 -3.23 0.001 -7.26 -1.77
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 61
61
Interpretation of Stata Output
• Regression model for CVA on age
• Intercept is labeled by “_cons”– Estimated intercept: -4.52
• Slope is labeled by variable name: “age”– Estimated slope: 0.0298
• Estimated linear relationship:– log probability of CVA by age group given by
iAgeCVA 0298.052.4]E[ log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 62
62
Interpretation of Intercept
• Estimated log odds CVA for newborns is -4.52– Probability of CVA for newborns is e-4.52 = 0.0109
• Pretty ridiculous to try to estimate– We never sampled anyone less than 67– In this problem, the intercept is just a tool in fitting the model
iAgeCVA 0298.052.4] E[ log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 63
63
Interpretation of Slope
• Estimated difference in log probability of CVA for two groups differing by one year in age is 0.0298, with older group tending to higher probability– Risk Ratio: e0.0298= 1.0302– For 5 year age difference: e5x0.0298= 1.03025 = 1.161
• (If a straight line relationship is not true, we interpret the slope as an average difference in log odds CVA per one year difference inage)
iAgeCVA 0298.052.4] E[ log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 64
64
Example: Interpretation
• “From Poisson regression analysis, we estimate that for each year difference in age, the probability of stroke is 3.02% higher in the older group, though this estimate is not statistically significant (P = .106). A 95% CI suggests that this observation is not unusual if a group that is one year older might have probability of stroke that was anywhere from 0.6% lower or 6.8% higher than the younger group.”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 65
65
Logistic Regression
• Binary response variable
• Allows continuous (or multiple) grouping variables– But is OK with binary grouping variable also
• Compares odds of response across groups– “Odds ratio”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 66
66
Logistic Regression
0ˆ:satisfy toˆ estimatesparameter regression Find
1 link Logit
:equation" estimatingfunction Score"
1,1~|
1
22110
j
i
i
n
iX
X
i
iiiX
X
iii
U
Xe
eYU
xxxe
eBxXY
i
i
i
i
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 67
67
Simple Logistic Regression
• Modeling odds of binary response Y on predictor X
110
10
0
10
oddslog1 oddslog
oddslog0
1loglogit Model
1Pr on Distributi
xxXxxX
X
Xp
pp
pY
i
i
i
ii
ii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 68
68
Interpretation as Odds
• Exponentiation of regression parameters
110
10
0
10
odds1 odds odds0
1 Model
1Pr on Distributi
eeexXeexX
eX
eep
p
pY
xi
xi
i
X
i
i
ii
i
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 69
69
Estimating Proportions
• Proportion = odds / (1 + odds)
110
110
10
10
00
10
10
11
1
1/0
1 Model
1Pr on Distributi
eeeeeepxX
eeeepxX
eepX
eeeep
pY
x
x
ii
x
x
ii
ii
X
X
i
ii
i
i
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 70
70
Parameter Interpretation
• Interpretation of the logistic regression parameters based on odds
• Odds when predictor is 0– Found by exponentiation of the intercept from the logistic
regression: exp(0)
• Odds ratio between groups differing in the value of the predictor by 1 unit– Found by exponentiation of the slope from the logistic regression:
exp(1)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 71
71
Similarity to Other Regressions
• Logistic regression uses maximum likelihood estimation to find parameter estimates
• If a saturated model is fit, the estimated odds of event in eachgroup will agree exactly with the sample odds
• In large samples, the regression parameter estimates are approximately normally distributed– P values and CI that are displayed for each parameter estimate
are Wald- based estimates
ˆˆ
ˆ
)()()( :statTest
ˆˆˆ )()()( :CI 95%
0
2/1
esZ
errstdnullestimateZ
eszerrstdvaluecritestimate
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 72
72
Technical Details
• Unlike linear regression, there is no closed form expression to find the logistic regression parameter estimates
• Instead, computer programs use an iterative search
• This search may fail in saturated or nearly saturated models if some parameter corresponds to a group having all events or no events– In this setting, logistic regression parameters modeling the log
odds are trying to estimate positive or negative infinity– The sample size is too small for the model
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 73
73
Stata: Logistic regression
• Weighted least squares using iteratively estimated weights:– logistic y x1 …, or logit y x1 …,– glm y x1 …,family(binomial) link(logit)– binreg y x1 …,
• Robust standard errors can be obtained with either– Specify option robust with glm or logistic / logit– Specify option vce(robust) with binreg
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 74
74
Example
• Prevalence of stroke (cerebrovascular accident- CVA) by age in subset of Cardiovascular Health Study
• Response variable is CVA– Binary variable: 0= no history of prior stroke, 1= prior history of
stroke
• Predictor variable is Age– Continuous predictor
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 75
75
Example: Regression Model
• Answer question by assessing linear trends in log odds of strokeby age
• Estimate best fitting line to log odds of CVA within age groups
• An association will exist if the slope (1) is nonzero– In that case, the odds (and probability) of CVA will be different
across different age groups
AgeAgeCVAodds 10log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 76
76
Parameter Estimates. logit cva age
(iteration info deleted)
Number of obs = 735
LR chi2(1) = 2.45
Prob > chi2 = 0.1175
Log likelihood = -240.98969
Pseudo R2 = 0.0051
cva | Coef StdErr z P>|z| [95% Conf Int]
age | .0336 .0210 1.59 0.111 -.0077 .0748
_cons | -4.69 1.591 -2.95 0.003 -7.810 -1.572
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 77
77
Interpretation of Stata Output
• Regression model for CVA on age
• Intercept is labeled by “_cons”– Estimated intercept: -4.69
• Slope is labeled by variable name: “age”– Estimated slope: 0.0336
• Estimated linear relationship:– log odds CVA by age group given by
iAgeCVA 0336.069.4 odds log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 78
78
Interpretation of Intercept
• Estimated log odds CVA for newborns is -4.69– Odds of CVA for newborns is e-4.69 = 0.0092– Probability of CVA for newborns
• Use prob = odds / (1+odds): .0092 / (1+.0092)= .0091
• Pretty ridiculous to try to estimate– We never sampled anyone less than 67– In this problem, the intercept is just a tool in fitting the model
iAgeCVA 0336.069.4 odds log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 79
79
Interpretation of Slope
• Estimated difference in log odds CVA for two groups differing byone year in age is 0.0336, with older group tending to higher log odds– Odds Ratio: e0.0336= 1.034– For 5 year age difference: e5x0.0336= 1.0345 = 1.183
• (If a straight line relationship is not true, we interpret the slope as an average difference in log odds CVA per one year difference inage)
iAgeCVA 0336.069.4 odds log
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 80
80
Stata: “logit” versus “logistic”
• We are rarely interested in the intercept by itself– We do have to use it when estimating odds of an event in a single
group
• Given that we are rarely interested in the intercept, we might as well use the “logistic” command– It will provide inference for the odds ratio– We don’t have to exponentiate the slope estimate
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 81
81
Odds Ratios using “logistic”.logistic cva age
Logistic regression Number of obs = 735
LR chi2(1) = 2.45
Prob > chi2 = 0.1175
Log likelihood = -240.98969
Pseudo R2 = 0.0051
cva |Odds Ratio StdErr z P>|z| [95% Conf Int]
age | 1.034 .0218 1.59 0.111 .992 1.078
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 82
82
Example: Interpretation
• “From logistic regression analysis, we estimate that for each year difference in age, the odds of stroke is 3.4% higher in the older group, though this estimate is not statistically significant (P = .113). A 95% CI suggests that this observation is not unusual if a group that is one year older might have odds of stroke that was anywhere from 0.8% lower or 7.8% higher than the younger group.”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 83
83
Comments on Interpretation
• I express this as a difference between group odds rather than a change with aging– We did not do a longitudinal study
• To the extent that the true group log odds have a linear relationship, this interpretation applies exactly
• If the true relationship is nonlinear– The slope estimates the “first order trend” for the sampled age
distribution– We should not regard the estimates of individual group
probabilities / odds as accurate
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 84
84
Logistic Regression and 2 Test
• Logistic regression with a binary predictor (two groups) corresponds to familiar chi squared test
• Three possible statistics from logistic regression– Wald: The test based on the estimate and SE– Score: Corresponds to chi squared test, but not given in Stata
output– Likelihood ratio test: Can be obtained using post-regression
commands in Stata (covered with adjusting for covariates)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 85
85
Properties of MLEs with Canonical Link
• In regression with no covariates– Estimated intercept is the sample mean
• In regression with a single binary estimator X– Estimated intercept is the sample mean for group with X=0– Estimated slope is the difference in sample means
• Group with X=1 minus group with X=0
• In regression with as many parameters as distinct combinations of predictors– The fitted value for every group will correspond exactly to the
sample mean– A “saturated model”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 86
86
Stata: Regression Commands
• Linear regression (means, proportions)– regress Y X W1 W2 W3, robust
• Linear regression ( geometric means)– regress logY X W1 W2 W3, robust
• Logistic regression (odds)– logistic Y X W1 W2 W3, [robust]– logit Y X W1 W2 W3, [robust]
• Poisson regression (rates)– poisson Y X W1 W2 W3, robust exposure(timevar)
• Proportional hazards regression (hazards)– stset Y Event– stcox X W1 W2 W3, robust
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 87
87
Stata: glm Regression Commands
• Difference of means, proportions (linear regression)– glm Y X W1, robust family(gaussian) link(identity)
• Difference of log means, log proportions– glm Y X W1 W2, robust link(log)
• Difference of log geometric means (linear regression on logs)– glm logY X W1 W2, robust
• Difference of log odds (logistic regression)– glm Y X W1, [robust] family(binomial) link(logit)
• Difference of log rates (Poisson regression)– glm Y X W1 W2, robust family(poisson) link(log)
• Back transforming regression parameters with log link– glm Y X W1 W2, robust family(…) link(…) eform
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 88
88
Adjustment for Covariates
• We “adjust” for other covariates
• Define groups according to– Predictor of interest, and– Other covariates
• Compare the distribution of response across groups which– differ with respect to the Predictor of Interest, but– are the same with respect to the other covariates
• “holding other variables constant”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 89
89
Adjusted Regression Analyses
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 90
90
Controlling Variation
• In a two sample comparison of means, we might control some variable in order to decrease the within group variability– Restrict population sampled– Standardize ancillary treatments and exposures after accrual– Standardize measurement procedure
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 91
91
Adjusting for Covariates
• When comparing means using stratified analyses or linear regression, adjustment for precision variables decreases the within group standard deviation– Var (Y | X) vs Var (Y | X, W)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 92
92
Ex: Linear Regression
tindependen if 0 ,orthogonal if 0
|
1ˆˆ
ion)randomizat (completet independen if,|ˆ|ˆ)stratified (balanced, "orthogonal" if
,,1,,~,| :Adj
,,1,,~| :Unadj
22
22
22
2
2
1
2
1
11
11
2210
210
|,|
,||
,|
|
XWXW
XW
W
ii
ind
iii
i
ind
ii
rr
XWVar
rXnVarse
XnVarse
WXEEXE
niWXWXY
niXXY
XYWXY
WXYXY
WXY
XY
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 93
93
Precision with Proportions
• When analyzing proportions (means), the mean variance relationship is important
• Precision is greatest when proportion is close to 0 or 1
• Greater homogeneity of groups makes results more deterministic– (At least, I always hope for this)
• Hence, we should get lower within-group variance upon adjusting for prognostic variables
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 94
94
Ex: Diff of Indep Proportions
2
22
1
212
221
2212121
2121
ˆ/1
1
ˆˆˆ/;
,,1;2,1,,1~
nnnVserrV
pp
YYpppp
nnrnnn
njipBYind
iii
iiij
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 95
95
Precision with Odds
• When analyzing odds (a nonlinear function of the mean), adjusting for a precision variable results in more extreme estimates– odds = p / (1-p)– odds using average of stratum specific p is not the average of
stratum specific odds
• Generally, little “precision” is gained due to the mean-variance relationship– Unless the precision variable is highly prognostic
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 96
96
Precision with Hazards
• When analyzing hazards, adjusting for a precision variable results in more extreme estimates
• The standard error tends to still be related to the number of observed events– Higher hazard ratio with same standard error greater precision
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 97
97
Adjustment for Covariates
• We “adjust” for other covariates
• Define groups according to– Predictor of interest, and– Other covariates
• Compare the distribution of response across groups which– differ with respect to the Predictor of Interest, but– are the same with respect to the other covariates
• “holding other variables constant”
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 98
98
Unadjusted vs Adjusted Models
• Adjustment for covariates changes the scientific question
• Unadjusted models– Slope compares parameters across groups differing by 1 unit in
the modeled predictor• Groups may also differ with respect to other variables
• Adjusted models– Slope compares parameters across groups differing by 1 unit in
the modeled predictor but similar with respect to other modeled covariates
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 99
99
Interpretation of Slopes
• Difference in interpretation of slopes
– β1 = Compares for groups differing by 1 unit in X• (The distribution of W might differ across groups being compared)
– γ1 = Compares for groups differing by 1 unit in X, but agreeing in their values of W
iiii WXWXg 210, :Model Adjusted
ii XXg 10 :Model Unadjusted
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 100
100
Comparing models
?ˆˆˆˆ is When
?ˆˆ isn Whe:Statistics
?ˆˆ is When
? isen Wh:Science
, Adjusted
Unadjusted
11
11
11
11
210
10
eses
sese
WXWXg
XXg
iiii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 101
101
General Results
• These questions can not be answered precisely in the general case
• However, in linear regression we can derive exact results
• These will serve as a basis for later examination of– Logistic regression– Poisson regression– Proportional hazards regression
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 102
102
Linear Regression
• Difference in interpretation of slopes
– β1 = Diff in mean Y for groups differing by 1 unit in X• (The distribution of W might differ across groups being compared)
– γ1 = Diff in mean Y for groups differing by 1 unit in X, but agreeing in their values of W
iiiii WXWXYE 210, :Model Adjusted
iii XXYE 10 :Model Unadjusted
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 103
103
Relationships: True Slopes• The slope of the unadjusted model will tend to be
• Hence, true adjusted and unadjusted slopes for X are estimating the same quantity only if
– ρXW = 0 (X and W are truly uncorrelated), OR
– (no association between W and Y after adjusting for X)
211
X
WXW
02
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 104
104
Relationships: Estimated Slopes• The estimated slope of the unadjusted model will be
• Hence, estimated adjusted and unadjusted slopes for X are equal only if
– rXW = 0 (X and W are uncorrelated in the sample, which can be arranged by experimental design), OR
– (which cannot be predetermined, because Y is random)
– sW = 0 (W is controlled at a single value in which case rXW = 0)
XWYWYXX
WXW rrrs
sr211 ˆ1ˆˆ
0ˆ2
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 105
105
Relationships: True SE
2,|
2|
22
2|
22
22
1
2
1
,|||
1,
ˆ Model Adjusted
ˆ Model Unadjusted
WXYXWXY
XW
WXYVarXWVarXYVar
rXnVarWXYVar
se
XnVarXYVar
se
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 106
106
Relationships: True SE
0| OR 0AND
0if ˆˆ Thus,
,|||
1,
ˆ Model Adjusted
ˆ Model Unadjusted
2
11
22
22
1
2
1
XWVar
rsese
WXYVarXWVarXYVar
rXnVarWXYVar
se
XnVarXYVar
se
XW
XW
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 107
107
Relationships: Estimated SE
2210
2
10
222
1
2
2
1
ˆˆˆ,|
ˆˆ|
113/,
ˆˆ Model Adjusted
12/ˆˆ Model Unadjusted
iii
ii
XWX
X
WXYWXYSSE
XYXYSSE
rsnnWXYSSE
es
snnXYSSE
es
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 108
108
Relationships: Estimated SE
3/,2/AND
0if ˆˆˆˆ Thus,
113/,
ˆˆ Model Adjusted
12/ˆˆ Model Unadjusted
11
222
1
2
2
1
nWXYSSEnXYSSE
reses
rsnnWXYSSE
es
snnXYSSE
es
XW
XWX
X
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 109
109
Residual Squared Error
WXYSSEXYSSE
WXYWXYSSE
XYXYSSE
iii
ii
,||:data same on the calculatedWhen
ˆˆˆ,|
ˆˆ|2
210
2
10
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 110
110
Relationships: Estimated SE
0ˆ if ,|| and ,0OR
,|| casein which ,0ˆ if ˆˆ Now
ˆˆˆ,|
ˆˆ|
2
2
11
2210
2
10
WXYSSEXYSSEr
WXYSSEXYSSE
WXYWXYSSE
XYXYSSE
XW
iii
ii
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 111
111
Special Cases
• Behavior of unadjusted and adjusted models according to whether– X and W are uncorrelated (no association in means)– W is associated with Y after adjustment for X
InflationVar Irrelevant0gConfoundinPrecision0
00
2
2
XWXW rr
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 112
112
Simulations
• Unadjusted and adjusted estimates of effect of binary POI as a function of– Effect of a covariate on summary of outcome (mean, odds, …)– Sampling scheme: Association between covariate and POI
• Difference in mean covariate• Difference in median covariate
iiii
ii
XXX
WXW
iii
WXWXg
XXg
WWssr
XXWE
210
10
011
10
, : Adjusted
:Unadjusted
ˆ
: Sampling
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 113
113
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 114
114
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 115
115
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 116
116
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 117
117
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 118
118
Linear Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.19)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.28) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.28) 1.0 (0.20)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.3 (0.29) 0.0 (0.21)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.22)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 119
119
Take Home Message: Risk Difference
• The magnitude of confounding is a product of– Magnitude of association between covariate and response AND– Difference of mean covariate value across POI groups
• If adjustment would be linear, mean (not median, etc) matters
• If no confounding, adjusted and unadjusted estimates will be equal (approximately)
• Standard errors are– Decreased if added covariates are (conditionally) associated with
response– Increased if added covariates that are (conditionally) correlated
with the POI
• After adjusting for a confounder, the precision of the estimates for response-POI association can be larger or smaller than in an unadjusted analysis
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 120
120
Logistic Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.42) 0.0 (0.42)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.40) 0.0 (0.42)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.42) 0.0 (0.43)
Precision 0.0 0.0 0.00 1.0 1.0 0.8 (0.43) 1.0 (0.49)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.43) 0.0 (0.48)
Confound 0.0 0.3 0.15 1.0 0.0 0.2 (0.41) 0.0 (0.47)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.41) 0.0 (0.47)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 121
121
Logistic Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.42) 0.0 (0.42)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.40) 0.0 (0.42)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.42) 0.0 (0.43)
Precision 0.0 0.0 0.00 1.0 1.0 0.8 (0.43) 1.0 (0.49)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.43) 0.0 (0.48)
Confound 0.0 0.3 0.15 1.0 0.0 0.2 (0.41) 0.0 (0.47)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.41) 0.0 (0.47)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 122
122
Logistic Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.42) 0.0 (0.42)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.40) 0.0 (0.42)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.42) 0.0 (0.43)
Precision 0.0 0.0 0.00 1.0 1.0 0.8 (0.43) 1.0 (0.49)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.43) 0.0 (0.48)
Confound 0.0 0.3 0.15 1.0 0.0 0.2 (0.41) 0.0 (0.47)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.41) 0.0 (0.47)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 123
123
Logistic Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.42) 0.0 (0.42)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.40) 0.0 (0.42)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.42) 0.0 (0.43)
Precision 0.0 0.0 0.00 1.0 1.0 0.8 (0.43) 1.0 (0.49)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.43) 0.0 (0.48)
Confound 0.0 0.3 0.15 1.0 0.0 0.2 (0.41) 0.0 (0.47)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.41) 0.0 (0.47)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 124
124
Logistic Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.42) 0.0 (0.42)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.40) 0.0 (0.42)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.42) 0.0 (0.43)
Precision 0.0 0.0 0.00 1.0 1.0 0.8 (0.43) 1.0 (0.49)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.43) 0.0 (0.48)
Confound 0.0 0.3 0.15 1.0 0.0 0.2 (0.41) 0.0 (0.47)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.41) 0.0 (0.47)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 125
125
Take Home Message: Odds Ratios
• The magnitude of confounding is primarily a function of– Magnitude of association between covariate and response AND– Difference of mean covariate value across POI groups
• If adjustment would be linear, mean (not median, etc) matters most
• Even if no confounding, adjusted and unadjusted estimates will be different if new covariate (conditionally) associated with response– Adjusting for a precision variable will “deattenuate” OR (and
estimate)– The amount of “deattenuation” depends on
• The strength of association between the added covariate and the response, and
• The adjusted strength of association between the POI and the response
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 126
126
Deattenuation of OR
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 127
127
Take Home Message: Odds Ratios
• Standard errors of estimated OR measuring POI – response association are largely driven by the mean-variance relationship– SE of odds behaves like 1 / p(1-p)– Greater homogeneity of groups leads to p closer to 0 or 1– Usually statistical significance of the POI – response association
is affected very little by adjustment for a “precision” variable unless it is very strongly associated with response
• Note that if we persist in estimating the population OR instead of the adjusted OR, we do gain precision by adjustment– We do have to take a weighted average, however– See Tangen & Koch, Stat Med, 2000
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 128
128
Poisson Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.18) 0.0 (0.18)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.27) 0.0 (0.10)
Precision - 0.3 0.0 0.00 1.0 0.0 0.4 (0.75) 0.0 (0.10)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.26) 1.0 (0.08)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.09)
Confound 0.0 0.3 0.15 1.0 0.0 0.7 (0.76) 0.0 (0.09)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.12) 0.0 (0.13)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 129
129
Poisson Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.18) 0.0 (0.18)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.27) 0.0 (0.10)
Precision - 0.3 0.0 0.00 1.0 0.0 0.4 (0.75) 0.0 (0.10)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.26) 1.0 (0.08)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.09)
Confound 0.0 0.3 0.15 1.0 0.0 0.7 (0.76) 0.0 (0.09)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.12) 0.0 (0.13)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 130
130
Poisson Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.18) 0.0 (0.18)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.27) 0.0 (0.10)
Precision - 0.3 0.0 0.00 1.0 0.0 0.4 (0.75) 0.0 (0.10)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.26) 1.0 (0.08)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.09)
Confound 0.0 0.3 0.15 1.0 0.0 0.7 (0.76) 0.0 (0.09)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.12) 0.0 (0.13)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 131
131
Poisson Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.18) 0.0 (0.18)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.27) 0.0 (0.10)
Precision - 0.3 0.0 0.00 1.0 0.0 0.4 (0.75) 0.0 (0.10)
Precision 0.0 0.0 0.00 1.0 1.0 1.0 (0.26) 1.0 (0.08)
Confound 0.3 0.3 0.15 1.0 0.0 0.3 (0.28) 0.0 (0.09)
Confound 0.0 0.3 0.15 1.0 0.0 0.7 (0.76) 0.0 (0.09)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.12) 0.0 (0.13)
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 132
132
Take Home Message: Risk Ratios
• In simulations, I allowed for “overdispersion” of mean-variance
• The magnitude of confounding is primarily a function of– Magnitude of association between covariate and response AND– Difference of mean covariate value across POI groups
• If adjustment would be linear, mean matters most, though the skewness of the predictor distribution does have an effect
• (There is a mixture of Poisson random variables)
• Adjusted and unadjusted estimates can be different if new covariate (conditionally) associated with response– Influence of skewed observations can be great– If added covariate symmetric with relatively small variance, there
is little difference between adjusted, unadjusted estimates
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 133
133
Take Home Message: Risk Ratio
• Standard errors tend to be– Decreased if added covariates are (conditionally) associated with
response– Increased if added covariates that are (conditionally) correlated
with the POI
• After adjusting for a confounder, the precision of the estimates for response-POI association can be larger or smaller than in an unadjusted analysis
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 134
134
Proportional Hazards Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.22)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.21)
Precision 0.0 0.0 0.00 1.0 1.0 0.7 (0.21) 1.0 (0.22)
Confound 0.3 0.3 0.15 1.0 0.0 0.2 (0.21) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.1 (0.20) 0.0 (0.22)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.23)
Next
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 135
135
Proportional Hazards Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.22)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.21)
Precision 0.0 0.0 0.00 1.0 1.0 0.7 (0.21) 1.0 (0.22)
Confound 0.3 0.3 0.15 1.0 0.0 0.2 (0.21) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.1 (0.20) 0.0 (0.22)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.23)
Next
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 136
136
Proportional Hazards Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.22)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.21)
Precision 0.0 0.0 0.00 1.0 1.0 0.7 (0.21) 1.0 (0.22)
Confound 0.3 0.3 0.15 1.0 0.0 0.2 (0.21) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.1 (0.20) 0.0 (0.22)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.23)
Next
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 137
137
Proportional Hazards Regression
• Simulation results
Truth Avg Estimates (SE)
ΔMdn α1 rXW γ2 γ1 β1 γ1
Irrelevant 0.0 0.0 0.00 0.0 0.0 0.0 (0.20) 0.0 (0.20)
Precision 0.0 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.22)
Precision - 0.3 0.0 0.00 1.0 0.0 0.0 (0.21) 0.0 (0.21)
Precision 0.0 0.0 0.00 1.0 1.0 0.7 (0.21) 1.0 (0.22)
Confound 0.3 0.3 0.15 1.0 0.0 0.2 (0.21) 0.0 (0.21)
Confound 0.0 0.3 0.15 1.0 0.0 0.1 (0.20) 0.0 (0.22)
Var Inflatn 0.0 1.0 0.45 0.0 0.0 0.0 (0.20) 0.0 (0.23)
Next
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 138
138
Take Home Message: Hazard Ratios
• The magnitude of confounding is primarily a function of– Magnitude of association between covariate and response AND– Difference of mean covariate value across POI groups
• If adjustment would be linear, mean (not median, etc) matters most
• Even if no confounding, adjusted and unadjusted estimates will be different if new covariate (conditionally) associated with response (PH regression is related to logistic regression)– Adjusting for a precision variable will “deattenuate” HR (and
estimate)– The amount of “deattenuation” depends on
• The strength of association between the added covariate and the response, and
• The adjusted strength of association between the POI and the response
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 139
139
Take Home Message: Hazard Ratios
• Standard errors of estimated HR measuring POI – response association are largely driven by the mean-variance relationship of the sample size ratios– Unless there is a very strong association, the SE tends to be fairly
constant for a range of different HRs
• Holding the SE constant, we will obtain a lower p value with a deattenuated HR
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 140
140
Precision: Linear Regression
• E.g., X, W independent in population (or completely randomized experiment) AND W associated with Y independent of X
1111
1111
2
ˆˆˆˆˆˆ Errs Std
ˆˆ Slopes
Estimates Value True
00
esessese
XW
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 141
141
Precision: Logistic Regression
• Adjusting for a precision variable • Deattenuates slope away from the null• Standard errors reflect mean-variance relationship
– Substantially increased power only in extreme cases» (OR > 5 for equal samples sizes of binary W)
1111
11111
11111
ˆˆˆˆˆˆ Errs Std
ˆˆ :0
ˆˆ :0 Slopes
Estimates Value True
esessese
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 142
142
Precision: Poisson Regression
• Adjusting for a precision variable (with robust SE)• No effect on the slope (similar to linear regression)
– log ratios are linear in log means• Standard errors reflect mean-variance relationship
– Virtually no effect on power
1111
1111
ˆˆˆˆˆˆ Errs Std
ˆˆ Slopes
Estimates Value True
esessese
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 143
143
Precision: PH Regression
• Adjusting for a precision variable • Deattenuates slope away from the null• Standard errors stay fairly constant
– (Complicated result of binomial mean-variance)
1111
11111
11111
ˆˆˆˆˆˆ Errs Std
ˆˆ :0
ˆˆ :0 Slopes
Estimates Value True
esessese
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 144
144
Lin Reg: Stratified Randomization
• Stratified (orthogonal) randomization in a designed experiment– Also when designing observational study with matching
• Unless we adjust for the stratification variables, we estimate the true standard errors incorrectly
1111
1111
2
ˆˆˆˆˆˆ Errs Std
ˆˆ Slopes
Estimates Value True
00
esessese
rXW
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 145
145
Multiple Adjustment
• We can generalize the results about the difference in mean confounder values across a groups defined by a binary POI
• Difference in interpretation of slopes
– β1 = Diff in mean Y for groups differing by 1 unit in X• (The distribution of W might differ across groups being compared)
– γ1 = Diff in mean Y for groups differing by 1 unit in X, but agreeing in their values of all W’s
iiiiiii WWXWWXYE 23121021 ,,, :Model Adjusted
iii XXYE 10 :Model Unadjusted
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 146
146
Multiple Adjustment: Good News• For a binary POI, the slope of the unadjusted model will tend to be
• Hence, true adjusted and unadjusted slopes for X are definitelyestimating the same quantity if for each j
• (Of course, it is also possible the various terms cancel each other out…)
p
jXjXjj WW
101111
covariatesother for adjusting afteroutcome of predictivenot is Covariate
0
group POIeach in equal are valuescovariateMean
1
01
j
XjXj
OR
WW
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 147
147
Multiple Adjustment: Bad News• We must consider the conditional relationships between a covariate and
the POI after adjustment for other variables• Consider four possible models involving X, W, and Z
• Deciding between models U and W, we only need consider ρXW
• Deciding between models Z and WZ, we need to consider ρXZ and ρWZ as wel
iZiWiXiiii
iZiXiii
iWiXiii
iXii
ZWXZWXYE
ZXZXYE
WXWXYE
XXYE
0
0
0
0
,, : WZModel
, : ZModel
, : WModel
: UModel
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 148
148
Multiple Adjustment: Bad News
• Even though W and X might be marginally uncorrelated, our models Z and ZW estimate the same quantity only if – X | Z is uncorrelated with W | Z, OR– Y | X, Z is uncorrelated with W | X, Z
WX
W
XW
WZXZXX
Z
WWZ
X
ZXZ
X
WXW
XW
WXX
XXWX
WXWXX
XW
XW
2
0
2
0
11
iZiWiXiiii
iZiXiii
iWiXiii
iXii
ZWXZWXYE
ZXZXYE
WXWXYE
XXYE
0
0
0
0
,, : WZModel
, : ZModel
, : WModel
: UModel
Lecture 6: Review of Regression October 14, 2014
Categorical Data Analysis, AUT 2014 149
149
Take Home Messages
• “Everything should be as simple as possible, but no simpler”
• Understanding the role of confounding and precision allows some generalizations, but there are details that can mess us up– Linear regression: We have formulas to guide us when modeling
differences in means– Other regressions: Generally we use computers in an iterative
search, but each step is often a linear regression
• When adjusting for multiple variables, each slope parameter is interpreted by conditioning on the other variables in the model– We have to talk about adjusting for “residual confounding”– We have to talk about “added precision”