Sociology 601 Class 21: November 10, 2009

Sociology 601 Class 21: November 10, 2009

• Review

– formulas for b and se(b)

– stata regression commands & output

• Violations of Model Assumptions, and their effects (9.6)

• Causality (10)

1

Formulas for b, a, r, and se(b)

2

€

b =Σ(X − X )(Y − Y )

Σ(X − X )2; a = Y − bX ;r = b

sx

sy

ˆ Y = a + bX; SSE = (Y − ˆ Y )∑2

se(b) =

SSE

n − 2sx

n −1

Stata Example of Inference about a Slope

. summarize murder poverty

Variable | Obs Mean Std. Dev. Min Max-------------+-------------------------------------------------------- murder | 51 8.727451 10.71758 1.6 78.5 poverty | 51 14.25882 4.584242 8 26.4

. regress murder poverty

Source | SS df MS Number of obs = 51-------------+------------------------------ F( 1, 49) = 23.08 Model | 1839.06931 1 1839.06931 Prob > F = 0.0000 Residual | 3904.25223 49 79.6786169 R-squared = 0.3202-------------+------------------------------ Adj R-squared = 0.3063 Total | 5743.32154 50 114.866431 Root MSE = 8.9263

------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- poverty | 1.32296 .2753711 4.80 0.000 .7695805 1.876339 _cons | -10.1364 4.120616 -2.46 0.017 -18.41708 -1.855707-----------------------------------------------------------------------------

3


4

. correlate murder poverty(obs=51)

| murder poverty-------------+------------------ murder | 1.0000 poverty | 0.5659 1.0000

. correlate murder poverty, covariance(obs=51)

| murder poverty-------------+------------------ murder | 114.866 poverty | 27.8024 21.0153

sqrt(114.866) = 14.26 = sd(y);sqrt (21.0153) = 8.73 = sd(x)

Alternative Formula for b

5

€

b =Σ(X − X )(Y −Y )

Σ(X − X )2

=Σ(X − X )(Y −Y ) /(N −1)

Σ(X − X )2 /(N −1)

=covariance(x,y)

var iance(x)

b = 27.8024 / 21.0153 = 1.323


6

scatter murder poverty || lfit murder poverty


7

. regress murder poverty if state!="DC"

Source | SS df MS Number of obs = 50-------------+------------------------------ F( 1, 48) = 31.36 Model | 307.342297 1 307.342297 Prob > F = 0.0000 Residual | 470.406476 48 9.80013492 R-squared = 0.3952-------------+------------------------------ Adj R-squared = 0.3826 Total | 777.748773 49 15.8724239 Root MSE = 3.1305

------------------------------------------------------------------------------ murder | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- poverty | .5842405 .104327 5.60 0.000 .3744771 .7940039 _cons | -.8567153 1.527798 -0.56 0.578 -3.92856 2.215129------------------------------------------------------------------------------

Assumptions Needed to make Population Inferences for slopes.

• The sample is selected randomly.

• X and Y are interval scale variables.

• The mean of Y is related to X by the linear equation E{Y} = + X.

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)

• The conditional distribution of Y at each value of X is normal.

• There is no error in the measurement of X.

8

Common Ways to Violate These Assumptions

• The sample is selected randomly.

o Cluster sampling (e.g., census tracts / neighborhoods) causes observations in any cluster to be more similar than to observations outside the cluster.

o Autocorrelation (spatial and temporal)

o Two or more siblings in the same family.

o Sample = populations (e.g., states in the U.S.)

• X and Y are interval scale variables.

o Ordinal scale attitude measures

o Nominal scale categories (e.g., race/ethnicity, religion)9

Common Ways to Violate These Assumptions (2)

• The mean of Y is related to X by the linear equation E{Y} = + X.

o U-shape: e.g., Kuznets inverted-U curve (inequality <- GDP/capita)

o Thresholds:

o Logarithmic (e.g., earnings <- education)

• The conditional standard deviation of Y is identical at each X value. (no heteroscedasticity)

o earnings <- education

o hours worked <- years

o adult child occupational status <- parental occupational status

10

Common Ways to Violate These Assumptions (3)

• The conditional distribution of Y at each value of X is normal.

o earnings (skewed) <- education

o Y is binary

o Y is a %

• There is no error in the measurement of X.

o almost everything

o what is the effect of measurement error in x on b?

11

Things to watch out for: extrapolation.

Extrapolation beyond observed values of X is dangerous.• The pattern may be nonlinear.• Even if the pattern is linear, the standard errors become

increasingly wide.• Be especially careful interpreting the Y-intercept: it may lie

outside the observed data.o e.g., year zeroo e.g., zero education in the U.S.o e.g., zero parity

12

Things to watch out for: outliers

• Influential observations and outliers may unduly influence the fit of the model.

• The slope and standard error of the slope may be affected by influential observations.

• This is an inherent weakness of least squares regression.

• You may wish to evaluate two models; one with and one without the influential observations.

13

Things to watch out for: truncated samples

Truncated samples cause the opposite problems of influential observations and outliers.

• Truncation on the X axis reduces the correlation coefficient for the remaining data.

• Truncation on the Y axis is a worse problem, because it violates the assumption of normally distributed errors.

•Examples: Topcoded income data, health as measured by number of days spent in a hospital in a year.

14

Causality

• We never prove that x causes y• Research and theory make it increasingly likely

• Criteria:• association• time order • no alternative explanations

• is the relationship spurious?

15

Alternative Explanations

Example: Neighborhood poverty -> Low Test Scores

16


Example: Neighborhood poverty -> Low Test Scores

Possible solutions:• multivariate models

• e.g., control for parents’ education, income• controls for other measureable differences

• fixed effects models• e.g., changes in poverty -> changes in test scores• controls for constant, unmeasured differences

• instrumental variables• find an instrument that affects x1 but not y

• experiments• e.g., Moving to Opportunity• randomize increases in $

17


Example: Fertility -> Lower Mothers’ LFP

Possible solutions:

18



Possible solutions:• multivariate models

• e.g., control for gender attitudes• controls for other measureable differences

• fixed effects models• e.g., changes in # children -> dropping out• controls for constant, unmeasured differences

• instrumental variables• find an instrument that affects x1 but not y• e.g., mothers of two same sex children

• experiments• not feasible (or ethical)

19

Types of 3-variable Causal Models

• Spurious• x2 causes both x1 and y• e.g., religion causes fertility and women’s lfp

• Intervening• x1 causes x2 which causes y• e.g., fertility raises time spent on children which

lowers time in the labor force

• What is the statistical difference between these?

20

Another type of 3-varaible relationship:Statistical Interaction Effects


The relationship between x1 and y depends on the value of another variable, x2

• e.g., marital status -> earnings depends on gender

21

Documents

Sociology 601 Class 21: November 10, 2009