33
Introduction to and Overview of DEf An R software package for cross-cultural research E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences Conference, Albuquerque, March 18, 2014

Introduction to and Overview of DEf An R software package for cross-cultural research E. Anthon Eff Malcolm M. Dow Wes Routon Anthropological Sciences

Embed Size (px)

Citation preview

Galtons Problem A Network Autocorrelation Approach to Cross-Cultural Research

Introduction to and Overview of DEfAn R software package for cross-cultural research

E. Anthon EffMalcolm M. Dow Wes Routon

Anthropological Sciences Conference, Albuquerque, March 18, 20141The two major problems with cross-cultural data analysis addressed by DEf are:Missing DataAll of the major cross-cultural data sets have substantial missing data. Single imputation methods mean substitution, regression predicted scores, hot deck, etc. result in coefficient variance estimates of that are downwardly biased. Data editing procedures, e.g. listwise deletion, generally result in small samples (loss of power) and also require very strong assumptions about why data are missing. These assumptions are very unlikely to hold. Single imputation methods are no longer recommended.

DEf employs the Multiple Imputation by Chained Equations (mice) approach to handling missing data.

Non-Independence of Sample Units Sample cases in cross-cultural and cross-national data are frequently not independent of one another due to various inter-societal network processes: cultural trait borrowing, conquest, emulation, inheritance from ancestral populations, etc. This is the classic Galtons Problem in anthropology, understood more generally as the problem of cultural trait transmission. DEf addresses this issue by incorporating networks of relations into regression models, and employing instrumental variables procedures to generate consistent and relatively efficient estimates.

2First problem: Missing DataSociety markin markout money commland sharefoodNama Hottentot NA NA 1 NA NAKung Bushmen 1 4 1 3 6Thonga 4 4 3 3 6Lozi 3 3 1 3 NAMbundu NA NA 4 NA NASuku 2 2 4 2 2Two solutions: Listwise deletionMultiple imputation3Listwise deletionSociety markin markout money commland sharefoodNama Hottentot NA NA 1 NA NAKung Bushmen 1 4 1 3 6Thonga 4 4 3 3 6Lozi 3 3 1 3 NAMbundu NA NA 4 NA NASuku 2 2 4 2 2 Lose three observations. Lose all of the information in the cells marked in red. Of 186 societies, 156 would have been dropped using listwise deletion. No longer testing against the full range of human societies. Losing the big advantage of the SCCS. Probable sample selection bias.4Multiple imputationSociety markin markout money commland sharefoodNama Hottentot 3 4 1 2 3Kung Bushmen 1 4 1 3 6Thonga 4 4 3 3 6Lozi 3 3 1 3 6Mbundu 4 5 4 3 1Suku 2 2 4 2 2Society markin markout money commland sharefoodNama Hottentot 2 3 1 1 2Kung Bushmen 1 4 1 3 6Thonga 4 4 3 3 6Lozi 3 3 1 3 4Mbundu 3 5 4 5 3Suku 2 2 4 2 2Society markin markout money commland sharefoodNama Hottentot 3 5 1 2 3Kung Bushmen 1 4 1 3 6Thonga 4 4 3 3 6Lozi 3 3 1 3 5Mbundu 2 6 4 4 2Suku 2 2 4 2 2Replace missing values with imputed values, drawn from conditional distribution. Create several (5 to 10) new data sets with imputed values.5Step 1 of the DEf Approach to Multiple Imputation of Missing Data: finding auxiliary variables.The mice procedure imputes values for missing observations on the variables specified in the structural regression model of interest, using both these variables themselves plus a set of auxiliary variables. Ideal auxiliary variables are usually a subset of those with no missing values in the full data set. Auxiliary variables must be correlated with the variables in the structural regression model that have missing values, since the imputation procedure is designed to borrow information from them to help impute the missing values.

DEf will employ auxiliary variables provided by the user. Alternatively, DEf will identify suitable auxiliary variables as follows:

1. identify all categorical, ordinal, interval variables with no missing values in the complete data set.2. identify variables that one wants to impute, and, one at a time, treating each as a dependent variable: i) regress (using binary/ordinal logit, multinomial, OLS) the dependent variable on the covariate that provides the highest correlation, and save the residualii) add to the regression model the covariate that correlates highest with the residual, and calculate the new residualiii) repeat the above steps 8 times (or more)iv) calculate the relative importance of predictors, drop variables that fall below a given threshold, and recalculate the residualv) repeat steps ii iv. 6Step 2: Create m complete data setsThe mice procedure is repeated m times to create m copies of the data set, each containing different sets of imputed values.

Since each data set is now complete, each can be analyzed using any of the usual statistical models that require complete data.

m = 10 - 100 is currently suggested, depending on sample size and amounts of missing data. 7Step 3: Analyzing the data and pooling the results: Rubins Rules

8Analyzing the data and pooling the results, cont.

9Analyzing the data and pooling the results, cont.

10Analyzing the data and pooling the results, cont.

11Analyzing the data and pooling the results, cont.Rubins pooling procedures can be done with any statistic generated by the statistical method employed to analyze the m imputed data sets.

12Galtons ProblemIncorporating inter-societal networks into network autocorrelation effects regression models13Galtons problemObservations not independent.Common descent (language phylogeny)Cultural borrowing (geographic distance)

In regression context, Galtons problem will cause biased coefficients and biased standard errors.

14Galtons problem example:alcohol wivesEcuador 1 0Iran 0 2Ireland 1 0Morocco 0 3Spain 1 0Yemen 0 4Pearson correlation= -0.9332565, p-value=0.0065 An observed correlation between a pair of cultural traits across cultures could be due to the borrowing of the traits, as a package, from a common source (horizontal transmission), or could be due to their transmission, as a package, from a common ancestor (vertical transmission), or could be due to a true functional relationship. Hypothesis: Drinking alcohol dampens the libido of religious specialists. Adapted from Victor de Munck and Andrey Korotayev. 2000. Cultural Units in Cross-Cultural Research. Ethnology 39(4): 335-34815What processes might be inducing non-independence?Spatial Diffusion: societies in close proximity have more opportunity to emulate, conform to, adopt, borrow, etc. neighbors behaviors, beliefs, customs, rituals (horizontal diffusion.)

Language similarity: Similarity due to populations splitting off from same ancestral population. (vertical diffusion.)

Religion: Marriage practices spread world-wide by the colonization of large swaths of the world by European Christian nations.

Equivalence: units similarly situated in a network and not necessarily proximate. E.g., economic similarity, core/periphery in world system, colonial status, ecological setting,

16

Assessing non-independence: Toblers First Law of Geography

Everything is related to everything else, but near things are more closely related than distant things.

This law suggests that the scores on variable y for the ith society should be similar to the scores of those societies with which it has the closest relationships. Call these societies is neighborhood set.

If so, yi should be similar to the weighted average of the set of y scores for is neighborhood set, where the weights indicate relative closeness. If the N scores on y are significantly correlated with the N weighted average scores, conclude the y variable is auto-(self)-correlated.

17Weighting sample units.First , need to construct an NxN connectivity matrix C of pair-wise relatedness scores among sample units, and then row-normalize C to unity to get the required weights matrix W. That is, wij = cij jcij.

(If a variable y is premultiplied by W, i.e. Wy, the product will be an Nx1 vector of weighted averages that are on the same scale as y.)Raw Connectivity Matrix CWeights Matrix WyWy011100001/31/31/30006710010001/2001/20005710010001/2001/200087C =0110100 W =01/31/301/30085.300010110001/301/31/333.3000010100001/201/212000011000001/21/201218

Incorporating autocorrelated variables into multiple regression

Most cross-cultural researchers are usually interested in testing whether hypothesized predictor variables are acting on a dependent variable, as well as what processes are inducing autocorrelation in it.

The Network Autocorrelation Regression Effects Models in DEf do just that.

19Most commonly used network autocorrelation regression model is:Network Autocorrelation Effects model:y = + Wy + X +

Where: W is a row-normalized NxN weighting matrix with wij > 0 if i and j are related, 0 otherwise, and wii = 0 for all i; is the network autocorrelation coefficient; y is an Nx1 vector; Wy is an Nx1 vector where each element i is a weighted average of y values for is neighborhood set; X is an Nxk matrix of exogenous variables; is an kx1 vector of coefficients; is an Nx1 vector of error terms.

Also called the Network Lag model, by analogy to time series, since W acts similarly to the lag operator in time series models, except that W lags the y variable in other kinds of social and physical spaces.

This is the model currently implemented in DEf20Estimating the network autocorrelation effects regression model y = + Wy + X + MLE: Maximum Likelihood Estimation. This is usually the method of choice. But the log-likelihood function contains the term ln|A|, where A= (I W). Since A is asymmetric and usually not sparse, finding the eigenvalues is computationally burdensome for large N. And, for more than two endogenous Wy variables, the likelihood function is intractable. OLS: Ordinary Least Squares. Basic assumption of OLS is that all r.h.s. variables be independent of (uncorrelated with) the error term . If not, all coefficient estimates ( and ) are biased and inconsistent. Here, y is by definition a function of , so Wy is also a function of . That is, Cov(Wy, ) 0. Wy is thus an endogenous regressor.

IV: Instrumental Variables (IV). Provides a way to obtain consistent parameter estimates for models with endogenous variables. 2SLS is an IV estimation procedure. Can deal with large samples and multiple endogenous variables. DEf uses IV estimation procedures.21An intuitive view of the IV regression approachOLS model:y = + Wy + Z Wy y

Z is an instrument for Wy if

Cov(Z,) = 0 (Z is valid) and Cov(Z,Wy) 0 (Z is relevant).

So, need to find an additional variable(s) Z that is correlated with Wy but uncorrelated with to serve as an instrument for Wy.

22An intuitive view of the 2SLS IV estimation procedure Consider again the network effects model y = + Wy + X +

Suppose we use WX, the lagged values of X, as an instrument for Wy.

Step 1. Using OLS, estimate Wy = a + WXc + Save the predicted scores w = + WX

Step 2. Again using OLS, estimate y = + w + X +

(Note: the reported standard errors from step 2 are incorrect. Not an issue for the 1-step procedures used in all the usual software packages.) 232SLS Estimation of the network autocorrelation effects regression model with IVs: general case y = + X +

24Where to get appropriate instruments?Usually, its hard to find additional variables that meet the conditions required. Variables that affect the endogenous variable(s) are often also likely to affect the dependent variable.Kelejian and Prucha (1998) show that the set of {WX, W2X, W3X,} variables are optimal as instruments for Wy, where W2, W3,. are the 2-step and 3-step connections between sample units. In practice, the WX variables or some subset of them will usually be sufficient.

25Evaluating the quality of the instrumental variables Quality of 2SLS estimators depends on the quality of the IVs. Require thatCov(Z,) = 0. IVs must be valid. IV estimation is vulnerable on this point. Tests are available only if there are more instruments than endogenous variables (overidentification.)IVs also need to be relevant. i.e., they should predict endogenous variables independently of other exogenous variables. Shea (1997) proposed a partial R2 measure of instrument relevance for multiple endogenous variable models.Marginal associations between endogenous variable(s) and Z is known as the weak instruments problem. Some diagnostics are available. No perfect collinearity between all exogenous variables.

26Overidentification testsIf there is more than 1 instrumental variable available for Wy, can test the null hypothesis that at least one of them is correlated with the errors.

Sargan (1958) is the best known test:Ts = NR2u ~ 2 (with df = #IVs - #endogenous variables) where R2u is the R2 of OLS regression of 2SLS residuals on the IVs.

Basmann (1960) provides an alternate, though similar, test.

Kirby and Bollen (2009) discuss additional variants of Sargan and Basmann in the context of SEM.27Weak InstrumentsBound et al (1995) show that when the instruments are only weakly correlated with the endogenous variables IV estimates are biased in the same direction as OLS estimates, and may be more biased than OLS. In addition, weak IV regression estimates may not be consistent.

Staiger and Stock (1997) suggest that the partial F-statistic from the increase in the regression R2 after adding the auxiliary instruments to the exogenous variables in the first stage regression should be greater than 10. Stock and Yogo (2005) provide tables that give some guidance as to how much greater than 10 the F-statistic may have to be.

28Example: Monogamy in the Pre-industrial World Multiple proposed determinants of the long-term historical shift in marriage preference from polygynous to monogamous unions are tested using data from the Standard Cross-Cultural Sample. 29Determinants of Monogamy (adapted from Dow and Eff 2013)Theoretical perspectivePrimary Sources Determinants (expected sign)Agent level perspectivesMales provide essential resourcesOrions 1969; Borgerhoff-Mulder et al 1990; Marlowe 2000; Low 2003; Alexander et al 1979male resource inequality (-), female economic contribution (-), beneficial natural environment (-)Female intra-sexual aggressionGowaty 1996; Reichard 2003Endemic violence (-)Male intra-sexual aggressionEmlen & Oring 1977; Hawkes et al 1995; Marlowe 2000; Borgerhoff-Mulder 1990; Quinlan and Quinlan 2007; van Schaik and Dunbar 1990; Wrangham et al 1999; Ember & Ember 1992.endemic violence (-), social control (-)Extrinsic RiskQuinland and Quinlan 2007; Del Guidice 2009; Low 1988, 1990, 2003, 2007pathogen stress (-)Group-level processesCollective action in small-scale societiesOlson 1971; Alexander et al 1979; Price 1999; Betzig 1986(Inverse of)societal scale (-)Socially Imposed Monogamy (SIM)Alexander et al 1979; Betzig 1986societal scale (+)Cultural Trait TransmissionDivale and Seda 2001; Dow and Eff 2009; Herlihy 2005; Price 1999Distance (+), language (+), modernization (+)30W matrices employedGeographical Distance: the WD matrix is described in Dow and Eff (2009), where cij = (1/dij)2 Use only the nearest 20 societies.Language similarity: the WL matrix is described in Eff (2008), where cij = e-score(ij)

If the Ws are collinear, can combine them into a single matrix: WDL = DWD + LWL where 0 D, L 1 and D + L =1

Then, run all combinations of WDL and select as best the matrix that maximizes R2ivAlso obtain information on the weights that yield the best combined W.

312SLS estimation of network autocorrelation regression model using composite distance/language W matrix. Dependent variable is a Box-Cox transform of the percentage of married females in monogamous marriages [monofem ( 1)/) ]

32Summary: DEf is a new statistical package designed for cross-cultural and cross-national data sets.Given the ubiquity of missing data in such data sets, DEf includes a suite of programs for multiple imputation of missing dataGiven that sample units in comparative data sets are non-independent due to various processes of cultural trait diffusion, DEf includes a suite of programs to implement network autocorrelation effects models. Available as R workspace and on XSEDE CoSSci/DEf Science Gateway.33

Separate analyses of m multiply imputed samples generates m estimates of any statistic of interest. In the general case, for any statistic an analysis of m data sets yields and estimates of the statistic and its variance for the jth data set (j = 1, 2,.,m). The multiple imputation point estimate of each parameter is simply the mean of the m estimates:

To calculate the variance of this estimate, both the m estimated variances of each and the variance in the across the m estimations must be combined. First, the mean of the m estimated variances for each parameter is obtained as a simple average:

This quantity is known as the within-imputation variance.

Next, the variance in the m estimated values is calculated as:

This quantity is known as the between-imputation variance. These two variances are then combined to get the total variance in the combined estimate of Q:

Rubin (1987: 79) shows that the following quantity is approximately distributed as a t-distribution where the degrees of freedom, df, is given by