47
DR. ARSHAD HASSAN Structural Equation Modeling 1

SEM2.pptx

Embed Size (px)

Citation preview

Structural Equation Modeling

Dr. Arshad HassanStructural Equation Modeling 1Structural Equation Modeling SEM is an extension of the general linear model that enables a researcher to test a set of regression equations simultaneously. SEM software can test traditional models, but it also permits examination of more complex relationships and models, such as confirmatory factor analysis and path analyses.2Structural Equation Modeling SEM Structural Equation ModelingCSA Covariance Structure AnalysisCausal ModelsSimultaneous Equation Modeling3Structural Equation Modeling SEM is a combination of factor analysis and multiple regression.

4Structural Equation Modeling The researcher first specifies a model based on theory, then determines how to measure constructs, collects data, and then inputs the data into the SEM software package. The package fits the data to the specified model and produces the results, which include overall model fit statistics and parameter estimates.

5

6Theory Theorize your modelWhat observed variables?What latent variables?Relationship between latent variables?Relationship between latent variables and observed variables?Correlated errors of measurement?

7Structural Equation Modeling SEM has a language all its own. Manifest or observed variables are directly measured by researchers, while latent or unobserved variables are not directly measured but are inferred by the relationships or correlations among measured variables in the analysis. This statistical estimation is accomplished in much the same way that an exploratory factor analysis infers the presence of latent factors from shared variance among observed variables.8Structural Equation Modeling Independent variables, which are assumed to be measured without error, are called exogenous or upstream variables; dependent or mediating variables are called endogenous or downstream variables.

9Structural Equation Modeling SEM users represent relationships among observed and unobserved variables using path diagrams. Ovals or circles represent latent variables, while rectangles or squares represent measured variables. Residuals are always unobserved, so they are represented by ovals or circles.

10Vocabulary Measured variableObserved variables, indicators or manifest variables in an SEM designPredictors and outcomes in path analysisSquares in the diagramLatent VariableUn-observable variable in the model, factor, constructConstruct driving measured variables in the measurement modelCircles in the diagram

11Vocabulary Error or EVariance left over after prediction of a measured variableDisturbance or DVariance left over after prediction of a factorExogenous VariableVariable that predicts other variablesEndogenous VariablesA variable that is predicted by another variableA predicted variable is endogenous even if it in turn predicts another variable12Vocabulary Parameters. The parameters of the model are regression coefficients for paths between variables and variances/covariances of independent variables. Parameters may be fixed to a certain value (usually 0 or 1) or may be estimated. In the diagram, an represents a parameter to be estimated. A 1 indicates that the parameter has been fixed to value 1. When two variables are not connected by a path the coefficient for that path is fixed at 0.

13Why SEM? Assumptions underlying the statistical analyses are clear and testable, giving the investigator full control and potentially furthering understanding of the analyses.Graphical interface software boosts creativity and facilitates rapid model debugging SEM programs provide overall tests of model fit and individual parameter estimate tests simultaneously.Regression coefficients, means, and variances may be compared simultaneously, even across multiple between-subjects groups.14Why SEM? Measurement and confirmatory factor analysis models can be used to purge errors, making estimated relationships among latent variables less contaminated by measurement error.Ability to fit non-standard models, including flexible handling of longitudinal data, databases with auto correlated error structures (time series analysis), and databases with non-normally distributed variables and incomplete data.This last feature of SEM is its most attractive quality. SEM provides a unifying framework under which numerous linear models may be fit using flexible, powerful software.

15SEM Assumptions A Reasonable Sample Size Continuously and Normally Distributed Endogenous Variables Model Identification

16IdentificationIdentification is a structural or mathematical requirement in order for the SEM analysis to take place.Identification refers to the idea that there is at least one unique solution for each parameter estimate in a SEM model. 17IdentificationModels in which there is only one possible solution for each parameter estimate are said to be just-identifiedModels for which there are an infinite number of possible parameter estimate values are said to be underidentified. Finally, models that have more than one possible solution (but one best or optimal solution) for each parameter estimate are considered overidentified.

18Model IdentificationTo determine whether the model is identified or not, compare the number of data points to the number of parameters to be estimated.Since the input data set is the sample variance/covariance matrix, the number of data points is the number of variances and covariances in that matrix, which can be calculated as , M(M+1)/2where m is the number of measured variables. 19Structural Equation ModelingThe SEM can be divided into two parts. The measurement model is the part which relates measured variables to latent variables.The structural model is the part that relates latent variables to one another.

20Structural Equation ModelingMeasurement Models21Structural Equation ModelingStructural Models22Structural Equation ModelingSimultaneous Models23Identification of the Measurement ModelThe measurement portion of the model will probably be identified if:There is only one latent variable, it has at least three indicators that load on it, and the errors of these indicators are not correlated with one another.There are two or more latent variables, each has at least three indicators that load on it, and the errors of these indicators are not correlated, each indicator loads on only one factor, and the factors are allowed to covary.There are two or more latent variables, but there is a latent variable on which only two indicators load, the errors of the indicators are not correlated, each indicator loads on only one factor, and none of variances or covariances between factors is zero.

24Identification of the Structural ModelThis portion of the model may be identified if:None of the latent dependent variables predicts another latent dependent variable.When a latent dependent variable does predict another latent dependent variable, the relationship is recursive, and the disturbances are not correlated. 25Handling of Incomplete Data Typical ad hoc solutions to missing data problems include listwise deletion of cases, where an entire cases record is deleted if the case has one or more missing data points, and pairwise data deletion, where bivariate correlations are computed only on cases with available data. Pairwise deletion results in different Ns for each bivariate covariance or correlation in the database. Another typically used ad hoc missing data handling technique is substitution of the variables mean for the missing data points on that variable.

26Handling of Incomplete Data Listwise deletion can result in a substantial loss of power, particularly if many cases each have a few data points missing on a variety of variables, not to mention limiting statistical inference to individuals who complete all measures in the database.Pairwise deletion is marginally better, but the consequences of using different ns for each covariance or correlation can have profound consequences for model fitting efforts, including impossible solutions in some instances. Finally, mean substitution will shrink the variances of the variables where mean substitution took place, which is not desirable.27Handling of Incomplete Data If the proportion of cases with missing data is small, say five percent or less, list wise deletion may be acceptable (Roth, 1994). Of course, if the five percent (or fewer) cases are not missing completely at random, inconsistent parameter estimates can result. Otherwise, missing data experts (e.g., Little and Rubin, 1987) recommend using a maximum likelihood estimation method for analysis, a method that makes use of all available data points. AMOS features maximum likelihood estimation in the presence of missing data.

28Reliability of Measured Variables.The variance in each measured variable is assumed to stem from variance in the underlying latent variable. Classically, the variance of a measured variable can be partitioned into true variance (that related to the true variable) and (random) error variance. The reliability of a measured variable is the ratio of true variance to total (true + error) variance. In SEM the reliability of a measured variable is estimated by a squared correlation coefficient, which is the proportion of variance in the measured variable that is explained by variance in the latent variable(s

29How SEM WorksStatistically, the model is evaluated by comparing two variance/covariance matrices. From the data a sample variance/covariance matrix is calculated. From this matrix and the model an estimated population variance/covariance matrix is computed. If the estimated population variance/covariance matrix is very similar to the known sample Variance/ covariance matrix, then the model is said to fit the data well.

30

How SEM Works31Evaluating Model Fit The Default model, contains the fit statistics for the model you specified in your AMOS Graphics diagram. The Saturated and Independence, refer to two baseline or comparison models automatically fitted by AMOS as part of every analysis. The Saturated model contains as many parameter estimates as there are available degrees of freedom or inputs into the analysis. The Saturated model is thus the least restricted model possible that can be fit by AMOS. By contrast, the Independence model is one of the most restrictive models that can be fit: it contains estimates of the variances of the observed variables only. In other words, the Independence model assumes all relationships between the observed variables are zero.

32Tests of Fit The chi-square test is a test of overall model fit, when the probability value of the chi-square test is smaller than the .05 level used by convention, you would reject the null hypothesis that the model fits the data. Because the chi-square test of absolute model fit is sensitive to sample size and non-normality in the underlying distribution of the input variables, investigators often turn to various descriptive fit statistics to assess the overall fit a model to the data.33Tests of Fit These fit statistics are similar to the adjusted R2 in multiple regression analysis: the parsimony fit statistics penalize large models with many estimated parameters Tucker-Lewis Index (TLI) and the Comparative Fit Index (CFI) compare the absolute fit of your specified model to the absolute fit of the Independence model. The greater the discrepancy between the overall fit of the two models, the larger the values of these descriptive statistics.

34Tests of Fit The chi-square test is an absolute test of model fit: If the probability value (P) is below .05, the model is rejected. Hu and Bentler (1999) recommend RMSEA values below .06 and Tucker-Lewis Index values of .95 or higher. The analysis uses an iterative procedure to minimize the differences between the sample variance/covariance matrix and the estimated population variance matrix. Maximum Likelihood (ML) estimation is that most frequently employed.35Goodness-of-fit StatisticsMany goodness-of-fit statisticsTb = chi-square test statistic for baseline modelTm = chi-square test statistic for hypothesized modeldfb = degrees of freedom for baseline modeldfm = degrees of freedom for hypothesized model

36Goodness-of-fit StatisticsThe Normed Fit Index (NFI) is simply the difference between the two models chi-squares divided by the chi-square for the independence model. Values of .9 or higher (some say .95 or higher) indicate good fit. The Comparative Fit Index (CFI) uses a similar approach (with a noncentral chi-square) and is said to be a good index for use even with small samples. It ranges from 0 to 1, like the NFI, and .95 (or .9 or higher) indicates good fit. 37Goodness-of-fit StatisticsPRATIO is the ratio of how many paths you dropped to how many you could have dropped (all of them). The Parsimony Normed Fit Index (PNFI), is the product of NFI and PRATIO, and PCFI is the product of the CFI and PRATIO. The PNFI and PCFI are intended to reward those whose models are parsimonious (contain few paths). 38Goodness-of-fit StatisticsNPAR is the number of parameters in the model. CMIN is a Chi-square statistic comparing the tested model and the independence model to the saturated model. CMIN/DF, the relative chi-square, is an index of how much the fit of data to model has been reduced by dropping one or more paths. One rule of thumb is to decide you have dropped too many paths if this index exceeds 2 or 3. 39RMR, the root mean square residual, is an index of the amount by which the estimated (by your model) variances and covariances differ from the observed variances and covariances. Smaller is better 40Goodness-of-fit StatisticsGFI, the goodness of fit index, tells you what proportion of the variance in the sample variance-covariance matrix is accounted for by the model. This should exceed .9 for a good model. For the saturated model it will be a perfect 1.AGFI (adjusted GFI) is an alternate GFI index in which the value of the index is adjusted for the number of parameters in the model. The fewer the number of parameters in the model relative to the number of data points (variances and covariances in the sample variance-covariance matrix), the closer the AGFI will be to the GFI. The PGFI (P is for parsimony), the index is adjusted to reward simple models and penalize models in which few paths have been deleted. 41Goodness-of-fit StatisticsThe Root Mean Square Error of Approximation (RMSEA) estimates lack of fit compared to the saturated model. RMSEA of .05 or less indicates good fit, and .08 or less adequate fit. PCLOSE is the p value testing the null that RMSEA is no greater than .05.

42Goodness-of-fit StatisticsComponent FitUse Substantive ExperienceAre signs correct?Any nonsensical results?R2s for individual equationsNegative error variances?Standard errors seem reasonable?

43SEM limitationsSEM is a confirmatory approachYou need to have established theory about the relationshipsCannot be used to explore possible relationships when you have more than a handful of variablesExploratory methods (e.g. model modification) can be used on top of the original theorySEM is not causal; experimental design = causeSEM is often thought of as strictly correlational but can be used (like regression) with experimental data

44Path AnalysisTheoretical assumptionsCausality:X1 and Y1 correlate.X1 precedes Y1 chronologically.X1 and Y1 are still related after controlling other dependencies.Statistical assumptionsModel needs to be recursive.It is OK to use ordinal data.All variables are measured (and analyzed) without measurement error ( = 0).

45Path AnalysisPath Analysis estimates effects of variables in a causal system. It starts with structural Equationa mathematical equation representing the structure of variables relationships to each other.46

47