Data Screaming! Validating and Preparing your data Lyytinen & Gaskin

Data Screaming!Validating and Preparing your data

Lyytinen & Gaskin

Data Screening

• Data screening (also known for us as “data screaming”) ensures your data is “clean” and ready to go before you conduct further your planned statistical analyses.

• Data must always be screened to ensure the data is reliable, and valid for testing the type of causal theory your have planned for.

• Screening and cooking are not synonymous – screening is like preparing the best ingredients for your gourmet food!

Necessary Data Screening To Do:

• Handle Missing Data• Address outliers and influentials• Meet multivariate statistical assumptions for

alternative tests (scales, n, normality, co-variance)

Statistical Problems with Missing Data

• If you are missing much of your data, this can cause several problems; e.g., can’t calculate the estimated model.

• EFA, CFA, and path models require a certain minimum number of data points in order to compute estimates – each missing data point reduces your valid n by 1.

• Greater model complexity (number of items, number of paths) and improved power require larger samples.

Logical Problem with Missing Data

• Missing data will indicate systematic bias because respondents may not have answered particular questions in your survey because of a common cause (poor formulation, sensitivity etc).

• For example, if you ask about gender, and if females are less likely to report their gender than males, then you will have “male-biased” data. Perhaps only 50% of the females reported their gender, but 95% of the males reported gender.

• If you use gender as moderator in your causal models, then you will be heavily biased toward males, because you will not end up using the unreported responses from females. You may also have biased sample from female respondents.

Detecting Missing Values1 2

3

Handling Missing Data

• Missing more than 10% from a variable or respondent is typically not problematic (unless you lose specific items, or one end of the tail)

• Method for handling missing data:– >10% - Just don't use that variable/respondent unless you go

below acceptable n– <10% - Impute if not categorical– Warning: If you remove too many respondents, you will introduce

response bias• If the DV is missing, then there is little you can do with that

record• One alternative is to impute and run models with and

without missing data to see how sensitive the result is

Imputation Methods (Hair, table 2-2)

• Use only valid data– No imputation, just use valid cases or variables– In SPSS: Exclude Pairwise (variable), Listwise (case)

• Use known replacement values– Match missing value with similar case’s value

• Use calculated replacement values– Use variable mean, median, or mode– Regression based on known relationships

• Model based methods– Iterative two step estimation of value and descriptives to find

most appropriate replacement value

2. Include each variable that has values that need imputing

Mean Imputation in SPSS

1

2

3

3. For each variable you can choose the new name (for the imputed column) and the type of imputation

4

Best Method – Prevention!

• Short surveys (pre testing critical!)• Easy to understand and answer survey items

(pre testing critical)• Force completion (incentives, technology)• Bribe/motivate (iPad drawing)• Digital surveys (rather than paper)• Put dependent variables at the beginning of

the survey!

Order for handling missing data

1. First decide which variables are going to be used in the model

2. Then handle missing data based on that set of variables

3. Then decide the method to handle missing data (see Hair Chapter 2)

Outliers and Influentials

• Outliers can influence your results, pulling the mean away from the median.

• Outliers also affect distributional assumptions and often reflect false or mistaken responses

• Two type of outliers: – outliers for individual variables (univariate)• Extreme values for a single variable

– outliers for the model (multivariate)• Extreme (uncommon) values for a correlation

Detecting Univariate Outliers

Outliers!

Mean 50% should fall within the

box 99% should fall within this

range

Handling Univariate Outliers

• Univariate outliers should be examined on a case by case basis.

• If the outlier is truly abnormal, and not representative of your population, then it is okay to remove. But this requires careful examination of the data points– e.g., you are studying dogs, but somehow a cat got ahold of

your survey– e.g., someone answered “1” for all 75 questions on the

survey• However, just because a datapoint doesn’t fit

comfortably with the distributions does not nominate that datapoint for removal

Detecting Multivariate Outliers

• Multivariate outliers refer to sets of data points (tuples) that do not fit the standard sets of correlations exhibited by the other data points in the dataset with regards to your causal model.

• For example, if for all but one person in the dataset reports that diet has a positive effect on weight loss, but this one guy reports that he gains weight when he diets, then his record would be considered an outlier.

• To detect these influential multivariate outliers, you need to calculate the Mahalanobis d-squared. (Easy in AMOS)

Anything less than .05 in the p1 column is abnormal, and is

candidate for inspection

These are row numbers

from SPSS

Handling Multivariate Outliers

• Create a new variable in SPSS called “Outlier” – Code 0 for Mahalanobis > .05 – Code 1 for Mahalanobis < .05

• I have a tool for this if you want…• Then in AMOS, when selecting data files, use

“Outlier” as a grouping variable, with the grouping value set to 0– This then runs your model with only non-outliers

Before and after removing outliers

BEFORE AFTEREven after you remove outliers, the Mahalanobis will come up with a whole new set of outliers, so these

should be checked on a case by case basis, using the Mahalanobis as a guide for inspection.

N=340 N=295

“Best Practice” for outliers

• In general, it is a bad idea to remove outliers, unless they are truly “abnormal” and do not represent accurate observations from the population. The logic of removal needs to be based on semantics of the data

• Removing outliers (especially en mass as demonstrated with the mahalanobis values) is risky because it decreases your ability to generalize as you do not know the cause of this type of variance, it may be more than just noise.

Statistical Assumptions

Part of data screening is ensuring you meet the four main statistical assumptions for multivariate data analysis:1. Normality2. Homoscedasticity3. Linearity4. Multicollinearity

These assumptions are intended to hold for scalar and continuous variables, rather than categorical (we prefer gender to be bimodal)

Normality

• Normality refers to the distributional assumptions of a variable.

• We usually assume in co-variance based models that the data is normally distributed, even though many times it is not!

• Other tests like PLS or binomial regressions do not require such assumptions

• t tests and F tests assume normal distributions• Normality is assessed in many ways: shape, skewness, and

kurtosis (flat/peaked).• Normality issues effect small sample sizes (<50) much

more than large sample sizes (>200)

Shape

Skewness

Kurtosis

Bimodal Flat

Tests for Skewness and Kurtosis

• Relaxed rule:– Skewness > 1 = positive (right) skewed– Skewness < -1 = negative (left) skewed– Skewness between -1 and 1 is fine

• Strict rule:– Abs(Skewness) > 3*Std. error = Skewed– Same for Kurtosis

1

2

3

Tests for Normality

SPSS1. Analyze2. Explore3. Plots4. Normality

*Neither of these variables would be considered normally

distributed according to the KS or SW measures, but a visual inspection shows that role

conflict (left) is roughly normal and participation (right) is

positive skewed.So, ALWAYS conduct visual

inspections!

Fixing Normality Issues

• Fix flat distribution with:– Inverse: 1/X

• Fix negative skewed distribution with:– Squared: X*X– Cubed: X*X*X

• Fix positive skewed distribution with:– Square root: SQRT(X)– Logarithm: LG10(X)

Before and After TransformationNegative Skewed Cubed

Homoscedasticity

• Homoscedasticity is a nasty word that helps impress your listeners!

• If a variable has this property it means that the DV exhibits consistent variance across different levels of the IV.

• A simple way to determine if a relationship is homoscedastic, is to do a scatter plot with the IV on the x-axis and the DV on the y-axis.

• If the plot comes up with a linear pattern, and has a substantial R-square we have homoscedasticity!

• If there is not a linear pattern, and the R-square is low, then the relationship is heteroscedastic.

Scatterplot approach

Linearity

• Linearity refers to the consistent slope of change that represents the relationship between an IV and a DV.

• If the relationship between the IV and the DV is radically inconsistent, then it will throw off your SEM analyses as your data is not linear

• Sometime you achieve this with transformations (log linear).

Good

Bad

Multicollinearity• Multicollinearity is not desirable in regressions

(but desirable in factor analysis!). • It means that independent variables are too

highly correlated with each other and share too much variance

• Influences the accuracy of estimates for DV and inflates error terms for DV (Hair).

• How much unique variance does the black circle actually account for?

Detecting Multicollinearity

• An easy way to check this is to calculate a Variable Inflation Factor (VIF) for each independent variable after running a multivariate regression using one of the IVs as the dependent variable, and then regressing it on all the remaining IVs. Then swap out the IVs one at a time.

• The rules of thumb for the VIF are as follows:– VIF < 3; no problem– VIF > 3; potential problem– VIF > 5; very likely problem– VIF > 10; definitely problem

Handling Multicollinearity

Loyalty 2 and loyalty 3 seem to be too similar in

both of these test

Dropping Loyalty 2 fixed the problem

Documents

Data Screaming! Validating and Preparing your data Lyytinen & Gaskin