Download ppt - Pre-Processing & Item Analysis DeShon - 2005. Pre-Processing Method of Pre-processing depends on the…

Pre-ProcessingPre-Processing & Item Analysis & Item Analysis

DeShon - 2005DeShon - 2005

Pre-ProcessingPre-Processing Method of Pre-processing depends on Method of Pre-processing depends on

the type of measurement instrument the type of measurement instrument usedused

General IssuesGeneral Issues Responses within range?Responses within range? Missing dataMissing data Item directionalityItem directionality ScoringScoring

Transforming responses into numbers that Transforming responses into numbers that are useful for the desired inferenceare useful for the desired inference

Checking response rangeChecking response range

First step…First step… Make sure there are no observations Make sure there are no observations

outside the range of your measure.outside the range of your measure. If you use a 1-5 response measure, you If you use a 1-5 response measure, you

can’t have a response of 6. can’t have a response of 6. Histograms and summary statistics Histograms and summary statistics

(min, max)(min, max)

Reverse ScoringReverse Scoring

Used when combining multiple Used when combining multiple measures (e.g., items) into a measures (e.g., items) into a compositecomposite All items should refer to the target trait All items should refer to the target trait

in the same directionin the same direction Alg: (high scale score +1) – scoreAlg: (high scale score +1) – score

Missing DataMissing Data Huge issue in most behavioral research!Huge issue in most behavioral research! Key issues:Key issues:

Why is the data missing?Why is the data missing? Planned, missing randomly, response bias?Planned, missing randomly, response bias?

What’s the best analytic strategy with missing What’s the best analytic strategy with missing data?data?

Statistical PowerStatistical Power Biased resultsBiased results

Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. procedures. Psychological MethodsPsychological Methods, , 66, 330_351., 330_351.

Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological MethodsPsychological Methods, , 77, 147-177., 147-177.

Causes of Missing DataCauses of Missing Data

Common in social researchCommon in social research nonresponse, loss to followupnonresponse, loss to followup lack of overlap between linked data setslack of overlap between linked data sets social processessocial processes

dropping out of school, graduation, etc.dropping out of school, graduation, etc. survey designsurvey design

““skip patterns” between respondentsskip patterns” between respondents

Missing DataMissing Data Step 1: Do everything ethically feasible Step 1: Do everything ethically feasible

to avoid missing data during data to avoid missing data during data collectioncollection

Step 2: Do everything ethically possible Step 2: Do everything ethically possible to recover missing datato recover missing data

Step 3: Examine amount and patterns of Step 3: Examine amount and patterns of missing datamissing data

Step 4: Use statistical models and Step 4: Use statistical models and methods that replace missing data or methods that replace missing data or are unaffected by missing data are unaffected by missing data

Missing Data MechanismsMissing Data Mechanisms

Missing Completely at Random Missing Completely at Random (MCAR)(MCAR)

Missing at Random (MAR)Missing at Random (MAR) Not Missing at Random (NMAR)Not Missing at Random (NMAR)

X X not subject to nonresponse (age) not subject to nonresponse (age)Y Y subject to nonresponse (income) subject to nonresponse (income)

Missing Completely at RandomMissing Completely at Random

MCARMCAR Probability of response is Probability of response is

independent of X & Yindependent of X & Y Ex: Probability that income is Ex: Probability that income is

recorded is the same for all recorded is the same for all individuals regardless of age or individuals regardless of age or incomeincome

Missing at RandomMissing at Random MARMAR Probability of response is Probability of response is

dependent on X but not Ydependent on X but not Y Probability of missingness Probability of missingness does notdoes not

depend on unobserved informationdepend on unobserved information Ex: Probability that income is Ex: Probability that income is

recorded varies according to age recorded varies according to age but it is not related to income but it is not related to income within a particular age groupwithin a particular age group

Not Missing at RandomNot Missing at Random

NMARNMAR Probability of missingness Probability of missingness doesdoes

depend on unobserved informationdepend on unobserved information Ex: Probability that income is Ex: Probability that income is

recorded varies according to recorded varies according to income and possibly ageincome and possibly age

How can you tell?How can you tell?

• Look for patternsLook for patterns

• Run a logistic regression with your IV’s predicting a dichotomous variable (1=missing; 0=nonmissing

LOGISTIC REGRESSIONLOGISTIC REGRESSIONCoefficients:Coefficients: Estimate Std. Error z value Pr(>|z|)Estimate Std. Error z value Pr(>|z|) (Intercept) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 ***-5.058793 0.367083 -13.781 < 2e-16 ***NEW.AGE NEW.AGE 0.181625 0.007524 24.140 < 0.181625 0.007524 24.140 <

2e-16 ***2e-16 ***SEXMale SEXMale -0.847947 0.131475 -6.450 1.12e-10 -0.847947 0.131475 -6.450 1.12e-10

******DVHHIN94 DVHHIN94 0.047828 0.026768 1.787 0.047828 0.026768 1.787 0.0740 .0.0740 . DVSMKT94 DVSMKT94 -0.015131 0.031662 -0.478 0.6327 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 0.233188 0.226732 1.028 0.3037 NUMCHRON NUMCHRON -0.087992 0.048783 -1.804 -0.087992 0.048783 -1.804 0.0713 . 0.0713 . VISITS VISITS 0.012483 0.006563 1.902 0.012483 0.006563 1.902 0.0572 .0.0572 . NEW.WT6 NEW.WT6 -0.043935 0.077407 -0.568 0.5703 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665 -0.015622 0.017299 -0.903 0.3665

Missing response DVHST94 vs Gender

0

500

1000

1500

2000

2500

3000

Male Female Total

Missing

Observed

Missing Data MechanismsMissing Data Mechanisms If MAR or MCAR, the missing data If MAR or MCAR, the missing data

mechanism is ignorable for full information mechanism is ignorable for full information likelihood-based inferenceslikelihood-based inferences

If MCAR, the mechanism is also ignorable for If MCAR, the mechanism is also ignorable for sampling-based inferences (OLS regression)sampling-based inferences (OLS regression)

If NMAR, the mechanism is nonignorable – If NMAR, the mechanism is nonignorable – thus any statistic could be biasedthus any statistic could be biased

Missing Data MethodsMissing Data Methods Always Bad MethodsAlways Bad Methods

Listwise deletionListwise deletion Pairwise deletion Pairwise deletion a.k.a.a.k.a. available case analysis available case analysis Person or item mean replacementPerson or item mean replacement

Often Good MethodsOften Good Methods Regression replacementRegression replacement Full-Information Maximum LikelihoodFull-Information Maximum Likelihood

SEM – must have full datasetSEM – must have full dataset Multiple ImputationMultiple Imputation

Listwise DeletionListwise Deletion

Assumes that the data are MCAR.Assumes that the data are MCAR. Only appropriate for small amounts Only appropriate for small amounts

of missing data.of missing data. Can lower power substantiallyCan lower power substantially

InefficientInefficient Now very rareNow very rare Don’t do it!Don’t do it!

FIML - AMOSFIML - AMOS

Imputation-based ProceduresImputation-based Procedures

Missing values are filled-in and the Missing values are filled-in and the resulting “Completed” data are resulting “Completed” data are analyzedanalyzed Hot deckHot deck Mean imputationMean imputation Regression imputationRegression imputation

Some imputation procedures (e.g., Some imputation procedures (e.g., Rubin’s multiple imputation) are really Rubin’s multiple imputation) are really model-based procedures.model-based procedures.

Mean ImputationMean Imputation TechniqueTechnique

Calculate mean over cases that have values for YCalculate mean over cases that have values for Y Impute this mean where Y is missingImpute this mean where Y is missing Ditto for XDitto for X11, X, X22, etc., etc.

Implicit modelsImplicit models Y=Y=YY XX11==11 XX22==22

ProblemsProblems ignores relationships among X and Yignores relationships among X and Y

underestimates covariancesunderestimates covariances

Regression ImputationRegression Imputation Technique & implicit modelsTechnique & implicit models

If Y is missingIf Y is missing impute mean of cases impute mean of cases

with similar values for Xwith similar values for X11, X, X22 Y = Y = XX11 XX22

Likewise, if XLikewise, if X22 is missing is missing impute mean of cases impute mean of cases

with similar values for Xwith similar values for X11, Y, Y XX11 = = XX11 Y Y

If both Y and XIf both Y and X22 are missing are missing impute means of cases impute means of cases

with similar values for Xwith similar values for X11 Y Y = = XX11

XX22= = XX11

ProblemProblem Ignores random components (no Ignores random components (no ))

Underestimates variances, se’sUnderestimates variances, se’s

Little and Rubin’s PrinciplesLittle and Rubin’s Principles

Imputations should beImputations should be Conditioned on observed variablesConditioned on observed variables MultivariateMultivariate Draws from a predictive distributionDraws from a predictive distribution

Single imputation methods do not Single imputation methods do not provide a means to correct standard provide a means to correct standard errors for estimation error.errors for estimation error.

Multiple ImputationMultiple Imputation Context: Multiple regression (in general)Context: Multiple regression (in general) Missing values are replaced Missing values are replaced with “plausible” with “plausible”

substitutes based on distributions or modelsubstitutes based on distributions or model Construct Construct mm>1 simulated versions>1 simulated versions Analyze each of the Analyze each of the mm simulated complete simulated complete

datasets by standard methodsdatasets by standard methods Combine the Combine the mm estimates estimates get confidence intervals using Rubin’s rules get confidence intervals using Rubin’s rules

((micombinemicombine))

ADVANTAGE: sampling variability is taken into ADVANTAGE: sampling variability is taken into account by restoring error varianceaccount by restoring error variance

Multiple ImputationMultiple Imputation

obsY missY

1 1 Vary yT T

2 2 Vary yT T

Varm my yT T

yT

Variance within + Variance Imputation

(Rubin, 1987, 1996)

imputations

obsY

obsY

obsY

(1)missY

(2)missY

( )missmY

Point estimate

Another ViewAnother View

IMPUTATION ANALYSIS POOLING

IMPUTED DATA

ANALYSIS RESULTS

FINAL RESULTS

• IMPUTATION:

Impute the missing entries of the incomplete data sets M times, resulting in M complete data sets.

• ANALYSIS:

Analyze each of the M completed data sets using weighted least squares.

• POOLING:

Integrate the M analysis results into a final result. Simple rules exist for combining the M analyses.

INCOMPLETE DATA

How many Imputations?How many Imputations? 55 Efficiency of an estimate:Efficiency of an estimate: (1+(1+γγ/m)/m)-1-1

γγ = percentage of missing info = percentage of missing infom = number of imputationsm = number of imputations

If 30% missing, 3 imputations If 30% missing, 3 imputations 91% 91% 5 imputations 5 imputations 94% 94% 10 imputations 10 imputations 97% 97%

Imputation in SASImputation in SASPROC MIPROC MI

By default generates 5 imputation values for each missing valueBy default generates 5 imputation values for each missing value

Imputation method: MCMC (Markov Chain Monte Carlo)Imputation method: MCMC (Markov Chain Monte Carlo)

EM algorithm determines initial valuesEM algorithm determines initial values

MCMC repeatedly simulates the distribution of interest from which the imputed values are drawnMCMC repeatedly simulates the distribution of interest from which the imputed values are drawn

Assumption: Data follows multivariate normal distributionAssumption: Data follows multivariate normal distribution

PROC REGPROC REGFits five weighted linear regression models to the Fits five weighted linear regression models to the five complete data sets obtained from PROC MI five complete data sets obtained from PROC MI (used by_imputation_statement )(used by_imputation_statement )

PROC MIANALIZE PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis performed on covariance matrix from the analysis performed on the multiple imputed data sets and derives valid the multiple imputed data sets and derives valid statistics for the parametersstatistics for the parameters

ExampleExamplecase maleness years_over_20 weight

1 0 282 1 19 2183 1 37 2354 0 24 1505 1 186 1 1767 18 09 0 2810 1 46 19511 0 2312 0 2913 1 44 22114 015 0 2116 1 41 20417 0 4018 1 37 20819 020 1 43

Case 1 is missing Case 1 is missing weightweight Given 1’s sex and ageGiven 1’s sex and age

generate a plausible generate a plausible distributiondistributionfor 1’s weightfor 1’s weight

75 100 125 150 175 200 225Weight

0.0020.0040.0060.0080.01

density

At random, sample 5 (or more) plausible weights for case 1At random, sample 5 (or more) plausible weights for case 1 Impute Y!Impute Y!

For case 6, sample from conditional distribution of age.For case 6, sample from conditional distribution of age. Use Y to impute X!Use Y to impute X!

For case 7, sample from conditional bivariate distribution of For case 7, sample from conditional bivariate distribution of age & weightage & weight

ExampleExample PROC MI;

DATA=missing_weight_age OUT=weight_age_mi; VAR years_over_20 weight

maleness; run;

_Imputation_ case maleness years_over_20 weight1 1 0 28 1781 2 1 19 2181 3 1 37 2352 1 0 28 1012 2 1 19 2182 3 1 37 235

3 1 0 28 1673 2 1 19 2183 3 1 37 235

4 1 0 28 1524 2 1 19 2184 3 1 37 235

5 1 0 28 1595 2 1 19 2185 3 1 37 235

PROC REG; data=weight_age_mi model weight = maleness

years_over_20; by _imputation_;

run;

Example – Standard ErrorsExample – Standard Errors Total variance in Total variance in bb00

Variation due to sampling + variation due to imputationVariation due to sampling + variation due to imputation Mean(Mean(ss22

bb00) + Var() + Var(bb00 ) ) Actually, there’s a correction factor of (1+1/Actually, there’s a correction factor of (1+1/MM))

for the number of imputations for the number of imputations MM. . (Here (Here MM=5.)=5.) So total variance in estimating So total variance in estimating 0 0 isis

Mean(Mean(ss22bb00) + (1+1/) + (1+1/MM) Var() Var(bb00 ) )

= 179.53 + (1.2) 511.59 = 793.44= 179.53 + (1.2) 511.59 = 793.44 Standard error is Standard error is 793.44 = 28.17793.44 = 28.17

ExampleExample PROC MIAnalyze data=parameters;

VAR intercept maleness years_over_20;

run;Multiple Imputation Parameter Estimates

Parameter Estimate Std Error 95% Confidence Limits DF

intercept 178.564526 28.168160 70.58553 286.5435 2.2804maleness 67.110037 14.801696 21.52721 112.6929 3.1866years_over_20 -0.960283 0.819559 -3.57294 1.6524 2.991

Other Software www.stat.psu.edu/~jls/misoftwa.htmlwww.stat.psu.edu/~jls/misoftwa.html

http://www.stat.psu.edu/~jls/misoftwa.html

Item AnalysisItem Analysis Relevant for tests with a right / Relevant for tests with a right /

wrong answerwrong answer Score the item so that 1=right and Score the item so that 1=right and

0=wrong0=wrong Where do the answers come from?Where do the answers come from?

Rational analysisRational analysis Empirical keyingEmpirical keying

Item AnalysisItem Analysis Goal of Item AnalysisGoal of Item Analysis

Determine the extent to which the item Determine the extent to which the item is useful in differentiating individuals is useful in differentiating individuals with respect to the focal constructwith respect to the focal construct

Improve the measure for future Improve the measure for future administrationsadministrations

Item AnalysisItem Analysis

Item analysis provides info on the Item analysis provides info on the effectiveness of the individual items for effectiveness of the individual items for future use future use

Typology of Item AnalysisTypology of Item Analysis

Item AnalysisItem Analysis

Classical Item Response theory

Rasch IRT2 IRT3

Item AnalysisItem Analysis Classical analysis is the easiest and most Classical analysis is the easiest and most

widely used form of analysiswidely used form of analysis The statistics can be computed by generic The statistics can be computed by generic

statistical packages (or by hand) and need statistical packages (or by hand) and need no specialist softwareno specialist software

The item statistics apply only to The item statistics apply only to that that group group of testees on of testees on that that collection of itemscollection of items Sample Dependent!Sample Dependent!

Classical Item AnalysisClassical Item Analysis Item DifficultyItem Difficulty

Proportion Correct (1=correct; 0=wrong)Proportion Correct (1=correct; 0=wrong) the higher the proportion the easier the itemthe higher the proportion the easier the item

In general, need a wide range of item In general, need a wide range of item difficulties to cover the range of the trait difficulties to cover the range of the trait being assessedbeing assessed

If mastery test, need item difficulties to If mastery test, need item difficulties to cluster around the cut scorecluster around the cut score

Very easy (0.0) or very hard items (1.0) are Very easy (0.0) or very hard items (1.0) are uselessuseless

Most variance at p=.5Most variance at p=.5

Classical Item AnalysisClassical Item Analysis Item Discrimination – 2 methodsItem Discrimination – 2 methods

Difference in proportion correct between Difference in proportion correct between high and low test score groups (27%)high and low test score groups (27%)

Item-total correlation (output in Item-total correlation (output in Cronbach’s alpha routines)Cronbach’s alpha routines) No negative discriminatorsNo negative discriminators

Check key or drop itemCheck key or drop item Zero discriminators are not usefulZero discriminators are not useful

Item difficulty and discrimination are Item difficulty and discrimination are interdependentinterdependent

Classical Item AnalysisClassical Item AnalysisItem Answer Item-total r Item-DiscI1 4 0.85 0.30 0.23I2 2 0.96 0.26 0.07I3 1 0.94 0.25 0.15I4 3 0.78 0.45 0.39I5 4 0.88 0.37 0.26I6 2 0.81 0.34 0.32I7 1 0.78 0.46 0.41I8 2 0.87 0.44 0.29I9 3 0.83 0.32 0.22I10 3 0.88 0.18 0.11I11 3 0.90 0.19 0.13I12 2 0.95 0.24 0.10I13 3 0.96 0.21 0.08I14 2 0.94 0.32 0.14I15 4 0.78 0.20 0.32I16 3 0.50 0.24 0.21

Item Diff