Pre-ProcessingPre-Processing & Item Analysis & Item Analysis
DeShon - 2005DeShon - 2005
Pre-ProcessingPre-Processing Method of Pre-processing depends on Method of Pre-processing depends on
the type of measurement instrument the type of measurement instrument usedused
General IssuesGeneral Issues Responses within range?Responses within range? Missing dataMissing data Item directionalityItem directionality ScoringScoring
Transforming responses into numbers that Transforming responses into numbers that are useful for the desired inferenceare useful for the desired inference
Checking response rangeChecking response range
First step…First step… Make sure there are no observations Make sure there are no observations
outside the range of your measure.outside the range of your measure. If you use a 1-5 response measure, you If you use a 1-5 response measure, you
can’t have a response of 6. can’t have a response of 6. Histograms and summary statistics Histograms and summary statistics
(min, max)(min, max)
Reverse ScoringReverse Scoring
Used when combining multiple Used when combining multiple measures (e.g., items) into a measures (e.g., items) into a compositecomposite All items should refer to the target trait All items should refer to the target trait
in the same directionin the same direction Alg: (high scale score +1) – scoreAlg: (high scale score +1) – score
Missing DataMissing Data Huge issue in most behavioral research!Huge issue in most behavioral research! Key issues:Key issues:
Why is the data missing?Why is the data missing? Planned, missing randomly, response bias?Planned, missing randomly, response bias?
What’s the best analytic strategy with missing What’s the best analytic strategy with missing data?data?
Statistical PowerStatistical Power Biased resultsBiased results
Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. procedures. Psychological MethodsPsychological Methods, , 66, 330_351., 330_351.
Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Schafer, J. L., & Graham, J. W. (2002). Missing data: our view of the state of the art. Psychological MethodsPsychological Methods, , 77, 147-177., 147-177.
Causes of Missing DataCauses of Missing Data
Common in social researchCommon in social research nonresponse, loss to followupnonresponse, loss to followup lack of overlap between linked data setslack of overlap between linked data sets social processessocial processes
dropping out of school, graduation, etc.dropping out of school, graduation, etc. survey designsurvey design
““skip patterns” between respondentsskip patterns” between respondents
Missing DataMissing Data Step 1: Do everything ethically feasible Step 1: Do everything ethically feasible
to avoid missing data during data to avoid missing data during data collectioncollection
Step 2: Do everything ethically possible Step 2: Do everything ethically possible to recover missing datato recover missing data
Step 3: Examine amount and patterns of Step 3: Examine amount and patterns of missing datamissing data
Step 4: Use statistical models and Step 4: Use statistical models and methods that replace missing data or methods that replace missing data or are unaffected by missing data are unaffected by missing data
Missing Data MechanismsMissing Data Mechanisms
Missing Completely at Random Missing Completely at Random (MCAR)(MCAR)
Missing at Random (MAR)Missing at Random (MAR) Not Missing at Random (NMAR)Not Missing at Random (NMAR)
X X not subject to nonresponse (age) not subject to nonresponse (age)Y Y subject to nonresponse (income) subject to nonresponse (income)
Missing Completely at RandomMissing Completely at Random
MCARMCAR Probability of response is Probability of response is
independent of X & Yindependent of X & Y Ex: Probability that income is Ex: Probability that income is
recorded is the same for all recorded is the same for all individuals regardless of age or individuals regardless of age or incomeincome
Missing at RandomMissing at Random MARMAR Probability of response is Probability of response is
dependent on X but not Ydependent on X but not Y Probability of missingness Probability of missingness does notdoes not
depend on unobserved informationdepend on unobserved information Ex: Probability that income is Ex: Probability that income is
recorded varies according to age recorded varies according to age but it is not related to income but it is not related to income within a particular age groupwithin a particular age group
Not Missing at RandomNot Missing at Random
NMARNMAR Probability of missingness Probability of missingness doesdoes
depend on unobserved informationdepend on unobserved information Ex: Probability that income is Ex: Probability that income is
recorded varies according to recorded varies according to income and possibly ageincome and possibly age
How can you tell?How can you tell?
• Look for patternsLook for patterns
• Run a logistic regression with your IV’s predicting a dichotomous variable (1=missing; 0=nonmissing
LOGISTIC REGRESSIONLOGISTIC REGRESSIONCoefficients:Coefficients: Estimate Std. Error z value Pr(>|z|)Estimate Std. Error z value Pr(>|z|) (Intercept) (Intercept) -5.058793 0.367083 -13.781 < 2e-16 ***-5.058793 0.367083 -13.781 < 2e-16 ***NEW.AGE NEW.AGE 0.181625 0.007524 24.140 < 0.181625 0.007524 24.140 <
2e-16 ***2e-16 ***SEXMale SEXMale -0.847947 0.131475 -6.450 1.12e-10 -0.847947 0.131475 -6.450 1.12e-10
******DVHHIN94 DVHHIN94 0.047828 0.026768 1.787 0.047828 0.026768 1.787 0.0740 .0.0740 . DVSMKT94 DVSMKT94 -0.015131 0.031662 -0.478 0.6327 -0.015131 0.031662 -0.478 0.6327 NEW.DVPP94 = 0NEW.DVPP94 = 0 0.233188 0.226732 1.028 0.3037 0.233188 0.226732 1.028 0.3037 NUMCHRON NUMCHRON -0.087992 0.048783 -1.804 -0.087992 0.048783 -1.804 0.0713 . 0.0713 . VISITS VISITS 0.012483 0.006563 1.902 0.012483 0.006563 1.902 0.0572 .0.0572 . NEW.WT6 NEW.WT6 -0.043935 0.077407 -0.568 0.5703 -0.043935 0.077407 -0.568 0.5703 NEW.DVBMI94 NEW.DVBMI94 -0.015622 0.017299 -0.903 0.3665 -0.015622 0.017299 -0.903 0.3665
Missing response DVHST94 vs Gender
0
500
1000
1500
2000
2500
3000
Male Female Total
Missing
Observed
Missing Data MechanismsMissing Data Mechanisms If MAR or MCAR, the missing data If MAR or MCAR, the missing data
mechanism is ignorable for full information mechanism is ignorable for full information likelihood-based inferenceslikelihood-based inferences
If MCAR, the mechanism is also ignorable for If MCAR, the mechanism is also ignorable for sampling-based inferences (OLS regression)sampling-based inferences (OLS regression)
If NMAR, the mechanism is nonignorable – If NMAR, the mechanism is nonignorable – thus any statistic could be biasedthus any statistic could be biased
Missing Data MethodsMissing Data Methods Always Bad MethodsAlways Bad Methods
Listwise deletionListwise deletion Pairwise deletion Pairwise deletion a.k.a.a.k.a. available case analysis available case analysis Person or item mean replacementPerson or item mean replacement
Often Good MethodsOften Good Methods Regression replacementRegression replacement Full-Information Maximum LikelihoodFull-Information Maximum Likelihood
SEM – must have full datasetSEM – must have full dataset Multiple ImputationMultiple Imputation
Listwise DeletionListwise Deletion
Assumes that the data are MCAR.Assumes that the data are MCAR. Only appropriate for small amounts Only appropriate for small amounts
of missing data.of missing data. Can lower power substantiallyCan lower power substantially
InefficientInefficient Now very rareNow very rare Don’t do it!Don’t do it!
FIML - AMOSFIML - AMOS
Imputation-based ProceduresImputation-based Procedures
Missing values are filled-in and the Missing values are filled-in and the resulting “Completed” data are resulting “Completed” data are analyzedanalyzed Hot deckHot deck Mean imputationMean imputation Regression imputationRegression imputation
Some imputation procedures (e.g., Some imputation procedures (e.g., Rubin’s multiple imputation) are really Rubin’s multiple imputation) are really model-based procedures.model-based procedures.
Mean ImputationMean Imputation TechniqueTechnique
Calculate mean over cases that have values for YCalculate mean over cases that have values for Y Impute this mean where Y is missingImpute this mean where Y is missing Ditto for XDitto for X11, X, X22, etc., etc.
Implicit modelsImplicit models Y=Y=YY XX11==11 XX22==22
ProblemsProblems ignores relationships among X and Yignores relationships among X and Y
underestimates covariancesunderestimates covariances
Regression ImputationRegression Imputation Technique & implicit modelsTechnique & implicit models
If Y is missingIf Y is missing impute mean of cases impute mean of cases
with similar values for Xwith similar values for X11, X, X22 Y = Y = XX11 XX22
Likewise, if XLikewise, if X22 is missing is missing impute mean of cases impute mean of cases
with similar values for Xwith similar values for X11, Y, Y XX11 = = XX11 Y Y
If both Y and XIf both Y and X22 are missing are missing impute means of cases impute means of cases
with similar values for Xwith similar values for X11 Y Y = = XX11
XX22= = XX11
ProblemProblem Ignores random components (no Ignores random components (no ))
Underestimates variances, se’sUnderestimates variances, se’s
Little and Rubin’s PrinciplesLittle and Rubin’s Principles
Imputations should beImputations should be Conditioned on observed variablesConditioned on observed variables MultivariateMultivariate Draws from a predictive distributionDraws from a predictive distribution
Single imputation methods do not Single imputation methods do not provide a means to correct standard provide a means to correct standard errors for estimation error.errors for estimation error.
Multiple ImputationMultiple Imputation Context: Multiple regression (in general)Context: Multiple regression (in general) Missing values are replaced Missing values are replaced with “plausible” with “plausible”
substitutes based on distributions or modelsubstitutes based on distributions or model Construct Construct mm>1 simulated versions>1 simulated versions Analyze each of the Analyze each of the mm simulated complete simulated complete
datasets by standard methodsdatasets by standard methods Combine the Combine the mm estimates estimates get confidence intervals using Rubin’s rules get confidence intervals using Rubin’s rules
((micombinemicombine))
ADVANTAGE: sampling variability is taken into ADVANTAGE: sampling variability is taken into account by restoring error varianceaccount by restoring error variance
Multiple ImputationMultiple Imputation
obsY missY
1 1 Vary yT T
2 2 Vary yT T
Varm my yT T
yT
Variance within + Variance Imputation
(Rubin, 1987, 1996)
imputations
obsY
obsY
obsY
(1)missY
(2)missY
( )missmY
Point estimate
Another ViewAnother View
IMPUTATION ANALYSIS POOLING
IMPUTED DATA
ANALYSIS RESULTS
FINAL RESULTS
• IMPUTATION:
Impute the missing entries of the incomplete data sets M times, resulting in M complete data sets.
• ANALYSIS:
Analyze each of the M completed data sets using weighted least squares.
• POOLING:
Integrate the M analysis results into a final result. Simple rules exist for combining the M analyses.
INCOMPLETE DATA
How many Imputations?How many Imputations? 55 Efficiency of an estimate:Efficiency of an estimate: (1+(1+γγ/m)/m)-1-1
γγ = percentage of missing info = percentage of missing infom = number of imputationsm = number of imputations
If 30% missing, 3 imputations If 30% missing, 3 imputations 91% 91% 5 imputations 5 imputations 94% 94% 10 imputations 10 imputations 97% 97%
Imputation in SASImputation in SASPROC MIPROC MI
By default generates 5 imputation values for each missing valueBy default generates 5 imputation values for each missing value
Imputation method: MCMC (Markov Chain Monte Carlo)Imputation method: MCMC (Markov Chain Monte Carlo)
EM algorithm determines initial valuesEM algorithm determines initial values
MCMC repeatedly simulates the distribution of interest from which the imputed values are drawnMCMC repeatedly simulates the distribution of interest from which the imputed values are drawn
Assumption: Data follows multivariate normal distributionAssumption: Data follows multivariate normal distribution
PROC REGPROC REGFits five weighted linear regression models to the Fits five weighted linear regression models to the five complete data sets obtained from PROC MI five complete data sets obtained from PROC MI (used by_imputation_statement )(used by_imputation_statement )
PROC MIANALIZE PROC MIANALIZE Reads the parameter estimates and associated Reads the parameter estimates and associated covariance matrix from the analysis performed on covariance matrix from the analysis performed on the multiple imputed data sets and derives valid the multiple imputed data sets and derives valid statistics for the parametersstatistics for the parameters
ExampleExamplecase maleness years_over_20 weight
1 0 282 1 19 2183 1 37 2354 0 24 1505 1 186 1 1767 18 09 0 2810 1 46 19511 0 2312 0 2913 1 44 22114 015 0 2116 1 41 20417 0 4018 1 37 20819 020 1 43
Case 1 is missing Case 1 is missing weightweight Given 1’s sex and ageGiven 1’s sex and age
generate a plausible generate a plausible distributiondistributionfor 1’s weightfor 1’s weight
75 100 125 150 175 200 225Weight
0.0020.0040.0060.0080.01
density
At random, sample 5 (or more) plausible weights for case 1At random, sample 5 (or more) plausible weights for case 1 Impute Y!Impute Y!
For case 6, sample from conditional distribution of age.For case 6, sample from conditional distribution of age. Use Y to impute X!Use Y to impute X!
For case 7, sample from conditional bivariate distribution of For case 7, sample from conditional bivariate distribution of age & weightage & weight
ExampleExample PROC MI;
DATA=missing_weight_age OUT=weight_age_mi; VAR years_over_20 weight
maleness; run;
_Imputation_ case maleness years_over_20 weight1 1 0 28 1781 2 1 19 2181 3 1 37 2352 1 0 28 1012 2 1 19 2182 3 1 37 235
3 1 0 28 1673 2 1 19 2183 3 1 37 235
4 1 0 28 1524 2 1 19 2184 3 1 37 235
5 1 0 28 1595 2 1 19 2185 3 1 37 235
PROC REG; data=weight_age_mi model weight = maleness
years_over_20; by _imputation_;
run;
Example – Standard ErrorsExample – Standard Errors Total variance in Total variance in bb00
Variation due to sampling + variation due to imputationVariation due to sampling + variation due to imputation Mean(Mean(ss22
bb00) + Var() + Var(bb00 ) ) Actually, there’s a correction factor of (1+1/Actually, there’s a correction factor of (1+1/MM))
for the number of imputations for the number of imputations MM. . (Here (Here MM=5.)=5.) So total variance in estimating So total variance in estimating 0 0 isis
Mean(Mean(ss22bb00) + (1+1/) + (1+1/MM) Var() Var(bb00 ) )
= 179.53 + (1.2) 511.59 = 793.44= 179.53 + (1.2) 511.59 = 793.44 Standard error is Standard error is 793.44 = 28.17793.44 = 28.17
ExampleExample PROC MIAnalyze data=parameters;
VAR intercept maleness years_over_20;
run;Multiple Imputation Parameter Estimates
Parameter Estimate Std Error 95% Confidence Limits DF
intercept 178.564526 28.168160 70.58553 286.5435 2.2804maleness 67.110037 14.801696 21.52721 112.6929 3.1866years_over_20 -0.960283 0.819559 -3.57294 1.6524 2.991
Other Software www.stat.psu.edu/~jls/misoftwa.htmlwww.stat.psu.edu/~jls/misoftwa.html
Item AnalysisItem Analysis Relevant for tests with a right / Relevant for tests with a right /
wrong answerwrong answer Score the item so that 1=right and Score the item so that 1=right and
0=wrong0=wrong Where do the answers come from?Where do the answers come from?
Rational analysisRational analysis Empirical keyingEmpirical keying
Item AnalysisItem Analysis Goal of Item AnalysisGoal of Item Analysis
Determine the extent to which the item Determine the extent to which the item is useful in differentiating individuals is useful in differentiating individuals with respect to the focal constructwith respect to the focal construct
Improve the measure for future Improve the measure for future administrationsadministrations
Item AnalysisItem Analysis
Item analysis provides info on the Item analysis provides info on the effectiveness of the individual items for effectiveness of the individual items for future use future use
Typology of Item AnalysisTypology of Item Analysis
Item AnalysisItem Analysis
Classical Item Response theory
Rasch IRT2 IRT3
Item AnalysisItem Analysis Classical analysis is the easiest and most Classical analysis is the easiest and most
widely used form of analysiswidely used form of analysis The statistics can be computed by generic The statistics can be computed by generic
statistical packages (or by hand) and need statistical packages (or by hand) and need no specialist softwareno specialist software
The item statistics apply only to The item statistics apply only to that that group group of testees on of testees on that that collection of itemscollection of items Sample Dependent!Sample Dependent!
Classical Item AnalysisClassical Item Analysis Item DifficultyItem Difficulty
Proportion Correct (1=correct; 0=wrong)Proportion Correct (1=correct; 0=wrong) the higher the proportion the easier the itemthe higher the proportion the easier the item
In general, need a wide range of item In general, need a wide range of item difficulties to cover the range of the trait difficulties to cover the range of the trait being assessedbeing assessed
If mastery test, need item difficulties to If mastery test, need item difficulties to cluster around the cut scorecluster around the cut score
Very easy (0.0) or very hard items (1.0) are Very easy (0.0) or very hard items (1.0) are uselessuseless
Most variance at p=.5Most variance at p=.5
Classical Item AnalysisClassical Item Analysis Item Discrimination – 2 methodsItem Discrimination – 2 methods
Difference in proportion correct between Difference in proportion correct between high and low test score groups (27%)high and low test score groups (27%)
Item-total correlation (output in Item-total correlation (output in Cronbach’s alpha routines)Cronbach’s alpha routines) No negative discriminatorsNo negative discriminators
Check key or drop itemCheck key or drop item Zero discriminators are not usefulZero discriminators are not useful
Item difficulty and discrimination are Item difficulty and discrimination are interdependentinterdependent
Classical Item AnalysisClassical Item AnalysisItem Answer Item-total r Item-DiscI1 4 0.85 0.30 0.23I2 2 0.96 0.26 0.07I3 1 0.94 0.25 0.15I4 3 0.78 0.45 0.39I5 4 0.88 0.37 0.26I6 2 0.81 0.34 0.32I7 1 0.78 0.46 0.41I8 2 0.87 0.44 0.29I9 3 0.83 0.32 0.22I10 3 0.88 0.18 0.11I11 3 0.90 0.19 0.13I12 2 0.95 0.24 0.10I13 3 0.96 0.21 0.08I14 2 0.94 0.32 0.14I15 4 0.78 0.20 0.32I16 3 0.50 0.24 0.21
Item Diff