Item Response Theory Analysis of the IPIP Big-Five Scales

Running Head: IRT ANALYSIS OF IPIP

Item Response Theory Analysis of the IPIP Big-Five Scales

D. Matthew Trippe and Robert J. Harvey

Virginia Polytechnic Institute and State University

Abstract

We used Samejima (1969) graded-response item response theory model to evaluate the Five

Factor Model scales of the public-domain International Personality Item Pool (IPIP; Goldberg,

1999). Test information and standard error functions showed that the IPIP scales provided

relatively good measurement precision across most of the scale ranges.

IRT Analysis of the IPIP - 2

Item Response Theory Analysis of the IPIP Big-Five Scales

The Five Factor Model (FFM) or “Big Five” approach has emerged as the dominant

taxonomic approach in the realm of personality research (e.g., Digman, 1990, John, 1990).

Although FFM critics exist (e.g. Block, 1995), many researchers and practitioners have

embraced the common framework provided by the FFM (e.g., McCrae & John, 1992; Goldberg,

1993; Costa & McCrae, 1995). In applied use, FFM constructs have shown to be valid predictors

of a wide variety of criteria (e.g., Paunonen & Ashton, 2001), and in particular, Industrial-

Organizational psychologists have produced a large body of research examining the predictive

relationships between FFM personality dimensions and job performance (e.g., Barrick & Mount,

1991; Mount & Barrick, 1995, Salgado, 1997; Hurtz & Donovan, 2000). Although some of

these studies have reached more optimistic conclusions than others, the bulk of the available

evidence suggests that FFM personality dimensions (especially Conscientiousness) exhibit at

least a small-to-moderate predictive relationship with respect to job performance, with improved

validity being obtained when the linkage between the predictor and criterion is logically and

theoretically justified (e.g., Hurtz & Donovan, 2000; Paunonen & Ashton, 2001), and when

problems regarding under-specification of the job performance criterion are avoided (e.g., Austin

& Villanova, 1992).

The increased focus that has been seen with respect to using personality traits for

employee selection and promotion purposes has led to a concomitant increase in the need for

researchers and practitioners to evaluate the quality and measurement precision of the personality

instruments they use to estimate the FFM dimensions. Item response theory (IRT) presents an

excellent methodology for evaluating personality instruments in this regard, given that unlike

classical test theory (CTT) it does not assume that tests are equally precise across the full range


of possible test scores. That is, rather than providing a point estimate of the standard error of

measurement (SEM) for a personality scale as in CTT, IRT provides a test information function

(TIF) and a test standard error (TSE) function to index the degree of measurement precision

across the full range of the latent trait (denoted θ). Using IRT, personality tests can be evaluated

in terms of the amount of information and precision they provide at specific ranges of test scores

that are of particular interest (e.g., when used for top-down employee selection purposes,

measurement precision at the upper end of the θ continuum would likely be the primary focus,

and even a relatively large lack of precision at the lower end of the scale might be excused).

Because many standardized tests tend to provide their highest levels of measurement precision in

the middle range of scores, with declines in precision being seen at the high and low ends of the

scale (e.g., Harvey, Murry, & Markham, 1994), it is quite possible that a test might be deemed

adequate for assessing individuals scoring in the middle range of the scale, but unacceptably

precise at the high or low ends (which, depending on the direction of the test’s scale, may

represent precisely the most relevant ranges of scores for employee selection purposes).

Despite the advantages offered by IRT over the older CTT-based methods of assessing

test performance, relatively few IRT analyses of personality inventories have been reported to

date. For example, Harvey, Murry, and Markham (1994) evaluated the Myers Briggs Type

Indicator, finding that short-form versions of the instrument provided considerably less

measurement precision than the full-length form, and that precision for all of the forms was quite

low at both the high and low ranges of scores (a finding that would strongly caution against

using its scales in a top-down selection situation). Likewise, Rouse, Finger, and Butcher (1999)

evaluated the Minnesota Multiphasic Personality Inventory-2 Personality Psycholopathology-

Five scales, finding that although three scales provided peak information at the high end of the θ


scale, very little information was produced at low levels of θ, a finding that suggests that

practitioners should be cautious when interpreting low and moderate scores. Finally, of more

direct relevance to the present investigation, McBride and Harvey (2002) examined the

performance of both the NEO-PI-R (Costa & McCrae, 1992) and scales formed from subsets of

the 1,200 items contained in Goldberg’s (1999) public domain International Personality Item

Pool (IPIP) that were designed to parallel the FFM scales of the NEO-PI. Using the graded-

response model of Samejima (1969) to analyze the Likert-type item responses; McBride and

Harvey (2002) found that that although 20-item IPIP scales were less precise than the NEO-PI

scales, 60-item IPIP scales outperformed the NEO-PI across most of the range of scores.

The present study was performed to further investigate the performance of the 60-item

IPIP scales formed by Goldberg (1999) to parallel the FFM dimensions measured by the NEO-

PI; two factors suggested the need for additional study. First, the McBride and Harvey (2002)

analyses were conducted using archival data collected as part of the IPIP development project

from carefully selected, compensated study participants who completed a wide range of survey

instruments over the course of the study; the degree to which similar results would be found in

additional samples (especially, ones not participating in a long-term research project) needed to

be determined (e.g., the archival participants in the original IPIP study may well have exhibited

appreciably higher levels of factors such as dedication, candor, veracity, etc., that potentially

may exert an impact on the obtained item parameters). Second, although McBride and Harvey

(2002) evaluated the degree to which some of IRT’s assumptions were satisfied (primarily,

unidimensionality), a closer examination of the performance of the graded-response IRT model

may be warranted (e.g., Drasgow, Levine, Tsien, Williams, & Mead, 1995).


In the present study, we were particularly interested in determining the degree of

measurement precision that was present in these IPIP item pools for the ranges of scores that

would most likely be of interest to practitioners when using the FFM scales for employee

selection purposes (i.e., the desirable pole of each dimension). That is, the McBride and Harvey

(2002) results showed that although the TIF and TSE functions were relatively flat (a desirable

characteristic for a general-purpose instrument), all five of the FFA dimensions for both the

NEO-PI and IPIP scales showed their lowest levels of measurement precision for the pole of the

scale that would presumably be most useful in a selection context (i.e., at the high end of the

Agreeableness, Conscientiousness, Extraversion, and Openness scales, and at the low end of the

Neuroticism scale). If such results are found to be generalizable to additional samples of raters,

the importance of modifying the item pools to increase the performance of the IPIP scales in

these target score ranges (i.e., by including additional items having their points of maximum

information toward these poles) would be underscored.

Method

Participants

Participants were recruited primarily from the Introductory Psychology participant pool

at a large southeastern university; participants received extra credit toward their final grade. All

participants completed the IPIP in an online testing session in which they answered the test items

via a web browser, and in which test items were presented in alternating fashion with the five

scales intermixed in screens of 8 items per screen; respondents were not allowed to go back to

change answers from earlier screens. Because the IPIP was available on a publicly available

online web server, a number of non-student participants also volunteered to participate. Out of a

total sample of approximately 700 individuals, respondents were discarded if they took less than


15 minutes to complete the online survey, or if their responses exhibited very low variance

across items (e.g., by selecting ‘2’ for all item responses), producing a final sample of N = 624

individuals for the IRT analyses.

Measure

The items from the IPIP used in this study were the 300 items identified by Goldberg

(2001; see http://ipip.ori.org/ipip/ for items and descriptive information) that paralleled the FFM

constructs of Agreeableness, Conscientiousness, Extraversion, Neuroticism and Openness as

measured by the subscales of the NEO-PI-R (Costa & McCrae, 1992). Each of the broad five

facet scales contained 60 items (10 for each of the NEO-PI subfactors), with each item composed

of a short statement (e.g. “love order and regularity”) to which participants responded on a five

point Likert scale (1= “very inaccurate” to 5 = “very accurate”).

Scale Dimensionality

Most IRT models assume that a response to any one item is unrelated to other item

responses if the latent trait is controlled for (e.g., Lord & Novic, 1968). Consequently, IRT

models assume that the latent trait construct space is either strictly unidimensional, or as a

practical matter, dominated by a general underlying factor. Reckase (1979) recommended that

the first factor account for at least twenty percent of the variance in order to obtain stable item

parameters; to evaluate the predominance of the factors underlying each FFA pool in the IPIP,

each scale was subjected to a common factor analysis using maximum likelihood estimation.

Modified parallel analysis (MPA; e.g., Drasgow & Lissak, 1983) was also used to assess the

dimensionality of each scale; MPA is an extension of the Humphreys and Montanelli (1975)

method of parallel analysis in which the eigenvalues from a synthetically created data set (i.e.,

one that satisfies the unidimensionality assumption of IRT) are compared to those estimated


from the actual data. In this approach, the synthetic data set is generated from item and person

parameters based on the actual data set. Appreciable multidimensionality is said to be present

when the second eigenvalue obtained from the actual data set is significantly larger than the

second eigenvalue obtained from the synthetic data set.

Item Parameter Estimation and Model Fit

MULTILOG 6.0 (Thissen, 1991) was used to estimate Samejima’s (1969) Graded

Response Model parameters for items in each of the five scales. The maximum number of

iterations (cycles) was set to 2000, and all scales converged before reaching this maximum. The

fit of parameters obtained from each scale were evaluated using the graphical and statistical

procedures recommended by Drasgow, Levine, Tsien, Williams and Mead (1995). Results from

the graphical procedure are not reported due to space limitations; χ2 fit statistics were computed

for item singles, doubles and triples, with large χ2 values seen as indicative of poor fit. Drasgow

et al (1995) recommended interpreting values lower than 3.0 as indications of acceptable fit.

Given our inability to perform a cross-validation in light of our sample size (i.e., dividing the

sample into two groups of 312 would fall below the Reise & Yu, 1990, recommended minimum

sample size of at least 500 to recover stable item parameters), some degree of caution may be

appropriate when interpreting our results.

Results

Dimensionality

Table 1 presents statistics relevant to scale dimensionality; internal consistency using

CTT methods was reasonably strong, ranging from .90 for the Openness scale to .95 for the

Neuroticism scale. Although the variance explained by the first eigenvalue for these scales

exceeds Reckase’s (1979) criterion of 20%, and that the results in Figure 1 (which presents the


eigenvalue plots from the MPA) show that in all 5 scales a clear and dominant first factor is

present, the fact that the second eigenvalues from the real data are all somewhat larger than the

second eigenvalues from the synthetic data indicates that all five IPIP FFA scales are to some

degree multidimensional.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Insert Table 1 and Figure 1 about here

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Of course, no objective or straightforward criterion or demarcation exists to indicate how

much multidimensionality is “too much” in order to recover stable and accurate item parameters

in IRT. In a series of studies on Monte Carlo and real data sets, Reckase (1979) found that when

a dominant first factor was present, IRT models primarily estimated the first factor, and that

model fit was directly related to the size of the first eigenvalue. That is, as the size of the first

eigenvalue increased, the deviations from fit decreased in approximately linear fashion. Reckase

concluded that an eigenvalue that accounted for at least 20% of the total variance is needed for

reasonable ability estimates and stable item parameters. Additionally, research by Drasgow and

Parsons (1983) and Parsons and Hulin (1982) reported evidence that dichotomous models are

considerably robust to rather severe violations of the unidimensionality assumption. For

example, Parsons and Hulin (1982) were able to recover reasonably stable item parameters when

analyzing the Job Descriptive Index (JDI), which is considerably multidimensional (i.e., the JDI

contains 4 distinct satisfaction facets, but still contains a dominant general satisfaction factor).

Similarly, in a series of simulated data sets Drasgow and Parsons (1983) found that as the

predominance of the general factor decreased, the estimation program LOGIST was drawn to the

strongest factor. They concluded that estimating parameters on moderately heterogeneous data


sets, such as those found in achievement tests and attitude assessment is justified. Kirisci, Hsu,

and Yu (2001) investigated the robustness of polytomous item parameter estimation using

MULTILOG to violations of the unidimensionality assumption, concluding that (a) when data

are multidimensional, a test length of more than 20 items and a sample size of over 250 are

necessary to recover stable parameter estimates, and (b) when there is one dominant dimension

with several minor dimensions, a unidimensional IRT model is likely justified.

Thus, given past research, we viewed the moderate violations of strict unidimensionality

seen in the IPIP personality scales in Figure 1 and Table 1 as being unlikely to exert an

appreciable distorting effect on IRT parameter estimation. To further assess model fit, χ2

statistics are reported in Table 2. The mean χ2 value for all 5 scales is well below Drasgow et

al’s (1995) recommended cutoff of 3. Although such results are supportive of the fit of the IRT

model to these data, it should be stressed that even more definitive results could be obtained via

the use of separate and larger validation and holdout samples.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Insert Table 2 about here

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Test Information and Standard Error Functions

Item information functions for each of the five FFM scales were aggregated to form the

test information functions (TIFs) reported in Figure 2. These TIFs indicate the areas on the θ

continuum in which the IPIP scales provide the most information or best discrimination among

test takers. Figure 3 reports plots of the test standard error functions based on the test information

functions. That is, as the SEM for a given level of θ decreases, the information at that level θ

increases.


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Insert Figures 2-3 about here

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

An examination of the TIF and TSE results presented in Figures 2 and 3 indicates that for

all scales except Neuroticism, test information clearly declines at the higher end of the θ scale;

typically, the TSE for these scales hovers around .25 for mid- and lower ranges of θ, then

gradually rises to the .40’s and higher towards the positive extreme of each latent trait (the

reverse situation is seen for the Neuroticism scale). Unfortunately, because higher scores on the

non-Neuroticism scales would typically be viewed as desirable among applicants for many jobs

(with lower scores desirable on Neuroticism), the fact that the IPIP provides very good levels of

information across most ranges of θ is to some degree offset by the notable drop in information

and measurement precision that occurs in the regions of the θ scale that are most desirable in

selection contexts. Although this loss of precision is to some degree moderated by the fact that a

smaller percentage of individuals would be expected to lie at the extremes of these FFA scales,

implementing a top-down hiring strategy based on these personality test results would tend to

exacerbate the relative loss of measurement precision existing in this region of θ.

Discussion

Overall, consistent with the earlier McBride and Harvey (2002) study of the IPIP and

NEO-PI, the results of the present study indicate that the IPIP FFM scales are capable of

providing strong measurement precision across most of the range of their respective personality

traits. As with most psychometric inventories, measurement precision tends to decline somewhat

at the extreme ends of the latent traits. Although these findings are strongly supportive of the

usefulness of the IPIP pools as indicators of the FFM constructs for “general purpose”


assessment situations (which are typically focused primarily on the middle range of scores that

contains the majority of examinees), our findings should instead motivate caution on the part of

test users who intend to use these FFM scales as assessed by the IPIP (or the NEO-PI as well,

based on the McBride & Harvey, 2002, results) in order to make employment or other selection

or placement decisions on a top-down basis. Additional research using either actual or Monte

Carlo methods is now needed to determine the degree to which actual employment decisions

might be affected based on the magnitude of measurement error implied by the TSE and TIF

results reported above.

Even given these cautionary notes, it must be stressed that overall, our results are quite

consistent with the McBride and Harvey (2002) results in indicating that these IPIP item pools

provide an impressive level of performance for most ranges of θ. This strong measurement

performance is all the more notable in light of the fact that the IPIP was designed to be a non-

proprietary resource available to all researchers. Given the lack of proprietary restrictions on the

use of these item pools, ideally research on the IPIP will fulfill its designer’s goals of progressing

at a faster rate than would be possible with a proprietary instrument (e.g., Goldberg, 1999).

Of course, applied uses of the full 300-item instrument we used in this study might well

be viewed as problematic, given the lengthy and cumbersome nature of such a survey when used

for employee selection or other assessment purposes. Fortunately, the relatively flat test standard

error and information functions seen for these item pools suggests that this 300-item pool might

form the basis for driving a computer adaptive testing (CAT) version of the IPIP that could

considerably reduce test administration time. That is, unlike many full-length psychological tests

– which typically exhibit a test information function that is strongly center-weighted, and often

considerably lacking at the high and low ends of the scale – the TIFs shown in Figure 2 are (as in


the McBride & Harvey, 2002, study) remarkably flat across a wide range of θ. Thus, it is quite

likely that a CAT-based IPIP for these FFM constructs could appreciably cut testing time without

causing an undue reduction in information and precision in estimating θ for examinees.

Additional research evaluating the degree to which CAT can reduce testing time, plus studies

designed to assess the susceptibility of the IPIP items to differential item functioning (DIF) on

criteria relevant to employee selection (e.g., race, sex) is now needed.


References

Austin, J.T., & Villanova, P. (1992). The criterion problem. Journal of Applied Psychology, 77,

836-874.

Barrick, M.R., & Mount, M.K. (1991). The Big Five personality dimensions and job

performance: A meta-analysis. Personnel Psychology, 44, 1-26.

Block, J. (1995). A contrarian view of the five factor approach to personality description.

Psychological Bulletin, 117, 187-215.

Costa, P. T., & McCrae, R.R. (1995). Primary traits of Eysenck’s P-E-N system: Three and five

factor solutions. Journal of Personality and Social Psychology, 69, 308-317.

Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and

NEO Five-Factor Inventory (NEO-FFI) professional manual. Odessa, FL: Psychological

Assessment Resources.

Digman, J.M. (1990). Personality structure: Emergence of the five factor model. Annual Review

of Psychology, 41, 417-440.

Drasgow, F., Levine, M.V., Tsien, S., Williams, B., & Mead, A.D. (1995). Fitting polytomous

item response theory models to multiple choice tests. Applied Psychological

Measurement, 19, 143-165.

Drasgow, F., & Lissak, R. I. (1983). Modified parallel analysis: A procedure for examining the

latent dimensionality of dichotomously scored item responses. Journal of Applied

Psychology, 68, 363-373.

Drasgow, F., & Parsons, C.K. (1983). Application of unidimensional item response theory

models to multidimensional data. Applied Psychological Measurement, 7, 189-199.


Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory measuring the

lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, &

F. Ostendorf (Eds.), Personality Psychology in Europe, Vol. 7 (pp. 7-28). Tilburg, The

Netherlands: Tilburg University Press.

Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist,

48, 26-34.

Harvey, R. J., Murry, W.D., & Markham, S.E. (1994). Evaluation of three short-form versions of

the Meyers-Briggs Type Indicator. Journal of Personality Assessment, 63, 181-184.

Humphreys, L.G., & Montanelli, R.G., Jr. (1975). An investigation of the parallel analysis

criterion for determining the number of common factors. Multivariate Behavioral

Research, 10, 193-205.

Hurtz, G. M., & Donovan, J.J. (2000). Personality and job performance: The Big Five revisited.

Journal of Applied Psychology, 85, 869-879.

International Personality Item Pool (2001). A Scientific Collaboratory for the Development of

Advanced Measures of Personality Traits and Other Individual Differences

(http://ipip.ori.org/). Internet Web Site.

Kiriscki, L., Hsu, T., & Yu, L. (2001). Robustness of item parameter estimation programs to

assumptions of unidimensionality and normality. Applied Psychological Measurement,

25, 146-162.

Lord, F.M., & Novic, M. (1968). Statistical theories of mental test scores. Reading Mass.:

Addison-Wesley.


McBride, N. L., & Harvey, R. J. (2002, April). Item response theory comparison of the IPIP and

NEO-PI-R Paper presented at the Annual Conference of the Society for Industrial and

Organizational Psychology, Toronto.

McCrae, R.R. & John, O. P. (1992). An introduction to the five factor model and its application.

Journal of Personality, 60, 175-215.

Mount, M.K. & Barrick, M.R. (1995). The Big Five personality dimensions: Implications for

research and practice in human resources management. In K. M. Rowland and G. Ferris

(Eds.), Research in personnel and human resource management (Vol. 13, pp 153-200).

Greenwich, CT: JAI Press.

Parsons, C.K. & Hulin, C.L., (1982). An empirical comparison of item response theory and

hierarchical factor analysis in applications to the measurement of job satisfaction. Journal

of Applied Psychology, 67, 826-834.

Paunonen, S.V. & Ashton, M.C. (2001). Big five factors and facets and the prediction of

behavior. Journal of Personality and Social Psychology, 81, 524-539.

Pervin, L.A. (1994). A critical analysis of current trait theory. Psychological Inquiry, 5, 103-113.

Reckase, M.D. (1979). Unifactor latent trait models applied to multifactor tests: Results and

implications. Journal of Educational Statistics, 4, 207-230.

Reise, S. P. & Yu, J. (1990). Parameter recovery in the graded response model using

MULTILOG. Journal of Educational Measurement, 27, 133-144.

Rouse, S.V., Finger, M.S., & Butcher, J.N. (1999). Advances in clinical personality

measurement: An item response theory analysis of the MMPI-2 PSY-5 Scales. Journal

of Personality Assessment, 72, 282-307.


Salgado, J.F. (1997). The five factor model of personality and job performance in the European

community. Journal of Applied Psychology, 82, 30-43.

Samejima, F. (1969). Estimation of latent ability using a response pattern of graded scores.

Psychometrika Monograph Supplement.

Thissen, D. (1991). MULTILOG version 6.0 user’s guide [computer program]. Chicago:

Scientific Software International.


Table 1

Descriptive Statistic Relevant to Dimensionality for the International Personality Item Pool FFM

Scales

IPIP Scale Cronbach’s Coefficient

Alpha (raw)

First Eigenvalue

Variance Explained by first

Eigenvalue Agreeableness .92 10.85 38% Conscientiousness .94 13.38 45% Extraversion .93 13.45 44% Neuroticism .95 15.09 48% Openness .90 8.83 30%


Table 2 Χ2 Fit Statistics for the International Personality Item Pool’s FFM Scales.

Frequency Table of Chi-Square/DF Ratios for IPIP300 Agreeableness Scale <1 1<2 2<3 3<4 4<5 5<7 >7 Mean SD

Singlets 60 0 0 0 0 0 0 0.044 0.051Doublets 23 25 7 3 0 2 0 1.471 1.098Triplets 3 13 3 1 0 0 0 1.534 0.691

Frequency Table of Chi-Square/DF Ratios for IPIP300 Conscientiousness Scale <1 1<2 2<3 3<4 4<5 5<7 >7 Mean SD


Frequency Table of Chi-Square/DF Ratios for IPIP300 Extraversion Scale <1 1<2 2<3 3<4 4<5 5<7 >7 Mean SD


Frequency Table of Chi-Square/DF Ratios for IPIP300 Neuroticism Scale <1 1<2 2<3 3<4 4<5 5<7 >7 Mean SD


Frequency Table of Chi-Square/DF Ratios for IPIP300 Openness Scale <1 1<2 2<3 3<4 4<5 5<7 >7 Mean SD



Figure Captions

Figure 1. Modified parallel analysis scree plots for the five IPIP scales. Figure 2. Test level information functions for the five IPIP scales. Figure 3. Test level standard error of measurement plots for the five IPIP scales.




Documents

Item Response Theory Analysis of the IPIP Big-Five Scales