IRT wikibook

Embed Size (px)

Citation preview

  • 8/3/2019 IRT wikibook

    1/47

    PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information.

    PDF generated at: Mon, 24 Oct 2011 11:06:17 UTC

    ITEM RESPONSE THEORYAN UNDERSTANDING

  • 8/3/2019 IRT wikibook

    2/47

    Contents

    Articles

    Item response theory 1

    Classical test theory 10

    Concept inventory 13

    Differential item functioning 16

    Person-fit analysis 17

    Psychometrics 18

    Rasch model 23

    Scale (social sciences) 31

    Standardized test 35

    Generalizability theory 41

    References

    Article Sources and Contributors 43

    Image Sources, Licenses and Contributors 44

    Article Licenses

    License 45

  • 8/3/2019 IRT wikibook

    3/47

    Item response theory 1

    Item response theory

    In psychometrics, item response theory (IRT) also known as latent trait theory, strong true score theory, or

    modern mental test theory, is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar

    instruments measuring abilities, attitudes, or other variables. It is based on the application of related mathematical

    models to testing data. Because it is generally regarded as superior to classical test theory, it is the preferred method

    for the development of high-stakes tests such as the Graduate Record Examination (GRE) and Graduate

    Management Admission Test (GMAT).

    The name item response theory is due to the focus of the theory on the item, as opposed to the test-level focus of

    classical test theory, by modeling the response of an examinee of given ability to each item in the test. The term item

    is used because many test questions are not actually questions; they might be multiple choice questions that have

    incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to

    indicate level of agreement (a rating or Likert scale), or patient symptoms scored as present/absent. IRT is based on

    the idea that the probability of a correct/keyed response to an item is a mathematical function of person and item

    parameters. The person parameter is called latent trait or ability; it may, for example, represent a person's

    intelligence or the strength of an attitude. Item parameters include difficulty (location), discrimination (slope or

    correlation), and pseudoguessing (lower asymptote).

    Overview

    The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred

    during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic

    M. Lord,[1]

    the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel

    research independently. Key figures who furthered the progress of IRT include Benjamin Wright and David Andrich.

    IRT did not become widely used until the late 1970s and 1980s, when personal computers gave many researchers

    access to the computing power necessary for IRT.

    Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and

    how well individual items on assessments work. The most common application of IRT is in education, where

    psychometricians use it for developing and refining exams, maintaining banks of items for exams, and equating for

    the difficulties of successive versions of exams (for example, to allow comparisons between results over time).[2]

    IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item

    responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly

    observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of

    sociology, but are virtually identical to IRT models.

    IRT is generally regarded as an improvement over classical test theory (CTT). For tasks that can be accomplishedusing CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications,

    such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical

    test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a

    researcher to improve the reliability of an assessment.

    IRT entails three assumptions:

    1. A unidimensional trait denoted by ;

    2. Local independence of items;

    3. The response of a person to an item can be modeled by a mathematical item response function (IRF).

    The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a

    standard scale with a mean of 0.0 and a standard deviation of 1.0. 'Local independence' means that items are not

    related except for the fact that they measure the same trait, which is equivalent to the assumption of

    http://en.wikipedia.org/w/index.php?title=Meanhttp://en.wikipedia.org/w/index.php?title=Standard_deviationhttp://en.wikipedia.org/w/index.php?title=Standard_deviationhttp://en.wikipedia.org/w/index.php?title=Meanhttp://en.wikipedia.org/w/index.php?title=Local_independencehttp://en.wikipedia.org/w/index.php?title=Reliability_%28psychometric%29http://en.wikipedia.org/w/index.php?title=Computer-adaptive_testhttp://en.wikipedia.org/w/index.php?title=Equatinghttp://en.wikipedia.org/w/index.php?title=Test_%28student_assessment%29http://en.wikipedia.org/w/index.php?title=Personal_computerhttp://en.wikipedia.org/w/index.php?title=David_Andrichhttp://en.wikipedia.org/w/index.php?title=Paul_Lazarsfeldhttp://en.wikipedia.org/w/index.php?title=Georg_Raschhttp://en.wikipedia.org/w/index.php?title=Frederic_M._Lordhttp://en.wikipedia.org/w/index.php?title=Frederic_M._Lordhttp://en.wikipedia.org/w/index.php?title=Educational_Testing_Servicehttp://en.wikipedia.org/w/index.php?title=Asymptotehttp://en.wikipedia.org/w/index.php?title=Intelligencehttp://en.wikipedia.org/w/index.php?title=Parametershttp://en.wikipedia.org/w/index.php?title=Mathematical_functionhttp://en.wikipedia.org/w/index.php?title=Probabilityhttp://en.wikipedia.org/w/index.php?title=Likert_scalehttp://en.wikipedia.org/w/index.php?title=Rating_scalehttp://en.wikipedia.org/w/index.php?title=Multiple_choicehttp://en.wikipedia.org/w/index.php?title=Graduate_Management_Admission_Testhttp://en.wikipedia.org/w/index.php?title=Graduate_Management_Admission_Testhttp://en.wikipedia.org/w/index.php?title=Graduate_Record_Examinationhttp://en.wikipedia.org/w/index.php?title=High-stakes_testinghttp://en.wikipedia.org/w/index.php?title=Datahttp://en.wikipedia.org/w/index.php?title=Mathematical_modelhttp://en.wikipedia.org/w/index.php?title=Mathematical_modelhttp://en.wikipedia.org/w/index.php?title=Measurementhttp://en.wikipedia.org/w/index.php?title=Questionnairehttp://en.wikipedia.org/w/index.php?title=Test_%28student_assessment%29
  • 8/3/2019 IRT wikibook

    4/47

    Item response theory 2

    unidimensionality, but presented separately because multidimensionality can be caused by other issues. The topic of

    dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the

    center of much of the research and literature.

    The item response function

    The IRF gives the probability that a

    person with a given ability level will

    answer correctly. Persons with lower

    ability have less of a chance, while

    persons with high ability are very

    likely to answer correctly; for example,

    students with higher math ability are

    more likely to get a math item correct.

    The exact value of the probability

    depends, in addition to ability, on a set

    of item parameters for the IRF. For

    example, in the three parameter

    logistic (3PL) model, the probability of a correct response to an item i is:

    where is the person (ability) parameter and , , and are the item parameters.

    The item parameters simply determine the shape of the IRF and in some cases have a direct interpretation. The figure

    to the right depicts an example of the 3PL model of the ICC with an overlaid conceptual explanation of the

    parameters. The parameter represents the item location which, in the case of attainment testing, is referred to as

    the item difficulty. It is the point on where the IRF has its maximum slope. The example item is of medium

    difficulty since =0.0, which is near the center of the distribution. Note that this model scales the item's difficulty

    and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person

    A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful

    performance of the task involved with an item reflects a specific level of ability.

    The item parameter represents the discrimination of the item: that is, the degree to which the item discriminates

    between persons in different regions on the latent continuum. This parameter characterizes the slope of the IRF

    where the slope is at its maximum. The example item has =1.0, which discriminates fairly well; persons with low

    ability do indeed have a much smaller chance of correctly responding than persons of higher ability.

    For items such as multiple choice items, the parameter is used in attempt to account for the effects of guessing onthe probability of a correct response. It indicates the probability that very low ability individuals will get this item

    correct by chance, mathematically represented as a lower asymptote. A four-option multiple choice item might have

    an IRF like the example item; there is a 1/4 chance of an extremely low ability candidate guessing the correct

    answer, so the would be approximately 0.25. This approach assumes that all options are equally plausible,

    because if one option made no sense, even the lowest ability person would be able to discard it, so IRT parameter

    estimation methods take this into account and estimate a based on the observed data.[3]

    http://en.wikipedia.org/w/index.php?title=Asymptotehttp://en.wikipedia.org/w/index.php?title=Multiple_choicehttp://en.wikipedia.org/w/index.php?title=File%3A3PL_IRF.pnghttp://en.wikipedia.org/w/index.php?title=Factor_analysis
  • 8/3/2019 IRT wikibook

    5/47

    Item response theory 3

    IRT models

    Broadly speaking, IRT models can be divided into two families: unidimensional and multidimensional.

    Unidimensional models require a single trait (ability) dimension . Multidimensional IRT models model response

    data hypothesized to arise from multiple traits. However, because of the greatly increased complexity, the majority

    of IRT research and applications utilize a unidimensional model.

    IRT models can also be categorized based on the number of scored responses. The typical multiple choice item is

    dichotomous; even though there may be four or five options, it is still scored only as correct/incorrect (right/wrong).

    Another class of models apply to polytomous outcomes, where each response has a different score value.[4]

    [5]

    A

    common example of this Likert-type items, e.g., "Rate on a scale of 1 to 5."

    Number of IRT parameters

    Dichotomous IRT models are described by the number of parameters they make use of.[6]

    The 3PL is named so

    because it employs three item parameters. The two-parameter model (2PL) assumes that the data have minimal

    guessing, but that items can vary in terms of location ( ) and discrimination ( ). The one-parameter model

    (1PL) assumes that there is minimal guessing and that items have equivalent discriminations, so that items are onlydescribed by a single parameter ( ). Additionally, there is theoretically a four-parameter model, with an upper

    asymptote. However, this is rarely used. This naming is somewhat confusing; note that it does not follow

    alphabetical order. The 1PL uses only , the 2PL uses and , and the 3PL adds .

    The 2PL is equivalent to the 3PL model with , and is appropriate for testing items where guessing the

    correct answer is highly unlikely, such as attitude items. For example, guessing is not relevant for the item "I like

    Broadway musicals," with responses of agree/disagree.

    The 1PL assumes not only that guessing is irrelevant, but that all items are equivalent in terms of discrimination.

    Here, the trait is analogous to a single factor in factor analysis. In fact, one can estimate a normal-ogive latent trait

    model by factor-analyzing a matrix of tetrachoric correlations between items.[7]

    This means it is technically possible

    to estimate a simple IRT model using general-purpose statistical software. Individual items or individuals might have

    secondary factors but these are assumed to be mutually independent and collectively orthogonal.

    Logistic and normal IRT models

    An alternative formulation constructs IRFs based on the normal probability distribution; these are sometimes called

    normal ogive models. For example, the formula for a two-parameter normal-ogive IRF is:

    where is the cumulative distribution function (cdf) of the standard normal distribution.

    The normal-ogive model derives from the assumption of normally distributed measurement error and is theoretically

    appealing on that basis. Here is, again, the difficulty parameter. The discrimination parameter is , the standard

    deviation of the measurement error for item i, and comparable to 1/ .

    With rescaling of the ability parameter, it is possible to make the 2PL logistic model closely approximate the

    cumulative normal ogive. Typically, the 2PL logistic and normal-ogive IRFs differ in probability by no more than

    0.01 across the range of the function. The difference is greatest in the distribution tails, however, which tend to have

    more influence on results.

    The latent trait/IRT model was originally developed using normal ogives, but this was considered too

    computationally demanding for the computers at the time (1960s). The logistic model was proposed as a simpler

    alternative, and has enjoyed wide use since. More recently, however, it was demonstrated that, using standardpolynomial approximations to the normal cdf,

    [8]the normal-ogive model is no more computationally demanding than

    logistic models.[9]

    http://en.wikipedia.org/w/index.php?title=Ogivehttp://en.wikipedia.org/w/index.php?title=Cumulative_normalhttp://en.wikipedia.org/w/index.php?title=Cumulative_distribution_functionhttp://en.wikipedia.org/w/index.php?title=Ogivehttp://en.wikipedia.org/w/index.php?title=Orthogonalhttp://en.wikipedia.org/w/index.php?title=Factor_analysishttp://en.wikipedia.org/w/index.php?title=Asymptotehttp://en.wikipedia.org/w/index.php?title=Likert_scalehttp://en.wikipedia.org/w/index.php?title=Multiple_choice
  • 8/3/2019 IRT wikibook

    6/47

    Item response theory 4

    The Rasch model

    The Rasch model is often considered to be an IRT model, namely the one-parameter model, but is in fact a

    completely different approach to conceptualizing the relationship between data and the theory.[10]

    IRT attempts to fit

    a model to observed data[11]

    , while the Rasch model specifies requirements for fundamental measurement, and thus

    requires adequate data-model fit before a test or research instrument can be claimed to measure a trait.[12]

    Operationally, this means that the IRT approaches adjust model parameters to reflect the patterns observed in thedata, while the Rasch approach requires that the data fits the Rasch model before claims regarding the presence of a

    latent trait can be considered valid. Therefore, under Rasch models, misfitting responses require diagnosis of the

    reason for the misfit, and may be excluded from the data set if substantive explanations can be made that they do not

    address the latent trait.[13]

    It is important to note that this Rasch perspective is in contrast to exploratory approaches,

    which attempt to develop a theory or model to account for observed data, but follows a confirmatory approach,

    where a theory or model is hypothesized prior to data collection and data-model fit is used to confirm the research

    hypotheses. As in any confirmatory analysis, care must be taken to avoid confirmation bias.

    A major point of contention is the use of the guessing, or pseudo-chance, parameter. The IRT approach recognizes

    that guessing is present in multiple choice examinations, and will therefore typically employ a guessing parameter to

    account for this. In contrast, the Rasch approach assumes guessing adds random noise to the data. As the noise is

    randomly distributed, provided sufficient items are tested, the rank-ordering of persons along the latent trait by raw

    score will not change, but will simply undergo a linear rescaling. The presence of random guessing will not therefore

    affect the relationships between Rasch person measures, although a larger number of items may be needed to achieve

    the desired level of reliability and separation. A form of guessing correction is available within Rasch measurement

    by excluding all responses where person ability and item difficulty differ by preset amounts, so persons are not tested

    on items where guessing or unlucky mistakes are likely to affect results. However, if guessing is not random, arising

    through poorly written distractors that address an irrelevant trait, for example, then more sophisticated identification

    of pseudo-chance responses is needed to correct for guessing. Rasch fit statistics allow identification of unlikely

    responses which may be excluded from the analysis if they are attributed to guessing.[14]

    This obviously assumes

    that the researcher is able to identify whether a student guessed or not by simply examining the patterns of responses

    in the data, so is typically used in analysis of distractor effectiveness in pilot administrations of operational tests or

    validation of research instruments, where exclusion of outlying persons is normal practice, rather than operational

    testing, where legal concerns typically dictate the use of rescaled raw scores without correction for guessing or

    misfit. If misfitting responses are retained, the Rasch model typically results in some items misfitting the model, and,

    if the number of misfitting items is excessive, there is a data-model mismatch, which has been a major criticism of

    the approach for decades.[15]

    Three-parameter IRT, by contrast, achieves data-model fit by selecting a model that fits

    the data. Unsurprisingly, such methods result in better data-model fit, but, as a model is not specified in advance for

    confirmation, such an exploratory approach sacrifices the use of fit statistics as a diagnostic tool to confirm whether

    the theorized model is an acceptable description of the latent trait. Two and three-parameter models will still report

    fit statistics, but the exploratory nature of the analysis means that they are irrelevant as a tool for confirmatory

    analysis and lack the diagnostic value of Rasch fit statistics.

    Of course, then the obvious question is why use the Rasch approach as opposed to IRT models. Rasch[16]

    showed

    that parameters are separable with measurement in the physical sciences, which provides fundamental person-free

    measurement if applied to the social sciences. But of course the data in the social sciences has considerable noise and

    error. Accordingly, proponents of Rasch measurement models assert that only data which adequately fit the Rasch

    model satisfy the requirements of fundamental measurement, where persons and items can be mapped onto the same

    invariant scale.[17]

    In other words, in IRT models where the discrimination parameter is controlled, the item response

    curves of different items can cross. This means that the relative difficulties of items are not invariant across the

    sample of persons, in violation of the assumptions of invariant measurement. The implication of this is that sampleindependent measurement is only possible in one-parameter models, where the probability of a correct response is a

    function only of the difference between person ability and item difficulty. By introducing further parameters and

    http://en.wikipedia.org/w/index.php?title=Multiple_choicehttp://en.wikipedia.org/w/index.php?title=Confirmation_bias
  • 8/3/2019 IRT wikibook

    7/47

    Item response theory 5

    allowing item response curves to cross, the measurement scale becomes sample dependent and relative item

    difficulties vary for different persons. Thus two or three-parameter models are only appropriate for uses where

    persons and items are not required to be mapped onto a single scale of measurement because the relative difficulty of

    items is not stable for different persons. The Rasch model thus has major benefits as a tool for validation of research

    instruments, where extensive piloting is conducted to diagnose poorly performing items, but mapping persons and

    items onto an invariant scale is of central interest, while two and three-parameter models have value for standardized

    testing where extensive piloting is impractical and correction for badly performing distractors is required.

    Another characteristic of the Rasch approach is that estimation of parameters is more straightforward in Rasch

    models due to the presence of sufficient statistics, which in this application means a one-to-one mapping of raw

    number-correct scores to Rasch estimates.[18]

    A practical benefit of this is that the validation analysis of a test can

    be used to produce a "score table" listing Rasch measures (scores) corresponding to raw scores, and this can then be

    used for operational scoring of the same test form without requiring the item responses of the new administrations to

    be collected and analyzed. For classroom testing purposes, such procedures greatly improve the practicality of

    immediate IRT-type scoring with paper-and-pencil administration. However, for some purposes, this existence of

    sufficient statistics is also a disadvantage because it means that there are only as many Rasch measures on the

    standard scale as there are raw scores, whereas 2- or 3-parameter models report a person measure for each possibleresponse string, producing finer differentiation among examinees, which is typically the goal of proficiency tests.

    However, unless the test is extremely long, the measurement error for individual persons will generally exceed the

    size of differences between Rasch and IRT person measures, so persons will not be substantively affected unless

    they display extremely misfitting response strings, which would require investigation to determine whether the

    person was following the same latent trait as other examinees. Also, for other instruments such as diagnostic tests or

    research instruments, finely calibrated differentiation of ability may not be the primary objective, hence different

    models are appropriate for different purposes. Two or three parameter models are useful for analyzing large

    dichotomous datasets where guessing may be significant, extensive pilot-testing to identify poorly discriminating

    items is not possible, and estimation of person ability is the only requirement, but outside of large-scale standardized

    testing, the Rasch model's ability to analyze smaller data sets than more complex IRT models and provision ofinvariant sample-independent measurement has major practical and theoretical benefits.

    Analysis of model fit

    As with any use of mathematical models, it is important to assess the fit of the data to the model. If item misfit with

    any model is diagnosed as due to poor item quality, for example confusing distractors in a multiple-choice test, then

    the items may be removed from that test form and rewritten or replaced in future test forms. If, however, a large

    number of misfitting items occur with no apparent reason for the misfit, the construct validity of the test will need to

    be reconsidered and the test specifications may need to be rewritten. Thus, misfit provides invaluable diagnostic

    tools for test developers, allowing the hypotheses upon which test specifications are based to be empirically tested

    against data.

    There are several methods for assessing fit, such as a chi-square statistic, or a standardized version of it. Two and

    three-parameter IRT models adjust item discrimination, ensuring improved data-model fit, so fit statistics lack the

    confirmatory diagnostic value found in one-parameter models, where the idealized model is specified in advance.

    Data should not be removed on the basis of misfitting the model, but rather because a construct relevant reason for

    the misfit has been diagnosed, such as a non-native speaker of English taking a science test written in English. Such

    a candidate can be argued to not belong to the same population of persons depending on the dimensionality of the

    test, and, although one parameter IRT measures are argued to be sample-independent, they are not population

    independent, so misfit such as this is construct relevant and does not invalidate the test or the model. Such an

    approach is an essential tool in instrument validation. In two and three-parameter models, where the psychometric

    model is adjusted to fit the data, future administrations of the test must be checked for fit to the same model used in

  • 8/3/2019 IRT wikibook

    8/47

    Item response theory 6

    the initial validation in order to confirm the hypothesis that scores from each administration generalize to other

    administrations. If a different model is specified for each administration in order to achieve data-model fit, then a

    different latent trait is being measured and test scores cannot be argued to be comparable between administrations.

    Information

    One of the major contributions of item response theory is the extension of the concept of reliability. Traditionally,

    reliability refers to the precision of measurement (i.e., the degree to which measurement is free of error). And

    traditionally, it is measured using a single index defined in various ways, such as the ratio of true and observed score

    variance. This index is helpful in characterizing a test's average reliability, for example in order to compare two tests.

    But IRT makes it clear that precision is not uniform across the entire range of test scores. Scores at the edges of the

    test's range, for example, generally have more error associated with them than scores closer to the middle of the

    range.

    Item response theory advances the concept of item and test information to replace reliability. Information is also a

    function of the model parameters. For example, according to Fisher information theory, the item information

    supplied in the case of the 1PL for dichotomous response data is simply the probability of a correct response

    multiplied by the probability of an incorrect response, or,

    The standard error of estimation (SE) is the reciprocal of the test information of at a given trait level, is the

    Thus more information implies less error of measurement.

    For other models, such as the two and three parameters models, the discrimination parameter plays an important role

    in the function. The item information function for the two parameter model is

    The item information function for the three parameter model is

    [19]

    In general, item information functions tend to look bell-shaped. Highly discriminating items have tall, narrow

    information functions; they contribute greatly but over a narrow range. Less discriminating items provide less

    information but over a wider range.

    Plots of item information can be used to see how much information an item contributes and to what portion of the

    scale score range. Because of local independence, item information functions are additive. Thus, the test informationfunction is simply the sum of the information functions of the items on the exam. Using this property with a large

    item bank, test information functions can be shaped to control measurement error very precisely.

    Characterizing the accuracy of test scores is perhaps the central issue in psychometric theory and is a chief difference

    between IRT and CTT. IRT findings reveal that the CTT concept of reliability is a simplification. In the place of

    reliability, IRT offers the test information function which shows the degree of precision at different values of theta,

    .

    These results allow psychometricians to (potentially) carefully shape the level of reliability for different ranges of

    ability by including carefully chosen items. For example, in a certification situation in which a test can only be

    passed or failed, where there is only a single "cutscore," and where the actually passing score is unimportant, a very

    efficient test can be developed by selecting only items that have high information near the cutscore. These items

    generally correspond to items whose difficulty is about the same as that of the cutscore.

    http://en.wikipedia.org/w/index.php?title=Certificationhttp://en.wikipedia.org/w/index.php?title=Accuracyhttp://en.wikipedia.org/w/index.php?title=Measurement_errorhttp://en.wikipedia.org/w/index.php?title=Additive_functionhttp://en.wikipedia.org/w/index.php?title=Standard_error_of_estimationhttp://en.wikipedia.org/w/index.php?title=Fisher_informationhttp://en.wikipedia.org/w/index.php?title=Reliability_%28statistics%29
  • 8/3/2019 IRT wikibook

    9/47

    Item response theory 7

    Scoring

    The person parameter represents the magnitude oflatent traitof the individual, which is the human capacity or

    attribute measured by the test.[20]

    It might be a cognitive ability, physical ability, skill, knowledge, attitude,

    personality characteristic, etc.

    The estimate of the person parameter - the "score" on a test with IRT - is computed and interpreted in a very different

    manner as compared to traditional scores like number or percent correct. The individual's total number-correct score

    is not the actual score, but is rather based on the IRFs, leading to a weighted score when the model contains item

    discrimination parameters. It is actually obtained by multiplying the item response function for each item to obtain a

    likelihood function, the highest point of which is the maximum likelihood estimate of . This highest point is

    typically estimated with IRT software using the Newton-Raphson method.[21]

    While scoring is much more

    sophisticated with IRT, for most tests, the (linear) correlation between the theta estimate and a traditional score is

    very high; often it is .95 or more. A graph of IRT scores against traditional scores shows an ogive shape implying

    that the IRT estimates separate individuals at the borders of the range more than in the middle.

    An important difference between CTT and IRT is the treatment of measurement error, indexed by the standard error

    of measurement. All tests, questionnaires, and inventories are imprecise tools; we can never know a person's true

    score, but rather only have an estimate, the observed score. There is some amount of random error which may push

    the observed score higher or lower than the true score. CTT assumes that the amount of error is the same for each

    examinee, but IRT allows it to vary.[22]

    Also, nothing about IRT refutes human development or improvement or assumes that a trait level is fixed. A person

    may learn skills, knowledge or even so called "test-taking skills" which may translate to a higher true-score. In fact, a

    portion of IRT research focuses on the measurement of change in trait level.[23]

    A comparison of classical and Item Response theory

    Classical test theory (CTT) and IRT are largely concerned with the same problems but are different bodies of theory

    and therefore entail different methods. Although the two paradigms are generally consistent and complementary,

    there are a number of points of difference:

    IRT makes stronger assumptions than CTT and in many cases provides correspondingly stronger findings;

    primarily, characterizations of error. Of course, these results only hold when the assumptions of the IRT models

    are actually met.

    Although CTT results have allowed important practical results, the model-based nature of IRT affords many

    advantages over analogous CTT findings.

    CTT test scoring procedures have the advantage of being simple to compute (and to explain) whereas IRT scoring

    generally requires relatively complex estimation procedures.

    IRT provides several improvements in scaling items and people. The specifics depend upon the IRT model, butmost models scale the difficulty of items and the ability of people on the same metric. Thus the difficulty of an

    item and the ability of a person can be meaningfully compared.

    Another improvement provided by IRT is that the parameters of IRT models are generally not sample- or

    test-dependent whereas true-score is defined in CTT in the context of a specific test. Thus IRT provides

    significantly greater flexibility in situations where different samples or test forms are used. These IRT findings

    are foundational for computerized adaptive testing.

    It is worth also mentioning some specific similarities between CTT and IRT which help to understand the

    correspondence between concepts. First, Lord[24]

    showed that under the assumption that is normally distributed,

    discrimination in the 2PL model is approximately a monotonic function of the point-biserial correlation. In

    particular:

    http://en.wikipedia.org/w/index.php?title=Monotonic_functionhttp://en.wikipedia.org/w/index.php?title=Point-biserial_correlation_coefficienthttp://en.wikipedia.org/w/index.php?title=Point-biserial_correlation_coefficienthttp://en.wikipedia.org/w/index.php?title=Monotonic_functionhttp://en.wikipedia.org/w/index.php?title=Standard_error_of_measurementhttp://en.wikipedia.org/w/index.php?title=Standard_error_of_measurementhttp://en.wikipedia.org/w/index.php?title=Correlationhttp://en.wikipedia.org/w/index.php?title=Newton-Raphson
  • 8/3/2019 IRT wikibook

    10/47

    Item response theory 8

    where is the point biserial correlation of item i. Thus, if the assumption holds, where there is a higher

    discrimination there will generally be a higher point-biserial correlation.

    Another similarity is that while IRT provides for a standard error of each estimate and an information function, it is

    also possible to obtain an index for a test as a whole which is directly analogous to Cronbach's alpha, called the

    separation index. To do so, it is necessary to begin with a decomposition of an IRT estimate into a true location and

    error, analogous to decomposition of an observed score into a true score and error in CTT. Let

    where is the true location, and is the error association with an estimate. Then is an estimate of the

    standard deviation of for person with a given weighted score and the separation index is obtained as follows

    where the mean squared standard error of person estimate gives an estimate of the variance of the errors, , across

    persons. The standard errors are normally produced as a by-product of the estimation process. The separation index

    is typically very close in value to Cronbach's alpha.[25]

    IRT is sometimes called strong true score theory or modern mental test theory because it is a more recent body of

    theory and makes more explicit the hypotheses that are implicit within CTT.

    References

    [1] ETS Research Overview (http://www.ets.org/portal/site/ets/menuitem.c988ba0e5dd572bada20bc47c3921509/

    ?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD&vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRD)

    [2] Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991).Fundamentals of Item Response Theory . Newbury Park, CA: Sage Press.

    [3] Bock R.D., Aitkin M. (1981). Marginal maximum likelihood estimation of item parameters: application of an EM algorithm.Psychometrika,

    46, 443-459.

    [4] Ostini & Nering (http://www.sagepub. com/booksProdDesc.nav?prodId=Book226392)

    [5] Nering & Ostini (http://www.routledgeeducation.com/books/

    Handbook-of-Polytomous-Item-Response-Theory-Models-isbn9780805859928)

    [6] Thissen, D. & Orlando, M. (2001). Item response theory for items scored in two categories. In D. Thissen & Wainer, H. (Eds.), Test Scoring

    (pp. 73-140). Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

    [7] K. G. Jreskog and D. Srbom(1988).PRELIS 1 user's manual, version 1 . Chicago: Scientific Software, Inc.

    [8] Abramowitz M., Stegun I.A. (1972).Handbook of Mathematical Functions . Washington DC: U. S. Government Printing Office.

    [9] Uebersax, J.S. (1999). Probit latent class analysis with dichotomous or ordered category measures: conditional independence/dependence

    models.Applied Psychological Measurement, 23, 4, 283-297.

    [10] Andrich, D (1989), Distinctions between assumptions and requirements in measurement in the Social sciences", in Keats, J.A, Taft, R.,

    Heath, R.A, Lovibond, S (Eds),Mathematical and Theoretical Systems, Elsevier Science Publishers, North Holland, Amsterdam, pp.7-16.

    [11] Steinberg, J. (2000). Frederic Lord, Who Devised Testing Yardstick, Dies at 87. New York Times, February 10, 2000[12] Andrich, D. (2004). Controversy and the Rasch model: a characteristic of incompatible paradigms?Medical Care, 42, 1-16.

    [13] Smith, R.M. (1990). Theory and practice of fit. Rasch Measurement Transactions, 3(4), p.78. http://rasch.org/rmt/rmt34b.htm

    [14] Smith, R.M. (1990). Theory and practice of fit. Rasch Measurement Transactions, 3(4), p.78. http://rasch.org/rmt/rmt34b.htm

    [15] Goldstein, H., & Blinkhorn, S. (1982). The Rasch model still does not fit. British Educational Research Journal, 8(2), 167-170.

    [16] Rasch, G. (1960/1980).Probabilistic models for some intelligence and attainment tests . (Copenhagen, Danish Institute for Educational

    Research), expanded edition (1980) with foreword and afterword by B.D. Wright. Chicago: The University of Chicago Press.

    [17] Wright, B.D. (1992).IRT in the 1990s: Which Models Work Best? Rasch measurement transactions, 6:1, 196-200.

    [18] Fischer, G.H. & Molenaar, I.W. (1995).Rasch Models: Foundations, Recent Developments, and Applications. New York: Springer.

    [19] de Ayala, R.J. (2009). The Theory and Practice of Item Response Theory, New York, NY: The Guilford Press. (6.12), p.144

    [20] Lazarsfeld P.F, & Henry N.W. (1968).Latent Structure Analysis. Boston: Houghton Mifflin.

    [21] "Thompson, N.A. (2009)" (http://www.assess.com/docs/Thompson (2009) - Ability estimation with IRT.pdf). .

    [22] Kolen, Michael J., Zeng, Lingjia, Hanson, Bradley A. (1996). Conditional Standard Errors of Measurement for Scale Scores Using IRT.

    Journal of Educational Measurement, 33(2), 129-40.

    http://www.assess.com/docs/Thompson%20(2009)%20-%20Ability%20estimation%20with%20IRT.pdfhttp://en.wikipedia.org/w/index.php?title=British_Educational_Research_Associationhttp://rasch.org/rmt/rmt34b.htmhttp://rasch.org/rmt/rmt34b.htmhttp://en.wikipedia.org/w/index.php?title=Karl_Gustav_J%C3%B6reskoghttp://www.routledgeeducation.com/books/Handbook-of-Polytomous-Item-Response-Theory-Models-isbn9780805859928http://www.routledgeeducation.com/books/Handbook-of-Polytomous-Item-Response-Theory-Models-isbn9780805859928http://www.sagepub.com/booksProdDesc.nav?prodId=Book226392http://www.ets.org/portal/site/ets/menuitem.c988ba0e5dd572bada20bc47c3921509/?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD&vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRDhttp://www.ets.org/portal/site/ets/menuitem.c988ba0e5dd572bada20bc47c3921509/?vgnextoid=26fdaf5e44df4010VgnVCM10000022f95190RCRD&vgnextchannel=ceb2be3a864f4010VgnVCM10000022f95190RCRDhttp://en.wikipedia.org/w/index.php?title=Cronbach%27s_alpha
  • 8/3/2019 IRT wikibook

    11/47

    Item response theory 9

    [23] Hall, L.A., & McDonald, J.L. (2000).Measuring Change in Teachers' Perceptions of the Impact that Staff Development Has on Teaching.

    Paper presented at the Annual Meeting of the American Educational Research Association (New Orleans, LA, April 2428, 2000). (http://

    eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&

    ERICExtSearch_SearchValue_0=ED441789&ERICExtSearch_SearchType_0=no&accno=ED441789)

    [24] Lord, F.M. (1980).Applications of item response theory to practical testing problems. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.

    [25] Andrich, D. (1982). An index of person separation in latent trait theory, the traditional KR.20 index, and the Guttman scale response pattern.

    Education Research and Perspectives, 9, 95-104.

    Additional reading

    Many books have been written that address item response theory or contain IRT or IRT-like models. This is a partial

    list, focusing on texts that provide more depth.

    Lord, F.M. (1980).Applications of item response theory to practical testing problems. Mahwah, NJ: Erlbaum.

    This book summaries much of Lord's IRT work, including chapters on the relationship between IRT and

    classical methods, fundamentals of IRT, estimation, and several advanced topics. Its estimation chapter is now

    dated in that it primarily discusses joint maximum likelihood method rather than the marginal maximum

    likelihood method implemented by Darrell Bock and his colleagues.

    Embretson, S. and Reise, S. (2000).Item response theory for psychologists. Mahwah, NJ: Erlbaum.

    This book is an accessible introduction to IRT, aimed, as the title says, at psychologists.

    Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation,

    University of Maryland, College Park, MD.

    This introductory book is by one of the pioneers in the field, and is available online at (http://edres. org/irt/

    baker/)

    Baker, F. B., & Kim, S.-H. (2004).Item response theory: Parameter estimation techniques (2nd ed.). New York:

    Marcel Dekker.

    This book describes various item response theory models and furnishes detailed explanations of algorithms

    that can be used to estimate the item and ability parameters. Portions of the book are available online as

    limited preview at (http://books. google.com/).

    Van der Linden, W.J. & Hambleton, R.K. (Eds.) (1997).Handbook of modern item response theory. New York:

    Springer.

    This book provides a comprehensive overview regarding various popular IRT models. It is well suited for

    persons who already have gained basic understanding of IRT.

    De Boeck, P., & Wilson, M. (Eds.) (2004).Explanatory Item Response Models. A Generalized Linear and

    Nonlinear Approach. New York: Springer.

    This volume shows an integrated introduction to item response models, mainly aimed at practitioners,

    researchers and graduate students.

    Fox, J.-P. (2010).Bayesian Item Response Modeling: Theory and Applications New York: Springer.

    This book discusses the Bayesian approach towards item response modeling. The book will be useful for

    persons (who are familiar with IRT) with an interest in analyzing item response data from a Bayesian

    perspective.

    http://books.google.com/http://edres.org/irt/baker/http://edres.org/irt/baker/http://en.wikipedia.org/w/index.php?title=Marginal_maximum_likelihoodhttp://en.wikipedia.org/w/index.php?title=Marginal_maximum_likelihoodhttp://eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED441789&ERICExtSearch_SearchType_0=no&accno=ED441789http://eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED441789&ERICExtSearch_SearchType_0=no&accno=ED441789http://eric.ed.gov/ERICWebPortal/custom/portlets/recordDetails/detailmini.jsp?_nfpb=true&_&ERICExtSearch_SearchValue_0=ED441789&ERICExtSearch_SearchType_0=no&accno=ED441789
  • 8/3/2019 IRT wikibook

    12/47

    Item response theory 10

    External links

    A Simple Guide to the Item Response Theory(PDF) (http://www.creative-wisdom.com/computer/sas/IRT.

    pdf)

    Psychometric Software Downloads (http://www.umass.edu/remp/main_software. html)

    IRT Tutorial (http://work.psych.uiuc.edu/irt/tutorial.asp)

    IRT Tutorial FAQ (http://sites.google.com/site/benroydo/irt-tutorial) An introduction to IRT (http://edres.org/irt/)

    The Standards for Educational and Psychological Testing (http://www.apa.org/science/standards.html)

    IRT Command Language (ICL) computer program (http://www.b-a-h.com/software/irt/icl/)

    IRT Programs from SSI, Inc. (http://www.ssicentral.com/irt/index. html)

    IRT Programs from Assessment Systems Corporation (http://assess.com/xcart/home.php?cat=37)

    IRT Programs from Winsteps (http://www.winsteps.com)

    Latent Trait Analysis and IRT Models (http://www.john-uebersax.com/stat/lta. htm)

    Rasch analysis (http://www.rasch-analysis.com/)

    Free IRT software (http://www.john-uebersax.com/stat/papers. htm)

    Classical test theory

    Classical test theory is a body of related psychometric theory that predict outcomes of psychological testing such as

    the difficulty of items or the ability of test-takers. Generally speaking, the aim of classical test theory is to understand

    and improve the reliability of psychological tests.

    Classical test theory may be regarded as roughly synonymous with true score theory. The term "classical" refers not

    only to the chronology of these models but also contrasts with the more recent psychometric theories, generally

    referred to collectively as item response theory, which sometimes bear the appellation "modern" as in "modern latent

    trait theory".

    Classical test theory as we know it today was codified by Novick (1966) and described in classic texts such as Lord

    & Novick 1968) and Allen & Yen (1979/2002). The description of classical test theory below follows these seminal

    publications.

    Definitions

    Classical test theory assumes that each person has a true score,T, that would be obtained if there were no errors in

    measurement. A person's true score is defined as the expected number-correct score over an infinite number of

    independent administrations of the test. Unfortunately, test users never observe a person's true score, only an

    observed score,X. It is assumed that observed score = true score plus some error:

    X = T + E

    observed score true score error

    Classical test theory is concerned with the relations between the three variables , , and in the population.

    These relations are used to say something about the quality of test scores. In this regard, the most important concept

    is that ofreliability. The reliability of the observed test scores , which is denoted as , is defined as the ratio

    of true score variance to the observed score variance :

    Because the variance of the observed scores can be shown to equal the sum of the variance of true scores and the

    variance of error scores, this is equivalent to

    http://en.wikipedia.org/w/index.php?title=Reliability_%28psychometric%29http://en.wikipedia.org/w/index.php?title=Test_%28student_assessment%29http://en.wikipedia.org/w/index.php?title=Psychometrichttp://www.john-uebersax.com/stat/papers.htmhttp://www.rasch-analysis.com/http://www.john-uebersax.com/stat/lta.htmhttp://www.winsteps.com/http://assess.com/xcart/home.php?cat=37http://www.ssicentral.com/irt/index.htmlhttp://www.b-a-h.com/software/irt/icl/http://www.apa.org/science/standards.htmlhttp://edres.org/irt/http://sites.google.com/site/benroydo/irt-tutorialhttp://work.psych.uiuc.edu/irt/tutorial.asphttp://www.umass.edu/remp/main_software.htmlhttp://www.creative-wisdom.com/computer/sas/IRT.pdfhttp://www.creative-wisdom.com/computer/sas/IRT.pdf
  • 8/3/2019 IRT wikibook

    13/47

    Classical test theory 11

    This equation, which formulates a signal-to-noise ratio, has intuitive appeal: The reliability of test scores becomes

    higher as the proportion of error variance in the test scores becomes lower and vice versa. The reliability is equal to

    the proportion of the variance in the test scores that we could explain if we knew the true scores. The square root of

    the reliability is the correlation between true and observed scores.

    Reliability

    Reliability cannot be estimated directly since that would require one to know the true scores, which according to

    classical test theory is impossible. However, estimates of reliability can be obtained by various means. One way of

    estimating reliability is by constructing a so-called parallel test. The fundamental property of a parallel test is that it

    yields the same true score and the same observed score variance as the original test for every individual. If we have

    parallel tests x and x', then this means that

    and

    Under these assumptions, it follows that the correlation between parallel test scores is equal to reliability (see Lord &

    Novick, 1968, Ch. 2, for a proof).

    Using parallel tests to estimate reliability is cumbersome because parallel tests are very hard to come by. In practice

    the method is rarely used. Instead, researchers use a measure of internal consistency known as Cronbach's .

    Consider a test consisting of items , . The total test score is defined as the sum of the

    individual item scores, so that for individual

    Then Cronbach's alpha equals

    Cronbach's can be shown to provide a lower bound for reliability under rather mild assumptions. Thus, the

    reliability of test scores in a population is always higher than the value of Cronbach's in that population. Thus,

    this method is empirically feasible and, as a result, it is very popular among researchers. Calculation of Cronbach's

    is included in many standard statistical packages such as SPSS and SAS.[1]

    As has been noted above, the entire exercise of classical test theory is done to arrive at a suitable definition of

    reliability. Reliability is supposed to say something about the general quality of the test scores in question. The

    general idea is that, the higher reliability is, the better. Classical test theory does not say how high reliability is

    supposed to be. Too high a value for , say over .9, indicates redundancy of items. Around .8 is recommended for

    personality research, while .9+ is desirable for individual high-stakes testing.[2]

    It must be noted that these 'criteria'

    are not based on formal arguments, but rather are the result of convention and professional practice. The extent to

    which they can be mapped to formal principles of statistical inference is unclear.

    http://en.wikipedia.org/w/index.php?title=SAS_Systemhttp://en.wikipedia.org/w/index.php?title=SPSShttp://en.wikipedia.org/w/index.php?title=Cronbach%27s_alpha
  • 8/3/2019 IRT wikibook

    14/47

    Classical test theory 12

    Alternatives

    Classical test theory is an influential theory of test scores in the social sciences. In psychometrics, the theory has

    been superseded by the more sophisticated models in Item Response Theory (IRT) and Generalizability theory

    (G-theory). However, IRT is not included in standard statistical packages like SPSS and SAS, whereas these

    packages routinely provide estimates of Cronbach's . Specialized software is necessary.

    Notes

    [1] Pui-Wa Lei and Qiong Wu (2007). "CTTITEM: SAS macro and SPSS syntax for classical item analysis" (http://brm.psychonomic-journals.

    org/content/39/3/527.full.pdf).Behavior Research Methods 39 (3): 527530. doi:10.3758/BF03193021. PMID 17958163. .

    [2] Streiner, D. L. (2003). "Starting at the Beginning: An Introduction to Coefficient Alpha and Internal Consistency".Journal of Personality

    Assessment80 (1): 99103. doi:10.1207/S15327752JPA8001_18. PMID 12584072.

    References

    Allen, M.J., & Yen, W. M. (2002).Introduction to Measurement Theory. Long Grove, IL: Waveland Press.

    Novick, M.R. (1966) The axioms and principal results of classical test theory Journal of Mathematical

    Psychology Volume 3, Issue 1, February 1966, Pages 1-18

    Lord, F. M. & Novick, M. R. (1968). Statistical theories of mental test scores. Reading MA: Addison-Welsley

    Publishing Company

    Further reading

    Gregory, Robert J. (2011).Psychological Testing: History, Principles, and Applications (Sixth ed.). Boston:

    Allyn & Bacon. ISBN 978-0-205-78214-7. Lay summary (http://www. pearsonhighered.com/bookseller/

    product/Psychological-Testing-History-Principles-and-Applications-6E/9780205782147.page) (7 November

    2010).

    Hogan, Thomas P.; Brooke Cannon (2007).Psychological Testing: A Practical Introduction (Second ed.).

    Hoboken (NJ): John Wiley & Sons. ISBN 978-0-471-73807-7. Lay summary (http://www.wiley. com/

    WileyCDA/WileyTitle/productCd-EHEP000675.html) (21 November 2010).

    External links

    International Test Commission article on Classical Test Theory (http://www.intestcom.org/Publications/

    ORTA/Classical+test+theory.php)

    http://www.intestcom.org/Publications/ORTA/Classical+test+theory.phphttp://www.intestcom.org/Publications/ORTA/Classical+test+theory.phphttp://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP000675.htmlhttp://www.wiley.com/WileyCDA/WileyTitle/productCd-EHEP000675.htmlhttp://www.pearsonhighered.com/bookseller/product/Psychological-Testing-History-Principles-and-Applications-6E/9780205782147.pagehttp://www.pearsonhighered.com/bookseller/product/Psychological-Testing-History-Principles-and-Applications-6E/9780205782147.pagehttp://brm.psychonomic-journals.org/content/39/3/527.full.pdfhttp://brm.psychonomic-journals.org/content/39/3/527.full.pdfhttp://en.wikipedia.org/w/index.php?title=SAS_Systemhttp://en.wikipedia.org/w/index.php?title=SPSShttp://en.wikipedia.org/w/index.php?title=Item_Response_Theory
  • 8/3/2019 IRT wikibook

    15/47

    Concept inventory 13

    Concept inventory

    A concept inventory is a criterion-referenced test designed to evaluate whether a student has an accurate working

    knowledge of a specific set of concepts. To ensure interpretability, it is common to have multiple items that address a

    single idea. Typically, concept inventories are organized as multiple-choice tests in order to ensure that they are

    scored in a reproducible manner, a feature that also facilitates administration in large classes. Unlike a typical,

    teacher-made multiple-choice test, questions and response choices on concept inventories are the subject of extensive

    research. The aims of the research include ascertaining (a) the range of what individuals think a particular question is

    asking and (b) the most common responses to the questions. Concept inventories are evaluated to ensure test

    reliability and validity. In its final form, each question includes one correct answer and several distractors. The

    distractors are incorrect answers that are usually (but not always) based on students commonly held

    misconceptions.[1]

    Ideally, a score on a criterion-referenced test reflects the amount of content knowledge a student has mastered.

    Criterion-referenced tests differ from norm-referenced tests in that (in theory) the former is not used to compare an

    individual's score to the scores of the group. Ordinarily, the purpose of a criterion-referenced test is to ascertain

    whether a student mastered a predetermined amount of content knowledge; upon obtaining a test score that is at or

    above a cutoff score, the student can move on to study a body of content knowledge that follows next in a learning

    sequence. In general, item difficulty values ranging between 30% and 70% are best able to provide information

    about student understanding.

    Distractors are often based on ideas commonly held by students, as determined by years of research on

    misconceptions. Test developers often research student misconceptions by examining students' responses to

    open-ended essay questions and conducting "think-aloud" interviews with students. The distractors chosen by

    students help researchers understand student thinking and give instructors insights into students' prior knowledge

    (and, sometimes, firmly held beliefs). This foundation in research underlies instrument construction and design, and

    plays a role in helping educators obtain clues about students' ideas, scientific misconceptions, and didaskalogenic,that is, teacher-induced confusions and conceptual lacunae that interfere with learning.

    Concept inventories in use

    The first concept inventory was developed in 1987.[2]

    It concerned photosynthesis and respiration in plants. The

    concept inventory not only used misconceptions as distractors, but also employed two-tiered items. First-tier items

    ask what happens when . . .? (which students often know); second-tier items ask why does this happen? (which

    students often dont know). Tiers of questions based on the distinction between student knowledge of outcome and

    mechanism provide an additional source of information for instructors.

    Hestenes, Halloun, and Wells developed the first of the concept inventories to be widely disseminated, the Force

    Concept Inventory (FCI).[3]

    [4]

    The FCI was designed to assess student understanding of the Newtonian concepts of

    force. Hestenes (1998) found that while nearly 80% of the [students completing introductory college physics

    courses] could state Newtons Third Law at the beginning of the course FCI data showed that less than 15% of

    them fully understood it at the end. These results have been replicated in a number of studies involving students at a

    range of institutions (see sources section below), and have led to greater recognition in the physics education

    research community of the importance of students' "active engagement" with the materials to be mastered.

    http://en.wikipedia.org/w/index.php?title=Physics_educationhttp://en.wikipedia.org/w/index.php?title=Newton%27s_laws_of_motionhttp://en.wikipedia.org/w/index.php?title=Classical_mechanicshttp://en.wikipedia.org/w/index.php?title=David_Hesteneshttp://en.wikipedia.org/w/index.php?title=Didaskalogenichttp://en.wikipedia.org/w/index.php?title=Didaskalogenichttp://en.wikipedia.org/w/index.php?title=Scientific_misconceptionshttp://en.wikipedia.org/w/index.php?title=Cutscorehttp://en.wikipedia.org/w/index.php?title=Norm-referenced_testshttp://en.wikipedia.org/w/index.php?title=Validity_%28statistics%29http://en.wikipedia.org/w/index.php?title=Reliability_%28statistics%29http://en.wikipedia.org/w/index.php?title=Multiple-choice_testhttp://en.wikipedia.org/w/index.php?title=Knowledgehttp://en.wikipedia.org/w/index.php?title=Criterion-referenced_test
  • 8/3/2019 IRT wikibook

    16/47

    Concept inventory 14

    This is a schematic of the Hake plot (see Redish

    page[5]

    )

    .

    Since the development of the FCI, other physics instruments have been

    developed. These include the Force and Motion Conceptual Evaluation

    developed by Thornton and Sokoloff[6]

    and the Brief Electricity and

    Magnetism Assessment developed by Ding et al.[7]

    For a discussion of

    how a number of concept inventories were developed see Beichner.[8]

    Information about physics concept tests can be found at the NC State

    Physics Education Research Group website (see the external links

    below).

    Concept inventories have been developed in physics,[6]

    [7]

    [9]

    statistics,[10]

    chemistry,[11]

    [12]

    astronomy,[13]

    basic biology,[14]

    [15]

    [16]

    natural selection,[17]

    [18]

    [19]

    genetics,[20]

    ,

    engineering.[21]

    , and geoscience[22]. A review of many concept inventories can be found in two papers (#4-

    Libarkin and #5- Reed-Rhoads) [23] commissioned by the National Research Council.

    A different type of conceptual assessment has been created by the Thinking Like a Biologist research group at

    Michigan State University. To date, they have created approximately 80 items exploring students understanding ofmatter and energy, organized into Diagnostic Question Clusters that are available for download. These items are

    valuable for engaging students in collaborative problem-solving activities in class. Another approach is illustrated by

    the Biological Concepts Instrument (BCI),[24]

    which is a 24-item, multiple-choice, research-based instrument

    (available on-line) designed to reveal students' (and teachers') understanding of foundational ideas within the

    (primarily) molecular biological arena. For example, results from the administration of the BCI indicate that students

    have difficulty grasping the implications of random processes in biological systems.[25]

    In many areas, foundational scientific concepts transcend disciplinary boundaries. An example of an inventory that

    assesses knowledge of such concepts is an instrument developed by Odom and Barrow (1995) to evaluate

    understanding of diffusion and osmosis.[2]

    In addition, there are non-multiple choice conceptual instruments, such as

    the essay-based approach suggested by Wright et al. (1998)[12] and the essay and oral exams used by Nehm and

    Schonfeld (2008).[18]

    Caveats associated with concept inventory use

    Some concept inventories are problematic. Some inventories created by scientists do not align with best practices in

    scale development. Concept inventories created to simply diagnose student ideas may not be viable as

    research-quality measures of conceptual understanding. Users should be careful to ensure that concept inventories

    are actually testing conceptual understanding, rather than test-taking ability, English skills, or other factors that can

    influence test performance.

    The use of multiple-choice exams as concept inventories is not without controversy. The very structure ofmultiple-choice type concept inventories raises questions involving the extent to which complex, and often nuanced

    situations and ideas must be simplified or clarified to produce unambiguous responses. For example, a

    multiple-choice exam designed to assess knowledge of key concepts in natural selection[17]

    does not meet a number

    of standards of quality control.[19]

    One problem with the exam is that the members of each of several pairs of parallel

    items, each pair designed to measure one key concept in natural selection, sometimes have very different levels of

    difficulty.[18]

    Another problem is that the multiple-choice exam overestimates knowledge of natural selection as

    reflected in student performance on a diagnostic essay exam and a diagnostic oral exam, two instruments with

    reasonably good construct validity.[18]

    Although scoring concept inventories in the form of essay or oral exams is

    labor intensive, costly, and difficult to implement with large numbers of students, such exams can offer a more

    realistic appraisal of the actual levels of students' conceptual mastery as well as their misconceptions.[12]

    [18]

    Recently, however, computer technology has been developed that can score essay responses on concept inventories

    http://en.wikipedia.org/w/index.php?title=Construct_validityhttp://en.wikipedia.org/w/index.php?title=Construct_validityhttp://en.wikipedia.org/w/index.php?title=Osmosishttp://en.wikipedia.org/w/index.php?title=Diffusionhttp://www7.nationalacademies.org/bose/PP_Commissioned_Papers.htmlhttp://geoscienceconceptinventory.wikispaces.com/homehttp://en.wikipedia.org/w/index.php?title=Geosciencehttp://en.wikipedia.org/w/index.php?title=Engineeringhttp://en.wikipedia.org/w/index.php?title=Geneticshttp://en.wikipedia.org/w/index.php?title=Natural_selectionhttp://en.wikipedia.org/w/index.php?title=Biologyhttp://en.wikipedia.org/w/index.php?title=Astronomyhttp://en.wikipedia.org/w/index.php?title=Chemistryhttp://en.wikipedia.org/w/index.php?title=Statisticshttp://en.wikipedia.org/w/index.php?title=Physicshttp://en.wikipedia.org/w/index.php?title=File%3AHake_Plot.JPG
  • 8/3/2019 IRT wikibook

    17/47

    Concept inventory 15

    in biology and other domains (Nehm, Ha, & Mayfield, 2011),[26]

    promising to facilitate the scoring of concept

    inventories organized as (transcribed) oral exams as well as essays.

    References

    [1] "Development and Validation of Instruments to Measure Learning of Expert-Like Thinking." W. K. Adams & C. E. Wieman, 2010.

    International Journal of Science Education, 1-24. iFirst, DOI: 10.1080/09500693.2010.512369[2] Odom AL, Barrow LH 1995 Development and application of a two-tier diagnostic test measuring college biology students' understanding of

    diffusion and osmosis after a course of instruction. Journal of Research In Science Teaching 32: 45-61.

    [3] Hestenes D, Wells M, Swackhamer G 1992 Force concept inventory. The Physics Teacher 30: 141-166.

    [4] Hestenes D 1998. Am. J. Phys. 66:465 (http://scitation.aip.org/vsearch/servlet/VerityServlet?KEY=AJPIAS&CURRENT=NO&

    ONLINE=YES&smode=strresults&sort=rel&maxdisp=25&threshold=0&pjournals=AJPIAS&pyears=2001,2000,1999&possible1=465&

    possible1zone=fpage&fromvolume=66&SMODE=strsearch&OUTLOG=NO&viewabs=AJPIAS&key=DISPLAY&docID=1&page=1&

    chapter=0)

    [5] Redish page. Visited Feb. 14, 2011. (http://www.physics.umd.edu/perg/papers/redish/nas/nas.htm)

    [6] Thornton RK, Sokoloff DR (1998) Assessing student learning of Newton's laws: The Force and Motion Conceptual Evaluation and

    Evaluation of Active Learning Laboratory and Lecture Curricula. . Amer J Physics 66: 338-352. (http://scitation.aip.org/getabs/servlet/

    GetabsServlet?prog=normal&id=AJPIAS000066000004000338000001&idtype=cvips&gifs=yes)

    [7] Ding, L, Chabay, R, Sherwood, B, & Beichner, R (2006). Evaluating an electricity and magnetism assessment tool: Brief electricity and

    magnetism assessment Brief Electricity and Magnetism Assessment (BEMA). Phys. Rev. ST Physics Ed. Research 2, 7 pages. (http://

    prst-per.aps.org/abstract/PRSTPER/v2/i1/e010105)

    [8] Beichner, R. Testing student interpretation of kinematics graphs, Am. J. Phys., 62, 750-762, (1994).

    [9] Hake RR (1998) Interactive-engagement versus traditional methods: a six-thousand-student survey of mechanics test data for introductory

    physics courses. Am J Physics 66: 64-74. (http://www. physics.indiana.edu/~sdi/ajpv3i.pdf)

    [10] Allen, K (2006) The Statistics Concept Inventory: Development and Analysis of a Cognitive Assessment Instrument in Statistics. Doctoral

    dissertation, The University of Oklahoma. (https://engineering.purdue.edu/SCI/pubs/Kirk Allen dissertation. pdf)

    [11] The Chemical Concepts Inventory. Visited Feb. 14, 2011 (http://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/

    ConceptsInventory/CCIIntro.html)

    [12] Wright et al., 1998. J. Chem. Ed. 75: 986-992. (http://bioliteracy.colorado.edu/Readings/Wright.pdf)

    [13] (http://solar.physics. montana.edu/aae/adt/) Astronomy Diagnostic Test (ADT) Version 2.0, visited Feb. 14, 2011

    [14] DAvanzo. C. 2008. Biology concept inventories: overview, status, and next steps. BioScience 58: 1079-85

    [15] D'Avanzo C, Anderson CW, Griffith A, Merrill J. 2010. Thinking like a biologist: Using diagnostic questions to help students reason withbiological principles. (17 January 2010; www.biodqc.org/)

    [16] Wilson CD, Anderson CW, Heidemann M, Merrill JE, Merritt BW, Richmond G, Sibley DF, Parker JM. 2007. Assessing students' ability to

    trace matter in dynamic systems in cell biology. CBE Life Sciences Education 5: 323-331.

    [17] Anderson DL, Fisher KM, Norman GJ (2002) Development and evaluation of the conceptual inventory natural selection. Journal of

    Research In Science Teaching 39: 952-978. (http://adsabs.harvard.edu/abs/2002JRScT..39..952A)

    [18] Nehm R & Schonfeld IS (2008). Measuring knowledge of natural selection: A comparison of the C.I.N.S., an open-response instrument, and

    an oral interview. Journal of Research in Science Teaching, 45, 1131-1160. (http://www1.ccny.cuny. edu/prospective/socialsci/

    psychology/faculty/upload/Nehm-Schonfeld-2008-JRST.pdf)

    [19] Nehm R & Schonfeld IS (2010). The future of natural selection knowledge measurement: A reply to Anderson et al. (2010). Journal of

    Research in Science Teaching, 47, 358-362. (http://www1. ccny.cuny.edu/prospective/socialsci/psychology/faculty/upload/

    Nehm-and-Schonfeld-2010-JRST.pdf)

    [20] Smith MK, Wood WB, Knight JK (2008)The Genetics Concept Assessment: A New Concept Inventory for Gauging Student Understandingof Genetics. CBE Life Sci Educ 7(4): 422-430. (http://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&

    RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&

    searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTR)

    [21] Concept Inventory Assessment Instruments for Engineering Science. Visited Feb. 14, 2011. (http://www. foundationcoalition.org/home/

    keycomponents/assessment_evaluation.html)

    [22] http://geoscienceconceptinventory.wikispaces.com/home

    [23] http://www7.nationalacademies.org/bose/PP_Commissioned_Papers.html

    [24] Klymkowsky, Underwood, & Garvin-Doxas, 2010. (http://arxiv.org/abs/1012. 4501)

    [25] Garvin-Doxas & Klymkowsky, 2009. (http://www.lifescied.org/cgi/content/full/7/2/227)

    [26] Nehm, R.H., Ha, M., Mayfield, E. (in press). Transforming Biology Assessment with Machine Learning: Automated Scoring of Written

    Evolutionary Explanations. Journal of Science Education and Technology. (http://www.springerlink.com/content/1059-0145/preprint/

    ?sort=p_OnlineDate&sortorder=desc&o=30)

    http://www.springerlink.com/content/1059-0145/preprint/?sort=p_OnlineDate&sortorder=desc&o=30http://www.springerlink.com/content/1059-0145/preprint/?sort=p_OnlineDate&sortorder=desc&o=30http://www.lifescied.org/cgi/content/full/7/2/227http://arxiv.org/abs/1012.4501http://www7.nationalacademies.org/bose/PP_Commissioned_Papers.htmlhttp://geoscienceconceptinventory.wikispaces.com/homehttp://www.foundationcoalition.org/home/keycomponents/assessment_evaluation.htmlhttp://www.foundationcoalition.org/home/keycomponents/assessment_evaluation.htmlhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www1.ccny.cuny.edu/prospective/socialsci/psychology/faculty/upload/Nehm-and-Schonfeld-2010-JRST.pdfhttp://www1.ccny.cuny.edu/prospective/socialsci/psychology/faculty/upload/Nehm-and-Schonfeld-2010-JRST.pdfhttp://www1.ccny.cuny.edu/prospective/socialsci/psychology/faculty/upload/Nehm-Schonfeld-2008-JRST.pdfhttp://www1.ccny.cuny.edu/prospective/socialsci/psychology/faculty/upload/Nehm-Schonfeld-2008-JRST.pdfhttp://adsabs.harvard.edu/abs/2002JRScT..39..952Ahttp://solar.physics.montana.edu/aae/adt/http://bioliteracy.colorado.edu/Readings/Wright.pdfhttp://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/ConceptsInventory/CCIIntro.htmlhttp://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/ConceptsInventory/CCIIntro.htmlhttps://engineering.purdue.edu/SCI/pubs/Kirk%20Allen%20dissertation.pdfhttp://www.physics.indiana.edu/~sdi/ajpv3i.pdfhttp://prst-per.aps.org/abstract/PRSTPER/v2/i1/e010105http://prst-per.aps.org/abstract/PRSTPER/v2/i1/e010105http://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=AJPIAS000066000004000338000001&idtype=cvips&gifs=yeshttp://scitation.aip.org/getabs/servlet/GetabsServlet?prog=normal&id=AJPIAS000066000004000338000001&idtype=cvips&gifs=yeshttp://www.physics.umd.edu/perg/papers/redish/nas/nas.htmhttp://scitation.aip.org/vsearch/servlet/VerityServlet?KEY=AJPIAS&CURRENT=NO&ONLINE=YES&smode=strresults&sort=rel&maxdisp=25&threshold=0&pjournals=AJPIAS&pyears=2001%2C2000%2C1999&possible1=465&possible1zone=fpage&fromvolume=66&SMODE=strsearch&OUTLOG=NO&viewabs=AJPIAS&key=DISPLAY&docID=1&page=1&chapter=0http://scitation.aip.org/vsearch/servlet/VerityServlet?KEY=AJPIAS&CURRENT=NO&ONLINE=YES&smode=strresults&sort=rel&maxdisp=25&threshold=0&pjournals=AJPIAS&pyears=2001%2C2000%2C1999&possible1=465&possible1zone=fpage&fromvolume=66&SMODE=strsearch&OUTLOG=NO&viewabs=AJPIAS&key=DISPLAY&docID=1&page=1&chapter=0http://scitation.aip.org/vsearch/servlet/VerityServlet?KEY=AJPIAS&CURRENT=NO&ONLINE=YES&smode=strresults&sort=rel&maxdisp=25&threshold=0&pjournals=AJPIAS&pyears=2001%2C2000%2C1999&possible1=465&possible1zone=fpage&fromvolume=66&SMODE=strsearch&OUTLOG=NO&viewabs=AJPIAS&key=DISPLAY&docID=1&page=1&chapter=0http://scitation.aip.org/vsearch/servlet/VerityServlet?KEY=AJPIAS&CURRENT=NO&ONLINE=YES&smode=strresults&sort=rel&maxdisp=25&threshold=0&pjournals=AJPIAS&pyears=2001%2C2000%2C1999&possible1=465&possible1zone=fpage&fromvolume=66&SMODE=strsearch&OUTLOG=NO&viewabs=AJPIAS&key=DISPLAY&docID=1&page=1&chapter=0
  • 8/3/2019 IRT wikibook

    18/47

    Concept inventory 16

    External links

    Astronomy (http://solar.physics.montana.edu/aae/adt/)

    Basic Biology (http://bioliteracy.colorado.edu/)

    Bio-Diagnostic Question Clusters (http://www.biodqc.org)

    Classroom Concepts and Diagnostic Tests (http://www.flaguide.org/cat/diagnostic/diagnostic5. php)

    Chemistry (http://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/ConceptsInventory/CCIIntro.html)

    Diagnostic Question Clusters in Biology (http://dqc.crcstl.msu.edu)

    Engineering (http://www.foundationcoalition.org/home/keycomponents/assessment_evaluation.html)

    Evolution Assessment (http://www.evolutionassessment.org)

    Force Concept Inventory (http://modeling.la.asu.edu/R&E/Research. html)

    Genetics (http://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&

    RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&

    andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&

    resourcetype=HWCIT,HWELTR)

    Molecular Life Sciences Concept Inventory (http://www.lifescinventory.edu.au/)

    Physics (http://www.ncsu.edu/per/TestInfo. html)

    Statistics (https://engineering.purdue.edu/SCI)

    Thinking Like a Biologist (http://www.biodqc.org/contacts)

    Differential item functioning

    Differential item functioning (DIF) occurs when people from different groups (commonly gender or ethnicity) with

    the same latent trait (ability/skill) have a different probability of giving a certain response on a questionnaire or

    test.[1] DIF analysis provides an indication of unexpected behavior by item on a test. An item does not display DIF if

    people from different groups have a different probability to give a certain response; it displays DIF if people from

    different groups of same underlying true ability have a different probability to give a certain response. More

    precisely, an item displays DIF when the difficulty level (b), the discrimination (a) or the lower asymptotes (c)

    estimated by item response theory (IRT)of an item differs across groups. Thus, when one or more item parameters

    differ across groups, an item displays DIF.

    The MantelHaenszel and logistic regression procedures are the most widely used methods to investigate DIF.

    Zumbo (2007)[2]

    offers a review of various DIF detection methods and strategies.

    References[1] Embretson,S.E., Reise,S.P. (2000). Item Response Theory for Psychologists.

    [2] Zumbo, B.D. (2007). Three generations of differential item functioning (DIF) analyses: Considering where it has been, where it is now, and

    where it is going.Language Assessment Quarterly, vol. 4, pp. 223233. (http://educ.ubc. ca/faculty/zumbo/papers/Zumbo_LAQ_reprint.

    pdf)

    http://educ.ubc.ca/faculty/zumbo/papers/Zumbo_LAQ_reprint.pdfhttp://educ.ubc.ca/faculty/zumbo/papers/Zumbo_LAQ_reprint.pdfhttp://en.wikipedia.org/w/index.php?title=Logistic_regressionhttp://en.wikipedia.org/w/index.php?title=Test_%28student_assessment%29http://en.wikipedia.org/w/index.php?title=Latent_traithttp://www.biodqc.org/contactshttps://engineering.purdue.edu/SCIhttp://www.ncsu.edu/per/TestInfo.htmlhttp://www.lifescinventory.edu.au/http://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://www.lifescied.org/cgi/content/full/7/4/422?maxtoshow=&hits=10&RESULTFORMAT=1&author1=smith&author2=knight&andorexacttitle=and&andorexacttitleabs=and&andorexactfulltext=and&searchid=1&FIRSTINDEX=0&sortspec=relevance&resourcetype=HWCIT,HWELTRhttp://modeling.la.asu.edu/R&E/Research.htmlhttp://www.evolutionassessment.org/http://www.foundationcoalition.org/home/keycomponents/assessment_evaluation.htmlhttp://dqc.crcstl.msu.edu/http://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/ConceptsInventory/CCIIntro.htmlhttp://jchemed.chem.wisc.edu/JCEDlib/QBank/collection/CQandChP/CQs/ConceptsInventory/CCIIntro.htmlhttp://www.flaguide.org/cat/diagnostic/diagnostic5.phphttp://www.biodqc.org/http://bioliteracy.colorado.edu/http://solar.physics.montana.edu/aae/adt/
  • 8/3/2019 IRT wikibook

    19/47

    Person-fit analysis 17

    Person-fit analysis

    Person-fit analysis is a technique for determining if the person's results on a given test are valid.

    The purpose of a person-fit analysis is to detect item-score vectors that are unlikely given a hypothesized test theory

    model such as item response theory, or unlikely compared with the majority of item-score vectors in the sample. An

    item-score vector is a list of "scores" that a person gets on the items of a test, where "1" is often correct and "0" is

    incorrect. For example, if a person took a 10-item quiz and only got the first five correct, the vector would be

    {1111100000}.

    In individual decision-making in education, psychology, and personnel selection, it is critically important that test

    users can have confidence in the test scores used. The validity of individual test scores may be threatened when the

    examinee's answers are governed by factors other than the psychological trait of interest - factors that can range from

    something as benign as the examinee dozing off to concerted fraud efforts. Person-fit methods are used to detect

    item-score vectors where such external factors may be relevant, and as a result, indicate invalid measurement.

    Unfortunately, person-fit statistics only tell if the set of responses is likely or unlikely, and cannot prove anything.

    The results of the analysis might look like an examinee cheated, but there is no way to go back to when the test wasadministered and prove it. This limits its practical applicability on an individual scale. However, it might be useful

    on a larger scale; if most examinees at a certain test site or with a certain proctor have unlikely responses, an

    investigation might be warranted.

    References

    Emons, W.H.M., Sijtsma, K., & Meijer, R.R. (2005). Global, local and graphical person-fit analysis using person

    response functions.Psychological Methods, 10(1), 101-119.

    Emons, W.H.M., Glas, C.A.W., Meijer, R.R., & Sijtsma, K. (2003). Person fit in order-restricted latent class

    models.Applied Psychological Measurement, 27(6), 459-478.

    Meijer, R. R., & Sijtsma, K. (2001). Methodology review: Evaluating person-fit.Applied Psychological

    Measurement, 25, 107-135.

    http://en.wikipedia.org/w/index.php?title=Measurementhttp://en.wikipedia.org/w/index.php?title=Personnel_selectionhttp://en.wikipedia.org/w/index.php?title=Psychologyhttp://en.wikipedia.org/w/index.php?title=Educationhttp://en.wikipedia.org/w/index.php?title=Test_%28student_assessment%29
  • 8/3/2019 IRT wikibook

    20/47

    Psychometrics 18

    Psychometrics

    Psychometrics is the field of study concerned with the theory and technique of psychological measurement, which

    includes the measurement of knowledge, abilities, attitudes, personality traits, and educational measurement. The

    field is primarily concerned with the construction and validation of measurement instruments such as questionnaires,

    tests, and personality assessments.

    It involves two major research tasks, namely: (i) the construction of instruments and procedures for measurement;

    and (ii) the development and refinement of theoretical approaches to measurement. Those who practice

    psychometrics are known as psychometricians. All psychometricians possess a specific psychometric qualification,

    and while many are clinical psychologists, others work as human resources or learning and development

    professionals.

    Origins and background

    Much of the early theoretical and applied work in psychometrics was undertaken in an attempt to measure

    intelligence. Francis Galton, often referred to as "the father of psychometrics", devised and included mental testsamong his anthropometric measures. However, the origin of psychometrics also has connections to the related field

    of psychophysics. Two other pioneers of psychometrics obtained doctorates in the Leipzig Psychophysics

    Laboratory under Wilhelm Wundt: James McKeen Cattell in 1886 and Charles Spearman in 1906.

    The psychometrician L. L. Thurstone, founder and first president of the Psychometric Society in 1936, developed

    and applied a theoretical approach to measurement referred to as the law of comparative judgment, an approach that

    has close connections to the psychophysical theory of Ernst Heinrich Weber and Gustav Fechner. In addition,

    Spearman and Thurstone both made important contributions to the theory and application of factor analysis, a

    statistical method developed and used extensively in psychometrics.

    More recently, psychometric theory has been applied in the measurement of personality, attitudes, and beliefs, and

    academic achievement. Measurement of these unobservable phenomena is difficult, and much of the research and

    accumulated science in this discipline has been developed in an attempt to properly define and quantify such

    phenomena. Critics, including practitioners in the physical sciences and social activists, have argued that such

    definition and quantification is impossibly difficult, and that such measurements are often misused, such as with

    psychometric personality tests used in employment procedures:

    "For example, an employer wanting someone for a role requiring consistent attention to repetitive detail will

    probably not want to give that job to someone who is very creative and gets bored easily."[1]

    Figures who made significant contributions to psychometrics include Karl Pearson, Henry F. Kaiser, L. L.

    Thurstone, Georg Rasch, Johnson O'Connor, Frederic M. Lord, Ledyard R Tucker, Arthur Jensen, and David

    Andrich.

    Psychometric, psychometrician and psychometrist appreciation week is the first week in November.

    Definition of measurement in the social sciences

    The definition of measurement in the social sciences has a long history. A currently widespread definition, proposed

    by Stanley Smith Stevens (1946), is that measurement is "the assignment of numerals to objects or events according

    to some rule". This definition was introduced in the paper in which Stevens proposed four levels of measurement.

    Although widely adopted, this definition differs in important respects from the more classical definition of

    measurement adopted in the physical sciences, which is that measurement is the numerical estimation and

    expression of the magnitude of one quantity relative to another(Michell, 1997).

    http://en.wikipedia.org/w/index.php?title=Levels_of_measurementhttp://en.wikipedia.org/w/index.php?title=Stanley_Smith_Stevenshttp://en.wikipedia.org/w/index.php?title=David_Andrichhttp://en.wikipedia.org/w/index.php?title=David_Andrichhttp://en.wikipedia.org/w/index.php?title=Arthur_Jensenhttp://en.wikipedia.org/w/index.php?title=Ledyard_R_Tuckerhttp://en.wikipedia.org/w/index.php?title=Frederic_M._Lordhttp://en.wikipedia.org/w/index.php?title=Johnson_O%27Connorhttp://en.wikipedia.org/w/index.php?title=Georg_Raschhttp://en.wikipedia.org/w/index.php?title=L._L._Thurstonehttp://en.wikipedia.org/w/index.php?title=L._L._Thurstonehttp://en.wikipedia.org/w/index.php?title=Henry_F._Kaiserhttp://en.wikipedia.org/w/index.php?title=Karl_Pearsonhttp://en.wikipedia.org/w/index.php?title=Physical_scienceshttp://en.wikipedia.org/w/index.php?title=Academic_achievementhttp://en.wiktionary.org/wiki/personalityhttp://en.wikipedia.org/w/index.php?title=Factor_analysishttp://en.wikipedia.org/w/index.php?title=Gustav_Fechnerhttp://en.wikipedia.org/w/index.php?title=Ernst_Heinrich_Weberhttp://en.wikipedia.org/w/index.php?title=Law_of_comparative_judgmenthttp://en.wikipedia.org/w/index.php?title=L._L._Thurstonehttp://en.wikipedia.org/w/index.php?title=Charles_Spearmanhttp://en.wikipedia.org/w/index.php?ti