Upload
letruc
View
222
Download
0
Embed Size (px)
Citation preview
Test Reliability & Development Using IRT
University of KansasItem Response Theory
Stats Camp ‘07
Overview
• Reliability with IRT–Item and Test Information
Functions• Concepts• Equations• Uses and Examples
• Optimal Test Design
Reliability with IRT
• We all know that reliability (precision) is a desirable property for an assessment.
• The more reliable a test is, the more precisely we can measure the construct.
• For any scaling procedure (IRT or CTT), as reliability goes up, the standard error of measurement goes down.
Reliability with IRT
• In CTT, reliability is a one-number summary of test precision, and there is a corresponding single standard error of measurement that is used for any test score.
• In IRT, test precision is conceptualized as something called Information, which is conditional on the trait level being measured.– Some tests could measure certain trait levels very
well but measure others poorly…
Reliability with IRT
• A further advantage of IRT with respect to evaluating reliability is that we can consider the amount of Information an item and/or a test provides.
• In CTT, measures of item quality exist, but these are only indirectly related to what the reliability of the test will be.
Item Information Function
• “Item Information” indicates an item’s usefulness for assessing ability.
• By “usefulness” we basically mean how good an item is at distinguishing examinees with lower ability levels from those with higher ability levels.
• Information Precision
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
0.0
0.2
0.4
0.6
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
0.8
1.0
Item Information Function
• Items are basically more informative where the slope of the ICC is steepest, which happens when…bj is relatively close to θi,aj is relatively high, andcj is relatively low
• If cj = 0, an item provides its maximum information when θi = bj
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
a = 1.0
c = 0.0
b = 1.0 or 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
a = 1.0
c = 0.0
b = 1.0 or 2.0
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
b = -1.0
c = 0.2
a = 1.0 or 0.5
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
b = -1.0
c = 0.2
a = 1.0 or 0.5
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
a = 1.0
b = 0.0
c = 0.0 or 0.2
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
a = 1.0
b = 0.0
c = 0.0 or 0.2
Item Information Function
• IMPORTANT: information is a function of θ, which means that an item could be very informative for some ability levels and relatively uninformative for others.
• Example: difficult items are informative for higher ability levels, but don’t tell us much about lower ability levels (because they mostly get all those items wrong!)
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
c = 0.0
a = 1.2 or 0.8
b = 1.0 or 0.0
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
c = 0.0
a = 1.2 or 0.8
b = 1.0 or 0.0
Item Information Functionfor the 3-PL
' 2
2 2
( ) ( ) 2
[ ( )]( )
( ) ( )
(1 )[ ][1 ]j j j j
jj
j j
j jDa b Da b
j
PI
P Q
D a cc e eθ θ
θθ
θ θ
− − −
=
−=
+ +
Notes on IIF
• The roles of aj and cj are easy to see– as aj increases, information increases– as cj increases, information decreases
• As ability moves away from bj (+ or -) the denominator increases, so information approaches zero.
Maximum Information
If cj = 0, then Information is maximized at bj
If cj > 0, then Information is maximized at an ability level slightly greater than bj
max1 ln 0.5(1 1 8 )j j
j
b cDa
θ ⎡ ⎤= + + +⎣ ⎦
Test Information Function
• Just like we add up ICCs to get a TCC, we add up IIFs to get a TIF.
• Information will continue to increase as we add test items, therefore increasing precision.
• All things equal, longer tests provide increased measurement precision.
Test Information Function
• Defined for a set of items at each point along the ability (θ) scale
• Test information is influenced by the ‘quality’ and the number of test items
1
( ) ( )n
jj
I Iθ θ=
=∑
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
P (u
= 1
| θ)
0
1
2
3
4
5
6
7
8
-3 -2 -1 0 1 2 3
Ability (θ)
E(X
| θ)
0.0
0.2
0.4
0.6
0.8
1.0
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
0
1
2
3
4
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
0
1
2
3
4
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
Conditional Error for Maximum Likelihood Estimates
• One of the great benefits of IRT scaling is that measurement precision and error can now be considered conditional on θ.
Conditional Error for Maximum Likelihood Estimates
• Standard error of an MLE is determined by:
1ˆ( )ˆ( )
SEI
θθ
=
Conditional Standard Error
• The imprecision of ability estimation is therefore inversely related to the amount of Information with respect to ability that is available.
• Since Information increases with the quality and number of items, the SE conversely decreases…which hopefully makes some sense!
0
1
2
3
4
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ) a
nd S
E(θ)
8-item Test Information Function
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ) a
nd S
E(θ)
Information may be spread across a relatively wide range…
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ) a
nd S
E(θ)
or maximized around an ability level of interest(e.g., a cutscore)
Info and SE Example
At 1.0, ( 1) 91 1ˆ( ) 0.33
ˆ 9( )ˆ ˆIf 1.0, ( ) 0.33
I
SEI
SE
θ θ
θθ
θ θ
= = =
= = =
= =
Info and SE Example
At 0.0, ( 0) 31 1ˆ( ) 0.58
ˆ 3( )ˆ ˆIf 0.0, ( ) 0.58
I
SEI
SE
θ θ
θθ
θ θ
= = =
= = =
= =
Info and SE Example
At 1.0, ( 1) 11 1ˆ( ) 1.0
ˆ 1( )ˆ ˆIf 1.0, ( ) 1.0
I
SEI
SE
θ θ
θθ
θ θ
=− =− =
= = =
=− =
95% Confidence Interval
• Because MLEs are asymptotically normally distributed, we create a 95% confidence interval around a point estimate of ability by adding and subtracting 1.96 standard errors:
• Estimate ± 1.96 SE(recall critical values from a standard normal distribution)
0
0.1
0.2
0.3
0.4
0.5
-3 -2 -1 0 1 2 3
Prob
abili
tyStandard Normal Distribution
0.025 0.025
0.95
95% Confidence Interval
• For θ = 1, SE=0.33 1.0 ± 0.65– 95% chance that examinee’s true ability is in
between 0.35 and 1.65• For θ = 0, SE=0.58 0.0 ± 1.14
– 95% chance that examinee’s true ability is in between -1.14 and 1.14
• For θ = -1, SE=1.0 -1.0 ± 1.96– 95% chance that examinee’s true ability is in
between -2.96 and 0.96
95% Confidence Interval
• As information increases…– SE decreases– CI becomes narrower– Increased trust in ability estimate
• As information decreases…– SE increases– CI becomes wider– Decreased trust in ability estimate
Notes on IIF and TIF
• Note that the contribution of Ij(θ) to I(θ) does not depend on the particular combination of test items.– Each item contributes independently
• This is a very big advantage of IRT over CTT: reliability can be described conditionally (as information), and it does not depend on the particular set of items.
Mini-CTT lesson• In CTT, item discrimination (quality) is the
item-total correlation• This will depend on the item itself, but is
also influenced by the other test items.• Adding items changes the total score, thus
changing the correlation.• Therefore, it’s difficult to anticipate the
reliability of a test when creating a form from a bank of previously piloted items, unless those items all appeared together.
CTT versus IRT• In IRT, item quality is Information, which
is affected by aj, bj, cj, and θ.• An item’s information function will be
independent of the other items on the test, as will its contribution to the TIF.
• Adding more and/or better items will increase TIF, but won’t impact any IIF.
• Therefore, it’s easy to anticipate the reliability of a test when creating a form from a bank of previously piloted items.
Excel Spreadsheet Demo
• Show Excel Spreadsheet containing eight items, their ICCs, TCC, IIFs, TIF and SE.
• Specify different item parameters and determine how changes affect the resulting graphs.
Uses of Item and Test Information Functions
1) Providing conditional SE of trait2) Building a test to meet desired
statistical specifications3) Revising an existing test4) Comparing tests
Conditional SE
• As previously stated, the precision (reliability) and imprecision (error) of a test scaled with IRT is conditional on θ.
• Tests may be better or worse for measuring certain trait levels
Test Development
• From a pool of previously piloted test items, IRT makes it relatively easy to switch items in and out and determine what the resulting Information function will be.
• This tells the test maker what the conditional standard errors will be, too.
Test Development
• Another benefit to test development is that multiple forms may be built to the same statistical specifications.
• This process is often referred to as “Pre-equating.”
• Building strictly parallel forms is always difficult, but these procedures can help.
Test Revision
• Likewise, test items may be removed from previously existing forms (e.g, to create a “short form” of a test).
• Test items may also need to be added if the previous form is found to be unreliable.
• Estimating the new reliability of the test is straightforward with IRT
Test Revision
• In CTT, such test revisions require the assumption that the deleted or added items are of comparable statistical quality to those already on the test.–Spearman-Brown prophecy formula–This may or may not be true!
Comparing Tests
• When comparing the reliability (i.e., precision) of two test forms, its useful to determine the ratio of their information with respect to θ.
• This ratio is known as the relative efficiency of a test: RE(θ).
• Consider two previous example TIFs
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ) a
nd S
E(θ)
Information targeted around a cutscore
We’ll call this“Form X”
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ) a
nd S
E(θ)
Information spread across a wide range
We’ll call this“Form Y”
( ) info for form X at ( )( ) info for form Y at
Suppose at =1 ( ) 9.0 =1 ( ) 3.6
9Then, ( 1) 2.53.6
X
Y
X
Y
IREI
II
RE
θ θθθ θ
θ θθ θ
θ
= →
→ =→ =
= = =
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
In the region θ = 1, Form X is 2.5 times more efficient than Form Y
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
In the region θ ≈ 0.10, Form X is just as efficient as Form Y
0
2
4
6
8
10
-3 -2 -1 0 1 2 3
Ability (θ)
Info
( θ)
In the region θ = -1, Form X is LESS efficient than Form Y RE(θ)=0.23
0
1
2
3
4
5
6
-3 -2 -1 0 1 2 3
Ability (θ)
RE(θ)
Form X is more efficient than Form Y above the point θ ≈ 0.1
0
2
4
6
8
10
12
-3 -2 -1 0 1 2 3
Ability (θ)
RE(θ)
Form Y is more efficient than Form X below the point θ ≈ 0.1
Next…
• Test Score Equating using IRT