Upload
austin-bryant
View
222
Download
0
Tags:
Embed Size (px)
Citation preview
Application of Item Response Theory to PRO Development
Michael A. Kallen, PhD, MPHAssociate Professor, Research Faculty
Department of Medical Social SciencesFeinberg School of Medicine
Northwestern UniversityChicago, Illinois
Outline
The new measurement
Constructing measures
Extending the usefulness of measurement
Outline
The new measurement
Constructing measures
Extending the usefulness of measurement
To see what’s new… The Lower Extremity Function Scale (LEFS)
from a “classical” perspective
LEFS ITEMS
“Activities”
Extreme difficulty or unable to perform
activity
Quite a bit of
difficulty
Moderate difficulty
A little bit of
difficulty
No difficulty
1. Perform any of your usual work, housework, or school activities 0 1 2 3 4
2. Perform your usual hobbies, recreational or sporting activities 0 1 2 3 4
3. Getting into or out of the bath 0 1 2 3 4
4. Walking between rooms 0 1 2 3 4
5. Putting on your shoes or socks 0 1 2 3 4
6. Squatting 0 1 2 3 4
7. Lifting an object, like a bag of groceries from the floor 0 1 2 3 4
8. Performing light activities around your home 0 1 2 3 4
9. Performing heavy activities around your home 0 1 2 3 4
10. Getting into or out of a car 0 1 2 3 4
11. Walking 2 blocks 0 1 2 3 4
12. Walking a mile 0 1 2 3 4
13. Going up or down 10 stairs (about 1 flight of stairs) 0 1 2 3 4
14. Standing for 1 hour 0 1 2 3 4
15. Sitting for 1 hour 0 1 2 3 4
16. Running on even ground 0 1 2 3 4
17. Running on uneven ground 0 1 2 3 4
18. Making sharp turns while running fast 0 1 2 3 4
19. Hopping 0 1 2 3 4
20. Rolling over in bed 0 1 2 3 4
Original LEFS Instructions“We are interested in knowing whether you are
having any difficulty at all with the activities listed below because of your lower limb problem for which you are currently seeking attention. Please provide an answer for each activity.”
“Today, do you or would you have any difficulty at all with:
(Circle one number on each line)”
LEFS ITEMS
“Activities”
Extreme difficulty or unable to
perform activity
Quite a bit of
difficulty
Moderate difficulty
A little bit of
difficulty
No difficulty
1. Perform any of your usual work, housework, or school activities 0 1 2 3 4
2. Perform your usual hobbies, recreational or sporting activities 0 1 2 3 4
3. Getting into or out of the bath 0 1 2 3 4
4. Walking between rooms 0 1 2 3 4
5. Putting on your shoes or socks 0 1 2 3 4
6. Squatting 0 1 2 3 4
7. Lifting an object, like a bag of groceries from the floor 0 1 2 3 4
8. Performing light activities around your home 0 1 2 3 4
9. Performing heavy activities around your home 0 1 2 3 4
10. Getting into or out of a car 0 1 2 3 4
11. Walking 2 blocks 0 1 2 3 4
12. Walking a mile 0 1 2 3 4
13. Going up or down 10 stairs (about 1 flight of stairs) 0 1 2 3 4
14. Standing for 1 hour 0 1 2 3 4
15. Sitting for 1 hour 0 1 2 3 4
16. Running on even ground 0 1 2 3 4
17. Running on uneven ground 0 1 2 3 4
18. Making sharp turns while running fast 0 1 2 3 4
19. Hopping 0 1 2 3 4
20. Rolling over in bed 0 1 2 3 4
Measurement products needed?
All patient individual item scores, so as to obtain
a patient’s total score
The burden of measurement For the healthcare provider/researcher:
instrument administration score summation score interpretation
For the patient: The actual time (and thoughtfulness) it
requires to respond Being asked inappropriate questions
LEFS ITEMS
“Activities”
Extreme difficulty or unable to
perform activity
Quite a bit of
difficulty
Moderate difficulty
A little bit of
difficulty
No difficulty
1. Perform any of your usual work, housework, or school activities 0 1 2 3 4
2. Perform your usual hobbies, recreational or sporting activities 0 1 2 3 4
3. Getting into or out of the bath 0 1 2 3 4
4. Walking between rooms 0 1 2 3 4
5. Putting on your shoes or socks 0 1 2 3 4
6. Squatting 0 1 2 3 4
7. Lifting an object, like a bag of groceries from the floor 0 1 2 3 4
8. Performing light activities around your home 0 1 2 3 4
9. Performing heavy activities around your home 0 1 2 3 4
10. Getting into or out of a car 0 1 2 3 4
11. Walking 2 blocks 0 1 2 3 4
12. Walking a mile 0 1 2 3 4
13. Going up or down 10 stairs (about 1 flight of stairs) 0 1 2 3 4
14. Standing for 1 hour 0 1 2 3 4
15. Sitting for 1 hour 0 1 2 3 4
16. Running on even ground 0 1 2 3 4
17. Running on uneven ground 0 1 2 3 4
18. Making sharp turns while running fast 0 1 2 3 4
19. Hopping 0 1 2 3 4
20. Rolling over in bed 0 1 2 3 4
The cost of this burden? For the healthcare provider/researcher:
Perhaps a loss of willingness to use the instrument to collect data, score responses, or interpret a patient’s self-reported condition.
For the patient: Perhaps a loss of willingness to respond in a
focused, honest way to an instrument that seems unresponsive or even annoying.
Classical psychometrics Its beginnings go back to the turn of the
20th century.
Consider Spearman’s work in disattenuating the correlation coefficient.
Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 1907.
Psychometrics’ First “Golden Years” The zenith of classical psychometrics arguably
occurred during the time period surrounding the publication of Gulliksen’s book, Theory of Mental Tests (1950).
This time period saw the development of many of the best and brightest aspects of classical psychometrics: e.g., 1) that measurements, not instruments, have
psychometric properties; 2) the introduction of alpha reliability; 3) validity as the sine qua non of measurement.
Classical Test Theory (CTT) CTT is a cornerstone of classical
psychometrics.
It is theory-based measurement: Individual scores are theory-defined as
composed of a true score component and an error component.
That is, observed score = true score + error
CTT as Theory
Because it is theory, CTT or “true score theory” does not provide:
hypotheses that are testable, or
models that are falsifiable.
CTT and Circular Definitions In CTT, item difficulty is defined as:
the proportion of examines in a group of interest who answer an item correctly.
Thus, an item being “hard” or “easy” depends on the ability of the group of examinees measured.
As a result, it is a challenge to determine score meaning: Examinee and test characteristics are entangled; each can be interpreted only in the context of the other.
Very Easy Test
Very Hard Test
Medium Test
1 8
1 8
1
Expected Score 8
Person
Expected Score 0
Person
Expected Score 5
Person
8
Reprinted with permission from: Wright, B.D. & Stone, M. (1979) Best test design, Chicago: MESA Press, p. 5.
Scores depend on the difficulty of test items
The New Measurement
In the past 40-plus years, measurement has undergone a quiet evolution.
Modern Psychometrics
In 1968 Lord and Novick’s book, Statistical Theories of Mental Tests, was published.
In it, model-based measurement, a foundation of modern psychometrics, was introduced.
Item Response Theory (IRT) IRT, a form of model-based measurement, now
plays a fundamental role in modern psychometrics.
It addresses CTT’s major shortcomings: By providing test-independent and group-independent
measurement; by employing models that can be tested and falsified.
If a proposed IRT model does NOT adequately explain the data at hand, it can be determined whether assumptions were met or an inappropriate model was used.
CTT vs. IRT CTT, from classical psychometrics, is
theory-based, sample and test-specific, and focuses on test performance.
IRT, from modern psychometrics, is model-based, ability or functioning level-specific, and focuses on item performance.
Outline The new measurement
Constructing measures
Extending the usefulness of measurement
IRT approach
For any approach Getting the measure right helps in getting the
measurement right
For the IRT approach There have come to be accepted and
recommended steps to take for getting a measure right
Adapteval-HIV Project (Yang, Kallen)
Goal: To develop and evaluate a set of HIV-specific item banks to support
IRT-based CAT assessment
Implementation on multiple technology platforms (web/phone/PDA)
Item Response Theory
Q Q Q Q Q Q Q Q Q Q Q
GoodPoor
Easy Hard
PersonPerson Latent Trait Latent Trait
ItemItem Location Location
IRT assumptions Unidimensionality
Measure one “thing” only
Monotonicity The “better” the trait status on a single scale item, the
“better” the trait status on the overall scale
Local independence Items are independent of each other statistically, after
controlling for shared dimensionality
Monotonicity
0
5
10
15
20
25
1 2 3 4 5
Response Category
Me
an
Sc
ale
Sc
ore
P1P2P3P4
The “better” the trait status on a single scale item, the “better” the trait status on the overall scale
Local Independence Items are independent of each other statistically,
after controlling for shared dimensionality
Components of a scale item’s variance Required: shared variance, in common with the scale’s
unidimensional factor Expected: some residual or noise/error variance Problem: If the residual variance of one item is
correlated with that of another item, at some point the variance is no longer just noise
Advantages of IRT approach to measurement Focus is on the item, not the scale
Each item possesses trait estimation capacities
Provides item- and group-independent measurement Not tied to sample or particular items used
Makes computer adaptive testing a reality Accumulation of detailed knowledge about individual
items and their functioning Customized item presentation reduces the number of
patient responses needed to achieve measurements of similar quality
Adapteval-HIV: 14 item “sets”
Pain Fatigue Sleep Emotion Functioning Cognitive Functioning Self Care/Daily Living Physical Act./Leisure
Life Satisfaction Body Image Physical Symptoms Sat w/ Medical Care Work/Employment Neg. Social Issues Pos. Social Exp.
Measure development process
Identify initial set of items (adapted from 9 existing instruments) (248 items)
Panel (20 docs/nurses/patients/psychosocial) eval (226 items)
Panel (+ input from 3 psychosocial researchers) evaluation (192 items)
Pilot test: 50 patients Analysis of floor/ceiling, missing data, low SD
(146 items) Primary data collection: 400 patients Primary psychometric analysis
(107 items) IRT parameter calibration to implement CAT
Psychometric analyses:item exclusion/inclusion Primary Criteria
Missingness > 20% (2 items excluded) CFA factor loadings < .50 (16 excluded) Local dependence > .20 (5 excluded) Lack of monotonicity (9 excluded) Potential item bias (3 excluded)
Secondary Criteria Multi-dimensionality (3 excluded) Failed IRT parameter convergence (2 excluded)
Project History Phase I
50 patients at Northwestern Web/phone Prelim psychometric analyses (cognitive interviews, etc.)
Phase II 225 each at Northwestern and UIC Web/phone/PDA Complete psychometric analyses
Ethnicity/race and gender characteristics of primary data collection (NU + UIC)
Ethnic Category
Sex/Gender
Females Males
Unknown or Not
Reported Total
Hispanic or Latino 12 34 0 46
Not Hispanic or Latino 80 266 0 346
Unknown (individuals not reporting ethnicity) 1 7 0 8
Ethnic Category: Total of All Subjects 93 307 0 400
Racial Categories
American Indian/Alaska Native 1 3 0 4
Asian 0 5 0 5
Native Hawaiian or Other Pacific Islander 1 3 0 4
Black or African American 67 133 0 200
White 14 134 0 148
More Than One Race 2 13 0 15
Unknown or Not Reported 8 16 0 24
Racial Categories: Total of All Subjects 93 307 0 400
Adapteval-HIV question distribution
Domain Pool Domain Pool
Pain 4 Life Sat. 8
Fatigue 5 Body Img. 4
Sleep 5 Phy Symp 9
Emotional 23 Sat. Med. 5
Cognitive 9 Work 6
Self Care 6 Neg. Soc. 8
Phy. Act. 7 Pos. Soc. 8
Overall Total: Pool (107)
Ceiling and Floor Effects
Domain Sz Ceil Floor Domain Sz Ceil Floor
Pain 389 102 (26%) 0 (0%) Life Sat. 327 34 (10%) 8 (2%)
Fatigue 393 49 (12%) 4 (1%) Body Img. 390 143 (37%) 3 (1%)
Sleep 334 61 (18%) 5 (1%) Phy Sym 319 37 (12%) 0 (0%)
Emotion 382 5 (1%) 0 (0%) Sat. Med. 362 151 (42%) 8 (2%)
Cognitive 392 88 (2%) 0 (0%) Work 336 56 (17%) 11 (3%)
Self Care 362 87 (24%) 17 (5%) Neg. Soc. 348 19 (5%) 0 (0%)
Phy. Act. 311 64 (21%) 0 (0%) Pos. Soc. 333 35 (11%) 4 (1%)
Aiming at < 20%
*Entries in red denote the values falling within the expected ranges.
CFA Goodness of Fit (Standards for Unidimensionality) Root Mean Square Error of Approximation (RMSEA) < 0.10 Comparative Fit Index (CFI) > 0.90 Standardized Root Mean Square Residual (SRMR) < 0.08
RMSEA CFI SRMR RMSEA CFI SRMR
Pain 0.051 1.00 0.021 Life Sat. 0.95 0.99 0.040
Fatigue 0.060 1.00 0.020 Body Img 0.020 1.00 0.017
Sleep 0.088 0.99 0.032 Phy Symp 0.11 0.95 0.090
Emotion 0.12 0.96 0.092 Sat. Med. 0.15 0.98 0.060
Cognitive 0.099 0.98 0.048 Work 0.16 0.96 0.086
Self Care 0.093 0.99 0.036 Neg. Soc. 0.13 0.94 0.11
Phy. Act. 0.0 1.00 0.024 Pos. Soc. 0.10 0.98 0.054
*Entries in red denote the values falling within the expected ranges.
Scale Inter-Correlations
N=400 pain fatig sleep emot cog selfca phyact satlif body physym satmed work socneg
fatig 0.710
sleep 0.596 0.610
emot 0.461 0.535 0.550
cog 0.491 0.557 0.541 0.662
selfca 0.315 0.371 0.265 0.280 0.287
phyact 0.434 0.486 0.348 0.325 0.356 0.653
satlif 0.296 0.365 0.374 0.609 0.470 0.492 0.526
body 0.278 0.360 0.330 0.496 0.429 0.175 0.196 0.336
physym 0.563 0.610 0.511 0.582 0.613 0.324 0.445 0.417 0.502
satmed 0.123 0.126 0.113 0.150 0.165 0.290 0.236 0.367 0.078 0.101
work 0.489 0.559 0.424 0.453 0.444 0.307 0.420 0.368 0.442 0.557 0.061
socneg 0.438 0.563 0.525 0.695 0.591 0.282 0.307 0.488 0.638 0.614 0.069 0.611
socpos 0.213 0.226 0.261 0.415 0.361 0.367 0.380 0.585 0.242 0.231 0.448 0.259 0.389
Inter-correlations: ranged from 0.069-0.710
*Entries in red denote the values falling within the expected ranges (i.e., <0.90).
Internal Consistency
Pain .809 Life Sat. .915
Fatigue .907 Body Img. .777
Sleep .898 Phy. Symp. .818
Emotional .950 Sat. Med. .859
Cognitive .930 Work .866
Daily Living .899 Neg. Soc. .798
Phy. Act. .888 Pos. Soc. .885
Group comparison: Cronbach's alpha > 0.7 Individual comparison: Cronbach's alpha > 0.9
*Entries in red denote the values falling within the expected ranges.
Computerized Adaptive Testing
2. Select & present optimal scale item
1. Begin with initial trait estimate
5. Is stopping
rule satisfied
7. End of
battery
6. End of
assessment
4. Re-estimate trait
3. Record and score response
8. Administer next scale
item
9. Stop
No
Yes
Yes
No
Source: Wainer et al. (2000). Computerized Adaptive Testing: A Primer 2nd Ed. LEA. Mahwah N.J.
CAT performance: # of questions
Domain Pool Avg/SD Mx/Mn Domain Pool Avg/SD Mx/Mn
Pain 4 3.1/1.0 4/2 Life Sat. 8 3.1/1.5 7/2
Fatigue 5 2.6/1.1 5/2 Body Img. 4 3.7/0.5 4/3
Sleep 5 3.3/1.1 5/2 Phy Symp 9 7.7/1.9 9/3
Emotional 23 4.6/2.6 11/2 Sat. Med. 5 4.3/1.1 5/2
Cognitive 9 4.9/2.5 9/2 Work 6 4.9/1.0 6/3
Self Care 6 3.9/1.4 6/2 Neg. Soc. 8 6.9/1.0 8/5
Phy. Act. 7 3.7/2.2 7/2 Pos. Soc. 8 4.4/1.8 8/3
Overall Total: Pool (107), Average (61.1)
Static (Full) vs. CAT/IRT
Correlation of assessment scores between full instrument and CAT/IRT implementation
Body Image (4 items, avg. 3.7): .99 Cognitive Functioning (9 / 4.9): .91 Emotional Functioning (23 / 4.9): .87
Relief of measurement burden For the healthcare provider/researcher,
computerizing the test can relieve some of the burden. Use a scoring algorithm. Deliver score-specific interpretation.
This could be accomplished witha computer administered test.
Relief for respondents?
Yes, if the measure is presented as a computer adaptive test (CAT),
i.e., the test adapts itself or customizes itself according to the responses presented to it by an individual patient.
CATs and Item Structure Item presentation order
CTT: standardized All patients start at Item #1 and complete all items in
order IRT: customized
Patients are presented new items based on their responses to previous items
IRT “logic” There is an underlying hierarchy of activities. Activities can be ordered (“calibrated”) from
easiest to hardest.
CALIBRATIONS ITEMS
62.8 (hardest item) Making sharp turns while running fast (18.)
62.8 Running on uneven ground (17.)
60.7 Running on even ground (16.)
59.7 Hopping (19.)
53.6 Walking a mile (12.)
51.7 Performing your usual hobbies, recreational or sporting activities (2.)
51.6 Squatting (6.)
51.4 Standing for 1 hour (14.)
50.8 Performing heavy activities around your home (9.)
48.5 Going up or down 10 stairs (about 1 flight of stairs) (13.)
46.5 Performing any of your usual work, housework, or school activities (1.)
45.3 Walking 2 blocks (11.)
41.0 Lifting an object, like a bag of groceries from the floor (7.)
38.6 Getting into or out of a car (10.)
37.9 Getting into or out of the bath (3.)
37.6 Performing light activities around your home (8.)
36.1 Putting on your shoes or socks (5.)
32.1 (easiest item) Walking between rooms (4.)
0
10
20
30
40
50
60
70
80
90
100
Functioning
Making sharp turns while running fast Running on uneven ground Running on even ground Hopping Walking a mile Performing your usual hobbies, recreational or sporting
activities Squatting Standing for 1 hour Performing heavy activities around your home Going up or down 10 stairs (about 1 flight of stairs) Performing any of your usual work, housework, or
school activities Walking 2 blocks Lifting an object, like a bag of groceries from the floor Getting into or out of a car Getting into or out of the bath Performing light activities around your home Putting on your shoes or socks Walking between rooms CALIBRATION:CALIBRATION:
0-100 SCALE0-100 SCALE
30
35
40
45
50
55
60
65
Functioning
Making sharp turns while running fast Running on uneven ground Running on even ground Hopping Walking a mile Performing your usual hobbies, recreational or sporting
activities Squatting Standing for 1 hour Performing heavy activities around your home Going up or down 10 stairs (about 1 flight of stairs) Performing any of your usual work, housework, or
school activities Walking 2 blocks Lifting an object, like a bag of groceries from the floor Getting into or out of a car Getting into or out of the bath Performing light activities around your home Putting on your shoes or socks Walking between rooms VIEW:VIEW:
30-65 RANGE30-65 RANGE
CATs CATs have starting rules.
LEFS CAT: Begin with an item of moderate difficulty.
And CATs have stopping rules. LEFS CAT:
When the SEM < 4 (score range: 0-100), or when the average score change for last 3 score
estimates < 1, or when all LEFS items are completed.
CAT simulation:Focus on item selection
Item Pool: 18-items from the Lower Extremity Functioning Scale (LEFS)
30
35
40
45
50
55
60
65
Functioning
- - - - - -
- - - - 1. Performing any of your usual work, housework, or
school activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
- - - - 2. Walking a mile (Quite a bit) -
- - - - 1. Performing any of your usual work, housework, or
school activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
- 3. Running on uneven ground (Extreme/unable) - - 2. Walking a mile (Quite a bit) -
- - - - 1. Performing any of your usual work, housework, or
school activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) - - 2. Walking a mile (Quite a bit) -
- - - - 1. Performing any of your usual work, housework, or school
activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) - 2. Walking a mile (Quite a bit) -
- - - - 1. Performing any of your usual work, housework, or school
activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) -
- - - - 1. Performing any of your usual work, housework, or school
activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) 7. Performing your usual hobbies, recreational or sporting
activities (Moderate) - - - - 1. Performing any of your usual work, housework, or school
activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
30
35
40
45
50
55
60
65
Functioning
4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) 7. Performing your usual hobbies, recreational or sporting
activities (Moderate) - - 8. Performing heavy activities around your home (Moderate) - 1. Performing any of your usual work, housework, or school
activities (A little bit of difficulty) - - - - - - -
ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE
The potential to relieve respondent burden Only 8 individual LEFS items required
responses.
As many as 10 potentially inappropriate items did NOT require responses.
This patient’s LEFS score is 46.02, the activity level at which the individual can function with little or no difficulty.
CAT simulation: Focus on precision
Item Pool: 23-items of the Modified Roland-Morris Low Back Pain Disability Questionnaire
2.01.00.0-1.0-2.0-3.0
Back Disability
Low High
Trait Continuum
2.01.00.0-1.0-2.0-3.0
Q1: I find it difficult to get out of a chair because of my back
SEM = 1.4
A: Yes
2.01.00.0-1.0-2.0-3.0
Q2: I stay at home most of the time.
SEM = 0.67
A: No
2.01.00.0-1.0-2.0-3.0
Q3: I can only walk short distances.
SEM = 0.59
A: Yes
2.01.00.0-1.0-2.0-3.0
Q4: I use a handrail to get up stairs.
SEM = 0.55
A: Yes
2.01.00.0-1.0-2.0-3.0
Q5: I’m not doing jobs that I usually do around the house.
SEM = 0.47
A: No
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
Yes
No
Yes
Yes
No
Low HighBack Disability
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
2.01.00.0-1.0-2.0-3.0
Estimate gets more precise with every question
SEM after each
question
Q1:
Q2:
Q3:
Q4:
Q5:
What response efficiencies does a 9-item “Physical Functioning” CAT offer?
Objective: To evaluate the response efficiencies provided by a 9-item CAT:
“Adapteval-HIV: Physical Functioning”
Methods 400 HIV patients from 2 clinics completed all 9
items of the “AD-HIV: Physical Functioning” scale.
CAT simulations were then conducted, based on the original, full-scale responses.
CATs used a set of commonly employed starting and stopping rules: begin with an item of moderate difficulty; end when a targeted score estimate precision is
achieved (standard error-based).
Results 3099 out of 3600 potential responses needed
to obtain CAT-based scores for 400 patients.
Eliminated the need for 501 responses (14%).
245 patients (61%) needed 9 responses to obtain physical functioning scores.
155 patients (39%) required <9 responses (894 responses instead of 1395).
Results 13 distinct CAT-based scales provided scores.
These CATs ranged in length from 3-8 items.
Of the 155 patients with reduced-item CATs: 35 (22.6%) had 6-item CATs, 30 (19.4%) had 8-item CATs, 29 (18.7%) had 5-item CATs, 26 (16.8%) had 7-item CATs, 23 (14.8%) had 3-item CATs, 12 (7.7%) had 4-item CATs.
Conclusions Initial studies of simulated CAT responses
to “AD-HIV: Physical Functioning” suggest that this 9-item CAT can provide response efficiencies: 155 patients (39%) required <9 responses. These patients averaged a 36% response
reduction.
Implications “CAT AD-HIV: Physical Functioning” offers
the potential to relieve patient response burden.
Scales of <10 items may provide real response efficiencies when delivered as CATs.
Issues in constructing and employing measures
Having a fit: impact of number of items and distribution of data on traditional criteria for assessing IRT’s unidimensionality assumption (Cook, Kallen, Amtmann)
Number of items an influencing factor on studied CFA model fit statistics CFI, NNFI, RMSEA, SRMR, WRMR
What about the context of CFA modeling: full scale analysis, short form analysis, item bank construction, static vs. dynamic (CAT) assessment
Technology aversion?
Education Differences Up to HS (172) vs. Some College or More (218)
p-val eta2 p-val eta2
Pain .009 .018 Life Sat. .015 .015
Fatigue .014 .016 Body Img. .019 .014
Sleep .176 .005 Phy. Symp. .000 .035
Emotional .001 .028 Sat. Med. .037 .011
Cognitive .000 .037 Work .006 .020
Self Care .004 .021 Neg. Soc. .079 .008
Phy. Act. .000 .065 Pos. Soc. .001 .029
*Higher education had better scores than lower education in all domains. Results based on ANOVA. 95% significance level (p-val)
Income Differences <20k (140) vs. 20k+ (199)
p-val eta2 p-val eta2
Pain .000 .078 Life Sat. .000 .046
Fatigue .000 .066 Body Img. .001 .033
Sleep .000 .049 Phy. Symp. .000 .145
Emotional .000 .060 Sat. Med. .010 .020
Cognitive .000 .088 Work .000 .097
Self Care .000 .070 Neg. Soc. .000 .057
Phy. Act. .000 .184 Pos. Soc. .000 .058
*Higher income had better scores than lower income in all domains. Results based on ANOVA. 95% significance level (p-val) and 0.10 effect size (eta2) assumed
Platform Selection Process Semi-random assignments; allowing self-selection to ensure valid
input from subjects. Platform selections affected by subject sociodemographic and
medical status.
p-value p-value
Sex .643 Occupation .076
Ethnicity .062 Income .000
Race .153 CD4 .676
Age .921 Viral Load .440
Education .000 Months Since Diag .023
*Results based on Pearson Chi-Square.. 95% significance level assumed.
Response invalidity
Outpatient population study of trust Consecutive outpatients from one public, one
private, and one VA clinic were recruited from waiting rooms prior to their appointments.
Completed questionnaires were received from 104 African Americans and 131 White Americans.
The overall sample composition was: 46% female, 57% incomes <= $20,000 per year, 36% completed high school or less, and 47% had experienced moderate to severe pain
during the previous 4 weeks.
The Physician Trust Scale:In “easiest-to-hardest to endorse” order
1. Your doctor will do whatever it takes to get you all the care you need. (R)
6. Your doctor is totally honest in telling you about all of the different treatment options available for your condition. (R)
4. Your doctor is extremely thorough and careful. (R)
5. You completely trust your doctor’s decisions about which medical treatments are best for you. (R)
10. All in all, you have complete trust in your doctor. (R)
7. Your doctor only thinks about what is best for you. (R)
9. You have no worries about putting your life in your doctor’s hands. (R)
8. Sometimes your doctor does not pay full attention to what you are trying to tell him/her.
3. Your doctor’s medical skills are not as good as they should be.
2. Sometimes your doctor cares more about what is convenient for him/her than about your medical needs.
Response options:1-Strongly agree 2-Agree 3-Neutral 4-Disagree 5-Strongly disagree
Scoring: Higher score = greater trust
Validity: Property of An Inference
A measure is not valid in an absolute sense.
Validity interest is in: the appropriateness of inferences about individuals, which were made based on an interpretation of
scores, which were derived from data collected in a specific
context.
Studying Measurement Invalidity: Person Fit
A unique advantage of IRT measurement models: Individualized person fit information is
calculated in conjunction with the estimation of person measures.
This individual person level fit information, i.e., person fit statistics, can then be used for inquiries about measurement validity.
SF-36 Physical Functioning Items:In “easiest-to-hardest” order
Measure Item # Item Content
2.36 SF3 (hardest) Vigorous activities
1.15 SF6 Climbing several flights of stairs
0.66 SF9 Walking more than a mile
0.48 SF8 Bending, kneeling, or stooping
0.10 SF4 Moderate activities
-0.18 SF10 Walking several hundred yards
-0.60 SF5 Lifting or carrying groceries
-0.60 SF7 Climbing one flight of stairs
-0.89 SF11 Walking one hundred yards
-2.48 SF12 (easiest) Bathing or dressing yourself
SF-36 Physical Functioning Items:An Individual’s Response Validity
Response Item # Item ContentNo, not limited at all. SF3 (hardest) Vigorous activitiesYes, limited a little. SF6 Climbing several flights of stairs
No, not limited at all. SF9 Walking more than a mileYes, limited a little. SF8 Bending, kneeling, or stoopingYes, limited a little. SF4 Moderate activitiesYes, limited a lot. SF10 Walking several hundred yardsYes, limited a lot. SF5 Lifting or carrying groceriesYes, limited a lot. SF7 Climbing one flight of stairs
Yes, limited a little. SF11 Walking one hundred yardsNo, not limited at all. SF12 (easiest) Bathing or dressing yourself
Using Person Fit Information 1) Identify individuals with high levels of invalidity
in their responses.
2) What percent of the sample is represented by these high invalidity individuals?
3) Can these individuals be characterized – demographically or otherwise? If no, invalid responses may be randomly distributed. If yes, measurements may be less appropriate for
certain groups.
Characterizing Invalidity In the study employing the Physician Trust Scale:
19.4% were found to have high invalidity responses (person infit or outfit mean sq >2).
This increased to 29.7% for those with Up-to-HS education, 44.1% for those with Up-to-HS education, little or no
pain.
Results: Identify Invalid Responses A sizeable percent of the sample (19.4) was identified as
having responses with questionable validity.
Characteristics common to respondents with low validity responses could be identified: Primarily - level of education; to some extent - pain status.
A question exists as to the stability of study findings when analyses are conducted with and without low validity data.
The validity of instrument responses across all patient subgroups of interest may not be equivalent. What, then, is effect, and what is inappropriate measurement?
Potential for measurement bias
If there is uncertainty in measurement…
In employing a measurement instrument, user expectation is individuals and population subgroups are NOT
likely to be advantaged or disadvantaged by the measurements they receive.
If an “advantaging” or “disadvantaging” occurs: there will be systematic error in the way in which
an instrument provides measures for members of a specific group or groups
The bottom line: Measurement disparities produce bias.
…there is uncertainty in results
The use of biased measurements in research introduces a significant question: Will conclusions hold?
In a measure of health status, what if something other than state of health has influenced the way in which certain individuals respond?
What, then, if observed group differences really reflect something other than what an outcomes instrument was intended to measure?
A question will remain about a study’s findings: Are observations about disparities in health care due to true group differences or were they influenced by the use of culturally-biased measurements?
Studying Measurement Disparity: DIF
DIF is “differential item functioning.”
A measurement item shows DIF if individuals of the same trait level (e.g., same level of trust) but originating from different groups do not have the same probability of endorsing an item.
Trust Item Difficulties: Up to HS vs. > HS Education
-1.09-0.91
-0.73
-0.47-0.30
0.34
0.550.62 0.65 0.67
-1.5
-1.0
-0.5
0.0
0.5
1.0
I1 I6 I4 I5 I10 I7 I9 I8 I3 I2
Items
Difficulty(in Logits)
Up to HS
> HS
Trust Item Difficulties: Up to HS vs. > HS Education
-1.09-0.91
0.65 0.67
-1.5
-1.0
-0.5
0.0
0.5
1.0
I1 I6 I4 I5 I10 I7 I9 I8 I3 I2
Items
Difficulty(in Logits)
Up to HS
> HS
1. Your doctor will do whatever it takes to get you all the care you need.6. Your doctor is totally honest in telling you about all of the different treatment options available for your condition. 3. Your doctor’s medical skills are not as good as they should be.2. Sometimes your doctor cares more about what is convenient for him/her than about your medical needs.
Address potential measurement bias Four items show statistically significant DIF, with
moderate to large DIF effect sizes.
The trust SCALE, as a whole, appears to be understood and responded to differently by the lower education versus higher education group.
Considering that 36% of the study’s sample completed high school or less, this trust instrument may not be able to provide interpretable scores for this study.
Outline The new measurement
Constructing measures
Extending the usefulness of measurement
When patient-reported outcomes…
…are reported to patients
A pivotal study
Measuring quality of life in routine oncology practice improves communication and patient well-being: a randomized controlled trial
G. Velikova et al, Journal of Clinical Oncology, February 15, 2004
Objective
To examine the effects of regular repeated collection and feedback of HRQL data to oncologists
Study Design: Groups Patients randomly assigned to
Intervention (I) complete HRQL questionnaires; feedback of results to
physicians Attention-Control (A-C)
complete HRQL questionnaires; no feedback of results to physicians
Control (C) no HRQL measurement before clinic encounter
Improvement = “>+7 points” in FACT-G well-being from baseline
FACT-G Overall
Physical Functional Emotional Social/family
I vs C yes yes yes yes no
I vs A-C no no no no no
A-C vs C yes yes yes no no
Fig 4. Proportions of patients showing clinically meaningful improvement, no change, or deterioration in Functional Assessment of Cancer–General (FACT-G) score after three encounters, by study arm. Intervention versus attention-control and control groups, P = .001; intervention and attention-control versus control, P = .003, using ordinal regression, controlling for baseline FACT-G, performance status, and time on study.
“Leftover” Finding
Completion of questionnaires may have effect on patient well-being, without feedback to physicians
Improvement = “>+7 points” in FACT-G well-being from baseline
FACT-G Overall
Physical Functional Emotional Social/family
I vs C yes yes yes yes no
I vs A-C no no no no no
A-C vs C yes yes yes no no
Patient-centric effect Achieved by completing measures
What if measures were systematically reported to patients, across time?
Effect yet to be harnessed by healthcare providers or systems
Two PRO-based clinical care projects Department of Palliative Care and
Rehabilitation Medicine, MD Anderson
Cancer-Related Fatigue Clinic, MD Anderson
Palliative Care (Yang, Kallen, Bruera)
NCI-sponsored SBIR
RFP Topic 246: “Integrating Patient-Reported Outcomes in
Hospice and Palliative Care Practices”
PRO Assessment
112
PRO Trending
113
PRO Assessment Snapshot
114
Expected benefits
Workflow efficiency
Data accuracy
Potential for temporal/causal analyses
Facilitate provider-provider and patient-provider communication
115
Cancer-Related Fatigue Clinic (Kallen, Yang, Escalante)
Expected benefits Timely presentation of interpreted
measurement information to all involved parties
Improve communication, shared decision-making, and outcomes
Summary The new measurement
Constructing measures
Extending the usefulness of measurement