Application of Item Response Theory to PRO Development Michael A. Kallen, PhD, MPH Associate Professor, Research Faculty Department of Medical Social Sciences

Application of Item Response Theory to PRO Development

Michael A. Kallen, PhD, MPHAssociate Professor, Research Faculty

Department of Medical Social SciencesFeinberg School of Medicine

Northwestern UniversityChicago, Illinois

Outline

The new measurement

Constructing measures

Extending the usefulness of measurement

Outline

The new measurement



To see what’s new… The Lower Extremity Function Scale (LEFS)

from a “classical” perspective

LEFS ITEMS

“Activities”

Extreme difficulty or unable to perform

activity

Quite a bit of

difficulty

Moderate difficulty

A little bit of

difficulty

No difficulty

1. Perform any of your usual work, housework, or school activities 0 1 2 3 4

2. Perform your usual hobbies, recreational or sporting activities 0 1 2 3 4

3. Getting into or out of the bath 0 1 2 3 4

4. Walking between rooms 0 1 2 3 4

5. Putting on your shoes or socks 0 1 2 3 4

6. Squatting 0 1 2 3 4

7. Lifting an object, like a bag of groceries from the floor 0 1 2 3 4

8. Performing light activities around your home 0 1 2 3 4

9. Performing heavy activities around your home 0 1 2 3 4

10. Getting into or out of a car 0 1 2 3 4

11. Walking 2 blocks 0 1 2 3 4

12. Walking a mile 0 1 2 3 4

13. Going up or down 10 stairs (about 1 flight of stairs) 0 1 2 3 4

14. Standing for 1 hour 0 1 2 3 4

15. Sitting for 1 hour 0 1 2 3 4

16. Running on even ground 0 1 2 3 4

17. Running on uneven ground 0 1 2 3 4

18. Making sharp turns while running fast 0 1 2 3 4

19. Hopping 0 1 2 3 4

20. Rolling over in bed 0 1 2 3 4

Original LEFS Instructions“We are interested in knowing whether you are

having any difficulty at all with the activities listed below because of your lower limb problem for which you are currently seeking attention. Please provide an answer for each activity.”

“Today, do you or would you have any difficulty at all with:

(Circle one number on each line)”

LEFS ITEMS

“Activities”

Extreme difficulty or unable to

perform activity

Quite a bit of

difficulty

Moderate difficulty

A little bit of

difficulty

No difficulty



















19. Hopping 0 1 2 3 4


Measurement products needed?

All patient individual item scores, so as to obtain

a patient’s total score

The burden of measurement For the healthcare provider/researcher:

instrument administration score summation score interpretation

For the patient: The actual time (and thoughtfulness) it

requires to respond Being asked inappropriate questions

LEFS ITEMS

“Activities”

Extreme difficulty or unable to

perform activity

Quite a bit of

difficulty

Moderate difficulty

A little bit of

difficulty

No difficulty



















19. Hopping 0 1 2 3 4


The cost of this burden? For the healthcare provider/researcher:

Perhaps a loss of willingness to use the instrument to collect data, score responses, or interpret a patient’s self-reported condition.

For the patient: Perhaps a loss of willingness to respond in a

focused, honest way to an instrument that seems unresponsive or even annoying.

Classical psychometrics Its beginnings go back to the turn of the

20th century.

Consider Spearman’s work in disattenuating the correlation coefficient.

Demonstration of formulae for true measurement of correlation. American Journal of Psychology, 1907.

Psychometrics’ First “Golden Years” The zenith of classical psychometrics arguably

occurred during the time period surrounding the publication of Gulliksen’s book, Theory of Mental Tests (1950).

This time period saw the development of many of the best and brightest aspects of classical psychometrics: e.g., 1) that measurements, not instruments, have

psychometric properties; 2) the introduction of alpha reliability; 3) validity as the sine qua non of measurement.

Classical Test Theory (CTT) CTT is a cornerstone of classical

psychometrics.

It is theory-based measurement: Individual scores are theory-defined as

composed of a true score component and an error component.

That is, observed score = true score + error

CTT as Theory

Because it is theory, CTT or “true score theory” does not provide:

hypotheses that are testable, or

models that are falsifiable.

CTT and Circular Definitions In CTT, item difficulty is defined as:

the proportion of examines in a group of interest who answer an item correctly.

Thus, an item being “hard” or “easy” depends on the ability of the group of examinees measured.

As a result, it is a challenge to determine score meaning: Examinee and test characteristics are entangled; each can be interpreted only in the context of the other.

Very Easy Test

Very Hard Test

Medium Test

1 8

1 8

1

Expected Score 8

Person

Expected Score 0

Person

Expected Score 5

Person

8

Reprinted with permission from: Wright, B.D. & Stone, M. (1979) Best test design, Chicago: MESA Press, p. 5.

Scores depend on the difficulty of test items

The New Measurement

In the past 40-plus years, measurement has undergone a quiet evolution.

Modern Psychometrics

In 1968 Lord and Novick’s book, Statistical Theories of Mental Tests, was published.

In it, model-based measurement, a foundation of modern psychometrics, was introduced.

Item Response Theory (IRT) IRT, a form of model-based measurement, now

plays a fundamental role in modern psychometrics.

It addresses CTT’s major shortcomings: By providing test-independent and group-independent

measurement; by employing models that can be tested and falsified.

If a proposed IRT model does NOT adequately explain the data at hand, it can be determined whether assumptions were met or an inappropriate model was used.

CTT vs. IRT CTT, from classical psychometrics, is

theory-based, sample and test-specific, and focuses on test performance.

IRT, from modern psychometrics, is model-based, ability or functioning level-specific, and focuses on item performance.

Outline The new measurement



IRT approach

For any approach Getting the measure right helps in getting the

measurement right

For the IRT approach There have come to be accepted and

recommended steps to take for getting a measure right

Adapteval-HIV Project (Yang, Kallen)

Goal: To develop and evaluate a set of HIV-specific item banks to support

IRT-based CAT assessment

Implementation on multiple technology platforms (web/phone/PDA)

Item Response Theory

Q Q Q Q Q Q Q Q Q Q Q

GoodPoor

Easy Hard

PersonPerson Latent Trait Latent Trait

ItemItem Location Location

IRT assumptions Unidimensionality

Measure one “thing” only

Monotonicity The “better” the trait status on a single scale item, the

“better” the trait status on the overall scale

Local independence Items are independent of each other statistically, after

controlling for shared dimensionality

Monotonicity

0

5

10

15

20

25

1 2 3 4 5

Response Category

Me

an

Sc

ale

Sc

ore

P1P2P3P4

The “better” the trait status on a single scale item, the “better” the trait status on the overall scale

Local Independence Items are independent of each other statistically,

after controlling for shared dimensionality

Components of a scale item’s variance Required: shared variance, in common with the scale’s

unidimensional factor Expected: some residual or noise/error variance Problem: If the residual variance of one item is

correlated with that of another item, at some point the variance is no longer just noise

Advantages of IRT approach to measurement Focus is on the item, not the scale

Each item possesses trait estimation capacities

Provides item- and group-independent measurement Not tied to sample or particular items used

Makes computer adaptive testing a reality Accumulation of detailed knowledge about individual

items and their functioning Customized item presentation reduces the number of

patient responses needed to achieve measurements of similar quality

Adapteval-HIV: 14 item “sets”

Pain Fatigue Sleep Emotion Functioning Cognitive Functioning Self Care/Daily Living Physical Act./Leisure

Life Satisfaction Body Image Physical Symptoms Sat w/ Medical Care Work/Employment Neg. Social Issues Pos. Social Exp.

Measure development process

Identify initial set of items (adapted from 9 existing instruments) (248 items)

Panel (20 docs/nurses/patients/psychosocial) eval (226 items)

Panel (+ input from 3 psychosocial researchers) evaluation (192 items)

Pilot test: 50 patients Analysis of floor/ceiling, missing data, low SD

(146 items) Primary data collection: 400 patients Primary psychometric analysis

(107 items) IRT parameter calibration to implement CAT

Psychometric analyses:item exclusion/inclusion Primary Criteria

Missingness > 20% (2 items excluded) CFA factor loadings < .50 (16 excluded) Local dependence > .20 (5 excluded) Lack of monotonicity (9 excluded) Potential item bias (3 excluded)

Secondary Criteria Multi-dimensionality (3 excluded) Failed IRT parameter convergence (2 excluded)

Project History Phase I

50 patients at Northwestern Web/phone Prelim psychometric analyses (cognitive interviews, etc.)

Phase II 225 each at Northwestern and UIC Web/phone/PDA Complete psychometric analyses

Ethnicity/race and gender characteristics of primary data collection (NU + UIC)

Ethnic Category

Sex/Gender

Females Males

Unknown or Not

Reported Total

Hispanic or Latino 12 34 0 46

Not Hispanic or Latino 80 266 0 346

Unknown (individuals not reporting ethnicity) 1 7 0 8

Ethnic Category: Total of All Subjects 93 307 0 400

Racial Categories

American Indian/Alaska Native 1 3 0 4

Asian 0 5 0 5

Native Hawaiian or Other Pacific Islander 1 3 0 4

Black or African American 67 133 0 200

White 14 134 0 148

More Than One Race 2 13 0 15

Unknown or Not Reported 8 16 0 24

Racial Categories: Total of All Subjects 93 307 0 400

Adapteval-HIV question distribution

Domain Pool Domain Pool

Pain 4 Life Sat. 8

Fatigue 5 Body Img. 4

Sleep 5 Phy Symp 9

Emotional 23 Sat. Med. 5

Cognitive 9 Work 6

Self Care 6 Neg. Soc. 8

Phy. Act. 7 Pos. Soc. 8

Overall Total: Pool (107)

Ceiling and Floor Effects

Domain Sz Ceil Floor Domain Sz Ceil Floor

Pain 389 102 (26%) 0 (0%) Life Sat. 327 34 (10%) 8 (2%)

Fatigue 393 49 (12%) 4 (1%) Body Img. 390 143 (37%) 3 (1%)

Sleep 334 61 (18%) 5 (1%) Phy Sym 319 37 (12%) 0 (0%)

Emotion 382 5 (1%) 0 (0%) Sat. Med. 362 151 (42%) 8 (2%)

Cognitive 392 88 (2%) 0 (0%) Work 336 56 (17%) 11 (3%)

Self Care 362 87 (24%) 17 (5%) Neg. Soc. 348 19 (5%) 0 (0%)

Phy. Act. 311 64 (21%) 0 (0%) Pos. Soc. 333 35 (11%) 4 (1%)

Aiming at < 20%

*Entries in red denote the values falling within the expected ranges.

CFA Goodness of Fit (Standards for Unidimensionality) Root Mean Square Error of Approximation (RMSEA) < 0.10 Comparative Fit Index (CFI) > 0.90 Standardized Root Mean Square Residual (SRMR) < 0.08

RMSEA CFI SRMR RMSEA CFI SRMR

Pain 0.051 1.00 0.021 Life Sat. 0.95 0.99 0.040

Fatigue 0.060 1.00 0.020 Body Img 0.020 1.00 0.017

Sleep 0.088 0.99 0.032 Phy Symp 0.11 0.95 0.090

Emotion 0.12 0.96 0.092 Sat. Med. 0.15 0.98 0.060

Cognitive 0.099 0.98 0.048 Work 0.16 0.96 0.086

Self Care 0.093 0.99 0.036 Neg. Soc. 0.13 0.94 0.11

Phy. Act. 0.0 1.00 0.024 Pos. Soc. 0.10 0.98 0.054


Scale Inter-Correlations

N=400 pain fatig sleep emot cog selfca phyact satlif body physym satmed work socneg

fatig 0.710

sleep 0.596 0.610

emot 0.461 0.535 0.550

cog 0.491 0.557 0.541 0.662

selfca 0.315 0.371 0.265 0.280 0.287

phyact 0.434 0.486 0.348 0.325 0.356 0.653

satlif 0.296 0.365 0.374 0.609 0.470 0.492 0.526

body 0.278 0.360 0.330 0.496 0.429 0.175 0.196 0.336

physym 0.563 0.610 0.511 0.582 0.613 0.324 0.445 0.417 0.502

satmed 0.123 0.126 0.113 0.150 0.165 0.290 0.236 0.367 0.078 0.101

work 0.489 0.559 0.424 0.453 0.444 0.307 0.420 0.368 0.442 0.557 0.061

socneg 0.438 0.563 0.525 0.695 0.591 0.282 0.307 0.488 0.638 0.614 0.069 0.611

socpos 0.213 0.226 0.261 0.415 0.361 0.367 0.380 0.585 0.242 0.231 0.448 0.259 0.389

Inter-correlations: ranged from 0.069-0.710

*Entries in red denote the values falling within the expected ranges (i.e., <0.90).

Internal Consistency

Pain .809 Life Sat. .915

Fatigue .907 Body Img. .777

Sleep .898 Phy. Symp. .818

Emotional .950 Sat. Med. .859

Cognitive .930 Work .866

Daily Living .899 Neg. Soc. .798

Phy. Act. .888 Pos. Soc. .885

Group comparison: Cronbach's alpha > 0.7 Individual comparison: Cronbach's alpha > 0.9


Computerized Adaptive Testing

2. Select & present optimal scale item

1. Begin with initial trait estimate

5. Is stopping

rule satisfied

7. End of

battery

6. End of

assessment

4. Re-estimate trait

3. Record and score response

8. Administer next scale

item

9. Stop

No

Yes

Yes

No

Source: Wainer et al. (2000). Computerized Adaptive Testing: A Primer 2nd Ed. LEA. Mahwah N.J.

CAT performance: # of questions

Domain Pool Avg/SD Mx/Mn Domain Pool Avg/SD Mx/Mn

Pain 4 3.1/1.0 4/2 Life Sat. 8 3.1/1.5 7/2

Fatigue 5 2.6/1.1 5/2 Body Img. 4 3.7/0.5 4/3

Sleep 5 3.3/1.1 5/2 Phy Symp 9 7.7/1.9 9/3

Emotional 23 4.6/2.6 11/2 Sat. Med. 5 4.3/1.1 5/2

Cognitive 9 4.9/2.5 9/2 Work 6 4.9/1.0 6/3

Self Care 6 3.9/1.4 6/2 Neg. Soc. 8 6.9/1.0 8/5

Phy. Act. 7 3.7/2.2 7/2 Pos. Soc. 8 4.4/1.8 8/3

Overall Total: Pool (107), Average (61.1)

Static (Full) vs. CAT/IRT

Correlation of assessment scores between full instrument and CAT/IRT implementation

Body Image (4 items, avg. 3.7): .99 Cognitive Functioning (9 / 4.9): .91 Emotional Functioning (23 / 4.9): .87

Relief of measurement burden For the healthcare provider/researcher,

computerizing the test can relieve some of the burden. Use a scoring algorithm. Deliver score-specific interpretation.

This could be accomplished witha computer administered test.

Relief for respondents?

Yes, if the measure is presented as a computer adaptive test (CAT),

i.e., the test adapts itself or customizes itself according to the responses presented to it by an individual patient.

CATs and Item Structure Item presentation order

CTT: standardized All patients start at Item #1 and complete all items in

order IRT: customized

Patients are presented new items based on their responses to previous items

IRT “logic” There is an underlying hierarchy of activities. Activities can be ordered (“calibrated”) from

easiest to hardest.

CALIBRATIONS ITEMS

62.8 (hardest item) Making sharp turns while running fast (18.)

62.8 Running on uneven ground (17.)

60.7 Running on even ground (16.)

59.7 Hopping (19.)

53.6 Walking a mile (12.)

51.7 Performing your usual hobbies, recreational or sporting activities (2.)

51.6 Squatting (6.)

51.4 Standing for 1 hour (14.)

50.8 Performing heavy activities around your home (9.)

48.5 Going up or down 10 stairs (about 1 flight of stairs) (13.)

46.5 Performing any of your usual work, housework, or school activities (1.)

45.3 Walking 2 blocks (11.)

41.0 Lifting an object, like a bag of groceries from the floor (7.)

38.6 Getting into or out of a car (10.)

37.9 Getting into or out of the bath (3.)

37.6 Performing light activities around your home (8.)

36.1 Putting on your shoes or socks (5.)

32.1 (easiest item) Walking between rooms (4.)

0

10

20

30

40

50

60

70

80

90

100

Functioning

Making sharp turns while running fast Running on uneven ground Running on even ground Hopping Walking a mile Performing your usual hobbies, recreational or sporting

activities Squatting Standing for 1 hour Performing heavy activities around your home Going up or down 10 stairs (about 1 flight of stairs) Performing any of your usual work, housework, or

school activities Walking 2 blocks Lifting an object, like a bag of groceries from the floor Getting into or out of a car Getting into or out of the bath Performing light activities around your home Putting on your shoes or socks Walking between rooms CALIBRATION:CALIBRATION:

0-100 SCALE0-100 SCALE

30

35

40

45

50

55

60

65

Functioning

Making sharp turns while running fast Running on uneven ground Running on even ground Hopping Walking a mile Performing your usual hobbies, recreational or sporting

activities Squatting Standing for 1 hour Performing heavy activities around your home Going up or down 10 stairs (about 1 flight of stairs) Performing any of your usual work, housework, or

school activities Walking 2 blocks Lifting an object, like a bag of groceries from the floor Getting into or out of a car Getting into or out of the bath Performing light activities around your home Putting on your shoes or socks Walking between rooms VIEW:VIEW:

30-65 RANGE30-65 RANGE

CATs CATs have starting rules.

LEFS CAT: Begin with an item of moderate difficulty.

And CATs have stopping rules. LEFS CAT:

When the SEM < 4 (score range: 0-100), or when the average score change for last 3 score

estimates < 1, or when all LEFS items are completed.

CAT simulation:Focus on item selection

Item Pool: 18-items from the Lower Extremity Functioning Scale (LEFS)

30

35

40

45

50

55

60

65

Functioning

- - - - - -

- - - - 1. Performing any of your usual work, housework, or

school activities (A little bit of difficulty) - - - - - - -

ITEM PRESENTATION ORDER AND RESPONSE FROM LEFS EXERCISE

30

35

40

45

50

55

60

65

Functioning

- - - - 2. Walking a mile (Quite a bit) -




30

35

40

45

50

55

60

65

Functioning

- 3. Running on uneven ground (Extreme/unable) - - 2. Walking a mile (Quite a bit) -




30

35

40

45

50

55

60

65

Functioning

4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) - - 2. Walking a mile (Quite a bit) -

- - - - 1. Performing any of your usual work, housework, or school

activities (A little bit of difficulty) - - - - - - -


30

35

40

45

50

55

60

65

Functioning

4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) - 2. Walking a mile (Quite a bit) -




30

35

40

45

50

55

60

65

Functioning

4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) -




30

35

40

45

50

55

60

65

Functioning

4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) 7. Performing your usual hobbies, recreational or sporting

activities (Moderate) - - - - 1. Performing any of your usual work, housework, or school



30

35

40

45

50

55

60

65

Functioning

4. Making sharp turns while running fast (Extreme/unable) 3. Running on uneven ground (Extreme/unable) 5. Running on even ground (Extreme/unable) 6. Hopping (Quite a bit) 2. Walking a mile (Quite a bit) 7. Performing your usual hobbies, recreational or sporting

activities (Moderate) - - 8. Performing heavy activities around your home (Moderate) - 1. Performing any of your usual work, housework, or school



The potential to relieve respondent burden Only 8 individual LEFS items required

responses.

As many as 10 potentially inappropriate items did NOT require responses.

This patient’s LEFS score is 46.02, the activity level at which the individual can function with little or no difficulty.

CAT simulation: Focus on precision

Item Pool: 23-items of the Modified Roland-Morris Low Back Pain Disability Questionnaire

2.01.00.0-1.0-2.0-3.0

Back Disability

Low High

Trait Continuum

2.01.00.0-1.0-2.0-3.0

Q1: I find it difficult to get out of a chair because of my back

SEM = 1.4

A: Yes

2.01.00.0-1.0-2.0-3.0

Q2: I stay at home most of the time.

SEM = 0.67

A: No

2.01.00.0-1.0-2.0-3.0

Q3: I can only walk short distances.

SEM = 0.59

A: Yes

2.01.00.0-1.0-2.0-3.0

Q4: I use a handrail to get up stairs.

SEM = 0.55

A: Yes

2.01.00.0-1.0-2.0-3.0

Q5: I’m not doing jobs that I usually do around the house.

SEM = 0.47

A: No

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

Yes

No

Yes

Yes

No

Low HighBack Disability

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

2.01.00.0-1.0-2.0-3.0

Estimate gets more precise with every question

SEM after each

question

Q1:

Q2:

Q3:

Q4:

Q5:

What response efficiencies does a 9-item “Physical Functioning” CAT offer?

Objective: To evaluate the response efficiencies provided by a 9-item CAT:

“Adapteval-HIV: Physical Functioning”

Methods 400 HIV patients from 2 clinics completed all 9

items of the “AD-HIV: Physical Functioning” scale.

CAT simulations were then conducted, based on the original, full-scale responses.

CATs used a set of commonly employed starting and stopping rules: begin with an item of moderate difficulty; end when a targeted score estimate precision is

achieved (standard error-based).

Results 3099 out of 3600 potential responses needed

to obtain CAT-based scores for 400 patients.

Eliminated the need for 501 responses (14%).

245 patients (61%) needed 9 responses to obtain physical functioning scores.

155 patients (39%) required <9 responses (894 responses instead of 1395).

Results 13 distinct CAT-based scales provided scores.

These CATs ranged in length from 3-8 items.

Of the 155 patients with reduced-item CATs: 35 (22.6%) had 6-item CATs, 30 (19.4%) had 8-item CATs, 29 (18.7%) had 5-item CATs, 26 (16.8%) had 7-item CATs, 23 (14.8%) had 3-item CATs, 12 (7.7%) had 4-item CATs.

Conclusions Initial studies of simulated CAT responses

to “AD-HIV: Physical Functioning” suggest that this 9-item CAT can provide response efficiencies: 155 patients (39%) required <9 responses. These patients averaged a 36% response

reduction.

Implications “CAT AD-HIV: Physical Functioning” offers

the potential to relieve patient response burden.

Scales of <10 items may provide real response efficiencies when delivered as CATs.

Issues in constructing and employing measures

Having a fit: impact of number of items and distribution of data on traditional criteria for assessing IRT’s unidimensionality assumption (Cook, Kallen, Amtmann)

Number of items an influencing factor on studied CFA model fit statistics CFI, NNFI, RMSEA, SRMR, WRMR

What about the context of CFA modeling: full scale analysis, short form analysis, item bank construction, static vs. dynamic (CAT) assessment

Technology aversion?

Education Differences Up to HS (172) vs. Some College or More (218)

p-val eta2 p-val eta2

Pain .009 .018 Life Sat. .015 .015

Fatigue .014 .016 Body Img. .019 .014

Sleep .176 .005 Phy. Symp. .000 .035

Emotional .001 .028 Sat. Med. .037 .011

Cognitive .000 .037 Work .006 .020

Self Care .004 .021 Neg. Soc. .079 .008

Phy. Act. .000 .065 Pos. Soc. .001 .029

*Higher education had better scores than lower education in all domains. Results based on ANOVA. 95% significance level (p-val)

Income Differences <20k (140) vs. 20k+ (199)

p-val eta2 p-val eta2

Pain .000 .078 Life Sat. .000 .046

Fatigue .000 .066 Body Img. .001 .033

Sleep .000 .049 Phy. Symp. .000 .145

Emotional .000 .060 Sat. Med. .010 .020

Cognitive .000 .088 Work .000 .097

Self Care .000 .070 Neg. Soc. .000 .057

Phy. Act. .000 .184 Pos. Soc. .000 .058

*Higher income had better scores than lower income in all domains. Results based on ANOVA. 95% significance level (p-val) and 0.10 effect size (eta2) assumed

Platform Selection Process Semi-random assignments; allowing self-selection to ensure valid

input from subjects. Platform selections affected by subject sociodemographic and

medical status.

p-value p-value

Sex .643 Occupation .076

Ethnicity .062 Income .000

Race .153 CD4 .676

Age .921 Viral Load .440

Education .000 Months Since Diag .023

*Results based on Pearson Chi-Square.. 95% significance level assumed.

Response invalidity

Outpatient population study of trust Consecutive outpatients from one public, one

private, and one VA clinic were recruited from waiting rooms prior to their appointments.

Completed questionnaires were received from 104 African Americans and 131 White Americans.

The overall sample composition was: 46% female, 57% incomes <= $20,000 per year, 36% completed high school or less, and 47% had experienced moderate to severe pain

during the previous 4 weeks.

The Physician Trust Scale:In “easiest-to-hardest to endorse” order

1. Your doctor will do whatever it takes to get you all the care you need. (R)

6. Your doctor is totally honest in telling you about all of the different treatment options available for your condition. (R)

4. Your doctor is extremely thorough and careful. (R)

5. You completely trust your doctor’s decisions about which medical treatments are best for you. (R)

10. All in all, you have complete trust in your doctor. (R)

7. Your doctor only thinks about what is best for you. (R)

9. You have no worries about putting your life in your doctor’s hands. (R)

8. Sometimes your doctor does not pay full attention to what you are trying to tell him/her.

3. Your doctor’s medical skills are not as good as they should be.

2. Sometimes your doctor cares more about what is convenient for him/her than about your medical needs.

Response options:1-Strongly agree 2-Agree 3-Neutral 4-Disagree 5-Strongly disagree

Scoring: Higher score = greater trust

Validity: Property of An Inference

A measure is not valid in an absolute sense.

Validity interest is in: the appropriateness of inferences about individuals, which were made based on an interpretation of

scores, which were derived from data collected in a specific

context.

Studying Measurement Invalidity: Person Fit

A unique advantage of IRT measurement models: Individualized person fit information is

calculated in conjunction with the estimation of person measures.

This individual person level fit information, i.e., person fit statistics, can then be used for inquiries about measurement validity.

SF-36 Physical Functioning Items:In “easiest-to-hardest” order

Measure Item # Item Content

2.36 SF3 (hardest) Vigorous activities

1.15 SF6 Climbing several flights of stairs

0.66 SF9 Walking more than a mile

0.48 SF8 Bending, kneeling, or stooping

0.10 SF4 Moderate activities

-0.18 SF10 Walking several hundred yards

-0.60 SF5 Lifting or carrying groceries

-0.60 SF7 Climbing one flight of stairs

-0.89 SF11 Walking one hundred yards

-2.48 SF12 (easiest) Bathing or dressing yourself

SF-36 Physical Functioning Items:An Individual’s Response Validity

Response Item # Item ContentNo, not limited at all. SF3 (hardest) Vigorous activitiesYes, limited a little. SF6 Climbing several flights of stairs

No, not limited at all. SF9 Walking more than a mileYes, limited a little. SF8 Bending, kneeling, or stoopingYes, limited a little. SF4 Moderate activitiesYes, limited a lot. SF10 Walking several hundred yardsYes, limited a lot. SF5 Lifting or carrying groceriesYes, limited a lot. SF7 Climbing one flight of stairs

Yes, limited a little. SF11 Walking one hundred yardsNo, not limited at all. SF12 (easiest) Bathing or dressing yourself

Using Person Fit Information 1) Identify individuals with high levels of invalidity

in their responses.

2) What percent of the sample is represented by these high invalidity individuals?

3) Can these individuals be characterized – demographically or otherwise? If no, invalid responses may be randomly distributed. If yes, measurements may be less appropriate for

certain groups.

Characterizing Invalidity In the study employing the Physician Trust Scale:

19.4% were found to have high invalidity responses (person infit or outfit mean sq >2).

This increased to 29.7% for those with Up-to-HS education, 44.1% for those with Up-to-HS education, little or no

pain.

Results: Identify Invalid Responses A sizeable percent of the sample (19.4) was identified as

having responses with questionable validity.

Characteristics common to respondents with low validity responses could be identified: Primarily - level of education; to some extent - pain status.

A question exists as to the stability of study findings when analyses are conducted with and without low validity data.

The validity of instrument responses across all patient subgroups of interest may not be equivalent. What, then, is effect, and what is inappropriate measurement?

Potential for measurement bias

If there is uncertainty in measurement…

In employing a measurement instrument, user expectation is individuals and population subgroups are NOT

likely to be advantaged or disadvantaged by the measurements they receive.

If an “advantaging” or “disadvantaging” occurs: there will be systematic error in the way in which

an instrument provides measures for members of a specific group or groups

The bottom line: Measurement disparities produce bias.

…there is uncertainty in results

The use of biased measurements in research introduces a significant question: Will conclusions hold?

In a measure of health status, what if something other than state of health has influenced the way in which certain individuals respond?

What, then, if observed group differences really reflect something other than what an outcomes instrument was intended to measure?

A question will remain about a study’s findings: Are observations about disparities in health care due to true group differences or were they influenced by the use of culturally-biased measurements?

Studying Measurement Disparity: DIF

DIF is “differential item functioning.”

A measurement item shows DIF if individuals of the same trait level (e.g., same level of trust) but originating from different groups do not have the same probability of endorsing an item.

Trust Item Difficulties: Up to HS vs. > HS Education

-1.09-0.91

-0.73

-0.47-0.30

0.34

0.550.62 0.65 0.67

-1.5

-1.0

-0.5

0.0

0.5

1.0

I1 I6 I4 I5 I10 I7 I9 I8 I3 I2

Items

Difficulty(in Logits)

Up to HS

> HS

Trust Item Difficulties: Up to HS vs. > HS Education

-1.09-0.91

0.65 0.67

-1.5

-1.0

-0.5

0.0

0.5

1.0

I1 I6 I4 I5 I10 I7 I9 I8 I3 I2

Items

Difficulty(in Logits)

Up to HS

> HS

1. Your doctor will do whatever it takes to get you all the care you need.6. Your doctor is totally honest in telling you about all of the different treatment options available for your condition. 3. Your doctor’s medical skills are not as good as they should be.2. Sometimes your doctor cares more about what is convenient for him/her than about your medical needs.

Address potential measurement bias Four items show statistically significant DIF, with

moderate to large DIF effect sizes.

The trust SCALE, as a whole, appears to be understood and responded to differently by the lower education versus higher education group.

Considering that 36% of the study’s sample completed high school or less, this trust instrument may not be able to provide interpretable scores for this study.

Outline The new measurement



When patient-reported outcomes…

…are reported to patients

A pivotal study

Measuring quality of life in routine oncology practice improves communication and patient well-being: a randomized controlled trial

G. Velikova et al, Journal of Clinical Oncology, February 15, 2004

Objective

To examine the effects of regular repeated collection and feedback of HRQL data to oncologists

Study Design: Groups Patients randomly assigned to

Intervention (I) complete HRQL questionnaires; feedback of results to

physicians Attention-Control (A-C)

complete HRQL questionnaires; no feedback of results to physicians

Control (C) no HRQL measurement before clinic encounter

Improvement = “>+7 points” in FACT-G well-being from baseline

FACT-G Overall

Physical Functional Emotional Social/family

I vs C yes yes yes yes no

I vs A-C no no no no no

A-C vs C yes yes yes no no

Fig 4. Proportions of patients showing clinically meaningful improvement, no change, or deterioration in Functional Assessment of Cancer–General (FACT-G) score after three encounters, by study arm. Intervention versus attention-control and control groups, P = .001; intervention and attention-control versus control, P = .003, using ordinal regression, controlling for baseline FACT-G, performance status, and time on study.

“Leftover” Finding

Completion of questionnaires may have effect on patient well-being, without feedback to physicians

Improvement = “>+7 points” in FACT-G well-being from baseline

FACT-G Overall

Physical Functional Emotional Social/family

I vs C yes yes yes yes no

I vs A-C no no no no no

A-C vs C yes yes yes no no

Patient-centric effect Achieved by completing measures

What if measures were systematically reported to patients, across time?

Effect yet to be harnessed by healthcare providers or systems

Two PRO-based clinical care projects Department of Palliative Care and

Rehabilitation Medicine, MD Anderson

Cancer-Related Fatigue Clinic, MD Anderson

Palliative Care (Yang, Kallen, Bruera)

NCI-sponsored SBIR

RFP Topic 246: “Integrating Patient-Reported Outcomes in

Hospice and Palliative Care Practices”

PRO Assessment

112

PRO Trending

113

PRO Assessment Snapshot

114

Expected benefits

Workflow efficiency

Data accuracy

Potential for temporal/causal analyses

Facilitate provider-provider and patient-provider communication

115

Cancer-Related Fatigue Clinic (Kallen, Yang, Escalante)

Expected benefits Timely presentation of interpreted

measurement information to all involved parties

Improve communication, shared decision-making, and outcomes

Summary The new measurement



Documents

Application of Item Response Theory to PRO Development Michael A. Kallen, PhD, MPH Associate Professor, Research Faculty Department of Medical Social Sciences