Researcher judgment and study design: Challenges of using administrative data

AMERICAN JOURNAL OF INDUSTRIAL MEDICINE 53:37–41 (2010)

Commentary

Researcher Judgment and Study Design:Challenges of Using Administrative Data

Leslie I. Boden, PhD1� and Al Ozonoff, PhD2

Background Questions have been raised about methods of studies finding substantialundercounting of workplace injuries and illnesses by the Bureau of Labor Statistics (BLS)and workers’ compensation agencies. A more recent study of Minnesota concluded that theBLS survey captures 84–90% of workers’ compensation cases.Methods We examined the sensitivity of findings in two studies to alternate sampledefinitions and study assumptions.Results Applying alternate sample construction rules to the earlier study increasedestimated BLS reporting rates from 68% to 77%, assuming source independence. Applyingalternate assumptions to the more recent Minnesota study reduced its high estimate of BLSreporting from 90% to 53–64%.Conclusions Studies linking administrative data from different sources requiresubstantial judgment in constructing research datasets and choosing analytic methods.Moreover, different sample construction rules lead to different results. This suggests thatsensitivity analysis should be carried out when alternatives cannot be ruled out. In thiscase, sensitivity analysis supports the hypothesis of substantial underreporting. Am. J.Ind. Med. 53:37–41, 2010. � 2009 Wiley-Liss, Inc.

KEY WORDS: occupational injuries; workers’ compensation; capture–recapture;study design

INTRODUCTION

Using administrative data in public health research

presents many challenges. Unlike data specifically designed

to answer a study question, administrative data such as those

collected by state workers’ compensation systems can be

difficult to use for research because they are collected with a

non-research purpose. These databases are primarily used to

provide an overview of the number and types of injuries, keep

track of overall payments, and possibly track the per-

formance of individual employers and insurers. Information

important to the researcher may not be central to these

purposes and may, as a consequence, frequently be missing or

inaccurate. Attempting to link two administrative datasets,

each designed with different goals, presents even greater

challenges. Data inaccuracies, different definitions of data

elements, and disparate criteria for inclusion all require

researchers to make choices about how to assemble and

analyze the data. All of these issues are present in the

data used by Oleinick and Zaidman [2009] in this issue and

in the earlier studies by Rosenman et al. [2006] and by

� 2009Wiley-Liss, Inc.

1Department of Environmental Health, Boston University School of Public Health, Boston,Massachusetts

2Department of Biostatistics, Boston University School of Public Health, Boston,Massachusetts

Contract grant sponsor: National Institute for Occupational Safety and Health; Contractgrant number: 5R01 OH 007596; Contract grant sponsor: State of California, Commissionon Health and Safety and Workers’ Compensation.

*Correspondence to: Leslie I. Boden, Department of Environmental Health, Boston Univer-sity School of Public Health, 715 Albany Street, Boston, MA 02118. E-mail: [email protected]

Accepted 9 October 2009DOI 10.1002/ajim.20787. Published online in Wiley InterScience

(www.interscience.wiley.com)

Boden and Ozonoff [2008]. These data shortcomings and

researchers’ responses to them may have important implica-

tions for research results. Oleinick and Zaidman have done us

all a service by focusing on many of the decisions made in

estimating the number of occupational injuries and illnesses

in Rosenman et al. [2006] and Boden and Ozonoff [2008].

(In the 2008 study, Boden was responsible for almost all of

these decisions. Ozonoff provided statistical expertise.)

In their article in this issue and in an earlier article,

Oleinick and Zaidman have demonstrated the importance

of understanding the legal and institutional setting when

undertaking research using administrative data sources. They

gave the Boden and Ozonoff [2008] study a very careful and

thoughtful reading and raised some important issues.

Their alternate assumptions would reduce estimates of

underreporting, which suggests a sensitivity analysis based

on these assumptions. However, some of their criticism rests

on misunderstanding of the methods used to link and analyze

the data. Moreover, their analysis of the completeness

of workers’ compensation and Bureau of Labor Statistics

(BLS) injury reporting is based on aggregate data. We will

demonstrate that their conclusions about the extent of

underreporting, comparing overall counts in two datasets,

are based on assumptions that are unlikely to be valid. Only

comparing individual case data can support valid conclusions

about the number of workplace injuries and illnesses missed

by these two systems.

Accuracy of Linking

Oleinick and Zaidman spend a substantial portion of

their article discussing the methods for linking workers’

compensation and BLS cases in Boden and Ozonoff [2008].

They are concerned that the first, deterministic, step

missed multi-establishment employers and, by implication,

that the overall strategy missed linking many injuries

reported in both systems. We think that neither of these is

the case. First, we note that attributes of individuals and

injuries were most important in linking. As we reported, we

used employer identifier, employer name, employer address,

employer zip code or city, worker’s first initial, worker’s

last name, sex, date of injury, and date of birth or age at injury.

This was not described in full detail in the original

article, but we used the Soundex code for the last name as

well, and if both the last name and the Soundex code

matched, each counted toward the required number of

matches. The same is true for age and date of birth, and

for month of injury and date of injury. For the probabilistic

linking, the number of exact matches needed was relaxed,

producing a large number of potentially matched pairs, many

of which referred to clearly different injuries. Then two

members of the research team independently examined each

potential linked pair to find pairs that seemed to represent the

same injury.

The function of probabilistic linking is to link pairs well

in the face of both missing and inaccurate data, the situation

that Oleinick and Zaidman describe in some detail. This

method is widely used by the Census Bureau [Winkler,

2006], in mortality epidemiology using the National Death

Index [Adams et al., 2007; Koster et al., 2009], as well as by

other epidemiologic studies [Lee et al., 2006; Meray et al.,

2007]. Still, it is possible that some of the cases we classified

as unlinked were actually linked. This is a particular concern

for multi-establishment employers, an issue that our study

addresses.

Appropriateness of Data Restrictions

Oleinick and Zaidman’s article and ours both compare

workplace injury reporting using datasets for which both

population and variable definitions differ. To derive unbiased

measures from these comparisons, it is important to limit the

comparisons to subsets that refer to the same population or to

account for the differences in the analysis. For example, the

BLS Survey of Occupational Injuries and Illnesses (SOII) is a

stratified random sample of establishments, but workers’

compensation systems cover almost all establishments in a

state. This means that a valid analysis of reporting should not

treat injuries outside the SOII sampling frame as if they were

missed by the SOII. During the course of our study, we

discovered that it was sometimes difficult to decide which

injuries were covered by both the SOII and workers’

compensation databases. In most situations, we made these

decisions so that the likely bias in our estimates was toward

complete reporting.

For example, we focused on estimates that assumed

source independence—that an injury reported to the workers’

compensation system was no more likely to be reported to the

BLS than an injury not reported to the workers’ compensa-

tion system. It is likely, however, that reporting to workers’

compensation substantially increases the probability that an

injury is reported to the BLS. In this situation, capture–

recapture estimates based on source independence will make

reporting appear more complete than it actually is. We are

unaware of any estimates of the degree of this positive source

dependence, and we could not estimate this with the data we

used. Given this limitation, we provided alternate estimates

assuming different degrees of positive source dependence.

Also, when we were uncertain about whether injuries were

linked, we treated them as linked. In addition, when limiting

the workers’ compensation and SOII data to comparable

subsets by removing BLS cases not meeting the workers’

compensation waiting period, we discarded unlinked cases

but kept those that were linked.

Oleinick and Zaidman note that we did not remove

certain cases from our analysis, and they argue that this led us

to underestimate the completeness of reporting. We were

aware of the issues they raise, both because we had read their

38 Boden and Ozonoff

valuable 2004 article [Oleinick and Zaidman, 2004] and

because we had used Mr. Zaidman as our source of

information about the Minnesota workers’ compensation

system, including details about Minnesota’s waiting period.

However, the Oleinick and Zaidman [2004] and the Boden

and Ozonoff [2008] studies used different inclusion rules. For

example, we decided to include cases involving stipulation or

settlement. These cases involve disputes, often over the

compensability of the claim, the duration of temporary

disability, or the degree of permanent disability. The disputes

and the nature of their eventual resolution often mean that

data are missing from workers’ compensation reports. For

example, the first day of lost time (FDLT) is frequently

missing, even for cases involving payment of temporary total

disability (TTD). Often, the first report of injury is missing.

Neither a missing first report of injury nor a missing FDLT

implies that there was no FDLT, but rather that it was not

reported. In stipulations and settlements, the FDLT may have

no relevance to the final disposition, so it is no surprise that it

is often missing. Data that have no practical importance to a

case are more likely than other data to be missing or

inaccurate. Disputes also may increase the probability that

such cases will not be entered into the OSHA log. For these

reasons and because injuries resulting in stipulations and

settlements tend to be more severe, we decided to include

these cases in our analysis. We recognize that some of these

cases would not be eligible for inclusion in the SOII, but we

thought that this would occur infrequently and that the benefit

of keeping these cases would outweigh the cost.

We also believed that many cases with permanent partial

disability (PPD) benefits but for which there is no record of

TTD payments involve at least one day away from work. By

far the most frequent of these cases with an identified nature

of injury were strains of the back, shoulder, or knee. In our

judgment, given that these injuries were severe enough to

lead to PPD payments, many of these would have involved at

least enough time lost from work to qualify for the SOII. For

this reason, our data included these cases. (Hearing loss cases

without TTD benefits should have been eliminated from the

analysis. Still, they amount to much less than 1% of all cases.)

Oleinick and Zaidman also suggest that our study did not

adequately restrict the SOII sample and should have dropped

all SOII cases with less than 4 days away from work (DAFW)

and placed a similar restriction on workers’ compensation

cases, even those that exceeded the waiting period. In our

analysis, we used the workers’ compensation rule to calculate

whether the waiting period would apply and then eliminated

unlinked SOII cases that did not meet this criterion. Their

exclusion criterion is definitely more restrictive than ours,

eliminating thousands of cases that meet the criteria for

inclusion in the SOII and the Minnesota definition of a

compensable workers’ compensation case. If short-duration

cases are less likely to be reported than longer-duration cases,

this would bias Oleinick and Zaidman’s estimates toward

complete reporting. Also, a substantial number of these

excluded SOII cases had a corresponding workers’ compen-

sation case and were included among our linked cases.

We could address each of the other objections that

Oleinick and Zaidman have with the reasoning behind our

choices, but the main point we would like to make is that our

decisions were not uninformed. Rather, they were choices

made after we did our best to understand the nature of

the systems and the data we used. Oleinick and Zaidman

highlight these decisions, and their critique leads us to think

that our study could have benefited from a sensitivity analysis

to determine the impact of different assumptions from the

ones that we made. However, they imply that their choices are

correct and ours are unjustified. Here we must disagree.

RESULTS

Sensitivity Analysis of Data Restrictions

To quantify the difference that using Oleinick and

Zaidman’s assumptions might make, we present a rough

sensitivity analysis that takes into account most of the

alternate assumptions used by Oleinick and Zaidman but

still keeps all linked workers’ compensation and SOII

cases—whether or not they meet the inclusion criteria. We

no longer have access to the data used to estimate reporting

rates, but because the crude capture–recapture estimates are

close to the adjusted estimates, we can approximate the

impact of most of their assumptions on our estimates.

Because we know that a portion of the cases that they

would eliminate appear in both workers’ compensation and

the SOII, and these cases are not discarded in our analysis, we

subtract from unlinked cases a fraction of the cases that

Oleinick and Zaidman suggest should be removed from the

sample. They observe that 15% of workers’ compensation

cases are recorded as paying no TTD benefits. Twenty

percent of these cases, 3% of the total, appear in both

datasets. We remove the remaining 12% of the 112,251

reported workers’ compensation cases from those classified

as not reported to BLS, reducing this number from 36,335 to

22,865. It is harder to apply the Oleinick and Zaidman criteria

to unlinked SOII cases. Many of the cases they would exclude

are eligible for workers’ compensation benefits even though

they had less than four lost workdays. Also, some of these

cases were linked to workers’ compensation claims in our

study. Still, in an attempt to see how much of an impact

difference the alternate assumptions could make, we further

eliminate 25% of BLS unlinked cases reducing the number of

these cases from 41,238 to 30,929. The results are displayed

in Table I.

For comparison, the original capture–recapture

estimates assuming source independence are shown in the

first panel of Table I. The second panel shows how altering

Researcher Judgment and Study Design 39

the underlying assumptions as described in the previous

paragraph changes these estimates. The estimated proportion

of injuries reported to the workers’ compensation system

rises from 65% to 71%, and the SOII proportion rises from

68% to 77%. Relaxing the assumption of source independ-

ence so that cases reported to workers’ compensation are

more likely to be reported to BLS (with an odds ratio of 3.0),

reduces estimated reporting rates to 63% (workers’ com-

pensation), and 68% (SOII). The original estimates are

altered by the changes suggested by Oleinick and Zaidman,

but the resulting estimates are still far from 100%.

Sensitivity Analysis of AnalyticAssumptions

Perhaps the most important limitation of the Oleinick

and Zaidman article is that they lack individual data. The first

panel of Table II shows Oleinick and Zaidman’s estimates

of the concordance of workers’ compensation and BLS

estimates for Minnesota for 1998–2001, using the upper end

of their range for the number of workers’ compensation

cases. This is displayed in the same format as Table I to make

the point that the only way to derive their estimate using

individual data is to assume that all workers’ compensation

cases are reported in the BLS data, a very unlikely

occurrence. In fact, applying this assumption to the marginal

totals in Boden and Ozonoff [2008] would imply that the

SOII captured 100% of all injuries and illnesses.

Many scenarios are consistent with the data in

Oleinick and Zaidman [2009], each represented by a 2� 2

table with identical values on the two observed marginal

totals but different counts of cases in the interior cells.

The second panel of Table II shows one of these. This

scenario assumes that 71% of reported SOII cases are also in

the workers’ compensation data. This is higher than the

proportion derived in our 2008 study and comes from

the second panel of Table I, based on Oleinick and Zaidman’s

assumptions.

If we assume that reporting to both systems is

independent and that 71% of reported SOII cases are linked

to workers’ compensation claims, we can use the total

number of SOII and workers’ compensation cases [Oleinick

and Zaidman, 2009, Table II] to calculate reporting

percentages. Given these conditions, it follows that only

64% of injuries are reported to the BLS. This estimate of BLS

reporting is actually lower than the Boden and Ozonoff

[2008] estimate of 68%. Assuming positive source depend-

ence with an odds ratio of 3 reduces the estimate of BLS

reporting from 64% to 53%. Of course, we could have made a

TABLE I. Sensitivity Analysis of Boden and Ozonoff Estimates;1998^2001Injuries, Minnesota

Original estimates�source independence

Workers’compensation

TotalNo report Report

BLSNo report 19,738 (11%) 36,335 (21%) 56,073 (32%)Report 41,238 (24%) 75,916 (44%) 117,154 (68%)Total 60,976 (35%) 112,251 (65%) 173,227 (100%)

Estimate with stringent assumptions�sourceindependence


TotalNo report ReportBLS

No Report 9,315 (7%) 22,865 (16%) 32,180 (23%)Report 30,929 (22%) 75,916 (55%) 106,845 (77%)Total 40,244 (29%) 98,781 (71%) 139,025 (100%)

Estimate with stringent assumptions�sourcedependence, OR¼ 3



No report 27,946 (18%) 22,865 (15%) 50,811 (32%)Report 30,929 (20%) 75,916 (48%) 106,845 (68%)Total 58,874 (37%) 98,781 (63%) 157,655 (100%)

Rows and columns may not add to totals because of rounding.

TABLE II. Concordance Estimates, 1998, High Estimate of Workers’Compensation Cases

Oleinick and Zaidman estimates


TotalNo report Report

BLSNo report 0 (0%) 9,260 (10%) 9,260 (10%)Report 0 (0%) 83,753 (90%) 83,753 (90%)Total 0 (0%) 93,013 (100%) 93,013 (100%)

Alternate concordance estimate�sourceindependence



No report 13,703 (10%) 33,548 (26%) 47,251 (36%)Report 24,288 (19%) 59,465 (45%) 83,753 (64%)Total 37,991 (29%) 93,013 (71%) 131,004 (100%)

Numbers in bold are those reported as observed by Oleinick and Zaidman as the upperrange of the number of BLS cases. All other numbers are based on assumptions.

40 Boden and Ozonoff

wide range of assumptions about source dependence and the

proportion of cases reported to workers’ compensation, and

that is precisely our point. The aggregate data used by

Oleinick and Zaidman do not support a specific estimate of

reporting completeness, and their assumptions are highly

optimistic.

CONCLUSIONS

Oleinick and Zaidman have made a very important point

deserving of close attention from all researchers in this area:

to use workers’ compensation data properly, a detailed

understanding of the law and regulations is critical. We thank

them for the opportunity to consider the effect on our

estimates of changing the underlying assumptions. Applying

more restrictive inclusion criteria to our original research

increases our reporting estimates, but they remain below

80%. Using Oleinick and Zaidman’s data but applying a less

optimistic assumption about the proportion of SOII cases in

the worker’s compensation data suggests that fewer than two

in three injuries are reported to the SOII.

More generally, research using secondary data sources

typically requires researchers to make judgments about how

to use the data and how to interpret data elements that do not

precisely match the concepts they represent. In such cases,

sensitivity analysis can prove a useful tool.

A remaining issue is whether the requirements of

academic studies and policy analysis can be met simulta-

neously. For example, we simulated the impact of the

unknown amount of source dependence on our estimates.

Based on how information enters the BLS and workers’

compensation data, there must be a substantial amount of

source dependence. Yet policymakers may well focus on the

source-independence estimates. Similarly, by choosing

conservative assumptions for which observations to keep in

the analysis, researchers may follow good academic practice,

but policymakers may infer that the undercount of occupa-

tional injuries and illnesses is lower than it actually is.

Finally, Oleinick and Zaidman underline the importance

of additional research to narrow the range of uncertainty

about reporting and to understand the reasons for the

undercount of workplace injuries and illnesses. Work on

these issues has already begun with a re-analysis of

Wisconsin data by the BLS [Nestoriak and Pierce, 2009],

and this agency will be providing support for further public

health research in this area.

ACKNOWLEDGMENTS

Research on which this commentary is based was

supported by the National Institute for Occupational Safety

and Health (Research Grant # 5R01 OH 007596) and the

State of California, Commission on Health and Safety and

Workers’ Compensation. We also acknowledge the help and

cooperation of the Bureau of Labor Statistics and the states

of Minnesota, New Mexico, Oregon, Washington, West

Virginia, and Wisconsin.

REFERENCES

Adams TD et al. 2007. Long-term mortality after gastric bypass surgery.NEJM 357(8):753–761.

Boden LI, Ozonoff A. 2008. Capture-recapture estimates of nonfatalworkplace injuries and illnesses. Ann Epidemiol 18(6):500–506.

Koster A, Harris TB, Moore SC, Schatzkin A, Hollenbeck AR, van EijkJTM, Leitzmann MF. 2009. Joint associations of adiposity and physicalactivity with mortality: The National Institutes of Health-AARP Dietand Health Study. Am J Epidemiol 169(11):1344–1351.

Lee DJ, Fleming LE, LeBlanc WG, Arheart KL, Chung-Bridges K,Christ SL, Caban AJ, Pitman T. 2006. Occupation and lung cancermortality in a nationally representative U.S. cohort: The National HealthInterview Survey (NHIS). J Occup Env Med 48(8):823–832.

Meray N, Reitsma JB, Ravelli A, Bonsel G. 2007. Probabilistic recordlinkage is a valid and transparent tool to combine databases without apatient identification number. J Clin Epidemiol 60(9):883.e1–883.e11.

Nestoriak N, Pierce B. 2009. Comparing workers’ compensation claimswith establishments’ responses to the SOII. Mon Labor Rev 132(5):57–64.

Oleinick A, Zaidman B. 2004. Methodologic issues in the use ofworkers’ compensation databases or the study of work injuries with daysaway from work. I. Sensitivity of case ascertainment. Am J Ind Med45:260–274.

Oleinick A, Zaidman B. 2009. The law and incomplete databaseinformation as confounders in epidemiologic research on occupationalinjuries and illnesses. Am J Ind Med This issue.

Rosenman KD, Kalush A, Reilly MJ, Gardiner JC, Reeves M, Luo Z.2006. How much work-related injury and illness is missed by the currentnational surveillance system? J Occup Environ Med 48:357–365.

Winkler WE. 2006. Overview of record linkage and current researchdirections. Technical Report, Statistical Research Division, U.S. CensusBureau. http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.

Researcher Judgment and Study Design 41

Documents

Researcher judgment and study design: Challenges of using administrative data