Upload
leslie-i-boden
View
213
Download
0
Embed Size (px)
Citation preview
AMERICAN JOURNAL OF INDUSTRIAL MEDICINE 53:37–41 (2010)
Commentary
Researcher Judgment and Study Design:Challenges of Using Administrative Data
Leslie I. Boden, PhD1� and Al Ozonoff, PhD2
Background Questions have been raised about methods of studies finding substantialundercounting of workplace injuries and illnesses by the Bureau of Labor Statistics (BLS)and workers’ compensation agencies. A more recent study of Minnesota concluded that theBLS survey captures 84–90% of workers’ compensation cases.Methods We examined the sensitivity of findings in two studies to alternate sampledefinitions and study assumptions.Results Applying alternate sample construction rules to the earlier study increasedestimated BLS reporting rates from 68% to 77%, assuming source independence. Applyingalternate assumptions to the more recent Minnesota study reduced its high estimate of BLSreporting from 90% to 53–64%.Conclusions Studies linking administrative data from different sources requiresubstantial judgment in constructing research datasets and choosing analytic methods.Moreover, different sample construction rules lead to different results. This suggests thatsensitivity analysis should be carried out when alternatives cannot be ruled out. In thiscase, sensitivity analysis supports the hypothesis of substantial underreporting. Am. J.Ind. Med. 53:37–41, 2010. � 2009 Wiley-Liss, Inc.
KEY WORDS: occupational injuries; workers’ compensation; capture–recapture;study design
INTRODUCTION
Using administrative data in public health research
presents many challenges. Unlike data specifically designed
to answer a study question, administrative data such as those
collected by state workers’ compensation systems can be
difficult to use for research because they are collected with a
non-research purpose. These databases are primarily used to
provide an overview of the number and types of injuries, keep
track of overall payments, and possibly track the per-
formance of individual employers and insurers. Information
important to the researcher may not be central to these
purposes and may, as a consequence, frequently be missing or
inaccurate. Attempting to link two administrative datasets,
each designed with different goals, presents even greater
challenges. Data inaccuracies, different definitions of data
elements, and disparate criteria for inclusion all require
researchers to make choices about how to assemble and
analyze the data. All of these issues are present in the
data used by Oleinick and Zaidman [2009] in this issue and
in the earlier studies by Rosenman et al. [2006] and by
� 2009Wiley-Liss, Inc.
1Department of Environmental Health, Boston University School of Public Health, Boston,Massachusetts
2Department of Biostatistics, Boston University School of Public Health, Boston,Massachusetts
Contract grant sponsor: National Institute for Occupational Safety and Health; Contractgrant number: 5R01 OH 007596; Contract grant sponsor: State of California, Commissionon Health and Safety and Workers’ Compensation.
*Correspondence to: Leslie I. Boden, Department of Environmental Health, Boston Univer-sity School of Public Health, 715 Albany Street, Boston, MA 02118. E-mail: [email protected]
Accepted 9 October 2009DOI 10.1002/ajim.20787. Published online in Wiley InterScience
(www.interscience.wiley.com)
Boden and Ozonoff [2008]. These data shortcomings and
researchers’ responses to them may have important implica-
tions for research results. Oleinick and Zaidman have done us
all a service by focusing on many of the decisions made in
estimating the number of occupational injuries and illnesses
in Rosenman et al. [2006] and Boden and Ozonoff [2008].
(In the 2008 study, Boden was responsible for almost all of
these decisions. Ozonoff provided statistical expertise.)
In their article in this issue and in an earlier article,
Oleinick and Zaidman have demonstrated the importance
of understanding the legal and institutional setting when
undertaking research using administrative data sources. They
gave the Boden and Ozonoff [2008] study a very careful and
thoughtful reading and raised some important issues.
Their alternate assumptions would reduce estimates of
underreporting, which suggests a sensitivity analysis based
on these assumptions. However, some of their criticism rests
on misunderstanding of the methods used to link and analyze
the data. Moreover, their analysis of the completeness
of workers’ compensation and Bureau of Labor Statistics
(BLS) injury reporting is based on aggregate data. We will
demonstrate that their conclusions about the extent of
underreporting, comparing overall counts in two datasets,
are based on assumptions that are unlikely to be valid. Only
comparing individual case data can support valid conclusions
about the number of workplace injuries and illnesses missed
by these two systems.
Accuracy of Linking
Oleinick and Zaidman spend a substantial portion of
their article discussing the methods for linking workers’
compensation and BLS cases in Boden and Ozonoff [2008].
They are concerned that the first, deterministic, step
missed multi-establishment employers and, by implication,
that the overall strategy missed linking many injuries
reported in both systems. We think that neither of these is
the case. First, we note that attributes of individuals and
injuries were most important in linking. As we reported, we
used employer identifier, employer name, employer address,
employer zip code or city, worker’s first initial, worker’s
last name, sex, date of injury, and date of birth or age at injury.
This was not described in full detail in the original
article, but we used the Soundex code for the last name as
well, and if both the last name and the Soundex code
matched, each counted toward the required number of
matches. The same is true for age and date of birth, and
for month of injury and date of injury. For the probabilistic
linking, the number of exact matches needed was relaxed,
producing a large number of potentially matched pairs, many
of which referred to clearly different injuries. Then two
members of the research team independently examined each
potential linked pair to find pairs that seemed to represent the
same injury.
The function of probabilistic linking is to link pairs well
in the face of both missing and inaccurate data, the situation
that Oleinick and Zaidman describe in some detail. This
method is widely used by the Census Bureau [Winkler,
2006], in mortality epidemiology using the National Death
Index [Adams et al., 2007; Koster et al., 2009], as well as by
other epidemiologic studies [Lee et al., 2006; Meray et al.,
2007]. Still, it is possible that some of the cases we classified
as unlinked were actually linked. This is a particular concern
for multi-establishment employers, an issue that our study
addresses.
Appropriateness of Data Restrictions
Oleinick and Zaidman’s article and ours both compare
workplace injury reporting using datasets for which both
population and variable definitions differ. To derive unbiased
measures from these comparisons, it is important to limit the
comparisons to subsets that refer to the same population or to
account for the differences in the analysis. For example, the
BLS Survey of Occupational Injuries and Illnesses (SOII) is a
stratified random sample of establishments, but workers’
compensation systems cover almost all establishments in a
state. This means that a valid analysis of reporting should not
treat injuries outside the SOII sampling frame as if they were
missed by the SOII. During the course of our study, we
discovered that it was sometimes difficult to decide which
injuries were covered by both the SOII and workers’
compensation databases. In most situations, we made these
decisions so that the likely bias in our estimates was toward
complete reporting.
For example, we focused on estimates that assumed
source independence—that an injury reported to the workers’
compensation system was no more likely to be reported to the
BLS than an injury not reported to the workers’ compensa-
tion system. It is likely, however, that reporting to workers’
compensation substantially increases the probability that an
injury is reported to the BLS. In this situation, capture–
recapture estimates based on source independence will make
reporting appear more complete than it actually is. We are
unaware of any estimates of the degree of this positive source
dependence, and we could not estimate this with the data we
used. Given this limitation, we provided alternate estimates
assuming different degrees of positive source dependence.
Also, when we were uncertain about whether injuries were
linked, we treated them as linked. In addition, when limiting
the workers’ compensation and SOII data to comparable
subsets by removing BLS cases not meeting the workers’
compensation waiting period, we discarded unlinked cases
but kept those that were linked.
Oleinick and Zaidman note that we did not remove
certain cases from our analysis, and they argue that this led us
to underestimate the completeness of reporting. We were
aware of the issues they raise, both because we had read their
38 Boden and Ozonoff
valuable 2004 article [Oleinick and Zaidman, 2004] and
because we had used Mr. Zaidman as our source of
information about the Minnesota workers’ compensation
system, including details about Minnesota’s waiting period.
However, the Oleinick and Zaidman [2004] and the Boden
and Ozonoff [2008] studies used different inclusion rules. For
example, we decided to include cases involving stipulation or
settlement. These cases involve disputes, often over the
compensability of the claim, the duration of temporary
disability, or the degree of permanent disability. The disputes
and the nature of their eventual resolution often mean that
data are missing from workers’ compensation reports. For
example, the first day of lost time (FDLT) is frequently
missing, even for cases involving payment of temporary total
disability (TTD). Often, the first report of injury is missing.
Neither a missing first report of injury nor a missing FDLT
implies that there was no FDLT, but rather that it was not
reported. In stipulations and settlements, the FDLT may have
no relevance to the final disposition, so it is no surprise that it
is often missing. Data that have no practical importance to a
case are more likely than other data to be missing or
inaccurate. Disputes also may increase the probability that
such cases will not be entered into the OSHA log. For these
reasons and because injuries resulting in stipulations and
settlements tend to be more severe, we decided to include
these cases in our analysis. We recognize that some of these
cases would not be eligible for inclusion in the SOII, but we
thought that this would occur infrequently and that the benefit
of keeping these cases would outweigh the cost.
We also believed that many cases with permanent partial
disability (PPD) benefits but for which there is no record of
TTD payments involve at least one day away from work. By
far the most frequent of these cases with an identified nature
of injury were strains of the back, shoulder, or knee. In our
judgment, given that these injuries were severe enough to
lead to PPD payments, many of these would have involved at
least enough time lost from work to qualify for the SOII. For
this reason, our data included these cases. (Hearing loss cases
without TTD benefits should have been eliminated from the
analysis. Still, they amount to much less than 1% of all cases.)
Oleinick and Zaidman also suggest that our study did not
adequately restrict the SOII sample and should have dropped
all SOII cases with less than 4 days away from work (DAFW)
and placed a similar restriction on workers’ compensation
cases, even those that exceeded the waiting period. In our
analysis, we used the workers’ compensation rule to calculate
whether the waiting period would apply and then eliminated
unlinked SOII cases that did not meet this criterion. Their
exclusion criterion is definitely more restrictive than ours,
eliminating thousands of cases that meet the criteria for
inclusion in the SOII and the Minnesota definition of a
compensable workers’ compensation case. If short-duration
cases are less likely to be reported than longer-duration cases,
this would bias Oleinick and Zaidman’s estimates toward
complete reporting. Also, a substantial number of these
excluded SOII cases had a corresponding workers’ compen-
sation case and were included among our linked cases.
We could address each of the other objections that
Oleinick and Zaidman have with the reasoning behind our
choices, but the main point we would like to make is that our
decisions were not uninformed. Rather, they were choices
made after we did our best to understand the nature of
the systems and the data we used. Oleinick and Zaidman
highlight these decisions, and their critique leads us to think
that our study could have benefited from a sensitivity analysis
to determine the impact of different assumptions from the
ones that we made. However, they imply that their choices are
correct and ours are unjustified. Here we must disagree.
RESULTS
Sensitivity Analysis of Data Restrictions
To quantify the difference that using Oleinick and
Zaidman’s assumptions might make, we present a rough
sensitivity analysis that takes into account most of the
alternate assumptions used by Oleinick and Zaidman but
still keeps all linked workers’ compensation and SOII
cases—whether or not they meet the inclusion criteria. We
no longer have access to the data used to estimate reporting
rates, but because the crude capture–recapture estimates are
close to the adjusted estimates, we can approximate the
impact of most of their assumptions on our estimates.
Because we know that a portion of the cases that they
would eliminate appear in both workers’ compensation and
the SOII, and these cases are not discarded in our analysis, we
subtract from unlinked cases a fraction of the cases that
Oleinick and Zaidman suggest should be removed from the
sample. They observe that 15% of workers’ compensation
cases are recorded as paying no TTD benefits. Twenty
percent of these cases, 3% of the total, appear in both
datasets. We remove the remaining 12% of the 112,251
reported workers’ compensation cases from those classified
as not reported to BLS, reducing this number from 36,335 to
22,865. It is harder to apply the Oleinick and Zaidman criteria
to unlinked SOII cases. Many of the cases they would exclude
are eligible for workers’ compensation benefits even though
they had less than four lost workdays. Also, some of these
cases were linked to workers’ compensation claims in our
study. Still, in an attempt to see how much of an impact
difference the alternate assumptions could make, we further
eliminate 25% of BLS unlinked cases reducing the number of
these cases from 41,238 to 30,929. The results are displayed
in Table I.
For comparison, the original capture–recapture
estimates assuming source independence are shown in the
first panel of Table I. The second panel shows how altering
Researcher Judgment and Study Design 39
the underlying assumptions as described in the previous
paragraph changes these estimates. The estimated proportion
of injuries reported to the workers’ compensation system
rises from 65% to 71%, and the SOII proportion rises from
68% to 77%. Relaxing the assumption of source independ-
ence so that cases reported to workers’ compensation are
more likely to be reported to BLS (with an odds ratio of 3.0),
reduces estimated reporting rates to 63% (workers’ com-
pensation), and 68% (SOII). The original estimates are
altered by the changes suggested by Oleinick and Zaidman,
but the resulting estimates are still far from 100%.
Sensitivity Analysis of AnalyticAssumptions
Perhaps the most important limitation of the Oleinick
and Zaidman article is that they lack individual data. The first
panel of Table II shows Oleinick and Zaidman’s estimates
of the concordance of workers’ compensation and BLS
estimates for Minnesota for 1998–2001, using the upper end
of their range for the number of workers’ compensation
cases. This is displayed in the same format as Table I to make
the point that the only way to derive their estimate using
individual data is to assume that all workers’ compensation
cases are reported in the BLS data, a very unlikely
occurrence. In fact, applying this assumption to the marginal
totals in Boden and Ozonoff [2008] would imply that the
SOII captured 100% of all injuries and illnesses.
Many scenarios are consistent with the data in
Oleinick and Zaidman [2009], each represented by a 2� 2
table with identical values on the two observed marginal
totals but different counts of cases in the interior cells.
The second panel of Table II shows one of these. This
scenario assumes that 71% of reported SOII cases are also in
the workers’ compensation data. This is higher than the
proportion derived in our 2008 study and comes from
the second panel of Table I, based on Oleinick and Zaidman’s
assumptions.
If we assume that reporting to both systems is
independent and that 71% of reported SOII cases are linked
to workers’ compensation claims, we can use the total
number of SOII and workers’ compensation cases [Oleinick
and Zaidman, 2009, Table II] to calculate reporting
percentages. Given these conditions, it follows that only
64% of injuries are reported to the BLS. This estimate of BLS
reporting is actually lower than the Boden and Ozonoff
[2008] estimate of 68%. Assuming positive source depend-
ence with an odds ratio of 3 reduces the estimate of BLS
reporting from 64% to 53%. Of course, we could have made a
TABLE I. Sensitivity Analysis of Boden and Ozonoff Estimates;1998^2001Injuries, Minnesota
Original estimates�source independence
Workers’compensation
TotalNo report Report
BLSNo report 19,738 (11%) 36,335 (21%) 56,073 (32%)Report 41,238 (24%) 75,916 (44%) 117,154 (68%)Total 60,976 (35%) 112,251 (65%) 173,227 (100%)
Estimate with stringent assumptions�sourceindependence
Workers’compensation
TotalNo report ReportBLS
No Report 9,315 (7%) 22,865 (16%) 32,180 (23%)Report 30,929 (22%) 75,916 (55%) 106,845 (77%)Total 40,244 (29%) 98,781 (71%) 139,025 (100%)
Estimate with stringent assumptions�sourcedependence, OR¼ 3
Workers’compensation
TotalNo report ReportBLS
No report 27,946 (18%) 22,865 (15%) 50,811 (32%)Report 30,929 (20%) 75,916 (48%) 106,845 (68%)Total 58,874 (37%) 98,781 (63%) 157,655 (100%)
Rows and columns may not add to totals because of rounding.
TABLE II. Concordance Estimates, 1998, High Estimate of Workers’Compensation Cases
Oleinick and Zaidman estimates
Workers’compensation
TotalNo report Report
BLSNo report 0 (0%) 9,260 (10%) 9,260 (10%)Report 0 (0%) 83,753 (90%) 83,753 (90%)Total 0 (0%) 93,013 (100%) 93,013 (100%)
Alternate concordance estimate�sourceindependence
Workers’compensation
TotalNo report ReportBLS
No report 13,703 (10%) 33,548 (26%) 47,251 (36%)Report 24,288 (19%) 59,465 (45%) 83,753 (64%)Total 37,991 (29%) 93,013 (71%) 131,004 (100%)
Numbers in bold are those reported as observed by Oleinick and Zaidman as the upperrange of the number of BLS cases. All other numbers are based on assumptions.
40 Boden and Ozonoff
wide range of assumptions about source dependence and the
proportion of cases reported to workers’ compensation, and
that is precisely our point. The aggregate data used by
Oleinick and Zaidman do not support a specific estimate of
reporting completeness, and their assumptions are highly
optimistic.
CONCLUSIONS
Oleinick and Zaidman have made a very important point
deserving of close attention from all researchers in this area:
to use workers’ compensation data properly, a detailed
understanding of the law and regulations is critical. We thank
them for the opportunity to consider the effect on our
estimates of changing the underlying assumptions. Applying
more restrictive inclusion criteria to our original research
increases our reporting estimates, but they remain below
80%. Using Oleinick and Zaidman’s data but applying a less
optimistic assumption about the proportion of SOII cases in
the worker’s compensation data suggests that fewer than two
in three injuries are reported to the SOII.
More generally, research using secondary data sources
typically requires researchers to make judgments about how
to use the data and how to interpret data elements that do not
precisely match the concepts they represent. In such cases,
sensitivity analysis can prove a useful tool.
A remaining issue is whether the requirements of
academic studies and policy analysis can be met simulta-
neously. For example, we simulated the impact of the
unknown amount of source dependence on our estimates.
Based on how information enters the BLS and workers’
compensation data, there must be a substantial amount of
source dependence. Yet policymakers may well focus on the
source-independence estimates. Similarly, by choosing
conservative assumptions for which observations to keep in
the analysis, researchers may follow good academic practice,
but policymakers may infer that the undercount of occupa-
tional injuries and illnesses is lower than it actually is.
Finally, Oleinick and Zaidman underline the importance
of additional research to narrow the range of uncertainty
about reporting and to understand the reasons for the
undercount of workplace injuries and illnesses. Work on
these issues has already begun with a re-analysis of
Wisconsin data by the BLS [Nestoriak and Pierce, 2009],
and this agency will be providing support for further public
health research in this area.
ACKNOWLEDGMENTS
Research on which this commentary is based was
supported by the National Institute for Occupational Safety
and Health (Research Grant # 5R01 OH 007596) and the
State of California, Commission on Health and Safety and
Workers’ Compensation. We also acknowledge the help and
cooperation of the Bureau of Labor Statistics and the states
of Minnesota, New Mexico, Oregon, Washington, West
Virginia, and Wisconsin.
REFERENCES
Adams TD et al. 2007. Long-term mortality after gastric bypass surgery.NEJM 357(8):753–761.
Boden LI, Ozonoff A. 2008. Capture-recapture estimates of nonfatalworkplace injuries and illnesses. Ann Epidemiol 18(6):500–506.
Koster A, Harris TB, Moore SC, Schatzkin A, Hollenbeck AR, van EijkJTM, Leitzmann MF. 2009. Joint associations of adiposity and physicalactivity with mortality: The National Institutes of Health-AARP Dietand Health Study. Am J Epidemiol 169(11):1344–1351.
Lee DJ, Fleming LE, LeBlanc WG, Arheart KL, Chung-Bridges K,Christ SL, Caban AJ, Pitman T. 2006. Occupation and lung cancermortality in a nationally representative U.S. cohort: The National HealthInterview Survey (NHIS). J Occup Env Med 48(8):823–832.
Meray N, Reitsma JB, Ravelli A, Bonsel G. 2007. Probabilistic recordlinkage is a valid and transparent tool to combine databases without apatient identification number. J Clin Epidemiol 60(9):883.e1–883.e11.
Nestoriak N, Pierce B. 2009. Comparing workers’ compensation claimswith establishments’ responses to the SOII. Mon Labor Rev 132(5):57–64.
Oleinick A, Zaidman B. 2004. Methodologic issues in the use ofworkers’ compensation databases or the study of work injuries with daysaway from work. I. Sensitivity of case ascertainment. Am J Ind Med45:260–274.
Oleinick A, Zaidman B. 2009. The law and incomplete databaseinformation as confounders in epidemiologic research on occupationalinjuries and illnesses. Am J Ind Med This issue.
Rosenman KD, Kalush A, Reilly MJ, Gardiner JC, Reeves M, Luo Z.2006. How much work-related injury and illness is missed by the currentnational surveillance system? J Occup Environ Med 48:357–365.
Winkler WE. 2006. Overview of record linkage and current researchdirections. Technical Report, Statistical Research Division, U.S. CensusBureau. http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf.
Researcher Judgment and Study Design 41