Upload
peng-wang
View
219
Download
0
Embed Size (px)
Citation preview
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 1/32
American Academy of Political and Social Science
Nonexperimental versus Experimental Estimates of Earnings ImpactsAuthor(s): Steven Glazerman, Dan M. Levy and David MyersReviewed work(s):Source: Annals of the American Academy of Political and Social Science, Vol. 589, MisleadingEvidence and Evidence-Led Policy: Making Social Science More Experimental (Sep., 2003), pp.63-93Published by: Sage Publications, Inc. in association with the American Academy of Political and Social
Science
Stable URL: http://www.jstor.org/stable/3658561 .
Accessed: 13/03/2013 22:03
Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp
.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact [email protected].
.
Sage Publications, Inc. and American Academy of Political and Social Science are collaborating with JSTOR
to digitize, preserve and extend access to Annals of the American Academy of Political and Social Science.
http://www.jstor.org
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 2/32
Nonexperimentalversus
ExperimentalEstimates
of EarningsImpacts
By
STEVEN GLAZERMAN,DAN M. LEVY,
andDAVID MYERS
To assess nonexperimental(NX) evaluation methods inthe context of welfare, job training, and employmentservicesprograms, he authorsreexamined he resultsoftwelve case studies intended to replicate impact esti-
mates from an experimental evaluation by using NXmethods. They found that the NX methods sometimes
came close to replicatingexperimentallyderivedresultsbut often produced estimates that differed by policy-relevant margins,which the authors interpret as esti-mates of bias. Although the authors identified several
study design factors associatedwith smallerdiscrepan-cies, no combinationof factorswouldconsistentlyelimi-nate discrepancies.Even with a largenumberof impactestimates, the positive and negative bias estimates didnot alwayscanceleach other out. Thus,it was difficult to
identify an aggregation strategy that consistentlyremovedbias while answeringa focused questionabout
earnings impacts of a program. They conclude that
althoughthe empiricalevidence fromthis literaturecanbe used in the contextof trainingand welfareprogramsto improveNXresearchdesigns,it cannoton its ownjus-tify the use of such designs.
Keywords: nonexperimental;uasi-experimental;esignreplication
AssessingAlternativestoSocialExperiments
Controlled experiments, where subjects are
randomly assigned to receive interventions, are
desirable but often thought to be infeasible or
overly burdensome, especially in social settings.
Therefore, researchers often substitute non-experimental (NX) or "quasi-experimental"methods, in which researchers use treatmentand comparison groups but do not randomly
assign subjects to the groups.l NX methods are
StevenGlazerman s a seniorresearcherat Mathematica
PolicyResearch.Hehasdesignedandconductednumer-ous evaluations of social programs, using both experi-mental and nonexperimentalmethods. His recent andcurrentwork includes evaluationsof such programsas
Job Corps,TeachforAmerica, he NIH LoanRepaymentProgram,and afederal initiative to provide supportivehousingfor people with severemental illness.
DOI: 10.1177/0002716203254879
ANNALS, AAPSS, 589, September2003 63
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 3/32
THE ANNALS OF THE AMERICANACADEMY
less intrusive and sometimes less costly than controlled experiments, but their
validity rests on untestable assumptions about the differences between treatment
and comparison groups.Recently, agrowing number of case studies have tried to use randomized experi-
ments to validate NX methods. To date, this growing literature has not been inte-
grated in a systematic review or meta-analysis. The most comprehensive summary(Bloom et al. 2002) addressed the portion of this literature dealing with mandatorywelfare programs. However, efforts to put the quantitative bias estimates from
these studies in a common metric and combine them to drawgeneral lessons have
been lacking.This article reports on a systematic review of such replication studies to assess
theability
of NXdesigns
toproduce
validimpacts
of socialprograms
on
partici-pants' earnings.2
Specifically, this article addresses the following questions:
* CanNXmethodsapproximateheresultsrom well-designedndwell-executedxperiment?* WhichNXmethods remore ikelyoreplicatempact stimatesromawell-designednd
well-executed experiment,and under what conditionsare they likelyto performbetter?* CanaveragingmultipleNXimpactestimatesapproximatehe results from awell-designed
and well-executed experiment?
The answers to these questions will help consumers of evaluation research, includ-
ing those who conduct literature reviews and meta-analyses, decide whether andhow to consider NX evidence. They will also help research designers decide, when
Dan M.Levyis a researcherat MathematicaPolicyResearch,wherehehasbeeninvolvedin eval-uationsofseveralsocialprograms, ncludingthe 21st Century CommunityLearningCentersandthe Welfare-to-WorkGrantsprogram,and is developinga series of evaluationtrainingforumsfor the U.S. CentersforDisease Controland Prevention.He is also an adjunctaculty memberat
GeorgetownUniversity'sPublicPolicy Institute,where he teaches quantitativemethods.
David Myers svicepresident ofhumanservices researchandaseniorfellowatMathematicaPol-
icy Research. He specializes in design and implementationof random assignmentand quasi-experimentalevaluationsof educationprograms.He has directedpolicy analysesof elementary,secondary,postsecondary,and adult education issues, includinga study of workplace literacyprograms or the U.S. Department of Education.He currentlyleads a large,multisite,random
assignmentstudy to evaluate UpwardBound, afederal programto help low-incomestudentsattendcollege.He serveson the advisorypanelof the What WorksClearinghousesponsoredbythe U.S.Departmentof Education and is a co-convenerof the CampbellCollaboration'sQuasi-ExperimentalDesign Group.
NOTE: Thisresearchwas supportedby grantsfromthe WilliamandFlora Hewlett Foundationand the Smith RichardsonFoundation; however,the conclusions do not necessarilyrepresentthe officialpositionorpolicies of the Hewlett Foundationorthe SmithRichardsonFoundation.
The authors hankHarrisCooper,Phoebe Cottingham,AllenSchirm,JeffValentine,andpartici-pants of workshopsheld by the CampbellCollaboration,Child Trends,Mathematica,and theSmith RichardsonFoundation. Also, we are grateful to the authors of the studies that weincluded in this review,manyof whom spent time answeringour questionsandprovidingaddi-tionaldata.
64
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 4/32
NONEXPERIMENTALVERSUS EXPERIMENTAL ESTIMATES
random assignment is not feasible, whether there are conditions that justify a NX
research design.
Between- and within-study comparisons
Researchers use two types of empirical evidence to assess NX methods:
between-study comparisons and within-study comparisons (Shadish 2000). Thisarticle synthesizes evidence from within-study comparisons, but we describe
between-study evidence as background.
Between-study comparisons. Between-study comparisons look at multiple stud-ies that use different research
designs
and
study samples
to estimate the
impact
of
the same type of program. By comparing results from experimental studies withthose of NX ones, researchers tryto derive the relationship between the design andthe estimates of impact. Examples include Reynolds and Temple (1995), who com-
pared three studies; and Cooper et al. (2000, Table 2), the National Reading Panel
(2000, chapter I, Tables 6-7), and Shadish and Ragsdale (1996), who all compareddozens orhundreds of studies by including research design variables as moderatorsin their meta-analyses. These analyses produced mixed evidence on whether
quasi-experiments produced higher or lower impact estimates than experiments.An even more comprehensive between-study analysis by Lipsey and Wilson
(1993) found mixed evidence as well. For many types of interventions, the averageof the NX studies gives a slightly different answer from the average of the experi-mental studies, while for some, it gives a markedly different answer. The authorsfound seventy-four meta-analyses that distinguished between randomized andnonrandomized treatment assignment and showed that the average effect sizes forthe two groups were similar,0.46 of a standard deviation from the experimentaldesigns and 0.41 from the NX designs. But such findings were based on averagesover a wide range of content domains, spanning nearly the entire applied psychol-ogy literature. Graphing the distribution of differences between random and
nonrandom treatment assignment within each meta-analysis (where each one per-tains to asingle content domain), they showed that the average difference between
findings based on experimental versus NX designs was close to zero, implying nobias. But the range extended from about -1.0 standarddeviation to +1.6 standard
deviations, with the bulk of differences falling between -0.20 and +0.40. Thus, the
between-study evidence does not resolve whether differences in impact estimatesare due to design or to some other factor.
Within-study comparisons. In a within-study comparison, researchers estimatea program'simpact by using a randomized control group and then reestimate the
impact by using one or more nonrandomized comparison groups. We refer to these
comparisons, described formally below, as "design replication" studies. Thenonrandomized comparison groups are formed and their outcomes adjusted byusing statistical or econometric techniques aimed at estimating or eliminatingselection bias. Design replication studies can use multiple comparison groups or
65
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 5/32
THE ANNALS OF THE AMERICANACADEMY
the same comparison groupwith multiple sample restrictions to examine the effect
of different comparison group strategies. The NX estimate is meant to mimic what
would have been estimated if a randomized experiment had not been conducted. Ifthe NX estimate is close to the experimental estimate, then the NX technique is
assumed to be "successful"at replicating an unbiased research design.
Within-study comparisons make it clear that the difference in findings between
methods is attributable to the methods themselves ratherthan to investigator bias,differences in how the intervention was implemented, or differences in treatment
NX [nonexperimental]methodsare less
intrusiveand sometimesless costly than
controlledexperiments,but their validityrests on untestableassumptions.
setting. For this reason, within-study comparisons can yield relatively clean esti-
mates of selection bias. Onthe other hand, it is more difficult to rule out the effects
of chance for a given set of within-study comparisons. Therefore, general conclu-
sions require, as in this article, several within-study comparisons in a variety of
contexts.
Design replicationto estimatebias
The current review differs from standardmeta-analysisbecause the "effect size"of interest is not the impact of some intervention on a given outcome but the dis-
crepancy between experimental and NX impact estimates. This, we argue, is itself
an estimate of the bias. Bias can never be directly observed, because the true
impact,0, is not known.This review includestwo equivalenttypesof studiesthat allow
us to estimate the bias empirically.The first type presents up to K NX estimators,
0k,of the impact, where k = 1,... K, and one experimental estimate, 00, such that
E[00] = 0. The second type compares average outcome for a control group, Y0,with the (adjusted) average outcome, Yk, or some comparison group based on NX
method k. The relationship between these variables is shown in equations (1) and(2), where YT epresents the averageoutcome for the treated group and B(0k ) is the
bias.0k Y,-Yk (1), YT Y (2)
e00 YT Y0 (2)
66
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 6/32
NONEXPERIMENTALVERSUS EXPERIMENTAL ESTIMATES
Using these estimates, we can estimate the bias associated with each of the k
estimators, defined as B( k) = E[Ok - 0]. Since the true parameter is not observ-
able, we estimate the bias as the difference between the NX and experimental esti-mators. Subtractingequation (2) from equation (1) yields two forms of the bias esti-
mate,B(0k), corresponding to the two types of reporting formatsdiscussed above:
(k- ) (Y ,- Yk) B( ). (
Thus, hetwotypesofstudies reequivalent,ven houghhelatterypedoesnotuse information from the treatment group.
If the experiment is well executed, then the estimated bias should itself be unbi-
ased,as shown in
equation (4).
E[B(0k)] = E[Ok]- E[0] = E[k - 0] = B(k). (4)
The goal of the analysisin this review is to model B(O ) as a function of the char-
acteristics and context of the study, the estimator, and the intervention whose
impact is being estimated. We recognize an important practical limitation in esti-
mating such models, which is that the observed bias estimates vary not onlybecause of the performance of the NX method (in reducing selection bias) and
other contextual variables alreadynoted but because of random sampling error in
both the experimental and NX estimators. This samplingvariance makes it difficultto judge when bias estimates are large enough or spread out enough to be evidence
that a NX method has failed. Therefore, we have refrained throughout the analysisfrom making general statements that go beyond the collection of case studies we
reviewed.
Data and Method
In recent years, the number of design replication studies has been growing tothe point where it is now possible to begin synthesizing the results to look for pat-terns. This article draws on such a synthesis of the design replication literature,
focusing on studies that used earnings as an outcome.3 The rest of this section
describes the methods we used to assemble the design replication studies, con-
struct the data set, and conduct the analysis.
Inclusioncriteria and searchstrategy
To be included in thereview, a study had
tomeet
thefollowing
criteria:
* Arandomizedontrolgroupwasused oevaluate program,ndacomparisonroupwasavailablefor computingat least one NX estimate of the same impact.
67
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 7/32
THE ANNALS OF THE AMERICANACADEMY
Because some studies estimated bias directly by comparing comparison and con-
trol groups, the presence of a treatment group is not required.
* Theexperimental-NXomparison asbasedon estimatesrom he sameexperiment.
This criterion excludes the between-study comparisons described in Section I.
* TheexperimentalndNXestimatespertainedo thesame nterventionnthe same ites.
This criterion excludes, for example, a study of education programs in Bolivia
(Newman et al. 2001), which compared findings from an experimental design in
one region with findings from a NX design in another. Such a study confounds
regional differences with differences in study design.
* Theintervention'spurposewasto raiseparticipants'arnings.Thiscriterionestricts urfocus oprogramshatprovideob training ndemploymentervices.4
The search process produced dozens of candidate studies. We narrowed them
down to thirty-three for closer examination, and determined that twelve, listed in
Table 1, met the search criteria. The twelve studies correspond to nine interven-
tions; four of these studies addressed the same intervention, the National Sup-
ported
WorkDemonstration (NSW). All of the interventions involved
job trainingor employment services, such asjob search assistance or vocational rehabilitation,and participationwas mandatory in about half of them. In terms of location, three
interventions were single-site programs (in San Diego, California; Riverside, Cali-
fornia; and Bergen, Norway); one was a multisite program in a single state
(Florida); and the remaining five were multistate in the United States. Half of the
interventions were studied in the 1990s; only one (NSW) was studied before 1980.
Seven of the studies appeared in peer-reviewed journals or in books, three are final
reports of government contractors, and two are working papers or unpublished
manuscripts.
The quality of the evidence in these studies-in particular,the quality of theexperiment-is critical to our analysis.The use of design replication as a validation
exercise assumes that the experimental estimators in the studies are themselves
unbiased.5 Common threats to the validity of the experimental estimator include
differential attrition or nonresponse, randomization bias, spillover effects, substi-
tution bias, John Henry effects, and Hawthorne effects.6 Bias could also arise from
nonuniform collection of data from treatment and control groups and from assign-ments that were not truly random. Noncompliance with treatment assignment,even if monitored and documented, can threaten an experiment's abilityto answer
interesting policy questions.To evaluate the validity of the experimental estimates, we assessed the nine
experiments in our review and found them to be of generally high quality. Most
were well funded and were carried out by research organizationswith established
track records in random assignment and data collection. The Manpower Demon-
68
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 8/32
TABLE 1
DESIGN REPLICATION STUDIES WITH EARNINGS AS AN OUT
Yearsof Program Geographic (
Study Intervention Intervention Mandatory? Location
Lalonde (1986) NationalSupportedWorkDemonstration(NSW)
1976-77 No National 10 sites
Fraker and Maynard National SupportedWork
(1987) Demonstration(NSW)
Dehejiaand Wahba National
SupportedWork
(1999) Demonstration(NSW)
Smith and Todd
(forthcoming)
Hotz, Imbens, andMortimer(1999)
Bell et al. (1995)
Heckman et al.(1998)
National SupportedWorkDemonstration(NSW)
SaturationWork InitiativeModel (SWIM)
Aid to Familieswith
Dependent Children(AFDC)Homemaker-Home HealthAide Demonstration
1976-77
1976-77
1976-77
1982-85
1984-85
Job TrainingPartnershipAct 1985-89(JTPA)
No National 10 sites
No National 10 sites
No National-10 sites
Yes San Diego, CA
No AR, KY,NJ, NY, OH,SC, TX
No CorpusChristi,TX;FortWayne,IN;
Jersey City, NJ;Providence,RI
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 9/32
TABLE 1 (continued)
Study
Hotz, Imbens, andKlerman(2000)
Bloom et al. (2002)
Intervention
GreaterAvenuesfor
Independence (GAIN)
National EvaluationofWelfare to WorkStrategies(NEWWS)
Bratberg,Grasdal, Bergen Experimentand Risa(2002) (vocationalrehabilitation)
Olsen and Decker Job Search Assistance(2001) demonstration(JSA)
GritzandJohnson Job Corps(2001)
NOTE: OLS = ordinary east squares regression;DD
Yearsof
Intervention
1988-90
1991-94
1994
1995
1995-97
Program
Mandatory?
Yes
Yes
No
Yes
No
GeographicLocation
Riverside,CA
OklahomaCity,OK;Riverside, CA;Detroit, MI;GrandRapids,MI;Portland,OR
Bergen, Norway
Florida-10 counties
National 110 sites
= difference-in-differences;SC = econometric selec
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 10/32
NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES
strationResearch Corporation (MDRC) oversawrandom assignment in four of the
experiments; Abt Associates oversaw two; and Mathematica Policy Research
(MPR) oversaw two. The remaining experiment was overseen by university-basedresearchers. Because details of the experimental designs andtheir implementationwere not reported in all the replication studies, we retrieved background reportsand methodological appendixes, examined nonresponse analyses, and corre-
sponded with researchers. We concluded that most of the experiments had rela-
tively low crossover and attrition rates and that the attrition and nonresponse did
[G]eneral conclusionsrequire,as in this article,severalwithin-studycomparisons n
a variety of contexts.
not appear to be related to treatment status in a way that would threaten the con-
clusions' validity.
Protocoland coding
Once the studies were assembled, we followed a procedure laid out in a formal
protocol (Glazerman, Levy, and Myers 2002) to extract data from the source stud-ies and code them for analysis. For example, the coding form had questions aboutthe source of the comparison group used for each NX estimator in each study.Twocoders read the twelve studies and extracted all the information needed for the
analysis.They coded two studies together to ensure a consistent understanding ofthe coding instrument and process. Then, each coded a subset of the rest, with
ample consultation built into the process to increase coding accuracy (see
Glazerman, Levy, and Myers [2002] for details on this and other aspects of theresearch synthesis methods). We also contacted authors of nearly every source
study to obtain clarification and, sometimes, additional data. Further details of thevariables that were coded are mentioned below.
Analysismethods
The goal of our analysisis to determine how selection biasvarieswith the type ofestimator employed, the setting, and the interaction between the setting and the
type of estimator.To answer this, we model B(0jk), the bias associated with estima-tor k, as a function of the characteristicsand context of the study (indexed byj) and
71
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 11/32
THE ANNALS OF THE AMERICAN ACADEMY
its intervention, captured in a vector labeled Z, and the characteristics of the esti-
mator itself, captured in a vector labeled W.
B(Ojk)=f(Z, Wk,ZWk). (5)
We use the absolute value of B(0 jk)on the left-hand side of the equation because a
researcher or research synthesist wants to choose designs to minimize the bias,whatever its direction. An interaction between study-level and estimator-level vari-
ables is included to capture the interplay between method and context.
One might expect that some types of NX design perform better than others and
that some designs are more appropriate under certain study conditions. To test
this,it is
importantto
describe each type of NX estimator.Of the
manyclassifica-
tion schemes, the most commonly used are those given by Campbell and Stanley(1966) and Cook and Campbell (1979). Alternative formulations by economistssuch as Heckman and Hotz (1989) and Heckman et al. (1998) are also useful for
categorizing methods in a general way; however, we prefer to avoid forcing the
methods into mutually exclusive categories, because many of the estimators used
multiple approaches. Instead, we describe each estimator by a vector of character-
istics that pertain to (1) the source of the comparison group and (2) the analytic
techniques used to adjust for differences between the comparison group and the
treatment population.Because there is a limited range of NX designs assessed in the design replication
literature, we must use very gross indicators to categorize NX designs. For the
source of the comparison group, we coded three binaryindicatorvariables: one for
whether the comparison group is drawnfrom a national data set, such as Surveyof
Income and ProgramParticipation (SIPP); one for whether the comparison groupis based on sample members from the same geographic area as the treated popula-tion; and one forwhether the comparison group is formed by using the randomized
control group from a different experiment. For the type of statistical adjustment,we used four variables to broadly indicate (1) whether background variables were
used as covariates in a regression model; (2) whether matching methods, such asstratification on an estimated propensity score, were used; (3) whether the estima-tor used preintervention measures of the outcome-examples include difference-
in-differences models, fixed-effect models, or even regression or matching models
using baseline earnings;and (4) whether the estimator was based on an economet-ric sample selection model. The selection modeling indicator (4) would be set to
one, for example, if the estimator used the inverse Mills'ratio or instrumental vari-
ables with a set of variablesthat were included in a model of programparticipationbut excluded from the model of earnings determination.
We constructed other indicators toidentify
conditions under whichquasi-experiments were potentially more likely to replicate experiments. One set of indi-
cators measured whether a specification test was conducted, and if so, whether thetest would have led the researcher to avoid the estimator a priori. Another set of
indicators measured whether the background variables used in the regression or
72
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 12/32
NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES
in the matching procedure were detailed, as with a survey, or sparse, as is typi-cally the case with administrative data. Variables included in the W vector
include the experiment's sample size, grouped into categories for small, medium,and large; and the program's estimated effectiveness-effective, ineffective, or
indeterminate.To estimate the average bias reduction associated with these design and context
variables, we used both bivariate analyses (tabulations) and multivariate analyses
(regression). Because such a small collection of studies limits the degrees of free-
dom, we expect to find the dataconsistent with several competing explanations for
why the estimated bias is high or low. The bivariate analyses use sample weights to
account for the unequal sample sizes of the source studies, although we found that
[T]he observedbias estimatesvary not
only becauseof the performanceof the NX
method (in reducingselectionbias) . . but
becauseof randomsamplingerrorin both
the experimentaland NX estimators.
weighting made little difference to the qualitative findings. Similarly, for the
multivariateanalyses,we tried alternative aggregation procedures to deal with lackof statistical independence among bias estimates from a single study.To minimize
artificialreplication, the regression results in the next section use the averageof theabsolute value of the bias estimates associated with each unique combination of
design variables. For example, if one studyproduced eight quarterlybias estimates
corresponding to impacts after random assignment, we aggregated them into a sin-
gle estimate forthe two-year period, as long as the policy interpretation for the two-
year period made sense.A constraint on more detailed analyses than those just described was dictated by
having just twelve replication studies. While many of these studies assessed multi-
ple NX estimators, resulting in more than one thousand bias estimates, the overall
diversityof
designswas not as
comprehensivea
catalogueof
quasi-experimentalmethods as those described by Cook and Campbell (1979) and others. Amongthose methods that were assessed, not every method was assessed in every setting.As more empirical work comes to light, more sophisticated analysis may be
possible.
73
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 13/32
THE ANNALS OF THE AMERICANACADEMY
Results
Our synthesis of design replication studies takes an important first step towardanswering the research questions of this article. However, interpretation of the evi-
dence remains a challenge. Even among authors of the studies we reviewed, therewas no consensus on how to judge the size of differences between experimentaland NX impact estimates. The authors differed in the extent to which they probedthe statistical and policy significance of their results. Some focused narrowly ontheir own case studies; others made broader statements praising or condemning aNX method. Four studies concluded that NX methods performed well, four foundevidence that some NX methods performed well while others did not, and four
found that NX methods did not perform well or that there was insufficient evi-dence that they did perform well. A summary of their conclusions is given in the
appendix. In this section, univariateanalysesdescribe the range of bias estimates inthe literature. Bivariateanalyses then relate the absolute size of the bias to several
explanatoryfactors. The multivariateanalysisthat follows uses regression to deter-mine whether the different explanations of bias overlap and whether one predomi-nates. Finally, we examine the distribution of the bias estimates to considerwhether they cancel out across studies and whether their variation is due to truevariation in the performance of NX methods or some other explanation.
Univariate analysis
From the twelve studies, we extracted 1,150 separate estimates of the bias,about 96 estimates per study.While some of the bias estimates were close to zero,some were very large, overestimating or underestimating annual earnings impactsby as much as $10,000 or more. Table 2 shows the bias estimates by study.
The definition of a "large"bias depends on the program and the policy decisionat stake. However, for disadvantaged workers, even a $1,000 difference in annual
earnings is important. For example, in a benefit-cost study of Job Corps
(McConnell and Glazerman 2001), a steady-state impact on annual earnings ofabout $1,200 was used to justify the program's expenditure levels, one of the high-est per trainee (about $16,500) for any federal training program. A difference of
$800 in the annual earnings impact estimate would have completely changed the
study'soutcome and might have led to a recommendation to eliminate rather than
expand the annual $1.4 billion program. For programs, such as the Job Training
PartnershipAct (JTPA)and the various welfare-to-work programs captured in our
data, where both the program costs and the impacts on earnings are likely much
smaller,a difference of $1,000 or more can make a dramaticdifference in the policy
recommendation.Another benchmark is the average earnings of control group members. In manyof the studies we reviewed, the inflation-adjusted annualearnings of control groupmembers was about $10,000, which includes zero earnings for nonworkers. Thus, a
$1,000 bias would represent 10 percent of earnings, a substantial amount.
74
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 14/32
NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES
TABLE 2
DESCRIPTIVE STATISTICS OF BIAS ESTIMATES BY STUDY
Bias Estimates
(AnnualEarningsin 1996 Dollars)
Averageof
Absolute
Valueof Number of
Rangeof Averageof Typesof Number of Typesof
Study Estimates Estimates Estimatesa Estimates Estimates
Bell et al. (1995) -723 to +5,008 661 813 54 3Bloom et al. (2002) -21,251 to +12,215 498 1,114 564 8
Bratberg,Grasdal,and Risa(2002) -18,702 to-654 -4,826 2,907 13 5
Dehejia andWahba(1999) -1,939 to +1,212 173 4,163 40 4
Frakerand
Maynard 1987) -3,673 to +871 -751 1,103 48 3GritzandJohnson
(2001) -1,091 to +3,189 497 780 48 2Heckmanet al.
(1998) -7,669 to +8,154 -423 3,273 45 17
Hotz, Imbens, and
Mortimer(1999) -1,682 to +2,192 -128 585 36 2Hotz, Imbens, andKlerman(2000) -1,248 to +438 -174 371 64 6
Lalonde(1986) -5,853 to +4,143 -636 2,849 112 8Olsen and Decker
(2001) -1,548 to +1,107 -363 1,397 10 5Smithand Todd
(forthcoming) -11,743 to +4,829 -1,655 4,019 116 6Total -21,251 to +12,215 -637 2,325 1,150 69
a. The averageof the absolute value of the bias is used to comparedifferent researchdesigns.Therefore,we calculate tbyfirstaveragingbias withindesign typeandthen averaging he results
acrossthe sixty-ninedesign types.
As mentioned earlier,one should interpret the statistics in Table 2 with caution.The average of the bias estimates can be substantially influenced by outliers
reflecting small samples or unrealistic estimators. However, the average does indi-cate whether the estimates are centered on zero andwhether they tend to overesti-mate or underestimate impacts relative to the experimental benchmark. Eight ofthe twelve studies in our
analysisshowed that NX methods tended to understate
impacts; four showed the opposite. All the studies included bias estimates that
were both negative and positive, except for the one by Bratberg, Grasdal,and Risa
(2002), in which all the econometric and matching techniques had negative biasestimate. As one would expect, the study with the greatest number of estimates
75
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 15/32
THE ANNALS OF THE AMERICANACADEMY
(Bloom et al. 2002) found the broadest range of estimates, with very large positiveand negative values.7
The absolute value of the bias provides a more direct measure of the perfor-mance of the NX estimator,where a smaller value always represents better perfor-mance. With that measure, the typical NX estimate of impact on annual earningsdeviates from the corresponding experimental estimate by about $2,000. The aver-
age absolute value in any one study ranged from twice that amount-as in the
attempts by Dehejia and Wahba (1999) and by Smith and Todd (forthcoming) to
Thisfinding suggests that, while convenient,
publicly availabledata sets at the national
level are not the bestfor evaluatingtrainingor welfareprograms.
replicate the findings of the NSW experiment using national datasets-to less than
$600 per year, as in the two studies by Hotz and colleagues (Hotz, Imbens, and
Klerman 2000; Hotz, Imbens, and Mortimer 1999).
Bivariate analysis
Tobegin to explainthe rangeof NX bias,we conducted simple bivariateanalyses
examiningthe
relationshipbetween several
possible explanatoryvariables and the
size of the bias. The candidate variables are those that describe the quasi-
experimental approach and the study in which it was implemented, including the
source of the comparison group, the statisticalmethod, and the qualityof the data.For each value of an explanatoryvariable,we computed the average of the abso-
lute value of the bias for all NX estimators with that value (see Table 3). For theentire sample of studies, the unweighted average of the absolute value of the bias
associated with using NX methods was about $1,500. However, this was based on
all 1,150 bias estimates without aggregatingto account fornonindependence of the
estimates or unequal sample size. Therefore, we constructed two sets of weights.
The first (weight 1) gives more emphasis to estimates based on studies that hadlarger samples as measured by the number of control group members in the ran-domized experiment; the other (weight 2) multiplies the sample-size weight by afactor inversely proportional to the number of estimates for a given sample. For
example, if aresearcher used 10 different methods to estimate the same impact for
76
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 16/32
NONEXPERIMENTALERSUSEXPERIMENTALSTIMATES
TABLE 3
AVERAGEBIAS BYCHARACTERISTICS OF ESTIMATOR
Averageof AbsoluteValue of BiasEstimate (AnnualEarningsin 1996 Dollars)
Weight 2
Weight 1 (Sample Size,
ExplanatoryVariableand Category Unweighted (SampleSize) Frequency)
Entire sampleSource of comparisongroupa
Same labormarketControlgroupfrom anothersite
National datasetStatisticalmethod:generalaRegressionMatchingSelection correction or instrumentalvariablesNone, simple mean differences
Statisticalmethod:type of matchingPropensityscore matching:one to one
Propensityscore matching:one to manyOther matching techniqueDid not use matching
Statisticalmethod:specification
test result
Specificationnot recommended
SpecificationrecommendedNo test conducted
Qualityof backgrounddata:regressionPoor set of controlsExtensiveset of controls
Veryextensive set of controlsDid not use regression
Qualityof backgrounddata:matchingPoor set of covariatesExtensiveset of covariates
Veryextensive set of covariatesDid not use matching
Qualityof backgrounddata:overallUsed priorearningsDid not use priorearnings
ExperimentalsamplesizeSmall(< 500 controls)Medium (500 to 1,500 controls)Large(>1,500 controls)
Experimental mpact findingProgram s effective
Program s ineffectiveIndeterminate
Number of observations
1.477 1.101 1,110
932843
2,817
821902
2,409
1,010828
2,0711,323
1,1011,1432,2512,791
1,0471,1811,2311,750
4,0271,1551,247
739852
1,0371,288
3,165857
1,047
1,4381,0301,0081,412
2,3361,2281,0262,431
1,7521,3921,1131,750
1,2242,662
2,5331,080
819
2,089
1,197924
1,290951
8021,288
1,0401,379
2,3781,001
819
1,288
9201,021
1,150 1,150
885814
2,131
958924
1,4121,515
744929
1,2971,311
2,8701,103
988
1,5901,0361,0161,589
1,3131,330
9201,311
1,0031,591
2,728960800
1,276
1,105911
1,150
a. Categoriesare not mutuallyexclusive or exhaustive.
77
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 17/32
THE ANNALS OF THE AMERICANACADEMY
onesite orsubgroup,hen thecorresponding iasestimatesreceivedaweightof 1/10 times the samplesize. Although he resultsvarysomewhatby type of weight
used, the qualitativeconclusionsdrawn from them do not, so we focus on theresults n the lastcolumn,whichaccount orsamplesizeandfrequencyof sample.Bothweightsreduce the averageabsolutevalue of the bias to about$1,100.
Table3 showsthat some factorsare indeed associatedwith higherandlowerbias.Asonewouldexpect,the sourceof thecomparison rouphas arole. Theaver-
age bias was lower (less than$900) when the comparisongroupcame fromthesamelabormarketas thetreatedpopulationor wascomposedof randomized on-trolgroupmembersfrom a separateexperimentandhigher(morethan$2,000)when the comparison roupwas drawn roma nationaldataset. Thisfindingsug-
geststhat,whileconvenient,
publiclyavailable atasetsat the national evel arenot
the best forevaluating rainingorwelfareprograms.Aspectsof the statisticalmethodalso were associatedwiththe size of the bias.
Therewas littledifferencebetweenregressionandmatchingmethodsoverall,butsome matchingmethodsperformedbetterthanothers.In particular,ne-to-one
propensity corematchinghad lowerbias thanotherpropensity core methodsor
non-propensity-scorematching.Five of the studies(Lalonde1986;Heckmanet al.1998;GritzandJohnson2001;Bratberg,Grasdal, ndRisa2002;andBloom et al.
2002) includedsomeformof econometricselection correctionproceduresuch asthe Heckmantwo-stepestimatoror instrumental ariablesestimator,but these
methodsperformedpoorlyonaverage,aboutaspoorlyasusingno methodat all.Rather hanexamineallquasi-experimentalstimators,t maybe moreproduc-
tivetofocusontheperformance fthose one wouldexpect(inthe absenceofa ran-domized experiment)to be the best ones. To make such a prioripredictions,researchersuse specification ests,as illustratedby Heckmanand Hotz (1989)intheirreanalysis f Lalonde's1986)replications f the NSWexperiment.Thetypi-cal specification est appliesthe NX estimator o outcomedata frombefore theintervention.If the estimatedimpacts,which shouldbe zero since nobodyhasbeen exposedto the intervention,arelargerthanwouldbe expectedby chance,
thentheestimator srejected,and tsuse is not recommended.Manyof the designreplication tudiesthat we revieweddid not conductspecification ests. Amongthosethatdid,the averageabsolutebias of rejectedestimatorswasnearly$2,900,almostthreetimes the biasof recommendedones.Thissuggests,consistentwiththe findingsof HeckmanandHotz, thatspecificationesting,wherefeasible,can
help eliminatepoor-performing stimators.The estimatedbias of the recom-mendedestimators,however,was stilllarge n absolutevalue:more than$1,000.
Some authors Heckmanet al. 1998;Smithand Toddforthcoming)havesug-gestedthatdataqualitymaybe asimportant s the researchdesign.Bycategorizingestimatorsby the richness of the background ariables-used as covariates n a
regressionorasmatchingvariables-to explain he sizeofthebias,we found somesupport or thisclaim.8The results nTable3 suggest hattheestimatorsbasedon amoreextensiveset ofvariablesn aregressionormatchingmethodhad lower bias.
78
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 18/32
NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES
The most important variable to include in the variable set was prior earnings.Studies without it had a bias of about $1,600; those with it, $1,000.9
Finally,we found that the performance of NX methods was related to the sam-ple size and direction of impacts for the experiment. Specifically, the NX methods
more closely replicated the experiments when the randomized control groupswere large andwhen the experiments did not show the programwas effective. One
possible explanation for the large average bias (more than $2,700) in small studiesis that the experimental impacts were not precisely estimated, so the estimate ofbias is also not precisely estimated. Another possible explanation is the size of thenonrandomized comparison group, which tends to be small when the control
Therewas little differencebetweenregressionandmatchingmethodsoverall,butsomematching
methodsperformedbetter than others.
group is small, so the largerestimated bias mayreflect random noise in the NX esti-mate. Because the sample sizes of control and comparison groups arecorrelated, itis difficult to distinguish between these two stories. The relationship between thedirection of the experimental impact and the size of bias suggests that a false posi-tive finding-concluding from the NX evidence that a programworkswhen it does
not-may be more common than a false negative.A limitation of this bivariateanalysisis that the design elements listed in Table 3
are not independent. For example, a study that uses a national data set to select acomparison group is likely also to use a relatively poor set of controls; this meansthat the large averagebias for studies that use a national data set could be reflectinga poor set of controls. It is difficult to distinguish these explanations. We therefore
proceed with multivariateregression analysisto tryto disentangle the factors asso-ciated with lower bias.
Multivariateanalysis
To examine the effect ofresearch design on bias, we estimated
severalregres-sions with the absolute value of the bias in annual earnings as the dependent vari-
able and the design attributes as explanatoryvariables (see Table 4). As suggestedearlier, other types of explanatoryvariables could also explain bias. However, we
79
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 19/32
THE ANNALS OF THE AMERICAN ACADEMY
have limited degrees of freedom, so we use a parsimonious model that includesindicator variables for each design replication study to proxy for all the measured
and unmeasured characteristics that vary only at the study level. Because theregression models focus on NX design configurations, we did not weight the indi-vidual bias estimates. We averaged them within design types so that each design
type would be represented no more than once for each study. This aggregationresulted in an analysis data set of sixty-nine bias estimates. The regression resultsare meant to be illustrative,because some of the design attributes are highly corre-lated with each other, the data set is very small, and the results depend on the
regression specification used.
Keeping these limitations in mind, we found the regressions largely confirmwhat one would
expect.Outcomes for the various nonrandomized
comparisongroups available to evaluators are not good approximations to the counterfactualoutcomes if left unadjusted. The intercept in the regression models shown in theodd-numbered columns of Table 4 represents the bias associated with raw mean
differences, estimated to be in the rangeof $4,400 to $5,800 in annualearnings (seethe first row). This coefficient is the expected bias in our sample, if one did notmake any adjustments to a typical comparison group. In the regression modelsshown in the even-numbered columns, we include a separate intercept for each
study, a study-level fixed effect describe above.The entries in the next two rows of Table4 suggest that using backgrounddataas
either covariates or matching variables is about equally effective at reducing bias.These techniques reduce the bias by about $3,100 to $3,600, once we account forthe studies' fixed effects (column 6). The sensitivity of this result to the inclusion offixed effects suggests that the relative performance of regression-based designsversus matching designs is confounded with contextual factors.
Combining methods is better than applying them individually. Models 5 and 6include an interaction term with apositive coefficient, which suggests that the biasreduction from these two methods is not fully additive, although there is likelysome increased benefit from their combination. In model 5, for example, the bias
from raw differences in means, represented by the intercept, is $5,775. This valueis reduced to $2,550, if only regression is used, and to $3,312, if only matching isused (holding comparison group variables fixed at the value of the omitted catego-ries). If matching and regression are both used, they reinforce each other to reducethe bias to $1,038.
Baseline measures of the outcome are important. This is suggested by the nega-tive coefficients on the difference-in-difference indicator,which equals one if theestimator uses preintervention earnings in anyway, show that using baseline mea-sures of the outcome is important, as reported in the literature. For the simplermodels 2 and 4, difference-in-difference estimators reduce the bias
by
about
$1,600 in annual earnings, a reduction slightly largerthan that achieved with otherestimators. The interaction terms of difference-in-differences with the regressionand matching (see models 5 and 6) indicate that these methods are also partiallyoffsetting.
80
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 20/32
TABLE4
RESULTS SHOWING THE EFFECT OF NONEXPERIMENTAL APPROACH ON BIA
Model Specification
ExplanatoryVariable 1 2 3 4
Intercept 4,467*** (657) 4,687'**(1,231)Statisticalmethod
Regression -1,583" (729) -1,516** (705) -1,476** (675) -1,416** (706)
Matching -478* (715) -1,268* (794) -807* (692) -1,427** (799)
Regressionx Matching 951 (1,422) 1,320 (1,484)Difference-in-Differences -1,874 (763) -1,596 (816) -1,859 (718) -1,568 (813)
Regressionx Differences-
in-Differences 2,325 (1,455)Matchingx Differences-
in-DifferencesSelection correction
Comparisongroup strategyGeographicmatchNationaldata setControlgroup from
anothersite
2,508* (1,248) 2,376 (1,305) 4,619 (1,048) 2,441 (1,299)
-387 (973)1,145 (1,062)
-646 (1,182)1,695 (1,536)
-1,762 (1,011) NA
Studydummies included No Yes No Yes
NOTE: Dependent variable s the absolute value of the bias in annualearnings, expressed in 1996 dollar
explanatoryvariablesare dummyvariables.Sample size is 69 bias estimate types.
*Significantlydifferent from zero at the .10 level. **Significantlydifferent fromzero at the .05 level. ***Slevel, two-tailedtest.
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 21/32
THE ANNALS OF THE AMERICANACADEMY
The one estimator that did not reduce bias at all, but in fact increased it, was the
selection correction estimator, but this should be interpreted cautiously. Few esti-
mates in our data were based on econometric methods such as the two-step estima-tor. Of these, one study (Bratberg, Grasdal,and Risa 2002) rejected the specifica-tion based on a hypothesis test but still reported the bias estimate, which was
particularlylarge.10Of the others, none produced a compelling justification for the
exclusion restrictions that typically justify such an approach. An exclusion restric-
tion is an assumption that some variable is known to influence participation in the
program (selection into treatment) but not the outcome of interest (earnings).The use of a comparison group that is matched to the same labor market or geo-
graphic area reduced bias by about $600. Funders of evaluation research probably
preferto use
largenationaldatasets to evaluate
programs,because
secondary analy-ses are far less costly than new data collection. Our findings suggest that such a
strategy comes with a penalty, an increase of average bias by about $1,700 (Table 4,column 6).
NX methods more closelyreplicatedthe
experimentswhen the randomizedcontrolgroups were largeand when the experiments
did not show the probramwas effective.
We coded another comparison group strategy that determined whether the
source was a control group from another study or another site. Several studies-for
example, those by Hotz, Imbens, and Mortimer (1999); Hotz, Imbens, andKlerman (2000); and Bloom et al. (2002)-compared the control group from one
site to the control group from another site and labeled one as the nonrandomized
comparison group. We included the "control group from another site" indicator
variable in the regression primarilyto distinguish between those studies from oth-
ers that used comparison groups more readily availableto researchers, such as eli-
gible nonapplicants (for example, Heckman et al. 1998) or individualswho appliedto the program but were screened out (for example, Bell et al. 1995). One might
argue that control groups arenot availableto most evaluators, so the more relevant
bias estimates are the larger ones found when the "other control group"indicatorequals zero.
The regression analysis described above is robust to the definition of the
dependent variable.We conducted the same analysisusing the signed value of the
bias and found very similar results. Those results, available from the authors, show
82
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 22/32
NONEXPERIMENTAL VERSUS EXPERIMENTALESTIMATES
FIGURE 1DISTRIBUTION OF BIAS ESTIMATES FOR ALL TWELVE STUDIES
600 -
, 400 -0
200-
0-
-5000-2500 0 2500 5000
Bias, AnnualEarnings, 1996$
that overall, the unadjusted bias is large and negative. The regressors representingdesign features increase the bias (toward zero) in much the same way that theydecreased the absolute value of the bias as shown in Table 4. Other dependent vari-ables can be used to further analyze the bias estimates. For example, we createdtwo indicator variables, one for whether the NX impact estimate led to the same
statistical inference and another for whether it led to the same policy conclusion.
Constructing these variables required some additional information, such as thethreshold value that would change the policy conclusion, but they allow us toinclude in a meta-analysis the results from a wider range of design replication stud-
ies, including those that focus on education interventions and those whose out-comes are not measured in earnings. The analyses based on these binaryoutcomevariables arebeyond the scope of the current article andwill be presented in future
work.
Aggregationof NX evidence
Now we turn to the third research question, whether averaging multiple NX
impact estimates can approximate the results from a well-designed and well-
executedexperiment.
The above discussionsuggests
that while some factors areassociated with lower bias estimates, a single NX estimator cannot reliably repli-cate an experimental one. The inability to achieve reliable replication may be dueto bias or to sampling errorin either the experiment or the quasi-experiment. How-
ever, apossibility exists that a large enough group of NX estimates pertaining to dif-
83
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 23/32
THE ANNALS OF THE AMERICAN ACADEMY
FIGURE 2
DISTRIBUTION OF BIAS ESTIMATES FROM NATIONAL EVALUATIONOF WELFARE TO WORK STRATEGIES (NEWWS) (BLOOM ETAL.2002)
80 -
60 -
a 40-
20 -
0-E
""
-5000 -2500 0 2500 5000
Bias, AnnualEarnings,1996$
ferent study sites, time periods, or interventions might replicate experiments on
average.We therefore examined the extent to which positive and negative bias esti-
mates cancel out. If they do, it would provide motivation for those who conduct lit-
erature reviews to be able to accumulate a largebody of NX evidence, when experi-ments are scarce, to drawvalid conclusions. A useful way to make this assessment is
by examiningthe full distribution of bias estimates for variousgroupings, such asbyintervention, by method, or for a collection of interventions, and looking for a pat-tern of estimates that tends toward zero.
The distribution of the 1,150 bias estimates from the twelve studies reviewed
in this articleprovides a case where the bias estimates do appearto cancel out (Fig-ure 1). The distribution is nearly centered on zero with a slight skew. The averagebias was about -$600. Applyingthe weights described above brings the overall bias
closer to zero, -$217; and removing the outliers and applying weights makes it even
smaller,about -$97.:' This is a crude indicator,but it suggests, consistent with the
work of Lipsey and Wilson (1993), that if enough NX studies are combined, the
average effect will be close to what the experimental evidence would predict.'2
Rarely, however, is the NX research used to answer such broad questions as
whether all programs are effective. Instead, we would like to identify dimensions
alongwhich the biasbegins to cancel out for more focused questions such as "What
is the average impact of program X?" For the studies reviewed in this article, the
average bias was sometimes close to zero (see Table 2) but often was still substan-
84
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 24/32
NONEXPERIMENTALVERSUS EXPERIMENTALESTIMATES
FIGURE 3DISTRIBUTION OF BIAS ESTIMATES FROM SUPPORTED
WORK (SMITH AND TODD FORTHCOMING)
20 -
15-
v:- 10-
I I I I
5-
-5000 -2500 0 2500 5000
Bias, Annual Earnings, 1996$
tial. Each of the studies in the review addressed a single intervention, althoughsome assessed more NX estimators, analyzed more subgroups, had larger samples,or included more sites. The distribution of bias estimates within studies-particu-
larlystudies that use multiple subgroups, study sites, or time periods, in addition to
multiple estimators-makes this clearer.Figures 2 and 3 displaythe distribution of
bias estimates for two of the studies that examined the largest number of estima-
tors. For the first study (Bloom et al. 2002), the bias estimates are centered roughlyon
zero,with an
average of-$151;but for the second
study (Smithand Todd forth-
coming), they clearly are not, with an average of-$2,563. It is possible to remove
outliers from the estimates reported by Smith and Todd (forthcoming) to achieve
anaveragebias that is closer to zero, but identifying outliers without the benefit of a
randomized experiment as a benchmark may be difficult. The within-study evi-
dence from the other studies (Table 2) suggests that the average bias across all
methods, subgroups, and time periods is sometimes positive, sometimes negative,and often still in the hundreds of dollars. This suggests that a mechanistic applica-tion of a large number of NX estimators might improve the inference one could
draw from such evidence, but not in a predictable way.Whether the average bias,
properly weighted within and between studies, is really close enough to zero for
policy makers, and whether the bias cancels out within a narrower domain of
research, are questions that we plan to address as more design replication studies
are completed.
85
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 25/32
THE ANNALS OF THE AMERICANACADEMY
What Have We Learned about NX Methods?
Our preliminary review of the evidence suggests that the twelve design replica-tion case studies we identified, even taken together, will not resolve any of the long-
standing debates about NX methods. From the case studies, we uncovered some
factors that might reduce bias, but we have not identified a reliable strategy for
eliminating it either in a single study or in a collection of studies. The findings can
be summarized in terms of the three empirical questions posed in section I.
Question: Can NX methods approximate the results from a well-designed and well-
executed experiment?
Answer: Occasionally, but many NX estimators produced results dramatically dif-ferent from the experimental benchmark.
* Quantitativeanalysisof the bias estimates underscoredthe potential for very large bias.Some NXimpactestimatesfell within a few hundreddollarsof the experimentalestimate,but otherswere off by severalthousand dollars.
* The size and directionof the "average" iasdepends on how the average s computed andwhat weighting assumptionsare applied.
* The averageof the absolutebias over all studieswas more than $1,000, which is about 10
percent of annualearningsfor a typical populationof disadvantagedworkers.
Question: Which NX methods are more likely to replicate impact estimates from a
well-designed and well-executed experiment, and under what conditions are
they likely to perform better?
Answer: We identified some factors associated with lower estimated bias. However,
even with these factors present, the estimated bias was often large.* The source of the comparisongroup made a difference in the averagebias estimate. For
example,biaswaslower when the comparisongroupwas drawn romwithinthe evaluationitself ratherthan froma national dataset, locallymatchedto the treatmentpopulation,ordrawnasacontrolgroupin an evaluationof a similarprogramorthe sameprogramat a dif-ferent studysite.
* Statisticaladjustments, ngeneral,reducedbias,butthe bias reductionassociatedwiththe
most common methods-regression, propensityscore matching,orotherforms of match-ing-did not differ substantially.Estimatorsthat combined methods had the lowest bias.Classicaleconometric estimators hat used an instrumentalvariableoraseparatepredictorof programparticipationperformedpoorly.
* Bias was lower when researchers used measures of preprogram earnings and otherdetailed backgroundmeasuresto control for individualdifferences.
* Specificationtests were useful in eliminatingthe poorest performingNX estimators.* Experimentswithlargersampleswere more likelyto be closely replicatedthanthose with
smallersamples.* "Noimpact"orindeterminateimpact findingsfrom anexperimentwere morenearlyrepli-
cated than were positive experimental impact findings.
Question: Can averaging multiple NX impact estimates approximate the results
from a well-designed and well-executed experiment?Answer: Maybe, but we have not identified an aggregation strategy that consistently
removed bias while answering a focused question about earnings impacts.
86
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 26/32
NONEXPERIMENTALVERSUS EXPERIMENTAL ESTIMATES
* Estimated biases were bothpositiveandnegative,andtheir distributionacrossall the stud-ies reviewed wascentered roughlyon zero.This was true both for the full set of estimatorsand forgroupsof estimatesacrossall studiesthat used a singlemethod, such asregression
or matching.* For a given intervention,the distributionof bias estimateswas sometimes centered near
zero, and sometimes was not.
We caution that this summary of findings gives only part of the picture, and it
does so for a specific area of programevaluation research: the impacts of job train-
ing and welfare programs on participant earnings. A somewhat more completestory can be developed in the short term as additional design replication studies,
including some that are now in progress, come to light.In the meantime, those who
plan
and
design
new studies to evaluate the
impactsof trainingorwelfare programson participants'earnings can use the empirical evi-dence to improve NX evaluation designs but not to justify their use. Similarly, hosewho wish to summarize a group of NX studies or average over a set of different NX
estimates to reach a conclusion about the impact of a single program can draw onthe design replication literature to identify stronger or weaker estimates but not to
justify the validity of such a summary.
AppendixWhatDid the StudiesConclude?
An alternativeway to reviewthe literatureis to summarizewhat the authorscon-
cluded in their own words. The twelve design replicationstudies divided into three
equalgroupsabout the valueof nonexperimental NX)methodsand the degree of sim-
ilaritybetween NX findingsand those of randomizedexperiments:four studies con-
cluded that NX methodsperformedwell, four found evidence thatsome NX methods
performedwell while othersdidnot, and four found thatNX methodsdid notperformwell or thatthere wasinsufficientevidence thattheydidperformwell (see TableAl).
The four studies that found positive results (evidence of smallbias) qualifiedtheirconclusionsby indicatingthat a researcher needs detailed backgrounddata (particu-
larly prior earnings), overlap in background characteristics, or intake workers' subjec-tive ratings of the applicants they screened.
It is important to probe the authors' conclusions further than the present discussion
allows. The various study authors used different standards to assess the size of the bias
and, in some cases, reached different conclusionswith the same data. Furthermore,the studies are not of equalvalue. Some more realisticallyreplicatedwhat would havebeen done in the absence of randomassignmentthan others.Withinstudies, some of
the estimatorsor comparisongroupswere more or less likelyto have been used thanothers, absent an experimentalbenchmark. Some estimates were based on smaller
samples than others. A recent summary by Bloom et al. (2002, chap. 2) describes manyof these studiesindividually.Section III of this articlepresentsaquantitativeanalysisofall the studies combined.
87
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 27/32
TABLE Al
AUTHOR CONCLUSIONS FROM THE STUDIES REVIEWE
Type
of Methods
ComparisonStudy Intervention Examined SamplesUsed Study
Nonexperimentalestimators
performedwell
Bell et al. (1995)
Dehejia and Wahba
(1999)
Hotz, Imbens, andKlerman(2000)
WTW OLS,instrumentalvariables
Supportedwork PSM, DD
WTW OLS, DD
Withdrawals,screenouts,no-shows
CPS, PSID
Controlgroupsfrom other sites
We believe that the
encouragingwith
impact methodsmented.... Themuch less reliablthe
in-programppostprogramperithe comparison gations .... The a
ratingsconsistenbased estimates (mates) closer to t
When the treatmewhen the variablare observed,prto estimate the tr
The resultspresentof non-experimeexperimentalmeon individualcha
ment, earnings,a
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 28/32
Hotz, Imbens, andMortimer(1999)
Mixedresults
Bratberg,Grasdal,and Risa (2002)
Gritz and Johnson(2001)
Heckman et al.
(1998)
Olsen and Decker2001
WTW OLS, PSM Controlgroupsfrom other sites
Occupational OLS, PSM, Eligibletherapy DD, SC nonparticipants
Job training PSM, SC No-shows
Job training OLS, PSM, EligibleDD, SC nonparticipants;
national dataset
(SIPP)Jobsearch
assistanceOLS, PSM Comparison
groupfromrelated study
We are able to prenon-traineesfairbias. Importantpre-trainingearsome measures...
Using
contro
evaluationsappegroupsin our anconducted in vertrainingprogra
In our case studybased on sampleterms that fail to
significanceis hidifference estim
estimatorsperfoCaution should be
the applicationo
specificationchethat the applicatimpactof particiestimates.
We reject the assuextensions of it.model and the aversion of the m
The linearregressiestimates. Theevaluationprodthe linearregresevidence that thevaluationare u
oco
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 29/32
TABLEAl (continued)
Type of Methods Comparison
Study Intervention Examined SamplesUsed Study
Nonexperimentalestimators
performed poorly
Bloom et al. (2002)
Frakerand
Maynard 1987)
Lalonde(1986)
Smith andTodd
(forthcoming)
WTW
Supportedwork
Supportedwork
Supportedwork
OLS, PSM, ControlgroupsDD from other sites
OLS, cell
match, SC
OLS, DD,SC
OLS, PSM,DD
CPS
CPS, PSID
CPS, PSID
The answer o the qenoughto replac
Nonexperimental dthe effectivenessestimates tend to
groupconstructimodel used.
Manyof the econo
experimentallyd
We find little suppcross-sectional esmethod of evalushowthatprograpropensityscorechoice of analysistudy,the differethe most robust.
NOTE:WTW=welfare-to-workprogram;OLS = ordinaryeast squares regression-adjusteddifferencein
DD = difference-in-differences;SC =parametricornonparametric election correction;SIPP = Surveyof I
CurrentPopulation Survey;PSID = Panel Studyof Income Dynamics.
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 30/32
NONEXPERIMENTAL VERSUS EXPERIMENTALESTIMATES
Notes1.This article uses the term"nonexperimental"s asynonym or"quasi-experimental,"lthough"quasi-
experimental"s used inplacesto connoteamorepurposefulattempt bythe researcher o mimicrandomizedtrials. In general, anyapproachthat does not use randomassignmentis labeled nonexperimental(NX).
2. Findings reportedhere are drawn from a researchsynthesis preparedunder the guidelines of the
Campbell Collaboration. The published protocol is available at http://www.campbellcollaboration.org/
doc-pdf/qedprot.pdf.3. A broaderreview we have undertakenfor the Campbell Collaboration ncludes design replication
studies that estimatebias for other outcomes, such as student achievement, school dropout,and receipt of
pubic assistancebenefits. Forthcomingresults from that study examinebinaryindicatorsfor whether the
experimentaland NX estimatorssupportthe same statistical nference andwhether they supportthe same
policy conclusion.
4. An importantareaexcludedbythis criterionwas health-related nterventions(forexample, MacKay
et al. 1995, 1998). Modelsof programparticipation, he keyfactor n sampleselection bias,mightbe similaramong education-, training-,and employment-relatedinterventions but are likely to differ markedly or amedical or communityhealth intervention.Furthermore,the outcomes would typicallybe very different.We initiallyapplied a broadercriterion that included school-relatedoutcomes such as school dropoutandtest scoresbut ultimately ocused on interventionswith earningsas the mainoutcome to limit the numberof
confoundingfactors. A forthcomingCampbellreview will draw on the wider literature.5. It is less important orourpurposesthat the experimentalestimatorbe externallyvalid or that t repre-
sent one policy parameter n particular suchas the effect of the treatmenton the treatedorthe localaveragetreatmenteffect), as long as the NX estimatorpurportsto measure the same thing.
6. Randomizationbiasresults when the treatmentgroup'sexperienceis influencedby the presence of arandomizedevaluation. Spillovereffects result when the control group'sexperience is influenced by the
presenceof a treatment
group.Substitutionbias
resultswhen
controlgroupmembers are
givenan alterna-
tive treatment thattheywould not havereceivedabsentthe experiment.JohnHenryand Hawthorne effectsresultfrommembersof the control andtreatmentgroup, respectively,behaving differentlybecause theyareawareof their inclusion in an experiment.
7. It is important o recallthatawide rangeof bias estimates does not necessarily mplyawide rangeof
biases, because of samplingerror in both the experimentaland NX impactestimates.8. The codingof the variablesrepresentingqualityof backgrounddata(forregressionormatching)nec-
essarilyinvolves some subjectivity.To be systematic,we applied the following criteria: f the specificationincluded several quartersof baseline earnings and a large number of relevant backgroundvariables,wecoded the qualityof the data as "veryextensive." If the specificationcontainedsome baseline measure of
earningsand a set of individualbackgroundvariables hatcapturesthe keyelements that arelikelyto affect
outcomes, then it was coded as "extensive."Otherwise,it was coded as "poor."9. Some researcherssuch as Bloom et al. (2002) and Smith and Todd (forthcoming) riedto determine
the number of quartersof prior earningsneeded to reduce bias to acceptable levels, but there were not
enough other examplesto drawany general lessons.10. The study populationfor Bratberg,Grasdal,and Risa(2002) differsfrom the populationstargetedin
the otherstudiesunderreview not onlybecause the populationwas madeupof Norwegiansbutalsobecausethe samplememberswere not disadvantagedworkers.The largerbias estimateswouldapplyto alargerearn-
ingsbase and thereforenot be assubstantivelymportantasasimilarly izedbias found in a studyof U.S.wel-fareparticipants.Some of this effect is measuredby the studyfixed effect (see the even-numbered columnsin Table4).
11. Removingthe outliersin this case is probablyreasonablebecause the outlyingbias estimates corre-
spondto NX impactestimates thatwere implausibleon their face (giventhe collection of other impactesti-
mates). One cannot count on being able to identify this type of outlier in general, when an experimentalbenchmark s not available.
12. We alsoexamined the distributionacross studies foragiven method-matching andregression-andfound a similarresult.Thissuggeststhataggregationneed not be done acrossmethodsif alargecollection of
studies and interventions is used.
91
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 31/32
THE ANNALS OF THE AMERICAN ACADEMY
References
Bell, StephenH., LarryL. Orr,JohnD. Blomquist,and GlenC. Cain. 1995.Programapplicantsas a compari-
songroupin evaluatingtraining programs.Kalamazoo,MI:UpjohnInstituteforEmploymentResearch.Bloom, Howard, Charles Michalopoulos, CarolynHill, and Ying Lei. 2002, June. Can non-experimental
comparisongroupmethodsmatchthefindingsfrom a randomassignmentevaluationof mandatorywel-
fare-to-workprograms?New York:ManpowerDemonstrationResearchCorporation.
Bratberg,Espen, AstridGrasdal,and AlfErlingRisa.2002. Evaluating ocialpolicyby experimentaland NX
methods. ScandinavianJournal of Economics 104 (1): 147-71.
Campbell,Donald, andJ.C. Stanley.1966. Experimentaland quasi-experimentaldesignsfor research. Chi-
cago: Rand McNally.Cook, Thomas,and Donald Campbell.1979. Quasi-experimentation:Design andanalysisissuesforfield set-
tings. Hopewell, NJ: Houghton-Mifflin.
Cooper,H., K.Charlton,J.C. Valentine,and L. Muhlenbruck.2000. Makingthe mostof summerschool: A
meta-analyticand narrative review. Monographsof the Society for Research in Child Development.Maiden, MA: Blackwell.
Dehejia, Rajeev,andSadekWahba. 1999. Causaleffects in NX studies:Reevaluating he evaluationof train-
ing programs.Journalof the AmericanStatisticalAssociation94 (448): 1053-62.
Fraker,Thomas,andRebecca Maynard.1987. The adequacyof comparisongroupdesignsfor evaluationsof
employment-related programs.Journalof HumanResources22 (2): 194-227.
Glazerman,StevenM.,Dan M.Levy,and DavidMyers.2002. NXreplicationsofsocial experimentsn educa-
tion, training,and employmentservicesrevisedprotocol. Philadelphia:CampbellCollaboration.Avail-
able fromhttp://www.campbellcollaboration.org/doc-pdf/qedprot.pdf.Gritz,R. Mark,andTerryJohnson.2001. NationalJob Corps Study:Assessing programeffectson earnings
for studentsachievingkeyprogrammilestones.Washington,DC: Battelle Memorial Institute.
Heckman,JamesJ., andV.JosephHotz. 1989. ChoosingamongalternativeNX methods for estimatingthe
impactof socialprograms:The case of manpowertraining.Journalof the American StatisticalAssocia-
tion 84 (408): 862-74.
Heckman,JamesJ., Hidehiko Ichimura,JeffreyC. Smith, and PetraTodd. 1998. Characterizing election
bias. Econometrica 66 (5): 1017-98.
Hotz,V.Joseph,Guido W.Imbens,andJacobKlerman.2000. The long-term gainsfromGAIN: Are-analysisof the impactsof the CaliforniaGAIN Program.NBER WorkingPaper8007. Cambridge,MA: National
Bureau of Economic Research.
Hotz, V.Joseph, Guido W. Imbens, and Julie H. Mortimer. 1999. Predictingthe efficacyof futuretraining
programs using past experiences. NBER Technical WorkingPaper 238. Cambridge, MA: National
Bureau of Economic Research.
Lalonde, Robert. 1986.
Evaluating
he econometric evaluationsof trainingwith experimentaldata. Ameri-
can EconomicReview 76 (4):604-20.
Lipsey,MarkW.,and David B.Wilson.1993.The efficacyofpsychological,educational,andbehavioral reat-
ment: Confirmation rom meta-analysis.AmericanPsychologist48:1181-209.
McConnell, Sheena, and Steven M. Glazerman.2001. NationalJob Corps Study:Thebenefitsand costs of
Job Corps.Washington,DC: MathematicaPolicy Research,Inc.
McKay,James R., ArthurI. Alterman,Thomas McLellan, Chris R. Boardman,Frank D. Mulvaney,and
Charles P. O'Brien. 1998. Randomversus nonrandomassignment in the evaluation of treatment for
cocaine abusers.Journalof Consultingand ClinicalPsychology66 (4): 697-701.
McKay,JamesR., ArthurI. Alterman,ThomasMcLellan,EdwardC. Snider,andCharlesP.O'Brien.1995.
Effect of randomversusnonrandomassignment n acomparisonof inpatientanddayhospitalrehabilita-
tion for male alcoholics.Journal of Consultingand Clinical Psychology63 (1): 70-78.
National Reading Panel. 2000. Teachingchildren to read: An evidence based assessmentof the scientificresearch iteratureon readingandits implicationsforreading nstruction.Reportsofthesubgroups.NIH
Pub.no.00-4754,December.Bethesda,MD: National nstituteofChildHealthandHumanDevelopment.Newman,John,Menno Pradhan,LauraRawlings,Geert Ridder,RamiroCoa, andJose Luis Evia.2001. An
impactevaluationof education,health and water supply investmentsof the Bolivian Social Investment
Fund. World BankEconomic Review,June.Washington,DC: World Bank.
92
This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions
7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts
http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 32/32
NONEXPERIMENTALVERSUS EXPERIMENTALESTIMATES 93
Olsen, Robert,and PaulDecker. 2001. Testingdifferentmethodsof estimatingthe impactsof workerprofil-
ing and reemploymentservicessystems. Washington,DC: MathematicaPolicy Research,Inc.
Reynolds,ArthurJ., and JudyA. Temple. 1995. Quasi-experimentalestimates of the effects of a preschool
intervention:Psychometricand econometric comparisons.EvaluationReview 19 (4):347-73.Shadish,William R. 2000. The empiricalprogramof quasi-experimentation. n Researchdesign:Donald
Campbell's egacy, edited by L. Bickman. ThousandOaks,CA:Sage.Shadish,WilliamR., and Kevin Ragsdale.1996. Random versus nonrandomassignmentin psychotherapy
experiments:Do you get the sameanswer?Journal fConsultingand ClinicalPsychology64:1290-305.
Smith,JeffreyC., and PetraTodd.Forthcoming.Does matchingovercomeLalonde'scritiqueof NXestima-tors?Journalof Econometrics.