2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts

7/28/2019 2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts

http://slidepdf.com/reader/full/2003-nonexperimental-versus-experimental-estimates-of-earnings-impacts 1/32

American Academy of Political and Social Science

Nonexperimental versus Experimental Estimates of Earnings ImpactsAuthor(s): Steven Glazerman, Dan M. Levy and David MyersReviewed work(s):Source: Annals of the American Academy of Political and Social Science, Vol. 589, MisleadingEvidence and Evidence-Led Policy: Making Social Science More Experimental (Sep., 2003), pp.63-93Published by: Sage Publications, Inc. in association with the American Academy of Political and Social

Science

Stable URL: http://www.jstor.org/stable/3658561 .

Accessed: 13/03/2013 22:03

Your use of the JSTOR archive indicates your acceptance of the Terms & Conditions of Use, available at .http://www.jstor.org/page/info/about/policies/terms.jsp

.JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of

content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms

of scholarship. For more information about JSTOR, please contact [email protected].

.

Sage Publications, Inc. and American Academy of Political and Social Science are collaborating with JSTOR

to digitize, preserve and extend access to Annals of the American Academy of Political and Social Science.

http://www.jstor.org

This content downloaded on Wed, 13 Mar 2013 22:03:05 PMAll use subject to JSTOR Terms and Conditions

http://www.jstor.org/action/showPublisher?publisherCode=sage

http://www.jstor.org/action/showPublisher?publisherCode=aapss


http://www.jstor.org/stable/3658561?origin=JSTOR-pdf

http://www.jstor.org/page/info/about/policies/terms.jsp





http://www.jstor.org/stable/3658561?origin=JSTOR-pdf



http://www.jstor.org/action/showPublisher?publisherCode=sage



Nonexperimentalversus

ExperimentalEstimates

of EarningsImpacts

By

STEVEN GLAZERMAN,DAN M. LEVY,

andDAVID MYERS

To assess nonexperimental(NX) evaluation methods inthe context of welfare, job training, and employmentservicesprograms, he authorsreexamined he resultsoftwelve case studies intended to replicate impact esti-

mates from an experimental evaluation by using NXmethods. They found that the NX methods sometimes

came close to replicatingexperimentallyderivedresultsbut often produced estimates that differed by policy-relevant margins,which the authors interpret as esti-mates of bias. Although the authors identified several

study design factors associatedwith smallerdiscrepan-cies, no combinationof factorswouldconsistentlyelimi-nate discrepancies.Even with a largenumberof impactestimates, the positive and negative bias estimates didnot alwayscanceleach other out. Thus,it was difficult to

identify an aggregation strategy that consistentlyremovedbias while answeringa focused questionabout

earnings impacts of a program. They conclude that

althoughthe empiricalevidence fromthis literaturecanbe used in the contextof trainingand welfareprogramsto improveNXresearchdesigns,it cannoton its ownjus-tify the use of such designs.

Keywords: nonexperimental;uasi-experimental;esignreplication

AssessingAlternativestoSocialExperiments

Controlled experiments, where subjects are

randomly assigned to receive interventions, are

desirable but often thought to be infeasible or

overly burdensome, especially in social settings.

Therefore, researchers often substitute non-experimental (NX) or "quasi-experimental"methods, in which researchers use treatmentand comparison groups but do not randomly

assign subjects to the groups.l NX methods are

StevenGlazerman s a seniorresearcherat Mathematica

PolicyResearch.Hehasdesignedandconductednumer-ous evaluations of social programs, using both experi-mental and nonexperimentalmethods. His recent andcurrentwork includes evaluationsof such programsas

Job Corps,TeachforAmerica, he NIH LoanRepaymentProgram,and afederal initiative to provide supportivehousingfor people with severemental illness.

DOI: 10.1177/0002716203254879

ANNALS, AAPSS, 589, September2003 63







THE ANNALS OF THE AMERICANACADEMY

less intrusive and sometimes less costly than controlled experiments, but their

validity rests on untestable assumptions about the differences between treatment

and comparison groups.Recently, agrowing number of case studies have tried to use randomized experi-

ments to validate NX methods. To date, this growing literature has not been inte-

grated in a systematic review or meta-analysis. The most comprehensive summary(Bloom et al. 2002) addressed the portion of this literature dealing with mandatorywelfare programs. However, efforts to put the quantitative bias estimates from

these studies in a common metric and combine them to drawgeneral lessons have

been lacking.This article reports on a systematic review of such replication studies to assess

theability

of NXdesigns

toproduce

validimpacts

of socialprograms

on

partici-pants' earnings.2

Specifically, this article addresses the following questions:

* CanNXmethodsapproximateheresultsrom well-designedndwell-executedxperiment?* WhichNXmethods remore ikelyoreplicatempact stimatesromawell-designednd

well-executed experiment,and under what conditionsare they likelyto performbetter?* CanaveragingmultipleNXimpactestimatesapproximatehe results from awell-designed

and well-executed experiment?

The answers to these questions will help consumers of evaluation research, includ-

ing those who conduct literature reviews and meta-analyses, decide whether andhow to consider NX evidence. They will also help research designers decide, when

Dan M.Levyis a researcherat MathematicaPolicyResearch,wherehehasbeeninvolvedin eval-uationsofseveralsocialprograms, ncludingthe 21st Century CommunityLearningCentersandthe Welfare-to-WorkGrantsprogram,and is developinga series of evaluationtrainingforumsfor the U.S. CentersforDisease Controland Prevention.He is also an adjunctaculty memberat

GeorgetownUniversity'sPublicPolicy Institute,where he teaches quantitativemethods.

David Myers svicepresident ofhumanservices researchandaseniorfellowatMathematicaPol-

icy Research. He specializes in design and implementationof random assignmentand quasi-experimentalevaluationsof educationprograms.He has directedpolicy analysesof elementary,secondary,postsecondary,and adult education issues, includinga study of workplace literacyprograms or the U.S. Department of Education.He currentlyleads a large,multisite,random

assignmentstudy to evaluate UpwardBound, afederal programto help low-incomestudentsattendcollege.He serveson the advisorypanelof the What WorksClearinghousesponsoredbythe U.S.Departmentof Education and is a co-convenerof the CampbellCollaboration'sQuasi-ExperimentalDesign Group.

NOTE: Thisresearchwas supportedby grantsfromthe WilliamandFlora Hewlett Foundationand the Smith RichardsonFoundation; however,the conclusions do not necessarilyrepresentthe officialpositionorpolicies of the Hewlett Foundationorthe SmithRichardsonFoundation.

The authors hankHarrisCooper,Phoebe Cottingham,AllenSchirm,JeffValentine,andpartici-pants of workshopsheld by the CampbellCollaboration,Child Trends,Mathematica,and theSmith RichardsonFoundation. Also, we are grateful to the authors of the studies that weincluded in this review,manyof whom spent time answeringour questionsandprovidingaddi-tionaldata.

64







NONEXPERIMENTALVERSUS EXPERIMENTAL ESTIMATES

random assignment is not feasible, whether there are conditions that justify a NX

research design.

Between- and within-study comparisons

Researchers use two types of empirical evidence to assess NX methods:

between-study comparisons and within-study comparisons (Shadish 2000). Thisarticle synthesizes evidence from within-study comparisons, but we describe

between-study evidence as background.

Between-study comparisons. Between-study comparisons look at multiple stud-ies that use different research

designs

and

study samples

to estimate the

impact

of

the same type of program. By comparing results from experimental studies withthose of NX ones, researchers tryto derive the relationship between the design andthe estimates of impact. Examples include Reynolds and Temple (1995), who com-

pared three studies; and Cooper et al. (2000, Table 2), the National Reading Panel

(2000, chapter I, Tables 6-7), and Shadish and Ragsdale (1996), who all compareddozens orhundreds of studies by including research design variables as moderatorsin their meta-analyses. These analyses produced mixed evidence on whether

quasi-experiments produced higher or lower impact estimates than experiments.An even more comprehensive between-study analysis by Lipsey and Wilson

(1993) found mixed evidence as well. For many types of interventions, the averageof the NX studies gives a slightly different answer from the average of the experi-mental studies, while for some, it gives a markedly different answer. The authorsfound seventy-four meta-analyses that distinguished between randomized andnonrandomized treatment assignment and showed that the average effect sizes forthe two groups were similar,0.46 of a standard deviation from the experimentaldesigns and 0.41 from the NX designs. But such findings were based on averagesover a wide range of content domains, spanning nearly the entire applied psychol-ogy literature. Graphing the distribution of differences between random and

nonrandom treatment assignment within each meta-analysis (where each one per-tains to asingle content domain), they showed that the average difference between

findings based on experimental versus NX designs was close to zero, implying nobias. But the range extended from about -1.0 standarddeviation to +1.6 standard

deviations, with the bulk of differences falling between -0.20 and +0.40. Thus, the

between-study evidence does not resolve whether differences in impact estimatesare due to design or to some other factor.

Within-study comparisons. In a within-study comparison, researchers estimatea program'simpact by using a randomized control group and then reestimate the

impact by using one or more nonrandomized comparison groups. We refer to these

comparisons, described formally below, as "design replication" studies. Thenonrandomized comparison groups are formed and their outcomes adjusted byusing statistical or econometric techniques aimed at estimating or eliminatingselection bias. Design replication studies can use multiple comparison groups or

65








the same comparison groupwith multiple sample restrictions to examine the effect

of different comparison group strategies. The NX estimate is meant to mimic what

would have been estimated if a randomized experiment had not been conducted. Ifthe NX estimate is close to the experimental estimate, then the NX technique is

assumed to be "successful"at replicating an unbiased research design.

Within-study comparisons make it clear that the difference in findings between

methods is attributable to the methods themselves ratherthan to investigator bias,differences in how the intervention was implemented, or differences in treatment

NX [nonexperimental]methodsare less

intrusiveand sometimesless costly than

controlledexperiments,but their validityrests on untestableassumptions.

setting. For this reason, within-study comparisons can yield relatively clean esti-

mates of selection bias. Onthe other hand, it is more difficult to rule out the effects

of chance for a given set of within-study comparisons. Therefore, general conclu-

sions require, as in this article, several within-study comparisons in a variety of

contexts.

Design replicationto estimatebias

The current review differs from standardmeta-analysisbecause the "effect size"of interest is not the impact of some intervention on a given outcome but the dis-

crepancy between experimental and NX impact estimates. This, we argue, is itself

an estimate of the bias. Bias can never be directly observed, because the true

impact,0, is not known.This review includestwo equivalenttypesof studiesthat allow

us to estimate the bias empirically.The first type presents up to K NX estimators,

0k,of the impact, where k = 1,... K, and one experimental estimate, 00, such that

E[00] = 0. The second type compares average outcome for a control group, Y0,with the (adjusted) average outcome, Yk, or some comparison group based on NX

method k. The relationship between these variables is shown in equations (1) and(2), where YT epresents the averageoutcome for the treated group and B(0k ) is the

bias.0k Y,-Yk (1), YT Y (2)

e00 YT Y0 (2)

66








Using these estimates, we can estimate the bias associated with each of the k

estimators, defined as B( k) = E[Ok - 0]. Since the true parameter is not observ-

able, we estimate the bias as the difference between the NX and experimental esti-mators. Subtractingequation (2) from equation (1) yields two forms of the bias esti-

mate,B(0k), corresponding to the two types of reporting formatsdiscussed above:

(k- ) (Y ,- Yk) B( ). (

Thus, hetwotypesofstudies reequivalent,ven houghhelatterypedoesnotuse information from the treatment group.

If the experiment is well executed, then the estimated bias should itself be unbi-

ased,as shown in

equation (4).

E[B(0k)] = E[Ok]- E[0] = E[k - 0] = B(k). (4)

The goal of the analysisin this review is to model B(O ) as a function of the char-

acteristics and context of the study, the estimator, and the intervention whose

impact is being estimated. We recognize an important practical limitation in esti-

mating such models, which is that the observed bias estimates vary not onlybecause of the performance of the NX method (in reducing selection bias) and

other contextual variables alreadynoted but because of random sampling error in

both the experimental and NX estimators. This samplingvariance makes it difficultto judge when bias estimates are large enough or spread out enough to be evidence

that a NX method has failed. Therefore, we have refrained throughout the analysisfrom making general statements that go beyond the collection of case studies we

reviewed.

Data and Method

In recent years, the number of design replication studies has been growing tothe point where it is now possible to begin synthesizing the results to look for pat-terns. This article draws on such a synthesis of the design replication literature,

focusing on studies that used earnings as an outcome.3 The rest of this section

describes the methods we used to assemble the design replication studies, con-

struct the data set, and conduct the analysis.

Inclusioncriteria and searchstrategy

To be included in thereview, a study had

tomeet

thefollowing

criteria:

* Arandomizedontrolgroupwasused oevaluate program,ndacomparisonroupwasavailablefor computingat least one NX estimate of the same impact.

67








Because some studies estimated bias directly by comparing comparison and con-

trol groups, the presence of a treatment group is not required.

* Theexperimental-NXomparison asbasedon estimatesrom he sameexperiment.

This criterion excludes the between-study comparisons described in Section I.

* TheexperimentalndNXestimatespertainedo thesame nterventionnthe same ites.

This criterion excludes, for example, a study of education programs in Bolivia

(Newman et al. 2001), which compared findings from an experimental design in

one region with findings from a NX design in another. Such a study confounds

regional differences with differences in study design.

* Theintervention'spurposewasto raiseparticipants'arnings.Thiscriterionestricts urfocus oprogramshatprovideob training ndemploymentervices.4

The search process produced dozens of candidate studies. We narrowed them

down to thirty-three for closer examination, and determined that twelve, listed in

Table 1, met the search criteria. The twelve studies correspond to nine interven-

tions; four of these studies addressed the same intervention, the National Sup-

ported

WorkDemonstration (NSW). All of the interventions involved

job trainingor employment services, such asjob search assistance or vocational rehabilitation,and participationwas mandatory in about half of them. In terms of location, three

interventions were single-site programs (in San Diego, California; Riverside, Cali-

fornia; and Bergen, Norway); one was a multisite program in a single state

(Florida); and the remaining five were multistate in the United States. Half of the

interventions were studied in the 1990s; only one (NSW) was studied before 1980.

Seven of the studies appeared in peer-reviewed journals or in books, three are final

reports of government contractors, and two are working papers or unpublished

manuscripts.

The quality of the evidence in these studies-in particular,the quality of theexperiment-is critical to our analysis.The use of design replication as a validation

exercise assumes that the experimental estimators in the studies are themselves

unbiased.5 Common threats to the validity of the experimental estimator include

differential attrition or nonresponse, randomization bias, spillover effects, substi-

tution bias, John Henry effects, and Hawthorne effects.6 Bias could also arise from

nonuniform collection of data from treatment and control groups and from assign-ments that were not truly random. Noncompliance with treatment assignment,even if monitored and documented, can threaten an experiment's abilityto answer

interesting policy questions.To evaluate the validity of the experimental estimates, we assessed the nine

experiments in our review and found them to be of generally high quality. Most

were well funded and were carried out by research organizationswith established

track records in random assignment and data collection. The Manpower Demon-

68







TABLE 1

DESIGN REPLICATION STUDIES WITH EARNINGS AS AN OUT

Yearsof Program Geographic (

Study Intervention Intervention Mandatory? Location

Lalonde (1986) NationalSupportedWorkDemonstration(NSW)

1976-77 No National 10 sites

Fraker and Maynard National SupportedWork

(1987) Demonstration(NSW)

Dehejiaand Wahba National

SupportedWork

(1999) Demonstration(NSW)

Smith and Todd

(forthcoming)

Hotz, Imbens, andMortimer(1999)

Bell et al. (1995)

Heckman et al.(1998)

National SupportedWorkDemonstration(NSW)

SaturationWork InitiativeModel (SWIM)

Aid to Familieswith

Dependent Children(AFDC)Homemaker-Home HealthAide Demonstration

1976-77

1976-77

1976-77

1982-85

1984-85

Job TrainingPartnershipAct 1985-89(JTPA)

No National 10 sites

No National 10 sites

No National-10 sites

Yes San Diego, CA

No AR, KY,NJ, NY, OH,SC, TX

No CorpusChristi,TX;FortWayne,IN;

Jersey City, NJ;Providence,RI







TABLE 1 (continued)

Study

Hotz, Imbens, andKlerman(2000)

Bloom et al. (2002)

Intervention

GreaterAvenuesfor

Independence (GAIN)

National EvaluationofWelfare to WorkStrategies(NEWWS)

Bratberg,Grasdal, Bergen Experimentand Risa(2002) (vocationalrehabilitation)

Olsen and Decker Job Search Assistance(2001) demonstration(JSA)

GritzandJohnson Job Corps(2001)

NOTE: OLS = ordinary east squares regression;DD

Yearsof

Intervention

1988-90

1991-94

1994

1995

1995-97

Program

Mandatory?

Yes

Yes

No

Yes

No

GeographicLocation

Riverside,CA

OklahomaCity,OK;Riverside, CA;Detroit, MI;GrandRapids,MI;Portland,OR

Bergen, Norway

Florida-10 counties

National 110 sites

= difference-in-differences;SC = econometric selec







NONEXPERIMENTAL VERSUS EXPERIMENTAL ESTIMATES

strationResearch Corporation (MDRC) oversawrandom assignment in four of the

experiments; Abt Associates oversaw two; and Mathematica Policy Research

(MPR) oversaw two. The remaining experiment was overseen by university-basedresearchers. Because details of the experimental designs andtheir implementationwere not reported in all the replication studies, we retrieved background reportsand methodological appendixes, examined nonresponse analyses, and corre-

sponded with researchers. We concluded that most of the experiments had rela-

tively low crossover and attrition rates and that the attrition and nonresponse did

[G]eneral conclusionsrequire,as in this article,severalwithin-studycomparisons n

a variety of contexts.

not appear to be related to treatment status in a way that would threaten the con-

clusions' validity.

Protocoland coding

Once the studies were assembled, we followed a procedure laid out in a formal

protocol (Glazerman, Levy, and Myers 2002) to extract data from the source stud-ies and code them for analysis. For example, the coding form had questions aboutthe source of the comparison group used for each NX estimator in each study.Twocoders read the twelve studies and extracted all the information needed for the

analysis.They coded two studies together to ensure a consistent understanding ofthe coding instrument and process. Then, each coded a subset of the rest, with

ample consultation built into the process to increase coding accuracy (see

Glazerman, Levy, and Myers [2002] for details on this and other aspects of theresearch synthesis methods). We also contacted authors of nearly every source

study to obtain clarification and, sometimes, additional data. Further details of thevariables that were coded are mentioned below.

Analysismethods

The goal of our analysisis to determine how selection biasvarieswith the type ofestimator employed, the setting, and the interaction between the setting and the

type of estimator.To answer this, we model B(0jk), the bias associated with estima-tor k, as a function of the characteristicsand context of the study (indexed byj) and

71







THE ANNALS OF THE AMERICAN ACADEMY

its intervention, captured in a vector labeled Z, and the characteristics of the esti-

mator itself, captured in a vector labeled W.

B(Ojk)=f(Z, Wk,ZWk). (5)

We use the absolute value of B(0 jk)on the left-hand side of the equation because a

researcher or research synthesist wants to choose designs to minimize the bias,whatever its direction. An interaction between study-level and estimator-level vari-

ables is included to capture the interplay between method and context.

One might expect that some types of NX design perform better than others and

that some designs are more appropriate under certain study conditions. To test

this,it is

importantto

describe each type of NX estimator.Of the

manyclassifica-

tion schemes, the most commonly used are those given by Campbell and Stanley(1966) and Cook and Campbell (1979). Alternative formulations by economistssuch as Heckman and Hotz (1989) and Heckman et al. (1998) are also useful for

categorizing methods in a general way; however, we prefer to avoid forcing the

methods into mutually exclusive categories, because many of the estimators used

multiple approaches. Instead, we describe each estimator by a vector of character-

istics that pertain to (1) the source of the comparison group and (2) the analytic

techniques used to adjust for differences between the comparison group and the

treatment population.Because there is a limited range of NX designs assessed in the design replication

literature, we must use very gross indicators to categorize NX designs. For the

source of the comparison group, we coded three binaryindicatorvariables: one for

whether the comparison group is drawnfrom a national data set, such as Surveyof

Income and ProgramParticipation (SIPP); one for whether the comparison groupis based on sample members from the same geographic area as the treated popula-tion; and one forwhether the comparison group is formed by using the randomized

control group from a different experiment. For the type of statistical adjustment,we used four variables to broadly indicate (1) whether background variables were

used as covariates in a regression model; (2) whether matching methods, such asstratification on an estimated propensity score, were used; (3) whether the estima-tor used preintervention measures of the outcome-examples include difference-

in-differences models, fixed-effect models, or even regression or matching models

using baseline earnings;and (4) whether the estimator was based on an economet-ric sample selection model. The selection modeling indicator (4) would be set to

one, for example, if the estimator used the inverse Mills'ratio or instrumental vari-

ables with a set of variablesthat were included in a model of programparticipationbut excluded from the model of earnings determination.

We constructed other indicators toidentify

conditions under whichquasi-experiments were potentially more likely to replicate experiments. One set of indi-

cators measured whether a specification test was conducted, and if so, whether thetest would have led the researcher to avoid the estimator a priori. Another set of

indicators measured whether the background variables used in the regression or

72








in the matching procedure were detailed, as with a survey, or sparse, as is typi-cally the case with administrative data. Variables included in the W vector

include the experiment's sample size, grouped into categories for small, medium,and large; and the program's estimated effectiveness-effective, ineffective, or

indeterminate.To estimate the average bias reduction associated with these design and context

variables, we used both bivariate analyses (tabulations) and multivariate analyses

(regression). Because such a small collection of studies limits the degrees of free-

dom, we expect to find the dataconsistent with several competing explanations for

why the estimated bias is high or low. The bivariate analyses use sample weights to

account for the unequal sample sizes of the source studies, although we found that

[T]he observedbias estimatesvary not

only becauseof the performanceof the NX

method (in reducingselectionbias) . . but

becauseof randomsamplingerrorin both

the experimentaland NX estimators.

weighting made little difference to the qualitative findings. Similarly, for the

multivariateanalyses,we tried alternative aggregation procedures to deal with lackof statistical independence among bias estimates from a single study.To minimize

artificialreplication, the regression results in the next section use the averageof theabsolute value of the bias estimates associated with each unique combination of

design variables. For example, if one studyproduced eight quarterlybias estimates

corresponding to impacts after random assignment, we aggregated them into a sin-

gle estimate forthe two-year period, as long as the policy interpretation for the two-

year period made sense.A constraint on more detailed analyses than those just described was dictated by

having just twelve replication studies. While many of these studies assessed multi-

ple NX estimators, resulting in more than one thousand bias estimates, the overall

diversityof

designswas not as

comprehensivea

catalogueof

quasi-experimentalmethods as those described by Cook and Campbell (1979) and others. Amongthose methods that were assessed, not every method was assessed in every setting.As more empirical work comes to light, more sophisticated analysis may be

possible.

73








Results

Our synthesis of design replication studies takes an important first step towardanswering the research questions of this article. However, interpretation of the evi-

dence remains a challenge. Even among authors of the studies we reviewed, therewas no consensus on how to judge the size of differences between experimentaland NX impact estimates. The authors differed in the extent to which they probedthe statistical and policy significance of their results. Some focused narrowly ontheir own case studies; others made broader statements praising or condemning aNX method. Four studies concluded that NX methods performed well, four foundevidence that some NX methods performed well while others did not, and four

found that NX methods did not perform well or that there was insufficient evi-dence that they did perform well. A summary of their conclusions is given in the

appendix. In this section, univariateanalysesdescribe the range of bias estimates inthe literature. Bivariateanalyses then relate the absolute size of the bias to several

explanatoryfactors. The multivariateanalysisthat follows uses regression to deter-mine whether the different explanations of bias overlap and whether one predomi-nates. Finally, we examine the distribution of the bias estimates to considerwhether they cancel out across studies and whether their variation is due to truevariation in the performance of NX methods or some other explanation.

Univariate analysis

From the twelve studies, we extracted 1,150 separate estimates of the bias,about 96 estimates per study.While some of the bias estimates were close to zero,some were very large, overestimating or underestimating annual earnings impactsby as much as $10,000 or more. Table 2 shows the bias estimates by study.

The definition of a "large"bias depends on the program and the policy decisionat stake. However, for disadvantaged workers, even a $1,000 difference in annual

earnings is important. For example, in a benefit-cost study of Job Corps

(McConnell and Glazerman 2001), a steady-state impact on annual earnings ofabout $1,200 was used to justify the program's expenditure levels, one of the high-est per trainee (about $16,500) for any federal training program. A difference of

$800 in the annual earnings impact estimate would have completely changed the

study'soutcome and might have led to a recommendation to eliminate rather than

expand the annual $1.4 billion program. For programs, such as the Job Training

PartnershipAct (JTPA)and the various welfare-to-work programs captured in our

data, where both the program costs and the impacts on earnings are likely much

smaller,a difference of $1,000 or more can make a dramaticdifference in the policy

recommendation.Another benchmark is the average earnings of control group members. In manyof the studies we reviewed, the inflation-adjusted annualearnings of control groupmembers was about $10,000, which includes zero earnings for nonworkers. Thus, a

$1,000 bias would represent 10 percent of earnings, a substantial amount.

74








TABLE 2

DESCRIPTIVE STATISTICS OF BIAS ESTIMATES BY STUDY

Bias Estimates

(AnnualEarningsin 1996 Dollars)

Averageof

Absolute

Valueof Number of

Rangeof Averageof Typesof Number of Typesof

Study Estimates Estimates Estimatesa Estimates Estimates

Bell et al. (1995) -723 to +5,008 661 813 54 3Bloom et al. (2002) -21,251 to +12,215 498 1,114 564 8

Bratberg,Grasdal,and Risa(2002) -18,702 to-654 -4,826 2,907 13 5

Dehejia andWahba(1999) -1,939 to +1,212 173 4,163 40 4

Frakerand

Maynard 1987) -3,673 to +871 -751 1,103 48 3GritzandJohnson

(2001) -1,091 to +3,189 497 780 48 2Heckmanet al.

(1998) -7,669 to +8,154 -423 3,273 45 17

Hotz, Imbens, and

Mortimer(1999) -1,682 to +2,192 -128 585 36 2Hotz, Imbens, andKlerman(2000) -1,248 to +438 -174 371 64 6

Lalonde(1986) -5,853 to +4,143 -636 2,849 112 8Olsen and Decker

(2001) -1,548 to +1,107 -363 1,397 10 5Smithand Todd

(forthcoming) -11,743 to +4,829 -1,655 4,019 116 6Total -21,251 to +12,215 -637 2,325 1,150 69

a. The averageof the absolute value of the bias is used to comparedifferent researchdesigns.Therefore,we calculate tbyfirstaveragingbias withindesign typeandthen averaging he results

acrossthe sixty-ninedesign types.

As mentioned earlier,one should interpret the statistics in Table 2 with caution.The average of the bias estimates can be substantially influenced by outliers

reflecting small samples or unrealistic estimators. However, the average does indi-cate whether the estimates are centered on zero andwhether they tend to overesti-mate or underestimate impacts relative to the experimental benchmark. Eight ofthe twelve studies in our

analysisshowed that NX methods tended to understate

impacts; four showed the opposite. All the studies included bias estimates that

were both negative and positive, except for the one by Bratberg, Grasdal,and Risa

(2002), in which all the econometric and matching techniques had negative biasestimate. As one would expect, the study with the greatest number of estimates

75








(Bloom et al. 2002) found the broadest range of estimates, with very large positiveand negative values.7

The absolute value of the bias provides a more direct measure of the perfor-mance of the NX estimator,where a smaller value always represents better perfor-mance. With that measure, the typical NX estimate of impact on annual earningsdeviates from the corresponding experimental estimate by about $2,000. The aver-

age absolute value in any one study ranged from twice that amount-as in the

attempts by Dehejia and Wahba (1999) and by Smith and Todd (forthcoming) to

Thisfinding suggests that, while convenient,

publicly availabledata sets at the national

level are not the bestfor evaluatingtrainingor welfareprograms.

replicate the findings of the NSW experiment using national datasets-to less than

$600 per year, as in the two studies by Hotz and colleagues (Hotz, Imbens, and

Klerman 2000; Hotz, Imbens, and Mortimer 1999).

Bivariate analysis

Tobegin to explainthe rangeof NX bias,we conducted simple bivariateanalyses

examiningthe

relationshipbetween several

possible explanatoryvariables and the

size of the bias. The candidate variables are those that describe the quasi-

experimental approach and the study in which it was implemented, including the

source of the comparison group, the statisticalmethod, and the qualityof the data.For each value of an explanatoryvariable,we computed the average of the abso-

lute value of the bias for all NX estimators with that value (see Table 3). For theentire sample of studies, the unweighted average of the absolute value of the bias

associated with using NX methods was about $1,500. However, this was based on

all 1,150 bias estimates without aggregatingto account fornonindependence of the

estimates or unequal sample size. Therefore, we constructed two sets of weights.

The first (weight 1) gives more emphasis to estimates based on studies that hadlarger samples as measured by the number of control group members in the ran-domized experiment; the other (weight 2) multiplies the sample-size weight by afactor inversely proportional to the number of estimates for a given sample. For

example, if aresearcher used 10 different methods to estimate the same impact for

76







NONEXPERIMENTALERSUSEXPERIMENTALSTIMATES

TABLE 3

AVERAGEBIAS BYCHARACTERISTICS OF ESTIMATOR

Averageof AbsoluteValue of BiasEstimate (AnnualEarningsin 1996 Dollars)

Weight 2

Weight 1 (Sample Size,

ExplanatoryVariableand Category Unweighted (SampleSize) Frequency)

Entire sampleSource of comparisongroupa

Same labormarketControlgroupfrom anothersite

National datasetStatisticalmethod:generalaRegressionMatchingSelection correction or instrumentalvariablesNone, simple mean differences

Statisticalmethod:type of matchingPropensityscore matching:one to one

Propensityscore matching:one to manyOther matching techniqueDid not use matching

Statisticalmethod:specification

test result

Specificationnot recommended

SpecificationrecommendedNo test conducted

Qualityof backgrounddata:regressionPoor set of controlsExtensiveset of controls

Veryextensive set of controlsDid not use regression

Qualityof backgrounddata:matchingPoor set of covariatesExtensiveset of covariates

Veryextensive set of covariatesDid not use matching

Qualityof backgrounddata:overallUsed priorearningsDid not use priorearnings

ExperimentalsamplesizeSmall(< 500 controls)Medium (500 to 1,500 controls)Large(>1,500 controls)

Experimental mpact findingProgram s effective

Program s ineffectiveIndeterminate

Number of observations

1.477 1.101 1,110

932843

2,817

821902

2,409

1,010828

2,0711,323

1,1011,1432,2512,791

1,0471,1811,2311,750

4,0271,1551,247

739852

1,0371,288

3,165857

1,047

1,4381,0301,0081,412

2,3361,2281,0262,431

1,7521,3921,1131,750

1,2242,662

2,5331,080

819

2,089

1,197924

1,290951

8021,288

1,0401,379

2,3781,001

819

1,288

9201,021

1,150 1,150

885814

2,131

958924

1,4121,515

744929

1,2971,311

2,8701,103

988

1,5901,0361,0161,589

1,3131,330

9201,311

1,0031,591

2,728960800

1,276

1,105911

1,150

a. Categoriesare not mutuallyexclusive or exhaustive.

77








onesite orsubgroup,hen thecorresponding iasestimatesreceivedaweightof 1/10 times the samplesize. Although he resultsvarysomewhatby type of weight

used, the qualitativeconclusionsdrawn from them do not, so we focus on theresults n the lastcolumn,whichaccount orsamplesizeandfrequencyof sample.Bothweightsreduce the averageabsolutevalue of the bias to about$1,100.

Table3 showsthat some factorsare indeed associatedwith higherandlowerbias.Asonewouldexpect,the sourceof thecomparison rouphas arole. Theaver-

age bias was lower (less than$900) when the comparisongroupcame fromthesamelabormarketas thetreatedpopulationor wascomposedof randomized on-trolgroupmembersfrom a separateexperimentandhigher(morethan$2,000)when the comparison roupwas drawn roma nationaldataset. Thisfindingsug-

geststhat,whileconvenient,

publiclyavailable atasetsat the national evel arenot

the best forevaluating rainingorwelfareprograms.Aspectsof the statisticalmethodalso were associatedwiththe size of the bias.

Therewas littledifferencebetweenregressionandmatchingmethodsoverall,butsome matchingmethodsperformedbetterthanothers.In particular,ne-to-one

propensity corematchinghad lowerbias thanotherpropensity core methodsor

non-propensity-scorematching.Five of the studies(Lalonde1986;Heckmanet al.1998;GritzandJohnson2001;Bratberg,Grasdal, ndRisa2002;andBloom et al.

2002) includedsomeformof econometricselection correctionproceduresuch asthe Heckmantwo-stepestimatoror instrumental ariablesestimator,but these

methodsperformedpoorlyonaverage,aboutaspoorlyasusingno methodat all.Rather hanexamineallquasi-experimentalstimators,t maybe moreproduc-

tivetofocusontheperformance fthose one wouldexpect(inthe absenceofa ran-domized experiment)to be the best ones. To make such a prioripredictions,researchersuse specification ests,as illustratedby Heckmanand Hotz (1989)intheirreanalysis f Lalonde's1986)replications f the NSWexperiment.Thetypi-cal specification est appliesthe NX estimator o outcomedata frombefore theintervention.If the estimatedimpacts,which shouldbe zero since nobodyhasbeen exposedto the intervention,arelargerthanwouldbe expectedby chance,

thentheestimator srejected,and tsuse is not recommended.Manyof the designreplication tudiesthat we revieweddid not conductspecification ests. Amongthosethatdid,the averageabsolutebias of rejectedestimatorswasnearly$2,900,almostthreetimes the biasof recommendedones.Thissuggests,consistentwiththe findingsof HeckmanandHotz, thatspecificationesting,wherefeasible,can

help eliminatepoor-performing stimators.The estimatedbias of the recom-mendedestimators,however,was stilllarge n absolutevalue:more than$1,000.

Some authors Heckmanet al. 1998;Smithand Toddforthcoming)havesug-gestedthatdataqualitymaybe asimportant s the researchdesign.Bycategorizingestimatorsby the richness of the background ariables-used as covariates n a

regressionorasmatchingvariables-to explain he sizeofthebias,we found somesupport or thisclaim.8The results nTable3 suggest hattheestimatorsbasedon amoreextensiveset ofvariablesn aregressionormatchingmethodhad lower bias.

78








The most important variable to include in the variable set was prior earnings.Studies without it had a bias of about $1,600; those with it, $1,000.9

Finally,we found that the performance of NX methods was related to the sam-ple size and direction of impacts for the experiment. Specifically, the NX methods

more closely replicated the experiments when the randomized control groupswere large andwhen the experiments did not show the programwas effective. One

possible explanation for the large average bias (more than $2,700) in small studiesis that the experimental impacts were not precisely estimated, so the estimate ofbias is also not precisely estimated. Another possible explanation is the size of thenonrandomized comparison group, which tends to be small when the control

Therewas little differencebetweenregressionandmatchingmethodsoverall,butsomematching

methodsperformedbetter than others.

group is small, so the largerestimated bias mayreflect random noise in the NX esti-mate. Because the sample sizes of control and comparison groups arecorrelated, itis difficult to distinguish between these two stories. The relationship between thedirection of the experimental impact and the size of bias suggests that a false posi-tive finding-concluding from the NX evidence that a programworkswhen it does

not-may be more common than a false negative.A limitation of this bivariateanalysisis that the design elements listed in Table 3

are not independent. For example, a study that uses a national data set to select acomparison group is likely also to use a relatively poor set of controls; this meansthat the large averagebias for studies that use a national data set could be reflectinga poor set of controls. It is difficult to distinguish these explanations. We therefore

proceed with multivariateregression analysisto tryto disentangle the factors asso-ciated with lower bias.

Multivariateanalysis

To examine the effect ofresearch design on bias, we estimated

severalregres-sions with the absolute value of the bias in annual earnings as the dependent vari-

able and the design attributes as explanatoryvariables (see Table 4). As suggestedearlier, other types of explanatoryvariables could also explain bias. However, we

79








have limited degrees of freedom, so we use a parsimonious model that includesindicator variables for each design replication study to proxy for all the measured

and unmeasured characteristics that vary only at the study level. Because theregression models focus on NX design configurations, we did not weight the indi-vidual bias estimates. We averaged them within design types so that each design

type would be represented no more than once for each study. This aggregationresulted in an analysis data set of sixty-nine bias estimates. The regression resultsare meant to be illustrative,because some of the design attributes are highly corre-lated with each other, the data set is very small, and the results depend on the

regression specification used.

Keeping these limitations in mind, we found the regressions largely confirmwhat one would

expect.Outcomes for the various nonrandomized

comparisongroups available to evaluators are not good approximations to the counterfactualoutcomes if left unadjusted. The intercept in the regression models shown in theodd-numbered columns of Table 4 represents the bias associated with raw mean

differences, estimated to be in the rangeof $4,400 to $5,800 in annualearnings (seethe first row). This coefficient is the expected bias in our sample, if one did notmake any adjustments to a typical comparison group. In the regression modelsshown in the even-numbered columns, we include a separate intercept for each

study, a study-level fixed effect describe above.The entries in the next two rows of Table4 suggest that using backgrounddataas

either covariates or matching variables is about equally effective at reducing bias.These techniques reduce the bias by about $3,100 to $3,600, once we account forthe studies' fixed effects (column 6). The sensitivity of this result to the inclusion offixed effects suggests that the relative performance of regression-based designsversus matching designs is confounded with contextual factors.

Combining methods is better than applying them individually. Models 5 and 6include an interaction term with apositive coefficient, which suggests that the biasreduction from these two methods is not fully additive, although there is likelysome increased benefit from their combination. In model 5, for example, the bias

from raw differences in means, represented by the intercept, is $5,775. This valueis reduced to $2,550, if only regression is used, and to $3,312, if only matching isused (holding comparison group variables fixed at the value of the omitted catego-ries). If matching and regression are both used, they reinforce each other to reducethe bias to $1,038.

Baseline measures of the outcome are important. This is suggested by the nega-tive coefficients on the difference-in-difference indicator,which equals one if theestimator uses preintervention earnings in anyway, show that using baseline mea-sures of the outcome is important, as reported in the literature. For the simplermodels 2 and 4, difference-in-difference estimators reduce the bias

by

about

$1,600 in annual earnings, a reduction slightly largerthan that achieved with otherestimators. The interaction terms of difference-in-differences with the regressionand matching (see models 5 and 6) indicate that these methods are also partiallyoffsetting.

80







TABLE4

RESULTS SHOWING THE EFFECT OF NONEXPERIMENTAL APPROACH ON BIA

Model Specification

ExplanatoryVariable 1 2 3 4

Intercept 4,467*** (657) 4,687'**(1,231)Statisticalmethod

Regression -1,583" (729) -1,516** (705) -1,476** (675) -1,416** (706)

Matching -478* (715) -1,268* (794) -807* (692) -1,427** (799)

Regressionx Matching 951 (1,422) 1,320 (1,484)Difference-in-Differences -1,874 (763) -1,596 (816) -1,859 (718) -1,568 (813)

Regressionx Differences-

in-Differences 2,325 (1,455)Matchingx Differences-

in-DifferencesSelection correction

Comparisongroup strategyGeographicmatchNationaldata setControlgroup from

anothersite

2,508* (1,248) 2,376 (1,305) 4,619 (1,048) 2,441 (1,299)

-387 (973)1,145 (1,062)

-646 (1,182)1,695 (1,536)

-1,762 (1,011) NA

Studydummies included No Yes No Yes

NOTE: Dependent variable s the absolute value of the bias in annualearnings, expressed in 1996 dollar

explanatoryvariablesare dummyvariables.Sample size is 69 bias estimate types.

*Significantlydifferent from zero at the .10 level. **Significantlydifferent fromzero at the .05 level. ***Slevel, two-tailedtest.








The one estimator that did not reduce bias at all, but in fact increased it, was the

selection correction estimator, but this should be interpreted cautiously. Few esti-

mates in our data were based on econometric methods such as the two-step estima-tor. Of these, one study (Bratberg, Grasdal,and Risa 2002) rejected the specifica-tion based on a hypothesis test but still reported the bias estimate, which was

particularlylarge.10Of the others, none produced a compelling justification for the

exclusion restrictions that typically justify such an approach. An exclusion restric-

tion is an assumption that some variable is known to influence participation in the

program (selection into treatment) but not the outcome of interest (earnings).The use of a comparison group that is matched to the same labor market or geo-

graphic area reduced bias by about $600. Funders of evaluation research probably

preferto use

largenationaldatasets to evaluate

programs,because

secondary analy-ses are far less costly than new data collection. Our findings suggest that such a

strategy comes with a penalty, an increase of average bias by about $1,700 (Table 4,column 6).

NX methods more closelyreplicatedthe

experimentswhen the randomizedcontrolgroups were largeand when the experiments

did not show the probramwas effective.

We coded another comparison group strategy that determined whether the

source was a control group from another study or another site. Several studies-for

example, those by Hotz, Imbens, and Mortimer (1999); Hotz, Imbens, andKlerman (2000); and Bloom et al. (2002)-compared the control group from one

site to the control group from another site and labeled one as the nonrandomized

comparison group. We included the "control group from another site" indicator

variable in the regression primarilyto distinguish between those studies from oth-

ers that used comparison groups more readily availableto researchers, such as eli-

gible nonapplicants (for example, Heckman et al. 1998) or individualswho appliedto the program but were screened out (for example, Bell et al. 1995). One might

argue that control groups arenot availableto most evaluators, so the more relevant

bias estimates are the larger ones found when the "other control group"indicatorequals zero.

The regression analysis described above is robust to the definition of the

dependent variable.We conducted the same analysisusing the signed value of the

bias and found very similar results. Those results, available from the authors, show

82







NONEXPERIMENTAL VERSUS EXPERIMENTALESTIMATES

FIGURE 1DISTRIBUTION OF BIAS ESTIMATES FOR ALL TWELVE STUDIES

600 -

, 400 -0

200-

0-

-5000-2500 0 2500 5000

Bias, AnnualEarnings, 1996$

that overall, the unadjusted bias is large and negative. The regressors representingdesign features increase the bias (toward zero) in much the same way that theydecreased the absolute value of the bias as shown in Table 4. Other dependent vari-ables can be used to further analyze the bias estimates. For example, we createdtwo indicator variables, one for whether the NX impact estimate led to the same

statistical inference and another for whether it led to the same policy conclusion.

Constructing these variables required some additional information, such as thethreshold value that would change the policy conclusion, but they allow us toinclude in a meta-analysis the results from a wider range of design replication stud-

ies, including those that focus on education interventions and those whose out-comes are not measured in earnings. The analyses based on these binaryoutcomevariables arebeyond the scope of the current article andwill be presented in future

work.

Aggregationof NX evidence

Now we turn to the third research question, whether averaging multiple NX

impact estimates can approximate the results from a well-designed and well-

executedexperiment.

The above discussionsuggests

that while some factors areassociated with lower bias estimates, a single NX estimator cannot reliably repli-cate an experimental one. The inability to achieve reliable replication may be dueto bias or to sampling errorin either the experiment or the quasi-experiment. How-

ever, apossibility exists that a large enough group of NX estimates pertaining to dif-

83








FIGURE 2

DISTRIBUTION OF BIAS ESTIMATES FROM NATIONAL EVALUATIONOF WELFARE TO WORK STRATEGIES (NEWWS) (BLOOM ETAL.2002)

80 -

60 -

a 40-

20 -

0-E

""

-5000 -2500 0 2500 5000

Bias, AnnualEarnings,1996$

ferent study sites, time periods, or interventions might replicate experiments on

average.We therefore examined the extent to which positive and negative bias esti-

mates cancel out. If they do, it would provide motivation for those who conduct lit-

erature reviews to be able to accumulate a largebody of NX evidence, when experi-ments are scarce, to drawvalid conclusions. A useful way to make this assessment is

by examiningthe full distribution of bias estimates for variousgroupings, such asbyintervention, by method, or for a collection of interventions, and looking for a pat-tern of estimates that tends toward zero.

The distribution of the 1,150 bias estimates from the twelve studies reviewed

in this articleprovides a case where the bias estimates do appearto cancel out (Fig-ure 1). The distribution is nearly centered on zero with a slight skew. The averagebias was about -$600. Applyingthe weights described above brings the overall bias

closer to zero, -$217; and removing the outliers and applying weights makes it even

smaller,about -$97.:' This is a crude indicator,but it suggests, consistent with the

work of Lipsey and Wilson (1993), that if enough NX studies are combined, the

average effect will be close to what the experimental evidence would predict.'2

Rarely, however, is the NX research used to answer such broad questions as

whether all programs are effective. Instead, we would like to identify dimensions

alongwhich the biasbegins to cancel out for more focused questions such as "What

is the average impact of program X?" For the studies reviewed in this article, the

average bias was sometimes close to zero (see Table 2) but often was still substan-

84







NONEXPERIMENTALVERSUS EXPERIMENTALESTIMATES

FIGURE 3DISTRIBUTION OF BIAS ESTIMATES FROM SUPPORTED

WORK (SMITH AND TODD FORTHCOMING)

20 -

15-

v:- 10-

I I I I

5-

-5000 -2500 0 2500 5000

Bias, Annual Earnings, 1996$

tial. Each of the studies in the review addressed a single intervention, althoughsome assessed more NX estimators, analyzed more subgroups, had larger samples,or included more sites. The distribution of bias estimates within studies-particu-

larlystudies that use multiple subgroups, study sites, or time periods, in addition to

multiple estimators-makes this clearer.Figures 2 and 3 displaythe distribution of

bias estimates for two of the studies that examined the largest number of estima-

tors. For the first study (Bloom et al. 2002), the bias estimates are centered roughlyon

zero,with an

average of-$151;but for the second

study (Smithand Todd forth-

coming), they clearly are not, with an average of-$2,563. It is possible to remove

outliers from the estimates reported by Smith and Todd (forthcoming) to achieve

anaveragebias that is closer to zero, but identifying outliers without the benefit of a

randomized experiment as a benchmark may be difficult. The within-study evi-

dence from the other studies (Table 2) suggests that the average bias across all

methods, subgroups, and time periods is sometimes positive, sometimes negative,and often still in the hundreds of dollars. This suggests that a mechanistic applica-tion of a large number of NX estimators might improve the inference one could

draw from such evidence, but not in a predictable way.Whether the average bias,

properly weighted within and between studies, is really close enough to zero for

policy makers, and whether the bias cancels out within a narrower domain of

research, are questions that we plan to address as more design replication studies

are completed.

85








What Have We Learned about NX Methods?

Our preliminary review of the evidence suggests that the twelve design replica-tion case studies we identified, even taken together, will not resolve any of the long-

standing debates about NX methods. From the case studies, we uncovered some

factors that might reduce bias, but we have not identified a reliable strategy for

eliminating it either in a single study or in a collection of studies. The findings can

be summarized in terms of the three empirical questions posed in section I.

Question: Can NX methods approximate the results from a well-designed and well-

executed experiment?

Answer: Occasionally, but many NX estimators produced results dramatically dif-ferent from the experimental benchmark.

* Quantitativeanalysisof the bias estimates underscoredthe potential for very large bias.Some NXimpactestimatesfell within a few hundreddollarsof the experimentalestimate,but otherswere off by severalthousand dollars.

* The size and directionof the "average" iasdepends on how the average s computed andwhat weighting assumptionsare applied.

* The averageof the absolutebias over all studieswas more than $1,000, which is about 10

percent of annualearningsfor a typical populationof disadvantagedworkers.

Question: Which NX methods are more likely to replicate impact estimates from a

well-designed and well-executed experiment, and under what conditions are

they likely to perform better?

Answer: We identified some factors associated with lower estimated bias. However,

even with these factors present, the estimated bias was often large.* The source of the comparisongroup made a difference in the averagebias estimate. For

example,biaswaslower when the comparisongroupwas drawn romwithinthe evaluationitself ratherthan froma national dataset, locallymatchedto the treatmentpopulation,ordrawnasacontrolgroupin an evaluationof a similarprogramorthe sameprogramat a dif-ferent studysite.

* Statisticaladjustments, ngeneral,reducedbias,butthe bias reductionassociatedwiththe

most common methods-regression, propensityscore matching,orotherforms of match-ing-did not differ substantially.Estimatorsthat combined methods had the lowest bias.Classicaleconometric estimators hat used an instrumentalvariableoraseparatepredictorof programparticipationperformedpoorly.

* Bias was lower when researchers used measures of preprogram earnings and otherdetailed backgroundmeasuresto control for individualdifferences.

* Specificationtests were useful in eliminatingthe poorest performingNX estimators.* Experimentswithlargersampleswere more likelyto be closely replicatedthanthose with

smallersamples.* "Noimpact"orindeterminateimpact findingsfrom anexperimentwere morenearlyrepli-

cated than were positive experimental impact findings.

Question: Can averaging multiple NX impact estimates approximate the results

from a well-designed and well-executed experiment?Answer: Maybe, but we have not identified an aggregation strategy that consistently

removed bias while answering a focused question about earnings impacts.

86








* Estimated biases were bothpositiveandnegative,andtheir distributionacrossall the stud-ies reviewed wascentered roughlyon zero.This was true both for the full set of estimatorsand forgroupsof estimatesacrossall studiesthat used a singlemethod, such asregression

or matching.* For a given intervention,the distributionof bias estimateswas sometimes centered near

zero, and sometimes was not.

We caution that this summary of findings gives only part of the picture, and it

does so for a specific area of programevaluation research: the impacts of job train-

ing and welfare programs on participant earnings. A somewhat more completestory can be developed in the short term as additional design replication studies,

including some that are now in progress, come to light.In the meantime, those who

plan

and

design

new studies to evaluate the

impactsof trainingorwelfare programson participants'earnings can use the empirical evi-dence to improve NX evaluation designs but not to justify their use. Similarly, hosewho wish to summarize a group of NX studies or average over a set of different NX

estimates to reach a conclusion about the impact of a single program can draw onthe design replication literature to identify stronger or weaker estimates but not to

justify the validity of such a summary.

AppendixWhatDid the StudiesConclude?

An alternativeway to reviewthe literatureis to summarizewhat the authorscon-

cluded in their own words. The twelve design replicationstudies divided into three

equalgroupsabout the valueof nonexperimental NX)methodsand the degree of sim-

ilaritybetween NX findingsand those of randomizedexperiments:four studies con-

cluded that NX methodsperformedwell, four found evidence thatsome NX methods

performedwell while othersdidnot, and four found thatNX methodsdid notperformwell or thatthere wasinsufficientevidence thattheydidperformwell (see TableAl).

The four studies that found positive results (evidence of smallbias) qualifiedtheirconclusionsby indicatingthat a researcher needs detailed backgrounddata (particu-

larly prior earnings), overlap in background characteristics, or intake workers' subjec-tive ratings of the applicants they screened.

It is important to probe the authors' conclusions further than the present discussion

allows. The various study authors used different standards to assess the size of the bias

and, in some cases, reached different conclusionswith the same data. Furthermore,the studies are not of equalvalue. Some more realisticallyreplicatedwhat would havebeen done in the absence of randomassignmentthan others.Withinstudies, some of

the estimatorsor comparisongroupswere more or less likelyto have been used thanothers, absent an experimentalbenchmark. Some estimates were based on smaller

samples than others. A recent summary by Bloom et al. (2002, chap. 2) describes manyof these studiesindividually.Section III of this articlepresentsaquantitativeanalysisofall the studies combined.

87







TABLE Al

AUTHOR CONCLUSIONS FROM THE STUDIES REVIEWE

Type

of Methods

ComparisonStudy Intervention Examined SamplesUsed Study

Nonexperimentalestimators

performedwell

Bell et al. (1995)

Dehejia and Wahba

(1999)

Hotz, Imbens, andKlerman(2000)

WTW OLS,instrumentalvariables

Supportedwork PSM, DD

WTW OLS, DD

Withdrawals,screenouts,no-shows

CPS, PSID

Controlgroupsfrom other sites

We believe that the

encouragingwith

impact methodsmented.... Themuch less reliablthe

in-programppostprogramperithe comparison gations .... The a

ratingsconsistenbased estimates (mates) closer to t

When the treatmewhen the variablare observed,prto estimate the tr

The resultspresentof non-experimeexperimentalmeon individualcha

ment, earnings,a







Hotz, Imbens, andMortimer(1999)

Mixedresults

Bratberg,Grasdal,and Risa (2002)

Gritz and Johnson(2001)

Heckman et al.

(1998)

Olsen and Decker2001

WTW OLS, PSM Controlgroupsfrom other sites

Occupational OLS, PSM, Eligibletherapy DD, SC nonparticipants

Job training PSM, SC No-shows

Job training OLS, PSM, EligibleDD, SC nonparticipants;

national dataset

(SIPP)Jobsearch

assistanceOLS, PSM Comparison

groupfromrelated study

We are able to prenon-traineesfairbias. Importantpre-trainingearsome measures...

Using

contro

evaluationsappegroupsin our anconducted in vertrainingprogra

In our case studybased on sampleterms that fail to

significanceis hidifference estim

estimatorsperfoCaution should be

the applicationo

specificationchethat the applicatimpactof particiestimates.

We reject the assuextensions of it.model and the aversion of the m

The linearregressiestimates. Theevaluationprodthe linearregresevidence that thevaluationare u

oco







TABLEAl (continued)

Type of Methods Comparison

Study Intervention Examined SamplesUsed Study

Nonexperimentalestimators

performed poorly

Bloom et al. (2002)

Frakerand

Maynard 1987)

Lalonde(1986)

Smith andTodd

(forthcoming)

WTW

Supportedwork

Supportedwork

Supportedwork

OLS, PSM, ControlgroupsDD from other sites

OLS, cell

match, SC

OLS, DD,SC

OLS, PSM,DD

CPS

CPS, PSID

CPS, PSID

The answer o the qenoughto replac

Nonexperimental dthe effectivenessestimates tend to

groupconstructimodel used.

Manyof the econo

experimentallyd

We find little suppcross-sectional esmethod of evalushowthatprograpropensityscorechoice of analysistudy,the differethe most robust.

NOTE:WTW=welfare-to-workprogram;OLS = ordinaryeast squares regression-adjusteddifferencein

DD = difference-in-differences;SC =parametricornonparametric election correction;SIPP = Surveyof I

CurrentPopulation Survey;PSID = Panel Studyof Income Dynamics.







NONEXPERIMENTAL VERSUS EXPERIMENTALESTIMATES

Notes1.This article uses the term"nonexperimental"s asynonym or"quasi-experimental,"lthough"quasi-

experimental"s used inplacesto connoteamorepurposefulattempt bythe researcher o mimicrandomizedtrials. In general, anyapproachthat does not use randomassignmentis labeled nonexperimental(NX).

2. Findings reportedhere are drawn from a researchsynthesis preparedunder the guidelines of the

Campbell Collaboration. The published protocol is available at http://www.campbellcollaboration.org/

doc-pdf/qedprot.pdf.3. A broaderreview we have undertakenfor the Campbell Collaboration ncludes design replication

studies that estimatebias for other outcomes, such as student achievement, school dropout,and receipt of

pubic assistancebenefits. Forthcomingresults from that study examinebinaryindicatorsfor whether the

experimentaland NX estimatorssupportthe same statistical nference andwhether they supportthe same

policy conclusion.

4. An importantareaexcludedbythis criterionwas health-related nterventions(forexample, MacKay

et al. 1995, 1998). Modelsof programparticipation, he keyfactor n sampleselection bias,mightbe similaramong education-, training-,and employment-relatedinterventions but are likely to differ markedly or amedical or communityhealth intervention.Furthermore,the outcomes would typicallybe very different.We initiallyapplied a broadercriterion that included school-relatedoutcomes such as school dropoutandtest scoresbut ultimately ocused on interventionswith earningsas the mainoutcome to limit the numberof

confoundingfactors. A forthcomingCampbellreview will draw on the wider literature.5. It is less important orourpurposesthat the experimentalestimatorbe externallyvalid or that t repre-

sent one policy parameter n particular suchas the effect of the treatmenton the treatedorthe localaveragetreatmenteffect), as long as the NX estimatorpurportsto measure the same thing.

6. Randomizationbiasresults when the treatmentgroup'sexperienceis influencedby the presence of arandomizedevaluation. Spillovereffects result when the control group'sexperience is influenced by the

presenceof a treatment

group.Substitutionbias

resultswhen

controlgroupmembers are

givenan alterna-

tive treatment thattheywould not havereceivedabsentthe experiment.JohnHenryand Hawthorne effectsresultfrommembersof the control andtreatmentgroup, respectively,behaving differentlybecause theyareawareof their inclusion in an experiment.

7. It is important o recallthatawide rangeof bias estimates does not necessarily mplyawide rangeof

biases, because of samplingerror in both the experimentaland NX impactestimates.8. The codingof the variablesrepresentingqualityof backgrounddata(forregressionormatching)nec-

essarilyinvolves some subjectivity.To be systematic,we applied the following criteria: f the specificationincluded several quartersof baseline earnings and a large number of relevant backgroundvariables,wecoded the qualityof the data as "veryextensive." If the specificationcontainedsome baseline measure of

earningsand a set of individualbackgroundvariables hatcapturesthe keyelements that arelikelyto affect

outcomes, then it was coded as "extensive."Otherwise,it was coded as "poor."9. Some researcherssuch as Bloom et al. (2002) and Smith and Todd (forthcoming) riedto determine

the number of quartersof prior earningsneeded to reduce bias to acceptable levels, but there were not

enough other examplesto drawany general lessons.10. The study populationfor Bratberg,Grasdal,and Risa(2002) differsfrom the populationstargetedin

the otherstudiesunderreview not onlybecause the populationwas madeupof Norwegiansbutalsobecausethe samplememberswere not disadvantagedworkers.The largerbias estimateswouldapplyto alargerearn-

ingsbase and thereforenot be assubstantivelymportantasasimilarly izedbias found in a studyof U.S.wel-fareparticipants.Some of this effect is measuredby the studyfixed effect (see the even-numbered columnsin Table4).

11. Removingthe outliersin this case is probablyreasonablebecause the outlyingbias estimates corre-

spondto NX impactestimates thatwere implausibleon their face (giventhe collection of other impactesti-

mates). One cannot count on being able to identify this type of outlier in general, when an experimentalbenchmark s not available.

12. We alsoexamined the distributionacross studies foragiven method-matching andregression-andfound a similarresult.Thissuggeststhataggregationneed not be done acrossmethodsif alargecollection of

studies and interventions is used.

91








References

Bell, StephenH., LarryL. Orr,JohnD. Blomquist,and GlenC. Cain. 1995.Programapplicantsas a compari-

songroupin evaluatingtraining programs.Kalamazoo,MI:UpjohnInstituteforEmploymentResearch.Bloom, Howard, Charles Michalopoulos, CarolynHill, and Ying Lei. 2002, June. Can non-experimental

comparisongroupmethodsmatchthefindingsfrom a randomassignmentevaluationof mandatorywel-

fare-to-workprograms?New York:ManpowerDemonstrationResearchCorporation.

Bratberg,Espen, AstridGrasdal,and AlfErlingRisa.2002. Evaluating ocialpolicyby experimentaland NX

methods. ScandinavianJournal of Economics 104 (1): 147-71.

Campbell,Donald, andJ.C. Stanley.1966. Experimentaland quasi-experimentaldesignsfor research. Chi-

cago: Rand McNally.Cook, Thomas,and Donald Campbell.1979. Quasi-experimentation:Design andanalysisissuesforfield set-

tings. Hopewell, NJ: Houghton-Mifflin.

Cooper,H., K.Charlton,J.C. Valentine,and L. Muhlenbruck.2000. Makingthe mostof summerschool: A

meta-analyticand narrative review. Monographsof the Society for Research in Child Development.Maiden, MA: Blackwell.

Dehejia, Rajeev,andSadekWahba. 1999. Causaleffects in NX studies:Reevaluating he evaluationof train-

ing programs.Journalof the AmericanStatisticalAssociation94 (448): 1053-62.

Fraker,Thomas,andRebecca Maynard.1987. The adequacyof comparisongroupdesignsfor evaluationsof

employment-related programs.Journalof HumanResources22 (2): 194-227.

Glazerman,StevenM.,Dan M.Levy,and DavidMyers.2002. NXreplicationsofsocial experimentsn educa-

tion, training,and employmentservicesrevisedprotocol. Philadelphia:CampbellCollaboration.Avail-

able fromhttp://www.campbellcollaboration.org/doc-pdf/qedprot.pdf.Gritz,R. Mark,andTerryJohnson.2001. NationalJob Corps Study:Assessing programeffectson earnings

for studentsachievingkeyprogrammilestones.Washington,DC: Battelle Memorial Institute.

Heckman,JamesJ., andV.JosephHotz. 1989. ChoosingamongalternativeNX methods for estimatingthe

impactof socialprograms:The case of manpowertraining.Journalof the American StatisticalAssocia-

tion 84 (408): 862-74.

Heckman,JamesJ., Hidehiko Ichimura,JeffreyC. Smith, and PetraTodd. 1998. Characterizing election

bias. Econometrica 66 (5): 1017-98.

Hotz,V.Joseph,Guido W.Imbens,andJacobKlerman.2000. The long-term gainsfromGAIN: Are-analysisof the impactsof the CaliforniaGAIN Program.NBER WorkingPaper8007. Cambridge,MA: National

Bureau of Economic Research.

Hotz, V.Joseph, Guido W. Imbens, and Julie H. Mortimer. 1999. Predictingthe efficacyof futuretraining

programs using past experiences. NBER Technical WorkingPaper 238. Cambridge, MA: National

Bureau of Economic Research.

Lalonde, Robert. 1986.

Evaluating

he econometric evaluationsof trainingwith experimentaldata. Ameri-

can EconomicReview 76 (4):604-20.

Lipsey,MarkW.,and David B.Wilson.1993.The efficacyofpsychological,educational,andbehavioral reat-

ment: Confirmation rom meta-analysis.AmericanPsychologist48:1181-209.

McConnell, Sheena, and Steven M. Glazerman.2001. NationalJob Corps Study:Thebenefitsand costs of

Job Corps.Washington,DC: MathematicaPolicy Research,Inc.

McKay,James R., ArthurI. Alterman,Thomas McLellan, Chris R. Boardman,Frank D. Mulvaney,and

Charles P. O'Brien. 1998. Randomversus nonrandomassignment in the evaluation of treatment for

cocaine abusers.Journalof Consultingand ClinicalPsychology66 (4): 697-701.

McKay,JamesR., ArthurI. Alterman,ThomasMcLellan,EdwardC. Snider,andCharlesP.O'Brien.1995.

Effect of randomversusnonrandomassignment n acomparisonof inpatientanddayhospitalrehabilita-

tion for male alcoholics.Journal of Consultingand Clinical Psychology63 (1): 70-78.

National Reading Panel. 2000. Teachingchildren to read: An evidence based assessmentof the scientificresearch iteratureon readingandits implicationsforreading nstruction.Reportsofthesubgroups.NIH

Pub.no.00-4754,December.Bethesda,MD: National nstituteofChildHealthandHumanDevelopment.Newman,John,Menno Pradhan,LauraRawlings,Geert Ridder,RamiroCoa, andJose Luis Evia.2001. An

impactevaluationof education,health and water supply investmentsof the Bolivian Social Investment

Fund. World BankEconomic Review,June.Washington,DC: World Bank.

92







NONEXPERIMENTALVERSUS EXPERIMENTALESTIMATES 93

Olsen, Robert,and PaulDecker. 2001. Testingdifferentmethodsof estimatingthe impactsof workerprofil-

ing and reemploymentservicessystems. Washington,DC: MathematicaPolicy Research,Inc.

Reynolds,ArthurJ., and JudyA. Temple. 1995. Quasi-experimentalestimates of the effects of a preschool

intervention:Psychometricand econometric comparisons.EvaluationReview 19 (4):347-73.Shadish,William R. 2000. The empiricalprogramof quasi-experimentation. n Researchdesign:Donald

Campbell's egacy, edited by L. Bickman. ThousandOaks,CA:Sage.Shadish,WilliamR., and Kevin Ragsdale.1996. Random versus nonrandomassignmentin psychotherapy

experiments:Do you get the sameanswer?Journal fConsultingand ClinicalPsychology64:1290-305.

Smith,JeffreyC., and PetraTodd.Forthcoming.Does matchingovercomeLalonde'scritiqueof NXestima-tors?Journalof Econometrics.

Documents

2003 Nonexperimental Versus Experimental Estimates of Earnings Impacts