29
Program Evaluation and Development Effectiveness Carolyn J. Heinrich University of Wisconsin-Madison Third Meeting of the Social Policy Monitoring Network Buenos Aires, November 22-23, 2004

Program Evaluation and Development Effectiveness Carolyn J. Heinrich University of Wisconsin-Madison Third Meeting of the Social Policy Monitoring Network

Embed Size (px)

Citation preview

  • Program Evaluation and Development EffectivenessCarolyn J. HeinrichUniversity of Wisconsin-Madison

    Third Meeting of the Social Policy Monitoring NetworkBuenos Aires, November 22-23, 2004

  • Increasing Demand for Performance and Program Evaluation If there is a single theme that characterized the public sector in the 1990s, it is the demand for performance. A mantra emerged in this decade, heard at all levels of government, that calls for documentation of performance and explicit outcomes of government action.

    Beryl Radin, Beyond Machiavelli: Policy Analysis Comes of Age (2000)

  • Performance Management versus Impact Evaluation Performance management primary goal is accountability: to legislative bodies, taxpayers, and program stakeholders.

    Impact evaluation main objective is knowledge creation: for understanding and improving program impacts and targeting.

  • Performance Management versus Impact Evaluation in PracticePerformance managementShorter-term focusAnalysis of outcomesOngoing, relying on readily accessible and regularly collected dataAnnual performance plans and program performance reportsOrganization-level accountability, incentives/ sanctions tied to performance

    Impact evaluationLonger-term focusAnalysis of impacts (value added)Periodic, more intensive data collection and analysisAccurate estimation of program impacts and their distribution Inform policy and program design, targeting, funding decisions

  • Types of Measures Used in EvaluationInputs (physical and financial resources, staff, etc.)Outputs (goods and services produced)Process (monitoring implementation, use of inputs in the production of outputs)Efficiency (productivity, unit costs)Outcomes (intermediate goals, readily observed)E.g., number of students graduating (completing the third cycle of the Educacon General Basica)Impacts (net gains, value added)E.g., increase in skills levels, human capital due to program participation

  • Linking Performance Management and Program EvaluationChallenge: Identify accessible and informative performance measures and methods of analysis that accurately approximate impacts (value-added) and reinforce progress toward long-run program goalsProblems: Evidence from some experimental impact evaluations shows weak relationships between measures of short-term outcomes and longer-term impacts National JTPA Study, GAIN (Greater Avenues for Independence) Study, Job Corp experimental impact evaluationEvidence of gaming behavior is growing

  • Example: Gaming of student achievement tests(Koretz study of competing test forms over a 4-year period)

    Chart1

    4.31986

    19873.7

    19884.05

    19894.1

    3.74.3

    District tests: coached

    Study test: uncoached

    Test C

    Test B

    Year

    Grade Equivalents

    Sheet1

    YearTest CTest B

    19864.30

    198703.7

    198804.05

    198904.1

    19903.74.3

    Sheet1

    00

    00

    00

    00

    00

    Test C

    Test B

    Year

    Grade Equivalents

  • Randomized Experimental EvaluationIndividuals assigned to receive treatment or service by chance, e.g., a random draw or process Random assignment to experimental (treatment) and control groups establishes statistical equivalence of treatment and control group membersAssumption: no differences in average characteristics (observed or unobserved) between treatment and control groupsAny outcome differences observed between treatment and control groups are assumed due to treatment

  • Average Impact Estimation in Randomized ExperimentsUnder random assignment, the observed post-program difference between treatment and control group members is the estimated impact.

  • Limitations of Randomized ExperimentsEthical concerns about interference with program processes or denial of access to servicesKnowledge produced is average impact estimateMore complex designs (or nonexperimental components) needed to estimate the distribution of impactsHigher costs of implementation and data collection in experimentsSome experiments are highly localized and dependent on contextStatistical equivalence property may not hold in small experimental samples

  • Nonexperimental Evaluation (Quasi-experiments)Individuals not assigned to treatment by a random processCounterfactual state unobservableEssential to understand and model processes by which assignments to treatment are madeSelf-selection (e.g., individual decision to apply)Administrator selection (e.g., individuals assigned to treatment based on specific criteria)Combination of self/administrator selectionAssert causal relationship and evaluate plausible alternative explanations that might falsify the claim

  • Nonexperimental Evaluation Design ChallengesIndividuals who participate are likely to differ in systematic ways from those who do not participate Comparison group members should be similarly eligible and interested in program participation and/or from same local labor marketEffects of program participation may differ across individuals (heterogeneity in treatment effects)Presence of heterogeneity in effects may affect individuals responses to offer of treatment

  • Nonexperimental Impact EstimationIf not corrected, selective differences between treatment and comparison group members will bias program impact estimates

  • Alternative Nonexperimental DesignsEx-ante evaluation methods: estimate likely program impact under alternative behavioral assumptionsUse of exogenous event or factor that influences program participation as proxy for random assignment (e.g., quotas)Examples: Bolsa Escola or Becas Estudantiles evaluationsInstrumental variables estimationUses variable uncorrelated with error term (unobserved characteristics or omitted variables) but that predicts program participation to adjust for selection bias in impact estimationExamples: distance to nearest school in estimating impacts of school choice programs on student educational outcomes

  • Alternative Nonexperimental Designs (continued)Panel data methods use data collected from same individuals at multiple points in timeFixed effects models: control for stable (observed and unobserved) characteristics of individualsFirst-difference and difference-in-differences models: control for all stable characteristics, measured and unmeasured, and measured changing characteristicsUse repeated observations (at least two time points for all individuals or units of analysis) First-difference model adjusts for stable characteristics that affect cross-sectional level of the dependent variable Difference-in-differences models adjusts for stable characteristics that affect changes in the dependent variable across time

  • First-difference vs. Difference-in-differences EstimatorA first-difference model will not adjust for characteristics that affect changes in the dependent variable over time; a difference-in-differences estimator, (YT2-YT1)-(YT-1-YT0) (YC2-YC1)-(YC-1-YC0), is needed

  • Nonexperimental Evaluation Data RequirementsMore periods of data are better: data from pre-program, during-program and post-program periodsMeasures should be consistent over time Detailed data for assessing eligibility and describing program participation aid evaluation of heterogeneous program effects Build data collection mechanisms into program plans before program implementationIndependent, pooled cross-sectional data from national surveys (random samples of individuals at different points in time) if panel data are unavailable

  • Assessing Data Needs and Determining Appropriate MeasuresDetermine what you want to measure or explainDescribe the probable causal relationship between the program or intervention and the phenomena of interest (e.g., outcome or impact)Identify existing data sources to construct measures (operational definitions) E.g., national sample surveys (e.g., Census and other household surveys), administrative data sourcesDesign data collection instruments to gather data not available from existing sourcesDetermine term of data collection for each measure (e.g., cross-sectional, repeated cross-sections, longitudinal data)

  • Measurement Challenges and Common Data Limitations Lack of clarity or agreement on measurable program objectivesPoorly defined measures produce low-quality data Inconsistent data quality (e.g., due to poor data collection procedures or data entry controls) Inaccuracies in self-reported dataNon-response, refusals due to sensitivity of questionsHigh costs of original data collection Data security, storage requirements, and confidentiality protections

  • Use of Existing Data Sources: National Survey SamplesAdvantagesData collected at regular intervalsMeasures typically consistent across time periodsMakes use of large investment in data collectionUsually inexpensive to obtainDisadvantagesRepresentation of particular subgroups may be limitedIndividual identifiers typically not available to link records to other data sourcesInability to influence the types of data collected (e.g., question items)

  • Use of Existing Data Sources: Administrative dataAdvantagesDetailed information about clients, progression through program treatment, and outcomes Complete coverage of service populations Longitudinal data on participants for some programs Low cost of obtaining data for multiple program years Changes/additions to data more readily accommodated DisadvantagesQuality of administrative data and potential utility varies considerablyRegular and systematic checking for data quality seldom performed Standardized data collection procedures may not be followed across program sitesPrivacy/permission issues may delay data access and transfer

  • Other limitations of using administrative dataRates of program participation, analyses of individuals eligible for services who do not apply, and some program outcomes not measurable with administrative data Little or no information collected from individuals once they leave the program, whether program is completed or not Addressing selection problems and measuring post-program or longer-term outcomes typically require supplementary data collection

  • Advantages and Disadvantagesof Sample Survey DataSurvey data typically do not cover complete service populations but may cover broader range of topics (e.g., more detailed information on individuals background and outcomes) Surveys facilitate data collection from individuals over a longer time period following program participation Surveys may obtain comparable information from individuals who did not participateNonresponse is a common source of selection bias in survey dataLack of information about when nonresponse likely biases results

  • Other Factors Biasing Nonexperimental Estimation

    Self-selection into or out of enrollment process or programProgram location and other factors selectively limiting access to treatmentVariations in program administration and implementation Poor accounting of drop-outs, no-shows, contamination

  • Nonexperimental Impact Evaluation Strategies to Correct for BiasSource of comparison group is key: draw from similar geographical areas, use internal controlsImprove data and covariate measures used in adjusting program impact estimates, including pre-intervention measures of outcomesEconometric matching techniques, e.g., propensity score matchingUse of instrumental variables or other two-stage estimation techniques (e.g., regression discontinuity) to adjust for unobservable differences between treatment and comparison groupsSpecification tests and sensitivity analyses to evaluate assumptions

  • Econometric Matching MethodsEstimate effect of treatment on the treated under assumption that conditional on measured characteristics, program participation is independent of outcomes Useful when data on pre-treatment controls (observed characteristics) is rich and distribution of characteristics varies significantly with treatment status If treatment status is influenced by unmeasured variables, matching methods are likely to produce biased impact estimatesPropensity score matching (estimation of probability P(X) that an individual with characteristics X participates) reduces matching problem to a single dimensionImposing common support preclude poor matches between treatment and comparison group members

  • Regression Discontinuity ExampleRegression discontinuity using a pre-treatment cut-off value, e.g., eligibility score, or other measure that accounts for treatment status

    Regression discontinuity design

    _1161497605.doc

    Pre-treatment value

    Po

    s

    t

    -ou

    t

    c

    o

    me

  • Quote, James J. HeckmanMost of the reported sensitivity of program impact estimates to the choice of estimator are actually a result of violations of basic principles of evaluation analysis: comparing incomparable people. What we really need are better data, not better estimators.

  • Objectives for Ongoing PerformanceAnalysis and Impact EvaluationConduct randomized experiments periodically to evaluate program impacts (value-added)Continue research to identify and develop performance outcome measures that correlate strongly with program impactsMeasures more distant from impactsassociated through hypothesized relationships and proxy variables (e.g., test scores for educational skills gains)increase degree of uncertainty in performance and impact analysis and impose resource costsHigh stakes evaluations need to recognize errors and imprecision due to data and methodological limitations