35
Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the School Voucher System in Chile * José R. Zubizarreta Luke Keele March 16, 2015 Abstract A distinctive feature of a clustered observational study is its multilevel or nested data structure arising from the assignment of treatment, in a non-random manner, to groups or clusters of units or individuals. Examples are ubiquitous in the health and social sciences including patients in hospitals, employees in firms, and students in schools. What is the optimal matching strategy in a clustered observational study? At first thought, one might start by matching clusters of individuals and then, within matched clusters, continue by matching individuals. But, as we discuss in this paper, the optimal strategy is the oppo- site: first match individuals and, once all possible combinations of matched individuals are known, then match clusters. In this paper we use dynamic and integer programming to im- plement this strategy and extend optimal matching methods to hierarchical and multilevel settings. In particular, our method attempts to replicate a paired clustered randomized study by finding the largest sample of matched pairs of treated and control individuals within matched pairs of treated and control clusters that is balanced according to speci- fications given by the user. Our method directly balances covariates both at the cluster and individual levels and does not require estimating the propensity score, although the propensity score can be balanced as an additional covariate. We illustrate our method on a case study of the comparative effectiveness of public versus private voucher schools in Chile, a question of intense policy debate in the country at the present. Keywords: Causal Inference; Group Randomization; Hierarchical/Multilevel Data; Observational Study; Optimal Matching * For comments and suggestions, we thank Cinar Kilcioglu, Sam Pimentel and Paul Rosenbaum, and seminar participants at Johns Hopkins University and the University of Pennsylvania. Assistant Professor, Division of Decision, Risk and Operations, and Statistics Department, Columbia Uni- versity, 3022 Broadway, 417 Uris Hall, New York, NY 10027, Email: [email protected]. Associate Professor, Department of Political Science, 211 Pond Lab, Penn State University, University Park, PA 16802 Phone: 814-863-1592, Email: [email protected]. 1

Optimal Multilevel Matching in Clustered Observational ... · Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the School Voucher System in Chile∗

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

  • Optimal Multilevel Matching in Clustered Observational Studies:A Case Study of the School Voucher System in Chile∗

    José R. Zubizarreta† Luke Keele‡

    March 16, 2015

    Abstract

    A distinctive feature of a clustered observational study is its multilevel or nested datastructure arising from the assignment of treatment, in a non-random manner, to groups orclusters of units or individuals. Examples are ubiquitous in the health and social sciencesincluding patients in hospitals, employees in firms, and students in schools. What is theoptimal matching strategy in a clustered observational study? At first thought, one mightstart by matching clusters of individuals and then, within matched clusters, continue bymatching individuals. But, as we discuss in this paper, the optimal strategy is the oppo-site: first match individuals and, once all possible combinations of matched individuals areknown, then match clusters. In this paper we use dynamic and integer programming to im-plement this strategy and extend optimal matching methods to hierarchical and multilevelsettings. In particular, our method attempts to replicate a paired clustered randomizedstudy by finding the largest sample of matched pairs of treated and control individualswithin matched pairs of treated and control clusters that is balanced according to speci-fications given by the user. Our method directly balances covariates both at the clusterand individual levels and does not require estimating the propensity score, although thepropensity score can be balanced as an additional covariate. We illustrate our method ona case study of the comparative effectiveness of public versus private voucher schools inChile, a question of intense policy debate in the country at the present.

    Keywords: Causal Inference; Group Randomization; Hierarchical/Multilevel Data; ObservationalStudy; Optimal Matching

    ∗For comments and suggestions, we thank Cinar Kilcioglu, Sam Pimentel and Paul Rosenbaum, and seminarparticipants at Johns Hopkins University and the University of Pennsylvania.

    †Assistant Professor, Division of Decision, Risk and Operations, and Statistics Department, Columbia Uni-versity, 3022 Broadway, 417 Uris Hall, New York, NY 10027, Email: [email protected].

    ‡Associate Professor, Department of Political Science, 211 Pond Lab, Penn State University, University Park,PA 16802 Phone: 814-863-1592, Email: [email protected].

    1

  • 1 Introduction

    1.1 Clustered Observational Studies with Multilevel Data

    Clustered observational studies are ubiquitous in the health and social sciences. Examples include

    patients receiving similar treatments in hospitals, employees facing a policy change inside firms,

    and students following a particular learning program within schools. Clustered observational

    studies have a nested or multilevel data structure with observed and unobserved covariates both

    at the cluster and unit levels. In this context, research interest typically lies in the effect of

    the cluster level treatment on unit level outcomes, however this effect may be confounded by

    differences in the distributions of covariates across the treatment groups both at the cluster

    and unit levels. Therefore, an important question in clustered observational studies is: how to

    adjust for observed covariates taking into account the multilevel structure? In an attempt to

    be transparent, these adjustments will ideally balance covariates at both levels and facilitate

    sensitivity analyses to hidden biases due to unobserved covariates (Rosenbaum 2010).

    Educational settings are perhaps the most well-known multilevel structure with unit level measures

    such as the student’s score on a standardized test, and also cluster level covariates such as school

    enrollment (Lee and Bryk 1989). Both covariates may act as confounders when evaluating, for

    instance, the impact of a study program or administration regime targeted at improving learning.

    A conventional approach to adjust for cluster and unit level covariates is hierarchical or multilevel

    regression modeling. A hierarchical regression model allows the researcher to fit a model for

    the mean outcome using unit level covariates while accounting for unexplained variation among

    clusters. The cluster level predictors are often referred to as “contextual effects,” and may

    be interpreted as causal effects under certain assumptions (Gelman 2006; Feller and Gelman

    2015).

    Nonparametric alternatives to multilevel regression modeling often rely on propensity scores (Hong

    and Raudenbush 2006; Arpino and Mealli 2011; Li et al. 2013). For example, Hong and Rau-

    2

  • denbush (2006) stratify on a multilevel propensity score to approximate a two-stage experiment

    where schools and students are randomly assigned to treatment within blocks. Matching methods

    are extensively used in observational studies (Stuart 2010; Lu et al. 2011), however they do not

    typically account for multilevel data structures.

    In this paper, we develop an optimal matching method for multilevel data structures. Contrary

    to intuition, our method first matches pairs of units and then clusters to create matched pairs

    of clusters with pairs of units within these cluster pairs. Our method is optimal in the sense

    that it maximizes the size of the matched sample and it directly balances the observed covariates

    in a specific way. For example, our method not only allows the researcher to balance central

    moments of distributions but also the entire distribution of the observed covariates. Our method

    does not require estimating the propensity score, although the propensity score can be used as an

    additional balancing covariate. Although we illustrate our method in a study with two levels—

    students within schools—it can readily be extended to settings with three or more levels such as

    students within schools within districts. In particular, we illustrate our method on a case study

    of the comparative effectiveness of public versus private voucher schools in standardized tests in

    Chile. This a question of intense policy debate in the country at the present, and we describe it

    subsequently.

    1.2 Vouchers and School Choice

    Governments often enact policy reforms to improve educational outcomes. One educational

    reform is the use of vouchers to create education markets. In a voucher system, parents receive

    a voucher to choose among a number of competing schools both public and private. While

    voucher systems are relatively rare in the United States, many other countries, however, have

    adopted universal voucher programs.1 Countries with universal voucher programs include Chile,

    Denmark, Netherlands, South Korea and Sweden (Lara et al. 2011). Chile was the first country

    to adopt a universal voucher system in the 1980’s. Public and private voucher schools receive a1As of 2007, only 16% of U.S. students lived in areas with an active voucher system.

    3

  • direct payment from the cover on a per-student basis. This voucher system fueled the growth

    of a parallel private education system. By 2006, nearly 5,000 private voucher schools existed,

    and these schools enrolled nearly 44% of students with 48% attending public schools (Lara et al.

    2011). By 2013, there were over 6,000 private voucher schools with around 53% of students,

    and 38% attended public schools (MINEDUC 2013).

    Are private vouchers schools more effective than public schools? Current evidence on the voucher

    system in Chile is mixed. A number of studies have found that private voucher schools increase

    test scores by at least 15% to 20% of a standard deviation (Mizala and Romaguera 2001; Anand

    et al. 2009) though other studies have found larger effects (Sapelli and Vial 2002, 2005). While

    other work has found effects that are either not statistically detectable (Hsieh and Urquiola 2006;

    McEwan 2001) or are much smaller (Lara et al. 2011). We conduct an observational study of

    whether private voucher schools produce students with higher test scores than public schools.

    These data have a multilevel structure in that we observe student level covariates such as gender

    and socio-economic status as well as school level covariates such as enrollment and whether the

    school is in an urban or rural area.

    Our study also provides a basic template for the design of observational studies of school level

    interventions. We demonstrate how temporal ordering is critical to selecting covariates for the

    removal of overt biases. In an observational study, use of longitudinal data is necessary to avoid

    conditioning on a post-treatment covariate. If data are carefully collected over time as events

    occur, then the temporal order of events is clear, and the distinction between covariates and

    outcomes is clear as well. In contrast, if data are collected from subjects at a single time, as in

    a cross-sectional study, such temporal order is unclear and one might mistakenly condition on an

    outcome. In this study, we use the transition from primary school to secondary school to clearly

    delineate the temporal order of pre- and post-treatment covariates. The Chilean data form a full

    panel over time which allows us to carefully separate both school and student level covariates

    from outcomes.

    4

  • This article is organized as follows. Section 2 describes the Chilean school system, the longitudinal

    census data, and the design that we use in our study. Section 3 reviews cardinality matching for

    finding the largest matched sample that is balanced, explains the multilevel matching method,

    and presents the method more generally. Section 4 shows the resulting matches. Section 5

    analyzes the comparative effectiveness of public and private voucher schools in Chile. Section 6

    concludes with a summary and a discussion.

    2 Schools in Chile; Educational Census Data; Study De-

    sign

    2.1 The Chilean School System and Restricted Secondary School Choice

    In Chile there are three basic types of schools: (i) private non-subsidized schools, (ii) private

    subsidized schools, and (iii) public schools, also called municipal schools. Private non-subsidized

    schools are generally elite schools that do not receive any funds from the state and are funded

    entirely through private tuition. Approximately 7% of the students in Chile attend this type of

    schools. Private subsidized schools receive funding from the state on a per pupil basis, and some

    of these schools charge an additional monthly tuition fee to parents (these are called “shared

    funding" private subsidized schools or “financiamiento compartido" in Spanish). These schools

    are a mix of for-profit and not-for-profit organizations. Approximately 53% of the students in

    Chile attend this type of school. Finally, approximately 40% of students attend public schools.2

    In Chile, the voucher system is based on a direct payment to the schools as a function of daily

    attendance. In our

    We seek to test whether private vouchers schools are more effective than public schools. Pre-2The exception is 60 schools (called “Delegated Administration") that are publicly funded on a non-voucher

    basis, i.e. they receive a fixed amount of money from the state regardless of how many students they enroll.

    5

  • treatment differences in these two groups of student may be either those that are measurable and

    thus form overt biases or those that are unmeasured which are hidden biases. In an observational

    study, analysts use pretreatment covariates and a statistical adjustment strategy to remove overt

    biases in the hopes of consistently estimating treatment effects. We face two challenges in our

    observational study if we simply compare the test scores of students in these two types of schools

    and remove overt biases via matching.

    First, data are not collected on students until they are in the fourth grade. Without measurements

    that precede the treatment, we may mistakenly adjust for an outcome rather than a covariate and

    bias the study (Rosenbaum 1984). Second, the choice to attend a private voucher school is most

    likely highly confounded by not only observed but unobserved factors. To solve both problems,

    we exploit an opportunity: an unusual setting in which we believe confounding from unobserved

    covariates is lessened (Rosenbaum 2010, section 5.1).

    The opportunity we exploit is related to the concept of differential effects, which are used to reduce

    bias in observational studies by the study of parallel treatments (Rosenbaum 2006). Consider the

    following example of a differential effect. It is thought that nonsteroidal anti-inflammatory drugs

    (NSAIDs) reduce the risk of Alzheimer disease. To test this theory one might compare subjects

    that regularly use NSAIDs to those that do not. However, there are a number of observed and

    unobserved reasons why these two populations may differ other than the use of NSAIDs. As an

    alternative design, one might instead compare regular NSAID users to regular users of other pain

    medications such as acetaminophen. The differential effect of acetaminophen versus NSAIDs may

    be less subject to bias from unmeasured confounders than the effect of NSAIDS versus no use

    of pain medication. As such, a differential effect is an effort to reduce sensitivity to bias through

    the comparison of parallel treatments.

    The Chilean school system provides us with an opportunity that is parallel to a differential effect.

    Chilean schools are divided into primary and secondary schools. Primary schools are comprised

    of Grades 1–8 and secondary schools encompass Grades 9–12. Since 2003 students are required

    6

  • to attend both primary and secondary schools. However, primary and secondary schools need not

    be separate institutions. It is not uncommon for primary and secondary schools to be fused into

    a single school, such that students can attend Grades 1–12 at the same institution. Students

    that attend a primary only school have to select a secondary school once they reach Grade 8. In

    2004, 76.4% of public schools and 52% of private voucher schools were primary only schools, and

    thus about 56% of students in the eighth grade had to select a new secondary school to attend

    (Lara et al. 2011). We use this opportunity in our design by restricting the analysis to students

    from public primary schools that had to select a secondary school since they attended a primary

    only school. Under this design, treated students are students that switch from a public primary

    school to a private voucher secondary school. We compare these treated students to students in

    public primary only schools that choose to attend public secondary schools.

    Exploiting this aspect of the Chilean school system has two advantages. First, it allows us to

    clearly delineate the temporal ordering of the treatment. Since the treatment is a private voucher

    secondary school, we may safely condition on covariates collected while students were in primary

    school. In this way we avoid biases from adjusting for a concomitant outcome (Rosenbaum

    1984). Second, we also suspect that a student that switches from a public to a private voucher

    school when they are not required to, may do so for a number of reasons, many of which are

    unobservable. Here, we restrict the analysis to students that attend their local public school but

    must switch due to school structure. This may be a more directly observable treatment selection

    mechanism.

    2.2 Longitudinal Census of Students and Schools

    In 1988, Chile introduced a national student assessment system known as the Sistema Nacional

    de Medición de la Calidad de la Educación or SIMCE. The SIMCE is an “educational census.”

    That is, in the SIMCE, the Ministry of Education collects data to evaluate all students in fourth,

    eighth, tenth and eleventh grades in language, mathematics and sciences, roughly every two years.

    SIMCE data are collected from four different sources. First, data are collected from students,

    7

  • which includes test scores that are complemented with other student covariates such as gender.

    Second, both parents and teachers complete questionnaires. Finally for schools, student test

    scores are aggregated, and a few additional covariates are collected. Students are given unique

    identifiers which allows us to form a true panel over a two year period. Student records can also

    be linked to teacher, parent, and school level covariates.

    In our study, we use SIMCE data from 2003, 2004, and 2006. The SIMCE from 2003 only

    collected data from secondary schools and students enrolled in secondary schools. For 2004 and

    2006, SIMCE collected data from both primary and secondary schools and students. We use test

    scores on language and mathematics administered in 2006 when students are in the tenth grade

    as our outcome measures.

    2.3 Data Structure and Study Design

    In our study, the data structure and design are intricately linked. We now outline how we

    constructed the match to fit the data structure. One advantage of our approach is that we

    can tailor the statistical adjustment to exactly fit the multilevel structure of the data, which

    is important since we have student, parent, teacher, and school level data. We perform two

    matches: one for students and one for schools. Next, we describe the covariates that form the

    student level match.

    For each student with test scores observed in 2006, we match on student, parent, teacher, and

    primary school covariates from the SIMCE data collection in 2004. For the student match, we

    first list student level covariates. The key covariate, here, is student test scores from the 8th

    grade. In 8th grade students are tested on four topics: language, mathematics, social sciences,

    and natural sciences. The student level data also measures gender. For the student match, we

    also include three covariates from parents: income measured in six categories, father’s education,

    and mother’s education. We also link students to primary school level measures, and we match

    on primary school covariates in the student level match. At the primary school level, we match

    8

  • on a five category socio-economic status indicator for each school that is created by the Chilean

    Ministry of Education. This five category indicator is constructed from questions based on

    parental education, family incomes in the school and an index of school vulnerability. We also

    use school level measures that are aggregates of data observed at other levels. As such, we

    match on average test scores for each primary school, the number of teachers and the number

    of enrolled students. Finally, in the teacher survey, teachers are asked what level of education

    they expect the majority of their students to achieve. Teachers responded using a five category

    scale that records responses from 8th grade to a college degree. We aggregate this measure and

    recorded the median for each primary school and use it in the student level match. To reiterate

    while we observed covariates measured at different levels in 2004, we treat all these measures as

    pre-treatment student level covariates in the match.

    The school match is based on secondary school data from 2003. Since the SIMCE forms a panel,

    we can match on characteristics of the secondary schools before any student is exposed to the

    treatment. That is, we match on the schools the students will attend using data from before

    they attend that school. For the school match, we match on enrollment, school level math and

    language test score averages, the percentage of female students in the school, average student

    income, urban versus rural status, and the same five category socio-economic status indicator

    for each school that is recorded for primary schools.3 Some of these covariates are aggregates

    that we created from either student, teacher, or parent level data in 2003. We did not match on

    several other covariates that are also observed in the 2003 data. These measures include whether

    teachers are allowed class preparation time, the proportion of teachers with a post-graduate

    diploma, the average teacher experience, the number of hours teachers worked per week. We

    do not match on these covariates since they are plausibly part of the school level treatment.

    Matching on such covariates would remove their effect on students from the final outcomes and

    thus could potentially attenuate the treatment effect.3For student in secondary schools, the SIMCE only collects test scores on language and math. At the primary

    school level, we have test scores for language, mathematics, social sciences, and natural sciences.

    9

  • We describe the matching algorithm in greater detail in Section 3. However, the match is based on

    integer programming which allows us to enforce different forms of balance for different covariates

    (Zubizarreta 2012; Zubizarreta et al. 2014). This is relevant since we tailored the constraints for

    each covariate. Here, we describe the different balance constraints we applied to each covariate.

    For the student level covariates, we applied a mean balance constraint to primary school test score

    measures, primary school enrollment, the number of teachers in the primary school, the average

    expected level of educational attainment, and the proportion of female student in the primary

    school. For student level test score measures, we enforced a constraint on the entire distribution

    via the Kolmogorov-Smirnov test statistic which is the maximum discrepancy in the empirical

    cumulative distribution functions. For the school level match, we enforced a mean balance

    constraint on secondary school test scores, missingness indicators for test scores, secondary school

    enrollment, income category, SES category, urban or rural status, and the proportion of female

    students in the secondary school.

    For discrete student and school level covariates, we used a fine balance constraint. Under fine

    balance, we exactly balance covariates without exactly matching. Fine balance is achieved for

    discrete covariates by balancing the marginal distributions of covariates exactly in aggregate but

    without constraining who is matched to whom. We applied fine balance to student sex, father

    and mother’s education level, parental income categories, and primary school SES categories. We

    now describe the notation and the optimal matching algorithm. See Rosenbaum et al. (2007) for

    a discussion of fine balance and Rosenbaum (2010, Part II) for a discussion of different forms of

    covariate balance.

    3 Dynamic and Integer Programming for Multilevel Match-

    ing

    The goal of our multilevel matching method is to find the largest sample of matched pairs of

    treated and control units within matched pairs of treated and control clusters that is balanced

    10

  • on the observed covariates. For assessing the sensitivity of results to the influence of unobserved

    covariates we use the methods for sensitivity analysis proposed by Rosenbaum (1987, 2002) and

    tailored to clustered treatment assignments by Hansen et al. (2014) (see subsection 5.4 of the

    paper). In our case study, units are students and clusters are schools, and, importantly, because

    results can be confounded both by student and school level covariates, we match pairs of students

    and schools to balance covariates at both levels. The basic tool that we use in our multilevel

    matching method is cardinality matching which we describe subsequently.

    3.1 Review of Cardinality Matching

    Common matching methods attempt to achieve covariate balance indirectly, by finding treated

    and control units that are close on a summary measure of the covariates such as the Mahalanobis

    distance or the propensity score (see Stuart 2010 and Lu et al. 2011 for reviews). Unlike these

    matching methods, cardinality matching uses the original covariates to match units and directly

    balance their covariate distributions (Zubizarreta et al. 2014). Specifically, by solving an integer

    programming problem, cardinality matching finds the largest matched sample that satisfies the

    researcher’s specifications for covariate balance. Following Zubizarreta (2012), these specifica-

    tions for covariate balance may not only require mean balance, but perhaps also other forms of

    distributional balance such as fine balance (Rosenbaum et al. 2007), x-fine balance (Zubizarreta

    et al. 2011), and strength-k matching (Hsu et al. 2015). For example, cardinality matching will

    find the largest sample of matched pairs in which all the covariates have differences in means

    smaller than one tenth of a standard deviation and the marginal distributions of nominal co-

    variates of greater prognostic importance are perfectly balanced (fine balance). In this manner,

    with cardinality matching subject matter knowledge about the research question at hand comes

    into the matching problem through the specifications for covariate balance, finding the largest

    matched sample that satisfies them.

    As we describe in the next subsection, our multilevel matching method uses cardinality matching

    to match treated and control students across all the possible combinations of treated and control

    11

  • schools, and then uses a modified version of cardinality matching to match schools with the

    largest number of matched students.

    3.2 A Multistage Decision Method for Multilevel Matching

    Let kt ∈ Kt = {1, ..., Kt} index the treated clusters and kc ∈ Kc = {1, ..., Kc} denote the

    control clusters. Let jkt be treated unit j in treated cluster kt, with jkt ∈ Jkt = {1, ..., Jkt},

    and jkc stand for control unit j in control cluster kc with jkc ∈ Jkc = {1, ..., Jkc}. Put xkt for

    the vector of observed covariates of treated cluster kt, and similarly write xjkt for the observed

    covariates of treated unit jkt ; analogous notation applies for control clusters and units. Based

    on the unit-level covariates, calculate a distance δjkt ,jkc between treated unit jkt and control unit

    jkc (for instance, this distance may be the robust Mahalanobis distance specified in section 8.3

    of Rosenbaum 2010). Define A and Ba as the sets of feasible solutions for the cluster- and

    unit-level matches within matched clusters (hence the subindex a in Ba). In practice, A and

    Ba are implemented as linear inequality constraints in a integer program and they enforce the

    researcher’s requirements for covariate balance and matching structures at the cluster and unit

    levels respectively (for instance, A may require the means of the cluster covariates to be balanced

    and the matched groups to form pairs of clusters, and Ba may require the marginal distributions

    of the unit covariates to be balanced and the matched groups to form pairs of units). Importantly,

    since the requirements in A refer to clusters and those in Ba refer to units, A and Ba are disjoint.

    Let J (m)kt be the set of treated units matched in treated cluster kt and J(m)t =

    ⋃kt∈Kt J

    (m)kt

    be the set of treated units matched across all treated clusters. Finally, let K(m)t be the set of

    matched treated clusters.

    Building upon the framework of Rosenbaum (2012a), an optimal cardinality matching of units

    within clusters can be characterized by the quadruple (K(m)t , α,J(m)t , β) of assignments of clus-

    ters α : K(m)t → Kc and units β : J(m)kt→ Jkc that maximize the cardinality of the set of

    matched of units within matched clusters subject to the constraints in A and Ba, respectively.

    If there are two cardinality matchings that satisfy the requirements in A and Ba, then we prefer

    12

  • one matching over the other if it has a larger cardinality, or, alternatively, if they both have

    the same cardinality, if it has a smaller sum of total distances between matched units. For-

    mally, we prefer the cardinality matching (K(m)t , α,J(m)t , β) to (K̃

    (m)t , α̃, J̃

    (m)t , β̃), denoted by

    (K(m)t , α,J(m)t , β) � (K̃

    (m)t , α̃, J̃

    (m)t , β̃), if |J

    (m)t | > |J̃

    (m)t |, or alternatively if |J

    (m)t | = |J̃

    (m)t |

    and ∑jkt∈J

    (m)t

    δjkt ,β(jkt ) <∑jkt∈J̃

    (m)t

    δjkt ,β(jkt ). If |J(m)t | = |J̃

    (m)t | and

    ∑jkt∈J

    (m)t

    δjkt ,β(jkt ) =∑jkt∈J̃

    (m)t

    δjkt ,β(jkt ), then we are indifferent between the two cardinality matchings and write

    (K(m)t , α,J(m)t , β)∼ (K̃

    (m)t , α̃, J̃

    (m)t , β̃). If we have either (K

    (m)t , α,J

    (m)t , β)� (K̃

    (m)t , α̃, J̃

    (m)t , β̃)

    or (K(m)t , α, J(m)t , β) ∼ (K̃

    (m)t , α̃, J̃

    (m)t , β̃), we write (K

    (m)t , α,J

    (m)t , β) % (K̃

    (m)t , α̃, J̃

    (m)t , β̃).

    Our optimal multilevel matching problem is the following.

    Problem 3.1. For given sets of cluster-level constraints A and unit-level constraints Ba, find a

    matching (K(m)t , α,J(m)t , β) that satisfies A and Ba such that, for any other matching (K̃

    (m)t , α̃,

    J̃ (m)t , β̃) that also satisfies A and Ba, (K(m)t , α,J

    (m)t , β) % (K̃

    (m)t , α̃, J̃

    (m)t , β̃).

    Intuition may suggest that the the best way to solve Problem 3.1 and match with multilevel data

    is first to match clusters and then within matched clusters to match units. In our case study, this

    would require first pairing schools and then, within pairs of schools, pairing students. However

    this strategy will not always find the largest matched sample that is balanced as two schools

    that are paired on their school level characteristics may have different student compositions so

    that when their students are paired it may result in a smaller sample size than optimal. For

    this reason, the optimal matching strategy needs to contemplate what is optimal both at the

    student and school levels simultaneously. Applying Bellman’s (1957) principle of optimality, the

    optimal matching strategy is, under the assumption that schools have been matched optimally,

    first match students and then, considering these optimal student matches, match schools.

    In abstract terms, the following algorithm and proposition state this; again, that the optimal

    strategy is first to match units across all the possible combinations of pairs of treated and control

    clusters, and, once all possible combinations of matched units are known, then match clusters.

    To implement the optimal assignments α and β, let akt,kc = 1 if treated cluster kt is paired to

    13

  • control cluster kc and akt,kc = 0 otherwise; similarly let bjkt ,jkc = 1 if treated unit j in treated

    cluster kt is paired to control unit j in control cluster kc, and bjkt ,jkc = 0 otherwise.

    Algorithm 3.2. For each of the possible Kt × Kc pairs of treated and control clusters, find

    the optimal cardinality matching of units that satisfies Ba. This is, for each kt ∈ Kt and each

    kc ∈ Kc find mkt,kc = maxb∑jkt∈Jkt

    ∑jkc∈Jkc bjkt ,jkc subject to b ∈ Ba. Then find the optimal

    cardinality cluster matching that solves maxa∑kt∈Kt

    ∑kc∈Kc mkt,kcakt,kc subject to a ∈ A.

    Proposition 3.3. Algorithm 3.2 solves the optimal multilevel cardinality matching problem 3.1.

    Proof. Let f(a, b) be the the total number of pairs of treated and control units matched by a

    within pairs of treated and clusters matched by b. In the abstract, in Problem 3.1 we want to

    maximize the function f(a, b) subject to the constraints A and Ba. This is, find a and b to solve

    maxa,b

    f(a, b) subject to a ∈ A, b ∈ Ba. (1)

    In a trivial way, we may solve (1) by first solving

    g(a) = maxbf(a, b) subject to b ∈ Ba (2)

    for each a ∈ A, and then solving

    maxa

    g(a) subject to a ∈ A. (3)

    While (2) seems hard in general (because there are many possible choices of b), the nested

    structure of the units-in-clusters problem makes it easier because f(a, b) separates into a sum

    of parts for cluster pairs because the constraint sets A and Ba are disjoint. Algorithm 3.2 does

    exactly this.

    In our case study, for each pairing of schools a, we find the best pairing of students b within

    14

  • those schools (2), and then pick the best pairing of schools with the associated best pairing of

    students for that pairing of schools (3). Again, while (2) seems hard in general (because there

    are many possible student matches b), the nested structure of the students-in-schools problem

    makes it easier because f(a, b) separates into a into a sum of parts for school pairs. For example,

    if treated school kt is paired to control school kc, then the contribution of schools kt and kc is

    the same of number of pairs regardless of how the other schools are paired.

    With Algorithm 3.2, the multilevel cardinality matching problem can be solved optimally by

    breaking it into simpler matching subproblems and recursively finding the optimal match. This is

    an application of dynamic programming to matching in observational studies that takes advantage

    of the multilevel structure of the data (see Bertsekas 2005 for an extensive exposition of dynamic

    programming).

    3.3 Extensions and Computation

    Note that if we had three or more levels (such as students within schools within districts), then

    the multilevel matching procedure would extend naturally. With l levels, the procedure would

    require first matching the lower level l under the assumptions that levels l − 1, l − 2, ..., 1 have

    been matched optimally, to then (once the matches at level l are completed) matching level l−1

    under the assumptions that levels l− 2, l− 3..., 1 have been matched optimally, and so on.

    Note that our multilevel matching method maximizes the size of the matched sample, but it can

    also be formulated to minimize a covariate distance between students. If this was the case and if

    each of the student level matching problems was solved using optimal matching as in Rosenbaum

    (1989) and Hansen (2007), then a trivial worst-case time bound for the multilevel matching

    method would be of order O(J3K2 + K3) where J = max{Jkt=1, ..., Jkt=Kt , Jkc=1, ..., Jkc=Kc}

    and K = max{Kt, Kc}. In general it is possible to find worst-case time bounds based on the

    component problems. In our presentation above, each of the component problems is a cardinality

    matching problem and, while at the present there is no polynomial time algorithm for cardinality

    15

  • matching, in practice many instances with data sets of reasonable size run in time comparable

    to that of optimal matching. Furthermore, a useful feature of problem (1) is that the student

    level matches can be found in parallel by separating all the possible pairs of treated and control

    schools into smaller mutually exclusive but exhaustive pairs of treated and control schools. In

    practice, we found the matches using the package mipmatch for R (Zubizarreta 2012).

    4 Covariate Balance in the Matched Sample

    After applying basic exclusion criteria, there are 64245 students in 517 schools, 150 subsidized

    and 367 public schools (henceforth treated and control schools respectively). Out of the 64245

    students, 15682 students are from treated schools and 48563 are from control schools. Using

    our multilevel matching method, we matched in two stages within similar groups regions of the

    country (namely, regions I-III, IV-V, VI-VII, VIII, IX, X-XII and the Metropolitan region).

    At the student level, we used cardinality matching to find the largest balanced sample of pairs

    of students across all the possible combinations of pairs of schools within the groups of regions.

    In each of these matches we required mean balance for 19 covariates (including student test

    scores, school test scores, and indicators for socioeconomic status and expected educational

    achievement; see Table 1 for details), fine balance for 4 covariates (sex, mother and father

    education, and household income; see Table 2) and distributional balance for the sum of the

    test scores in language and mathematics at baseline. Figure 1 shows not only that the marginal

    distributions of the baseline test scores are very closely balanced after matching but also their

    joint distribution. As a matter of fact, the 95% bivariate normal density contours are almost

    indistinguishable after matching.

    At the school level, we used the modification of cardinality matching in the second stage of

    Algorithm 3.2 and mean balanced 16 other covariates: percentage female, total enrollment, lan-

    guage and math scores (plus indicators for missing values), urban area, parental income categories

    (1-5), and socioeconomic groups (A-D). Again, covariates were exact matched for the 7 region

    16

  • Table 1: Covariate balance at the student level after matching. All the covariates are measuredin 2004.

    Covariate Mean Std.Subsidized Public dif.

    Language score 243.24 243.68 -0.01Mathematics score 243.43 243.04 0.01Natural science score 246.19 247.00 -0.02Social science score 243.01 243.91 -0.02School language score 237.57 237.02 0.03School mathematics score 238.37 237.89 0.03School female proportion 0.51 0.50 0.05School number of students 82.34 82.34 0.00School teacher to student ratio 8.09 8.03 0.03Urban area 0.83 0.83 0.00Socioeconomic status A 0.13 0.12 0.01Socioeconomic status B 0.61 0.61 0.01Socioeconomic status C 0.25 0.26 -0.02Socioeconomic status D 0.01 0.01 0.01Expected education: primary 0.01 0.02 -0.04Expected education: secondary, technical-professional 0.77 0.77 -0.00Expected education: secondary, scientific-humanities 0.13 0.12 0.02Expected education: technical-professional 0.09 0.10 -0.01Expected education: college 0.00 0.00 0.01

    17

  • Table 2: Balance for nominal covariates at the student level. All the covariates are measured in2004. Fine balance constraints balanced sex, type of education of the mother and father, andhousehold income category. The tabulated values are counts of the number of students in eachcategory. In addition, matching was exact for groups of counties (not shown here).

    Covariate Subsidized PublicSex

    Male 2084 2084Female 1981 1981

    Mother educationPrimary school 1886 1886Secondary school 1152 1152Technical 56 56College or higher 17 17Missing 954 954

    Father educationPrimary school 1647 1647Secondary school 1255 1255Technical 56 56College or higher 17 17Missing 954 954

    Household income category (in 1000 pesos)[0, 100) 1647 1647[100, 200] 1255 1255(200, 400] 1643 1643(400, 600] 446 446(600, 1400] 124 124> 1400 84 84Missing 183 183

    18

  • Figure 1: Distribution of student test scores at baseline after matching. The baseline test scoresare measured in 2004. The ellipses trace the 95% bivariate normal density contours of the jointdistributions of test scores for the matched treated and control units. The contours are almostidentical showing that not only marginal distributions of the test scores are very closely balancedbut also their joint distribution.

    100 150 200 250 300 350 400

    100

    150

    200

    250

    300

    350

    400

    PublicSubsidizedPub.

    Sub.

    100 150 200 250 300 350 400

    Test scores in language at baseline

    Pub. Sub.

    100

    150

    200

    250

    300

    350

    400

    Test

    sco

    res

    in m

    athe

    mat

    ics

    at b

    asel

    ine

    Distribution of student-level test scores after matching

    groups. We balanced all covariates with and without weighting for the size of the school; see

    Table 3. Note that after matching all the differences in means are smaller than 0.05 standard

    deviations. In this way, we matched 8130 students in 4065 pairs, and 166 schools in 83 pairs.

    In this match, 7 out of the 13 region of the country are represented in both the treatment and

    19

  • control groups.

    Table 3: Covariate balance at the school level after matching. Both means and standardizeddifferences are weighted by the number of students in each school.

    Covariate Mean Std.Subsidized Public dif.

    Female proportion 0.49 0.49 -0.00Number of students 262.33 262.67 -0.00Language score 241.36 241.43 -0.00Mathematics score 230.20 229.19 0.04Language score missing 0.00 0.00 0.00Math score missing 0.00 0.00 0.00Urban area 0.98 0.99 -0.03Income category 1 0.08 0.09 -0.02Income category 2 0.80 0.79 0.01Income category 3 0.12 0.12 0.00Income category 4 0.00 0.00 0.00Income category 5 0.00 0.00 0.00Socioeconomic status A 0.18 0.18 0.01Socioeconomic status B 0.71 0.71 0.01Socioeconomic status C 0.10 0.11 -0.02Socioeconomic status D 0.00 0.00 0.00

    Thus while our match yields highly comparable treated and control groups, geographic coverage

    is somewhat poor. To that end, we also implemented a second match designed to increase geo-

    graphic representation. Starting from the same student matches, we also found a more externally

    valid school in which all the differences in means are smaller than 0.15 standard deviations, and

    where 10776 students are matched in 5388 pairs and 210 schools are matched in 105 pairs. In

    this match, 12 out of the 13 regions in Chile are represented. We deem the first match a match

    with greater internal validity, and the second match, a match with greater external validity. Table

    (4) compares the results of these two matches. In the next section we compare treatment effect

    estimates for these two different matched samples.

    20

  • Table 4: Comparison of the two school level matches. In the internally valid match the largeststandardized difference in means is smaller than 0.05, whereas in the externally valid matchthis difference is smaller than 0.15. Regions represented are the regions present in both in thetreatment and control groups.

    Internal validity match External validity matchMatched students 8130 10776Matched schools 166 210Regions represented 4, 5, 6, 7, 9, 10, 13 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13

    5 Outcome Analyses

    Having found the optimal match, we now estimate the voucher effect, test its significance, and

    assess the robustness of these conclusions with sensitivity analysis methods to hidden bias. We

    follow the notation and methods for clustered experiments by Small et al. (2008), and use the

    sensitivity analysis methods for clustered observational studies by Hansen et al. (2014).

    5.1 Notation: Treatment Effects for Students in Schools

    There are S matched pairs of clusters, s = 1, . . . , S, with two schools, j = 1, 2, one treated and

    one control for 2S total units. The ordered pair sj thus identifies a unique cluster. Each cluster

    sj contains nsj > 1 individuals, i = 1, . . . , nsj. Each pair is matched for observed, pretreatment

    covariates, so xs11 = xs22 for each j and i, where xsji represents the observed covariates on

    which we matched. A student i in school sj is described by both observed covariates and an

    unobserved covariate usji. The set (xsji, usji) may describe either the student sji or the school

    sj containing this student. In our study, treatment assignment occurs at the school level as whole

    schools are assigned to treatment or control. If the jth school in pair s receives the treatment,

    write Zsj = 1, whereas if this school receives the control, write Zsj = 0, so Zs1 + Zs2 = 1, for

    each s as each pair contains one treated school and one control school. If nsj = 1 for all sj then

    the clusters are individuals, and we have unclustered treatment assignment.

    Each student has two potential responses; one response that is observed under treatment Zsj = 1

    21

  • and the other observed under control Zsj = 0 (Neyman 1923; Rubin 1974). We denote these

    responses with (yTsji, yCsji), where yTsji is observed from the ith subject in pair s under Zsj = 1,

    and yCsji is observed from this subject under Zsj = 0. In our application yTsji is the test score

    that student sji would exhibit if he or she switched from a public school to a private voucher

    school and yCsji is the test score this same student would exhibit if he or she remained in a public

    school. Under this notation, we allow for interference among students in the same school but

    not across schools. In this context, yTsji denotes the response of student sji if all students in

    school sj receive the treatment, while yCsji denotes the response of student sji if all students

    in school sj receive the control. Therefore, we do not assume that we would observe the same

    response from student sji if the treatment were assigned to some but not all of the students in

    school sj.

    For each student, the unobservable effect of treatment is yTsji − yCsji, which is the change in

    test scores induced by attending a private voucher school. We do not observe both potential

    outcomes, but we do observe responses: Ysji = ZsjyTsji + (1−Zsj)yCsji. Under this framework,

    the observed response Ysji varies with Zsj but the potential outcomes do not vary with treatment

    assignment. Write R = (R111, . . . , RS2,ns2)T for theN =∑s,j ns,j dimensional vector of observed

    responses with the same notation for yc, which are potential responses under control. Below

    we test the sharp null hypothesis of no treatment effect on (yTsji, yCsji) which stipulates that

    H0 : yTsji = yCsji for all sji (Fisher 1935). This hypothesis asserts that changing the treatment

    assigned to school sj would leave the response of student sji unchanged.

    5.2 Randomization Inference When Treatment is Assigned at the School

    Level

    In our analysis, we initially assume that treatment assignment is as-if randomly assigned to

    clusters conditional on the matches. In short, we assume as if the toss of a fair coin was used

    to allocate private voucher status within matched school pairs. We collect in the set Ω the 2S

    22

  • treatment assignments for all 2S clusters: Z = (Z11, Z12, . . . , ZS2)T . In a matched-pair group

    randomized experiment, one treatment assignment Zsj would be picked at random and each

    assignment would therefore have probability Pr(Z = Zsj) = 2−S, which yields the randomization

    distribution. In an observational study, if the probability of receiving treatment is equal for both

    schools in each pair, then the conditional distribution of Z given that there is exactly one treated

    unit in each pair equals this randomization distribution, and Pr(Zsj = 1) = 1/2 for each unit j

    in pair s (see Rosenbaum 2002 for details). However, in an observational study it may not be

    true Pr(Zsj = 1) = 1/2 for each unit j in pair s due to an unobserved covariate usji. We explore

    this possibility through a sensitivity analysis described below.

    To test Fisher’s sharp null hypothesis of no treatment effect, we define T a test statistic which is

    a function of Z and R where T = t(Z,R). Under the sharp null hypothesis R = yc, therefore

    T = t(Z,yc). If the model for treatment assignment above were true then the randomization

    distribution for T is

    Pr{t(Z,R) ≥ w|yTsji, yCsji,xsji, usji,Ω} = Pr{t(Z,yc) ≥ w|yTsji, yCsji,xsji, usji,Ω}

    since yc is fixed by conditioning on yTsji, yCsji,xsji, usji and Pr(Z = Zsj|yTsji, yCsji,xsji, usji,Ω) =

    1/|Ω|. We use a test statistic from Hansen et al. (2014) that provides inferences for units in

    clusters.

    For this test statistic, qsji is a score or rank given to Ysji, so that under the null hypothesis, the

    qsji are functions of the yCsji and xsji, and they do not vary with Zsk. To make qsji resistant to

    outliers, we use the ranks of the residuals when Ysji is regressed on the student level covariates

    using Huber’s method of m-estimation Small et al. (2008). We regressed the outcome, test

    scores recorded in 2006, on student level test scores recorded in 2004 when the student was still

    in primary school. The test statistic T is a weighted sum of the mean ranks in the treated school

    23

  • minus the mean ranks in the control school. Formally the test statistic is

    T =S∑s=1

    BsQs

    where

    Bs = 2Zs1 − 1 = ±1, Qs =wsns1

    ns1∑i=1

    qs1i −wsns2

    ns2∑i=1

    qs2i.

    Hansen et al. (2014) show that T is the sum of S independent random variables each taking the

    value ±Qs with probability 1/2, so E(T ) = 0 and var(T ) =∑Ss=1 Q

    2s. The central limit theorem

    implies that as S → ∞, then T/√

    var(T ) converges in distribution to the standard Normal

    distribution. In the above equation, ws defines the weights which are a function of nsj.

    The choice of weights ws has important implications in our application. Hansen et al. (2014)

    discuss three possible choices for ws. One possibility is to use constant weights, ws ∝ 1. Another

    possibility is to use weights that are proportional to the total number of students in a matched

    cluster pair: ws ∝ ns1 +ns2 or ws = (ns1 +ns2)/∑Sl=1(n11 +n12). These proportional weights are

    particularly useful if we believe that the private school voucher effect varies with cluster size. This

    would be true if, for example, the private school effect was larger in smaller schools. However,

    if we suspect that the private voucher school effect is constant, we could select the weights to

    minimize the variance of T . For example, ws ∝ ns1ns2/(ns1 + ns2) will minimize the variance of

    T if cross cluster variability is low, while constant weights will minimize the variance if there is

    little variance within schools. Hansen et al. (2014) note that for testing the null hypothesis each

    set of weights is valid. Given that our cluster sizes exhibit considerable variation, and it is fully

    possible that the treatment effect varies with cluster size, we use all three sets of weights to test

    the sharp null hypothesis. Below we discuss how we incorporate the different weights into the

    sensitivity analysis.

    If we test the hypothesis of a shift effect instead of the hypothesis of no effect, we can apply the

    method of Hodges and Lehmann (1963) to estimate the voucher school effect. The Hodges and

    24

  • Lehmann (HL) estimate of τ is the value of τ0 that when subtracted from Ysji makes T as as

    close as possible to its null expectation. Intuitively, the point estimate τ̂ is the value of τ0 such

    that T equals 0 when Tτ0 is computed from Ysji − Zsjτ0. Using constant effects is convenient,

    but this assumption can be relaxed; see Rosenbaum (2003). If the treatment has an additive

    effect, Ysji = yCsji + τ then a 95% confidence interval for the additive treatment effect is formed

    by testing a series of hypotheses H0 : τ = τ0 and retaining the set of values of τ0 not rejected at

    the 5% level.

    5.3 Comparative Effectiveness of Public Versus Private Voucher Schools

    We now test the hypothesis of no effect for private voucher schools. We test this hypothesis in

    both matches. For the match with greater external validity, with constant weights ws ∝ 1, the

    approximate one-sided p-value is 0.256. Thus we are unable to reject the null that the voucher

    school are completely without effect. We also found that both sets of weights, ws ∝ ns1 + ns2

    and ws ∝ ns1ns2/(ns1 + ns2), lead to identical p-values of 0.292. In the absence of bias from

    hidden confounders, the point estimate is τ̂ = 2.81 with a 95% confidence interval of -5.68 and

    11.34.

    For the match with greater internal validity with constant weights, the approximate one-sided p-

    value is 0.492. Using non-constant weights, we find the approximate one-sided p-value is 0.633. If

    there are no hidden confounders, the point estimate for the match with greater internal validity is

    τ̂ = 0.0743 with a 95% confidence interval of -8.58 and 9.36. Thus for both matches, we cannot

    reject the hypothesis that attending a private voucher school has no effect on test scores. We

    next explore the likelihood that bias from a hidden confounder masks a treatment effect.

    5.4 Test of Equivalence and Sensitivity Analysis

    In an observational study, one concern is that bias from a hidden covariate can give the impression

    that a treatment effect exists when in fact no effect is present. Bias from hidden confounders can

    25

  • also mask an actual treatment effect leaving the analyst to conclude there is no effect when in

    fact such an effect exists. We explore this possibility using a test of equivalence and a sensitivity

    analysis (Rosenbaum 2008; Rosenbaum and Silber 2009; Rosenbaum 2010).

    Above we were unable to reject the null hypothesis that τ = 0 for all students. Next, we apply

    a test of equivalence to test the hypotheses that τ is not small. Under a test of equivalence, we

    test the following null hypothesis H(δ)6= : |τ | > δ. Rejecting H(δ)6= provides a basis for asserting

    with confidence that |τ | < δ. H(δ)6= is the union of two exclusive hypotheses:←−H

    (δ)0 : τ ≤ −δ

    and −→H (δ)0 : τ ≥ δ, and H(δ)6= is rejected if both

    ←−H

    (δ)0 and

    −→H

    (δ)0 are rejected (Rosenbaum and

    Silber 2009). We can apply the two tests without correction for multiple testing since we test

    two mutually exclusive hypotheses. Thus we can test whether the estimate from our study is

    different from other possible treatment effects which are represented by δ.

    With a test of equivalence, it is not possible to demonstrate a total absence of effect, but if

    this were a randomized trial we could safely test that our estimated effect is not as large as δ.

    That is we may be able to reject H(δ)6= : |τ | > δ. In an observational study, however, there are

    additional complications. Since the treatment was not randomly assigned, it may be the case

    that we reject the null hypothesis of equivalence due to hidden confounding. However, using a

    sensitivity analysis we may find evidence that the test of equivalence is insensitive to biases from

    nonrandom treatment assignment.

    In a sensitivity analysis, we quantify the degree to which a key assumption must be violated in

    order for our inference to be reversed. Our model of treatment assignment assumes that within

    matched pairs, receipt of the treatment is effectively random conditional on the matches. We

    consider how sensitive our conclusions are to violations of this assumption using a model of

    sensitivity analysis discussed in Rosenbaum (2002, ch. 4).

    In our study, matching on observed covariates xsji made students more similar in their chances

    of being exposed to the treatment. However, we may have failed to match on an important

    unobserved covariate usji such that xsji = xsji′ ∀ s, j, i, i′, but possibly usji 6= usji′ . If true, the

    26

  • probability of being exposed to treatment may not be constant within matched pairs. To explore

    this possibility, we use a sensitivity analysis that imagines that before matching, student i in pair

    s had a probability, πs, of being exposed to the voucher school treatment. For two matched

    students in pair s, say i and i′, because they have the same observed covariates xsji = xsji′ it

    may be true that πs = πs′ . However, if these two students differ in an unobserved covariate,

    usji 6= usji′ , then these two students may differ in their odds of being exposed to the voucher

    school treatment by at most a factor of Γ ≥ 1 such that

    1Γ ≤

    πs/(1− πs′)πs′/(1− πs)

    ≤ Γ, ∀ s, s′, with xsji = xsji′ ∀ j, i, i′. (4)

    If Γ = 1, then πs = πs′ , and the randomization distribution for T is valid. If Γ > 1, then quantities

    such as p-values and point estimates are unknown but are bounded by a known interval. In a

    sensitivity analysis, we use several values of Γ to compute bounds on the p-value for the test

    of equivalence. We then observe at which value of Γ the upper bound on the p-value exceeds

    0.05. If the value of Γ is large, we can be confident that it would take a large bias from a hidden

    confounder to reverse the conclusions of the study. The derivation for the sensitivity analysis as

    applied to our test statistic T is in Hansen et al. (2014).

    Under a test of equivalence, we may be able to reject H(δ)6= : |τ | > δ if the p-value from the test

    is low. Rejecting this null, allows us to infer that the estimate treatment effect is not as large

    as δ. We then apply the sensitivity analysis to understand whether this inference is sensitive to

    biases from nonrandom treatment assignment. In the analysis, we observe at what value of Γ the

    p-value exceeds the conventional 0.05 threshold for each test. If this Γ value is relatively large,

    we can be confident that the test of equivalence is not sensitive to hidden bias from nonrandom

    treatment assignment.

    Hansen et al. (2014) note that sensitivity to hidden bias may vary with the choice of weights ws.

    To understand whether different weights lead to different sensitivities to hidden confounders, we

    27

  • can conduct a different sensitivity analysis for each set of weights and correct these tests using

    a Bonferroni correction. However, Rosenbaum (2012b) shows that the Bonferroni correction is

    overly conservative when applied sensitivity analysis. He develops an alternative multiple testing

    correction based on correlations among the test statistics. Under this correction, for a given

    value of Γ we conduct a sensitivity analysis using each set of weights. We then apply the multiple

    testing correction from Rosenbaum (2012b) which produces a single corrected p-value for that

    value of Γ.

    5.5 How Much Bias Would Need to be Present to Mask a Positive

    Effect of Private Voucher Schools?

    We now apply the test of equivalence to both matches. In this test, the null hypothesis asserts

    H(δ)6= : |τ | > δ for some specified δ > 0. Rejection of this null hypothesis provides evidence

    that the effect of attending a private voucher school on test scores is less than δ. What values

    should we select for δ? A number of studies in the literature have found that private voucher

    schools increase test score achievement. The smallest effect size in the extant literature is 0.15

    of a standard deviation (Sapelli and Vial 2002). However, among low income students the effects

    may be as large as 0.5 of a standard deviation, and Sapelli and Vial (2005) find an effect size of 0.6

    standard deviations. These results suggest a range of possible effects from 0.15 to 0.6 standard

    deviations. To that end, we use three values for δ of 0.15, 0.30 and 0.6 standard deviations.

    This allows us to test whether the point estimates in our study are equivalent to small, medium

    or large voucher effects. Thus we define three values δ1, δ2, and δ3 to correspond to these three

    different possible effect sizes.

    We first ask whether the point estimate from the match with greater external validity is large

    relative to effect sizes in the literature. Table 5 contains a summary of the test of equivalence

    and sensitivity analysis to the match with greater external validity. We first assume that there

    is no hidden bias such that Γ = 1. We first test ←−H (δ1)0 and find that the one-sided p-value

    28

  • from this test is 0.008. We then test −→H (δ1)0 and we find that the one-sided p-value is 0.089.

    Therefore we are unable to reject H(δ1)6= for the match with greater external validity. Thus our

    point estimate from this match may be consistent with a small effect. For a larger effect size

    of 0.30 standard deviations, however, we can reject H(δ2)6= with a p-value of 0.003. Thus the

    estimated treatment effect is not consistent with moderate effect size. Is this inference sensitive

    to bias from a confounder? We find that for Γ = 1.85, the p-value is 0.049. A bias of magnitude

    Γ = 1.85 means that two matched students might differ in terms of an unobserved usji such

    that one student is almost twice as like as the other to attend a private voucher school before

    it would alter our conclusions. Finally, we test whether the point estimate is equivalent with a

    large effect size of 0.60 standard deviations. Again, we can reject H(δ3)6= with p < .001. With a

    bias of magnitude Γ = 5.71 the p-value is 0.049. Therefore, it would take a very large bias for

    our conclusions about a large treatment effect to be altered.

    Table 5: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Externally Valid Match

    Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.089 0.003

  • In sum, for both matches, we either cannot reject that the estimated effect is as large as the small-

    est effects found in previous studies or that association could be easily explained by unobserved

    confounding. Bias from an unobserved covariate would need to double the odds of selecting a

    private voucher school to mask a moderate size effect of 0.30 standard deviations. To mask a

    large effect size of 0.60 standard deviations, the bias from the unobserved founders would have

    to nearly quintuple the odds of differential treatment assignment in both matches.

    Table 6: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Internally Valid Match

    Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.036 0.003 0.00151.1 0.048 0.005

  • for covariate balance may not only require mean balance, but also other forms of distributional

    balance such as fine balance, x-fine balance, and strength-k matching. In practice, this method

    facilitates sensitivity analyses to hidden biases due to unobserved covariates, and it readily extends

    to clustered observational studies with three or more levels of data. To our knowledge, this method

    is the first application of dynamic and integer programming to observational studies.

    31

  • References

    Anand, P., Mizala, A., and Repetto, A. (2009), “Using School Scholarships to Estimate the Effect

    of Government Subsidized Private Education on Academic Achievement in Chile,” Economics

    of Education Review, 28, 370–381.

    Arpino, B. and Mealli, F. (2011), “The specification of the propensity score in multilevel obser-

    vational studies,” Computational Statistics & Data Analysis, 55, 1770–1780.

    Bellman, R. (1957), Dynamic Programming, Princeton, NJ: Princeton University Press.

    Bertsekas, D. P. (2005), Dynamic Programming and Optimal Control, Vol. I, Belmont, MA:

    Athena Scientific.

    Feller, A. and Gelman, A. (2015), “Hierarchical Models for Causal Effects,” Working Paper.

    Fisher, R. A. (1935), The Design of Experiments, London: Oliver and Boyd.

    Gelman, A. (2006), “Multilevel (hierarchical) modeling: what it can and cannot do,” Techno-

    metrics, 48, 432–435.

    Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), “Optimal Multivariate Matching

    Before Randomization,” Biostatistics, 5, 263–275.

    Hansen, B. B. (2007), “Flexible, Optimal Matching for Observational Studies,” R News, 7, 18–24.

    Hansen, B. B., Rosenbaum, P. R., and Small, D. S. (2014), “Clustered Treatment Assignments

    and Sensitivity to Unmeasured Biases in Observational Studies,” Journal of the American

    Statistical Association, 109, 133–144.

    Hodges, J. L. and Lehmann, E. (1963), “Estimates of Location Based on Ranks,” The Annals of

    Mathematical Statistics, 34, 598–611.

    Hong, G. and Raudenbush, S. W. (2006), “Evaluating Kindergarten Retention Policy: A Case of

    32

  • Study of Causal Inference for Multilevel Data,” Journal of the American Statistical Association,

    101, 901–910.

    Hsieh, C.-T. and Urquiola, M. (2006), “The Effects of Generalized School Choice on Achievement

    and Stratification: Evidence from Chile’s voucher program,” Journal of Public Economics, 90,

    1477–1503.

    Hsu, J. Y., Zubizarreta, J. R., Small, D. S., and Rosenbaum, P. R. (2015), “Strong Control

    of the Family-Wise Error Rate in Observational Studies that Discover Effect Modification by

    Exploratory Methods,” Working Paper.

    Lara, B., Mizala, A., and Repetto, A. (2011), “The Effectiveness of Privte Voucher Education:

    Evidence From Structural School Switches,” Educational Evaluation and Policy Analysis, 33,

    119–137.

    Lee, V. E. and Bryk, A. S. (1989), “A Multilevel Model of The Social Distribution of High School

    Achievement,” Sociology of Education, 62, 172–192.

    Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013), “Propensity score weighting with multilevel

    data,” Statistics in medicine, 32, 3373–3387.

    Lu, B., Greevy, R., Xu, X., and Beck, C. (2011), “Optimal Nonbipartite Matching and its Statis-

    tical Applications,” The American Statistician, 65, 21–30.

    McEwan, P. J. (2001), “The Effectiveness of Public, Catholic, and Non-Religious Private Schools

    in Chile’s Voucher System,” Education Economics, 9, 183–219.

    MINEDUC (2013), “Estadśticas de la Educación,” http://www.ministeriodesarrollosocial.gob.cl/encuesta-

    post-terremoto/index.html.

    Mizala, A. and Romaguera, P. (2001), “Factors Explaining Secondary Education Outcomes in

    Chile,” El Trimestre Economico, 272, 515–549.

    Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay

    33

  • on Principles. Section 9.” Statistical Science, 5, 465–472. Trans. Dorota M. Dabrowska and

    Terence P. Speed (1990).

    Rosenbaum, P. R. (1984), “The Consequences of Adjusting For a Concomitant Variable That

    Has Been Affected By The Treatment,” Journal of The Royal Statistical Society Series A, 147,

    656–666.

    — (1987), “Sensitivity Analysis for Certain Permutation Inferences in Matched Observational

    Studies,” Biometrika, 74, 13–26.

    — (1989), “Optimal Matching for Observational Studies,” Journal of the American Statistical

    Association, 84, 1024–1032.

    — (2002), Observational Studies, New York, NY: Springer, 2nd ed.

    — (2003), “Exact Confidence Intervals for Nonconstant Effects by Inverting the Signed Rank

    Test,” The American Statistician, 57, 132–138.

    — (2006), “Differential effects and generic biases in observational studies,” Biometrika, 93,

    573–586.

    — (2008), “Testing hypotheses in order,” Biometrika, 95, 248–252.

    — (2010), Design of Observational Studies, New York: Springer-Verlag.

    — (2012a), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal

    of Computational and Graphical Statistics, 21, 57–71.

    — (2012b), “Testing One Hypothesis Twice in Observational Studies,” Biometrika, 99, 763–774.

    Rosenbaum, P. R., Ross, R. N., and Silber, J. H. (2007), “Minimum Distance Matched Sampling

    with Fine Balance in an Observational Study of Treatment for Ovarian Cancer,” Journal of the

    American Statistical Association, 102, 75–83.

    Rosenbaum, P. R. and Silber, J. H. (2009), “Sensitivity Analysis for Equivalence and Difference in

    34

  • an Observational Study of Neonatal Intensive Care Units,” Journal of the American Statistical

    Association, 104, 501–511.

    Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandom-

    ized Studies,” Journal of Educational Psychology, 6, 688–701.

    Sapelli, C. and Vial, B. (2002), “The Perfomance of Private and Public Schools in the Chilean

    Voucher System,” Cuadernos De Economia, 39.

    — (2005), “Private vs public voucher schools in Chile: New evidence on efficiency and peer

    effects,” Working Paper 289, Catholic University of Chile, Instituto de Economia.

    Small, D. S., Have, T. R. T., and Rosenbaum, P. R. (2008), “Randomization Inference in a Group–

    Randomized Trial of Treatments for Depression: Covariate Adjustment, Noncompliance, and

    Quantile Effects,” Journal of the American Statistical Association, 103, 271–279.

    Stuart, E. A. (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,”

    Statistical Science, 25, 1–21.

    Zubizarreta, J. R. (2012), “Using Mixed Integer Programming for Matching in an Observational

    Study of Kidney Failure after Surgery,” Journal of the American Statistical Association, 107,

    1360–1371.

    Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), “Matching for Balance, Pairing

    for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit

    High Schools in Chile,” Annals of Applied Statistics, 8, 204–231.

    Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), “Match-

    ing for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following

    Surgery,” The American Statistician, 65, 229–238.

    35

    IntroductionClustered Observational Studies with Multilevel DataVouchers and School Choice

    Schools in Chile; Educational Census Data; Study DesignThe Chilean School System and Restricted Secondary School Choice Longitudinal Census of Students and SchoolsData Structure and Study Design

    Dynamic and Integer Programming for Multilevel MatchingReview of Cardinality MatchingA Multistage Decision Method for Multilevel MatchingExtensions and Computation

    Covariate Balance in the Matched SampleOutcome AnalysesNotation: Treatment Effects for Students in SchoolsRandomization Inference When Treatment is Assigned at the School LevelComparative Effectiveness of Public Versus Private Voucher SchoolsTest of Equivalence and Sensitivity AnalysisHow Much Bias Would Need to be Present to Mask a Positive Effect of Private Voucher Schools?

    Summary and Discussion