Optimal Multilevel Matching in Clustered Observational ... · Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the School Voucher System in Chile∗

Optimal Multilevel Matching in Clustered Observational Studies:A Case Study of the School Voucher System in Chile∗

José R. Zubizarreta† Luke Keele‡

March 16, 2015

Abstract

A distinctive feature of a clustered observational study is its multilevel or nested datastructure arising from the assignment of treatment, in a non-random manner, to groups orclusters of units or individuals. Examples are ubiquitous in the health and social sciencesincluding patients in hospitals, employees in firms, and students in schools. What is theoptimal matching strategy in a clustered observational study? At first thought, one mightstart by matching clusters of individuals and then, within matched clusters, continue bymatching individuals. But, as we discuss in this paper, the optimal strategy is the oppo-site: first match individuals and, once all possible combinations of matched individuals areknown, then match clusters. In this paper we use dynamic and integer programming to im-plement this strategy and extend optimal matching methods to hierarchical and multilevelsettings. In particular, our method attempts to replicate a paired clustered randomizedstudy by finding the largest sample of matched pairs of treated and control individualswithin matched pairs of treated and control clusters that is balanced according to speci-fications given by the user. Our method directly balances covariates both at the clusterand individual levels and does not require estimating the propensity score, although thepropensity score can be balanced as an additional covariate. We illustrate our method ona case study of the comparative effectiveness of public versus private voucher schools inChile, a question of intense policy debate in the country at the present.

Keywords: Causal Inference; Group Randomization; Hierarchical/Multilevel Data; ObservationalStudy; Optimal Matching

∗For comments and suggestions, we thank Cinar Kilcioglu, Sam Pimentel and Paul Rosenbaum, and seminarparticipants at Johns Hopkins University and the University of Pennsylvania.

†Assistant Professor, Division of Decision, Risk and Operations, and Statistics Department, Columbia Uni-versity, 3022 Broadway, 417 Uris Hall, New York, NY 10027, Email: [email protected].

‡Associate Professor, Department of Political Science, 211 Pond Lab, Penn State University, University Park,PA 16802 Phone: 814-863-1592, Email: [email protected].

1

1 Introduction

1.1 Clustered Observational Studies with Multilevel Data

Clustered observational studies are ubiquitous in the health and social sciences. Examples include

patients receiving similar treatments in hospitals, employees facing a policy change inside firms,

and students following a particular learning program within schools. Clustered observational

studies have a nested or multilevel data structure with observed and unobserved covariates both

at the cluster and unit levels. In this context, research interest typically lies in the effect of

the cluster level treatment on unit level outcomes, however this effect may be confounded by

differences in the distributions of covariates across the treatment groups both at the cluster

and unit levels. Therefore, an important question in clustered observational studies is: how to

adjust for observed covariates taking into account the multilevel structure? In an attempt to

be transparent, these adjustments will ideally balance covariates at both levels and facilitate

sensitivity analyses to hidden biases due to unobserved covariates (Rosenbaum 2010).

Educational settings are perhaps the most well-known multilevel structure with unit level measures

such as the student’s score on a standardized test, and also cluster level covariates such as school

enrollment (Lee and Bryk 1989). Both covariates may act as confounders when evaluating, for

instance, the impact of a study program or administration regime targeted at improving learning.

A conventional approach to adjust for cluster and unit level covariates is hierarchical or multilevel

regression modeling. A hierarchical regression model allows the researcher to fit a model for

the mean outcome using unit level covariates while accounting for unexplained variation among

clusters. The cluster level predictors are often referred to as “contextual effects,” and may

be interpreted as causal effects under certain assumptions (Gelman 2006; Feller and Gelman

2015).

Nonparametric alternatives to multilevel regression modeling often rely on propensity scores (Hong

and Raudenbush 2006; Arpino and Mealli 2011; Li et al. 2013). For example, Hong and Rau-

2

denbush (2006) stratify on a multilevel propensity score to approximate a two-stage experiment

where schools and students are randomly assigned to treatment within blocks. Matching methods

are extensively used in observational studies (Stuart 2010; Lu et al. 2011), however they do not

typically account for multilevel data structures.

In this paper, we develop an optimal matching method for multilevel data structures. Contrary

to intuition, our method first matches pairs of units and then clusters to create matched pairs

of clusters with pairs of units within these cluster pairs. Our method is optimal in the sense

that it maximizes the size of the matched sample and it directly balances the observed covariates

in a specific way. For example, our method not only allows the researcher to balance central

moments of distributions but also the entire distribution of the observed covariates. Our method

does not require estimating the propensity score, although the propensity score can be used as an

additional balancing covariate. Although we illustrate our method in a study with two levels—

students within schools—it can readily be extended to settings with three or more levels such as

students within schools within districts. In particular, we illustrate our method on a case study

of the comparative effectiveness of public versus private voucher schools in standardized tests in

Chile. This a question of intense policy debate in the country at the present, and we describe it

subsequently.

1.2 Vouchers and School Choice

Governments often enact policy reforms to improve educational outcomes. One educational

reform is the use of vouchers to create education markets. In a voucher system, parents receive

a voucher to choose among a number of competing schools both public and private. While

voucher systems are relatively rare in the United States, many other countries, however, have

adopted universal voucher programs.1 Countries with universal voucher programs include Chile,

Denmark, Netherlands, South Korea and Sweden (Lara et al. 2011). Chile was the first country

to adopt a universal voucher system in the 1980’s. Public and private voucher schools receive a1As of 2007, only 16% of U.S. students lived in areas with an active voucher system.

3

direct payment from the cover on a per-student basis. This voucher system fueled the growth

of a parallel private education system. By 2006, nearly 5,000 private voucher schools existed,

and these schools enrolled nearly 44% of students with 48% attending public schools (Lara et al.

2011). By 2013, there were over 6,000 private voucher schools with around 53% of students,

and 38% attended public schools (MINEDUC 2013).

Are private vouchers schools more effective than public schools? Current evidence on the voucher

system in Chile is mixed. A number of studies have found that private voucher schools increase

test scores by at least 15% to 20% of a standard deviation (Mizala and Romaguera 2001; Anand

et al. 2009) though other studies have found larger effects (Sapelli and Vial 2002, 2005). While

other work has found effects that are either not statistically detectable (Hsieh and Urquiola 2006;

McEwan 2001) or are much smaller (Lara et al. 2011). We conduct an observational study of

whether private voucher schools produce students with higher test scores than public schools.

These data have a multilevel structure in that we observe student level covariates such as gender

and socio-economic status as well as school level covariates such as enrollment and whether the

school is in an urban or rural area.

Our study also provides a basic template for the design of observational studies of school level

interventions. We demonstrate how temporal ordering is critical to selecting covariates for the

removal of overt biases. In an observational study, use of longitudinal data is necessary to avoid

conditioning on a post-treatment covariate. If data are carefully collected over time as events

occur, then the temporal order of events is clear, and the distinction between covariates and

outcomes is clear as well. In contrast, if data are collected from subjects at a single time, as in

a cross-sectional study, such temporal order is unclear and one might mistakenly condition on an

outcome. In this study, we use the transition from primary school to secondary school to clearly

delineate the temporal order of pre- and post-treatment covariates. The Chilean data form a full

panel over time which allows us to carefully separate both school and student level covariates

from outcomes.

4

This article is organized as follows. Section 2 describes the Chilean school system, the longitudinal

census data, and the design that we use in our study. Section 3 reviews cardinality matching for

finding the largest matched sample that is balanced, explains the multilevel matching method,

and presents the method more generally. Section 4 shows the resulting matches. Section 5

analyzes the comparative effectiveness of public and private voucher schools in Chile. Section 6

concludes with a summary and a discussion.

2 Schools in Chile; Educational Census Data; Study De-

sign

2.1 The Chilean School System and Restricted Secondary School Choice

In Chile there are three basic types of schools: (i) private non-subsidized schools, (ii) private

subsidized schools, and (iii) public schools, also called municipal schools. Private non-subsidized

schools are generally elite schools that do not receive any funds from the state and are funded

entirely through private tuition. Approximately 7% of the students in Chile attend this type of

schools. Private subsidized schools receive funding from the state on a per pupil basis, and some

of these schools charge an additional monthly tuition fee to parents (these are called “shared

funding" private subsidized schools or “financiamiento compartido" in Spanish). These schools

are a mix of for-profit and not-for-profit organizations. Approximately 53% of the students in

Chile attend this type of school. Finally, approximately 40% of students attend public schools.2

In Chile, the voucher system is based on a direct payment to the schools as a function of daily

attendance. In our

We seek to test whether private vouchers schools are more effective than public schools. Pre-2The exception is 60 schools (called “Delegated Administration") that are publicly funded on a non-voucher

basis, i.e. they receive a fixed amount of money from the state regardless of how many students they enroll.

5

treatment differences in these two groups of student may be either those that are measurable and

thus form overt biases or those that are unmeasured which are hidden biases. In an observational

study, analysts use pretreatment covariates and a statistical adjustment strategy to remove overt

biases in the hopes of consistently estimating treatment effects. We face two challenges in our

observational study if we simply compare the test scores of students in these two types of schools

and remove overt biases via matching.

First, data are not collected on students until they are in the fourth grade. Without measurements

that precede the treatment, we may mistakenly adjust for an outcome rather than a covariate and

bias the study (Rosenbaum 1984). Second, the choice to attend a private voucher school is most

likely highly confounded by not only observed but unobserved factors. To solve both problems,

we exploit an opportunity: an unusual setting in which we believe confounding from unobserved

covariates is lessened (Rosenbaum 2010, section 5.1).

The opportunity we exploit is related to the concept of differential effects, which are used to reduce

bias in observational studies by the study of parallel treatments (Rosenbaum 2006). Consider the

following example of a differential effect. It is thought that nonsteroidal anti-inflammatory drugs

(NSAIDs) reduce the risk of Alzheimer disease. To test this theory one might compare subjects

that regularly use NSAIDs to those that do not. However, there are a number of observed and

unobserved reasons why these two populations may differ other than the use of NSAIDs. As an

alternative design, one might instead compare regular NSAID users to regular users of other pain

medications such as acetaminophen. The differential effect of acetaminophen versus NSAIDs may

be less subject to bias from unmeasured confounders than the effect of NSAIDS versus no use

of pain medication. As such, a differential effect is an effort to reduce sensitivity to bias through

the comparison of parallel treatments.

The Chilean school system provides us with an opportunity that is parallel to a differential effect.

Chilean schools are divided into primary and secondary schools. Primary schools are comprised

of Grades 1–8 and secondary schools encompass Grades 9–12. Since 2003 students are required

6

to attend both primary and secondary schools. However, primary and secondary schools need not

be separate institutions. It is not uncommon for primary and secondary schools to be fused into

a single school, such that students can attend Grades 1–12 at the same institution. Students

that attend a primary only school have to select a secondary school once they reach Grade 8. In

2004, 76.4% of public schools and 52% of private voucher schools were primary only schools, and

thus about 56% of students in the eighth grade had to select a new secondary school to attend

(Lara et al. 2011). We use this opportunity in our design by restricting the analysis to students

from public primary schools that had to select a secondary school since they attended a primary

only school. Under this design, treated students are students that switch from a public primary

school to a private voucher secondary school. We compare these treated students to students in

public primary only schools that choose to attend public secondary schools.

Exploiting this aspect of the Chilean school system has two advantages. First, it allows us to

clearly delineate the temporal ordering of the treatment. Since the treatment is a private voucher

secondary school, we may safely condition on covariates collected while students were in primary

school. In this way we avoid biases from adjusting for a concomitant outcome (Rosenbaum

1984). Second, we also suspect that a student that switches from a public to a private voucher

school when they are not required to, may do so for a number of reasons, many of which are

unobservable. Here, we restrict the analysis to students that attend their local public school but

must switch due to school structure. This may be a more directly observable treatment selection

mechanism.

2.2 Longitudinal Census of Students and Schools

In 1988, Chile introduced a national student assessment system known as the Sistema Nacional

de Medición de la Calidad de la Educación or SIMCE. The SIMCE is an “educational census.”

That is, in the SIMCE, the Ministry of Education collects data to evaluate all students in fourth,

eighth, tenth and eleventh grades in language, mathematics and sciences, roughly every two years.

SIMCE data are collected from four different sources. First, data are collected from students,

7

which includes test scores that are complemented with other student covariates such as gender.

Second, both parents and teachers complete questionnaires. Finally for schools, student test

scores are aggregated, and a few additional covariates are collected. Students are given unique

identifiers which allows us to form a true panel over a two year period. Student records can also

be linked to teacher, parent, and school level covariates.

In our study, we use SIMCE data from 2003, 2004, and 2006. The SIMCE from 2003 only

collected data from secondary schools and students enrolled in secondary schools. For 2004 and

2006, SIMCE collected data from both primary and secondary schools and students. We use test

scores on language and mathematics administered in 2006 when students are in the tenth grade

as our outcome measures.

2.3 Data Structure and Study Design

In our study, the data structure and design are intricately linked. We now outline how we

constructed the match to fit the data structure. One advantage of our approach is that we

can tailor the statistical adjustment to exactly fit the multilevel structure of the data, which

is important since we have student, parent, teacher, and school level data. We perform two

matches: one for students and one for schools. Next, we describe the covariates that form the

student level match.

For each student with test scores observed in 2006, we match on student, parent, teacher, and

primary school covariates from the SIMCE data collection in 2004. For the student match, we

first list student level covariates. The key covariate, here, is student test scores from the 8th

grade. In 8th grade students are tested on four topics: language, mathematics, social sciences,

and natural sciences. The student level data also measures gender. For the student match, we

also include three covariates from parents: income measured in six categories, father’s education,

and mother’s education. We also link students to primary school level measures, and we match

on primary school covariates in the student level match. At the primary school level, we match

8

on a five category socio-economic status indicator for each school that is created by the Chilean

Ministry of Education. This five category indicator is constructed from questions based on

parental education, family incomes in the school and an index of school vulnerability. We also

use school level measures that are aggregates of data observed at other levels. As such, we

match on average test scores for each primary school, the number of teachers and the number

of enrolled students. Finally, in the teacher survey, teachers are asked what level of education

they expect the majority of their students to achieve. Teachers responded using a five category

scale that records responses from 8th grade to a college degree. We aggregate this measure and

recorded the median for each primary school and use it in the student level match. To reiterate

while we observed covariates measured at different levels in 2004, we treat all these measures as

pre-treatment student level covariates in the match.

The school match is based on secondary school data from 2003. Since the SIMCE forms a panel,

we can match on characteristics of the secondary schools before any student is exposed to the

treatment. That is, we match on the schools the students will attend using data from before

they attend that school. For the school match, we match on enrollment, school level math and

language test score averages, the percentage of female students in the school, average student

income, urban versus rural status, and the same five category socio-economic status indicator

for each school that is recorded for primary schools.3 Some of these covariates are aggregates

that we created from either student, teacher, or parent level data in 2003. We did not match on

several other covariates that are also observed in the 2003 data. These measures include whether

teachers are allowed class preparation time, the proportion of teachers with a post-graduate

diploma, the average teacher experience, the number of hours teachers worked per week. We

do not match on these covariates since they are plausibly part of the school level treatment.

Matching on such covariates would remove their effect on students from the final outcomes and

thus could potentially attenuate the treatment effect.3For student in secondary schools, the SIMCE only collects test scores on language and math. At the primary

school level, we have test scores for language, mathematics, social sciences, and natural sciences.

9

We describe the matching algorithm in greater detail in Section 3. However, the match is based on

integer programming which allows us to enforce different forms of balance for different covariates

(Zubizarreta 2012; Zubizarreta et al. 2014). This is relevant since we tailored the constraints for

each covariate. Here, we describe the different balance constraints we applied to each covariate.

For the student level covariates, we applied a mean balance constraint to primary school test score

measures, primary school enrollment, the number of teachers in the primary school, the average

expected level of educational attainment, and the proportion of female student in the primary

school. For student level test score measures, we enforced a constraint on the entire distribution

via the Kolmogorov-Smirnov test statistic which is the maximum discrepancy in the empirical

cumulative distribution functions. For the school level match, we enforced a mean balance

constraint on secondary school test scores, missingness indicators for test scores, secondary school

enrollment, income category, SES category, urban or rural status, and the proportion of female

students in the secondary school.

For discrete student and school level covariates, we used a fine balance constraint. Under fine

balance, we exactly balance covariates without exactly matching. Fine balance is achieved for

discrete covariates by balancing the marginal distributions of covariates exactly in aggregate but

without constraining who is matched to whom. We applied fine balance to student sex, father

and mother’s education level, parental income categories, and primary school SES categories. We

now describe the notation and the optimal matching algorithm. See Rosenbaum et al. (2007) for

a discussion of fine balance and Rosenbaum (2010, Part II) for a discussion of different forms of

covariate balance.

3 Dynamic and Integer Programming for Multilevel Match-

ing

The goal of our multilevel matching method is to find the largest sample of matched pairs of

treated and control units within matched pairs of treated and control clusters that is balanced

10

on the observed covariates. For assessing the sensitivity of results to the influence of unobserved

covariates we use the methods for sensitivity analysis proposed by Rosenbaum (1987, 2002) and

tailored to clustered treatment assignments by Hansen et al. (2014) (see subsection 5.4 of the

paper). In our case study, units are students and clusters are schools, and, importantly, because

results can be confounded both by student and school level covariates, we match pairs of students

and schools to balance covariates at both levels. The basic tool that we use in our multilevel

matching method is cardinality matching which we describe subsequently.

3.1 Review of Cardinality Matching

Common matching methods attempt to achieve covariate balance indirectly, by finding treated

and control units that are close on a summary measure of the covariates such as the Mahalanobis

distance or the propensity score (see Stuart 2010 and Lu et al. 2011 for reviews). Unlike these

matching methods, cardinality matching uses the original covariates to match units and directly

balance their covariate distributions (Zubizarreta et al. 2014). Specifically, by solving an integer

programming problem, cardinality matching finds the largest matched sample that satisfies the

researcher’s specifications for covariate balance. Following Zubizarreta (2012), these specifica-

tions for covariate balance may not only require mean balance, but perhaps also other forms of

distributional balance such as fine balance (Rosenbaum et al. 2007), x-fine balance (Zubizarreta

et al. 2011), and strength-k matching (Hsu et al. 2015). For example, cardinality matching will

find the largest sample of matched pairs in which all the covariates have differences in means

smaller than one tenth of a standard deviation and the marginal distributions of nominal co-

variates of greater prognostic importance are perfectly balanced (fine balance). In this manner,

with cardinality matching subject matter knowledge about the research question at hand comes

into the matching problem through the specifications for covariate balance, finding the largest

matched sample that satisfies them.

As we describe in the next subsection, our multilevel matching method uses cardinality matching

to match treated and control students across all the possible combinations of treated and control

11

schools, and then uses a modified version of cardinality matching to match schools with the

largest number of matched students.

3.2 A Multistage Decision Method for Multilevel Matching

Let kt ∈ Kt = {1, ..., Kt} index the treated clusters and kc ∈ Kc = {1, ..., Kc} denote the

control clusters. Let jkt be treated unit j in treated cluster kt, with jkt ∈ Jkt = {1, ..., Jkt},

and jkc stand for control unit j in control cluster kc with jkc ∈ Jkc = {1, ..., Jkc}. Put xkt for

the vector of observed covariates of treated cluster kt, and similarly write xjkt for the observed

covariates of treated unit jkt ; analogous notation applies for control clusters and units. Based

on the unit-level covariates, calculate a distance δjkt ,jkc between treated unit jkt and control unit

jkc (for instance, this distance may be the robust Mahalanobis distance specified in section 8.3

of Rosenbaum 2010). Define A and Ba as the sets of feasible solutions for the cluster- and

unit-level matches within matched clusters (hence the subindex a in Ba). In practice, A and

Ba are implemented as linear inequality constraints in a integer program and they enforce the

researcher’s requirements for covariate balance and matching structures at the cluster and unit

levels respectively (for instance, A may require the means of the cluster covariates to be balanced

and the matched groups to form pairs of clusters, and Ba may require the marginal distributions

of the unit covariates to be balanced and the matched groups to form pairs of units). Importantly,

since the requirements in A refer to clusters and those in Ba refer to units, A and Ba are disjoint.

Let J (m)kt be the set of treated units matched in treated cluster kt and J(m)t =

⋃kt∈Kt J

(m)kt

be the set of treated units matched across all treated clusters. Finally, let K(m)t be the set of

matched treated clusters.

Building upon the framework of Rosenbaum (2012a), an optimal cardinality matching of units

within clusters can be characterized by the quadruple (K(m)t , α,J(m)t , β) of assignments of clus-

ters α : K(m)t → Kc and units β : J(m)kt→ Jkc that maximize the cardinality of the set of

matched of units within matched clusters subject to the constraints in A and Ba, respectively.

If there are two cardinality matchings that satisfy the requirements in A and Ba, then we prefer

12

one matching over the other if it has a larger cardinality, or, alternatively, if they both have

the same cardinality, if it has a smaller sum of total distances between matched units. For-

mally, we prefer the cardinality matching (K(m)t , α,J(m)t , β) to (K̃

(m)t , α̃, J̃

(m)t , β̃), denoted by

(K(m)t , α,J(m)t , β) � (K̃

(m)t , α̃, J̃

(m)t , β̃), if |J

(m)t | > |J̃

(m)t |, or alternatively if |J

(m)t | = |J̃

(m)t |

and ∑jkt∈J

(m)t

δjkt ,β(jkt ) <∑jkt∈J̃

(m)t

δjkt ,β(jkt ). If |J(m)t | = |J̃

(m)t | and

∑jkt∈J

(m)t

δjkt ,β(jkt ) =∑jkt∈J̃

(m)t

δjkt ,β(jkt ), then we are indifferent between the two cardinality matchings and write

(K(m)t , α,J(m)t , β)∼ (K̃

(m)t , α̃, J̃

(m)t , β̃). If we have either (K

(m)t , α,J

(m)t , β)� (K̃

(m)t , α̃, J̃

(m)t , β̃)

or (K(m)t , α, J(m)t , β) ∼ (K̃

(m)t , α̃, J̃

(m)t , β̃), we write (K

(m)t , α,J

(m)t , β) % (K̃

(m)t , α̃, J̃

(m)t , β̃).

Our optimal multilevel matching problem is the following.

Problem 3.1. For given sets of cluster-level constraints A and unit-level constraints Ba, find a

matching (K(m)t , α,J(m)t , β) that satisfies A and Ba such that, for any other matching (K̃

(m)t , α̃,

J̃ (m)t , β̃) that also satisfies A and Ba, (K(m)t , α,J

(m)t , β) % (K̃

(m)t , α̃, J̃

(m)t , β̃).

Intuition may suggest that the the best way to solve Problem 3.1 and match with multilevel data

is first to match clusters and then within matched clusters to match units. In our case study, this

would require first pairing schools and then, within pairs of schools, pairing students. However

this strategy will not always find the largest matched sample that is balanced as two schools

that are paired on their school level characteristics may have different student compositions so

that when their students are paired it may result in a smaller sample size than optimal. For

this reason, the optimal matching strategy needs to contemplate what is optimal both at the

student and school levels simultaneously. Applying Bellman’s (1957) principle of optimality, the

optimal matching strategy is, under the assumption that schools have been matched optimally,

first match students and then, considering these optimal student matches, match schools.

In abstract terms, the following algorithm and proposition state this; again, that the optimal

strategy is first to match units across all the possible combinations of pairs of treated and control

clusters, and, once all possible combinations of matched units are known, then match clusters.

To implement the optimal assignments α and β, let akt,kc = 1 if treated cluster kt is paired to

13

control cluster kc and akt,kc = 0 otherwise; similarly let bjkt ,jkc = 1 if treated unit j in treated

cluster kt is paired to control unit j in control cluster kc, and bjkt ,jkc = 0 otherwise.

Algorithm 3.2. For each of the possible Kt × Kc pairs of treated and control clusters, find

the optimal cardinality matching of units that satisfies Ba. This is, for each kt ∈ Kt and each

kc ∈ Kc find mkt,kc = maxb∑jkt∈Jkt

∑jkc∈Jkc bjkt ,jkc subject to b ∈ Ba. Then find the optimal

cardinality cluster matching that solves maxa∑kt∈Kt

∑kc∈Kc mkt,kcakt,kc subject to a ∈ A.

Proposition 3.3. Algorithm 3.2 solves the optimal multilevel cardinality matching problem 3.1.

Proof. Let f(a, b) be the the total number of pairs of treated and control units matched by a

within pairs of treated and clusters matched by b. In the abstract, in Problem 3.1 we want to

maximize the function f(a, b) subject to the constraints A and Ba. This is, find a and b to solve

maxa,b

f(a, b) subject to a ∈ A, b ∈ Ba. (1)

In a trivial way, we may solve (1) by first solving

g(a) = maxbf(a, b) subject to b ∈ Ba (2)

for each a ∈ A, and then solving

maxa

g(a) subject to a ∈ A. (3)

While (2) seems hard in general (because there are many possible choices of b), the nested

structure of the units-in-clusters problem makes it easier because f(a, b) separates into a sum

of parts for cluster pairs because the constraint sets A and Ba are disjoint. Algorithm 3.2 does

exactly this.

In our case study, for each pairing of schools a, we find the best pairing of students b within

14

those schools (2), and then pick the best pairing of schools with the associated best pairing of

students for that pairing of schools (3). Again, while (2) seems hard in general (because there

are many possible student matches b), the nested structure of the students-in-schools problem

makes it easier because f(a, b) separates into a into a sum of parts for school pairs. For example,

if treated school kt is paired to control school kc, then the contribution of schools kt and kc is

the same of number of pairs regardless of how the other schools are paired.

With Algorithm 3.2, the multilevel cardinality matching problem can be solved optimally by

breaking it into simpler matching subproblems and recursively finding the optimal match. This is

an application of dynamic programming to matching in observational studies that takes advantage

of the multilevel structure of the data (see Bertsekas 2005 for an extensive exposition of dynamic

programming).

3.3 Extensions and Computation

Note that if we had three or more levels (such as students within schools within districts), then

the multilevel matching procedure would extend naturally. With l levels, the procedure would

require first matching the lower level l under the assumptions that levels l − 1, l − 2, ..., 1 have

been matched optimally, to then (once the matches at level l are completed) matching level l−1

under the assumptions that levels l− 2, l− 3..., 1 have been matched optimally, and so on.

Note that our multilevel matching method maximizes the size of the matched sample, but it can

also be formulated to minimize a covariate distance between students. If this was the case and if

each of the student level matching problems was solved using optimal matching as in Rosenbaum

(1989) and Hansen (2007), then a trivial worst-case time bound for the multilevel matching

method would be of order O(J3K2 + K3) where J = max{Jkt=1, ..., Jkt=Kt , Jkc=1, ..., Jkc=Kc}

and K = max{Kt, Kc}. In general it is possible to find worst-case time bounds based on the

component problems. In our presentation above, each of the component problems is a cardinality

matching problem and, while at the present there is no polynomial time algorithm for cardinality

15

matching, in practice many instances with data sets of reasonable size run in time comparable

to that of optimal matching. Furthermore, a useful feature of problem (1) is that the student

level matches can be found in parallel by separating all the possible pairs of treated and control

schools into smaller mutually exclusive but exhaustive pairs of treated and control schools. In

practice, we found the matches using the package mipmatch for R (Zubizarreta 2012).

4 Covariate Balance in the Matched Sample

After applying basic exclusion criteria, there are 64245 students in 517 schools, 150 subsidized

and 367 public schools (henceforth treated and control schools respectively). Out of the 64245

students, 15682 students are from treated schools and 48563 are from control schools. Using

our multilevel matching method, we matched in two stages within similar groups regions of the

country (namely, regions I-III, IV-V, VI-VII, VIII, IX, X-XII and the Metropolitan region).

At the student level, we used cardinality matching to find the largest balanced sample of pairs

of students across all the possible combinations of pairs of schools within the groups of regions.

In each of these matches we required mean balance for 19 covariates (including student test

scores, school test scores, and indicators for socioeconomic status and expected educational

achievement; see Table 1 for details), fine balance for 4 covariates (sex, mother and father

education, and household income; see Table 2) and distributional balance for the sum of the

test scores in language and mathematics at baseline. Figure 1 shows not only that the marginal

distributions of the baseline test scores are very closely balanced after matching but also their

joint distribution. As a matter of fact, the 95% bivariate normal density contours are almost

indistinguishable after matching.

At the school level, we used the modification of cardinality matching in the second stage of

Algorithm 3.2 and mean balanced 16 other covariates: percentage female, total enrollment, lan-

guage and math scores (plus indicators for missing values), urban area, parental income categories

(1-5), and socioeconomic groups (A-D). Again, covariates were exact matched for the 7 region

16

Table 1: Covariate balance at the student level after matching. All the covariates are measuredin 2004.

Covariate Mean Std.Subsidized Public dif.

Language score 243.24 243.68 -0.01Mathematics score 243.43 243.04 0.01Natural science score 246.19 247.00 -0.02Social science score 243.01 243.91 -0.02School language score 237.57 237.02 0.03School mathematics score 238.37 237.89 0.03School female proportion 0.51 0.50 0.05School number of students 82.34 82.34 0.00School teacher to student ratio 8.09 8.03 0.03Urban area 0.83 0.83 0.00Socioeconomic status A 0.13 0.12 0.01Socioeconomic status B 0.61 0.61 0.01Socioeconomic status C 0.25 0.26 -0.02Socioeconomic status D 0.01 0.01 0.01Expected education: primary 0.01 0.02 -0.04Expected education: secondary, technical-professional 0.77 0.77 -0.00Expected education: secondary, scientific-humanities 0.13 0.12 0.02Expected education: technical-professional 0.09 0.10 -0.01Expected education: college 0.00 0.00 0.01

17

Table 2: Balance for nominal covariates at the student level. All the covariates are measured in2004. Fine balance constraints balanced sex, type of education of the mother and father, andhousehold income category. The tabulated values are counts of the number of students in eachcategory. In addition, matching was exact for groups of counties (not shown here).

Covariate Subsidized PublicSex

Male 2084 2084Female 1981 1981

Mother educationPrimary school 1886 1886Secondary school 1152 1152Technical 56 56College or higher 17 17Missing 954 954

Father educationPrimary school 1647 1647Secondary school 1255 1255Technical 56 56College or higher 17 17Missing 954 954

Household income category (in 1000 pesos)[0, 100) 1647 1647[100, 200] 1255 1255(200, 400] 1643 1643(400, 600] 446 446(600, 1400] 124 124> 1400 84 84Missing 183 183

18

Figure 1: Distribution of student test scores at baseline after matching. The baseline test scoresare measured in 2004. The ellipses trace the 95% bivariate normal density contours of the jointdistributions of test scores for the matched treated and control units. The contours are almostidentical showing that not only marginal distributions of the test scores are very closely balancedbut also their joint distribution.

100 150 200 250 300 350 400

100

150

200

250

300

350

400

PublicSubsidizedPub.

Sub.

100 150 200 250 300 350 400

Test scores in language at baseline

Pub. Sub.

100

150

200

250

300

350

400

Test

sco

res

in m

athe

mat

ics

at b

asel

ine

Distribution of student-level test scores after matching

groups. We balanced all covariates with and without weighting for the size of the school; see

Table 3. Note that after matching all the differences in means are smaller than 0.05 standard

deviations. In this way, we matched 8130 students in 4065 pairs, and 166 schools in 83 pairs.

In this match, 7 out of the 13 region of the country are represented in both the treatment and

19

control groups.

Table 3: Covariate balance at the school level after matching. Both means and standardizeddifferences are weighted by the number of students in each school.

Covariate Mean Std.Subsidized Public dif.

Female proportion 0.49 0.49 -0.00Number of students 262.33 262.67 -0.00Language score 241.36 241.43 -0.00Mathematics score 230.20 229.19 0.04Language score missing 0.00 0.00 0.00Math score missing 0.00 0.00 0.00Urban area 0.98 0.99 -0.03Income category 1 0.08 0.09 -0.02Income category 2 0.80 0.79 0.01Income category 3 0.12 0.12 0.00Income category 4 0.00 0.00 0.00Income category 5 0.00 0.00 0.00Socioeconomic status A 0.18 0.18 0.01Socioeconomic status B 0.71 0.71 0.01Socioeconomic status C 0.10 0.11 -0.02Socioeconomic status D 0.00 0.00 0.00

Thus while our match yields highly comparable treated and control groups, geographic coverage

is somewhat poor. To that end, we also implemented a second match designed to increase geo-

graphic representation. Starting from the same student matches, we also found a more externally

valid school in which all the differences in means are smaller than 0.15 standard deviations, and

where 10776 students are matched in 5388 pairs and 210 schools are matched in 105 pairs. In

this match, 12 out of the 13 regions in Chile are represented. We deem the first match a match

with greater internal validity, and the second match, a match with greater external validity. Table

(4) compares the results of these two matches. In the next section we compare treatment effect

estimates for these two different matched samples.

20

Table 4: Comparison of the two school level matches. In the internally valid match the largeststandardized difference in means is smaller than 0.05, whereas in the externally valid matchthis difference is smaller than 0.15. Regions represented are the regions present in both in thetreatment and control groups.

Internal validity match External validity matchMatched students 8130 10776Matched schools 166 210Regions represented 4, 5, 6, 7, 9, 10, 13 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 13

5 Outcome Analyses

Having found the optimal match, we now estimate the voucher effect, test its significance, and

assess the robustness of these conclusions with sensitivity analysis methods to hidden bias. We

follow the notation and methods for clustered experiments by Small et al. (2008), and use the

sensitivity analysis methods for clustered observational studies by Hansen et al. (2014).

5.1 Notation: Treatment Effects for Students in Schools

There are S matched pairs of clusters, s = 1, . . . , S, with two schools, j = 1, 2, one treated and

one control for 2S total units. The ordered pair sj thus identifies a unique cluster. Each cluster

sj contains nsj > 1 individuals, i = 1, . . . , nsj. Each pair is matched for observed, pretreatment

covariates, so xs11 = xs22 for each j and i, where xsji represents the observed covariates on

which we matched. A student i in school sj is described by both observed covariates and an

unobserved covariate usji. The set (xsji, usji) may describe either the student sji or the school

sj containing this student. In our study, treatment assignment occurs at the school level as whole

schools are assigned to treatment or control. If the jth school in pair s receives the treatment,

write Zsj = 1, whereas if this school receives the control, write Zsj = 0, so Zs1 + Zs2 = 1, for

each s as each pair contains one treated school and one control school. If nsj = 1 for all sj then

the clusters are individuals, and we have unclustered treatment assignment.

Each student has two potential responses; one response that is observed under treatment Zsj = 1

21

and the other observed under control Zsj = 0 (Neyman 1923; Rubin 1974). We denote these

responses with (yTsji, yCsji), where yTsji is observed from the ith subject in pair s under Zsj = 1,

and yCsji is observed from this subject under Zsj = 0. In our application yTsji is the test score

that student sji would exhibit if he or she switched from a public school to a private voucher

school and yCsji is the test score this same student would exhibit if he or she remained in a public

school. Under this notation, we allow for interference among students in the same school but

not across schools. In this context, yTsji denotes the response of student sji if all students in

school sj receive the treatment, while yCsji denotes the response of student sji if all students

in school sj receive the control. Therefore, we do not assume that we would observe the same

response from student sji if the treatment were assigned to some but not all of the students in

school sj.

For each student, the unobservable effect of treatment is yTsji − yCsji, which is the change in

test scores induced by attending a private voucher school. We do not observe both potential

outcomes, but we do observe responses: Ysji = ZsjyTsji + (1−Zsj)yCsji. Under this framework,

the observed response Ysji varies with Zsj but the potential outcomes do not vary with treatment

assignment. Write R = (R111, . . . , RS2,ns2)T for theN =∑s,j ns,j dimensional vector of observed

responses with the same notation for yc, which are potential responses under control. Below

we test the sharp null hypothesis of no treatment effect on (yTsji, yCsji) which stipulates that

H0 : yTsji = yCsji for all sji (Fisher 1935). This hypothesis asserts that changing the treatment

assigned to school sj would leave the response of student sji unchanged.

5.2 Randomization Inference When Treatment is Assigned at the School

Level

In our analysis, we initially assume that treatment assignment is as-if randomly assigned to

clusters conditional on the matches. In short, we assume as if the toss of a fair coin was used

to allocate private voucher status within matched school pairs. We collect in the set Ω the 2S

22

treatment assignments for all 2S clusters: Z = (Z11, Z12, . . . , ZS2)T . In a matched-pair group

randomized experiment, one treatment assignment Zsj would be picked at random and each

assignment would therefore have probability Pr(Z = Zsj) = 2−S, which yields the randomization

distribution. In an observational study, if the probability of receiving treatment is equal for both

schools in each pair, then the conditional distribution of Z given that there is exactly one treated

unit in each pair equals this randomization distribution, and Pr(Zsj = 1) = 1/2 for each unit j

in pair s (see Rosenbaum 2002 for details). However, in an observational study it may not be

true Pr(Zsj = 1) = 1/2 for each unit j in pair s due to an unobserved covariate usji. We explore

this possibility through a sensitivity analysis described below.

To test Fisher’s sharp null hypothesis of no treatment effect, we define T a test statistic which is

a function of Z and R where T = t(Z,R). Under the sharp null hypothesis R = yc, therefore

T = t(Z,yc). If the model for treatment assignment above were true then the randomization

distribution for T is

Pr{t(Z,R) ≥ w|yTsji, yCsji,xsji, usji,Ω} = Pr{t(Z,yc) ≥ w|yTsji, yCsji,xsji, usji,Ω}

since yc is fixed by conditioning on yTsji, yCsji,xsji, usji and Pr(Z = Zsj|yTsji, yCsji,xsji, usji,Ω) =

1/|Ω|. We use a test statistic from Hansen et al. (2014) that provides inferences for units in

clusters.

For this test statistic, qsji is a score or rank given to Ysji, so that under the null hypothesis, the

qsji are functions of the yCsji and xsji, and they do not vary with Zsk. To make qsji resistant to

outliers, we use the ranks of the residuals when Ysji is regressed on the student level covariates

using Huber’s method of m-estimation Small et al. (2008). We regressed the outcome, test

scores recorded in 2006, on student level test scores recorded in 2004 when the student was still

in primary school. The test statistic T is a weighted sum of the mean ranks in the treated school

23

minus the mean ranks in the control school. Formally the test statistic is

T =S∑s=1

BsQs

where

Bs = 2Zs1 − 1 = ±1, Qs =wsns1

ns1∑i=1

qs1i −wsns2

ns2∑i=1

qs2i.

Hansen et al. (2014) show that T is the sum of S independent random variables each taking the

value ±Qs with probability 1/2, so E(T ) = 0 and var(T ) =∑Ss=1 Q

2s. The central limit theorem

implies that as S → ∞, then T/√

var(T ) converges in distribution to the standard Normal

distribution. In the above equation, ws defines the weights which are a function of nsj.

The choice of weights ws has important implications in our application. Hansen et al. (2014)

discuss three possible choices for ws. One possibility is to use constant weights, ws ∝ 1. Another

possibility is to use weights that are proportional to the total number of students in a matched

cluster pair: ws ∝ ns1 +ns2 or ws = (ns1 +ns2)/∑Sl=1(n11 +n12). These proportional weights are

particularly useful if we believe that the private school voucher effect varies with cluster size. This

would be true if, for example, the private school effect was larger in smaller schools. However,

if we suspect that the private voucher school effect is constant, we could select the weights to

minimize the variance of T . For example, ws ∝ ns1ns2/(ns1 + ns2) will minimize the variance of

T if cross cluster variability is low, while constant weights will minimize the variance if there is

little variance within schools. Hansen et al. (2014) note that for testing the null hypothesis each

set of weights is valid. Given that our cluster sizes exhibit considerable variation, and it is fully

possible that the treatment effect varies with cluster size, we use all three sets of weights to test

the sharp null hypothesis. Below we discuss how we incorporate the different weights into the

sensitivity analysis.

If we test the hypothesis of a shift effect instead of the hypothesis of no effect, we can apply the

method of Hodges and Lehmann (1963) to estimate the voucher school effect. The Hodges and

24

Lehmann (HL) estimate of τ is the value of τ0 that when subtracted from Ysji makes T as as

close as possible to its null expectation. Intuitively, the point estimate τ̂ is the value of τ0 such

that T equals 0 when Tτ0 is computed from Ysji − Zsjτ0. Using constant effects is convenient,

but this assumption can be relaxed; see Rosenbaum (2003). If the treatment has an additive

effect, Ysji = yCsji + τ then a 95% confidence interval for the additive treatment effect is formed

by testing a series of hypotheses H0 : τ = τ0 and retaining the set of values of τ0 not rejected at

the 5% level.

5.3 Comparative Effectiveness of Public Versus Private Voucher Schools

We now test the hypothesis of no effect for private voucher schools. We test this hypothesis in

both matches. For the match with greater external validity, with constant weights ws ∝ 1, the

approximate one-sided p-value is 0.256. Thus we are unable to reject the null that the voucher

school are completely without effect. We also found that both sets of weights, ws ∝ ns1 + ns2

and ws ∝ ns1ns2/(ns1 + ns2), lead to identical p-values of 0.292. In the absence of bias from

hidden confounders, the point estimate is τ̂ = 2.81 with a 95% confidence interval of -5.68 and

11.34.

For the match with greater internal validity with constant weights, the approximate one-sided p-

value is 0.492. Using non-constant weights, we find the approximate one-sided p-value is 0.633. If

there are no hidden confounders, the point estimate for the match with greater internal validity is

τ̂ = 0.0743 with a 95% confidence interval of -8.58 and 9.36. Thus for both matches, we cannot

reject the hypothesis that attending a private voucher school has no effect on test scores. We

next explore the likelihood that bias from a hidden confounder masks a treatment effect.

5.4 Test of Equivalence and Sensitivity Analysis

In an observational study, one concern is that bias from a hidden covariate can give the impression

that a treatment effect exists when in fact no effect is present. Bias from hidden confounders can

25

also mask an actual treatment effect leaving the analyst to conclude there is no effect when in

fact such an effect exists. We explore this possibility using a test of equivalence and a sensitivity

analysis (Rosenbaum 2008; Rosenbaum and Silber 2009; Rosenbaum 2010).

Above we were unable to reject the null hypothesis that τ = 0 for all students. Next, we apply

a test of equivalence to test the hypotheses that τ is not small. Under a test of equivalence, we

test the following null hypothesis H(δ)6= : |τ | > δ. Rejecting H(δ)6= provides a basis for asserting

with confidence that |τ | < δ. H(δ)6= is the union of two exclusive hypotheses:←−H

(δ)0 : τ ≤ −δ

and −→H (δ)0 : τ ≥ δ, and H(δ)6= is rejected if both

←−H

(δ)0 and

−→H

(δ)0 are rejected (Rosenbaum and

Silber 2009). We can apply the two tests without correction for multiple testing since we test

two mutually exclusive hypotheses. Thus we can test whether the estimate from our study is

different from other possible treatment effects which are represented by δ.

With a test of equivalence, it is not possible to demonstrate a total absence of effect, but if

this were a randomized trial we could safely test that our estimated effect is not as large as δ.

That is we may be able to reject H(δ)6= : |τ | > δ. In an observational study, however, there are

additional complications. Since the treatment was not randomly assigned, it may be the case

that we reject the null hypothesis of equivalence due to hidden confounding. However, using a

sensitivity analysis we may find evidence that the test of equivalence is insensitive to biases from

nonrandom treatment assignment.

In a sensitivity analysis, we quantify the degree to which a key assumption must be violated in

order for our inference to be reversed. Our model of treatment assignment assumes that within

matched pairs, receipt of the treatment is effectively random conditional on the matches. We

consider how sensitive our conclusions are to violations of this assumption using a model of

sensitivity analysis discussed in Rosenbaum (2002, ch. 4).

In our study, matching on observed covariates xsji made students more similar in their chances

of being exposed to the treatment. However, we may have failed to match on an important

unobserved covariate usji such that xsji = xsji′ ∀ s, j, i, i′, but possibly usji 6= usji′ . If true, the

26

probability of being exposed to treatment may not be constant within matched pairs. To explore

this possibility, we use a sensitivity analysis that imagines that before matching, student i in pair

s had a probability, πs, of being exposed to the voucher school treatment. For two matched

students in pair s, say i and i′, because they have the same observed covariates xsji = xsji′ it

may be true that πs = πs′ . However, if these two students differ in an unobserved covariate,

usji 6= usji′ , then these two students may differ in their odds of being exposed to the voucher

school treatment by at most a factor of Γ ≥ 1 such that

1Γ ≤

πs/(1− πs′)πs′/(1− πs)

≤ Γ, ∀ s, s′, with xsji = xsji′ ∀ j, i, i′. (4)

If Γ = 1, then πs = πs′ , and the randomization distribution for T is valid. If Γ > 1, then quantities

such as p-values and point estimates are unknown but are bounded by a known interval. In a

sensitivity analysis, we use several values of Γ to compute bounds on the p-value for the test

of equivalence. We then observe at which value of Γ the upper bound on the p-value exceeds

0.05. If the value of Γ is large, we can be confident that it would take a large bias from a hidden

confounder to reverse the conclusions of the study. The derivation for the sensitivity analysis as

applied to our test statistic T is in Hansen et al. (2014).

Under a test of equivalence, we may be able to reject H(δ)6= : |τ | > δ if the p-value from the test

is low. Rejecting this null, allows us to infer that the estimate treatment effect is not as large

as δ. We then apply the sensitivity analysis to understand whether this inference is sensitive to

biases from nonrandom treatment assignment. In the analysis, we observe at what value of Γ the

p-value exceeds the conventional 0.05 threshold for each test. If this Γ value is relatively large,

we can be confident that the test of equivalence is not sensitive to hidden bias from nonrandom

treatment assignment.

Hansen et al. (2014) note that sensitivity to hidden bias may vary with the choice of weights ws.

To understand whether different weights lead to different sensitivities to hidden confounders, we

27

can conduct a different sensitivity analysis for each set of weights and correct these tests using

a Bonferroni correction. However, Rosenbaum (2012b) shows that the Bonferroni correction is

overly conservative when applied sensitivity analysis. He develops an alternative multiple testing

correction based on correlations among the test statistics. Under this correction, for a given

value of Γ we conduct a sensitivity analysis using each set of weights. We then apply the multiple

testing correction from Rosenbaum (2012b) which produces a single corrected p-value for that

value of Γ.

5.5 How Much Bias Would Need to be Present to Mask a Positive

Effect of Private Voucher Schools?

We now apply the test of equivalence to both matches. In this test, the null hypothesis asserts

H(δ)6= : |τ | > δ for some specified δ > 0. Rejection of this null hypothesis provides evidence

that the effect of attending a private voucher school on test scores is less than δ. What values

should we select for δ? A number of studies in the literature have found that private voucher

schools increase test score achievement. The smallest effect size in the extant literature is 0.15

of a standard deviation (Sapelli and Vial 2002). However, among low income students the effects

may be as large as 0.5 of a standard deviation, and Sapelli and Vial (2005) find an effect size of 0.6

standard deviations. These results suggest a range of possible effects from 0.15 to 0.6 standard

deviations. To that end, we use three values for δ of 0.15, 0.30 and 0.6 standard deviations.

This allows us to test whether the point estimates in our study are equivalent to small, medium

or large voucher effects. Thus we define three values δ1, δ2, and δ3 to correspond to these three

different possible effect sizes.

We first ask whether the point estimate from the match with greater external validity is large

relative to effect sizes in the literature. Table 5 contains a summary of the test of equivalence

and sensitivity analysis to the match with greater external validity. We first assume that there

is no hidden bias such that Γ = 1. We first test ←−H (δ1)0 and find that the one-sided p-value

28

from this test is 0.008. We then test −→H (δ1)0 and we find that the one-sided p-value is 0.089.

Therefore we are unable to reject H(δ1)6= for the match with greater external validity. Thus our

point estimate from this match may be consistent with a small effect. For a larger effect size

of 0.30 standard deviations, however, we can reject H(δ2)6= with a p-value of 0.003. Thus the

estimated treatment effect is not consistent with moderate effect size. Is this inference sensitive

to bias from a confounder? We find that for Γ = 1.85, the p-value is 0.049. A bias of magnitude

Γ = 1.85 means that two matched students might differ in terms of an unobserved usji such

that one student is almost twice as like as the other to attend a private voucher school before

it would alter our conclusions. Finally, we test whether the point estimate is equivalent with a

large effect size of 0.60 standard deviations. Again, we can reject H(δ3)6= with p < .001. With a

bias of magnitude Γ = 5.71 the p-value is 0.049. Therefore, it would take a very large bias for

our conclusions about a large treatment effect to be altered.

Table 5: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Externally Valid Match

Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.089 0.003

In sum, for both matches, we either cannot reject that the estimated effect is as large as the small-

est effects found in previous studies or that association could be easily explained by unobserved

confounding. Bias from an unobserved covariate would need to double the odds of selecting a

private voucher school to mask a moderate size effect of 0.30 standard deviations. To mask a

large effect size of 0.60 standard deviations, the bias from the unobserved founders would have

to nearly quintuple the odds of differential treatment assignment in both matches.

Table 6: Sensitivity Analysis Results With Different Weights and Corrections for Multiple Testingfor the Internally Valid Match

Γ H0 : |τ0| > δ1 H0 : |τ0| > δ2 H0 : |τ0| > δ31 0.036 0.003 0.00151.1 0.048 0.005

for covariate balance may not only require mean balance, but also other forms of distributional

balance such as fine balance, x-fine balance, and strength-k matching. In practice, this method

facilitates sensitivity analyses to hidden biases due to unobserved covariates, and it readily extends

to clustered observational studies with three or more levels of data. To our knowledge, this method

is the first application of dynamic and integer programming to observational studies.

31

References

Anand, P., Mizala, A., and Repetto, A. (2009), “Using School Scholarships to Estimate the Effect

of Government Subsidized Private Education on Academic Achievement in Chile,” Economics

of Education Review, 28, 370–381.

Arpino, B. and Mealli, F. (2011), “The specification of the propensity score in multilevel obser-

vational studies,” Computational Statistics & Data Analysis, 55, 1770–1780.

Bellman, R. (1957), Dynamic Programming, Princeton, NJ: Princeton University Press.

Bertsekas, D. P. (2005), Dynamic Programming and Optimal Control, Vol. I, Belmont, MA:

Athena Scientific.

Feller, A. and Gelman, A. (2015), “Hierarchical Models for Causal Effects,” Working Paper.

Fisher, R. A. (1935), The Design of Experiments, London: Oliver and Boyd.

Gelman, A. (2006), “Multilevel (hierarchical) modeling: what it can and cannot do,” Techno-

metrics, 48, 432–435.

Greevy, R., Lu, B., Silber, J. H., and Rosenbaum, P. R. (2004), “Optimal Multivariate Matching

Before Randomization,” Biostatistics, 5, 263–275.

Hansen, B. B. (2007), “Flexible, Optimal Matching for Observational Studies,” R News, 7, 18–24.

Hansen, B. B., Rosenbaum, P. R., and Small, D. S. (2014), “Clustered Treatment Assignments

and Sensitivity to Unmeasured Biases in Observational Studies,” Journal of the American

Statistical Association, 109, 133–144.

Hodges, J. L. and Lehmann, E. (1963), “Estimates of Location Based on Ranks,” The Annals of

Mathematical Statistics, 34, 598–611.

Hong, G. and Raudenbush, S. W. (2006), “Evaluating Kindergarten Retention Policy: A Case of

32

Study of Causal Inference for Multilevel Data,” Journal of the American Statistical Association,

101, 901–910.

Hsieh, C.-T. and Urquiola, M. (2006), “The Effects of Generalized School Choice on Achievement

and Stratification: Evidence from Chile’s voucher program,” Journal of Public Economics, 90,

1477–1503.

Hsu, J. Y., Zubizarreta, J. R., Small, D. S., and Rosenbaum, P. R. (2015), “Strong Control

of the Family-Wise Error Rate in Observational Studies that Discover Effect Modification by

Exploratory Methods,” Working Paper.

Lara, B., Mizala, A., and Repetto, A. (2011), “The Effectiveness of Privte Voucher Education:

Evidence From Structural School Switches,” Educational Evaluation and Policy Analysis, 33,

119–137.

Lee, V. E. and Bryk, A. S. (1989), “A Multilevel Model of The Social Distribution of High School

Achievement,” Sociology of Education, 62, 172–192.

Li, F., Zaslavsky, A. M., and Landrum, M. B. (2013), “Propensity score weighting with multilevel

data,” Statistics in medicine, 32, 3373–3387.

Lu, B., Greevy, R., Xu, X., and Beck, C. (2011), “Optimal Nonbipartite Matching and its Statis-

tical Applications,” The American Statistician, 65, 21–30.

McEwan, P. J. (2001), “The Effectiveness of Public, Catholic, and Non-Religious Private Schools

in Chile’s Voucher System,” Education Economics, 9, 183–219.

MINEDUC (2013), “Estadśticas de la Educación,” http://www.ministeriodesarrollosocial.gob.cl/encuesta-

post-terremoto/index.html.

Mizala, A. and Romaguera, P. (2001), “Factors Explaining Secondary Education Outcomes in

Chile,” El Trimestre Economico, 272, 515–549.

Neyman, J. (1923), “On the Application of Probability Theory to Agricultural Experiments. Essay

33

on Principles. Section 9.” Statistical Science, 5, 465–472. Trans. Dorota M. Dabrowska and

Terence P. Speed (1990).

Rosenbaum, P. R. (1984), “The Consequences of Adjusting For a Concomitant Variable That

Has Been Affected By The Treatment,” Journal of The Royal Statistical Society Series A, 147,

656–666.

— (1987), “Sensitivity Analysis for Certain Permutation Inferences in Matched Observational

Studies,” Biometrika, 74, 13–26.

— (1989), “Optimal Matching for Observational Studies,” Journal of the American Statistical

Association, 84, 1024–1032.

— (2002), Observational Studies, New York, NY: Springer, 2nd ed.

— (2003), “Exact Confidence Intervals for Nonconstant Effects by Inverting the Signed Rank

Test,” The American Statistician, 57, 132–138.

— (2006), “Differential effects and generic biases in observational studies,” Biometrika, 93,

573–586.

— (2008), “Testing hypotheses in order,” Biometrika, 95, 248–252.

— (2010), Design of Observational Studies, New York: Springer-Verlag.

— (2012a), “Optimal Matching of an Optimally Chosen Subset in Observational Studies,” Journal

of Computational and Graphical Statistics, 21, 57–71.

— (2012b), “Testing One Hypothesis Twice in Observational Studies,” Biometrika, 99, 763–774.

Rosenbaum, P. R., Ross, R. N., and Silber, J. H. (2007), “Minimum Distance Matched Sampling

with Fine Balance in an Observational Study of Treatment for Ovarian Cancer,” Journal of the

American Statistical Association, 102, 75–83.

Rosenbaum, P. R. and Silber, J. H. (2009), “Sensitivity Analysis for Equivalence and Difference in

34

an Observational Study of Neonatal Intensive Care Units,” Journal of the American Statistical

Association, 104, 501–511.

Rubin, D. B. (1974), “Estimating Causal Effects of Treatments in Randomized and Nonrandom-

ized Studies,” Journal of Educational Psychology, 6, 688–701.

Sapelli, C. and Vial, B. (2002), “The Perfomance of Private and Public Schools in the Chilean

Voucher System,” Cuadernos De Economia, 39.

— (2005), “Private vs public voucher schools in Chile: New evidence on efficiency and peer

effects,” Working Paper 289, Catholic University of Chile, Instituto de Economia.

Small, D. S., Have, T. R. T., and Rosenbaum, P. R. (2008), “Randomization Inference in a Group–

Randomized Trial of Treatments for Depression: Covariate Adjustment, Noncompliance, and

Quantile Effects,” Journal of the American Statistical Association, 103, 271–279.

Stuart, E. A. (2010), “Matching Methods for Causal Inference: A Review and a Look Forward,”

Statistical Science, 25, 1–21.

Zubizarreta, J. R. (2012), “Using Mixed Integer Programming for Matching in an Observational

Study of Kidney Failure after Surgery,” Journal of the American Statistical Association, 107,

1360–1371.

Zubizarreta, J. R., Paredes, R. D., and Rosenbaum, P. R. (2014), “Matching for Balance, Pairing

for Heterogeneity in an Observational Study of the Effectiveness of For-profit and Not-for-profit

High Schools in Chile,” Annals of Applied Statistics, 8, 204–231.

Zubizarreta, J. R., Reinke, C. E., Kelz, R. R., Silber, J. H., and Rosenbaum, P. R. (2011), “Match-

ing for Several Sparse Nominal Variables in a Case-Control Study of Readmission Following

Surgery,” The American Statistician, 65, 229–238.

35

IntroductionClustered Observational Studies with Multilevel DataVouchers and School Choice

Schools in Chile; Educational Census Data; Study DesignThe Chilean School System and Restricted Secondary School Choice Longitudinal Census of Students and SchoolsData Structure and Study Design

Dynamic and Integer Programming for Multilevel MatchingReview of Cardinality MatchingA Multistage Decision Method for Multilevel MatchingExtensions and Computation

Covariate Balance in the Matched SampleOutcome AnalysesNotation: Treatment Effects for Students in SchoolsRandomization Inference When Treatment is Assigned at the School LevelComparative Effectiveness of Public Versus Private Voucher SchoolsTest of Equivalence and Sensitivity AnalysisHow Much Bias Would Need to be Present to Mask a Positive Effect of Private Voucher Schools?

Summary and Discussion

Documents

Optimal Multilevel Matching in Clustered Observational ... · Optimal Multilevel Matching in Clustered Observational Studies: A Case Study of the School Voucher System in Chile∗