Stability of Linear Structural Equation Models of Causal Inferenceauai.org/uai2019/proceedings/papers/122.pdf · 2019. 7. 13. · lyzing causal inference. One such line of work is

Stability of Linear Structural Equation Models of Causal Inference∗

Karthik Abinav SankararamanUniversity of Maryland, College Park

[email protected]

Anand LouisIndian Institute of

Science, [email protected]

Navin GoyalMicrosoft Research, [email protected]

Abstract

We consider numerical stability of the parame-ter recovery problem in Linear Structural Equa-tion Model (LSEM) of causal inference. Nu-merical stability is essential for the recoveredparameters to be reliable. A long line of workstarting from Wright (1920) has focused onunderstanding which sub-classes of LSEM al-low for efficient parameter recovery. Despitedecades of study, this question is not yet fullyresolved. The goal of the present paper is com-plementary to this line of work: we want to un-derstand the stability of the recovery problemin the cases when efficient recovery is possible.Numerical stability of Pearl’s notion of causal-ity was first studied in Schulman and Srivastava(2016) using the concept of condition numberwhere they provide ill-conditioned examples.In this work, we provide a condition numberanalysis for the LSEM. First we prove that un-der a sufficient condition, for a certain sub-classof LSEM that are bow-free (Brito and Pearl(2002)), parameter recovery is numerically sta-ble. We further prove that randomly choseninput parameters for this family satisfy the con-dition with a substantial probability. Hence forthis family, on a large subset of parameter space,recovery is stable. Next we construct an exam-ple of LSEM on four vertices with unboundedcondition number. We then corroborate ourtheoretical findings via simulations as well asreal-world experiments for a sociology applica-tion. Finally, we provide a general heuristic forestimating the condition number of any LSEMinstance.

∗Full version of the paper [25]

1 Introduction

Inferring causality, i.e., whether a group of events causesanother group of events is a central problem in a widerange of fields from natural to social sciences. A commonapproach to inferring causality is Randomized controlledtrials (RCT). Here the experimenter intervenes on a sys-tem of variables (often called stimulus variables) such thatit is not affected by any confounders with the variables ofinterest (often called response variables) and observes theprobability distributions on the response variables. Un-fortunately, in many cases of interest performing RCT iseither costly or impossible due to practical or ethical orlegal reasons. A common example is the age-old debate[32] on whether smoking causes cancer. In such scenar-ios RCT is completely out of the question due to ethicalissues. This necessitates new inference techniques.

The causal inference problem has been extensively stud-ied in statistics and mathematics (e.g., [18, 20, 23, 24])where decades of research has led to rich mathematicaltheories and a framework for conceptualizing and ana-lyzing causal inference. One such line of work is theLinear Structural Equation Model (or LSEM in short)for formalizing causal inference (see the monograph [4]for a survey of classical results). In fact, this is amongthe most commonly used models of causality in socialsciences [2, 4] and some natural sciences [31]. In thismodel, we are given a mixed graph on n (observable)variables1 of the system containing both directed and bi-directed edges (see Figure 1 for an example). We willassume that the directed edges in the mixed graph form adirected acyclic graph (DAG). A directed edge from ver-tex u to vertex v represents the presence of causal effectof variable u on variable v, while the bi-directed edgesrepresent the presence of confounding effect (modeledas noise) which we next explain.2 In the LSEM, the fol-

1In this paper we are interested in properties for large n.2We also interchangeably refer to the directed edges as ob-

servable edges since they denote the direct causal effects and

Figure 1: Mixed Graph: Black solid edges representcausal edges. Green dotted edges represent covariance ofthe noise.

lowing extra assumption is made (see Equation (1)): thevalue of a variable v is determined by a (weighted) linearcombination of its parents’ (in the directed graph) valuesadded with a zero-mean Gaussian noise term (ηv). The bi-directed graph indicates dependencies between the noisevariables (i.e., lack of an edge between variables u and vimplies that the covariance between ηu and ηv is 0). Weuse Λ ∈ n×n to represent the matrix of edge weights ofthe DAG, Ω ∈ n×n to represent the covariance matrixof the Gaussian noise and Σ ∈ n×n to represent the co-variance matrix of the observation data (henceforth calleddata covariance matrix). Let X ∈ n×1 denote a vec-tor of random variables corresponding to the observablevariables in the system. Let η ∈ n×1 denote the vec-tor of corresponding noises whose covariance matrix isgiven by Ω. Formally, the LSEM assumes the followingrelationship between the random variables in X:

X = ΛT X + η. (1)

From the Gaussian assumption on the noise random vari-able η, it follows that X is also a multi-variate Gaussianwith covariance matrix given by

Σ = (I − Λ)−T Ω (I − Λ)−1 . (2)

In a typical setting, the experimenter estimates the joint-distribution by estimating the covariance matrix Σ overthe observable variables obtained from finitely many sam-ples. The experimenter also has a causal hypothesis (rep-resented as a mixed graph for the causal effects and thecovariance among the noise variables, which in turn de-termines which entries of Λ and Ω are required to be 0,referred to as the zero-patterns of Λ and Ω). One thenwants to solve the inverse problem of recovering Λ andΩgiven Σ. This problem is solvable for some special typesof mixed graphs using parameter recovery algorithms,such as the one in [12, 16].

Thus a central question in the study of LSEM is for whichmixed graphs (specified by their zero patterns) and whichvalues of the parameters (Λ and Ω) is the inverse prob-lem above solvable; in other words, which parametersare identifiable. Ideally one would like all values of theparameters to be identifiable. However, identifiability is

the bi-directed edges as unobservable edges since they indicatethe unobserved common causes.

often too strong a property to expect to hold for all param-eters and instead we are satisfied with a slightly weakerproperty, namely generic identifiability (GI): here we re-quire that identifiability holds for all parameter valuesexcept for a measure 0 set (according to some reasonablemeasure). (The issue of identifiability is a very generalone that arises in solving inverse problems in statisticsand many other areas of science.) A series of works (e.g.,[6, 14, 13, 16, 22]) have made progress on this questionby providing various classes of mixed graphs that do al-low generic identifiability. However, the general problemof generic identifiability has not been fully resolved. Thisproblem is important since it is a version of a centralquestion in science: what kind of causal hypotheses canbe validated purely from observational data as opposedto the situations where one can do RCT. Much of theprior work has focused on designing algorithms with theassumption that the exact joint distribution over the vari-ables is available. However, in practice, the data is noisyand inaccurate and the joint distribution is generated viafinite samples from this noisy data.

While theoretical advances assuming exact joint distri-bution have been useful it is imperative to understandthe effect of violation of these assumptions rigorously.Algorithms for identification in LSEMs are numericalalgorithms that solve the inverse problem of recoveringthe underlying parameters constructed from noisy andlimited data (a common heuristic is the RICF algorithm[12]). Such models and associated algorithms are usefulonly if they solve the inverse problem in a stable fash-ion: if the data is perturbed by a small amount then therecovered parameters change only a small amount. If therecovery is unstable then the recovered parameters are un-likely to tell us much about the underlying causal modelas they are inordinately affected by the noise. We saythat robust identifiability (RI) holds for parameter valuesΛ and Ω if even after perturbing the corresponding Σ toΣ′ (by a small noise), the recovered parameter values Λ′

and Ω′ are close to the original ones. It follows from thepreceding discussion that to consider an inverse problemsolved it is not sufficient for generic identifiability to hold;instead, we would like the stronger property of robustidentifiability to hold for almost all parameter values (wecall this generic robust identifiability; for now we leavethe notions of “almost everywhere” and “close” informal).In addition, the problem should also be solvable by anefficient algorithm. The mixed graphs we consider alladmit efficient parameter recovery algorithms. Note thatGI and RI are properties of the problem and not of thealgorithm used to solve the problem.

In the general context of inverse problems, the differencebetween GI and RI is quite common and important: e.g.,in solving a system of n linear equations in n variables

2

given by an n × n matrix M. For any reasonable distri-bution on n × n matrices, the set of singular matrices hasmeasure 0 (an algebraic set of lower dimension given bydet(M) = 0). Hence, M is invertible with probability 1and GI holds. This however does not imply that genericrobust identifiability holds: for that one needs to say thatthe set of ill-conditioned matrices has small measure. Tounderstand RI for this problem, one needs to resort toanalyses of the minimum singular value of M which arenon-trivial in comparison to GI (e.g., see [7, Sec 2.4]).In general, proving RI almost everywhere turns out tobe harder and remains an open problem in many caseseven though GI results are known. One recent exampleis rank-1 decomposition of tensors (for n × n × n randomtensors of rank up to n2/16, GI is known, whereas genericrobust identifiability is known only up to rank n1.5); see,e.g., [3]. The problem of tensor decomposition is just oneexample of a general recent concern for RI results in thetheoretical computer science literature and there are manymore examples. Generic robust identifiability remainsan open problem for semi-Markovian models for whichefficient GI is known, e.g., [23].

In the context of causality, the study of robust identifia-bility was initiated in [28] where the authors constructa family of examples in the so-called Pearl’s notion ofcausality on semi-Markovian graphs3 (see e.g., [23]) andshow that for this family there exists an adversarial pertur-bation of the input which causes the recovered parametersto be drastically different (under an appropriate metric de-scribed later) from the actual set of parameters. Howeverthis result has the following limitation. Their adversar-ial perturbations are carefully crafted and this worst casescenario can be alleviated by modifying just a few edges(in fact just deleting some edges randomly suffices whichcan also be achieved without changing the zero-pattern bychoosing the parameters appropriately). This leads us toask: how prevalent are such perturbations? Since thereis no canonical notion of what a typical LSEM model is(i.e. the graph structure and the associated parameters),we will assume that the graph structure is given and theparameters are randomly chosen according to some rea-sonable distribution. Thus we would like to answer thefollowing question4.

3Unlike LSEM, this is a non-parametric model. The func-tional dependence of a variable on the parents’ variables is al-lowed to be fully general, in particular it need not be linear. Thisof course comes at the price of making the inference problemcomputationally and statistically harder.

4The question of understanding the condition number (ameasure of stability) of LSEMs was raised in [28], though pre-sumably in the worst-case sense of whether there are identifiableinstances for which recovery is unstable. As mentioned in theirwork, when the model is not uniquely identifiable, the authorsin [11] show an example where uncertainty in estimating theparameters can be unbounded.

Question 1.1 (Informal). For the class of LSEMs that areuniquely identifiable, does robust identifiability hold formost choices of parameters?

The question above is informal mainly because “mostchoices of parameters” is not a formally defined notion.We can quantify this notion by putting a probability mea-sure on the space of all parameters. This is what we willdo.

Notation. Throughout this paper, we will use the follow-ing notation. Bold fonts represent matrices and vectors.

Given matrices A,B ∈ n×m, we define the relative dis-tance, denoted by Rel(A,B) as the following quantity.

Rel(A,B) def= max

16i6n,16 j6m:|Ai, j|,0

|Ai, j − Bi, j|∣∣∣Ai, j

∣∣∣ .

In this paper we use the notion of condition number (see[7] for a detailed survey on condition numbers in numer-ical algorithms) to quantitatively measure the effect ofperturbations on data in the parameter recovery problem.The specific definition of condition number we use is anatural extension of the `∞-condition number studied in[28] to matrices.

Definition 1.2 (Relative `∞-condition number). Let Σbe a given data covariance matrix and Λ be the corre-sponding parameter matrix. Let a γ-perturbed family ofmatrices be denoted by Fγ (i.e., set of matrices Σγ suchthat Rel(Σ, Σγ) 6 γ). For any Σγ ∈ Fγ let the correspond-ing recovered parameter matrix be denoted by Λγ. Thenthe relative `∞-condition number is defined as,

κ(Λ,Σ) def= lim

γ→0+supΣγ∈Fγ

Rel(Λ, Λγ)

Rel(Σ, Σγ). (3)

We confine our attention to the stability of recoveryof Λ as once Λ is recovered approximately, Ω =

(I − Λ)T Σ (I − Λ) can be easily approximated in a stablemanner. In this paper, we restrict our theoretical analysesto causal models specified by bow-free paths which areinspired by bow-free graphs [5]5 In [5] the authors showthat the bow-free property of the mixed graph underlyingthe causal model is a sufficient condition for unique iden-tifiability. We now define bow-free graphs and bow-freepaths; Figure 2 has an example of bow-free paths.

Definition 1.3 (Bow-free graphs). Bow-free (mixed)graphs are those where for any pair of vertices, bothdirected and undirected edges are never present simul-taneously.

5See footnote 4 in [5] for the significance of bow-free graphsin LSEM.

3

Figure 2: Illustration of a bow-free path. Solid linesrepresent directed causal edges and dotted lines representbi-directed covariance edges.

Definition 1.4 (Bow-free Paths). A causal model is calleda bow-free path if the underlying DAG forms a directedpath and the mixed graph is bow-free [5]. Consider nvertices indexed 1, 2, . . . , n. The (observable) DAG formsa path starting at 1 with directed edges going from vertexi to vertex i + 1. The bi-directed edges can exist betweenpairs of vertices (i, j) only if |i − j| > 2. Thus the onlypotential non-zero entries inΛ are in the diagonal immedi-ately above the principal diagonal. Similarly, the diagonalimmediately above and below the principal diagonal in Ωis always zero.

We further assume the following conditions on the pa-rameters of the bow-free paths for studying the numeri-cal stability of any parameter recovery algorithm. Laterin Section 4 we show that a natural random process ofgenerating Λ andΩ satisfies these assumptions with high-probability. Informally, the model can be viewed as ageneralization of symmetric diagonally dominant (SDD)matrices.

Model 1.5. Our model satisfies the following conditionson the data-covariance matrix and the perturbation withthe underlying graph structure given by bow-free paths.It takes parameters α and λ.

1. Data properties. Σ 0 is an n × n symmetricmatrix satisfying the following conditions6 for some0 6 α 6 1

5 .∣∣∣Σi−1,i∣∣∣ , ∣∣∣Σi,i+1

∣∣∣ , ∣∣∣Σi−1,i+1∣∣∣ 6 αΣi,i ∀i ∈ [n − 1].

Additionally, the true parameters Λ, correspondingto Σ, satisfy the following. For every i, j such thatthere is a directed edge from i to j, we have 1

n2 61λ6

∣∣∣Λi, j

∣∣∣ 6 1, where λ is a parameter.

2. Perturbation properties. For each i, j ∈ [n] anda fixed γ 6 1

n6 , let εi, j = ε j,i be arbitrary numbers

6We note that the important part is that α is bounded by aconstant independent of n. The precise constant is of secondaryimportance.

satisfying 0 6∣∣∣εi, j

∣∣∣ =∣∣∣ε j,i

∣∣∣ 6 γ ∣∣∣Σi, j

∣∣∣ for all i, j suchthat that there exists a pair i∗, j∗ ∈ [n] with

∣∣∣εi∗, j∗∣∣∣ =∣∣∣ε j∗,i∗

∣∣∣ = γ∣∣∣Σi∗, j∗

∣∣∣. Note that we eventually considerγ in the limit going to 0, hence some small but fixedγ suffices. Let Σi, j

def= Σi, j + εi, j for every pair (i, j).

Remark 1.6. The covariance matrix has errors arisingfrom three sources: (1) measurement errors, (2) numer-ical errors, and (3) error due to finitely many samples.The combined effect is modeled by Item 2 (perturbationproperties) above of Model 1.4 which is very general aswe only care about the max error.

Remark 1.7. We show in the full version that the constant15 is an approximation to the following. Let τ = 1 + 5nγ.Then we want [1 − (τ + 2)α−(τ + 1)α2] > 0 and α 6 1

4 .

With the definitions and model in place, we can now poseQuestion 1.1 formally:

Question 1.8 (Formal). For the class of LSEMs repre-sented by bow-free paths and Model 1.5, can we charac-terize the behavior of the `∞-condition number?

1.1 Our Contributions

Our key contribution in this paper is a step towards under-standing Question 1.1 by providing an answer to Ques-tion 1.8. In particular, we first prove that when Modelassumptions 1.5 hold, the `∞-condition number of theparameter recovery problem is upper-bounded by a poly-nomial in n. Formally, we prove Theorem 1.9. Thisimplies that the loss in precision scales logarithmicallyand hence we need at most O(log d) additional bits to geta precision of d bits7. See [7] and [28] for further discus-sion on the relation between condition number and thenumber of bits needed for d-bit precision in computationas well as other implications of small condition number.

Theorem 1.9 (Stability Result). Under the assumptionsin Model 1.5, we have the following bound on the condi-tion number for bow-free paths with n vertices.

κ(Λ,Σ) 6 O(n2).

More specifically, the condition number is upper-boundedby κ(Λ,Σ) 6 O (λ) and for the parameters chosen in thismodel we have the bound in Theorem 1.9.

Furthermore, in Section 4 we show that a natural gen-erative process to construct matrices Λ and Ω satisfiesthe Model assumptions 1.5. Hence this implies that alarge class of instances are well-conditioned and henceill-conditioned models are not prevalent under our gen-erative assumption. Moreover, as described in the model

7Note we already need O(poly(n) log n) bits to represent thegraph and the associated matrices.

4

preliminaries, all data covariance matrices that are SDDand have the property that every row has at least 8 entriesthat are not arbitrarily close to 0, satisfy the assumptionsin Model 1.5. Thus an important corollary of this theoremis that when the data covariance matrix is in this class ofSDD matrices, it is well-conditioned.

Next we show that there exist examples for LSEMs witharbitrarily high condition number. This implies that onsuch examples it is unreasonable to expect any form of ac-curate computation. Formally, we prove Theorem 1.10. Itshows that the recovery problem itself has a bad conditionnumber and does not depend on any particular algorithmthat is used for parameter recovery. This theorem followseasily using the techniques of [16] and we include it forcompleteness.

Theorem 1.10 (Instability Result). There exists a bow-free path of length four and data covariance matrix Σsuch that the parameter recovery problem of obtaining Λfrom Σ has an arbitrarily large condition number.

We perform numerical simulations to corroborate the theo-retical results in this paper. We verify our theorems usingnumerical simulations. Furthermore, we consider generalgraphs (e.g., clique of paths, layered graphs) and showthat a similar behavior on the condition number holds.Finally, we consider a real-world dataset used in [29] andperform condition number analysis experimentally.

In the full-version of the paper, we additionally give a gen-eral heuristic for practitioners to determine the conditionnumber of any given instance, which may be of indepen-dent interest. This heuristic can detect bad instances withhigh-probability.

1.2 Related work

A standard reference on Structural Causal Models isBollen [4]. Identifiability and robust identifiability ofLSEMs has been studied from various viewpoints in theliterature: Robustness to model misspecification, to mea-surement errors, etc. (e.g., Chapter 5 of Bollen [4] and[19, 26, 21]). These works are not directly related to thispaper, since they focus on the identifiability problem un-der erroneous model specification and/or measurement.See the sub-section on measurement errors below for fur-ther details.

Identifiability. The problem of (generic) identification inLSEMs is well-studied though still not fully understood.We give a brief overview of the known results. All worksin this section assume that the data-covariance matrix isexact and do not consider robustness aspects of the prob-lem. These works do not directly relate to the problemstudied in this paper. This problem has a rich history

(see [23] for some classical results) and we only give anoverview of recent results. Brito and Pearl [6] gave a suf-ficient condition called G-criterion which implies linearindependence on a certain set of equations. The variablesinvolved in this set is called the auxiliary variables. Themain theorem in their paper is that if for every variablethere is a corresponding “auxiliary” variable, then themodel is identifiable. Followed by this, Foygel et al. [16]introduced the notion of half-trek criterion which gives agraphical criterion for identifiability in LSEM. This crite-rion strictly subsumes the G-criterion framework. Both[6, 16] supply efficient algorithms for identification. Chenet al. [10] tackle the problem of identifying “overiden-tifying constraints”, which leads to identifiability on aclass of graphs strictly larger than those in [16] and [6].Ernest et al. [15] consider a variant of the problem called“Partially Linear Additive SEM” (PLSEM) with gaussiannoise. In this variant the value for a random variable X isdetermined both by an additive linear factor of some of itsparents (called linear parents) and additive factor of theother parents (called non-linear parents) via a non-linearfunction. They give characterization for identifiabilityin terms of graphical conditions for this more generalmodel. Chen [8] extends the half-trek criterion of [16]and give a c-component decomposition (Tian [33]) basedalgorithm to identify a broader class of models. Drton andWeihs [14] also extend the half-trek criterion ([16]) to in-clude the ancestor decomposition technique of Tian [33].Chen et al. [9] approach the parameter identification prob-lem for LSEM via the notion of auxiliary variables. Thismethod subsumes all the previous works above and iden-tifies a strictly larger set of models.

Measurement errors. Another recent line of work[27, 34] studies the problem of identification under mea-surement errors. Both our work and these works sharethe same motivation: causal inference in real-world isusually performed on noisy data and hence it is importantto understand how noise affects the causal identification.However their focus differs significantly from the ques-tion we tackle in this paper. They pose and answer theproblem of causal identification when the variables usedare not identical to the ones that were intended to be mea-sured. This leads to different conditional independencesand hence a different causal graph; they study the identifi-cation problem in this setting and characterize when suchidentification is possible. Recently [17] looked at sam-ple complexity of identification in LSEM. They considerthe special case when Ω is the Identity matrix and givea new algorithm for identifiability with optimal samplecomplexity.

In this paper we are interested in the question of robustidentifiability, along the lines of aforementioned work of

5

Schulman and Srivastava [28]. While the work of [28]was direct inspiration for the present paper, since we workwith LSEMs and [28] work with the semi-Markoviancausal models of Pearl, the techniques involved are com-pletely different. Moreover, as previously remarked, an-other important difference between [28] and our workis that our main result is positive: the parameter recov-ery problem is well-conditioned for most choices of theparameters for a well-defined notion of most choices.

2 Bow-free Paths — Stable Instances

In this section, we show that for bow-free paths underModel 1.5, the condition number is small. More precisely,we prove Theorem 1.9.

To prove the main theorem, we set-up some helper lem-mas. Using Foygel et al. [16], we obtain the followingrecurrence to compute the values of Λ from the correla-tion matrix Σ. The base case is Λ1,2 =

Σ1,2

Σ1,1. For every

i > 2 we have,

Λi,i+1 =−Λi−1,iΣi−1,i+1 + Σi,i+1

−Λi−1,iΣi−1,i + Σi,i. (4)

We first show that the model assumptions 1.5 imply thatthe parameters recovered from the perturbed version arenot too large.

Lemma 2.1. For each i ∈ [n − 1], we have∣∣∣Λi,i+1

∣∣∣ 6τi (:= τi−1 + 5γ) 6 τ (:= 1 + 5nγ).

Next, we show that the relative distance between the realparameter Λ and the recovered parameter from the per-turbed instance Λ is not too large.

Lemma 2.2. Let τ = 1 + 5nγ. Define βc :=[(3+3τ)α+(τ+1)]

1−(τ+2)α−(τ+1)α2 −4nγ . Then for each i ∈ [n − 1] we havethat, ∣∣∣Λi,i+1 − Λi,i+1

∣∣∣ 6 βc ·γ

1 − γ.

2.1 Proof of Theorem 1.9

We are now ready to prove the main Theorem 1.9.From Lemma 2.2 and the model assumptions we have

Rel(Λ, Λ) =|Λi,i+1−Λi,i+1||Λi,i+1|

6 βc ·γ

1−γ · λ. From perturbationproperties in the Model 1.5, and the definition of i∗, j∗,

we have Rel(Σ, Σ) =γ|Σi∗ , j∗ ||Σi∗ , j∗ |

= γ. Therefore,

Rel(Λ, Λ)Rel(Σ, Σ)

6λ · βc

1 − γ.

This implies that,

limγ→0+

Rel(Λ, Λ)Rel(Σ, Σ)

6 λβc 6 O(λ) 6 O(n2).

The second last inequality used the fact that βc 6 O(1)and the last inequality used the fact that 1

λ> Ω(n−2).

3 Instability Example

In this section, we show that there exist simple bow-freepath examples where the condition number can be arbi-trarily large. We consider a point on the measure 0 set ofunidentifiable parameters and apply a small perturbation(of magnitude ε). This instance in the parameter spaceis identifiable (by using [16]). We then apply a perturba-tion (of magnitude γ) to this instance and compute thecondition number. By making ε arbitrarily close to 0, weobtain as large a condition number as desired. Any pointon the singular variety could be used for this purpose andthe proof of Theorem 1.10 provides a concrete instance.The example we construct is as follows. Consider a pathof four vertices. Fix a small value ε. Define the matricesΩ and Λ as follows.

Ω =

1 0 1/2 1/20 1 0 1/2

1/2 0 1 + ε 01/2 1/2 0 1

0

and Λ1,2 =√

2,Λ2,3 = −√

2,Λ3,4 = 12 .

Thus, we have the following.

(1 − Λ)−1 =

1√

2 −2 −10 1 −

√2 −1/

√2

0 0 1 1/20 0 0 1

Multiplying with Ω we get (1 − Λ)−TΩ is,

1 0 1/2 1/2√

2 1 1/√

2 1/√

2 + 1/2−3/2 −

√2 ε −1 − 1/

√2

−1/4 −1/√

2 + 1/2 ε/2 1/2(1 − 1/√

2)

Therefore, the resulting matrix Σ, obtained by (1 −Λ)−TΩ(1 − Λ)−1 is,

1√

2 − 32 − 1

4√2 3 − 5

√2

12 −

32√

2− 3

2 − 5√

25 + ε 3

2 −1√

2+ ε

2

− 14

12 −

32√

232 −

1√

2+ ε

254 −

1√

2+ ε

4

Perturb every entry of Σ additively by γ to obtain Σ. Thiswill ensure that Rel(Σ, Σ) = 4γ since all entries in Σ areat least 1/4. Now we show that in the reconstruction ofΛ from Σγ, the entry Λ3,4 = 1. This implies that thecondition number κ(Λ,Σ) = O

(1γ

)and since γ can be

6

made arbitrarily close to 0, this implies that the conditionnumber is unbounded.

The denominator in the expression for Λ3,4 in (4) is,−Λ2,3Σ2,3 + Σ3,3 = ε +

(1 +√

2)γ.

Likewise the numerator in the expression for Λ3,4 evalu-ates to, −Λ2,3Σ2,4 + Σ3,4 = ε/2 +

(1 +√

2)γ.

Therefore when ε → 0 we have Λ3,4 → 1 and henceRel(Λ, Λ) = O(1) , 0.

4 When do random instances satisfymodel assumptions?

In this section, we prove theoretically that randomly gen-erated Λ andΩ satisfy the Model 1.5 for a natural genera-tive process, albeit with slightly weaker constants.

Generative model for LSEM instances. We considerthe following generative model for the LSEM instances.Λ ∈ n×n is generated by choosing each non-zero entryto be a sample from the uniform U[−h, h] distribution.Ω ∈ n×n is chosen by first sampling n-dimensional vec-tors v1, v2, . . . , vn ∈

d from a d-dimensional unit sphere,such that vi is a uniform sample in the subspace perpen-dicular to vi−1 for every i (i.e., 〈vi, vi−1〉 = 0) and thenletting Ωi, j = 〈ui, u j〉. Note, in the scenario when Ω neednot follow a specified zero-pattern then first generatingvectors u1, u2, . . . , un independently and uniformly from ad-dimensional unit sphere and then setting Ωi, j = 〈ui, u j〉

gives an uniform distribution over PSD matrices. Hence,the above generative procedure is a natural extension torandomly generating PSD matrices with a specified zero-patterns.

4.1 Regime when the generative model satisfy theModel 1.5 properties

First, we prove that the random generative process satis-fies the bound α 6 1

5 in Theorem 4.1 below.

Theorem 4.1. Consider Λ and Ω generated using theabove random model with 0 6 h < 1

√2. Then there exists

a 0 6 ζh 6 h + o(1) and Ω(n−2) 6 χh, such that forevery sufficiently large n, there exists 0 6 δ1,n <

12 and

0 6 δ2,n <12 such that,

[∀ i

∣∣∣Σi−1,i∣∣∣ , ∣∣∣Σi,i+1

∣∣∣ , ∣∣∣Σi−1,i+1∣∣∣ 6 ζhΣi,i

]> 1 − δ1,n. (5)

[∀ i

∣∣∣Λi,i−1∣∣∣ > χh

]> 1 − δ2,n. (6)

Moreover, we have that limn→∞ δ1,n = 0 andlimn→∞ δ2,n = 0. Thus, the above statements hold withhigh-probability.

Remark 4.2. Note that in this theorem, the constant ζh

for α is at most h. When 0 6 h < 1√

2, this is at most 1

√2

where the maximum is achieved when h = 1√

2. Moreover,

when h < 0.2, we have h 6 15 . Therefore, this satisfies the

exact requirement in data properties of Model 1.5 whenh < 0.2 and in the regime 0.2 6 h < 1

√2, it satisfies the

property for a slightly larger value of α, namely, α 6 1√

2.

Nonetheless in Section 5 we show experimentally thateven in this regime the instance is well-conditioned.

4.2 What happens to the random model whenh > 1?

We now briefly show that when h > 1, the conditions ofModel 1.5 do not hold. More specifically, we can showthat the value of α is larger than 1 with non-negligibleprobability. Note that the following theorem implies thatin this regime we cannot hope to expect ζh < 1.

Theorem 4.3. For every h = (1 + z), where z > 0 is aconstant and a large enough n, there exists an i ∈ [n] suchthat for a constant (dependent on z), 0 < Cz 6 1 we havethat,

[∣∣∣Σi,i+1

∣∣∣ > Σi,i

]> Cz. (7)

5 Experiments

In this section, we will describe our numerical and real-world experimental results. For the purposes of this sec-tion, we define the following quantity randomized condi-tion number as follows. Consider a given data correlationmatrix Σ and the corresponding parameter matrix Λ. LetΣε be a matrix obtained by adding N(0, ε2) independentrandom variable to each entry in Σ and let Λ be the corre-sponding parameter matrix. Then the randomized condi-

tion number is [

Rel(Λ,Λ)Rel(Σ,Σε )

]. We use randomized condition

number as a proxy for studying the `∞-condition number.

5.1 Good instances

We start off by showing that on most random instances,the randomized condition number is small (i.e., instancesare stable). In particular, it is stable because it satisfies theModel 1.5. Thus, a large fraction of random instances sat-isfies the condition and hence the stability proof directlyimplies low condition number on these instances.

The first experiment is as follows. We start with an in-stance of (Λ,Ω) and the bow-free path. We constructthe matrix Σ using the recurrence (4). We then plot thevalues of |Σi,i|, |Σi,i+1|, |Σi−1,i|, |Σi−1,i+1|. Λ,Ω are generatedexactly as described in the generative model discussed inthe Section 4 with h = 1. Figure 3 shows the plots for

7

three random runs. Note that in all three cases it satisfiesthe data properties in Model 1.5.

Next, we explicitly analyze the effect of random perturba-tions on these instances. As predicted by our theory, theinstances are fairly stable. We do this as follows. Given a(Λ,Ω) pair, we can generate Σ in two ways. We can either(1) use the recurrence (4), which can introduce numericalprecision errors in Σ, or (2) we can generate samples ofX (observational data) and estimate Σ by taking the aver-age over samples of XXT , which can introduce samplingerrors in Σ.

We add, to each non-zero entry of the obtained Σ (e.g.,perturbations), an independent N(0, ε2) random variable.The value of ε we chose for this experiment is 10−6. Wethen recover Λ from this perturbed Σ using the recurrence(4). Hence, the “noise” entering the system is throughtwo sources, namely, sampling errors and perturbations.Figure 4 shows the effect of both sampling and perturba-tions.

In the full-version we consider additional experiments.First, we show that both sampling errors and perturbationsequally affect the randomized condition number. More-over, we consider general DAGs and show that similarresults holds even for more general graphs.

5.2 Bad Instances

In the next experiment, we start off with the bad exampleon a bow-free path of length 4. We show that as confirmedin the theory, this instance is highly unstable. We thenperturb this instance slightly and show that in a smallenough ball around this instance, it continues to remainunstable. Finally, we plot a graph showing the variationof the randomized condition number as a function of theradius of the ball around this instance (see next paragraphfor precise definition).

In particular, we start with the bad Λ and Ω. We thenconsider a region around these matrices as follows. To Λ(and likewise to Ω) add an independent N(0, ε2) randomvariable to each non-zero entry. We then consider the ef-fect of random perturbations starting from this new (Λ,Ω)pair. Figure 5 shows the effect of random perturbationsin the region around the matrix Λ and Figure 6 shows thesame in the region around Ω. As evident from the figures,the randomized condition number continues to remainlarge even when slightly perturbed. Hence, this impliesthat there are infinitely many pairs (Λ,Ω) which producevery large condition numbers.

5.3 Real-world data

In this experiment, we analyze a sociology dataset on6 vertices, taken from a public Internet repository [1].This was used in [29] which almost resembles our modelof LSEM with the exception that we consider Gaussiannoise while they don’t. Nonetheless, we experiment withthe matrices returned by their algorithm, and compute therandomized condition number.

The dataset and pre-processing is the same as [29]8. Notethat the causal graph given by domain experts is bow-freetherefore, from the theorem in [5], this graph is identifi-able. We constructed the Σ matrix from the observationaldata. We then use the algorithm of [16] to recover theparameter Λ. We then perturb the data as follows. Toeach entry in the observational data, we add an additiveN(0, ε2) noise independently. Compared to the magni-tudes of the actual data, this additive noise is insignificant.We recompute Σ, from this perturbed observational data.Using the algorithm of [16], we once again recover theparameter Λ. For a given ε, we run 100 independent runsand take the average. Figure 7 shows the variation of therandomized condition number as a function of ε. As thevalue of ε becomes very small, the randomized conditionnumber remains almost constant and approaches 10−1.This number is very close to our well-behaved instancesimplying that this dataset is fairly robust when modeledas a LSEM.

6 Conclusion and Future Directions

In this paper, we initiate the study of condition numberanalysis for LSEMs. We build theory and experimentsto analyze the condition number for a class of instances.Further we give a heuristic that can be used in practiceto identify if a given instance is well-conditioned. Thiswork opens a number of future directions. An immediatedirection is to extend our understanding to other instances.In particular, can we prove condition number for everybow-free graph? More generally, a series of recent workshas extended the identifiability criteria beyond bow-freegraphs. Can we analyze the condition number of theselarger class of instances? The other direction is to com-pute the condition number algorithmically. This can beapproached by either trying to prove the correctness of theheuristic proposed in this paper, or coming up with a newalgorithm. What happens for causal models other thanLSEMs? In particular, are the ill-conditioned instancesrare in a well-defined sense for Pearl’s semi-Markoviancausal model?

8Obtained via a private communication with the authors of[29].

8

Figure 3: Local values of Σ for three random runs. x-axis: Value of i (0-indexed). y-axis: Value of Σ. Thus, a point onthe red-line with x-axis label 2 represents the value Σ2,3.

Figure 4: Both sampling and perturbation errors. (First) Denotes the actual values of Λ. (Second) Denotes the values ofΛ obtained from perturbing Σ. (Third) Plots the difference between the two value of Λ. (Fourth) Gives the histogram ofthe randomized condition number in various runs. (First three plots): x-axis: index of Λi,i+1 (0-indexed). y-axis: Valueof Λ.

Figure 5: Randomized conditionnumber in a small region around thebad Λ. Scale of y-axis: 1010.

Figure 6: Randomized conditionnumber in a small region around thebad Ω. Scale of y-axis: 1010.

Figure 7: Variation of the random-ized condition number as a functionof the perturbation error on the soci-ology dataset

Finally, our results in this paper can be summarized assuggesting that while arbitrarily ill-conditioned instancesexist they are rare and unlikely to arise in practice. More-over, one may check how well-conditioned the instanceat hand is. However, this statement depends on the under-lying probability distribution on the instances. While ourprobability distribution is natural, it is desirable to achievea more “robust” result. Smoothed analysis [30] seems tobe well-suited for this purpose. In particular, one couldbegin with smoothed analysis of ill-conditioned instancesidentified in the present paper.

Acknowledgements

The authors thank Amit Sharma, Ilya Shpitser and PiyushSrivastava for useful discussions on causality and ShoheiShimizu for providing us with the sociology dataset.

Part of this work was done when KAS was visiting In-dian Institute of Science, Bangalore. KAS was supportedin part by NSF Awards CNS 1010789, CCF 1422569,CCF-1749864 and research awards from Adobe, Ama-zon, and Google. Anand Louis was supported in part bySERB Award ECR/2017/003296. AL is also grateful toMicrosoft Research for supporting this collaboration.

9

References

[1] General social survey http://gss.norc.org/Get-The-Data. 2018.

[2] Peter M Bentler and David G Weeks. Linear struc-tural equations with latent variables. Psychometrika,45(3):289–308, 1980.

[3] Aditya Bhaskara, Moses Charikar, Ankur Moitra,and Aravindan Vijayaraghavan. Smoothed analysisof tensor decompositions. In Symposium on Theoryof Computing, STOC 2014, pages 594–603, 2014.

[4] Kenneth A. Bollen. Structural Equations with LatentVariables. Wiley-Interscience, 1989.

[5] Carlos Brito and Judea Pearl. A new identificationcondition for recursive models with correlated errors.Structural Equation Modeling, 9(4):459–474, 2002.

[6] Carlos Brito and Judea Pearl. Graphical Conditionfor Identification in recursive SEM. Proceedingsof the Twenty-Second Conference on Uncertainty inArtificial Intelligence (UAI), pages 47–54, 2006.

[7] Peter Bürgisser and Felipe Cucker. Condition: Thegeometry of numerical algorithms, volume 349.Springer Science & Business Media, 2013.

[8] Bryant Chen. Identification and Overidentificationof Linear Structural Equation Models. Advancesin Neural Information Processing Systems (NIPS),pages 1587—-1595, 2016.

[9] Bryant Chen, Daniel Kumor, and Elias Bareinboim.Identification and Model Testing in Linear StructuralEquation Models using Auxiliary Variables. Inter-national Conference on Machine Learning (ICML),pages 757—-766, 2017.

[10] Bryant Chen, Jin Tian, and Judea Pearl. TestableImplications of Linear Structural Equation Models.Association for the Advancement of Artificial Intelli-gence (AAAI), pages 2424–2430, 2014.

[11] N Cornia and JM Mooij. Type-II errors of inde-pendence tests can lead to arbitrarily large errors inestimated causal effects: An illustrative example. InCEUR Workshop Proceedings, volume 1274, pages35–42, 2014.

[12] Mathias Drton, Michael Eichler, and Thomas SRichardson. Computing maximum likelihood es-timates in recursive linear models with correlatederrors. Journal of Machine Learning Research,10(Oct):2329–2348, 2009.

[13] Mathias Drton, Rina Foygel, and Seth Sullivant.Global identifiability of linear structural equationmodels. The Annals of Statistics, pages 865–886,2011.

[14] Mathias Drton and Luca Weihs. Generic Identi-fiability of Linear Structural Equation Models byAncestor Decomposition. Scandinavian Journal ofStatistics, 43(4):1035–1045, 2016.

[15] Jan Ernest, Dominik Rothenhausler, and PeterBuhlmann. Causal inference in partially linear struc-tural equation models: identifiability and estimation.arXiv preprint arXiv:1607.05980, 2016.

[16] Rina Foygel, Jan Draisma, and Mathias Drton. Half-trek criterion for generic identifiability of linearstructural equation models. Annals of Statistics,40(3):1682–1713, 2012.

[17] Asish Ghoshal and Jean Honorio. Learning linearstructural equation models in polynomial time andsample complexity. In International Conferenceon Artificial Intelligence and Statistics, pages 1466–1475, 2018.

[18] Isabelle Guyon, Dominik Janzing, and BernhardScholkopf. Causality: Objectives and assessment.In Causality: Objectives and Assessment, pages 1–42, 2010.

[19] Greogry R. Hancock. Fortune cookies, measurementerror, and experimental design. Journal of ModernApplied Statistical Methods, 2 (2): 293, 2003.

[20] Paul W Holland, Clark Glymour, and Clive Granger.Statistics and causal inference. ETS Research ReportSeries, 1985(2), 1985.

[21] Xianzheng Huang, Leonard A. Stefanski, and MarieDavidian. Latent-model robustness in structuralmeasurement error models. Biometrika, 93(1):53–64, 2006.

[22] Roderick P McDonald. What can we learn from thepath equations?: Identifiability, constraints, equiva-lence. Psychometrika, 67(2):225–249, 2002.

[23] Judea Pearl. Causality. Cambridge university press,2009.

[24] J. Peters, D. Janzing, and B. Schölkopf. Elementsof Causal Inference - Foundations and LearningAlgorithms. Adaptive Computation and MachineLearning Series. The MIT Press, Cambridge, MA,USA, 2017.

10

http://gss.norc.org/Get-The-Data

http://gss.norc.org/Get-The-Data

[25] Karthik Abinav Sankararaman, Anand Louis, andNavin Goyal. Stability of linear structural equationmodels of causal inference. CoRR, abs/1905.06836,2019.

[26] Albert Satorra. Robustness issues in structural equa-tion modeling: A review of recent developments.24:367–386, 01 1990.

[27] Richard Scheines and Joseph Ramsey. Measurementerror and causal discovery. In CEUR workshop pro-ceedings, volume 1792, page 1. NIH Public Access,2016.

[28] Leonard J Schulman and Piyush Srivastava. Stabil-ity of Causal Inference. Uncertainty in ArtificialIntelligence (UAI), 2016.

[29] Shohei Shimizu, Takanori Inazumi, Yasuhiro So-gawa, Aapo Hyvarinen, Yoshinobu Kawahara,Takashi Washio, Patrik O Hoyer, and KennethBollen. Directlingam: A direct method forlearning a linear non-gaussian structural equationmodel. Journal of Machine Learning Research,12(Apr):1225–1248, 2011.

[30] Daniel A Spielman and Shang-Hua Teng. Smoothedanalysis of algorithms: Why the simplex algorithmusually takes polynomial time. Journal of the ACM(JACM), 51(3):385–463, 2004.

[31] Peter Spirtes. Introduction to causal inference. Jour-nal of Machine Learning Research, 11(May):1643–1662, 2010.

[32] Luther Terry. Smoking and health. The Reports ofthe Surgeon General, 1964.

[33] Jin Tian. Tian’s PhD thesis, 2012.

[34] Kun Zhang, Mingming Gong, Joseph Ramsey, Kay-han Batmanghelich, Peter Spirtes, and Clark Gly-mour. Causal discovery in the presence of measure-ment error: Identifiability conditions. arXiv preprintarXiv:1706.03768, 2017.

11

Documents

Stability of Linear Structural Equation Models of Causal Inferenceauai.org/uai2019/proceedings/papers/122.pdf · 2019. 7. 13. · lyzing causal inference. One such line of work is