Non-Gaussian structural equation models for causal discovery

Shohei Shimizu

Osaka University, Japan

1

Non-Gaussian structural equation

models for causal discovery

2016 Probabilistic Graphical Model Workshop:

Sparsity, Structure and High-dimensionality

References:

https://sites.google.com/site/sshimizu06/home/lingampapers

https://sites.google.com/site/sshimizu06/home/lingampapers

Abstract

• Estimation of causal direction and

connection strength of two observed

variables in the presence of hidden

common causes

• A key challenge in causal discovery

• Propose a non-Gaussian model

– Not require us to specify the number of hidden

common causes

2

Illustrative example

Significant correlation btw Chocolate

consumption and Num. Nobel laureates(Messerli12NEJM)

4

2002-2011Chocolate consumption (kg/yr/capita)

Nu

m. N

ob

el la

ure

ate

s p

er

10

mill

ion

po

p.

Corr. 0.791

P-value < 0.001

Eating more chocolate increases

the number of Nobel laureates??

• Interpretational Drift (Maurage+13, J. Nutrition)

5

Chclt Nobel?

Chclt Nobelor

GDP GDP

Chclt Nobelor

GDP

Corr. 0.791

P-value < 0.001N

ob

el

Chocolate

Hidden

Common

cause

Manage this gap!

Hidden

Common

cause

Hidden

Common

cause

Under what conditions

can we manage this gap?

• We have shown that it is possible under the

three assumptions (Hoyer+08IJAR; Shimizu+14JMLR)

– Linearity

– Acyclity

– Non-Gaussianity

• Performing interventions often very hard

• Theory closely related to independent

component analysis (ICA) (Hyvarinen+01)

6

7

Many application areas

Epidemiology Economics

Neuroscience Chemistry

Sleep

problems

Depression

mood

Sleep

problems

Depression

mood ?

or

OpInc.gr(t)

Empl.gr(t)

Sales.gr(t)

R&D.gr(t)

Empl.gr(t+1)

Sales.gr(t+1)

R&D(.grt+1)

OpInc.gr(t+1)

Empl.gr(t+2)

Sales.gr(t+2)

R&D.gr(t+2)

OpInc.gr(t+2)

(Moneta et al., 2012)(Rosenstrom et al., 2012)

Policy evaluation

(Campomanes et al., 2014)

Causal information flow

Improving health and QOL

(Boukrina & Graves, 2013)

What changes absorption spectra?

Brief review of structural

causal models

Structural causal models (Pearl, 2000)

• A framework for describing causal relations

(or data generating processes)

• An example of linear cases:

• Generally speaking, if the value of 𝑥1 has

been changed and then that of 𝑥2 changes,

then 𝑥1 causes 𝑥2

9

𝒙𝟐 ∶= 𝒃𝟐𝟏𝒙𝟏 + 𝒆𝟐

𝒙𝟏 ∶= 𝒆𝟏

x2x1

e1 e2

e1 and e2 dependent

73

Changing the value of x1

from c to d

• Replacing the function determining x1 with

a constant c, denoted by do(x1=c), and

then change the constant to d (Pearl, 2000)

21212

11

exbx

ex

21212

1

exbx

cx

Intervention: do(x1=c)

x2x1

e1 e2

x2x1

c e2

74

Average causal effect(Rubin, 1974; Pearl, 2000)

• Average causal effect of x1 on x2 when changing x1 from c to d

– Computed based on the models with do(x1=d) and do(x1=c)

•

cdb

cxdoxEdxdoxE

21

1212 ||

cdbxE

dcx

212

1

bychangewill)(then

,tofromof value thechangedhaveyouIf

Formulating the problem

13

Estimation of causal direction

• Suppose that data X was randomly generated from either of the following two models:

• Estimate which model generated the data X based on the data X only

or

21212

11

exbx

ex

22

12121

ex

exbx

Model 1: Model 2:

)0( 21 b

x1x2

e2 e1

x1x2

e2 e1

12b21b

)0( 12 b

Major difficulty

• Errors and are often dependent

• Regression coefficient of on is not

equal to even if we know the right

causal direction

14

or

21212

11

exbx

ex

22

12121

ex

exbx

Model 1: Model 2:

x1x2

e2 e1

x1x2

e2 e1

12b21b

21b

1e 2e

1x2x

)0( 21 b )0( 12 b

Hidden common causes

• Such dependency is typically introduced

by hidden common causes, say

15

or

Model 1’: Model 2’:

x1x2

e’2 e’1

21b

2

21211212

1

11111

e

efxbx

e

efx

1f

1f

x1x2

e’2 e’1

12b

1f

2

21212

1

11112121

e

efx

e

efxbx

A well-known guideline(Pearl2000; Spirtes+1993)

• Observe the hidden common cause ,

incorporate it in the models,

and carry out three-variable analysis

• Errors independent!

16

1f

or


x1x2

e’2 e’1

21b

21211212

11111

efxbx

efx

1f

x1x2

e’2 e’1

12b

1f

21212

11112121

efx

efxbx

21, ee

Following the guideline is often

very hard• A large number of hidden common causes

may exist (Q unknown)

• Often no idea what they are

17

Qfff ,,, 21

or


x1x2

e’2 e’1

21b

221212

111

efxbx

efx

q

qq

q

qq

1f

222

112121

efx

efxbx

q

qq

q

qq

Qf

x1x2

e’2 e’1

12b

1f Qf

18Estimation of causal direction

in the presence of

hidden common causes

• Estimate which model generated the data X

or


x1x2

e’2 e’1

21b

221212

111

efxbx

efx

q

qq

q

qq

1f

222

112121

efx

efxbx

q

qq

q

qq

Qf

x1x2

e’2 e’1

12b

1f Qf

qf

Note

• If we intervene on x1 (and x2), we have no

hidden common causes

• But, ethically and costly often difficult to do

interventions

19

Model 1’:

x1x2

e’2 e’1

21b

221212

111

efxbx

efx

q

qq

q

qq

1f Qf

Model 1’’:

x1x2

e’2 c

21b

cx 1

1f Qf

221212 efxbxq

qq

1. Estimation of causal direction when temporal information is not available

2. Managing hidden common causes

20

Major challenges

x1 x2

?x1 x2

or

x1 x2 ?x1 x2 or

f1 f1

Basic non-Gaussian model

(No hidden common cause)

S. Shimizu, P. O. Hoyer, A. Hyvärinen

and A. Kerminen.

Journal of Machine Learning Research,

2006.

• Implying no hidden common causes

• The two models distinguishable if the errors

e1 and e2 are non-Gaussian (Dodge+00CSTM; Shimizu+06JMLR)

Independent errors22

or

21212

11

exbx

ex

22

12121

ex

exbx

Model 1: Model 2:

x1x2

e2 e1

x1x2

e2 e1

12b21b

)0,( 2112 bb

2323

Different directions give

different data distributionsGaussian Non-Gaussian

Model 1:

Model 2:

x1

x2

x1

x2

e1

e2

x1

x2

e1

e2

x1

x2

x1

x2

x1

x2

212

11

8.0 exx

ex

22

121 8.0

ex

exx

1varvar 21 xx

,021 eEeE

24

Independent Component Analysis

(ICA) (Jutten & Herault, 1991; Comon, 1994)

• Observed random vector x is modeled by

where

– The mixing matrix A = [ ]

– The hidden variables (independent components) are non-Gaussian and mutually independent

• Then, A is identifiable up to permutation and scaling of the columns

Asx

is

p

j

jiji sax1

or

ija

Sketch of the identifiability proof

• Different directions give different zero/non-

zero patterns of the mixing matrices

– No zeros on the diagonal in the causal model

– No permutation indeterminacy

25

2

1

212

1

1

01

e

e

bx

x

21212

11

exbx

ex

A sx

2

112

2

1

10

1

e

eb

x

x

A sx22

12121

ex

exbx

x1

x2

e1

e2

x1

x2

e1

e2

0

0

Linear Non-Gaussian Acyclic

Models (LiNGAM) (Shimizu+06JMLR)

• Identifiable: Directions, coefficients, and intercepts

– Can be uniquely estimated without knowing the causal

structure

26

i

ij

jijii exbx

x1 x2

x3

21b

23b13b

2e

3e

1e

Acyclicity

Non-Gaussian errors ei

Independence of errors ei

(no hidden common causes)

Extensions

• Cyclic models (Lacerda+08UAI; Hyvarinen+13JMLR)

• Time series (Hyvarinen+10JMLR; Huang+15IJCAI; Gong15ICML)

• Nonlinearity (Zhang+09UAI; Peters+14JMLR; cf. Imoto02PSB)

• Discrete variables (Peters+11TPAMI; Park+15NIPS)

27

iiiii exofparentsffx

1,

1

2,

x1x2e2 e1

)()()(0

tttk

exBx

LiNGAM with hidden

common causes

P. O. Hoyer, S. Shimizu, A. Kerminen,

and M. Palviainen.

Int. J. Approximate Reasoning

2008

• Extension to incorporate non-Gaussian hidden

common causes

i

ij

jij

Q

q

qiqii exbfx 1

LiNGAM with hidden

common causes (Hoyer+08IJAR)

29

where are independent: ),,1( Qqfq

qf

x1 x2 2e1e

1f 2f

2121

1

222

1

1

111

exbfx

efx

Q

q

qq

Q

q

qq

qfWLG, hidden common causes

are assumed to be independent

Independent hidden

common causes

i

ij

jij

Q

q

qiqii exbfx 1

30

x1 x2 2e1e

1fe

2fe

x1 x2 2e1e

1

:1 fef

2

:2 fef

1f 2f

Dependent hidden

common causes

2

1

2221

11

2221

11

2

100

2

1

f

f

aa

a

e

e

aa

a

f

f

f

f

Different causal directions give

different data distributions(Hoyer, Shimizu, Kerminen and Palviainen, 2008, IJAR)

• Faithfulness + N. hidden common causes “known”

31

x1 x2

f1

x1 x2

orfQ f1 fQ

… …

2e1e2e1e

2121

1

222

1

1

111

exbfx

efx

Q

q

qq

Q

q

qq

2

1

222

1212

1

111

efx

exbfx

Q

q

qq

Q

q

qq

1x1x

2x2x

Previous estimation approaches

• Explicitly model hidden common causes and

compare two models with opposite directions of

causation

– Maximum likelihood principle (Hoyer+08IJAR)

– Bayesian model selection (Henao & Winther, 2011, JMLR)

• Require us to specify the number of hidden

common causes, which is difficult in general

32

x1 x2

f1

x1 x2

orfQ f1 fQ… …

2e1e2e1e

Our proposal:

a Bayesian approach

S. Shimizu and K. Bollen.

Journal of Machine Learning Research,

2014

)(

2

m

)1(

1x)1(

2x

)(

2

mx)1(

1x

)(

2

)(

121

1

)(

22

)(

2

mmQ

q

m

qq

m exbfx

Key idea (1/2)

• Another look at the LiNGAM with hidden common

causes:

34

x1 x2

f1 fQ…

2e1e

m-th obs.:

)1(

2e)1(

1e

)(

2

me)(

1

me

……

21b

21b

21b)(

22

m

)1(

22

Observations are generated from the LiNGAM

model with possibly different intercepts )(

22

m

Key idea (2/2)

• Include the sums of hidden common

causes as the observation-specific

intercepts:

• Not explicitly model hidden common

causes

– Neither necessary to specify the number of

hidden common causes Q nor estimate the

coefficients

35

)(

2

m

)(

2

)(

121

1

)(

22

)(

2

mmQ

q

m

qq

m exbfx

m-th obs.:

q2

Obs.-specific

intercept

• Compare the marginal likelihoods of these two

models with opposite directions

• Many additional parameters

– Similar to mixed models and multi-level models

– Informative Prior for the observation-specific intercepts

)()(

121

)(

22

)(

2

)(

1

)(

11

)(

1

m

i

mmm

mmm

exbx

ex

Bayesian model selection36

),,1;2,1()( nmim

i

Model 3 (x1 x2)

)(

2

)(

22

)(

2

)(

1

)(

212

)(

11

)(

1

mmm

mmmm

ex

exbx

Model 4 (x1 x2)

v

Prior for the observation-specific

intercepts

• Motivation: Central limit theorem

– Sums of independent variables tend to be more Gaussian

• Approximate the density by a bell-shaped curve dist.

• Select the hyper-parameter values that maximize the

marginal likelihood

–

– DOF fixed to be 6 in the experiments below

37

Q

q

m

qq

mQ

q

m

qq

m ff1

)(

2

)(

2

1

)(

1

)(

1 ,

~)(

2

)(

1

m

m

t-distribution with sd ,

correlation , and DOF1221,

v

)},(sd0.1,),(sd2.0,0{ lll xx }9.0,,1.0,0{12

The chocolate data revisited

Corr. 0.791

P-value < 0.001No

bel

Chocolate

Gaussianity rejected for both

``Chocolate consumption”

and ``Num. Nobel laureates’’

Model comparison

• No method available before to compare these two

39

Conclusions

Conclusions

• Estimation of causal direction in the presence of

hidden common causes is a major challenge in

causal discovery

• Proposed a linear non-Gaussian SEM with

possibly different intercepts

– Not require to specify the number of hidden common

causes

• Future work

– Sensitivity to the choice of prior distributions

– Better estimation methods computationally and

statistically efficient … and many others

41

42

Pairwise

analysis

High-dimensional cases

• Huge number of candidate networks

• Analyze every pair of variables and Integrate the

results to get an entire causal ordering

• Simpler than trying all the combinations of

causal orders

43

ｘ1

ｘ2ｘ4

ｘ3

f1

f3

ｘ1 x2

x3 x4

ｘ1

ｘ2ｘ4

ｘ3

f1

f3

Full graph

Prune

redundant

edges

Integrate

the results

Non-Gaussian

x2

x1

Gaussian e1,e2, f1

x2

• Faithfulness on 𝑥𝑖, 𝑓𝑖 + Number of 𝑓𝑖 given

Different zero/non-zero patterns

of the mixing matrices (Hoyer+08IJAR)

44

x1 x2

f1

x1 x2

f1

x1 x2

f1

Models

1.

2.

3.

**0

*0*

***

*0*

**0

***

A

A

Science

Non-Gaussian structural equation models for causal discovery