Upload
sshimizu2006
View
514
Download
3
Embed Size (px)
Citation preview
Shohei Shimizu
Osaka University, Japan
1
Non-Gaussian structural equation
models for causal discovery
2016 Probabilistic Graphical Model Workshop:
Sparsity, Structure and High-dimensionality
References:
https://sites.google.com/site/sshimizu06/home/lingampapers
Abstract
• Estimation of causal direction and
connection strength of two observed
variables in the presence of hidden
common causes
• A key challenge in causal discovery
• Propose a non-Gaussian model
– Not require us to specify the number of hidden
common causes
2
Illustrative example
Significant correlation btw Chocolate
consumption and Num. Nobel laureates(Messerli12NEJM)
4
2002-2011Chocolate consumption (kg/yr/capita)
Nu
m. N
ob
el la
ure
ate
s p
er
10
mill
ion
po
p.
Corr. 0.791
P-value < 0.001
Eating more chocolate increases
the number of Nobel laureates??
• Interpretational Drift (Maurage+13, J. Nutrition)
5
Chclt Nobel?
Chclt Nobelor
GDP GDP
Chclt Nobelor
GDP
Corr. 0.791
P-value < 0.001N
ob
el
Chocolate
Hidden
Common
cause
Manage this gap!
Hidden
Common
cause
Hidden
Common
cause
Under what conditions
can we manage this gap?
• We have shown that it is possible under the
three assumptions (Hoyer+08IJAR; Shimizu+14JMLR)
– Linearity
– Acyclity
– Non-Gaussianity
• Performing interventions often very hard
• Theory closely related to independent
component analysis (ICA) (Hyvarinen+01)
6
7
Many application areas
Epidemiology Economics
Neuroscience Chemistry
Sleep
problems
Depression
mood
Sleep
problems
Depression
mood ?
or
OpInc.gr(t)
Empl.gr(t)
Sales.gr(t)
R&D.gr(t)
Empl.gr(t+1)
Sales.gr(t+1)
R&D(.grt+1)
OpInc.gr(t+1)
Empl.gr(t+2)
Sales.gr(t+2)
R&D.gr(t+2)
OpInc.gr(t+2)
(Moneta et al., 2012)(Rosenstrom et al., 2012)
Policy evaluation
(Campomanes et al., 2014)
Causal information flow
Improving health and QOL
(Boukrina & Graves, 2013)
What changes absorption spectra?
Brief review of structural
causal models
Structural causal models (Pearl, 2000)
• A framework for describing causal relations
(or data generating processes)
• An example of linear cases:
• Generally speaking, if the value of 𝑥1 has
been changed and then that of 𝑥2 changes,
then 𝑥1 causes 𝑥2
9
𝒙𝟐 ∶= 𝒃𝟐𝟏𝒙𝟏 + 𝒆𝟐
𝒙𝟏 ∶= 𝒆𝟏
x2x1
e1 e2
e1 and e2 dependent
73
Changing the value of x1
from c to d
• Replacing the function determining x1 with
a constant c, denoted by do(x1=c), and
then change the constant to d (Pearl, 2000)
21212
11
exbx
ex
21212
1
exbx
cx
Intervention: do(x1=c)
x2x1
e1 e2
x2x1
c e2
74
Average causal effect(Rubin, 1974; Pearl, 2000)
• Average causal effect of x1 on x2 when changing x1 from c to d
– Computed based on the models with do(x1=d) and do(x1=c)
•
cdb
cxdoxEdxdoxE
21
1212 ||
cdbxE
dcx
212
1
bychangewill)(then
,tofromof value thechangedhaveyouIf
Formulating the problem
13
Estimation of causal direction
• Suppose that data X was randomly generated from either of the following two models:
• Estimate which model generated the data X based on the data X only
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
)0( 21 b
x1x2
e2 e1
x1x2
e2 e1
12b21b
)0( 12 b
Major difficulty
• Errors and are often dependent
• Regression coefficient of on is not
equal to even if we know the right
causal direction
14
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
x1x2
e2 e1
x1x2
e2 e1
12b21b
21b
1e 2e
1x2x
)0( 21 b )0( 12 b
Hidden common causes
• Such dependency is typically introduced
by hidden common causes, say
15
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
2
21211212
1
11111
e
efxbx
e
efx
1f
1f
x1x2
e’2 e’1
12b
1f
2
21212
1
11112121
e
efx
e
efxbx
A well-known guideline(Pearl2000; Spirtes+1993)
• Observe the hidden common cause ,
incorporate it in the models,
and carry out three-variable analysis
• Errors independent!
16
1f
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
21211212
11111
efxbx
efx
1f
x1x2
e’2 e’1
12b
1f
21212
11112121
efx
efxbx
21, ee
Following the guideline is often
very hard• A large number of hidden common causes
may exist (Q unknown)
• Often no idea what they are
17
Qfff ,,, 21
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
q
1f
222
112121
efx
efxbx
q
q
Qf
x1x2
e’2 e’1
12b
1f Qf
18Estimation of causal direction
in the presence of
hidden common causes
• Estimate which model generated the data X
or
Model 1’: Model 2’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
q
1f
222
112121
efx
efxbx
q
q
Qf
x1x2
e’2 e’1
12b
1f Qf
qf
Note
• If we intervene on x1 (and x2), we have no
hidden common causes
• But, ethically and costly often difficult to do
interventions
19
Model 1’:
x1x2
e’2 e’1
21b
221212
111
efxbx
efx
q
q
1f Qf
Model 1’’:
x1x2
e’2 c
21b
cx 1
1f Qf
221212 efxbxq
1. Estimation of causal direction when temporal information is not available
2. Managing hidden common causes
20
Major challenges
x1 x2
?x1 x2
or
x1 x2 ?x1 x2 or
f1 f1
Basic non-Gaussian model
(No hidden common cause)
S. Shimizu, P. O. Hoyer, A. Hyvärinen
and A. Kerminen.
Journal of Machine Learning Research,
2006.
• Implying no hidden common causes
• The two models distinguishable if the errors
e1 and e2 are non-Gaussian (Dodge+00CSTM; Shimizu+06JMLR)
Independent errors22
or
21212
11
exbx
ex
22
12121
ex
exbx
Model 1: Model 2:
x1x2
e2 e1
x1x2
e2 e1
12b21b
)0,( 2112 bb
2323
Different directions give
different data distributionsGaussian Non-Gaussian
Model 1:
Model 2:
x1
x2
x1
x2
e1
e2
x1
x2
e1
e2
x1
x2
x1
x2
x1
x2
212
11
8.0 exx
ex
22
121 8.0
ex
exx
1varvar 21 xx
,021 eEeE
24
Independent Component Analysis
(ICA) (Jutten & Herault, 1991; Comon, 1994)
• Observed random vector x is modeled by
where
– The mixing matrix A = [ ]
– The hidden variables (independent components) are non-Gaussian and mutually independent
• Then, A is identifiable up to permutation and scaling of the columns
Asx
is
p
j
jiji sax1
or
ija
Sketch of the identifiability proof
• Different directions give different zero/non-
zero patterns of the mixing matrices
– No zeros on the diagonal in the causal model
– No permutation indeterminacy
25
2
1
212
1
1
01
e
e
bx
x
21212
11
exbx
ex
A sx
2
112
2
1
10
1
e
eb
x
x
A sx22
12121
ex
exbx
x1
x2
e1
e2
x1
x2
e1
e2
0
0
Linear Non-Gaussian Acyclic
Models (LiNGAM) (Shimizu+06JMLR)
• Identifiable: Directions, coefficients, and intercepts
– Can be uniquely estimated without knowing the causal
structure
26
i
ij
jijii exbx
x1 x2
x3
21b
23b13b
2e
3e
1e
Acyclicity
Non-Gaussian errors ei
Independence of errors ei
(no hidden common causes)
Extensions
• Cyclic models (Lacerda+08UAI; Hyvarinen+13JMLR)
• Time series (Hyvarinen+10JMLR; Huang+15IJCAI; Gong15ICML)
• Nonlinearity (Zhang+09UAI; Peters+14JMLR; cf. Imoto02PSB)
• Discrete variables (Peters+11TPAMI; Park+15NIPS)
27
iiiii exofparentsffx
1,
1
2,
x1x2e2 e1
)()()(0
tttk
exBx
LiNGAM with hidden
common causes
P. O. Hoyer, S. Shimizu, A. Kerminen,
and M. Palviainen.
Int. J. Approximate Reasoning
2008
• Extension to incorporate non-Gaussian hidden
common causes
i
ij
jij
Q
q
qiqii exbfx 1
LiNGAM with hidden
common causes (Hoyer+08IJAR)
29
where are independent: ),,1( Qqfq
qf
x1 x2 2e1e
1f 2f
2121
1
222
1
1
111
exbfx
efx
Q
q
Q
q
qfWLG, hidden common causes
are assumed to be independent
Independent hidden
common causes
i
ij
jij
Q
q
qiqii exbfx 1
30
x1 x2 2e1e
1fe
2fe
x1 x2 2e1e
1
:1 fef
2
:2 fef
1f 2f
Dependent hidden
common causes
2
1
2221
11
2221
11
2
100
2
1
f
f
aa
a
e
e
aa
a
f
f
f
f
Different causal directions give
different data distributions(Hoyer, Shimizu, Kerminen and Palviainen, 2008, IJAR)
• Faithfulness + N. hidden common causes “known”
31
x1 x2
f1
x1 x2
orfQ f1 fQ
… …
2e1e2e1e
2121
1
222
1
1
111
exbfx
efx
Q
q
Q
q
2
1
222
1212
1
111
efx
exbfx
Q
q
Q
q
1x1x
2x2x
Previous estimation approaches
• Explicitly model hidden common causes and
compare two models with opposite directions of
causation
– Maximum likelihood principle (Hoyer+08IJAR)
– Bayesian model selection (Henao & Winther, 2011, JMLR)
• Require us to specify the number of hidden
common causes, which is difficult in general
32
x1 x2
f1
x1 x2
orfQ f1 fQ… …
2e1e2e1e
Our proposal:
a Bayesian approach
S. Shimizu and K. Bollen.
Journal of Machine Learning Research,
2014
)(
2
m
)1(
1x)1(
2x
)(
2
mx)1(
1x
)(
2
)(
121
1
)(
22
)(
2
mmQ
q
m
m exbfx
Key idea (1/2)
• Another look at the LiNGAM with hidden common
causes:
34
x1 x2
f1 fQ…
2e1e
m-th obs.:
)1(
2e)1(
1e
)(
2
me)(
1
me
……
21b
21b
21b)(
22
m
)1(
22
Observations are generated from the LiNGAM
model with possibly different intercepts )(
22
m
Key idea (2/2)
• Include the sums of hidden common
causes as the observation-specific
intercepts:
• Not explicitly model hidden common
causes
– Neither necessary to specify the number of
hidden common causes Q nor estimate the
coefficients
35
)(
2
m
)(
2
)(
121
1
)(
22
)(
2
mmQ
q
m
m exbfx
m-th obs.:
q2
Obs.-specific
intercept
• Compare the marginal likelihoods of these two
models with opposite directions
• Many additional parameters
– Similar to mixed models and multi-level models
– Informative Prior for the observation-specific intercepts
)()(
121
)(
22
)(
2
)(
1
)(
11
)(
1
m
i
mmm
mmm
exbx
ex
Bayesian model selection36
),,1;2,1()( nmim
i
Model 3 (x1 x2)
)(
2
)(
22
)(
2
)(
1
)(
212
)(
11
)(
1
mmm
mmmm
ex
exbx
Model 4 (x1 x2)
v
Prior for the observation-specific
intercepts
• Motivation: Central limit theorem
– Sums of independent variables tend to be more Gaussian
• Approximate the density by a bell-shaped curve dist.
• Select the hyper-parameter values that maximize the
marginal likelihood
–
– DOF fixed to be 6 in the experiments below
37
Q
q
m
mQ
q
m
m ff1
)(
2
)(
2
1
)(
1
)(
1 ,
~)(
2
)(
1
m
m
t-distribution with sd ,
correlation , and DOF1221,
v
)},(sd0.1,),(sd2.0,0{ lll xx }9.0,,1.0,0{12
The chocolate data revisited
Corr. 0.791
P-value < 0.001No
bel
Chocolate
Gaussianity rejected for both
``Chocolate consumption”
and ``Num. Nobel laureates’’
Model comparison
• No method available before to compare these two
39
Conclusions
Conclusions
• Estimation of causal direction in the presence of
hidden common causes is a major challenge in
causal discovery
• Proposed a linear non-Gaussian SEM with
possibly different intercepts
– Not require to specify the number of hidden common
causes
• Future work
– Sensitivity to the choice of prior distributions
– Better estimation methods computationally and
statistically efficient … and many others
41
42
Pairwise
analysis
High-dimensional cases
• Huge number of candidate networks
• Analyze every pair of variables and Integrate the
results to get an entire causal ordering
• Simpler than trying all the combinations of
causal orders
43
x1
x2x4
x3
f1
f3
x1 x2
x3 x4
x1
x2x4
x3
f1
f3
Full graph
Prune
redundant
edges
Integrate
the results
Non-Gaussian
x2
x1
Gaussian e1,e2, f1
x2
• Faithfulness on 𝑥𝑖, 𝑓𝑖 + Number of 𝑓𝑖 given
Different zero/non-zero patterns
of the mixing matrices (Hoyer+08IJAR)
44
x1 x2
f1
x1 x2
f1
x1 x2
f1
Models
1.
2.
3.
**0
*0*
***
*0*
**0
***
A
A