DAG discovery - Network Analysis 2017

Recap D-separation recap DAGS & Probability DAG Discovery Fitting DAGs Conclusion

DAG discoveryNetwork Analysis 2017

Sacha Epskamp

04-12-2017


Last week

• Regularization controls for spurious connection• LASSO regularization• EBIC model selection

• Bootstrap methods assess accuracy and stability of results• Non-parametric bootstrap• Case-drop bootstrap

• Comparing networks takes three steps• Visually inspect; Correlate weights; Permutation test

(NetworkComparisonTest)• Non-normal data

• Non-paranormal transformation• Polychoric correlations


Bootnet estimation


Directed Acyclic Graphs


Building blocks of a DAGCommon Cause

A

B

C

Example: Disease (B)causes twosymptoms (A and C).

Chain

A B C

Example: Insomnia(A) causes fatigue(B), which in turncauses concentrationproblems (C)

ColliderA

B

C

Example: Difficulty ofclass (A) andIntelligence of student(C) cause grade on atest (B)

A 6⊥⊥ C

A ⊥⊥ C | B

A 6⊥⊥ C

A ⊥⊥ C | B

A ⊥⊥ C

A 6⊥⊥ C | B


To identify two variables (e.g., B and F) are conditionallyindependent given a third (e.g., C) or set of multiple variables:• List all paths between the variables (ignore direction of edge)• For each path, check if the variable to condition on is:

• The middle node in a chain or common cause structure• Not the middle node (common effect) in a collider structure or

an effect of such a common effect• If so, then the path is blocked• If all such paths are blocked, the two variables are

d-separated and thus conditionally independent


• A ⊥⊥ B• A ⊥⊥ D | C• B ⊥⊥ G | C ,E• ...

Testing this causal model involves testing if all these conditionalindependence relations hold


However, if this model fits:

• A → B → C

Then so do these:

• A ← B → C

• A ← B ← C

Because these models imply the same conditional independencerelationships and are therefore equivalent


DAGS & Probability

• A key problem in statistics is characterizing a joint likelihoodfunction of all data• A function that tells you how likely your observed data is given

some parameters• Pr(A ,B ,C ,D, . . .)

• This function is used in estimating parameters• Parameters are selected that maximize the likelihood function

• Obtaining the joint likelihood may be complicated though

• DAGs make this much simpler!


DAGS & Probability

Normally, to obtain the joint likelihood we need to factorize (chainrule):

Pr(A ,B ,C ,D,E) = Pr(A ) Pr(B | A ) Pr(C | A ,B) Pr(D | A ,B ,C) Pr(E | A ,B ,C ,D)

But if we know the DAG:

A → B → C → D → E

Then we know, e.g., Pr(E | A ,B ,C ,D) = Pr(E | D) (any node onlydepends on their “parents”), and thus:

Pr(A ,B ,C ,D,E) = Pr(A ) Pr(B | A ) Pr(C | B) Pr(D | C) Pr(E | D)

Much simpler!


Joint Likelihood of Multiple Realizations

Y1 Y2 Y3 Y4 Y5

lag−0

Simplest: independent cases (e.g., cross-sectional data):

Pr(YYY ) = Pr(YYY1) Pr(YYY2) Pr(YYY3) Pr(YYY4) Pr(YYY5)

Estimable if all probability distributions are assumed identical


Joint Likelihood of Multiple Realizations

Y1 Y2 Y3 Y4 Y5

lag−1

Lag-1 factorization (time-series):

Pr(YYY ) = Pr(YYY1) Pr(YYY2 | YYY1) Pr(YYY3 | YYY2) Pr(YYY4 | YYY3) Pr(YYY5 | YYY4)

Estimable if all probability distributions are assumed identical


Statistical models can often be portrayed as DAGs, in which casethey are called graphical models. For example:

Lee, M. D., & Wagenmakers, E. J. (2014). Bayesian cognitive modeling: A practical course.Cambridge university press.

• Powerful method for showing how the parameters of a complex model interact withone-another

• Bayesian software packages (e.g., WinBUGS, JAGS, Stan) use this DAG in samplingfrom the posterior distribution


DAG Discovery

• DAG search algorithms intent to identify an equivalence class

• List equally plausible DAGs• Two types of algorithms:

• Constraint-based algorithms

• (1) identify edge locations, (2) identify colliders, (3) orient edgesunder acyclicity assumtion

• Score-based algorithms:

• Find optimal DAG by model selection/search

• Prior knowledge can be used in both cases to greatly help thealgorithm

• E.g., causation cannot go backward in time


DAG Discovery

• DAG search algorithms intent to identify an equivalence class• List equally plausible DAGs

• Two types of algorithms:








DAG Discovery


• Two types of algorithms:








DAG Discovery


• Two types of algorithms:• Constraint-based algorithms







DAG Discovery









DAG Discovery









DAG Discovery




• Score-based algorithms:• Find optimal DAG by model selection/search




DAG Discovery








DAG Discovery





• Prior knowledge can be used in both cases to greatly help thealgorithm• E.g., causation cannot go backward in time


Assumptions

• Causal Sufficiency Assumption

• “There exist no common unobserved (also known as hidden orlatent) variables in the domain that are parent of one or moreobserved variables of the domain”.

• tl;dr: No latent variables• Markov Assumption

• “Given a Bayesian network model B, any variable isindependent of all its nondescendants in B, given its parents”.

• tl;dr: Acyclicity

• Faithfulness Assumption

• “A BN graph G and a probability distribution P are faithful toone another iff every one and all independence relations validin P are those entailed by the Markov assumption on G”.

• tl;dr: No weird stuff

Source: Margaritis, D., 2003. Learning Bayesian network model structure from data.Thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh.


Assumptions

• Causal Sufficiency Assumption• “There exist no common unobserved (also known as hidden or

latent) variables in the domain that are parent of one or moreobserved variables of the domain”.









Assumptions



• tl;dr: No latent variables

• Markov Assumption








Assumptions





• tl;dr: Acyclicity• Faithfulness Assumption





Assumptions










Assumptions











Assumptions










Assumptions










Assumptions










Score-based algorithms

• Score-based algorithms fit several DAGs to some criteriumand selects the best

• Possible criteria are posterior model fit and AIC/BIC

• Searching all possible DAGs is intractable, so some strategyis needed

• Examples• Hill Climbing; Tabu Search

• Used, e.g., by McNally, R. J., Mair, P., Mugno, B. L., &Riemann, B. C. (2017). Co-morbid obsessive-compulsivedisorder and depression: a Bayesian network approach.Psychological Medicine, 1-11.


Hill Climbing

• 1. Start at empty, full or random network• 2. Add, remove, or reverse edges all possible edges• 3. Select the best fitting model that performs better than

current model• 4. Go to 2


Hill Climbing

• Hill Climbing results in a local optimum• Random restarts and perturbations can be used to find a

global optimum• No control for overfitting

• Bootstrapping and only retaining stable edges is highlyrecommended


Constraint-based algorithm• Structure estimated based on conditional independence

relationships

• E.g., Inductive Causation algorithm:

1. For each pair a and b, look for (a y b | Sab ). If no such Sab

exists, then a and b are dependent.2. For each trio (a, b , c) such that a − c − b check if c belongs to

Sab . If so, then nothing. If c is not in Sab then make a colliderat c, i.e. a → c ← b.

3. Orient as many of the undirected edges as possible, subjectto: (i) no new v-structures and (ii) no cycles.

• Examples:

• IC algorithm; PC algorithm; Grow-Shrink; IncrementalAssociation Markov Blanket

• Used, e.g., by Borsboom, D., & Cramer, A. O. (2013). Networkanalysis: an integrative approach to the structure ofpsychopathology. Annual review of clinical psychology, 9,91-121.



relationships• E.g., Inductive Causation algorithm:





• Examples:







exists, then a and b are dependent.

2. For each trio (a, b , c) such that a − c − b check if c belongs toSab . If so, then nothing. If c is not in Sab then make a colliderat c, i.e. a → c ← b.


• Examples:










• Examples:










• Examples:










• Examples:










• Examples:• IC algorithm; PC algorithm; Grow-Shrink; Incremental

Association Markov Blanket









• Examples:• IC algorithm; PC algorithm; Grow-Shrink; Incremental

Association Markov Blanket• Used, e.g., by Borsboom, D., & Cramer, A. O. (2013). Network

analysis: an integrative approach to the structure ofpsychopathology. Annual review of clinical psychology, 9,91-121.


Easiness of Class Intelligence

Grade IQ

Diploma



Grade IQ

Diploma

What if we don’t know the structure?



Grade IQ

Diploma

Are the two nodes independent given *any* set of other nodes(including the empty set)?

• Yes! They are independent to begin with!

• Draw no edge between Easiness of Class and Intelligence



Grade IQ

Diploma






Grade IQ

Diploma






Grade IQ

Diploma


• No!

• Draw an edge between Easiness of Class and Grade



Grade IQ

Diploma


• No!




Grade IQ

Diploma


• No!




Grade IQ

Diploma


• No!

• Draw an edge between Grade and Intelligence



Grade IQ

Diploma


• No!




Grade IQ

Diploma


• No!




Grade IQ

Diploma



Grade IQ

Diploma

Is the middle node in the set that separated the other two nodes?

• Yes!

• Do nothing



Grade IQ

Diploma


• Yes!

• Do nothing



Grade IQ

Diploma


• Yes!

• Do nothing



Grade IQ

Diploma


• Yes!

• Grade is a collider between Easiness of Class and Intelligence



Grade IQ

Diploma


• Yes!




Grade IQ

Diploma


• Yes!




Grade IQ

Diploma

Do we now know the direction of the edge between Grade andDiploma?

• Yes! Grade was not a common effect of diploma and anothervariable!



Grade IQ

Diploma

Do we now know the direction of the edge between Grade andDiploma?

• Yes! Grade was not a common effect of diploma and anothervariable!



Grade IQ

Diploma

Do we now know the direction of the edge between Intelligenceand IQ?

• No!



Grade IQ

Diploma

Do we now know the direction of the edge between Intelligenceand IQ?

• No!



Grade IQ

Diploma


IQGrade

Diploma



IQGrade

Diploma


Constraint-based vs Score-based algorithms

• Constraint-based algorithms are more specific and detailed,allow for a more certain causal interpretation. But are alsosensitive to error (if one test is wrong everything fails!)

• Score-based methods provide a metric of confidence in thereturned model and are useful in approximating the jointprobability distribution

• Hybrid methods that aim to take the best from both worlds arealso developed!

• e.g., Max-Min Hill Climbing

















• Hybrid methods that aim to take the best from both worlds arealso developed!• e.g., Max-Min Hill Climbing



• A DAG implies a set of independence relationships, which canbe tested

• If the data is assumed Multivariate Gaussian:

• Each variable normally distributed• Linear relationships between variables

• Then the correlation or covariance can be used to test fordependencies and the partial correlation or partial covariancecan be used to test for conditional dependencies











• Each variable normally distributed

• Linear relationships between variables















A

B

C

• Cov (A ,C) , 0

• Cov (A ,C | B) = 0


Structural Equation Modeling• In SEM, the variance-covariance matrix is modeled and

compared to the observed variance-covariance matrix

• If multivariate normality holds, then the Schur complementshows that any partial covariance can be expressed solely interms of variances and covariances:

• Cov(Yi ,Yj | X = x

)=

Cov(Yi ,Yj

)− Cov (Yi ,X ) Var (X )−1 Cov

(X ,Yj

)• Thus, a specific structure of the correlation matrix also implies

a model for all possible partial correlations

• If the implied covariance matrix of SEM exactly matches theobserved covariance matrix, then the data contains alld-separations that are implied by the causal model

• In that case, the model could have generated the data!• But, this does not mean the model is correct

• Equivalent models could have generated the same data!



compared to the observed variance-covariance matrix• If multivariate normality holds, then the Schur complement

shows that any partial covariance can be expressed solely interms of variances and covariances:


)=

Cov(Yi ,Yj


(X ,Yj

)

• Thus, a specific structure of the correlation matrix also impliesa model for all possible partial correlations









)=

Cov(Yi ,Yj


(X ,Yj











)=

Cov(Yi ,Yj


(X ,Yj











)=

Cov(Yi ,Yj


(X ,Yj




• In that case, the model could have generated the data!

• But, this does not mean the model is correct







)=

Cov(Yi ,Yj


(X ,Yj











)=

Cov(Yi ,Yj


(X ,Yj







Doosje, B., Loseman, A., & Bos, K. (2013). Determinants ofradicalization of Islamic youth in the Netherlands: Personaluncertainty, perceived injustice, and perceived group threat.Journal of Social Issues, 69(3), 586-604.





What does pcalg come up with?

In−group Identification

Individual Deprivation

Collective Deprivation

Intergroup Anxiety

Symbolic Threat

Realistic Threat

Personal Emotional Uncertainty

Perceived Injustice

Perceived Illegitimacy authorities

Perceived In−group superiority

Distance to Other People

Societal Disconnected

Attitude towards Muslim Violence

Own Violent Intentions


Does it fit?

## chisq df pvalue cfi nfi

## 80.52 39.00 0.00 0.89 0.82

## rmsea rmsea.ci.lower rmsea.ci.upper

## 0.09 0.06 0.12

• Not really. . .


DAG Discovery

Discovering an equivalence set of DAGs is possible under someassumptions:

• Causal Sufficiency

• Markov Aumption

• Faithfulness

Two general methods:

• Score-based algorithms


DAGs provide useful characterisations of the joint likelihood andcan be fitted to the data (e.g., SEM)


DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






DAG Discovery



• Markov Aumption

• Faithfulness






But...

• Assumptions often not plausible

• Latents or acyclicity• Prone to errors

• Often edges are estimated in a different direction than youwould expect

• Exploratory estimation may suffer from low power

• Confirmatory fit may suffer from many equivalent models


But...

• Assumptions often not plausible• Latents or acyclicity

• Prone to errors





But...


• Prone to errors





But...


• Prone to errors• Often edges are estimated in a different direction than you

would expect




But...



would expect




But...



would expect




Software

Several R packages, but mainly:• pcalg

• Implements the PC-algorithm (a faster variant of theIC-algorithm)

• bnlearn• Implements everything *but* the PC-algorithm

We will see these in the assignment!


Thank you for your attention!

Documents

DAG discovery - Network Analysis 2017