Causal feature learning: an overviefehardt/papers/CEP2017.pdf · 2017. 6. 22. · Causal feature learning (CFL) is an unsupervised machine learning and causal inference framework

INVITED PAPER

Causal feature learning: an overview

Krzysztof Chalupka1 • Frederick Eberhardt2 •

Pietro Perona3

Received: 29 August 2016 /Accepted: 29 November 2016 / Published online: 26 December 2016

� The Behaviormetric Society 2016

Abstract Causal feature learning (CFL) (Chalupka et al., Proceedings of the

Thirty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press,

Edinburgh, pp 181–190, 2015) is a causal inference framework rooted in the lan-

guage of causal graphical models (Pearl J, Reasoning and inference. Cambridge

University Press, Cambridge, 2009; Spirtes et al., Causation, Prediction, and Search.

Massachusetts Institute of Technology, Massachusetts, 2000), and computational

mechanics (Shalizi, PhD thesis, University of Wisconsin at Madison, 2001). CFL is

aimed at discovering high-level causal relations from low-level data, and at

reducing the experimental effort to understand confounding among the high-level

variables. We first review the scientific motivation for CFL, then present a detailed

introduction to the framework, laying out the definitions and algorithmic steps. A

simple example illustrates the techniques involved in the learning steps and provides

visual intuition. Finally, we discuss the limitations of the current framework and list

a number of open problems.

Keywords Causal discovery � Causal inference � Graphical models � Bayesiannetworks � Macrovariables � Multiscale modeling

Communicated by Shohei Shimizu.

& Krzysztof Chalupka

[email protected]

1 Computation and Neural Systems, California Institute of Technology, Pasadena, CA, USA

2 Humanities and Social Sciences, California Institute of Technology, Pasadena, CA, USA

3 Electrical Engineering, California Institute of Technology, Pasadena, CA, USA

123

Behaviormetrika (2017) 44:137–164

DOI 10.1007/s41237-016-0008-2

http://orcid.org/0000-0002-1225-2112

http://crossmark.crossref.org/dialog/?doi=10.1007/s41237-016-0008-2&domain=pdf

http://crossmark.crossref.org/dialog/?doi=10.1007/s41237-016-0008-2&domain=pdf

1 Introduction

Causal feature learning (CFL) is an unsupervised machine learning and causal

inference framework with two goals: (1) the formation of high-level causal

hypotheses using low-level input data, and (2) efficient testing of these

hypotheses (Chalupka et al. 2015, 2016a, b). As a motivation, consider the

following archetypical research situation (Fig. 1c): a neuroscientist notices that a

specific neuron responds preferentially to some images containing humans. He

progressively refines and tests this hypothesis by exploring painstakingly the

effect of different poses and occlusions of a large number of human shapes on

the neuron. He concludes that the neuron responds specifically to images of

female faces. This conclusion is based on alternating three main steps:

(a) formulating hypotheses through modeling and intuition, (b) designing

experiments to test such hypotheses, and (c) collecting evidence from such

experiments.

Steps (a) and (b) are guided by prior knowledge, intuition and formal reasoning.

CFL aims to automate this process in situations where observational data is

plentiful, reducing the bias resulting from pre-conceived ideas of the scientist. The

method is predicated on the idea that if the data in fact contains high-level features

(such as faces) that are causal, then these ought to be detectable by an unsupervised

learning algorithm.

In CFL, the distinction between high-level and low-level features is framed in

terms of macrovariables and microvariables, terms often used in physical sciences.

The semantics of these terms as used in science provide direct inspiration for our

algorithms.

Fig. 1 Causal macrovariables. Macrovariables in science can often be thought of as functions of theunderlying microvariable space. Each such function f corresponds to a partition on the microvariable statespace, defined by the equivalence relation x1 � x2 () f ðx1Þ ¼ f ðx2Þ. a Temperature may be defined asthe mean kinetic energy of a system of particles. It is a one-dimensional function of a high-dimensionalsystem consisting of a large number of particles, each one with a mass and velocity. b El Nino is definedas the sea surface temperature (SST) anomaly in a specific region of the Pacific Ocean exceeding 0.5 �C.It is a binary function of the high-dimensional SST map. c Primate brains are thought to have areasspecialized for face detection (see Tsao et al. (2006) for direct evidence in the macaque cortex).‘‘Presence of a face’’ is a high-level feature of the space of all images

138 Behaviormetrika (2017) 44:137–164

123

1.1 Macrovariables in science

Just about any scientific discipline is concerned with developing ‘macrovariables’

that summarize an underlying finer-scale structure of ‘microvariables’ (see Fig. 1a–

c). Temperature and pressure summarize the particles’ masses and velocities in a

gas at equilibrium; large-scale climate phenomena, such as El Nino, supervene on

the geographical and temporal distribution of sea surface temperatures and wind

speed. Similarly, for the human sciences: macro-economics supervenes on the

economic activities of individuals, which in turn presumably summarize the

psychological processes of each person, which are aggregates of neural states. These

abstractions are particularly useful when one can establish causal relations among

macrovariables that hold independent of the microvariable instantiations of the

macrostates.

Causal feature learning is motivated by the need to automate the process of

developing such hierarchical descriptions starting from the less-constrained space of

microvariables. The key insight is that it is best to discover simultaneously the

macrovariables and their causal relations. The framework draws heavily on ideas

developed in computational mechanics (Shalizi and Crutchfield 2001; Shalizi 2001;

Shalizi and Moore 2003) and connects them with the framework of causal graphical

models (Spirtes et al. 2000; Pearl 2000).

CFL searches for the macrovariable cause/effect hypotheses starting from

microvariable data. Any random variable with a large, possibly infinite, number of

states may be considered a microvariable. Continuous variables, as well as discrete

variables with exponentially many states (such as images) are microvariables.

One may think of macrovariables as equivalence relationships on the microvari-

able state space. For example, all the particle ensembles with the same mean kinetic

energy correspond to the same temperature. Similarly, all the sea surface

temperature maps where the temperature anomaly in a specified region of the

Pacific Ocean exceeds 0.5 �C correspond to El Nino. CFL also defines the relation

between micro- and macrovariables in terms of an equivalence relation, which we

review more formally in Sect. 2.1.

The learning task of CFL may be framed in terms of the micro- and

macrovariable distinction:

1. Take two observational—that is, ‘‘sampled by nature’’, not-experimental (Pearl

2000, 2010)—microvariable datasets H and E as input, with the task of

discovering ‘‘what in H causes what in E.’’2. Search the space of all macrovariables (equivalence relationships) on H and

retain only those that could be causes of E.3. Search the space of all the macrovariables that supervene on E and retain only

those that could be effects of H.

4. Propose an efficient experimental procedure that picks out the (unique)

macrovariable cause and effect from among the retained macrovariable pairs.

In general, there is an infinite number of macrovariables one could define as

supervening on the two given microvariable spaces. However, not every random

Behaviormetrika (2017) 44:137–164 139

123

variable can function as a causal variable. First of all, causal variables cannot stand

in logical or definitional relations to one another—X does not cause 2X. Furthermore,

causal variables should permit well-defined experimental interventions. This latter

point raises a subtle but important issue for the interplay between causality and

aggregation: ambiguous manipulations.

1.2 Ambiguous manipulation and causal macrovariables

Figure 2 illustrates a case of an unfortunate choice of a macrovariable. Total

cholesterol used to be considered as a risk factor for heart disease. However, further

analysis revealed that ‘total cholesterol’ is not a good causal variable, since it is a

sum of cholesterol carried by low-density and high-density lipoproteins (LDL and

HDL, commonly called ‘‘bad’’ and ‘‘good’’ cholesterol), which have different

effects on heart disease (see Spirtes and Scheines 2004 for an in-depth discussion of

this case.)

Consequently, to recommend a ‘‘low-cholesterol diet’’ is to prescribe an

ambiguous manipulation: ‘‘low-cholesterol’’ could mean low in LDL, HDL or

both, but each would have very different consequences for the heart. Unless the

proportions of LDL vs. HDL are known in advance, this makes a proper

experimental verification of the causal link between total cholesterol and heart

disease impossible. The example illustrates that there is an appropriate level of

aggregation to describe the causal relation, and ‘total cholesterol’ is too high-level.

The challenge is to identify when one has reached the correct level of aggregation.

Causal feature learning addresses this concern by learning unambiguous causal

variables: each macrovariable state must have a consistent well-defined causal

effect. This effect can be probabilistic and highly variable, but may not depend on

the microvariable instantiation of the macrovariable—just like the specifics of gas

molecule momenta do not change the effects of temperature, as long as their mean is

equal. In this way, CFL abstracts microscopic details of the problem away, allowing

the scientist to focus on all the relevant macroscopic details. This is analogous to the

role of the macrovariables in Fig. 1a–c.

Fig. 2 Ambiguousmacrovariables. Totalcholesterol is the sum of low-and high-density lipids (LDLand HDL). Suppose that LDLcauses heart disease and HDLprevents it. The effect of totalcholesterol on heart disease isthen ambiguous as it depends onthe proportion of HDL vs. LDL(see Sect.1.2). Experimentalprocedures based on adjustingtotal cholesterol only can giveinconsistent results


123

1.3 Macrovariables are task-specific

Although pre-theoretic intuition may suggest that there is some uniquely true

taxonomy to the variables describing the world, we reject this view and propose that

macrovariables should be thought of as task-specific. For example, there is evidence

that the human visual system parses the image in terms of macrovariables that

(among other things) track the location, shape and appearance of faces in the scene.

However, there is no a priori reason that these are ‘optimal’ visual variables. To

other creatures, occupying a different ecological niche and animated by different

behavioral goals, a different aggregation of pixel information may be relevant. For

example, an insect might be far more concerned about luminance, edges and motion

flow in its visual input than about objects and faces. Thus, the appropriate

equivalence relation on the microvariable state space is driven by the relation

between ‘‘input’’ and ‘‘output’’ spaces (e.g., the statistics of the environment as

imaged by the optic array, and the desired behavior), rather than by one or the other

of the spaces considered individually.

Consequently, in the simplest case, CFL is concerned with the relationship

between two microvariable spaces: the ‘input’ and ‘output’ space. The task is then

to detect all the causes (in the input space) of all the effects (in the output space), in

their simplest form. Any distinctions unimportant to the task at hand will be

abstracted away.

For example, Chalupka et al. (2016a) take ‘wind strength map over Pacific’ as

the input space, and ‘sea surface temperature (SST)’ as the output space. Applying

CFL to this task yields a discrete division of the two spaces into a set of wind pattern

classes (‘Westerly Winds’, ‘Easterly Winds’ etc) and SST pattern classes (‘El

Nino’, ‘La Nina’ etc)—see Fig. 3. Knowing which class a wind pattern belongs to

Fig. 3 Macrovariables of Pacific weather patterns. Chalupka et al. (2016a) applied CFL to climate data.The microvariables consisted of zonal (East–West) wind strength over the equatorial Pacific and seasurface temperatures (SST) over the same region of space. The figure shows the causal hypothesisdiscovered by CFL. Each image represents one macrovariable state, the average over one cluster of windW (left) or temperature T (right). The conditional probability table shows PðT jWÞ, the probability of thehypothesized SST macrovariable given the hypothesized wind macrovariable. It shows that CFL learnedat least two relations that causally interpreted, is consistent with current climate science: ‘Westerly winds’(W ¼ 1) cause El Nino (T ¼ 1) and ‘Easterly winds’ (W ¼ 0) cause La Nina (T ¼ 2)


123

then gives all the useful information about its possible effects on SST1. However,

given a different output space—say ‘‘average US income’’—the input macrovari-

able would change, unless the causal consequences are entirely mediated by the

same temperature macrovariable.

1.4 A toy example

To visualize the definitions and main algorithmic steps involved in CFL, we will

resort to a simple toy example of a fictitious study on the influence of color on the

electrodermal response (eda), also known as the skin conductance2. In the fictitious

experiment, the eda response to a (constant but unspecified) stimulus is recorded in

varying environments. At the same time, the predominant hue of the environment is

recorded. Our simulated system is pictured in Fig. 4. In the system, ‘‘Red’’ hues

increase eda (a perhaps controversial but plausible response, see Jacobs and

Hustmyer (1974)). In addition, living in warmer climates increases eda, but also

increases the chance of observing ‘‘Warm’’ colors in the environment. Our

imaginary study consists of picking humans from diverse populations at random,

and measuring their eda as well as the predominant hue in their environment. The

example is set up to exhibit three characteristics:

1. The microvariables (hue and eda) are one-dimensional. Although this neces-

sarily makes the example slightly contrived, the visualizations of the algorithms

and definitions are much simpler and more illuminating than in the higher

dimensional case.

2. Microvariable hue gives rise to intuitive macrovariables: color classes. ‘‘Red’’

colors, ‘‘Natural’’ colors or ‘‘Warm’’ colors are (subjective) partitions of the hue

space, and clearly supervene on hue. For example, ‘‘Red’’ is not caused by hue,

it is simply a range in the hue space.

3. The cause (hue) influences the effect (eda) through direct causation, but they are

at the same time confounded by geographic location. As discussed in Sect. 4.1

in more detail, separating confounding from causation differentiates CFL from,

for example, many other learning methods.

Whereas our simulated dataset is simple and low-dimensional, CFL, in the same

form as presented here (Algorithms 1 and 2), can be applied to very high-

dimensional and complex data (see Chalupka et al. 2015, 2016a, b for example

complex applications).

In our model, eda [in units normalized to (0, 1) where 0.5 is the global average]

is causally influenced by hue (represented in degrees, with 0 being the red hue, see

Fig. 4) and lat (geographic latitude, a one-dimensional proxy for ‘‘climate’’ for ease

of visualization). Among these microvariables, only hue and eda are observed and

lat is latent. Thus, hue causes eda and at the same time the two variables are

1 As discussed in some detail in Chalupka et al. (2016a), a causal interpretation of purely observational

data is not possible without further assumptions.2 Python code that implements the learning algorithms and reproduces all the figures and experimental

results is available online at http://vision.caltech.edu/*kchalupk/code.html.


123

http://vision.caltech.edu/%7ekchalupk/code.html

confounded, as illustrated in Fig. 4. Assuming that lat fully captures the causal

confounding between hue and eda, their joint distribution p(hue, eda) factorizes as

pðhue; edaÞ ¼X

lat

pðedajhue; latÞpðedajlatÞpðlatÞ; ð1Þ

and can be described by the causal graphical model shown in Fig. 4.

The probability tables for these factors are shown in Fig. 4. We constructed the

conditional pðedajhue; latÞ to take a special form: there are four ranges of hue

within which pðedajhueÞ is constant. For example, the conditional is the same for

any hue 2 ð0; 90Þ. This construction indicates that there are macrovariables driving

the relationship between hue and eda: to a good approximation, any hue within a

given range has the same effect on eda. The situation is analogous to that of the

temperature macrovariable driving the relationship between, say, water and human

pain receptors. To a good approximation, any body of water with the same

temperature has the same effect on our pain receptors, no matter the individual

velocities of specific water particles.

In Sect. 1.1, we will formalize the notion of a macrovariable and its relationship

to the underlying microvariable. We will show that such macrovariables are unique

and can be extracted automatically from the data. For now, we propose a ‘‘ground-

truth’’ macrovariable model that agrees with the microvariable distribution shown in

Fig. 4. Throughout the paper we will show that this model is in fact the

macrovariable structure that supervenes on hue and eda.

Our macrovariables are all binary. A supervenes (is a function of) eda, with

A ¼ 1 if and only if eda[ 0:5. That is, A represents an ‘‘Above-average’’ skin

Fig. 4 A toy causal model. In the simulated study, predominant hue of the environment causes changesin the electrodermal response: red hues increase eda beyond the average, whereas non-red hues tend todecrease it. In addition, the latitude of the experiment influences eda: lower latitudes, close to the equator,cause higher absolute eda due to warm climate predominant there. Finally, lower latitude environmentstend to have visually warmer hues, whereas higher latitude environments often have cooler hues. Theprobability tables show generative probabilities for our data, where U(a, b) is the uniform distributionbetween a and b. For example, if lat[ 45 and 270\ hue\ 360, then pðedaÞ ¼ 0:2U ð0; 0:5Þþ0:8U ð0:5; 1Þ—a mixture of two uniform distributions that indicates that most likely, eda is above theaverage in this situation


123

conductance. A is caused by R, which supervenes on hue,

R ¼ 1 () hue 2 ð0; 90Þ [ ð270; 360Þ. In addition, A correlates with, but is not

caused by,W—another variable that supervenes on hue,W ¼ 1 () hue 2 ð0; 180Þ.The causal graph of R, W and A, shown in Fig. 4, is determined by the variables’

supervenience on hue and eda. Similarly, the joint P(R, W, H) is fully determined

by p(hue, eda). In the remainder of the article, we will show how CFL can recover

such macrovariables, as well as the causal graph over these variables, given only

observational samples of the microvariables and a few carefully selected interven-

tions in the microvariable space.

2 Learning causal features

CFL takes microvariable inputs and produces macrovariable causal hypotheses. As

discussed in Sect. 1.1, we will now define macrovariables as equivalence classes

over our microvariables. Throughout this section, let H ¼ ð0; 360Þ denote our inputmicrovariable space (range of the hue variable), and E ¼ ð0; 1Þ the output

microvariable space (range of eda). We will denote the random variables defined

over these spaces as hue and eda as before, and their specific instantiations as h and

e, for example, we will write pðeda ¼ ejhue ¼ hÞ for some e 2 E; h 2 H.

2.1 Learning the causal hypothesis

Fig. 5A shows 1,000 samples from p(hue, eda) together with the ground-truth

conditional distribution pðhuejedaÞ. These observations follow Eq. (1), where hue

and eda are confounded by the unobserved lat.

The empirical distribution shown in the figure indicates that pðedajhueÞ is constantfor any h 2 ð0; 90Þ as well as for h 2 ð90; 180Þ, h 2 ð180; 270Þ and h 2 ð270; 360Þ.Figure 4 shows that indeed, this partition of hue into four classes captures all the

combinations of macrovariables supervening on hue. For example, h 2 ð0; 90Þ if andonly if W ¼ 1 and R ¼ 1, h 2 ð90; 180Þ if and only if W ¼ 1 and R ¼ 0, and so on.

Such partitioning of a microvariable space into the coarsest cells that retain all the

Fig. 5 Model samples and pdf. a Black dots are samples from the joint p(hue, eda), background colorshows the ground-truth value of pðedajhueÞ. b The result of conditional density learning of pðedajhueÞusing a mixture density network (see Sect. 2.1)


123

observational distinctions is the key element of CFL. This construction, called the

Observational Partition, abstracts away all the irrelevant micro-level details:

Definition 1 (Observational partition, observational class) The observational

partition of H, denoted by PoðHÞ, is the partition induced by the equivalence

relation � h such that

h1 � hh2 , 8e2Epðejh1Þ ¼ pðejh2Þ:

The observational partition of E, denoted by PoðEÞ, is the partition induced by the

equivalence relation � e such that

e1 � ee2 , 8h2H pðe1jhÞ ¼ pðe2jhÞ:

A cell of an observational partition is called an observational class (of H or E).

The observational partition of E is also easily discerned from Fig. 5: e 2 ð0; 0:5Þhas the same PðejhÞ for any h. Let us index the observational classes on H as

H ¼ 0; 1; 2; 3 if h 2 ð0; 90Þ; (90, 180), (180, 270), (270, 360), respectively, and

E ¼ 0; 1 if e 2 ð0; 0:5Þ and (0.5, 1), respectively. We can then compress pðedajhueÞto only four numbers without losing any information:

PðE ¼ 1jH ¼ 0Þ ¼ 1;

PðE ¼ 1jH ¼ 1Þ ¼ 1=3;

PðE ¼ 1jH ¼ 2Þ ¼ 0;

PðE ¼ 1jH ¼ 3Þ ¼ 4=5:

Note that this corresponds to pðAjR;WÞ in Fig. 4. However, whereas R truly is a

cause of A, the noncausal dependence of A on W results from the confounder lat.

The observational partition can be seen as a macrovariable causal hypothesis for the

causal effect of hue on eda. However, the observational partition of hue does not

necessarily characterize the cause of the observational class of eda. We will discuss

the role of the observational partition in causal learning in detail in Sect. 2.2 below.

Before that, let us see how to learn the observational class of hue and eda from data.

Learning the observational partition amounts to clustering H such that all the h

belonging to one cluster induce the same pðedajhue ¼ hÞ, and clustering E such that

all the e in one cluster have the same likelihood pðeda ¼ ejhueÞ for any value of

hue. We outline the procedure in Algorithm 1. Its most involved component is the

density learning subroutine used in Line 1. Fortunately, we only need to estimate

the conditional density well enough to discover its equivalence classes.

In Fig. 5b, the learned density differs from the ground-truth. Nevertheless, we

used this learned density to perform clustering on the H and E spaces into the

ground-truth number of clusters (4 and 2, respectively). Figure 6 shows that simple

K-means clustering of the density vectors accurately discovers the observational

class boundaries in both H and E. Sect. 3.3 discusses in detail observational

partition learning in the more realistic situation where the ground-truth number of

macrovariable states (clusters) is unknown.


123

To estimate the conditional density, we used a mixture density network

(MDN) (Bishop 1994) with three hidden layers of 64, 64 and 32 units and four

mixture components. MDNs can be relatively easily applied to high-dimensional

conditional density learning problems with large datasets, even in the online setting

where new data is arriving continuously. In very high-dimensional problems, an

MDN with several mixture components might be unable to learn the true density

accurately. Nevertheless, if the ground-truth generative model has a discrete

macrovariable structure, we can expect the mixture coefficients to have similar

values within each observational class as long as the number of components is not

significantly smaller than the number of observational classes.

2.2 Weeding out the spurious correlates

Our notion of causality is rooted in the framework of Pearl (2000) and Spirtes et al.

(2000). Intuitively, X causes Y if intervening on (or manipulating) X, without

influencing any other variables in the system, changes the distribution of Y. That is,

PðY jdoðXÞÞ is not constant. But as is well-known, the conditional probability

distribution PðYjXÞ for any two variables X and Y does not fix the causal effect

PðY jdoðXÞÞ. For example, the barometer’s needle predicts rain, but manipulating

the needle will not cause the weather to change.

The observational partition can be used as a basis for an efficient testing

procedure of causal hypotheses. To distinguish interventions in the microvariable

space from those on the macrovariable space, we denote the manipulation operation

in the microvariable space with the operator manðÞ and reserve the standard doðÞoperator for causal macrovariables:

Fig. 6 Learning the observational partition. a The observational partition learned on H results fromclustering the samples’ h coordinate with respect to the inferred pðedajhueÞ shown in Fig. 5b. We indicatethe learned partitions with an apostrophe, H0 and E0 in contrast with the ground-truth H and E. b Theobservational partition of E, with two cells, results from clustering the samples’ eda-coordinate withrespect to the inferred pðedajhueÞ. c The observational partitions are endowed with probability densitiessimply by counting the histogram of the microvariable samples in each (conditional) macrovariable state.The ground-truth values (see Fig. 4) are given in square brackets


123

Definition 2 (Microvariable manipulation) A microvariable manipulation is the

operation manðhue ¼ hÞ (we will often simply write manðhÞ for a specific

manipulation) that changes the microvariable hue to h 2 H, while not (directly)

affecting any other variables (such as lat or eda). That is, the manipulated

probability distribution of the generative model in Eq. (1) is given by

Pðedajmanðhue ¼ hÞÞ ¼X

l

Pðedajhue ¼ h; lat ¼ lÞPðlat ¼ lÞ: ð2Þ

Compare Eq. (2) in the definition with Eq. (1). In contrast to the conditional

distribution pðedajhue ¼ hÞ ¼P

l Pðedajhue ¼ h; lat ¼ lÞPðlat ¼ ljhue ¼ hÞ, the

dependency between lat and hue is removed in the manipulated probability

pðedajmanðhue ¼ hÞÞ. This is because the latter equation models an intervention,

where the value hue ¼ h is set in a controlled setting. For example, placing a subject

in a room colored in a particular hue is a micro-level manipulation.

A macrovariable intervention doðX ¼ xÞ amounts to setting the underlying

microvariable to any value within the specified partition cell x. The value of the

underlying microvariable need not be fully determined by the intervention. For

example, doðR ¼ 1Þ in our toy model would mean that the subject is placed in a

room colored with any hue belonging to the R ¼ 1 range as indicated in Fig. 4b.

Note that any such experiment would, according to our model, have the same effect

on eda (and A).

In our model, PðAjdoðR ¼ rÞÞ 6¼ PðAÞ for any r, and PðAjdoðW ¼ wÞÞ ¼ PðAÞfor any w, which confirms the intuition that R is a cause of A, butW is not. However,

the observational partition PoðHÞ that we learned in Sect. 2.1 contains information

about both R and W.

We can discover which cells of the observational partition are causally relevant

using a simple experimental procedure, illustrated in Fig. 7. Pick one representative

hi from each observational class i and perform the intervention manðhue ¼ hiÞ.Then, merge those cells of the observational partition whose representatives induced

the same pðedajmanðhiÞÞ (see Algorithm 2). The resulting causal partition retains

only the distinction between hue 2 ð90; 270Þ and hue 2 ð0; 90Þ [ ð270; 360Þ—which is our ‘‘Red’’ variable, the true cause of A.

This procedure can be applied in the general setting. Let us first define the causal

partition, which corresponds to the macrovariable true cause. We will then show

that the causal partition is almost always a coarsening of the observational partition,

just like in our toy model and as illustrated in Fig. 7.

Definition 3 (Fundamental causal partition, causal class) The fundamental causal

partition of H, denoted by PcðHÞ is the partition induced by the equivalence

relation � h such that

h1 � hh2 , 8e2Epðejmanðh1ÞÞ ¼ pðejmanðh2ÞÞ:

Similarly, the fundamental causal partition of E, denoted by PcðEÞ, is the partitioninduced by the equivalence relation � e such that


123

e1 � ee2 , 8h2H pðe1jmanðhÞÞ ¼ pðe2jmanðhÞÞ:

We call a cell of a causal partition a causal class of hue or eda.

That is, two microvariable states h1; h2 2 H belong to the same causal class if

they have the same exact effect on the microvariable eda. This implies that

switching between the causal classes of hue is the only way to change

pðedajmanðhueÞÞ. The causal class is precisely the value of the macrovariable cause.

Definition 4 (Macrovariable cause and effect) The fundamental cause C is a

random variable whose value stands in a bijective relation to the causal class of H.

The fundamental effect S is a random variable whose value stands in a bijective

relation to the causal class of E. We will also use C and S to denote the functions

that map each h and e, respectively, to its causal class. We will thus write, for

example, CðhÞ ¼ c to indicate that the causal cell of h is c.

The standard doðÞ-operator is now simply defined as an intervention on such a

causal macrovariable. But note that a macrovariable intervention, while well-

defined in the macrovariable space, in general has multiple instantiations in the

microvariable space. In our simplified example, ‘‘do(R=0)’’ can be realized by

manðhue ¼ hÞ for any h 2 ð90; 270Þ. Macrovariables treat such distinctions as

irrelevant because they make no causal difference.

Definition 5 (Macrovariable manipulation) The operation doðX ¼ xÞ on a

macrovariable is given by a manipulation of the underlying microvariable

manðhue ¼ hÞ to some value h such that XðhÞ ¼ x.

Fig. 7 Learning the causal partition. a The observational partition, obtained from the empiricaldistribution pðedajhueÞ (Eq. 1), is a causal hypothesis: each cell ofPoðHÞ could have a different effect onthe probability of cells of PoðEÞ. b Conducting experiments to estimate PðE0jdoðH0 ¼ h0Þ amounts toestimating PðE0jmanðhue ¼ hÞÞ for any h 2 h0. In this case, we arbitrarily chose hue ¼ 36; 100; 195; 320as representatives of the observational cells. By experimental estimate, PðE0 ¼ 1jmanðhue ¼ 36ÞÞ ¼22=25 (the ground-truth is 0.83), PðE0 ¼ 1jmanðhue ¼ 100ÞÞ ¼ 1=25½0:1�, PðE0 ¼ 1jmanðhue ¼ 195ÞÞ ¼6=25½0:1� , PðE0 ¼ 1jmanðhue ¼ 320ÞÞ ¼ 23=25½0:83�. c The causal partition on H results from mergingthe observational cells whose representatives induce similar PðE0jmanðhueÞÞ. Here, we show both thecausal partition (in color) and 1000 samples from the causal density pðedajmanðhueÞÞ. As expected, thesampled structure is homogeneous within each causal class. It is also different from the observationaldensity, because the manðÞ operator removes confounding. d Estimates of the macrovariable causalprobabilities, obtained from experiments shown in b—ground-truth values in square brackets. Note that aclose prediction of the behavior shown in c was obtained from the few samples in b


123

We are now ready to state our main theorem, which relates the causal and

observational partitions. It turns out that under appropriate, intuitive assumptions,

the causal partition is a coarsening of the observational partition. That is, the causal

partition aligns with the observational partition, but the observational partition may

subdivide some of the causal classes.

2.3 Set-up and definitions

We define general partitions of the microvariable space H.

Definition 6 (partition Pf ðHÞ) Let Pf ðHÞ be the partition on H induced by the

relationship h1 � h2 , f ðh1Þ ¼ f ðh2Þ for any h1; h2 2 H.

Here f stands for any function whose domain contains H. For example,

PðlatjhueÞ or P(hue) are such functions, where h1 � h2 means that Pðlatjh1Þ ¼Pðlatjh2Þ for any value of lat. Thus, the causal and observational partition above canbe rewritten as, respectively

PcðHÞ ¼ PPðedajmanðhueÞÞðHÞ ð3Þ

PoðHÞ ¼ PPðedajhueÞðHÞ ð4Þ

We write C(h) to denote the causal class of h in PcðHÞ and O(h) to denote the

observational class of h in PoðHÞ.In addition, we will make use below of a partition PPðhuejlatÞðHÞ that we refer to

as the confounding partition:

h1 � h2 , Pðh1jlatÞ ¼ Pðh2jlatÞ 8lat:

We are now ready to state the theorem:

Theorem 7 (Causal coarsening theorem) Among all the joint distributions

P(hue, eda, lat), consider the subset that induces any fixed causal partition

PcðHÞ and a fixed confounding partition PPðhuejlatÞðHÞ. The following two

statements hold within this subset:

1. The subset of distributions for which PCðHÞ is not a coarsening of the

observational partition POðHÞ is Lebesgue measure zero, and

2. The subset of distributions for which PCðEÞ is not a coarsening of POðEÞ is

Lebesgue measure zero.

Proof of the theorem is provided in the mathematical Appendix.

An observational partition that is a coarsening of the microvariable space H(which is the only ‘‘interesting’’ kind) can arise for several reasons. To have such a

coarsening, the following equation must be satisfied for at least two distinct

h1; h2 2 H:


123

Pðedajh1Þ ¼ Pðedajh2Þ ð5Þ

,X

lat

Pðedajh1; latÞPðlatjh1Þ � Pðedajh2; latÞPðlatjh2Þ ¼ 0

,X

lat

PðlatÞðPðedajh1; latÞPðh1jlatÞ � Pðedajh2; latÞPðh2jlatÞÞ ¼ 0

,X

lat

Pðedajh1; latÞPðh1jlatÞ � Pðedajh2; latÞPðh2jlatÞ ¼ 0

if PðlatÞ 6¼ 0 8lat

ð6Þ

Since lat is assumed to be a hidden variable there is no significance to states that

have zero probability, so the assumption on the last line is innocuous.

Consequently, for an observational partition to be a coarsening of the

microvariable space, the fundamental parameters must combine in just such a

way that Eq. 6 is satisfied. However, there is an important subclass of such

combinations that satisfy the equation due to the fact that the corresponding

fundamental parameters for h1 and h2 are equal, i.e., when

Pðedajh1; latÞ ¼ Pðedajh2; latÞ 8latPðh1jlatÞ ¼ Pðh2jlatÞ 8lat:

It is these cases that are of interest to the discovery of causal macrovariables,

since—intuitively—the coarseness of the observational partition arises from causal

effects that are invariant across distinctions at the micro-level; as is the case in our

simulated example. In other cases that satisfy Eq. 6, the parameters just happen to

combine in such a way as to result in a coarse observational partition.

The CCT shows that no matter what partitions we fix Pc and PPðhuejlatÞ to, the set

of distributions consistent with these partitions has the property that the causal

partition will be a coarsening of the observational partition except for a set of

distributions that has measure zero.

In particular, if we assume that the observational partition is a coarsening of Honly because both the confounding partition PPðhuejlatÞ and the causal partition

PPðedajmanðhueÞÞ are each coarsenings of H, then the theorem justifies the application

of the algorithms developed in this article to problems where the observational

partition is itself already a coarsening of the microvariable space of H. In other

words, when using CFL we assume away cases where a coarse observational

partition arises due to ‘‘coincidental’’ combinations of the fundamental parameters

that satisfy Eq. 6.

Finally, the notion of coincidence here is not measure-theoretic in the standard

sense, since for two fundamental parameters to be equal carries in a standard

measure-theoretic analysis the same amount of measure as the event that a

combination of parameters satisfy a particular algebraic constraint. However, our

set-up takes as starting point the assumption that there exist causal macrovariables

in nature. In that case, the equality of two fundamental parameters Pðedajlat; h1Þ ¼Pðedajlat; h2Þ is not coincidental but a result of a macrovariable, whereas the


123

satisfaction of some algebraic constraint such as Eq. (6) without equalities in the

fundamental parameters is a rare event.

The theorem allows us to use Algorithm 2 to find the macrovariable cause and

effect with small experimental effort given microvariable data. The theorem is

used in Line 2 of the algorithm. Estimating PðE0jdoðH0 ¼ h0iÞÞ is equivalent to

estimating PðE0jmanðhue ¼ hiÞÞ for only one, arbitrary microvariable represen-

tative of h0i. This is made explicit in Fig. 7b, where we performed only four

experiments (one for each observational class) to correctly learn the causal

partition of H.

3 CFL in the real world: assumptions and challenges

Sections 2.1 and 2.2 showed how to learn causal macrovariables from observational

microvariable data. Our toy example demonstrates that the ideas work in a very

simple setting. Chalupka et al. (2016b) used a simulated example where the input

space consisted of all binary images, and the output space of all the possible activity

patterns of a population of neurons over a 300ms time window (see Fig. 8). In that

case, too, CFL recovered the ground-truth macrovariable causal structure—namely

that horizontal bars in an image cause a single spike in the activity of a

subpopulation of neurons, and vertical bars trigger a rhythmic activation of the

subpopulation. Finally, Chalupka et al. (2016a) showed that when CFL is applied to

real-world climate data, it recovers a commonly believed hypothesis on Pacific

wind-SST interactions: exceedingly strong westerly winds cause El Nino and strong

easterly winds cause La Nina (see Fig. 3).

Notwithstanding these practical advances, CFL makes a set of assumptions that

do not necessarily hold in real-world settings. Assessing to what degree violations of

these assumptions decrease the accuracy of the algorithm is an open issue, but we

can at least lay out and discuss some of the caveats.

3.1 Discreteness of macrovariables

The essential assumption of CFL in its current form is that the macrovariables are

discrete—that is, the statistics of the system, while supervening on continuous

microvariables, can be captured by discrete variables with manageable cardinalities.

In our toy example, the continuous space H collapses to an observational

partition with four cells. Each cell consists of microvariable states that share exactly

the same pðedajhueÞ. Many real-world phenomena, however, are thought to have

continuous probabilistic structure.

In Fig. 1a, for example, temperature is a continuous macrovariable. The

observational partition still exists, but it divides the state space of particle masses

and velocities into uncountably many cells. Two states belong to the same cell of

that partition if and only if the average kinetic energy of its particles is equal. This

observational partition corresponds precisely to the ‘temperature’ macrovariable.

Unfortunately, an equivalent of the Causal Coarsening Theorem for


123

uncountable partitions does not currently exist, so the value of such a partition for

causal discovery is unclear.

Consequently, before applying CFL it is essential to establish whether the

probabilistic structure of the problem can possibly be captured or at least well-

approximated by discrete variables. In low-dimensional domains visualization of the

data can provide guidance. Expert knowledge or physical intuition can also justify

the discreteness assumption.

3.2 Smoothness assumptions during learning

While the theory of CFL assumes a discrete macrovariable structure, Algorithm 1

makes the contrary assumption. Consider pðedajhueÞ evaluated on a fixed set of h

samples as a vector-valued function of hue. Figure 5a shows that this function is

discontinuous at the boundaries of the observational states (namely at

hue ¼ 0; 90; 180; 270). However, the learned density shown in Fig. 5b varies

continuously with hue. We chose to learn a continuous density mainly because there

are good algorithms for learning continuous conditionals (described in Sect. 2.1).

As Fig. 5b suggests, these algorithms can take sharp boundaries into account.

Nevertheless, mistakes at the boundaries are a likely artifact of the learning method.

A similar situation is encountered in neural network classification (Rumelhart

et al. 1985; Bishop 1994; Krizhevsky et al. 2012): an essentially discrete problem

(dividing the feature space into a discrete number of classes) is solved using a

Fig. 8 CFL and simulated neuroscience. Chalupka et al. (2016b) applied CFL to a simulatedneuroscience dataset. a The generative model of the data: images I constitute the microvariable input. Themicrovariable output consists of recordings of spikes in a population of a hundred neurons over asimulated time of 300 ms (gray background plots, black dots indicate spikes, each line corresponds to aneuron). The macrovariables in the image are the presence of a horizontal and the presence of a verticalbar. The first one causes neural macrovariable ‘pulse’, the second causes neural ‘30 Hz rhythm’. Theconditional probability table of the macrovariables that is shown was used to generate 10,000 datapointsin the I–J space (altogether, hundreds of dimensions). b CFL recovered the macrovariables and theprobabilistic structure of the system. The histograms show the counts of microvariable states in each cellof the recovered causal partition. For example, C ¼ 1 corresponds to the ‘first state of the macrovariablecause’. Its histogram—full mass in the first out of four possible bars—shows that this state containsalmost exclusively images that belong to the ground-truth first causal class (that is, images with no bars)


123

continuous algorithm and appropriate thresholding of the final output. The success

of neural networks in machine learning tasks proves that this strategy can yield good

results.

3.3 Choosing the number of states

In Fig. 6, we provided the algorithm with the ground-truth number of observational

states. In practice, we want to learn the variables starting only from continuous

microvariable data—their a priori unknown cardinalities must also be discovered. A

solution we propose is to run Algorithm 1 with Nh and Ne (the target number of

observational classes for H and E) slightly larger than our best guess. Steps 8–17

then merge the appropriate classes to obtain the observational partition.

This procedure is based on the assumption that the density learning and

clustering steps return to a good approximation a refinement of the observational

partition. In the limit of infinite samples and a good density learning and clustering

algorithm this should always be true.

Figure 9a illustrates the result of running our algorithm on toy data with Nh ¼Ne ¼ 6 (as opposed to the ground-truth Nh ¼ 4;Ne ¼ 2). The algorithm divided Hinto six groups, which are close to a refinement of the true observational partition.

The pink group (spanning about hue 2 ð160; 190Þ crosses the true observational

boundary at hue ¼ 180. This error type can be ascribed to low sample numbers and

clustering mistakes and is hard to avoid given finite sampling.

Given a refinement of the observational partition on H, it is easy to recover the

true observational partition. If any two clusters h0i and h0j are subsets of the same

observational state, then PðE0jh0iÞ should be similar to PðE0jh0jÞ, where E0 is (the

refinement of) the observational partition on E. In Fig. 9b, the i-th column

corresponds to the empirical PðE0jh0iÞ where E0 and H0 are the six-state observationalvariables. Fig. 9c shows the result of merging these E0 states whose corresponding

columns in Fig. 9b have earth mover’s distance (Levina and Bickel 2001) smaller

than 0.2. Due to sampling errors, it deviates from the ground-truth slightly, but

contains four states as expected.

Figure 9d–f shows the merging process for E. The end result is almost exactly the

ground-truth. The merging process for E requires a slight modification, since one

cannot simply merge e0i and e0j when Pðe0ijh0Þ ¼ Pðe0jjh0Þ for any h0. To see why,

consider clusters e00 and e01 in Fig. 9d (counting from the top of the plot, the dark

green and the brown clusters). Both clusters are subsets of the same ground-truth

observational cell, but e00 consists of significantly fewer samples. As a result, the

vector ½Pðe00jh00Þ; . . .;Pðe00jh05Þ� is a scaled version of ½Pðe01jh00Þ; . . .;Pðe01jh05Þ� (see thefirst and second rows in Fig. 9b). Normalizing the likelihood vectors such that they

sum to 1 solves the problem. It is always true that if two clusters are subsets of the

same observational class, their normalized likelihood vectors are (in the limit of

infinite sample size) the same.


123

4 Discussion

Causal feature learning can learn high-level causal knowledge from low-level data

in an automatic, unbiased manner. We provided the necessary theoretical

framework and demonstrated, on an easy-to-visualize example, how modern

machine learning techniques can put the theory into practice. We pointed to

applications of CFL to complex data where the method successfully recovered a

known ground-truth. Finally, we discussed in detail the assumptions that CFL makes

on the domain of application.

4.1 Related work

CFL is directly inspired by the theory of computational mechanics (Shalizi 2001;

Shalizi and Crutchfield 2001; Shalizi and Moore 2003). Computational mechanics

defines macrovariable states in time series in terms of equivalences of conditional

Fig. 9 Observational partition learning—unknown number of states. a We clustered hue w.r.t.pðedajhueÞ into six (as opposed to the ground-truth four) clusters H 0 ¼ 0; . . .; 5. With enough samples andgood density learning, the overclustering should refine the true observational partition. We repeated theprocedure in E, clustering reaction states into six (instead of two) clusters E0. b Computing empiricalPðE0jH0Þ shows that H0 ¼ 1; 2 induce similar conditionals on E0. H0 ¼ 4; 5 also induce similarprobabilities. c Merging clusters with similar conditionals brings us close to the ground-truthobservational partition (compare with Fig. 6). We merged clusters whose earth mover’s distance isless than .1. d–f A similar merging procedure is repeated for E0 clusters. In this case, we renormalized thelikelihoods to account for different sample counts in clusters with the same PðE0jH0Þ (see text for details).Two of the merged clusters correspond well to the ground-truth. Because of sample size issues a smalladditional cluster a the boundary of the true classes was detected (colored pink)


123

probabilities. However, in computational mechanics macrovariables are meant to

‘summarize’ the phenomena rather than to support causal reasoning. The

interventional/observational distinction, central to most analyses of causation, is

also lost in heuristic approaches such as that of Hoel et al. (2013). We have adapted

and developed the underlying theory to capture the more general causal setting.

Psychometric models (see Reise et al. 2010 for a good introduction) have a

similar structure to the one presented here (see Fig. 10). There is, however, one

crucial difference. In psychometric models, it is assumed that there are unobserved

variables (such as IQ or some emotional or cognitive state; the dashed variables in

Fig. 10) that stand in some unknown causal relation to one another. The only access

the researcher has to these unobserved variables is through a set of measurement

variables (the Xi), e.g., through questionnaire in which each question is supposed to

provide an indication of the state of (one or more of) the unobserved psychological

variable(s).

As the causal representations already indicate, the measurements are generally

considered to be causal effects of the unobserved variables, in contrast to the

constitutive relation we have considered in this article between the macro- and

microvariables. That is, it is generally thought that specific answers to a

questionnaire are causal consequences of IQ, rather than that having a high IQ

constitutes answering a question in a particular way. In contrast, temperature is

thought to be constituted by the mean kinetic energy of a particle ensemble, it is not

a consequence of it. But as the example with IQ already suggests, the boundary

between causal effects and constitutive relationships in the psychological setting

may not be always as sharp. It suffices then to say here that the way we have defined

interventions in this article requires that the relationship between the macro- and the

microvariables is constitutive, since we define an intervention on a macrovariable as

Fig. 10 Example of a psychometric model. Unobserved cognitive variables Li stand in some unknowncausal relation (Reise et al. 2010; Silva et al. 2006). The researcher can only learn about these causalrelations via an analysis of the observed measurement variables Xi, which are thought to be causal effectsof one or more of the unobserved variables. In general, the relation between the Li and Xi is not thought tobe constitutive, in contrast to the case of macro- and microvariables considered in this article


123

a class of interventions on the microvariable space. If the relations between

unobserved variable and measurement variable in a psychometric system are causal

rather than constitutive, our learning tools may nevertheless prove useful. It remains

an area of application that we have yet to explore.

Work on learning causal graphs over high-dimensional variables (Entner and

Hoyer 2012; Parviainen and Kaski 2015) is orthogonal to ours. The general idea of

these approaches is to extend standard causal discovery methods (Spirtes et al.

2000) to the high-dimensional case. If the high-dimensional variables are

interpreted as microvariables, CFL could be applied ‘‘on top’’ of each causal link

in such a graph to compress it to retain only causal information.

Active learning and automatic experimental design (see for example Chaloner

and Verdinelli 1995; Tong and Koller 2001; Srinivas et al. 2010; Snoek et al. 2012)

share CFL’s goal of decreasing experimental effort in scientific discovery. CFL and

active learning are applicable in complementary situations. CFL serves its purpose

best if microvariable observational data is easy to obtain and/or we suspect the

presence of macrovariables driving the system. Active learning is applicable to

macro-level, experimental data.

4.2 Open problems

The following problems point to future work that would significantly extend the

range of domains CFL can be applied to.

1. Learning without experimentation. Currently, testing the causal hypothesis

requires intervention—one experiment per each observational class. However,

in the field of causal discovery there are methods to reject causal hypotheses

based on observational data only. These are either based on the independence

structure of the generative distributions (Spirtes et al. 2000; Chickering 2002;

Silander and Myllymaki 2006; Claassen and Heskes 2012; Hyttinen et al. 2014)

or assumptions about the functional form of the structural equations that govern

the system (Shimizu et al. 2006; Hoyer et al. 2009; Mooij et al. 2011). None of

these methods can be directly applied to the formation of causal macrovari-

ables—they all assume the causal variables of interested are given. Extending

these ideas to the CFL framework would make it useful in domains where direct

experimentation is expensive (medicine) or impossible (climate science).

2. Continuous macrovariables. Section 3 discussed the discreteness assumption

in CFL. A logical next step is to extend the framework to systems where the

macrovariables are continuous, or hybrid systems. Whereas definitions of

continuous macrovariables do not pose a challenge, extending the Causal

Coarsening Theorem—which makes the framework useful—appears nontrivial.

3. Cyclic microvariable graphs. Like the majority of work in causal inference

and discovery—notable exceptions being (Richardson 1996; Mooij et al. 2011;

Lacerda et al. 2012; Hyttinen et al. 2012, 2014)—CFL assumes that the

microvariable system is acyclic: in our toy example, hue has causal influence on

eda, but we assumed (quite plausibly) that eda is not a cause of hue. This

assumption is not always warranted. For example, in the wind-temperature


123

climate system discussed in Chalupka et al. (2016a) (see also Sect. 3), the

system definitely experiences feedback (over time). While cyclicity may not

break the CCT or our algorithms, the proof of such a claim has not been

completed.

4.3 Brief philosophical considerations

Our goal has been to provide a rigorous and objective account of causal

macrovariables as they occur in the sciences. Motivation has come from examples,

such as temperature that supervenes on the kinetic energy of particles. Just as it is

the room temperature—the mean kinetic energy of the particles—that triggers the

air conditioning rather than the exact distribution of particle velocities in the room,

we have defined causal macrovariables as aggregates of those microvariables that

have the same causal consequences.

Nontrivial macrovariables exist to the extent that there are such equivalent

microstates. There is a sense in which the occurrence of macrovariables is a measure

zero event. Whether this licenses inferences to the existence of macrovariables in

practice depends on the appropriateness of the measure for the description of our

world. But there is a further consideration worth noting: We claimed that it was in

fact the mean kinetic energy and not the exact distribution of kinetic energies of the

particles that determined whether the air conditioning was triggered.

But perhaps that is not quite right. After all, it is the specific movement of the

particles close to the sensor that triggers the air conditioning. We could maintain the

mean kinetic energy of the particles in the room overall constant, while significantly

changing the velocities of the particles close to the sensor. In that case one may

argue that temperature is not a causal macrovariable in this system. Another view is

to say that these sorts of microstates are extraordinarily improbable, and therefore

can be neglected. Assuming that such a view can be properly formalized, the

macrovariable temperature then does not have a completely clean delineation in

terms of its causal consequences. There will be a few microstates within each of its

macrostates that have very different causal consequences from the other microstates

within the same macrostate. Metaphorically, the macrovariable is a little bit ‘‘fuzzy

around the edges’’. Such a metaphysical account of macrovariables may be

anathema to many, but we note that our epistemology—our learning method—is

unable to distinguish between these and sharply delineated macrovariables since we

will in practice never be able to investigate all possible microstates.

5 Mathematical appendix

In this section, we prove the Causal Coarsening Theorem. To (significantly)

simplify the notation and to easier relate this proof to earlier ones, we use the same

variable names we used in Chalupka et al. (2016b):

1. The microvariable cause is I 2 I ,


123

2. the microvariable effect is J 2 J ,

3. the confounder is H,

These correspond to hue, eda, lat, respectively. Before we prove the CCT, we prove

a simpler Lemma.

Lemma 8 Let SPðHÞ denote the simplex of multinomial distributions over the

values of H. For fixed PðJjH; IÞ, the subset of SPðHÞ for which Pc is not equal to

PPðJjH;IÞðIÞ is measure zero.

Proof We want to show that the subset of SPðHÞ for which, for any i1; i2 2 I and

h 2 H

PðJjH ¼ h; i1Þ 6¼ PðT jH ¼ h; i2Þ; and ð7Þ

PðJjmanði1ÞÞ ¼ PðT jmanði2ÞÞ; ð8Þ

is measure zero. (Note that if PðJjH; IÞ is the same for all i, equality of Pc and

PPðJjH;IÞ follows directly from their definitions).

Equation 8 is equivalent toP

h PðH ¼ hÞ½PðJjH ¼ h; i1Þ � PðJjH ¼ h; i2Þ� ¼ 0.

Since this is a linear constraint on SPðHÞ, to show that it is satisfied on a measure zero

subset we only need to show that there is at least one point which does not satisfy it.

First, set PðH ¼ hÞ ¼ 1=K, where K is the number of states of H, for all h. If the

equation is not satisfied, we are done. If it is satisfied, it must be for some h1 that

PðJjH ¼ h1; i1Þ � PðJjH ¼ h1; i2Þ[ 0 and for some h2 PðJjH ¼ h2; i1Þ �PðJjH ¼ h2; i2Þ\0. Pick any 0\�\minð1=K; 1� 1=KÞ. Set PðH ¼ h1Þ ¼ 1=K þ �and PðH ¼ h2Þ ¼ 1=K � �, and PðH ¼ hÞ ¼ 1=K for other h. Then Eq. (8) does not

hold. h

Causal coarsening theorem Among all the joint distributions P(J, H, I) consider

the subset that induces any fixed causal partition PcðIÞ and a fixed confounding

partition PPðIjHÞðIÞ. The following two statements hold within this subset:

1. The subset of distributions that induce a fundamental causal partition PcðIÞthat is not a coarsening of the observational partition PoðIÞ is Lebesgue

measure zero, and

2. The subset of distributions that induce a fundamental causal partition PcðJ Þthat is not a coarsening of the PoðJ Þ is Lebesgue measure zero.

Proof 1. We first prove that PcðIÞ is almost always a coarsening of PoðIÞ.(i) First set up the notation. Let H be the hidden variable of the system, with

cardinality K; let J have cardinality N and I cardinalityM. We can factorize

the joint on I, J, H as PðJ; I;HÞ ¼ PðJjH; IÞPðIjHÞPðHÞ. PðJjH; IÞ can be

parametrized by ðN � 1Þ � K �M parameters, PðIjHÞ by ðM � 1Þ � K

parameters, and P(H) by K � 1 parameters, all of which are independent.

Call the parameters, respectively,


123

aj;h;i,PðJ ¼ jjH ¼ h; I ¼ iÞbi;h,PðI ¼ ijH ¼ hÞch,PðH ¼ hÞ

We will denote parameter vectors as

a ¼ ðaj1;h1;i1 ; . . .; ajN�1;hK ;iM Þ 2 RðN�1Þ�K�M

b ¼ ðbi1;h1 ; . . .; biN�1;hK Þ 2 RðM�1Þ�K

c ¼ ðch1 ; . . .; chK�1Þ 2 RK�1;

where the indices are arranged in lexicographical order. This creates a one-

to-one correspondence of each possible joint distribution P(J, H, I) with a

point ða; b; cÞ 2 P½a; b; c� � RðN�1Þ�K2ðK�1Þ�MðM�1Þ.(ii) Show that for any a; b consistent with Pc and PPðIjHÞ, the causal partition

and the confounding partition are, in general, fixed. To proceed with the

proof, pick any point in the PðJjH; IÞ � PðIjHÞ space—that is, fix a and b.The only remaining free parameters are now in c. Varying these values

creates a subset of the space of all joints isometric to the ðK � 1Þ-dimensional simplex of multinomial distributions over K states (call the

simplex SK�1):

P½c; a; b� ¼ fða; b; cÞ j c 2 SK�1g � ½0; 1�ðK�1Þ:

Note that fixing b directly fixes PPðIjHÞ. Fixing a does not directly fix Pc.

But by Lemma 8, for almost all distributions in P½c; a; b� the causal

partition Pc equals the partition PPðT jH;IÞ, which is directly fixed by a. LetP0½c; a; b� be P½c; a; b� minus this measure zero subset. The statement of the

theorem fixes Pc and PPðIjHÞ. If the a; b we picked are consistent with

these partitions within P0½c; a; b�, continue with the proof. Otherwise,

choose other a; b. We now prove that within P0½c; a; b� the set of c for

which the causal partition PcðJ Þ is not a coarsening of the observational

partition PoðJ Þ is of measure zero. Later in (iv) we integrate the result

over all a; b.(iii) Let the causal coarsening constraint be that for i1; i2 2 I , we have

Oði1Þ ¼ Oði2Þ ) Cði1Þ ¼ Cði2Þ: ð9Þ

That is, it is not the case that two members of I are observationally

equivalent but have causally different effects. We will show that for any

i1; i2 pair, the constraint holds for almost any c. Pick any i1; i2 2 I . IfCði1Þ ¼ Cði2Þ, then we are done with this pair. So assume that there is a

causal difference, i.e., Cði1Þ 6¼ Cði2Þ. Our goal is now to show that then

only a measure zero subset of P0½c; a; b� allows for Oði1Þ ¼ Oði2Þ. We first

show that Oði1Þ ¼ Oði2Þ places a polynomial constraint on P0½c; a; b�. Bydefinition of the observational class, the equality implies that for any j,


123

1

Pði1ÞX

h

aj;h;i1bi1;hch;

¼ 1

Pði2ÞX

h

aj;h;i2bi2;hch:

Picking an arbitrary j and expanding in terms of a;b; c, we have

Oði1Þ ¼ Oði2Þ,

X

hk ;hl

chkchl ½bi2;hkbi1;hlaj;hl;i1 � bi1;hkbi2;hlaj;hl;i2 � ¼ 0: ð10Þ

We have thus shown that, for fixed a; b and i1; i2, the violation of the causalcoarsening constraint (9), is a polynomial constraint on P0½c; a; b�. By an

algebraic lemma (proven by Okamoto 1973), the subset on which the

constraint holds is measure zero if the constraint is not trivial. That is, we

only need to find one c for which Eq. (10) does not hold to prove that it

almost never holds. To find such c, let ch ¼ 1=K for all h. If for this cEq. (10) does not hold, we are done. If it does hold, since we know a is not

all 0, there must be in the sum of the equation at least one factor

½bi2;hkbi1;hlaj;hl;i1 � bi1;hkbi2;hlaj;hl;i2 � which is positive, and one that is

negative. Call the hk; hl corresponding to the positive element hkþ ; hlþ

and to the negative element hk� ; hl� . Since the factors are different, we

must have either kþ 6¼ k� or lþ 6¼ l� (or both). Assume kþ 6¼ k�. Now,pick any positive �\minð1=K; 1� 1=KÞ. Set ch ¼ 1=K for all h 6¼ hkþ ; hk�

and set chkþ ¼ 1Kþ � and chk� ¼ 1

K� �. In this way, we keep

Ph c

unchanged, and are guaranteed that Eq. (10) does not hold. That is, for

this c we have Oði1Þ 6¼ Oði2Þ3.(iv) Show that the theorem holds over the space of all distributions.

To reiterate proof progress thus far:

1. We fixed the macro-scale causal partition Pc and the confounding partition

PPðIjHÞ and picked arbitrary a and b compatible with these partitions.

2. We picked two points i1; i2 for which Cði1Þ 6¼ Cði2Þ.3. We showed that for any such two points, the subset of P0½c; a; b� for which

Oði1Þ ¼ Oði2Þ is measure zero.

Since there are only finitely many points in I , it follows that for the fixed a; b, thesubset of P0½c; a; b� on which the coarsening constraint (9 does not hold for at least

one pair of points is also measure zero. Since P½c; a; b� � P0½c; a; b� is a set of

measure zero, the subset of P½c; a; b� on which the causal coarsening constraint does

not hold is also measure zero.

Now, call the set of all joint distributions that agree with Pc and PPðIjHÞ the

3 It is possible that this c is not in P0½c; a; b�. However, it is guaranteed to be in P½c; a;b�. Since a subset ofmeasure zero in P½c; a; b� is also measure zero in P0½c; a;b�, this does not influence the proof.


123

admittable set, and denote it with P½a; b; c�A. For each a; b consistent with the two

partitions, call the (measure zero) subset of P½c; a; b�A that violates the causal

coarsening constraint z½a; b�. Let Z ¼ [a;bz½a; b� � P½a; b; c�A be the set of all the

admittable joint distributions which violate the causal coarsening constraint. We

want to prove that lðZÞ ¼ 0, where l is the Lebesgue measure. To show this, we

will use the indicator function

zða; b; cÞ ¼1 if c 2 z½a; b�;0 otherwise:

�

By basic properties of positive measures, we have

lðZÞ ¼Z

P½a;b;c�Az dl:

For simplicity of notation, let

1. A � RK�N be the set of all possible a’s (a Cartesian product of K � N 1-d

simplexes);

2. B � RN�K be the set of all possible b’s (a Cartesian product of K simplexes,

each N � 1 dimensional);

3. G � RK be the set of all possible c’s (a K � 1-dimensional simplex).

Note that each set has, in its respective Euclidean space, a nonempty interior, and

comes equipped with the Lebesgue measure.

Finally, let IAða; bÞ be the indicator function that evaluates to 1 if a; b are

admittable and evaluates to 0 otherwise. We haveZ

P½a;b;c�Az dl ¼

Z

A�B�Gzða; b; cÞIAða; bÞ dðc; b; aÞ

¼Z

A�B

Z

Gzða; b; cÞ dðcÞ IPc

ðb; aÞ dðb; aÞ

¼Z

A�Blðz½a; b�Þ IAða; bÞ dðb; aÞ

¼Z

A�B0 IAða; bÞ dðb; aÞ

¼ 0:

ð11Þ

Equation (11) follows as z restricted to P½c; a; b� is the indicator function of z½a; b�.This completes the proof that Z, the set of joint distributions over T, H and I that

violate the causal coarsening constraint (9) is measure zero.

(2) We use the same strategy as above, with some differences in the details of the

algebra of the polynomial constraint in step (iii):

(iii) Let the causal coarsening constraint be that for j1; j2 2 J , we have


123

Oðj1Þ ¼ Oðj2Þ ) Cðj1Þ ¼ Cðj2Þ: ð12Þ

That is, it is not the case that two members of J are observationally equivalent but

have different likelihoods of causation.

We show that the causal coarsening constraint holds for each pair j1; j2 2 J : Pick

any j1; j2 2 J . If Cðj1Þ ¼ Cðj2Þ, then we are done with this pair. So assume that

there is a causal difference, i.e., Cðj1Þ 6¼ Cðj2Þ. Our goal is now to show that then

only a measure zero subset of P0½c; a; b� allows for Oðj1Þ ¼ Oðj2Þ.We first show that Oðj1Þ ¼ Oðj2Þ places a polynomial constraint on P0½c; a;b�. By

definition, we have

Oðj1Þ ¼ Oðj2Þ, 8i

X

h

aj1;h;ibi;hch ¼X

h

aj1;h;ibi;hch

Pick an arbitrary i. The above equation places the following polynomial constraint,

for this i, on P0½c; a; b�:X

h

chðaj1;h;ibi;h � aj2;h;ibi;hÞ ¼ 0: ð13Þ

We have thus shown that for fixed a; b and j1; j2, the violation of the causal

coarsening constraint (12) places a polynomial constraint on P0½c; a; b�. By an

algebraic lemma (proven by Okamoto 1973), the subset on which the constraint

holds is measure zero if the constraint is not trivial. That is, we only need to find one

c for which Eq. (13) does not hold to prove that it almost never holds.

To find such c, let ch ¼ 1=K for all h. If for this c Eq. (13) does not hold, we aredone. If it does hold, since we know a is not all 0, there must be in the sum of the

equation at least one factor ½aj1;h;ibi;h � aj1;h;ibi;h� which is positive, and one that is

negative. Call the h corresponding to the positive element hþ and to the negative

element h�. Pick any positive �\minð1=K; 1� 1=KÞ. Set ch ¼ 1=K for all h 6¼hþ; h� and set chþ ¼ 1

Kþ � and ch� ¼ 1

K� �. In this way, we keep

Ph c unchanged,

and are guaranteed that Eq. (13) does not hold. That is, for this c we have

Oðj1Þ 6¼ Oðj2Þ4. h

Acknowledgements We thank an anonymous reviewer for pointing out an error in our original theorem.

This work was supported by NSF Award #1564330.

Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of

interest.

References

Bishop CM (1994) Mixture density networks. Technical report

Chaloner K, Verdinelli I (1995) Bayesian experimental design: a review. Stat Sci 273–304

4 It is possible that this c is not in P0½c; a; b�. However, it is guaranteed to be in P½c; a;b�. Since a subset ofmeasure zero in P½c; a; b� is also measure zero in P0½c; a;b�, this does not influence the proof.


123

Chalupka K, Perona P, Eberhardt F (2015) Visual causal feature learning. In: Proceedings of the thirty-

first conference on uncertainty in artificial intelligence. AUAI Press, Corvallis, pp 181–190

Chalupka K, Bischoff T, Perona P, Eberhardt F (2016a) Unsupervised discovery of el nino using causal

feature learning on microlevel climate data. In: Proceedings of the thirty-second conference on

uncertainty in artificial intelligence

Chalupka K, Perona P, Eberhardt F (2016b) Multi-level cause-effect systems. In: 19th international

conference on artificial intelligence and statistics (AISTATS)

Chickering DM (2002) Learning equivalence classes of bayesian-network structures. J Mach Learn Res

2:445–498

Claassen T, Heskes T (2012) A bayesian approach to constraint based causal inference. In: Proceedings of

UAI. AUAI Press, Corvallis, pp 207–216

Entner Doris, Hoyer Patrik O (2012) Estimating a causal order among groups of variables in linear

models. Artif Neural Netw Mach Learn-ICANN 2012:84–91

Hoel Erik P, Albantakis L, Tononi G, Albantakis GT (2013) Quantifying causal emergence shows that

macro can beat micro. Proc Natl Acad Sci 110(49):19790–19795

Hoyer PO, Janzing D, Mooij JM, Peters J, Scholkopf B (2009) Nonlinear causal discovery with additive

noise models. In: Advances in neural information processing systems, pp 689–696

Hyttinen A, Eberhardt F, Hoyer PO (2012) Causal discovery of linear cyclic models from multiple

experimental data sets with overlapping variables. arXiv:1210.4879

Hyttinen A, Frederick E, Jarvisalo M (2014) Conflict resolution with answer set programming. In:

Proceedings of UAI, constraint-based causal discovery

Jacobs KW, Hustmyer FE (1974) Effects of four psychological primary colors on gsr, heart rate and

respiration rate. Percept Motor Skills 38(3):763–766

Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural

networks. In: Pereira F, Burges CJC, Bottou L, Weinberger KO (eds). Advances in neural

information processing systems, vol 25, pp 1097–1105

Lacerda G, Spirtes PL, Ramsey J, Hoyer PO (2012) Discovering cyclic causal models by independent

components analysis. arXiv:1206.3273

Levina E, Bickel P (2001) The earth mover’s distance is the mallows distance: some insights from

statistics. In: Eighth IEEE international conference on computer vision, 2001. ICCV 2001.

Proceedings, vol 2. IEEE, New York, pp 251–256

Mooij JM, Janzing D, Heskes T, Scholkopf B (2011) On causal discovery with cyclic additive noise

models. In: Advances in neural information processing systems, pp 639–647

Okamoto M (1973) Distinctness of the eigenvalues of a quadratic form in a multivariate sample. Ann Stat

1(4):763–765

Parviainen P, Kaski S (2015) Bayesian networks for variable groups. arXiv:1508.07753

Pearl J (2000) Causality: models. Reasoning and inference. Cambridge University Press, Cambridge

Pearl J (2010) An introduction to causal inference. Int J Biostat 6(2)

Reise SP, Moore TM, Haviland MG (2010) Bifactor models and rotations: Exploring the extent to which

multidimensional data yield univocal scale scores. J Pers Assessm 92(6):544–559

Richardson T (1996) A discovery algorithm for directed cyclic graphs. In: Proceedings of the twelfth

international conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc.,

USA, pp 454–461

Rumelhart DE, Hinton GE, Williams RJ (1985) Learning internal representations by error propagation.

Technical report, No. ICS-8506. California University of San Diego La Jolla Institute for Cognitive

Science

Shalizi CR (2001) Causal architecture, complexity and self-organization in the time series and cellular

automata. PhD thesis, University of Wisconsin at Madison

Shalizi CR, Crutchfield JP (2001) Computational mechanics: pattern and prediction, structure and

simplicity. J Stat Phys 104(3–4):817–879

Shalizi CR, Moore C (2003) What is a macrostate? Subjective observations and objective dynamics.

arXiv:cond-mat/0303625

Shimizu Shohei, Hoyer Patrik O, Hyvarinen Aapo, Kerminen Antti (2006) A linear non-gaussian acyclic

model for causal discovery. J Mach Learn Res 7:2003–2030

Silander T, Myllymaki P (2006) A simple approach for finding the globally optimal bayesian network

structure. In: Proc UAI. AUAI Press, Oregon, pp 445–452

Silva R, Scheines R, Glymour C, Spirtes P (2006) Learning the structure of linear latent variable models.

J Mach Learn Res 7:191–246


123

http://arxiv.org/abs/1210.4879



http://arxiv.org/abs/cond-mat/0303625

Snoek J, Larochelle H, Adams RP (2012) Practical bayesian optimization of machine learning algorithms.

In: Advances in neural information processing systems, pp 2951–2959

Spirtes Peter, Scheines Richard (2004) Causal inference of ambiguous manipulations. Philos Sci

71(5):833–845

Spirtes P, Glymour CN, Scheines R (2000) Causation, prediction, and search, 2nd edn. Massachusetts

Institute of Technology, Massachusetts

Srinivas N, Krause A, Seeger M, Kakade SM (2010) Gaussian process optimization in the bandit setting:

no regret and experimental design. In: Proceedings of the 27th international conference on machine

learning (ICML-10), pp 1015–1022

Tong S, Koller D (2001) Support vector machine active learning with applications to text classification.

J Mach Learn Res 2:45–66

Tsao Doris Y, Freiwald Winrich A, Tootell RBH, Livingstone MS (2006) A cortical region consisting

entirely of face-selective cells. Science 311(5761):670–674


123

Documents

Causal feature learning: an overviefehardt/papers/CEP2017.pdf · 2017. 6. 22. · Causal feature learning (CFL) is an unsupervised machine learning and causal inference framework