34
Feature Selection and Causal discovery Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University

Feature Selection and Causal discovery

  • Upload
    jenna

  • View
    66

  • Download
    0

Embed Size (px)

DESCRIPTION

Feature Selection and Causal discovery. Isabelle Guyon, Clopinet André Elisseeff, IBM Zürich Constantin Aliferis, Vanderbilt University. Road Map. Feature selection. What is feature selection? Why is it hard? What works best in practice? How to make progress using causality? - PowerPoint PPT Presentation

Citation preview

Page 1: Feature Selection  and  Causal discovery

Feature Selection and

Causal discoveryIsabelle Guyon, Clopinet

André Elisseeff, IBM ZürichConstantin Aliferis, Vanderbilt University

Page 2: Feature Selection  and  Causal discovery

Road Map

• What is feature selection?• Why is it hard?• What works best in practice?• How to make progress using causality?• Can causal discovery benefit from

feature selection?

Feature selection

Causal discovery

Page 3: Feature Selection  and  Causal discovery

Introduction

Page 4: Feature Selection  and  Causal discovery

Causal discovery

• What affects your health?• What affects the economy?• What affects climate changes?

and…

Which actions will have beneficial effects?

Page 5: Feature Selection  and  Causal discovery

Feature Selection

Remove features Xi to improve (or least degrade) prediction of Y.

X

Y

Page 6: Feature Selection  and  Causal discovery

Uncovering Dependencies

Factors of variability

ArtifactualActual

UnknownKnown

Uncontrollable

Observable Unobservable

Controllable

Page 7: Feature Selection  and  Causal discovery

Predictions and Actions

X

Y

See e.g. Judea Pearl, “Causality”, 2000

Page 8: Feature Selection  and  Causal discovery

Predictive power of causes and effects

Lung disease

Coughing

Allergy

Smoking

Anxiety Smoking is a better predictor of lung disease than coughing.

Page 9: Feature Selection  and  Causal discovery

“Causal feature selection”

• Abandon the usual motto of predictive modeling: “we don’t care about causality”.

• Feature selection may benefit from introducing a notion of causality:– To be able to predict the consequence of given

actions.– To add robustness to the predictions if the input

distribution changes.– To get more compact and robust feature sets.

Page 10: Feature Selection  and  Causal discovery

“FS-enabled causal discovery”

Isn’t causal discovery solved with experiments?

• No! Randomized Controlled Trials (RCT) may be:– Unethical (e.g. a RCT about the effects of smoking)– Costly and time consuming – Impossible (e.g. astronomy)

• Observational data may be available to help plan future experiments Causal discovery may benefit from feature selection.

Page 11: Feature Selection  and  Causal discovery

Feature selection basics

Page 12: Feature Selection  and  Causal discovery

Individual Feature Irrelevance

P(Xi, Y) = P(Xi) P(Y)

P(Xi| Y) = P(Xi)

xi

density

Page 13: Feature Selection  and  Causal discovery

Individual Feature Relevance

-1

- +

- +1Specificity

Sensitivity

ROC curve

AUC

0

1

Page 14: Feature Selection  and  Causal discovery

Univariate selection may fail

Guyon-Elisseeff, JMLR 2004; Springer 2006

Page 15: Feature Selection  and  Causal discovery

Multivariate FS is complex

n features, 2n possible feature subsets!

Kohavi-John, 1997

Page 16: Feature Selection  and  Causal discovery

FS strategies

• Wrappers:– Use the target risk functional to evaluate feature

subsets. – Train one learning machine for each feature subset

investigated.• Filters:

– Use another evaluation function than the target risk functional.

– Often no learning machine is involved in the feature selection process.

Page 17: Feature Selection  and  Causal discovery

Reducing complexity

• For wrappers:– Use forward or backward selection: O(n2) steps.– Mix forward and backward search, e.g. floating search.

• For filters:– Use a cheap evaluation function (no learning machine).– Make independence assumptions: n evaluations.

• Embedded methods:– Do not retrain the LM at every step: e.g. RFE, n steps.– Search FS space and LM parameter space

simultaneously: e.g. 1-norm/Lasso approaches.

Page 18: Feature Selection  and  Causal discovery

In practice…

• Univariate feature selection often yields better accuracy results than multivariate feature selection.

• NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters.

• Multivariate methods usually claim only better “parsimony”.

• How can we make multivariate FS work better?

NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges

Page 19: Feature Selection  and  Causal discovery

Definition of “irrelevance”

• We want to determine whether one variable Xi is “relevant” to the target Y.

• Surely irrelevant feature:P(Xi, Y |S\i) = P(Xi |S\i)P(Y |S\i)for all S\i X\i

for all assignment of values to S\i

Are all non-irrelevant features relevant?

Page 20: Feature Selection  and  Causal discovery

Causality enters the picture

Page 21: Feature Selection  and  Causal discovery

Causal Bayesian networks

• Bayesian network:– Graph with random variables X1, X2, …Xn as

nodes.– Dependencies represented by edges.– Allow us to compute P(X1, X2, …Xn) as

i P( Xi | Parents(Xi) ).– Edge directions have no meaning.

• Causal Bayesian network: egde directions indicate causality.

Page 22: Feature Selection  and  Causal discovery

Markov blanket

Lung disease

Coughing

Allergy

Smoking

Anxiety

A node is conditionally independent of all other nodes given its Markov blanket.

Page 23: Feature Selection  and  Causal discovery

Relevance revisited

In terms of Bayesian networks in “faithful” distributions:

• Strongly relevant features = members of the Markov Blanket

• Weakly relevant features = variables with a path to the Markov Blanket but not in the Markov Blanket

• Irrelevant features = variables with no path to the Markov Blanket

Koller-Sahami, 1996; Kohavi-John, 1997; Aliferis et al., 2002.

Page 24: Feature Selection  and  Causal discovery

Is X2 “relevant”?X2 || Y

baseline(X2)

health(Y)

peak(X1)

P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)

X2 X1

180 190 200 210 220 230 240 250 260

20

40

60

80

100

120

peak

baselineY

normaldisease

x1

x2

X2 || Y | X1

1

Page 25: Feature Selection  and  Causal discovery

Are X1 and X2“relevant”?

time(X2)

health(Y)

peak(X1)

P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)

X1 || YX2 || YX1 || X2

peak

sample processing time normal

diseaseY

2

Page 26: Feature Selection  and  Causal discovery

XOR and unfaithfulness

X1

Y

X2

X1 || YX2 || YX1 || X2

Y = X1 X2X1 X2 Y

1 1 11 -1 -1-1 1 -1-1 -1 1

Example:X1 and X2: Two fair coins tossed at randomY: Win if both coins end on the same side

X1

Y

X2X1

Y

X2X1

Y

X2

Page 27: Feature Selection  and  Causal discovery

y

x1

Adding a variable…

… can make another one irrelevant

y

x1

X2

Simpson’s paradox

X1 || Y | X2

3

Page 28: Feature Selection  and  Causal discovery

y

x1

… conclusion: no evidence that eating chocolate makes you live longer.

X1 || Y | X2

Is chocolate good for your health?

chocolate intake

life expectancy y

x1

Male

Female

X2=gender

3

Page 29: Feature Selection  and  Causal discovery

y

x1

… conclusion: eating chocolatemay make you live longer!

Really?

Is chocolate good for your health?

chocolate intake

life expectancy y

x1

Depressed

Happy

X2=mood

3

Page 30: Feature Selection  and  Causal discovery

Same independence relationsDifferent causal relations

P(X1, X2 , Y)

= P(X1 | X2) P(Y | X2) P(X2)

X1 || Y | X2

X1 YX2

X1 YX2

P(X1, X2 , Y)

= P(Y | X2) P(X2 | X1) P(X1)

P(X1, X2 , Y)

= P(X1 | X2) P(X2 | Y) P(Y)

X1 YX2

Page 31: Feature Selection  and  Causal discovery

Is X1 “relevant”?

X1 || Y | X2

chocolate intake(X1)

life expectancy

(Y)

gender(X2)

chocolate intake(X1)

life expectancy

(Y)

mood(X2)

3

Page 32: Feature Selection  and  Causal discovery

Non-causal features may be predictive yet not

“relevant”baseline

(X2)health

(Y)

peak(X1)

time(X2)

health(Y)

peak(X1)

chocolate intake(X1)

life expectancy

(Y)

gender(X2)

chocolate intake(X1)

life expectancy

(Y)

mood(X2)

1 2

3

Page 33: Feature Selection  and  Causal discovery

x2

x1

Causal feature discovery

x2

x1P(X,Y) = P(X|Y) P(Y)

Y

X1 X2

P(X,Y) = P(Y|X) P(X)

Y

X1 X2

Sun-Janzing-Schoelkopf, 2005

Page 34: Feature Selection  and  Causal discovery

Conclusion• Feature selection focuses on uncovering

subsets of variables X1, X2, … predictive of the target Y.

• Taking a closer look at the type of dependencies may help refining the notion of variable relevance.

• Uncovering causal relationships may yield better feature selection, robust under distribution changes.

• These “causal features” may be better targets of action.