Upload
matthew-chambers
View
218
Download
2
Embed Size (px)
Citation preview
Feature Selection and
Causal discovery
Isabelle Guyon, Clopinet
André Elisseeff, IBM Zürich
Constantin Aliferis, Vanderbilt University
Road Map
• What is feature selection?
• Why is it hard?
• What works best in practice?
• How to make progress using causality?
• Can causal discovery benefit from feature selection?
Feature selection
Causal discovery
Introduction
Causal discovery
• What affects your health?
• What affects the economy?
• What affects climate changes?
and…
Which actions will have beneficial effects?
Feature Selection
Remove features Xi to improve (or least degrade) prediction of Y.
X
Y
Uncovering Dependencies
Factors of variability
ArtifactualActual
UnknownKnown
Uncontrollable
Observable Unobservable
Controllable
Predictions and Actions
X
Y
See e.g. Judea Pearl, “Causality”, 2000
Predictive power of causes and effects
Lung disease
Coughing
Allergy
Smoking
Anxiety Smoking is a better predictor of lung disease than coughing.
“Causal feature selection”
• Abandon the usual motto of predictive modeling: “we don’t care about causality”.
• Feature selection may benefit from introducing a notion of causality:– To be able to predict the consequence of given
actions.– To add robustness to the predictions if the input
distribution changes.– To get more compact and robust feature sets.
“FS-enabled causal discovery”
Isn’t causal discovery solved with experiments?
• No! Randomized Controlled Trials (RCT) may be:– Unethical (e.g. a RCT about the effects of smoking)– Costly and time consuming – Impossible (e.g. astronomy)
• Observational data may be available to help plan future experiments Causal discovery may benefit from feature selection.
Feature selection basics
Individual Feature Irrelevance
P(Xi, Y) = P(Xi) P(Y)
P(Xi| Y) = P(Xi)
xi
density
Individual Feature Relevance
-1
- +
- +1Specificity
Sensitivity
ROC curve
AUC
0
1
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006
Multivariate FS is complex
n features, 2n possible feature subsets!
Kohavi-John, 1997
FS strategies
• Wrappers:– Use the target risk functional to evaluate feature
subsets. – Train one learning machine for each feature subset
investigated.
• Filters:– Use another evaluation function than the target risk
functional. – Often no learning machine is involved in the feature
selection process.
Reducing complexity
• For wrappers:– Use forward or backward selection: O(n2) steps.– Mix forward and backward search, e.g. floating search.
• For filters:– Use a cheap evaluation function (no learning machine).– Make independence assumptions: n evaluations.
• Embedded methods:– Do not retrain the LM at every step: e.g. RFE, n steps.– Search FS space and LM parameter space
simultaneously: e.g. 1-norm/Lasso approaches.
In practice…
• Univariate feature selection often yields better accuracy results than multivariate feature selection.
• NO feature selection at all gives sometimes the best accuracy results, even in the presence of known distracters.
• Multivariate methods usually claim only better “parsimony”.
• How can we make multivariate FS work better?
NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges
Definition of “irrelevance”
• We want to determine whether one variable Xi is “relevant” to the target Y.
• Surely irrelevant feature:
P(Xi, Y |S\i) = P(Xi |S\i)P(Y |S\i)
for all S\i X\i
for all assignment of values to S\i
Are all non-irrelevant features relevant?
Causality enters the picture
Causal Bayesian networks
• Bayesian network:– Graph with random variables X1, X2, …Xn as
nodes.– Dependencies represented by edges.– Allow us to compute P(X1, X2, …Xn) as
i P( Xi | Parents(Xi) ).
– Edge directions have no meaning.
• Causal Bayesian network: egde directions indicate causality.
Markov blanket
Lung disease
Coughing
Allergy
Smoking
Anxiety
A node is conditionally independent of all other nodes given its Markov blanket.
Relevance revisited
In terms of Bayesian networks in “faithful” distributions:
• Strongly relevant features = members of the Markov Blanket
• Weakly relevant features = variables with a path to the Markov Blanket but not in the Markov Blanket
• Irrelevant features = variables with no path to the Markov Blanket
Koller-Sahami, 1996; Kohavi-John, 1997; Aliferis et al., 2002.
Is X2 “relevant”?
X2 || Y
baseline(X2)
health(Y)
peak(X1)
P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)
X2 X1
180 190 200 210 220 230 240 250 260
20
40
60
80
100
120
peak
baselineY
normaldisease
x1
x2
X2 || Y | X1
1
Are X1 and X2“relevant”?
time(X2)
health(Y)
peak(X1)
P(X1, X2 , Y)= P(X1 | X2 , Y) P(X2) P(Y)
X1 || YX2 || YX1 || X2
peak
sample processing time normal
diseaseY
2
XOR and unfaithfulness
X1
Y
X2
X1 || YX2 || YX1 || X2
Y = X1 X2X1 X2 Y
1 1 1
1 -1 -1
-1 1 -1
-1 -1 1
Example:X1 and X2: Two fair coins tossed at randomY: Win if both coins end on the same side
X1
Y
X2X1
Y
X2X1
Y
X2
y
x1
Adding a variable…
… can make another one irrelevant
y
x1
X2
Simpson’s paradox
X1 || Y | X2
3
y
x1
… conclusion: no evidence that eating chocolate makes you live longer.
X1 || Y | X2
Is chocolate good for your health?
chocolate intake
life expectancy y
x1
Male
Female
X2=gender
3
y
x1
… conclusion: eating chocolatemay make you live longer!
Really?
Is chocolate good for your health?
chocolate intake
life expectancy y
x1
Depressed
Happy
X2=mood
3
Same independence relationsDifferent causal relations
P(X1, X2 , Y)
= P(X1 | X2) P(Y | X2) P(X2)
X1 || Y | X2
X1 YX2
X1 YX2
P(X1, X2 , Y)
= P(Y | X2) P(X2 | X1) P(X1)
P(X1, X2 , Y)
= P(X1 | X2) P(X2 | Y) P(Y)
X1 YX2
Is X1 “relevant”?
X1 || Y | X2
chocolate intake(X1)
life expectancy
(Y)
gender(X2)
chocolate intake(X1)
life expectancy
(Y)
mood(X2)
3
Non-causal features may be predictive yet not
“relevant”baseline
(X2)health
(Y)
peak(X1)
time(X2)
health(Y)
peak(X1)
chocolate intake(X1)
life expectancy
(Y)
gender(X2)
chocolate intake(X1)
life expectancy
(Y)
mood(X2)
1 2
3
x2
x1
Causal feature discovery
x2
x1P(X,Y) = P(X|Y) P(Y)
Y
X1 X2
P(X,Y) = P(Y|X) P(X)
Y
X1 X2
Sun-Janzing-Schoelkopf, 2005
Conclusion
• Feature selection focuses on uncovering subsets of variables X1, X2, … predictive of the target Y.
• Taking a closer look at the type of dependencies may help refining the notion of variable relevance.
• Uncovering causal relationships may yield better feature selection, robust under distribution changes.
• These “causal features” may be better targets of action.