Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Informative Subspace Learning for Counterfactual Inference
Informative Subspace Learning for Counterfactual Inference
Yale Chang Jennifer G. DyDepartment of Electrical and Computer Engineering
Northeastern University
February 9, 2017
Motivation: Why Causal Inference?
Treatment Outcome?
New Medication
Blood Pressure
?
Job Training Employee’s Income
?Ø Healthcare
Ø Economics
Ø Advertising Advertising Campaign
Company’s Revenue
?
Question of Interest:
Challenges
Figures: Shalit & Sontag www.cs.nyu.edu/~shalit/tutorial.html
?Potential Outcome Framework
Ø Only one outcome can be observed
Randomized Controlled Trial
Observational Data
Ø Confounding factors
Contributions of This Work
Ø Propose a novel approach for causal inference on observational data.
Ø Speed up the proposed approach (reducing quadratic to linear complexity) via randomized approximation and provide theoretical results proving an upper bound on the approximation error.
Ø Empirical results on simulated and real-world data demonstrate that our proposed approach outperforms competing methods.
Potential Outcome Framework
Age
Bloo
d Pr
essu
re
Control Outcome
Treatment Outcome
Factual Outcome
Potential Outcome Framework
Age
Bloo
d Pr
essu
re
Control Outcome
Treatment Outcome
Factual Outcome
Counterfactual Outcome
Potential Outcome Framework
Age
Bloo
d Pr
essu
re
ITEITE
ITE: Individual Treatment Effect
Potential Outcome Framework
Age
Bloo
d Pr
essu
re
ITEITE
ITE: Individual Treatment Effect
ATE:Average Treatment Effect
ATE
Nearest Neighbor Matching
Ø Set each sample’s counterfactual outcome equal to factual outcome of its nearest neighbor in the opposite group
Ø Distance can be measured by Euclidean metric
Age
Bloo
d Pr
essu
re
Nearest Neighbor Matching
Ø Not all features affect the outcome.
Ø Need learn informative subspaces (predictive of outcomes) for both treatment and control group before matching.
However!
In this case, only age affects the outcome
AgeWeight
Blo
od P
ress
ure
Informative Subspace Learning
Key Property: make samples 𝑥" with similar outcomes 𝑦" be close in the learned subspace.
𝐾% =𝑠𝑖𝑚(𝑦+, 𝑦+) ⋯ 𝑠𝑖𝑚(𝑦+, 𝑦/)
⋮ ⋱ ⋮𝑠𝑖𝑚(𝑦/, 𝑦+) ⋯ 𝑠𝑖𝑚(𝑦/, 𝑦/)
Learn projection matrix 𝑊 ∈ ℝ5×7 to map 𝑥" ∈ ℝ5 to its low dimensional embedding 𝑧" = 𝑊9𝑥" ∈ ℝ7 to preserve the similarity structure in 𝑌.
𝐾< =𝑠𝑖𝑚(𝑧+, 𝑧+) ⋯ 𝑠𝑖𝑚(𝑧+, 𝑧/)
⋮ ⋱ ⋮𝑠𝑖𝑚(𝑧/, 𝑧+) ⋯ 𝑠𝑖𝑚(𝑧/, 𝑧/)
Maximize Hilbert-Schmidt Independence Criterion (HSIC) between 𝒁 and 𝒀!
HSIC Z, Y = 1
𝑛(𝑛 − 1)𝑇𝑟 𝐾< 𝐾%
=1
𝑛(𝑛 − 1)KK𝐾L 𝑖, 𝑗 𝐾%(𝑖, 𝑗)
/
NO+
/
"O+
Error Bound on HSIC Approximation
ChallengeThe storage and computation of kernels are quadratic w.r.t. sample size!
SolutionApproximate kernel matrices with random Fourier features.
𝐾<
𝐾%
PPQR
SSQT
𝐹 ∈ ℝ/×R
𝐾< ∈ ℝ/×/
𝐾% ∈ ℝ/×/
𝐺 ∈ ℝ/×T
𝑚, 𝑙 are the numbers of random Fourier features 𝑚, 𝑙 ≪ 𝑛
Approximation Error Bound
𝔼 |𝑒𝑟𝑟𝑜𝑟|≤ /
/^+_/ `ab /RT
cd/ `ab /RT
Learning Objective
maxh
HSIC Z, Y − 𝜆||𝑊||Pd
Ø Solved with L-BFGS
Ø Time complexity: 𝒪(𝑛(𝑚𝑑 +𝑚𝑙 + 𝑑𝑞))
Ø Storage cost: 𝒪(𝑛(𝑑 +𝑚 + 𝑙))
Infant Health Development Data
MDM PSM RLP LASSO BART CausalForest
Proposed
News Data
MDM PSM RLP LASSO BART CausalForest
Proposed
Summary
ØSignificantly improve nearest-neighbor matching for counterfactual inference through informative subspace learning.
ØSpeed up HSIC computation via random Fourier features and provided proof on an upper bound on the approximation error.
ØEmpirically show state-of-the-art performance on real datasets.