Upload
others
View
22
Download
0
Embed Size (px)
Citation preview
Lecture 3: Dependence measuresusing RKHS embeddings
MLSS Tubingen, 2015
Arthur Gretton
Gatsby Unit, CSML, UCL
Outline
• Three or more variable interactions, comparison with conditional
dependence testing [Sejdinovic et al., 2013a]
• Dependence detection in detail, covariance operators
• Bayesian inference without models, comparison with approximate
Bayesian computation (ABC) [Fukumizu et al., 2013]
• Recent work (2014/2015) (not in this talk, see my webpage)
– Testing for time series Chwialkowski and Gretton [2014], Chwialkowski et al. [2014]
– Nonparametric adaptive expectation propagation Jitkrittum et al. [2015]
– Infinite dimensional exponential families Sriperumbudur et al. [2014]
– Adaptive MCMC, and adaptive Hamiltonian Monte Carlo Sejdinovic et al.
[2014], Strathmann et al. [2015]
Lancaster (3-way) Interactions
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
X Y
Z
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
Detecting a higher order interaction
• How to detect V-structures with pairwise weak individual dependence?
• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z
X vs Y Y vs Z
X vs Z XY vs Z
X Y
Z
• X,Yi.i.d.∼ N (0, 1),
• Z| X,Y ∼ sign(XY )Exp( 1√2)
Faithfulness violated here
V-structure Discovery
X Y
Z
Assume X ⊥⊥ Y has been established. V-structure can then be detected by:
• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
V-structure Discovery
X Y
Z
Assume X ⊥⊥ Y has been established. V-structure can then be detected by:
• Consistent CI test: H0 : X ⊥⊥ Y |Z [Fukumizu et al., 2008, Zhang et al., 2011], or
• Factorisation test: H0 : (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X
(multiple standard two-variable tests)
– compute p-values for each of the marginal tests for (Y, Z) ⊥⊥ X,
(X,Z) ⊥⊥ Y , or (X,Y ) ⊥⊥ Z
– apply Holm-Bonferroni (HB) sequentially rejective correction(Holm 1979)
V-structure Discovery (2)
• How to detect V-structures with pairwise weak (or nonexistent)
dependence?
• X ⊥⊥ Y , Y ⊥⊥ Z, X ⊥⊥ Z
X1 vs Y1 Y1 vs Z1
X1 vs Z1 X1*Y1 vs Z1
X Y
Z
• X1, Y1i.i.d.∼ N (0, 1),
• Z1| X1, Y1 ∼ sign(X1Y1)Exp(1√2)
• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1) Faith-
fulness violated here
V-structure Discovery (3)
CI: X ⊥⊥Y |Z
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)V-structure discovery: Dataset A
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 1: CI test for X ⊥⊥ Y |Z from Zhang et al (2011), and a factorisation test
with a HB correction, n = 500
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP =
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
X Y
Z
X Y
Z
X Y
Z
X Y
Z
PXY Z −PXPY Z −PY PXZ −PZPXY +2PXPY PZ
∆LP = 0
Case of PX ⊥⊥ PY Z
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
(X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X ⇒ ∆LP = 0.
...so what might be missed?
Lancaster Interaction Measure
[Bahadur (1961); Lancaster (1969)] Interaction measure of (X1, . . . , XD) ∼ P is
a signed measure ∆P that vanishes whenever P can be factorised in a
non-trivial way as a product of its (possibly multivariate) marginal
distributions.
• D = 2 : ∆LP = PXY − PXPY
• D = 3 : ∆LP = PXY Z − PXPY Z − PY PXZ − PZPXY + 2PXPY PZ
∆LP = 0 ; (X,Y ) ⊥⊥ Z ∨ (X,Z) ⊥⊥ Y ∨ (Y, Z) ⊥⊥ X
Example:
P (0, 0, 0) = 0.2 P (0, 0, 1) = 0.1 P (1, 0, 0) = 0.1 P (1, 0, 1) = 0.1
P (0, 1, 0) = 0.1 P (0, 1, 1) = 0.1 P (1, 1, 0) = 0.1 P (1, 1, 1) = 0.2
A Test using Lancaster Measure
• Test statistic is empirical estimate of ‖µκ (∆LP )‖2Hκ
, where
κ = k ⊗ l ⊗m:
‖µκ(PXY Z − PXY PZ − · · · )‖2Hκ=
〈µκPXY Z , µκPXY Z〉Hκ− 2 〈µκPXY Z , µκPXY PZ〉Hκ
· · ·
Inner Product Estimators
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ
PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)
PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++
PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++
PY ZPX (L ◦ M)++ K++ (LM)++K++
PXPY PZ K++L++M++
Table 1: V -statistic estimators of 〈µκν, µκν′〉Hκ
Inner Product Estimators
ν\ν′ PXY Z PXY PZ PXZPY PY ZPX PXPY PZ
PXY Z (K ◦ L ◦ M)++ ((K ◦ L)M)++ ((K ◦ M)L)++ ((M ◦ L)K)++ tr(K+ ◦ L+ ◦ M+)
PXY PZ (K ◦ L)++ M++ (MKL)++ (KLM)++ (KL)++M++
PXZPY (K ◦ M)++ L++ (KML)++ (KM)++L++
PY ZPX (L ◦ M)++ K++ (LM)++K++
PXPY PZ K++L++M++
Table 2: V -statistic estimators of 〈µκν, µκν′〉Hκ
‖µκ (∆LP )‖2Hκ
=1
n2(HKH ◦HLH ◦HMH)++ .
Empirical joint central moment in the feature space
Example A: factorisation tests
CI: X ⊥⊥Y |Z
∆L: Factor
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
V-structure discovery: Dataset A
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 2: Factorisation hypothesis: Lancaster statistic vs. a two-variable
based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al
(2011), n = 500
Example B: Joint dependence can be easier to detect
• X1, Y1i.i.d.∼ N (0, 1)
• Z1 =
X21 + ǫ, w.p. 1/3,
Y 21 + ǫ, w.p. 1/3,
X1Y1 + ǫ, w.p. 1/3,
where ǫ ∼ N (0, 0.12).
• X2:p, Y2:p, Z2:pi.i.d.∼ N (0, Ip−1)
• dependence of Z on pair (X,Y ) is stronger than on X and Y individually
• Satisfies faithfulness
Example B: factorisation tests
CI: X ⊥⊥Y |Z
∆L: Factor
2var: Factor
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
V-structure discovery: Dataset B
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 3: Factorisation hypothesis: Lancaster statistic vs. a two-variable
based test (both with HB correction); Test for X ⊥⊥ Y |Z from Zhang et al
(2011), n = 500
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
Interaction for D ≥ 4
• Interaction measure valid for all D
(Streitberg, 1990):
∆SP =∑
π
(−1)|π|−1 (|π| − 1)!JπP
– For a partition π, Jπ associates to
the joint the corresponding
factorisation, e.g.,
J13|2|4P = PX1X3PX2PX4 .
1e+04
1e+09
1e+14
1e+19
1 3 5 7 9 11 13 15 17 19 21 23 25D
Nu
mb
er
of
pa
rtitio
ns o
f {1
,...
,D} Bell numbers growth
joint central moments (Lancaster interaction)
vs.
joint cumulants (Streitberg interaction)
Total independence test
• Total independence test:
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ
Total independence test
• Total independence test:
H0 : PXY Z = PXPY PZ vs. H1 : PXY Z 6= PXPY PZ
• For (X1, . . . , XD) ∼ PX, and κ =⊗D
i=1 k(i):
∥∥∥∥∥∥∥∥∥∥∥
µκ
(PX −
D∏
i=1
PXi
)
︸ ︷︷ ︸∆totP
∥∥∥∥∥∥∥∥∥∥∥
2
Hκ
=1
n2
n∑
a=1
n∑
b=1
D∏
i=1
K(i)ab −
2
nD+1
n∑
a=1
D∏
i=1
n∑
b=1
K(i)ab
+1
n2D
D∏
i=1
n∑
a=1
n∑
b=1
K(i)ab .
• Coincides with the test proposed by Kankainen (1995) using empirical
characteristic functions: similar relationship to that between dCov and
HSIC (DS et al, 2013)
Example B: total independence tests
∆tot : total indep.
∆L: total indep.
Null
acc
epta
nce
rate
(Typ
eII
erro
r)
Total independence test: Dataset B
Dimension
1 3 5 7 9 11 13 15 17 19
0
0.2
0.4
0.6
0.8
1
Figure 4: Total independence: ∆totP vs. ∆LP , n = 500
Kernel dependence measures - in detail
MMD for independence: HSIC
!"#$%&'()#)&*+$,#&-"#.&-"%(+*"&/$0#1&2',&
-"#34%#&'#5#%&"266$#%&-"2'&7"#'&0(//(7$'*&
2'&$'-#%#)8'*&)9#'-:&!"#3&'##,&6/#'-3&(0&
#;#%9$)#1&2<(+-&2'&"(+%&2&,23&$0&6())$</#:&
=&/2%*#&2'$.2/&7"(&)/$'*)&)/(<<#%1&#;+,#)&2&
,$)8'985#&"(+',3&(,(%1&2',&72'-)&'(-"$'*&.(%#&
-"2'&-(&0(//(7&"$)&'()#:&!"#3&'##,&2&)$*'$>92'-&
2.(+'-&(0&#;#%9$)#&2',&.#'-2/&)8.+/28(':&
!#;-&0%(.&,(*8.#:9(.&2',&6#?$',#%:9(.&
@'(7'&0(%&-"#$%&9+%$()$-31&$'-#//$*#'9#1&2',&
#;9#//#'-&9(..+'$928('&&)A$//)1&-"#&B252'#)#&
<%##,&$)&6#%0#9-&$0&3(+&72'-&2&%#)6(')$5#1&&
$'-#%2985#&6#-1&('#&-"2-&7$//&</(7&$'&3(+%%&
2',&0(//(7&3(+#%37"#%#:&
!" #"
Empirical HSIC(PXY ,PXPY ):
1
n2(HKH ◦HLH)++
Covariance to reveal dependence
A more intuitive idea: maximize covariance of smooth mappings:
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
Covariance to reveal dependence
A more intuitive idea: maximize covariance of smooth mappings:
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
Covariance to reveal dependence
A more intuitive idea: maximize covariance of smooth mappings:
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 2−1
−0.5
0
0.5
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
y
g(y
)
Dependence witness, Y
Covariance to reveal dependence
A more intuitive idea: maximize covariance of smooth mappings:
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 2−1
−0.5
0
0.5
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5−1
−0.5
0
0.5
1
f(X)g(Y
)
Correlation: −0.90 COCO: 0.14
Covariance to reveal dependence
A more intuitive idea: maximize covariance of smooth mappings:
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 2−1
−0.5
0
0.5
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5−1
−0.5
0
0.5
1
f(X)g(Y
)
Correlation: −0.90 COCO: 0.14
How do we define covariance in (infinite) feature spaces?
Covariance to reveal dependence
Covariance in RKHS: Let’s first look at finite linear case.
We have two random vectors x ∈ Rd, y ∈ R
d′ . Are they linearly dependent?
Covariance to reveal dependence
Covariance in RKHS: Let’s first look at finite linear case.
We have two random vectors x ∈ Rd, y ∈ R
d′ . Are they linearly dependent?
Compute their covariance matrix: (ignore centering)
Cxy = E(xy
⊤)
How to get a single “summary” number?
Covariance to reveal dependence
Covariance in RKHS: Let’s first look at finite linear case.
We have two random vectors x ∈ Rd, y ∈ R
d′ . Are they linearly dependent?
Compute their covariance matrix: (ignore centering)
Cxy = E(xy
⊤)
How to get a single “summary” number?
Solve for vectors f ∈ Rd, g ∈ R
d′
argmax‖f‖=1,‖g‖=1
f⊤Cxyg = argmax‖f‖=1,‖g‖=1
Exy
[(f⊤x
)(g⊤y
)]
= argmax‖f‖=1,‖g‖=1
Ex,y[f(x)g(y)] = argmax‖f‖=1,‖g‖=1
cov (f(x)g(y))
(maximum singular value)
Challenges in defining feature space covariance
Given features φ(x) ∈ F and ψ(y) ∈ G:
Challenge 1: Can we define a feature space analog to x y⊤?
YES:
• Given f ∈ Rd, g ∈ R
d′ , h ∈ Rd′ , define matrix f g⊤ such that
(f g⊤)h = f(g⊤h).
• Given f ∈ F , g ∈ G, h ∈ G, define tensor product operator f ⊗ g such
that (f ⊗ g)h = f〈g, h〉G .
• Now just set f := φ(x), g = ψ(y), to get x y⊤ → φ(x)⊗ ψ(y)
Challenges in defining feature space covariance
Given features φ(x) ∈ F and ψ(y) ∈ G:
Challenge 2: Does a covariance “matrix” (operator) in feature space exist?
I.e. is there some CXY : G → F such that
〈f, CXY g〉F = Ex,y[f(x)g(y)] = cov (f(x), g(y))
Challenges in defining feature space covariance
Given features φ(x) ∈ F and ψ(y) ∈ G:
Challenge 2: Does a covariance “matrix” (operator) in feature space exist?
I.e. is there some CXY : G → F such that
〈f, CXY g〉F = Ex,y[f(x)g(y)] = cov (f(x), g(y))
YES: via Bochner integrability argument (as with mean embedding).
Under the condition Ex,y
(√k(x, x)l(y, y)
)<∞, we can define:
CXY := Ex,y [φ(x)⊗ ψ(y)]
which is a Hilbert-Schmidt operator (sum of squared singular values is finite).
REMINDER: functions revealing dependence
COCO(P;F ,G) := sup‖f‖F=1,‖g‖G=1
(Ex,y[f(x)g(y)]−Ex[f(x)]Ey[g(y)])
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 2−1
−0.5
0
0.5
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5−1
−0.5
0
0.5
1
f(X)
g(Y
)
Correlation: −0.90 COCO: 0.14
How do we compute this from finite data?
Empirical covariance operator
The empirical covariance given z := (xi, yi)ni=1 (now include centering)
CXY :=1
n
n∑
i=1
φ(xi)⊗ ψ(yi)− µx ⊗ µy,
where µx := 1n
∑ni=1 φ(xi). More concisely,
CXY =1
nXHY ⊤,
where H = In − n−11n, and 1n is an n× n matrix of ones, and
X =[φ(x1) . . . φ(xn)
]Y =
[ψ(y1) . . . ψ(yn)
].
Define the kernel matrices
Kij =(X⊤X
)ij= k(xi, xj) Lij = l(yi, yj),
Functions revealing dependence
Optimization problem:
COCO(z;F ,G) := max⟨f, CXY g
⟩F
subject to ‖f‖F ≤ 1
‖g‖G ≤ 1
Assume
f =
n∑
i=1
αi [φ(xi)− µx] = XHα g =
n∑
j=1
βi [ψ(yi)− µy] = Y Hβ,
The associated Lagrangian is
L(f, g, λ, γ) = f⊤CXY g −λ
2
(‖f‖2F − 1
)−γ
2
(‖g‖2G − 1
),
Covariance to reveal dependence
• Empirical COCO(z;F ,G) largest eigenvalue of 0 1
nKL
1
nLK 0
α
β
= γ
K 0
0 L
α
β
.
• K and L are matrices of inner products between centred observations in
respective feature spaces:
K = HKH where H = I −1
n11⊤
Covariance to reveal dependence
• Empirical COCO(z;F ,G) largest eigenvalue of 0 1
nKL
1
nLK 0
α
β
= γ
K 0
0 L
α
β
.
• K and L are matrices of inner products between centred observations in
respective feature spaces:
K = HKH where H = I −1
n11⊤
• Mapping function for x:
f(x) =n∑
i=1
αi
k(xi, x)−
1
n
n∑
j=1
k(xj , x)
Hard-to-detect dependence
−2 0 2−3
−2
−1
0
1
2
3
X
Y
Smooth density
−4 −2 0 2 4−4
−2
0
2
4
X
Y
500 Samples, smooth density
−2 0 2−3
−2
−1
0
1
2
3
X
Y
Rough density
−4 −2 0 2 4−4
−2
0
2
4
X
Y
500 samples, rough density
Density takes the form:
Px,y ∝ 1 + sin(ωx) sin(ωy)
Hard-to-detect dependence
• Example: sinusoids of increasing frequency
ω=1 ω=2
ω=3 ω=4
ω=5 ω=6
0 1 2 3 4 5 6 70.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Frequency of non−constant density component
CO
CO
COCO (empirical average, 1500 samples)
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of ω = 1
−4 −2 0 2 4−4
−3
−2
−1
0
1
2
3
4
X
Y
Correlation: 0.27
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
f(X)
g(Y
)
Correlation: −0.50 COCO: 0.09
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of ω = 2
−4 −2 0 2 4−4
−3
−2
−1
0
1
2
3
4
X
Y
Correlation: 0.04
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
f(X)
g(Y
)
Correlation: 0.51 COCO: 0.07
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of ω = 3
−4 −2 0 2 4−4
−3
−2
−1
0
1
2
3
4
X
Y
Correlation: 0.03
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−0.6 −0.4 −0.2 0 0.2 0.4−0.5
0
0.5
f(X)
g(Y
)
Correlation: −0.45 COCO: 0.03
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of ω = 4
−4 −2 0 2 4−4
−3
−2
−1
0
1
2
3
4
X
Y
Correlation: 0.03
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
f(X)
g(Y
)
Correlation: 0.21 COCO: 0.02
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of ω =??
−4 −2 0 2 4−3
−2
−1
0
1
2
3
X
Y
Correlation: 0.00
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
f(X)
g(Y
)
Correlation: −0.13 COCO: 0.02
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
Case of uniform noise!
This bias will decrease with increasing sample size.
−4 −2 0 2 4−3
−2
−1
0
1
2
3
X
Y
Correlation: 0.00
−2 0 2−1
−0.5
0
0.5
1
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5 1−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
f(X)g
(Y)
Correlation: −0.13 COCO: 0.02
Hard-to-detect dependence
COCO vs frequency of perturbation from independence.
• As dependence is encoded at higher frequencies, the smooth mappings
f, g achieve lower linear covariance.
• Even for independent variables, COCO will not be zero at finite sample
sizes, since some mild linear dependence will be induced by f, g (bias)
• This bias will decrease with increasing sample size.
More functions revealing dependence
• Can we do better than COCO?
More functions revealing dependence
• Can we do better than COCO?
• A second example with zero correlation
−1 0 1−1.5
−1
−0.5
0
0.5
1
X
Y
Correlation: 0
−2 0 2−1
−0.5
0
0.5
x
f(x)
Dependence witness, X
−2 0 2−1
−0.5
0
0.5
y
g(y
)
Dependence witness, Y
−1 −0.5 0 0.5−1
−0.5
0
0.5
1
f(X)
g(Y
)
Correlation: −0.80 COCO: 0.11
More functions revealing dependence
• Can we do better than COCO?
• A second example with zero correlation
−1 0 1−1.5
−1
−0.5
0
0.5
1
X
Y
Correlation: 0
−2 0 2−1
−0.5
0
0.5
1
x
f 2(x
)
2nd dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g2(y
)
2nd dependence witness, Y
More functions revealing dependence
• Can we do better than COCO?
• A second example with zero correlation
−1 0 1−1.5
−1
−0.5
0
0.5
1
X
Y
Correlation: 0
−2 0 2−1
−0.5
0
0.5
1
x
f 2(x
)
2nd dependence witness, X
−2 0 2−1
−0.5
0
0.5
1
y
g2(y
)
2nd dependence witness, Y
−1 0 1−1
−0.5
0
0.5
1
f2(X)
g2(Y
)
Correlation: −0.37 COCO2: 0.06
Hilbert-Schmidt Independence Criterion
• Given γi := COCOi(z;F ,G), define Hilbert-Schmidt Independence
Criterion (HSIC) [ALT05, NIPS07a, JMLR10] :
HSIC(z;F ,G) :=n∑
i=1
γ2i
Hilbert-Schmidt Independence Criterion
• Given γi := COCOi(z;F ,G), define Hilbert-Schmidt Independence
Criterion (HSIC) [ALT05, NIPS07a, JMLR10] :
HSIC(z;F ,G) :=n∑
i=1
γ2i
• In limit of infinite samples:
HSIC(P;F,G) := ‖Cxy‖2HS
= 〈Cxy, Cxy〉HS
= Ex,x′,y,y′ [k(x, x′)l(y, y′)] +Ex,x′ [k(x, x
′)]Ey,y′ [l(y, y′)]
− 2Ex,y
[Ex′ [k(x, x
′)]Ey′ [l(y, y′)]]
– x′ an independent copy of x, y′ a copy of y
HSIC is identical to MMD(PXY ,PXPY )
When does HSIC determine independence?
Theorem: When kernels k and l are each characteristic, then HSIC = 0 iff
Px,y = PxPy [Gretton, 2015].
Weaker than MMD condition (which requires a kernel characteristic on
X × Y to distinguish Px,y from Qx,y).
Intuition: why characteristic needed on both X and Y
Question: Wouldn’t it be enough just to use a rich mapping from X to Y,
e.g. via ridge regression with characteristic F :
f∗ = argminf∈F
(EXY (Y − 〈f, φ(X)〉F )
2 + λ‖f‖2F),
Intuition: why characteristic needed on both X and Y
Question: Wouldn’t it be enough just to use a rich mapping from X to Y,
e.g. via ridge regression with characteristic F :
f∗ = argminf∈F
(EXY (Y − 〈f, φ(X)〉F )
2 + λ‖f‖2F),
Counterexample: density symmetric about x-axis, s.t. p(x, y) = p(x,−y)
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
Energy Distance and the MMD
Energy distance and MMD
Distance between probability distributions:
Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]
DE(P,Q) = EP‖X −X ′‖q +EQ‖Y − Y ′‖q − 2EP,Q‖X − Y ‖q
0 < q ≤ 2
Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012]
MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y
′)− 2EP,Qk(X,Y )
Energy distance and MMD
Distance between probability distributions:
Energy distance: [Baringhaus and Franz, 2004, Szekely and Rizzo, 2004, 2005]
DE(P,Q) = EP‖X −X ′‖q +EQ‖Y − Y ′‖q − 2EP,Q‖X − Y ‖q
0 < q ≤ 2
Maximum mean discrepancy [Gretton et al., 2007, Smola et al., 2007, Gretton et al., 2012]
MMD2(P,Q;F ) = EPk(X,X′) +EQk(Y, Y
′)− 2EP,Qk(X,Y )
Energy distance is MMD with a particular kernel! [Sejdinovic et al., 2013b]
Distance covariance and HSIC
Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Szekely et al., 2007]
V2(X,Y ) = EXY EX′Y ′
[‖X −X ′‖q‖Y − Y ′‖r
]
+ EXEX′‖X −X ′‖qEY EY ′‖Y − Y ′‖r
− 2EXY
[EX′‖X −X ′‖qEY ′‖Y − Y ′‖r
]
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y
with kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)
− 2EX′Y ′
[EXk(X,X
′)EY l(Y, Y′)].
Distance covariance and HSIC
Distance covariance (0 < q, r ≤ 2) [Feuerverger, 1993, Szekely et al., 2007]
V2(X,Y ) = EXY EX′Y ′
[‖X −X ′‖q‖Y − Y ′‖r
]
+ EXEX′‖X −X ′‖qEY EY ′‖Y − Y ′‖r
− 2EXY
[EX′‖X −X ′‖qEY ′‖Y − Y ′‖r
]
Hilbert-Schmdit Indepence Criterion [Gretton et al., 2005, Smola et al., 2007, Gretton et al.,
2008, Gretton and Gyorfi, 2010]Define RKHS F on X with kernel k, RKHS G on Y
with kernel l. Then
HSIC(PXY ,PXPY )
= EXY EX′Y ′k(X,X ′)l(Y, Y ′) +EXEX′k(X,X ′)EY EY ′ l(Y, Y ′)
− 2EX′Y ′
[EXk(X,X
′)EY l(Y, Y′)].
Distance covariance is HSIC with particular kernels! [Sejdinovic et al., 2013b]
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and
denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Negative type: The semimetric space (Z, ρ) is said to have negative type if
∀n ≥ 2, z1, . . . , zn ∈ Z, and α1, . . . , αn ∈ R, with∑n
i=1 αi = 0,
n∑
i=1
n∑
j=1
αiαjρ(zi, zj) ≤ 0. (1)
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and
denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Special case: Z ⊆ Rd and ρq(z, z
′) = ‖z − z′‖q. Then ρq is a valid semimetric
of negative type for 0 < q ≤ 2.
Semimetrics and Hilbert spaces
Theorem [Berg et al., 1984, Lemma 2.1, p. 74]
ρ : X × X → R is a semimetric (no triangle inequality) on X . Let z0 ∈ X , and
denote
kρ(z, z′) = ρ(z, z0) + ρ(z′, z0)− ρ(z, z′).
Then k is positive definite (via Moore-Arnonsajn, defines a unique RKHS)
iff ρ is of negative type.
Call kρ a distance induced kernel
Special case: Z ⊆ Rd and ρq(z, z
′) = ‖z − z′‖q. Then ρq is a valid semimetric
of negative type for 0 < q ≤ 2.
Energy distance is MMD with a distance induced kernel
Distance covariance is HSIC with distance induced kernels
Two-sample testing benchmark
Two-sample testing example in 1-D:
−6 −4 −2 0 2 4 60
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X
P(X
)
VS
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
−6 −4 −2 0 2 4 6
0.1
0.2
0.3
0.4
X
Q(X
)
Two-sample test, MMD with distance kernel
Obtain more powerful tests on this problem when q 6= 1 (exponent of distance)
Key:
• Gaussian kernel
• q = 1
• Best: q = 1/3
• Worst: q = 2
Nonparametric Bayesian inference using distributionembeddings
Motivating Example: Bayesian inference without a model
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames,
remaining for test.
• Gaussian noise added to Yt.
Challenges:
• No parametric model of camera dynamics (only samples)
• No parametric model of map from camera angle to image (only samples)
• Want to do filtering: Bayesian inference
ABC: an approach to Bayesian inference without a model
Bayes rule:
P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood
• π(y) is prior
One approach: Approximate Bayesian Computation (ABC)
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
P (Y |x∗)
ABC: an approach to Bayesian inference without a model
Approximate Bayesian Computation (ABC):
−10 −5 0 5 10
−10
−5
0
5
10
Prior y ∼ π(Y )
Lik
eli
hood
x∼
P(X
|y)
ABC demonstration
x∗
P (Y |x∗)
Needed: distance measure D, tolerance parameter τ .
ABC: an approach to Bayesian inference without a model
Bayes rule:
P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood
• π(y) is prior
ABC generates a sample from p(Y |x∗) as follows:
1. generate a sample yt from the prior π,
2. generate a sample xt from P(X|yt),
3. if D(x∗, xt) < τ , accept y = yt; otherwise reject,
4. go to (i).
In step (3), D is a distance measure, and τ is a tolerance parameter.
Motivating example 2: simple Gaussian case
• p(x, y) is N ((0,1Td )T , V ) with V a randomly generated covariance
Posterior mean on x: ABC vs kernel approach
100
101
102
103
104
10−2
10−1
CPU time vs Error (6 dim.)
CPU time (sec)
Mea
n S
qu
are
Err
ors
KBI
COND
ABC
3.3×102
1.3×106
7.0×104
1.0×104
2.5×103
9.5×102
200
400
2000
600800
15001000
50006000
40003000
200
400
2000
4000
600800
1000
1500
6000
30005000
Bayes again
Bayes rule:
P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood
• π is prior
How would this look with kernel embeddings?
Bayes again
Bayes rule:
P(y|x) =P(x|y)π(y)∫P(x|y)π(y)dy
• P(x|y) is likelihood
• π is prior
How would this look with kernel embeddings?
Define RKHS G on Y with feature map ψy and kernel l(y, ·)
We need a conditional mean embedding: for all g ∈ G,
EY |x∗g(Y ) = 〈g, µP(y|x∗)〉G
This will be obtained by RKHS-valued ridge regression
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := R
d′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := R
d′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Solve
A = argminA∈Rd′×d
(‖Y −AX‖2 + λ‖A‖2HS
),
where
‖A‖2HS = tr(A⊤A) =min{d,d′}∑
i=1
γ2A,i
Ridge regression and the conditional feature mean
Ridge regression from X := Rd to a finite vector output Y := R
d′ (these
could be d′ nonlinear features of y):
Define training data
X =[x1 . . . xm
]∈ R
d×m Y =[y1 . . . ym
]∈ R
d′×m
Solve
A = argminA∈Rd′×d
(‖Y −AX‖2 + λ‖A‖2HS
),
where
‖A‖2HS = tr(A⊤A) =min{d,d′}∑
i=1
γ2A,i
Solution: A = CY X (CXX +mλI)−1
Ridge regression and the conditional feature mean
Prediction at new point x:
y∗ = Ax
= CY X (CXX +mλI)−1 x
=m∑
i=1
βi(x)yi
where
βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)
]⊤
and
K := X⊤X k(x1, x) = x⊤1 x
Ridge regression and the conditional feature mean
Prediction at new point x:
y∗ = Ax
= CY X (CXX +mλI)−1 x
=m∑
i=1
βi(x)yi
where
βi(x) = (K + λmI)−1[k(x1, x) . . . k(xm, x)
]⊤
and
K := X⊤X k(x1, x) = x⊤1 x
What if we do everything in kernel space?
Ridge regression and the conditional feature mean
Recall our setup:
• Given training pairs:
(xi, yi) ∼ PXY
• F on X with feature map ϕx and kernel k(x, ·)
• G on Y with feature map ψy and kernel l(y, ·)
We define the covariance between feature maps:
CXX = EX (ϕX ⊗ ϕX) CXY = EXY (ϕX ⊗ ψY )
and matrices of feature mapped training data
X =[ϕx1 . . . ϕxm
]Y :=
[ψy1 . . . ψym
]
Ridge regression and the conditional feature mean
Objective: [Weston et al. (2003), Micchelli and Pontil (2005), Caponnetto and De Vito (2007), Grunewalder
et al. (2012, 2013) ]
A = arg minA∈HS(F ,G)
(EXY ‖Y −AX‖2G + λ‖A‖2HS
), ‖A‖2HS =
∞∑
i=1
γ2A,i
Solution same as vector case:
A = CY X (CXX +mλI)−1 ,
Prediction at new x using kernels:
Aϕx =[ψy1 . . . ψym
](K + λmI)−1
[k(x1, x) . . . k(xm, x)
]
=
m∑
i=1
βi(x)ψyi
where Kij = k(xi, xj)
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
We need A to have the property
EY |xg(Y ) ≈ 〈g, µY |x〉G
= 〈g,Aϕx〉G
= 〈A∗g, ϕx〉F = (A∗g)(x)
Ridge regression and the conditional feature mean
How is loss ‖Y −AX‖2G relevant to conditional expectation of some
EY |xg(Y )? Define: [Song et al. (2009), Grunewalder et al. (2013)]
µY |x := Aϕx
We need A to have the property
EY |xg(Y ) ≈ 〈g, µY |x〉G
= 〈g,Aϕx〉G
= 〈A∗g, ϕx〉F = (A∗g)(x)
Natural risk function for conditional mean
L(A,PXY ) := sup‖g‖≤1
EX
(EY |Xg(Y )
)︸ ︷︷ ︸
Target
(X)− (A∗g)︸ ︷︷ ︸Estimator
(X)
2
,
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
[〈g, ψY 〉G − 〈A∗g, ϕX〉F ]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
[〈g, ψY 〉G − 〈g,AϕX〉G ]2
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
≤ EXY ‖ψY −AϕX‖2G
Ridge regression and the conditional feature mean
The squared loss risk provides an upper bound on the natural risk.
L(A,PXY ) ≤ EXY ‖ψY −AϕX‖2G
Proof: Jensen and Cauchy Schwarz
L(A,PXY ) := sup‖g‖≤1
EX
[(EY |Xg(Y )
)(X)− (A∗g) (X)
]2
≤ EXY sup‖g‖≤1
[g(Y )− (A∗g) (X)]2
= EXY sup‖g‖≤1
〈g, ψY −AϕX〉2G
≤ EXY ‖ψY −AϕX‖2G
If we assume EY [g(Y )|X = x] ∈ F then upper bound tight (next slide).
Conditions for ridge regression = conditional mean
Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F
Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then
CXXEY |X [g(Y )|X = ·] = CXY g.
Why this is useful:
EY |X [g(Y )|X = x] = 〈EY |X [g(Y )|X = ·], ϕx〉F
= 〈C−1XXCXY g, ϕx〉F
= 〈g, CY XC−1XX︸ ︷︷ ︸
regression
ϕx〉G
Conditions for ridge regression = conditional mean
Conditional mean obtained by ridge regression when EY [g(Y )|X = x] ∈ F
Given a function g ∈ G. Assume EY |X [g(Y )|X = ·] ∈ F . Then
CXXEY |X [g(Y )|X = ·] = CXY g.
Proof : [Fukumizu et al., 2004]
For all f ∈ F , by definition of CXX ,
⟨f, CXXEY |X [g(Y )|X = ·]
⟩F
= cov(f,EY |X [g(Y )|X = ·]
)
= EX
(f(X) EY |X [g(Y )|X]
)
= EXY (f(X)g(Y ))
= 〈f, CXY g〉 ,
by definition of CXY .
Kernel Bayes’ law
• Prior: Y ∼ π(y)
• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
Kernel Bayes’ law
• Prior: Y ∼ π(y)
• Likelihood: (X|y) ∼ P(x|y) with some joint P(x, y)
• Joint distribution: Q(x, y) = P(x|y)π(y)
Warning: Q 6= P, change of measure from P(y) to π(y)
• Marginal for x:
Q(x) :=
∫P(x|y)π(y)dy.
• Bayes’ law:
Q(y|x) =P(x|y)π(y)
Q(x)
Kernel Bayes’ law
• Posterior embedding via the usual conditional update,
µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.
Kernel Bayes’ law
• Posterior embedding via the usual conditional update,
µQ(y|x) = CQ(y,x)C−1Q(x,x)φx.
• Given mean embedding of prior: µπ(y)
• Define marginal covariance:
CQ(x,x) =
∫(ϕx ⊗ ϕx) P(x|y)π(y)dx = C(xx)yC
−1yy µπ(y)
• Define cross-covariance:
CQ(y,x) =
∫(φy ⊗ ϕx) P(x|y)π(y)dx = C(yx)yC
−1yy µπ(y).
Kernel Bayes’ law: consistency result
• How to compute posterior expectation from data?
• Given samples: {(xi, yi)}ni=1 from Pxy, {(uj)}
nj=1 from prior π.
• Want to compute E[g(Y )|X = x] for g in G
• For any x ∈ X ,
∣∣gTy RY |XkX(x)− E[f(Y )|X = x]
∣∣ = Op(n− 4
27 ), (n→ ∞),
where
– gy = (g(y1), . . . , g(yn))T ∈ R
n.
– kX(x) = (k(x1, x), . . . , k(xn, x))T ∈ R
n
– RY |X learned from the samples, contains the uj
Smoothness assumptions:
• π/pY ∈ R(C1/2Y Y ), where pY p.d.f. of PY ,
• E[g(Y )|X = ·] ∈ R(C2Q(xx)
).
Experiment: Kernel Bayes’ law vs EKF
Experiment: Kernel Bayes’ law vs EKF
• Compare with extended Kalman
filter (EKF) on camera
orientation task
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames, remaining
for test.
• Gaussian noise added to Yt.
Experiment: Kernel Bayes’ law vs EKF
• Compare with extended Kalman
filter (EKF) on camera
orientation task
• 3600 downsampled frames of
20× 20 RGB pixels
(Yt ∈ [0, 1]1200)
• 1800 training frames, remaining
for test.
• Gaussian noise added to Yt.
Average MSE and standard errors (10 runs)
KBR (Gauss) KBR (Tr) Kalman (9 dim.) Kalman (Quat.)
σ2 = 10−4 0.210± 0.015 0.146± 0.003 1.980± 0.083 0.557± 0.023
σ2 = 10−3 0.222± 0.009 0.210± 0.008 1.935± 0.064 0.541± 0.022
Co-authors
• From UCL:
– Luca Baldasssarre
– Steffen Grunewalder
– Guy Lever
– Sam Patterson
– Massimiliano Pontil
– Dino Sejdinovic
• External:
– Karsten Borgwardt, MPI
– Wicher Bergsma, LSE
– Kenji Fukumizu, ISM
– Zaid Harchaoui, INRIA
– Bernhard Schoelkopf, MPI
– Alex Smola, CMU/Google
– Le Song, Georgia Tech
– Bharath Sriperumbudur,
Cambridge
Selected references
Characteristic kernels and mean embeddings:
• Smola, A., Gretton, A., Song, L., Schoelkopf, B. (2007). A hilbert space embedding for distributions. ALT.
• Sriperumbudur, B., Gretton, A., Fukumizu, K., Schoelkopf, B., Lanckriet, G. (2010). Hilbert space
embeddings and metrics on probability measures. JMLR.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR.
Two-sample, independence, conditional independence tests:
• Gretton, A., Fukumizu, K., Teo, C., Song, L., Schoelkopf, B., Smola, A. (2008). A kernel statistical test of
independence. NIPS
• Fukumizu, K., Gretton, A., Sun, X., Schoelkopf, B. (2008). Kernel measures of conditional dependence.
• Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B. (2009). A fast, consistent kernel two-sample
test. NIPS.
• Gretton, A., Borgwardt, K., Rasch, M., Schoelkopf, B., Smola, A. (2012). A kernel two- sample test. JMLR
Energy distance, relation to kernel distances
• Sejdinovic, D., Sriperumbudur, B., Gretton, A.,, Fukumizu, K., (2013). Equivalence of distance-based and
rkhs-based statistics in hypothesis testing. Annals of Statistics.
Three way interaction
• Sejdinovic, D., Gretton, A., and Bergsma, W. (2013). A Kernel Test for Three-Variable Interactions. NIPS.
Selected references (continued)
Conditional mean embedding, RKHS-valued regression:
• Weston, J., Chapelle, O., Elisseeff, A., Scholkopf, B., and Vapnik, V., (2003). Kernel Dependency
Estimation, NIPS.
• Micchelli, C., and Pontil, M., (2005). On Learning Vector-Valued Functions. Neural Computation.
• Caponnetto, A., and De Vito, E. (2007). Optimal Rates for the Regularized Least-Squares Algorithm.
Foundations of Computational Mathematics.
• Song, L., and Huang, J., and Smola, A., Fukumizu, K., (2009). Hilbert Space Embeddings of Conditional
Distributions. ICML.
• Grunewalder, S., Lever, G., Baldassarre, L., Patterson, S., Gretton, A., Pontil, M. (2012). Conditional mean
embeddings as regressors. ICML.
• Grunewalder, S., Gretton, A., Shawe-Taylor, J. (2013). Smooth operators. ICML.
Kernel Bayes rule:
• Song, L., Fukumizu, K., Gretton, A. (2013). Kernel embeddings of conditional distributions: A unified
kernel framework for nonparametric inference in graphical models. IEEE Signal Processing Magazine.
• Fukumizu, K., Song, L., Gretton, A. (2013). Kernel Bayes rule: Bayesian inference with positive definite
kernels, JMLR
Kernel CCA: Definition
• There exists a factorization of Cxy such that [Baker, 1973]
Cxy = C1/2xx VxyC
1/2Y Y ‖Vxy‖S ≤ 1
• Regularized empirical estimate of spectral norm: [JMLR07]
‖Vxy‖S := supf∈F ,g∈G
〈f, Cxyg〉F subject to
〈f, (Cxx + ǫnI)f〉F = 1,
〈g, (Cyy + ǫnI)g〉G = 1,
– First canonical correlate
Kernel CCA: Definition
• There exists a factorization of Cxy such that [Baker, 1973]
Cxy = C1/2xx VxyC
1/2Y Y ‖Vxy‖S ≤ 1
• Regularized empirical estimate of spectral norm: [JMLR07]
‖Vxy‖S := supf∈F ,g∈G
〈f, Cxyg〉F subject to
〈f, (Cxx + ǫnI)f〉F = 1,
〈g, (Cyy + ǫnI)g〉G = 1,
– First canonical correlate
• Regularized empirical estimate of HS norm: [NIPS07b]
NOCCO(z;F,G) := ‖Vxy‖2HS = tr
[RyRx
], Rx := Kx(Kx + nǫnIn)
−1
Kernel CCA: Illustration
• Ring-shaped density, first eigenvalue
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 2−0.8
−0.6
−0.4
−0.2
0
xf(
x)
Dependence witness, X
−2 0 20
0.2
0.4
0.6
0.8
y
g(y
)
Dependence witness, Y
−1 −0.5 00.2
0.3
0.4
0.5
0.6
0.7
f(X)
g(Y
)
Correlation: 1.00
Kernel CCA: Illustration
• Ring-shaped density, third eigenvalue
−2 0 2−1.5
−1
−0.5
0
0.5
1
1.5
X
Y
Correlation: −0.00
−2 0 20.2
0.25
0.3
0.35
0.4
xf(
x)
Dependence witness, X
−2 0 20.2
0.25
0.3
0.35
0.4
y
g(y
)
Dependence witness, Y
0.25 0.3 0.350.27
0.28
0.29
0.3
0.31
0.32
0.33
f(X)
g(Y
)
Correlation: 0.97
NOCCO: HS Norm of Normalized Cross Covariance
• Define NOCCO as
NOCCO := ‖Vxy‖2HS
• Characteristic kernels: population NOCCO is mean-square contingency,
indep. of RKHS
NOCCO =
∫ ∫
X×Y
(pxy(x, y)
px(x)py(y)− 1
)2
px(x)py(y)dµ(x)dµ(y).
– µ(x) and µ(y) Lebesgue measures on X and Y; Pxy absolutely continuous w.r.t.
µ(x)× µ(y), density pxy , marginal densities px and py
• Convergence result: assume regularization ǫn satisfies ǫn → 0 and
ǫ3nn→ ∞, Then
‖Vxy − Vxy‖HS → 0
in probability
References
C.Baker.
Join
tm
easures
and
cross-c
ovaria
nce
operators.
Tra
nsactio
nsofth
e
American
Math
ematic
alSociety,186:2
73–289,1973.
L.
Barin
ghaus
and
C.
Franz.
On
anew
multiv
aria
te
two-s
am
ple
test.
J.
Multiv
ariate
Anal.,
88:1
90–206,2004.
C.Berg,J.P.R.Christensen,and
P.Ressel.
Harmonic
Analysis
on
Semigro
ups.
Sprin
ger,New
York,1984.
K.Chwia
lkowskiand
A.G
retton.
Akernelin
dependence
test
for
random
processes.
ICM
L,2014.
K.Chwia
lkowski,
D.Sejd
inovic
,and
A.G
retton.
Awild
bootstrap
for
de-
generate
kerneltests.
NIP
S,2014.
Andrey
Feuerverger.A
consistenttestfo
rbiv
aria
te
dependence.In
ternatio
nal
Sta
tistic
alReview,61(3):4
19–433,1993.
K.Fukum
izu,F.R.Bach,and
M.I.
Jordan.
Dim
ensio
nalit
yreductio
nfo
rsupervised
learnin
gwith
reproducin
gkernel
Hilb
ert
spaces.
Journalof
Machin
eLearnin
gResearch,5:7
3–99,2004.
K.Fukum
izu,
A.G
retton,
X.Sun,
and
B.Scholk
opf.
Kernelm
easures
of
conditio
naldependence.In
Advancesin
Neura
lIn
formatio
nProcessin
gSystems
20,pages
489–496,Cam
brid
ge,M
A,2008.M
ITPress.
K.Fukum
izu,L.Song,and
A.G
retton.
Kernelbayes’rule
:Bayesia
nin
fer-
ence
with
positiv
edefin
ite
kernels.JournalofM
achin
eLearnin
gResearch,14:
3753–3783,2013.
A.G
retton
and
L.G
yorfi.
Consistent
nonparam
etric
tests
ofin
dependence.
JournalofM
achin
eLearnin
gResearch,11:1
391–1423,2010.
A.G
retton,O
.Bousquet,A.J.Sm
ola
,and
B.Scholk
opf.
Measurin
gstatisti-
caldependencewith
Hilb
ert-S
chm
idtnorm
s.In
S.Jain
,H.U.Sim
on,and
E.Tom
ita,editors,Proceedin
gsofth
eIn
ternatio
nalConference
on
Algorith
mic
Learnin
gTheory,pages
63–77.Sprin
ger-V
erla
g,2005.
A.G
retton,K
.Borgwardt,M
.Rasch,B.Scholk
opf,
and
A.J.Sm
ola
.A
ker-
nelm
ethod
for
the
two-s
am
ple
proble
m.
InAdvancesin
Neura
lIn
formatio
n
Processin
gSystems15,pages
513–520,Cam
brid
ge,M
A,2007.M
ITPress.
70-1
A.G
retton,K
.Fukum
izu,C.-H
.Teo,L.Song,B.Scholk
opf,
and
A.J.Sm
ola
.A
kernelstatistic
altestofin
dependence.In
Advancesin
Neura
lIn
formatio
n
Processin
gSystems20,pages
585–592,Cam
brid
ge,M
A,2008.M
ITPress.
A.G
retton,K
.Borgwardt,M
.Rasch,B.Schoelk
opf,
and
A.Sm
ola
.A
kernel
two-s
am
ple
test.
JM
LR,13:7
23–773,2012.
ArthurG
retton.A
sim
ple
rconditio
nfo
rconsistency
ofa
kernelin
dependence
test.
Technic
alReport
1501.0
6103,arXiv
,2015.
WittawatJitkrittum
,ArthurG
retton,Nic
ola
sHeess,S.M
.Ali
Esla
mi,
Bal-
aji
Lakshm
inarayanan,D
ino
Sejd
inovic
,and
Zoltan
Szabo.K
ernel-b
ased
just-in
-tim
ele
arnin
gfo
rpassin
gexpectatio
npropagatio
nm
essages.UAI,
2015.
D.Sejd
inovic
,A.G
retton,and
W.Bergsm
a.A
kerneltestfo
rthree-v
aria
ble
interactio
ns.
InNIP
S,2013a.
D.Sejd
inovic
,B.Srip
erum
budur,A.G
retton,and
K.Fukum
izu.Equiv
ale
nce
ofdistance-b
ased
and
rkhs-b
ased
statistic
sin
hypothesis
testin
g.
Annals
ofSta
tistic
s,41(5):2
263–2702,2013b.
D.Sejd
inovic
,H.Strathm
ann,M
.Lom
eli
Garcia
,C.Andrie
u,and
A.G
ret-
ton.
Kerneladaptiv
eM
etropolis
-Hastin
gs.
ICM
L,2014.
A.J.Sm
ola
,A.G
retton,L.Song,and
B.Scholk
opf.
AHilb
ert
space
em
-beddin
gfo
rdistrib
utio
ns.
InProceedin
gs
ofth
eIn
ternatio
nalConference
on
Algorith
mic
Learnin
gTheory,volu
me
4754,pages
13–31.Sprin
ger,2007.
B.Srip
erum
budur,
K.Fukum
izu,
A.G
retton,
and
A.Hyvarin
en.
Density
estim
atio
nin
infin
ite
dim
ensio
nalexponentia
lfa
milie
s.Technic
alReport
1312.3
516,ArXiv
e-p
rin
ts,2014.
Heik
oStrathm
ann,D
ino
Sejd
inovic
,Sam
uelLiv
ingstone,Zoltan
Szabo,and
Arthur
Gretton.
Gradie
nt-fr
ee
Ham
iltonia
nM
onte
Carlo
with
effic
ient
kernelexponentia
lfa
milie
s.
arxiv,2015.
G.Szekely
and
M.Riz
zo.Testin
gfo
requaldistrib
utio
nsin
hig
hdim
ensio
n.
InterSta
t,5,2004.
G.Szekely
and
M.Riz
zo.A
new
testfo
rm
ultiv
aria
tenorm
alit
y.J.M
ultiv
ariate
Anal.,
93:5
8–80,2005.
G.Szekely
,M
.Riz
zo,and
N.Bakirov.M
easurin
gand
testin
gdependence
by
correla
tio
nofdistances.
Ann.Sta
t.,35(6):2
769–2794,2007.
K.Zhang,J.Peters,D
.Janzin
g,B.,
and
B.Scholk
opf.
Kernel-b
ased
con-
ditio
nalin
dependence
test
and
applic
atio
nin
causaldiscovery.
In27th
Conferenceon
Uncerta
inty
inArtifi
cialIn
tellig
ence,pages
804–813,2011.
70-2