Kernel Thinning and Stein Thinning
Lester Mackey
Microsoft Research New England
October 15, 2021
Joint work with Raaz Dwivedi, Marina Riabiz, Wilson Ye Chen, Jon Cockayne, Pawel Swietach,Steven A. Niederer, Chris J. Oates, and Abhishek Shetty
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 1 / 38
Motivation: Computational Cardiology
Computational Cardiology: Developing multiscale digital twins of human hearts tonon-invasively predict disease progression and therapy response [Niederer, Sacks, Girolami, and Willcox, 2021]
Organ scale<latexit sha1_base64="tb7H8EVnsmshWv1rgVqSlSn2iBI=">AAAB9HicbVBNS8NAEJ34WeNX1aOXxSJ4Kkk96LHoxZsV7Ae0oUy2m3bpZhN3N4US+ju8eFDEqz/Gm//GpM1BWx8MPN6bYWaeHwuujeN8W2vrG5tb26Ude3dv/+CwfHTc0lGiKGvSSESq46NmgkvWNNwI1okVw9AXrO2Pb3O/PWFK80g+mmnMvBCHkgecoskk714NURJNUTDb7pcrTtWZg6wStyAVKNDol796g4gmIZOGCtS66zqx8VJUhlPBZnYv0SxGOsYh62ZUYsi0l86PnpHzTBmQIFJZSUPm6u+JFEOtp6GfdYZoRnrZy8X/vG5igmsv5TJODJN0sShIBDERyRMgA64YNWKaEaSKZ7cSOkKF1GQ55SG4yy+vklat6l5Waw+1Sv2miKMEp3AGF+DCFdThDhrQBApP8Ayv8GZNrBfr3fpYtK5ZxcwJ/IH1+QNmlZEx</latexit>
Tissue scale<latexit sha1_base64="d37eTTHEYHWpxsIBEMJ+3E1MvT8=">AAAB9XicbVC7TsNAEFyHVzCvACXNiQiJKrJDAWUEDWWQ8kBKTHS+rJNTzg/dnUGRlf+goQAhWv6Fjr/hnLiAhJFWGs3sanfHTwRX2nG+rdLa+sbmVnnb3tnd2z+oHB51VJxKhm0Wi1je+1Sh4BG2NdcC7xOJNPQFdv3JTe53H1EqHkctPU3QC+ko4gFnVBvpocWVSpEoRgXa9qBSdWrOHGSVuAWpQoHmoPLVH8YsDTHSTFCleq6TaC+jUnMmcGb3U4UJZRM6wp6hEQ1Redn86hk5M8qQBLE0FWkyV39PZDRUahr6pjOkeqyWvVz8z+ulOrjyMh4lqcaILRYFqSA6JnkEZMglMi2mhlAmubmVsDGVlGkTVB6Cu/zyKunUa+5FrX5XrzauizjKcAKncA4uXEIDbqEJbWAg4Rle4c16sl6sd+tj0Vqyiplj+APr8wddI5HB</latexit>
Single cell scale<latexit sha1_base64="U+5TJMPnWOyG3lvju3p6V8LEoS8=">AAAB/HicbVBNS8NAEN34WeNXtEcvi0XwVJJ60GPRi8eK9gPaUDbbSbt0swm7GyGE+le8eFDEqz/Em//GTZuDtj4YeLw3w8y8IOFMadf9ttbWNza3tis79u7e/sGhc3TcUXEqKbRpzGPZC4gCzgS0NdMceokEEgUcusH0pvC7jyAVi8WDzhLwIzIWLGSUaCMNneo9E2MOmALnWFHCwbaHTs2tu3PgVeKVpIZKtIbO12AU0zQCoSknSvU9N9F+TqRmlMPMHqQKEkKnZAx9QwWJQPn5/PgZPjPKCIexNCU0nqu/J3ISKZVFgemMiJ6oZa8Q//P6qQ6v/JyJJNUg6GJRmHKsY1wkgUdMAtU8M4RQycytmE6IJFSbvIoQvOWXV0mnUfcu6o27Rq15XcZRQSfoFJ0jD12iJrpFLdRGFGXoGb2iN+vJerHerY9F65pVzlTRH1ifPyhqk8k=</latexit>
Figurecredit:MarinaRiabiz
Example (Heartbeats and arrhythmias)
Whole-organ heartbeats are coordinated by calcium signaling in heart cells
Dysregulation known to lead to life-threatening heart arrhythmias
Goal: Model impact of calcium signaling dysregulation on heart function [Campos,
Shiferaw, Prassl, Boyle, Vigmond, and Plank, 2015, Niederer, Lumens, and Trayanova, 2019, Colman, 2019]
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 2 / 38
Motivation: Computational Cardiology
Figurecredit:
Augustinet al.2020
Inferential Pipeline (Impact of calcium signaling dysregulation on heart function)
1 Estimate unknown calcium signaling model parameters from patient data2 Capture uncertainty by sampling many likely parameter configurations
Run Markov chain Monte Carlo (MCMC) to (eventually) draw sample pointsfrom the posterior distribution P over unknown parametersMay require millions of sample points to adequately explore target distribution P
3 Propagate uncertainty by simulating whole-heart model for each configuration
Problem: Each simulation requires 1000s of CPU hours!
Questions: Can we accurately summarize P using many fewer points? How?Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 3 / 38
Distribution Compression
Goal: Accurately summarize a distribution P using a small number of points
Standard solutions
i.i.d. sampling directly from PMCMC with Markov chain converging to P
−2.5 0.0 2.5x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2
0
2
x2
Benefits: Readily available and eventually high-quality
Provide asymptotically exact sample estimates Pnf = 1n
∑ni=1 f(xi) for intractable
expectations Pf = EX∼P[f(X)]
Drawback: Samples are too large!
Typical integration error Pnf − Pf = Θ(n−1/2): need n = 10000 for 1% error
Prohibitive for expensive downstream tasks and function evaluations
Idea: Directly compress the high-quality sample approximations PnReduces general problem to approximating empirical distributions
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 4 / 38
Distribution Compression
Question: How do we effectively compress an empirical distribution Pn?
Standard solutions
Uniform subsampling / i.i.d. sampling
Standard thinning: Keep every t-th point−2.5 0.0 2.5
x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2.5
0.0
2.5
x2
Drawback: Large loss in accuracy, worst case integration error = Θ(√t/n)
Compression from n to√n points increases error from Θ(n−1/2) to Θ(n−1/4)
Question: Can we do better?
Minimax lower bounds for worst-case integration error to PΩ(n−1/2) for any compression procedure returning
√n points [Phillips and Tai, 2020]
Ω(n−1/2) for any approximation based only on n i.i.d. points from P[Tolstikhin, Sriperumbudur, and Muandet, 2017]
This talk: Introduce a more effective compression strategy – kernel thinning – thatmatches these lower bounds up to log factors
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 5 / 38
Problem Setup
Given:
Input points Sin = x1, . . . , xn ⊂ Rd with empirical distribution Pn = 1n
∑ni=1 δxi
Pre-generated by any algorithm (i.i.d. sampling, MCMC, quadrature, kernel herding)
Target output size s (e.g., s =√n for heavy compression)
Goal: Return coreset Sout ⊂ Sin with |Sout| = s, Q = 1s
∑x∈Sout δx, and o(s−1/2)
(better-than-i.i.d.) worst-case integration error between Pn and Q
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 6 / 38
Maximum Mean Discrepancies
Goal: Return coreset Sout ⊂ Sin with |Sout| = s, Q = 1s
∑x∈Sout δx, and o(s−1/2)
worst-case integration error between Pn and Q
Quality measure: Maximum mean discrepancy (MMD) [Gretton, Borgwardt, Rasch, Scholkopf, and Smola, 2012]
MMDk(Pn,Q) = sup‖f‖k≤1
|Pnf −Qf |
Measures maximum discrepancy between input and coreset expectations over aclass of real-valued test functions (unit ball of a reproducing kernel Hilbert space)
Parameterized by a reproducing kernel k: any symmetric (k(x, y) = k(y, x)) andpositive semidefinite (
∑i,l ciclk(zi, zl) ≥ 0,∀zi ∈ Rd, ci ∈ R) function
Gaussian: k(x, y) = e−12‖x−y‖22 , Inverse multiquadric: k(x, y) = 1
(1+‖x−y‖22)1/2
Metrizes convergence in distribution for popular infinite-dimensional kernels (e.g.,Gaussian, Matern, B-spline, inverse multiquadric, sech, and Wendland)
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 7 / 38
Square-root Kernels
Definition (Square-root kernel)
A reproducing kernel krt is a square-root kernel for k if
k(x, y) =∫Rd krt(x, z)krt(y, z)dz.
k(x, y) = κ(x−y)Name of kernel
κ(z)Expression for
κ(ω)Fourier transform
krt
Square-root kernel
σ > 0Gaussian(σ) :
exp(−‖z‖
22
2σ2
)σd exp
(−σ2‖ω‖22
2
) (2πσ2
) d4 Gaussian
(σ√2
)ν > d, γ > 0
Matern(ν, γ) :cν− d
2(γ‖z‖2)ν−
d2Kν− d
2(γ‖z‖2) φd,ν,γ (γ2 + ‖ω‖22)−ν Aν,γ,dMatern(ν2 , γ)
β ∈ 2N + 1B-spline(2β + 1) :
S2β+2,d
d∏j=1
~2β+21[− 12, 12
](zj) S′2β+2,d
d∏j=1
sin2β+2(ωj2
)
ω2β+2j
Sβ,dB-spline(β)
Exact square-root kernel not necessary: see Dwivedi and Mackey [2021b] for convenient choicesfor inverse multiquadric, sech, Wendland, and all sufficiently smooth and integrable κ
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 8 / 38
Kernel Thinning
Two stages1 Initialization: kt-split
Divides input Sin into 2m balanced candidate coresets, each of size n2m using
non-uniform randomnessWhen m = 1
2 log2(n), obtain√n coresets of size
√n
2 Refinement: kt-swap
Selects candidate coreset closest to Sin in terms of MMDk
Iteratively refines the coreset by replacing each coreset point in turn with the bestalternative in Sin, as measured by MMDk
Complexity
Time: dominated by O(n2) kernel evaluations
Space: O(min(nd, n2))
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 9 / 38
KT-SPLIT
𝒮in
𝒮1,1 𝒮1,2
𝒮2,1 𝒮2,2 𝒮2,3 𝒮2,4
𝒮3,1 𝒮3,2 𝒮3,3 𝒮3,4 𝒮3,5 𝒮3,6 𝒮3,7 𝒮3,8
𝒮m,1 𝒮m,2 𝒮m,5𝒮m,3 𝒮m,4 𝒮m,6 𝒮m,7 𝒮m,2m𝒮m,8
kt-split partitions the input Sin recursively, first dividing the input sequence in half,then halving those halves into quarters, and so on
Runs online: after i input points processed have output coresets of size i2m
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 10 / 38
KT-SPLITInput
( points)n
( points)n2
Output ( points)
n2m
After Kernel Halving roundsm
Kernel Halving
( points)n4
Kernel Halving
<latexit sha1_base64="dM2tpuvUYF3a17jCjaY7CCNIIg4=">AAAB/XicbVDLSsNAFL2pr1pf8bFzM1iEClISFXVZdOOyon1AG8tkOmmHTh7MTIQagr/ixoUibv0Pd/6NkzYLrR4YOJxzL/fMcSPOpLKsL6MwN7+wuFRcLq2srq1vmJtbTRnGgtAGCXko2i6WlLOANhRTnLYjQbHvctpyR5eZ37qnQrIwuFXjiDo+HgTMYwQrLfXMna6P1ZBgntykd0nFPkT2Qdozy1bVmgD9JXZOypCj3jM/u/2QxD4NFOFYyo5tRcpJsFCMcJqWurGkESYjPKAdTQPsU+kkk/Qp2tdKH3mh0C9QaKL+3EiwL+XYd/VkllXOepn4n9eJlXfuJCyIYkUDMj3kxRypEGVVoD4TlCg+1gQTwXRWRIZYYKJ0YSVdgj375b+keVS1T6vH1yfl2kVeRxF2YQ8qYMMZ1OAK6tAAAg/wBC/wajwaz8ab8T4dLRj5zjb8gvHxDQ8YlFE=</latexit>
S(1,1)
<latexit sha1_base64="IzjRe/pKMhG0lDpZnAr+uYI87CU=">AAAB/XicbVDLSgMxFL1TX7W+xsfOTbAIFaTMVFGXRTcuK9oHtGPJpGkbmnmQZIQ6DP6KGxeKuPU/3Pk3ZtpZaOuBwOGce7knxw05k8qyvo3cwuLS8kp+tbC2vrG5ZW7vNGQQCULrJOCBaLlYUs58WldMcdoKBcWey2nTHV2lfvOBCskC/06NQ+p4eOCzPiNYaalr7nU8rIYE8/g2uY9LlWNkHyVds2iVrQnQPLEzUoQMta751ekFJPKorwjHUrZtK1ROjIVihNOk0IkkDTEZ4QFta+pjj0onnqRP0KFWeqgfCP18hSbq740Ye1KOPVdPplnlrJeK/3ntSPUvnJj5YaSoT6aH+hFHKkBpFajHBCWKjzXBRDCdFZEhFpgoXVhBl2DPfnmeNCpl+6x8cnNarF5mdeRhHw6gBDacQxWuoQZ1IPAIz/AKb8aT8WK8Gx/T0ZyR7ezCHxifPxChlFI=</latexit>
S(2,1)
<latexit sha1_base64="3Lu/QOS+4D9PDX/fWpjFkMdWhok=">AAAB/XicbVDLSgMxFM3UV62v8bFzEyxCBSkzKuqy6MZlRfuAdiyZNNOGJpkhyQh1GPwVNy4Ucet/uPNvzLSz0OqBwOGce7knx48YVdpxvqzC3PzC4lJxubSyura+YW9uNVUYS0waOGShbPtIEUYFaWiqGWlHkiDuM9LyR5eZ37onUtFQ3OpxRDyOBoIGFCNtpJ690+VIDzFiyU16l1T4IXQP0p5ddqrOBPAvcXNSBjnqPfuz2w9xzInQmCGlOq4TaS9BUlPMSFrqxopECI/QgHQMFYgT5SWT9CncN0ofBqE0T2g4UX9uJIgrNea+mcyyqlkvE//zOrEOzr2EiijWRODpoSBmUIcwqwL2qSRYs7EhCEtqskI8RBJhbQormRLc2S//Jc2jqntaPb4+Kdcu8jqKYBfsgQpwwRmogStQBw2AwQN4Ai/g1Xq0nq036306WrDynW3wC9bHN2s0lI0=</latexit>
S(m,1)
<latexit sha1_base64="f2RiQ3d0Z6heUOlpz815K7Ji2Lk=">AAACAnicbVC7TsMwFHV4lvIKMCEWiwqJqUoAAWMFC2MR9CG1UeS4TmvVdiLbQaqiiIVfYWEAIVa+go2/wUkzQMuRLB2fc6/uvSeIGVXacb6thcWl5ZXVylp1fWNza9ve2W2rKJGYtHDEItkNkCKMCtLSVDPSjSVBPGCkE4yvc7/zQKSikbjXk5h4HA0FDSlG2ki+vd/nSI8wYuld5qfFR/KUiizz7ZpTdwrAeeKWpAZKNH37qz+IcMKJ0JghpXquE2svRVJTzEhW7SeKxAiP0ZD0DBWIE+WlxQkZPDLKAIaRNE9oWKi/O1LElZrwwFTmO6pZLxf/83qJDi89c1CcaCLwdFCYMKgjmOcBB1QSrNnEEIQlNbtCPEISYW1Sq5oQ3NmT50n7pO6e109vz2qNqzKOCjgAh+AYuOACNMANaIIWwOARPINX8GY9WS/Wu/UxLV2wyp498AfW5w+b3pg5</latexit>Sin
Each output coreset S(m,`) is the result of repeated kernel halvingOn each halving round, remaining points are paired, and one point from each pairis selected using a new Hilbert space generalization of the self-balancing walk ofAlweiss, Liu, and Sawhney [2020]
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 11 / 38
Kernel Halving with a Self-Balancing Hilbert WalkAlgorithm: Self-balancing Hilbert Walk [Dwivedi and Mackey, 2021b]
Input: sequence of functions (fi)n/2i=1 in Hilbert space H, threshold sequence (ai)
n/2i=1
ψ0 ← 0 ∈ Hfor i = 1, 2, . . . , n/2 do
αi ← 〈ψi−1, fi〉H // Compute Hilbert space inner productif |αi| > ai:
ψi ← ψi−1 − fi · αi/ai // We choose ai to avoid this case with high probabilityelse:
ηi ← 1 with probability 12 (1− αi/ai) and ηi ← −1 otherwise
ψi ← ψi−1 + ηifiend
return ψn/2, sum of signed input functions // ψn/2 =∑n/2
i=1 ηifi with high probability
1 Kernel Halving: If fi = krt(x2i−1, ·)− krt(x2i, ·), half of input points Sout given sign 1
⇒ 1nψn/2 = Pnkrt −Qkrt with Q = 2
n
∑x∈Sout δx
2 Balance: If H = krt RKHS, Pnkrt(x)−Qkrt(x) is O(√
log(n)/n) sub-Gaussian, ∀xIn contrast, i.i.d. signs ηi give Pnkrt(x)−Qkrt(x) = Ω(1/
√n)
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 12 / 38
Why the Square-root Kernel krt?
Theorem (L∞ coresets for (krt,Pn) are MMD coresets for (k,Pn) [Dwivedi and Mackey, 2021b])
For any scalars R, a, b ≥ 0 with a+ b = 1, we have
MMDk(Pn,Q) ≤ vdRd2 · ‖Pnkrt−Qkrt‖∞ + 2τkrt(aR) + 2‖k‖ 1
2∞ ·maxτPn(bR), τQ(bR)
for vd , πd/4/Γ(d/2 + 1)1/2.
L∞ error: ‖Pnkrt−Qkrt‖∞ , supx∈Rd |Pnkrt(x)−Qkrt(x)|Tail decay of (Pn,Q,krt): τPn(R) , Pn(‖X‖2 ≥ R)
Effective radius: Want τkrt(aR), τPn(bR), τQ(bR) = O( 1√n)
R = O(1) for compact support, R = O(log(n)) for sub-exponential decay
When (Pn,Q,krt) are compactly supported, MMDk(Pn,Q) = O(‖Pnkrt−Qkrt‖∞)
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 13 / 38
L∞ Coresets from Kernel Halving
Theorem (L∞ guarantees for kernel halving [Dwivedi and Mackey, 2021b])
With high probability,
1 Kernel halving yields a 2-thinned L∞ coreset Q(1)KH satisying
‖Pnkrt −Q(1)KHkrt‖∞ ≤ ‖krt‖∞ · 2
nMkrt(Pn)
2 Repeated kernel halving yields a 2m-thinned L∞ coreset Q(m)KH satisfying
‖Pnkrt−Q(m)KHkrt‖∞ ≤ ‖krt‖∞ · 2m
nMkrt(Pn)
Mkrt(Pn) = O(√
log n) for compactly supported (P,krt) and O(log n) in general
With m = 12
log2(n) rounds, yields√n points with O(n−
12 log(n)) L∞ error
An equal-sized i.i.d. sample has Ω(n−14 ) L∞ error
Near-optimal: any procedure outputting√n points must suffer Ω(n−
12 ) L∞ error
for some Pn Phillips and Tai [2020, Thm. 3.1]Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 14 / 38
MMD Coresets from Kernel Thinning
Theorem (MMD guarantee for kernel thinning [Dwivedi and Mackey, 2021b])
Kernel thinning returns a coreset QKT with√n points satisfying, with high probability,
MMDk(Pn,QKT )=
O(√
lognn
) for compact support (P,krt) (e.g., B-spline k)
O( (logn)d+24 log logn√n
) for sub-Gaussian (P,krt) (e.g., Gaussian k)
O( (logn)d+12 log logn√n
) for sub-exponential (P,krt) (e.g., Matern k)
An equal-sized i.i.d. sample has Ω(n−14 ) MMD
Sub-exponential guarantees resemble the classical O( (logn)d√n
) quasi-Monte Carlo
error rates for uniform P on [0, 1]d but apply to more general distributions on Rd
See the paper for non-asymptotic bounds with explicit constants and n2m
points
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 15 / 38
Related Work on MMD Coresets
Uniform distribution P on [0, 1]d: O(n−12 logd n) L2 discrepancy MMD,
√n points
Quasi-Monte Carlo [Hickernell, 1998, Novak and Wozniakowski, 2010], Online Haar strategy [Dwivedi, Feldheim,
Gurel-Gurevich, and Ramdas, 2019]
Order n−14 MMD coresets for general P
i.i.d. [Tolstikhin, Sriperumbudur, and Muandet, 2017], geometrically ergodic MCMC [Dwivedi and Mackey, 2021b]
Kernel herding [Chen, Welling, and Smola, 2010, Lacoste-Julien, Lindsten, and Bach, 2015], Stein points MCMC [Chen,
Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019], Greedy sign selection [Karnin and Liberty, 2019]
Finite-dimensional linear kernels on Rd: O(√dn−
12 log2.5 n),
√n points
Discrepancy construction [Harvey and Samadi, 2014]: does not cover infinite-dimensional k
Unknown coreset qualitySuper-sampling with a reservoir [Paige, Sejdinovic, and Wood, 2016]: coreset quality not analyzedSupport points [Mak and Joseph, 2018]
Optimal√n coreset has o(n−
14 ) energy distance MMD but no construction given
Practical convex-concave procedures not analyzed or shown to be optimalMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 16 / 38
Related Work on L∞ Coresets
L∞ coresets for Pn: o(n−14 ) L∞ error,
√n points
Series of breakthroughs due to [Joshi, Kommaraji, Phillips, and Venkatasubramanian, 2011, Phillips,
2013, Phillips and Tai, 2018, 2020, Tai, 2020]
Best known L∞ guarantees (for coreset of size√n)
Phillips and Tai [2020]: O(√dn−
12
√log n) error, Ω(n4) time, Ω(n2) space
Tai [2020] (Gaussian k): O(2dn−12
√log(d log n)) error, Ω(max(d5d, n4)) time
Both are offline and require rebalancing after approximate halving steps
This work: O(√dn−
12 log n) error, O(n2) time, O(nd) space, online, exact halving
Sub-Gaussian (krt,P): O(√dn−
12√
log n log log n) error
Compact support (krt,P): O(√dn−
12√
log n) error
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 17 / 38
Kernel Thinning vs. i.i.d. Sampling: Mixture of Gaussians
5
0
5
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000
-5.000
8 iid points
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000
-5.000
16 iid points
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000
-5.000
32 iid points
5 0 5
5
0
5
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000
-5.000
8 KT points
5 0 5
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000
-5.000
16 KT points
5 0 5
-15.000
-12.500
-10.000
-7.5
00
-7.500
-5.000
-5.000
-5.000-5.000
32 KT points
P = 1k
∑kj=1N (µj, Id)
k(x, y) = exp(− 12σ2‖x− y‖2
2)with σ2 = 2d
Even for small sample sizes,kernel thinning (KT) provides
Better stratification acrosscomponentsLess clumping and fewergaps within components
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 18 / 38
Kernel Thinning vs. i.i.d. Sampling: Mixture of Gaussians
10
5
0
5-17
.500
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000-1
5.00
0
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
8 iid points
-17.50
0
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000
-15.
000
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
16 iid points
-17.50
0
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000
-15.
000
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
32 iid points
10 5 0 510
5
0
5-17
.500
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000
-15.
000
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
8 KT points
10 5 0 5
-17.50
0
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000
-15.
000
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
16 KT points
10 5 0 5
-17.50
0
-17.500
-17.500
-17.5
00
-15.00
0
-15.000
-15.000
-15.
000
-12.5
00
-12.500
-12.500
-12.
500
-10.000
-7.5
00
-7.500
32 KT points
P = 1M
∑Mj=1N (µj, Id)
k(x, y) = exp(− 12σ2‖x− y‖2
2)with σ2 = 2d
Even for small sample sizes,kernel thinning (KT) provides
Better stratification acrosscomponentsLess clumping and fewergaps within components
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 19 / 38
Kernel Thinning vs. i.i.d. Sampling: Mixture of Gaussians
22 24 26
Coreset size n
2 6
2 4
2 2
Mea
n M
MD
M=4
iid: n 0.26
KT: n 0.52
22 24 26
Coreset size n
M=6
iid: n 0.25
KT: n 0.51
22 24 26
Coreset size n
M=8
iid: n 0.25
KT: n 0.52
Kernel thinning (KT) improves both rate of decay and order of magnitude ofMMDk(P,QKT )
P = 1M
∑Mj=1N (µj, Id), d = 2
k(x, y) = exp(− 12σ2‖x− y‖2
2) with σ2 = 2dMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 20 / 38
Kernel Thinning vs. i.i.d. Sampling: Higher Dimensions
22 24 26
Coreset size n
2 7
2 5
2 3
2 1
Mea
n M
MD
d=2
iid: n 0.25
KT: n 0.51
22 24 26
Coreset size n
d=4
iid: n 0.25
KT: n 0.48
22 24 26
Coreset size n
d=10
iid: n 0.25
KT: n 0.43
22 24 26
Coreset size n
d=100
iid: n 0.25
KT: n 0.34
Kernel thinning (KT) improves both rate of decay and order of magnitude ofMMDk(P,QKT ) even for high dimensions and small sample sizes
P = N (0, Id)
k(x, y) = exp(− 12σ2‖x− y‖2
2) with σ2 = 2d
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 21 / 38
Kernel Thinning vs. Standard MCMC Thinning
Posterior inference for systems of ordinary differential equations (ODEs)
P = posterior distribution of coupled ODE model parameters given observed dataGoodwin model of oscillatory enzymatic control (d = 4) [Goodwin, 1965]
Lotka-Volterra model of oscillatory predator-prey evolution (d = 4) [Lotka, 1925, Volterra, 1926]
Hinch model of cardiac calcium signalling (d = 38) [Hinch, Greenstein, Tanskanen, Xu, and Winslow, 2004]
Downstream goal: propagate model uncertainty through whole-heart simulationEvery sample point discarded via compression = 1000s of CPU hours saved
MCMC input points [Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021]
Gaussian random walk (RW), adaptive RW (adaRW) [Haario, Saksman, and Tamminen, 1999]
2 weeks of CPU time to generate each RW Hinch chain of length 4× 106
Metropolis-adjusted Langevin algorithm (MALA) [Roberts and Tweedie, 1996]
Pre-conditioned MALA (pMALA) [Girolami and Calderhead, 2011]
Discarded burn-in and standard thinned to form Pnk(x, y) = exp(− 1
2σ2‖x− y‖22) with median heuristic σ2
[Garreau, Jitkrittum, and Kanagawa, 2017]
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 22 / 38
22 24 26
Coreset size n
2 7
2 5
2 3
Mea
n M
MD
Goodwin RW
standard: n 0.19
KT: n 0.50
22 24 26
Coreset size n
Goodwin adaRW
standard: n 0.15
KT: n 0.48
22 24 26
Coreset size n
Goodwin MALA
standard: n 0.28
KT: n 0.49
22 24 26
Coreset size n
Goodwin pMALA
standard: n 0.30
KT: n 0.50
22 24 26
Coreset size n
2 8
2 6
2 4
2 2
Mea
n M
MD
Lotka-Volterra RW
standard: n 0.14
KT: n 0.56
22 24 26
Coreset size n
Lotka-Volterra adaRW
standard: n 0.24
KT: n 0.57
22 24 26
Coreset size n
Lotka-Volterra MALA
standard: n 0.40
KT: n 0.56
22 24 26
Coreset size n
Lotka-Volterra pMALA
standard: n 0.31
KT: n 0.57
22 24 26
Coreset size n
2 7
2 5
2 3
2 1
Mea
n M
MD
Hinch 1
standard: n 0.46
KT: n 0.53
22 24 26
Coreset size n
Hinch 2
standard: n 0.43
KT: n 0.49
22 24 26
Coreset size n
Hinch Tempered 1
standard: n 0.35
KT: n 0.38
22 24 26
Coreset size n
Hinch Tempered 2
standard: n 0.32
KT: n 0.38
KT improves rate of decay and magnitude of MMD, even when standard thinning is accurateMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 23 / 38
Something’s Wrong
Problem: The Hinch Markov chains haven’t mixed!
Solution: Use a more diffuse tempered posterior for faster mixing
Problem: Tempering introduces a persistent bias
MCMC points Pn will be summarizing the wrong target P
Question: Can we correct for such biases during compression?P
aram
eter
1
med sclmed smpcov MCMC
seed 1
seed 2
prior
Par
amet
er2
Par
amet
er3
Par
amet
er4
Par
amet
er5
Par
amet
er6
Par
amet
er7
Par
amet
er8
Par
amet
er9
Par
amet
er10
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 24 / 38
Compression with Bias Correction
Question: Can we correct for distributional biases in Pn during compression?
e.g., Biases due to off-target sampling, tempering, approximate MCMC, or burn-in
−2.5 0.0 2.5x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2.5
0.0
2.5
x2
Difficulty: Pn alone is insufficient; need to measure distance to the true target P
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 25 / 38
Measuring Distance to P
Quality measure: Maximum mean discrepancy (MMD) [Gretton, Borgwardt, Rasch, Scholkopf, and Smola, 2012]
MMDk(P,Q) = sup‖f‖k≤1
|Pf −Qf | =√
(P× P)k + (Q×Q)k− 2(Q× P)k
Problem: Integration under P is typically intractable!
⇒ Pk and MMDk(P,Q) cannot be computed in practice for most kernels
Idea: Only consider kernels kP with PkP known a priori to be 0
Then MMD computation only depends on Q!
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 26 / 38
Kernel Stein Discrepancies
Idea: Consider MMDkP with PkP known a priori to be 0
Kernel Stein discrepancy (KSD)[Chwialkowski, Strathmann, and Gretton, 2016, Liu, Lee, and Jordan, 2016, Gorham and Mackey, 2017]
kP(x, y) =∑d
j=11
p(x)p(y)∇xj∇yj(p(x)k(x, y)p(y)) [Oates, Girolami, and Chopin, 2017]
P has differentiable Lebesgue density pk is a bounded base kernel with bounded continuous derivatives
Depends on P through ∇ log p: computable when normalization constant unknownPkP = 0 whenever ∇ log p is integrable [Gorham and Mackey, 2017]
⇒ Kernel Stein discrepancy MMDkP(P,Q) is computable!
Theorem (KSD controls convergence in distribution[Gorham and Mackey, 2017, Chen, Barp, Briol, Gorham, Girolami, Mackey, and Oates, 2019])
Consider the base kernel k(x, y) = (c2 + ‖Γ(x− y)‖22)−1/2 for any c > 0 and positive
definite Γ. If P has strongly log concave tails and Lipschitz ∇ log p, then Qs ⇒ Pwhenever MMDkP(P,Qs)→ 0.
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 27 / 38
Stein Thinning
Idea: Greedily minimize KSD using points from Sin = x1, . . . , xn[Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021]
Choose initial approximation Q1 = δy1 with
y1 ∈ argminy∈Sin MMDkP(P, δy) = argminy∈Sin kP(y, y)
Iteratively construct Qs = 1s
∑si=1 δyi with
ys ∈ argminy∈Sin MMDkP(P, s−1sQs−1 + 1
sδy)
= argminy∈Sin kP(y, y) + 2∑s−1
i=1 kP(yi, y)
Same point xi can be selected multiple times
Running time = O(n∑s
i=1 ri) for ri ≤ i the number of distinct points selectedprior to round i (worst case = O(ns2))
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 28 / 38
Stein Thinning Guarantees
Theorem (Stein thinning KSD guarantee [Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021] )
MMDkP(P,Qs)2 ≤ infw∈∆n−1 MMDkP(P,
∑ni=1 wiδxi)
2 + (1+log(s))s
maxx∈Sin kP(x, x)
Expect maxx∈Sin kP(x, x) = O(log(n)) for sub-Gaussian input and kP(x, x) = O(‖x‖22)
Takeaway: Stein thinning performs nearly as well as best simplex reweighting of Sin
⇒ Nearly as well as Markov chain with burn-in removed!⇒ Nearly as well as off-target sample after optimal importance sampling reweighting!
−2.5 0.0 2.5x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2
0
2
x2
−2.5 0.0 2.5x1
−2.5
0.0
2.5
x2
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 29 / 38
Stein Thinning Guarantees
Takeaway: Stein thinning performs nearly as well as best simplex reweighting of Sin
⇒ Nearly as well as Markov chain with burn-in removed!
⇒ Neary as well as off-target sample after optimal importance sampling reweighting!
Theorem (Stein thinning corrects off-target sampling[Riabiz, Chen, Cockayne, Swietach, Niederer, Mackey, and Oates, 2021])
If Sin drawn i.i.d. from P, then, under mild conditions (s ≤ n, log(n) = O(sβ/2) for
some β < 1, and E[eγmax(1,dPdP
(Xi)2)kP(Xi,Xi)] <∞ for some γ > 0),
MMDkP(P,Qs)→ 0 almost surely as s, n→∞.
Result extends to sufficiently ergodic Markov chains targeting P
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 30 / 38
Stein Thinning in Action: Correcting for Burn-in
Goodwin model of oscillatory enzymatic control
Projections on the first two coordinates of the MALA MCMC output
First s = 20 points from Stein thinning vs. burn-in removal + standard thinning
Substantial burn-in: b points out of 2× 106 removed for standard thinningMackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 31 / 38
Stein Thinning in Action: Correcting for Burn-in
Goodwin model of oscillatory enzymatic control101
102K
SD
RW ADA-RW
100 101 102
m
101
102
KS
D
MALA
100 101 102
m
P-MALA
Standard Thinning (high burn-in)
Standard Thinning (low burn-in)
Support Points
Stein Thinning (med)
Stein Thinning (sclmed)
Stein Thinning (smpcov)
2.3× 10−2
2.4× 10−2
2.5× 10−2
2.6× 10−2
2.7× 10−2
2.8× 10−2
ED
RW ADA-RW
101 102
m
2.3× 10−2
2.4× 10−2
2.5× 10−2
2.6× 10−2
2.7× 10−2
2.8× 10−2
ED
MALA
101 102
m
P-MALA
Standard Thinning, high burn-in
Standard Thinning, low burn-in
Support Points
Stein Thinning (med)
Stein Thinning (sclmed)
Stein Thinning (smpcov)10−6
10−4
10−2
Ab
solu
teE
rror
Fir
stM
omen
t
Parameter 1 Parameter 2
100 101 102
m
10−6
10−4
10−2
Ab
solu
teE
rror
Fir
stM
omen
tParameter 3
100 101 102
m
Parameter 4
Standard Thinning (high burn-in)
Standard Thinning (low burn-in)
Support Points
Stein Thinning (med)
Stein Thinning (sclmed)
Stein Thinning (smpcov)
Stein thinning outperforms standard thinning with high and low levels of burn-inremoval in terms of KSD, energy distance (ED), and first moment estimation
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 32 / 38
Stein Thinning in Action: Correcting for Tempering
Hinch model of cardiac calcium signalling: Tempering improves mixingP
aram
eter
1
med sclmed smpcov MCMC
seed 1
seed 2
prior
Par
amet
er2
Par
amet
er3
Par
amet
er4
Par
amet
er5
Par
amet
er6
Par
amet
er7
Par
amet
er8
Par
amet
er9
Par
amet
er10
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 33 / 38
Stein Thinning in Action: Correcting for Tempering
Hinch model of cardiac calcium signalling
100 101 102
m
102
103
104K
SD
RWSupport Points
Support Points (tempered)
Stein Thinning (med)
Stein Thinning (sclmed)
Stein Thinning (smpcov)
Untempered support points compression yields poor summary due to poor mixingTempered SP without bias correction is even worse (due to tempering bias)Tempering + Stein thinning bias correction improves approximation to P
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 34 / 38
Conclusions
Summary
New tools for summarizing a probability distribution more effectively than i.i.d.sampling or standard MCMC thinning
Kernel thinning compresses an n point summary into a√n point summary with
better-than-i.i.d. approximation error
Stein thinning simultaneously compresses and reduces biases due to off-targetsampling, tempering, or burn-in
Especially well-suited for tasks that incur substantial downstream computation costs
Kernel ThinningPaper: arxiv.org/abs/2105.05842
Package: github.com/rzrsk/kernel thinning
Stein ThinningPaper: arxiv.org/abs/2105.05842
Website: stein-thinning.org
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 35 / 38
Generalized Kernel Thinning [Dwivedi and Mackey, 2021a]
Question: Do you really need a square-root kernel?1 kt-split with target kernel k yields
Similar or better MMD guarantees for analytic kernels (like Gaussian, IMQ, & sinc)
Dimension-free O(√
lognn ) single-function integration error for any k and P
2 kt-split with fractional power kernel kα yields
Improved MMD for kernels without krt (like Laplace and non-smooth Matern)
3 kt-split with k + kα yields all of the above simultaneously!
We call this kernel thinning+ (KT+)
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 36 / 38
Distribution Compression in Near-linear Time [Shetty, Dwivedi, and Mackey, 2021]
Question: Can we speed up thinning algorithms without ruining their quality?
103 104 105 106
Input size n
2 10
2 8
2 6
2 4
Mea
n M
MD
d=2
ST: n 0.24
KT-Comp: n 0.46
KT-Comp++: n 0.50
KT: n 0.52
103 104 105 106
Input size n
2 9
2 7
2 5
2 3
d=4
ST: n 0.23
KT-Comp: n 0.41
KT-Comp++: n 0.46
KT: n 0.48
103 104 105 106
Input size n
2 8
2 6
2 4
d=10
ST: n 0.25
KT-Comp: n 0.38
KT-Comp++: n 0.41
KT: n 0.43
103 104 105 106
Input size n
2 7
2 6
2 5
2 4
2 3
d=100
ST: n 0.25
KT-Comp: n 0.32
KT-Comp++: n 0.31
KT: n 0.32
103 104 105 106
Input size n
1s
1m
10m1hr8hr1day
Sing
le c
ore
runt
ime
d=2KTKT-Comp++KT-Comp
103 104 105 106
Input size n
1s
1m
10m1hr8hr1day
d=4
103 104 105 106
Input size n
1s
1m
10m1hr8hr1day
d=10
103 104 105 106
Input size n
1s
1m
10m1hr8hr1day
d=100
Compress++ reduces n2 runtime to n log2 n, applies to any thinning algorithm, andinflates error by at most a constant factor
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 37 / 38
Distribution Compression in Near-linear Time [Shetty, Dwivedi, and Mackey, 2021]
Question: Can we speed up thinning algorithms without ruining their quality?
44 45 46 47 48
Input size n
2 8
2 6
2 4
Mea
n M
MD
d=2
ST: n 0.27
Herd-Comp: n 0.45
Herd-Comp++: n 0.51
Herd: n 0.49
44 45 46 47 48
Input size n
2 7
2 5
2 3
d=4
ST: n 0.24
Herd-Comp: n 0.42
Herd-Comp++: n 0.45
Herd: n 0.45
44 45 46 47 48
Input size n
2 7
2 6
2 5
2 4
2 3
d=10
ST: n 0.25
Herd-Comp: n 0.37
Herd-Comp++: n 0.42
Herd: n 0.46
44 45 46 47 48
Input size n2 6
2 5
2 4
2 3
d=100
ST: n 0.24
Herd-Comp: n 0.32
Herd-Comp++: n 0.32
Herd: n 0.32
103 104 105 106
Input size n
1s10s1m10m1hr10hr
Sing
le c
ore
runt
ime
d=2HerdHerd-Comp++Herd-Comp
103 104 105 106
Input size n
1s10s1m10m1hr10hr
d=4
103 104 105 106
Input size n
1s10s1m10m1hr10hr
d=10
103 104 105 106
Input size n
1s10s1m10m1hr10hr
d=100
Compress++ reduces n2 runtime to n log2 n, applies to any thinning algorithm(e.g., kernel herding), and inflates error by at most a constant factor
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 38 / 38
References IR. Alweiss, Y. P. Liu, and M. Sawhney. Discrepancy minimization via a self-balancing walk. arXiv preprint arXiv:2006.14009, 2020.
F. O. Campos, Y. Shiferaw, A. J. Prassl, P. M. Boyle, E. J. Vigmond, and G. Plank. Stochastic spontaneous calcium release events trigger premature ventricularcomplexes by overcoming electrotonic load. Cardiovascular Research, 107(1):175–183, 2015.
W. Y. Chen, A. Barp, F.-X. Briol, J. Gorham, M. Girolami, L. Mackey, and C. Oates. Stein point Markov chain Monte Carlo. In International Conference onMachine Learning, pages 1011–1021. PMLR, 2019.
Y. Chen, M. Welling, and A. Smola. Super-samples from kernel herding. In Proceedings of the Twenty-Sixth Conference on Uncertainty in Artificial Intelligence,UAI’10, page 109–116, Arlington, Virginia, USA, 2010. AUAI Press. ISBN 9780974903965.
K. Chwialkowski, H. Strathmann, and A. Gretton. A kernel test of goodness of fit. In Proceedings of the 33rd International Conference on Machine Learning, 2016.
M. A. Colman. Arrhythmia mechanisms and spontaneous calcium release: Bi-directional coupling between re-entrant and focal excitation. PLoS ComputationalBiology, 15(8), 2019.
R. Dwivedi and L. Mackey. Generalized kernel thinning. arXiv preprint arXiv:2110.01593, 2021a.
R. Dwivedi and L. Mackey. Kernel thinning, 2021b.
R. Dwivedi, O. N. Feldheim, O. Gurel-Gurevich, and A. Ramdas. The power of online thinning in reducing discrepancy. Probability Theory and Related Fields, 174(1):103–131, 2019.
D. Garreau, W. Jitkrittum, and M. Kanagawa. Large sample analysis of the median heuristic. arXiv preprint arXiv:1707.07269, 2017.
M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (StatisticalMethodology), 73(2):123–214, 2011.
B. C. Goodwin. Oscillatory behavior in enzymatic control process. Advances in Enzyme Regulation, 3:318–356, 1965.
J. Gorham and L. Mackey. Measuring sample quality with kernels. In Proceedings of the 34th International Conference on Machine Learning, 2017.
A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773, 2012.
H. Haario, E. Saksman, and J. Tamminen. Adaptive proposal distribution for random walk Metropolis algorithm. Computational Statistics, 14(3):375–395, 1999.
N. Harvey and S. Samadi. Near-optimal herding. In Conference on Learning Theory, pages 1165–1182, 2014.
F. Hickernell. A generalized discrepancy and quadrature error bound. Mathematics of computation, 67(221):299–322, 1998.
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 39 / 38
References II
R. Hinch, J. Greenstein, A. Tanskanen, L. Xu, and R. Winslow. A simplified local control model of calcium-induced calcium release in cardiac ventricularmyocytes. Biophysical journal, 87(6):3723–3736, 2004.
S. Joshi, R. V. Kommaraji, J. M. Phillips, and S. Venkatasubramanian. Comparing distributions and shapes using the kernel distance. In Proceedings of thetwenty-seventh annual symposium on Computational geometry, pages 47–56, 2011.
Z. Karnin and E. Liberty. Discrepancy, coresets, and sketches in machine learning. In Conference on Learning Theory, pages 1975–1993. PMLR, 2019.
T. Karvonen, M. Kanagawa, and S. Sarkka. On the positivity and magnitudes of bayesian quadrature weights. Statistics and Computing, 29(6):1317–1333, 2019.
S. Lacoste-Julien, F. Lindsten, and F. Bach. Sequential kernel herding: Frank-Wolfe optimization for particle filtering. In Artificial Intelligence and Statistics,pages 544–552. PMLR, 2015.
Q. Liu, J. D. Lee, and M. I. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests and model evaluation. In Proceedings of the 33rd InternationalConference on Machine Learning, 2016.
A. J. Lotka. Elements of physical biology. Williams & Wilkins, 1925.
S. Mak and V. R. Joseph. Support points. The Annals of Statistics, 46(6A):2562–2592, 2018.
S. A. Niederer, J. Lumens, and N. A. Trayanova. Computational models in cardiology. Nature Reviews Cardiology, 16(2):100–111, 2019.
S. A. Niederer, M. S. Sacks, M. Girolami, and K. Willcox. Scaling digital twins from the artisanal to the industrial. Nature Computational Science, 1(5):313–320,2021.
E. Novak and H. Wozniakowski. Tractability of multivariate problems, volume ii: Standard information for functionals, european math. Soc. Publ. House, Zurich,3, 2010.
C. J. Oates, M. Girolami, and N. Chopin. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society, Series B, 79(3):695–718, 2017.
B. Paige, D. Sejdinovic, and F. Wood. Super-sampling with a reservoir. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence,pages 567–576, 2016.
J. M. Phillips. ε-samples for kernels. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1622–1632. SIAM, 2013.
J. M. Phillips and W. M. Tai. Improved coresets for kernel density estimates. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on DiscreteAlgorithms, pages 2718–2727. SIAM, 2018.
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 40 / 38
References III
J. M. Phillips and W. M. Tai. Near-optimal coresets of kernel density estimates. Discrete & Computational Geometry, 63(4):867–887, 2020.
M. Riabiz, W. Chen, J. Cockayne, P. Swietach, S. A. Niederer, L. Mackey, and C. Oates. Optimal thinning of MCMC output. To appear: Journal of the RoyalStatistical Society: Series B (Statistical Methodology), 2021.
G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
A. Shetty, R. Dwivedi, and L. Mackey. Distribution compression in near-linear time. arXiv preprint: to appear, 2021.
W. M. Tai. New nearly-optimal coreset for kernel density estimation. arXiv preprint arXiv:2007.08031, 2020.
I. Tolstikhin, B. K. Sriperumbudur, and K. Muandet. Minimax estimation of kernel mean embeddings. The Journal of Machine Learning Research, 18(1):3002–3048, 2017.
V. Volterra. Variazioni e fluttuazioni del numero d’individui in specie animali conviventi. 1926.
Mackey (MSR) Kernel Thinning and Stein Thinning October 15, 2021 41 / 38