Divergence matching criteria for indexing andregistration
Alfred O. Hero
Dept. EECS, Dept Biomed. Eng., Dept. Statistics
University of Michigan - Ann Arbor
http://www.eecs.umich.edu/˜hero
Collaborators: Olivier Michel, Bing Ma, Huzefa Heemuchwala
Outline
1. Statistical framework: entropy measures, error exponents
2. α-MI indexing using single pixel gray levels
3. α-MI indexing via coincident features
4. α-entropy andα-MI estimation via MST
1
(a) ImageI1 (b) ImageI0 (c) Registration result
Figure 1: A multidate image registration example
2
Statistical Framework� I : an image
� Z = Z(I): an image feature vector over[0;1]d
� IR a reference image, featureZR
� fI ig a database ofK images, featuresZi
� f (ZR; Zi): joint feature density
For a pairZR andZi we have two hypotheses:
H0 : f (ZR; Zi) = f (ZR) f (Zi)H1 : f (ZR; Zi) 6= f (ZR) f (Zi)
3
Divergence Measures
Refs: [Csiszar:67,Basseville:SP89]
Define densities
f1 = f (ZR; Zi); f0 = f (ZR) f (Zi)
The Renyi α-divergence of fractional orderα 2 [0;1] [Renyi:61,70 ]
Dα( f1 k f0) =
1α�1
ln
Zf1
�
f1f0
�αdx
=
1α�1
ln
Zf α1 f 1�α
0 dx
4
Renyi α-Divergence: Special cases� α-Divergence vsα-Entropy
Hα( f1) =
11�α
ln
Z
f α1 dx=�Dα( f1 k f0)j f0=U([0;1]d)
� α-Divergence vs. Batthacharyya-Hellinger distance
D 12
( f1 k f0) = ln�Z p
f1 f0dx
�2
D2BH( f1k f0) =
Z �pf1�
pf0
�2dx= 2
�
1�Z p
f1 f0dx
�
� α-Divergence vs. Kullback-Liebler divergence
limα!1
Dα( f1k f0) =Z
f1 lnf1f0
dx:
5
Renyi α-divergence and Error Exponents
Observe i.i.d. sampleW = [W1; : : : ;Wn]
H0 : Wj � f0(w)
H1 : Wj � f1(w)
Bayes probability of error
Pe(n) = β(n)P(H1)+α(n)P(H0)
LDP gives Chernoff bound [Dembo&Zeitouni:98]
liminfn!∞
1n
logPe(n) =� supα2[0;1]
f(1�α)Dα( f1k f0)g :6
Registration via α-Mutual-Information
Ref: Viola&Wells:ICCV95
1. ReferenceIR and targetIT images.
2. Set of rigid transformationsfTig3. Derived feature vectors
ZR = Z(IR); Zi = Z(Ti(IT))
7
H0 : fZR;Zig independent
H1 : fZR;Zig dependent
Error exponent isα-MI (Neemuchwala&etal:ICIP01,
Pluim&etal:SPIE01)
MIα(ZR;Zi) =
1α�1
ln
Z
f α(ZR;Zi)( f (ZR) f (Zi))1�αdZRdZi :
8
Registration via α-Jensen-Difference
Ref: Ma&etal:ICIP00, He&etal:SigProc01
� Jensen’s difference btwnf0; f1:
∆Jα = Hα(ε f1+(1� ε) f0)� εHα( f1)� (1� ε)Hα( f0)� 0
� f0; f1 are two densities,ε satisfies 0� ε� 1
� Let X ;Y be i.i.d. features extracted from two images
Xm = fX1; : : : ;Xmg; Yn = fY1; : : : ;Yng
� Each realization inunorderedsampleZ = fXm; Yng has marginal
fZ(z) = ε fX(z)+(1� ε) fY(z); ε =
mn+m
9
Ultrasound Registration Example
Figure 2: Three ultrasound breast scans. From top to bottom are: case 151,
case 142 and case 162.
10
(a) ImageX1 (b) ImageX0
Figure 3: Single Pixel Coincidences (Left and right: 0o and 8o rotations)
11
Grey Level Scatterplots
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
50 100 150 200 250
50
100
150
200
250
Figure 4:Grey level scatterplots. 1st Col: target=reference slice. 2nd Col: target = reference+1 slice.
12
−5 −4 −3 −2 −1 0 1 2 3 4 5−0.3
−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
ANGLE OF ROTATION (deg) −−−−−−>
α−D
IVE
RG
EN
CE
(αM
I) −
−−
−−
−>
α−DIVERGENCE v/s α CURVES FOR α ∈ [0,1] FOR SINGLE PIXEL INTENSITY
α =0 α =0.1α =0.2α =0.3α =0.4α =0.5α =0.6α =0.7α =0.8α =0.9
α=0.1
α=0.9
α=0
Figure 5:α-Divergence as function of angle for ultra sound image registration of
image 142
13
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
0.005
0.01
0.015
0.02
0.025
α −−−−−−>
curv
atur
e of
α−
MI −
−−
−−
−>
CURVATURE OF: α−MI v/s α CURVE FOR SINGLE PIXEL INTENSITY
Figure 6:Resolution ofα-Divergence as function of alpha
14
Higher Level Features
Disadvantages of gray level features:
� Only depends on histogram of single pixel pairs
� Insensitive to spatial reording of pixels in each image
� Difficult to select out grey level anomalies (shadows, speckle)
� Spatial discriminants fall outside of single pixel domain
� Alternative : Spatial-feature-based indexing
15
Local Tags
(a) ImageX0 (b) ImageXi
Figure 7: Local Tag Coincidences
16
Spatial Relations Between Local Tags
N
E
SE
S
W
R
(a) ImageX0
N
E
SE
S
W
R
(b) ImageXi
Figure 8: Spatial Relation Coincidences
17
Feature: spatial tags via feature trees
Root Node
Depth 1
Depth 2
Not examined further
Figure 9:Part of feature tree data structure.
Terminal nodes (Depth 16)
Figure 10:Leaves of feature tree data structure.
18
Feature: projection-coefficient wrt ICA basis
Figure 11:Estimated ICA basis set for ultrasound breast image database
19
Simple Example
50 100 150 200 250 300 350
50
100
150
200
250
300
350
50 100 150 200 250 300 350
50
100
150
200
250
300
350
−8 −6 −4 −2 0 2 4 6 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Performance of higher order features v/s single pixels
Relative rotation between images
MS
T L
engt
h an
d S
hann
on M
I
MST LengthShannon MI
Figure 12:10 basis composite images with noise and MI vs Feature divergence objective functions
20
US Registration Comparisons
151 142 162 151/8 151/16 151/32
pixel 0:6=0:9 0:6=0:3 0:6=0:3
tag 0:5=3:6 0:5=3:8 0:4=1:4
spatial-tag 0:99=14:6 0:99=8:4 0:6=8:3
ICA 0:7=4:1 0:7=3:9 0:99=7:7
Table 1: Numerator =optimal values ofα and Denominator = maximum
resolution of mutualα-information for registering various images (Cases
151, 142, 162) using various features (pixel, tag, spatial-tag, ICA). 151/8,
151/16, 151/32 correspond to ICA algorithm with 8, 16 and 32 basis ele-
ments run on case 151.
21
Feature-based Indexing: Challenges� How to best select discriminating features?
– Require training database of images to learn feature set
– Apply cross-validation...
– ...bagging, boosting, or randomized selection?
� How to computeα-MI for multi-dimensional features?
– Tag space is of high cardinality:25616� 1032
– ICA projection-coefficient space is multi-dimensional continuum
– Soln 1: partition feature space and count coincidences...
– Soln 2: apply kernel density estimation and ...
– ...plug in to theα-MI or α-Jensen formula
– Soln 3: estimateα-MI or α-Jensen directly via MST.
22
Methods of Entropy/Divergence Estimation� Z = (ZR;ZT): a statistic (feature pair)
� fZig: n i.i.d. realizations fromf (Z)
Objective: Estimate
Hα( f ) =1
1�αln
Z
f α(x)dx
1. Parametric density estimation methods
2. Non-parametric density estimation methods
3. Non-parametric minimal-graph estimation methods
23
Non-parametric estimation methods
Given i.i.d. sampleX = fX1; : : : ;Xng
Density “plug-in” estimator
Hα( fn) =
11�α
ln
Z
IRdf α(x)dx
Previous work limited to Shannon entropyH( f ) =�R
f (x) ln f (x)dx
� Histogram plug-in [Gyorfi&VanDerMeulen:CSDA87]
� Kernel density plug-in [Ahmad&Lin:IT76]
� Sample-spacing plug-in [Hall:JMS86](d = 1)
� Performance degrades as densityf becomes non smooth
� Unclear how to robustifyf against outliers
� d-dimensional integration might be difficult
� ) functionf f (x) : x2 IRdg over-parameterizes entropy functional
24
Direct α-entropy estimation� MST estimator ofα-entropy [Hero&Michel:IT99]:
Hα =
11�α
lnLγ(Xn)=n�α
� Direct entropy estimator: faster convergence for nonsmooth densities
� Parameterα is varied by varying interpoint distance measure
� Optimally prunedk-MST graphs robustifyf against outliers
� Greedy multi-scale MST approximations reduce combinatorial
complexity
25
Minimal Graphs: Minimal Spanning Tree (MST)
Let Mn = M (Xn) denote the possible sets of edges in the class of acyclic
graphs spanningXn (spanning trees).
The Euclidean Power Weighted MST achieves
LMST(Xn) = minMn
∑e2Mn
kekγ:
26
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
z1
z 2
128 random samples
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
z1
z 2
MST
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
z1
z 2
128 random samples
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
z1
z 2
MST
Figure 13:
27
Convergence of MST
0 500 10000
5
10
15
20
N
MS
T le
ngth
MST length, Unif. dist. (red), Triang. dist (blue)
0 500 10000.4
0.6
0.8
N
MST normalized compensated length
Figure 14:
28
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 15:Continuous quasi-additive euclidean functional satisfies “self-similarity” property on any scale.
29
Asymptotics: the BHH Theorem and entropy estimation
Theorem 1Beardwood&etal:Camb59,Steele:95,Redmond&Yukich:SPA96Let L
be a continuous quasi-additive Euclidean functional with power-exponent
γ, and letXn = fX1; : : : ;Xng be an i.i.d. sample drawn from a distribution
on [0;1]d with an absolutely continuous component having (Lebesgue)
density f(x). Then
(1)
limn!∞
Lγ(Xn)=n(d�γ)=d = βLγ;d
Z
f (x)(d�γ)=ddx; (a:s:)
Or, lettingα = (d� γ)=d
limn!∞
Lγ(Xn)=nα = βLγ;d exp((1�α)Hα( f )) ; (a:s:)
30
Asymptotics: Plug-in estimation ofHα( f )
Class of Holder continuous functions over[0;1]d
Σd(κ;c) =n
f (x) : j f (x)� pbκcx (z)j � c kx�zkκ
o
Proposition 1 (Hero&Ma:IT01) Assume that fα 2 Σd(κ;c). Then, if f α
is aminimax estimator
supf α2Σd(κ;c)
E1=p
�����Z bf α(x)dx�Z
f α(x)dx
����p� = O
�
n�κ=(2κ+d)�
31
Asymptotics: Minimal-graph estimation of Hα( f )
Proposition 2 (Hero&Ma:IT01) Let d� 2 and
α = (d� γ)=d 2 [1=2;(d�1)=d]. Assume that fα 2 Σd(κ;c) whereκ� 1
and c< ∞. Then for any continuous quasi-additive Euclidean functional
Lγ
supf α2Σd(κ;c)
E1=p
�����Lγ(X1; : : : ;Xn)nα �βLγ;d
Z
f α(x)dx
����p��O
�
n�2=(3d)�
Conclude: minimal-graph estimator converges faster for
κ <
2d3d�4
32
Observations� Minimal graph rates valid for MST,k-NN graph, TSP, Steiner Tree,
etc
� Analogous rate bound holds for progressive-resolution algorithm
LGγ (Xn) =
md
∑i=1
Lγ(Xn\Qi)
fQig is uniform partition of[0;1]d into cell volumes 1=md
� Optimal sequence of cell volumes is:
m�d = n�1=3
33
Computational Acceleration of MST
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
z0
z1
Selection of nearest neighbors for MST using disc
r=0.15
Figure 16:Acceleration of Kruskal’s MST algorithm from n2 logn to nlogn.
34
0 2000 4000 6000 8000 10000−50
0
50
100
150
200
250
300
Number of Points, N with Uniform Distribution
Comparison: Standard and Modified Kruskal Agorithms
Exe
cutio
n T
ime
in s
econ
ds
Standard Kruskal Modified Kruskal (radius=0.09)
Figure 17:Comparison of Kruskal’s MST to our nlogn MST algorithm.
35
0 0.02 0.04 0.06 0.08 0.10
10
20
30
40
50
60
70
Radius of disc
MS
T L
engt
h
Effect of disc radius on MST Length: 10000 unif. dist. points
Figure 18:Bias of nlogn MST algorithm as function of radius parameter.
36
Application of MST to US image Registration
1. Extract features from reference and transformed target images:
Xm = fXig
mi=1 and Yn = fYig
nyi=1
2. Construct MST on union ofXm andYn
Lγ(Xm[Yn)
3. MinimizeLγ over transformations producingYn.
Note: This minimizer converges to minimizer ofα-Jensen difference
Lγ(Xm[Yn)=(m+n)α ! βLγ;d exp((1�α)Hα(ε fx+(1� ε) fy)) ; (a:s:)
whereε = mm+n
37
0
200
400
0
100
200
3000
10
20
30
40
X
misaligned points
Y
Inte
nsity
(a)
0
200
400
0
100
200
3000
10
20
30
40
X
MST demonstration
Y
Inte
nsity
(b)
Figure 19: MST demonstration for misaligned images
38
0
200
400
0
100
200
3000
20
40
60
X
Aligned points
Y
Inte
nsity
(a)
0
200
400
0
100
200
3000
20
40
60
X
MST demonstration
Y
inte
nsity
(b)
Figure 20: MST demonstration for aligned images
39
Illustration for Case 142 and Single Pixels
−8 −6 −4 −2 0 2 4 6 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative rotation between images (degrees)
Nor
mal
ized
MS
T L
engt
h, M
I
MST Length, alphaMI profiles. Single Pixel domain, vol. 142
MI (alpha=0.5)MST Length
Figure 21:Single-pixel objective function profiles for MST estimator ofα-Jensen difference vs histogram plug-in estimator (α = 1=2).
40
Illustration for Case 142 and 8-D ICA Features
−8 −6 −4 −2 0 2 4 6 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative rotation between images (degrees)
MS
T L
en.,
alph
aJen
sen(
4 bi
ns/d
imen
sion
)
MST & alphaJensen profile for 8D ICA feature vector, vol. 142
Jensen (alpha=0.5)MST Length
Figure 22: 8-basis ICA objective function profiles for MST estimator ofα-Jensen difference vs histogram plug-in estimator ofα-MI
(α = 1=2).
41
Illustration for Case 142 and 64-D ICA Features
−8 −6 −4 −2 0 2 4 6 80
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MS
T L
engt
h
Relative rotation between images (degrees)
Norm. MST Length profile for 64D ICA feature vector, vol. 142
Figure 23:64-basis ICA objective function profiles for MST estimator ofα-Jensen difference.
42
Extension of MST to divergence estimation
1. Let i.i.d.fZig
ni=1 have marginal densityf1 on [0;1]d
2. Let f0 dominate densityf1
3. Define measure transformationM on [0;1]d which takesf0 to uniformdensity
ThenSidef
= M (Zi) hasα-entropy:
(1�α)Hα(Si) = ln
Zf αS (s)ds= ln
Z( f1(z)= f0(z))
α f0(z)dz
Conclude, forSn = fS1; : : : ;Sng:
Dα( f1k f0) =
11�α
�
lnLγ(Sn)=nα� lnβL;γ
�: (2)
is a consistent estimator ofα-divergence
43
Example
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1Original data
z1
z 2
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
exact inverse transform
z1
z 2
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1tranfd data
z1
z 2
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1tranfd data, 2D cdf estd
z1
z 2
Figure 24:
44
MST robustification against outliers: Pruned MST
Fix k, 1� k� n.
Let Mn;k = M (xi1; : : : ;xik) be a minimal graph connectingk distinct
verticesxi1; : : : ;xik.
Thek-MST T�n;k = T�(xi�1
; : : : ;xi�k
) is minimum of allk-point MST’s
L�n;k = L�(Xn;k) = mini1;:::;ik
minMn;k
∑e2Mn;k
kekγ
45
0 0.5 1
0
0.2
0.4
0.6
0.8
1
z2
k−MST (k=99): 1 outlier rejection
0 0.5 1
0
0.2
0.4
0.6
0.8
1
z2
(k=98): 2 outlier rejection
0 0.5 1
0
0.2
0.4
0.6
0.8
1
z1
z2
k−MST (k=62): 38 outlier rejection
0 0.5 1
0
0.2
0.4
0.6
0.8
1
(k=25): 75 outlier rejection
z1z2
Figure 25:k-MST for 2D annulus density with and without the addition of
uniform “outliers”.
46
Extension of BHH to Pruned Graphs
Fix α 2 [0;1] and letk= bαnc. Then asn! ∞ (Hero&Michel:IT99)
L(X �
n;k)=(bαnc)ν ! βLγ;d minA:P(A)�α
Z
f ν(xjx2 A)dx (a:s:)
or, alternatively, with
Hν( f jx2 A) =1
1�νln
Z
f ν(xjx2 A)dx
L(X �
n;k)=(bαnc)ν ! βL;γ exp
�(1�ν) min
A:P(A)�αHν( f jx2 A)
�(a:s:)
47
Water Pouring Construction
0 10 20 30 40 500
0.1
Density of X
x
f(x)
0 10 20 30 40 500
0.1
x
max
(f(x
)−f(
x)
Water filled density
0 10 20 30 40 500
0.1
0.2Asymptotic density of X seen by k−MST
x
f(x|
A)
Figure 26:Water pouring construction of f(xjAo). Arrows indicate the high water marks indicating high probability regions which are
not truncated in the influence function.
48
k-MST Influence Function for Gaussian Density
−20
0
20
−20
0
20−100
0
100
200
300
MST for planar Gaussian
IC(x
,F,L
)
−20
0
20
−20
0
20−40
−20
0
20
40
60
k−MST for planar Gaussian
IC(x
,F,L
)
Figure 27:MST and k-MST influence curves for Gaussian density on the plane.
49
k-MST Stopping Rule
0 20 40 60 80 1000
20
40
60
80
100
k
k−M
ST
leng
thk−MST length as function of k
0 20 40 60 80 1000
5
10
15
20
25
30
k
crite
rion(
k)
selection criterion
Figure 28:Left: k-MST curve for 2D annulus density with addition of uniform “outliers” has a knee in the vicinity of n� k = 35. This
knee can be detected using residual analysis from a linear regression line fitted to the left-most part of the curve. Right: error residual of linear
regression line.
50
Conclusions
1. α-divergence for indexing can be justified via decision theory
2. Applicable to feature-based image registration
3. Non-parametric estimation is possible without density estimation via
MST
4. MST outperforms plug-in estimation for non-smooth densities
5. Robustified MST can be defined via optimal pruning of MST: k-MST
51
Divergence vs. Jensen: Asymptotic Comparison
For ε 2 [0;1] andg a p.d.f. define
fε = ε f1+(1� ε) f0; Eg[Z] =Z
Z(x)g(x)dx; f α12
=
f α12R
f α12dx
Then
∆Jα =
αε(1� ε)
2
24Ef α12
0@" f1� f0f 1
2
#2
1A+
α1�α
Ef α12
"
f1� f0f 1
2
#!2
35+O(∆)
Dα( f1k f0) =
α4
Z
f 12
"
f1� f0f 1
2
#2
dx+O(∆)
52