Upload
others
View
13
Download
1
Embed Size (px)
Citation preview
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
KD-Tree Algorithm for Propensity Score Matching
PhD Qualifying Exam Defense
John R Hott
University of Virginia
May 11, 2012
1 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Motivation
• Epidemiology: Clinical Trials• Phase II and III pre-market trials• After-market Phase IV trials against 3+ treatments
• Trials requiring similar groups to avoid confounding• Participants with similar comorbidities (diseases, ...)• Participants with similar traits (age, weight, height, ...)
• Propensity Scores• Probability, given certain factors, that a person will be given a certain treatment• Want to match participants with similar propensity scores
2 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
3 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
4 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
5 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
6 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
7 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
8 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
9 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
10 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
11 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Why doesn’t this nearest neighbor approach work for more groups?
• b1 and g closest points to r
• Triangle rgb2 has smaller perimeter than rgb1
12 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Current Approaches
For two treatment groups
• Propensity scores used to reduce dimensionality
• Brute force or nearest neighbor searches
For more than two groups
• Brute force• Requires O(nk log n) time using O(nk) space• Less efficient brute force uses O(n) space, but O(nk+1) time• Problem: must consider all matches• Not feasible for large n
kd-tree algorithm, under a uniform distribution, will perform in O(kdn2) time and O(n)space.
13 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
The End: Spoilers
0 1 2 3 4 5 6 7 8Time to find all matches (ms) 1e7
50
100
200
300
400
500
750
1000
Part
icip
ants
in e
ach
grou
pBrute Forcekd-tree Implementation
Figure: Results in 3-dimensions with 3 groups
For 1000 participants in 3 groups with 3 dimensions
• Brute Force: 19.6 hours
• kd-tree Algorithm: 3.6 seconds (19,427x speedup)14 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Problem Statement
Informally, we want to make the smallest n disjoint matches with one participant fromeach of k groups per match.
• Start with participants in one group,
• Find their closest matched participants of each other group (nearest neighbor),
• Search within a small neighborhood of these points for a smaller match, if oneexists,
• Repeat, if necessary.
15 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Outline1 Motivation
2 Formal DefinitionDefinitionsSize Function
3 Current Approaches
4 kd-tree Algorithmkd-treesAlgorithm Description
5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study
6 Conclusions
16 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Participant Definitions
Definition (Participant)
Let the point p ∈ Rd be a participant normalized over d defining characteristics
Definition (Set of all Participants)
The set P ⊆ Rd , is the set of all participants, such that:
• P = ∪ki=1Gi , where each Gi defines a treatment group
• |Gi | = n
• |P| =∑
i |Gi | = kn.
17 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Participant Definitions
Definition (Participant)
Let the point p ∈ Rd be a participant normalized over d defining characteristics
Definition (Set of all Participants)
The set P ⊆ Rd , is the set of all participants, such that:
• P = ∪ki=1Gi , where each Gi defines a treatment group
• |Gi | = n
• |P| =∑
i |Gi | = kn.
18 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Match Definitions
Definition (Match)
A set m ⊆ P is a match if it contains exactly one point from each Gi :
• |m| = k,
• |m ∩ Gi | = 1, ∀i .
Definition (Set of all Matches)
Let M = m : m is a match be the set of all matches with
|M| =∏
i |Gi | = nk
19 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Match Definitions
Definition (Match)
A set m ⊆ P is a match if it contains exactly one point from each Gi :
• |m| = k,
• |m ∩ Gi | = 1, ∀i .
Definition (Set of all Matches)
Let M = m : m is a match be the set of all matches with
|M| =∏
i |Gi | = nk
20 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Size of a Match
DefinitionMatch measure function size(m),
size :M→ R,
independent of the order of the points in the match m, must give a consistentmeasurement of the match.
Ideal measure: minimize the sum of the distance between all points
• Fully connected graph
• Quadratic on k
21 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Match Covering
DefinitionM is a match covering of P if M is a set of disjoint matches:
• M ⊂M• |M | = n
• ∀m, l ∈ M where m 6= l then m ∩ l = ∅.
DefinitionWLOG, assume M is sorted on size(m): ∀mi ,mj ∈ M , i < j =⇒ size(mi) < size(mj).Define ordering <M such that M0 <M M1 if for some index i ,
size(m0,i) < size(m1,i) and ∀j < i , size(m0,j) = size(m1,j)
• Size of match in M0 less than size of match in M1 at the first place they differ
22 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Match Covering
DefinitionM is a match covering of P if M is a set of disjoint matches:
• M ⊂M• |M | = n
• ∀m, l ∈ M where m 6= l then m ∩ l = ∅.
DefinitionWLOG, assume M is sorted on size(m): ∀mi ,mj ∈ M , i < j =⇒ size(mi) < size(mj).Define ordering <M such that M0 <M M1 if for some index i ,
size(m0,i) < size(m1,i) and ∀j < i , size(m0,j) = size(m1,j)
• Size of match in M0 less than size of match in M1 at the first place they differ
23 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Problem Statement
Find the minimal match covering, M0, such that
∀i ,M0 ≤M Mi .
24 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Match size function
What is the best method for measuring the size of a match?Perimeter? Convex Hull? Something else?
25 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Measuring Matches: by example
Figure: Sample points for one match.
26 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Measuring Matches: Area
(a) (b) (c)
• All colinear points have 0 area, regardless of distance
• Favors colinear points
27 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Measuring Matches: Perimeter
(d) (e)
• Works for 2-dimensions, 3-colors
• Equivalent to Traveling Salesman as number of colors increases
• Not well defined in more dimensions
28 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Measuring Matches: Convex Hull
(f)
• Avoids TSP encountered with perimeter
• Ω(nbd/2c) for d > 3
more
29 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Measuring Matches: Centroid
• Linearly computable (on colors and dimensions)
• Statistical sense: distance to an average point
• Matches are intuitively small
• Possible Measurements• Max distance to centroid• Average distance to centroid (variance)• Sum of squared distances to centroid
more
30 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Perimeter Measure
For 3 or fewer groups, perimeter matches our ideal measurement. In this case,
size(m) = perimeter(m).
This definition leads to a search radius of
search(m) =1
2size(m).
31 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Perimeter
TheoremGiven an initial match m containing point pi ∈ G1, which contains random points pr j ,one per Gj with j ≥ 2, the perimeter will define the size of the match. Let us assumethat size(m) = q ∈ R. Our search radius will then be the disc centered at pi withradius q
2. This area will contain the smallest match for pi .
32 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Perimeter
33 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Perimeter
34 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Perimeter
35 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Centroid MeasureFor a d-dimensional space and k colors per match, we consider the sum of squareddistances to the centroid.
The centroid of a match m is defined as
c(m) =1
k
k∑i=1
pi .
Our size(m) function, using sum of squared distances to the centroid, is defined as
size(m) =k∑
i=1
(d∑
j=1
(pi ,j − cj(m))2
)2
.
This definition leads to a search radius of
search(m) = k ∗max distance to centroid.
36 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Centroid
TheoremGiven an initial match m containing point pi ∈ G1, which contains random points pr j ,one per Gj with j ≥ 2, the sum of squared distances to the centroid will define the sizeof the match. Our search radius will be the disc centered at pi with radius k ∗ s where sis the max distance to centroid. This area will contain the smallest match for pi .
37 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Centroid
38 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Centroid
39 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Proof of Correctness: Centroid
more
40 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Outline1 Motivation
2 Formal DefinitionDefinitionsSize Function
3 Current Approaches
4 kd-tree Algorithmkd-treesAlgorithm Description
5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study
6 Conclusions
41 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Brute Force Approaches
Algorithm 2: O(nk log n) time, but O(nk) space.
Input: k sets of n pointsOutput: set of n ordered smallest matches of k points each
read inputforeach p1 ∈ G1 do
foreach p2 ∈ G2 do. . . foreach pk ∈ Gk do
M ← m = p1, p2, ..., pk
sort(M)foreach M do
if p1, p2, ..., pk clean thenMans ← m
return Mans
42 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Brute Force Approaches
Algorithm 1: O(n) space, but O(nk+1) time.
Input: k sets of n pointsOutput: set of n ordered smallest matches of k points each
read inputfor i = 1 : n do
smallest = MAXmsmallest = nullforeach p1 ∈ G1 do
foreach p2 ∈ G2 do. . .foreach pk ∈ Gk do
if size(m = p1, p2, ..., pk) < smallest thenmsmallest = msmallest = size(m)
Mans ← msmallest
remove p1, p2, ..., pk from G1,G2, ...,Gk
return Mans
43 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Voronoi Matching Algorithm
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Figure: Voronoi Cells.
44 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Voronoi Matching Algorithm
Problems with the Voronoi algorithm
• Provides only an approximation for M0
• Worst and expected case complexity O(n4) for 3 groups
• Required searching for points in polygons
45 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Outline1 Motivation
2 Formal DefinitionDefinitionsSize Function
3 Current Approaches
4 kd-tree Algorithmkd-treesAlgorithm Description
5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study
6 Conclusions
46 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
kd-treesBinary tree data structure used to store points in d-dimensional space.
more
47 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
kd-tree AlgorithmInput: k sets of n pointsOutput: set of n ordered smallest matches of k points each
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
48 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
addPutativeMatches SubroutineInput: PriorityQueue pq, current point p1 from G1, kd-trees Ti for each G2 to Gk
Output: list of 10 smallest matches for point p1
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
49 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Empirical Study: addPutativeMatches returns
20 40 60 80Matches used in AddPutativeMatches
7500
8000
8500
9000
9500
10000
10500
Tim
e to
find
all
mat
ches
(ms)
Average Time (ms)Max Time (ms)Min Time (ms)
Figure: Empirical study varying number of matches returned by addPutativeMatches.
50 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Outline1 Motivation
2 Formal DefinitionDefinitionsSize Function
3 Current Approaches
4 kd-tree Algorithmkd-treesAlgorithm Description
5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study
6 Conclusions
51 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Worst Case
Figure: Worst case example (3 colors in 2 dimensions). Points in each Gi are coincident witheach other (∀i : ri = (−1, 0), gi = (0, 0), bi = (1, 0)); all points are within the search area ofany match (ri , gi , bi ).
52 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
addPutativeMatches Worst Case
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)
O(kdn1−1/d )
a
O(n)×O(n)××...× O(n)
O(kdnk−1)O(nk−1 log nk−1)
abc
O(10 log nk−1)
abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d ) more
53 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
kd-tree Algorithm Worst CaseComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
O(nk log n)
abc
O(nk log n)O(n log n)
a
O(nk − n)ab
O(n2k−1 log n)
abc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)more
54 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Expected Case Assumption
For n points, ∃δ such that ∀ε-sized areas, there are less than δεn points in that region.
• This assumes a uniform distribution, per study design
• Location of the ε-sized area is irrelevant
• As the density increases, the search radius becomes smaller
• For a small area ε ≈ 1/n, number of points appears constant in that area (δ)• Since we look at the k closest neighbors to a point
55 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Expected Case
Tapmk,d= O(2(k − 1)dn1− 1
d + log n) = O(kdn)
Tpart1k,d = O(nTapmk,d
)= O(kdn2)
Tpart2k,d = O (n log n + nk log n) = O((k + 1)n log n)
with the total time complexity reducing to
Tkdtree = Tbuildkds + Tpart1k,d + Tpart2k,d
= O((k − 1)(n log n) + kdn2 + (k + 1)n log n
)= O(kdn2).
56 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Empirical Study: Brute Force vs KD-Tree
Parameter Values
Number of Treatment Groups 3 - 4Participants per Treatment Group 50, 100, 200, 300, 400, 500, 750, 1000Confounding Factors per Participant 3
Table: Empirical test configurations.
Each configuration repeated 50 times.Centurion cluster nodes
• 1.6 GHz dual-core Opteron
• 3GB RAM
57 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Empirical Study: Brute Force vs KD-Tree
101 102 103 104 105 106 107 108
Time to find all matches (ms)
50
100
200
300
400
500
750
1000
Part
icip
ants
in e
ach
grou
p
Brute Forcekd-tree Implementation
Figure: Results in 3-dimensions with 3 groups (log-scale x-axis)
58 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Empirical Study: Brute Force vs KD-Tree
100 101 102 103 104 105 106 107 108 10950
100
200
300
Part
icip
ants
per
gro
up 4
gro
ups
100 101 102 103 104 105 106 107 108 109
Time to find all matches (ms)
50
100
200
300
Part
icip
ants
per
gro
up 3
gro
ups
Brute Forcekd-tree Implementation
Figure: Empirical study comparing brute force and the kd-tree algorithm. (log-scale x-axis)
59 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Outline1 Motivation
2 Formal DefinitionDefinitionsSize Function
3 Current Approaches
4 kd-tree Algorithmkd-treesAlgorithm Description
5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study
6 Conclusions
60 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Future Directions
• Alternative applications for the algorithm
• Combining kd-trees and voronoi cells• Some research into using voronoi cells to speed kd-tree lookups• Utilize kd-trees to build effective voronoi diagrams (completed)• Extracting matches using the effective voronoi diagrams before completion using
the kd-tree algorithm
• Other assumptions for expected cases• Alternate distributions• Slowly growing δ in our expected assumption
• Other match measure functions
• Reduce the search area once a smaller match is found
61 / 62
Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions
Research Plan
Proposed research directions:√
Generalize the kd-tree algorithm to an arbitrary k colors in d dimensions, asdefined in the problem statement,
√Analyze the time complexity of the k-d tree algorithm for both worst-case andexpected case running times,
√Examine other methods for defining the size of a match that are not dependent orlimited by dimensionality, number of colors, or ordering of the points,
√Prove algorithm correctness.
Additional research directions:√
Perform an empirical study of addPutativeMatches return values,√
Perform an empirical study comparing brute force to the kd-tree algorithm for 3-5groups in 3-4 dimensions.
62 / 62
Questions?
63 / 62
addPutativeMatches Analysis
Complexity
O(k)O(1)O(kd)O(1)O(log n)O(k)a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
64 / 62
addPutativeMatches Analysis
ComplexityO(k)
O(1)O(kd)O(1)O(log n)O(k)a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
65 / 62
addPutativeMatches Analysis
ComplexityO(k)O(dn1−1/d )
O(1)O(kd)O(1)O(log n)O(k)a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
66 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
67 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)O(k)
a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
68 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)O(k)O(dn1−1/d )
a
O(n)×O(n)××...× O(n)abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
69 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)
O(kdn1−1/d )
a
O(n)×O(n)××...× O(n)
abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
70 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)
O(kdn1−1/d )
a
O(n)×O(n)××...× O(n)O(kd)O(log nk−1)
abc
O(10 log nk−1)abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
71 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)
O(kdn1−1/d )
a
O(n)×O(n)××...× O(n)
O(kdnk−1)O(nk−1 log nk−1)
abc
O(10 log nk−1)
abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d )
back
72 / 62
addPutativeMatches Analysis
Complexity
O(k)
O(kdn1−1/d )O(1)O(kd)O(1)O(log n)
O(k)
O(kdn1−1/d )
a
O(n)×O(n)××...× O(n)
O(kdnk−1)O(nk−1 log nk−1)
abc
O(10 log nk−1)
abcd
for i ← 2 to k dopi = Ti .getnearest(pi−1)
small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do
listi = Ti .getnearest(pi−1, search)
foreach list2 as p2 doforeach list3 as p3 do
... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then
tq.add(match(p1,p2, ... ,pk ))
for i ← 1 to 10 dom = tq.poll()pq.add(m);
O(nk−1 log n + kdnk−1 + 2kdn1−1/d ) back
73 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)a
O(1)O(1)O(n)a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
74 / 62
kd-tree Algorithm AnalysisComplexityO(n)O(k)
a
O(1)O(1)O(n)a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
75 / 62
kd-tree Algorithm AnalysisComplexityO(n)O(k)O(n + n log n)
a
O(1)O(1)O(n)a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
76 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
77 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)O(n)
a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
78 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)O(n)O(nk−1 log n)
a
O(nk )abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
79 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
80 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )O(log n)
abc
O(n(k log n + log n))a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
81 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
O(nk log n)
abc
O(n(k log n + log n))
a
O(nk − n)ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
82 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
O(nk log n)
abc
O(n(k log n + log n))
a
O(nk − n)
ababc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
83 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
O(nk log n)
abc
O(n(k log n + log n))
a
O(nk − n)
ab
O(nk−1 log n)
abc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)
back
84 / 62
kd-tree Algorithm AnalysisComplexityO(n)
O(k)
O(kn log n)
a
O(1)O(1)
O(n)
O(nk log n)
a
O(nk )
O(nk log n)
abc
O(n(k log n + log n))
a
O(nk − n)ab
O(n2k−1 log n)
abc
G1 = read inputfor i ← 2 to k do
Gi = read inputTi = makeKDTree(Gi )
pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do
addPutativeMatches(pi , pq)
while pq not empty dom = pq.poll()if all points are unused then
foreach i ≤ k doTi .remove(m.i)
matches.add(m)
elseif point m.1 ∈ G1 is unused then
if no more matches available thenaddPutativeMatches(m.1, pq)
O(n2k−1 log n + knk+1 log n + kdnk+1
)back
85 / 62
Measuring Matches: Convex Hull
(a) (b)
• Avoids TSP encountered with perimeter
• Ω(nbd/2c) for d > 3
back
86 / 62
kd-tree Data Structure
kd-trees
• Multi-dimensional data structure introduced by Bentley (1975)
• Based on binary search trees
• Each level i divides the search space in dimension i mod d
87 / 62
kd-tree Data Structure
Insert
• Search for node in the tree, if not found, add node
• Average cost: O(log n) ≈ 1.386 log2 n (by Knuth)
• Can use Insert to build kd-tree• Inserting random nodes to build kd-tree is statistically similar to building bst• Build cost: O(n log n) for sufficiently random nodes
Optimize
• Given all n nodes, build an optimal kd-tree
• Uses the median for each dimension as discriminator for that level
• O(n log n) running time
• Maximum path length: blog2 nc
88 / 62
kd-tree Data Structure
Delete
• Must replace node with j-max element of left tree or j-min element of right tree
• Worst Case Cost: O(n1−1/d), dominated by find min/max
• Average Delete Cost: O(log n)
Nearest Neighbor Queries
• Bentley’s Original algorithm: empirically O(log n) (redacted)
• Friedman and Bentley: empirically O(log2 n)
• Lee and Wong (’80): Worst case: O(n1−1/k)
back
89 / 62
Proof of Correctness: Centroid
We want to find r such that given m with k points,
s = max
(d∑
l=1
(pl − cl(m))2
),∀p ∈ m,
where s is the maximum Euclidean distance to centroid.
• First, consider size(m) ≤ ks2. Remember,
size(m) =k∑
i=1
(d∑
j=1
(pi ,j − cj(m))2
)2
Since∑d
j=1 (pi ,j − cj(m))2 ≤ s for all i , this is trivially true.
90 / 62
Proof of Correctness: Centroid• Second, there exists pj outside of r , with pj , pi ∈ m′. Let x and y be the distance
from pi and pj to c(m′), respectively. By assumption, x + y ≥ r . Sincesize(m) ≤ ks2, we show
size(m) ≤ ks2 ≤ x2 + y 2 ≤ size(m′).
With minimal x + y , x + y = r . Then we know
r 2
2≤ x2 + y 2 ≤ r 2.
Therefore
ks2 ≤ r 2
2
ks2 ≤ k2s2
22 ≤ k .
back
91 / 62
Centroid Measures
Visible differences between max, average (variance), and sum of squared distances tothe centroid.
−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5Longest Distance to Centroid:100 matches (total match size 12.1)
(c) Max distance−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5Average Distance to Centroid:100 matches (total match size 8.76)
(d) Average distance−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5
−0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5Sum of Squared Distance to Centroid:100 matches (total match size 15.9)
(e) Sum of squares
92 / 62
Centroid Measures
Equivalent matches under each measure to the centroid.
(f) Max distance (g) Average distance
back
93 / 62
kd-tree Empirical Performance
0 500 1000 1500 2000 2500 3000 3500 4000Time to find all matches (ms)
50
100
200
300
400
500
750
1000Pa
rtic
ipan
ts in
eac
h gr
oup
kd-tree Implementation
94 / 62