KD-Tree Algorithm for Propensity Score Matching PhD ...jh2jf/pdfs/quals_defense.pdf · Motivation Formal De nitionCurrent Approaches kd-tree Algorithm Algorithm AnalysisConclusions

Motivation Formal Definition Current Approaches kd-tree Algorithm Algorithm Analysis Conclusions

KD-Tree Algorithm for Propensity Score Matching

PhD Qualifying Exam Defense

John R Hott

University of Virginia

May 11, 2012

1 / 62


Motivation

• Epidemiology: Clinical Trials• Phase II and III pre-market trials• After-market Phase IV trials against 3+ treatments

• Trials requiring similar groups to avoid confounding• Participants with similar comorbidities (diseases, ...)• Participants with similar traits (age, weight, height, ...)

• Propensity Scores• Probability, given certain factors, that a person will be given a certain treatment• Want to match participants with similar propensity scores

2 / 62


3 / 62


4 / 62


5 / 62


6 / 62


7 / 62


8 / 62


9 / 62


10 / 62


11 / 62


Why doesn’t this nearest neighbor approach work for more groups?

• b1 and g closest points to r

• Triangle rgb2 has smaller perimeter than rgb1

12 / 62


Current Approaches

For two treatment groups

• Propensity scores used to reduce dimensionality

• Brute force or nearest neighbor searches

For more than two groups

• Brute force• Requires O(nk log n) time using O(nk) space• Less efficient brute force uses O(n) space, but O(nk+1) time• Problem: must consider all matches• Not feasible for large n

kd-tree algorithm, under a uniform distribution, will perform in O(kdn2) time and O(n)space.

13 / 62


The End: Spoilers

0 1 2 3 4 5 6 7 8Time to find all matches (ms) 1e7

50

100

200

300

400

500

750

1000

Part

icip

ants

in e

ach

grou

pBrute Forcekd-tree Implementation

Figure: Results in 3-dimensions with 3 groups

For 1000 participants in 3 groups with 3 dimensions

• Brute Force: 19.6 hours

• kd-tree Algorithm: 3.6 seconds (19,427x speedup)14 / 62


Problem Statement

Informally, we want to make the smallest n disjoint matches with one participant fromeach of k groups per match.

• Start with participants in one group,

• Find their closest matched participants of each other group (nearest neighbor),

• Search within a small neighborhood of these points for a smaller match, if oneexists,

• Repeat, if necessary.

15 / 62


Outline1 Motivation

2 Formal DefinitionDefinitionsSize Function

3 Current Approaches

4 kd-tree Algorithmkd-treesAlgorithm Description

5 Algorithm AnalysisWorst CaseExpected CaseEmpirical Study

6 Conclusions

16 / 62


Participant Definitions

Definition (Participant)

Let the point p ∈ Rd be a participant normalized over d defining characteristics

Definition (Set of all Participants)

The set P ⊆ Rd , is the set of all participants, such that:

• P = ∪ki=1Gi , where each Gi defines a treatment group

• |Gi | = n

• |P| =∑

i |Gi | = kn.

17 / 62


Participant Definitions

Definition (Participant)

Let the point p ∈ Rd be a participant normalized over d defining characteristics

Definition (Set of all Participants)

The set P ⊆ Rd , is the set of all participants, such that:

• P = ∪ki=1Gi , where each Gi defines a treatment group

• |Gi | = n

• |P| =∑

i |Gi | = kn.

18 / 62


Match Definitions

Definition (Match)

A set m ⊆ P is a match if it contains exactly one point from each Gi :

• |m| = k,

• |m ∩ Gi | = 1, ∀i .

Definition (Set of all Matches)

Let M = m : m is a match be the set of all matches with

|M| =∏

i |Gi | = nk

19 / 62


Match Definitions

Definition (Match)

A set m ⊆ P is a match if it contains exactly one point from each Gi :

• |m| = k,

• |m ∩ Gi | = 1, ∀i .

Definition (Set of all Matches)

Let M = m : m is a match be the set of all matches with

|M| =∏

i |Gi | = nk

20 / 62


Size of a Match

DefinitionMatch measure function size(m),

size :M→ R,

independent of the order of the points in the match m, must give a consistentmeasurement of the match.

Ideal measure: minimize the sum of the distance between all points

• Fully connected graph

• Quadratic on k

21 / 62


Match Covering

DefinitionM is a match covering of P if M is a set of disjoint matches:

• M ⊂M• |M | = n

• ∀m, l ∈ M where m 6= l then m ∩ l = ∅.

DefinitionWLOG, assume M is sorted on size(m): ∀mi ,mj ∈ M , i < j =⇒ size(mi) < size(mj).Define ordering <M such that M0 <M M1 if for some index i ,

size(m0,i) < size(m1,i) and ∀j < i , size(m0,j) = size(m1,j)

• Size of match in M0 less than size of match in M1 at the first place they differ

22 / 62


Match Covering

DefinitionM is a match covering of P if M is a set of disjoint matches:

• M ⊂M• |M | = n

• ∀m, l ∈ M where m 6= l then m ∩ l = ∅.

DefinitionWLOG, assume M is sorted on size(m): ∀mi ,mj ∈ M , i < j =⇒ size(mi) < size(mj).Define ordering <M such that M0 <M M1 if for some index i ,

size(m0,i) < size(m1,i) and ∀j < i , size(m0,j) = size(m1,j)

• Size of match in M0 less than size of match in M1 at the first place they differ

23 / 62


Problem Statement

Find the minimal match covering, M0, such that

∀i ,M0 ≤M Mi .

24 / 62


Match size function

What is the best method for measuring the size of a match?Perimeter? Convex Hull? Something else?

25 / 62


Measuring Matches: by example

Figure: Sample points for one match.

26 / 62


Measuring Matches: Area

(a) (b) (c)

• All colinear points have 0 area, regardless of distance

• Favors colinear points

27 / 62


Measuring Matches: Perimeter

(d) (e)

• Works for 2-dimensions, 3-colors

• Equivalent to Traveling Salesman as number of colors increases

• Not well defined in more dimensions

28 / 62


Measuring Matches: Convex Hull

(f)

• Avoids TSP encountered with perimeter

• Ω(nbd/2c) for d > 3

more

29 / 62


Measuring Matches: Centroid

• Linearly computable (on colors and dimensions)

• Statistical sense: distance to an average point

• Matches are intuitively small

• Possible Measurements• Max distance to centroid• Average distance to centroid (variance)• Sum of squared distances to centroid

more

30 / 62


Perimeter Measure

For 3 or fewer groups, perimeter matches our ideal measurement. In this case,

size(m) = perimeter(m).

This definition leads to a search radius of

search(m) =1

2size(m).

31 / 62


Proof of Correctness: Perimeter

TheoremGiven an initial match m containing point pi ∈ G1, which contains random points pr j ,one per Gj with j ≥ 2, the perimeter will define the size of the match. Let us assumethat size(m) = q ∈ R. Our search radius will then be the disc centered at pi withradius q

2. This area will contain the smallest match for pi .

32 / 62



33 / 62



34 / 62



35 / 62


Centroid MeasureFor a d-dimensional space and k colors per match, we consider the sum of squareddistances to the centroid.

The centroid of a match m is defined as

c(m) =1

k

k∑i=1

pi .

Our size(m) function, using sum of squared distances to the centroid, is defined as

size(m) =k∑

i=1

(d∑

j=1

(pi ,j − cj(m))2

)2

.

This definition leads to a search radius of

search(m) = k ∗max distance to centroid.

36 / 62


Proof of Correctness: Centroid

TheoremGiven an initial match m containing point pi ∈ G1, which contains random points pr j ,one per Gj with j ≥ 2, the sum of squared distances to the centroid will define the sizeof the match. Our search radius will be the disc centered at pi with radius k ∗ s where sis the max distance to centroid. This area will contain the smallest match for pi .

37 / 62



38 / 62



39 / 62



more

40 / 62


Outline1 Motivation





6 Conclusions

41 / 62


Brute Force Approaches

Algorithm 2: O(nk log n) time, but O(nk) space.

Input: k sets of n pointsOutput: set of n ordered smallest matches of k points each

read inputforeach p1 ∈ G1 do

foreach p2 ∈ G2 do. . . foreach pk ∈ Gk do

M ← m = p1, p2, ..., pk

sort(M)foreach M do

if p1, p2, ..., pk clean thenMans ← m

return Mans

42 / 62


Brute Force Approaches

Algorithm 1: O(n) space, but O(nk+1) time.

Input: k sets of n pointsOutput: set of n ordered smallest matches of k points each

read inputfor i = 1 : n do

smallest = MAXmsmallest = nullforeach p1 ∈ G1 do

foreach p2 ∈ G2 do. . .foreach pk ∈ Gk do

if size(m = p1, p2, ..., pk) < smallest thenmsmallest = msmallest = size(m)

Mans ← msmallest

remove p1, p2, ..., pk from G1,G2, ...,Gk

return Mans

43 / 62


Voronoi Matching Algorithm

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Figure: Voronoi Cells.

44 / 62


Voronoi Matching Algorithm

Problems with the Voronoi algorithm

• Provides only an approximation for M0

• Worst and expected case complexity O(n4) for 3 groups

• Required searching for points in polygons

45 / 62


Outline1 Motivation





6 Conclusions

46 / 62


kd-treesBinary tree data structure used to store points in d-dimensional space.

more

47 / 62


kd-tree AlgorithmInput: k sets of n pointsOutput: set of n ordered smallest matches of k points each

G1 = read inputfor i ← 2 to k do

Gi = read inputTi = makeKDTree(Gi )

pq = new PriorityQueuematches = new ArrayListforeach pi ∈ G1 do

addPutativeMatches(pi , pq)

while pq not empty dom = pq.poll()if all points are unused then

foreach i ≤ k doTi .remove(m.i)

matches.add(m)

elseif point m.1 ∈ G1 is unused then

if no more matches available thenaddPutativeMatches(m.1, pq)

48 / 62


addPutativeMatches SubroutineInput: PriorityQueue pq, current point p1 from G1, kd-trees Ti for each G2 to Gk

Output: list of 10 smallest matches for point p1

for i ← 2 to k dopi = Ti .getnearest(pi−1)

small = size(p1, p2, ... , pk )search = get search distance from smalltq = new PriorityQueuetq.add(match(p1,p2, ... ,pk ))for i ← 2 to k do

listi = Ti .getnearest(pi−1, search)

foreach list2 as p2 doforeach list3 as p3 do

... foreach listk as pk dodist = size(p1,p2, ... ,pk )if dist ≤ small then

tq.add(match(p1,p2, ... ,pk ))

for i ← 1 to 10 dom = tq.poll()pq.add(m);

49 / 62


Empirical Study: addPutativeMatches returns

20 40 60 80Matches used in AddPutativeMatches

7500

8000

8500

9000

9500

10000

10500

Tim

e to

find

all

mat

ches

(ms)

Average Time (ms)Max Time (ms)Min Time (ms)

Figure: Empirical study varying number of matches returned by addPutativeMatches.

50 / 62


Outline1 Motivation





6 Conclusions

51 / 62


Worst Case

Figure: Worst case example (3 colors in 2 dimensions). Points in each Gi are coincident witheach other (∀i : ri = (−1, 0), gi = (0, 0), bi = (1, 0)); all points are within the search area ofany match (ri , gi , bi ).

52 / 62


addPutativeMatches Worst Case

Complexity

O(k)

O(kdn1−1/d )O(1)O(kd)O(1)O(log n)

O(k)

O(kdn1−1/d )

a

O(n)×O(n)××...× O(n)

O(kdnk−1)O(nk−1 log nk−1)

abc

O(10 log nk−1)

abcd








O(nk−1 log n + kdnk−1 + 2kdn1−1/d ) more

53 / 62


kd-tree Algorithm Worst CaseComplexityO(n)

O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

O(nk log n)

abc

O(nk log n)O(n log n)

a

O(nk − n)ab

O(n2k−1 log n)

abc







matches.add(m)



O(n2k−1 log n + knk+1 log n + kdnk+1

)more

54 / 62


Expected Case Assumption

For n points, ∃δ such that ∀ε-sized areas, there are less than δεn points in that region.

• This assumes a uniform distribution, per study design

• Location of the ε-sized area is irrelevant

• As the density increases, the search radius becomes smaller

• For a small area ε ≈ 1/n, number of points appears constant in that area (δ)• Since we look at the k closest neighbors to a point

55 / 62


Expected Case

Tapmk,d= O(2(k − 1)dn1− 1

d + log n) = O(kdn)

Tpart1k,d = O(nTapmk,d

)= O(kdn2)

Tpart2k,d = O (n log n + nk log n) = O((k + 1)n log n)

with the total time complexity reducing to

Tkdtree = Tbuildkds + Tpart1k,d + Tpart2k,d

= O((k − 1)(n log n) + kdn2 + (k + 1)n log n

)= O(kdn2).

56 / 62


Empirical Study: Brute Force vs KD-Tree

Parameter Values

Number of Treatment Groups 3 - 4Participants per Treatment Group 50, 100, 200, 300, 400, 500, 750, 1000Confounding Factors per Participant 3

Table: Empirical test configurations.

Each configuration repeated 50 times.Centurion cluster nodes

• 1.6 GHz dual-core Opteron

• 3GB RAM

57 / 62



101 102 103 104 105 106 107 108

Time to find all matches (ms)

50

100

200

300

400

500

750

1000

Part

icip

ants

in e

ach

grou

p

Brute Forcekd-tree Implementation

Figure: Results in 3-dimensions with 3 groups (log-scale x-axis)

58 / 62



100 101 102 103 104 105 106 107 108 10950

100

200

300

Part

icip

ants

per

gro

up 4

gro

ups

100 101 102 103 104 105 106 107 108 109

Time to find all matches (ms)

50

100

200

300

Part

icip

ants

per

gro

up 3

gro

ups

Brute Forcekd-tree Implementation

Figure: Empirical study comparing brute force and the kd-tree algorithm. (log-scale x-axis)

59 / 62


Outline1 Motivation





6 Conclusions

60 / 62


Future Directions

• Alternative applications for the algorithm

• Combining kd-trees and voronoi cells• Some research into using voronoi cells to speed kd-tree lookups• Utilize kd-trees to build effective voronoi diagrams (completed)• Extracting matches using the effective voronoi diagrams before completion using

the kd-tree algorithm

• Other assumptions for expected cases• Alternate distributions• Slowly growing δ in our expected assumption

• Other match measure functions

• Reduce the search area once a smaller match is found

61 / 62


Research Plan

Proposed research directions:√

Generalize the kd-tree algorithm to an arbitrary k colors in d dimensions, asdefined in the problem statement,

√Analyze the time complexity of the k-d tree algorithm for both worst-case andexpected case running times,

√Examine other methods for defining the size of a match that are not dependent orlimited by dimensionality, number of colors, or ordering of the points,

√Prove algorithm correctness.

Additional research directions:√

Perform an empirical study of addPutativeMatches return values,√

Perform an empirical study comparing brute force to the kd-tree algorithm for 3-5groups in 3-4 dimensions.

62 / 62

Questions?

63 / 62

addPutativeMatches Analysis

Complexity

O(k)O(1)O(kd)O(1)O(log n)O(k)a

O(n)×O(n)××...× O(n)abc

O(10 log nk−1)abcd








O(nk−1 log n + kdnk−1 + 2kdn1−1/d )

back

64 / 62


ComplexityO(k)

O(1)O(kd)O(1)O(log n)O(k)a

O(n)×O(n)××...× O(n)abc










back

65 / 62


ComplexityO(k)O(dn1−1/d )

O(1)O(kd)O(1)O(log n)O(k)a

O(n)×O(n)××...× O(n)abc










back

66 / 62


Complexity

O(k)


O(k)a

O(n)×O(n)××...× O(n)abc










back

67 / 62


Complexity

O(k)

O(kdn1−1/d )O(1)O(kd)O(1)O(log n)O(k)

a

O(n)×O(n)××...× O(n)abc










back

68 / 62


Complexity

O(k)

O(kdn1−1/d )O(1)O(kd)O(1)O(log n)O(k)O(dn1−1/d )

a

O(n)×O(n)××...× O(n)abc










back

69 / 62


Complexity

O(k)


O(k)

O(kdn1−1/d )

a

O(n)×O(n)××...× O(n)

abc










back

70 / 62


Complexity

O(k)


O(k)

O(kdn1−1/d )

a

O(n)×O(n)××...× O(n)O(kd)O(log nk−1)

abc










back

71 / 62


Complexity

O(k)


O(k)

O(kdn1−1/d )

a

O(n)×O(n)××...× O(n)


abc

O(10 log nk−1)

abcd









back

72 / 62


Complexity

O(k)


O(k)

O(kdn1−1/d )

a

O(n)×O(n)××...× O(n)


abc

O(10 log nk−1)

abcd








O(nk−1 log n + kdnk−1 + 2kdn1−1/d ) back

73 / 62

kd-tree Algorithm AnalysisComplexityO(n)

O(k)a

O(1)O(1)O(n)a

O(nk )abc

O(n(k log n + log n))a

O(nk − n)ababc







matches.add(m)




)

back

74 / 62

kd-tree Algorithm AnalysisComplexityO(n)O(k)

a

O(1)O(1)O(n)a

O(nk )abc


O(nk − n)ababc







matches.add(m)




)

back

75 / 62

kd-tree Algorithm AnalysisComplexityO(n)O(k)O(n + n log n)

a

O(1)O(1)O(n)a

O(nk )abc


O(nk − n)ababc







matches.add(m)




)

back

76 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)a

O(nk )abc


O(nk − n)ababc







matches.add(m)




)

back

77 / 62


O(k)

O(kn log n)

a

O(1)O(1)O(n)

a

O(nk )abc


O(nk − n)ababc







matches.add(m)




)

back

78 / 62


O(k)

O(kn log n)

a

O(1)O(1)O(n)O(nk−1 log n)

a

O(nk )abc


O(nk − n)ababc







matches.add(m)




)

back

79 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

abc


O(nk − n)ababc







matches.add(m)




)

back

80 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )O(log n)

abc


O(nk − n)ababc







matches.add(m)




)

back

81 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

O(nk log n)

abc

O(n(k log n + log n))

a

O(nk − n)ababc







matches.add(m)




)

back

82 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

O(nk log n)

abc


a

O(nk − n)

ababc







matches.add(m)




)

back

83 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

O(nk log n)

abc


a

O(nk − n)

ab

O(nk−1 log n)

abc







matches.add(m)




)

back

84 / 62


O(k)

O(kn log n)

a

O(1)O(1)

O(n)

O(nk log n)

a

O(nk )

O(nk log n)

abc


a

O(nk − n)ab

O(n2k−1 log n)

abc







matches.add(m)




)back

85 / 62

Measuring Matches: Convex Hull

(a) (b)

• Avoids TSP encountered with perimeter

• Ω(nbd/2c) for d > 3

back

86 / 62

kd-tree Data Structure

kd-trees

• Multi-dimensional data structure introduced by Bentley (1975)

• Based on binary search trees

• Each level i divides the search space in dimension i mod d

87 / 62


Insert

• Search for node in the tree, if not found, add node

• Average cost: O(log n) ≈ 1.386 log2 n (by Knuth)

• Can use Insert to build kd-tree• Inserting random nodes to build kd-tree is statistically similar to building bst• Build cost: O(n log n) for sufficiently random nodes

Optimize

• Given all n nodes, build an optimal kd-tree

• Uses the median for each dimension as discriminator for that level

• O(n log n) running time

• Maximum path length: blog2 nc

88 / 62


Delete

• Must replace node with j-max element of left tree or j-min element of right tree

• Worst Case Cost: O(n1−1/d), dominated by find min/max

• Average Delete Cost: O(log n)

Nearest Neighbor Queries

• Bentley’s Original algorithm: empirically O(log n) (redacted)

• Friedman and Bentley: empirically O(log2 n)

• Lee and Wong (’80): Worst case: O(n1−1/k)

back

89 / 62


We want to find r such that given m with k points,

s = max

(d∑

l=1

(pl − cl(m))2

),∀p ∈ m,

where s is the maximum Euclidean distance to centroid.

• First, consider size(m) ≤ ks2. Remember,

size(m) =k∑

i=1

(d∑

j=1

(pi ,j − cj(m))2

)2

Since∑d

j=1 (pi ,j − cj(m))2 ≤ s for all i , this is trivially true.

90 / 62

Proof of Correctness: Centroid• Second, there exists pj outside of r , with pj , pi ∈ m′. Let x and y be the distance

from pi and pj to c(m′), respectively. By assumption, x + y ≥ r . Sincesize(m) ≤ ks2, we show

size(m) ≤ ks2 ≤ x2 + y 2 ≤ size(m′).

With minimal x + y , x + y = r . Then we know

r 2

2≤ x2 + y 2 ≤ r 2.

Therefore

ks2 ≤ r 2

2

ks2 ≤ k2s2

22 ≤ k .

back

91 / 62

Centroid Measures

Visible differences between max, average (variance), and sum of squared distances tothe centroid.

−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5Longest Distance to Centroid:100 matches (total match size 12.1)

(c) Max distance−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5Average Distance to Centroid:100 matches (total match size 8.76)

(d) Average distance−0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4 0.5

−0.5

−0.4

−0.3

−0.2

−0.1

0

0.1

0.2

0.3

0.4

0.5Sum of Squared Distance to Centroid:100 matches (total match size 15.9)

(e) Sum of squares

92 / 62

Centroid Measures

Equivalent matches under each measure to the centroid.

(f) Max distance (g) Average distance

back

93 / 62

kd-tree Empirical Performance

0 500 1000 1500 2000 2500 3000 3500 4000Time to find all matches (ms)

50

100

200

300

400

500

750

1000Pa

rtic

ipan

ts in

eac

h gr

oup

kd-tree Implementation

94 / 62

Documents

KD-Tree Algorithm for Propensity Score Matching PhD ...jh2jf/pdfs/quals_defense.pdf · Motivation Formal De nitionCurrent Approaches kd-tree Algorithm Algorithm AnalysisConclusions