52

A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation
Page 2: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

A CUDA Based Parallel Decoding Algorithm forthe Leech Lattice Locality Sensitive Hash Family

Lee Carraher

Department of Computer ScienceUniversity of Cincinnati

May 8, 2012

Page 3: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Outline

I Thesis Goal

I Nearest Neighbors Problem and Applications

I LSH

I Parallel Leech Decoder

I Results

I Conclusions and Future Directions

Page 4: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Our Goal

The goal of this thesis is to present a CUDA optimized parallelimplementation of a bounded distance Leech Lattice decoder 1 foruse in query optimized c-approximate k-nearest neighbors using thelocality sensitive hash framework of E2LSH 2.

In addition, we will present an improvement to the E2LSHAlgorithm that improves the performance for real world dataproblems by an order of magnitude in regards to selectivity, withno change in algorithmic or storage complexity.

1 O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. van Tilborg,“The leech lattice and the golay code: bounded-distance decoding and mul-tilevel constructions,” Information Theory, IEEE Transactions on, vol. 40,pp. 1030–1043, jul 1994

2 A. Andoni and P. Indyk, “Near-optimal hashing algorithms for approximatenearest neighbor in high dimensions,” in Foundations of Computer Science,2006. FOCS ’06. 47th Annual IEEE Symposium on, pp. 459–468, oct. 2006

Page 5: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

The Nearest Neighbor Problem

Definition (Exact NN)

Given a set of points P in Rd and query point q return a pointp ∈ P such that p = Argmin {ρ(p, q)} , where ρ is some metricfunction.3

And an extension to an arbitrary number of neighbors.

Definition (k-Nearest Neighbor Search)

Given a set of points P in Rd and query point q return the knearest points p ∈ P to q4

3 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006

4 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006

Page 6: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Applications of NN

I Pattern Recognition

I Image/Object Recognition - Robot Vision systems, object andmotion tracking

I Genomics - similar and exact gene sequences in long DNAreads

I Biology/Medical - drug interaction and symptoms

I Recommender Systems - similar items based on usersubmission

I Data Compression

Page 7: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Standard Solutions for NN search

Definition (Linear Search NN)

Given a set of points P in Rd and query point q return the nearestpoints p ∈ P to q by minimizing the function f () = ρ(P, q), whereρ is some distance function.

This algorithm has complexity Θ(nd). And for large d and n,becomes computationally expensive. For this reason, it is not oftenused in practice and instead kd-tree based NN search is used.

Page 8: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Kd Trees for NN search

Kd-trees are an extension of a binary tree in which nodescorrespond to points in the database and splitting decisions arebased on the value of the point along one if its axis’s. This methodforms splitting planes which partition the points in Rd . A simple2d example is given below.

[3,5]

[2,1]

[2,2]

[1,2]

[5,5]

[4,6]

Page 9: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Kd Trees for NN search

4,6

1,2

3,5

2,1

2,2

5,5.

.

..

..

I The search algorithm consists of recursively searching thebinary tree.

I In 1 dimension, this is equivalent to a binary search, and isoptimal.

I However, as d →∞, the recursion approaches linear search.

Page 10: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

The Curse of Dimensionality

Although there are many interpretations of the common COD inregards to nearest neighbor searching it applies to an exact searchmethod’s complexity approaching linear search complexity as dgrows large This is due to the chosen splitting plane having anincreasingly lower probability of partitioning the data equally.

consider the distance between any two points x and y . under

Euclidean distance. dist(x , y) =√∑d

i=1 (xi − yi )2 Now if weconsider the vector as being composed of values that have auniformly distribution, according to Ullman 5. the range ofdistances are all very near the average distance between points.Given this, it becomes very difficult to differentiate between nearand far points.

5 A. Rajaraman and J. D. Ullman, Mining of Massive Datasets. CambridgeUniversity Press, 2011

Page 11: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

COD Continued

COD is sometimes cited as the cause for the distance functionloosing its usefullness for high dimensions.6 This arises from theratio of metric space partitioning to hypersphere embedding.

limd→∞

Vol(Sd)

Vol(Cd)=

πd/2

d2d−1Γ (d/2)→ 0

Given a single distribution, the minimum and the maximumdistances become indiscernible. 7 Or the relative majority of spaceis outside of the sphere

6 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearestneighbor” meaningful?,” in In Int. Conf. on Database Theory, pp. 217–235,1999

7 K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is ”nearestneighbor” meaningful?,” in In Int. Conf. on Database Theory, pp. 217–235,1999

Page 12: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Approximate NN Methods

Due to COD’s effects, approximate methods are often used to notonly solve the problems of increasing computational complexity,but they also allow formulations of the nearest neighbor problemthat somewhat side step the the space embedding issue by castingthem as a probabilistic problem. Many variants of Kd-Trees exist,and generally apply to approximate linear search methods

Definition (c-Approximate Nearest Neighbor Search)

Given a set of points P in Rd and query point q return a pointp ∈ P such that its distance to q is less than cρ(q, p) where p isthe exact solution, and ρ is some distance metric.8

*For Kd-Trees, common heuristic algorithms are best-bin first, andrandomized Kd-Tree.

8 H. Samet, Foundations of Multidimensional and Metric Data Structures.Morgan Kaufmann, 2006

Page 13: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

LSH Outline

Although approximate KD-Tree searches offer various solutions tothe growing search complexity issue, they either lack guarantees ofexact search complexity bounds or fixed approximation values. Forthis reason, we consider the LSH search method of 9.

Andoni and Indyk state a query and space complexity ofO(n)-space and O(nρ)-time where ρ is parameter of the hashfunction that is bounded by 1

4 .

9 P. Indyk and R. Motwani, “Approximate nearest neighbors: towards remov-ing the curse of dimensionality,” in Proceedings of the thirtieth annual ACMsymposium on Theory of computing, STOC ’98, (New York, NY, USA),pp. 604–613, ACM, 1998

Page 14: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

LSH

The general concept of LSH based NN lies in the definition of alocality sensitive hash function.

Page 15: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

LSH Hash Families

Definition (Locality Sensitive Hash Function)

let H = {h : S → U} is (r1, r2, p1, p2)−sensitive if for any u, v ∈ S

1. if d(u, v) ≤ r1 then PrH[h(u) = h(v)] ≥ p1

2. if d(u, v) > r2 then PrH[h(u) = h(v)] ≤ p2

For this family ρ = log p1log p2

Page 16: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

LSH

Using the above family of hash functions, the LSH algorithm canbe defined as an iterative convergence of probabilities leading to asolution to the approximate near neighbor problem.

Require: X = {x1, ..., xm}, xk ∈ Rn a set of pointsH is a locality sensitive hash functionD - is a set of bucketsfor all xk ∈ X do

Add xk to the hash bucket H(x)end forreturn H,D

Page 17: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Require: H,D, x̂ ∈ Rn a set of pointsreturn D[H(x̂)]

If H is defined as splitting the vectors over a particular randomplane, the results are very similar to the approximate kd-treealgorithm with depth 1.

To improve this algorithm, we must come up with a better localitysensitive hash functions than simply splitting the dataset.

I select multiple planes and then intersect the returned hashbuckets

I to increase selectivity of the buckets, we could concatenatehashes to create longer keys and more specific buckets

Page 18: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

An Example Hash Family

.

.

.

.

.

.

.

.

.

.

0

1

2

3

...

w

w

w

r

Figure: Random Projection of R2 → R1

Page 19: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Voronoi TilingAn optimal partitioning in 2-dimensions is the Voronoi partitioning,and with point location search can be done with complexityΘ(log(n))) and linear space10.

Figure: Voronoi Partitioning of R2

10 M. d. Berg, M. v. Kreveld, and M. O. . . e. a. ], Computational geometry :algorithms and applications. Berlin: Springer, 2000

Page 20: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Voronoi

I Voronoi diagrams make for very efficient hash functions in 2dbecause, by definition, a point within a Voronoi region isnearest to the regions representative point.

I Voronoi regions provide an optimal solution to the NNpartitioning in 2-d Space!

I However, for arbitrary dimension d, Voronoi diagrams requireΘ(nd/2)-space, and no known optimal point locationalgorithms exists.

Page 21: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Lattices

Instead we will consider lattices, which provide regular spacepartitioning and scale to arbitrarily large dimensional space, andhave sub-linear nearest center search algorithms associated withthem.

Definition (Lattice in Rn)

let v1, ..., vn be n linear independent vectors wherevi = vi ,1, vi ,2, ..., vi ,n The lattice Λ with basis {v1, ..., vn} is the setof all integer combinations of v1, ..., vn the integer combinations ofthe basis vectors are the points of the lattice.

Λ = {z1v1 + z2v2 + ...+ znvn|zi ∈ Z, 1 ≤ i ≤ n}

11

11 P. V. Huffman W., Fundamentals of Error-Correcting Codes. CambridgeUniversity Press, 1 ed., 2003

Page 22: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Examples in 2D

Figure: Square(left) and Hexagonal(right) Lattices in R2

Page 23: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Finding The Nearest Representative in Constant Time

I Certain lattices allow us to find the nearest representativepoint in constant time.

I For example the above square lattice.I The nearest point can be found by simply rounding our real

valued point to its nearest integer.

I With the exception of a few exceptional lattices, morecomplex lattices have more complex searches (exponential asd increases 12 ).

12 V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gain oflattices. i,” Information Theory, IEEE Transactions on, vol. 42, pp. 1796–1807, nov 1996

Page 24: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Exceptional Higher Dimensional Lattices

The previous lattices work well in R2, but our data spaces are ingeneral � 2.

I fortunately there are some higher dimensional lattices, withefficient nearest center search algorithms.

I E8 or Gosset’s Lattice, is one such lattice in

I it is also the densest lattice packing in R813.

13

Page 25: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Example of decoding E8

E8 can be formed by gluing two D8 integer lattices together andshifting by a vector of 1

2 . This gluing of less dense lattices andshifting by a “glue vector” is a common theme in finding denselattices.

I Decoding D8 is simple

I E8 = D8 ∩ D8 + 12

I both cosets of D8 can be computed in parallel

I D8’s decoding algorithm consists of rounding all values totheir nearest integer value s.t they sum to an even number

Page 26: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Example of decoding E8

define f (x) and g(x) to round the components of x, except in g(x)we round the furthest value from an integer in the wrong direction.let

x =< 0.1, 0.1, 0.8, 1.3, 2.2,−0.6,−0.7, 0.9 >

thenf (x) =< 0, 0, 1, 1, 2,−1,−1, 1 >, sum = 3

andg(x) =< 0, 0, 1, 1, 2, 0,−1, 1 >, sum = 4

since g(x) is even, it is the nearest lattice point in D8

Page 27: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Example of decoding E8 conti.

I However we want the nearest in E8, so we also have to includethe coset ∩D8 + 1

2 .

I We can do this by subtracting 12 from all the values of x.

f (x − 1

2) =< 0, 0, 0, 1, 2,−1,−1, 0 >, sum = 1

g(x − 1

2) =< -1, 0, 0, 1, 2,−1,−1, 0 > sum = 0

Now we find the coset representative that is closest to x using asimple distance metric.

||x − g(x)||2 = 0.65

||x − g(x1

2)||2 = 0.95

So this case it is the first coset representative:

< 0, 0, 1, 1, 2, 0,−1, 1 >

Page 28: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Leech Lattice

By gluing sets of E8 together in a way originally conceived byCurtis’ MOG, we can get an even higher dimensional dense latticecalled the Leech lattice.

Page 29: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Leech Lattice

Here we will state some attributes of the leech lattice as well asgive a comparison to other lattices by way of Eb/N0 and thecomputational cost of decoding.Some Important Attributes:

I Densest Regular Lattice Packing in R24

I Lattice Construction can be based on 2 cosets G24

I Sphere Packing Density: π12

12! ≈ 0.00192957

I Kmin = 196560

Page 30: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Leech Lattice

0 5 10 15 20 25EB/NO (dB)

�5

�4

�3

�2

�1

Prob

abili

ty o

f Bit

Erro

r - lo

g10(

Pb)

Error Correcting Performance (3072000 Samples)

Leech hex -.2db and QAM16E8 2PSKUnencoded QAM64Unencoded QAM16

Figure: Performance of Some Coded and Unencoded Data TransmissionSchemes

Page 31: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Leech Lattice DecodingSome information about the decoding algorithm:

I The decoding of the leech lattice is based closely on theDecoding of the Golay Code.

I In general, advances in either Leech decoding or binary Golaydecoding imply an advance in the other.

I The decoding method used in this implementation is based onAmrani and Be’ery’s ’9614 publication for decoding the Leechlattice, and consists of around 519 floating point operationsand suffers a gain loss of only 0.2bB.

I In general decoding complexity scales exponentially withdimension.15

Next is an outline of the decoding process.14 O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. van Tilborg,

“The leech lattice and the golay code: bounded-distance decoding and mul-tilevel constructions,” Information Theory, IEEE Transactions on, vol. 40,pp. 1030–1043, jul 1994

15 V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gain oflattices. i,” Information Theory, IEEE Transactions on, vol. 42, pp. 1796–1807, nov 1996

Page 32: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Leech Decoder

QAM Decoder

A Points

Block Confidence

Even Reps

Min Over H6

Even Reps

H-Parity

Even

K-Parity

Even

QAM Decoder

B Points

Block Confidence

Odd Reps

Block Confidence

Even Reps

Block Confidence

Odd Reps

Min Over H6

Odd Reps

H-Parity

Odd

Min Over H6

Even Reps

Min Over H6

Odd Reps

H-Parity

Even

K-Parity

Odd

H-Parity

Odd

Compare Candidate Codewords

24 Reals Received

K-Parity

Odd

K-Parity

Even

Figure: Leech Lattice Decoder

Page 33: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

CUDA

CUDA is a programming/computational platform for performinggeneral purpose processing on GPU hardware. CUDA isimplemented as an extension of C, exposing various parallelizationand memory allocation primitives. The advantages of GPGPUcomputing can be most easily seen in a picture on the next slide.

Page 34: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Advantages of CUDA in a Picture

ALU

Control

Cache

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU ALU ALU

ALU ALU ALU ALU ALU ALU ALU ALU

ALU

ALUALU

ALU ALU ALU ALU ALU ALU ALU ALUControl

Cache

Control

Cache

Control

Cache

Control

Cache

Control

Cache

DRAM DRAM

CPU Multiprocessor GPU Streaming Processor

Figure: Transistor Allocation for CPU vs. GPU Architectures

Page 35: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

CUDA doesn’t come without a cost

So what do we have to pay for all of these processors and overallhigher ALU density?

I We have to adapt or rewrite programs into CUDA

I Memory Latency, no caching* (feeding lots of little beasts)

I Memory size per core

I Synchronization, communication, standard parallelprogramming costs

Page 36: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Parallel Decoder Outline

I Naive CUDA Implementation

I Improved CUDA Implementation

I CUDA Pitfalls

I Some Performance Tweaks

Page 37: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Preprocessing

Require: X = {x1, ..., xm}, xk ∈ Rn

1: U(x) is a universal hash function2: hk(x) ∈ H is (r , cr , p1, p2)-sensitive3: choose l s.t.4: G← i ∈ Zl from [0, n)5: g(x) = {h1(x), ...hj(x)}6: D ← []7: for all xk ∈ X do8: D ← U(g(xk))9: end for

10: return G,D

Page 38: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Query: c-approx k-nearest neighbor

Require: x̂ ∈ Rn, G,H,U(),D1: for all gj(x) ∈ G do2: U(gj(x̂))3: end for4: K = []5: for all l ∈ L do6: d ← dist(x̂ ,D[l ])7: K ← {d ,D[l ]}8: end for9: sort(K )

10: return K [0 : k]

Page 39: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

An addition to the Standard Leech Lattice LSH Algorithm

I As suggested in Panigrahy 16, minimize database size (toΘ(n)) by delaying replication to the query phase.

I In addition we use a suggestion of Jegou17 for E8 exactc-nearest neighbor search, by ordering our 2L exact search listby the lattice center weights returned by the leech latticealgorithm.

16 R. Panigrahy, “Entropy based nearest neighbor search in high dimensions,”in Proceedings of the seventeenth annual ACM-SIAM symposium on Dis-crete algorithm, SODA ’06, (New York, NY, USA), pp. 1186–1195, ACM,2006

17

Page 40: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

SIFT

Scale invariant feature transform is an image transform patentedby David Lowe of Carnegie Mellon University, for identifying localfeatures in an image that are invariant to scale, rotational, affineand to some extent lighting transformations.

SIFT maps an image to a set of feature vectors, by first findingkeypoints and then applying various rotational and scaletransformations such that the now feature point is scale androtation invariant.

These features will serve as our database vectors, and a queryimage will be SIFT’d and its features will be searched in thedatabase using the Lattice LSH search Algorithm.

Page 41: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

SIFT Example

Figure: Top image is a scene and the bottom contains the SIFT imagematches

Page 42: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Results

Test data was performed on real world SIFT vectors formed fromthe Caltech25618 Database. The images in the imageset wereconverted to their SIFT vector equivalents and an LSH databasewas created from the SIFT Vectors relating images. Randomimages were chosen and searched with our parallel LSH Algorithm,as well as an exhaustive search for nearest neighbors based on thedefault algorithm provided with the SIFT library.

18 G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category dataset,”Tech. Rep. 7694, 2007

Page 43: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Query Time as a Ratio of Parallel to Total TimeAverage speedup in decoding between parallel and sequentialdecoding is approximately 41x.

0 50 100 150 200 250 300Samples

0.88

0.89

0.90

0.91

0.92

0.93

0.94Ra

tio P

ar/S

eq

Query Database: Parallel/Total Compute Time for SIFT Samples

SIFT sample data samplesavg=0.9340, R2=2.83203799077e-05

Figure: Ratio of Parallel to Total Time for Queries on an LSH SearchDatabase using Real World SIFT Sample Data

Page 44: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Extrapolated Lowes Linear vs Parallel LSH SearchBelow is a semilog comparison of linear vs. parallel LSH Search.Extrapolation is required due to the rapidly increasing complexityof linear search.

0 20000 40000 60000 80000 100000Number of Images

10-2

10-1

100

101

102

103

104Av

erag

e Ti

me

log-

scal

eExtrapolated Comparison of Time for LSH and Linear Search (Semilog)

Extrapolated Linear SearchExtrapolated LSH

Figure: Extrapolated Average Search Query Times

Page 45: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Extrapolated Lowes Linear vs Parallel LSH Search

0 50 100 150 200 250Sample Images

0

50

100

150

200

250Nu

mbe

r of M

atch

es

Accuracy of Matches for Various L's and Linear Search

Linear Search Matchl=1 LSH Matchl=5 LSH Matchl=10 LSH Match

Figure: Recall Accuracy of Random SIFT Vector Image Query

Page 46: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Effects of our Addition 2 and Selectivity

10-5 10-4 10-3

Selectivity

0.0

0.2

0.4

0.6

0.8

1.0

NN R

ecal

l

Data Recall Selectivity for Query Adaptive and Unranked LSH

Distance SortedUnsorted

Figure: Data Recall Selectivity for Query Adaptive and Unranked LSH

Page 47: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Conclusions

I Our primary conclusion is that parallel decoding is a usefulmethod for decoding the leech lattice, however its utility isnot scalable to arbitrarily large databases in thisimplementation due to the sequential bottleneck.

I On a more positive note, the query adaptive implementationyields a considerable benefit to selectivity over the originalmethod.

Page 48: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Future Directions

Although the applications of approximate nearest neighbor searchare manifold, offering a wide variety of similar applications above,we will instead focus on directions that better accommodate thissystems strengths and avoiding its weaknesses.

I Adaptive Mean Shift

I Networked Distributed KNN Search for Heterogeneous LargeScale Problems

I Finding more rigorous correlations between ECCs and LocalitySensitive Hashing Functions in regards to Θ(nρ).

Page 49: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Future Directions Adaptive Mean Shift

Without explicitly defining the functions involved in Mean Shiftand the variant Adaptive Mean Shift, we will simply show theupdate gradient step below. and define N(x) as the set of nearestneighbors to x.

→x =

∑xi∈N(x)

K (x − xi )→x∑

xi∈N(x)K (x − xi )

A simple extension would be to use LSH-KNN at this step,furthermore the approximation factor of c, may even be somewhatbeneficial in real world data as it may add a beneficial soft marginto the clustering.

Page 50: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Distributed KNN ...

I This extension may very well be used as a jumping point intoparallel exa-scale computing for my further research.

I A direction I would like to pursue is to use the universallyhashed, locality sensitive hash values for nearest neighborvectors, and fold them into a search across heterogeneoussystems using the same hashing scheme and supporting aquery architecture similar to internet routing schemes such asRIP or the more scalable OSPF (open shortest path first)algorithm.

I As an added bonus, there seems to be a variety of tunableattributes

I space-time tradeoffI actual network architecture versus algorithmic architecture and

optimizations.

Page 51: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

Closing Notes

Thank you everyone for your time! Complete References as well asslides, source code, and original thesis are freely available athomepages.uc.edu/ carrahle

Page 52: A CUDA Based Parallel Decoding Algorithm forhomepages.uc.edu/~carrahle/documents/defensePresentation.pdfThe goal of this thesis is to present a CUDA optimized parallel implementation

O. Amrani, Y. Be’ery, A. Vardy, F.-W. Sun, and H. C. A. vanTilborg, “The leech lattice and the golay code: bounded-distancedecoding and multilevel constructions,” Information Theory, IEEETransactions on, vol. 40, pp. 1030–1043, jul 1994.

A. Andoni and P. Indyk, “Near-optimal hashing algorithms forapproximate nearest neighbor in high dimensions,” in Foundations ofComputer Science, 2006. FOCS ’06. 47th Annual IEEE Symposiumon, pp. 459–468, oct. 2006.

H. Samet, Foundations of Multidimensional and Metric DataStructures.

Morgan Kaufmann, 2006.

A. Rajaraman and J. D. Ullman, Mining of Massive Datasets.

Cambridge University Press, 2011.

K. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft, “When is”nearest neighbor” meaningful?,” in In Int. Conf. on DatabaseTheory, pp. 217–235, 1999.

P. Indyk and R. Motwani, “Approximate nearest neighbors: towardsremoving the curse of dimensionality,” in Proceedings of the thirtiethannual ACM symposium on Theory of computing, STOC ’98, (NewYork, NY, USA), pp. 604–613, ACM, 1998.

M. d. Berg, M. v. Kreveld, and M. O. . . e. a. ], Computationalgeometry : algorithms and applications.

Berlin: Springer, 2000.

P. V. Huffman W., Fundamentals of Error-Correcting Codes.

Cambridge University Press, 1 ed., 2003.

V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gainof lattices. i,” Information Theory, IEEE Transactions on, vol. 42,pp. 1796–1807, nov 1996.

V. Tarokh and I. F. Blake, “Trellis complexity versus the coding gainof lattices. i,” Information Theory, IEEE Transactions on, vol. 42,pp. 1796–1807, nov 1996.

R. Panigrahy, “Entropy based nearest neighbor search in highdimensions,” in Proceedings of the seventeenth annual ACM-SIAMsymposium on Discrete algorithm, SODA ’06, (New York, NY,USA), pp. 1186–1195, ACM, 2006.

G. Griffin, A. Holub, and P. Perona, “Caltech-256 object categorydataset,” Tech. Rep. 7694, 2007.