A Locality-Sensitive Hash for Real Vectors

  • Upload
    xml

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    1/11

    A locality-sensitive hash for real vectors

    Tyler Neylon

    Abstract

    We present a simple and practical algorithm for thecapproximate near neighbor problem (cNN): given n

    points P Rd and radius R, build a data structure which,given q Rd, can with probability 1 return a pointp P with dist(p,q) cR if there is any p P withdist(p, q) R. For c = d + 1, our algorithm determin-istically ( = 0) preprocesses in time O(nd log d), spaceO(dn), and answers queries in expected time O(d2); thisis the first known algorithm to deterministically guaranteean O(d)NN solution in constant time with respect to n forall p metrics. A probabilistic version empirically achievesuseful c values (c < 2) where c appears to grow minimally

    as d . A query time of O(d log d) is available, provid-ing slightly less accuracy. These techniques can also be usedto approximately find (pointers between) all pairs x, y Pwith dist(x, y) R in time O(nd log d).

    The key to the algorithm is a locality-sensitive hash: amapping h : Rd U with the property that h(x) = h(y) ismuch more likely for nearby x, y. We introduce a somewhatregular simplex which tessellates Rd, and efficiently hasheach point in any simplex of this tessellation to all d + 1corners; any points in neighboring cells will be hashed to ashared corner and noticed as nearby points. This methodis completely independent of dimension reduction, so thatadditional space and time savings are available by firstreducing all input vectors.

    1 Introduction.

    In this paper, we focus on a practical variation ofthe traditional nearest neighbor search problem (NNS):given points P Rd and query point q Rd, find p Pwhich minimizes dist(p,q). The usual approach is topreprocess P so that a later query can quickly retrievethe closest point.

    Although much work has been done to efficientlysolve NNS, the known non-probabilistic approaches thusfar are not much more efficient than a brute force linearsearch over P, except in small dimensions d (see [15]for approaches in small d). Therefore, practical algo-rithms have been developed which can efficiently solve

    probabilistic approximate versions of NNS. These lessexact variations still support many applications, includ-ing data mining, information retrieval, media databases(such as images or video), pattern recognition, and du-plicate detection. For example, a video database maywant to avoid the insertion of videos which are slightlyaltered versions of existing data, and may do so by solv-ing an approximate version, such as cNN (defined be-low), on real vector representations of the videos. In

    this case, we expect a near-duplicate video q to be manytimes closer to a particular point in P than any other,so a fast approximate algorithm is very useful.

    The authors of [11] show how to reduce a probabilis-tic approximate version of NNS to a version in whichthe allowed distances between p and q are restricted bya fixed parameter R. In particular, let us define thecnear neighbor problem (cNN) as follows:Definition 1.1. Given points P Rd, radius R > 0,and probability tolerance > 0, preprocess P so that for

    any q Rd, we can, with probability at least 1 , findp P with dist(p,q) < cR whenever there is a p Pwith dist(p, q) < R.

    The value c 1 is the approximation factor; ideally wecan solve this for c = 1 + , where both error terms and are small. The exact value of R is immaterialsince input points can easily be scaled to match the fixedradius of any particular cNN implementation.

    The work in [11] also introduced the idea of alocality-sensitive hash a hash function h with theproperty that h(x) = h(y) is more likely for nearbypoints x, y. A good locality-sensitive hash can serve

    as a basis for solving cNN, and thus for approximateapproaches to NNS as well. Since [11], a number of suchhashes have been proposed for use on various point sets(including strings [13] and families of subsets [3]), andin various metrics (cf. [7], [2], [8]).

    In this paper, we focus on solving cNN in Rd un-der the p norm for p [1, ] (equation (2.6)), withemphasis on 2 (equation (2.12)). Previous work hascentered around reducing the time complexity depen-dence on n = |P|. However, locality-sensitive hashesmore directly solve cNN than variants of NNS, andlend themselves to approaches more sensitive to dimen-sion d, accuracy c, and certainty 1

    , than to speed

    in terms of n. We propose that any advantages previ-ously conveyed by smaller theoretical exponents on ncan be expressed in the form of better speed, proba-bility of success, and approximation factors within thisframework.

    1.1 Anatomy of a locality-sensitive hash A typ-ical locality-sensitive hashing scheme on Rd consistsof three components: dimension reduction, the local-

    1179 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    2/11

    ity hash, and amplification. It is widely utilized that arandom projection of n points in Rd to a smaller spaceRt, t = O(log n), is likely to preserve interpoint dis-

    tances ([7], [12]). Thus we can apply a locality hash toa reduced vector with good results.

    In addition, most locality hashes improve with some

    form of amplification a repetition of some portion ofthe hashing scheme which improves the accuracy. Ex-amples include using multiple hash tables with slightlydifferent locality hash functions, or performing multi-ple queries via random points close to the given q Rd([14]).

    The locality hash we present in 2 is independentof either dimension reduction or amplification, so thatboth of these techniques can be used to augment thebase performance. This paper focuses on the localityhash itself; we do not directly address the question ofhow to optimally combine the hash with these tech-niques.

    1.2 Our contributions We present two versions ofa simple and general algorithm based on a tessellationof Rd by simplices (a simplex is the ddimensionalgeneralization of a triangle the convex hull of d + 1points in general position). In each case, any given pointis hashed to all d + 1 corners of the cell (one simplex) inthe tessellation which contains the point. In this way,we can be sure to match any two points in neighboringcells of the tessellation.

    By using simplices instead of hypercubes, we canwork with O(d) corners instead of O(2d); thus we

    confront the curse of dimensionality. We can achievestorage in time O(d log d) per point, and lookup inexpected time O(d2), or O(d log d) at some accuracycost as explained in 2.4.

    Our main theoretical result (in 2) is to show thatone version of our hash deterministically ( = 0) solvescNN with c = 2d in the p norm for any p [1, ],and that the other version gives an approximate factorc d + 1 for 2. Previous algorithms ([4]) can solvecNN in query time independent of n, with c = O(d)in 1, and in select other p metrics via embedding ([5],[9], [10]). To our knowledge, this is the first proven toachieve c = O(d) for all p, p [1, ].

    We also give empirical evidence (in 3) that thealgorithm is practical as a probabilistic approach tocNN for smaller c on a wide range of dimensions.For example, using artificial data, we approximate thequantity .1, an estimated upper bound on c for = 0.1(90% confidence). In table 1, we see that this algorithmachieves .1 = 1.6 in R

    300 with at least 90% confidence,without dimension reduction.

    d = 10 100 300 10 20hash Lt = 5 5 5 d + 1 d + 1here 1.6 1.7 1.6 1.5 1.5pstable [7] 5.6 4.7 5.1 3.9 2.8sphere grid [2] 6.0 5.1 5.1 4.1 3.1unary [8] 8.3 6.0 5.6 4.4 3.3

    Table 1: Empirical estimates of.1 = D.05/D.95, furtherexplained in 3.

    2 The algorithm

    In this section, we describe a general class of hashalgorithms which provide guarantees of finding closeenough neighbors while excluding those sufficiently farapart. Next we describe a particular version of thisalgorithm, which we refer to as the orthogonal version,which is easy to implement and analyze. Finally

    we present an improved version, the vertex-transitiveversion, along with some thoughts towards improvedtime and space complexity.

    2.1 The general method In this section, we sup-pose that we have a partition P which is a family ofsubsets ofRn with the following properties: (i) Each el-ement c P, called a cell, is the convex hull of finitely-many points; (ii) Rn = P, while the intersection ofany two cells has zero measure; and (iii) any boundedregion contains finitely many cells. The general frame-work of our algorithm, described next, works for anysuch partition.

    The partition algorithm Based on such a par-tition, we are now ready to define a general locality-sensitive hashing algorithm. The idea is to build a mapfrom each corner of the partition to a list of all datapoints in adjacent cells. Conceptually, start with a fast-lookup mapping (such as a hash map) from every cor-ner of every cell to an empty linked list (sparse, createdlazily). For point x Rd, choose a cell c Pcontainingx, and append (a pointer to) x to the list mapped fromeach corner of c. This process can be repeated for anarbitrary number of points, at each step augmenting thelists mapped to by each corner. We say that two pointsx, y collide in this hash if any corner point maps to bothx and y; write this as x y. This general algorithm isalso outlined in the pseudocode in figure 1.

    In practice, the result of a query will be a set oflists of points; it may suffice to simply work with thefirst point found, avoiding the time required to parse allthe points.

    An advantage of this algorithm is that it determin-istically guarantees that all close enough points are con-sidered nearby, while all far enough are not. The chal-

    1180 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    3/11

    def locality hashes(x):

    cell c = cell containing(x)

    return corners(c)

    def preprocess points(P):

    map ={}

    # An empty mapping.

    for p in P:

    for h in locality hashes(p):

    map[h].append(p)

    return map

    def lookup query(map, q):

    neighbors = [] # An empty list.

    for h in locality hashes(q):

    neighbors.append(map[h])

    return neighbors

    Figure 1: The general partition-based LSH algorithm

    lenge from here is to preserve speed by choosing a goodpartition. A naive choice of P, such as all unit hyper-cubes cornered at integer points, can result in manycorners (2d) per cell, which in turn requires runningtime exponential in d. Accordingly, we will now con-sider how our choice of P affects the performance ofthis algorithm.

    Time and space complexity. The resourcesused by this algorithm are dominated by the calls tolocality hash. Suppose one call to locality hashtakes time and returns an object of size = O().Our preprocessing time of n points then takes O(n)

    time, and O(n) space. If we do not bother to traversethe neighbors list and simply return pointers to theselists, then a single lookup also takes time O().

    When we fill in the details below, (2.4) well seethat we can achieve = d log d and = d, where d is thedimensionality of our points. Unfortunately, verifyingthe identity of each hash key requires O(d) time, so thatqueries (but not storage) are actually O(d2). If werewilling to probabilistically verify hash keys at the costof some false collisions, we can still achieve O(d log d)lookups.

    Accuracy. Next wed like to give some useful char-acterizations concerning which pairs of points are, or arenot, considered nearby neighbors of this algorithm. Ourmain goal will be to discover, for a given partition P,the optimal values D0 and D1 such that

    dist(x, y) > D0 x y anddist(x, y) < D1 x y.

    This algorithm solves cNN where c = D0/D1 and = 0. The notation here is inspired by the idea,

    expanded below, that Dp, for p (0, 1), can representthe distance at which

    dist(x, y) = Dp P(x y) = p

    in a certain probabilistic context.

    Well work with a particular type of partition whichis more conducive to analysis. A sliced partition is adivision of Rd into cells created by cutting the spacealong a series of parallel hyperplanes so that every cellis bounded by exactly two hyperplanes in each direction.We further require that the hyperplanes be oriented infinitely-many directions, and that every bounded subsetofRd intersects finitely-many of the hyperplanes. It isnot hard to verify that a sliced partition is a special caseof our general partition described above.

    As a quick example, the hypercube partition in which we cut Rd = {x = (x1, x2, . . . , xd)} alongevery hyperplane xi = j; i

    [d], j

    Z i s a

    canonical example of a sliced partition (notation: [d] :={1, 2, . . . , d}).We can effectively define D1 = D1(P) as the

    quantity

    (2.1) D1 := inf{dist(x, y) : x y; x, y Rd}.

    Let cx cy indicate that cells cx and cy share aboundary point. Toward characterizing D1 in termsof the cells of P, well say that points x, y are barelyneighbors when x y yet there are cells cx, cy withx cx, y cy, and cx cy. This may occur when x, yare in the boundary of neighboring cells, or the same

    cell, and this boundary is shared with two cells whichare not neighbors of each other.

    Lemma 2.1. Given any points x y, there are barelyneighboring points x y between x and y along theline segment xy.

    We defer this proof and others in this section to theappendix. This last lemma clarifies that

    (2.2) D1 = inf{dist(x, y) : x, y are barely neighbors},

    so that, in finding D1, we may focus exclusively on suchpoint pairs. Combining this fact with the next lemmawill allow us to gain a powerful tool for extracting D1from the shape of the cells in any sliced partition.

    Let corners(c) denote the minimal set of pointswhose convex hull is the cell c. We will say that pointsx, y cross a cell c iff x, y c and there are disjointsubsets Cx, Cy corners(c) with x convex(Cx) andy convex(Cy). Intuitively, we can think of this notionas stipulating that any edge path from x to y mustcontain at least one full edge.

    1181 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    4/11

    Figure 2: The orthogonal partition O in R2

    Lemma 2.2. Suppose x, y are points in cell c in slicedpartition P. Then x, y are barely neighbors x, y cross c.

    Lemma 2.1 allowed us to restrict the definition (2.1)ofD1 to the reduced form of (2.2). With lemma 2.2, wecan move one step further, in sliced partitions, to

    (2.3) D1 = inf{dist(x, y) : x, y cross some cell c}.In particular, if all the cells of a sliced partition Pareisometric, then D1 is exactly the minimum distance ofany points which cross the canonical cell. We can nowformally define D0 = D0(P) := sup{dist(x, y) : x y; x, y Rd}.

    The next result summarizes the main points we willuse from this section.

    Property 2.3. If P is a sliced partition with isomet-ric cells, then D1 = min{dist(x, y) : x, y cross c},where c is the canonical cell. Moreover, if there ex-ist colinear points x, y, z such that xy is the diame-ter of one cell, and yz the diameter of another, thenD0 = 2max{dist(x, y) : x, y c} = 2 diam(c).

    Applying this result to hypercubes, we see thatD0 = 2

    d, and D1 = 1 in 2. But each hypercube

    has 2d corners, which would lead to exponential timecomplexity. The next two sections are devoted tofinding the values of D0, D1 for two sliced partitionswhose canonical cell is the convex hull of d + 1 corners,allowing for much faster algorithms.

    2.2 The orthogonal partition O We define the or-thogonal partition, denoted by

    O, as the sliced parti-

    tion given by the slices xi = z : z Z, i [d],and xi xj = z : z Z, i = j [d], inRd = {x : x = (x1, x2, . . . , xd)}. We will see below

    that all the cells in this partition are isometric, so thatwe can determine D0 and D1 from a single cell, usingproperty 2.3.

    To begin, notice that this partition is a refinement ofthe hypercube partition; it contains strictly more slices.Also notice that every integer-cornered hypercube is

    sliced in the same way, so we may conveniently focuson the partition of [0, 1]d. Given any permutation :[d] [d], we can define the set S := {x : 0 x(1) x(2) x(d) 1}. Each inequality definesa halfspace; since S is a bounded, nondegenerateintersection of d + 1 halfspaces in Rd, it is a simplex.

    For any permutations 1 = 2, the set S1 S2 hasmeasure zero. It is also clear that S = [0, 1]d.These simplices are precisely the cells in [0, 1]d of theorthogonal partition.

    We also note that, for any permutations 1 = 2,the mapping y = f(x) on Rd which follows y2(i) =x1(i), is an isometry (in any p) mapping S1 S2 .Thus all d! cells in the hypercube are isometric, andsince the same decomposition of the cube is repeatedthroughout space, all cells of the entire partition areisometric.

    Hence we can focus on a particular cell to study.Well choose Sid, where id is the identity permutation,and Sid is the cell for which 0 x1 x2 xd 1. This cell is cornered by the points fi :=( 0, 0, . . . , 0

    di

    , 1, 1, . . . , 1 i

    ), for i = 0, . . . , d.

    Whenever a b c, we have fc, fa = fb, faso that fc fb, fa = 0. Now consider any triangle ofcorners fi, fj , fk with i < j < k. Then fkfj , fjfi =fk fj , fj fk fj , fi = 0 0 = 0. Every boundarytriangle in every cell of this partition is a right triangle this is why we call it the orthogonal partition.

    D1(O) Next well find the value ofD1(O) in any pnorm. To begin, well need the following elementary

    Lemma 2.4. If (xi)di=1, (yi)di=1 are nondecreasing se-quences with min(xi, yi) max(xi1, yi1), then

    ||x y||1 max(xd, yd) min(x1, y1).Proof. Let ai = min(xi, yi) and bi = max(xi, yi).

    Then ||x y||1 =d

    i=1(bi ai) d

    i=2(bi bi1) +(b1 a1) = bd a1.

    Were now ready for the main result of this section,

    Theorem 2.5.

    D1(O) = d1/p1 in p

    Proof. Well begin by showing that

    (2.4) D1 1 in 1.Since ||x||p d1/p1||x||1, for any x Rd, this will giveus

    (2.5) D1 d1/p1 in p.Suppose that x, y cross Sid. Then there are distinct

    corner sets Fx, Fy {f0, f1, . . . , f d} whose convex hulls

    1182 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    5/11

    contain x and y, respectively. We can see that eachpoint can also be viewed as a nondecreasing sequence,and that xi < xi+1 yi = yi+1 and yi < yi+1 xi = xi+1, since, e.g., xi < xi+1 fdi Fx, so thatzi = zi+1 for any point z convex(Fy), including y. Ourpoints x, y meet the conditions of lemma 2.4. Since x, y

    cross Sid, we must have min(x1, y1) = 0, max(xd, yd) =1, and thus ||x y||1 1. By property 2.3, this givesus (2.4).

    It remains to be seen that we can find cross-ing points x, y Sid which actually achieve thelower bound given by (2.5). To do so, let x =(1/d) (1, 1, 3, 3, 5, 5, . . . , 2d/2 1) and y = (1/d) (0, 2, 2, 4, 4, . . . , 2d/2). Then x convex(Fx), wherefi Fx iff the parity of i matches that of d, andy convex(Fy), with Fy the complement of Fx. Sox, y do indeed cross Sid. We also have x y = (1/d) (1, 1, 1, 1, . . . , 1), and ||x y||p = d1/p1, whichcompletes the proof.

    D0(O) Each cell in [0, 1]d contains the line segmentfrom 0 to 1, which is a diameter of the hypercube and hence also of each cell in any p norm. Wecan also clearly see that this diameter does occur ina colinear fashion as required by property 2.3, so thatD0(O) = 2d1/p in p.

    We can summarize these results as 12 D0(O) =1 in and D1(O) = 1 in 1.

    Recall that D0/D1 gives us the value of the ap-proximation factor c for the non-probabilistic version ofcNN. We can summarize our findings thus far by thesurprisingly simple equation

    (2.6) c = D0/D1 = 2d in p, for any p [1, ]for the orthogonal partition.

    So far, weve examined the theoretical deterministic( = 0) bounds provided by the orthogonal partition.In order to turn this into actual code, we still needto check how feasible it is to compute the corners ofa cell containing any given point. We do this in thenext section.

    Computing locality hashes(x) in the orthog-

    onal partition In this section we suppose there isa point x Rd for which we want to computelocality hashes(x). We do this by implicitly find-ing cx = cell(x), the cell in O containing x; and thenfinding the set corners(cx), which is the ultimate out-put of locality hashes needed for the overall localitysensitive hash.

    Any point x = (x1, x2, . . . , xd) Rd can bedecomposed as x = xint + xfrac, where xint =(x1, x2, . . . , xd). Since every (Zdcornered) hy-percube looks the same in O, the corners of cx are ex-actly xint + corners(cell(xfrac)). Thus it suffices to find

    def locality hashes(x):

    a corner = x int = map(math.floor, x)

    corners = [] # an empty list

    add corner(corners, a corner)

    = sort as permu(x - x int)d = len(x)

    for i in 1..d:

    a corner[[d-i]] += 1add corner(corners, a corner)

    return corners

    Figure 3: The orthogonal partitions algorithm

    Figure 4: A rotated snapshot of V in R2 a grid ofequilateral triangles

    the corners of cell(x) for any x [0, 1]d.Suppose x S for some permutation : [d] [d].

    This means that

    (2.7) x(1) x(2) x(d).Our job is to find this permutation based on thevalue of x. In other words, we simply have to sort

    the coordinates of x, where we think of sorting x asfinding the permutation which puts the coordinatesin ascending order.

    Once we know , the corners are those of [0, 1]d

    which follow (2.7). Let ei denote the canonicalunit basis vector ei = (0, 0, . . . , 0

    i1

    , 1, 0, 0, . . . , 0 di

    ). Then

    corners(S) = {0, e(d), e(d) + e(d1), . . . ,1}.The pseudocode in figure 3 computes

    locality hashes(x) for a point x given as a listof coordinates.

    2.3 The vertex-transitive partition

    VIn figure 2,

    we saw the R2 version of partition O, in which each cellis an isosceles right triangle. Intuitively, one may betempted to improve this partition by searching for evenmore regular cells, since this would seem to give a lowerD0/D1 ratio. The next partition we discuss achievesthis type of improvement.

    We can define the vertex-transitive partition, de-noted V, as a particular linear transformation of O.For any dimension d, let matrix T = T(d) have diag-

    1183 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    6/11

    onal elements (1 + (d 1)d + 1)/d and off-diagonals(1 d + 1)/d. Then V is characterized by the slicesT h for each hyperplane h which is a slice of O.

    We say that a polytope p is vertex transitive iff,for every pair of vertices v1, v2, there is an isometry : p

    p with (v1) = v2. The following fact is proved

    in the appendix:

    Property 2.6. In 2, the cells ofVare isometric, andeach is a vertex-transitive simplex.

    It is interesting to note that, beyond R2, regularsimplices cannot tessellate Rd, so that this may beconsidered a reasonable compromise in which someform of regularity is maintained within an isometric,simplicial tiling of space.

    In the following analysis, we restrict ourselves to 2.Since V is isometric, we can focus on a single

    particular cell to study we choose T(Sid). The corners

    of this cell are the points pi := T(fi). This means that(2.8)

    pi =

    i

    d

    1 +

    d + 1

    i , . . . , i di times

    , d i , . . . , d i i times

    ,

    from which it is a straightforward computation to seethat pi, pj = i(d + 1 j) for i j, and that||pipj ||22 = |ji|(d+1|ji|). This surprisingly simpleformula reveals that the diameter of T(Sid) is clearlygiven between any two points pi, pj which approximate

    j i = (d + 1)/2:

    D0(V) = 2d2

    d2

    + 1

    =d + 1 odd d,

    d(d + 2) even d.

    We will invoke a general lemma giving the shortestdistance from any face ofT(Sid) to the convex hull of theother vertices. By minimizing over all faces, this allowsus to find the shortest distance between any pointswhich cross the cell; this distance is exactly D1(V).

    The following lemma deals with subsets of verticeswhose indices are in Nd := {0, 1, . . . , d}. It will beuseful to consider the indices as residue classes modulod + 1. To this end, we will extend the usual intervalnotation [a, b)

    Nd to allow the case a > b, defined by:

    i [a, b) iff i Nd and (a i or i < b).

    Lemma 2.7. Suppose A Nd and A = ki=1[ai, bi),written so that k is minimized (i.e., there are nocontiguous or overlapping intervals).

    Then the points xA :=1

    2k

    i pai + pbi1 (mod d+1)

    and xB :=1

    2k

    i pbi + pai1 (mod d+1) minimize the

    distance between convex({pi}iA) and convex({pi}iB).Furthermore, ||xA xB||22 = d+12k .

    Proof. Let y := 2kd+1 (xA xB). The first goal ofthe proof is to see that

    (2.9) y pi pj when either i, j A or i, j B.

    This suffices to show that xAxB is the shortest line

    segment between the two affine linear subsets.In order to decompose y, we introduce the notations

    ei := ed+1i and e0 := 1. We will write ti for T(ei),and t

    (2)i for T

    2(ei). Observe that (d + 1)y =

    i pai pai1(pbipbi1) =

    T(eaiebi) =

    taitbi .

    Observe that T =

    d + 1(I J), where =(1 1/d + 1)/d and J is the all-one matrix, so thatT2 = (d + 1)I J. Then, for i > 0,

    ti, pj = ei, T2fj =

    ei ,d

    k=d+1jt

    (2)k

    = (d + 1)(i

    j)

    j,(2.10)

    where the notation n(boolean) denotes value n if theboolean is true, 0 otherwise.

    Using (2.8), we also have t0, pj = 1, pj = j.Using this, we can re-write (2.10) to include the i = 0case as ti, pj = (d + 1)

    j [i, 0)j, where [0, 0) is

    interpreted as the empty set.Let gi := tai tbi . Then 1d+1 gi, pj = 1(j

    [ai, bi)) 1(0 [ai, bi)). Let w := #{i : 0 [ai, bi)}.Notice that y = 1d+1

    i gi, so that

    (2.11)

    y, pj = 1d + 1 i

    gi, pj = 1 w if j A

    w if j

    B.

    From this, we have that y, pi pj = 0 wheneveri, j A or i, j B. This finishes the demonstrationof (2.9), the first half of this proof.

    Next we confirm the distance between xA and xB.Using (2.11), (d + 1)y, y = iy, (pai +pbi1) (pbi +

    pai1) = 2k. Thus ||xA xB ||22 =d+1

    2k

    2 y, y = d+12k ,which concludes the proof.

    The lemma tells us that we can achieve the mini-mum distance by maximizing k, the number of intervalsneeded to express A (or B). For instance, we could al-ways choose A =

    {i

    Nd : i is even

    }, in which case

    k = d/2. Hence D1(V) = (d + 1)/d for even d, and1 for odd d.

    Combining this with D0(V), we see that, in 2,

    (2.12) c =D0D1

    =

    d + 1 if d is odd,

    d

    1 + 1d+1 if d is even.

    We always have c d + 1, so that the guaranteedlocality-sensitive hashing accuracy of V is essentially

    1184 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    7/11

    twice as good as that of O; although the accuracy ofVhas only been verified in 2, whereas our analysis ofO applies for any p.

    Computing locality hashes(x) in V We couldsummarize the definition of this partition via V =T(

    O). Accordingly, if locality hashes in O(x) com-

    putes the corners of the cell in O containing x, thenlocality hashes in O(T1x) does the same for V.

    We have T1 = 1d+1

    I + J, where =1 1

    d+1

    1d . Ify = T

    1x, then yi = xi/

    d + 1+sx,

    where sx =

    i xi. This is computable in O(d) time.

    2.4 Time and space complexities As in 2.1, wewill use to denote the size of the return value oflocality hashes; and to denote the time needed forcomputation. The algorithms presented here need atleast = (d log d) worst-case time in order to sort thecoordinates ofxfrac, as described in

    2.2. Unfortunately,

    the output has size = (d2) there are d +1 corners,each in Zd so that we need at least = (d2) timesimply to store and return all coordinates of each cellcorner.

    In the next section, we mention a method to reducethese complexities.

    A trick for faster hashes The bottleneck in thealgorithms thus far is the fact that we need at least(d2) space to store and retrieve the list of all d + 1corners (each in Zd) of the cell containing x.

    If the hashed value h(v) of each vector v Zd isof the form h(v) u, v (mod N) for some constantvector u and constant modulus N, then we can achievecomplexity O(d log d) for and . To do so, it sufficesto first compute h(xint) and then subsequently add u(i)modulo N, for each i = d, d 1, . . . , 1, where isthe sorting permutation as discussed in 2.2. For fastlookups, we only need to store (xint,

    1) in a newtable, and a value k associated with each hash keycorresponding to xint + e(d) + . . . + e(k). The exact i

    th

    coordinate of a key can be retrieved in constant time as(xint)i + 1(k 1(i)).

    Fast lookups can be achieved by only examininglog d random coordinates to verify the key of each hashtable entry. This allows for O(d log d) query time with

    a small risk of extra hash table collisions ([1]).

    3 Empirical performance

    In this section we present empirical evidence on artificialdata suggesting that our algorithm compares favorablyagainst previously known locality hashes.

    3.1 Related hashes We compare our algorithm tothree other locality hashes for real vectors.

    The first, presented in [8], is based on a projectionof the points into Hamming space, followed by a projec-tion onto Zk. This algorithm is very simple to imple-ment, although it requires a fixed bound on the inputvectors; because it utilizes the unary representation ofcoordinates, we refer to it as the unary hash.

    In [7], an algorithm was presented based primarilyon dimension reduction tailored to preserve distanceswell in certain p metrics. The reduced vector isthen abridged in a coordinate-wise fashion into Zk.Intuitively, the equivalence classes of this hash can bethought of as hypercubes in the reduced space. We referto this as the pstable hash.

    The authors of [2] specifically design their hash sothat the equivalence classes, in a reduced space, arespheres (or occluded spheres) instead of hypercubes,which will always result when concatenating a series of1-dimensional hashes. We refer to this as the sphere-gridhash, as it is built on a sequence of grids of spheres.

    Also mentioned in [2] is the idea to use the Leechlattice for d = 24, or dimension reductions thereto. Wehave not compared against the Leech lattice primarilybecause, as a locality hash (i.e., without dimensionreduction), it exists for only one d, while we areinterested in the ability of other hashes which also scaleto arbitrary dimensions.

    3.2 Test data and performance metrics All ofthe locality hashes were tested by repeatedly generatinga uniform random point x [0, C]d, then a random unitvector w Rd, and checking to see if x x + Dw for agiven distance D, where x y means x, y are considerednearby according to whichever algorithm is being tested.In this way, we simulate data that has been randomlyshifted and rotated before processing, and can estimatethe probability P(x y | dist(x, y) = D), over theprobability space of this random data movement, foreach distance D. Effectively, we are estimating worst-case bounds on the approximation factor c for anyquery/point pair.

    Figure 5 shows P(x y|D) versus distance D foreach algorithm. For easier comparison, the scale of eachalgorithm has been chosen so that the 50% point isat D = 1; better accuracy is indicated by a steeper

    descent around this threshold. We exclude sphere gridfrom the d = 500 graph because our implementationhad difficulty scaling to this dimension.

    It is also interesting to estimate which approxima-tion factors are available at certain confidence levels foreach algorithm. Let f(D) := P(x y|dist(x, y) = D).We assume that f is a strictly decreasing function whenf(D) (0, 1), which appears to be true for all testedalgorithms. Then we can define Dp := f

    1(p) for

    1185 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    8/11

    Figure 5: Hit probabilities versus point distance fordimensions 10, 100, and 500, top to bottom.

    hash .01 .1 .3unary 18 6.2 3.1

    sphere grid 16 5.6 2.8pstable 16 5.1 2.8

    here 2.2 1.6 1.4

    Table 2: estimates for d = 20.

    p (0, 1). Next, let := D/2/D1/2. We know withconfidence at least 1 that dist(x, y) < D1/2 x y, and dist(x, y) > D/2 x y; when both are true,we have c for any single point p.

    Figure 6 plots .05 and .1 for various dimensions.Note that c grows minimally as d increases. Table 2gives estimates in d = 20 for confidence levels 99%,90%, and 70%.

    Parameters Our choices for the parameters ofeach algorithm was based primarily on a requirementthat each query lookup be fast, and next on practicalchoices hinted at by the various authors.

    Some previous algorithms use a parameter L toindicate the number of hash tables used, where eachhash table t [L] is slightly altered, such as bytranslating the input point by some vector ut Rd.In those cases, the number of hash buckets per point(well call this Lb) was equal to the number of hashtables (call this Lt) but here, we conceptually have

    Figure 6: Left: 0.05, right: 0.1 versus dimension.

    d + 1 buckets per point in a single table. Because speedis a focus of this paper, we compare algorithms withthe same Lt value, which is a more uniform measureof time complexity. In particular, we use the constantvalue Lt = 5 for most empirical data. In table 1, we seethat setting Lt = d + 1 offers modest accuracy gains atthe cost of slower running time (figure 7). All empiricaldata is run using the O(d log d) query time version of

    our algorithm.For unary, we chose to use the fixed number of

    projected coordinates per hash vector k = 700, a valuesuggested in [8]. For pstable, we used r = 1 as thedenominator for each 1-dimensional hash. In bothpstable and sphere grid, we used k = 2log d as thereduced dimension (the reduced dimension is called tin [2]). In sphere grid, we used = 2; larger values(such as = 4, as suggested in [2]) required an orderof magnitude more grids to fill the space, and were thussignificantly slower.

    Reproducing the experiments Source code toeasily reproduce all our experiments, along with scripts

    to produce our particular graphs and tables, is availablefrom: http://thetangentspace.com/lsh/

    4 Future work

    A natural next step is to corroborate the empiricalperformance measurements with provable timing-vs-accuracy bounds.

    Many locality-sensitive hash families H (includingthose considered here) can be intuitively understood as

    1186 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    9/11

    Figure 7: Timing data, in ms per trial. A trial istwo point storages and one lookup. Note the differentscales on both axes, which further emphasize the (notunexpected) speed difference between the Lt = 5 and

    Lt = d + 1 cases.

    matching a point x primarily to an equivalence classnear(x, i) := {y | hi(y) = hi(x)} for each particular hashhi H, and ultimately to a larger set near(x, I) :=iInear(x, i). It is inevitable that x y for anyy core(x, I) := iInear(x, i). Ideally, we would alsohave

    near(x, I) = ycore(x,I)B(y, R),where R is the radius of detection for the hash. Intu-itively, this perspective gives us a strong hint towardmaking sure that each near(x, i) is sphere-like an in-tuition followed by [2], and, less directly, here.

    Its not too difficult to check that near(x, I) for ourhash is a Voronoi cell in a lattice solving the sphere-packing problem for d = 2, 3, although not for anylarger d. Perhaps there is a trade-off in maintainingthat the core set is a simplex while each near(, i) setis sphere-like. The excellent sphere-packing propertiesof the Leech lattice ([6]) suggest it would give a goodindividual hash ([2]), and perhaps perform better whencomplemented by the other Neimeier lattices (althoughthey could not fit exactly into our simplicial frameworksince there are only 24 of them we need d + 1 = 25

    corners, one from each hash).It has not escaped our notice that the tessellationspresented in this paper may themselves be of geometricinterest. Are they, in some sense, the most regularsimplices which isometrically tile space? For the pur-poses of locality-sensitive hashes, we propose the fol-lowing optimality of our tessellation:

    Conjecture 4.1. Among all sliced partitions into iso-metric simplices, the canonical simplexc ofVminimizes

    the ratio between diameter and the shortest distance be-tween any points which cross c.

    5 Thanks

    Thanks to Dennis Shasha, Mehryar Mohri, CorinnaCortes, and Manfred Warmuth for their invaluablementoring and generous support.

    References

    [1] D. Achlioptas. Database-friendly random projections,Symposium on principles of database systems, 2001.

    [2] A. Andoni and P. Indyk, Near-optimal hashing algo-rithm for approximate nearest neighbor in high dimen-

    sions, Communications of the ACM, 51:117122, 2008.[3] A. Broder, M. Charikar, A. Frieze and M. Mitzen-

    macher, Min-wise independent permutations, Proceed-

    ings of the 30th annual ACM symposium on theory ofcomputing, 1998.

    [4] T.M. Chan, Approximate nearest neighbor queries re-visited, Disc. Comp. Geom., 20(1998), pp. 359373.

    [5] T.M. Chan, Closest-point problems simplified on theRAM, Proceedings of the ACM-SIAM Symposium ondiscrete algorithms, 2002.

    [6] J.H. Conway and N.J.A. Sloane, Sphere packings,lattices, and groups, Springer, 1993.

    [7] M. Datar, N. Immorlica, P. Indyk and V. Mirrokni,Locality-sensitive hashing scheme based on p-stable dis-

    tributions, Proceedings of the 20th annual symposiumon computational geometry, 2004.

    [8] A. Gionis, P. Indyk and R. Motwani, Similarity search

    in high dimensions via hashing, Proceedings of the 25thinternational conference on very large databases, 1999.

    [9] V. Guruswami, J. Lee and A. Razborov, Almost Eu-clidean subspaces of 1 via expander codes, Proceedingsof the ACM-SIAM Symposium on discrete algorithms,2008.

    [10] P. Indyk, Uncertainty principles, extractors, and ex-plicit embeddings of 2 into 1, Proceedings of the 39thannual ACM symposium on theory of computing, 2007.

    [11] P. Indyk and R. Motwani, Approximate nearest neigh-bor: toward removing the curse of dimensionality, Pro-ceedings of the 35th IEEE Symposium on Foundationsof Computer Science, 1998.

    [12] W. Johnson and J. Lindenstrauss, Extensions of Lip-

    schitz maps into a Hilbert space, Contemp. Math.,26(1984), pp. 189206.

    [13] G. Manku, A. Jain and A. Sarma, Detecting near-duplicates for web crawling, Proceedings of the 16thinternational conference on WWW, 2007.

    [14] R. Panigrahy, Entropy-based nearest neighbor algo-rithm in high dimensions, Proceedings of the ACM-SIAM Symposium on discrete algorithms, 2006.

    [15] H. Samet, Foundations of Multidimensional and MetricData Structures, Elsevier, 2006.

    1187 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    10/11

    A Appendix

    Proof of lemma 2.1.

    Let y be the first point on line xy moving from y tox which is the entry point to a new cell; i.e., such thatthere is some cell cy for which y

    cy but y cy forany previous point y on this line. Note that we mayhave y = y.

    Notice that y must be on the boundary of the newcell, so that there is some cell c containing both y andy. Thus, ify x, then we are done, since x is in somecell cx c. From here on, well assume that y x.

    Let x0 = x, and, similar to our choice of y, let x1 be

    the next point on the line moving from x to y which isthe entry point to a new cell. We can continue to definea sequence x1, x2, . . . with each successive point beingthe next along xy toward y which is the entry point toa new cell.

    Let xk be the first in this sequence such that xk y. Since x y, we have k > 0. As above (for y), theremust be a cell ck containing both xk1 and xk. By ourchoice of xk, we know that ck c. Thus x = xk andy are the desired pair of barely neighboring points.

    Figure 8 shows an example case to illustrate theabove proof.

    Proof of lemma 2.2. We begin with the direction. Suppose x, y dont cross c. If either pointis not in the boundary of c, then clearly they cannotbe barely neighbors. Instead, lets assume that thecorner sets Cx, Cy corners(c) with x convex(Cx), y convex(Cy), both contain a common point z

    Cx

    Cy.

    If there were a cell c containing x but not z, thenthe polytope cc would be defined by corners excludingz, yet this shape would include x. This contradicts thenecessity of z in Cx, so there cannot be any such cell.This same argument applies to all cells containing y.Therefore, any pair of cells c, c containing x, y mustalso share the point z, so that c c, and x, y cannotbe barely neighbors.

    Next we show the direction.We will refer to any kdimensional set of the form

    convex(F), for any F corners(c), as a kface of c.Thus a point is a 0-face, an edge is a 1-face, etc.

    We claim that, for every k

    face f of c, there isanother cell c such that c c and c c = f.

    First well complete the proof using this claim, andthen justify it. Suppose that x, y cross cell c. Letfx, fy denote the minimal kfaces of c which containx and y, respectively. Now choose cells cx, cy so thatx cx, y cy and cx c = fx, cy c = fy, using theclaim. Ifcx cy, then there is some point z cx cy.By the convexity of the cells, cx and cy contain theline segments xz and yz respectively. This means that

    there can be no slicing hyperplane which intersects thetriangle formed by x, y, z, which in turn implies thatz c (see figure 9). By our minimal choices of thekfaces, we have that z fx and z fy. But thiscontradicts the fact that x, y cross c! Hence there canexist no common point z between cx, cy; and cx

    cy so

    that x, y are barely neighbors.Now lets justify the claim that there is a cell c with

    c c = f for every kface f of cell c.Suppose f is a kface of cell c. Then f is the

    intersection of d k hyperplanes with c, where eachhyperplane is a slice of the partition that determinessome d 1face ofc. Let c be any cell on the oppositeside of all of these hyperplanes, and with c c. Ifx c c, then x must be a point in each of thesed k hyperplanes, so that x f. It is also clear thatf c c; so we may conclude that f = c c, whichwas our goal.

    Proof of property 2.6.

    V is isometric in 2 Lets check that V is an iso-metric partition that every pair of cells is isometric.Similar to our previous argument, well begin by show-ing that every image, under T, of a lattice-cornered hy-percube is isometric to the image of [0, 1]d.

    Let u denote the unit hypercube [0, 1]d and u =T(u). Similarly, let c denote an offset hypercube c =u + z, z Zd, and c = T(c). We would like to find anisometry from c u. Let Az be the shift Az(x) = xz,so that Az(c) = u. Then clearly Az := TAzT1 mapsc to u. Notice that

    (T AzT1)(x) = T(T1(x) z) = x T z,

    so that Az is indeed an isometry.Next wed like to check that any two simplices

    T(S1), T(S2) are isometric as well. Suppose : [d] [d] is a permutation, and let U denote the mappinggiven by yi = x(i) when y = U(x). At this point wenote that T can be written in the form T = I + J,where I is the identity matrix, J is the all-one matrix, =

    d + 1, and = (1 d + 1)/d. Then

    UT T U = UT (I+ J)U

    = I+ (UT

    JU) = I+ J = T.Using the fact that U is unitary (so UT = U

    1 ), we

    can see that T UT1 = U. So ifU : S1 S2 , then

    we still have U : T(S1) T(S2), also isometrically.As a final note, observe that any two isometries

    A = T A T1 and B = T B T1 may becomposed, and preserve the conjugate relationship A B = T (A B) T1. This justifies that a series of theabove tranformations translations and permutations

    1188 Copyright by SIAM.Unauthorized reproduction of this article is prohibited.

  • 8/3/2019 A Locality-Sensitive Hash for Real Vectors

    11/11

    suffices to produce an isometry between any two cellsofV, and this partition can be studied by considering asingle canonical cell.

    The canonical cell is vertex-transitive in 2Next well choose a particular canonical cell to study,and justify the name of this partition by showing that

    it has the type of polytope regularity known as vertextransitivity. Recall that a polytope p is vertex transitiveiff, for every pair of vertices v1, v2, there is an isometry : p p with (v1) = v2.

    Our canonical cell in Vis T(Sid), with corners pi :=T(fi). To show that this cell is vertex-transitive, it willsuffice to construct mappings mi,j : T(Sid) T(Sid),for any i, j {0, 1, . . . , d}, such that mi,j is an isometryand mi,j(pi) = pj .

    We proceed by constructing a (non-isometric) mapV : Sid Sid which acts as a shift operator on thecorners of this simplex. If we define the matrix

    W := 1 I

    1 0

    ,

    and let V(x) := W x + f1, then

    V(fi) = fi+1 for i < d;

    V(fd) = f0.

    Since V is an affine transformation,V(convex(f0, f1, . . . , f d)) = convex(V(f0, f1, . . . , f d)),verifying that V : Sid Sid.

    To operate in the new partition, we extend V viaV := T V T1. Clearly, V maps the canonical cell toitself, and shifts the corners pi pi+1 (mod d + 1).Hence any desired mapping pi pj can be acheivedby the correct number of iterations of V; that is, weare close to seeing that mi,j = Vk, where k j i(mod d + 1).

    The last step is to confirm that V is an isometry.This is the case iff W = T W T1 is itself isometric.Let := (1 1/d + 1)/d, so that T = d + 1(I J); and T1 = 1

    d+1(I+

    d + 1J). Then W = (I

    J)W(I+

    d + 1J), and from here one may tediouslysimplify the corresponding formulae for wi, wj (wherewi is the i

    th column of W) to confirm that WTW = I.

    This suffices to demonstrate that T(Sid) is indeed avertex-transitive simplex.

    Figure 8: Example from the proof of lemma 2.1. Here,x y and x y are barely neighbors

    Figure 9: For the proof of lemma 2.2: If xz and yz areboth intact i.e., do not intersect a slicing hyperplane then all of xyz must exist in a single cell.

    1189 Copyright by SIAM