9
Learning Hash Functions Using Column Generation Xi Li [email protected] Guosheng Lin [email protected] Chunhua Shen [email protected] Anton van den Hengel [email protected] Anthony Dick [email protected] Australian Centre for Visual Technologies, and School of Computer Science, The University of Adelaide, Australia Abstract Fast nearest neighbor searching is becom- ing an increasingly important tool in solv- ing many large-scale problems. Recently a number of approaches to learning data- dependent hash functions have been devel- oped. In this work, we propose a column generation based method for learning data- dependent hash functions on the basis of proximity comparison information. Given a set of triplets that encode the pairwise prox- imity comparison information, our method learns hash functions that preserve the rel- ative comparison relationships in the data as well as possible within the large-margin learning framework. The learning procedure is implemented using column generation and hence is named CGHash. At each iteration of the column generation procedure, the best hash function is selected. Unlike most other hashing methods, our method generalizes to new data points naturally; and has a train- ing objective which is convex, thus ensur- ing that the global optimum can be identi- fied. Experiments demonstrate that the pro- posed method learns compact binary codes and that its retrieval performance compares favorably with state-of-the-art methods when tested on a few benchmark datasets. * indicates equal contributions. Proceedings of the 30 th International Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). 1. Introduction The explosive growth in the volume of data to be pro- cessed in applications such as web search and mul- timedia retrieval increasingly demands fast similarity search and efficient data indexing/storage techniques. Considerable effort has been spent on designing hash- ing methods which address both the issues of fast similarity search and efficient data storage (for exam- ple, (Andoni & Indyk, 2006; Weiss et al., 2008; Zhang et al., 2010b; Norouzi & Fleet, 2011; Kulis & Darrell, 2009; Gong et al., 2012)). A hashing-based approach constructs a set of hash functions that map high- dimensional data samples to low-dimensional binary codes. These binary codes can be easily loaded into the memory in order to allow rapid retrieval of data samples. Moreover, the pairwise Hamming distance between these binary codes can be efficiently computed by using bit operations, which are well supported by modern processors, thus enabling efficient similarity calculation on large-scale datasets. Hash-based ap- proaches have thus found a wide range of applica- tions, including object recognition (Torralba et al., 2008), information retrieval (Zhang et al., 2010b), lo- cal descriptor compression (Strecha et al., 2011), im- age matching (Korman & Avidan, 2011), and many more. Recently a number of effective hashing meth- ods have been developed which construct a variety of hash functions, mainly on the assumption that seman- tically similar data samples should have similar binary codes, such as random projection-based locality sensi- tive hashing (LSH) (Andoni & Indyk, 2006), boost- ing learning-based similarity sensitive coding (SSC) (Shakhnarovich et al., 2003), and spectral hashing of Weiss et al. (2008) which is inspired by Laplacian eigenmap. In more detail, spectral hashing (Weiss et al., 2008) optimizes a graph Laplacian based objective func- tion such that in the learned low-dimensional binary

Learning Hash Functions Using Column Generation - Journal of

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Learning Hash Functions Using Column Generation

Xi Li∗ [email protected]

Guosheng Lin∗ [email protected]

Chunhua Shen [email protected]

Anton van den Hengel [email protected]

Anthony Dick [email protected]

Australian Centre for Visual Technologies, and School of Computer Science, The University of Adelaide, Australia

Abstract

Fast nearest neighbor searching is becom-ing an increasingly important tool in solv-ing many large-scale problems. Recentlya number of approaches to learning data-dependent hash functions have been devel-oped. In this work, we propose a columngeneration based method for learning data-dependent hash functions on the basis ofproximity comparison information. Given aset of triplets that encode the pairwise prox-imity comparison information, our methodlearns hash functions that preserve the rel-ative comparison relationships in the dataas well as possible within the large-marginlearning framework. The learning procedureis implemented using column generation andhence is named CGHash. At each iterationof the column generation procedure, the besthash function is selected. Unlike most otherhashing methods, our method generalizes tonew data points naturally; and has a train-ing objective which is convex, thus ensur-ing that the global optimum can be identi-fied. Experiments demonstrate that the pro-posed method learns compact binary codesand that its retrieval performance comparesfavorably with state-of-the-art methods whentested on a few benchmark datasets.

* indicates equal contributions.

Proceedings of the 30 th International Conference on Ma-chine Learning, Atlanta, Georgia, USA, 2013. JMLR:W&CP volume 28. Copyright 2013 by the author(s).

1. Introduction

The explosive growth in the volume of data to be pro-cessed in applications such as web search and mul-timedia retrieval increasingly demands fast similaritysearch and efficient data indexing/storage techniques.Considerable effort has been spent on designing hash-ing methods which address both the issues of fastsimilarity search and efficient data storage (for exam-ple, (Andoni & Indyk, 2006; Weiss et al., 2008; Zhanget al., 2010b; Norouzi & Fleet, 2011; Kulis & Darrell,2009; Gong et al., 2012)). A hashing-based approachconstructs a set of hash functions that map high-dimensional data samples to low-dimensional binarycodes. These binary codes can be easily loaded intothe memory in order to allow rapid retrieval of datasamples. Moreover, the pairwise Hamming distancebetween these binary codes can be efficiently computedby using bit operations, which are well supported bymodern processors, thus enabling efficient similaritycalculation on large-scale datasets. Hash-based ap-proaches have thus found a wide range of applica-tions, including object recognition (Torralba et al.,2008), information retrieval (Zhang et al., 2010b), lo-cal descriptor compression (Strecha et al., 2011), im-age matching (Korman & Avidan, 2011), and manymore. Recently a number of effective hashing meth-ods have been developed which construct a variety ofhash functions, mainly on the assumption that seman-tically similar data samples should have similar binarycodes, such as random projection-based locality sensi-tive hashing (LSH) (Andoni & Indyk, 2006), boost-ing learning-based similarity sensitive coding (SSC)(Shakhnarovich et al., 2003), and spectral hashing ofWeiss et al. (2008) which is inspired by Laplacianeigenmap.

In more detail, spectral hashing (Weiss et al., 2008)optimizes a graph Laplacian based objective func-tion such that in the learned low-dimensional binary

Learning Hash Functions Using Column Generation

space, the local neighborhood structure of the orig-inal dataset is best preserved. SSC (Shakhnarovichet al., 2003) makes use of boosting to adaptively learnan embedding of the original space, represented by aset of weak learners or hash functions. This embed-ding aims to preserve the pairwise affinity relation-ships of training duplets (i.e., pairs of samples in theoriginal space). These approaches have demonstratedthat, in general, data-dependent hashing is superiorto data-independent hashing with a typical examplebeing LSH (Andoni & Indyk, 2006).

Following this vein, here we learn hash functions us-ing side information that is generally presented in aset of triplet-based constraints. Note that the triplesused for training can be generated in an either super-vised or unsupervised fashion. The fundamental ideais to learn optimal hash functions such that, whenusing the learned weighted Hamming distance, therelative distance comparisons of the form “point x

is closer to x+ than to x−” are satisfied as well aspossible (x+ and x− are respectively relevant and ir-relevant samples to x). This type of relative prox-imity comparisons have been successfully applied tolearn quadratic distance metrics (Schultz & Joachims,2004; Shen et al., 2012). Usually this type of prox-imity relationships do not require explicit class labelsand thus are easier to obtain than either the class la-bels or the actual distances between data points. Forinstance, in content based image retrieval, to collectfeedback, users may be required to report whether im-age x looks more similar to x+ than it is to a third im-age x−. This task is typically much easier than to la-bel each individual image. Formally, we are given a setC = {(xi,x

+i ,x

−i )|d(xi,x

+i ) < d(xi,x

−i )}, i = 1, 2, · · · ,

where d(·, ·) is some similarity measure (e.g., Euclideandistance in the original space; or semantic similaritymeasure provided by a user). As explained, one may

not explicitly know d(·, ·); instead, one may only beable to provide sparse proximity relationships. Us-ing such a set of constraints, we formulate a learningproblem in the large-margin framework. By using aconvex surrogate loss function, a convex optimizationproblem is obtained, but has an exponentially largenumber of variables. Column generation is thus em-ployed to efficiently and optimally solve the formulatedoptimization problem.

The main contribution of this work is to propose anovel hash function learning framework which has thefollowing desirable properties. (i) The formulated opti-mization problem can be globally optimized. We showthat column generation can be used to iteratively findthe optimal hash functions. The weights of all theselected hash functions for calculating the weighted

Hamming distance are updated at each iteration. (ii)The proposed framework is flexible and can accommo-date various types of constraints. We show how tolearn hash functions based on proximity comparisons.Furthermore, the framework can accommodate differ-ent types of loss functions as well as regularizationterms. Also, our hashing framework can use differenttypes of hash functions such as linear functions, deci-sion stumps/trees, RBF kernel functions, etc.

Related work Loosely speaking, hashing methodsmay be categorized into two groups: data-independentand data-dependent. Without using any training data,data-independent hashing methods usually generate aset of hash functions using randomization. For in-stance, LSH of Andoni & Indyk (2006) use randomprojection and thresholding to generate binary codesin the Hamming space, where the mutually close datasamples in the Euclidean space are likely to have sim-ilar binary codes. Recently, Kulis & Grauman (2009)propose a kernelized version of LSH, which is capa-ble of capturing the intrinsic relationships betweendata samples using kernels instead of linear innerproducts. In terms of learning methodology, data-dependent hashing methods can make use of unsu-pervised, supervised or semi-supervised learning tech-niques to learn a set of hash functions that gener-ate the compact binary codes. As for unsupervisedlearning, two typical approaches are used to obtainsuch compact binary codes, including thresholding thereal-valued low-dimensional vectors (after dimension-ality reduction) and direct optimization of a Hammingdistance based objective function (e.g., spectral hash-ing (Weiss et al., 2008), self-taught hashing (Zhanget al., 2010b)). The spectral hashing (SPH) methoddirectly optimizes a graph Laplacian objective func-tion in the Hamming space. Inspired by SPH, Zhanget al. (2010b) developed the self-taught hashing (STH)method. At the first step of STH, Laplacian graph em-bedding is used to generate a sequence of binary codesfor each sample. By viewing these binary codes asbinary classification labels, a set of hash functions areobtained by training a set of bit-specific linear supportvector machines. Liu et al. (2011) proposed a scalablegraph-based hashing method which uses a small-sizeanchor graph to approximate the original neighbor-hood graph and alleviates the computational limita-tion of spectral hashing.

As for the supervised learning case, a number of hash-ing methods take advantage of labeled training sam-ples to build data-dependent hash functions. Thesehashing methods often formulate hash function learn-ing as a classification problem. For example, Salakhut-dinov & Hinton (2009) proposed the restricted Boltz-

Learning Hash Functions Using Column Generation

mann machine (RBM) hashing method using a multi-layer deep learning technique for binary code genera-tion. Strecha et al. (2011) use Fisher linear discrimi-nant analysis (LDA) to embed the original data sam-ples into a lower-dimensional space, where the em-bedded data samples are binarized using thresholding.Boosting methods have also been employed to develophashing methods such as SSC (Shakhnarovich et al.,2003) and Forgiving Hash (Baluja & Covell, 2008),both of which learn a set of weak learners as hashfunctions in the boosting framework. It is demon-strated in (Torralba et al., 2008) that some data-dependent hashing methods like stacked RBM andboosting SSC perform much better than LSH on large-scale databases of millions of images. Wang et al.(2012) proposed a semi-supervised hashing method,which aims to ensure the smoothness of similar datasamples and the separability of dissimilar data sam-ples. More recently, Liu et al. (2012) introduced akernel-based supervised hashing method, where thehashing functions are nonlinear kernel functions.

The closest work to ours might be boosting basedSSC hashing (Shakhnarovich et al., 2003), which alsolearns a set of weighted hash functions through boost-ing learning. Ours differs SSC in the learning pro-cedure. The resulting optimization problem of ourCGHash is based on the concept of margin maximiza-tion. We have derived a meaningful Lagrange dualproblem such that column generation can be applied tosolve the semi-infinite optimization problem. In con-trast, SSC is built on the learning procedure of Ad-aBoost, which employs stage-wise coordinate-descentoptimization. The weights associated with selectedhash functions (corresponding weak classifiers in Ad-aBoost) are not fully updated at each iteration. Alsothe information used for training is different. We haveused distance comparison information and SSC usespairwise information. In addition, our work can ac-commodate various types of constraints, and can flex-ibly adapt to different types of loss functions as wellas regularization terms. It is unclear, for example,how SSC can accommodate different types regulariza-tion that may encode useful prior information. In thissense our CGHash is much more flexible. Next, wepresent our main results.

2. The proposed algorithm

Given a set of training samples xm ∈ RD, (m =1, 2, . . . ), we aim to learn a set of hash functionshj(x) ∈ H, j = 1, 2, . . . ℓ, for mapping these train-ing samples to a low-dimensional binary space, be-ing described by a set of binary codewords bm, (m =1, 2, . . . ). Here each bm is an ℓ-dimensional binary vec-

tors. In the low-dimensional binary space, the code-words bm’s are supposed to preserve the underlyingproximity information of corresponding xi’s in theoriginal high-dimensional space. Next we learn suchhash functions {hj(x)}

ℓj=1 within the large-margin

learning framework.

Formally, suppose that we are given a set of triplets

{(xi,x+i ,x

−i )}

|I|i=1 with xi,x

+i ,x

−i ∈ RD and I be-

ing the triplet index set. These triplets encode theproximity comparison information such that the dis-tance/dissimilarity between xi and x+

i is smaller thanthat between xi and x−

i . Now we need to define theweighted Hamming distance for the learned binarycodes: dH(x, z) =

∑ℓ

j=1 wj |hj(x)−hj(z)|, where wj isa non-negative weight factor associated with the j-thhash function. In our experiments, we have gener-ated the triplets set as: xi and x+

i belong to the sameclass and xi and x−

i belong to different classes. Asdiscussed, these triplets may be sparsely provided byusers in applications such as image retrieval. So wewant the constraints dH(xi,x

+i ) < dH(xi,x

−i ) to be

satisfied as well as possible. For notational simplicity,

we define a[i]j = |hj(xi) − hj(x

−i )| − |hj(xi) − hj(x

+i )|

and dH(xi,x−i )− dH(xi,x

+i ) = w⊤ai with

ai = [a[i]1 , a

[i]2 , . . . , a

[i]ℓ ]⊤. (1)

In what follows, we describe the details of our hashingalgorithm using different types of convex loss functionsand regularization norms. In theory, any convex lossand regularization can be used in our hashing frame-work. More details of our hashing algorithm can befound in Algorithm 1 and the supplementary file (Liet al., 2013).

2.1. Learning hashing functions with the hinge

loss

Hashing with l1 norm regularization Using thehinge loss, we define the following large-margin opti-mization problem:

minw,ξ

∑|I|i=1 ξi + C‖w‖1

s.t. w < 0, ξ < 0;dH(xi,x

−i )− dH(xi,x

+i ) ≥ 1− ξi, ∀i,

(2)

where ‖·‖1 is the 1-norm, w = (w1, w2, . . . , wℓ)⊤ is the

weight vector; ξ is the slack variable; C is a parametercontrolling the trade-off between the training error andmodel capacity, and the symbol ‘<’ indicates element-wise inequalities. The optimization problem (2) canbe rewritten as:

minw,ξ

∑|I|i=1 ξi + C1⊤w

s.t. w < 0; a⊤i w ≥ 1− ξi, ξi ≥ 0, ∀i,(3)

where 1 is the all-one column vector. The correspond-

Learning Hash Functions Using Column Generation

Algorithm 1 Hashing using column generation

Input: Training triplets {(xi,x+

i ,x−i )}, i = 1, 2 · · · and

ℓ, the number of hash functions.Output: Learned hash functions {hj(x)}

ℓj=1 and the

associated weights w.Initialize: u← 1

|I|.

for j = 1 to ℓ do

1. Find the best hash function hj(·) by solving the sub-problem (11).

2. Add hj(·) to the hash function set;3. Update ai, ∀i as in (1);4. Solve the primal problem for w (using LBFGS-B

(Zhu et al., 1997)) and obtain the dual variable uusing KKT condition (10).

endfor

ing dual problem is:

maxu

1⊤u, s.t. Au 4 C1, 0 4 u 4 1, (4)

where the matrix A = (a1,a2, . . . ,a|I|) ∈ Rℓ×|I| andthe symbol ‘4’ indicates element-wise inequalities.

Hashing with l∞ norm regularization The primalproblem is formulated as:

minw,ξ

∑|I|i=1 ξi + C‖w‖∞

s.t. w < 0,a⊤i w ≥ 1− ξi, ξi ≥ 0, ∀i.

We can make the l∞ regularization a constraint,

minw,ξ

∑|I|i=1 ξi

s.t. w < 0, ‖w‖∞ ≤ C ′;a⊤i w ≥ 1− ξi, ξi ≥ 0, ∀i.(5)

The dual form of the above optimization problem is:

minu,q

−1⊤u+ C ′1⊤q

s.t. Au 4 q, 0 4 u 4 1,(6)

where C ′ is a positive constant.

2.2. Hashing with a general convex loss

function

Here we derive the algorithm for learning hash func-tions with general convex loss. We assume that thegeneral convex loss function f(·) is smooth (exponen-tial, logistic, squared hinge loss etc.) although ouralgorithm can be easily extended to non-smooth lossfunctions.

Hashing with l1 norm regularization Assume thatwe want to find a set of hash functions such that theset of constraints dH(xi,x

−i ) − dH(xi,x

+i ) = w⊤ai >

0, i = 1, 2 . . . hold as well as possible. These con-straints do not have to be all strictly satisfied. Now,we need to define the margin ρi = w⊤ai, and we wantto maximize the margin with regularization. Using l1

norm regularization to control the capacity, we maydefine the primal optimization problem as:

minw,ρ

|I|∑

i=1

f(ρi) +C‖w‖1, s.t. w < 0; ρi = a⊤i w, ∀i. (7)

Here f(·) is a smooth convex loss function; w =[w1, w2, . . . , wℓ]

⊤ is the weight vector that we are in-terested in optimizing. C is a parameter controllingthe trade-off between the training error and model ca-pacity.

Also without this regularization, one can always makew arbitrarily large to make the convex loss approachzero when all constraints are satisfied. Here becausethe possibility of hash functions can be extremely largeor even infinite, we are not able to directly solve theproblem (7). We can use the column generation tech-nique to iteratively and approximately solve the orig-inal problem. Column generation is a technique orig-inally used for large scale linear programming prob-lems. Demiriz et al. (2002) used this method to designboosting algorithms. At each iteration, one column—a variable in the primal or a constraint in the dualproblem—is added when solving the restricted prob-lem. Till one can not find any column violating theconstraint in the dual, the solution of the restrictedproblem is identical to the optimal solution. Here weonly need to obtain an approximate solution and inorder to learn compact codes, we only care about thefirst few (e.g, 60) selected hash functions. In theory, ifwe run the column generation with a sufficient num-ber of iterations, one can obtain a sufficiently accuratesolution (up to a preset precision or no more hash func-tions can be found to improve the solution).

We need to derive a meaningful Lagrange dual in orderto use column generation. The Lagrangian is:

L =

|I|∑

i=1

f(ρi) + C1⊤w − p⊤w +

|I|∑

i=1

ui(a⊤i w − ρi)

= C1⊤w − p⊤w +

|I|∑

i=1

ui(a⊤i w)− (u⊤ρ−

|I|∑

i=1

f(ρi)),

where p < 0 and u are Lagrange multipliers. Withthe definition of Fenchel conjugate (Boyd & Vanden-berghe, 2004), we have the following relation:

infρ

L = −supρ

(u⊤ρ−∑|I|

i=1f(ρi)) = −∑|I|

i=1f∗(ui) and

in order to have a finite infimum, C1 − p + Au = 0

must hold. So we have p < 0, Au < −C1. Here thematrix A is defined in (4).

Consequently, the corresponding dual problem of (7)can be written as:

minu

∑|I|i=1 f

∗(ui), s.t. Au < −C1. (8)

Learning Hash Functions Using Column Generation

Here f∗(·) is the Fenchel conjugate of f(·). By re-versing the sign of u, we can reformulate (8) as itsequivalent form:

minu

∑|I|i=1f

∗(−ui), s.t. Au 4 C1. (9)

Since we assume that f(·) is smooth, the Karush-Kuhn-Tucker (KKT) condition establishes the connec-tion between (9) and (7) at optimality:

u⋆i = −f ′(ρ⋆i ), ∀i. (10)

In other words, the dual variable is determined by thegradient of the loss function in the primal. So if wesolve the primal problem (7), from the primal solutionw⋆, we can calculate the dual solution u⋆ using (10).But the other way around may not be true.

The core idea of column generation is to generate asmall subset of variables, each of which is sequentiallyfound by selecting the most violated dual constraintsin the dual optimization problem (9). This process isequivalent to inserting several primal variables into theprimal optimization problem (7). Here, the subprob-lem for generating the most violated dual constraint(i.e., to find the best hash function) can be defined as:

h⋆(·) = argmaxh(·)∈H

∑|I|

i=1uia

[i]

= argmaxh(·)∈H

∑|I|

i=1ui(|h(xi)− h(x−

i )| − |h(xi)− h(x+i )|).

(11)

In order to obtain a smoothly differentiable objectivefunction, we reformulate (11) into the following equiv-alent form:

argmaxh(·)∈H

|I|∑

i=1

ui[(h(xi)− h(x−i ))

2 − (h(xi)− h(x+i ))

2].

(12)

The equivalence between (11) and (12) can be triviallyestablished.

To globally solve the optimization problem (12) is ingeneral difficult. In the case of decision stumps ashash functions, we can usually exhaustively enumerateall the possibilities and find the globally best one. Inthe case of linear perception as hash functions, h(x)takes the form of sgn(v⊤x+b) where sgn(·) is the signfunction. As a result, the binary hash codes are easilycomputed by (1+h(x))/2. In practice, we relax h(x) =sgn(v⊤x + b) to h(x) = tanh(v⊤x + b) with tanh(·)being the hyperbolic tangent function. For notionalsimplicity, let τi+ and τi− denote tanh(v⊤xi + b) −tanh(v⊤x+

i +b) and tanh(v⊤xi+b)−tanh(v⊤x−i +b),

respectively. Then we have the following optimization

problem:

h⋆(·) =

argmaxh(·)∈H

|I|∑

i=1

ui[(h(xi)− h(x−i ))

2 − (h(xi)− h(x+i ))

2]

= argmaxv,b

|I|∑

i=1

ui(τ2i− − τ2i+). (13)

The above optimization problem can be efficientlysolved by using LBFGS (Zhu et al., 1997) after fea-ture normalization. The initialization of LBFGS canbe guided by LSH (Andoni & Indyk, 2006). Namely,we first generate a set of candidate samples such thatv ∼ N(0, 1) and b ∼ U(−1, 1) with N(·) and U(·) re-spectively being the normal and uniform distributions.Then, we use the best candidate sample as the initial-ization that maximizes the objective function (13). Inour experiments, we have used linear perception ashash functions.

Hashing with l∞ norm regularization We showhere that we can also use other regularization termssuch as the l∞ norm. With the l∞ norm regularization,the primal problem is defined as:

minw,ρ

∑|I|i=1 f(ρi) + C‖w‖∞, s.t. w < 0; ρi = a⊤i w, ∀i.

(14)This optimization problem is equivalent to:

minw,ρ

∑|I|i=1f(ρi), s.t. ‖w‖∞ ≤ C ′;w < 0; ρi = a⊤i w, ∀i,

(15)where C ′ is a properly selected constant, related to Cin (14). Due to w < 0 and ‖w‖∞ ≤ C ′, we obtain 0 4

w 4 C ′1. Therefore, the Lagrangian can be writtenas:

L =

|I|∑

i=1

f(ρi)+q⊤w−C ′q⊤1−p⊤w+

|I|∑

i=1

ui(a⊤i w−ρi),

where p, q, u are Lagrange multipliers. Similar to theℓ1 norm case, we can easily derive the dual problemas:

minu,q

∑|I|i=1f

∗(ui) + C ′1⊤q, s.t. Au < −q. (16)

By reversing the sign of u, we can reformulate (16) asits equivalent form:

minu,q

∑|I|i=1f

∗(−ui) + C ′1⊤q, s.t. Au 4 q. (17)

The KKT condition in this l∞ regularized case is thesame as (10). Also the rule to generate the best hashfunction (i.e., the most violated constraint in (17)) re-mains the same as in the l1 norm case that we havediscussed. Note that both the primal problems (7)and (15) can be efficiently solved using quasi-Newtonmethods such as L-BFGS-B (Zhu et al., 1997) by elim-inating the auxiliary variable ρ.

Learning Hash Functions Using Column Generation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ISOLET

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.414±0.006)

LSI (AP:0.205±0.003)

LCH (AP:0.227±0.002)

ITQ (AP:0.433±0.008)

AGH (AP:0.319±0.007)

SPH (AP:0.256±0.004)

LSH (AP:0.261±0.004)

BREs (AP:0.443±0.009)

SPLH (AP:0.322±0.018)

SSC (AP:0.359±0.007)

CGHash (AP:0.672±0.006)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ISOLET

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.620±0.010)

LSI (0.423±0.005)

LCH (0.470±0.004)

ITQ (0.652±0.010)

AGH (0.637±0.009)

SPH (0.495±0.005)

LSH (0.450±0.007)

BREs (0.630±0.012)

SPLH (0.551±0.018)

SSC (0.584±0.010)

CGHash (0.871±0.005)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ISOLET

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.308±0.013)

LSI (0.279±0.009)

LCH (0.243±0.008)

ITQ (0.208±0.010)

AGH (0.222±0.010)

SPH (0.248±0.006)

LSH (0.353±0.007)

BREs (0.227±0.012)

SPLH (0.280±0.019)

SSC (0.249±0.012)

CGHash (0.069±0.008)

Figure 1. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on theISOLET dataset. The left plot shows the average precision-recall performances using 60 bits. The middle plot shows theaverage performances using different code lengths measured as the proportion of the true nearest neighbors with top-50retrieval. The right plot shows the average 3-nearest-neighbor classification performances using different code lengths.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

SCENE−15

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.211±0.015)

LSI (AP:0.168±0.002)

LCH (AP:0.177±0.002)

ITQ (AP:0.247±0.004)

AGH (AP:0.217±0.006)

SPH (AP:0.139±0.008)

LSH (AP:0.116±0.003)

BREs (AP:0.376±0.009)

SPLH (AP:0.166±0.017)

SSC (AP:0.250±0.005)

CGHash (AP:0.492±0.012)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

SCENE−15

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.316±0.025)

LSI (0.265±0.004)

LCH (0.285±0.003)

ITQ (0.361±0.006)

AGH (0.356±0.012)

SPH (0.217±0.018)

LSH (0.165±0.005)

BREs (0.505±0.011)

SPLH (0.283±0.026)

SSC (0.361±0.007)

CGHash (0.645±0.011)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

SCENE−15

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.581±0.030)

LSI (0.490±0.005)

LCH (0.474±0.008)

ITQ (0.474±0.008)

AGH (0.517±0.015)

SPH (0.622±0.025)

LSH (0.758±0.004)

BREs (0.358±0.011)

SPLH (0.621±0.028)

SSC (0.502±0.013)

CGHash (0.284±0.007)

Figure 2. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on theSCENE-15 dataset. The description of each plot is the same as in Fig. 1.

Extension To demonstrate the flexibility of the pro-posed framework, we show an example that consid-ers an addition pairwise information. Assume thatwe have information about a set of duplets that theyare neighbors to each other or they are from thesame class. So the distance between these dupletsshould be minimized. We can easily include sucha term in our objective function. Formally, let usdenote the duplet set as D = {(xk,x

+k )} and we

want to minimize the divergence∑|D|

k=1 dH(xk,x+k ) =∑

j wj(∑

k |hj(xk) − hj(x+k )|) =

∑j sjwj with sj =

∑|D|k=1 |hj(xk) −hj(x

+k )| being a nonnegative constant

given hj(·). If we use this term to replace the l1 reg-ularization term

∑j wj in the primal (7), all of our

analysis still holds and Algorithm 1 is still applicablewith minimal modification, because the new term canbe simply seen as a weighted l1 norm.

3. Experimental results

Experimental setup In order to evaluate the pro-posed column generation hashing method (referred toas CGHash), we have conducted a set of experimentson six benchmark datasets. To train data-dependenthash functions, each dataset is randomly split intoa training subset and a testing subset. This train-ing/testing split is repeated 5 times, and the average

performance over these 5 trials is reported here.

In the experiments, the proposed hashing method isimplemented by using the squared hinge loss func-tion with the l1 regularization norm (as shown in thesupplementary file). Moreover, the triplets used forlearning hash functions are generated in the same wayas (Weinberger et al., 2006). Specifically, given a train-ing sample, we select the K nearest neighbors fromits associated same-label training samples as relevantsamples, and then choose the K nearest neighborsfrom its associated different-label training samples asirrelevant samples (K = 30 for the SCENE-15 datasetand K = 10 for the other datasets). The trade-offcontrol factor C is cross-validated. We found that, ina wide range, the trade-off control factor C does nothave a significant impact on the performance.

Competing methods To demonstrate the effec-tiveness of the proposed hashing method (CGHash),we compare with some other state-of-the-art hashingmethods quantitatively. For simplicity, they are re-spectively referred to as LSH (Locality Sensitive Hash-ing (Andoni & Indyk, 2006)), SSC (Supervised Sim-ilarity Sensitive Coding (Torralba et al., 2008) as amodified version of (Shakhnarovich et al., 2003)), LSI(Latent Semantic Indexing (Deerwester et al., 1990)),LCH (Laplacian Co-Hashing (Zhang et al., 2010a)),SPH (Spectral Hashing (Weiss et al., 2008)), STH

Learning Hash Functions Using Column Generation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MNIST

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.364±0.017)

LSI (AP:0.244±0.001)

LCH (AP:0.272±0.002)

ITQ (AP:0.414±0.003)

AGH (AP:0.309±0.006)

SPH (AP:0.275±0.002)

LSH (AP:0.306±0.009)

BREs (AP:0.566±0.009)

SPLH (AP:0.406±0.014)

SSC (AP:0.402±0.007)

CGHash (AP:0.498±0.009)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MNIST

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.669±0.022)

LSI (0.786±0.005)

LCH (0.799±0.005)

ITQ (0.866±0.005)

AGH (0.823±0.005)

SPH (0.811±0.004)

LSH (0.746±0.004)

BREs (0.857±0.005)

SPLH (0.745±0.011)

SSC (0.855±0.007)

CGHash (0.889±0.005)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

MNIST

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.258±0.021)

LSI (0.080±0.006)

LCH (0.103±0.006)

ITQ (0.061±0.004)

AGH (0.119±0.005)

SPH (0.076±0.004)

LSH (0.137±0.008)

BREs (0.084±0.002)

SPLH (0.180±0.013)

SSC (0.072±0.006)

CGHash (0.062±0.005)

Figure 3. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on theMNIST dataset. The description of each plot is the same as in Fig. 1.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

LABELME

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.361±0.006)

LSI (AP:0.240±0.001)

LCH (AP:0.237±0.001)

ITQ (AP:0.356±0.001)

AGH (AP:0.266±0.004)

SPH (AP:0.254±0.001)

LSH (AP:0.307±0.005)

BREs (AP:0.393±0.003)

SPLH (AP:0.356±0.007)

SSC (AP:0.383±0.003)

CGHash (AP:0.481±0.007)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65

0.7

LABELME

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.616±0.004)

LSI (0.428±0.002)

LCH (0.417±0.003)

ITQ (0.601±0.002)

AGH (0.588±0.007)

SPH (0.446±0.003)

LSH (0.522±0.007)

BREs (0.602±0.005)

SPLH (0.561±0.007)

SSC (0.605±0.003)

CGHash (0.680±0.006)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

LABELME

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.310±0.004)

LSI (0.422±0.006)

LCH (0.429±0.005)

ITQ (0.317±0.004)

AGH (0.341±0.004)

SPH (0.430±0.007)

LSH (0.375±0.005)

BREs (0.314±0.004)

SPLH (0.353±0.003)

SSC (0.321±0.006)

CGHash (0.258±0.005)

Figure 4. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on a subsetof the LABELME dataset. The description of each plot is the same as in the previous figures.

(Self-Taught hashing (Zhang et al., 2010b)), AGH(Anchor Graph Hashing (Liu et al., 2011)), BREs(Supervised Binary Reconstructive Embedding (Kulis& Darrell, 2009)), SPLH (Semi-Supervised LearningHashing (Wang et al., 2012)), and ITQ (IterativeQuantization (Gong et al., 2012)). Making a com-parison with the above competing methods can ver-ify the effect of learning hashing functions and showthe performance differences in the context of hashingmethods.

Evaluation criteria For a quantitative performancecomparison, we introduce the following three evalu-ation criteria: i) precision-recall curve; ii) propor-tion of true neighbors in top-k retrieval; and iii) K-nearest-neighbor classification. In the experiments,the aforementioned retrieval performance scores areaveraged over all test queries in the dataset. Fori), the precision-recall curve is computed as follows:

precision =#retrieved relevant sampels

#all retrieved samplesand recall =

#retrieved relevant sampels#all relevant samples

. For ii), the proportion

of true neighbors in top-k retrieval is calculated as:#retrieved true neighbors

k. For iii), each test sample

is classified by a majority voting inK-nearest-neighborclassification.

Quantitative comparison results Figs. 1–6 showthe retrieval and classification performances of all thehashing methods using different code lengths on thesix datasets. In each of these figures, we report quan-

titative comparison results of all the hashing methodsin the following three aspects: 1) the average precision-recall performances using the maximum code length,and the average precisions together with standard de-viations (as shown in the legend of each figure); 2)the average performances using different code lengthsin the proportion of the true nearest neighbors withtop-50 retrieval, and the average proportion results to-gether with their standard deviations in the case of themaximum code length (as shown in the legend of eachfigure); and 3) the average K-nearest-neighbor classifi-cation performances using different code lengths, andthe average classification results together with theirstandard deviations in the case of the maximum codelength (as shown in the legend of each figure).

From Figs. 1–6, we clearly see that the proposedCGHash obtains the larger areas under the precision-recall curves than the competing hashing methods. Inaddition, we observe that CGHash achieves the higherproportions of the true nearest neighbors with top-50 retrieval at most times. Moreover, it is seen thatCGHash has lower classification errors than the com-peting methods in most cases.

Fig. 7 shows the retrieval and classification perfor-mances of the proposed CGHash using different val-ues of K on the SCENE-15 dataset. It is seen fromFig. 7 that in general the performance is improved asK increases.

Besides, Fig. 8 shows two retrieval examples on the

Learning Hash Functions Using Column Generation

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

USPS

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.429±0.011)

LSI (AP:0.294±0.002)

LCH (AP:0.297±0.002)

ITQ (AP:0.566±0.003)

AGH (AP:0.368±0.005)

SPH (AP:0.339±0.006)

LSH (AP:0.458±0.016)

BREs (AP:0.706±0.003)

SPLH (AP:0.595±0.031)

SSC (AP:0.517±0.011)

CGHash (AP:0.672±0.009)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

USPS

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.823±0.010)

LSI (0.684±0.002)

LCH (0.702±0.003)

ITQ (0.856±0.002)

AGH (0.854±0.003)

SPH (0.720±0.007)

LSH (0.766±0.010)

BREs (0.888±0.006)

SPLH (0.826±0.020)

SSC (0.827±0.007)

CGHash (0.926±0.004)

5 10 15 20 25 30 35 40 45 50 55 600

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

USPS

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.096±0.006)

LSI (0.095±0.003)

LCH (0.095±0.004)

ITQ (0.060±0.004)

AGH (0.081±0.009)

SPH (0.093±0.008)

LSH (0.103±0.008)

BREs (0.064±0.006)

SPLH (0.108±0.012)

SSC (0.080±0.005)

CGHash (0.037±0.004)

Figure 5. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on theUSPS dataset. The description of each plot is the same as in the previous figures.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

PASCAL07

recall (using 60 bits)

Pre

cis

ion

STH (AP:0.133±0.003)

LSI (AP:0.119±0.001)

LCH (AP:0.123±0.001)

ITQ (AP:0.101±0.001)

AGH (AP:0.124±0.002)

SPH (AP:0.121±0.001)

LSH (AP:0.115±0.002)

BREs (AP:0.147±0.001)

SPLH (AP:0.112±0.003)

SSC (AP:0.142±0.001)

CGHash (AP:0.175±0.002)

5 10 15 20 25 30 35 40 45 50 55 600.05

0.1

0.15

0.2

0.25

0.3

PASCAL07

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

STH (0.197±0.003)

LSI (0.174±0.002)

LCH (0.190±0.002)

ITQ (0.121±0.001)

AGH (0.219±0.003)

SPH (0.182±0.003)

LSH (0.159±0.002)

BREs (0.224±0.004)

SPLH (0.178±0.006)

SSC (0.200±0.002)

CGHash (0.288±0.002)

5 10 15 20 25 30 35 40 45 50 55 60

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

PASCAL07

number of bits

3N

N c

lassific

atio

n e

rro

r

STH (0.790±0.008)

LSI (0.747±0.007)

LCH (0.725±0.008)

ITQ (0.855±0.006)

AGH (0.741±0.009)

SPH (0.744±0.009)

LSH (0.818±0.006)

BREs (0.721±0.008)

SPLH (0.785±0.021)

SSC (0.769±0.012)

CGHash (0.636±0.010)

Figure 6. The retrieval and classification performances of the proposed CGHash and 10 other hashing methods on thePASCAL07 dataset. The description of each plot is the same as in the previous figures.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8SCENE−15

recall (using 60 bits)

Pre

cis

ion

CGHash−k3 (AP:0.197±0.012)

CGHash−k10 (AP:0.323±0.028)

CGHash−k20 (AP:0.444±0.019)

CGHash−k30 (AP:0.492±0.012)

5 10 15 20 25 30 35 40 45 50 55 60

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

0.6

0.65SCENE−15

number of bits

Pro

po

rtio

n o

f th

e t

rue

ne

are

st

ne

igh

bo

rs

CGHash−k3 (0.305±0.018)

CGHash−k10 (0.474±0.031)

CGHash−k20 (0.603±0.017)

CGHash−k30 (0.645±0.011)

5 10 15 20 25 30 35 40 45 50 55 600.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1SCENE−15

number of bits

3N

N c

lassific

atio

n e

rro

r

CGHash−k3 (0.547±0.025)

CGHash−k10 (0.379±0.026)

CGHash−k20 (0.303±0.009)

CGHash−k30 (0.284±0.007)

Figure 7. The retrieval and classification performances of the proposed CGHash using different values of K (K ∈{3, 10, 20, 30}) on the SCENE-15 dataset. The left plot shows the average precision-recall performances using 60 bits.The middle plot displays the average performances using different code lengths measured as the proportion of the truenearest neighbors with top-50 retrieval. The right plot shows the average 3-nearest-neighbor classification performancesusing different code lengths.

Figure 8. Two retrieval examples for CGHash on the LA-BELME and MNIST datasets. The left part shows querysamples while the right part displays the first a few nearestneighbors obtained using CGHash.

MNIST and LABELME datasets. From Fig. 8, weobserve that CGHash obtains the visually accuratenearest-neighbor-search results.

Conclusion We have proposed a novel hashing

method that is implemented using column generation-based convex optimization. By taking into account aset of constraints on the triplet-based relative rank-ing, the proposed hashing method is capable of learn-ing compact hash codes. Such a set of constraintsare incorporated into the large-margin learning frame-work. Hash functions are then learned iterativelyusing column generation. Experimental results onseveral datasets have shown that the proposed hash-ing method achieves improved performance comparedwith state-of-the-art hashing methods in nearest-neighbor classification, precision-recall, and propor-tion of true nearest neighbors retrieved.

This work is in part supported by ARC grants

LP120200485 and FT120100969.

Learning Hash Functions Using Column Generation

References

Andoni, A. and Indyk, P. Near-optimal hashing al-gorithms for approximate nearest neighbor in highdimensions. In Proc. IEEE Symp. Foundations of

Computer Science, pp. 459–468, 2006.

Baluja, S. and Covell, M. Learning to hash: forgivinghash functions and applications. Data Mining &

Knowledge Discovery, 17(3):402–430, 2008.

Boyd, S. and Vandenberghe, L. Convex Optimization.Cambridge University Press, 2004.

Deerwester, S., Dumais, S.T., Furnas, G.W., Lan-dauer, T.K., and Harshman, R. Indexing by latentsemantic analysis. J. American Society for Informa-

tion Science, 41(6):391–407, 1990.

Demiriz, A., Bennett, K.P., and Shawe-Taylor, J. Lin-ear programming boosting via column generation.Machine Learning, 46(1):225–254, 2002.

Gong, Y., Lazebnik, S., Gordo, A., and Perronnin, F.Iterative quantization: a procrustean approach tolearning binary codes for large-scale image retrieval.IEEE Trans. Pattern Analysis & Machine Intelli-

gence, 2012.

Korman, S. and Avidan, S. Coherency sensitive hash-ing. In Proc. Int. Conf. Computer Vision, pp. 1607–1614, 2011.

Kulis, B. and Darrell, T. Learning to hash with binaryreconstructive embeddings. In Proc. Adv. Neural

Information Process. Systems, 2009.

Kulis, B. and Grauman, K. Kernelized locality-sensitive hashing for scalable image search. In Proc.

Int. Conf. Computer Vision, pp. 2130–2137, 2009.

Li, X., Lin, G., Shen, C., van den Hengel, A., and Dick,A. Supplementary document: Effectively learn-ing hash functions using column generation. avail-able at: http://cs.adelaide.edu.au/~chhshen/

paper.html, 2013.

Liu, W., Wang, J., Kumar, S., and Chang, S. F. Hash-ing with graphs. In Proc. Int. Conf. Machine Learn-

ing, 2011.

Liu, W., Wang, J., Ji, R., Jiang, Y.G., and Chang,S.F. Supervised hashing with kernels. In Proc.

IEEE Conf. Computer Vision & Pattern Recogni-

tion, 2012.

Norouzi, M. and Fleet, D.J. Minimal loss hashing forcompact binary codes. In Proc. Int. Conf. Machine

Learning, 2011.

Salakhutdinov, R. and Hinton, G. Semantic hash-ing. Int. J. Approximate Reasoning, 50(7):969–978,2009.

Schultz, M. and Joachims, T. Learning a distance met-ric from relative comparisons. In Proc. Adv. Neural

Information Processing Systems, 2004.

Shakhnarovich, G., Viola, P., and Darrell, T. Fastpose estimation with parameter-sensitive hashing.In Proc. Int. Conf. Computer Vision, pp. 750–757,2003.

Shen, C., Kim, J., Wang, L., and van den Hengel, A.Positive semidefinite metric learning using boosting-like algorithms. J. Machine Learning Research, 13:1007–1036, 2012.

Strecha, C., Bronstein, A. M., Bronstein, M. M., andFua, P. Ldahash: Improved matching with smallerdescriptors. IEEE Trans. Pattern Analysis & Ma-

chine Intelligence, 2011.

Torralba, A., Fergus, R., and Weiss, Y. Small codesand large image databases for recognition. In Proc.

IEEE Conf. Computer Vision & Pattern Recogni-

tion, pp. 1–8, 2008.

Wang, J., Kumar, S., and Chang, S.F. Semi-supervised hashing for large scale search. IEEE

Trans. Pattern Analysis & Machine Intelligence,2012.

Weinberger, K.Q., Blitzer, J., and Saul, L.K. Distancemetric learning for large margin nearest neighborclassification. In Proc. Adv. Neural Information Pro-

cessing Systems, 2006.

Weiss, Y., Torralba, A., and Fergus, R. Spectral hash-ing. In Proc. Adv. Neural Information Process. Sys-

tems, 2008.

Zhang, D., Wang, J., Cai, D., and Lu, J. Laplacianco-hashing of terms and documents. In Proc. Eur.

Conf. Information Retrieval, pp. 577–580, 2010a.

Zhang, D., Wang, J., Cai, D., and Lu, J. Self-taughthashing for fast similarity search. In Proc. ACM

SIGIR Conf., pp. 18–25, 2010b.

Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Al-gorithm 778: L-BFGS-B: Fortran subroutines forlarge-scale bound-constrained optimization. ACM

Trans. Math. Softw., 23(4):550–560, 1997.