Training Invariant Support Vector Machines using Selective ...€¦ · demand, trading reduced memory requirements for increased computation time. We propose two tools to reduce these

Training Invariant Support Vector Machines

using Selective Sampling

Gaelle Loosli, Stephane Canu, Leon Bottou

November 2005

Abstract

(author?) [3] describe the efficient online LASVM algorithm usingselective sampling. On the other hand, (author?) [24] propose astrategy for handling invariance in SVMs, also using selective sampling.This paper combines the two approaches to build a very large SVM.We present state-of-the-art results obtained on a handwritten digitrecognition problem with 8 millions points on a single processor. Thiswork also demonstrates that online SVMs can effectively handle reallylarge databases.

1 Introduction

Because many patterns are insensitive to certain transformations such asrotations or translations, it is widely admitted that the quality of a patternrecognition system can be improved by taking into account invariance. Verydifferent ways to handle invariance in machine learning algorithms have beenproposed [14, 20, 35, 22, 34].

In the case of kernel machines, three general approaches have beenproposed. The first approach consists of learning orbits instead of points.It requires costly semi-definite programming algorithms [17]. The secondapproach involves specialized kernels. This turns out to be equivalent tomapping the patterns into a space of invariant features [8]. Such featuresare often difficult to construct. The third approach is the most general. Itconsists of artificially enlarging the training set by new examples obtainedby deforming the available input data [33]. This approach suffers fromhigh computational costs. Very large datasets can be generated this way.For instance, by generating 134 random deformations per digit, we haveincreased the MNIST training set size from 60,000 to more than 8 millions

1

of examples. A batch optimization algorithm working on the augmenteddatabase needs to go several times through the entire dataset until convergence.Either we store the whole dataset or we compute the deformed examples ondemand, trading reduced memory requirements for increased computationtime.

We propose two tools to reduce these computational costs:

• Selective sampling lets our system select informative points and discarduseless ones. This property would reduce the amount of samples to bestored.

• Unlike batch optimization algorithms, online learning algorithms donot need to access each example again and again until convergence.Such repeated access are particularly costly when the deformed examplesare computed on demand. This is even more wasteful because mostexamples are readily discarded by the selective sampling methods.

1.1 Large scale learning

Running machine learning algorithms on very large scale problems is stilldifficult. For instance, (author?) [1] discuss the problem of scaling SVM“to massive datasets” and point out that learning algorithms are typicallyquadratic and imply several scans of the data. Three factors limit machinelearning algorithms when both the sample size n and the input dimension d

grow very large. First, the training time becomes unreasonable. Second, thesize of the solution affects the memory requirements during both the trainingand recognition phases. Third, labeling the training examples becomes veryexpensive. To address those limitations, there has been a lot of clever studieson solving quadratic optimization problems [7, 18], on online learning [4, 19,11], on sparse solutions [39], and on active learning [9, 6].

The notion of computational complexity discussed in this paper is tied tothe empirical performance of algorithms. Three common strategies can bedistinguished to reduce this practical complexity (or observed training time).The first strategy consists in working on subsets of the training data, solvingseveral smaller problems instead of a large one, as in the SVM decompositionmethod [30]. The second strategy consists in parallelizing the learningalgorithm. The third strategy tries to design less complex algorithms thatgive an approximate solution with equivalent or superior performance. Forinstance, early stopping approximates the full optimization without compromisingthe test set accuracy. This paper focuses on this third strategy.

2

1.2 Learning complexity

The core of many learning algorithms essentially amounts to solving a systemof linear equations of the form Ax = b where A is an n×n input matrix, b anoutput vector and x is an unknown vector. There are many classical methodsto solve such systems [16]. For instance, if we know that A is symmetricand semi-definite positive, we can factorize as A = LL⊤ where L is a lowertriangular matrix representing the square root of A. The factorization costsO(n3) operations. Similarly to this Cholesky factorization, there are manymethods for every kind of matrices A such as QR, LU, etc. Their complexityis always O(n3). They lead to the exact solution, even though in practice,some are more numerically stable or slightly faster. When the problembecomes large, a method with a cubic complexity is not a realistic approach.

Iterative methods are useful for larger problems. The idea is to startfrom some initial vector x0 (null or random) and to iteratively update itby performing steps of size ρk along direction dk, that is xk+1 = xk + ρkdk.Choosing the direction is the difficult part. The optimization literaturesuggests that the first order gradients yield very poor directions. Findinga second order direction costs O(n2) operations (conjugate gradients, etc.)and provides the exact solution in n iterations, for a total of O(n3) operations.However we obtain an approximate solution in fewer steps. Therefore thesealgorithms have a practical complexity of O(kn2) where k is the numberof steps required to have a good enough approximation. Hence we havealgorithms for larger problems, but they remain too costly for really verylarge datasets.

Additional progress can be achieved by exploiting sparsity. The idea isto constrain the number of non null coefficients in vector x. Approximatecomputations can indicate which coefficients will be zero with high confidence.Then the remaining subsystem can be solved with any of the previouslydescribed methods. This approach corresponds to the fact that many examplesbring very little additional information. The hope is then to have a solutionwith an empirical complexity O(nd) with d close to 2. This idea is exploitedby modern SVM algorithms.

During his optimization lecture, (author?) [28] said that the extremelylarge-scale case (n ≫ 105) rules out all advanced convex optimization technics.He argues that there is only one option left, first order gradient methods. Weare convinced that other approximate techniques can exploit the stochasticregularities of learning problems. For instance the LASVM method [2, 3]seems considerably more efficient than the first order online SVMs discussedin [19].

3

1.3 Online learning

Online learning algorithms are usually associated with problems where thecomplete training set is not available beforehand. However their computationalproperties are very useful for large scale learning. A well designed onlinealgorithm needs less computation to reach the same test set accuracy as thecorresponding batch algorithm [26, 5].

Two overlapping frameworks have been used to study online learningalgorithms, by leveraging the mathematics of stochastic approximations [4],or by refining the mathematics of the Perceptron [29]. The Perceptronseems a natural starting point for online SVM. Algorithms derived fromthe Perceptron share common characteristics. Each iteration consists of twosteps. First one decides if the new point xt should influence the decision ruleand then one updates the decision rule. The Perceptron, for instance, onlyupdates its parameter wt−1 if the point xt is misclassified. The updatedparameters are obtained by performing a step along direction xt, that iswt = wt−1 + ytxt. Compared to maximum margin classifiers such as SVM,the Perceptron runs much faster but does not deliver as good a generalizationperformance.

Many authors [12, 13, 15, 23, 11] have modified the Perceptron algorithmto ensure a margin. Older variants of the Perceptron, such as minover andadatron in [27], are also very close to SVMs.

1.4 Active learning

Active learning addresses problems where obtaining labels is expensive [9].The learner has access to a vast pool of unlabeled examples, but is onlyallowed to obtain a limited number of labels. Therefore it must carefullychoose which examples deserve the labeling expense.

Even when all labels are available beforehand, active learning is usefulbecause it leads to sparser solutions [32, 3]. Moreover, the criteria for askingor not a label may be cheaper to compute than trying a labeled point.

1.5 Outline of the paper

Section 2 briefly presents the SVMs and describes how the LASVM algorithmcombines online and active characteristics. Section 3 discusses invarianceand presents selective sampling strategies to address them. Finally, section4 reports experiments and results on a large scale invariant problem.

4

2 Online algorithm with selective sampling

This section first discusses the geometry of the quadratic optimization problemand its suitability to algorithms, such as SMO [31], that iterate feasibledirection searches. Then we describe how to organize feasible directionsearches into an online learning algorithm amenable to selective sampling[3].

Consider a binary classification problem with training patterns x1 . . . xn

and associated classes y1 . . . yn ∈ {+1,−1}. A soft margin SVM [10] classifiesa pattern x according to the sign of a decision function

f(x) =∑

i

αi 〈x, xi〉 + b (1)

where the notation 〈x, x′〉 represents the dot-product of feature vectorsassociated with the patterns x and x′. Such a dot-product is often definedimplicitly by means of a Mercer kernel [10]. The coefficients αi in (1) areobtained by solving the following quadratic optimization (QP)1problem:

maxα

W (α) =∑

i

αiyi −1

2

∑

i j

αiαj 〈xi, xj〉

with

∑

i

αi = 0

0 ≤ yiαi ≤ C

(2)

A pattern xi is called “support vector” when the corresponding coefficient αi

is non zero. The number s of support vectors asymptotically grows linearlywith the number n of examples [38].

2.1 Feasible direction algorithms

The geometry of the SVM QP problem (2) is summarized in figure 1. Thebox constraints 0 ≤ yiαi ≤ C restrict the solutions to a n–dimensionalhypercube. The equality constraint

∑

αi = 0 further restricts the solutionsto a (n−1)–dimensional polytope F .

Consider a feasible point αt ∈ F . A direction ut indicates a feasibledirection if the half-line {αt + λut, λ ≥ 0} intersects the polytope in pointsother than αt. Feasible direction algorithms iteratively update αt by firstchoosing a feasible direction ut and searching the half-line for the feasible

1Note that αi is positive when yi = +1 and negative when yi = − 1.

5

point αt+1 that maximizes the cost function. The optimum is reached whenno further improvement is possible [41].

The quadratic cost function restricted to the half-line search might reachits maximum inside or outside the polytope (see figure 2). The new feasiblepoint αt+1 is easily derived from the differential information in αt and fromthe location of the bounds A and B induced by the box constraints.

αt+1 = αt + max {A, min {B,C}} ut

with C =dW (αt + λut)

dλ

(

d2 W (αt + λut)

dλ2

)−1 (3)

Computing these derivatives for arbitrary directions ut is potentially expensivebecause it involves all n2 terms of the dot-product matrix 〈xi, xj〉. However,it is sufficient to pick ut from a well chosen finite set of directions [3,appendix], preferably with many zero coefficients. The SMO algorithm [31]exploits this opportunity by only considering feasible directions that onlymodify two coefficients αi and αj by opposite amounts. The most commonvariant selects the pairs (i, j) that define the successive search directionsusing a first order criterion:

i = arg maxs

{∂W

∂αs

s.t. αs < max(0, ys C)}

j = arg mins

{∂W

∂αs

s.t. αs > min(0, ys C)}(4)

The time required for solving the SVM QP problem grows like nβ with 2 ≤β ≤ 3 [3, section 2.1] when the number of examples n increases. Meanwhilethe kernel matrix 〈xi, xj〉 becomes too large to store in memory. Computingkernel values on the fly vastly increases the computing time. Modern SVMsolvers work around this problem using a cache of recently computed kernelmatrix coefficients.

2.2 Learning is easier than optimizing

The SVM quadratic optimization problem (2) is only a sophisticated statisticalproxy, defined on the finite training set, for the actual problem, that isclassifying future patterns correctly. Therefore it makes little sense to optimizewith an accuracy that vastly exceeds the uncertainty that arises from theuse of a finite number of training examples.

Online learning algorithms exploit this fact. Since each example is onlyprocessed once, such algorithms rarely can compute the optimum of the

6

objective function. Nevertheless, many online learning algorithms comewith formal generalization performance guarantees. In certain cases, it iseven possible to prove that suitably designed online algorithms outspeed thedirect optimization of the corresponding objective function [5]: they do notoptimize the objective function as accurately, but they reach an equivalenttest error more quickly.

Researchers therefore have sought efficient online algorithms for kernelmachines. For instance, the Budget Perceptron [11] demonstrates that onlinekernel algorithms should be able to both insert and remove support vectorsfrom the current kernel expansion. The Huller [2] shows that both insertionand deletion can be derived from incremental increases of the SVM objectivefunction W (α).

2.3 The LASVM online SVM algorithm

(author?) [3] describe an online SVM algorithm that incrementally increasesthe dual objective function W (α) using feasible direction searches. LASVMmaintains a current coefficient vector αt and the associated set of supportvector indices St. Each LASVM iteration receives a fresh example (xσ(t), yσ(t))and updates the current coefficient vector αt by performing two feasibledirection searches named “process” and “reprocess”.

• Process is a SMO direction search (3) along the direction defined by thepair formed with the current example index σ(t) and another indexchosen among the current support vector indices St using the firstorder criterion (4). Example σ(t) might be a new support vector inthe resulting α

′t and S ′

t.

• Reprocess is a SMO direction search (3) along the direction definedby a pair of indices (i, j) chosen among the current support vectors S ′

t

using the first order criterion (4). Examples i and j might no longerbe support vectors in the resulting αt+1 and St+1.

Repeating LASVM iterations on randomly chosen training set examplesprovably converges to the SVM solution with arbitrary accuracy. Empiricalevidence indicates that a single presentation of each training example issufficient to achieve training errors comparable to those achieved by theSVM solution. After presenting all training examples in random order, it isuseful to tune the final coefficients by running reprocess until convergence.

7

Online LASVM

1: Initialize α0

2: while there are training examples left do3: Select an unseen training example (xσ(t), yσ(t))4: Process(σ(t))5: Reprocess6: end while7: Finish: repeat reprocess until convergence

This single pass algorithm runs faster than SMO and needs much less memory.Useful orders of magnitude can be obtained by evaluating how large thekernel cache must be to avoid the systematic recomputation of kernel values.Let n be the number of examples and s be the number of support vectorswhich is smaller than n. Furthermore, let r ≤ s be the number of free supportvectors, that is, support vectors such that 0 < yiαi < C. Whereas the SMOalgorithm requires n r cache entries to avoid the systematic recomputationof kernel values, the LASVM algorithm only needs s r entries [3].

2.4 Selective sampling

Each iteration of the above algorithm selects a training example (xσ(t), yσ(t))randomly. More sophisticated example selection strategies yield furtherscalability improvements. (author?) [3] suggest four example selectionstrategies:

• Random Selection : Pick a random unseen training example.• Gradient Selection : Pick the most poorly classified example (smallest

value of yk f(xk)) among 50 randomly selected unseen training examples.This criterion is very close to what is done in [25].

• Active Selection : Pick the training example that is closest to thedecision boundary (smallest value of |f(xk)|) among 50 randomly selectedunseen training examples. This criterion chooses a training exampleindependently of its label.

• Autoactive selection : Randomly sample at most 100 unseen trainingexamples but stop as soon as 5 of them fall inside the margins (anyof these examples would be inserted in the set of support vectors).Pick among these 5 examples the one closest to the decision boundary(smallest value of |f(xk)|.)

Empirical evidence shows that the active and autoactive criterion yieldcomparable or better performance level using a smaller number of support

8

vectors. This is understandable because the linear growth of the number ofsupport vectors is related to fact that soft margin SVMs make a supportvector with every misclassified training example. Selecting training examplesnear the boundary excludes a large number of examples that are uninformativeoutliers. The reduced number of support vectors further improves both thelearning speed and the memory footprint.

The following section presents the challenge set by invariance and discusseshow the low memory requirements and the selective sampling capabilities ofLASVM are well suited to the task.

3 Invariance

Many pattern recognition problems have invariance properties: the classremains largely unchanged when specific transformations are applied to thepattern. Object recognition in images is invariant under lighting changes,translations and rotation, mild occlusions, etc. Although humans handleinvariance very naturally, computers do not.

In machine learning, the a priori knowledge of such invariance propertiescan be used to improve the pattern recognition accuracy. Many approacheshave been proposed [36, 40, 34, 22]. We first propose an illustration of theinfluence of invariance. Then we describe our selective sampling approachto invariance, and discuss practical implementation details.

3.1 On the influence of invariance

Let us first illustrate how we propose to handle invariance. Consider pointsin the plane belonging to one of two classes and assume that there is anuncertainty on the point coordinates corresponding to some rotation aroundthe origin. The class labels are therefore expected to remain invariantwhen one rotates the input pattern around the origin. Figure 3(a) showsthe points, their classes and a prospective decision boundary. Figure 3(b)shows the example orbits, that is, the sets of all possible positions reachedby applying the invariant transformation to a specific example. All thesepositions should be given the same label by the classifier. Figure 3(c) shows adecision boundary that takes into account all the potential transformationsof the training examples. Figure 3(d) shows that this boundary can beobtained by selecting adequate representatives for the orbits correspondingto each training example.

This simple example illustrates the complexity of the problem. Learningorbits leads to some almost intractable problems [17]. Adding virtual examples

9

[34] requires considerable memory to simply store the transformed examplesin memory. However, figure 3(d) suggests that we do not need to store allthe transformed examples forming an orbit. We only need to add a few wellchosen transformed examples.

The LASVM algorithm (section 2) displays interesting properties for thispurpose. Because LASVM is an online algorithm, it does not require storingall the transformed examples. Because LASVM uses selective samplingstrategies, it provides the means to select the few transformed examplesthat we think are sufficient to describe the invariant decision boundaries.We therefore hope to solve problems with multiple invariance with mildersize and complexity constraints.

Our first approach was inspired by [24]. Each iteration randomly picksthe next training example, generates a number of transformed examplesdescribing the orbit of the original example, selects the best transformedexample (see section 2.4), and performs the LASVM process/reprocess steps.

Since the online algorithm never revisits a previously seen training example,this first approach cannot pick more than one representative transformedexample from each orbit. Problems with multiple invariance are likelyto feature complicated orbits that are poorly summarized using a singletransformed example. This major drawback can be remedied by revisitingtraining examples that have generated interesting variations in previousiterations. Alas this remedy requires to either recompute the exampletransformations, or to store all of them in memory.

Our final approach simply considers a huge virtual training set composedof all examples and all their transformation. Each iteration of the algorithmpicks a small sample of randomly transformed training examples, selects thebest one using one of the criteria described in section 2.4, and performs theLASVM process/reprocess steps.

This approach can obviously select multiple examples for each orbit. Italso provides great flexibility. For instance, it is interesting to boostrapthe learning process by first learning from untransformed examples. Oncewe have a baseline decision function, we can apply increasingly ambitioustransformations.

3.2 Invariance in practice

Learning with invariance is a well studied problem. Several papers explainhow to efficiently apply pattern transformations [36, 40, 34]. We use theMNIST database of handwritten digit images because many earlier resultshave been reported [21]. This section explains how we store the original

10

images and how we efficiently compute random transformations of thesedigits images on the fly.

3.2.1 Tangent vectors

(author?) [35] explains how to use Lie algebra and tangent vectors toapply arbitrary affine transformations to images. Affine transformationscan be described as the composition of a six elementary transformations:horizontal and vertical translations, rotations, horizontal and vertical scaletransformations, and hyperbolic transformations. For each image, the methodcomputes a tangent vector for each elementary transformation, that is, thenormalized pixel-wise difference between an infinitesimal transformation ofthe image and the image itself.

Small affine transformation are then easily approximated by adding alinear combination of these six elementary tangent vectors:

xaff (i, j) = x(i, j) +∑

T∈T

αT tT (i, j)

where T is the set of elementary transformations, x(i, j) represents the initialimage, tT (i, j) is its tangent vector for transformation T , and αT representsthe coefficient for transformation T .

3.2.2 Deformation fields

It turns out that the tangent vector for the six elementary affine transformationscan be derived from the tangent vectors tx(i, j) and ty(i, j) corresponding tothe horizontal and vertical translations. Each elementary affine transformationcan be described by a vector field [fT

x (i, j), fTy (i, j)] representing the displacement

direction of each pixel (i, j) when one performs an infinitesimal elementarytransformation T of the image. The tangent vector for transformation T isthen:

tT (i, j) = fTx (i, j) tx(i, j) + fT

y (i, j) ty(i, j)

This property provides for extending the tangent vector approach from affinetransformations to arbitrary elastic transformations, that is, transformationsthat can be represented by a deformation field [ fx(i, j), fy(i, j) ]. Smalltransformations of the image x(i, j) are then easily approximated using alinear operation:

xdeformed(i, j) = x(i, j) + α ∗ (fx(i, j) ∗ tx(i, j) + fy(i, j) ∗ ty(i, j))

11

Plausible deformation fields [fx(i, j), fy(i, j)] are easily generated by randomlydrawing an independent motion vector for each pixel and applying a smoothingfilter. Figure 4 shows examples of such deformation fields. Horizontal andvertical vector components are generated independently and therefore canbe used interchangeably.

We also generate transformation fields by adding a controlled amountof random smoothed noise to a transformation field representing a pureaffine transformation. This mainly aims at introducing more rotations inthe transformations.

3.2.3 Thickening

The horizontal and vertical translation tangent vectors can also be used toimplement a thickening transformation [35] that erodes or dilates the digitimages.

xthick(i, j) = x(i, j) + β ∗√

tx(i, j)2 + ty(i, j)2,

where β is a coefficient that controls the strength of the transformation.Choosing β < 0 makes the strokes thinner. Choosing β > 0 makes themthicker.

3.2.4 The infinite virtual training set

As discussed before, we cannot store all transformed example in memory.However we can efficiently generate random transformations by combiningthe above methods:

xtrans(i, j) = x(i, j) + αx ∗ fx(i, j) ∗ tx(i, j)

+ αy ∗ fy(i, j) ∗ ty(i, j)

+ β ∗√

tx(i, j)2 + ty(i, j)2

We only store the initial images x(i, j) along with its translation tangentvectors tx(i, j) and ty(i, j). We also store a collection of pre-computeddeformation fields that can be interchangeably used as fx(i, j) or fy(i, j).The scalar coefficients αx, αy and β provide further transformation variability.

Figure 6 shows examples of all the combined transformations. We cangenerate as many examples as we need this way, playing on the choice ofdeformation fields and scalar coefficients.

12

3.2.5 Large translations

All the transformations described above are small sub-pixel transformations.Even though the MNIST digit images are roughly centered, experimentsindicate that we still need to implement invariance with respect to translationsof magnitude one or two pixels. Thus we also apply randomly chosentranslations of one or two pixels. These full-pixel translations come on topof the sub-pixel translations implemented by the random deformation fields.

4 Application

This section reports experimental results achieved on the MNIST databaseusing the techniques described in the previous section. We have obtainedstate-of-the-art results using 10 SVM classifiers in one-versus-rest configuration.Each classifier is trained using 8 million transformed examples using thestandard RBF kernel < x, x′ >= exp(−γ‖x − x′‖2). The soft-margin C

parameter was always 1000.

As explained before, the untransformed training examples and their twotranslation tangent vectors are stored in memory. Transformed exemplesare computed on the fly and cached. We allowed 500MB for the cache oftransformed examples, and 6.5GB for the cache of kernel values. Indeed,despite the favorable characteristics of our algorithm, dealing with millionsof examples quickly yields tens of thousands support vectors.

4.1 Deformation settings

One could argue that handling more transformations always increases thetest set error. In fact we simply want a classifier that is invariant totransformations that reflect the typical discrepancies between our trainingexamples and our test examples. Making the system invariant to strongertransformations might be useful in practice, but could reduce the performanceon our particular test set.

We used cross-validation to select both the kernel bandwidth parameterand to select the strength of the transformations applied to our initialexamples. The different parameters of the cross validation are the deformationstrength α and the RBF kernel bandwidth γ for the RBF kernel. Duringthe same time, we have also estimated whether the thickening transformand the 1 or 2 pixel translations are desirable.

Figure 7 reports the SVM error rates measured for various configurationson a validation set of 10,000 points taken from the standard MNIST training

13

set. The training set was composed by picking 5,000 other points from theMNIST training set and applying 10 random transformations to each point.We see that thickening is not a relevant transformation for the MNISTproblem. Similarly, we observe that 1 pixel translations are very useful, andthat it is not necessary to use 2 pixels translations.

4.2 Example selection

One of the claims of our work is the ability to implement invariant classifiersby selecting a moderate amount of transformed training examples. Otherwisethe size of the kernel expansion would grow quickly and make the classifierimpractically slow.

We implemented the example selection criteria discussed in section 2.4.Figure 8 compares error rates (left), number of support vectors (center),and training times (right) using three different selection criteria: randomselection, active selection, and auto-active selection. These results wereobtained using 100 random transformations of each of the 60000 MNISTtraining examples. The graphs also show the results obtained on the 60000MNIST training examples without random transformations.

The random and auto-active selection criteria give the best test errors.The auto-active criterion however yields a much smaller number of supportvectors and trains much faster. Therefore we chose the auto-active selectioncriterion for the following experiments.

4.3 The finishing step

After presenting the last training example, (author?) [3] suggest to tunethe final solution by running reprocess until convergence. This amounts tooptimizing the SVM objective function restricted to the remaining supportvectors. This operation is known as the “finishing step”. In our case, wenever see the last training examples since we can always generate more.

At first, we simply eliminated this finishing step. However we noticedthat after processing a large amount of examples (between 5 and 6 millionsdepending on the class) the number of support vectors decreases slowly.There is in fact an implicit finishing step. After a sufficient time, the processoperation seldom does anything because hardly any of the selected examplesneeds to be added to the expansion. Meanwhile the reprocess step remainsactive.

We then decided to perform a finishing step every 600,000 points. Weobserved the same the reduction of the number of support vectors, but

14

earlier during the learning process (see figure 9). These additional finishingsteps seem useful for the global task. We achieved the best results usingthis setup. However this observation raises several questions. Is it enoughto perform a single reprocess step after each example selection? Can we getfaster and better results?

4.4 The number of reprocess

As said before, the LASVM algorithm does not explicitly defines how muchoptimization should be performed after processing each example. To explorethis issue, we ran several variants of the LASVM algorithms on 5 randomtransformations of 10,000 MNIST examples.

1R/1P 2R/1P 3R/1P 4R/1P 5R/1P

Max size 24531 21903 21436 20588 20029Removed pts 1861 1258 934 777 537Proportion 7.6% 5.7% 4.3% 3.7% 2.6%Train time (s) 1621 1548 1511 1482 1441Error rate 2.13% 2.08% 2.19% 2.09% 2.07%

1R each 2R each 3R each

Max size 23891 21660 20596Removed pts 1487 548 221Proportion 6.2% 2.5% 1.0%Train time (s) 1857 1753 1685Error rate 2.06% 2.17% 2.13%

Table 1: Effects of transformations on the performance, with an rbf kernelof bandwidth γ = 0.006. The table shows a comparison for different trade-off between Process and Reprocess. We change the number of consecutiveReprocess after a Process, and also after each coming point, even if it is notProcessed.

The variants denoted “nR/1P” consist of performing n reprocess stepsafter selecting a transformed training example and performing a process step.The variants denoted “nR each” consist of performing n reprocess steps aftereach examples, regardless of whether the example was selected for a processstep.

15

Table 1 shows the number of support vectors before and after the finishingstep, the training times, and the test performance measured on an independentvalidation set of 10,000 examples. Neither optimizing a lot (last column) oroptimizing very little (first column) are good setups. In terms of trainingtime, the best combinations for this data set are “4R/1P” and “5R/1P”.This results certainly shows that achieving the right balance remains anobscure aspect of the LASVM algorithm.

4.5 Final results

Number of binary classifiers 10Number of examples for each binary classifier 8,100,000Thickening transformation noAdditional translations 1 pixelRBF Kernel bandwidth (γ) 0.006Example selection criterion auto-activeFinishing step every 600,000 examplesFull training time 8 daysTest set error 0.67%

Table 2: Summary of our final experiment.

Table 2 summarizes our final results. We first bootstrapped the systemusing the original MNIST training and 4 random deformation of each example.Then we expanded the database with 130 further random transformations,performing a finishing step every 600,000 examples. The final accuracymatches the results obtained using virtual support vectors [33] on the originalMNIST test set. Slightly better performances have been reported usingconvolution networks [37], or using a deskewing algorithm to make the testset easier [33].

5 Conclusion

We have shown how to address large invariant pattern recognition problemsusing selective sampling and online algorithms. We also have demonstratedthat these techniques scale remarkably well. It is now possible to run SVMon millions of examples in a relatively high dimension input space (here 784),using a single processor. Because we only keep a few thousands of supportvectors per classifier that we can handle millions of training examples.

16

References

[1] K. P. Bennett and C. Campbell. Support vector machines: Hype orhallelujah? SIGKDD Explorations, 2(2):1–13, 2000.

[2] A. Bordes and L. Bottou. The Huller: a simple and efficient onlinesvm. In Machine Learning: ECML 2005, Lecture Notes in ArtificialIntelligence, LNAI 3720, pages 505–512. Springer Verlag, 2005.

[3] A. Bordes, S. Ertekin, J. Weston, and L. Bottou. Fast kernel classifierswith online and active learning. Journal of Machine Learning Research,6:1579–1619, September 2005.

[4] L. Bottou. Online algorithms and stochastic approximations. InDavid Saad, editor, Online Learning and Neural Networks. CambridgeUniversity Press, Cambridge, UK, 1998.

[5] L. Bottou and Y. LeCun. Large scale online learning. In SebastianThrun, Lawrence Saul, and Bernhard Scholkopf, editors, Advances inNeural Information Processing Systems 16, Cambridge, MA, 2004. MITPress.

[6] C. Campbell, N. Cristianini, and A. Smola. Query learning withlarge margin classifiers. In Proc. 17th International Conf. on MachineLearning, pages 111–118. Morgan Kaufmann, San Francisco, CA, 2000.

[7] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vectormachines, 2001. Software available at http://www.csie.ntu.edu.tw/∼cjlin/libsvm.

[8] O. Chapelle and B. Scholkopf. Incorporating invariances in nonlinearsvms. In Dietterich T. G.and Becker S. and Ghahramani Z., editors,Advances in Neural Information Processing Systems, volume 14, pages609–616, Cambridge, MA, USA, 2002. MIT Press.

[9] D. A. Cohn, Z. Ghahramani, and M. I. Jordan. Active learning withstatistical models. In G. Tesauro, D. Touretzky, and T. Leen, editors,Advances in Neural Information Processing Systems, volume 7, pages705–712. The MIT Press, 1995.

[10] C. Cortes and V. Vapnik. Support vector networks. Machine Learning,20:pp 1–25, 1995.

17

[11] K. Crammer, J. Kandola, and Y. Singer. Online classification on abudget. In S. Thrun, L. Saul, and B. Scholkopf, editors, Advancesin Neural Information Processing Systems 16. MIT Press, Cambridge,MA, 2004.

[12] Y. Freund and R. E. Schapire. Large margin classification usingthe perceptron algorithm. In J. Shavlik, editor, Machine Learning:Proceedings of the Fifteenth International Conference, San Francisco,CA, 1998. Morgan Kaufmann.

[13] T. Frieß, N. Cristianini, and C. Campbell. The kernel Adatronalgorithm: a fast and simple learning procedure for support vectormachines. In J. Shavlik, editor, 15th International Conf. MachineLearning, pages 188–196. Morgan Kaufmann Publishers, 1998.

[14] K. Fukushima. Neocognitron: A hierarchical neural network capable ofvisual pattern recognition. Neural Networks, 1(2):119–130, 1988.

[15] C. Gentile. A new approximate maximal margin classificationalgorithm. Journal of Machine Learning Research, 2:213–242, 2001.

[16] G. H. Golub and C. F. Van Loan. Matrix Computation. John HopkinsUniversity Press, 1991. Second Edition.

[17] T. Graepel and R. Herbrich. Invariant pattern recognition by semi-definite programming machines. In S. Thrun, L. Saul, and B. Scholkopf,editors, Advances in Neural Information Processing Systems 16. MITPress, Cambridge, MA, 2004.

[18] T. Joachims. Making large-scale SVM learning practical. InB. Scholkopf, C. Burges, and A. Smola, editors, Advanced in KernelMethods - Support Vector Learning, pages 169–184. MIT Press, 1999.

[19] J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels,2002.

[20] K. J. Lang and G. E. Hinton. A time delay neural network architecturefor speech recognition. Technical Report CMU-CS-88-152, Carnegie-Mellon University, Pittsburgh PA, 1988.

[21] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-basedlearning applied to document recognition. Proceedings of the IEEE,86(11):2278–2324, 1998. http://yann.lecun.com/exdb/mnist/.

18

[22] T. K. Leen. From data distributions to regularization in invariantlearning. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advancesin Neural Information Processing Systems, volume 7, pages 223–230.The MIT Press, 1995.

[23] Y. Li and P. M. Long. The relaxed online maximum margin algorithm.Mach. Learn., 46(1-3):361–387, 2002.

[24] G. Loosli, S. Canu, S. V. N. Vishwanathan, and A. J. Smola. Invariancesin classification : an efficient svm implementation. In ASMDA 2005 -Applied Stochastic Models and Data Analysis, 2005.

[25] G. Loosli, S. Canu, S. V. N. Vishwanathan, A. J. Smola, andM. Chattopadhyay. Une boıte a outils rapide et simple pour les SVM.In Michel Liquiere and Marc Sebban, editors, CAp 2004 - Conferenced’Apprentissage, pages 113–128. Presses Universitaires de Grenoble,2004.

[26] N. Murata and S.-I. Amari. Statistical analysis of learning dynamics.Signal Processing, 74(1):3–28, 1999.

[27] J.-P. Nadal. Reseaux de neurones: de la physique a la psychologie.Armand Colin, Collection 2aI, 1993.

[28] A. Nemirovski. Introduction to convex programming, interior pointmethods, and semi-definite programming. Machine Learning, SupportVector Machines, and Large-Scale Optimization Pascal Workshop,March 2005.

[29] A. B. J. Novikoff. On convergence proofs on perceptrons. In Proceedingsof the Symposium on the Mathematical Theory of Automata, volume 12,pages 615–622. Polytechnic Institute of Brooklyn, 1962.

[30] E. Osuna, R. Freund, and F. Girosi. An improved training algorithmfo support vector machines. In J. Principe, L. Gile, N. Morgan,and E. Wilson, editors, Neural Networks for Signal Processing VII -Proceedings of the 1997 IEEE Workshop, pages 276–285, 1997.

[31] J. Platt. Fast training of support vector machines using sequentialminimal optimization. In B. Scholkopf, C. J. C. Burges, andA. J. Smola, editors, Advances in Kernel Methods — Support VectorLearning, pages 185–208, Cambridge, MA, 1999. MIT Press.

19

[32] G. Schohn and D. Cohn. Less is more: Active learning with supportvector machines. In P. Langley, editor, Proceedings of the SeventeenthInternational Conference on Machine Learning (ICML 2000), pages839–846. Morgan Kaufmann, June 2000.

[33] B. Scholkopf and A. Smola. Learning with Kernels. MIT Press, 2001.

[34] B. Scholkopf, C. Burges, and V. Vapnik. Incorporating invariancesin support vector learning machines. In J.C. Vorbruggen C. von derMalsburg, W. von Seelen and B. Sendhoff, editors, Artificial NeuralNetworks — ICANN’96, volume 1112, pages 47–52, Berlin, 1996.Springer Lecture Notes in Computer Science.

[35] P. Simard, Y. Le Cun, J. S. Denker, and B. Victorri. Transformationinvariance in pattern recognition – tangent distance and tangentpropagation. International Journal of Imaging Systems and Technology,11(3), 2000.

[36] P. Simard, Y. Le Cun, and Denker J. efficient pattern recognition usinga new transformation distance. In S. Hanson, J. Cowan, and L. Giles,editors, Advances in Neural Information Processing Systems, volume 5.Morgan Kaufmann, 1993.

[37] P. Simard, D. Steinkraus, and J. C. Platt. Best practices forconvolutional neural networks applied to visual document analysis. InICDAR, pages 958–962, 2003.

[38] I. Steinwart. Sparseness of support vector machines—someasymptotically sharp bounds. In Sebastian Thrun, Lawrence Saul,and Bernhard Scholkopf, editors, Advances in Neural InformationProcessing Systems 16. MIT Press, Cambridge, MA, 2004.

[39] P. Vincent and Y. Bengio. Kernel matching pursuit. Machine Learning,48(1-3):165–187, 2002.

[40] J. Wood. Invariant pattern recognition: A review. Pattern Recognition,29 Issue 1:1–19, 1996.

[41] G. Zoutendijk. Methods of Feasible Directions. Elsevier, 1960.

20

Equality Constraint

A

BBox Constraints

FeasiblePolytope

Figure 1: Geometry of the SVM dual QP problem (2). The box constraints0 ≤ yiαu ≤ C restrict the solutions to a n–dimensional hypercube.The equality constraint

∑

αi = 0 further restricts the solutions to a(n−1)–dimensional polytope, that is segment [AB] in this 2-dimensionalfigure.

BA BA

Figure 2: Performing a line search inside the feasible space of the SVM dualQP problem. The quadratic cost function restricted to the search line mightreach its maximum inside (left) or outside the box constraints (right.)

21

(a) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1 −1

−1

−1

0

0

0

1

1

1

(b) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1−1

−1

−1

0

0

0

1

1

1

(c) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−1

−1

−1

−1

0

0

00

0

1

1

1

11

(d) −1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

−1

−1

−1

−1

−1

0

0

00

0

1

1

1

11

Figure 3: Those figures illustrate the influence of point variations on thedecision boundary for a toy example. Plot (a) shows a typical decisionboundary obtained by only considering the training examples. Plot (b) shows“orbits” describing the possible transformations of the training examples.All these variations should be given the same label by the classifier. Plot(c) shows a decision boundary that takes into account the variations.Plot (d) shows how this boundary can be obtained by selecting adequaterepresentants for each orbit.

Figure 4: This figure shows examples of deformation fields. We representthe combination of horizontal and vertical fields. The first one is smoothedrandom and the second one is a rotation field modified by random noise.

22

Figure 5: This figure shows the original digit, the two translation tangentvectors and the thickening tangent vector.

Figure 6: This figure shows 16 variations of a digit with all thetransformations cited here.

23

0 0.5 1 1.5 2 2.5 3 3.5 4

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

Strength of coefficient α

Err

or r

ate

on in

depe

ndan

t val

idat

ion

set

Effects of transformations, trained on 5000 points with 10 deformations each

2 pixels with thickening1 pixel with thickeningno translation with thickeningno translation non thinckening1 pixel no thckening2 pixels no thickening

Figure 7: Effects of transformations on the performance. This graph isobtained on a validation set of 10,000 points, trained on 5000 points and 10transformations for each (55,000 training points in total) with an rbf kernelwith bandwidth γ = 0.006. The best configuration is elastic deformationwithout thickening, for α = 2 and translations of 1 pixel, which gives 1.28%.Note that α = 0 is equivalent to no elastic deformation. The baseline resultsfor the validation set is thus 2.52%.

24

NT RS AS AAS0

0.2

0.4

0.6

0.8

1

1.2

1.4

Modes for LASVM

Err

or p

erce

ntag

e

Error rate

NT RS AS AAS0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2x 10

5 Size of the solution

Modes for LASVM

Num

ber

of s

uppo

rt v

ecto

rs

NT RS AS AAS0

2

4

6

8

10

12x 10

5

Modes for LASVM

Tra

inin

g tim

e (s

)

Training time

Figure 8: This figures compare the error rates (left), the numbers of supportvectors (center) and the training times (right) of different LASVM runs.The first bar of each graph corresponds to the training on the original60,000 MNIST examples (no transformation - NT). The others three barswere obtained using 100 random deformations of each MNIST example,that is 6 millions points. The second columns reports results for randomselection (RS), the thirds for active selection (AS) and the last ones forauto-active selection (AAS) The deformation settings are set according toprevious results (figure 7). The auto-active run gives the best compromise.

25

0

5000

10000

15000

20000

25000

0 500000 1e+06 1.5e+06 2e+06 2.5e+06 3e+06 3.5e+06 4e+06

num

ber

of

support

vec

tors

size of training set

MNIST with invariances - size of expansions

class 0class 1class 2

Figure 9: This figure shows the evolution of the expansion size duringtraining. We used auto-active selection and performing a finishing stepat regular intervals Each gap corresponds to one finishing step. Here wenotice that the number of support vectors eventually decreases without fullyoptimizing, at least intentionally.

26

Documents

Training Invariant Support Vector Machines using Selective ...€¦ · demand, trading reduced memory requirements for increased computation time. We propose two tools to reduce these