ong03hyperkernels.pdf

Embed Size (px)

Citation preview

  • 8/2/2019 ong03hyperkernels.pdf

    1/8

    Machine Learning using Hyperkernels

    Cheng Soon Ong [email protected]

    Alexander J. Smola [email protected] Learning Group, RSISE, Australian National University, Canberra, ACT 0200, Australia

    Abstract

    We expand on the problem of learning a ker-nel via a RKHS on the space of kernels it-self. The resulting optimization problem isshown to have a semidefinite programmingsolution. We demonstrate that it is possibleto learn the kernel for various formulationsof machine learning problems. Specifically,we provide mathematical programming for-mulations and experimental results for the C-

    SVM, -SVM and Lagrangian SVM for clas-sification on UCI data, and novelty detection.

    1. Introduction

    Kernel Methods have been highly successful in solvingvarious problems in machine learning. The algorithmswork by mapping the inputs into a feature space, andfinding a suitable hypothesis in this new space. Inthe case of the Support Vector Machine, this solutionis the hyperplane which maximizes the margin in thefeature space. The feature mapping in question is de-

    fined by a kernel function, which allows us to computedot products in feature space using only the objects inthe input space.

    Recently, there have been many developments regard-ing learning the kernel function itself (Bousquet & Her-rmann, 2003; Crammer et al., 2003; Cristianini et al.,2002; Lanckriet et al., 2002; Momma & Bennett, 2002;Ong et al., 2003). In this paper, we extend the hy-perkernel framework introduced in Ong et al. (2003),which we review in Section 2. In particular, the con-tributions of this paper are:

    a general class of hyperkernels allowing automatic

    relevance determination (Section 3),

    explicit mathematical programming formulations ofthe optimization problems (Section 4),

    implementation details of various SVMs and Align-ment (Section 5)

    and further experiments on binary classification andnovelty detection (Section 6).

    At the heart of the strategy is the idea that we learnthe kernel by performing the kernel trick on the spaceof kernels, hence the notion of a hyperkernel.

    2. Hyper-RKHS

    As motivation for the need for such a formulation, con-sider Figure 1, which shows the separating hyperplaneand the margin for the same dataset. Figure 1(a)shows the training data and the classification function

    for a support vector machine using a Gaussian RBFkernel. The data has been sampled from two Gaussiandistributions with standard deviation 1 in one dimen-sion and 1000 in the other. This difference in scalecreates problems for the Gaussian RBF kernel, sinceit is unable to find a kernel width suitable for bothdimensions. Hence, the classification function is dom-inated by the dimension with large variance. The tra-ditional way to handle such data is to normalize eachdimension independently.

    Instead of normalizing the input data, we make thekernel adaptive to allow independent scales for eachdimension. This allows the kernel to handle unnor-malized data. However, the resulting kernel would bedifficult to tune by cross validation as there are nu-merous free variables (one for each dimension). Welearn this kernel by defining a quantity analogous tothe risk functional, called the quality functional, whichmeasures the badness of the kernel function. Theclassification function for the above mentioned data isshown in Figure 1(b). Observe that it captures thescale of each dimension independently.

    We review the definitions from (Ong et al., 2003).Given a set of input data, X, and their associatedlabels1 , Y, and a class of kernels K, we would like to

    select the best kernel k K for the problem.

    Definition 1 (Empirical Quality Functional)Given a kernel k, and data X, Y, we defineQemp(k,X ,Y ) to be an empirical quality func-tional if it depends on k only via k(xi, xj) wherexi, xj X for 1 i, j m.

    1only for supervised learning

    Proceedings of the Twentieth International Conference on Machine Learning (ICML-2003), Washington DC, 2003.

  • 8/2/2019 ong03hyperkernels.pdf

    2/8

    (a) Standard Gaussian RBF kernel (b) RBF-Hyperkernel with adaptive widths

    Figure 1. For data with highly non-isotropic variance, choosing one scale for all dimensions leads to unsatisfactory results.Plot of synthetic data, showing the separating hyperplane and the margins given for a uniformly chosen length scale (left)and an automatic width selection (right).

    Qemp tells us how well matched k is to a specificdataset X, Y. Examples of such functionals includethe Kernel Target Alignment, the regularized risk andthe negative log-posterior. However, if provided witha sufficiently rich class of kernels K it is in generalpossible to find a kernel that attains the minimum ofany such Qemp regardless of the data (see (Ong et al.,2003) for examples). Therefore, we would like to some-how control the complexity of the kernel function. Weachieve this by using the kernel trick again on the spaceof kernels.

    Definition 2 (Hyper RKHS) LetX be a nonemptyset and denote byX := X X the compounded indexset. The Hilbert space H of functions k : X R,endowed with a dot product , (and the normk =

    k, k) is called a Hyper Reproducing Kernel HilbertSpace if there exists a hyperkernel k : XX R withthe following properties:

    1. k has the reproducing property

    k, k(x, ) = k(x) for all k H; (1)

    in particular, k(x, ), k(x, ) = k(x, x).

    2. k spans H, i.e. H = span{k(x, )|x X} where Xis the completion of the set X.

    3. For any fixed x X the hyperkernel k is a kernelin its second argument, i.e. for any fixed x X,the functionk(x, x) := k(x, (x, x)) withx, x Xis a kernel.

    What distinguishes H from a normal RKHS is the par-ticular form of its index set (X = X2) and the ad-

    ditional condition on k to be a kernel in its secondargument for any fixed first argument. This condi-tion somewhat limits the choice of possible kernels, onthe other hand, it allows for simple optimization algo-rithms which consider kernels k H, which are in theconvex cone of k. Analogous to the regularized riskfunctional, Rreg(f , X , Y ) =

    1m

    mi=1 l(xi, yi, f(xi)) +

    2 f

    2, we regularize Qemp(k,X ,Y ).

    Definition 3 (Regularized Quality Functional)

    Qreg(k,X ,Y ) := Qemp(k , X , Y ) +Q

    2 k2H (2)

    where Q > 0 is a regularization constant and k2H

    denotes the RKHS norm inH.

    Minimization of Qreg is less prone to overfitting than

    minimizing Qemp, since the regularizerQ2 k

    2H

    effec-tively controls the complexity of the class of kernelsunder consideration (this can be derived from (Bous-quet & Herrmann, 2003)). The minimizer of (2) satis-fies the representer theorem:

    Theorem 4 (Representer Theorem) Denote byX

    a set, and by Q an arbitrary quality functional. Theneach minimizer k H of the regularized quality func-tional (2), admits a representation of the form

    k(x, x) =m

    i,j=1

    i,jk((xi, xj), (x, x)). (3)

    This shows that even though we are optimizing overa whole Hilbert space of kernels, we still are able to

  • 8/2/2019 ong03hyperkernels.pdf

    3/8

    find the optimal solution by choosing among a finitenumber, which is the span of the kernel on the data.

    Note that the minimizer (3) is not necessarily positivesemidefinite. In practice, this is not what we want,since we require a positive semidefinite kernel. There-fore we need to impose additional constraints. We re-quire that all expansion coefficients i,j 0. While

    this may prevent us from obtaining the minimizer ofthe objective function, it yields a much more amenableoptimization problem in practice. In the subsequentderivations of optimization problems, we choose thisrestriction as it provides a tractable problem.

    Similar to the analogy between Gaussian Processes(GP) and SVMs (Opper & Winther, 2000), there isa Bayesian interpretation for Hyperkernels which isanalogous to the idea of hyperpriors. Our approachcan be interpreted as drawing the covariance matrixof the GP from another GP.

    3. Designing Hyperkernels

    The criteria imposed by Definition 2 guide us directlyin the choice of functions suitable as hyperkernels. Thefirst observation is that we can optimize over the spaceof kernel functions, hence we can take large linear com-binations of parameterized families of kernels as thebasic ingredients. This leads to the so-called harmonichyperkernels (Ong et al., 2003):

    Example 1 (Harmonic Hyperkernel) Denote byk a kernel with k : X X [0, 1], and set ci :=(1 h)

    ih for some 0 < h < 1. Then we have

    k(x, x) = (1 h)

    i=0 (hk(x)k(x))

    i

    =1 h

    1 hk(x)k(x)

    (4)

    A special case is k(x, x) = exp(x x2). Here weobtain for k((x, x), (x, x))

    1

    1 exp((x x2 + x x2))(5)

    However, if we want the kernel to adapt automatically

    to different widths for each dimension, we need to per-form the summation that led to (4) for each dimensionin its arguments separately (similar to automatic rel-evance determination (MacKay, 1994)).

    Example 2 (Hyperkernel for ARD) Letk(x, x

    ) = exp(d(x, x)), where d(x, x

    ) =(x x)(x x), and a diagonal covariancematrix. Take sums over each diagonal entry j = jj

    separately to obtain

    k((x, x), (x, x))

    = (1 h)d

    j=1

    i=0 (hk(x, x

    )k(x, x))

    i

    =d

    j=1

    1 h1 h exp

    j((xj xj)

    2 + (xj xj )

    2)

    .

    This is a valid hyperkernel since k(x) factorizes intoits coordinates. A similar definition also allows us touse a distance metric d(x, x) which is a generalizedradial distance as defined by (Haussler, 1999).

    4. Semidefinite Programming

    We derive Semidefinite Programming (SDP) formu-lations of the optimization problems arising fromthe minimization of the regularized risk functional.Semidefinite programming (Vandenberghe & Boyd,1996) is the optimization of a linear objective func-tion subject to constraints which are linear matrix in-equalities and affine equalities. The following proposi-tion allows us to derive a SDP from a class of generalquadratic programs. It is an extension of the deriva-tion in (Lanckriet et al., 2002) and its proof can befound in Appendix A.

    Proposition 5 (Quadratic Minimax) Letm,n,M N, H : Rn Rmm, c : Rn Rm,be linear maps. Let A RMm and a RM. Also,let d : Rn R and G() be a function and the furtherconstraints on . Then the optimization problem

    min

    maxx

    12xH()x c()x + d()

    subject to H() 0Ax + a 0G() 0

    (6)

    can be rewritten as

    mint,,

    12

    t + a+ d()

    subject to 0, G() 0H() (A c())

    (A c()) t

    0

    (7)

    Specifically, when we have the regularized quality func-tional, d() is quadratic, and hence we obtain an opti-mization problem which has a mix of linear, quadraticand semidefinite constraints.

    Corollary 6

    min

    maxx

    12xH()x c()x + 12

    subject to H() 0Ax + a 0 0

    (8)

  • 8/2/2019 ong03hyperkernels.pdf

    4/8

    can be rewritten as

    minimizet,t,,

    12 t +

    12 t + a

    subject to 0 0

    1

    2 t

    H() (A c())(A c()) t

    0

    (9)

    The proof of the above is obtained immediately fromProposition 5 and introducing an auxiliary variable t

    which upper bounds the quadratic term of .

    5. Implementation Details

    When Qemp is the regularized risk, we obtain:

    minfH,kH

    1

    m

    m

    i=1

    l(xi, yi, f(xi)) +

    2

    f2H +Q

    2

    k2H

    (10)Comparing the objective function in (8) with (10), weobserve that H() and c() are linear in . Let = .As we vary the constraints are still satisfied, but theobjective function scales with . Since is the coef-ficient in the hyperkernel expansion, this implies thatwe have a set of possible kernels which are just scalarmultiples of each other. To avoid this, we add an ad-ditional constraint on which is 1 = 1. This breaksthe scaling freedom of the kernel matrix. As a side-effect, the numerical stability of the SDP problems im-proves considerably.

    We give some examples of common SVMs which arederived from (10). The derivation is basically by ap-plication of Corollary 6. We derive the correspondingSDP for the case when Qemp is a C-SVM (Example 3).

    Derivations of the other examples follow the same rea-soning, and are omitted. In this subsection, we definethe following notation. For p ,q ,r Rn, n N let r =

    p q be defined as element by element multiplication,ri = pi qi. The pseudo-inverse (or Moore-Penrose in-verse) of a matrix K is denoted K. Define the hyper-kernel Gram matrix K by Kijpq = k((xi, xj), (xp, xq)),

    the kernel matrix K = reshape(K) (reshaping a m2

    by 1 vector, K, to a m by m matrix), Y = diag(y)(a matrix with y on the diagonal and zero everywhereelse), G() = Y KY (the dependence on is madeexplicit), I the identity matrix and 1 a vector of ones.

    The number of training examples is assumed to be m.Where appropriate, and are Lagrange multipliers,while and are vectors of Lagrange multipliers fromthe derivation of the Wolfe dual for the SDP, are

    the hyperkernel coefficients, t1 and t2 are the auxiliaryvariables.

    Example 3 (Linear SVM (C-style)) A commonlyused support vector classifier, the C-SVM (Bennett &Mangasarian, 1992; Cortes & Vapnik, 1995) uses an1 soft margin, l(xi, yi, f(xi)) = max(0, 1 yif(xi)),

    which allows errors on the training set. The parameterC is given by the user. Setting the quality functionalQemp(k,X ,Y ) = minfH

    1m

    mi=1 l(xi, yi, f(xi)) +

    12C

    w2H

    , the resulting SDP is

    minimize,,,

    12

    t1 +Cm

    1 +CQ2

    t2

    subject to 0, 0, 0

    K1

    2 t2G() z

    z t1

    0,

    (11)

    where z = y + 1 + .

    The value of which optimizes the corresponding La-grange function is G()z, and the classification func-tion, f = sign(K( y) boffset), is given by f =sign(KG()(y z) ).

    Proof [Derivation of SDP for C-SVM] We begin ourderivation from the regularized quality functional (10).Dividing throughout by and setting the cost functionto the 1 soft margin loss, that is l(xi, yi, f(xi)) =max(0, 1 yif(xi)) we get the following equation.

    minkH

    minfHk

    1

    m

    mi=1

    i +1

    2f2Hk +

    Q

    2k2H

    subject to yif(xi) 1 ii 0

    (12)

    Recall the form of the C-SVM,

    minw,

    12w

    2 + Cm

    mi=1 i

    subject to yi(xi, w + b) 1 i

    i 0 for all i = 1, . . . , m

    and its dual,

    maxRmm

    i=1 i 1

    2m

    i=1 ijyiyjk(xi, xj)

    subject tom

    i=1 iyi = 0

    0 i Cm

    for all i = 1, . . . , m .

    By considering the optimization problem dependent onf in (12), we can use the derivation of the dual problemof the standard C-SVM. Observe that C = 1, and

  • 8/2/2019 ong03hyperkernels.pdf

    5/8

    we can rewrite k2H

    = K due to the representertheorem. Substituting the dual C-SVM problem into(12), we get the following matrix equation,

    min

    max

    1 12G() +

    CQ2

    K

    subject to y = 00 i

    Cm

    for all i = 1, . . . , m

    i 0

    (13)

    This is of the quadratic form of Corollary 6 wherex = , = , H() = G(), c() = 1, = CQK,

    the constraints are A =

    y y I I

    and

    a =

    0 0 0 Cm

    1

    . Applying Corollary 6,we obtain the SDP in Example 3. To make thedifferent constraints explicit, we replace the matrixconstraint Ax + a 0 and its associated Lagrangemultiplier with three linear constraints. We use as the Lagrange multiplier for the equality constrainty = 0, for 0, and for C

    m1.

    Example 4 (Linear SVM (-style)) An alterna-tive parameterization of the 1 soft margin was in-troduced by (Scholkopf et al., 2000), where the userdefined parameter [0, 1] controls the fraction ofmargin errors and support vectors. Using -SVMas Qemp, that is, for a given , Qemp(k,X ,Y ) =minfH

    1m

    mi=1 i +

    12w

    2H

    subject to yif(xi) i and i 0 for all i = 1, . . . , m. The correspond-ing SDP is given by

    minimize,,,,

    12

    t1 + 1m

    +Q2

    t2

    subject to 0, 0, 0, 0

    K1

    2 t2G() z

    z t1

    0

    (14)

    where z = y + 1 + .

    The value of which optimizes the corresponding La-grange function is G()z, and the classification func-tion, f = sign(K( y) boffset), is given by f =sign(KG()(y z) ).

    Example 5 (Quadratic SVM) Instead of using an

    1 loss class, (Mangasarian & Musicant, 2001) usesan 2 loss class,

    l(xi, yi, f(xi)) =

    0 if yif(xi) 1(1 yif(xi))

    2 otherwise,

    and regularized the weight vector as well as the biasterm, that is the empirical quality functional is setto Qemp(k,X ,Y ) = minfH

    1m

    mi=1

    2i +

    12(w

    2H

    +

    b2offset) subject to yif(xi) 1 i and i 0 for alli = 1, . . . , m. This is also known as the LagrangianSVM. The resulting dual SVM problem has fewer con-straints, as is evidenced by the smaller number of La-grange multipliers needed in the SDP below.

    minimize,

    12 t1 +

    Q2 t2

    subject to 0, 0K

    1

    2 t2H() ( + 1)

    ( + 1) t1

    0

    (15)

    where H() = Y(K + 1mm + mI)Y, and z = 1 + .

    The value of which optimizes the corresponding La-grange function is H()( + 1), and the classifica-tion function, f = sign(K( y) boffset), is given byf = sign(KH()(( + 1) y) + y(H()( + 1))).

    Example 6 (Single class SVM) For unsupervisedlearning, the single class SVM computes a func-tion which captures regions in input space where theprobability density is in some sense large (Scholkopfet al., 2001). The quality functionalQemp(k,X ,Y ) =minfH

    1m

    mi=1 i +

    12w

    2H

    subject to f(xi) i, and i 0 for all i = 1, . . . , m, and 0.The corresponding SDP for this problem, also knownas novelty detection, is shown below.

    minimize,,,

    12 t1 +

    1m

    +Q2

    t2

    subject to 0, 0, 0

    K1

    2 t2K zz t1

    0

    (16)

    where z = 1 + , and [0, 1] a user selectedparameter controlling the proportion of the data to beclassified as novel.

    The score to be used for novelty detection is given byf = K boffset, which reduces to f = , bysubstituting = K(1 + ), boffset = 1 andK = reshape(K).

    Example 7 (-Regression) We derive the SDP for regression (Scholkopf et al., 2000), which auto-matically selects the insensitive tube for regres-sion. As in the -SVM case in Example 4, the userdefined parameter controls the fraction of errorsand support vectors. Using the -insensitive loss,l(xi, yi, f(xi)) = max(0, |yi f(xi)| ), and the-parameterized quality functional, Qemp(k , X , Y ) =minfH C

    + 1

    m

    mi=1(i +

    i )

    subject to f(xi)

    yi i, yi f(xi) i , ()i 0 for all

  • 8/2/2019 ong03hyperkernels.pdf

    6/8

    Data C-SVM -SVM Lag-SVM Other CV Tuned SVMsyndata 2.82.2 1.21.3 2.52.4 NA 15.23.8

    pima 24.51.6 28.71.5 23.71.7 23.5 24.81.9ionosph 7.31.9 7.41.7 7.12.0 5.8 6.81.7

    wdbc 2.80.7 4.11.7 2.50.6 3.2 7.01.5heart 19.72.7 19.52.1 19.82.4 16.0 23.83.2

    thyroid 6.63.6 9.04.3 5.53.4 4.4 5.23.3

    sonar 15.23.2 15.73.9 14.93.4 15.4 15.83.6credit 14.81.7 13.81.1 15.31.8 22.8 24.31.9glass 5.22.3 7.73.3 5.21.5 NA 6.01.7

    Table 1. Hyperkernel classification: Test error and standard deviation in percent.The second, third and fourth columnsshow the results of the hyperkernel optimizations of C-SVM (Example 3), -SVM (Example 4) and Lagrangian SVM(Example 5) respectively. The results in the fifth column shows the best results from (Freund & Schapire, 1996; Ratschet al., 2001; Meyer et al., 2003). The rightmost column shows a C-SVM tuned in the traditional way. A Gaussian RBFkernel was tuned using 10-fold cross validation on the training data, with the b est value of C shown in brackets. A gridsearch was performed on (C, ). The values of C tested were {101, 100, . . . , 105}. The values of the kernel width, ,tested were between 10% and 90% quantile of the distance between a pair of sample of points in the data. These quantileswere estimated by a random 20% sample of the training data.

    i = 1, . . . , m and 0. The corresponding SDP is

    minimize,,,,

    12 t1 +

    + 1m

    +Q2

    t2

    subject to 0, 0, 0, 0

    K1

    2 t2F() z

    z t1

    0

    , (17)

    where z =

    y

    y

    1

    1

    +

    11

    and

    F() =

    K K

    K K

    .

    The Lagrange function is minimized for = F()z,and substituting into f = K boffset, we obtain theregression function f =

    K K

    F()z .

    Example 8 (Kernel Target Alignment) For theAlignment approach (Cristianini et al., 2002), Qemp =yKy, we directly minimize the regularized quality

    functional, obtaining the following optimization prob-lem,

    minimize

    12

    t1 +Q2

    t2

    subject to 0

    K

    1

    2

    t2K y

    y t1

    0

    (18)

    Note that for the case of Alignment, Qemp does notprovide a direct formulation for the hypothesis func-tion, but instead, it determines a kernel matrix K.This kernel matrix, K, can be utilized in a traditionalSVM to obtain a classification function.

    6. ExperimentsWe used data from the UCI repository for our experi-ments. Where the data was numerical, we did not per-

    form any preprocessing of the data. Boolean attributeswere converted to {-1,1}, and categorical attributeswere arbitrarily assigned an order, and numbered {1,2,.. . }. The hyperkernel used was as in Example 2.This scaling freedom means that we did not have tonormalize data to some arbitrary distribution. Similarto Ong et al. (2003), we used a low rank decomposi-tion (Fine & Scheinberg, 2000; Zhang, 2001) for thehyperkernel matrix.

    6.1. Classification Experiments

    A set of synthetic data sampled from two Gaussianswas created, a sample of which is illustrated in Fig-ure 1. The rest of the datasets were UCI datasets forbinary classification tasks. The datasets were split into10 random permutations of 60% training data and 40%test data. We deliberately did not attempt to tune pa-rameters and instead made the following choices uni-formly for all datasets:

    The kernel width was set to 50 times the 90%quantile of the value of |xixj | over all the training

    data, which ensures sufficient coverage.

    was adjusted so that 1m

    = 100 (that is C = 100 inthe Vapnik-style parameterization of SVMs). Thishas commonly been reported to yield good results.

    was set to 0.3. While this is clearly suboptimal formany datasets, we decided to choose it beforehandto avoid having to change any parameter. We could

  • 8/2/2019 ong03hyperkernels.pdf

    7/8

    use previous reports on generalization performanceto set to this value for better performance.

    h for the Gaussian Harmonic Hyperkernel was cho-sen to be 0.6 throughout, giving adequate coverageover various kernel widths in (4) (small h focus al-most exclusively on wide kernels, h close to 1 willtreat all widths equally).

    The hyperkernel regularization was set to Q = 1.

    We observe (Table 1) that our method achieves state ofthe art results for all the datasets, except the heartdataset. We also achieve results much better than pre-viously reported for the credit dataset. Comparingthe results for C-SVM and Tuned SVM, we observethat our method is always equally good, or better thana C-SVM tuned using 10-fold cross validation.

    6.2. Novelty Detection Experiments

    To demonstrate that we can solve problems other than

    binary classification using the same framework, weperformed novelty detection. We apply the singleclasssupport vector machine (Example 6) to detect outliersin the USPS data. A subset of 300 randomly selectedUSPS images for the digit 5 were used for the ex-periments. The parameter was set to 0.1 for theseexperiments, hence selecting up to 10% of the data asoutliers. The rest of the parameters were the same asin the previous section. Since there is no quantitativemethod for measuring the performance of novelty de-tection, we cannot directly compare our results withthe traditional single class SVM. We can only subjec-tively conclude, by visually inspecting a sample of thedigits, that our approach works for novelty detectionof USPS digits. Figure 2 shows a sample of the digits.We can see that the algorithm identifies novel digits,such as in the top row of Figure 2. The bottom rowshows a sample of digits which have been deemed tobe common.

    Figure 2. Top: Images of digit 5 considered novel by al-gorithm; Bottom: Common images of digit 5

    7. Discussion and Conclusion

    We have shown that it is possible to define a convex op-timization problem which learns the best kernel giventhe data. The resulting problem, which has a Bayesianinterpretation, is expressed as a SDP. Since we can op-timize over the whole class of kernel functions, we candefine more general kernels which may have many freeparameters, without overfitting. The experimental re-sults on classification and novelty detection demon-strate that it is possible to achieve the state of the art,and in certain cases (such as the credit data) improvethe accuracy significantly.

    This approach makes support vector based estimationapproaches more automated. Parameter adjustmentis less critical compared to the case when the kernelis fixed. Future work will focus on deriving improvedstatistical guarantees for estimates derived via hyper-kernels which match the good empirical performance.

    Acknowledgements This work was supported bya grant of the Australian Research Council. The au-thors would like to thank Laurent El Ghaoui, MichaelJordan, John Lloyd, Robert Williamson and DanielaPucci de Farias for their helpful comments and sugges-tions. The authors also thank Alexandros Karatzogloufor his help with SVLAB.

    References

    Albert, A. (1969). Conditions for positive and nonneg-ative definiteness in terms of pseudoinverses. SIAMJournal on Applied Mathematics, 17, 434440.

    Bennett, K. P., & Mangasarian, O. L. (1992). Robustlinear programming discrimination of two linearlyinseparable sets. Optimization Methods and Soft-ware, 1, 2334.

    Bousquet, O., & Herrmann, D. (2003). On the com-plexity of learning the kernel matrix. Advances inNeural Information Processing Systems 15.

    Cortes, C., & Vapnik, V. (1995). Support vector net-works. Machine Learning, 20, 273297.

    Crammer, K., Keshet, J., & Singer, Y. (2003). Kerneldesign using boosting. Advances in Neural Informa-tion Processing Systems 15.

    Cristianini, N., Shawe-Taylor, J., Elisseeff, A., & Kan-dola, J. (2002). On kernel-target alignment. Ad-vances in Neural Information Processing Systems 14(pp. 367373). Cambridge, MA: MIT Press.

  • 8/2/2019 ong03hyperkernels.pdf

    8/8

    Fine, S., & Scheinberg, K. (2000). Efficient SVM train-ing using low-rank kernel representation (TechnicalReport). IBM Watson Research Center, New York.

    Freund, Y., & Schapire, R. E. (1996). Experimentswith a new boosting algorithm. Proceedings of theInternational Conference on Machine Learing (pp.148146). Morgan Kaufmann Publishers.

    Haussler, D. (1999). Convolutional kernels on dis-crete structures (Technical Report UCSC-CRL-99-10). Computer Science Department, UC Santa Cruz.

    Lanckriet, G., Cristianini, N., Bartlett, P., Ghaoui,L. E., & Jordan, M. (2002). Learning the kernelmatrix with semidefinite programming. Proceedingsof the International Conference on Machine Learn-ing (pp. 323330). Morgan Kaufmann.

    MacKay, D. J. C. (1994). Bayesian non-linear mod-elling for the energy prediction competition. Amer-ican Society of Heating, Refrigerating and Air-Conditioning Engineers Transcations, 4, 448472.

    Mangasarian, O. L., & Musicant, D. R. (2001).Lagrangian support vector machines. Jour-nal of Machine Learning Research, 1, 161177.http://www.jmlr.org.

    Meyer, D., Leisch, F., & Hornik, K. (2003). The sup-port vector machine under test. Neurocomputing.Forthcoming.

    Momma, M., & Bennett, K. P. (2002). A patternsearch method for model selection of support vectorregression. Proceedings of the Second SIAM Inter-national Conference on Data Mining.

    Ong, C. S., Smola, A. J., & Williamson, R. C. (2003).Hyperkernels. Advances in Neural Information Pro-cessing Systems 15.

    Opper, M., & Winther, O. (2000). Gaussian processesand SVM: Mean field and leave-one-out. Advancesin Large Margin Classifiers (pp. 311326). Cam-bridge, MA: MIT Press.

    Ratsch, G., Onoda, T., & Muller, K. R. (2001). Softmargins for adaboost. Machine Learning, 42, 287

    320.Scholkopf, B., Platt, J., Shawe-Taylor, J., Smola, A. J.,

    & Williamson, R. C. (2001). Estimating the supportof a high-dimensional distribution. Neural Compu-tation, 13, 14431471.

    Scholkopf, B., Smola, A., Williamson, R. C., &Bartlett, P. L. (2000). New support vector algo-rithms. Neural Computation, 12, 12071245.

    Vandenberghe, L., & Boyd, S. (1996). Semidefiniteprogramming. SIAM Review., 38, 4995.

    Zhang, T. (2001). Some sparse approximation boundsfor regression problems. Proc. 18th InternationalConf. on Machine Learning (pp. 624631). MorganKaufmann, San Francisco, CA.

    A. Proof of Proposition 5

    We prove the proposition that the solution of thequadratic minimax problem (6) is obtained by mini-mizing the SDP (7).

    Proof Rewrite the terms of the objective functionin (6) dependent on x in terms of their Wolfe dual.The corresponding Lagrange function is

    L(x , , ) = 1

    2xH()xc()x+(Ax+a), (19)

    where RM is a vector of Lagrange multipliers with

    0. By differentiating L(x , , ) with respect to xand setting the result to zero, one obtains that (19)is maximized with respect to x for x = H()(A c()) and subsequently we obtain the dual

    D(, ) =1

    2(A c())H()(A c()) + a.

    (20)Note that H()H()H() = H(). For equalityconstraints in (6), such as Bx + b = 0, we get corre-spondingly free dual variables. The dual optimizationproblem is given by inserting (20) into (6)

    minimize,

    12(A

    c())H()(A c())

    +a + d()subject to H() 0, G() 0, 0.

    (21)Introducing an auxiliary variable, t, which serves as anupper bound on the quadratic objective term gives anobjective function linear in t and . Then (21) can bewritten as

    minimize,

    12 t +

    a + d()

    subject to t (A c())H()(A c()),H() 0, G() 0, 0.

    (22)From the properties of the Moore-Penrose inverse, we

    get H()H()(A c()) = (A c()). SinceH() 0, by the Schur complement lemma (Albert,1969), the quadratic constraint in (22) is equivalent to

    H() (A c())(A c()) t

    0 (23)