Recent advances in Kernel methods for Classification Problems · Experiment:Asymmetricsimilaritiesintextualdata 20Newsgroupsdataset:documentsofReligionandPolitics. Weclassifythepointsbyusing:

Recent advances in Kernel methods forClassification Problems

Javier GonzálezAlberto Muñoz

Universidad Carlos III de MadridDepartment of Statistics

December 16th, 2010

J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 1 / 53

1 Motivation

2 Statistical Learning Theory and Support Vector Machines

3 Analyzed Problems:

I Classification problems with asymmetry

I Partially labeled classification problems

I Classification problems where several sources of information areavailable

4 Summary and Main Contributions


The learning process


The learning process


Statistical Learning

The goal of Statistical LearningThe goal of learning theory is to approximate a function from datasamples, perhaps perturbed by noise.

Example of learning problemsOptical character recognition: categorize images of handwrittencharacters by the letters represented.Face detection: Find faces in images (or indicate if a face is present).Spam filtering: Identify email messages as spam or non-spam.

Classification Problems


Ill-posed problems

Well-posed problems (Hadamard)A solution exists.The solution is unique.The solution is stable.

Examples of ill-posed problemsDensity estimation.Classification problems.Regression problems.


Example of ill-posed problem

-1.0 -0.5 0.0 0.5 1.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

x

y

(a) Smoothest interpolating polynomi-al of degree 10 for the original data andtheir perturbations.

-1.0 -0.5 0.0 0.5 1.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

x

y(b) Interpolating polynomial of degree2 for the original data and their pertur-bations.

Problem originally ill-posed and its transformation to well posed.


1 Motivation








Elements of the problem and notation

Elements of the problem

X : compact set or manifold in a Euclidean Space (X ⊂ IRd).

Y = −1, 1 (binary classification).

ν: Borel probability measure defined on X × Y .

fν(x) =∫Y ydν(y |x) = E [y |x]: Learning function.

A generic loss function L(fν(x), y).

Traditional goal of the learning processFind the best approximation to fν : X → Y given a random samplesn = (xi , yi) ∈ X × Y ni=1 independently obtained from ν.


Generalization and Empirical Error

Given f : X → Y its Generalization Error (or mean square error) is:

Rν(f ) =

∫X×Y

L(f (x), y)dν(x, y).

fν(x) =∫Y ydν(y |x) is the function that minimizes Rν(·).

Rν(·) is not feasible since ν is unknown.

Given the sample data sn and f : X → Y the Empirical Error of f is:

Rsn(f ) =1n

n∑i=1

L(f (xi), yi).

Rsn(·) converges in probability to Rν(·). Approximate Rν(·) by Rsn(·).

Minimize Rsn(·):ill-posed problem.

Cucker, F. and Smale, S.On the Mathematical Foundations of Learning. Bulletin of theAmerican Mathematical Society, 39(1):1-49, 2002.


Imposing structure to the Hypothesis space

We choose some space of functions where the best approximation tofν(x) should be sought.

Let C(X ) be the Banach space of continuous functions in X with thenorm ‖f ‖∞ = sup

x∈X|f (x)|.

Where to seek for f ?In an compact set H of C(X ): Hypothesis space.

If H is compact ⇒ the minimization of the Empirical Error in H iswell posed.


How to impose compactness to the hypothesisspace?By minimizing the variational

F (f ) =1n

n∑i=1

L(f (xi)− yi) + γΩ(f ) ,

where γ > 0 and Ω(f ) is a convex positive functional defined on H.

Our choice following Wahba, Smale and others:Ω(f ) = ‖f ‖2K ,

where ‖f ‖K is the norm of f in a Reproducing Kernel Hilbert Space.

Wahba, G. Spline Models for Observational Data. Series in Applied Mathematics, vol. 59,SIAM. Philadelphia, 1990.

Cucker, F. and Smale, S.On the Mathematical Foundations of Learning. Bulletin of theAmerican Mathematical Society, 39(1):1-49, 2002.


Mercer kernels

DefinitionLet X be a metric space and K : X × X → IR a continuous and symmetricfunction. If we assume that K is positive definite, that is, for any setx = x1, · · · , xn ⊂ X the n × n matrix K

∣∣x with components

(K∣∣x)ij = K (xi , xj),

is positive semi-definite, then K is a Mercer kernel.

Examples of Mercer kernels defined on XLinear kernel K (x, y) = xTy.

Polynomial kernel K (x, y) = (a + xTy)b with a ≥ 0 and b ∈ IN.

Gaussian kernel K (x, y) = exp−ρ‖x− y‖2 with ρ ≥ 0.

Laplace kernel K (x, y) = exp−ρ‖x− y‖ with ρ ≥ 0.


Reproducing Kernel Hilbert Spaces (RKHS)DefinitionA RKHS HK of kernel K is the completed space of functions spanned by finitelinear combinations of the form

f (x) =n∑

i=1αiK (xi , x),

where n ∈ IN, xi ∈ X and αi ∈ IR and equipped with the inner product

〈f , g〉K =n∑

i=1

n∑j=1

αiβjK (xi , xj),

where f (x) =∑n

i=1 αiK (xi , x), g(x) =∑n

j=1 βjK (xj , x) for x ∈ X .

Notice that ‖f ‖K =√〈f , f 〉K .

Wahba, G. Spline Models for Observational Data. Series in Applied Mathematics, vol. 59,SIAM. Philadelphia, 1990.


Operators defined by a kernelOperator associated to a kernel KLet K : X × X → IR be a continuous function. Then the (linear) mapLK : L2

ν(X )→ C(X ) defined by the operator

(LK f )(x) =

∫K (x, t)f (t)dν(t),

is well defined and the function K is called the kernel of LK .

In particular:If K is a Mercer kernel the Spectral theorem applies: there exists an orthogonalbasis φj of HK consisting of the eigenfunctions of LK where each φj is givenby

φj(x) =1λj

∫X

K (x, t)φj(t) dν(t),

being λj > 0 the corresponding eigenvalue.


Mercer’s theorem

Mercer’s theoremLet K : X × X −→ IR a Mercer’s kernel. Let λj the j-th eigenvalue of LK andφjj≥1 the corresponding eigenvector. Then, for all x , y ∈ X

K (x, y) =∞∑j=1

λjφj(x)φj(y)

where the convergence is absolute (for each (x, y) ∈ X × X ) and uniform (on(x, y) ∈ X × X ).

Mercer, J. Functions of Positive and Negative Type and the Connection with the Theoryof Integral Equations. Philos. Trans. Roy. Soc. London A, 209:415-446, 1909.


Geometrical implications of Mercer’s theorem

RemarkLet Φ : X → l2 be the map given by x 7→

(√λjφj(x)

)j∈IN. Then

K (x, y) = 〈Φ(x),Φ(y)〉.

K (x, y) can be interpreted as a inner product in the transformed space.

Hyperplanes on the RKHSf (x) =

∑mi=1 αiK (x, xi) =

∑mi=1 αiΦ(x)Φ(xi) = wTΦ(x),

f (x) = 0 ⇒ Hyperplane on the transformed space.

AIZERMAN, M. A., BRAVERMAN, E. M. and ROZONOER, L. I. Theoreticalfoundations of the potential function method in pattern recognition learning. Automat.Remote Control 25 821–837, 1964.


RegularizationRegularization in RKHSs

minf∈HK

1n

n∑i=1

L(f (xi), yi) + γ‖f ‖2K ,

where γ > 0 and ‖f ‖2K is the norm of f in HK . To minimize this problem isequivalent to

minf∈HK

1n

n∑i=1

L(f (xi), yi) s.t ‖f ‖2K ≤ supy∈Y

y2/γ.

Therefore, the hypothesis space where the solution is searched takes the form

H = f ∈ HK : ‖f ‖2K ≤ supy∈Y y2/γ,

that is, a convex, compact subset of C(X ).

MUKHERJEE, S., RIFKIN, P. and POGGIO, T. Regression and classification withregularization. Nonlinear Estimation and Classification. Lecture Notes in Statististics,171:111-128. Springer, New York., 2003.


Solution to the learning problem

Representer TheoremThe solution to the regularization problem exists, is unique and admits arepresentation of the form:

f ∗(x) =n∑

i=1αiK (xi , x), ∀x ∈ X .

where x1, . . . , xn are now the sample data.

Kimeldorf, G. S. and Wahba, G. A correspondence between bayesian estimation onstochastic processes and smoothing by splines. The Annals of Mathematical Statistics,41(2):495-502, 1970.


Support Vector Machines (SVM)

Mercer theorem: Mercer, 1909.

Geometrical interpretation of kernels: Aizerman et al., 1964.

Hyperplane in an non parametric context: Vapnik andChervonenkis, 1964.

SVM origin: Boser, Guyon y Vapnik, 1992.

SVM as regularization problem: Wahba, 1999.

SVM review and open problems: Moguerza and Muñoz, 2006.


SVM as regularization method IThe SVM minimize the risk functional

F (f ) =1n

n∑i=1

L(yi , f (xi)) + µΩ(f ) ,

where:

Loss function: the so-called hinge loss: L(f (xi), yi) = (1− yi f (xi))+,with (x)+ = max(x, 0).Hypothesis space: a RKHS of reproducing kernel K .


SVM as regularization method IIProblem to solve

minf∈Hk

1n

n∑i=1

(1− yi f (xi))+ + µ‖f ‖2K .

Solution to the dual problem

f ∗(x) =n∑

i=1λ∗i yiK (x, xi) + b∗ = (w∗)TΦ(x) + b∗.

RemarksEstimating a nonlinear decision function in the input space ⇒Estimating the weights of a hyperplane in the feature space.

The empirical error for SVM converges to the expected error.

λ∗i and b∗ only depend on K (x, xi) and µ.

Many of the λ∗i are generally zero (support vectors).


Geometrical interpretation of SVMOrigin of the SVMs

Transform the data to a high dimensional space where the classesbecome separable using a kernel.

Maximize the sum of distances from the hyperplane to the closest point ofeach class (“margin").

BOSER, B. E., GUYON, I. and VAPNIK, V.A training algorithm for optimal marginclassifiers. In Proc. Fifth ACM Workshop on Computational Learning Theory (COLT)144–152. ACM Press, New York., 1992.

Both approaches are equivalent in terms of the final optimization problem.J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 23 / 53

Example: the solution depends on the chosen kernelThe decision function (w∗)TΦ(x) + b∗ = 0 is plotted in black. The code colorsrepresents the output of the function for each point in the space (red for positive,yellow for negative). The intensity of the color represents the value of the output.

(c) Linear. (d) Pol. g. 2. (e) Pol. g. 3. (f) RBFσ=0,1.

(g) RBFσ=0,5. (h) RBFσ=1. (i) RBFσ=2. (j) RBFσ=10.


Key issue in SVM and other kernel methods

Main problemTo choose an appropriate kernel: to define a positive definite kernel matrix Kuseful in classification.

Generalization of this approach to other classification proceduresGiven a kernel matrix K and using the relationship between inner products anddistances:

DKij =

√Kii + Kjj −Kij

Applying some Multidimensional Scaling Procedure we can therefore using thedistance matrix DK to obtain a Euclidean data representation that can beused as input in any classification procedure.


1 Motivation








Classification with asymmetric proximity matrices

Random Sample sn = (x1, y1), . . . , (xn, yn) where xi ∈ X (a compact setof IRp) and yi ∈ −1, 1.

Asymmetric similarity (normalized): function s : X × X −→ IR+ suchthat s(x, x) = 1, s(x, y) 6= s(y, x) and s(x, y) ≥ 0 for all x, y ∈ X .

Main GoalIn the classification and cluster context, to embed data of the problem into aEuclidean space using the asymmetric similarity matrix S (matrix S

∣∣sn).

Main Aplications:

Textual Data (*).

Genetic Data.

Distances for time series.


Asymmetric similarity measure for the terms of a textual data setX: terms-by-documents data matrix of dimensions n × p where xij = 1 ifthe ith term appears in the document jth and 0 otherwise.

|xi |: number of documents indexed by the ith term.

|xi ∧ xj |: number of documents indexed by both i and j terms.

sij =|xi ∧ xj ||xi |

=

∑k min(xik , xjk)∑

k xik.

Muñoz, A Compound Key Word Generation from Document Databases Using AHierarchical Clustering ART Model. Intell. Data Anal. 1(1-4): 25-48.1997.

Terms hierarchyJ. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 28 / 53

First step: Symmetrize STriangular Decomposition

S =12(S + ST )︸︷︷︸

Sym

+12(S− ST )︸︷︷︸

Skew

.

Then 12(S + ST ) = 1

2(S1 + S2) where S1 and S2 are two symmetricmatrices built from the upper and lower triangular part of S.

Polar DecompositionS = UΣVT the singular value decomposition of S.M1 = UΣUT and M2 = VΣVT .

IdeaTo combine S1 and S2 (or M1 and M2) and the labels to obtain asymmetric and positive definite kernel matrix K∗ useful forclassification.J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 29 / 53

Combining S1, S2 and the labelsIdeaTo find a matrix S∗ that maximizes the regularized between-groups separationcriterion given by

Gλ[S] =1n

n∑i,j=1

(S)ijyjyi − λn∑

i,j=1

((S)ij −

S1,ij + S2,ij2

)2, (1)

where λ > 0.

Solution to the previous problem:

S∗ =12 (S1 + S2) + τSy ,

where τ = 1/2λ and Sy = yyT where y = (y1, . . . , yn)T is the labels vector.

PROBLEM: S∗ is not positive semi-definite and its components are notavailable for test points.


Transforming S∗ to positive semi-definite.Orthogonal Projection (onto the cone of p.s.d matrices)

K∗ =n∑

j=1max(lj , 0)vTj vj

where (lj , vj) are the eigenpairs of S∗.

Alternating Projections (AP)

Find K∗ in the intersection of:

Kn+ = K = KT ∈ IRn×n : K ≥ 0.Qn = Q ∈ IRn×n : qii = 1.

PROBLEM: K∗ is a kernel matrix whose components are not available fortest points (the labels are unknown).

Deutsch, F. R. Best approximation in inner product spaces. Springer. CMS Books inMathematics, 2001.


Extending K∗ to test pointsNew frameworkUse a Functional Data Analysis approach to study the relationship betweenkernel matrices and Hilbert-Smith integral operators.

Approach in our caseTo built a kernel function K∗(x, y) =

∑j λjϕ(x)ϕ(y) such that

K∗(xi , xj) = (K∗)ij .

General idea to build the function K ∗

Replace in K∗ each λj and ϕ by some estimators λj and ϕ

λj : eigenvalues of K∗.

ϕ =∑

cj φj : linear combination of the eigenfunctions of the kernelsassociated to the sources of asymmetry.

φj(x) = 1lj√n∑n

j=1 K (x, xi)vij (Nyström formula)


Experiment: Asymmetric similarities in textual data20 Newsgroups data set: documents of Religion and Politics.We classify the points by using:

I S1 and S2 (or M1 and M2 in the polar case).

I 12 (S1 + S2) and 1

2 (M1 + M2).

I 12 (S1 + S2) + τSy and 1

2 (M1 + M2) + τSy .

I Combination of S1 and S2 (and M1,M2) using the Semi-definiteprogramming (S.D.P.).

Four classification algorithms: SVM, FDA(mars), FDA(bruto), LDA/PLS.Document-frequency representation of the terms: each term i isrepresented by a vector whose component j is estimated by

dfij = #times the term i appears in the document j#times the term i appears in the data base .

Lanckriet, G. et al. Learning the kernel matrix with semidefinite programming. Journalof Machine Learning Research, 5:27-72, 2004.


Classification results using asymmetry

Triang. Dec. + S1 S2 12 (S1 + S2) 1

2 (S1 + S2) + τSy S.D.POthogonal Proy.

SVM 0.2424 0.2278 0.2357 0.1506 0.1819FDAbruto 0.2345 0.2333 0.1792 0.1541 0.2580FDAmars 0.2756 0.2811 0.2027 0.1722 0.2580LDA/PLS 0.2035 0.1945 0.1929 0.1549 0.1486

Triang. Dec. + S1 S2 12 (S1 + S2) 1

2 (S1 + S2) + τSy S.D.PAlternating Proy.


Polar Desc. M1 M212 (M1 + M2)

12 (M1 + M2) + τSy S.D.P



Comparative without using asymmetryMethod SVM FDAbruto FDAmars LDA/PLS

Original data, df. 0.1957 0.4520 0.3416 0.1522Best assym. result. 0.1349 0.1424 0.1623 0.1404Relative improv. 31.06% 68.49% 52.48% 7.75%

−0.05 0.00 0.05 0.10 0.15 0.20 0.25

−0.3

−0.2

−0.1

0.0 deuteriumoutcome big

firing

issue regulations

feustelminute

minority

would

prohibition

straightpolice

organized

greatly

culthaving

assault

dcbetz

firearm

triumphtombullet

argued

fema citizen

finding

alleged illinoisusespound

stove

commonly

blastsemtexequipmentuicvmpacket

stratussupposeddiscoverwin

homicides

solely

affairown

organization

damn federalten

forced

regulated

game

vikingfew

compare

chose

top nation

mathrenosecretaryoutlawknockingcomplexlegitimate

jrutledg

pistolhmmirrationalguessprivate

cite

england

vealblockwilling

loss

fails

twice

core

halfpoorhoursccreferreddecidedduring

politicalgas

remove

advantage

possiblebreak

below

stats

risk

letterban

irvine

ruleexciting

founding

cost

sentence

complicatedsettingnearwacohighly

paper

short

larrymorning

murdersupport

deaths

released

arguing

taxesgovjoke

purposes

arms

shootlet

keep

supportsandersonlovescertainlythroughoutpossibly

meetingbutgracemuch

refer

influencemotherknowsfollowed

holdjudgedprecisely puterrortempleit

jerusalemvery

ll

wholecarpenterrequirements

hole

notes

implications

obedience

scientificassociation

relationsdoes

revealedprophecyworthytermscoupleknowrejected

same

alanus

extremeexistbasically michaelfriendstreatagree

distinguishtired petergood

apologize

appliedancient

refused

wronggeorgiaworldisaiah has

correct

foundationaccounts

kindthesefelt

seven

traditioncatholicdyingreferringchurchexpectingsounds

simple

opinionsortearthpraiseraisedturning

measure

patricktesttalked

leavingannounced

ministrydefinitelynt

intent

differ

sleephellocertain

disagree

portionpassage

powerfulonlywriting

bobherselfaccount

languagesviewcounterthoughtsheavenholdsspecifically

happiness

therefore

usuallyoffer

etcbeliefarensimilarly

(l) MDS of the terms using the Eu-clidean distance.

-0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1

-0.2

-0.1

0.0

0.1

0.2

deuterium

outcome

bigfiring

issue

regulations

feustel

minute

minority

would

prohibition

straight

police

organized

greatly

culthaving

assault

dcbetz

firearm

triumphtom

bulletargued

femacitizen

finding

alleged

illinois

uses

pound

stove

commonly

blast

semtexequipment

uicvm

packet

stratus

supposed

discoverwin

homicides

solely

affair

own

organization

damn

federal

ten

forced

regulated

gamevikingfew

comparechose

topnation

math

renosecretary

outlaw

knockingcomplex

legitimate

jrutledgpistol

hmm

irrational

guess

private

cite

englandveal

blockwilling

loss

fails

twice

core

half

poorhourscc

referred

decidedduringpoliticalgas

remove

advantage

possible

break

belowstatsrisk

letter

ban

irvineruleexciting

founding

costsentence

complicated

setting

near

waco

highly

papershort larry

morning

murdersupport deathsreleased

arguingtaxes

gov

jokepurposes

arms

shootlet

keep

supportsanderson

lovescertainlythroughout

possiblymeeting

but

grace

much

referinfluencemother

knowsfollowed

hold

judged

precisely

put

error

temple

it

jerusalem

very ll

whole

carpenter

requirements

hole

notesimplications

obedience

scientificassociation

relations

does

revealed

prophecy

worthy

termscouple

know

rejected

same

alanus

extreme

exist

basicallymichaelfriends

treatagree

distinguish

tired

peter

good

apologize

applied

ancient

refused

wrong

georgiaworld

isaiah

has

correct

foundation

accounts

kind

these

feltseven

traditioncatholic

dyingreferring

church

expecting

sounds

simpleopinionsort

earth

praiseraised

turning

measurepatricktest

talked

leaving

announced

ministry

definitely

nt

intent

differsleephello

certain

disagree

portion

passage

powerful

only

writing

bobherself

account

languagesview counterthoughts

heaven

holdsspecificallyhappiness

therefore

usually

offeretc

belief

aren

similarly

(m) MDS of the terms using the dis-tance induced by combination of S1,S2 and Sy .


1 Motivation








Application: Partially labeled classification problems

ExampleIn speech recognition, it cost almost nothing to record huge amounts ofspeech but labeling it requires some human to listen and transcript.

Elements of the problemTraining sample: sn = (x1, y1), . . . , (xt , yt), xt+1, . . . , xn wherexi ∈ X (a compact set of IRp) and yi ∈ −1, 1.

I Subset st = (x1, y1), . . . , (xt , yt) of t labeled points.I Subset sun−t = xt+1, . . . , xn of n − t unlabeled points.

S : n × n similarity matrix (Sii = 1).

Chapelle, O., Scholköpf, B., and Zien, A. Semi-supervised learning. MIT Press,Cambridge, MA. ,2006.


Idea to solve the partially labeled classificationproblem

Main GoalTransform the similarity S using the information of unlabeled pointsmaking it more useful for classification.


Transform the similarity matrix S

STEP 1: Define S∗

(S)∗ij =12((S)ij + (S)zi ,zj

)+ σyzi yzj

∣∣(S)ij − (S)zi ,zj∣∣

σ ≥ 0.

zi and zj are indexes of the closest (dij = 1− sij) labeled data points to xiand xj and whose labels are yzi , yzj .

(S)∗ij = (S)ij if the points points i-th and j-th are labeled.

STEP 2: Obtaining K∗ by transforming S∗ to positive definite.We can use some of the previously defined projections onto the cone ofpositive semi-definite matrices: Orthogonal projection or alternatingprojections.


Illustrative example: description

Sample sn of n = 1000 data made up from bivariate normal distributionsN(µi , I), with µ1 = (0, 0) and µ2 = (0, 0). 20000 additional test points.

Theoretical Bayes Error: 0,027. Optimal discriminant function: x = 2.

Several scenarios selecting (randomly) from sn and increasing number t ofdata (form 10 to 250).

For each scenario, we compare the averaged errors using SVM with the tlabeled data using a linear kernel and with the matrix K∗.

Test errors estimated over the remaining 20000 (30 runs).


Data points and convergence of the SVM

−2 0 2 4 6

−3

−2

−1

01

23

(n) Simulated data.

20 40 60 80 100 120

0.02

0.03

0.04

0.05

0.06

Sample size of labeled dataError

(ñ) Convergence of the SVM error for alinear kernel (black) and for K∗ (grey).


Results

-2 0 2 4 6

-3-2

-10

12

3

(o) SVM + linear and SVM + K∗ whent = 10.

-2 0 2 4 6

-3-2

-10

12

3

(p) SVM + linear and SVM + K∗ whent = 50.


Real data examplesFour data sets (UCI Repository) named as:

I Iris (we only considered the classes versicolor and virginica).

I Connectionist Bench (Sonar, Mines vs. Rocks).

I Breast Cancer Wisconsin.

I Blood Transfusion.

We divide each data set in two subsets, one of size n (training sample sn)and the rest for testing.

We compare K − nn for K = 1 and a SVM using a linear kernel and usingK∗.

We fix σ and we select these parameters by cross validation.


Results

Data Param. Partition Partition Test Error Test Error Test ErrorSet σ Tr./Test (t, n-t) SVM + linear SVM + K∗ 1− nn

Iris [3,5, 6,5] (60,40) (10,50) 0.0958 0.0833 0.1008(50,10) 0.0500 0.0250 0.0766

Sonar [0,05, 0,1] (170,38) (75,95) 0.3491 0.2964 0.2596(100,70) 0.3043 0.1605 0.2368(150,20) 0.2429 0.2438 0.2228

Cancer [1,5, 2,1] (400,283) (10,390) 0.1253 0.0628 0.0667(50,350) 0.0502 0.0398 0.0591(100,300) 0.0369 0.0380 0.0501

Transf. [0,1, 1] (400,384) (10,390) 0.2529 0.2274 0.2917(50,350) 0.1845 0.1798 0.2717(100,300) 0.1800 0.1788 0.2720


1 Motivation








Proximity matrices combinations

ExampleIn web pages classification problems we have available two sources ofinformation:

The co-citation matrix.

The term by document matrix.

GoalTo combine (2 or more) proximity matrices to obtain a single representation ofthe data (web pages) useful in classification problems.

Martín de Diego, I., Muñoz, A., and Moguerza, J. M. Methods for the combinationof kernel matrices within a support vector framework. Machine Learning, 78:137-174, 2009

Moguerza, J. M. and Muñoz, A., Rejoinder Support Vector Machines withApplicaitons. Statistical Science, 21(3):358-362, 2006.


Combining kernels matrices

Kernel combinations schemesn = (xi , yi)ni=1 is a random sample where xi ∈ X (some subset of IRp) andyi ∈ −1, 1 are the labels of the data.K1,...,Km is a set of m kernel matrices and Y diagonal matrix of the labels of theproblem.

K∗ =1m

m∑t=1

Kt + τY∑t<l

g(Ki ,Kj)Y ,

ExampleMethod τ g(x) # Kernels

AKM Average 0 - mMAKM Modified Average > 0 g(x) = 1 mAV Absolute Value > 0 g(x) = |x | mPO Pick Out 1/2 g(x) = |x | 2


Open questions in kernel combinations

Problems1 How to define a measure g that capture the differences betweeneach pair of kernels.

2 How to define a sum of kernels that takes into account theredundant information between each pair of matrices.


General idea to solve both problems

Given K1,K2, ...,Km ∈ IRn×n to find an orthonormal matrix V ∈ IRn×n,such that:

VTK1V = D1...

VTKmV = Dm

where D1, ...,Dm are diagonal (or quasi-diagonal).

Main ideaUse the matrices D1, ...,Dm and V to manage the redundantinformation between the matrices.

Alberto Muñoz and Javier González. Joint Diagonalization of Kernels forInformation Fusion. Proceedings of the Iberoamerican Congress on Pattern Recognition,556-563, Springer.

Javier González and Alberto Muñoz. Spectral Measures for Kernel MatricesComparison. Proceedings of the International Conference on Neural Networks, 727-736,Springer.


1 Motivation








Main Contributions

To modify the original data similarity using the labels is an efficientstrategy to define appropriate kernel matrices for classification.

We have proposed a methodology to solve partially labeled classificationproblems.

New method to deal with asymmetric proximity matrices in classificationproblems. Application to Textual data analysis.

Extension of the previous methodology to information fusion problemswhen the available sources of information are a set or proximity matrices.

Study of the problems derived form the from the bad use of the commoninformation of the matrices.


Other workRepresentation of functional data onto Reproducing Kernel HilbertSpaces.

Alberto Muñoz and Javier González. Representing Functional data with SupportVector Machines. Pattern Recognition Letters, 31(6):511-516.

Kernel for Latent Semantic Extraction in Text Mining Problems.

Alberto Muñoz, Javier González and Javier Arriero.. Kernel Latent SemanticAnalysis using Term Fusion Kernels.Support Vector Machines: Data Analysis, MachineLearning and Applications. NOVA Science. To appear.

Kernels (generalized covariances) for Spatio-Temporal data Analysis.Javier González, Stephan R. Sain and Alberto Muñoz. Spatial Temporal DataAnalysis via Reproducing Kernel Regularization. Joint Statistical Meeting (JSM),Washington, USA, June 2009.

Robust Methods in Statistics.Javier González, Daniel Peña and Rosario Romera.. A robust partial LeastSquares Regression Method with applications Journal of Chemometrics, 23:78-90, 2009.


Thanks!


Documents

Recent advances in Kernel methods for Classification Problems · Experiment:Asymmetricsimilaritiesintextualdata 20Newsgroupsdataset:documentsofReligionandPolitics. Weclassifythepointsbyusing: