Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Recent advances in Kernel methods forClassification Problems
Javier GonzálezAlberto Muñoz
Universidad Carlos III de MadridDepartment of Statistics
December 16th, 2010
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 1 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 2 / 53
The learning process
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 3 / 53
The learning process
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 4 / 53
Statistical Learning
The goal of Statistical LearningThe goal of learning theory is to approximate a function from datasamples, perhaps perturbed by noise.
Example of learning problemsOptical character recognition: categorize images of handwrittencharacters by the letters represented.Face detection: Find faces in images (or indicate if a face is present).Spam filtering: Identify email messages as spam or non-spam.
Classification Problems
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 5 / 53
Ill-posed problems
Well-posed problems (Hadamard)A solution exists.The solution is unique.The solution is stable.
Examples of ill-posed problemsDensity estimation.Classification problems.Regression problems.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 6 / 53
Example of ill-posed problem
-1.0 -0.5 0.0 0.5 1.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
x
y
(a) Smoothest interpolating polynomi-al of degree 10 for the original data andtheir perturbations.
-1.0 -0.5 0.0 0.5 1.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
x
y(b) Interpolating polynomial of degree2 for the original data and their pertur-bations.
Problem originally ill-posed and its transformation to well posed.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 7 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 8 / 53
Elements of the problem and notation
Elements of the problem
X : compact set or manifold in a Euclidean Space (X ⊂ IRd).
Y = −1, 1 (binary classification).
ν: Borel probability measure defined on X × Y .
fν(x) =∫Y ydν(y |x) = E [y |x]: Learning function.
A generic loss function L(fν(x), y).
Traditional goal of the learning processFind the best approximation to fν : X → Y given a random samplesn = (xi , yi) ∈ X × Y ni=1 independently obtained from ν.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 9 / 53
Generalization and Empirical Error
Given f : X → Y its Generalization Error (or mean square error) is:
Rν(f ) =
∫X×Y
L(f (x), y)dν(x, y).
fν(x) =∫Y ydν(y |x) is the function that minimizes Rν(·).
Rν(·) is not feasible since ν is unknown.
Given the sample data sn and f : X → Y the Empirical Error of f is:
Rsn(f ) =1n
n∑i=1
L(f (xi), yi).
Rsn(·) converges in probability to Rν(·). Approximate Rν(·) by Rsn(·).
Minimize Rsn(·):ill-posed problem.
Cucker, F. and Smale, S.On the Mathematical Foundations of Learning. Bulletin of theAmerican Mathematical Society, 39(1):1-49, 2002.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 10 / 53
Imposing structure to the Hypothesis space
We choose some space of functions where the best approximation tofν(x) should be sought.
Let C(X ) be the Banach space of continuous functions in X with thenorm ‖f ‖∞ = sup
x∈X|f (x)|.
Where to seek for f ?In an compact set H of C(X ): Hypothesis space.
If H is compact ⇒ the minimization of the Empirical Error in H iswell posed.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 11 / 53
How to impose compactness to the hypothesisspace?By minimizing the variational
F (f ) =1n
n∑i=1
L(f (xi)− yi) + γΩ(f ) ,
where γ > 0 and Ω(f ) is a convex positive functional defined on H.
Our choice following Wahba, Smale and others:Ω(f ) = ‖f ‖2K ,
where ‖f ‖K is the norm of f in a Reproducing Kernel Hilbert Space.
Wahba, G. Spline Models for Observational Data. Series in Applied Mathematics, vol. 59,SIAM. Philadelphia, 1990.
Cucker, F. and Smale, S.On the Mathematical Foundations of Learning. Bulletin of theAmerican Mathematical Society, 39(1):1-49, 2002.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 12 / 53
Mercer kernels
DefinitionLet X be a metric space and K : X × X → IR a continuous and symmetricfunction. If we assume that K is positive definite, that is, for any setx = x1, · · · , xn ⊂ X the n × n matrix K
∣∣x with components
(K∣∣x)ij = K (xi , xj),
is positive semi-definite, then K is a Mercer kernel.
Examples of Mercer kernels defined on XLinear kernel K (x, y) = xTy.
Polynomial kernel K (x, y) = (a + xTy)b with a ≥ 0 and b ∈ IN.
Gaussian kernel K (x, y) = exp−ρ‖x− y‖2 with ρ ≥ 0.
Laplace kernel K (x, y) = exp−ρ‖x− y‖ with ρ ≥ 0.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 13 / 53
Reproducing Kernel Hilbert Spaces (RKHS)DefinitionA RKHS HK of kernel K is the completed space of functions spanned by finitelinear combinations of the form
f (x) =n∑
i=1αiK (xi , x),
where n ∈ IN, xi ∈ X and αi ∈ IR and equipped with the inner product
〈f , g〉K =n∑
i=1
n∑j=1
αiβjK (xi , xj),
where f (x) =∑n
i=1 αiK (xi , x), g(x) =∑n
j=1 βjK (xj , x) for x ∈ X .
Notice that ‖f ‖K =√〈f , f 〉K .
Wahba, G. Spline Models for Observational Data. Series in Applied Mathematics, vol. 59,SIAM. Philadelphia, 1990.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 14 / 53
Operators defined by a kernelOperator associated to a kernel KLet K : X × X → IR be a continuous function. Then the (linear) mapLK : L2
ν(X )→ C(X ) defined by the operator
(LK f )(x) =
∫K (x, t)f (t)dν(t),
is well defined and the function K is called the kernel of LK .
In particular:If K is a Mercer kernel the Spectral theorem applies: there exists an orthogonalbasis φj of HK consisting of the eigenfunctions of LK where each φj is givenby
φj(x) =1λj
∫X
K (x, t)φj(t) dν(t),
being λj > 0 the corresponding eigenvalue.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 15 / 53
Mercer’s theorem
Mercer’s theoremLet K : X × X −→ IR a Mercer’s kernel. Let λj the j-th eigenvalue of LK andφjj≥1 the corresponding eigenvector. Then, for all x , y ∈ X
K (x, y) =∞∑j=1
λjφj(x)φj(y)
where the convergence is absolute (for each (x, y) ∈ X × X ) and uniform (on(x, y) ∈ X × X ).
Mercer, J. Functions of Positive and Negative Type and the Connection with the Theoryof Integral Equations. Philos. Trans. Roy. Soc. London A, 209:415-446, 1909.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 16 / 53
Geometrical implications of Mercer’s theorem
RemarkLet Φ : X → l2 be the map given by x 7→
(√λjφj(x)
)j∈IN. Then
K (x, y) = 〈Φ(x),Φ(y)〉.
K (x, y) can be interpreted as a inner product in the transformed space.
Hyperplanes on the RKHSf (x) =
∑mi=1 αiK (x, xi) =
∑mi=1 αiΦ(x)Φ(xi) = wTΦ(x),
f (x) = 0 ⇒ Hyperplane on the transformed space.
AIZERMAN, M. A., BRAVERMAN, E. M. and ROZONOER, L. I. Theoreticalfoundations of the potential function method in pattern recognition learning. Automat.Remote Control 25 821–837, 1964.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 17 / 53
RegularizationRegularization in RKHSs
minf∈HK
1n
n∑i=1
L(f (xi), yi) + γ‖f ‖2K ,
where γ > 0 and ‖f ‖2K is the norm of f in HK . To minimize this problem isequivalent to
minf∈HK
1n
n∑i=1
L(f (xi), yi) s.t ‖f ‖2K ≤ supy∈Y
y2/γ.
Therefore, the hypothesis space where the solution is searched takes the form
H = f ∈ HK : ‖f ‖2K ≤ supy∈Y y2/γ,
that is, a convex, compact subset of C(X ).
MUKHERJEE, S., RIFKIN, P. and POGGIO, T. Regression and classification withregularization. Nonlinear Estimation and Classification. Lecture Notes in Statististics,171:111-128. Springer, New York., 2003.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 18 / 53
Solution to the learning problem
Representer TheoremThe solution to the regularization problem exists, is unique and admits arepresentation of the form:
f ∗(x) =n∑
i=1αiK (xi , x), ∀x ∈ X .
where x1, . . . , xn are now the sample data.
Kimeldorf, G. S. and Wahba, G. A correspondence between bayesian estimation onstochastic processes and smoothing by splines. The Annals of Mathematical Statistics,41(2):495-502, 1970.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 19 / 53
Support Vector Machines (SVM)
Mercer theorem: Mercer, 1909.
Geometrical interpretation of kernels: Aizerman et al., 1964.
Hyperplane in an non parametric context: Vapnik andChervonenkis, 1964.
SVM origin: Boser, Guyon y Vapnik, 1992.
SVM as regularization problem: Wahba, 1999.
SVM review and open problems: Moguerza and Muñoz, 2006.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 20 / 53
SVM as regularization method IThe SVM minimize the risk functional
F (f ) =1n
n∑i=1
L(yi , f (xi)) + µΩ(f ) ,
where:
Loss function: the so-called hinge loss: L(f (xi), yi) = (1− yi f (xi))+,with (x)+ = max(x, 0).Hypothesis space: a RKHS of reproducing kernel K .
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 21 / 53
SVM as regularization method IIProblem to solve
minf∈Hk
1n
n∑i=1
(1− yi f (xi))+ + µ‖f ‖2K .
Solution to the dual problem
f ∗(x) =n∑
i=1λ∗i yiK (x, xi) + b∗ = (w∗)TΦ(x) + b∗.
RemarksEstimating a nonlinear decision function in the input space ⇒Estimating the weights of a hyperplane in the feature space.
The empirical error for SVM converges to the expected error.
λ∗i and b∗ only depend on K (x, xi) and µ.
Many of the λ∗i are generally zero (support vectors).
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 22 / 53
Geometrical interpretation of SVMOrigin of the SVMs
Transform the data to a high dimensional space where the classesbecome separable using a kernel.
Maximize the sum of distances from the hyperplane to the closest point ofeach class (“margin").
BOSER, B. E., GUYON, I. and VAPNIK, V.A training algorithm for optimal marginclassifiers. In Proc. Fifth ACM Workshop on Computational Learning Theory (COLT)144–152. ACM Press, New York., 1992.
Both approaches are equivalent in terms of the final optimization problem.J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 23 / 53
Example: the solution depends on the chosen kernelThe decision function (w∗)TΦ(x) + b∗ = 0 is plotted in black. The code colorsrepresents the output of the function for each point in the space (red for positive,yellow for negative). The intensity of the color represents the value of the output.
(c) Linear. (d) Pol. g. 2. (e) Pol. g. 3. (f) RBFσ=0,1.
(g) RBFσ=0,5. (h) RBFσ=1. (i) RBFσ=2. (j) RBFσ=10.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 24 / 53
Key issue in SVM and other kernel methods
Main problemTo choose an appropriate kernel: to define a positive definite kernel matrix Kuseful in classification.
Generalization of this approach to other classification proceduresGiven a kernel matrix K and using the relationship between inner products anddistances:
DKij =
√Kii + Kjj −Kij
Applying some Multidimensional Scaling Procedure we can therefore using thedistance matrix DK to obtain a Euclidean data representation that can beused as input in any classification procedure.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 25 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 26 / 53
Classification with asymmetric proximity matrices
Random Sample sn = (x1, y1), . . . , (xn, yn) where xi ∈ X (a compact setof IRp) and yi ∈ −1, 1.
Asymmetric similarity (normalized): function s : X × X −→ IR+ suchthat s(x, x) = 1, s(x, y) 6= s(y, x) and s(x, y) ≥ 0 for all x, y ∈ X .
Main GoalIn the classification and cluster context, to embed data of the problem into aEuclidean space using the asymmetric similarity matrix S (matrix S
∣∣sn).
Main Aplications:
Textual Data (*).
Genetic Data.
Distances for time series.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 27 / 53
Asymmetric similarity measure for the terms of a textual data setX: terms-by-documents data matrix of dimensions n × p where xij = 1 ifthe ith term appears in the document jth and 0 otherwise.
|xi |: number of documents indexed by the ith term.
|xi ∧ xj |: number of documents indexed by both i and j terms.
sij =|xi ∧ xj ||xi |
=
∑k min(xik , xjk)∑
k xik.
Muñoz, A Compound Key Word Generation from Document Databases Using AHierarchical Clustering ART Model. Intell. Data Anal. 1(1-4): 25-48.1997.
Terms hierarchyJ. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 28 / 53
First step: Symmetrize STriangular Decomposition
S =12(S + ST )︸ ︷︷ ︸
Sym
+12(S− ST )︸ ︷︷ ︸
Skew
.
Then 12(S + ST ) = 1
2(S1 + S2) where S1 and S2 are two symmetricmatrices built from the upper and lower triangular part of S.
Polar DecompositionS = UΣVT the singular value decomposition of S.M1 = UΣUT and M2 = VΣVT .
IdeaTo combine S1 and S2 (or M1 and M2) and the labels to obtain asymmetric and positive definite kernel matrix K∗ useful forclassification.J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 29 / 53
Combining S1, S2 and the labelsIdeaTo find a matrix S∗ that maximizes the regularized between-groups separationcriterion given by
Gλ[S] =1n
n∑i,j=1
(S)ijyjyi − λn∑
i,j=1
((S)ij −
S1,ij + S2,ij2
)2, (1)
where λ > 0.
Solution to the previous problem:
S∗ =12 (S1 + S2) + τSy ,
where τ = 1/2λ and Sy = yyT where y = (y1, . . . , yn)T is the labels vector.
PROBLEM: S∗ is not positive semi-definite and its components are notavailable for test points.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 30 / 53
Transforming S∗ to positive semi-definite.Orthogonal Projection (onto the cone of p.s.d matrices)
K∗ =n∑
j=1max(lj , 0)vTj vj
where (lj , vj) are the eigenpairs of S∗.
Alternating Projections (AP)
Find K∗ in the intersection of:
Kn+ = K = KT ∈ IRn×n : K ≥ 0.Qn = Q ∈ IRn×n : qii = 1.
PROBLEM: K∗ is a kernel matrix whose components are not available fortest points (the labels are unknown).
Deutsch, F. R. Best approximation in inner product spaces. Springer. CMS Books inMathematics, 2001.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 31 / 53
Extending K∗ to test pointsNew frameworkUse a Functional Data Analysis approach to study the relationship betweenkernel matrices and Hilbert-Smith integral operators.
Approach in our caseTo built a kernel function K∗(x, y) =
∑j λjϕ(x)ϕ(y) such that
K∗(xi , xj) = (K∗)ij .
General idea to build the function K ∗
Replace in K∗ each λj and ϕ by some estimators λj and ϕ
λj : eigenvalues of K∗.
ϕ =∑
cj φj : linear combination of the eigenfunctions of the kernelsassociated to the sources of asymmetry.
φj(x) = 1lj√n∑n
j=1 K (x, xi)vij (Nyström formula)
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 32 / 53
Experiment: Asymmetric similarities in textual data20 Newsgroups data set: documents of Religion and Politics.We classify the points by using:
I S1 and S2 (or M1 and M2 in the polar case).
I 12 (S1 + S2) and 1
2 (M1 + M2).
I 12 (S1 + S2) + τSy and 1
2 (M1 + M2) + τSy .
I Combination of S1 and S2 (and M1,M2) using the Semi-definiteprogramming (S.D.P.).
Four classification algorithms: SVM, FDA(mars), FDA(bruto), LDA/PLS.Document-frequency representation of the terms: each term i isrepresented by a vector whose component j is estimated by
dfij = #times the term i appears in the document j#times the term i appears in the data base .
Lanckriet, G. et al. Learning the kernel matrix with semidefinite programming. Journalof Machine Learning Research, 5:27-72, 2004.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 33 / 53
Classification results using asymmetry
Triang. Dec. + S1 S2 12 (S1 + S2) 1
2 (S1 + S2) + τSy S.D.POthogonal Proy.
SVM 0.2424 0.2278 0.2357 0.1506 0.1819FDAbruto 0.2345 0.2333 0.1792 0.1541 0.2580FDAmars 0.2756 0.2811 0.2027 0.1722 0.2580LDA/PLS 0.2035 0.1945 0.1929 0.1549 0.1486
Triang. Dec. + S1 S2 12 (S1 + S2) 1
2 (S1 + S2) + τSy S.D.PAlternating Proy.
SVM 0.2471 0.2373 0.2333 0.1518 0.1819FDAbruto 0.2192 0.2224 0.1608 0.1498 0.2580FDAmars 0.2796 0.2576 0.2360 0.1746 0.2580LDA/PLS 0.2086 0.1878 0.1780 0.1525 0.1486
Polar Desc. M1 M212 (M1 + M2)
12 (M1 + M2) + τSy S.D.P
SVM 0.1851 0.1965 0.1678 0.1349 0.2521FDAbruto 0.1690 0.1784 0.1494 0.1424 0.1898FDAmars 0.2175 0.2415 0.2462 0.1623 0.1898LDA/PLS 0.1655 0.2137 0.1655 0.1404 0.1432
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 34 / 53
Comparative without using asymmetryMethod SVM FDAbruto FDAmars LDA/PLS
Original data, df. 0.1957 0.4520 0.3416 0.1522Best assym. result. 0.1349 0.1424 0.1623 0.1404Relative improv. 31.06% 68.49% 52.48% 7.75%
−0.05 0.00 0.05 0.10 0.15 0.20 0.25
−0.3
−0.2
−0.1
0.0 deuteriumoutcome big
firing
issue regulations
feustelminute
minority
would
prohibition
straightpolice
organized
greatly
culthaving
assault
dcbetz
firearm
triumphtombullet
argued
fema citizen
finding
alleged illinoisusespound
stove
commonly
blastsemtexequipmentuicvmpacket
stratussupposeddiscoverwin
homicides
solely
affairown
organization
damn federalten
forced
regulated
game
vikingfew
compare
chose
top nation
mathrenosecretaryoutlawknockingcomplexlegitimate
jrutledg
pistolhmmirrationalguessprivate
cite
england
vealblockwilling
loss
fails
twice
core
halfpoorhoursccreferreddecidedduring
politicalgas
remove
advantage
possiblebreak
below
stats
risk
letterban
irvine
ruleexciting
founding
cost
sentence
complicatedsettingnearwacohighly
paper
short
larrymorning
murdersupport
deaths
released
arguing
taxesgovjoke
purposes
arms
shootlet
keep
supportsandersonlovescertainlythroughoutpossibly
meetingbutgracemuch
refer
influencemotherknowsfollowed
holdjudgedprecisely puterrortempleit
jerusalemvery
ll
wholecarpenterrequirements
hole
notes
implications
obedience
scientificassociation
relationsdoes
revealedprophecyworthytermscoupleknowrejected
same
alanus
extremeexistbasically michaelfriendstreatagree
distinguishtired petergood
apologize
appliedancient
refused
wronggeorgiaworldisaiah has
correct
foundationaccounts
kindthesefelt
seven
traditioncatholicdyingreferringchurchexpectingsounds
simple
opinionsortearthpraiseraisedturning
measure
patricktesttalked
leavingannounced
ministrydefinitelynt
intent
differ
sleephellocertain
disagree
portionpassage
powerfulonlywriting
bobherselfaccount
languagesviewcounterthoughtsheavenholdsspecifically
happiness
therefore
usuallyoffer
etcbeliefarensimilarly
(l) MDS of the terms using the Eu-clidean distance.
-0.6 -0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1
-0.2
-0.1
0.0
0.1
0.2
deuterium
outcome
bigfiring
issue
regulations
feustel
minute
minority
would
prohibition
straight
police
organized
greatly
culthaving
assault
dcbetz
firearm
triumphtom
bulletargued
femacitizen
finding
alleged
illinois
uses
pound
stove
commonly
blast
semtexequipment
uicvm
packet
stratus
supposed
discoverwin
homicides
solely
affair
own
organization
damn
federal
ten
forced
regulated
gamevikingfew
comparechose
topnation
math
renosecretary
outlaw
knockingcomplex
legitimate
jrutledgpistol
hmm
irrational
guess
private
cite
englandveal
blockwilling
loss
fails
twice
core
half
poorhourscc
referred
decidedduringpoliticalgas
remove
advantage
possible
break
belowstatsrisk
letter
ban
irvineruleexciting
founding
costsentence
complicated
setting
near
waco
highly
papershort larry
morning
murdersupport deathsreleased
arguingtaxes
gov
jokepurposes
arms
shootlet
keep
supportsanderson
lovescertainlythroughout
possiblymeeting
but
grace
much
referinfluencemother
knowsfollowed
hold
judged
precisely
put
error
temple
it
jerusalem
very ll
whole
carpenter
requirements
hole
notesimplications
obedience
scientificassociation
relations
does
revealed
prophecy
worthy
termscouple
know
rejected
same
alanus
extreme
exist
basicallymichaelfriends
treatagree
distinguish
tired
peter
good
apologize
applied
ancient
refused
wrong
georgiaworld
isaiah
has
correct
foundation
accounts
kind
these
feltseven
traditioncatholic
dyingreferring
church
expecting
sounds
simpleopinionsort
earth
praiseraised
turning
measurepatricktest
talked
leaving
announced
ministry
definitely
nt
intent
differsleephello
certain
disagree
portion
passage
powerful
only
writing
bobherself
account
languagesview counterthoughts
heaven
holdsspecificallyhappiness
therefore
usually
offeretc
belief
aren
similarly
(m) MDS of the terms using the dis-tance induced by combination of S1,S2 and Sy .
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 35 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 36 / 53
Application: Partially labeled classification problems
ExampleIn speech recognition, it cost almost nothing to record huge amounts ofspeech but labeling it requires some human to listen and transcript.
Elements of the problemTraining sample: sn = (x1, y1), . . . , (xt , yt), xt+1, . . . , xn wherexi ∈ X (a compact set of IRp) and yi ∈ −1, 1.
I Subset st = (x1, y1), . . . , (xt , yt) of t labeled points.I Subset sun−t = xt+1, . . . , xn of n − t unlabeled points.
S : n × n similarity matrix (Sii = 1).
Chapelle, O., Scholköpf, B., and Zien, A. Semi-supervised learning. MIT Press,Cambridge, MA. ,2006.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 37 / 53
Idea to solve the partially labeled classificationproblem
Main GoalTransform the similarity S using the information of unlabeled pointsmaking it more useful for classification.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 38 / 53
Transform the similarity matrix S
STEP 1: Define S∗
(S)∗ij =12((S)ij + (S)zi ,zj
)+ σyzi yzj
∣∣(S)ij − (S)zi ,zj∣∣
σ ≥ 0.
zi and zj are indexes of the closest (dij = 1− sij) labeled data points to xiand xj and whose labels are yzi , yzj .
(S)∗ij = (S)ij if the points points i-th and j-th are labeled.
STEP 2: Obtaining K∗ by transforming S∗ to positive definite.We can use some of the previously defined projections onto the cone ofpositive semi-definite matrices: Orthogonal projection or alternatingprojections.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 39 / 53
Illustrative example: description
Sample sn of n = 1000 data made up from bivariate normal distributionsN(µi , I), with µ1 = (0, 0) and µ2 = (0, 0). 20000 additional test points.
Theoretical Bayes Error: 0,027. Optimal discriminant function: x = 2.
Several scenarios selecting (randomly) from sn and increasing number t ofdata (form 10 to 250).
For each scenario, we compare the averaged errors using SVM with the tlabeled data using a linear kernel and with the matrix K∗.
Test errors estimated over the remaining 20000 (30 runs).
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 40 / 53
Data points and convergence of the SVM
−2 0 2 4 6
−3
−2
−1
01
23
(n) Simulated data.
20 40 60 80 100 120
0.02
0.03
0.04
0.05
0.06
Sample size of labeled dataError
(ñ) Convergence of the SVM error for alinear kernel (black) and for K∗ (grey).
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 41 / 53
Results
-2 0 2 4 6
-3-2
-10
12
3
(o) SVM + linear and SVM + K∗ whent = 10.
-2 0 2 4 6
-3-2
-10
12
3
(p) SVM + linear and SVM + K∗ whent = 50.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 42 / 53
Real data examplesFour data sets (UCI Repository) named as:
I Iris (we only considered the classes versicolor and virginica).
I Connectionist Bench (Sonar, Mines vs. Rocks).
I Breast Cancer Wisconsin.
I Blood Transfusion.
We divide each data set in two subsets, one of size n (training sample sn)and the rest for testing.
We compare K − nn for K = 1 and a SVM using a linear kernel and usingK∗.
We fix σ and we select these parameters by cross validation.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 43 / 53
Results
Data Param. Partition Partition Test Error Test Error Test ErrorSet σ Tr./Test (t, n-t) SVM + linear SVM + K∗ 1− nn
Iris [3,5, 6,5] (60,40) (10,50) 0.0958 0.0833 0.1008(50,10) 0.0500 0.0250 0.0766
Sonar [0,05, 0,1] (170,38) (75,95) 0.3491 0.2964 0.2596(100,70) 0.3043 0.1605 0.2368(150,20) 0.2429 0.2438 0.2228
Cancer [1,5, 2,1] (400,283) (10,390) 0.1253 0.0628 0.0667(50,350) 0.0502 0.0398 0.0591(100,300) 0.0369 0.0380 0.0501
Transf. [0,1, 1] (400,384) (10,390) 0.2529 0.2274 0.2917(50,350) 0.1845 0.1798 0.2717(100,300) 0.1800 0.1788 0.2720
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 44 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 45 / 53
Proximity matrices combinations
ExampleIn web pages classification problems we have available two sources ofinformation:
The co-citation matrix.
The term by document matrix.
GoalTo combine (2 or more) proximity matrices to obtain a single representation ofthe data (web pages) useful in classification problems.
Martín de Diego, I., Muñoz, A., and Moguerza, J. M. Methods for the combinationof kernel matrices within a support vector framework. Machine Learning, 78:137-174, 2009
Moguerza, J. M. and Muñoz, A., Rejoinder Support Vector Machines withApplicaitons. Statistical Science, 21(3):358-362, 2006.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 46 / 53
Combining kernels matrices
Kernel combinations schemesn = (xi , yi)ni=1 is a random sample where xi ∈ X (some subset of IRp) andyi ∈ −1, 1 are the labels of the data.K1,...,Km is a set of m kernel matrices and Y diagonal matrix of the labels of theproblem.
K∗ =1m
m∑t=1
Kt + τY∑t<l
g(Ki ,Kj)Y ,
ExampleMethod τ g(x) # Kernels
AKM Average 0 - mMAKM Modified Average > 0 g(x) = 1 mAV Absolute Value > 0 g(x) = |x | mPO Pick Out 1/2 g(x) = |x | 2
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 47 / 53
Open questions in kernel combinations
Problems1 How to define a measure g that capture the differences betweeneach pair of kernels.
2 How to define a sum of kernels that takes into account theredundant information between each pair of matrices.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 48 / 53
General idea to solve both problems
Given K1,K2, ...,Km ∈ IRn×n to find an orthonormal matrix V ∈ IRn×n,such that:
VTK1V = D1...
VTKmV = Dm
where D1, ...,Dm are diagonal (or quasi-diagonal).
Main ideaUse the matrices D1, ...,Dm and V to manage the redundantinformation between the matrices.
Alberto Muñoz and Javier González. Joint Diagonalization of Kernels forInformation Fusion. Proceedings of the Iberoamerican Congress on Pattern Recognition,556-563, Springer.
Javier González and Alberto Muñoz. Spectral Measures for Kernel MatricesComparison. Proceedings of the International Conference on Neural Networks, 727-736,Springer.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 49 / 53
1 Motivation
2 Statistical Learning Theory and Support Vector Machines
3 Analyzed Problems:
I Classification problems with asymmetry
I Partially labeled classification problems
I Classification problems where several sources of information areavailable
4 Summary and Main Contributions
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 50 / 53
Main Contributions
To modify the original data similarity using the labels is an efficientstrategy to define appropriate kernel matrices for classification.
We have proposed a methodology to solve partially labeled classificationproblems.
New method to deal with asymmetric proximity matrices in classificationproblems. Application to Textual data analysis.
Extension of the previous methodology to information fusion problemswhen the available sources of information are a set or proximity matrices.
Study of the problems derived form the from the bad use of the commoninformation of the matrices.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 51 / 53
Other workRepresentation of functional data onto Reproducing Kernel HilbertSpaces.
Alberto Muñoz and Javier González. Representing Functional data with SupportVector Machines. Pattern Recognition Letters, 31(6):511-516.
Kernel for Latent Semantic Extraction in Text Mining Problems.
Alberto Muñoz, Javier González and Javier Arriero.. Kernel Latent SemanticAnalysis using Term Fusion Kernels.Support Vector Machines: Data Analysis, MachineLearning and Applications. NOVA Science. To appear.
Kernels (generalized covariances) for Spatio-Temporal data Analysis.Javier González, Stephan R. Sain and Alberto Muñoz. Spatial Temporal DataAnalysis via Reproducing Kernel Regularization. Joint Statistical Meeting (JSM),Washington, USA, June 2009.
Robust Methods in Statistics.Javier González, Daniel Peña and Rosario Romera.. A robust partial LeastSquares Regression Method with applications Journal of Chemometrics, 23:78-90, 2009.
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 52 / 53
Thanks!
J. González and A. Muñoz (uc3m) BCAM seminar December 16th, 2010 53 / 53