Kernel Methods and Semi-Supervised Learningzbodo/pres/szeged2012.pdf · SVM k-means Semi-supervised learning Assumptions in SSL Classes of SSL Self-training Data graphs Label propagation

Kernel Methods andSemi-Supervised

Learning

Zalan Bodo

The aims of the shortcourse

Top 10 data miningalgorithms

Machine learning.Supervised learning

Kernel methods

Kernels

Centroid method

kNN

SVM

k-means

Semi-supervisedlearning

Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation

Graph-based distances

Semi-supervised kNNand k-means

Cluster kernels

Laplacian SVM

Kernel Methods and Semi-SupervisedLearning

Zalan Bodo

Faculty of Mathematics and Computer Science,Babes–Bolyai University, Cluj-Napoca/Kolozsvar

−15 −10 −5 0 5 10 15 20 25−10

−5

0

5

10

15

Szeged, March 20121/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Contents

The aims of the short courseTop 10 data mining algorithms

Machine learning. Supervised learning

Kernel methodsKernelsCentroid methodkNNSVMk-means

Semi-supervised learningAssumptions in SSLClasses of SSLSelf-trainingData graphsLabel propagationGraph-based distancesSemi-supervised kNN and k-meansCluster kernelsLaplacian SVM

2/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

The aims of the short courseI to get an overview of kernel methods

I to get an overview of semi-supervised classification

I kernel methods: change the kernel ⇒ obtain a newalgorithm

I semi-supervised learning: use a large unlabeled dataset

I semi-supervised kernels: use a supervised algorithm with asemi-supervised kernel to obtain a semi-supervised learningmethod

I no need to develop semi-supervised methods

I use a supervised algorithm with a good – and replaceable –semi-supervised kernel

3/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Top 10 data mining algorithms – of all times

1. C4.5

2. k-Means ←3. SVM ←4. Apriori

5. EM

6. PageRank (←)

7. AdaBoost

8. kNN ←9. Naive Bayes

10. CART

I website: http://www.cs.uvm.edu/∼icdm/algorithms/index.shtml

I article: http://www.cs.uvm.edu/∼icdm/algorithms/10Algorithms-08.pdf

I slides with results:http://www.cs.uvm.edu/∼icdm/algorithms/ICDM06-Panel.pdf

4/41

http://www.cs.uvm.edu/~icdm/algorithms/index.shtml

http://www.cs.uvm.edu/~icdm/algorithms/10Algorithms-08.pdf

http://www.cs.uvm.edu/~icdm/algorithms/ICDM06-Panel.pdf


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Machine learning. Supervised learningI machine learning (ML): subdomain of AI, modeling the most

important brain activities: classification, differentiation,prediction

I supervised learning: examples with teaching instructionsI example: learn to differentiate between relevant and spam

emails

I unsupervised learning: no teaching instructions; groupsimilar points into clusters

I example: find products similar to a given one

I semi-supervised learning: halfway between supervised andunsupervised learning

I example: learn to classify handwritten digits based on a fewlabeled and a large set of unlabeled examples

spamham (1200x287x16M jpeg)

5/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5

0.5

1

1.5

2

2.5

3

3.5

I training data: Dtrain = {(xxx i , yi ) | i = 1, . . . ,N}, xxx i ∈ X ,yi ∈ Y

I X is the input space, xxx i ’s are vectors

I Y is the label set, yi ’s are scalars

I goal: find f : X → Y to approximate f given by Dtrain

I if |Y | = 2 ⇒ binary classificationI if |Y | > 2 ⇒ multi-class classificationI if f : X → Y ⇒ single-label classificationI if f : X → 2Y ⇒ multi-label classification

6/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Kernel methodsI James Mercer (1909): any continuous symmetric, positive

semi-definite kernel function can be expressed as a dotproduct in a high-dimensional space

I M. Aizerman, E. Braverman, and L. Rozonoer – 1964

I Boser, Guyon, and Vapnik – 1992 (Support Vector Machine)

I linear algorithm → non-linear algorithm

I φ: feature mapping

I kernels: k(xxx ,zzz) = 〈φ(xxx), φ(zzz)〉, φ : X → H, X ⊆ Rn i.e.cosine of the angle enclosed by the vectors in ahigh-dimensional space

I covers all geometric constructions that can be formulated interms of angles, lengths and distances

I any positive definite kernel is a dot product in another space;any mapping φ to a dot product space defines a positivedefinite kernel by 〈φ(xxx), φ(zzz)〉

7/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

FEJEZET 6. KERNEL MÓDSZEREK 132

×××

×××

◦◦◦

◦◦◦ 00.2

0.40.6

0.81 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

X22

X12

X 1X 2

(a) (b)

6.1. ábra. Az XOR feladat – bal oldali ábra – és a nemlineáris vetítés, mely szeparál-hatóvá teszi az adatokat – jobbra

tól a legtávolabb van. Ez összekapcsolható a 2.4. fejezetben említett regularizációmódszerével, ugyanis ez a hipersík a legstabilabb a lehetséges hipersíkok közül. Amódszer kernelesített alkalmazásáig sok évnek kellett eltelnie: a paradigma a modernszámítógépek kapacitásának nagymértéku bovülésével lett népszeru.

19. példa. []„Mi újat hoz egy lineáris módszer kernelesítése?”A kérdésre az XOR feladattal válaszolunk a 4.4 részbol. A feladat a 6.1.a ábránlátható négy pont két osztályba való sorolása, ahol a osztályok kódjai az XOR függvénykimenetei. A pontjaink

X =

{(0, 0), (0, 1), (1, 0), (1, 1)

}

és hozzájuk a {−1, 1, 1,−1} címkék tartoznak, ebben a sorrendben. Korábban beláttuk,hogy az adatok lineárisan nem szétválaszthatók – a 6.1.a ábra pontjait nem tudjuk ket-téosztani úgy, hogy a két kör egy félsík egyik felén, a két „X” a félsík másik felén legyen.A megoldáshoz az adatokat vetítjük a következo függvény szerint:

φ(xxx) =[x21, x

22,√2x1x2

]T(6.1)

a háromdimenziós euklidészi térbe, ahol xk az xxx vektor k-adik komponensét jelöli. A

kutató, majd 2002-tol a NEC Princeton-i kutatócsoportjában csatlakozik a „MachineLearning” csoporthoz. 1995 óta a londoni Royal Holloway egyetem, 2003 óta pediga New York-i Columbia egyetem professzora.Kutatási eredményei közül fontosak a „szupport vektor gépek” és az ezt megalapozóVapnik-Chervonenkis elmélet. Wikipedia Vapnik-ról

00.2

0.40.6

0.81 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

X22

X12

X1X

2

I we want to solve the XOR-problem (shown above) by a linearseparator; data: (0, 0), (1, 1), (1, 0), (0, 1)

I easy to observe: this cannot be done in the initial space

I lets use the following mapping: φ(xxx) =[x2

1 x22

√2x1x2

]′

I then the dot product:φ(xxx)′φ(zzz) = x2

1 z21 + 2x1z1x2z2 + x2

2 z22 = (xxx ′zzz)2 = k(xxx ,zzz)

I it is called the second order homogeneous polynomial kernel

8/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

General purpose kernels

I linear (dot product):

k(xxx ,zzz) = 〈xxx ,zzz〉

I polynomial:k(xxx ,zzz) = (a〈xxx ,zzz〉+ b)c

I RBF (Gaussian):

k(xxx ,zzz) = exp

(−‖xxx − zzz‖2

2σ2

)

I sigmoid:k(xxx ,zzz) = tanh(a〈xxx ,zzz〉+ r)

9/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM





I RBF (Gaussian):

k(xxx ,zzz) = exp


2σ2

)


9/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM





I RBF (Gaussian):

k(xxx ,zzz) = exp


2σ2

)


9/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM





I RBF (Gaussian):

k(xxx ,zzz) = exp


2σ2

)


9/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Kernel trickGiven an algorithm which is formulated in terms of a positivedefinite kernel k, one can construct an alternative algorithm byreplacing k by another positive definite kernel k .

10/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Distances ↔ kernels

1. kernel → (squared) distance:

DDD(2) = diag(KKK )111NN + 111NNdiag(KKK )− 2KKK

2. distance → kernel:

−1

2JJJDDD(2)JJJ = −1

2JJJdiag(KKK )111NNJJJ − 1

2JJJ111NNdiag(KKK )′JJJ + JJJKKKJJJ

= JJJKKKJJJ

where JJJ = III − 1N111111′ = III − 1

N111NN is the centering matrix(since 111NNJJJ = 000)

11/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Centroid method

Outline:

I learn: calculate the class centroids

I classify: label an unseen example as belonging to the classwith closest centroid

+ +

+ +

+ +

– –

–

–

–

c+ c–

x

.

12/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Binary classification:

m+ = |{xxx i |yi = +1}|,m− = |{xxx i |yi = −1}|

centers:

ccc+ =1

m+

∑

{i|yi=+1}

xxx i

ccc− =1

m−

∑

{i|yi=−1}

xxx i

Let www := ccc+ − ccc− and ccc := (ccc+ + ccc−)/2; label will be determinedby 〈www ,xxx − ccc〉

y = sgn〈xxx − ccc ,www〉= sgn (〈ccc+,xxx〉 − 〈ccc−,xxx〉+ b) ,

where b :=(‖ccc−‖2 − ‖ccc+‖2

)/2

13/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Substituting the corresponding expressions for ccc+ and ccc− weobtain

y = sgn

1

m+

∑

{i|yi=+1}

〈xxx ,xxx i 〉 −1

m−

∑

{i|yi=−1}

〈xxx ,xxx i 〉+ b

= sgn

1

m+

∑

{i|yi=+1}

k(xxx ,xxx i )−1

m−

∑

{i|yi=−1}

k(xxx ,xxx i ) + b

,

where

b :=1

2

1

m2−

∑

{(i,j)|yi=yj=−1}

k(xxx i ,xxx j)−1

m2+

∑

{(i,j)|yi=yj=+1}

k(xxx i ,xxx j)

and k(·, ·) is an arbitrary kernel function.

14/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

For multi-class classification:

ccc i =1

|Ci |∑

xxx j∈Ci

xxx i

The Euclidean distance can be written in terms of the dot product:

‖xxx − zzz‖2 = xxx ′xxx − 2xxx ′zzz + zzz ′zzz = k(xxx ,xxx)− 2k(xxx ,zzz) + k(zzz ,zzz)

Since distances from the centroids are to be considered, we cancalculate:

‖xxx − ccc i‖2 = ‖xxx − 1

|Ci |∑

xxx j∈Ci

xxx i‖2

= xxx ′xxx − 2

|Ci |∑

xxx j∈Ci

xxx ′xxx j +1

|Ci |2∑

xxx j∈Ci ,xxxk∈Ci

xxx ′jxxxk

= k(xxx ,xxx)− 2

|Ci |∑

xxx j∈Ci

k(xxx ,xxx j) +1

|Ci |2∑

xxx j∈Ci ,xxxk∈Ci

k(xxx j ,xxxk)

15/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

k-Nearest Neighbors

Outline:

I seemingly no separate learning phase

I classify: find the k nearest neighbors of the example inquestion and label with the majority class

1

2

2

2

34

5

5

3

33

−10 −5 0 5 10 15 20

−10

−5

0

5

10

15

16/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

I assign the label which is in majority among the k-nearestneighbors

f (xxx) = argmaxc=1,2,...,K

∑

zzz∈Nk (xxx),zzz∈Dtrain

sim(zzz ,xxx) · δ(c , f (zzz))

(we use sim(zzz ,xxx) = 1,∀zzz ,xxx)

δ(a, b) =

{1, a = b0, otherwise

(Kronecker delta)

I to determine Nk(xxx) the Euclidean distance is used

I as before, we can rewrite the Euclidean distance using dotproducts:

‖xxx − zzz‖2 = xxx ′xxx − 2xxx ′zzz + zzz ′zzz = k(xxx ,xxx)− 2k(xxx ,zzz) + k(zzz ,zzz)

17/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Support Vector Machines

Outline:

I learn: find a separating hyperplane – with maximal margin –that separates the positive and negative examples

I classify: on one side of the hyperplane there are the positivepoints, on the other side the negative points

I large-margin separation: the probability of misclassification ofan unseen example to be minimized

I this hyperplane is determined only by the points which lie onthe marginal hyperplanes → Support Vector Machines(SVMs)

I setting: binary classification

18/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Hyperplane:www ′xxx i + b ≥ 1 ∀xxx i s.t. yi = 1

www ′xxx i + b ≤ −1 ∀xxx i s.t. yi = −1

Decision function:f (xxx) = sgn(www ′xxx + b)

I margin of the separating hyperplane: 2/‖www‖I optimization problem:

min F (www , b) =1

2‖www‖2

subject to yi (www′xxx i + b) ≥ 1, i = 1, . . . , `

19/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Linearly non-separable case

I introduce slack variables; show the level of violation of linearseparability

I new conditions:

www ′xxx i + b

{≥ 1− ξi if yi = 1

≤ −1 + ξi if yi = −1

I new optimization problem:

min F (www , b, ξξξ) =1

2‖www‖2 + C

∑

i=1

ξi

s.t. yi (www′xxx i + b) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , `

20/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

k-means

Outline:

I clustering method; no supervised information is given

I goal: cluster data into K sets, such that similar points aregrouped in the same set

I determine K cluster centers

−15 −10 −5 0 5 10 15−15

−10

−5

0

5

10

15

21/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

I objective function:

F =K∑

i=1

N∑

j=1

Uijd(xxx j ,ccc i )

whereI ccc i , i = 1, . . . ,K are the cluster centersI UUU is the membership matrix of size K × N (K clusters, N

points)I Uij ∈ {0, 1}; Uij = 1 if xxx j belongs to Ci (i-th cluster);

otherwise 0I d(·, ·) is a distance, usually the squared Euclidean:

d(xxx ,zzz) = ‖xxx − zzz‖2

I F is minimized if the points are assigned to the closest clustercenter, provided they are fixed:

Uij =

{1, if ‖xxx j − ccc i‖2 ≤ ‖xxx j − ccck‖2,∀k 6= i0, otherwise

I on the other side, if UUU is fixed, F is minimized by taking thecenters of the groups:

ccc i =1

∑Nj=1 Uij

N∑

k=1

Uikxxxk

22/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

k-Means

1. Initialize the centers.

2. Determine UUU.

3. Compute cost function F .

4. If converged, then STOP.Else update centers using the new UUU; GOTO step 2.

23/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Fuzzy c-means:

I generalization of k-means

I cluster membership values are ∈ [0, 1]

I objective function:

F ′ =K∑

i=1

N∑

j=1

Upij ‖xxx j − ccc i‖2

such that Uij ∈ [0, 1],∑K

i=1 Uij = 1,∀j , p ∈ [1,∞)

I if the centers are fixed, F ′ is minimized by

Uij =1

∑Kk=1

(‖xxx j−ccc i‖‖xxx j−ccck‖

) 2N−1

I the values which minimize F ′ for fixed UUU are

ccc i =1

∑Nj=1 Up

ij

N∑

j=1

Upijxxx j

24/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Kernel k-means

I k-means and fuzzy c-means are limited to (properly) findpairwise linearly separable clusters

I to obtain non-linear cluster boundaries we use kernels

I the only expression one needs to rewrite for the kernel-basedvariant of the algorithm is the distance of a point to a center:

‖φ(xxx)− 1

sj

N∑

k=1

Ujkφ(xxxk)‖2

= k(xxx ,xxx) +1

s2j

N∑

k,`=1

UjkUjlk(xxxk ,xxx`)−2

sj

N∑

k=1

Ujkk(xxxk ,xxx)

where sj is the size of the j-th cluster

25/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Kernel k-means

1. Generate a random UUU which satisfies the conditions, that ismake a random assignment of the points.

2. Compute the cost function F .

3. If converged then STOP.Else update UUU using the update formula used in k-means (=get new UUU); GOTO step 2.

26/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Semi-supervised learning (SSL)I supervised learning:

D = {(xxx i , yi ) |xxx i ∈ X ⊆ Rd , yi ∈ {−1,+1}, i = 1, . . . , `};find f : X → {−1,+1} which agrees with D

I semi-supervised learning:D = {(xxx i , yi ) | i = 1, . . . , `} ∪ {xxx j | j = `+ 1, . . . , `+ u},`� u, N = `+ u;

I inductive: find f : X → {−1,+1} which agrees with D +use the information of DU

I transductive: find f : DU → {−1,+1} by using D = DL ∪DU

0 5 10

−6

−4

−2

0

2

4

0 5 10

−6

−4

−2

0

2

4

27/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Semi-supervised learning (SSL)I supervised learning:

D = {(xxx i , yi ) |xxx i ∈ X ⊆ Rd , yi ∈ {−1,+1}, i = 1, . . . , `};find f : X → {−1,+1} which agrees with D

I semi-supervised learning:D = {(xxx i , yi ) | i = 1, . . . , `} ∪ {xxx j | j = `+ 1, . . . , `+ u},`� u, N = `+ u;

I inductive: find f : X → {−1,+1} which agrees with D +use the information of DU

I transductive: find f : DU → {−1,+1} by using D = DL ∪DU

0 5 10

−6

−4

−2

0

2

4

0 5 10

−6

−4

−2

0

2

4

27/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Assumptions in SSL

I smoothness assumption: If two points xxx i and xxx j in a highdensity region are close, then so should be the correspondingoutputs yi and yj .

I cluster assumption: If two points are in the same cluster,they are likely to be of the same class.

I manifold assumption: The high dimensional data lie roughlyon a low dimensional manifold. – regarding dimensionality;but if manifold = approximation of the high-dimensionalregion ⇒ smoothness assumption

−10 −5 0 5 10 15 20−5

0

5

10

28/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Assumptions in SSL




−10 −5 0 5 10 15 20−5

0

5

10

28/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Assumptions in SSL




−10 −5 0 5 10 15 20−5

0

5

10

28/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Classes of SSL

I Generativemodels

I Low-densityseparation

I Graph-basedmethods

I Change ofrepresentation

29/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Classes of SSL

I Generativemodels




I model class conditional density P(xxx |y)and class priors P(y) and use the Bayestheorem for calculating posteriors, whichare used for classification

I e.g. Naive Bayes + EM

1. Train the classifier on the trainingexamples; determine probabilitiesP(ddd i | cj) and P(cj).

2. Repeat while there is improvement:

E-step: Classify unlabeled examplesusing the trained classifier.M-step: Re-estimate the classifierusing the previously classifiedexamples; recalculate P(ddd i | cj ) andP(cj ).

29/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Classes of SSL

I Generativemodels




I push away a decision boundary fromlabeled and unlabeled data; naturalchoice: use a large-margin classifier

I e.g. Transductive SVM (TSVM)

min1

2‖www‖2

s.t. yi (www′xxx i + b) ≥ 1, i = 1, . . . , `

yj(www′xxx j + b) ≥ 1, j = `+ 1, . . . ,N

yj ∈ {−1,+1}, j = `+ 1, . . . ,N

29/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Classes of SSL

I Generativemodels




I use the graph of the labeled andunlabeled data, represented by weightmatrix WWW ; use graph Laplacian LLL−WWW

I e.g. Label Propagation

1. Compute YYY (t + 1) = PPP YYY (t).2. Reset the labeled data,

YYY L(t + 1) = YYY L(0).3. t = t + 1 and repeat the above steps

until convergence.

29/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Classes of SSL

I Generativemodels




I find some structure in the whole data set;use the information provided by thelabeled and unlabeled data set to create anew representation

1. Build the new representation – newdistance, dot-product or kernel – of thelearning examples.

2. Use a supervised learning method forobtaining the decision functionemploying the new representationobtained in the previous step.

I e.g. PCA, KPCA, LLE, ISOMAP, etc.

29/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Self-training/Bootstrapping

Simple self-trainingUntil convergence:

1. Train classifier with the labeled data.

2. Classify unlabeled data with the trained classifier.

3. Add the most confident unlabeled points to the training data.

4. GOTO step 1.

Committee-based learningUntil convergence:

1. Train separate classifiers on the same labeled data.

2. Make predictions on the unlabeled data.

3. Add the most agreed points to the training set.

4. GOTO step 1.

Final prediction: weighted majority vote among all the learners.

30/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Data graphs

I G = (V ,W ) undirected weighted graph(usually)

I V = XL ∪ XU (training data)

I W : V × V → R (similarity function);W (xxx i ,xxx j) = sim(xxx i ,xxx j) (def

= Wij)for example:

sim(xxx i ,xxx j) = exp(−‖xxx i−xxx j‖2

2

σ2

)

I sparsification techniques: εNN, kNNgraphs

I graph Laplacian: LLL = DDD −WWW ,DDD = diag(WWW · 111)

I nice properties: e.g. positivesemi-definite; has as many 0 eigenvaluesas connected components contains; etc.

31/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Label propagation

I build data graph: Wij = exp(− 1

2σ2 ‖xxx i − xxx j‖2)

I compute transition probabilities: DDD = diag(WWW111), PPP = DDD−1WWW

I vector of labels: fff = [fff L fff U ]′

LP

1. i = 0; Initialize fff(i)U .

2. fff (i+1) = PPPfff (i).

3. Clamp the labeled data, fff(i+1)L = fff L.

4. i = i + 1; Unless convergence go tostep 2.

32/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

I can be rewritten as

fi =1

∑Nj=1 Wij

N∑

k=1

Wik fk

I exact solution (LLL – graph Laplacian)

fff U = −LLL−1UU · LLLUL · fff L

I minimizes

E (fff ) =1

2

N∑

i,j=1

Wij(fi − fj)2 = fff ′LLLfff

I LP is equivalent with searching for a minimum cut in the datagraph

33/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Transduction → induction

I fix the other labels and compute the new one

I we minimize:C +

∑

i

Wiy (fi − y)2

I from setting the derivative to zero we obtain:

y =

∑i Wiy fi∑j Wjy

34/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM


I if data lie on a manifold, graph-based distances are better

I graph-based distances = shortest path distances calculatedusing a known algorithm: Floyd–Warshall, Dijkstra etc.

�

��

�

�

�

��

�

�

��

�

Floyd–Warshall algorithm

1. Dij = weight of the edge; if no edge then ∞2. for k=1:N

for i=1:Nfor j=1:N

if Dij > Dik + Dkj

Dij = Dik + Dkj

35/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Semi-Supervised kNN and k-Means

here: semi-supervised = graph-based

1. Initial distance matrix ⇐ kNN/εNN matrix

2. Compute the all-pair shortest path distance matrix – using forexample the Floyd–Warshall algorithm:

Dij = shortest path value among all paths from i to j

3. Perform kNN/k-Means using these graph distances.

36/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Cluster kernels

Neighborhood kernel

I representing each as the average of its neighbors →neighborhood structure influences representation:

φnbd(xxx) =1

|N(xxx)|∑

xxx′∈N(xxx)

φb(xxx)

I the kernel:

knbd(xxx ,zzz) =

∑xxx′∈N(xxx),zzz′∈N(zzz) kb(xxx ′,zzz ′)

|N(xxx)||N(zzz)|

37/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Bagged cluster kernel

I idea: reweighting the base kernel values by the probabilitythat the points belong to the same cluster

I use k-means clustering: random initial cluster centers →affects the output of the algorithm

BCK

1. Run k-means t times, which results in cj(xxx i ) clusterassignments, j = 1, . . . , t, i = 1, . . . ,N, c·(·) ∈ {1, . . . ,K}.

2. Construct the bagged kernel in the following way:

kbag(xxx ,zzz) =

∑tj=1[cj(xxx) = cj(zzz)]

t

The final kernel:

k(xxx ,zzz) = kb(xxx ,zzz) · kbag(xxx ,zzz)

38/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Laplacian SVM

I use SVM + semi-supervised (smoothness) assumption

I add the following member to the objective function:

γI1

2

N∑

i,j=1

wij(fi − fj)2 = γI fff

′LLLfff

where LLL is the graph Laplacian, fff is the N × 1 vectorcontaining the labels of the points;γI is a parameter determining how much the unlabeled pointsinfluence the result/kernel

I the optimization problem thus becomes:

minwww ,b

1

2‖www‖2 + C

∑

i=1

ξi + γI fff′LLLfff

s.t. yi (www′xxx i + b) ≥ 1− ξi

39/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

I it can be shown that it is the same problem which is solved inthe original SVM formulation, but now using the kernel

k(xxx ,zzz) = k(xxx ,zzz)− kkk ′xxx

(1

2γIIII + LLLKKKNN

)−1

LLLkkkzzz

where k(·, ·) is the new semi-supervised kernel, while k(·, ·)is the base kernel

−20

−10

0

10

20

−20

−10

0

10

200

50

40/41


Learning

Zalan Bodo




Kernel methods

Kernels

Centroid method

kNN

SVM

k-means


Assumptions in SSL

Classes of SSL

Self-training

Data graphs

Label propagation



Cluster kernels

Laplacian SVM

Thank you! Questions?

41/41

Documents

Kernel Methods and Semi-Supervised Learningzbodo/pres/szeged2012.pdf · SVM k-means Semi-supervised learning Assumptions in SSL Classes of SSL Self-training Data graphs Label propagation