Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Kernel Methods and Semi-SupervisedLearning
Zalan Bodo
Faculty of Mathematics and Computer Science,Babes–Bolyai University, Cluj-Napoca/Kolozsvar
−15 −10 −5 0 5 10 15 20 25−10
−5
0
5
10
15
Szeged, March 20121/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Contents
The aims of the short courseTop 10 data mining algorithms
Machine learning. Supervised learning
Kernel methodsKernelsCentroid methodkNNSVMk-means
Semi-supervised learningAssumptions in SSLClasses of SSLSelf-trainingData graphsLabel propagationGraph-based distancesSemi-supervised kNN and k-meansCluster kernelsLaplacian SVM
2/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
The aims of the short courseI to get an overview of kernel methods
I to get an overview of semi-supervised classification
I kernel methods: change the kernel ⇒ obtain a newalgorithm
I semi-supervised learning: use a large unlabeled dataset
I semi-supervised kernels: use a supervised algorithm with asemi-supervised kernel to obtain a semi-supervised learningmethod
I no need to develop semi-supervised methods
I use a supervised algorithm with a good – and replaceable –semi-supervised kernel
3/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Top 10 data mining algorithms – of all times
1. C4.5
2. k-Means ←3. SVM ←4. Apriori
5. EM
6. PageRank (←)
7. AdaBoost
8. kNN ←9. Naive Bayes
10. CART
I website: http://www.cs.uvm.edu/∼icdm/algorithms/index.shtml
I article: http://www.cs.uvm.edu/∼icdm/algorithms/10Algorithms-08.pdf
I slides with results:http://www.cs.uvm.edu/∼icdm/algorithms/ICDM06-Panel.pdf
4/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Machine learning. Supervised learningI machine learning (ML): subdomain of AI, modeling the most
important brain activities: classification, differentiation,prediction
I supervised learning: examples with teaching instructionsI example: learn to differentiate between relevant and spam
emails
I unsupervised learning: no teaching instructions; groupsimilar points into clusters
I example: find products similar to a given one
I semi-supervised learning: halfway between supervised andunsupervised learning
I example: learn to classify handwritten digits based on a fewlabeled and a large set of unlabeled examples
spamham (1200x287x16M jpeg)
5/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5
0.5
1
1.5
2
2.5
3
3.5
I training data: Dtrain = {(xxx i , yi ) | i = 1, . . . ,N}, xxx i ∈ X ,yi ∈ Y
I X is the input space, xxx i ’s are vectors
I Y is the label set, yi ’s are scalars
I goal: find f : X → Y to approximate f given by Dtrain
I if |Y | = 2 ⇒ binary classificationI if |Y | > 2 ⇒ multi-class classificationI if f : X → Y ⇒ single-label classificationI if f : X → 2Y ⇒ multi-label classification
6/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Kernel methodsI James Mercer (1909): any continuous symmetric, positive
semi-definite kernel function can be expressed as a dotproduct in a high-dimensional space
I M. Aizerman, E. Braverman, and L. Rozonoer – 1964
I Boser, Guyon, and Vapnik – 1992 (Support Vector Machine)
I linear algorithm → non-linear algorithm
I φ: feature mapping
I kernels: k(xxx ,zzz) = 〈φ(xxx), φ(zzz)〉, φ : X → H, X ⊆ Rn i.e.cosine of the angle enclosed by the vectors in ahigh-dimensional space
I covers all geometric constructions that can be formulated interms of angles, lengths and distances
I any positive definite kernel is a dot product in another space;any mapping φ to a dot product space defines a positivedefinite kernel by 〈φ(xxx), φ(zzz)〉
7/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
FEJEZET 6. KERNEL MÓDSZEREK 132
×××
×××
◦◦◦
◦◦◦ 00.2
0.40.6
0.81 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
X22
X12
X 1X 2
(a) (b)
6.1. ábra. Az XOR feladat – bal oldali ábra – és a nemlineáris vetítés, mely szeparál-hatóvá teszi az adatokat – jobbra
tól a legtávolabb van. Ez összekapcsolható a 2.4. fejezetben említett regularizációmódszerével, ugyanis ez a hipersík a legstabilabb a lehetséges hipersíkok közül. Amódszer kernelesített alkalmazásáig sok évnek kellett eltelnie: a paradigma a modernszámítógépek kapacitásának nagymértéku bovülésével lett népszeru.
19. példa. []„Mi újat hoz egy lineáris módszer kernelesítése?”A kérdésre az XOR feladattal válaszolunk a 4.4 részbol. A feladat a 6.1.a ábránlátható négy pont két osztályba való sorolása, ahol a osztályok kódjai az XOR függvénykimenetei. A pontjaink
X =
{(0, 0), (0, 1), (1, 0), (1, 1)
}
és hozzájuk a {−1, 1, 1,−1} címkék tartoznak, ebben a sorrendben. Korábban beláttuk,hogy az adatok lineárisan nem szétválaszthatók – a 6.1.a ábra pontjait nem tudjuk ket-téosztani úgy, hogy a két kör egy félsík egyik felén, a két „X” a félsík másik felén legyen.A megoldáshoz az adatokat vetítjük a következo függvény szerint:
φ(xxx) =[x21, x
22,√2x1x2
]T(6.1)
a háromdimenziós euklidészi térbe, ahol xk az xxx vektor k-adik komponensét jelöli. A
kutató, majd 2002-tol a NEC Princeton-i kutatócsoportjában csatlakozik a „MachineLearning” csoporthoz. 1995 óta a londoni Royal Holloway egyetem, 2003 óta pediga New York-i Columbia egyetem professzora.Kutatási eredményei közül fontosak a „szupport vektor gépek” és az ezt megalapozóVapnik-Chervonenkis elmélet. Wikipedia Vapnik-ról
00.2
0.40.6
0.81 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
X22
X12
X1X
2
I we want to solve the XOR-problem (shown above) by a linearseparator; data: (0, 0), (1, 1), (1, 0), (0, 1)
I easy to observe: this cannot be done in the initial space
I lets use the following mapping: φ(xxx) =[x2
1 x22
√2x1x2
]′
I then the dot product:φ(xxx)′φ(zzz) = x2
1 z21 + 2x1z1x2z2 + x2
2 z22 = (xxx ′zzz)2 = k(xxx ,zzz)
I it is called the second order homogeneous polynomial kernel
8/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
General purpose kernels
I linear (dot product):
k(xxx ,zzz) = 〈xxx ,zzz〉
I polynomial:k(xxx ,zzz) = (a〈xxx ,zzz〉+ b)c
I RBF (Gaussian):
k(xxx ,zzz) = exp
(−‖xxx − zzz‖2
2σ2
)
I sigmoid:k(xxx ,zzz) = tanh(a〈xxx ,zzz〉+ r)
9/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
General purpose kernels
I linear (dot product):
k(xxx ,zzz) = 〈xxx ,zzz〉
I polynomial:k(xxx ,zzz) = (a〈xxx ,zzz〉+ b)c
I RBF (Gaussian):
k(xxx ,zzz) = exp
(−‖xxx − zzz‖2
2σ2
)
I sigmoid:k(xxx ,zzz) = tanh(a〈xxx ,zzz〉+ r)
9/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
General purpose kernels
I linear (dot product):
k(xxx ,zzz) = 〈xxx ,zzz〉
I polynomial:k(xxx ,zzz) = (a〈xxx ,zzz〉+ b)c
I RBF (Gaussian):
k(xxx ,zzz) = exp
(−‖xxx − zzz‖2
2σ2
)
I sigmoid:k(xxx ,zzz) = tanh(a〈xxx ,zzz〉+ r)
9/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
General purpose kernels
I linear (dot product):
k(xxx ,zzz) = 〈xxx ,zzz〉
I polynomial:k(xxx ,zzz) = (a〈xxx ,zzz〉+ b)c
I RBF (Gaussian):
k(xxx ,zzz) = exp
(−‖xxx − zzz‖2
2σ2
)
I sigmoid:k(xxx ,zzz) = tanh(a〈xxx ,zzz〉+ r)
9/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Kernel trickGiven an algorithm which is formulated in terms of a positivedefinite kernel k, one can construct an alternative algorithm byreplacing k by another positive definite kernel k .
10/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Distances ↔ kernels
1. kernel → (squared) distance:
DDD(2) = diag(KKK )111NN + 111NNdiag(KKK )− 2KKK
2. distance → kernel:
−1
2JJJDDD(2)JJJ = −1
2JJJdiag(KKK )111NNJJJ − 1
2JJJ111NNdiag(KKK )′JJJ + JJJKKKJJJ
= JJJKKKJJJ
where JJJ = III − 1N111111′ = III − 1
N111NN is the centering matrix(since 111NNJJJ = 000)
11/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Centroid method
Outline:
I learn: calculate the class centroids
I classify: label an unseen example as belonging to the classwith closest centroid
+ +
+ +
+ +
– –
–
–
–
c+ c–
x
.
12/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Binary classification:
m+ = |{xxx i |yi = +1}|,m− = |{xxx i |yi = −1}|
centers:
ccc+ =1
m+
∑
{i|yi=+1}
xxx i
ccc− =1
m−
∑
{i|yi=−1}
xxx i
Let www := ccc+ − ccc− and ccc := (ccc+ + ccc−)/2; label will be determinedby 〈www ,xxx − ccc〉
y = sgn〈xxx − ccc ,www〉= sgn (〈ccc+,xxx〉 − 〈ccc−,xxx〉+ b) ,
where b :=(‖ccc−‖2 − ‖ccc+‖2
)/2
13/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Substituting the corresponding expressions for ccc+ and ccc− weobtain
y = sgn
1
m+
∑
{i|yi=+1}
〈xxx ,xxx i 〉 −1
m−
∑
{i|yi=−1}
〈xxx ,xxx i 〉+ b
= sgn
1
m+
∑
{i|yi=+1}
k(xxx ,xxx i )−1
m−
∑
{i|yi=−1}
k(xxx ,xxx i ) + b
,
where
b :=1
2
1
m2−
∑
{(i,j)|yi=yj=−1}
k(xxx i ,xxx j)−1
m2+
∑
{(i,j)|yi=yj=+1}
k(xxx i ,xxx j)
and k(·, ·) is an arbitrary kernel function.
14/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
For multi-class classification:
ccc i =1
|Ci |∑
xxx j∈Ci
xxx i
The Euclidean distance can be written in terms of the dot product:
‖xxx − zzz‖2 = xxx ′xxx − 2xxx ′zzz + zzz ′zzz = k(xxx ,xxx)− 2k(xxx ,zzz) + k(zzz ,zzz)
Since distances from the centroids are to be considered, we cancalculate:
‖xxx − ccc i‖2 = ‖xxx − 1
|Ci |∑
xxx j∈Ci
xxx i‖2
= xxx ′xxx − 2
|Ci |∑
xxx j∈Ci
xxx ′xxx j +1
|Ci |2∑
xxx j∈Ci ,xxxk∈Ci
xxx ′jxxxk
= k(xxx ,xxx)− 2
|Ci |∑
xxx j∈Ci
k(xxx ,xxx j) +1
|Ci |2∑
xxx j∈Ci ,xxxk∈Ci
k(xxx j ,xxxk)
15/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
k-Nearest Neighbors
Outline:
I seemingly no separate learning phase
I classify: find the k nearest neighbors of the example inquestion and label with the majority class
1
2
2
2
34
5
5
3
33
−10 −5 0 5 10 15 20
−10
−5
0
5
10
15
16/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
I assign the label which is in majority among the k-nearestneighbors
f (xxx) = argmaxc=1,2,...,K
∑
zzz∈Nk (xxx),zzz∈Dtrain
sim(zzz ,xxx) · δ(c , f (zzz))
(we use sim(zzz ,xxx) = 1,∀zzz ,xxx)
δ(a, b) =
{1, a = b0, otherwise
(Kronecker delta)
I to determine Nk(xxx) the Euclidean distance is used
I as before, we can rewrite the Euclidean distance using dotproducts:
‖xxx − zzz‖2 = xxx ′xxx − 2xxx ′zzz + zzz ′zzz = k(xxx ,xxx)− 2k(xxx ,zzz) + k(zzz ,zzz)
17/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Support Vector Machines
Outline:
I learn: find a separating hyperplane – with maximal margin –that separates the positive and negative examples
I classify: on one side of the hyperplane there are the positivepoints, on the other side the negative points
I large-margin separation: the probability of misclassification ofan unseen example to be minimized
I this hyperplane is determined only by the points which lie onthe marginal hyperplanes → Support Vector Machines(SVMs)
I setting: binary classification
18/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Hyperplane:www ′xxx i + b ≥ 1 ∀xxx i s.t. yi = 1
www ′xxx i + b ≤ −1 ∀xxx i s.t. yi = −1
Decision function:f (xxx) = sgn(www ′xxx + b)
I margin of the separating hyperplane: 2/‖www‖I optimization problem:
min F (www , b) =1
2‖www‖2
subject to yi (www′xxx i + b) ≥ 1, i = 1, . . . , `
19/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Linearly non-separable case
I introduce slack variables; show the level of violation of linearseparability
I new conditions:
www ′xxx i + b
{≥ 1− ξi if yi = 1
≤ −1 + ξi if yi = −1
I new optimization problem:
min F (www , b, ξξξ) =1
2‖www‖2 + C
∑
i=1
ξi
s.t. yi (www′xxx i + b) ≥ 1− ξi , ξi ≥ 0, i = 1, . . . , `
20/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
k-means
Outline:
I clustering method; no supervised information is given
I goal: cluster data into K sets, such that similar points aregrouped in the same set
I determine K cluster centers
−15 −10 −5 0 5 10 15−15
−10
−5
0
5
10
15
21/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
I objective function:
F =K∑
i=1
N∑
j=1
Uijd(xxx j ,ccc i )
whereI ccc i , i = 1, . . . ,K are the cluster centersI UUU is the membership matrix of size K × N (K clusters, N
points)I Uij ∈ {0, 1}; Uij = 1 if xxx j belongs to Ci (i-th cluster);
otherwise 0I d(·, ·) is a distance, usually the squared Euclidean:
d(xxx ,zzz) = ‖xxx − zzz‖2
I F is minimized if the points are assigned to the closest clustercenter, provided they are fixed:
Uij =
{1, if ‖xxx j − ccc i‖2 ≤ ‖xxx j − ccck‖2,∀k 6= i0, otherwise
I on the other side, if UUU is fixed, F is minimized by taking thecenters of the groups:
ccc i =1
∑Nj=1 Uij
N∑
k=1
Uikxxxk
22/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
k-Means
1. Initialize the centers.
2. Determine UUU.
3. Compute cost function F .
4. If converged, then STOP.Else update centers using the new UUU; GOTO step 2.
23/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Fuzzy c-means:
I generalization of k-means
I cluster membership values are ∈ [0, 1]
I objective function:
F ′ =K∑
i=1
N∑
j=1
Upij ‖xxx j − ccc i‖2
such that Uij ∈ [0, 1],∑K
i=1 Uij = 1,∀j , p ∈ [1,∞)
I if the centers are fixed, F ′ is minimized by
Uij =1
∑Kk=1
(‖xxx j−ccc i‖‖xxx j−ccck‖
) 2N−1
I the values which minimize F ′ for fixed UUU are
ccc i =1
∑Nj=1 Up
ij
N∑
j=1
Upijxxx j
24/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Kernel k-means
I k-means and fuzzy c-means are limited to (properly) findpairwise linearly separable clusters
I to obtain non-linear cluster boundaries we use kernels
I the only expression one needs to rewrite for the kernel-basedvariant of the algorithm is the distance of a point to a center:
‖φ(xxx)− 1
sj
N∑
k=1
Ujkφ(xxxk)‖2
= k(xxx ,xxx) +1
s2j
N∑
k,`=1
UjkUjlk(xxxk ,xxx`)−2
sj
N∑
k=1
Ujkk(xxxk ,xxx)
where sj is the size of the j-th cluster
25/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Kernel k-means
1. Generate a random UUU which satisfies the conditions, that ismake a random assignment of the points.
2. Compute the cost function F .
3. If converged then STOP.Else update UUU using the update formula used in k-means (=get new UUU); GOTO step 2.
26/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Semi-supervised learning (SSL)I supervised learning:
D = {(xxx i , yi ) |xxx i ∈ X ⊆ Rd , yi ∈ {−1,+1}, i = 1, . . . , `};find f : X → {−1,+1} which agrees with D
I semi-supervised learning:D = {(xxx i , yi ) | i = 1, . . . , `} ∪ {xxx j | j = `+ 1, . . . , `+ u},`� u, N = `+ u;
I inductive: find f : X → {−1,+1} which agrees with D +use the information of DU
I transductive: find f : DU → {−1,+1} by using D = DL ∪DU
0 5 10
−6
−4
−2
0
2
4
0 5 10
−6
−4
−2
0
2
4
27/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Semi-supervised learning (SSL)I supervised learning:
D = {(xxx i , yi ) |xxx i ∈ X ⊆ Rd , yi ∈ {−1,+1}, i = 1, . . . , `};find f : X → {−1,+1} which agrees with D
I semi-supervised learning:D = {(xxx i , yi ) | i = 1, . . . , `} ∪ {xxx j | j = `+ 1, . . . , `+ u},`� u, N = `+ u;
I inductive: find f : X → {−1,+1} which agrees with D +use the information of DU
I transductive: find f : DU → {−1,+1} by using D = DL ∪DU
0 5 10
−6
−4
−2
0
2
4
0 5 10
−6
−4
−2
0
2
4
27/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Assumptions in SSL
I smoothness assumption: If two points xxx i and xxx j in a highdensity region are close, then so should be the correspondingoutputs yi and yj .
I cluster assumption: If two points are in the same cluster,they are likely to be of the same class.
I manifold assumption: The high dimensional data lie roughlyon a low dimensional manifold. – regarding dimensionality;but if manifold = approximation of the high-dimensionalregion ⇒ smoothness assumption
−10 −5 0 5 10 15 20−5
0
5
10
28/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Assumptions in SSL
I smoothness assumption: If two points xxx i and xxx j in a highdensity region are close, then so should be the correspondingoutputs yi and yj .
I cluster assumption: If two points are in the same cluster,they are likely to be of the same class.
I manifold assumption: The high dimensional data lie roughlyon a low dimensional manifold. – regarding dimensionality;but if manifold = approximation of the high-dimensionalregion ⇒ smoothness assumption
−10 −5 0 5 10 15 20−5
0
5
10
28/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Assumptions in SSL
I smoothness assumption: If two points xxx i and xxx j in a highdensity region are close, then so should be the correspondingoutputs yi and yj .
I cluster assumption: If two points are in the same cluster,they are likely to be of the same class.
I manifold assumption: The high dimensional data lie roughlyon a low dimensional manifold. – regarding dimensionality;but if manifold = approximation of the high-dimensionalregion ⇒ smoothness assumption
−10 −5 0 5 10 15 20−5
0
5
10
28/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Classes of SSL
I Generativemodels
I Low-densityseparation
I Graph-basedmethods
I Change ofrepresentation
29/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Classes of SSL
I Generativemodels
I Low-densityseparation
I Graph-basedmethods
I Change ofrepresentation
I model class conditional density P(xxx |y)and class priors P(y) and use the Bayestheorem for calculating posteriors, whichare used for classification
I e.g. Naive Bayes + EM
1. Train the classifier on the trainingexamples; determine probabilitiesP(ddd i | cj) and P(cj).
2. Repeat while there is improvement:
E-step: Classify unlabeled examplesusing the trained classifier.M-step: Re-estimate the classifierusing the previously classifiedexamples; recalculate P(ddd i | cj ) andP(cj ).
29/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Classes of SSL
I Generativemodels
I Low-densityseparation
I Graph-basedmethods
I Change ofrepresentation
I push away a decision boundary fromlabeled and unlabeled data; naturalchoice: use a large-margin classifier
I e.g. Transductive SVM (TSVM)
min1
2‖www‖2
s.t. yi (www′xxx i + b) ≥ 1, i = 1, . . . , `
yj(www′xxx j + b) ≥ 1, j = `+ 1, . . . ,N
yj ∈ {−1,+1}, j = `+ 1, . . . ,N
29/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Classes of SSL
I Generativemodels
I Low-densityseparation
I Graph-basedmethods
I Change ofrepresentation
I use the graph of the labeled andunlabeled data, represented by weightmatrix WWW ; use graph Laplacian LLL−WWW
I e.g. Label Propagation
1. Compute YYY (t + 1) = PPP YYY (t).2. Reset the labeled data,
YYY L(t + 1) = YYY L(0).3. t = t + 1 and repeat the above steps
until convergence.
29/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Classes of SSL
I Generativemodels
I Low-densityseparation
I Graph-basedmethods
I Change ofrepresentation
I find some structure in the whole data set;use the information provided by thelabeled and unlabeled data set to create anew representation
1. Build the new representation – newdistance, dot-product or kernel – of thelearning examples.
2. Use a supervised learning method forobtaining the decision functionemploying the new representationobtained in the previous step.
I e.g. PCA, KPCA, LLE, ISOMAP, etc.
29/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Self-training/Bootstrapping
Simple self-trainingUntil convergence:
1. Train classifier with the labeled data.
2. Classify unlabeled data with the trained classifier.
3. Add the most confident unlabeled points to the training data.
4. GOTO step 1.
Committee-based learningUntil convergence:
1. Train separate classifiers on the same labeled data.
2. Make predictions on the unlabeled data.
3. Add the most agreed points to the training set.
4. GOTO step 1.
Final prediction: weighted majority vote among all the learners.
30/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Data graphs
I G = (V ,W ) undirected weighted graph(usually)
I V = XL ∪ XU (training data)
I W : V × V → R (similarity function);W (xxx i ,xxx j) = sim(xxx i ,xxx j) (def
= Wij)for example:
sim(xxx i ,xxx j) = exp(−‖xxx i−xxx j‖2
2
σ2
)
I sparsification techniques: εNN, kNNgraphs
I graph Laplacian: LLL = DDD −WWW ,DDD = diag(WWW · 111)
I nice properties: e.g. positivesemi-definite; has as many 0 eigenvaluesas connected components contains; etc.
31/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Label propagation
I build data graph: Wij = exp(− 1
2σ2 ‖xxx i − xxx j‖2)
I compute transition probabilities: DDD = diag(WWW111), PPP = DDD−1WWW
I vector of labels: fff = [fff L fff U ]′
LP
1. i = 0; Initialize fff(i)U .
2. fff (i+1) = PPPfff (i).
3. Clamp the labeled data, fff(i+1)L = fff L.
4. i = i + 1; Unless convergence go tostep 2.
32/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
I can be rewritten as
fi =1
∑Nj=1 Wij
N∑
k=1
Wik fk
I exact solution (LLL – graph Laplacian)
fff U = −LLL−1UU · LLLUL · fff L
I minimizes
E (fff ) =1
2
N∑
i,j=1
Wij(fi − fj)2 = fff ′LLLfff
I LP is equivalent with searching for a minimum cut in the datagraph
33/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Transduction → induction
I fix the other labels and compute the new one
I we minimize:C +
∑
i
Wiy (fi − y)2
I from setting the derivative to zero we obtain:
y =
∑i Wiy fi∑j Wjy
34/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Graph-based distances
I if data lie on a manifold, graph-based distances are better
I graph-based distances = shortest path distances calculatedusing a known algorithm: Floyd–Warshall, Dijkstra etc.
�
��
�
�
�
��
�
�
��
�
Floyd–Warshall algorithm
1. Dij = weight of the edge; if no edge then ∞2. for k=1:N
for i=1:Nfor j=1:N
if Dij > Dik + Dkj
Dij = Dik + Dkj
35/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Semi-Supervised kNN and k-Means
here: semi-supervised = graph-based
1. Initial distance matrix ⇐ kNN/εNN matrix
2. Compute the all-pair shortest path distance matrix – using forexample the Floyd–Warshall algorithm:
Dij = shortest path value among all paths from i to j
3. Perform kNN/k-Means using these graph distances.
36/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Cluster kernels
Neighborhood kernel
I representing each as the average of its neighbors →neighborhood structure influences representation:
φnbd(xxx) =1
|N(xxx)|∑
xxx′∈N(xxx)
φb(xxx)
I the kernel:
knbd(xxx ,zzz) =
∑xxx′∈N(xxx),zzz′∈N(zzz) kb(xxx ′,zzz ′)
|N(xxx)||N(zzz)|
37/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Bagged cluster kernel
I idea: reweighting the base kernel values by the probabilitythat the points belong to the same cluster
I use k-means clustering: random initial cluster centers →affects the output of the algorithm
BCK
1. Run k-means t times, which results in cj(xxx i ) clusterassignments, j = 1, . . . , t, i = 1, . . . ,N, c·(·) ∈ {1, . . . ,K}.
2. Construct the bagged kernel in the following way:
kbag(xxx ,zzz) =
∑tj=1[cj(xxx) = cj(zzz)]
t
The final kernel:
k(xxx ,zzz) = kb(xxx ,zzz) · kbag(xxx ,zzz)
38/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Laplacian SVM
I use SVM + semi-supervised (smoothness) assumption
I add the following member to the objective function:
γI1
2
N∑
i,j=1
wij(fi − fj)2 = γI fff
′LLLfff
where LLL is the graph Laplacian, fff is the N × 1 vectorcontaining the labels of the points;γI is a parameter determining how much the unlabeled pointsinfluence the result/kernel
I the optimization problem thus becomes:
minwww ,b
1
2‖www‖2 + C
∑
i=1
ξi + γI fff′LLLfff
s.t. yi (www′xxx i + b) ≥ 1− ξi
39/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
I it can be shown that it is the same problem which is solved inthe original SVM formulation, but now using the kernel
k(xxx ,zzz) = k(xxx ,zzz)− kkk ′xxx
(1
2γIIII + LLLKKKNN
)−1
LLLkkkzzz
where k(·, ·) is the new semi-supervised kernel, while k(·, ·)is the base kernel
−20
−10
0
10
20
−20
−10
0
10
200
50
40/41
Kernel Methods andSemi-Supervised
Learning
Zalan Bodo
The aims of the shortcourse
Top 10 data miningalgorithms
Machine learning.Supervised learning
Kernel methods
Kernels
Centroid method
kNN
SVM
k-means
Semi-supervisedlearning
Assumptions in SSL
Classes of SSL
Self-training
Data graphs
Label propagation
Graph-based distances
Semi-supervised kNNand k-means
Cluster kernels
Laplacian SVM
Thank you! Questions?
41/41