Upload
ieeeciscyprus
View
493
Download
1
Embed Size (px)
Citation preview
Models from Data: a Unifying Picture
Johan Suykens
KU Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10
B-3001 Leuven (Heverlee), BelgiumEmail: [email protected]
http://www.esat.kuleuven.be/scd/
Grand Challenges of Computational IntelligenceNicosia Cyprus, Sept. 14, 2012
Models from data: a unifying picture - Johan Suykens
Overview
• Models from data
• Examples
• Challenges for computational intelligence
High-quality predictive models are crucial
biomedical
bio-informatics
Models from data: a unifying picture - Johan Suykens 1
High-quality predictive models are crucial
biomedicalenergy
process industry
bio-informatics
Models from data: a unifying picture - Johan Suykens 1
High-quality predictive models are crucial
biomedicalenergy
process industry
bio-informatics
brain-computer interfaces traffic networks
Models from data: a unifying picture - Johan Suykens 1
Classical Neural Networksx1
x2
x3
xn
1
w1
w2
w3wn
b
h(·)
h(·)
y
Multilayer Perceptron (MLP) properties:- Universal approximation- Learning from input-output patterns: off-line & on-line- Parallel network architecture, multiple inputs and outputs
+ Flexible and widely applicable:
Feedforward & recurrent networks, supervised & unsupervised learning
- Many local minima, trial and error for number of neurons
Models from data: a unifying picture - Johan Suykens 2
Support Vector Machines
cost function cost function
weights weights
MLP SVM
• Nonlinear classification and function estimation by convex optimization
• Learning and generalization in high dimensional input spaces
• Use of kernels:- linear, polynomial, RBF, MLP, splines, kernels from graphical models,...- application-specific kernels: e.g. bioinformatics, textmining
[Vapnik, 1995; Scholkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004]
Models from data: a unifying picture - Johan Suykens 3
Kernel-based models: different views
LS−SVMSVM
Kriging
Gaussian Processes
RKHS
Some early history on RKHS:
1910-1920: Moore
1940: Aronszajn
1951: Krige
1970: Parzen
1971: Kimeldorf & Wahba
Complementary insights from different perspectives:kernels are used in different methodologies
- Support vector machines (SVM): optimization approach (primal/dual)- Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis- Gaussian processes (GP): probabilistic/Bayesian approach
Models from data: a unifying picture - Johan Suykens 4
SVMs: living in two worlds ...
x
oooo
x
xx
x
x xx
x
x
oo
oo Input space
Feature space
ϕ(x)
Parametric
Nonparametric
Primal space
Dual space
y = sign[wTϕ(x) + b]
y = sign[P#sv
i=1 αiyiK(x, xi) + b]
K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)
y
y
w1
wnh
α1
α#sv
ϕ1(x)
ϕnh(x)
K(x, x1)
K(x, x#sv)
x
x
Models from data: a unifying picture - Johan Suykens 5
SVMs: living in two worlds ...
x
oooo
x
xx
x
x xx
x
x
oo
oo Input space
Feature space
ϕ(x)
Parametric
Nonparametric
Primal space
Dual space
y = sign[wTϕ(x) + b]
y = sign[P#sv
i=1 αiyiK(x, xi) + b]
K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)
y
y
w1
wnh
α1
α#sv
ϕ1(x)
ϕnh(x)
K(x, x1)
K(x, x#sv)
x
x
Par
amet
ricN
on−p
aram
etric
Models from data: a unifying picture - Johan Suykens 5
Fitting models to data: alternative views
- Consider model y = f(x; w), given input/output data (xi, yi)Ni=1:
minw
wTw + γN
∑
i=1
(yi − f(xi;w))2
- Rewrite the problem as
minw,e wTw + γ∑N
i=1 (yi − f(xi;w))2
subject to ei = yi − f(xi;w), i = 1, ..., N
- Construct Lagrangian and take condition for optimality
- Express the solution and the model in terms of the Lagrange multipliers
Models from data: a unifying picture - Johan Suykens 6
Fitting models to data: alternative views
- Consider model y = f(x; w), given input/output data (xi, yi)Ni=1:
minw
wTw + γN
∑
i=1
(yi − f(xi; w))2
- Rewrite the problem as
minw,e
wTw + γ∑N
i=1 e2i
subject to ei = yi − f(xi;w), i = 1, ..., N
- Construct Lagrangian and take conditions for optimality
- Express the solution and the model in terms of the Lagrange multipliers
Models from data: a unifying picture - Johan Suykens 6
Linear model: solving in primal or dual?
inputs x ∈ Rd, output y ∈ R
training set (xi, yi)Ni=1
(P ) : y = wTx + b, w ∈ Rd
րModel
ց(D) : y =
∑
i αi xTi x + b, α ∈ R
N
Models from data: a unifying picture - Johan Suykens 7
Linear model: solving in primal or dual?
inputs x ∈ Rd, output y ∈ R
training set (xi, yi)Ni=1
(P ) : y = wTx + b, w ∈ Rd
րModel
ց(D) : y =
∑
i αi xTi x + b, α ∈ R
N
Models from data: a unifying picture - Johan Suykens 7
Linear model: solving in primal or dual?
few inputs, many data points: (e.g. 20 × 1.000.000)
primal : w ∈ R20
dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000)
Models from data: a unifying picture - Johan Suykens 7
Linear model: solving in primal or dual?
many inputs, few data points: (e.g. 10.000 × 50)
primal: w ∈ R10.000
dual : α ∈ R50 (kernel matrix: 50 × 50)
Models from data: a unifying picture - Johan Suykens 7
Least Squares Support Vector Machines: “core models”
• Regression
minw,b,e
wTw + γ∑
i
e2i s.t. yi = wTϕ(xi) + b + ei, ∀i
• Classification
minw,b,e
wTw + γ∑
i
e2i s.t. yi(w
Tϕ(xi) + b) = 1 − ei, ∀i
• Kernel pca (V = I), Kernel spectral clustering (V = D−1)
minw,b,e
−wTw + γ∑
i
vie2i s.t. ei = wTϕ(xi) + b, ∀i
• Kernel canonical correlation analysis/partial least squares
minw,v,b,d,e,r
wTw + vTv + ν∑
i
(ei − ri)2 s.t.
ei = wTϕ1(xi) + bri = vTϕ2(yi) + d
[Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]
Models from data: a unifying picture - Johan Suykens 8
Core models
+Core modeladditional constraints
regularization terms
Models from data: a unifying picture - Johan Suykens 9
Core models
+Core modeladditional constraints
model estimate
regularization terms
optimal model representation
Models from data: a unifying picture - Johan Suykens 9
Core models
+Core modeladditional constraints
model estimate
regularization terms
optimal model representation
parametric model
support vector machineleast−squares support vector machineParzen kernel model
Models from data: a unifying picture - Johan Suykens 9
Overview
• Models from data
• Examples
• Challenges for computational intelligence
Kernel spectral clustering
• Underlying model: e∗ = wTϕ(x∗)with q∗ = sign[e∗] the estimated cluster indicator at any x∗ ∈ R
d.
• Primal problem: training on given data xiNi=1
minw,e
−1
2wTw + γ
1
2
N∑
i=1
vi e2i
subject to ei = wTϕ(xi), i = 1, ..., N
with weights vi (related to inverse degree matrix: V = D−1).Dual problem:
Ωα = λDα
with Ωij = K(xi, xj) = ϕ(xi)Tϕ(xj).
• Kernel spectral clustering [Alzate & Suykens, IEEE-PAMI, 2010], related tospectral clustering [Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997]
Models from data: a unifying picture - Johan Suykens 10
Primal and dual model representations
bias term b for centeringk clusters, k − 1 sets of constraints (index l = 1, ..., k − 1)
(P ) : sign[e(l)∗ ] = sign[w(l)Tϕ(x∗) + bl]
րM
ց
(D) : sign[e(l)∗ ] = sign[
∑
j α(l)j K(x∗, xj) + bl]
Advantages:- out-of-sample extensions- model selection- solving large scale problems
Models from data: a unifying picture - Johan Suykens 11
Out-of-sample extension and coding
−12 −10 −8 −6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
x(1)
x(2
)
−12 −10 −8 −6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
x(1)
x(2
)
Models from data: a unifying picture - Johan Suykens 12
Out-of-sample extension and coding
−12 −10 −8 −6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
x(1)
x(2
)
−12 −10 −8 −6 −4 −2 0 2 4 6−10
−8
−6
−4
−2
0
2
4
6
8
x(1)
x(2
)
Models from data: a unifying picture - Johan Suykens 12
Example: image segmentation
−5−4
−3−2
−10
12
−3
−2
−1
0
1
2
3−2
−1
0
1
2
3
4
e(1)
i,vale(2)
i,val
e(3
)
i,va
l
Models from data: a unifying picture - Johan Suykens 13
Example: image segmentation
2 4 6 8 10 12 140
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Number of clusters k
Fisher
criter
ion
−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1−0.25
−0.2
−0.15
−0.1
−0.05
0
0.05
0.1
α(1)i,val
α(2
)i,val
Models from data: a unifying picture - Johan Suykens 14
Kernel spectral clustering: sparse kernel models
original image binary clustering
Incomplete Cholesky decomposition: ‖Ω − GGT‖2 ≤ η with G ∈ RN×R and R ≪ N
Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV
e(l)∗ =
∑
i∈SSVα
(l)i K(xi, x∗) + bl
Models from data: a unifying picture - Johan Suykens 15
Kernel spectral clustering: sparse kernel models
original image sparse kernel model
Incomplete Cholesky decomposition: ‖Ω − GGT‖2 ≤ η with G ∈ RN×R and R ≪ N
Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV
e(l)∗ =
∑
i∈SSVα
(l)i K(xi, x∗) + bl
Models from data: a unifying picture - Johan Suykens 15
Highly sparse kernel models: image segmentation
**
*
*
*
*
****
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
**
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
*
*
****
*
**
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
*
**
*
**
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
***
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
*
**
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
*
*
***
*
***
*
*
*
**
*
***
**
*
*
*
*
*
*
*
*
**
*
*
*
*
*
**
*
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
***
**
*
*
*
***
*
**
**
***
*
***
*
*
*
*
*
*
*
*
**
*
*
*
****
***
*
*
*
*
*
**
*
*
*
**
*
**
*
*
*
*
*
*
*
*
*
*
*
*
*
*
**
**
****
*
*
*
*
*
**
*
*
*
*
**
*
*
*
*
***
*
*
**
**
*
*
*
*
*
*
*
**
*
*
*
**
**
**
*
*
*
**
*
***
*
*
*
**
*
*
*
**
*
*
*
**
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
**
*
*
*
*
*
*
*
*
**
*
**
*
**
*
*
*
*
*
**
*
*
*
*
*
*
*
*
*
*
**
*
*
*
*
*
*
**
*
*
*
*
*
**
*****
*
*
*
*
**
*
*
*
*
*
*
*
**
*
*
*
*
*
**
*
**
*
*
*
*
*
*
****
*
*
*
*
*
−1.5−1
−0.50
0.5
−3
−2
−1
0
1
2−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
e(1)i
e(2)i
e(3
)i
Models from data: a unifying picture - Johan Suykens 16
Highly sparse kernel models: image segmentation
*
*
*
*
*
**
*
*
*
*
*
−1.5−1
−0.50
0.5
−3
−2
−1
0
1
2−3
−2.5
−2
−1.5
−1
−0.5
0
0.5
e(1)i
e(2)i
e(3
)i
only 3k = 12 support vectors [Alzate & Suykens, Neurocomputing, 2011]
Models from data: a unifying picture - Johan Suykens 16
Kernel spectral clustering: adding prior knowledge
• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link
• Primal problem [Alzate & Suykens, IJCNN 2009]
minw(l),e(l),bl
−1
2
k−1∑
l=1
w(l)T
w(l) +1
2
k−1∑
l=1
γle(l)T
D−1e(l)
subject to e(1) = ΦN×nhw(1) + b11N
...e(k−1) = ΦN×nh
w(k−1) + bk−11N
w(1)Tϕ(x†) = cw(1)Tϕ(x‡)...
w(k−1)Tϕ(x†) = cw(k−1)Tϕ(x‡)
• Dual problem: yields rank-one downdate of the kernel matrix
Models from data: a unifying picture - Johan Suykens 17
Kernel spectral clustering: example
original image without constraints
Models from data: a unifying picture - Johan Suykens 18
Kernel spectral clustering: example
original image with constraints
Models from data: a unifying picture - Johan Suykens 19
Hierarchical kernel spectral clustering
Hierarchical kernel spectral clustering:- looking at different scales- use of model selection and validation data
[Alzate & Suykens, Neural Networks, 2012]
Models from data: a unifying picture - Johan Suykens 20
Power grid: kernel spectral clustering of time-series
5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
hour
nor
mal
ized
load
5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
hour
nor
mal
ized
load
5 10 15 200
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
hour
nor
mal
ized
load
Electricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation)xi ∈ R
43.824: spectral clustering on high dimensional data (5 years)
3 of 7 detected clusters:- 1: Residential profile: morning and evening peaks- 2: Business profile: peaked around noon- 3: Industrial profile: increasing morning, oscillating afternoon and evening
[Alzate, Espinoza, De Moor, Suykens, 2009]
Models from data: a unifying picture - Johan Suykens 21
Dimensionality reduction and data visualization
• Traditionally:commonly used techniques are e.g. principal component analysis (PCA),multi-dimensional scaling (MDS), self-organizing maps (SOM)
• More recently:isomap, locally linear embedding (LLE), Hessian locally linear embedding,diffusion maps, Laplacian eigenmaps(“kernel eigenmap methods and manifold learning”)[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:data visualization and dimensionality reduction by solving linear system
Models from data: a unifying picture - Johan Suykens 22
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z
- Regularization term: (z − PDz)T (z − PDz) =PN
i=1 ‖zi −PN
j=1 sijDzj‖22
with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ
2)- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2
minz,w1,w2,b1,b2,ei,1,ei,2
1
2(z − PDz)T (z − PDz)+
ν
2(wT
1 w1 + wT2 w2) +
η
2
NX
i=1
(e2i,1 + e
2i,2)
such that cT1,1z = q1 + e1,1
cT1,2z = q2 + e1,2
cTi,1z = wT
1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N
cTi,2z = wT
2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN
Models from data: a unifying picture - Johan Suykens 23
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z- Regularization term: (z−PDz)T (z−PDz) =
PN
i=1 ‖zi−PN
j=1 sijDzj‖22
with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ
2)- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2
minz,w1,w2,b1,b2,ei,1,ei,2
1
2(z − PDz)T (z − PDz)+
ν
2(wT
1 w1 + wT2 w2) +
η
2
NX
i=1
(e2i,1 + e
2i,2)
such that cT1,1z = q1 + e1,1
cT1,2z = q2 + e1,2
cTi,1z = wT
1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N
cTi,2z = wT
2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN
Models from data: a unifying picture - Johan Suykens 23
Kernel maps with reference point: formulation
• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z- Regularization term: (z−PDz)T (z−PDz) =
PN
i=1 ‖zi−PN
j=1 sijDzj‖22
with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ
2)- reference point q (e.g. first point; sacrificed in the visualization)
• Example: d = 2
minz,w1,w2,b1,b2,ei,1,ei,2
1
2(z − PDz)T (z − PDz)+
ν
2(wT
1 w1 + wT2 w2) +
η
2
NX
i=1
(e2i,1 + e
2i,2)
such that cT1,1z = q1 + e1,1
cT1,2z = q2 + e1,2
cTi,1z = wT
1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N
cTi,2z = wT
2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN
Models from data: a unifying picture - Johan Suykens 23
Kernel maps: spiral example
−1.5−1
−0.50
0.51
−1.5
−1
−0.5
0
0.5
1−0.5
0
0.5
x1
x2
x 3
q = [+1;−1]
−0.02 −0.015 −0.01 −0.005 0 0.005 0.01−2
0
2
4
6
8
10
12x 10
−3
z1hat
z 2hat
q = [−1;−1]
−0.01 −0.005 0 0.005 0.01 0.015 0.02−2
0
2
4
6
8
10
12x 10
−3
z1hat
z 2hat
training data (blue *), validation data (magenta o), test data (red +)
Model selection: min∑
i,j
(
zTi zj
‖zi‖2‖zj‖2−
xTi xj
‖xi‖2‖xj‖2
)2
Models from data: a unifying picture - Johan Suykens 24
Kernel maps: visualizing gene distribution
−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95x 10
−31.9
2
2.1
2.2
2.3
x 10−3
1.7
1.8
1.9
2
2.1
x 10−3
z1
z2
z 3
Alon colon cancer microarray data set: 3D projectionsDimension input space: 62
Number of genes: 1500 (training: 500, validation: 500, test: 500)
Models from data: a unifying picture - Johan Suykens 25
Kernels & Tensorsneuroscience: EEG data
(time samples × frequency × electrodes)
computer vision: image (/video) compression/completion/· · ·(pixel × illumination × expression × · · ·)
web mining: analyze users behaviors(users × queries × webpages)
vector x matrix X tensor X
- Naive kernel: K(X ,Y) = exp(−12σ
2‖vec(X ) − vec(Y)‖22)
- Tensorial kernel exploiting structure: learning from few examples
[Signoretto et al., Neural Networks, 2011 & IEEE-TSP, 2012]
Models from data: a unifying picture - Johan Suykens 26
Tensor completion
Mass spectral imaging: sagittal section mouse brain [data: E. Waelkens, R. Van de Plas]
Tensor completion using nuclear norm regularization [Signoretto et al., IEEE-SPL, 2011]
Models from data: a unifying picture - Johan Suykens 27
Overview
• Models from data
• Examples
• Challenges for computational intelligence
Challenges for Computational Intelligence
- Bridging gaps between advanced methods and end-users
- New mathematical and methodological frameworks
- Scalable algorithms towards large and high-dimensional data
Models from data: a unifying picture - Johan Suykens 28
Acknowledgements
• Colleagues at ESAT-SCD (especially research units: systems, models,control - biomedical data processing - bioinformatics):
C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, B. De Moor, M. Espinoza,
T. Falck, D. Geebelen, X. Huang, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez,
J. Luts, R. Mall, S. Mehrkanoon, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M.
Signoretto, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, C. Varon, S.
Yu, and others
• Many people for joint work, discussions, invitations, organizations
• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet,COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects,IWT, IBBT eHealth, COST
Models from data: a unifying picture - Johan Suykens 29