Johan Suykens: "Models from Data: a Unifying Picture"

Models from Data: a Unifying Picture

Johan Suykens

KU Leuven, ESAT-SCD/SISTAKasteelpark Arenberg 10

B-3001 Leuven (Heverlee), BelgiumEmail: [email protected]

http://www.esat.kuleuven.be/scd/

Grand Challenges of Computational IntelligenceNicosia Cyprus, Sept. 14, 2012

Models from data: a unifying picture - Johan Suykens

Overview

• Models from data

• Examples

• Challenges for computational intelligence

High-quality predictive models are crucial

biomedical

bio-informatics

Models from data: a unifying picture - Johan Suykens 1


biomedicalenergy

process industry

bio-informatics



biomedicalenergy

process industry

bio-informatics

brain-computer interfaces traffic networks


Classical Neural Networksx1

x2

x3

xn

1

w1

w2

w3wn

b

h(·)

h(·)

y

Multilayer Perceptron (MLP) properties:- Universal approximation- Learning from input-output patterns: off-line & on-line- Parallel network architecture, multiple inputs and outputs

+ Flexible and widely applicable:

Feedforward & recurrent networks, supervised & unsupervised learning

- Many local minima, trial and error for number of neurons


Support Vector Machines

cost function cost function

weights weights

MLP SVM

• Nonlinear classification and function estimation by convex optimization

• Learning and generalization in high dimensional input spaces

• Use of kernels:- linear, polynomial, RBF, MLP, splines, kernels from graphical models,...- application-specific kernels: e.g. bioinformatics, textmining

[Vapnik, 1995; Scholkopf & Smola, 2002; Shawe-Taylor & Cristianini, 2004]


Kernel-based models: different views

LS−SVMSVM

Kriging

Gaussian Processes

RKHS

Some early history on RKHS:

1910-1920: Moore

1940: Aronszajn

1951: Krige

1970: Parzen

1971: Kimeldorf & Wahba

Complementary insights from different perspectives:kernels are used in different methodologies

- Support vector machines (SVM): optimization approach (primal/dual)- Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis- Gaussian processes (GP): probabilistic/Bayesian approach


SVMs: living in two worlds ...

x

oooo

x

xx

x

x xx

x

x

oo

oo Input space

Feature space

ϕ(x)

Parametric

Nonparametric

Primal space

Dual space

y = sign[wTϕ(x) + b]

y = sign[P#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y

y

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x


SVMs: living in two worlds ...

x

oooo

x

xx

x

x xx

x

x

oo

oo Input space

Feature space

ϕ(x)

Parametric

Nonparametric

Primal space

Dual space

y = sign[wTϕ(x) + b]

y = sign[P#sv

i=1 αiyiK(x, xi) + b]

K(xi, xj) = ϕ(xi)Tϕ(xj) (“Kernel trick”)

y

y

w1

wnh

α1

α#sv

ϕ1(x)

ϕnh(x)

K(x, x1)

K(x, x#sv)

x

x

Par

amet

ricN

on−p

aram

etric


Fitting models to data: alternative views

- Consider model y = f(x; w), given input/output data (xi, yi)Ni=1:

minw

wTw + γN

∑

i=1

(yi − f(xi;w))2

- Rewrite the problem as

minw,e wTw + γ∑N

i=1 (yi − f(xi;w))2

subject to ei = yi − f(xi;w), i = 1, ..., N

- Construct Lagrangian and take condition for optimality

- Express the solution and the model in terms of the Lagrange multipliers


Fitting models to data: alternative views

- Consider model y = f(x; w), given input/output data (xi, yi)Ni=1:

minw

wTw + γN

∑

i=1

(yi − f(xi; w))2

- Rewrite the problem as

minw,e

wTw + γ∑N

i=1 e2i

subject to ei = yi − f(xi;w), i = 1, ..., N

- Construct Lagrangian and take conditions for optimality

- Express the solution and the model in terms of the Lagrange multipliers


Linear model: solving in primal or dual?

inputs x ∈ Rd, output y ∈ R

training set (xi, yi)Ni=1

(P ) : y = wTx + b, w ∈ Rd

րModel

ց(D) : y =

∑

i αi xTi x + b, α ∈ R

N



inputs x ∈ Rd, output y ∈ R

training set (xi, yi)Ni=1

(P ) : y = wTx + b, w ∈ Rd

րModel

ց(D) : y =

∑

i αi xTi x + b, α ∈ R

N



few inputs, many data points: (e.g. 20 × 1.000.000)

primal : w ∈ R20

dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000)



many inputs, few data points: (e.g. 10.000 × 50)

primal: w ∈ R10.000

dual : α ∈ R50 (kernel matrix: 50 × 50)


Least Squares Support Vector Machines: “core models”

• Regression

minw,b,e

wTw + γ∑

i

e2i s.t. yi = wTϕ(xi) + b + ei, ∀i

• Classification

minw,b,e

wTw + γ∑

i

e2i s.t. yi(w

Tϕ(xi) + b) = 1 − ei, ∀i

• Kernel pca (V = I), Kernel spectral clustering (V = D−1)

minw,b,e

−wTw + γ∑

i

vie2i s.t. ei = wTϕ(xi) + b, ∀i

• Kernel canonical correlation analysis/partial least squares

minw,v,b,d,e,r

wTw + vTv + ν∑

i

(ei − ri)2 s.t.

ei = wTϕ1(xi) + bri = vTϕ2(yi) + d

[Suykens & Vandewalle, 1999; Suykens et al., 2002; Alzate & Suykens, 2010]


Core models

+Core modeladditional constraints

regularization terms


Core models


model estimate


optimal model representation


Core models


model estimate


optimal model representation

parametric model

support vector machineleast−squares support vector machineParzen kernel model


Overview


• Examples


Kernel spectral clustering

• Underlying model: e∗ = wTϕ(x∗)with q∗ = sign[e∗] the estimated cluster indicator at any x∗ ∈ R

d.

• Primal problem: training on given data xiNi=1

minw,e

−1

2wTw + γ

1

2

N∑

i=1

vi e2i

subject to ei = wTϕ(xi), i = 1, ..., N

with weights vi (related to inverse degree matrix: V = D−1).Dual problem:

Ωα = λDα

with Ωij = K(xi, xj) = ϕ(xi)Tϕ(xj).

• Kernel spectral clustering [Alzate & Suykens, IEEE-PAMI, 2010], related tospectral clustering [Fiedler, 1973; Shi & Malik, 2000; Ng et al. 2002; Chung, 1997]


Primal and dual model representations

bias term b for centeringk clusters, k − 1 sets of constraints (index l = 1, ..., k − 1)

(P ) : sign[e(l)∗ ] = sign[w(l)Tϕ(x∗) + bl]

րM

ց

(D) : sign[e(l)∗ ] = sign[

∑

j α(l)j K(x∗, xj) + bl]

Advantages:- out-of-sample extensions- model selection- solving large scale problems


Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8

x(1)

x(2

)

−12 −10 −8 −6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8

x(1)

x(2

)


Out-of-sample extension and coding

−12 −10 −8 −6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8

x(1)

x(2

)

−12 −10 −8 −6 −4 −2 0 2 4 6−10

−8

−6

−4

−2

0

2

4

6

8

x(1)

x(2

)


Example: image segmentation

−5−4

−3−2

−10

12

−3

−2

−1

0

1

2

3−2

−1

0

1

2

3

4

e(1)

i,vale(2)

i,val

e(3

)

i,va

l


Example: image segmentation

2 4 6 8 10 12 140

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Number of clusters k

Fisher

criter

ion

−0.04 −0.02 0 0.02 0.04 0.06 0.08 0.1−0.25

−0.2

−0.15

−0.1

−0.05

0

0.05

0.1

α(1)i,val

α(2

)i,val


Kernel spectral clustering: sparse kernel models

original image binary clustering

Incomplete Cholesky decomposition: ‖Ω − GGT‖2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV

e(l)∗ =

∑

i∈SSVα

(l)i K(xi, x∗) + bl


Kernel spectral clustering: sparse kernel models

original image sparse kernel model

Incomplete Cholesky decomposition: ‖Ω − GGT‖2 ≤ η with G ∈ RN×R and R ≪ N

Image (Berkeley image dataset): 321 × 481 (154, 401 pixels), 175 SV

e(l)∗ =

∑

i∈SSVα

(l)i K(xi, x∗) + bl


Highly sparse kernel models: image segmentation

**

*

*

*

*

****

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

**

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

**

*

*

****

*

**

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

*

**

*

**

*

**

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

***

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

*

**

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

*

*

***

*

***

*

*

*

**

*

***

**

*

*

*

*

*

*

*

*

**

*

*

*

*

*

**

*

*

*

**

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

*

***

**

*

*

*

***

*

**

**

***

*

***

*

*

*

*

*

*

*

*

**

*

*

*

****

***

*

*

*

*

*

**

*

*

*

**

*

**

*

*

*

*

*

*

*

*

*

*

*

*

*

*

**

**

****

*

*

*

*

*

**

*

*

*

*

**

*

*

*

*

***

*

*

**

**

*

*

*

*

*

*

*

**

*

*

*

**

**

**

*

*

*

**

*

***

*

*

*

**

*

*

*

**

*

*

*

**

*

*

*

**

*

*

*

*

*

*

*

*

*

*

**

*

*

**

*

*

*

*

*

*

*

*

**

*

**

*

**

*

*

*

*

*

**

*

*

*

*

*

*

*

*

*

*

**

*

*

*

*

*

*

**

*

*

*

*

*

**

*****

*

*

*

*

**

*

*

*

*

*

*

*

**

*

*

*

*

*

**

*

**

*

*

*

*

*

*

****

*

*

*

*

*

−1.5−1

−0.50

0.5

−3

−2

−1

0

1

2−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

e(1)i

e(2)i

e(3

)i


Highly sparse kernel models: image segmentation

*

*

*

*

*

**

*

*

*

*

*

−1.5−1

−0.50

0.5

−3

−2

−1

0

1

2−3

−2.5

−2

−1.5

−1

−0.5

0

0.5

e(1)i

e(2)i

e(3

)i

only 3k = 12 support vectors [Alzate & Suykens, Neurocomputing, 2011]


Kernel spectral clustering: adding prior knowledge

• Pair of points x†, x‡: c = 1 must-link, c = −1 cannot-link

• Primal problem [Alzate & Suykens, IJCNN 2009]

minw(l),e(l),bl

−1

2

k−1∑

l=1

w(l)T

w(l) +1

2

k−1∑

l=1

γle(l)T

D−1e(l)

subject to e(1) = ΦN×nhw(1) + b11N

...e(k−1) = ΦN×nh

w(k−1) + bk−11N

w(1)Tϕ(x†) = cw(1)Tϕ(x‡)...

w(k−1)Tϕ(x†) = cw(k−1)Tϕ(x‡)

• Dual problem: yields rank-one downdate of the kernel matrix


Kernel spectral clustering: example

original image without constraints


Kernel spectral clustering: example

original image with constraints


Hierarchical kernel spectral clustering

Hierarchical kernel spectral clustering:- looking at different scales- use of model selection and validation data

[Alzate & Suykens, Neural Networks, 2012]


Power grid: kernel spectral clustering of time-series

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

hour

nor

mal

ized

load

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

hour

nor

mal

ized

load

5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

hour

nor

mal

ized

load

Electricity load: 245 substations in Belgian grid (1/2 train, 1/2 validation)xi ∈ R

43.824: spectral clustering on high dimensional data (5 years)

3 of 7 detected clusters:- 1: Residential profile: morning and evening peaks- 2: Business profile: peaked around noon- 3: Industrial profile: increasing morning, oscillating afternoon and evening

[Alzate, Espinoza, De Moor, Suykens, 2009]


Dimensionality reduction and data visualization

• Traditionally:commonly used techniques are e.g. principal component analysis (PCA),multi-dimensional scaling (MDS), self-organizing maps (SOM)

• More recently:isomap, locally linear embedding (LLE), Hessian locally linear embedding,diffusion maps, Laplacian eigenmaps(“kernel eigenmap methods and manifold learning”)[Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006]

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:data visualization and dimensionality reduction by solving linear system


Kernel maps with reference point: formulation

• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z

- Regularization term: (z − PDz)T (z − PDz) =PN

i=1 ‖zi −PN

j=1 sijDzj‖22

with D diagonal matrix and sij = exp(−‖xi − xj‖22/σ

2)- reference point q (e.g. first point; sacrificed in the visualization)

• Example: d = 2

minz,w1,w2,b1,b2,ei,1,ei,2

1

2(z − PDz)T (z − PDz)+

ν

2(wT

1 w1 + wT2 w2) +

η

2

NX

i=1

(e2i,1 + e

2i,2)

such that cT1,1z = q1 + e1,1

cT1,2z = q2 + e1,2

cTi,1z = wT

1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N

cTi,2z = wT

2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN



• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z- Regularization term: (z−PDz)T (z−PDz) =

PN

i=1 ‖zi−PN

j=1 sijDzj‖22



• Example: d = 2


1


ν

2(wT

1 w1 + wT2 w2) +

η

2

NX

i=1

(e2i,1 + e

2i,2)


cT1,2z = q2 + e1,2

cTi,1z = wT

1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N

cTi,2z = wT

2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N




• Kernel maps with reference point [Suykens, IEEE-TNN 2008]:- LS-SVM core part: realize dimensionality reduction x 7→ z- Regularization term: (z−PDz)T (z−PDz) =

PN

i=1 ‖zi−PN

j=1 sijDzj‖22



• Example: d = 2


1


ν

2(wT

1 w1 + wT2 w2) +

η

2

NX

i=1

(e2i,1 + e

2i,2)


cT1,2z = q2 + e1,2

cTi,1z = wT

1 ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N

cTi,2z = wT

2 ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N



Kernel maps: spiral example

−1.5−1

−0.50

0.51

−1.5

−1

−0.5

0

0.5

1−0.5

0

0.5

x1

x2

x 3

q = [+1;−1]

−0.02 −0.015 −0.01 −0.005 0 0.005 0.01−2

0

2

4

6

8

10

12x 10

−3

z1hat

z 2hat

q = [−1;−1]

−0.01 −0.005 0 0.005 0.01 0.015 0.02−2

0

2

4

6

8

10

12x 10

−3

z1hat

z 2hat

training data (blue *), validation data (magenta o), test data (red +)

Model selection: min∑

i,j

(

zTi zj

‖zi‖2‖zj‖2−

xTi xj

‖xi‖2‖xj‖2

)2


Kernel maps: visualizing gene distribution

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −2 −1.95x 10

−31.9

2

2.1

2.2

2.3

x 10−3

1.7

1.8

1.9

2

2.1

x 10−3

z1

z2

z 3

Alon colon cancer microarray data set: 3D projectionsDimension input space: 62

Number of genes: 1500 (training: 500, validation: 500, test: 500)


Kernels & Tensorsneuroscience: EEG data

(time samples × frequency × electrodes)

computer vision: image (/video) compression/completion/· · ·(pixel × illumination × expression × · · ·)

web mining: analyze users behaviors(users × queries × webpages)

vector x matrix X tensor X

- Naive kernel: K(X ,Y) = exp(−12σ

2‖vec(X ) − vec(Y)‖22)

- Tensorial kernel exploiting structure: learning from few examples

[Signoretto et al., Neural Networks, 2011 & IEEE-TSP, 2012]


Tensor completion

Mass spectral imaging: sagittal section mouse brain [data: E. Waelkens, R. Van de Plas]

Tensor completion using nuclear norm regularization [Signoretto et al., IEEE-SPL, 2011]


Overview


• Examples


Challenges for Computational Intelligence

- Bridging gaps between advanced methods and end-users

- New mathematical and methodological frameworks

- Scalable algorithms towards large and high-dimensional data


Acknowledgements

• Colleagues at ESAT-SCD (especially research units: systems, models,control - biomedical data processing - bioinformatics):

C. Alzate, A. Argyriou, J. De Brabanter, K. De Brabanter, B. De Moor, M. Espinoza,

T. Falck, D. Geebelen, X. Huang, V. Jumutc, P. Karsmakers, R. Langone, J. Lopez,

J. Luts, R. Mall, S. Mehrkanoon, Y. Moreau, K. Pelckmans, J. Puertas, L. Shi, M.

Signoretto, V. Van Belle, R. Van de Plas, S. Van Huffel, J. Vandewalle, C. Varon, S.

Yu, and others

• Many people for joint work, discussions, invitations, organizations

• Support from ERC AdG A-DATADRIVE-B, KU Leuven, GOA-MaNet,COE Optimization in Engineering OPTEC, IUAP DYSCO, FWO projects,IWT, IBBT eHealth, COST


Documents

Johan Suykens: "Models from Data: a Unifying Picture"