Supervised Learning applied to Neuroimaging2 › staff › J.Shawe-Taylor › courses › J2.pdf · 2009-12-14 · Supervised Learning Input: X1 X2 X3 Output y1 y2 y3 Learning/Training

Application of Supervised

Learning to NeuroimagingJanaina Mourao-Miranda

Supervised Learning

Input:

X1

X2

X3

Output

y1

y2

y3

Learning/Training

Generate a function or hypothesis fsuch that

Training Examples:

(X1, y1), (X2, y2), . . .,(Xn, yn)

Test

Prediction

Test Example

Xi

f(xi) -> yi

f(Xi) = yi

f

Learning

Methodology

Automatic procedures that learn a task from a series of examples

No mathematical

model available

Machine Leaning Methods

• Artificial Neural Networks

• Decision Trees

• Bayesian Networks

• Gaussian Process

• Support Vector Machines

• ..

• SVM is a classifier derived from statistical learning

theory by Vapnik and Chervonenkis

• SVMs introduced by Boser, Guyon, Vapnik in

COLT-92

• Powerful tool for statistical pattern recognition

Advantages of pattern recognition analysis in Neuroimaging

Explore the multivariate nature of neuroimaging data

•MRI/fMRI data are multivariate by nature since each scan contains

information about brain activity at thousands of measured locations (voxels).

•Considering that most of the brain function are distributed process involving a

network of brain regions it seems advantageous to use the spatially distributed

information contained in the data to give a better understanding of brain

functions.

•Can yield greater sensitivity than conventional analysis.

Can be used to make predictions for new examples

•Enable clinical applications: previously acquired data can be used to make

diagnostic or prognostic for a new subjects.

e.g. GLM

Input Output

Map: Activated regionstask 1 vs. task 2

Classical approach: Mass-univariate Analysis

SVM - training

Input Output

Volumes from task 1

Volumes from task 2

…

… Map: Discriminating regions between task 1 and task 2

Pattern recognition approach: Multivariate Analysis

SVM - test Prediction: task 1 or task 2

Time

Intensity

BOLD signal

1. Voxel time series2. Experimental Design

New example

fMRI Data Analysis

Each fMRI volume is treated as a vector in a extremely high dimensional space

(~200,000 voxels or dimensions after the mask)

fMRI data as input to a classifier

Vector representing the pattern of brain activation

[2 8 4 2 5 4 8 4 8]

Using pattern recognition to distinguish between object categories

vo

xe

ls

Time (trails or scans)

Train Test

input classification decision

?

Classification in Neuroimaging: 2D toy example

voxel 1

voxel 2

w

volume in t1

volume in t2

volume in t4

volume from a

new subject

volume in t32

4

L R

4 2

task 2

volume in t1 volume in t3volume in t2 volume in t4

task 2task 1 task 1task ?

voxel 1

voxel 2

Classification in High Dimensions

w

Data: <xi,yi>, i=1,..,N

Observations: xi ∈ Rd

Labels: yi ∈ {-1,+1}

All hyperplanes in Rd are parameterized by a vector w and a constant b. They can be expressed as:

(x1,+1)

(x2,-1)

In high dimensions there are many possible hyperplanes

0)(, =+>< bxw φ

Our aim is to find such a hyperplane/decision function:

that correctly classify our data: f(x1)=+1 and f(x2)=-1

))(,sgn()f( b+><= xwx φ

Nd ℜ→ℜ:φ

voxel 1

voxel 2

( )11 tX

( )31 tX ( )22 tX

( )42 tX

1m 2mw

thr

w

Projections onto the learning weight vector

FLD

with correction

w

FLD

without correction

w

Simplest Approach: Fisher Linear Discriminant

Fisher Linear Discriminant

voxel 1

voxel 2

−

2x

−

1x

+1x

+2x

1µ 2µw

thr

Fisher Discriminant is a classification function:

))(,sgn()( bf +><= xwx φ

Where the weight w is chosen to maximize the quotient

22

2

21

)()(

)()(

−+

−+

+

−=

ww

Jσσ

µµw

−+11 , µµ : mean of the projection of the +/- examples

−+ww σσ , : corresponding standard deviations

Find the direction w that maximizes the separation of the means scaled according to the

variances in that direction.

Regularized version:

ww

λσσ

µµ

++

−=

−+

−+

22

2

21

)()(

)()(

ww

J

If the optimal hyperplane has margin γγγγ>r it will correctly separate the test points.

As r is unknown the best we can do is maximize the margin γγγγ

r

Among all hyperplanes separating the data there is a unique optimal hyperplane, the one which

presents the largest margin (the distance of the closest points to the hyperplane).

Let us consider that all test points are generated by adding bounded noise (r) to the training

examples (test and training data are assumed to have been generate by the same distribution).

Optimal Hyperplane: largest margin classifier

γγγγ

γγγγ

Support Vector Machine: the maximal margin classifier

Data: <xi,yi>, i=1,..,N

Observations: xi ∈ R2

Labels: yi ∈ {-1,+1}

ξ

w

1,...,1

0,))(,(..

min

2

1

,,,

==

≥−≥+><

+− ∑=

w

xw

w

andNi

byts

C

iiii

N

i

ib

ξξγφ

ξγξγ

For details on SVM formulation see Kernel Methods for Patter Analysis, J. Shaw-Taylor & N. Christianini

Optimization Problem (convex quadratic program):

marginslack variables

weight vector

C: controls the trade-off between the margin and the size of the slack variables

In practice C is chosen by cross-validation.

As the parameter C varies, the margin varies smoothly through a corresponding range.

SVM decision function:

SVM weights:

+= ∑

=

N

j

jii bKy1

),(sgn)f( xxx α

)(1

i

N

i

i y xw ∑=

= φα

In the linear case:

>=<

=

jiji

ii

K xxxx

xx

,),(

)(φ

αi≠0 only for inputs that lie on the margin (i.e. support vectors)

The trade-off parameter C between accuracy and

regularization directly controls the size of αi

How to interpret the SVM weight vector?

Weight vector (Discriminating Volume)W = [0.45 0.89]

1 4 2 3 2.5 4.50.5 0.3 1 1.5 2 1

task1 task2 task1 task2 task1 task2

0

0.5

1

1.5

2

2.5

3

0 1 2 3 4 5

voxel 2

vo

xel 1

H: Hyperplane

w

• The value of each voxel in the discriminating volume indicates the importance of such voxel in differentiating between the two classes or brain states.

0.45 0.89

Patter Recognition Method: General Procedure

Split data: training and test

ML training and test

Dimensionality Reduction and/or

Feature Selection

Standard fMRI pre-processing:

•Realignment

•Normalization

•Smooth

Output:

-Accuracy

-Discriminating Maps

(weight vector)

Compute Kernel Matrix

Kernel is a function that, for given two pattern X and X*, returns a real number

characterizing their similarity.

Κ: χ x χ →ℝ

A simple type of similarity measure between two vectors is a dot product.

<X,X*> → Κ(X,X*)

Kernel

5 10 15 20 25 30 35 40 45

5

10

15

20

25

30

35

40

45-3

-2

-1

0

1

2

3

4

5

6

7

x 106

Kernel Matrix

X1

X2

<X1,X2>

>=<

=

jiji

ii

K xxxx

xx

,),(

)(φ

• The original input space can be mapped to some higher-dimensional feature space

where the training set is separable:

Φ: x→→→→ φ(x)

Kernel Approaches and Feature Space

Instead of using two steps:

1. Mapping to high dimensional space xi-> φ(xi)

2. Computing dot product in the high dimensional space <φ(xi), φ(xi)>

One can use the kernel trick and compute these two steps together

A kernel function is defined as a function that corresponds to a dot product of two feature vectors in some expanded feature space

K(xi,xj) := <φ(xi), φ(xi)>

Kernel trick

– Linear kernel:

2

2( , ) exp( )

2

i j

i jK

σ

−= −

x xx x

( , ) T

i j i jK =x x x x

( , ) (1 )T p

i j i jK = +x x x x

0 1( , ) tanh( )T

i j i jK β β= +x x x x

• Examples of commonly-used kernel functions:

– Polynomial kernel:

– Gaussian (Radial-Basis Function (RBF) ) kernel:

– Sigmoid:

• In general, functions that satisfy Mercer’s condition can be kernel functions.

How to give the data as input to the

classifier?

First Approach: Training with the whole brain

- Additional pre-processing: removal of the base line and low frequency components of each

voxel

- Advantages: Predict single events

- Disadvantages: Low signal to noise rate (SNR), stationarity assumptions

Data Matrix =

C1 C1 C1 BL BL BL C2 C2 C2 BL BL BL

vo

xe

ls

Single volumes

Second Approach: Training temporal compressed data


voxel

- Advantages: High SNR

- Disadvantages: Stationarity assumptions

Data Matrix =


vo

xe

ls

Mean of volumes or betas

Average volumes(over blocks or over the experiment) or

use parameter estimates (betas) of

the GLM model

Third Approach: Training with regions of interest (ROIs)


voxel

- Advantages: lower dimensionality

- Disadvantages: stationarity assumptions, need a priori hypothesis to define the ROI, does

not use the whole brain information

Data Matrix =


vo

xe

ls

Single volumes

Feature selection

Method

Se

lecte

d v

oxe

ls

Single volumes

Fourth Approach: Spatiotemporal information


voxel

- Advantages: uses temporal and spatial information, no stationarity assumptions

- Disadvantages: Low signal to noise rate (SNR)

Data Matrix =


vo

xe

ls

Spatiotemporal observations

T1 T2 T3

T1

T2

T3

Examples of Applications

Can we classify brain states

using the whole brain information

from different subjects?

Application I: Classifying cognitive states

We applied SVM classifier to predict from the fMRI scans if a subject was looking

at an unpleasant or pleasant image

Number of subjects: 16

Tasks: Viewing unpleasant and pleasant pictures (6 blocks of 7 scans per block)

Pre-Processing Procedures

• Realignment, normalization to standard space, spatial filter.

• Mask to select voxels inside the brain.

Training Examples

• Mean volume per block

Leave one-out-test

• Training: 15 subjects

• Test: 1 subject

This procedure was repeated 16 times and the results (error rate) were averaged.

Experimental Design:

fMRI scanner

fMRI scanner

?

fMR

Iscanner

Machine Learning Method:

Support Vector Machine

The subject was viewing a pleasant stimuli

Brain looking

at a pleasant stimulus

Brain looking

at an unpleasant stimulus

fMRI scanner

fMRI scanner

Brain looking

at a pleasant stimulus

Brain looking

at an unpleasant stimulus

Training Subjects

Test Subject

1.00

0.66

0.33

0.05

-0.05

-0.33

-0.66

-1.00

un

ple

asa

nt

ple

asa

nt

z=-18 z=-6 z=6 z=18 z=30 z=42

Spatial weight vector

Results

Mourao-Miranda et al. 2006

Can we make use of the

temporal dimension in decoding?

Experiment: Emotional

Images

(Pleasant vs. Unpleasant)

Fixation

Unpleasant or

Pleasant Stimuli

vt2

vt3

vt4

vt5

vt6

vt7

vt9

vt10

vt11

vt12

vt13

vt14

vt8vt1

Duty Cycle

Spatial Temporal Observation

Vi = [v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14]

Spatiotemporal SVM: Block Design

Unpleasant

Pleasant

Training example:

Whole duty cycle

T1T2T3T4T5T6T7T8T9T10T11T12T13T14

1.00

0.45

0.22

0.05

-0.05

-0.22

-0.45

-1.00

un

ple

asa

nt

ple

asa

nt

Mourao-Miranda et al. 2007

Spatial-Temporal weight vector:

Dynamic discriminating map

Spatial-Temporal weight vector:

Dynamic discriminating map

T5

1.00

0.45

0.22

0.05

-0.05

-0.22

-0.45

-1.00

un

ple

asa

nt

ple

asa

nt

z=-18

A C

B D

T5

z=-6

1.00

0.45

0.22

0.05

-0.05

-0.22

-0.45

-1.00

un

ple

asa

nt

ple

asa

nt

B

A

T5

z=-6

1.00

0.45

0.22

0.05

-0.05

-0.22

-0.45

-1.00

un

ple

asa

nt

ple

asa

nt

C

D

E

F

Can we classify groups using the

whole brain information from

different subjects?

Application II: Classifying groups of subjects

We applied SVM to classify depressed patients vs. healthy controls based on

their pattern of activation for emotional stimuli (sad faces).

•19 free medication depressed patients vs. 19 healthy controls

•Event-related fMRI paradigm consisted of affective processing of sad facial stimuli with

modulation of the intensity of the emotional expression (low, medium, and high

intensity).

Pre-Processing Procedures

• Realignment, normalization to standard space, spatial filter.

• GLM analysis.

Training Examples

• GLM coefficients, i.e. one example per subject

Leave one pair out cross-validation test


Pattern Classification of Brain Activity in Depression

(train and test with GLM coefficients)

Collaboration with Cynthia H.Y. Fu

Fu et al. 2008

SVM weight – Low intensity (Hap 0)

SVM weight – Medium intensity (Sad 2)

SVM – High intensity (Sad 4)

Can we decode subjective pain

from whole brain pattern of fMRI?(Andre Marquand)

Application IV: Decoding Pain Perception

We applied GP methods to predict subjective pain levels in an fMRI experiment

investigating subjective responses to thermal pain

• 15 subjects scanned 6 times over three visits (repeated-measures design)

• Thermal stimulation was delivered via a thermode attached to subjects’ right forearm

• Stimulation was individually calibrated to three different subjective intensity thresholds:1. Sensory detection threshold (SDT; temperature stimulation is detectable)2. Pain detection threshold (PDT; temperature at which it becomes painful)3. Pain tolerance threshold (PTT; maximum tolerable temperature)

• Subjects rated the perceived intensity of the stimulus using a visual-analogue scale

(VAS). 0 = “No sensation”, 100 = “Worst pain imaginable”

• After calibration, the actual temperature applied was invariant throughout the

experiment (within subjects and stimulus classes)


1. GPR was used to predict the subjective pain rating (VAS score)

Whole-brain fMRI volumes were used as input to the model

Predictive Model:

Results: GP Regression

For every stimulus class, GPR provided very accurate predictions of subjective

pain intensity (SMSE = 0.51*, p < 1x10-10 by permutation)

SDT ρS = 0.60 PDT ρS = 0.73 PTT ρS = 0.87

Predicted VAS Predicted VAS Predicted VAS

Tru

e V

AS

Tru

e V

AS

Tru

e V

AS

Marquand et al. 2009

Results: GP Regression

Relating brain activity to subjective pain intensity is not a novel finding and

several brain regions have been demonstrated to encode subjective pain intensity

We compared the strength of correlation between GPR predictions and VAS

scores to correlations derived from a number of intensity-coding brain regions

Primary Somatosensory cortex:

• Left: ρS = 0.26

• Right: ρS = 0.12

Secondary Somatosensory cortex:

• Left: ρS = 0.27*

• Right: ρS = 0.32*

Anterior Cingulate Cortex:

• Left: ρS = 0.42*

• Right: ρS = 0.41*

Insula:

• Left: ρS = 0.37*

• Right: ρS = 0.36*

No brain region produced a correlation as strong as GPR predictions derived from

the whole brain. ‘The whole is greater than any of the parts’

Marquand et al. 2009

Documents

Supervised Learning applied to Neuroimaging2 › staff › J.Shawe-Taylor › courses › J2.pdf · 2009-12-14 · Supervised Learning Input: X1 X2 X3 Output y1 y2 y3 Learning/Training