Support Vector Machines in Marketing Georgi Nalbantov MICC, Maastricht University

Preview:

Citation preview

Support Vector Machines in Marketing

Georgi Nalbantov

MICC, Maastricht University

2/20

Contents

Purpose

Linear Support Vector Machines

Nonlinear Support Vector Machines

(Theoretical justifications of SVM)

Marketing Examples

Conclusion and Q & A

(some extensions)

3/20

Purpose

Task to be solved (The Classification Task):

Classify cases (customers) into “type 1” or “type 2” on the basis of

some known attributes (characteristics)

Chosen tool to solve this task:

Support Vector Machines

4/20

The Classification Task

Given data on explanatory and explained variables, where the explained variable can take two values { 1 }, find a function that gives the “best” separation between the “-1” cases and the “+1” cases:

Given: ( x1, y1 ), … , ( xm , ym ) n { 1 }

Find: : n { 1 }

“best function” = the expected error on unseen data ( xm+1, ym+1 ), … , ( xm+k , ym+k ) is

minimal

Existing techniques to solve the classification task:

Linear and Quadratic Discriminant Analysis

Logit choice models (Logistic Regression)

Decision trees, Neural Networks, Least Squares SVM

5/20

Support Vector Machines: Definition

Support Vector Machines are a non-parametric tool for classification/regression

Support Vector Machines are used for prediction rather than description purposes

Support Vector Machines have been developed by Vapnik and co-workers

6/20N

umbe

r of

art

boo

ks p

urch

ased

∆ buyers ● non-buyers

Months since last purchase

Linear Support Vector Machines

A direct marketing company wants to sell a new book:

“The Art History of Florence”

Nissan Levin and Jacob Zahavi in Lattin, Carroll and Green (2003).

Problem: How to identify buyers and non-buyers using the two variables: Months since last purchase Number of art books purchased

∆ ●

●● ●

∆∆

∆∆

7/20

∆ buyers ● non-buyers

Num

ber

of a

rt b

ooks

pur

chas

ed

Months since last purchase

Main idea of SVM:

separate groups by a line.

However: There are infinitely many lines that have zero training error…

… which line shall we choose?

Linear SVM: Separable Case

∆ ●

●● ●

∆∆

8/20

SVM use the idea of a margin around the separating line.

The thinner the margin,

the more complex the model,

The best line is the one with thelargest margin.

∆ buyers ● non-buyers

Num

ber

of a

rt b

ooks

pur

chas

ed

margin

Months since last purchase

Linear SVM: Separable Case

∆ ●

●● ●

∆∆

9/20

The line having the largest margin is:

w1x1 + w2x2 + b = 0

Where

x1 = months since last purchase x2 = number of art books purchased

Note:

w1xi 1 + w2xi 2 + b +1 for i ∆ w1xj 1 + w2xj 2 + b –1 for j ●

x2

x1

Months since last purchase

Num

ber

of a

rt b

ooks

pur

chas

ed

margin

Linear SVM: Separable Case

w 1x 1

+ w 2x 2

+ b = 1

w 1x 1

+ w 2x 2

+ b = 0

w 1x 1

+ w 2x 2

+ b = -1

w

∆ ●

●● ●

∆∆

10/20

The width of the margin is given by:

Note:

maximizethe margin

2w

minimize minimize

w2 22w

||||2)1(1

margin22

21

w

ww

Linear SVM: Separable Case

x2

x1

Months since last purchase

Num

ber

of a

rt b

ooks

pur

chas

ed

w 1x 1

+ w 2x 2

+ b = 1

w 1x 1

+ w 2x 2

+ b = 0

w 1x 1

+ w 2x 2

+ b = -1

w

margin

∆ ●

●● ●

∆∆

11/20

The optimization problem for SVM is:

subject to:

w1xi 1 + w2xi 2 + b +1 for i ∆ w1xj 1 + w2xj 2 + b –1 for j ●

x2

x1

maximizethe margin

2w

minimize minimize

w2 22w

Linear SVM: Separable Case

2)( minimize2

ww Lmargin

∆ ●

●● ●

∆∆

12/20

“Support vectors” are those points that lie on the boundaries of the margin

The decision surface (line) is determined only by the support vectors. All other points are irrelevant

x2

x1

“Support vectors”

Linear SVM: Separable Case

∆ ●

●● ●

∆∆

13/20

Non-separable case: there is no line separating errorlessly the two groups

Here, SVM minimize L(w,C) :

subject to:

w1xi 1 + w2xi 2 + b +1 – i for i

∆ w1xj 1 + w2xj 2 + b –1 + i for j

● I,j 0

x2

x1

∆ buyers ● non-buyers

Training set: 1000 targeted customers

maximizethe margin

minimize thetraining errors

i

iCCL 2),(2

ww

L(w,C) = Complexity + Errors

Linear SVM: Nonseparable Case

w 1x 1

+ w 2x 2

+ b = 1

∆ ●

●● ●

∆∆

∆∆

14/20

C = 5x2

x1

Bigger C

( thinner margin )

smaller number errors( better fit on the data )

increased complexity Smaller C( wider margin )

bigger number errors( worse fit on the data )

decreased complexity

Linear SVM: The Role of C

∆∆

● ● ●

x2

x1

C = 1∆

∆∆

● ● ●

Vary both complexity and empirical error via C … by affecting the optimal w and optimal number of training errors

15/20

Mapping into a higher-dimensional space

Optimization task: minimize L(w,C)

subject to:

iiiii bxwxxwxw 12 223212

211

22

222

212

2121

2221221

1211211

21

2221

1211

2

2

2

llllll x

x

x

xxx

xxx

xxx

xx

xx

xx

x2

x1

i

iCC,L 22

w)(w

jjjjj bxwxxwxw 12 223212

211

Nonlinear SVM: Nonseparable Case

∆ ●

●● ●

∆∆

∆∆

16/20

Nonlinear SVM: Nonseparable Case

Map the data into higher-dimensional space: 2 3

22

21

21

2

x

xx

x

2

1

x

x

(1,-1)

x2

(1,1)(-1,1)

(-1,-1)

∆ ●

x1

12111 ,,,

12111 ,,,

12111 ,,,

12111 ,,,

212 xx

21x

22x

17/20

Nonlinear SVM: Nonseparable Case

Find the optimal hyperplane in the transformed space

22

21

21

2

x

xx

x

2

1

x

x

(1,-1)

x2

(1,1)(-1,1)

(-1,-1)

∆ ●

x1

12111 ,,,

12111 ,,,

12111 ,,,

12111 ,,,

212 xx

22x

21x

18/20

Nonlinear SVM: Nonseparable Case

Observe the decision surface in the original space (optional)

22

21

21

2

x

xx

x

2

1

x

x

x2

∆ ●

x1

12111 ,,,

12111 ,,,

12111 ,,,

12111 ,,,

212 xx

22x

21x

19/20

Nonlinear SVM: Nonseparable Case

Dual formulation of the (primal) SVM minimization problem

jiji

i j

ji

i

i yymax xx2

1

Ci 0

i

ii y 0

i

iCmin 2

2w

Primal Dual

iii by 1xw

Subject to

0i 1iy

Subject to

1iy

20/20

Nonlinear SVM: Nonseparable Case

Dual formulation of the (primal) SVM minimization problem

jiji

i j

ji

i

i yymax xx2

1

Dual

2

2

2121

2221

21

2221

21 22

ji

jjii

jjjjiiii

ji

)x,x()x,x(

x,xx,xx,xx,x

)()(

xx

xx

22

21

21

2

x

xx

x

2

1

x

x

)()(),(K jiji xxxx

(kernel function)Ci 0

iii y 0

Subject to

1iy

21/20

Nonlinear SVM: Nonseparable Case

Dual formulation of the (primal) SVM minimization problem

jiji

i j

ji

i

i yymax xx2

1

Ci 0 i

ii y 0

Dual

Subject to

1iy

2

2

2121

2221

21

2221

21 22

ji

jjii

jjjjiiii

ji

)x,x()x,x(

x,xx,xx,xx,x

)()(

xx

xx

22

21

21

2

x

xx

x

2

1

x

x

)()(),(K jiji xxxx

(kernel function)

2

2

1

jiji

i j

ji

i

i yymax xx

)()(yymax jiji

i j

ji

i

i xx2

1

22/20

Strengths of SVM:

Training is relatively easy No local minima It scales relatively well to high dimensional data Trade-off between classifier complexity and error can be controlled

explicitly via C Robustness of the results The “curse of dimensionality” is avoided

Weaknesses of SVM:

What is the best trade-off parameter C ? Need a good transformation of the original space

Strengths and Weaknesses of SVM

23/20

The Ketchup Marketing Problem

Two types of ketchup: Heinz and Hunts

Seven Attributes Feature Heinz Feature Hunts Display Heinz Display Hunts Feature&Display Heinz Feature&Display Hunts Log price difference between Heinz and Hunts

Training Data: 2498 cases (89.11% Heinz is chosen)

Test Data: 300 cases (88.33% Heinz is chosen)

24/20

C

σ

Cross-validation mean squared errors, SVM with RBF kernel

min max

Do (5-fold ) cross-validation procedure to find the best combination of the manually adjustable parameters (here: C and σ)

The Ketchup Marketing Problem

Choose a kernel mapping:

)(),( jijiK xxxx d

jijiK )1(),( xxxx22

2/),( jieK jixxxx

Linear kernel

Polynomial kernel

RBF kernel

25/20

Model

Linear Discriminant Analysis

The Ketchup Marketing Problem – Training Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 68 204 272 89.51%

Heinz 58 2168 2226

% Hunts 25.00% 75.00% 100.00%

Heinz 2.61% 97.39% 100.00%

26/20

Model

Logit Choice Model

The Ketchup Marketing Problem – Training Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 214 58 272 77.79%

Heinz 497 1729 2226

% Hunts 78.68% 21.32% 100.00%

Heinz 22.33% 77.67% 100.00%

27/20

Model

Support Vector Machines

The Ketchup Marketing Problem – Training Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 255 17 272 99.08%

Heinz 6 2220 2226

% Hunts 93.75% 6.25% 100.00%

Heinz 0.27% 99.73% 100.00%

28/20

Model

Majority Voting

The Ketchup Marketing Problem – Training Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 0 272 272 89.11%

Heinz 0 2226 2226

% Hunts 0% 100% 100.00%

Heinz 0% 100% 100.00%

29/20

Model

Linear Discriminant Analysis

The Ketchup Marketing Problem – Test Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 3 32 35 88.33%

Heinz 3 262 265

% Hunts 8.57% 91.43% 100.00%

Heinz 1.13% 98.87% 100.00%

30/20

Model

Logit Choice Model

The Ketchup Marketing Problem – Test Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 29 6 35 77%

Heinz 63 202 265

% Hunts 82.86% 17.14% 100.00%

Heinz 23.77% 76.23% 100.00%

31/20

Model

Support Vector Machines

The Ketchup Marketing Problem – Test Set

HeinzPredicted Group Membership Total

Hunts Heinz Hit Rate

Original Count Hunts 25 10 35 95.67%

Heinz 3 262 265

% Hunts 71.43% 28.57% 100.00%

Heinz 1.13% 98.87% 100.00%

32/20

Conclusion

Support Vector Machines (SVM) can be applied in the binary

and multi-class classification problems

SVM behave robustly in multivariate problems

Further research in various Marketing areas is needed to justify

or refute the applicability of SVM

Support Vector Regressions (SVR) can also be applied

http://www.kernel-machines.org

Email: nalbantov@few.eur.nl

Recommended