77
CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of

CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

Embed Size (px)

Citation preview

Page 1: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Data Mining with Statistical Learning

Theodoros Evgeniou Massachusetts Institute of Technology

Page 2: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

OutlineOutline

I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?

II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data

II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools

Concluding remarksConcluding remarks

Page 3: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Part IPart I

I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?

II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data

II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools

Concluding remarksConcluding remarks

Page 4: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

What is Data Mining?What is Data Mining?

Goal:Goal: To classifyTo classify or find trends in data in order to or find trends in data in order to

improve future decisions improve future decisions

Examples: Examples: - financial data modeling - forecasting - financial data modeling - forecasting

- customer profiling- customer profiling

- fraud detection - fraud detection

Page 5: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Example: Fraud DetectionExample: Fraud Detection

Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …

OK OK OKOK FRAUD FRAUD

OKOK

………………..

………………..

……..

……..

Age:.. Occ:..Age:.. Occ:..

FRAUD?FRAUD?

OK?OK?

data miningdata mining

fraud fraud systemsystem

??

Page 6: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Example: Customer ProfilingExample: Customer Profiling

Age: 24, Occ.: student, Spend: $100, Buy: … Age: 24, Occ.: student, Spend: $100, Buy: … Age: 39, Occ.: engineer, Spend: $5000, Buy: …Age: 39, Occ.: engineer, Spend: $5000, Buy: … Age: 27, Occ.: ???????, Spend: $400, Buy: …Age: 27, Occ.: ???????, Spend: $400, Buy: … Age: 53, Occ.: small b., Age: 53, Occ.: small b., Spend: $1300, Buy: … Spend: $1300, Buy: …

NO NO NONO BUY BUY

NONO

………………..

………………..

……..

……..

Age:.. Occ:..Age:.. Occ:..

BUY?BUY?

NO?NO?

data miningdata mining

profiling profiling systemsystem

??

Page 7: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Data Mining: More ExamplesData Mining: More Examples

• Sales analysis for inventory control Sales analysis for inventory control

• Diagnostics (manufacturing, health, …) Diagnostics (manufacturing, health, …)

• Information filtering/retrieval (e.g. emails, multimedia)Information filtering/retrieval (e.g. emails, multimedia)

• E-Customer Relationship Management : E-Customer Relationship Management :

E-customer profiling (personalization, marketing…)E-customer profiling (personalization, marketing…)

E-customer support E-customer support

Page 8: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Market Interest Market Interest

• Only 30% of Fortune 500 Only 30% of Fortune 500 using email respond to it using email respond to it on time on time (IDC)(IDC)

Email filtering/response software: Email filtering/response software: $20M now, $350M in 2003 (IDC)$20M now, $350M in 2003 (IDC) Kana, eGain, aptex… Kana, eGain, aptex…: ~$10b: ~$10b

Personalization, targeted marketing, Personalization, targeted marketing, collaborative filtering … (privacy?) collaborative filtering … (privacy?) engage, netperceptions… engage, netperceptions…:: ~$10b ~$10b

• 20% of e-companies use 20% of e-companies use internet customer info, internet customer info, 70% by 2001 70% by 2001 (Forrester R.)(Forrester R.)

• US 1999: $12b credit card US 1999: $12b credit card fraud, 50% on internet fraud, 50% on internet (IDC)(IDC) (insurance, telecom…)(insurance, telecom…)

Fraud detection using data mining:Fraud detection using data mining: HNC/eHNC HNC/eHNC : 1999: ~ $500m M.C.: 1999: ~ $500m M.C.

2000: ~ $2b M.C. 2000: ~ $2b M.C.

Page 9: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Part IIPart II

I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?

II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data

III.III. Mathematics for complex data mining Mathematics for complex data mining - Statistical Learning Theory- Statistical Learning Theory- Data mining tools- Data mining tools

Concluding remarksConcluding remarks

Page 10: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

An E-Support SystemAn E-Support System

companies need to respond efficiently and companies need to respond efficiently and accurately to customers’ emails…accurately to customers’ emails…

……how can they manage this when they how can they manage this when they receive thousands of emails a day ?receive thousands of emails a day ?

1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)1 trillion emails/year in 1999, 5 trillion by 2003 (IDC)

Page 11: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

An Email Classification SystemAn Email Classification System

……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …

request a copy of the report… balance of…request a copy of the report… balance of…

PROBLEM PROBLEM

PROBLEM PROBLEM

ACCOUNT ACCOUNT

ACCOUNTACCOUNT

………………..

………………..

……..

……..

data miningdata mining

newnew emailemail

PROBLEMPROBLEM

ACCOUNTACCOUNT

e-supporte-support ……

??

Page 12: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

An Image Mining SystemAn Image Mining System

How can we detect objects in an image?How can we detect objects in an image?

Page 13: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

An Image Mining SystemAn Image Mining System

. . .

. . .

data mining data mining

newnew imageimage

PedestrianPedestrian

CarCar

Image SystemImage System ……....

??

Page 14: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

General System ArchitectureGeneral System Architecture

data mining data mining

new new datadata

Decision ADecision A

Decision BDecision B

SystemSystem ……....

Example data

??

Page 15: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

A Data Mining Process A Data Mining Process

Data exist in many different forms (text, images, web clicks …)Data exist in many different forms (text, images, web clicks …)

Raw DataRaw Data Feature vectorFeature vectorFeatures extractionFeatures extraction

text text imagesimages

(12, 3, …)(12, 3, …)

STEP 1:STEP 1: Represent data in numerical form (feature vectors) Represent data in numerical form (feature vectors)

(Problem Specific)(Problem Specific)

Page 16: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

A Data Mining Process (cont.)A Data Mining Process (cont.)

Numerical Data (featureNumerical Data (feature vectors)vectors)

Regression Regression ClassificationClassification ClusteringClustering

STEP 2 :STEP 2 : Statistical analysis of numerical data Statistical analysis of numerical data

Page 17: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Step 1: Text RepresentationStep 1: Text Representation

(2, 0, 1, 1, 1, 1 , ….)(2, 0, 1, 1, 1, 1 , ….)

WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?

• Bag of words Bag of words

• Bag of combinations of wordsBag of combinations of words

• Natural language processing featuresNatural language processing features

Yang, McCallum, Joachims, … Yang, McCallum, Joachims, …

……drive..far..see.. drive..far..see.. later… left.. drive.. later… left.. drive..

Page 18: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Step 1: Image Representation Step 1: Image Representation

(Papageorgiou et al, 1999, Evgeniou et al, 2000)(Papageorgiou et al, 1999, Evgeniou et al, 2000)

(12, 92, 74, 0, 12, …., 124)(12, 92, 74, 0, 12, …., 124)

WHAT IS THE WHAT IS THE REPRESENTATION?REPRESENTATION?

• Pixel ValuesPixel Values

• Projections on filters (Wavelets)Projections on filters (Wavelets)

• PCAPCA

Feature selectionFeature selection

Page 19: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Step 2: “Learn” a Decision SurfaceStep 2: “Learn” a Decision Surface

(1,13,…)(1,13,…)

(92,10,…)(92,10,…)

(41,11,…)(41,11,…)

(19,3,…)(19,3,…)

(4,24,…)(4,24,…) (7,33,…)(7,33,…)

(4,71,…)(4,71,…)

decision decision surface surface

Page 20: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Learning Methods Learning Methods

Other approaches:Other approaches:

• Bayesian methods Bayesian methods

• Nearest Neighbor Nearest Neighbor

• Neural Networks Neural Networks

• Decision Trees Decision Trees

• Expert systems Expert systems

New approach:New approach:

• The Statistical Learning approachThe Statistical Learning approach

Page 21: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Part IIIPart III

I.I. What is data mining?What is data mining?- Industry – why data mining?- Industry – why data mining?

II.II. Data mining projectsData mining projects- E-support system- E-support system- Detecting patterns in multimedia data- Detecting patterns in multimedia data

II.II. Mathematics for complex data mining Mathematics for complex data mining - - Statistical Learning TheoryStatistical Learning Theory - Data - Data mining toolsmining tools

Concluding remarksConcluding remarks

Page 22: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

RoadmapRoadmap

• Formal setting of learning from examplesFormal setting of learning from examples

• Standard learning methodsStandard learning methods

• The Statistical Learning approachThe Statistical Learning approach

• Tools and contributionsTools and contributions

Page 23: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Formal Setting of the ProblemFormal Setting of the Problem

Given Given a set of a set of examples examples (data)(data)

QuestionQuestion: : find function find function ff such that such that

is ais a good predictorgood predictor of of yy for a for a futurefuture input input xx

yxf ˆ)(

),(...,,),(),,( 2211 yxyxyx

Page 24: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Ideal SolutionThe Ideal Solution

What is “good predictor”?What is “good predictor”?

If data If data ((xx,y),y) appear according to an (unknown) probability appear according to an (unknown) probability

distribution distribution P(P(xx,y),,y), then we want our solution to: then we want our solution to:

dxdyyxPxfyR[f]f

),())(,V( minimize Error Expected

V(V(y, y, f f ((xx)) : Loss function measuring “cost” from )) : Loss function measuring “cost” from

predicting predicting ff ((xx) instead of ) instead of y (e.g. (y - y (e.g. (y - ff((xx))))2 2 ) )

Page 25: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

(I) Empirical Error Minimization(I) Empirical Error Minimization

We only have example data, so go for the obvious…We only have example data, so go for the obvious…

… … and hope that the solution has a small expected errorand hope that the solution has a small expected error

1

))(,(V minimizei

iiempf

xfy[f]RError Empirical

Where?Where?

Page 26: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

(II) Function Space(II) Function Space

Where do we choose Where do we choose ff from? from?

ff can be any constant function? can be any constant function? ff can be any polynomial? can be any polynomial?

Page 27: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Standard Learning MethodsStandard Learning Methods

A standard way of building learning methods:A standard way of building learning methods:

• Step 1:Step 1: define a function space define a function space HH

• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))

• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error

1

))(,(V minimizei

iiempf

xfy[f]RH

Page 28: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Standard Learning MethodsStandard Learning Methods

A standard way of building learning methods:A standard way of building learning methods:

• Step 1:Step 1: define a function space define a function space HH How?How?

• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))

• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?

1

))(,(V minimizei

iiempf

xfy[f]RH

Enough ?Enough ?

Page 29: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Central QuestionsThe Central Questions

I.I. How do we choose the function space How do we choose the function space H H ??

II.II. What if there are many solutions in What if there are many solutions in H H minimizing minimizing the empirical error (ill-posed problem) ?the empirical error (ill-posed problem) ?

III.III. Does a function Does a function ff that minimizes the empirical that minimizes the empirical error also minimize the expected error in error also minimize the expected error in H H ??

Page 30: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning Approach Statistical Learning Approach (Vapnik, Chervonenkis, 1968- )(Vapnik, Chervonenkis, 1968- )

I.I. Choose function space Choose function space H H according to its according to its complexitycomplexity. . Formal measures are provided (i.e. VC-dimension). Formal measures are provided (i.e. VC-dimension).

II.II. With appropriate control of the With appropriate control of the complexitycomplexity of the of the function space, the problem becomes function space, the problem becomes well-posedwell-posed : : there is a unique solution.there is a unique solution.

III.III. The theory provides The theory provides necessary and sufficientnecessary and sufficient conditions for the conditions for the uniform convergenceuniform convergence of the of the empirical error to the expected error in a function empirical error to the expected error in a function space in terms of the complexity of the space.space in terms of the complexity of the space.

Page 31: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Important Bound Important Bound (Vapnik, Chervonenkis, 1971)(Vapnik, Chervonenkis, 1971)

The theory provides bounds on the distance between The theory provides bounds on the distance between the expected and empirical error :the expected and empirical error :

)(][][h

OfRfR emp

These bounds can be used to choose the These bounds can be used to choose the function space function space HH

data ofnumber : , spacefunction of complexity :

error Empirical :))(,(V][

error Expected :)())(,(V][

1

Hh

xfyfR

dxdyx,yPxfyfR

ii

iemp

Page 32: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Using the BoundUsing the Bound

Underfit Overfit Underfit Overfit

Page 33: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Using the BoundUsing the Bound

ErrorError

Complexity Complexity hh

ExpectedExpected

EmpiricalEmpirical

hhoptopt

Page 34: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Standard ApproachesStandard Approaches

A standard way of building learning methods:A standard way of building learning methods:

• Step 1:Step 1: define a function space define a function space HH How?How?

• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))

• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error Ok?Ok?

1

))(,(V minimizei

iiempf

xfy[f]RH

Enough ?Enough ?

Page 35: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Statistical Learning ApproachThe Statistical Learning Approach

The The new new way of building learning methods:way of building learning methods:

Minimize:Minimize: Empirical Error Empirical Error + Complexity+ Complexity

) )(y (complexit))(,(min1

,H

H

i

ii

fxfyV

By trying many By trying many HH

)(][][h

OfRfR emp

Page 36: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Solves the problems of the standard methodsSolves the problems of the standard methods

• Step 1:Step 1: define a function space define a function space HH

• Step 2:Step 2: define the loss function V( define the loss function V(yy, , ff ((xx))))

• Step 3:Step 3: find find ff in in HH that minimizes the empirical error that minimizes the empirical error

1

))(,(V minimizei

iiempf

xfy[f]RH

The Statistical Learning ApproachThe Statistical Learning Approach

Page 37: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

ExampleExample

})({ bxwxf H classified correctly if ,0

not if 1,))(,( xxfyV

Page 38: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

aka Perceptron (Neural Network)aka Perceptron (Neural Network)

1

))(,(V minimizei

iiempf

xfy[f]RH

Page 39: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning ApproachStatistical Learning Approach

What if we restrict the set of lines - function space? What if we restrict the set of lines - function space? (therefore control complexity)(therefore control complexity)

Page 40: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning ApproachStatistical Learning Approach

}scaling plus ,;)({2Awbxwxf H

20

0

||);(

w

bxwbxwxd

Page 41: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Benefits of Statistical LearningBenefits of Statistical Learning

a)a) The problem becomes well-posed The problem becomes well-posed

b)b) Solution has smaller expected errorSolution has smaller expected error

Page 42: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Empirical Error vs ComplexityEmpirical Error vs Complexity

What if we further restrict complexity?What if we further restrict complexity?

Page 43: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Benefits of Statistical LearningBenefits of Statistical Learning

Avoid overfitting Avoid overfitting

(Important for large dimensional data!)(Important for large dimensional data!)

Page 44: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Support Vector Machines Support Vector Machines (Vapnik, Cortes, 1995)(Vapnik, Cortes, 1995)

2

1i

w|y| min wxw i

i

0 if ,00 if ,||

xwyxwyxwyxwy

Empirical ErrorEmpirical Error ComplexityComplexity

Page 45: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Non-linear Function SpacesNon-linear Function Spaces

Generally Generally ff can be any can be any “linear”“linear” function in some very function in some very complex featurecomplex feature space: space:

AwfxwxfN

nn

N

nnn

1

22

1

;)()(

H

)largevery (possibly features ofnumber :

featurecomplex some:)(

N

xn

Page 46: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Example: Second Order FeaturesExample: Second Order Features

215224

2132211

215224

213

2211

21

xxxxxx)(

xx)(x)(x)(

x)(x)(

)x,x(

wwwwwxf

xxx

xx

x

Page 47: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Second Order PolynomialsSecond Order Polynomials

Using more complex features Using more complex features (second order features)(second order features)

Page 48: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Reproducing Kernel Hilbert SpaceReproducing Kernel Hilbert Space

RKHSRKHS: A space of linear functions in a feature space : A space of linear functions in a feature space satisfying some conditions satisfying some conditions (functional (functional analysis…)analysis…)Examples:Examples:

)xcos()(

)(

x)(x

jn

nn

njn

nx

ex

x

j

Page 49: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Support Vector Machines: GeneralSupport Vector Machines: General

2

1

|)(y| min

fxf ii

if

H

Empirical ErrorEmpirical Error ComplexityComplexity

Training: Quadratic ProgrammingTraining: Quadratic Programming

Page 50: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Kernel MachinesKernel Machines

2

1

))(,V(y min

fxf ii

if

H

Choices to make: V , Choices to make: V , , ,

Empirical ErrorEmpirical Error ComplexityComplexity

Page 51: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Some Kernel Machines Some Kernel Machines (Vapnik 1998, Evgeniou et al 1999)(Vapnik 1998, Evgeniou et al 1999)

With appropriate choices of the complex features With appropriate choices of the complex features and the loss function V we can get:and the loss function V we can get:

• Support Vector Machines (SVM)Support Vector Machines (SVM)

• A type of multi-layer perceptronsA type of multi-layer perceptrons

• A type of radial basis functions A type of radial basis functions

• A type of spline models A type of spline models

• A type of additive modelsA type of additive models

• A type of ridge regression models A type of ridge regression models

Page 52: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Kernel Machines Analysis Kernel Machines Analysis (the difficult questions)(the difficult questions)

Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error?machines converge to the expected error?

What is the distance between empirical and What is the distance between empirical and expected error for these machines?expected error for these machines?

Are these machines well-posed?Are these machines well-posed?

Page 53: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Convergence of Kernel MachinesConvergence of Kernel Machines

infinite). for (also any for converges

;)()(

:RKHS ain space hypothesis

withmachine learning ,function loss SVM the

for and )))(( i.e.(function loss Lany For

1999) Poggio, Pontil, (Evgeniou,

1

22

1

NA

Awfxwxf

xfy

N

nn

N

nnn

pp

H

any

Theorem

Page 54: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Implications of the TheoremImplications of the Theorem

The empirical error converges to the expected one for:The empirical error converges to the expected one for:

Support Vector MachinesSupport Vector Machines

A type of:A type of:

Multi-layer Perceptrons (i.e. Neural Networks)Multi-layer Perceptrons (i.e. Neural Networks)

Radial Basis FunctionsRadial Basis Functions

Spline models (i.e piece-wise linear functions)Spline models (i.e piece-wise linear functions)

… …....

Page 55: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Bounds on Expected Error Bounds on Expected Error (Evgeniou, Pontil, 2000)(Evgeniou, Pontil, 2000)

Furthermore, we can get bounds on the distance Furthermore, we can get bounds on the distance between the expected error and the empirical error between the expected error and the empirical error

“of the form”: “of the form”:

)(][][h

OfRfR emp

By measuring hBy measuring h ( (the complexity of sets in RKHS)the complexity of sets in RKHS) (Evgeniou, Pontil, 1999)(Evgeniou, Pontil, 1999)

Page 56: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Kernel Machines: ContributionsKernel Machines: Contributions

Does the empirical error of general kernel Does the empirical error of general kernel machines converge to the expected error? machines converge to the expected error? YES!YES!

What is the distance between empirical and What is the distance between empirical and expected error for these machines? expected error for these machines? BOUNDS! BOUNDS!

Are these machines well-posed? Are these machines well-posed? YESYES

Page 57: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Characteristics of Kernel MachinesCharacteristics of Kernel Machines

• Automatic complexity controlAutomatic complexity control

• Guaranteed bounds on expected errorGuaranteed bounds on expected error

• Unique optimal solutionUnique optimal solution

• Good with very large dimensional dataGood with very large dimensional data

• Little parameter tuning (V, RKHS, Little parameter tuning (V, RKHS,

Page 58: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Email Classification SystemThe Email Classification System

data minedata mine

……bought a piece of… some broken part… bought a piece of… some broken part… …would like to return… not satisfied with…. …would like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …

request a copy of the report… balance of…request a copy of the report… balance of…

problem problem

problem problem

account account

accountaccount

………………..

………………..

……..

……..

??

Page 59: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Email Classification SystemThe Email Classification System

Representation - high dimensional feature vectors

text system

……bought a piece of… some broken part… …would bought a piece of… some broken part… …would like to return… not satisfied with….like to return… not satisfied with….

……send a receipt… previous payment… send a receipt… previous payment… … …request a copy of the report… balance of…request a copy of the report… balance of…

problem problem problem problem

account account accountaccount

………………..

………………..

……..

……..

Page 60: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

The Image Mining SystemThe Image Mining System

Representation - high dimensional feature vectors

. . .

. . .

image system

Page 61: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Image System PerformanceImage System Performance

Train size: 700 pedestrians, 6000 non-pedestriansTrain size: 700 pedestrians, 6000 non-pedestrians

Test size: 224 pedestrians, 3000 non-pedestrians Test size: 224 pedestrians, 3000 non-pedestrians

50556065707580859095

100

% detect

Pixels Wav.29a Wav.29b Wav.1326

Collaboration with C. Papageorgiou and M. Pontil Collaboration with C. Papageorgiou and M. Pontil

Comparing representationsComparing representations

Page 62: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Comparing learning methods Comparing learning methods (29 wavelets)(29 wavelets)

50

60

70

80

90

100

% correct

CART SVM

Collaboration with L. Perez-Breva Collaboration with L. Perez-Breva

Image System PerformanceImage System Performance

Page 63: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

50

60

70

80

90

100

% correct

BAYES SVM

Some Text System PerformanceSome Text System Performance

Preliminary results on a 2-class news groups email Preliminary results on a 2-class news groups email classification problem (800 train data, 1200 test data)classification problem (800 train data, 1200 test data)

Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress

Page 64: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

0

20

40

60

80

100

% correct

Chance Bayes SVM

Preliminary Multi-Class Text SystemPreliminary Multi-Class Text System

20 classes (multi-class) categorization: 20 classes (multi-class) categorization: How is it done?How is it done?

Collaboration with R. Rifkin and C. Papageorgiou – in progressCollaboration with R. Rifkin and C. Papageorgiou – in progress

Page 65: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Examples of the Image SystemExamples of the Image System

Page 66: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Summary Summary andand Contributions Contributions

I.I. The importance of data miningThe importance of data mining

II.II. Text and image systemsText and image systems

- Choosing representations, feature selection- Choosing representations, feature selection

III.III. Statistical Learning : powerful tools Statistical Learning : powerful tools

- Theoretical analysis of kernel learning machines- Theoretical analysis of kernel learning machines

- Unified analysis of many “standard” methods- Unified analysis of many “standard” methods

- Important conceptual and formal tools - Important conceptual and formal tools

Page 67: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Further PlansFurther Plans

• Choosing data representationsChoosing data representations

• Multi-class categorizationMulti-class categorization

• Unsupervised learningUnsupervised learning

• Text / multimedia systemsText / multimedia systems

• Web click analysisWeb click analysis

• Intelligent agentsIntelligent agents

• E-Customer supportE-Customer support

• PersonalizationPersonalization

• Fraud / Trust controlFraud / Trust control

THEORYTHEORY MARKETMARKET

SYSTEMSSYSTEMS

Page 68: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Page 69: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Image System ROC CurvesImage System ROC Curves

Page 70: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

SVM vs Neural NetworksSVM vs Neural Networks

• Complexity controlComplexity control

• Quadratic ProgrammingQuadratic Programming

• Unique SolutionUnique Solution

• Few parameters to tuneFew parameters to tune

• Guaranteed performanceGuaranteed performance

• Empirical error controlEmpirical error control

• Difficult trainingDifficult training

• Many local optimaMany local optima

• Often many parametersOften many parameters

• Asymptotic analysisAsymptotic analysis

SVM Neural NetworksSVM Neural Networks

Page 71: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Convergence of Learning MachinesConvergence of Learning Machines

0)|][][|sup(

])[( min ])[(min

fRfRP

fRfR

empf

fempf

H

HH

examples ofnumber :

;)()( :

oferror expected :][

oferror empirical :][

1

2

N

nknn

emp

Afxwxf

ffR

ffR

H

Page 72: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Statistical Learning TheoryStatistical Learning Theory

Learning from examplesLearning from examples = given examples of an = given examples of an input/output relation, find function input/output relation, find function ff such that such that

output = output = ff (input)(input)

Developed mainly by Vapnik and Chervonenkis in Developed mainly by Vapnik and Chervonenkis in the late 60’s, 70’s, 80’s, 90’s, … the late 60’s, 70’s, 80’s, 90’s, …

FunctionFunction

IINNPPUUTT

OOUUTTPPUUTT

Page 73: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Data Mining: Driving Forces Data Mining: Driving Forces

• 1 trillion emails, 5 trillion by 2003 1 trillion emails, 5 trillion by 2003 (IDC)(IDC)

• Yahoo! collects 400 Gbytes/day web-click info Yahoo! collects 400 Gbytes/day web-click info (Business Week)(Business Week)

• $150 billion ecommerce, $1.3 trillion by 2003 $150 billion ecommerce, $1.3 trillion by 2003 (IDC)(IDC)

• 1 billion web pages, with 50% increase/year 1 billion web pages, with 50% increase/year (IDC)(IDC)

• 100s millions digital images, audio, video, ….100s millions digital images, audio, video, ….

Page 74: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Driving ForcesDriving Forces

1997 1998 1999 2000 2005

MEMORY MEMORY COMPUTATION COMPUTATION DATA TOOLS DATA TOOLS

DATA DATA Emails Emails

Multimedia Multimedia EcommerceEcommerce

Page 75: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Further StudiesFurther Studies

Bounds on the expected risk of kernel machines Bounds on the expected risk of kernel machines are developed are developed (Evgeniou, Pontil, 2000).(Evgeniou, Pontil, 2000).

Connections with other learning methods are Connections with other learning methods are made made (Evgeniou, Pontil, Poggio, 1999).(Evgeniou, Pontil, Poggio, 1999).

Combinations of learning machines (e.g. many Combinations of learning machines (e.g. many machines each using different information) are machines each using different information) are studied studied (Evgeniou, Perez-Breva, Pontil, Poggio, 2000).(Evgeniou, Perez-Breva, Pontil, Poggio, 2000).

Page 76: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

Text MiningText Mining

How can we reliably decide if a text is about a How can we reliably decide if a text is about a particular topic?particular topic?

……drive…drive…used… ..brokused… ..brok

e…send…e…send…

CAR CAR

COMPUTERCOMPUTER

TRAVELTRAVEL

COMMERCECOMMERCE

Page 77: CBCL MIT March 8, 2000 INSEAD Data Mining with Statistical Learning Theodoros Evgeniou Massachusetts Institute of Technology

CBCL MIT

March 8, 2000

INSEAD

What is Data Mining?What is Data Mining?

Data Data MMiine ne

DataData MMooneneyy