111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota [email protected]

111

Support Vector Machines and

Predictive Data Modeling

Electrical and Computer Engineering

Vladimir Cherkassky University of Minnesota

[email protected]

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

mailto:[email protected]

222

AcknowledgementsResearch on Predictive Learning supported by• NSF grant ECCS-0802056• The

A. Richard Newton Breakthrough Research Award from Microsoft Research

Joint work with grad students F. Cai & S. Dhar

Parts of this presentation are from the books Introduction to Predictive Learning, by Cherkassky and Ma, Springer 2011

Learning from Data, by Cherkassky and Mulier, Wiley 2007

http://research.microsoft.com/ur/us/fundingopps/rfps/ARichardNewtonAwards.aspx

http://research.microsoft.com/ur/us/fundingopps/rfps/ARichardNewtonAwards.aspx

333

OUTLINE

Introduction + Motivation

4 parts of this course:• Philosophy, induction and predictive

data modeling• Support vector machines (SVM)• SVM practical issues and applications• Advanced SVM-based learning

technologies

444

Motivation 1Two critical points:

(1) Humans can not reason about uncertainty in a rational wayExamples

(2) Humans and animals have excellent biological capabilities to cope with uncertainty and riskExamples

555

Motivation 2• Growth of data in digital age

• Is it possible to extract knowledge from this data? – philosophical and cultural implications

• How to extract knowledge from data? – business and technological aspects

• Is this a natural domain of statistics?

666

Motivation 3: biological learning• Rosenblatt’s Perceptron (early 1960’s)

- an early attempt to simulate biological learning (simple learning algorithm for a linear classifier)

• Young scientists in Moscow tried to understand generalization properties of such ‘machines’ and developed new statistical learning theory

http://upload.wikimedia.org/wikipedia/en/3/31/Perceptron.svg

777

Motivation 4: why SVM?• Support Vector Machines

- developed in the USSR in mid-1960’s- later introduced in the West in mid-1990’s- currently the most widely used method for modeling high-dimensional data-based on new mathematical theory different from classical statistics

• VC-theory also provides philosophical framework for ‘learning from data’

• This new predictive modeling methodology is still poorly understood

888

PART 1: Philosophy, induction and predictive data modeling

• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory

999

Understanding Uncertainty• Humans tend to avoid uncertainty, and try to

explain unpredictable eventsAristotle: All men by nature desire knowledge• Learning ~discovering regularities from

data• Ancient cultures, i.e. Ancient Greeks, had no

formal concepts related to randomness:Unpredictable events (wars, natural disasters etc.) were thought to be controlled by Gods or Fate.

• In modern society, religion has been replaced by science and pseudo-science

101010

Gods, Prophets and Shamans

111111

Science and Uncertainty• Math, Logic and Science are about

certainty ~ deterministic rules

• Probability and empirical data: involves uncertainty ~ inferior knowledge

This view dominates modern science, i.e.

• True Scientific knowledge consists of deterministic Laws of Nature

• There is a (true, causal) model explaining a given natural phenomenon (i.e., disease)

121212

Causal Determinism in Science• Popular view of science

- deterministic rules (laws of Nature)

- reflects objective reality (single truth)

- knowledge inferred from (observed) data

• Digital technology enables growth of data

Can expect rapid growth of knowledge by applying (statistical, data mining etc.) algorithms to this data

• Reality is more sobering (as usual)

131313

Popular Hype: the data deluge makes scientific method obsolete

• Wired Magazine, 16/07: We can stop looking for (scientific) models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

• Early Detection of Cancer (or other diseases):Massive data analysis of cancer samples in order to identify unique proteins for tens of thousands of types of cancer. The goal is that (in the future) we can all be screened for these proteins as early warning signals for cancer.

141414

REALITY• Many studies have questionable value

- statistical correlation vs causation • Some border stupidity/ pseudoscience

- US scientists at SUNY discovered Adultery Gene !!! (based on a sample of 181 volunteers interviewed about sexual life)

• Usual conclusion- more research is needed …

151515

Some Views on Science

• Karl Popper: Science starts from problems, and not from observations

• Werner Heisenberg: What we observe is not nature itself, but nature exposed to our method of questioning

• Albert Einstein: Reality is merely an illusion, albeit a very persistent one.

161616

Scientific Discovery• Always involves ideas (models) and

facts (data)

• Classical first-principle knowledge:

hypothesis data scientific theory

Note: deterministic, simple models

• Modern data-driven discovery:

Computer program + DATA knowledge

Note: statistical, complex systems

• Two philosophies, poorly understood

171717

COMPLEX SYSTEMS

• A. Einstein:When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails us.

Example: weather prediction

• Does digital technology make Einstein’s claim obsolete?

181818

Examples of Complex Systems

• Life Sciences

• Healthcare

• Climate modeling• Social Systems (i.e. financial markets)

Attempts to understand and model such systems using deterministic approach usually fail

191919

Problem of Induction in Philosophy• Francis Bacon: advocated empirical

knowledge (inductive) vs scholastic

• David Hume: What right do we have to assume that the future will be like the past?

• Philosophy of Science tries to resolve this

dilemma/contradiction between deterministic logic and uncertain nature of empirical data.

• Digital Age: growth of empirical data, and this dilemma becomes important in practice.

202020

What is ‘a good model’?• All models are mental constructs that

(hopefully) relate to real world• Two goals of data-driven modeling:

- explain available data- predict future data

• All good (scientific) models make non-trivial predictions

Good data-driven models can predict well, so the goal is to estimate predictive models

212121

Three Types of Knowledge

• Growing role of empirical knowledge• Classical philosophy of science

differentiates only between (first-principle) science and beliefs (demarcation problem)

• Importance of demarcation btwn empirical knowledge and beliefs in applications

2222

Examples of Nonscientific Beliefs• Aristotle’s science

- everything is a mix of 4 basic elements: earth, water, air and fire

• Geocentric system of the world

• Origin of life (spontaneous generation)

- disproved by L. Pasteur in 19th century

• Modern belief: every medical condition can be traced to genetic variations

- is it a popular belief or scientific theory ?

2323

Popper’s Demarcation Principle

Karl Popper: Every true (inductive) theory prohibits certain events or occurences, i.e. it should be falsifiable

• First-principle scientific theories vs beliefs or metaphysical theories

• Risky prediction, testability, falsifiability

2424

Popper’s conditions for scientific hypothesis

- Should be testable

- Should be falsifiable

Example 1: Efficient Market Hypothesis(EMH) The prices of securities reflect all known information that impacts their value

Example 2: We do not see our noses, because they all live on the Moon

25

Predictive Learning: FormalizationGiven: data samples ~ training data (x,y)

Estimate: a model, or function, f(x) that

- explains this data and

- can predict future data

Classification problem:

Learning ~ function estimation

262626

Application Example:predicting gender of face images

• Training data: labeled face images

Male etc.

Female etc.

27

Predicting Gender of Face Images• Input ~ 32x32 pixel image

• Model ~ indicator function f(x) separating 1024-dimensional pixel space in two halves

• Model should predict well new images

• Difficult machine learning problem, but easy for human recognition

28

Learning ~ Reliable Induction

Induction ~ function estimation from data:

Deduction ~ prediction for new (test) inputs:

29

Common Learning ProblemsClassification Regression

Note: explanation does not ensure prediction

30

Common Learning ProblemsUnsupervised learning (i.e., clustering)

Note: many other types of problems exist.

All such problems ~ inductive learning setting

31

Generalization and Complexity ControlConsider regression estimation• Ten training samples

• Fitting linear and 2-nd order polynomial:25.0),,0( 222 whereNxy

32

Complexity Control (cont’d)The same data set:• Using k-nn regression with k=1 and k=4

Generalization depends on model complexity

33

Complexity Control: issues• Theoretical + conceptual

- how to define model complexity

• Practical 1 - high-dimensional data

• Practical 2 - true model is not known resampling for choosing opt. complexity

Model selection ~ choosing opt model complexity

34

Resampling• Split available data into 2 sets:

Training + Validation(1) Use training set for model estimation (via data fitting)(2) Use validation data to estimate the ‘prediction’ error of the model

• Change model complexity index and repeat (1) and (2)

• Select the final model providing lowest (estimated) prediction error

BUT results are sensitive to data splitting

35

K-fold cross-validation

1. Divide the training data Z into k (randomly selected) disjoint subsets {Z1, Z2,…, Zk} of size n/k

2. For each ‘left-out’ validation set Zi :

- use remaining data to estimate the model

- estimate prediction error on Zi :

3. Estimate ave prediction risk as

)(ˆ xify

k

iicv r

kR

1

1

2

)(

ij

jji yfn

kr

Z

x

36

Example of model selection• 25 samples are generated as

with x uniformly sampled in [0,1], and noise ~ N(0,1)• Regression estimated using polynomials of degree m=1,2,…,10• Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the

polynomial model, along with training (* ) and validation (*) data points, for one partitioning.

m Estimated R via Cross validation

1 0.1340

2 0.1356

3 0.1452

4 0.1286

5 0.0699

6 0.1130

7 0.1892

8 0.3528

9 0.3596

10 0.4006

xy 22sin

373737

Statistical vs Predictive Approach• Binary Classification problem estimate decision boundary from training data

where y ~ binary class label (-1/+1)Assuming distribution P(x,y) is known:

(x1,x2) space

ii y,x

-2 0 2 4 6 8 10-6

-4

-2

0

2

4

6

8

10

x1

x2

383838

Classical Statistical Approach(1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from the training data (3) Construct decision boundary using estimated distribution

and given misclassification costs

Estimated boundary

Modeling assumption:Distribution P(x,y) can be accurately estimated fromavailable data

-2 0 2 4 6 8 10

-6

-4

-2

0

2

4

6

8

10

x1

x2

393939

Predictive Approach(1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of

some loss function (i.e., squared error)(3) A function f(x,w*) providing smallest fitting error is then

used for predictiion

Estimated boundary

Modeling assumptions- Need to specify f(x,w) andloss function a priori.

- No need to estimate P(x,y) -2 0 2 4 6 8 10

-6

-4

-2

0

2

4

6

8

10

x1

x2

404040

Two Different Methodologies• System Identification (~ classical statistics)

- estimate probabilistic model (class densities) from available data- use this model to make predictions

• System Imitation (~ biological learning) - need only predict well, i.e. imitate specific aspect of unknown system;- multiplicity of good models;- can they be interpreted and/or trusted?

• Which approach works for high-dim. data?

414141

Classification with High-Dimensional Data• Digit recognition 5 vs 8:

each example ~ 32 x 32 pixel image 1,024-dimensional vector x

Medical analogy- Each pixel ~ genetic marker- Each patient (sample) described by 1024 genetic markers - Two classes ~ presence/ absence of a disease• Estimation of P(x,y) with finite data is not possible• Accurate estimation of decision boundary in 1024-dim.

space is possible, using just a few hundred samples

42

Statistical vs Predictive: Discussion• Classical statistics has modeling goals:

- interpretable model explaining the data- few important input variables (risk factors)- prediction performance is not verified but (usually) assumed – Why?

• Predictive modeling has different goals:- prediction (generalization) is the main goal- prediction accuracy is measured/reported- model interpretation is not important, as it cannot be objectively evaluated

434343

PART 1: Philosophy, induction and predictive data modeling

• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory

44

Empirical Risk Minimization• ERM principle for learning

– Model parameterization: f(x, w) – Loss function: L(f(x, w),y)– Estimate risk from data:– Choose w* that minimizes Remp

model f(x, w*) explains past data

• ERM principle ~ biological approach• Statistical Learning Theory (aka VC-theory)

under what conditions the ERM-style models will generalize (predict) well?

n

iiiemp yfL

nR

1

)),,((1

)( wxw

45

Inductive Learning Setting• The learning machine observes samples (x ,y), and

returns an estimated response

• Recall ‘first-principles’ vs ‘empirical’ knowledge

Two types of inference: identification vs imitation• Risk

Generatorof samples

LearningMachine

System

x

y

y

),(ˆ wfy x

min,y),w)) dP(Loss(y, f( xx

4646

VC-theory basics - 1Goals of Predictive Learning

- explain (or fit) available training data- predict well future (yet unobserved) data

Similar to biological learningExample: given 1, 3, 7, …

predict the rest of the sequence.Rule 1: Rule 2: randomly chosen odd numbersRule 3:

BUT for sequence 1, 3, 7, 15, 31, 63, …,

Rule 1 seems very reliable (why?)

11 2

kkk xx

12 kkxk

4747

VC-theory basics - 2Main Practical Result of VC-theory:

If a model explains well past data AND

is simple, then it can predict well • This explains why Rule 1 is a good model for

sequence 1, 3, 7, 15, 31, 63, …, • Measure of model complexity ~ VC-dimension

~ Ability to explain past data 1, 3, 7, 15, 31, 63

BUT can not explain all other possible sequences Low VC-dimension (~ large falsifiability)• For linear models, VC-dim = DoF (as in statistics)• But for nonlinear models they are different different

4848

VC-theory basics - 3Strategy for modeling high-dimensional data:

Find a model f(x) that explains past data AND

has low VC-dimension, even when dim. is large

SVM approach

Large margin =

Low VC-dimension

~ easy to falsify

494949

SUMMARY & DISCUSSION• Predictive data modeling:

- training data similar to future (test) data- performance index/loss function - predictive methodology is different from classical statistics- may not be a single true model- ‘conventional’ model interpretation is hard

• Understanding of uncertainty and risk:- changing due to technological advances- cultural and ethical issues

Documents

111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota [email protected]