111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering...

Support Vector Machines and

Predictive Data Modeling

Electrical and Computer Engineering

Vladimir Cherkassky University of Minnesota

cherk001@umn.edu

Presented at Tech Tune Ups, ECE Dept, June 1, 2011

AcknowledgementsResearch on Predictive Learning supported by• NSF grant ECCS-0802056• The

A. Richard Newton Breakthrough Research Award from Microsoft Research

Joint work with grad students F. Cai & S. Dhar

Parts of this presentation are from the books Introduction to Predictive Learning, by Cherkassky and Ma, Springer 2011

Learning from Data, by Cherkassky and Mulier, Wiley 2007

OUTLINE

Introduction + Motivation

4 parts of this course:• Philosophy, induction and predictive

data modeling• Support vector machines (SVM)• SVM practical issues and applications• Advanced SVM-based learning

technologies

Motivation 1Two critical points:

(1) Humans can not reason about uncertainty in a rational wayExamples

(2) Humans and animals have excellent biological capabilities to cope with uncertainty and riskExamples

Motivation 2• Growth of data in digital age

• Is it possible to extract knowledge from this data? – philosophical and cultural implications

• How to extract knowledge from data? – business and technological aspects

• Is this a natural domain of statistics?

Motivation 3: biological learning• Rosenblatt’s Perceptron (early 1960’s)

- an early attempt to simulate biological learning (simple learning algorithm for a linear classifier)

• Young scientists in Moscow tried to understand generalization properties of such ‘machines’ and developed new statistical learning theory

Motivation 4: why SVM?• Support Vector Machines

- developed in the USSR in mid-1960’s- later introduced in the West in mid-1990’s- currently the most widely used method for modeling high-dimensional data-based on new mathematical theory different from classical statistics

• VC-theory also provides philosophical framework for ‘learning from data’

• This new predictive modeling methodology is still poorly understood

PART 1: Philosophy, induction and predictive data modeling

• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory

Understanding Uncertainty• Humans tend to avoid uncertainty, and try to

explain unpredictable eventsAristotle: All men by nature desire knowledge• Learning ~discovering regularities from

data• Ancient cultures, i.e. Ancient Greeks, had no

formal concepts related to randomness:Unpredictable events (wars, natural disasters etc.) were thought to be controlled by Gods or Fate.

• In modern society, religion has been replaced by science and pseudo-science

101010

Gods, Prophets and Shamans

111111

Science and Uncertainty• Math, Logic and Science are about

certainty ~ deterministic rules

• Probability and empirical data: involves uncertainty ~ inferior knowledge

This view dominates modern science, i.e.

• True Scientific knowledge consists of deterministic Laws of Nature

• There is a (true, causal) model explaining a given natural phenomenon (i.e., disease)

121212

Causal Determinism in Science• Popular view of science

- deterministic rules (laws of Nature)

- reflects objective reality (single truth)

- knowledge inferred from (observed) data

• Digital technology enables growth of data

Can expect rapid growth of knowledge by applying (statistical, data mining etc.) algorithms to this data

• Reality is more sobering (as usual)

131313

Popular Hype: the data deluge makes scientific method obsolete

• Wired Magazine, 16/07: We can stop looking for (scientific) models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.

• Early Detection of Cancer (or other diseases):Massive data analysis of cancer samples in order to identify unique proteins for tens of thousands of types of cancer. The goal is that (in the future) we can all be screened for these proteins as early warning signals for cancer.

141414

REALITY• Many studies have questionable value

- statistical correlation vs causation • Some border stupidity/ pseudoscience

- US scientists at SUNY discovered Adultery Gene !!! (based on a sample of 181 volunteers interviewed about sexual life)

• Usual conclusion- more research is needed …

151515

Some Views on Science

• Karl Popper: Science starts from problems, and not from observations

• Werner Heisenberg: What we observe is not nature itself, but nature exposed to our method of questioning

• Albert Einstein: Reality is merely an illusion, albeit a very persistent one.

161616

Scientific Discovery• Always involves ideas (models) and

facts (data)

• Classical first-principle knowledge:

hypothesis data scientific theory

Note: deterministic, simple models

• Modern data-driven discovery:

Computer program + DATA knowledge

Note: statistical, complex systems

• Two philosophies, poorly understood

171717

COMPLEX SYSTEMS

• A. Einstein:When the number of factors coming into play in a phenomenological complex is too large, scientific method in most cases fails us.

Example: weather prediction

• Does digital technology make Einstein’s claim obsolete?

181818

Examples of Complex Systems

• Life Sciences

• Healthcare

• Climate modeling• Social Systems (i.e. financial markets)

Attempts to understand and model such systems using deterministic approach usually fail

191919

Problem of Induction in Philosophy• Francis Bacon: advocated empirical

knowledge (inductive) vs scholastic

• David Hume: What right do we have to assume that the future will be like the past?

• Philosophy of Science tries to resolve this

dilemma/contradiction between deterministic logic and uncertain nature of empirical data.

• Digital Age: growth of empirical data, and this dilemma becomes important in practice.

202020

What is ‘a good model’?• All models are mental constructs that

(hopefully) relate to real world• Two goals of data-driven modeling:

- explain available data- predict future data

• All good (scientific) models make non-trivial predictions

Good data-driven models can predict well, so the goal is to estimate predictive models

212121

Three Types of Knowledge

• Growing role of empirical knowledge• Classical philosophy of science

differentiates only between (first-principle) science and beliefs (demarcation problem)

• Importance of demarcation btwn empirical knowledge and beliefs in applications

Examples of Nonscientific Beliefs• Aristotle’s science

- everything is a mix of 4 basic elements: earth, water, air and fire

• Geocentric system of the world

• Origin of life (spontaneous generation)

- disproved by L. Pasteur in 19th century

• Modern belief: every medical condition can be traced to genetic variations

- is it a popular belief or scientific theory ?

Popper’s Demarcation Principle

Karl Popper: Every true (inductive) theory prohibits certain events or occurences, i.e. it should be falsifiable

• First-principle scientific theories vs beliefs or metaphysical theories

• Risky prediction, testability, falsifiability

Popper’s conditions for scientific hypothesis

- Should be testable

- Should be falsifiable

Example 1: Efficient Market Hypothesis(EMH) The prices of securities reflect all known information that impacts their value

Example 2: We do not see our noses, because they all live on the Moon

Predictive Learning: FormalizationGiven: data samples ~ training data (x,y)

Estimate: a model, or function, f(x) that

- explains this data and

- can predict future data

Classification problem:

Learning ~ function estimation

262626

Application Example:predicting gender of face images

• Training data: labeled face images

Male etc.

Female etc.

Predicting Gender of Face Images• Input ~ 32x32 pixel image

• Model ~ indicator function f(x) separating 1024-dimensional pixel space in two halves

• Model should predict well new images

• Difficult machine learning problem, but easy for human recognition

Learning ~ Reliable Induction

Induction ~ function estimation from data:

Deduction ~ prediction for new (test) inputs:

Common Learning ProblemsClassification Regression

Note: explanation does not ensure prediction

Common Learning ProblemsUnsupervised learning (i.e., clustering)

Note: many other types of problems exist.

All such problems ~ inductive learning setting

Generalization and Complexity ControlConsider regression estimation• Ten training samples

• Fitting linear and 2-nd order polynomial:25.0),,0( 222 whereNxy

Complexity Control (cont’d)The same data set:• Using k-nn regression with k=1 and k=4

Generalization depends on model complexity

Complexity Control: issues• Theoretical + conceptual

- how to define model complexity

• Practical 1 - high-dimensional data

• Practical 2 - true model is not known resampling for choosing opt. complexity

Model selection ~ choosing opt model complexity

Resampling• Split available data into 2 sets:

Training + Validation(1) Use training set for model estimation (via data fitting)(2) Use validation data to estimate the ‘prediction’ error of the model

• Change model complexity index and repeat (1) and (2)

• Select the final model providing lowest (estimated) prediction error

BUT results are sensitive to data splitting

K-fold cross-validation

1. Divide the training data Z into k (randomly selected) disjoint subsets {Z1, Z2,…, Zk} of size n/k

2. For each ‘left-out’ validation set Zi :

- use remaining data to estimate the model

- estimate prediction error on Zi :

3. Estimate ave prediction risk as

)(ˆ xify

iicv r

jji yfn

Example of model selection• 25 samples are generated as

with x uniformly sampled in [0,1], and noise ~ N(0,1)• Regression estimated using polynomials of degree m=1,2,…,10• Polynomial degree m = 5 is chosen via 5-fold cross-validation. The curve shows the

polynomial model, along with training (* ) and validation (*) data points, for one partitioning.

m Estimated R via Cross validation

1 0.1340

2 0.1356

3 0.1452

4 0.1286

5 0.0699

6 0.1130

7 0.1892

8 0.3528

9 0.3596

10 0.4006

xy 22sin

373737

Statistical vs Predictive Approach• Binary Classification problem estimate decision boundary from training data

where y ~ binary class label (-1/+1)Assuming distribution P(x,y) is known:

(x1,x2) space

ii y,x

-2 0 2 4 6 8 10-6

383838

Classical Statistical Approach(1) parametric form of unknown distribution P(x,y) is known (2) estimate parameters of P(x,y) from the training data (3) Construct decision boundary using estimated distribution

and given misclassification costs

Estimated boundary

Modeling assumption:Distribution P(x,y) can be accurately estimated fromavailable data

-2 0 2 4 6 8 10

393939

Predictive Approach(1) parametric form of decision boundary f(x,w) is given (2) Explain available data via fitting f(x,w), or minimization of

some loss function (i.e., squared error)(3) A function f(x,w*) providing smallest fitting error is then

used for predictiion

Estimated boundary

Modeling assumptions- Need to specify f(x,w) andloss function a priori.

- No need to estimate P(x,y) -2 0 2 4 6 8 10

404040

Two Different Methodologies• System Identification (~ classical statistics)

- estimate probabilistic model (class densities) from available data- use this model to make predictions

• System Imitation (~ biological learning) - need only predict well, i.e. imitate specific aspect of unknown system;- multiplicity of good models;- can they be interpreted and/or trusted?

• Which approach works for high-dim. data?

414141

Classification with High-Dimensional Data• Digit recognition 5 vs 8:

each example ~ 32 x 32 pixel image 1,024-dimensional vector x

Medical analogy- Each pixel ~ genetic marker- Each patient (sample) described by 1024 genetic markers - Two classes ~ presence/ absence of a disease• Estimation of P(x,y) with finite data is not possible• Accurate estimation of decision boundary in 1024-dim.

space is possible, using just a few hundred samples

Statistical vs Predictive: Discussion• Classical statistics has modeling goals:

- interpretable model explaining the data- few important input variables (risk factors)- prediction performance is not verified but (usually) assumed – Why?

• Predictive modeling has different goals:- prediction (generalization) is the main goal- prediction accuracy is measured/reported- model interpretation is not important, as it cannot be objectively evaluated

434343

PART 1: Philosophy, induction and predictive data modeling

• Understanding uncertainty and risk• Induction and knowledge discovery• Philosophy and statistical learning• Predictive learning approach• Introduction to VC-theory

Empirical Risk Minimization• ERM principle for learning

– Model parameterization: f(x, w) – Loss function: L(f(x, w),y)– Estimate risk from data:– Choose w* that minimizes Remp

model f(x, w*) explains past data

• ERM principle ~ biological approach• Statistical Learning Theory (aka VC-theory)

under what conditions the ERM-style models will generalize (predict) well?

iiiemp yfL

)),,((1

)( wxw

Inductive Learning Setting• The learning machine observes samples (x ,y), and

returns an estimated response

• Recall ‘first-principles’ vs ‘empirical’ knowledge

Two types of inference: identification vs imitation• Risk

Generatorof samples

LearningMachine

System

),(ˆ wfy x

min,y),w)) dP(Loss(y, f( xx

VC-theory basics - 1Goals of Predictive Learning

- explain (or fit) available training data- predict well future (yet unobserved) data

Similar to biological learningExample: given 1, 3, 7, …

predict the rest of the sequence.Rule 1: Rule 2: randomly chosen odd numbersRule 3:

BUT for sequence 1, 3, 7, 15, 31, 63, …,

Rule 1 seems very reliable (why?)

kkk xx

12 kkxk

VC-theory basics - 2Main Practical Result of VC-theory:

If a model explains well past data AND

is simple, then it can predict well • This explains why Rule 1 is a good model for

sequence 1, 3, 7, 15, 31, 63, …, • Measure of model complexity ~ VC-dimension

~ Ability to explain past data 1, 3, 7, 15, 31, 63

BUT can not explain all other possible sequences Low VC-dimension (~ large falsifiability)• For linear models, VC-dim = DoF (as in statistics)• But for nonlinear models they are different different

VC-theory basics - 3Strategy for modeling high-dimensional data:

Find a model f(x) that explains past data AND

has low VC-dimension, even when dim. is large

SVM approach

Large margin =

Low VC-dimension

~ easy to falsify

494949

SUMMARY & DISCUSSION• Predictive data modeling:

- training data similar to future (test) data- performance index/loss function - predictive methodology is different from classical statistics- may not be a single true model- ‘conventional’ model interpretation is hard

• Understanding of uncertainty and risk:- changing due to technological advances- cultural and ethical issues

111 Support Vector Machines and Predictive Data Modeling Electrical and Computer Engineering...

Documents

Formulas Gestures Formulas GesturesMusic Mathematics Guerino Mazzola U Minnesota & Zürich mazzola@umn.edu mazzola@umn.edu guerino@mazzola.ch

Katie Subra, English Language Fellow subr0054@umn.edu

11 Overview of Predictive Learning Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at the University

Presenter: Robert McCaa, rmccaa@umn.edu Co-authors: Krish Muralidhar

Pediatrics Emily Borman-Shoap, M.D. Program Director borm0029@umn.edu

2018-2019 Student Guidebook · Assistant Professor berma186@umn.edu 626‐0923 1228 Mayo Timothy Church, PhD, MS Env. and Occupational Epidemiology Professor churc001@umn.edu 626‐1494

Born-digital AES and CES publications: Archiving and Preserving the New Stuff Linda Eells & Leslie Delserone lle@umn.edulle@umn.edu delse001@umn.edu delse001@umn.edu

Para eLink - An UPDATE 12th Annual Minnesota Paraprofessional Conference: Tools for Making a Difference Teri Wallace - walla001@umn.edu Ici2.umn.edu/para

Mathematical Theory Mathematical Theory of Gestures in Music Guerino Mazzola U Minnesota & Zürich mazzola@umn.edu mazzola@umn.edu guerino@mazzola.ch

111 From Big Data to Little Knowledge Electrical and Computer Engineering Vladimir Cherkassky University of Minnesota cherk001@umn.edu Presented at CodeFreeze,

Civil Engineering BCE Environmental Engineering B.EnvE ...€¦ · Civil Engineering 5-3877 254 Catherine French cfrench@umn.edu Environmental Engineering 6-1341 159 Erin Surdo surdo001@umn.edu

Using Jakarta Struts to Rewrite the University Admission Application Kristin Kinzler Deal - kinzl002@umn.edu Tim Stevens - steve074@umn.edu University

Carla Pfahl pfahl001@umn.edu Minitex Reference Outreach & Instruction mtxref@umn.edu AskMN Coordinator

SUSTAINABLE AFFORDABLE HOUSING William Weber weber064@umn.edu

Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu rmccaa@umn.edu “ Inadequate

Anaerobes and Mycobacteria Karen Ross PhD rossx007@umn.edu 2007

Algorithms and Data Structures for Logic Synthesis and Verification using Boolean Satisfiability John Backes (back0145@umn.edu) Advisor: Marc Riedel (mriedel@umn.edu)

Henri Temianka Correspondence; (cherkassky)Henri Temianka Correspondence; (cherkassky) Description This collection contains material pertaining to the life, career, and activities

2019 lecture4 teams - Product Design€¦ · 505 Team Assignments magentatoy@umn.edu redtoy@umn.edu coraltoy@umn.edu orangetoy@umn.edu ambertoy@umn.edu yellowtoy@umn.edu chartreusetoy@umn.edu

SHURA CHERKASSKY piano