Cooley KXEN April 04

8/3/2019 Cooley KXEN April 04

1/28

Automation throughStructured Risk Minimization

Robert Cooley, Ph.D.

VP Technical OperationsKnowledge Extraction Engines (KXEN), Inc.


2/28

Personal Motivation & BackgroundPersonal Motivation & Background

When the solution is simple, God is answering Albert Einstein

1991 Navy Nuclear Reactor Design

Emphasis on simple but highly functional design

The key is to automate data mining as much aspossible, but not more so

Gregory Piatetsky-Shapiro

1996 - Data mining research

Emphasis on automation of knowledge discovery


3/28

Personal Motivation & BackgroundPersonal Motivation & Background

Solving a problem of interest, do not solve a moregeneral problem as an intermediate step

Vladimir Vapnik

1997 Structured Risk Minimization (SRM)

Emphasis on simplicity to manage generalization

1998 Support Vector Machines (SVM) for Text

Classification Hoping for better generalization, ended up with increased

automation

When all you have is a hammer, every problemlooks like a nail

Abraham Maslow

2001 Joined KXEN Automation through SRM


4/28

Learning / GeneralizationLearning / Generalization

Over Fit Model/Low Robustness(No Training Error, High Test Error)Under Fit Model/High Robustness(Training Error = Test Error)

Robust Model

(Low Training ErrorLow Test Error)


5/28

What is Robustness?What is Robustness?

Statistical Ability to generalize from a set oftraining examples

Are the available examples sufficient? Training Ability to handle a wide range of

situations Any number of variables, cardinality, missing values,

etc.

Deployment Ability to resist degradation underchanging conditions

Values not seen in training Engineering Ability to avoid catastrophic

failure Return an answer without crashing under challenging

conditions


6/28

From W. ofFrom W. of OckhamOckham to Vapnikto Vapnik

If two models explain the data equally wellthen the simpler model is to be preferred

1st discovery: VC (Vapnik-Chervonenkis) dimension measures

the complexity of a set of mappings.


7/28


8/28


If two models explain the data equally well then themodel with the smallest VC dimension has bettergeneralization performance


the complexity of a set of mappings.2nd discovery: The VC dimension can be linked togeneralization results (results on new data).

3rd discovery: Dont just observe differences between models,control them.


9/28


To build the best model try to optimize the

performance on the training data set,AND minimize the VC dimension


the complexity of a set of mappings.

2nd discovery: The VC dimension can be linked togeneralization results (results on new data).

3rd discovery3rd discovery: Dont just observe differences between models,control them.


10/28

Structured Risk Minimization (SRM)Structured Risk Minimization (SRM)

Quality: How well does a model describe your existing data?

Achieved by minimizing Error.

Reliability: How well will a model predict future data?

Achieved by minimizing Confidence Interval.

Quality:Quality: How well does a model describe your existing data?How well does a model describe your existing data?

Achieved by minimizing Error.Achieved by minimizing Error.

Reliability:Reliability: How well will a model predict future data?How well will a model predict future data?

Achieved by minimizing Confidence Interval.Achieved by minimizing Confidence Interval.

Error

Risk

Best Model Total Risk

Conf. Interval

Model Complexity


11/28

SRM is a Principle, NOT an AlgorithmSRM is a Principle, NOT an Algorithm

Robust RegressionRobust Regression Smart SegmenterSmart SegmenterConsistent CoderConsistent Coder Support VectorSupport Vector

MachineMachine


12/28

Features of SRMFeatures of SRM

Adding Variables is Low Cost, Potentially High Benefit Does not cause over-fitting More variables can only add to the model quality

Random variables do not harm quality or reliability Highly correlated variables do not harm the modeling process Efficient scaling with additional variables

Adding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High Benefit Does not cause overDoes not cause over--fittingfitting

More variables can only add to the model qualityMore variables can only add to the model quality

Random variables do not harm quality or reliabilityRandom variables do not harm quality or reliability Highly correlated variables do not harm the modeling processHighly correlated variables do not harm the modeling process

Efficient scaling with additional variablesEfficient scaling with additional variables

Free of Distribution Assumptions Normal distributions arent necessary Skewed distributions do not harm quality Resistant to outliers

Independence of inputs isnt necessary

Free of Distribution AssumptionsFree of Distribution Assumptions Normal distributions arenNormal distributions arent necessaryt necessary

Skewed distributions do not harm qualitySkewed distributions do not harm quality

Resistant to outliersResistant to outliers

Independence of inputs isnIndependence of inputs isnt necessaryt necessary

Indicates Generalization Capacity of any Model Insufficient training examples to get a robust model Choose between several models with similar training errors

Indicates Generalization Capacity of any ModelIndicates Generalization Capacity of any Model Insufficient training examples to get a robust modelInsufficient training examples to get a robust model

Choose between several models with similar training errorsChoose between several models with similar training errors

S


13/28

SRM enables AutomationSRM enables Automation

without sacrificing Quality & Reliabilitywithout sacrificing Quality & Reliability

Create AnalyticDataset

Create AnalyticDataset Prepare Data

Prepare Data TrainModel

TrainModel

Validate &Understand

Validate &Understand

SelectVariables

SelectVariables

Analytic Dataset Creation (Lack of) Variable Selection

Data Preparation

Model Selection

Model Testing

Analytic Dataset CreationAnalytic Dataset Creation (Lack of) Variable Selection(Lack of) Variable Selection

Data PreparationData Preparation

Model SelectionModel Selection

Model TestingModel Testing


14/28

Analytic Data Set CreationAnalytic Data Set Creation

Adding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High Benefit

Kitchen Sink philosophy for variable creation

Dont shoe-horn several pieces of informationinto a single variable (e.g. RFM)

Break events out into several different time

periods A single analytic data set with hundreds or

thousands of columns can serve as input for

hundreds of different models


15/28

Event Aggregation ExampleEvent Aggregation Example

Price

Customer 1

Time

AA

B

B

CC

AA

BBCC

Jan Feb Mar Apr May Jun

50

40

30

20

10

50

40

30

20

10

Customer 2 132Customer 2132Customer 1

RFM

RFM Aggregation

105111Customer 2

105111Customer 1

Totalpurchase

CBA

Simple Aggregation

$35

$25

Month5

$45

0

Month6

000$25

00$35$45

Month4

Month3

Month2

Month1

Relative Aggregation

1

0

B C

1

0

A B

1000

0111

C outB AC BA outTransitions


16/28

Data PreparationData Preparation

Free of Distribution AssumptionsFree of Distribution AssumptionsFree of Distribution Assumptions

Different encodings of the same informationgenerally lead to the same predictive quality

No need to check for co-linearities

No need to check for random variables SRM can be used within the data preparation

process to automate binning of values


17/28

BinningBinning

Nominal Group values

TV, Chair, VCR, Lamp, Sofa [TV,VCR],[Chair, Sofa], [Lamp]

Ordinal Group values, preserving order 1, 2, 3, 4, 5 [1, 2, 3], [4, 5]

Continuous Group values into ranges 1.5, 2.6, 5.2, 12.0, 13.1 [1.5 5.2], [12.0 13.1]


18/28

Why Bin Variables?Why Bin Variables?

Performance

Captures non-linear behavior of continuousvariables

Minimizes impact of outliers

Removes noise from large numbers of distinct

values

Explainability

Grouped values are easier to display and

understand Speed

Predictive algorithms get faster as the number of

distinct values decreases


19/28

Binning StrategiesBinning Strategies

None Leave each distinct value as a

separate input

Standardized Group values according to

preset ranges Age: [10 19], [20 29], [30 39],

Zip Code: [94100 94199], [94200 94299],

Target Based Use advanced techniques tofind the right bins for a given question.DEPENDS ON THE QUESTION!


20/28

No BinningNo Binning

Customer Value by Age

0

1

2

3

4

56

7

8

910

18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45


21/28

Standardized BinningStandardized Binning


0

12

3

45

6

78

18 - 25 26 - 35 36 - 45


22/28

Target Based BinningTarget Based Binning


0

1

2

3

4

5

6

7

89

18 - 28 29 - 39 40 - 45

Predictive Benefit ofPredictive Benefit of


23/28

Predictive Benefit ofPredictive Benefit of

Target Based BinningTarget Based Binning

KXEN Approximation - 20 Bins

-2

-1.5

-1

-0.5

0

0.5

1

1.5

2

-200 0 200 400 600 800 1000 1200

Sine Wave Training Dataset

5

5

0

5

1

5

2

0 200 400 600 800 1000 1200

X

-2

-1.

-1

-0.

0.

1.

-200

Y

M d l T iM d l T ti


24/28

Model TestingModel Testing

Indicates Generalization Capacity of any ModelIndicates Generalization Capacity of any ModelIndicates Generalization Capacity of any Model

It is not always possible to get a robustmodel from a limited set of data

Determine if a given number of trainingexamples is sufficient

Determine amount of model degradationover time

T diti l M d li M th d lT diti l M d li M th d l


25/28

Traditional Modeling MethodologyTraditional Modeling Methodology

CRISP-DM

Business Understanding

Data Understanding

Data Preparation

Modeling

Evaluation

Deployment

SEMMA

Sample Explore

Modify

Model Assess

SRM M d li M th d lSRM M d li M th d l


26/28

SRM Modeling MethodologySRM Modeling Methodology

1. Business Understanding2. Create Analytic Dataset

3. Modeling

4. Data Understanding5. Deployment

6. Maintenance

C l iConclusions


27/28

ConclusionsConclusions

SRM is a general principle that can be

applied to all phases of data mining

The main benefit of applying SRM is

increased automation, not increasedpredictive quality

In order to take full advantage of thebenefits, the overall modeling methodologymust be modified

Thank YouThank You


28/28

Thank YouThank You

Questions?

Contact Information:[email protected]

651-338-3880

Additional KXEN Information:

www.kxen.com

[email protected]

5/25 NYC Discovery Workshop, 9am to 1pm
mailto:[email protected]://www.kxen.com/http://www.kxen.com/mailto:[email protected]

Documents

Cooley KXEN April 04