Cooley KXEN April 04

Embed Size (px)

Citation preview

  • 8/3/2019 Cooley KXEN April 04

    1/28

    Automation throughStructured Risk Minimization

    Robert Cooley, Ph.D.

    VP Technical OperationsKnowledge Extraction Engines (KXEN), Inc.

  • 8/3/2019 Cooley KXEN April 04

    2/28

    Personal Motivation & BackgroundPersonal Motivation & Background

    When the solution is simple, God is answering Albert Einstein

    1991 Navy Nuclear Reactor Design

    Emphasis on simple but highly functional design

    The key is to automate data mining as much aspossible, but not more so

    Gregory Piatetsky-Shapiro

    1996 - Data mining research

    Emphasis on automation of knowledge discovery

  • 8/3/2019 Cooley KXEN April 04

    3/28

    Personal Motivation & BackgroundPersonal Motivation & Background

    Solving a problem of interest, do not solve a moregeneral problem as an intermediate step

    Vladimir Vapnik

    1997 Structured Risk Minimization (SRM)

    Emphasis on simplicity to manage generalization

    1998 Support Vector Machines (SVM) for Text

    Classification Hoping for better generalization, ended up with increased

    automation

    When all you have is a hammer, every problemlooks like a nail

    Abraham Maslow

    2001 Joined KXEN Automation through SRM

  • 8/3/2019 Cooley KXEN April 04

    4/28

    Learning / GeneralizationLearning / Generalization

    Over Fit Model/Low Robustness(No Training Error, High Test Error)Under Fit Model/High Robustness(Training Error = Test Error)

    Robust Model

    (Low Training ErrorLow Test Error)

  • 8/3/2019 Cooley KXEN April 04

    5/28

    What is Robustness?What is Robustness?

    Statistical Ability to generalize from a set oftraining examples

    Are the available examples sufficient? Training Ability to handle a wide range of

    situations Any number of variables, cardinality, missing values,

    etc.

    Deployment Ability to resist degradation underchanging conditions

    Values not seen in training Engineering Ability to avoid catastrophic

    failure Return an answer without crashing under challenging

    conditions

  • 8/3/2019 Cooley KXEN April 04

    6/28

    From W. ofFrom W. of OckhamOckham to Vapnikto Vapnik

    If two models explain the data equally wellthen the simpler model is to be preferred

    1st discovery: VC (Vapnik-Chervonenkis) dimension measures

    the complexity of a set of mappings.

  • 8/3/2019 Cooley KXEN April 04

    7/28

  • 8/3/2019 Cooley KXEN April 04

    8/28

    From W. ofFrom W. of OckhamOckham to Vapnikto Vapnik

    If two models explain the data equally well then themodel with the smallest VC dimension has bettergeneralization performance

    1st discovery: VC (Vapnik-Chervonenkis) dimension measures

    the complexity of a set of mappings.2nd discovery: The VC dimension can be linked togeneralization results (results on new data).

    3rd discovery: Dont just observe differences between models,control them.

  • 8/3/2019 Cooley KXEN April 04

    9/28

    From W. ofFrom W. of OckhamOckham to Vapnikto Vapnik

    To build the best model try to optimize the

    performance on the training data set,AND minimize the VC dimension

    1st discovery: VC (Vapnik-Chervonenkis) dimension measures

    the complexity of a set of mappings.

    2nd discovery: The VC dimension can be linked togeneralization results (results on new data).

    3rd discovery3rd discovery: Dont just observe differences between models,control them.

  • 8/3/2019 Cooley KXEN April 04

    10/28

    Structured Risk Minimization (SRM)Structured Risk Minimization (SRM)

    Quality: How well does a model describe your existing data?

    Achieved by minimizing Error.

    Reliability: How well will a model predict future data?

    Achieved by minimizing Confidence Interval.

    Quality:Quality: How well does a model describe your existing data?How well does a model describe your existing data?

    Achieved by minimizing Error.Achieved by minimizing Error.

    Reliability:Reliability: How well will a model predict future data?How well will a model predict future data?

    Achieved by minimizing Confidence Interval.Achieved by minimizing Confidence Interval.

    Error

    Risk

    Best Model Total Risk

    Conf. Interval

    Model Complexity

  • 8/3/2019 Cooley KXEN April 04

    11/28

    SRM is a Principle, NOT an AlgorithmSRM is a Principle, NOT an Algorithm

    Robust RegressionRobust Regression Smart SegmenterSmart SegmenterConsistent CoderConsistent Coder Support VectorSupport Vector

    MachineMachine

  • 8/3/2019 Cooley KXEN April 04

    12/28

    Features of SRMFeatures of SRM

    Adding Variables is Low Cost, Potentially High Benefit Does not cause over-fitting More variables can only add to the model quality

    Random variables do not harm quality or reliability Highly correlated variables do not harm the modeling process Efficient scaling with additional variables

    Adding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High Benefit Does not cause overDoes not cause over--fittingfitting

    More variables can only add to the model qualityMore variables can only add to the model quality

    Random variables do not harm quality or reliabilityRandom variables do not harm quality or reliability Highly correlated variables do not harm the modeling processHighly correlated variables do not harm the modeling process

    Efficient scaling with additional variablesEfficient scaling with additional variables

    Free of Distribution Assumptions Normal distributions arent necessary Skewed distributions do not harm quality Resistant to outliers

    Independence of inputs isnt necessary

    Free of Distribution AssumptionsFree of Distribution Assumptions Normal distributions arenNormal distributions arent necessaryt necessary

    Skewed distributions do not harm qualitySkewed distributions do not harm quality

    Resistant to outliersResistant to outliers

    Independence of inputs isnIndependence of inputs isnt necessaryt necessary

    Indicates Generalization Capacity of any Model Insufficient training examples to get a robust model Choose between several models with similar training errors

    Indicates Generalization Capacity of any ModelIndicates Generalization Capacity of any Model Insufficient training examples to get a robust modelInsufficient training examples to get a robust model

    Choose between several models with similar training errorsChoose between several models with similar training errors

    S

  • 8/3/2019 Cooley KXEN April 04

    13/28

    SRM enables AutomationSRM enables Automation

    without sacrificing Quality & Reliabilitywithout sacrificing Quality & Reliability

    Create AnalyticDataset

    Create AnalyticDataset Prepare Data

    Prepare Data TrainModel

    TrainModel

    Validate &Understand

    Validate &Understand

    SelectVariables

    SelectVariables

    Analytic Dataset Creation (Lack of) Variable Selection

    Data Preparation

    Model Selection

    Model Testing

    Analytic Dataset CreationAnalytic Dataset Creation (Lack of) Variable Selection(Lack of) Variable Selection

    Data PreparationData Preparation

    Model SelectionModel Selection

    Model TestingModel Testing

  • 8/3/2019 Cooley KXEN April 04

    14/28

    Analytic Data Set CreationAnalytic Data Set Creation

    Adding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High BenefitAdding Variables is Low Cost, Potentially High Benefit

    Kitchen Sink philosophy for variable creation

    Dont shoe-horn several pieces of informationinto a single variable (e.g. RFM)

    Break events out into several different time

    periods A single analytic data set with hundreds or

    thousands of columns can serve as input for

    hundreds of different models

  • 8/3/2019 Cooley KXEN April 04

    15/28

    Event Aggregation ExampleEvent Aggregation Example

    Price

    Customer 1

    Time

    AA

    B

    B

    CC

    AA

    BBCC

    Jan Feb Mar Apr May Jun

    50

    40

    30

    20

    10

    50

    40

    30

    20

    10

    Customer 2 132Customer 2132Customer 1

    RFM

    RFM Aggregation

    105111Customer 2

    105111Customer 1

    Totalpurchase

    CBA

    Simple Aggregation

    $35

    $25

    Month5

    $45

    0

    Month6

    000$25

    00$35$45

    Month4

    Month3

    Month2

    Month1

    Relative Aggregation

    1

    0

    B C

    1

    0

    A B

    1000

    0111

    C outB AC BA outTransitions

  • 8/3/2019 Cooley KXEN April 04

    16/28

    Data PreparationData Preparation

    Free of Distribution AssumptionsFree of Distribution AssumptionsFree of Distribution Assumptions

    Different encodings of the same informationgenerally lead to the same predictive quality

    No need to check for co-linearities

    No need to check for random variables SRM can be used within the data preparation

    process to automate binning of values

  • 8/3/2019 Cooley KXEN April 04

    17/28

    BinningBinning

    Nominal Group values

    TV, Chair, VCR, Lamp, Sofa [TV,VCR],[Chair, Sofa], [Lamp]

    Ordinal Group values, preserving order 1, 2, 3, 4, 5 [1, 2, 3], [4, 5]

    Continuous Group values into ranges 1.5, 2.6, 5.2, 12.0, 13.1 [1.5 5.2], [12.0 13.1]

  • 8/3/2019 Cooley KXEN April 04

    18/28

    Why Bin Variables?Why Bin Variables?

    Performance

    Captures non-linear behavior of continuousvariables

    Minimizes impact of outliers

    Removes noise from large numbers of distinct

    values

    Explainability

    Grouped values are easier to display and

    understand Speed

    Predictive algorithms get faster as the number of

    distinct values decreases

  • 8/3/2019 Cooley KXEN April 04

    19/28

    Binning StrategiesBinning Strategies

    None Leave each distinct value as a

    separate input

    Standardized Group values according to

    preset ranges Age: [10 19], [20 29], [30 39],

    Zip Code: [94100 94199], [94200 94299],

    Target Based Use advanced techniques tofind the right bins for a given question.DEPENDS ON THE QUESTION!

  • 8/3/2019 Cooley KXEN April 04

    20/28

    No BinningNo Binning

    Customer Value by Age

    0

    1

    2

    3

    4

    56

    7

    8

    910

    18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

  • 8/3/2019 Cooley KXEN April 04

    21/28

    Standardized BinningStandardized Binning

    Customer Value by Age

    0

    12

    3

    45

    6

    78

    18 - 25 26 - 35 36 - 45

  • 8/3/2019 Cooley KXEN April 04

    22/28

    Target Based BinningTarget Based Binning

    Customer Value by Age

    0

    1

    2

    3

    4

    5

    6

    7

    89

    18 - 28 29 - 39 40 - 45

    Predictive Benefit ofPredictive Benefit of

  • 8/3/2019 Cooley KXEN April 04

    23/28

    Predictive Benefit ofPredictive Benefit of

    Target Based BinningTarget Based Binning

    KXEN Approximation - 20 Bins

    -2

    -1.5

    -1

    -0.5

    0

    0.5

    1

    1.5

    2

    -200 0 200 400 600 800 1000 1200

    Sine Wave Training Dataset

    5

    5

    0

    5

    1

    5

    2

    0 200 400 600 800 1000 1200

    X

    -2

    -1.

    -1

    -0.

    0.

    1.

    -200

    Y

    M d l T iM d l T ti

  • 8/3/2019 Cooley KXEN April 04

    24/28

    Model TestingModel Testing

    Indicates Generalization Capacity of any ModelIndicates Generalization Capacity of any ModelIndicates Generalization Capacity of any Model

    It is not always possible to get a robustmodel from a limited set of data

    Determine if a given number of trainingexamples is sufficient

    Determine amount of model degradationover time

    T diti l M d li M th d lT diti l M d li M th d l

  • 8/3/2019 Cooley KXEN April 04

    25/28

    Traditional Modeling MethodologyTraditional Modeling Methodology

    CRISP-DM

    Business Understanding

    Data Understanding

    Data Preparation

    Modeling

    Evaluation

    Deployment

    SEMMA

    Sample Explore

    Modify

    Model Assess

    SRM M d li M th d lSRM M d li M th d l

  • 8/3/2019 Cooley KXEN April 04

    26/28

    SRM Modeling MethodologySRM Modeling Methodology

    1. Business Understanding2. Create Analytic Dataset

    3. Modeling

    4. Data Understanding5. Deployment

    6. Maintenance

    C l iConclusions

  • 8/3/2019 Cooley KXEN April 04

    27/28

    ConclusionsConclusions

    SRM is a general principle that can be

    applied to all phases of data mining

    The main benefit of applying SRM is

    increased automation, not increasedpredictive quality

    In order to take full advantage of thebenefits, the overall modeling methodologymust be modified

    Thank YouThank You

  • 8/3/2019 Cooley KXEN April 04

    28/28

    Thank YouThank You

    Questions?

    Contact Information:[email protected]

    651-338-3880

    Additional KXEN Information:

    www.kxen.com

    [email protected]

    5/25 NYC Discovery Workshop, 9am to 1pm

    mailto:[email protected]://www.kxen.com/http://www.kxen.com/mailto:[email protected]