28
© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

© Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

Embed Size (px)

Citation preview

Page 1: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

© Deloitte Consulting, 2004

Introduction to Data Mining

James Guszcza, FCAS, MAAA

CAS 2004 Ratemaking Seminar

Philadelphia

March 11-12, 2004

Page 2: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

2

© Deloitte Consulting, 2004

Themes

What is Data Mining?How does it relate to statistics?Insurance applicationsData sources

The Data Mining Process Model Design Modeling Techniques

Louise Francis’ Presentation

Page 3: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

3

© Deloitte Consulting, 2004

Themes

How does data mining need actuarial science?

Variable creationModel designModel evaluation

How does actuarial science need data mining?

Advances in computing, modeling techniquesIdeas from other fields can be applied to insurance

problems

Page 4: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

4

© Deloitte Consulting, 2004

Themes

“The quiet statisticians have changed our world; not by discovering new facts or technical developments, but by changing the ways that we reason, experiment and form our opinions.”

-- Ian Hacking

Data mining gives us new ways of approaching the age-old problems of risk selection and pricing….

….and other problems not traditionally considered ‘actuarial’.

Page 5: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

© Deloitte Consulting, 2004

What is Data Mining?

Page 6: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

6

© Deloitte Consulting, 2004

What is Data Mining?

My definition: “Statistics for the Computer Age” Many new techniques have come from Computer

Science, Marketing, Biology… but all can (should!) be brought under the framework of “statistics”

Not a radical break with traditional statistics Complements, builds on traditional statistics

Statistics enriched with brute-force capabilities of modern computing

Opens the door to new techniques Therefore Data Mining tends to be associated with

industrial-sized data sets

Page 7: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

7

© Deloitte Consulting, 2004

Buzz-words

Data Mining Knowledge Discovery Machine Learning Statistical Learning Predictive Modeling Supervised Learning Unsupervised Learning ….etc

Page 8: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

8

© Deloitte Consulting, 2004

What is Data Mining?

Supervised learning: predict the value of a target variable based on several predictive variables

“Predictive Modeling”Credit / non-credit scoring enginesRetention, cross-sell models

Unsupervised learning: describe associations and patterns along many dimensions without any target information

Customer segmentationData ClusteringMarket basket analysis (“diapers and beer”)

Page 9: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

9

© Deloitte Consulting, 2004

So Why Should Actuaries Do This Stuff?

Any application of statistics requires subject-matter expertise

Psychometricians Econometricians Bioinformaticians Marketing scientists …are all applied statisticians with a particular subject-

matter expertise & area of specialty Add actuarial modelers to this list!

“Insurometricians”!? Actuarial knowledge is critical to the success of insurance

data mining projects

Page 10: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

10

© Deloitte Consulting, 2004

Three Concepts

Scoring enginesA “predictive model” by any other name…

Lift curvesHow much worse than average are the policies with

the worst scores?

Out-of-sample testsHow well will the model work in the real world?Unbiased estimate of predictive power

Page 11: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

11

© Deloitte Consulting, 2004

Classic Application: Scoring Engines

Scoring engine: formula that classifies or separates policies (or risks, accounts, agents…) into

profitable vs. unprofitableRetaining vs. non-retaining…

(Non-)Linear equation f( ) of several predictive variables

Produces continuous range of scores

score = f(X1, X2, …, XN)

Page 12: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

12

© Deloitte Consulting, 2004

What “Powers” a Scoring Engine?

Scoring Engine:

score = f(X1, X2, …, XN) The X1, X2,…, XN are at least as important as

the f( )!Again why actuarial expertise is necessaryThink of the predictive power of credit variables

A large part of the modeling process consists of variable creation and selection

Usually possible to generate 100’s of variablesSteepest part of the learning curve

Page 13: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

13

© Deloitte Consulting, 2004

Model Evaluation: Lift Curves

Sort data by score Break the dataset into

10 equal pieces Best “decile”: lowest

score lowest LR Worst “decile”: highest

score highest LR Difference: “Lift”

Lift = segmentation power

Lift translates into ROI of the modeling project

Page 14: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

14

© Deloitte Consulting, 2004

Out-of-Sample Testing

Randomly divide data into 3 pieces Training data, Test data, Validation data

Use Training data to fit models Score the Test data to create a lift curve

Perform the train/test steps iteratively until you have a model you’re happy with

During this iterative phase, validation data is set aside in a “lock box”

Once model has been finalized, score the Validation data and produce a lift curve

Unbiased estimate of future performance

Page 15: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

15

© Deloitte Consulting, 2004

Data Mining: Applications

The classic: Profitability Scoring Model Underwriting/Pricing applications

Credit models Retention models Elasticity models Cross-sell models Lifetime Value models Agent/agency monitoring Target marketing Fraud detection Customer segmentation

no target variable (“unsupervised learning”)

Page 16: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

16

© Deloitte Consulting, 2004

Skills needed

StatisticalBeyond college/actuarial exams… fast-moving field

ActuarialThe subject-matter expertise

Programming!Need scalable software, computing environment

IT - Systems AdministrationData extraction, data load, model implementation

Project ManagementAbsolutely critical because of the scope &

multidisciplinary nature of data mining projects

Page 17: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

17

© Deloitte Consulting, 2004

Data Sources

Company’s internal data Policy-level records Loss & premium transactions Billing VIN……..

Externally purchased data Credit CLUE MVR Census ….

Page 18: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

© Deloitte Consulting, 2004

The Data Mining Process

Page 19: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

19

© Deloitte Consulting, 2004

Raw Data

Research/Evaluate possible data sourcesAvailabilityHit rateImplementabilityCost-effectiveness

Extract/purchase data Check data for quality (QA) At this stage, data is still in a “raw” form

Often start with voluminous transactional dataMuch of the data mining process is “messy”

Page 20: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

20

© Deloitte Consulting, 2004

Variable Creation

Create predictive and target variablesNeed good programming skillsNeed domain and business expertise

Steepest part of the learning curveDiscuss specifics of variable creation

with company expertsUnderwriters, Actuaries, Marketers…Opportunity to quantify tribal wisdom

Page 21: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

21

© Deloitte Consulting, 2004

Variable Transformation

Univariate analysis of predictive variables Exploratory Data Analysis (EDA) Data Visualization Use EDA to cap / transform predictive

variablesExtreme valuesMissing values…etc

Page 22: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

22

© Deloitte Consulting, 2004

Multivariate Analysis

Examine correlations among the variables Weed out redundant, weak, poorly distributed

variables Model design Build candidate models

Regression/GLMDecision Trees/MARSNeural Networks

Select final model

Page 23: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

23

© Deloitte Consulting, 2004

Model Analysis & Implementation

Perform model analytics Necessary for client to gain comfort with the model

Calibrate Models Create user-friendly “scale” – client dictates

Implement models Programming skills again are critical

Monitor performance Distribution of scores/variables, usage of the models,..etc Plan model maintenance schedule

Page 24: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

© Deloitte Consulting, 2004

Model Design

Where Data Mining Needs Actuarial Science

Page 25: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

25

© Deloitte Consulting, 2004

Model Design Issues

Which target variable to use? Frequency & severity Loss Ratio, other profitability measures Binary targets: defection, cross-sell …etc

How to prepare the target variable? Period - 1-year or Multi-year? Losses evaluated @? Cap large losses? Cat losses? How / whether to re-rate, adjust premium? What counts as a “retaining” policy? …etc

Page 26: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

26

© Deloitte Consulting, 2004

Model Design Issues

Which data points to include/exclude Certain classes of business? Certain states? …etc

Which variables to consider? Credit, or non-credit only? Include rating variables in the model? Exclude certain variables for regulatory reasons? …etc

What is the “level” of the model? Policy-term level, HH-level, Risk-level ..etc Or should data be summarized into “cells” à la minimum bias?

Page 27: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

27

© Deloitte Consulting, 2004

Model Design Issues

How should model be evaluated?Lift curves, Gains chart, ROC curve?How to measure ROI?How to split data into train/test/validation? Or cross-

validation?Is there enough data for lift curve to be “credible”?

Are your “incredible” results credible?…etc

Not an exhaustive list – every project raises different actuarial issues!

Page 28: © Deloitte Consulting, 2004 Introduction to Data Mining James Guszcza, FCAS, MAAA CAS 2004 Ratemaking Seminar Philadelphia March 11-12, 2004

28

© Deloitte Consulting, 2004

Reference

My favorite textbook:

The Elements of Statistical Learning--Jerome Friedman, Trevor Hastie, Robert Tibshirani