47
Constructing a Credit Risk Scorecard using Predictive Clusters Darwin Amezquita, Colpatria-Scotia Bank Alejandro Correa, Colpatria-Scotia Bank Andrés González, Colpatria-Scotia Bank Catherine Nieto, Colpatria-Scotia Bank

2012 predictive clusters

Embed Size (px)

DESCRIPTION

Presentation at the SAS Global Forum 2012, Orlando, FL. Presenters: Alejandro Correa Bahnsen Andres Felipe Gonzalez Montoya

Citation preview

Page 1: 2012 predictive clusters

Constructing a Credit Risk Scorecard using Predictive Clusters

Darwin Amezquita, Colpatria-Scotia Bank Alejandro Correa, Colpatria-Scotia Bank Andrés González, Colpatria-Scotia Bank Catherine Nieto, Colpatria-Scotia Bank

Page 2: 2012 predictive clusters

Contents

Introduction

Data description

General Concepts

Modeling

Results

Conclusions

Page 3: 2012 predictive clusters

Introduction

Innovation – Competitiveness.

New solutions using known techniques.

Cluster Analysis.

SAS®.

Page 4: 2012 predictive clusters

Objective

Improve credit risk Scorecards.

Cluster analysis.

Descriptive classification technique.

Predictive process.

Page 5: 2012 predictive clusters

How?

Methodology 1

Total Population

Credit risk scorecards

Logistic Regression

MLP Neural Network

Methodology 2

Total Population

Cluster analysis

Predictive allocation algorithm

Credit risk scorecards

Final classification

Score

Page 6: 2012 predictive clusters

Data Description

Four different databases – Financial products.

Specific default definition.

Variables from X1 to Xn.

Data Number of

Goods

Number of

Bads Total

Bad

Rate

Number of

Variables

Database 1 81.659 5.394 87.053 6,2% 7

Database 2 12.065 2.258 14.323 15,8% 29

Database 3 50.670 3.797 54.467 7,0% 25

Database 4 71.127 54.430 125.557 43,4% 7

Payroll

C.C.

Vehicle

No

Experience

Page 7: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 8: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 9: 2012 predictive clusters

General Concepts - Cluster Analysis

What is it?

Descriptive process.

Create groups between objects that are more similar to each other than to those in other clusters.

Objectives

Characterize the population.

Understand behaviors.

Identify opportunities.

Apply different treatments.

Page 10: 2012 predictive clusters

General Concepts - Cluster Analysis

Page 11: 2012 predictive clusters

General Concepts - Cluster Analysis

Different clustering algorithms.

Define the measures of similarity.

Algorithms

K-means.

Kohonen Self-organizing maps (SOM).

Page 12: 2012 predictive clusters

General Concepts - Cluster Analysis

K-means

Iterative technique.

Assign a set on n observations.

k number of clusters.

Nearest centroid.

Each observation belongs to one cluster.

Unsupervised algorithm.

Page 13: 2012 predictive clusters

General Concepts - Cluster Analysis

K-means

Page 14: 2012 predictive clusters

General Concepts - Cluster Analysis

Kohonen SOM

Unsupervised and iterative algorithm.

Kohonen's learning law.

Winning cluster.

Training case.

Learning rate.

Page 15: 2012 predictive clusters

General Concepts - Cluster Analysis

Kohonen SOM

K variables

Page 16: 2012 predictive clusters

General Concepts - Cluster Analysis

Kohonen SOM Define a learning rate

Initialize the weights for each node

Select one training case and calculate the

winning cluster

Update the codebook of the winning

cluster

Convergence check

STOP

Page 17: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 18: 2012 predictive clusters

General Concepts - Logistic Regression

Probability of occurrence of an event.

Determine the relationship between the independent variables and the dichotomous dependent variable.

Fit output values between 0 and 1.

Widely use in credit risk scorecards:

Simplicity.

Interpretability.

Page 19: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 20: 2012 predictive clusters

General Concepts - Multinomial Logistic Regression

Generalization of the logistic regression.

Dependent variable is not dichotomous.

Predict more than two possibilities.

Parameter estimation through maximum likelihood.

Page 21: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 22: 2012 predictive clusters

General Concepts – MLP neural network

Hidden Layer i

Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

X16

X17

X18

X19

Xn

X11

X12

X13

X14

X15

X6

X7

X8

X9

X10

X1

X2

X3

X4

X5

H1j

H11

H12

H21

H22

H2j

Hi1

Hi2

Hij

Yi

Bias 2

Bias i

Bias 1

Abstraction of the nervous system.

Collection of interconnected units.

Page 23: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 24: 2012 predictive clusters

General Concepts – Classification Distances

Three algorithms:

Minimum Euclidean distance.

Minimum adjusted distance.

Minimum Mahalanobis distance.

Classify a new client to a corresponding cluster.

Assign to the cluster with the minimum distance.

Page 25: 2012 predictive clusters

General Concepts – Classification Distances

Minimum Euclidean distance

Assign base on the distance to the centroids of the defined clusters.

3.7

3.3

8.9

6.5

Page 26: 2012 predictive clusters

General Concepts – Classification Distances

Minimum adjusted distance

Defined by the authors.

Assign base on the distance to cluster radius.

Average distance of all observations within a cluster to its centroid.

2.1 2.9

8.2

6.0

Page 27: 2012 predictive clusters

General Concepts – Classification Distances

Minimum adjusted distance

Page 28: 2012 predictive clusters

General Concepts – Classification Distances

Minimum Mahalanobis distance

Considers the correlation between variables.

Scale-invariant.

Covariance matrix is the Identity matrix.

Reduces to the Euclidean distance.

Page 29: 2012 predictive clusters

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score

Page 30: 2012 predictive clusters

General Concepts – The F1 score

Two concepts:

Precision.

Recall.

Standard measure of comparison when the outcome is binary.

Classification accuracy.

Percentage of true observations that

were correctly predicted.

Percentage of real good observations

that were correctly predicted.

Harmonic mean between precision and recall.

Page 31: 2012 predictive clusters

Modeling

Logistic regression

or Neural network

model developed

for each resulting

cluster.

Weighted average

of the scores,

weighted by the

probability of

belonging to each

cluster.

Vote of each

score, weighted by

the probability of

belonging to each

cluster.

Page 32: 2012 predictive clusters

Modeling

F1 score is calculated for every model.

248 models.

SAS® base and SAS Enterprise Miner™ procedures.

Page 33: 2012 predictive clusters

Modeling

Example of model 13

Page 34: 2012 predictive clusters

Modeling

Example of model 13

Cluster 2 0.90

0.55

0.35

0.80

Score

0.90

Page 35: 2012 predictive clusters

Modeling

Example of model 40

Page 36: 2012 predictive clusters

Modeling

Example of model 40

0.90

0.55

0.35

0.80

Score

0.74

1 / 2.9

1 / 6.0

1 / 8.2

1 / 2.1

𝑆𝑐𝑜𝑟𝑒 = 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖

Page 37: 2012 predictive clusters

Modeling

Example of model 59

Page 38: 2012 predictive clusters

Modeling

Example of model 59

0.90 - 1

0.55 - 0

0.35 - 0

0.80 - 1

Classifier

1

1 / 3.2

1 / 6.4

1 / 7.8

1 / 2.7

𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 = 𝑟𝑜𝑢𝑛𝑑 ( 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 )

Page 39: 2012 predictive clusters

Results

240 predictive clustering models.

8 scorecards for the entire population.

Top 25% and Top 10% of the models.

Contrast on each stage of the modeling process.

Page 40: 2012 predictive clusters

Results

Stage 1: Clustering methodology

Clustering Methodology Top 25% Top 10%

K-Means 46% 52%

Kohonen SOM 54% 48%

There is no significant difference between the techniques.

Opposite

relationship.

Page 41: 2012 predictive clusters

Results

Stage 2: Predictive clusters methodology

Predictive Clusters Methodology Top 25% Top 10%

Multinomial Logistic Regression 4% 0%

MLP Neural Network 15% 0%

Minimum Euclidean Distance 24% 28%

Minimum Adjusted Distance 28% 44%

Minimum Mahalanobis Distance 28% 28%

81%

100%

Distance methodologies are more powerful

Page 42: 2012 predictive clusters

Results

Stage 3: Credit scoring methodology

Credit Scoring Methodology Top 25% Top 10%

Logistic Regression 49% 64%

MLP Neural Network 51% 36%

Approximately

the Same

percentage

Logistic

regression

exceeds the

N.N.

Logistic regression scorecards performs better on segmented populations.

Page 43: 2012 predictive clusters

Results

Stage 4: Final score methodology

Final Score Methodology Top 25% Top 10%

Cluster Score 21% 8%

Score Ensemble 28% 20%

Classifier Average Vote Ensemble 51% 72%

Undisputed

winner

The best method to define the final score is the classifier average vote ensemble.

Page 44: 2012 predictive clusters

Results

Entire population models ranking

Database Logistic Regression position

(F1 Score Ranking)

MLP Neural Network position

(F1 Score Ranking)

Database 1 34th 13th

Database 2 4th 21st

Database 3 32nd 16th

Database 4 13th 24th

Best scenario

The best models according the F1 score statistic are the ones developed using the predictive clusters methodology.

Page 45: 2012 predictive clusters

Conclusions

Clustering methods lead to similar results.

On cluster assignment distance methods perform better.

Logistic regression could have a higher predictive power after using clusters analysis.

The classifier average vote ensemble produce superior results on the task of defining the final score.

Predictive clusters provide better results than a single scorecard.

Page 46: 2012 predictive clusters

Thank You!

Page 47: 2012 predictive clusters

Contact information Darwin Amézquita

Colpatria – Scotia Bank

Bogotá, Colombia

(+57) 301-3372763

[email protected]

Alejandro Correa

Colpatria – Scotia Bank

Bogotá, Colombia

(+57) 320-8306606

[email protected]

Andrés González

Colpatria – Scotia Bank

Bogotá, Colombia

(+57) 310-3595239

[email protected]

Catherine Nieto

Colpatria – Scotia Bank

Bogotá, Colombia

(+57) 315-7426533

[email protected]