2012 predictive clusters

Constructing a Credit Risk Scorecard using Predictive Clusters

Darwin Amezquita, Colpatria-Scotia Bank Alejandro Correa, Colpatria-Scotia Bank Andrés González, Colpatria-Scotia Bank Catherine Nieto, Colpatria-Scotia Bank

Contents

Introduction

Data description

General Concepts

Modeling

Results

Conclusions

Introduction

Innovation – Competitiveness.

New solutions using known techniques.

Cluster Analysis.

SAS®.

Objective

Improve credit risk Scorecards.

Cluster analysis.

Descriptive classification technique.

Predictive process.

How?

Methodology 1

Total Population

Credit risk scorecards

Logistic Regression

MLP Neural Network

Methodology 2

Total Population

Cluster analysis

Predictive allocation algorithm

Credit risk scorecards

Final classification

Score

Data Description

Four different databases – Financial products.

Specific default definition.

Variables from X1 to Xn.

Data Number of

Goods

Number of

Bads Total

Bad

Rate

Number of

Variables

Database 1 81.659 5.394 87.053 6,2% 7

Database 2 12.065 2.258 14.323 15,8% 29

Database 3 50.670 3.797 54.467 7,0% 25

Database 4 71.127 54.430 125.557 43,4% 7

Payroll

C.C.

Vehicle

No

Experience

General Concepts Cluster Analysis

K-means

Kohonen SOM

Logistic Regression

Multinomial Logistic Regression

Multi-layer Perceptron Neural Network

Classification distances

Minimum Euclidian Distance

Minimum Adjusted Distance

Minimum Mahalanobis Distance

F1 Score


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts - Cluster Analysis

What is it?

Descriptive process.

Create groups between objects that are more similar to each other than to those in other clusters.

Objectives

Characterize the population.

Understand behaviors.

Identify opportunities.

Apply different treatments.



Different clustering algorithms.

Define the measures of similarity.

Algorithms

K-means.

Kohonen Self-organizing maps (SOM).


K-means

Iterative technique.

Assign a set on n observations.

k number of clusters.

Nearest centroid.

Each observation belongs to one cluster.

Unsupervised algorithm.


K-means


Kohonen SOM

Unsupervised and iterative algorithm.

Kohonen's learning law.

Winning cluster.

Training case.

Learning rate.


Kohonen SOM

K variables


Kohonen SOM Define a learning rate

Initialize the weights for each node

Select one training case and calculate the

winning cluster

Update the codebook of the winning

cluster

Convergence check

STOP


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts - Logistic Regression

Probability of occurrence of an event.

Determine the relationship between the independent variables and the dichotomous dependent variable.

Fit output values between 0 and 1.

Widely use in credit risk scorecards:

Simplicity.

Interpretability.


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts - Multinomial Logistic Regression

Generalization of the logistic regression.

Dependent variable is not dichotomous.

Predict more than two possibilities.

Parameter estimation through maximum likelihood.


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts – MLP neural network

Hidden Layer i

Input Layer

Hidden Layer 1

Hidden Layer 2

Output Layer

X16

X17

X18

X19

Xn

X11

X12

X13

X14

X15

X6

X7

X8

X9

X10

X1

X2

X3

X4

X5

H1j

H11

H12

H21

H22

H2j

Hi1

Hi2

Hij

Yi

Bias 2

Bias i

Bias 1

Abstraction of the nervous system.

Collection of interconnected units.


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts – Classification Distances

Three algorithms:

Minimum Euclidean distance.

Minimum adjusted distance.

Minimum Mahalanobis distance.

Classify a new client to a corresponding cluster.

Assign to the cluster with the minimum distance.


Minimum Euclidean distance

Assign base on the distance to the centroids of the defined clusters.

3.7

3.3

8.9

6.5


Minimum adjusted distance

Defined by the authors.

Assign base on the distance to cluster radius.

Average distance of all observations within a cluster to its centroid.

2.1 2.9

8.2

6.0


Minimum adjusted distance


Minimum Mahalanobis distance

Considers the correlation between variables.

Scale-invariant.

Covariance matrix is the Identity matrix.

Reduces to the Euclidean distance.


K-means

Kohonen SOM

Logistic Regression







F1 Score

General Concepts – The F1 score

Two concepts:

Precision.

Recall.

Standard measure of comparison when the outcome is binary.

Classification accuracy.

Percentage of true observations that

were correctly predicted.

Percentage of real good observations

that were correctly predicted.

Harmonic mean between precision and recall.

Modeling

Logistic regression

or Neural network

model developed

for each resulting

cluster.

Weighted average

of the scores,

weighted by the

probability of

belonging to each

cluster.

Vote of each

score, weighted by

the probability of

belonging to each

cluster.

Modeling

F1 score is calculated for every model.

248 models.

SAS® base and SAS Enterprise Miner™ procedures.

Modeling

Example of model 13

Modeling

Example of model 13

Cluster 2 0.90

0.55

0.35

0.80

Score

0.90

Modeling

Example of model 40

Modeling

Example of model 40

0.90

0.55

0.35

0.80

Score

0.74

1 / 2.9

1 / 6.0

1 / 8.2

1 / 2.1

𝑆𝑐𝑜𝑟𝑒 = 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖

Modeling

Example of model 59

Modeling

Example of model 59

0.90 - 1

0.55 - 0

0.35 - 0

0.80 - 1

Classifier

1

1 / 3.2

1 / 6.4

1 / 7.8

1 / 2.7

𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 = 𝑟𝑜𝑢𝑛𝑑 ( 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 )

Results

240 predictive clustering models.

8 scorecards for the entire population.

Top 25% and Top 10% of the models.

Contrast on each stage of the modeling process.

Results

Stage 1: Clustering methodology

Clustering Methodology Top 25% Top 10%

K-Means 46% 52%

Kohonen SOM 54% 48%

There is no significant difference between the techniques.

Opposite

relationship.

Results

Stage 2: Predictive clusters methodology

Predictive Clusters Methodology Top 25% Top 10%

Multinomial Logistic Regression 4% 0%

MLP Neural Network 15% 0%

Minimum Euclidean Distance 24% 28%

Minimum Adjusted Distance 28% 44%

Minimum Mahalanobis Distance 28% 28%

81%

100%

Distance methodologies are more powerful

Results

Stage 3: Credit scoring methodology

Credit Scoring Methodology Top 25% Top 10%

Logistic Regression 49% 64%

MLP Neural Network 51% 36%

Approximately

the Same

percentage

Logistic

regression

exceeds the

N.N.

Logistic regression scorecards performs better on segmented populations.

Results

Stage 4: Final score methodology

Final Score Methodology Top 25% Top 10%

Cluster Score 21% 8%

Score Ensemble 28% 20%

Classifier Average Vote Ensemble 51% 72%

Undisputed

winner

The best method to define the final score is the classifier average vote ensemble.

Results

Entire population models ranking

Database Logistic Regression position

(F1 Score Ranking)

MLP Neural Network position

(F1 Score Ranking)

Database 1 34th 13th

Database 2 4th 21st

Database 3 32nd 16th

Database 4 13th 24th

Best scenario

The best models according the F1 score statistic are the ones developed using the predictive clusters methodology.

Conclusions

Clustering methods lead to similar results.

On cluster assignment distance methods perform better.

Logistic regression could have a higher predictive power after using clusters analysis.

The classifier average vote ensemble produce superior results on the task of defining the final score.

Predictive clusters provide better results than a single scorecard.

Thank You!

Contact information Darwin Amézquita

Colpatria – Scotia Bank

Bogotá, Colombia

(+57) 301-3372763

[email protected]

Alejandro Correa


Bogotá, Colombia

(+57) 320-8306606

[email protected]

Andrés González


Bogotá, Colombia

(+57) 310-3595239

[email protected]

Catherine Nieto


Bogotá, Colombia

(+57) 315-7426533

[email protected]

mailto:[email protected]




Technology

2012 predictive clusters