Upload
alejandro-correa-bahnsen
View
409
Download
2
Embed Size (px)
DESCRIPTION
Presentation at the SAS Global Forum 2012, Orlando, FL. Presenters: Alejandro Correa Bahnsen Andres Felipe Gonzalez Montoya
Citation preview
Constructing a Credit Risk Scorecard using Predictive Clusters
Darwin Amezquita, Colpatria-Scotia Bank Alejandro Correa, Colpatria-Scotia Bank Andrés González, Colpatria-Scotia Bank Catherine Nieto, Colpatria-Scotia Bank
Contents
Introduction
Data description
General Concepts
Modeling
Results
Conclusions
Introduction
Innovation – Competitiveness.
New solutions using known techniques.
Cluster Analysis.
SAS®.
Objective
Improve credit risk Scorecards.
Cluster analysis.
Descriptive classification technique.
Predictive process.
How?
Methodology 1
Total Population
Credit risk scorecards
Logistic Regression
MLP Neural Network
Methodology 2
Total Population
Cluster analysis
Predictive allocation algorithm
Credit risk scorecards
Final classification
Score
Data Description
Four different databases – Financial products.
Specific default definition.
Variables from X1 to Xn.
Data Number of
Goods
Number of
Bads Total
Bad
Rate
Number of
Variables
Database 1 81.659 5.394 87.053 6,2% 7
Database 2 12.065 2.258 14.323 15,8% 29
Database 3 50.670 3.797 54.467 7,0% 25
Database 4 71.127 54.430 125.557 43,4% 7
Payroll
C.C.
Vehicle
No
Experience
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts - Cluster Analysis
What is it?
Descriptive process.
Create groups between objects that are more similar to each other than to those in other clusters.
Objectives
Characterize the population.
Understand behaviors.
Identify opportunities.
Apply different treatments.
General Concepts - Cluster Analysis
General Concepts - Cluster Analysis
Different clustering algorithms.
Define the measures of similarity.
Algorithms
K-means.
Kohonen Self-organizing maps (SOM).
General Concepts - Cluster Analysis
K-means
Iterative technique.
Assign a set on n observations.
k number of clusters.
Nearest centroid.
Each observation belongs to one cluster.
Unsupervised algorithm.
General Concepts - Cluster Analysis
K-means
General Concepts - Cluster Analysis
Kohonen SOM
Unsupervised and iterative algorithm.
Kohonen's learning law.
Winning cluster.
Training case.
Learning rate.
General Concepts - Cluster Analysis
Kohonen SOM
K variables
General Concepts - Cluster Analysis
Kohonen SOM Define a learning rate
Initialize the weights for each node
Select one training case and calculate the
winning cluster
Update the codebook of the winning
cluster
Convergence check
STOP
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts - Logistic Regression
Probability of occurrence of an event.
Determine the relationship between the independent variables and the dichotomous dependent variable.
Fit output values between 0 and 1.
Widely use in credit risk scorecards:
Simplicity.
Interpretability.
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts - Multinomial Logistic Regression
Generalization of the logistic regression.
Dependent variable is not dichotomous.
Predict more than two possibilities.
Parameter estimation through maximum likelihood.
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts – MLP neural network
Hidden Layer i
Input Layer
Hidden Layer 1
Hidden Layer 2
Output Layer
X16
X17
X18
X19
Xn
X11
X12
X13
X14
X15
X6
X7
X8
X9
X10
X1
X2
X3
X4
X5
H1j
H11
H12
H21
H22
H2j
Hi1
Hi2
Hij
Yi
Bias 2
Bias i
Bias 1
Abstraction of the nervous system.
Collection of interconnected units.
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts – Classification Distances
Three algorithms:
Minimum Euclidean distance.
Minimum adjusted distance.
Minimum Mahalanobis distance.
Classify a new client to a corresponding cluster.
Assign to the cluster with the minimum distance.
General Concepts – Classification Distances
Minimum Euclidean distance
Assign base on the distance to the centroids of the defined clusters.
3.7
3.3
8.9
6.5
General Concepts – Classification Distances
Minimum adjusted distance
Defined by the authors.
Assign base on the distance to cluster radius.
Average distance of all observations within a cluster to its centroid.
2.1 2.9
8.2
6.0
General Concepts – Classification Distances
Minimum adjusted distance
General Concepts – Classification Distances
Minimum Mahalanobis distance
Considers the correlation between variables.
Scale-invariant.
Covariance matrix is the Identity matrix.
Reduces to the Euclidean distance.
General Concepts Cluster Analysis
K-means
Kohonen SOM
Logistic Regression
Multinomial Logistic Regression
Multi-layer Perceptron Neural Network
Classification distances
Minimum Euclidian Distance
Minimum Adjusted Distance
Minimum Mahalanobis Distance
F1 Score
General Concepts – The F1 score
Two concepts:
Precision.
Recall.
Standard measure of comparison when the outcome is binary.
Classification accuracy.
Percentage of true observations that
were correctly predicted.
Percentage of real good observations
that were correctly predicted.
Harmonic mean between precision and recall.
Modeling
Logistic regression
or Neural network
model developed
for each resulting
cluster.
Weighted average
of the scores,
weighted by the
probability of
belonging to each
cluster.
Vote of each
score, weighted by
the probability of
belonging to each
cluster.
Modeling
F1 score is calculated for every model.
248 models.
SAS® base and SAS Enterprise Miner™ procedures.
Modeling
Example of model 13
Modeling
Example of model 13
Cluster 2 0.90
0.55
0.35
0.80
Score
0.90
Modeling
Example of model 40
Modeling
Example of model 40
0.90
0.55
0.35
0.80
Score
0.74
1 / 2.9
1 / 6.0
1 / 8.2
1 / 2.1
𝑆𝑐𝑜𝑟𝑒 = 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖
Modeling
Example of model 59
Modeling
Example of model 59
0.90 - 1
0.55 - 0
0.35 - 0
0.80 - 1
Classifier
1
1 / 3.2
1 / 6.4
1 / 7.8
1 / 2.7
𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 = 𝑟𝑜𝑢𝑛𝑑 ( 𝑠𝑐𝑜𝑟𝑒 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 𝑖 ∗ 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦 𝑜𝑓 𝑏𝑒𝑖𝑛𝑔 𝑖𝑛 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝑖 )
Results
240 predictive clustering models.
8 scorecards for the entire population.
Top 25% and Top 10% of the models.
Contrast on each stage of the modeling process.
Results
Stage 1: Clustering methodology
Clustering Methodology Top 25% Top 10%
K-Means 46% 52%
Kohonen SOM 54% 48%
There is no significant difference between the techniques.
Opposite
relationship.
Results
Stage 2: Predictive clusters methodology
Predictive Clusters Methodology Top 25% Top 10%
Multinomial Logistic Regression 4% 0%
MLP Neural Network 15% 0%
Minimum Euclidean Distance 24% 28%
Minimum Adjusted Distance 28% 44%
Minimum Mahalanobis Distance 28% 28%
81%
100%
Distance methodologies are more powerful
Results
Stage 3: Credit scoring methodology
Credit Scoring Methodology Top 25% Top 10%
Logistic Regression 49% 64%
MLP Neural Network 51% 36%
Approximately
the Same
percentage
Logistic
regression
exceeds the
N.N.
Logistic regression scorecards performs better on segmented populations.
Results
Stage 4: Final score methodology
Final Score Methodology Top 25% Top 10%
Cluster Score 21% 8%
Score Ensemble 28% 20%
Classifier Average Vote Ensemble 51% 72%
Undisputed
winner
The best method to define the final score is the classifier average vote ensemble.
Results
Entire population models ranking
Database Logistic Regression position
(F1 Score Ranking)
MLP Neural Network position
(F1 Score Ranking)
Database 1 34th 13th
Database 2 4th 21st
Database 3 32nd 16th
Database 4 13th 24th
Best scenario
The best models according the F1 score statistic are the ones developed using the predictive clusters methodology.
Conclusions
Clustering methods lead to similar results.
On cluster assignment distance methods perform better.
Logistic regression could have a higher predictive power after using clusters analysis.
The classifier average vote ensemble produce superior results on the task of defining the final score.
Predictive clusters provide better results than a single scorecard.
Thank You!
Contact information Darwin Amézquita
Colpatria – Scotia Bank
Bogotá, Colombia
(+57) 301-3372763
Alejandro Correa
Colpatria – Scotia Bank
Bogotá, Colombia
(+57) 320-8306606
Andrés González
Colpatria – Scotia Bank
Bogotá, Colombia
(+57) 310-3595239
Catherine Nieto
Colpatria – Scotia Bank
Bogotá, Colombia
(+57) 315-7426533