View
17
Download
0
Category
Preview:
Citation preview
AN EXL WHITE PAPER
Guide to Segmentation for Survival Models using SAS
Swagata MajumderSenior Manager, EXL
Contributor:
Alok RustagiVice President , EXL
lookdeeper@exlservice.com
Written by:
It relies on comparing the survivor
functions across sub groups through
Log-Rank Test (PROC LIFETEST). The
examples given in this paper are from the
credit card domain, but this technique
can be effectively applied to any kind
of survival data to generate intuitive
segmentation trees.
The importance of segmentation in any
kind of modelling exercise is undeniable.
Segmentation into different population
sets enables a modeller to develop
separate models for different subsets of
the population. This often outperforms a
single standalone model through higher
accuracy in predictions, lower bias, or
both. The relationship between the
predictors and target variables is often
different in each subpopulation, which can
be effectively captured by a segmented
model leading to its better performance.
Popular segmentation tools like
Classification and Regression Tree (CART)
were originally developed to analyse
cross sectional data where several
subjects are observed at the same point
in time. Applying these techniques is
complex in case of survival data, where
several subjects are observed at different
period of time or until the event of
interest occurs. A distinguishing feature
of this kind of data called “censoring”
can make it difficult to be handled with
conventional statistical methods. In
simple terms, if subjects are observed
over a five-year duration to see whether
an event of interest (for example, default
on credit card payment) occurs, there
will be subjects at the end of the study
who do not default within the time period.
Such cases are referred to as censored.
It is not known when or if a censored
customer will experience the event, only
This paper highlights how to tackle segmentation structure in the case of survival data, and also
elaborates on its implementation in SAS. This is a step prior to the actual model building exercise,
and is about dividing the population into segments which are homogeneous within themselves and
heterogeneous amongst themselves, so that separate probability of default models can be developed
on each of these segments.
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 2
that he or she has not done so by the end
of the observation period.
MethodologyInterest in using survival analysis for credit
scoring is quite recent, and is aimed at
assessing the risk of customers who have
already been assigned credit cards . The
reason is that the objective of credit scoring,
also known as credit risk modelling, has
recently shifted towards choosing the
customers that will provide the highest
profit. To do so loan offers must consider
not only if a customer will default, but also
when they will default. This knowledge can
be gained through survival models. The
use of survival models also avoids the need
to define a fixed period within which the
default event is measured – a step inherent
to logistic regression. They also allow the
inclusion of behavioural and economic
risk factors over time, like macroeconomic
variables. There are several alternative
survival models to estimate the hazard/
survivor function, the most popular of them
in credit scoring literature being the Cox
Proportional Hazards (PH) model.
However, before proceeding to the actual
modelling exercise, it often makes sense
to split the data into sub groups and build
separate models for each of these groups.
This allows for a much greater level of
accuracy in predictions and portfolio
management. The question then becomes
how many models are optimal, and which
set of segmentation structure will provide a
client the best business results.
In case of cross sectional data, a
classification tree technique called
CHAID (Chi-Square Automatic Interaction
Detection) is very popular for segmentation.
This technique recursively partitions a
population into separate and distinct
groups defined by a set of independent
predictor variables, such that the variance
of the target variable is minimized within
the groups and maximized across the
groups. The advantage of CHAID is that
the output is highly visual and easy to
interpret. The development of the decision,
or classification tree, starts with identifying
the target variable or dependent variable
which would be considered the root.
CHAID analysis splits the target into two or
more categories that are called the initial, or
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 3
parent nodes, and then the nodes are split
using statistical algorithms into child nodes.
The methodology outlined in this paper
is somewhat inspired by CHAID, but it has
been adapted to suit the requirements of
a time series data structure. This paper
provides a step by step guide to choosing
the appropriate predictor or predictors to
segment the population, and highlights
how to use a Log-Rank test to decide the
potential candidates for segmentation so
that the underlying survivor functions of the
sub groups are statistically different from
each other.
A. DATA STRUCTURE
The first step is to develop an appropriate
segmentation structure, so that separate
survival models can be built for each
of these segments. The segmentation
structure should ensure that accounts
with similar default patterns are grouped
together.
The data available for this example analysis
is at account and monthly level, whereby
for each month (to be referred as snapshot
from now), there is a dataset containing
all the non-defaulted accounts as of that
snapshot, and their characteristics like
month on book (MOB), delinquency status
(DELQ), utilization (UTIL), balance (BAL),
payments (PMT), full-payer indicator (FULL_
PAY_IND) and so on. In addition, there are
two variables which denote the default
performance of that account using the
most recent date till which data is available:
default indicator (t_PD), which is a binary
variable taking a value of 1 if the account
has ever defaulted and 0 otherwise, and
default month (def_month), which denotes
the month when the account defaulted,
taking a value of 0 if the account has never
defaulted.
An example of a snippet of the data
structure for a particular snapshot (in this
case, May 2014) is illustrated below. A similar
dataset will be available for other snapshots
as well.
Snapshot Account ID
MOB DELQ UTIL BAL PMT FULL_PAY_IND
t_PD def_month
201405 A1 36 1 80 1000 20 0 1 14
201405 A2 60 0 50 800 100 1 0 0
201405 A3 .5 2 90 920 40 0 1 3
Table 1: Snapshot of Account Data
For Example Purposes Only
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 4
A1, A2 and A3 have a non-default status as
of May 2014. Assuming that data is available
until December 2015, each account can
be observed for a performance window of
a period of 19 months from the snapshot
date. A1 and A3 hit a default status in this
performance window, as indicated by the
value of t_PD. The variable def_month
takes a value of 14 for A1 and 3 for A3,
implying that they default in July 2015 and
August 2014 respectively. On the other
hand A2 does not default over the entire
performance window, and hence both t_PD
and def_month are 0 for this account. In the
context of survival analysis, A2 is a censored
case, as the study ends before default
occurs. A1 and A3, instead have uncensored
default times.
The next step is to convert the data into a
format which can be easily handled by the
survival analysis procedures in SAS, be it
LIFETEST, LIFEREG or PHREG. For each
account in the sample, there must be one
variable (named event_duration in this
example) that contains either the time that
an event occurred or, for censored cases,
the last time at which that account was
observed, both measured from the chosen
origin. A second variable is required to
denote the status of the account at the time
recorded in the event_duration variable.
Fortunately, this variable is already available
in the data (t_PD) which takes a value of 1 for
uncensored cases and 0 for censored ones.
The variable event_duration can be created
by a simple data step within SAS as follows:
*CREATING EVENT_DURATION VARIABLE;
%MACRO INCL_DATE(SAMPDATE = , TERM = );
DATA X_&SAMPDATE.;
SET Y_&SAMPDATE.;
IF DEF_MONTH = 0 THEN EVENT_DURATION =
TERM;ELSE EVENT_DURATION = DEF_MONTH;
RUN;
%MEND;
The variable TERM in the above macro is
simply the difference between the snapshot
date and the last date of study. Depending
on the snapshot date, the above macro can
be invoked as follows:
%INCL_DATE(SAMPDATE = 201409, TERM = 15);
%INCL_DATE(SAMPDATE = 201203, TERM = 45);
The data for each month is then appended
to generate a master data on which the
segmentation exercise is to be carried out.
B. IDENTIFICATION OF POTENTIAL CANDIDATES FOR SEGMENTATION
Once the data has been converted to the
desired format, the next task is to identify
a set of potential candidates that can be
used to segment the population using the
account characteristics available. Business
intuition comes in handy at this stage, as
there may be certain variables that are
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 5
important based on policies, underwriting
strategies etc. Another approach is to
shortlist the variables by fitting a survival
model on the entire population using all
possible predictors. This can be achieved in
SAS using the stepwise selection methods
within PROC PHREG. Variables that are highly
significant in terms of p-value of Chi-Square
in this model are most likely to be good
candidates for segmenting the population.
The segmentation structure is also
governed by data availability. There are
certain subgroups of the population which
have more information available than
others, and it is a good idea to develop a
separate model for this segment which
optimally utilizes all of the available extra
information.
Irrespective of whether a variable is
shortlisted through business intuition or
statistical techniques, it is essential to
convert continuous variables to categorical
in order to be able to compare the survivor
functions across different categories of the
variable. This can be achieved by grouping
the accounts into ten equal bins based
on the values of the concerned variable.
Adjacent groups can then be clubbed if
the default rate in the next X months are
similar (X can typically be 12 months or 18
months, depending on the length of the
performance window of the latest snapshot
available).
C. GENERATION OF SEGMENTATION STRUCTURE
I. Theoretical Background
of Statistical Tests used
A Log-Rank Test was used as an approach
for sub segmenting the population. It is
based on non-parametric hypothesis tests
to compare the survival distributions of two
or more samples. It basically compares
estimates of hazard function of the groups
at each observed event time, or unique
time when any individual from any group
experiences the event, the null hypothesis
being that the hazard functions for all
groups are equal for all study time.
The idea behind segmentation is to divide
the population in a way such that the
survival functions are statistically different
across the sub categories. The following is
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 6
how the Log-Rank test can be represented
mathematically.
H0: h1 (t)= h2 (t)= ... =hk (t) for all t ≤ τ
H1: At least one hj (t)is different for some t ≤ τ
τ is the largest time during which each
group has at least one individual at risk, hj
(t) represents the hazard function at time
interval t for group k. The Log-Rank Test will
compare the hazards from each group at
each event time between 0 and τ.
If all hazards are equal for all groups, then
it is assumed that the proportion of each
group experiencing the event at any given
time τi will be equal to the proportion of the
overall population experiencing the event at
that same time:
= for all event timesti, i=1,2,3,... .. ..D, j=1,2,3,... .. ..k
Where:
dij = Number of events experienced by group j at event time ti
Yij = Number of individuals at risk in group j just prior to time ti
di = T otal number of events experienced by the entire study population at event time ti
Yi = Total number of persons at risk in the entire study population just prior to time ti
The Log-Rank test begins by calculating a
statistic representing the sum of weighted
differences between dij
Yij and di
Yi
at each
event time ti for each group j=1 through k.
For the Log-Rank test, the weights applied
to these differences are all equal to 1, so
each event time has an equal weighting
on the value of the statistics. The statistics
calculated for the k groups are linearly
dependent, and therefore only (k-1) may
be used to calculate a test statistic. To
calculate the test statistic, (k-1) of the
statistics are formed into a vector called Z.
The variances and covariances for these
(k-1) statistics are placed into a variance-
covariance matrix called ∑. A test statistic is
then calculated as:
x2=Z(∑-1)Zt
This has a chi-squared distribution with
(k-1) degrees of freedom when the null
hypothesis is true.
The Log-Rank test allows for two types of
non perfect survival data, left truncated
data and right censored data. If censored
observations are not present in the data
then Wilcoxon Rank sum test is more
appropriate.
II. Application of the Log-Rank Test
The LIFETEST procedure in SAS can be
used to generate the Log-Rank test for
comparison of survival patterns across
different groups.
dij
Yij
di
Yi
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 7
Continuing with the credit card example
as before, once a few categorical variables
have been shortlisted through either
business intuition or statistical techniques
as potential candidates for segmenting
the population, a Log-Rank test will be
applied to each of them to test whether
the survivor/hazard functions are different
across the categories of that variable. The
variable having the highest Chi- Square
will be used to create the first split of the
population.
Before even going ahead with the
segmentation methodology, it is advised to
summarize the data across the shortlisted
variables for ease of computation.
Assuming that there are 4 shortlisted
variables, MOB, DELQ, UTIL and BAL, this
can be achieved in SAS through a simple
SQL procedure. Continuous variables
like utilization and balance need to be
converted to categorical variables– UTIL_
FMT and BAL_FMT.
*SUMMARIZING THE DATA;
PROC SQL;
CREATE TABLE SUMMARY1 AS
SELECT EVENT_DURATION,UTIL_
FMT,DELQ,MOB,BAL_FMT,FULL_PAY_IND,T_PD,
COUNT(*) AS NUMBER,
SUM(T_PD = 1) AS DEFAULTS
FROM STACKED_DATA
GROUP BY 1,2,3,4,5,6,7 ;
END;
Next, Log-Rank test will be computed
iteratively for each of the four selected
variables by specifying them in the STRATA
option of PROC LIFETEST. A separate
survivor function is then estimated for each
stratum, and tests of the homogeneity of
strata are performed. The precise SAS code
is as follows:
*LIFETEST FOR EACH VARIABLE;
ODS OUTPUT SURVDIFF = SD HOMTESTS = HT;
PROC LIFETEST DATA = SUMMARY1 METHOD = LT
INTERVALS = 0 TO 108 BY 2;
TIME EVENT_DURATION*T_PD(0);
STRATA VAR_NAME/ADJUST = TUKEY;
FREQ NUMBER;
RUN;
It is essential to configure some options of
the LIFETEST procedure before executing it:
• In the TIME statement, the survival time
variable, EVENT_DURATION, is crossed
with the censoring variable, T_PD, with
the value 0 indicating censoring. Hence
the values of EVENT_DURATION are
considered censored if the corresponding
values of T_PD are 0. Otherwise, they are
considered as event times.
• In the STRATA statement, the variable
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 8
name is specified, which indicates that the
data are to be divided into strata based
on the values of that particular variable. In
this example, a separate PROC LIFETEST
is run for each of the five shortlisted
variables - UTIL_FMT, DELQ, MOB, BAL_
FMT and FULL_PAY_IND.
• The METHOD option specifies the
method to be used to compute the
survival function estimates. LT refers to
the life table (actuarial estimates). This
method is preferred when the number of
observations is large2 .
• The INTERVALS option specifies interval
endpoints for life-table estimates. Each
interval contains its lower endpoint but
does not contain its upper endpoint.
Hence the specification in the above code
produces the set of intervals
{[0,2), [2,4), ...............[106,108), {108, ∞)}
• The FREQ statement is useful for
producing life tables when the data
are already in the form of a summary
data set. The FREQ statement identifies
a variable (NUMBER in this case) that
contains the frequency of occurrence of
each observation. PROC LIFETEST treats
each observation as if it appeared n times,
where n is the value of the FREQ variable
for the observation.
Once the LIFETEST procedure is run for
each of the shortlisted variables and the
output datasets are stored in the dataset
named HT, they are appended together to
create a final table having the Chi-Square
test results for each variable.
DATA HT1;
SET HT;
LENGTH VAR $30.;
VAR = “VAR_SEG.”;
WHERE TEST = “Log-Rank”;
RUN;
The Chi-Square test results for each
variable are then appended to create a
table like the below. The table is sorted in
order of descending Chi-Square values.
The top variable, DELQ, is used as the first
segmentation split.
Level 1
Test ChiSq DF ProbChiSq Var
Log-Rank 734,420 3 <.0001 DELQ
Log-Rank 457,622 5 <.0001 UTIL_FMT
Log-Rank 340,373 5 <.0001 BAL_FMT
Log-Rank 295,331 9 <.0001 MOB
Log-Rank 294,356 1 <.0001 FULL_PAY_IND
Table 2: Chi-Square Test Results
For Example Purposes Only
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 9
Now, DELQ has 4 categories: cycle 0, 1, 2 and 3.
Since cycle 0 comprises of around 95%
of the non-default population, the data is
further divided into two categories: Inorder
(cycle 0) and Delinquent (cycle 1+). Since the
Delinquent population is relatively small, it
is not split further. The Inorder population
is then considered, and the segmentation
exercise is carried on this subset, using the
remaining variables.
The InOrder population is further split
into Full Payer and Revolver population
according to the top splitter in this subset,
FULL_PAY_IND. Each of this subset can
further be split using the remaining
variables, following the same steps as
before. It should be kept in mind that
any of the final nodes post segmentation
should have sufficient volume for the
model to be robust. In this example, the
final segmentation structure is obtained by
further splitting each of the Full Payer and
Revolver population by MOB.
Level 2: DELQ = 0 (InOrder)
Test ChiSq DF ProbChiSq Var
Log-Rank 275,918 1 <.0001 FULL_PAY_IND
Log-Rank 230,824 5 <.0001 UTIL_FMT
Log-Rank 175,167 9 <.0001 MOB
Log-Rank 169,771 5 <.0001 BAL_FMT
Table 3: InOrder Population Segmentation
For Example Purposes Only
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 10
D. ANALYSIS OF SEGMENTATION PERFORMANCE
The blue highlighted boxes in Figure 1
are the final segments for this population.
Once the final segmentation structure has
been decided, it makes sense to check
the survival distributions across the five
segments. The following program can be
used for that, assuming that the variable
“pd_seg_ind_sm_2” captures the new
segmentation structure:
ods output SurvDiff = SD HomTests = HT ; proc lifetest data=<DATA> method=lt
intervals= 0 to 108 by 2 plots = (s,h);
time event_duration * t_pd(0) ;
Strata pd_seg_ind_sm_2/Adjust = Tukey
;
run;
The ADJUST option (new in SAS 9.2) tells
PROC LIFETEST to produce p-values for all
ten pairwise comparisons of the five strata
and then to report p-values that have been
adjusted for multiple comparisons using the
Tukey’s method. Results are shown in Table 4.
Table 4 shows the overall chi-square
tests of the null hypothesis that the
survivor functions are identical across the
five segments. All three tests are highly
significant, unanimously rejecting the null
hypothesis and providing evidence that at
least one of the five stratum hazard plots is
Cards (Stacked Sample)Volume percentage : 100%
12M Default Rate: 6.43%60M Default Rate: 27.22%
DELQ >0Volume Percentage : 4.0%
12M Default Rate : 62.70%60M Default Rate: 93.92%
FULL_PAY_IND = 1Volume Percentage : 52.44%
12M Default Rate : 0.75%60M Default Rate : 3.69%
FULL_PAY_IND = 0Volume Percentage : 43.55%
12M Default Rate : 9.28%60M Default Rate: 42.32%
MOB < XVolume Percentage : 6.11%
12M Default Rate : 3.29%60M Default Rate : 18.71%
MOB >= XVolume Percentage : 46.34%
12M Default Rate : 0.33%60 M Default Rate : 3.03%
MOB < XVolume Percentage : 5.41%
12M Default Rate : 24.96%60M Default Rate : 87.68%
MOB >= XVolume Percentage : 38.15%
12M Default Rate : 6.47%60M Default Rate : 36.43%
Figure 1: Final Segmentation Tree
DELQ = 0Volume Percenatge : 96.0%
12M Default Rate: 3.62%60M Default Rate: 21.76%
Test ChiSq DF ProbChiSq
Log-Rank 559,397 4 <.0001
Wilcoxon 643,746 4 <.0001
-2Log(LR) 455,316 4 <.0001
Table 4: Comparison of Results
For Example Purposes Only
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 11
significantly different from others for some
value of t≤ τ
The second output in Table 5 shows the Log
Rank tests comparing each possible pair
of strata. All the tests are significant both
using the raw p-values and after the Tukey
adjustment, suggesting that each segment
is significantly different from another. This
rules out the possibility to collapse the
segments.
Adjustment for Multiple Comparisons for the Log-Rank Test
Strata ComparisonChi-Square
p-values
pd_seg_ind_sm_2 pd_seg_ind_sm_2 Raw Tukey-Kramer
1 2 160,134 <.0001 <.0001
1 3 82,894 <.0001 <.0001
1 4 29,425 <.0001 <.0001
1 5 124,817 <.0001 <.0001
2 3 335,678 <.0001 <.0001
2 4 138,374 <.0001 <.0001
2 5 388,489 <.0001 <.0001
3 4 95 <.0001 <.0001
3 5 3,176 <.0001 <.0001
4 5 393 <.0001 <.0001
Table 5: Log Rank Test Results
For Example Purposes Only
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 12
The graph in Figure 2 shows some evidence
of difference in survival functions across the
five strata, thereby supporting the results
already obtained from the Chi-Square tests.
Finally, since the default rates of each of
the five segments at two time intervals, 12
month and 60 months, are considerably
different across the segments, this indicates
that the segments are different in terms of
the default rates.
Conclusion and LimitationsThis approach also has its own set of
limitations:
First, the test statistics for the Log-Rank test
are based on large-sample approximations
and gives good results when the sample
size is large. The number of comparison
segments should not be allowed to get too
large to avoid having segments with too few
subjects. Each group should contain at least
30 subjects, preferably more for the best
results.
Secondly, the Log-Rank test is more
powerful for detecting differences of the
form S1 (t) = [ S2 (t)]Ƴ, where Ƴ is some
Figure 2: Survival Graphs for Chosen Segments
1.00
0.75
0.50
0.25
0.00
Su
rviv
al D
istr
ibu
tio
n F
un
cti
on
0 20 40 60 80 100
STRATA: pd_seg_ind_sm_2=1pd_seg_ind_sm_2=3pd_seg_ind_sm_2=5
pd_seg_ind_sm_2=2pd_seg_ind_sm_2=4
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 13
positive number other than 1.0. This
equation defines a proportional hazards
model, and the log rank test is not
particularly good at detecting differences
when survival curves cross.
Segmentation is a unique aspect in
modelling in that it blends art and science
in almost equal measures. There are times
when a segmentation structure based
entirely on statistical measures does not
add enough value; however, it will be
effective only when these numbers are
coupled with business requirements and
common sense as was demonstrated in
the example discussed above. Dividing
the population into five different groups
and building separate survival models on
each of these groups yielded better results
instead of building a single standalone
model on the entire population, as these
groups are inherently different in terms of
the survival patterns.
Survival analysis can be applied to build
models for time of default on credit cards.
This knowledge helps the issuer to pre-
empt the attrition and devise customer
engagement strategies. We here proposed
to create an intuitive segmentation
structure on a large dataset of credit card
accounts before the onset of the actual
modelling exercise by using the Log-Rank
test to compare the hazard function across
the different sub groups. The program used
in this paper serves as a fast, efficient way
to churn through a large quantity of data to
provide the client the necessary information
needed for a final decision on modelling
splits.
References1 Allison, P. D. (2010). Survival Analysis using SAS®:
A Practical Guide, Second Edition. Cary,NC: SAS Institute Inc.
2 Bellotti, T., & Crook, J. (2007, May 7). Credit Scoring With Macroeconomic Variables Using Survival Analysis.
3 Man, R. (2014, May 9). Survival analysis in credit scoring: A framework for PD estimation.
4 Pazdera, J., Rychnovsky, M., & Zahradnik, P. (2009, Feb 1). Survival analysis in credit scoring.
5 Sayles, H., & Soulakova, J. (n.d.). Log-Rank Test for More tan Two Groups.
6 Weldon, G., & Zidun, H. (n.d.). Segmentation of Data Prior to Modeling. Atlanta: Merkle,Inc.
End Notes1 Refer to (Bellotti & Crook, 2007), (Pazdera,
Rychnovsky, & Zahradnik, 2009), (Man, 2014).
2 The Kaplan – Meier method of estimating survivor functions is more suitable when sample size is small, and event times are measured with precision. This is in fact the default method in PROC LIFETEST.
Guide to Segmentation for Survival Models using SAS
EXLservice.com | 14
GLOBAL HEADQUARTERS280 Park Avenue, 38th Floor, New York, NY 10017
T: +1.212.277.7100 • F: +1.212.277.7111
United States • United Kingdom • Czech Republic • Romania • Bulgaria • India • Philippines • Colombia • South Africa
Email us: lookdeeper@exlservice.com On the web: EXLservice.com
EXL (NASDAQ: EXLS) is a leading operations management and analytics company that designs and enables
agile, customer-centric operating models to help clients improve their revenue growth and profitability. Our
delivery model provides market-leading business outcomes using EXL’s proprietary Business EXLerator
Framework®, cutting-edge analytics, digital transformation and domain expertise. At EXL, we look deeper to
help companies improve global operations, enhance data-driven insights, increase customer satisfaction,
and manage risk and compliance. EXL serves the insurance, healthcare, banking and financial services,
utilities, travel, transportation and logistics industries. Headquartered in New York, New York, EXL has
more than 27,000 professionals in locations throughout the United States, Europe, Asia (primarily India and
Philippines), South America, Australia and South Africa.
© 2017 ExlService Holdings, Inc. All Rights Reserved.
For more information, see www.exlservice.com/legal-disclaimer
Recommended