16
Improving prediction accuracy of loan default- A case in rural credit M.J. Xavier, Sundaramurthy, P.K. Viswanathan, G. Balasubramanian Abstract: Application of data mining techniques for predicting loan default assumes paramount importance in Banking and financial services. The analytics involved in this context pave the way for evolving robust credit scoring models and automation of the lending process. They also help discern the pattern of relationship between the input [borrower characteristics] and the output [loan default status]. If the underlying relationship is not strictly linear in nature, routine ways of using factor analysis and non-linear oriented neural network may not improve predictive accuracy due to intense level of complexity present in the data. The question is “Is it possible to improve the predictive accuracy by judiciously combining the inputs from these two algorithms?”. In this paper we demonstrate that it is possible to do so using the data set of a finance company lending small loans in rural areas. Factor analysis is used to generate inputs for the application of the neural network algorithm to predict loan default and the result seen is substantial improvement in accuracy. Key words: Predictive accuracy, algorithms, factor analysis, neural network, data mining

Improving Prediction Accuracy

Embed Size (px)

DESCRIPTION

Application of data mining techniques for predicting loan default assumes paramount importance in Banking and financial services. The analytics involved in this context pave the way for evolving robust credit scoring models and automation of the lending process. They also help discern the pattern of relationship between the input [borrower characteristics] and the output [loan default status]. If the underlying relationship is not strictly linear in nature, routine ways of using factor analysis and non-linear oriented neural network may not improve predictive accuracy due to intense level of complexity present in the data. The question is “Is it possible to improve the predictive accuracy by judiciously combining the inputs from these two algorithms?”. In this paper we demonstrate that it is possible to do so using the data set of a finance company lending small loans in rural areas. Factor analysis is used to generate inputs for the application of the neural network algorithm to predict loan default and the result seen is substantial improvement in accuracy.

Citation preview

Page 1: Improving Prediction Accuracy

Improving prediction accuracy of loan default- A case in rural credit

M.J. Xavier, Sundaramurthy, P.K. Viswanathan, G. Balasubramanian

Abstract: Application of data mining techniques for predicting loan default assumes paramount

importance in Banking and financial services. The analytics involved in this context pave

the way for evolving robust credit scoring models and automation of the lending process.

They also help discern the pattern of relationship between the input [borrower

characteristics] and the output [loan default status]. If the underlying relationship is not

strictly linear in nature, routine ways of using factor analysis and non-linear oriented

neural network may not improve predictive accuracy due to intense level of complexity

present in the data. The question is “Is it possible to improve the predictive accuracy by

judiciously combining the inputs from these two algorithms?”. In this paper we

demonstrate that it is possible to do so using the data set of a finance company lending

small loans in rural areas. Factor analysis is used to generate inputs for the application of

the neural network algorithm to predict loan default and the result seen is substantial

improvement in accuracy.

Key words: Predictive accuracy, algorithms, factor analysis, neural network, data mining

Page 2: Improving Prediction Accuracy

I Introduction Lending institutions are exposed to the risk of default by borrowers as it affects their

profitability and solvency. Continuous default by borrowers affects the economic growth

of a country and therefore a major cause of concern for the government and banks. In

managing loan default, prevention is better than cure. Hence, lenders exercise vigil at the

time of lending by carefully studying the characteristics of the borrowers and making a

judgment call about the relationship between borrower characteristics and loan default

event. Over a period of time, the database of lending and default history captures this

relationship, which can then be modeled using the appropriate algorithms. Using this

relationship, lenders take credit decisions. Models, once tested for their reliability,

facilitate automation of lending process, which helps lenders to scale up their operations.

Lenders here include banks, financial institutions and other agencies like micro finance

institutions engaged in lending.

Historical data relating to borrower characteristics and their default status is used to

predict loan default. Past data is studied for discerning any patterns between certain

characteristics and the loan default status. A very naïve example could be a case where a

bank found that people belonging to a particular community never default. So, this bank

will not look at any other characteristics except community. According to this bank, all

the rest of the characteristics are irrelevant. That is the bank’s experience. So, a few data

points were used to predict the loan default. But today, with high speed computing

power, more data can be captured and analyzed to detect hidden patterns. More

sophisticated techniques are used for improving prediction accuracy. Multiple regression

techniques are very famous in expressing the mathematical relationship between

dependent and independent variables in the form of coefficients which can be used for

prediction. But, multiple regression technique, assumes that the relationship between

dependent and independent variable is linear. More over, when the number of

independent variables is more, there is a problem of multi-collinearity which can affect

the predictive accuracy of the equation. This means that the independent variables

themselves are correlated with one another. Factor analysis came in as a powerful

Page 3: Improving Prediction Accuracy

technique that can handle this issue. In factor analysis, all the correlated independent

variables are reduced to factors which are then used to predict the future outcome.

The various techniques and tools that can be used to discern the underlying relationship

between the dependent and independent variable are collectively known as data mining

techniques. The conventional approach of regression and other multi-variate techniques

are based on the assumption of linear relationship between dependent and independent

variable, which may not hold well in many situations. Arising out of this limitation, non-

linear analytical tools were developed which use algorithms that handle non-linearity.

Neural network (NN) is one such technique extensively used to predict outcomes in cases

where the underlying relationship between the dependent and independent variable is

non-linear. In some situations NN fails miserably, particularly when there is large number

of independent variables. But, is there a way of combining the conventional factor

analysis and NN to obtain a better predictive accuracy? We show in this paper that it is

possible, with the help of a real life case.

Page 4: Improving Prediction Accuracy

II Review of Literature Factor analysis is one of the most widely used tools for prediction in various fields such

as psychology, sociology and market research. Factor analysis is a statistical method for

reducing the original set of variables into a smaller set of underlying factors in a manner

that retains as much as possible of the original information of the data. In statistical terms

this means finding factors that explain as much as possible of the variances in the data.

Many statistical methods are used to study the relation between independent and

dependent variables. Factor analysis is used to study the patterns of relationship among

many inter-dependent variables.

Timo Salmi, Ilkka Virtanen and Paavo Yli-Olli [1990], applied factor analysis and

transformation analysis techniques on a number of financial ratios of 32 firms from

Finland for the period 1974-84. Fifteen financial ratios, grouped under three broad

categories namely, accrual, cash flow and market based ratios, were considered for the

application of factor analysis. It is mainly used for exploratory analysis of underlying

dimensions and exploits the benefits of data reduction to replace original variables with

factor scores for subsequent analyses such as multidimensional scaling, cluster analysis

or neural network analysis. One of the earliest applications of factor analysis by

Spearman [2005], a noted psychologist, researched the issue of intelligence. Eight

measurable variables associated with the factor of intelligence and reduced to four

dimensions. Colgate and Lang [2001] used exploratory factor analysis to assess the

dimensionality of the reasons why customers do not switch banks and thus determine the

relevance of categories unearthed in literature. Minhas and Jacobs [1996] employed

factor analysis to identify the most important attributes of new technology that have an

impact on the marketing of financial services. 33 variables were reduced to underlying

six dimensions. In a study for predicting industrial bond ratings, Pinches and Mingo

[1973] used factor analysis to identify independent dimensions to be used for further

modeling. Seven underlying factors were extracted which accounted for sixty three

percent of explanation of the results. Scannel, Safdari and Newton (2003) present an

extended application of factor analysis performed on a set of 17 banks and 13 financial

variables. The results were used to classify banks on the basis of common characteristics.

Page 5: Improving Prediction Accuracy

Neural Networks [NN], represent a radically different form of computation from the

more common algorithmic approaches like factor analysis. The unique learning capability

of NN promise benefits in many aspects of default prediction which involve pattern

recognition. Development of robust and reliable model for default prediction is important

as it enables investors, auditors and others to independently evaluate the risk of

investment. The task of predicting default can be posed as a classification problem: given

a set of classes (good and bad loans) and a set of input data vectors, the task is to assign

each input data vector to one of the classes. What forms the input data vector is one of the

key issues to be determined in research design. Conventional statistical approaches are of

limited use in deriving an appropriate prediction model in the absence of well-defined

domain models. They all require the assumption of a certain functional form for relating

independent and dependent variables. Generalization can be made only with caution. NN

provide a more general framework for determining relationships in the data and do not

require the specification of any functional form. Wullianallur Raghupathi etal, [1991],

presented the results of an exploratory research where NN was applied for bankruptcy

prediction. They conclude that NN could provide a model for prediction of bankruptcy.

Eric Rahimian etal.,[1992] compared discriminant analysis and NN, applying on a set of

data of 129 firms of which 65 were bankrupt. They report that after normalizing the data,

the performance of NN significantly improved. It is important to consider how we specify

the input data to NN. Marcus D. Odom etal.,[1991] also compared the predictive ability

of a NN and multivariate discriminant analysis model in bankruptcy risk prediction and

show promise in using NN for prediction purposes. They report that NN performed better

in both the original data and hold out sample. Kevin coleman etal.,[1991], Kar Yan Tam

etal., [1992], extended the application of NN in bankruptcy prediction further into an

expert system for prescribing remedial action for preventing bankruptcy. Linda

Salchenberger etal.,[1992], used NN to train 100 failures 100 surviving institutions of

thrift, between 1986 to December 1987 and compared its performance with logit model.

Their results show that NN has performed better than logit model. They also conclude

that the cost of committing type I error and type II error are lower in NN compared to the

logit model.

Page 6: Improving Prediction Accuracy

III case study

Predicting Credit default for a consumer financial services company Company background: The Company under study is a financial services Company and

is a part of a leading multi business Industrial group in South India. Their products

include auto loans, personal loans and consumer durable loans. They have a customer

base in all the four states of South India. They have customers in urban, semi urban and

rural areas. In fact their strength is their reach to the rural areas and they are expanding

their base in rural areas of South India. There is a large untapped potential in the rural

sector as elaborated above. They have a very good brand image in South India and they

are known for their professionalism, integrity and customer focus and commitment to

social responsibility. Their customers include lower middle class, middle class and upper

middle class in the salaried segments in private , government and joint sectors, traders,

small farmers, and others in various own small enterprises in urban, semi urban and rural

areas.

Process of acquiring customers: Auto loans: The Business Group has an automobile

company with a leading brand in the two wheeler market The Company has appointed

authorized dealers who have their show rooms in various places and also a marketing

team. The Customers walk in to the show room and once they decide to buy the vehicle

they are approached by different financing companies available in the show rooms. So

this financial company has to compete with other companies to get a customer. Once they

convince a customer their personal data in the format is sent to the call centre. The call

centre enters the data in the network and makes it available for the field investigation

network.

Another process of acquiring customers is when the financial services company conducts

loan melas at different locations. It is organized either by the company directly in alliance

with the dealers or by a dealer himself. Here the number of customers acquired i.e. the

hit rate is much more than that obtained from the walk in customers.

Page 7: Improving Prediction Accuracy

The assessment of the prospective customer for giving a loan is done by a team of staff

called Field Investigators who visit the customers, their work places, their residence and

fill in a format which helps them to arrive at a score. The score is arrived at by summing

up the marks given for various attributes like income, ownership of house, consumer

durables, age etc. If the score arrived at is more than a cutoff then the person is deemed

eligible for a loan.

The score is sent to the call centre and from there it is sent to the concerned branch. Then

the branch handles the further formalities of handling the process up to loan

disbursement.

Process of acquiring customers: Consumer Durables: The financial services company

has tie up with the dealers of the consumer durables and a similar process as explained

above for the auto loans is carried out for the assessment and disbursement of the loans.

Another avenue for acquiring the customer is by cross selling to the customers who have

already got and repaid may be an auto loan or a personal loan and who exhibited a good

repayment behavior.

Process of acquiring a personal loan: Here again the existing customers of auto loans or

the consumer durable loans are the customers. Good customers who have not defaulted

even once are offered these loans.

Problem definition: There is a tremendous competition among the various banks both

Indian and Foreign banks and also the financial services companies in the Private sector.

As already mentioned there is a scramble for the low hanging fruits and which is in the

urban and semi urban areas. All the players in this sector vied with each other and with

the result the margins have become very thin. The players have been forced to look for

newer markets and also compete on the time for disbursement of loans. Hence with the

rush to sanction the loans there is a lot of pressure on the assessment process to ascertain

customers as “GOOD” or “BAD” .Even though it is evident for the sake of emphasis it is

Page 8: Improving Prediction Accuracy

better to state here that neither wrongly rejecting a good customer nor accepting a bad

customer is acceptable as both will affect the bottom line. However generally, accepting a

bad customer wrongly is more harmful than rejecting a good customer wrongly. The un -

recovered loans varies anywhere between 1% at the best to even 25% at the worst. For

the company understudy the typical collection is around 2.5%.So there is every reason for

every company to minimize the un- recovered loans. And more importantly the cost of

collection is also important and this is the case with the customers who are not prompt in

repayment. Finally when there is need to grow aggressively there is a need to disburse the

loans as fast as possible but at the same time giving the loan only to the right customer.

Hence it has become imperative for the Company under study to develop a credit scoring

model which should a)identify and predict a good customer and a bad customer b) take

all the relevant factors into consideration and assign weights according to their relative

importance with respect to the repayment behavior c)arrive at a final score based on the

various parameters so that the customers can be ranked and the pricing of the loan in

terms of interest ,initial down payment, etc and collection mechanism can be arrived at.

As a first step of the study, detailed discussions were carried out with the President, to

understand the business and with the regional managers, branch managers, risk managers,

call centre managers and call centre operators, process of acquiring customers,

assessment of customers, collection mechanisms, challenges in terms of technology,

people and processes. The objectives of the study was to a) understand the operational

aspects related to the risk b) aspects of customer data capture and process of Field

Investigation c) risk assessment d) to ascertain the practical aspects in respect of quality

of information obtained from customers and the process of validation of the information

e) to assess the attitudes of FI with respect to the credit scoring models, willingness to get

information which may be more critical f) to understand the transaction between the DSA

and the prospective customer and the process of conversion g) to ascertain the Dealer’s

perspective of the scoring of Customers h)to understand the process of getting the

customer information with respect to the credit scoring model and validating it h) to

check the data with respect to the completeness and consistency and also make a

preliminary study on any visible patterns.

Page 9: Improving Prediction Accuracy

Visits were made to dealers, master field investigators, field investigators, branches,

direct selling associates and customers in rural, semi urban and urban areas in different

regions.

The company had a very good data warehouse of customer data. This was used to collect

secondary data of the customers. The customers comprised of people who had availed

loans for buying a two wheeler. Most of them are from the rural and semi urban areas.

Most of them are in self employment, in small trade or marginal farming or some form of

small business. The vehicles are used directly by traders and business men in their

occupation or indirectly to help in their business or farming to transport farm inputs or

farm produce.

A database of Customers with around 12000 customers was taken. The

database was a mixture of good and bad customers. The database had 38

fields and data like age, qualification, profession, income, possession of

consumer durables, house, and number of dependents, down payments and

advance equated monthly installments etc.

Data collection: Sample Size: The total customer database has a total of around 150000 and the sample is

a random sample and has around 12000 data sets. The following independent variables

were used in the model. ‘Good’ or ‘Bad’ customer is the dependent variable, represented

as 1 or 0.

Data Structure

1. Income

2. Advance EMI

3. Dependents

4. Experience

5. Rent

Page 10: Improving Prediction Accuracy

6. Down payment

7. Consumer durables

8. Interest

9. Vehicles

10. Age

11. Other income

12. TV

13. Music System

14. Fridge

15. Two wheeler

16. Four wheeler.

17. Qualification

Methodology for analysis:

The data set of 12000 customers with seventeen variables were subjected to factor

analysis, which reduced the seventeen variables to the following five factors. Some of the

variables themselves loaded as independent factors.

1. Income & assets

2. Consumer durables

3. Initial payments (down payments, advance EMI)

4. Vehicles

5. Dependents

Page 11: Improving Prediction Accuracy

The rotated factor matrix is shown

Rotated Component Matrix a

.198 .025 .037 .781 -.014 -.062-.108 .008 -.084 .814 .047 .049-.005 .005 .453 -.007 .016 -.311-.089 -.005 -.053 .079 .040 .833.032 .970 -.006 .040 .026 -.004.035 .022 .468 -.009 .194 .353

-.116 -.001 -.018 .178 .206 -.305-.137 .010 .638 -.035 .007 .020-.014 .970 .034 -.005 -.015 .004.670 .000 .545 -.021 -.008 -.039.675 .004 .525 -.014 -.061 -.034.803 .003 -.126 .063 .091 .039.724 .016 -.289 .026 .099 -.001.066 -.024 .257 .045 .655 .071.080 .032 -.124 -.017 .770 -.080

cfsan_emiADVEMIdependentCHILDRENINCOMEexperienceRENTAGEothincomeTVMSFRIDGEWMTWFW

1 2 3 4 5 6Component

Extraction Method: Principal Component Analysis. Rotation Method: Varimax with Kaiser Normalization.

Rotation converged in 6 iterations.a.

The resultant factor scores were used on the past data to test the predictive accuracy of

the factor scores. It could achieve about 40 percent accuracy in prediction. Thought it is

considered to be significant, further improvement of the prediction accuracy was

explored with NN methodology.

Methodology of NN:

NN is an assumption free, non-algorithmic approach to estimate the relationship between

dependent and independent variables. The relationship between dependent and

independent variables can be explained with the help of algorithm based techniques like

regression, discriminant analysis or factor analysis. The predictive performance of these

techniques depends on the extent to which the dependent and independent variables are

linearly related. When the underlying relationship is non-linear, a different approach is

required for modeling the relationship which is the non-algorithmic approach of NN.

The given dataset is divided into training and testing data sets. NN uses the training data

set to model the relationship between input (independent variables) and output

(dependent variable). Starting with a randomly assigned weight matrix, NN maps the

input and output with the help of this weight matrix and continuously refines the weight

Page 12: Improving Prediction Accuracy

matrix to get the perfect fit between input variables and output. It will never get a perfect

fit. So, a stopping rule has to be specified in the form an error term. Root mean squared

error is normally used for this purpose. NN stops refining the weight matrix when the

root mean squared error reaches a certain target value. The weight matrix is applied in the

test data to check its reliability. The process of refining the weight matrix is done with the

help of various algorithms. The back propagation algorithm is the most popular method.

It is a method of distributing the error within the network.

Performance of NN can be improved through various methods of configuring the network

and giving the proper inputs.

The same data was subjected to NN which could not generalize with the existing data.

Instead of subjecting the raw data to NN, the extracted factor scores for each cases was

used as an input to NN. This dramatically improved the performance of NN which

converged in three hours but with 98% accuracy. The classification matrix is shown

below:

Good Bad Good Bad Good Bad TotalTotal 2673 3354 1460 1553 1412 1601 12053

Correct 2585 3349 1409 1553 1364 1599 11859 98.39%Wrong 88 5 51 0 48 2 194

Unknown 0 0 0 0 0 0Good 2585 5 1409 0 1364 2Bad 88 3349 51 1553 48 1599

Test Verify Validate

NN creates a test data set, uses various types topography of network, selects the best

performing network and weights, uses those weights to verification data and validation

data. Thus, we get the overall summary in the form of misclassification table. The

networks chosen by NN are shown below:

Page 13: Improving Prediction Accuracy

Network Topology Error Input Hidden Performance1 RBF 0.496363 2 1 0.55990712 RBF 0.492150 2 2 0.56189843 Linear 0.342226 3 - 0.97842684 Linear 0.342162 4 - 0.98075015 Linear 0.342103 5 - 0.98240966 RBF 0.298466 2 4 0.97245277 MLP 0.083890 2 7 0.99236648 MLP 0.083850 3 17 0.99236649 MLP 0.081590 4 14 0.9926983

10 * MLP 0.079790 4 15 0.9923664

Note:

1. Column ‘Topology’ shows the algorithm chosen by NN. ‘MLP’ refers to multi-

layer-perceptron, layer.

2. The reduction in error is accompanied by improvement in performance

3. Column ‘Input’ and ‘Hidden’ shows the input and hidden layers used in the

network.

IV Summary and Conclusion

NN is a powerful technique for predicting the behavior of output based on a given set of

input values. It has a very good potential for application in loan default prediction and

lending automation through credit scoring models. But NN per se has some limitations

when there are a large number of input variables and the relationship among these

variables is non-linear and complex. This is where, a conventional algorithm based

technique like factor analysis gives a lending hand. It is able to reduce the complexities to

a great extent by reducing the number of variables to a few numbers of dimensions.

When these factor values are given as input values to NN, it is able to perform

dramatically well. Factor analysis alone is not able to achieve 98 percent prediction

accuracy. But, factor analysis, coupled with NN, is able to deliver 98 percent prediction

accuracy. This has been illustrated in this paper with the help of live data from a financial

services company. The future holds for a combinatorial approach to prediction with help

of linear and non-linear based techniques.

Page 14: Improving Prediction Accuracy

References:

1. Colgate, Mark and Lang, Bodo [2001], “Switching barriers in consumer markets: an investigation of the financial services industry,” Journal of consumer marketing, Vol. 18, No.4, PP. 332-347

2. Eric Rahimian, Seema Singh, thongchai Thammachote and Rajiv Virmani, “

Bankruptcy prediction by Neural Network”

3. Kevin G Coleman, Timothy J Graettinger and William F. Lawrence, [1991], “Neural Networks for Bankruptcy prediction: The power to solve financial problems” AI review, July/ August 1991, pp 48-50

4. Kar Yan Tam and Melody Y. Kiang [1992], “ Managerial applications of Neural

Networks: The case of bank failure predictions”, Management science, Vol .38, No. 7, July 1992, pp 926-947

5. Linda M. Salchenberger, E. Mine Cinar and Nocholas A. Lash, [1992], “Neural

networks: A new tool for predicting thrift failures”, Decision sciences, vol 23, No.4, july/august 1992, pp-899-916

6. Marcos D. Odom and Ramesh Sharda [1992], “A Neural Network Model for

Bankruptcy Prediction”, IEEE, International conference on Neural Network, PP. II163-II168, SanDiego, CA

7. Minhas, R.S. and Jacobs, E.M. [1996], “Benefit segmentation by factor analysis:

an improved method of targeting customers for financial services,” International Journal of Bank Marketing, 14/3, March, 3-13

8. Pinches, A. Mingo & J.Kent Caruthers, [1973], “The stability of financial patterns

in industrial organizations”, Journal of finance 28, 389-396

9. Scannell, Nancy J, Safdari, Cyrus and Newton Judy, “An extended application of factor analysis in establishing peer groups among banks in Armenia,” Journal of the Academy of Business and Economics, January

10. Spearman, http://www. Indiana.edu/~intell/spearman.shtml,accessed on Ocotber

22, 2005

11. Timo Salmi, ilkka virtanen, Paavo Yli-Olli, “on the classification financial ratios, a factor and transformation analysis of accrual, cash flow and market based ratios”, No.25, Business administration No.9, Accounting and Finance, University Wasaensis

12. Wullianallur Raghupathi, Lawrence l. Schkade and Bapi S. Raju [1991], “ A

Neural Network approach to Bankruptcy prediction”, IEEE. Proceedings of the IEEE 24th Annual Hawaii International conference on systems sciences

Page 15: Improving Prediction Accuracy

13. Yli-Olli, Paavo & Virtanen, [1990], “Transformation analysis applied to long term stability and structural invariance of financial ratio patterns: US vs. Finnish firms, American Journal of Mathematical and Management sciences 10

Page 16: Improving Prediction Accuracy

Details of Authors:

1. M.J. Xavier: 2. Sundharamurthy: 3. P.K. Viswanathan: Faculty, Institute for Financial Management and Reseach,

IFMR, [email protected] 4. G. Balasubramanian, Faculty, Institute for Financial Management and Research,

IFMR, [email protected]