Geographically Weighted Logistic Regression Applied to Credit … · Geographically Weighted Logistic Regression Applied to Credit Scoring Models* 94 R. Cont. Fin. – USP, São Paulo,

DOI: 10.1590/1808-057x201703760ISSN 1808-057X

*Paper presented at the XL ANDAP Congress, Costa do Sauípe, BA, Brazil, September 2016.

Geographically Weighted Logistic Regression Applied to Credit Scoring Models*

Pedro Henrique Melo AlbuquerqueUniversidade de Brasília, Faculdade de Economia, Administração, Contabilidade e Políticas Públicas, Departamento de Administração, Brasília, DF, Brazil

Fabio Augusto Scalet MedinaUniversidade de Brasília, Faculdade de Economia, Administração, Contabilidade e Políticas Públicas, Departamento de Administração, Brasília, DF, Brazil

Alan Ricardo da SilvaUniversidade de Brasília, Instituto de Ciências Exatas, Departamento de Estatística, Brasília, DF, Brazil

Received on 05.11.2016 – Desk acceptance on 06.20.2016 – 2nd version approved on 10.11.2016

ABSTRACTThis study used real data from a Brazilian financial institution on transactions involving Consumer Direct Credit (CDC), granted to clients residing in the Distrito Federal (DF), to construct credit scoring models via Logistic Regression and Geographically Weighted Logistic Regression (GWLR) techniques. The aims were: to verify whether the factors that influence credit risk differ according to the borrower’s geographic location; to compare the set of models estimated via GWLR with the global model estimated via Logistic Regression, in terms of predictive power and financial losses for the institution; and to verify the viability of using the GWLR technique to develop credit scoring models. The metrics used to compare the models developed via the two techniques were the AICc informational criterion, the accuracy of the models, the percentage of false positives, the sum of the value of false positive debt, and the expected monetary value of portfolio default compared with the monetary value of defaults observed. The models estimated for each region in the DF were distinct in their variables and coefficients (parameters), with it being concluded that credit risk was influenced differently in each region in the study. The Logistic Regression and GWLR methodologies presented very close results, in terms of predictive power and financial losses for the institution, and the study demonstrated viability in using the GWLR technique to develop credit scoring models for the target population in the study.

Keywords: credit risk, geographically weighted logistic regression, credit scoring.

R. Cont. Fin. – USP, São Paulo, v. 28, n. 73, p. 93-112, jan./abr. 2017 93


94 R. Cont. Fin. – USP, São Paulo, v. 28, n. 73, p. 93-112, jan./abr. 2017

1. INTRODUCTION

Th e main activity of commercial banks is fi nancial intermediation, which consists of raising financial resources and lending them to third parties under pre-established conditions, such as payment period, installment value, and interest rate (Hand & Henley, 1997). As it involves expectation of future receipt, all credit granted is exposed to risks.

Th e topic “risk management” drew attention in the fi nancial sector aft er the publishing of the Basel accords, which is a set of documents that serve as a basis for regulation and monitoring of the sector. Advances in technology and computing, together with the development of quantitative methods, have contributed in creating diff erent tools for measuring risk, bringing signifi cant gains in the fi nancial management of institutions.

Credit risk can be defined as the possibility of fi nancial losses occurring, associated with borrowers or counterparties not fulfi lling their respective obligations in the agreed terms, with the devaluation of loan contracts because of a deterioration in borrowers’ risk classifi cations, with reductions in earnings or remunerations, with advantages conceded in renegotiations, and with recovery costs (Brazilian Central Bank [BACEN], 2009). It is one of the main risks that fi nancial institutions are exposed to.

Th e models used to measure risk when granting credit are called credit scoring models. Due to them involving lower costs and greater agility, objectivity, and predictive power in credit granting decisions, credit scoring models have become popular and are widely used by the fi nancial sector (Hand & Henley, 1997).

Lessmann, Baesens, Seow, and Thomas (2015) carried out a comprehensive study on the classifi cation methodologies used for developing credit scoring models and indicated logistic regression as being the standard methodology in the fi nancial sector.

Logistic regression is a multivariate analysis technique that aims to explain the relationship between a random binary dependent variable and a set of independent predictive variables (Hosmer & Lemeshow, 2000).

Financial institutions use various credit scoring models, which are applied when evaluating diff erent types of clients or credit operations to be contracted. Th e predictive variables that compose each model can be diff erent, with the aim of improving predictions for the target population.

Geographical location (space) and its relationship with credit risk is the topic of some published studies. Among the most recent, Stine (2011) analyzes the evolution of

defaults on real estate loans in US counties between 1993 and 2010, contemplating pre- and post- subprime crisis periods, and fi nding evidence of a spatial correlation between default rates in these counties.

Fernandes and Artes (2015) used the Ordinary Kriging methodology to create a variable that refl ects spatial risk and applied the Logistic Regression technique to verify the existence of a spatial correlation in defaults on loans taken out by small and medium sized enterprises (SMEs), using data from the SERASA credit bureau. Th e authors developed models with and without the spatial risk variable and confi rmed that the inclusion of this variable improves credit scoring model performance.

Th e Geographically Weighted Regression (GWR) technique, proposed by Brunsdon, Fotheringham, and Charlton (1996), is used to model spatially heterogeneous (non-stationary) processes; that is, processes that vary (whether in mean, median, variance, etc.) from region to region. Th e basic idea of GWR is to adjust a regression model to each region in the data set using geographical location of the other observations to weight the parameter estimates. Application of the GWR technique can be observed in diff erent areas of research, such as Geography (See et al., 2015), Health (Gilbert & Chakraborty, 2011), and Economics (Huang & Leung, 2002).

Atkinson, German, Sear, and Clark (2003) used Geographically Weighted Logistic Regression (GWLR) in their study to analyze the dependency of geographical location in the relationship between erosion and geomorphologic controls in a region of Wales. The dummy variable used in this study was the presence or absence of erosion in the areas studied. Applying the GWLR technique resulted in the estimation of models with diff erent parameters (distinct models) for each area studied, revealing the need to adopt diff erent practices to avoid erosion, depending on the region.

Th is article used data related to transactions involving Consumer Direct Credit (CDC), granted by a Brazilian fi nancial institution to clients residing in the Distrito Federal (DF). Th e aims were as follows: to verify whether the factors that infl uence credit risk diff er according to borrowers’ geographical locations; to compare the set of models estimated via GWLR with the global model estimated via Logistic Regression, in terms of predictive power and fi nancial losses for the institution; and to verify the viability of using the GWLR technique to develop credit scoring models.

Although the central idea in this article of verifying

95

Pedro Henrique Melo Albuquerque, Fabio Augusto Scalet Medina & Alan Ricardo da Silva

R. Cont. Fin. – USP, São Paulo, v. 28, n. 73, p. 93-112, jan./abr. 2017

whether space infl uences credit risk is similar to that of Stine (2011) and Fernandes and Artes (2015), the target population and methodology used are diff erent, with no studies being found in the literature that used the GWLR technique for the development of credit scoring models.

One advantage of applying the GWLR technique in relation to the others lies in estimating a model for each region in the study, allowing these models to be distinct in their variables and parameters (Atkinson et al., 2003), whereas a global model, represented by only one formula, may not represent local variations adequately. In relation to credit, diff erent study regions can involve diff erent risks, and if this phenomenon is observed, models that consider local characteristics can better diff erentiate the credit risk for borrowers residing there and generate

fi nancial gains for the institution.Another diff erence from other studies on this topic and

an advantage in the GWLR technique involves the use of diff erent samples in developing each local model, giving greater weight to borrowers who are closer geographically, and not using distant information that is outside the radius defi ned by the weighting function.

Questions regarding endogeneity are not addressed in this study and could be raised by researchers in future papers.

In addition to this introduction, the second section of the article presents the geographically weighted logistic regression methodology and the process for developing the models, the thirds shows the results obtained, and the fourth sets out the conclusion.

2. METHODOLOGY

Th e fl owchart presented in Figure 1 details all of the stages carried out in the process of developing the models in this study.

Figure 1. Flowchart of the stages in developing the models.Source: Prepared by the authors.



2.1. Database

Th e data related to this study refer to transactions involving Consumer Direct Credit (CDC) granted by a Brazilian fi nancial institution to clients residing in the

Distrito Federal (DF). Th ese transactions are paid in installments over periods between 0 and 36 months and have a maximum contract value of R$30,000.00.

Th e territorial division of the DF used in this study was composed of 19 regions, shown in Figure 2.

The sample included all loans granted between December 2013 and September 2014, involving 10 rounds of borrowing and a total of 22,132 diff erent loan contracts. Payment performance on these loans was monitored in the twelve months subsequent to the contract agreement date and those that exceeded 90 days in arrears in any of these months were labeled as being in default (Y=1). Due to loan arrears performance involving diff erent moments in time, this database is classifi ed as being of the panel data type.

Th e predictive variables selected to compose the models were: Age, Income, Level of Education, Borrower’s Time of Relationship with the Institution, Loan Contract Period, SELIC, Unemployment Rate, and Infl ation (IPCA). Th ese variables refer to the time credit is taken out (a single point in time), thus involving cross-sectional type data.

Th e latitude and longitude geographical coordinates for the regions used in the study and needed to apply the GWLR technique were obtained from the IBGE website, and refer to the central point in each region and are equal for borrowers residing in the same region.

Th e database was subdivided into model development

and validation samples according to the date a transaction was contracted, with the development sample composed of the fi rst fi ve rounds (December 2013 to April 2014) and totaling 10,944 records. Th e validation database is composed of the fi nal fi ve rounds (May to September 2014), totaling 11,188 records.

Th e data manipulation, as well the univariate, bivariate, and spatial indicator calculations, along with those for developing the global model via logistic regression analysis, were carried out using the SAS soft ware. Th e GWLR models were developed using the GWR4 soft ware.

2.2. Spatial indicators

Moran’s I (Moran, 1950) is one of the most widely used global indicators for verifying the existence of spatial correlation. Global indicators present a single measure of spatial tendency for the whole region being studied, they allow the hypothesis of the existence of spatial dependency between regions to be tested in accordance with the variable of interest, and are used in exploratory analysis of data. Th e formula is given by:

Figure 2. Territorial division of the Distrito Federal used in the study.Source: Prepared by the authors.

97



in which n is the number of regions being studied, xi and xj are the values of the variable of interest in regions i and j, and wij are the spatial proximity matrix elements, which can be calculated in diff erent ways, such as via the presence or absence of a frontier between the regions or by the Euclidian distance between them. Th e Moran index is restricted to the interval [-1,1], in which values close to -1 indicate negative spatial correlation, values close to 1 indicate positive spatial correlation, and a value equal to 0 indicates the absence of spatial correlation or spatial

independence with relation to the variable tested. Whereas the global indicators assume that all of the regions studied can be represented by a single value, the local indicators of spatial association (LISA), developed by Anselin (1995), are used to verify the existence of spatial correlation within the geographical units studied and look for regional diff erences (peculiarities). Th e presence of areas with signifi cant local indices is an indication of spatial (non-stationary) homogeneity.

Th e Moran Local Index formula is given by:

Th e database used in applying the Moran Global and Local Indices was the total database of records (without

subdivision of samples) and the variable tested was the regional default rate, calculated via the following formula:

In this study the Moran Global Index was used to verify the existence of spatial correlation in the default rate between the regions in the DF. Th e Moran Local Index was used to verify the existence of regions with diff erent default rates in relation to the others. Th e existence of signifi cant regions (the confi dence level used for the Moran Local Index was 95%) may indicate that the regression models developed for these regions are diff erent in relation to

the models for the other regions in the study, which may warrant applying the GWLR to this target population.

2.3. Geographically Weighted Regression

According to Fotheringham, Brunsdon, and Charlton (2002), given a basic linear regression model, the equivalent expression for the GWR is given by:

It is noted from the expression above that the model parameters represented by the function βk(ui, vi) vary according to the values (ui, vi), which represent the latitude and longitude geographical coordinates for observation (region) i, resulting in a diff erent model for each region in

the study. Th e assumptions of the classical linear regression model remain in place for GWR.

Th e matrix form for estimating the GWR parameters is given by:

1

2

3

4



in which

W(ui, vi) is a diagonal matrix and diff erent for each point i of coordinates (ui, vi), containing the weights wij in its main diagonal, obtained via the weighting functions, or kernel. Th e substitution of all the weights wij for the value 1 equates to the identity matrix, which, substituted in (5),

turns it back into the classical linear regression model.The two main weighting functions found in the

literature are the Normal or Gaussian function and the Bisquare function. Th e formulas for both functions are presented in Table 1.

Table 1 Weighting functions or kernels.

Weighting Functions Weighting Function Formulas

Fixed Gaussian

Fixed Bisquare

Adaptive Gaussian

Adaptive Bisquare

Source: Fotheringham, Brunsdon, and Charlton. (2002).

It is noted from Table 1 that there are two types of expressions for each one of the Gaussian and Bisquare functions, which diff er in the method of choosing the b (bandwidth) parameter to be used (whether fi xed or variable). Th e dij parameter contained in the weighting functions represents the distance from point i to point j, the b parameter is the fi xed bandwidth (smoothing parameter), and the bi(k) parameter represents the adaptive bandwidth, with the letter k representing the number of

neighbors closest to point i.Th e bandwidth parameter controls the variance in the

weighting function; for this reason, in situations in which the data are not equally distributed between regions, use of the bandwidth adaptive is recommended. Figure 3 illustrates the bandwidth in the weighting function and Figures 4 and 5 exemplify the use of fi xed or adaptive bandwidths.

5

6

99



Figure 3. Bandwidth or Smoothing Parameter.Source: Adapted from Fotheringham et al. (2002).

Figure 4. Spatial weighting functions with � xed Bandwidth.Source: Adapted from Fotheringham et al. (2002).

Figure 5. Spatial weighting functions with adaptive Bandwidth.Source: Adapted from Fotheringham et al. (2002).

0,30

0.80

0.30



When developing a model via GWR using the fi xed bandwidth, it should be specifi ed by its value in unit of distance; however, in using the adaptive bandwidth, a k (fi xed) number of closest neighbors to be used in the models should be defi ned, and based on this quantity k, the value of the bandwidth varies between the regions being studied.

2.4. Geographically Weighted Logistic Regression

When the response variable is binary, GWR should be applied via Geographically Weighted Logistic Regression (GWLR), in which the formula for obtaining the probability of the event of interest occurring is given by:

or still, in the form:

in which π(xj) is the probability of the jth client defaulting and the function βk (ui,vi) represents the parameters (coeffi cients) of the k variables in the model, which vary according to the region i of latitude and longitude

coordinates (ui, vi).Th e GWLR parameters are estimated via the maximum

vraisemblance method and the GWLR vraisemblance function is represented by the following expression:

By applying the natural logarithm transformation (ln) and developing the formula, we obtain:

Th e W(ui, vi) matrix described in (6) features weights wij (calculated via the weighting functions shown in Table 1) and is used to geographically weight the observations in the estimation of each set of parameters βk (ui,vi). Th at is, this matrix is responsible for assigning a higher weight to the geographically closest observations to region i in

the estimation of its parameters, and assigning a lower or zero weight (depending on the weighting function chosen) for the most distant observations from region i in question in the estimation of its parameters βk(ui, vi). Th e W(ui, vi) matrix also varies according to the location of each borrower and composes the likelihood function in the following way:

7

8

9

10

101



11

Similar to the logistic regression model, after diff erentiating (11) in function of β(ui, vi) and equating to zero, the model parameters are estimated using interactive numerical methods, such as the interactively reweighted least squares (IRLS) method. It should be noted that this maximization procedure is carried out for each one of the functions related to each region i in the study.

Initially, four diff erent models were developed using each one of the weighting functions presented in Table 1. Th e best model based on AICc was selected for comparison with the global model and to compare between the local models (the models generated for each region in the DF) in terms of signifi cance of the variables that composed

the fi nal formula and estimations of the coeffi cients of the variables.

2.5. Comparison Between the Models

Th e metrics used to compare the models developed via GWLR and Logistic Regression were: the AICc informational criteria (Hurvich, Simonoff, & Tsai, 1998), the accuracy of the models, the percentage of false positives, the sum of the value of false positive debt, and the expected monetary value of portfolio defaults compared with the monetary value of defaults observed.

Th e accuracy of the models and percentage of false positives were obtained via the confusion matrix, given by:

Table 2 Confusion Matrix

Value Observed

0 1

Value Predicted0 TP FP

1 FN TN

Note. TP: True Positive – number of good clients classi� ed as good; TN: True Negative – number of bad clients classi� ed as bad; FP: False Positive – number of bad clients classi� ed as good; FN: False Negative – number of good clients classi� ed as bad.Source: Adapted from Crook, Edelman, and Thomas (2007).

According to Table 2, there are two types of error that a classifying model can commit: rejecting good clients (False Negative – FN), or approving bad clients (False Positive – FP). Th e latter, also known as a Type II Error, is considered to be the worst of the two errors, since these clients would be approved and could generate fi nancial losses for the institution. Th us, the FP percentage was one of the metrics used to compare the models.

Th e sum of the outstanding balance of all borrowers classifi ed as FP was measured to verify the monetary value that would enter into default due to classifi cation error in the model.

Th e accuracy of the model is calculated using the proportion of TP and TN in relation to the total, as in the following formula:

12



Th e expected monetary value of portfolio defaults was calculated using the expected discrete distributions formula, given by:

in which n is the total number of borrowers in a portfolio, xi is the outstanding balance on the credit transaction for borrower i, and P(Yi = 1) is the probability of borrower i defaulting, resulting in the credit scoring models. Th is

value was compared with the value of the sum of defaulting client debts, with the aim of verifying which model comes closest to the real default value.

13

Table 3 Distribution of frequencies of response variable Y

Y Frequency Percentage Accumulated Frequency Accumulated Percentage

0 16,011 72.34% 16,011 72.34%

1 6,121 27.66% 22,132 100.00%

Source: Prepared by the authors.

Table 4. Default rates by DF region.

Region Number of Defaults Total Number Default Rate

LAGO SUL 79 597 13.233%

CRUZEIRO 136 772 17.617%

BRASÍLIA 423 2,203 19.201%

GUARÁ 373 1,545 24.142%

LAGO NORTE 82 331 24.773%

TAGUATINGA 921 3,682 25.014%

NÚCLEO BANDEIRANTE 107 396 27.020%

SOBRADINHO 441 1,614 27.323%

GAMA 330 1,136 29.049%

SAMAMBAIA 441 1,488 29.637%

RIACHO FUNDO 221 697 31.707%

BRAZLÂNDIA 124 390 31.795%

CEILÂNDIA 882 2,671 33.021%

SÃO SEBASTIÃO 222 667 33.283%

PLANALTINA 441 1,323 33.333%

CANDANGOLÂNDIA 58 173 33.526%

SANTA MARIA 347 1,031 33.657%

RECANTO DAS EMAS 267 778 34.319%

PARANOÁ 226 638 35.423%


3. RESULTS

3.1. Univariate and Bivariate Analyses

Th e results on general default rates and those by region are shown in Tables 3 and 4 and the spatial distribution of default rates is found in Figure 6.

103



Figure 6. Spatial distribution of default rates in the Distrito Federal.Source: Prepared by the authors.

As shown in Table 3, the general default rate in the DF was 27.66%; thus, it can be observed in Table 4 that only seven regions (Lago Sul, Cruzeiro, Brasília, Guará, Lago Norte, Taguatinga, and Núcleo Bandeirante) have lower default rates than the general average. It is also noted that the Lago Sul region presented the lowest default rate of the regions studied, followed by the Cruzeiro and Brasilia regions. As can be observed in Figure 6, the three regions are located in the center of the Distrito Federal.

Also by analyzing Figure 6, it is noted that the greater the distance from the central point in the DF, the more default rates increase (represented by the darkest areas on the map). Th e Santa Maria, Recanto das Emas, and Paranoá regions stand out in negative terms by presenting

the worst default rates.Th e frequencies were calculated, along with the mean,

median, maximum, minimum, and quartiles statistics for the candidate variables for composing the models, and as there were no inconsistencies, missing values, or outliers, no variable was removed in this stage of the study.

Th e bivariate analysis consisted of calculating the cross frequency between the predictive variables and the response variable, with the aim of identifying the variables that diff erentiate credit risk among the target population in the study. Th e variables were categorized based on relative Risk (14), and using this categorization, dummy variables were created to compose the models.

14

All attributes of the rate of unemployment and infl ation variables presented similar levels of credit risk, and for this

reason, they were excluded from the study. Th e categories for the other variables are found in Table 5.



It is observed in Table 5 that borrowers with higher Formal Incomes presented a lower credit risk. It is also observed that the higher the borrower’s Level of Education, the lower the risk, with PhDs presenting a much higher relative risk than the rest. Th e results also indicated that the older the borrower and the shorter the loan period, the lower the credit risks. With relation to the borrower’s time of relationship with the institution, those with shorter times presented a greater credit risk.

Th e SELIC rate is the basic interest rate in the Brazilian economy. An increase in the SELIC makes it more expensive for fi nancial institutions to raise funds, which consequently makes credit transactions more expensive. Higher interest rates in credit transactions reduce the purchasing power of borrowers, and because of this, it is expected that the higher the SELIC rate, the greater defaults and credit risk will be. However, as observed in Table 5, the results obtained were the opposite from expected, with lower relative risk (greater credit risk)

for SELIC values below 10.00%, and lower credit risk for values above 10.00%. However, even in light of the results presented, the decision was made to maintain the SELIC rate variable in the study due to it being the only remaining macroeconomic variable. Subsequent studies using a more comprehensive target population should be conducted to better assess this variable.

Based on this categorization, dummy variables were created to be used for composing the regression models.

3.2. Spatial Indicators

Th e next stage in the study involved applying the Moran Global and Local Indices with the aim of verifying the existence of a spatial correlation between the default rate variable and the individual regions in the study population.

Th e Moral Global Index presented a value of 0.05, indicating an almost null spatial dependency.

Table 5 Categorization and Relative Risk of the variables.

Variable Class Categorization Relative RiskNumber of Good

ClientsNumber of Bad

ClientsTotal

Formal Income (minimum wages)

1 > = 7.5 1.4196 3,602 970 4,572

2 [3.5 ; 7.5[ 1.1580 3,841 1,268 5,109

3 < 3.5 0.8435 8,568 3,883 12,451

Level of Education

1 PhD 6.1168 48 3 51

2 Masters 2.1941 132 23 155

3Specialization or Completed College Degree

1.5530 4,570 1,125 5,695

4Incomplete College Degree or lower Level of Education

0.8662 11,261 4,970 16,231

Age (years)

1 > 55 2.2855 3,019 505 3,524

2 ] 49 ; 55 ] 1.5760 1,954 474 2,428

3 ] 40 ; 49 ] 1.1970 3,610 1,153 4,763

4 ] 30 ; 40 ] 0.8634 4,275 1,893 6,168

5 < = 30 0.5751 3,153 2,096 5,249

Period Contracted (months)

1 < = 12 1.9630 724 141 865

2 ] 12 ; 24 ] 1.4197 3,747 1,009 4,756

3 < = 24 0.8875 11,540 4,971 16,511

Time of Relationship

1 > 50 2.9392 3,798 494 4,292

2 ] 20 ; 50 ] 1.6576 2,337 539 2,876

3 ] 4 ; 20 ] 1.0095 3,343 1,266 4,609

4 < = 4 0.6535 6,533 3,822 10,355

SELIC Rate1 >= 10 1.0115 14,515 5,486 20,001

2 < 10 0.9007 1,496 635 2,131


105



Figure 7. Moran dispersion map.Source: Moran (1950)

Figure 7 presents the Moran dispersion map in which the regions colored in red tones present positive spatial dependency, whereas the regions colored in blue tones present negative spatial dependency. Th e “Low-Low” type regions presented the lowest default rates, followed by the “Low-High”, “High-Low”, and “High-High” regions. Th ese results can be considered as spatial clusters of the default rate variable. Th is information could be used by

the fi nancial institution to defi ne the target population in loan recovery campaigns, in which obtaining payment from clients residing in the “High-High” regions should be the initial focus of activities, with the aim of improving the company’s fi nancial results.

Th e results found for the Moran Local Index, using a 95% level of signifi cance, are presented in the Moran Map in Figure 8.



Figure 8. Moran Map with 95% con� dence.Source: Prepared by the authors.

The Moran Map indicates the existence of local correlations in some regions that are significantly diff erent from the others, revealing indications of spatial heterogeneity. Th e signifi cant regions in the local index and which are labeled in Figure 8 are Brasilia and Cruzeiro (Low-Low), Lago Sul (Low-High), and Candangolândia (High-Low). According to Fotheringham et al. (2002), the existence of signifi cant values for the Moran Local Index warrants applying the GWLR technique.

3.3. Global Model via Logistic Regression

Th e global model was developed using the development sample, containing 10,944 records.

Th e variables used in developing the model were all of the dummies created based on the categorizations presented in Table 5. Using the stepwise variable selection method, the variables with p-values under 0.10 (10% level of signifi cance) and which were selected to compose the fi nal logistic regression model (global model) are presented in Table 6.

107



Table 6 Final variables in the global model and respective coef� cients.

Variables Coeffi cients Standard Deviation Wald Statistic Ratio of Chances

Intercept -1.3068 0.0893 -14.6338* -

d_age1 -0.5665 0.084 -6.7440* 0.567

d_age2 -0.2891 0.0907 -3.1874* 0.749

d_age4 0.1481 0.0635 2.3323* 1.160

d_age5 0.5684 0.0653 8.7044* 1.765

d_education4 0.3019 0.0614 4.9169* 1.352

d_time_rel1 -0.7764 0.0862 -9.0070* 0.46

d_time_rel2 -0.3529 0.0844 -4.1813* 0.703

d_time_rel4 0.4206 0.0566 7.4311* 1.523

d_income1 0.3742 0.0705 5.3078* 1.454

d_income2 0.1135 0.06 1.8917** 1.120

d_pd_contract1 -0.6099 0.1398 -4.3627* 0.543

d_pd_contract2 -0.4165 0.0541 -7.6987* 0.659

* p-value below 0.05.** p-value below 0.10. Source: Prepared by the authors.

Th e SELIC variable was not signifi cant and was not selected to compose the fi nal global regression model. One possible explanation for this fact is the use of a short loan contract period, leading to few distinct values for this variable.

Moreover, the coeffi cients for the Formal Income variable were inverted, in which the best income bands (d_income1 and d_income2) obtained worse coeffi cients with relation to the worst band (d_income4, the coeffi cient for which is zero). Th is result can be explained by the variable’s behavior, with inversions of relative risk in its value ranges when categorized granularly. Another possible explanation is that the categorization was carried out based on total records and the model was developed using the development database, which covers a smaller number of records.

Th e nomenclature for the dummy variables respects the nomenclature for the categories shown in Table 5. For example, the dummy d_age1 represents the age category “> 55 years old” and is the best category of this variable with relation to credit risk, and the dummy d_education4 represents clients in the category “Incomplete College Degree or lower level of education”, with this being the worst category for the Level of Education variable with relation to credit risk.

Response variable Y involves the occurrence of defaults (Y=1) as the event of interest, with the probability resulting from the logistic regression models and via GLWR referring to the probability of this event occurring; that

is, of the client defaulting. Th us, it can be noted in Table 6 that all of the global regression coeffi cients, except for the Formal Income variable, are coherent, since the best categories for each variable with relation to credit risk presented lower coeffi cients in relation to the higher risk categories for the same variable; that is, the presence of the best categories for each variable reduces the probability of a client defaulting. Th is analysis is called congruence analysis; it is important for verifying whether there are inversions in the coeffi cients and whether categorization of the variables was carried out correctly.

Th e value found for the AICc informational criterion of the global model was 12,098.29, with this value being used for comparison with the models estimated via GWLR, the results from which are presented below.

3.4. Local Models via Geographically Weighted Logistic Regression (GWLR)

As described in the methodology, four models using the GWLR were developed, one for each weighting function shown in Table 1. Th e predictive variables used were those selected by the logistic regression model, shown in Table 6.

Th e best model using GWLR, following the AICc criterion, was the Adaptive Gaussian model, with a value of 2,022 closest neighbors to estimate the adaptive bandwidths.



Table 7 Statistics of the coef� cients estimated in the Gaussian Adaptive GWLR model.

Variable MeanStandard Deviation

Minimum Maximum Range Q1 Median (Q2) Q3

Intercept -1.2950 0.0432 -1.3923 -1.2006 0.1917 -1.3201 -1.2847 -1.2689

d_age1 -0.6557 0.1193 -1.0145 -0.4850 0.5295 -0.7164 -0.6283 -0.5676

d_age2 -0.3230 0.0950 -0.4969 -0.1507 0.3462 -0.3586 -0.3319 -0.2660

d_age4 0.0749 0.0760 -0.0987 0.2164 0.3151 0.0272 0.0616 0.1320

d_age5 0.5054 0.0696 0.3130 0.5910 0.2780 0.4852 0.5275 0.5605

d_education4 0.3004 0.0376 0.2124 0.3518 0.1394 0.2851 0.2979 0.3347

d_time_rel1 -0.6720 0.1019 -0.8264 -0.4858 0.3406 -0.7626 -0.6894 -0.5817

d_time_rel2 -0.3436 0.0513 -0.4208 -0.2314 0.1894 -0.3716 -0.3465 -0.3213

d_time_rel4 0.4614 0.0543 0.3498 0.5573 0.2075 0.4393 0.4430 0.5201

d_income1 0.3272 0.0732 0.2173 0.4769 0.2596 0.2680 0.3222 0.3638

d_income2 0.1255 0.0443 0.0247 0.1791 0.1544 0.0996 0.1469 0.1669

d_pd_con-tract1

-0.6241 0.1160 -0.7555 -0.3766 0.3789 -0.7183 -0.6849 -0.5065

d_pd_con-tract2

-0.4134 0.0332 -0.4516 -0.3327 0.1189 -0.4479 -0.4177 -0.3904


Table 7 contains the descriptive statistics of the coeffi cients estimated by the GWLR model, in which it is noted that the averages for the coeffi cients were very

close to the coeffi cients for the global model presented in Table 6.

Table 8 contains the fi nal formula of the models estimated via Adaptive Gaussian GWLR for the 19 regions in the DF.

109



Tabl

e 8

Form

ulas

of L

ocal

Reg

ress

ions

est

imat

ed v

ia th

e A

dapt

ive

Gau

ssia

n G

WLR

mod

el.

Reg

ion

Intercept

d_age1

d_age2

d_age4

d_age5

d_education4

d_time_rel1

d_time_rel2

d_time_rel4

d_income1

d_income2

d_pd_contract1

d_pd_contract2

BR

ASÍ

LIA

-1.3

04-0

.839

-0.2

660.

028N

S0.

438

0.23

1-0

.581

-0.3

210.

520

0.29

10.

100 N

S-0

.468

-0.3

71

BR

AZ

LÂN

DIA

-1.3

20-0

.571

-0.3

630.

097 N

S0.

522

0.31

0-0

.691

-0.3

300.

496

0.36

70.

109 N

S-0

.685

-0.4

31

CA

ND

AN

GO

LÂN

DIA

-1.2

56-0

.740

-0.3

400.

011 N

S0.

438

0.28

2-0

.636

-0.3

460.

455

0.26

10.

128

-0.6

18-0

.396

CEI

LÂN

DIA

-1.3

42-0

.485

-0.4

970.

090 N

S0.

548

0.35

1-0

.763

-0.4

210.

537

0.47

70.

147 N

S-0

.712

-0.4

48

CR

UZ

EIR

O-1

.326

-1.0

15-0

.351

-0.0

99 N

S0.

313

0.23

4-0

.485

-0.2

31 N

S0.

557

0.26

80.

107 N

S-0

.426

-0.3

33

GA

MA

-1.2

85-0

.619

-0.3

340.

132

0.57

20.

296

-0.8

26-0

.366

0.44

30.

323

0.17

9-0

.647

-0.3

63

GU

AR

Á-1

.248

-0.7

58-0

.359

NS

-0.0

46 N

S0.

372

0.32

3-0

.527

-0.2

650.

430

0.21

70.

101 N

S-0

.685

-0.4

18

LAG

O N

ORT

E-1

.378

-0.7

55-0

.156

0.14

8 NS

0.52

20.

212

-0.5

55-0

.289

0.55

40.

327

0.04

3 NS

-0.3

77-0

.370

LAG

O S

UL

-1.2

57-0

.716

-0.3

080.

057 N

S0.

489

0.26

8-0

.745

-0.4

070.

423

0.29

20.

122 N

S-0

.528

-0.3

96

NÚ

CLE

O B

AN

DEI

RA

NTE

-1.2

58-0

.678

-0.3

440.

060 N

S0.

492

0.29

0-0

.709

-0.3

630.

442

0.28

10.

145

-0.6

42-0

.390

PAR

AN

OÁ

-1.2

89-0

.609

-0.1

72 N

S0.

176

0.58

50.

283

-0.8

08-0

.409

0.35

00.

374

0.06

9 NS

-0.4

55-0

.428

PLA

NA

LTIN

A-1

.315

-0.5

42-0

.205

0.19

30.

591

0.29

8-0

.771

-0.3

460.

363

0.39

40.

072 N

S-0

.556

-0.4

34

REC

AN

TO D

AS

EMA

S-1

.253

-0.6

28-0

.372

0.07

9 NS

0.53

00.

300

-0.7

41-0

.378

0.45

90.

321

0.15

5-0

.692

-0.3

98

RIA

CH

O F

UN

DO

-1.2

01-0

.664

-0.3

570.

043 N

S0.

484

0.27

8-0

.682

-0.3

720.

434

0.27

10.

159

-0.7

39-0

.404

SAM

AM

BAIA

-1.2

69-0

.623

-0.4

080.

062 N

S0.

527

0.31

7-0

.689

-0.3

640.

482

0.34

60.

147

-0.7

18-0

.429

SAN

TA M

AR

IA-1

.286

-0.6

28-0

.332

0.12

40.

561

0.29

7-0

.807

-0.3

670.

439

0.32

20.

167

-0.6

27-0

.373

SÃO

SEB

AST

IÃO

-1.2

73-0

.624

-0.2

470.

141

0.56

70.

289

-0.8

22-0

.408

0.37

30.

354

0.10

8 NS

-0.5

07-0

.418

SOB

RA

DIN

HO

-1.3

92-0

.568

-0.1

51 N

S0.

216

0.57

80.

285

-0.6

25-0

.273

0.45

60.

364

0.02

5 NS

-0.4

70-0

.412

TAG

UAT

ING

A-1

.271

-0.6

66-0

.312

0.02

7 NS

0.48

50.

335

-0.5

82-0

.322

0.43

90.

259

0.17

3-0

.756

-0.4

52

Not

e. N

S: C

oef�

cien

t not

sig

ni� c

ant w

ith 9

0% c

on� d

ence

(p-

valu

e ab

ove

0.10

).

Sour

ce: P

repa

red

by th

e au

thor

s.



Table 10 Confusion Matrix of the models using LR.

Value Observed LR Value Observed GWLR

0 1 0 1

Value Predicted0 48.7% 11.3% 49.0% 11.2%

1 24.0% 16.0% 23.8% 16.0%


Table 9 Descriptive Analysis of Model Scores.

Model Mean Minimum Q1 Median (Q2) Q3 Maximum Range

LR 0.277 0.036 0.172 0.268 0.392 0.585 0.551

GWLR 0.272 0.035 0.166 0.270 0.378 0.639 0.603


It is noted in Table 8 that the Intercept was signifi cant for all the regions in the Distrito Federal and varied from -1.3922 to -1.2005, indicating a regional diff erence between the values estimated.

With relation to the borrower’s age, the variables d_age1 and d_age5 were signifi cant for all of the regions in the Distrito Federal, whereas the variables d_age2 and d_age4 were not signifi cant for some regions, indicating that the borrower’s age infl uences risk diff erently, depending on the region studied.

Th e d_education4 variable was also signifi cant for all of the regions in the Distrito Federal, presenting a small variation in coeffi cients between the regions.

With relation to the borrower’s Time of Relationship with the institution, the variables d_time_rel1 and d_time_rel4 were signifi cant for all of the regions in the Distrito Federal, whereas the d_time_rel2 variable was not signifi cant for the Cruzeiro region.

With relation to the borrower’s Income, the d_income1 variable was significant for all of the regions in the Distrito Federal, whereas the d_income2 variable was signifi cant only for the regions of Candangolândia, Gama, Núcleo Bandeirante, Recanto das Emas, Riacho Fundo,

Samambaia, Santa Maria, and Taguatinga, indicating that the borrower’s Income also infl uences credit risk diff erently between the regions.

Th e variables d_pd_contract1 and d_pd_contract2, which represent the Loan Contract Period, were signifi cant for all of the regions in the Distrito Federal.

3.5. Comparison Between the Models

Th e comparison between the Logistic Regression model and the GWLR Adaptive Gaussian model was made using the following metrics: International AICc Criterion, Accuracy, Percentage of False Positives, Sum of Value of False Positive Debt, and Expected Monetary Value of Defaults in the portfolio compared with the monetary value of defaults observed.

Except for the AICc informational criterion, calculated when developing the model, the other metrics were calculated based on the validation database, composed of 11,188 records.

Table 9 shows the descriptive statistics for the scores obtained by both the models selected in the validation sample.

Th e means for the model scores were very close, with a diff erence only in the third decimal place; however, the model using GWLR presented a greater range of scores. Th e use of few predictive variables meant that the scores produced by the models did not present values greater than 0.585 and 0.639.

To calculate the confusion matrix, a cut-off point had to be defi ned in terms of score, so that borrowers could be classifi ed as good or bad (0 or 1). Th is cut-off point was defi ned based on the shortest distance between Sensitivity and Specifi city and its value was 0.30.

It can be noted in Table 10 that the models presented very close results with regards to client classifi cation.

111



Table 11 contains all of the metrics used for comparison between the models, in which a small diff erence is noted

between the values of the indicators in the two models.

Table 11 Comparison between the LR and GWRL models

Model AICc Accuracy % FPSum of Debt Value

FPExpected Default

Value

RL 12,098.29 64.7% 11.3% R$ 5,271,027.78 R$ 11,909,313.79

GWLR 12,091.19 65% 11.2% R$ 5,484,464.08 R$ 11,611,161.58


In Table 11, all of the values obtained for the metrics of the two models were also very close, with the model using GWLR being the one with the best (lowest) AICc informational criterion and best (highest) Accuracy, which indicates a better percentage of hits and lower percentage of False Positives. Th e model using LR was slightly higher in the metrics Sum of the Value of False Positives - this metric can be considered as an estimate

of the monetary value that would be granted and enter into default, resulting in fi nancial loss for the institution - and Expected Value of Defaults, since the sum of the value of debt of all of the contracts in default (Y=1) in the validation database of the model was R$ 12,026,290.09, and the value that comes closest is the value from the model using LR.

4. CONCLUSION

In this article, real data were used from a Brazilian fi nancial institution on transactions involving Consumer Direct Credit, granted to clients residing in 19 regions in the Distrito Federal, to develop credit scoring models using two diff erent methodologies: Logistic Regression and Geographically Pondered Logistic Regression.

The Logistic Regression methodology is quite widespread in the fi nancial sector, and is used in this study to develop a global credit scoring model for the whole Distrito Federal.

Th e Geographically Weighted Logistic Regression methodology is quite rare and uses the borrower’s geographical location to weight observations when developing diff erent models for each region studied.

Th e indicators used for comparison between the models developed via the two methodologies were very close, and based on the results obtained, the methodologies can be considered as similar in terms of their power to predict fi nancial losses for the institution.

Th e study demonstrated that some variables were signifi cant for all of the regions, whereas others were signifi cant only for particular regions, concluding that credit risk is infl uenced by diff erent factors, depending on the region studied.

It was also observed that all of the regression models developed using GWLR (regional models) presented diff erent values for the coeffi cients (parameters) of the variables, showing that the weights (importance) of the

variables varied from region to region.Th e results demonstrated the viability of applying

the GWLR methodology for developing credit scoring models for the target population in this study. The formulas obtained are applicable only to this population, however, it is believed that this methodology could be extended to other credit transactions and spatial levels (e.g. neighborhoods, municipalities, federal units).

Due to great advances in computing and technology occurring in recent decades, institutions granting credit have robust credit risk evaluation systems, which makes the implantation and use of a set of models estimated via GWLR viable.

With relation to the limitations of the study, the use of few predictive variables meant that the models presented low ranges of scores.

Categorization of the Formal Income variable was carried out so that the classes were monotonic with relation to relative risk; however, the values of their coeffi cients were inverted. Studies considering another categorization or target population should be carried out to verify the relevance of this variable for credit risk.

For future study topics, it is suggested that: the GWLR methodology is applied to develop credit scoring models for other target populations (for example, diff erent credit transactions or geographical regions); comparisons are carried out with other methodologies (such as Support Vector Machines or Boosting); other predictive variables



are used; the GWLR methodology is applied to develop models in other areas of a fi nancial institution, such as strategy and marketing; or other functions are used, such

as the Log Binomial, to develop geographically weighted models.

REFERENCES

Address for Correspondence:Pedro Henrique Melo Albuquerque

Universidade de Brasília, Departamento de AdministraçãoCampus Universitário Darcy Ribeiro, Bloco A-2, 1º andar, Sala A1-54/7 – CEP: 70910-900Asa Norte – Brasília – DF – BrazilEmail: [email protected]

Anselin, L. (1995). Local indicators of spatial association — LISA. Geographical Analysis, 27(2), 93-115.

Atkinson, P. M., German, S. E., Sear, D. A., & Clark, M. J. (2003). Exploring the relations between riverbank erosion and geomorphological controls using geographically weighted logistic regression. Geographical Analysis, 35(1), 58-82.

Banco Central do Brasil (2009). Resolução CMN nº 3.721, de 30/04/2009. Retrieved from http://www.bcb.gov.br.

Brunsdon, C., Fotheringham, A. S., & Charlton, M. E. (1996). Geographically weighted regression: a method for exploring spatial nonstationarity. Geographical Analysis, 28(4), 281-298.

Crook, J. N., Edelman, D. B., & Th omas, L. C. (2007). Recent developments in consumer credit risk assessment. European Journal of Operational Research, 183(3), 1447-1465.

Fernandes, G. B., & Artes, R. (2016). Spatial dependence in credit risk and its improvement in credit scoring. European Journal of Operational Research, 249(2), 517-524.

Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: the analysis of spatially varying relationships. Chichester: John Wiley & Sons.

Gilbert, A., & Chakraborty, J. (2011). Using geographically weighted regression for environmental justice analysis: Cumulative cancer risks from air toxics in Florida. Social Science Research, 40(1), 273-286.

Hand, D. J., & Henley, W. E. (1997). Statistical classifi cation methods in consumer credit scoring: a review. Journal of the

Royal Statistical Society: Series A (Statistics in Society), 160(3), 523-541.

Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. Hoboken, NJ: John Wiley & Sons.

Huang, Y., & Leung, Y. (2002). Analysing regional industrialisation in Jiangsu province using geographically weighted regression. Journal of Geographical Systems, 4(2), 233-249.

Hurvich, C. M., Simonoff , J. S., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information criterion. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 60(2), 271-293.

Lessmann, S., Baesens, B., Seow, H. V., & Th omas, L. C. (2015). Benchmarking state-of-the-art classifi cation algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124-136.

Moran, P. A. (1950). Notes on continuous stochastic phenomena. Biometrika, 37(1/2), 17-23.

See, L., Schepaschenko, D., Lesiv, M., McCallum, I., Fritz, S., Comber, A., ...., & Obersteiner, M. (2015). Building a hybrid land cover map with crowdsourcing and geographically weighted regression. ISPRS Journal of Photogrammetry and Remote Sensing, 103, 48-56.

Stine, R. (2011). Spatial temporal models for retail credit. In Proceedings of the Credit Scoring and Credit Control Conference, Edinburgh, UK.

Documents

Geographically Weighted Logistic Regression Applied to Credit … · Geographically Weighted Logistic Regression Applied to Credit Scoring Models* 94 R. Cont. Fin. – USP, São Paulo,