Data mining method for listed companies’ financial distress prediction

Available online at www.sciencedirect.com

www.elsevier.com/locate/knosys

Knowledge-Based Systems 21 (2008) 1–5

Short communication

Data mining method for listed companies’ financial distress prediction

Jie Sun, Hui Li *

School of Business Administration, Zhejiang Normal University, Jinhua 321004, Zhejiang Province, PR China

Received 21 August 2006; received in revised form 1 November 2006; accepted 16 November 2006Available online 8 December 2006

Abstract

Data mining technique is capable of mining valuable knowledge from large and changeable database. This paper puts forward a datamining method combining attribute-oriented induction, information gain, and decision tree, which is suitable for preprocessing financialdata and constructing decision tree model for financial distress prediction. On the base of financial ratios attributes and one class attri-bute, adopting entropy-based discretization method, a data mining model for listed companies’ financial distress prediction is designed.The empirical experiment with 35 financial ratios and 135 pairs of listed companies as initial samples got satisfying result, which testifiesthe feasibility and validity of the proposed data mining method for listed companies’ financial distress prediction.� 2006 Elsevier B.V. All rights reserved.

Keywords: Financial distress prediction; Data mining; Decision tree; Attribute-oriented induction

1. Introduction

Listed companies’ financial distress prediction is impor-tant to both listed companies and investors. However, dueto the uncertainty of business environment and strongcompetition, even companies with perfect operation mech-anism have the possibility of business failure and financialbankruptcy. So whether listed companies’ financial distresscan be predicted effectively and timely is related to compa-nies’ development, numerous investors’ interest, and theorder of capital market.

Early studies of financial distress prediction used statis-tical techniques such as univariate analysis, multiple dis-criminant analysis, Logit and so on [1]. Though thesemethods use history samples to create diagnostic model,they cannot inductively learn from new data dynamically,which greatly affects the prediction accuracy. More recent-ly, many studies have demonstrated that artificial intelli-gence such as neural networks can be an alternativemethod for financial distress prediction [2]. But neural net-

0950-7051/$ - see front matter � 2006 Elsevier B.V. All rights reserved.

doi:10.1016/j.knosys.2006.11.003

* Corresponding author. Tel.: +86 130 9144 2884; fax: +86 24 83500616.

E-mail addresses: [email protected] (J. Sun), [email protected] (H. Li).

work is a black-box whose structure weight values are thehidden knowledge for classification, which is difficult forordinary investors and finance majors to understand.

In recent years, with the development of informationtechnology, machine learning, and artificial intelligence, anew field of intelligent data analysis, data mining, beganto appear and grow rapidly in the embarrassing back-ground of abundant data and poor knowledge. It alsobrings a new livingness for the deep research of the methodfor financial distress prediction. On the basis of large data-base or data warehouse which stores a great number of list-ed companies’ financial data, data mining technique candynamically mine out valuable hidden knowledge, whichcan be applied to predict listed companies’ financialdistress.

2. Data mining method for listed companies’ financial

distress prediction

2.1. Choice of algorithm

Data mining is the process of mining hidden and valu-able knowledge from database, data warehouse and otherinformation storage media. It has several functions such

mailto:[email protected]

mailto:[email protected]

2 J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 1–5

as association analysis, classification and prediction, clus-tering analysis, outlier analysis and so on. Each of themmay have several alternative data mining algorithms [3].Data mining aiming at listed companies’ financial distressprediction belongs to the problem of classification and pre-diction, whose typical data mining methods consist of deci-sion tree classifier, Bayesian classifier, and neural networksclassifier. Bayesian classifier is based on the hypothesis ofclass independency that is hard to meet in reality, and neu-ral networks have the deficiency mentioned above. So wechoose decision tree method to form classifier for listedcompanies’ financial distress prediction, not only becauseit is not subject to class independency hypothesis and is fastand accurate, but also because the knowledge produced byit is easy to understand and use.

Besides, attribute-oriented induction (AOI) and attri-bute relativity analysis based on information gain (IG)are combined to enhance the attributes’ conceptual leveland filter the weak-related attributes out of attributes set.This kind of data preprocessing not only improves the min-ing efficiency of decision tree algorithm, but also makes theclassification knowledge obtained by data mining moremeaningful and valuable.

2.1.1. Data preprocessing algorithm combining AOI and IG

AOI can be used to generalize data. It firstly collectsdata relevant to the mining task by database query opera-tion, and then generalizes data by counting the number ofeach attribute’s different values. Generally, this process isrealized through two operations, attribute reduction andattribute generalization, and the degree of attribute con-cept level enhancement is controlled by attribute general-ization threshold [4]. The method of IG is based on theentropy theory. It is used to eliminate attributes whichare irrelated or weak-related to mining task by calculatingeach attribute’s IG and comparing it to attribute relativitythreshold which is designed beforehand [5]. The detailedpreprocessing algorithm is as follows.

Input. (1) relation database – DB; (2) data mining com-mand – DMQuery; (3) attributes set – a_list; (4) conceptlevel tree or generalization operation of attribute ai –Gen(ai); (5) generalization threshold of attribute ai-gen_thresh(ai); (6) attribute relativity threshold –rela_thresh.Output. relation after generalization and attribute rela-tivity analysis – Gen_Rela_relation.Algorithm.

(1) //Obtain data related to mining taskget _relavant_data(DMQuery,DB, Work_relation);(2) //Get the number of each attribute’s different valuesscan Work_relation to count tot_valu(ai);(3) //Attribute reductionfor each ai in a_list where tot_valu(ai) > gen_thresh(ai)if (Gen(ai) not exist) or (higher concept level of ai isdenoted as other attribute)remove_attribute(ai, a_list);

(4) //Attribute generalizationfor each ai in a_list where tot_valu(ai) > gen_thresh(ai)while (tot_valu(ai) > gen_thresh(ai))generalize(ai, Gen(ai), tot_valu(ai), Work_relation);(5) //Attribute relativity analysisfor each ai in a_listIG(ai) //Get the IG of each attributeif IG(ai) < rela_thresh

remove_attribute(ai, a_list).

In the above algorithm, IG(ai), which is used to get theIG of attributes, is calculated as the followingapproach.

Suppose S is a data set containing s samples. Classattribute have m different values that correspond to m dif-ferent classes, denoted as Ci, i 2{1,2,3, . . . ,m}, and si is thesample number of class Ci. Then the total informationentropy needed to classify the given data set isI (s1, s2, . . . , sm)

Iðs1; s2; � � � ; smÞ ¼ �Xm

i¼1

pilog2ðpiÞ ð1Þ

In which, pi is the probability of each random samplebelonging to class Ci, namely pi = si/s.

If attribute A has v different values {a1,a2, . . . ,av}, thendata set S can be divided into v subsets {S1,S1, . . . ,Sv}, andsubset Sj is composed of data samples whose value of attri-bute A equals aj. Suppose sij is the number of samples whobelong to both subset Sj and class Ci, then the informationentropy needed to classify the given data set according toattribute A is E(A)

EðAÞ ¼Xv

j¼1

s1j þ s2j þ � � � þ smj

sIðs1j; s2j; . . . ; smjÞ; ð2Þ

Iðs1j; s2j; . . . ; smjÞ ¼ �Xm

i¼1

pij log2ðpijÞ; ð3Þ

pij ¼sij

s1j þ s2j þ � � � þ smj: ð4Þ

In this way, the information entropy gained by attribute A

is Gain(A),

GainðAÞ ¼ Iðs1; s2; . . . ; smÞ � EðAÞ: ð5Þ

2.1.2. Decision tree algorithmDecision tree is a kind of tree-shaped decision structure

learnt inductively from sample data whose class is alreadyknown. Each non-leaf node of the decision tree means atesting of an attribute value, and each leaf node representsa class [6]. The basis algorithm to generate a decision tree isstated as follows.

Input. Training sample data (all attributes should be dis-cretized), candidate attributes set – attribute_list.Output. Decision tree.Algorithm: Gen_decision_tree(N, attribute_list)

Creating data set

Data preprocessing

Construction of decision tree model

Accuracy evaluation

Classification and prediction

Fig. 1. Data mining steps of financial distress prediction.

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 1–5 3

(1) Create a node denoted as N;(2) If all samples of node N belong to the same class C

then return N as a leaf node and denote it as class C;(3) If attribute_list is empty then return N as a leaf nodeand denote it as the class, which has the most samples innode N;(4) Choose the attribute which has the biggest IG inattribute_list, and denote it as test_attribute;(5) Sign the node N as test_attribute;(6) According to condition test_attribute = ai, produce abranch from node N, and Si is samples set who meet thebranch condition;(7) If Si is empty then denote the corresponding leafnode as the class which has the most samples in nodeN, else denote the corresponding leaf node as the classwhich is iteratively returned by algorithm Gen_decision_tree(Si, attribute_list – test_attribute).

2.2. Discretization of continuous-values attributes

Most financial measures attributes have continuous-val-ues, but decision tree algorithm requires that all attributesshould be discretized. So before data mining, we must con-vert continuous-values attributes into discretized attributesby dividing the continuous-values domain into severalintervals and replacing the real data with interval symbols.In fact, this process is also the process of constructing con-cept level trees for continuous-values attributes, which isthe preparation for the preprocessing algorithm in Section2.1.1. At present, discretization methods include equalbreadth intervals, equal frequency intervals, clustering dis-cretization, and entropy-based discretization. Comparedwith other methods, the entropy-based discretization takesclass information into consideration, so that intervalsdivided by this method will improve the accuracy of classi-fication [7]. Because data mining for listed companies’financial distress prediction belongs to classification prob-lem, and decision tree algorithm chooses testing attributeby the rule of biggest IG, entropy-based discretization isthe best choice for discretizing financial measuresattributes.

Given a data set S, each value of attribute A can beconsidered as a possible interval boundary T. For exam-ple, attribute A’s one possible value v may divide thedata set into two subsets: S1 who meets the conditionof A < v and S2 who meets the condition of A P v. Sowhen each possible value is supposed to be the intervalboundary, we can, respectively, calculate the IG of attri-bute A according to formula (1)–(5). Choose the valuev*, who makes attribute A get the biggest IG (namelysmallest information entropy), to be the interval bound-ary, then the value domain of attribute A can be dividedinto two intervals: A < v* and A P v*. The same methodcan be used to further subdivide these two intervals untilthe IG of attribute A is bigger than the predefinedthreshold.

2.3. Data mining steps

Data mining for listed companies’ financial distress pre-diction needs five steps: creating data set, data preprocess-ing, constructing decision tree by inductive learning,accuracy evaluation, and classification and prediction, asshown in Fig. 1.

(1) Creating data set: means drawing relevant data fromdata source such as listed companies’ publiclyrevealed information. Attributes of the data set mayinclude financial measures attributes, class attribute,and other essential information attributes.

(2) Data preprocessing: consists of discretization of con-tinuous-values attributes, data generalization, andattribute relativity analysis, elimination of outliers,and so on.

(3) Construction of decision tree model: is to inductivelylearn from preprocessed data by the decision treealgorithm stated in Section 2.1.2 and construct a deci-sion tree which represents the classification knowledgefor listed companies’ financial distress prediction.

(4) Accuracy evaluation: is to evaluate the decision treemodel’s prediction accuracy, respectively, throughtraining data set and validation data set.

(5) Classification and prediction: if the decision tree’saccuracy is acceptable, then it will be used to predictlisted companies’ financial distress.

3. Empirical experiment

3.1. Data collection and preprocessing

The data used in this study was obtained from ChinaStock Market and Accounting Research Database(CSMAR). Following principles such as summarization,measurability, and sensitivity [8], initial financial ratio setis composed of 35 financial ratios, which cover profitabilityratios, activity ratios, short-term debt ratios, long-termdebt ratios, growth ratios and structural ratios. Companies

0 2 4 6 8 100

0.1

0.2

0.3

0.4

0.5

Number of terminal nodes

Cos

t (m

iscl

assi

ficat

ion

erro

r)

Cross-validation

Resubstitution

Best choice

Fig. 2. Misclassification error when decision tree is pruned at differentdegrees.

x4<0.497159

N

x1<0.865457

S

x30<-1.525755

x14<1.50866

x32<0.05295

NSx17<0.05929

ST

x15<611.156

NS

ST

Fig. 3. The decision tree model for listed companies’ financial distressprediction.

Table 1Meaning of non-nodes

Non-leaf nodes Meaning

x30 Net profit growth ratex14 Ratio of liabilities to tangible net assetsx1 Account receivable turnoverx32 Ratio of liabilities to cash flowx15 Ratio of liabilities to equity market valuex4 Total asset turnoverx17 Gross profit rate of sales

Table 2Result of accuracy evaluation

4 J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 1–5

that are specially treated (ST)1 by China Securities Super-vision and Management Committee (CSSMC) are consid-ered as companies in financial distress and those neverspecially treated are regarded as healthy ones. Accordingto the data between 2000 and 2005, 135 pairs of companieslisted in Shenzhen Stock Exchange and Shanghai StockExchange are selected as initial sample companies. In orderto eliminate outliers, companies with financial ratios devi-ating from the mean value as much as three times of stan-dard deviation are excluded, getting the final 198 samplecompanies, among which 92 are ST companies and 106are normal (NM) ones. Then 70 ST companies and 80NM companies (totally 150 companies) are randomly cho-sen as training samples. Another 22 ST companies and 26NM companies (totally 48 companies) are used as valida-tion samples.

Method Sample size Error (%) Accuracy (%)

Resubstitution 150 4.67 95.33Cross-validation 150 15.33 84.67Independent validation 48 18.75 81.25

3.2. Construction of decision tree model

Modeling process is realized through MATLAB 6.5toolbox and its programming language. After applyingthe data mining method proposed in Section 2, the decisiontree model for listed companies’ financial distress predic-tion was formed, and then it was pruned until the cross-val-idation error reached the minimum value, as Fig. 2. Sowhen the number of terminal nodes equals eight, the deci-sion tree has the lowest cross-validation misclassificationerror. At this time the decision tree model is as Fig. 3.

It is easy to transform decision tree knowledge into ruleknowledge. For example, if x30 < �1.52757 then financially

1 The most common reason that China listed companies are speciallytreated by CSSMC is that they have had negative net profit in continuoustwo years. Of course they will also be specially treated if they purposelypublish financial statements with serious false and misstatement, but theST samples chosen in this study are all companies that have been speciallytreated because of negative net profit in continuous two years.

distressed. The meaning of non-leaf nodes in Fig. 3 is listedin Table 1.

3.3. Accuracy evaluation of the decision tree model

The method of resubstitution, 10-fold cross-validationand validation with independent samples are, respectively,carried out to evaluate the decision tree model. As Table 2shows, the classification accuracy obtained by resubstitu-tion, 10-fold cross-validation and independent validationis, respectively, 95.33%, 84.67% and 81.25%, indicatingthat the decision tree model constructed by the data miningmethod in Section 2 has relatively satisfying predictionaccuracy not only for training samples but also for valida-tion samples. So this data mining method is suitable to

J. Sun, H. Li / Knowledge-Based Systems 21 (2008) 1–5 5

construct decision tree model for listed companies’ finan-cial distress prediction.

4. Conclusion

Existing financial distress prediction methods haveproblems such as lacking dynamic learning ability and dif-ficulty to understand. Data mining method combiningAOI, IG and decision tree can overcome those problemsand effectively predict listed companies’ financial distress.Adopting entropy-based discretization method to discretizecontinuous-values attributes, data mining model for listedcompanies’ financial distress prediction can be designedto dynamically and inductively learn from periodicallychangeable database, which produces easily understand-able decision tree classification model. The empirical exper-iment involving 35 financial ratios and 135 pairs of listedcompanies got a satisfactory result, which means applica-tion of the proposed data mining method to listed compa-nies’ financial distress prediction is not only theoreticallyfeasible but also practically effective.

Acknowledgements

This research is partially supported by Zhejiang Provin-cial Natural Science Foundation of China (Grant No.

Y607011), National Natural Science Foundation of China(Nos. 70573030 and 70571019), and National Center ofTechnology, Policy and Management at Harbin Instituteof Technology. The authors gratefully thank anonymousreferees for their useful comments and editors for their work.

References

[1] E. Altman, G. Marco, Corporate distress diagnosis: comparisons usingliner discriminant analysis and neural networks, Journal of Bankingand Finance 18 (1994) 505–529.

[2] C.P. Parag, A threshold varying artificial neural network approach forclassification and its application to bankruptcy prediction problem,Computers and Operations Research 32 (10) (2005) 2561–2582.

[3] P. Adriaans, D. Zantinge, Data Mining, Addison Wesley, England,1996.

[4] Y.-L. Chen, C.-C. Shen, Mining generalized knowledge from ordereddata through attribute oriented induction techniques, EuropeanJournal of Operational Research 166 (2005) 221–245.

[5] J.-W. Han, M. Kamber, Data Mining Concepts and Techniques,Morgan Kaufman Publishers Inc., San Mateo, 2001.

[6] S.-C. Chou, C.-L. Hsu, MMDT: a multi-valued and multi-labeleddecision tree classifier for data mining, Expert Systems with Applica-tions 28 (4) (2005) 799–812.

[7] D. Janssens, T. Brijs, K. Vanhoof, et al., Evaluating the performanceof cost based discretization versus entropy- and error-based discreti-zation, Computers and Operations Research 33 (11) (2005) 1–17.

[8] X.-F. Li, J.-P. Xu, The establishment of rough-ANN model for pre-warning of enterprise financial crisis and its application, SystemsEngineering – Theory and Practice 10 (2004) 8–13 (in Chinese).

Documents

Data mining method for listed companies’ financial distress prediction