Abstract - SEIDENBERG SCHOOL OF CSIS |ctappert/dps/d860-13/weka2012f-t1.docx · Web viewSome algorithms did not appear to be very accurate. Also, surprisingly, the algorithms did

DPS 2014 Team 1: Data Mining ProjectAmer Ali, Gary Fisher, Hilda Mackin

Project report:

Data Mining for Credit Worthiness

Team 1: Hilda, Amer, Gary

AbstractEvaluate loan applicant credit worthiness by learning from historical data by using data mining algorithms and tools.

DescriptionPeople apply for loans, and banks or lenders need to determine the applicants “credit worthiness”, based on an “A, B, C, no credit” scale (where A is good and worthy, B is more of a risk, C is risky, and no credit is not worthy of a loan). There are 100 attributes to consider. This is a difficult problem for many loan officers because there are many possible combinations of attributes. [book]

We used the Weka [weka] data mining tool to evaluate loan applicant credit worthiness by learning from historical data. We tried several data mining algorithms [book] [algo] as well as iterative training [book]. We then evaluated how correct the tools’ guesses for the credit scores compared to the actual credit scores.

Data and MethodologyWe used data from the “easy data mining” website [data] which has data in many categories for data mining testers. The data we chose was used for determining the credit worthiness of loan applicants based on a large number of attributes.

We converted the data, as shown in Figure 1: Source data with credit scores, into a format for the Weka tool.

We used the data with the credit scores as training data. To learn the tool, we used classifiers on the training data to show us the “confusion matrix” (basically, where the mistakes were made) as shown in

Fall 2012 DCS860A Emerging Topics in Computer Security Page 1 of 12


Figure 3: The Confusion Matrix from evaluating training data, and (when the algorithm created one) the decision tree, as shown in Figure 4: Visualizing the decision tree.

We removed the credit scores (but kept them aside) for the test data. After using the data mining tool to come up with credit scores based on the applicant data in the test file, as shown in Figure 2: Running a test with training data, and them massaging the output, as shown in Figure 5: Capturing the Test Data, we compared the tools credit scores with the actual credit score, so that we could determine the number of correct evaluations done by the tool as a percentage, shown in Figure 6: Determining the success rate of the instances.

We used different algorithms to evaluate the test data. The results are shown in “Observations” below. For some of the algorithms, we were able to see the decision trees that were generated.

Tools/ModelsWe used the Weka tool. We used the Naïve Bayes, J48, IB1, and Ordinal Class classifiers. [book] [help]

As explained below out of all these classification algorithms we found J48 [algo] as the better model for our prediction. This model is an open source Java implementation of C4.5 algorithm. Below is a short description of this algorithm and its pseudo code:

Algorithm

C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information entropy. The training data is a set S = {s_1, s_2, ...} of already classified samples. Each sample s_i = {x_1, x_2, ...} is a vector where x_1, x_2, ... represent attributes or features of the sample. The training data is augmented with a vector C = {c_1, c_2, ...} where c_1, c_2, ... represent the class to which each sample belongs.

At each node of the tree, C4.5 chooses one attribute of the data that most effectively splits its set of samples into subsets enriched in one class or the other. Its criterion is the normalized information gain (difference in entropy) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurses on the smaller sublists.

This algorithm has a few base cases.

All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.

None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.

Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.

Pseudocode

In pseudocode, the general algorithm for building decision trees is[2]:

1. Check for base cases2. For each attribute a

a. Find the normalized information gain from splitting on a3. Let a_best be the attribute with the highest normalized information gain 4. Create a decision node that splits on a_best 5. Recurse on the sublists obtained by splitting on a_best, and add those nodes as children of node



ObservationsTraining/Test records, algorithms

Table 1: Observations

Training records Test Records AlgorithmIB1 J48 Naïve Bayes Ordinal Class

50 2000 - 43% 33% -100 2000 34% 54% 47% 33%200 2000 27% 35% 37% 35%500 2000 30% 40% 44% 40%

Refined200 300 - 48% 37% -300trimmed 2000 - 59% - -

The J48 algorithm seemed to work out the best, as shown in Table 1: Observations.

We also “retrained”, or iteratively trained [book] [help]. That is, we took 200 records to tested 300 records and got 48% correct. Then we trimmed the 300 record output and kept only those records for which we correctly predicted (about 144 records). We then used the trimmed file as training for the next 2000, and by getting 59% correct; we increased the successful rate by 11%. We found the J48 algorithm also retrained the best; Naïve Bayes did not improve at all, in our test. We did not want to retrain too many times or we would over-fit the model.

As shown in Figure 6: Determining the success rate of the instances, we used 50 examples to train for 2000 tests, we were able to correctly identify the credit rating 43% of the time.

We learned a great deal about the data mining algorithms and the “feel” of data mining (that is, successes, iteration, pruning, and other techniques). The Weka tool is very useful, but a lot of manual effort was required to massage the data, often necessitating conversion to different data formats to allow us to edit or modify the data.

SummaryWe used the Weka tool to assist in using data mining techniques to help us seek out better, simpler procedures to determine “credit worthiness” of loan applicants. We evaluated Weka’s results with the known results to determine our success rate. We tried different classification algorithms. We “retrained” by using the revised results from previous tests to try to improve the success rate. Some algorithms did not appear to be very accurate. Also, surprisingly, the algorithms did not seem to improve much with more training data. We feel that the J48 algorithm, with the highest training accuracy and retraining success, worked out the best.





References[algo] Detailed description of J48 algorithm: http:// en.wikipedia.org/wiki/C4.5_algorithm

[book] Data Mining - Practical Machine Learning Tools and Techniques Third Edition, Ian H. Witten - Eibe Frank - Mark A Hall, ISBN-10: 0123748569

[data] Easy.Data.Mining at http://www.easydatamining.com/index.php?option=com_content&view=article&id=22&Itemid=90&lang=en [weka] Weka Data Mining Tool at http://www.cs.waikato.ac.nz/ml/weka/

[help]Weka supplied documentation, installed at wekapgmdir/Weka-3-6/documentation.html

Appendix A: Screen captures of the Weka Data Mining process

Figure 1: Source data with credit scores


http://www.cs.waikato.ac.nz/ml/weka/

http://www.easydatamining.com/index.php?option=com_content&view=article&id=22&Itemid=90&lang=en

http://www.easydatamining.com/index.php?option=com_content&view=article&id=22&Itemid=90&lang=en

http://en.wikipedia.org/wiki/C4.5_algorithm


Figure 2: Running a test with training data



Figure 3: The Confusion Matrix from evaluating training data

Figure 4: Visualizing the decision tree



Figure 5: Capturing the Test Data



Figure 6: Determining the success rate of the instances



Appendix B: ProcedureProcedure:

1. Process the “.csv” filea. Start Wekab. Select “Explorer”c. Select “Open file…”

i. Select file type “CSV data files”ii. Select training file, e.g. 100Training.arff

iii. Select “open”d. Process the fields

i. Select the check boxes for1. “our company”2. “our copyright”3. “our product”4. “our URL”5. “do not remove”

ii. Keep “row” and all other fieldsiii. Select “Remove”

e. Select “Save…”i. Remove the “.csv” from the filename

ii. Keep “.arff” as the filetypeiii. Select “Save”

f. You might need to copy the @attribute lines from another file if you get an “incompatible” message when processing the training file

2. Run the testsa. Start Wekab. Select “Explorer”c. Select “Open file…”

i. Select training file, e.g. 100Training.arffd. Select “Classify” tabe. Select “Choose”

i. Select the test, 1. Bayes->NaiveBayes, 2. Lazy->IB13. Meta->OrdinalClassClassifier, 4. Trees->J48,

f. Select “Supplied Test Set”g. Select “Set…”

i. Select “Open file…”



1. Select the Test file, e.g. 2000Testing.arff2. Select “Open”

ii. Select “Close”h. Select “Start”i. Wait for the bird (bottom right corner of the Weka page) to stop movingj. Right click the last entry in “result list”k. Select “View Classifier Errors”

i. Select “Save”1. Give name, e.g. 100TestingResults_NaiveBayes.arff2. Select “Save” which will close “view” window

ii. Select “X” to close “Visualize…” windowl. Select the next test and iterate e through k above

3. Process the resultsa. Open Weka Explorerb. Select “Preprocess” Tabc. Select “Open file…” to get the results.arff file we just created

i. Select the results.arff fileii. Select “Open”

iii. Select “Save…”1. Change the name, remove the “.arff”2. Select a “.csv” file type3. Select “Save”

d. Iterate 15 above for all tests4. Select “X” to Exit Weka5. Process the output

1. Open the results.csv file (in Excel)2. Go to the last column, probably “CT”

a. Enter in “CT2”:=IF(VLOOKUP(A2,Creditworthiness.csv!$A$2:$CX$2501,102,TRUE)=CR2,1,0)

b. Copy that from CT2 to CT1692 (where 1692 is the last row of data)c. In CT1, enter:

=SUM(CT2:CT1692)d. In CU1, enter:

=ROWS(CT2:CT1692)e. In CV2, enter:

=CT1/CU1That gives the successful classification rate for this classifier rule

3. Save the filea. Select “X”b. Select “Save” to “Do you want to save the changes…”c. Select “Yes” to “…may contain features…”

4. Change the name to add the percentage




Documents

Abstract - SEIDENBERG SCHOOL OF CSIS |ctappert/dps/d860-13/weka2012f-t1.docx · Web viewSome algorithms did not appear to be very accurate. Also, surprisingly, the algorithms did