View
221
Download
7
Embed Size (px)
Citation preview
Implementation In TreeImplementation In Tree
Stat 6601Stat 6601
November 24, 2004November 24, 2004
Bin Hu, Philip Wong, Yu YeBin Hu, Philip Wong, Yu Ye
Data BackgroundData Background
From SPSS Answer Tree program, we use From SPSS Answer Tree program, we use its credit scoring exampleits credit scoring example
There are 323 data pointsThere are 323 data points The target variable isThe target variable is credit ranking credit ranking (good [48%], bad [52%])(good [48%], bad [52%])
The four predictor variables are The four predictor variables are age categoricalage categorical (young [58%], middle [24%], old[18%]) (young [58%], middle [24%], old[18%]) has AMEX cardhas AMEX card (yes [48%], no [52%]) (yes [48%], no [52%]) paid weekly/monthlypaid weekly/monthly (weekly pay [51%], monthly salary (weekly pay [51%], monthly salary
[49%])[49%]) social class social class (management [12%], professional [49%], (management [12%], professional [49%],
clerical [15%], skilled [13%], unskilled [12%])clerical [15%], skilled [13%], unskilled [12%])
Data BackgroundData Background
CLASS Bad Good % Goodmanagement 1 38 97.4%professional 56 102 64.6%clerical 40 7 14.9%skilled manual 40 1 2.4%unskilled 31 7 18.4%
PAY_WEEKweekly pay 143 22 13.3%monthly salary 25 133 84.2%
AGEyoung 151 36 19.3%middle 16 63 79.7%old 1 56 98.2%
AMEXno 89 78 46.7%yes 79 77 49.4%
• Useful to see how the target variable is distributed by each of the predictor variable
Data BackgroundData Background
Pearson Correlation Coefficients, N = 323Pearson Correlation Coefficients, N = 323 Prob > |r| under H0: Rho=0Prob > |r| under H0: Rho=0
CREDIT_R PAY_WEEK AGE AMEXCREDIT_R PAY_WEEK AGE AMEX
CREDIT_R 1.00000 0.70885 0.66273 0.02653CREDIT_R 1.00000 0.70885 0.66273 0.02653 CREDIT_R <.0001 <.0001 0.6348CREDIT_R <.0001 <.0001 0.6348
PAY_WEEK 0.70885 1.00000 0.51930 0.08292PAY_WEEK 0.70885 1.00000 0.51930 0.08292 PAY_WEEK <.0001 <.0001 0.1370PAY_WEEK <.0001 <.0001 0.1370
AGE 0.66273 0.51930 1.00000 -0.00172AGE 0.66273 0.51930 1.00000 -0.00172 AGE <.0001 <.0001 0.9755AGE <.0001 <.0001 0.9755
AMEX 0.02653 0.08292 -0.00172 1.00000AMEX 0.02653 0.08292 -0.00172 1.00000 AMEX 0.6348 0.1370 0.9755AMEX 0.6348 0.1370 0.9755
• Correlation Matrix:
ObjectiveObjective To create a predictive model of good To create a predictive model of good
credit risks credit risks To assess the performance of the model, To assess the performance of the model,
we randomly split data into two parts: a we randomly split data into two parts: a training set to develop the model (60%) training set to develop the model (60%) and the rest (40%) to validate.and the rest (40%) to validate.
This is done to avoid possible “over fitting” This is done to avoid possible “over fitting” since the validation set was not involve in since the validation set was not involve in deriving the modelderiving the model
Using the same data, we compare the Using the same data, we compare the results using R’s Tree, Answer Tree’s results using R’s Tree, Answer Tree’s CART, and SAS’ Proc LogisticCART, and SAS’ Proc Logistic
Logistic RegressionLogistic Regression
Let x be a vector of explanatory variablesLet x be a vector of explanatory variables Let y be a binary target variable (0 or 1)Let y be a binary target variable (0 or 1) p = Pr(Y=1|x) is the target probabilityp = Pr(Y=1|x) is the target probability The linear logistic model has the formThe linear logistic model has the form
xp-1
ploglogit(p)
• Predicted probability, phat = 1/(1+exp(-(α+β’x)))
• Note that the range for p is (0,1), but logit(p) is the whole real line
Logistic ResultsLogistic Results Using the training set, the maximum likelihood estimate Using the training set, the maximum likelihood estimate
failed to converge using the social class and age variablesfailed to converge using the social class and age variables Only the paid weekly/monthly and has AMEX card variables Only the paid weekly/monthly and has AMEX card variables
could be estimatedcould be estimated The AMEX variable was highly insignificant and so was The AMEX variable was highly insignificant and so was
droppeddropped Apparently, the tree algorithm does a better job in handling Apparently, the tree algorithm does a better job in handling
all the variablesall the variables• SAS output of model: Standard WaldStandard Wald
Parameter DF Estimate Error Chi-Square Pr > ChiSqParameter DF Estimate Error Chi-Square Pr > ChiSq
Intercept 1 1.5856 0.2662 35.4756 <.0001Intercept 1 1.5856 0.2662 35.4756 <.0001
PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001PAY_WEEK 1 1 -3.6066 0.4169 74.8285 <.0001
• So, the odds of weekly pay to be a good risk over a monthly salary person is exp(-3.6) ≈ 0.027 to 1 or 36 to 1 against.
Validation ResultsValidation Results With only one variable in our predicted model, there are only two With only one variable in our predicted model, there are only two
possible predicted probabilities: 0.117 and 0.830possible predicted probabilities: 0.117 and 0.830 Taking the higher probability as predicting a “good” account, our results Taking the higher probability as predicting a “good” account, our results
are beloware below
Validation Set Training SetValidation Set Training Set Actual BadActual Bad Actual GoodActual Good Actual BadActual Bad Actual GoodActual Good
Predicted BadPredicted Bad 60 60 11 83 11 11 83 11
Predicted Good 8Predicted Good 8 50 17 83 50 17 83
Percent Agreement Percent Agreement 85.3% 85.6%85.3% 85.6%
• The better measure is use the validation set results. Note that the results are very similar, so overfitting does not appear to be a problem.
Growing a Tree in RGrowing a Tree in R(Based on training data)(Based on training data)
> credit_data<- read.csv(file="training.csv")> credit_data<- read.csv(file="training.csv")
> library(tree)> library(tree)
> credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + > credit_tree<-tree(CREDIT_R ~ CLASS + PAY_METHOD + AGE + AMEX, data=credit_data, split=c("gini"))AMEX, data=credit_data, split=c("gini"))
> tree.pr<-prune.tree(credit_tree)> tree.pr<-prune.tree(credit_tree)
> plot(tree.pr) # figure 1 > plot(tree.pr) # figure 1
> plot(credit_tree, type="u"); text(credit_tree, pretty=0) > plot(credit_tree, type="u"); text(credit_tree, pretty=0)
# figure 2 # figure 2
> > tree.1<-prune.tree(credit_tree, best=5)tree.1<-prune.tree(credit_tree, best=5)
> plot(tree.1, type="u"); text(tree.1, pretty=0) > plot(tree.1, type="u"); text(tree.1, pretty=0)
# figure 3, 4, 5 # figure 3, 4, 5
> summary(tree.1) > summary(tree.1)
Figure 1 Figure 1
size
de
via
nce
10
01
50
20
02
50
2 4 6 8 10
110.00 1.40 0.78 0.13
Figure 2Figure 2
|PAY_METHOD: pay by month
AGE: young
AMEX: no
CLASS: clerical,skilled,unskilled
CLASS: clerical
CLASS: skilled
AMEX: no AMEX: no
AMEX: no
AGE: middle
good bad
good
bad
bad bad bad bad
bad bad
bad
Figure 3Figure 3
|PAY_METHOD: pay by month
AGE: young CLASS: clerical,skilled,unskilled
CLASS: clericalbad good
bad bad
bad
Figure 4Figure 4
|PAY_METHOD: pay by month
AGE: young CLASS: clerical,skilled,unskilled
bad good bad bad
Figure 5Figure 5
|PAY_METHOD: pay by month
AGE: young
bad good
bad
Tree Based on Validation DataTree Based on Validation Data
|PAY_METHOD: pay by month
AGE: middle,younggood
bad good
Implementing using SPSS ANSWER TREEImplementing using SPSS ANSWER TREETraining sample – C&RT (Min impunity change .01)Training sample – C&RT (Min impunity change .01)
Category % nGood 48.45 94Bad 51.55 100Total (100.00) 194
Node 0
Category % nGood 11.70 11Bad 88.30 83Total (48.45) 94
Node 2
Category % nGood 32.14 9Bad 67.86 19Total (14.43) 28
Node 6Category % nGood 3.03 2Bad 96.97 64Total (34.02) 66
Node 5
Category % nGood 83.00 83Bad 17.00 17Total (51.55) 100
Node 1
Category % nGood 100.00 68Bad 0.00 0Total (35.05) 68
Node 4Category % nGood 46.88 15Bad 53.13 17Total (16.49) 32
Node 3
Credit ranking (1=default)
Paid Weekly/MonthlyImprovement=0.2539
Weekly pay
Social ClassImprovement=0.0172
Professional;ManagementClerical;Unskilled;Skilled Manual
Monthly salary
Age CategoricalImprovement=0.0633
Middle (25-35);Old ( > 35)Young (< 25)
Implementing using SPSS ANSWER TREEImplementing using SPSS ANSWER TREETraining sample – CHAID (Pearson Chi2, p=.05)Training sample – CHAID (Pearson Chi2, p=.05)
Category % nGood 48.45 94Bad 51.55 100Total (100.00) 194
Node 0
Category % nGood 11.70 11Bad 88.30 83Total (48.45) 94
Node 2
Category % nGood 3.03 2Bad 96.97 64Total (34.02) 66
Node 6Category % nGood 32.14 9Bad 67.86 19Total (14.43) 28
Node 5
Category % nGood 83.00 83Bad 17.00 17Total (51.55) 100
Node 1
Category % nGood 46.88 15Bad 53.13 17Total (16.49) 32
Node 4Category % nGood 100.00 68Bad 0.00 0Total (35.05) 68
Node 3
Credit ranking (1=default)
Paid Weekly/MonthlyAdj. P-value=0.0000, Chi-square=98.6177, df=1
Weekly pay
Social ClassAdj. P-value=0.0009, Chi-square=16.1257, df=1
Clerical;Unskilled;Skilled ManualProfessional;Management
Monthly salary
Age CategoricalAdj. P-value=0.0000, Chi-square=43.5241, df=1
Young (< 25)Middle (25-35);Old ( > 35)
Summary classification for training dataSummary classification for training data
Level 1Level 1 Level 2Level 2 PredictePredictedd
% of % of errorerror
Pay-MonthlyPay-Monthly Age-med,oldAge-med,old GoodGood 0%0%
Pay-MonthlyPay-Monthly Age-YoungAge-Young BadBad 47%47%
Pay-WeeklyPay-Weekly Class-P,MClass-P,M BadBad 32%32%
Pay-WeeklyPay-Weekly Class-C,S,UClass-C,S,U BadBad 3%3%
Summary of Validation dataSummary of Validation datagrouped by training data classificationgrouped by training data classification
Level 1Level 1 Level 2Level 2 PredictePredictedd
% of % of errorerror
Pay-MonthlyPay-Monthly Age-med,oldAge-med,old GoodGood 2%2%
Pay-MonthlyPay-Monthly Age-YoungAge-Young BadBad 59%59%
Pay-WeeklyPay-Weekly Class-P,MClass-P,M BadBad 23%23%
Pay-WeeklyPay-Weekly Class-C,S,UClass-C,S,U BadBad 13%13%
Crosstabulation of predicted and actual classificationCrosstabulation of predicted and actual classification
Training sampleTraining sample Validation sampleValidation sample
Actual Actual goodgood
Actual Actual badbad
Actual Actual goodgood
Actual Actual badbad
Predict Predict goodgood 6868 00 6767 2121
Predict badPredict bad 2626 100100 11 4040
AgreementAgreement 86.6%86.6% 82.9%82.9%
Agreement Agreement in in
regressionregression85.6%85.6% 85.3%85.3%
Summary of resultSummary of result
Similar trees were generated from R and SPSS Similar trees were generated from R and SPSS ANSWER TREEANSWER TREE
Similar results were derived using different tree Similar results were derived using different tree generation methods – C&RT and CHAIDgeneration methods – C&RT and CHAID
Classification tree has higher percentage of Classification tree has higher percentage of agreement between predicted values and agreement between predicted values and actual values than logistic regression on actual values than logistic regression on training datatraining data
Utilizing the grouping criteria derived from Utilizing the grouping criteria derived from training data, logistic regression has higher training data, logistic regression has higher percentage of agreement than classification percentage of agreement than classification treetree
ConclusionConclusion
Classification tree is a non-parametric method to select Classification tree is a non-parametric method to select predictive variables sequentially and group cases to predictive variables sequentially and group cases to homogenous clusters to derive the highest predictive homogenous clusters to derive the highest predictive probability probability
Classification tree can be implemented in different Classification tree can be implemented in different software and using different tree growing methodologiessoftware and using different tree growing methodologies
Classification tree normally performs better than Classification tree normally performs better than parametric models with higher percentage of agreement parametric models with higher percentage of agreement between predicted values and actual valuesbetween predicted values and actual values
Classification tree has special advantages in industries Classification tree has special advantages in industries like credit card and marketing research by 1) grouping like credit card and marketing research by 1) grouping individuals by homogenous clusters 2) assigning not only individuals by homogenous clusters 2) assigning not only the predicted values, but also the probability of the predicted values, but also the probability of predicting errorpredicting error
Conclusion – con’dConclusion – con’d
As a non-parametric method, no function As a non-parametric method, no function form is specified and no parameter will be form is specified and no parameter will be estimated and testedestimated and tested
As showed in this small study, the lower As showed in this small study, the lower percentage of agreement for validation percentage of agreement for validation data shows “overfitting” might be a data shows “overfitting” might be a potential problem in classification treepotential problem in classification tree