28

Decision Trees -

Embed Size (px)

DESCRIPTION

CHAID and CARTS

Citation preview

  • Classification Problems

    Introduction to classification.

    Decision Tree approach for classification.

    Chi-Square Automatic Interaction Detection (CHAID)

    Classification and Regression Tree (CART)

  • CHAID Chi-square Automatic Interaction Detection

  • Introduction to CHAID

    CHAID is a decision tree algorithm used in classification problems.

    CHAID uses chi-square test of independence for splitting. CHAID was first presented in an article title, An exploratory technique for Investigating large quantities of categorical variables, by G V Kass in Applied Statistics (1980)

  • CHAID CHAID partitions the data into mutually exclusive, exhaustive,

    subsets that best describe the dependent categorical variable.

    CHAID is an iterative procedure that examines the predictors (or classification variables) and uses them in the order of their statistical significance.

  • Chi-Square test of independence

    Chi-square test of independence starts with an assumption that there is no relationship between two variables.

    For example, we assume that there is no relationship between checking account balance and default.

  • Chi-Square test of Independence German Credit Case

    H0: There is no relationship between checking account balance and default.

    HA: There is a relationship between checking account balance and default.

  • Chi-Square test of Independence German Credit Case

    H0: Checking account balance and default are independent.

    HA: Checking account balance and default are dependent.

  • Contingency Table

    Checking account balance

    Default

    1 0 Total

    0 DM 135 139 274 Other than

    0 DM 165 561 726 Total 300 700 1000

  • Chi-Square test of Independence Test Statistic

    sum total

    sumcolumn sum row Efrequency Expected

    E

    EOstatistic

    22

  • Chi-Square test of Independence

    Observed frequency

    Expected Frequency (O-E)^2/E

    0DM-1 135 82.2 33.91

    0DM-0 139 191.8 14.53

    N0DM-1 165 217.8 12.8

    N0DM-0 561 508.2 5.48

    1000 1000 66.73

    P-value 3.10E-16

    p-value is less than 0.05, we reject the null hypothesis.

  • CHAID Procedure Step 1: Examine each predictor variable for its statistical

    significance with the dependent variable.

    Step 2: Determine the most significant among the predictors (predictor with smallest p value).

    Step 3: Divide the data by levels of the most significant predictor (using chi-square test of independence). Each of these groups will be examined individually further.

    Step 4: For each sub-group, determine the most significant variable from the remaining predictor and divide the data again.

    Step 5: Repeat step 4 until all statistically significant predictors have been identified.

  • CHAID CHAID uses both splitting and merging steps.

    In merging, least significantly different groups are merged to form one class.

    In splitting, the values of a predictor that results in most significantly different classes are used.

    The split selection is based on the chi-square test of independence between the two grouped predictors and the independent variable.

  • Chi-Square test of independence HO: Two groups are independent with respect to a dependent

    variable.

    HA: Two groups are not independent with respect to a dependent variable.

    0.05an greater th is valuep when hypothesis null accept the We

    1)-1)(c-(r freedom of degrees

    StatisticTest

    22

    i

    ii

    E

    EO

  • CHAID Input

    Significance level for partitioning a variable.

    Significance level for merging.

    Minimum number of records for the cells.

  • CHAID Example: Breaking Barriers

  • OBSERVED Total EXPECTED

    LTV 0 1 0 1

    80 77 83 160 32.10191 127.8981

    315 1255 1570

    CHI STATISTIC 10105.29

    P-VALUE 0

    Chi-Square test

    confirms relationship

  • /* Node 1 */.

    DO IF (VALUE LTV) EQ 1).

    COMPUTE nod_001 = 1.

    COMPUTE pre_001 = 1.

    COMPUTE prb_001 = 0.8686.

    END IF.

    EXECUTE.

    /* Node 2 */.

    DO IF (SYSMIS(@0DM) OR VALUE(@0DM) NE 1).

    COMPUTE nod_001 = 2.

    COMPUTE pre_001 = 1.

    COMPUTE prb_001 = 0.78431.

    END IF.

    EXECUTE.

    BUSINESS RULES

  • CHAID optimal cut

    For a given variable (say LTV), find a cut-off value x, such that the chi-square statistic function is maximum.

    n

    i

    m

    j ij

    ijij

    LTVx xE

    xExOMax

    1

    2

    1

    2

    )(

    )()(statistic

    Check whether the maximum Chi-Square is significant

  • Classification and Regression Trees (CART)

    Splits are chosen based on SSE between the observation and the mean value of the node.

    CART is a binary tree, whereas CHAID can split the initial node into more than 2 branches.

    Uses Gini Index to minimize the classification error.

  • Gini Index (Classification Impurity)

    Gini Index is used to measure the impurity at a node (in classification problem) and is given by:

    k nodein jcategory of proportion theis k)P(j where

    ))(1)((Gini(k)1

    K

    k

    kjPkjP

    Smaller Gini Index implies less impurity.

  • Classification Tree Logic

    Node t

    Node tL Node tR

    RL

    R

    L

    LRLL

    NNN

    noderight in the nsobservatio ofNumber N

    nodeleft in the nsobservatio ofNumber N

    (.) nodeat Impurity i(.)

    )i(tN)i(tNi(t)NMax

    Reduction in

    impurity

  • SHUBHAM Classification and Regression Tree

  • SHUBHAM Classification and Regression Tree

    02.2870.746.2152.251

    86.0*14.0*64173.0*827.0*15068.0*2.0*1570

    i(2)Pi(1)Pi(0)impurityin Reduction 11

  • Entropy

    Entropy is a measure of impurity and is given by:

    k nodein ccategory of proportion theis k)P(c where

    ))(log()(E1

    C

    c

    kcPkcPntropy

  • Entropy

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0 0.2 0.4 0.6 0.8 1 1.2

  • 00.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0 0.2 0.4 0.6 0.8 1 1.2

    Gini Index

    Entropy

  • Tree Pruning

    Based on criteria such as percentage of data in each node (say minimum number of observations is at least 5%).

    Based on Level of Tree (Say 4 levels from root node)

    Based on impurity functions (such as Gini Index and Entropy)