Simafire Logistic Regression Article Digest

Embed Size (px)

Citation preview

  • 8/19/2019 Simafire Logistic Regression Article Digest

    1/11

    Logistic Regression Digest – An

    eBookUnderstand, build and use logistic regressionmodels for common business problems withRapidMiner

    Bala Deshpande, Ph.D., MBA

  • 8/19/2019 Simafire Logistic Regression Article Digest

    2/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 1

    Table of Contents

    Chapter 1: Basics of Logistic Regression models .......................................................................................... 2

    Chapter 2: Seven steps to building a Logistic Regression Model .................................................................. 5Chapter 3: Applying a logistic regression model built with RapidMiner ...................................................... 8

  • 8/19/2019 Simafire Logistic Regression Article Digest

    3/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 2

    Chapter 1: Basics of Logistic Regression models

    It can be argued that the most important step in a business analytics process is establishing aclear business objective. Once this is done, selecting the right technique becomes a matter ofsimple logic. At a very high level there are fundamentally two main classes of techniques: thosethat evolved purely from statistics (such as regression) and those that emerged from a blend ofstats, computer science and mathematics (such as classification trees).

    This chapter is about logistic regression and how it compares to its twin - linear regression, andwhen it makes sense to use it. In chapter 2, we discuss the mechanics of logistic regression andits implementation using RapidMiner for a simple business analytics application. Finally inchapter 3 we discuss how to apply the model to new data.

    A simple explanation of Logistic Regression

    Recall that linear regression is the process of finding a straight line that passes through a bunchof points with the objective of being able to use the equation of the line as a model forprediction. The key assumptions here are that both the predictor and target variables arecontinuous as seen in this chart below. Intuitively, one can state that when X increases, Yincreases along the slope of the line.

  • 8/19/2019 Simafire Logistic Regression Article Digest

    4/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 3

    What happens if the target variable is not continuous? When the target (Y) variable is discrete,the straight line is no longer a fit as seen in this chart. Although intuitively we can still state that

    when X (say advertising spend) increases, Y (say response or no response to a mailingcampaign) also increases, but there is no gradual transition, the Y value abruptly jumps from

    one binary outcome to the other. Thus the straight line is a poor fit for this data.

    On the other hand, take a look at the S-shaped curve below. This is certainly a better fit for thedata shown. If we then know the equation to this "sigmoid" curve, we can use it as effectivelyas we used the straight line in the case of linear regression.

  • 8/19/2019 Simafire Logistic Regression Article Digest

    5/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 4

    Logistic regression is thus the process of obtaining an appropriate sigmoid curve to fit the datawhen the target variable is discrete.

    Key facts to keep in mind

    • Logistic Regression is the equivalent of linear regression to use when the target (ordependent) variable is discrete i.e. not continuous

    • Logistic Regression is ideally suited for business analytics applications where the targetvariable is a binary decision (fail-pass, response-no response, etc)

    • The predictors can be either continuous or categorical

    In chapter 2, we discuss the mechanics of logistic regression and also the process ofimplementing a simple analysis using RapidMiner.

  • 8/19/2019 Simafire Logistic Regression Article Digest

    6/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 5

    Chapter 2: Seven steps to building a Logistic Regression Model

    In chapter 1, we gave a brief introduction to logistic regression and indicated when it might beappropriate to use it in business analytics settings. Probably the best definition of Logisticregression is this "A mathematical modeling approach in which the best-fitting, yet least-restrictive model is desired to describe the relationship between several independentexplanatory variables and a dependent dichotomous response variable".

    In this chapter we get into the details of how the model equation is developed and then showhow to set up a simple analysis using RapidMiner.

    How does logistic regression find the sigmoid curve?

    A straight line can be depicted by only two parameters: the slope ( m ) and the intercept ( c). Theway in which X's and Y's are related to each other can be simply specified by m and c. However

    an S-shaped curve is a much more complex shape and representing it parametrically is not aseasy. So how does one find a mathematical means to relate the X's to the Y's?

    It turns out that if we transform the Y's to the logarithm of the odds of Y , then the transformedtarget variable is linearly related to the X's. In most cases where we need to use logisticregression, the Y is usually a YES-NO type of response. This is usually interpreted as theprobability of an event happening (Y=1) or not happening (Y=0).

    • If Y is an event (response, pass etc),• and p is the probability of the event happening (Y=1),•

    then (1-p) is the probability of the event not happening (Y=0),• and p/(1-p) is the odds of the event happening• It turns out that log(p/1-p) is linear in the predictors, X

    We can write the model as

    • log[p/1-p] = mX + c ------------------ Eq 1.

    From the data given, we know the X and can compute the p for each value of X. After this ofcourse the problem is essentially similar to linear regression. (To see the sigmoid curve, the

    variables need to be transformed from the p-space to the Y-space).

    The logistic regression model from Eq. 1 ultimately delivers the probability of Y happening (i.e.Y=1), given specific value(s) of X.

    http://dspace.lib.ttu.edu/bitstream/handle/2346/19443/31295013251094.pdf?sequence=1http://dspace.lib.ttu.edu/bitstream/handle/2346/19443/31295013251094.pdf?sequence=1http://dspace.lib.ttu.edu/bitstream/handle/2346/19443/31295013251094.pdf?sequence=1http://dspace.lib.ttu.edu/bitstream/handle/2346/19443/31295013251094.pdf?sequence=1

  • 8/19/2019 Simafire Logistic Regression Article Digest

    7/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 6

    7-steps to a simple logistic regression model in RapidMiner

    The data we used comes from an example here f or a credit scoring exercise. The objective is topredict DEFAULT (Y or N) based on two predictors: Loan age (business usage) and number ofdays of delinquency. There are 100 samples.

    Step 1: Load speadsheet into RapidMiner. Use the process described here . Remember to setthe DEFAULT column as "Label"

    Step 2: Split data into train and test samples using the Split Validation operator as shown here

    Step 3: Add the Logistic Regression operator in the "training" window of the split validationoperator

    Step 4: Add Apply Model operator in the "testing" window of split validation operator ina similar manner as discussed here . Just use default parameter values.

    Step 5: Add Performance evaluation operator in the "testing" window of split validationoperator as discussed here .

    Step 6: Connect all ports as shown below

    Step 7: Run the model and view results. In particular check for the Kernel Model which shows

    the coefficients for the two predictors and the intercept. Also check the confusion matrixfor Accuracy, Sensitivity, and Specificity and finally view the ROC curves and check AUC.

    http://chem-eng.utoronto.ca/~datamining/dmc/datasets/credit_scoring.txthttp://chem-eng.utoronto.ca/~datamining/dmc/datasets/credit_scoring.txthttp://chem-eng.utoronto.ca/~datamining/dmc/datasets/credit_scoring.txthttp://www.simafore.com/blog/bid/55751/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-1http://www.simafore.com/blog/bid/55751/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-1http://www.simafore.com/blog/bid/55751/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-1http://www.simafore.com/blog/bid/56588/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-2http://www.simafore.com/blog/bid/56588/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-2http://www.simafore.com/blog/bid/56588/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-2http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/57470/How-to-evaluate-classification-models-for-business-analytics-Part-2http://www.simafore.com/blog/bid/57470/How-to-evaluate-classification-models-for-business-analytics-Part-2http://www.simafore.com/blog/bid/57470/How-to-evaluate-classification-models-for-business-analytics-Part-2http://www.simafore.com/blog/bid/57470/How-to-evaluate-classification-models-for-business-analytics-Part-2http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56930/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-3http://www.simafore.com/blog/bid/56588/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-2http://www.simafore.com/blog/bid/55751/How-to-use-decision-trees-for-credit-scoring-using-RapidMiner-Part-1http://chem-eng.utoronto.ca/~datamining/dmc/datasets/credit_scoring.txt

  • 8/19/2019 Simafire Logistic Regression Article Digest

    8/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 7

    The accuracy of the model based on the 30% testing sample is 83%. The ROC curves has an AUCof 0.863 which is quite acceptable. The next step would be to review the kernel model andprepare for deploying this model.

  • 8/19/2019 Simafire Logistic Regression Article Digest

    9/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 8

    Chapter 3: Applying a logistic regression model built with RapidMiner

    In this chapter we will briefly describe how to apply the results from a logistic regressionanalysis with RapidMiner. Let us start by recapping the basic elements of logistic regression.

    1. Logistic regression is the equivalent of linear regression that is used when the responsevariable or label is binomial. A binomial response variable has two categories: Yes/No,Accept/Not Accept, Default/Not Default and so on.

    2. Logarithm of the odds of the response, Y, being a "Yes" is expressed as a function ofindependent or predictor variables, X, and a constant term. That is, for example

    log (odds of Y = "Yes") = mX + c ---- This is also called the Logit

    3. The logit gives the odds of the "Yes" event, however if we want probability, we need to usethe transformed equation below:

    p (of Y = "Yes") = Reciprocal of [1+exp(-mX-c)]

    A simple example

    Let us use a simple example of predicting if a customer will accept a bank's personal loan offeras a function of their income.

  • 8/19/2019 Simafire Logistic Regression Article Digest

    10/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 9

    When we run this simple dataset and build a logistic regression model, we see the followingresults

    RapidMiner's implementation of logistic regression differs from many other (moreconventional) approaches. The table on the left which shows the kernel model should not beconfused with the logit model described above. In other words, w[Income] does not directlycorrespond to the slope "m" and Bias (offset) does not directly correspond to "c".

    The easiest way to implement the results of the analysis is to use the process below whichapplies the results of the logistic regression learner on the example data set.

    When the analysis runs, simply click on the "Example Set" tab and the "Data View" radiobutton. You will see that for each of the cases, there is a predicted result - Prediction (Personalloan) and the confidence or probability that the loan acceptance is "No" and the correspondinginverse probability of "Yes".

  • 8/19/2019 Simafire Logistic Regression Article Digest

    11/11

    Decision Tree Digest – How to build and use decision trees for business analytics – an eBook by SimaFore

    © SimaFore LLC Page 10

    The main takeaway from this chapter is that, using RapidMiner it is easier to apply thedeveloped model to new data to obtain probability of response variable being in one of the twocategories, rather than trying to interpret the model parameters in the light of traditionalformulas, such as the logit.

    If you liked this e-book tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,statistics and complex systems” for more like these. Sign up is FREE and allows you to search fortechniques for other common business problems.

    http://vistasc.simafore.com/create-account