Download pdf - Mlr Article Digest

8/2/2019 Mlr Article Digest

1/13

Multiple Linear Regression

Modeling An eBookUnderstand, build and use MLR models using

RapidMiner for predicting sales

Bala Deshpande, Ph.D., MBA


2/13

MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore

SimaFore LLC Page 1

Table of Contents

Chapter 1: Multiple Linear Regression Business Problem and Data .......................................................... 2

Chapter 2: Setting up MLR using RapidMiner ............................................................................................... 4

Chapter 3: Identifying most important variables: Feature Selection ........................................................... 7

Chapter 4: Checkpoints to ensure regression model validity ..................................................................... 11


3/13


SimaFore LLC Page 2

Chapter 1: Multiple Linear Regression Business Problem and Data

We will describe one of the most commonly used data mining techniques - multiple linear

regression in this eBook. According to theRexer Analytics Survey, regression models are one of

the three most common analytics tools used today by practitioners. In the first chapter we will

discuss the problem we are trying to address the data and give a quick introduction to using

regression models. In the chapters 2 and 3 we will dig into the mechanics of using RapidMiner

to do the data preparation, model building, and validation. Finally in chapter 4 we will describe

some check points to ensure that MLR is used correctly.

The business problem

The fundamental issue for all businesses and specifically small and medium enterprises (SME) is

the need to grow revenues. Understanding and increasing the likelihood that someone will buy

again from the company is critical. Another important question that would help strategically is

predicting how much money a customer is likely to spend given data about their previouspurchase habits. The business problem we are looking at here is the second issue.

About predictive vs. explanatory models

Two very important distinctions need to be made here: understanding why someone purchased

from the company will fall into the realm of "explanatory modeling" whereas predicting how

much someone is likely to spend will fall into the realm of "predictive analytics". Addressing the

second problem is predicated by the availability of large data volumes.
http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html


4/13


SimaFore LLC Page 3

Multiple linear regression can be applied in a variety of situations ranging from predicting

customer activity based on demographics and historical patterns to predicting time to failure of

machinery based on usage and operating conditions. Any situation where a numerical

prediction, such as "how much will someone spend", is required warrants the use of regression

models. This is in contrast to making categorical prediction such as "will buy/will not buy", "will

fail/will not fail", where we can use eitherdecision treesorlogistic regression models.

The main task is to find a linear equation that relates the predictors (independent variables or

factors) to the response (dependent variable or target). If there are two or more predictors, we

are effectively doing "multiple" regression. A note of caution before using the equation for

prediction: we have to ensureregression models are not arbitrarily deployedand mustperform

checks to ensure regression models are valid. This is discussed in more detail in Chapter 4.

Data

The data consists of six predictors and one response variable. The predictors are as follows:historical transactions, days since last transaction, online order (y/n), gender (m/f), customer

type (b2b/b2c), and region (domestic/international). The response variable is of course the

amount of spend.

Due to the small number of factors, we may not need to employ any data reduction schemes

and in chapter 2 we will be using RapidMiner to directly build the model and explore the

weakest/strongest predictors, most likely customer profile, and predictive accuracy.
http://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=decision+trees


5/13


SimaFore LLC Page 4

Chapter 2: Setting up MLR using RapidMiner

In this chapter, we will show how to set up a RapidMiner process to build a multiple linear

regression model for the sales prediction business analytics problem described in Chapter 1.

Before we do the actual modeling, let us do some initial analysis. This includes summarizing the

data by using excel pivot tables. Anexcellent introduction to building pivot tables in excelcan

be found here. Additionally we will check for correlations between the predictors and response

variable to avoidmulticollinearity issues later on.

Here is the data that was introduced in chapter 1, shown in a table.

We have 2000 records (rows) and 7 predictors. The response or label (in RapidMiner

terminology) is Column F - Purchase Amount which is what needs to be predicted. Note that the

data is "coded" which means that instead of a column for "Online order" with rows reading

either "Yes" or "no", we have a variable called "online order (Yes=0)" and the rows are either0's (implying online order) or 1. The same coding has been applied to the other three non-

numeric variables: gender, geographic region, type of customer. This coding will help in

interpreting the regression coefficients later on.

Summarizing data using pivot tables in XL allows us to gain some early insight into the data and

will help us understand the final model better. As seen in the summary table below, Online

orders are a bit more than 50% whereas there is not much difference between Male and

Female customers in terms of number of orders. However, B2B customers make up 82% of the

orders and Domestic customers are nearly 78% of all orders.

RapidMiner model setup requires 3 steps: click and drag Read Excel into the main window in

Step 1, connect it to the Split Validation operator in Step 2, and use the "Linear Regression"
http://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_data


6/13


SimaFore LLC Page 5

operator in the Training Window, which opens when the nested window icon within Split

Validation is double clicked. Full details of these steps are describedhere(and in our other free

ebook on Decision Treeshere).

When the above model is run, RapidMiner will provide two main outputs: the actual model in

the form of a text (linear equation) and a table. The text output is useful to interpret the model

while the table form helps to explain the confidence level in each of the regression coefficients.

The graphics below illustrate this.
http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2


7/13


SimaFore LLC Page 6

So who is most likely to become a customer and how much are they likely to spend? Finally,

which are the most important and least important variables? (Why we cannot use model

coefficients to do this?) We will explore this and the model accuracy questions in the next

chapter.


8/13


SimaFore LLC Page 7

Chapter 3: Identifying most important variables: Feature Selection

In this chapter, we will explore two additional questions which were raised at the end of

Chapter 2:

1. Eliminating the least important variables from the model2. Identifying the characteristics of the most valuable prospect based on historical data

We will show how to use RapidMiner to run a feature selection operator which will answer 1

and interpret the model to answer 2.

Feature selection or data dimension reduction or variable screening in predictive analyticsrefers to the process of identifying the few most important variables or parameters which help

in predicting the outcome. In today's charged up world of high speed computing, one might be

forgiven for asking, why bother? The most important reasons all come from practicality.

Reason 1: If two or more of the independent variables (or predictors) are correlated to the

dependent (or predicted) variable, then the estimates of coefficients in a regression model tend

to be unstable or counter intuitive.

Example: y = 45 + 0.8x1 and y = 45 + 0.1x2 are two linear regression models which predict y.

Both clearly indicate that if x's increase, y also increases. If x1 and x2 show a strong correlation

to y, then a multiple regression model might look like y = 45 + 0.02 x1 - 0.4 x2. In this case,because the three (x1, x2 and y) are strongly correlated, interaction effects between x1 and x2

lead to a situation where x2 is in a negative relationship with y, meaning y will decrease with

increase in x2. This is not only the reverse of what was seen in the simple model, but is also

counter-intuitive.


9/13


SimaFore LLC Page 8

Reason 2: The law of averages states that the larger the set of predictors, the higher the

probability of having missing values in the data. If we chose to delete cases which have missing

values for some predictors, we may end up with a shortage of samples.

Example: A practical rule of thumb used by data miners is to have at least 5(p+2) samples

where p is the number of predictors. If your data set is sufficiently large and this rule is easilysatisfied, then you may not be risking much by deleting cases. But if your data is from an

expensive market survey for example, a systematic procedure to actually reduce the data set,

may result in a situation where you dont have to address this problem of losing samples. It is

better to lose variables which dont impact your prediction than to lose somewhat more

expensive samples.

There are several other more technical reasons for reducing data dimensionality which will be

explored in subsequent articles. In a next article, we will discuss some common techniques for

actually implementing this process.

Backward Elimination to reduce dataset

The process logic which RapidMiner uses is not "linear", but recursive. We dont apply

operators linearly, one after another. The graphic below explains how this nesting was used in

setting up the training and testing of Linear Regression operator for the analysis we did in

chapter 2. The red arrow indicates that the training and testing process was nested within the

"Split Validation" operator.

In order to introduce the feature selection method, we need to tuck the training and testing

process inside another sub-process called the Learning Process. The learning process is nested

inside the "Backward Elimination" operator. We now have two nestings as schematically shown

below.


10/13


SimaFore LLC Page 9

Finally the image below shows how to access the Backward Elimination operator. Doubleclicking on the Backward Elim operator opens up the Learning Process which will now contain

the Split Validation operator used earlier.

There is one more step to complete before running this model. Simply connecting the Backward

Elim operators ports to the output will not show us the final regression model equation. To be

able to see that, we need to connect the "exa" port of Backward Elim operator to another

"Linear Regression" operator in the main process! The output of this operator will contain the

model which can be examined in the Results perspective. The graphic below shows this.


11/13


SimaFore LLC Page 10

What variables have been eliminated by Backward Elimination?

Comparing the two regression equations (above and in chapter 2) we can see that the variables

B2C=1 and Domestic=1 have been removed. What are the advantages of this, if any? It implies

thath in the future, it may not be necessary to collect these two pieces of data to predict

spending amount.

What characteristics does a high spending customer have?

Referring back to the regression model shown in the graphic above, we see that amount of

spend increases with Purchase frequency. Also online orders tend to spend less. More recent

purchasers tend to spend more and finally if a contact was made with a prospect recently (Last

Update) they tend to spend more.

In the final chapter of this ebook, we will discuss some tips to make sure that regression

modeling is applied correctly.


12/13



Chapter 4: Checkpoints to ensure regression model validity

While acknowledging the general overallrisk in using models, it is important to know how to

mitigate some of these risks. In this article, we will specifically focus on 6 checkpoints to ensure

that bivariate analyses used to develop models (such as simple regression models), or to verify

if two parameters are related, are valid. Finally, we will briefly mention some advantages of

using mutual information over simple regression models for bivariate analysis.

Checkpoint 1: The first check point to consider before accepting any simple regression model is

of course to quantify the r-squared, which is also known as the "coefficient of determination".

R-squared effectively explains how much of variability in the dependent parameter is explained

by the independent parameter.

Addendum to 1: In most cases of Linear Regression the r-squared value lies between 0 and 1.

The ideal range for r-squared varies across applications , for example, in social and behavioral

science models typically low values are acceptable. Generally, very low values( ~ < 0.2) indicate

that the variables in your do not explain the outcome satisfactorily. Similarly very high values (>

0 .8) values indicate too high a dependency making the predictive ability of the model low.

Checkpoint 2: Once a regression model is fit through the sample data points, the t-statistic

must be used to check if the slope of the model is different from zero. But why not simply check

the slope (even visually) of the model? The t-statistic check ensures that the population slope

(not just the sample slope) is different from zero. This of course requires the assumption of

normal distribution of all sample slopes that make up the population.

Checkpoint 3: This brings us to the next check - which is to ensure that all error terms in the

model are normally distributed. Fortunately most standard statistical packages do this

automatically, but it is good to know that this check has been performed.
http://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+management


13/13



Checkpoint 4: Make sure that if you are using the model to predict, the domain of the predictor

is within the range of the sample data used to build the model.

Checkpoint 5: Passing checks 1 and 2 will ensure that the independent and dependent variable

are related. However this does not imply that the independent variable is the cause and the

dependent is the effect. Remember that correlation is not causation!

Checkpoint 6: Highly non-linear relationships will result in simple regression models failing

checks 1 through 3. However this does not mean that the two variables are not related. In such

cases it may become necessary to resort to somewhat more advanced bivariate analysis

methods. The use ofmutual informationfor testing if two variables are related is highly

effective in such cases.

Mutual information will very simply tell you if variable X is related to variable Y, and how much

uncertainty is reduced in predicting Y if the uncertainty in knowing X is quantified. Furthermore,

mutual information can handle jumps or discontinuities within the sample data - for example

the X data may not be uniformly spaced. Such jumps in data are well captured by mutual

information, as are non-linearities.

If you liked this ebook tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,

statistics and complex systems for more like these. Sign up is FREE and allows you to search for

techniques for other common business problems.
http://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://vistasc.simafore.com/create-accounthttp://www.simafore.com/blog/?Tag=mutual+information