8/2/2019 Mlr Article Digest
1/13
Multiple Linear Regression
Modeling An eBookUnderstand, build and use MLR models using
RapidMiner for predicting sales
Bala Deshpande, Ph.D., MBA
8/2/2019 Mlr Article Digest
2/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 1
Table of Contents
Chapter 1: Multiple Linear Regression Business Problem and Data .......................................................... 2
Chapter 2: Setting up MLR using RapidMiner ............................................................................................... 4
Chapter 3: Identifying most important variables: Feature Selection ........................................................... 7
Chapter 4: Checkpoints to ensure regression model validity ..................................................................... 11
8/2/2019 Mlr Article Digest
3/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 2
Chapter 1: Multiple Linear Regression Business Problem and Data
We will describe one of the most commonly used data mining techniques - multiple linear
regression in this eBook. According to theRexer Analytics Survey, regression models are one of
the three most common analytics tools used today by practitioners. In the first chapter we will
discuss the problem we are trying to address the data and give a quick introduction to using
regression models. In the chapters 2 and 3 we will dig into the mechanics of using RapidMiner
to do the data preparation, model building, and validation. Finally in chapter 4 we will describe
some check points to ensure that MLR is used correctly.
The business problem
The fundamental issue for all businesses and specifically small and medium enterprises (SME) is
the need to grow revenues. Understanding and increasing the likelihood that someone will buy
again from the company is critical. Another important question that would help strategically is
predicting how much money a customer is likely to spend given data about their previouspurchase habits. The business problem we are looking at here is the second issue.
About predictive vs. explanatory models
Two very important distinctions need to be made here: understanding why someone purchased
from the company will fall into the realm of "explanatory modeling" whereas predicting how
much someone is likely to spend will fall into the realm of "predictive analytics". Addressing the
second problem is predicated by the availability of large data volumes.
http://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.htmlhttp://www.rexeranalytics.com/Data-Miner-Survey-Results-2010.html8/2/2019 Mlr Article Digest
4/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 3
Multiple linear regression can be applied in a variety of situations ranging from predicting
customer activity based on demographics and historical patterns to predicting time to failure of
machinery based on usage and operating conditions. Any situation where a numerical
prediction, such as "how much will someone spend", is required warrants the use of regression
models. This is in contrast to making categorical prediction such as "will buy/will not buy", "will
fail/will not fail", where we can use eitherdecision treesorlogistic regression models.
The main task is to find a linear equation that relates the predictors (independent variables or
factors) to the response (dependent variable or target). If there are two or more predictors, we
are effectively doing "multiple" regression. A note of caution before using the equation for
prediction: we have to ensureregression models are not arbitrarily deployedand mustperform
checks to ensure regression models are valid. This is discussed in more detail in Chapter 4.
Data
The data consists of six predictors and one response variable. The predictors are as follows:historical transactions, days since last transaction, online order (y/n), gender (m/f), customer
type (b2b/b2c), and region (domestic/international). The response variable is of course the
amount of spend.
Due to the small number of factors, we may not need to employ any data reduction schemes
and in chapter 2 we will be using RapidMiner to directly build the model and explore the
weakest/strongest predictors, most likely customer profile, and predictive accuracy.
http://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=decision+treeshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/54963/6-checkpoints-to-ensure-regression-model-validity-for-analyticshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/?Tag=logistic+regression+modelshttp://www.simafore.com/blog/?Tag=decision+trees8/2/2019 Mlr Article Digest
5/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 4
Chapter 2: Setting up MLR using RapidMiner
In this chapter, we will show how to set up a RapidMiner process to build a multiple linear
regression model for the sales prediction business analytics problem described in Chapter 1.
Before we do the actual modeling, let us do some initial analysis. This includes summarizing the
data by using excel pivot tables. Anexcellent introduction to building pivot tables in excelcan
be found here. Additionally we will check for correlations between the predictors and response
variable to avoidmulticollinearity issues later on.
Here is the data that was introduced in chapter 1, shown in a table.
We have 2000 records (rows) and 7 predictors. The response or label (in RapidMiner
terminology) is Column F - Purchase Amount which is what needs to be predicted. Note that the
data is "coded" which means that instead of a column for "Online order" with rows reading
either "Yes" or "no", we have a variable called "online order (Yes=0)" and the rows are either0's (implying online order) or 1. The same coding has been applied to the other three non-
numeric variables: gender, geographic region, type of customer. This coding will help in
interpreting the regression coefficients later on.
Summarizing data using pivot tables in XL allows us to gain some early insight into the data and
will help us understand the final model better. As seen in the summary table below, Online
orders are a bit more than 50% whereas there is not much difference between Male and
Female customers in terms of number of orders. However, B2B customers make up 82% of the
orders and Domestic customers are nearly 78% of all orders.
RapidMiner model setup requires 3 steps: click and drag Read Excel into the main window in
Step 1, connect it to the Split Validation operator in Step 2, and use the "Linear Regression"
http://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_datahttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.simafore.com/blog/bid/56752/3-checks-to-prevent-abuse-of-regression-modelshttp://www.timeatlas.com/5_minute_tips/chunkers/learn_to_use_pivot_tables_in_excel_2007_to_organize_data8/2/2019 Mlr Article Digest
6/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 5
operator in the Training Window, which opens when the nested window icon within Split
Validation is double clicked. Full details of these steps are describedhere(and in our other free
ebook on Decision Treeshere).
When the above model is run, RapidMiner will provide two main outputs: the actual model in
the form of a text (linear equation) and a table. The text output is useful to interpret the model
while the table form helps to explain the confidence level in each of the regression coefficients.
The graphics below illustrate this.
http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-2http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/Download-ebook-Decision-Tree-Articles-Digest/http://www.simafore.com/blog/bid/56588/how-to-use-decision-trees-for-credit-scoring-using-rapidminer-part-28/2/2019 Mlr Article Digest
7/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 6
So who is most likely to become a customer and how much are they likely to spend? Finally,
which are the most important and least important variables? (Why we cannot use model
coefficients to do this?) We will explore this and the model accuracy questions in the next
chapter.
8/2/2019 Mlr Article Digest
8/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 7
Chapter 3: Identifying most important variables: Feature Selection
In this chapter, we will explore two additional questions which were raised at the end of
Chapter 2:
1. Eliminating the least important variables from the model2. Identifying the characteristics of the most valuable prospect based on historical data
We will show how to use RapidMiner to run a feature selection operator which will answer 1
and interpret the model to answer 2.
Feature selection or data dimension reduction or variable screening in predictive analyticsrefers to the process of identifying the few most important variables or parameters which help
in predicting the outcome. In today's charged up world of high speed computing, one might be
forgiven for asking, why bother? The most important reasons all come from practicality.
Reason 1: If two or more of the independent variables (or predictors) are correlated to the
dependent (or predicted) variable, then the estimates of coefficients in a regression model tend
to be unstable or counter intuitive.
Example: y = 45 + 0.8x1 and y = 45 + 0.1x2 are two linear regression models which predict y.
Both clearly indicate that if x's increase, y also increases. If x1 and x2 show a strong correlation
to y, then a multiple regression model might look like y = 45 + 0.02 x1 - 0.4 x2. In this case,because the three (x1, x2 and y) are strongly correlated, interaction effects between x1 and x2
lead to a situation where x2 is in a negative relationship with y, meaning y will decrease with
increase in x2. This is not only the reverse of what was seen in the simple model, but is also
counter-intuitive.
8/2/2019 Mlr Article Digest
9/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 8
Reason 2: The law of averages states that the larger the set of predictors, the higher the
probability of having missing values in the data. If we chose to delete cases which have missing
values for some predictors, we may end up with a shortage of samples.
Example: A practical rule of thumb used by data miners is to have at least 5(p+2) samples
where p is the number of predictors. If your data set is sufficiently large and this rule is easilysatisfied, then you may not be risking much by deleting cases. But if your data is from an
expensive market survey for example, a systematic procedure to actually reduce the data set,
may result in a situation where you dont have to address this problem of losing samples. It is
better to lose variables which dont impact your prediction than to lose somewhat more
expensive samples.
There are several other more technical reasons for reducing data dimensionality which will be
explored in subsequent articles. In a next article, we will discuss some common techniques for
actually implementing this process.
Backward Elimination to reduce dataset
The process logic which RapidMiner uses is not "linear", but recursive. We dont apply
operators linearly, one after another. The graphic below explains how this nesting was used in
setting up the training and testing of Linear Regression operator for the analysis we did in
chapter 2. The red arrow indicates that the training and testing process was nested within the
"Split Validation" operator.
In order to introduce the feature selection method, we need to tuck the training and testing
process inside another sub-process called the Learning Process. The learning process is nested
inside the "Backward Elimination" operator. We now have two nestings as schematically shown
below.
8/2/2019 Mlr Article Digest
10/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 9
Finally the image below shows how to access the Backward Elimination operator. Doubleclicking on the Backward Elim operator opens up the Learning Process which will now contain
the Split Validation operator used earlier.
There is one more step to complete before running this model. Simply connecting the Backward
Elim operators ports to the output will not show us the final regression model equation. To be
able to see that, we need to connect the "exa" port of Backward Elim operator to another
"Linear Regression" operator in the main process! The output of this operator will contain the
model which can be examined in the Results perspective. The graphic below shows this.
8/2/2019 Mlr Article Digest
11/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 10
What variables have been eliminated by Backward Elimination?
Comparing the two regression equations (above and in chapter 2) we can see that the variables
B2C=1 and Domestic=1 have been removed. What are the advantages of this, if any? It implies
thath in the future, it may not be necessary to collect these two pieces of data to predict
spending amount.
What characteristics does a high spending customer have?
Referring back to the regression model shown in the graphic above, we see that amount of
spend increases with Purchase frequency. Also online orders tend to spend less. More recent
purchasers tend to spend more and finally if a contact was made with a prospect recently (Last
Update) they tend to spend more.
In the final chapter of this ebook, we will discuss some tips to make sure that regression
modeling is applied correctly.
8/2/2019 Mlr Article Digest
12/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 11
Chapter 4: Checkpoints to ensure regression model validity
While acknowledging the general overallrisk in using models, it is important to know how to
mitigate some of these risks. In this article, we will specifically focus on 6 checkpoints to ensure
that bivariate analyses used to develop models (such as simple regression models), or to verify
if two parameters are related, are valid. Finally, we will briefly mention some advantages of
using mutual information over simple regression models for bivariate analysis.
Checkpoint 1: The first check point to consider before accepting any simple regression model is
of course to quantify the r-squared, which is also known as the "coefficient of determination".
R-squared effectively explains how much of variability in the dependent parameter is explained
by the independent parameter.
Addendum to 1: In most cases of Linear Regression the r-squared value lies between 0 and 1.
The ideal range for r-squared varies across applications , for example, in social and behavioral
science models typically low values are acceptable. Generally, very low values( ~ < 0.2) indicate
that the variables in your do not explain the outcome satisfactorily. Similarly very high values (>
0 .8) values indicate too high a dependency making the predictive ability of the model low.
Checkpoint 2: Once a regression model is fit through the sample data points, the t-statistic
must be used to check if the slope of the model is different from zero. But why not simply check
the slope (even visually) of the model? The t-statistic check ensures that the population slope
(not just the sample slope) is different from zero. This of course requires the assumption of
normal distribution of all sample slopes that make up the population.
Checkpoint 3: This brings us to the next check - which is to ensure that all error terms in the
model are normally distributed. Fortunately most standard statistical packages do this
automatically, but it is good to know that this check has been performed.
http://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+managementhttp://www.simafore.com/blog/?Tag=risk+management8/2/2019 Mlr Article Digest
13/13
MLR Digest How to build and use multiple linear regression for business analytics an eBookby SimaFore
SimaFore LLC Page 12
Checkpoint 4: Make sure that if you are using the model to predict, the domain of the predictor
is within the range of the sample data used to build the model.
Checkpoint 5: Passing checks 1 and 2 will ensure that the independent and dependent variable
are related. However this does not imply that the independent variable is the cause and the
dependent is the effect. Remember that correlation is not causation!
Checkpoint 6: Highly non-linear relationships will result in simple regression models failing
checks 1 through 3. However this does not mean that the two variables are not related. In such
cases it may become necessary to resort to somewhat more advanced bivariate analysis
methods. The use ofmutual informationfor testing if two variables are related is highly
effective in such cases.
Mutual information will very simply tell you if variable X is related to variable Y, and how much
uncertainty is reduced in predicting Y if the uncertainty in knowing X is quantified. Furthermore,
mutual information can handle jumps or discontinuities within the sample data - for example
the X data may not be uniformly spaced. Such jumps in data are well captured by mutual
information, as are non-linearities.
If you liked this ebook tutorial on analytics, sign up for visTASC, "a visual thesaurus of analytics,
statistics and complex systems for more like these. Sign up is FREE and allows you to search for
techniques for other common business problems.
http://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://www.simafore.com/blog/?Tag=mutual+informationhttp://vistasc.simafore.com/create-accounthttp://www.simafore.com/blog/?Tag=mutual+information