Abstract732A30/info/2010_Nadesska... · the efficient use of the auxiliary brake system (Scania retarder and exhaust brake), matches gear selection and engine revolutions, have also

1

Abstract

The aim of this master thesis is to predict the outcome of a metric K which

describes the usage of Scania vehicles on different roads. This metric is of great

interest for the company and it is used during the process of developing vehicle

components. Through this work we discuss two well known supervised learning

methods, decision trees and neural networks, which enable us to build the

predictive models. The set of data used consists of approximately 30.000 vehicles,

and it is based on a set of features which from theoretical bases and expert

opinions in Scania were considered to contain relevant information and be related

to the output metric K. The selected data set represents the largest product segment

in Scania, long haulage vehicles.

CART (Classification and Regression Trees) and CHAID (Count or Chi-squared

Automatic Interaction Detection) regression trees of different sizes were first

performed given their simplicity and predictive power. However, evaluation of the

performance of these algorithms, based on the Nash-Sutcliffe efficiency measure

(0.61 and 0.65 for the CART and CHAID tree respectively), demonstrates that the

tree methods were not able to extract the patterns and relationships present in the

data. Finally, knowing that given enough data, hidden units, and training time, a

feedforward multilayer perceptron (MLP) can learn to approximate virtually any

function to any degree of accuracy, a MLP neural network model with one hidden

layer and four neurons was performed. An accuracy of 0.86 shows that the

predictive results obtained with the selected network were more accurate than

those acquired with the regression tree methods. Predicted values for the fraction

of the data set that did not contain the metric K as the target value were also

obtained, and the results showed that it is possible to rely on the predictive power

of the neural network model for further analysis, including other group of vehicles

built in Scania for different purposes.

2

3

Acknowledgements

I would first like to express my deep appreciation and sincere gratitude to

Professor Anders Grimvall for encouraging me and giving me the opportunity to

take part of the Master’s Programme in Statistics, Data Analysis and Knowledge

Discovery, and specially thank him for recommending me in Scania. Thank you

Anders for your guidance, for sharing your knowledge with me, for being so

patient, for always trying to explain everything really clearly and carefully during

our lessons and our consulting sessions. Your pedagogical spirit and your never

ending stream of ideas have always inspired me and motivated me during my

studies and work.

I would also like to thank everyone who in one way or another is involved in the

success of this thesis work. I thank Scania for its permission to carry out this

project and for making the data that has been analyzed available. Ann Lindqvist,

my supervisor at Scania, who gave me the opportunity and confidence to work

with this challenging project. Thank you Ann for helping within Scania to obtain

all the necessary knowledge and for introducing me to all the people that in

different ways contributed to improving the quality of my work. I also want to

thank you for your friendship, for making sure I would always find my way in

Scania and in Södertälje, and for sharing time with me after a hard working day.

Thank you for all the help you offered me inside and outside the office.

Thank you Klas Levin, Mikael Curbo, Anders Forsen, Erik Landström and all of

you at Scania who were always interested in my work, offering me valuable

advices and helping me during the development of my thesis.

Special thanks also go to Anders Noorgard, my supervisor at Linköpings

University, who gave me valuable tips about the project work and report writing,

4

and Oleg Sysoev for taking the time to review my work and for sharing his

knowledge about machine learning.

Last, but not least, I especially want to thank my beloved family: my mother

Juneida Sánchez, for your endless love, your blessings and your wishes for the

successful culmination of all my projects. My sister Marbella Covino, for

believing in me and for always being there for me. My boyfriend Karl Aronsson,

for all your love and support, for always encouraging me, for being next to me

during happy and difficult times, for always making me laugh, and for showing

me the positive side of every situation. My Swedish family, the Aronsson family,

because you have made me feel this is my second home, thank you for supporting

me and helping me ever since I decided to move to Sweden.

5

Table of contents

1 Introduction ................................................................................................................. 7 1.1 Scania ................................................................................................................... 7 1.2 Background .......................................................................................................... 7 1.3 Objective ............................................................................................................ 10

2 Data ............................................................................................................................ 11

3 Methodology .............................................................................................................. 17 3.1 Methodology step by step .................................................................................. 17 3.2 Supervised learning methods ............................................................................ 19

3.2.1 Decision Trees .................................................................................................... 19

3.2.1.1 CHAID .................................................................................................. 24

3.2.1.2 CART .................................................................................................... 25

3.2.2 Neural Networks ................................................................................................ 26

3.3 Approximation efficiency…. ............................................................................. 30 4 Results….. ................................................................................................................. 32

4.1 Decision Trees .................................................................................................... 32 4.2 Neural Networks ................................................................................................ 41

4.3 Scoring Process .................................................................................................. 51 5 Discussion and conclusions ....................................................................................... 52 6 Literature. .................................................................................................................. 56

7 Appendix ................................................................................................................... 57

6

7

1 Introduction

1.1 Scania

Scania is one of the world’s leading manufacturers of trucks and buses for heavy

transport applications. The company operates in about 100 countries and employs

almost 33000 people. Scania’s objective is to deliver optimized heavy trucks and

buses, engines and services, provide the best total operating economy for our

customers, and thereby be the leading company in the industry. Research and

development are concentrated in Södertälje-Sweden, and production units are

located in Europe and Latin America. This master thesis has been carried out at

RESD, the department responsible for diagnostic protocols. Software modules for

diagnostic communication between electrical control unit systems and external

tools are developed in this department, as well as off board systems for remotely

retrieving and analyzing diagnostic data. (Scania Inline, 2010)

1.2 Background

The electrical system in Scania vehicles is based on a number of control units that

communicate with each other via a common network based on serial

communication. Scania’s serial communication is based on the CAN protocol. The

principal features of a CAN bus system are control and interaction. At the heart of

the Scania’s CAN bus is a central control unit (coordinator) through which all

functions are monitored and managed. From here, the truck’s electrical functions

are arranged in three circuits: red, yellow and green. Red functions cover all main

management units: engine, gearbox, brakes and suspension. Yellow covers

instruments, bodywork systems, locking and alarm systems, and lights. Green

covers comfort systems, such as climate control, audio and informatics.

8

Figure 1. Vehicle Applications of Controller Area Network (CAN).

All control units found in the Scania electrical system can be checked with a plug-

in diagnostic software (SDP3) used by Scania’s workshops, among other purposes,

to decipher and interpret the operational data. Data about the operation of the

vehicles stored in the control units is read with SDP3 and sent via the Internet to

Scania’s servers in Södertälje for analysis. Only authorized dealer workshops and

distributors have the necessary identities and access rights to collect, use, and

transfer operational data.

9

Figure 2. Operational data collection system.

A huge amount of operational data have been gathered and analyzed to understand

vehicles usage, for example how the accelerator pedal and vehicle momentum are

utilized in varying topography. The frequency and harshness of brake applications,

the efficient use of the auxiliary brake system (Scania retarder and exhaust brake),

matches gear selection and engine revolutions, have also been evaluated from the

data collected.

Figure 3. Histogram of operational data collected since 2006 until 2010.

10

A metric used in the company to describe the usage of Scania vehicles due to the

road conditions and the driving needs (starts and stops, accelerations, etc.), and

which from now on we will call K, is calculated by using data collected from one

of the control units currently installed in the electrical system of the new

generation Scania vehicles. However, this value cannot be estimated for those

vehicles that are not equipped with the required control unit. Hence, it is of interest

to build a predictive model to estimate the values of K by making use of the data

available for all vehicles.

Given the nature of the problem we have decided to carefully select a set of data

consisting of a group of variables which are believed to contain potentially

predictive relationships with the variable K. Afterwards, different algorithms can

be implemented to capture the patterns and relations found in the data, and

generalize to unseen situations in a reasonable way. These algorithms, also called

supervised learning methods, apply various mechanisms capable of inducing

knowledge from examples of data.

1.3 Objective

The aim of this thesis work is to build a predictive model, by making use of

supervised learning methods, which could accurately predict the outcome of a

metric that describes the usage of Scania vehicles due to the road conditions and

the driving needs.

11

2 Data

Our analysis will be concentrated in the segment of long haulage vehicles for

which Scania has followed many years of strong presence in the market. The

selection of the data is based on an assortment of physical components in order to

obtain a group of vehicles that are mostly dedicated to this specific product

segment. The selected group represents 78% of the total operational data collected

in the company.

As illustrated in Figure 4, approximately 43% of these vehicles are not equipped

with the required control unit from which the necessary data for the calculation of

the metric K is collected. Thus, only 35% of the vehicles are selected for building

the predictive model and the remaining data will be used during the scoring

process. Some potential predictor variables are selected based on theoretical bases

and expert opinions. In addition, the corresponding values of K for this 35% of

data are calculated from a sequence of measurements made by specialists in

Scania, and they are based on series of studies.

Figure 4. Pie chart of operational data.

35%

6%43%

16%Required control units (yes) - Long haulage

Required control units (yes) - Other purposes

Required control units (no) - Long haulage

Required control units (no) - Other purposes

12

Once the selected data had been extracted from the different databases, it was

finally integrated into one data set consisting of approximately 30.000

observations. After removing input variables that had low or no predictive power,

the input data set was represented by four variables.

The first variable corresponds to an 11*12 matrix called L. The second and third

variables represent two vectors of 10 positions each, called S and G. The variables

L, S and G implicitly contain information of the usage of Scania vehicles. The last

variable is called E and it corresponds to the different categories for one of the

vehicle components. All input variables excluding E are represented by continuous

values, whereas the variable E contains nominal values.

Afterwards, for simplicity reasons and reduction of the data, we performed

transformations of the raw data to create new input variables. We calculated

averaged values of the vectors S and G. However, it was not possible to make this

estimation for the values in the matrix L due to the importance of the information

contained in each of its positions; every position in the matrix is crucial for the

pattern recognition process. Hence, the matrix was just reorganized into a feature

vector of 132 positions for possible handling of the variable by the predictive

methods. As there was no need to transform the output variable K, this variable

was used in its raw form.

A quantitative analysis of the data set is given by the descriptive statistics of the

input and output variables, shown by the color maps in Figures 5-8, the Tables 1

and 2, and the histograms in Figures 9-11. They provide simple summaries about

the data set being analyzed and the measures.

13

Figure 5. Minimum values for the input variable Matrix L.

Figure 6. Maximum values for the input variable Matrix L.

14

Figure 7. Mean values for the input variable Matrix L.

Figure 8. Median values for the input variable Matrix L.

Table1. Descriptive statistics for the input variables S and G, and for the output variable K.

Variable

Min

Max

Mean

Q1

Q3

Median

S 1.527 82.218 54.538 48.826 62.869 57.240

G 2.000 86.774 35.238 29.023 40.250 35.278

K 17.090 79.310 34.564 30.750 37.460 33.560

15

Table2. Counts for the input variable E.

Variable

Count

Total E

28883

E01 162

E02 13

E03 627

E04 86

E05 342

E06 1860

E07 2382

E08 3470

E09 20

E10 507

E11 2

E12 391

E13 10

E14 616

Variable

Count

E15 30

E16 1175

E17 22

E18 297

E19 1096

E20 480

E21 1389

E22 529

E23 1129

E24 149

E25 146

E26 813

E27 69

E28 32

E29 8

E30 12

Variable

Count

E31 52

E32 8

E33 4

E34 1

E35 7

E36 13

E37 1

E38 96

E39 700

E40 47

E41 504

E42 5701

E43 42

E44 56

E45 3786

E46 1

In addition, as a summary of the frequency of the continuous input and output

variables, the histogram plots from Figures 9 trough 11 were also obtained:

776655443322110

1400

1200

1000

800

600

400

200

0

S

Frequency

Figure 9. Histogram of the input variable S.

16

8877665544332211

2000

1500

1000

500

0

G

Frequency

Figure 10. Histogram of the input variable G.

72635445362718

2500

2000

1500

1000

500

0

K

Frequency

Figure 11. Histogram of the output variable K.

Further information about how each of the chosen supervised learning methods

interprets and utilizes the selected variables when building the predictive models is

given in detail in the results chapter.

17

3 Methodology

A sequence of steps were followed during the development of the project in order

to successfully reach the main objective of this master thesis, to build a predictive

model, by making use of supervised learning methods, which could accurately

predict the outcome of a metric that describes the usage of Scania vehicles due to

the road conditions and the driving needs.

3.1 Methodology step by step:

1. First, we gathered the training set of data which needed to be characteristic

of the real-world use of the function to be learned. Thus, approximately

30.000 observations were collected, characterized by a set of input

variables which implicitly contained descriptive information of the usage of

the vehicles, and that were considered to have enough predictive power to

be able to estimate the values of the output variable K. The corresponding

values of K were also collected for each observation from a sequence of

measurements made by specialists in Scania.

2. Second, we determined the input feature representation of the function.

During this step, the input variables were reorganized or transformed into

suitable values for the predictive methods. Thus, matrices were reorganized

into vectors where all positions were kept, and vectors were transformed

into single averaged values. The number of features should not be too large,

because of the curse of dimensionality; but should be large enough to

accurately predict the output. The output variable was used in its raw form.

3. Third, we carried out graphical representations of the data, and analysis of

the descriptive statistics which were useful for detecting spurious

http://en.wikipedia.org/wiki/Curse_of_dimensionality

18

observations. Inconsistent records were eliminated, thus increasing the

quality of the data.

4. Subsequently, we selected two supervised learning methods which were

thought to be appropriate for the given problem and data at hand, decision

trees and neural networks.

5. The selected predictive methods required partitioning the dataset into

training and validation sets. The training set teaches the model, and the

validation set measures and assesses the model performance and reliability

for applying the model to future unseen data. The validation process avoids

the over-fitting problem by validating the model on a different set of data.

Our model data set was split into a training data set and a validation data

set, 70% and 30% respectively, in order to create a large enough validation

data set. A validation data set that is too small might lead to erroneous

conclusions when evaluating the reliability of the model.

6. We completed the design by running the learning algorithms on the

gathered training set. Parameters of the algorithms were adjusted to

optimize the performance on a subset (validation set) of data. A manual

forward selection method was also implemented during this step for

selecting the combination of input variables that increased the predictive

power of the learning methods.

7. Finally, we assessed the performance of the chosen learning algorithms

based on the produced average squared error, and compared the efficiency

of the predictions obtained based on the Nash-Sutcliffe efficiency measure.

The best predictive method was selected and applied to a new set of data in

order to compute the corresponded values of K.

http://en.wikipedia.org/wiki/Decision_tree_learning

http://en.wikipedia.org/wiki/Decision_tree_learning

19

3.2 Supervised learning methods:

In a typical scenario for supervised learning methods, we have an outcome

measurement, usually quantitative or categorical, that we wish to predict based on

a set of features. We also have a training set of data, in which we observe the

outcome and feature measurements for a set of objects. Using this data we build a

prediction model, or learner, which will enable us to predict the outcome for new

unseen objects. A good learner is one that accurately predicts such an outcome.

A Supervised learning method is a machine learning technique for deducing a

function from training data. The function fitting paradigm from a machine learning

point of view is as follows: Suppose for simplicity that the errors are additive and

that the model is a reasonable assumption. Supervised learning

attempts to learn by example through a teacher. One observes the system under

study, both the inputs and the outputs, and assembles a training set of observations

i = 1,…,N. The observed input values to the system are also fed

into an artificial system, known as a learning algorithm which also produces

outputs in response to the inputs. The learning algorithm has the property

that it can modify its input/output relationship in response to differences

between the original and generated outputs. This process is known as

learning by example. Upon completion of the learning process the hope is that the

artificial and real outputs will be close enough to be useful for all sets of inputs

likely to be encountered in practice. (Hastie et al., 2001).

3.2.1 Decision Trees:

Decision trees belong to a class of data mining techniques that break up a

collection of heterogeneous records into smaller groups of more homogeneous

http://en.wikipedia.org/wiki/Machine_learning

20

records using a directed knowledge discovery. Directed knowledge discovery is

goal-oriented where it explains the target fields in terms of the rest of the input

fields to find meaningful patterns in order to predict the future events using a chain

of decision rules. In this way, decision trees provide accurate and explanatory

models where the decision tree model is able to explain the reason of certain

decisions using these decision rules. Decision trees could be used in classification

problems and also in estimation problems where the output is a continuous value,

and in the last case the tree is called a regression tree. (Abdullah, 2010)

For a tree to be useful, the data in the leaves (the final groups or unsplit nodes)

must be similar with respect to some target measure, so that the tree represents the

segregation of a mixture of data into purified groups. (Neville, 1999). The general

form of this modeling approach is illustrated in Figure 12.

Decision trees attempt to find a strong relationship between input values and target

values in a group of observations that form a data set. When a set of input values is

identified as having a strong relationship with a target value, then all of these

values are grouped in a bin that becomes a branch on the decision tree. These

groupings are determined by the observed form of the relationship between the bin

values and the target. Binning involves taking each input, determining how the

values in the input are related to the target, and, based on the input-target

relationship, depositing inputs with similar values into bins that are formed by the

relationship. A strong input-target relationship is formed when knowledge of the

value of an input improves the ability to predict the value of the target. (De Ville,

2006)

21

Figure 12. Illustration of a decision tree.

Decision trees have many useful features, both in the traditional fields of science

and engineering and in a range of applied areas, including business intelligence

and data mining. These useful features include (De Ville, 2006):

Decision trees produce results that communicate very well in symbolic and

visual terms. Decision trees are easy to produce, easy to understand, and

easy to use. One valuable feature is the ability to incorporate multiple

predictors in a simple, step-by-step fashion. The ability to incrementally

build highly complex rule sets (which are built on simple, single association

rules) is both simple and powerful.

22

Decision trees readily incorporate various levels of measurement, (nominal,

ordinal, and interval), regardless of whether it serves as the target or as an

input.

Decision trees readily adapt to various twists and turns in data (unbalanced

effects, nested effects, offsetting effects, interactions and nonlinearities)

that frequently defeat other one-way and multi-way statistical and numeric

approaches.

Trees require little data preparation and perform well with large data in a

short time.

Decision trees are nonparametric and highly robust (for example, they

readily accommodate the incorporation of missing values) and produce

similar effects regardless of the level of measurement of the fields that are

used to construct decision tree branches (for example, a decision tree of

income distribution will reveal similar results regardless of whether income

is measured in thousands, in tens of thousands, or even as a discrete range

of values from 1 to 5).

Trees also have their short comings (Neville, 1999):

When the data contain no simple relationship between the inputs and the

target, a not complex tree is too simplistic. Even when a simple description

is accurate, the description may not be the only accurate one.

A tree gives an impression that certain inputs uniquely explain the

variations in the target. A completely different set of inputs might give a

different explanation that is just as good.

Trees may deceive; they may fit the data well but then predict new data

worse than having no model at all. This is called over-fitting. They may fit

the data well, predict well, and convey a good story, but then, if some of the

original data are replaced with a fresh sample and a new tree is created, a

23

completely different tree may emerge using completely different inputs in

the splitting rules and consequently conveying a completely different story.

Specific decision tree methods include the CHAID (Count or Chi-squared

Automatic Interaction Detection) and CART (Classification and Regression Trees)

algorithms. The following discussion provides a brief description of these

algorithms for building decision trees.

3.2.1.1 CHAID

CHAID is an acronym for “Chi-Squared Automatic Interaction Detection”. This

algorithm accepts either nominal or ordinal inputs, however some software

packages as SAS Business Analytics and Business Intelligence Software accept

interval inputs and automatically group the values into ranges before growing the

tree.

The splitting criterion is based on P-values from the F-distribution (interval

targets) or Chi-squared distribution (nominal targets). The P-values are adjusted to

accommodate multiple testing.

Missing values are treated as separate values. For nominal inputs, a missing value

constitutes a new category. For ordinal inputs, a missing value is free of any order

restrictions.

The search for a split on an input proceeds stepwise. Initially, a branch is allocated

for each value of the input. Branches are alternately merged and re-split as seems

warranted by the P-values. The algorithm stops when no merge or re-splitting

operation creates an adequate P-value. The final split is adopted. A common

alternative, sometimes called the exhaustive method, continues merging to a

24

binary split and then adopts the split with the most favorable P-value among all

splits the algorithm considered.

The tests of significance are used to select whether inputs are significant

descriptors of target values and, if so, what are their strengths relative to other

inputs. Thus, after a split is adopted for an input, its P-value is adjusted, and the

input with the best adjusted P-value is selected as the splitting variable.

If the adjusted P-value is smaller than a specified threshold, then the node is split.

Tree construction ends when all the adjusted P-values of the splitting variables in

the unsplit nodes are above the user-specified threshold. (SAS Enterprise Miner

Tutorial, 2010)

3.2.1.2 CART

The following is a description of the Breiman, Friedman, Olshen, and Stone

Classification and Regression Trees method for building decision trees. More

detailed information can be found in the following text: Breiman, L., Friedman,

J.H., Olsen, R. A., and Stone, C. J. (1984), Classification and Regression Trees,

Pacific Grove: Wadsworth.

For this method, the inputs are either nominal or interval. Ordinal inputs are

treated as interval. CART trees employ a binary splitting methodology, which

produces binary decision trees. They do not embrace the kind of merge-and-split

heuristic developed in the CHAID algorithm to grow multi-way splits, so multi-

way splits are not included in this approach. Classification and Regression Trees

do not use the statistical hypothesis testing approach proposed in the CHAID

algorithm, and they rely on the empirical properties of a validation or resample

data set to guard against overfit. (De Ville, 2006)

25

The full methodology for growing and pruning branches in CART trees includes

the following (De Ville, 2006; SAS Enterprise Miner Tutorial, 2010):

For a continuous response field, both least squares and least absolute

deviation measures can be employed. Deviations between training and test

measures can be used to assess when the error rate has reached a point to

justify pruning the sub tree below the error-calculation point.

For a categorical-dependent response field, it is possible to use either the

Gini diversity measure or Twoing criteria.

Ordered Twoing is a criterion for splitting ordinal target fields.

Calculating misclassification costs of smaller decision trees is possible.

Selecting the decision tree with the lowest or near-lowest cost is an option.

Costs can be adjusted.

Picking the smallest decision tree within one standard error of the lowest

cost decision tree is an option.

In addition to a validated decision tree structure, CART trees also:

work with both continuous and categorical response variables.

omit observations with a missing value in the splitting variable when

creating a split.

create surrogate splits and uses them to assign observations to branches

when the primary splitting variable is missing. If missing values prevent the

use of the primary and surrogate splitting rules, then the observation is

assigned to the largest branch (based on the within node training sample).

grow a larger-than-optimal decision tree and then prunes it to a final

decision tree using a variety of pruning rules.

consider misclassification costs in the desirability of a split.

use cost-complexity rules in the desirability of a split.

26

split on linear and multiple linear combinations.

do sub sampling with large data sets.

3.2.2 Neural Networks

Neural networks (NNs) form a joint framework for regression and classification

that has become widely used during the past decades, traditionally associated with

machine learning and data mining. Because of their ability to approximate any

dataset, NNs are sometimes called universal approximators (Hornik et al,1989)

The study of artificial neural networks is motivated by their similarity to

successfully working biological systems, which compared to the complete system

consist of very simple but numerous nerve cells that work massively parallel and

have the capability to learn. There is no need to explicitly program a neural

network. For instance, it can learn from training examples. One result from this

learning procedure is the capability of neural networks to generalize and associate

data. After successful training, a neural network can find reasonable solutions for

similar problems of the same class that were not explicitly trained.

A technical neural network consists of simple processing units or neurons which

are connected by directed, weighted connections. Data are transferred between

neurons via connections with the connecting weight being either excitatory or

inhibitory.

A propagation function converts vector inputs to scalar network inputs. For a

neuron the propagation function receives the outputs of other neurons and

transforms them in consideration of the connecting weights into a network input

net, that can be used by the activation function.

27

The activation function is the “switching status” of a neuron. Based on the model

of nature every neuron is always active to a certain extent. The reactions of the

neurons to the input values depend on this activation state. Neurons get activated,

if the network input exceeds their threshold value. The threshold value is explicitly

assigned to the neurons and marks the position of the maximum gradient value of

the activation function. When centered on the threshold value, the activation

function of a neuron reacts particularly sensitive. The activation of a neuron

depends on the prior activation state of the neuron and the external input.

Finally, an output function may be used to process the activation once again. The

output function of a neuron calculates the values which are transferred to the other

connected neurons. The learning strategy is an algorithm that can be used to

change the neural network and thus such a network can be trained to produce a

desired output for a given input. An error is composed from the difference

between the desired response and the system output. This error information is fed

back to the system and adjusts the system parameters in a systematic fashion. The

process is repeated until the performance is acceptable. It is clear from this

description that the performance hinges heavily on the data. If one does not have

data that cover a significant portion of the operating conditions then neural

network technology is probably not the right solution. (Kriesel, 2005)

The term neural network has evolved to encompass a large class of models and

learning methods. Here we describe the most commonly used neural net, a

feedforward multilayer perceptron (MLP) neural network model with one hidden

layer. A more general description and analysis of the neural network framework

can be found in Bishop (1995).

This neural network is a two-stage regression or classification model typically

represented by a network diagram as to the one shown in Figure 13.

28

Figure 13. Schematic of a single hidden layer, feed-forward neural network.

For regression, there is only one output unit , however these networks can handle

multiple responses in a seamless fashion. Derived features are created from

linear combination of the inputs , and then the target is modeled as a function

of linear combination of the .

= , m =1,…,M,

= , k = 1,…,K, (1)

= = , k = 1,…,K,

where , and .

The activation function is usually chosen to be the sigmoid 1/(1+e-ν

).

Sometimes a Gaussian radial basis function (Hastie et al., 2001) is used for the

, producing what is known as a radial basis function network.

29

Neural network diagrams like the one in Figure 13 are sometimes drawn with an

additional bias unit feeding into every unit in the hidden and output layers.

Thinking of the constant “1” as an additional input feature, this bias unit captures

the intercepts and in model (1).

The output function allows a final transformation of the vectors of

outputs . For regression we typically choose the identity function .

Early work in classification also used the identity function, but this was later

abandoned in favor of the softmax function . This is of course

exactly the transformation used in a multilogit model, and produces positive

estimates that sum to one.

The units in the middle of the network, computing the derived features , are

called hidden units because the values are not directly observed. In general

there can be more than one hidden layer. We can think of as a basis expansion

of the original inputs ; the neural network is then a standard linear model, or a

linear multilogit model, using these transformations as inputs. (Hastie et al., 2001)

The network shown in Figure 13 belongs to the class of feed-forward networks, in

which the connections go from one layer to its successor only; there are not feed-

backs. The fitting of the neural network model is done by searching for the

weights that minimize the error function, which often takes the form of a weighted

sum of squared errors: .

The practical use of neural networks has clear advantages but also some

limitations.

30

Advantages:

NNs involve human like thinking.

There is no need to assume an underlying probability distribution such as

usually is done in statistical modeling.

They handle noisy or missing data.

They can work with large number of variables or parameters.

They create their own relationship amongst information.

NNs are applicable to multivariate non-linear problems. A neural network

can perform tasks that a linear program cannot.

When an element of the neural network fails, it can continue without any

problem by their parallel nature.

NNs learn and do not need to be reprogrammed.

They provide general solutions with good predictive accuracy.

Disadvantages:

Large NNs require high processing time.

The individual relations between the input variables and the output

variables are not developed by engineering judgment thus NNs model tend

to be black boxes or input/output tables without analytical basis.

3.3 Approximation efficiency:

The efficiency of the predictions obtained by the different supervised learning

methods can be quantified in many different ways. We have decided to use the

Nash-Sutcliffe efficiency measure. The efficiency E proposed by Nash and

Sutcliffe (1970) is defined as one minus the sum of the absolute squared

differences between the predicted and observed values normalized by the variance

31

of the observed values during the period under investigation. E is calculated as:

(Krause et al., 2005)

This measure can take values from minus infinity to one, and it is close to one if

the prediction errors are small.

32

4 Results

4.1 Decision Trees:

Different techniques do better with different data but trees should compete along

with other methods. We decided to make an approximation of the CHAID and

CART regression trees by making use of the tree node in SAS Enterprise Miner.

The SAS Enterprise Miner provides a visual programming environment for

predictive modeling. SAS algorithms incorporate and extend most of the good

ideas of the tree methods discussed in the methodology chapter.

The two chosen tree methods were performed producing a series of trees which

were based on selected parameters. A number of common tree parameters were set

to specific values to support appropriate assessment efforts. The remaining

parameters were set according to the different algorithms performed in each tree

node. Details about the parameters setting can be found in Appendix A.

We first performed the CHAID tree method by building trees of different depths,

varying from 6 to 15. Given that the target is a continuous value we used the

average squared error as the assessment measure. The results obtained are shown

in table 3:

33

Table 3. Depth and average squared error - CHAID tree.

Depth

ASE Training

ASE Validation

6 11.40 12.75

7 11.40 12.75

8 11.40 12.75

9 11.40 12.75

10 11.40 12.75

11 11.40 12.75

12 11.40 12.75

13 11.40 12.75

14 11.40 12.75

15 11.40 12.75

These results confirm that the predictive power of the tree will not be improved by

building a more complex model. Thus, the depth of the CAHID tree was set to 6.

In the same fashion, when varying the values for the depth of the tree in the CART

model we obtained the results shown in table 4:

Table 4. Depth and average squared error - CART tree.

Depth

ASE Training

ASE Validation

6 12.00 13.17

7 10.74 12.37

8 10.00 11.75

9 9.65 11.55

10 9.48 11.43

11 9.38 11.37

12 9.34 11.36

13 9.33 11.36

14 9.34 11.36

15 9.34 11.36

34

Figure 14 shows a plot of average squared error vs. depth for the CART model.

There, we observe that the lines for the training and validation set are progressing

as the depth of the tree increases; however after a depth of 10 the reduction in the

average squared error is not significant. Therefore, 10 is chosen to be an

appropriated value for the depth of the CART tree:

Figure 14. Average squared error vs. Depth - CART tree.

To verify that the performance of the selected trees was acceptable, we carefully

analyzed the results of the tree nodes where we could find a number of diagnostic

tools. First, we reviewed the assessment plots which show tree evaluation

information; trees are evaluated using the number of cases that are correctly

predicted. For each tree size, a tree that correctly predicts the most training cases is

selected to represent that size. The selected tree is then evaluated again with

validation cases.

The assessment plots in Figures 15 and 16 display lines of the modeling

assessment statistic between the training and validation data sets across the

number of leaves that are created. The plots allow us to evaluate the accuracy of

8

9

10

11

12

13

14

6 7 8 9 10 11 12 13 14 15

Ave

rage

sq

uar

ed

err

or

Depth of the tree

Training

Validation

35

the decision tree models by viewing the change in the average squared error in the

growth of the trees based on the number of leaves to the design.

Figure 15. Assessment plot - CHAID tree. The selected tree contains 253 leaves.

Figure 16. Assessment plot - CART tree. The selected tree contains 134 leaves.

The following situations can be identified in these plots:

36

The lines for the training and validation data are progressing as the number

of leaves increases.

In Figure 15, the validation data confirms the progress of the training data

until the number of leaves are around 100. After this point, the line for the

validation data starts to flat out and move apart from the training data line.

A similar situation is encountered in Figure 16 where the validation data

confirms the progress of the training data until the number of leaves are

around 30.

If any of the decision tree models were selected as the best and final model, these

plots would help us evaluate smaller trees that still perform well in terms of the

assessment measure but are less complex and more reliable, and therefore they

might be more appropriate for the prediction process.

The trees chosen by SAS Enterprise Miner as the best trees to use were the one

with 253 leaves for the CHAID model and the one with 134 for the CART model.

These trees were selected because they optimized the assessment value on the

training data set.

The average number of observations assigned in each leaf was around 80 and 150

for the CHAID and CART tree respectively, which represents 0.28 and 0.52

percent of the total number of cases in the model set. The appropriate value of

observations in a leaf to avoid overfitting or underfitting the training data set

depends on the context, i.e., the size of the training data set; however, as a rule-of-

thumb, an appropriated value would be between 0.25 and 1 percent of the model

set. (Berry and Linoff, 1999)

Subsequently, we constructed the color maps shown in Figures 17 and 18 in order

to illustrate the importance each of the input variables had when building the

37

decision trees. The higher the importance measure the better the variable

approximates the target values, and therefore variables with high importance

represent strong splits.

Figure 17. Variable’s importance - CHAID tree.

Figure 18. Variable’s importance - CART tree.

38

Afterwards, we analyzed graphics diagnostic of the model fit. Figures 19 and 20

are scatter plots of observed vs. predicted values for the validation set of the

CHAID and CART models correspondingly.

Figure 19. Scatter plot of predicted vs. observed values - CHAID tree.

Figure 20. Scatter plot of predicted vs. observed values - CART tree.

Pr edi ct ed K

15

20

25

30

35

40

45

50

55

60

65

70

75

80

K

15 20 25 30 35 40 45 50 55 60 65 70 75 80

Pr edi ct ed K

15

20

25

30

35

40

45

50

55

60

65

70

75

80

K

15 20 25 30 35 40 45 50 55 60 65 70 75 80

39

From the two plots, it is easy to observe a large discrepancy of the observed and

predicted values. Points are lying far away from the 45 degree reference line that

passes through the origin indicating a low predictive accuracy.

Residual plots of the tree models were also evaluated. Residuals are helpful in

evaluating the adequacy of the model itself relative to the data and any assumption

made in the analysis. If the model fits the data well, and the typical assumption of

independent normally distributed residuals is also made, the plots of the residuals

versus predicted values should not show any patterns or trends, i.e., they should be

a random scatter of points.

The plots of residuals in Figures 21 and 22 show a slightly increasing variation of

the residuals as the predicted values increase, which may suggest that the

assumption of equal variance of the residuals is not valid for this data.

Nevertheless, it is hard to confirm this assumption and it would be more natural to

consider the plots of residuals within the limits one may expect when building a

complex predictive model.

Figure 21. Scatter plot of residuals vs. predicted values - CHAID tree.

Resi dual s

- 30

- 20

- 10

0

10

20

30

40

Pr edi ct ed K

20 30 40 50 60

40

Figure 22. Scatter plot of residuals vs. predicted values - CART tree.

In addition, the histograms shown in Figures 23 and 24 provide a view of the

overall distribution of the residuals. The plots appear to be bell-shaped, however

the pattern found in the plots of residuals vs. predicted values is also reveled in

these histograms which show too long tails to be considered approximately

normal.

241680-8-16-24

1400

1200

1000

800

600

400

200

0

Residuals

Frequency

Normal

Figure 23. Histogram of residuals - CHAID tree.

Resi dual s

- 30

- 20

- 10

0

10

20

30

40

Pr edi ct ed K

20 30 40 50

41

32241680-8-16-24

1400

1200

1000

800

600

400

200

0

Residuals

Frequency

Normal

Figure 24. Histogram of residuals - CART tree.

Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation sets

of the CHAID and CART models to evaluate the performance of these trees. The

values obtained are far from 1, indicating a poor fit.

4.2 Neural Networks:

The Neural Network node in SAS Enterprise Miner enables us to fit nonlinear

models such as a multilayer perceptron (MLP). NNs are flexible prediction models

that, when carefully tuned, often provide optimal performance in regression and

classification problems. There is no theory that tells us how to set the parameters

42

of the network to approximate any given function. It will generally be impossible

to determine the correct design without training numerous networks and

estimating the generalization error for each model. The design process and the

training process are both iterative.

We made use of the advanced user interface provided by the Neural Network node

to create a MLP model. Figure 25 displays the constructed network.

Figure 25. Schematic representation of the MLP neural

network model built in SAS Enterprise Miner.

The layer on the left represents the input layer and it consists of all interval and

nominal inputs. The middle layer is the hidden layer, in which hidden units

(neurons) were varied from 1 to 40, and 4 was selected as the optimal value based

on the results shown in Table 5 and Figure 26. The layer on the right is the output

layer, which correspond to the target variable K. The propagation, activation and

output functions were selected based on the default configuration specified in the

methodology chapter.

43

Table 5. Average squared error of a feedforward MLP neural network model with one hidden layer.

Neurons

ASE Training

ASE Validation

1 5.49 6.39

2 4.12 5.3

3 3.7 4.93

4 3.57 4.25

5 3.38 4.73

6 3.25 4.23

7 3.07 4.34

8 3.19 4.39

9 3.31 4.39

10 2.73 4.4

11 3.43 4.57

12 2.95 4.41

13 3.2 4.4

14 2.97 4.64

15 2.74 4.05

16 2.7 4.22

17 2.65 4.34

18 2.78 4.3

19 2.78 3.94

20 2.51 4.2

Neurons

ASE Training

ASE Validation

21 2.24 4.09

22 2.72 4.11

23 2.75 4.37

24 2.49 3.78

25 2.39 4.12

26 2.24 3.91

27 2.49 3.87

28 2.24 4.05

29 2.32 3.94

30 2.27 4.03

31 2.14 4

32 2.33 3.94

33 2.23 4.07

34 2.3 3.97

35 2.05 4.06

36 1.98 3.92

37 1.92 4.09

38 1.96 4.04

39 2.2 3.95

40 2 4.1

Figure 26. Average squared error vs. number of neurons.

0

1

2

3

4

5

6

7

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39

Ave

rage

sq

uar

ed

err

or

Number of neurons

Training

Validation

44

The numbers of hidden neurons affect how well the network is able to predict the

output variable. A large number of hidden neurons will ensure correct learning and

prediction of the data the network has been trained on, but its performance on new

data may be compromised. On the other hand, with too few hidden neurons the

network may be unable to learn the relationships among the data. Thus, selection

of the number of hidden neurons is crucial. The trial and error approach performed

for the selection of an appropriate number of hidden neurons started with a small

number of neurons and gradually increased the number if the network had failed to

reduce the error.

Although one hidden layer is always sufficient provided we have enough data,

there are situations where a network with two or more hidden layers may require

fewer hidden units and weights than a network with one hidden layer, thus using

extra hidden layers sometimes can improve generalization. We built a second

model with two hidden layers, and neurons varying from two to four which were

differently distributed among the layers during each run. Estimations of the

average squared error for each network are display in Table 6, and the results

reveal that better predictions are not obtained by adding an extra hidden layer.

Table 6. Average squared error of a feedforward MLP neural network model with two hidden layers.

Neurons First Layer

Neurons Second Layer

ASE Training

ASE Validation

1 1 5.64 6.26

1 2 5.01 5.62

1 3 5.55 6.33

2 1 9.29 10.05

3 1 5.26 5.69

2 2 5.93 6.80

45

The plot shown in Figure 27 displays the average squared error for each iteration

of the training and validation sets of the MLP model with one hidden layer and

four neurons.

Figure 27. Assessment plot - MLP (1 hidden layer and 4 neurons).

The lines for the training and validation data are progressing as the number of

iterations increases. By default, the node completed 100 iterations but we could

have continued the training process. However, given that the reduction in the

average squared error was becoming less and less significant after the hundredth

iteration, we decided to evaluate the default model.

Color maps of the weight factors were constructed and they are displayed in

Figures 28-33. Each input has its own relative weight, which gives the inputs the

impact that is needed during the training process. Weights determine the intensity

of the inputs signal as registered by the neurons. Some input variables are

considered more important than others, and the color maps illustrate the effect that

each input has on the network.

46

Figure 28. Weight 1 - Variable L.


47



48

Figure 32. Weights - Variable E.

Figure 33. Weights - Variables G and S.

Subsequently, a scatter plot of predicted vs. observed values was obtained and it is

shown in Figure 34. This plot reveals that the MLP neural network model with

one hidden layer and four neurons produced better predictive results for the output

variable K than the CHAID and CART tree models. Observed and predicted

values are very close to each other which is expected from an accurate model.

Observations lie close to the 45 degree reference line that passes through the

origin showing a high correlation between the observed and predicted values.

49

However, the closer we get to the minimum and especially maximum values of the

data, the more disperse the points tend to be, indicating that prediction of those

values are less accurate. These points, lying far away from the diagonal line,

represent cases with a few numbers of observations.

Figure 34. Scatter plot of predicted vs. observed values.

Additionally, in Figure 35 we can observe that even thought the residuals are fairly

scattered around zero, there is a slight but discernable tendency for the residuals to

increase as the predicted values increase. This indicates that the model performs

less well when predicting high observed values.

Figure 35. Scatter plot of residuals vs. predicted values.

Pr edi ct ed K

15

20

25

30

35

40

45

50

55

60

65

70

75

80

K

15 20 25 30 35 40 45 50 55 60 65 70 75 80

Resi dual s

- 30

- 20

- 10

0

10

20

30

Pr edi ct ed K

10 20 30 40 50 60

50

Once again, the histogram of the residuals shown in Figures 36 appears to follow a

normal distribution pattern; however the tails are too long. When building

complex predictive models, such as trees or neural networks, it is acceptable to

obtain residuals that behave as the ones in this figure.

211470-7-14-21

2500

2000

1500

1000

500

0

Residuals

Frequency

Normal

Figure 36. Histogram of residuals.

Finally, we calculated the Nash-Sutcliffe efficiency measure for the validation set

to evaluate the performance of the selected neural network. This time, the value

obtained is closer to 1 indicating a reasonably good fit.

51

4.3 Scoring Process:

The final and most important step during the process of building a predictive

model is the generalization or scoring process, i.e., how well the model makes

predictions for cases that were not available at the time of training and that do not

contain a target value.

The Score node in SAS Enterprise Miner generates and manages scoring code that

is produced by the tree or neural network nodes. The code is encapsulated and can

be used in most SAS environments even without the presence of Enterprise Miner.

After scoring 43% of the collected data, from which the value of the variable K

was not possible to calculate, we produced the overlaid histograms of observed

and predicted values shown in Figure 36. The distribution of the predicted values

is very much analogous to the distribution of the observed values, which indicates

that we have obtained reliable predictive results.

72635445362718

3000

2500

2000

1500

1000

500

0

Data

Frequency

K

Predicted K

Variable

Figure 37. Histograms of observed and predicted values for the variable K.

52

5 Discussion and conclusions

Throughout this thesis work two well known supervised learning methods,

regression trees and neural networks, were performed in order to build a predictive

model that could accurately predict the output of the metric K. This metric

describes the usage of Scania vehicles due to the road conditions and the driving

needs, and it is used in the company during the process of developing vehicle

components.

The first major problem encountered when selecting the appropriate predictive

method was the high dimensionality of the input data, as the presence of a large

number of input variables can present some severe problems for pattern

recognition systems. In addition, the underlying distribution of the input dataset

was unknown, as well as the relationships between the input variables and the

output variable, and the possible relations among all input variables. Given the

complexity of the input dataset, methods that assume no distributional patterns in

the data, and that can at the same time handle unknown high dimensional

relationships were required.

We first decided to implement CHAID and CART regression trees as they are

easy to produce, understand, and use. Tree methods ability to incrementally build

complex rules is simple and powerful, and they readily adapt to various twists and

turns in the data. Nevertheless, given that the predictive results were not

satisfactory; a MLP neural network model with one hidden layer and four neurons

was performed.

Neural networks are as well normally implemented to model complex

relationships between inputs and outputs, and when having little prior knowledge

of these relationships. They also have the ability to detect all possible interactions

53

between predictor variables. Moreover, no assumptions of the model have to be

made; neural networks can solve difficult process problems that cannot be solved

quickly or accurately with conventional methods given their limitation to strict

assumptions of normality, linearity, variable independence, etc. Finally, MLPs can

approximate almost any function with a high degree of accuracy given enough

data, enough hidden units, and enough training time.

Evaluation of the methods performance was based on the Nash-Sutcliffe efficiency

measure which showed that the selected neural network model was able to capture

the patterns and unknown relations existing between the input data and the output

metric K with an accuracy of 0.86, whereas the measures of model performance

for the CHAID and CART trees were 0.61 and 0.65 respectively. One of the

reasons of the high accuracy of the neural network model is due to its computation

of adequate weights for each one of the input attributes, thus accounting for all the

predictive information each of these attributes contains. Later, these weights are

combined and the computed value is passed along connections to other hidden

units and output units, where first internal computations are performed, providing

the nonlinearity that makes neural networks so powerful, and finally predicted

output values close to the observed values are generated.

On the other hand, both CHAID and CART regression trees use less number of

inputs than the neural network model. They attempt to find strong relationships

between the input and target variables, and only relationships that are strong

enough are used for building the model. Some inputs attributes are treated as

irrelevant or redundant and are not taken into account when building the predictive

tree. Thus, the patterns and relations existing between these “irrelevant” inputs

attribute and the output variable K are not captured, and the predicted values

produced are not as accurate as when performing the neural network model.

54

In addition, knowledge about the input variable matrix L indicates that some of its

adjacent positions must be considered as a whole when analyzing the patterns

present in the data, even in the situation when they are somehow correlated. The

tree methods do not take into account this special feature of the input data because

attributes are subsequently treated one by one when producing the splitting rules,

and in certain cases only some of them are considered as important inputs.

On the contrary, neural networks take into account all input attributes when

building the model, even if some of them have certain degree of correlation.

Some attempts were made to try to understand how the weights produced by the

neural network were distributed in the input data set in a way that they could

capture the patterns shaped by adjacent positions of the matrix L. However, plots

of the computed weights did not reveal any apparent pattern in the distribution of

the weights over the entire input set, thus no evident explanation of how the neural

network model relates adjacent positions of the matrix was found.

One of the disadvantages of performing a neural network model is its "black box"

nature, and therefore they are often implemented when the prediction task is more

important than the interpretation of the built model. Even though the neural

network model outperformed the tree models, due to its complex structure, it lacks

of clear graphical representation of the results and it also requires longer

computation time.

55

Satisfactory results were also achieved when applying the scoring formula from

the neural network model to new cases, i.e., the generation of predicted values for

the fraction of the data set that did not contain the metric K as the target value. The

results obtained showed that it is possible to rely on the predictive power of the

neural network model, and further analysis including other group of vehicles built

in Scania for different purposes, can be made based on the model proposed.

56

6 Literature

1. Abdullah, M. (2010). Decision Tree Induction & Clustering Techniques in

SAS Enterprise Miner, SPSS Clementine, and IBM Intelligent Miner – A

Comparative Analysis. IABR & ITLC Conference Proceedings.

2. Berry, M.J.A. and Linoff, G. (1999). Mastering Data Mining: The Art and

Science of Customer Relationship Management. New York: John Wiley &

Sons, Inc.

3. Bishop, C. (1995). Neural Networks for Pattern Recognition. Oxford

University Press Inc., New York.

4. De Ville, B. (2006). Decision Trees for Business Intelligence and Data

Mining: Using SAS® Enterprise Miner™. SAS Publishing.

5. Hastie, T., Tibshirani, R., Friedman, J. (2001). The Elements of Statistical

Learning; Data Mining, Inference, and Prediction. Springer Series in

Statistics.

6. Krause P., Boyle DP., Bäse F. (2005). Comparison of different efficiency

criteria for hydrological model assessment. Advance in Geosciences; 5:89-

97.

7. Kriesel D. (2005). A Brief Introduction to Neural Networks.

www.dkriesel.com.

8. Neville, P. (1999). Decision Trees for Predictive Modeling. SAS Institute

Inc.

9. SAS Enterprise Miner Tutorial, retrieved in 2010.

10. Scania Inline, retrieved in 2010 from www.sacnia.inline.com.

http://proquest.safaribooksonline.com.lt.ltag.bibl.liu.se/9781590475676

http://proquest.safaribooksonline.com.lt.ltag.bibl.liu.se/9781590475676

http://www.dkriesel.com/

57

7 Appendix

Appendix A. Parameters setting for the CHAID and CART algorithms

Tree parameters setting to support appropriate assessment efforts:

Minimum number of observations in a leaf:

The smaller this value is, the more likely it is that the tree will overfit the training

data set. If the value is too large, it is likely that the tree will underfit the training

data set and miss relevant patterns in the data. In SAS the default setting is

max (5, n/1000) where n is the number of observations in the training set. In our

case, the default value for the minimum number of observations in a leaf is 20, and

better predictive results were not obtained when trying different values for this

parameter.

Observations required for a split search:

This option prevents the splitting of nodes with few observations. In other words,

nodes with fewer observations than the value specified in observations required

for a split search will not be split. The default is a calculated value that depends

on the number of observations and the value stored in minimum number of

observations in a leaf. The default value for our model is 202, and better

predictive results were not obtained when trying different values.

Maximum depth of tree:

This parameter was changed from 6 to 15 to allow complex trees to be grown. The

size of a tree may be the most important single determinant of quality, more

important, perhaps, than creating good individual splits. Trees that are too small

do not describe the data well. Trees that are too large have leaves with too little

data to make any reliable predictions about the contents of the leaf when the tree is

applied to a new sample. Splits deep in a large tree may be based on too little data

to be reliable.

58

Parameters setting according to the different algorithms performed in each tree

node:

Approximation of the CHAID algorithm by using the tree node:

The Model assessment measure property was set to Average squared error.

This measure is the average of the square of the difference between the

predicted outcome and the actual outcome, and it is used to calculate the

worth of a tree when the target is continuous. The worth of the tree is

calculated by using the validation set to compare trees of different sizes in

order to pick the tree with the optimal number of leaves.

The Splitting Criterion was set to F test to measure the degree of separation

achieved by a split.

The F test significance level was set to 0.05, as a stopping rule that

accounts for the predictive reliability of the data. Partitioning stops when no

split meets the threshold level of significance.

To avoid automatic pruning, the Subtree method was set to The most leaves.

The subtree method determines which subtree is selected from the fully

grown tree. This option selects the full tree given that other options are

relied on for stopping the training.

The Maximum number of branches from a node option was changed from 2

to 100, and 10 was chosen given that not better predictive results were

obtained when this value was increased.

The Surrogate rules saved in each node option were set to 0. A surrogate

rule is a back-up to the main splitting rule. When the main splitting rule

relies on an input whose value is missing, the first surrogate rule is invoked.

If the first surrogate also relies on an input whose value is missing, the next

surrogate is invoked. If missing values prevent the main rule and all of the

59

surrogates from applying to an observation, then the main rule assigns the

observation to the branch it has designated as receiving missing values.

However, since missing values are not present in the data the use of

surrogate rules was not implemented.

To force a heuristic search, the Maximum tries in an exhaustive split search

option was set to 0. This option allows finding the optimal split, even if it is

necessary to evaluate every possible split on a variable.

The Observations sufficient for split search option was set to the size of the

training data set (20218). This option sets an upper limit on the number of

observations used in the sample to determine a split. All observations in the

node are then passed to the branches and a new sample is taken within each

branch independently.

The P-value adjustment was set to Kass, and the Apply Kass after choosing

number of branches option was also selected. By choosing this option, the

P-value is multiplied by a Bonferroni factor that depends on the number of

branches, target values, and sometimes on the number of distinct input

values. The algorithm applies this factor after the split is selected. The

adjusted P-values are used in comparing splits on the same input and splits

on different inputs.

Approximation of the CART algorithm by using the tree node:

Trees created by using the tree node are very similar to the ones grown by using

the Classification and Regression Trees method without linear combination splits

or Twoing or ordered Twoing splitting criteria. The Classification and Regression

Trees method recommends using validation data unless the data set contains too

few observations. The Tree node is intended for large data sets. The options in the

Tree node were set as follow:

The Model assessment measure property was set to Average squared error.

60

The Splitting Criterion was set to Variance reduction. This value measures

the reduction in the squared error from node means.

The Maximum number of branches from a node were set to 2.

The Treat missing as an acceptable value check box was selected.

However, this option did not affect the results since the data did not contain

missing values.

The Surrogate rules saved in each node were set to 5. Yet, for the same

reason mentioned above, these rules were not implemented.

The Subtree method was set to Best assessment value. This option selects

the smallest subtree with the best assessment value. Validation data is used

during the selection process.

The Observations sufficient for split search were set to 1000.

The Maximum tries in an exhaustive split search were set to 5000. To find

the optimal split, it is sometimes necessary to evaluate every possible split

on a variable. Sometimes the number of possible splits is extremely large.

In this case, if the number for a specific variable in a specific node is larger

than 5000, then a heuristic (stepwise, hill-climbing) search algorithm is

used instead for that variable in that node.

The P-value adjustment was set to Depth. By selecting this option, the

P-values are adjusted for the number of ancestor splits where the

adjustment depends on the depth of the tree at which the split is done.

Depth is measured as the number of branches in the path from the current

node, where the splitting is taking place, to the root node. The calculated

P-value is multiplied by a depth multiplier, based on the depth in the tree of

the current node, to arrive at the depth-adjusted P-value of the split.

Documents

Abstract732A30/info/2010_Nadesska... · the efficient use of the auxiliary brake system (Scania retarder and exhaust brake), matches gear selection and engine revolutions, have also