Decision tree, softmax regression and ensemble methods in machine learning

DECISION TREE, SOFTMAX

REGRESSION AND ENSEMBLE

METHODS IN MACHINE LEARNING

- Abhishek Vijayvargia

WHAT IS MACHINE LEARNING

Formal Approach

Filed of study that gives computers the ability to learn

without explicitly programmed.

Informal Approach

MACHINE LEARNING

Supervised Learning

Supervised learning is the machine learning task of

inferring a function from labeled training data.

Approximation

Unsupervised Learning

Trying to find hidden structure in unlabeled data.

Examples given to the learner are unlabeled, there is no

error or reward signal to evaluate a potential solution.

Shorter Description

Reinforcement learning

Learning by interacting with an environment

SUPERVISED LEARNING

Classification

Output variable takes class labels.

Ex. Predicting a mail is spam/ham

Regression

Output variable is numeric or continuous.

Ex. Measuring temperature

DECISION TREES

Is this restaurant good?

( YES/ NO)

DECISION TREES

What are the factors which decide that restaurant is

good for you or not?

Type : Italian, South Indian, French

Atmosphere: Casual, Fancy

How many people inside it? (10< people > 30 )

Cost

Weather outside : Rainy, Sunny, Cloudy

Hungry : Yes/No

DECISION TREE

Hungry

Rainy

People >

10

YES No

YES

Type

Cost

YES No

No

True

True False

False

TrueFalse

French South Indian

MoreLess

DECISION TREE LEARNING

Pick best attribute

Make a decision tree node containing that attribute

For each value of decision node create a

descendent of node

Sort training example to leaves

Iterate on subsets using remaining attributes

DECISION TREE : PICK BEST ATTRIBUTE

+ + -

+ + - -

+ -+-

TrueFalse

- - + -

+ - +

+ - - +

+ - + -

+ + +

- - - +

TrueFalse

+ + +

+ +

TrueFalse

- - - -

- - -

Graph. 1 Graph. 2 Graph. 3

DECISION TREE : PICK BEST ATTRIBUTE

Select the attribute which gives MAXIMUM Information

Gain.

Gain measures how well a given attribute separates

training examples into targeted classes.

Entropy is a measure of the amount of uncertainty in the

(data) set.

H(S) = − 𝑥∈𝑋 𝑝(𝑥) log2 𝑝(𝑥)

S: Current data set for which entropy is calculated.

X: Set of classes in X.

p(x) : The proportion of the number of elements in class to

the number of elements in set.

DECISION TREE : INFORMATION GAIN

Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A.

In other words, how much uncertainty in S was reduced after splitting set S on attribute A.

IG(A,S) = H(S) - 𝑡∈𝑇 𝑝 𝑡 𝐻(𝑡)

H(S) : Entropy of set S

T : The subsets created from splitting set S by attribute A such that S = 𝑡∈𝑇 𝑡

p(t) : The proportion of the number of elements in t to the number of elements in set S

DECISION TREE ALGORITHM : BIAS

Restriction Bias : All type of possible decision tree.

Preference Bias : Which decision tree algorithm

prefer?

Good split at TOP

Correct over Incorrect

Shorter tree

DECISION TREE : CONTINUOUS ATTRIBUTE

Branch on number of possible values?

Include age only in training set?

Useless when we get some age not present in training

set

Represent in the form of range

Age

1.11 1.111

20<=Age<30

DECISION TREE : CONTINUOUS ATTRIBUTE

Does it make sense to repeat an attribute along a

path in the tree?

B

A A

A

B

A

DECISION TREE : WHEN DO WE STOP?

Everything classified correctly? (same example/

noisy two answer for same)

No more attribute? ( not good for continuous

attribute/ infinite possibility)

Pruning

SOFTMAX REGRESSION

Softmax Regression ( or multinomial logistic

regression) is a classification method that

generalizes logistic regression to multiclass

problems. (i.e. with more than two possible discrete

outcomes.)

Used to predict the probabilities of the different

possible outcomes of a categorically distributed

dependent variable, given a set of independent

variables (which may be real-valued, binary-valued,

categorical-valued, etc.).

LOGISTIC REGRESSION

Logistic regression is used to refer specifically to

the problem in which the dependent variable is

binary ( only two categories).

As output variable y ∈ 0,1 , it seems natural to

choose Bernoulli family of distribution to model

conditional distribution of y given x.

Logistic function (which always takes on values

between zero and one)

𝐹 𝑡 = 1 1+𝑒−𝑡 = 1𝑒−𝜃𝑇𝑥

SOFTMAX REGRESSION

Used in classification problem in which response

variable y can take on any one of k values.

𝑦 ∈ 1,2,… , 𝑘 .

Ex. Classify emails into three classes { Primary,

Social, Promotions }

Response variable is still discrete but can take

more than two values.

To derive General Linear Model for multinomial data

we begin by expressing the multinomial as an

exponential family distribution.

SOFTMAX REGRESSION

To parameterize a multinomial over k-possible

outcomes, we could use k parameters ∅1, … , ∅𝑘specifying probability of each outcomes.

These parameters are redundant because 𝑖=1𝑘 ∅𝑖 =

1. So ∅𝑖 = 𝑝 𝑦 = 𝑖; ∅

and 𝑝(𝑦 = 𝑘; ∅) = 1 − 𝑖=1𝑘 ∅𝑖

Indicator Function 1{.} takes a value of 1 if it’s

argument is true, and 0 otherwise.

1{True} = 1, 1{False} = 0.

SOFTMAX REGRESSION

Multinomial is member of exponential family.

𝑝 𝑦; ∅ = ∅11{𝑦=1}

∅21{𝑦=2}

…… . ∅𝑘1{𝑦=𝑘}

= ∅11{𝑦=1}

∅21{𝑦=2}

…… . ∅𝑘1− 𝑖=1𝑘−1{𝑦=𝑖}

=𝑏 𝑦 exp 𝜔𝑇 𝑇 𝑦 − a ω

Where 𝜔 =

log ∅1 ∅𝑘log ∅2 ∅𝑘⋮

log ∅𝑘 − 1 ∅𝑘

𝑎 𝜔 = − log∅𝑘

𝑏 𝑦 = 1 𝑇 𝑦 ∈ 𝑅𝑘_1

SOFTMAX REGRESSION

The link function is given as

𝜔𝑖 = log∅𝑖

∅𝑘

To invert the link function and derive the response

function

𝑒𝜔𝑖 =∅𝑖∅𝑘

∅𝑘𝑒𝜔𝑖 = ∅𝑖

∅𝑘

𝑖=1

𝑘

𝑒𝜔𝑖 =

𝑖=1

𝑘

∅𝑖 = 1

SOFTMAX REGRESSION

So we get ∅𝑘= 1 𝑖=1𝑘 𝑒𝜔𝑖

we can substitute it back in

the equation to give response function

∅𝑖= 𝑒𝜔𝑖

𝑖=1𝑘 𝑒𝜔𝑖

Conditional distribution of y given x is given by

𝑝 𝑦 = 𝑖 𝑥; 𝜃 = 𝜔𝑖

= 𝑒𝜔𝑖

𝑖=1𝑘 𝑒𝜔𝑖

= 𝑒−𝜃𝑖𝑇𝑖𝑥

𝑖=1𝑘 𝑒−𝜃

𝑇𝑖𝑥

SOFTMAX REGRESSION

Softmax regression is a generalization of logistic

regression.

Our Hypothesis will output

ℎ𝜃 𝑥 =

∅1∅2⋮∅𝑘

In other words, our hypothesis will output the

estimated probability 𝑝 𝑦 = 𝑖 𝑥; 𝜃 for every value of

i = 1, .. k.

ENSEMBLE LEARNING

Ensemble learning use multiple learning algorithms

to obtain better predictive performance than could

be obtained from any of the constituent learning

algorithms.

Ensemble learning is primarily used to improve the

prediction performance of a model, or reduce the

likelihood of an unfortunate selection of a poor one.

HOW GOOD ARE ENSEMBLES?

Let’s look at NetFlix Prize Competition…

NETFLIX PRIZE : STARTED IN OCT 2006

Supervised Learning Task

Training Data is a set of users and rating (1,2,3,4,5

stars) those users have given to movies.

Construct a classifier that given a user and an unrated

movie, correctly classified that movie as either 1,2,3,4 or

5 stars.

$1 Million prize for a 10% improvement over Netflix

current movie recommender/Classifier.

NETFLIX PRIZE : LEADER BOARD

ENSEMBLE LEARNING : GENERAL IDEA

ENSEMBLE LEARNING : BAGGING

Given :

Training Set of N examples

A class of learning models ( decision tree, NB, SVM,RF etc. )

Training :

At each iteration I a training set Si of N tuples is sampled with replacement from S.

A classifier model Mi is learned for each training set Si.

Classification : Classify an unknown sample x

Each classifier Mi returns it’s class prediction.

The bagged classifier M* count the votes and assign the class with the most votes.

ENSEMBLE LEARNING : BAGGING

Bagging reduces variance by voting/averaging.

Can help a lot when data is noisy.

If learning algorithm is unstable, then Bagging

almost always improves performance.

ENSEMBLE LEARNING : RANDOM FORESTS

Random Forests grows many classification trees.

To classify a new object from an input vector, put

the input vector down each of the trees in the

forest.

Each tree gives a classification, and we say the tree

"votes" for that class.

The forest chooses the classification having the

most votes (over all the trees in the forest).

ENSEMBLE LEARNING : RANDOM FORESTS

Each tree is grown as follows:

If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.

If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.

Each tree is grown to the largest extent possible. There is no pruning.

FEATURES OF RANDOM FORESTS

Better in accuracy among current algorithms.

Runs efficiently on large data bases.

It can handle thousands of input variables without

variable deletion.

It gives estimates of what variables are important in

the classification.

Effective method for estimating missing data and

maintains accuracy when a large proportion of the

data are missing.

Generated forests can be saved for future use on

other data.

ENSEMBLE LEARNING : BOOSTING

Create a sequence of classifiers, giving higher

influence to more accurate classifiers.

At each iteration, make examples currently

misclassified more important( get larger weight in

the construction of the next classifier)

Then combine classifier by weighted vote (weight

given by classifier accuracy)


Suppose there are just 7 training examples {1,2,3,4,5,6,7}

Initially each example has a 0.142 (1/7) probability of being sampled.

1st round of boosting samples ( with replacement) 7 examples { 3,5,5,4,6,7,3} and build a classifier from them.

Suppose examples {2,3,4,6,7} are correctly predicted by this classifier and examples {1,5} are wrongly predicted:

Weight of examples {1,5} are increased.

Weight of examples {2,3,4,6,7} are decreased.

2nd round of boosting again take 7 examples, but now examples {1,5} are more likely to be sampled.

And so on until some convergence is achieved.


Weights models according to performance.

Encourage new model to become an “expert” for

instances misclassified by earlier model.

Combines “Weak Learner” to generate “strong

learner”.

ENSEMBLE LEARNING

Netflix 1st prize winner gradient boosted decision

tree.

http://www.netflixprize.com/assets/GrandPrize2009

_BPC_BellKor.pdf

http://www.netflixprize.com/assets/GrandPrize2009_BPC_BellKor.pdf

THANK YOU FOR YOUR ATTENTION

Ask Question to narrow down possiblity

Informatica building example

Mango machine learning

Cannot look all trees

Data & Analytics

Decision tree, softmax regression and ensemble methods in machine learning