Upload
abhishek-vijayvargia
View
335
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Decision Tree, Softmax Regression and Ensemble Methods in Machine Learning. There use and practice.
Citation preview
DECISION TREE, SOFTMAX
REGRESSION AND ENSEMBLE
METHODS IN MACHINE LEARNING
- Abhishek Vijayvargia
WHAT IS MACHINE LEARNING
Formal Approach
Filed of study that gives computers the ability to learn
without explicitly programmed.
Informal Approach
MACHINE LEARNING
Supervised Learning
Supervised learning is the machine learning task of
inferring a function from labeled training data.
Approximation
Unsupervised Learning
Trying to find hidden structure in unlabeled data.
Examples given to the learner are unlabeled, there is no
error or reward signal to evaluate a potential solution.
Shorter Description
Reinforcement learning
Learning by interacting with an environment
SUPERVISED LEARNING
Classification
Output variable takes class labels.
Ex. Predicting a mail is spam/ham
Regression
Output variable is numeric or continuous.
Ex. Measuring temperature
DECISION TREES
Is this restaurant good?
( YES/ NO)
DECISION TREES
What are the factors which decide that restaurant is
good for you or not?
Type : Italian, South Indian, French
Atmosphere: Casual, Fancy
How many people inside it? (10< people > 30 )
Cost
Weather outside : Rainy, Sunny, Cloudy
Hungry : Yes/No
DECISION TREE
Hungry
Rainy
People >
10
YES No
YES
Type
Cost
YES No
No
True
True False
False
TrueFalse
French South Indian
MoreLess
DECISION TREE LEARNING
Pick best attribute
Make a decision tree node containing that attribute
For each value of decision node create a
descendent of node
Sort training example to leaves
Iterate on subsets using remaining attributes
DECISION TREE : PICK BEST ATTRIBUTE
+ + -
+ + - -
+ -+-
TrueFalse
- - + -
+ - +
+ - - +
+ - + -
+ + +
- - - +
TrueFalse
+ + +
+ +
TrueFalse
- - - -
- - -
Graph. 1 Graph. 2 Graph. 3
DECISION TREE : PICK BEST ATTRIBUTE
Select the attribute which gives MAXIMUM Information
Gain.
Gain measures how well a given attribute separates
training examples into targeted classes.
Entropy is a measure of the amount of uncertainty in the
(data) set.
H(S) = − 𝑥∈𝑋 𝑝(𝑥) log2 𝑝(𝑥)
S: Current data set for which entropy is calculated.
X: Set of classes in X.
p(x) : The proportion of the number of elements in class to
the number of elements in set.
DECISION TREE : INFORMATION GAIN
Information gain IG(A) is the measure of the difference in entropy from before to after the set S is split on an attribute A.
In other words, how much uncertainty in S was reduced after splitting set S on attribute A.
IG(A,S) = H(S) - 𝑡∈𝑇 𝑝 𝑡 𝐻(𝑡)
H(S) : Entropy of set S
T : The subsets created from splitting set S by attribute A such that S = 𝑡∈𝑇 𝑡
p(t) : The proportion of the number of elements in t to the number of elements in set S
DECISION TREE ALGORITHM : BIAS
Restriction Bias : All type of possible decision tree.
Preference Bias : Which decision tree algorithm
prefer?
Good split at TOP
Correct over Incorrect
Shorter tree
DECISION TREE : CONTINUOUS ATTRIBUTE
Branch on number of possible values?
Include age only in training set?
Useless when we get some age not present in training
set
Represent in the form of range
Age
1.11 1.111
20<=Age<30
DECISION TREE : CONTINUOUS ATTRIBUTE
Does it make sense to repeat an attribute along a
path in the tree?
B
A A
A
B
A
DECISION TREE : WHEN DO WE STOP?
Everything classified correctly? (same example/
noisy two answer for same)
No more attribute? ( not good for continuous
attribute/ infinite possibility)
Pruning
SOFTMAX REGRESSION
Softmax Regression ( or multinomial logistic
regression) is a classification method that
generalizes logistic regression to multiclass
problems. (i.e. with more than two possible discrete
outcomes.)
Used to predict the probabilities of the different
possible outcomes of a categorically distributed
dependent variable, given a set of independent
variables (which may be real-valued, binary-valued,
categorical-valued, etc.).
LOGISTIC REGRESSION
Logistic regression is used to refer specifically to
the problem in which the dependent variable is
binary ( only two categories).
As output variable y ∈ 0,1 , it seems natural to
choose Bernoulli family of distribution to model
conditional distribution of y given x.
Logistic function (which always takes on values
between zero and one)
𝐹 𝑡 = 1 1+𝑒−𝑡 = 1𝑒−𝜃𝑇𝑥
SOFTMAX REGRESSION
Used in classification problem in which response
variable y can take on any one of k values.
𝑦 ∈ 1,2,… , 𝑘 .
Ex. Classify emails into three classes { Primary,
Social, Promotions }
Response variable is still discrete but can take
more than two values.
To derive General Linear Model for multinomial data
we begin by expressing the multinomial as an
exponential family distribution.
SOFTMAX REGRESSION
To parameterize a multinomial over k-possible
outcomes, we could use k parameters ∅1, … , ∅𝑘specifying probability of each outcomes.
These parameters are redundant because 𝑖=1𝑘 ∅𝑖 =
1. So ∅𝑖 = 𝑝 𝑦 = 𝑖; ∅
and 𝑝(𝑦 = 𝑘; ∅) = 1 − 𝑖=1𝑘 ∅𝑖
Indicator Function 1{.} takes a value of 1 if it’s
argument is true, and 0 otherwise.
1{True} = 1, 1{False} = 0.
SOFTMAX REGRESSION
Multinomial is member of exponential family.
𝑝 𝑦; ∅ = ∅11{𝑦=1}
∅21{𝑦=2}
…… . ∅𝑘1{𝑦=𝑘}
= ∅11{𝑦=1}
∅21{𝑦=2}
…… . ∅𝑘1− 𝑖=1𝑘−1{𝑦=𝑖}
=𝑏 𝑦 exp 𝜔𝑇 𝑇 𝑦 − a ω
Where 𝜔 =
log ∅1 ∅𝑘log ∅2 ∅𝑘⋮
log ∅𝑘 − 1 ∅𝑘
𝑎 𝜔 = − log∅𝑘
𝑏 𝑦 = 1 𝑇 𝑦 ∈ 𝑅𝑘_1
SOFTMAX REGRESSION
The link function is given as
𝜔𝑖 = log∅𝑖
∅𝑘
To invert the link function and derive the response
function
𝑒𝜔𝑖 =∅𝑖∅𝑘
∅𝑘𝑒𝜔𝑖 = ∅𝑖
∅𝑘
𝑖=1
𝑘
𝑒𝜔𝑖 =
𝑖=1
𝑘
∅𝑖 = 1
SOFTMAX REGRESSION
So we get ∅𝑘= 1 𝑖=1𝑘 𝑒𝜔𝑖
we can substitute it back in
the equation to give response function
∅𝑖= 𝑒𝜔𝑖
𝑖=1𝑘 𝑒𝜔𝑖
Conditional distribution of y given x is given by
𝑝 𝑦 = 𝑖 𝑥; 𝜃 = 𝜔𝑖
= 𝑒𝜔𝑖
𝑖=1𝑘 𝑒𝜔𝑖
= 𝑒−𝜃𝑖𝑇𝑖𝑥
𝑖=1𝑘 𝑒−𝜃
𝑇𝑖𝑥
SOFTMAX REGRESSION
Softmax regression is a generalization of logistic
regression.
Our Hypothesis will output
ℎ𝜃 𝑥 =
∅1∅2⋮∅𝑘
In other words, our hypothesis will output the
estimated probability 𝑝 𝑦 = 𝑖 𝑥; 𝜃 for every value of
i = 1, .. k.
ENSEMBLE LEARNING
Ensemble learning use multiple learning algorithms
to obtain better predictive performance than could
be obtained from any of the constituent learning
algorithms.
Ensemble learning is primarily used to improve the
prediction performance of a model, or reduce the
likelihood of an unfortunate selection of a poor one.
HOW GOOD ARE ENSEMBLES?
Let’s look at NetFlix Prize Competition…
NETFLIX PRIZE : STARTED IN OCT 2006
Supervised Learning Task
Training Data is a set of users and rating (1,2,3,4,5
stars) those users have given to movies.
Construct a classifier that given a user and an unrated
movie, correctly classified that movie as either 1,2,3,4 or
5 stars.
$1 Million prize for a 10% improvement over Netflix
current movie recommender/Classifier.
NETFLIX PRIZE : LEADER BOARD
ENSEMBLE LEARNING : GENERAL IDEA
ENSEMBLE LEARNING : BAGGING
Given :
Training Set of N examples
A class of learning models ( decision tree, NB, SVM,RF etc. )
Training :
At each iteration I a training set Si of N tuples is sampled with replacement from S.
A classifier model Mi is learned for each training set Si.
Classification : Classify an unknown sample x
Each classifier Mi returns it’s class prediction.
The bagged classifier M* count the votes and assign the class with the most votes.
ENSEMBLE LEARNING : BAGGING
Bagging reduces variance by voting/averaging.
Can help a lot when data is noisy.
If learning algorithm is unstable, then Bagging
almost always improves performance.
ENSEMBLE LEARNING : RANDOM FORESTS
Random Forests grows many classification trees.
To classify a new object from an input vector, put
the input vector down each of the trees in the
forest.
Each tree gives a classification, and we say the tree
"votes" for that class.
The forest chooses the classification having the
most votes (over all the trees in the forest).
ENSEMBLE LEARNING : RANDOM FORESTS
Each tree is grown as follows:
If the number of cases in the training set is N, sample N cases at random - but with replacement, from the original data. This sample will be the training set for growing the tree.
If there are M input variables, a number m<<M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during the forest growing.
Each tree is grown to the largest extent possible. There is no pruning.
FEATURES OF RANDOM FORESTS
Better in accuracy among current algorithms.
Runs efficiently on large data bases.
It can handle thousands of input variables without
variable deletion.
It gives estimates of what variables are important in
the classification.
Effective method for estimating missing data and
maintains accuracy when a large proportion of the
data are missing.
Generated forests can be saved for future use on
other data.
ENSEMBLE LEARNING : BOOSTING
Create a sequence of classifiers, giving higher
influence to more accurate classifiers.
At each iteration, make examples currently
misclassified more important( get larger weight in
the construction of the next classifier)
Then combine classifier by weighted vote (weight
given by classifier accuracy)
ENSEMBLE LEARNING : BOOSTING
Suppose there are just 7 training examples {1,2,3,4,5,6,7}
Initially each example has a 0.142 (1/7) probability of being sampled.
1st round of boosting samples ( with replacement) 7 examples { 3,5,5,4,6,7,3} and build a classifier from them.
Suppose examples {2,3,4,6,7} are correctly predicted by this classifier and examples {1,5} are wrongly predicted:
Weight of examples {1,5} are increased.
Weight of examples {2,3,4,6,7} are decreased.
2nd round of boosting again take 7 examples, but now examples {1,5} are more likely to be sampled.
And so on until some convergence is achieved.
ENSEMBLE LEARNING : BOOSTING
Weights models according to performance.
Encourage new model to become an “expert” for
instances misclassified by earlier model.
Combines “Weak Learner” to generate “strong
learner”.
ENSEMBLE LEARNING
Netflix 1st prize winner gradient boosted decision
tree.
http://www.netflixprize.com/assets/GrandPrize2009
_BPC_BellKor.pdf
THANK YOU FOR YOUR ATTENTION
Ask Question to narrow down possiblity
Informatica building example
Mango machine learning
Cannot look all trees