View
230
Download
0
Category
Preview:
Citation preview
DECISION TREES & RANDOM FORESTSX
CONVOLUTIONAL NEURAL NETWORKS
Meir Dalal
Or Gorodissky
1
Deep Neural Decision Forests
Microsoft Research Cambridge UK , ICCV 2015
Decision Forests, Convolutional Networks and the Models in-Between
Microsoft Research Technical Report arXiv 3 Mar. 2016
MOTIVATION
DECISION TREES
RANDOM FORESTS
DECISION TREES VS CNN
OVERVIEW OF THE PRESENTATION
2
COMBINING DECISION TREE & CNN
DECISION TREE - WHAT IS IT
4
Supervised learning algorithm used for classification
An inductive learning task - use particular facts to make more generalized conclusions
A predictive model based on a branching series of tests
These smaller tests are less complex than a one-stage classifier (Divide & Conquer)
Different way to look at : each node either predicates the answer or passes the problem to a
different node
Example…
DECISION TREES - HOW TO CONSTRUCT
8
When to stop
All the instances have the same target class
There are no more instances
There are no more attributes
Reach to pre-defined max depth
How to split? constructing a decision trees usually work top-down
Gini impurity
Information gain
…
DECISION TREES - STOCHASTIC ROUTING
10
Input space 𝜒, output space 𝒴
Decision nodes : 𝑛 ∈ Ν ∶ 𝑑𝑛(∙;Θ)
Prediction nodes : 𝑙 ∈ ℒ: 𝜋𝑙 𝑜𝑣𝑒𝑟 𝒴
Θ - Decision node parameterization
Routing function till now
𝒅𝒏 is binary and the routing is deterministic
Leaf prediction mark as 𝜋𝑙
Stochastic routing function
𝒅𝒏(∙;Θ) : 𝜒 → 0,1
Routing decision is an output of a Bernoulli random variable with mean 𝑑𝑛(∙;Θ)
Leaf node contain a probability for each class
𝝅:
DECISION TREE - ENSEMBLE METHODS
11
If a decision tree is fully grown, it may lose some generalization capability
→Overfitting
How to solve it?
Ensemble methods
Involve group of predictive models to achieve a better accuracy and model stability
RANDOM FOREST
12
When you can’t think of any algorithm , use random forest!
Algorithm (Bootstrap Aggregation)
1. Grow K different decision trees
1. Pick a random subset of the training examples (with return)
2. Pick d << D random attributes to split the data
3. Each tree is grown to the largest extent possible and there
is no pruning
2. Given a new data point 𝜒1. Classify 𝜒 using each of the trees 𝑇1…𝑇𝐾2. Predict new data by aggregating the predictions of the tree
trees (i.e., majority votes for classification, average for
regression).F O R E S T D E C I S I O N
A v e r a g i n g a l l t h e t r e e s ’ p r e d i c t i o n s
DT
Levels
Divide & Conquer
Only log2 𝑁 parameters used in test time
No feature learned (at most)
Training is done layer wise
High efficiency
Layers
High dimensionality
Use all the parameters in test time!
Feature learning integrated classification
Training E2E with S/GD
State of the art accuracy
CNN
DECISION TREES X CONV NEURAL NETS
13
How to efficiently combine DT/RF with CNN?
DECISION TREE BY CNN FEATURESARCHITECTURE
16
FOREST DEC IS ION
Avera g i n g a l l t h e t re e s ’ p red i c t i on s
RF
CNN
DECISION TREE BY CNN FEATURESARCHITECTURE
17
Decision Nodes
𝑑𝑛 ∙;Θ = 𝜎 𝑓𝑛 𝓍 ;Θ
𝜎 𝑥 = 1 + 𝑒−𝑥 −1 (sigmoid function)
𝑓𝑛(∙;Θ) : 𝜒 → ℝ
Prediction Probability
Prediction for sample 𝓍 ∶ 𝕡𝑇 𝓎 𝓍, Θ, 𝜋 = 𝑙∈ ℒ 𝜋𝑙𝓎𝜇𝑙(𝓍|Θ) where
𝜋𝑙𝓎 - probability of a sample reaching a leaf ℓ to take class 𝓎
𝜇𝑙(𝓍|Θ) - probability that sample 𝓍 will reach leaf ℓ 𝑙∈ ℒ 𝜇𝑙(𝓍|Θ) = 1
Forest Of Decision Trees
Deliver a prediction for a 𝓍 sample by averaging the output of each tree: ℙℱ 𝓎 𝓍 =1
𝐾 ℎ=1𝐾 ℙ𝑇ℎ 𝓎 𝓍
K - number of decision trees in the forest
TWO-STEP OPTIMIZATION STRATEGY
18
(2) Learning predictions nodes(1) Learning decision nodes
Our goal:
min𝝅𝑅(Θ, 𝝅; 𝒯)
𝑍ℓ𝑡− 𝑛𝑜𝑟𝑚𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑓𝑐𝑎𝑡𝑜𝑟
𝜋ℓ𝑦0− 𝑎𝑟𝑏𝑖𝑡𝑟𝑎𝑟𝑦 > 0
Our goal:
minΘ𝑅(Θ, 𝝅;𝒯)
𝜂 > 0 − 𝑙𝑒𝑎𝑟𝑛𝑖𝑛𝑔 𝑟𝑎𝑡𝑒
ℬ ⊆ 𝒯 - random subset
Objective Function:
LEARNING TREE BY BACK PROPAGATION
19
𝝅
Update the predication nodes in each tree independently since each tree has its own set of leaf predictions
𝜣
Randomly select a tree in the forest for each
mini-batch
(2)
(1)
LEARNING AND ENTROPY
20
How can we quantify that the network’s learned process?
Decisions Nodes
Measure the decision uncertainty for a given sample 𝓍
As the certainty of routing a sample increase, the sample will only be routed to a small subset of available decisions
nodes with reasonably high probability
𝒅𝒏 𝒐𝒖𝒕𝒑𝒖𝒕 𝒗𝒂𝒍𝒖𝒆𝒔
His
togra
m C
ounts
100 epochs 500 epochs 1K epochs
𝒅𝒏 response on validation set
LEARNING AND ENTROPY
21
How can we quantify that the network’s learned process?
Leaf Entropy
Measure the leaf posterior distribution
Highly peaked distributions for the leaf predictors, leads to low entropy
ℋ > ℋ
Average leaf entropy during training
#Training epochs
Ave
rage leaf
entr
opy [
bit
s]
RESULTS 1
22
Algorithms
ADF - state-of-the-art stand-alone, off-the-shelf forest ensemble
sNDF -1 fully connected layer, no hidden layers
RESULTS 2
23
Architecture
GoogLeNet* - GoogLeNet implementation Distributed (Deep) Machine Learning Common (DMLC) library
dNDF.NET - Replacing each softmax layer in GoogLeNet*(1) with Random Forest consisting of 10 trees
CONCLUSIONS
24
Novel algorithm for learning Random Forest - sNDF (shallow neural decision forest)
Model unified representation learning and classifier using random forest -
dNDF.NET (deep neural decision forest)
Train dNFTs - 2 step stochastic gradient descent
Prediction function
Routing function
No dramatic improvement in accuracy comparing to regular GoogLeNet
RECAP
25
Before: Decision trees and random forests are efficient classifiers
CNNs are state of the art at feature extractions an classifiers
In “Deep Neural Decision Forests” ICCV 2015:
All softmax layers are used to deduce a random forest
GoogLeNet variation
Two steps SGD defined for finding both the decision and prediction functions
Trained E2E achieved (slightly) better results
Now:In “Decision Forests, Convolutional Networks and the Models in-Between”
Microsoft Research Technical Report arXiv 3 Mar. 2016
Generalize DT and CNN as Conditional Networks using routers
Improve state of the art architectures compute cost while maintaining accuracyYani Ioannou
Peter
Kontschieder
SAVE THE PLANET / YOUR PHONE(MOTIVATION)
26
VGG16 single forward pass uses ~ 30G FLOPS
Top ranking efficient super computer (HPC) ~ 10G FLOPS / Watt
https://www.top500.org/green500/
100,000,000 US search for an image on their cloud ~ 300MWatt
After one hour:
Energy equivalent to a ~ 45 ton of coal
https://www.euronuclear.org/info/encyclopedia/coalequivalent.htm
NomophobiaFrom Wikipedia, the free encyclopedia
is a proposed name for the phobia of being out of mobile phone contact.[1][2] It is, however, arguable that the word
"phobia" is misused and that in the majority of cases it is another form of anxiety disorder.[3][not in citation given] Although
nomophobia does not appear in the current Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5), it
has been proposed as a "specific phobia", based on definitions given in the DSM-IV.[4][dubious – discuss]
MOTIVATION
27
• Neural networks are becoming deeper and more complex – carrying a quickly growing computational cost
• We would like to make more efficient neural networks by introducing ideas from decision trees
• Decide on the fly how accurateefficient you want your prediction to be (trade off)
Top 1 accuracy on
imageNet
Vs. number of
operations (GFLOPS)
size is the number of
parameters
https://arxiv.org/abs
/1605.07678
DT CNN
Decision nodes Relu
Random forest Ensembles
Prediction nodes Softmax
Deactivating branches Dropout
DECISION TREES X DEEP NEURAL NETSTAKING A CLOSER LOOK
28
Actually they are similar…
But how do we combine them?
- Generalize both as Conditional Networks
More Efficient More Accurate
POC - FROM NET TO TREE
29
Take 2 consecutive layers from
trained CNN (VGG)
Calculate the 2 layers cross-
correlation matrix of a fully
connected neural network
Rearrange as a block matrix
(higher cross-correlation values)
Decorrelate by zeroing block
off-diagonal elements
Replot the net with the
branched structure
INTRODUCING THE ROUTER NODE
31
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
data
router
𝑃𝑙𝑅ʃ
split
node
𝑟(2)
𝑟(1)
INTRODUCING THE ROUTER NODE
32
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
Explicit Routing – data is sent conditionally to a single / multiple routes
split
nodedata
router
INTRODUCING THE ROUTER NODE
33
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
Explicit Routing – data is sent conditionally to a single / multiple routes
Implicit Routing – data is sent unconditionally but selectively to all son nodes
split
node
data
router
INTRODUCING THE ROUTER NODE
34
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
Explicit Routing – data is sent conditionally to a single / multiple routes
Implicit Routing – data is sent unconditionally but selectively to all son nodes
Hard Routing – binary weights on branches (on/off)
Soft Routing – real weights on branches
split
node
data
router
Partial derivative:
INTRODUCING THE ROUTER NODE
35
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
Explicit Routing – data is sent conditionally to a single / multiple routes
Implicit Routing – data is sent unconditionally but selectively to all son nodes
Hard Routing – binary weights on branches (on/off)
Soft Routing – real weights on branches
Quiz-where are DTs?
Explicit Implicit
Hard
Soft
INTRODUCING THE ROUTER NODE
36
Implemented here as perceptron though other choices are possible
Outputs real value weights that affect data routing:
Explicit Routing – data is sent conditionally to a single / multiple routes
Implicit Routing – data is sent unconditionally but selectively to all son nodes
Hard Routing – binary weights on branches (on/off)
Soft Routing – real weights on branches
Generalization is called Conditional Network
Explicit Implicit
Hard DT
Soft
EXPERIMENT – CONDITIONAL GOOGLE-NET
37
Ensemble/Random forest architecture
Based on two GoogLeNets: regular and one with 10x oversampling.
This time we learn an explicit router based simple CNN1
Router is trained together to predict the accuracy of each route for each
image.
EXPERIMENT – CONDITIONAL GOOGLE-NET
38
Purple Dots: original networks accuracies.
Dashed Line: accuracy when choosing each network at random
Green Line: amortized cost to accuracy curve on the validation set
Green Point: operation point where we achieve almost the 10x oversampled CNN accuracy with less than half the
computational cost.
We could decide during test time what accuracy we require.
EFFICIENCY BENEFITS OF IMPLICIT ROUTING
39
Top: A standard CNN (one route). Bottom: A two-routed implicit arch.
The larger boxes denote feature maps, the smaller ones the filters
Due to branching, the depth of the second set of kernels (in yellow)
changes between the two architectures yielding lower computational cost.
EXPERIMENT – CONDITIONAL VGG11
40
Based on VGG11 with additional global max polling layer after last
convolutional layer.
Implemented as DAG
Split features into 2
EXPERIMENT – CONDITIONAL VGG11
41
Matching the original VGG11 top5 error with less than half the compute (45%), and almost one-fifth (21%) of
the parameters.
Training from scratch took twice the epochs but the overall time remained the same due to the decrease in
computations.
Decision Trees are efficient and CNN are Accurate
Conditional NN are the generalization of both
Trade off - we try to find the sweet spot combining the two
By using Implicit Routing:
we could achieve 50% reduction of computational and memory cost.
By using Explicit Routing:
we could achieve 50% reduction of computational cost same accuracy
Decide on the fly how accurate-costly we want to be
*If you aren’t more accurate maybe you’re more efficient
TL;DR
42
Recommended