Upload
mahesh-pal
View
225
Download
0
Embed Size (px)
Citation preview
7/27/2019 kernel methods for land cover classification and prediction
1/66
Beyond Neural Network: New Algorithms for
Classification and Prediction
MAHESH PAL
Department of Civil Engineering
National Institute of Technology
Kurukshetra, 136119, INDIA
7/27/2019 kernel methods for land cover classification and prediction
2/66
Neural network
Support vector machines
Relevance vector Machines
Random forest classifier
Extreme Learning machines
7/27/2019 kernel methods for land cover classification and prediction
3/66
3D GEOLOGICAL MODELING: SOLVING AS A CLASSIFICATION PROBLEM
WITH THE SUPPORT VECTOR MACHINE
3-D SEISMIC-BASED LITHOLOGY PREDICTION USING IMPEDANCE
INVERSION AND NEURAL NETWORKS APPLICATION: CASE-STUDY
FROM THE MANNVILLE GROUP IN EAST-CENTRAL ALBERTA, CANADA
EVALUATING CLASSIFICATION TECHNIQUES FOR MAPPING VERTICAL
GEOLOGY USING FIELD-BASED HYPERSPECTRAL SENSORS
FLOW UNIT PREDICTION WITH LIMITED PERMEABILITY DATA USING
ARTIFICIAL NEURAL NETWORK ANALYSIS (WVU, PhD, 2002)
SUBSURFACE CHARACTERIZATION WITH SUPPORT VECTOR
MACHINES
SUPPORT VECTOR MACHINES FOR DELINEATION OF GEOLOGICFACIES FROM POORLY DIFFERENTIATED DATA
SUPERIORITIES OF SUPPORT VECTOR MACHINE IN FRACTURE
PREDICTION AND GASSINESS EVALUATION
7/27/2019 kernel methods for land cover classification and prediction
4/66
DYNAMICS OF WATER TRANSPORT THROUGH CATCHMENT OF DANUBE
RIVER TRACED BY 3H AND 18O -THE NEURAL NETWORK APPROACH
A COMBINED STABLE ISOTOPE AND MACHINE LEARNING APPROACH TO
QUANTIFY AND CLASSIFY NITRATE POLLUTION SOURCES IN WATER
USING GEOCHEMISTRY AND NEURAL NETWORKS TO MAP GEOLOGY
UNDER GLACIAL COVER
POROSITY AND PERMEABILITY ESTIMATION USING NEURAL NETWORK
APPROACH FROM WELL LOG DATA
ILLINOIS STATEWIDE MONITORING WELL NETWORK FOR PESTICIDES IN
SHALLOW GROUNDWATER (AQUIFER SENSITIVITY TO CONTAMINATION
BY PESTICIDE LEACHING USING NN).
APPLICATION OF ARTIFICIAL NEURAL NETWORKS IN HYDROGEOLOGY:
IDENTIFICATION OF UNKNOWN POLLUTION SOURCES IN
CONTAMINATED AQUIFERS
7/27/2019 kernel methods for land cover classification and prediction
5/66
Classification has been a major research usingremote sensing images.
A major input in GIS based studies.
Several approaches are used.
7/27/2019 kernel methods for land cover classification and prediction
6/66
Classification Algorithms
Supervised - requires labelled training data
Unsupervised- searches for natural groups of
data, called clusters.
7/27/2019 kernel methods for land cover classification and prediction
7/66
Parametric
Maximum likelihood classifier
Nonparametric
Neural network, Support vector machines,Relevance vector machines, Random Forestclassifier, extreme learning machine
7/27/2019 kernel methods for land cover classification and prediction
8/66
For classification/regression, training sample is
made available to the learning algorithm (likeNeural network, SVM, RVM, Random forest,
extreme learning machines etc).
After training, learning algorithm outputs amodel or function, which is called
the hypothesis.
This Hypothesis can be considered as amachine that outputs the prediction for a new
test data.
7/27/2019 kernel methods for land cover classification and prediction
9/66
Training samples
Model/ function
Learning algorithm
Output values
Testing samples
Also called as
Hypothesis
Hypothesis can be considered as a machine that provides the prediction for test
data
7/27/2019 kernel methods for land cover classification and prediction
10/66
Neural Network
A major research area within 1990-2000 for
classification/regression, still in use.
No assumption about data distribution.
Works well with different data including remote sensingdata.
7/27/2019 kernel methods for land cover classification and prediction
11/66
ijw
k
i
kw
Input
Layer
Hidden
Layer
Output
Layer
7/27/2019 kernel methods for land cover classification and prediction
12/66
The interconnecting weights are determined during the
training process.
Number of algorithms can be used to adjust the
interconnecting weights.
Back-propagation is the most commonly used methods
The error between actual and predicted values is fed
backwards through the network towards the input layer.
Connecting weights changes in relation to the
magnitude of the error.
use an iterative process to minimize the error.
7/27/2019 kernel methods for land cover classification and prediction
13/66
ProblemsIdentifying user-defined parameters:
Number of hidden layer and nodes
Learning rate
Momentum factor
Iterations
Local minima due to the use of non-
convex, unconstrained minimization
problem
7/27/2019 kernel methods for land cover classification and prediction
14/66
http://mnemstudio.org/neural-networks-multilayer-perceptron-design.htm
7/27/2019 kernel methods for land cover classification and prediction
15/66
Support Vector Machines (SVM)
Basic Theory: in 1965 Margin based classifier: in 1992
Support vector network: In 1995
Since 1998, support vector network called as
Support Vector Machines (SVM) - used as an
alternative to neural network.
First application, Gualtieri and Cromp, (1998)
for hyperspectral image classification
7/27/2019 kernel methods for land cover classification and prediction
16/66
SVM: structural risk minimisation (SRM)
statistical learning theory proposed in 1960s
by Vapnik and co-workers.
SRM: Minimise the probability of
misclassifying an unknown data drawn
randomly
Neural network: Empirical risk minimisation
Minimise the misclassification error ontraining data
7/27/2019 kernel methods for land cover classification and prediction
17/66
SVM
Map data from the original input featurespace to a very high dimensional feature
space (even infinite).
Data becomes linearly separable but problembecomes computationally difficult to solve.
Kernel function allows SVM to work in feature
space, without knowing mapping anddimensionality of feature space.
7/27/2019 kernel methods for land cover classification and prediction
18/66
A Kernel Function:
SVM kernels need to satisfy Mercer
Theorem: Any continuous, symmetric, positive
semi-definite kernel function can be expressed
as a dot product in a high-dimensional space.
The linear classification in the new space is
equivalent to non-linear classification in the
original space.
jijiK xxxx
7/27/2019 kernel methods for land cover classification and prediction
19/66
Linearly separable class
7/27/2019 kernel methods for land cover classification and prediction
20/66
For a 2-class classification problem, Training
patterns are linearly separable if:
for all y = 1
for all y = -1
wprovide orientation of discriminating plane andb, the offset from origin.
Theclassification function will be:
1b ixw
1b ixw
bsignf b, xww
7/27/2019 kernel methods for land cover classification and prediction
21/66
7/27/2019 kernel methods for land cover classification and prediction
22/66
To classify the dataset
There can be a large number ofdiscriminating planes.
SVM tries to find a plane farthest fromboth classes.
Assume two supporting planes,
maximise the distance (called margin)
between them.
7/27/2019 kernel methods for land cover classification and prediction
23/66
A plane supports a class if allpoints in that class are on
one side of that plane. Use convex optimisation
problem.
Push parallel planes apartuntil they collides with few
data points for each class.
Data points are calledSupport vectors.
Other training examples areof no use
margin
w
Origin
x
i
x
w.x + b = 1
Optimal
hyperplane
7/27/2019 kernel methods for land cover classification and prediction
24/66
The margin is defined by : 2/
Maximising the margin is equivalent to
minimising the following quadratic program:
/2
subject to
Solved by QP techniques using Lagrangian
multipliers.
2w
01by i ixw
w
j,i
jijiji
i
i yy
2
1L xx 0i for
7/27/2019 kernel methods for land cover classification and prediction
25/66
Linearly Non-separable data
7/27/2019 kernel methods for land cover classification and prediction
26/66
New optimisation problem:
with and
C is a positive constant such that
LargerC means higher penalty to errors.
k
1i
i
2
,....,b C2
1
mink1
ww,
0i
0C
01bxwy iii
Cortes and Vapnik (1995)
7/27/2019 kernel methods for land cover classification and prediction
27/66
Nonlinear SVM
7/27/2019 kernel methods for land cover classification and prediction
28/66
Final classification function:
Nonlinear classification via linear separation in higherdimensional space:
http://www.youtube.com/watch?v=9NrALgHFwTo
SVM with polynomial kernel visualization:
http://www.youtube.com/watch?v=3liCbRZPrZA
j,i
jijiji
i
i yy
2
1L xx
bysignf
i
ii ji KK xxx
http://www.youtube.com/watch?v=9NrALgHFwTohttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=3liCbRZPrZAhttp://www.youtube.com/watch?v=9NrALgHFwTo7/27/2019 kernel methods for land cover classification and prediction
29/66
Advantages
Margin theory suggest no affect ofdimensionality of input space
uses fewer number of training data (called
support vectors)
QP solution, so no chance of local minima
Not many user-defined parameters
7/27/2019 kernel methods for land cover classification and prediction
30/66
But with real data:
55
60
65
70
75
80
85
90
95
5 10 15 20 25 30 35 40 45 50 55 60 65
Classificationaccuracy
(%)
Number of features
8 pixels 15 pixels
25 pixels 50 pixels
75 pixels 100 pixels
Mahesh Pal and Giles M. Foody, 2010, Feature selection for classification of hyperspectral data bySVM. IEEE Transactions on Geoscience and Remote Sensing, Vol. 48, No. 5, 2297-2306.
7/27/2019 kernel methods for land cover classification and prediction
31/66
Training set size per class
8 pixels 15 pixels 25 pixels 50 pixels 75 pixels 100 pixels
Peak accuracy,
% (number of
features)
74.79 (35) 81.21 (35) 84.45 (35) 88.47 (40) 91.13 (50) 92.53 (50)
Accuracy with
65 features (%)69.79 77.05 81.66 87.58 90.63 91.76
Difference in
accuracy (%)5.00 4.16 2.79 0.89 0.50 0.77
Z value 6.04 5.35 4.02 1.69 1.48 2.22
7/27/2019 kernel methods for land cover classification and prediction
32/66
Disadvantages Designed for two class problem
Different methods to create multi-class
classifier.
Choice of kernel function and kernel specific
parameters
The kernel function is required to satisfy the
Mercer condition
Choice of ParameterC
Output is not naturally probabilistic
7/27/2019 kernel methods for land cover classification and prediction
33/66
Multiclass results
Multiclass approach Classificationaccuracy (%)
Training time
one against one 87.90 6.4 sec
one against rest 86.55 30.37sec
Directed Acyclic Graph 87.63 6.5 sec
Bound constrained approach 87.29 79.6 sec
Crammer and Singer approach 87.43 347 min 18 sec
ECOC (exhaustive approach) 89.00 806.6 min
7/27/2019 kernel methods for land cover classification and prediction
34/66
Choice of kernel function
7/27/2019 kernel methods for land cover classification and prediction
35/66
Parameter selection
Grid search and trial & error methods
commonly used approach computationally expensive
Other approaches
Genetic algorithm Particle swarm optimization
Their combination with grid search.
7/27/2019 kernel methods for land cover classification and prediction
36/66
SVR
7/27/2019 kernel methods for land cover classification and prediction
37/66
http://www.saedsayad.com/support_vector_machine_reg.htm
7/27/2019 kernel methods for land cover classification and prediction
38/66
Relevance vector Machines
7/27/2019 kernel methods for land cover classification and prediction
39/66
Based on a Bayesian formulation of a linear
model (Tipping, 2001).
Produce a sparse solution than that of SVM
(i.e. less number of relevance vectors)
Ability to use non-Mercer kernels
Probabilistic output
No need to define the parameter C
7/27/2019 kernel methods for land cover classification and prediction
40/66
For a 2-class problem, The maximum a
posteriori estimate of the weights can be
obtained by maximizing the followingobjective function:
http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf
http://www.tristanfletcher.co.uk/RVM%20Explained.pdf
n
iiiii
n
in wplogwcplogwwwf
1121 ,........,,
http://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/RVM%20Explained.pdfhttp://www.cs.uoi.gr/~tzikas/papers/EURASIP06.pdf7/27/2019 kernel methods for land cover classification and prediction
41/66
RVM
The solution involves in calculating the gradientoffwith respect to w.
Only those training data having non-zero
coefficients wi (called relevance vectors) will
contribute to the decision function.
An iterative analysis is followed to find the set ofweights that maximizes the objective function
7/27/2019 kernel methods for land cover classification and prediction
42/66
Major difference from SVM
Selected points are anti-boundary (away fromBoundary)
Support vectors represent the leastprototypical examples (closer to boundary,difficult to classify)
Relevance vectors are the most prototypical(more representative of class)
7/27/2019 kernel methods for land cover classification and prediction
43/66
Location of the useful training cases for
classifications by SVM & RVM
40
50
60
70
80
90
100
110
70 80 90 100
Band5
Band 1
Wheat
Sugar beet
Oilseed rape
40
50
60
70
80
90
100
110
70 80 90 100
Band5
Band 1
Wheat
Sugar beet
Oilseed rape
MAHESH PAL AND G.M FOODY, Evaluation of SVM, RVM and SMLR for accurate image classification with limited
ground data, IEEEjournal of selected topics in applied earth observations and remote sensing, 5( 5), 2012
7/27/2019 kernel methods for land cover classification and prediction
44/66
Class (number of useful
training cases)
Difference of two
smallest
Mahalanobis
distances
Mahalanobis distance to class centroid
Wheat Sugar beet Oilseed rape
Support vectors
Wheat 1(4) 4.8697 15.8246 100.2179 10.9549
Sugar beet(8) 51.9803 3.9906 47.6909 31.0740
Oilseed rape(7) 89.3444 20.9320 6.2782 15.8113
Relevance vectors
Wheat(1) 12.9498 31.8135 171.6667 18.8637
Sugar beet(2) 68.8468 4.4170 144.2734 64.4298
Oilseed rape(4) 112.0943 35.5128 4.3981 31.1147
7/27/2019 kernel methods for land cover classification and prediction
45/66
Disadvantages
Requires large computation cost incomparison to SVM.
Designed for 2-class problem- similar toSVM.
Choice of kernel
May have a problem of local minima
7/27/2019 kernel methods for land cover classification and prediction
46/66
Random forest algorithm
7/27/2019 kernel methods for land cover classification and prediction
47/66
A multistage or hierarchical algorithm
Break up of complex decision into a union of
several simpler decision
Use different subset of features/data at
various decision levels.
Tree based Algorithm
7/27/2019 kernel methods for land cover classification and prediction
48/66
Root node
Internal
node
Terminal
node
7/27/2019 kernel methods for land cover classification and prediction
49/66
7/27/2019 kernel methods for land cover classification and prediction
50/66
A tree based algorithm requires
Splitting rules/tree creation [called attribute selection]
Most popular are:
a) Gain ratio criterion (Quinlan, 1993)
b) Gini Index (Breiman, et. al., 1984)
Termination rules/ pruning rules
Most popular are:
a) Error-based pruning (Quinlan, 1993)
b) Cost-Complexity pruning (Brieman, et. al., 1984)
7/27/2019 kernel methods for land cover classification and prediction
51/66
Information GainInformation Gain
ratioGini Index
Chi-squaremeasure
Accuracy 83.7 84.54 83.9 83.65
83
84
85
Accura
cy(%)
Attribute selection measure
Mahesh Pal and P.M. Mather, 2003, An Assessment of the Effectiveness of Decision Tree Methods for
Land Cover Classification. Remote Sensing of Environment. 86, 554-565
R d f
7/27/2019 kernel methods for land cover classification and prediction
52/66
Random forest
An ensemble of tree based algorithm
Uses a random set of features (i.e. input
variables)
Uses a bootstrapped sample of original data Bootstrapped sample consists of ~63% of
original data
Remaining ~37% is left out and called out ofbag data (OOB).
Multiclass and require no pruning
7/27/2019 kernel methods for land cover classification and prediction
53/66
Parameters
a) Number of tree to growb) Number of attributes (features) for each tree
87.78
87.48
88.3788.27
88.0787.92
86.5
87
87.5
88
88.5
89
1 2 3 4 5 6
Number of features used
Testdataaccuracy(%)
87
87.2
87.4
87.6
87.8
88
88.2
88.4
88.6
88.8
89
0 2000 4000 6000 8000 10000 12000 14000
Number of trees
Testdataaccuracy(%)
Mahesh Pal, 2005, Random Forest Classifier for Remote Sensing Classifications. International Journal of
Remote sensing, 26(1), 217-222.
7/27/2019 kernel methods for land cover classification and prediction
54/66
Classification Results
Classifier used Random forest classifier Support vector machines
Accuracy (%) and Kappa value 88.37 (0.86) 87.9 (0.86)
Training time 12.98 seconds on P-IV 0.30 minutes on sun machine
7/27/2019 kernel methods for land cover classification and prediction
55/66
Can be used for:
Feature selection
Clustering of data
Outlier detection
Predictions/regression
Can handle categorical data and the data with
missing values
Performance - comparable to SVM
Computationally efficient
Mahesh Pal,2006,Support Vector Machines Based Feature Selection for land cover classification: a casestudy with DIAS Hyperspectral Data. International Journal of Remote Sensing, 27(14), 28772894
7/27/2019 kernel methods for land cover classification and prediction
56/66
Outliers
0123456789
101112131415161718192021
0 500 1000 1500 2000 2500 3000
Outliervalue
samples
class 1
class 2
class 3
class 4
class 5
class 6
class 7
An outlieris an observation that lies at an abnormal distance from other values in
the dataset
Cl t i
7/27/2019 kernel methods for land cover classification and prediction
57/66
Clustering
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
-0.4 -0.3 -0.2 -0.1 0 0.1 0.2 0.3 0.4 0.5
IIndscalingcoordina
te
Ist scaling coordinate
class 1
class 2
class 3
class 4
class 5
class 6
class 7
7/27/2019 kernel methods for land cover classification and prediction
58/66
Extreme Learning Machines
Comparison of ELM with SVR for reservoir permeability prediction
Modelling Permeability prediction using ELM
7/27/2019 kernel methods for land cover classification and prediction
59/66
A neural network classifier
Use one hidden layer only
No parameter except number of hidden nodes
Global solution
Performance comparable to SVM and better
than back-propagation neural network
Very fast
7/27/2019 kernel methods for land cover classification and prediction
60/66
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
7/27/2019 kernel methods for land cover classification and prediction
61/66
HUANG, G.-B., ZHU, Q.-Y. and SIEW, C.-K., 2006, Extreme learning machine: Theory and
applications, Neurocomputing, 70, 489501.
. +
=1=
7/27/2019 kernel methods for land cover classification and prediction
62/66
Disadvantages
Weights are randomly assigned. Large variation inaccuracy using same number of hidden nodes with
different trials.
Difficult to replicate results
Mahesh Pal, 2009, Extreme learning machine based land cover classification, International Journal of
Remote Sensing, 30(14), 38353841.
70
74
78
82
86
90
25 50 75 100 150 200 250 300 350 400 450
Number of nodes in hidden layer
Classif
icationaccuracy(%)
Extreme learning
machine1.25 sec
Back propagation
neural network336.20 sec
K li d ELM
7/27/2019 kernel methods for land cover classification and prediction
63/66
Kernlised ELM
Kernel function can be used in place of hidden layer by
modifying the optimization problem. Multiclass
Can be used for classification and regression
Same Kernel function as used with SVM/RVM can be
used.
Encouraging results for classification and
prediction- better than SVM in terms of accuracy
and computational cost
Huang, G-B. Zhou H. Ding X. and Zhang R. 2012, Extreme Learning Machine for Regression and Multiclass
Classification. IEEE Transactions on Systems, Man, and CyberneticsPart B: Cybernetics 42: 513-529.
NO f L h Th
7/27/2019 kernel methods for land cover classification and prediction
64/66
NO free Lunch Theorem
No algorithm performs better than any other when their
performance is averaged uniformly over all possible
problems of a particulartype(Wolpert and Macready, 1995)
Algorithm must be designed for a particular domain and
there is no such thing as a general purpose algorithm.
Data dependent nature
7/27/2019 kernel methods for land cover classification and prediction
65/66
http://www.tristanfletcher.co.uk/SVM%20Explained.pdf
http://www.youtube.com/watch?v=eHsErlPJWUU
{SVM by Prof. Yasser, CalTech}
http://www.youtube.com/watch?v=s8B4A5ubw6c
{SVM by Prof. Andrew Ng, Stanford}
http://videolectures.net/mlss03_tipping_pp/
{ RVM, Video lecture by Tipping}
http://www.ntu.edu.sg/home/egbhuang/pdf/ELM-WCCI2012.pdf
http://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://videolectures.net/mlss03_tipping_pp/http://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=s8B4A5ubw6chttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.youtube.com/watch?v=eHsErlPJWUUhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdfhttp://www.tristanfletcher.co.uk/SVM%20Explained.pdf7/27/2019 kernel methods for land cover classification and prediction
66/66
Questions?