Upload
aashish
View
214
Download
0
Embed Size (px)
DESCRIPTION
a
Citation preview
ASSIGNMENT 2
EEL 709
DEEPALI JAIN
2012ee10082
Motive :
Grid search for linear, radial and polynomial function.
Effect of varying features used. (15 -> 10 -> 5 -> further subsets within the best range)
Binary Classification : 4 pairs of classes analysed : 0,1 ; 4,5 ; 8,9 ; 4,8
Analyse overfitting, best fit, under fit.
One vs one and one vs all
Scaling of feature values
Approach
Cross validation by varying C and kernel parameter initially insteps of log2c = 3 at 10 fold cross validation.
Obtaining the best parameters and then again varying them in steps of log2c= 1 around the previously obtained values to obtain more accurate values.
Test the final parameter without any cross validation on the data.
ANALYSIS FOR EACH PAIR OF CLASSES :
I]. Classes : 0,1 Features : 1-15
Linear Kernel
Overfit occurs at C>2^6 Underfit occurs at C
Radial Kernel :
Best c : 2^13 , Best g : 2^-9 Accuracy=99.4
Overfitting is not easily distinguished. For given gamma, accuracy keeps on increasing with c asymptotically for a very very large range.
This means that with radial basis, hard margin case can be approximated when all features are used.
Eg: Even for gamma = log2(-6) and log2c = 190, Accuracy is 99.2%
Polynomial Kernel :
Best c : 2^1, Best deg: 2 , Accuracy=99.6
Overfit at c>4 ; Underfit at C < -4
6570758085
90
90
90
90 90
95
95
9595 95
log2c
log2gam
ma
-4 -2 0 2 4 6 8 10 12
-14
-12
-10
-8
-6
-4
-2
0
2
log2c
degre
e
-4 -2 0 2 4 6 8 10 121
2
3
4
5
6
7
95.5
96
96.5
97
97.5
98
98.5
Features Linear Radial Polynomial
1-15 Best c 2^4 , Accuracy=99.2
Best c : 2^13 ,
Best g : 2^-9 Accuracy=99.4
Best c : 2^1,
Best deg: 2 , Accuracy=99.6
1-10 Best c 2^1,
Accuracy=99
Best c : 2^4,
Best g : 2^-6 , Accuracy=99
Best c : 2^1,
Best deg: 1 , Accuracy=99.2
1-5 Best c 2^4 , Accuracy=99.2
Best c : 2^13 ,
Best g : 2^0 Accuracy=99.4
Best c : 2^-2,
Best deg: 5 , Accuracy=99.2
5-10 Best C : 2^7 , Accuracy=93.2
Best c : 2^1 ,
Best g : 2^0 Accuracy=91.8
Best c : 2^10,
Best deg: 2 , Accuracy=92.3
10-15 Best c 2^4 , Accuracy=98.6
Best c : 2^1 ,
Best g : 2^0 Accuracy=99
Best c : 2^1,
Best deg: 3 , Accuracy=98.6
Observations:
Kernel:
Changing the kernel function has no drastic effect on accuracy for any number of features. Hence sigmoidal was not used and other parameters were analysed more. A low degree polynomial in general gives good enough results. With decrease in number of features, radial kernel parameters are affected the maximum for
change in number of features. Best C of other 2 is usually same (2^4 and 2^1) As we go from 15 to 5 features, vast variation occurs in radial kernel.
Fitting: With radial basis, hard margin case can be approximated when all features are used. For linear and polynomial, underfit and over fit critical C occur at approx 2^(-5) and 2^(+5)
respectively in all cases. Overfit and underfit are less prominent in radial (shape of graph). For polynomial, at lower degree over, under fit occurs is prominent. At high degrees, accuracy is
less for all c values. Features:
The best parameters do not vary much as features decreased from 15 to 10. However if we further decrease it to 5, the change seems more significant.
Ignoring the effect of combination of two features on the accuracy (use filter and not wrapping), it can be seen that features 1-5 and 10-15 are more important.
Now, we see subset of features from [1,5] and [10,15] :
We use linear kernel to get the accuracy in each subset.
Features Results
2-5 Best c 2^4 , Accuracy=93.8
1-4 Best c 2^7 , Accuracy=95.6
10-14 Best c 2^10 , Accuracy=97.2
11-15 Best c 2^4 , Accuracy=98.8
11,14,15 Best c 2^4 , Accuracy=98.6
Observations :
Amongst, [1,5] removing even a single feature really reduces accuracy. Features 11, 14, 15 alone give very high accuracy and parameter setting comes close to more
feature case When accuracy is good, best C is 2^4 irrespective of the features included for linear.
II]. Classes 4,5
Features : 1-15
Linear : Overfit : c>2^6 Underfit : C < 2^(-4)
Radial :
No prominent overfitting. With increase in C, there is never a very substantial decrease in accuracy.
Polynomial : Overfit : C >2^4 Underfit : C < 2 ^(-4)
Features Linear Radial Polynomial
1-15 Best c 2^1 , Accuracy=98.6028
Best c : 2^1 ,
Best g : 2^-3 Accuracy=98.8024
Best c : 2^-2,
Best deg: 3 , Accuracy=98.8028
1-10 Best c 2^4 , Accuracy=98.4032
Best c : 2^13 ,
Best g : 2^-3 Accuracy=98.6028
Best c : 2^4,
Best deg: 4 , Accuracy=98.004
1-5 Best c 2^4 , Accuracy=75.8483
Best c : 2^7 ,
Best g : 2^-6 Accuracy=76.6467
Poor acc
5-10 Best c 2^4 , Accuracy=94.61
Best c : 2^7 ,
Best g : 2^-3 , Accuracy=89.2216
Poor acc
10-15 Best c 2^1 , Accuracy=98.60
Best c : 2^1 ,
Best g : 2^-3 Accuracy=98.6028
Best c : 2^3,
Best deg: 2 , Accuracy=96.8064
Other Observations :
In this case, best parameters for features [10,15] were closer to the actual ones than [1,10] ie [10,15] are most essential.
When further subsets were takes, it was again found that {11,14,15} give accuracy 98.2%, Best c = 2^(-2) for linear case.
III]. Classes : 8,9
All features -
Linear: overfit at c > 2^4, underfit at c < 2^(-4)
Polynomial : overfit at c>2^5, underfit at c< 2^(-6)
1-10 features :
Lin : overfit at c > 2^6, underfit at c < 2^(-4)
Features Linear Radial Polynomial
1-15 Best c 2^4 , Accuracy=95.01
Best c : 2^10 ,
Best g : 2^-9 Accuracy=95.6088
Best c : 2^4,
Best deg: 2 , Accuracy=96.2076
1-10 Best c 2^4 , Accuracy=95.8084
Best c : 2^4 ,
Best g : 2^-3 Accuracy=95.01
Best c : 2^7,
Best deg: 2 , Accuracy=94.8104
1-5 Best c 2^1 , Accuracy=72.6547
Best c : 2^4 ,
Best g : 2^-3 , Accuracy=72.2555
Best c : 2^4,
Best deg: 2 , Accuracy=68.0639
5-10 Best c 2^10 , Accuracy=80.4391
Best c : 2^7 ,
Best g : 2^-3 , Accuracy=81.6367
Poor acc
10-15 Best c 2^1 , Accuracy=94.6088
Best c : 2^4 ,
Best g : 2^-3 , Accuracy=95.2096
Best c : 2^3,
Best deg: 3 , Accuracy=94.2116
Other Observations :
Again , accuracy using 15 features and last 5 features is almost the same. However in the latter case, we get a more complex model.
Again features {11,14,15} in linear kernel give : Best c 2^-2 , Accuracy=92.2156.
IV]. Classes : 4,8
Features Linear Radial Polynomial
1-15 Best c 2^1 , Accuracy=97.4052
Best c : 2^4 ,
Best g : 2^-3 Accuracy=97.6048
Best c : 2^7,
Best deg: 2 , Accuracy=97.2056
1-10 Best c 2^4 , Accuracy=96.6068
Best c : 2^13 ,
Best g : 2^-6 Accuracy=96.8064
Best c : 2^1,
Best deg: 1 , Accuracy=96.008
1-5 Best c 2^1 , Accuracy=81.6367
Best c : 2^-2 ,
Best g : 2^0 Accuracy=81.6367
Poor acc
5-10 Best c 2^1 , Accuracy=83.8323
10-15 Best c 2^4 , Accuracy=96.4072
Best c : 2^3 ,
Best g : 2^-3 , Accuracy=97.6096
Best c : 2^6,
Best deg: 2 , Accuracy=96.2151
Other Observations :
Features 10-15 are most important. Again features {11,14,15} in linear kernel give Best c 2^4 , Accuracy=95.4092
ANALYSES OF PARAMETERS FOR DIFFERENT PAIRS OF CLASSES:
Features Linear Radial Polynomial
1-15 {0,1} Best c 2^4 , Accuracy=99.2 Best c : 2^13 ,
Best g : 2^-9 Accuracy=99.4
Best c : 2^1,
Best deg: 2 , Accuracy=99.6
1-15 {4,5} Best c 2^1 , Accuracy=98.6028
Best c : 2^1 ,
Best g : 2^-3 Accuracy=98.8024
Best c : 2^-2,
Best deg: 3 , Accuracy=98.8028
1-15 {8,9} Best c 2^4 , Accuracy=95.01
Best c : 2^10 ,
Best g : 2^-9 Accuracy=95.6088
Best c : 2^4,
Best deg: 2 , Accuracy=96.2076
1-15 {4,8} Best c 2^1 , Accuracy=97.4052
Best c : 2^4 ,
Best g : 2^-3 Accuracy=97.6048
Best c : 2^7,
Best deg: 2 , Accuracy=97.2056
1-10 Best c 2^4 , Accuracy=98.4032
Best c : 2^13 ,
Best g : 2^-3 Accuracy=98.6028
Best c : 2^4,
Best deg: 4 , Accuracy=98.004
1-10 Best c 2^1,
Accuracy=99
Best c : 2^4,
Best g : 2^-6 , Accuracy=99
Best c : 2^1,
Best deg: 1 , Accuracy=99.2
1-10 Best c 2^4 , Accuracy=95.8084
Best c : 2^4 ,
Best g : 2^-3 Accuracy=95.01
Best c : 2^7,
Best deg: 2 , Accuracy=94.8104
1-10 Best c 2^4 , Accuracy=96.6068
Best c : 2^13 ,
Best g : 2^-6 Accuracy=96.8064
Best c : 2^1,
Best deg: 1 , Accuracy=96.008
10-15 Best c 2^4 , Accuracy=98.6
Best c : 2^1 ,
Best g : 2^0 Accuracy=99
Best c : 2^1,
Best deg: 3 , Accuracy=98.6
10-15 Best c 2^1 , Accuracy=98.60
Best c : 2^1 ,
Best g : 2^-3 Accuracy=98.6028
Best c : 2^3,
Best deg: 2 , Accuracy=96.8064
10-15 Best c 2^1 , Accuracy=94.6088
Best c : 2^4 ,
Best g : 2^-3 , Accuracy=95.2096
Best c : 2^3,
Best deg: 3 , Accuracy=94.2116
10-15 Best c 2^4 , Accuracy=96.4072
Best c : 2^3 ,
Best g : 2^-3 , Accuracy=97.6096
Best c : 2^6,
Best deg: 2 , Accuracy=96.2151
Observations :
There is some correlation between different pairs of classes. With lesser features, similarity is higher.
Linear : With any number of features, value of best c is very similar in all cases.
Radial: A higher C is usually preferred while gamma seems to change a little arbitrarily.
Polynomial:
Degree obtained is low and bestC shows more variation than linear but less than radial.
MULTICLASS :
ONE v/s ONE
Features Linear Radial Polynomial
1-15 Best c 2^6 , Accuracy=87.32
Best c : 2^14 ,
Best g : 2^-9 Accuracy=89
Best c : 2^4,
Best deg: 2 , Accuracy=89.12
1-10 Best c 2^6 , Accuracy=80.24
Poor acc Poor acc
10-15 Best c 2^6 , Accuracy=79.6
Best c : 2^6 ,
Best g : 2^-3 , Accuracy=71.52
Best c : 2^4,
Best deg: 2 , Accuracy=80.72
11 14 15 Best c 2^1 , Accuracy=47.6
Observations :
Multiclass classification gives almost 10% less accuracy. Also, there seems no highly preferred set of features. Although, features 1-10 and features 10-15 give the same accuracy (approx), unlike in the binary
classification these accuracies are really less than all feature accuracy . Hence leaving out some features does not make sense.
There is not a very large difference in binary class average parameters and multiclass. Eg : Linear gives best c near 2^4, rad gives a high gamma near 2^-6 and polynomial gives a low degree.
One versus all :
Class Lin : Best c Rad: Best c Rad : Best g Poly : Best c Poly : Best d
0
1
2
3
4
5
6
7
8
9
Accuracy
4
4
4
7
4
7
7
7
4
4
80.8 %
-6
-6
-6
-6
-9
-6
-9
-9
-6
-6
76.8%
7
7
7
13
13
10
13
13
10
13
4
4
4
7
7
7
7
7
4
4
78.24%
7
7
7
7
7
7
7
7
7
7
OVO vs OVA :
Accuracy in one versus all is less than one versus one. One vs one takes less computational time since dataset for each of the binary classifier is
reduced. However one vs all gives an insight about each class parameters. The feature selection in ova follows the same concept as ovo but the accuracy is less than ovo.
ANOTHER IDEA TO IMPROVE ACCURACY :
Scaling :
Idea:
If we have F1 and F2 within range (1,2) and (1000,2000) respectively for a dataset. Now to want the hyperplane to depend on distribution of points, we need to normalize. For consider (1.1, 1100) and (1.4, 1100). The classifier will be either less sensitive to F1 or have a very large coefficient for it.
15 features:
Non scaled Scaled
All features, multiclass L 87.32 88.12
R 89 88.68
P 89.12 81.96
Features 5-10, classes 0,1 L 93.4 94.4
R 91.2 94.6
P 92.3 93.4
Features 5-10, Classes 4,8 L 80.4 83.2669
R 82.27 86.0558
P 81.051 83.4661
Observations :
When the accuracy is already high, scaling shows a very slight increase (hence results for binary, all feature is not shown).
If the accuracy is slow, scaling is a good option. For multiclass, scaling is less consequential