assignment

ASSIGNMENT 2

EEL 709

DEEPALI JAIN

2012ee10082

Motive :

Grid search for linear, radial and polynomial function.

Effect of varying features used. (15 -> 10 -> 5 -> further subsets within the best range)

Binary Classification : 4 pairs of classes analysed : 0,1 ; 4,5 ; 8,9 ; 4,8

Analyse overfitting, best fit, under fit.

One vs one and one vs all

Scaling of feature values

Approach

Cross validation by varying C and kernel parameter initially insteps of log2c = 3 at 10 fold cross validation.

Obtaining the best parameters and then again varying them in steps of log2c= 1 around the previously obtained values to obtain more accurate values.

Test the final parameter without any cross validation on the data.

ANALYSIS FOR EACH PAIR OF CLASSES :

I]. Classes : 0,1 Features : 1-15

Linear Kernel

Overfit occurs at C>2^6 Underfit occurs at C

Radial Kernel :

Best c : 2^13 , Best g : 2^-9 Accuracy=99.4

Overfitting is not easily distinguished. For given gamma, accuracy keeps on increasing with c asymptotically for a very very large range.

This means that with radial basis, hard margin case can be approximated when all features are used.

Eg: Even for gamma = log2(-6) and log2c = 190, Accuracy is 99.2%

Polynomial Kernel :

Best c : 2^1, Best deg: 2 , Accuracy=99.6

Overfit at c>4 ; Underfit at C < -4

6570758085

90

90

90

90 90

95

95

9595 95

log2c

log2gam

ma

-4 -2 0 2 4 6 8 10 12

-14

-12

-10

-8

-6

-4

-2

0

2

log2c

degre

e

-4 -2 0 2 4 6 8 10 121

2

3

4

5

6

7

95.5

96

96.5

97

97.5

98

98.5

Features Linear Radial Polynomial

1-15 Best c 2^4 , Accuracy=99.2

Best c : 2^13 ,

Best g : 2^-9 Accuracy=99.4

Best c : 2^1,

Best deg: 2 , Accuracy=99.6

1-10 Best c 2^1,

Accuracy=99

Best c : 2^4,

Best g : 2^-6 , Accuracy=99

Best c : 2^1,



Best c : 2^13 ,

Best g : 2^0 Accuracy=99.4

Best c : 2^-2,


5-10 Best C : 2^7 , Accuracy=93.2

Best c : 2^1 ,


Best c : 2^10,



Best c : 2^1 ,

Best g : 2^0 Accuracy=99

Best c : 2^1,


Observations:

Kernel:

Changing the kernel function has no drastic effect on accuracy for any number of features. Hence sigmoidal was not used and other parameters were analysed more. A low degree polynomial in general gives good enough results. With decrease in number of features, radial kernel parameters are affected the maximum for

change in number of features. Best C of other 2 is usually same (2^4 and 2^1) As we go from 15 to 5 features, vast variation occurs in radial kernel.

Fitting: With radial basis, hard margin case can be approximated when all features are used. For linear and polynomial, underfit and over fit critical C occur at approx 2^(-5) and 2^(+5)

respectively in all cases. Overfit and underfit are less prominent in radial (shape of graph). For polynomial, at lower degree over, under fit occurs is prominent. At high degrees, accuracy is

less for all c values. Features:

The best parameters do not vary much as features decreased from 15 to 10. However if we further decrease it to 5, the change seems more significant.

Ignoring the effect of combination of two features on the accuracy (use filter and not wrapping), it can be seen that features 1-5 and 10-15 are more important.

Now, we see subset of features from [1,5] and [10,15] :

We use linear kernel to get the accuracy in each subset.

Features Results



10-14 Best c 2^10 , Accuracy=97.2


11,14,15 Best c 2^4 , Accuracy=98.6

Observations :

Amongst, [1,5] removing even a single feature really reduces accuracy. Features 11, 14, 15 alone give very high accuracy and parameter setting comes close to more

feature case When accuracy is good, best C is 2^4 irrespective of the features included for linear.

II]. Classes 4,5

Features : 1-15

Linear : Overfit : c>2^6 Underfit : C < 2^(-4)

Radial :

No prominent overfitting. With increase in C, there is never a very substantial decrease in accuracy.

Polynomial : Overfit : C >2^4 Underfit : C < 2 ^(-4)


1-15 Best c 2^1 , Accuracy=98.6028

Best c : 2^1 ,


Best c : 2^-2,


1-10 Best c 2^4 , Accuracy=98.4032

Best c : 2^13 ,


Best c : 2^4,


1-5 Best c 2^4 , Accuracy=75.8483

Best c : 2^7 ,


Poor acc


Best c : 2^7 ,

Best g : 2^-3 , Accuracy=89.2216

Poor acc

10-15 Best c 2^1 , Accuracy=98.60

Best c : 2^1 ,


Best c : 2^3,


Other Observations :

In this case, best parameters for features [10,15] were closer to the actual ones than [1,10] ie [10,15] are most essential.

When further subsets were takes, it was again found that {11,14,15} give accuracy 98.2%, Best c = 2^(-2) for linear case.

III]. Classes : 8,9

All features -

Linear: overfit at c > 2^4, underfit at c < 2^(-4)

Polynomial : overfit at c>2^5, underfit at c< 2^(-6)

1-10 features :

Lin : overfit at c > 2^6, underfit at c < 2^(-4)



Best c : 2^10 ,


Best c : 2^4,


1-10 Best c 2^4 , Accuracy=95.8084

Best c : 2^4 ,


Best c : 2^7,


1-5 Best c 2^1 , Accuracy=72.6547

Best c : 2^4 ,


Best c : 2^4,


5-10 Best c 2^10 , Accuracy=80.4391

Best c : 2^7 ,


Poor acc

10-15 Best c 2^1 , Accuracy=94.6088

Best c : 2^4 ,


Best c : 2^3,



Again , accuracy using 15 features and last 5 features is almost the same. However in the latter case, we get a more complex model.

Again features {11,14,15} in linear kernel give : Best c 2^-2 , Accuracy=92.2156.

IV]. Classes : 4,8


1-15 Best c 2^1 , Accuracy=97.4052

Best c : 2^4 ,


Best c : 2^7,


1-10 Best c 2^4 , Accuracy=96.6068

Best c : 2^13 ,


Best c : 2^1,


1-5 Best c 2^1 , Accuracy=81.6367

Best c : 2^-2 ,


Poor acc

5-10 Best c 2^1 , Accuracy=83.8323

10-15 Best c 2^4 , Accuracy=96.4072

Best c : 2^3 ,


Best c : 2^6,



Features 10-15 are most important. Again features {11,14,15} in linear kernel give Best c 2^4 , Accuracy=95.4092

ANALYSES OF PARAMETERS FOR DIFFERENT PAIRS OF CLASSES:


1-15 {0,1} Best c 2^4 , Accuracy=99.2 Best c : 2^13 ,


Best c : 2^1,


1-15 {4,5} Best c 2^1 , Accuracy=98.6028

Best c : 2^1 ,


Best c : 2^-2,


1-15 {8,9} Best c 2^4 , Accuracy=95.01

Best c : 2^10 ,


Best c : 2^4,


1-15 {4,8} Best c 2^1 , Accuracy=97.4052

Best c : 2^4 ,


Best c : 2^7,


1-10 Best c 2^4 , Accuracy=98.4032

Best c : 2^13 ,


Best c : 2^4,


1-10 Best c 2^1,

Accuracy=99

Best c : 2^4,

Best g : 2^-6 , Accuracy=99

Best c : 2^1,


1-10 Best c 2^4 , Accuracy=95.8084

Best c : 2^4 ,


Best c : 2^7,


1-10 Best c 2^4 , Accuracy=96.6068

Best c : 2^13 ,


Best c : 2^1,



Best c : 2^1 ,

Best g : 2^0 Accuracy=99

Best c : 2^1,


10-15 Best c 2^1 , Accuracy=98.60

Best c : 2^1 ,


Best c : 2^3,


10-15 Best c 2^1 , Accuracy=94.6088

Best c : 2^4 ,


Best c : 2^3,


10-15 Best c 2^4 , Accuracy=96.4072

Best c : 2^3 ,


Best c : 2^6,


Observations :

There is some correlation between different pairs of classes. With lesser features, similarity is higher.

Linear : With any number of features, value of best c is very similar in all cases.

Radial: A higher C is usually preferred while gamma seems to change a little arbitrarily.

Polynomial:

Degree obtained is low and bestC shows more variation than linear but less than radial.

MULTICLASS :

ONE v/s ONE



Best c : 2^14 ,

Best g : 2^-9 Accuracy=89

Best c : 2^4,



Poor acc Poor acc


Best c : 2^6 ,


Best c : 2^4,


11 14 15 Best c 2^1 , Accuracy=47.6

Observations :

Multiclass classification gives almost 10% less accuracy. Also, there seems no highly preferred set of features. Although, features 1-10 and features 10-15 give the same accuracy (approx), unlike in the binary

classification these accuracies are really less than all feature accuracy . Hence leaving out some features does not make sense.

There is not a very large difference in binary class average parameters and multiclass. Eg : Linear gives best c near 2^4, rad gives a high gamma near 2^-6 and polynomial gives a low degree.

One versus all :

Class Lin : Best c Rad: Best c Rad : Best g Poly : Best c Poly : Best d

0

1

2

3

4

5

6

7

8

9

Accuracy

4

4

4

7

4

7

7

7

4

4

80.8 %

-6

-6

-6

-6

-9

-6

-9

-9

-6

-6

76.8%

7

7

7

13

13

10

13

13

10

13

4

4

4

7

7

7

7

7

4

4

78.24%

7

7

7

7

7

7

7

7

7

7

OVO vs OVA :

Accuracy in one versus all is less than one versus one. One vs one takes less computational time since dataset for each of the binary classifier is

reduced. However one vs all gives an insight about each class parameters. The feature selection in ova follows the same concept as ovo but the accuracy is less than ovo.

ANOTHER IDEA TO IMPROVE ACCURACY :

Scaling :

Idea:

If we have F1 and F2 within range (1,2) and (1000,2000) respectively for a dataset. Now to want the hyperplane to depend on distribution of points, we need to normalize. For consider (1.1, 1100) and (1.4, 1100). The classifier will be either less sensitive to F1 or have a very large coefficient for it.

15 features:

Non scaled Scaled

All features, multiclass L 87.32 88.12

R 89 88.68

P 89.12 81.96

Features 5-10, classes 0,1 L 93.4 94.4

R 91.2 94.6

P 92.3 93.4

Features 5-10, Classes 4,8 L 80.4 83.2669

R 82.27 86.0558

P 81.051 83.4661

Observations :

When the accuracy is already high, scaling shows a very slight increase (hence results for binary, all feature is not shown).

If the accuracy is slow, scaling is a good option. For multiclass, scaling is less consequential

Documents

assignment