assignment

  • Upload
    aashish

  • View
    214

  • Download
    0

Embed Size (px)

DESCRIPTION

a

Citation preview

  • ASSIGNMENT 2

    EEL 709

    DEEPALI JAIN

    2012ee10082

    Motive :

    Grid search for linear, radial and polynomial function.

    Effect of varying features used. (15 -> 10 -> 5 -> further subsets within the best range)

    Binary Classification : 4 pairs of classes analysed : 0,1 ; 4,5 ; 8,9 ; 4,8

    Analyse overfitting, best fit, under fit.

    One vs one and one vs all

    Scaling of feature values

    Approach

    Cross validation by varying C and kernel parameter initially insteps of log2c = 3 at 10 fold cross validation.

    Obtaining the best parameters and then again varying them in steps of log2c= 1 around the previously obtained values to obtain more accurate values.

    Test the final parameter without any cross validation on the data.

    ANALYSIS FOR EACH PAIR OF CLASSES :

    I]. Classes : 0,1 Features : 1-15

    Linear Kernel

    Overfit occurs at C>2^6 Underfit occurs at C

  • Radial Kernel :

    Best c : 2^13 , Best g : 2^-9 Accuracy=99.4

    Overfitting is not easily distinguished. For given gamma, accuracy keeps on increasing with c asymptotically for a very very large range.

    This means that with radial basis, hard margin case can be approximated when all features are used.

    Eg: Even for gamma = log2(-6) and log2c = 190, Accuracy is 99.2%

    Polynomial Kernel :

    Best c : 2^1, Best deg: 2 , Accuracy=99.6

    Overfit at c>4 ; Underfit at C < -4

    6570758085

    90

    90

    90

    90 90

    95

    95

    9595 95

    log2c

    log2gam

    ma

    -4 -2 0 2 4 6 8 10 12

    -14

    -12

    -10

    -8

    -6

    -4

    -2

    0

    2

    log2c

    degre

    e

    -4 -2 0 2 4 6 8 10 121

    2

    3

    4

    5

    6

    7

    95.5

    96

    96.5

    97

    97.5

    98

    98.5

  • Features Linear Radial Polynomial

    1-15 Best c 2^4 , Accuracy=99.2

    Best c : 2^13 ,

    Best g : 2^-9 Accuracy=99.4

    Best c : 2^1,

    Best deg: 2 , Accuracy=99.6

    1-10 Best c 2^1,

    Accuracy=99

    Best c : 2^4,

    Best g : 2^-6 , Accuracy=99

    Best c : 2^1,

    Best deg: 1 , Accuracy=99.2

    1-5 Best c 2^4 , Accuracy=99.2

    Best c : 2^13 ,

    Best g : 2^0 Accuracy=99.4

    Best c : 2^-2,

    Best deg: 5 , Accuracy=99.2

    5-10 Best C : 2^7 , Accuracy=93.2

    Best c : 2^1 ,

    Best g : 2^0 Accuracy=91.8

    Best c : 2^10,

    Best deg: 2 , Accuracy=92.3

    10-15 Best c 2^4 , Accuracy=98.6

    Best c : 2^1 ,

    Best g : 2^0 Accuracy=99

    Best c : 2^1,

    Best deg: 3 , Accuracy=98.6

    Observations:

    Kernel:

    Changing the kernel function has no drastic effect on accuracy for any number of features. Hence sigmoidal was not used and other parameters were analysed more. A low degree polynomial in general gives good enough results. With decrease in number of features, radial kernel parameters are affected the maximum for

    change in number of features. Best C of other 2 is usually same (2^4 and 2^1) As we go from 15 to 5 features, vast variation occurs in radial kernel.

    Fitting: With radial basis, hard margin case can be approximated when all features are used. For linear and polynomial, underfit and over fit critical C occur at approx 2^(-5) and 2^(+5)

    respectively in all cases. Overfit and underfit are less prominent in radial (shape of graph). For polynomial, at lower degree over, under fit occurs is prominent. At high degrees, accuracy is

    less for all c values. Features:

    The best parameters do not vary much as features decreased from 15 to 10. However if we further decrease it to 5, the change seems more significant.

    Ignoring the effect of combination of two features on the accuracy (use filter and not wrapping), it can be seen that features 1-5 and 10-15 are more important.

  • Now, we see subset of features from [1,5] and [10,15] :

    We use linear kernel to get the accuracy in each subset.

    Features Results

    2-5 Best c 2^4 , Accuracy=93.8

    1-4 Best c 2^7 , Accuracy=95.6

    10-14 Best c 2^10 , Accuracy=97.2

    11-15 Best c 2^4 , Accuracy=98.8

    11,14,15 Best c 2^4 , Accuracy=98.6

    Observations :

    Amongst, [1,5] removing even a single feature really reduces accuracy. Features 11, 14, 15 alone give very high accuracy and parameter setting comes close to more

    feature case When accuracy is good, best C is 2^4 irrespective of the features included for linear.

    II]. Classes 4,5

    Features : 1-15

    Linear : Overfit : c>2^6 Underfit : C < 2^(-4)

    Radial :

    No prominent overfitting. With increase in C, there is never a very substantial decrease in accuracy.

    Polynomial : Overfit : C >2^4 Underfit : C < 2 ^(-4)

    Features Linear Radial Polynomial

    1-15 Best c 2^1 , Accuracy=98.6028

    Best c : 2^1 ,

    Best g : 2^-3 Accuracy=98.8024

    Best c : 2^-2,

    Best deg: 3 , Accuracy=98.8028

    1-10 Best c 2^4 , Accuracy=98.4032

    Best c : 2^13 ,

    Best g : 2^-3 Accuracy=98.6028

    Best c : 2^4,

    Best deg: 4 , Accuracy=98.004

    1-5 Best c 2^4 , Accuracy=75.8483

    Best c : 2^7 ,

    Best g : 2^-6 Accuracy=76.6467

    Poor acc

    5-10 Best c 2^4 , Accuracy=94.61

    Best c : 2^7 ,

    Best g : 2^-3 , Accuracy=89.2216

    Poor acc

    10-15 Best c 2^1 , Accuracy=98.60

    Best c : 2^1 ,

    Best g : 2^-3 Accuracy=98.6028

    Best c : 2^3,

    Best deg: 2 , Accuracy=96.8064

  • Other Observations :

    In this case, best parameters for features [10,15] were closer to the actual ones than [1,10] ie [10,15] are most essential.

    When further subsets were takes, it was again found that {11,14,15} give accuracy 98.2%, Best c = 2^(-2) for linear case.

    III]. Classes : 8,9

    All features -

    Linear: overfit at c > 2^4, underfit at c < 2^(-4)

    Polynomial : overfit at c>2^5, underfit at c< 2^(-6)

    1-10 features :

    Lin : overfit at c > 2^6, underfit at c < 2^(-4)

    Features Linear Radial Polynomial

    1-15 Best c 2^4 , Accuracy=95.01

    Best c : 2^10 ,

    Best g : 2^-9 Accuracy=95.6088

    Best c : 2^4,

    Best deg: 2 , Accuracy=96.2076

    1-10 Best c 2^4 , Accuracy=95.8084

    Best c : 2^4 ,

    Best g : 2^-3 Accuracy=95.01

    Best c : 2^7,

    Best deg: 2 , Accuracy=94.8104

    1-5 Best c 2^1 , Accuracy=72.6547

    Best c : 2^4 ,

    Best g : 2^-3 , Accuracy=72.2555

    Best c : 2^4,

    Best deg: 2 , Accuracy=68.0639

    5-10 Best c 2^10 , Accuracy=80.4391

    Best c : 2^7 ,

    Best g : 2^-3 , Accuracy=81.6367

    Poor acc

    10-15 Best c 2^1 , Accuracy=94.6088

    Best c : 2^4 ,

    Best g : 2^-3 , Accuracy=95.2096

    Best c : 2^3,

    Best deg: 3 , Accuracy=94.2116

    Other Observations :

    Again , accuracy using 15 features and last 5 features is almost the same. However in the latter case, we get a more complex model.

    Again features {11,14,15} in linear kernel give : Best c 2^-2 , Accuracy=92.2156.

  • IV]. Classes : 4,8

    Features Linear Radial Polynomial

    1-15 Best c 2^1 , Accuracy=97.4052

    Best c : 2^4 ,

    Best g : 2^-3 Accuracy=97.6048

    Best c : 2^7,

    Best deg: 2 , Accuracy=97.2056

    1-10 Best c 2^4 , Accuracy=96.6068

    Best c : 2^13 ,

    Best g : 2^-6 Accuracy=96.8064

    Best c : 2^1,

    Best deg: 1 , Accuracy=96.008

    1-5 Best c 2^1 , Accuracy=81.6367

    Best c : 2^-2 ,

    Best g : 2^0 Accuracy=81.6367

    Poor acc

    5-10 Best c 2^1 , Accuracy=83.8323

    10-15 Best c 2^4 , Accuracy=96.4072

    Best c : 2^3 ,

    Best g : 2^-3 , Accuracy=97.6096

    Best c : 2^6,

    Best deg: 2 , Accuracy=96.2151

    Other Observations :

    Features 10-15 are most important. Again features {11,14,15} in linear kernel give Best c 2^4 , Accuracy=95.4092

    ANALYSES OF PARAMETERS FOR DIFFERENT PAIRS OF CLASSES:

    Features Linear Radial Polynomial

    1-15 {0,1} Best c 2^4 , Accuracy=99.2 Best c : 2^13 ,

    Best g : 2^-9 Accuracy=99.4

    Best c : 2^1,

    Best deg: 2 , Accuracy=99.6

    1-15 {4,5} Best c 2^1 , Accuracy=98.6028

    Best c : 2^1 ,

    Best g : 2^-3 Accuracy=98.8024

    Best c : 2^-2,

    Best deg: 3 , Accuracy=98.8028

    1-15 {8,9} Best c 2^4 , Accuracy=95.01

    Best c : 2^10 ,

    Best g : 2^-9 Accuracy=95.6088

    Best c : 2^4,

    Best deg: 2 , Accuracy=96.2076

    1-15 {4,8} Best c 2^1 , Accuracy=97.4052

    Best c : 2^4 ,

    Best g : 2^-3 Accuracy=97.6048

    Best c : 2^7,

    Best deg: 2 , Accuracy=97.2056

    1-10 Best c 2^4 , Accuracy=98.4032

    Best c : 2^13 ,

    Best g : 2^-3 Accuracy=98.6028

    Best c : 2^4,

    Best deg: 4 , Accuracy=98.004

    1-10 Best c 2^1,

    Accuracy=99

    Best c : 2^4,

    Best g : 2^-6 , Accuracy=99

    Best c : 2^1,

    Best deg: 1 , Accuracy=99.2

  • 1-10 Best c 2^4 , Accuracy=95.8084

    Best c : 2^4 ,

    Best g : 2^-3 Accuracy=95.01

    Best c : 2^7,

    Best deg: 2 , Accuracy=94.8104

    1-10 Best c 2^4 , Accuracy=96.6068

    Best c : 2^13 ,

    Best g : 2^-6 Accuracy=96.8064

    Best c : 2^1,

    Best deg: 1 , Accuracy=96.008

    10-15 Best c 2^4 , Accuracy=98.6

    Best c : 2^1 ,

    Best g : 2^0 Accuracy=99

    Best c : 2^1,

    Best deg: 3 , Accuracy=98.6

    10-15 Best c 2^1 , Accuracy=98.60

    Best c : 2^1 ,

    Best g : 2^-3 Accuracy=98.6028

    Best c : 2^3,

    Best deg: 2 , Accuracy=96.8064

    10-15 Best c 2^1 , Accuracy=94.6088

    Best c : 2^4 ,

    Best g : 2^-3 , Accuracy=95.2096

    Best c : 2^3,

    Best deg: 3 , Accuracy=94.2116

    10-15 Best c 2^4 , Accuracy=96.4072

    Best c : 2^3 ,

    Best g : 2^-3 , Accuracy=97.6096

    Best c : 2^6,

    Best deg: 2 , Accuracy=96.2151

    Observations :

    There is some correlation between different pairs of classes. With lesser features, similarity is higher.

    Linear : With any number of features, value of best c is very similar in all cases.

    Radial: A higher C is usually preferred while gamma seems to change a little arbitrarily.

    Polynomial:

    Degree obtained is low and bestC shows more variation than linear but less than radial.

    MULTICLASS :

    ONE v/s ONE

    Features Linear Radial Polynomial

    1-15 Best c 2^6 , Accuracy=87.32

    Best c : 2^14 ,

    Best g : 2^-9 Accuracy=89

    Best c : 2^4,

    Best deg: 2 , Accuracy=89.12

    1-10 Best c 2^6 , Accuracy=80.24

    Poor acc Poor acc

    10-15 Best c 2^6 , Accuracy=79.6

    Best c : 2^6 ,

    Best g : 2^-3 , Accuracy=71.52

    Best c : 2^4,

    Best deg: 2 , Accuracy=80.72

  • 11 14 15 Best c 2^1 , Accuracy=47.6

    Observations :

    Multiclass classification gives almost 10% less accuracy. Also, there seems no highly preferred set of features. Although, features 1-10 and features 10-15 give the same accuracy (approx), unlike in the binary

    classification these accuracies are really less than all feature accuracy . Hence leaving out some features does not make sense.

    There is not a very large difference in binary class average parameters and multiclass. Eg : Linear gives best c near 2^4, rad gives a high gamma near 2^-6 and polynomial gives a low degree.

    One versus all :

    Class Lin : Best c Rad: Best c Rad : Best g Poly : Best c Poly : Best d

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    Accuracy

    4

    4

    4

    7

    4

    7

    7

    7

    4

    4

    80.8 %

    -6

    -6

    -6

    -6

    -9

    -6

    -9

    -9

    -6

    -6

    76.8%

    7

    7

    7

    13

    13

    10

    13

    13

    10

    13

    4

    4

    4

    7

    7

    7

    7

    7

    4

    4

    78.24%

    7

    7

    7

    7

    7

    7

    7

    7

    7

    7

    OVO vs OVA :

    Accuracy in one versus all is less than one versus one. One vs one takes less computational time since dataset for each of the binary classifier is

    reduced. However one vs all gives an insight about each class parameters. The feature selection in ova follows the same concept as ovo but the accuracy is less than ovo.

    ANOTHER IDEA TO IMPROVE ACCURACY :

    Scaling :

  • Idea:

    If we have F1 and F2 within range (1,2) and (1000,2000) respectively for a dataset. Now to want the hyperplane to depend on distribution of points, we need to normalize. For consider (1.1, 1100) and (1.4, 1100). The classifier will be either less sensitive to F1 or have a very large coefficient for it.

    15 features:

    Non scaled Scaled

    All features, multiclass L 87.32 88.12

    R 89 88.68

    P 89.12 81.96

    Features 5-10, classes 0,1 L 93.4 94.4

    R 91.2 94.6

    P 92.3 93.4

    Features 5-10, Classes 4,8 L 80.4 83.2669

    R 82.27 86.0558

    P 81.051 83.4661

    Observations :

    When the accuracy is already high, scaling shows a very slight increase (hence results for binary, all feature is not shown).

    If the accuracy is slow, scaling is a good option. For multiclass, scaling is less consequential