10
D3A. EXTREME KERNEL MACHINE Extreme Kernel Machine Viktor Karlsson and Erik Rosvall Abstract—The purpose of this report is to examine the combi- nation of an Extreme Learning Machine (ELM) with the Kernel Method. Kernels lies at the core of Support Vector Machines suc- cess in classifying non-linearly separable datasets. The hypothesis is that by combining ELM with a kernel we will utilize features in the ELM-space otherwise unused. The report is intended as a proof of concept for the idea of using kernel methods in an ELM setting. This will be done by running the new algorithm against five image datasets for a classification accuracy and time complexity analysis. Results show that our extended ELM algorithm, which we have named Extreme Kernel Machine (EKM), improve classification accuracy for some datasets compared to the regularised ELM, in the best scenarios around three percentage points. We found that the choice of kernel type and parameter values had great effect on the classification performance. The implementation of the kernel does however add computational complexity, but where that is not a concern EKM does have an advantage. This trade- off might give EKM a place between other neural networks and regular ELMs. I. I NTRODUCTION E XTREME Learning Machine (ELM) was introduced by Huang et al. [1] and has shown promising results in both classification and regression settings [2]. ELMs are efficient to train since it can be done through solving a set of linear equations which, in comparison to other algorithm’s iterative training approach, can be computationally efficient while still retaining an excellent classification performance [3]. A version of the ELM, using an auto encoder which is a type of data pre- processing method, has shown to be able to reach an accuracy of 99.03% on the MNIST dataset [3]. Another algorithm used in the machine learning field is the Support Vector Machine (SVM). This algorithm, in its simplest form, aims to find a linear separation boundary between two classes of data, thus being a binary classifier. To create more complex separation boundaries one implements a kernel into the model, see figure 1. This kernel implicitly transforms the data to a higher dimension with the hope of being able to perform the linear separation in that space. By choosing these kernels under some general constraints, one can make use of this higher dimension without ever having to actually transform anything to that higher dimension, thereby the implicit transform. Kernels can also be thought of as a similarity measure [4], which will be expanded on later in this report. This paper looks at the possibility of integrating kernels into ELMs in order to improve classification performance. Our hypothesis is that after having transformed the input data into ELM-space we can apply a kernel to make use of additional information in the input features. This is different from how other authors have combined kernels with ELMs in the past, see section II-E. Fig. 1. Example of a linear decision boundary being possible because of a non-linear transform φ [5]. II. THEORETICAL BACKGROUND Before we dive into our extended algorithm, we want to present the consepts of Extreme Learning Machines and kernels. In this section we also discuss other work related to both ELMs and kernels and explain their difference to our integration of the two. A. Notation Throughout this report we will let bold font, lower case letters (x, b etc.) represent column vectors while bold font, upper case letters (X, W i ,Y etc.) will represent matrices. The transpose of a matrix X will be denoted by X T and its inverse by X -1 .A hat on a matrix or vector will denote that this entity is a prediction made by a model, so that if y is the real labels of a data set ˆ y will be the model’s prediction of these labels. Finally, the dimension of these vectors and matrices will often be described by normal font letters (d, N , c etc.). B. Image classification In this report we study classification of images, which for some might be thought of as a process of feature extraction. That is however not the case for ELMs in general. The images was simply reshaped into column feature vectors x i , where the intensity value for each pixel is though of as one feature, and then stored in the input feature matrix X. This generalisation of images means that classifying pictures is no different from other types of data which also can be represented as vectors. C. ELM ELMs are a type of single layered feed-forward net- works (SLFNs) transforming the input feature matrix X = [x 1 , x 2 , ..., x N ] ∈< d×N to a hidden layer of h neurons through multiplication with a matrix W i ∈< h×d filled with zero-mean, random Gaussian noise. This operation can be thought of as generating random rotations of the input features,

D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

Extreme Kernel MachineViktor Karlsson and Erik Rosvall

Abstract—The purpose of this report is to examine the combi-nation of an Extreme Learning Machine (ELM) with the KernelMethod. Kernels lies at the core of Support Vector Machines suc-cess in classifying non-linearly separable datasets. The hypothesisis that by combining ELM with a kernel we will utilize featuresin the ELM-space otherwise unused. The report is intended asa proof of concept for the idea of using kernel methods in anELM setting. This will be done by running the new algorithmagainst five image datasets for a classification accuracy and timecomplexity analysis.

Results show that our extended ELM algorithm, which we havenamed Extreme Kernel Machine (EKM), improve classificationaccuracy for some datasets compared to the regularised ELM,in the best scenarios around three percentage points. We foundthat the choice of kernel type and parameter values had greateffect on the classification performance. The implementation ofthe kernel does however add computational complexity, but wherethat is not a concern EKM does have an advantage. This trade-off might give EKM a place between other neural networks andregular ELMs.

I. INTRODUCTION

EXTREME Learning Machine (ELM) was introduced byHuang et al. [1] and has shown promising results in both

classification and regression settings [2]. ELMs are efficientto train since it can be done through solving a set of linearequations which, in comparison to other algorithm’s iterativetraining approach, can be computationally efficient while stillretaining an excellent classification performance [3]. A versionof the ELM, using an auto encoder which is a type of data pre-processing method, has shown to be able to reach an accuracyof 99.03% on the MNIST dataset [3].

Another algorithm used in the machine learning field isthe Support Vector Machine (SVM). This algorithm, in itssimplest form, aims to find a linear separation boundarybetween two classes of data, thus being a binary classifier. Tocreate more complex separation boundaries one implementsa kernel into the model, see figure 1. This kernel implicitlytransforms the data to a higher dimension with the hope ofbeing able to perform the linear separation in that space. Bychoosing these kernels under some general constraints, onecan make use of this higher dimension without ever having toactually transform anything to that higher dimension, therebythe implicit transform. Kernels can also be thought of as asimilarity measure [4], which will be expanded on later inthis report.

This paper looks at the possibility of integrating kernelsinto ELMs in order to improve classification performance. Ourhypothesis is that after having transformed the input data intoELM-space we can apply a kernel to make use of additionalinformation in the input features. This is different from howother authors have combined kernels with ELMs in the past,see section II-E.

Fig. 1. Example of a linear decision boundary being possible because of anon-linear transform φ [5].

II. THEORETICAL BACKGROUND

Before we dive into our extended algorithm, we wantto present the consepts of Extreme Learning Machines andkernels. In this section we also discuss other work related toboth ELMs and kernels and explain their difference to ourintegration of the two.

A. Notation

Throughout this report we will let bold font, lower caseletters (x, b etc.) represent column vectors while bold font,upper case letters (X,Wi, Y etc.) will represent matrices.The transpose of a matrix X will be denoted by XT and itsinverse by X−1. A hat on a matrix or vector will denote thatthis entity is a prediction made by a model, so that if y isthe real labels of a data set y will be the model’s predictionof these labels. Finally, the dimension of these vectors andmatrices will often be described by normal font letters (d, N ,c etc.).

B. Image classification

In this report we study classification of images, which forsome might be thought of as a process of feature extraction.That is however not the case for ELMs in general. The imageswas simply reshaped into column feature vectors xi, where theintensity value for each pixel is though of as one feature, andthen stored in the input feature matrix X . This generalisationof images means that classifying pictures is no different fromother types of data which also can be represented as vectors.

C. ELM

ELMs are a type of single layered feed-forward net-works (SLFNs) transforming the input feature matrix X =[x1,x2, ...,xN ] ∈ <d×N to a hidden layer of h neuronsthrough multiplication with a matrix Wi ∈ <h×d filled withzero-mean, random Gaussian noise. This operation can bethought of as generating random rotations of the input features,

Page 2: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

since every square sub-matrix of dimension d in Wi has thisability. This transform is often regarded as taking the inputfeatures from <d to ELM-space <h.

The output from the hidden neurons are controlled by anon-linear function f(A) often called the activation function.While many activation functions exist; sigmoid, tanh andbinary step just to name a few, we will in this report onlyuse Rectified Linear Unit (RLU) defined through equation (1).RLU is a very popular activation function in many differentmachine learning algorithms, we use it because of its ease ofcomputation.

f(A) = max {0,A} =

{0, if Ai,j < 0

Ai,j , otherwise(1)

We will define Z ≡ f(WiX) as the output of the activationfunction for future convenience.

This output Z ∈ <h×N is finally multiplied with thehidden-to-output-layer matrix Wo ∈ <c×h where c is thenumber of classes in the dataset. This matrix is defined asthe solution to the optimisation problem stated in equation(2).

minWo∈<c×h

1

2||Wo Z − Y ||2 (2)

Here, Y ∈ <c×N is the one-hot label representation for theinput feature matrix X . That is, if feature vector xi has thelabel yi, the i:th column of Y will have a one in the yi:thposition and zeros elsewhere.

The complete ELM algorithm for classification can besummarised as in equation (3) and (4). In (3), Y ∈ <c×N isthe collection of scores for each class assigned to each inputfeature vector by the ELM.

Y =WoZ =Wof(WiX) (3)

The classification is then conducted through assigning eachinput feature vector to the class with highest score.

yT = argmaxj,yj

Y (4)

yT ∈ <1×N is here the collection of label predictions for eachof the N xi column feature vectors of X . For an illustrationof the ELM architecture, see figure 2.

Fig. 2. Graphical illustration of an ELM. The dimension of each matrix iswritten under corresponding circle.

Since the optimization problem stated in equation (2) isnothing less than the sum of square errors, the solution isgiven by:

Wo = Y Z−1 (5)

The matrix Z is generally not square, so the Moore-Penrosepseudoinverse has to be used instead. This is defined as inequation (6) where A is a non-square matrix.

A† ≡(ATA

)−1A (6)

Herein lies the efficiency of the ELM algorithm. The modelcan be trained through solving a set of linear equations,in a non-iterative manner. This can often be many timesmore computationally efficient than other iterative algorithms,especially other deep networks [1]-[3].

While we above described the basic idea of the ELM,additional constraints can be added to the optimisation inequation (2) in order to increase the models performance. Wewill in this report also introduce a regularisation term, calledthe L2 regularisation [6]. The impact the regularisation has onthe model is controlled by the parameter λ. The reasoningbehind adding such a regularisation boils down to makingthe model generalise better through decreasing the amountthe model overfits the training set. With this addition, theoptimisation problem is altered slightly and now defined byequation (7).

minWo∈<c×h

1

2||WoZ − Y ||2 +

λ

2||Wo||2 (7)

While both the activation function output Z and Wi re-mains the same, the solution for Wo becomes equation (8)where I denotes the identity matrix of appropriate dimension.

Wo = Y(ZTZ + λI

)−1ZT (8)

D. Kernels in SVMs

Support Vector Machines creates a decision boundary be-tween data points from to different classes in a dataset [7].The simplest SVM, which is called Least Square SupportVector Machine, tries to create a linear decision boundary. Itis however clear that this will fail for many types of data sets,the XOR problem is just one example. See figure 3.

Fig. 3. Illustration of the XOR classification problem, where a linearseparation boundary cannot be found. The squares and circles are thoughtof as belonging to different classes.

If in such a case the data was transformed by a non-linearfunction to a higher dimensional space, maybe a linear separa-tion could be conducted there. If we constrain ourselves to spe-cific non-linear transforms and to only compute scalar productsin the higher dimensional space we are able to compute these

Page 3: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

products using the non-transformed data. This means thatwe can transform our data into infinite dimensional spaceswithout ever having to actually create infinite dimensionalvectors, which obviously would not be possible for a computer.Consider the following example where we use the non-linearfunction φ(x) to transform a vector xT = (x1, x2) ∈ <2 to<4 and then show that a scalar product in <4 can be calculatedusing only the vectors in <2.

φ(x)T =(x31,√3x21x2,

√3x1x

22, x

32)

φ(x) · φ(y) =x31y31 + 3x21y21x2y2 + 3x1y1x

22y

22 + x32y

32

=(x1y1 + x2y2)3

=(x · y)3 ≡ k(x, y)k(x, y) is an example of a homogeneous polynomial kernel ofdegree three but of course other kernels can be defined. Someof the most common kernels [8] are:

- Linear Kernel

k(xi,xj) = 1 + xi · xj (9)

- Polynomial Kernel of degree p

k(xi,xj) = (1 + xi · xj)p (10)

- Radial Basis Function Kernel (RBF)

k(xi,xj) = exp

{−||xi − xj ||2

2σ2

}(11)

In equation (9) and (10) the addition of ”1” improvesstability, especially when p gets large. Linear kernels are onlyuseful when classes are linearly separable, and polynomialor RBF kernels are often used when more complicated tasksneeds to be solved. The RBF kernel is generally more popular,but polynomial has found use in natural language processing[9] .

Kernel functions can also be regarded as a similarity mea-sure between two vectors, returning a bigger value if twovectors are similar than if they are dissimilar. The two differentkernel types, RBF and polynomial, does however differ inexactly how the similarity is evaluated. Given two vectors,the polynomial kernel will return a maximum value if thesetwo vectors are parallel, though not similar in length. That isnot the case for the RBF kernel which only is maximized ifthe two vectors point in close proximity of each other. Onecan think of it as evaluating a Gaussian distribution, centredat one vector, at the point where the other vector ends, seefigure 4.

a) b)

Fig. 4. Example of different types of similarity measurement. a) shows apolynomial kernel’s projection method and b) shows how the RBF kernelapproach the same question. The circles are here thought of as iso-levels forthe Gaussian distribution centered at the end of one vector.

E. Note on Terminology

Several different definitions of kernel matrix is in use forELMs [10] [11]. These mainly refer to two different types ofapplication of kernels in normal ELM, both of which differsfrom the application of kernels in this report. For clarity, wewill explain the other uses and then show our implementation.

Some use the same kernels that we mention in equation (9)- (11), but apply them to the input layer, as a transformationto ELM-space [3]. Others use the notation of kernel matrixwithin the Moore-Penrose pseudoinverse, eq. (6), and say thatZTZ is the kernel matrix [11]. Z is here the matrix of all theinput features in ELM-space after application of an activationfunction, as defined after equation (1).

While activation functions and our kernel methods both arenon-linear functions, we consider them separate. Because ofthis unfortunate naming convention, we will use conflictingnotation for what we refer to as the kernel matrix and what ismentioned elsewhere.

To separate between the other uses of Kernel ELM, K-ELM and ELM with kernel we call our implementationExtreme Kernel Machine (EKM).

III. METHOD

With a understanding of the ELM algorithm and howkernels can be used for measuring similarity we are now readyto introduce our algorithm.

A. Extreme Kernel Machine

Our extension of the ELM algorithm combine the strengthsof the ELM, especially the ease of calculating the outputweights, and the kernels similarity measure property. Ourintuition leads us to reason that further extracting informationfrom the training data should have a positive impact on themodels classification performance.

This extension does not change the formulation of theoptimisation problem, but does however change the matricesinvolved. Instead of the matrix Z in equation (7) we willintroduce a new kernel matrix Kt. Since the optimisationproblem is solved during the training phase the entries ofKt, Kt,i,j , will be defined as the kernel function appliedto the i:th and j:th feature vector in the training set. Thatis, Kt,i,j = k(zt,i, zt,j). The subscript t indicated that theentities originate from feature vector of the training set, whichis an distinction which will be important during classification.

Simply substituting Z for Kt in (7) gives the new optimi-sation problem in equation (12).

minWo∈<c×Nt

1

2||WoKt − Y ||2 +

λ

2||Wo||2 (12)

It is worth noting that Wo in equation (12) has the di-mensions c×Nt, where Nt is the number of training featurevectors, instead of c × h as before. This, however, does notchange the solution to (12) which is found through the samesubstitution made above applied to equation (8) resulting inequation (13).

Wo = Y (KTt Kt + λI)−1KT

t (13)

Page 4: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

The classification is carried out in a slightly differentmanner for EKM than a normal ELM, which can be foundin equation (3). The EKM algorithm needs both Wi and Wo,same as ELM, but also needs Zt which is the collection oftraining features in ELM-space. These are fed through thekernel function together with the features which we wishto classify in Z. This results in another kernel matrix Kwith entries Ki,j = k(zt,i, zj). Instead of performing theclassification as in equation (3), we now multiply the outputmatrix Wo with the kernel matrix as in equation (14).

Y =WoK (14)

As before, in equation (3), Y is the collection of scoreassigned to each input feature for each class. The actual classprediction is finally performed through the same equation as(4).

For a graphical comparison between ELM and EKM, com-pare figure 2 and 5.

Fig. 5. Illustration of an EKM with its kernel applied after the transformto ELM-space. The dimension of each matrix is written under correspondingcircle.

B. Grid search for σ and λIn order to utilize the regularisation term λ in both al-

gorithms a grid search, an algorithm for hyperparameteroptimisation, had to be conducted. In the cases where we usedthe RBF kernel in the EKM algorithm a grid search over bothλ and σ had to be carried out. We restricted ourselves to onlyfind an optimal σ in the case of RBF kernel, and did not searchfurther for p in the case of the polynomial kernels where onlyp = 1 and p = 2 were used. The results of our basic gridsearch are presented in table I.

C. Number of hidden nodes for ELM and EKMAnother parameter which have to be chosen for both the

ELM and the EKM algorithm is the number of hidden nodeswhich corresponds to the dimension of ELM-space, noted withh earlier. Because of project constraints we did not optimisethis number for each dataset. We did however find that theperformance for both ELM and EKM had similar responseto the number of hidden nodes for the MNIST- as well asthe AR dataset, see figure 6. To not introduce unnecessarycomputational complexity for marginal improvements in theclassification accuracy, we decided that every network we trainshould have twice the dimension as the number of features.That is, h = 2d using the notation from the description of theELM algorithm.

Using the same dimension of the ELM-space for bothalgorithms also serves as an indication of the possible strengthsour EKM has over the normal ELM.

TABLE IOVERVIEW OVER THE PARAMETERS CHOSEN FOR EACH DATASET. THELOWER NUMBER IN EACH CELL DESCRIBES THE KERNEL PARAMETER p

FOR THE POLYNOMIAL KERNELS AND σ FOR THE RBF KERNEL.

λELM

RBF Polynomial Linearp or σ EKM EKM EKM

MNIST1 1 104 102

- 3 2 1

AR5 · 104 10−6 1021 109

- 2 · 103 2 1

Yalefaces5 · 104 10−4 1021 108

- 103 2 1

Caltech1010.1 0.5 0.01 0.01

- 4 · 10−3 2 1

Scene 1510−4 0.01 10−7 10−7

- 10−9 2 1

Fig. 6. Classification accuracy for ELM and EKM algorithms on MNIST-and AR datasets as function of the number of hidden nodes in the model.

D. Datasets and training procedure

Five datasets were used to evaluate EKM’s performancecompared to ELM’s. These datasets where put together byZhuolin Jiang, Zhe Lin and Larry S. Davis [12]. A summarycan be found in table II.

TABLE IIPROPERTIES OF DATASETS IN THE REPORT

Images Features ClassesMNIST 60,000 + 10,000 784 10

AR 2,600 540 126

Yalefaces 2,414 504 38

Caltech101 9,144 3,000 102

Scene 15 4,485 3,000 15

MNIST is the only dataset with a dedicated test set, whichdifferentiates its training and testing process to the other fourdatasets. With MNIST the process is straight forward; thetraining set is used for training and the test set for testing.

For the other four datasets, we shuffled the dataset and thensplit it into two sets, one for training and one for testing. In

Page 5: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

order to get a statistically correct measurement we averagedour results over a number of repeated shufflings for each givensplit ratio. The number of repeated shuffles varied betweenalgorithms and datasets because of difference in computationalcomplexity. The specifics of the number of iterations are foundin the results.

1) MNIST: This dataset is perhaps the most famous one andis often used as a baseline when introducing new algorithmsbecause of its size and ease of use. It is however often crit-icised for being too simple, but is nonetheless bigly popular.It consists of 60,000 images of handwritten numbers, pre-randomised to reduce bias when selecting subsets for training,with an additional test set of 10,000 images. Each image is28× 28 pixels large, see figure 7 for some examples.

Fig. 7. Example of images in MNIST dataset.

2) AR: This dataset consists of images of 26 people thatdisplay a wide variety of facial expression taken during twosessions. The images are originally of size 768 × 576 pixelsbut have been resized by Jiang et al. [12] so that the totalnumber of features are 540. An example of images that canbe found in this dataset are presented in figure 8.

Fig. 8. Example of images in AR dataset. The dataset consist of (a) neutralexpression, (b) expression-variant, (c) illumination-variant, (d) sunglasses and(e) scarves images. [13]

3) Yalefaces Extended: This third dataset is a collection of2,414 images from 38 different people displaying a wide vari-ety of facial expressions, seen in figure 9. We will frequentlycall the dataset Yalefaces or YF for short. Most classes consistof 64 images, but some have less. The original images havebeen resized, through methods described by Jiang et al. [12],down to the total 504 pixels used here.

Fig. 9. Example of images in Yaleface Extended dataset. [14]

4) CalTech101: CalTech101 is a collection of 102 classes,101 of which are different object and one which is a back-ground class. See figure 10 for an example. The number ofimages differs greatly between classes, some having only 40and others around 800 images. Most have about of 50 images.Jiang et al. [12] used spatial pyramid extraction on the images.

Fig. 10. Example of images in CalTech101 dataset. [15]

5) Scene-15: The fifth and final dataset used in this report isScene-15 and consists of 4,485 images of 15 different settings,or scenes to follow its name. Each class, ranging from man-made structures to nature scenes, has between 200 and 400images. As with CalTech101, this dataset was also arrangedas spatial pyramid features. A small subset of the images canbe seen in figure 11.

Fig. 11. Example of the original images in Scene-15 dataset. [15]

E. Measurements

To be able to evaluate the general performance of EKM incomparison to ELM we measured both classification accuracyas a function of training data size and how this affects thetotal time of training and classification.

IV. RESULTS

An overview of the best results for each algorithm anddataset is summarised in table III. For a more detailed viewof the results we refere you to figure 12 - 14.

TABLE IIIOVERVIEW OF BEST ACCURACY AND CORRESPONDING TRAINING AND

CLASSIFICATION TIME.

AccuracyELM

RBF Polynomial LinearTime EKM EKM EKM

MNIST95.38% 97.71% 98.19% 95.36%485 s 1717 s 1850 s 1805 s

AR96.92% 98.42% 98.29% 97.03%1.05 s 2.44 s 2.03 s 2.03 s

Yalefaces97.07% 97.04% 96.94% 96.94%

0.83 s 1.96 s 1.65 s 1.65 s

Caltech10167.81% 71.14% 68.41% 68.09%37.85 s 75.55 s 75.92 s 74.4 s

Scene 1598.62% 98.68% 98.68% 98.68%12.41 s 21.25 s 19.53 s 19.31 s

Page 6: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

A. MNIST

Because of MNIST’s size we were only able to train on amaximum of 30,000 images before we ran out of memory onour laptop hardware. The results are found in figure 12.

B. AR

We ran the EKM algorithm 100 times and the ELM 700times. The average of these runs are shown in figure 13.1.

C. Yalefaces Extended

We ran the EKM algorithm 100 times and the ELM 700times. The average of these runs are shown in figure 13.2.

D. Caltech 101

We ran the EKM algorithm 20 times and the ELM 120times. The average of these runs are shown in figure 14.2.

E. Scene15

We ran the EKM algorithm 25 times and the ELM 100times. The average of these runs are shown in figure 14.1.

V. DISCUSSION

A. Accuracy comparison

The results plotted in figure 12.a, 13.a and 14.a showthat the EKM outperformes the normal ELM in terms ofclassification accuracy for three out of the five datasets. Forthe other two datasets negligible improvement were found.While computational cost increased we were able to improveon the prediction accuracy with close to three percentagepoints in the case of the MNIST dataset, from 95.38% to98.19% with a polynomial kernel. Furthermore, AR saw anincrease from 96.92% to 98.42% between ELM and EKM.For the Caltech101 dataset we found that the RBF kernelgave significant improvements as well, and the overall greatestimprovement with 4.91 percent from 67.81% to 71.14%. Whileall dataset did not see an improvement in accuracy, we canconfidently say that the method of using an kernel function inELM-space has real merit.

Perhaps surprisingly, results point to that both RBF andpolynomial perform well. In SVMs the RBF kernel is con-sidered the superior one in the general case, and we expectedit would perform better in this context as well. While RBFdid perform well, and often better, than polynomial kernel, wewould like to highlight that we did not optimise the polynomialkernel as thoroughly the RBF.

(a)

(b)

(c)

Fig. 12. Results on MNIST dataset. (a) Classification accuracy on validationset as a function of training dataset size. (b) Total training- and classifi-cation time, measured in seconds, as function of training dataset size. (c)Total training- and classification time, measured in seconds, as function ofclassification accuracy.

Page 7: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

(a.1) (a.2)

(b.1) (b.2)

(c.1) (c.2)

Fig. 13. Results on AR- (left) and Yalefaces dataset (right). (a) Classification accuracy on validation set, consisting of all images not used for training, as afunction of training dataset size. (b) Total training- and classification time, measured in seconds, as function of training dataset size. (c) Total training- andclassification time, measured in seconds, as function of classification accuracy on the validation set consisting of all images not used in training.

Page 8: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

(a.1) (a.2)

(b.1) (b.2)

(c.1) (c.2)

Fig. 14. Results on Scene15- (left) and Caltech101 dataset (right). (a) Classification accuracy on validation set, consisting of all images not used for training,as a function of training dataset size. (b) Total training- and classification time, measured in seconds, as function of training dataset size. (c) Total training-and classification time, measured in seconds, as function of classification accuracy on the validation set consisting of all images not used in training.

Page 9: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

B. Time complexity of ELM vs EKMTo conclude if the EKM algorithm performs better than the

ELM is a question of both classification accuracy and timecomplexity. As shown in figure 12.b, 13.b and 14.b there isan obvious increased time complexity in our extended EKMalgorithm. When training the EKM model the following stepshave to be performed:

1) Linear transform to ELM-space trough the matrix mul-tiplication WiX , which has the complexity O(hdNt).

2) Application of activation function on WiX which as-suming the RLU has the complexity O(Nth).

3) Application of kernel, where the complexity is depen-dent on the kernel in our code-implementation.

a) Polynomial kernel: O(N2t h+N2

t )b) RBF kernel: O(N2

t h+N2t + 2Nth)

4) Calculating the output weight matrix Wo trough inver-sion of a Nt × Nt sized matrix giving the complexityO(N3

t ).This should be compared to the complexity of the steps in theELM-algorithm:

1) Linear transform to ELM-space trough the matrix mul-tiplication WiX , which has the complexity O(hdNt).

2) Application of activation function on WiX which as-suming the RLU has the complexity O(Nth).

3) Calculating the output weight matrix Wo trough inver-sion of a Nt × Nt sized matrix giving the complexityO(N3

t ).Additionally, the EKM algorithm needs to save the trainingfeature matrix for classification which also increases thedimension of Wo. It is with this clear that additional com-putational complexity has been added to both training- andclassification procedure with our extension, which explainsthe results. It should however be noted that this report doesnot claim to have produced the most efficient version of thecode, and some improvements might be possible with betterprogramming resulting in a faster runtime, or possibly lowercomplexity.

C. Future workWe found, during the grid search, that classification accu-

racy was highly sensitive to changes in the model parameters.This suggest that further optimisation, with higher granularityand additional kernels might be very useful. Additional opti-misation on larger intervals for p would also be required inan more extensive result analysis. Furthermore, a more gen-eralised search for a proper λ should be undertaken for eachsubset of data, with an example of alternative optimisationproblem being formulated as equation (15) with λ ∈ (0, 1).This turns our unbounded problem to a bounded one since itsonly the ratio between the two terms in (15) that matters.

minWo∈<c×Nt

(1− λ)||WoK − Y ||2 + λ||Wo||2 (15)

It would be prudent to test other activation functions as well,since we have a plethora of functions to choose from. Usingthe same argument of prudence, much more effort could beput into to optimise the number of hidden neurons.

VI. CONCLUSION

We feel that this report shows clearly that the idea behindan Extreme Kernel Machine has potential as a great addi-tion to the ELM family of machine learning algorithms. Itretains most of its advantages of ELM, short training timeand good generalisability, while improving results in severalcases. While more optimisations can be done on our currentimplementation, both in parameter choice and coding , we feelthat it might be a competitive single layer feed forward option.

ACKNOWLEDGEMENT

The authors would like to thank Saikat Chatterjee forthe generous use of his time and letting us use his ideafor this report. Additionally, Mostafa Sadeghi and AlirezaMahdavi Javid have been instrumental in giving their timefor explanations of concepts and theory. We would not havebeen able to come this far without your insights.

REFERENCES

[1] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:a new learning scheme of feedforward neural networks,” in 2004IEEE International Joint Conference on Neural Networks (IEEE Cat.No.04CH37541), vol. 2, July 2004, pp. 985–990.

[2] E. Cambria, G.-B. Huang, L. L. C. Kasun, H. Zhou, C. M. Vong, J. Lin,J. Yin, Z. Cai, Q. Liu, K. Li et al., “Extreme learning machines [trends& controversies],” IEEE Intelligent Systems, vol. 28, no. 6, pp. 30–59,2013.

[3] G. Huang, G.-B. Huang, S. Song, and K. You, “Trends in extremelearning machines: A review,” Neural Networks, vol. 61, pp. 32–48,January 2014.

[4] A. T. Martins, M. A. T. Figueiredo, and P. M. Q. Aguiar, “Kernels andsimilarity measures for text classification,” in Proceedings of the 6thConference on Telecommunications, Peniche, Portugal, 2007.

[5] Wikipedia. (2011, Apr) File: Kernel machine.svg. [Online]. Available:https://commons.wikimedia.org/w/index.php?title=File:Kernel Machine.svg&oldid=238012497

[6] Y. Miche, M. van Heeswijk, P. Bas, O. Simula, and A. Lendasse, “Trop-elm: a double-regularized elm using lars and tikhonov regularization.”Neurocomputing, vol. 74, no. 16, pp. 2413–2421, September 2011.

[7] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning,vol. 20, no. 3, pp. 273–297, 1995.

[8] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction toStatisitical Learning with Applications in R. New York: Springer Pub-lishing Company, Incorporated, 2013, ch. The Support Vector Machine,pp. 365–369.

[9] Y.-W. Chang, C.-J. Hsieh, K.-W. Chang, M. Ringgaard, and C.-J. Lin,“Training and testing low-degree polynomial data mappings via linearsvm,” Journal of Machine Learning Research, vol. 11, pp. 1471–1490,April 2010.

[10] A. Iosifidis, A. Tefas, and I. Pitas, “On the kernel extreme learningmachine classifier,” Pattern Recognition Letters, vol. 54, pp. 11–17,2015.

[11] B. Li, X. Rong, and Y. Li, “An improved kernel based extreme learningmachine for robot execution failures,” The Scientific World Journal, vol.2014, p. 7, 2014.

[12] Z. Jiang, Z. Lin, and L. S. Davis, “Label consistent k-svd: Learning adiscriminative dictionary for recognition,” IEEE Transactions on PatternAnalysis and Machine Intelligence, vol. 35, no. 11, pp. 2651–2664, Nov2013.

[13] F. Kamaruzaman and A. Shafie, “Recognizing faces with normalizedlocal gabor features and spiking neuron patterns,” Pattern Recognition,vol. 53, pp. 102–115, May 2016.

[14] K.-C. Lee. (2001, May) The extendedyale face database b. [Online]. Available:http://vision.ucsd.edu/ iskwak/ExtYaleDatabase/ExtYaleB.html

Page 10: D3A. EXTREME KERNEL MACHINE Extreme Kernel Machinekth.diva-portal.org/smash/get/diva2:1130092/FULLTEXT01.pdf · 2017. 8. 8. · D3A. EXTREME KERNEL MACHINE since every square sub-matrix

D3A. EXTREME KERNEL MACHINE

[15] M. M. Farhangi, M. Soryani, and M. Fathy.(2013, Jan) Improvment the bag of words imagerepresenation using spatial information. [Online]. Available:https://www.researchgate.net/publication/264789408 Improvement theBag of Words Image Representation Using Spatial Information