31
Stock market prediction using the K Nearest Neighbours algorithm and a comparison with the moving average formula Ida Vainionpää and Sophie Davidsson Degree Project in Computer Science DD143X Supervisor: Pawel Andrzej Herman Examinator: Örjan Ekeberg CSC KTH 29 April 2014

Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

Stock market prediction using the K Nearest

Neighbours algorithm and a comparison with

the moving average formula

Ida Vainionpää and Sophie Davidsson

Degree Project in Computer ScienceDD143X

Supervisor: Pawel Andrzej HermanExaminator: Örjan Ekeberg

CSC KTH 29 April 2014

1

Page 2: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

Abstract

The stock market has a large impact on the economy of a nation, thisis why it is an interesting matter to see how stock market prediction canbe used and whether or not the predicted results are valid. This reportwill compare the prediction methods, the K Nearest Neighbour algorithmand the moving average formula using the closing prices of four Swedishequities that are based on the Stockholm stock exchange OMX. To geta proper familiarization into the background of stock markets and theutilized formulas, the report explains these theoretical concepts for thereader. A proper distribution of the results is given of the data with ap-propriate charts and tables. Lastly a discussion explains the implicationsof the results and the conclusion that the K Nearest Neighbour algorithmproduced more accurate data when compared to the moving average for-mula.

Aktiemarknaden har en stor inverkan på en nations ekonomi, varför̈ det ärintressant att se om förutsägelser på aktiemarknaden kan användas samtom det förväntade resultatet är trovärdigt. Denna rapport kommer attjämföra slutkurser på fyra aktier frön Stockholmsbörsen med hjälp av KNärmaste Grannar algoritmen och det glidande medelvärdet formeln. Föratt ordentligt kunna sättas in i bakgrunden för aktiemarknaden och devalda formlerna, förklarar rapporten dessa villkor för läsaren. En lämpligfördelning av resultaten ges av det samlade datat med lämpliga diagramoch tabeller. Slutligen ges en diskussion som förklarar varför slutsatsen äratt K Närmaste Granne algoritmen ger ett mer exakt värde jämfört medden glidande medelvärde formeln.

Page 3: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Contents

1 Introduction 31.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . 41.2 Problem Scope . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Report Outline . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 52.1 Stock Markets . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 Data Mining in Stock Market Analysis . . . . . . . . . . . . 62.3 KNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 62.4 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Method 93.1 The Data Selection . . . . . . . . . . . . . . . . . . . . . . . 93.2 KNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Moving Average . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Error Calculation . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.1 Root Mean Square Error (RMSE) . . . . . . . . . . 123.4.2 Mean Percentage Error (MPE) . . . . . . . . . . . . 123.4.3 Average Difference . . . . . . . . . . . . . . . . . . . 13

4 Results 134.1 Graphs and Tables . . . . . . . . . . . . . . . . . . . . . . . 13

4.1.1 ABB . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.2 Astrazeneca . . . . . . . . . . . . . . . . . . . . . . . 174.1.3 H&M . . . . . . . . . . . . . . . . . . . . . . . . . . 194.1.4 Investor . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Error Calculation . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Discussion 245.1 KNN Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 245.2 MA Formula . . . . . . . . . . . . . . . . . . . . . . . . . . 255.3 Implications of the results . . . . . . . . . . . . . . . . . . . 26

6 Conclusion 27

7 References 277.1 Figures: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

List of Figures

1 KNN algorithm explained in a picture . . . . . . . . . . . . 72 Pseudocode explaining the KNN algorithm in a simple way

(KNN Classification Algorithm Implemented in Lisp) . . . . 83 The mathematical formula for the MA . . . . . . . . . . . . 94 The mathematical formula for the euclidean distance . . . . 115 The mathematical formula for the average . . . . . . . . . . 116 The mathematical formula for the MA . . . . . . . . . . . . 127 The mathematical formula for RMSE . . . . . . . . . . . . . 128 The mathematical formula for MPE . . . . . . . . . . . . . 139 The mathematical formula for AD . . . . . . . . . . . . . . 1310 The difference between the average actual price and each

prediction method, as per each equity . . . . . . . . . . . . 14

1

Page 4: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

11 Graph displaying the actual values for ABB against thepredicted values from the KNN algorithm . . . . . . . . . . 15

12 Table displaying a short cutout of the actual price values,those predicted using the KNN algorithm and their differ-ence (deviation) . . . . . . . . . . . . . . . . . . . . . . . . . 15

13 Graph displaying the actual values for ABB against thepredicted values from the MA formula . . . . . . . . . . . . 16

14 Table displaying a short cutout of the actual price values,those predicted using the MA formula and their difference(deviation) . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

15 Graph displaying the actual values for Astrazeneca againstthe predicted values from the KNN algorithm . . . . . . . . 17

16 Table displaying a short cutout of the actual price values,those predicted using the KNN algorithm and their differ-ence (deviation) . . . . . . . . . . . . . . . . . . . . . . . . . 17

17 Graph displaying the actual values for Astrazeneca againstthe predicted values from the MA formula . . . . . . . . . . 18

18 Table displaying a short cutout of the actual price values,those predicted using the MA formula and their difference(deviation) . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

19 Graph displaying the actual values for H&M against thepredicted values from the KNN algorithm . . . . . . . . . . 19

20 Table displaying a short cutout of the actual price values,those predicted using the KNN algorithm and their differ-ence (deviation) . . . . . . . . . . . . . . . . . . . . . . . . . 19

21 Graph displaying the actual values for H&M against thepredicted values from the MA formula . . . . . . . . . . . . 20

22 Table displaying a short cutout of the actual price values,those predicted using the MA formula and their difference(deviation) . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

23 Graph displaying the actual values for Investor against thepredicted values from the KNN algorithm . . . . . . . . . . 21

24 Table displaying a short cutout of the actual price values,those predicted using the KNN algorithm and their differ-ence (deviation) . . . . . . . . . . . . . . . . . . . . . . . . . 21

25 Graph displaying the actual values for Investor against thepredicted values from the MA formula . . . . . . . . . . . . 22

26 Table displaying a short cutout of the actual price values,those predicted using the MA formula and their difference(deviation) . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

27 Table displaying the three different error calculations methodand the values corresponding with each prediction methodsas per equity . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2

Page 5: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

1 Introduction

Stock markets are today a large part of each nation’s financial system andthe stock markets can be a direct portrayal of how a nation’s economy de-velopment and downturn is progressing. There are currently several meth-ods used in order to help market participators forecast market movement.The most commonly used market prediction methods use software basedsolutions that use graphical and statistical approaches of foreseeing themarket movement (Beattle, 2011). One example of a statistical methodwould be the moving average which is often used by traders (InteractiveData Corp, 2014). These methods however are not competent enough tobe relied on completely and often the opinion of an experienced markettrader is valued higher than software based predictions (Beattle, 2011).If a valid prediction method were to exist, then one could foresee largefluctuations in the market and enforce countermeasures in order to reducethe consequences of such large developments. This could also generate aless chaotic behaviour in the stock market.

If this kind of prediction method were to exist the field of computerscience would become more essential, because computers would be usedmore frequently. These days the field of computer science is ever evolv-ing and becomes more and more evident in the every aspect of one’slifestyle. As the capability of computers continues to broaden, more andmore tasks are being completed with the help of computers instead of hu-mans, thus eliminating the human error factor. We now see complicatedtasks being achieved by computers through the use of artificial intelligencewhich is becoming ever nearer to replacing a part of the error filled hu-man mind. Many examples of technology replacing the human error isseen in aviation, which suggests that it one day could be evident in morefields (Ihilliard1 2011). This elimination of human error could also beintroduced in the finance field for example when predicting stock marketmovements. Using computational intelligence, machine learning and datamining to find correlations in large data sets that humans are not capableof finding, can now be used as a prediction method in finance as well asthe fields of medicine and biology (Alexander, 1998).

There are several data mining algorithms that can be used for predictionpurposes in the field of finance. Some examples would be the naive Bayesclassifier, the k nearest neighbour (KNN) algorithm and the classificationand the regression tree algorithm (Wu et al. 2007). All the mentionedalgorithms could fill the purpose of the paper but it will center aroundthe kNN algorithm as a method of predicting stock market movements aswell as the MA formula. The movements will be detected by looking ata large amount of historical data and finding patterns to establish a wellestimated forecast. This specific algorithm was chosen as it is a simple buta very effective algorithm to implement when looking at large amountsof data (Berson et al. 1999).The KNN algorithm simply states: "Objectsthat are ’near’ to each other will have similar prediction values as well.Thus if you know the prediction value of one of the objects you can pre-dict it for its nearest neighbours" (Berson et al. 1999). As a comparisonwith the KNN algorithm, the MA formula was chosen. The MA formulahas its simplicity as a common factor with the KNN algorithm, but it isa statistical method used frequently by traders (Interactive Data Corp,2014).

3

Page 6: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

There are existing techniques when it comes to stock prediction, someof them are multispectral prediction, distortion controlled prediction andlempel-ziv based prediction. These are based on the fact that the datarepresentation is more compact by removing redundancy while the essen-tial information is kept in format that is accessible (Azhar et al. 1994).Due to the scope of the project the techniques that were the most suitableto work with were the KNN algorithm and the MA formula despite theexisting techniques listed above.

The most accurate way to predict the outcome of the stock market isa frequently discussed matter. It is extremely complicated to take intoconsideration all those factors that can influence a stock. For example,internal development, world events, inflation and interest rates, exchangerates and lastly hype (Wolski 2014). Over the last couple of years the ideaof using data mining to try and prognosticate the stock market has beenincreasingly explored because of people’s growing interest in the stockmarket and ability to gain a profit.

1.1 Problem StatementThe problem statement we have chosen to work with in this project fol-lows from the hypothesis that the KNN algorithm is a more precise way ofpredicting closing prices than the MA formula is. The problem statementis formulated below:

Is using the KNN algorithm a more precise way of predicting the futureclosing prices of equities than using the more common method of MA?

1.2 Problem ScopeThe following report will delve into the concept of using a data mining al-gorithm to attempt to predict the movement of four equities as accuratelyas possible on the Stockholm stock exchange OMX. This is an importantaspect to knowledge because it can give rise to many prediction techniquesin the future and it introduces a link between the finance field and thefield of computer science. To explain the approach of the research ques-tion the following section will explain the scope of the project.

This report will explore the ability of using the KNN algorithm and theMA formula as methods of predicting stock market movements. Severalsteps will be followed as we attempt to answer and explore the researchquestion.

- As an initial step, the algorithm and the formula will be implementedin a programming language that fits the purpose of the research questionand then tested to ensure that it performs as it should.

- Data will then be gathered from the Stockholm stock exchange wherethe focused data will be on four Swedish companies’ equities and theirclosing prices. This data will be collected from the website of Nasdaqwith help from the Nordic bank Nordea. The range of data will cover twoyears in order to get a sufficient amount of training data needed to get asaccurate results as expected.

4

Page 7: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

- The results of the gathered data from the algorithm and the MA methodwill then be accumulated and displayed graphically through graphs, chartsand explanatory text, in order to clearly display the accuracy.

- The error calculations from running the algorithm and the MA methodis the numerical quantification of the project. It is necessary in order totest the correctness of the data collected and in order to allow an eval-uation of whether or not the method of using the KNN algorithm is awell working prediction model when attempting to predict stock marketmovements compared to the MA formula.

1.3 Report Outline• The background (2) gives a brief overview of the different aspects

involved within this project. This includes some financial theoryof for example how a stock market works, a concise explanation ofhow the KNN algorithm is structured and an overview of what datamining involves as well as a description of the MA formula.

• How the results were achieved is described in method (3). Theprocess in which the algorithm was implemented and how the datawas used to obtain results. It will also describe how the MA formulawas used to arrive at results. All the thoughts and decisions madewithin the implementation phase will be documented and clearlydisplayed within the methods section.

• The findings are then presented in the results (4) section, wherethey are presented numerically through graphs and tables. Errorcalculations will also be exhibited in this section.

• The results are then analyzed within the discussion (5) section.This means that one considers the error calculations and the resultsone arrived at and then one attempts to explain why the results lookthe way they do, and whether or not one can answer the researchquestion based on the results one has arrived at.

• The conclusion (6) then consists of a final statement which stateswhether or not the research question has been answered. It alsomust answer what implications the results one arrived at have hadfor the future of the concerned areas.

2 Background

2.1 Stock MarketsThe stock market consists of buyers and sellers and their stocks. A stockis essentially a document that proves your right to a part of a company’scurrent and future net assets. For example, when companies have thewish to expand their business but do not have the financial resources forit, they can choose to sell a share of their stock. The owners would sell ashare of the company in order to raise new capital to invest in the expan-sion of the business (Kennon 2014).

Stock prices tend to rise as the demand for a particular stock is higher, atthe current market price, than the supply for that stock. One event which

5

Page 8: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

would cause such a rise would be when a company announces an unex-pected increase in its quarterly result. This would cause the price of theparticular stock to rise even if the market is relatively stable (Little 2009).

The Stockholm stock exchange is the largest stock exchange in the Nordiccountries. It consists of approximately three hundred listed companies.The market data used in this report was gathered from the Stockholmstock exchange (Nasdaq OMX Nordic 2014).

Several of the market participants dedicate a lot of time in trying toforecast the market to their own or their customers benefit, in order tomake a strong return. If they were to succeed they would end up witha huge financial rise. In addition to the subject being a very interestingone, rather complicated and valued knowledge, the main part is in fact tobe able to predict the stock market.

2.2 Data Mining in Stock Market AnalysisData mining consists of analyzing data from different perspectives andputting the data to use by summarizing it into helpful information. Theconcept of data mining is essentially finding patterns in large collectionsof datasets.

By recognizing these patterns in large datasets analysts can recognizetheir behaviour and therefore perhaps even predict the future (Alexander,1998). An effective use of data mining and the recognizing of patterns, canlead to more effective investment decisions on the stock market (Palace,1996). Data mining is a helpful methodology hence it helps the humandetect patterns that without the algorithm could easily be missed with-out extensive analysis and is thus a large part of this project (Alexander,1998).

Data mining tools help people realize what is going on in the stock market,that can be missed without these tools. It is said that these kinds of toolsare helpful because of the difficulty in stock market prediction (Nawawi,2013).

The basics of using data mining as a stock market prediction tool is to;define the pattern, recognize the problem, collect the data, preparing itand lastly preprocessing it (Unica Technologies Inc 1997). Thus a largeamount of time is spent on preparation before the user can start to predictthe wanted prices.

2.3 KNN AlgorithmThe KNN algorithm is a machine learning algorithm that is considered tobe a lazy learning algorithm, with a low computational cost and it is verysimple to implement (Alkhatib et al. 2013). With this said, the algorithmlooks at the data closest or most similar to the needed estimation insteadof using a function or model to interpret the data. When using the KNNalgorithm you divide the data into two parts, the training data, which thealgorithm bases its predictions on and the test data, which the algorithmmakes the predictions about (Imandoust et al. 2013). The test dataconsists of the values that are being predicted with the algorithm. The

6

Page 9: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

training data is divided up into vectors and then a distance from thetest data to its neighbour is calculated by using one of several methods.The most common method being the weighted euclidean distance, whichbasically is the closest distance between the two components. This can beseen in the figure 1 where the nearest neighbours to B? are selected usingthe euclidean distance.

Figure 1: KNN algorithm explained in a picture

The figure shows how the KNN algorithm uses the Euclidean metricsto chose the nearest neighbours.To explain the implementation of the KNN algorithm, pseudocode is in-serted below:

7

Page 10: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 2: Pseudocode explaining the KNN algorithm in a simple way (KNNClassification Algorithm Implemented in Lisp)

8

Page 11: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

2.4 Moving AverageThe MA is a statistical method which traders have traditionally used as atool in predicting future prices (Interactive Data Corp, 2014). The formulais based on the method of using a certain amount of prior data and takingthe average of that data in order to make a prediction. The formula mayvary in appearance but usually follows the following structure:

Figure 3: The mathematical formula for the MA

As a common tool for traders this method will later be compared tothe prediction using the KNN algorithm.

3 Method

In order to efficiently provide an answer to our research question, thecorrect data had to firstly be gathered and then the two methods wereimplemented and used in order to predict the closing prices of the fourequities.

3.1 The Data SelectionOne large part of this project was choosing which data to use the pre-diction methods with. As previously discussed, the data was chosen fromthe Stockholm stock exchange as this was the most easily accessed stockexchange data and the most relevant. Then four equities were chosenwithin different sectors. This was important as the data for all four eq-uities should not all follow specific trends and movements as often seenby equities within the same sector. The chosen equities are large globalcompanies who are often not only based in Stockholm, this means thatthe data also would react to changes within other stock markets and notonly change just according to the Stockholm stock exchange. The chosenfour were; Investor a Swedish based investment company, H&M a Swedishbased retail company, Astrazeneca a Swedish and British based pharma-ceutical company and lastly ABB a Swedish and Swiss based technicalcompany.

Of all the prices taken of an equity during a trading day, the closing priceis often the most accurate (Saint-Leger, 2014). This is not only becausethese prices are used for the valuation of for example funds but these arealso the prices which will be historically published. That in turn meansthat these prices are often checked several times before they are officiallypublished. Price manipulation is also the least evident in closing prices asthese price windows are taken just as the market closes. Meaning that thetrader would have to keep an open position until the market opened thenext day, which is often not desirable as then the trader cannot influencethe position until the morning. The traders often have higher risk limits

9

Page 12: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

during the day than over night. So if for example a major macroeconomicevent were to happen in a different time zone the trader would not havethe chance to adjust his/her position (Saint-Leger, 2014).

When choosing the time frame for which the closing prices were to beselected, there were two major constraints to be considered. Firstly aunlimited amount of data could not be selected due to the fact that thiswould cause practical problems. Then the selected data set had to belarge enough to ensure that the predictions would be accurate. After con-sidering this, a timeframe of two years was selected as this was enoughdata to produce accurate predictions from, but the amount of data wasstill manageable.

With this in mind, the historical daily closing prices for a time span oftwo years was then selected as the data which the predictions would bebased on and compared to.

3.2 KNN AlgorithmAs an initial step the KNN algorithm had to be implemented and validdata had to be gathered from it. Several different types of programminglanguages were considered, which could be utilized in order to implementthe algorithm. It became quickly evident that the most suited languagewas matlab. This became obvious as the question was thoroughly re-searched and a handful of languages were compared. Matlab has severalbuilt in functions which were utilized and simplified the implementationphase. The built in function which is central to our version of the al-gorithm is the knnsearch function, which essentially returns the nearestneighbours of the values within a matrix based on values within anothermatrix (the training data). This function was used and then modifiedto apply it to the specific situation of predicting the closing prices. Thefunction allows one to specify the number of neighbours to be selectedand after testing a certain different amount of neighbours, K = 4 wasselected as it yielded the most accurate results, this meaning the smallestdeviation from the actual price.

The algorithm was implemented so that all the historical data was gath-ered in the second matrix which contained all the "training data" meaningthe data the algorithm would select the neighbours from. The first matrixthen contained all those prices which were tested, meaning those priceswhich were used in order to predict the market the next day. This meansthat the prediction was made only one day at a time and the previouslypredicted value was not stored in order to make a future prediction. Thereason for this was that when attempting to predict using already pre-dicted prices, the price had a tendency to converge towards a single valueand the predicted values became very far off the actual values. So the oneday prediction method was chosen. This means that the prediction wasdone one day at a time, using old data so that a comparison between theactual value could be done. The distances between the selected price andthe nearest neighbour is calculated using the Euclidean distance, this isused as a control tool to see how close K is from the selected value whichis used for the prediction. (Greenacre, 2008)

10

Page 13: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

The euclidean distance is calculated by:

Figure 4: The mathematical formula for the euclidean distance

n - number of coordinates of the two pointsx and y - the coordinates of the two points

In order to find the next days prediction, the selected nearest neighboursindexes were all added with one unit in order to select the next day fromthese neighbours. This of course causes issues if one arrives at the lastprice in the matrix, and adds a unit meaning that the index becomes outof bounds. This however was solved with a special for-loop. The valuesof the elected next day neighbours are then averaged in order to makea prediction of how the closing price will evolve the following day (Plat-inumPrep, 2009).

The average was calculated by:

Figure 5: The mathematical formula for the average

Where xi

values where the nearest neighbours and the n value was thenumber of neighbours k.

As a last step a while loop was implemented in order to be able to pre-dict several prices in one run, this simplified the process in gathering theprediction data for two years.

3.3 Moving AverageIn order to calculate the MA of the closing prices, Excel and its formulafunctions were utilized. The MA is calculated by using the previous Ndays of data and taking an average of the chosen days (TA-Guru, 2010).The number of N was chosen to 5 because it gave the most accurateresults, and accurate here means closest to the actual closing price value.This information was gathered through comparison.

11

Page 14: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

The MA was calculated by:

Figure 6: The mathematical formula for the MA

Where CLOSE(i) is the closing price and N is the number of days onehas selected data for.

This formula was then applied to the gathered closing data for all of theequities where each 5 previous days were used to predict a sixth. As thisprediction method requires a certain amount of previous data, the predic-tions were only possible 5 days after the first data was gathered. Meaningthat for the KNN algorithm the predictions begin the 2012-01-02 and forthe MA formula the predictions begin the 2012-01-09.

3.4 Error Calculation3.4.1 Root Mean Square Error (RMSE)

RMSE is a method based on measuring the difference between an esti-mated value and an actual observation. (Holmes, 2000)

Figure 7: The mathematical formula for RMSE

The formula can be described as the difference between the observedvalues X

obs

and the estimated values XMode l

, which is then squared anddivided by the number of values of one data set (n) i.e the number ofobserved values which of course is the same value as the number of actualvalues and then lastly square rooted. This universal error measurementwill be used to measure the accuracy of the two different prediction meth-ods that are implemented.

3.4.2 Mean Percentage Error (MPE)

This method is similar to the previous method, as it is designed to measurethe difference between a predicted value (often by a model) and an actualobserved value. However it also differs from the previous method as it isa percentage error. (Swanson et al. 2010)

12

Page 15: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 8: The mathematical formula for MPE

The formula can be described as the difference between the forecastedvalue f

t

and the actual value at

as a percentage which is then averagedusing the number of values in one set (n) i.e the number of observed valueswhich of course is the same value as the number of actual values.

3.4.3 Average Difference

The AD is a simple error calculation where the average difference betweenthe observed value and the actual value is calculated (World Encyclopedia,2013).

Figure 9: The mathematical formula for AD

The formula can be described as the sum of all the differences betweenthe observed values (x) and their relative actual values (y) divided by thenumber of values in one data set (n) i.e the number of observed valueswhich of course is the same value as the number of actual values.

4 Results

4.1 Graphs and TablesAll four equities were first predicted using the KNN algorithm for the twoyear period that was selected, then the same process was repeated usingthe MA formula. This generated a large amount of data. The followingpassage presents a summarized and compressed version of the receiveddata. In order to give an initial glimpse of the data in an effective andcomprehensible way, an average price of each prediction method was cal-culated using the same method as described earlier.

13

Page 16: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 10: The difference between the average actual price and each predictionmethod, as per each equity

The combined graph and table figure 10 presents the deviation be-tween the averaged prices from each prediction method and the actualclosing prices. In other words an average price of all the predicted datawas calculated from each forecasting method as well as all the actual clos-ing prices. The average price from the KNN algorithm was then comparedto the actual average price, this was also done with the MA formula. Thebar chart figure 10 displays the difference between the predicted averageprice and the actual average price, for each different equity. As seen infigure 10 the KNN algorithm (blue) has a continuously lower value thanthe MA formula (red). This of course suggests that the KNN predictionwas closer to the actual value, than the MA formula.

In order to gain a further grasp of the data collected, the following graphsand tables were produced. Each method of prediction is a graph againstthe actual values, clearly displaying how the forecasting method differsfrom the actual closing prices values. Where the two lines differ is wherethe prediction method was inaccurate and deviated largely from the ac-tual price. The graphs are also distinguishable as per equity since theprediction methods performed differently for each equity. A total of 500rows of data was collected. This is a very large sum and in order to givethe reader a glimpse of the actual raw data, a section of this data was cutout in the tables shown below. These tables show the actual values forthe period of approximately one month, the actual price, the predictedprice and the difference between these two (deviation) is displayed in thetables.

14

Page 17: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

4.1.1 ABB

Figure 11: Graph displaying the actual values for ABB against the predictedvalues from the KNN algorithm

Figure 12: Table displaying a short cutout of the actual price values, thosepredicted using the KNN algorithm and their difference (deviation)

15

Page 18: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 13: Graph displaying the actual values for ABB against the predictedvalues from the MA formula

Figure 14: Table displaying a short cutout of the actual price values, thosepredicted using the MA formula and their difference (deviation)

16

Page 19: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

4.1.2 Astrazeneca

Figure 15: Graph displaying the actual values for Astrazeneca against the pre-dicted values from the KNN algorithm

Figure 16: Table displaying a short cutout of the actual price values, thosepredicted using the KNN algorithm and their difference (deviation)

17

Page 20: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 17: Graph displaying the actual values for Astrazeneca against the pre-dicted values from the MA formula

Figure 18: Table displaying a short cutout of the actual price values, thosepredicted using the MA formula and their difference (deviation)

18

Page 21: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

4.1.3 H&M

Figure 19: Graph displaying the actual values for H&M against the predictedvalues from the KNN algorithm

Figure 20: Table displaying a short cutout of the actual price values, thosepredicted using the KNN algorithm and their difference (deviation)

19

Page 22: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 21: Graph displaying the actual values for H&M against the predictedvalues from the MA formula

Figure 22: Table displaying a short cutout of the actual price values, thosepredicted using the MA formula and their difference (deviation)

20

Page 23: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

4.1.4 Investor

Figure 23: Graph displaying the actual values for Investor against the predictedvalues from the KNN algorithm

Figure 24: Table displaying a short cutout of the actual price values, thosepredicted using the KNN algorithm and their difference (deviation)

21

Page 24: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

Figure 25: Graph displaying the actual values for Investor against the predictedvalues from the MA formula

Figure 26: Table displaying a short cutout of the actual price values, thosepredicted using the MA formula and their difference (deviation)

22

Page 25: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

4.2 Error CalculationAs described earlier three methods of error calculation were used in orderto early state the accuracy of each of the prediction methods. The threemethods used were; RMSE, MPE and AD.

Figure 27: Table displaying the three different error calculations method andthe values corresponding with each prediction methods as per equity

The RMSE method looks into the difference between the actual valueand the predicted value by taking the square and square root and then

23

Page 26: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

finds an average for all the produced values. As one can see from the tableFigure 27, the KNN algorithm has a lower RMSE then the MA formula,for each of the four equities. The KNN values are almost half of the of theMA values, meaning that they differed only half of what the MA valuesdid.

The MPE method looks into the difference between the actual valuesand the predicted values and then finds a percentage difference. Similarlyto the RMSE method, the MPE values for the KNN algorithm were alsolower than those for the MA formula. This implies that the KNN algo-rithm values deviated less percentage wise from the actual prices than theMA formula did. One interesting observation with this error method isthat, the equity Astrazeneca who received the highest values using theother two methods, received the lowest error using this method. This cu-rious observation will be further discussed later on in the paper.

Lastly the AD method looks also into the difference between the actualprice and the predicted price by taking the absolute value of the differ-ence and then averaging it. Just like the previous mentioned methods,the values for the predicted values using the KNN algorithm were lowerthan those for the MA formula.

5 Discussion

5.1 KNN AlgorithmAfter viewing the graphs and tables in the results section a clear trendbecomes evident. This is especially evident in the average prices graphand the error calculation table. The average prices graph displayed ex-actly how close the average closing prices using the KNN algorithm wasto the actual average closing prices. The difference between the two wassmaller than 0.1 SEK, a considerably small difference. One could com-pare this to the AD, which is the average difference per day and the valueswere between 1.09 SEK and 2.05 SEK, which also is a considerably smalldifference per day. However these measurements are just an average andthe prices as seen in the equity graphs (showing the actual price graphedagainst the predicted price using KNN), show that the price levels wereever changing. So in order to answer the question of whether the KNNprediction method was more precise than the MA formula, one cannotonly take the average prices into consideration.

Another set of results that must be consider are the error calculations.The RMSE values for the KNN algorithm were lower meaning that thedeviations for each price between the actual value and the predicted valueusing the KNN algorithm were lower than for the MA formula. This wasalso true for the MPE, which is a percentage error, meaning it takes intoconsideration if the different prices are of different sizes, something thatthe other two error calculations do not take into consideration. Here thepercentage error per day was also low with a maximum of 0.8% per day(The ABB equity). This could also be an answer to the previously raisedquestion of why the Astrazeneca equity performed the worst in the othertwo error calculations but the best for the MPE. Astazeneca’s averageprice is 319.914 SEK, which is almost twice as high as the other equities.As the price is much higher the deviations may be perceived higher when

24

Page 27: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

not taking the size of the total price into consideration. As the MPE is anerror percentage of the total price, this indicator may be the most correctas it takes the large price difference between the equities into considera-tion. However when comparing it towards the MA this will not make adifference, as the values for the KNN error were considerably lower thanthose for the MA formula, including Astrazeneca.

Another matter to discuss is the choice of the value of K in the KNNalgorithm. As the value was chosen based on experimentation and eval-uation, there is no way of knowing our conclusion was the most optimalwihtout evaluating the method which was not done. However since thedata already supports our hypothesis that the KNN algorithm would bea more accurate method of prediction, this means that even if a moreaccurate value of k could have been chosen this would not have impactedthe outcome of the conclusions reached. But it could have had an impacton the KNN results, making them even more efficient. This is howeversomething that would have to be more researched and expanded on.

As mentioned in the method section of the paper we inserted a specialcase for arriving at the last price in the training data, thus it would fallout of bounds because of the fact there not being any more days to calcu-late on. This in turn means that the next index is the initial closing price,which of course may not have a direct relation to the last closing price ofthe matrix. But it is still used to predict the closing price after the lastprice, this of course can lead to an invalid prediction. However as this isonly one prediction of out 499, its impact on the overall efficiency of thealgorithm is relatively small. This of course could have been solved if theuser would insert more data.

The way the predictions were made must also be considered in detail.The method, which was used, was based on one single prediction at atime. A single price was selected and then compared to the training data,finding its nearest neighbours and taking these neighbours’ next price.However each predicted price was not stored in order to make anotherprediction based on the already predicted price. Each price was predictedusing only the already existing training data and not using the previouspredictions. Whether or not this method was the most optimal can beargued, and if a prediction was to be made further then one future dayat a time, another method of prediction had to be implemented. Duringthe implementation stage this method of using already predicted pricesto predict further was tested but it had a tendency to converge towardsa single value and the single day prediction was favoured. The methodused can easily be adapted in order to be able to predict further in thefuture but then the risk arises that this method also will have a tendencyto converge towards a single value if one chooses to try and predict toofar into the future. However if one were to be able to overcome the con-verging problem, both methods could be used to predict closing values onthe stock market.

5.2 MA FormulaAfter reviewing the results, one can safely say that the predicted pricesusing the MA formula deviated more than those prices predicted using theKNN algorithm. This can be seen when viewing the deviation between

25

Page 28: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

average prices graph, the difference between the average actual price andthe MA formula predicted price, deviated almost ten times as much asthen KNN algorithms predicted prices. However when regarding the er-ror calculations the difference between the two prediction methods arenot as great. For example the MPE values who ranged from 1.07% to1.4% were only slightly higher than the values for the KNN algorithm.An average error of approximately 1% of 2 years of data is still a very ac-curate result, even though the KNN algorithm performed more accurately.

There can be many reasons for why the MA formula was not as pre-cise as the KNN algorithm, but one main reason could be the differencein how these methods consider previous data. While the KNN methodsearches for relevant data, the MA formula only regards the previous Ndays. The MA formula then bases the entire prediction on only those Ndays, whilst the KNN algorithm uses the entire 2 years worth of data tofind relevant neighbours. This is a limitation of the MA formula as it doesnot take longer historic trends into consideration, but only looks at theclosest previous data.

The value of N was chosen as 5, after some experimenting and compari-son. However the question arises if this way of choosing a N value was themost efficient way. Only whole numbers were considered and this couldhave limited the efficiency of choosing the value of N which was most ac-curate in this case. A further study and evaluation taking more valuesof N into account could result in more accurate values using the MA for-mula. However when considering the difference between the accuracy ofthe KNN algorithm and the MA formula, the change in value of N maybetter the values from the MA formula but will most probably not makethe method more accurate than the KNN algorithm, as the difference indeviation between the two methods is large.

5.3 Implications of the resultsAs the results gathered show that the algorithm is a more accurate wayof predicting the future closing prices, the question arises: why traders donot usually utilize even more technical aspects in order to predict stockmarket changes and in turn earn more money. Making the assumptionthat most traders have a purely economic background and may perceivethe world of algorithms and programming rather intimidating and com-plex, this would explain the absence of more technical approaches to theprediction problem.

Another implication which must also be considered is that if algorithmswere able to predict the market movements extremely accurately, and wasavailable to all market participants, then the potential of making an ex-traordinary return from trading would disappear.

Lastly it is impossible for algorithms to predict dramatic events in theworld that will impact the stock market. When using an algorithm topredict future prices it is impossible at the present moment, for them toforesee large unpredicted happenings, that can alter prices. The algo-rithm cannot understand the meaning of a terror attack or other eventsin the same scale or that they impact the stock market as they are notincluded in historical data. With this in mind the prediction cannot only

26

Page 29: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

be based on the algorithm as social aspects which can only incorporatedby a trader also needs to be considered. However a combination of thetwo, could only be beneficial to the trader.

6 Conclusion

To answer our problem statement, Is using the KNN algorithm a moreprecise way of predicting the future closing prices of equities than usingthe more common method of MA? In order to answer this question, thetwo methods of prediction were implemented and data was produced. Theresults were graphed and displayed partially in tables, calculations werethen done in order to measure the accuracy of the data. After taking all ofthe different versions of the data and presented calculations into account,there was a clear answer to the stated question, the KNN algorithm wasin fact more accurate on all accounts.

This was not largely surprising as the KNN algorithm not only consid-ered the entire two years of data for each prediction but also took previoushistoric trends into consideration. It looked at the historic movement ofthe days which resembled the value one was attempting to predict. Thismethod of prediction clearly proved to be effective and more accurate thana method which is commonly used by traders.

The implication of this is that traders should take the step and explorethe world of prediction using algorithms and machine learning more thanthey do these days. As the computer science sector continues to evolveand take new steps towards simplifying the environment around us, maybeits time for the trading environment to take the steps and rely more oncomputer science than on the statistical methods.

7 References

• Alexander, D. (1998). Data mining. Available at: http://www.laits.utexas.edu/ anor-man/BUS.FOR/course.mat/Alex/ [Accessed 3rd March 2014].

• Alkhatib, K., Najadat, H., Hmeidi, I., Ali Shantnawi, M.K. (2013)Stock Price Prediction Using K-Nearest Neighbor (kNN) Algoritm,International Journal of Business, Humanities and Technology, 3 (3),pp. 34-44. Available at: http://www.ijbhtnet.com/journals/Vol_3_No_3_March_2013/4.pdf[Accessed: 28th February 2014].

• Azhar, S., Badros, G., Glodjo, A., Kao, M.Y, and Reif, J. (1994)’Data compression techniques for stock market prediction’. DataCompression Conference, pp. 1-11.

• Beattle, A. (2011) The Basics Of Business Forecasting. Avail-able at: http://www.investopedia.com/articles/financial-theory/11/basics-business-forcasting.asp [Accessed 01 March 2014].

• Berson, A., Smith, S. and Thearling, K. (1999) Building Data MiningApplications for CRM. New York: McGraw-Hill.

• Dominique, G.,Huck, N. (2005) On the Use of Nearest Neighbors infinance. Available at: http://hal.archives-ouvertes.fr/docs/00/18/08/58/PDF/Guegan-Huck_finance2005.pdf [Accessed: 21st February 2014].

27

Page 30: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

• Greenacre, M. (2008) ’Measures of distance between samples: Eu-clidean’. In: Greenacre’s, M. ed. Correspondence Analysis andRelated Methods. Stanford University.

• Holmes. S. (2000) RMS Error [RMS Error]. Stanford University,28th November.

• Ihilliard1. (2011) Faa guidance in human error for air trafficcontrol and pilot crew. Available at: http://www.studymode.com/essays/Faa-Guidance-In-Human-Error-For-704108.html [Accessed 1st april 2014].

• Imandoust, S.B., Bolandraftar, M. (2013) ’Application of K-NearestNeighbor (KNN) Approach for Predicting Economic Events: The-oretical Background’, S B Imandoust Et Al. Int. Journal of Engi-neering Research and Applications, 3 (5), pp. 605-610. Available at:http://www.ijera.com/papers/Vol3_issue5/DI35605610.pdf [Accessed:28th February 2014].

• Interactive Data Corp. (2014) Moving Averages - Simple andExponential. Available at: http://stockcharts.com/help/doku.php?id=chart_school:technical_indicators:moving_averages [Accessed 5th March 2014].

• Kennon, J. (2014) An introduction to the stock market. Avail-able at: http://beginnersinvest.about.com/cs/newinvestors/l/bl_lesson1c.htm[Accessed: 4th March 2014].

• Little, K. (2009) How stock prices are changed. Available at:http://stocks.about.com/od/tradingbasics/a/032909prices.htm [Ac-cessed: 4th March 2014].

• Nasdaq OMX Nordic. (2014) Companies listed on Nasdaq OMXStockholm. Available at: http://www.nasdaqomxnordic.com/aktier/listed-companies/stockholm, [Accessed: 4th March 2014].

• Nawawi, A. (2013) ’Stock market tip: use Google Trends’, The Con-versation, 26th April. Available at: http://theconversation.com/stock-markettipusegoogletrends13745 [Accessed: 1st april 2014].

• Palace, B. (1996) Data Mining. Available at: http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm, [Ac-cessed: 3rd March 2014].

• PlatinumPrep, LLC. (2009) Arithmetic Mean (Average) - GMATMath Study Guide. Available at: http://www.platinumgmat.com/gmat_study_guide/statistics_mean [Accessed: 8th April 2014].

• Saint-Leger, R. (2014) What Is the Significance of a Closing Priceon a Stock? Available at: http://finance.zacks.com/significance-closing-price-stock-3007.html [Accessed: 9th April 2014].

• Swanson, D.A., Tayman, J., Bryan, T.M. (2010) Mape-R: a rescaledmeasure of accuracy for cross-sectional forecasts. University of Cal-ifornia Riverside and University of California San Diego.

• TA-Guru. (2010) Moving average. Available at: http://www.ta-guru.com/Book/TechnicalAnalysis/TechnicalIndicators/MovingAverage.php5[Accessed: 9th April 2014].

• Unica Technologies Inc. (1997) Solving Data Mining Problems Us-ing Pattern Recognition Software with Cdrom. Prentice Hall PTRUpper Saddle River, NJ, USA

• Wolski, C., Media, D. (2014). Five Factors or Events that Af-fect the Stock Market. Available: http://smallbusiness.chron.com/five-factorseventsaffectstockmarket3384.html [Last accessed: 1st april 2014].

28

Page 31: Stock market prediction using the K Nearest Neighbours …771141/FULLTEXT01.pdf · to work with were the KNN algorithm and the MA formula despite the existing techniques listed above

DD143X - Bachelor Thesis 29 April 2014

• World Encyclopedia. (2013) Genomsnittlig avvikelse. Available at:http://sv.swewe.com/word_show.htm/?26143_1&Genomsnittlig%7Cavvikelse[Accessed: 9th April 2014].

• Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H.,McLachlan, J.G., Ng, A., Liu, B., Yu, S.P., Zhou, Z.H., Steinbach,M., Hand, J.D. and Steinberg, D. (2007) ’Top 10 Algorithms in DataMining’, Springer-Verlag London Limited, 14 (1), pp. 1-37. Avail-able at: http://www.cs.umd.edu/ samir/498/10Algorithms08.pdf [Ac-cessed 01 Mar. 2014].

7.1 Figures:• Figure 1: Sergei Savchenko (1998). Editing Nearest Neighbour Deci-

sion Rules. Available: http://jeff.cs.mcgill.ca/g̃odfried/teaching/projects.pr.98/sergei/project.html Last accessed: 1 st april 2014

• Figure 2: k-Nearest Neighbor Classification Algorithm Implementedin Lisp. Available: http://web.bahcesehir.edu.tr/atabey_kaygun/other/knn.htmlLast accessed: 1 st april 2014

29