Predicting the price of Bitcoin using Machine Learningtrap.ncirl.ie/2496/1/seanmcnally.pdf · Predicting the price of Bitcoin using Machine Learning Sean McNally x15021581 MSc Reseach

Predicting the price of Bitcoin using MachineLearning

MSc Reseach Project

Data Analytics

Sean McNallyx15021581

School of Computing

National College of Ireland

Supervisor: Dr. Jason Roche

www.ncirl.ie

National College of IrelandProject Submission Sheet – 2015/2016

School of Computing

Student Name: Sean McNallyStudent ID: x15021581Programme: Data AnalyticsYear: 2016Module: MSc Reseach ProjectLecturer: Dr. Jason RocheSubmission DueDate:

22/08/2016

Project Title: Predicting the price of Bitcoin using Machine LearningWord Count: XXX

I hereby certify that the information contained in this (my submission) is informationpertaining to research I conducted for this project. All information other than my owncontribution will be fully referenced and listed in the relevant bibliography section at therear of the project.

ALL internet material must be referenced in the bibliography section. Studentsare encouraged to use the Harvard Referencing Standard supplied by the Library. Touse other author’s written or electronic work is illegal (plagiarism) and may result indisciplinary action. Students may be required to undergo a viva (oral examination) ifthere is suspicion about the validity of their submitted work.

Signature:

Date: 9th September 2016

PLEASE READ THE FOLLOWING INSTRUCTIONS:1. Please attach a completed copy of this sheet to each project (including multiple copies).2. You must ensure that you retain a HARD COPY of ALL projects, both foryour own reference and in case a project is lost or mislaid. It is not sufficient to keepa copy on computer. Please do not bind projects or place in covers unless specificallyrequested.3. Assignments that are submitted to the Programme Coordinator office must be placedinto the assignment box located outside the office.

Office Use OnlySignature:

Date:Penalty Applied (ifapplicable):

Predicting the price of Bitcoin using Machine Learning

Sean McNallyx15021581

MSc Reseach Project in Data Analytics

9th September 2016

Abstract

This research is concerned with predicting the price of Bitcoin using machinelearning. The goal is to ascertain with what accuracy can the direction of Bit-coin price in USD can be predicted. The price data is sourced from the BitcoinPrice Index . The task is achieved with varying degrees of success through theimplementation of a Bayesian optimised recurrent neural network (RNN) and LongShort Term Memory (LSTM) network. The LSTM achieves the highest classific-ation accuracy of 52% and a RMSE of 8%. The popular ARIMA model for timeseries forecasting is implemented as a comparison to the deep learning models. Asexpected, the non-linear deep learning methods outperform the ARIMA forecastwhich performs poorly. Wavelets are explored as part of the time series narrativebut not implemented for prediction purposes. Finally, both deep learning modelsare benchmarked on both a GPU and a CPU with the training time on the GPUoutperforming the CPU implementation by 67.7%.

1 Introduction

Time series prediction is not a newphenomenon. Prediction of mature finan-cial markets such as the stock market hasbeen researched at length (1)(2). Bitcoinpresents an interesting parallel to this asit is a time series prediction problem in amarket still in its transient stage. As a res-ult, there is high volatility in the market (3)and this provides an opportunity in terms ofprediction. In addition, Bitcoin is the lead-ing cryptocurrency in the world with adop-tion growing consistently over time. Dueto the open nature of Bitcoin it also posesanother paradigm as opposed to traditionalfinancial markets. It operates on a decent-ralised, peer-to-peer and trustless system inwhich all transactions are posted to an open

ledger called the Blockchain. This type oftransparency is unheard of in other financialmarkets.

Traditional time series prediction meth-ods such as Holt-Winters exponentialsmoothing models rely on linear assump-tions and require data that can be brokendown into trend, seasonal and noise to beeffective (4). This type of methodology ismore suitable for a task such as forecast-ing sales where seasonal effects are present.Due to the lack of seasonality in the Bit-coin market and its high volatility, thesemethods are not very effective for this task.Given the complexity of the task, deeplearning makes for an interesting technolo-gical solution based on its performance in

1

similar areas. Tasks such as natural lan-guage processing which are also sequentialin nature and have shown promising results(5). This type of task uses data of a se-quential nature and as a result is similarto a price prediction task. The recurrentneural network (RNN) and the long shortterm memory (LSTM) flavour of artificialneural networks are favoured over the tradi-tional multilayer perceptron (MLP) due tothe temporal nature of the more advancedalgorithms (6).

The aim of this research is to ascertainwith what accuracy can the price of Bitcoinbe predicted using machine learning. Sec-tion one addresses the project specificationwhich includes the research question, subresearch questions, the purpose of the studyand the research variables. A brief over-view of Bitcoin, machine learning and timeseries analysis concludes section one. Sec-tion two examines related work in the areaof both Bitcoin price prediction and otherfinancial time series prediction. Literatureon using machine learning to predict Bit-coin price is limited. Out of approximately653 papers published on Bitcoin (7) only7 have related to machine learning for pre-diction. As a result, literature relating toother financial time series prediction usingdeep learning is also assessed as these taskscan be considered analogous. Section 3 fo-cuses on the design and implementation ofthe solution to the research question. Sec-tion 4 analyses the results of the implemen-ted solution. A forecast using a traditionalARIMA time series model is also developedfor performance comparison purposes withthe neural network models. Section 5 con-cludes the paper with reference to futurework in the area.

1.1 Project Specification

1.1.1 Research Question

Research question: With whataccuracy can the direction of theprice of Bitcoin be predicted us-ing machine learning?

Sub Research question: Whatmagnitude of performance im-provement can be achieved fromparallelisation of algorithms ona GPU compared to a CPU?

1.1.2 Purpose

The purpose of this study is to find out withwhat accuracy the direction of the priceof Bitcoin can be predicted using machinelearning methods. This is fundamentally atime series prediction problem. While muchresearch exists surrounding the use of dif-ferent machine learning techniques for timeseries prediction, research in this area relat-ing specifically to Bitcoin is lacking. In ad-dition, Bitcoin as a currency is in a transientstage and as a result is considerably morevolatile than other currencies such as theUSD. Interestingly, it is the top perform-ing currency four out of the last five years1.Thus, its prediction offers great potentialand this provides motivation for researchin the area. As evidenced by an analysisof the existing literature, running machinelearning algorithms on a GPU as opposedto a CPU can offer significant performanceimprovements. This is explored by bench-marking the training of the RNN and LSTMnetwork using both the GPU and CPU.This provides an answer to the sub researchquestion. Finally, in analysing the chosendependent variables, each variables import-ance is assessed using a random forest al-gorithm. This body of research builds onexisting literature in the area which is as-sessed in section 2.

In addition, the ability to predict thedirection of the price of an asset such as

1The Economist: https://www.economist.com/

https://www.economist.com/

Bitcoin offers the opportunity for profit tobe made by trading the asset. To imple-ment a full trading strategy based on theresults of the models is worthy of a disserta-tion in itself and as a result this paper willfocus solely on the accuracy at which theprice direction can be predicted. In basicterms, the model would initiate a short pos-ition if the price was predicted to go up anda long position if the price was predictedto go down. Several Bitcoin exchanges of-fer margin trading accounts to facilitate this2. The profitability of this strategy wouldbe based not only on the accuracy of themodel, but also on the size of the positionstaken. This is outside the scope of thisresearch but could be addressed in futurework.

1.2 Research Variables

The independent variable for this study isthe closing price of Bitcoin in US Dollarstaken from the Coindesk Bitcoin Price In-dex. Rather than focusing on one specificexchange this price index takes the averageprices from five major Bitcoin exchanges;Bitstamp, Bitfinex, Coinbase, OkCoin anditBit. If one were to implement trades basedon the signals it would be beneficial to fo-cus on one exchange. However, the aver-age price is more suitable for this researchas some exchanges suffer isolated drops inprice from internal issues such as Bitfinexwho were hacked recently3. As a result,there is less noise in the averaged dataset.The closing price is chosen over a three-classdummy classification variable representingprice going up, down or staying the samefor the following reason; the use of a regres-sion model over a classification model offersfurther model comparison potential throughthe capture of the root mean squared error(RMSE) of the models. Classifications arethen made based on the prediction of theregression model e.g. price up, price down

or no change. Additional performance met-rics include accuracy, specificity, sensitivityand precision. This is discussed in the im-plementation section.

The dependent variables are taken fromthe Coindesk website, Blockchain.info andfrom the process of feature engineering. Inaddition to the closing price, the openingprice, daily high and daily low are also in-cluded. Data taken from the Blockchainincludes mining difficulty and hash rate.The features which have been engineeredare considered technical analysis indicators(8) and include two simple moving averages(SMA) and a de-noised closing price. Therationale behind the selection of variables isdiscussed in chapter 2.

1.3 Bitcoin

Bitcoin is the worlds most valuable crypto-currency introduced following the release ofa whitepaper published in 2008 under thepseudo name Satoshi Nakamoto (9). Thecurrency is built on a decentralised, peer-to-peer network with the creation of moneyand transaction management carried out bythe members of the network. The resultof this is no central authority controls Bit-coin. All Bitcoin transactions are postedin blocks to an open ledger known as theBlockchain to be verified by miners usingcryptographic proof. This verification takesplace in a trustless system with no interme-diary required to pass the funds from senderto receiver. Bitcoin offers a novel opportun-ity for prediction due its relatively youngage and resulting volatility. In addition, itis unique in relation to traditional fiat cur-rencies in terms of its open nature. In com-parison, no complete data exists regardingcash transactions or money in circulationof fiat currencies. The well-known efficientmarket hypothesis (10) suggests the price ofassets such as currencies reflect all availableinformation, and as a result trade at their

2Poloniex Exchange: https://www.poloniex.com/3Bitfinex Exchange: https://www.bitfinex.com/

https://www.poloniex.com/

https://www.bitfinex.com/

fair value. Although there is an abundanceof data available relating to Bitcoin and itsnetwork, the author argues that not all mar-ket participants will utilise all this inform-ation effectively and as a result it may notbe reflected in the price. This paper aims totake advantage of this assumption throughvarious machine learning methods.

Bitcoin is traded on over 40 exchangesworldwide accepting over 30 different cur-rencies and has a current market capitaliza-tion of 9 billion dollars 4. Interest in Bitcoinhas grown significantly with over 250,000transactions now taking place per day. Inaddition to the regular use of Bitcoin byprivate individuals, its lack of correlationwith other assets have made it an attractivehedging option to investors. Some researchhas found that the price volatility of Bitcoinis far greater than that of fiat currencies(3).This offers significant potential in compar-ison to mature financial markets.

1.4 Machine Learning

Data mining can be defined as the extrac-tion of implicit, previously unknown andpotentially useful information from data.Machine learning provides the technicalbasis for data mining (11). A dataset iscomprised of observations which are knownas instances which contain one or more vari-ables known as attributes. Broadly speak-ing, machine learning can be split into twocategories. Supervised learning involvesthe modelling of datasets with labelled in-stances. Each instance can be representedas x and y, with x a set of independent pre-dictor attributes and y the dependent tar-get attribute. The target attribute can becontinuous or discrete however this has aneffect on the model. If the target variable iscontinuous then a regression model is usedand if the target variable is discrete then aclassification model is used (8). Examplesof supervised methods include neural net-

works and support vector machines.Unsupervised learning involves the mod-

elling of datasets with no known outcomeor attribute. The purpose of these tech-niques is to group similar data into clustersor groups. The purpose of this research is topredict the direction of the price of Bitcoin.As this is a task with a known target it is asupervised machine learning task althoughsome pre-processing can take advantage ofunsupervised learning methods. The super-vised algorithms explored include Waveletsand the wavelet discrete transform, severaltype of artificial neural networks includingthe Multi-Layer perceptron (MLP), ElmanRecurrent Neural Network (RNN) and LongShort Term Memory (LSTM). In terms ofpre-processing, random forests were usedfor feature selection while Bayesian optim-isation was performed to optimize some theparameters of the LSTM.

1.4.1 Multilayer Perceptron

Figure 1: Multilayer Perceptron

Simple feed forward neural networks areknown as multilayer perceptrons and theyform the basis for other neural networkmodels. In terms of neural network ter-minology, examples fed to the model areknown as inputs and the predictions areknown as outputs. Each modular sub func-tion is a layer. A model consists of an in-

4Blockchain: https://www.blockchain.info/

https://www.blockchain.info/

put and output layer, with layers betweenthese known as hidden layers. Each out-put of one of these layers is a unit whichcan be considered analogous to a neuron inthe brain. The connections between theseunits is known as a weight which is analog-ous to a synapse in the brain. The weightsdefine the function of the model as they arethe parameter that is adjusted when train-ing a model. Non-linear element wise op-erators such as the hyperbolic tangent orrectified linear unit (ReLu) are applied tothe input to perform the non-linear trans-formations between layers. An example ofan MLP can be seen in figure 1. One lim-itation of the MLP and similarly the RNNis that they are affected by the vanishinggradient problem. This issue is that as lay-ers and time steps of the network relate toeach other through multiplication, derivat-ives are susceptible to exploding or vanish-ing gradients. Vanishing gradients are moreof a concern as they can become too smallfor the network to learn whereas explodinggradients can be limited using regularisa-tion. Another limitation of the MLP is thatits signals only pass forward in the networkin a static nature. As a result, it does notrecognise the temporal element of a timeseries task effectively as its memory can beconsidered frozen in time 5. One can con-sider an MLP to treat all input as a bucketof objects with no order in time. As a result,the same weights are applied to all incomingdata which is a naive approach. The recur-rent neural network, sometimes known as adynamic neural network addresses some ofthese limitations.

1.5 Recurrent Neural Net-work

The recurrent neural network (RNN) wasfirst developed by Elman (6). The RNN isstructured similarly to the MLP, with theexception that signals can flow both for-ward and backwards in an iterative man-

ner. To facilitate this another layer knownas the context layer is added. In additionto passing input between layers, the out-put of each layer is fed to the context layerto be fed into the next layer with the nextinput. In this context, the state is overwrit-ten at each timestep. This offers the benefitof allowing the network to assign particu-lar weights to events that occur in a seriesrather than the same weights to all inputas with the MLP. This results in a dynamicnetwork. The length of the temporal win-dow in a sense is the length of your networksmemory. As is discussed in the related worksection, it is an appropriate technique for atime series prediction task (5) (12). Whilethis addresses the temporal issue faced witha time series task, vanishing gradient canstill be an issue. In addition, some researchhas found that while RNN are capable ofhandling long-term dependencies, in prac-tice they often fail to learn due to the dif-ficulties between gradient descent and longterm dependencies (13) (14). An exampleof an RNN can be seen below in figure 2.Note the context layer is represented as z-i.

Figure 2: Recurrent Neural Network

5Deeplearning4j: http://deeplearning4j.org/lstm

http://deeplearning4j.org/lstm

1.5.1 Long Short Term Memory

Long short term memory (LSTM) units ad-dress both of these issues. Developed byHochreiter et al. (15), they allow the preser-vation of the weights that are forward andback-propagated through layers. This is incontrast to the Elman RNN in which thestate gets overwritten at each step. Theyalso allow the network to continue to learnover many time steps by maintaining a moreconstant error. This allows the network tolearn long term dependencies. A LSTM cellcontains forget and remember gates whichallow the cell to decide what information toblock or pass based on its strength and im-portance. As a result, weak signals can beblocked which prevents vanishing gradient.A comparison of a regular RNN and LSTMcan be seen below in figure 3. Note the ad-dition of the gates on the right and left ofthe LSTM block.

Figure 3: Long Short Term Memory6

1.6 Time Series Analysis:

Based on the abundance of literature on thetopic, the ARIMA model is a popular ap-proach to time series forecasting (16)(17).However, these models rely on linear as-sumptions regarding the data. Due to thehighly non-linear nature of the Bitcoin priceit was not expected to perform well. Thedata was differenced to make it station-ary to meet the necessary assumptions. Asa result, the ARIMA model will not forman integral part of this research. How-ever, a forecast was made using an ARIMA

model for comparison purposes with the fi-nal neural network models in the evaluationsection. Wavelets and the Wavelet DiscreteTransform (WDT) were also considered inthe analysis of the time series analysis do-main. Wavelets are a mathematical func-tion useful for analysing and breaking downsignals (18). A time series can be consideredin the same regard as a signal. The inten-tion was to investigate using Wavelets asan input to the neural network model. Thiswas not possible due to time constraints butcould be addressed in future work. How-ever, Wavelet coherence analysis was usedto analyse correlation between variables inthe dataset on a temporal scale. This is auseful exploratory analysis exercise in ana-lysing relationship between variables overtime (19). An example of one of the Wave-let Coherence analysis spectrum graphs canbe seen below in figure 4. This graph rep-resents the cross correlation between Bit-coin and the Hash rate over time. Fromthe graph it is apparent that Bitcoin clos-ing price is correlated with the network hashrate on a long term horizon.

Figure 4: Wavelet Power Spectrum

6Deeplearning4j: http://deeplearning4j.org/lstm

http://deeplearning4j.org/lstm

2 Related Work

Research on predicting the price of Bitcoinusing machine learning algorithms specific-ally is lacking. Shah et al. (20) implemen-ted a latent source model as developed byChen et al. (17) to predict the price of Bit-coin. The model received an impressive 89percent return in 50 days with a Sharpe ra-tio of 4.1. The Sharpe ratio examines per-formance of an investment while adjustingfor its risk. One point of note from thisstudy is that the period selected for testingby the user enjoyed growth of 33 percentusing a buy and hold strategy alone. At-tempts were made to re-create this study in-dependently which were unsuccessful . Onereason provided for this was that the ori-ginal authors hand selected 20 patterns ob-served in their clusters which were similarto those they had seen in trading booksi.e. the head and shoulders. This highlightsthe danger of cherry-picking data to obtaingood results.

Geourgoula et al. (21) investigated thedeterminants of the price of Bitcoin whilealso implementing sentiment analysis usingsupport vector machines. The author foundthat the frequency of Wikipedia views andthe network hash rate had a positive correl-ation with the price of Bitcoin. Sentimenthas also been utilised as a predictor of Bit-coin in other research. Matta et al. (22)investigated the relationship between Bit-coin price, tweets and views for Bitcoin onGoogle Trends. The author found a weak tomoderate correlation between Bitcoin priceand both Google Trends views and positivetweets on Twitter. The author found this tobe proof that they can be used as predict-ors. However, one limitation of this studyis that the sample size is only 60 days. Sen-timent was considered as a variable. How-ever, with a variable of this nature the ques-tion of trust arises. Misinformation can bespread through various social media chan-nels such as Twitter or on message boardssuch as Reddit. In what is known as a

pump-and-dump scheme, an investor look-ing to take advantage could spread misin-formation through the channels discussed inorder to buy at an artificially deflated priceor sell at an artificially inflated price (23).In the Bitcoin exchanges liquidity is con-siderably limited. As a result, the marketsuffers from a greater risk of manipulation.For this reason, sentiment from social me-dia is not considered further. Another studyby the same authors (24) implemented asimilar methodology except instead of pre-dicting Bitcoin price they tried to predicttrading volume. This study found GoogleTrends views to be strongly correlated withBitcoin price. This data sample covered aperiod of just under one year. This datasource was considered for implementation.However, as only the past years worth ofdata is available from Google this variablewas not deemed suitable. Some papershave utilised Wavelets to find similar results(25)(19). Kristoufek found a positive cor-relation between search engine views, net-work hash rate and mining difficulty withthe price of Bitcoin in the long run usingWavelet coherence analysis of Bitcoin price.Wavelets are useful in exploring cross cor-relation between time series as they have atemporal dimension. As a result, one cansee correlation between variables at specificpoints in time.

Greaves et al. (26) analysed the BitcoinBlockchain to predict the price of Bitcoinusing SVM and ANN. The author reportedprice direction accuracy of 55 percent witha regular ANN. They concluded that therewas limited predictability in Blockchaindata alone as price is technically dictated byexchanges whose behavior lies outside of therealm of the Blockchain. Building on thesefindings, data from the Blockchain, namelyhash rate and difficulty are included in theanalysis along with data from the major ex-changes provided by CoinDesk. Similarly,Madan et al. (27) attempted to predict Bit-coins price using Blockchain data. They im-plemented SVM, Random Forests and Bino-

mial GLM using data from the Blockchain.The author reported accuracy of over 97percent. One limitation of this study is theresults were not cross-validated. As a resultthey may have overfit the data and one cantbe sure if the model will generalize.

In terms of a machine learning task, pre-dicting the price of Bitcoin can be con-sidered analogous to other financial timeseries prediction tasks such as forex andstock prediction. The idea for using ANNfor such a task is not a new notion. Sincethe discovery of the back propagation al-gorithm (28) much research was undertakenin the area which concluded that ANN aresuitable for modelling and forecasting non-linear time series (29)(30). Several bodies ofresearch implemented the MultiLayer Per-ceptron for stock price prediction (2)(31).White found limited value in their IBMstock price prediction MLP due to a lackof data. This study utilized a trial and er-ror network parameter search process. Onelimitation of this approach is that there isno guarantee that a global maximum hasbeen found. Bergstra et al. (32) found thisrandom search process to be more effect-ive than grid search. This is due to ran-dom searches being able to find as goodor better models within a small fractionof the computation time. What the au-thor doesnt recognize is that it is difficultto know when an optimal model has beenfound. For this reason, grid search is fa-voured over random search in this researchwhere Bayesian optimization is not suitable(33). This approach uses Bayesian regular-isation to search the feature space for anoptimal solution. The fitness is calculatedfor each model and the fittest model is kept.An example of a Bayesian optimiser is thePython library Hyperopt7. This approach isalso implemented. This should reduce thechances of overfitting the model which is acommon problem in the literature. Oftenmodels perform well on training data butnot on unseen data.

Another approach to reducing the risk ofoverfitting is dropout regularisation as out-lined by Wager et al. (34) This approachis considered a noising scheme that con-trols for overfitting by corrupting trainingdata. This increases the generalisability ofthe model and as a result should performbetter on unseen data. Another factor toconsider when trying to predict time seriesdata is that the time when certain featuresappear may contain important data. Onelimitation of the MLP is it can only learnan input to output mapping that is static,and thus it has no notion of order in time.As a result, the only input it considers is thecurrent example it has been exposed to (35).In contrast, the output from each layer ina RNN is stored in a context layer to belooped back in with the output from thenext layer. In this sense one may considerthat the network gains a memory of sortas opposed to the MLP. The length of thenetwork is known as the temporal windowlength. Giles et al. (36) found that the factthat the temporal relationship of the seriesis explicitly modelled by the internal statescan contribute significantly to the modelseffectiveness. Rather et al (37) took this ap-proach in predicting stock returns. They in-corporated a RNN in a hybrid model with agenetic algorithm for optimization of weightselection. In this case the author reportedsuccessful results and credited this to op-timal network parameter selection. Optim-isers such as RMSprop are suitable for re-current neural networks in terms of weightselection and updates.

Another form of RNN is the Long ShortTerm Memory (LSTM) network. They dif-fer from Elman RNN in that in addition tohaving a memory, they can choose whichdata to remember and which data to for-get based on the weight and importance ofthat feature. Gers et al. (12) implemented aLSTM for a time series prediction task. Theauthor found that the LSTM performed aswell as the RNN for this task. This type of

7Hyperopt: https://github.com/hyperopt/hyperopt

h

model is implemented here also. One limit-ation in training both the RNN and LSTMis the significant computation required. Forexample, a network of 50 days is compar-able to training 50 individual MLP models.Since the development of the CUDA frame-work by NVIDIA in 2006, the developmentof applications that take advantage of theextremely parallel capabilities of the GPUhas grown greatly including the area of ma-chine learning. Steinkrau et al. (38) repor-ted over three times faster training and test-ing of its ANN model when implemented ona GPU rather than a CPU. Cantanzaro etal. (39) reported an increased speed in clas-sification time to the magnitude of eightytimes when implementing a SVM on a GPUover an alternative SVM algorithm ran ona CPU. In addition, training time was ninetimes greater for the CPU implementation.Ciresan et al. (40) also received speeds thatwere forty times faster for training a whentraining deep neural networks for image re-cognition on a GPU as opposed to a CPU.Due to the apparent benefits of utilising aGPU, the LSTM model is implemented onboth the CPU and GPU. The performanceof both implementations is analysed in theresults section.

3 Methodology

This body of research follows the CRISPdata mining methodology 8. This is an in-cremental approach made of up four levelsof tasks which are completed in an iterat-ive manner. As a result, analogously it alsoreflects an Agile methodology (41).

The dataset ranges from the 19th of Au-gust 2013 until the 19th of July 2016. Atime series plot of this can be seen in figure5 below in the observed window. Data fromprevious to August 2013 was excluded asthe author feels that due to the immaturityof the network, its not an accurate repres-entation of the network at present. In addi-

tion to the Open, High, Low, Close (OHLC)data from CoinDesk and the difficulty andhash rate taken from the Blockchain, fea-tures were engineered to provide the data-set with more predictive data to learn from.Data was standardised in R using the scalefunction. This transforms the dataset togive it a mean = 0 and SD = 1. Beforethe dataset was standardised and given asinput to the neural network it performedpoorly. Standardisation was chosen overnormalising the data for example between0 and 1 as this methodology is robust tomost activation functions.

Figure 5 Decomposition of TimeSeries

3.1 Feature Engineering

Feature engineering is the art of extract-ing useful patterns from data to make iteasier for machine learning models to per-form its prediction. It can be consideredone of the most important skills to achievegood results for prediction tasks (42). Wind(43) investigated the behaviour of consist-ent top performers in Kaggle data miningcompetitions. The findings were that fea-ture engineering is often the most import-ant part. It is quite a subjective processrequiring domain knowledge to be effective.It is also considered an art. Engineered fea-tures should represent what one is trying toteach the network.

8CRISP-DM 1.0: https://www.the-modeling-agency.com/crisp-dm.pdf

h

However, chosen features must be eval-uated to ensure they improve your data-sets predictability. The addition of tech-nical analysis indicators offers the benefitof describing and quantifying trends whichmay exist in the price. The rationale for thisis quantitative technical indicators providea numerical description of previous trendsin the data which can be used as an at-tribute in a machine learning algorithm forboth regression and classification. Severalpapers in recent years have included indic-ators including the Simple Moving Average(SMA) for machine learning classificationtasks (44) (45). An example of one suchtechnical indicator is a simple moving aver-age (SMA). This records the average priceover the previous x number of days.

SMA =p1 + p2... + px

n

Fortunately, not much feature engineer-ing is required for deep learning models asthe hidden layers learn non-linear hierarch-ical features and in the case of the LSTMto detect very long term dependencies in se-quential data (42). The reason one wouldstill be required to engineer features is thatthere is no way of knowing what non-linearfeatures the model has learned. When fea-ture engineering, one may explicitly choosefeatures which are thought to be important.It is accepted that feature engineering is asubjective process highly individual to thetask at hand (43). As a result, utilising do-main knowledge is necessary. The rationaleof selecting a SMA is that it can allow amodel to more easily recognise trends bysmoothing the data. It is a popular tech-nical indicator. In addition to the SMAa de-noised closing price was created in R.This involved decomposing the time seriesinto trend, seasonal and noise componentsas can be seen in figure 5 above. The im-portance of features can be evaluated usingvarious feature evaluation methods.

3.2 Feature Evaluation

Features must be evaluated once selected.The reason for this is dealing with too largeof a feature set will considerably increasetraining time. In addition, machine learn-ing algorithms can suffer from decreased ac-curacy if the number of variables is signific-antly higher than the optimal number(46).Several methods of feature evaluation existincluding filter based selection and wrapperbased selection. Filter based selectors fil-ter features based on a particular statist-ical property of the feature e.g. correlation.Wrapper based methods perform a heuristicsearch of solutions to a classifier.

The Boruta algorithm in R is one suchwrapped based methods. This algorithm isa wrapper built around the random forestclassification algorithm. This is an en-semble classification method in which clas-sification is performed by voting of mul-tiple classifiers. The algorithm works on asimilar principle as the random forest clas-sifier. It adds randomness to the modeland collects results from the ensemble ofrandomised samples to evaluate attributes.This extra randomness provides you witha clear view on which attributes are im-portant (47). All features were deemed im-portant to the model based on the randomforest, with 5 day and 10 days the highestimportance among the tested averages. Thede-noised closing price was one of the mostimportant variables also.

The dimensionality reduction techniqueof principal component analysis (PCA) wasalso explored. The result was four principalgroups in which all attributes belonged to.The results of the PCA was not includedin the final model as computation wasnt anissue and the original data performed reas-onably well.

3.3 RNN

Appropriate design of deep learning mod-els in terms of network parameters is im-perative to their success. The three main

options available when choosing how to se-lect parameters for deep learning modelsare random search, grid search and heur-istic search methods such as genetic al-gorithms. As mentioned in the related worksection manual grid search and Bayesian op-timisation are utilised in this study. Gridsearch, implemented for the Elman RNN,is the process of selecting two hyperpara-maters with a minimum and maximum foreach. One then searches that feature spacelooking for the best performing parameters.This approach was taken for parameterswhich were unsuitable for Bayesian optim-isation. This model was built using Kerasin the Python programming language (48).

3.4 LSTM

Similar to the RNN, Bayesian optimisationwas chosen for selecting parameters for thismodel where possible. This is a heuristicsearch method which works by assumingthe function was sampled from a Gaussianprocess and maintains a posterior distri-bution for this function as the results ofdifferent hyperparameter selections are ob-served. One can then optimise the expectedimprovement over the best result to pickhyperparameters for the next experiment(49). The performance of both the RNNand LSTM network are evaluated on valid-ation data with significant overfitting meas-ures in place. Dropout is implemented inboth layers. In addition, an early stopperis programmed into the model to preventoverfitting. This stops the model if its val-idation loss doesnt improve for 5 epochs.

4 Implementation

4.1 RNN

The first parameter for selection was thetemporal length window. As suggested bysupporting literature (14) these type of net-works may struggle to learn long term de-pendencies using gradient based optimisa-

tion. However, this could not be assumed.An autocorrelation function (ACF) was ranfor the closing price time series. This as-sesses the relationship between the currentclosing price and previous or future closingprices. While this is not a guarantee of pre-dictive power for this length, it was a bet-ter choice than random choice. The ACFfor closing price can be seen below in figure6. Closing price is correlated with a lag ofup to 20 days in many cases, with isolatedcases at 34, 45 and 47 days. This led thegrid search for the temporal window to testfrom 2 to 20 days and says 34, 45 and 47. Toensure a robust search larger time periodsof up to 100 days were also tested in incre-ments of five. The most effective windowtemporal length was 24.

Figure 6: ACF

Learning rate is the parameter thatguides your stochastic gradient descent(SGD) i.e. how your network learns. Mo-mentum updates your learning rate to avoidgetting stuck in local minima and attemptto move to global minimum of the function(50). However, several optimisers are avail-able to improve this process. The optim-ser RMSprop improves on SGD with mo-mentum as it keeps a running average andas a result is more robust to information

loss (51). RMSprop keeps a running aver-age of its recent gradient magnitudes. Thenext gradient is divided by the average sothat gradient values are loosely normalised.The number of hidden layers and hiddennodes was the next parameter choice. Otheroptimisers such as Adagrad and Adadeltaare available but the Keras documenta-tion recommends the RMSprop for recur-rent neural networks. According to Heaton,one hidden layer is enough to approxim-ate the vast majority of non-linear functions(52). Two hidden layers were also exploredand were chosen as they achieved lower val-idation accuracy. Heaton also recommendsfor the number of hidden nodes to selectbetween the number of input and outputnodes. In this case, less than 20 nodesper layer resulted in poor performance. 50and 100 nodes were tested with good per-formance. However, too many nodes canincrease the chances of overfitting. As 20nodes performed sufficiently well this waschosen for the final model. The activationfunction are non-linear stepwise equationsthat pass signals between layers. The op-tions explored were Tanh, and ReLu whichcan be seen below..

Tanhx =sinhx

coshx=

ex − e−x

ex + e−x

ReLU = f(x) = max(0, x)

These functions perform better on dif-ferent types of tasks so it is important totest several. Tanh performed the best butthe differences were not significant. The fi-nal parameters for selection are batch sizeand number of training epochs. Batch sizewas found to have little effect on accur-acy but considerable effect on training timewhen using smaller batches in this case.Number of epochs tested ranged from 10 to10000. Too many training epochs can res-ult in overfitting. Overfitting is a commonissue with neural networks. To reduce the

risk of overfitting dropout can be implemen-ted as discussed in the related work section.Optimal dropout between 0.1 and 1 wassearched for both layers with .5 dropout theoptimal solution for both layers. An earlystopper was also implemented in the code.This Keras callback method stops the train-ing of the model if its performance on valid-ation data did not improve after 5 epochs.This helps to prevent overfitting. Gener-ally the RNN converged between 20 and 40epochs with early stopping. LSTM modelsconverged between 50 and 100 epochs

4.2 LSTM

In terms of temporal length, the LSTM isconsiderably better at learning long termdependencies. As a result, picking a longwindow for this parameter was less detri-mental for the LSTM as the RNN. This pro-cess followed a similar process to the RNNin which autocorrelation lag was used asa guideline. The LSTM performed poorlyon smaller window sizes. Its most effectivelength found was 100 days.

Two hidden LSTM layers were chosen.For a time series task two layers is enoughto find non-linear relationships among thedata for a time series task. Three and fourlayers were tested but didnt improve val-idation performance. Several layers can berequired for tasks such as image recognitionwhen the number of features is significantlylarge. 20 hidden nodes were also chosen forboth layers as per the RNN model. TheHyperas library9 was used to implementthe Bayesian optimisation of the networkparameters. The optimiser searched for theoptimal model in terms of how much dro-pout per layer and which optimizer to use.RMSprop performed the best for this taskas indicated by the Keras documentation.The LSTM model activation functions wer-ent changed as the LSTM has a particularsequence of activation functions within itscell made up of tanh and sigmoid activa-

9Hyperas: https://github.com/maxpumperla/hyperas

h

tion functions for the different gates withinthe cell.

LSTM models converged between 50and 100 epochs with early stopping. LSTMmodels converged between 50 and 100epochs in comparison to the RNN. This in-dicates the LSTM continues to learn forlonger than the RNN. Batch size representshow often to updates weights using RMS-prop. This was found to have a greater af-fect on execution time than accuracy. Thismay be due to the relatively small size ofthe dataset.

In terms of temporal length, the LSTMis considerably better at learning long termdependencies. As a result, picking a longwindow for this parameter was less detri-mental for the LSTM as the RNN. This pro-cess followed a similar process to the RNNin which autocorrelation lag was used as aguideline.

Two hidden LSTM layers were chosen.For a time series task two layers is enoughto find non-linear relationships among thedata for a time series task. Three and fourlayers were tested but didnt improve valid-ation performance. Several layers can berequired for tasks such as image recogni-tion when the number of features is sig-nificantly large. The Hyperas library wasused to implement the Bayesian optimisa-tion of the network parameters. The optim-iser searched the optimal model in terms ofthe number of hidden layers and batch size.Batch size represents the number of datapoints to train on before updating weights.Similarly as above, dropout was implemen-ted along with early stopping to preventoverfitting.

4.3 Model Comparison

A confusion matrix representing the ratioof true/false and positive/negative classific-ations is used to derive the ratings metrics.The formulae for the four metrics can beseen below. Accuracy can be defined as thetotal number of correctly classified predic-

tions. This metric can be misleading forexample when there is an imbalanced data-set. As a result, the metrics sensitivity,specificity and precision are also analysed.Sensitivity represents how good a test isat detecting positives. Specificity repres-ent how good the model is at avoiding falsealarms. Finally, precision represents howmany positively classified predictions wererelevant. Root Mean Square Error (RMSE)is used to evaluate and compare the regres-sion accuracy.

Sensitivity =TP

TP + FN

Specificity =TN

FP + TN

Precision =TP

TP + FP

Accuracy =TP + TN

P + N

RMSE =√

(xi− yi)2

4.4 Validation

Holdout validation is used by default inKeras. The last 20 percent of the datasetis withheld for validation while the modeltrains on the remaining 80 percent of thedate. This is the only method natively sup-ported in Keras. K-fold cross validationwas investigated through the Python Scikit-Learn library. However, it was not imple-mented due to the significant training timesthat result from it. Training times are ktimes greater than that for holdout. Slid-ing window validation was tested but per-formed poorly in a Theano version of theRNN implementation. As a result it wasnot implemented for the LSTM.

Holdout validation is used by default inKeras. The last 20 percent of the data-set is withheld for validation. The modeltrains on the remaining 80 percent. K-foldcross validation was investigated. However,

it was not implemented due to the signific-ant training times that result from it. Slid-ing window validation was considered butnot implemented due to time constraints.

4.5 ARIMA

The ARIMA forecast was created by split-ting the data into 5 periods and then pre-dicting 30 days into the future. The datawas differenced before being fit with severalARIMA models. The best fit as found byauto.arima from the R forecast package.10

5 Evaluation

5.1 Model Evaluation

As can be seen in table 1 LSTM achievedthe highest accuracy while the RNNachieved the lowest RMSE. The ARIMAprediction performed poorly in terms of ac-curacy and RMSE. This was to be expec-ted. Upon analysis of the ARIMA forecastit predicted the price would gradually riseeach day. There were no false positives fromthe model. One reason for this may be dueto the class imbalance in predictive portionof the ARIMA forecast. This contributedto the specificity and precision being so high(specificity,precision= 1). This does not ne-cessarily suggest good performance.

Based on the results on the validationdata it is apparent that all models struggledto effectively learn from the data. On train-ing data the model reduced error to below1%. On validation data the LSTM achievederror of 8.07% while the RNN achieved er-

ror of 7.15%. The 50.25% and 52.78%accuracy achieved by the neural networkmodels is a marginal improvement over theodds one has in a binary classification taski.e. 50% The RNN was effectively of nouse when using a temporal length over 50days. In contrast, the LSTM performedbetter in the 50 to 100 day range with100 days producing the best performance.Both models were combined to create a net-work with one LSTM and one RNN lay-ers. This performed marginally worse tothe pure LSTM model but required lesstraining time. The ARIMA model appearsto perform well based on sensitivity, spe-cificity and precision. However, on analysisof the high RMSE of the model its apparentits forecast was poor. The ARIMA modelforecast followed a stable path with verylittle variance. The fact that it appears tohave performed well in terms of specificityand precision accuracy appears to have beendown to chance as it failed to recognise anytrends in the data and predicted the pricewould rise in general.

The figures listed are those that arethought to represent the generalizability ofthe model as appropriate prevention meas-ures were taken in terms of overfitting. Dro-pout was considerably high on both hid-den layers with the training dataset beingshuffled in groups based on temporal win-dow length. In addition, an early stopperwas programmed into the model to stoptraining if the validation loss didnt improvefor 5 epochs. Turning this off and it is pos-sible to attain considerably better resultsbut this is as a result of overfitting.

5.2 Performance Evaluation

The CPU utilised was a Intel Core i72.6GH/z. This is a relatively high specific-ation processor for a laptop. The GPUused was a NVIDIA GeForce 940M 2GB.Both were running on Ubuntu 14.04 LTSinstalled on a SSD.

For comparability the same batch sizeand temporal length of 50 were chosen forboth the RNN and LSTM. Performancecan be seen in table 1 above. The GPUconsiderably outperformed the CPU as canbe seen in figure 7. In terms of over-all training time for both networks, theGPU trained 67.7% faster than the CPU.The RNN trained 58.8% faster on the GPUwhile the LSTM trained 70.7% faster onthe GPU. From monitoring performance inGlances the CPU spread the algorithm outover 7 threads. The GPU has 384 CUDAcores which provide it with greater paral-lelism. These models were relatively quitesmall in terms of data with two layers. Fordeeper models with more layers or biggerdatasets the benefits of implementing on aGPU is even greater.

The LSTM and RNN were also com-pared in terms of total training time onboth the CPU and GPU combined as canbe seen in figure 8. The LSTM took 3.1times longer to train than the RNN with thesame network parameters. One reason forthis may be due to the increased number ofactivation functions, and thus an increasednumber of equations to be performed by theLSTM. Due to the increased computation,this raises the question of the value of using

an LSTM over an RNN. In financial marketprediction small margins can make all thedifference. As a result of this the use of anLSTM is justified. In other areas the slightimprovement in terms of performance isn’tjustifiable for the increase in computation.

Figure 7: CPU vs GPU

0

1000

2000

3000

4000

5000

6000

7000

CPU GPU

Seco

nd

s

Training Time: CPU vs GPU

Figure 8: LSTM vs RNN

0

1000

2000

3000Seco

nd

s

0

1000

2000

3000

4000

5000

6000

7000

8000

LSTM RNN

Training Time: LSTM vs RNN

Table 1: Model Results

Model Temporal Length Sensitivity Specificity Precision Accuracy RMSE

LSTM 100 37% 61.30% 35.50% 52.78% 6.87%

RNN 20 40.40% 56.65% 39.08% 50.25% 5.45%

ARIMA 170 14.7% 1 1 50.05% 53.74%

Table 2: Performance ComparisonModel Epochs Intel Core i7 6700 2.6GH\z NVIDIA GeForce 940M 2GBRNN 50 56.71s 33.15sLSTM 50 59.71s 38.85sRNN 500 462.31s 258.1sLSTM 500 1505s 888.34sRNN 1000 918.03s 613.21sLSTM 1000 3001.69s 1746.87s

6 Conclusion and Future

Work

Deep learning models such as the RNN andLSTM are evidently effective learners ontraining data with the LSTM more cap-able for recognising longer-term dependen-cies. However, a high variance task of thisnature make it difficult to transpire thisinto impressive validation results. As a res-ult it remains a difficult task. There is afine line to balance between overfitting amodel and preventing it from learning suf-ficiently. Dropout is a valuable feature toassist in improving this. However, despiteusing Bayesian optimisation to optimize theselection of dropout it still couldnt guar-antee good validation results. Despite themetrics of sensitivity, specificity and preci-sion indicating good performance, the ac-tual performance of the ARIMA forecastbased on error was significantly worse thanthe neural network models. The LSTM out-performed the RNN marginally, but therewas not significant difference in the resultsof both. However, the LSTM takes consid-erably longer to train.

The performance benefits gained fromthe parallelisation of machine learning al-gorithms on a GPU are evident with a70.7% performance improvement for train-ing the LSTM model on the GPU as op-posed to the CPU. This confirmed thefindings indicated by the related work.It appears from the results that AndrejKarpathy was correct in his article onthe unreasonable effectiveness of recurrentneural networks (5). They can reduce val-

idation error sufficiently low for a difficulttask like this one. Looking at the task frompurely a classification perspective one maybe able to achieve better results. One lim-itation of the research is that the model hasnot been implemented in a practical or realtime setting for predicting into the futureas opposed to learning what has alreadyhappened. In addition, the ability to pre-dict on streaming data would improve themodel. Sliding window validation is an ap-proach not implemented here but this maybe explored for future work. One problemthat will arise is that the data is inherentlyshrouded in noise.

Wavelets offer an interesting approachto time series analysis due to the way theybreak down a signal on a temporal levelto get rid of noise. Some research existsin merging Wavelets with the MLP flavourof Neural Networks (53). To the best ofthe authors no implementation exists know-ledge of a more advanced neural networklike a LSTM or RNN being integrated withWavelets. This is to be explored in fu-ture work. In terms of the dataset, basedon an analysis of the weights of the modelthe difficulty and hash rate variables couldbe considered for pruning. Deep learningmodels require a significant amount to datato learn effectively from. The dataset util-ised contained 1066 time steps representingeach day. If the granularity of data waschanged to per minute this would provide512,640 data points in a year. Data of thisnature is not available for the past but iscurrently being gathered from CoinDesk ona daily basis for future use. Finally, par-

allelisation of algorithms is not limited toGPU devices. Field Programmable GateArrays (FPGA) are an interesting altern-ative to GPU devices in terms of paral-lelisation. Under some circumstances ma-

chine learning models have been show toperform better on FPGA than on a GPU(54). This warrants further investigation fordeep learning models.

Acknowledgements

Firstly, I would like to express my sincere gratitude to my supervisor Dr. Jason Rochefor accompanying me on this deep journey of mine. I learned a tremendous amount fromyou in this short space of time. Thank you for believing in me and making me believe inmyself and my abilities. I would also like to thank my girlfriend Katie and my parents fortheir unfaltering support and counsel. Finally, thank you to Dr. Simon Caton, JonathanMcNicholas and Patrick McHale for your technical advice and guidance when called upon.

)

References

[1] I. Kaastra and M. Boyd, “Designing a neural network for forecasting financial andeconomic time series,” Neurocomputing, vol. 10, no. 3, pp. 215–236, 1996.

[2] H. White, “Economic prediction using neural networks: The case of ibm daily stockreturns,” in Neural Networks, 1988., IEEE International Conference on. IEEE,1988, pp. 451–458.

[3] M. Briere, K. Oosterlinck, and A. Szafarz, “Virtual currency, tangible return: Port-folio diversification with bitcoins,” Tangible Return: Portfolio Diversification withBitcoins (September 12, 2013), 2013.

[4] C. Chatfield and M. Yar, “Holt-winters forecasting: some practical issues,” TheStatistician, pp. 129–140, 1988.

[5] A. Karpathy, “The unreasonable effectiveness of recurrent neural networks,” AndrejKarpathy blog, 2015.

[6] J. L. Elman, “Finding structure in time,” Cognitive science, vol. 14, no. 2, pp. 179–211, 1990.

[7] B. Scott, “Bitcoin academic paper database,” suitpossum blog, 2016.

[8] M. D. Rechenthin, “Machine-learning classification techniques for the analysis andprediction of high-frequency stock direction,” 2014.

[9] S. Nakamoto, “Bitcoin: A peer-to-peer electronic cash system,” 2008.

[10] B. G. Malkiel and E. F. Fama, “Efficient capital markets: A review of theory andempirical work,” The journal of Finance, vol. 25, no. 2, pp. 383–417, 1970.

[11] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools andtechniques. Morgan Kaufmann, 2005.

[12] F. A. Gers, D. Eck, and J. Schmidhuber, “Applying lstm to time series predictablethrough time-window approaches,” pp. 669–676, 2001.

[13] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradi-ent descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp.157–166, 1994.

[14] Y. Bengio, “Learning deep architectures for ai,” Foundations and trends R© in Ma-chine Learning, vol. 2, no. 1, pp. 1–127, 2009.

[15] S. Hochreiter and J. Schmidhuber, “Lstm can solve hard long time lag problems,”Advances in neural information processing systems, pp. 473–479, 1997.

[16] V. S. Ediger and S. Akar, “Arima forecasting of primary energy demand by fuel inturkey,” Energy Policy, vol. 35, no. 3, pp. 1701–1708, 2007.

[17] G. H. Chen, S. Nikolov, and D. Shah, “A latent source model for nonparametrictime series classification,” in Advances in Neural Information Processing Systems,2013, pp. 1088–1096.

[18] I. Daubechies et al., Ten lectures on wavelets. SIAM, 1992, vol. 61.

[19] L. Kristoufek, “What are the main drivers of the bitcoin price? evidence from waveletcoherence analysis,” PloS one, vol. 10, no. 4, p. e0123923, 2015.

[20] D. Shah and K. Zhang, “Bayesian regression and bitcoin,” in Communication, Con-trol, and Computing (Allerton), 2014 52nd Annual Allerton Conference on. IEEE,2014, pp. 409–414.

[21] I. Georgoula, D. Pournarakis, C. Bilanakos, D. N. Sotiropoulos, and G. M. Giag-lis, “Using time-series and sentiment analysis to detect the determinants of bitcoinprices,” Available at SSRN 2607167, 2015.

[22] M. Matta, I. Lunesu, and M. Marchesi, “Bitcoin spread prediction using social andweb search media,” Proceedings of DeCAT, 2015.

[23] B. Gu, P. Konana, A. Liu, B. Rajagopalan, and J. Ghosh, “Identifying information instock message boards and its implications for stock market efficiency,” in Workshopon Information Systems and Economics, Los Angeles, CA, 2006.

[24] M. Matta, I. Lunesu, and M. Marchesi, “The predictor impact of web search mediaon bitcoin trading volumes.”

[25] R. Delfin Vidal, “The fractal nature of bitcoin: Evidence from wavelet power spec-tra,” The Fractal Nature of Bitcoin: Evidence from Wavelet Power Spectra (Decem-ber 4, 2014), 2014.

[26] A. Greaves and B. Au, “Using the bitcoin transaction graph to predict the price ofbitcoin,” 2015.

[27] I. Madan, S. Saluja, and A. Zhao, “Automated bitcoin trading via machine learningalgorithms,” 2015.

[28] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal repres-entation by back propagation,” Parallel distributed processing: exploration in themicrostructure of cognition, vol. 1, 1986.

[29] A. S. Weigend, B. A. Huberman, and D. E. Rumelhart, “Predicting the future: Aconnectionist approach,” International journal of neural systems, vol. 1, no. 03, pp.193–209, 1990.

[30] Z. Tang, C. de Almeida, and P. A. Fishwick, “Time series forecasting using neuralnetworks vs. box-jenkins methodology,” Simulation, vol. 57, no. 5, pp. 303–310, 1991.

[31] Y. Yoon and G. Swales, “Predicting stock price performance: A neural network ap-proach,” in System Sciences, 1991. Proceedings of the Twenty-Fourth Annual HawaiiInternational Conference on, vol. 4. IEEE, 1991, pp. 156–162.

[32] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,”Journal of Machine Learning Research, vol. 13, no. Feb, pp. 281–305, 2012.

[33] J. L. Ticknor, “A bayesian regularized artificial neural network for stock marketforecasting,” Expert Systems with Applications, vol. 40, no. 14, pp. 5501–5506, 2013.

[34] S. Wager, S. Wang, and P. S. Liang, “Dropout training as adaptive regularization,”in Advances in neural information processing systems, 2013, pp. 351–359.

[35] T. Koskela, M. Lehtokangas, J. Saarinen, and K. Kaski, “Time series prediction withmultilayer perceptron, fir and elman neural networks,” in Proceedings of the WorldCongress on Neural Networks. Citeseer, 1996, pp. 491–496.

[36] C. L. Giles, S. Lawrence, and A. C. Tsoi, “Noisy time series prediction using recurrentneural networks and grammatical inference,” Machine learning, vol. 44, no. 1-2, pp.161–183, 2001.

[37] A. M. Rather, A. Agarwal, and V. Sastry, “Recurrent neural network and a hybridmodel for prediction of stock returns,” Expert Systems with Applications, vol. 42,no. 6, pp. 3234–3241, 2015.

[38] D. Steinkrau, P. Y. Simard, and I. Buck, “Using gpus for machine learning al-gorithms,” in Proceedings of the Eighth International Conference on Document Ana-lysis and Recognition. IEEE Computer Society, 2005, pp. 1115–1119.

[39] B. Catanzaro, N. Sundaram, and K. Keutzer, “Fast support vector machine trainingand classification on graphics processors,” in Proceedings of the 25th internationalconference on Machine learning. ACM, 2008, pp. 104–111.

[40] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep, big, simpleneural nets for handwritten digit recognition,” Neural computation, vol. 22, no. 12,pp. 3207–3220, 2010.

[41] R. C. Martin, Agile software development: principles, patterns, and practices. Pren-tice Hall PTR, 2003.

[42] T. Dettmers, “Deep learning in a nutshell: Core concepts,” NVIDIA Devblogs, 2015.

[43] D. K. Wind, “Concepts in predictive machine learning,” in Maters Thesis, 2014.

[44] Y. Wang, “Stock price direction prediction by directly using prices data: an empiricalstudy on the kospi and hsi,” International Journal of Business Intelligence and DataMining, vol. 9, no. 2, pp. 145–160, 2014.

[45] S. Lauren and S. D. Harlili, “Stock trend prediction using simple moving averagesupported by news classification,” in Advanced Informatics: Concept, Theory andApplication (ICAICTA), 2014 International Conference of. IEEE, 2014, pp. 135–139.

[46] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intel-ligence, vol. 97, no. 1, pp. 273–324, 1997.

[47] M. B. Kursa, W. R. Rudnicki et al., “Feature selection with the boruta package,”2010.

[48] F. Chollet, “Keras,” https://github.com/fchollet/keras, 2015.

[49] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of ma-chine learning algorithms,” in Advances in neural information processing systems,2012, pp. 2951–2959.

[50] L. Yu, S. Wang, and K. K. Lai, “A novel nonlinear ensemble forecasting modelincorporating glar and ann for foreign exchange rates,” Computers And OperationsResearch, vol. 32, no. 10, pp. 2523–2541, 2005.

[51] G. Hinton, N. Srivastava, and K. Swersky, “Lecture 6a overview of mini–batch gradi-ent descent,” Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online.

[52] J. Heaton, “Introduction to neural network for java, heaton research,” Inc. St. Louis,2005.

[53] L. Wang, K. K. Teo, and Z. Lin, “Predicting time series with wavelet packet neuralnetworks,” in Neural Networks, 2001. Proceedings. IJCNN’01. International JointConference on, vol. 3. IEEE, 2001, pp. 1593–1597.

[54] M. Papadonikolakis, C.-S. Bouganis, and G. Constantinides, “Performance com-parison of gpu and fpga architectures for the svm training problem,” in Field-Programmable Technology, 2009. FPT 2009. International Conference on. IEEE,2009, pp. 388–391.

https://github.com/fchollet/keras

Documents

Predicting the price of Bitcoin using Machine Learningtrap.ncirl.ie/2496/1/seanmcnally.pdf · Predicting the price of Bitcoin using Machine Learning Sean McNally x15021581 MSc Reseach