A Comparative Study of Support Vector Machines and ... · The master thesis entitled “A Comparative Study Between Support Vector Machines and Artiﬁcial Neural Networks For Short

A Comparative Study of Support VectorMachines and Artificial Neural Networks for

Short Term Load Forecasting

ByOussama Saad

A thesis submitted tothe Faculty of Engineering at

University of Monastir and University of Kasselin partial fulfillment of the requirements for the degree

of Master of science in

Renewable Energy and Energy Efficiency

University of Kassel - Kassel, GermanyUniversity of Monastir - Monastir, Tunisia

July 2018

A Comparative Study of Support VectorMachines and Artificial Neural Networks for

Short Term Load Forecasting

ByOussama Saad

A thesis submitted tothe Faculty of Engineering at

University of Monastir and University of Kasselin partial fulfillment of the requirements for the degree

of Master of science in

Renewable Energy and Energy Efficiency

Under the supervision of

Prof. Dr. sc. techn. Dirk DahlhausUniversity of Kassel, Germany

Dr. Walid HassenUniversity of Monastir, Tunisia

Dr. Nour MansourUniversity of Kassel, Germany

July 2018

Declaration

To the best of my knowledge I do hereby declare that this thesis is my ownwork. It has not been submitted in any form of another degree or diploma toany other university or other institution of education. Information derivedfrom the published or unpublished work of others has been acknowledged inthe text and a list of references is given.

Oussama Saad

Kassel, 11 July 2018

i

Abstract

With the deregulation of the electricity market, short term load forecasting(STLF) plays an increasingly important role in the management of electricpower systems, especially in planning unit commitment, demand side man-agement (DSM) interventions and energy trading operations. Hence, theaccuracy of load forecasts has become more and more of a necessity. Inthis thesis, we aim to build a STLF model in order to support the Tunisiangovernorate of Bizerte in its transition towards a sustainable energy system.Two machine learning (ML) models, namely, ν- support vector regression (ν-SVR) and recurrent neural networks (RNN) were selected for comparativeassessment. Both models take into account the influence of the temperatureand the characteristics of the calendar on the electric load to improve theirforecast accuracy. Furthermore, given the importance of data quality on thesuccess of ML models, a special attention was given to the preprocessingof the load time series and in particular to the detection of outliers. Usingdifferent performance metrics, it has been demonstrated that the ν-SVRmodel outperforms the RNN for a day-ahead and a week-ahead out-of- sam-ple load forecasting and that both models predict accurately the hour of thedaily load peak. This thesis also investigates the performance of a combinedRNN-SVR model, which averages the forecasts of both ML models. Theexperimental results show that this model improves slightly the STLF accu-racy for a day-ahead load forecasting. Nevertheless, it is recommend to usethe ν-SVR model for STLF given that it outperforms both the RNN andthe combined model for the week-ahead load forecasting.

Key words: Short term load forecasting, Support vector machines,Supportvector regression, Artificial neural networks, Recurrent neural networks, out-lier detection.

ii

Acknowledgements

When I started this journey, it was difficult for me to imagine how excitingand challenging this experience would be, and getting through it would nothave been possible without the support and encouragement of many people.

First and foremost, I would like to extend my sincere gratitude to Prof.Dirk Dahlhaus, Dr. Walid Hassen and Dr. Nour Mansour for accepting tosupervise and review my thesis and for their continuous support and guid-ance. My work benefited considerably from their valuable advices and theenriching discussions we had throughout this thesis.

Special thanks go to my colleagues at Ramboll, in particular for Ing. StefenChun for welcoming me in his team, my supervisor Ing. Peter Ritter forthe enthusiasm he expressed for my work and the flexibility and freedom heprovided in allowing me to choose the subject of my thesis and finally M.sc.Johannes Herbert for his advices and helpful suggestions.

I would like also to express my appreciation for the support provided by the“societe Tunisienne de l’electricite et du gaz (STEG)” and the “NationalInstitute of Meteorology of Tunisia” in collecting the required data for mythesis.

My deepest gratitude goes to my parents and siblings for their encourage-ment and support throughout my academic career. I am forever indebtedto them and to their sacrifices.

I am eternally grateful to my life partner, Yosr, for encouraging me to realizemy dreams and for the patience and support she has shown even in the mostdifficult times.

I am also appreciative of the friendship I developed during the past two yearsespecially with Wael, Kami and Alkaff with whom I shared this incrediblejourney.

iii

Lastly, I gratefully acknowledge the funding received towards my M.sc fromthe “Deutsche Gesellschaft fur Internationale Zusammenarbeit (GIZ)” onbehalf of the “OPEC Fund for International Development (OFID)”.

iv

Disclaimer

The master thesis entitled “A Comparative Study Between Support VectorMachines and Artificial Neural Networks For Short Term Load Forecast-ing” contains confidential information of the Tunisian electric utility SocieteTunisienne de l’electricite et du gaz (STEG). The sharing of the contentsof the work in whole or in part as well as the making of copies or tran-scriptions also in digital form is strictly prohibited. Exceptions require thewritten approval of STEG.

v

Contents

Declaration i

Abstract ii

Acknowledgements iii

Disclaimer v

Table of Contents vi

List of Figures ix

List of Tables x

Notation xi

1 Introduction 11.1 Thesis Background and Motivation . . . . . . . . . . . . . . 11.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2.1 Classification of Electricity Load Forecasts . . . . . . . 31.2.2 Short Term Load Forecast Approaches . . . . . . . . . 41.2.3 Model Comparison . . . . . . . . . . . . . . . . . . . . 8

1.3 Thesis Contribution . . . . . . . . . . . . . . . . . . . . . . . 91.4 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . 9

2 Support Vector Machines 102.1 Support Vector Classification . . . . . . . . . . . . . . . . . . 10

2.1.1 SVC for Linearly Separable Sets . . . . . . . . . . . . 112.1.2 SVC for Non-Linearly Separable Sets . . . . . . . . . . 13

2.2 Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Support Vector Regression . . . . . . . . . . . . . . . . . . . . 15

vi

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Artificial Neural Networks 203.1 Artificial Neural Networks Architectures . . . . . . . . . . . . 21

3.1.1 Artificial Neurons . . . . . . . . . . . . . . . . . . . . 213.1.2 Layers of Neurons . . . . . . . . . . . . . . . . . . . . 223.1.3 Recurrent Neural Networks . . . . . . . . . . . . . . . 24

3.2 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . 263.2.1 Online Learning and Batch Learning . . . . . . . . . . 273.2.2 Widrow-Hoff Learning Rule . . . . . . . . . . . . . . . 273.2.3 Backpropagation Algorithm . . . . . . . . . . . . . . . 28

3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4 Methodology 314.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Electric Load Time Series . . . . . . . . . . . . . . . . 314.1.2 Exogenous Factors . . . . . . . . . . . . . . . . . . . . 32

4.2 Data Exploration with Descriptive Statistics . . . . . . . . . . 334.3 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3.1 Data Aggregation . . . . . . . . . . . . . . . . . . . . 344.3.2 Data Cleansing . . . . . . . . . . . . . . . . . . . . . . 354.3.3 Data Normalization . . . . . . . . . . . . . . . . . . . 38

4.4 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 384.4.1 ν-Support Vector Regression . . . . . . . . . . . . . . 384.4.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 40

4.5 Training and Testing . . . . . . . . . . . . . . . . . . . . . . . 404.6 Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . 414.7 Performance Measures . . . . . . . . . . . . . . . . . . . . . . 43

5 Performance Analysis 455.1 Data Preprocessing Results . . . . . . . . . . . . . . . . . . . 45

5.1.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . 455.1.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Load Characteristics . . . . . . . . . . . . . . . . . . . . . . . 475.3 Selected Models . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 Selected ν-Support Vector Regression Model . . . . . 485.3.2 Selected Recurrent Neural Networks . . . . . . . . . . 49

5.4 Forecasting Results . . . . . . . . . . . . . . . . . . . . . . . . 515.4.1 Comparison of ν-SVR and RNN Forecasts . . . . . . . 515.4.2 Combined Model Forecasts . . . . . . . . . . . . . . . 525.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 54

6 Conclusion and Future Work 55

vii

Acronyms 57

Nomenclature 59

Bibliography 61

viii

List of Figures

1.1 Bizerte governorate, Tunisia. . . . . . . . . . . . . . . . . . . 21.2 Load forecasting classification. . . . . . . . . . . . . . . . . . 4

2.1 Support vector classification. . . . . . . . . . . . . . . . . . . 112.2 Support vectors and decision boundaries for linear SVC. . . . 132.3 Overlapping classes. . . . . . . . . . . . . . . . . . . . . . . . 142.4 Support vectors regression. . . . . . . . . . . . . . . . . . . . 16

3.1 Block diagram of an artificial neuron. . . . . . . . . . . . . . 223.2 Feedfroward neural network architecture. . . . . . . . . . . . 233.3 Recurrent neural network architecture. . . . . . . . . . . . . . 253.4 Unfolded recurrent neural network. . . . . . . . . . . . . . . . 26

4.1 Adopted methodology for STLF. . . . . . . . . . . . . . . . . 314.2 Time series of the electric load in Bizerte in hourly resolution. 344.3 Box and whisker plot . . . . . . . . . . . . . . . . . . . . . . . 364.4 Flowchart of the out-of-sample forecasting algorithm. . . . . . 42

5.1 Detected outliers. . . . . . . . . . . . . . . . . . . . . . . . . . 475.2 Daily profile of the load in Bizerte for 2017 . . . . . . . . . . 485.3 Contour plot of the coefficient of determination (R2) for ν =

0.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.4 Architecture of the selected recurrent neural network . . . . . 505.5 RNN training and test results . . . . . . . . . . . . . . . . . . 505.6 One week-ahead load forecasting using ν-SVR and RNN. . . 515.7 One week ahead load forecast using combined ν-SVR and RNN. 535.8 Forecast accuracy of the SLTF models. . . . . . . . . . . . . . 53

ix

List of Tables

4.1 Grid search intervals for ν-SVR model . . . . . . . . . . . . . 39

5.1 Descriptive statistics for the time series of the electrical load. 465.2 Number of outliers per year according to the IQR and LOF

methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.3 Best configuration of the SVR model. . . . . . . . . . . . . . 495.4 Forecast accuracy of ν-SVR and RNN models. . . . . . . . . 52

x

Notation

In this thesis, we use the following mathematical notation. Bold upper-caseletters (X) are used for matrices. Vectors are denoted by bold lower-caseletters (x). Italic lower-case letters (x) are used for scalars. 〈·, ·〉 correspondsto the inner product of two vectors whereas ⊗ refers to their outer product.The norm of a vector is denoted by ||.||. We refer to a set of N orderednatural numbers by 1, · · · ,N and a set of N pairs of input and targetvariables are denoted by T = (xi, ti), i = 1, ..,N. We use card(·) to referto the set cardinality. Lastly, |.|ε refers to the ε-insensitive loss function.

xi

Chapter 1

Introduction

“Begin with the determination tosucceed and the work is halfdone already.”

Mark Twain

1.1 Thesis Background and Motivation

The past decades have witnessed a rapid expansion of cities all around theworld. This expansion was often at the expenses of our planet. Indeed,although cities nowadays occupy approximately only 2% of the total land,they are accountable for over 60% of the global energy consumption and upto 70% of greenhouse gas emissions [70]. Facing this situation, both industri-alized and developing countries became increasingly aware of the necessityto shift towards a more sustainable development of their cities, especiallytheir energy system.

In this context, the consortium Bizerte Sustainable City composed of theTunisian government, civil society and research institutes launched a set ofprojects aiming at improving the management of energy in the northerngovernorate of Bizerte, Tunisia with a focus on electricity production andconsumption. These projects include, inter alia, the implementation of ad-vanced energy management systems (EMS) to optimize the operation of theelectric network [6].

The EMS, which consist of computer-aided tools, are widely used by electric-ity utilities to monitor, control, and optimize the performance of electricitygeneration and transmission systems [46]. In order to function effectively,

1

Figure 1.1: Bizerte governorate, Tunisia.

source:Wikipedia

EMS are essentially dependent on electric load forecasts [21], which moti-vates the experts in the field to develop new techniques in order to generatemore and more accurate and reliable forecasts.

Load forecasting has always played a vital role in planning and managementof electric power systems. Utilities rely on this type of analysis to make im-portant decisions ranging from scheduling day-to-day operations to planninglong-term extensions on their network.

Moreover, with the deregulation in power industry, load forecasting is be-coming increasingly important. In fact, today’s utilities are called to in-novate in the way they operate their business in order to maintain theirpresence in a very open and competitive marketplace. Having a good priorknowledge of the electrical demand to meet leads to an effective manage-ment of electric power purchasing and generation thus providing subscriberswith Kilowatt hours at competitive prices.

Given the importance of demand forecasting in the viability and security ofelectrical system operations, we felt that it would be of great interest to focuson this subject in order to provide practical and feasible solutions for thegovernorate of Bizerte and support it in its initiative towards a sustainableenergy system.

2

1.2 Literature Review

The field of electric load forecasting has always aroused great interest amongexperts and researchers. Many papers and reports have been published inthis subject illustrating various approach to predict the load.

In this section, we conduct a review of some of these publications with theaim to survey the latest development in this field and to define the scope ofthis thesis.

1.2.1 Classification of Electricity Load Forecasts

The purpose of load forecasting can be viewed differently even in the sameelectric utility. In fact, each department has its specific tasks and expectsdifferent outputs from the forecast analysis. For instance, whereas opera-tions department is only interested in the small time-window variation of theload or seasonality, the planning department needs to assess the long termvariation or trend. However, there is no single forecast that can satisfy theneeds of the different departments [33]. Thus, in practice, the load forecastis performed for different time intervals.

Experts classified load forecasting into four categories according to the fore-cast horizon (see figure 1.2). Each of these categories require different inputsand approaches and serve different purposes [54].

The four load forecast categories are:

• Very short term load forecasting (VSTLF): This operation covers aperiod of time ranging from one minute to one hour [3] and requiresonly inputs from the previous load [33, 54]. Its purpose is to provideinformation about the expected load in small time-window allowing areal-time management of the power system [3].

• Short term load forecasting (STLF): The forecast horizon for this cat-egory ranges between one hour and one week [3]. In addition to theprevious load values, weather related information such as hourly valuesof the temperature are required to analyse and predict the future elec-tricity demand [54]. The results of this analysis are important for unitcommitment scheduling, demand side management (DSM) operationsand energy trading [3, 33, 69].

• Medium term load forecasting (MTLF): Scheduling maintenance ac-tivities on the power system components or planning for fuel allocationgenerally require a knowledge on the future electrical load that extendsup to few months [3]. Forecasting for such horizons is referred to as

3

(seconds)100 103 106 109

VSTLF STLF MTLF LTLF

Hour week year

Figure 1.2: Load forecasting classification.

medium term load forecasting.

MTLF and STLF share the same input requirements. However, forMTLF, socio-economic indicators are also included in the forecastingprocess [33, 54].

• Long term load forecasting (LTLF): This forecasting category targetsthe variation of the load in the far future, ie. more than one year [3].It takes as inputs only socio-economic indicators and electric energyconsumption [33, 54]. LTLF is applied mainly for planning infrastruc-tural development of the power system and defining energy policies [3,33].

After analysing the different categories of load forecasting and their applica-tion, it was decided to focus on STLF since our main interest is to providea forecasting model that facilitates the management of daily operations onthe power system. Thus, the remaining sections of the literature review willconcentrate exclusively on the approaches applied for STLF.

1.2.2 Short Term Load Forecast Approaches

The publications on STLF can be tracked back to the 1960’s with perhapsone of the first conclusive studies on the subject conducted by Heinemannet al. in 1966. This study investigated the relationship between the tem-perature and the load during summer using regression analysis [30]. Sincethen, many other studies on STLF were published using various approachesand methods with varying degrees of success.

A survey on these approaches has been conducted by Srivastava et al. in2016. This survey, which covers the recent publications on the subject,classified STLF methods in four categories [66]:

• Statistical techniques,

• Artificial intelligence (AI) techniques,

4

• Knowledge Based Expert Systems and

• Hybrid models.

In the following paragraphs, we discuss the aforementioned categories basedon some publications.

1.2.2.1 Statistical Techniques

Statistical models were considered for a long time as the state of the artfor STLF [33]. A statistical model is defined as an idealized mathemati-cal representation of the process that generated the observations [67]. Thestatistical approach for STLF consists on determining the explicit mathe-matical relationship between the load and other exogenous factors such asthe temperature or the humidity [66].

Many statistical techniques were discussed in the literature such as multipleregression, exponential smoothing, adaptive filtering and stochastic time se-ries analysis [66]. Two of these statistical techniques are discussed below.

The regression method consists on assuming a linear or non-linear mathe-matical relationship between a dependent variable (load) and a set of se-lected independent variables (weather variables, week-day,· · · ). A regres-sion analysis is then applied to determine the coefficients of the independentvariables in the assumed model. For instance, Amral et al. used a multiplelinear regression model to forecast the electric load up to 24 hours ahead forthe Sulewesi Island, Indonesia by selecting the current and previous hourlyvalues of the temperature as independent variables [2]. The conclusion ofthis study highlight the importance of including accurate values of the tem-perature in order to reduce the error between model predictions and actualvalues of the load [2].

Time series analysis is another popular statistical technique for STLF. Themain idea of time series models is that future observations can be accuratelypredicted by analysing the correlations between past observations. Some ofthe mostly used time series models for STLF are the autoregressive inte-grated moving-average (ARIMA) models [39] which are based on the Boxand Jenkins methodology for time series analysis [8]. In [40], Juberias etal. investigated these models for a day-ahead electric load forecasting anddemonstrate that ARIMA models provided satisfactory results.

Despite the initial enthusiasm for statistical models, their use has declinedconsiderably over time mainly due to the challenges that experts were facingin applying these techniques to practical problems. In fact, statistical models

5

cannot be automatically updated, hence, a new modelling is required for eachnew observations. Furthermore, these models which require an importantcomputational workload tend to provide unstable results in practice [39, 75].Thanks to the advances registered in the field of artificial intelligence (AI),new and more effective techniques were introduced allowing to overcome thelimitations of statistical methods.

1.2.2.2 Artificial Intelligence Models

Machine Learning Models

Machine learning (ML) is a multidisciplinary field based on theories fromother scientific fields such as AI, statistics and even neuroscience. In thisContext, the term “learning” designate the ability of a machine to improveits performance in executing a certain class of tasks by relying on its pastexperience [52].

The first step in solving a ML problem is to identify the type of learning thatthe machine should follow. We distinguish three types of learning, namelysupervised, semi supervised and unsupervised learning.

In the case of supervised learning, the machine is provided with a trainingset including both input and output variables. The aim of this learning isto estimate a function that maps the input to the output variables. In MLterminology, this function is called the target function and it is chosen froma set of mapping functions or hypothesis known as the hypothesis space.

For the unsupervised learning, the machine is provided only with input vari-ables and an algorithm is used to identify the distribution of the data. Thistype of learning is applied in clustering and association problems [12].

The semi-supervised learning is a mixture of supervised and unsupervisedlearning. In other words, not all the input variables are fed to the machinewith correspondent outputs.

Although being relatively new in the field of load forecasting (mid 1990’s),ML models became increasingly popular among experts due to their capac-ity to perform non-linear modelling between the load and other exogenousfactors [33]. Some of the notable techniques that were tested include artifi-cial neural networks (ANN) and support vector machines (SVM).

ANN models were originally inspired from biological neural network how-ever designed in simpler manner. They are composed of multiple processingunits, called neurons, capable of learning non-linear relationship between

6

past and current observations. As mentioned in [18], ANN are one of themain techniques for STLF. It is has been reported that over 30 electricutilities in the United States use ANN for load forecasting. Different ANNmodels were proposed over the years for STLF. For instance, Park et al. useda feed forward ANN to model the relationship between the load and pastand current values of temperature. The proposed model was then tested toprovide one hour ahead and one day ahead forecasts. The results showeda significant improvement in the accuracy of the forecasts in comparison toother conventional methods used by utilities at the date of the publication[58]. The latest development in STLF with ANN include the use of morecomplex models such as recurrent neural networks (RNN) [74].

SVM are supervised ML models introduced by Vapnik in 1966. These mod-els rely on kernel functions for mapping data into high dimensional spacesand solving non-linear problems. Using SVM for STLF is relatively newin comparison with ANN. One of the first application of SVM models tothis subject was during the European network on intelligent technologies forsmart adaptive systems (EUNITE) world wide competition on electricityload prediction in 2001. The task proposed in this competition is to developan STLF model that uses past values of the load, average daily tempera-tures and date of holidays to predict future values of the load [16]. Theperformances of the proposed SVM model for this task were documented byChen et al. in [17]. It was reported that SVM outperformed many othertechniques such as ANN and ARIMA models. Other research projects onSVM for STLF include the work of Wang et al. for daily load forecasting[56] and Afshin et al. concerning SVM modelling optimization [1].

Other Models

Besides machine learning models, researchers investigated other AI tech-niques for STLF such as fuzzy logic (FL) and genetic algorithms (GA).

FL techniques mimic the Boolean logic used for the design of digital circuitsin order to map the input to the output variables [39]. In this sense, theBoolean operators are replaced by other logic functions and the mapping isachieved through a set of IF...THEN rules. In [44], Liu et al. tested a fuzzylogic (FL) model for one day ahead load forecast and reported satisfactoryresults.

GA are evolutionary algorithms developed to apply the biological evolu-tion rule, where the survival is for the fittest individuals, to optimisationproblems [66]. Promising results were reported for this approach in [31],where GA were used to optimize an ANN for one day ahead load forecast.

7

Nevertheless, only few STLF projects are dedicated to GA.

1.2.2.3 Knowledge Based Expert Systems

Knowledge based expert systems (KBES) combine the advancement in thefield of AI and the experience of utility experts [66]. As it is difficult to coverall the factors influencing the load in one model, the knowledge acquiredthrough experience of experts is incorporated in the decision making process.Such techniques permit to predicted more accurately the load for particulardays which might be difficult to achieve for the machine by itself. In [53],Rahman et al. stated that KBES can provide more accurate STLF resultsthan other statistical models.

1.2.2.4 Hybrid Models

Many publications have reported an improvement in SLTF accuracy by com-bining two or more models [38, 60]. Such approach permits to overcomecertain limitations in the original methods [66]. This was demonstrated in[55] using an ARIMA-SVM model and showed that the hybrid model ismore accurate than both models separately. Some other examples of hybridmodels include combining wavelet transform, ARIMA and ANN [25] or GAand ANN [31, 36],

1.2.3 Model Comparison

The primary interest in STLF remains the accuracy of the prediction. It ison this basis that model are compared in this section with a focus on sta-tistical and machine learning models. This comparison is intended to guideour choice on the models that might be most useful to our study.

In [48], Alvarez et al.(2015) compared the accuracy of different STLF mod-els such as seasonal autoregressive integrated moving-average (SARIMA),SVM, ANN and FL models. Their findings show that machine learningmodels are the best alternative for STLF. In fact, in the different case studymentioned in [48], SVM and ANN outperformed stochastic time series mod-els with an advantage for the SVM models over ANN.

Another comparison of ML and statistical models is given [55]. In this ex-ample as well, SVM provided more accurate results than those predicted bythe ARIMA models.

In [51], ANN were found to be slightly more accurate than SVM. However,the author expressed a preference for the support vector machines given

8

their capacity to always provide the global minimum of the error and thetheir stability unlike artificial neural networks.

1.3 Thesis Contribution

This thesis project aims to develop a STLF model to predict the electricalload in the governorate of Bizerte in hourly resolution. This model will haveto take into account the influence of the temperature as well as the char-acteristics of the calendar to forecast the future load. In the light of theprevious literature review, both SVM and ANN approaches were chosen toachieve this task.

The particularity of this thesis is that it studies a class of SVM that hasnot been used for STLF beforehand namely, the ν- support vector regres-sion (ν-SVR). The predictions of this model are confronted with those of arecurrent neural networks (RNN) , which is a class of ANN, for comparativeassessment of their accuracy and stability.

Given the influence of data quality on the success of ML models [42], sev-eral data pre-processing techniques are discussed in this thesis and a novelmethodology for data cleansing is proposed.

1.4 Thesis Organization

The remainder of this thesis is structured as follows:

• Chapter 2 & 3 explore the theoretical background of SVM and ANN,respectively. These chapters serves as an introductory for the readerto the ML techniques that will be used for electric load forecasting.

• Chapter 4 describes the data sets used and the methodology followedin this study. Data preprocessing techniques , STLF model designand implementation as well as the perfromance metrics employed foraccuracy assessment are presented in this chapter.

• Chapter 5 is dedicated to the presentation and analysis of the exper-imental results. This chapter discusses the effictivness of the method-ology proposed for outlier detection and provides a compraison of thethe accuracy and stability of the designed STLF models.

• Chapter 6 summarizes the work carried out throughout this thesisand presents its conclusions. Some proposals for future work are pre-sented to conclude this chapter.

9

Chapter 2

Support Vector Machines

“There is nothing so practical asa good theory.”

Kurt Lewin

SVM were originally invented by Vladimir N. Vapnik and Alexey Ya. Cher-vonenkis in 1963. However, their current form was developed by Vapnik andhis co-workers at the AT&T Bell Laboratories in 1995 [62].

SVM are a class of supervised ML models which are grounded on the statis-tical learning theory (SLT). These models were originally developed for clas-sification problems. Nevertheless, thanks to their empirical performances,they were quickly extended to solve regression problems such as time seriesprediction.

The formulation of SVM is based on the structural risk minimization (SRM)principal. In contrast to the empirical risk minimization (ERM) principal,which is used by most ML model, the SRM permits to avoid data over-fittingproblems by defining an upper bound on the expected risk [37]. This resultsin a better generalization of the model to new observations.

In the following sections, we discuss the reasoning and mathematical formu-lation behind these models.

2.1 Support Vector Classification

The support vector classification (SVC) are decision machines that definea hyperplane to separate between observations that have different classes.

10

(a) Multiple linear classifiers. (b) Maximum margin classifier.

Figure 2.1: Support vector classification.

The choice of this hyperplane is based on the maximization of the marginthat separates it from each class. Therefore, they are often referred to asmaximum margin classifiers [5].

The figure 2.1a shows that it exist multiple linear functions that can sep-arate the red dots from the blue ones. However, the optimal solution forclassification and future generalization on new data would be the one thathave the largest distance to both classes, as shown in figure 2.1b.

To better understand the working process of SVC, we will first focus on thecase where the data is linearly separable.

2.1.1 SVC for Linearly Separable Sets

Let us consider a two class classification problem defined by the data setT = (xi, ti), i = 1, ..,N ∈ Rn × −1, 1. We assume that T is linearlyseparable, then it exist at least one hyperplane which shatters (separates)all the training instances. The equation of the separation hyperplane canbe written as [5]:

〈w, x〉+ b = 0, (2.1)

where, x ∈ Rn, w is non-zero vector normal to the hyperplane and b is ascaler. w and b are called the weight vector and the bias, respectively.

Since it may exist multiple solutions that can classify the data set correctly,we would rather select the one that reduces the generalization error to unseendata points. For SVM, this problem is equivalent to selecting the optimal

11

pair of weight vector and bias that maximize the distance between the hy-perplane and both classes.

To insure that all the data points or observations are classified correctly, theSVC algorithm applies an additional constraint on the parameters of thehyperplane. This constraint is called the hard-margin constraint and it isdefined by the inequality 2.2:

ti(〈w, xi〉+ b

)≥ 1, ∀i ∈ 1, ...,N (2.2)

This constraint implies that all the observation must be located outside theregion delimited by the following normalized hyperplanes:

〈w, x〉+ b = ±1 (2.3)

To locate the position of these two hyperplanes or boundaries, the machinerelies on support vectors. They are defined as the critical points of the dataset verifying the equation 2.3. The figure 2.2 illustrates the support vectorsas they define the position of the boundaries.

Consequently, the margin ∆ is defined as the distance between the separatinghyperplane and each boundary, as shown in figure 2.2. From references [5,71], ∆ can be written as:

∆ =1||w|| (2.4)

The optimal solution for any classification problem using SVM is charac-terized by the largest margin. Thus, their learning algorithm is designed tomaximize the term 1

||w|| which is equivalent to minimizing 12 ||w||

2 [5]. Thetask of the SVC is then reduced to solving the following convex quadraticprogramming problem (QPP):

minimize 12 ||w||

2

subject to

ti(〈w, xi〉+ b

)≥ 1

∀i ∈ 1, ...,N

(2.5)

As stated in [71], the solution for this optimization QPP can be obtainedby introducing Lagrange multipliers αi ≥ 0 and solving the new problemin the dual space under the Karush-Kuhn-Tucker conditions (KKT). It isimportant to note that the non-zero Lagrange coefficients correspond to thesupport vectors. Assuming that there are ` support vectors, the solution forthe optimal weight vector and bias is given by:

12

w · x + b = 1

w

w · x + b = 0

w · x + b = −1

∆∆

Figure 2.2: Support vectors and decision boundaries for linear SVC.

w0 =∑i=1

αitixi,

αi(ti (〈w0, xi〉+ b0)− 1

)= 0,

∑i=1

αiti = 0,

(2.6)

Hence, the optimal separating hyperplane or decision surface equation canbe written as:

f0(x,αi) =∑i=1

αiti〈xi, x〉+ b0 = 0 (2.7)

Then, for any input vector x, its classification will be determined by thedecision rule expressed in equation 2.8:

f0(x) = sgn

∑i=1

αiti〈xi, x〉+ b0

(2.8)

2.1.2 SVC for Non-Linearly Separable Sets

In practice, the data set is not often linearly separable and the differentclasses might be overlapping as shown in figure 2.3. To overcome this prob-lem, the learning algorithm of the SVC is redefined in a way that it allowsthe misclassification of some observations while penalizing them accordingto their position from the decision surface. This is achieved through theintroduction of slack variables ξi which are defined as follows:

13

ξi = 0

0 < ξi ≤ 1

ξi > 1

Figure 2.3: Overlapping classes.

• ξi = 0 for data points that are correctly classified;

• 0 < ξi ≤ 1 for data points that lie in the correct side of the margin;

• ξi > 1 for misclassified data points.

With the introduction of ξi, the data set is now subject to a new constraintcalled the soft-margin constraint and given by:

ti(〈w, xi〉+ b

)≥ 1− ξi ∀i ∈ 1, ...,N (2.9)

Considering the above mentioned constraint, the optimization QPP for thenon linearly separable sets can be expressed as :

minimize 12 ||w||

2 +C( N∑i=1

ξi)

subject to

ti(〈w, xi〉+ b

)≥ 1− ξi

∀i ∈ 1, ...,N and ξi ≥ 0

(2.10)

The parameter C is introduced to control the trade-off between the maxi-mization of the margin and the minimization the training errors [5].

The solution for the QPP in equation 2.10 is obtained in the same manneras described in section 2.1.1 but with different constraints on Lagrange mul-tipliers. The expression of the optimal weight vector and bias are similar tothose defined in equation 2.6 with 0 ≤ αi ≤ C. Furthermore, the optimalhyperplane equation and the decision rule for the non linearly separable caseare similar to those presented in equations 2.7 and 2.8.

14

2.2 Kernel Functions

In certain applications, the data set classes can be deeply overlapping, whichmakes it impossible to perform a linear classification in the feature spaceeven by introducing slack variables. The solution for these applications canbe obtained by applying Cover’s theorem. It stipulates that it is highlyprobable to solve a non linear classification problem using linear classifiersby projecting the input set into a higher dimensional space using a non lin-ear transformation function φ [20].

From the previous results in section 2.1.1 and 2.1.2, it is clear that the equa-tion of the optimal hyperplane and the decision rule are function of the innerproduct of the support vectors and the new input vector. By mapping theinput set into a higher dimensional space, we now need to compute the highdimensional inner product of their transformation, which requires a goodknowledge of the mapping function.

According to the Hilbert-Schmidt theory for inner products in high dimen-sional spaces, computing 〈φ(xi),φ(xj)〉 is equivalent to computing a sym-metric function K(xi, xj) satisfying Mercer’s theorem [5, 71]. K is calledthe kernel function. Its main advantage is that it does not require anyknowledge on the mapping function. Therefore, the use of the function Kis commonly referred to as the kernel trick.

The choice of the kernel function depends essentially on the data set andin certain cases several trials must be performed before choosing the appro-priate one. Nevertheless, a popular choice of k is the radial basis function(RBF) [65] defined by:

K(xi, xj) = exp(−γ||xi − xj ||2

)(2.11)

where γ is inverse of the standard deviation of RBF. This parameter needto be tuned ie. adjusted to obtain a good performance from the model.

2.3 Support Vector Regression

SVM are also used to solve regression problems. In this context, they arecalled support vector regression (SVR). SVR uses the same principles as theSVC with slight differences. The main idea of SVR is to estimate a functionf for which the maximum deviation between the target variables ti and theimage of the input variables yi = f(xi) does not exceed a certain thresholdε. The parameter ε is an indicator of the precision of the approximationfunction f .

15

+ε

0w

−ε

ξi

ξ?ierror vector

support vector

Figure 2.4: Support vectors regression.

The region delimited by the upper and lower value of the deviation is calledthe ε-tube (shaded region in figure 2.4) and the support vectors are thepoints that lie on its boundaries.

As it might be practically infeasible to construct an SVR that respect theconstraint made on the deviation ε, similarly to the classification case, slackvariables ξi and ξ?i are introduced in order to allow the presence of sometraining errors. These variables represent the excess of deviation from theε-tube.

To measure the approximation errors made by the SVR, Vapnik introducedthe ε insensitive loss function given by [71]:

|t− y|ε =

0 if |t− y| ≤ ε

|t− y| − ε otherwise(2.12)

For a linear regression problem, the approximation functions take the sameform as presented in equation 2.1. Therefore, to estimate this function for agiven data set T = (xi, yi), i = 1, ...,N ∈ Rn ×R, the task for SVR isto find the weight vector and bias that minimize the empirical risk relativeto |t− y|ε. This can be expressed by the following convex optimization QPP[71]:

16

minimize 12 ||w||

2 +CN∑i=1

(ξi + ξ?i )

subject to

ti − 〈w, xi〉 − b ≤ ε+ ξi

〈w, xi〉+ b− ti ≤ ε+ ξ?i

ξi, ξ?i ≥ 0

∀i ∈ 1, ...,N

(2.13)

where C is a positive parameter that translates the trade-off between theminimization of the norm of the weight vector and the tolerated fraction ofdeviations larger than ε [64].

As for the previous QPPs, the system (2.13) can be solved in the dual spaceunder the KTT conditions with the introduction of Lagrange multipliersαi,α?i ∈ [0,C].

The optimal solution for the weight vector is given in function of the non-zero Lagrange multipliers of the support vectors. For ` support vectors, itcan be written as:

w0 =∑i=1

(αi − α?i )xi (2.14)

Hence, the optimal approximation function f0 is:

f0(x) =∑i=1

(αi − α?i )〈xi, x〉+ b0 (2.15)

The above solution can be generalized to the case of non-linear problemsusing the kernel trick as follows:

f0(x) =∑i=1

(αi − α?i )K(xi, x) + b0 (2.16)

Over the past decades, many research have been conducted to improve theperformance of SVR machines. Therefore, one can find many variants andalternative methods for the ε-SVR. These methods include the least squaressupport vector regression (LS-SVR) proposed by Suykens in [68] and whichsolves a linear system instead of the convex QPP proposed by Vapnik. An-other method has been also proposed by Scholkopf et al. in [64] called theν- support vector regression (ν-SVR). This method will be the focus of thenext section.

17

ν- Support Vector Regression

ν-SVR is a variant of the SVR algorithm that introduces a new parameterν to the QPP proposed by Vapnik for the ε-SVR. Considering a data setT = (xi, ti), i = 1, ...,N ∈ Rn ×R, the new formulation proposed byScholkopf et al. is given in the following system [64]:

minimize 12 ||w||

2 +C

Nνε+ N∑i=1

(ξi + ξ?i )

subject to

ti − 〈w, xi〉 − b ≤ ε+ ξi

〈w, xi〉+ b− ti ≤ ε+ ξ?i

ξi, ξ?i ≥ 0

0 < ν ≤ 1

∀i ∈ 1, ...,N

(2.17)

The above optimisation QPP is solved in the same manner as for the oneproposed in equation 2.16. However, for the ν-SVR, the Lagrange multipliersare subject to the following constraints [64]:

N∑i=1

(αi + α?i ) ≤ NνC

N∑i=1

(αi − α?i ) = 0

αi,α?i ∈ [0,C]

(2.18)

From these constraints, it follows that the number of support vectors isfunction of the parameter ν. Indeed, as specified in [64], ν constitute alower bound on the fraction of the support vectors. Furthermore, it is also anupper bound on the fraction of the training errors. Therefore, by adjustingν the performance of the SVR machine can be improved.

2.4 Discussion

As we have seen in this chapter, SVM are designed in a way to solve convexQPPs which guarantees their convergence to an optimal solution (the globalminimum). However, the computational cost of these models might be ex-tremely high for certain problems. Indeed, as stated in [23], many factorscan result in the increase of SVM computational cost especially the number

18

of support vectors, the value of error cost C as well as the kernel configu-ration. Thus, the success of the optimisation process depends highly on theadequate selection of the kernel function and SVM parameters (C, ε, ν).

In this context, different approaches were proposed to optimize the selectionprocess. For instance, one method suggest the use of the cross-validationtechnique, which splits randomly the original data into many folds andtest different combinations of the SVM parameters on them [57]. Anothermethod proposed by Cherkassky and Ma includes the mean and the stan-dard deviation of the data set in the selection process [19].

19

Chapter 3

Artificial Neural Networks

“Neurons wire together if theyfire together.”

Lowel and Singer

Artificial neural networks (ANN) are ML models that were inspired by thehuman brain and designed in a way to mimic its functioning. The firstANN was developed by Warren McCulloch and Walter Pitts in 1943. Al-though, their model was only capable of solving simple arithmetic and logicalproblems, it attracted the attention of many researchers. One of these re-searchers was the psychologist Frank Rosenblatt who succeeded in 1958 toregister the first practical application of ANN by introducing a new modelcalled the perceptron [27].

Nowadays, ANN are applied in various fields such as banking, robotics andtelecommunication. They are also quite a popular choice when dealing withtime series modelling and forecasting. This is mainly due to the fact thatANN models are data driven and self-adaptive methods [73]. In other words,they do not require any prior knowledge on the statistical distribution of thedata unlike statistical models for example. Furthermore, ANN have the abil-ity to treat non-linear problems and to infer unseen data with high accuracy.

In this chapter, we give an overview of the basic architectures of ANN anddiscuss some of their learning algorithms.

20

3.1 Artificial Neural Networks Architectures

3.1.1 Artificial Neurons

The principal element of any ANN is the artificial neuron illustrated in figure3.1. It is an information-processing unit that defines the final output of thenetwork. It can be modeled using three main elements [28]:

Synapses

They are connecting links that act like the dendrites for the biological neu-rons. Each of these links is connected to an input variable xi and has aspecific synaptic weight wi . In these links, the input variables are multi-plied by the corresponding synaptic weights. The resultant values representthe influence of the input variables in the determination of the final output.

Summing junction

It is where all the weighted input variables are summed together. Thisjunction also includes an additional term in the summation process calledthe bias b.

Activation function

This function plays an important role in the working process of ANN. Since,it can be a linear or a non linear function, it allows to solve even non-linearproblems, which extend the scope of applications of these models. Some ofthe the most popular activation functions are:

• Heaviside step function

f(x) =

0 if x < 0

1 if x ≥ 0(3.1)

• Sigmoid functionf(x) =

11 + exp(−x) (3.2)

• Hyperbolic tangent function

f(x) =1− exp(−2x)1 + exp(−2x)

(3.3)

• Rectified linear unit (ReLU)

f(x) = max(0,x) (3.4)

21

1

x1

x2

...

xn

∑f y

bw1w2

wn

...Summingjunction

Activationfunction

Figure 3.1: Block diagram of an artificial neuron.

When the activation function is a Heaviside step function, the artificial neu-ron is called perceptron [4].

Given the above description of the different components of an artificial neu-ron, the mathematical relationship between the input vector x, the weightvector w, the bias b, the activation function f and the output variable y canbe written as:

y = f (〈x, w〉+ b)

= f

n∑i=1

xiwi + b

(3.5)

3.1.2 Layers of Neurons

A single artificial neuron is capable of solving simple problems such as thelinear two-class classification problem. However, for more difficult tasks,multiple neurons are required. In the context of ANN, a layer of neuronsis defined as a set of neurons working in parallel [27]. We distinguish threetypes of layers:

• The input layer: it is where the input vectors are fed to the neuralnetwork. Thus, the number of neurons or nodes in this first layerare defined by the dimension of the input vector. These neurons arepassive. Their only function is to relay the input variables to theneurons of the next layer.

• The hidden layer: It is where the input variables are processed. Theneurons of this layer are active and each one has its own bias.

22

I1

I2

I3

...

In

h1

h2

h3

...

hm

O1

O2

O3

...

op

Input layer Hidden layer Output layer

w11,1 w2

1,1

w1m,n w2

p,m

b11

b1m

b21

b2p

x1

x2

x3

xn

y1

y2

y3

yp

Figure 3.2: Feedfroward neural network architecture.

• The output layer: The neurons in this layer are also active and theirnumber depend on the number of outputs specified in the problem.

In the case where the neurons of each layer are connected to all the neuronsof the subsequent layer, the ANN is said to be fully connected. Moreover,when the information is moving in one direction, the network is called afeedforward ANN.

For a feedforward ANN with a single hidden layer, as shown in figure 3.2,and considering an input vector x ∈ Rn , the output vector y is given inequation 3.6:

y = f (2)(

W(2)f (1)(W(1)x + b(1)

)+ b(2)

)(3.6)

f (1) and f (2) are the activation functions of the hidden and output layers,respectively, and they can be different from each other.

W(1), b(1) and W(2), b(2) are the weight matrices and bias vectors of thehidden and output layers, respectively, and they are given by :

23

W(i) =

wi1,1 wi1,2 wi1,3 . . . wi1,n

wi2,1 wi2,2 wi2,3 . . . wi2,n...

...... . . . ...

wik,1 wik,2 wik,3 . . . wik,n

; b(i) =

bi1

bi2...bik

(3.7)

where k is the number of neurons in the layer. The coefficient of the weightmatrix and the bias vector are defined as shown in the figure 3.2. For in-stance, w1

2,1 corresponds to the weight of the synapse linking the first inputneuron to the second hidden neuron and b1

1 is the bias of the first hiddenneuron.

It exists different types of feedforward neural networks such as the RBFor multilayer perceptron (MLP) networks. Considered as one of the mostinfluential ANN models, MLP networks were initially defined as layers ofperceptrons using the Heaviside step function. However, this definition wasextended to cover many other models with no regards to the neuron’s ac-tivation function. The most common activation functions for the MLP arethe hyperbolic tangent and the logistic-sigmoid functions [27, 43, 73].

Feedforward neural networks with one or multiple hidden layers are veryuseful in time series forecasting. This is mainly due to their capacity toarbitrarily map the input to the output variables [73].

3.1.3 Recurrent Neural Networks

The feedforward neural networks are quite efficient in many real life appli-cations. However, they have a major drawback, which consist on their in-ability to keep track of the treated information. Indeed, each neuron treatsthe new coming information without any consideration for the past ones.This reveals to be problematic when analysing data sets that have certaindependencies between their current and past values. This issue is addressedby a different type of ANN called recurrent neural networks (RNN). A sim-ple representation of these networks is given in figure 3.3.

Using an additional loop in their architecture, RNN are able to store thetreated information of an input vector at time t and use them again to anal-yse the input vectors of the next time steps. Thus, the hidden neurons ofRNN are often assimilated to memory units [28].

For a better understanding of the working mechanism of RNN, we unfoldthe diagram given in figure 3.3 over the data set. Let us consider a set of

24

I OHx y

Figure 3.3: Recurrent neural network architecture.

input and target vectors T = (xt, tt), t = 1, 2, · · · ,N and yt the outputof the neural network corresponding to xt, the figure 3.4 shows that RNNcan be seen as a set of identical feedforward neural networks connected toeach other in a way that the output of the hidden layer in one network istransmitted to the hidden layer of the subsequent network. In other words,the state of a hidden layer does not only depend on the input vector at thisstep but also on the state of the previous hidden layer. Consequently, theoutput at each step are updated with the treated information of the previoussteps and for t ∈ 2, · · · ,N, we can write [13, 22]:

ht = f (h)

(Wxt +Uht−1 + b(h)

)yt = f (o)

(Vht + b(o)

) (3.8)

where:

• W is the weight matrix connecting the input layer to the hidden layer.

• U is the weight matrix connecting the hidden layers.

• V is the weight matrices connecting the hidden layer to the outputlayer.

• b(h) and b(o) are the bias vectors of the hidden and output layers,receptively.

• ht is the hidden layer vector.

• f (h) and f (o) are the activation functions of the hidden and outputlayers, receptively.

25

I1 I2 I3 IN

O1 O2 O3 ON

H1 H2 H3 HN

h1 h2 h3 hN−1

· · ·

· · ·

· · ·

x1 x2 x3 xN

y1 y2 y3 yN

Figure 3.4: Unfolded recurrent neural network.

The input vector x1 is used to initialize the RNN.

We have now covered the basic architectures of ANN and for the next section,we will focus on their learning algorithms.

3.2 Learning Algorithms

The learning algorithm of ANN is a set of rules and equations defining theway to update the system parameters, which are the weights and biases, inorder to optimize the performance of the network. ANN learning can beviewed as an optimization problem where the selected model is the one thatminimize a chosen loss function.

The learning process usually starts by initializing the network’s parametersto a set of random variables and then updating them at each iteration withregards to the error between the targets ti and the outputs of the network yi.

Different optimization algorithms have been proposed to train ANN such asthe gradient descent, Newton’s method or the Levenberg-Marquardt algo-rithm. For this thesis, we will focus on the Gradient descent method andWe will restrict our study to the case of supervised learning.

26

3.2.1 Online Learning and Batch Learning

The learning process of ANN can be performed following two different modesviz. on-line learning and batch learning.

To understand the difference between both modes, we need first to introducethree key terms:

• Epoch: It corresponds to exposing the ANN model to all the trainingset.

• Batches: As the training set may be too large to feed to the learningmachine at once, it is more suitable to divide an epoch into smallersets called batches.

• Iterations: It correspond to the number of batches in one epoch.

In the on-line learning the ANN parameters are adjusted with every newtraining input-target vectors introduced. Thus, the number of iterationscorrespond to the size of the training sample. However, for the batch learn-ing, the weights and biases are adjusted on a batch by batch basis.

Both methods have their advantages and drawbacks. While the online learn-ing is simple to implement and provides good solutions for a wide range ofproblems, the batch learning gives more accurate results but it also requiresan important storage capacity [28].

Independently from the mode chosen to fit the training set, it is often nec-essary to use several epochs to obtain good results form the ANN model.

3.2.2 Widrow-Hoff Learning Rule

The Widrow-Hoff Learning Rule, also called delta rule (DR), is an iterativelearning method for a single layer neural network. It is based on the gradi-ent descent optimization algorithm to adjust the weights and biases of thenetwork.

Although the delta rule is defined for any activation function, it is oftenpresented in a simplified manner for a neuron with a linear activation func-tion. To update the network parameters after each iteration k in the on-linemode, the DR is given by [27]:

W(k+ 1) = W(k) + β(ti − yi)⊗ xi

b(k+ 1) = b(k) + β(ti − yi)(3.9)

where β is the learning rate.

27

3.2.3 Backpropagation Algorithm

The backpropagation (BP) algorithm is a generalization of the DR to a mul-tilayer feedforward neural network. Thus, it is also based on the gradientdescent method. The mathematical description of this algorithm is givenbelow in accordance to [27].

Let us consider a feedforward neural network with M layers and a pair ofinput-target vectors (xi, ti). The equation connecting the outputs of thedifferent layers can be written as :

a(m+1) = f (m+1)(W(m+1)a(m) + b(m+1)

)for m = 0, 1, · · · ,M − 1

(3.10)a(m) is the output vector of layer m. a(0) and a(M) correspond to the inputvector xi and output vector yi, respectively.

We also define the input of the ith neuron in layer m as follows:

nmi =∑j∈D

wmi,jam−1j + bmi (3.11)

where, D is the set of neurons in layer (m− 1) and am−1j is the output of

the jth neuron.

The equation 3.10 translates the forward propagation of the input vector xito obtain the output vector yi. The latter is then compared to the targetvector ti to estimate the error using a loss function L.

The main idea of the BP algorithm is to adjust iteratively the networkparameters in each layer by propagating backwards, from the output to theinput layer, the gradient of L with regards to the weight and biases in eachlayer. Using the gradient decent method, the adjustment of the weights andbiases in a layer m can be expressed as follows:

wmi,j(k+ 1) = wmi,j(k)− β ∂L∂wm

i,j

bmi (k+ 1) = bmi (k)− β ∂L∂bm

i

(3.12)

where k represents the number of iteration.

Using the chain rule and equation 3.11,the gradient of the L with regardsto the weights and biases can be written as:

28

∂L∂wm

i,j= smi am−1,j

∂L∂bm

i,j= smi

smi = ∂L∂nm

i

(3.13)

smi is called sensitivity and represents the influence of the ith neuron in thelayer m on the output error [27].

Giving the equations 3.12 and 3.13, the backpropagation algorithm is thengiven by the following system:

W(m)(k+ 1) = Wm(k)− βs(m) ⊗ a(m−1)

b(m)(k+ 1) = b(m)(k)− βs(m)

for m =M , · · · , 2, 1

(3.14)

where, s(m) is the vector representing the sensitivities in the layer m.

3.2.3.1 Backpropagation Through Time

The backpropagation through time (BPTT) algorithm is the application ofthe BP algorithm to recurrent neural networks. Indeed, in section 3.1.3,we saw that RNN can be defined as a set of feedforward neural networkshaving the same weight matrices and bias vectors. Thus, the BP algorithmcan be used for training RNN. However, due to the particularity of theirhidden layers, the gradient of the loss function with regards to the networkparameters for the recurrent networks is different from the one defined inequation 3.13.

By unfolding RNN over time, it is clear that we actually have a loss at eachtime step t, denoted by Lt. Since the output of the hidden layer ht is fedas input in the next hidden layer, the loss Lt is also transmitted to the nexttime step. The total loss of the RNN is then defined as the sum of all thelosses across time.

To better understand this dependency, we consider the RNN shown in figure3.4. We propose to calculate the gradient of the loss function with regardsto the weight matrix connecting the input layer to the hidden layer W att = 3. Using the chain rule, we obtain:

∂L3∂W =

∂L3∂y3

∂y3∂h3

∂h3∂W (3.15)

29

However, from equation 3.8, it is clear that h3 is function of h2 which is alsofunction of h1 . Thus, we can write:

∂L3∂W =

3∑t=1

∂L3∂y3

∂y3∂h3

∂h3∂ht

∂ht∂W (3.16)

The equation 3.16 translates the contribution of the coefficients of W to theerror at the time step t = 3.

As for the BP algorithm, to adjust the RNN parameters, the BPTT algo-rithm propagate backwards the gradient of the Loss function across all thetime steps.

3.3 Discussion

Unlike SVM models, ANN are hard to train. Indeed, the learning process ofneural networks need to be performed several times before obtaining accept-able results. This can be explained by the fact that the learning algorithmsfor these models has a tendency to converge to local minimums of the lossfunction. Thus, by repeating the learning process several times, the globalminimum can be approached.

The design of the neural networks plays a major role in improving the per-formances of the machine. The number of hidden layers and neurons mustbe chosen in accordance with the complexity of the task to ensure that allthe information from the data set is captured.

For our problem, we chose to work with RNN. This choice is motivated bythe fact that electric load time series usually display an important depen-dency between its observations and a certain seasonal behaviour thus, RNNseems a more appropriate choice for load forecasting.

30

Chapter 4

Methodology

“Methodology is intuitionreconstructed in tranquillity.”

Paul Lazarsfeld

This chapter describes the approach used in this thesis for STLF in thegovernorate of Bizerte. The main task at this level is to build a model thatcan predict accurately the electric load at an hour H using electric load dataas well as other exogenous factors from the previous hour (H−1). The studyapproach can be summarized in five steps as shown in the diagram 4.1 below:

Data pre-processing

Modelselection

Training& testing

Fore-casting

Resultsanalysis

Figure 4.1: Adopted methodology for STLF.

A detailed description of each of these steps is given in the next sections butwe first start by presenting the data sets used in this study.

4.1 Data

4.1.1 Electric Load Time Series

The time series of the electric load were collected from the national elec-tric utility ”Societe Tunisienne de l’electricite et du gaz (STEG)” during amission to Bizerte from January, 8th to January, 22nd 2018. To better un-derstand the nature of the collected data, it is first necessary to explain how

31

the governorate is supplied with electricity. Bizerte is served by the nationalelectric grid via high voltage transmission lines. These lines are connectedto five HV/MV substations which feed in the distribution network.

The electric utility provided us with the electric load data for each substationfor the period from 01-01-2013 to 31-12-2017. In total, 60 Excel files werecollected containing monthly data with a 15 minutes resolution. For eachfile, the electric loads were summed synchronously over the five substations.Then, all the data sets were regrouped in one file to create one continuoustime series.

4.1.2 Exogenous Factors

The electric load is often influenced by several external factors such asweather conditions and calendar events. As mentioned in section 1.2.1, itis important to include some of these factors in the STLF process in orderto obtain more accurate results. For our application, both the temperatureand the calender data were included in the forecasting process.

4.1.2.1 Temperature

The hourly values of the temperature for the region of Bizerte were collectedfrom the national institute of meteorology of Tunisia. The time series of thetemperature covers the same period as for the electric load.

4.1.2.2 Calendar Data

The calender data set was created on Excel according to STEG’s classifica-tion of the calendar year. It includes the following items:

• Day (D);

• Month (M);

• Year (Y );

• Hour (H);

• Day of week (DOW): Monday, Tuesday, · · · , Sunday;

• Type of day (TOD): Monday, working day (from Tuesday to Friday),weekend and holidays;

The different items of the calender data set were encoded in natural numbersas the SVR and ANN take only numerical data as inputs.

32

4.2 Data Exploration with Descriptive Statistics

Descriptive statistics is one of the most convenient techniques to summarizequantitative data in few key measures. In opposition to inferential statisticswhere the aim is to extrapolate the sample results to a population, descrip-tive statistics simply describe the existing data. It gives a great insight intothe central tendency, the variability and the shape of data distribution. Tocover these aspects, we usually look at the following statistics:

• The mean value: It is a measure of the central tendency and can bedefined as the arithmetic average value of a set of observations. For afinite dataset T = Xt, t = 1, ...,n, the mean value µ is given by:

µ =1n

n∑t=1

Xt (4.1)

• The median or the second quartile X: It is defined as the value sepa-rating the upper part of an ordered data set from the lower part.

• The first quartile Q1 and the third quartile Q3: They are definedrespectively as the median of the lower part and the upper part of anorder dataset, respectively. The distance between the Q1 and Q3 iscalled the interquartile Rang (IQR).

• The standard deviation: The values taken by the data set’ observationsusually deviate from the mean value. To estimate this deviation, werefer to the measure of the variance. For a discrete time series of nobservations, the variance σ2 can be written as:

σ2 =1n

n∑t=1

(Xt − µ)2 (4.2)

However, since the variance is expressed in squared units, it is morepractical to have a measure of the deviation in the same units as theobservations. The standard deviation is then defined as the squaredroot of the variance and simply noted σ.

4.3 Data Preprocessing

Time series generated by real-life processes often contain inconsistencieswithin their observations. The quality and the format of the input data canstrongly affect the performances of ML models [42], thus, the importance ofpre-processing the data.

33

Data pre-processing is a set of techniques that are applied to the raw dataset in order to ensure its reliability and to prepare it for further processing.It includes:

• Data cleansing: removing outliers and filling missing values;

• Data integration: data from different sources can be combined to cre-ate a coherent representation of them;

• Data transformation: where data can be scaled, standardized, aggre-gated or generalized;

• Data reduction: reducing the volume of the stored data by transform-ing it into a simplified form.

In this section, we present the techniques used to pre-process the time seriesof the load.

4.3.1 Data Aggregation

From the description of the data sets in section 4.1, it is clear that the timeseries of the electric load and the exogenous factors does not have the sametemporal resolution. Therefore, it was necessary to decrease the resolutionof the load time series from 15 minutes to one hour. This was achievedby considering only hourly peaks of the load. The resultant time series ispresented in figure 4.2.

Figure 4.2: Time series of the electric load in Bizerte in hourly resolution.

34

4.3.2 Data Cleansing

4.3.2.1 Outlier Detection

The definition of an outlier depends on the nature of the process that isbeing monitored. However, we can adopt the definition given by Hawkinswhich states that “an outlier is an observation that deviates so much fromother observations as to arouse suspicion that it was generated by a differentmechanism” [47].

The detection of outliers is an important step in time series analysis. Itshould be performed at the early stages of this analysis due to the influencethese observations may have on the forecasting process.

Outlier detection methods can be divided into two main categories namely,global and local methods. While the former compares each observation withthe entire dataset, the latter only consider a limited number of neighbouringpoints.

In general, removing outliers from the dataset is a critical procedure. Indeed,outlier detection techniques might provide biased results by mislabelling cer-tain points as outliers. Thus, the labelled observations need to be examinedcarefully, case by case if needed, before proceeding to their removal. Forthat, a good knowledge of the characteristics of the process that generatedthe time series or dataset is often necessary to analyse the results of theapplied outlier detection techniques.

For our project, after a first examination of the plot of the electric load, itseemed sensible to proceed to the detection of outliers for each year sep-arately and to use successively global and local methods. This choice isjustified by the fact that the electric load in Bizerte is subject to importantchanges over long period of times (uptrend) and certain values of the lastyear (2017) might be considered as outliers in comparison to other valuesfrom previous years. Furthermore, the plot of the electric load in figure 4.2illustrates the presence of important variations of the load for short periodof time (only few hours), thus the need to investigate these values with localmethods.

In this study, two methods were selected to detect outliers viz. the IQRmethod (global) and the local outlier factor (LOF) method (local).

35

X

Q3

Q1

IQR

Lower inner fence

Upper inner fence

Lower outer fence

Upper outer fence

Figure 4.3: Box and whisker plot

4.3.2.1.1 The IQR Method

This method is based on the box and whisker plot introduced by John W.Tukey in 1969. Basically, this plot uses quartiles to represent the shapeof the data set. It permits to visualize the skewness, the central tendency(median), the spread (first and third quartiles) and puts in evidence thepresence of outliers if they exist. The IQR method relies on two type offences for the detection of outliers:

• Inner fences: defined at a distance of 1.5 IQR below the first quartile(lower inner fence) and above the third quartile (Upper inner fence).

• Outer fences: defined at a distance of 3 IQR below the first quartile(lower outer fence) and above the third quartile (Upper outer fence).

Considering these fences, outliers are then divided into two categories:

• Observations that lie between the inner and outer fences are possibleoutliers and they are called mild outliers (red dots in figure 4.3).

• Observations that lie beyond the outer fences have a high probabilityto be outliers. Theses points are called extreme outliers (black dots infigure 4.3).

For the implementation of the IQR method, we used the pandas library [50]in the programming language python to extract the first and third quartiles.Then an algorithm was designed to identify and remove only the extremeoutliers as defined above.

36

4.3.2.1.2 Local Outlier Factor

The local outlier factor (LOF) algorithm is a density-based method that wasdeveloped by Breunig et al. in 2000 to detect local outliers in data sets. Thismethod is based on the concept of local reachability density discussed below.

Considering an observation X in the dataset T , the k distance of X notedδk(X) is defined as the distance between this observation and its k-th nearestneighbour. The set of k-nearest neighbours Nk(X) is then defined as:

Nk(X) = X ′ ∈ T | d(X,X ′) ≤ δk(X) (4.3)

where d(X,X ′) is the distance between X and X ′.

In order to reduce the statistical fluctuations of d(X,X ′), Breunig et al.introduced the concept of reachability distance given by [9]:

ηk(X′,X) = maxδk(X), d(X,X ′) (4.4)

The local reachability density Θk of X is then defined as the inverse of theaverage reachability distance between X and its k-nearest neighbours:

Θk(X) = 1/

∑

X′∈Nk(X)ηk(X

′,X)

card(Nk(X))

(4.5)

The LOF method consists on comparing the local reachability density of anobservation to the local reachability densities of its k-nearest neighbours.For each observation X, a coefficient called local outlier factor and notedΩk(X) is computed. This factor corresponds to the degree of X to be anoutlier [9]:

Ωk(X) =

∑X′∈Nk(X)

Θk(X′)

× ∑X′∈Nk(X)

ηk(X′,X)

(4.6)

Data points with significantly high local outlier factor are considered asoutliers [45].For our load time series, the LOF method was implemented using the Scikit-learn package in python [59] and considering the Euclidean distance definedas:

d(x, y) =

n∑i=1|xi − yi|2

12

, for x, y ∈ Rn (4.7)

After several simulations and examination of the results, k was set to 24.

37

4.3.2.2 Missing Data Imputation

Despite the technological advancements in the control and monitoring ofelectric networks, utilities are still facing several challenges in metering theload. It is frequent that the data sets of the load present certain gaps intheir observations ranging from few minutes to few hours. These gaps areoften attributed to failures in the metering and transmission equipments.The data sets retrieved from STEG did not make an exception to this trend.

Furthermore, removing outliers increases the number of missing values.Since our aim is to forecast new values of the load based on the previ-ous observation, the missing values need to be filled in.

Different techniques were proposed for missing data imputation such as k-nearest neighbours (KNN) method, clustering methods [26] or simply usinginterpolation [41]. In our case, we used a first order spline interpolationmethod to fill in the gaps in the load time series.

4.3.3 Data Normalization

A common practice in ML modelling is to normalize the data [10]. Thistransformation can improve significantly The efficiency of ML models andprevent numerical difficulties during the calculation [35]. Normalizationconsists on scaling the data set’s observations Xt to fit in the range of [0,1].The normalized observation X ′t is calculated using the following equation:

X ′t =Xt −min

max−min (4.8)

where max and min represent the maximum and minimum values of thedata set, respectively.

4.4 Model Selection

The accuracy of the forecast depends essentially on the adequate selection ofmodel parameters. Different techniques and methodology were proposed forthe design and implementation of SVM and ANN. As the selection processcan be time- demanding and is often subject to many trials and errors, oneneed to rely on similar research projects and publications at least to definethe search boundaries.

4.4.1 ν-Support Vector Regression

The selection and implementation of the ν-SVR model were conducted inpython using the library LIBSVM [14]. The choice of this library was driven

38

by its good performance in modelling and forecasting the electric load. In-deed, LIBSVM was the winner of the EUNITE world wide network compe-tition on electricity load prediction in 2001.

The first step in the selection of any SVM model is to choose the appropri-ate kernel function. In LIBSVM’s tutorial, one can find a set of empiricalrules to do so. For instance, when the number of features is too small incomparison to the cardinality of the training set, non-linear kernels are usedto map the observations to higher dimensional spaces and the RBF kernel isoften a good choice for such problems [35]. In our case, the training datasetcomprises thousands of instances and we have only eight features (previousload and exogenous factors), thus we chose the RBF kernel for our model.

The next step consists on defining the appropriate kernel and model param-eters. This process is called hyperparameter optimization. The commonpractice at this level is to adopt a grid search approach which consists onperforming an exhaustive search through a subset of the parameter space.The choice of the optimal parameters is guided by a well defined metricwhich evaluate the model performance for each combination of the consid-ered hyperparameters.

Since we are using the ν-SVR model with the RBF kernel, the parametersto be tuned are the cost of errors C,the fraction of errors ν and the RBFparameter γ [14]. Finding the best combination of these parameters canbe time-consuming especially when the training set is too big. In [16], it isstipulated that the search process can be improved by first testing randomvalues of these parameters in order to reduce the search space. After testingdifferent values of C, ν, and γ, we found it sufficient to define the searchspace for our problem as indicated in table 4.1.

Table 4.1: Grid search intervals for ν-SVR model

Hyperparameter Interval

γ ]0,5]ν ]0,1]C ]0,10]

For each combination of the aforementioned parameters, the performance ofthe model were evaluated through a training and testing procedure and thethe combination that gave the lowest error between the output of the modeland the actual values was retained for forecasting. The evaluation processwill be discussed more in detail in section 4.5.

39

4.4.2 Recurrent Neural Networks

The design of an artificial neural network consists mainly on defining thefollowing parameters:

• The number of neurons in each layer;

• The number of hidden layers;

• The activation function for the hidden and output layers.

Defining the number of neurons in the input and output layers is relativelysimple, since it is defined by the dimension of the input and desired outputvectors, respectively.

For the number of hidden layers, as mentioned in [29] it is actually rare thatwe need to use more than one layer since it is usually “enough to approxi-mate any function that contains a continuous mapping from one finite spaceto another”. Thus in our design, we considered only one hidden layer.

Concerning the hidden neurons, given their influence on the stability ofANN models, many research projects were conducted in order to provide amethodology to identify their number [63]. In our design, we used the ruleof geometric pyramid proposed by Masters. It stipulates that the numberof hidden neurons in a three layer neural network can be obtained using thefollowing equation [49]:

Nhid =√Nin ·Nout (4.9)

WhereNin,Nhid andNout represent the number of neurons in the input,hiddenand output layer, respectively.

Concerning the choice of the activation function for the hidden and outputneurons, several trials were made using different function and it was foundthat the sigmoid function (equation 3.2) are the most suitable choice for ourapplication.

The design and the implementation of the chosen RNN model was performedusing the pybrain library in python [61].

4.5 Training and Testing

In order to validate an ML model, we need to be able to evaluate its perfor-mance on unseen data. For large data sets, the ideal strategy to do that isto divide the data set into two folds called training and test sets [11]. While

40

the former is used to fit the model, the latter serves to test its capacity togeneralize to unseen observations.

Once the model has learnt from the training set the appropriate functionto approximate the data, it is used to predict new values using inputs fromthe test set. The predicted values are then compared to the expected valuesusing a well defined performance metric. Based on the metric’s results, thetraining process is repeated in order to improve the model accuracy until itconverges to a minimum error.

There is no rule on how to choose the proportion that need to be consideredfor training or testing [11]. However, one should keep in mind that MLmodels perform better when they are exposed to more training instances.Thus, we divided our data sets in a way that the models are trained on thefirst four years, ie. from 2013 to 2016, and tested on the values of the lastyear 2017 (80% of data for training and 20% for testing).

4.6 Forecasting

After tuning the model’s parameters and validating its training, the nextstep is to forecast out-of-sample values. This procedure is considered moretrustworthy to check the performance of the model since it permits to as-sess the impact of uncertainties propagation [24]. Indeed, during the testingphase, the inputs are always selected from the original data set, hence eachinstance has its own error independently from the previous time steps. How-ever, in practice, the forecast is performed in a way that each new value isused to predict the next ones, thus the error in one time step will affect theresults of the subsequent ones. To generate new forecast values, we adoptedthe method presented in figure 4.4.

41

Start

Initialization:

• Input Load

& exogenous

factors at

H = −1;

• set counter i =

0,

• set

max-iterations

Forecast: Estimate load for next hour (H + 1)

Append forecast list

Incrementation:

i <- i + 1

i ≤ max-iterations

Scale back

forecast list

Concatenate the new

value and exogenous

factors with the

same time index

End

NO

Yes

Figure 4.4: Flowchart of the out-of-sample forecasting algorithm.

42

4.7 Performance Measures

It is of crucial importance to monitor the performances of the forecastingmodel at the different stages of the selection and design. Different metricsare proposed in the literature to estimate the forecast accuracy and comparedifferent model settings. However, the choice of the metrics is sometimeslimited by the specification of the used package or library. For instanceLIBSVM uses the coefficient of determination R2 to assess the accuracyof the model. This coefficient ranges from 0 to 1 and higher values of R2

usually indicates better accuracy of the model. It is defined as:

R2 = 1−

n∑i=1

(ti − yi)2

n∑i=1

(ti − t)2(4.10)

where ti is the target or expected value, yi is the predicted value and t isthe mean value of the target set.

On the other hand, the BPTT algorithm in pybrain uses the mean squarederror (MSE) to adjust the weights and biases of RNN. The MSE is given by:

MSE =1n

n∑i=1

(ti − yi)2 (4.11)

As it is recommended to use different metrics to check the out-of-sampleforecast accuracy and after reviewing various publications on similar tasks,we chose to include the three following metrics:

• The mean absolute percentage error (MAPE): This metric is scale-independent which makes it a popular choice in ML problems [15, 34,72]. It is defined as follows:

MAPE =

1n

n∑i=1

|ti − yi||ti|

× 100 (4.12)

• The mean absolute error (MAE): It measures the average magnitudeof the error with no consideration to its direction. It has an advantageover the MSE by being in the same unit as the data. The equation ofthis metric is given below:

MAE =1n

n∑i=1|ti − yi| (4.13)

• The root mean square error (RMSE): It is defined as the standarddeviation of the errors between the target and predicted values. The

43

RMSE is simply the square root of the MSE thus it is calculated usingthe following equation:

RMSE =

√√√√ 1n

n∑i=1

(ti − yi)2 (4.14)

44

Chapter 5

Performance Analysis

‘The probable is what usuallyhappens.’

Aristotle

In this chapter, we present the results of the implementation of the pre-viously described methodology. After preprocessing the data and trainingν-SvR and RNN models, we performed a load forecasting for one week. Theresults of this forecast are presented and discussed in section 5.4.

5.1 Data Preprocessing Results

5.1.1 Descriptive Statistics

Using the panda library in python, we were able to extract the descriptivestatistics for the time series of the electrical load. These statistics, whichare presented in table 5.1, allowed us to have a great insight into the loaddata set. For instance, it was noted that about 1% of the observations aremissing and need to be filled in. Additionally, from the values of Q1 andQ3, we can see that 50% of the electric demand range between 80 MW and110 MW. This information is important for electric utilities to estimate thebase load that need to be covered at all time and to schedule the productionunits accordingly.

On the other hand, by running this type of analysis for different subsets ofthe data, we determined the values of the first and third quartiles for eachyear which permit us to calculate the yearly IQRs. These values will be usedlater to identify outliers.

45

Table 5.1: Descriptive statistics for the time series of the electrical load.

Statistics Result

count 43 335µ 95.8σ2 19.5minimum 44.9Q1 79.8X 97.8Q3 109.8maximum 168.5

5.1.2 Outliers

As mentioned in section 4.3.2.1, two methods were applied to detect outliers,namely, the IQR and LOF methods. The results of the implementation ofboth methods are given in table 5.2. In total, 96 data points were labelledas outliers which amounts to only 0.2% of the observations in the load dataset.

Table 5.2: Number of outliers per year according to the IQR and LOFmethods.

Year IQR method LOF method Total

2013 12 9 212014 26 14 402015 0 10 102016 10 0 102017 0 15 15

Total 48 48 96

As shown in figure 5.1, these outliers do not conform to the expected be-haviour of the electric load. For instance, the drop registered in the load in2014 denotes the occurrence of a failure in the power system that occurredin August 31st of the same year and which caused a major power outage.Although, this kind of event can affect extremely the stability of the electricgrid, it remains an isolated case in the context of Tunisia. Thus, the impor-

46

Figure 5.1: Detected outliers.

tance to remove such data points.

A special attention must be paid while removing outliers to avoid changingthe data set distribution. In our case, the removal of the detected outliersdid not affect the central tendency or the variability of the load distribu-tion. Indeed, the mean value, the median as well as the standard deviationremained practically unchanged.

5.2 Load Characteristics

After the imputation of the removed outliers as well as the originally missingdata points, we obtain finally a cleaned time series of the electric load.Since we are interested in predicting the load in the short term, it is thedaily profile of the load that interests us the most. For this analysis, weconsidered the subset of data covering the year 2017. From figure 5.2, it isclear that the daily load profile in Bizerte exhibits a seasonal behaviour. Infact, during summer (June, July and August), the electric demand is moreimportant than for the rest of the year. Furthermore, the duration of peakdemand is extended during this season in comparison to the other seasons.This behaviour is mainly influenced by weather conditions especially thetemperature. Hence, the importance to introduce the temperature in theforecasting process. Furthermore, by including calendar data, our aim is toprovide the forecasting model with enough information to detect the seasonalbehaviour of the load.

47

Figure 5.2: Daily profile of the load in Bizerte for 2017

5.3 Selected Models

5.3.1 Selected ν-Support Vector Regression Model

In the search for the most accurate ν-SVR model for forecasting the loadin Bizerte, a grid search algorithm was developed in python to test differentcombinations of ν, γ and C within the search intervals given in table 4.1. Foreach combination, a score is calculated for the corresponding model. Thisscore represents the coefficient of determination R2 indicating the accuracyof the model to predict the electric load in the test dataset. A contour plotof model accuracy is obtained for each value of ν showing the region wherethe model performs the best. An example of these plots is given in figure5.3 which illustrates the minimum accuracy for models with ν = 0.8.

Figure 5.3: Contour plot of the coefficient of determination (R2) for ν = 0.8.

48

Another important factor to take into consideration for the selection of themodel parameters is the computation time. In our experiment, some com-plex models performed slightly better than other simpler models. However,it was noted that the running time increases considerably with model com-plexity.

After many tests, the best trade off between the model accuracy and therequired computation time was found for the following values:

Table 5.3: Best configuration of the SVR model.

Hyperparameter Value

γ 5ν 0.8C 7.743

The ν-SVR model having the above configuration required about 90 minutesof computation to converge and presented an accuracy on the test data setof R2 = 0.96.

5.3.2 Selected Recurrent Neural Networks

Concerning the design of the RNN model, given that the input vector con-tains 8 features, which are the current values of Load and exogenous factors,and that our target is to forecast only the next value of the load, the num-ber of input and output neurons was fixed to 8 and 1, respectively. Usingequation 4.9, the number of hidden neurons was set to 3. The topology ofthe designed RNN is illustrated in figure 5.4.

To train the model, we used an online BPTT algorithm with a learning rateof β = 0.01. Initially, the number of epochs was set to 50, however thetraining was stopped at 20 epochs with no improvement in error as shownin figure 5.5. The trained RNN model presented a final MSE on the scaledtest dataset equal to 2× 10−3.

49

I1

I2

I3

I4

I5

I6

I7

I8

h1

h2

h3

O1

L(t)

T (t)

D(t)

M(t)

Y (t)

H(t)

DOW (t)

TOD(t)

L(t+ 1)

Figure 5.4: Architecture of the selected recurrent neural network

Figure 5.5: RNN training and test results

50

5.4 Forecasting Results

After finishing their learning process, the ν-SVR and RNN models are nowready for STLF. In order to check the stability of these models, we propose totest their out-of sample forecast accuracy by predicting the hourly load peaksfor one week according to the diagram given in figure 4.4 (max-iterations =168 observations).

5.4.1 Comparison of ν-SVR and RNN Forecasts

The load curves predicted by the selected ν-SVR and RNN are illustratedin figure 5.6. It shows that both models gave satisfactory results for thefirst 48 hours. However, unlike the ν-SVR model, the accuracy of the RNNpredictions is degraded as we move forward in time. This illustrates thatour RNN is more sensitive to the propagation of uncertainty in comparisonto the ν-SVR.

Figure 5.6: One week-ahead load forecasting using ν-SVR and RNN.

Another important ascertainment can be drawn from the predicted loadcurves, which consists on the capacity of both models to predict accuratelythe hour of the daily load peak. This aspect is highly important for electricutilities for its influence on the electricity prices and DSM activities. Indeed,having a good knowledge about the period of peak load allows utilities to

51

schedule their cheapest means of production to cover it, thus reducing thecost of electricity production. In addition, decisions regarding peak shavingand storage units discharge schedule are based on the forecast of the dailypeak period.

As mentioned in section 4.7, three performance metrics were used to bench-mark the accuracy of model predictions, namely, RMSE, MAE and MAPE.The results of these metrics are given in table 5.4 for one day-ahead and oneweek-ahead load forecast.

Table 5.4: Forecast accuracy of ν-SVR and RNN models.

MetricsDay-ahead forecast Week-ahead forecastν-SVR RNN ν-SVR RNN

RMSE (MW) 3.9 7.2 5.3 9.4MAE (MW) 2.9 5.9 3.6 7.6MAPE (%) 3.0 6.2 3.5 8.2

The results shown above confirmed our initial ascertainments. They demon-strate the superiority of the ν-SVR model over RNN. In fact, the ν-SVRmodel achieved a MAPE of 3% and 3.5% for a day ahead and a week aheadload forecast, respectively, in comparison to a MAPE of 6.2% and 8.2%achieved by RNN for the same forecast horizons. The results of the RMSEand MAE also testify that the ν-SVR outperforms RNN for the consideredforecast horizons. It is also important to note that the average absoluteerror of the ν-SVR predictions does not exceed 4 MW for the week-aheadload forecasting. Covering this deviation can be easily handled by STEGsince most of its power plants are gas turbines which are flexible generationunits.

5.4.2 Combined Model Forecasts

As discussed in section 1.2.2.4, by combining the results of different models,one might expect a better forecast accuracy. In this thesis and after a closeanalysis of the predicted load curves, we considered a combined model thattakes the arithmetic mean of the ν-SVR and RNN forecasts at each hour.The load curve generated by the combined model is presented in figure 5.7.It is clear from this figure that there is an improvement in the accuracy ofthe predictions for some periods of times; e.g., the load forecasts for the fifthday. Nevertheless, for a better assessment of the combined model perfor-mance and in order to compare its accuracy to the previous models, we turn

52

Figure 5.7: One week ahead load forecast using combined ν-SVR and RNN.

to the performance metrics.

As an overall look, figure 5.8 shows that averaging the forecasts of ν-SVRmodel and RNN improves slightly the day-ahead forecast accuracy. How-ever, for the week-ahead load forecasting the SVR model remains the bestchoice for our application. Indeed, the latter presents a MAPE of 3.5%compared to 4.2% for the combined model.

(a) One day-ahead load forecasting. (b) One week-ahead load forecasting.

Figure 5.8: Forecast accuracy of the SLTF models.

53

5.4.3 Discussion

The superiority of the ν-SVR model over RNN in forecasting the electricalload in Bizerte in the short term can be attributed to two main limitingfactors in the training algorithm of the latter.

The first limitation is called the vanishing gradient problem which was firstdiscussed by hochreiter in [32]. This problem translates the incapacity ofRNN to capture accurately information in a long sequence of input data.Indeed, RNN learns for computing and back-propagating the gradient of theerror with regards to the weights and biases of the network. An increasein the size of the training data set means that the chain rule in equation3.16 will get longer and longer in order to cover all the time steps. Sincethe weights and biases are usually small numbers in the range of [0,1], aswe move forward in the training sequence, the gradient of the error of thefirst time-steps become increasingly smaller which causes the RNN only toremember the errors of the last time steps.

The second limitation consists on the fact that training RNN with the gra-dient descent method does not guarantee to achieve the global minimum ofthe error. In contrast, since SVM are designed to solve convex QPP, theyalways provide the optimal solution.

Concerning the results of the combined model, although better accuracy wasachieved for the day-ahead forecast, it is recommended to consider only theforecasts of ν-SVR model. This recommendation is based on the conclusionsof Borges et al. presented in [7] and where it is stipulated that “an error-respective linear combinator suffices to improve the forecast iff there is not aforecasting algorithm that clearly outperforms the rest”. In our case, the ν-SVR model outperforms RNN for the day-ahead and week-ahead forecasting.

54

Chapter 6

Conclusion and Future Work

“One worthwhile task carried toa successful conclusion is worthhalf-a-hundred half-finishedtasks.”

Malcolm S. Forbes

In order to support Bizerte in its transition towards a sustainable energysystem, this thesis sought to provide a reliable STLF model whose resultscould be utilised for better management of operations on the electrical sys-tem.

To achieve this task, the historical data for the electric load in Bizerte cov-ering the period from 01/01/2013 to 31/12/2017 have been collected. Thesedata were then preprocessed, inter alia, to detect and remove outliers usingIQR and LOF methods. The preprocessed load time series is then combinedwith other data sets comprising the hourly values of the temperature andthe characteristics of the calendar. The resultant data set is used to opti-mize the design of a ν-SVR model and a RNN.

Following a grid search approach, the optimal configuration for the ν−SVRmodel was found for γ, C and ν equal to 5, 7.743 and 0.8, respectively.Concerning the RNN, it comprises three layers with a total of 12 neurons (8input neurons, 3 hidden neurons and 1 output neuron). Both models weretrained on data covering the first four years and tested on the observationsof the last year (2017).

Good results were achieved on the test set. Nevertheless, for a better as-sessment of the accuracy of the models, we tested both models for one day-

55

ahead and one week-ahead out-of sample load forecasting. The results ofthe performance metrics show that the ν-SVR model outperforms the RNNfor both forecast horizons. Indeed, for the day-ahead load forecasting, theν− SVR model achieved a MAPE of 3% in comparison to 6.2% for RNN.Similarly, the ν-SVR model presented a lower MAPE than the RNN for theweek-ahead load forecasting (3.5% and 8.2%, respectively).

The out-of-sample forecast results illustrate the superiority of the SRM prin-ciple applied by SVM over the ERM principle used by ANN in generalizingto unseen data. In fact, the SRM principle tends to minimize an upperbound on the generalization error instead of focusing on the training error.This usually leads to better forecasting results.

The under-performance of the RNN model can be also attributed to thevanishing gradient problem, which prevents it from learning from large datasets. Furthermore, when the optimisation problem is not convex, the BPTTlearning algorithm of the RNN may get trapped into local minima duringgradient descent, whereas SVM are designed to solve a convex QPP, hencethey always converge to the global minimum of the loss function.

An attempt was made to improve the accuracy of the load forecasts byaveraging the predictions of both ν-SVR and RNN models. This resultedin a slight improvement in the accuracy of the day-ahead load forecasting(MAPE = 2.7% for the combined model). However, for the week-aheadload forecasting, the ν-SVR model remains more accurate than the com-bined model, which achieved a MAPE of 4.2% for this forecast horizon.

In conclusion, the ν-SVR model outperforms RNN for the day-ahead andweek-ahead load forecast and although averaging the predictions of bothmodels may improves the overall forecast accuracy for certain periods oftime, we recommend to consider only those generated by the ν-SVR modelfollowing the conclusions of Borges et al. in [7].

Future Work

The design and training of the ν−SVR and RNN models were time-consumingand computationally expensive, which can be explained in part by the highnumber of observations used in our experience. A future study might focusin determining the optimal size of the training and test sets for similar tasks.It is also recommended to test other RNN models for STLF such as the longshort term memory (LSTM) neural networks which were introduced to over-come the vanishing gradient problem. It is also interesting to estimate theexpected energy savings from the implementation of such STLF models.

56

Acronyms

ν-SVR ν- support vector regression

AI artificial intelligenceANN artificial neural networksARIMA autoregressive integrated moving-average

BP backpropagationBPTT backpropagation through time

DOW day of weekDR delta ruleDSM demand side management

EMS energy management systemsERM empirical risk minimizationEUNITE European network on intelligent technologies

for smart adaptive systems

FL fuzzy logic

GA genetic algorithms

IQR interquartile Rang

KBES knowledge based expert systemsKKT Karush-Kuhn-Tucker conditionsKNN k-nearest neighbours

LOF local outlier factorLS-SVR least squares support vector regressionLSTM long short term memoryLTLF long term load forecasting

57

MAE mean absolute errorMAPE mean absolute percentage errorML machine learningMLP multilayer perceptronMSE mean squared errorMTLF medium term load forecasting

QPP quadratic programming problem

RBF radial basis functionReLU rectified linear unitRMSE root mean square errorRNN recurrent neural networks

SARIMA seasonal autoregressive integrated moving-average

SLT statistical learning theorySRM structural risk minimizationSTEG societe Tunisienne de l’electricite et du gazSTLF short term load forecastingSVC support vector classificationSVM support vector machinesSVR support vector regression

TOD type of day

VSTLF very short term load forecasting

58

Nomenclature

Roman Symbols

X median value

C cost of error

D day

d distance

H hour

K kernel function

L loss function

M month

Nhid number of hidden neurons

Nin number of input neurons

Nout number of output neurons

X observation in a data set

Y year

Greek Symbols

α Lagrange multiplier

β learning rate

∆ margin

δk k−distance

ε error threshold for ε- SVR

59

ηk k− reachability distance

γ inverse of the standard deviation of the radial basis function

µ mean value

ν upper bound on the fraction of errors for ν-SVR

Ωk local outlier factor

φ transformation function

σ standard deviation

Θk local reachability density

ξ slack variable

Sets

R real numbers

Nk set of k neighbours

T data set

60

Bibliography

[1] Mohammdreza Afshin, Alireza Sadeghian, and Kaamran Raahemi-far. “On efficient tuning of LS-SVM hyper-parameters in short-termload forecasting: A comparative study”. In: Power Engineering SocietyGeneral Meeting, 2007. IEEE. IEEE. 2007, pp. 1–6.

[2] N Amral, CS Ozveren, and D King. “Short term load forecasting usingmultiple linear regression”. In: Universities Power Engineering Con-ference, 2007. UPEC 2007. 42nd International. IEEE. 2007, pp. 1192–1198.

[3] Saeed M Badran. “Neural network integrated with regression methodsto forecast electrical load”. In: (2012).

[4] Christoph Berger. Perceptrons - the most basic form of a neural net-work. 2016. url: https://appliedgo.net/perceptron/.

[5] Christopher M. Bishop. Pattern recognition and machine learning. In-formation science and statistics. New York: Springer, 2006. isbn: 978-0387-31073-2.

[6] Bizerte Smart City. url: http://www.bizertesmartcity.com.

[7] Cruz E Borges, Yoseba K Penya, and Ivan Fernandez. “Optimal com-bined short-term building load forecasting”. In: Innovative Smart GridTechnologies Asia (ISGT) (2011), pp. 1–7.

[8] GEP Box and GM Jenkins. “Times series Analysis Forecasting andControl. Holden-Day San Francisco”. In: (1970).

[9] Markus M Breunig et al. “LOF: identifying density-based local out-liers”. In: ACM sigmod record. Vol. 29. 2. ACM. 2000, pp. 93–104.

61

https://appliedgo.net/perceptron/

http://www.bizertesmartcity.com

[10] Jason Brownlee. How To Prepare Your Data For Machine Learning inPython with Scikit-Learn. url: https://machinelearningmastery.

com/prepare-data-machine-learning-python-scikit-learn/.

[11] Jason Brownlee. Machine Learning Mastery With Python: UnderstandYour Data, Create Accurate Models and Work Projects End-To-End.2016.

[12] Jason Brownlee. Supervised and Unsupervised Machine Learning Al-gorithms. url: https://machinelearningmastery.com/.

[13] John A. Bullinaria. Recurrent Neural Networks. 2015. url: http://

www.cs.bham.ac.uk/˜jxb/INC/l12.pdf.

[14] Chih-Chung Chang and Chih-Jen Lin. “LIBSVM: A library for sup-port vector machines”. In: ACM Transactions on Intelligent Systemsand Technology 2 (3 2011). Software available at http://www.csie.

ntu.edu.tw/˜cjlin/libsvm, 27:1–27:27.

[15] Jeng-Fung Chen, Shih-Kuei Lo, and Quang Hung Do. “Forecastingmonthly electricity demands: an application of neural networks trainedby heuristic algorithms”. In: Information 8.1 (2017), p. 31.

[16] Bo-Juen Chen. “Two Applications of Support Vector Machine LoadForecasting and Bank Clients’ Modeling”. PhD thesis. National Tai-wan University, 2003.

[17] Bo-Juen Chen, Ming-Wei Chang, et al. “Load forecasting using sup-port vector machines: A study on EUNITE competition 2001”. In:IEEE transactions on power systems 19.4 (2004), pp. 1821–1830.

[18] Kunjin Chen et al. “Short-term Load Forecasting with Deep ResidualNetworks”. In: PP (2018).

[19] Vladimir Cherkassky and Filip Mulier. Learning from Data. Hoboken,NJ, USA: John Wiley & Sons, Inc, 2007. isbn: 9780470140529. doi:10.1002/9780470140529.

[20] Thomas M. Cover. “Geometrical and Statistical Properties of Systemsof Linear Inequalities with Applications in Pattern Recognition”. In:IEEE Transactions on Electronic Computers EC-14.3 (1965), pp. 326–334. issn: 0367-7508. doi: 10.1109/PGEC.1965.264137.

62

https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/

https://machinelearningmastery.com/prepare-data-machine-learning-python-scikit-learn/

https://machinelearningmastery.com/

http://www.cs.bham.ac.uk/~jxb/INC/l12.pdf

http://www.cs.bham.ac.uk/~jxb/INC/l12.pdf

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://dx.doi.org/10.1002/9780470140529

http://dx.doi.org/10.1109/PGEC.1965.264137

[21] Vasudev Dehalwar et al. “Electricity load forecasting for Urban areausing weather forecast information”. In: Power and Renewable Energy(ICPRE), IEEE International Conference on Power and. IEEE. 2016,pp. 355–359.

[22] Jeffrey L. Elman. “Finding Structure in Time”. In: Cognitive Science14.2 (1990), pp. 179–211. issn: 03640213. doi: 10.1207/s15516709cog1402_

1.

[23] Seyda Ertekin. “Learning in extreme conditions: Online and activelearning with massive, imbalanced and noisy data”. In: (2009).

[24] Eurostat. Glossary:In-sample vs. out-of-sample forecasts. url: http:

//ec.europa.eu/eurostat/statistics-explained/index.php/

Glossary:In-sample_vs._out-of-sample_forecasts.

[25] Abdollah Kavousi Fard and Mohammad-Reza Akbari-Zadeh. “A hy-brid method based on wavelet, ANN and ARIMA model for short-termload forecasting”. In: Journal of Experimental & Theoretical ArtificialIntelligence 26.2 (2014), pp. 167–182.

[26] G Grigoras et al. “Missing data treatment of the load profiles in distri-bution networks”. In: PowerTech, 2009 IEEE Bucharest. IEEE. 2009,pp. 1–5.

[27] Martin T. Hagan, Howard B. Demuth, and Mark H. Beale. Neuralnetwork design. 1st ed. Boston: PWS Pub, 1996. isbn: 9780534952594.

[28] Simon S. Haykin. Neural networks and learning machines. 3rd ed. NewYork: Prentice Hall, 2009. isbn: 9780131471399.

[29] Jeff Heaton. Introduction to neural networks with Java. Heaton Re-search, Inc., 2008.

[30] G. Heinemann, D. Nordmian, and E. Plant. “The Relationship Be-tween Summer Weather and Summer Loads - A Regression Analysis”.In: IEEE Transactions on Power Apparatus and Systems PAS-85.11(1966), pp. 1144–1154. issn: 0018-9510. doi: 10.1109/TPAS.1966.

291535.

63

http://dx.doi.org/10.1207/s15516709cog1402_1

http://dx.doi.org/10.1207/s15516709cog1402_1

http://ec.europa.eu/eurostat/statistics-explained/index.php/Glossary:In-sample_vs._out-of-sample_forecasts



http://dx.doi.org/10.1109/TPAS.1966.291535

http://dx.doi.org/10.1109/TPAS.1966.291535

[31] E. T. H. Heng, D. Srinivasan, and A. C. Liew. “Short term load fore-casting using genetic algorithm and neural networks”. In: Energy Man-agement and Power Delivery, 1998. Proceedings of EMPD ’98. 1998International Conference on. Vol. 2. Mar. 1998, 576–581 vol.2. doi:10.1109/EMPD.1998.702749.

[32] Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Net-zen”. In: Diploma, Technische Universitat Munchen 91.1 (1991).

[33] Tao Hong. Short term electric load forecasting. North Carolina StateUniversity, 2010.

[34] Wei-Chiang Hong. “Electric load forecasting by support vector model”.In: Applied Mathematical Modelling 33.5 (2009), pp. 2444–2454.

[35] Chih-Wei Hsu, Chih-Chung Chang, Chih-Jen Lin, et al. “A practicalguide to support vector classification”. In: (2003).

[36] B. ul Islam et al. “A hybrid neuro-genetic approach for STLF: Acomparative analysis of model parameter variations”. In: 2014 IEEE8th International Power Engineering and Optimization Conference(PEOCO2014). Mar. 2014, pp. 526–531. doi: 10.1109/PEOCO.2014.

6814485.

[37] Vikramaditya Jakkula. “Tutorial on support vector machine (svm)”.In: School of EECS, Washington State University 37 (2006).

[38] Xin Jin et al. “Application of a Hybrid Model to Short-Term LoadForecasting”. In: Proceedings - 2010 International Conference of Infor-mation Science and Management Engineering, ISME 2010 1 (2010).doi: 10.1109/ISME.2010.122.

[39] Medha Joshi and Rajiv Singh. “Short-term load forecasting approaches:A review”. In: International Journal of Recent Engineering Researchand Development (IJRERD) 01.3 (2017), pp. 9–17.

[40] G Juberias et al. “A new ARIMA model for hourly load forecasting”.In: Transmission and Distribution Conference, 1999 IEEE. Vol. 1.IEEE. 1999, pp. 314–319.

64

http://dx.doi.org/10.1109/EMPD.1998.702749

http://dx.doi.org/10.1109/PEOCO.2014.6814485

http://dx.doi.org/10.1109/PEOCO.2014.6814485

http://dx.doi.org/10.1109/ISME.2010.122

[41] Heikki Junninen et al. “Methods for imputation of missing valuesin air quality data sets”. In: Atmospheric Environment 38.18 (2004),pp. 2895–2907.

[42] SB Kotsiantis, D Kanellopoulos, and PE Pintelas. “Data preprocessingfor supervised leaning”. In: International Journal of Computer Science1.2 (2006), pp. 111–117.

[43] Rudolf Kruse. Computational intelligence: A methodological introduc-tion / Rudolf Kruse [and five others]. London: Springer, 2013. isbn:978-1-4471-5012-1.

[44] K Liu et al. “Comparison of very short-term load forecasting tech-niques”. In: IEEE Transactions on power systems 11.2 (1996), pp. 877–882.

[45] Local outlier factor. url: https://en.wikipedia.org/wiki/Local_

outlier_factor.

[46] Zofia Lukszo, Geert Deconinck, and Margot PC Weijnen. Securingelectricity supply in the cyber age: exploring the risks of informationand communication technology in tomorrow’s electricity infrastructure.Vol. 15. Springer Science & Business Media, 2009.

[47] Oded Z. Maimon and Lior Rokach. Data mining and knowledge dis-covery handbook. Ramat-Aviv and Great Britain: Springer, 2005. isbn:0-387-24435-2.

[48] Francisco Martınez-Alvarez et al. “A Survey on Data Mining Tech-niques Applied to Electricity-Related Time Series Forecasting”. In: 8(2015), pp. 13162–13193.

[49] Timothy Masters. Practical neural network recipes in C++. MorganKaufmann, 1993.

[50] Wes McKinney. “pandas: a foundational Python library for data anal-ysis and statistics”. In: Python for High Performance and ScientificComputing (2011), pp. 1–9.

65

https://en.wikipedia.org/wiki/Local_outlier_factor

https://en.wikipedia.org/wiki/Local_outlier_factor

[51] Glen Mitchell et al. “A comparison of artificial neural networks andsupport vector machines for short-term load forecasting using vari-ous load types”. In: PowerTech, 2017 IEEE Manchester. IEEE. 2017,pp. 1–4.

[52] Tom M. Mitchell. Machine Learning. McGraw-Hill series in computerscience. New York and London: McGraw-Hill, 1997. isbn: 0070428077.

[53] I. Moghram and S. Rahman. “Analysis and evaluation of five short-term load forecasting techniques”. In: IEEE Transactions on PowerSystems 4.4 (Nov. 1989), pp. 1484–1491. issn: 0885-8950. doi: 10.

1109/59.41700.

[54] M. Mustapha et al. “Classification of electricity load forecasting basedon the factors influencing the load consumption and methods used: An-overview”. In: 2015 IEEE Conference on Energy Conversion (CEN-CON). IEEE, 19.10.2015 - 20.10.2015, pp. 442–447. isbn: 978-1-4799-8598-2. doi: 10.1109/CENCON.2015.7409585.

[55] Hongzhan Nie et al. “Hybrid of ARIMA and SVMs for short-term loadforecasting”. In: Energy Procedia 16 (2012), pp. 1455–1460.

[56] Dongxiao Niu et al. “Daily load forecasting using support vector ma-chine and case-based reasoning”. In: Industrial Electronics and Ap-plications, 2007. ICIEA 2007. 2nd IEEE Conference on. IEEE. 2007,pp. 1271–1274.

[57] Tapio Pahikkala, Jorma Boberg, and Tapio Salakoski. “Fast n-foldcross-validation for regularized least-squares”. In: Proceedings of theninth Scandinavian conference on artificial intelligence (SCAI 2006).Espoo. 2006, pp. 83–90.

[58] Dong C Park et al. “Electric load forecasting using an artificial neu-ral network”. In: IEEE transactions on Power Systems 6.2 (1991),pp. 442–449.

[59] F. Pedregosa et al. “Scikit-learn: Machine Learning in Python”. In:Journal of Machine Learning Research 12 (2011), pp. 2825–2830.

66

http://dx.doi.org/10.1109/59.41700

http://dx.doi.org/10.1109/59.41700

http://dx.doi.org/10.1109/CENCON.2015.7409585

[60] Hossein Javedani Sadaei et al. “Short-term load forecasting using a hy-brid model with a refined exponentially weighted fuzzy time series andan improved harmony search”. In: International Journal of ElectricalPower & Energy Systems 62 (2014), pp. 118–129.

[61] Tom Schaul et al. “PyBrain”. In: Journal of Machine Learning Re-search 11 (2010), pp. 743–746.

[62] Martin Sewell. Support Vector Machines (SVMs). url: http://www.

svms.org/.

[63] K Gnana Sheela and Subramaniam N Deepa. “Review on methods tofix number of hidden neurons in neural networks”. In: MathematicalProblems in Engineering 2013 (2013).

[64] Alex J Smola and Bernhard Scholkopf. “A tutorial on support vectorregression”. In: Statistics and computing 14.3 (2004), pp. 199–222.

[65] Cesar Souza. Kernel Functions for Machine Learning Applications.url: http://crsouza.com/.

[66] AK Srivastava, Ajay Shekhar Pandey, and Devender Singh. “Short-term load forecasting methods: A review”. In: Emerging Trends inElectrical Electronics & Sustainable Energy Systems (ICETEESES),International Conference on. IEEE. 2016, pp. 130–138.

[67] Statistical model. url: https://en.wikipedia.org/wiki/Statistical_

model.

[68] J.A.K. Suykens and J. Vandewalle. In: Neural Processing Letters 9.3(1999), pp. 293–300. issn: 13704621. doi: 10.1023/A:1018628609742.

[69] Tao Hong and Shu Fan. “Probabilistic electric load forecasting: A tu-torial review”. In: International Journal of Forecasting 32.3 (2016),pp. 914–938. issn: 0169-2070. doi: 10.1016/j.ijforecast.2015.

11.011. url: http://www.sciencedirect.com/science/article/

pii/S0169207015001508.

[70] The new urban agenda. url: http://habitat3.org/the-new-urban-

agenda/.

67

http://www.svms.org/

http://www.svms.org/

http://crsouza.com/

https://en.wikipedia.org/wiki/Statistical_model

https://en.wikipedia.org/wiki/Statistical_model

http://dx.doi.org/10.1023/A:1018628609742

http://dx.doi.org/10.1016/j.ijforecast.2015.11.011

http://dx.doi.org/10.1016/j.ijforecast.2015.11.011

http://www.sciencedirect.com/science/article/pii/S0169207015001508

http://www.sciencedirect.com/science/article/pii/S0169207015001508

http://habitat3.org/the-new-urban-agenda/

http://habitat3.org/the-new-urban-agenda/

[71] Vladimir N. Vapnik. Statistical learning theory. Adaptive and learningsystems for signal processing, communications, and control. New Yorkand Chichester: Wiley, 1998. isbn: 0471030031.

[72] Guoqiang Zhang, B Eddy Patuwo, and Michael Y Hu. “Forecastingwith artificial neural networks:: The state of the art”. In: Internationaljournal of forecasting 14.1 (1998), pp. 35–62.

[73] Guoqiang Zhang, B. Eddy Patuwo, and Michael Y. Hu. The state ofthe art. 1997.

[74] Huiting Zheng, Jiabin Yuan, and Long Chen. “Short-term load fore-casting using EMD-LSTM neural networks with a Xgboost algorithmfor feature importance evaluation”. In: Energies 10.8 (2017), p. 1168.

[75] Xianghe Zhu and Min Shen. “Based on the ARIMA model with greytheory for short term load forecasting model”. In: Systems and In-formatics (ICSAI), 2012 International Conference on. IEEE. 2012,pp. 564–567.

68

Documents

A Comparative Study of Support Vector Machines and ... · The master thesis entitled “A Comparative Study Between Support Vector Machines and Artiﬁcial Neural Networks For Short