11
1 Filling gaps in time series in urban hydrology Project of research June 2014 ROCHTUS Yannick Co-supervisors: AUBIN Jean-Baptiste, BERTRAND-KRAJEWSKI Jean-Luc Département Génie Civil & Urbanisme Laboratoire de Génie Civil et d'Ingénierie Environnementale (LGCIE) 20 Avenue A. Einstein, 69621 Villeurbanne Cedex, France [email protected] ABSTRACT. Many research projects in urban hydrology are based on time series, especially for knowledge on processes, modelling, etc. Consequently, the quality and the completeness of these time series are essential. However, time series may show gaps due to various reasons. It is then important, for some applications, to fill these gaps and to replace missing values. The goal is to find a method that relays as much as possible on the measured data and as little as possible on model theoretic assumptions. That is why different methods have been studied to define the most appropriate one. Afterwards a method was implemented in Matlab so that the results could be compared with the actual values. It uses the K-means algorithm to make clusters and fills the gaps with the median of their respective cluster. The results are promising and the method works equally good for small gaps as for large gaps. KEYWORDS: clustering, filling gaps, K-means, missing values, time series, urban hydrology. 1. Introduction Since 2004, the LGCIE, in the OTHU project (Field Observatory on Urban Hydrology - see www.othu.org) has monitored stormwater quality in two urban catchments in Lyon. With a time step of 2 minutes, time series on rainfall, water level, flow velocity and discharge, turbidity (used as a surrogate measurement of TSS Total Suspended Solids and COD Chemical Oxygen Demand), conductivity, pH, temperature, etc. are collected. Many research projects are based on time series, especially for knowledge on processes, modelling, etc. Consequently, the quality and the completeness of these time series are essential. However, time series may show gaps, due to various reasons (maintenance, sensor failure, power failure, human error, rejection after validation test, etc.). It is then important, for some applications, to fill these gaps and to replace missing values. If gaps are small (a few minutes), solutions are simple and already implemented. One of the methods to do this was interpolation . However, if gaps expand over hours or days, specific methods should be developed. It is important that that the method to fill these gaps relays as much as possible on the measured data and as little as possible on model theory assumptions. Also is it very important that it provides unbiased estimates. That is why to find the most appropriate method, different methods have been studied and their results were compared. This was done by making artificial gaps in the time series. So that afterwards the results, from the various methods, could be compared with the actual values and demonstrate how successfully they could correct these gaps where the data was purposely left out. Thus, various methods can be selected to fill the real gaps. Possibly more than one method will be used depending on the duration of the gap. Using different methods is expected to work better for different lengths of gaps. The implementation of the methods will be done in Matlab and this code will be further used in the context of a PhD.

Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

1

Filling gaps in time series in urban hydrology Project of research – June 2014 ROCHTUS Yannick Co-supervisors: AUBIN Jean-Baptiste, BERTRAND-KRAJEWSKI Jean-Luc

Département Génie Civil & Urbanisme

Laboratoire de Génie Civil et d'Ingénierie Environnementale (LGCIE)

20 Avenue A. Einstein, 69621 Villeurbanne Cedex, France

[email protected]

ABSTRACT. Many research projects in urban hydrology are based on time series, especially for knowledge on processes,

modelling, etc. Consequently, the quality and the completeness of these time series are essential. However, time series may

show gaps due to various reasons. It is then important, for some applications, to fill these gaps and to replace missing values.

The goal is to find a method that relays as much as possible on the measured data and as little as possible on model theoretic

assumptions. That is why different methods have been studied to define the most appropriate one. Afterwards a method was

implemented in Matlab so that the results could be compared with the actual values. It uses the K-means algorithm to make

clusters and fills the gaps with the median of their respective cluster. The results are promising and the method works equally

good for small gaps as for large gaps.

KEYWORDS: clustering, filling gaps, K-means, missing values, time series, urban hydrology.

1. Introduction

Since 2004, the LGCIE, in the OTHU project (Field Observatory on Urban Hydrology - see www.othu.org)

has monitored stormwater quality in two urban catchments in Lyon. With a time step of 2 minutes, time series on

rainfall, water level, flow velocity and discharge, turbidity (used as a surrogate measurement of TSS – Total

Suspended Solids and COD – Chemical Oxygen Demand), conductivity, pH, temperature, etc. are collected.

Many research projects are based on time series, especially for knowledge on processes, modelling, etc.

Consequently, the quality and the completeness of these time series are essential. However, time series may

show gaps, due to various reasons (maintenance, sensor failure, power failure, human error, rejection after

validation test, etc.). It is then important, for some applications, to fill these gaps and to replace missing values.

If gaps are small (a few minutes), solutions are simple and already implemented. One of the methods to do this

was interpolation . However, if gaps expand over hours or days, specific methods should be developed.

It is important that that the method to fill these gaps relays as much as possible on the measured data and as

little as possible on model theory assumptions. Also is it very important that it provides unbiased estimates. That

is why to find the most appropriate method, different methods have been studied and their results were

compared. This was done by making artificial gaps in the time series. So that afterwards the results, from the

various methods, could be compared with the actual values and demonstrate how successfully they could correct

these gaps where the data was purposely left out.

Thus, various methods can be selected to fill the real gaps. Possibly more than one method will be used

depending on the duration of the gap. Using different methods is expected to work better for different lengths of

gaps. The implementation of the methods will be done in Matlab and this code will be further used in the context

of a PhD.

Page 2: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

2

2. Literature review

2.1. Overview of the different methods

Many gap-filling techniques exist, and a full overview of all these techniques is beyond the scope of this

paper. But a brief discussion of several of the most common techniques is in place.

2.2. Mean diurnal variation (MDV)

The MDV method is an empirical method. This means that conclusions will be made based on observations.

In this method missing values are replaced with the averaged values of the adjacent days at exactly that time of

the day. The gap will be estimated by looking at those days close to the gap. A certain predefined window is

chosen depending on the data. This window contains several days or weeks that will be used to estimate the

missing data and the mean of this window is used to fill the gaps.

The most important advantages are that this method is easy to implement and that previous studies showed

that MDV produced decent results. These test were done on a data sets with an average 35% missing or rejected

data that amounted to about 6000 half-hour values for a year. Over 50% of the gaps were less 2 h, and less than

4% were longer than 1000 periods (21 days). The smaller the gap the better the results and the results also

dependent if the gaps where during the night or during the day. For the exact numbers is referred to the

following paper (Falge et al, 2001).

Some examples where this method is used are: in ecosystem exchange, volatile organic compounds flux,

carbon fluxes and so on (Falge et al, 2001; Moffat et al., 2007).

2.3. Singular spectrum analysis (SSA)

Singular spectrum analysis is a well-known method to analyse time series. It combines elements of classical

time series analysis, multivariate statistics, dynamical systems and signal processing (Kondrashov et al., 2010).

The method is iterative and uses both spatial and temporal correlations. When earlier applied on data the method

produced decent results, for single missing values as well as for longer continuous gaps (Karelmo, 2010). The

method is based on an eigenvalue decomposition of a lag-covariance matrix, Cx, obtained from the original data

series Xt: t = 1, ... ,N.

Previous studies of Kondrashov et al. have shown that this method produced decent results. The results were

as follows, for respectively 5 and 17% of missing values: RMSE: 5,97 until 48,37 passengers. Some examples in

practice are: analysis of climatic, meteorological and geophysical time series. Some more practical examples are:

solar-wind analysis and air pollution (Kondrashov et al., 2010; Zhigljavsky, 2013).

2.4. Kohonen self-organizing maps

The Kohonen self-organizing map clustering analysis is probably one of the best methods to fill gaps, but it

uses very difficult mathematical methods and needs lots of calculating power. The method uses polygons as a

first approximation to fill the gaps where data is missing. These curves are defined by a set of model points (a

kernel) and every point is mapped onto the closest point of the kernel. The domain of points mapped to a certain

kernel is called its taxon. By using an iteratively method a convergent algorithm forms the best fitted curve. Data

with missing values up to 50 % have been filled up accurately with a polygon in previous studies.

Previous studies have shown that this method is not easy to implement and there are easier methods that

would probably also result in decent results (Dergachev et al., 2001).

2.5. Look-up tables

This gap-filling method clusters the different measured parameters in predefined groups ranging over a

specific period, that depends on the data. The mean and standard deviation are used for each group to determine

Page 3: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

3

the replacement value of the missing data. By analysing the results in the paper of Fadge at al. became it clear

that the results of this method probably not be the best option.

The look-up tables are useful in many situations, but big gaps had to be filled. So there are other methods

which are more useful for the specific situation of this paper. For results is referred to the paper of Fadge et al.

(Falge et al., 2001; Moffat et al., 2007).

2.6. Multiple imputation method (MI)

MI method predicts the missing values by using the values of other correlated measured variables. These

predicted values are called “imputes” and are inserted where the data is missing. These imputes are created by a

regression model. This process is performed multiple times, producing multiple imputed data sets, which are

similar but not the same. Then an average of all these results forms one spline to fill the missing values. Thus a

complete data set is obtained (Falge et al., 2001; Wayman, 2003).

There our parameters are expected to all be correlated they can all be used as imputes. It is expected that, if

the most important variable, the velocity, isn’t measured the other variables that are measured by the same

installation will likely also not be available and cannot be used as imputes. The more imputes that are available

the more reliable the results will be and the minimum of imputes is 3 (Falge et al., 2001).

Multiple imputation often results in good quality of filled in data and because it is easy to implement, one can

consider this to be a good choice for filling gaps. Studies from Graham at al. have shown that this method

performs better with smaller gaps. The results were as follows, Mean: 37.83, Standard error: 0.138. This with 3

imputes and a gap of 25%. For the mathematical background and more results is referred to the specific literature

(Graham et al., 2007; Wayman, 2003).

2.7. Multiple regression analysis

Multiple regression analysis was already tested to fill in the missing values in previous work done in the lab.

This method will not be further discussed because of this. Only a comparison of the results with the newly tried

methods is mentioned in the results of this paper (Métadier, 2011).

2.8. Empirical mode decomposition (EMD)

The process with EMD is the following: the original function is broken down in a sum of easier functions by

applying an algorithm. To fill a gap in the time series, four extrema on the left and four on the right side of the

gap are used. The distance, on the time axis, between the extrema on each side is calculated. The minimum of

these distances will be used to be the base for the distance between extrema in the gap. This technique is based

on fractional Brownian motion, Markov model, Cholesky decomposition and Hurst exponent to find the polygon

that fills the gap. Using EMD to fill short time gaps, with a maximum of 20 extremes, is very useful and will

give good results as seen in previous work. Although in the time series of this paper are the extrema really close

to each other so only small gaps can be filled. For filling larger gaps, estimates given from this method will not

fulfil the requirements (Seitchik, 2012).

2.9. Partitioning Around Medoids (PAM)

The next two methods are very similar and find their difference in the way they group the splines. The result

of this PAM method will be k clusters and within each cluster elements with the highest possible degree of

similarity. So, by doing this k clusters with a certain amount of curves will be found. The clusters will be defined

by a medoid, which is defined as objects of the cluster for which the average dissimilarity to all the object in the

cluster is minimal. After this, the gaps can be filled in with the parts of the medoids with the same interval as

these gaps (Kaufman & Rousseeuw, 1990).

Page 4: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

4

3. The implemented method

3.1 Introduction & Research Questions

First is the K-means method explained. The K-means clustering is used to form the clusters and afterwarts is

the gap filled in by doing a shift and rotation of the median to which the curve with the gap belongs to. To know

to which cluster this curve belong to is the a tested performed. Then is a verification preformed to decide what

the optimal amount of clusters would be for our data set. Kaufman & Rousseeuw’s paper indicates that the

optimal method to test this is the silhouetteplot.

Then are artificial gaps made in the 45% of the days and the other 55% is used as the data for clustering. The

parameter that is used is discharge (L/s). This parameter is chosen because in previous work in the laboratory

were already methods tested on this parameter. So, by using the same parameter would it be possible to compare

the results. For all the testing curves, the beginning of the artificial gap is chosen at the eighth measured value.

Then 14 measurements are left out which results in an artificial gap of 28 minutes. Next, the RMSE is calculated.

This is done for gaps with the initial length of 14 measurements until a gap with a length of 700 (about 23 h)

measurements this with steps of 14 measurements. So respectively are the gap 28, 56, 84,…, 700 measurements

long.

Then is the RMSE calculated and set out in a diagram to analyse the error. Then are conclusions made for

which gap length this method is useful.

3.2 K-means clustering

This is the tested and used method. The K-means method is similar to the PAM method. The method also

wants to form clusters but uses a slightly different approach. The days that are used to form the clusters are the

dry days without missing values of 2007.

The K in the K-means cluster algorithm refers to the fact that the algorithm is going to look for K different

clusters. Which means when applied on a data set, the algorithm is going to break the dataset into K different

clusters. When it is unable to find K clusters it is going to break the data set in K-1 clusters.

So if K is specified equal to 3, then the clustering algorithm is going to break the population into 3 different

clusters. The value of k needs to be specified to the algorithm in advance, so it has to be decided how many

clusters are wanted before starting the clustering process. How this is decided is specified later in the paper.

The plus signs that are set out in figure 1 represent the summed values of the splines. This follows from the

intention to cluster splines and not just points. Thus each plus sign represents a spline and needs to be clustered.

For example to obtain three clusters (see figure 1), the procedure is as follows: the first step is to define three

random means. These means will be used to make the first clusters (left part of figure 1). Three random means

are used because three clusters are needed. Then two of these three points are connected with a line and in the

middle of this line the perpendicular lines is set out. This is done three times, so every random mean is connected

with each other. So the clusters are now defined for the first time (boundaries of the clusters are the yellow

lines). Then the new mean of each cluster is calculated and they form the new centre to calculate the clusters

again (right part of figure 1). Then the same process is done with the connection lines and perpendicular line.

One can observe that the boundaries (yellow lines) of the cluster change and so it is also possible that the

elements of the clusters change. This is repeated until the elements stop changing of cluster and every element is

part of a cluster. The whole process is done several times because the random means in the first step sometimes

influence the clustering process. This can happen when for example all the initial means are close to each other

or are all in the cluster that one would visually distinct to be one cluster. So to become the optimal clustering is

the whole process done several times with different initial means. Then are the sizes from the clusters, from each

iteration, compared and the sizes that occur the most are chosen as the definitive clusters. In our case where the

sizes of the clusters always the same and I repeated this process 5 times just to be sure.

Page 5: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

5

Figure 1. K-means forming clusters

The distances between the points of the clusters mean (or centre) and the points of the splines is calculated

with the Euclidean distance and form the summed Euclidean distance or the distance to the centre of the cluster,

this to decide to which cluster the curve belong to. In figure 2 are these points set out according to their spline

number and the values are calculated with the following formula. In this formula is the distance for each curve

calculated by the Euclidean distance summed over all the points of the curve. With C(i) the value of the points of

the centre of the clusters (this is not a curve of the data set) and Y(i) the value of the points of the curve. With t

until 720, the total amount of measurements (Santhi & Bhaskaran, 2014; Tarpey, 2007).

( ) √∑( ( ) ( ))

Figure 2 gives a visualisation to which cluster the curves belong. The blue dots stand for the distance to the

centre of cluster 1 and the red dots for the distance to the centre of cluster 2. The spline is an element of the

cluster to which the distance is the smallest. the values are only displayed for the first 16 curves, but for the other

curves the same principle is applied. The lowest of these values decide to which cluster the curves are appointed.

The lower the distance to the centre of the cluster the better it fits the centre of the cluster. In figure 2 one can see

that 15 of the dots are red and only 1 is blue. This is not a coincidence. The reason for this is that the usage of

water depends on the season and on which part of the week it is (weekdays or weekends). So days that are for

example in the same season and from the same part of the week have therefore a bigger chance of being part of

the same cluster.

Figure 2. Distance to the centre of the cluster (done for two clusters)

Page 6: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

6

For our data specifically was it not known what the best amount of clusters would be. Therefore a silhouette

plot was used to test and validate the amount of clusters that would lead to the best results. The outcome of this

is a graphical display on how well each object is coherent to its cluster.

The method works as follows: the data is clustered by the K-means method. In the formula below is a(i) the

average dissimilarity of i to all other objects of cluster 1 (its own cluster), this is mathematically D(i,1)=a(i). b(i)

is the lowest average dissimilarity of i to cluster 2 (D(i,2)=b(i))(not its own cluster). For more than two cluster is

b(i) the lowest average dissimilarity of i to any other cluster which i is not a member. The cluster with the lowest

average dissimilarity is said to be the neighbouring cluster of i because it is the next best fitting cluster for curve

i. s(i) gives us a number between 1 and -1. If the number is close to one, then this means that curve i is well

clustered. So it gives an indication on how well the element has been classified.

[2]

Or shorter:

[3]

When silhouette value is close to one, a(i) will be much smaller than b(i), because when silhouette value is

close to one it means that i is very similar to its own cluster. Thus when silhouette value is close to one it means

that the curve is clustered in the right cluster and that it is very dissimilar to the other cluster. When silhouette

value is close to minus one it means that the element should belong to another cluster. The average silhouette

value off all the curves of a cluster is therefore an indication on how tight the cluster is. Therefore silhouette

value is also an indication whether the chosen amount of clusters is correct. The curves will all have the most

optimal silhouette value if the amount of clusters is optimal. So with the right amount of clusters will the highest

average silhouette value be obtained (Kaufman & Rousseeuw, 1990; Merkuryeva, 2012).

Visually in the upper silhouette plot of figure 3 are three "knife blade" shapes observed. This indicates that

three clusters are formed. The first knife blade contains all the silhouette values of the first cluster, the second

knife blade the silhouette values of the second cluster etc..

The third cluster contains many points with low silhouette values, and the 2nd and 3the contain negative

values, indicating that the separation of those two clusters is not optimal. The average silhouette value is 0,4317.

In the bottom silhouette plot are the silhouette values higher with very few negative values. This gives an

indication that the clusters are tighter. Also is the average silhouette value higher namely 0,6087. Tests were also

performed for four and five clusters but the outcome of those where low average silhouette values (respectively

0,27 and 0,19) .

After several runs with K = 2, 3, 4 and 5 clusters, it became clear that the best results will be obtained when

two clusters are used. This with an average silhouette value of 0,6087, which is relatively high, and indicates that

the choice of two clusters is a good option (Rousseeuw, 1987).

Page 7: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

7

Figure 3. Silhouette plot for three and two clusters, with silhouette values of respectively 0,4317 and 0,6087.

The gap can be filled in with the mean or the median of the cluster (not equal to the above mentioned means

used for clustering). To avoid confusion: The difference between those two means is that the means that are used

further in the paper are the standard mathematical means (and medians) and the means used for clustering were

randomly chosen by the K-means algorithm and then as explained above were they calculated again depending

on the elements of the clusters. Those means were calculated until the elements of the cluster stopped changing

and are therefore not necessarily the same as the mathematical means.

To compute if the mean or median would be the best choice is, the error of the mean and median calculated

and set out in a diagram (figure 4). So, each blue circle represents the value of the median on the x-axis and the

value of the mean on the y-axis. This is done for several gap lengths and different starting positions, because it is

possible that for different gap lengths or starting positions the mean or median would have the lowest value. One

of these results is given in figure 4. The conclusion in our case is that the median is the best option, because the

RMSE of the median was slightly lower than the one from the mean. This was done with formula [6]. This

formula is discussed later in the paper, but for the advancement of the paper is it necessary to know that the

median of the clusters is used to fill the gaps. In the figure one can observe that when the average of all the

results (red star) is calculated that the value of the median is slightly lower than the one from the mean.

Figure 4. Error of the median and mean set out to make a conclusion if the median or the mean is the most appropriate

option to fill the gaps

Page 8: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

8

Now it is known that using two clusters and using the median is the best option, is it possible to start the gap

filling procedure. Two groups are made: the dry days without missing values of 2007 to perform as the days used

for the above mentioned clustering and the dry days without missing values of 2008 as the curves where the

method is applied to. This results in 70 days for clustering and 55 testing days.

For all the testing curves, the beginning of the artificial gap is chosen at the eighth measured value. Then 14

measurements are left out which results in an artificial gap of 28 minutes. Next, the RMSE is calculated. This is

done for gaps with the initial length of 14 measurements until a gap with a length of 700 (about 23 h)

measurements this with steps of 14 measurements (see figure 5). So respectively are the gap 28, 56, 84,…, 700

measurements long.

The curves with the gaps will be filled in with the median of the cluster to which it belongs, but a test is

necessary to decide to which cluster the curve with the gap belongs. This is done as follows, with in this formula

Y’ the curve with the gap and M represents the medians of the clusters. K stands for the number of the cluster.

& are the linear model parameters specific, for each median for the parts before and after the gap. The

parameters & are used to explain the median as good as possible to the curve with the gap. Then the root

mean squared error (RMSE) is calculated for all the medians. t0,t1 indicate the start and end points of the gap.

Then these results are compared and the one with the smallest error will be assigned as the median to fill the

gaps in that curve.

( ) ( ) ( ) k=1,2 [4]

√∑ ( ( ) ( ( ) ( ) ( )( )))

t [t0,t1] [5]

The part that fill in the gap is first moved up or down to the starting point of the gap. Then rotated over an

angle that depends on the starting and ending points of the gap, with the starting point being the rotation centre.

Finally the curve is stretched because the length of the gap depends on the slope of the gap and will normally not

be the same as the one from the moved and rotated curve. By doing this, a continuity with the first and last points

of the gap is realised (Jørgensen & Goegebeu, 2007).

This is obtained by comparing the real values of the artificially made gaps and the filled in values of the K-

means method. This with the above mentioned groups. In this formula are Rx the values of the real curve, the

curve before the artificial gap was made and Y’x the values of the filled in part. X is the starting value of the gap, t

indicates until where the gap reaches. The summation of all the measurements is divide by the amount of

measurements (t-8). This results in 55 RMSE curves, because there are 55 testing days.

Figure 5 : The curve with the gap and the indication of the gap length

Page 9: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

9

√∑ ( ( ) ( ))

( t=8+14k ; k=1,2,…,50) [6]

4. Results and discussions

When all the results are obtained, the 5% percentiles on both sides are removed, this to take the confidence

interval in consideration. So, figure 6 gives the 90% confidence interval. The mean of all the RMSE curves is set

out (red line) and here the error is always around 1 L/s. This was done with the RMSE curves that where

obtained by formule [6]. The upper and lower boundaries (blue dashed line) indicate the spread where the actual

accuracy of the filled in measurements will be. The highest error is lower than 3 L/s and the lowest error is

around 0,25 L/s.

Figure 6. The upper and lower boundaries of the RMSE and the mean of the all RMSE

5. Conclusion & perspectives

The target of this study was to find a method to fill gaps that expand over hours or days. The method is tested

on the values of the discharge. The discharge was chosen because previous methods were already tested on this

method. By using the same parameter is it possible to compare the results.

One of the conclusion is that results for 2 clusters are better than the results that would be obtained when 3

clusters are used, this is the case for our data set and depends on the data set. The reason by hypothesis is that the

usage of water depends on the season and on which part of the week it is (weekdays or weekends). So days that

are in the same season and are of the same part of the week have therefore a bigger chance of being in the same

cluster. I found that the curves from winter and autumn mostly are in the same cluster and the curves from spring

and summer are also mostly is the same cluster. The exceptions that occurred were mainly when the days where

in the weekend. So by hypothesis is the usage of water in the winter and autumn weekdays very similar and the

same applies for the weekdays of spring and summer. The weekend days are more difficult to cluster.

The results are promising and the method works equally good for small gaps as for large gaps. Although the

spread of the confidence interval is large. Especially the first 250 measurements have a large confidence interval.

The reason by hypothesis is that this is due to the big jump in water usage in the morning hours. Between 5 and

8 o’clock is the confidence interval the largest, because here is the rise in water demand the largest.

The method needs at least the starting and ending point of the curve to do the moving and rotating. Therefore

is it not possible to apply this method if the starting and/or ending point of a day is missing. The starting and

ending point are necessary because the starting point is the rotation point of the rotation and also to decide what

the angle of the rotation will be is the starting and ending point of the gap necessary. It should be noted that this

dependence on the start and ending point is a downside of this method. For these situations is there still no

solution and a method should be developed.

Page 10: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

PIRD - GCU/INSAL Rochtus Yannick June 2014

10

This method can only fill gaps up gaps with a maximum gap length of 700 measurements. So methods to fill

gaps over a period of days should still be developed.

In other laboratories are people are working on the same problem and it would be interesting to test their

methods on our data and compare the results with the results obtained by us implemented method.

References

Falge E., Baldocchi D., Olson R., Anthoni P., & Aubinet M., « Gap filling strategies for defensible annual sums ofnet

ecosystem exchange », Argricultural and Forest Meteorology, vol. 107, n° 1, Los Angeles, CA, U.S.A., March 2001, p.

43-69.

Graham J., Olchowski A., & Gilreath T., « How Many Imputations are Really Needed? Some Practical Clarifications of

Multiple Imputation Theory », Prevention Science, vol. 8, n° 3, Department of Biobehavioral Health, Penn State

University, September 2007, p. 206-213.

Jørgensen B., & Goegebeur Y., « Prediction and validation, Multivariate Data Analysis and Chemometrics », Lecture Notes

in Computer Science, Department of Statistics, University of Southern Denmark, January 2007.

Kaufman L., & Rousseeuw P., « Finding Groups In Data - An introduction to cluster analysis », Wiley Series in Probability

and Statistics, Hoboken, New Jersey, U.S.A., 1990, p. 321-333.

Kondrashov D., Shprits Y., & Ghil M., « Gap Filling of Solar Wind Data by Singular Spectrum Analysis », Geophysical

research letters, vol. 37, n° 15, August 2010.

Merkuryeva G., « Integrated delivery planning and scheduling built on cluster analysis and simulation optimisation »,

Proceedings 26th European Conference on Modelling and Simulation Ecms, 2012, p. 164-168.

Moffat A., Papale D., Reichstein M., Hollinger D., Richardson A., Barr A., Stauch V., « Comprehensive comparison of gap-

filling techniques for eddy covariance net carbon fluxes », Agricultural and Forest Meteorology, vol. 147, n° 3-4,

December 2007, p. 209-232.

Métadier M., raitement et analyse de séries chronologiques continues de turbidité pour la formulation et le test de modèles

des rejets urbains par temps de pluie, Thèse de doctorat, INSA, Lyon, 2011.

Rousseeuw P. J., « Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis », Journal of

Computational and Applied Mathematics, vol. 20, University of Fribourg, November 1987, p. 53-65.

Santhi P., & Bhaskaran M., « Improving the Efficiency of Image Clustering using Modified Non Euclidean Distance

Measures in Data Mining », International Journal of Computers Communications & Control, 2014, p. 56-61.

Seitchik E., « Trend and Detail: Gap Filling with the Empirical Mode Decomposition », A Senior Project submitted to The

Division of Science, Mathematics and Computing, Bard College, New York, 2012.

Tarpey T., « Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves », The

American statistican, 2007.

Wayman J., « Multiple Imputation For Missing Data: What Is It And How Can I Use It? », Annual Meeting of the American

Educational Research Association, Johns Hopkins University, Chicago,IL , U.S.A., 2003.

Zhigljavsky A., « Singular spectrum analysis for time series », International Journal of Forecasting, vol. 25, n° 1 ,School of

Mathematics, Cardiff University, UK, 2013, p. 103-118.

Page 11: Filling gaps in time series in urban hydrologylib.ugent.be/fulltxt/RUG01/002/153/908/RUG01-002153908... · 2014. 10. 16. · 1 Filling gaps in time series in urban hydrology Project

Yannick Rochtus

Filling gaps in time series

Academic year 2013-2014Faculty of Engineering and ArchitectureChairman: Prof. Marc VanhaelstDepartment of Industrial Technology and Construction

Master of Science in de industriële wetenschappen: bouwkundeMaster's dissertation submitted in order to obtain the academic degree of

Supervisors: Prof. Patrick Ampe, dhr. Jean-Luc Bertrand Krajewski