Upload
others
View
11
Download
0
Embed Size (px)
Citation preview
1
Filling gaps in time series in urban hydrology Project of research – June 2014 ROCHTUS Yannick Co-supervisors: AUBIN Jean-Baptiste, BERTRAND-KRAJEWSKI Jean-Luc
Département Génie Civil & Urbanisme
Laboratoire de Génie Civil et d'Ingénierie Environnementale (LGCIE)
20 Avenue A. Einstein, 69621 Villeurbanne Cedex, France
ABSTRACT. Many research projects in urban hydrology are based on time series, especially for knowledge on processes,
modelling, etc. Consequently, the quality and the completeness of these time series are essential. However, time series may
show gaps due to various reasons. It is then important, for some applications, to fill these gaps and to replace missing values.
The goal is to find a method that relays as much as possible on the measured data and as little as possible on model theoretic
assumptions. That is why different methods have been studied to define the most appropriate one. Afterwards a method was
implemented in Matlab so that the results could be compared with the actual values. It uses the K-means algorithm to make
clusters and fills the gaps with the median of their respective cluster. The results are promising and the method works equally
good for small gaps as for large gaps.
KEYWORDS: clustering, filling gaps, K-means, missing values, time series, urban hydrology.
1. Introduction
Since 2004, the LGCIE, in the OTHU project (Field Observatory on Urban Hydrology - see www.othu.org)
has monitored stormwater quality in two urban catchments in Lyon. With a time step of 2 minutes, time series on
rainfall, water level, flow velocity and discharge, turbidity (used as a surrogate measurement of TSS – Total
Suspended Solids and COD – Chemical Oxygen Demand), conductivity, pH, temperature, etc. are collected.
Many research projects are based on time series, especially for knowledge on processes, modelling, etc.
Consequently, the quality and the completeness of these time series are essential. However, time series may
show gaps, due to various reasons (maintenance, sensor failure, power failure, human error, rejection after
validation test, etc.). It is then important, for some applications, to fill these gaps and to replace missing values.
If gaps are small (a few minutes), solutions are simple and already implemented. One of the methods to do this
was interpolation . However, if gaps expand over hours or days, specific methods should be developed.
It is important that that the method to fill these gaps relays as much as possible on the measured data and as
little as possible on model theory assumptions. Also is it very important that it provides unbiased estimates. That
is why to find the most appropriate method, different methods have been studied and their results were
compared. This was done by making artificial gaps in the time series. So that afterwards the results, from the
various methods, could be compared with the actual values and demonstrate how successfully they could correct
these gaps where the data was purposely left out.
Thus, various methods can be selected to fill the real gaps. Possibly more than one method will be used
depending on the duration of the gap. Using different methods is expected to work better for different lengths of
gaps. The implementation of the methods will be done in Matlab and this code will be further used in the context
of a PhD.
PIRD - GCU/INSAL Rochtus Yannick June 2014
2
2. Literature review
2.1. Overview of the different methods
Many gap-filling techniques exist, and a full overview of all these techniques is beyond the scope of this
paper. But a brief discussion of several of the most common techniques is in place.
2.2. Mean diurnal variation (MDV)
The MDV method is an empirical method. This means that conclusions will be made based on observations.
In this method missing values are replaced with the averaged values of the adjacent days at exactly that time of
the day. The gap will be estimated by looking at those days close to the gap. A certain predefined window is
chosen depending on the data. This window contains several days or weeks that will be used to estimate the
missing data and the mean of this window is used to fill the gaps.
The most important advantages are that this method is easy to implement and that previous studies showed
that MDV produced decent results. These test were done on a data sets with an average 35% missing or rejected
data that amounted to about 6000 half-hour values for a year. Over 50% of the gaps were less 2 h, and less than
4% were longer than 1000 periods (21 days). The smaller the gap the better the results and the results also
dependent if the gaps where during the night or during the day. For the exact numbers is referred to the
following paper (Falge et al, 2001).
Some examples where this method is used are: in ecosystem exchange, volatile organic compounds flux,
carbon fluxes and so on (Falge et al, 2001; Moffat et al., 2007).
2.3. Singular spectrum analysis (SSA)
Singular spectrum analysis is a well-known method to analyse time series. It combines elements of classical
time series analysis, multivariate statistics, dynamical systems and signal processing (Kondrashov et al., 2010).
The method is iterative and uses both spatial and temporal correlations. When earlier applied on data the method
produced decent results, for single missing values as well as for longer continuous gaps (Karelmo, 2010). The
method is based on an eigenvalue decomposition of a lag-covariance matrix, Cx, obtained from the original data
series Xt: t = 1, ... ,N.
Previous studies of Kondrashov et al. have shown that this method produced decent results. The results were
as follows, for respectively 5 and 17% of missing values: RMSE: 5,97 until 48,37 passengers. Some examples in
practice are: analysis of climatic, meteorological and geophysical time series. Some more practical examples are:
solar-wind analysis and air pollution (Kondrashov et al., 2010; Zhigljavsky, 2013).
2.4. Kohonen self-organizing maps
The Kohonen self-organizing map clustering analysis is probably one of the best methods to fill gaps, but it
uses very difficult mathematical methods and needs lots of calculating power. The method uses polygons as a
first approximation to fill the gaps where data is missing. These curves are defined by a set of model points (a
kernel) and every point is mapped onto the closest point of the kernel. The domain of points mapped to a certain
kernel is called its taxon. By using an iteratively method a convergent algorithm forms the best fitted curve. Data
with missing values up to 50 % have been filled up accurately with a polygon in previous studies.
Previous studies have shown that this method is not easy to implement and there are easier methods that
would probably also result in decent results (Dergachev et al., 2001).
2.5. Look-up tables
This gap-filling method clusters the different measured parameters in predefined groups ranging over a
specific period, that depends on the data. The mean and standard deviation are used for each group to determine
PIRD - GCU/INSAL Rochtus Yannick June 2014
3
the replacement value of the missing data. By analysing the results in the paper of Fadge at al. became it clear
that the results of this method probably not be the best option.
The look-up tables are useful in many situations, but big gaps had to be filled. So there are other methods
which are more useful for the specific situation of this paper. For results is referred to the paper of Fadge et al.
(Falge et al., 2001; Moffat et al., 2007).
2.6. Multiple imputation method (MI)
MI method predicts the missing values by using the values of other correlated measured variables. These
predicted values are called “imputes” and are inserted where the data is missing. These imputes are created by a
regression model. This process is performed multiple times, producing multiple imputed data sets, which are
similar but not the same. Then an average of all these results forms one spline to fill the missing values. Thus a
complete data set is obtained (Falge et al., 2001; Wayman, 2003).
There our parameters are expected to all be correlated they can all be used as imputes. It is expected that, if
the most important variable, the velocity, isn’t measured the other variables that are measured by the same
installation will likely also not be available and cannot be used as imputes. The more imputes that are available
the more reliable the results will be and the minimum of imputes is 3 (Falge et al., 2001).
Multiple imputation often results in good quality of filled in data and because it is easy to implement, one can
consider this to be a good choice for filling gaps. Studies from Graham at al. have shown that this method
performs better with smaller gaps. The results were as follows, Mean: 37.83, Standard error: 0.138. This with 3
imputes and a gap of 25%. For the mathematical background and more results is referred to the specific literature
(Graham et al., 2007; Wayman, 2003).
2.7. Multiple regression analysis
Multiple regression analysis was already tested to fill in the missing values in previous work done in the lab.
This method will not be further discussed because of this. Only a comparison of the results with the newly tried
methods is mentioned in the results of this paper (Métadier, 2011).
2.8. Empirical mode decomposition (EMD)
The process with EMD is the following: the original function is broken down in a sum of easier functions by
applying an algorithm. To fill a gap in the time series, four extrema on the left and four on the right side of the
gap are used. The distance, on the time axis, between the extrema on each side is calculated. The minimum of
these distances will be used to be the base for the distance between extrema in the gap. This technique is based
on fractional Brownian motion, Markov model, Cholesky decomposition and Hurst exponent to find the polygon
that fills the gap. Using EMD to fill short time gaps, with a maximum of 20 extremes, is very useful and will
give good results as seen in previous work. Although in the time series of this paper are the extrema really close
to each other so only small gaps can be filled. For filling larger gaps, estimates given from this method will not
fulfil the requirements (Seitchik, 2012).
2.9. Partitioning Around Medoids (PAM)
The next two methods are very similar and find their difference in the way they group the splines. The result
of this PAM method will be k clusters and within each cluster elements with the highest possible degree of
similarity. So, by doing this k clusters with a certain amount of curves will be found. The clusters will be defined
by a medoid, which is defined as objects of the cluster for which the average dissimilarity to all the object in the
cluster is minimal. After this, the gaps can be filled in with the parts of the medoids with the same interval as
these gaps (Kaufman & Rousseeuw, 1990).
PIRD - GCU/INSAL Rochtus Yannick June 2014
4
3. The implemented method
3.1 Introduction & Research Questions
First is the K-means method explained. The K-means clustering is used to form the clusters and afterwarts is
the gap filled in by doing a shift and rotation of the median to which the curve with the gap belongs to. To know
to which cluster this curve belong to is the a tested performed. Then is a verification preformed to decide what
the optimal amount of clusters would be for our data set. Kaufman & Rousseeuw’s paper indicates that the
optimal method to test this is the silhouetteplot.
Then are artificial gaps made in the 45% of the days and the other 55% is used as the data for clustering. The
parameter that is used is discharge (L/s). This parameter is chosen because in previous work in the laboratory
were already methods tested on this parameter. So, by using the same parameter would it be possible to compare
the results. For all the testing curves, the beginning of the artificial gap is chosen at the eighth measured value.
Then 14 measurements are left out which results in an artificial gap of 28 minutes. Next, the RMSE is calculated.
This is done for gaps with the initial length of 14 measurements until a gap with a length of 700 (about 23 h)
measurements this with steps of 14 measurements. So respectively are the gap 28, 56, 84,…, 700 measurements
long.
Then is the RMSE calculated and set out in a diagram to analyse the error. Then are conclusions made for
which gap length this method is useful.
3.2 K-means clustering
This is the tested and used method. The K-means method is similar to the PAM method. The method also
wants to form clusters but uses a slightly different approach. The days that are used to form the clusters are the
dry days without missing values of 2007.
The K in the K-means cluster algorithm refers to the fact that the algorithm is going to look for K different
clusters. Which means when applied on a data set, the algorithm is going to break the dataset into K different
clusters. When it is unable to find K clusters it is going to break the data set in K-1 clusters.
So if K is specified equal to 3, then the clustering algorithm is going to break the population into 3 different
clusters. The value of k needs to be specified to the algorithm in advance, so it has to be decided how many
clusters are wanted before starting the clustering process. How this is decided is specified later in the paper.
The plus signs that are set out in figure 1 represent the summed values of the splines. This follows from the
intention to cluster splines and not just points. Thus each plus sign represents a spline and needs to be clustered.
For example to obtain three clusters (see figure 1), the procedure is as follows: the first step is to define three
random means. These means will be used to make the first clusters (left part of figure 1). Three random means
are used because three clusters are needed. Then two of these three points are connected with a line and in the
middle of this line the perpendicular lines is set out. This is done three times, so every random mean is connected
with each other. So the clusters are now defined for the first time (boundaries of the clusters are the yellow
lines). Then the new mean of each cluster is calculated and they form the new centre to calculate the clusters
again (right part of figure 1). Then the same process is done with the connection lines and perpendicular line.
One can observe that the boundaries (yellow lines) of the cluster change and so it is also possible that the
elements of the clusters change. This is repeated until the elements stop changing of cluster and every element is
part of a cluster. The whole process is done several times because the random means in the first step sometimes
influence the clustering process. This can happen when for example all the initial means are close to each other
or are all in the cluster that one would visually distinct to be one cluster. So to become the optimal clustering is
the whole process done several times with different initial means. Then are the sizes from the clusters, from each
iteration, compared and the sizes that occur the most are chosen as the definitive clusters. In our case where the
sizes of the clusters always the same and I repeated this process 5 times just to be sure.
PIRD - GCU/INSAL Rochtus Yannick June 2014
5
Figure 1. K-means forming clusters
The distances between the points of the clusters mean (or centre) and the points of the splines is calculated
with the Euclidean distance and form the summed Euclidean distance or the distance to the centre of the cluster,
this to decide to which cluster the curve belong to. In figure 2 are these points set out according to their spline
number and the values are calculated with the following formula. In this formula is the distance for each curve
calculated by the Euclidean distance summed over all the points of the curve. With C(i) the value of the points of
the centre of the clusters (this is not a curve of the data set) and Y(i) the value of the points of the curve. With t
until 720, the total amount of measurements (Santhi & Bhaskaran, 2014; Tarpey, 2007).
( ) √∑( ( ) ( ))
Figure 2 gives a visualisation to which cluster the curves belong. The blue dots stand for the distance to the
centre of cluster 1 and the red dots for the distance to the centre of cluster 2. The spline is an element of the
cluster to which the distance is the smallest. the values are only displayed for the first 16 curves, but for the other
curves the same principle is applied. The lowest of these values decide to which cluster the curves are appointed.
The lower the distance to the centre of the cluster the better it fits the centre of the cluster. In figure 2 one can see
that 15 of the dots are red and only 1 is blue. This is not a coincidence. The reason for this is that the usage of
water depends on the season and on which part of the week it is (weekdays or weekends). So days that are for
example in the same season and from the same part of the week have therefore a bigger chance of being part of
the same cluster.
Figure 2. Distance to the centre of the cluster (done for two clusters)
PIRD - GCU/INSAL Rochtus Yannick June 2014
6
For our data specifically was it not known what the best amount of clusters would be. Therefore a silhouette
plot was used to test and validate the amount of clusters that would lead to the best results. The outcome of this
is a graphical display on how well each object is coherent to its cluster.
The method works as follows: the data is clustered by the K-means method. In the formula below is a(i) the
average dissimilarity of i to all other objects of cluster 1 (its own cluster), this is mathematically D(i,1)=a(i). b(i)
is the lowest average dissimilarity of i to cluster 2 (D(i,2)=b(i))(not its own cluster). For more than two cluster is
b(i) the lowest average dissimilarity of i to any other cluster which i is not a member. The cluster with the lowest
average dissimilarity is said to be the neighbouring cluster of i because it is the next best fitting cluster for curve
i. s(i) gives us a number between 1 and -1. If the number is close to one, then this means that curve i is well
clustered. So it gives an indication on how well the element has been classified.
[2]
Or shorter:
[3]
When silhouette value is close to one, a(i) will be much smaller than b(i), because when silhouette value is
close to one it means that i is very similar to its own cluster. Thus when silhouette value is close to one it means
that the curve is clustered in the right cluster and that it is very dissimilar to the other cluster. When silhouette
value is close to minus one it means that the element should belong to another cluster. The average silhouette
value off all the curves of a cluster is therefore an indication on how tight the cluster is. Therefore silhouette
value is also an indication whether the chosen amount of clusters is correct. The curves will all have the most
optimal silhouette value if the amount of clusters is optimal. So with the right amount of clusters will the highest
average silhouette value be obtained (Kaufman & Rousseeuw, 1990; Merkuryeva, 2012).
Visually in the upper silhouette plot of figure 3 are three "knife blade" shapes observed. This indicates that
three clusters are formed. The first knife blade contains all the silhouette values of the first cluster, the second
knife blade the silhouette values of the second cluster etc..
The third cluster contains many points with low silhouette values, and the 2nd and 3the contain negative
values, indicating that the separation of those two clusters is not optimal. The average silhouette value is 0,4317.
In the bottom silhouette plot are the silhouette values higher with very few negative values. This gives an
indication that the clusters are tighter. Also is the average silhouette value higher namely 0,6087. Tests were also
performed for four and five clusters but the outcome of those where low average silhouette values (respectively
0,27 and 0,19) .
After several runs with K = 2, 3, 4 and 5 clusters, it became clear that the best results will be obtained when
two clusters are used. This with an average silhouette value of 0,6087, which is relatively high, and indicates that
the choice of two clusters is a good option (Rousseeuw, 1987).
PIRD - GCU/INSAL Rochtus Yannick June 2014
7
Figure 3. Silhouette plot for three and two clusters, with silhouette values of respectively 0,4317 and 0,6087.
The gap can be filled in with the mean or the median of the cluster (not equal to the above mentioned means
used for clustering). To avoid confusion: The difference between those two means is that the means that are used
further in the paper are the standard mathematical means (and medians) and the means used for clustering were
randomly chosen by the K-means algorithm and then as explained above were they calculated again depending
on the elements of the clusters. Those means were calculated until the elements of the cluster stopped changing
and are therefore not necessarily the same as the mathematical means.
To compute if the mean or median would be the best choice is, the error of the mean and median calculated
and set out in a diagram (figure 4). So, each blue circle represents the value of the median on the x-axis and the
value of the mean on the y-axis. This is done for several gap lengths and different starting positions, because it is
possible that for different gap lengths or starting positions the mean or median would have the lowest value. One
of these results is given in figure 4. The conclusion in our case is that the median is the best option, because the
RMSE of the median was slightly lower than the one from the mean. This was done with formula [6]. This
formula is discussed later in the paper, but for the advancement of the paper is it necessary to know that the
median of the clusters is used to fill the gaps. In the figure one can observe that when the average of all the
results (red star) is calculated that the value of the median is slightly lower than the one from the mean.
Figure 4. Error of the median and mean set out to make a conclusion if the median or the mean is the most appropriate
option to fill the gaps
PIRD - GCU/INSAL Rochtus Yannick June 2014
8
Now it is known that using two clusters and using the median is the best option, is it possible to start the gap
filling procedure. Two groups are made: the dry days without missing values of 2007 to perform as the days used
for the above mentioned clustering and the dry days without missing values of 2008 as the curves where the
method is applied to. This results in 70 days for clustering and 55 testing days.
For all the testing curves, the beginning of the artificial gap is chosen at the eighth measured value. Then 14
measurements are left out which results in an artificial gap of 28 minutes. Next, the RMSE is calculated. This is
done for gaps with the initial length of 14 measurements until a gap with a length of 700 (about 23 h)
measurements this with steps of 14 measurements (see figure 5). So respectively are the gap 28, 56, 84,…, 700
measurements long.
The curves with the gaps will be filled in with the median of the cluster to which it belongs, but a test is
necessary to decide to which cluster the curve with the gap belongs. This is done as follows, with in this formula
Y’ the curve with the gap and M represents the medians of the clusters. K stands for the number of the cluster.
& are the linear model parameters specific, for each median for the parts before and after the gap. The
parameters & are used to explain the median as good as possible to the curve with the gap. Then the root
mean squared error (RMSE) is calculated for all the medians. t0,t1 indicate the start and end points of the gap.
Then these results are compared and the one with the smallest error will be assigned as the median to fill the
gaps in that curve.
( ) ( ) ( ) k=1,2 [4]
√∑ ( ( ) ( ( ) ( ) ( )( )))
t [t0,t1] [5]
The part that fill in the gap is first moved up or down to the starting point of the gap. Then rotated over an
angle that depends on the starting and ending points of the gap, with the starting point being the rotation centre.
Finally the curve is stretched because the length of the gap depends on the slope of the gap and will normally not
be the same as the one from the moved and rotated curve. By doing this, a continuity with the first and last points
of the gap is realised (Jørgensen & Goegebeu, 2007).
This is obtained by comparing the real values of the artificially made gaps and the filled in values of the K-
means method. This with the above mentioned groups. In this formula are Rx the values of the real curve, the
curve before the artificial gap was made and Y’x the values of the filled in part. X is the starting value of the gap, t
indicates until where the gap reaches. The summation of all the measurements is divide by the amount of
measurements (t-8). This results in 55 RMSE curves, because there are 55 testing days.
Figure 5 : The curve with the gap and the indication of the gap length
PIRD - GCU/INSAL Rochtus Yannick June 2014
9
√∑ ( ( ) ( ))
( t=8+14k ; k=1,2,…,50) [6]
4. Results and discussions
When all the results are obtained, the 5% percentiles on both sides are removed, this to take the confidence
interval in consideration. So, figure 6 gives the 90% confidence interval. The mean of all the RMSE curves is set
out (red line) and here the error is always around 1 L/s. This was done with the RMSE curves that where
obtained by formule [6]. The upper and lower boundaries (blue dashed line) indicate the spread where the actual
accuracy of the filled in measurements will be. The highest error is lower than 3 L/s and the lowest error is
around 0,25 L/s.
Figure 6. The upper and lower boundaries of the RMSE and the mean of the all RMSE
5. Conclusion & perspectives
The target of this study was to find a method to fill gaps that expand over hours or days. The method is tested
on the values of the discharge. The discharge was chosen because previous methods were already tested on this
method. By using the same parameter is it possible to compare the results.
One of the conclusion is that results for 2 clusters are better than the results that would be obtained when 3
clusters are used, this is the case for our data set and depends on the data set. The reason by hypothesis is that the
usage of water depends on the season and on which part of the week it is (weekdays or weekends). So days that
are in the same season and are of the same part of the week have therefore a bigger chance of being in the same
cluster. I found that the curves from winter and autumn mostly are in the same cluster and the curves from spring
and summer are also mostly is the same cluster. The exceptions that occurred were mainly when the days where
in the weekend. So by hypothesis is the usage of water in the winter and autumn weekdays very similar and the
same applies for the weekdays of spring and summer. The weekend days are more difficult to cluster.
The results are promising and the method works equally good for small gaps as for large gaps. Although the
spread of the confidence interval is large. Especially the first 250 measurements have a large confidence interval.
The reason by hypothesis is that this is due to the big jump in water usage in the morning hours. Between 5 and
8 o’clock is the confidence interval the largest, because here is the rise in water demand the largest.
The method needs at least the starting and ending point of the curve to do the moving and rotating. Therefore
is it not possible to apply this method if the starting and/or ending point of a day is missing. The starting and
ending point are necessary because the starting point is the rotation point of the rotation and also to decide what
the angle of the rotation will be is the starting and ending point of the gap necessary. It should be noted that this
dependence on the start and ending point is a downside of this method. For these situations is there still no
solution and a method should be developed.
PIRD - GCU/INSAL Rochtus Yannick June 2014
10
This method can only fill gaps up gaps with a maximum gap length of 700 measurements. So methods to fill
gaps over a period of days should still be developed.
In other laboratories are people are working on the same problem and it would be interesting to test their
methods on our data and compare the results with the results obtained by us implemented method.
References
Falge E., Baldocchi D., Olson R., Anthoni P., & Aubinet M., « Gap filling strategies for defensible annual sums ofnet
ecosystem exchange », Argricultural and Forest Meteorology, vol. 107, n° 1, Los Angeles, CA, U.S.A., March 2001, p.
43-69.
Graham J., Olchowski A., & Gilreath T., « How Many Imputations are Really Needed? Some Practical Clarifications of
Multiple Imputation Theory », Prevention Science, vol. 8, n° 3, Department of Biobehavioral Health, Penn State
University, September 2007, p. 206-213.
Jørgensen B., & Goegebeur Y., « Prediction and validation, Multivariate Data Analysis and Chemometrics », Lecture Notes
in Computer Science, Department of Statistics, University of Southern Denmark, January 2007.
Kaufman L., & Rousseeuw P., « Finding Groups In Data - An introduction to cluster analysis », Wiley Series in Probability
and Statistics, Hoboken, New Jersey, U.S.A., 1990, p. 321-333.
Kondrashov D., Shprits Y., & Ghil M., « Gap Filling of Solar Wind Data by Singular Spectrum Analysis », Geophysical
research letters, vol. 37, n° 15, August 2010.
Merkuryeva G., « Integrated delivery planning and scheduling built on cluster analysis and simulation optimisation »,
Proceedings 26th European Conference on Modelling and Simulation Ecms, 2012, p. 164-168.
Moffat A., Papale D., Reichstein M., Hollinger D., Richardson A., Barr A., Stauch V., « Comprehensive comparison of gap-
filling techniques for eddy covariance net carbon fluxes », Agricultural and Forest Meteorology, vol. 147, n° 3-4,
December 2007, p. 209-232.
Métadier M., raitement et analyse de séries chronologiques continues de turbidité pour la formulation et le test de modèles
des rejets urbains par temps de pluie, Thèse de doctorat, INSA, Lyon, 2011.
Rousseeuw P. J., « Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis », Journal of
Computational and Applied Mathematics, vol. 20, University of Fribourg, November 1987, p. 53-65.
Santhi P., & Bhaskaran M., « Improving the Efficiency of Image Clustering using Modified Non Euclidean Distance
Measures in Data Mining », International Journal of Computers Communications & Control, 2014, p. 56-61.
Seitchik E., « Trend and Detail: Gap Filling with the Empirical Mode Decomposition », A Senior Project submitted to The
Division of Science, Mathematics and Computing, Bard College, New York, 2012.
Tarpey T., « Linear Transformations and the k-Means Clustering Algorithm: Applications to Clustering Curves », The
American statistican, 2007.
Wayman J., « Multiple Imputation For Missing Data: What Is It And How Can I Use It? », Annual Meeting of the American
Educational Research Association, Johns Hopkins University, Chicago,IL , U.S.A., 2003.
Zhigljavsky A., « Singular spectrum analysis for time series », International Journal of Forecasting, vol. 25, n° 1 ,School of
Mathematics, Cardiff University, UK, 2013, p. 103-118.
Yannick Rochtus
Filling gaps in time series
Academic year 2013-2014Faculty of Engineering and ArchitectureChairman: Prof. Marc VanhaelstDepartment of Industrial Technology and Construction
Master of Science in de industriële wetenschappen: bouwkundeMaster's dissertation submitted in order to obtain the academic degree of
Supervisors: Prof. Patrick Ampe, dhr. Jean-Luc Bertrand Krajewski