Unsupervised Anomaly Detection in Sensor Data used for … · 2019-01-31 · Unsupervised Anomaly Detection in Sensor Data used for Predictive Maintenance MASTER THESIS Author: MariaErdmann

Unsupervised Anomaly Detection in SensorData used for Predictive Maintenance

MASTER THESIS

Author: Maria ErdmannFaculty supervisor: Prof. Dr. Christian Heumann

Department of StatisticsFaculty of Mathematics, Informatics and StatisticsLudwig–Maximilians–University München

External supervisor: Dr. Sebastian KaiserMunich ReKöniginstrstraße 10780802 München

Abgabe: München, December 3, 2018

III

Statutory DeclarationI declare that I have developed and written the enclosed Master’s Thesis completely bymyself, and have not used sources or means without declaration in the text. Any thoughtsfrom others or literal quotations are clearly marked. The Master’s Thesis was not usedin the same or in a similar version to achieve an academic grading or is being publishedelsewhere.

Munich, December 03, 2018

................................................................Maria Erdmann

IV

Abstract

With the emergence of “Industry 4.0” and advances in technology, anomaly detection onsensor data has become increasingly important within the area of predictive maintenance.This thesis deals with sensor data that are unlabeled, unevenly spaced time series, whichmakes anomaly detection a challenging task. It provides a literature overview on unsuper-vised anomaly detection methods suitable for unevenly spaced time series and introducestwo new methods for anomaly detection, which are based on the Pattern Anomaly Value(PAV) algorithm proposed by Chen and Zhan (2008). The PAV algorithm is the onlymethod explicitly described by their authors as being suitable for unsupervised anomalydetection on unevenly spaced time series. However, it has some limitations, which the newmodifications aim to overcome. The PAV and its modifications are compared with fourbaseline methods from Statistical Process Control, which are adapted according to thepresent application. All methods are implemented in Python and are applied to the sen-sor data. Comparative analyses indicate low similarity between the results of the baselinemethods and those of the PAV variants, which induced the idea to combine the results of allmethods in an ensemble approach. An experiment on simulated data with manually addedoutliers enables performance evaluation and shows promising results for the ensemble.

CONTENTS V

Contents

1 Introduction 11.1 Predictive Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Literature Review 102.1 General Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Sliding Window Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3 Time Series Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.4 Approaches For Streaming Data . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Data 223.1 Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.1.1 Data Access and Preparation . . . . . . . . . . . . . . . . . . . . . 233.1.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Baseline Methods for Anomaly Detection 314.1 The 3σ-Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314.2 The 3σ-Rule Based on a Rolling Average . . . . . . . . . . . . . . . . . . . 344.3 Exponentially Weighted Moving Average . . . . . . . . . . . . . . . . . . . 374.4 Percentile Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5 PAV: Anomaly Detection for Unevenly Spaced Time Series 41

6 Anomaly Detection with Modifications of PAV 446.1 Kernel Density Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 456.2 Copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

7 Experimental Analysis and Results 547.1 Application to Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . 547.2 Application to Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 647.3 Implementation and Application of the Copula PAV . . . . . . . . . . . . . 76

8 Conclusion 84

CONTENTS VI

References 95

A Appendix 102A.1 Benchmark Experiment: Bandwidth Selection . . . . . . . . . . . . . . . . 102A.2 Simulated Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

A.2.1 Performance Measures . . . . . . . . . . . . . . . . . . . . . . . . . 106A.3 Sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

A.3.1 Additional Figures and Tables for all Channels . . . . . . . . . . . . 109A.3.2 Additional Figures and Tables for Temperature . . . . . . . . . . . 112A.3.3 Additional Figures and Tables for Object Temperature . . . . . . . 113A.3.4 Additional Figures and Tables for Humidity . . . . . . . . . . . . . 114A.3.5 Additional Figures and Tables for Pressure . . . . . . . . . . . . . . 116A.3.6 Additional Figures and Tables for Magnetic Field x-Direction . . . 117A.3.7 Additional Figures and Tables for Magnetic Field y-Direction . . . 119A.3.8 Additional Figures and Tables for Magnetic Field z-Direction . . . . 120

A.4 Digital Appendix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

1. INTRODUCTION 1

1 Introduction

Anomaly detection is the process of finding unusual patterns that deviate from their ex-pected behavior. The research field of anomaly detection is complex and broad with manyfields of application. One application is predictive maintenance, the process of collectingand evaluating data from machines to predict machine failure and degradation of the pro-duction capacity. In particular, data from various sources, such as equipment sensors, areanalyzed to identify unusual patterns and to predict these undesirable events.This thesis aims to present suitable methods for anomaly detection that can be used onsensor data within the application of predictive maintenance. Such data was provided byMunichRE, the collaboration partner of this thesis. The data originates from about 500sensor devices that were either attached directly to machines, machine parts or placedwithin the surrounding facilities such as machinery rooms. The sensor devices recordenvironmental quantities like ambient temperature, object temperature, humidity, pressureand strength of the magnetic field. The resulting data can be briefly described as unlabeled,continuous, unevenly spaced time series. This data structure makes anomaly detection achallenging task.The objectives of this thesis are to provide a literature review on anomaly detection tech-niques that are suitable for the specific data at hand and to implement those methods, thatwere identified to be appropriate, in Python. These methods include the Pattern AnomalyValue (PAV) algorithm by Chen and Zhan (2008), which is the only unsupervised anomalydetection method explicitly described as being suitable for unevenly spaced time series.Based on the work of Chen and Zhan (2008) two new methods are presented, which mod-ify the PAV algorithm in a way that its limitations can be overcome. The methods arecalled Kernel PAV and Copula PAV.The PAV algorithm and its modifications were compared to four baseline methods, namelya method based on Shewhart charts, which is called the 3σ-rule, a modification of the3σ-rule using a rolling average, the exponentially weighted moving average (EWMA) anda method based on percentiles. Since anomaly detection methods within the context ofpredictive maintenance should be able to handle streaming data, the selected methods wereimplemented in a way that they could be used for streaming applications. Subsequently,they were applied to the sensor data and analyzed. In particular, the results of the differentmethods were compared against each other. The results induced the idea to constructan ensemble of anomaly detection methods. The performance of this ensemble and the

1.1 Predictive Maintenance 2

individual methods were assessed within a simulation study.The outline of this thesis is as follows: Chapter 1 provides - in addition to this briefintroduction - some background on the concepts of predictive maintenance (Chapter 1.1)and anomaly detection (Chapter 1.2) as well as the link between these two concepts.In Chapter 2, a literature overview on anomaly detection methods for temporal data isprovided. Chapter 3 introduces the sensor data as well as the simulated data. Chapter4 to 6 elaborate on the theory of the methods used in this thesis. These methods wereapplied to the sensor data as well as the simulated data and evaluated accordingly. Theimplementation and its results are reported in Chapter 7. Finally, Chapter 8 summarizesall findings, mentions various difficulties as well as drawbacks and gives an outlook.

1.1 Predictive Maintenance

Predictive maintenance is a maintenance strategy that monitors the actual operating con-dition of a machine, a system or plant equipment1, and uses this data to schedule allmaintenance activities (Schreiner and J., 2018). Hence, this is a condition-based and data-driven approach to estimate when servicing of equipment is needed. To understand thenature of predictive maintenance, it is valuable to take a look at the main maintenancephilosophies. These are: run-to-failure maintenance, preventive maintenance and predic-tive maintenance (Scheffer and Girdhar, 2004).Run-to-failure maintenance, which is also called breakdown maintenance, is a reactiveapproach, where the equipment is serviced, i.e. it is repaired or replaced, when it breaksdown. This approach is usually applied if the shutdown of the equipment for the servicedoes not affect production and if material costs do not matter (Scheffer and Girdhar, 2004).The disadvantage of run-to-failure maintenance is its high costs, which come from eitherhigh spare part inventory costs or costs due to a required fast delivery of spare parts fromother vendors. Additionally, there are usually costs from overtime labor, high machinedowntime and a low production availability (Mobley, 2002).Applying preventive or time-based maintenance means to schedule all maintenance activi-ties at predetermined time intervals, which can be specific calendar days, run-time hours ofa machine (Scheffer and Girdhar, 2004), or are based on mean-time-between-failure (Mob-ley, 2002). The goal is to repair or rebuild the equipment before a failure occurs. The maindisadvantage of this approach is that the maintenance service may be applied too early or

1In the following, the term “equipment” will be used for all objects, where predictive maintenance canbe performed, such as machines, systems or plant equipment.

1.1 Predictive Maintenance 3

too late, which leads to reduced productivity (Scheffer and Girdhar, 2004). Additionally,scheduling is based on the experience of the maintenance manager (Mobley, 2002).Predictive maintenance is often understood as condition-driven preventive maintenanceapproach (Mobley, 2002). However, others, such as Nagandhi et al. (2015) and Feldmannet al. (2017), see condition monitoring more as a philosophy on its own, which is theprecursor and the basis of predictive maintenance. Condition monitoring checks the equip-ment’s actual mechanical condition, its efficiency and other indicators that determine themean-time-to-failure or loss of efficiency (Mobley, 2002). Common monitoring tools involvenondestructive tests like vibration monitoring, process parameter monitoring, thermogra-phy, tribology and visual inspection. Which diagnostic tool is suitable depends on the typeof industry, the type of machinery and the availability of trained man power (Scheffer andGirdhar, 2004). Prognostic techniques are applied to the data from the monitoring pro-cess(es) in order to determine the mean-time-to-failure and other performance indicators(Lee et al., 2006). These are then used to schedule all maintenance activities.The great advantage of this approach is that scheduling takes place in an orderly fashion,and that it saves costs, since spare parts can be ordered in a timely manner and do notneed to be stocked. Additionally, it increases production capacity. However, an incorrectassessment of the deterioration process leads to potentially unnecessary maintenance ser-vices or services provided at the wrong time or not at all, which in turn leads to increasedcosts. Moreover, condition-based predictive maintenance requires specialized equipment,as well as trained and skilled personnel or, in case the maintenance service is outsourced,the willingness to pay for it (Scheffer and Girdhar, 2004).In many cases, however, the concept of predictive maintenance is taken one step further.Thus, Feldmann et al. (2017) denote predictive maintenance as key innovation of Industry4.0. In addition to the traditional diagnostic tools, this concept includes process perfor-mance data as well as data from sensors to assess the health of the equipment. It isabout smart machines that are networked, and about automated methods to align the sys-tem’s operations and maintenance services. It involves an information flow infrastructureand automation in triggering the maintenance process, starting with ordering spare partsneeded for repairing the equipment. Lee et al. (2014) terms this “e-maintenance” based on“intelligent prognostics” rather than predictive maintenance.“Intelligent prognostics” is a systematic approach to continuously track the health or thedegradation of a machine, and to extrapolate its temporal behavior in order to predict risksand unacceptable (anomalous) behavior, which is not only the failure of the equipment but

1.2 Anomaly Detection 4

also its loss of efficiency. This more holistic and systematic approach, which also can befound in Scheffer and Girdhar (2004) and Mobley (2002) in its basic ideas, aims to improveproductivity, overall effectiveness, system safety, product quality and profitability. It isassociated with a large amount of data that needs to be captured, stored and processed.In addition, it is advantageous to network the sensors and diagnostic processes. In manycases, the technical basis for advanced sensor technology is provided by the Internet ofThings (IoT) (Schreiner and J., 2018).

1.2 Anomaly Detection

Anomaly detection, which is also denoted as outlier detection, has neither a universally ac-cepted name nor definition. The reason for this heterogeneity is that research in this fieldcomes from a wide variety of disciplines (Hodge and Austin, 2004). One of the commonlyused definitions can be found in Chandola et al. (2009), who denote anomaly detectionas the “problem of finding patterns in data that do not conform to the expected behav-ior”. These patterns are called outliers, outlying or discordant observations, anomalies,abnormalities, deviants, exceptions, aberrations, surprises, peculiarities or contaminants(Aggarwal, 2017; Hodge and Austin, 2004). In this thesis, the term outliers or anomalies,as well as outlier detection and anomaly detection will be used interchangeably. Outliersare observations that deviate so much from the other observations that one may assumethat these observations originate from a different data-generating process than the remain-der of the data (Hawkins, 1980).Anomaly detection is also sometimes called novelty detection or noise removal, althoughthese terms are not the same. Novelty detection is more concerned about finding a patternthat has been hitherto unknown. Hence, novelty detection tries to find patterns in newobservations that were not included in the training data. Noise removal, which is alsocalled data cleansing (Goldstein and Uchida, 2016), refers to the pre-processing step, whereunfavorable observations are removed in order to get a signal, which can be used for analysisand inference (Choudhary, 2017).In the recent applications of anomaly detection, the outlying observation itself is of partic-ular interest. Here, outliers are indicators of possible major adverse effects or deviations,whose detection is crucial for the application at hand (Suri et al., 2012). This is theunderstanding also this thesis focuses on.Anomaly detection has many applications. In intrusion detection, the goal is to detectan unauthorized access in computer networks. Fraud detection plays a crucial role with


respect to credit cards, but it is also a topic in health insurance. On the other hand,anomaly detection could be used by equity and commodity traders to monitor individualshares and markets in order to detect novel trends. Based on this, they identify buyingand selling opportunities. In military, anomaly detection is used for the surveillance ofenemy activities.In connection with predictive maintenance, anomaly detection is critical. By monitoringcomplete manufacturing lines or single mechanical components such as motors or turbinesand by applying anomaly detection techniques, a fault or considerable performance degra-dation can be detected early (Hodge and Austin, 2004). Chandola et al. (2009) call thissystem health management or fault detection in mechanical units. They distinguish betweenthis application and the detection of defects in physical structures, such as cracks in beamsor strains in airframes, which they refer to as structural defect detection. This distinctionby Chandola et al. (2009) is made, because in system health management normal datafor training is often readily available and used to model normality, while structural defectdetection is about novelty or change point detection. There are many more applicationsthan those already mentioned. A good overview is provided by Hodge and Austin (2004).At first glance, implementing anomaly detection seems simple. By defining a representationof normal behavior of the data, it is straightforward to identify any observation that doesnot belong to this region. However, Chandola et al. (2009) names several factors thatactually make outlier detection challenging: First, to define a region for normal behaviorany notion of this normal behavior should be considered. This is difficult, since normalbehavior very often keeps evolving. Secondly, the boundary between normal and anomalousbehavior is very often continuous. In particular, observations close to the boundary cannotbe unambiguously assigned to the normal or anomalous region. Additionally, the definitionof an outlier is application-dependent, which often makes the methods used for anomalydetection not transferable or generalizable. Moreover, it is difficult to distinguish betweenanomalies and noise. Finally, one of the biggest challenges in anomaly detection is thatlabeled data is often not available or simply to expensive to acquire.Hence, there is no straightforward approach for anomaly detection. On the contrary, mostanomaly detection problems solve a specific problem formulation, which depends on theapplication at hand.The anomaly detection problem consists of different aspects such as the nature of the data,the availability of labels, the type of anomaly that should be detected, and the desiredoutput. Techniques to solve the specific problems originate from different areas such as


4 · Chandola, Banerjee and Kumar

which the anomalies need to be detected. Researchers have adopted concepts fromdiverse disciplines such as statistics, machine learning, data mining, informationtheory, spectral theory, and have applied them to specific problem formulations.Figure 2 shows the above mentioned key components associated with any anomalydetection technique.

Anomaly Detection Technique

Application Domains

Medical Informatics

Intrusion Detection

. . .

Fault/Damage Detection

Fraud Detection

Research Areas

Information Theory

Machine Learning

Spectral Theory

Statistics

Data Mining

. . .

Problem Characteristics

Labels Anomaly TypeNature of Data Output

Fig. 2. Key components associated with an anomaly detection technique.

1.3 Related Work

Anomaly detection has been the topic of a number of surveys and review articles,as well as books. Hodge and Austin [2004] provide an extensive survey of anomalydetection techniques developed in machine learning and statistical domains. Abroad review of anomaly detection techniques for numeric as well as symbolic datais presented by Agyemang et al. [2006]. An extensive review of novelty detectiontechniques using neural networks and statistical approaches has been presentedin Markou and Singh [2003a] and Markou and Singh [2003b], respectively. Patchaand Park [2007] and Snyder [2001] present a survey of anomaly detection techniques

To Appear in ACM Computing Surveys, 09 2009.

Figure 1: Key components associated with anomaly detection. The application domain,e.g. damage detection, determines the characteristics of the problem, i.e. the nature ofthe data, whether labels are available, etc. These problem characteristics again deter-mine which anomaly detection methods, which can be from various areas such as machinelearning, are suitable for the task. Reprinted from Chandola et al. (2009).


statistics, machine learning, data mining, information theory, spectral theory and others.The interaction of the key components associated with anomaly detection are depicted inFigure 1.The four different aspects of anomaly detection are (Chandola et al., 2009):

Nature of Input Data Data can be either continuous, categorical or binary. Each datainstance can be either univariate or multivariate, where in the latter case the data canbe either of the same type (e.g. all data instances are continuous) or of different types.Data instances can be independent of or related to each other. If data instances arerelated, data are sequence data, spatial data or graph data. In sequence data, datainstances are linearly ordered. Examples include genome sequences, protein sequencesand time series data, the latter being of superior interest in this work. Time seriesdata are characterized by their temporal continuity, which means that data is notchanging abruptly, unless there are anomalous processes at work (Aggarwal, 2017).

Type of Anomaly Anomalies can be classified into three categories: point outliers, con-textual outliers and collective outliers (see Figure 2).

Point outlier If a single instance is anomalous with respect to the remainder of thedata, the observation is called a point outlier.

Contextual outlier If a data instance is anomalous in a specific context, it is termeda contextual or conditional anomaly. In case of temporal data, these are in-stances that deviate remarkably from their adjacent values. Thus, these arepoint anomalies but with respect to their immediate neighborhood.

Collective outlier Collective outliers, which are also termed anomalous subsequences,are a collection of data instances that behave anomalous with respect to the en-tire data set. Here, individual data instances may not be anomalous by them-selves, but the pattern, which they are part of, is anomalous.

While point outliers can occur in any type of data, contextual outliers can only befound in data where the instances are related to each other, for example, in temporaldata. Furthermore, a point outlier or collective outliers can be transformed to contex-tual outliers, when analyzed with respect to its context. Aggarwal (2017) states thatin time series data there are either contextual or collective outliers, while Gupta et al.(2014a) differentiates between point outliers and anomalous subsequences. Hence,


one can assume that the terms point outliers and contextual outliers are sometimesused interchangeably in temporal data.

Data Labels When data labels are available, techniques operate in a supervised mode.Typically, anomaly detection is done by classification. This involves to train themodel on labeled data, which is called training data, and to predict or identify theanomalies on some new data, called test data. In a semi-supervised setting, thereare only labels available for the normal class. These labels, or in other words thisnormal data, are used for training, i.e. to establish a model of normality. This modelis then used to identify anomalies in new data points (test data). If there are nolabels available, techniques must operate in an unsupervised mode. In this case, notraining data is needed. However, often unsupervised techniques can be implementedin a way that they suit the semi-supervised task. Thus, normal data can be used fortraining and prediction takes place on test data.

Output Anomalies can be either reported in the form of a score or as binary label (normalor anomalous). Most techniques output a score, which is a measure reflecting thedegree of outlierness of a data instance. This score can be used to create a rankedlist of anomalies. From this list, the user can select the top k anomalies or use acut-off threshold to select the anomalies and to assign labels (normal or anomalous)to each data instance.

The four key factors, which are determined by the application, lead to a specific problemformulation. Given the data provided for this thesis and the application of predictivemaintenance, also the problem of this thesis can be formulated with the presented four keycharacteristics. The sensor data can be describes as being continuous, unevenly spaced timeseries data, representing a strong temporal continuity. Although various environmentalparameters are measured at the same time, the focus is on identifying outliers in univariatedata streams. As there are no labels available, methods that can be used in an unsupervisedmode are of primary interest. Additionally, in the application of predictive maintenanceone is not interested in a specific type of outlier. All anomalies that indicate degradationor an effect of a major fault are of importance.


2018-03-22

2018-03-29

2018-04-05

2018-04-12

2018-04-19

2018-04-26

2018-05-03

Time

20

25

30

35

40

45

Tem

pera

ture

[°C]

(a) Point outlier (red).

2009-01-01

2010-01-01

2011-01-01

2012-01-01

2013-01-01

Time

5

0

5

10

15

20

25

Tem

pera

ture

[°C]

(b) Contextual outlier (red).

2018-042018-05

2018-062018-07

2018-082018-09

Time

18

20

22

24

26

28

Tem

pera

ture

[°C]

(c) Collective outlier (red).

Figure 2: Types of anomalies. (a) and (c) are sensor temperature data provided for thisthesis. (b) is the daily average temperature in Germany from 2009-2013 displaying anunusual temperature in April, 2012. Data was taken from Trading Economics (2018).

2. LITERATURE REVIEW 10

2 Literature Review

There has been extensive work on anomaly detection since the beginning of the 19thcentury (Chandola et al., 2009) and a large quantity of various methods has been developedin several research disciplines. Aggarwal (2017), Chandola et al. (2009) and Hodge andAustin (2004) provide comprehensive overviews on various anomaly detection methods.However, most of the work do not specifically consider the temporal aspect of the data.Research in anomaly detection for temporal data, especially time series data, is fragmentedover different application domains, and a thorough understanding is missing. Particularly,insight is lacking, how the various methods are related to each other and differ fromeach other. An abstract classification of these anomaly detection methods based on theirtheoretical principles has not yet been provided. However, there is some work that tries toorganize the multitude of methods.Golmohammadi and Zaiane (2015) classify outlier detection into two orthogonal directions:the transformation dimension and the dimension of the anomaly detection technique. Thetransformation dimension refers to the way the data is preprocessed, i.e. transformed priorto applying a specific technique that aims to identify anomalies. The various types oftransformations have different objectives:

Aggregation Aggregation reduces dimensionality by replacing a certain number of suc-cessive values with a representative value (e.g. average) of them. This refers toapplying a low-pass filter, which suppresses the high frequency signals in order toreveal the low-frequency features.

Discretization Discretizising a time series means to convert the continuous data into cat-egorical data. For example, continuous values are mapped to letters of an alphabet.One such approach was proposed by Lin et al. (2007). The objective for discretiza-tion is reducing computational complexity, as it is, for example, used in Keogh et al.(2005)s’ HOTSAX algorithm.

Signal Processing With the help of, for example, Fourier transforms or wavelet trans-forms, data is mapped to a different space. This helps to detect anomalies at adifferent scale and may reduce dimensionality.

Differencing Differencing is used to stabilize the mean, hence, to make non-stationarytime series stationary. For example, computing the change between consecutive ob-servations (1st-order differencing) helps to eliminate or reduce trend and seasonality


Figure 3: Overview on outlier detection in temporal data based on the various types ofdata. Reprinted from Gupta et al. (2014a,b).

in time series. Besides 1st-order differencing, it is occasionally necessary to compute2nd-order differences to consider “the change in the changes”, or to do some sort ofseasonal differencing (Hyndman and Athanasopoulos, 2018).

There are some disadvantages associated with transformation of the data. Firstly, theresults of anomaly detection on the aggregated time series needs to be mapped back tothe original time series in order to provide meaningful information. Discretization doesnot reduce dimensionality. Additionally, most discretization techniques require the wholetime series in order to provide a meaningful alphabet, and thus, cannot be applied in astreaming setting. Moreover, distance measures may no longer be reasonable when datahas been discretized or processed.However, transformations provide the possibility to investigate different aspects of thedata. Thus, combining transformation with anomaly detection methods results in a greatvariety of anomaly detection approaches.Regarding the dimension of anomaly techniques, there are a couple of suggestions forcategorizing or organizing the plentitude of methods. For example, Gupta et al. (2014a,b)provide an overview on outlier detection for temporal data, where methods are classifiedwith respect to the actual type of data (see Figure 3). Thus, they distinguish betweenmethods for time series data, data streams, distributed data, spatio-temporal data ornetwork data. Furthermore, anomaly detection in time series data can be partitioned intoanomaly detection on time series databases or in a single time series. For the latter, theyclaim that there are two types of anomalies: point anomalies and subsequences as outliers,which refer to anomalous shapes in the time series and are also called collective outliers.


Aggarwal (2017), as well, organizes the methods for outlier detection in time series by clas-sifying them according to the type of anomaly they detect. There, methods for contextualoutliers are distinguished from methods for collective outliers.A reasonable taxonomy of outlier detection methods that should be highlighted was pro-posed by Ben-Gal (2005) and adopted by Suri et al. (2011), although it is not specifically fortemporal data. Ben-Gal (2005) distinguishes between univariate and multivariate meth-ods, and between parametric and non-parametric methods. Parametric methods, which arealso called statistical methods, assume that there is a known underlying distribution of thedata. This distribution can be characterized by the probability density function f(x; Θ),whose model parameters Θ – if there are any – are estimated from the data (Eskin et al.,2001, quoted from Chandola et al., 2009). Using statistical methods, one assumes thatnormal data is generated by a specific stochastic model, while anomalous data do not orig-inate from this model. This means, that normal data is presumed to occur in regions ofhigh probability of the stochastic model, while anomalous data tend to appear in regionsof low probability. Hence, data instances that have a low probability are declared outliers.For the non-parametric techniques, no assumptions on the underlying distribution aremade. The model structure is defined solely by the data (Chandola et al., 2009). Thenon-parametric techniques are further classified into distance-based, density-based andclustering-based methods (Suri et al., 2011). The latter classification appears to be some-how artificial, as there are great overlaps between the individual classes. Hence, for thepurpose of this work, it is sufficient to distinguish between parametric and non-parametricmethods.

All structural organizations provided by the surveys mentioned above seem reasonable.However, there are several reasons why it is not possible to simply use one of these organi-zations to provide a comprehensive overview on unsupervised anomaly detection methodsfor unevenly spaced time series. Firstly, since the sensor data used in this work is not la-beled, only methods for unsupervised anomaly detection will be reviewed. Secondly, manyunsupervised methods presented by Aggarwal (2017), Chandola et al. (2009) and Guptaet al. (2014a,b) assume that normal training data is readily available. If this would be thecase a model of normality can be fitted to the data, and anomaly detection takes placeon some test data. However, in the case of this work, there is no a priori knowledge onwhether the data is normal or not. Hence, also training needs to be done in an unsuper-vised manner. Additionally, the peculiarity of the sensor data is an unevenly spaced time

2.1 General Approaches 13

Anomaly Detection Methods

General Approaches Sliding Window Approaches Time Series Approaches

Evenly spacedtime series

Unevenly spacedtime series

Figure 4: Structure of the literature overview provided in this work

series data, which considerably narrows down the methods that are applicable. Lastly, themethods presented should suit the predictive maintenance context or should actually becreated for this specific application. Hence, this work tries to provide a distinctive orga-nization of methods for completely unsupervised anomaly detection on time series datafor the application of predictive maintenance, based on the surveys mentioned so far. Thestructure is schematically presented in Figure 4.Three types of anomaly detection techniques are distinguished: The first type does notconsider the temporal aspect of the time series data and comprises approaches for anomalydetection for i.i.d. data. Hence, these are general approaches. They can be modifiedby using a sliding window, with the result that the temporal aspect of the data is nowconsidered. The third type of methods, considers the time series structure specifically.Methods for evenly and unevenly spaced time series will be presented.To round off this literature overview, a variety of anomaly detection methods for a stream-ing environment is presented, as streaming is of particular interest in predictive mainte-nance.

2.1 General Approaches

Although there is certainly a main drawback in applying techniques that do not considerthe temporal correlation of adjacent data points, these methods are widely used in practice.With regards to the parametric techniques for anomaly detection, it is very common toassume that the data is normally distributed. Methods that are based on this assumptionare Shewhart charts, the box plot rule and a couple of statistical tests.Shewhart charts have their roots in Statistical Process Control (SPC). The Shewhartmethod declares data instances as outliers, if they are more than 3σ (three times thestandard deviation) away from the mean µ. Assuming a normal distribution, the µ ± 3σ

2.1 General Approaches 14

region contains 99.7% of the data instances (Shewhart, 1931, quoted from Chandola et al.,2009). More details on this method are provided in Chapter 4.1.The box plot rule (Laurikkala et al., 2000; Horn et al., 2001; Solberg and Lahti, 2005;Guttormsson et al., 1999, quoted from Chandola et al., 2009) is a graphical approachto identify outliers. The boxplot is a 5-point plot displaying the median, the lower andupper quartile (Q1, Q3), as well as the largest non-outlying observations. Data points aredeclared outliers, if they are 1.5 · IQR lower than Q1 or 1.5 · IQR higher than Q3, whereIQR is the interquartile range (Q3 − Q1). The 1.5 · IQR rule is a heuristic proposed byLaurikkala et al. (2000) that makes the box plot rule equal to the Shewhart method withits 3σ-limits.The Grubb’s test, also called maximum normed residual test, uses the z-score

z = |x− x|s

(1)

as statistic, where x is the mean and s the standard deviation of the sample. A datainstance is declared an outlier, if

z >n− 1√n

√√√√ t2α/(2n),n−2

n− 2− t2α/(2n),n−2(2)

where n is the size of the sample and t2α/(2n),n−2 is the α/(2n) quantile of the t-distributionwith n−2 degrees of freedom. Thus, the hypothesis of no outliers is rejected at a significancelevel of α.Other statistical tests are the student’s t-test (Surace et al., 1997; Surace and Worden,1998, quoted from Chandola et al., 2009), the Rosner test (Rosner, 1983, quoted fromChandola et al., 2009) and the Dixon test (Hawkins, 1980, quoted from Chandola et al.,2009).The disadvantage of these parametric methods is that they rely on the assumption that thedata was generated from a specific distribution (here: Gaussian distribution). However,this assumption does not hold true in most cases (Chandola et al., 2009). In particular,these methods cannot detect outliers in small samples. Moreover, the mean, as a measureof central tendency, and the standard deviation, as a measure of scale, are heavily impactedby outliers. Hence, the indicator that guides anomaly detection, is itself altered by thepresence of outliers. Thus, using non-parametric approaches for anomaly detection, suchas using the median and the mean absolute deviation (MAD) instead of the mean and the

2.2 Sliding Window Approaches 15

standard deviation, is more robust (Rousseeuw and Croux, 1993). The MAD is defined asMAD = c ·median(|xi−median(x)|), i = 1, . . . , n, where median(x) is the median of thesample x = (x1, . . . , xn) and c is a constant determined by c = 1/Q3. If data is normallydistributed, c = 1.4826. For anomaly detection, a common choice for the threshold is:

|xi −median(x)|MAD

> | ± 3|, i = 1, . . . , n (3)

(Leys et al., 2013).

2.2 Sliding Window Approaches

The problem with methods that do not consider the temporal dependency of the data isthat they disregard possible seasonality, trends or change points in the data. Computingthe standard deviation of a long time series may result in a large value (see Figure 10) suchthat only global outliers can be detected. If the goal is to detect outliers that are anomalouswith respect to their neighboring values, it is necessary to consider this neighborhood, i.e.to divide the time series into subsequences of a fixed window length m. By using a slidingor rolling window, all possible subsequences are considered. Methods that use a slidingwindow can be applied to evenly and unevenly spaced time series.Hence, Shewhart’s 3σ-rule can be modified using a rolling mean, as it is described in moredetail in Chapter 4.2.An approach to use the median on time series data has been proposed by Basu andMeckesheimer (2007). They propose a two-sided and a one-sided median method, whichthey apply to sensor data.For the two-sided median method, the median mt = median(xt−κ, . . . , xt, . . . , xt+κ) of thevalues in a window of size 2κ is computed. The window starts at time point t−κ and endsat t + κ. By using the absolute value of the difference between mt and xt and comparingthis difference to a threshold, outliers are identified. In case |xt−mt| ≥ τ , xt is labeled anoutlier.The one-sided median method is applied to identify outliers, when solely historic data isavailable. Here a one-sided median of the raw data x and a one sided median of the firstdifferences z = xt − xt−1 are computed:

m(x)t = median(xt−2κ, . . . , xt−1) (4)

m(z)t = median(zt−2κ, . . . , zt−1) (5)

2.2 Sliding Window Approaches 16

Then, the predicted value for xt is computed as a weighted average:

xt = m(x)t + κ · m(z)

t (6)

If |xt − xt| ≥ τ , the value xt is denoted an outlier. Although these methods providedgood results, they are sensitive to the appropriate choice of threshold τ and τ , respectively,which needs to be determined by the user and therefore requires some knowledge on theapplication.Hill and Minsker (2010) propose three anomaly detection methods that use a movingwindow. Particularly, a moving window of the last m measurements

Dt = xt−m+1, . . . , xt (7)

is used to predict the next measurement xt+1 and to classify this measurement as nor-mal or anomalous. Hence, their approach uses a one-step-ahead prediction model thattakes Dt as input to predict xt+1. Then a prediction interval is computed and used todetermine whether xt+1 is an anomaly. The first method they propose for the one step-ahead-prediction is the nearest cluster (NC) predictor. Similar training data are groupedto clusters based on their moving windows and the Euclidean distance metric is computedbetween xt+1 and each training sample2. The predicted value of xt+1 is calculated based onthe average of all measurements in the cluster that xt+1 maps to. The two other methodsfor the one-step-ahead prediction are the single-layer linear network (LN) predictor andthe multilayer perceptron (MLP) predictor. The LN predicts xt+1 as a linear combinationof m previous measurements:

xt+1 = bm−1∑i=0

wixt−i (8)

where b and {wo, . . . , wm−2} are the weights that are learned by the delta learning rule.The MLP is a feed-forward network with sigmoid activation functions in the hidden layerand a linear activation function in the output layer. It is trained with the backpropagationalgorithm based on gradient descent and momentum. The number of hidden layers, andnodes in the hidden layers, as well as the learning rate, the momentum and the number ofepochs are hyperparameters that were selected by a trial-and-error approach. The approach

2In this work, a training sample consists of a input vector, which is the moving window Dt and anoutput vector, which is the value to be predicted xt+1.

2.3 Time Series Approaches 17

by Hill and Minsker (2010) has the great advantage that the threshold for differentiatingbetween normal and anomalous data is not user-defined, but rather a prediction intervalthat is based on an estimate of the standard deviation, for which 10fold cross-validation isused. Additionally, it is fast and scalable to large data sets.One additional algorithm should be mentioned here, although this method is intendedto be applied on evenly spaced time series and no work has provided evidence yet thatthis method works well on unevenly spaced time series. However, investigation of themethodology used suggests that modification of the algorithm is possible, such that itcan be used for unevenly spaced time series. Wei et al. (2005) use a lead and lag slidingwindow, and within each window, the time series is discretized via Symbolic AggregateApproximation (SAX) (Lin et al., 2007). A modification of SAX to unevenly spaced timeseries data may be achieved by applying time horizons instead of specifying the numberof observations in the aggregation step. The SAX representation is then used to createso called chaos game bitmaps, which are matrices representing the count of single lettersor subwords occurring in the SAX representation of the time window. Subsequently, thedistance between the bitmaps is measured and reported as anomaly score at each timeinstance.

2.3 Time Series Approaches

Time series approaches are techniques that consider the internal structure of the time seriesdata. This means, that they consider that data may have a trend or seasonal variation andthat observations close together are correlated (autocorrelation). But these approachescan solely be applied to evenly spaced time series. Additionally, these methods oftenrequire that normal training data is available. So far, little work has been published onunsupervised anomaly detection for unevenly spaced time series. Namely, there is a singlework by Chen and Zhan (2008), which claims to be applicable to unevenly spaced timeseries. Chen and Zhan (2008) propose an approach where infrequent patterns within a timeseries are indicative for anomalous patterns. More details on this method are provided inChapter 5.Traditionally, unevenly spaced time series were transformed to evenly spaced time series byuse of some form of interpolation. Then the common time series methods can be appliedto the transformed data. Since this is still an approach commonly used in practice, somemethods for anomaly detection on evenly spaced time series are shortly presented.For detecting point outliers, it is prevalent to use a parametric statistical approach, which

2.4 Approaches For Streaming Data 18

is also often referred to as prediction-based or regression model based outlier detection. Itconsists of two steps: First, a regression model is fitted to the data, such as an autoregres-sive (AR), moving average (MA), autoregresssive moving average (ARMA) or integratedautoregressive moving average (ARIMA) model. Second, the anomaly score is computedbased on the residual of the test instance. Hence, the difference between the predictedand the observed value of the test instance is computed. The magnitude of this differenceserves as anomaly score. As the presence of anomalies during training might influence theresult of the model, and may lead to inaccurate results in the worst case, using robustregression is advisable (Rousseeuw and Leroy, 1987, quoted from Chandola et al., 2009).For example, in Bianco et al. (2001, quoted from Chandola et al., 2009) and Chang et al.(1988, quoted from Aggarwal, 2017) robust anomaly detection on ARIMA models are ap-plied. The robustness originates in a robust parameter estimation and robust filtering.Chen and Liu (1993) jointly estimate model parameters and outlier effects by successivelyanalyzing the data with adjusted ARMA models and by removing potential outliers thatwere detected in a previous step.For detecting collective outliers (anomalous subsequences) within a time series, discorddiscovery by Keogh et al. (2005) is a common method. A time series discord is definedas being maximally different to the remaining time series subsequences. In their paper,they propose two types of algorithms. The first approach is brute force, since the discordsearch, i.e. the search for the subsequence that has the largest distance to its nearestnon-self match, is done by computing the Euclidean distances between each pair of sub-sequence. The second approach is HOTSAX, which improves computational efficiency byusing a data structure that is based on SAX.

2.4 Approaches For Streaming Data

In this chapter, some methods for anomaly detection in streaming data are introduced,since ultimately anomaly detection in the application of predictive maintenance aims toforecast future trends and to detect anomalies at an early stage in a streaming context.Streaming data adds additional challenges to the demanding task of unsupervised anomalydetection. Firstly, data must be processed in real-time. Batch processing and a lookinto the future is not possible. Secondly, often there is a multitude of streams providinginformation on the application under investigation. Thirdly, anomaly detection should beperformed in an unsupervised and automated fashion, where the latter means that the


hyperparameters of the methods should not be adjusted manually. Moreover, anomaliesshould be detected as early as possible and, at the same time, the false positive rate aswell as the false negative rate, should be kept small. In addition, every application has itsown constraints.The following methods are consolidated by the Numenta Anomaly Benchmark (NAB)project, which is an open source framework that provides a “controlled and repeatableenvironment of [...] tools to test and measure anomaly detection algorithms on streamingdata” (Lavin and Ahmad, 2015). They provide data from different application domainsthat are partially labeled and a scoring system. Both can be used to evaluate perfor-mance of anomaly detection algorithms for streaming data. The project is designed tohelp researchers to evaluate their own methods and to compare them with existing meth-ods for streaming data. The methods implemented include Hierarchical Temporal Memory(HTM), Skyline developed by Etsy.com, Twitter’s anomaly detection methods, BayesianOnline Changepoint detection, EXPoSE and Multinomial Relative Entropy (Ahmad et al.,2017).HTM is a theory inspired by neuroscience, especially by the structure and functionalityof the neocortex of the mammalian brain. In the context of anomaly detection, HTM isused to make multiple predictions for an instreaming observation xt+1. At time t+ 1 thesepredictions are compared to the observed value and an anomaly score is calculated. Theanomaly detection system retains information on the distribution of the anomaly scorescomputed so far. Therefore, at every time step a likelihood, on whether the current anomalyscore is from the respective distribution, can be calculated. The anomaly likelihood servesas a threshold to finally identify whether the instreaming data point is an anomaly (Lavinand Ahmad, 2015).Skyline is a real-time anomaly detection system based on an ensemble of (simple) detectorsand a voting scheme. A data instance is flagged as being anomalous if the majority of thedetectors agree that this data instance is an anomaly. A couple of simple detectors arebuilt-in, such as the deviation from moving average (this refers to the Shewhart methodusing a rolling average explained in 2.2), the deviation from the least square estimate andthe deviation from the histogram of passed values (Lavin and Ahmad, 2015; Stanway,2013).Twitter provides two algorithms that make use of the combination of various statisticaltechniques. Namely, these are Seasonal-ESD (S-ESD) and Seasonal-Hybrid-ESD (S-H-ESD). In S-ESD, the time series data is decomposed to eliminate trend and seasonality.


Then ESD (Extreme Studentized Deviant) (Rosner, 1983) is applied to the residual compo-nent of the time series. ESD, also called Rosner’s test, is a further development of Grubb’stest. While Grubb’s test only detects single outliers in a univariate data set, which followsan approximately normal distribution (NIST/SEMATECH, 2013b), the ESD can detectmultiple outliers and requires only an upper bound k for the suspected number of outliers.Depending on this upper bound, k tests are performed, with the following hypothesis:H0: There are no outliers in the data set.H1: There are k outliers in the data set.The test statistic is computed for the k most extreme values in the data set:

Rk = maxk|xk − x|s

(9)

where x and s denotes the sample mean and the sample standard deviation, respectively.The most extreme values are those that maximize |xk − x|. The statistic is compared toa critical value λk, and the data instance is removed from the data set, if it is flaggedan anomaly. Then the critical value λk is recalculated. This process is repeated k times(Hochenbaum et al., 2017; NIST/SEMATECH, 2013b).S-ESD is suitable for detecting local and global outliers, but it suffers from the problem thatmean and standard deviation – as explained earlier – are highly sensitive to outliers. Thisresults in a high rate of false negatives. To solve this problem, S-H-ESD was introduced,which uses robust statistics such as the median and the MAD (Hochenbaum et al., 2017).Bayesian Changepoint detection proposed by Prescott Adams and MacKay (2007) is aBayesian algorithm, that identifies online, whether there is a state transition in a sequenceof data. It uses the Bayes’ theorem to calculate the posterior probability of the currentrun length ri, which refers to the time that has elapsed since the last change point. Giventhe run length at time t, the run length at time t + 1 is either be set back to zero (if achange point is detected) or increased by 1 (if current state continuous).EXPoSE (EXpected Similarity Estimation) by Schneider et al. (2016) is a kernel-basedestimator that computes the similarity between a new data point and the distribution ofregular data seen so far. A score, which is the likelihood for belonging to the normal class,is computed as the inner product between the feature map φ(z) and the kernel mean mapµ[P] of the distribution of the normal data P:

η(z) = 〈φ(z), µ[P]〉 (10)


The estimator can be learned incrementally and is therefore suitable for the streamingapplication.Wang et al. (2011) introduced the Multinomial Relative Entropy, which uses the relativeentropy statistic to test multiple hypotheses. Each observed value is compared against mul-tiple null hypotheses. If the new data instance does not agree with one of the hypotheses itis declared anomalous and a new hypothesis is declared. For rejecting or accepting the hy-pothesis, the relative entropy statistic is computed. Its value is compared with a thresholdthat represents an acceptable level of false negatives of the Chi-square distribution.

3. DATA 22

3 Data

This chapter introduces the data basis for the present work. The sensor data provided byMunichRe are introduced in Chapter 3.1, along with some summary statistics. As thesedata are not labeled, performance evaluation, as is commonly used in supervised learning,is not possible. In order to assess performance of the methods presented in this thesis,data was simulated. These data are presented in Chapter 3.2.

3.1 Sensor Data

The data base for this work was provided by the IoT Department of MunichRe. It com-prises sensor data from machinery, technical equipment and surrounding facilities, such asmachinery rooms. In 2017, several battery-powered sensor devices were installed in variousplaces of 17 small and medium sized enterprises (SMEs), manly from manufacturing in-dustry. Popular locations were electrical cabinets, server rooms or cooling compressors. Asensor device can measure multiple quantities. These quantities, which are called channels,are listed and described in Table 1

Table 1: Channels of the sensor devices. The unit of measurement is depicted in squarebrackets.

Channel Channel Description

Temperature Ambient temperature [◦C]Object temperature Object temperature (sensor that directly points to an object

and measures its temperature) [◦C]Humidity Relative humidity [%]Magnetic field, x/y/z Strength of magnetic field in three dimensions [Gs]Pressure Atmospheric pressure [hPa]Battery voltage Battery voltage of the sensor device [V]

Depending on the application, i.e. whether the sensor is attached to a compressor, orplaced within an electrical cabinet or is aimed to measure the room’s temperature, certainchannels record data, while others are switched off, because they are not relevant to theapplication.

3.1 Sensor Data 23

Overall, there were 567 sensor devices installed, each of them having a unique identifier(uid).Recording of the data started in February, 2018 for some data sets, while for others thestart was a little later.The sensor device took measurements on a very high frequency, however a signal to theplatform was sent on a less frequent basis to safe battery of the sensor device. But if changesin measurements between the regular transmission times were above or below a certainthreshold, then the affected measurement was sent to the platform immediately. The rulesfor measuring and sending were defined and implemented for each channel individually,but are not available for this work.The result of the data generating process is that data are represented as time series.Formally, a time series is defined as a series of data points indexed with time stamps:

X = 〈v1 = (x1, t1), v2 = (x2, t2), . . . , vn = (xn, tn)〉 , (11)

where vi = (xi, ti) refers to a data point xi at time stamp ti.In particular, the sensor data is characterized by varying sampling intervals ∆t = ti − ti−1

between successive time points. Time series with varying sampling frequencies, i.e. ∆t 6=const, are called unevenly spaced time series. Since the sensor data are unlabeled, theywere not split into training set Xtrain and test set Xtest for analysis, rather the completetime series was used for training and predictions, hence X = Xtrain = Xtest applies.

3.1.1 Data Access and Preparation

Data was accessed per channel from a platform. For each channel, data were stored underthe uid of the sensor device from which the data originated.Not all sensor devices sent information to the platform and not all sensor devices could beaccessed while dumping the data. Therefore, the total number of data sets per channelvaries. Table 2 shows the resulting total number of data sets for each channel. Datasets that contained either no or just one observations were removed from the data basefor analysis. In addition, some data sets dropped out, due to errors encountered duringanalyses. The number of data sets with one or no observation and the final number of datasets for analysis are shown in column 3 and 4, respectively of Table 2.Channel battery voltage, which indicates the battery status of the sensor device, was notused for any analysis.

3.1 Sensor Data 24

Table 2: Number of data sets per channel.

Channel Noverall Nremoved Nafter filtering

Temperature 544 48 495Object temperature 547 350 193Humidity 547 50 496Pressure 547 51 493Magnetic field, x 529 286 259Magnetic field, y 528 286 259Magnetic field, z 529 286 260Battery voltage 490 0 490

Due to a firmware update on the data platform, which lead to invalid measurements,specific data points needed to be erased. In particular, ambient and object temperaturedata greater than 150 degrees, a humidity greater than 100 percent and a pressure ofgreater than 1500 hPa were considered to be indicative of the firmware update. The af-fected period and all data points 2 hours before and 2 hours after the period were removed.

Since the anomaly detection methods implemented within the course of this thesis shouldnot only be applied to the “raw” data, but also “transformed data”, data was transformedby using first-, second- and third-order differences of the raw data.Differencing of time series is commonly used to make non-stationary time-series stationary.First-order differencing is defined by

∆xt = xt − xt−1 (12)

(Cowpertwait and Metcalfe, 2009). Hence, this is the change between each observation inthe original time series (the raw data). Writing this with the backward shift operator B,results in ∆xt = (1−B)xt. Higher ordering differencing can be expressed with

∆n = (1−B)n (13)

(Cowpertwait and Metcalfe, 2009).

3.1 Sensor Data 25

Table 3: Summary statistics describing the data sets’ length per channel.

Mean Std Min 25% perc. 50% perc. 75% perc. Max

Temperature 3698.79 5135.35 13 1953.75 2902.50 4773.50 83249Object temperature 5339.93 11599.73 3 2132.00 3255.00 5062.00 102964Humidity 3410.21 4493.79 6 1922.00 2800.00 4599.00 83080Pressure 2769.80 3876.51 4 1643.50 2292.00 3913.25 83022Mag x 26007.10 20683.17 7 11596.00 18268.00 42295.00 144597Mag y 25451.36 19642.96 7 11595.50 18268.00 42419.00 144011Mag z 26054.48 20718.48 7 11596.00 18268.00 42419.00 144729Battery 202.04 366.28 2 105.00 164.00 231.00 5720

std: standard deviation, min: minimum, perc: percentile, max: maximum

3.1.2 Descriptive Statistics

The average number of observations in the individual data sets varies with respect to thechannel (see Table 3). While the average number of observations is between 25 451 and26 054 for the sensors capturing the strength of the magnetic field, the average number ofobservations for the other channels ranges between 2770 and 5340. On contrary, the datasets for battery voltage has on average 202 observations. From the second column, it canbe derived that the number of observations per data set varies quite a bit. The maximumnumber of observations observed in the data is 144 729 (for strength of magnetic field inthe z direction).

Table 4 presents summary statistics on the sampling frequencies of all data for each chan-nel. Sampling frequencies vary between 1 and maximum 16 764 342 seconds (= 194 days).Large sampling frequencies originate from data sets, where measurements needed to beremoved within the course of data pre-processing. The median value shows that for mostchannels the regular sampling frequency was one hour (= 3600 seconds) or 300 seconds.

Each individual time series was plotted and summary statistics were computed. All plotsand summary statistics can be found in the electronic appendix. As an example, the resultsof one sensor device, where all channels of the sensor device recorded data, are plotted inFigure 5. The corresponding summary statistics are provided in Table 5. Additionally, thefirst-order differences of the example are shown in Figure 6.

3.1 Sensor Data 26

Table 4: Summary statistics for sampling frequencies (in seconds) per channel.

Mean Std Min % perc. 50% perc. 75% perc. Max

Temperature 3466.44 56520.40 1 1200 3600 3600 16764342Object temperature 2397.84 4493.79 1 118 1620 4599 12397776Humidity 3994.96 54499.42 1 3600 3600 3601 12414106Pressure 4822.94 80349.83 1 3600 3600 3658 15703482Magnetic field x 314.30 11150.63 1 300 300 300 12234182Magnetic field y 317.31 11209.19 1 300 300 300 12234182Magnetic field z 314.15 11141.86 1 300 300 300 12234182Battery voltage 68548.08 193052.22 1 1800 85904 86400 12485576


Table 5: Summary statistics for the example data of sensor device 001BC50C70000CB7presented in Figure 5.

count Mean Std Min 25% perc. 50% perc. 75% perc. Max

Temperature 4860 20.8975 1.6222 17.3000 19.5000 21.1000 22.2000 30.1000Object temperature 5067 21.0020 1.5868 18.3000 19.6000 20.9000 22.4000 28.1000Humidity 5756 42.6117 8.0001 13.5000 40.0000 43.0000 47.0000 63.0000Pressure 4142 974.4305 7.3922 951.0000 971.0000 975.0000 979.0000 979.0000Magnetic field x 42766 0.0007 0.0002 0.0003 0.0005 0.0006 0.0009 0.0056Magnetic field y 42272 0.0007 0.0001 0.0004 0.0007 0.0007 0.0008 0.0001Magnetic field z 42890 0.0011 0.0002 0.0005 0.0009 0.0010 0.0013 0.0021Battery voltage 412 27.9150 14.0606 3.5400 27.7500 35.7000 36.2000 48.0000


3.1 Sensor Data 27

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

28

30

Tem

pera

ture

[°C]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

28

Obje

ct te

mpe

ratu

re [°

C]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

20

30

40

50

60

Hum

idity

[%]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

950

960

970

980

990

Pres

sure

[hPa

]

2018-062018-07

2018-082018-09

2018-10

Time

0.004

0.002

0.000

0.002

0.004

Mag

netic

fiel

d x

[Gs]

2018-062018-07

2018-082018-09

2018-10

Time

0.0004

0.0005

0.0006

0.0007

0.0008

0.0009

0.0010

0.0011

0.0012

Mag

netic

fiel

d y

[Gs]

2018-062018-07

2018-082018-09

2018-10

Time

0.0006

0.0008

0.0010

0.0012

0.0014

0.0016

0.0018

0.0020

Mag

netic

fiel

d z [

Gs]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

10

20

30

40

50

Batte

ry v

olta

ge [V

]

Figure 5: Example time series produced by the channels of sensor device001BC50C70000CB7.

3.1 Sensor Data 28

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

2

0

2

4

6

1. D

iff. T

empe

ratu

re [°

C]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

2

0

2

4

6

1. D

iff. O

bjec

t Tem

p. [°

C]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

12.5

1. D

iff. H

umid

ity [%

]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

4

2

0

2

4

6

1. D

iff. P

ress

ure

[hPa

]

2018-062018-07

2018-082018-09

2018-10

Time

0.004

0.002

0.000

0.002

0.004

1. D

iff. M

ag. f

ield

x [G

s]

2018-062018-07

2018-082018-09

2018-10

Time

0.0004

0.0002

0.0000

0.0002

0.0004

1. D

iff. M

ag. f

ield

y [G

s]

2018-062018-07

2018-082018-09

2018-10

Time

0.0015

0.0010

0.0005

0.0000

0.0005

0.0010

0.0015

1. D

iff. M

ag. f

ield

z [G

s]

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

30

20

10

0

10

1. D

iff. B

atte

ry V

olta

ge [V

]

Figure 6: An example of first-order differences on the time series produced by the channelsof sensor device 001BC50C70000CB7.

3.2 Simulated Data 29

3.2 Simulated Data

Because the sensor data are not labeled, the anomaly detection methods (Chapter 4 to 6)cannot be evaluated with respect to performance metrics like accuracy. Also verificationof the detected anomalies by a domain expert was not available within the project time.To assess the performance of these methods, nevertheless a simulated data base was estab-lished. This data base could not be build upon the sensor data, since there is no certainty,whether the sensor data is normal or contains anomalies. Once there is normal sensordata available, i.e. data that contains no anomalies, this data can be used as a basis for asimulation setting. Anomalies can then be deliberately placed (simulated) into the normaldata. For this work, the normal data will be simulated with the following approach.The basis of the simulated time series are autoregressive processes of order 1, abbreviatedto AR(1). In an autoregressive process, xt depends on its previous values and a stochasticerror term, which is usually called “(white) noise”. Hence, the AR(1) process for someconstant µ is

Xt − µ = φ(Xt−1 − µ) + εt ,∀t (14)

Parameter µ is the mean of the process, thus, Xt−µ is zero for all t, in case of a stationaryprocess. In an AR(1) process, Xt−µ depends on the deviation of Xt−1 from its mean, wheremodel parameter φ determines how strong this dependence is (Ruppert and Matteson,2011).In order to transform this AR(1) process to an unevenly spaced time series, the values of theAR(1) process are matched to time stamps, whose time intervals are irregular. These timeintervals are generated by bootstrapping the sampling intervals. Bootstrapping, which is arandom sampling method with replacement (Efron and Tibshirani, 1993), was consideredto be an appropriate method to estimate the distribution of the sampling intervals in thesensor data. The bootstrap of the sampling intervals is based on all temperature sensordata that was available for this work.21 time series were simulated, using a range of different parameters

φ ∈ ±{0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.99}

and a φ close to zero, i.e. φ = −2.22 · 10−16 for the AR(1) process. For each AR(1) processa new bootstrap sample of time intervals was generated. This leads to 21 distinct time

3.2 Simulated Data 30

2018-042018-05

2018-062018-07

2018-082018-09

7.5

5.0

2.5

0.0

2.5

5.0

7.5

Figure 7: Simulated data: AR(1) process with φ = 0.9 and irregular sampling frequencies.

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

(a) Complete simulated time series with outliers

2018-06-05

2018-06-19

2018-07-03

2018-07-17

2018-07-31

2018-08-14

2018-08-28

2018-09-11

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

(b) Zoom in: Test data with outliers

Figure 8: Simulated data: AR(1) process with φ = 0.9. The manually added outliers arehighlighted by a red star.

series, one of them shown in Figure 7.Subsequently, each simulated time series was divided into a training and a test data set.To the test data, five outliers in terms of the PAV algorithm were injected, i.e. datapoints were added, which lead to a linear pattern with a slope and a length that havenot yet occurred in the complete simulated time series. The result for this approach isdemonstrated in Figure 8. All other simulated data can be found in the appendix.

4. BASELINE METHODS FOR ANOMALY DETECTION 31

4 Baseline Methods for Anomaly Detection

This Chapter provides a conceptual introduction on the methods that were identified tobe suitable to serve as a baseline. These baseline methods originate from traditionalStatistical Process Control (SPC) and have been adapted in a way that they suit the taskof unsupervised anomaly detection on unevenly spaced time series. These methods includethe 3σ-rule (Chapter 4.1), a modification of the 3σ-rule based on a rolling average (Chapter4.2), the exponentially weighted moving average (EWMA) (Chapter 4.3) and an approachbased on percentiles (Chapter 4.4).

4.1 The 3σ-Rule

In many applications, anomaly detection is based on statistical properties, like mean,median, mode, percentiles and standard deviations (Choudhary, 2017). Particularly, thefield of Statistical Process Control (SPC), which is closely related to univariate outlierdetection (Ben-Gal, 2005), makes use of these statistical properties to monitor the qualityof a manufacturing or a business process. SPC was pioneered by Walter A. Shewhart inthe 1920s. He developed the concept of statistical control and the control chart, which isnowadays also called the Shewhart chart. A stable process, or in other words a processin statistical control, displays variation that is natural to the process, i.e. has commonsources of variation. On the other hand, a process, which would be identified as being notin statistical control, displays special sources of variation. This variation either originatesfrom assignable sources or the variation is indicative of a change in the process. A controlchart is a map of quality features or a map of statistics of these quality features computedfrom a sample of measurements. Periodically, a sample is taken (e.g. a sample of products)and quality features, such as the fraction of defective products, are computed. Then, thevalue of the quality feature itself or a statistic thereof (e.g. the mean of the features) aremapped on the y-axis of a two-dimensional coordinate system. The horizontal axis refersto the time point when the sample was taken. Furthermore, a center line and a symmetricupper and lower control limit (see Figure 9) are drawn. In most use cases, the value ofthe center line and the control limits are computed based on the average and the standarddeviation of historic samples, when the process was labeled to be in statistical control. Thecontrol limit is often set to three times the standard deviation below or above the centralline, which refers to Shewhart’s 3-sigma limits. The purpose of the control chart is todifferentiate between common source variation and assignable, special sources of variation.

4.1 The 3σ-Rule 32

0 5 10 15 20 25

5

0

5

10

15

20

25

Original Time Series Center line 2 3

Figure 9: Shewhart control chart with its three key elements: the center line, the lower andupper control limit, which is typically 3 standard deviations above and below the centerline. Sometimes, 2 standard deviations are either used for the control limits or as warningcontrol limits.

A point outside the control limits is said to signal the presence of special source variation.In order to recreate a stable process, the assignable source of variation should be identifiedand removed (Shewhart, 1931; Faes, 2009).At first glance, SPC differs fundamentally from the data situation described in this work.For example, for SPC, samples are taken regularly and their quality is assessed, while inthe present data case, sensors are measuring environmental features such as temperatureand humidity. On a second glance, however, a link between these two data situationscan be established: In the 1920s the quality of the product was the only quantity thatcould be measured to monitor the manufacturing process. Today, sensors attached tothe machine may actually provide similar information. Specifically, one wants to identifyunusual variation in the environmental parameters in order to identify an actual processchange. Hence, a simple threshold model established on normal data could be used toidentify unusual variation for new incoming sensor data points. This threshold modelincludes to compute the mean x and the standard deviation σ of a time series, known tobe in “statistical control”. A new incoming data point xi will be flagged as an anomaly if

4.1 The 3σ-Rule 33

it deviates from the mean more than k times the standard deviation :

|xi − x| > k · σ (15)

Because k is very often set to 3, this threshold model is called the 3σ-rule.The 3σ-rule was implemented in such a way that data, which are known to be normal or in“statistical control” can be used for training, while new data points can be evaluated in theprediction step. But it was also implemented such that it can be used for the unsupervisedcontext, where the same data are used for training and prediction. Then the method willonly flag very strong individual deviations as anomalies. In this sense, the threshold modeldetects “global” outliers. The pseudocode of the 3σ-rule is displayed in 1.

Algorithm 1 3σ-rule1: Input: Time series data split into Xtrain and Xtest or X = Xtrain = Xtest, multiplier k2: Training:3: Compute the mean µ0 and standard deviation s of the training data Xtrain

4: Prediction:5: Compute boundaries:

lower_boundary = µ0 − k · supper_boundary = µ0 + k · s

6: Predict, if observation is an inlier/outlier:7: if Xtest,i < lower_boundary orXtest,i > upper_boundary then8: Inlier = False

9: else10: Inlier = True

11: end if12: Return Boolean array of length ntest indicating if the corresponding observation is an

inlier/outlier

However, it needs to be pointed out that this technique is susceptible to the number ofobservations in the sample. Hodge and Austin (2004) show that “The higher the numberof records the more statistically representative the sample is likely to be”. Hence, in along time series, abrupt changes indicating a special cause of variation, will presumablylie within the control limits and not be identified as an anomaly. Figure 10 demonstratesthat the peaks between April and June, 2018 that might indicate some special source ofvariation are within the control limits, due to the fact that mean and standard deviation

4.2 The 3σ-Rule Based on a Rolling Average 34

2018-032018-04

2018-052018-06

2018-072018-08

Time

0

5

10

15

20

25

30

35Te

mpe

ratu

re [°

C]

Figure 10: Result of the 3σ-rule on temperature sensor data. All points outside the greenshaded area, i.e. points where xi > x+ 3σ or xi < x− 3σ, are flagged outliers. x = 17.66,σ = 5.25, Lower bound: 1.87, Upper bound: 33.45

were computed on the complete sample. Imagine, one would divide the presented sampleinto two parts, one before and one after April, 2018. This would result in two different setsof control limits and would lead to completely different results.

4.2 The 3σ-Rule Based on a Rolling Average

To overcome the limitation of the 3σ-rule of detecting only the largest deviations within testdata, which may probably mask smaller abrupt changes, one could traverse the statisticsover the time-series. This means to compute the average across the data points within arolling window and to use a stationary standard deviation to identify outlying data points(Choudhary, 2017). In particular, the so-called “rolling average” or “rolling mean” iscomputed, which is usually applied on time series to smooth short-term variations and is aform of a low-pass filter. Then the residual, i.e. the difference between the observed valuexi at time point ti and the corresponding moving average is taken. Now, the variationin the distribution of the residuals is calculated, by computing the stationary standard


deviation of the residuals. With the help of these two statistics, data points are flagged asanomalies with the same set of rules that were applied in the 3σ-rule (Choudhary, 2017).The procedure is presented as pseudocode in Algorithm 2.

Algorithm 2 3σ-rule using rolling average and stationary standard deviation1: Input: Time series X, rolling window size τ , multiplier k2: Compute rolling average µrolling of time series X using a rolling window size τ3: Compute residual, i.e. the deviation between observation xt and µt,rolling

rt = xt − µt,rolling4: Compute standard deviation of all residuals sr5: Compute anomaly score

scoret = rt

sr

6: Predict, if observation is an inlier/outlier:7: if scoret > k then8: Inlier = True

9: else10: Inlier = False

11: end if12: Return Boolean array of length n indicating if the corresponding observation is an

inlier/outlier

Rolling operators for evenly spaced time series differ slightly from those applied to unevenlyspaced time series in the way they are implemented. The mathematical framework ispredominantly the same, however, a complete framework for processing and analyzingunevenly spaced time series data, was written by Eckner (2010). In many implementationsof rolling operators, the window width of the operator is specified in terms of the numberof observations (e.g. 6 observations) instead of the temporal duration (e.g. 6 hours). Inevenly spaced time series the two are equivalent. However, in unevenly spaced time series,this does not apply. The implementation for rolling operators must allow for a temporalduration for its rolling time window (Eckner, 2018). Therefore, a brief introduction on thealgorithm for computing an moving average on unevenly spaced time series will be given.Let τ be the width of the rolling time window. In general, the window of a rolling operatorcan be aligned in three different ways. If the window is “right”-aligned, this means that thecurrent point is the rightmost end of the window. This is a rolling time window, which ishalf-open on the right. If the window is “centered”, the current point will be in the center


of the window, while it would be on the leftmost point of the window, when the window is“left”-aligned. Since the outlier detection methods implemented within this thesis shouldalso be suitable for the streaming context, only “right” aligned windows are of interest.This type of alignment keeps track of observations, which are relevant to the calculationof the rolling operator and which occur before time point t.The pseudocode for the rolling average for a “right”-aligned time window is displayed inAlgorithm 3. Let right be the index that corresponds to the right edge of the rollingwindow and left the index that corresponds to the left-most observation inside the rollingwindow. For each observation i = 1, . . . , n a rolling sum of observations will be computedby first expanding the window on the right and by shrinking it on the left, if necessary.Subsequently, the rolling average is computed based on the rolling sum and the number ofobservations within the time window.

Algorithm 3 Rolling average for unevenly spaced time series with “right” aligned rollingtime window1: Input: Time series X = 〈(x1, t1), (x2, t2), . . . (xn, tn)〉 of length n, window width τ2: Initialize: left = 1, rolling_sum = 03: for right=1 to n do4: Expand window on the right end

rolling_sum = rolling_sum+ xright

5: Shrink window on the left end6: while tleft ≤ tright − τ do7: rolling_sum = rolling_sum− xleft8: left = left+ 19: end while

10: Save average values for current time window:

outright = rolling_sum/(right− left+ 1)

11: end for12: Return: out, a vector of length n containing the rolling average of X

For the data used in this work, the rolling average as depicted in Algorithm 3 is used inAlgorithm 2.

4.3 Exponentially Weighted Moving Average 37

4.3 Exponentially Weighted Moving Average

The exponentially weighted moving average (EWMA) is a frequently used alternative tothe Shewhart chart in the area of Statistical Process Control, since it is more sensitive tosmall or gradual shifts in the data. One drawback of the Shewhart chart is that the resultonly depends on the most recent measure. This can be either solved by using a rollingaverage as the center line or by using an exponentially weighted moving average. Thisaverage incorporates the information of all previous data, however, the weight of previousinformation decreases the further the observation is away from the current observation.The EWMA at time point t is defined as follows (NIST/SEMATECH, 2013a):

EWMAt = λxt + (1− λ)EWMAt−1 for t = 1, 2, . . . , n (16)

This is a weighted average of the current observation xt and the EWMA of all previoustime points EWMAt−1. n refers to the length of the time series including EWMA0, whichis called the target value. The target value is either specified by some domain expertwith EWMA0 = µ or set to the mean of the historical data. Using EWMA in anomalydetection, the target value is computed based on the mean of the time series used fortraining. In an unsupervised setting, this means to compute the mean of the whole timeseries. 0 < λ ≤ 1 is a weighting constant that determines the depth of the memory of theEWMA. Specifically, λ specifies how much weight older values of the time series obtain.Values near 1 put almost all weight on the current observation and the method wouldbecome similar to the Shewhart method. Values near 0 give more weight to older datapoints (NIST/SEMATECH, 2013a). Usually, λ is set between 0.2 and 0.3 (Hunter, 1986,quoted from NIST/SEMATECH, 2013a).In Equation 16, EWMAt−1 can be recursively be substituted, such that EWMAt can bewritten as (Borkowksi, 2015):

EWMAt = λt−1∑j=0

(1− λ)jxt + (1− λ)tEWMA0 (17)

In order to establish a statistical control chart or an anomaly detection method it is nec-essary to define the statistical control limits. The basis is the center line, i.e. the target


value. The control limits for time point t are then as follows (NIST/SEMATECH, 2013a):

UCLt = EWMA0 + k · st (18)

LCLt = EWMA0 − k · st (19)

where UCLt refers to the upper control limit and LCLt to the lower control limit. st

is the standard error of the EWMA at time point t and k is the multiple of it. k is ahyperparameter with a default value of 3, which results in the known 3σ limits.If observations xt would be independent with a common variance, the standard error ofEWMAt is (Borkowksi, 2015):

st = σ

√λ

2− λ(1− (1− λ)2t (20)

This results from calculating V ar(λ∑t−1j=0(1− λ)jxt + (1− λ)tEWMA0)−1. Since λ

2−λ(1−(1− λ)2t → λ

2−λ as t→∞, it is also possible to work with approximate control limits:

UCLapprox = EWMA0 + k · σ√

λ

2− λ (21)

LCLapprox = EWMA0 − k · σ√

λ

2− λ (22)

These control limits are the same for all time points t. Usually, variance σ2 is computedfrom historical data (NIST/SEMATECH, 2013a). It should be noted that the assumptionof independent xt is certainly not met in case of time series data. However, it was decidedto use the standard error as presented in Equation 20 and 21.In Algorithm 4 it is demonstrated how the EWMA is used for outlier detection.


Algorithm 4 Exponentially weighted moving average (EWMA)1: Input: Time series split into Xtrain and Xtest or X = Xtrain = Xtest, smoothing

parameter λ, multiplier k2: Training:3: Compute mean EWMA0 and standard deviation st of training data Xtrain

4: Prediction:5: Add EWMA0 to Xtest as first observation6: Compute the exponentially weighted average of Xtest:

EWMAt = λt−1∑j=0

(1− λ)jxt + (1− λ)tEWMA0

7: Compute the standard error of the EWMA:8: if approximate then9: st = σ

√λ

2−λ10: else11: st = σ

√λ

2−λ(1− (1− λ)2t

12: end if13: Compute the the lower and upper control limit:

UCLt = EWMA0 + k · stLCLt = EWMA0 − k · st

14: Predict, if observation is an inlier/outlier:15: if EWMAt < LCLt or EWMAt > UCLt then16: Inlier = False


19: end if20: Return Boolean array of length Xtest indicating if the corresponding observation is an

inlier/outlier

4.4 Percentile Method 40

4.4 Percentile Method

Percentiles, another common statistics, are the basis of the anomaly detection methodpresented in Algorithm 5. The threshold q, which is used to compute the percentile, mustbe determined by the user. Here, it is a number in the (0, 0.5) range. Choosing, forexample, q = 0.05 would flag all observations below the 5th percentile and above the 95thpercentile as anomalies. Side remark: If the data is normally distributed, there is a directlink between the percentile model and the 3σ-rule. In this case, the percentiles refer tothe area below the normal distribution curve. Three standard deviations below the meanis equal to the 0.13th percentile, three standard deviations above the mean is equal to the99.87th percentile (Herrnstein and Murray, 1994).

Algorithm 5 Percentile approach1: Input: Time series X split into training Xtrain or Xtest or X = Xtrain = Xtest, param-

eter q2: Derive, which percentiles will be used for the control limits based on q:

lower_percentile = q · 100upper_percentile = 100− lower_percentile

3: Compute boundaries on Xtrain:

lower_boundary = Percentile(lower_percentile)upper_boundary = Percentile(upper_percentile)

4: Predict, if observation is an inlier/outlier:5: if Xtest,t < lower_boundary orXtest,t > upper_boundary then6: Inlier = False


9: end if10: Return Inlier, a boolean array of length of complete time series indicating if the

corresponding observation is an inlier/outlier

5. PAV: ANOMALY DETECTION FOR UNEVENLY SPACED TIME SERIES 41

5 PAV: Anomaly Detection for Unevenly Spaced TimeSeries

The Pattern Anomaly Value (PAV) algorithm, proposed by Chen and Zhan (2008) aims tofind anomalous patterns, i.e. patterns that are different from the other patterns for a giventime series. It is based on the computation of the Pattern Anomaly Value (PAV) and can beapplied also to unevenly spaced time series, which makes it unique among all unsupervisedanomaly detection methods for time series. The PAV algorithm, which is presented aspseudocode in Algorithm 6, starts with splitting the time series into linear patterns, wherea linear pattern, is defined as the segment between two consecutive sampling points, i.e.Yi = 〈xi, xi+1〉, i = 1, 2, . . . , n− 1. Then the slope and the length for each linear pattern iscomputed. The the slope is obtained by

si = xi+1 − xiti+1 − ti

, i = 1, . . . , n (23)

and the length by

li =√

(xi+1 − xi)2 + (ti+1 − ti)2 , i = 1, . . . , n . (24)

Subsequently, the Pattern Anomaly Value of the patterns is calculated. This is achievedby comparing all patterns in the time series and checking if two patterns are the same withrespect to length and slope. If this is the case, the support count will be increased by 1.The support count of a linear pattern Yi, i = 1, 2, . . . , n − 1 counts how many times thispattern occurs in the time series:

δ(Yi) = |{Yi | Yi = 〈xi, xi+1〉 ∧ xi ∈ X ∧ xi+1 ∈ X}| (25)

In case the time series is very volatile and irregular, it is likely that only a few linearpatterns are the same with respect to slope and length. Therefore, Chen and Zhan (2008)suggest reducing the precision of the slope, by retaining d decimal places of the slope’svalue, where d is suggested to be set to 1 or 2. However, they only apply this to the slope.For unevenly spaced time series reducing the precision of the length seems reasonable,which is why the implementation used in this thesis reduces precision of both the lengthand the slope.After computation of the support count for each linear pattern, the support count will be


mapped to the [0, 1] interval by computing

δi = δi −minj(δj)maxj(δj)−minj(δj)

, i = 1, . . . , n . (26)

Based on the normalized support count, we obtain the Pattern Anomaly Value (PAV) scoreor AV score by computing

PAVi = 1− δi , i = 1, . . . , n . (27)

Since the support count for rare linear patterns will be low, the anomaly value will beclose to 1 for these patterns, indicating a higher anomaly degree of Yi. Hence, the higherthe anomaly value the higher the likelihood for an anomalous pattern. In conclusion, theanomaly value captures the anomaly degree of a linear pattern. However, a pattern is onlydetermined as anomalous, if the anomaly value exceeds a user-defined threshold minvalue.Often setting this threshold would require some kind of domain knowledge. Alternatively,one can sort the patterns by decreasing order of the AV score and declare the top k patternsof this ordering as outliers. Nevertheless, also this approach depends on the choice of k.


Algorithm 6 Pattern Anomaly Value (PAV) algorithm1: Input: Time series X, precision parameter d, threshold minvalue or k, the number of

the top k anomalous patterns2: Initialize support count: δ = (δ1, · · · , δn−1) = (0, · · · , 0)3: Compute the support count δi of each linear pattern:

• Compute the slope si of pattern Yi, i = 1, . . . , n and retain d digits after thedecimal point• Compute the length li of pattern Yi, i = 1, . . . , n and retain d digits after the

decimal point• Compare pattern Yi with pattern Yj, i = 1, 2, . . . , n− 1; j = i, i+ 1, . . . , n− 1

and increase support count, if slope and length of pattern Yi and Yj are equal:

δi = δi + 14: Standardize support count δ = (δ1, . . . , δn−1):

δi = δi−minj δj

minj δj−maxj δj, i = 1, · · · , n− 1

5: Compute pattern anomaly value: PAVi = 1− δi6: Return Anomalous patterns with PAVi ≥ minvalue or sort PAVi, i = 1, . . . , n− 1 in

ascending order to derive the top k anomalous patterns.

6. ANOMALY DETECTION WITH MODIFICATIONS OF PAV 44

6 Anomaly Detection with Modifications of PAV

In the original PAV algorithm, the slope and the length for each linear pattern are computedto determine, whether a pattern is similar to other patterns. The drawback of this approachis that every new and unseen pattern will be flagged an anomaly, if it is not identical withone of the patterns from the training set. In this case the new, unseen pattern will have aPAV score of 1. This is also the case, when the slope and the length of the new pattern isclose but not identical to slopes and lengths of patterns in the training data. This behavioris not desirable, because relatively similar patterns should not be flagged as anomalies. Thegoal is to flag only those patterns as anomalous that are really different from the rest ofthe patterns.Chen and Zhan (2008) try to solve this by keeping d digits after the decimal point ofthe slope, which is – as described above – probably not a sufficient approach in unevenlyspaced time series. Therefore, precision parameter d was applied to both, the slope and thelength. Chen and Zhan (2008) claim that d can in most cases be set to 1 or 2. However,the degree to which precision needs to be reduced depends on the data. Additionally, theproblem that similar but not identical patterns may be flagged as outliers, might prevaildespite introduction of precision parameter d. Another disadvantage of Chen and Zhan(2008)s’ approach is, that all linear patterns with a pattern anomaly value of 1 cannot bedifferentiated from each other in terms of their outlierness. All patterns with PAV = 1are “equally anomalous”.An approach to overcome these limitations is to estimate the joint density f of the slopesand the lengths of the linear patterns. This can be either done nonparametrically with adensity estimator or by assuming that f belongs to a parametric family of distributionsand then by estimating the unknown parameters. For nonparametric density estimation,the kernel density estimator can be used. For a parametric model, the concept of copulasprovides a suitable framework. Hence, PAV is modified in two ways: on the one hand,bivariate kernel density estimation is used to estimate the joint density of slopes and lengthsof the linear patterns for the reasons just mentioned. This implementation is called theKernel PAV. On the other hand, the univariate marginal distributions and a copula areused to model the dependence structure between the slopes and the lengths of the linearpatterns.

6.1 Kernel Density Estimation 45

6.1 Kernel Density Estimation

Let Z1,Z2, . . . ,Zn be a sample of d-dimensional random vectors drawn from a distributionwhich is described by density function f . Then the kernel density estimator is defined as:

f(z; H) = 1n

n∑i=1

KH(z− Zi) , (28)

where z = (z1, z2, . . . , zd)T and Zi = (Zi1, Zi2, . . . , Zid)T are d-dimensional vectors. KH

is the d-variate kernel function and H is the bandwidth or smoothing matrix, which is asymmetric and positive definite matrix of size d× d:

KH(z) = |H|−1/2K(H−1/2z) (29)

(Wand and Jones, 1995). There are various kernels that lead to estimators with differentcharacteristics (VanderPlas, 2013). Often it is a probability density function. A commonchoice for the kernel is the standard multivariate kernel

KH(z) = (2π)(−d/2) |H|−1/2 exp(−12zTH−1z) (30)

where H refers to the covariance matrix (Wand and Jones, 1995). A corresponding visu-alization is given in Figure 11.While the choice of the kernel is not crucial for the accuracy of the estimation, findingthe right bandwidth matrix H is essential. Bandwidth matrix H determines the amountof smoothing and the orientation of that smoothing (Wand and Jones, 1995). There is ahierarchical class of three smoothing parametrizations from which one could choose.Let F be the class of symmetric and positive definite d × d matrices. If H ∈ F andthere are no additional restrictions, then there are different amounts of smoothing in alldimensions possible. Using this type of parametrization is computationally costly, whichis why they were used less in the past. But there is a substantial improvement in accuracyby using it. Imposing a restriction on H ∈ D, where D ⊆ F leads to the first subclass ofsmoothing parametrizations. It is the class of diagonal, positive definite d × d matrices,where H = diag(h2

1, . . . , h2d). In this subclass different amounts of smoothing are applied

to the coordinate directions. But smoothing in another direction than the coordinate is no


z1

3 2 1 0 1 2 3z2

32

10

12

3

K(z)

0.020.040.060.080.100.120.14

Figure 11: Bivariate standard normal kernel

longer possible. Applying the restriction to H leads to a simplified version of Equation 28:

f(z;h) = 1n

( d∏l=1

hl

)−1 n∑i=1

K(z1 − Zi1h1

· · · zd − Zidhd

)(31)

The third subclass H ∈ S is the class of positive scalars multiplied with the identity matrix:S = {h2I : h > 0}. Hence, there is the same amount of smoothing applied to all coordinatedirections. The great advantage of this parametrization is, that there is only one smoothingparameter. However, it comes at the cost of the least flexibility in smoothing. This leadsto the single bandwidth estimator

f(z;h) = 1n

(1h

)−d n∑i=1

K(z− Zi

h

)(32)

(Wand and Jones, 1995). The three parametrization classes are visualized with the help ofcontour plots of the Gaussian kernel, as shown in Figure 12.

There are various methods for selecting bandwidth matrix H, of which the ones relevant


Figure 12: Three bandwidth parametrization classes. Left: H ∈ S, positive scalars multi-plied by identity matrix. Middle: H ∈ D, diagonal and positive definite matrices. Right:H ∈ F , symmetric and positive definite matrix. Reprinted from Wand and Jones (1995).

to this work are now explained. Some of them use an optimality criterion for selecting thebandwidth, namely the MISE (mean integrated square error):

MISE(H) = E[ ∫ (

f(z; H)− f(z))2dz]

(33)

(Wand and Jones, 1995). However, the MISE has no closed-form expression. Thus, anasymptotic approximation of the MISE, the AMISE, is usually used:

AMISE(H) = 1n|H|−1/2R(K) + 1

4m(K)2(vecTH)ψ(vecH) (34)

where

• R(K) =∫K(z)2 dz is usually set to R(K) = (4π)−d/2, if K refers to a Gaussian

kernel.

• m(K)Id =∫

z zTK(z) dz

• ψ =∫

(vecD2f(z))(vecTD2f(z))dz, with vec being the operator which stacks thecolumns of a matrix into a vector, and D2 being the Hessian matrix of second orderpartial derivatives of f .

The rate of convergence of infh>0 MISE{f(·;h)} is of order n−4/(d+4). Thus, for n → ∞,the MISE → 0, which means that the kernel density estimate converges in probability tothe true density f (Wand and Jones, 1995).


A bandwidth selector that minimizes the AMISE

H = argminH∈F

AMISE(H) (35)

would be an ideal approach for bandwidth selection. But, as demonstrated in Equation34 the computation of the AMISE involves the unknown function f . Therefore, variousestimators for the AMISE exist, that are used for bandwidth selection. One of them is theleast squares cross-validation selector (LSCV). It is defined as follows:

HLSCV = argminH∈F

LSCV (H) (36)

where

LSCV (H) =∫

f(z; H)2 dz− 2n

n∑i=1

f−i(Zi,H) (37)

with f−i(·; H) is the kernel estimator which uses the sample without Zi. Hence, f−i(·; H) isthe leave-one-out (LOO) kernel estimator (Wand and Jones, 1995). Although this methodselects H optimally by minimizing the MISE (Li and Racine, 2006), it has some disadvan-tages (Wand and Jones, 1995). Because the LSCV can have more than one minimum, Hcan be very variable. Thus, applying the LSCV in practice is not very popular (Wand andJones, 1995).The second approach for bandwidth selection is the maximum likelihood cross-validation(MLCV) selector. Here, the bandwidth is selected by maximizing the multivariate leave-one-out (LOO) maximum likelihood:

L = lnL =n∑i=1

ln f−i(Zi,H) (38)

The disadvantage of the MLCV is, that for fat-tailed distributions, this selector may leadto inconsistent results (Li and Racine, 2006).The simplest approach to bandwidth selection is the rule-of-thumb:

h = c σ n−1/(4+d) (39)

where σ is the sample standard deviation and d refers to the number of dimensions. c is aconstant and usually set to 1.06 for the Gaussian kernel. This approach is computationally

6.2 Copulas 49

attractive. However, it treats the Zid’s symmetrically, and thus, cannot capture whendimensions behave differently (Li and Racine, 2006). Additionally, this variant is onlyAMISE-optimal in case the true density f follows a multivariate normal distribution (Wandand Jones, 1995).The bandwidth selection options that were implemented in the Kernel PAV are based onthe options for kernel density estimation provided in Python libraries. Thus, the KernelPAV provides two options to estimate kernel density: statsmodels’ KDEMultivariate andScikit-learn’s KernelDensity. Additionally, one can choose between three options to esti-mate the bandwidth if kernel density is estimated by KDEmultivariate: the rule-of-thumb,maximum likelihood cross-validation and least-squares cross-validation. Bandwidth esti-mation with the Scikit-learn estimator provides the option to use cross-validation, but theresult is a scalar, i.e. the same amount of smoothing is applied to all coordinate directions.The implementation of Kernel PAV is depicted in Algorithm 7 in the form of a pseudocode.As in the original PAV slope and length of the time series are computed. After a min-max-normalization of slope and length to the [0, 1] range, the bivariate kernel density isestimated based on these two features. The result is a probability density function (pdf),which is used in the prediction step to evaluate the slopes and lengths of new incomingdata points. Finally, the pdf evaluated at the slope and length of the test data is checkedagainst the threshold minvalue. If the pdf is smaller than the threshold, the data point isflagged as an outlier.

6.2 Copulas

The bivariate kernel density estimation is a reliable, non-parametric approach to estimatethe probability density function of the slope and the length. However, a small benchmarkexperiment, which is provided in the Appendix A.1 demonstrates that computation timestrongly depends on the method, which is chosen to select the bandwidth matrix H. Ad-ditionally, the quality of the kernel density estimate depends on this bandwidth matrix.Hence, choosing the rule-of-thumb might decrease computation time but comes at costsof decreasing accuracy. This is where a parametric approach might bridge the gap. Here,a family of distributions is chosen and its parameters are estimated based on the givendata. Once, the bivariate distribution of the slope and the length with its correspondingparameters is known, it is possible to evaluate the resulting probability density functionfor new incoming data points in a fast manner. However, a simple model of the multi-variate distribution can only be fitted, if the random variables are independent. Since the

6.2 Copulas 50

Algorithm 7 Kernel PAV1: Input: Time series X split into Xtrain and Xtest or X = Xtrain = Xtest, thresholdminvalue

2: Training:3: Compute slope si,train and length li,train for each linear pattern Yi,train of the training

data4: Scale slope si,train and length li,train to a feature range between 0 and 15: Bivariate kernel density estimation using slope strain and length ltrain:pdftrain(strain, ltrain)

6: Prediction:7: Compute slope si,test and length li,test for each linear pattern Yi,test of the test data8: Scale slope si,test and length li,test such that it is in the given range of the training data9: Compute score:

Evaluate probability density function of stest and ltest: pdftrain(stest, ltest)10: if pdf(stest, ltest) < minvalue then11: Inlier = False



inlier/outlier

6.2 Copulas 51

assumption of independent data does not hold for slope and length (see Equation 23 and24) an alternative approach to model the dependence structure between slope and lengthis needed. A great tool to flexibly model a multivariate distribution, where the randomvariables are correlated is provided by the theory of copulas. Particularly, copulas are usedto model the dependence structure between random variables. It is a function, which “cou-ples the multivariate distribution function to its marginal distribution function” (Michy,2015).In the following, the basic theory on copulas is presented, followed by a section where theapproach of using a copula for estimating the bivariate distribution of the slope and thelength is demonstrated.A d-dimensional copula C : [0, 1]d :→ [0, 1] is a joint cumulative distribution function(CDF) of a random vector U = (U1, U2, . . . , Ud) with uniformly distributed marginals(Haugh, 2016):

C(u1, . . . , ud) = P(U1 ≤ u1, . . . , Ud ≤ ud) (40)

The fact that any continuous random variable can be transformed to uniform by theprobability integral transform, makes it possible to model dependence structure betweenrandom variables and the marginals separately (Yan, 2007).Let X be a random vector with a continuous CDF FX , then the probability integraltransform proves that the random vector

U = FX(X) (41)

has a uniform distribution. Applying the transform to each component of a d-dimensionalrandom vector X = (X1, . . . , Xd), then the random vector

(U1, U2, . . . , Ud) = (F1(X1), F2(X2), . . . Fd(Xd)) (42)

has uniformly distributed marginals. Sklar’s Theorem makes use of this transformationand allows to decouple the modeling of the marginal distributions from the dependencestructure (Brechmann and Schepsmeier, 2013). Let X be a d-dimensional random vectorwith a joint CDF

F (x1, . . . , xd) = P(X1 ≤ x1, . . . , Xd ≤ xd) (43)

6.2 Copulas 52

and marginals

Fj(x) = P(Xj ≤ xj) , j = 1, . . . , d (44)

then there exists a copula C such that

F (x1, . . . , xd) = C(F1(x1), . . . , Fd(xd)) (45)

for all xj ∈ [−∞,∞] , j = 1, . . . , d. Moreover, if the margins Fj(xj) are continuous, thenC is unique. Otherwise, the copula is unique on RanF1 × RanF2 × · · · × RanFd, whereRanFj = Fj([−∞,∞]) is the range of Fj (Haugh, 2016).Conversely, if C is a copula and Fj(Xj) , j = 1, . . . , d are univariate distributions, thenF (x1, . . . , xd) in Equation 45 defines a d-dimensional joint distribution with marginalsFj(xj) , j = 1, . . . , d . This implies that a copula combined with any set of marginal dis-tributions results in a multivariate distribution (Brechmann and Schepsmeier, 2013).

The density, i.e. the probability density function (pdf), of a copula c(u) can be derivedfrom (Haugh, 2016):

c(u) = ∂dC(u1, . . . , ud)∂u1 · · · ∂ud

(46)

When modeling multivariate distributions with the help of copulas, the challenge is toidentify an appropriate copula family. Fortunately, for the bivariate case there is a greatvariety of families. The two major classes of bivariate copula families are elliptical andArchimedean copulas. Elliptical copulas can be obtained by inverting Sklar’s Theorem(Equation 45):

C(u1, u2) = F (F−11 (u1), F−1

2 (u2)) (47)

is a bivariate copula for u1, u2 ∈ [0, 1], where F is a bivariate distribution function andF−1j , j = 1, 2 are the inverted margins. The copula C is called elliptical when F is elliptical.

This is the case for the Gaussian and Student-t distribution.Archimedean copulas are defined as:

C(u1, u2) = ρ[−1](ρ(u1) + ρ(u2)) (48)

where ρ : [0, 1] → [0,∞] is the generator function of the copula and ρ[−1] is the pseudo-

6.2 Copulas 53

inverse. Popular Archimedean copula families are Clayton, Gumbel, Frank and Joe. Inaddition there are copulas that extend the Gumbel copula, such as the Tawn Copula.After establishing a solid theoretical background on copulas, the following section providesthe procedure on how to use the copula theory for modifying the PAV. As in the originalPAV algorithm, the slope and the length of each linear pattern Yi, i = 1, . . . , n are com-puted. Then the joint distribution is estimated by modeling the marginal distributionsand the dependence structure. Once the model has been established, the correspondingpdf can be derived and used to determine if an observation is anomalous by evaluating thepdf for new linear patterns.The complete procedure is displayed as pseudocode in Algorithm 8.

Algorithm 8 Copula PAV1: Input: Time series X split into Xtrain and Xtest or X = Xtrain = Xtest, thresholdminvalue

2: Training:3: Compute slope si,train and length li,train for each linear pattern Yi,train of the training

data4: Scale slope si,train and length li,train to a feature range between 0 and 15: Identify a suitable marginal distribution for slope s and length l and estimate respective

parameters6: Identify a suitable copula family and estimate respective parameters7: Model bivariate distribution based on the marginals and the copula8: Compute the probability density function of the bivariate distribution:pdftrain(strain, ltrain)

9: Prediction:10: Compute slope si,test and length li,test for each linear pattern Yi,test of the test data11: Scale slope si,test and length li,test such that it is in the given range of the training data12: Compute score:

Evaluate probability density function of stest and ltest: pdftest(stest, ltest)13: if pdf(stest, ltest) < minvalue then14: Inlier = False



inlier/outlier

7. EXPERIMENTAL ANALYSIS AND RESULTS 54

7 Experimental Analysis and Results

Performance evaluation of unsupervised anomaly detection methods is not as straight for-ward as it is in the classical supervised classification task. For the latter, a classificationmodel is fitted to some training data, and the model’s prediction is compared to the testdata, which is labeled. Based on this, classification metrics like accuracy, a confusionmatrix and derived measures, can be calculated, and provide an assessment on how wellthe model performs on a specific task. This makes it easy to compare different modelsexecuting the same task.However, in an unsupervised setting the lack of labels makes evaluation challenging. Addi-tionally, the method of evaluating performance depends on the application task. Usually,the unsupervised model is compared against one or more well-known baseline models withthe aid of an application-specific metric.In this work, where the data is unlabeled, a two-tier approach was realized. On the onehand, data was simulated and some outliers manually added, such that evaluation, as in asupervised setting, is possible. Chapter 7.1 describes how the anomaly detection methodswere applied to and evaluated on the simulated data. On the other hand, the anomalydetection methods were applied to the sensor data, which is described in Chapter 7.2.Additionally, an ensemble consisting of all anomaly detection methods was applied to andevaluated on the sensor data and the simulated data.The Copula PAV was not part of the ensemble and not evaluated together with the othermethods. The reason that lead this decision is that Python was used as the programminglanguage in this work. Python does not provide the necessary functionalities in form ofbuilt-in libraries such that implementation of the Copula PAV could be achieved withinthe time scope of this thesis. Particularly, only a couple of copula families are implementedand a functionality to automatically find a suitable copula family is missing. Fortunately,there are R packages meeting those needs. Therefore, Chapter 7.3 illustrates, how to usecopulas to estimate the joint distribution of the linear patterns’ slope and length and howto apply Copula PAV to the sensor data by using the programming language R.

7.1 Application to Simulated Data

The simulated data (see Chapter 3.2 for the data generation process) are based on anAR(1) process with irregular sampling frequencies. After splitting the data into trainingand test data sets, outliers were added to the test data. These are outliers in the sense of

7.1 Application to Simulated Data 55

Table 6: Parameter settings of the anomaly detection methods used for evaluation.

Method Type of parameter Default parameter setting

Baseline methods3σ-rule multiplier k k = 33σ-rule, rolling average multiplier k k = 3

rolling window size window_length = 3 hoursPercentile percentile q = 0.05EWMA multiplier k k = 3

Smoothing parameter α = 0.25

PAV algorithmsOriginal PAV threshold parameter minvalue = 0.9Kernel PAV threshold parameter -

bandwidth selector ”normal reference”

the PAV algorithm: they have a slope and a length that have not been seen in the data sofar. Hence, the PAV algorithm should be able to detect those outliers.For the evaluation of the methods on the simulated data, each anomaly detection algorithmwas applied to each of the 21 simulated data sets, which differ by their parameters andsampling frequencies. Specifically, the algorithms were fitted to the training data andanomalies should be detected in the test data. The anomaly detection methods that wereused, are the baseline methods explained in Chapter 4 and the PAV variants explained inChapter 6 without the Copula PAV.For the Kernel PAV the threshold was chosen in a way that PAV and PAV Kernel outputtedthe same amount of outliers. For the PAV the threshold was set to 1.0, i.e. all observationswith a Pattern Anomaly Value score (PAV score) of 1.0 were flagged as outliers. Otherwise,the algorithms were used with their default settings, which are presented in Table 6. Thesesettings are, except for the information on the multiplier, choices that seemed reasonablewith respect to the sensor data. The choice for the default parameter of the bandwidthselector is based on a comparative analysis, which is provided in the Appendix A.1.The bandwidth selection plays a crucial role in kernel density estimation. In the imple-mentation of the Kernel PAV, there are four options available: rule-of-thumb (“normalreference”), maximum likelihood cross-validation selector (“MLCV”), least squares cross-validation selector (“LSCV”) and sklearn’s cross-validation selection (“sklearn”). The com-


parative analysis aims to compare the time for training and prediction needed by the KernelPAV when using each of these bandwidth selectors. Moreover, results were plotted andtabulated against each other. The results suggest that the rule-of-thumb might be a time-saving alternative to the cross-validation selectors.The default values for the multiplier k are suggestions taken from literature (Shewhart,1931; NIST/SEMATECH, 2013a).The results of each algorithm on each simulated data set are visualized by plotting thetime series and by color highlighting the identified outliers. An example is presented inFigure 13. The presented simulated time series is based on an AR(1) process with φ = 0.9.It is the same time series as presented in Figure 8 in Chapter 3.2. The results for the othersimulated data sets can be found in the electronic appendix. In Figure 13, PAV and KernelPAV identify 112 outliers out of 3704 data points (equals to 3.024% outliers), while the3σ-rule with and without a rolling average as well as the percentile and EWMA approachidentify substantially less outliers (5, 6, 8 and 0 respectively).Whether the identified outliers are those that were intentionally added, can be seen inTable 7. The first column depicts the time stamps of the outliers that were added to thesimulated data. Column 2 to 7 indicate whether the corresponding method was able tofind the outlier (−1) or not (1). The column for PAV will always contain −1, since thenature of the added outliers is to have an unique slope and length. The average acrossthe binary results of the individual methods is demonstrated in the last column and canbe interpreted as the anomaly score of the ensemble consisting of the respective anomalydetection methods. Averaging across the binary labels results in a score ranging from −1to 1, where 1 means that no outlier detection algorithm was able to detect the anomaly,and −1 means that all algorithms identify the particular observation as an outlier3.A graphical representation of Table 7 is shown in Figure 14a. There, the outliers arehighlighted by color and marker size, depending on how many algorithms of the ensembleclassified the observation as an outlier. In case the ensemble agrees little on whetherthe data point is an outlier, the marker size is smaller and the color turns bluish. In thepresented example, the outlier that obviously deviates from the rest of the data is identifiedby the majority of the methods. The other anomalies are mostly just identified by the PAValgorithms.In addition, Figure 14b shows the result of the ensemble on each data point of the completetest data. Again, the color and the size of the marker indicate how much the algorithms

3Technically, the score will never be 1 in the reported case, since at least the PAV algorithm will alwaysfind the manually added outliers.


2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

PAV (n_outliers = 112)

(a) PAV (noutliers = 112)

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

PavKernel (n_outliers = 112)

(b) Kernel PAV (noutliers = 112)

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

3Sigma (n_outliers = 5)

(c) 3σ-rule (noutliers = 5)

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

3Sigma-RollingAverage (n_outliers = 6)

(d) 3σ-rule, rolling average (noutliers = 6)

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

EWMA (n_outliers = 0)

(e) EWMA (noutliers = 0)

2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

PercentileModel (n_outliers = 8)

(f) Percentiles (noutliers = 8)

Figure 13: Results of each anomaly detection method on simulated data (φ = 0.9). Thesub-captions contain the name of the anomaly detection method and the number of anoma-lies that were identified by the respective method.


Table 7: Capability of the anomaly detection algorithm (columns 2 − 6) to identify theintentionally added outliers on a simulated database (φ = 0.9). The last column representsthe ensemble score. It is the average across the binary results of all methods.

Time stamp PAV Kernel PAV 3σ-rule 3σ-rule EWMA Percentile MeanRoll. Ave.

2018-07-27 13:15:28 -1 1 1 -1 -1 -1 −0.332018-08-16 10:50:33 -1 -1 1 1 1 1 0.332018-08-24 16:30:47 -1 -1 1 1 1 -1 0.002018-08-31 19:47:06 -1 -1 1 1 1 1 0.332018-09-09 05:30:09 -1 -1 1 1 1 1 0.33

agree on whether the data point is an outlier.A reliable outlier detection algorithm is not only characterized by its capability to identifythe true outliers, but also, for example, by its ability to correctly identify inliers. Theconfusion matrix and statistical measures derived from it, provide more insight on theperformance of the outlier detection algorithm. Based on the ground truth (inlier/outlier)and the predicted label (inlier/outlier) a contingency table or confusion matrix, as in Table8 (Kohavi and Provost, 1998), can be constructed. The confusion matrix helps to describethe performance of the outlier detection methods, given that the truth is known, i.e. thedata is labeled, which is the case for this simulation setting.For each outlier detection method that was applied to a simulated time series, a confusionmatrix was calculated and plotted. An example plot is demonstrated in Figure 15. Absolutefrequencies within the cells are color highlighted, i.e. with darker color schemes indicatingmore observations within one cell. The plot shows that the simulated data are quiteimbalanced with many inliers and just five outliers. One can derive that the error rate,also called misclassification rate,

error rate = (FP + FN)/total (49)

is lower for the 3σ-rule approaches, the EWMA and percentile method (0.007, 0.004 and0.007, respectively) than the error rate of the PAV algorithms (0.087, 0.089). However,accuracy and misclassification rate are less suitable measures in imbalanced data. Addi-tionally, the plot demonstrates that the amount of false positives for PAV and Kernel PAV


2018-042018-05

2018-062018-07

2018-082018-09

Time

7.5

5.0

2.5

0.0

2.5

5.0

7.5

10.0

12.5

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Mea

n sc

ore

(a)

2018-06-04

2018-06-18

2018-07-02

2018-07-16

2018-07-30

2018-08-13

2018-08-27

2018-09-10

Time

10

5

0

5

10

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Mea

n sc

ore

outli

erne

ss

(b)

Figure 14: Simulated time series based on an AR(1) process with φ = 0.9 and manuallyadded outliers. Data points are highlighted by color and size of the marker. The greaterthe size of the marker and the darker the color, the more the ensemble of methods agree,that this observation is an outlier. (a) Result of the ensemble wrt. the manually addedoutliers. (b) Result of the ensemble wrt. all data points in the test set.


inlier outlierPredicted label

inlier

outlier

True

labe

l

1111 107

0 5

0

150

300

450

600

750

900

1050

(a) PAV


inlier

outlier

True

labe

l

1110 108

1 4

150

300

450

600

750

900

1050

(b) Kernel PAV


inlier

outlier

True

labe

l

1214 4

4 1

150

300

450

600

750

900

1050

1200

(c) 3σ-rule


inlier

outlier

True

labe

l

1213 5

4 1

150

300

450

600

750

900

1050

1200

(d) 3σ-rule, rolling average


inlier

outlier

True

labe

l

1218 0

5 0

0

150

300

450

600

750

900

1050

1200

(e) EWMA


inlier

outlier

True

labe

l

1212 6

3 2

150

300

450

600

750

900

1050

1200

(f) Percentile

Figure 15: Confusion matrices of each anomaly detection method (a-f) applied to a simu-lated time series based on an AR(1) process with φ = 0.9. The x-axis presents predictedlabels, while y-axis presents the truth. The numbers within the cells present the absolutefrequency under the specific conditions.


Prediction

Predicted condition negative Predicted condition positiveInlier Outlier

Truth

True condition negative True Negative (TN) False Positive (FP)Inlier Type I error

True condition positive False negative (FN) True positive (TP)Outlier Type II error Power

Table 8: Confusion matrix

is quite high compared to the other methods. The confusion matrices for the 20 othersimulated data sets can be derived from the electronic appendix.Based on the confusion matrix, one can derive further statistical measures like false positiverate, recall, precision and the F1 score.False positive rate (FPR) is defined by

FPR = FP

FP + TN= FP

negatives(50)

and indicates how often the method predicts an outlier, when actually the object is aninlier. Hence, this is the probability of a false alarm. A FPR of 1 would mean thatall inliers were predicted to be outliers. Whereas, a FPR of 0 means that no inlier waspredicted to be an outlier.Recall, also called true positive rate (TPR) or sensitivity, is the proportion of correctlyidentified outliers among all outliers (positives):

TPR = TP

TP + FN= TP

positives(51)

A TPR of 1 means that all outliers were correctly identified as outliers, while a TPR of 0means that no outlier was identified as such.Precision addresses the question, how often the prediction was correct, when the method


predicts “outlier”.

Precision = TP

TP + FP(52)

A precision of 1 means that no actual inliers were predicted to be outliers . A precision of0 means that no outlier was identified as such.Finally, the F1 score is the harmonic mean of precision and recall:

F1 = 2 · (precision · recall)(precision+ recall) (53)

A F1 score of 1 refers to the best value that can be obtained, while a value of 0 is the worstresult.Based on the computation of the mentioned metrics for each anomaly detection algorithmevaluated on the 21 simulated data sets, the distribution of the results gives an indicationon how the algorithms perform with respect to recall (Figure 16a), FPR (Figure 16b),precision (Figure 16c) and F1 score (Figure 16d). Additionally, the four metrics were alsocomputed for the ensemble. This ensemble works based on the majority vote principle.Hence, if the mean score4 is smaller or equal to −0.5, the ensemble will flag the observationas an outlier, otherwise it will be an inlier.Recall is always 1 for the PAV, as the added outliers are shaped such that the PAV mustdetect them. As one would expect, also the Kernel PAV has a high average recall of 0.819,while the ensemble has a comparatively small recall with 0.390. Except for the PAV, thevariability in recall is quite high.The average false positive rates are highest for the PAV algorithms (0.077 and 0.077,respectively), while it is on average 0 for the ensemble.While average precision is low for the PAV algorithms (between 0.053 and 0.043), it is ishigh for the ensemble with an average of 0.952. Furthermore, the variability of precision forthe PAV algorithms is low, whereas it is quite high for the baseline methods. For example,the precision of the EWMA approach ranges from 0 to 1, depending on the data set.For two data sets, the 3σ-rule and the EWMA approach did not detect any of the addedoutliers. This means that precision and, consequently, the F1 score cannot be calculated.These two observations were not considered for plotting precision and recall. For the

4As previously mentioned this mean score is computed from the binary flags of all anomaly detectionmethods. Each method outputs either a 1 for inlier or a −1 for outlier. The average is taken based onthese binary flags.


PAV

Kernel PAV3Sigma

3Sigma

Rolling Average EWMA

Percentiles

Ensemble

0.0

0.2

0.4

0.6

0.8

1.0

Reca

ll

(a) Recall, TPR

PAV

Kernel PAV3Sigma

3Sigma


Percentiles

Ensemble

0.00

0.05

0.10

0.15

0.20

0.25

0.30

FPR

(b) False positive rate

PAV

Kernel PAV3Sigma

3Sigma


Percentiles

Ensemble

0.0

0.2

0.4

0.6

0.8

1.0

Prec

ision

(c) Precision

PAV

Kernel PAV3Sigma

3Sigma


Percentiles

Ensemble

0.0

0.2

0.4

0.6

0.8

1.0F1

(d) F1

Figure 16: Distribution of Recall, FPR, Precision and F1-Score of each anomaly detectionalgorithm and an ensemble thereof across the 21 simulated data sets. A tabulated versionof the results is provided in the Appendix A.2.1 in Table 15.

ensemble, this even applies to 7 data sets.The average F1 score is overall low for the PAV algorithms with almost no variability.When comparing the individual algorithms, average F1 score is highest for the 3σ-rulewith a mean of 0.420, followed by the 3σ-rule based on a rolling average with a mean of0.393, the percentile with a mean of 0.288 and the EWMA approach with a mean of 0.284).For the baseline methods, variability of F1 scores is high. The ensemble has on averagethe best F1 score with 0.686.

7.2 Application to Sensor Data 64

7.2 Application to Sensor Data

This chapter describes the application of the anomaly detection methods to the sensordata, as well as the evaluation of those methods. The anomaly detection methods thatwere used are the baseline methods explained in Chapter 4, including the 3σ-rule, the 3σ-rule based on a rolling average and the EWMA. The percentile approach was not appliedto the data, since the evaluation process (see below) was designed in a way that the 3σ-rulewill cover the same outliers as the percentile method. Moreover, the PAV (see Chapter5) and the modification based on kernel density estimation (see Chapter 6.1) were appliedand analyzed on the sensor data.The PAV was run with a threshold value of minvalue = 1.0. In addition, the time interval∆t was transformed by a min-max-normalization to a feature range of [0, 1]. The aim ofthis transformation was to ensure reasonable results, although precision parameter d isalways set to its default and not with respect to the data.The hyperparameters of all other methods did not need any further specification, becauseevaluation process was established in such a way that the anomaly detection methodscould be compared in a reasonable way. This involved to fix the number of outliers foreach method to the same value. As a benchmark for the number of outliers served thenumber of observations with a PAV score of 1, which is the maximum possible value. Forexample, if the original PAV predicts a PAV score of 1 for 100 observations, this meansthat the PAV flags 100 observations as outliers. Among these 100 observations the PAValgorithm cannot make any further gradation on outlierness. As shown in Table ?? theaverage fraction of observations with a PAV score of 1 is substantial. Here, summary

Table 9: Summary statistics on fraction [%] of observations with PAV score of 1 for alltemperature time series. Statistics are presented for the raw and the differences data.

Mean Std Min 25% Perc. 50% Perc. 75% Perc. Max

Raw 2.785 6.097 0.018 0.018 0.900 2.018 50.000First-order differences 4.413 8.039 0.037 0.037 1.386 4.603 68.000Second-order differences 6.321 9.502 0.055 0.055 1.970 6.907 80.000Third-order differences 9.582 12.101 0.119 0.119 3.378 10.698 96.552



statistics for a PAV score of 1 has been computed based on all temperature data sets(first row). The same statistics are also presented for the first-, second- and third-orderdifferences of the temperature data sets. The summary statistics for the other channels areprovided in the Appendix in Table 20, 23, 26, 29, 32 and 35. The results are similar to thoseof the temperature. However, strength of magnetic field has on average less observationswith a PAV score of 1 compared to the other channels.Because of this “limitation” in the granularity of the score, it appeared natural to use thePAV score of 1 as a benchmark. When the number of outliers were set to a fixed numberbased on the result of the PAV with a threshold of 1.0, for example, 100, then also all othermethods are forced to detect 100 outliers. This is achieved by taking the 100 observationswith the top 100 outlier scores. For the 3σ rule, for example, this includes that all ob-servations are sorted in descending order according to the anomaly score and the top 100observations are flagged as anomalies. For this reason, specifying hyperparameters in thisevaluation setting was redundant.

For the evaluation of the anomaly detection methods on the sensor data, the time serieswere not split into training and test. Rather the whole time series was used in the trainingand in the prediction step.The data evaluation process aims to answer the following questions:

• What are the results of the individual anomaly detection methods? How do theresults look like?

• Do the methods provide similar results, i.e. flag the same observations as outliers,given the number of anomalies is fixed?

• How much do the methods agree with one another? How can the results of theindividual methods be combined? What is the result of such an ensemble approach?

• Is there a difference between the PAV methods and the baseline methods, if the PAVmethods were applied to the raw data, while the baseline models were applied to thefirst-order differences of the data?

• Do the methods vary with respect to training time and prediction time?

A schematic presentation of the evaluation process is shown in Figure 17. More details onthe individual steps and the corresponding results follow in the subsequent subchapters.


Figure 17: Each iteration of the evaluation process started with a single data set, for whichfirst-, second- and third-order differences were computed. The anomaly detection methodswere applied to the data sets and important measures were taken. Finally, the results ofthe individual methods were compared against each and were aggregated in the form of anensemble’s majority vote.

Starting point: data set

The starting point of a complete iteration through the evaluation process is a data set, i.e.a sensor time series. The same procedure is applied to every single time series, for eachtype of channel.

Differencing

For each time series first-, second- and third-order differences were computed. In general,differencing, but preliminary first-order differencing, aims to produce stationary time series.In the context of this work, differencing also pursues another goal: the core of the PAVand its variants is to compute the slopes and lengths of linear patterns Yi, i = 1, . . . , n.This means to compute the differences between consecutive values, thus, this includesdifferencing. Therefore, a comparison of the results of the PAV applied to the raw dataand the results of the baseline models applied to the first-order differences seems to bemore reasonable. Additionally, the results of the PAV applied to the first-order differencesof the data are compared to the results of the baseline models applied to the second-order differences. Moreover, the results of the PAV variants on second-order differences


were compared to the results of the baseline methods on the second-order differences.These types of comparisons are called “cross-comparisons” throughout the remainder ofthis thesis. Subsequently, the anomaly detection methods were run on all data, includingthe raw and the first- to third-order differences of the data.In order to increase readability of the next paragraphs, it will be referred to all “datatypes”, if the goal is to address all data, including the raw data and the first- to third-orderdifferences of the data. For example, if aggregation were made on all temperature data andon all data types, this would mean that aggregation took place on all the raw temperaturedata sets and all the first-, second-, and third-order differences of the temperature datasets.

Apply methods and compute measures

When the methods were applied to the data, several key measures were computed, includingtime for training, time for predicting, the anomaly score for each observation, and thepositions (indices) of the observations that were flagged as outliers. With the latter it ispossible to derive exactly which time stamps are affected.The result was then visualized by plotting the time series and by color-highlighting theanomalous points or linear patterns, as demonstrated exemplarily in Figure 18 for thetemperature time series of sensor device 001BC50C70000AEC. All other result plots canbe derived from the electronic appendix.

Run time analysis of the algorithms was performed for each data set. A summary of thetime for training is shown in Figure 19a. A tabulated version is provided in the AppendixA.3.1, in Table 16. Average training time is lowest for the 3σ-rule based on a rolling average(<0.001 seconds), followed by the 3σ-rule (0.003 seconds), the EWMA (0.003 seconds) andthe Kernel PAV (0.020 seconds). The original PAV needs the most time with an averagetraining time of 0.031 seconds.Average prediction time (see Figure 19b) is shortest for the 3σ-rule (0.005 seconds), fol-lowed by the PAV (0.040 seconds), the 3σ-rule based on a rolling average (0.050 seconds)and the EWMA (0.075 seconds). Time needed for prediction by the Kernel PAV is inrelation to all other methods quite high with an average of 111.682 seconds. Particularly,prediction can take up to 75 minutes, when using the Kernel PAV. However, one shouldkeep in mind that the implementation of the anomaly detection methods did not aim foroptimal performance in terms of computation time.


2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

(a) PAV

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

(b) Kernel PAV

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

(c) 3σ-rule

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

(d) 3σ-rule with rolling average

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

(e) EWMA

Figure 18: Result of each anomaly detection methods on temperature time series of sensordevice 001BC50C70000AEC (raw data). Anomalies detected by the PAV variants arehighlighted by red segments, those detected by the baseline methods are highlighted by ared star.


PAV

Kernel PAV3-Simga

3-Sigma


0.00

0.05

0.10

0.15

0.20

0.25

Trai

ning

tim

e (s

econ

ds)

(a) Time for training

PAV

Kernel PAV3-Simga

3-Sigma


0

1000

2000

3000

4000

Pred

ictio

n tim

e (s

econ

ds)

(b) Time for prediction

Figure 19: Time for training (left) and prediction (right) on all data sets.

Based on the anomaly scores that were recorded for each observation, new data sets werecreated that join the results of all anomaly detection methods.In these new data sets, the columns refer to the anomaly detection methods and the rowsrefer to the observations. For each observation and each anomaly detection method, thescore of the method or the binary outcomes (outlier/inlier) are recorded. Table 10 showsan excerpt of those data sets. The table on the top shows the anomaly scores for eachobservation resulting from applying the anomaly detection algorithms. The magnitude ofthe scores differ between the methods. While for the PAV the score ranges between 0 and1, the score for the other methods may be any positive real number and their magnitudesdepend strongly on the data. The table at the bottom, shows the same observations as inthe upper table, but instead of the scores, the binary labels are displayed. 1 means thatthe observation is flagged as an inlier and −1 means that the observation is flagged asan outlier. The two flags that are highlighted in red are artificially produced. The PAVaims to find anomalous patterns based on the computation of slopes and lengths. Sincethe anomaly score always relates to the second point of the data point tuple that forms alinear pattern, there will be no anomaly score for the first observation (see top). Since ananomalous pattern is shaped by two points, the first point will never be an outlier as ithas no predecessor. Hence, the first observation will be flagged with 1.The complete data sets are included in the electronic appendix.


Table 10: Example excerpt of a data set that contains the anomaly score (top) or thebinary flag (bottom) for each observation produced by the individual methods. For thelatter 1 indicates that the observation is flagged as an inlier, while −1 indicates an outlier.The flags in red are added manually.

Time PAV Kernel PAV 3σ 3σ, rolling average EWMA

16.02.2018 15:20:29 − − 0.345 0 0.08616.02.2018 16:21:27 0.810 2612.120 0.454 0.224 0.09217.02.2018 01:30:09 1 1647.867 0.635 0 0.11417.02.2018 03:32:05 0 3107.969 0.635 0 0.08617.02.2018 04:33:03 0.945 1539.789 0.852 0.448 0.118

... ... ... ... ... ...

Time PAV Kernel PAV 3σ 3σ, rolling average EWMA

16.02.2018 15:20:29 1 1 1 1 116.02.2018 16:21:27 1 1 1 1 117.02.2018 01:30:09 −1 1 1 1 117.02.2018 03:32:05 1 1 1 1 117.02.2018 04:33:03 1 1 1 1 1

... ... ... ... ... ...


Compute derived metrics

Based on the measures that were recorded, while applying the methods to the data, twoadditional metrics were computed. These include the Jaccard index and an overall scorefor each observation.The Jaccard index is a similarity coefficient, usually used within the area of clustering. Itcompares two sets of elements and provides information which elements are shared betweenthe data sets and which are distinct. The formula is as follows:

J(A,B) = |A ∩B||A ∪B|

(54)

where A and B refer to the two sets of elements, that shall be compared. The Jaccardindex takes values between 0 and 1. The higher the Jaccard index, the higher the numberof equal elements, hence, the more similar are the two sets (Deviant, 2018). The Jaccardindex is used to compare the anomaly detection methods regarding their similarity. Inparticular, the positions (indices) of outliers identified by one method are compared tothe positions (indices) identified by another method. For example, the PAV algorithmhas identified 3 outliers, which occur on positions [5, 100, 112]. As described above, theKernel PAV will also identify 3 outliers, which occur on positions [5, 80, 90]. Comparingthese two sets of outlier positions provides information on how similar the PAV and theKernel PAV perform, given they flag the same amount of outliers. The Jaccard index inthis example is 1

5 = 0.2, which is small and indicates a low similarity between the resultof the two anomaly detection methods. The Jaccard indices were computed for each dataset. Having five methods to evaluate, this results in 10 pair-by-pair comparisons, whichare presented in the form of an upper triangular matrix, as shown in Table 11. Here,the average of the Jaccard indices over all temperature data sets, including all types ofdata, was taken. This aggregation aims to provide an accessible overview on how similarthe methods perform. The results for the other channels are presented in the Appendixin Tables 21, 24, 27, 30, 33 and 36. More granular results, i.e. results with respect toindividual data sets, are provided in the electronic appendix.In Table 11, the on average highest Jaccard index was achieved when comparing the resultsof the 3σ-rule with the EWMA approach. While the Jaccard index indicates that thesimilarity between the baseline methods is quite high, the similarity for the PAV with itskernel variant is within a low to medium range. Comparing the results of the PAV variantswith the results of the baseline models provide quite low Jaccard indices. This tendency


Table 11: Average Jaccard indices for temperature data, presented as an upper triangularmatrix.

PAV Kernel PAV 3σ 3σ, rolling average EWMA

PAV − 0.333 0.199 0.173 0.181Kernel PAV − − 0.095 0.092 0.0883σ − − − 0.581 0.7093σ, rolling average − − − − 0.650EWMA − − − − −

in average Jaccard index is also present in the results of the other channels (see Appendix,Tables 21, 24, 27, 30, 33 and 36).It was assumed that the low similarity between the results of the PAV variants with theresults of the baseline methods is due to the fact that the core principle of the PAV is tocompute the slopes and lengths of the linear patterns in the data. Hence, this includes tocompute data differences. Therefore, cross-comparisons5 between the results from the PAVand the results from the baseline models were implemented. This cross-comparisons leadto 6 pair-by-pair comparisons, which can be presented in the shape of a small data frame,as shown in Table 12. The table demonstrates the mean of all pair-wise cross-comparisonscomputed across all temperature data sets. The mean of the Jaccard indices of the cross-comparisons are not substantially higher than when the PAV variants were compared tothe baseline methods on the same type of data. For some channels, similarity when cross-comparing even decreased, as it is the case for object temperature and the strength ofmagnetic field in the x-direction (see Appendix, Table 22 and 31, respectively).

The overall score or ensemble score for each observation was computed by averaging theresults of the anomaly detection methods, which the ensemble consists of. Based on theresults, which are presented in Table 10 on the bottom, row-wise means were computed,such that the ensemble score ranges between −1 and 1. An ensemble score of −1 denotesthat all anomaly detection methods agree that this observation is an outlier, while an

5See Chapter 7.2 for a definition of “cross-comparison”.


Table 12: Mean Jaccard indices for cross-comparisons between the baseline models andthe PAV variants.

PAV Kernel PAV

3σ 0.200 0.1133σ, rolling average 0.196 0.113EWMA 0.235 0.139

ensemble score of 1 denotes that all methods agree that the observation is an inlier. Valuesin between indicate the degree of agreement or disagreement, where a value close to zeroindicates more disagreement than a value close to 1 or −1. Additionally, a categorizedversion of the ensemble score was created consisting of three categories, where −1 refersto “100% agreement outlier”, 1 to “100% agreement inlier” and 0 to “disagreement”.Table 13 shows the distribution of the agreement, where an average of agreement wascomputed across all data sets belonging to one channel and a specific data type. On thetop, there are the results on the raw data, followed by the results on the differenced data.Average agreements, when the ensemble consists either only of the PAV variants or of thebaseline models, are shown in Table 17 and Table 18 in the Appendix A.3.1.Disagreement of the anomaly detection methods is on average highest on time series ofobject temperature, independent of data type. The second and third highest disagreementcan be found on time series of temperature and humidity. The anomaly detection methodstend to disagree little on the time series for strength of magnetic field. There is a tendency,that the anomaly detection methods disagree more the higher the order of differencing gets.

There are several visualizations available for the ensemble score and its categorized ver-sion. Figure 20 displays the temperature time series of sensor device 001BC50C70000AEC.Highlighted by a red star are data points, where the ensemble agrees fully that it is anoutlier. On the left of Figure 20, the ensemble consists of the PAV variants. On the right,the ensemble consists of all methods.Another type of visualization is illustrated in Figure 21 on the left. Here, each data pointis highlighted by color and size corresponding to the ensemble score. For this kind ofvisualization it is advisable to zoom in to be able to view single data points (right ofFigure 21).


Table 13: Distribution of the levels of the agreement variable averaged across all databelonging to a certain channel and a certain data type.

Channel inlier agreement disagreement outlier agreement

Temperature 91.400 8.448 1.216Object temperature 88.550 11.531 1.302Humidity 92.308 7.718 1.082Pressure 94.734 5.244 0.423Magnetic field, x 98.468 1.504 0.138Magnetic field, y 98.295 1.690 0.058Magnetic field, z 98.649 1.343 0.030

(a) Raw data



(b) First-order differences



(c) Second-order differences


2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

Outlier agreement

(a) PAV and Kernel PAV

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

Outlier agreement

(b) All methods

Figure 20: Example temperature time series, where data points are highlighted by a redstar if the methods agree fully that this observation is an outlier. In (a) the ensembleconsists of PAV and PAV Kernel. In (b) the ensemble consists of all methods.

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Mea

n sc

ore

outli

erne

ss

03-31 1603-31 22

04-01 0404-01 10

04-01 1604-01 22

04-02 0404-02 10

04-02 1604-02 22

Time

15

20

25

30

35

40

45

Tem

pera

ture

[°C]

1.00

0.75

0.50

0.25

0.00

0.25

0.50

0.75

1.00

Mea

n sc

ore

outli

erne

ss

Figure 21: Temperature time series for sensor device 001BC50C70000AEC with each datapoint highlighted by color and size corresponding to the ensemble score of the anomalydetection ensemble consisting of all methods. Left: Complete time series, right: Zoom inof the left graph.

7.3 Implementation and Application of the Copula PAV 76

7.3 Implementation and Application of the Copula PAV

This exemplary representation is based on the temperature time series recorded by sensordevice 01BC50C70000B1B, which consists of 31 140 data points. The following resultsare tied to that specific data, which is presented in Figure 22. However, the presentedprocedure and the conclusions that are made can be generalized to other data.The Copula PAV starts, corresponding to the other PAV variants with the computationof the slopes and lengths of all linear patterns Yi, i = 1, . . . , n. Subsequently, slope andlength are transformed in the range [0.001, 0.999]. The reason for this transformation isthat for a large number of univariate distributions (which will be fitted to the data toget the marginals), a positive codomain is needed, and for some - for example, the Betadistribution - a value range from 0 to 1 is required. This range was deliberately chosen, toavoid problems during optimization, which is used when estimating the parameters of themarginals. The 2-dimensional distribution of the normalized slope and length is visualizedwith a scatter plot (Figure 23a).

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

Tem

pera

ture

[°C]

Figure 22: Temperature time series of sensor device 01BC50C70000B1B.

To model the dependence structure, the R package VineCopula (Schepsmeier et al., 2018)provides a function BiCopSelect that selects an appropriate bivariate copula family byusing the Akaike or Bayesian Information Criterion (AIC or BIC, respectively). This func-tion selects bivariate copula families from a set of families, by estimating the parameters ofeach copula family using maximum likelihood estimation (Genest et al., 1995) and select-ing a family based on the smallest AIC or BIC, respectively. If no set of family is passed,


0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

Normalized slope

Nor

mal

ized

leng

th

(a) Normalized slope and length.

0.00

0.25

0.50

0.75

1.00

0.00 0.25 0.50 0.75 1.00

pseudo − obsslope

pse

udo

−ob

s len

gth

(b) Pseudo observations for slope and length.

Figure 23: Scatter plots for temperature time series of sensor device 01BC50C70000B1B.

selection among all available families is performed.Since maximum likelihood estimation of the parameters assumes that the input data fol-lows an uniform distribution (Brechmann and Schepsmeier, 2013), the first step was to es-timate the marginal distributions. This can be done parametrically or non-parametrically.Parametric estimation is known as inference functions for margins (Joe and Xu, 1996,quoted from Hofert et al., 2012), while the non-parametric estimation is called the pseudomaximum-likelihood estimation (Genest et al., 1995, quoted from Hofert et al., 2012). Forthis work, the non-parametric estimation of the marginal distributions was used. Here, themarginal distribution functions are estimated based on their empirical distribution func-tions Fj(x) = 1

n

∑nk=1 1{xkj≤x} , j ∈ {1, . . . , n}. This leads to so-called pseudo-observations:

uij = n

n+ 1Fj(xij) = rijn+ 1 , i ∈ {1, . . . , n}, j ∈ {1, . . . , d} (55)

where rij denotes the rank of all xij among all xkj, k ∈ {1, . . . , n}, j ∈ {1, . . . , d}. Thescaling factor n/(n + 1) is used to avoid evaluation problems at the boundaries of thehypercube [0, 1]d. It forces the observation to be within an open unit hypercube. Thepseudo-observations for the temperature sensor data set are presented in Figure 23b.The copula that results from applying BiCopSelect with AIC as selection criteria, is aTawn copula. The density and contours are displayed in Figure 24.To model the marginals, the R package fitdistrplus provides functions to fit univariatedistributions to the data. This package was used to find appropriate marginals for the slopeand the length. With the help of descriptive statistics, including skewness and kurtosis,as well as graphical tools, it is possible to restrict the choice of possible distributions.


slope

0.0

0.2

0.4

0.60.8

1.0

length

0.0

0.2

0.4

0.60.8

1.0

dCopula

0

1

2

3

4

(a) Perspective plot

slope

leng

th

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

(b) Contour plot

Figure 24: Perspective and contour plot of the estimated Tawn copula.

Graphical tools include a histogram, the cumulative distribution function (CDF) plot aswell as a Cullen-Frey graph (also called Skewness-Kurtosis plot). For the latter the squaredskewness is plotted against the kurtosis. In this plot, possible values for skewness andkurtosis for a couple of distributions are displayed. For some distributions, for examplethe normal distribution, there is only one possible value for skewness and kurtosis (skewness= 0, kurtosis = 3). For other distributions a complete range of values is possible. In orderto reduce uncertainty in the estimated values of skewness and kurtosis from the data, thedata can be bootstrapped. Skewness and kurtosis are estimated on these bootstraps andare highlighted in yellow color. Skewness and kurtosis of the original data is highlightedin blue. The Cullen-Frey graphs for the normalized slope and the normalized length areshown in Figure 25.Figure 25a shows that a lognormal distribution might be suitable for the marginal distri-bution of the normalized slope. For comparison, also a normal distribution was fitted. Thegoodness-of-fit can be checked graphically with a density plot along with a histogram, aCDF plot, a Q-Q plot and a P-P plot. While for the Q-Q plot the empirical quantiles areplotted against the theoretical quantiles, for the P-P plot the empirical distribution evalu-ated at each data point is plotted against the fitted distribution. The result for the slopeis shown in Figure 26a. Both distributions demonstrate poor performance in modeling theslope’s distribution. However, for the sake of simplicity a normal distribution with mean0.469 and a standard deviation of 0.008 for the normalized slope was assumed.For the normalized length, Figure 25b suggests to test three distributions: Beta, Gammaand Weibull. The result of the graphical assessment of the goodness-of-fit is illustratedin Figure 26b. These plots reveal that the suggested distributions are not appropriate.


0 1000 2000 3000 4000

Cullen and Frey graph

square of skewness

kurt

osis

8088

6775

5462

4149

2836

1523

329

Observationbootstrapped values

Theoretical distributionsnormaluniformexponentiallogisticbeta

lognormalgamma

(Weibull is close to gamma and lognormal)

(a) Skewness-Kurtosis plot for normalized slope.

0 5000 10000 15000 20000 25000 30000

Cullen and Frey graph

square of skewness

kurt

osis

3094

325

007

1907

113

135

7655

2632 Observation

bootstrapped valuesTheoretical distributions

normaluniformexponentiallogisticbeta

lognormalgamma

(Weibull is close to gamma and lognormal)

(b) Skewness-Kurtosis plot for normalized length.

Figure 25: Skewness-Kurtosis plot for normalized slope (top) and length (bottom).


Histogram and theoretical densities

data

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

020

40

Lnorm

Norm

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical and theoretical CDFs

data

CD

F Lnorm

Norm

0.40 0.45 0.50 0.55

0.0

0.4

0.8

Q−Q plot

Theoretical quantiles

Em

piric

al q

uant

iles

Lnorm

Norm

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

P−P plot

Theoretical probabilities

Em

piric

al p

roba

bilit

ies

Lnorm

Norm

(a) Normalized slope

Histogram and theoretical densities

data

Den

sity

0.0 0.2 0.4 0.6 0.8 1.0

0e+

006e

+06 Beta

Gamma

Weibull

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

Empirical and theoretical CDFs

data

CD

F

Beta

Gamma

Weibull

0.000 0.005 0.010 0.015 0.020

0.0

0.4

0.8

Q−Q plot

Theoretical quantiles

Em

piric

al q

uant

iles

Beta

Gamma

Weibull

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

P−P plot

Theoretical probabilities

Em

piric

al p

roba

bilit

ies

Beta

Gamma

Weibull

(b) Normalized length

Figure 26: Graphical analysis of the goodness-of-fit for the univariate distribution of nor-malized slope (a) and normalized length (b). From left to right, top to bottom: Densityplot aligned with histogram, CDF plot, Q-Q plot and P-P plot


slope

0.00.2

0.40.6

0.81.0

leng

th

0.0

0.2

0.4

0.60.81.0

density

0.0e+00

5.0e−06

1.0e−05

1.5e−05

2.0e−05

2.5e−05

Figure 27: Perspective plot for the density of the joint bivariate distribution of normalizedslope and length.

However, for the remaining analysis, a Beta distribution with shape parameters 0.307 and361.536 for the normalized length was assumed.Having modeled the dependency structure and the margins, a joint bivariate distributionwith function mvdc of the VineCopula package was constructed. A graphical impressionof this bivariate distribution is given in Figure 27.Looking at Figure 27, the impression might arise, that all linear patterns, that do not havea slope and a length that fall within the area of the steep peak are flagged as outliers. Butsetting the threshold of the Copula PAV to 0, resulted in 16 outliers. These are highlightedin Figure 28 on the top left.On the bottom of Figure 28, the results of the original PAV and the Kernel PAV are shownin order to allow for a visual comparison of the PAV variants. A comparison between theresult by the Copula PAV with a threshold of 0 and the original PAV, reveals that 4 outliersidentified by the Copula PAV are also identified by the original PAV. The Jaccard indexin this case is 0.053. This quite low similarity measure has its cause in the PAV, whichidentifies overall 64 outliers. Comparing the Copula PAV with the Kernel PAV, shows thatthe outliers of the Copula PAV are also all identified by the Kernel PAV. The Jaccardindex of this comparison is 0.25.In a second comparison, the Copula PAV with a different threshold was compared againstthe original PAV and the Kernel PAV. The threshold for the Copula PAV matches thethreshold for the Kernel PAV. And the Kernel PAV threshold was set, such that it detects


2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

Tem

pera

ture

[°C]

(a) Copula PAV with threshold = 0

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

Tem

pera

ture

[°C]

(b) Copula PAV with threshold = 630.89

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

Tem

pera

ture

[°C]

(c) PAV

2018-032018-04

2018-052018-06

2018-072018-08

2018-092018-10

Time

18

20

22

24

26

Tem

pera

ture

[°C]

(d) Kernel PAV

Figure 28: Temperature time series with outliers highlighted in color. Outlying linearpatterns are additionally highlighted with a red star at the start of the linear pattern.


the same amount of outliers as the original PAV. The Copula PAV with a threshold of630.89 flags 41 observations as outliers. If this Copula PAV is compared to the originalPAV, 24 outliers are identified by both methods. The Jaccard index is 0.296. There are 41outliers identified by both the Copula PAV and the Kernel PAV, which results in a Jaccardindex of 0.641. This demonstrates that similarity between Kernel PAV and Copula PAVis quite high.

8. CONCLUSION 84

8 Conclusion

This thesis provided an overview of literature on unsupervised anomaly detection methodsthat suit the application of predictive maintenance and the special data at hand. Thisdata are sensor data, which are unlabeled, and unevenly spaced time series data. The lit-erature review revealed that only one unsupervised anomaly detection method is explicitlydescribed as being suitable for unevenly spaced time series: the PAV algorithm proposedby Chen and Zhan (2008). In order to handle some of the limitations of the PAV algorithm,two new methods were introduced, that modify the PAV algorithm correspondingly. ThePAV and its modification - namely, the Kernel PAV and the Copula PAV - were com-pared against four baseline methods from Statistical Process Control. For this purpose, allmethods were implemented in Python in a way that they can be used within the sklearnecosystem (Pedregosa et al., 2011). A comparative analysis on the sensor data revealedthat the similarity of the PAV algorithms is low to medium, while similarity of the baselinemethods is quite high. However, similarity between the PAV algorithms and the baselinemethods is low, which induced the idea to construct an ensemble. The performance ofthis ensemble and of the individual methods were evaluated in a simulation study, wherelabels were produced manually. Performance metrics like FPR, Precision and F1-Scorewere assessed and the results indicate that the ensemble might be a promising approach.The literature review of this thesis can be considered as an approach of “categorizing”the methods that are suitable to detect anomalies in unevenly spaced time series in anunsupervised setting. Abstract and consistent categorizations within the area of anomalydetection are rare and often only available for a specific field of application or a specific typeof data. In particular, there has been no formal categorization of unsupervised methodsfor (unevenly spaced) time series data.An important result of this literature review was that there is only one algorithm capableof handling unevenly spaced time series data: the PAV algorithm proposed by Chen andZhan (2008). Therefore, this thesis focused on the PAV algorithm and introduced twomodifications improving the PAV by handling its limitations.The central aspect of the PAV algorithms are linear patterns of the time series, which arecharacterized by their slope and length. An anomalous pattern has a slope and lengththat has either not yet occurred or only occurred rarely in the data. By counting howoften a linear pattern occurs in the time series, the PAV computes a score, which servesas a basis to identify anomalies. In volatile time series or time series with very different

8. CONCLUSION 85

sampling frequencies, it is very likely that all linear patterns are unique with respect toslope and length. Thus, all patterns would be flagged outliers. Chen and Zhan (2008) tryto overcome this by introducing the precision parameter d, which reduces the precision ofthe slope by only keeping d decimal places after the decimal point. They claim that itis sufficient to set d to 1 or 2. However, they do not apply this reduction of precision tothe length, which would be necessary, in case the time series has greatly varying samplingfrequencies. Moreover, their simple rule of thumb disregards that magnitude of d mustbe set according to the data, as magnitude of slope and length can vary between differentdata sets. Because many different sensor data sets were compared within the evaluationof the anomaly detection methods, setting precision parameter manually for each data setwould have been infeasible. It was necessary to set d to its default value. In order tostill produce reasonable results, a min-max-normalization was applied to the time interval.This transformation converted the values of the time interval to a range between 1 and60. Some experiments on the impact of this transformation on the magnitude of slopeand length as well as on the results were implemented. However, more experiments areneeded to get a clearer understanding on the effect of transformations. Moreover, researchon alternative approaches to set precision parameter d automatically, would enable theapplication of the PAV on different data without manually adjusting parameters.Another disadvantage of the PAV algorithm is, that it tends to assign the maximum scorequite often, making it impossible to gradate outlierness for these data points.The drawbacks of the PAV can be avoided by using a bivariate kernel density estimationfor the slopes and lengths of the linear patterns. This concept was called Kernel PAVand was developed within the scope of this thesis. However, accuracy of bivariate ker-nel density estimation depends on the estimation of bandwidth matrix H, which can becomputationally expensive. A less expensive alternative to estimate the bivariate distribu-tion of slope and length, is a parametric model, which was realized with the introductionand implementation of Copula PAV. The joint distribution was modeled by estimating themarginal distributions separately from the estimation of the dependence structure. Thegreat advantages of the Kernel PAV and the Copula PAV are, firstly, that the score allowsfor granular assessment of the outlierness of each data point. Secondly, patterns, that fallwithin the highest density, will not be flagged as outliers (unless the threshold is set toan unreasonably high value), even if the pattern’s slope and length has not yet occurredin the training data. Thirdly, the modifications also allow for probability statements. Infuture research, performance of these two new methods should be evaluated on real labeled

8. CONCLUSION 86

data. It would be also interesting to compare these methods against other state-of-the artanomaly detection methods. Furthermore, next steps within this research would includeto also implement Copula PAV in Python. For this purpose, additional analysis to findappropriate marginals and dependence structure would be desirable.The four baseline methods that were compared to the PAV algorithms also have theirlimitations. The disadvantage of the 3σ-rule and the percentile method is that they donot account for the time series structure of the data. On contrary, the 3σ-rule based ona rolling average and the EWMA incorporate previous data points in their final result.In future applications, the percentile method could be modified in the same manner asthe 3σ-rule to account for the time series structure of the data. However, the percentilemethod requires loading and sorting of the entire data set, which is CPU and memoryintensive. Additionally, there is a guarantee for finding anomalies, in case the same data isused for training and prediction. This could be avoided, if training took place on normaltraining data, while prediction would be made on unseen data. Moreover, the upperand lower control limits of the percentile method are constructed symmetrically, which isbased on the assumption that the data is symmetrically distributed. In particular, thepercentile method equals the 3σ-rule, if the data follows a normal distribution. One ofthe main disadvantages of the 3σ-rule approaches is the computation of the mean andthe standard deviation of the data, which are measures that are themselves vulnerable tooutliers. Therefore, it would be interesting for future applications to replace the mean bythe median and the standard deviation by the median absolute deviation (Rousseeuw andCroux, 1993). The problem of the measures’ vulnerability to outliers could also be avoidedby again using normal data for training. The same applies to the EWMA method.The PAV, Kernel PAV and the 4 baseline models were evaluated on the sensor data andsimulated data. Evaluation on the simulated data allowed to assess the performance of theindividual anomaly detection methods, as well as the ensemble. As expected, average recallfor the PAV variants is high, as is the average FPR. Hence, the PAV variants can reliablyfind the manually added outliers, but they also flag a lot of normal patterns as anomalous.This results in a low average F1-score. The baseline methods have a low average FPR.However, the methods also tend to miss the manually added outliers. Moreover, recall andprecision vary greatly depending on the data set, which the baseline methods were appliedto. This variability should be investigated in more detail in future research.Although the simulation study revealed interesting findings, it needs to be debated, ifthe approach that generated the simulated data meets scientific requirements. On the one

8. CONCLUSION 87

hand, there is no knowledge on how the normal data generating process really looks like, asnormal data is lacking. Assuming an AR(1) process for the normal data generating processmay simplify a complex situation. In particular, it assumes that the current observationonly depends on the previous observation. This is a strong assumption, given the sensordata usually has a high sampling frequency rate of 1 second and given the fact that theseare, for example, temperature data. On the other hand, the manually added outliers arechosen such that at least the PAV finds them, which also needs to be discussed. Furthersimulations should ideally be based on normal training data. Furthermore, other types ofoutliers, such as change point outliers and anomalous subsequences could be added to thedata.In addition, the simulation setting could be used to find the best hyperparameters by,for example using ROC AUC or a grid search. However, hyperparameter, optimized onartificial data, may differ substantially from the optimal hyperparameters in a real setting.Application of anomaly detection methods to real sensor data should be reliable, but alsoas fast as possible. Therefore, training and prediction time was assessed when applying theanomaly detection methods to the data. In general, average training time is small for allmethods (<< 1 second). This is also true for prediction time, for the most part. However,the implementations of the algorithms, in particular the one of the Kernel PAV, could beoptimized in terms of performance.Similarity between each pair of methods was assessed via the Jaccard index. This analy-sis revealed that on average similarity of the baseline methods is medium to high, whilesimilarity of the PAV variants is low to medium. The Jaccard index for the pairwisecomparisons of the PAV variants with the baseline methods is very low. It could not beapproved, that this dissimilarity originates from the fact, that the core concept of the PAVvariants is to compute the slope and the length, which includes to compute differences ofthe data. This lead to the assumption that similarity between the PAV variants appliedon the raw data and the baseline methods applied to the first-order differences of the datashould improve. However, the evaluation of the cross-comparisons showed that the Jaccardindex improves either only slightly or it even decreases.Finally, an ensemble approach was tested on the real sensor data. For this purpose, thebinary results of the individual methods were averaged to provide an ensemble score.The result of the ensemble on the sensor data and the simulated data shows that thisapproach might provide promising results. Pasillas-Díaz and Ratté (2016) and Zimek et al.(2014) also propose, that the construction of an ensemble might increase the individual

8. CONCLUSION 88

capacity of single methods. Most methods cover a certain aspect of the data, for example,the PAV algorithm searches for infrequent linear patterns characterized by a rare slopeand length. By combining the results of methods that cover those different aspects and byproducing a consensus or majority vote, the detection rate of outliers might increase, whilethe variance introduced by each anomaly detection method might decrease (Pasillas-Díazand Ratté, 2016).Considering the decisive aspects for the proper construction of an ensemble (according toZimek et al. (2014) and Pasillas-Díaz and Ratté (2016)), it appears that there is room forimprovement and future research in the area of ensemble construction using the methodspresented in this thesis.The construction of an ensemble involves two important considerations: the choice ofalgorithms (model selection) and the combination of methods (combination method).For the choice of algorithms, two principles are paramount: Firstly, the methods shouldbe accurate (at least better than random) and secondly, the methods should be as diverseas possible. Usually, assessment of accuracy requires labeled data or, at least, knowledgeon how many anomalies are in the data. The assessment of diversity is not straightfor-ward either. Zimek et al. (2014) suggest using a weighted similarity measure, such as theweighted Pearson correlation, and propose a greedy heuristic for model selection. The ideais to create an ensemble, where the members make uncorrelated errors, i.e. the score vectorsof the methods are uncorrelated. This requirement is certainly not met by the ensemblebuilt within the scope of this thesis, since there is a strong correlation between the PAVvariants, which are based on the same core principle. There might also be a correlationbetween the 3σ-rule and the percentile approach, since they are the same if data followsa normal distribution. The drawback of the approach proposed by Zimek et al. (2014) isagain that it requires labels. On contrary, Pasillas-Díaz and Ratté (2016) introduce anapproach that manages ensemble construction without labels. Here, more weight is put tothe more suitable algorithms for a specific data representation, and differentiation betweenoutliers and inliers is improved by increasing the relative distances between the scores ofthe outliers and the scores of the inliers.In contrast to the approach used in this thesis, where combination was based on the binaryflags, Zimek et al. (2014) combine methods by aggregating the scores of the methods. Thisrequires that the scores of all methods have the same co-domain to be comparable, forexample within the [0, 1] interval. This normalization of scores is not straight forward anddiscussed in the paper of Zimek et al. (2014). In addition, the type of aggregation should

8. CONCLUSION 89

be deliberately thought about and adapted to the application. Other aggregation methodsbesides the mean can be the minimum or the maximum or measures that include somekind of cost function or weighting.Although the lack of labels makes performance assessment and hyperparameter optimiza-tion in unsupervised anomaly detection difficult, a supervised setting would not be prefer-able to an unsupervised setting within the context of anomaly detection. Supervisedanomaly detection methods may be more accurate on the data they were learned on,but they lack the explorative character of unsupervised methods. Supervised anomaly de-tection is restricted to those anomalies that have been recorded, when training data wascollected and occurrence of anomalies might be rare. On contrary, unsupervised methodscan detect potentially new anomalies. To get the best out of out of both worlds (thesupervised and the unsupervised one), the most complete approach would be to combineunsupervised and supervised methods.An alternative to this, would be to train the unsupervised anomaly detection methodson normal training data. With normal training data, the unsupervised methods could becalibrated, some problems of the baseline methods could be avoided and finally, this wouldopen up for other anomaly detection methods. These methods use the normal trainingdata to create a model of normality and to compare each new incoming data point to thenormal profile. Additionally, a database of anomalous data points can be used to compareeach new incoming data point against the anomalous points and patterns. For example,this database could be established by getting the anomalies detected by the unsupervisedmethods verified or rejected by a domain expert. Ultimately, this would result in a three-fold system: Firstly, there is an unsupervised anomaly detection system comprised of themethods proposed in this work to detect potentially new anomalies (novelty detection).Secondly, new data points are compared against a “normal profile”, i.e. the model that isbased on normal training data. Thirdly, a new incoming data point would be comparedagainst the anomalous database. Each system outputs scores, which could be weightedand aggregated for a final score.In addition to the completely unsupervised learning setting, there was a second majorchallenge: the unevenly spaced time series structure. There is only little research on thistype of data and only one unsupervised anomaly detection method was described as beingsuitable for unevenly spaced time series. Moreover, the way that sampling frequenciesvaried was not the same for the channels belonging to one sensor device. Hence, a multi-variate analysis was not possible. In future applications, avoiding such heterogeneous time

8. CONCLUSION 90

series structures would increase the amount of possibilities to analyze the sensor data andwould enable a multivariate approach that incorporates information from all environmentalsources at the same time. However, there are practical reasons for the irregular structureof the time series: The sensors were battery-driven and the frequency of sending signals tothe platform affected the lifetime of the battery. Hence, the present approach is a trade-offbetween sending signals on a regular, but not too frequent basis and sending additionalsignals, whenever something unusual occurs. In addition, frequent transmissions are costlyand require connectivity.To solve several of the mentioned problems, edge computing could be used. Processingor computing at the edge means that the analytics are taken from the central servers tothe devices - in this case, to the sensors (Gillespie and Gupta, 2017). In this approach,pre-processing of data and anomaly detection are deployed on the edge. This would enableusing all data, applying a multivariate approach and having a real-time stream processingof the data. Schneible and Lu (2017) use autoencoders and deep learning neural networksdeployed to the edges to detect anomalies, where the autoencoders are used to identify newtrends by learning from new observations. The centralized server aggregates the updatedmodels and redistributes all updates back to the edges. Additionally, the autoencodersand all anomalous observations are stored on the server.

As technological advances in hardware and software of IoT devices evolve quickly, analyticaltechnologies, such as machine learning, statistics and deep learning become increasinglyimportant to analyze the IoT-generated data. This thesis demonstrates several possibilitiesof detecting anomalies in sensor data. If these techniques are applied and deployed in asmart way, predictive maintenance might look into a bright future. A future, where wewill be able to harvest the benefits of these solutions, which will be increasingly reliable,safe and with a high quality standard.

LIST OF FIGURES 91

List of Figures

1 Key components associated with anomaly detection . . . . . . . . . . . . . 62 Types of anomalies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 Overview on outlier detection in temporal data . . . . . . . . . . . . . . . 114 Structure of the literature overview provided in this work . . . . . . . . . . 135 Example time series produced by the channels of a sensor device . . . . . . 276 An example of first-order differences on the time series produced by the

channels of a sensor device . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 Example of simulated data . . . . . . . . . . . . . . . . . . . . . . . . . . . 308 Example of simulated data with added outliers . . . . . . . . . . . . . . . . 309 Shewhart control chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3210 Result of the 3σ-rule on temperature sensor data . . . . . . . . . . . . . . 3411 Bivariate standard normal kernel . . . . . . . . . . . . . . . . . . . . . . . 4612 Three bandwidth parametrization classes . . . . . . . . . . . . . . . . . . . 4713 Results of the individual anomaly detection methods on simulated data . . 5714 Simulated time series with anomalies highlighted by color and size . . . . . 5915 Confusion matrices for each anomaly detection algorithms . . . . . . . . . 6016 Distribution of evaluation metrics of the individual anomaly detection algo-

rithms and the ensemble . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6317 Evaluation process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6618 Result of each anomaly detection methods on an exemplary temperature

time series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6819 Time for training and prediction on all temperature data sets . . . . . . . 6920 Example time series with highlighted data points if the ensemble agrees . . 7521 Example time series with each point highlighted corresponding to the en-

semble score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7522 Temperature time series of sensor device 01BC50C70000B1B . . . . . . . . 7623 Scatter plots for temperature time series of sensor device 01BC50C70000B1B 7724 Perspective and contour plot of the estimated Tawn copula . . . . . . . . . 7825 Skewness-Kurtosis plot for normalized slope and length . . . . . . . . . . . 7926 Graphical analysis of the goodness-of-fit for the univariate distribution of

normalized slope and length . . . . . . . . . . . . . . . . . . . . . . . . . . 80

LIST OF FIGURES 92

27 Perspective plot for the density of the joint bivariate distribution of normal-ized slope and length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

28 Temperature time series with the results of the PAV variants highlighted . 8229 Comparison of training and testing time among the various bandwidth se-

lection methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10330 Two-dimensional estimated bandwidth parameters for each bandwidth se-

lection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

LIST OF TABLES 93

List of Tables

1 Description of the channels of the sensor devices . . . . . . . . . . . . . . . 222 Number of data sets per channel . . . . . . . . . . . . . . . . . . . . . . . . 243 Summary statistics describing the data sets’ length per channel . . . . . . 254 Summary statistics for sampling frequencies (in seconds) per channel . . . 265 Summary statistics for time series data of sensor device 001BC50C70000CB7 266 Default parameter settings of the anomaly detection methods . . . . . . . . 557 Capability of anomaly detection algorithms to identify injected outliers . . 588 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 619 Summary statistics on the fraction of observations with PAV score of 1 across

temperature data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6410 Excerpt of data sets containing the anomaly scores or binary flags outputted

by each anomaly detection methods . . . . . . . . . . . . . . . . . . . . . . 7011 Average Jaccard indices for temperature data . . . . . . . . . . . . . . . . 7212 Mean Jaccard indices for cross-comparisons between the baseline models

and the PAV variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7313 Distribution of agreement levels . . . . . . . . . . . . . . . . . . . . . . . . 7414 Summary statistics of the estimated bandwidth parameters for each band-

width selection method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10515 Summary statistics for Recall, FPR, Precision and F1-Score . . . . . . . . 10715 Summary statistics for Recall, FPR, Precision and F1-Score, continued . . 10816 Summary statistics for training and prediction time of individual anomaly

detection methods on sensor data . . . . . . . . . . . . . . . . . . . . . . . 10917 Average agreement of PAV and Kernel PAV across all data sets belonging

to one channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11018 Average agreement of baseline methods across all data sets belonging to one

channel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11119 Average Jaccard index for the cross-comparison between the baseline meth-

ods and the PAV variants on temperature time series . . . . . . . . . . . . 11220 Summary statistics on the fraction of observations with PAV score of 1 across

object temperature data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 11321 Average Jaccard indices for object temperature data . . . . . . . . . . . . . 11322 Average Jaccard index for object temperature data, cross comparison . . . 114

LIST OF TABLES 94

23 Summary statistics on the fraction of observations with PAV score of 1 acrossobject humidity data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

24 Average Jaccard indices for humidity data . . . . . . . . . . . . . . . . . . 11525 Average Jaccard index for humidity data, cross comparison . . . . . . . . . 11526 Summary statistics on the fraction of observations with PAV score of 1 across

pressure data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11627 Average Jaccard indices for pressure data . . . . . . . . . . . . . . . . . . . 11628 Average Jaccard index on pressure data, cross comparison . . . . . . . . . 11729 Summary statistics on the fraction of observations with PAV score of 1 across

magnetic field x-direction data sets . . . . . . . . . . . . . . . . . . . . . . 11730 Average Jaccard indices for magnetic field x-direction data . . . . . . . . . 11831 Average Jaccard index of data on magnetic strength, x-direction, cross com-

parison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11832 Summary statistics on the fraction of observations with PAV score of 1 across

magnetic field y-direction data sets . . . . . . . . . . . . . . . . . . . . . . 11933 Average Jaccard indices for magnetic field y-direction data . . . . . . . . . 11934 Average Jaccard index of data on magnetic strength, y-direction, cross com-

parison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12035 Summary statistics on the fraction of observations with PAV score of 1 across

magnetic field z-direction data sets . . . . . . . . . . . . . . . . . . . . . . 12036 Average Jaccard indices for magnetic field z-direction data . . . . . . . . . 12137 Average Jaccard index of data on magnetic strength, z-direction, cross com-

parison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

REFERENCES 95

References

Aggarwal, C. C. (2017). Outlier Analysis. Springer.

Ahmad, S., Lavin, A., Purdy, S., and Agha, Z. (2017). Unsupervised real-time anomalydetection for streaming data. Neurocomputing, 262:134 – 147. Online Real-Time LearningStrategies for Data Streams.

Basu, S. and Meckesheimer, M. (2007). Automatic outlier detection for time series: anapplication to sensor data. Knowledge and Information Systems, 11(2):137–154.

Ben-Gal, I. (2005). Outlier detection. In Maimon, O. and Rockach, L., editors, Data miningand discovery handbook: a complete guide for practicioners and researchers, pages 1 –16. Kluwer Academic Publishers.

Bianco, A. M., Ben, M. G., Martinez, E. J., and Yohai, V. J. (2001). Outlier detectionin regression models with arima errors using robust estimates. Journal of Forecasting,20:565–579.

Borkowksi, J. (2015). Lecture notes: Statistical quality control. Retrieved from http://www.math.montana.edu/jobo/st528/documents/chap9d.pdf.

Brechmann, E. C. and Schepsmeier, U. (2013). Modeling dependence with C- and D-Vinecopulas: the R package CDVine. Journal of Statistical Software, 52.

Chandola, V., Banerjee, A., and Kuma, V. (2009). Anomaly detection: a survey. ACMComputing Surveys, 41:15:1–15:58.

Chang, I., Tiao, G. C., and Chen, C. (1988). Estimation of time series parameters in thepresence of outliers. Technometrics, 30:193 – 204.

Chen, C. and Liu, L.-M. (1993). Joint estimation of model parameters and outlier effectsin time series. Journal of the American Statistical Association, 88:284–297.

Chen, X.-Y. and Zhan, Y.-Y. (2008). Multi-scale anomaly detection algorithm based oninfrequent pattern of time series. Journal of Computational and Applied Mathematics,214(1):227 – 237.

Choudhary, P. (2017). Introduction to anomaly detection. Retrieved from https://www.datascience.com/blog/python-anomaly-detection.

http://www.math.montana.edu/jobo/st528/documents/chap9d.pdf

http://www.math.montana.edu/jobo/st528/documents/chap9d.pdf

https://www.datascience.com/blog/python-anomaly-detection

https://www.datascience.com/blog/python-anomaly-detection

REFERENCES 96

Cowpertwait, P. S. P. and Metcalfe, A. V. (2009). Introductory time series with R. SpringerScience+Business Media.

Deviant, S. (2018). Statistics how to. Retrieved from https://www.statisticshowto.datasciencecentral.com/jaccard-index/.

Eckner, A. (2010). Algorithms for unevenly spaced time series: moving averages andother rolling operators. Retrieved from http://www.eckner.com/papers/Algorithms%20for%20Unevenly%20Spaced%20Time%20Series.pdf.

Eckner, A. (2018). Unevenly spaced time series in r. Retrieved from https://github.com/andreas50/uts.

Efron, B. and Tibshirani, R. (1993). An introduction to the Bootstrap. Boca Raton, FL:Chapman & Hall/CRC.

Eskin, E., Lee, W., and Stolfo, S. J. (2001). Modeling system calls for intrusion detectionwith dynamic window sizes. In Proceedings DARPA Information Survivability Confer-ence and Exposition II. DISCEX’01, volume 1, pages 165–175.

Faes, G. (2009). Shewhart-Regelkarte. Retrieved from http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Regelkarten/Basis-Statistik-Regel-Shewhart/basis-statistik-regel-shewhart.html.

Feldmann, S., Herweg, O., Rauen, H., and Synek, P.-M. (2017). PredictiveMaintenance. Retrieved from https://www.rolandberger.com/de/Publications/Predictive-Maintenance.html.

Genest, C., Ghoudi, K., and Rivest, L.-P. (1995). A semiparametric estimation procedureof dependence parameters in multivariate families of distributions. Biometrika, 82:543–552.

Gillespie, R. and Gupta, S. (2017). Real-time analytics at the edge : Identifying ab-normal equipment behavior and filtering data near the edge for internet of things ap-plications. In Paper SAS645. Retrieved from https://support.sas.com/resources/papers/proceedings17/SAS0645-2017.pdf.

Goldstein, M. and Uchida, S. (2016). A comparative evaluation of unsupervised anomalydetection algorithms for multivariate data. PLoS ONE, 11.

https://www.statisticshowto.datasciencecentral.com/jaccard-index/

https://www.statisticshowto.datasciencecentral.com/jaccard-index/

http://www.eckner.com/papers/Algorithms%20for%20Unevenly%20Spaced%20Time%20Series.pdf

http://www.eckner.com/papers/Algorithms%20for%20Unevenly%20Spaced%20Time%20Series.pdf

https://github.com/andreas50/uts

https://github.com/andreas50/uts

http://www.faes.de/Basis/Basis-Statistik/Basis-Statistik-Regelkarten/Basis-Statistik-Regel-Shewhart/basis-statistik-regel-shewhart.html



https://www.rolandberger.com/de/Publications/Predictive-Maintenance.html

https://www.rolandberger.com/de/Publications/Predictive-Maintenance.html

https://support.sas.com/resources/papers/proceedings17/SAS0645-2017.pdf

https://support.sas.com/resources/papers/proceedings17/SAS0645-2017.pdf

REFERENCES 97

Golmohammadi, K. and Zaiane, O. R. (2015). Time series contextual anomaly detection fordetecting market manipulation in stock market. In 2015 IEEE International Conferenceon Data Science and Advanced Analytics (DSAA), pages 1–10.

Gupta, M., Gao, J., Aggarwal, C., and Han, J. (2014a). Outlier detection for temporaldata. Morgan & Claypool Publishers.

Gupta, M., Gao, J., Aggarwal, C. C., and Han, J. (2014b). Outlier detection for temporaldata: A survey. IEEE Transactions on Knowledge and Data Engineering, 26:2250–2267.

Guttormsson, S., Marks II, R., El-Sharkawi, M., and Kerszenbaum, I. (1999). Ellipticalnovelty grouping for on-line short-turn detection of excited running rotors. In IEEETransactions on Energy Conversion, volume 14, pages 16–22.

Haugh, M. (2016). Lecture notes for quantitative risk management. Retrieved from http://www.columbia.edu/~mh2078/QRM/Copulas.pdf.

Hawkins, D. (1980). Identification of outliers. Chapman & Hall.

Herrnstein, R. J. and Murray, C. (1994). The Bell Curve. Free Press Paperbacks.

Hill, D. J. and Minsker, B. S. (2010). Anomaly detection in streaming environmentalsensor data: a data-driven modeling approach. Environmental Modelling and Software,25:1014–1022.

Hochenbaum, J., Vallis, O. S., and Kejariwal, A. (2017). Automatic anomaly detection inthe cloud via statistical learning. CoRR, abs/1704.07706.

Hodge, V. J. and Austin, J. (2004). A survey of outlier detection methodologies. ArtificialIntelligence Review, 22:85–126.

Hofert, M., Mächler, M., and McNeil, A. J. (2012). Likelihood inference for archimedeancopulas. Journal of Multivariate Analysis, 110:133–150.

Horn, P. S., Feng, L., Li, Y., and Pesce, A. J. (2001). Effect of outliers and nonhealthyindividuals on reference interval estimation. Clinical Chemistry, 47:2137–2145.

Hunter, S. J. (1986). The exponentially weighted moving average. Journal of QualityTechnology, 18:203–210.

http://www.columbia.edu/~mh2078/QRM/Copulas.pdf

http://www.columbia.edu/~mh2078/QRM/Copulas.pdf

REFERENCES 98

Hyndman, R. J. and Athanasopoulos, G. (2018). Forecasting. https://otexts.org/fpp2/.

Joe, H. and Xu, J. J. (1996). The estimation method of inference functions for marginsfor multivariate models. Technical Report no. 166, Department of Statistics, Universityof British Columbia.

Keogh, E., Lin, J., and Fu, A. (2005). HOT SAX: efficiently finding the most unusual timeseries subsequence. In Fifth IEEE International Conference on Data Mining (ICDM’05).Jan, J. and Wah, B. W. and Vijay, R. and Wu, X. and Rastogi, R.

Kohavi, R. and Provost, F. (1998). Glossary of terms. Machine Learning, 30.

Laurikkala, J., Juhola, M., and Kentala, E. (2000). Informal identification of outliers inmedical data. In Fifth International Workshop on Intelligent Data Analysis in Medicineand Pharmacology, pages 20–24.

Lavin, A. and Ahmad, S. (2015). Evaluating real-time anomaly detection algorithms - thenumenta anomaly benchmark. CoRR, abs/1510.03336.

Lee, J., Kao, H.-A., and Yang, S. (2014). Service innovation and smart analytics forIndustry 4.0. Procedia CIRP, 16:3–8.

Lee, Y., Ni, J., Djurdjanovic, D., Qiu, H., and Liao, H. (2006). Intelligent prognostics toolsand e-maintenance. Computers in Industry, 57:476–489.

Leys, C., Ley, C., Klein, O., Bernard, P., and Licata, L. (2013). Do not use standarddeviation around the mean, use absolute deviation around the median. Journal of Ex-perimental Social Psychology, 49:764–766.

Li, Q. and Racine, J. S. (2006). Nonparametric Econometrics. Princeton University Press.

Lin, J., Keogh, E., Wei, L., and Lonardi, S. (2007). Experiencing SAX: a novel symbolicrepresentation of time series. Data Mining and Knowledge Discovery, 15:107–144.

Michy, A. (2015). Modelling dependence with copulas in R. Retrieved from https://datascienceplus.com/modelling-dependence-with-copulas/.

Mobley, R. (2002). An introduction to Predictive Maintenance. Butterworth-Heinemann.

https://otexts.org/fpp2/

https://otexts.org/fpp2/

https://datascienceplus.com/modelling-dependence-with-copulas/

https://datascienceplus.com/modelling-dependence-with-copulas/

REFERENCES 99

Nagandhi, V., Sreenivasan, L., Giffen, R., Sewak, M., and Rajasekharan, A. (2015). IBMPredictive Maintenance and quality 2.0 Technical Overview. International Business Ma-chines Cooperation 2013.

NIST/SEMATECH (2013a). e-Handbook of statistical methods. ewma control charts. Re-trieved from https://itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm.

NIST/SEMATECH (2013b). e-Handbook of statistical methods. grubbs’ test for out-liers. Retrieved from https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm.

Pasillas-Díaz, J. R. and Ratté, S. (2016). An unsupervised approach for combining scoresof outlier detection techniques, based on similarity measures. Electronic Notes in Theo-retical Computer Science, 329:61–77.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel,M., Prettenhofer, P.and Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau,D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learningin Python. Journal of Machine Learning Research, 12:2825–2830.

Prescott Adams, R. and MacKay, D. J. C. (2007). Bayesian online changepoint detection.ArXiv e-prints.

Rosner, B. (1983). Percentage points for a generalized esd many-outlier procedure. Tech-nometrics, 25(2):165–172.

Rousseeuw, P. J. and Croux, C. (1993). Alternatives to the median absolute deviation.Journal of the American Statistical Association, 88:1273–1283.

Rousseeuw, P. J. and Leroy, A. M. (1987). Robust regression and outlier detection. JohnWiley & Sons, Inc., New York, NY, USA.

Ruppert, D. and Matteson, D. R. (2011). Statistics and data analysis for financial engi-neering with R examples. Springer Science+Business Media, New York.

Scheffer, C. and Girdhar, P. (2004). Practical Machinery Vibration Analysis and PredictiveMaintenance. Elsevier.

Schepsmeier, U., Stoeber, J., Brechmann, E. C., Graeler, B., Nagler, T., and Erhardt, T.(2018). VineCopula: Statistical Inference of Vine Copulas. R package version 2.1.8.

https://itl.nist.gov/div898/handbook/pmc/section3/pmc324.htm

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

https://www.itl.nist.gov/div898/handbook/eda/section3/eda35h1.htm

REFERENCES 100

Schneible, J. and Lu, A. (2017). Anomaly detection on the edge. In MILCOM 2017 - 2017IEEE Military Communications Conference (MILCOM), pages 678–682.

Schneider, M., Ertel, W., and Ramos, F. T. (2016). Expected similarity estimation forlarge-scale batch and streaming anomaly detection. CoRR, abs/1601.06602.

Schreiner, J. and J., M. (2018). IoT-Basics: Wie funktioniert Predic-tive Maintenance? Retrieved from https://www.industry-of-things.de/iot-basics-wie-funktioniert-predictive-maintenance-a-693842/.

Shewhart, W. A. (1931). Economic control of quality of manufactured product. LancasterPress.

Solberg, H. E. and Lahti, A. (2005). Detection of outliers in reference distributions: Per-formance of horn’s algorithm. Clinical chemistry, 51:2326–2332.

Stanway, A. (2013). etsy/skyline [online code repository]. Retrieved from https://github.com/etsy/skyline.

Surace, C. and Worden, K. (1998). Novelty detection method to diagnose damage instructures: an application to an offshore platform. In Proceedings of the InternationalOffshore and Polar Engineering Conference, volume 4, pages 64–70.

Surace, C., Worden, K., and Tomlinson, G. (1997). A novelty detection approach todiagnose damage in a cracked beam. In Proceedings of SPIE, pages 947–953.

Suri, N. N. R., Murty, M. N., and Athithan, G. (2011). Data mining techniques for outlierdetection. In Zhang, Q., Segall, R., and Cao, M., editors, Visual analytics and interactivetechnologies, pages 19 – 34. Information Science Reference, New York.

Suri, N. N. R., Murty, M. N., and Athithan, G. (2012). Data mining techniques for outlierdetection. In Data Mining: Concepts, methodologies, tools and applications - Volume 1,pages 159–178. IGI Global.

Trading Economics (2018). Germany average temperature. Data retrieved from TradingEconomics, https://tradingeconomics.com/germany/temperature.

VanderPlas, J. (2013). Kernel density estimation in python. Retrieved from http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/.

https://www.industry-of-things.de/iot-basics-wie-funktioniert-predictive-maintenance-a-693842/

https://www.industry-of-things.de/iot-basics-wie-funktioniert-predictive-maintenance-a-693842/

https://github.com/etsy/skyline

https://github.com/etsy/skyline

https://tradingeconomics.com/germany/temperature

http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

http://jakevdp.github.io/blog/2013/12/01/kernel-density-estimation/

REFERENCES 101

Wand, M. P. and Jones, M. (1995). Kernel Smoothing. London: Chapman & Hall/CRC.

Wang, C., Viswanathan, K., Choudur, L., Talwar, V., Satterfield, W., and Schwan, K.(2011). Statistical techniques for online anomaly detection in data centers. In 12thIFIP/IEEE International Symposium on Integrated Network Management (IM 2011)and Workshops, pages 385–392.

Wei, L., Kumar, N., Lolla, V., Keogh, E. J., Lonardi, S., and Ratanamahatana, C. (2005).Assumption-free anomaly detection in time series. In Proceedings of the 17th Inter-national Conference on Scientific and Statistical Database Management, SSDBM’2005,pages 237–240, Berkeley, CA, US. Lawrence Berkeley Laboratory.

Yan, J. (2007). Enjoy the joy of copulas: with a package copula. Journal of StatisticalSoftware, 21.

Zimek, A., Campello, R. J., and Sander, J. (2014). Ensembles for unsupervised outlierdetection: Challenges and research questions a position paper. SIGKDD Explor. Newsl.,15(1):11–22.

A. APPENDIX 102

A Appendix

A.1 Benchmark Experiment: Bandwidth Selection

The objective of the following experiment is to compare different approaches for selectingbandwidth matrix H using Kernel PAV. For this purpose 50 temperature time series wererandomly sampled from the 496 available temperature data sets.On average, the temperature time series contain 3492 observations, with a variability of1861 observations. The largest time series contains 8575 data points, the smallest 89data points. To these time series the PAV as suggested by Chen and Zhan (2008) (seeChapter 5), and the Kernel PAV (see Chapter 6.1) were applied. For the latter, thereare four different options to estimate bandwidth matrix H. The bandwidth selectors thatwere compared were those that were available as Python modules for multivariate kerneldensity estimation. These are a simple rule-of-thumb (statsmodels’ normal reference),maximum likelihood cross-validation selector (statsmodels’ MLCV), least squares cross-validation selector (statsmodels’ LSCV) and cross-validation by sklearn, which results ina one dimensional scalar.As Figure 29a shows time for training differs quite a lot between the different bandwidthselection variants. On average training time is shortest for the PAV Kernel with the rule-of-thumb selector (0.0025 seconds), followed by the original PAV with 0.0078 seconds.The sklearn cross-validation variant ranges in third place with an average training timeof 557.1897 seconds. A substantially longer training time is needed by the Kernel PAVwith bandwidth selector based on the leave-on-out approaches LSCV and MLCV. Averagetraining time amounts to 1053.0771 seconds for the MLCV selector and to 4530.7041 sec-onds for the LSCV selector.Time needed for prediction is compared to training time negligibly small (see Figure 29b).Average prediction time ranges between 0.0097 seconds for the original PAV and 2.1832seconds for the LSCV selector. The largest prediction time comes from the Kernel PAVvariant with the LSCV selector with 13.3599 seconds.Figure 30 presents the estimated bandwidth parameters in each dimension for each band-width selection method. It is noticeable that for sklearn (green star) only one bandwidthparameter at (0.1, 0.1) is presented. In the sklearn approach, bandwidth parameters areestimated by using cross-validation and by conducting an exhaustive search over specifiedparameter values. The specified parameter values (the search grid) are a range of values

A.1 Benchmark Experiment: Bandwidth Selection 103

PAV

PAV Kernel

Normal ReferencePAV Kernel

LSCV PAV Kernel

MLCV PAV Kernel

Sklearn

0

5000

10000

15000

20000

25000

Trai

n tim

e (s

econ

ds)

(a) Train time

PAV

PAV Kernel

Normal ReferencePAV Kernel

LSCV PAV Kernel

MLCV PAV Kernel

Sklearn

0

2

4

6

8

10

12

14

Pred

ictio

n tim

e (s

econ

ds)

(b) Test time

Figure 29: Comparison of training and testing time among the various bandwidth selectionmethods.


0.00 0.02 0.04 0.06 0.08 0.10h1

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

h 2

LSCVMLCVsklearnnormal reference

Figure 30: Two-dimensional estimated bandwidth parameters for each bandwidth selectionmethod (highlighted by a different color).

between −1 and 1, which are evenly spaced on a log scale. The result is the following array:[0.1, 0.1274275, 0.16237767, . . . , 7.8475997, 10.0] If one compares the values from this arraywith the bandwidth parameters that are estimated by the other selectors, it gets clearthat the search grid is not a good choice. The estimated bandwidth parameters are alwayssmaller than 0.1, hence, the result produced by cross-validation is always the smallest valueof the search grid, namely 0.1.More details on the distribution of the selected bandwidths are given in Table 14. TheLSCV selector demonstrates some extreme estimations. Additionally, the distribution ofestimations for h1 and h2 in Figure 30 appear conspicuous. It can be suspected, that theLSCV ended up in a local minimum for at least one dimension during optimization. Theseresults and the huge training time confirm the statement made by Wand and Jones (1995),that this bandwidth selection method might not be very attractive in practice.On contrary, the statistics of the rule-of-thumb and the MLCV selector are more moderateand quite similar, which also becomes evident in the proximity of estimations in the scatterplot. Given that the computation time for the rule-of-thumb is much shorter than forMLCV, it might be most convenient to use the kernel density estimation with a rule-of-thumb bandwidth selector in practice. These findings supported the choice to use therule-of-thumb as default for the bandwidth selector. It is a trade-off decision between


Table 14: Summary statistics of the estimated bandwidth parameters for each bandwidthselection method.

Normal reference LSCV MLCVH1 H2 H1 H2 H1 H2

Mean 0.013 0.014 0.003 0.013 0.009 0.011Std 0.012 0.015 0.005 0.018 0.006 0.020Min 0.006 0.002 4.1 · 10−75 8.6 · 10−73 0.004 0.00125% percentile 0.008 0.005 3.5 · 10−68 6.5 · 10−62 0.006 0.00450% percentile 0.010 0.009 4.3 · 10−60 5.4 · 10−03 0.007 0.00575% percentile 0.013 0.019 7.8 · 10−03 1.7 · 10−02 0.008 0.009Max 0.083 0.074 0.015 0.086 0.037 0.137


computational resources and accuracy, where it was assumed that the MLCV selectorwould provide more accurate results. For a proper statement on accuracy, it would benecessary to compare the bandwidth selection methods in a supervised setting, where theamount and position of outliers are known.

A.2 Simulated Data 106

A.2 Simulated Data

A.2.1 Performance Measures


Table 15: Summary statistics for (a) Recall, (b) FPR, (c) Precision and (d) F1-Score.Statistics were computed across the 21 simulated data sets for each anomaly detectionmethod and an ensemble thereof.

PAV Kernel PAV 3σ-rule 3σ-rule, rolling average EWMA Percentile Ensemble

Missing 0 0 0 0 0 0 0Mean 1.000 0.819 0.429 0.457 0.276 0.524 0.390Std 0 0.227 0.365 0.364 0.279 0.355 0.371Min 1.000 0.200 0 0 0 0 025% 1.000 0.800 0.200 0.200 0 0.200 050% 1.000 0.800 0.400 0.400 0.200 0.600 0.40075% 1.000 1.000 0.800 0.800 0.400 0.800 0.600Max 1.000 1.000 1.000 1.000 1.000 1.000 1.000

std: standard deviation, min: minimum, perc: percentile, max: maximum(a) Recall


Missing 0 0 0 0 0 0 0Mean 0.077 0.077 0.003 0.003 0.040 0.012 0Std 0.024 0.024 0.002 0.001 0.076 0.012 0Min 0.059 0.061 0 0.001 0 0 025% 0.064 0.066 0.001 0.002 0.001 0.007 050% 0.071 0.073 0.002 0.003 0.002 0.010 075% 0.081 0.081 0.003 0.004 0.053 0.013 0Max 0.168 0.169 0.006 0.007 0.311 0.060 0.001

std: standard deviation, min: minimum, perc: percentile, max: maximum(b) FPR


Table 15: Summary statistics for (a) Recall,(b) FPR, (c) Precision and (d) F1-Score,continued.


Missing 0 0 2 0 2 0 7Mean 0.053 0.043 0.397 0.353 0.344 0.216 0.952Std 0.010 0.013 0.282 0.268 0.398 0.224 0.138Min 0.024 0.012 0 0 0 0 0.50025% 0.048 0.038 0.225 0.167 0.001 0.071 1.00050% 0.054 0.042 0.333 0.333 0.111 0.188 1.00075% 0.060 0.053 0.563 0.625 0.657 0.250 1.000Max 0.065 0.063 1.000 0.833 1.000 1.000 1.000

std: standard deviation, min: minimum, perc: percentile, max: maximum(c) Precision


Missing 0 0 2 0 2 0 7Mean 0.100 0.081 0.420 0.393 0.284 0.288 0.686Std 0.018 0.025 0.230 0.306 0.309 0.233 0.247Min 0.047 0.023 0 0 0 0 0.28625% 0.0912 0.071 0.211 0.182 0.003 0.105 0.57150% 0.103 0.081 0.364 0.364 0.143 0.286 0.75075% 0.114 0.100 0.692 0.727 0.586 0.400 0.889Max 0.122 0.119 0.889 0.909 0.833 0.889 1.000

std: standard deviation, min: minimum, perc: percentile, max: maximum(d) F1

A.3 Sensor Data 109

A.3 Sensor Data

A.3.1 Additional Figures and Tables for all Channels

Table 16: Summary statistics for training (top) and prediction (bottom) for each anomalydetection algorithm. Aggregation was made over all sensor data sets, including all channels,and first-, second-, third-order differences

Method Mean Std min 25% Perc. 50% Perc 75% Perc. Max

PAV 0.031 0.035 0 0.006 0.016 0.046 0.266Kernel PAV 0.02 0.02 0 0.004 0.016 0.031 0.1563σ-rule 0.003 0.005 0 0 0.001 0.003 0.0193σ-rule, rolling average 0 0 0 0 0 0 0.016EWMA 0.003 0.005 0 0 0.001 0.004 0.019

std: standard deviation, min: minimum, perc: percentile, max: maximum(a) Training time

Method Mean Std min 25% Perc. 50% Perc 75% Perc. Max

PAV 0.04 0.046 0 0.008 0.016 0.056 0.344Kernel PAV 111.682 258.886 0 1.728 15.671 82.206 4513.8693σ-rule 0.005 0.005 0 0 0.003 0.006 0.1043σ-rule, rolling average 0.05 0.094 0 0.012 0.02 0.047 1.34EWMA 0.075 0.132 0 0.016 0.031 0.064 1.595

std: standard deviation, min: minimum, perc: percentile, max: maximum(b) Prediction time

A.3 Sensor Data 110

Table 17: Average agreement of PAV and Kernel PAV across all data sets belonging to onechannel.



(a) Raw data







A.3 Sensor Data 111

Table 18: Average agreement of baseline methods across all data sets belonging to onechannel.



(a) Raw data







A.3 Sensor Data 112

A.3.2 Additional Figures and Tables for Temperature

Table 19: Average Jaccard index for the cross-comparison between the baseline methodsand the PAV variants on temperature time series.

PAV Kernel PAV


(a) Baseline models on raw data vs. PAV variants on first-order differences

PAV Kernel PAV


(b) Baseline models on first order-differences vs. PAV variants on second-order differences

PAV Kernel PAV


(c) Baseline models on second-order differences vs. PAV variants on third-order differences

A.3 Sensor Data 113

A.3.3 Additional Figures and Tables for Object Temperature

Table 20: Summary statistics on fraction [%] of observations with PAV score of 1 for allobject temperature time series. Statistics for the raw and differenced data.




Table 21: Average Jaccard indices for object temperature data.



A.3 Sensor Data 114

Table 22: Average Jaccard index for a cross-comparison between the baseline methods andthe PAV variants on object temperature data.

PAV Kernel PAV


A.3.4 Additional Figures and Tables for Humidity

Table 23: Summary statistics on fraction [%] of observations with PAV score of 1 for allhumidity time series. Statistics for the raw and differenced data.




A.3 Sensor Data 115

Table 24: Average Jaccard indices for humidity data, presented as an upper triangularmatrix.



Table 25: Average Jaccard index between the baseline methods and the PAV variants onhumidity data, accounting for the fact that for the PAV variants differences are alreadycomputed.

PAV Kernel PAV


A.3 Sensor Data 116

A.3.5 Additional Figures and Tables for Pressure

Table 26: Summary statistics on fraction [%] of observations with PAV score of 1 for allpressure time series. Statistics are presented for the raw and the differenced data.




Table 27: Average Jaccard indices for object pressure data, presented as an upper triangularmatrix.



A.3 Sensor Data 117

Table 28: Average Jaccard index between the baseline methods and the PAV variants onpressure data, accounting for the fact that for the PAV variants differences are alreadycomputed.

PAV Kernel PAV


A.3.6 Additional Figures and Tables for Magnetic Field x-Direction

Table 29: Summary statistics on fraction [%] of observations with PAV score of 1 forall magnetic field x-direction time series. Statistics are presented for the raw and thedifferenced data.




A.3 Sensor Data 118

Table 30: Average Jaccard indices for magnetic field x-direction data, presented as anupper triangular matrix.



Table 31: Mean Jaccard index between the baseline methods and the PAV variants ofdata on magnetic strength in x-direction, accounting for the fact that for the PAV variantsdifferences are already computed.

PAV Kernel PAV


A.3 Sensor Data 119

A.3.7 Additional Figures and Tables for Magnetic Field y-Direction

Table 32: Summary statistics on fraction [%] of observations with PAV score of 1 forall magnetic field y-direction time series. Statistics are presented for the raw and thedifferenced data.




Table 33: Average Jaccard indices for magnetic field y-direction data, presented as anupper triangular matrix.



A.3 Sensor Data 120

Table 34: Average Jaccard index between the baseline methods and the PAV variants ofdata on magnetic strength in y-direction, accounting for the fact that for the PAV variantsdifferences are already computed.

PAV Kernel PAV


A.3.8 Additional Figures and Tables for Magnetic Field z-Direction

Table 35: Summary statistics on fraction [%] of observations with PAV score of 1 forall magnetic field z-direction time series. Statistics are presented for the raw and thedifferenced data.




A.3 Sensor Data 121

Table 36: Average Jaccard indices for magnetic field z-direction data, presented as an uppertriangular matrix.



Table 37: Average Jaccard index between the baseline methods and the PAV variants ofdata on magnetic strength in z-direction, accounting for the fact that for the PAV variantsdifferences are already computed.

PAV Kernel PAV


A.4 Digital Appendix 122

A.4 Digital Appendix

The digital appendix of this thesis is stored on three USB Flash devices, which are markedwith the number 1, 2 and 3. Device 1 and 2 contain the scripts for the implemented algo-rithms, the simulation of data, the scripts for evaluation of the algorithms and the resultsproduced while evaluating the data. Results are available for each data set, originatingfrom a channel of a sensor device. All results are stored in a way that their name containsthe unique identifier of the sensor device that produced the data. The original data pro-vided by MunichRe cannot be passed to third parties, but an example data set is providedto test the scripts. The password for these USB Flash devices is forwarded via e-mail tothe supervisor of this thesis. The 3rd USB Flash device contains this Master’s thesis as apdf. This Flash device is not encrypted with a password.

Documents

Unsupervised Anomaly Detection in Sensor Data used for … · 2019-01-31 · Unsupervised Anomaly Detection in Sensor Data used for Predictive Maintenance MASTER THESIS Author: MariaErdmann