A Comparative Study of Databases for Storing Sensor Data

INOM EXAMENSARBETE ELEKTROTEKNIK,AVANCERAD NIVÅ, 30 HP

, STOCKHOLM SVERIGE 2019

A Comparative Study of Databases for Storing Sensor Data

JIMMY FJÄLLID

KTHSKOLAN FÖR ELEKTROTEKNIK OCH DATAVETENSKAP

A Comparative Study of Databasesfor Storing Sensor Data

Jimmy Fjallid

Master of Science Thesis

Communication SystemsSchool of Information and Communication Technology

KTH Royal Institute of Technology

Stockholm, Sweden

29 May 2019

Examiner: Peter SjodinSupervisor: Markus Hidell

© Jimmy Fjallid, 29 May 2019

Abstract

More than 800 Zettabytes of data is predicted to be generated per year bythe Internet of Things by 2021. Storing this data necessitates highly scalabledatabases. Many different data storage solutions exist that specialize in specificuse cases, and designing a system to accept arbitrary sensor data while remainingscalable presents a challenge.

The problem was approached through a comparative study of six commondatabases, inspecting documented features and evaluations, followed by theconstruction of a prototype system. Elasticsearch was found to be the best suiteddata storage system for the specific use case presented in this report, and aflexible prototype system was designed. No single database was determined tobe best suited for sensor data in general, but with more specific requirements andknowledge of future use, a decision could be made.

Keywords: IoT, NoSQL

i

Sammanfattning

Over 800 Zettabytes av data ar forutspatt att genereras av Sakernas Internet vidar 2021. Lagring av denna data gor det nodvandigt med synnerligen skalbaradatabaser. Det finns manga olika datalagringslosningar som specialiserar sig paspecifika anvandningsomraden, och att designa ett system som ska kunna ta emotgodtycklig sensordata och samtidigt vara skalbar ar en utmaning.

Problemet angreps genom en jamforande studie av sex populara databaser somjamfordes utifran dokumenterad funktionalitet och fristaende utvarderingar. Dettafoljdes av utvecklingen ut av ett prototypsystem. Elasticsearch bedomdes vara bastlampad for det specifika anvandningsomrade som presenteras i denna rapport, ochett flexibelt prototypsystem utvecklades. Inte en enda databas bedomdes vara bastlampad for att hantera sensordata i allmanhet, men med mer specifika krav ochvetskap om framtida anvandning kan en databas valjas ut.

Nyckelord: IoT, NoSQL

iii

Contents

1 Introduction 11.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Problem Context . . . . . . . . . . . . . . . . . . . . . . . . . . 21.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.6 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.7 Deliverables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.8 Research Methodology . . . . . . . . . . . . . . . . . . . . . . . 51.9 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.10 Structure of This Thesis . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 72.1 Data Storage Models . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 ACID . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.1.2 BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Database Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.1 Relational Databases . . . . . . . . . . . . . . . . . . . . 92.2.2 NoSQL Databases . . . . . . . . . . . . . . . . . . . . . 92.2.3 Time Series Databases . . . . . . . . . . . . . . . . . . . 10

2.3 B+-tree vs LSM . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.4 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.5 Database Replication Architectures . . . . . . . . . . . . . . . . . 122.6 Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.6.1 Master-Master vs Master-Slave . . . . . . . . . . . . . . 132.6.2 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.7 Query Interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.7.1 Information Query Interfaces . . . . . . . . . . . . . . . . 15

2.7.1.1 SOAP . . . . . . . . . . . . . . . . . . . . . . 152.7.1.2 REST . . . . . . . . . . . . . . . . . . . . . . 152.7.1.3 GraphQL . . . . . . . . . . . . . . . . . . . . . 16

v

vi CONTENTS

2.7.2 Database Query Interfaces . . . . . . . . . . . . . . . . . 162.7.3 System Integration . . . . . . . . . . . . . . . . . . . . . 16

2.8 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Comparison 193.1 Comparison Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Scalability and Backups . . . . . . . . . . . . . . . . . . . . . . . 213.3 Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.4 Support for New Data Types . . . . . . . . . . . . . . . . . . . . 243.5 Query Language . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.6 Long Term Storage . . . . . . . . . . . . . . . . . . . . . . . . . 283.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Prototype development 314.1 First Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.1.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . 324.1.2 Dynamic Mapping of Data Types . . . . . . . . . . . . . 344.1.3 Parsing SenML Messages . . . . . . . . . . . . . . . . . 344.1.4 Calculating Measurement Time . . . . . . . . . . . . . . 354.1.5 Message Transport Protocol . . . . . . . . . . . . . . . . 35

4.2 Second Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.2.1 Long Term Storage and Backups . . . . . . . . . . . . . . 364.2.2 Scaling Through Modularity . . . . . . . . . . . . . . . . 364.2.3 Multi Threading and Asynchronous Requests . . . . . . . 384.2.4 Index Lifecycles . . . . . . . . . . . . . . . . . . . . . . 39

4.3 Third Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404.3.1 Data Retrieval Interface . . . . . . . . . . . . . . . . . . 404.3.2 SenML Parsing . . . . . . . . . . . . . . . . . . . . . . . 404.3.3 Automated Deployment . . . . . . . . . . . . . . . . . . 41

5 Evaluation 435.1 Basic Suitability . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.2 Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.3 Licensing, Cost, and Support . . . . . . . . . . . . . . . . . . . . 445.4 Parts Not Evaluated . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.4.1 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 455.4.2 Performance . . . . . . . . . . . . . . . . . . . . . . . . 465.4.3 Methods of Transport . . . . . . . . . . . . . . . . . . . . 465.4.4 Accessing Old Data . . . . . . . . . . . . . . . . . . . . . 46

CONTENTS vii

6 Analysis 476.1 Prototype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.1.1 Index Rotation . . . . . . . . . . . . . . . . . . . . . . . 476.1.2 UUID and Duplicated Measurements . . . . . . . . . . . 486.1.3 Data Structure . . . . . . . . . . . . . . . . . . . . . . . 496.1.4 Data Retrieval Interface . . . . . . . . . . . . . . . . . . 49

7 Conclusions 517.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517.1.2 Requirements . . . . . . . . . . . . . . . . . . . . . . . . 517.1.3 Insights . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2.1 What Has Been Left Undone? . . . . . . . . . . . . . . . 537.2.2 Security . . . . . . . . . . . . . . . . . . . . . . . . . . . 537.2.3 Next Obvious Things to Be Done . . . . . . . . . . . . . 53

Bibliography 55

A Elementary evaluation 63A.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

A.1.1 Basic Functionality . . . . . . . . . . . . . . . . . . . . . 64A.1.2 Varying Data Fields . . . . . . . . . . . . . . . . . . . . . 66A.1.3 Variable Data Structure . . . . . . . . . . . . . . . . . . . 69A.1.4 Retiring of Data . . . . . . . . . . . . . . . . . . . . . . . 70

List of Figures

2.1 Replication architectures . . . . . . . . . . . . . . . . . . . . . . 122.2 Splitting a collection into multiple shards . . . . . . . . . . . . . 14

4.1 Architecture of the prototype system . . . . . . . . . . . . . . . . 37

ix

List of Tables

3.1 Database comparison . . . . . . . . . . . . . . . . . . . . . . . . 29

xi

List of Listings

1 Resolving of a SenML message . . . . . . . . . . . . . . . . . . 33A.1 A SenML structured message . . . . . . . . . . . . . . . . . . . . 64A.2 A SenML message after ingestion into elastic . . . . . . . . . . . 65A.3 A SenML message with multiple occurrences of the same base

fields. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66A.4 A resolved SenML Message with multiple occurrences of the

same base fields after ingestion into elastic. . . . . . . . . . . . . 67A.5 A JSON object sent to elastic . . . . . . . . . . . . . . . . . . . . 69A.6 A JSON document after ingestion by elastic . . . . . . . . . . . . 70

xiii

List of Acronyms and Abbreviations

ACID Data consistency model with strong consistency guarantees.

API Application Programming Interface

BASE Data consistency model with a focus on availability.

CoAP Constrained Application Protocol

CPU Central Processing Unit

CQL Cassandra Query Language

DBMS Database Management System

DSL Domain Specific Language

HATEOAS Hypermedia as the Engine of Application State

HTTP Hypertext Transfer Protocol

InfluxQL Influx Query Language

IoT Internet of Things

JDBC Java Database Connectivity

LSM Log-structured Merge-tree

MQTT Message Queuing Telemetry Transport

ODBC Open Database Connectivity

RAC Real Application Cluster

RAM Random Access Memory

REST Representational State Transfer

xv

xvi LIST OF ACRONYMS AND ABBREVIATIONS

SenML Sensor Measurement Lists

SOAP Simple Object Access Protocol

SQL Structured Query Language

UUID Universally Unique Identifier

XML Extensible Markup Language

Chapter 1

Introduction

Most application developers will at some point require persistent data storage.There are many data storage solutions to choose from, and choosing which typeof storage to use for a specific application is not always straight forward. Earlierthere was a clear distinction between databases working with structured andunstructured data. However, recently the distinction is harder to make becausesome structured databases now also supports unstructured data, and the other wayaround.

Cisco has predicted that the Internet of Things (IoT) will generate more than800 Zettabytes per year by 2021 [1]. Things such as environment sensors,smart parking systems, and self-driving cars can all generate data that could beuseful. Cars could be equipped with various sensors that reports information suchas outside temperature and current emission levels. The estimated amount ofcars in Stockholm, Sweden 2017 was 375 for every 1000 citizens, which witha population of about 950 000 would amount to more than 350 000 cars [2].Collecting sensor data from only cars in a city would put a high load on a database,but also including stationary sensors throughout a city and potentially sensor datagenerated from wearables and smartphones requires a scalable system.

Information about the air quality could be a factor in deciding where to go, and asmart parking system could inform the self-driving car where to find a parkingslot at the destination. Furthermore, the car could retrieve information aboutcurrent road traffic conditions to make an informed choice of which route to take.Connecting these information sources together to make the data useful presents achallenge. While there already exist multiple cloud providers that offer solutionsto collect, store, and analyze sensor data, the focus of this thesis is how to designan open system that is not locked to a specific company or product. This is topromote system longevity, flexibility in what type of sensors to use, how they

1

2 CHAPTER 1. INTRODUCTION

transmit data, and how that data is stored.

To understand how to build such a system, this report will contain two parts: acomparative study, and the development of a prototype system. In the first part ofthe report a selection of different types of databases will be compared to find themost suitable one. Next, a query interface will be designed to enable the retrievalof collected data and to facilitate integration with other systems. Once a choicehas been made, a prototype back-end system will be developed to evaluate thechosen solution.

1.1 OverviewAny computer program that requires persistent data storage must use a database.This might be some embedded database bundled with the program, or it could bean external database that the program interacts with using some remote interface.Regardless of which type of database is used, a database interface and some formof query language is required for the interaction. This could be a database driverthat integrates the database with a program, or a simple Application ProgrammingInterface (API), leveraging the Hypertext Transfer Protocol (HTTP) to connectarbitrary devices.

1.2 Problem DescriptionThere are many different data storage solutions to choose from that specialize indifferent use cases, and selecting which one to use can be hard. How to buildan open and scalable system to support a large quantity of sensors from varyingmanufacturers to continuously collect, store, and make the data available? Howto compare different data storage solutions and what metrics are relevant?

Problem statement: How to design a data storage solution and build a scalableback-end system for receiving and handling large amounts of sensor data, and howto make that data readily available?

1.3 Problem ContextIoT enables easier and more cost effective deployment of sensors. Everyconnected machine is a potential information source that can collect and reportdifferent kind of metrics. 42 billion IoT devices are forecast to be connected by2022 [3]. A system to collect sensor data would thus have to scale well to support

1.4. REQUIREMENTS 3

the growing number of data sources. A sensor might produce a regular stream ofdata that adds an even load to a database, but it might also send data in bursts,highly irregularly. This means that it is not safe to assume that there will beperiods of lesser load where the database can work on a backlog and catch up. Asystem would thus have to be able handle both steady and variable loads.

Previously, the distinction between relational and NoSQL databases was moreclear. The relational databases all supported SQL and the NoSQL databaseshad their own individual query languages. The relational databases only handledstructured data, and the NoSQL databases only handled unstructured data. If therewas a need for huge data sets, then NoSQL was the way to go due to the betterscalability. However, more recently the distinction has become blurred as somerelational databases began to support unstructured data as well as clustering tobetter scale (at least for read operations) [4][5].

1.4 RequirementsTo select a suitable database and build a prototype system, a requirementspecification is needed to provide some of the selection criteria. This projectaims to design a system for collection of various types of sensor data such asenvironmental data in a city environment. It should be able to handle the hugeamount of connected devices expected for the future, and the increasing amountof generated data. The system should thus be scalable and extendable to supportnew types of sensor data and methods of collection.

When environmental data is collected, there is no clear point in time after whichthe data becomes irrelevant. To support this the system should be able to storedata for an undefined amount of time, and the collected data should be accessibleto the public.

The following list contains the formal requirements for the chosen solution:

• Any sensor can be connected to supply complex data structures.

• New types of sensor readings can be added in the future.

• The collected data must be publicly accessible.

• A new method of transporting a report from sensor to back-end can beimplemented in the future.

• A sensor can report different types of information at separate time intervals.

4 CHAPTER 1. INTRODUCTION

• The system should support scaling the number of sensors to support thegrowing number of connected devices.

1.5 PurposeThe purpose of this thesis is to answer the questions raised in Section 1.2 bylooking into how to design a generic data storage solution. It should be a datastorage solution that is suitable for receiving large amounts of sensor data and thatpromotes interoperability. Part of the purpose is also to look into how to makethe data readily available and accessible, and to implement a prototype back-endsystem.

1.6 GoalsThe goal of this project is to determine how to store sensor data and make itavailable, and why the chosen method is best suited. The project is also expectedto deliver a prototype implementation of a back-end system that can receive largeamounts of data, store it, and make it available to the public.

The aforementioned goals are listed below:

1. Determine the best way to store sensor data.

2. Determine the most suitable way to access stored sensor data.

3. Design a prototype back-end system to receive sensor data, store it, andmake it available.

To aid in the design of the prototype, a use case is chosen to be the GreenIoTresearch project [6], where IoT sensors are deployed in a city environment tomonitor air quality. This is meant to be of benefit for the GreenIoT project andthus by extension benefit the citizens where the project is deployed by making thecollected data freely and readily available.

1.7 DeliverablesThis project is expected to deliver a comprehensive guide to choosing theappropriate database for storing IoT sensor data. The project is also expectedto deliver a prototype back-end system to handle IoT sensor data for the GreenIoTproject.

1.8. RESEARCH METHODOLOGY 5

1.8 Research MethodologyThis research project will be carried out using an inductive approach [7], and itwill consist of two parts. The first part of the project will collect and establishmetrics for comparison and then form a decision based on the collected data.The first step will be to gather background knowledge about databases and queryinterfaces to identify various important properties. Next, a subset of the identifiedproperties will be selected and used as a basis for comparison. The conclusion ofwhich database and query interfaces is most suitable will be formed based on howwell it matches the selection criteria.

The second part of the project will focus on the implementation of the chosendatabase and query interface in a prototype back-end system. This prototypewill be developed using an iterative approach where a simple prototype will bedeveloped at first, and then features will be added continuously in the form of newiterations until the prototype is completed.

1.9 DelimitationsThe database comparison will be theoretical and based on information available inproduct documentations and research papers. No practical tests will be performedwhile comparing the databases.

There are more databases available to choose from than what can be covered bya thesis project. Therefore, this project will focus on some of the most popularchoices of SQL, NoSQL, and time-series databases. In regards to query interfaces,only REST and GraphQL will be examined as they are considered to be the topcandidates.

1.10 Structure of This ThesisChapter 1 introduce the reader to the problem and its context. Chapter 2 providesthe background necessary to understand the basis for the choice and additionalknowledge useful to understand the rest of this thesis. Chapter 3 compares thedifferent alternatives to find the most suitable choice. Following this, Chapter 4implements the selected database and query interface in a prototype system. Thesolution is evaluated in Chapter 5, and analyzed in Chapter 6. Finally, Chapter 7presents the conclusion and provides suggestions for future work.

Chapter 2

Background

This chapter provides background information required to understand the propertiesof different databases, and serves as the basis for the comparison. It also containsinformation about how data can be retrieved and manipulated using various queryinterfaces.

2.1 Data Storage ModelsA database is a collection of data that is stored electronically and is accessible invarious ways. When designing a database there is a trade-off to be made becauseof the CAP Theorem. This theorem was first described by Eric A. Brewer in apaper published in 1999 [8]. In the paper it is explained that a distributed systemcan only guarantee 2 out of 3 of:

• Consistency: A read will always return the last written value.

• Availability: Given availability from replication of data over multiplenetwork nodes, a client can always reach some replica.

• Partition-resilience: Given a partition in the network, the system as awhole will still be operational.

This has often been a choice between consistency and availability because it isonly relevant at the time of a network partition. When there is no partition inthe network, a database can have both consistency and availability. Traditionally,consistency has been the preferred choice which lead to the development of ACID.

7

8 CHAPTER 2. BACKGROUND

2.1.1 ACID

ACID is a data consistency model that promise:

• Atomic: Either all operations in a transaction are completed or everythingis rolled back.

• Consistent: After a successful transaction, the database is consistent andstructurally sound.

• Isolated: Transactions do not compete for access to specific data and areisolated from each other.

• Durable: The committed data is stored so that it will be available in thecorrect state even after a failure and system restart.

This model was defined by Haerder and Retuer in 1983 [9], and has beenfrequently used when strong data consistency is a top requirement. A typicalexample of when this would be the case is found in the Banking industry.Whenever a bank transaction occurs it is of paramount importance that thetransaction record is stored consistently and permanently.

2.1.2 BASE

Another model that is not as strict as ACID is BASE. The BASE data semanticsconsists of:

• Basically Available: Guarantees availability as described by the CAPtheorem [8]. However, while a response is guaranteed, it might be aresponse of failure or that the data is inconsistent.

• Soft state: The system state might change over time, without input, as theresult of eventual consistency.

• Eventually consistent: System consistency will eventually be achievedprovided that the user input stops, but the system will not wait forconsistency before processing new requests [10].

The BASE model was first definied by Eric A. Brewer in 1997 [11]. It is a moreloose specification than ACID and has a shifted focus to availability instead ofconsistency. This enables a database to handle a higher insertion load and toeasier scale to multiple machines.

2.2. DATABASE TYPES 9

2.2 Database TypesThere are many different kinds of databases, and a rough classification couldbe: structured and unstructured databases, also known as Relational and NoSQLdatabases. A third classification that exist somewhere in between is time seriesdatabases.

2.2.1 Relational DatabasesRelational databases are a type of database that builds on the relational model[12]. The data is stored in tables, which consists of rows and columns, thatcan be connected to other tables to create relationships between the data. Therelational database is governed by strict schemes that must be defined beforeany data is inserted into the database. This means that knowledge about thedata structure is required to configure the database before any data can be stored.Relational databases are typically ACID compliant meaning that they guaranteestrong consistency.

2.2.2 NoSQL DatabasesNoSQL databases come in different flavors and the main ones are [13]:

• key-value store: data is stored as key-value pairs where the key is somename that is mapped to a simple value.

• document database: similar to a key-value store, but the value is a morecomplex data structure known as a document.

• column-based store: data is stored in columns instead of rows so that eachrow can have varying number and type of columns.

• graph database: built on graph theory and treats relationships between dataas equally important to the data itself.

A key difference to relational databases is that NoSQL databases does not requirea predefined scheme to be configured before data can be inserted. They are builtaround the loose model of BASE instead of ACID to support high availability.

When discussing document databases in this report, a document is the basic itemthat is stored. At the base it is a JSON object, but it can contain both regular fields,nested objects, and nested arrays. A JSON object is thus referred to as a documentwhen discussed in relation to a document database.


2.2.3 Time Series DatabasesA time series database has a shifted focus compared to Relational and NoSQLdatabases in that it can specialize in append operations. A key characteristic oftime series data is that the data is tightly coupled with a timestamp. This meansthat it is primarily inserted and not updated as the data is mainly relevant togetherwith the timestamp from when it was collected.

Because each new data point is coupled with the current time of the measurement,it is almost always inserted at the end, and if records are deleted, it is in bulk. Eachvalue is coupled with a timestamp, as such there is seldom any reason to update anold value, but rather the new data is inserted with the current timestamp. Alwaysinserting data at the end, or current timestamp, means that random inserts couldbe slow as long as inserts at the end is fast. Another characteristic of time seriesdata is that queries are typically over ranges of values rather than on disperseddata points.

2.3 B+-tree vs LSMWhen data is stored, a data structure is used to enforce some layout that promotesa desired trait such as fast queries or quick inserts. Two well known data structuresare B+-tree and log-structured merge-tree (LSM).

A B+-tree is a generalization of the binary search tree [14] that is often used,in some variant, to store database indices. It is a self-balancing tree structuredesigned to be quick for both random and sequential access [14]. However, whileit is fast at read operations even for very large data sets, it is not as fast to insertnew data.

LSM, on the other hand, is a data structure designed for fast inserts and indexing,making it well suited for applications where writes are more common than reads[15]. It is specialized in sequential disk access to exploit that disks are commonlymuch faster at sequential access than random access because there is no extraoverhead of seeking the different data locations [15].

2.4. INDEXING 11

2.4 IndexingAn index is used to increase performance of specific database queries. Typically,data is stored on disks that are slow to access compared to main memory. Toretrieve some record without an index, the database management system (DBMS)would have to scan through every stored record to find the right one which canbecome prohibitively expensive as the database grows.

An index can be used to create a smaller table containing all the keys from themain table together with pointers to the exact location of each database record.This smaller table would be much faster to scan through to find the exact locationof the sought entry, resulting in fewer disk reads.

As an example, consider a company database containing employee number, firstname, last name, and email address for every employee. Assuming that theemployee numbers are ordered, binary search could be used to access a specificrecord when using the employee number as a key. However, if the query is foran employee using the last name as the key, all of the records might have to bescanned to find the right one. In this case an index could be created by putting allthe employee names together with a pointer to the full employee record in a newtable.

This new table would be sorted by last names, thereby resulting in a fast lookuptime for a specific employee by last name. Multiple indices might be needed tosupport fast queries using other keys such as first name or email address. However,creating a new index requires extra storage space. Another drawback is thatinsertion of new records, or deletion of old ones are slowed down by the processof updating the indices.


2.5 Database Replication ArchitecturesWhen hosting a database on a single machine, a crash might result in lossof all data. A method of avoiding this is to use database replication. Twocommon database replication architectures are: master-slave and master-master(also known as multi-master).

When a group of machines hosting a database are configured for master-slave,only one of the machines, the master, performs any writes [16]. The masterreplicates the database to the slaves, and any update written to the master is pushedto the slaves, as seen in Figure 2.1a. In this setup any of the machines can handlequeries, which allow for load distribution, but because only the master is allowedto update the database, there is still a performance bottleneck. Master-slave canalso be implemented with strong consistency requirements such as ACID as thereis only one machine that is the owner of the data.

The second architecture is the master-master or sometimes called multi-masterarchitecture seen in Figure 2.1b. This architecture allows any node to handleboth database writes and reads [17], which improves availability, but it cannotguarantee as strong consistency as required by ACID.

(a) Master-Slave (b) Multi-Master

Figure 2.1: Replication architectures

2.6. SCALING 13

2.6 ScalingTwo methods of scaling a database is vertical and horizontal scaling. Withvertical scaling, the machine resources are increased to handle a larger load,while horizontal scaling focus on scaling through multiple machines [18]. Itmight be easy to scale a single machine by adding memory, Central ProcessingUnits (CPU), and disk storage. However, this method of scaling cannot continueindefinitely and when the resources of a single machine can no longer handlethe load, the alternative is to implement horizontal scaling. Adding additionalmachines to a system adds great scalability, but there are some inherent problemswith horizontal scaling that must be dealt with. There are a few ways to scalehorizontally depending on what part of the system requires more resources.

2.6.1 Master-Master vs Master-SlaveIf there are many read requests it might be enough to add slaves that have areplica of the database but that can only handle read requests. This would bethe case in a master-slave architecture, as described in Section 2.5. If the systemneeds to handle a larger amount of write operations, a multi-master architecturewould be more suitable. With this method, each machine has a full copy of thedatabase and can handle both read and write requests, but issues concerning dataconsistency between the machines must be resolved. In both of these methods,the full database is stored on each machine, but if the data set is to large for onemachine, a solution is to use sharding.

2.6.2 ShardingWith sharding, as seen in Figure 2.2, the data is split into multiple parts that arestored on different machines. This way, only a fraction of the data is stored oneach machine and they can all handle both write and read requests. A challengehere is to find a way to split the data evenly across the machines and to make surethat the read and write requests are sent to the correct database instance.


Figure 2.2: Splitting a collection into multiple shards. (Adapted fromhttps://blog.pythian.com/sharding-sql-server-database/)

2.7. QUERY INTERFACES 15

2.7 Query InterfacesTwo broad categorizations of query interfaces are those used to retrieve informationfrom a database, and the more general interface used to retrieve information fromany source. For the rest of this report they will be referred to as database queryinterfaces and information query interfaces respectively.

2.7.1 Information Query InterfacesIn order to expose a system to a client to enable data retrieval and manipulation, acommon method is to introduce an API that defines a set of methods available tothe client to interact with the system. There are multiple different types of APIsand the more well known ones are: Simple Object Access Protocol (SOAP) [19],Representational State Transfer (REST), and GraphQL [20].

2.7.1.1 SOAP

SOAP is an Extensible Markup Language (XML) based protocol created toexchange information in a distributed environment [19]. It is designed to belightweight and Operating System agnostic, and enables simple communicationfor programs over HTTP*. It is usually combined with the Web Services DescriptionLanguage that defines the available methods and how to call them [21], resultingin a tight coupling between server and client.

2.7.1.2 REST

REST is a well known design architecture for an API that was first describedby Roy T. Fielding in his Doctoral dissertation [22]. In the dissertation Fieldingdescribes how REST is designed to promote longevity and enable independentevolution of server and client through the use of hypermedia as the engine ofapplication state (HATEOAS). This means that the client should not have any out-of-band knowledge about the server other than how to handle hypermedia andthe entry point location for the API. Some of the key characteristics of the RESTarchitecture described by Fielding is a uniform and stateless design that enablesgreat scalability, further promoting API longevity.

* SOAP is not dependent on HTTP and works over other protocols as well such as the SimpleMail Transport Protocol.


2.7.1.3 GraphQL

GraphQL is a graph query language developed by Facebook to give the clientmore control over what data they retrieve [20]. It is presented as an alternative tothe RESTful approach and it is contract-driven through the use of schemas thatdefine the functionality. A client can specify which attributes to retrieve whichenables a high granularity of the request as well as a potential to reduce networkbandwidth.

2.7.2 Database Query InterfacesThere are multiple methods developed to access and manipulate a database. Onesuch method is the Structured Query Language (SQL), which is based on therelational model, and it is typically used to manipulate data in relational databasessuch as MySQL and OracleDB*. It was defined by D. D. Chamberlin and R. F.Boyce in 1974 [23] and has since been through multiple revisions with the latestone called SQL:2016 [24].

For non-relational databases it does not exist a standard interface to be used.Every database implements an interface tailored for itself, and while some arevery similar to SQL, they often lack some features due to the strong consistencyrequirements of ACID compliant databases [25][26][27]. There are also databasessuch as Elasticsearch that use their own query strings instead of SQL to provideaccess and support custom queries [28].

2.7.3 System IntegrationA database driver can be used to connect a database to another system. Driverssuch as Open Database Connectivity (ODBC) and Java Database Connectivity(JDBC) enable the execution of SQL queries from programming languages suchas C, C++, and Java [29], thereby facilitating integration with other systems.

Another way to access a database is through a REST API [22], using the HypertextTransfer Protocol (HTTP) to communicate with the database. Using a REST APIintroduces an abstraction between the database and the client, allowing for changeof database back-end without updating the client.

* Multiple dialects of SQL exist and it can differ much between implementations, but they are allbased on the original SQL definition created by Chamberlin and Boyce.

2.8. RELATED WORK 17

2.8 Related WorkLittleTable is a relational database developed by Cisco Meraki that is specializedin storing time-series data from IoT devices [30]. It is designed around theassumptions of single-writer, append-only, and that the most recent data can beretrieved again from the IoT device in the event of a crash. This allows forweaker consistency and durability guarantees that simplifies the implementation.However, there is no support for horizontal scaling, and any such requirement ishandled by the use of independent LittleTable instances and external systems toshard the data [30].

Nayak, Poriya, and Poojary discussed different types of NOSQL Databasesin their paper Type of NOSQL Databases and its Comparison with RelationalDatabases [31]. In the paper they describe the four common categorizationsof NoSQL databases and give an overview of how they compare to relationaldatabases. They conclude that one of the major drawbacks of NoSQL databasesthat has resulted in a lower usage than that of relational databases is the lack of acommon query language.

The authors of [32] compare the NoSQL database MongoDB to Microsoft SQLServer for a modest-sized data set to find out if NoSQL databases are beneficialfor smaller data sets than what is referred to as ”Big data”. In the paper they cometo the conclusion that MongoDB performs as well as or better than SQL Serverfor everything except aggregate queries where SQL server is shown to be up to 23times faster.

In [33] the authors compare 14 different NoSQL databases based on five aspects:data model, query possibilities, concurrency control, partitioning, and replication.They argue that the use case is the dominant factor in determining which databaseto use because the various NoSQL databases were developed to solve a specificproblem that a relational database could not handle efficiently. They also state thatthe choice of database has to be made based on what type of queries are requiredas databases such as key-value stores does not support complex queries.

The developers of InfluxDB explains in their documentation some of the designinsights and tradeoffs that are specific for time series data such as sensor data[34]. They argue that operations such as delete and update are very rare and thusdoes not required high performance, and if records are deleted it is often overlarge ranges of old data. They also state that time series data is mostly appendoperations because the timestamps of new inserts are primarily very recent.Finally, they identify that when it comes to time series data, no single points of


data is too important. In effect, the main focus in on larger data aggregates andnot on individual data points.

Chapter 3

Comparison

The first part of finding a good candidate was to limit the amount of databases tochoose from. In this report only a few candidates are selected from the threecategories: Relational, NoSQL, and Time series. The candidates are mainlychosen based on a basic suitability for the task, popularity, and longevity ofthe database*, with some exception for promising newcomers. The selectedcandidates for comparison are: InfluxDB, Elasticsearch, Cassandra, MongoDB,TimescaleDB, and OracleDB.

The following sections will compare the databases, starting in Section 3.1 with alist of the selected comparison criteria. After that, a discussion will follow of eachof the criteria, ending with a summary and conclusion.

3.1 Comparison CriteriaTo find a suitable database for a given project, a set of requirements and a fewchosen properties are required as a basis for comparison. The databases werecompared based on the following properties:

• Scalability and Backups

• Maintenance

• Support for new data types

• Query language: How to get access to the data?

• Long term storage

* A strong community, or low risk of being discontinued by the developers.

19

20 CHAPTER 3. COMPARISON

A system to handle sensor data from a city environment must be able to growto support an increasing amount of connected sensors, and an increasing amountof users as the project is integrated into other products. Another aspect is thathistorical data could be valuable and should be kept in storage indefinitely. As thesystems grows in size, the storage space required will be ever increasing. Thus,an important aspect is how to add more storage and if it will temporarily disruptthe service.

In order to ensure the longevity of the data and to protect against failures,clear procedures for backup and restoration are important. Although there aretechnologies such as replica shards that protect against database corruption, theydo not protect against user errors such as accidental deletion of data, or insertion ofdata that corrupts the database. While performing full backups every time mightbe feasible for low scale databases, incremental backups better support growth,and encourage shorter backup intervals.

Part of building a robust system that promotes longevity is to strive for lowmaintenance requirements, and more importantly, to keep potential downtime to aminimum. Another aspect is flexibility, because there is no standard with a broadadoption for how sensor data should be formatted when reported. This means thatthe way that data is represented might be changed in the future, and a system thatcan adapt to this without loosing the old format data has a clear advantage.

While details about how the data is stored is very important, another considerationis how to access the data. What query language is used and does it support queriesthat include data processing? When requesting aggregates of the data, it might bemore efficient to perform the aggregation at the database than to perform multiplequeries and let the caller compute the result.

When considering if a specific feature is supported by a database, only built-in features are considered fully supported. While some databases might haveextended functionality when combined with other products, that functionalityis not considered to be supported by the database itself, but rather a possibleextension. If a certain feature is only supported while running in the cloud, itis not considered to be supported by the database as it depends on cloud featuresnot available when running a self-hosted database.

3.2. SCALABILITY AND BACKUPS 21

3.2 Scalability and BackupsNoSQL databases have been the clear choice from the start when great scalabilityis needed. The main method of scaling is horizontal scaling through the additionof extra nodes.

In a blog post Netflix demonstrates how Cassandra achieves linear scalabilitywhen adding new nodes to the cluster [35]. To scale a Cassandra cluster forread and write operations it is enough to add more nodes to the cluster whichcan be done seamlessly without any downtime [36]. Backups are performed usinga snapshot operation that can be completed on the whole cluster or a single node,and after the first snapshot, further backups can be incremental [37].

MongoDB scales through sharding which is based on a shard key as explainedin the official documentation [38]. According to the documentation, both readsand writes are scaled by adding a new shard. MongoDB can begin with asingle machine and then at a later time implement sharding to scale horizontally.However, the initial sharding can only take place if the current database size doesnot exceed certain limits [39].

Multiple options exist to backup the data, and one of them is to use a built-intool called mongodump to take a backup of a cluster node. Using this method,a backup of each cluster node has to be performed manually while the loadbalancing feature is disabled [40]. A drawback with this approach is that it onlytakes a backup of the documents and not of the index data. The recommendedapproach is to use snapshot features of the underlying file system to backup thedatabase [40]. Another approach is to use the MongoDB Cloud Manager that canautomatically backup clusters running both in the cloud and on local infrastructureusing incremental backups [41].

TimescaleDB supports clustering for replication of data and to shard read queries.However, scaling out to multiple nodes to support higher write workloads iscurrently unsupported [42]. Using replication, the full data set is available oneach replica, and in the case of a failure of the primary node, one of the slavescan take over*. However, only manual failover is available natively and a third-party solution is required to get automatic failover [43]. To protect against usererrors that could result in data loss such as the accidental deletion of data, it is alsopossible to backup the database using built-in backup functions from PostgreSQL[42].* If the primary node fails, the time to elect a new master might result in unavailability for newwrites and thus data loss.


OracleDB can be scaled both to support more read and write operations, and largerstorage using different techniques. Oracle Real Application Cluster (Oracle RAC)enables horizontal scaling so that multiple database instances share the samestorage and can thus handle more read and write operations provided that thebottleneck is CPU and Random Access Memory [44]. Because all the instancesshare the same database storage, a bottleneck could still be disk access, and thereis still a single point of failure in the database storage back-end.

In order to utilize horizontal scaling for extended storage, Oracle Exadata can beused [45]. However, a drawback with both Oracle RAC and Oracle Exadata isthat they are both quite expensive and the cost might not always be justifiable forthe collection of sensor data [46][47]. To protect against data loss, a utility calledRMAN can be used to perform incremental backups and restore operations [48].

InfluxDB scales through clustering and by adding additional nodes. Addingnew nodes to the cluster can help to scale both read and write operations, andincrease available disk space [49][50]. The clustering feature is only available inthe enterprise edition which means that if it might be required in the future, theenterprise version should be used from the start instead of starting with the freeversion. InfluxDB supports both full and incremental backups of the database torecover from data loss [51].

Elasticsearch scales through clustering by adding additional nodes and utilizingsharding. There are different types of nodes that can be added to the clusterto improve read and write performance, and replica shards can be created anddistributed among the nodes to keep backups of the data [52]. There is alsoa snapshot API that can be used to backup the entire cluster to protect againstcatastrophic failures that cannot be recovered from by replica shards. Thesnapshot API will take a full backup of the cluster the first time, and thenincremental backups for subsequent calls [53]. An elastic cluster can be created ona single node and then expanded as required through the introduction of additionalnodes [52].

While backups and replica shards protect against losing saved data, there is alsothe aspect of losing future data if the cluster crashes, preventing new data frombeing written. A solution to protect against this could be to write the data tomultiple locations for each insertion. The data ingestion system could write everypiece of data both to the database, and to some cheap long term storage solutionthat is not necessarily optimized for fast queries. In the event that the databasebecomes temporarily available for new insertions, the data would still be written

3.3. MAINTENANCE 23

to the other storage solution. Later, when the database is available again, it couldingest whatever data it missed from the secondary storage.

3.3 MaintenanceRegardless of which database is in use, there will likely be some form ofmaintenance required. Typical maintenance tasks might be to upgrade to the nextversion of the database, recalculate or compact the indices, extend disk storage,or update a schema. How to perform these tasks and minimize downtime variesamong the different databases.

With MongoDB, most maintenance tasks can be performed without downtimewhen running a cluster. A MongoDB cluster has a collection of nodes withone elected as the master, and each secondary node can be taken offline formaintenance and then reintegrated into the cluster [54]. Once all the secondarynodes are done, a manual fail-over can be performed to elect a new master whichallows the previous master to be taken down for maintenance.

Same as MongoDB, InfluxDB also supports maintenance tasks such as upgradesof clusters without downtime by updating one node at the time in the cluster [51].However, it is still recommended to schedule a maintenance window for an offlineupgrade.

Cassandra has two maintenance tasks found in the documentation [55] that shouldbe performed regularly to keep the database healthy called repair and compaction.Repair is used to synchronize the missed writes to a cluster node, that has beentemporarily offline, to enforce data consistency. Compaction is performed toremove any expired data and free up disk space. This task can be automatedin Cassandra by enabling the option for autocompaction. Cassandra also supportsversion upgrade without downtime in a cluster by updating one node at the time[56].

Some of the primary maintenance tasks for Elasticsearch is to rotate indices, deleteold indices, and upgrade the cluster to newer versions. The index maintenancetasks can be automated by configuring Elasticsearch to automatically rotate theindex and use a tool called Curator [57] together with a cronjob to regularly moveor delete old indices. The index maintenance can also be automated directly inelasticsearch using a pipeline and an index lifecycle management policy [53].Thus the only manual maintenance required is to upgrade to new versions andthis can be done without downtime by updating one node at the time.


TimescaleDB does currently not fully support cluster deployments and anymaintenance tasks such as an upgrade that requires a restart will thus haveto be performed during a scheduled downtime. As TimescaleDB is based onPostgreSQL [58] the same maintenance tasks apply, but most of them can beautomated using external tools such as a cronjob [59].

Oracle provides instructions to help minimize downtime when performing plannedmaintenance. Features such as online patching and Automatic Storage Management[60] help lead to no downtime for some of the common tasks [61]. Othermaintenance tasks can be fully automated such as updating statistics and databasetuning [62].

3.4 Support for New Data TypesBuilding a system to support the collection of sensor data with complex data typesthat is expected to grow requires a database that can be updated to handle bothnew types of sensors and new types of readings. A sensor might be added tothe system that measures a new type of reading that should be collected, and thedatabase should thus support this without the need to recreate the database orseverely disrupt operations. When considering if a database supports new dataformats, a new format should not lead to performance degradation.

InfluxDB organizes data into tags for meta-data and fields for measured values,and only the tags are indexed. The schema cannot be created in advance; itis automatically created based on the inserted data. Once the schema has beencreated it cannot be updated with new tags. Thus, if a new sensor is introduced tothe system that reports some new fields or tags, it will automatically be insertedinto a new series [63].

Elasticsearch is a search engine designed to index JSON objects which means thatit does not depend on any fixed schema to store data other than the use of JSONformatting [64]. Elasticsearch stores the raw data and index everything* with thehelp of mapping templates [66]. According to the documentation, elasticsearchuse a dynamic mapping template by default which means that new fields areautomatically added and their data type is guessed [66]. It further specifies thata dynamic mapping template is applied to new indices, and can be configuredto specify the existence and type of a subset, or of all the fields. Thus, when asensor reports a new type of field it will automatically be discovered and handled.

* Elasticsearch can be configured explicitly to not index certain fields [65]

3.4. SUPPORT FOR NEW DATA TYPES 25

To change the recognized type, or properties of a dynamically mapped field, themapping template can be updated so that the changes will apply at the next indexrotation.

Cassandra supports the addition of new columns by altering the schema beforeinsertion [27]. Because each row can have a different set of columns, a new sensorwith different measurements will just use another set of columns.

MongoDB is a document store and support different structures for each document.There are multiple ways to store sensor data in a document store, but onealternative is to create one document per sensor per measurement. Using thisapproach means that the addition of new sensor types will just result in differentstructured documents, but it will not affect the other measurement as they are inseparate documents and each document is independent.

TimescaleDB supports storing semi-structured data using a binary JSON format.Using this format, only the fields that are mandatory on every sensor suchas identifier and timestamp are defined as columns, the rest are stored as abinary JSON object [42]. This means that a new sensor can report any typeof measurement as long as it also reports the mandatory sensor identifier andtimestamp, thereby enabling automated addition of new sensor types to thesystem. However, since the rest of the fields are stored as a JSON object in abinary format, the whole object has to be deserialized and reconstructed just toaccess a single field.

The JSON object can be indexed either using a GIN index, or by indexingindividual fields [42]. Using a GIN index will only optimize for certain queriesthat inspect top-level fields in the JSON object which might negatively affectthe performance depending on the data structure of collected sensor data. Thealternative is to index individual fields, but this is limited to only indexing thefields that are common for all JSON object [42].

OracleDB supports storing time series data using a special schema describedin [67]. However, the schema is created before data insertion and is verystrict. Sensor reports that does not match the predefined columns will notbe inserted into the table. A way to store complex data is to use the sameapproach as TimescaleDB and use JSON to store it as objects. However, sameas with TimescaleDB, using JSON to store data impose some restrictions, andperformance might not be equal to storing the data using the less complex datatypes [68].


3.5 Query LanguageIt is important that the data can be stored efficiently and that the system can bescaled as needed, but it is also important that the data can be queried properly tomake use of the information. When collecting data from various sensors it mightbe hard to know what the data will be used for in the future, or what type of queriesmight be run against the data. Thus, it could be better with a flexible solution thatdoes not require knowledge in advance to optimize for certain queries.

The importance of this criteria depends on how the database will be used. Someof the query languages support powerful queries that put the heavy load on thedatabase and thereby enable the use of weak clients that cannot aggregate the datathemselves. Another aspect is that a well known language such as SQL might beimportant if various applications should communicate with the database directly.

On the other hand, even a language supporting powerful queries might not beenough. In that case an additional service would have to be implemented thatqueries the database, and then performs the heavy operations and expose whateverinterface is best suited for the clients.

InfluxDB implements a query language called Influx Query Language (InfluxQL)that is used for data exploration. It is designed to be and feel similar to SQLbut does not implement some relational database specific features such as tablejoins, but instead it implements some new features useful for time series data[51]. One of the specific features of InfluxQL is continuous queries which enablethe creation of automated periodic queries. A typical use case for this is toperiodically run a query that calculates some aggregate of the data inside a slidingwindow time range and inserts it into a new table.

As mentioned in Section 3.4, InfluxDB splits data into tags and fields, and onlythe tags are indexed. This means that upon data insertion the user should knowwhat part of the data will be used to formulate future queries since the tags cannotbe changed without deleting the data and reinserting it with new tags. While it isstill possible to query based on fields, that will come with a performance impactas all the data entries must be searched without an index.

Elasticsearch implements a custom query language called Query Domain SpecificLanguage (Query DSL) that utilize a JSON-style formatting that enables a widerange of queries to be performed on the data to find specific information [53].Query DSL supports, among other things, aggregate queries such as to calculatethe average, sum, or percentiles of specific fields, as well as to find min and max

3.5. QUERY LANGUAGE 27

values across collected data points. Elasticsearch also supports a subset of SQLthat can be used to read data using common SQL select statements [69].

Cassandra implements its own query language called Cassandra Query Language(CQL) that is similar to SQL but lacks certain features. Some of the differencesfrom SQL is that CQL does not support joins, nested queries, or transactions, andit does not support logical operators such as OR and NOT [27]. Queries can befiltered using the WHERE clause but only on columns that are either the primarykey or a secondary index [27]. Thus, future types of queries might not be trivialto perform unless the primary key is used to select the data.

MongoDB implements its own rich query language that can be used to extractinformation from stored documents. It has built-in support for finding data basedon some constraints, as well as, aggregation methods to calculate sums, averages,min, and max values on documents [40].

OracleDB implements SQL which brings the whole range of powerful queriesavailable to relational databases. SQL supports many built-in functions that canbe performed on the data such as calculating sums and averages, and findingminimum and maximum values. However, some other functions such as findingthe median value is not supported and has to be implemented using complexnested queries which can impact performance.

Indices are important to get good query performance and they are created togetherwith the schema before data insertion. Relational databases store data acrossmultiple tables connected with relations to avoid, among other things, duplicateddata. Deciding how to store the data across tables and which columns should beindexed affects what queries can be run efficiently.

TimescaleDB, same as OracleDB, implements SQL which enables a wide rangeof queries to be run on the data. It also implements some extra functions tosupport more advanced analytics such as median, percentile calculations, andhistograms. Because TimescaleDB is based on PostgreSQL it also requires thedatabase administrator to create appropriate schemas and indices before datainsertion to support efficient queries.


3.6 Long Term StorageRegularly collecting sensor data from hundreds of thousands to millions of nodesis likely to consume storage space rapidly. All of the databases examined in thisreport support some form of data retention policy that dictates how long beforeinserted data is removed from the database. Storage can be scaled to supportlonger retention periods, but there is a trade-off between keep data for a longtime, and have a fast system.

Elasticsearch supports time based indices where new indices could be createdregularly to keep the index size manageable. The documentation describes that anindex has a lifecycle where it can transition from a hot - warm - cold - delete stage[53]. When an index is created it will be in a hot stage with much read and writeactivity. Once a new index has been created and data is not longer inserted into theold index, it can be moved to the warm stage where queries are still fast but datais read only. The next step is to move the index to a cold stage when it is seldomqueried, and the query response time can be slower. After a certain amount of timean index can be deleted from the cold stage if it is no longer relevant, or if it isarchived to some long term storage solution. If it is archived, it can be re-importedto support queries.

MongoDB has a feature called tag-aware sharding that enables the placementof a shard on a specific machine [70]. This could be used to tag shards sothat new documents are placed on a fast machine, but when it gets older it willautomatically be moved to a slower machine. This enables fast queries of the mostrecent data, but a drawback is that the cutoff date between new and old documentsmust be manually specified and regularly updated.

In a whitepaper Oracle describes how table partitioning could be used to move lessfrequently accessed data to slower storage [71]. This could be used in OracleDBto move old data to more long term storage and make room for new data on thefast storage.

None of the others have a feature to facilitate long term storage of data throughseparation of recent and old data. Cassandra has a feature to compress the datato reduce storage by up to 33%, but it is only suitable when the inserted data usethe same fields or columns [72]. Sensor data which might vary greatly in whattype of fields are reported could thus be hard to compress. Both InfluxDB andTimescaleDB support data compression as well to reduce storage requirements.However, it is unclear if the compression is significant on non uniform data.

3.7. SUMMARY 29

3.7 SummaryTable 3.1 contains a summary of how the databases match most of the chosencriteria, but some of them such as maintenance and support for complex queriesdoes not fit in the table. In the use case for this project, described in Section 1.4,the main focus is on scalability, support for new data structures, and long termstorage of the collected data. For this use case Elasticsearch was the top candidatedue to the greatest flexibility in all the areas.

Table 3.1: Database comparison

InfluxDB Elasticsearch Cassandra MongoDB TimescaleDB OracleDBScale Read* 3 3 3 3 3 3

Scale Write* 3 3 3 3 7 3

Scale Storage* 3 3 3 3 7 ExadataIncrementalBackups 3 3 3

CloudManager 7 3

QueryLanguage InfluxQL

Query DSL,Partial SQL CQL JavaScript SQL SQL

Add newdata types† 3 3 3 3 7 7

Long termStorage 7 3 7 3 7 3

* Scaling here means scaling out. † Without imposing restrictions or performance degradation.

Chapter 4

Prototype development

This chapter explains the approach to implement a prototype system to storesensor data using Elasticsearch, henceforth referred to as elastic, as the databaseback-end. The prototype was created to handle sensor data generated by theGreenIoT project which utilize the Sensor Measurement Lists (SenML) messagestructure [73].

The prototype was developed during three iterations where each iteration broughtsome new extended functionality. The first iteration only implemented basicfunctionality to parse and insert data into a single node elastic cluster, withoutany functionality for scaling or setup for long term storage. The second iterationadded scalability and support for long term storage using index rotation, andunique identifiers to enable correlation between a document stored in elastic andthe original message stored elsewhere. The last iteration moved to a three nodecluster, added a data retrieval interface, and created an automated deploymentmethod.

4.1 First Iteration

Elastic can be installed directly on a server or in a containerized environment.To facilitate migration to other hosts, and extension from single to multi-nodecluster running on the same host, elastic was installed in a docker container.Using a docker container to encapsulate elastic had the benefit of bundling all thedependencies with elastic in an isolated environment, thereby preventing conflictswith other system libraries. In this first iteration, elastic was set up using defaultsettings for a single-node cluster, and elastic version 6.7 was used.

31

32 CHAPTER 4. PROTOTYPE DEVELOPMENT

4.1.1 Data StructureThe first design choice to make was how to internally structure the data. TheSenML format used by the GreenIoT project defines a message as a list ofmeasurements (also known as records). A record contains regular fields withvalues such as the sensor name, the time of measurement, and the measured value.

Multiple representations are defined, and one of them is the JSON representation.This representation is used by the GreenIoT project, and it is defined as a JSONarray containing multiple records as JSON objects. Elastic stores data as JSONobjects and cannot directly ingest a SenML formatted JSON array [73]. Thus, aSenML message must be restructured to be a JSON object before insertion intothe database.

According to the SenML specification, a record can also contain base fields (inaddition to regular fields), as seen in Listing 1a. These field start with the letter”b”, such as ”bn” for base name, and ”bt” for base time. A base field apply toevery subsequent regular field with a corresponding name until it is overridden bythe specification of a new base field with the same name. Thus, to get the fullrecord, the regular fields has to first be combined with the base fields to form whatthe SenML specification refers to as resolved records, as seen in Listing 1b.

The choice was to either store the SenML message unresolved and let the clientretrieve and resolve the records when querying, or to store the measurements inresolved form. A problem with storage of the measurements in unresolved form isthat a client would have to retrieve the whole SenML message in order to resolveit even when only interested in a single field. This is because a base field mighthave been specified earlier in the message that must be applied to the regular fieldto be retrieved. Another drawback is that elastic cannot be used to aggregate themeasured values as they are not correct until they are resolved.

Every SenML message could potentially contain multiple definitions of a basefield which should be applied to subsequent regular fields. A structure to storeunresolved SenML records would thus also have to keep track of the recordorder, and the structure would be unnecessarily complex to support potentiallymultiple occurrences of the same base fields. Because of these drawbacks, theSenML messages were first resolved and then split into multiple documents beforeingestion by elastic.

4.1. FIRST ITERATION 33

1 [2 {3 "bn":"urn:dev:ow:10a10240b1020085;",4 "bt":1.554098400e+09,5 "bu":"%RH",6 "n":"humidity",7 "v":55.858 },9 {

10 "n":"humidity",11 "t":-5,12 "v":55.8013 },14 {15 "n":"temp",16 "u":"Cel",17 "t":-5,18 "v":18.519 }20 ]

(a) Original SenML message

1 [2 {3 "n":"urn:dev:ow:10a10240b1020085;humidity",4 "t":1.554098400e+09,5 "u":"%RH",6 "v":55.857 },8 {9 "n":"urn:dev:ow:10a10240b1020085;humidity",

10 "t":1.554098395e+09,11 "u":"%RH",12 "v":55.8013 },14 {15 "n":"urn:dev:ow:10a10240b1020085;temp",16 "t":1.554098395e+09,17 "u":"Cel",18 "v":18.519 }20 ]

(b) Resolved SenML message

Listing 1: Resolving of a SenML message


4.1.2 Dynamic Mapping of Data Types

Elastic by default attempts to automatically detect the data type for each fieldin a document. This makes elastic very flexible as new fields in the data areautomatically detected and mapped to some type such as integer, float, or string, toenable aggregations and efficient search. However, once a field has been mappedto a type and data has been stored for that field, it cannot be changed.

A problem was encountered when first inserting sensor data into the databasebecause all sensors using the SenML format stores numerical values in a fieldnamed v. The first measurement to arrive was from an air quality sensor thatreported a value of 1 ug/m3. The dynamic mapping feature of elastic used thisto determine that the field named v should contain integer values. Later when ameasurement from an air pressure sensor arrived with a value of 933.66 hPa, itcould not be inserted into elastic because of a conflicting data type of float.

Two ways to solve this would be to either convert all integers to floats so that 1ug/m3 would be 1.0 ug/m3, or to continuously rename the field to something elsewhen it contains a float before inserting it into elastic. A decision was made toalways convert the received numeric sensor values to floats to keep it consistentin the database.

4.1.3 Parsing SenML Messages

A SenML library [74] written in python was used to parse received messages asSenML. While it supported parsing and resolving of SenML messages, it did notperform adequate sanity checks to verify that the whole message conforms to theSenML specification. Another problem was that the library was written for anearlier specification of SenML which meant that certain new requirements werenot enforced. To solve this, the library was forked and updated to implement therequired checks.

A design decision here was if a message that mostly conforms to the SenMLspecification contains multiple conforming records but a single malformed recordshould be rejected as a SenML message, or if it should be parsed as SenML withthe malformed record stripped away. The obvious drawback of rejecting the wholemessage is that many correctly formatted measurements might not be stored in thedatabase because of a single malformed measurement. However, by rejecting it asSenML, it can be parsed as some other format by trying multiple parsers beforegiving up on storing the data.

4.1. FIRST ITERATION 35

For this prototype, the decision was made to reject such messages as SenMLto force either the updating at the source to send proper SenML formattedmessages, or the implementation of an additional parser that can fully understandthe received message.

4.1.4 Calculating Measurement TimeSenML defines a field for specification of measurement time. This field can eitherbe absolute in seconds since the Epoch*, or relative to current time for situationswhen the sensor does not have access to a real-time clock. Because of this, theSenML messages had to be tagged with a timestamp when received to use forcalculation of measurement time.

To differentiate between absolute and relative time, the SenML specificationdefines a cut-off point where values less than that would be considered as relativetime. A design decision here was what to do with a message that reports a time ofgreater than zero but less than the cut-off point.

When a message arrives, the time field is overwritten with the calculated absolutetime if relative time was used. If the reported time was -10, that meant thatthe measurement occurred roughly ten seconds ago from when the message wasreceived. However, a positive time value such as 200 would refer to 200 secondsin the future from when the message was received. This is relevant for the use ofactuators where something is scheduled to occur. In a system that only retrievessensor data, a positive value that is not absolute time should never occur.

However, in this project some of the test sensors reported uptime as the measurementtime instead of a zero to represent that the measurement was taken roughly now.Because this would be interpreted as a timestamp in the future, the decisionwas made to replace those values with the current time to indicate that themeasurement was taken roughly now.

4.1.5 Message Transport ProtocolThe test sensors used for the prototype reported their measurements using theMessage Queuing Telemetry Transport (MQTT) protocol [75] together with a

* Time since the Epoch, 1970-01-01 00:00:00 +0000 (UTC)


message broker. To retrieve the sensor measurements from the broker, an MQTTclient called paho-mqtt [76] was used. A client library for elastic written in pythonwas used to insert data into the database [77]. Whenever an MQTT message wasreceived, it was parsed as SenML if possible, and then inserted into the elasticdatabase.

4.2 Second IterationFor the second iteration, the focus was shifted to building a more robust andscalable system. As two of the requirements from Section 3.1 was to supportlong term data storage and data backups, a solution was to also store the data inan external system.

4.2.1 Long Term Storage and BackupsThe prototype was designed to enable writing the original sensor data to a separatestorage system before restructuring it to fit in elastic. A benefit with this approachis that since the original message is stored, it can be reinserted into elastic in theevent of data loss or corruption. If desirable, the original data can also be insertedinto some other data storage solution at a later date.

A challenge to overcome was that the received sensor data does not necessarilyhave anything uniquely identifying the message. Thus, there was a problem ofhow to link the data in elastic to the original message in the external storagesystem. Another related problem was that running multiple instances of the dataingestion system for scalability might result in duplicate data in the database.To solve the problem of linking the data in elastic to the original message, aUniversally Unique Identifier (UUID) [78] was generated. It was appended toeach received sensor message before being sent to the external storage system andparsing functions.

4.2.2 Scaling Through ModularityTo better support scaling, the data ingestion system was split into multiplesubsystems, as seen in Figure 4.1. In the figure, a report has a different color as ittraverse the system, where a new color indicates that the report has been modified.

4.2. SECOND ITERATION 37

The first part of the system was a tagger service that received the MQTT messages,tagged them with a UUID and timestamp, and then sent them to another servicefor processing. This service was developed to be very fast as it is the single dataingestion point that cannot be duplicated without using different data sources toprevent duplication in the database. Thus, to add another MQTT broker as a datasource, it is enough to run an additional tagger service connected to that source.

This new tagger service will perform the same operations as the other tagger, andthen send the report to the pool of parsers for further processing. If a protocolother than MQTT, such as the Constrained Application Protocol (CoAP), shouldbe used, it is enough to only write a new tagger service that implements a CoAPlibrary instead of the paho-mqtt library used here.

Figure 4.1: Architecture of the prototype system


The second part was the parser service that received the tagged messages, parsedthem, and inserted them into elastic and optionally a separate storage system.Functionality to enable sending the messages before parsing to a separate storagesystem, as illustrated in Figure 4.1, was implemented, but not used in the currentprototype.

The parser service performed the comparatively heavy work but could be scaledto support higher loads by introducing a load balancer between the tagger serviceand multiple instances of the parser service. The parser service expose an API, andthe passing of messages between the tagger and parser services were performedvia HTTP Post requests.

The third component was the elastic database which exposed a REST API for datainsertion, querying, and management. This separation of subsystems via API:sallowed each service to be developed separately, hosted at a different machine,and potentially hosted at different networks.

4.2.3 Multi Threading and Asynchronous RequestsTo improve performance and to minimize the waiting time introduced by potentialnetwork delays, both multi threading and asynchronous requests were introduced.The MQTT library in the tagger service defines a callback method that is calledevery time a message is received.

The library is single threaded which meant that only a single message could beprocessed at a time. Because of this, the next message could not be processeduntil the parser service had accepted the previous message and responded to therequest, which included the network round-trip delay.

A design decision here was to either create a new thread every time an MQTTmessage was received, or to implement asynchronous requests. The tagger servicecould create a thread pool and a queue where each thread continuously retrievedthe next message from the queue, processed it, and sent in to the parser service.Every time a new MQTT message was received, it would be placed in the queuefor the other threads to handle. Using this method, the thread receving messagesfrom the MQTT broker would not have to wait for the delay in sending themessage to the parser service and wait for the reply.

The other approach was to implement asynchronous requests using a pythonlibrary called asyncio [79]. This library enables the tagger service to tag a

4.2. SECOND ITERATION 39

received MQTT message, send it to the parser service, and then process the nextmessage without waiting for a response. Asyncio utilize a single separate threadthat handles the transmission of messages to the parser service and receives theresponses without blocking handling of new messages.

With multi threading, a thread with nothing to do can potentially waste timewaiting for another thread to perform some work. With asynchronous requests,the program itself decides when to switch to work on another task and can do sowith more insight than the CPU. Asynchronous requests was used for the taggerservice, but to further improve performance on a multi core CPU, a combinationwith multi threading can be used.

The parser service utilized the Flask library to implement the API, and everyincoming request automatically spawns a new thread to handle it [80]. Tominimize the time the tagger service would have to wait for the message to beparsed and sent to elastic, an additional thread is spawned whenever a message isreceived that parse the data and send it to elastic. The first thread can thus returna response to the tagging service as soon as the new thread is spawned withoutwaiting for the result, thereby reducing the time the tagger service has to keep thenetwork connection open.

4.2.4 Index LifecyclesWhen a document is sent to elastic, it is stored in an index which is a collectionof documents. An index has a limit to the amount of documents that can be storedand it is therefore necessary to either delete documents once it is full, or startwriting to a new index. When a document is sent to elastic, the index can bespecified, enabling the client to decide when to move to a new index.

Another option is to define a pipeline that will automatically create a new indexbased on the timestamp. Using this method, a new index could automatically becreated to contain documents within a given time period such as every hour orevery day. A third alternative is to use a rollover policy that automatically createsa new index based on age, size, or number of documents. Using the rollovermethod, the index names would be a base name with an appended number suchas ”measurements-000001”.

In order to reduce the load on the elastic cluster, old indices can automaticallybe deleted using an index lifecycle management policy [53]. This policy canbe used to delete indices that are older than some specified max age such as


6 months. For this prototype the pipeline method was used to generate morepredictable index names such as ”measurements-2019-03-16”, while retainingautomatic index creation. A policy was defined to automatically delete an indexolder than 6 months.

4.3 Third Iteration

The third iteration goals were to deploy an elastic cluster, build a data retrievalinterface, and to facilitate prototype deployment. The elastic cluster was deployedas docker containers using docker compose. The move from a single node clusterto a three node cluster was simply to change some start parameters and to add twomore containers. A docker network was created to connect the containers and toemulate a real network.

4.3.1 Data Retrieval Interface

The first two modules developed were the tagger and parser. They were builtto handle incoming sensor data and to send it to elastic. A third module wasdeveloped in this iteration to provide a demo interface for data retrieval. Thismodule creates a bridge between the user and elastic and defines a few GraphQLqueries exposed to a basic web interface.

The free version of elastic does not define any users or security features whichmeans that anyone that can send an API request directly to elastic has fulladministrative control. Because of this, elastic is not exposed externally, but onlyby traversing the demo interface for queries, and the tagger for data insertion. Theexposed interface enables the user to write GraphQL queries against elastic, andprovides feedback on what information can be retrieved and which parameters canbe specified.

4.3.2 SenML Parsing

A problem encountered during testing was that the SenML library did not supportredefining of base fields. This is because it was written for an earlier versionof SenML that mandated that all base fields were defined in the first object of thearray. To work around this problem, the library was updated to support this feature

4.3. THIRD ITERATION 41

and to accommodate some new restrictions imposed by the latest draft.

However, not every feature from the latest draft was implemented, and thereforethe library does not fully implement the SenML specification. The only representationimplemented was the JSON representation, and some features such as SensorStreaming Measurement Lists has not been implemented.

4.3.3 Automated DeploymentEach of the different modules was developed to be independent and communicatevia HTTP requests to facilitate the exchange of any module. The source code* ofthe three modules were published on GitHub as free and open source.

As mentioned before, the prototype was split into multiple modules to better scaleand to support independent development of the modules. At the same time, moremodules complicate the deployment as each module needs to be deployed in thecorrect manner. As a way to simplify the setup and enable rapid deployment, a setof deployment scripts, written in an automation tool called ansible, were created[84].

The scripts were developed to deploy the prototype on an arbitrary Linux machinerunning a recent version of Ubuntu†. The scripts were configured to handle thecomplete setup, including installation and configuration of elastic and of the threemodules. However, the elastic cluster nodes are deployed to the same machinewhich is not suitable for a production environment. The ansible scripts‡ werepublished to GitHub as free and open source.

* jfjallid/GreenIoT-MQTT-Tagger [81], jfjallid/GreenIoT-SenML-Parser [82],and jfjallid/GreenIoT-GraphQL-Demo [83]. † Ubuntu version 18.04 or later‡ jfjallid/GreenIoT-Ansible [85]

Chapter 5

Evaluation

This chapter contains the qualitative evaluation of the system design and implementation.The system was designed to be generic and work for multiple use cases byfollowing the specification from Section 1.4, but the prototype was implementedand tailored for a specific one. As such, a part of the evaluation was basedon functionality test results, and another part was a subjective evaluation of thesystem architecture and its suitability for the given use case.

In the following sections the prototype will be evaluated based on a set offunctionality tests that are designed to verify that a given solution can handlestandard tasks such as data insertion and data retrieval. If the prototype can handlethe functionality tests it is considered a success. Following the functionality tests,the prototype will be evaluated from a cost perspective and in regards to licensingand support. Finally, a discussion will follow of which parts were not evaluatedand how they could be evaluated in the future.

5.1 Basic SuitabilityTo evaluate the basic suitability of the system design, the prototype was deployedand fed data produced by real sensors while monitoring how it behaved. Artificialdata was generated to simulate the introduction of new types of sensors andnew data formats. The system was monitored to see if it behaved as expectedand if it handled unrecognized data gracefully*. The success of the prototypewas determined based on if it could handle the functionality tests without anyproblems. More details about the prototype testing can be found in Appendix A.

* Gracefully here means that the unrecognized measurement report is logged and does not crashthe system.

43

44 CHAPTER 5. EVALUATION

5.2 Data RetrievalDue to security considerations, a user is not allowed to ask arbitrary queries of thesystem. Instead, all queries are predefined and written through the data retrievalinterface. As a designer of a generic system it is difficult to predict every futureuse case and in which way a user might want to query the collected data. As such,only a demo interface was implemented for this prototype with two predefinedqueries available with some parameters that can be adjusted.

While these two queries cover one use case and enable a user to retrieve any storeddata, this system architecture means that any query must be first implementedby the system designer before it can be asked of the data. It also means thatthere is not much to evaluate until the data retrieval interface is extended. Furtherevaluation could look into how to perform queries that combines data from sourcesthat use different formats for their reports such as SenML and non-SenML, but aprerequisite is that the prototype is extended with this type of query.

5.3 Licensing, Cost, and SupportElastic is released using multiple licenses where all the base functionality usedin this project is free and open sourced using an Apache 2.0 license. Otherfeatures such as security and machine learning are made available only through asubscription model. The other parts of the prototype, such as the parser module,are also free and open source under the Apache 2.0 license.

Because of the free licensing, hosting an elastic cluster and the developedprototype in the way that was done in this project does not incur any initial orrecurring costs for the software. However, if support or more advanced featuresare required, or if hosting of the elastic cluster should be outsourced, there arevarious subscriptions from elastic to accommodate that [86].

5.4. PARTS NOT EVALUATED 45

5.4 Parts Not EvaluatedThis section will cover what parts of the prototype were not evaluated, and someof the difficulties to evaluate certain aspects.

5.4.1 ScalabilityOne of the sought properties of the designed system was for it to be scalableto support a growing amount of connected devices and potential data sources.Scalability went into the system design, but it is inherently hard to evaluate asscaling the system requires more resources. Due to this difficulty, combined withlimited time and resources, the system scalability was not evaluated.

To evaluate the system scalability, the elastic cluster should at the very least bedeployed on separate machines. It could also include more cluster nodes withdedicated roles such as the coordinating nodes that handles requests, and the ingestnodes that process the data before indexing.

Other things to consider for the elastic cluster would be to reduce the data retentionperiod from 6 months to a shorter period such as one week, and to change theindex rotation to every hour instead of every day. A load balancer should be placedbetween the tagger module and the parser modules from Figure 4.1, and multipleparser modules should be deployed on separate machines to better support a higheringestion rate.

Finally, a high workload would have to be generated to evaluate how the systemoperates at a larger scale. This could be accomplished by simulating the desiredamount of sensors that generate random measurements and publish them toMQTT brokers. Things to test could be to divide the simulated data sourcesamong multiple MQTT brokers, and perhaps different topics to enable the useof multiple tagger modules.

46 CHAPTER 5. EVALUATION

5.4.2 PerformanceAn important aspect of a back-end system designed to handle large amountsof data and users is to benchmark the performance. Due to limited resources,this prototype was built and deployed on a single machine, including the elasticcluster. This meant that multiple cluster nodes were running on the same machineand because of that, and because of limited time, this type of evaluation was notperformed.

At the very least, to evaluate the performance, the elastic cluster should bedeployed on multiple machines running on the same network. Things to testwould include how many insertions the system can handle per second, and theresponse time for various queries under varying load conditions.

5.4.3 Methods of TransportOnly a single method of transport, MQTT, was implemented in the prototype, butit was designed to support the implementation of other methods of transport aswell. Because of this, there was not much to evaluate except that MQTT workedas it should.

To evaluate other methods of transport, they would have to be implementedfirst as new tagger modules. After the implementation they could be evaluatedindividually or together by sending sensor data through the different methods oftransportation.

5.4.4 Accessing Old DataThe current prototype does not store the original messages in a secondary storage.However, everything is prepared by assigning each measurement with a uniqueidentifier that can be used to identify a given measurement in the secondarystorage system.

The procedure to access data that is no longer present in elastic would be toretrieve the records from the secondary storage system and ingest them into a newindex to enable a user to query that data. Then, once the data is no longer needed,the index would manually be deleted. To evaluate this, a secondary storage wouldhave to be implemented, and the parser module would have to be modified to senda copy of the original message there.

Chapter 6

Analysis

This chapter contains the qualitative analysis of the system design and implementedprototype. It will cover the design choices, apparent weaknesses, and alternativesolutions.

6.1 PrototypeThis section will provide an analysis of the difficulties encountered related to thesystem design and the implemented prototype. It will cover some of the choicesthat had to be made while developing the prototype, and how they could have beenmade differently. The analysis is structured as one topic per subsection, and eachsubsection will attempt to convey the importance of the specific choice.

6.1.1 Index Rotation

For the prototype development, one of the design choices was whether to use timebased indices or some other setup. Elastic supports automatic index rotation basedon time, number of documents, or size of index. Since the index is specified wheninserting the data, some other scheme could be implemented as well outside ofelastic. As a system designer it is difficult to imagine every possible scenario forhow the product might be used, and in extension, the most suitable index rotationscheme.

Which method is used to rotate the indices mainly affects how the data can beretrieved, but using rotation based on index size leads to a predictable index size.A predictable index size could be useful in determining how many indices to retainbefore deletion to stay within available storage. It could also be useful in the caseof external processing, where multiple indices are retrieved from the database, to

47

48 CHAPTER 6. ANALYSIS

calculate how many indices can fit in memory.

Although a time based index might result in evenly sized indices provided that theload is even and the message size is constant over time, a more varying load andmessage size would be more unpredictable. The advantage of rotating the indicesbased on time is that it narrows the search space when only interested in a subsetof the measurements.

When interested in the measurements within a time range, only the relevantindices can be searched instead of looking through every index to see if theycontain measurements within the specified time range. A time based indexthat rotates every day could be named: ”measurements-yyyy-mm-dd” e.g.,”measurements-2019-04-10”. This enables the clients to be more specific inwhich index to retrieve as the specification of index name in the elastic APIsupports wildcards. For example, ”measurements-2019-04-1*” can be used toretrieve all the indices between 2019-04-10 and 2019-04-19 inclusive.

6.1.2 UUID and Duplicated Measurements

When inserting data into the cluster, elastic takes care of generating, if notspecified, a unique identifier for the document, and that is the only thing thatis verified to be unique. This means that if the same document is sent to elastictwice, it will be stored using two different unique identifiers*.

To handle this, the prototype contains a tagger module that generates a uniqueidentifier for every message, and which must be the single entry point of allthe messages from a given source. This is the implemented solution to preventduplicated measurements in the database. A drawback is that this tagger modulecannot be duplicated for a single source, e.g., the message from a given sensormust not pass more than one tagger module to avoid duplication in the database.

In this project, the sensor messages are published to an MQTT broker. If the loadfrom a broker is too high for a single tagger module, the sensor messages could bedivided into discrete MQTT topics. This way, multiple taggers can be connectedto the same broker but listen on different topics.

A better approach is to generate this unique identifier at the point of originso that any copies of the message written to elastic will use the same unique

* Provided that elastic generates the unique identifier

6.1. PROTOTYPE 49

identifier and thus not create a duplicate. If the sensor cannot generate a uniqueidentifier it should be done as close to the sensor as possible, before there is anyrisk of duplication of the message. In this project, a good place to generate aunique identifier would be at the gateway for the sensor network that publish themeasurements to the message broker, as illustrated in Figure 4.1.

Another way is to use a generated hash over the message content as the documentidentifier. This would mean that the same document would always generatethe same hash and thus prevent storage of duplicated documents. However, theinherent problem of hashes is that two distinct documents might generate the samehash due to hash collisions, and this would require the usage of longer hashes toreduce the likelihood of a collision.

6.1.3 Data Structure

In Section 4.1 a discussion was held about why the SenML messages had to beresolved before being stored in the database. A design decision was made to spliteach SenML message into multiple documents so that a document only containsa single measurement. This resulted in some SenML messages transforming intoa dozen documents to be inserted into the database, which meant that more diskspace was used to store metadata about the documents.

An alternative would have been to resolve every measurement and then combinethem into a single document so that every received SenML message only generatesone document. This would amount to less metadata stored than the previoussolution. However, a drawback with storing multiple measurements in a singledocument is that queries get more complex. This is because each documentwould likely contain nested structures to handle identical field names in everymeasurement.

6.1.4 Data Retrieval Interface

To access the measurements stored in elastic, a REST API is used to send queries.This is the same API that is used to insert new document and to manage the elasticcluster. As mentioned in Section 4.3, anyone with access to this API has fulladministrative control, and for that reason security has to be implemented outsideof elastic.

50 CHAPTER 6. ANALYSIS

To solve this, the prototype isolates the elastic cluster via network separation fromthe outside with only a subset of the modules allowed to communicate with it.One of those modules is the data retrieval module which enables the user to sendGraphQL queries to elastic.

The current GraphQL interface is limited to two different queries where oneretrieves individual sensor measurements and the other returns the average valueof a specified sensor type. Placing a separate API in front of the elastic REST APIachieves the desired limitation in what a user can do with the database, but withthe great drawback that each query has to be implemented in the code before itcan be used.

With a paid subscription, as mentioned in Section 5.3, this network separation andextra interface is not necessary. Instead, user access control could be used to onlyallow read queries on the indices. This way, a user is not limited by what queriesare implemented in advance, and can utilize the full query capabilities of elastic.However, it could be useful to implement a more generic interface regardless tounburden the user of having to know how to use the query language supported byelastic.

Chapter 7

Conclusions

This chapter concludes the report by presenting the conclusions reached from thedesign, implementation, and evaluation described in this thesis. Furthermore, itpresents gained insights, and suggests areas for future work.

7.1 ConclusionIn this section the project goals and requirements will be reiterated, and gainedinsights will be presented.

7.1.1 GoalsThe project met two of the three initial goals. The second goal to determine themethod best suited to make the collected sensor data available was not met due tothe breadth of the other two goals. Although the topic was covered to some extentin Section 2.7, it was not properly discussed throughout the rest of the paper.

7.1.2 RequirementsThe requirements listed in Section 1.4 were all met by the prototype. Three out ofsix requirements focused on the possibility to add new types of sensors, changingthe data structure, and reporting different types of information at separate timeintervals. All of these requirements were fulfilled through the use of a documentdatabase that stores each measurement as a separate document so that eachmeasurement can be completely different as long as it is a JSON object. However,the use of different structures might complicate queries. The requirement that

51

52 CHAPTER 7. CONCLUSIONS

the data must be publicly accessible was fulfilled through the implementation ofthe GraphQL demo interface that enable a user to ask a set of predefined queriesagainst the data.

The two requirements that a new method of transportation and a growing numberof connected devices should be supported was fulfilled through the separation intomultiple modules. The separation of the tagger module from the parser enable theparser to scale horizontally to handle higher parsing loads. This separation alsoenable the implementation of the modules on different machines in contrast tosharing the hardware.

Furthermore, the separation of the tagger module, which only handles retrievalof data from sources and tagging with a unique identifier and timestamp, meansthat only that module has to be replaced to support a new method of transport.As an example, a new tagger module could be developed to support the CoAPprotocol and send the tagged measurements to the same parser modules as thecurrent MQTT tagger.

7.1.3 InsightsAn important insight gained from this project is that there is not one singledatabase best suited to handle sensor data. Although a flexible database such asElasticsearch was determined to be best suited for this project, other databasessuch as InfluxDB and Cassandra might outperform Elasticsearch when lessflexibility is required. Aspects that have an impact on the choice of databaseinclude: If the queries that will be asked of the data is known from the start, ifthe primary load is of writes or reads, and if the data format might change in thefuture or stays consistent.

7.2. FUTURE WORK 53

7.2 Future WorkDue to the extent of the problem introduced in Section 1.2, one of the initial goalswas not met. In this section the remaining issues will be presented together withsuggested areas for future work.

7.2.1 What Has Been Left Undone?The main thing left undone for the database comparison is performance measurementsunder realistic loads. For this project the comparison was restricted to documentedfeatures and properties of the databases, and not how fast they insert/retrieve data,or how many concurrent clients they can handle.

The second goal to determine the best suited method to make the sensor dataavailable was not addressed and is left for future work.

7.2.2 SecurityThe basic edition of elasticsearch does not include security features. Any securitymeasurements has to be implemented in the surrounding systems such as thepublic interface and in the parser. The design and implementation of securityfeatures are potential areas for future work. Things to consider is how to restrictthe type of queries that can be made to limit the attack vector for attacks such asdenial of service using queries that put a high load on the server.

The project prototype is built around an open system where anyone can publishsensor measurements which means that a malicious user could corrupt thedatabase by inserting fake measurements. A way to secure this end would be toenforce authentication before publishing a sensor measurement. Another aspectis that all the communication between the services are implemented using HTTP,and this should be replaced with HTTPS to add integrity of the data in transit.

7.2.3 Next Obvious Things to Be DoneOne of the main goals of the prototype was to collect and store sensor data. In thecurrent implementation the prototype only supports collecting sensor data fromMQTT brokers. The next obvious thing to design and implement is a new taggermodule that supports additional sources such as CoAP.

Bibliography

[1] “Cisco Global Cloud Index: Forecast and Methodology,2016–2021 White Paper,” Accessed 2019-03-07. [Online].Available: https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.html

[2] “Folkmangd i riket, lan och kommuner 31 december 2017och befolkningsforandringar 1 oktober–31 december 2017.Totalt,” Accessed 2019-03-07. [Online]. Available: http://www.scb.se/hitta-statistik/statistik-efter-amne/befolkning/befolkningens-sammansattning/befolkningsstatistik/pong/tabell-och-diagram/kvartals--och-halvarsstatistik--kommun-lan-och-riket/kvartal-4-2017/

[3] “IoT: number of connected devices worldwide 2012-2025,” Accessed 2019-03-07. [Online]. Available: https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

[4] “MySQL :: MySQL Document Store,” Accessed 2019-02-21. [Online]. Available: https://www.mysql.com/products/enterprise/document store.html

[5] “MySQL :: MySQL Cluster CGE,” Accessed 2019-02-21. [Online].Available: https://www.mysql.com/products/cluster/

[6] B. Ahlgren, M. Hidell, and E. C. . Ngai, “Internet of things for smart cities:Interoperability and open data,” IEEE Internet Computing, vol. 20, no. 6, pp.52–56, Nov 2016. doi: 10.1109/MIC.2016.124

[7] A. Hakansson, “Portal of research methods and methodologies forresearch projects and degree projects,” in Proceedings of the InternationalConference on Frontiers in Education : Computer Science andComputer Engineering FECS’13. CSREA Press U.S.A, 2013. ISBN1-60132-243-7 pp. 67–73, qC 20131210. [Online]. Available: http://www.world-academy-of-science.org/worldcomp13/ws

55

https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.html

https://www.cisco.com/c/en/us/solutions/collateral/service-provider/global-cloud-index-gci/white-paper-c11-738085.html

http://www.scb.se/hitta-statistik/statistik-efter-amne/befolkning/befolkningens-sammansattning/befolkningsstatistik/pong/tabell-och-diagram/kvartals--och-halvarsstatistik--kommun-lan-och-riket/kvartal-4-2017/




https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

https://www.statista.com/statistics/471264/iot-number-of-connected-devices-worldwide/

https://www.mysql.com/products/enterprise/document_store.html

https://www.mysql.com/products/enterprise/document_store.html

https://www.mysql.com/products/cluster/

http://www.world-academy-of-science.org/worldcomp13/ws

http://www.world-academy-of-science.org/worldcomp13/ws

56 BIBLIOGRAPHY

[8] A. Fox and E. A. Brewer, “Harvest, yield, and scalable tolerant systems,” inProceedings of the Seventh Workshop on Hot Topics in Operating Systems,March 1999. doi: 10.1109/HOTOS.1999.798396 pp. 174–178.

[9] T. Haerder and A. Reuter, “Principles of transaction-oriented databaserecovery,” ACM Comput. Surv., vol. 15, no. 4, pp. 287–317, Dec. 1983. doi:10.1145/289.291. [Online]. Available: http://doi.acm.org.focus.lib.kth.se/10.1145/289.291

[10] C. Roe, “ACID vs. BASE: The Shifting pH of DatabaseTransaction Processing,” Mar. 2012, Accessed 2019-02-11. [Online].Available: https://www.dataversity.net/acid-vs-base-the-shifting-ph-of-database-transaction-processing/

[11] A. Fox, S. D. Gribble, Y. Chawathe, E. A. Brewer, and P. Gauthier,“Cluster-based scalable network services,” SIGOPS Oper. Syst. Rev.,vol. 31, no. 5, pp. 78–91, Oct. 1997. doi: 10.1145/269005.266662. [Online].Available: http://doi.acm.org.focus.lib.kth.se/10.1145/269005.266662

[12] E. F. Codd, “A relational model of data for large shareddata banks,” Commun. ACM, vol. 13, no. 6, pp. 377–387, Jun. 1970. doi: 10.1145/362384.362685. [Online]. Available:http://doi.acm.org.focus.lib.kth.se/10.1145/362384.362685

[13] “Types of NoSQL Databases,” Accessed 2019-02-04. [Online]. Available:https://www.mongodb.com/scale/types-of-nosql-databases

[14] D. Comer, “Ubiquitous b-tree,” ACM Comput. Surv., vol. 11, no. 2, pp.121–137, Jun. 1979. doi: 10.1145/356770.356776. [Online]. Available:http://doi.acm.org.focus.lib.kth.se/10.1145/356770.356776

[15] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil, “The log-structuredmerge-tree (lsm-tree),” Acta Informatica, vol. 33, no. 4, pp. 351–385, Jun 1996. doi: 10.1007/s002360050048. [Online]. Available:https://doi.org/10.1007/s002360050048

[16] “Setting Up MySQL Database Master-Slave Replication,” Accessed 2019-02-11. [Online]. Available: https://docs.jelastic.com/database-master-slave-replication

[17] “Database Advanced Replication,” Accessed 2019-02-11. [Online].Available: https://docs.oracle.com/cd/B28359 01/server.111/b28326/repmaster.htm#BGBHCCDJ

http://doi.acm.org.focus.lib.kth.se/10.1145/289.291


https://www.dataversity.net/acid-vs-base-the-shifting-ph-of-database-transaction-processing/

https://www.dataversity.net/acid-vs-base-the-shifting-ph-of-database-transaction-processing/



https://www.mongodb.com/scale/types-of-nosql-databases


https://doi.org/10.1007/s002360050048

https://docs.jelastic.com/database-master-slave-replication

https://docs.jelastic.com/database-master-slave-replication

https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#BGBHCCDJ

https://docs.oracle.com/cd/B28359_01/server.111/b28326/repmaster.htm#BGBHCCDJ

BIBLIOGRAPHY 57

[18] D. Pritchett, “Base: An acid alternative,” Queue, vol. 6, no. 3, pp.48–55, May 2008. doi: 10.1145/1394127.1394128. [Online]. Available:http://doi.acm.org/10.1145/1394127.1394128

[19] D. Box, D. Ehnebuske, G. Kakivaya, A. Layman, N. Mendelsohn,H. F. Nielsen, S. Thatte, and D. Winer, “Simple object accessprotocol (soap) 1.1,” W3C, Note, May 2000. [Online]. Available:https://www.w3.org/TR/2000/NOTE-SOAP-20000508/

[20] O. Hartig and J. Perez, “An Initial Analysis of Facebook’s GraphQLLanguage,” vol. 1912. Juan Reutter, Divesh Srivastava, 2017, Accessed2019-02-18. [Online]. Available: http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-140921

[21] E. Christensen, F. Curbera, G. Meredith, and S. Weerawarana, “Webservices description language (wsdl) 1.1,” W3C, Note, March 2001.[Online]. Available: https://www.w3.org/TR/2001/NOTE-wsdl-20010315

[22] R. T. Fielding, “Fielding Dissertation: CHAPTER 5: Representational StateTransfer (REST),” Doctoral dissertation, University of California, Irvine,2000, Accessed 2019-02-18. [Online]. Available: https://www.ics.uci.edu/∼fielding/pubs/dissertation/rest arch style.htm

[23] D. D. Chamberlin and R. F. Boyce, “Sequel: A structured english querylanguage,” in Proceedings of the 1974 ACM SIGFIDET (Now SIGMOD)Workshop on Data Description, Access and Control, ser. SIGFIDET ’74.New York, NY, USA: ACM, 1974. doi: 10.1145/800296.811515 pp.249–264. [Online]. Available: http://doi.acm.org/10.1145/800296.811515

[24] “ISO/IEC 9075-1:2016,” Accessed 2019-02-08. [Online].Available: http://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/35/63555.html

[25] “FAQ: MongoDB Fundamentals — MongoDB Manual,” Accessed2019-02-15. [Online]. Available: https://docs.mongodb.com/manual/faq/fundamentals

[26] “Influx Query Language (InfluxQL),” Accessed 2019-02-15. [Online].Available: https://docs.influxdata.com/

[27] “The Cassandra Query Language (CQL),” Accessed 2019-02-15. [Online].Available: http://cassandra.apache.org/doc/4.0/cql/

http://doi.acm.org/10.1145/1394127.1394128

https://www.w3.org/TR/2000/NOTE-SOAP-20000508/

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-140921

http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-140921

https://www.w3.org/TR/2001/NOTE-wsdl-20010315

https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm

https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm

http://doi.acm.org/10.1145/800296.811515

http://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/35/63555.html

http://www.iso.org/cms/render/live/en/sites/isoorg/contents/data/standard/06/35/63555.html

https://docs.mongodb.com/manual/faq/fundamentals

https://docs.mongodb.com/manual/faq/fundamentals

https://docs.influxdata.com/

http://cassandra.apache.org/doc/4.0/cql/

58 BIBLIOGRAPHY

[28] “Talking to Elasticsearch | Elasticsearch: The Definitive Guide [2.x] |Elastic,” Accessed 2019-02-15. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/ talking to elasticsearch.html

[29] “Overview of ODBC, JDBC, OLE DB, and.NET,” Oct. 2014, Accessed 2019-02-18. [Online].Available: https://www.ibm.com/support/knowledgecenter/en/SSULQD7.2.1/com.ibm.nz.datacon.doc/c datacon introduction.html

[30] S. Rhea, E. Wang, E. Wong, E. Atkins, and N. Storer, “Littletable:A time-series database and its uses,” in Proceedings of the 2017 ACMInternational Conference on Management of Data, ser. SIGMOD ’17.New York, NY, USA: ACM, 2017. doi: 10.1145/3035918.3056102.ISBN 978-1-4503-4197-4 pp. 125–138. [Online]. Available: http://doi.acm.org.focus.lib.kth.se/10.1145/3035918.3056102

[31] A. Nayak, “Type of NOSQL Databases and its Comparison with RelationalDatabases,” International Journal of Applied Information Systems, vol. 5,p. 5, 2013.

[32] Z. Parker, S. Poe, and S. V. Vrbsky, “Comparing nosql mongodb to an sqldb,” in Proceedings of the 51st ACM Southeast Conference, ser. ACMSE’13. New York, NY, USA: ACM, 2013. doi: 10.1145/2498328.2500047.ISBN 978-1-4503-1901-0 pp. 5:1–5:6. [Online]. Available: http://doi.acm.org.focus.lib.kth.se/10.1145/2498328.2500047

[33] R. Funck and S. Jablonski, “Nosql evaluation: A use case orientedsurvey,” Proc 2011 Int Conf Cloud Serv Computing, 12 2011. doi:10.1109/CSC.2011.6138544

[34] “InfluxDB design insights and tradeoffs Documentation,” Accessed2019-02-21. [Online]. Available: https://docs.influxdata.com/influxdb/v1.7/concepts/insights tradeoffs/

[35] N. T. Blog, “Benchmarking Cassandra Scalability onAWS,” Nov. 2011, Accessed 2019-03-04. [Online].Available: https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

[36] “Adding nodes to an existing cluster | Apache Cassandra 3.0,” Accessed2019-03-04. [Online]. Available: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_talking_to_elasticsearch.html

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/_talking_to_elasticsearch.html

https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.datacon.doc/c_datacon_introduction.html

https://www.ibm.com/support/knowledgecenter/en/SSULQD_7.2.1/com.ibm.nz.datacon.doc/c_datacon_introduction.html





https://docs.influxdata.com/influxdb/v1.7/concepts/insights_tradeoffs/

https://docs.influxdata.com/influxdb/v1.7/concepts/insights_tradeoffs/

https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

https://medium.com/netflix-techblog/benchmarking-cassandra-scalability-on-aws-over-a-million-writes-per-second-39f45f066c9e

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAddNodeToCluster.html

BIBLIOGRAPHY 59

[37] “About snapshots | Apache Cassandra 3.0,” Accessed 2019-03-04. [Online].Available: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html

[38] “Sharding — MongoDB Manual,” Accessed 2019-03-04. [Online].Available: https://docs.mongodb.com/manual/sharding

[39] “Operational Restrictions in Sharded Clusters — MongoDB Manual,”Accessed 2019-03-04. [Online]. Available: https://docs.mongodb.com/manual/core/sharded-cluster-requirements

[40] “The MongoDB 4.0 Manual — MongoDB Manual,” Accessed 2019-03-07.[Online]. Available: https://docs.mongodb.com/manual/

[41] “Cloud Manager Overview — MongoDB Cloud Manager,” Accessed2019-03-04. [Online]. Available: http://docs.cloudmanager.mongodb.com/application

[42] “TimescaleDB Docs | Using TimescaleDB,” Accessed 2019-03-04.[Online]. Available: https://docs.timescale.com/v1.2/using-timescaledb

[43] “TimescaleDB Docs | Replication,” Accessed 2019-03-04. [Online].Available: https://docs.timescale.com/v1.2/tutorials/replication

[44] “Real Application Clusters - RAC |Oracle,” Accessed 2019-03-04. [Online].Available: https://www.oracle.com/database/technologies/rac.html

[45] “Oracle Exadata Database Machine | Oracle Technology Network | Oracle,”Accessed 2019-03-04. [Online]. Available: https://www.oracle.com/technetwork/database/exadata/overview/index.html

[46] “Oracle Technology Global Price List,” Accessed 2019-03-04. [Online].Available: https://www.oracle.com/assets/technology-price-list-070617.pdf

[47] “Oracle Engineered Systems Price List,” Accessed 2019-03-04. [Online].Available: https://www.oracle.com/assets/exadata-pricelist-070598.pdf

[48] “Database 2 Day + Real Application Clusters Guide,” Accessed 2019-03-07.[Online]. Available: https://docs.oracle.com/cd/E11882 01/rac.112/e17264/racbureco.htm#TDPRC260

[49] “Assessing Write Performance of InfluxDB Clusters| InfluxData Paper,” Accessed 2019-03-04. [Online].Available: https://www.influxdata.com/resources/assessing-write-performance-of-influxdb-clusters-using-amazon-web-services/

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsAboutSnapshots.html

https://docs.mongodb.com/manual/sharding

https://docs.mongodb.com/manual/core/sharded-cluster-requirements

https://docs.mongodb.com/manual/core/sharded-cluster-requirements

https://docs.mongodb.com/manual/

http://docs.cloudmanager.mongodb.com/application

http://docs.cloudmanager.mongodb.com/application

https://docs.timescale.com/v1.2/using-timescaledb

https://docs.timescale.com/v1.2/tutorials/replication

https://www.oracle.com/database/technologies/rac.html

https://www.oracle.com/technetwork/database/exadata/overview/index.html

https://www.oracle.com/technetwork/database/exadata/overview/index.html

https://www.oracle.com/assets/technology-price-list-070617.pdf

https://www.oracle.com/assets/exadata-pricelist-070598.pdf

https://docs.oracle.com/cd/E11882_01/rac.112/e17264/racbureco.htm#TDPRC260

https://docs.oracle.com/cd/E11882_01/rac.112/e17264/racbureco.htm#TDPRC260

https://www.influxdata.com/resources/assessing-write-performance-of-influxdb-clusters-using-amazon-web-services/

https://www.influxdata.com/resources/assessing-write-performance-of-influxdb-clusters-using-amazon-web-services/

60 BIBLIOGRAPHY

[50] “Clustering in InfluxDB Enterprise | InfluxData Documentation,” Accessed2019-03-04. [Online]. Available: https://docs.influxdata.com/enterpriseinfluxdb/v1.7/concepts/clustering/

[51] “InfluxDB Enterprise 1.7 documentation | InfluxData Documentation,”Accessed 2019-03-05. [Online]. Available: https://docs.influxdata.com/

[52] “Node | Elasticsearch Reference [6.7] | Elastic,” Accessed 2019-04-23. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/modules-node.html

[53] “Elasticsearch Reference [6.7] | Elastic,” Accessed 2019-04-23. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html

[54] “Your Ultimate Guide to Rolling Upgrades,” Accessed 2019-03-05.[Online]. Available: https://www.mongodb.com/blog/post/your-ultimate-guide-to-rolling-upgrades

[55] “Documentation,” Accessed 2019-03-06. [Online]. Available: http://cassandra.apache.org/doc/latest/operating/

[56] “Upgrading Cassandra,” Accessed 2019-03-06. [Online].Available: https://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeC c.html

[57] “Curator Reference [5.6] | Elastic,” Accessed 2019-03-06. [Online].Available: https://www.elastic.co/guide/en/elasticsearch/client/curator/5.7/index.html

[58] “PostgreSQL: The world’s most advanced open source database,” Accessed2019-03-06. [Online]. Available: https://www.postgresql.org/

[59] “PostgreSQL: Documentation: 11: Chapter 24. Routine DatabaseMaintenance Tasks,” Accessed 2019-03-06. [Online]. Available: https://www.postgresql.org/docs/11/maintenance.html

[60] “Database Storage Administrator’s Guide,” Accessed 2019-03-06.[Online]. Available: https://docs.oracle.com/cd/B28359 01/server.111/b31107/asmcon.htm#OSTMG036

[61] “Database High Availability Best Practices,” Accessed 2019-03-06. [Online]. Available: https://docs.oracle.com/database/121/HABPT/schedule outage.htm#HABPT044

https://docs.influxdata.com/enterprise_influxdb/v1.7/concepts/clustering/

https://docs.influxdata.com/enterprise_influxdb/v1.7/concepts/clustering/

https://docs.influxdata.com/

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/modules-node.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/modules-node.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/index.html

https://www.mongodb.com/blog/post/your-ultimate-guide-to-rolling-upgrades

https://www.mongodb.com/blog/post/your-ultimate-guide-to-rolling-upgrades

http://cassandra.apache.org/doc/latest/operating/

http://cassandra.apache.org/doc/latest/operating/

https://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeC_c.html

https://docs.datastax.com/en/archived/cassandra/1.2/cassandra/upgrade/upgradeC_c.html

https://www.elastic.co/guide/en/elasticsearch/client/curator/5.7/index.html

https://www.elastic.co/guide/en/elasticsearch/client/curator/5.7/index.html

https://www.postgresql.org/

https://www.postgresql.org/docs/11/maintenance.html

https://www.postgresql.org/docs/11/maintenance.html

https://docs.oracle.com/cd/B28359_01/server.111/b31107/asmcon.htm#OSTMG036

https://docs.oracle.com/cd/B28359_01/server.111/b31107/asmcon.htm#OSTMG036

https://docs.oracle.com/database/121/HABPT/schedule_outage.htm#HABPT044

https://docs.oracle.com/database/121/HABPT/schedule_outage.htm#HABPT044

BIBLIOGRAPHY 61

[62] “Database Administrator’s Guide,” Accessed 2019-03-06. [Online].Available: https://docs.oracle.com/cd/B28359 01/server.111/b28310/tasks.htm#BABJIJFC

[63] “InfluxDB key concepts | InfluxData Documentation,” Accessed 2019-03-06. [Online]. Available: https://docs.influxdata.com/influxdb/v1.7/concepts/key concepts/

[64] “Data In, Data Out | Elasticsearch: The Definitive Guide [2.x] | Elastic,”Accessed 2019-04-24. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/guide/2.x/data-in-data-out.html

[65] “index | Elasticsearch Reference [6.7] | Elastic,” Accessed 2019-04-24. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/mapping-index.html

[66] “Dynamic Mapping | Elasticsearch Reference [6.7] | Elastic,”Accessed 2019-04-05. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/dynamic-mapping.html

[67] “Time Series Concepts,” Accessed 2019-03-06. [Online]. Available:https://docs.oracle.com/cd/F49540 01/DOC/inter.815/a67294/ts tscon.htm

[68] “JSON Developer’s Guide,” Accessed 2019-03-08. [Online]. Available:https://docs.oracle.com/en/database/oracle/oracle-database/12.2/adjsn/

[69] “SQL Access | Elasticsearch Reference [6.7] | Elastic,” Accessed 2019-04-23. [Online]. Available: https://www.elastic.co/guide/en/elasticsearch/reference/6.7/xpack-sql.html

[70] “Tiered Storage Models | Optimizing Latency and Cost,” Accessed 2019-03-08. [Online]. Available: https://www.mongodb.com/blog/post/tiered-storage-models-in-mongodb-optimizing

[71] “Oracle Partitioning,” Mar. 2019, Accessed 2019-03-08.[Online]. Available: https://www.oracle.com/technetwork/database/options/partitioning/partitioning-wp-12c-1896137.pdf

[72] “When to compress data | Apache Cassandra 3.0,” Accessed 2019-04-08.[Online]. Available: https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsWhenCompress.html

[73] J. Arkko, C. Jennings, Z. Shelby, and C. Bormann, “Sensor MeasurementLists (SenML),” Accessed 2019-04-09. [Online]. Available: https://tools.ietf.org/html/rfc8428

https://docs.oracle.com/cd/B28359_01/server.111/b28310/tasks.htm#BABJIJFC

https://docs.oracle.com/cd/B28359_01/server.111/b28310/tasks.htm#BABJIJFC

https://docs.influxdata.com/influxdb/v1.7/concepts/key_concepts/

https://docs.influxdata.com/influxdb/v1.7/concepts/key_concepts/

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/data-in-data-out.html

https://www.elastic.co/guide/en/elasticsearch/guide/2.x/data-in-data-out.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/mapping-index.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/mapping-index.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/dynamic-mapping.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/dynamic-mapping.html

https://docs.oracle.com/cd/F49540_01/DOC/inter.815/a67294/ts_tscon.htm

https://docs.oracle.com/en/database/oracle/oracle-database/12.2/adjsn/

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/xpack-sql.html

https://www.elastic.co/guide/en/elasticsearch/reference/6.7/xpack-sql.html

https://www.mongodb.com/blog/post/tiered-storage-models-in-mongodb-optimizing

https://www.mongodb.com/blog/post/tiered-storage-models-in-mongodb-optimizing

https://www.oracle.com/technetwork/database/options/partitioning/partitioning-wp-12c-1896137.pdf

https://www.oracle.com/technetwork/database/options/partitioning/partitioning-wp-12c-1896137.pdf

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsWhenCompress.html

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsWhenCompress.html

https://tools.ietf.org/html/rfc8428


62 BIBLIOGRAPHY

[74] “SenML handling library for Python.” Dec. 2018, Accessed 2019-03-26.[Online]. Available: https://github.com/eistec/senml-python

[75] “MQTT Version 3.1.1,” Accessed 2019-03-26. [Online]. Available:http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/mqtt-v3.1.1.html

[76] R. Light, “paho-mqtt: MQTT version 3.1.1 client class,” Accessed2019-03-26. [Online]. Available: http://eclipse.org/paho

[77] “Python Elasticsearch Client — Elasticsearch 6.3.1 documentation,”Accessed 2019-04-12. [Online]. Available: https://elasticsearch-py.readthedocs.io/en/master/

[78] P. J. Leach, M. Mealling, and R. Salz, “A Universally Unique IDentifier(UUID) URN Namespace,” Accessed 2019-03-26. [Online]. Available:https://tools.ietf.org/html/rfc4122

[79] “asyncio — Asynchronous I/O — Python 3.7.3 documentation,”Accessed 2019-04-09. [Online]. Available: https://docs.python.org/3/library/asyncio.html

[80] “Welcome | Flask (A Python Microframework),” Accessed 2019-03-26.[Online]. Available: http://flask.pocoo.org/

[81] J. Fjallid, “Contribute to jfjallid/GreenIoT-MQTT-Tagger development bycreating an account on GitHub,” Apr. 2019, Accessed 2019-04-17. [Online].Available: https://github.com/jfjallid/GreenIoT-MQTT-Tagger

[82] ——, “Contribute to jfjallid/GreenIoT-SenML-Parser development bycreating an account on GitHub,” Apr. 2019, Accessed 2019-04-17. [Online].Available: https://github.com/jfjallid/GreenIoT-SenML-Parser

[83] ——, “Contribute to jfjallid/GreenIoT-GraphQL-Demo development bycreating an account on GitHub,” Apr. 2019, Accessed 2019-04-17. [Online].Available: https://github.com/jfjallid/GreenIoT-GraphQL-Demo

[84] A. Hat, Red, “Ansible is Simple IT Automation,” Accessed 2019-05-01.[Online]. Available: https://www.ansible.com

[85] J. Fjallid, “Contribute to jfjallid/GreenIoT-Ansible development by creatingan account on GitHub,” Apr. 2019, Accessed 2019-04-17. [Online].Available: https://github.com/jfjallid/GreenIoT-Ansible

[86] “Subscriptions · Elastic Stack Products & Support | Elastic,” Accessed2019-04-15. [Online]. Available: https://www.elastic.co/subscriptions

https://github.com/eistec/senml-python

http://docs.oasis-open.org/mqtt/mqtt/v3.1.1/mqtt-v3.1.1.html

http://eclipse.org/paho

https://elasticsearch-py.readthedocs.io/en/master/

https://elasticsearch-py.readthedocs.io/en/master/


https://docs.python.org/3/library/asyncio.html

https://docs.python.org/3/library/asyncio.html

http://flask.pocoo.org/

https://github.com/jfjallid/GreenIoT-MQTT-Tagger

https://github.com/jfjallid/GreenIoT-SenML-Parser

https://github.com/jfjallid/GreenIoT-GraphQL-Demo

https://www.ansible.com

https://github.com/jfjallid/GreenIoT-Ansible

https://www.elastic.co/subscriptions

Appendix A

Elementary evaluation

This appendix contains the elementary evaluation of the prototype system andverification of the basic functionality.

A.1 OverviewTo evaluate the functionality of the solution and developed prototype, it wasexamined on the following parameters:

• Basic functionality: Does publishing MQTT messages on the broker ingestinto elastic and become queryable?

• Varying data fields: What happens when inserting data using different typesof fields?

• Variable data structure: What happens when the data structure is completelychanged e.g., not SenML?

• Retiring of data: How to access data that is very old?

63

64 APPENDIX A. ELEMENTARY EVALUATION

A.1.1 Basic FunctionalityThe first part to evaluate was if the basic functionality of the prototype workedas intended, meaning that it can ingest sensor measurements and then answerqueries about them. The prototype was developed to ingest sensor measurementsvia MQTT, structured according to the SenML specification. To test this, a SenMLmessage, seen in Listing A.1, was generated and published via MQTT to beingested by the system.

1 [2 {3 "bn":"urn:dev:mac:18a6f71ca5d2;",4 "bt":1554446379.78,5 "ut":1206 },7 {8 "n":"uptime",9 "u":"s",

10 "v":499298711 },12 {13 "n":"hostname",14 "vs":"GIoTgw1"15 },16 {17 "n":"SoC:temp",18 "u":"Cel",19 "v":26.820 }21 ]

Listing A.1: A SenML structured message

A.1. OVERVIEW 65

The message contains a base name of ”urn:dev:mac:18a6f71ca5d2;”, and one ofthe measurements in the list has a name of ”SoC:temp”. A query was sent for adocument with a name of ”urn:dev:mac:18a6f71ca5d2;SoC:temp”, which is theresolved name of that measurement, and the result can be seen in Listing A.2.Note that the field ”bt” is renamed ”t” and added to the document as a part ofresolving the measurement.

1 {2 "_index": "measurements-2019-04-05",3 "_type": "_doc",4 "_id": "ULU57GkBhl2DXs92gj-6",5 "_score": 1,6 "_source": {7 "t": 1554446379.78,8 "u": "Cel",9 "v": 26.8,

10 "uuid": "5f4c2f8331bc4005a32add520c92d751",11 "n": "urn:dev:mac:18a6f71ca5d2;SoC:temp",12 "timestamp": "2019-04-05T06:39:45.817+00:00"13 }14 }

Listing A.2: A SenML message after ingestion into elastic

To get a list of all the documents created from the message in Listing A.1, a querycould be sent for documents containing a uuid of ”5f4c2f8331bc4005a32add520c92d751”which is created by the tagger service and present in all related documents.


A.1.2 Varying Data FieldsAnother SenML message was published using other fields to test if it was parsedcorrectly as well, and this time re-defining the base fields to verify that theupdated SenML library works as it should. Listing A.3 contains the new messagepublished to the broker.

1 [2 {3 "bn": "urn:dev:mac:0b92569229fc9e68;",4 "bt": 1554447851,5 "n": "temp",6 "u": "Cel",7 "v": 18.58 },9 {

10 "n": "location",11 "vs": "Baker street"12 },13 {14 "bn": "urn:dev:mac:0b92569229fc1234;",15 "bt": 1554448521,16 "n": "humidity",17 "u": "%RH", "v": 20},18 {19 "n": "lon", "v": 17.6486023},20 {21 "n": "lat", "v": 59.854562}22 ]

Listing A.3: A SenML message with multiple occurrences of the same base fields.

A.1. OVERVIEW 67

As seen in Listing A.3, there are two occurences of the base fields ”bn” and”bt”. This means that the first two objects in the list should resolve using thefirst occurrence of the base fields, and the last three objects should resolve usingthe second occurrence of the base fields. Listing A.4 contains the response fromelastic showing that the whole message was parsed and ingested correctly.

1 [2 {3 "_index": "measurements-2019-04-05",4 "_type": "_doc",5 "_id": "u7W-BmoBhl2DXs9276xt",6 "_score": 0,7 "_source": {8 "t": 1554448521,9 "u": "%RH",

10 "v": 20,11 "uuid": "3bfaca5a973449939c3a3638ce4c7d51",12 "n": "urn:dev:mac:0b92569229fc1234;humidity",13 "timestamp": "2019-04-05T07:15:21.440+00:00"14 }15 },16 {17 "_index": "measurements-2019-04-05",18 "_type": "_doc",19 "_id": "ubW-BmoBhl2DXs9276xt",20 "_score": 0,21 "_source": {22 "t": 1554447851,23 "u": "Cel",24 "v": 18.5,25 "uuid": "3bfaca5a973449939c3a3638ce4c7d51",26 "n": "urn:dev:mac:0b92569229fc9e68;temp",27 "timestamp": "2019-04-05T07:04:11.360+00:00"28 }29 },30 {31 "_index": "measurements-2019-04-05",32 "_type": "_doc",33 "_id": "urW-BmoBhl2DXs9276xt",34 "_score": 0,


35 "_source": {36 "t": 1554447851,37 "vs": "Baker street",38 "uuid": "3bfaca5a973449939c3a3638ce4c7d51",39 "n": "urn:dev:mac:0b92569229fc9e68;location",40 "timestamp": "2019-04-05T07:04:11.360+00:00"41 }42 },43 {44 "_index": "measurements-2019-04-05",45 "_type": "_doc",46 "_id": "vLW-BmoBhl2DXs9276xt",47 "_score": 0,48 "_source": {49 "t": 1554448521,50 "v": 17.6486023,51 "uuid": "3bfaca5a973449939c3a3638ce4c7d51",52 "n": "urn:dev:mac:0b92569229fc1234;lon",53 "timestamp": "2019-04-05T07:15:21.440+00:00"54 }55 },56 {57 "_index": "measurements-2019-04-05",58 "_type": "_doc",59 "_id": "vbW-BmoBhl2DXs9276xt",60 "_score": 0,61 "_source": {62 "t": 1554448521,63 "v": 59.854562,64 "uuid": "3bfaca5a973449939c3a3638ce4c7d51",65 "n": "urn:dev:mac:0b92569229fc1234;lat",66 "timestamp": "2019-04-05T07:15:21.440+00:00"67 }68 }69 ]

Listing A.4: A resolved SenML Message with multiple occurrences of the samebase fields after ingestion into elastic.

A.1. OVERVIEW 69

A.1.3 Variable Data StructureThe next thing to test was what happens when a non-SenML message is publishedvia MQTT. To test this, a JSON object was published instead of a JSON array,which is mandated by the SenML specification. Listing A.5 displays the messagepublished on the MQTT broker.

1 {2 "UDP": {3 "Dropped packets ": " 0",4 "Checksum errors ": " 0",5 "Sent packets ": " 231",6 "Received packets ": " 231"7 },8 "TCP": {9 "Dropped packets ": " 0",

10 "Checksum errors ": " 0",11 "retransmitted segments ": " 0",12 "Sent packets ": " 18472",13 "Dropped SYNs ": " 0",14 "Received RST ": " 0",15 "Received packets ": " 19185",16 "SYNs for closed ports ": " 0",17 "Ack errors ": " 0"18 }19 }

Listing A.5: A JSON object sent to elastic

Because this data did not have any uniquely identifying fields such as the resolvednames from Listing A.1, a query was sent to elastic for any document containingthe field ”UDP”. As this document was the only non-SenML message ingested atthe time, only that document was returned and can be seen in Listing A.6


1 {2 "_index": "measurements-2019-04-05",3 "_type": "_doc",4 "_id": "ebVO62kBhl2DXs92QDsl",5 "_source": {6 "UDP": {7 "Dropped packets ": " 0",8 "Checksum errors ": " 0",9 "Sent packets ": " 231",

10 "Received packets ": " 231"11 },12 "TCP": {13 "Dropped packets ": " 0",14 "Checksum errors ": " 0",15 "retransmitted segments ": " 0",16 "Sent packets ": " 18472",17 "Dropped SYNs ": " 0",18 "Received RST ": " 0",19 "Received packets ": " 19185",20 "SYNs for closed ports ": " 0",21 "Ack errors ": " 0"22 },23 "uuid": "8a5cf5dd81944a0bb476641770706553",24 "timestamp": "2019-04-05T02:22:47.816+00:00"25 }26 }

Listing A.6: A JSON document after ingestion by elastic

A.1.4 Retiring of DataTo evaluate the feature to retire old data, elastic was configured with a policy todelete any index older than 4 hours. This was tested together with a pipeline thatcreates a new index every hour to see if how it behaved. As expected, the numberof indices were kept at 4 except for the brief transition period at the beginning ofa new hour before the oldest index had been deleted.

TRITA -EECS-EX-2019:174

www.kth.se

Documents

A Comparative Study of Databases for Storing Sensor Data