Generic data collectors for NoSQL document stores Ilse Taleman · Generic data collectors for NoSQL document stores Academic year 2017-2018 Faculty of Engineering and Architecture

Ilse Taleman

Generic data collectors for NoSQL document stores

Academic year 2017-2018Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Herwig BruneelDepartment of Telecommunications and Information Processing

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Ir. Christophe BillietSupervisor: Prof. dr. Guy De Tré

Ilse Taleman

Generic data collectors for NoSQL document stores

Academic year 2017-2018Faculty of Engineering and ArchitectureChair: Prof. dr. ir. Herwig BruneelDepartment of Telecommunications and Information Processing

Master of Science in Computer Science Engineering Master's dissertation submitted in order to obtain the academic degree of

Counsellor: Ir. Christophe BillietSupervisor: Prof. dr. Guy De Tré

Permission for usage

The author gives permission to make this master dissertation available for consultation and

to copy parts of this master dissertation for personal use. In the case of any other use, the

copyright terms have to be respected, in particular with regard to the obligation to state

expressly the source when quoting results from this master dissertation.

Ilse Taleman, May 13, 2018

ii

Preface

It is now about three years ago I first heard of databases and Big Data. It were the storage

methods who could easily store and retrieve such a huge amount of data at large speeds

that triggered my interests. This topic became the subject of VOP (bachelor’s dissertation),

my internship, holiday job and finally my master’s dissertation. My interesting and instructive

master’s dissertation, however, would not have been achieved without the support of a number

of people:

First, I would like to bring out a special word of thanks to my supervisor, prof. dr. Guy De

Tre, for the large amount of time he has spent helping me to bring this work to a good end.

Without his help, hints and feedback during the two-weekly meetings and emails, I would never

have achieved the same result.

Second, I would like to thank my family and especially my parents, sister and brother for their

ongoing support during this entire curriculum.

iii

Generic data collectors for NoSQL

document stores

by ILSE TALEMAN

Supervisor: Prof. dr. Guy De Tre

Master’s dissertation submitted in order to obtain the academic degree of

Master of Science in Computer Science Engineering

Department of Telecommunications and Information Processing

Chair: Prof. dr. ir. Herwig Bruneel

Faculty of Engineering and Architecture, Ghent University

Academic year 2017-2018

Summary

Nowadays, big data is everywhere, and it is evolving. NoSQL (Not only SQL) data storages

are used to handle this big data. NoSQL storage methods have the advantage to insert the

data very efficiently and fast. However, this advantage is afterwards detrimental for querying:

NoSQL solutions do not efficiently support ad hoc querying.

In this master’s dissertation, it is studied if and how more generic data collectors for feeding

a NoSQL document store can be constructed. This generic data collector should contribute

to better query facilities, by keeping the insertion of data still efficient.

The generic data collector consists of two large components. The first component takes care

of the gathering of information about the database. After that, those properties are fed to

the second component, which handles the insertion in the NoSQL document database.

The generic data collector is applied to two use cases. They give a good indication that the

generic data collector is practically useful.

Keywords

Big Data, NoSQL, Document Stores, Generic Data Collector

iv

Generic data collectors for NoSQL document storesILSE TALEMAN

Supervisor: Prof. dr. Guy De TreGhent University

May 31, 2018

Abstract—The amount of data that is being created and storedon a global level is almost inconceivable, and it just keepsgrowing. Handling this big data in information managementsystems faces several challenges. NoSQL solutions have theadvantage to insert very efficiently and fast, but a disadvantageis that they do not efficiently support ad hoc querying. Therefore,a generic data collector for feeding a NoSQL document store isdeveloped and described in this study. It tries to optimize theinsert execution times vs. the query execution times. The genericdata collector program is tested bottom-up. The two exploredcases give a good indication that the generic data collectorprogram is practically useful.

Index Terms—Big Data, NoSQL, Document Stores, GenericData Collector

I. INTRODUCTION

Nowadays, big data is everywhere, and it is evolving veryquickly. NoSQL data storages are used to handle this bigdata. NoSQL storage methods have the advantage to insertthe data very efficiently and fast. However, this advantage isafterwards detrimental for querying: NoSQL solutions do notefficiently support ad hoc querying. In this masters disserta-tion, it is studied if and how more generic data collectorsfor feeding a NoSQL document store can be constructed.This generic data collector should contribute to better queryfacilities, by keeping the insertion of data still efficient. Thegeneric data collector consists of two large components. Thefirst component takes care of the gathering of informationabout the database. After that, those properties are fed to thesecond component, which handles the insertion in the NoSQLdocument database.

The remainder of this paper consists of the following. Insection II, an introduction to big data is given. In section III,the generic data collector program is described and the resultsof its application to three use cases are listed in section IV.This paper is concluded in section V.

II. BIG DATA

In this section, the definition of big data will first be ex-plained. Second, NoSQL databases, which are used to handlebig data, will be described.

A. What is big data?

Big data encompass a wide range of the tremendous datagenerated from various sources such as mobile devices, digitalrepositories, and enterprise applications. It is characterized by

at least one of the four dimensions: volume, variety, velocityand veracity.

The main characteristic that makes data big is the sheervolume. In the era of Internet, social networks, mobile devicesand sensors that produce data continuously, the notion of sizeassociated to data collections has evolved very quickly. 90%of all data ever created, was created in the past 2 years [1].

The second dimension of big data is the variety of data.In the past, all the created data was structured data, it neatlyfitted in columns and rows. Nowadays, 90% of the data thatis generated by organizations is unstructured data [2]. It hasmany different formats: structured data, semi-structured data,unstructured data and even complex structured data.

The third dimension of big data is the velocity of data,which refers to the enormous frequency at which the data iscreated, stored, analyzed and visualized. The speed at whichdata is created currently is almost unimaginable: every minutewe upload 100 hours of video on Youtube, and every minuteover 200 million emails are sent [3].

Veracity refers to the trustworthiness of the data. Data agesquickly and much information shared via the Internet andsocial media does not necessarily have to be correct.

B. NoSQL databases

NoSQL stands for Not Only SQL. It is a non-relationaldatabase architecture that is used for handling huge amountsof data in the multi-terabyte and petabyte range. Rather thanthe strict table/row structure of the relational databases that aretraditionally used in enterprises, NoSQL databases are fieldoriented and more flexible for many applications. They scalehorizontally, which means that inconsistent amounts of datacan be stored in an individual item/record (the equivalent of arelational row).

Furthermore, NoSQL databases have the advantage of highavailability, flexibility, performance and scalability. However,they have the disadvantage of slower execution of ad hocqueries and they have a lack of standardization.

NoSQL databases can be divided into four main categories:key-value store databases, document store databases, column-oriented databases and graph databases. Key-value NoSQLdatabases contain pairs of keys and values. The data model issimple: a map or a dictionary that allows the user to requestthe values according to the key specified. Graph databases aredatabases which store data in the form of a graph. The graphconsists of nodes and edges, where nodes act as the objects and

1

edges act as the relationship between the objects. Documentstore databases store their data in the form of documents (rowsof data). Documents in the same collection can be of a differentstructure each. In column-oriented NoSQL database, data isstored in cells grouped in columns of data rather than as rowsof data.

III. GENERIC DATA COLLECTOR

In this section, the definition and implementation of thegeneric data collector are explained.

A. Definition

The research in this master’s dissertation focuses on doc-ument databases because they are commonly used NoSQLdatabases [4]. With document databases, you have a lot offreedom, because there are many ways to store and retrieve thedata. First, document stores are schema free (i.e. documents inthe same collection can be of different structure each). Thus,the possibilities to structure the data in the database is endless.Second, document stores can insert data in multiple ways: viainserting document after document, or via a bulk-insert (i.e.multiple documents at once), and these inserts can each besynchronous or asynchronous. Third, document stores supportindexing for faster query execution times. Fourth, documentstores also support horizontal sharding to get more parallelwork and thus faster query execution times.

As the previous four points quoted, a programmer has alot of possibilities to create a document database. He/she willneed a lot of prior knowledge and he/she has to make alot of decisions. Also important during the creation of thedatabase is the decision of how much structure will be addedin the database. A lot of structure leads to efficient querytimes, but less efficient insert times. Adding less structure willlead to more efficient insert times, but slower query times.Therefore, a generic data collector will be created that makesprogrammers their work easier. The generic data collector willsearch for a good solution to both store and retrieve this datain a document store as fast as possible.

More specifically, the generic data collector makes the ETL-process generic. ETL is short for extract, transform, load.These are three database functions that are combined to pulldata out of a source and place it into a database:

1) Extract is the process of reading data from a source.2) Transform is the process of converting the extracted data

from its previous form into the desired form to place itinto the database.

3) Load is the process of writing the data into the targetdatabase.

The generic data collector is thus a generic ETL-process.This is shown in figure 1. It will extract, transform and loadthe input data automatically into the database in such a waythat the insert and query execution times are optimized.

Fig. 1. Representation of the generic data collector that preprocesses andstores the data into a document store (MongoDB) as good as possible vs.direct dump of the data into MongoDB.

B. Implementation

To implement the generic data collector, MongoDB is usedas document store database because it is currently the leadingdocument store [4]. The programming language that I useduring this thesis is C#, in the Visual Studio - programmingenvironment. MongoDB has provided an open-source driver[5] to connect from C to the database.

The generic data collector consists of two parts: the infor-mation gathering part and the execution part. First, informationabout the input database must be given to the generic collectorby a programmer. This information can be used by thecollector to search for efficient storage and query methods.Next, in the execution part, when a programmer calls thegeneric insert function of the generic data collector, the inputdata will be inserted in an efficient way. Analogous, when aprogrammer calls the generic query function of the genericdata collector, the data will be queried in an efficient way, i.e.short query execution time. These two parts will be explainedin the following two sections.

C. Part 1: Information gathering

In the first part of the generic data collector, informationabout the database is collected. This information is stored in askeleton of the collection, which the programmer has to define.The goal of this skeleton is to give the generic data collectorinformation about the layout and properties of the collection.

The definition of a collection is limited to a list of elementsthat belong to one class. So, all documents in a collection areof the same class type. They are homogeneous. The reasonwhy is for simplicity reasons. In this way, the loading andextraction of the data to and from the database is straightfor-ward. And also, the definition of a collection can be made intoa single class struct.

The skeleton is defined by the class struct in C# that definesa collection with all possible fields of a document, and bya custom attribute that specifies the properties and detailsof the collection. These two parts of the skeleton provideinformation, i.e. metadata, to the general data collector system.In listing 1, the general abstract definition of the collection

2

with all custom attribute parameters is given. The customattribute is called GenericCollection, and it has 5 parameters:

• collectionName: is a string and it defines the exactname of that collection in the MongoDB database. Ifthis field is empty, the generic system will create a nameautomatically.

• uniqueFields: is an array of strings and it consists ofall fields in the collection that are unique key fields (i.e.primary key fields).

• mostQueriedFields: is an array of strings and it consistsof all (combinations of) fields in the collection that arefrequently queried.

• mostSortedFields: is an array of strings and it consistsof fields on which queries are often sorted.

• servers: is an array of strings and it consists of fields thatdefine server(s).

Listing 1. General skeleton of the definition of a collection with all the customattribute parameters.

C l a s s MongoBase { i d }

[ G e n e r i c C o l l e c t i o n ( c o l l e c t i o n N a m e =<c o l l e c t i o n n a m e > ,

u n i q u e F i e l d s = <un iq ue key f i e l d s > ,m o s t Q u e r i e d F i e l d s = <a r r a y o f

( m u l t i p l e ) f i e l d s > ,m o s t S o r t e d F i e l d s = <a r r a y o f f i e l d s > ,s e r v e r s = <a r r a y o f f i e l d s > ) ]

C l a s s c o l l e c t i o n : MongoBase{

<p r o p e r t i e s ><s u b c l a s s e s >

}

D. Part 2: Execution of generic methods

In the second part of the generic data collector, the genericinsert and query methods are discussed. First, an efficientinsertion method is considered. Second, the generic querymethod is made efficient. However, optimizing the insertionmethod can have a negative influence on the query methodand vice versa. This trade-off is discussed in the third and lastsection.

1) Generic insert method: The goal of the generic insertmethod of the generic system is to insert the input data intothe MongoDB database as fast as possible.

There are two ways to insert data into a collection: byinserting document per document, i.e. insertOne()-method orby inserting many documents at once, i.e. bulk insertion.Tests show that one bulk insertion is on average 43% fasterthan performing multiple insertOne()-methods. This can beexplained by the fact that in the bulk insertion, the programonly needs to go to the database once, insert all documentsin one bulk, and return to the application. In contrary, withmultiple individual inserts, the program needs to go to thedatabase, do the insert of that single document, then go back

to the application, and the program will repeat this process forevery single insert operation.

Therefore, the generic data collector program will alwaysperform bulk insertions if possible.

2) Generic query method: The goal of the generic querymethod of the generic system is to query data as fast aspossible. MongoDB supports indexing and sharding to lowerthe query execution time. The definition and properties of bothindexing and sharding, and the way in which the generic datacollector will implement them is discussed in the followingtwo paragraphs.

a) Indexing: Indexes are special data structures that storea small portion of the collections data set in an easy to traverseform. MongoDB indexes use namely a B-tree data structure.

Indexes support the efficient execution of queries in Mon-goDB [6]. Without indexes, MongoDB must perform a col-lection scan, i.e. scanning every document in a collection, toselect the documents that match the query statement. If anappropriate index exists for a query, MongoDB can use thatindex to limit the number of documents it must inspect.

MongoDB defines indexes at the collection level and sup-ports indexes on any field or sub-field of the documents ina MongoDB collection. Generally, MongoDB only uses oneindex to fulfill most queries. However, each clause of an OR-query may use a different index, and starting in version 2.6,MongoDB can use an intersection of multiple indexes. Anindex supports a query when the index contains all the fieldsscanned by the query, because in that case the query scans theindex and not the entire collection.

The indexing strategy implemented in the generic querymethod goes as follows:

• a first good index strategy is to set an index on fieldsthat most likely will be queried. Therefore, the customattribute GenericCollector from the skeleton containsan array called MostQueriedFields to store frequentlyqueried fields.

• A second index implementation is to add indexes onunique fields. Queries on these fields will be fast exe-cuted because of their uniqueness. Therefore, the cus-tom attribute GenericCollector includes an array calleduniqueFields to store these fields.

• A third good index strategy is to set indexes on fields thatare used to sort on in queries. For that reason, the customattribute GenericCollector also includes an array calledmostSortedFields to store all these fields.

• A fourth index strategy is to ensure that indexes fit inRAM. This is a less important strategy because Mon-goDB has solved this by itself through minimizing theamount of RAM usage. Hence, the generic data collectorwill not take this strategy into consideration.

• The last index strategy is to ensure high selectivity offields that will be indexed. Selectivity is the ability of aquery to narrow results using the index. E.g. all recordscontain the field gender with only 2 different values manand woman. If you add an index on gender, then thisindex is a low-selectivity index. This is something that

3

the generic system can not avoid. But in fact, if a lot ofqueries use that field, the index on it will not be fullydisadvantageous.

In summary, MostQueriedFields, uniqueFields and most-SortedFields are three arrays that belong to the custom at-tribute GenericCollector. These three arrays contain stringvalues that refer to one or more fields.

b) Sharding: Sharding is a method for distributing dataacross multiple machines. Database systems with large datasets or high throughput applications can challenge the capacityof a single server. For example, high query rates can exhaustthe CPU capacity of the server. Working set sizes larger thanthe systems RAM stress the I/O capacity of disk drives. Thereare two methods for addressing system growth: vertical andhorizontal scaling.

Vertical scaling increases the capacity of a single server,e.g. using a more powerful CPU, adding more RAM, orincreasing the amount of storage space. But, there is a practicalmaximum. E.g. single machines can have hard ceilings basedon available hardware configurations.

Horizontal scaling involves dividing the system datasetand load over multiple servers, adding additional servers toincrease capacity as required. While the overall speed orcapacity of a single machine may not be high, each machinehandles a subset of the overall workload, potentially providingbetter efficiency than a single high-speed high-capacity server.

MongoDB supports horizontal scaling through sharding [7].A sharded cluster consists of shards, config servers and themongos. A shard contains a subset of the sharded data. Themongos acts as a query router, providing an interface betweenclient applications and the sharded cluster. Config serversstore metadata and configuration settings for the cluster. Thepartitioning of the collection data across multiple shards isdone using the shard key. The shard key consists of one ormore immutable fields, chosen by the programmer, that exist inevery document in the target collection. It can not be changedafter sharding, and every sharded collection can have only oneshard key.

MongoDB supports two sharding strategies for distributingdata across sharded clusters: hashed sharding and rangedsharding. Hashed sharding computes a hash of the shard keyfields value to partition data across a sharded cluster. Rangedsharding involves dividing data into ranges based on the shardkey values.

The generic data collector will always use hashed sharding,because the hash function leads to better distribution of thedata across the shards. If the generic data collector systemwould use ranged hashing on a monotonically increasing ordecreasing key, then the data would be badly divided overthe shards. Measuring if a key is monotonically increasing ordecreasing is hard. Therefore, it is better to use a hash functionand thus implement hashed sharding.

The choice of the key to implement hashed sharding isdone by the generic data collector in the following way: ingeneral, the data collector system chooses as hash key theObjectID key id because it has a good cardinality and changes

monotonically. Unless there is a unique key given that is alsofrequently queried. This can be detected by comparing thearrays uniqueFields and mostQueriedFields, and searching fora similar (combination of) key(s). Because in this case, theunique key has surely low frequency and high cardinality.

E. Trade-off between the generic insert and query method

The insert execution time is the fastest when doing adirect dump, i.e. writing the given input data directly intothe MongoDB database without changing the input data noradding indexes.

The query execution time is the fastest when the inputdata is written in a structured way in the database. Addingstructure to the data means storing the input data in sucha way that queries can avoid broadcasts over all documentsin a collection but instead read a reduced group of specificdocuments in a collection. Structuring the data in MongoDBis done by defining a good logical structure for each documentin a collection, by adding indexing, and by adding sharding.

Making the insert and query execution time both equalto their own minimum time is impossible. The reason whyis because they depend on each other. To get the insertexecution time as low as possible, input data may not bechanged but written directly into the database. In contrary,to get the query execution time as low as possible, input datamust be changed (i.e. defining structure, adding indexing andsharding). Thus, optimizing the query method will de-optimizethe insert method and vice versa.

As stated in the introduction, the goal of this thesis is to finda generic data collector that makes these two methods optimal,i.e. insertion and querying, as good as possible. A bottom-up approach will be used to test its suitability, namely byconsidering two implementations and evaluating their results.This will be discussed in the following section.

IV. APPLICATION OF THE GENERIC DATA COLLECTOR

The generic data collector is tested on a practical use case,with data from a company. The input dataset can be dividedinto two groups. The programmer defines 2 collections withtheir properties, in the same way as explained in section III-C.

The performance of the insertion and query methods ofthe generic data collector are both measured by timing theirexecution time. These times are then compared to the insertand query execution times of data that is directly dumpedinto the database (see figure 1). The comparison is done bycalculating the overall suitability through a cost vs. preferenceanalysis model [8]. The suitability value ranges from 0 to1, and a suitability equal to 1 means that the user is fullysatisfied. The suitability value equal to 0 means the opposite.This method is displayed in figure 2. The overall suitabilityis computed by the combination of the overall preference andthe total cost. The total cost is equal to the insertion time, andthe overall preference is calculated by by the Logic Scoring ofPreference (LSP) model [9]. The inputs to the LSP model areperformance variables, in this case the query execution timeof expected, i.e. regular, and unexpected, i.e. ad hoc, queries.

4

Fig. 2. Computation of the overall suitability using a cost/performance model.

First, the generic data collector is tested on a non-shardeddatabase with 2 collections.

The first collection does not need any preprocessing. Thismeans that the generic data collector program only appliessmart indexing before insertion compared to a direct dump.Comparing the suitability of both programs results in a suit-ability value of 0,934 for the generic data collector programand a suitability ranging from 0,839 to 0,940 for the dumpprogram. Thus, the generic data collector program has almostalways a higher suitability and is thus better suited than thedump program for users. In the case when the programmerhad fully wrong estimated query usage, then the suitability ofthe dump program is higher than the suitability of the genericdata collector program (0,940 vs 0,934).

The second collection that is tested on a non-shardeddatabase needs preprocessing. This means that data is notonly inserted, but sometimes also removed or updated inthe collection. The suitability value of the the generic datacollector program is equal to 1, the highest value possible.The suitability value of the dump program lies in the range0,797 to 0,887. These values are moderately high, but neveras high as the suitability value of the generic data collectorprogram. Thus, even when a programmer has completelywrong estimated the expected queries, the suitability of thegeneric data collector is higher than the suitability of the dumpprogram.

Next, the generic data collector program is tested on asharded database. Setting up a sharded database, and choosingthe shard key cannot be done by the generic data collectorprogram. This has to be configured in the Mongo shell.However, indexing can be applied on the sharded databaseby the generic data collector program. To test the influence ofa sharded database, a sharded database is set up in the Mongoshell with one config server, one mongos router and two shardsat different ports on the localhost. Because they all are runningon the localhost, they have to share the system resources. Thislimitation leads to slower insert and query execution times thanin the other non-sharded database test sets. However, the testsindicate that queries on the shard key are faster, as expected.And that adding indexing by the generic data collector alsoleads to faster query execution times, but almost twice as slowinsert execution times.

The three explored cases give a good indication that thegeneric data collector program is practically useful.

V. DISCUSSION AND CONCLUSION

In this study, it is studied if and how more generic datacollectors for feeding a NoSQL document store can be con-structed. Such a data collector should extract, transform andload data from incoming data streams into multiple documentcollections. A challenge is to keep the ETL-process minimalto avoid velocity problems. It is a trade-off between theprocessing of the data before inserting it into the databaseand the query execution time. The generic data collector is amechanism to help the programmer determining how to setup the document store in order to set up data storage in anefficient way.

A drawback of the program is that unexpected queries whodo not involve any indexed fields are still slow. A possiblesolution for this problem is to store a list of non indexedfields that belong to these ad hoc queries. When a certainfield from this list often occurs, then the generic data collectorsystem will create an index on it. Of course, attention isrequired because too much indexes may slow down the system.Therefore, the insert execution times must simultaneously bemonitored.

Another possible feature that could be implemented in thefuture is the use of asynchronous insertions. This will reducethe insert execution time, since MongoDB then does not waitfor a confirmation. However, asynchronous writes are ’unsafe’.This means that you do not receive feedback whether the insertwas successful or not.

The generic data collector system can also serve as a coretuning system. In the future, this system can be extended withpossibilities to generate the necessary statistical informationon the basis of life inserts and queries and thus automaticallyimprove the index pattern of a collection.

This thesis is a first step to a solution for a generic datacollector. The use of the generic data collector program istested by working bottom-up. The three explored cases givea good indication that the generic data collector program ispractically useful.

REFERENCES

[1] Dragland, A.: Big Data, for better or worse: 90% of world’s data generatedover last two years. ScienceDaily (2013)

[2] Spire :Why Organizations Should Explore Their Unstructured Data.Retrieved from: http://spiretechnologies.com/organizations-explore-unstructured-data (2016)

[3] IBM Big Data & Analytics Hub: The Four V’s of Big Data. Retrievedfrom: https://lebigdata.com/en/volume-of-big-data (2016)

[4] Solid IT: Knowledge Base of Relational and NoSQL Database Manage-ment Systems: Database Engines Ranking. Retrieved from: https://db-engines.com/en/ranking (2018)

[5] MongoDB Documentation: MongoDB Drivers: C#and .NET MongoDB Driver. Retrieved from:https://docs.mongodb.com/ecosystem/drivers/csharp/ (2008)

[6] MongoDB Documentation: Indexes. Retrieved from:https://docs.mongodb.com/manual/indexes/ (2008)

[7] MongoDB Documentation: Sharding. Retrieved from:https://docs.mongodb.com/manual/sharding/ (2008)

[8] De Tre, G., Bronselaer, A.: Information management. Ugent, Gent,Belgium, pp. 345–367 (2017)

5

[9] Dujmovic, J. J.: A Method For Evaluation And Selection Of ComplexHardware And Software Systems. In: Int. CMG Conference, San Fran-cisco, USA, pp. 368–378 (1996).

6

Contents

Permission for usage ii

Preface iii

Abstract iv

Extended abstract iv

List of abbreviations 1

1 Introduction 2

1.1 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Big Data 4

2.1 What is meant by big data? . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.2 Variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.3 Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.4 Veracity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Relational Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 ACID . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Limitations of RDBMS to support big data . . . . . . . . . . . . . . . 7

2.3 NoSQL databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 CAP theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.3 BASE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.4 Data models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3.4.1 Key-value store databases . . . . . . . . . . . . . . . . . . . 11

xi

Contents xii

2.3.4.2 Graph databases . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3.4.3 Document store databases . . . . . . . . . . . . . . . . . . 12

2.3.4.4 Column-oriented databases . . . . . . . . . . . . . . . . . . 13

3 Generic data collector 14

3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Part 1: Information gathering 17

4.1 Basic principles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 General skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3 Example skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Part 2: Execution of generic methods 22

5.1 Generic insert method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.1 Insert one vs. many . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5.1.2 Performance measure . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2 Generic query method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.2.1.1 Index types in MongoDB . . . . . . . . . . . . . . . . . . . 24

5.2.1.2 Indexing strategies . . . . . . . . . . . . . . . . . . . . . . . 26

5.2.1.3 Indexing implemented in the generic data collector . . . . . . 27

5.2.2 Sharding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.2.1 Sharding strategies . . . . . . . . . . . . . . . . . . . . . . 30

5.2.2.2 Sharding implemented in the generic data collector . . . . . . 32

5.2.3 Performance measure . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 Trade-off between insert and query execution

times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6 Application of the generic data collector 35

6.1 Description of the database . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.1 Process alarms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.1.2 Operator actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 Implementation of the message database . . . . . . . . . . . . . . . . . . . . 36

6.2.1 Input data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2.2 Storing the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2.3 Adding the custom attribute parameters . . . . . . . . . . . . . . . . 37

6.3 Insertion in the message database . . . . . . . . . . . . . . . . . . . . . . . . 41

6.3.1 Insertion times of the operator actions collection . . . . . . . . . . . . 41

Contents xiii

6.3.2 Insertion times of the process alarms collection . . . . . . . . . . . . . 43

6.4 Querying in the message database . . . . . . . . . . . . . . . . . . . . . . . . 46

6.4.1 Query times of the operator actions collection . . . . . . . . . . . . . 47

6.4.1.1 Execution times of expected queries . . . . . . . . . . . . . 47

6.4.1.2 Execution times of unexpected queries . . . . . . . . . . . . 49

6.4.2 Query times of the process alarms collection . . . . . . . . . . . . . . 51

6.4.2.1 Execution times of expected queries . . . . . . . . . . . . . 51

6.4.2.2 Execution times of unexpected queries . . . . . . . . . . . . 53

6.5 Discussion of the results in the message database . . . . . . . . . . . . . . . . 55

6.5.1 Cost vs. preference analysis model . . . . . . . . . . . . . . . . . . . 55

6.5.2 Evaluation of the operator actions collection . . . . . . . . . . . . . . 57

6.5.2.1 Suitability of the generic data collector program . . . . . . . 57

6.5.2.2 Suitability of the dump program . . . . . . . . . . . . . . . 59

6.5.2.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.5.3 Evaluation of the process alarms collection . . . . . . . . . . . . . . . 62

6.5.3.1 Suitability of the generic data collector program . . . . . . . 62

6.5.3.2 Suitability of the dump program . . . . . . . . . . . . . . . 63

6.5.3.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.5.4 General discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.6 The influence of sharding in the message database . . . . . . . . . . . . . . . 66

6.6.1 Creating the sharded database . . . . . . . . . . . . . . . . . . . . . . 66

6.6.2 Insertion in the sharded operator actions database . . . . . . . . . . . 67

6.6.3 Querying in the sharded operator actions database . . . . . . . . . . . 68

6.6.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

7 Conclusion 70

Bibliography 71

List of abbreviations

RDBMS Relational Database Management System

SQL Structured Query Language

NoSQL Not only SQL

ETL Extract Transform Load

CLR Common Language Runtime

IL Intermediate language

DB Database

LSP Logic Scoring of Preference method

QET Query Execution Time

1

Chapter 1

Introduction

1.1 Problem statement

The amount of data that’s being created and stored on a global level is almost inconceivable,

and it just keeps growing. Handling this big data in information management systems faces

several challenges. Big data challenges include storing and analyzing large, rapidly growing,

diverse and noisy data stores, then deciding precisely how to best handle that data.

For data storage, traditional relational databases are extended with NoSQL data storages that

especially focus on solving problems like huge data volumes, large varieties of data structure,

high velocity requirements and the need to guarantee the veracity of the data.

NoSQL solutions have the advantage to insert very efficiently and fast. More specifically, the

input data can be directly written (i.e. dumped) into the database without any preprocessing.

In contrast to SQL solutions, where a prior preprocessing is required.

However, this advantage of efficient insert dumps is afterwards detrimental for querying:

NoSQL solutions do not efficiently support ad hoc querying. Typically, a NoSQL database,

like older operational databases, is designed in view of the optimization of the most important

operation(s) of a specific application. Other operations will execute less efficiently. Thus, the

NoSQL database will be optimized in view of important queries. Other ad hoc queries will

execute less efficiently. Therefore, a more generic solution is wished for.

1.2 Goal

In this master thesis it is studied if and how more generic data collectors for feeding a NoSQL

document store can be constructed.

2

Chapter 1. Introduction 3

Such a data collector should extract, transform and load data (ETL) from incoming data

streams into multiple document collections. Each document collection still has to be optimized

in view of the efficient handling of a specific operation, but all collections together support

the optimization of multiple operations. The operations on a database that will be discussed

in this master thesis are inserts and queries. As such, the application becomes less dependent

on the structure of the inserted data.

A challenge is to keep the ETL-process minimal to avoid velocity problems. It is a trade-

off between the processing of the data before inserting it into the database and the query

execution time. The more preprocessing of the data, the longer are insertions in the database

but the faster queries are executed. On the other hand, the less preprocessing of the data, the

faster insertions are done but the slower queries will execute.

Theoretical results need to be verified with a prototype implementation of the proposed solu-

tion.

1.3 Thesis overview

In the remainder of this study, the system design and performance are explained in detail.

In chapter 2, the definition of big data and how it is stored is explained. In chapter 3, the

definition and implementation of the generic data collector is described. In the following

two chapters, chapter 4 and 5, details of the implementation of the generic data collector

are discussed. Next, in chapter 6, the generic data collector is applied to use cases and its

performance is also tested. Finally, in chapter 7, a conclusion is given.

Chapter 2

Big Data

2.1 What is meant by big data?

Big data encompass a wide range of the tremendous data generated from various sources such

as mobile devices, digital repositories, and enterprise applications. It is characterized by at

least one of the four dimensions: volume, variety, velocity and veracity. These four V’s are

explained in the next sections.

2.1.1 Volume

The main characteristic that makes data big is the sheer volume. In the era of Internet, social

networks, mobile devices and sensors producing data continuously, the notion of size associated

to data collections has evolved very quickly. 90% of all data ever created, was created in the

past 2 years [3]. From now on, the amount of data in the world will double every two years.

By 2020, we will have 50 times the amount of data as that we had in 2011.

Airplanes generate approximately 2,5 billion Terabyte of data each year from the sensors

installed in the engines. Self-driving cars will generate 2 Petabyte of data every year. Also

the agricultural industry generates massive amounts of data with sensors installed in tractors.

And the Square Kilometer Array Telescope will generate 1 Exabyte of data per day [4].

Managing large and rapidly increasing volumes of data has been a challenging issue for many

decades. In the past, this challenge was mitigated by processors getting faster, following

Moore’s law, to provide us with the resources needed to cope with increasing volumes of

data. However, there is a fundamental shift underway now: data volume is scaling faster than

computing resources, and CPU speeds are static [2].

4

Chapter 2. Big Data 5

2.1.2 Variety

The second dimension of big data is the variety of data. In the past, all the created data

was structured data, it neatly fitted in columns and rows. Nowadays, 90% of the data that

is generated by organization is unstructured data [5]. Data today comes in many different

formats: structured data, semi-structured data, unstructured data and even complex structured

data.

This wide variety of data originates from the fact that big data is present in all sectors, from

the medical world, to construction and business. Consider, for example, the electronic patient

files in the care, which contribute to many trillions of gigabytes of data. And then we do not

even talk about the videos we watch through YouTube, the posts we share via Facebook and

the blog articles we write.

2.1.3 Velocity

Velocity is the enormous frequency at which the data is created, stored, analyzed and visualized.

In the past, when batch processing was common practice, it was normal to receive an update

from the database every night or even every week. Computers and servers required substantial

time to process the data and update the databases. In the big data era, data is created in

real-time or near real-time. With the availability of Internet connected devices, wireless or

wired, machines and devices can pass-on their data the moment it is created.

The speed at which data is created currently is almost unimaginable: Every minute we upload

100 hours of video on Youtube. In addition, every minute over 200 million emails are sent,

around 20 million photos are viewed and 30.000 uploaded on Flickr, almost 300.000 tweets

are sent and almost 2,5 million queries on Google are performed [1].

2.1.4 Veracity

Veracity refers to the trustworthiness of the data. Data ages quickly and much information

shared via the Internet and social media does not necessarily have to be correct. Having a

lot of data in different volumes coming in at high speed is worthless if that data is incorrect.

Therefore, organizations need to ensure that the data is correct as well as the analyses per-

formed on the data are correct. Especially in automated decision-making, where no human is

involved anymore, you need to be sure that both the data and the analyses are correct.

Nowadays, there is much uncertainty of data: 1 in 3 business leaders do not trust the infor-

mation they use to make decisions, and poor data quality costs the US economy around 2.5

Trillion euros a year [1].


2.2 Relational Databases

To manage this data, relational database systems can be used to structure the data, store it

into a database and query it. In this section, the defenition and most important properties

of relational databases are explained. Next, the utility of relational databases in handling big

data is discussed.

2.2.1 Definition

A relational database management system (RDBMS) is a program that lets you create, update,

and administer a relational database. A relational database is a collection of data items

organized as a set of formally-described tables from which data can be accessed or reassembled

in many different ways without having to reorganize the database tables.

E. F. Codd developed a RDBMS at IBM in the 1970s. It is a set of tables containing data

fitted into predefined categories. Each table contains one or more data categories in columns.

These data categories are called attributes. Each row, also called a relation, contains a unique

instance of data for the categories defined by the columns. A representation of RBDMS

is given in figure 2.1. For example, a typical business order entry database would include

a table orders that describes orders by columns for productID, productType, customer and

orderDate. The column customer refers to entries from the table customers with columns for

name, address and phone number. The column productType contains references to another

table called productTypes that contains columns for the product’s name, brand, and other

information about the product type. In this table, the column brand again contains references

to another table that defines brands.

Most relational database management systems use the structured query language (SQL) to

access the database. SQL statements are used both for interactive queries for information

from a relational database (i.e. queries through direct user interaction) and for gathering data

for reports (i.e. queries through an API). Interactive queries are widely divergent, they can be

everything. Queries from an API on the other hand are mostly known in advance, they are

more predictable.


Figure 2.1: Representation of a relational database.

2.2.2 ACID

A very important feature of all relational database systems is that they are transaction based

and support ACID. A database transaction is a sequence of one or more SQL operations that

are treated as a unit. ACID assures that all the transactions are reliably processed. ACID

stands for Atomicity, Consistency, Isolation and Durability:

• Atomicity The database transaction must completely succeed or completely fail. This

is the all-or-none principle. Partial success is not allowed: if one element of a transaction

fails the entire transaction fails.

• Consistency The database transaction must meet all protocols or rules defined by the

system at all times. During a transaction, the RDBMS progresses from one valid state to

another. The state is never invalid, and there are never any half-completed transactions.

• Isolation The client’s database transaction must occur in isolation from other clients

attempting to transact with the RDBMS. No transaction has access to any other transac-

tion that is in an intermediate or unfinished state. Thus, each transaction is independent

unto itself.

• Durability Once the transaction is complete, it will persist as complete and cannot be

undone. It will survive system failure, power loss and other types of system breakdowns.

2.2.3 Limitations of RDBMS to support big data

RDBMS had been the one solution for all database needs. Oracle, IBM, and Microsoft are

the leading players of RDBMS. However, the volume, velocity, variety and veracity of business

data has changed dramatically in the last couple of years. It is skyrocketing every day. In this

section, I will discuss three limitations of RDBMS to support big data.


First, the data size has increased tremendously to the range of petabytes (one petabyte = 1,024

terabytes). RDBMS finds it challenging to handle such huge data volumes. To address this,

RDBMS added more central processing units or more memory to the database management

system to scale up vertically.

Second, the majority of the data comes in a semi-structured or unstructured format from social

media, audio, video, texts, and emails. However, the second problem related to unstructured

data is outside the purview of RDBMS because relational databases just can’t categorize un-

structured data. They are designed and structured to accommodate a fixed database schema,

which leads to expensive database transactions.

Third, big data is generated at a very high velocity. RDBMS lacks in high velocity because

it is designed for steady data retention rather than rapid growth. With it’s ACID properties,

it keeps the data consistent and stable on disk. This write consistency can be a wonderful

thing for application developers, but it also requires sophisticated locking which delays the

data storage velocity.

As a result, these limitations of relational databases led to the emergence of new technologies,

namely NoSQL databases.

2.3 NoSQL databases

NoSQL databases were specifically designed to handle big data, unlike relational databases.

Most NoSQL systems have removed the multi-platform support and features to guarantee

the ACID properties of RDBMS, making them much more lightweight and efficient than their

RDBMS counterparts. In this section, the definition and properties of NoSQL databases are

given. Next, the different types of NoSQL databases are listed.

2.3.1 Definition

NoSQL stands for Not Only SQL. It is a non-relational database architecture that is used for

handling huge amounts of data in the multi-terabyte and petabyte range. Rather than the

strict table/row structure of the relational databases that are traditionally used in enterprises,

NoSQL databases are field oriented and more flexible for many applications. They scale

horizontally, which means that inconsistent amounts of data can be stored in an individual

item/record (the equivalent of a relational row).

The benefits of developing and using NoSQL databases are:


• Scalability NoSQL databases use a horizontal scale-out methodology that makes it easy

to add or reduce capacity quickly and non-disruptively with commodity hardware. This

eliminates the tremendous cost and complexity of manual sharding that is necessary

when attempting to scale RDBMS.

• Cost-efficiency Because NoSQL uses inexpensive commodity hardware, cost savings

versus RDBMS become more dramatic over time as greater capacity is needed to ac-

commodate petabytes and exabytes of big data. Also, organizations only need to deploy

the amount of hardware that is required to meet current capacity requirements rather

than making large purchases ahead of need.

• Flexibility Whether an organization is developing web, mobile, or IoT applications, the

fixed data models of RDBMS prevent or dramatically slow down an organization’s ability

to adapt to evolving big data application requirements. NoSQL enables developers to

use the data types and query options that best fit the specific application use case,

enabling faster and more agile development.

• Performance As mentioned, with RDBMS, increasing performance incurs tremendous

expense and the overhead of manual sharding. On the other hand, when compute

resources are added to a NoSQL database, performance increases in a proportional

manner so that organizations can continue to deliver a reliably fast user experience.

• High Availability Typical RDBMS systems rely on primary/secondary architectures that

are complex and can create single points of failure. Some “distributed” NoSQL databases

use a masterless architecture that automatically distributes data equally among multiple

resources so that the application remains available for both read and write operations

even when one node fails.

• Global Availability By automatically replicating data across multiple servers, data cen-

ters, or cloud resources, distributed NoSQL databases can minimize latency and ensure

a consistent application experience wherever users are located. An added benefit is a

significantly reduced database management burden from manual RDBMS configuration,

freeing operations teams to focus on other business priorities.

Disadvantages of NoSQL databases are:

• Lack of standardization NoSQL is not a specific type of database or programming

interface. The design and query languages of NoSQL databases vary widely between

different NoSQL products. There is no standard query language provided, such as

by traditional SQL database systems. This means that the learning curve for NoSQL


databases is steeper, since a programmer who knows one type of NoSQL database may

not be prepared to work with a different one.

• Ad hoc queries NoSQL can dump data into a database very efficiently, but this is

detrimental for querying afterwards. NoSQL solutions do not efficiently support ad hoc

querying, because the data is stored in a more unstructured way. Programmers have to

add indexes, sharding or extra structure to get more efficient ad hoc queries.

2.3.2 CAP theorem

Eric Brewer made the conjecture that a distributed system can not provide all three of the

following guarantees at once:

• Consistency Every read receives the most recent write or an error.

• Availability Every request receives a (non-error) response – without guarantee that it

contains the most recent write.

• Partition tolerance The system continues to operate despite an arbitrary number of

messages being dropped (or delayed) by the network between nodes.

In a distributed system, failures can and will occur to a networked system. Thus, partition

tolerance should be accommodated. Therefore, designers are forced to decide between con-

sistency and availability.

RDBMS have ACID database transactions (see section 2.2.2) that provide consistency and

partition tolerance. But ACID transactions are mostly far more pessimistic, i.e. they’re more

worried about data consistency, than the domain actually requires. Hence, an alternative to

ACID has come up: BASE. This term will be explained in the next section.

2.3.3 BASE

BASE provides the availability choice for partitioned databases, instead of the consistency

choice of ACID. It is diametrically opposed to ACID: Where ACID is pessimistic and forces

consistency at the end of every operation, BASE is optimistic and accepts that the database

consistency will be in a state of flux. BASE is quite manageable and leads to levels of scalability

that cannot be obtained with ACID.

BASE stands for Basically Available, Soft state and Eventual consistency:

• Basically Available The system does guarantee the availability of the data as regards

CAP Theorem; there will be a response to any request. But, that response could still be


‘failure’ to obtain the requested data or the data may be in an inconsistent or changing

state.

• Soft state The state of the system could change over time, so even during times without

input there may be changes going on due to ‘eventual consistency,’ thus the state of the

system is always ‘soft.’

• Eventual consistency The system will eventually become consistent once it stops

receiving input. The data will propagate to everywhere it should sooner or later, but

the system will continue to receive input and is not checking the consistency of every

transaction before it moves onto the next one.

NoSQL systems can be seen as ‘lightweight’ versions of traditional SQL systems in which

the focus is on BASE support instead of ACID support. This results in more flexibility and

scalability.

2.3.4 Data models

Several different varieties of NoSQL databases have been created to support specific needs

and use cases. These fall into four main categories: key-value store databases, document store

databases, column-oriented databases and graph databases.

2.3.4.1 Key-value store databases

Key-value NoSQL databases contain pairs of keys and values. The key can be synthetic or

auto-generated, and the values can be any type of binary object (text, video, JSON document,

etc.). These stores are similar to hash tables where the keys are used as indexes, thus making

it faster than RDBMS. The data model is simple: a map or a dictionary that allows the user

to request the values according to the key specified. This is represented in figure 2.3.

The modern key value data stores prefer high scalability over consistency. Hence ad hoc

querying and analytics features like joins and aggregate operations have been omitted. High

concurrency, fast lookups and options for mass storage are provided by key-value stores. A

disadvantage of a key value data store is the lack of schema which makes it much more difficult

to create custom views of the data. Another disadvantage is that all key-value pairs are stored

together in a single namespace, which can become very large and thus slower for querying.

The top three key-values stores in March 2018 are Redis, Amazon DynamoDB and Memcached

[10].


2.3.4.2 Graph databases

Graph databases are databases which store data in the form of a graph. The graph consists of

nodes and edges, where nodes act as the objects and edges act as the relationship between the

objects. This is shown in figure 2.2. The graph also consists of properties related to nodes. It

uses a technique called index free adjacency meaning every node consists of a direct pointer

which points to the adjacent node. Millions of records can be traversed using this technique.

In a graph databases, the main emphasis is on the connection between data. Graph databases

provides schema less and efficient storage of semi structured data.. The queries are expressed

as traversals, thus making graph databases faster than relational databases. It is easy to scale

and whiteboard friendly. Graph databases are ACID compliant and offer rollback support.

The top three graph databases in March 2018 are Neo-4j, Microsoft Azure Cosmos DB and

Datastax Enterprise [10].

Figure 2.2: Representationof a graph store.

Figure 2.3: Representationof a key-value store

2.3.4.3 Document store databases

Document store databases store their data in the form of documents, see figure 2.4. The

documents are of standard formats such as XML, BSON and JSON. They are somewhat

similar to records in relational databases, but they are much more flexible since they are

schema free. Schema free means that documents in the same collection can be of a different

structure each. Documents in the database are addressed using a unique key that represents

that document. These keys may be a simple string or a string that refers to URI or path.

Document stores are slightly more complex as compared to key-value stores as they allow to

encase the key-value pairs in document also known as key-document pairs. Another advantage

of document stores is that they support horizontal sharding to improve query performance.

This means that large collections can be split over multiple servers, allocating a subset of work

to each server.

The top three document stores in March 2018 are MongoDB, Amazon DynamoDB and Couch-

base [10].


2.3.4.4 Column-oriented databases

In column-oriented NoSQL database, data is stored in cells grouped in columns of data rather

than as rows of data. Columns are logically grouped into column families. Column families

can contain a virtually unlimited number of columns that can be created at runtime or the

definition of the schema. This is shown in figure 2.5.

A first advantage of column stores is that when a query involves only a small number of

columns, less data will be read compared to document stores. A second advantage is that

they also deliver good compression, because values in a column are often very similar. A

disadvantage is that writes are expensive, because each row is broken apart into many columns,

each write ends up being many writes.

The top three column-oriented databases in March 2018 are Cassandra, HBase and Microsoft

Azure Cosmos DB [10].

Figure 2.4: Representationof a document store.

Figure 2.5: Representationof a column store (vs. adocument store)

Chapter 3

Generic data collector

In this chapter, the definition and implementation of the generic data collector are explained.

3.1 Definition

The research in this master thesis focuses on document databases because they are commonly

used NoSQL databases [10]. With document databases, you have a lot of freedom. There

are many ways to store and retrieve the data. First, document stores are schema free. Thus,

the possibilities to structure the data in the database is endless. Second, document stores can

insert data in multiple ways: via inserting document after document, or via a batch-insert (i.e.

multiple documents at once), and these inserts can each be synchronous or asynchronous.

Third, document stores support indexing for faster query execution times. Indexes can be

set on all desired fields. But, indexes have the disadvantage that they delay the insert time.

This is the case when there are too much indexes, or when some indexes become too large to

handle/store in memory, or when they are not useful when set on the wrong fields. Fourth,

document stores also support horizontal sharding to get more parallel work and thus faster

query execution times. Again, on which field(s) should the horizontal sharding be applied to

get the best performance?

Programmers that want to implement a document store need a lot of prior knowledge of

document stores before creating it for their database. But not everyone has the time or

insight to do this. Programmers also have to make the decision of how much structure they

will create in their database. A lot of structure in their database leads to efficient searches.

But adding less structure in their database, thus doing a data dump, leads to more efficient

insert times. Therefore, I will create a generic data collector that makes their work easier.

The generic data collector will search for a good solution to store and retrieve this data in a

document store as fast as possible.

14

Chapter 3. Generic data collector 15

More specifically, the generic data collector makes the ETL-process generic. ETL is short for

extract, transform, load. These are three database functions that are combined to pull data

out of a source and place it into a database:

• Extract is the process of reading data from a source (database, file, ...). In this stage,

the data is collected, often from multiple and different types of sources.

• Transform is the process of converting the extracted data from its previous form into

the desired form to place it into the database. Transformation occurs by using rules or

lookup tables or by combining the data with other data.

• Load is the process of writing the data into the target database.

The generic data collector is thus a generic ETL-process. This is shown in figure 3.1. It will

extract, transform and load the input data automatically into the database in such a way that

the insert and query execution times are optimized.

Figure 3.1: Representation of the generic data collector that preprocesses and stores thedata into a document store (MongoDB) as good as possible vs. direct dump of the datainto MongoDB

3.2 Implementation

To implement the generic data collector, I have chosen to use MongoDB as document store

database because it is currently the leading document store [10]. The programming language

that I use during this thesis is C#, in the Visual Studio - programming environment. MongoDB

has provided an open-source driver [11] to connect from C# to the database.

Chapter 3. Generic data collector 16

The generic data collector consists of two parts: the information gathering part and the

execution part. First, information about the input database must be given to the generic

collector by a programmer. This information can be used by the collector to search for efficient

storage and query methods. Next, in the execution part, when a programmer calls the generic

insert function of the generic data collector, the input data will be inserted in an efficient way.

Analogous, when a programmer calls the generic query function of the generic data collector,

the data will be queried in an efficient way, i.e. short query execution time.

In the following two chapters, the implementation of the generic data collector will be discussed.

First, the structure of the skeleton will be explained that defines the information gathering

part. Second, the principles of the generic insert and query methods, i.e. the execution part

will be listed.

Chapter 4

Part 1: Information gathering

In this chapter, the first part of the implementation of the generic data collector is explained.

This information gathering phase is characterized by a general skeleton that a programmer

needs to define. This skeleton delivers the generic data collector information about the input

data.

In the first section, the basics of a C# implementation of a database in MongoDB are given

that will be used in the definiton of the skeleton. In the second section, the general skeleton

will be explained. And in the third section, an example is applied to the skeleton to show how

it works.

4.1 Basic principles

An example of the definition of two MongoDB collections ”Owner” and ”Car” in C# is

given in listing 4.1. Both collections inherit the class MongoBase so that every document in

these collections gets a unique automatic generated ObjectId field, called ” id”. The collection

”Owner” consists of documents that can have 3 fields filled in, namely ”firstname”, ”lastname”

and ”passportnumber”. The collection ”Car” consists of documents that can have 5 fields

filled in, namely ”idNumber”, ”brand”, ”model”, ”deliverday’, and ”owner”. Where ”owner”

has a type equal to the class ”Owner”, because it refers to an existing document in the

”Owner” collection.

Listing 4.1: Definition of the MongoDB collections Owner and Car in C#.

1 using MongoDB . Bson ;

2

3 publ ic abstract c l a s s MongoBase

4 {

17

Chapter 4. Part 1: Information gathering 18

5 publ ic O b j e c t I d i d ;

6 }7

8 publ ic c l a s s Owner : MongoBase

9 {10 publ ic s t r i n g f i r s t n a m e { g e t ; s e t ; }11 publ ic s t r i n g l a s t n a m e { g e t ; s e t ; }12 publ ic s t r i n g p a s s p o r t n u m b e r { g e t ; s e t ; }13 }14

15 publ ic c l a s s Car : MongoBase

16 {17 publ ic s t r i n g idNumber { g e t ; s e t ; }18 publ ic s t r i n g brand { g e t ; s e t ; }19 publ ic s t r i n g model { g e t ; s e t ; }20 publ ic DateTime d e l i v e r d a y { g e t ; s e t ; }21 publ ic Person owner { g e t ; s e t ; }22 }

A more abstract and compact representation of a MongoDB collection in C# is given in

listing 4.2. Where ”Class MongoBase { id }” refers to the creation of the abstract class

MongoBase with one ObjectId field ” id”. And where the collection, called ”collection”,

contains properties and subclasses. < properties > refers to the declaration of fields with

basic value types such as string, DateTime, long, bool, int and float (e.g. ”public string

firstname”). And < subclasses > refers to the declaration of fields with value types that are

defined by custom classes (e.g. ”public Person owner”).

Listing 4.2: Abstract representation of a MongoDB collection in C#.

1 C l a s s MongoBase { i d }2

3 C l a s s c o l l e c t i o n : MongoBase

4 {5 <p r o p e r t i e s >

6 <s u b c l a s s e s >

7 }


4.2 General skeleton

In this section, the general abstract skeleton of the generic data collector is discussed. A

programmer has to define it. The goal of this skeleton is to give the generic data collector

information about the layout and properties of the collection.

I limited the definition of a collection to a list of elements that belong to one class. So, all

documents in a collection are of the same class type. They are homogeneous. The reason

why I limited the documents to be homogeneous is for simplicity reasons. In this way, the

loading and extraction of the data to and from the database is straightforward. And also, the

definition of a collection can be made into a single class struct.

The compact representation of a MongoDB collection was given in listing 4.2. This definition

has to be enlarged by the programmer. He/she has to specify the details and properties of

the collection by filling in custom attribute parameters. These custom attributes then provide

information, i.e. metadata, to the general data collector system. In listing 4.3, the general

abstract definition of the collection with all custom attribute parameters is given. The custom

attribute is called GenericCollection, and it has 5 parameters:

• collectionName: is a string and it defines the exact name of that collection in the

MongoDB database. If this field is empty, the generic system will create a name auto-

matically.

• uniqueFields: is an array of strings and it consists of all fields in the collection that are

unique key fields (i.e. primary key fields).

• mostQueriedFields: is an array of strings and it consists of all (combinations of) fields

in the collection that are frequently queried.

• mostSortedFields: is an array of strings and it consists of fields on which queries are

often sorted.

• servers: is an array of strings and it consists of fields that define servers.

The reason why these five parameters are chosen will be explained later (section 5.2.1).

A limitation of using custom attributes is that only primitive constants or arrays of primitives

can be used as attribute parameters. This is a Common Language Runtime (CLR) restriction.

The reason is that an attribute must be encoded entirely in metadata. This is different from

a method body which is coded in Intermediate Language (IL). Using MetaData only severely

restricts the scope of values that can be used. In the current version of the CLR, metadata

values are limited to primitives, null, types and arrays of primitives [13].


Because of that CLR restriction, a matrix or list of strings are no valid attribute parameters,

but an array of strings is allowed. Thus, in the generic data collector system are the properties

of type string array. When you want to add one field to the array, it can be done by just

adding the field name to the array. But when you want to add multiple fields that belong

together, these field names have to be separated by a comma and then added to the array.

E.g. the fields ”firstname” and ”lastname” are often queried together, then you add to the

array ”mostQueriedFields” the string ”firstname,lastname”.

Listing 4.3: General skeleton of the definition of a collection with all the custom attribute

parameters.

1 C l a s s MongoBase { i d }2

3 [ G e n e r i c C o l l e c t i o n ( c o l l e c t i o n N a m e = <c o l l e c t i o n n a m e >,

4 u n i q u e F i e l d s = <u n i q u e key f i e l d s >,

5 m o s t Q u e r i e d F i e l d s = <a r r a y o f ( m u l t i p l e ) f i e l d s >,

6 m o s t S o r t e d F i e l d s = <a r r a y o f f i e l d s >,

7 s e r v e r s = <a r r a y o f f i e l d s > ) ]

8 C l a s s c o l l e c t i o n : MongoBase

9 {10 <p r o p e r t i e s >

11 <s u b c l a s s e s >

12 }

4.3 Example skeleton

In this section, the general abstract skeleton of the generic data collector will be applied to an

example to clarify it’s use. The example case is the same as the one given in section 4.1: the

collection ”Car” with fields ”idNumber”, ”brand”, ”model”, ”deliverday” and ”owner”.

In listing 4.1, it is given how to define the classes. The generic system works the best when the

programmer defines as much as possible properties of the custom attribute GenericCollection.

The programmer can set the property collectionName equal to ”cars” if he/she wants that

the collection in the MongoDB database is called ”cars”. He/she can add all the unqiue fields

into the array uniqueFields. In this case, only idNumber is unique. Next, he/she can also add

(combinations of) fields to the array mostQueriedFields if he/she knows which (combinations

of) fields are mostly queried. In this case, the ”idNumber” and ”brand” are single fields that

are often queried. ”Brand” and ”model” are also frequently queried together. Finally, the


programmer can also add fields to the array mostSortedFields that are often sorted on during

queries. In this example, queries are often sorted on ”deliverday” or ”brand”. The programmer

has only one server, so in this case he/she does not have to define the array servers. All these

properties of the collection ”Car” are noted in the GenericCollection attribute in listing 4.4

(line 15 to 18).

Listing 4.4: Definition of the MongoDB collection Car in C# with custum attribute

GenericCollection

1 using MongoDB . Bson ;

2

3 publ ic abstract c l a s s MongoBase

4 {5 publ ic O b j e c t I d i d ;

6 }7

8 publ ic c l a s s Owner : MongoBase

9 {10 publ ic s t r i n g f i r s t n a m e { g e t ; s e t ; }11 publ ic s t r i n g l a s t n a m e { g e t ; s e t ; }12 publ ic s t r i n g p a s s p o r t n u m b e r { g e t ; s e t ; }13 }14

15 [ G e n e r i c C o l l e c t i o n ( c o l l e c t i o n N a m e = ” c a r s ” ,

16 u n i q u e F i e l d s = [ ” idNumber ” ] ,

17 m o s t Q u e r i e d F i e l d s = [ ” idNumber ” , ” brand ” , ” brand , model ” ] ,

18 m o s t S o r t e d F i e l d s = [ ” d e l i v e r d a y ” , ” brand ” ] ) ]

19 publ ic c l a s s Car : MongoBase

20 {21 publ ic s t r i n g idNumber { g e t ; s e t ; }22 publ ic s t r i n g brand { g e t ; s e t ; }23 publ ic s t r i n g model { g e t ; s e t ; }24 publ ic DateTime d e l i v e r d a y { g e t ; s e t ; }25 publ ic Person owner { g e t ; s e t ; }26 }

Chapter 5

Part 2: Execution of generic

methods

In this chapter, the principles of the generic insert and query method of the generic data

collector are discussed. These methods respectively search for the best insertion and query

methods. Optimizing the insertion method can have a negative influence on the query method

and vice versa. This trade-off will be discussed in the last section.

5.1 Generic insert method

The goal of the generic insert method of the generic system is to insert given data into the

MongoDB database as fast as possible. There are two ways to insert data into a collection,

and these will be explained in the next section. Next, the way performance of insertions can

be measured is given.

5.1.1 Insert one vs. many

There are two ways to insert data into a collection: by inserting document per document, i.e.

insertOne-method, or by inserting many documents at once, i.e. insertMany()-method. As

shown in figure 5.1, the insertMany-method is always faster than multiple insertOne methods.

More specifically, one insertMany-method is on average 43% faster than performing multiple

insertOne-methods. This can be explained by the fact that in the insertMany-method, the

program only needs to go to the database once, insert all documents in one bulk, and return

to the application. In contrary, with multiple individual inserts, the program needs to go to

the database, do the insert of that single document, then go back to the application, and the

program will repeat this process for every single insert operation. Thus, executing multiple

22

Chapter 5. Part 2: Execution of generic methods 23

individual inserts are not as efficient as performing a bulk insert. Therefore, the generic

insertion method will always do bulk insertions if possible.

Figure 5.1: Insert execution time of the InsertOne and InsertMany methods for differentnumbers of documents.

5.1.2 Performance measure

MongoDB does not deliver information about specific insertions. Thus, the performance of

an insert method will be measured by timing it by itself. This will be done by a class, called

”Timing”, that consists of a constructor and a save-the-info function. In the constructor, a

Timing object is made that contains a name, description and a starttime. When the save-

the-data function is called on such a Timing object, the elapsed time between creation of the

object and the current time is stored into a collection together with its name and description.

5.2 Generic query method

The goal of the generic query method of the generic system is to query data as fast as possible.

MongoDB supports indexing and sharding to lower the query execution time. The definition

and use of these two concepts will be explained in the next two sections. Finally, the way

performance will be measured by the generic system is explained.

5.2.1 Indexing

Indexes support the efficient execution of queries in MongoDB [12]. Without indexes, Mon-

goDB must perform a collection scan, i.e. scanning every document in a collection, to select

the documents that match the query statement. If an appropriate index exists for a query,

MongoDB can use that index to limit the number of documents it must inspect.


Indexes are special data structures that store a small portion of the collection’s data set in an

easy to traverse form. MongoDB indexes use namely a B-tree data structure. A B-tree is a

self-balancing tree data structure that keeps data sorted and allows searches, sequential access,

insertions, and deletions in logarithmic time. MongoDB defines indexes at the collection level

and supports indexes on any field or sub-field of the documents in a MongoDB collection.

The index stores the value of a specific field or set of fields, ordered by the value of that field.

The ordering of the index entries supports efficient equality matches and range-based query

operations of the field(s) on which the index is defined. In addition, MongoDB can return

sorted results by using the ordering in the index.

5.2.1.1 Index types in MongoDB

MongoDB provides a number of different index types to support specific types of data and

queries.

Default id index

MongoDB creates a unique index on the id field during the creation of a collection. This

id index ensures that every document is different, even if their values are the same. And

it prevents clients from inserting two documents with the same value for the id field. You

cannot drop this index on the id field. This default index can be generated automatically by

the MongoDB system, called ObjectId.

Single field index

In addition to the MongoDB-defined id index, MongoDB supports the creation of user-defined

ascending/descending indexes on a single field of a document. The field on which the index

is defined can be a single field, an embedded field or embedded document. To explain the

difference of these three, an example is given. Consider a collection named “BlockItem” that

resemble the following sample document:

{” id”: ObjectId(”570c04a4ad233577f97dc459”),

”processor”: “AV3Z01”,

”paramItem”: paramName: ”BSTATIC”, value: 1

}

A single field is in this example “processor” and an embedded document is “paramItem”.

“paramItem “ contains 2 embedded fields, namely paramName and value.

For a single-field index and sort operations, the sort order (i.e. ascending or descending) of

the index key does not matter because MongoDB can traverse the index in either direction.


If the index building takes too long time, it is possible to set an extra option, i.e. ”background”,

while creating the index in MongoDBWhen this background option is set to true, it will create

that index in the background so that other database operations can run while creating the

index. However, the mongo connection in which the index is being created will block until

the index build is complete. To continue issuing commands to the database, open another

connection or mongo instance. By default, background is set to false for building MongoDB

indexes. An important note is that queries will not use partially-built indexes. So, the index

will only be usable once the index build is complete.

Compound index

MongoDB also supports user-defined compound indexes. A compound index is a single index

structure that holds references to multiple fields within a collection’s documents. The amount

of fields for any compound index is limited by 31 fields.

The order of fields listed in a compound index has significance. For instance, if a compound

index consists of compound: 1, block: -1 , the index sorts first by compound (ascending) and

then, within each compoundID value, sorts by block (descending).

Compound indexes can support queries that match on multiple fields. For example, consider

the following compound index:

{“compound” : 1, “block” : 1, “parameter” : 1}

In this example, MongoDB can use the index for queries on the following fields:

• the compound field

• the compound field and the block field

• the compound field and the block field and the parameter field

MongoDB cannot use the compound index to support queries that does not include the

compound field. This is a disadvantage of compound indexes, because the use of compound

indexes depend on both the list order (i.e. the order in which the keys are listed in the index)

and the sort order (i.e. ascending or descending).

A solution for this problem is using index intersection (available in MongoDB version 2.6):

instead of creating one index on multiple fields, you can create one index per field (so multiple

single field indexes). These multiple single field indexes can, either individually or through index

intersection, support all queries without depending on the list order or sort order. The choice

between creating compound indexes that support your queries or relying on index intersection

depends on the specifics of the system.


The advantage of compound indexes is that they can cover queries very efficient. When the

query criteria and the projection of a query include only the indexed fields, MongoDB returns

results directly from the index without scanning any documents or bringing documents into

memory.

Multikey index

To index a field that holds an array value, MongoDB creates an index key for each element in

the array. These multikey indexes support efficient queries against array fields by matching on

element or elements of the arrays. Multikey indexes can be constructed over arrays that hold

both scalar values (e.g. strings, numbers) and nested documents. MongoDB automatically

creates a multikey index if any indexed field is an array.

Geospatial index

To support efficient queries of geospatial coordinate data, MongoDB provides two special

indexes: 2d indexes that uses planar geometry when returning results and 2dsphere indexes

that use spherical geometry to return results.

Text index

To support efficient searching for string content in a collection, MongoDB provides text in-

dexes. These text indexes can include any field whose value is a string or an array of string

elements. They do not store language-specific stop words (e.g. “the”, “a”, “or”) and stem

the words in a collection to only store root words. An important note is that a collection can

have at most 1 text index. When creating a text index on multiple fields, you can also use

the wildcard specifier ($**). With a wildcard text index, MongoDB indexes every field that

contains string data for each document in the collection.

Hashed index

To support hash based sharding, MongoDB provides a hashed index type, which indexes the

hash of the value of a field. These indexes have a more random distribution of values along

their range, but only support equality matches and cannot support range-based queries.

5.2.1.2 Indexing strategies

A good index strategy for an application must take a number of factors into account: the

kinds of queries you expect and the amount of free memory on your system.

Generally, MongoDB only uses one index to fulfill most queries. However, each clause of

an OR-query may use a different index, and starting in version 2.6, MongoDB can use an

intersection of multiple indexes. An index supports a query when the index contains all the


fields scanned by the query, because in that case the query scans the index and not the entire

collection. This benefit in the number of documents to scan leads to an increased query

performance. If many queries use the same single key, then it is advantageous to create a

single-key index. If you sometimes query on only one key and at other times query on that key

combined with a second key, then creating a compound index is more efficient than creating

a single-key index. Because in those cases, MongoDB will use the compound index for both

queries.

Another indexing strategy is to use indexes to sort query results. In MongoDB sort operations

can obtain the sort order by retrieving documents based on the ordering in an index. If the

query planner cannot obtain the sort order from an index, it will sort the results in memory.

Thus, sort operations that use an index may lead to better performance than those that do

not use an index.

A third indexing strategy is to ensure that indexes fit in RAM. So that the system can avoid

reading the index from disk, and as a result will process faster. But, indexes do not have to fit

entirely into RAM in all cases. If the value of the indexed field increments with every insert,

and most queries select recently added documents; then MongoDB only needs to keep the

parts of the index that hold the most recent values in RAM. This allows for efficient index use

for read and write operations and minimize the amount of RAM required to support the index.

A fourth indexing strategy is to create queries that ensure selectivity. Selectivity is the ability

of a query to narrow results using the index. E.g. all records contain the field “gender” with

only 2 different values “man” and “woman”. If you add an index on “gender”, then this index

is a low-selectivity index. It will be of help in locating records, because it only has 2 different

values.

5.2.1.3 Indexing implemented in the generic data collector

The generic data collector will use indexing to optimize the generic query method. As men-

tioned in section 5.2.1.2, a first good index strategy is to set an index on fields that most likely

will be queried. Therefore, the custom attribute GenericCollector from the skeleton contains

an array called ”MostQueriedFields” to store frequently queried fields.

A second index implementation is to add indexes on unique fields. Queries on these fields will

be fast executed because of their uniqueness. Therefore, the custom attribute GenericCollector

includes an array called ”uniqueFields” to store these fields.

A third good index strategy, mentioned in section 5.2.1.2, is to set indexes on fields that are

used to sort on in queries. For that reason, the custom attribute GenericCollector also includes

an array called ”mostSortedFields” to store all these fields.


Two other good index strategies that were mentioned in section 5.2.1.2, are to ensure that

indexes fit in RAM and to ensure high selectivity. As declared in section 5.2.1.2, to ensure

that indexes fit in RAM is not an important index strategy. This is because MongoDB has

solved this by itself through minimizing the amount of RAM usage. The other strategy, i.e.

ensuring high selectivity of fields that will be indexed, is something that the generic system

can not avoid. But in fact, if a lot of queries use that field, this index on it will not be fully

disadvantageous.

In summary, ”MostQueriedFields”, ”uniqueFields” and ”mostSortedFields” are three arrays

that belong to the custom attribute GenericCollector. These three arrays contain string values

that refer to one or more fields.

As mentioned in section 5.2.1.1, there are a lot of index types in MongoDB. The generic

data collector system will support all of them, except the hashed index because this index is

not as advantageous as the others (hashed indexes cannot handle ranges). Thus, the generic

data collector system will support a default id index automatically. Single field indexes will

be created when an element of an array has only one field defined. Compound indexes will

be made when multiple fields are given in one element of an array (fields are separated by

commas). Multikey indexes will be created automatically because MongoDB does it intern

automated. Geospatial indexes can be created by the generic data collector system, but then

it must be indicated that the field is geospatial. This can be done by adding before the

field name ”geo:”. E.g. field ”location” requires a geospatial index, then write as element

”geo:location”. Finally, text indexes can also be created by the generic data collector system,

but again this needs to be indicated. This can be done by adding before the string ”text:”.

E.g. when queries search often to documents that contain the string ”aab”, then you can add

to the array ”mostQueriedFields” the element ”text:aab”.

5.2.2 Sharding

Sharding is a method for distributing data across multiple machines. Database systems with

large data sets or high throughput applications can challenge the capacity of a single server.

For example, high query rates can exhaust the CPU capacity of the server. Working set sizes

larger than the system’s RAM stress the I/O capacity of disk drives. There are two methods

for addressing system growth: vertical and horizontal scaling.

Vertical scaling increases the capacity of a single server, e.g. using a more powerful CPU,

adding more RAM, or increasing the amount of storage space. Limitations in available tech-

nology may restrict a single machine from being sufficiently powerful for a given workload.


Also, single machines can have hard ceilings based on available hardware configurations. As a

result, there is a practical maximum for vertical scaling.

Horizontal scaling involves dividing the system dataset and load over multiple servers, adding

additional servers to increase capacity as required. While the overall speed or capacity of

a single machine may not be high, each machine handles a subset of the overall workload,

potentially providing better efficiency than a single high-speed high-capacity server. Expanding

the capacity of the deployment only requires adding additional servers as needed, which can

be a lower overall cost than high-end hardware for a single machine. The trade off is increased

complexity in infrastructure and maintenance for the deployment.

MongoDB supports horizontal scaling through sharding [14]. Sharding is implemented in

MongoDB using a sharded cluster. A sharded cluster consists of shards, config servers and

the mongos. A shard contains a subset of the sharded data. Each shard can be deployed

as a replica set. The mongos acts as a query router, providing an interface between client

applications and the sharded cluster. Config servers store metadata and configuration settings

for the cluster. These components are visualized in figure 5.2, which is retrieved from the

MongoDB manual.

Figure 5.2: Representation of a sharded cluster in MongoDB (MongoDB Manual 2008).

MongoDB shards data at the collection level, this means that documents of one collection

get distributed across the shards in the cluster. The partitioning of the collection data across

multiple shards is done using the shard key. The shard key consists of one or more immutable

fields, chosen by the programmer, that exist in every document in the target collection. It

cannot be changed after sharding, and every sharded collection can have only one shard key.

To shard a non-empty collection, the collection must have an index that starts with the shard

key. Thus if the collection is not empty, you first have to create an index. This index can be


an index on the shard key or a compound index where the shard key is a prefix of the index. In

the other case, when the collection is empty, then MongoDB creates the index on the shard

key if such an index does not already exists.

Using sharding in MongoDB has several advantages. The first advantage is that sharding can

lead to faster insert and query times. This is because MongoDB distributes the read and write

workload across the shards in the sharded cluster. This allows each shard to process a subset

of cluster operations. For queries that include the shard key or the prefix of a compound

shard key, mongos (i.e. the query router) can target the query at a specific shard or set

of shards. These targeted operations are generally more efficient than broadcasting to every

shard in the cluster. Second, sharding utilizes the storage capacity of the cluster optimally.

This because sharding distributes data across the shards in the cluster, allowing each shard to

contain a subset of the total cluster data. As the data set grows, additional shards increase the

storage capacity of the cluster. Third, sharding delivers high availability. A sharded cluster can

continue to perform partial read or write operations even if one or more shards are unavailable.

While the subset of data on the unavailable shards cannot be accessed during the downtime,

reads or writes directed at the available shards can still succeed.

Next to the advantages of sharding, there are also some disadvantages/considerations to take

into account. Using sharding requires careful planning, execution, and maintenance. First, the

choice of the shard key is essential for ensuring cluster performance and efficiency. The shard

key cannot be changed after sharding, nor can a sharded collection be unsharded. Second,

sharding has certain operational requirements and restrictions [15]. One important to note

restriction is that sharding does not support geoSearch, i.e. sharding does not support queries

on geospatial data. Third, if queries do not include the shard key or the prefix of a compound

shard key, mongos performs a broadcast operation, querying all shards in the sharded cluster.

These scatter/gather queries can be long running operations.

5.2.2.1 Sharding strategies

MongoDB supports two sharding strategies for distributing data across sharded clusters:

hashed sharding and ranged sharding.

Hashed sharding

Hashed sharding computes a hash of the shard key field’s value to partition data across a

sharded cluster. The process goes as follows: first, a hash of the shard key is computed.

Second, the data is divided into chunks based on this hash value. Each chunk has an inclusive

lower and exclusive upper range based on that hashed shard key. Finally, all chunks are as

equally as possible divided over the shards. An example of this process is given in figure 5.3.


In that figure, the shard key is ”x”. Next, the hash of it’s value is computed. Finally, the

document is inserted in the corresponding chunk.

Figure 5.3: Representation of hashed sharding in MongoDB (MongoDB Manual 2008).

An important property of hashed sharding is that a range of shard keys may be “close”, but

their hashed values are unlikely to be on the same chunk. This leads to the advantage of more

evenly distributed data, especially in data sets where the shard key changes monotonically.

However, this also means that ranged-based queries on the shard key are less likely to target

a single shard, resulting in more cluster wide broadcast operations. Thus, hashed sharding

provides more even data distribution across the sharded cluster at the cost of reducing query

isolation.

The choice of which field will be used for the shard key is essential. It should have a good

cardinality, i.e. a large number of different values. Also, fields that change monotonically are

very useful, e.g. ObjectId values or timestamps. A good example of this is the default ” id”

field, assuming it only contains ObjectID values.

Ranged sharding

The other type of sharding in MongoDB is called ranged sharding. Ranged sharding involves

dividing data into ranges based on the shard key values. Each chunk is then assigned a range

based on the shard key values. This is shown in figure 5.4.


Figure 5.4: Representation of ranged sharding in MongoDB (MongoDB Manual 2008).

An important property of ranged sharding is that a range of shard keys who are ”close”, are

more likely to reside on the same chunk. This leads to the advantage of query isolation.

Namely, this allows for efficient queries where reads target documents within a contiguous

range. However, both read and write performance may decrease with poor shard key selection,

because the data may not be distributed evenly.

The efficiency of ranged sharding depends on the shard key chosen. Ranged sharding is most

efficient when the shard key has a large cardinality (i.e. the key values have a large range),

a low frequency (i.e. the same value does not often occur in the key values), and is non-

monotonically changing (because otherwise all data would be appended to the last shard with

the range with high values).

5.2.2.2 Sharding implemented in the generic data collector

The generic data collector will also use sharding to optimize the generic query method. Shard-

ing is only possible when multiple servers are available to divide the storage and workload.

Therefore, the programmer needs to specify the name and port of all te servers in the cus-

tom attribute GenericCollection property ”servers”. ”servers” is a string array, in which each

element contains the specifics of one server in the form < servername >:< portnumber >.

As mentioned in section 5.2.2.1, there are two strategies to implement sharding: hashed

sharding and ranged sharding. The generic data collector will always use hashed sharding,

because the hash function leads to better distribution of the data across the shards. If the

generic data collector system would use ranged hashing on a monotonically increasing or

decreasing key, then the data would be badly divided over the shards. Measuring if a key is

monotonically increasing or decreasing is hard. Therefore, it is better to use a hash function

and thus implement hashed sharding.

Which key should the generic data collector choose to implement hashed sharding? In general,

the data collector system chooses as hash key the ObjectID key ” id”. This is a good example


of hash key, as already mentioned in section 5.2.2.1. Unless there is a unique key given that

is also frequently queried. This can be detected by comparing the arrays ”uniqueFields” and

”mostQueriedFields”, and searching for a similar (combination of) key(s). Because in this

case, the unique key has surely low frequency and high cardinality.

5.2.3 Performance measure

MongoDB supports a command to analyse a specific query, called explain(). It is used as

follows:

1 BsonDocument r e s = c o l l e c t i o n . f i n d ( f i l t e r ) . e x p l a i n ( ”

e x e c u t i o n S t a t s ” ) ;

The result of this explain-function is can be information regarding 3 topics. The first topic

is queryP lanner,which details the plan selected by the query optimizer and lists the rejected

plans. The second topic is executionStats, which details the execution of the winning plan

and the rejected plans. The third and last topic is serverInfo, which provides information

on the MongoDB instance.

The most useful result of the explain-function for the performance measure is the topic

executionStats. It’s most important properties are:

• explain.executionStats.nReturned: the number of documents that match the query

condition.

• explain.executionStats.executionTimeMillis: the total time in milliseconds required

for query plan selection and query execution.

• explain.executionStats.totalKeysExamined: the number of index entries scanned.

• explain.executionStats.totalDocsExamined: the number of documents scanned.

• explain.executionStats.executionStages.needYield: the number of times that the

reader yielded its lock because data was not in RAM. If this number is greater than zero,

then the data that is queried comes not only from RAM, but also from the disk memory.

For sharded collections, the explain-function returns the core query planner (i.e. queryP lanner)

and server information (i.e. serverInfo) for each accessed shard:

• explain.queryPlanner.winningPlan.shards: Array of documents that contain queryP lanner

and serverInfo for each accessed shard.

• explain.executionStats.executionStages.shards: Array of documents that contain

executionStats for each accessed shard.


Next to the usage of the explain-function, measuring the execution time of a query will be

done in the same way as timing the execution time of an insert. Namely by measuring the

time to get a result back from the query. The reason why I measure time performance of

queries and inserts in the same way, is so that these can be compared to each other.

5.3 Trade-off between insert and query execution

times

The insert execution time is the fastest when doing a direct dump, i.e. writing the given input

data directly into the MongoDB database without changing the input data nor adding indexes.

The query execution time is the fastest when the input data is written in a structured way in

the database. Adding structure to the data means storing the input data in such a way that

queries can avoid broadcasts over all documents in a collection but instead read a reduced

group of specific documents in a collection. Structuring the data in MongoDB is done by

defining a good logical structure for each document in a collection, by adding indexing, and

by adding sharding.

Making the insert and query execution time both equal to their own minimum time is impos-

sible. The reason why is because they depend on each other. To get the insert execution time

as low as possible, input data may not be changed but written directly into the database. In

contrary, to get the query execution time as low as possible, input data must be changed (i.e.

defining structure, adding indexing and sharding). Thus, optimizing the query method will

de-optimize the insert method and vice versa.

As stated in the introduction, the goal of this thesis is to find a generic data collector that

optimizes these two methods, i.e. insertion and querying, as good as possible. I will use a

bottom-up approach to test the suitability of the generic data collector, namely by considering

two implementations and evaluating their results. This will be done in the following chapter.

Chapter 6

Application of the generic data

collector

In this chapter, the generic data collector will be implemented and tested on a practical use

case.

6.1 Description of the database

The given database contains messages of a company that has several production units where

products are created and where operators keep an eye on the production process. These

messages represent notifications or alarms generated in the company. The database is also

called the message database. Its messages can be divided into two groups: process alarms and

operator actions.

6.1.1 Process alarms

The first group, called process alarms, consists of all the alarms from each production unit in

a company. An alarm is a notification of an exceptional condition of a machine. E.g. ”Level

of the water is too low.” or ”Temperature is too high.”.

Process alarms are defined by i.a. their name, zone, port, priority, message and loopID. Further,

process alarms can have four different states: in alarm, disabled, returned or acknowledged.

When the state of an alarm is equal to in alarm, then the alarm is active at that time. When

the state of an alarm is acknowledged, this means that the operator has indicated that he

has noticed this alarm. The states disabled and returned both indicate that the alarm has

stopped. In the case of a returned state, the alarm went off by itself. This could be the case

when an operator of the company has lowered the temperature in a tank, causing the alarm

35

Chapter 6. Application of the generic data collector 36

”Temperature is too high.” to stop. In the case of a disabled alarm state, the alarm is turned

off manually by the operator. For example, the alarm ”Tank is empty.” is manually disabled

by the operator because he knows that the tank is in reparation and thus empty.

6.1.2 Operator actions

The second message group, called operator actions, consists of all actions that taken by

a operator in a control room or are logged. For example, when the operator changes the

setpoint in a proportional–integral–derivative controller (i.e. PID controller), then this change

is logged by an operator action message. This message includes i.a. additional information

about the action itself, the station on which this action took place, who took this action and

at what time.

6.2 Implementation of the message database

6.2.1 Input data

The input data that contains all the messages is given in text files. Each text file contains

messages of only one message group (i.e. process or operator message group). This is indicated

by the name of the text file, that starts with ”process” or ”operator”. In total, there are 465

files of 1.29 GB in size. The process alarms files have a total size of 0.97GB, the operator

actions are 320,2MB. Each line from a text file describes the properties of one message. These

properties are pre-defined in a fixed structure for each of the two message groups.

6.2.2 Storing the data

This section describes how the input data will be stored in a MongoDB database.

The message database consists of two message groups, that are independent of each other:

their input files are separated, and they cover two different parts of the system. Queries can

relate to both groups at the same time, but it is rarely done. Therefore, each message group

will be stored in a different collection: process alarms in ”data alarm process”, and operator

actions in ”data alarm oper”.

All operator actions do not need any preprocessing because they only contain properties. Thus,

a document in the ”data alarm oper” collection will contain a value for every property if that

property field is filled in in its text line.

Unlike operator actions, process alarms will need to be preprocessed. As described in section

6.1.1, each process alarm line in the input text file contains next to its properties also a state.


All the states of one process alarm form together a state cycle: the time when the alarm went

on, the time at which the alarm was acknowledged, and the time when the alarm went out.

Many queries involve this state cycle. Therefore, an efficient storage structure would be to

store per document the properties and the full state cycle of a process alarm.

Thus, the collection ”data alarm process” contains on the one hand documents with finished

alarms, namely documents that contain an alarm its properties and a full state cycle. On the

other hand, the collection ”data alarm process” consists of currently active alarms, namely

documents that contain an alarm its properties and a partially filled state cycle.

Inserting process alarms into the ”data alarm process” collection will therefore always imply a

read (query), namely searching for a currently active alarm with the same ID. In the case when

the currently active alarm exists, this alarm gets updated. In the case the currently active

alarm does not exist, this alarm is inserted in the collection.

A schematical representation of the final message database structure is given in figure 6.1.

Figure 6.1: Schema of the message database.

6.2.3 Adding the custom attribute parameters

After choosing the storage structure of the database, each collection has custom attribute

parameters that need to be filled in.

First, I will discuss how the operator actions collection will be completed by custom attribute

parameters. The desired name for the collection is ”data alarm oper”. Queries are mostly

executed on the combination of the ”compound” and ”block” field, and on the single field

”letterbug”. The ”compound” field refers to a certain process of a product, the ”block” field


refers to a part of the process, and the field letterbug is a specific name for the process. Oper-

ator actions have no unique fields, only the combination of all its properties is unique. Because

MongoDB automatically creates an ID for each document, the uniqueFields array is left empty.

Results of queries are often sorted on ”timestamp”, ”compound” and ”msgNew”. The field

”timestamp” refers to the time when the operator performed that action, ”compound” refers

to all points in a certain process, and ”msgNew” refers to the new value of the parameter that

the operator has changed in that action. The c# code of this collection with custom attribute

parameters is shown in listing 6.1.

Listing 6.1: Definition of the collection of operator actions with custom attribute param-

eters.

1 [ G e n e r i c C o l l e c t i o n ( c o l l e c t i o n N a m e =” d a t a a l a r m o p e r ” ,

2 m o s t Q u e r i e d F i e l d s = new s t r i n g [ ] { ”compound , b l o c k ” ,

” l e t t e r b u g ” } ,

3 m o s t S o r t e d F i e l d s = new s t r i n g [ ] { ” t imestamp ” , ”

compound” , ”msgNew”} ) ]

4 publ ic c l a s s OperItem : MongoBase

5 {6 publ ic DateTime t imestamp { g e t ; s e t ; }7 publ ic i n t t imeTenth { g e t ; s e t ; }8 publ ic s t r i n g alarmName { g e t ; s e t ; }9 publ ic s t r i n g a l a r m P o r t { g e t ; s e t ; }

10 publ ic s t r i n g h o s t { g e t ; s e t ; }11 publ ic s t r i n g l e t t e r b u g { g e t ; s e t ; }12 publ ic s t r i n g a c t i o n T y p e { g e t ; s e t ; }13 publ ic s t r i n g compound { g e t ; s e t ; }14 publ ic s t r i n g b l o c k { g e t ; s e t ; }15 publ ic s t r i n g p o i n t { g e t ; s e t ; }16 publ ic s t r i n g msgText { g e t ; s e t ; }17 publ ic s t r i n g msgOld { g e t ; s e t ; }18 publ ic s t r i n g msgNew { g e t ; s e t ; }19

20 publ ic DateTime sysTimestamp { g e t ; s e t ; }21 publ ic s t r i n g msgType { g e t ; s e t ; }22 publ ic i n t s y s C o u n t e r { g e t ; s e t ; }23 }

Second, I will explain how the process alarms collection is completed by custom attribute


parameters. The desired name for the collection is ”data alarm process”. Each process alarm

is defined by a unique identifier, namely by combination of the values of its ”timestamp”,

”compound”, ”block”, ”pointNr” and ”almType” fields. The ”timestamp” field refers to the

time when the alarm is generated, the”compound” field refers to a certain process of a product,

the ”block” field refers to a part of the process, the ”pointNr” to a specific point in that part

of that process, and the ”almType” refers to the type of alarm. The fields which are mostly

queried on are the combination of ”compound” and ”block” and the single fields ”ackSec”,

”returnSec” and ”timestamp”. The fields ”ackSec” and ”returnSec” refer respecively to the

time until acknowledging the alarm and the time until returning the alarm in seconds. Queries

are frequently sorted on ”timestamp” and ”returnTimestamp”, where ”timestamp” is the time

when the alarm started and ”returnTimestamp” is the time when the alarm is returned. The

c# code of this collection with custom attribute parameters is shown in listing 6.2

Listing 6.2: Definition of the collection of process alarms with custom attribute parame-

ters.

1 [ G e n e r i c C o l l e c t i o n ( c o l l e c t i o n N a m e = ” d a t a a l a r m p r o c e s s ” ,

2 u n i q u e F i e l d s = new s t r i n g [ ] { ” timestamp , compound ,

b lock , po intNr , almType ” } ,

3 m o s t Q u e r i e d F i e l d s = new s t r i n g [ ] { ”compound , b l o c k ” ,

” ackSec ” , ” r e t u r n S e c ” , ” t imestamp ” } ,

4 m o s t S o r t e d F i e l d s = new s t r i n g [ ] { ” t imestamp ” , ”

returnTimestamp ” } ) ]

5 publ ic c l a s s AlarmItem : MongoBase

6 {7 publ ic DateTime t imestamp { g e t ; s e t ; }8 publ ic s t r i n g compound { g e t ; s e t ; }9 publ ic s t r i n g b l o c k { g e t ; s e t ; }

10 publ ic i n t p o i n t N r { g e t ; s e t ; }11 publ ic s t r i n g almType { g e t ; s e t ; }12

13 publ ic s t r i n g inOut { g e t ; s e t ; }14 publ ic s t r i n g alarmZone { g e t ; s e t ; }15 publ ic s t r i n g alarmName { g e t ; s e t ; }16 publ ic s t r i n g a l a r m P o r t { g e t ; s e t ; }17 publ ic s t r i n g h o s t { g e t ; s e t ; }18 publ ic s t r i n g l e t t e r b u g { g e t ; s e t ; }19 publ ic s t r i n g l o o p I d { g e t ; s e t ; }20


21 publ ic i n t p r i o { g e t ; s e t ; }22 publ ic s t r i n g a c k S t a t e { g e t ; s e t ; }23 publ ic i n t t imeTenth { g e t ; s e t ; }24 publ ic s t r i n g tagName { g e t ; s e t ; }25 publ ic s t r i n g tagType { g e t ; s e t ; }26 publ ic s t r i n g b l o c k D e s c { g e t ; s e t ; }27 publ ic s t r i n g r e a l V a l u e { g e t ; s e t ; }28 publ ic s t r i n g a l a r m L i m i t { g e t ; s e t ; }29 publ ic s t r i n g e n g U n i t s { g e t ; s e t ; }30 publ ic s t r i n g pointName { g e t ; s e t ; }31 publ ic s t r i n g msgText { g e t ; s e t ; }32 publ ic s t r i n g s t a t e T e x t { g e t ; s e t ; }33

34 publ ic bool a l m C l e a r e d { g e t ; s e t ; }35 publ ic DateTime c l e a r T i m e s t a m p { g e t ; s e t ; }36

37 publ ic bool almAcked { g e t ; s e t ; }38 publ ic DateTime ackTimestamp { g e t ; s e t ; }39 publ ic i n t ackSec { g e t ; s e t ; }40 publ ic bool a lmReturned { g e t ; s e t ; }41 publ ic DateTime returnTimestamp { g e t ; s e t ; }42 publ ic i n t r e t u r n S e c { g e t ; s e t ; }43

44 publ ic s t r i n g u s e r I n f o { g e t ; s e t ; }45 publ ic s t r i n g u s e r A c t i o n { g e t ; s e t ; }46 publ ic s t r i n g u s e r L o c a t i o n { g e t ; s e t ; }47

48 publ ic i n t almGroup { g e t ; s e t ; }49 publ ic s t r i n g devGroup { g e t ; s e t ; }50 publ ic s t r i n g b lockType { g e t ; s e t ; }51

52 publ ic DateTime sysTimestamp { g e t ; s e t ; }53 publ ic s t r i n g msgType { g e t ; s e t ; }54 publ ic i n t s y s C o u n t e r { g e t ; s e t ; }55 }


6.3 Insertion in the message database

In this section, the insertion times into the message database will be given and discussed.

First, the insertion times of the collection ”data alarm oper” will be calculated. Next, the

insertion times of the collection ”data alarm process” will be discussed.

6.3.1 Insertion times of the operator actions collection

Now that the collection of operator actions, called ”data alarm oper”, is fully defined, the

insertion times into the database will be measured and compared against direct data dumps.

The insertion times are be measured by the ”Timing” class, as explained in section 5.1.2.

When measuring the insertion time of 1 operator action, this results in 0ms. This means that

MongoDB saves this action almost instantly, with or without indexes. Therefore, I will first

investigate in how much messages (operator actions) should be inserted together so that the

difference in insertion times between inserting with and without indexes is clearly visible.

First, I created a plot of insertion times with 1000 operator actions at every insertion, with

and without indexing. This is shown in figure 6.2. This plot shows that dump insertion times

are faster than insertion with indexing, which is explainable because MongoDB needs extra

time to apply indexing on the data. The mean of a dump insertion time of 1000 actions is

28,08 ms, and the mean of one insertion with indexing time is 56,12 ms. The plot contains

also outliers by both insertion times. This is explainable by the fact that MongoDB stores

data first in memory, and every 60 s (from MongoDB version 3.6) the data gets synced to the

disk. Not shown in the plot is an outlier that has a measurement value of 5045 ms as insertion

time by insert dump. This outlier is not due to MongoDB, but due to the Visual Studio

Managed Debugging Assistant. The Managed Debugging Assistant gave a warning that there

is a potential of deadlock in the code during the program run. Then, the application went in

break mode, and I had to click on ”continue execution” which took about 5 s. To resolve this

problem, I disabled the ContextSwitchDeadlock option of the Managed Debugging Assistant.

To render the plot, I replaced the outlier value of 5045 ms by 43 ms using linear interpolation.


Figure 6.2: Insertion times with and without indexing of 1000 operator actions at eachinsertion.

Because the insertion times with indexing of 1000 operator actions at once cannot be separated

from the insertion times without indexing, I did two other measurements with more operator

actions at once. These two measurement plots are shown in figure 6.3 and 6.4. The first

figure shows that 10.000 operator actions inserting at once is not big enough to separate the

insertion times of a dump insert and an index insert. Inserting 100.000 operator actions at

once is plotted in the second figure, and this gives a clear difference between dump and index

insert. There is no confusion between the two possible: one dump insert of 100.000 actions

takes on average 2,79 s, and one index insert takes on average 5,69 s.

Figure 6.3: Insertion times with andwithout indexing of 10.000 operatoractions at each insertion.

Figure 6.4: Insertion times with andwithout indexing of 100.000 opera-tor actions at each insertion.

Inserting 1 operator action without indexing takes 0,028 ms, and with indexing it takes 0,056

ms. Thus, adding indexing on operator actions almost doubles the insertion time. Because

MongoDB inserts very fast, the difference is almost not noticeable until inserting 100.000 or

more operator actions at once. The three figures (figs. 6.2 to 6.4) also show that the insertion

without indexing has a smaller deviation from its mean execution time than inserting with

indexing. The deviation of insertion times with indexing of 10.000 operator actions is 462,07

ms, and without indexing is the deviation equal to 166,75 ms.


6.3.2 Insertion times of the process alarms collection

In this section, the insertion times of the generic data collector program into the database will

be measured and compared against insertions into a database without indexes.

As by the insertion time measurements of the operator actions collection, the insertion times

into the process alarms collection will be measured by the ”Timing” class. The difference

between a process alarm document and an operator action document is that a process alarm

document contains next to its properties also an alarm cycle (see section 6.2.2). One complete

alarm cycle is constructed by 3 process alarm documents with the same ID, namely one alarm

document with the time that it went on, one with the time that it was acknowledged, and one

with the time and how it went out. Thus, every incoming alarm depends on all the previous

alarms. Therefore, an insertion into the process alarms collection can only apply to one alarm

at a time.

Insertions of multiple process alarms at a time are not possible. The insertMany method, as

used in the operator actions collection, is not applicable here. Hence, an insertion will execute

slower, as shown in section 5.1.1.

When measuring the insertion time of 1 process alarm in the dumped and indexed collection,

this results often in 0ms. This is because MongoDB can insert a process alarm faster than a ms.

Now, this problem cannot be solved by measuring the insertion time of multiple documents.

Therefore, the total time to process an alarm will be saved. This means that the measured

time of an alarm in this collection will be seen as the time to preprocess the alarm plus the

time to insert/update the alarm. Tables 6.7 and 6.8 show the process (i.e. measured) times

respectively in the indexed collection and in the dumped collection. Tables ?? and 6.6 show

the process times of the indexed and dumped collection zoomed in. Each table contains points

or process times for each of the four different types of alarm documents:

• alarm ON means that the document contains an alarm that is on.

• alarm ACK means that the document contains an alarm that is acknowledged or seen

by an operator.

• alarm OUT by returning means that the document contains an alarm that went off

by itself.

• alarm OUT by disabling means that the document contains an alarm that went off

by an operator who disabled it manually.

To generate these four tables, there are 7123 input documents added, from which 5860 were

good ones. With a good input document, I mean an alarm document that is a new ON alarm


or is related to an alarm with the same ID stored in the database. The reason why there

are not good input documents, is because the input data does not start from scratch. From

these 5860 good input documents, there are 2716 new alarms, 582 acknowledged alarms,

2513 returned alarms and 49 disabled alarms. The original number of alarm documents in the

collection before adding these 7123 input documents is 29863.

These four types of alarm documents have in each collection the same characteristics in

preprocessing+insertion time. The ”alarm ON”-type is always processed at a constant time:

on average 1,77 ms in the indexed collection and 52,89 ms in the dumped collection. The

processing time for the ”alarm ACK”-type has the second highest variance of the 4 types. In

the indexed collection, its processing times stay within a boundary of 600 ms with a mean

of 18,33 ms. But in the dumped collection, the variance is a lot bigger: processing times go

up to 15000 ms with a mean of 278,86 ms. The ”alarm OUT by returning”-type has for its

processing times the third largest variance of the 4 types. In the indexed collection, the mean

of the processing times is equal to 54,29 ms. In the dumped collection, the mean is equal to

133,52 ms. The processing times for the ”alarm OUT by disabling” have the largest variance

of the 4 types. In the indexed collection, the values stay within a boundary of 400 ms, but

in the dumped collection the values may go up to 50000 ms. It has a mean of 202,18 ms in

the indexed collection, and 5167,86 ms in the dumped collection. The mean process times are

also listed in table 6.1.

Indexed collection Dumped collectionalarm ON 1,94 ms 55,80 msalarm ACK 18,33 ms 278,86 msalarm OUT by returning 54,29 ms 133,52 msalarm OUT by disabling 202,18 ms 5167,86 ms

Table 6.1: Mean process times of the alarm documents in the indexed and dumpedcollection

Table 6.1 shows that the mean processing times in the indexed collection are always faster

than the processing times in the dumped collection. These results can be explained logically

by the fact that every input alarm document of type ”alarm ON” implies only one insertion,

and that every input alarm document of another type implies one search with deletion and one

insertion (i.e. an update).


Figure 6.5: Preprocessing plus insertion times with indexing of one process alarm at eachinsertion, zoomed in.

Figure 6.6: Preprocessing plus insertion times without indexing of one process alarm ateach insertion, zoomed in.


Figure 6.7: Preprocessing plus in-sertion times with indexing of oneprocess alarm at each insertion,original.

Figure 6.8: Preprocessing plus in-sertion times without indexing ofone process alarm at each insertion,original.

6.4 Querying in the message database

In this section, the query execution times into the message database will be investigated.

The query execution times are measured in the same way that insert execution times were

measured, namely by the Timing class. Next to the Timing class, the MongoDB Explain()-

function is also applied to each query. This function will give inside information about the

executed query, e.g. the number of indexes used, and the number of documents scanned (see

section 5.2.3).

Measuring the query execution time with the Timing class can be done in two ways. The

first way measures the time for MongoDB to search all the corresponding documents and

returning them into a cursor, called an iFindFluent cursor. The second way measures the time

for MongoDB to search all the corresponding documents and returning them into a list. As a

programmer, to be able to do extra operations on or to go over all the returned documents,

you need this list structure. Converting the cursor to a list is almost always done. Therefore,

I define the query execution time as the time to find all the corresponding elements plus the

time to convert them into a list.

Another advantage of this choice is that the difference in query execution times between the

dump collection and indexed collection are clearer and larger. An example is given in table 6.2.

This table contains the query execution times of a query between the indexed collection and

the dump collection. The query applies to an indexed field, and returns returns 5% of the data.

The query execution time is measured in three ways. The first two ways are explained in the

paragraph above. The third way, called ”executionTimeMillis”, is the result after running the

MongoDB Explain()-function on this query and reading the property ”executionTimeMillis”.


Table 6.2: Query execution times of a query that returns 5% (i.e. 120000 documents) ofthe operator actions data between the indexed collection and the dump collection.

Indexed collection Dumped collectionWay 1 query execution time 0 ms 3 msWay 2 query execution time 2226 ms 4086 ms

”executionTimeMillis” 306 ms 2173 ms

The MongoDB Explain()-function can only be called on a MongoCursor object in C#. In

the latest version of MongoDB, version 3.6 which I use in my program, only a iMongoCursor

exists and no MongoCursor. Thus, it was necessary to create a second separated project in

Visual Studio which has a reference to an older version of the MongoDB Driver (version 1.10).

In this new project, the Explain()-function can be applied on queries. The results from this

function are always written in an extra collection in the general MongoDB database. In this

way, the results of the Timing class function in my first project and the Explain()-function in

my second project can be combined and analysed.

Further in this section, first the query times of the collection ”data alarm oper” will be calcu-

lated. Next, the query times of the collection ”data alarm process” will be discussed.

6.4.1 Query times of the operator actions collection

In this section, the query execution times of the operator actions collection will be calculated

and analysed for expected and unexpected queries. The amount of operator action documents

in ”data alarm oper” is equal to 2.400.000.

6.4.1.1 Execution times of expected queries

Expected queries are queries that are typical and regularly executed on the collection. The

fields where these queries relate to, were also listed by the programmer in the custom attribute

properties of the generic data collection program. Upon creation of the collection, the generic

data collector has token these fields into account to optimize queries on them.

The collection ”data alarm oper” has no unique fields, ”compound, block” and ”letterbug”

are fields that are queried often and queries are mostly sorted on the fields ”timestamp” and

”msgNew”. Thus, the generic data collector program has created indexes on the single fields

”compound”, ”block”, ”letterbug”, ”timestamp” and ”msgNew” to optimize the execution

times of queries on these fields. Tables 6.3 and 6.4 show the results of tests to verify whether

the intention of the generic data collector program for expected queries succeeds.

Table 6.3 shows the mean query execution times (QET) of different queries on indexed fields

(e.g. ”compound”, ”block” and ”letterbug”). The QETs are listed for the indexed collection,


which is the collection with indexed fields that is created by the generic data collector program,

and the dumped collection, which is the collection created by directly dumping the data. For

each query, the fastest QET is colored in green, and the amount of returned data compared

to the total amount of data is also given. The QETs in the indexed collection strongly differ

with QETs in the dumped collection when the expected query is specific and returns a small

amount of the total data. In this case, the QETs in the dumped collection take almost always

the double of the time compared to the indexed collection. On the other hand, QETs in

the indexed collection are almost equal to (but a little bit faster than) those in the dumped

collection when the expected query is not so specific and returns a large amount of the total

data (e.g. 63,12%). Thus, this table concludes that expected queries on indexed fields in the

indexed collection are the fastest.

Table 6.3: QETs on indexed fields in the indexed collection and dumped collection.

Query Amount QET QETof data Indexed Dumped

returned collection collection[%] [ms] [ms]

letterbug equals 4,8 2226 4086compound block letterbug equals 0,005 1335 2301compound equals 63,12 25132 26125compound equals 1,16 1419 2187letterbug startswith 45 19864 20568compound startswith 63,12 26622 27233

and block containscompound startswith 0 981 2337

and block startswith

Table 6.4 lists the mean query execution times (QET) of different queries on indexed fields that

are sorted on expected fields (e.g. ”timestamp” and ”msgNew”). Again, the QETs between

the indexed collection and the dumped collection are compared. For each query, the fastest

QET is colored in green, and the amount of returned data is also given. From this table can

be concluded that when a low amount of data is returned, sorting on an indexed field is always

faster than sorting on a not-indexed field. Moreover, even if the query is sorted in opposite

direction than the sorting direction of the indexed field, then this does not result in longer

QETs. What strikes in this table, is that when a high amount of the total data is returned (e.g.

63,12% or 1514880 documents) then sorting on a not-indexed field results into a RAM error.

This RAM error indicates that the sort operation used more than the maximum 33554432

bytes of RAM, and it proposes to add an index or to specify a smaller limit. But, no one can

understand and find what he/she needs when querying for a sorted list of 1514880 documents.

He/she will always need extra refinement. Thus, this table concludes that expected queries


on indexed fields that are sorted on indexed fields are executed faster than the same query in

the dump collection. Even more, when returning a high amount of sorted data, it is required

to add an index to the sorted field(s) or to specify a query which returns a smaller amount of

sorted data.

Table 6.4: QETs on indexed fields in the indexed collection and dumped collection, wherethe resulting data is sorted on indexed fields.

Query Amount Sort QET QETof data definition Indexed Dumped


compound block equals 0,01

/ 901 2351asc. timestamp 1056 2361desc. timestamp 1509 2270

asc. msgNew 1271 2239

compound block equals 63,12/ 25524 25809

asc. timestamp 25294 RAM ERRORdesc.timestamp 26416 RAM ERROR

6.4.1.2 Execution times of unexpected queries

In this section, the execution time of several unexpected queries are measured. Unexpected

queries are ad hoc queries, and they can be divided into 6 types:

• Type 1: queries on a non-indexed field.

• Type 2: queries sorting on a non-indexed field.

• Type 3: query that returns the whole database.

• Type 4: query on a part of the fields that are specified in one element/string in the

mostQueriedFields or uniqueFields array.

• Type 5: queries on combinations of fields that were defined by the programmer as

seperate.

• Type 6: queries on fields that are defined in the mostSortedFields array.

Example queries of these 6 types are listed in table 6.5 together with their QET in the indexed

and dumped collection.


Table 6.5: QETs in the indexed collection and dumped operator collection.

Type Query Amount Sort QET QETof data definition Indexed Dumped


1msgText equals 22,21 / 13683 9541msgOld equals 11,5 / 7365 5337

2msgText equals 1 22,21 asc. msgOld RAM RAM

ERROR ERRORmsgText equals 2 11,5 asc. msgOld 3462 1568compound equals 0,01 asc. msgOld 24364 24316

3 all 100 / 37599 38749

4compound equals 0,011 / 1876 2438

compound startswith 63,14 / 24973 25758

5block letterbug equals 0,00054 / 1473 3681

block startswith 0,023 / 1877,5 2016and letterbug equals

6timestamp lte 0,0090 / 934 1832

msgNew equals 0,0023 / 1052 3230

Queries of type 1 search for fields that are not indexed. The Explain()-function indicates that

such kind of queries in both collections lead to a full scan of the database. The QETs in the

dumped collection are on average 1,4 times faster than the QETs in the indexed collection.

Queries of type 2 are ordered by non-indexed fields. If the amount of returned data is too

large (here 11,5%), then ordering on a non-indexed field leads to the same RAM ERROR as

in the previous section: the sort operation used more than the maximum 33554432 bytes of

RAM. If the amount of returned data is smaller, then is executing a query on non-indexed

fields and ordering it by a non-indexed field faster in the dumped collection. In the case of

executing a query on an indexed field and ordering it by a non-indexed field, the resulting time

measurements on the indexed and dumped collection are similar.

A query of type 3, namely a query with an empty filter, returns the whole database. The

Explain()-function shows that this query examines all documents in each collection. The QET

in the indexed collection is on average about 1,3s faster than in the dumped collection.

Queries of type 4 relate to a part of a combined-fields element in the mostQueriedFields or

uniqueField arrays. QETs of this kind of queries are twice as fast executed in the indexed

collection if the amount of returned data is not too high. In the other case, when the amount

of returned data is high (63,14%), the QETs in both collections differ on average only 1s.

Running the Explain()-function on the first query indicates that the time to examine all the

2400000 documents (thus executing the query in the dumped collection) is on average twice


as slow as examining 27741 keys and 27741 documents (this is executing the query in the

indexed collection) to return 27741 documents in a list.

Queries of type 5 relate to queries that combine fields which were defined separate by the

programmer. QETs of this kind of queries are always executed faster in the indexed collection.

This is because MongoDB will use the indexes on the fields of the queries in the indexed

collection.

Queries of type 6 relate to queries that search on fields that are listed in the mostSortedFields

array. QETs of these queries are faster in the indexed collection, because the fields are indexed.

To conclude, queries of type 1 and 2 are always faster executed in the dumped collection.

This is because no key can be used, and a full scan of the database has to be done in each

collection. Queries of type 3 to 6 are faster executed in the indexed collection. This is because

the generic data collector program has set indexes on each singular field, even when the

programmer defined an element as a combination of fields in the custom attribute properties.

When a query relates to multiple indexed fields, MongoDB combines these fields automatically.

6.4.2 Query times of the process alarms collection

In this section, the query execution times of the process alarms collection will be calculated and

analyzed for expected and unexpected queries, in the same way as for the operator actions col-

lection in the section before. The amount of process alarm documents in ”data alarm process”

is equal to 154.809.

6.4.2.1 Execution times of expected queries

The collection ”data alarm process” has one unique compound field: ”timestamp, compound,

block, pointNr, almType”. Further, the compound field ”compound, block” and the single

fields ”ackSec”, ”returnSec” and ”timestamp” are very often queried, and the single fields

”timestamp” and ”returnTimestamp” are very often sorted on in the collection of process

alarms. Thus, the generic data collector program has created one unique compound index

on ”timestamp, compound, block, pointNr, almType”, and indexes on the single fields ”com-

pound”, ”block” ”ackSec”, ”returnSec”, ”timestamp” and ”returnTimestamp” to optimize

the execution times of queries that relate to these fields. Tables 6.6 and 6.7 show the results

of tests to verify whether this intention succeeds.

Table 6.6 lists the QETs of expected queries on indexed fields in the indexed and dumped

collection. On average, the QETs in the indexed collection are less than those executed in the

dumped collection. This is because MongoDB makes use of the index in the indexed collection.


The difference in QETs of both collections is not big (e.g. on average 353 ms) because the

amount of process alarm documents is also not so big as the number of documents in the

”data alarm oper” collection: 154.809 vs 2.400.000 documents.

Table 6.6: QETs on indexed fields in the indexed collection and dumped collection.

Query Number QET QETof docs Indexed Dumped

returned collection collection[ms] [ms]

timestamp compound block 1 1047 1674pointNr almType equalscompound block equals 72 1296 1369ackSec greater than 140764 6436 7129returnSec greater than 11591 1801 2223returnSec equals 3 1528 1771timestamp equals 3 1100 1472timestamp greater than 417984 4157 4800

Table 6.7 shows the results of execution times of queries that are sorted by an indexed field.

The first query ”letterbug equals” contains a search on a non-indexed field ”letterbug”. When

the results of this query are sorted by an indexed field, the difference between the QET of the

indexed and dumped collection is almost zero. The reason why the QETs are almost equal is

contained in the Explain()-function. Executing the Explain()-function on this query is given in

table 6.8. From this can be deduced that the QETs without sorting are the same, and that a

full search is done in both collections. For the queries with sorting, the execution time in the

indexed collection is greater than the one in the dumped collection. This is because in both

collections is a full search done, and in the indexed collection are also all the keys examined

but in the dumped collection not.

The second query ”returnSec equals” concerns an indexed field, namely ”returnSec”. Execut-

ing this query with and without sorting in the indexed collection is on average faster than in

the dumped collection. Again, the difference between the QETs when sorted is not so big.

The third and last query ”ackSec greater than” returns a lot of documents, namely 140476.

When the result of this query is also sorted, this gives a RAM ERROR when executed in the

dumped collection but not in the indexed collection. This RAM ERROR is the same error as

in the operator actions collection: the sort operation used more than the maximum 33554432

bytes of RAM.


Table 6.7: QETs in the indexed collection and dumped collection, where the resultingdata is sorted on indexed fields.

Query Number Sort QET QETof docs definition Indexed Dumped


letterbug equals 34387/ 2795 3219

asc. timestamp 3546 3582

returnSec equals 15559/ 1190 1651


ackSec greater than 140476asc. timestamp 6569 RAM ERRORdesc. timestamp 7273 RAM ERROR

Table 6.8: Results of the Explain()-function on the query ”letterbug equals” from table6.7.

Sort definition Indexed collection dumped collection

executionTimeMillis/ 362 362


totalKeysExamined/ 0 0

asc.timestamp 154836 0

totalDocsExamined/ 154836 154835


6.4.2.2 Execution times of unexpected queries

The tests on unexpected or ad hoc queries are again divided into 6 types. These types are

given in section 6.4.1.2. Table 6.9 shows the results.

Queries of type 1 concern fields that are not indexed. The Explain()-function indicates that

this kind of queries in both collections lead to a full scan of the database. The QETs in the

dumped collection are on average 1,03 times faster than the QETs in the indexed collection.

Queries of type 2 are sorted by non-indexed fields. In the case that a query on an indexed field

and sorting it by a non-indexed or indexed field, the QETs are on average of equally size: the

QETs in the indexed collection are on average 1,03 faster than those in the dumped collection.

A query of type 3 returns the whole database. The Explain()-function shows that this query

examines all documents in each collection. The QETs in the indexed and dumped collection

are almost the same: the difference is equal to 47 ms on average.

Queries of type 4 relate to a part of a combined-fields element in the mostQueriedFields or

uniqueFields arrays. On average, this type of queries in the indexed collection are 55ms faster


executed than in the dumped collection. Tis difference is almost not noticeable for a person.

The reason why the operator action collection had more difference between both QETs is

probably because the amount of data in the process alarms collection is not so big ( 154.809

vs. 2.400.000).

Queries of type 5 relate to queries that combine fields which were defined separate by the

programmer. QETs of this kind of queries are always executed faster in the indexed collection.

This is because MongoDB will use the indexes on the fields of the queries in the indexed

collection.

Queries of type 6 relate to queries that search on fields that are listed in the mostSortedFields

array. QETs of these queries are faster in the indexed collection, because the fields are indexed.

In the table, the query is 300 ms faster than in the dumped collection.

To conclude, queries of type 1 and 2 are on average faster executed in the dumped collection.

This is because no key can be used, and a full scan of the database has to be done in each

collection. Queries of type 3 to 6 are on average faster executed in the indexed collection.

This is because the generic data collector program has set indexes on each singular field, even

when the programmer defined an element as a combination of fields in the custom attribute

properties. When a query relates to multiple indexed fields, MongoDB will combine these

fields automatically. This conclusion is the same as the one for operator actions.


Table 6.9: QETs in the indexed collection and dumped operator collection.

Type Query Number Sort QET QETof docs definition Indexed Dumped


1letterbug equals 31235 / 3097 3043msgText equals 11392 / 2324 2308

msgText letterbug 411 / 1677 1488equals

2 11591

asc. return- 2074 2418Timestamp

returnSec desc. return- 2537 2462greater than Timestamp

desc. letterbug 2104 20423 all 154809 / 6858 6811

4timestamp compound 1 / 1141 1298

equalsblock pointNr equals 281 / 1325 1616

pointNr almType equals 14358 / 2548 22645 returnSec ackSec 5801 / 1696 2169

greater than6 timestamp equals 1 / 1205 1521

6.5 Discussion of the results in the message database

In this section, I will discuss the results of the generic data collector program and the dump

program in the message database by calculating the overall suitability through a cost vs.

preference analysis model [16]. First, the cost vs. preference analysis model will be explained

in short, and next it will be applied to the operator actions and process alarms collections.

6.5.1 Cost vs. preference analysis model

The overall suitability of a system can be calculated by a cost vs. preference analysis model.

The inputs to this model are the overall preference Eglobal and the total cost C of the system.

The overall preference Eglobal is calculated by the LSP preference model. LSP stands for Logic

Scoring of Preference. It is a general quantitative decision method for evaluation, comparison

and selection of complex hardware and software systems [17]. The inputs to the LSP model

are performance variables, in this case the query execution time of expected (regular) and

unexpected (ad hoc) queries. The overall preference E is calculated as follows:

Eglobal = W1 · EexpQ + W2 · EunexpQ (6.1)


With W1 and W2 the weights for which applies W1 + W2 = 1. Eglobal always lies in the

interval [0, 1], where Eglobal = 0 denotes a complete dissatisfaction and Eglobal = 1 means a

complete satisfaction of the user. EexpQ and EunexpQ denote respectively the overall preference

of expected and unexpected queries. The preference is inversely proportional with the query

execution time, because the longer the query execution time, the less preferable this is for

users. These two values can be written in formula form as:

E =QETmin

QET(6.2)

Where QET stands for query execution time, and QET min is the smallest value that QET can

be.

The total cost C is equal to the insert execution time. The longer the insertion time, the

higher the cost.

In the cost/preference analysis model, C is mapped to the inexpensiveness indicator P by:

P =Cmin

C(6.3)

Where Cmin is the minimal cost, e.g. the smallest insert execution time that is possible, so

that P lies in the interval [0, 1].

And Eglobal is mapped to the usefulness indicator U by:

U =Eglobal

(Eglobal)max

(6.4)

With (E global) max the largest value of E global.

Combining U and P leads to the overall suitability S:

S = [WU · U r + WP · P r]1/r (6.5)

where r can be any number and WU and WP are in the interval [0, 1] and their sum equals 1.

This process is also shown in figure 6.9.


Figure 6.9: Computation of the overall suitability using a cost/performance model.

6.5.2 Evaluation of the operator actions collection

In this section, the suitability of the generic data collector program and of the dump program

will be compared.

6.5.2.1 Suitability of the generic data collector program

The total cost C is equal to 0, 056, because an insertion of 1 document takes on average 0,056

ms. The minimal cost is the minimal insertion time, namely the insertion time of a dump:

0,028. Filling in equation 6.3 gives for the inexpensiveness indicator P:

P =0, 028

0, 056= 0, 5 (6.6)

Calculating the preference for expected queries EexpQ by equation 6.2 results in:

EexpQ =(QETexpQ)min

QETexpQ

= 1 (6.7)

because the QETs of the generic data collector program are always faster executed than the

QETs of the dump program. The EexpQ is very high (e.g. 100%). This means that the

satisfaction of the user for expected queries is very good.

Remember from section 6.4.1.2, that there were 6 types of unexpected queries defined. By

queries of type 3 to 6 (e.g. group 1), the QETs of the generic data collector program were

always faster than the dump QETs because these type of queries related unconsciously to

indexed fields. By queries of type 1 to 2 (e.g. group 2), the QETs of the generic data collector


program were always slower than the dump QETs because these type of queries did not relate

to indexed fields. I assume that queries of group 1 are executed as often as queries of group

2. Therefore, the preference of unexpected queries EunexpQ is calculated by taking the mean

of the preference of unexpected queries of these 2 groups EunexpQ-g1 and EunexpQ-g2.

EunexpQ-g1 can again be calculated by equation 6.2 and the fact that the QETs of the generic

data collector program are always the smallest:

EunexpQ−g1 =(QETunexpQ−g1)min

QETunexpQ−g1

=QETunexpQ−g1

QETunexpQ−g1

= 1 (6.8)

EunexpQ-g1 can also be calculated by equation 6.2 and the fact that the QETs of the dump

program are the smallest:


QETunexpQ−g2

=QETDUMP−unexpQ−g2

QETunexpQ−g2

< 1 (6.9)

Filling in the QETs from section 6.4.1.2 gives:

EunexpQ−g2 = MEAN(9541

13683,5337

7365,1568

3462,24316

24364) = 0, 718 (6.10)

Taking the mean value of these two results leads to the preference of unexpected queries

EunexpQ:

EunexpQ =EunexpQ−g1 + EunexpQ−g2

2= (1 + 0, 718)/2 = 0, 859 (6.11)

EunexpQ is reasonably high, thus the satisfaction of the user is good for unexpected queries.

Calculating the value of Eglobal is done by taking the weighted mean of EexpQ and EunexpQ (see

equation 6.1). This results in:

Eglobal = W1 · 1 + W2 · 0, 859 (6.12)

Filling in W1 and W2 in this formula for five different scenarios gives:

W1 W2 Eglobal

1 0 10,75 0,25 0,9650,5 0,5 0,930

O,25 0,75 0,8940 1 0,859

The usefullness indicator is calculated by equation 6.4 and the fact that Eglobal has the highest


value when calculated in the generic data collector program vs. the dump program:

U =Eglobal

(Eglobal)max

=Eglobal

Eglobal

= 1 (6.13)

Now, the suitability can be calculated by equation 6.5. WP and WE are chosen respectively

to 0,3 and 0,7 because the insertion time of 1 document in the generic data collector is still

fast enough for the user. Namely, every day 35000 operator alarms are on average created.

Thus, importing all the alarms of one day takes 1,96s by the generic data collector program,

and 0,98s. Thus only one second difference between the two import methods. The variable r

is chosen equal to 5. This results in:

S = [0, 7 · U5 + 0, 3 · P 5]1/5 = 0, 934 (6.14)

6.5.2.2 Suitability of the dump program

The insertion time of a dump insert takes 0,028 ms on average, and it is also the minimal

insertion time. Thus the inexpensiveness P is:

P =0, 028

0, 028= 1 (6.15)

The preference values of the expected queries can again be calculated by 6.2:

EexpQ =(QETexpQ)min

QETexpQ

=QETGDCexpQ

QETexpQ

(6.16)

EexpQ = MEAN(QET i

GDCexpQ

QET iexpQ

) = 0, 679 (6.17)

for all i = 1..n, with n the number of tests.

The preference of the unexpected queries is again calculated by taking the mean of the expected

queries of group 1 and 2, because I assume that both groups are queried an equal amount of

times:

EunexpQ−g1 =min(QETunexpQ−g1)

QETunexpQ−g1

= MEAN(QET i

GDCexpQ−g1

QET iexpQ−g1

) = 0, 697 (6.18)


The QETs of the unexpected queries in group 2 are always the fastest, therefore equation 6.2

results in:


QETunexpQ−g2

= 1 (6.19)


Thus, the total preference of unexpected queries is:

EunexpQ = 0, 78 (6.20)

Combining EexpQ and EunexpQ results in the global preference:

Eglobal = W1 · 0, 679 + W2 · 0, 78 (6.21)

Filling in W1 and W2 in this formula for five scenarios results in:

W1 W2 Eglobal

1 0 0,6970,75 0,25 0,7180,5 0,5 0,739

O,25 0,75 0,7590 1 0,78

The usefullness indicator is equal to:

U =Eglobal

(Eglobal)max

(6.22)

Where (Eglobal)max is equal to the Eglobal of the Gerenic Data Collector program. Thus, U is

for the different cases of W1 and W2 equal to:

W1 W2 U

1 0 0,6970,75 0,25 0,7440,5 0,5 0,795

O,25 0,75 0,8490 1 0,908

Finally, the suitability can be calculated with WP, WE and r equal to respectively 0,3 , 0,7 and

5:

S = [0, 7 · U5 + 0, 3 · P 5]1/5 (6.23)

Filling into this formula the different values for U gives:

W1 W2 U S

1 0 0,697 0,8390,75 0,25 0,744 0,8560,5 0,5 0,795 0,878

O,25 0,75 0,849 0,9060 1 0,908 0,940


6.5.2.3 Conclusion

Insertion in the dumped collection is twice as fast as in the generic data collector collection,

namely 0,028 ms vs. 0,056 ms. But both insertion times are very fast, and inserting one

day of operator actions information takes respectively 0,98s and 1,96s. Thus insertion by the

generic data collector is still fast enough for a user.

Regularly executed queries (e.g. expected queries) are in the generic data collector on average

twice as fast when a low amount of data is returned. And when a lot of data is returned (e.g.

more than 1 million records), the QETs are almost equally fast. But these queries will not be

executed so often, because no one can get wise out 1 million records or more.

Ad hoc queries are faster executed in the generic data collector when they accidental relate to

one or more fields that are indexed. But, if the ad hoc query does not relate to one or more

indexed fields, both collections need to do a full scan and the dumped collection is always

faster.

Since the insert of both programs is certainly fast enough (e.g. in some seconds the information

of a whole day can be inserted), the QETs are the most import to the users. From the fact

that the generic data collector program accelerates the query times of regular queries, it can be

deduced that the generic data collector gives a gain in storing and retrieving operator actions.

Comparing the suitability of both programs results in a suitability value of 0,934 for the generic

data collector program, and a suitability ranging from 0,839 to 0,940 for the dump program.

The generic data collector has a higher suitability value than the dump program in the case

when the weight for expected queries is higher than zero in the calculation of the global

preference. In the case that the weight for expected queries W1 is equal to zero, then the

suitability value of the generic data collector is a bit lower than the suitability value of the

dump program: 0,934 vs. 0,940. This is logically explainable because when W1 is equal

to zero, only the preference of the collection for ad hoc queries has influence on the global

preference. Thus, the preference of the collection for expected queries does not matter in

this case. The index and dumped collection have in this case on average the same QETs and

thus the same preference value. But, the cost of the index collection is higher. Therefore, the

suitability of the dump program is a bit better than the generic data collector program. In

general, the value of W1 will be greater than zero because expected queries will be executed

the most. The special case where W1 is equal to zero, is the case when a programmer has

completely wrong estimated the query usage. So, this case gives a good indication that the

generic data collector can deliver a good gain to a collection.


6.5.3 Evaluation of the process alarms collection

In this section, the suitability of the generic data collector program and the dump program are

calculated and compared with each other, following the formulas from section 6.5.1.

6.5.3.1 Suitability of the generic data collector program

The total cost C is equal to the time to insert 1 alarm document with the generic data collector

program. This total cost C of the indexed collection will be calculated by taking the weighted

mean of the insert times of the four types of alarm documents:

C = WON · 1, 94 + WACK · 18, 33 + WRETURN · 54, 29 + WDISABLE · 202, 18 (6.24)

WON, WACK, WRETURN and WDISABLE are set respectively to 2716/5860 = 0, 46, 582/5860 =

0, 1, 2513/5860 = 0, 43 and 49/5860 = 0, 01. These values are obtained by calculating how

often that type of alarm document occurs (see section 6.3.2). This results in the total cost C

of the generic data collector program:

C = 28, 09 (6.25)

The minimal cost Cmin is the minimal insertion time. The insertion time of the generic data

collector program is always smaller than the insertion time of the dump program. Therefore,

the inexpensiveness indicator P can be rewritten as:

P =Cmin

C=

C

C= 1 (6.26)

Calculating the preference for expected queries EexpQ by equation 6.2 results in:

EexpQ =(QETexpQ)min

QETexpQ

= 1 (6.27)

because the QETs of the generic data collector program are always faster executed than the

QETs of the dump program. The EexpQ is very high (e.g. 100%). This means that the

satisfaction of the user for expected queries in the indexed collection is very good.

The preference for unexpected queries is calculated EunexpQ with equation 6.2:

EunexpQ =n∑

i=1

(QET iunexpQ)min

QET iunexpQ

= 0, 974 (6.28)

with n the number of tests that were done on unexpected queries. EunexpQ is reasonably high,

thus the satisfaction of the user is good for unexpected queries on the indexed collection.


Filling in equation 6.1 results in the global preference of the generic data collector program:

Eglobal = W1 · 1 + W2 · 0, 974 (6.29)

Filling in W1 and W2 in this formula for five different scenarios gives:

W1 W2 Eglobal

1 0 10,75 0,25 0,9940,5 0,5 0,987

O,25 0,75 0,9810 1 0,974

The usefullness indicator is calculated by equation 6.4 and the fact that Eglobal has the highest

value when calculated in the generic data collector program vs. the dump program:

U =Eglobal

(Eglobal)max

=Eglobal

Eglobal

= 1 (6.30)

With equation 6.5, the suitability can be calculated. WP and WE are again chosen respectively

to 0,3 and 0,7 because the insertion time of 1 document in the generic data collector is still

fast enough for the user. Namely, every day are on average 15000 process alarms created.

Thus, importing all the alarms of one day takes on average 6,8 minutes by the generic data

collector program, and 39,6 minutes by the dump program. The variable r is chosen equal to

5. This results in:

S = [0, 7 · U5 + 0, 3 · P 5]1/5 = 1 (6.31)

Because U and P are equal to 1, the suitability is also equal to 1. Thus, the suitability of

the generic data collector program is 100 % for process alarms. The satisfaction of the users

is very good when working with the process alarms database generated by the generic data

collector program.

6.5.3.2 Suitability of the dump program

The total cost C for the dump program is equal to the time to insert 1 process alarm document

with the dump program. This total cost C of the dumped collection is calculated by taking

the weighted mean of the insert times of the four types of alarm documents:

C = WON · 1, 94 + WACK · 18, 33 + WRETURN · 54, 29 + WDISABLE · 202, 18 (6.32)

WON, WACK, WRETURN and WDISABLE are set respectively to 0, 46, 0, 1, 0, 43 and 0, 01. These

values represent how often that type of alarm document occurs (see section 6.3.2). This


results in the total cost C of the dump program:

C = 162, 65 (6.33)

Filling in equation 6.3 gives:

P =Cmin

C=

28

162, 65= 0, 17 (6.34)

With Cmin equal to the minimal insertion time. The minimal insertion time is equal to the

insertion time of the generic data collector: 28 ms for 1 alarm document. The inexpensiveness

indicator P is very low for the dump program. This means that the dump program is very

expensive.

The preference values of the expected queries can again be calculated by 6.2:

EexpQ =(QETexpQ)min

QETexpQ

=QETGDCexpQ

QETexpQ

(6.35)

With QETGDCexpQ the query execution time of an expected query in the indexed collection

(e.g. by the generic data collector program).

EexpQ = MEAN(QET i

GDCexpQ

QET iexpQ

) = 0, 856 (6.36)


The preference for unexpected queries EunexpQ is calculated by equation 6.2:

EunexpQ =n∑

i=1

(QET iunexpQ)min

QET iunexpQ

= 0, 928 (6.37)

with n the number of tests that were done on unexpected queries. EunexpQ is high, thus the

satisfaction of the user is very good for unexpected queries on the dumped collection.

Combining EexpQ and EunexpQ results in the global preference:

Eglobal = W1 · 0, 856 + W2 · 0, 928 (6.38)

Filling in W1 and W2 in this formula for five scenarios results in:


W1 W2 Eglobal

1 0 0,8560,75 0,25 0,8740,5 0,5 0,892

O,25 0,75 0,910 1 0,928

The usefullness indicator is equal to:

U =Eglobal

(Eglobal)max

(6.39)

Where (Eglobal)max is equal to the Eglobal of the generic data collector program (this is true for

every value of W1 and W2.

Thus, U is for the different cases of W1 and W2 equal to:

W1 W2 U

1 0 0,8560,75 0,25 0,8780,5 0,5 0,904

O,25 0,75 0,9280 1 0,953

Finally, the suitability can be calculated with WP, WE and r equal to respectively 0,3 , 0,7 and

5:

S = [0, 7 · U5 + 0, 3 · P 5]1/5 (6.40)

Filling into this formula the different values for U gives:

W1 W2 U S

1 0 0,856 0,7970,75 0,25 0,878 0,8190,5 0,5 0,904 0,842

O,25 0,75 0,928 0,8640 1 0,953 0,887

6.5.3.3 Conclusion

Preprocessing and insertion of a process alarm document takes on average 28,09 ms by the

generic data collector, and 162,65 ms by the dump program. Thus, the generic data collector

program can preprocess and insert a process alarm document 5,78 times faster than the dump

program. This is because every input alarm document of type ”alarm ON” implies only one

insertion, and that every input alarm document of another type implies one search with deletion

and one insertion (i.e. an update). Because searching in the dumped collection goes not so


fast as in the indexed collection, the preprocessing and insertion are slower than in the index

collection.

Querying in the dump collection or in the indexed collection have no such pronounced dif-

ferences as in the operator actions collections. A reason for this can be that the amount of

documents in each process alarms collection is 154.809, which is small in comparison with

the 2.400.000 documents in each operator actions collections. However, the query times of

expected queries in the index collection are on average 353 ms faster than those in the dumped

collection. And query times of unexpected queries in the index collection are on average almost

as fast as those in the dumped collection: they differ on average only 71,17 ms.

The suitability value of a program gives a good indication of how appropriate the program

is for a user. The suitability value of the the generic data collector program is equal to 100

%, the highest value possible. The suitability value of the dump program lies in the range

0,797 to 0,887. These values are moderately high, but never as high as the suitability value of

the generic data collector. Thus, even when a programmer has completely wrong estimated

the expected queries (the case where W0=0 and W1=1), the suitability of the generic data

collector is higher than the suitability of the dump program. So, this case gives a good

indication that the generic data collector can deliver a good gain to a collection.

6.5.4 General discussion

I am aware of the complexity of this topic. This thesis is a first step to a solution for a generic

data collector. I have limited myself in my research to the aspects I have looked at.

I tested the use of my generic data collector program by working bottom-up. The two explored

cases give a good indication that the generic data collector program is practically useful.

6.6 The influence of sharding in the message database

In this section is the use of sharding investigated. First, the setup of the sharded database is

discussed. Next, the insert execution time is timed. Finally, several query execution times are

measured.

6.6.1 Creating the sharded database

As discussed in section 5.2.2.2, a part of the purpose of the generic data collector program

was to use sharding to get more efficient query and insert execution times.


After searching, I discovered that it is not possible to create a sharded database in a c#

program. This type of MongoDB database has to be configured in the Mongo shell itself. As

well as the config server, the mongos router and the shards have to be configured by hand.

Also, the shard key and the type of sharding (e.g. hashed or ranged) can not be chosen by

the generic data collector program, because it has to be configured in the Mongo shell.

Now, I will test if a dump in a sharded database that is configured as in my proposal in

section 5.2.2.2 leads to advantages in speed compared to a direct dump in a single MongoDB

database. My proposal is to use hashed sharding on the ” id” field, or on one or multiple fields

that are defined in both the uniqueFields and mostSortedFields arrays.

The sharded database of the test set consists of one config server, one mongos router and two

shards. They are all configured at different ports on the localhost. Hashed sharding on the

” id” field is applied. The data that is imported is all the data of operator actions.

6.6.2 Insertion in the sharded operator actions database

As explained in section 6.3.1, insertion times are best measured by a bulk insert with 100.000

operator actions. This is because the insertion time of 1 operator action is less than a mil-

lisecond, and timing this is not possible and accurate enough.

Inserting all operator actions via the bulk insert of 100.000 documents at a time into the

sharded database leads to a total of 2.495.706 documents. These documents are approximately

equally divided over the two shards: the first and second shard contain respectively 1.362.695

and 1.133.011 documents. The config server indicates that there are in total 28 chunks over

the two shards. This means that the total range of ” id” fields is divided into 28 intervals.

The results of the insert execution times of bulk inserts of 100.000 documents are given in

figure 6.10. The mean bulk insertion time in the sharded database without indexing is equal to

16677 ms. Thus, the mean insertion time of one operator action document is equal to 0,016

ms in the sharded database without indexing. When the generic data collector program also

applies indexes, the mean bulk insertion time is equal to 28442 ms. Thus, the mean insertion

time of one operator action is equal to 0,028 ms. Comparing these results with respect to

my earlier measurements about operator actions in the non-sharded database leads to the

conclusion that an insert in a sharded database takes a lot longer. Namely, on average one

dump bulk insert takes 2,79 s and one index bulk insert takes 5,69 s compared to 16,7 s for

a sharded bulk insert and 28,4 s for a sharded index insert. The reason why the bulk inserts

into the sharded database take longer is probably because my test sharded database is stored

on the same computer. The two shards are both stored on the localhost. Thus, they have to

fight for the system resources.


Figure 6.10: Insertion times with and without sharding and/or indexing of 100.000 oper-ator actions at each insertion.

6.6.3 Querying in the sharded operator actions database

The results of query execution times on my sharded database also take longer than the queries

on a database without sharding, just like the insertion times. This again can be explained

by the fact that both shards have to fight for the same system resources. The QETs of the

dumped and indexed sharded databases are shown in table 6.10. Despite the long QETs, it is

clearly that a query on the ” id” field is faster executed than other queries. This is because

the ” id” field is the shard key, and documents are stored corresponding to it: a query on the

shard key will be routed directly to the corresponding shard server(s). However, if a query

relates to another field than the shard key, then that query has to go to every server. Thus,

in this test case, every query is executed twice (e.g. on every sharded server). The QETs in

the indexed collection are in these five test cases faster executed.

Table 6.10: QETs in the sharded collection.

QET QETQuery dump sharded index sharded

collection [ms] collection [ms]

letterbug equals 7481 5208letterbug startswith 25457 24027compound equals 41593 29069id equals 6194 1243

msgText equals 23108 17575

6.6.4 Conclusion

Setting up a sharded database, and choosing the shard key cannot be done by the generic

data collector program. This has to be configured in the Mongo shell. However, indexing can


be applied on the sharded database by the generic data collector program.

To test the influence of a sharded database, a sharded database in the Mongo shell is set up

with one config server, one mongos router and two shards at different ports on the localhost.

Because they all are running on the localhost, they have to share the system resources. This

limitation leads to slower insert and query execution times than in the other non-sharded

databases. However, the tests indicate that queries on the shard key are faster. And adding

indexing by the generic data collector also led to faster query execution times, but almost

twice as slow insert execution times.

Chapter 7

Conclusion

In this study, it is studied if and how more generic data collectors for feeding a NoSQL

document store can be constructed. Such a data collector extracts, transforms and loads data

from incoming data streams into multiple document collections. A challenge is to keep the

ETL-process minimal to avoid velocity problems. It is a trade-off between the processing of

the data before inserting it into the database and the query execution time. The generic data

collector is a mechanism to help the programmer determining how to set up the document

store in order to set up data storage in an efficient way.

The generic data collector consists of two parts: the information gathering part and the

execution part. First, information about the input database must be given to the generic data

collector by a programmer. This information can be used by the collector to search for efficient

storage and query methods. Next, in the execution part, when a programmer calls the generic

insert function of the generic data collector, the input data will be inserted in a way such

that insert and query execution times are efficient. The generic data collector supplies these

efficient insertion and query times by using bulk insertions, smart indexes and/or sharding.

The generic data collector is applied to a use case, and its results are computed by the cost

vs. preference analysis model. From the values of the cost vs. preference model for this use

case can be concluded that the generic data collector delivers sufficient gain to that database.

A drawback of the program is that unexpected queries who do not involve any indexed fields

are still slow. A possible solution for this problem is to store a list of non indexed fields

that belong to these ad hoc queries. When a certain field from this list often occurs, then

the generic data collector system will create an index on it. Of course, attention is required

because too much indexes may slow down the system. Therefore, the insert execution times

must simultaneously be monitored.

70

Chapter 7. Conclusion 71

Another possible feature that could be implemented in the future is the use of asynchronous

insertions. This will reduce the insert execution time, since MongoDB then does not wait for a

confirmation. However, asynchronous writes are ’unsafe’. This means that you do not receive

feedback whether the insert was successful or not.

The generic data collector system can also serve as a core tuning system. In the future, this

system can be extended with possibilities to generate the necessary statistical information on

the basis of life inserts and queries and thus automatically improve the index pattern of a

collection.

This thesis is a first step to a solution for a generic data collector. The use of the generic

data collector program is tested by working bottom-up. The three explored cases give a good

indication that the generic data collector program is practically useful.

Bibliography

[1] IBM Big Data & Analytics Hub: The Four V’s of Big Data. Retrieved from: https:

//lebigdata.com/en/volume-of-big-data (2016)

[2] Yu, S., Lin, X., Misic, Y., Shen, X.S.: Networking for Big Data. Chapman and Hall/CRC

(2015)

[3] Dragland, A.: Big Data, for better or worse: 90% of world’s data generated over last two

years. ScienceDaily (2013)

[4] van Rijmenam, M.: Why The 3V’s Are Not Sufficient To Describe Big Data. Datafloq

(2015)

[5] Spire :Why Organizations Should Explore Their Unstructured Data. Retrieved from: http:

//spiretechnologies.com/organizations-explore-unstructured-data (2016)

[6] MongoDB Documentation: Query and Projection Operators. Retrieved from: https://

docs.mongodb.com/v3.2/reference/operator/query (2008)

[7] MongoDB Documentation: FilterDefinitionBuilder Class. Retrieved from: https:

//mongodb.github.io/mongo-csharp-driver/2.5/apidocs/html/T MongoDB Driver

FilterDefinitionBuilder 1.htm (2008)

[8] MongoDB Documentation: Definitions and Builders. Retrieved from: http://mongodb.

github.io/mongo-csharp-driver/2.4/reference/driver/definitions/ (2008)

[9] MongoDB Documentation: LINQ Tutorial. Retrieved from: http://mongodb.github.io/

mongo-csharp-driver/1.11/linq/ (2008)

[10] Solid IT: Knowledge Base of Relational and NoSQL Database Management Systems:

Database Engines Ranking. Retrieved from: https://db-engines.com/en/ranking (2018)

[11] MongoDB Documentation: MongoDB Drivers: C# and .NET MongoDB Driver. Re-

trieved from: https://docs.mongodb.com/ecosystem/drivers/csharp/ (2008)

72

https://lebigdata.com/en/volume-of-big-data

https://lebigdata.com/en/volume-of-big-data

http://spiretechnologies.com/organizations-explore-unstructured-data

http://spiretechnologies.com/organizations-explore-unstructured-data

https://docs.mongodb.com/v3.2/reference/operator/query

https://docs.mongodb.com/v3.2/reference/operator/query

https://mongodb.github.io/mongo-csharp-driver/2.5/apidocs/html/T_MongoDB_Driver_FilterDefinitionBuilder_1.htm



http://mongodb.github.io/mongo-csharp-driver/2.4/reference/driver/definitions/

http://mongodb.github.io/mongo-csharp-driver/2.4/reference/driver/definitions/

http://mongodb.github.io/mongo-csharp-driver/1.11/linq/

http://mongodb.github.io/mongo-csharp-driver/1.11/linq/

https://db-engines.com/en/ranking

https://docs.mongodb.com/ecosystem/drivers/csharp/

Bibliography 73

[12] MongoDB Documentation: Indexes. Retrieved from: https://docs.mongodb.com/

manual/indexes/ (2008)

[13] Par, J.: Valid attribute parameter types. Retrieved from: https://stackoverflow.com/

questions/3192833/why-decimal-is-not-a-valid-attribute-parameter-type (2010)

[14] MongoDB Documentation: Sharding. Retrieved from: https://docs.mongodb.com/

manual/sharding/ (2008)

[15] MongoDB Documentation: Operational Restrictions in Sharded Clusters. Re-

trieved from: https://docs.mongodb.com/manual/core/sharded-cluster-requirements/

#sharding-operational-restrictions (2008)

[16] De Tre, G., Bronselaer, A.: Information management. Ugent, Gent, Belgium, pp. 345–

367 (2017)

[17] Dujmovic, J. J.: A Method For Evaluation And Selection Of Complex Hardware And

Software Systems. In: Int. CMG Conference, San Francisco, USA, pp. 368–378 (1996).

https://docs.mongodb.com/manual/indexes/

https://docs.mongodb.com/manual/indexes/

https://stackoverflow.com/questions/3192833/why-decimal-is-not-a-valid-attribute-parameter-type

https://stackoverflow.com/questions/3192833/why-decimal-is-not-a-valid-attribute-parameter-type

https://docs.mongodb.com/manual/sharding/

https://docs.mongodb.com/manual/sharding/

https://docs.mongodb.com/manual/core/sharded-cluster-requirements/#sharding-operational-restrictions

https://docs.mongodb.com/manual/core/sharded-cluster-requirements/#sharding-operational-restrictions