Investigation and implementation of a developer-friendly ...umu.diva-portal.org/smash/get/diva2:1366903/FULLTEXT01.pdf · Investigation and implementation of a developer-friendly

Investigation and implementation of adeveloper-friendly and efficient API forDatabase Management Systems

Izak Nordin

Izak NordinMaster’s Thesis in Computing Science, 30 ECTS CreditsSpring 2019Supervisor: Daniel SandstromExternal supervisor: Jan-Erik MostromExaminer: Henrik BjorklundMasters degree in Engineering and Computer Science

Abstract

New databases and query-languages are created every year. Developers using thesetechnologies have to learn all the different ways to access the databases. When usinga query-language to retrieve information it is easy for the query to become really longand complex.

Different ways to implement an abstraction layer on top of databases was investi-gated. By looking at what is easiest to learn and use a Java implementation was used.In the resulting implementation the developers could create a query-object which issent into a converter which creates the correct query-syntax. After showing the im-plementation to different developers at Cinnober they felt that an abstraction layerlike this one could be really useful. The solution would provide a uniform way ofcreating queries and that it would be easier to use and understand once you got usedto it. The performance of the implementation was high enough for the standards setby Cinnober, it was also deemed easy to understand which was an important criteriafor usage in production. A new way of communicating to databases is presented andcould be used in production if developers chooses to improve the current implemen-tation.

2

Acknowledgements

First of all I would like to thank the people at Cinnober for giving me the opportunity to work on thisthesis, providing the idea and a place to work. Whenever I had questions I had a lot of help from mysupervisors Daniel Sandstrom at Cinnober and Jan-Erik Mostrom from the university. I would also liketo thank my girlfriend, family and friends for proof-reading and supporting me whenever I got stuck orhad other problems.

3

Contents

1 Introduction 1

2 Background 22.1 Abstraction Layer 22.2 Databases 32.3 Query Language 32.4 Relational Databases 32.5 Non-Relational Databases 4

3 Problem Description 53.1 Purpose 53.2 Goal 6

4 Methods 74.1 Outline 74.2 Limitations 74.3 Research 74.4 Analysis and discussion 74.5 Testing 7

5 Theory 95.1 Abstraction layer 9

5.1.1 Structure 105.2 Backus-Naur Form 115.3 Query Languages 125.4 Debate 13

5.4.1 Related Work 135.4.2 Conclusion 13

5.5 Restrictions and Known Issues 145.6 Approaches 14

5.6.1 Requirements 145.6.2 Old Query Language 145.6.3 New Query Language 145.6.4 Natural Query Language 155.6.5 Java API 155.6.6 Conclusion 15

6 Implementations 166.1 Disclaimer 166.2 Java API Abstraction 166.3 Tools and Resources 17

4

Contents

6.4 Structure 176.5 Predicates 186.6 Creating a Query 186.7 Converter 206.8 History Caching 21

7 Results 227.1 Performance 227.2 Usability 24

8 Evaluation 258.1 Evaluation of Performance 258.2 Evaluation of Usage 25

9 Conclusions 269.1 Restrictions 269.2 Limitations 269.3 Usage areas 27

10 Future Work 28

11 Appendix 2911.1 Resources 2911.2 Figures 30

5

Contents

Abbreviations

• CDB Cinnober Database

• API Application Programming Interface

• SQL Structured Query Language

• RDBS Relational Database System

• NoSQL Refers too Non-Relational Database

• BNF Backus-Naur Form

• NLIDB Natural Language Interface Database

6

1 Introduction

Companies use many different software products in their operations. The most important is perhaps thedatabases used for their business data. Every company need somewhere to store their data and this iswhere Database Management Systems(DBMS) comes in. The data stored can be called the lifeblood ofthe company. Because of this it is important that the database is used correctly in order to save time andprevent data loss. DBMS today can look very different, having their own way of storing data or sendingqueries to the database. Developers using these new DBMS now have to learn and understand all thedifferent ways of sending a query to the databases. This creates a delay in production when learning thenew database. An abstraction layer could be implemented in order to help the developers communicatewith the different DBMS using a single query-language.

Software products developed by companies need to be fast and easy to understand in order to keep thecostumers happy. In order to create the best solution Cinnober have developed their own column-baseddatabase which is optimal for analytic use-cases. There are however still problems with the CinnoberDatabase(CDB), as mentioned above, which is how it is used. All databases need some way to retrievethe information and this is most often done through a query-language. Query-languages are made tolook like a normal language where most of the key-phrases are similar to how question or statementwould be formulated in natural language. The problem with this is that if a developer want to be reallyspecific when retrieving information they would end up with a long and complex query which might notbe so easy to understand. These long queries could introduce unnecessary bugs in the system. Anotherproblem with Query-languages is that there exists a lot of them and they can look completely differentfrom one another. Because of this it is important that new ways of interact with databases is invented toimprove performance and usability.

1

2 Background

Technology is always advancing, improving in different areas. One of the most important technologyis the DBMS which is used to store all kinds of data. With new DBMS:s being developed new query-languages are created to access the data. Given all the different query-languages out there it is hard fordevelopers to get comfortable with every single one. One way to make it easier for developers is to intro-duce an abstraction layer. This abstraction layer would use one query-language which is converted intothe correct query, given the database. By creating an abstraction layer it becomes easier for developersto create the correct queries without knowing the details of the underlying database.

2.1 Abstraction Layer

Abstraction layers are used to hide details and provide a simpler way of using the functionality of anapplication. An abstraction layer can be used in many different scenarios. The most common imple-mentation is the API which translates high-level requests into low-level commands in order to interactwith the application. There are however problems with abstraction layers in that the developer/user whointeracts with the new layer might not have as much control over all the small decisions that could bemade [4].

2

2.2. Databases

2.2 Databases

A database is an organized collection of data which is usually stored in long term memory. The mostcommonly used database-type is the relational database [6] which breaks up the information into rows,columns and tables. As an example, lets say that we want to store information that contains differentAccounts. Accounts would be classified as a table which contains all the different accounts. For eachaccount we have fields describing the account id, account name and create date. These fields would becalled attributes or columns. When storing an account in the table it would be entered into a row. Thetable described is displayed in figure 2.1

Figure 2.1: Example of how a table with accounts could look in a relational database.

2.3 Query Language

Query languages were created so that developers can write commands in order to access, create or modifydata in a database. These commands are made to look much like how a question or statement does inEnglish, e.g. ”Select accountId From Accounts” [13]. This command would return all the account idsfrom the Accounts-table.

2.4 Relational Databases

Relational Databases(RDBS) stores information in tables, also called relations or schemes [12]. Anexample of RDBS can be seen in section 2.2. Each row has a unique key that is used to find that rowor link together other tables. All tables follow a predetermined schedule of how it should look. This isboth a strength and weakness depending on what the database is supposed to do. The primary languageused to communicate with relational databases is SQL. RDBS support transactions which are one ormore SQL statements that are executed in a sequence. The idea is that the instructions that make up atransaction executed successfully or nothing happens(all-or-nothing). All database transactions must beACID:

• Atomicity - Everything in the transaction has to be executed or nothing will be written to memory.

• Consistency - Transactions will not violate defined rules, and restrictions including constraints,cascades and triggers.

• Isolation - Results are independent of concurrent transactions.

• Durability - Once the transaction is complete all changes has to be permanent.

Some of the biggest relational databases are Oracle, MySQL, PostgreSQL and Amazon Aurora.

3

Chapter 2. Background

2.5 Non-Relational Databases

Non-Relational Databases, also called NoSQL, has grown in popularity and the main reason for thisis because they try to overcome the limitations that the relational databases has when dealing with BigData. Big Data refers to data that is growing, updating constantly and does not have a fixed structure.NoSQL is not bound to schemes as RDBS are, it can be schema-agnostic which means that it allowsunstructured and semi-structured data to be stored and manipulated. Generally NoSQL is more flexibleand easier to administrate given that there is less restrictions set. There are however drawbacks, such asnot being as mature as RDBMS, meaning that it will require specific expertise to use, and data stored canvary greatly for each database type. Common NoSQL databases are MongoDB, Cassandra and AmazonDynamoDB.

4

3 Problem Description

This chapter will cover purpose, goal, limitations, research and testing.

3.1 Purpose

Databases in a system often contains many tables, each containing a large number of rows/columns. Ifa client would want to create a request that would require tables to be join, they would end up with along and complex SQL-query. One way to make it easier for developers to send requests to a database ishaving an abstraction layer. The abstraction would take the request by the developer and translates it tothe correct query-language for the database. The problem with this is that abstraction layers are usuallyspecific to one database and cannot be used for other databases.

To make it easier for developers to write queries in different DBMS:s, tests of different alternativesfound in the investigation were performed. The goal was to find a way to implement an abstraction layerthat takes the users request and translates it to fit a certain database implementation. The developersshould at the same time have the ability to implement their own translation for their database withouthaving to change the way they use the abstraction layer.

One reason why companies like Cinnober uses column-based databases is because how fast it is. Be-cause of this one of the main goals of this thesis is to find a way you could implement the abstractionlayer without losing too much of the performance. The result of the abstraction layer should be efficientenough to use in production while making it easier for developers to debug and use in order to make itworth the delay trade-off.

5

Chapter 3. Problem Description

3.2 Goal

The goal of the thesis is to investigate different approaches that could be used to build a database ab-straction layer on top of a database currently accessible through a query-language. The abstraction layershould make it easier for developers to enter requests to the database without significantly increasing thecomputational cost. If this is possible then the developers do not have to know the query-language forthem to use the database.

The questions we want answered with this project are:

• Is it possible to create an abstraction layer where the translation can be swapped between databases?

• Is it possible to make the abstraction layer easy enough to use for it to be efficient when developingin a new DBMS?

• Can the cost of time of a database abstraction layers be reduced to the point where it is within thetime requirements, set by the developers, for accessing the database?

• Could an abstraction layer make it easier for developers to debug and use in production?

• Where is an abstraction layer most useful given that it adds some delay?

If this is possible then the developers do not need to know the specifics of the database in order to use it.

6

4 Methods

4.1 Outline

Chapter 5 will introduce the underlying theory and explain why an abstraction layer could help in pro-duction. In the end of the theory chapter alternatives is discussed in order to find the optimal approach forthis thesis. Chapter 6 contains a description of the implementation and how it works. Chapter 7 explainshow the abstraction performed in the tests. Chapter 8 evaluates how well the implementation of theabstraction layer worked given the requirements put on it. Chapter 9 discusses whether the abstractionlayer would be something that could work in production.

4.2 Limitations

The implementation described in this thesis only covers conversion for the CDB and a SQL-querydatabase since it requires a lot of coding to create a conversion to the required query-language. Eachdatabase/query-language requires a substantial implementation effort.

4.3 Research

A literature review, focused on performance and ease of use, was done to investigate previous workin this area. Different approaches found during the research is presented with their advantages anddisadvantages.

4.4 Analysis and discussion

The analysis part of this thesis is mainly to check performance of the implementation to make sure thatthe delay-cost of the abstraction layer is good enough to use in production. Using the research, differentways of implementing an abstraction layer is presented where pros and cons are stated for each solution.By analyzing and discussing the alternatives the most promising solution was implemented. This waywe make sure that the optimal solution is tested for this thesis.

4.5 Testing

A big part of this master thesis is making sure that the implementation of the abstraction layer workswell enough for it to be used in production. The tests were created to make sure that the abstractionlayer is fast, easy to use and work correctly. Once this is established a conclusion on the usefulness suchimplementation can be, given the different databases that exists.

7

Chapter 4. Methods

Testing on Query-generationThe abstraction layer was used to generate different queries. It is therefore important that the generatedqueries are correct and that the returned data is identical to the current way queries are used. The testingbegins with easy queries like retrieving a table where some condition has to be met and once the basicsare tested it is important to test things like joined tables where several conditions has to be met.

Performance TestingIdentical queries was created in both the abstraction layer and the current way you enter queries. Bothsolutions retrieved 100 resultSets where each retrieval was be timed. Once all the requests were finishedthe average time was calculated and compared. Other tests were performed to make sure that the correctresult has been achieved. In order to check different scenarios, different queries were used to find outwhat kind of requests are optimal for the abstraction layer and where it is too slow to use, if any.

Usage TestingThis is a hard thing to test given that it is mostly up to the developer who has to use it. One thing thatcould be said for this is that even if the implementation is somewhat harder to use than regular SQL itmight be more beneficial in the long run. Using this solution would mean that the developers only have tolearn the query-abstraction language. The main test that can be done in this area is showing it to differentdevelopers in Cinnober and having them evaluate the usability.

Debug TestingSince it is important that the developer gets the correct result when entering a query we need the abstrac-tion layer to be easily debugged. This part is easy to test since all we need to do is make sure that thequery sent in to the abstraction layer can be split up into smaller requests which can be validated.

8

5 Theory

This chapter covers the theory behind abstraction layers and how they can be used.

5.1 Abstraction layer

Database abstraction layer is nothing new. Developers have always had problems with using query syntaxto retrieve information and because of this they have searched for ways to make it easier. Commonstrategies to make the database more usable is creating a general API where developers can call methodslike

• getAccounts()

Which creates the query:

• Select * from Accounts;

This implementation may work but it is not scalable or optimal when developing in bigger projects. Foreach get/insert/update someone has to create an API-point for the query, meaning that it is not abstractenough for all the different DBMS out there. Let us assume that a developer want to retrieve all Accountsthat were created between two dates but not if the account exists in the blacklist table. This would requirea very specific API-point for something that might not be used that often.

There are many points that can be made both for and against a database abstraction layer.Pros:

• Easier for developers to use the database.

• Swapping to a new database will not require the developers to learn a new query language.

• Developers will not have to know about the underlying database.

Cons:

• Increased delay given that there is more code which has to be executed.

• Hard to abstract away all operations included in the original query-language.

• Developers have to create their own converter between the abstraction layer and the database.

9

Chapter 5. Theory

Baron Schwartz wrote in a blog-post about the four different types of abstraction layers for databases [9].It is worth mentioning the different types since they are often used in different scenarios. Baron classifiesthe different types as:

1. A software library to connect to a database server and issue queries, fetch results etc.

2. A software library to present a common API to different database servers.

3. A software library to automatically generate portable SQL queries.

4. A software library to map Object-Oriented Programming to a relational database (Object-RelationalMapping, or ORM)

This thesis will focus on the third type which is described in the Goal section, with some modifica-tions.

5.1.1 StructureA simple description of how the abstraction layer will be implemented is shown in figure 5.1. In thefigure we see how a developer can either choose to directly send a query-string to one of the databasesor use the abstraction layer to interact with it. In the figure we assume that the two databases DB1 andDB2 uses different query-languages and because of this the abstraction layer will need two differentconversion algorithms in order to interact with them.

Figure 5.1: Basic structure of how the abstraction layer will be implemented.

10

5.2. Backus-Naur Form

5.2 Backus-Naur Form

All languages need rules in order for them to make sense and these rules are called grammar [10]. Whenit comes to programming-languages the most common grammar is the context-free grammar BNF. TheBackus-Naur Form structures how the language can be recursive, an example of how it is used can beseen below.

Looking at Figure 5.2 we see how different expressions can expand. In Figure 5.3 we see an exam-ple of how 5 * 4 might look in BNF.

expr→ term+ exprexpr→ termterm→ term× f actorterm→ f actorf actor→ (expr)f actor→ constconst→ integer

Figure 5.2: How expressions canexpand

5∗4→ 5∗ const5∗ const→ const ∗ constconst ∗ const→ const ∗ f actorconst ∗ f actor→ f actor ∗ f actorf actor ∗ f actor→ term∗ f actorterm∗ f actor→ termterm→ expr

Figure 5.3: Example of BNF

It is also important to show how the context-free grammar [11] in BNF works. Context-free grammarconsisting of a finite set of grammar rules in a quadruple (N,T,P,S) where:

• N is a set of non-terminal symbols.

• T is a set of terminals where N∩T = NULL.

• P is a set of rules, P : N → (N ∪T )×, i.e., the left-hand side of the production rule P does haveany right context or left context.

• S is the start symbol.

11

Chapter 5. Theory

To show how this works considerA = (a+b)∗ (c−d)which would be written asX → |X +X | ∗ |X−X |This can be expanded into a tree to better show how it looks, see figure 5.4 below.

Figure 5.4: Tree of how the instruction is broken down using context-free grammar.

Query languages follow these rules and because of this it should be possible to break them up in order tofit them into an abstraction layer that could translate to different languages.

5.3 Query Languages

Query languages can have all kind of syntax and semantics to retrieve information. The most commonquery-language is SQL, mainly used in RDBS. When it comes to query languages for non-relationaldatabases they often vary more in how you make calls to the database. Some of them like MongoDB [7]uses objects to decide what database and collection/table the call should go to. In MongoDB you definewhat the parameters are by creating a method-call using JSON-like field. The code example below showshow MongoDB sends a request to the database ”db” and that they are looking in collection ”people” inorder to find people with first-name ”Joe”, last-name ”Schmoe” and age greater than 18. Note that thisis not how you write the request in java but it is how queries are built.

db.people.find({”firstname” : ”Joe”, ”lastname” : ”Schmoe”, ”age” : {”$gt” : 18}})

Building a query using MongoDB is not the same as writing a SQL-String but it is important to note thatit still follows the BNF-structure mentioned in section 5.2. The query is split up in database, collection,select-fields and requirements. The same query could be written in SQL as:

Select firstname, lastname, ageFrom peopleWhere firstname=”Joe” and lastname=”Schmoe” and age>18;

12

5.4. Debate

5.4 Debate

Database abstraction layers has been a hot topic among developers for a long time. On one side develop-ers argue that an abstraction is required in order to keep databases from dominating how the applicationshould be structured and from becoming too hard to grasp. On the other side they argue that abstractionlayers cost way too much in time and that they are too simple to be used in bigger productions. The goalof this thesis was to see if there is a way to create an abstraction layer without making it to specific andslow. The abstraction layer should provide the option to implement new converters for other databases.When investigating different approaches it is important to keep in mind what developers think aboutthe usability and performance. Because of this some discussion within this area will be mentioned insection 5.4.1.

5.4.1 Related WorkIn 2006 Jack Herrington wrote an article on five common PHP database problems [5]. This article mainlydiscussed bad practices when communicating with a MySQL database using PHP and how you could usea simple abstraction to improve performance. This article was spread and other developers felt like thatthey had to respond and explain the pros and cons when using a database through different approaches.

Kristian Kohntopp responded in an article [2] where he wrote that abstraction layers can make thingseasier but at a price. He explained that they did tests on how many operations-per-minute(OPM) couldbe executed with and without the abstraction Herrington mentioned. In their tests of Herringtons ab-straction layer, they got 219 OPM (operations per minute) compared to 444 OPM without. A substantialperformance loss.

In response to Kohntopp article Peter Zaitsev wrote his own article [3] where he discussed some pointsmentioned earlier. In his article he wrote that he uses wrappers for some queries sent to the databaseand that he does not think that the abstraction layer is the main performance problem but it is the PHPconnections to the database that could be improved.

5.4.2 ConclusionThis thesis will in broad terms not implement the kind of abstraction layer mentioned in the articlesabove. The reason for bringing up the articles is so that the reader understands that abstraction fordatabases have been a big topic for a long time. In the articles they mainly talk about how you couldabstract the calls to the databases with different connections and how this affects the performance andusability. The articles are mentioned because the problems brought up will still be a factor that will affectthe abstraction layer since the layer will use the existing connections to retrieve and update information.Ever since the beginning we have known that the latency will increase, it is just not known by how much.One of the articles mentioned that without abstraction they could get almost twice as much OPM thanwith abstraction but this was almost 13 years ago and technology has changed drastically since then. Theresulting implementation of this thesis will not be as slow as Kohntopps abstraction.

13

Chapter 5. Theory

5.5 Restrictions and Known Issues

In the articles mentioned in section 5.4 [5][2][3] they all bring up some of the most common problemswith an database abstraction layer. The most known issues are the increase in delay, portability betweendatabases and lack of functionality [8][14][5]. It is important that readers of this thesis understandsthat the point is not to create a perfect database abstraction layer. The goal is to investigate differentapproaches on how you could abstract away the query-language, making it easier to write queries whileknowing that the latency will increase. When it comes to the portability it is also important to know thatthe resulting implementation will not be something that could be installed and used for any database.In order for the abstraction layer to work the developer will have to create their own conversion beforeconnecting it to the database. As mentioned in section 5.3 most of the syntax and semantics are similarbetween different query languages which means that most of the syntax/functionality will exist at a basiclevel. It is impossible to predict all kind of syntax and semantics that could be used. Developers whowant to implement the abstraction on a database that uses a more specific syntax/semantic will have toadd these features in the layer before it can work.

5.6 Approaches

There are a lot of different ways a database abstraction layer could be implemented but before movinginto the different approaches it is important to set up the basic requirements.

5.6.1 RequirementsApart from the structure displayed in figure 5.1 there are some other criteria that has to be met. Theimplemented abstraction layer has to:

• Be easy to use.

• Provide an easy way to implement own query-syntax in the abstraction layer.

• Developers must be able to implement their own converter between the abstraction layer and thedatabase.

• Relatively fast given that it will increase latency.

5.6.2 Old Query LanguageA more standard query language like SQL could be used for writing the queries which would then be sentto the converter. This, however, does not make it easier to write long queries, it only makes it possible toconvert from SQL to another query language which would be sent to the database.

5.6.3 New Query LanguageA new query language is something that could work. The problem with this approach is that even ifyou make the syntax and semantics easier to understand it would still require long strings filled withselect-fields and restrictions that would normally be in the where/select-clause in SQL.

14

5.6. Approaches

5.6.4 Natural Query LanguageUsing a natural query-language to interact with a database would make it easier to use since it wouldnot require a lot of training. The developer could simply write something like ”Show me all data onaccounts where create date is later than 2014”. G.D Ritchie and P.Thanisch published a paper on NaturalLanguage Interfaces to Databases - An Introduction [1] where they went through, among other things,advantages and disadvantages for a NLIDB. Some advantages were that it would require little learningsince it does not use artificial languages and that NLIDB is better for some questions. An example theygave were ”Which department has no programmers?” which is easier to express in natural language thangraphical or form-based interfaces. Disadvantages regarding NLIDBs mentioned in the paper were thatusers found it difficult to understand and remember what kind of questions the NLIDBs could handle.Another problem with NLIDBs is that it require tedious configuration before it can be used for a database.Like mentioned above this approach would make it easier to use at a cost. Using a natural query languagewould be much slower given that there is a lot more translation and learning required.

5.6.5 Java APIIf a query language like SQL were broken up into different segments you could say that it contain thefields Select, From, Where and Join. A class could be created in Java for example where the developersenter a query in a more structured way which would then be sent to a converter which creates the re-quested query. An example of how this would work can be seen in figure 5.5.

Figure 5.5: Example of how an application interface could abstract away the query language.

Writing this query in SQL might be easier but if more attributes is required, an abstraction will makethings easier. This approach would make it easier for developers to structure and use while at the sametime providing a simple way to implement their own converter. A downside for this approach is that thedeveloper will have to change in the code if the underlying database require specific syntax that does notexist in other databases. An example of this would be if a database has a certain command Foo which isonly used in one query language and not another.

5.6.6 ConclusionThe approach which looked most promising was the Java API since it would make sending request morenatural for programmers given that they know Java. This thesis will explain how such an implementationcould be created in order to satisfy all the requirements.

15

6 Implementations

For this thesis the Java API approach was chosen to be implemented. The reason for this is that Java iseasier to learn and read compared to a new query-language which might be either too similar to existingquery-languages or too different, making it harder to understand.

6.1 Disclaimer

As stated previously in this thesis the focus of this report is to investigate if an abstraction layer could beimplemented for different databases. The query-languages currently supported are SQL and Cinnobersown SQL-like query language. A lot of improvements have to be made in the implementation for it to beusefull in production. If future developers wants to add support for another database or query-languagethey have to create a new converter by following the Converter-interface. The goal of this thesis was notto create a fully working abstraction layer, only to test whether it is possible.

6.2 Java API Abstraction

An overview of how the implementation is structured can be seen in Figure 6.1. The developer usingthe abstraction creates a query-object and sets the conditions in the desired fields. Once the query isdone it can be sent into the converter where it will take the query-object and convert it to the correctquery-language for the database. The converted query is sent to the database and a result-set is returned.

Figure 6.1: Overview of how the abstraction-implementation is structured.

16

6.3. Tools and Resources

6.3 Tools and Resources

The Java API Abstraction is written in Java 1.8.0.172 in a Gradle 2.3 project. Since Cinnober is in-terested in this thesis and its results, the tests have been done using their systems which includes theCinnober Database(CDB). The specific system used was Cinnobers clearinghouse application that isused to receive data such as positions, risks, portfolios etc. in order to calculate, among other things,risk. Everything done in the system is stored in the CDB.

6.4 Structure

The two main parts in the implementation are the query and the converter. The query is an object thatcontains the fields which is displayed in the Figure 6.2

Figure 6.2: UML of the fields required in the abstract query class.

This structure contains the same fields as in SQL, the difference is how you create and set the conditionsfor the different fields. Looking at Figure 6.2 above we see that all the fields uses the interface Condition.The Condition interface is implemented by the field-class e.g. Where (Where.java) and the reason forthis is that the converter will need certain fields in order to create the query. It is up to the developer whoimplements this solution to decide how they are used. In Figure 6.3 we see how the Condition-interfaceis structured.

Figure 6.3: Condition interface that is used by the field-classes.

17

Chapter 6. Implementations

6.5 Predicates

Each of the different fields uses predicates to indicate what the operation is supposed to do. Figure 6.4shows some predicates that can be used when setting conditions in the Where-field, more predicatesexists in the implementation.

Figure 6.4: Some predicates that exists for the Where-field.

Most of the basic predicates exists in the original implementation. If a database require special predicates,not included in the implementation, then the developer will have to go into the predicates-class and addit. It is impossible to predict all the different predicates and because of this the implementation is createdin a way that nothing will break if conditions or predicates are added since it is up to the converter tomake sense of the query.

6.6 Creating a Query

When a developer wants to create a query they instantiate a query-object and then call the methods e.g.select, from and where to set the fields. Most of the field-methods accept objects that are converted tothe class for that field, e.g. the from-method converts the objects to a from-object (From.java) whichimplements the Condition-interface. A short example of this can be seen in Figure 6.5.

18

6.6. Creating a Query

Figure 6.5: Short example of how a query could be written.

The same query could be written in SQL like below:

Select first(accountId), accountName as name, createDateFrom accountsWhere createDate > 2014-01-01;

The advantage of this approach is that it is easier to keep track of longer queries and nested joins. Belowis an example of how a longer query is written in SQL where a few fields are selected, several conditionsare set in the where-clause and an inner join exists.

Select last(price.TRINSTRID) as TRINSTRID, last(price.PRICE) as PREMIUM,last(price.CURRENCY) as CURRENCY, last(greeks.DELTA) as DELTA,last(greeks.VOLATILITY) as VOLATILITYFrom RTC MARKETPRICEEVENTS as priceinner join RTC GREEKS as greeks on price.DATE = greeks.DATE andprice.EXT GREEKS PK = greeks.GREEKS PKWhere price.DATE between 2012-04-30 and 2012-04-30 andgreeks.DATE between 2012-04-30 and 2012-04-30 andprice.EXT MARKET PRICE TAG TYPE = 1 andprice.EXT BUSINESS DATE between 2012-04-30 and 2012-04-30 andprice.EXT GREEKS PK is not null andprice.EXT INSTRUMENT TYPE = OptionOnForward andgreeks.GREEKS PK is not nullgroup by TRINSTRID;

19

Chapter 6. Implementations

The same query can be written in the implementation as shown in Figure 6.6

Figure 6.6: Longer and more complex example of how a query could be written.

Writing this query like above will require more coding but it will make the structure more clear. Usingthis implementation will make it easier for the developer to separate the different fields and joins withouthaving to remember e.g. all the parentheses.

6.7 Converter

All converters need to follow the Converter-interface in order to make it easier for developers to imple-ment their own query-converter. In figure 6.7 we see the structure of the Converter-interface.

Figure 6.7: Converter-interface used by all converters.

In the currently existing converters (SQL and CDB) the QueryRequest-object are sent into the converterand from there the query-fields are sent into the correct method for conversion. Since both SQL and CDBuses string-queries it is pretty simple to implement a solution that satisfies the normal query-languageuser given that most predicates, conditions and structure are currently supported.

20

6.8. History Caching

When it comes to databases such as MongoDB it gets a little trickier. MongoDB have different ways tocommunicate with the database, most common way is shown in section 5.3. Databases like MongoDBappend objects to objects e.g.

database.accounts.find({”accountId”, ”createDate”}, {”createDate” {”$gt” : 2014-01-01}})

In order to create a converter for MongoDB using the QueryRequest the implementation needs to dynam-ically find the collection and from there create the correct string and insert it into one of the alternativesprovided by MongoDB e.g. ”find”, ”group” or ”aggregate”.

6.8 History Caching

Query-history is used in order to keep the implementation as fast as possible. When a query is receivedin the converter it first checks the history to see if the query has been converted before. If it is stored thenthe converted query is retrieved and if not the query is converted and then stored in the history. In mostcases one query does not look like another, however, some fields e.g. Select and Where might containsthe same conditions. Because of this the history-storage checks whether the field-conditions have beenconverted before and if they have then the stored conversion is used.

Once a query is created each field is given an identifier which is the hash-value of all the content. Whenusing the history-storage this identifier is set as the key.

Running the query-conversion in figure 6.6 100 times with history-storage we get an average conversion-time of 288 391 ns. Running the same conversion without history we get the average 835 035 ns. Andecrease of 35%, or 0.6 ms, might not seem like a lot but for a big system that runs long queries severaltimes this could be considered as a significant performance decrease.

21

7 Results

In this chapter results for performance and usability when using the Java API Abstraction solution ispresented.

7.1 Performance

Tests were made using the CDB-database in the Cinnober system. The size of the data retrieved bythe queries were 4.9MB. As displayed in Figure 7.1 we can see how long the query took to execute inmilliseconds. Spikes exists for both the standard way and when using the converter. The spikes areirrelevant since they can be caused by almost anything in the system. One of the reason reasons was thatthe garbage collector was doing its job, slowing down the request.

Figure 7.1: 100 runs with the same query using the standard way and the abstraction layer. Result isdisplayed in ms.

22

7.1. Performance

By calculating the average time to complete the query, run 100 times, we can see the result in Figure7.2.

Figure 7.2: Time to complete the task using standard way and the abstraction layer. Average time iscalculated and displayed in milliseconds.

After 100 runs of the same query using the CDB query-language the average landed on 664ms whilewhen using the abstraction-implementation the average time over 100 runs were 674ms.

23

Chapter 7. Results

7.2 Usability

When demonstrating the abstraction layer, the developers of Cinnober got a short presentation on whatthe goal of this thesis was and how the abstraction worked. The demonstration was one-on-one so thatthe one developers opinion would not affect others. Once the developers understood what the goal was,Figure 11.1 and 11.2 was used to compare two different ways of writing a long and complex query.

After demonstrating the abstraction layer to software developers at Cinnober I received some noteson how they felt about the implementation. The major thing that the developers noted was how theconverter will require a lot of validation and implementation for it to work with all the different kindsqueries. When it came to usability and debugging they mostly felt indifferent between writing queries inSQL and in the abstraction layer. Once asked if they would rather use the abstraction layer than a newquery-language that they did not know, most of them felt that the abstraction would help a lot given thatthey would not have to learn all the different syntax and semantic of the new query language while at thesame time they would still have to know how the new database worked.

When the developers looked at the way the abstraction layer were used they pointed out that it waseasy to use and understand, but there was room for improvements. When comparing the statement to thesame query in SQL they stated that it was easier to structure and use. At the same time they said thatthey were used to writing queries in SQL and that was why they did not feel that the abstraction was amajor step-up. If, they said, they were given more time with the abstraction then it would most likely bean improvement given that the structure would not vary as much as it can when writing SQL-queries.

One of the main reasons the developers would want an abstraction layer like this implemented wasbecause it would provide a uniform way of creating a query. In the system they are currently workingin, strings are used to create queries and with that a lot of ways you could structure the query-string.Standards like how you should indent or divide the sub-queries are not always followed which makesthis way hard to learn and keep track of what is happening. If the abstraction layer was used there wouldno longer be big unreadable blocks of queries with different indentations etc.

In the beginning of the presentation, none of the developers felt that the abstraction layer would improvethe debugging and in some extent this is true. The abstraction will not help with the results returnednor with the semantic errors. After some discussion they understood that the abstraction will preventsyntax-errors from occurring and when it comes to semantic errors it will make it easier to check thecontent of each field compared to reading a long string.

24

8 Evaluation

This chapter contains a short evaluation of the results and how they can be interpreted.

8.1 Evaluation of Performance

Looking at the results in section 7.1 we can see that the increase in delay is just over 10 milliseconds.This is not surprising given that the only difference between the two alternatives is that the abstraction-implementation first has to convert the query before executing it. Because of this we know that therewould not be any changes in performance if there was more data. Another important observation is thatthe query executed was not a small one. It contained a huge select-field, nested join with several where-conditions.

Looking at how much extra time it took in this scenario and how long it takes to convert a query, men-tioned in section 6.8, we can safely say that the increase in delay is not enough for this solution to bediscarded. As long as the developers feel that the abstraction layer is intuitive and easy to use this couldbe a valid alternative when writing queries to the database.

The developers at Cinnober did not think that this delay was big enough to be a problem in produc-tion given that most of the latency comes from the look-up in the database, something the abstractionlayer has nothing to do with. Because of this feedback the goal ”Where is an abstraction layer mostuseful given that it has longer delay?” in section 3.2 is no longer as relevant. The abstraction layer couldbe used basically everywhere as long as the developers know that there is a small delay.

8.2 Evaluation of Usage

Looking at the feedback received from developers at Cinnober in section 7.2 we see that the major con-cern about this abstraction was the conversion. They felt that in order for this to work a lot of validationand error-handling has to be implemented to make sure that the correct query can be created losing im-portant information. The converter also has to be widely prepared for the different kinds of special casesdevelopers might throw at it. This is all true but it is important to keep in mind that the main focus ofthese meetings was to check whether the queries created for the abstraction was intuitive enough for itto make it easier when querying to the database. The converter is of course something that will have tobe further developed in order for it to work in real-life practice but the most important part is how thedevelopers create the query and if it is easier or harder to understand than the normal way.

Overall the developers felt optimistic that this implementation was something that could be used inproduction given that the converter was built out and had more validation. The way you use the abstrac-tion was easy to understand but since they had a lot of experience in SQL and CDB-SQL they felt that itwould take a while for them to get used to working this way given the learning curve. As mentioned inResults 7.2 the main reason they wanted an abstraction in their production was so that they would have auniform way of creating queries with fewer variations like CDB-SQL contains.

25

9 Conclusions

The goal of this thesis was to investigate best approaches to implement an abstraction layer on top ofdifferent databases in order to provide one query-language which could be used to communicate withseveral databases. In the end, all the goals set in section 3.2 were reached and the implementation isexplained in section 6. Using the implementation tests of performance and usability were done which isdisplayed section 7 and discussed in section 8.

The finished implementation did have an extra delay but since it was so small it will not really causea problem, especially on systems which has no, or low, requirements of the database. The end resultwas an easy to use interface but this was not what the developers from Cinnober felt most optimisticabout. They said that the main reason for implementing an abstraction layer was so that there would bea uniform way of writing queries to the database. Making it easier to, among other things, read otherdevelopers code. The abstraction was preferred when swapping between databases since all you neededto know was one way to create queries.

9.1 Restrictions

The solution has the following restrictions.

• Converters for databases not supported will have to be created.

• Some predicates and structures might have to be implemented since everything cannot be pre-dicted.

9.2 Limitations

The implementation can currently only convert SQL and CDB-SQL. As of now the converter does notsend the converted query to any database since no database was used. When using the implementationin the Cinnober system a query was converted and then the converted query was sent directly to thedatabase. There is support for a database to be implemented but since this was not a big priority for thethesis it was left out. The SQL, and CDB-SQL, language are huge and because of this all features doesnot exist. Features required to create most of the regular expressions or queries does exist but it is closeto impossible to predict all the ways a query can be structured.

26

9.3. Usage areas

9.3 Usage areas

In the beginning of this thesis the idea was for this abstraction to be used everywhere in order to make iteasier for developers to create queries. After demoing for employees at Cinnober the usage-area shiftedmore to just using it in order to get a uniform way of creating queries. It was not mainly about making iteasier, although it was in most cases, but making sure that every developer writes code in the same wayin order to make it easier to jump into the code written by someone else.

This abstraction layer shine when there are a lot of different developers in one project or when thequeries get really long and complex. It provides a uniform way of writing queries that could easily bestructured and understood. Take the example below for example:

Select *From accountsWhere createDate > 2014-01-01;

This SQL-query is easier to write in plain-text so in these cases the abstraction might not be as useful. Itwas however stated by an employee here at Cinnober that these kind queries were very unusual so it wasnot really relevant in bigger systems.

27

10 Future Work

The implementation of the abstraction layer is a pretty basic program which converts a query-object intoa query-language. Most of the standard feature exist and it does work when converting to SQL andCDB-SQL as displayed in Chapter 7. One thing to keep in mind for future work is that all the differentquery-languages are not tested and because of this the API might not contain all the features requiredfor it to work. There is proof that this abstraction layer could be used in production and if more time isinvested to improve on some points mentioned by the developers at Cinnober this solution could improvedevelopment of different products.

Most of the developers at Cinnober mentioned that a lot of work can be put down on the converterin order to make it robust enough to be used in production. In order to catch all the different syntax-errors that could occur when creating a query in the abstraction layer exceptions has to be implemented.The current state of the implementation accepts all kinds of arguments without checking whether theyare the correct ones so exceptions are a must for the implementation to be used in production.

One developer at Cinnober recommended using builders which could be implemented in order for itto be easier to create queries. This could greatly improve usability at the cost of abstraction. In otherwords, the builders would work for one kind of database but not others. An example of how builders canbe used is shown in Figure 11.3 which works the same way as Figure 11.1 does.

28

11 Appendix

11.1 Resources

While working at Cinnober I had access to their test databases which uses their own SQL-like language.The example for what kind of data sets that would be given was, among other things, clearing membersand their positions. I had access to Cinnobers clearinghouse LMEClear which implements their CDB.

While investigating how an abstraction layer could be implemented I created test for it on differentsorts of databases which contains a lot of data. As mentioned above Cinnober has different databaseswhich uses SQL and SQL-like languages. I used these to see if the conversion could easily be swappedout in order to fit different databases.

29

Chapter 11. Appendix

11.2 Figures

Figure 11.1: Converter-query run in section 7.1.

30

11.2. Figures

Figure 11.2: Query run in section 7.1.

31

Chapter 11. Appendix

Figure 11.3: Query displayed in 11.2 created using builders.

32

Bibliography

[1] Androutsopoulos and Ritchie. Natural Language Interfaces to Databases - An Introduction. Mar.1995. URL: https://arxiv.org/abs/cmp-lg/9503016 (visited on 04/11/2019).

[2] Annotations to ”Five Common PHP database problems”. URL: http://mysqldump.azundris.com/archives/57-Annotations-to-Five-Common-PHP-database-problems.html (visitedon 04/11/2019).

[3] Database problems in MySQL/PHP Applications. Aug. 2006. URL: https://www.percona.com/blog/2006/08/11/database-problems-in-mysqlphp-applications/ (visited on04/11/2019).

[4] Encyclopedia. URL: https://www.pcmag.com/encyclopedia/term/37353/abstraction-layer (visited on 04/11/2019).

[5] Five common PHP database problems. Aug. 2006. URL: https://www.ibm.com/developerworks/library/os-php-dbmistake/index.html (visited on 04/11/2019).

[6] John Hammink. The Types of Modern Databases. URL: https://www.alooma.com/blog/types-of-modern-databases (visited on 04/11/2019).

[7] Open Source Document Database. URL: https://www.mongodb.com/ (visited on 04/11/2019).

[8] Tim Quax. Column: Encapsulation, DB Abstraction: Performance vs Scalability. URL: http://www.bytemods.com/news/227/encapsulation,-db-abstraction-performance-vs-scalability (visited on 04/23/2019).

[9] Baron Schwartz. Four types of database abstraction layers. URL: https://www.xaprb.com/blog/2006/08/13/four-types-of-database-abstraction-layers/ (visited on 04/11/2019).

[10] The language of languages. URL: http://matt.might.net/articles/grammars-bnf-ebnf/(visited on 04/11/2019).

[11] Tutorialspoint.com. Context-Free Grammar Introduction. URL: https://www.tutorialspoint.com/automata_theory/context_free_grammar_introduction.htm (visited on 04/11/2019).

[12] What is a Relational Database? – Amazon Web Services (AWS). URL: https://aws.amazon.com/relational-database/ (visited on 04/11/2019).

[13] What is Query Language? URL: https://www.techopedia.com/definition/3948/query-language (visited on 04/11/2019).

[14] Jeremy Zawodny. Database Abstraction Layers Must Die! URL: http://jeremy.zawodny.com/blog/archives/002194.html (visited on 04/23/2019).

33

https://arxiv.org/abs/cmp-lg/9503016

http://mysqldump.azundris.com/archives/57-Annotations-to-Five-Common-PHP-database-problems.html

http://mysqldump.azundris.com/archives/57-Annotations-to-Five-Common-PHP-database-problems.html

https://www.percona.com/blog/2006/08/11/database-problems-in-mysqlphp-applications/

https://www.percona.com/blog/2006/08/11/database-problems-in-mysqlphp-applications/

https://www.pcmag.com/encyclopedia/term/37353/abstraction-layer

https://www.pcmag.com/encyclopedia/term/37353/abstraction-layer

https://www.ibm.com/developerworks/library/os-php-dbmistake/index.html

https://www.ibm.com/developerworks/library/os-php-dbmistake/index.html

https://www.alooma.com/blog/types-of-modern-databases

https://www.alooma.com/blog/types-of-modern-databases

https://www.mongodb.com/

http://www.bytemods.com/news/227/encapsulation,-db-abstraction-performance-vs-scalability



https://www.xaprb.com/blog/2006/08/13/four-types-of-database-abstraction-layers/

https://www.xaprb.com/blog/2006/08/13/four-types-of-database-abstraction-layers/

http://matt.might.net/articles/grammars-bnf-ebnf/

https://www.tutorialspoint.com/automata_theory/context_free_grammar_introduction.htm

https://www.tutorialspoint.com/automata_theory/context_free_grammar_introduction.htm

https://aws.amazon.com/relational-database/

https://aws.amazon.com/relational-database/

https://www.techopedia.com/definition/3948/query-language

https://www.techopedia.com/definition/3948/query-language

http://jeremy.zawodny.com/blog/archives/002194.html

http://jeremy.zawodny.com/blog/archives/002194.html

Documents

Investigation and implementation of a developer-friendly ...umu.diva-portal.org/smash/get/diva2:1366903/FULLTEXT01.pdf · Investigation and implementation of a developer-friendly