18
© Springer-Verlag Berlin Heidelberg 2011 PROJECT REPORT CS 670 Comparative NoSQL DBs for Online Shopping Name ID Aesam AbdulrahmanAl Malky 433037907 Ayoob Al Essa 2311044 Salah Al Shammary 432032124 Dr. Mohamed-Foued Sriti

2311044_433037907_432032124_pr.pdf

Embed Size (px)

Citation preview

Page 1: 2311044_433037907_432032124_pr.pdf

© Springer-Verlag Berlin Heidelberg 2011

PROJECT REPORT

CS 670

Comparative NoSQL DBs

for Online Shopping

Name ID

Aesam AbdulrahmanAl Malky 433037907

Ayoob Al Essa 2311044

Salah Al Shammary 432032124

Dr. Mohamed-Foued Sriti

Page 2: 2311044_433037907_432032124_pr.pdf

2

Table of Contents 1 INTRODUCTION ......................................................................................... 3

1.1 Project Overview : ............................................................................. 3

1.2 Importance ........................................................................................ 3

1.3 Scope ................................................................................................. 5

1.4 Problematic ....................................................................................... 5

1.5 Objective ........................................................................................... 5

2. Literature review .......................................................................................... 6

2.1 Definition of the domain ................................................................... 6

2.2 Similar works: .................................................................................... 6

3 Comparative NoSQL DBs for online shopping .......................................... 8

3.1 Research methodology ..................................................................... 8

3.2 Issues: ................................................................................................ 8

3.3 Database design and Modeling ......................................................... 8

3.4 Document-oriented Database (Couchbase ) .................................... 9

CouchDB Data Model: ............................................................................. 10

CouchDB Architechture : ........................................................................ 11

3.5 Document-oriented database (MongoDB) ..................................... 12

MongoDB Data Model: ........................................................................... 13

MongoDB Architechture : ....................................................................... 14

4 Discussion and Conclusion ...................................................................... 16

5 References .............................................................................................. 18

Page 3: 2311044_433037907_432032124_pr.pdf

3

1 INTRODUCTION

Electronic commerce also known as e-commerce, is the function of

buying goods and services through the Internet. In other words, e-commerce

is the usage of electronic communications and digital information processing

mediums in transactions to produce relationships for creating values among

businesses, and between businesses and customers.

Nowadays for many people, e-Commerce is something they use on dai-

ly basis, online payment of bills, purchase of goods from Amazon or booking

a plane ticket can be shown as examples of e-Commerce usage. e-Commerce

first appeared 40 years ago and, till now, continually growing with latest in-

novations, technologies, and new start-ups joining the online market every

year. Even though it is not older than two decades, today life without e-

Commerce seems to be difficult.

1.1 Project Overview :

The Project titled "comparative NoSQL DBs for online shopping" is analytical study for current e-commerce portals and its issues. The objectives of this study are to improve the services of e-commerce portals by using NoSQL DBs instead of Relational DBs used in legacy web-based application, and to find which the class of NoSQL DBs are the most suitable for web-based applica-tion such as Online- Shopping . We report the comparison between the two types of NoSQL database storage in order to achieve high responding, design flexibility and easy scalability of the online shopping sites.

1.2 Importance

A Web database is an integrated system of Web servers and database serv-ers, which enables users to access on-line information in a platform-

Page 4: 2311044_433037907_432032124_pr.pdf

4

independent manner through Web browsers. Web servers and database servers work together in a Web database as an integrated system. Since a web server and a database server work simultaneously, the response time in dealing with a request to the database cannot be seen simply as the web server service time plus database service time. The performance metrics and optimization suggestions are made on the basis of the analysis of the rela-tionship between them. Good performance of Web databases provides a company with a definite edge over competition while poor performance makes it seriously handicapped. Hence to ensure good performance of Web databases is absolutely essential for business institutions as well as for any type of enterprises.

Interactive applications have changed dramatically over the last 15 years. In

the late ‘1990s, large web companies emerged with dramatic increases in

scale on many dimensions:

The number of concurrent users increasingly became accessible via the web

and on mobile devices.

The amount of data collected and processed soared as it became easier and

increasingly valuable to capture all kinds of data.

The amount of unstructured or semi-structured data its use became integral

to the value and richness of applications.

Where the size of data gets bigger as the E-commerce portals has grown in

the world and the Relational Database Management Systems (RDBMS) were

not sufficient to store and handle large volumes of data efficiently.

Therefore, in this project, NoSQL as a database is proposed to maintain data

for "Online Shopping". After this comparative study we can conclude which

types of NoSQL DBs are the most suitable for such applications.

Page 5: 2311044_433037907_432032124_pr.pdf

5

1.3 Scope

Due to time limitation, our project will focus on two types of NoSQL DBs

which are Document-oriented database (Couch DB, Mongo DB).

1.4 Problematic

The legacy e-commerce sites with relational database technology Suffers

from some issues such:

Response time. Figure 1 shows the testing result with the same requests to

a different size database. A large database has a bit longer response time

than a small [3].

Fig. 1. Shows the Web database system response time versus the result file size from 10k to 100k in single query test

1.5 Objective

The objectives of this study are to improve the services of e-commerce por-

tals by using NoSQL DBs instead of Relational DBs used in legacy web-based

application, and to find which type of NoSQL DBs are the most suitable for

Page 6: 2311044_433037907_432032124_pr.pdf

6

web-based application such as Online- Shopping. We aim at this study to

propose type of database for online shopping sites to achieve:

─ Better application development productivity through a more flexible data

model;

─ Greater ability to scale dynamically to support more users and data;

─ Improved performance to satisfy expectations of users wanting highly re-

sponsive applications and to allow more complex processing of data.

2. Literature review

2.1 Definition of the domain

The biggest companies over world like Google, Amazon, Facebook, and

LinkedIn were among the first companies to discover the serious limitations

of relational database technology for supporting these new application re-

quirements. Commercial alternatives didn’t exist, so they invented new data

management approaches themselves. Their pioneering work generated tre-

mendous interest because a growing number of companies faced similar

problems. Open source NoSQL database projects formed to leverage the

work of the pioneers, and commercial companies associated with these pro-

jects soon followed.

2.2 Similar works:

The Web-based application Database challenges and issues discussed by var-

ious researchers. The authors of [2] discussed the new generation of e-

commerce applications and data schemas requirements. Analyzing the per-

formance of a typical web database system with different sizes of web pages

and different sizes of database tables presented in Performance Issues of a

Page 7: 2311044_433037907_432032124_pr.pdf

7

Web Database [3]. A survey and comparison of relational and non-relational

database [6] Made the comparison of large amount of data between the two

leading types of database (relational and non-relational) storage components

prevailing in the industry, then they conclude the major differences between

the two types of databases such that NoSQL has High data throughput and

Highly scalable than relational database.

Page 8: 2311044_433037907_432032124_pr.pdf

8

3 Comparative NoSQL DBs for online shopping

3.1 Research methodology

The project presented in this report based on theoretical research into best practices regarding how a website should be developed. In order to achieve this, an analytical comparative methodology was followed.

3.2 Issues:

Relational databases do not support high scalability, until a certain point bet-

ter hardware can be employed but beyond that point the database must be

distributed. One of the major disadvantage is data is stored in relational da-

tabase in form of tables, this structure can give rise to high complexity in

case data cannot be easily encapsulated in a table. Much of the features pro-

vided by relational databases are not used hence simply add to the cost as

well as the complexity of the database. Relational Databases make use of

SQL, which is featured to work on structured data, but SQL can be highly

complex when working with unstructured data. When the amount of data

turns huge the database has to be partitioned across multiple servers, this

partitioning poses several problems because joining tables in distributed

servers is not an easy task.

3.3 Database design and Modeling

In a relational database system we must define a schema before adding rec-

ords to a database. The schema is the structure described in a formal lan-

Page 9: 2311044_433037907_432032124_pr.pdf

9

guage supported by the database and provides a blueprint for the tables in a

database and the relationships between tables of data. Within a table, we

need to define constraints in terms of rows and named columns as well as

the type of data that can be stored in each column.

In contrast, a document-oriented database contains documents, which are

records that describe the data in the document, as well as the actual data.

Documents can be as complex as we choose; we can use nested data to pro-

vide additional subcategories of information about your object. We can also

use one or more document to represent a real-world object.

3.4 Document-oriented Database (Couchbase )

In a document-oriented model, data objects are stored as documents; each document stores the data and enables us to update the data or delete it. Instead of columns with names and data types, we describe the data in the document, and provide the value for that description. CouchDB is a document-oriented database server, accessible through REST APIs. Couch is an acronym for "Cluster Of Unreliable Commodity Hardware", emphasizing the distributed nature of the database. CouchDB is designed for document-oriented applications, such as forums, bug tracking, wiki, email, etc. The CouchDB project is part of the Apache Foundation and is completely written in Erlang. Erlang was chosen as programming language, because it is very well suited for concurrent applications through its light-weight process-es and functional programming paradigm. CouchDB is ad-hoc and schema-free with a flat address space. CouchDB is not only a NoSQL database, but also a web server for applications written in JavaScript. The advantage of using CouchDB as a web server is that applications in CouchDB can be deployed by just putting them into the data-base and that the applications can directly access the database without the overhead of a query protocol.

Page 10: 2311044_433037907_432032124_pr.pdf

10

CouchDB Data Model:

Data in CouchDB is organized into documents. Each document can have any

number of attributes and each attribute itself can contain lists or even ob-

jects. The Documents are stored and accessed as JSON objects, this is why

CouchDB supports the data types String, Number, Boolean and Array.

Each CouchDB document has a unique identifier and because CouchDB uses

optimistic replication on the server side and on the client side, each docu-

ment has also a revision identifier. The revision id is updated by CouchDB

every time a document is rewritten. Update operations in CouchDB are per-

formed on whole documents. If a client wants to modify a value in a docu-

ment, it has first to load the document, make the modifications on it and

then the client has to send the whole document back to the database.

CouchDB uses the revision id included in the document for concurrency con-

trol and therefore can detect if another client has made any updates in the

meantime. The query model of CouchDB consists of two concepts one is

Views which are build using MapReduce functions and another is HTTP query

API, which allows clients to access and query the views. A View in CouchDB is

basically a collection of key-value pairs, which are ordered by their key.

Views are build by user specified MapReduce functions, which are incremen-

tally called whenever a document in the database is updated or created.

Views should be specified before runtime, as introducing a new View re-

quires that its MapReduce functions are invoked for each document in the

databases. This is why CouchDB does not support dynamic queries.

Page 11: 2311044_433037907_432032124_pr.pdf

11

CouchDB Architechture :.

Fig. 2. Simple Architecture of Apache CouchDB database

There are three main components of CouchDB which are Storage Engine,

View Engine and Replicator.

Storage Engine: It is B-tree based and the core of the system which manages

storing internal data, documents and views. Data in CouchDB is accessed by

keys or key ranges which map directly to the underlying B-tree operations.

This direct mapping improves speed significantly.

View Engine: It is based on Mozilla SpiderMonkey and written in JavaScript.

It allows creating adhoc views that are made of MapReduce jobs. Definitions

of the views are stored in design documents. When a user reads data in a

Page 12: 2311044_433037907_432032124_pr.pdf

12

view, CouchDB makes sure the result is up to date. Views can be used to cre-

ate indices and extract data from documents.

Replicator: It is responsible for replicating data to a local or remote database

and synchronizing design documents.

Fig. 3. Simple database model for online shopping upon CouchDB database

3.5 Document-oriented database (MongoDB)

MongoDB is a schema less document oriented database developed by 10gen

and an open source community. The name mongoDB comes from "humong-

ous". The database is intended to be scalable and fast and is written in C++.

Page 13: 2311044_433037907_432032124_pr.pdf

13

In addition to its document oriented databases features, mongoDB can be

used to store and distribute large binary files like images and videos. It is

fault tolerant, persistent and provides a complex query language as well as

an implementation of MapReduce.

MongoDB Data Model:

MongoDB stores documents as BSON (Binary JSON) objects, which are binary

encoded JSON like objects. BSON supports nested object structures with em-

bedded objects and arrays like JSON does. MongoDB supports in-place modi-

fications of attributes, so if a single attribute is changed by the application,

then only this attribute is send back to the database. Each document has an

ID ˝field, which is used as a primary key. To enable fast queries, the develop-

er can create an index for each query-able ˝field in a document. MongoDB

also supports indexing over embedded objects and arrays. For arrays it has a

special feature, called "multikeys": This feature allows using an array as in-

dex, which could for example contain tags for a document. With such an in-

dex, documents can be searched by their associated tags. Documents in

MongoDB can be organized in so called "collections". Each collection can

contain any kind of document, but queries and indexes can only be made

against one collection. Because of MongoDB's current restriction of 40 in-

dexes per collection and the better performance of queries against smaller

collections, it is advisable to use a collection for each type of document. Re-

lations in MongoDB can be modeled by using embedded objects and arrays.

Therefore, the data model has to be a tree. The first option would imply that

some documents would be replicated inside the database. This solution

should only be used, if the replicated documents do not need very frequent

updates. The second option is to use client side joins for all relations that

cannot be put into the tree form. This requires more work in the application

layer and increases the network traffic with the database.

Page 14: 2311044_433037907_432032124_pr.pdf

14

MongoDB Architechture :

A MongoDB cluster consists of three components namely Shard nodes, con-

figuration servers and routing services or mongos.

Shard nodes: Shard nodes are responsible for storing the actual data. Each

shard can consist of either one node or a replication pair. In future versions

of mongoDB one shard may consist of more than two nodes for better re-

dundancy and read performance.

Configuration servers: The config servers are used to store the metadata and

routing information of the MongoDB cluster and are accessed from the shard

nodes and from the routing services.

Routing services or mongos: Mongos, the routing processes are responsible

for the performing of the tasks requested by the clients. Depending on the

type of operation the mongos send the requests to the necessary shard

nodes and merge the results before they return them to the client. Mongos

for themselves are stateless and therefore can be run in parallel. For storage

MongoDB uses memory-mapped ˝files, which lets the operating system's

virtual memory manager decide which parts of the database are stored in

memory and which one only on the disk. This is why MongoDB cannot con-

trol, when the data is written to the hard disk. The motivation for the usage

of memory mapped ˝files is to instrument as much of the available memory

as possible to boost the performance. In some cases this might eliminate the

need for a separate cache layer on the client side.

Page 15: 2311044_433037907_432032124_pr.pdf

15

Fig. 4. typical architecture of a MangoDB Cluster

Page 16: 2311044_433037907_432032124_pr.pdf

16

4 Discussion and Conclusion

Firstly we summarize the reasons to adopt and use NoSQL database for web-

based applications as flowing:

1. Avoidance of Unneeded Complexity

Relational databases provide a variety of features and strict data consistency

but this rich feature set and the ACID properties implemented by RDBMSs

might be more than necessary for particular applications and use cases.

2. High Throughput

Some NoSQL databases provide a significantly higher data throughput than

traditional RDBMSs. For instance, the column-store Hypertable which pur-

sues Google’s Bigtable approach allows the local search engine Zvent to store

one billion data cells per day. To give another example, Google is able to pro-

cess 20 petabyte a day stored in Bigtable via it’s MapReduce approach.

3. Avoidance of Expensive Object-Relational Mapping

Most of the NoSQL databases are designed to store data structures that are

either simple or more similar to the ones of object-oriented programming

languages compared to relational data structures. They do not make expen-

sive object-relational mapping necessary (such as Key/Value-Stores, Docu-

ment-Stores).

Page 17: 2311044_433037907_432032124_pr.pdf

17

Finally, we can conclude that the two types of document-oriented data-

bases is satisfy all web-based requirements but the MongoBase database

has an advantage which is the capability to store and distribute large bi-

nary files like images and videos, so in make sense to use this database

model to handle a lot of picture and clips for the deferent product on the

online shopping websites.

No. Web issues CouchBase MongoBase

1 Scalability OK OK

2 Flexibility OK OK

3 Support Big user OK OK

4 Support Big data OK OK

5 Support distribution OK OK

6 Support parallel computation OK OK

7 Support Binary files for imagers and video

N/A OK

8 High data throughput OK OK

9 For write more than read N/A OK

Table 1. Comparison between two types of Document-oriented database

Page 18: 2311044_433037907_432032124_pr.pdf

18

5 References

1. “NoSQL Databases”, Christof Strauch, Lecture Selected Topics on Software-Technology Ultra-Large Scale Sites.

2. “Storage and Querying of E-Commerce Data”, R.Agrawal, A.Somani and Y.Xu, VLDB Con-ference, 2001.

3. “Performance Issues of a Web Database” , Y.Li and K.Lü, chapter from Database and Ex-pert Systems Applications, Springer, Volume 1873, 2000, pp 825-834.

4. Couchbase whitepapers, By www.couchbase.com/

5. Couchbase Developer's Guide 2.0, By www.couchbase.com/

6. "A Survey and Comparison of Relational and Non-Relational Database", N.Jatana, S.Puri, M.Ahuja, I.Kathuria and D.Gosain, International Journal of Engineering Research & Tech-nology (IJERT), August – 2012.

7. "Semi-Structured Data Modeling For A Web-Enabled Engineering Application", E.Rasys and N.Dawood, International Conference on Computing in Civil Engineering, June-2012.

8. "RDBMS to NoSQL: Reviewing Some Next-Generation Non-Relational Database's", R.Padhy, M.Patra, S.Satapathy, International Journal Of Advanced Engineering Sciences And Technologies, 2011.