34
Sridhar Valaguru (c) Copy right 2013 contact [email protected]

Extend db

Embed Size (px)

DESCRIPTION

eXTend DB. An embedded extensible document database. Extend with custom queries and object modifiers. Learn More ». Morph DB. A Key-Value pair database. Allows fast in-place updates / object expansion. Learn More ». Block Manager An innovative library which manages on-disk blocks inside a file and provides a very simple interface to be used for variety of on-disk datastructures. http://sscreation.net.in

Citation preview

Page 1: Extend db

Sridhar Valaguru

(c) Copy right 2013 contact [email protected]

Page 2: Extend db

Motivation Example Usecases eXTend DB.

Design Extensibility Limitations

Morph DB – KeyValue pair Design and implementation

Block management design implementation Caches

Unique Approach towards the Database

(c) Copy right 2013 contact [email protected]

Page 3: Extend db

No SQL document Database like mongo steadily becoming popular.

Mongo DB features suitable for wide variety of applications over traditional sql databases

JSON-style documents with dynamic schemas offer simplicity and power.

Rich, document-based queries.

Index on any attribute.

Fast in-place updates and atomic modifiers.

Features like replication , sharding , High availability , Map reduce etc. are not applicable in this context.

(c) Copy right 2013 contact [email protected]

Page 4: Extend db

Features mentioned previously are also applicable for stand-alone applications installed/running on user machines.

There are few problems in using Mongo DB in such applications.

External dependency on Mongo DB .

User needs to install it separately.

User has to manage Mongo DB for the application to work.

Possibility of name space collision among different unrelated applications.

Unnecessary client-server communication impacts performance.

So there is need for an embedded (into application) document database with similar features as Mongo DB. Basically sqlite equivalent of Mongo DB.

An extensible database is a plus.

(c) Copy right 2013 contact [email protected]

Page 5: Extend db

Logging library -

Each log file entry could be an object in the database.

Indexes could be created at later point in time to analyze log

files using rich querying.

File tagging application -

Each file information could be stored as an object to the DB.

With tags attached removed dynamically.

Indexed data could extend the object with new fields.

Querying / searching based on tags or indexed data.

(c) Copy right 2013 contact [email protected]

Page 6: Extend db

Single node user-space NFS server –

Stores all metadata information into the database.

Maps filehandle to object/file attributes.

Objects accessed with filehandles and/or parent

file handle and name.

File data stored separately outside the database

using object-id based name space.

Any other stand-alone applications .

(c) Copy right 2013 contact [email protected]

Page 7: Extend db

No SQL Document Database Stores BSON documents Embedded into process Mongo DB like querying interface Extensible Each database collection is stored into set of

files in user specified directory.

(c) Copy right 2013 contact [email protected]

Page 8: Extend db

Application

DataBase API

Query related Management api

Query Optimizer

Extensible Query Module

Storage Layer

Tokyo cabinet Morph DB In-memory

Key value DB

(c) Copy right 2013 contact [email protected]

Page 9: Extend db

Data is stored in 3 types of files backed up by storage layer key value database.

Descriptor DB – Holds information about the list of indexes in the database.

Main DB – Stores all the document information with generated BSON

object ids as keys.

BSON object id uniquely identifies the object in the collection.

Index DB – Stores references of objects with particular field values as

key and list of object ids.

(c) Copy right 2013 contact [email protected]

Page 10: Extend db

Simple weight based query optimizer. Index with the least number of objects is

chosen.

(c) Copy right 2013 contact [email protected]

Page 11: Extend db

Provides 2 functionalities for database engine

Given a query in bson object format returns a list of indexes which can be used for the particular query.

▪ This is in-turn used by query optimizer for finding the best index to use.

Takes bson object and a query bson object returns whether the object matches the query or not.

(c) Copy right 2013 contact [email protected]

Page 12: Extend db

Query module implements comparison operator between 2 bson elements.

Has no knowledge of storage layer , just operates on the given bson objects.

Can be overridden by users by registering user specified comparison operators.

This could be very useful for custom binary data stored in database.

Different query operators are implemented in the module for providing complex querying.

(c) Copy right 2013 contact [email protected]

Page 13: Extend db

Operators let a object be selected in different ways other than just by comparing the value is equal to the value in query.

E.g. {‘a’:3} will match and all documents which has field a with value 3. This is a

simple query. But if we want to get all objects whose values are greater than 3 we cant

accomplish this with simple query. {‘a’:{‘$gt’:3}} is the query which will match all the documents where the value

is greater than 3. Here operator ‘$gt’ is given meaning “greater than”.

Any field name starting with “$” is considered as an operator and the rest of the name gives the name of the operator.

Querying function looks up for the operator in the registered list and invokes the handler to check whether the field matches the criteria in query.

By default various operators like $lt, $lte ,$nin, $all , $in, $exists have been implemented.

(c) Copy right 2013 contact [email protected]

Page 14: Extend db

Custom operators can be registered with the query module.

When a particular query comes the corresponding user call back will be invoked.

Call back takes value of the field as one parameter and value of the query value as other one and returns boolean.

This way query language of eXTend DB can be extended without having the need to edit the code of the database or wait for the developer to implement the features.

(c) Copy right 2013 contact [email protected]

Page 15: Extend db

Abstract layer which provides key-value storage. Isolates data storage from the rest of the

database engine. Only place where the data is stored. Backend can be any key-value pair database. E.g.

Tokyo cabinet Morph DB In memory key value pair

Currently tokyo-cabinet is the default key-value pair backend which stores all the data to files.

Also Morph DB backend is almost complete. (c) Copy right 2013 contact [email protected]

Page 16: Extend db

Different backends can be chosen depending the type of data stored.

E.g.

Index Databases can be stored completely in memory which will provide fast access.

Main DB could be stored using tokyo cabinet back end.

For persistent indexes Morph DB could be used .

(c) Copy right 2013 contact [email protected]

Page 17: Extend db

Easy to use mongodb like embedded database.

Extensible storage backends . Extensible query language. Completely customizable query behavior .

(c) Copy right 2013 contact [email protected]

Page 18: Extend db

Tokyocabinet updates are not in-place Every time the object is expanded old space in

file is discarded new space is found. This is a serious problem for heavy update

workload. Tokyo cabinet by default writes to memory need

to do sync to sync the data to file. If application crashes without sync data is lost. Sync calls are costly. Incase sync gets called after every insert the

performance is very low.

(c) Copy right 2013 contact [email protected]

Page 19: Extend db

Morph DB is a key value pair database aimed at solving the limitations of tokyo cabinet.

Aims of Morph DB – Fast in-place updates / object expansion.

A fast block management layer which could reuse storage used by deleted objects.

Once written data read should not be slowed down by block management layer.

Writes all data directly to the file while maintaining performance.

(c) Copy right 2013 contact [email protected]

Page 20: Extend db

B+ Tree implementation on top of block management layer.

Provides generation based cursors. Cursors can work while DB is being modified. Can search for values in a range of keys.

(c) Copy right 2013 contact [email protected]

Page 21: Extend db

Provides 2 basic functionalities Data Write –

▪ Finds allocates resources in file ▪ Writes the data to suitable location(s). ▪ Returns an address where the data is written. ▪ Upper layer must store this reference to read the data back. ▪ Data is not interpreted.

Data Read – ▪ Given the address which was earlier returned by the Data

write reads data from the offset or links of offsets ▪ Verifies the checksum of each piece ▪ Returns stitched object to the caller.

(c) Copy right 2013 contact [email protected]

Page 22: Extend db

File storage is managed in terms of resource clusters.

Each resource cluster contains some header information and the resources followed by it.

Unique property of resources is that it is of variable size instead of a fixed single block size like in various solutions.

Individual resources (block) size varies from 128 bytes to 4MB.

This range of block sizes makes it suitable for data of various sizes from very small values to 16 MB.

(c) Copy right 2013 contact [email protected]

Page 23: Extend db

Clusters are allocated on-demand for a particular type of resource.

Cluster sizes start from 128K and subsequent cluster sizes are double the previous one capped by 32 MB.

Increasing cluster sizes makes the database file size small initially and grows along with the data size.

In case of small clusters header information could be significant size compared to the resource sizes.

(c) Copy right 2013 contact [email protected]

Page 24: Extend db

Data is stored in list of blocks each stores reference to next block in the list.

Each chunk stores the checksum of the entire data. This helps in identifying corrupt or partially updated

links. When data is expanded according to the expanding

data size suitable block is allocated and linked. There is a cap on link counts there can be maximum 4

links. Once data spreads across 4 links data is automatically

defragmented and a suitable block bigger is found for the entire block which will reduce number of links.

(c) Copy right 2013 contact [email protected]

Page 25: Extend db

Block allocation takes a block size parameter . A free block of specified size found in the bitmap

residing in cluster header and the address is returned. DiskAddr structure identifying resource is 64 bit , bit-

field structure. 56-bit component directly gives the address of the

resources . So no translation of address in IO path. 4-bit type field indicates the resource size 0 for 128

and 1 for 256 and so on. Type field helps identifying the resource when freeing.

(c) Copy right 2013 contact [email protected]

Page 26: Extend db

Block allocation need to be extremely fast. Caches used to remember last cluster from

which data was allocated cluster. One such cache for each resource type. Cache state makes allocation O(1) in case of

series of allocations. Freeing resource will set the cache state to

point to the lowest offset resource. Always search continues in the next clusters.

(c) Copy right 2013 contact [email protected]

Page 27: Extend db

System calls (mostly pread/pwrite were used) are very fast in some machines(core i3 processors). Doing large number of small writes were not a problem.

In other machines (core 2 Duo) system calls were significantly slower and huge percentage of time was spent in system calls.

Memory mapped IOs were significantly faster. (c) Copy right 2013 contact [email protected]

Page 28: Extend db

Mapping entire file has few problems. File sizes can grow In 32-bit machines will limit the database size. Unused regions could be mapped and kernel could choose to remove

wrong set of pages. To avoid above draw backs list of mmapped blocks were used. Number was limited by 10 to limit the virtual address usage. Least recently used mmapped region is removed if new region is to

be mmapped. Whenever a cluster is allocated whole cluster is mmapped. For each IO this list is checked if it is a hit simple memcpy is done

or else fall back to old system call. This improved the performance by almost 50 % in slow machines.

(c) Copy right 2013 contact [email protected]

Page 29: Extend db

B+ Tree uses the block management layer to store its internal nodes and data.

Block manager has no information about how the blocks are going to be used.

Provides a slot for the upper layer to store a reference to its superblock.

Internal nodes stores all keys of the nodes and references to corresponding child nodes/values.

Parent pointer is not maintained on-disk this makes the splitting of nodes fast.

Parent child relation ship is established during search.

(c) Copy right 2013 contact [email protected]

Page 30: Extend db

All the nodes being modified are in-memory. Nodes are pinned in cache. After each modification node is written back

to file.

(c) Copy right 2013 contact [email protected]

Page 31: Extend db

Concurrent modifications can be allowed by taking write lock on root of sub-tree which could be modified by insert/delete.

An insert in B+Tree could modify few to all nodes in the path from root to the leaf.

The highest level which will be modified could be found by whether child could overrun by the insert.

If child is overrun then parent will be modified. So instead of locking root we just need to lock the subtree

whose root is the top most parent which could be modified.

Similar speculation could be done for deletes. All the nodes from root to the first child which could be

modified will be locked for read.

(c) Copy right 2013 contact [email protected]

Page 32: Extend db

(c) Copy right 2013 contact [email protected]

Page 33: Extend db

Tokyocabinet - http://fallabs.com/tokyocabinet/spex-en.html

Mongo DB - http://www.mongodb.org/

(c) Copy right 2013 contact [email protected]

Page 34: Extend db

(c) Copy right 2013 contact [email protected]