Demystifying datastores

Preview:

Citation preview

Vishnu RaoMySQL Enthusiast

Doodle makerSenior Data Engineer @ DataSpark

Formerly @ flipkart.com

The comma separated list ...

● Hadoop , Hbase, Rocks Db● MySQL , MariaDB , Postgres● Cassandra , MongoDb● Druid , Redis, MemSQL● Elastic Search , Solr● Cockroach Db, Couch db ● Vertica , Infobright● Redshift , Dynamo Db● S3 , OpenStack Swift ….

The FUN-damental Qns:

The FUN-damental Qns: Which one should I use ?

DemystifyingDatastores

Lets try to look at the problem from the view of the database

First lets play some baseball ...

Base 0 : The Data itself

Base 0 : The Data itself

● Row having columns

Base 0 : The Data itself

● Row having columns● Key - Value

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)○ Key - Document (u think json / xml)

Base 0 : The Data itself

● Row having columns● Key - Value

○ Key - Blob (u think object)○ Key - Document (u think json / xml)

● Graph (Nodes/edges kind of like key-value)

Base 1 : How is the Data Stored ?

Base 1 : How is the Data Stored ?

Let’s consider a Sample Data Record/Row

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Base 1 : How is the Data Stored ?

Let’s consider a Sample Data Record/Row

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Columns / AttributesPossible PrimaryKey

Column

Base 1 : How is the Data Stored ?

Approach 1

● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.

Base 1 : How is the Data Stored ?

Approach 1

● Store all columns of the Row side by side (i.e. TOGETHER ) on disk.

● This is generally referred to as a ROW based DataStore.

Base 1 : How is the Data Stored ?

Approach 1

● Useful for use cases like “showing ENTIRE Order on UI”

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Base 1 : How is the Data Stored ?

Approach 1

● Useful for use cases like “showing ENTIRE Order on UI”

● The entire row is fetched in one disk access

order-id-123 customer-1 5$ bill amount Bugis Street

1$ Tax 3 Items

Base 1 : How is the Data Stored ?

Approach 2

● Store Columns SEPARATELY, so that they can be accessed independently.

Base 1 : How is the Data Stored ?

Approach 2

● Store Columns SEPARATELY, so that they can be accessed independently.

● This is generally referred to as a COLUMN based DataStore.

Base 1 : How is the Data Stored ?

Approach 2

● Avg(billing_amount) or Sum(Items)

order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items

order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street

Base 1 : How is the Data Stored ?

Approach 2

● Avg(billing_amount) or Sum(Items)

● Instead of fetching entire row, fetch necessary columns for compute○ I.e Less Data fetched from Disk = REDUCED IO

order-id-123 customer-1 5$ bill amountBugis Street1$ tax 3 items

order-id-121 customer-1 2$ bill amount 2$ tax 1 items Bugis Street

Base 1 : How is the Data Stored ?

Approach 2

● What are the other optimisations for column store.○ Imagine 4 rows with column say ‘age’

■ Row 1 - 28■ Row 2- 30■ Row 3 - 28■ Row 4- 28

Base 1 : How is the Data Stored ?

Approach 2

● While storing on disk , if you SORT and store, you can also think of compression:

28,28,28,30 (sorted -> good for search now) 28(3),30 (now compressed -> 28 stored once)

Base 1 : How is the Data Stored ?

Typically :

● MySQL / Postgres = ROW based● Vertica / Infobright / Druid = COLUMN based

Base 1 : How is the Data Stored ?

Approach 2.5

● Store Group of Columns TOGETHER but store each group separately.

Base 1 : How is the Data Stored ?

Approach 2.5

● Store Group of Columns TOGETHER but store each group separately.

● This is generally referred to as a COLUMN-family based DataStore.

Base 1 : How is the Data Stored ?

Approach 2.5

Logically group the columns.

order-id-123

customer-1

5$ bill amountBugis Street

1$ tax 3 items

Base 1 : How is the Data Stored ?

Approach 2.5

Logically group the columns.

Typically: Hbase/Cassandra

order-id-123

customer-1

5$ bill amountBugis Street

1$ tax 3 items

Base 2 : The Indexing

● What kind of Data Structure is used ?

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.

Base 2 : The Indexing

● What kind of Data Structure is used ?○ B-tree, Inverted Index , Fractal Tree, Clustered Key , BitMap, No Index ?

● Certain type of queries like certain indexes○ Range like B-tree, Inserts like Fractal.

● Whats the index loading mechanism ? ○ Redis is Memory bound.

Base 3 : The Theorem

● Most Datastores do ○ Horizontal scaling○ Sharding

Base 3 : The Theorem

● Most Datastores do ○ Horizontal scaling○ Sharding

● So Here is the Catch - In event of Network Partition,○ How is Consistency / Availability Handled ?

Base 4 : Apart from CAP theorem

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

● Can I do joins if data is sharded ?

○ What about Distribution awareness ?

Base 4 : Apart from CAP theorem

● ACID ?

○ Transaction commit/Rollback support

● BASE ?

○ Basically Available , Soft State, Eventual Consistency ?

● Can I do joins if data is sharded ?

○ What about Distribution awareness ?

● The Query Interface (major concern ?)

The bases...

So, Try to cover the Bases & decide if you need it..

PS: There is no Silver Bullet

Thank you.

Vishnu Raojaihind213

sweetweet213mash213.wordpress.com

linkedin.com/in/213vishnu

Recommended