NoSQL, SQL, NewSQL - methods of structuring data

Multiple ways of storing -> Data

<-SQL -> NOSQL -> NEWSQL

Tony Rogerson@tonyrogerson

[email protected]/tonyrogerson

Agenda Data structures

◦ Relational, Key/Value pair, Document, Graph, Column/Column Family Store◦ Key Concepts◦ Hashing, Partitioning, Sharding, ACID, BASE

Technology Areas◦ SQL, NoSQL, NewSQL

Who-am-I Freelance SQL Server professional and Data Specialist

Fellow BCS, MSc in BI, PGCert in Data Science

Started out in 1986 – VSAM, System W, Application System, DB2, Oracle, SQL Server since 4.21a

Awarded SQL Server MVP yearly since 97

Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay, SQL Santa

Interested in commodity based distributed processing of Data.

Data StructuresWAYS OF STRUCTURING DATA

What is data? Tony Rogerson

[email protected]

Harpenden

36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15

46

44

mailto:[email protected]

Data needs context and structure

Tony Rogerson FullName

[email protected] Email

Harpenden PostalTown

36 on 2014-01-01, 36 on 2014-05-01, {WaistInches, RecordedOn}38 on 2014-10-15

46 ChestInches

44 Ages

Schema gives Context


Relational [Tables]FullName (PK) Email PostalTown WaistInches ChestInches AgeYears

Tony Rogerson [email protected]

Harpenden 46 44

FullName (FK) WaistInches RecordedDate

Tony Rogerson 36 2014-01-01



People WaistInches

Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644




Key/Value pair (EAV)Entity Attribute ValuePerson FullName Tony Rogerson

Person Email [email protected]

Person PostalTown Harpenden

Person ChestInches 46

Person Age 44

WaistInches FullName Tony Rogerson

WaistInches WaistInches 36

WaistInches RecordedDate 2014-01-01








Examples:Riak, Dyanamo, Redis,Foundation etc.



Document

{“FullName” : “string”,“Email” : “string”,“PostalTown” : “string”,“WaistInches” : {

“WaistInches” : “number”,“RecordedDate” : “string” },

“ChestInches” : “number”,“Age” : “number”}

{“FullName” : “Tony Rogerson”,“Email” : “[email protected]”,“PostalTown” : “Harpenden”,“WaistInches” : [ {

“WaistInches” : 36,“RecordedDate” : “2014-01-01” }, {“WaistInches” : 36,“RecordedDate” : “2014-05-01” } ],

“ChestInches” : 46,“Age” : 44}

JSON Schema JSON Document

JSON vs XML discussion: http://stackoverflow.com/questions/4862310/json-and-xml-comparison


Examples:MongoDB, Couchbase,CouchDB etc.


Schema DesignDocument Database Normal Form (Relational)

E.g. 100 machine cluster

{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "height_cm": 167.6, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ]}

person address

phoneNumbers

Object data stored together(collection)

Object data stored separately(tables)

MongoDB Example Use ESTER for MongoVUE

What do documents look like?

Graph SQL (inherently very poor performance):

◦ Nested Sets◦ Recursive CTE

Represents “connected” data

All about understanding and exploring relationships

Examples:Neo4j, Virtuoso, Allegro.

DaveTony

Fred

SidNode

Relationship

Column Values stored as a key-value pair

Column Name (unique)

Value

Timestamp

Important bit: It may not appear in each row!

Column Family is: container for columns and rows (like but not a relational table)

Relational Table: Fixed Columns

Column Family: determined by application – flexible

Examples:Cassandra, Druid, HBase

Column storageExamples:Cassandra, Druid, HBase

http://www.datastax.com/docs/1.1/ddl/column_family

Stored as…

SQL Server Columnstore

Table sliced into rowgroups (a group of rows – a batch)

Each rowgroup compressed in column-wise manner

Column segment is a column of data from within the rowgroup

Column segment per column in table which is then compressed onto storage.

SO: a table has rows (sliced into rowgroups), rowgroups have columns (each column having a column segment)

Demo: SQL Sparse columns

Key ConceptsSHARDING, PARTITIONING, HASHING

Hashing Distributed Database Cluster has fixed number of data nodes

Your data is spread across the database cluster◦ 10 node cluster; each data item may reside on 3 nodes◦ Which 3 nodes?

Data key is Hashed to a number – hashing algorithm is deterministic

data-node = f( data-key )◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10

Partitioning Chop big table up into “horizontal partitions”

Partition key required

Each partition is self-contained binding rows by the partitioning key

Access all data through logical view over all partitions

Table by table basis

Shared Nothing Partitioning+

Each Shard is self-contained and has all the procs, meta-data and of course your partition of data

Shard Key common to multiple tables, for example CustomerID, Email Address.

Greater autonomy across the distributed database

Seeing the entire database as a logical unit is more difficult – joining is a nightmare

Node 1

Node 2

Node 3

Sharding SyncNode 1

Node 2

Node 3

Full copy of data

Subset of dataReplication

ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)

ACID is a Transactional model

Not specific to the relational database◦ eg. HIVE (interface to HADOOP) provides ACID facilities

Durability: write ahead Logging expensive (latency from serialisation of writes)

Distributed transactions – Two Phase Commit (2PC)◦ Poor scalability because of Latency◦ ACID across distributed nodes bad design choice◦ Partition/Shard database and ACID in-node only

Coordinator

Subordinate

Subordinate

INSERT

2PC Transaction

All or nothing


BASE is a Transactional modelish

Specific to Distributed database model

Basically Available – all or some of the system is available

Node 1 Node 2 Node 3


Soft-stateEventually Consistent

System may change over time [as replica’s become up-to-date (consistent)]

Node 1 Node 2 Node 3

Insert value ‘A’

SQLAH – THE COMMON DENOMINATOR OF AN ACCESS LAYER

What is SQL? SQL is NOT a method of storing data!

SQL is a language, it’s just syntax

Relational Theory = thinking in sets

SQL is a language that follows (but does not obey) relational theory

With SQL we associate ACID (but durability is now optional in SQL 2014)

NoSQLNOT ONLY SQLNO SQL

Origins NoSQL? First NoSQL database was an open source relational database

NoSQL (really NoREL) started in mid 2000’s

Realisation that ACID doesn’t scale easily

Should really be NoACID (Mutually exclusive for some 70’s developers)

Hadoop – came out of Yahoo

Cassandra, Riak and others derivatives of Amazon Dynamo

NoSQL basically means: ACID doesn’t scale, SQL is too restrictive, and I’m a developer and I like complexity.

But why the need for “NoSQL”? Feb 2001

◦ BigData - http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Basically Scale-Up (SAN) costs too much and doesn’t scale well

Sick of vendor lock in and associated costs – open source software running across cheap commodity machines (Redundant Array of Inexpensive Servers)

Availability, Resilience – by design – by software and not expensive hardware

Existing Relational Databases (with SQL as their only language) expensive and too slow (ACID)

BASE v ACID

SQL implements a rigid and inflexible framework (or does it)

http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf



Eventual Consistency in SQL Server

Asynchronous Availability Groups/Database Mirroring

Replication

Eventual / Causal Consistency◦ Eventual no good for order specific [and important] transactions

◦ Like Merge replication

◦ Causal: deliver messages in correct order [e.g. service broker]◦ Like Transactional Replication

MongoDB – Replica Set

$ mongo --host 10.0.0.1 --port 27017

ROSIE 10.0.0.2

ESTER 10.0.0.1

HAZEL10.0.0.3

primary

secondary's

replication replicationHeart-beat

• 1 Master – Multiple Secondary’s• 1 R/W – Multiple Readers• Setup:

• Use replication.replSetName in mongo config file• On Primary:

• rs.initiate()• rs.add( “---secondary address” )• rs.add( “---secondary address” )• rs.status()

MongoDB - Sharding

Standalone or Replica-Set MongoDB instances(data storage)

Shards of data (data chopped up into multiple ranges, range depends where it sits)

Stores configuration informationabout the Shards.

MongoDB – Sharding (with Replica-Set)

DAISY10.0.0.4

POPPY10.0.0.5

KARLI10.0.0.6

config serversport 27019(shard information point to replica sets)

ROSIE 10.0.0.2

ESTER 10.0.0.1

HAZEL10.0.0.3

primary

secondary's


mongod: port 27017, replSet: rsDemomongos: port 27020 (on ESTER, HAZEL, ROSIE)

THIRLMERE10.0.0.13

CONISTON10.0.0.11

ULLSWATER10.0.0.12

primary

secondary's


mongod: port 27017, replSet: rsDemoRS2

DAISY10.0.0.4

Query Balancer

Query

NewSQLSCALABLE ACID!

Relational Databases catch up Maintains ACID

Same scalability and performance of NoSQL systems

Some Vendors: Clustrix, MemSQL, NuoDB, VoltDB, Postgres-XL

Auto-sharding, auto-partitioning

Queries need to take place on same box to save latency

http://www.postgres-xl.org/overview/

Summary / Q & A / Discuss