36
Multiple ways of storing -> Data <- SQL -> NOSQL -> NEWSQL Tony Rogerson @tonyrogerson [email protected] dataidol.com/tonyrogerson

NoSQL, SQL, NewSQL - methods of structuring data

Embed Size (px)

DESCRIPTION

Today’s environment is a polyglot database, that is to say, it’s made up of a number of different database sources and possibly types. In this session we’ll look at some of the options of storing data – relational, key/value, document etc. I’ll overview what is SQL, NoSQL and NewSQL to give you some context for today’s world of data storage.

Citation preview

Page 1: NoSQL, SQL, NewSQL - methods of structuring data

Multiple ways of storing -> Data

<-SQL -> NOSQL -> NEWSQL

Tony Rogerson@tonyrogerson

[email protected]/tonyrogerson

Page 2: NoSQL, SQL, NewSQL - methods of structuring data

Agenda Data structures

◦ Relational, Key/Value pair, Document, Graph, Column/Column Family Store◦ Key Concepts◦ Hashing, Partitioning, Sharding, ACID, BASE

Technology Areas◦ SQL, NoSQL, NewSQL

Page 3: NoSQL, SQL, NewSQL - methods of structuring data

Who-am-I Freelance SQL Server professional and Data Specialist

Fellow BCS, MSc in BI, PGCert in Data Science

Started out in 1986 – VSAM, System W, Application System, DB2, Oracle, SQL Server since 4.21a

Awarded SQL Server MVP yearly since 97

Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay, SQL Santa

Interested in commodity based distributed processing of Data.

Page 4: NoSQL, SQL, NewSQL - methods of structuring data

Data StructuresWAYS OF STRUCTURING DATA

Page 5: NoSQL, SQL, NewSQL - methods of structuring data

What is data? Tony Rogerson

[email protected]

Harpenden

36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15

46

44

Page 6: NoSQL, SQL, NewSQL - methods of structuring data

Data needs context and structure

Tony Rogerson FullName

[email protected] Email

Harpenden PostalTown

36 on 2014-01-01, 36 on 2014-05-01, {WaistInches, RecordedOn}38 on 2014-10-15

46 ChestInches

44 Ages

Schema gives Context

Page 7: NoSQL, SQL, NewSQL - methods of structuring data

Relational [Tables]FullName (PK) Email PostalTown WaistInches ChestInches AgeYears

Tony Rogerson [email protected]

Harpenden 46 44

FullName (FK) WaistInches RecordedDate

Tony Rogerson 36 2014-01-01

Tony Rogerson 36 2014-05-01

Tony Rogerson 38 2014-10-01

People WaistInches

Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644

Page 8: NoSQL, SQL, NewSQL - methods of structuring data

Key/Value pair (EAV)Entity Attribute ValuePerson FullName Tony Rogerson

Person Email [email protected]

Person PostalTown Harpenden

Person ChestInches 46

Person Age 44

WaistInches FullName Tony Rogerson

WaistInches WaistInches 36

WaistInches RecordedDate 2014-01-01

WaistInches FullName Tony Rogerson

WaistInches WaistInches 36

WaistInches RecordedDate 2014-05-01

WaistInches FullName Tony Rogerson

WaistInches WaistInches 38

WaistInches RecordedDate 2014-10-01

Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644

Examples:Riak, Dyanamo, Redis,Foundation etc.

Page 9: NoSQL, SQL, NewSQL - methods of structuring data

Document

{“FullName” : “string”,“Email” : “string”,“PostalTown” : “string”,“WaistInches” : {

“WaistInches” : “number”,“RecordedDate” : “string” },

“ChestInches” : “number”,“Age” : “number”}

{“FullName” : “Tony Rogerson”,“Email” : “[email protected]”,“PostalTown” : “Harpenden”,“WaistInches” : [ {

“WaistInches” : 36,“RecordedDate” : “2014-01-01” }, {“WaistInches” : 36,“RecordedDate” : “2014-05-01” } ],

“ChestInches” : 46,“Age” : 44}

JSON Schema JSON Document

JSON vs XML discussion: http://stackoverflow.com/questions/4862310/json-and-xml-comparison

Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644

Examples:MongoDB, Couchbase,CouchDB etc.

Page 10: NoSQL, SQL, NewSQL - methods of structuring data

Schema DesignDocument Database Normal Form (Relational)

E.g. 100 machine cluster

{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "height_cm": 167.6, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ]}

person address

phoneNumbers

Object data stored together(collection)

Object data stored separately(tables)

Page 11: NoSQL, SQL, NewSQL - methods of structuring data

MongoDB Example Use ESTER for MongoVUE

What do documents look like?

Page 12: NoSQL, SQL, NewSQL - methods of structuring data

Graph SQL (inherently very poor performance):

◦ Nested Sets◦ Recursive CTE

Represents “connected” data

All about understanding and exploring relationships

Examples:Neo4j, Virtuoso, Allegro.

DaveTony

Fred

SidNode

Relationship

Page 13: NoSQL, SQL, NewSQL - methods of structuring data

Column Values stored as a key-value pair

Column Name (unique)

Value

Timestamp

Important bit: It may not appear in each row!

Column Family is: container for columns and rows (like but not a relational table)

Relational Table: Fixed Columns

Column Family: determined by application – flexible

Examples:Cassandra, Druid, HBase

Page 14: NoSQL, SQL, NewSQL - methods of structuring data

Column storageExamples:Cassandra, Druid, HBase

http://www.datastax.com/docs/1.1/ddl/column_family

Stored as…

Page 15: NoSQL, SQL, NewSQL - methods of structuring data

SQL Server Columnstore

Table sliced into rowgroups (a group of rows – a batch)

Each rowgroup compressed in column-wise manner

Column segment is a column of data from within the rowgroup

Column segment per column in table which is then compressed onto storage.

SO: a table has rows (sliced into rowgroups), rowgroups have columns (each column having a column segment)

Page 16: NoSQL, SQL, NewSQL - methods of structuring data

Demo: SQL Sparse columns

Page 17: NoSQL, SQL, NewSQL - methods of structuring data

Key ConceptsSHARDING, PARTITIONING, HASHING

Page 18: NoSQL, SQL, NewSQL - methods of structuring data

Hashing Distributed Database Cluster has fixed number of data nodes

Your data is spread across the database cluster◦ 10 node cluster; each data item may reside on 3 nodes◦ Which 3 nodes?

Data key is Hashed to a number – hashing algorithm is deterministic

data-node = f( data-key )◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10

Page 19: NoSQL, SQL, NewSQL - methods of structuring data

Partitioning Chop big table up into “horizontal partitions”

Partition key required

Each partition is self-contained binding rows by the partitioning key

Access all data through logical view over all partitions

Table by table basis

Page 20: NoSQL, SQL, NewSQL - methods of structuring data

Shared Nothing Partitioning+

Each Shard is self-contained and has all the procs, meta-data and of course your partition of data

Shard Key common to multiple tables, for example CustomerID, Email Address.

Greater autonomy across the distributed database

Seeing the entire database as a logical unit is more difficult – joining is a nightmare

Node 1

Node 2

Node 3

Page 21: NoSQL, SQL, NewSQL - methods of structuring data

Sharding SyncNode 1

Node 2

Node 3

Full copy of data

Subset of dataReplication

Page 22: NoSQL, SQL, NewSQL - methods of structuring data

ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)

ACID is a Transactional model

Not specific to the relational database◦ eg. HIVE (interface to HADOOP) provides ACID facilities

Durability: write ahead Logging expensive (latency from serialisation of writes)

Distributed transactions – Two Phase Commit (2PC)◦ Poor scalability because of Latency◦ ACID across distributed nodes bad design choice◦ Partition/Shard database and ACID in-node only

Coordinator

Subordinate

Subordinate

INSERT

2PC Transaction

All or nothing

Page 23: NoSQL, SQL, NewSQL - methods of structuring data

ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)

BASE is a Transactional modelish

Specific to Distributed database model

Basically Available – all or some of the system is available

Node 1 Node 2 Node 3

Page 24: NoSQL, SQL, NewSQL - methods of structuring data

ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)

Soft-stateEventually Consistent

System may change over time [as replica’s become up-to-date (consistent)]

Node 1 Node 2 Node 3

Insert value ‘A’

Page 25: NoSQL, SQL, NewSQL - methods of structuring data

SQLAH – THE COMMON DENOMINATOR OF AN ACCESS LAYER

Page 26: NoSQL, SQL, NewSQL - methods of structuring data

What is SQL? SQL is NOT a method of storing data!

SQL is a language, it’s just syntax

Relational Theory = thinking in sets

SQL is a language that follows (but does not obey) relational theory

With SQL we associate ACID (but durability is now optional in SQL 2014)

Page 27: NoSQL, SQL, NewSQL - methods of structuring data

NoSQLNOT ONLY SQLNO SQL

Page 28: NoSQL, SQL, NewSQL - methods of structuring data

Origins NoSQL? First NoSQL database was an open source relational database

NoSQL (really NoREL) started in mid 2000’s

Realisation that ACID doesn’t scale easily

Should really be NoACID (Mutually exclusive for some 70’s developers)

Hadoop – came out of Yahoo

Cassandra, Riak and others derivatives of Amazon Dynamo

NoSQL basically means: ACID doesn’t scale, SQL is too restrictive, and I’m a developer and I like complexity.

Page 29: NoSQL, SQL, NewSQL - methods of structuring data

But why the need for “NoSQL”? Feb 2001

◦ BigData - http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf

Basically Scale-Up (SAN) costs too much and doesn’t scale well

Sick of vendor lock in and associated costs – open source software running across cheap commodity machines (Redundant Array of Inexpensive Servers)

Availability, Resilience – by design – by software and not expensive hardware

Existing Relational Databases (with SQL as their only language) expensive and too slow (ACID)

BASE v ACID

SQL implements a rigid and inflexible framework (or does it)

Page 30: NoSQL, SQL, NewSQL - methods of structuring data

Eventual Consistency in SQL Server

Asynchronous Availability Groups/Database Mirroring

Replication

Eventual / Causal Consistency◦ Eventual no good for order specific [and important] transactions

◦ Like Merge replication

◦ Causal: deliver messages in correct order [e.g. service broker]◦ Like Transactional Replication

Page 31: NoSQL, SQL, NewSQL - methods of structuring data

MongoDB – Replica Set

$ mongo --host 10.0.0.1 --port 27017

ROSIE 10.0.0.2

ESTER 10.0.0.1

HAZEL10.0.0.3

primary

secondary's

replication replicationHeart-beat

• 1 Master – Multiple Secondary’s• 1 R/W – Multiple Readers• Setup:

• Use replication.replSetName in mongo config file• On Primary:

• rs.initiate()• rs.add( “---secondary address” )• rs.add( “---secondary address” )• rs.status()

Page 32: NoSQL, SQL, NewSQL - methods of structuring data

MongoDB - Sharding

Standalone or Replica-Set MongoDB instances(data storage)

Shards of data (data chopped up into multiple ranges, range depends where it sits)

Stores configuration informationabout the Shards.

Page 33: NoSQL, SQL, NewSQL - methods of structuring data

MongoDB – Sharding (with Replica-Set)

DAISY10.0.0.4

POPPY10.0.0.5

KARLI10.0.0.6

config serversport 27019(shard information point to replica sets)

ROSIE 10.0.0.2

ESTER 10.0.0.1

HAZEL10.0.0.3

primary

secondary's

replication replicationHeart-beat

mongod: port 27017, replSet: rsDemomongos: port 27020 (on ESTER, HAZEL, ROSIE)

THIRLMERE10.0.0.13

CONISTON10.0.0.11

ULLSWATER10.0.0.12

primary

secondary's

replication replicationHeart-beat

mongod: port 27017, replSet: rsDemoRS2

DAISY10.0.0.4

Query Balancer

Query

Page 34: NoSQL, SQL, NewSQL - methods of structuring data

NewSQLSCALABLE ACID!

Page 35: NoSQL, SQL, NewSQL - methods of structuring data

Relational Databases catch up Maintains ACID

Same scalability and performance of NoSQL systems

Some Vendors: Clustrix, MemSQL, NuoDB, VoltDB, Postgres-XL

Auto-sharding, auto-partitioning

Queries need to take place on same box to save latency

http://www.postgres-xl.org/overview/

Page 36: NoSQL, SQL, NewSQL - methods of structuring data

Summary / Q & A / Discuss