Upload
tony-rogerson
View
240
Download
1
Embed Size (px)
DESCRIPTION
Today’s environment is a polyglot database, that is to say, it’s made up of a number of different database sources and possibly types. In this session we’ll look at some of the options of storing data – relational, key/value, document etc. I’ll overview what is SQL, NoSQL and NewSQL to give you some context for today’s world of data storage.
Citation preview
Multiple ways of storing -> Data
<-SQL -> NOSQL -> NEWSQL
Tony Rogerson@tonyrogerson
[email protected]/tonyrogerson
Agenda Data structures
◦ Relational, Key/Value pair, Document, Graph, Column/Column Family Store◦ Key Concepts◦ Hashing, Partitioning, Sharding, ACID, BASE
Technology Areas◦ SQL, NoSQL, NewSQL
Who-am-I Freelance SQL Server professional and Data Specialist
Fellow BCS, MSc in BI, PGCert in Data Science
Started out in 1986 – VSAM, System W, Application System, DB2, Oracle, SQL Server since 4.21a
Awarded SQL Server MVP yearly since 97
Founded UK SQL Server User Group back in ’99, founder member of DDD, SQL Bits, SQL Relay, SQL Santa
Interested in commodity based distributed processing of Data.
Data StructuresWAYS OF STRUCTURING DATA
What is data? Tony Rogerson
Harpenden
36 on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-15
46
44
Data needs context and structure
Tony Rogerson FullName
[email protected] Email
Harpenden PostalTown
36 on 2014-01-01, 36 on 2014-05-01, {WaistInches, RecordedOn}38 on 2014-10-15
46 ChestInches
44 Ages
Schema gives Context
Relational [Tables]FullName (PK) Email PostalTown WaistInches ChestInches AgeYears
Tony Rogerson [email protected]
Harpenden 46 44
FullName (FK) WaistInches RecordedDate
Tony Rogerson 36 2014-01-01
Tony Rogerson 36 2014-05-01
Tony Rogerson 38 2014-10-01
People WaistInches
Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644
Key/Value pair (EAV)Entity Attribute ValuePerson FullName Tony Rogerson
Person Email [email protected]
Person PostalTown Harpenden
Person ChestInches 46
Person Age 44
WaistInches FullName Tony Rogerson
WaistInches WaistInches 36
WaistInches RecordedDate 2014-01-01
WaistInches FullName Tony Rogerson
WaistInches WaistInches 36
WaistInches RecordedDate 2014-05-01
WaistInches FullName Tony Rogerson
WaistInches WaistInches 38
WaistInches RecordedDate 2014-10-01
Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644
Examples:Riak, Dyanamo, Redis,Foundation etc.
Document
{“FullName” : “string”,“Email” : “string”,“PostalTown” : “string”,“WaistInches” : {
“WaistInches” : “number”,“RecordedDate” : “string” },
“ChestInches” : “number”,“Age” : “number”}
{“FullName” : “Tony Rogerson”,“Email” : “[email protected]”,“PostalTown” : “Harpenden”,“WaistInches” : [ {
“WaistInches” : 36,“RecordedDate” : “2014-01-01” }, {“WaistInches” : 36,“RecordedDate” : “2014-05-01” } ],
“ChestInches” : 46,“Age” : 44}
JSON Schema JSON Document
JSON vs XML discussion: http://stackoverflow.com/questions/4862310/json-and-xml-comparison
Tony [email protected] on 2014-01-01, 36 on 2014-05-01, 38 on 2014-10-154644
Examples:MongoDB, Couchbase,CouchDB etc.
Schema DesignDocument Database Normal Form (Relational)
E.g. 100 machine cluster
{ "firstName": "John", "lastName": "Smith", "isAlive": true, "age": 25, "height_cm": 167.6, "address": { "streetAddress": "21 2nd Street", "city": "New York", "state": "NY", "postalCode": "10021-3100" }, "phoneNumbers": [ { "type": "home", "number": "212 555-1234" }, { "type": "office", "number": "646 555-4567" } ]}
person address
phoneNumbers
Object data stored together(collection)
Object data stored separately(tables)
MongoDB Example Use ESTER for MongoVUE
What do documents look like?
Graph SQL (inherently very poor performance):
◦ Nested Sets◦ Recursive CTE
Represents “connected” data
All about understanding and exploring relationships
Examples:Neo4j, Virtuoso, Allegro.
DaveTony
Fred
SidNode
Relationship
Column Values stored as a key-value pair
Column Name (unique)
Value
Timestamp
Important bit: It may not appear in each row!
Column Family is: container for columns and rows (like but not a relational table)
Relational Table: Fixed Columns
Column Family: determined by application – flexible
Examples:Cassandra, Druid, HBase
Column storageExamples:Cassandra, Druid, HBase
http://www.datastax.com/docs/1.1/ddl/column_family
Stored as…
SQL Server Columnstore
Table sliced into rowgroups (a group of rows – a batch)
Each rowgroup compressed in column-wise manner
Column segment is a column of data from within the rowgroup
Column segment per column in table which is then compressed onto storage.
SO: a table has rows (sliced into rowgroups), rowgroups have columns (each column having a column segment)
Demo: SQL Sparse columns
Key ConceptsSHARDING, PARTITIONING, HASHING
Hashing Distributed Database Cluster has fixed number of data nodes
Your data is spread across the database cluster◦ 10 node cluster; each data item may reside on 3 nodes◦ Which 3 nodes?
Data key is Hashed to a number – hashing algorithm is deterministic
data-node = f( data-key )◦ print ( checksum( 'All hale to the ale' ) * 1.) % 10◦ print ( checksum( 'And a glass of wine for the ladies' ) * 1.) % 10
Partitioning Chop big table up into “horizontal partitions”
Partition key required
Each partition is self-contained binding rows by the partitioning key
Access all data through logical view over all partitions
Table by table basis
Shared Nothing Partitioning+
Each Shard is self-contained and has all the procs, meta-data and of course your partition of data
Shard Key common to multiple tables, for example CustomerID, Email Address.
Greater autonomy across the distributed database
Seeing the entire database as a logical unit is more difficult – joining is a nightmare
Node 1
Node 2
Node 3
Sharding SyncNode 1
Node 2
Node 3
Full copy of data
Subset of dataReplication
ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)
ACID is a Transactional model
Not specific to the relational database◦ eg. HIVE (interface to HADOOP) provides ACID facilities
Durability: write ahead Logging expensive (latency from serialisation of writes)
Distributed transactions – Two Phase Commit (2PC)◦ Poor scalability because of Latency◦ ACID across distributed nodes bad design choice◦ Partition/Shard database and ACID in-node only
Coordinator
Subordinate
Subordinate
INSERT
2PC Transaction
All or nothing
ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)
BASE is a Transactional modelish
Specific to Distributed database model
Basically Available – all or some of the system is available
Node 1 Node 2 Node 3
ACID (Automicity, Consistency, Isolation, Durability)BASE (Basically Available, Soft-state, Eventually Consistent)
Soft-stateEventually Consistent
System may change over time [as replica’s become up-to-date (consistent)]
Node 1 Node 2 Node 3
Insert value ‘A’
SQLAH – THE COMMON DENOMINATOR OF AN ACCESS LAYER
What is SQL? SQL is NOT a method of storing data!
SQL is a language, it’s just syntax
Relational Theory = thinking in sets
SQL is a language that follows (but does not obey) relational theory
With SQL we associate ACID (but durability is now optional in SQL 2014)
NoSQLNOT ONLY SQLNO SQL
Origins NoSQL? First NoSQL database was an open source relational database
NoSQL (really NoREL) started in mid 2000’s
Realisation that ACID doesn’t scale easily
Should really be NoACID (Mutually exclusive for some 70’s developers)
Hadoop – came out of Yahoo
Cassandra, Riak and others derivatives of Amazon Dynamo
NoSQL basically means: ACID doesn’t scale, SQL is too restrictive, and I’m a developer and I like complexity.
But why the need for “NoSQL”? Feb 2001
◦ BigData - http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf
Basically Scale-Up (SAN) costs too much and doesn’t scale well
Sick of vendor lock in and associated costs – open source software running across cheap commodity machines (Redundant Array of Inexpensive Servers)
Availability, Resilience – by design – by software and not expensive hardware
Existing Relational Databases (with SQL as their only language) expensive and too slow (ACID)
BASE v ACID
SQL implements a rigid and inflexible framework (or does it)
Eventual Consistency in SQL Server
Asynchronous Availability Groups/Database Mirroring
Replication
Eventual / Causal Consistency◦ Eventual no good for order specific [and important] transactions
◦ Like Merge replication
◦ Causal: deliver messages in correct order [e.g. service broker]◦ Like Transactional Replication
MongoDB – Replica Set
$ mongo --host 10.0.0.1 --port 27017
ROSIE 10.0.0.2
ESTER 10.0.0.1
HAZEL10.0.0.3
primary
secondary's
replication replicationHeart-beat
• 1 Master – Multiple Secondary’s• 1 R/W – Multiple Readers• Setup:
• Use replication.replSetName in mongo config file• On Primary:
• rs.initiate()• rs.add( “---secondary address” )• rs.add( “---secondary address” )• rs.status()
MongoDB - Sharding
Standalone or Replica-Set MongoDB instances(data storage)
Shards of data (data chopped up into multiple ranges, range depends where it sits)
Stores configuration informationabout the Shards.
MongoDB – Sharding (with Replica-Set)
DAISY10.0.0.4
POPPY10.0.0.5
KARLI10.0.0.6
config serversport 27019(shard information point to replica sets)
ROSIE 10.0.0.2
ESTER 10.0.0.1
HAZEL10.0.0.3
primary
secondary's
replication replicationHeart-beat
mongod: port 27017, replSet: rsDemomongos: port 27020 (on ESTER, HAZEL, ROSIE)
THIRLMERE10.0.0.13
CONISTON10.0.0.11
ULLSWATER10.0.0.12
primary
secondary's
replication replicationHeart-beat
mongod: port 27017, replSet: rsDemoRS2
DAISY10.0.0.4
Query Balancer
Query
NewSQLSCALABLE ACID!
Relational Databases catch up Maintains ACID
Same scalability and performance of NoSQL systems
Some Vendors: Clustrix, MemSQL, NuoDB, VoltDB, Postgres-XL
Auto-sharding, auto-partitioning
Queries need to take place on same box to save latency
http://www.postgres-xl.org/overview/
Summary / Q & A / Discuss