45
Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Embed Size (px)

Citation preview

Page 1: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Copyright © 2011-2013 Curt Hill

NoSQL Databases

No SQL or Not Only SQL

Page 2: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Historically …

• Typical relational databases live in a pleasant niche– Their data is relatively small and

usually on one machine– The meaning of the data is well-

understood– Schemas are tightly defined– Transactional consistency (ACID) is

maintained– Results of transactions are accurate

Copyright © 2011-2013 Curt Hill

Page 3: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Things can be different

• Extremely large amounts of data• Data is spread over many

machines, possibly geographically distant

• Change to this data is continuous• Data quality may be poor,

obtained from many sources• Schemas are fuzzy and uncertain

– Or completely lacking

Copyright © 2011-2013 Curt Hill

Page 4: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Relaxing principles• Classic database principles have

been left behind• Locking is usually absent• Schema are often inconsistent or

lacking• Data come from many sources

– How does this get integrated into rigid schema?

• Accuracy of the data is missing– By the time we update it has already

changedCopyright © 2011-2013 Curt Hill

Page 5: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

People and Business• A normal relational database gives

us accuracy– The limitation is the accuracy of the

data

• People are used to making decisions without all the facts

• Businesses often make decisions without all the facts or complete analysis– Otherwise the window of opportunity

has passedCopyright © 2011-2013 Curt Hill

Page 6: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

CAP Theorem• A distributed database or web

service cannot guarantee all of the following:

• Consistency– That operations occur all at once

• Availability– Every operation must terminate in the

intended operation

• Partition tolerance– Operations will complete even if

individual components failCopyright © 2011-2013 Curt Hill

Page 7: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

ACID absent• ACID, in particular, is in danger• The goal of a transaction is to make

it look like it occurs by itself without considering other transactions

• When multiple computers are communicating and have their own data this is in danger

• Locking and unlocking is a problem– Things are changing too fast to let one

transaction lock data– Without it serializing is in danger

Copyright © 2011-2013 Curt Hill

Page 8: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Now and Then• Suppose a transaction is made• One computer messages all the

others• By the time that message arrives it

reflects a past state• By the time it is processed that

state may have changed• Virtually everything on the Internet

represents a past state and not currently

Copyright © 2011-2013 Curt Hill

Page 9: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Now and Then Again

• A single computer may think of its data as current

• It must accept all messages from other computers as in the past

• Absolute consistency cannot be obtained

• Eventual consistency is now the norm

Copyright © 2011-2013 Curt Hill

Page 10: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

BASE not ACID

• BASE is an alternative to ACID• Basically Available, Soft state,

Eventually consistent– Clearly contrived to complement ACID

• This is acknowledging that when the data becomes too widely distributed something has to give

Copyright © 2011-2013 Curt Hill

Page 11: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Not the only relaxation of requirements

• NoSQL databases usually abandon the whole relational format

• They may also include the relational database as a subset of the entire database

• The most common form is the data store– AKA key-value store

Copyright © 2011-2013 Curt Hill

Page 12: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

NoSQL Databases

• Must provide APIs to various programming language

• Must scale well to very large sizes• Indexing is the key to rapid access• These NoSQL databases are

targeted at different niches• Generally not interchangable

– Unlike most RDBMS

Copyright © 2011-2013 Curt Hill

Page 13: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Kinds of NoSQL

• Key Value• Columnar• Document• Graph

Copyright © 2011-2013 Curt Hill

Page 14: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Key Value• Simplest model• There is a key (which must be unique)

linked to a group of values• It gets more interesting if the values

may include key value pairs as well• Often not much of a schema• Think of a database with one table

– Unlimited string as key– Unlimited string as second field

• Two examples: Riak and Reddis

Copyright © 2011-2013 Curt Hill

Page 15: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Key-Value stores• A relational table is a restricted

form of key-value– The key is the primary key– The data is all the fields associated

with that key– However, it may not be even in First

Normal Form

• There is only one table– Key is unrestricted size string– Data is whatever needs to be there– The values may be completely

differentCopyright © 2011-2013 Curt Hill

Page 16: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

KV Picture

Copyright © 2011-2013 Curt Hill

Page 17: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Key Value Again

• In a relational database we always know what the value extracted from a cell is

• It has the same meaning as everything else in the column

• This is no longer the case in key value stores

Copyright © 2011-2013 Curt Hill

Page 18: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Columnar• Also known as a column store• A lot of similarity to relational, but the

dominant item is the column not the row

• We lack rectangularity that relational has

• Columns are stored together• Halfway between Relational and Key

Value• HBase, Cassandra, HyperTable,

CalPont, MonetDB are examples

Copyright © 2011-2013 Curt Hill

Page 19: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Columnar

Copyright © 2011-2013 Curt Hill

Page 20: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Columnar again

• Often used in Data warehouses• Since the columns are stored

together (rather than the rows) and since the columns have only one data type, there is an opportunity to compress a column that is absent in relational DBs

Copyright © 2011-2013 Curt Hill

Page 21: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Document• The basic object is now a document

instead of a simple field like a number– Document is often XML or JSON

• Each document has an ID and other identifying values

• A document is an arbitrary and complicated item– As if every field were a BLOB

• Examples: MongoDB, CouchDB, Oracle NoSQL, Amazon’s SimpleDB

Copyright © 2011-2013 Curt Hill

Page 22: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Graph• A mathematical graph consists of

nodes (the data) and links between these– This is the network model revisited

• Used for highly interconnected data

• Processing rides the links• Neo4J and Zope are examples

Copyright © 2011-2013 Curt Hill

Page 23: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Commentary

• These classifications are incomplete

• Many examples exist that are combinations of several

• We next look at some example databases– Most of these are open source

Copyright © 2011-2013 Curt Hill

Page 24: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Riak• Key value store designed to be

distributed over many nodes• Designed to be fault-tolerant

– Peer to peer architecture – no master– All the data is scattered over many

servers and disk– Any one or more failures does not

compromise the data

• Everything is done through web queries

• Used by a quarter of Fortune 50• Includes Best Buy, Github, Comcast

Copyright © 2011-2013 Curt Hill

Page 25: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Redis• Key value store, optimized for speed• Creator is Salvatore Sanfilippo who

calls it a data structure server– Data could be more than a string or

number linked to a key

• May also consider data a sorted or unsorted set strings– This enables set operations on keys

• Keeps data in memory and occasionally updates disk– No ACID guarantees in that

• Used by Craigslist, flickrCopyright © 2011-2013 Curt Hill

Page 26: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

MongoDB• Designed to be very scalable

document model database– Used by CERN for Large Hadron data

• Data is formatted as JavaScript objects – JavaScript Object Notation (JSON)

• Attributes are indexed• Queries now become JavaScript

functions• APIs in the major languages• Who is Mongo?

Copyright © 2011-2013 Curt Hill

Page 27: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

JSON• A lightweight data interchange

format• Defined by JavaScript but used

outside of the JavaScript• Most languages have a subroutine

to parse and assimilate JSON• A short JSON presentation

Copyright © 2011-2013 Curt Hill

Page 28: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

MongoDB and ACID

• Atomicity - yes• Consistency – no schema, so no

consistency or inconsistency• Isolation – good, but not perfect• Durable – yes

Copyright © 2011-2013 Curt Hill

Page 29: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Terms

RDBMS MongoDBTable Collection

Row JSON Document

Index Index

Join Embedding and linking

Partition Shard

Copyright © 2011-2013 Curt Hill

Page 30: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

CouchDB

• Document based with JSON content

• Each document has a set of keys that link to it

• Written in Erlang, but with JavaScript API– Other languages interface to that

• Very fault tolerant• Used by LinkedIn, Orbitz

Copyright © 2011-2013 Curt Hill

Page 31: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

HBase• A columnar database• Very scalable – designed for big

data• Each field is versioned, making it

3D rather than 2D – Columns are stored together– Rows are the related data– Depth are older versions

• Used by Facebook, Twitter, Yahoo, eBay

Copyright © 2011-2013 Curt Hill

Page 32: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Cassandra• Project started by Facebook to track

status updates• Became an Apache project• Intended to create a network of equal

nodes• Eventual consistency not perfect

consistency• Mostly written in Java but provides

APIs in Python, Ruby, PhP among others

• Used by IBM, HP, Netflix among others

Copyright © 2011-2013 Curt Hill

Page 33: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Neo4J• Graph database

– Network of nodes and links

• Data is information on a person or thing

• Links are the connections between one datum and another

• Numerous graph algorithms have been implemented– Consider Facebook connections

• Used by Adobe, Lufthanza, Mozilla

Copyright © 2011-2013 Curt Hill

Page 34: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

CAP

• Several of these are distributed• Since they cannot do all three they

generally are good at two of the three

• See the following picture

Copyright © 2011-2013 Curt Hill

Page 35: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

CAP

Copyright © 2011-2013 Curt Hill

Consistency

AvailabilityPartition tolerance

RiakMongoDBHBase

CouchDB

Page 36: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Niches

• For a product to be successful it must find one or more niches where it may do well

• A niche is a particular set of circumstances and requirements

• Next we want to consider some of these products and what they do well and what they do poorly

Copyright © 2011-2013 Curt Hill

Page 37: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Relational

• Layout and form of the data is well known in advance and relatively stable– We do not need to know in advance

what will be done with the data, but we do need to know the form

– Most business processes have this kind of requirements

• Not as effective for deeply hierarchical and widely varying data

Copyright © 2011-2013 Curt Hill

Page 38: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Key Value

Copyright © 2011-2013 Curt Hill

• Easy to make fast or horizontally scalable or both

• Useful where data does not conform to a well known schema or the data is not very well related

• Searches are easy but more complicated queries are not– No indices– No linkages, ie. foreign keys

Page 39: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Columnar

• Horizontal scalability is based on storing columns in different nodes– Thus good for big data

• Allows for versioning• Like relational, schema needs to

be done in advance– Based on what queries are needed– Does poorly with ad hoc data and

queries

Copyright © 2011-2013 Curt Hill

Page 40: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Document

• Works well with data that is highly variable and not known in advance

• Content is often JSON, so these are object oriented databases

• No normalization is possible, so redundancies are mostly unavoidable

• Most interesting queries are not possible

Copyright © 2011-2013 Curt Hill

Page 41: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Graph

• Particularly useful for modeling networking

• For social networking applications– Nodes are people and edges their

relationships– Hard to model this in other models

• Not easy to partition, so not easy to scale

• No common query language

Copyright © 2011-2013 Curt Hill

Page 42: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

déjà vu?• In the early 1970s database world

was in some disarray• There were several models• None had achieved dominance• Commercial offerings were

present, but theoretical foundation was lacking

• There was no uniformity to these products

• Interchanging products was very difficult

Copyright © 2011-2013 Curt Hill

Page 43: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

The End or Start of an Era• Codd changed that by the

development of a theoretical foundation for relational databases

• SQL became the common language• For several decades now Relational

Databases have been the undisputed king

• RDMS is a 32 billion dollar industry• The products are to some degree

interchangeable

Copyright © 2011-2013 Curt Hill

Page 44: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Again• The situation around NoSQL

databases has a lot of the same feel as in the 1970s

• They are not interchangeable and not even directed towards the same ends

• Is this the end of RDBMS era?• Unlikely we will soon get rid of

RDBMS, but it is not likely to be as exclusive as it has been

Copyright © 2011-2013 Curt Hill

Page 45: Copyright © 2011-2013 Curt Hill NoSQL Databases No SQL or Not Only SQL

Finally

• Some of the motivations of the NoSQL movement are:– Big Data– Requirements to be distributed– Volatility of data, largely caused by

web

• Check out the following link– DB-Engines.com rates popularity of d

ata bases

Copyright © 2011-2013 Curt Hill