Introduction to NoSQL Databases -...

Preview:

Citation preview

Introduction to NoSQL Databases

Jianfeng Zhan2012.6.25

A quick introduction to DB

Models of Reality

REALITY• structures• processes

DATABASE SYSTEM

DATABASE

DML

DDL

A database is a model of structures of reality The use of a database reflect processes of reality A database system is a software system which

supports the definition and use of a database DDL: Data Definition Language DML: Data Manipulation Language

Data Modeling

REALITY• structures• processes

DATABASE SYSTEM

MODEL

data modeling

The model represents a perception of structures of reality

The data modeling process is to fix a perception of structures of reality and represent this perception

In the data modeling process we select aspects and we abstract

Process Modeling

REALITY• structures• processes

DATABASE SYSTEM

MODEL

process modeling

The use of the model reflects processes of reality Processes may be represented by programs with

embedded database queries and updates Processes may be represented by ad-hoc database

queries and updates at run-timeDML DML

PROG

Data Model

data structures integrity constraints operations

A data model consists of notations for expressing:

Data Model - Data Structures

attribute types entity types relationship types

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

All data models have notation for defining:

Data Model - Constraints

Static constraints apply to database state Dynamic constraints apply to change of database state E.g., “All FLIGHT-SCHEDULE entities must have

precisely one DEPT-AIRPORT relationship

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Constraints express rules that cannot be expressed by the data structures alone:

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

Data Model - Operations

insert FLIGHT-SCHEDULE(97, delta, tu, 258); insert DEPT-AIRPORT(97, atl);

select FLIGHT#, WEEKDAYfrom FLIGHT-SCHEDULEwhere AIRLINE=‘delta’;

Operations support change and retrieval of data:

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

97 delta tu 258

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

97 atl

Data Model - Operations from Programs

declare C cursor for select FLIGHT#, WEEKDAYfrom FLIGHT-SCHEDULEwhere AIRLINE=‘delta’;open C;repeat

fetch C into :FLIGHT#, :WEEKDAY;do your thing;

until done;close C;

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

97 delta tu 258

Keys and Identifiers

A key on FLIGHT# in FLIGHT-SCHEDULE will force all FLIGHT#’s to be unique in FLIGHT-SCHEDULE

Consider the following keys on DEPT-AIRPORT:

Keys (or identifiers) are uniqueness constraints

FLIGHT# AIRPORT-CODE FLIGHT# AIRPORT-CODE FLIGHT# AIRPORT-CODEFLIGHT# AIRPORT-CODE

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Integrity and Consistency Integrity: does the model reflect reality well? Consistency: is the model without internal conflicts?

a FLIGHT# in FLIGHT-SCHEDULE cannot be null because it models the existence of an entity in the real world

a FLIGHT# in DEPT-AIRPORT must exist in FLIGHT-SCHEDULE because it doesn’t make sense for a non-existing FLIGHT-SCHEDULE entity to have a DEPT-AIRPORT

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Triggers and Stored Procedures

Triggers can be defined to enforce constraints on a database, e.g.,

DEFINE TRIGGER DELETE-FLIGHT-SCHEDULE ON DELETE FROM FLIGHT-SCHEDULE WHERE

FLIGHT#=‘X’ACTION DELETE FROM DEPT-AIRPORT WHERE

FLIGHT#=‘X’;

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Normalization

FLIGHT# AIRLINE PRICE

FLIGHT-SCHEDULE

101 delta 156

545 american 110

912 scandinavian 450

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo

545 american mo 110

912 scandinavian fr 450

156

101 delta fr 156

545 american we 110

545 american fr 110

FLIGHT# AIRLINE WEEKDAYS PRICE

FLIGHT-SCHEDULE

101 delta mo,fr 156

545 american mo,we,fr 110

912 scandinavian fr 450

FLIGHT# WEEKDAY

FLIGHT-WEEKDAY

101 mo

545 mo

912 fr

101 fr

545 we

545 fr

ANSI/SPARC 3-Level DB Architecture - separating concerns

database system

schema

data

database

database systemDDL

DML

a database is divided into schema and data the schema describes the intension (types) the data describes the extension (data) Why? Effective! Efficient!

ANSI/SPARC 3-Level DB Architecture - separating concerns

schema

data

schema

conceptual schema internal schema

data

internal schema

data

external schema

ANSI/SPARC 3-Level DB Architecture

externalschema1

externalschema2

externalschema3

conceptualschema

internalschema

database

• external schema:

use of data

• conceptual schema:

meaning of data

• internal schema:

storage of data

Conceptual Schema

Describes all conceptually relevant, general, time-invariant structural aspects of the universe of discourse

Excludes aspects of data representation and physical organization, and access

NAME ADDR SEX AGE

CUSTOMER

An object-oriented conceptual schema would also describe all process aspects

External Schema

Describes parts of the information in the conceptual schema in a form convenient to a particular user group’s view

Is derived from the conceptual schema

NAME ADDR SEX AGE

CUSTOMER

NAME ADDR

MALE-TEEN-CUSTOMER

TEEN-CUSTOMER(X, Y) =CUSTOMER(X, Y, S, A) WHERE SEX=M AND 12<A<20;

Internal Schema Describes how the information described in the

conceptual schema is physically represented to provide the overall best performance

NAME ADDR SEX AGE

CUSTOMER

NAME ADDR SEX AGE

CUSTOMER

B+-tree on AGE NAME PTR

index on NAME

Indexing

Why Bother? Disk access time: 0.01-0.03 sec Memory access time: 0.000001-0.000003 sec Databases are I/O bound Rate of improvement of

(memory access time)/(disk access time) >>1 Things won’t get better anytime soon!

Indexing helps reduce I/O !

Indexing (cont.)

Clustering vs. non-clustering alters the data block into a certain distinct order to match the index,

resulting in the row data being stored in order. The data is present in arbitrary order, but the logical ordering is

specified by the index.

Primary and secondary indices An index structure that is defined on the ordering field index field that are neither ordering fields nor key fields

I/O cost for lookup: Heap: N/2 Sorted file: log2(N) Single-level index: log2(n)+1 Multi-level index; B+-tree: logfanout(n)+1 Hashing: 2-3

Concurrency Control

datereservflight# customer#

flight-inst

flight# date #avail-seats

T1:read(flight-inst(flight#,date)seats:=#avail-seatsif seats>0 then {seats:=seats-1

write(reserv(flight#,date,customer1))write(flight-inst(flight#,date,seats))}

T2:

read(flight-inst(flight#,date)seats:=#avail-seatsif seats>0 then {seats:=seats-1write(reserv(flight#,date,customer2))write(flight-inst(flight#,date,seats))}

overbooking!

ACID Transactions An ACID transaction is a sequence of database

operations that has the following properties: Atomicity

Either all operations are carries out, or none is This property is the responsibility of the concurrency

control and the recovery sub-systems Consistency

A transaction maps a correct database state to another correct state

This requires that the transaction is correct, which is the responsibility of the application programmer

Concurrency Control (cont.)

Isolation Although multiple transactions execute

concurrently, i.e. interleaved, not parallel, they appear to execute sequentially

This is the responsibility of the concurrency control sub-system

Durability The effect of a completed transaction is

permanent This is the responsibility of the recovery manager

Concurrency Control (cont.)

Serializability is a good definition of correctness A variety of concurrency control protocols exist

Two-phase (2PL) locking deadlock and livelock possible deadlock prevention: wait-die, wound-wait deadlock detection: rollback a transaction

Optimistic protocol: proceed optimistically; back up and repair if needed

Pessimistic protocol: do not proceed until knowing that no back up is needed

RecoveryStorage types: Volatile: main memory Nonvolatile: disk

Errors: Logical error: transaction fails; e.g. bad input, overflow System error: transaction fails; e.g. deadlock System crash: power failure; main memory lost, disk

survives Disk failure: head crash, sabotage, fire; disk lost

What to do?

Recovery (cont.) Deferred update (NO-UNDO/REDO):

don’t change database until ready to commit write-ahead to log to disk change the database

Immediate update (UNDO/NO-REDO): write-ahead to log on disk update database anytime commit not allowed until database is completely updated

Immediate update (UNDO/REDO): write-ahead to log on disk update database anytime commit allowed before database is completely updated

Shadow paging (NO-UNDO/NO-REDO): write-ahead to log in disk keep shadow page; update copy only; swap at commit

Parallel Databases

A database in which a single query may be executed by multiple processors working together in parallel

There are three types of systems: Shared memory Shared disk Shared nothing

Parallel Databases - Shared Memory

processors share memory via bus

extremely efficient processor communication via memory writes

bus becomes the bottleneck not scalable beyond 32 or 64

processorsP processor

M memorydisk

P

M

P

P

P

Parallel Databases - Shared Disk

processors share disk via interconnection network

memory bus not a bottleneck fault tolerance wrt. processor

or memory failure scales better than shared

memory interconnection network to

disk subsystem is a bottleneck

P

P

P

P

M

M

M

M

Parallel Databases - Shared Nothing

scales better than shared memory and shared disk

main drawbacks: higher processor

communication cost higher cost of non-local disk

access used in the Teradata database

machine

PM

PM

PM

PM

OUTLINE

NoSQL Definition Motivation Data Store Introduction

-- Key-value Stores-- Document Stores-- Extensible Record Stores-- New Relational Database

Conclusion

NoSQL: The Name

环境变化

互联网网络延迟变大

Time out

Partial failures

Network partitions

RDBMS

Web apps can (usually) do without--Transaction/ Strong Consistency/ integrity--Complex queries

Web apps have different needs (than the apps that RDBMS were designed for)--Scalability & elasticity (at low cost)--High availability--Flexible schemas/ semi-structured data --Geographic distribution (multiple datacenters)

NoSQL Systems

No declarative query language– more programmingRelaxed consistency—fewer guarantees

NoSQL Systems

The idea behind the NoSQL: Giving up ACIDconstraints, one can achieve much higher performance and scalability.ACID= Atomicity, Consistency, Isolation, and DurabilityBASE=Basically Available, soft state, Eventually consistent.

CAP Theorem

A system can have only two out of three of the following properties:consistency, availability, and partition-tolerance.

CAP details

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

The easiest way to understand CAP

Two nodes on opposite sides of a partition. Allowing at least one node to update state will cause

the nodes to become inconsistent. forfeiting C.

Likewise, if the choice is to preserve consistency,one side of the partition must act as if it isunavailable. forfeiting A.

Only when nodes communicate is it possible topreserve both consistency and availability. forfeiting P.

Classification of NoSQL systems and tradeoffs (1).

Read performance versus write performance Hbase optimized for write performance.

Records on disk are never overwritten; instead, updates are written to a buffer in memory, and the entire buffer is written sequentially to disk.

Latency versus durability Writes are synched to disk before the system

returns success to users. Writes are stored in memory at write time and

synched to disk later.

Classification of NoSQL systems and tradeoffs (2).

Synchronous versus asynchronous replication Improve system availability, avoid data loss,

and improve performance.

Data partition row-based storage

Efficient access of an entire record. Column storage

Efficient for accessing a subset of the columns.

Design decisions of various systems.

Types of NoSQL Databases

Key-value stores

Document stores

Extensible record stores

NoSQL systems differ mainly in their data models

Specific implementations differ in the persistent mechanism and additional functionalities: Replication

Versioning

Locking

Transactions

etc..

Types of NoSQL Databases

Key-Value Stores

• Global Collection of Key/Value Pairs

• Inspired by Amazon’s Dynamo and Distributed Hashtables

•Operations

•void Put(string key, byte[] data);

•byte[] Get(string key);

•void Remove(string key);

Key-Value Stores: Examples

Project Voldemort

Advanced key-value store Created by LinkedIn, now open source Written in Java Provides MVCC

Multiversion concurrency control Asynchronous replication Sharding + Consistent Hashing Automatic failure detection and recovery

MVCC

Snapshot one Time Object 1 Object 2 t1 "Hello" "Bar" t0 "Foo" "Bar“

Snapshot two Time Object 1 Object 2 Object 3 t2 "Hello" (deleted) "Foo-Bar" t1 "Hello" "Bar" t0 "Foo" "Bar"

A Solution: Hashing

Example: y = ax+b (mod n)

Intuition: Assigns items to “random” caches few items per cache

Easy to compute which cache holds an item

Server

items assigned to cachesby hash function.

Users use hash to compute cache for item.

Adding Caches: why consistent hashing?

Suppose a new cache arrives. How work it into hash function? Natural change:

y=ax+b (mod n+1) Problem: changes bucket for every item

every cache will be flushed servers get swamped with new requests

Goal: when add bucket, few items move

Project Voldemort

Operations:

value = store.get(key)

store.put(key, value)

store.delete(key)

Pros? & Cons?

What is a document? Semi-structured data Encapsulates and encodes data (or information) in

some standard formats or encodings Encodings:

XML YAML JSON BSON Binary forms: PDF, Microsoft Office documents.. etc.

Document Stores: Document?

Document Stores: Document?

Documents are like rows or records in relational databases, BUT

Schema No Schema

FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing"

FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}]

RowDocument

Document Stores

Similar to Key-value stores but with a major differences, value is a document generally support secondary indexes

Flexible schema Any number of fields can be added Multiple types of documents (objects) and nested

documents or lists Documents stored in JSON or Binary JSON (BSON) No ACID property

Document Stores: Examples

TERRASTOREby Google

CouchDB

Apache project since 2008 Schema free, document oriented database

Documents are stored in JSON format Support secondary indexes B-tree storage engine MVCC model, no locking No joins, no PK/FK

Incremental replication

CouchDB

REST API

Libraries for various languages that convert native API calls into the RESTful calls Java, C, PHP, etc.

CRUD HTTP ParamsCreate PUT /db/docidRead GET /db/docidUpdate POST /db/docidDelete DELETE /db/docid

CouchDB: Views

Views Filter, sort, “join”, aggregate, report Map/Reduce based K/V pairs from Map/Reduce are also stored in

the B-tree engine Built on demand Can be materialized & incrementally updated

CouchDB: Views

CouchDB: Local Consistency

• CouchDB uses Multi-Version Concurrency Control (MVCC)

CouchDB: “Global” Consistency

• Incremental Replication

Extensible record stores

Extensible record stores also called column stores.

Each key is associated with multiple attributes(i.e. columns)

Hybrid row/column stores Inspired Google BigTable Example: HBase, Cassandra

Column: HBase

Based on Google’s BigTable Apache Project TLP Cloudera (certification, EC2 AMI’s, etc.) Layered over HDFS (Hadoop Distributed File

System). Input/Output for MapReduce Jobs APIs

---Thrift, REST

Thrift API Thrift is an interface definition language that is used to

define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and

was developed at Facebook for "scalable cross-language services development".

It combines a software stack with a code generation engine to build services that work efficiently to a varying degree and seamlessly between different languages.

Although developed at Facebook, it is now an open sourceproject in the Apache Software Foundation.

To put it simply, Apache Thrift is a binary communication protocol.

Thrift Architecture

REST architecture style (1)

Client–server separation of concerns

Stateless The client–server communication is further

constrained by no client context being stored on the server between requests.

Each request from any client contains all of the information necessary to service the request, and any session state is held in the client.

REST architecture style (2)

Cacheable As on the World Wide Web, clients can cache

responses. Responses must therefore, implicitly or explicitly,

define themselves as cacheable, or not, to prevent clients reusing stale or inappropriate data in response to further requests.

REST architecture style (3)

Layered system A client cannot ordinarily tell whether it is

connected directly to the end server, or to an intermediary along the way.

Code on demand (optional) Servers are able temporarily to extend or

customize the functionality of a client by the transfer of executable code.

Uniform interface

Column: HBase

Automatic Partitioning Automatic re-balancing/re-partitioning Fault tolerant

--HDFS---Multiple Replicates

Highly distributed

Column: HBase

Column: Cassandra

Create at facebook for Inbox search Facebook Google Code ASF Commercial Support available from Riptano Features taken from both Dynamo and Big

Table-- Dynamo – Consistent hashing, Partitioning,

Replication-- Big Table- Column Familes, MemTables,

SSTables

Column: Cassandra

Symmetric nodes-- No single point of failure-- Linearly scalable-- Ease of administration

Flexible/Automated Provisioning Flexible Replica Replacement High Availability

-- Eventually Consistency-- However, consistency is tuneable

Column: Cassandra

Partitioning--Random----Good distribution of data between nodes---- Range scans not possible--Order preserving---can lead to unbalanced nodes--- Range scans, Natural Order

Extremely fast reads/writes (low latency) Thrift API

Column-oriented NoSQLName Producer Data Model QueryingBigTable Google Set of couple(key,

values)Selection (by combination of row, column, and time stamp ranges)

HBase Apache Groups of columns (a BigTable clone)

JRUBY IRB-based shell(similar to SQL)

Hypertable Hypertable Like BigTable HQL(Hypertext Query Language)

CASSANDRA

Apache Columns, groups of columns corresponding to a key(supercolumns)

Simple selection on key, range queries, column or column ranges

PNUTS Yahoo (hashed or ordered) tables, typed arrays, flexible schema

Selection and projection from a single table (retrieve an arbitrary single record by primary key, range queries, complex predicates, ordering, top-k)

Scalable Relational Systems

Also called NewSQL SQL ACID Performance and scalability through modern

innovative software architecture

Scalable Relational Systems

RDBMS will provide scalabilty: Use small scope operations Use small-scope transaction

MySQL Cluster

shared-nothing clusters

NDB storage engine (replace the InnoDB)

Replication(2PC)

Horizontal data partitioning

Two phases commit protocols

The commit-request phase (or voting phase) a coordinator process attempts to prepare all the

transaction's participating processes to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit, or "No": abort

The commit phase based on voting of the cohorts, the coordinator

decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the cohorts. The participating processes then follow with the needed actions.

为了帮助保护您的隐私,PowerPoint 已阻止自动下载此图片。

Horizontal data partitioning

Data within NDB tables is automatically partitioned across all of the data nodes in the system.

This is done based on a hashing algorithm based on the PRIMARY KEY on the table, and is transparent to the end application.

MySQL Cluster

Scalable Relational Systems

CONCLUSION: NoSQL pros/cons

Advantages Massive scalability High availability Lower cost (than competitive solutions at that scale) (usually) predictable elasticity

Schema flexibility, sparse & semi-structured data

Disadvantages Limited query capabilities (so far) Eventual consistency is not intuitive to program for

Makes client applications more complicated No standardization

Portability might be an issue

CONCLUSION

For now NoSQL databases are still far from advanced

database technologies NoSQL will not replace traditional relational

DBMS NoSQL are good for specialized applications

involving large unstructured distributed data with high requirements on scaling

A reading list

E Brewer , CAP Twelve Years Later: How the" Rules" Have Changed, Computer-IEEE Computer Magazine, 2012

谢谢

Recommended