92
Introduction to NoSQL Databases Jianfeng Zhan 2012.6.25

Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Embed Size (px)

Citation preview

Page 1: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Introduction to NoSQL Databases

Jianfeng Zhan2012.6.25

Page 2: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

A quick introduction to DB

Page 3: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Models of Reality

REALITY• structures• processes

DATABASE SYSTEM

DATABASE

DML

DDL

A database is a model of structures of reality The use of a database reflect processes of reality A database system is a software system which

supports the definition and use of a database DDL: Data Definition Language DML: Data Manipulation Language

Page 4: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Modeling

REALITY• structures• processes

DATABASE SYSTEM

MODEL

data modeling

The model represents a perception of structures of reality

The data modeling process is to fix a perception of structures of reality and represent this perception

In the data modeling process we select aspects and we abstract

Page 5: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Process Modeling

REALITY• structures• processes

DATABASE SYSTEM

MODEL

process modeling

The use of the model reflects processes of reality Processes may be represented by programs with

embedded database queries and updates Processes may be represented by ad-hoc database

queries and updates at run-timeDML DML

PROG

Page 6: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Model

data structures integrity constraints operations

A data model consists of notations for expressing:

Page 7: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Model - Data Structures

attribute types entity types relationship types

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

All data models have notation for defining:

Page 8: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Model - Constraints

Static constraints apply to database state Dynamic constraints apply to change of database state E.g., “All FLIGHT-SCHEDULE entities must have

precisely one DEPT-AIRPORT relationship

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Constraints express rules that cannot be expressed by the data structures alone:

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

Page 9: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Model - Operations

insert FLIGHT-SCHEDULE(97, delta, tu, 258); insert DEPT-AIRPORT(97, atl);

select FLIGHT#, WEEKDAYfrom FLIGHT-SCHEDULEwhere AIRLINE=‘delta’;

Operations support change and retrieval of data:

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

97 delta tu 258

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

97 atl

Page 10: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Data Model - Operations from Programs

declare C cursor for select FLIGHT#, WEEKDAYfrom FLIGHT-SCHEDULEwhere AIRLINE=‘delta’;open C;repeat

fetch C into :FLIGHT#, :WEEKDAY;do your thing;

until done;close C;

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

97 delta tu 258

Page 11: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Keys and Identifiers

A key on FLIGHT# in FLIGHT-SCHEDULE will force all FLIGHT#’s to be unique in FLIGHT-SCHEDULE

Consider the following keys on DEPT-AIRPORT:

Keys (or identifiers) are uniqueness constraints

FLIGHT# AIRPORT-CODE FLIGHT# AIRPORT-CODE FLIGHT# AIRPORT-CODEFLIGHT# AIRPORT-CODE

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Page 12: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Integrity and Consistency Integrity: does the model reflect reality well? Consistency: is the model without internal conflicts?

a FLIGHT# in FLIGHT-SCHEDULE cannot be null because it models the existence of an entity in the real world

a FLIGHT# in DEPT-AIRPORT must exist in FLIGHT-SCHEDULE because it doesn’t make sense for a non-existing FLIGHT-SCHEDULE entity to have a DEPT-AIRPORT

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Page 13: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Triggers and Stored Procedures

Triggers can be defined to enforce constraints on a database, e.g.,

DEFINE TRIGGER DELETE-FLIGHT-SCHEDULE ON DELETE FROM FLIGHT-SCHEDULE WHERE

FLIGHT#=‘X’ACTION DELETE FROM DEPT-AIRPORT WHERE

FLIGHT#=‘X’;

DEPT-AIRPORT

FLIGHT# AIRPORT-CODE

101 atl

912 cph

545 lax

242 bos

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo 156

545 american we 110

912 scandinavian fr 450

242 usair mo 231

Page 14: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Normalization

FLIGHT# AIRLINE PRICE

FLIGHT-SCHEDULE

101 delta 156

545 american 110

912 scandinavian 450

FLIGHT# AIRLINE WEEKDAY PRICE

FLIGHT-SCHEDULE

101 delta mo

545 american mo 110

912 scandinavian fr 450

156

101 delta fr 156

545 american we 110

545 american fr 110

FLIGHT# AIRLINE WEEKDAYS PRICE

FLIGHT-SCHEDULE

101 delta mo,fr 156

545 american mo,we,fr 110

912 scandinavian fr 450

FLIGHT# WEEKDAY

FLIGHT-WEEKDAY

101 mo

545 mo

912 fr

101 fr

545 we

545 fr

Page 15: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

ANSI/SPARC 3-Level DB Architecture - separating concerns

database system

schema

data

database

database systemDDL

DML

a database is divided into schema and data the schema describes the intension (types) the data describes the extension (data) Why? Effective! Efficient!

Page 16: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

ANSI/SPARC 3-Level DB Architecture - separating concerns

schema

data

schema

conceptual schema internal schema

data

internal schema

data

external schema

Page 17: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

ANSI/SPARC 3-Level DB Architecture

externalschema1

externalschema2

externalschema3

conceptualschema

internalschema

database

• external schema:

use of data

• conceptual schema:

meaning of data

• internal schema:

storage of data

Page 18: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Conceptual Schema

Describes all conceptually relevant, general, time-invariant structural aspects of the universe of discourse

Excludes aspects of data representation and physical organization, and access

NAME ADDR SEX AGE

CUSTOMER

An object-oriented conceptual schema would also describe all process aspects

Page 19: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

External Schema

Describes parts of the information in the conceptual schema in a form convenient to a particular user group’s view

Is derived from the conceptual schema

NAME ADDR SEX AGE

CUSTOMER

NAME ADDR

MALE-TEEN-CUSTOMER

TEEN-CUSTOMER(X, Y) =CUSTOMER(X, Y, S, A) WHERE SEX=M AND 12<A<20;

Page 20: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Internal Schema Describes how the information described in the

conceptual schema is physically represented to provide the overall best performance

NAME ADDR SEX AGE

CUSTOMER

NAME ADDR SEX AGE

CUSTOMER

B+-tree on AGE NAME PTR

index on NAME

Page 21: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Indexing

Why Bother? Disk access time: 0.01-0.03 sec Memory access time: 0.000001-0.000003 sec Databases are I/O bound Rate of improvement of

(memory access time)/(disk access time) >>1 Things won’t get better anytime soon!

Indexing helps reduce I/O !

Page 22: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Indexing (cont.)

Clustering vs. non-clustering alters the data block into a certain distinct order to match the index,

resulting in the row data being stored in order. The data is present in arbitrary order, but the logical ordering is

specified by the index.

Primary and secondary indices An index structure that is defined on the ordering field index field that are neither ordering fields nor key fields

I/O cost for lookup: Heap: N/2 Sorted file: log2(N) Single-level index: log2(n)+1 Multi-level index; B+-tree: logfanout(n)+1 Hashing: 2-3

Page 23: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Concurrency Control

datereservflight# customer#

flight-inst

flight# date #avail-seats

T1:read(flight-inst(flight#,date)seats:=#avail-seatsif seats>0 then {seats:=seats-1

write(reserv(flight#,date,customer1))write(flight-inst(flight#,date,seats))}

T2:

read(flight-inst(flight#,date)seats:=#avail-seatsif seats>0 then {seats:=seats-1write(reserv(flight#,date,customer2))write(flight-inst(flight#,date,seats))}

overbooking!

Page 24: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

ACID Transactions An ACID transaction is a sequence of database

operations that has the following properties: Atomicity

Either all operations are carries out, or none is This property is the responsibility of the concurrency

control and the recovery sub-systems Consistency

A transaction maps a correct database state to another correct state

This requires that the transaction is correct, which is the responsibility of the application programmer

Page 25: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Concurrency Control (cont.)

Isolation Although multiple transactions execute

concurrently, i.e. interleaved, not parallel, they appear to execute sequentially

This is the responsibility of the concurrency control sub-system

Durability The effect of a completed transaction is

permanent This is the responsibility of the recovery manager

Page 26: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Concurrency Control (cont.)

Serializability is a good definition of correctness A variety of concurrency control protocols exist

Two-phase (2PL) locking deadlock and livelock possible deadlock prevention: wait-die, wound-wait deadlock detection: rollback a transaction

Optimistic protocol: proceed optimistically; back up and repair if needed

Pessimistic protocol: do not proceed until knowing that no back up is needed

Page 27: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

RecoveryStorage types: Volatile: main memory Nonvolatile: disk

Errors: Logical error: transaction fails; e.g. bad input, overflow System error: transaction fails; e.g. deadlock System crash: power failure; main memory lost, disk

survives Disk failure: head crash, sabotage, fire; disk lost

What to do?

Page 28: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Recovery (cont.) Deferred update (NO-UNDO/REDO):

don’t change database until ready to commit write-ahead to log to disk change the database

Immediate update (UNDO/NO-REDO): write-ahead to log on disk update database anytime commit not allowed until database is completely updated

Immediate update (UNDO/REDO): write-ahead to log on disk update database anytime commit allowed before database is completely updated

Shadow paging (NO-UNDO/NO-REDO): write-ahead to log in disk keep shadow page; update copy only; swap at commit

Page 29: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Parallel Databases

A database in which a single query may be executed by multiple processors working together in parallel

There are three types of systems: Shared memory Shared disk Shared nothing

Page 30: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Parallel Databases - Shared Memory

processors share memory via bus

extremely efficient processor communication via memory writes

bus becomes the bottleneck not scalable beyond 32 or 64

processorsP processor

M memorydisk

P

M

P

P

P

Page 31: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Parallel Databases - Shared Disk

processors share disk via interconnection network

memory bus not a bottleneck fault tolerance wrt. processor

or memory failure scales better than shared

memory interconnection network to

disk subsystem is a bottleneck

P

P

P

P

M

M

M

M

Page 32: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Parallel Databases - Shared Nothing

scales better than shared memory and shared disk

main drawbacks: higher processor

communication cost higher cost of non-local disk

access used in the Teradata database

machine

PM

PM

PM

PM

Page 33: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

OUTLINE

NoSQL Definition Motivation Data Store Introduction

-- Key-value Stores-- Document Stores-- Extensible Record Stores-- New Relational Database

Conclusion

Page 34: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

NoSQL: The Name

Page 35: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database
Page 36: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

环境变化

互联网网络延迟变大

Time out

Partial failures

Network partitions

Page 37: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

RDBMS

Web apps can (usually) do without--Transaction/ Strong Consistency/ integrity--Complex queries

Web apps have different needs (than the apps that RDBMS were designed for)--Scalability & elasticity (at low cost)--High availability--Flexible schemas/ semi-structured data --Geographic distribution (multiple datacenters)

Page 38: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

NoSQL Systems

No declarative query language– more programmingRelaxed consistency—fewer guarantees

Page 39: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

NoSQL Systems

The idea behind the NoSQL: Giving up ACIDconstraints, one can achieve much higher performance and scalability.ACID= Atomicity, Consistency, Isolation, and DurabilityBASE=Basically Available, soft state, Eventually consistent.

Page 40: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CAP Theorem

A system can have only two out of three of the following properties:consistency, availability, and partition-tolerance.

Page 41: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CAP details

Consistency (all nodes see the same data at the same time)

Availability (a guarantee that every request receives a response about whether it was successful or failed)

Partition tolerance (the system continues to operate despite arbitrary message loss or failure of part of the system)

Page 42: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

The easiest way to understand CAP

Two nodes on opposite sides of a partition. Allowing at least one node to update state will cause

the nodes to become inconsistent. forfeiting C.

Likewise, if the choice is to preserve consistency,one side of the partition must act as if it isunavailable. forfeiting A.

Only when nodes communicate is it possible topreserve both consistency and availability. forfeiting P.

Page 43: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database
Page 44: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Classification of NoSQL systems and tradeoffs (1).

Read performance versus write performance Hbase optimized for write performance.

Records on disk are never overwritten; instead, updates are written to a buffer in memory, and the entire buffer is written sequentially to disk.

Latency versus durability Writes are synched to disk before the system

returns success to users. Writes are stored in memory at write time and

synched to disk later.

Page 45: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Classification of NoSQL systems and tradeoffs (2).

Synchronous versus asynchronous replication Improve system availability, avoid data loss,

and improve performance.

Data partition row-based storage

Efficient access of an entire record. Column storage

Efficient for accessing a subset of the columns.

Page 46: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Design decisions of various systems.

Page 47: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Types of NoSQL Databases

Key-value stores

Document stores

Extensible record stores

Page 48: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

NoSQL systems differ mainly in their data models

Specific implementations differ in the persistent mechanism and additional functionalities: Replication

Versioning

Locking

Transactions

etc..

Types of NoSQL Databases

Page 49: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database
Page 50: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Key-Value Stores

• Global Collection of Key/Value Pairs

• Inspired by Amazon’s Dynamo and Distributed Hashtables

•Operations

•void Put(string key, byte[] data);

•byte[] Get(string key);

•void Remove(string key);

Page 51: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Key-Value Stores: Examples

Page 52: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Project Voldemort

Advanced key-value store Created by LinkedIn, now open source Written in Java Provides MVCC

Multiversion concurrency control Asynchronous replication Sharding + Consistent Hashing Automatic failure detection and recovery

Page 53: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

MVCC

Snapshot one Time Object 1 Object 2 t1 "Hello" "Bar" t0 "Foo" "Bar“

Snapshot two Time Object 1 Object 2 Object 3 t2 "Hello" (deleted) "Foo-Bar" t1 "Hello" "Bar" t0 "Foo" "Bar"

Page 54: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

A Solution: Hashing

Example: y = ax+b (mod n)

Intuition: Assigns items to “random” caches few items per cache

Easy to compute which cache holds an item

Server

items assigned to cachesby hash function.

Users use hash to compute cache for item.

Page 55: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Adding Caches: why consistent hashing?

Suppose a new cache arrives. How work it into hash function? Natural change:

y=ax+b (mod n+1) Problem: changes bucket for every item

every cache will be flushed servers get swamped with new requests

Goal: when add bucket, few items move

Page 56: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Project Voldemort

Operations:

value = store.get(key)

store.put(key, value)

store.delete(key)

Pros? & Cons?

Page 57: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database
Page 58: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

What is a document? Semi-structured data Encapsulates and encodes data (or information) in

some standard formats or encodings Encodings:

XML YAML JSON BSON Binary forms: PDF, Microsoft Office documents.. etc.

Document Stores: Document?

Page 59: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Document Stores: Document?

Documents are like rows or records in relational databases, BUT

Schema No Schema

FirstName:"Bob", Address:"5 Oak St.", Hobby:"sailing"

FirstName:"Jonathan", Address:"15 Wanamassa Point Road", Children:[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5}, {Name:"Elena", Age:2}]

RowDocument

Page 60: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Document Stores

Similar to Key-value stores but with a major differences, value is a document generally support secondary indexes

Flexible schema Any number of fields can be added Multiple types of documents (objects) and nested

documents or lists Documents stored in JSON or Binary JSON (BSON) No ACID property

Page 61: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Document Stores: Examples

TERRASTOREby Google

Page 62: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB

Apache project since 2008 Schema free, document oriented database

Documents are stored in JSON format Support secondary indexes B-tree storage engine MVCC model, no locking No joins, no PK/FK

Incremental replication

Page 63: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB

REST API

Libraries for various languages that convert native API calls into the RESTful calls Java, C, PHP, etc.

CRUD HTTP ParamsCreate PUT /db/docidRead GET /db/docidUpdate POST /db/docidDelete DELETE /db/docid

Page 64: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB: Views

Views Filter, sort, “join”, aggregate, report Map/Reduce based K/V pairs from Map/Reduce are also stored in

the B-tree engine Built on demand Can be materialized & incrementally updated

Page 65: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB: Views

Page 66: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB: Local Consistency

• CouchDB uses Multi-Version Concurrency Control (MVCC)

Page 67: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CouchDB: “Global” Consistency

• Incremental Replication

Page 68: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Extensible record stores

Extensible record stores also called column stores.

Each key is associated with multiple attributes(i.e. columns)

Hybrid row/column stores Inspired Google BigTable Example: HBase, Cassandra

Page 69: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: HBase

Based on Google’s BigTable Apache Project TLP Cloudera (certification, EC2 AMI’s, etc.) Layered over HDFS (Hadoop Distributed File

System). Input/Output for MapReduce Jobs APIs

---Thrift, REST

Page 70: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Thrift API Thrift is an interface definition language that is used to

define and create services for numerous languages. It is used as a remote procedure call (RPC) framework and

was developed at Facebook for "scalable cross-language services development".

It combines a software stack with a code generation engine to build services that work efficiently to a varying degree and seamlessly between different languages.

Although developed at Facebook, it is now an open sourceproject in the Apache Software Foundation.

To put it simply, Apache Thrift is a binary communication protocol.

Page 71: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Thrift Architecture

Page 72: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

REST architecture style (1)

Client–server separation of concerns

Stateless The client–server communication is further

constrained by no client context being stored on the server between requests.

Each request from any client contains all of the information necessary to service the request, and any session state is held in the client.

Page 73: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

REST architecture style (2)

Cacheable As on the World Wide Web, clients can cache

responses. Responses must therefore, implicitly or explicitly,

define themselves as cacheable, or not, to prevent clients reusing stale or inappropriate data in response to further requests.

Page 74: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

REST architecture style (3)

Layered system A client cannot ordinarily tell whether it is

connected directly to the end server, or to an intermediary along the way.

Code on demand (optional) Servers are able temporarily to extend or

customize the functionality of a client by the transfer of executable code.

Uniform interface

Page 75: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: HBase

Automatic Partitioning Automatic re-balancing/re-partitioning Fault tolerant

--HDFS---Multiple Replicates

Highly distributed

Page 76: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: HBase

Page 77: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: Cassandra

Create at facebook for Inbox search Facebook Google Code ASF Commercial Support available from Riptano Features taken from both Dynamo and Big

Table-- Dynamo – Consistent hashing, Partitioning,

Replication-- Big Table- Column Familes, MemTables,

SSTables

Page 78: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: Cassandra

Symmetric nodes-- No single point of failure-- Linearly scalable-- Ease of administration

Flexible/Automated Provisioning Flexible Replica Replacement High Availability

-- Eventually Consistency-- However, consistency is tuneable

Page 79: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column: Cassandra

Partitioning--Random----Good distribution of data between nodes---- Range scans not possible--Order preserving---can lead to unbalanced nodes--- Range scans, Natural Order

Extremely fast reads/writes (low latency) Thrift API

Page 80: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Column-oriented NoSQLName Producer Data Model QueryingBigTable Google Set of couple(key,

values)Selection (by combination of row, column, and time stamp ranges)

HBase Apache Groups of columns (a BigTable clone)

JRUBY IRB-based shell(similar to SQL)

Hypertable Hypertable Like BigTable HQL(Hypertext Query Language)

CASSANDRA

Apache Columns, groups of columns corresponding to a key(supercolumns)

Simple selection on key, range queries, column or column ranges

PNUTS Yahoo (hashed or ordered) tables, typed arrays, flexible schema

Selection and projection from a single table (retrieve an arbitrary single record by primary key, range queries, complex predicates, ordering, top-k)

Page 81: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Scalable Relational Systems

Also called NewSQL SQL ACID Performance and scalability through modern

innovative software architecture

Page 82: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Scalable Relational Systems

RDBMS will provide scalabilty: Use small scope operations Use small-scope transaction

Page 83: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

MySQL Cluster

shared-nothing clusters

NDB storage engine (replace the InnoDB)

Replication(2PC)

Horizontal data partitioning

Page 84: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Two phases commit protocols

The commit-request phase (or voting phase) a coordinator process attempts to prepare all the

transaction's participating processes to take the necessary steps for either committing or aborting the transaction and to vote, either "Yes": commit, or "No": abort

The commit phase based on voting of the cohorts, the coordinator

decides whether to commit (only if all have voted "Yes") or abort the transaction (otherwise), and notifies the result to all the cohorts. The participating processes then follow with the needed actions.

Page 85: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

为了帮助保护您的隐私,PowerPoint 已阻止自动下载此图片。

Horizontal data partitioning

Data within NDB tables is automatically partitioned across all of the data nodes in the system.

This is done based on a hashing algorithm based on the PRIMARY KEY on the table, and is transparent to the end application.

Page 86: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

MySQL Cluster

Page 87: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

Scalable Relational Systems

Page 88: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CONCLUSION: NoSQL pros/cons

Advantages Massive scalability High availability Lower cost (than competitive solutions at that scale) (usually) predictable elasticity

Schema flexibility, sparse & semi-structured data

Disadvantages Limited query capabilities (so far) Eventual consistency is not intuitive to program for

Makes client applications more complicated No standardization

Portability might be an issue

Page 89: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

CONCLUSION

For now NoSQL databases are still far from advanced

database technologies NoSQL will not replace traditional relational

DBMS NoSQL are good for specialized applications

involving large unstructured distributed data with high requirements on scaling

Page 90: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

A reading list

E Brewer , CAP Twelve Years Later: How the" Rules" Have Changed, Computer-IEEE Computer Magazine, 2012

Page 91: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database
Page 92: Introduction to NoSQL Databases - BigDataBenchprof.ict.ac.cn/DComputing/uploads/2012/DC_3_1_NoSQL.pdf · FLIGHT# AIRLINE WEEKDAY PRICE FLIGHT-SCHEDULE ... used in the Teradata database

谢谢