Upload
folio3-software
View
592
Download
4
Embed Size (px)
Citation preview
NoSQL Database: Apache NoSQL Database: Apache CassandraCassandra
www.folio3.com@folio_3
Folio3 – OverviewFolio3 – Overview
www.folio3.com @folio_3
Who We Are
We are a Development Partner for our customers
Design software solutions, not just implement them
Focus on the solution – Platform and technology agnostic
Expertise in building applications that are:
Mobile Social Cloud-based Gamified
What We Do Areas of Focus
Enterprise
Custom enterprise applications
Product development targeting the enterprise
Mobile
Custom mobile apps for iOS, Android, Windows Phone, BB OS
Mobile platform (server-to-server) development
Social Media
CMS based websites for consumers and enterprise (corporate, consumer,
community & social networking)
Social media platform development (enterprise & consumer)
Folio3 At a Glance Founded in 2005
Over 200 full time employees
Offices in the US, Canada, Bulgaria & Pakistan
Palo Alto, CA. Sofia, Bulgaria
Karachi, Pakistan
Toronto, Canada
Areas of Focus: Enterprise Automating workflows
Cloud based solutions
Application integration
Platform development
Healthcare
Mobile Enterprise
Digital Media
Supply Chain
Areas of Focus: Mobile Serious enterprise applications for Banks,
Businesses
Fun consumer apps for app discovery,
interaction, exercise gamification and play
Educational apps
Augmented Reality apps
Mobile Platforms
Some of Our Mobile Clients
Areas of Focus: Web & Social Media
Community Sites based on
Content Management Systems
Enterprise Social Networking
Social Games for Facebook &
Mobile
Companion Apps for games
Some of Our Web Clients
NoSQL Database: Apache NoSQL Database: Apache CassandraCassandra
www.folio3.com @folio_3
Agenda What is NOSQL?
Motivations for NOSQL?
Brewer’s CAP Theorem
Taxonomy of NOSQL databases
Apache Cassandra
Features
Data Model
Consistency
Operations
Cluster Membership
What Does NOSQL means for RDBMS?
What is NOSQL?
Refers to databases that differs from traditional relational database
management system (RDBMS)
Distributed, flexible, horizontally scalable data stores
Confusion with the term NOSQL
NOSQL != No SQL (or Anti-SQL)
NOSQL = Not Only SQL
NOSQL is an inaccurate term since it is commonly used to refer to
"non-relational" databases but the term has stuck
Motivations for NOSQL
Classical RDBMS unsuitable for today's web applications
because:
Performance (Latency): Variable
Flexibility: Low
Scalability: Variable
Functionality
Brewer's CAP Theorm
Consistency (C)
Availability (A)
Partition Tolerance (P)
Pick any two
Most NOSQL databases sacrifice Consistency
in favor of high Availability and Performance
Taxonomy of NOSQL Key/Value Stores - Distributed Hash Tables (DHT)
Memcached, Amazon’s Dynamo, Redis, PStore
Document Stores
Semi structured data (stores entire documents)
CouchDB, MongoDB, RDDB, Riak
Graph Databases *
Based on graph theory
ActiveRDF, AllegroGraph, Neo4J
Object Database *
Versant, Objectivity
Column-oriented Stores
* these are considered soft NOSQL databases and are usually in NOSQL category because of being
"non-relational".
Column-Oriented Data Stores Semi-structured column-based data stores
Stores each column separately so that aggregate operations for one column
of the entire table are significantly quicker than the traditional row storage
model
Popular examples
Hadoop/HBASE
Apache Cassandra
Google's BigTable
HyperTable
Amazon's SimpleDB
Apache Cassandra
Fully distributed column oriented data store
Also provides Map Reduce implementation using Hadoop (increased
performance)
Based on Google's BigTable (Data Model) and Amazon's Dynamo
(Consistency & Partition Tolerance)
Cassandra values Availability and Partitioning tolerance (AP) while
providing tunable consistency levels.
History
Developed at Facebook
Released as open source project on Google Code in July 2008
Became an Apache Incubator Project in March 2009
Became a top level Apache project in February 2010 Performance
Rumors of Facebook having started working on its own separate
version of Cassandra
Features
Fully Distributed
Highly Scalable
Fault Tolerant (No single point of failure)
Tunable Consistency (Eventually Consistent)
Semi-structured key-value store
High Availability
No Referential Integrity
No Joins
Data Model
KeySpace (Uppermost namespace)
Column Family / Super Column Family (analogous to table)
Super Column
Column (Name, Value, Timestamp)
Rows are referenced through keys
Each column is stored in a separate physical file
Standard Column Family
Super Column Family
Super Column Family: Static/Static
Super Column Family: Static/Static
Super Column Family: Static/Dynamic
Super Column Family: Static/Dynamic
Super Column Family: Dynamic/Static
Super Column Family: Dynamic/Static
Super Column Family: Dynamic/Dynamic
Super Column Family: Dynamic/Dynamic
Apache Cassandra: Consistency
Consistency refers to whether a system is left in a consistent state
after an operation. In distributed data systems like Cassandra, this
usually means that once a writer has written, all readers will see that
write.
If W + R > N, you will have strong consistent behavior; that is, readers
will always see the most recent write
W is the number of nodes to block for on write
R is the number to block for on reads
N is the replication factor (number of replicas)
Apache Cassandra: Consistency
Relational databases provide strong consistency (ACID)
Cassandra provide eventual consistency (BASE) meaning the database
will eventually reach a consistent state
QUORUM reads and writes gives consistency while still allowing
availability
Q = (N / 2) + 1 (simple majority)
If latency is more important than consistency, you can lower values
for either or both W and R.
Apache Cassandra: Consistency Levels
Write ZERO ANY ONE QUORUM ALL
Read ZERO ANY ONE QUORUM ALL
Write Operation
Client sends a write request to a random node; the random node
forwards the request to the proper node (1st replica responsible for
the partition - coordinator)
Coordinator sends requests to N replicas
If W replicas confirm the write operation then OK
Always writable, hinted handoff (If a replica node for the key is down,
Cassandra will write a hint to the live replica node indicating that the
write needs to be replayed to the unavailable node.)
Read Operation
Coordinator sends requests to N replicas, if R replicas respond then
OK
If different versions are returned then reconcile and write back the
reconciled version (Read Repair)
Cluster Membership
Gossip Protocol
Every T seconds each node increments its heartbeat counter
and gossips to another node about the state of the cluster;
the receiving node merges the cluster info with its own copy
Cluster state (node in/out, failure) propagated quickly:
O(LogN) where N is the number of nodes in the cluster
Storage Ring
Cassandra cluster nodes are organized in a virtual ring.
Each node has a single unique token that defines its place in the ring
and which keys it is responsible for
Key ranges are adjusted when the nodes join or leave
Apache Cassandra: MySQL Comparison
MySQL (> 50 GB data)
Read Average: ~ 350 ms
Write Average: ~ 300 ms
Cassandra (> 50 GB data)
Read Average: 15 ms
Write Average: 0.12 ms
Apache Cassandra: Client API Low level API
Thrift
High Level API
Java
Hector, Pelops, Kundera
.NET
FluentCassandra, Aquiles
Python
Telephus, Pycassa
PHP
phpcassa, SimpleCassie
Apache Cassandra: Where to Use?
Use Cassandra, if you want/need
High write throughput
Near-Linear scalability
Automated replication/fault tolerance
Can tolerate low consistency
Can tolerate missing RDBMS features
Apache Cassandra: Users Facebook (of course)
To power inbox search (previously)
To handle user relationships, analytics (but not for tweets)
Digg & Reddit
Both use Cassandra to handle user comments and votes
Rackspace
IBM
To build scalable email system
Cisco's WebEx
To store user feed and activity in near real time
What does NOSQL mean for the future of RDBMS?
No worries! RDBMSs are here to stay for the foreseeable future
NOSQL data stores can be used in combination with RDBMS in some
situations
NOSQL still has a long way to go, in order to reach the widespread
(mainstream) use and support of the RDBMS
Weakness of NOSQL
No or limited support for complex queries
No transactions available (operations are atomic)
No standard interface for NOSQL databases (like SQL in relational
databases)
No or limited administrative features available for NOSQL databases
Not suitable (yet) for mainstream use
Why Still Use RDBMS?
All the weaknesses of NOSQL
Relational databases are widely used and understood
RDBMS DBAs and developers are easily available in the market
For big business, relational databases are a safe choice because they
have heavily invested in relational technology
Many database design and development tools available
References
http://www.allthingsdistributed.com/2008/12/eventually_consistent.
html
http://wiki.apache.org/cassandra/FrontPage
http://en.wikipedia.org/wiki/Apache_Cassandra
http://www.slideshare.net/gdusbabek/cassandra-presentation-for-
san-antonio-jug
http://www.slideshare.net/Eweaver/cassandra-presentation-at-nosql
http://nosql-database.org/
http://nosqlpedia.com/
Contact
For more details about our
services, please get in touch with
us.
US Office: (408) 365-4638
www.folio3.com