View
3.046
Download
3
Embed Size (px)
DESCRIPTION
Presentation on Big Data, NoSQL with MongoDB and Cassasdra
Citation preview
1
NOSQL INTRO WITH MONGODB & CASSANDRA
NOSQL Intro with MongoDB and Cassandra
Big Data and NoSQL with MongoDB & Cassandra
NOSQL Intro with MongoDB and Cassandra
2
Requisite Slide – Who Am I?
- Brian Enochson- SW Engineer who has worked as designer /
developer on NOSQL (Mongo, Cassandra, Hadoop)- Specialize in SW Development, architecture and
training
Brian Enochson [email protected] Twitter @benochso Google Plus https://plus.google.com/+
BrianEnochson
NOSQL Intro with MongoDB and Cassandra
3
Agenda
• Presentation Intro• Introduction to Big Data• Introduction to NoSQL• Relational Database to NoSQL technology
contrast & compare• NoSQL landscape
NOSQL Intro with MongoDB and Cassandra
4
Agenda
• Introduction to MongoDB• MongoDB Components, capabilities and
common use cases• Json & BsON• Documents, collections, references and
Mongo ID• Querying• Data Modeling/Schema Design• Replication & Sharding
NOSQL Intro with MongoDB and Cassandra
5
Agenda
• Cassandra• Architecture• Data Model• Data Modeling• Application Development• Wrap-up and final Q & A
NOSQL Intro with MongoDB and Cassandra
6
Big Data
http://www.cloudtweaks.com/2014/01/hand-writing-data-data-everywhere-but-lets-just-stop-and-think/
NOSQL Intro with MongoDB and Cassandra
7
Big Data – Why Needed
Why are database like Mongo or Cassandra needed?
• To understand one needs to look at • the history of databases• How systems were built in the past
• Then examine modern applications• Web scale• Data acquisition
• Other factors like cost of H/W
NOSQL Intro with MongoDB and Cassandra
8
History of the Database
• 1960’s – Hierarchical and Network type (IMS and CODASYL)
• 1970’s – Beginnings of theory behind relational model. Codd
• 1980’s – Rise of the relational model. SQL. E/R Model (Chen)
• 1990’s – Access/Excel and MySQL. ODMS began to appear
• 2000;’s – Two forces; large enterprise and open source. Google and Amazon. CAP Theorem (more on that to come…)
• 2010’s – Immergence of NoSQL as an industry player and viable alternative
NOSQL Intro with MongoDB and Cassandra
9
Why were alternatives needed
• Developers today are faced with Internet scale
• 100,000’s of users• Low cost of storage• Increased processing power• Ability to capture (and need) of millions of events. Caching
solves it to an extent but brings other complexities• Real-time• Need to scale out and not up. (add infinite number of low
cost machines vs. replace with a more powerful machine).
• Cost• Let’s not forget for enterprise DB’s Internet scale can become
expensive• Open source DB’s may solve license cost, but don’t ignore
operational costs
NOSQL Intro with MongoDB and Cassandra
10
A lot of data
Some facts from http://www.storagenewsletter.com/rubriques/market-reportsresearch/ibm-cmo-study/
Approximately 90 percent of all the real-time information being created today is unstructured data
Every day we create 2.5 quintillion (10 to the 18th) bytes of data (this is 30 zeroes!!)
90 percent of the world's data today has been created in the last two years alone
NOSQL Intro with MongoDB and Cassandra
11
Relational vs. NoSQL
• Relational
• Divide into tables, relate into foreign keys, DB constraints, normalized data, the Interface is SQL
• NoSQL
• Store in schemaless format, redundancy encouraged, application access determines the storage format (your queries).Interface varies and is optimized for the implementation, no forced DB constraints.
NOSQL Intro with MongoDB and Cassandra
12
Are Tradeoffs Bad?
Luckily, due to the large number of compromises made when attempting to scale their existing
relational databases, these tradeoffs were not so foreign or distasteful as they might have been.
Greg Burd - https://www.usenix.org/legacy/publications/login/2011-10/openpdfs/Burd.pdf
NOSQL Intro with MongoDB and Cassandra
13
What Are Tradeoffs?
Eventual consistency
Application has increased responsibility such as maintain consistency & handle transactions
Store redundant data
NOSQL Intro with MongoDB and Cassandra
14
3 V’s – Describing the Big Data Problem
Driving force in requiring new technology is often referred to as the “3 V’s”.
• Volume – amount of data• Variety – range of data types and sources• Velocity – speed of data in and out
NOSQL Intro with MongoDB and Cassandra
15
NoSQL is not Big Data
NoSQL != Big Data
NoSQL products were created to help solve the big data problem.
Big data is a much larger problem than just storage. Analysis tools like Hadoop, messaging systems like Kafka, real time processing engines like Storm and machine learning (Mahout) all help solve the big data problem.
NOSQL Intro with MongoDB and Cassandra
16
NoSQL Types
Document DB MongoDB, CouchDB,
Wide Column– Column Family Cassandra, HBASE, Amazon SimpleDB
Key Value• Riak, Redis, DynamoDB, Voldemort, MemcacheDB
Graph• Neo4J, OrientDB
Search (search can also be a persistence store)• Lucene, Solr, ElasticSearch
Many many many, many more! (http://nosql-database.org/)
NOSQL Intro with MongoDB and Cassandra
17
Choosing the right one…
Choosing the right NoSQL type and eventual product depends on…
Type of Data• One key and a lot of data?• Schema variance• High volume of data?• Storing, media, blobs, • Document oriented?• Tracking relationships?• Combination?• Multi-Datacenter
Type of Access Volumes of Data (there is big data and there is BIG DATA) Need/want support/services/training
NOSQL Intro with MongoDB and Cassandra
18
Some Basics Concepts
• ACID
• CAP Theorem
• BASE
NOSQL Intro with MongoDB and Cassandra
19
ACIDPROBABLY HAVE HEARD OF ACID• Atomic – All or None
• Consistency – What is written is valid
• Isolation – One operation at a time
• Durability – Once committed to the DB, it stays
This is the world we have lived in for a long time…
NOSQL Intro with MongoDB and Cassandra
20
CAP Theorem (Brewers)
Many may have heard this one
CAP stands for Consistency, Availability and Partition Tolerance• Consistency –like the C in ACID. Operation is all or nothing,
• Availability – service is available.
• Partition Tolerance – No failure other than complete network failure causes system not to respond
** http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
NOSQL Intro with MongoDB and Cassandra
21
You can only have 2 of them
In Mongo terms you can have 2 of 3. Availability, Partition-Tolerance or Eventual Consistency.
NOSQL Intro with MongoDB and Cassandra
22
VISUAL GUIDE – USING THE CAP THEOREM
http://blog.nahurst.com/visual-guide-to-nosql-systems
NOSQL Intro with MongoDB and Cassandra
23
Big Data Wrap up
• So we are talking about large amounts of data
• High velocity of acquisition
• A lot of variety that we need to store. Will worry about it later how to handle (or not)
• Need to scale and not break the bank
• Want the database to support agile, not hinder
NOSQL Intro with MongoDB and Cassandra
24
Still Wrapping
• Maybe consider going relational if
• Highly transactional (FoundationDB?)
• Business Intelligence Systems (Hadoop may make this not true)
• Don’t be fooled by fear of losing ACID….http://highscalability.com/blog/2013/5/1/myth-eric-brewer-on-why-banks-are-base-not-acid-availability.html
NOSQL Intro with MongoDB and Cassandra
25
And nowlet’s look at MongoDB
NOSQL Intro with MongoDB and Cassandra
26
DB Popularityhttp://db-engines.com/en/ranking_definition
NOSQL Intro with MongoDB and Cassandra
27
Mongo Overview
Few high level points
• Document Oriented• Storage format is JSON (actually BSON)• Replication built in• Master / slave architecture• Strong querying support• Name from "humongous"
NOSQL Intro with MongoDB and Cassandra
28
Meet Mongo
• Open Source
• Schemaless
• Scalable
• Document Level Atomicity
• Easy Installation
• Relatively Ease Of Use
• Great (!!!!) Documentation
NOSQL Intro with MongoDB and Cassandra
29
And…
• No cross document transactions
• No joins
• Replication – master / slave
• Sharding
NOSQL Intro with MongoDB and Cassandra
30
Mongo Advantage
-
* Credit – Dwight Merriman, Founder and CEO – MongoDB (was 10Gen)
NOSQL Intro with MongoDB and Cassandra
31
Mongo Consistency
Master Slave and Secondary Reads** http://docs.mongodb.org/manual/core/replication-introduction/
NOSQL Intro with MongoDB and Cassandra
32
Replica Sets
Primary Receives all write requests Replica set can only have on primary Mongo stored all changes in oplog
Secondary Replicates primary oplog Clients can prefer to read from secondaries If primary goes down a new primary is
elected (after 10 seconds no response)
NOSQL Intro with MongoDB and Cassandra
33
Sharding http://docs.mongodb.org/manual/core/sharding-introduction/
NOSQL Intro with MongoDB and Cassandra
34
Sharding Clusters
Shards Store the data, normally in production each
shard is a replica set Routers
Routes client operations to shards based on shard key, can have more than one for availability Shard key is range based or hashed
Config Servers Contains cluster metadata Production there are 3 config servers
NOSQL Intro with MongoDB and Cassandra
35
Mongo Document At its simplest form, Mongo is a document oriented database
• MongoDB stores all data in documents, which are JSON-style data structures composed of field-and-value pairs.
• MongoDB stores documents on disk in the BSON serialization format. BSON is a binary representation of JSON documents. BSON contains more data types than does JSON.
** For in-depth BSON information, see bsonspec.org.
NOSQL Intro with MongoDB and Cassandra
36
What does a Document Look Like
{ "_id" : "52a602280f2e642811ce8478",
"ratingCode" : "PG13", "country" : "USA", "entityType" : "Rating” }
NOSQL Intro with MongoDB and Cassandra
37
Mongo Documents
NOSQL Intro with MongoDB and Cassandra
38
Rules for a document
Documents have the following rules:
The maximum BSON document size is 16 megabytes.
The field name _id is reserved for use as a primary key; its value must be unique in the collection.
The field names cannot start with the $ character.
The field names cannot contain the . character.
NOSQL Intro with MongoDB and Cassandra
39
Mongo Install
Windows http://docs.mongodb.org/manual/tutorial/install-mongodb
-on-windows/
MAC http://docs.mongodb.org/manual/tutorial/install-
mongodb-on-os-x/
Create Data Directory , Defaults• C:\data\db• /data/db/ (make sure have permissions)
Or can set using -dbpath C:\mongodb\bin\mongod.exe --dbpath d:\test\mongodb\
data
NOSQL Intro with MongoDB and Cassandra
40
Start It!
Databasemongod
Shellmongo
show dbsshow collectionsdb.stats()
NOSQL Intro with MongoDB and Cassandra
41
Basic Operations
1_simpleinsert.txt
Insert
Find Find all Find One Find with criteria
Indexes Explain()
NOSQL Intro with MongoDB and Cassandra
42
More Mongo Shell
2_arrays_sort.txt
• Embedded documents
• Limit, Sort
• Using regex in query
• Removing documents
• Drop collection
NOSQL Intro with MongoDB and Cassandra
43
Import / Export
3_imp_exp.txt
Mongo provides tools for getting data in and out of the database• Data Can Be Exported to json files
• Json files can then be Imported
NOSQL Intro with MongoDB and Cassandra
44
Conditional Operators
4_cond_ops.txt
• $lt• $gt• $gte• $lte• $or
• Also $not, $exists, $type, $in
(for $type refer to http://docs.mongodb.org/manual/reference/operator/query/type/#_S_type )
NOSQL Intro with MongoDB and Cassandra
45
Analytics
Aggregation Framework Uses a pipeline model to perform a series of
operations on data. Common is a match phase (selection) and then grouping (create result)
Map Reduce Two phases
Map that creates one or more documents from each input document
Reduce phase that combines output from Map into some result
Finalize – optional that can perform some logic (e.g. sorting) on reduce output
NOSQL Intro with MongoDB and Cassandra
46
Admin Commands
5_admin.txt
• how dbs• show collections• db.stats()• db.posts.stats()• db.posts.drop()• db.system.indexes.find()
NOSQL Intro with MongoDB and Cassandra
47
Data Modeling
• Remember with NoSql redundancy is not evil
• Applications insure consistency, not the DB
• Application join data, not defined in the DB
• Datamodel is schema-less• Datamodel is built to support queries
usually
NOSQL Intro with MongoDB and Cassandra
48
Questions to ask
• Your basic units of data (what would be a document)?
• How are these units grouped / related?
• How does Mongo let you query this data, what are the options?
• Finally, maybe most importantly, what are your applications access patterns?
• Reads vs. writes• Queries• Updates• Deletions• How structured is it
NOSQL Intro with MongoDB and Cassandra
49
Data Model - Normalized
Normalized
• Similar to relational model.
• One collection per entity type
• Little or no redundancy
• Allows clean updates, familiar to many SQL users, easier to understand
NOSQL Intro with MongoDB and Cassandra
50
Normalized documents
NOSQL Intro with MongoDB and Cassandra
51
References
• From parent to child{ name: "O'Reilly Media",
books: [12346789, 234567890, ...]}
• From child to parent{ _id: 123456789, title: "MongoDB: The Definitive Guide", publisher_id: "oreilly"}
NOSQL Intro with MongoDB and Cassandra
52
Data Model - Embedded
Often used pattern in Mongo is to embed information as subdocuments.
• Used when there is a contains relationship
• Easier querying (when related data is often used together)
• Need to keep 16 MB document size in mind
NOSQL Intro with MongoDB and Cassandra
53
Embedded
NOSQL Intro with MongoDB and Cassandra
54
Other considerations For Data Modeling
Many or few collections• Many Collections
• As seen in normalized• Clean and little redundancy• May not provide best performance• May require frequent updates to application if new types added
• Multiple Collections• Middle ground, partially normalized
• Not many collections• One large generic collection• Contains many types• Use type field
NOSQL Intro with MongoDB and Cassandra
55
Consideration Continued
• Document Growth – will relocate if exceeds allocated size
• Atomicity• Atomic at document level• Consideration for insertions, remove and multi-document updates
Sharding – collections distributed across mongod instances, uses a shard key.
Indexes – index fields often queries, indexes affect write performance slightly
Consider using TTL to automatically expire documents
NOSQL Intro with MongoDB and Cassandra
56
Common Uses For Mongo
CMS Systems
Log Collection https://code.google.com/p/log4mongo/
Caching
Queues / Messaging Capped Collections - fixed-size collections that support high-throughput
operations that insert, retrieve, and delete documents based on insertion order.
Analytics
Prototyping
NOSQL Intro with MongoDB and Cassandra
57
MongoDB Development with Java
Mongo DriverSupplied by MongoDB Itself
Easy to setup
Housed on maven repo
Morphia
Uses App Model
Handles References Well
Spring Mongo
Great if using Spring already
NOSQL Intro with MongoDB and Cassandra
58
Other
Node Javascript (JSON), Coffeescript MEAN Stack
Scala Casbah Reactive Mongo
NOSQL Intro with MongoDB and Cassandra
59
MEAN Stack
Get MEAN
Mongo, Express, Angular and Node
http://bitnami.com/stack/mean
http://mean.io
Can install, in a VM or even in the cloud
NOSQL Intro with MongoDB and Cassandra
60
The cloud
Database in the cloud
https://mongolab.com/
Can access using shell, GUI Mongo explorer, mongoimport, mongoexport and use in application
Amazon, Rackspace, Joyent or Azure
NOSQL Intro with MongoDB and Cassandra
61
Books
MongoDB: The Definitive Guide, 2nd EditionBy: Kristina ChodorowPublisher: O'Reilly Media, Inc.Pub. Date: May 23, 2013Print ISBN-13: 978-1-4493-4468-9Pages in Print Edition: 432
MongoDB in ActionBy: Kyle BankerPublisher: Manning PublicationsPub. Date: December 16, 2011Print ISBN-10: 1-935182-87-0Print ISBN-13: 978-1-935182-87-0Pages in Print Edition: 312
The Definitive Guide to MongoDB: The NoSQL Database for Cloud and Desktop ComputingBy Eelco Plugge; Peter Membrey; Tim HawkinsApress, September 2010ISBN: 9781430230519327 pages
NOSQL Intro with MongoDB and Cassandra
62
Books Cont.MongoDB Applied Design PatternsBy: Rick CopelandPublisher: O'Reilly Media, Inc.Pub. Date: March 18, 2013Print ISBN-13: 978-1-4493-4004-9Pages in Print Edition: 176
MongoDB for Web Development (rough cut!)By: Mitch PirtlePublisher: Addison-Wesley ProfessionalLast Updated: 14-JUN-2013Pub. Date: March 11, 2015 (Estimated)Print ISBN-10: 0-321-70533-5Print ISBN-13: 978-0-321-70533-4Pages in Print Edition: 360
Instant MongoDBBy: Amol Nayak;Publisher: Packt PublishingPub. Date: July 26, 2013Print ISBN-13: 978-1-78216-970-3Pages in Print Edition: 72
NOSQL Intro with MongoDB and Cassandra
63
Important Sites
• http://www.mongodb.org/• https://mongolab.com/welcome/• https://education.mongodb.com/• http://blog.mongodb.org/• http://
stackoverflow.com/questions/tagged/mongodb
• http://bitnami.com/stack/mean
NOSQL Intro with MongoDB and Cassandra
64
Cassandra
Let’s look briefly at Cassandra as an alternative to Mongo
NOSQL Intro with MongoDB and Cassandra
65
Cassandra History
• Developed At Facebook, based on Google Big Table and Amazon Dynamo **
• Open Sourced in mid 2008
• Apache Project March 2009
• Commercial Support through Datastax (originally known as Riptano, founded 2010)
• Used at Netflix, eBay and many more. Reportedly 300 TB on 400 machines largest installation
• Current version is 2.0.3
NOSQL Intro with MongoDB and Cassandra
66
C* Basics
• No Single Point of Failure – highly available. • Peer to Peer – no master
• Data Center Aware – distributed architecture• Linear Scaling – just add hardware• Eventual Consistency, tunable tradeoff between
latency and consistency• Architecture is optimized for writes.• Can have 2 billion columns (cells)!• Data modeling for reads. Design starts with looking at
your queries. (sound familiar?)• With CQL became more SQL-Like, but no joins, no
subqueries, limited ordering (but very useful)• Column Names can part of data, e.g. Time Series
NOSQL Intro with MongoDB and Cassandra
67
C* Eventual Consistency
** Important Term **Quorum : Q = N / 2 + 1.
We get consistency in a BASE world by satisfying W + R > N
3 obvious ways:
1. W = 1, R = N
2. W = N, R = 1
3. W = Q, R = Q
(N is replication factor, R = read replica count, W = write replica count)
NOSQL Intro with MongoDB and Cassandra
68
C* Data Model
C* data model is made of these: Column – a name, a value and a timestamp.
Applications can use the name as the data and not use value. (RDBMS like a column).
Row – a collection of columns identified by a unique key. Key is called a partition key (RDBMS like a row).
Column Family – container for an ordered collection rows. Each row is an ordered collection of columns. Each column has a key and maybe a value. (RDBMS like a
table). This is also known as a table now in C* terms. Keyspace – administrative container for CF’s. It is a
namespace. Also has a replication strategy – more late. (RDBMS like a DB or schema).
NOSQL Intro with MongoDB and Cassandra
69
How Does This Look?
NOSQL Intro with MongoDB and Cassandra
70
Tokens
Tokens – partitioner dependent element on the ring. Each node has a single unique token assigned. Each node claims a range of tokens that is from its token to
token of the previous node on the ring.
Use this formula Initial_Token= Zero_Indexed_Node_Number * ((2^127) / Number_Of_Nodes) In cassandra.yamlinitial token=42535295865117307932921825928971026432 ** http://blog.milford.io/cassandra-token-calculator/
NOSQL Intro with MongoDB and Cassandra
71
Replication
• Replication is how many copies of each piece of data that should be stored. In C* terms it is Replication Factor or “RF”.
• In C* RF is set at the keyspace level:CREATE KEYSPACE drg_compare WITH replication = {'class':'SimpleStrategy',
'replication_factor':3};
• How the data is replicated is called the Replication Strategy• SimpleStrategy – returns nodes “next” to each
other on ring, Assumes single DC• NetworkTopologyStrategy – for configuring
per data center. Rack and DC’s aware.update keyspace UserProfile with strategy_options=[{DC1:3, DC2:3}];
NOSQL Intro with MongoDB and Cassandra
72
C* Ring Topology
NOSQL Intro with MongoDB and Cassandra
73
SimpleStrategy
Using token generation values from before. 4 node cluster. Write value with token 32535295865117307932921825928971026432
NOSQL Intro with MongoDB and Cassandra
74
SimpleStrategy (Cont)
NOSQL Intro with MongoDB and Cassandra
75
Coordinator and CL
• When writing, Coordinator Node will be selected. Selected at write (or read) time. Not a SPF!
• Using Gossip Protocol nodes share information with each other. Who is up, who is down, who is taking which token ranges, etc. Every second, each node shares with 1 to 3 nodes.
• Consistency Level (CL) – says how many nodes must agree before an operation is a success. Set at read or write operation.
• ONE – coordinator will wait for one node to ack write (also TWO, THREE). One is default if none provided.
• QUORUM – we saw that before. N / 2 + 1. LOCAL_QUORUM, EACH_QUORUM
• ANY – waits for some replicate. If all down, still succeeds. Only for writes. Doesn’t guarantee it can be read.
• ALL– Blocks waiting for all replicas
NOSQL Intro with MongoDB and Cassandra
76
Insuring Consistency
3 important concepts: Read Repair - At time of read, inconsistencies are noticed
between nodes and replicas are updated. Direct and background. Direct is determined by CL.
Anti-Entropy Node Repair - For data that is not read frequently, or to update data on a node that has been down for a while, the nodetool repair process (also called anti-entropy repair). Builds Merkle trees, compares nodes and does repair.
Hinted Handoff - Writes are always sent to all replicas for the specified row regardless of the consistency level specified by the client. If a node happens to be down at the time of write, its corresponding replicas will save hints about the missed writes, and then handoff the affected rows once the node comes back online. This notification happens is via Gossip. Default 1 hour.
NOSQL Intro with MongoDB and Cassandra
77
Application Development
• Interaction with Cassandra can be done using one of supplied clients such as CLI or CQL. Otherwise client applications are built using a language client library.
• Many clients in multiple languages. Including Java, .NET, Python, Scala, Go, PHP, Node.js, Perl, Ruby, etc.
• Java:• Hector wraps the underlying Thrift API. Hector is one of the most
commonly used client libraries. • Astyanax is a client library developed by Netflix .• Datastax CQL – newest CQL driver, will be very familiar to JDBC
developers• And many more … (JPA)
• Also exists Datastax OPSCenter and other various GUI’s and REST API (Virgil)
NOSQL Intro with MongoDB and Cassandra
78
Cassandra Summary
Many More Topics / Information Related to C* not covered
Great for Fast Writes
No Single POF
Data Center Aware
Also Relative Ease Of Use
NOSQL Intro with MongoDB and Cassandra
79
That’s All Folks
Questions?
Comments?
Thank You!!!!!! [email protected]