Upload
dataversity
View
1.455
Download
0
Embed Size (px)
Citation preview
1
NoSQL: Introduction
Asya Kamsky
2
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
3
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
4
• 1970's Relational Databases Invented
– Storage is expensive
– Data is normalized
– Data storage is abstracted away from app
• 1980's RDBMS commercialized
– Client/Server model
– SQL becomes the standard
• 1990's Things begin to change
– Client/Server=> 3-tier architecture
– Rise of the Internet and the Web
5
• 2000's Web 2.0
– Rise of "Social Media"
– Acceptance of E-Commerce
– Constant decrease of HW prices
– Massive increase of collected data
6
• 2000's Web 2.0
– Rise of "Social Media"
– Acceptance of E-Commerce
– Constant decrease of HW prices
– Massive increase of collected data
• Result
– Constant need to scale dramatically
– How can we scale?
7
OLTP / operational
BI / reporting
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
8
OLTP / operational
BI / reporting
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
9
OLTP / operational
BI / reporting
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
fewer issues here
10
OLTP / operational
BI / reporting
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
fewer issues here
a lot more issues here
11
OLTP / operational
BI / reporting
caching
flat files
map/reduce
app layer partitioning
+ complex transactions
+ tabular data
+ ad hoc queries
- O<->R mapping hard
- speed/scale problems
- not super agile
+ ad hoc queries
+ SQL standard
protocol between
clients and servers
+ scales horizontally
better than oper dbs.
- some scale limits at
massive scale
- schemas are rigid
- no real time; great at
bulk nightly data loads
12
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
13
• Agile Development Methodology • Shorter development cycles
• Constant evolution of requirements
• Flexibility at design time
• Relational Schema • Hard to evolve
• long painful migrations
• must stay in sync with
application
• few developers interact directly
14
15
16
• Horizontal scaling
• More real time requirements
• Faster development time
• Flexible data model
• Low upfront cost
• Low cost of ownership
17
Relational
vs
Non-Relational
What is NoSQL?
18
scalable nonrelational (“nosql”)
OLTP / operational
BI / reporting
+ speed and scale
- ad hoc query limited
- not very transactional
- no sql/no standard
+ fits OO well
+ agile
19
Non-relational next generation
operation data stores and databases
A collection of very different products
• Different data models (Not relational)
• Most are not using SQL for queries
• No predefined schema
• Some allow flexible data structures
20
• Relational
• Key-Value
• Document
• XML
• Graph
• Column
21
• Relational
• ACID
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
22
• Relational
• ACID
• Two-phase commit
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
• Atomic transactions on
document level
23
• Relational
• ACID
• Two-phase commit
• Joins
• Key-Value
• Document
• XML
• Graph
• Column
• BASE
• Atomic transactions on
document level
• No Joins
24
25
• Transaction rate
• Reliability
• Maintainability
• Ease of Use
• Scalability
• Cost
26
MongoDB: Introduction
27
• Designed and developed by founders of Doubleclick, ShopWiki, GILT groupe, etc.
• Coding started fall 2007
• First production site March 2008 - businessinsider.com
• Open Source – AGPL, written in C++
• Version 0.8 – first official release February 2009
• Version 1.0 – August 2009
• Version 2.0 – September 2011
28
MongoDB
Design Goals
29
30
• Document-oriented
Storage
• Based on JSON
Documents
• Flexible Schema
• Scalable Architecture
• Auto-sharding
• Replication & high
availability
• Key Features Include:
• Full featured indexes
• Query language
• Map/Reduce &
Aggregation
31
• Rich data models
• Seamlessly map to native programming
language types
• Flexible for dynamic data
• Better data locality
32
33
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”
}
34
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”]
}
> db.posts.find( { tags : “news” } )
35
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ]
}
36
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ],
comments : [
{ by : “tim157”, text : “great story” },
{ by : “gora”, text : “i don’t think so” },
{ by : “dmerr”, text : “also check out...” }
]
}
37
{
_id : ObjectId("4e2e3f92268cdda473b628f6"),
title : “Too Big to Fail”,
when : Date(“2011-07-26”),
author : “joe”,
text : “blah”,
tags : [“business”, “news”, “north america”],
votes : 3,
voters : [“dmerr”, “sj”, “jane” ],
comments : [
{ by : “tim157”, text : “great story” },
{ by : “gora”, text : “i don’t think so” },
{ by : “dmerr”, text : “also check out...” }
]
}
> db.posts.find( { “comments.by” : “gora” } )
> db.posts.ensureIndex( { “comments.by” : 1 } )
38
Seek = 5+ ms Read = really really fast
Post
Author Comment
39
Post
Author
Comment Comment Comment Comment Comment
Disk seeks and data locality
40
• Sophisticated secondary indexes
• Dynamic queries
• Sorting
• Rich updates, upserts
• Easy aggregation
• Viable primary data store
41
• Scale linearly
• High Availability
• Increase capacity with no downtime
• Transparent to the application
42
Replica Sets
• High Availability/Automatic Failover
• Data Redundancy
• Disaster Recovery
• Transparent to the application
• Perform maintenance with no down time
43
Asynchronous
Replication
44
Asynchronous
Replication
45
Asynchronous
Replication
46
47
Automatic
Election
48
49
• Increase capacity with no downtime
• Transparent to the application
50
• Increase capacity with no downtime
• Transparent to the application
• Range based partitioning
• Partitioning and balancing is automatic
51
mongod
Write Scalability
Key Range
0..100
52
Write Scalability
mongod mongod
Key Range
0..50
Key Range
51..100
53
mongod mongod mongod mongod
Key Range
0..25
Key Range
26..50
Key Range
51..75 Key Range
76.. 100
Write Scalability
54
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
0..25
Key Range
26..50
Key Range
51..75
Key Range
76.. 100
55
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
0..25
Key Range
26..50
Key Range
51..75
Key Range
76.. 100
MongoS
Application
56
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
0..25
Key Range
26..50
Key Range
51..75
Key Range
76.. 100
MongoS MongoS MongoS
Application
57
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Primary
Secondary
Secondary
Key Range
0..25
Key Range
26..50
Key Range
51..75
Key Range
76.. 100
MongoS MongoS MongoS
Config
Config
Config
Application
58
• Few configuration options
• Does the right thing out of the box
• Easy to deploy and manage
59
MySQL
START TRANSACTION;
INSERT INTO contacts VALUES
(NULL, ‘joeblow’);
INSERT INTO contact_emails VALUES
( NULL, ”[email protected]”,
LAST_INSERT_ID() ),
( NULL, “[email protected]”,
LAST_INSERT_ID() );
COMMIT;
MongoDB
db.contacts.save( {
userName: “joeblow”,
emailAddresses: [
“[email protected]” ] } );
60
MySQL
START TRANSACTION;
INSERT INTO contacts VALUES
(NULL, ‘joeblow’);
INSERT INTO contact_emails VALUES
( NULL, ”[email protected]”,
LAST_INSERT_ID() ),
( NULL, “[email protected]”,
LAST_INSERT_ID() );
COMMIT;
MongoDB
db.contacts.save( {
userName: “joeblow”,
emailAddresses: [
“[email protected]” ] } );
• Native drivers for dozens of languages
• Data maps naturally to OO data
structures
61
MongoDB Usage Examples
62
User Data Management High Volume Data Feeds
Content Management Operational Intelligence E-Commerce
63
Analyze a staggering amount of data for a system build on continuous stream of high-quality text pulled from online sources
Adding too much data too quickly resulted in outages; tables locked for tens of seconds during inserts
Initially launched entirely on MySQL but quickly hit performance road blocks
Problem
Life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. Since we don’t spend time worrying about the database, we can spend more time writing code for our application. -Tony Tam, Vice President of Engineering and Technical Co-founder
Migrated 5 billion records in a single day with zero downtime
MongoDB powers every website request: 20m API calls per day
Ability to eliminate memcached layer, creating a simplified system that required fewer resources and was less prone to error.
Why MongoDB
Reduced code by 75% compared to MySQL
Fetch time cut from 400ms to 60ms
Sustained insert speed of 8k words per second, with frequent bursts of up to 50k per second
Significant cost savings and 15% reduction in servers
Impact
Wordnik uses MongoDB as the foundation for its “live” dictionary that stores its entire
text corpus – 3.5T of data in 20 billion records
64
Intuit hosts more than 500,000 websites
wanted to collect and analyze data to recommend conversion and lead generation improvements to customers.
With 10 years worth of user data, it took several days to process the information using a relational database.
Problem
MongoDB's querying and Map/Reduce functionality could server as a simpler, higher-performance solution than a complex Hadoop implementation.
The strength of the MongoDB community.
Why MongoDB
In one week Intuit was able to become proficient in MongoDB development
Developed application features more quickly for MongoDB than for relational databases
MongoDB was 2.5 times faster than MySQL
Impact
Intuit relies on a MongoDB-powered real-time analytics tool for small businesses to
derive interesting and actionable patterns from their customers’ website traffic
We did a prototype for one week, and within one week we had made big progress. Very big progress. It was so amazing that we decided, “Let’s go with this.” -Nirmala Ranganathan, Intuit
65
Managing 20TB of data (six billion images for millions of customers) partitioning by function.
Home-grown key value store on top of their Oracle database offered sub-par performance
Codebase for this hybrid store became hard to manage
High licensing, HW costs
Problem
JSON-based data structure
Provided Shutterfly with an agile, high performance, scalable solution at a low cost.
Works seamlessly with Shutterfly’s services-based architecture
Why MongoDB
500% cost reduction and 900% performance improvement compared to previous Oracle implementation
Accelerated time-to-market for nearly a dozen projects on MongoDB
Improved Performance by reducing average latency for inserts from 400ms to 2ms.
Impact
Shutterfly uses MongoDB to safeguard more than six billion images for millions of
customers in the form of photos and videos, and turn everyday pictures into keepsakes
The “really killer reason” for using MongoDB is its rich JSON-based data structure, which offers Shutterfly an agile approach to develop software. With MongoDB, the Shutterfly team can quickly develop and deploy new applications, especially Web 2.0 and social features. -Kenny Gorman, Director of Data Services
66
67
Open source, high performance database