Intro to NoSQL and MongoDB

Preview:

Citation preview

1

NoSQL: Introduction

Asya Kamsky

2

• 1970's Relational Databases Invented

– Storage is expensive

– Data is normalized

– Data storage is abstracted away from app

3

• 1970's Relational Databases Invented

– Storage is expensive

– Data is normalized

– Data storage is abstracted away from app

• 1980's RDBMS commercialized

– Client/Server model

– SQL becomes the standard

4

• 1970's Relational Databases Invented

– Storage is expensive

– Data is normalized

– Data storage is abstracted away from app

• 1980's RDBMS commercialized

– Client/Server model

– SQL becomes the standard

• 1990's Things begin to change

– Client/Server=> 3-tier architecture

– Rise of the Internet and the Web

5

• 2000's Web 2.0

– Rise of "Social Media"

– Acceptance of E-Commerce

– Constant decrease of HW prices

– Massive increase of collected data

6

• 2000's Web 2.0

– Rise of "Social Media"

– Acceptance of E-Commerce

– Constant decrease of HW prices

– Massive increase of collected data

• Result

– Constant need to scale dramatically

– How can we scale?

7

OLTP / operational

BI / reporting

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

8

OLTP / operational

BI / reporting

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

9

OLTP / operational

BI / reporting

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

fewer issues here

10

OLTP / operational

BI / reporting

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

fewer issues here

a lot more issues here

11

OLTP / operational

BI / reporting

caching

flat files

map/reduce

app layer partitioning

+ complex transactions

+ tabular data

+ ad hoc queries

- O<->R mapping hard

- speed/scale problems

- not super agile

+ ad hoc queries

+ SQL standard

protocol between

clients and servers

+ scales horizontally

better than oper dbs.

- some scale limits at

massive scale

- schemas are rigid

- no real time; great at

bulk nightly data loads

12

• Agile Development Methodology • Shorter development cycles

• Constant evolution of requirements

• Flexibility at design time

13

• Agile Development Methodology • Shorter development cycles

• Constant evolution of requirements

• Flexibility at design time

• Relational Schema • Hard to evolve

• long painful migrations

• must stay in sync with

application

• few developers interact directly

14

15

16

• Horizontal scaling

• More real time requirements

• Faster development time

• Flexible data model

• Low upfront cost

• Low cost of ownership

17

Relational

vs

Non-Relational

What is NoSQL?

18

scalable nonrelational (“nosql”)

OLTP / operational

BI / reporting

+ speed and scale

- ad hoc query limited

- not very transactional

- no sql/no standard

+ fits OO well

+ agile

19

Non-relational next generation

operation data stores and databases

A collection of very different products

• Different data models (Not relational)

• Most are not using SQL for queries

• No predefined schema

• Some allow flexible data structures

20

• Relational

• Key-Value

• Document

• XML

• Graph

• Column

21

• Relational

• ACID

• Key-Value

• Document

• XML

• Graph

• Column

• BASE

22

• Relational

• ACID

• Two-phase commit

• Key-Value

• Document

• XML

• Graph

• Column

• BASE

• Atomic transactions on

document level

23

• Relational

• ACID

• Two-phase commit

• Joins

• Key-Value

• Document

• XML

• Graph

• Column

• BASE

• Atomic transactions on

document level

• No Joins

24

25

• Transaction rate

• Reliability

• Maintainability

• Ease of Use

• Scalability

• Cost

26

MongoDB: Introduction

27

• Designed and developed by founders of Doubleclick, ShopWiki, GILT groupe, etc.

• Coding started fall 2007

• First production site March 2008 - businessinsider.com

• Open Source – AGPL, written in C++

• Version 0.8 – first official release February 2009

• Version 1.0 – August 2009

• Version 2.0 – September 2011

28

MongoDB

Design Goals

29

30

• Document-oriented

Storage

• Based on JSON

Documents

• Flexible Schema

• Scalable Architecture

• Auto-sharding

• Replication & high

availability

• Key Features Include:

• Full featured indexes

• Query language

• Map/Reduce &

Aggregation

31

• Rich data models

• Seamlessly map to native programming

language types

• Flexible for dynamic data

• Better data locality

32

33

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : “Too Big to Fail”,

when : Date(“2011-07-26”),

author : “joe”,

text : “blah”

}

34

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : “Too Big to Fail”,

when : Date(“2011-07-26”),

author : “joe”,

text : “blah”,

tags : [“business”, “news”, “north america”]

}

> db.posts.find( { tags : “news” } )

35

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : “Too Big to Fail”,

when : Date(“2011-07-26”),

author : “joe”,

text : “blah”,

tags : [“business”, “news”, “north america”],

votes : 3,

voters : [“dmerr”, “sj”, “jane” ]

}

36

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : “Too Big to Fail”,

when : Date(“2011-07-26”),

author : “joe”,

text : “blah”,

tags : [“business”, “news”, “north america”],

votes : 3,

voters : [“dmerr”, “sj”, “jane” ],

comments : [

{ by : “tim157”, text : “great story” },

{ by : “gora”, text : “i don’t think so” },

{ by : “dmerr”, text : “also check out...” }

]

}

37

{

_id : ObjectId("4e2e3f92268cdda473b628f6"),

title : “Too Big to Fail”,

when : Date(“2011-07-26”),

author : “joe”,

text : “blah”,

tags : [“business”, “news”, “north america”],

votes : 3,

voters : [“dmerr”, “sj”, “jane” ],

comments : [

{ by : “tim157”, text : “great story” },

{ by : “gora”, text : “i don’t think so” },

{ by : “dmerr”, text : “also check out...” }

]

}

> db.posts.find( { “comments.by” : “gora” } )

> db.posts.ensureIndex( { “comments.by” : 1 } )

38

Seek = 5+ ms Read = really really fast

Post

Author Comment

39

Post

Author

Comment Comment Comment Comment Comment

Disk seeks and data locality

40

• Sophisticated secondary indexes

• Dynamic queries

• Sorting

• Rich updates, upserts

• Easy aggregation

• Viable primary data store

41

• Scale linearly

• High Availability

• Increase capacity with no downtime

• Transparent to the application

42

Replica Sets

• High Availability/Automatic Failover

• Data Redundancy

• Disaster Recovery

• Transparent to the application

• Perform maintenance with no down time

43

Asynchronous

Replication

44

Asynchronous

Replication

45

Asynchronous

Replication

46

47

Automatic

Election

48

49

• Increase capacity with no downtime

• Transparent to the application

50

• Increase capacity with no downtime

• Transparent to the application

• Range based partitioning

• Partitioning and balancing is automatic

51

mongod

Write Scalability

Key Range

0..100

52

Write Scalability

mongod mongod

Key Range

0..50

Key Range

51..100

53

mongod mongod mongod mongod

Key Range

0..25

Key Range

26..50

Key Range

51..75 Key Range

76.. 100

Write Scalability

54

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

0..25

Key Range

26..50

Key Range

51..75

Key Range

76.. 100

55

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

0..25

Key Range

26..50

Key Range

51..75

Key Range

76.. 100

MongoS

Application

56

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

0..25

Key Range

26..50

Key Range

51..75

Key Range

76.. 100

MongoS MongoS MongoS

Application

57

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Primary

Secondary

Secondary

Key Range

0..25

Key Range

26..50

Key Range

51..75

Key Range

76.. 100

MongoS MongoS MongoS

Config

Config

Config

Application

58

• Few configuration options

• Does the right thing out of the box

• Easy to deploy and manage

59

MySQL

START TRANSACTION;

INSERT INTO contacts VALUES

(NULL, ‘joeblow’);

INSERT INTO contact_emails VALUES

( NULL, ”joe@blow.com”,

LAST_INSERT_ID() ),

( NULL, “joseph@blow.com”,

LAST_INSERT_ID() );

COMMIT;

MongoDB

db.contacts.save( {

userName: “joeblow”,

emailAddresses: [

“joe@blow.com”,

“joseph@blow.com” ] } );

60

MySQL

START TRANSACTION;

INSERT INTO contacts VALUES

(NULL, ‘joeblow’);

INSERT INTO contact_emails VALUES

( NULL, ”joe@blow.com”,

LAST_INSERT_ID() ),

( NULL, “joseph@blow.com”,

LAST_INSERT_ID() );

COMMIT;

MongoDB

db.contacts.save( {

userName: “joeblow”,

emailAddresses: [

“joe@blow.com”,

“joseph@blow.com” ] } );

• Native drivers for dozens of languages

• Data maps naturally to OO data

structures

61

MongoDB Usage Examples

62

User Data Management High Volume Data Feeds

Content Management Operational Intelligence E-Commerce

63

Analyze a staggering amount of data for a system build on continuous stream of high-quality text pulled from online sources

Adding too much data too quickly resulted in outages; tables locked for tens of seconds during inserts

Initially launched entirely on MySQL but quickly hit performance road blocks

Problem

Life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. Since we don’t spend time worrying about the database, we can spend more time writing code for our application. -Tony Tam, Vice President of Engineering and Technical Co-founder

Migrated 5 billion records in a single day with zero downtime

MongoDB powers every website request: 20m API calls per day

Ability to eliminate memcached layer, creating a simplified system that required fewer resources and was less prone to error.

Why MongoDB

Reduced code by 75% compared to MySQL

Fetch time cut from 400ms to 60ms

Sustained insert speed of 8k words per second, with frequent bursts of up to 50k per second

Significant cost savings and 15% reduction in servers

Impact

Wordnik uses MongoDB as the foundation for its “live” dictionary that stores its entire

text corpus – 3.5T of data in 20 billion records

64

Intuit hosts more than 500,000 websites

wanted to collect and analyze data to recommend conversion and lead generation improvements to customers.

With 10 years worth of user data, it took several days to process the information using a relational database.

Problem

MongoDB's querying and Map/Reduce functionality could server as a simpler, higher-performance solution than a complex Hadoop implementation.

The strength of the MongoDB community.

Why MongoDB

In one week Intuit was able to become proficient in MongoDB development

Developed application features more quickly for MongoDB than for relational databases

MongoDB was 2.5 times faster than MySQL

Impact

Intuit relies on a MongoDB-powered real-time analytics tool for small businesses to

derive interesting and actionable patterns from their customers’ website traffic

We did a prototype for one week, and within one week we had made big progress. Very big progress. It was so amazing that we decided, “Let’s go with this.” -Nirmala Ranganathan, Intuit

65

Managing 20TB of data (six billion images for millions of customers) partitioning by function.

Home-grown key value store on top of their Oracle database offered sub-par performance

Codebase for this hybrid store became hard to manage

High licensing, HW costs

Problem

JSON-based data structure

Provided Shutterfly with an agile, high performance, scalable solution at a low cost.

Works seamlessly with Shutterfly’s services-based architecture

Why MongoDB

500% cost reduction and 900% performance improvement compared to previous Oracle implementation

Accelerated time-to-market for nearly a dozen projects on MongoDB

Improved Performance by reducing average latency for inserts from 400ms to 2ms.

Impact

Shutterfly uses MongoDB to safeguard more than six billion images for millions of

customers in the form of photos and videos, and turn everyday pictures into keepsakes

The “really killer reason” for using MongoDB is its rich JSON-based data structure, which offers Shutterfly an agile approach to develop software. With MongoDB, the Shutterfly team can quickly develop and deploy new applications, especially Web 2.0 and social features. -Kenny Gorman, Director of Data Services

66

67

Open source, high performance database