20121023 mongodb schema-design

Preview:

Citation preview

SCHEMA DESIGN WORKSHOP

Jeremy Mikola@jmikola

AGENDA1. Basic schema design principles for MongoDB2. Schema design over an application's lifetime3. Common design patterns4. Sharding

GOALSLearn the schema design process in MongoDBPractice applying common principles via exercisesUnderstand the implications of sharding

WHAT IS A SCHEMA AND WHY IS ITIMPORTANT?

SCHEMAMap concepts and relationships to dataSet expectations for the dataMinimize overhead of iterative modificationsEnsure compatibility

NORMALIZATIONusers

usernamefirst_namelast_name

← bookstitleisbnlanguagecreated_byauthor

→ authorsfirst_namelast_name

DENORMALIZATIONusers

usernamefirst_namelast_name

← bookstitleisbnlanguagecreated_byauthorfirst_namelast_name

WHAT IS SCHEMA DESIGN LIKE INMONGODB?

Schema is defined at the application-levelDesign is part of each phase in its lifetimeThere is no magic formula

MONGODB DOCUMENTSStorage in BSON → BSONSpec.org

ScalarsDoublesIntegers (32 or 64-bit)UTF-8 stringsUTC Date, timestampBinary, regex, codeObject IDnull

Rich typesObjectsArrays

TERMINOLOGY{    "mongodb"    : "relational db",    "database"   : "database",    "collection" : "table",    "document"   : "row",    "index"      : "index",    "sharding" : {        "shard"     : "partition",        "shard key" : "partition key"    }}

THREE CONSIDERATIONS IN MONGODBSCHEMA DESIGN

1. The data your application needs2. Your application's read usage of the data3. Your application's write usage of the data

CASE STUDYLIBRARY WEB APPLICATION

Different schemas are possible

AUTHOR SCHEMA{    "_id": int,    "first_name": string,    "last_name": string}

USER SCHEMA{    "_id": int,    "username": string,    "password": string}

BOOK SCHEMA{    "_id": int,    "title": string,    "slug": string,    "author": int,    "available": boolean,    "isbn": string,    "pages": int,    "publisher": {        "city": string,        "date": date,        "name": string    },    "subjects": [ string, string ],    "language": string,    "reviews": [       { "user": int, "text": string },       { "user": int, "text": string }    ],}

EXAMPLE DOCUMENTS

AUTHOR DOCUMENT> db.authors.findOne(){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald"}

USER DOCUMENT> db.users.findOne(){    _id: 1,    username: "emily@10gen.com",    password: "slsjfk4odk84k209dlkdj90009283d"}

BOOK DOCUMENT> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    available: true,    isbn: "9781857150193",    pages: 176,    publisher: {        name: "Everyman's Library",        date: ISODate("1991‐09‐19T00:00:00Z"),        city: "London"    },    subjects: ["Love stories", "1920s", "Jazz Age"],    language: "English",    reviews: [       { user: 1, text: "One of the best…" },       { user: 2, text: "It's hard to…" }    ]}

EMBEDDED OBJECTSAKA EMBEDDED OR SUB-DOCUMENTS

What advantages do they have?

When should they be used?

EMBEDDED OBJECTS> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    available: true,    isbn: "9781857150193",    pages: 176,    publisher: {        name: "Everyman's Library",        date: ISODate("1991‐09‐19T00:00:00Z"),        city: "London"    },    subjects: ["Love stories", "1920s", "Jazz Age"],    language: "English",    reviews: [       { user: 1, text: "One of the best…" },       { user: 2, text: "It's hard to…" }    ]}

EMBEDDED OBJECTSGreat for read performanceOne seek to load the entire documentOne round trip to the databaseWrites can be slow if constantly adding to objects

LINKED DOCUMENTSWhat advantages does this approach have?

When should they be used?

LINKED DOCUMENTS> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    available: true,    isbn: "9781857150193",    pages: 176,    publisher: {        publisher_name: "Everyman's Library",        date: ISODate("1991‐09‐19T00:00:00Z"),        publisher_city: "London"    },    subjects: ["Love stories", "1920s", "Jazz Age"],    language: "English",    reviews: [       { user: 1, text: "One of the best…" },       { user: 2, text: "It's hard to…" }    ]}

LINKED DOCUMENTSMore, smaller documentsCan make queries by ID very simpleAccessing linked document data requires extra readWhat effect does this have on the system?

DATA, RAM AND DISK

ARRAYSWhen should they be used?

ARRAY OF SCALARS> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    available: true,    isbn: "9781857150193",    pages: 176,    publisher: {        name: "Everyman's Library",        date: ISODate("1991‐09‐19T00:00:00Z"),        city: "London"    },    subjects: ["Love stories", "1920s", "Jazz Age"],    language: "English",    reviews: [       { user: 1, text: "One of the best…" },       { user: 2, text: "It's hard to…" }    ]}

ARRAY OF OBJECTS  db.books.findOne(){   _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    available: true,    isbn: "9781857150193",    pages: 176,    publisher: {        name: "Everyman's Library",        date: ISODate("1991‐09‐19T00:00:00Z"),        city: "London"    },    subjects: ["Love stories", "1920s", "Jazz Age"],    language: "English",    reviews: [       { user: 1, text: "One of the best…" },       { user: 2, text: "It's hard to…" }    ],}

EXERCISE #1Design a schema for users and their book reviews

Usersusername (string)email (string)

Reviewstext (string)rating (integer)created_at (date)

Usernames are immutable

EXERCISE #1: SOLUTION AReviews may be queried by user or book

// db.users (one document per user){   _id: ObjectId("…"),    username: "bob",    email: "bob@example.com"}

// db.reviews (one document per review){   _id: ObjectId("…"),    user: ObjectId("…"),    book: ObjectId("…"),    rating: 5,    text: "This book is excellent!",    created_at: ISODate("2012‐10‐10T21:14:07.096Z")}

EXERCISE #1: SOLUTION BOptimized to retrieve reviews by user

// db.users (one document per user with all reviews){   _id: ObjectId("…"),    username: "bob",    email: "bob@example.com",    reviews: [        {   book: ObjectId("…"),            rating: 5,            text: "This book is excellent!",            created_at: ISODate("2012‐10‐10T21:14:07.096Z")        }    ]}

EXERCISE #1: SOLUTION COptimized to retrieve reviews by book

// db.users (one document per user){   _id: ObjectId("…"),    username: "bob",    email: "bob@example.com"}

// db.books (one document per book with all reviews){   _id: ObjectId("…"),    // Other book fields…    reviews: [        {   user: ObjectId("…"),            rating: 5,            text: "This book is excellent!",            created_at: ISODate("2012‐10‐10T21:14:07.096Z")        }    ]}

SCHEMA DESIGN OVER AN APPLICATION'SLIFETIME

DevelopmentProductionIterative Modifications

DEVELOPMENT PHASEBasic CRUD functionality

CREATE

The _id field is unique and automatically indexedMongoDB will generate an ObjectId if not provided

RUD  author = {    _id: 2,    first_name: "Arthur",    last_name: "Miller"  };

  db.authors.insert(author);

READC UD> db.authors.find({ "last_name": "Miller" }){    _id: 2,    first_name: "Arthur",    last_name: "Miller"}

READS AND INDEXINGExamine the query after creating an index.

> db.books.ensureIndex({ "slug": 1 })

> db.books.find({ "slug": "the‐great‐gatsby" }).explain(){    "cursor": "BtreeCursor slug_1",    "isMultiKey" : false,    "n" : 1,    "nscannedObjects" : 1,    "nscanned" : 1,    "scanAndOrder" : false,    "indexOnly" : false,    "nYields" : 0,    "nChunkSkips" : 0,    "millis" : 0,    // Other fields follow…}

MULTI-KEY INDEXESIndex all values in an array field.

  > db.books.ensureIndex({ "subjects": 1 });

INDEXING EMBEDDED FIELDSIndex an embedded object's field.

    > db.books.ensureIndex({ "publisher.name": 1 }) 

QUERY OPERATORSConditional operators

$gt, $gte, $lt, $lte, $ne, $all, $in, $nin, $size,$and, $or, $nor, $mod, $type, $exists

Regular expressionsValue in an array

$elemMatchCursor methods and modifiers

count(), limit(), skip(), snapshot(), sort(),batchSize(), explain(), hint()

UPDATECR D  review = {    user: 1,    text: "I did NOT like this book."  };

  db.books.update(    { _id: 1 },    { $push: { reviews: review }}  );

ATOMIC MODIFIERSUpdate specific fields within a document

$set, $unset$push, $pushAll$addToSet, $pop$pull, $pullAll$rename$bit

DELETECRU  > db.books.remove({ _id: 1 })

PRODUCTION PHASEEvolve schema to meet the application's read and write

patterns

READ USAGEFinding books by an author's first name

  authors = db.authors.find({ first_name: /̂f.*/i }, { _id: 1 });

  authorIds = authors.map(function(x) { return x._id; });

  db.books.find({author: { $in: authorIds }});

READ USAGE"Cache" the author name in an embedded document

Queries are now one step

> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    author: {        first_name: "F. Scott",        last_name: "Fitzgerald"    }    // Other fields follow…}

  > db.books.find({ author.first_name: /̂f.*/i })

WRITE USAGEUsers can review a book

Document size limit (16MB)Storage fragmentation after many updates/deletes

review = {    user: 1,    text: "I thought this book was great!",    rating: 5};

  > db.books.update(    { _id: 3 },    { $push: { reviews: review }});

EXERCISE #2Display the 10 most recent reviews by a userMake efficient use of memory and disk seeks

EXERCISE #2: SOLUTIONStore users' reviews in monthly buckets

// db.reviews (one document per user per month){   _id: "bob‐201210",    reviews: [        {   _id: ObjectId("…"),            rating: 5,            text: "This book is excellent!",            created_at: ISODate("2012‐10‐10T21:14:07.096Z")        },        {   _id: ObjectId("…"),            rating: 2,            text: "I didn't really enjoy this book.",            created_at: ISODate("2012‐10‐11T20:12:50.594Z")        }    ]}

EXERCISE #2: SOLUTIONAdding a new review to the appropriate bucket

myReview = {    _id: ObjectId("…"),    rating: 3,    text: "An average read.",    created_at: ISODate("2012‐10‐13T12:26:11.502Z")};

> db.reviews.update(      { _id: "bob‐2012‐10" },      { $push: { reviews: myReview }});

EXERCISE #2: SOLUTIONDisplay the 10 most recent reviews by a user

cursor = db.reviews.find(    { _id: /̂bob‐/ },    { reviews: { $slice: 10 }}).sort({ _id: ‐1 });

num = 0;

while (cursor.hasNext() && num < 10) {    doc = cursor.next();

    for (var i = 0; i < doc.reviews.length && num < 10; ++i, ++num) {        printjson(doc.reviews[i]);    }}

EXERCISE #2: SOLUTIONDeleting a review

cursor = db.reviews.update(    { _id: "bob‐2012‐10" },    { $pull: { reviews: { _id: ObjectId("…") }}});

ITERATIVEMODIFICATIONS

Schema design is evolutionary

ALLOW USERS TO BROWSE BY BOOKSUBJECT

How can you search this collection?Be aware of document size limitationsBenefit from hierarchy being in same document

> db.subjects.findOne(){    _id: 1,    name: "American Literature",    sub_category: {         name: "1920s",         sub_category: { name: "Jazz Age" }   }}

TREE STRUCTURES> db.subjects.find(){   _id: "American Literature" }

{   _id : "1920s",    ancestors: ["American Literature"],    parent: "American Literature"}

{   _id: "Jazz Age",    ancestors: ["American Literature", "1920s"],    parent: "1920s"}

{   _id: "Jazz Age in New York",    ancestors: ["American Literature", "1920s", "Jazz Age"],    parent: "Jazz Age"}

TREE STRUCTURESFind sub-categories of a given subject

> db.subjects.find({ ancestors: "1920s" }){    _id: "Jazz Age",    ancestors: ["American Literature", "1920s"],    parent: "1920s"}

{    _id: "Jazz Age in New York",    ancestors: ["American Literature", "1920s", "Jazz Age"],    parent: "Jazz Age"}

EXERCISE #3Allow users to borrow library books

User sends a loan requestLibrary approves or notRequests time out after seven days

Approval process is asynchronousRequests may be prioritized

EXERCISE #3: SOLUTIONNeed to maintain order and stateEnsure that updates are atomic

// Create a new loan request> db.loans.insert({    _id: { borrower: "bob", book: ObjectId("…") },    pending: false,    approved: false,    priority: 1,});

// Find the highest priority request and mark as pending approvalrequest = db.loans.findAndModify({    query: { pending: false },    sort: { priority: ‐1 },    update: { $set: { pending: true, started: new ISODate() }},    new: true});

EXERCISE #3: SOLUTIONUpdated and added fieldsModified document was returned

{    _id: { borrower: "bob", book: ObjectId("…") },    pending: true,    approved: false,    priority: 1,    started: ISODate("2012‐10‐11T22:09:42.542Z")}

EXERCISE #3: SOLUTION// Library approves the loan request> db.loans.update(    { _id: { borrower: "bob", book: ObjectId("…") }},    { $set: { pending: false, approved: true }});

EXERCISE #3: SOLUTION// Request times out after seven dayslimit = new Date();limit.setDate(limit.getDate() ‐ 7);

> db.loans.update(    { pending: true, started: { $lt: limit }},    { $set: { pending: false, approved: false }});

EXERCISE #4Allow users to recommend books

Users can recommend each book only onceDisplay a book's current recommendations

EXERCISE #4: SOLUTION// db.recommendations (one document per user per book)> db.recommendations.insert({    book: ObjectId("…"),    user: ObjectId("…")});

// Unique index ensures users can't recommend twice> db.recommendations.ensureIndex(    { book: 1, user: 1 },    { unique: true });

// Count the number of recommendations for a book> db.recommendations.count({ book: ObjectId("…") });

EXERCISE #4: SOLUTIONIndexes in MongoDB are not countingCounts are computed via index scansDenormalize totals on books

> db.books.update(    { _id: ObjectId("…") },    { $inc: { recommendations: 1 }}});

COMMON DESIGNPATTERNS

ONE-TO-ONERELATIONSHIP

Let's pretend that authors only write one book.

LINKINGEither side, or both, can track the relationship.

> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    // Other fields follow…}

> db.authors.findOne({ _id: 1 }){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald"    book: 1,}

EMBEDDED OBJECT> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: {        first_name: "F. Scott",        last_name: "Fitzgerald"    }    // Other fields follow…}

ONE-TO-MANYRELATIONSHIP

In reality, authors may write multiple books.

ARRAY OF ID'SThe "one" side tracks the relationship.

Flexible and space-efficientAdditional query needed for non-ID lookups

> db.authors.findOne(){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald",    books: [1, 3, 20]}

SINGLE FIELD WITH IDThe "many" side tracks the relationship.

> db.books.find({ author: 1 }){    _id: 1,    title: "The Great Gatsby",    slug: "9781857150193‐the‐great‐gatsby",    author: 1,    // Other fields follow…}

{    _id: 3,    title: "This Side of Paradise",    slug: "9780679447238‐this‐side‐of‐paradise",    author: 1,    // Other fields follow…}

ARRAY OF OBJECTS

Use $slice operator to return a subset of books

> db.authors.findOne(){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald",    books: [        { _id: 1, title: "The Great Gatsby" },        { _id: 3, title: "This Side of Paradise" }    ]    // Other fields follow…}

MANY-TO-MANYRELATIONSHIP

Some books may also have co-authors.

ARRAY OF ID'S ON BOTH SIDES> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    authors: [1, 5]    // Other fields follow…}

> db.authors.findOne(){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald",    books: [1, 3, 20]}

ARRAY OF ID'S ON BOTH SIDESQuery for all books by a given author

Query for all authors of a given book

> db.books.find({ authors: 1 });

> db.authors.find({ books: 1 });

ARRAY OF ID'S ON ONE SIDE> db.books.findOne(){    _id: 1,    title: "The Great Gatsby",    authors: [1, 5]    // Other fields follow…}

> db.authors.findOne({ _id: { $in: [1, 5] }}){    _id: 1,    first_name: "F. Scott",    last_name: "Fitzgerald"}

{    _id: 5,    first_name: "Unknown",    last_name: "Co‐author"}

ARRAY OF ID'S ON ONE SIDEQuery for all books by a given author

Query for all authors of a given book

  > db.books.find({ authors: 1 });

book = db.books.findOne(    { title: "The Great Gatsby" },    { authors: 1 });

db.authors.find({ _id: { $in: book.authors }});

EXERCISE #5Tracking time series data

Graph recommendations per unit of timeCount by: day, hour, minute

EXERCISE #5: SOLUTION A// db.rec_ts (time series buckets, hour and minute sub‐docs)> db.rec_ts.insert({    book: ObjectId("…"),    day: ISODate("2012‐10‐11T00:00:00.000Z")    total: 0,    hour:   { "0": 0, "1": 0, /* … */ "23": 0 },    minute: { "0": 0, "1": 0, /* … */ "1439": 0 }});

// Record a recommendation created one minute before midnight> db.rec_ts.update(    { book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") },    { $inc: { total: 1, "hour.23": 1, "minute.1439": 1 }}});

BSON STORAGESequence of key/value pairsNot a hash mapOptimized to scan quickly

minute[0] [1] … [1439]

What is the cost of updating the minute before midnight?

BSON STORAGEWe can skip sub-documents

hour0[0] [1] … [59]

… hour23[1380] … [1439]

How could this change the schema?

EXERCISE #5: SOLUTION B// db.rec_ts (time series buckets, each hour a sub‐doc)> db.rec_ts.insert({    book: ObjectId("…"),    day: ISODate("2012‐10‐11T00:00:00.000Z")    total: 148,    hour: {        "0": { total: 7, "0": 0, /* … */ "59": 2 },        "1": { total: 3, "60": 1, /* … */ "119": 0 },        // Other hours…        "23": { total: 12, "1380": 0, /* … */ "1439": 3 }    }});

// Record a recommendation created one minute before midnight> db.rec_ts.update(    { book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") },    { $inc: { total: 1, "hour.23.total": 1, "hour.23.1439": 1 }}});

SINGLE-COLLECTION INHERITANCETake advantage of MongoDB's features

Documents need not all have the same fieldsSparsely index only present fields

SCHEMA FLEXIBILITY

Find all books that are part of a series

> db.books.findOne(){    _id: 47,    title: "The Wizard Chase",    type: "series",    series_title: "The Wizard's Trilogy",    volume: 2    // Other fields follow…}

db.books.find({ type: "series" });

> db.books.find({ series_title: { $exists: true }});

> db.books.find({ volume: { $gt: 0 }});

INDEX ONLY PRESENT FIELDSDocuments without these fields will not be indexed.> db.books.ensureIndex({ series_title: 1 }, { sparse: true })

> db.books.ensureIndex({ volume: 1 }, { sparse: true })

EXERCISE #6Users can recommend at most 10 books

EXERCISE #6: SOLUTION// db.user_recs (track user's remaining and given recommendations)> db.user_recs.insert({    _id: "bob",    remaining: 8,    books: [3, 10]});

// Record a recommendation if possible> db.user_recs.update(    { _id: "bob", remaining: { $gt: 0 }, books: { $ne: 4 }},    { $inc: { remaining: ‐1 }, $push: { books: 4 }}});

EXERCISE #6: SOLUTIONOne less unassigned recommendation remainingNewly-recommended book is now linked

> db.user_recs.findOne(){    _id: "bob",    remaining: 7,    books: [3, 10, 4]}

EXERCISE #7Statistic buckets

Each book has a listing page in our applicationRecord referring website domains for each bookCount each domain independently

EXERCISE #7: SOLUTION A> db.book_refs.findOne(){   book: 1,    referrers: [        { domain: "google.com", count: 4 },        { domain: "yahoo.com", count: 1 }    ]}

> db.book_refs.update(    { book: 1, "referrers.domain": "google.com" },    { $inc: { "referrers.$.count": 1 }});

EXERCISE #7: SOLUTION AUpdate the position of the first matched element.

What if a new referring website is used?

> db.book_refs.update(    { book: 1, "referrers.domain": "google.com" },    { $inc: { "referrers.$.count": 1 }});

> db.book_refs.findOne(){   book: 1,    referrers: [        { domain: "google.com", count: 5 },        { domain: "yahoo.com", count: 1 }    ]}

EXERCISE #7: SOLUTION B

Replace dots with underscores for key namesIncrement to add a new referring websiteUpsert in case this is the book's first referrer

> db.book_refs.findOne(){   book: 1,    referrers: {        "google_com": 5,        "yahoo_com": 1    }}

> db.book_refs.update(    { book: 1 },    { $inc: { "referrers.bing_com": 1 }},    true);

SHARDING

SHARDINGAd-hoc partitioningConsistent hashing

Amazon DynamoDBRange based partitioning

Google BigTableYahoo! PNUTSMongoDB

SHARDING IN MONGODBAutomated managementRange based partitioningConvert to sharded system with no downtimeFully consistent

SHARDING A COLLECTION

Keys range from −∞ to +∞Ranges are stored as chunks

> db.runCommand({ addshard : "shard1.example.com" });

> db.runCommand({    shardCollection: "library.books",    key: { _id : 1}});

SHARDING DATA BY CHUNKS

[ −∞, +∞) → [−∞, 40)[40, +∞)

→ [−∞, 40)[40, 50)[50, +∞)

Ranges are split into chunks as data is inserted

> db.books.save({ _id: 35, title: "Call of the Wild" });> db.books.save({ _id: 40, title: "Tropic of Cancer" });> db.books.save({ _id: 45, title: "The Jungle" });> db.books.save({ _id: 50, title: "Of Mice and Men" });

ADDING NEW SHARDSshard1[−∞, 40)[40, 50)[50, 60)[60, +∞)

ADDING NEW SHARDS

shard1[−∞, 40) [50, 60)

shard2 [40, 50) [60, +∞)

Chunks are migrated to balance shards

  > db.runCommand({ addshard : "shard2.example.com" });

ADDING NEW SHARDS

shard1[−∞, 40)

shard2 [40, 50) [60, +∞)

shard3 [50, 60)

  > db.runCommand({ addshard : "shard3.example.com" });

SHARDING COMPONENTSmongosConfig serversShards

mongodReplica sets

SHARDED WRITESInserts

Shard key requiredRouted

Updates and removesShard key optionalMay be routed or scattered

SHARDED READSQueries

By shard key: routedWithout shard key: scatter/gather

Sorted queriesBy shard key: routed in orderWithout shard key: distributed merge sort

EXERCISE #8Users can upload images for books

imagesimage_id: ???data: binary

The collection will be sharded by image_id.

What should image_id be?

EXERCISE #8: SOLUTIONSWhat's the best shard key for our use case?

Auto-increment (ObjectId)MD5 of dataTime (e.g. month) and MD5

Right-balanced Access

Random Access

Segmented Access

SUMMARYSchema design is different in MongoDB.Basic data design principles apply.It's about your application.It's about your data and how it's used.It's about the entire lifetime of your application.

THANKS!QUESTIONS?

Recommended