SCHEMA DESIGN WORKSHOP
Jeremy Mikola@jmikola
AGENDA1. Basic schema design principles for MongoDB2. Schema design over an application's lifetime3. Common design patterns4. Sharding
GOALSLearn the schema design process in MongoDBPractice applying common principles via exercisesUnderstand the implications of sharding
WHAT IS A SCHEMA AND WHY IS ITIMPORTANT?
SCHEMAMap concepts and relationships to dataSet expectations for the dataMinimize overhead of iterative modificationsEnsure compatibility
NORMALIZATIONusers
usernamefirst_namelast_name
← bookstitleisbnlanguagecreated_byauthor
→ authorsfirst_namelast_name
DENORMALIZATIONusers
usernamefirst_namelast_name
← bookstitleisbnlanguagecreated_byauthorfirst_namelast_name
WHAT IS SCHEMA DESIGN LIKE INMONGODB?
Schema is defined at the application-levelDesign is part of each phase in its lifetimeThere is no magic formula
MONGODB DOCUMENTSStorage in BSON → BSONSpec.org
ScalarsDoublesIntegers (32 or 64-bit)UTF-8 stringsUTC Date, timestampBinary, regex, codeObject IDnull
Rich typesObjectsArrays
TERMINOLOGY{ "mongodb" : "relational db", "database" : "database", "collection" : "table", "document" : "row", "index" : "index", "sharding" : { "shard" : "partition", "shard key" : "partition key" }}
THREE CONSIDERATIONS IN MONGODBSCHEMA DESIGN
1. The data your application needs2. Your application's read usage of the data3. Your application's write usage of the data
CASE STUDYLIBRARY WEB APPLICATION
Different schemas are possible
AUTHOR SCHEMA{ "_id": int, "first_name": string, "last_name": string}
USER SCHEMA{ "_id": int, "username": string, "password": string}
BOOK SCHEMA{ "_id": int, "title": string, "slug": string, "author": int, "available": boolean, "isbn": string, "pages": int, "publisher": { "city": string, "date": date, "name": string }, "subjects": [ string, string ], "language": string, "reviews": [ { "user": int, "text": string }, { "user": int, "text": string } ],}
EXAMPLE DOCUMENTS
AUTHOR DOCUMENT> db.authors.findOne(){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald"}
USER DOCUMENT> db.users.findOne(){ _id: 1, username: "[email protected]", password: "slsjfk4odk84k209dlkdj90009283d"}
BOOK DOCUMENT> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, available: true, isbn: "9781857150193", pages: 176, publisher: { name: "Everyman's Library", date: ISODate("1991‐09‐19T00:00:00Z"), city: "London" }, subjects: ["Love stories", "1920s", "Jazz Age"], language: "English", reviews: [ { user: 1, text: "One of the best…" }, { user: 2, text: "It's hard to…" } ]}
EMBEDDED OBJECTSAKA EMBEDDED OR SUB-DOCUMENTS
What advantages do they have?
When should they be used?
EMBEDDED OBJECTS> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, available: true, isbn: "9781857150193", pages: 176, publisher: { name: "Everyman's Library", date: ISODate("1991‐09‐19T00:00:00Z"), city: "London" }, subjects: ["Love stories", "1920s", "Jazz Age"], language: "English", reviews: [ { user: 1, text: "One of the best…" }, { user: 2, text: "It's hard to…" } ]}
EMBEDDED OBJECTSGreat for read performanceOne seek to load the entire documentOne round trip to the databaseWrites can be slow if constantly adding to objects
LINKED DOCUMENTSWhat advantages does this approach have?
When should they be used?
LINKED DOCUMENTS> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, available: true, isbn: "9781857150193", pages: 176, publisher: { publisher_name: "Everyman's Library", date: ISODate("1991‐09‐19T00:00:00Z"), publisher_city: "London" }, subjects: ["Love stories", "1920s", "Jazz Age"], language: "English", reviews: [ { user: 1, text: "One of the best…" }, { user: 2, text: "It's hard to…" } ]}
LINKED DOCUMENTSMore, smaller documentsCan make queries by ID very simpleAccessing linked document data requires extra readWhat effect does this have on the system?
DATA, RAM AND DISK
ARRAYSWhen should they be used?
ARRAY OF SCALARS> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, available: true, isbn: "9781857150193", pages: 176, publisher: { name: "Everyman's Library", date: ISODate("1991‐09‐19T00:00:00Z"), city: "London" }, subjects: ["Love stories", "1920s", "Jazz Age"], language: "English", reviews: [ { user: 1, text: "One of the best…" }, { user: 2, text: "It's hard to…" } ]}
ARRAY OF OBJECTS db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, available: true, isbn: "9781857150193", pages: 176, publisher: { name: "Everyman's Library", date: ISODate("1991‐09‐19T00:00:00Z"), city: "London" }, subjects: ["Love stories", "1920s", "Jazz Age"], language: "English", reviews: [ { user: 1, text: "One of the best…" }, { user: 2, text: "It's hard to…" } ],}
EXERCISE #1Design a schema for users and their book reviews
Usersusername (string)email (string)
Reviewstext (string)rating (integer)created_at (date)
Usernames are immutable
EXERCISE #1: SOLUTION AReviews may be queried by user or book
// db.users (one document per user){ _id: ObjectId("…"), username: "bob", email: "[email protected]"}
// db.reviews (one document per review){ _id: ObjectId("…"), user: ObjectId("…"), book: ObjectId("…"), rating: 5, text: "This book is excellent!", created_at: ISODate("2012‐10‐10T21:14:07.096Z")}
EXERCISE #1: SOLUTION BOptimized to retrieve reviews by user
// db.users (one document per user with all reviews){ _id: ObjectId("…"), username: "bob", email: "[email protected]", reviews: [ { book: ObjectId("…"), rating: 5, text: "This book is excellent!", created_at: ISODate("2012‐10‐10T21:14:07.096Z") } ]}
EXERCISE #1: SOLUTION COptimized to retrieve reviews by book
// db.users (one document per user){ _id: ObjectId("…"), username: "bob", email: "[email protected]"}
// db.books (one document per book with all reviews){ _id: ObjectId("…"), // Other book fields… reviews: [ { user: ObjectId("…"), rating: 5, text: "This book is excellent!", created_at: ISODate("2012‐10‐10T21:14:07.096Z") } ]}
SCHEMA DESIGN OVER AN APPLICATION'SLIFETIME
DevelopmentProductionIterative Modifications
DEVELOPMENT PHASEBasic CRUD functionality
CREATE
The _id field is unique and automatically indexedMongoDB will generate an ObjectId if not provided
RUD author = { _id: 2, first_name: "Arthur", last_name: "Miller" };
db.authors.insert(author);
READC UD> db.authors.find({ "last_name": "Miller" }){ _id: 2, first_name: "Arthur", last_name: "Miller"}
READS AND INDEXINGExamine the query after creating an index.
> db.books.ensureIndex({ "slug": 1 })
> db.books.find({ "slug": "the‐great‐gatsby" }).explain(){ "cursor": "BtreeCursor slug_1", "isMultiKey" : false, "n" : 1, "nscannedObjects" : 1, "nscanned" : 1, "scanAndOrder" : false, "indexOnly" : false, "nYields" : 0, "nChunkSkips" : 0, "millis" : 0, // Other fields follow…}
MULTI-KEY INDEXESIndex all values in an array field.
> db.books.ensureIndex({ "subjects": 1 });
INDEXING EMBEDDED FIELDSIndex an embedded object's field.
> db.books.ensureIndex({ "publisher.name": 1 })
QUERY OPERATORSConditional operators
$gt, $gte, $lt, $lte, $ne, $all, $in, $nin, $size,$and, $or, $nor, $mod, $type, $exists
Regular expressionsValue in an array
$elemMatchCursor methods and modifiers
count(), limit(), skip(), snapshot(), sort(),batchSize(), explain(), hint()
UPDATECR D review = { user: 1, text: "I did NOT like this book." };
db.books.update( { _id: 1 }, { $push: { reviews: review }} );
ATOMIC MODIFIERSUpdate specific fields within a document
$set, $unset$push, $pushAll$addToSet, $pop$pull, $pullAll$rename$bit
DELETECRU > db.books.remove({ _id: 1 })
PRODUCTION PHASEEvolve schema to meet the application's read and write
patterns
READ USAGEFinding books by an author's first name
authors = db.authors.find({ first_name: /̂f.*/i }, { _id: 1 });
authorIds = authors.map(function(x) { return x._id; });
db.books.find({author: { $in: authorIds }});
READ USAGE"Cache" the author name in an embedded document
Queries are now one step
> db.books.findOne(){ _id: 1, title: "The Great Gatsby", author: { first_name: "F. Scott", last_name: "Fitzgerald" } // Other fields follow…}
> db.books.find({ author.first_name: /̂f.*/i })
WRITE USAGEUsers can review a book
Document size limit (16MB)Storage fragmentation after many updates/deletes
review = { user: 1, text: "I thought this book was great!", rating: 5};
> db.books.update( { _id: 3 }, { $push: { reviews: review }});
EXERCISE #2Display the 10 most recent reviews by a userMake efficient use of memory and disk seeks
EXERCISE #2: SOLUTIONStore users' reviews in monthly buckets
// db.reviews (one document per user per month){ _id: "bob‐201210", reviews: [ { _id: ObjectId("…"), rating: 5, text: "This book is excellent!", created_at: ISODate("2012‐10‐10T21:14:07.096Z") }, { _id: ObjectId("…"), rating: 2, text: "I didn't really enjoy this book.", created_at: ISODate("2012‐10‐11T20:12:50.594Z") } ]}
EXERCISE #2: SOLUTIONAdding a new review to the appropriate bucket
myReview = { _id: ObjectId("…"), rating: 3, text: "An average read.", created_at: ISODate("2012‐10‐13T12:26:11.502Z")};
> db.reviews.update( { _id: "bob‐2012‐10" }, { $push: { reviews: myReview }});
EXERCISE #2: SOLUTIONDisplay the 10 most recent reviews by a user
cursor = db.reviews.find( { _id: /̂bob‐/ }, { reviews: { $slice: 10 }}).sort({ _id: ‐1 });
num = 0;
while (cursor.hasNext() && num < 10) { doc = cursor.next();
for (var i = 0; i < doc.reviews.length && num < 10; ++i, ++num) { printjson(doc.reviews[i]); }}
EXERCISE #2: SOLUTIONDeleting a review
cursor = db.reviews.update( { _id: "bob‐2012‐10" }, { $pull: { reviews: { _id: ObjectId("…") }}});
ITERATIVEMODIFICATIONS
Schema design is evolutionary
ALLOW USERS TO BROWSE BY BOOKSUBJECT
How can you search this collection?Be aware of document size limitationsBenefit from hierarchy being in same document
> db.subjects.findOne(){ _id: 1, name: "American Literature", sub_category: { name: "1920s", sub_category: { name: "Jazz Age" } }}
TREE STRUCTURES> db.subjects.find(){ _id: "American Literature" }
{ _id : "1920s", ancestors: ["American Literature"], parent: "American Literature"}
{ _id: "Jazz Age", ancestors: ["American Literature", "1920s"], parent: "1920s"}
{ _id: "Jazz Age in New York", ancestors: ["American Literature", "1920s", "Jazz Age"], parent: "Jazz Age"}
TREE STRUCTURESFind sub-categories of a given subject
> db.subjects.find({ ancestors: "1920s" }){ _id: "Jazz Age", ancestors: ["American Literature", "1920s"], parent: "1920s"}
{ _id: "Jazz Age in New York", ancestors: ["American Literature", "1920s", "Jazz Age"], parent: "Jazz Age"}
EXERCISE #3Allow users to borrow library books
User sends a loan requestLibrary approves or notRequests time out after seven days
Approval process is asynchronousRequests may be prioritized
EXERCISE #3: SOLUTIONNeed to maintain order and stateEnsure that updates are atomic
// Create a new loan request> db.loans.insert({ _id: { borrower: "bob", book: ObjectId("…") }, pending: false, approved: false, priority: 1,});
// Find the highest priority request and mark as pending approvalrequest = db.loans.findAndModify({ query: { pending: false }, sort: { priority: ‐1 }, update: { $set: { pending: true, started: new ISODate() }}, new: true});
EXERCISE #3: SOLUTIONUpdated and added fieldsModified document was returned
{ _id: { borrower: "bob", book: ObjectId("…") }, pending: true, approved: false, priority: 1, started: ISODate("2012‐10‐11T22:09:42.542Z")}
EXERCISE #3: SOLUTION// Library approves the loan request> db.loans.update( { _id: { borrower: "bob", book: ObjectId("…") }}, { $set: { pending: false, approved: true }});
EXERCISE #3: SOLUTION// Request times out after seven dayslimit = new Date();limit.setDate(limit.getDate() ‐ 7);
> db.loans.update( { pending: true, started: { $lt: limit }}, { $set: { pending: false, approved: false }});
EXERCISE #4Allow users to recommend books
Users can recommend each book only onceDisplay a book's current recommendations
EXERCISE #4: SOLUTION// db.recommendations (one document per user per book)> db.recommendations.insert({ book: ObjectId("…"), user: ObjectId("…")});
// Unique index ensures users can't recommend twice> db.recommendations.ensureIndex( { book: 1, user: 1 }, { unique: true });
// Count the number of recommendations for a book> db.recommendations.count({ book: ObjectId("…") });
EXERCISE #4: SOLUTIONIndexes in MongoDB are not countingCounts are computed via index scansDenormalize totals on books
> db.books.update( { _id: ObjectId("…") }, { $inc: { recommendations: 1 }}});
COMMON DESIGNPATTERNS
ONE-TO-ONERELATIONSHIP
Let's pretend that authors only write one book.
LINKINGEither side, or both, can track the relationship.
> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, // Other fields follow…}
> db.authors.findOne({ _id: 1 }){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald" book: 1,}
EMBEDDED OBJECT> db.books.findOne(){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: { first_name: "F. Scott", last_name: "Fitzgerald" } // Other fields follow…}
ONE-TO-MANYRELATIONSHIP
In reality, authors may write multiple books.
ARRAY OF ID'SThe "one" side tracks the relationship.
Flexible and space-efficientAdditional query needed for non-ID lookups
> db.authors.findOne(){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald", books: [1, 3, 20]}
SINGLE FIELD WITH IDThe "many" side tracks the relationship.
> db.books.find({ author: 1 }){ _id: 1, title: "The Great Gatsby", slug: "9781857150193‐the‐great‐gatsby", author: 1, // Other fields follow…}
{ _id: 3, title: "This Side of Paradise", slug: "9780679447238‐this‐side‐of‐paradise", author: 1, // Other fields follow…}
ARRAY OF OBJECTS
Use $slice operator to return a subset of books
> db.authors.findOne(){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald", books: [ { _id: 1, title: "The Great Gatsby" }, { _id: 3, title: "This Side of Paradise" } ] // Other fields follow…}
MANY-TO-MANYRELATIONSHIP
Some books may also have co-authors.
ARRAY OF ID'S ON BOTH SIDES> db.books.findOne(){ _id: 1, title: "The Great Gatsby", authors: [1, 5] // Other fields follow…}
> db.authors.findOne(){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald", books: [1, 3, 20]}
ARRAY OF ID'S ON BOTH SIDESQuery for all books by a given author
Query for all authors of a given book
> db.books.find({ authors: 1 });
> db.authors.find({ books: 1 });
ARRAY OF ID'S ON ONE SIDE> db.books.findOne(){ _id: 1, title: "The Great Gatsby", authors: [1, 5] // Other fields follow…}
> db.authors.findOne({ _id: { $in: [1, 5] }}){ _id: 1, first_name: "F. Scott", last_name: "Fitzgerald"}
{ _id: 5, first_name: "Unknown", last_name: "Co‐author"}
ARRAY OF ID'S ON ONE SIDEQuery for all books by a given author
Query for all authors of a given book
> db.books.find({ authors: 1 });
book = db.books.findOne( { title: "The Great Gatsby" }, { authors: 1 });
db.authors.find({ _id: { $in: book.authors }});
EXERCISE #5Tracking time series data
Graph recommendations per unit of timeCount by: day, hour, minute
EXERCISE #5: SOLUTION A// db.rec_ts (time series buckets, hour and minute sub‐docs)> db.rec_ts.insert({ book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") total: 0, hour: { "0": 0, "1": 0, /* … */ "23": 0 }, minute: { "0": 0, "1": 0, /* … */ "1439": 0 }});
// Record a recommendation created one minute before midnight> db.rec_ts.update( { book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") }, { $inc: { total: 1, "hour.23": 1, "minute.1439": 1 }}});
BSON STORAGESequence of key/value pairsNot a hash mapOptimized to scan quickly
minute[0] [1] … [1439]
What is the cost of updating the minute before midnight?
BSON STORAGEWe can skip sub-documents
hour0[0] [1] … [59]
… hour23[1380] … [1439]
How could this change the schema?
EXERCISE #5: SOLUTION B// db.rec_ts (time series buckets, each hour a sub‐doc)> db.rec_ts.insert({ book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") total: 148, hour: { "0": { total: 7, "0": 0, /* … */ "59": 2 }, "1": { total: 3, "60": 1, /* … */ "119": 0 }, // Other hours… "23": { total: 12, "1380": 0, /* … */ "1439": 3 } }});
// Record a recommendation created one minute before midnight> db.rec_ts.update( { book: ObjectId("…"), day: ISODate("2012‐10‐11T00:00:00.000Z") }, { $inc: { total: 1, "hour.23.total": 1, "hour.23.1439": 1 }}});
SINGLE-COLLECTION INHERITANCETake advantage of MongoDB's features
Documents need not all have the same fieldsSparsely index only present fields
SCHEMA FLEXIBILITY
Find all books that are part of a series
> db.books.findOne(){ _id: 47, title: "The Wizard Chase", type: "series", series_title: "The Wizard's Trilogy", volume: 2 // Other fields follow…}
db.books.find({ type: "series" });
> db.books.find({ series_title: { $exists: true }});
> db.books.find({ volume: { $gt: 0 }});
INDEX ONLY PRESENT FIELDSDocuments without these fields will not be indexed.> db.books.ensureIndex({ series_title: 1 }, { sparse: true })
> db.books.ensureIndex({ volume: 1 }, { sparse: true })
EXERCISE #6Users can recommend at most 10 books
EXERCISE #6: SOLUTION// db.user_recs (track user's remaining and given recommendations)> db.user_recs.insert({ _id: "bob", remaining: 8, books: [3, 10]});
// Record a recommendation if possible> db.user_recs.update( { _id: "bob", remaining: { $gt: 0 }, books: { $ne: 4 }}, { $inc: { remaining: ‐1 }, $push: { books: 4 }}});
EXERCISE #6: SOLUTIONOne less unassigned recommendation remainingNewly-recommended book is now linked
> db.user_recs.findOne(){ _id: "bob", remaining: 7, books: [3, 10, 4]}
EXERCISE #7Statistic buckets
Each book has a listing page in our applicationRecord referring website domains for each bookCount each domain independently
EXERCISE #7: SOLUTION A> db.book_refs.findOne(){ book: 1, referrers: [ { domain: "google.com", count: 4 }, { domain: "yahoo.com", count: 1 } ]}
> db.book_refs.update( { book: 1, "referrers.domain": "google.com" }, { $inc: { "referrers.$.count": 1 }});
EXERCISE #7: SOLUTION AUpdate the position of the first matched element.
What if a new referring website is used?
> db.book_refs.update( { book: 1, "referrers.domain": "google.com" }, { $inc: { "referrers.$.count": 1 }});
> db.book_refs.findOne(){ book: 1, referrers: [ { domain: "google.com", count: 5 }, { domain: "yahoo.com", count: 1 } ]}
EXERCISE #7: SOLUTION B
Replace dots with underscores for key namesIncrement to add a new referring websiteUpsert in case this is the book's first referrer
> db.book_refs.findOne(){ book: 1, referrers: { "google_com": 5, "yahoo_com": 1 }}
> db.book_refs.update( { book: 1 }, { $inc: { "referrers.bing_com": 1 }}, true);
SHARDING
SHARDINGAd-hoc partitioningConsistent hashing
Amazon DynamoDBRange based partitioning
Google BigTableYahoo! PNUTSMongoDB
SHARDING IN MONGODBAutomated managementRange based partitioningConvert to sharded system with no downtimeFully consistent
SHARDING A COLLECTION
Keys range from −∞ to +∞Ranges are stored as chunks
> db.runCommand({ addshard : "shard1.example.com" });
> db.runCommand({ shardCollection: "library.books", key: { _id : 1}});
SHARDING DATA BY CHUNKS
[ −∞, +∞) → [−∞, 40)[40, +∞)
→ [−∞, 40)[40, 50)[50, +∞)
Ranges are split into chunks as data is inserted
> db.books.save({ _id: 35, title: "Call of the Wild" });> db.books.save({ _id: 40, title: "Tropic of Cancer" });> db.books.save({ _id: 45, title: "The Jungle" });> db.books.save({ _id: 50, title: "Of Mice and Men" });
ADDING NEW SHARDSshard1[−∞, 40)[40, 50)[50, 60)[60, +∞)
ADDING NEW SHARDS
shard1[−∞, 40) [50, 60)
shard2 [40, 50) [60, +∞)
Chunks are migrated to balance shards
> db.runCommand({ addshard : "shard2.example.com" });
ADDING NEW SHARDS
shard1[−∞, 40)
shard2 [40, 50) [60, +∞)
shard3 [50, 60)
> db.runCommand({ addshard : "shard3.example.com" });
SHARDING COMPONENTSmongosConfig serversShards
mongodReplica sets
SHARDED WRITESInserts
Shard key requiredRouted
Updates and removesShard key optionalMay be routed or scattered
SHARDED READSQueries
By shard key: routedWithout shard key: scatter/gather
Sorted queriesBy shard key: routed in orderWithout shard key: distributed merge sort
EXERCISE #8Users can upload images for books
imagesimage_id: ???data: binary
The collection will be sharded by image_id.
What should image_id be?
EXERCISE #8: SOLUTIONSWhat's the best shard key for our use case?
Auto-increment (ObjectId)MD5 of dataTime (e.g. month) and MD5
Right-balanced Access
Random Access
Segmented Access
SUMMARYSchema design is different in MongoDB.Basic data design principles apply.It's about your application.It's about your data and how it's used.It's about the entire lifetime of your application.
THANKS!QUESTIONS?