72
Data Modeling for Performance Mongo Boulder January 21, 2010 Michael Dwan Snapjoy

Data Modeling for Performance

Embed Size (px)

DESCRIPTION

My talk for Mongo Boulder on data modeling.

Citation preview

Page 1: Data Modeling for Performance

Data Modeling for Performance

Mongo BoulderJanuary 21, 2010

Michael DwanSnapjoy

Page 2: Data Modeling for Performance

i’m michael dwan@michaeldwan on the twitter

Page 3: Data Modeling for Performance

the projectCompany X

Page 4: Data Modeling for Performance

application spec

• find business details (web + api)

• search by category/keyword + geo (web + api)

• update (api)

Page 5: Data Modeling for Performance

why is this interesting?

15,000,000businesses

30,000partners

100,000geo areas

2,300categories

2,000,000requests daily

24,000,000urls in sitemap

100,000,000tags

Page 6: Data Modeling for Performance

updates

• infrequent changes

• monthly updates w/ 12M monthly changes

• “zero downtime”

Page 7: Data Modeling for Performance

the problemmo’ data, mo’ problems

Page 8: Data Modeling for Performance

complexity

Page 9: Data Modeling for Performance

businesses

phone_numbers

businesses _phone_numbers

cities

states

zips

neighborhoods

businesses_neighborhoods

tags

taggings

assets

users

categories

categorizations

providers mappings

Page 10: Data Modeling for Performance

architecture

x

xx x

Page 11: Data Modeling for Performance

read performance

Page 12: Data Modeling for Performance

solr

downtime

Page 13: Data Modeling for Performance

solr getting fussy

Page 14: Data Modeling for Performance

migrations

downtime

Page 15: Data Modeling for Performance

the solution

Page 16: Data Modeling for Performance

> gem install acts_as_web_scale

Page 17: Data Modeling for Performance
Page 18: Data Modeling for Performance
Page 19: Data Modeling for Performance

a business...

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz",}

Page 20: Data Modeling for Performance

a business... has many phone numbers

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz",}

Page 21: Data Modeling for Performance

a business... has many phone numbers

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ]}

Page 22: Data Modeling for Performance

a business... has coordinates

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ]}

Page 23: Data Modeling for Performance

a business... has coordinates

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ]}

Page 24: Data Modeling for Performance

a business... has many tags

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ]}

Page 25: Data Modeling for Performance

a business... has many tags

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ]}

Page 26: Data Modeling for Performance

a business... has an address

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ]}

Page 27: Data Modeling for Performance

a business... has an address

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St" }}

Page 28: Data Modeling for Performance

belongs to?

Page 29: Data Modeling for Performance

a state

{ "_id" : ObjectId("4ce82937961552247900000f"), "name" : "Illinois", "slug" : "il", ...}

Page 30: Data Modeling for Performance

a business... belongs to a state

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St" }}

Page 31: Data Modeling for Performance

a business... belongs to a state

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St" }}

Page 32: Data Modeling for Performance

a business... belongs to a state

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" } }}

Page 33: Data Modeling for Performance

a business... belongs to a city

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" } }}

Page 34: Data Modeling for Performance

a business... belongs to a city

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, }}

Page 35: Data Modeling for Performance

a business... belongs to a zip code

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, }}

Page 36: Data Modeling for Performance

a business... belongs to a zip code

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, "zip" : { "_id" : ObjectId("4ce82c29d3dfaa116b006dfa"), "display_name" : "97211" } }}

Page 37: Data Modeling for Performance

many-to-many?

Page 38: Data Modeling for Performance

a category

{ "_id" : ObjectId("4ce82e64d3dfaa16360014eb"), "name" : "Auto Glass", "slug" : "3063-auto-glass", "tags" : [ "windshields" ], ...}

Page 39: Data Modeling for Performance

a business... belongs to a zip code

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, "zip" : { "_id" : ObjectId("4ce82c29d3dfaa116b006dfa"), "display_name" : "97211" } }}

Page 40: Data Modeling for Performance

a business... belongs to many categories

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, "zip" : { "_id" : ObjectId("4ce82c29d3dfaa116b006dfa"), "display_name" : "97211" } }}

Page 41: Data Modeling for Performance

a business... belongs to many categories

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "tagline" : "Your trusty glass hole", "description" : "Glass repair...", "hours" : "Mon Fri 8 5", "url" : "http://acmeglasshole.biz", "phone_numbers" : [ "5035550091", "8005555456" ], "coordinates" : [ 45.559294, -122.644053 ], "tags" : [ "glass", "mirrors", "flat glass" ], "location" : { "street_address" : "2035 NE Alberta St", "state" : { "_id" : ObjectId("4ce829379615522479000026"), "meta" : { "slug" : "or" }, "display_name" : "Oregon" }, "city" : { "_id" : ObjectId("4ce82abdd3dfaa10f8006faa"), "meta" : { "slug" : "portland", }, "display_name" : "Portland, OR" }, "zip" : { "_id" : ObjectId("4ce82c29d3dfaa116b006dfa"), "display_name" : "97211" } }, "categories" : [ { "_id" : ObjectId("4ce82e50d3dfaa16360004f2"), "meta" : { "slug" : "282-glass", "tags" : [ "windows" ], }, "display_name" : "Glass" }, { "_id" : ObjectId("4ce82e64d3dfaa16360014eb"), "meta" : { "slug" : "3063-auto-glass", "tags" : [ "windshields" ], }, "display_name" : "Auto Glass" } ]}

Page 42: Data Modeling for Performance

queries & indexesknow what you want

Page 43: Data Modeling for Performance

#1 find a businessI want *that* one

Page 44: Data Modeling for Performance

find a business

// single businessdb.businesses.findOne({ _id: ObjectId("4ce838ef4a882579960001b9")})

Page 45: Data Modeling for Performance

#2 find by locationBusinesses in San Francisco, CA

Page 46: Data Modeling for Performance

find businesses by state/city/zip

// find all within statedb.businesses.find({ "location.state._id": ObjectId("4ce82937961552247900000f")})

Page 47: Data Modeling for Performance

find businesses by state/city/zip

// find all within statedb.businesses.find({ "location.state._id": ObjectId("4ce82937961552247900000f")})

// find all within citydb.businesses.find({ "location.city._id": ObjectId("4ce82aa0d3dfaa10f8004a95")})

Page 48: Data Modeling for Performance

find businesses by state/city/zip

// find all within statedb.businesses.find({ "location.state._id": ObjectId("4ce82937961552247900000f")})

// find all within citydb.businesses.find({ "location.city._id": ObjectId("4ce82aa0d3dfaa10f8004a95")})

// find all within zipdb.businesses.find({ "location.zip._id": ObjectId("4ce82b5ed3dfaa116b0026f0")})

Page 49: Data Modeling for Performance

indexes

// the indexesdb.businesses.ensureIndex({"location.city._id": 1})db.businesses.ensureIndex({"location.zip._id": 1})

skip “location.state._id” -- only 51 possibilities

1.5GBeach

Page 50: Data Modeling for Performance

#3 find by categoryBusinesses in the Auto Repair category

Page 51: Data Modeling for Performance

businesses by category

// find by category iddb.businesses.find({ "categories._id": ObjectId("4ce82e50d3dfaa16360004f2")})

// the indexdb.businesses.ensureIndex({ "categories._id":1})

Page 52: Data Modeling for Performance

#4 - find by category + location Businesses in the Plumbing category in Chicago, IL

Page 53: Data Modeling for Performance

businesses by category + city

// find by city id and category iddb.businesses.find({ "location.city._id": ObjectId("4ce82aa0d3dfaa10f8004a95"), "categories._id": ObjectId("4ce82e50d3dfaa16360004f2")})

Page 54: Data Modeling for Performance

which index should we use?

// city id{"location.city._id":1}

// category id{"categories._id":1}

~ or ~

we need a compound indexanswer: both suck

Page 55: Data Modeling for Performance

which order?

db.businesses.ensureIndex({ "location.city._id" : 1, "categories._id" : 1})

db.businesses.ensureIndex({ "categories._id" : 1, "location.city._id" : 1})

~ or ~

answer: cities → categories

35,000 cities & 2,500 categories

create one for zip codes and categories too!

Page 56: Data Modeling for Performance

don’t we have 2 indexes on city id?

answer: yes

{"location.city._id" : 1}{"location.city._id" : 1, "categories._id" : 1}

db.businesses.dropIndex("location.city._id_1")

Page 57: Data Modeling for Performance

#5 - find by keyword“something awesome” in Boulder, CO

Page 58: Data Modeling for Performance

find businesses in city by keyword

{ "_id" : ObjectId("4ce838ef4a882579960001b9"), "name" : "Acme Glass Co", "keywords" : [ "glass", "repair", "acme", ... ]}

db.businesses.ensureIndex({ "location.city._id":1, "keywords":1})

db.businesses.find({ "location.city._id":ObjectId("4ce82aa0d3dfaa10f8004a95"), "keywords":/glass/i})

Page 59: Data Modeling for Performance

chat with Kyle Banker

me: we’re switching from postgres+solr to mongo

kyle: oh wow, you can replace solr with mongo?

me: with some creativity

kyle: seems like it’d still be hard to get just right

me: it works well

kyle: gotcha

Page 60: Data Modeling for Performance

i was wrong, kyle was right

Page 61: Data Modeling for Performance

I’ll never leave you again

...until MongoDB supports full text later this year:)

I

Page 62: Data Modeling for Performance

aggregationmap/reduce to the rescue

Page 63: Data Modeling for Performance

sitemapsbig list of every url

Page 64: Data Modeling for Performance

sitemaps

• xml files containing each unique url ~ 24M

• 50,000 urls per file, about 500 files

• urls are generated from live data

• http://companyx.com/sitemaps/1.xml

Page 65: Data Modeling for Performance

partition by consistent hash

>> "hello!".hash % 6 #=> 5

>> "/ny/new-york/c/apartments".hash % 6 #=> 5

returns an integer between 0 and the number specified

Page 66: Data Modeling for Performance

map/reduce

1. map each url in the site to a partition

2. reduce all partitions to a single document containing all urls in that partition

3. save to a permanent collection

Page 67: Data Modeling for Performance

map

/il/chicago/c/pizza 4/ny/new-york/c/apartments 1nd/rugby/c/apartments 6/14076500-bayside-marina 2/13401000-comtrak-logistics-inc 3/12347500-allstate-auto-insurance 1il/downers-grove/c/computer-web-design 6/1009500-heidelberg-lodges 5mn/redwood-falls/c/food-service 4/14077000-bank-of-america 5mn/savage/c/audio-visual-equipment 1...

1

2

3

4

5

6

Page 68: Data Modeling for Performance

reduce

{ "total" : 2, "urls" : [ "/12347500-allstate-auto-insurance", "/ny/new-york/c/apartments" ]}

{ "total" : 1, "urls" : [ "/mn/savage/c/audio-visual-equipment" ]}

{ "_id" : 1, "value" : { "total" : 2, "urls" : [ "/12347500-allstate-auto-insurance", "/mn/savage/c/audio-visual-equipment", "/ny/new-york/c/apartments" ] }}

Page 69: Data Modeling for Performance

usage

db.sitemaps.findOne({_id:1}).value.urls

[ "/12347500-allstate-auto-insurance", "/mn/savage/c/audio-visual-equipment", "/ny/new-york/c/apartments"]

Page 70: Data Modeling for Performance

wrap up

Page 71: Data Modeling for Performance

2 months later

115ms average response times

Page 72: Data Modeling for Performance

thank you@michaeldwan