Upload
mongodb
View
724
Download
0
Tags:
Embed Size (px)
DESCRIPTION
In order to build a robust, multi-tenant, highly available storage services that meet the business’ SLA your databases has to be sharded. But if your service has to scale continuously through the incremental additions of storage without service interruption or human intervention, basic static sharding is not enough. At eBay, we are building MStore to solve this problem, with MongoDB as the storage engine. In this presentation, we will dive into the key design concepts of this solution.
Citation preview
Elastic metadata store for eBay media platform
Yuri Finkelstein, architect, eBay Global Platform Services
MongoDB SF 2014
Introduction• eBay items have media:
• pictures, 360 views, overlays, video, etc
• binary content + metadata
• metadata is rich and is best modeled as a document
• 99% reads, 1% writes, ~1/100th deleted daily
• MongoDB is a reasonable fit
• But we need a service on top of it
What is a data service?• Data service vs database instances
• SLA
• data lifecycle management automation instead of DBA excellence
• no downtime during hardware repair, maintenance
• no downtime as data grows and hardware is added
• multiple tenants
• tenants come and go, grow (and shrink) at different rates
• different tenant requirements for cross DC replication and data access latencies
What is wrong with this picture?
• Vertical scalability only
• Prone to coarse-grain outages
• No service model, no SLA
• Limited number of connections
• etc
App1App1App1
App1App1App2
DB
DB
DB
driv
erdr
iver
Replicas
tenant B
tenant A
Is MongoDB sharding the answer?
• Can scale out in theory, but at the time of expansion we either need a downtime or likely have SLA breach
• Mongo chunks are logical
• flurry of IO or too slow to engage new hardware
• Still no service boundary
• Other problems mentioned earlier
App1App1App1
App1App1App2
driv
er
DB DB DB
DB DB DB
DB DB DB
DB DB DB
Replicas
Shar
ds
router
router
driv
ernew shard ?
MongoS
On the effect of chunk migration: great slide by Kenny Gorman/Object Rocket
Buckets• Need smaller data granularity -
bucket
• ~100 buckets per tenant to begin with
• bucket algebra
• create / delete
• split / merge
• compact
• move (to another RS)
host (storage server)
MongoD (RS1)
MongoD (RS2)
MongoD (RS3)
MongoD (RS4)
b1 b2 b3 b4
b25 b26 b27 b28
b49 b50 b51 b52
b73 b74 b75 b76
_id=>BucketName ? • Can be done in a number of different ways,
based on the use case
• if range queries on _id are needed, use “order preserving partitioner”, ex. like in HBase
• If access is by _id only, consistent hashing works well, ex: Memcached, Cassandra
MStore DB Proxy• proxy/manage DB connections
• runs on each host
• lightweight and efficient
• connects to mongo over unix socket
• BSON in, BSON out
• performs “logical address translation” in BSON messages
• bucketName => mongoDB dbName
• dbName changes after each compaction
host (storage server)
Proxy
MongoD (RS1)
MongoD (RS2)
MongoD (RS3)
MongoD (RS4)
b1 b2 b3 b4
b25 b26 b27 b28
b49 b50 b51 b52
b73 b74 b75 b76
MStore Service Tier• stateless REST service
• domain API
• route calculation:
• _id=>BucketName
• BucketName=>ReplicaSetId
• isWrite or !staleReadOk?
• primary MongoD or some secondary MongoD
• MongoD=>host
• request goes to proxy@host
App1App1App1
App1App1App2
mstoreservice
http
/json
http/bsonstorage servers
Connections, Protocols, Payload formats
• Too many connections problem with MongoDB
• The service forms BSON and sends it to proxy over HTTP
• Proxy needs only few connections with mongo
• Fair request scheduling
mstore service
BSON/HTTP 1.1 with Keep Alive
Proxy
MongoD
Native transport/ Unix socket
MStore Coordination Tier• Manages cluster map
• Serves queries and pushes changes to map cache in the service nodes and in proxies
• Functionally similar to MongoDB Config server or ZooKeeper
• Backed by a transactional, highly available repository
MStore Coordination
crd crd crd
GET map on init
Push updateon change
Legend
cluster map
cache
coordinator service
instance
MStore service
instance
proxy
mm
db dbCoordinator
DB
The big picture
App1App1App1
App1App1App2
mstoreservice
storage servers
MStore Coordination
crd crd crd
db dbWorkflow
Management &
Automation tools
Bucket Compaction• Document deletes are expensive
• We prefer marking documents with tombstones, hence need to compact
• Compaction is done on an AUX storage node to not disturb ongoing operation
• When new bucket image is ready, in proxy:
• hold new writes
• flush pending writes
• “flip the switch” : BucketName->DB
• resume writes
• This is not easy and is implemented as a multi-step workflow
Other workflows• Compaction workflow is quite hard to master
• But the good news is - other workflows are very similar:
• bucket move is ~the same except the target RS is different
• bucket split creates 2 new buckets
• bucket merge is like 2 moves
• etc
What are we achieving?• Elastic expansion of the storage tier
• Full control over what/when/how fast to rebalance
• Efficiency of rebalancing
• Smooth and predictable operation
• Intelligent DB connection management
• SLA measurement
Final words
• Open source?
• Looking for feedback
• Contact us if interested
• Thank you!
Appendix
Buckets• Need smaller data granularity - bucket
• sizeof(bucket) << sizeof(data set)
• bucket is a single MongoDB DB
• one Replica Set has many buckets
• Bucket operations:
• create / delete
• split / merge
• compact
• move (to another RS)
• Multiple MongoD processes from different replica sets on the same physical host for best storage utilization (on big bare metal)
• These could be LXC containers
host (storage server)
MongoD (RS1)
MongoD (RS2)
MongoD (RS3)
MongoD (RS4)
b1 b2 b3 b4
b25 b26 b27 b28
b49 b50 b51 b52
b73 b74 b75 b76
Bucket Compaction• Document deletes are expensive
• We prefer marking documents with tombstones
• Compaction is the process of generating a new image of the bucket after purging deleted and expired documents
• Compaction is done on an AUX storage node to not disturb ongoing operation
• When new bucket image is ready - just “flip the switch” in the proxy
• This is not easy and is implemented as a multi-step workflow
Compaction Workflow1. mark oplog time; take a snapshot 2. copy snapshot to Aux node 3. start 2 stand-alone mongod
• source and destination 4. bulk-scan source
• skip deleted or expired docs • insert docs into destination db
5. transfer compacted bucket image to all nodes in the original replica set
6. replay oplog from old db to new db “Flip the switch” phase:7. pause writes to old db in proxy 8. keep oplog replay until all queues are
drained 9. tell proxy to enable writes to new DB 10. update Coordinator map