An Elastic Metadata Store for eBay’s Media Platform

Elastic metadata store for eBay media platform

Yuri Finkelstein, architect, eBay Global Platform Services

MongoDB SF 2014

Introduction• eBay items have media:

• pictures, 360 views, overlays, video, etc

• binary content + metadata

• metadata is rich and is best modeled as a document

• 99% reads, 1% writes, ~1/100th deleted daily

• MongoDB is a reasonable fit

• But we need a service on top of it

What is a data service?• Data service vs database instances

• SLA

• data lifecycle management automation instead of DBA excellence

• no downtime during hardware repair, maintenance

• no downtime as data grows and hardware is added

• multiple tenants

• tenants come and go, grow (and shrink) at different rates

• different tenant requirements for cross DC replication and data access latencies

What is wrong with this picture?

• Vertical scalability only

• Prone to coarse-grain outages

• No service model, no SLA

• Limited number of connections

• etc

App1App1App1

App1App1App2

DB

DB

DB

driv

erdr

iver

Replicas

tenant B

tenant A

Is MongoDB sharding the answer?

• Can scale out in theory, but at the time of expansion we either need a downtime or likely have SLA breach

• Mongo chunks are logical

• flurry of IO or too slow to engage new hardware

• Still no service boundary

• Other problems mentioned earlier

App1App1App1

App1App1App2

driv

er

DB DB DB

DB DB DB

DB DB DB

DB DB DB

Replicas

Shar

ds

router

router

driv

ernew shard ?

MongoS

On the effect of chunk migration: great slide by Kenny Gorman/Object Rocket

Buckets• Need smaller data granularity -

bucket

• ~100 buckets per tenant to begin with

• bucket algebra

• create / delete

• split / merge

• compact

• move (to another RS)

host (storage server)

MongoD (RS1)

MongoD (RS2)

MongoD (RS3)

MongoD (RS4)

b1 b2 b3 b4

b25 b26 b27 b28

b49 b50 b51 b52

b73 b74 b75 b76

_id=>BucketName ? • Can be done in a number of different ways,

based on the use case

• if range queries on _id are needed, use “order preserving partitioner”, ex. like in HBase

• If access is by _id only, consistent hashing works well, ex: Memcached, Cassandra

MStore DB Proxy• proxy/manage DB connections

• runs on each host

• lightweight and efficient

• connects to mongo over unix socket

• BSON in, BSON out

• performs “logical address translation” in BSON messages

• bucketName => mongoDB dbName

• dbName changes after each compaction


Proxy

MongoD (RS1)

MongoD (RS2)

MongoD (RS3)

MongoD (RS4)

b1 b2 b3 b4

b25 b26 b27 b28

b49 b50 b51 b52

b73 b74 b75 b76

MStore Service Tier• stateless REST service

• domain API

• route calculation:

• _id=>BucketName

• BucketName=>ReplicaSetId

• isWrite or !staleReadOk?

• primary MongoD or some secondary MongoD

• MongoD=>host

• request goes to proxy@host

App1App1App1

App1App1App2

mstoreservice

http

/json

http/bsonstorage servers

Connections, Protocols, Payload formats

• Too many connections problem with MongoDB

• The service forms BSON and sends it to proxy over HTTP

• Proxy needs only few connections with mongo

• Fair request scheduling

mstore service

BSON/HTTP 1.1 with Keep Alive

Proxy

MongoD

Native transport/ Unix socket

MStore Coordination Tier• Manages cluster map

• Serves queries and pushes changes to map cache in the service nodes and in proxies

• Functionally similar to MongoDB Config server or ZooKeeper

• Backed by a transactional, highly available repository

MStore Coordination

crd crd crd

GET map on init

Push updateon change

Legend

cluster map

cache

coordinator service

instance

MStore service

instance

proxy

mm

db dbCoordinator

DB

The big picture

App1App1App1

App1App1App2

mstoreservice

storage servers

MStore Coordination

crd crd crd

db dbWorkflow

Management &

Automation tools

Bucket Compaction• Document deletes are expensive

• We prefer marking documents with tombstones, hence need to compact

• Compaction is done on an AUX storage node to not disturb ongoing operation

• When new bucket image is ready, in proxy:

• hold new writes

• flush pending writes

• “flip the switch” : BucketName->DB

• resume writes

• This is not easy and is implemented as a multi-step workflow

Other workflows• Compaction workflow is quite hard to master

• But the good news is - other workflows are very similar:

• bucket move is ~the same except the target RS is different

• bucket split creates 2 new buckets

• bucket merge is like 2 moves

• etc

What are we achieving?• Elastic expansion of the storage tier

• Full control over what/when/how fast to rebalance

• Efficiency of rebalancing

• Smooth and predictable operation

• Intelligent DB connection management

• SLA measurement

Final words

• Open source?

• Looking for feedback

• Contact us if interested

• Thank you!

Appendix

Buckets• Need smaller data granularity - bucket

• sizeof(bucket) << sizeof(data set)

• bucket is a single MongoDB DB

• one Replica Set has many buckets

• Bucket operations:

• create / delete

• split / merge

• compact

• move (to another RS)

• Multiple MongoD processes from different replica sets on the same physical host for best storage utilization (on big bare metal)

• These could be LXC containers


MongoD (RS1)

MongoD (RS2)

MongoD (RS3)

MongoD (RS4)

b1 b2 b3 b4

b25 b26 b27 b28

b49 b50 b51 b52

b73 b74 b75 b76

Bucket Compaction• Document deletes are expensive

• We prefer marking documents with tombstones

• Compaction is the process of generating a new image of the bucket after purging deleted and expired documents

• Compaction is done on an AUX storage node to not disturb ongoing operation

• When new bucket image is ready - just “flip the switch” in the proxy

• This is not easy and is implemented as a multi-step workflow

Compaction Workflow1. mark oplog time; take a snapshot 2. copy snapshot to Aux node 3. start 2 stand-alone mongod

• source and destination 4. bulk-scan source

• skip deleted or expired docs • insert docs into destination db

5. transfer compacted bucket image to all nodes in the original replica set

6. replay oplog from old db to new db “Flip the switch” phase:7. pause writes to old db in proxy 8. keep oplog replay until all queues are

drained 9. tell proxy to enable writes to new DB 10. update Coordinator map

Technology

An Elastic Metadata Store for eBay’s Media Platform