23
BSON MAD SCIENCE FOR FUN AND PROFIT Alessandro Molina @__amol__ [email protected]

MongoTorino 2013 - BSON Mad Science for fun and profit

  • Upload
    amol

  • View
    5.881

  • Download
    1

Embed Size (px)

DESCRIPTION

The talk will cover how to use BSON directly as an exchange protocol to gain speed and advanced types. BSON is the underlying serialization protocol used by MongoDB to store and represent data. Whenever we retrieve data from MongoDB we get it as BSON, then our drivers decode it just so that our web service can encode it back in JSON. We will see how to take advantage of BSON for fun and speed skipping this double step by directly fetching BSON and decoding it at client side.

Citation preview

Page 1: MongoTorino 2013 - BSON Mad Science for fun and profit

BSON MAD SCIENCE FOR FUN AND PROFIT

Alessandro Molina@__amol__

[email protected]

Page 2: MongoTorino 2013 - BSON Mad Science for fun and profit

Who am I

● CTO @ Axant.it, mostly Python company,

with some iOS and Android development.

● Mostly relying on MySQL, MongoDB, Redis

(and sqlite!) for day by day data storage

● TurboGears web framework team member

● Contributions to Ming MongoDB ODM

Page 3: MongoTorino 2013 - BSON Mad Science for fun and profit

The Reason

● EuroPython 2013

○ JSON WebServices with Python best practices

talk

● Question raised

○ “We have a service where our bottleneck is

actually the JSON encoding itself, what can we

do?”

Page 4: MongoTorino 2013 - BSON Mad Science for fun and profit

First obvious answer

● Avoid encoding whole data in memory

○ iterencode yields one object at time instead of

encoding everything at once.

● Use a faster encoder!

○ There are projects with custom encoders like

GPSD that are very fast and very memory

conservative.

Page 5: MongoTorino 2013 - BSON Mad Science for fun and profit

Mad answer

● If the JSON encoder is too slow for you

○ Remove JSON encoding

● Looking for the fastest encoding?

○ Don’t encode data at all!

Page 6: MongoTorino 2013 - BSON Mad Science for fun and profit

MongoDB flow

● BSON is the serialization format used by

mongodb to talk with its clients

● Involves decoding BSON and then re-

encoding JSON

MongoDB WebServiceClient

BSONJSON

DriverNATIVE

Page 7: MongoTorino 2013 - BSON Mad Science for fun and profit

Using BSON!

● Can we totally skip “BSON decoding” and

“JSON encoding” dance and directly use

BSON?

“BSON [bee · sahn], short for Bin-ary JSON, is a binary-encoded seri-alization

of JSON-like documents. Like JSON, BSON supports the embedding of

documents and arrays within other documents and arrays. BSON also contains

extensions that allow representation of data types that are not part of the JSON

spec. For example, BSON has a Date type and a BinData type.”

Page 8: MongoTorino 2013 - BSON Mad Science for fun and profit

Target Flow

● BSON decoding on the client can happen

using the js-bson library (or equivalent)

● Skipping BSON decoding on server is hard

○ It’s built-in into the mongodb driver

MongoDB WebServiceClient

BSONBSON

DriverBSON

Page 9: MongoTorino 2013 - BSON Mad Science for fun and profit

The Python Driver

MongoDB Cursor _unpack_response

bson.decode_all_elements_to_dict_element_to_dict

Page 10: MongoTorino 2013 - BSON Mad Science for fun and profit

Custom decoding

● bson.decode_all is the method in charge

of decoding BSON objects.

● We need a decoder that partially decodes

the query but lets the actual documents

encoded.

● Full BSON spec available on bsonspec.org

Page 11: MongoTorino 2013 - BSON Mad Science for fun and profit

Custom bson.decode_all

$ python test.py {u'text': u'My first blog post!', u'_id': ObjectId('5267f71a0e9ce56fe55bdc4b'), u'author': u'Mike'}

$ python test.py 'E\x00\x00\x00\x07_id\x00Rg\xf7\x1a\x0e\x9c\xe5o\xe5[\xdcK\x02text\x00\x14\x00\x00\x00My first blog post!\x00\x02author\x00\x05\x00\x00\x00Mike\x00\x00'

Page 12: MongoTorino 2013 - BSON Mad Science for fun and profit

BSON format

SIZE ONE OR MORE KEY-VALUE ENTRIES \0

TYPE KEY NAME \0 VALUE

Page 13: MongoTorino 2013 - BSON Mad Science for fun and profit

Custom bson.decode_all

obj_size = struct.unpack("<i", data[position:position + 4])[0]elements = data[position + 4:position + obj_size - 1]position += obj_sizedocs.append(_elements_to_dict(elements, as_class, ...))

obj_size = struct.unpack("<i", data[position:position + 4])[0]elements = data[position:position + obj_size]position += obj_sizedocs.append(elements)

Page 14: MongoTorino 2013 - BSON Mad Science for fun and profit

Enforcing in PyMongo

● Now that we have a custom decoding

function, that leaves the documents

encoded in BSON, we need to enforce it to

PyMongo

● _unpack_response is the method that is in

charge of calling the decode_all function,

we must convince it to call our version

Page 15: MongoTorino 2013 - BSON Mad Science for fun and profit

MonkeyPatching

and this is the reason why it’s mad science and you should avoid doing it!

Page 16: MongoTorino 2013 - BSON Mad Science for fun and profit

Hijacking decoding

● _unpack_response

○ Called by pymongo to unpack responses retrieved

by the server.

○ Some informations are given: like the current

cursor id in case of getMore and other parameters

○ We can use provided parameters to suppose if we

are decoding a query response or something else.

Page 17: MongoTorino 2013 - BSON Mad Science for fun and profit

Custom unpack_response_real_unpack_response = pymongo.helpers._unpack_response

def custom_unpack_response(response, cursor_id=None, as_class=None, *args, **kw): if as_class is None: # Not a query, here lies the real trick return _real_unpack_response(response, cursor_id, dict, *args, **kw)

response_flag = struct.unpack("<i", response[:4])[0] if response_flag & 2: # In case it's an error report return _real_unpack_response(response, cursor_id, as_class, *args, **kw)

result = {} result["cursor_id"] = struct.unpack("<q", response[4:12])[0] result["starting_from"] = struct.unpack("<i", response[12:16])[0] result["number_returned"] = struct.unpack("<i", response[16:20])[0] result["data"] = custom_decode_all(response[20:]) return result

pymongo.helpers._unpack_response = custom_unpack_response

Page 18: MongoTorino 2013 - BSON Mad Science for fun and profit

Fetching BSON

● Our PyMongo queries will now return

BSON encoded data we can then push to

the client

● Let’s fetch the data from the client to close

the loop

Page 19: MongoTorino 2013 - BSON Mad Science for fun and profit

Fetching BSONfunction fetch_bson() { var BSON = bson().BSON;

var oReq = new XMLHttpRequest(); oReq.open("GET", 'http://localhost:8080/results_bson', true); oReq.responseType = "arraybuffer"; oReq.onload = function(e) { var data = new Uint8Array(oReq.response); var offset = 0; var results = [];

while (offset < data.length) offset = BSON.deserializeStream(data, offset, 1, results, results.length, {});

show_output(results); }

oReq.send();}

Page 20: MongoTorino 2013 - BSON Mad Science for fun and profit

See it in action

Page 21: MongoTorino 2013 - BSON Mad Science for fun and profit

Performance Gain

● All started to get a performance boost,

how much did it improve?

JSON BSON

1239.72 req/sec 2079.75 req/sec

Page 22: MongoTorino 2013 - BSON Mad Science for fun and profit

False Benchmark

● Benchmark is actually pointless

○ as usual ;)

● Replacing bson.decode_all which is

written in C with custom_decode_all which

is written in Python

○ The two don’t compare much

● Wanna try with PyPy?

Page 23: MongoTorino 2013 - BSON Mad Science for fun and profit

Questions?