Upload
amol
View
5.881
Download
1
Tags:
Embed Size (px)
DESCRIPTION
The talk will cover how to use BSON directly as an exchange protocol to gain speed and advanced types. BSON is the underlying serialization protocol used by MongoDB to store and represent data. Whenever we retrieve data from MongoDB we get it as BSON, then our drivers decode it just so that our web service can encode it back in JSON. We will see how to take advantage of BSON for fun and speed skipping this double step by directly fetching BSON and decoding it at client side.
Citation preview
Who am I
● CTO @ Axant.it, mostly Python company,
with some iOS and Android development.
● Mostly relying on MySQL, MongoDB, Redis
(and sqlite!) for day by day data storage
● TurboGears web framework team member
● Contributions to Ming MongoDB ODM
The Reason
● EuroPython 2013
○ JSON WebServices with Python best practices
talk
● Question raised
○ “We have a service where our bottleneck is
actually the JSON encoding itself, what can we
do?”
First obvious answer
● Avoid encoding whole data in memory
○ iterencode yields one object at time instead of
encoding everything at once.
● Use a faster encoder!
○ There are projects with custom encoders like
GPSD that are very fast and very memory
conservative.
Mad answer
● If the JSON encoder is too slow for you
○ Remove JSON encoding
● Looking for the fastest encoding?
○ Don’t encode data at all!
MongoDB flow
● BSON is the serialization format used by
mongodb to talk with its clients
● Involves decoding BSON and then re-
encoding JSON
MongoDB WebServiceClient
BSONJSON
DriverNATIVE
Using BSON!
● Can we totally skip “BSON decoding” and
“JSON encoding” dance and directly use
BSON?
“BSON [bee · sahn], short for Bin-ary JSON, is a binary-encoded seri-alization
of JSON-like documents. Like JSON, BSON supports the embedding of
documents and arrays within other documents and arrays. BSON also contains
extensions that allow representation of data types that are not part of the JSON
spec. For example, BSON has a Date type and a BinData type.”
Target Flow
● BSON decoding on the client can happen
using the js-bson library (or equivalent)
● Skipping BSON decoding on server is hard
○ It’s built-in into the mongodb driver
MongoDB WebServiceClient
BSONBSON
DriverBSON
The Python Driver
MongoDB Cursor _unpack_response
bson.decode_all_elements_to_dict_element_to_dict
Custom decoding
● bson.decode_all is the method in charge
of decoding BSON objects.
● We need a decoder that partially decodes
the query but lets the actual documents
encoded.
● Full BSON spec available on bsonspec.org
Custom bson.decode_all
$ python test.py {u'text': u'My first blog post!', u'_id': ObjectId('5267f71a0e9ce56fe55bdc4b'), u'author': u'Mike'}
$ python test.py 'E\x00\x00\x00\x07_id\x00Rg\xf7\x1a\x0e\x9c\xe5o\xe5[\xdcK\x02text\x00\x14\x00\x00\x00My first blog post!\x00\x02author\x00\x05\x00\x00\x00Mike\x00\x00'
BSON format
SIZE ONE OR MORE KEY-VALUE ENTRIES \0
TYPE KEY NAME \0 VALUE
Custom bson.decode_all
obj_size = struct.unpack("<i", data[position:position + 4])[0]elements = data[position + 4:position + obj_size - 1]position += obj_sizedocs.append(_elements_to_dict(elements, as_class, ...))
obj_size = struct.unpack("<i", data[position:position + 4])[0]elements = data[position:position + obj_size]position += obj_sizedocs.append(elements)
Enforcing in PyMongo
● Now that we have a custom decoding
function, that leaves the documents
encoded in BSON, we need to enforce it to
PyMongo
● _unpack_response is the method that is in
charge of calling the decode_all function,
we must convince it to call our version
MonkeyPatching
and this is the reason why it’s mad science and you should avoid doing it!
Hijacking decoding
● _unpack_response
○ Called by pymongo to unpack responses retrieved
by the server.
○ Some informations are given: like the current
cursor id in case of getMore and other parameters
○ We can use provided parameters to suppose if we
are decoding a query response or something else.
Custom unpack_response_real_unpack_response = pymongo.helpers._unpack_response
def custom_unpack_response(response, cursor_id=None, as_class=None, *args, **kw): if as_class is None: # Not a query, here lies the real trick return _real_unpack_response(response, cursor_id, dict, *args, **kw)
response_flag = struct.unpack("<i", response[:4])[0] if response_flag & 2: # In case it's an error report return _real_unpack_response(response, cursor_id, as_class, *args, **kw)
result = {} result["cursor_id"] = struct.unpack("<q", response[4:12])[0] result["starting_from"] = struct.unpack("<i", response[12:16])[0] result["number_returned"] = struct.unpack("<i", response[16:20])[0] result["data"] = custom_decode_all(response[20:]) return result
pymongo.helpers._unpack_response = custom_unpack_response
Fetching BSON
● Our PyMongo queries will now return
BSON encoded data we can then push to
the client
● Let’s fetch the data from the client to close
the loop
Fetching BSONfunction fetch_bson() { var BSON = bson().BSON;
var oReq = new XMLHttpRequest(); oReq.open("GET", 'http://localhost:8080/results_bson', true); oReq.responseType = "arraybuffer"; oReq.onload = function(e) { var data = new Uint8Array(oReq.response); var offset = 0; var results = [];
while (offset < data.length) offset = BSON.deserializeStream(data, offset, 1, results, results.length, {});
show_output(results); }
oReq.send();}
See it in action
Performance Gain
● All started to get a performance boost,
how much did it improve?
JSON BSON
1239.72 req/sec 2079.75 req/sec
False Benchmark
● Benchmark is actually pointless
○ as usual ;)
● Replacing bson.decode_all which is
written in C with custom_decode_all which
is written in Python
○ The two don’t compare much
● Wanna try with PyPy?
Questions?