Upload
datablend
View
1.106
Download
2
Tags:
Embed Size (px)
Citation preview
anal
ytic
s
about me
who am i ...
Davy Suvee@DSUVEE
➡ big data architect @ datablend - continuum• provide big data and nosql consultancy
• 5 years of hands-on expertise
from data to insights
data analytics in mongodb
chemical similarity use-case
native api
aggregation frameworkmap/reduce
chemical similarity (1)
★ 31 million compounds available➡ pubchem
➡ Question:★ find compounds similar to a particular other compound
chemical similarity (2)
0[N]1[C O]2[C C C]0[N]1[C O]2[C C C]3[C C C C C]0[C]1[C C C]2[C C N O]3[C C C C O O]0[C]1[C C]2[C C C C O]3[C C N O]0[O]1[C]2[C O]3[C C C]0[C]1[C O O]2[C C C O]0[C]1[C C]2[C C]0[C]1[C]2[C]3[C O]0[C]1[C C N]2[C C C C O]3[C C C O]...
chemical similarity (3)
0[N]1[C O]2[C C C]0[N]1[C O]2[C C C]3[C C C C C]0[C]1[C C C]2[C C N O]3[C C C C O O]0[C]1[C C]2[C C C C O]3[C C N O]0[O]1[C]2[C O]3[C C C]0[C]1[C O O]2[C C C O]0[C]1[C C]2[C C]0[C]1[C]2[C]3[C O]0[C]1[C C N]2[C C C C O]3[C C C O]...
0[N]1[C O]2[C C C]3[C C C C C C]0[C]1[C C C]2[C C N O]3[C C C C O O]0[C]1[C C]2[C C C C O]3[C C N O]0[O]1[C]2[C O]3[C C C C]0[C]1[C O O]2[C C C O]0[C]1[C C]2[C C]0[N]1[C O]2[C C C]0[C]1[C]2[C]3[C O]0[C]1[C C N]2[C C C C O]3[C C C O]...
equality via tanimoto
but 31 million calculations?
mongodb datamodel (1)
{ "compound_cid" : "46200001" , "smiles" : "CCC1C(C(C(C(=NOCC=CCN2CCCCC2)C(CC(C(C(C(C(C(=O)O1)C)OC3C" , "fingerprint_count" : 120 , "fingerprints" : [ "0[N]1[C O]2[C C C]" , "0[N]1[C O]2[C C C]3[C C C C C]" , "0[C]1[C C C]2[C C N O]3[C C C C O O]" , "0[C]1[C C]2[C C C C O]3[C C N O]" , "0[O]1[C]2[C O]3[C C C]" , "0[C]1[C O O]2[C C C O]" , "0[C]1[C C]2[C C]" , "0[C]1[C]2[C]3[C O]" , "0[C]1[C C N]2[C C C C O]3[C C C O]" , ... ] , }
compound collection
mongodb datamodel (2) fingerprint collection
{ "fingerprint" : "0[N]1[C O]2[C C C]", "count" : 472}{ "fingerprint" : "0[N]1[C O]2[C C C]3[C C C C C]", "count" : 41}{ "fingerprint" : "0[O]1[C]2[C O]3[C C C]", "count" : 1343}
querying pattern (1)
★ from 31 million -> potential match➡ narrow down the search space
➡ imagine 80% search for a compound with 40 features
➡ 32
➡ 50
querying pattern (2)
★ from 31 million -> potential match➡ narrow down the search space
➡ imagine 80% search for a compound with 40 features
(9 fingerprints)
find the fingerprints (1)
// Retrieve the particular compoundDBObject object = compoundsCollection.findOne(QueryBuilder.start("compound_cid").is(compound).get());
// Retrieve the relevant propertiesString pubchemcid = (String)object.get(COMPOUNDCID_PROPERTY);
List<Integer> fingerprintstofind = Arrays.asList(((BasicDBList)object.get(FINGERPRINTS_PROPERTY)).toArray(new Integer[]{}));
// Sort the fingerprints on total number of occurencesfingerprintstofind = findSortedFingerprints(fingerprintstofind);
find the fingerprints (2)List<Integer> sortedFingerprintsToFind = new ArrayList<Integer>();
// Find all fingerprint count documents DBObject fingerprintcountquery = QueryBuilder.start(FINGERPRINT_PROPERTY).in(fingerprintsToFind.toArray()).get(); // Only retrieve the fingerprint string itself DBObject fingerprintcountselection = QueryBuilder.start(FINGERPRINT_PROPERTY).is(1).get(); // Sort the result on count DBObject fingerprintcountsort = QueryBuilder.start(COUNT_PROPERTY).is(1).get();
// Execute the query on the fingerprint counts collection DBCursor fingerprintcounts = fingerprintCountsCollection.find(fingerprintcountquery, fingerprintcountselection). sort(fingerprintcountsort);
native query (1)
// Find the matching compoundsDBObject compoundquery = QueryBuilder. start(FINGERPRINTS_PROPERTY). in(fingerprintsToConsider). and(FINGERPRINTCOUNT_PROPERTY).lessThanEquals(maxnumberofcompoundfingerprints). and(FINGERPRINTCOUNT_PROPERTY).greaterThanEquals(minnumberofcompoundfingerprints). get();
native query (2)
// Execute the queryDBCursor compounds = compoundsCollection.find(compoundquery); // Let's process the found compounds locallywhile(compounds.hasNext()) { DBObject compound = compounds.next(); BasicDBList fingerprints = ((BasicDBList) compound.get(FINGERPRINTS_PROPERTY)); // Calculate the intersection on the total list of fingerprints fingerprints.retainAll(fingerprintsToFind);
if (fingerprints.size() >= minnumberofcompoundfingerprints) { // Calculate the tanimoto coefficient ... }
}
map/reduce query (1)
map/reduce query (2) // Find all compoundsDBObject compoundquery = ...
// The map fuctionString map = "function() { " + " var found = 0; " + " var fingerprintslength = this.fingerprints.length; " + " for (i = 0; i < fingerprintslength; i++) { " + " if (fingerprintstofind[this.fingerprints[i]] === true) { found++; } " + " } " + " if (found >= minnumberofcompoundfingerprints) { " + " emit (this.compound_cid, {found : found, " + " total : this.fingerprint_count} ); } " + "}";
// Execute the map reduce functionMapReduceCommand mr = new MapReduceCommand(compoundsCollection, map, "", MapReduceCommand.OutputType.INLINE, compoundquery);
aggregation framework (1)
aggregation framework (2)
{ "aggregate" : "compounds" , "pipeline" : [ { "$match" : { "fingerprint_count" : { "$gte" : 4 , "$lte" : 1780}}} , { "$unwind" : "$fingerprints"} , { "$match" : { "fingerprints" : { "$in" : [ 1960, 15111, ...,94 , 26]}}} , { "$group" : { "_id" : "$compound_cid" , "fingerprintmatches" : { "$sum" : 1} , "totalcount" : { "$first" : "$fingerprint_count"} }}} , { "$project" : { "_id" : 1 , "tanimoto" : { "$divide" : [ "$fingerprintmatches" , { "$subtract" : [ { "$add" : [ 89 , "$totalcount"]} , "$fingerprintmatches"]}]}} , { "$match" : { "tanimoto" : { "$gte" : 0.05}}}]}
benchmark results
★ native -> 202 ms➡ 100K compounds, 0.8 tanimoto
★ map/reduce -> 214 ms★ aggregation framework -> 609 ms
★ native -> 1909 ms➡ 100K compounds, 0.05 tanimoto
★ map/reduce -> 20978 ms★ aggregation framework -> 1613 ms
diy mongodb analytics ...
➡ http://datablend.be/?p=256
➡ the joy of algorithms and nosql: a mongodb example
➡ http://github.com/datablend/mongo-compound-comparison-revisited
Questions?
Follow us
twitter.com/data_blendwww.datablend.be
www.datablend.be [email protected] 0499/05.00.89
datablend - continuum