Using MongoDB as a graph database - 2014 redux

Preview:

DESCRIPTION

** An update to the 2012 MongoUK presentation, given at NoSQL Birmingham/London meetup ** This presentation charts how Talis implemented tripod, a library that runs over the top of MongoDB, to provide access to large scale graph datasets with very high performance query access. As Talis' own applications became web-scale, the company used tripod as a replacement for its earlier, general purpose RDF triple store, and maintained the graph-model in the code line whilst swapping in MongoDB underneath. By prioritising on what really mattered to those applications, and discarding what did not, the company was able to extract extreme performance from graph based datasets using MongoDB running on commodity hardware. https://github.com/talis/tripod-php https://github.com/talis/tripod-node

Citation preview

Using MongoDB as a Graph Database

Chris ClarkeNoSQL Birmingham16th October 2014

Graphs 101For the uninitiated

John Janeknows

John Janeknows

John knows JaneJane knows John

John Janeknows

John Janeknows

John knows JaneJane ? John

John Jane

John knows JaneJane knows John

knows

knows

RDF

John knows JaneEntity Property Value

John knows Jane

Subject Predicate Object

John knows Jane

Jane knows John

Subject Predicate Object

http://example.com/John foaf:knows http://example.com/Jane

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

Subject Predicate Object

http://example.com/John

http://example.com/John

foaf:knows http://example.com/Jane

foaf:name “John”

PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX rdf: <

http://www.w3.org/1999/02/22-rdf-syntax-ns#>

http://example.com/John rdf:type foaf:Person

http://example.com/Jane foaf:name “Jane”

http://example.com/Jane rdf:type foaf:Person

http://example.com/Jane foaf:knows http://example.com/John

Subject Predicate Object

example:John example:Jane

foaf:Person

rdf:type rdf:type

“John” “Jane”

foaf:name foaf:name

foaf:knows

foaf:knows

– Jack Fullstack

“WTF! Surely this is easier in JSON!”

> db.people.find(){ _id: ObjectID(‘123’), name: ‘John’ knows: [ObjectID(‘456’)]},{ _id: ObjectID(‘456’), name: ‘Jane’ knows: [ObjectID(‘123’)]}

foaf:Person

example:John

“John”

foaf:name

example:John

24

foaf:age

Dataset A Dataset B

example:John

“John” 24

Dataset A+B

foaf:name foaf:age

SPARQLAn RDF Query Language

PREFIX foaf: <http://xmlns.com/foaf/0.1/>SELECT ?name ?emailWHERE { ?person a foaf:Person. ?person foaf:name ?name. ?person foaf:mbox ?email.}ORDER BY ?nameLIMIT 50

CONSTRUCTDESCRIBESELECTASK

GraphGraph

TabularBoolean

Graphs and Talis A bit of history

Over time…• Our apps become popular. Last week, average

4M requests per day and at peak times 600k+ per hour

• Our dataset is growing in size - about 350M triples this week

• Our apps needed more queries and more expensive queries

• Our in-house triple store was EoL and out of date

Project Tripodhttp://github.com/talis/tripod-php http://github.com/talis/tripod-node

System characteristics

• 99:1 read:write

• Well shared, tenant based system. Our largest single customer has 35M triples

• Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries)

• Actually not that many distinct query shapes

Simple Queries, and how they influenced our core

data model

DESCRIBE <http://example.com/John>

SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}

Give me all the triples about John as a graph

Give me properties name, age of John as tabular data

Subject Predicate Object

http://example.com/John

http://example.com/John

foaf:knows http://example.com/Jane

foaf:name “John”

PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX rdf: <

http://www.w3.org/1999/02/22-rdf-syntax-ns#>

http://example.com/John rdf:type foaf:Person

http://example.com/Jane foaf:name “Jane”

http://example.com/Jane rdf:type foaf:Person

http://example.com/Jane foaf:knows http://example.com/John

http://example.com/John

http://example.com/John

foaf:knows http://example.com/Jane

foaf:name “John”

http://example.com/John rdf:type foaf:Person

http://example.com/Jane foaf:name “Jane”

http://example.com/Jane rdf:type foaf:Person

http://example.com/Jane foaf:knows http://example.com/John

Concise Bound Description of http://example.com/John

Concise Bound Description of http://example.com/Jane

http://example.com/John

http://example.com/John

foaf:knows http://example.com/Jane

foaf:name “John”

http://example.com/John rdf:type foaf:Person

Concise Bound Description of http://example.com/John

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

_id is the unique primary key. There can only be one John

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

_id is the unique primary key. There can only be one John

l means value is a literal text value

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

_id is the unique primary key. There can only be one John

u means value is a uri, or another

node.l means value is a literal text value

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

DESCRIBE <http://example.com/John>

SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

DESCRIBE <http://example.com/John>

SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}

mongo$ col.findOne({_id:”example:John”});

mongo$ col.findOne({_id:”example:John”},{“foaf:name.l”:1,”foaf:age.l”:1});

{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },

{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },

DESCRIBE <http://example.com/John>

SELECT ?name ?age WHERE { <http://example.com/John> <foaf:name> ?name . <http://example.com/John> <foaf:age> ?age .}

mongo$ var s = col.find({s:”example:John”});mongo$ while (s.hasNext()) { addToGraph(s.next()) }

mongo$ col.find({s:”example:John”, p: “foaf:name”}},{“o”:1});mongo$ col.find({s:”example:John”, p: “age”}},{“o”:1});

{ s: “example:John, p: “foaf:knows” o: { u: “example:Jane” } }, { s: “example:John, p: “rdf:type” o: { u: “foaf:Person” } }, { s: “example:John, p: “foaf:name” o: { l: “John” } },

DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }

mongo$ var s = col.find({p:”foaf:name”, o:”John”}); // BasicCursor = slow

{ _id: “example:John”, “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

DESCRIBE ?person WHERE { ?person <foaf:name> “John” . }

mongo$ col.ensureIndex({“foaf:name.u”:1});mongo$ var s = col.find({“foaf:name.u”:”John”}); // BTreeCursor = fast

Complex Queries

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{ OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}

DESCRIBE <http://example.com/foo> ?sectionOrItem ?resource ?document ?authorList ?author ?usedBy ?creator ?libraryNote ?publisherWHERE{ OPTIONAL { <http://example.com/foo> resource:contains ?sectionOrItem . OPTIONAL { ?sectionOrItem resource:resource ?resource . OPTIONAL { ?resource dcterms:isPartOf ?document . } OPTIONAL { ?resource bibo:authorList ?authorList . OPTIONAL { ?authorList ?p ?author . } } OPTIONAL { ?resource dcterms:publisher ?publisher . } } OPTIONAL { ?libraryNote bibo:annotates ?sectionOrItem } } . OPTIONAL { <http://example.com/foo> resource:usedBy ?usedBy } . OPTIONAL { <http://example.com/foo> sioc:has_creator ?creator }}

– Project Tripod Team, sometime 2012

“We don’t need dynamic queries”

Precomputed viewsRemember those from the RDBMS?

{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” }}

{ _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” }}

DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . }

mongo$ var john = col.findOne({_id:”example:John”}); for (var i=0; i < john[“foaf:knows”].length; i++) { var knownPerson = col.findOne({“_id: john[“foaf:knows”][i]}); }

System characteristics

• 99:1 read:write

• Well shared, tenant based system. Our largest single customer has 35M triples

• Graph data structures and operations (merges, sub-graphs etc.) well entrenched in the codebase, over 2M lines code (inc. libraries).

• Actually not that many distinct query shapes.

{ _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }]}

DESCRIBE example:John ?knownPerson WHERE { example:John foaf:knows ?knownPerson . }

mongo$ viewsCol.findOne({_id: {r:”example:John”,t:”v_knows”}})

{ _id : { r: “example:John, t: “v_knows”}, graphs: [{ _id: { “example:John” “foaf:knows”: { u: “example:Jane” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “John” } }, { _id: “example:Jane”, “foaf:knows”: { u: “example:John” }, “rdf:type”: { u: “foaf:Person” }, “foaf:name”: { l: “Jane” } }] _impactIndex : [“example:Jane”,”example:John”]}

{ "_id":"v_knows", "type":["foaf:Person"], "from":"CBD_people", "joins":{ “foaf:knows":{} }}

View specification

More complex example

{ "_id":"v_resources", "type":["resourcelist:Resource"], "from":"CBD_resources", "joins":{ "dct:partOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } },

"dct:isPartOf":{ "joins": { "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }, "bibo:authorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "bibo:editorList":{ "joins" : { "followSequence":{ "maxJoins":50 } } }, "dct:publisher":{} } }

What about tabular data?

• We also have tables and table specs

• Conceptually the same as views

• Instead of an array of graphs we have computed columns for complex tabular queries

• You can page, limit, offset results just like you’d expect

{"_id" : {

"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB-AF132854770F”"type" : "t_user_resources"

},"value" : {

"_impactIndex" : [{

"r" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks/1ABE1B4B-A68C-90E4-41DB-AF132854770F","c" : "tenantContexts:DefaultGraph"

},{

"r" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE","c" : "tenantContexts:DefaultGraph"

}],"collection" : “http://example.com/users/FC44E153-161C-C199-DBAB-4DDE13F76F9B/bookmarks","createdDate" : "2011-02-08T15:59:45+00:00","resourceUri" : "tenantResources:7AB1D8E3-5D74-D07F-41E7-56206CFEC8EE","note" : "ELECTRONIC","title" : "Feminism & psychology","type" : [

"resourcelist:Resource","bibo:Journal"

]}

}

Database layout

talis-rs:PRIMARY> show collectionsCBD_configCBD_draftCBD_eventsCBD_jobsCBD_listsCBD_nodesCBD_resourcesCBD_reviewsCBD_serviceCBD_user_listsCBD_user_resourcesCBD_userstable_rowsviews

{r/w

} read only

Fast and slow saves, you decide.

Tripod save()• Based on change sets, you supply the old and

new graphs

• CBDs updated immediately. Write ahead transaction log for multi-CBD writes

• Choice per save on whether to update views/tables sync or async (eventually consistent)

• Async adds jobs to a Mongo based queue

Measure everything

Query volumecomplex vs. simple

Query volumegraph vs. tabular

Query speedcomplex vs. simple graph query

Hardware• Real tin, 2x Dell low-end rack mount servers

• 96Gb RAM, 24 cores

• RAID-10 disks, non-SSD

• Keep ‘em on the same LAN as your app servers

• About the same to lease per month than a couple of c3.4xlarge (30Gb, 32vCPU)

• We’re about to add similar second cluster, 144Gb

Why Mongo? RTFM, not HN comment feeds.

But seriously it could have been n other document DBs

There’s lots moreSearch, named graphs (quads), data

functions

Future roadmap• Multi-cluster <- IN PROGRESS

• NodeJS port <- IN PROGRESS

• Choose better solution for tlog, probably PostgreSQL

• Background queue -> redis and resque

• Chainable API

• Spout of updates for Apache Storm

• Versioned views/tables config

ApertureAnnotate your models to persist to graph

ApertureAnnotate your models to persist to graph

tripod-php code…

…same in aperture

@talisfacebook.com/talisgroup

+44 (0) 121 374 2740

talis.cominfo@talis.com

48 Frederick StreetBirminghamB1 3HN

Recommended