FASTER LAP TIMES WITH NEO4J Srinivas Suravarapu Chief Architect Scribestar Ltd @srinivas_s
Scribestar A content collaboration platform for the legal community.
The solution is targeted at lawyers and how they draft legal content.
From an information systems point of a view it is a collaboration platform and is concerned with getting related users changing related content to work effectively.
We are about 20 people and the platform is built on on .NET and uses Neo4j as its primary store. (now)
Relational Stores Looking back at some Content management systems. They tend to base themselves on a Relational DBMS and
serve content using BLOBs As content grows, a monolithic store pretty quickly starts affecting users ability to perform functions which do not have anything to do with content itself. BLOBs are good for a few pieces of content but when all you have is content, you have to go back to the drawing board on where you store it.
NOSQL Stores Some have managed to use document oriented databases Able to serve large content to the web quickly. When you combine content in some form of ML format with fragments of relational data inevitably present in every system, they rely heavily on how you model your aggregates. Working with multiple aggregates sitting behind service boundaries tend to bring up consistency issues, and we tend to offload a lot of implementation complexity to the application tier.
Polyglot Persistence Using multiple storage technologies to store the information is inevitable. The type of data and how its consumed by parts of your application should be the driver to choose your data store. We store our relation information using a graph and store content in a file store. The ability of the user to collaborate effectively on the content is isolated and not affected by users collaborating on the metadata.
Wish list Constantly being compared to the capabilities of desktop
publishing tools Should be fast and secure We cannot loose content Corruption of content is not an option Everyone should be able to change the same piece of
content at the time (You Only think you need it)
Modeling the domain
Modeling the domain Stay as close as possible to the domain and let the graph
be a reflection of the users actions in the system over time Bounded contexts still apply and the rules of how you
share information between two aggregates remain, think how you can have multiple graphs that are smaller.
Keeping the graph acyclic and directed keeps it simple however this is entirely based on the context of your problem.
Code Tips Principles of how to interact with a database havent
changed The notion of using parameters, indexes or constraints exists. Dont read the same information repeatedly if it doesnt change Writes, the principles of concurrency haven't changed. Watch out for queries that are reused, its easier to write separate
queries for separate concerns, duplication is fine. The unexpected side effects of query reuse for different concerns
turns out to be a killer. Use the profile and explain options to analyse your queries
Cypher Tuning Neo4j
Switch query logging on to capture slow running queries, threshold is subject to what you want
Switch metrics to be output to graphite or CSV files Have a suite of tests which run regularly and test concurrency and
load. Use the feedback and tweak any slow running queries. Repeat the exercise until you dont find any queries being written
into the querylog, that should ensure you have fast queries Always look at getting to the node you are interested first like
SELECT on SELECT in SQL , MATCH on MATCH is effective.
The business benefits We built our new solution in 8 months, compared to the
former that was built for about 2+ years We did this with half the size of the original team The system is at least X times quicker than where it used
to be, where X is a two digit number J The complexity of work has reduced Indicators
Team hasnt come up with I dont know how big this is in a while With the definition of the cycle time widening, the cycle times have
dropped for the same complexity
The business benefits The complexity of work reduced
With the definition of the cycle time widening, the cycle times to drop the same level of complexity reduced significantly.
Being able to visualize the data real time provides valuable analytics for the user
Cypher is absolutely powerful in its ability to get you to where you need to on the graph
Reduces the need to understand the implementation immediately to some degree.
Moving into the future Visualizing the information using tools to get some insight into
user behavior, this will help us evolve the product. Some of the principles we have used should help us scale out
without contention famous last words , remains to be seen. Who knows we may be able to store large files in hybrid
technology of Neo and something else, alternative stores like Riak or any self hosted S3 styled products
Would be great to have a light-weight Linkurious plugin on the neo dashboard
Precedents and taxonomy subject to research
The agility you obtain using a graph is great, changing the underlying model is no where near as painful or dreadful, the value of visualization simply exceeds any cost involved in the transition
See Some Code?