OrientDB: Unlock the Value of Document Data Relationships
Fabrizio Fortino@fabriziofortino 11th April 2016 #HUGIreland
@boistartups
The world is changing
UnstructuredData
Big Data Explosion
ConnectedDataMobile, IOT
http://destinhaus.com/internet-of-things-the-rise-of-smart-manufacturing/
“… starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but
you should seriously look at other alternatives.”
Polyglot Persistence [2011] Martin Fowler
Rethink how we store data
A Polyglot Persistence example
E-commerce Application
Primary Store+
Financial Data(RDBMS)
Recommendations(Graph)
Products Catalog(Document)
User Sessions(Key-Value)
ETL Jobs / Data Synchronisation
• Hire experts for each database type
• No standards between NOSQL products
• Increased overall complexity
• High TCO
• Write and maintain ETL and data synchronisation
• Hard to refactor
• Testing can be tough
More flexibility, at what price?
Entering Multi-Model Databases
Graph
Document
Object
Key/Value
Full-Text
Spatial
Multi-Model represents the intersection
of multiple models in a single product
Product Positioning QuadrantRe
latio
nshi
p C
ompl
exity
>
Data Complexity >
Relational
Key Value
Column
Graph
Document
Multi-Model
• First Multi-Model DBMS with a Graph Engine
• Community Edition FREE (Apache v2 License)
• Enterprise Edition (profiler, live monitor, telereporter, etc)
• Vibrant community (≈ 100 contributors, ≈ 15K commits)
• Easy to install and use
• Zero configuration Multi-Master Architecture
• ACID
• Reactive (Live Queries)
OrientDB at a Glance
Quite a long journey
1998 2009 2010 2011 20152012 20142013
OrientDB: First ever multi-model DBMS released as Open
Source
R&D
2016
OrientDB Enterprise Launch
0
12K
70K
3K1K
200
Downloads / month
Orient ODBMS: First ever ODBMS with
index-free adjacency
Under the hood
Storage
MemoryWorks in Memory Only
(Ideal for Integration Testing)
PLocalWrite/Read to/from File System
RemoteDelegates all Operations to a Remote
Server
Document APIHandles Records as Documents
Graph APITinkerPop Blueprints Implementation
Object APIPOJO to Document mapping
User Application
• Embedded (in-process)
• Single, Standalone Node
• Multi-Master Replica
• Mixed
Deployment options
Application
Application
ApplicationApplication
Application
Document API
• Lowest level API
• Document (record) is the storage’s unit
• An immutable id (ORID) is automatically set to each document
• Documents can contain key-value pairs or nested/embedded documents (no ORID)
• Transactions support (optimistic mode with MVCC)
• Classes are logical sets of documents
Schema-less, Schema-full or Hybrid?
Schema-lessrelaxed model, the type of each
field is inferred for each document
Schema-fullstrict model, schema with constraints on fields and
validation rules
Hybridmixed model, schema with
mandatory and optional fields with constraints and
validation rules
• Can inherits from other classes, creating a tree (similar to RDF Schema)
• A sub-class inherits all the schema fields from the parents
• An abstract class is used as the foundation for other classes (it cannot have records)
• Class hierarchies allow native polymorphic queries
• 1 to 1 mapping with domain objects
Class concept is taken from OOP
Let’s create a Document
`
{ ”@rid": “#12:216”, ”@class": ”user", “name”: “Fabrizio”, “meetups”: [ { “name”: “HUG Ireland”, “city”: “Dublin”, “since”: “14-03-2014” } ], “details”: { “@type”: “d”, “@class”: “user_details” “city”:”Dublin”,
“nationality”:”IT” } }
Immutable Record IDLogical set
Property
Array of objects
Embedded document
Let’s create a Document
`
{ ”@rid": “#12:216”, ”@class": ”user", “name”: “Fabrizio”, “meetups”: [ { “name”: “HUG Ireland”, “city”: “Dublin”, “since”: “14-03-2014” } ], “details”: { “@type”: “d”, “@class”: “user_details” “city”:”Dublin”,
“nationality”:”IT” } }
Immutable Record IDLogical set
Property
Array of objects
Embedded document
With a traditional Document DB you have to duplicate your data to some degree. The degree
depends on how complex are the interdependencies of the application domain.
OrientDB combines the unique flexibility of documents with the power of graphs to unlock the business value of Document Data Relationships.
Graphs: everything old is new again
https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg
What is a Graph Database?
“A Graph Database is any storage system that provides index-free adjacency”
The Graph Traversal Pattern [2010] Marco A. Rodriguez
G = (V, E)G
raph
Vert
ex
Edge
A
• Given a User (Fabrizio)
• Find Fabrizio (id=10) in member table O(log n)
• Find 18 and 24 (Hug Ireland & Microservices) in Meetup table O(log n)
What’s wrong with joins?
name idFabrizio 10
Uli 12
John 13
Eddie 88
Useruser_id meetup_id
10 18
10 24
13 18
88 66
memberid name18 HUG Ireland
57 AWS Users
24 Microservices
66 Scala
Meetup
• Joins are computed every time you cross relationships
• Time complexity grows with data: O(log n)
• Joining 3-4 tables with million of records could create billion combinations
• Given a User (Fabrizio)
• Traverse the edges member to reach Hug Ireland O(1) & Microservices O(1)
• Fabrizio is the index to reach the linked Meetups!
The Graph as an Index
• Every vertex and edge is “hard wired” to its adjacent vertex or edge
• Traversing an edge does not require complex computation, near O(1)
• The traversal time is not affected by the database size
Fabrizio
HUG Ireland
Micro Services
member
member
Easier to sketch!
Combine Documents with Graphs
`
{ “@rid”: “12:216”, “@class”: ”user", “name”: “Fabrizio”, “details”: {
“@type”: “d”, “@class”: “user_detail”,
“city”: “Dublin”, “nationality”: ”IT”
}
`
{ “@rid”: “13:12”, “@class”: “meetup”, “name”: “HUG Ireland”, “city”: “Dublin” }
`
{ “@rid”: “14:32”, “@class”: “member”, “since”: “14-03-2014”, “in”: “12:216”, “out”: “13:12” }
out_member=14:32 in_member=14:32
{ “@rid”: “15:79”, “@class”: “talk”, “title”: “OrientDB”, “on”: “11-04-2016”, “in”: “12:216”, “out”: “13:12” }
out_talk=15:79
in_talk=15:79
Combine Documents with Graphs
`
{ “@rid”: “12:216”, “@class”: ”user", “name”: “Fabrizio”, “details”: {
“@type”: “d”, “@class”: “user_detail”,
“city”: “Dublin”, “nationality”: ”IT”
}
`
{ “@rid”: “13:12”, “@class”: “meetup”, “name”: “HUG Ireland”, “city”: “Dublin” }
`
{ “@rid”: “14:32”, “@class”: “member”, “since”: “14-03-2014”, “in”: “12:216”, “out”: “13:12” }
out_member=14:32 in_member=14:32
{ “@rid”: “15:79”, “@class”: “talk”, “title”: “OrientDB”, “on”: “11-04-2016”, “in”: “12:216”, “out”: “13:12” }
out_talk=15:79
in_talk=15:79
Multi-relational Document Graph
Will you believe me if I said you can query documents/graphs with SQL like syntax?
Show me something now! OK, time for a quick demo.
http://www.sharegoodstuffs.com/2011_12_12_archive.html
Use Case: raise standards in Irish Public Office
• Aggressive deadline
• Large amount of data from different sources with different formats
• Messy, dirty data
• Connects records from different sources representing the same thing without a common identifier
• Multiple steps traverse of fixed and inferred links to identify disparate entities connected by a path
The challenges
The solution
OrientDB
Fuzzy Inference Engine
• Main Language: Groovy
• Database Type: OrientDB Embedded
• Fuzzy Inference Engine: Duke
• minHash proximity index based on Lucene to avoid cartesian product
• probabilistic model with configurable statistical algorithms (Levenshtein, NGram, Soundex, Custom, etc) to identify the same entities despite differences
• End-To-End Process Time < 10 min
• Deliverable: Database
• Preset of queries to answer the main questions (analysts are completely independent to add / modify where conditions)
• Graph View to visually search and visualise data
Technical Details
What people from home perceived
≈ 20K tweets
Top hashtag in Ireland for 24 hours#rteinvestigates
“While we’ve long understood the value of Big Data to better understand how people interact with us, we’ve noticed an
alarming trend of Big Data envy: organizations using complex tools to handle “not-really-that-big” Data. Distributed map-reduce algorithms are a handy technique for large data sets,but many data sets we see could easily fit in a single node
relational or graph database. Even if you do havemore data than that, usually the best thing to do isto first pick out the data you need, which can often
then be processed on such a single node”
OK but what about Big Data?
ThoughtWorks Technology Radar, 5 April 2016
Begin the journey!
https://www.udemy.com/orientdb-getting-started/
• http://martinfowler.com/bliki/PolyglotPersistence.html
• https://en.wikipedia.org/wiki/Multi-model_database
• http://orientdb.com/
• https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg
• http://arxiv.org/pdf/1004.1001.pdf
• https://www.udemy.com/orientdb-getting-started/
• http://www.rte.ie/news/investigations-unit/2015/1207/751833-rte-investigates/
• https://github.com/larsga/Duke
• https://www.thoughtworks.com/radar
Resources
Q A
Thank you!
&
Fabrizio Fortino@fabriziofortino 11th April 2016 #HUGIreland
@boistartups