Download pdf - OrientDB: Unlock the Value of Document Data Relationships

OrientDB: Unlock the Value of Document Data Relationships

Fabrizio Fortino@fabriziofortino 11th April 2016 #HUGIreland

@boistartups

The world is changing

UnstructuredData

Big Data Explosion

ConnectedDataMobile, IOT

http://destinhaus.com/internet-of-things-the-rise-of-smart-manufacturing/

http://destinhaus.com/internet-of-things-the-rise-of-smart-manufacturing/

“… starting a new strategic enterprise application you should no longer be assuming that your persistence should be relational. The relational option might be the right one - but

you should seriously look at other alternatives.”

Polyglot Persistence [2011] Martin Fowler

Rethink how we store data

A Polyglot Persistence example

E-commerce Application

Primary Store+

Financial Data(RDBMS)

Recommendations(Graph)

Products Catalog(Document)

User Sessions(Key-Value)

ETL Jobs / Data Synchronisation

• Hire experts for each database type

• No standards between NOSQL products

• Increased overall complexity

• High TCO

• Write and maintain ETL and data synchronisation

• Hard to refactor

• Testing can be tough

More flexibility, at what price?

Entering Multi-Model Databases

Graph

Document

Object

Key/Value

Full-Text

Spatial

Multi-Model represents the intersection

of multiple models in a single product

Product Positioning QuadrantRe

latio

nshi

p C

ompl

exity

>

Data Complexity >

Relational

Key Value

Column

Graph

Document

Multi-Model

• First Multi-Model DBMS with a Graph Engine

• Community Edition FREE (Apache v2 License)

• Enterprise Edition (profiler, live monitor, telereporter, etc)

• Vibrant community (≈ 100 contributors, ≈ 15K commits)

• Easy to install and use

• Zero configuration Multi-Master Architecture

• ACID

• Reactive (Live Queries)

OrientDB at a Glance

Quite a long journey

1998 2009 2010 2011 20152012 20142013

OrientDB: First ever multi-model DBMS released as Open

Source

R&D

2016

OrientDB Enterprise Launch

0

12K

70K

3K1K

200

Downloads / month

Orient ODBMS: First ever ODBMS with

index-free adjacency

Under the hood

Storage

MemoryWorks in Memory Only

(Ideal for Integration Testing)

PLocalWrite/Read to/from File System

RemoteDelegates all Operations to a Remote

Server

Document APIHandles Records as Documents

Graph APITinkerPop Blueprints Implementation

Object APIPOJO to Document mapping

User Application

• Embedded (in-process)

• Single, Standalone Node

• Multi-Master Replica

• Mixed

Deployment options

Application

Application

ApplicationApplication

Application

Document API

• Lowest level API

• Document (record) is the storage’s unit

• An immutable id (ORID) is automatically set to each document

• Documents can contain key-value pairs or nested/embedded documents (no ORID)

• Transactions support (optimistic mode with MVCC)

• Classes are logical sets of documents

Schema-less, Schema-full or Hybrid?

Schema-lessrelaxed model, the type of each

field is inferred for each document

Schema-fullstrict model, schema with constraints on fields and

validation rules

Hybridmixed model, schema with

mandatory and optional fields with constraints and

validation rules

• Can inherits from other classes, creating a tree (similar to RDF Schema)

• A sub-class inherits all the schema fields from the parents

• An abstract class is used as the foundation for other classes (it cannot have records)

• Class hierarchies allow native polymorphic queries

• 1 to 1 mapping with domain objects

Class concept is taken from OOP

Let’s create a Document

`

{ ”@rid": “#12:216”, ”@class": ”user", “name”: “Fabrizio”, “meetups”: [ { “name”: “HUG Ireland”, “city”: “Dublin”, “since”: “14-03-2014” } ], “details”: { “@type”: “d”, “@class”: “user_details” “city”:”Dublin”,

“nationality”:”IT” } }

Immutable Record IDLogical set

Property

Array of objects

Embedded document

Let’s create a Document

`

{ ”@rid": “#12:216”, ”@class": ”user", “name”: “Fabrizio”, “meetups”: [ { “name”: “HUG Ireland”, “city”: “Dublin”, “since”: “14-03-2014” } ], “details”: { “@type”: “d”, “@class”: “user_details” “city”:”Dublin”,

“nationality”:”IT” } }

Immutable Record IDLogical set

Property

Array of objects

Embedded document

With a traditional Document DB you have to duplicate your data to some degree. The degree

depends on how complex are the interdependencies of the application domain.

OrientDB combines the unique flexibility of documents with the power of graphs to unlock the business value of Document Data Relationships.

Graphs: everything old is new again

https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg

https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg

What is a Graph Database?

“A Graph Database is any storage system that provides index-free adjacency”

The Graph Traversal Pattern [2010] Marco A. Rodriguez

G = (V, E)G

raph

Vert

ex

Edge

A

• Given a User (Fabrizio)

• Find Fabrizio (id=10) in member table O(log n)

• Find 18 and 24 (Hug Ireland & Microservices) in Meetup table O(log n)

What’s wrong with joins?

name idFabrizio 10

Uli 12

John 13

Eddie 88

Useruser_id meetup_id

10 18

10 24

13 18

88 66

memberid name18 HUG Ireland

57 AWS Users

24 Microservices

66 Scala

Meetup

• Joins are computed every time you cross relationships

• Time complexity grows with data: O(log n)

• Joining 3-4 tables with million of records could create billion combinations

• Given a User (Fabrizio)

• Traverse the edges member to reach Hug Ireland O(1) & Microservices O(1)

• Fabrizio is the index to reach the linked Meetups!

The Graph as an Index

• Every vertex and edge is “hard wired” to its adjacent vertex or edge

• Traversing an edge does not require complex computation, near O(1)

• The traversal time is not affected by the database size

Fabrizio

HUG Ireland

Micro Services

member

member

Easier to sketch!

Combine Documents with Graphs

`

{ “@rid”: “12:216”, “@class”: ”user", “name”: “Fabrizio”, “details”: {

“@type”: “d”, “@class”: “user_detail”,

“city”: “Dublin”, “nationality”: ”IT”

}

`

{ “@rid”: “13:12”, “@class”: “meetup”, “name”: “HUG Ireland”, “city”: “Dublin” }

`

{ “@rid”: “14:32”, “@class”: “member”, “since”: “14-03-2014”, “in”: “12:216”, “out”: “13:12” }

out_member=14:32 in_member=14:32

{ “@rid”: “15:79”, “@class”: “talk”, “title”: “OrientDB”, “on”: “11-04-2016”, “in”: “12:216”, “out”: “13:12” }

out_talk=15:79

in_talk=15:79

Combine Documents with Graphs

`

{ “@rid”: “12:216”, “@class”: ”user", “name”: “Fabrizio”, “details”: {

“@type”: “d”, “@class”: “user_detail”,

“city”: “Dublin”, “nationality”: ”IT”

}

`

{ “@rid”: “13:12”, “@class”: “meetup”, “name”: “HUG Ireland”, “city”: “Dublin” }

`

{ “@rid”: “14:32”, “@class”: “member”, “since”: “14-03-2014”, “in”: “12:216”, “out”: “13:12” }

out_member=14:32 in_member=14:32

{ “@rid”: “15:79”, “@class”: “talk”, “title”: “OrientDB”, “on”: “11-04-2016”, “in”: “12:216”, “out”: “13:12” }

out_talk=15:79

in_talk=15:79

Multi-relational Document Graph

Will you believe me if I said you can query documents/graphs with SQL like syntax?

Show me something now! OK, time for a quick demo.

http://www.sharegoodstuffs.com/2011_12_12_archive.html

http://www.sharegoodstuffs.com/2011_12_12_archive.html

Use Case: raise standards in Irish Public Office

• Aggressive deadline

• Large amount of data from different sources with different formats

• Messy, dirty data

• Connects records from different sources representing the same thing without a common identifier

• Multiple steps traverse of fixed and inferred links to identify disparate entities connected by a path

The challenges

The solution

OrientDB

Fuzzy Inference Engine

• Main Language: Groovy

• Database Type: OrientDB Embedded

• Fuzzy Inference Engine: Duke

• minHash proximity index based on Lucene to avoid cartesian product

• probabilistic model with configurable statistical algorithms (Levenshtein, NGram, Soundex, Custom, etc) to identify the same entities despite differences

• End-To-End Process Time < 10 min

• Deliverable: Database

• Preset of queries to answer the main questions (analysts are completely independent to add / modify where conditions)

• Graph View to visually search and visualise data

Technical Details

What people from home perceived

≈ 20K tweets

Top hashtag in Ireland for 24 hours#rteinvestigates

“While we’ve long understood the value of Big Data to better understand how people interact with us, we’ve noticed an

alarming trend of Big Data envy: organizations using complex tools to handle “not-really-that-big” Data. Distributed map-reduce algorithms are a handy technique for large data sets,but many data sets we see could easily fit in a single node

relational or graph database. Even if you do havemore data than that, usually the best thing to do isto first pick out the data you need, which can often

then be processed on such a single node”

OK but what about Big Data?

ThoughtWorks Technology Radar, 5 April 2016

Begin the journey!

https://www.udemy.com/orientdb-getting-started/

• http://martinfowler.com/bliki/PolyglotPersistence.html

• https://en.wikipedia.org/wiki/Multi-model_database

• http://orientdb.com/

• https://en.wikipedia.org/wiki/Seven_Bridges_of_Königsberg

• http://arxiv.org/pdf/1004.1001.pdf

• https://www.udemy.com/orientdb-getting-started/

• http://www.rte.ie/news/investigations-unit/2015/1207/751833-rte-investigates/

• https://github.com/larsga/Duke

• https://www.thoughtworks.com/radar

Resources

http://martinfowler.com/bliki/PolyglotPersistence.html

https://en.wikipedia.org/wiki/Multi-model_database

http://orientdb.com/

https://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg

http://arxiv.org/pdf/1004.1001.pdf

https://www.udemy.com/orientdb-getting-started/

http://www.rte.ie/news/investigations-unit/2015/1207/751833-rte-investigates/

https://github.com/larsga/Duke

https://www.thoughtworks.com/radar

Q A

Thank you!

&

Fabrizio Fortino@fabriziofortino 11th April 2016 #HUGIreland

@boistartups