On Storing Big Data

On Storing Big Data

Ilias Flaounas

Intelligent Systems Lab

30 October 2012

I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16

http://flaounas.net

Storing Big Data

Data start to play an increasingly important role in business andscience.


Storing Big Data


Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.


Storing Big Data



Especially storing of data is often disregarded as an issue.


Storing Big Data




Note that sometimes a MySQL database is not enough.


Storing Big Data




Note that sometimes a MySQL database is not enough.

Hadoop offers an out of the box distributed filesystem for storing datafiles. However, the challenge appears when someone needs DBcapabilities, frequent updates or real time processing.


The Problems

Nowadays traditional relational databases can reach their limit inperformance.


The Problems


Data keep on coming in high velocity, high volumes, and high variety.


The Problems



Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...


The Problems




Furthermore, “alter table” doesn’t really work with lots of data.


The Problems




Furthermore, “alter table” doesn’t really work with lots of data.

Backups and data availability becomes an issue.


NoSQL Movement

The term is too broad and new to really define it.


NoSQL Movement


Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”


NoSQL Movement



No schema


NoSQL Movement



No schema

No joins between tables


NoSQL Movement



No schema


No common scripting language (like SQL)


NoSQL Movement



No schema



No ACID (atomicity, consistency, isolation, durability)


NoSQL Movement



No schema



No ACID (atomicity, consistency, isolation, durability)

On the other hand you gain horizontal scalability and high performance.Also, most NoSQL systems are Map/Reduce ready and/or bind withHadoop.


NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.


NoSQL DBs


Document based: CouchDB, MongoDB,...


NoSQL DBs



Key-value: Cassandra, Dynamo, Riak,...


NoSQL DBs




Tabular based: BigTable, HBase,...


NoSQL DBs





Memory based: Memcached, Redis, other optimised for solid statedisks...


NoSQL DBs






Specialised for graphs: Neo4j, InfiniteGraph,...


NoSQL DBs







Specialised for full-text search: Lucene, Solr...


NoSQL DBs







Specialised for full-text search: Lucene, Solr...

Understand your requirements and then make a choice.


Oracle response


Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.


Oracle response


The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”


Oracle response


The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”

October 2011: Oracle releases the “Oracle NoSQL Database”. The whitepaper is now reachable only via Google archives.


Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.


Example: MongoDB


Document-Oriented storage


Example: MongoDB



No predefined schema


Example: MongoDB




High Performance


Example: MongoDB




High Performance

Easy to add new “columns” in data rows


Example: MongoDB




High Performance




Example: MongoDB




High Performance



Easy to scale horizontally: Auto-Sharding


Example: MongoDB




High Performance




Automatic fail-over: invisible to applications


Example: MongoDB




High Performance





Full Index Support


Example: MongoDB




High Performance





Full Index Support

Map/Reduce ready - Can bind with Hadoop


Example: MongoDB




High Performance





Full Index Support


Eventually consistent


Example: MongoDB




High Performance





Full Index Support


Eventually consistent

Open Source but developed and maintained by company “10gen”


Document based DB

A document is represented in JSON format:

{“ id” : 12345678,“Link” : “http://news.scotsman.com/abc.html”,“Title”:“Blah blah blah”,“Content”: “More blah blah”,“OutletID” : 14,“Date” : ISODate(“2011-11-17T20:33:15.097Z”),“ Hash” : 550973592,“Tags” : [ International, News, Scotland],}


Single Server

A single machine stores the DB, e.g MySQL.I. Flaounas (Intelligent Systems Lab) 30 October 2012 9 / 16

Master/Slave

Two machines in Master/Slave configuration.I. Flaounas (Intelligent Systems Lab) 30 October 2012 10 / 16

MongoDB - Replication

Automatic Fail Over - The Master is elected among servers.I. Flaounas (Intelligent Systems Lab) 30 October 2012 11 / 16

MongoDB - Sharding

Data is spread horizontally.I. Flaounas (Intelligent Systems Lab) 30 October 2012 12 / 16

MongoDB

If new shard is added, data is balanced automatically.I. Flaounas (Intelligent Systems Lab) 30 October 2012 13 / 16

MongoDB

No single point of failure, distributed read/writes.


Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers




Need to adapt legacy software





Training people on the new techs






Designing DB – splitting data among machines for maximum I/O







Bugs or ‘simple’ features may be missing, new versions come out toooften...







Bugs or ‘simple’ features may be missing, new versions come out toooften...

Security


Thank you!


Data & Analytics

On Storing Big Data