56
On Storing Big Data Ilias Flaounas Intelligent Systems Lab 30 October 2012 I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16

On Storing Big Data

Embed Size (px)

Citation preview

Page 1: On Storing Big Data

On Storing Big Data

Ilias Flaounas

Intelligent Systems Lab

30 October 2012

I. Flaounas (Intelligent Systems Lab) 30 October 2012 1 / 16

Page 2: On Storing Big Data

Storing Big Data

Data start to play an increasingly important role in business andscience.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Page 3: On Storing Big Data

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Page 4: On Storing Big Data

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Page 5: On Storing Big Data

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

Note that sometimes a MySQL database is not enough.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Page 6: On Storing Big Data

Storing Big Data

Data start to play an increasingly important role in business andscience.

Storing, searching, sharing, analysing and visualising big data hasbecome a challenge.

Especially storing of data is often disregarded as an issue.

Note that sometimes a MySQL database is not enough.

Hadoop offers an out of the box distributed filesystem for storing datafiles. However, the challenge appears when someone needs DBcapabilities, frequent updates or real time processing.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 2 / 16

Page 7: On Storing Big Data

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

Page 8: On Storing Big Data

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

Page 9: On Storing Big Data

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

Page 10: On Storing Big Data

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

Furthermore, “alter table” doesn’t really work with lots of data.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

Page 11: On Storing Big Data

The Problems

Nowadays traditional relational databases can reach their limit inperformance.

Data keep on coming in high velocity, high volumes, and high variety.

Common practices to increase performance fail after a while: buying afaster server, getting more RAM, using materialised views, fine tuningqueries...

Furthermore, “alter table” doesn’t really work with lots of data.

Backups and data availability becomes an issue.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 3 / 16

Page 12: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 13: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 14: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 15: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 16: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 17: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

No ACID (atomicity, consistency, isolation, durability)

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 18: On Storing Big Data

NoSQL Movement

The term is too broad and new to really define it.

Wikipedia: “NoSQL (Not only SQL) DB systems are often highlyoptimized for retrieve and append operations and often offer littlefunctionality beyond record storage.”

No schema

No joins between tables

No common scripting language (like SQL)

No ACID (atomicity, consistency, isolation, durability)

On the other hand you gain horizontal scalability and high performance.Also, most NoSQL systems are Map/Reduce ready and/or bind withHadoop.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 4 / 16

Page 19: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 20: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 21: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 22: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 23: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 24: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 25: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

Specialised for full-text search: Lucene, Solr...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 26: On Storing Big Data

NoSQL DBs

There are lots of different systems under the NoSQL ‘umbrella’. Each oneis optimised with different application scenarios in mind, and with differentchoices on trade-offs.

Document based: CouchDB, MongoDB,...

Key-value: Cassandra, Dynamo, Riak,...

Tabular based: BigTable, HBase,...

Memory based: Memcached, Redis, other optimised for solid statedisks...

Specialised for graphs: Neo4j, InfiniteGraph,...

Specialised for full-text search: Lucene, Solr...

Understand your requirements and then make a choice.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 5 / 16

Page 27: On Storing Big Data

Oracle response

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Page 28: On Storing Big Data

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Page 29: On Storing Big Data

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Page 30: On Storing Big Data

Oracle response

May, 2011: Oracle issues a white paper titled “Debunking the NoSQLHype”.

The conclusion:

“Go for the tried and true path. Don’t be risking your data on NoSQLdatabases.”

October 2011: Oracle releases the “Oracle NoSQL Database”. The whitepaper is now reachable only via Google archives.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 6 / 16

Page 31: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 32: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 33: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 34: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 35: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 36: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 37: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 38: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 39: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 40: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 41: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

Eventually consistent

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 42: On Storing Big Data

Example: MongoDB

MongoDB (from “humongous”) is an open source, high-performance,schema-free, document-oriented database.

Document-Oriented storage

No predefined schema

High Performance

Easy to add new “columns” in data rows

No joins between tables

Easy to scale horizontally: Auto-Sharding

Automatic fail-over: invisible to applications

Full Index Support

Map/Reduce ready - Can bind with Hadoop

Eventually consistent

Open Source but developed and maintained by company “10gen”

I. Flaounas (Intelligent Systems Lab) 30 October 2012 7 / 16

Page 43: On Storing Big Data

Document based DB

A document is represented in JSON format:

{“ id” : 12345678,“Link” : “http://news.scotsman.com/abc.html”,“Title”:“Blah blah blah”,“Content”: “More blah blah”,“OutletID” : 14,“Date” : ISODate(“2011-11-17T20:33:15.097Z”),“ Hash” : 550973592,“Tags” : [ International, News, Scotland],}

I. Flaounas (Intelligent Systems Lab) 30 October 2012 8 / 16

Page 44: On Storing Big Data

Single Server

A single machine stores the DB, e.g MySQL.I. Flaounas (Intelligent Systems Lab) 30 October 2012 9 / 16

Page 45: On Storing Big Data

Master/Slave

Two machines in Master/Slave configuration.I. Flaounas (Intelligent Systems Lab) 30 October 2012 10 / 16

Page 46: On Storing Big Data

MongoDB - Replication

Automatic Fail Over - The Master is elected among servers.I. Flaounas (Intelligent Systems Lab) 30 October 2012 11 / 16

Page 47: On Storing Big Data

MongoDB - Sharding

Data is spread horizontally.I. Flaounas (Intelligent Systems Lab) 30 October 2012 12 / 16

Page 48: On Storing Big Data

MongoDB

If new shard is added, data is balanced automatically.I. Flaounas (Intelligent Systems Lab) 30 October 2012 13 / 16

Page 49: On Storing Big Data

MongoDB

No single point of failure, distributed read/writes.

I. Flaounas (Intelligent Systems Lab) 30 October 2012 14 / 16

Page 50: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 51: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 52: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 53: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 54: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

Bugs or ‘simple’ features may be missing, new versions come out toooften...

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 55: On Storing Big Data

Big Data come with Big Problems

Maintenance of infrastructure - It is easier to manage one instead of10 servers

Need to adapt legacy software

Training people on the new techs

Designing DB – splitting data among machines for maximum I/O

Bugs or ‘simple’ features may be missing, new versions come out toooften...

Security

I. Flaounas (Intelligent Systems Lab) 30 October 2012 15 / 16

Page 56: On Storing Big Data

Thank you!

I. Flaounas (Intelligent Systems Lab) 30 October 2012 16 / 16