Upload
zubair-nabi
View
1.372
Download
0
Embed Size (px)
DESCRIPTION
Cloud Computing Workshop 2013, ITU
Citation preview
10: Taxonomy of Data and Storage
Zubair Nabi
April 20, 2013
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 1 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 2 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 3 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different events
Datasets can easily be classified on the basis of their structure1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Introduction
Data is everywhere and is the driving force behind our lives
The address book on your phone is data
So is the newspaper that you read every morning
Everything you see around you is a potential source of data whichmight be useful for a certain application
We use this data to share information and make a more informeddecision about different eventsDatasets can easily be classified on the basis of their structure
1 Structured2 Unstructured3 Semi-structured
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 4 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rowsEach field also has an associated type
I Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rowsEach field also has an associated type
I Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rowsEach field also has an associated type
I Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rows
Each field also has an associated typeI Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rowsEach field also has an associated type
I Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Structured Data
Formatted in a universally understandable and identifiable way
In most cases, structured data is formally specified by a schema
Your phone address phone is structured because it has a schemaconsisting of name, phone number, address, email address, etc.
Most traditional databases contain structured data revolving arounddata laid out across columns and rowsEach field also has an associated type
I Possible to search for items based on their data types
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 5 / 27
Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
Unstructured Data
Data without any conceptual definition or type
Can vary from raw text to binary data
Processing unstructured data requires parsing and tagging on the fly
In most cases, consists of simple log files
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 6 / 27
Semi-structured Data
Occupies the space between the structured and unstructured dataspectrum
For instance, while binary data has no structure, audio and video fileshave meta-data which has structure, such as author, time of creation,etc.
Can also be labelled as self-describing structure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
Semi-structured Data
Occupies the space between the structured and unstructured dataspectrum
For instance, while binary data has no structure, audio and video fileshave meta-data which has structure, such as author, time of creation,etc.
Can also be labelled as self-describing structure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
Semi-structured Data
Occupies the space between the structured and unstructured dataspectrum
For instance, while binary data has no structure, audio and video fileshave meta-data which has structure, such as author, time of creation,etc.
Can also be labelled as self-describing structure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 7 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 8 / 27
Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
Database Management Systems (DBMS)
Used to store and manage data
Support for large amounts of data
Ensure concurrency, sharing, and locking
Security is useful too; to enable fine-grained access control
Ability to keep working in the face of failure
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 9 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Relational Database Management Systems (RDBMS)
The most popular and predominant storage system in use
Data in different files is connected by using a key field
Data is laid out in different tables, with a key field that identifies eachrow
The same key field is used to connect one table to another
For instance, a relation might have customer ID as key and her detailsas data; another table might have the same key but different data, sayher purchases; yet another table with the same key might have abreakdown of her preferences
Examples include Oracle Database, MS SQL Server, MySQL, IBMDB2, and Teradata
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 10 / 27
Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation inRDBMS
Adds a layer of abstraction over relational algebra, which enables setoperations, selections, etc.
Due to its declarative nature, users operate in terms of their expectedoutput while the underlying system decides the actual query executionplan
Instructions consist of a specific SQL statement and additionalparameters and operands
For instance, the SELECT operator retrieves certain records, INSERTadds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation inRDBMS
Adds a layer of abstraction over relational algebra, which enables setoperations, selections, etc.
Due to its declarative nature, users operate in terms of their expectedoutput while the underlying system decides the actual query executionplan
Instructions consist of a specific SQL statement and additionalparameters and operands
For instance, the SELECT operator retrieves certain records, INSERTadds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation inRDBMS
Adds a layer of abstraction over relational algebra, which enables setoperations, selections, etc.
Due to its declarative nature, users operate in terms of their expectedoutput while the underlying system decides the actual query executionplan
Instructions consist of a specific SQL statement and additionalparameters and operands
For instance, the SELECT operator retrieves certain records, INSERTadds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation inRDBMS
Adds a layer of abstraction over relational algebra, which enables setoperations, selections, etc.
Due to its declarative nature, users operate in terms of their expectedoutput while the underlying system decides the actual query executionplan
Instructions consist of a specific SQL statement and additionalparameters and operands
For instance, the SELECT operator retrieves certain records, INSERTadds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
Structured Query Language (SQL)
Non-procedural language used for data retrieval and manipulation inRDBMS
Adds a layer of abstraction over relational algebra, which enables setoperations, selections, etc.
Due to its declarative nature, users operate in terms of their expectedoutput while the underlying system decides the actual query executionplan
Instructions consist of a specific SQL statement and additionalparameters and operands
For instance, the SELECT operator retrieves certain records, INSERTadds a record, and so on
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 11 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend itFor instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend itFor instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend itFor instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend it
For instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend itFor instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Structured Data
As structured data follows a predefined schema, it naturally maps on toa relational database system
I The schema defines the type and structure of the data and its relations
Schema design is an arduous process and needs to be done beforethe database can be populated
Another consequence of a strict schema is that it is non-trivial toextend itFor instance, adding a new attribute to an existing row necessitatesadding a new column to the entire table
I Extremely suboptimal in tables with millions of rows
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 12 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak one
Data within such datasets also has an associated typeI In fact, types are application-centric: It might be possible to interpret a
field as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive task
Structureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
RDBMS and Semi- and Un-structured Data
Unstructured data has no notion of schema while semi-structured dataonly has a weak oneData within such datasets also has an associated type
I In fact, types are application-centric: It might be possible to interpret afield as a float in one application and as a string in another
While it is possible, with human intervention, to glean structure fromunstructured data, it is an extremely expensive taskStructureless data generated by real-time sources can change thenumber of attributes and their types on the fly
I RDBMS would require the creation of a new table each time such achange takes place
Therefore, unstructured and semi-structured data does not fit therelational model
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 13 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 14 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails
2 Consistent: Data within the database remains consistent after eachtransaction
3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction
3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other
4 Durable: Transactions are persistent across failures and restartsI Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applications
I Most applications are more interested in availability and willing tosacrifice consistency leading to eventual consistency
I This basically available, soft state, eventually consistent (BASE) modelenables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistency
I This basically available, soft state, eventually consistent (BASE) modelenables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation
Different semantics:I RDBMS provide ACID semantics:
1 Atomic: The entire transaction either succeeds or fails2 Consistent: Data within the database remains consistent after each
transaction3 Isolation: Transactions are sandboxed from each other4 Durable: Transactions are persistent across failures and restarts
I Overkill in case of most user-facing applicationsI Most applications are more interested in availability and willing to
sacrifice consistency leading to eventual consistencyI This basically available, soft state, eventually consistent (BASE) model
enables applications to function even in the face of partial failure
High Throughput: Most NoSQL databases sacrifice consistency foravailability leading to higher throughput (in some cases an order ofmagnitude)
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 15 / 27
Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can bescaled up by just adding more machines and the underlying systemautomatically re-distributes the data
Commodity Hardware: A large number of RDBMS require specializedand proprietary hardware for operation. In contrast, NoSQL databasesfunction over commodity off-the-shelf hardware
Programming Language Support: Over the years programminglanguages have started providing abstractions for database support(LINQ, etc.) while bypassing SQL. NoSQL databases provideabstractions that directly map onto the language abstractions leadingto tighter coupling
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can bescaled up by just adding more machines and the underlying systemautomatically re-distributes the data
Commodity Hardware: A large number of RDBMS require specializedand proprietary hardware for operation. In contrast, NoSQL databasesfunction over commodity off-the-shelf hardware
Programming Language Support: Over the years programminglanguages have started providing abstractions for database support(LINQ, etc.) while bypassing SQL. NoSQL databases provideabstractions that directly map onto the language abstractions leadingto tighter coupling
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
Motivation (2)
Horizontal Scalability: To cater for more data, NoSQL stores can bescaled up by just adding more machines and the underlying systemautomatically re-distributes the data
Commodity Hardware: A large number of RDBMS require specializedand proprietary hardware for operation. In contrast, NoSQL databasesfunction over commodity off-the-shelf hardware
Programming Language Support: Over the years programminglanguages have started providing abstractions for database support(LINQ, etc.) while bypassing SQL. NoSQL databases provideabstractions that directly map onto the language abstractions leadingto tighter coupling
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 16 / 27
Motivation (3)
The Rise of Cloud Computing: Cloud Computing applications requirehorizontal scalability and low administration overhead. Bothrequirements are naturally satisfied by NoSQL stores
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 17 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 18 / 27
Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
4 Persistence: How persistent the data is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
4 Persistence: How persistent the data is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
4 Persistence: How persistent the data is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
Introduction
NoSQL databases can be classified on the basis of:
1 Data Model: How data is represented
2 Scalability: How scalable the system is
3 Query Model: What type of API it exposes
4 Persistence: How persistent the data is
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 19 / 27
Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorizedinto three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics perkey
2 Document Stores: Complex data structures to encapsulate documentkey/value pairs
3 Column-Oriented Stores: Data laid out by column
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorizedinto three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics perkey
2 Document Stores: Complex data structures to encapsulate documentkey/value pairs
3 Column-Oriented Stores: Data laid out by column
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
Classification by Data Model
Based on the data model, NoSQL databases can roughly be categorizedinto three categories:
1 Key/value Stores: A map/dictionary allowing put/get semantics perkey
2 Document Stores: Complex data structures to encapsulate documentkey/value pairs
3 Column-Oriented Stores: Data laid out by column
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 20 / 27
Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
Key/value Stores
Data is stored within a large hash map
Simple get/put API
Favour scalability over consistency
Limit on the size of the key
Examples include Amazon’s Dynamo, LinkedIn’s Voldemort, Redis,and Memcached
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 21 / 27
Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
Document Stores
Key/value semantics but based on documents
A document encapsulates data in a standard format, such as JSON,XML, PDF, etc.
Documents themselves can be heterogeneous
Documents can also be retrieved based on their content
Examples include Apache CouchDB and MongoDB
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 22 / 27
Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enablingopportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open sourceclone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enablingopportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open sourceclone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enablingopportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open sourceclone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enablingopportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open sourceclone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
Column-Oriented Stores
Data is stored and processed by column
Useful for read-mostly and read-intensive data
Data within the same column is of the same type enablingopportunities for efficient compression
Columns are stored separately so they can be loaded in parallel
Examples include Google’s BigTable (Apache HBase is its open sourceclone) and Facebook’s Cassandra
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 23 / 27
Outline
1 Datasets
2 Storage
3 Beyond RDBMS
4 NoSQL Taxonomy
5 NewSQL
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 24 / 27
Introduction
A hybrid of traditional RDBMS and NoSQL
I Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardware
Classified into:1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch
2 New MySQL Storage Engines: Keep MySQL as interface but replacethe storage engine
3 Transparent Clustering: Add pluggable features to existing databasesto ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine
3 Transparent Clustering: Add pluggable features to existing databasesto ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
Introduction
A hybrid of traditional RDBMS and NoSQLI Scalability and performance of NoSQL and ACID guarantees of RDBMS
Use SQL as the primary language
Ability to scale out and run over commodity hardwareClassified into:
1 New Databases: Designed from scratch2 New MySQL Storage Engines: Keep MySQL as interface but replace
the storage engine3 Transparent Clustering: Add pluggable features to existing databases
to ensure scalability
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 25 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the data
I Queries are split and shipped to nodes that own the dataI Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all dataI A set of processing nodes receives queries and pulls in required data
from the central nodeI Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the dataI Queries are split and shipped to nodes that own the data
I Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all dataI A set of processing nodes receives queries and pulls in required data
from the central nodeI Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the dataI Queries are split and shipped to nodes that own the dataI Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all dataI A set of processing nodes receives queries and pulls in required data
from the central nodeI Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the dataI Queries are split and shipped to nodes that own the dataI Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all data
I A set of processing nodes receives queries and pulls in required datafrom the central node
I Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the dataI Queries are split and shipped to nodes that own the dataI Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all dataI A set of processing nodes receives queries and pulls in required data
from the central node
I Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
New Databases
1 Query Distribution:I Each node holds a subset of the dataI Queries are split and shipped to nodes that own the dataI Examples include Google’s Spanner and NuoDB
2 Pull Data:I A central node (possibly replicated) holds all dataI A set of processing nodes receives queries and pulls in required data
from the central nodeI Examples include VMware’s SQLFire
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 26 / 27
References
1 NoSQL Databases: https://oak.cs.ucla.edu/cs144/handouts/nosqldbs.pdf
2 NewSQL – The New Way to Handle Big Data: http://www.linuxforu.com/2012/01/newsql-handle-big-data/
Zubair Nabi 10: Taxonomy of Data and Storage April 20, 2013 27 / 27