Google File Systemray/teaching/CA485/notes/NoSQL.pdf · • A BigTable cluster may contain several large tables, but it does not support operations across multiple tables (non-relational,

CA485 Ray Walshe 2015

NoSQL

CA485 Ray Walshe 2015 2

BASE vs ACID

Summary

• Traditional relational database management

systems (RDBMS) do not scale because they

adhere to ACID. A strong movement within

cloud computing is to utilize non-traditional

data stores (sometimes poorly dubbed

NoSQL or NewSQL) for managing large

amounts of data. This article contrasts the

traditional ACID with the new-style BASE

approach


Scaling

"If your application relies upon persistence, then data storage will probably become your bottleneck."

• Many websites are I/O-bound. That is they are limited by how quickly they can access data from their data storage system (normally a SQL database).

• To scale or improve performance, you have two options:

– Vertical Scaling: Get a stronger, faster, better machine. • Easiest, but also expensive

• Limited to the largest single system available

– Horizontal Scaling: Spread data across multiple machines. • More flexible, but also more complex

• Functional Scaling: group data by function and spread functional groups across databases.

• Sharding: splitting data with functional areas across multiple databases


CAP

A theorem which conjectures that web services cannot ensure all three of the following properties at once:

• Consistency: All operations appear to occur at once.

• Availability: Every operation must terminate in an intended response.

• Partition Tolerance: Operations will complete, even if individual components are unavailable.


ACID

Traditional databases utilize transactions that adhere to the following guarantees:

• Atomicity: All operations in the transaction will complete or none will.

• Consistency: Database will be in a consistent state before and after a transaction.

• Isolation: Transaction will behave as if it is the only operation being performed.

• Durability: Upon completion of the transaction, the operation will not be reversed.


To ensure these properities when using partitioned databases, traditional RDBMS utilizet two-phase commit:

1. First the transaction coordinator asks each database involved to precommit and indicate if the commit is possible.

2. If all agree, then coordinator instructs each database to commit.

This method ensures consistency over availability (if any databases are down, then we can't commit). Likewise, this locking and coordination serves as a bottleneck and prevents from scaling to large numbers of nodes.


BASE

The current trend in cloud computing data storage is to loosen or relax the requirements of consistency in favor of more availablity. This is embodied in the BASE approach:

• Basically available: system guarantees the availability of your data; but the response can be "failure" if the data is in the middle of changing.

• Soft State: the state of the system is constantly changing.

• Eventually Consistent: the system will eventually become consistent once it stops receiving input.


BASE is optimistic and accepts that the database consistency will be in a state of flux. It achieves availibility by supporting partial failures without total system failure (i.e. partition tolerance).

To implement BASE, many systems rely on some sort of message queue to persistently store and route data to various storage services the perform the actual database operations.


BigTable

• BigTable is a distributed storage system

created by Google for managing structured

data. It is structured as a large table that may

be petabytes in size and distributed across

tens of thousands of machines.

• HBase is an open source version of BigTable that

works on top of Hadoop.


BigTable is a large, persistant, distributed,

sparse, sorted, and multidimensional

map. Map

• A map is an associative array or data structure that

allows one to look up a value to a corresponding key

quickly (e.g. hash table, binary search tree, etc.); in other

words, it's a collection of key, value pairs.

In BigTable, the key consists of the following:

row key: string, column key: string, timestamp: int64

while the value is simply an array of bytes that is

interpreted by the application (up to 64KB).


Sorted

Normally, associative arrays are not sorted (keys are hashed to a position in the map). In BigTable, however, data is sorted by row to keep related data close together. This means that we must be careful in choosing row names such that related data is sorted near each other.

For example, to store data about websites, Google's WebTable reverses the domain names of web pages:

ie.dcu.computing

ie.dcu.eeng

ie.dcu.meng

This keeps DCU website rows close together


• Data Locality

Sorting the rows is mechanism for improving

data locality. With pure hashing it is possible

for related data to be spread across multiple

machines. Sorting and then partitioning the

data allows all the data for one key subset to

reside on one machine. A similar technique is

used to shuffle data to reducers in

MapReduce.


Multidimensional

• Each table is indexed by rows. Each row contains one or more named column families which are defined when the table is first created. Within a column family, there can be one or more named columns which can be created on the fly. With rows, column families, and columns, we have three-level naming hierarchy to identify data.

For example:

ie.dcu.computing: # Row

- users: # Column Family

- ray: Ray Walshe # Column

- cdaly: Charlie # Column

- system: # Column Family

- : Linux 3.2 # Column (Null name)


• To get data, we first access the row via the row name and then specify column key which is in the form column-family:column. In the example above, we first get the row ie.dcu.computing and then get a particular user with users:ray. To get multiple users, we can use a regular expression (or glob) to fetch multiple values: users:*.

• In addition to row and column, the data is also versioned by timestamps (either real time or application defined time) and sorted such that the most recent cell is first. To help manage these multiple versions, BigTable provides a mechanism to remove entries either by date (keep versions since some time t) or by amount (keep only the latest n versions). These garbage collection settings can be specified per column-family.


• In addition to row and column, the data is also versioned by timestamps (either real time or application defined time) and sorted such that the most recent cell is first. To help manage these multiple versions, BigTable provides a mechanism to remove entries either by date (keep versions since some time t) or by amount (keep only the latest n versions). These garbage collection settings can be specified per column-family.


Sparse

While the number of column-families is fixed at creation, the number of

columns can grow arbitrarily. This means that within a particular row, it is

possible for many columns to be empty.

ie.dcu.computing:

- language:

- : EN

- contents:

- : <html>...

- anchor:

- dcu.ie: Dublin City University

- microsoft.com: Microsoft

Ie.dcu.computing.ftp:

- language:

- : EN

- contents:

- : <html>...

- anchor:

- dcu.ie: Dublin City University

- kernel.org: Linux

- computing.dcu.ie: Vinson

- reddit.com: Reddit

- freenode.net: Freenode


• Distributed

• BigTable's data is spread across many independent machines. Tables are broken up into collections of rows called tablets such that each tablet has a set of consecutive rows. This allows for distribution of a Table onto multiple machines and for load balancing (split large Tablets into smaller ones).

• Persistant

• BigTable uses GFS to store data and log files persistantly.

• Large

• Can handle upwards of a Petabyte of data. Hooks into MapReduce (can be used as either input or output) and is utilized by a variety of applications.


Implementation

Architecturally, BigTable resembles GFS: a master that coordinates activity and a large number of tablet servers that store and manage the data. These tablet servers can be added or removed dynamically.

Master

Master assigns tablets to tablet servers and balances tablet server load. It also manages garbage collection of files in GFS and handles scheme changes.

Tablet Server

A tablet server manages a set of tablets (10-1,000 per server) and handles read/write requests to the tablets. Internally, this data is stored in Google' SSTable format, which is a persistent, ordered, immutable key, value map file.


Chubby

To coordinate the various servers, Chubby, a highly available and persistent distributed lock service is used to manage leases for resources and configuration storage by providing a namespace of files and directories that the user can lock atomically. It is used to:

• Ensure there is only one active master.

• Discover tablet servers.

• Store BigTable schema information.

• Store access control lists.

Example of how it is used:

When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in the servers directory. The master can monitor this directory for new servers.


• Replication

• A BigTable can be configured for replication

to multiple BigTable clusters in different data

centers to ensure availability. Data is

propagated asynchronously, which results in

an eventually consistent model.


Applications

BigTable, like GFS and MapReduce, is utilized internally by Google for many of their operations.

Google Analytics

This is a service that helps webmasters analyze traffic patterns at their website. BigTable is used to maintain raw click information (200 TB).

Google Earth

BigTable is used to store the raw image data.

Personalized Search

User data for personalized search is stored in BigTable.


How is it NoSQL?

• A BigTable cluster may contain several large tables, but it does not support operations across multiple tables (non-relational, no joining).

• No SQL! Perform key lookups to access data.

• Columns have no type (just a bunch of bytes) and may be quite large.

• Columns can be added dynamically.

• Columns within a row may be quite sparse; that is we may have a large number of columns, but each row may only have a tiny fraction of them populated.

• Availability is increased by asynchronously propogating data to multiple clusters in different data centers.

Documents

Google File Systemray/teaching/CA485/notes/NoSQL.pdf · • A BigTable cluster may contain several large tables, but it does not support operations across multiple tables (non-relational,