15
Whitepaper The Modern Database Landscape: Technology Innovations Power Convergence of Transactions and Analytics

Modern Database Landscape

Embed Size (px)

Citation preview

Page 1: Modern Database Landscape

Whitepaper

The Modern Database Landscape: Technology Innovations Power Convergence of Transactions and Analytics

Page 2: Modern Database Landscape

2

The Modern Database Landscape Whitepaper

June 2015

Table of Content

TABLE OF CONTENT ............................................................................................................ 2

ABSTRACT .............................................................................................................................. 3

THE DATABASE LANDSCAPE ............................................................................................. 3

Legacy Database Landscape ............................................................................................................................ 3

The Modern Database Landscape ................................................................................................................. 4

HYBRID TRANSACTIONAL AND ANALYTICAL PROCESSING ....................................... 6

New Sources of Revenue .................................................................................................................................... 7

Simplified Computing Structure ...................................................................................................................... 9

MEMSQL IS DESIGNED FOR HTAP ..................................................................................... 11

In-Memory Storage ................................................................................................................................................ 11

Access to Real-Time and Historical Data ................................................................................................... 11

Code Generation and Compiled Query Execution Plans ................................................................... 12

Lock-Free Data Structures and Multiversion Concurrency Control ............................................. 13

Two-Tiered Shared Nothing Architecture ................................................................................................ 13

Fault-Tolerance and ACID Compliance ...................................................................................................... 14

Integration ................................................................................................................................................................. 14

CONCLUSION ....................................................................................................................... 15

Page 3: Modern Database Landscape

3

The Modern Database Landscape Whitepaper

June 2015

Abstract It is no secret that we live in a data-driven world. For many companies, the technology of the past is still running this data, forcing companies to decide between fast and correct, between easy and global, between saving data and serving data. Modern businesses that seek to transform themselves into real-time enterprises recognize that instantaneous access to data is non negotiable. The paradigm, as set forth by Gartner, for being able to maximize the value of megabytes, even petabytes of data, is HTAP: hybrid transactional and analytical processing, or the ability to perform analysis on data directly in an operational data store. With the concept of HTAP at its crux, MemSQL helps companies eliminate the compromises of the past with its distributed in-memory database, which simultaneously ingests and analyzes enormous quantities of data at sub-second speeds to power business applications. MemSQL provides the performance, speed and high throughput needed by real time enterprise applications.

The Database Landscape

Legacy Database Landscape

Historically, databases fit into one of two categories: those optimized for online transactional processing (OLTP) and those optimized for online analytical processing (OLAP).

An OLTP-optimized database is essentially a database of record, used to store information for an application with rapidly changing data. Running an application with maximum performance on a database requires valid transactions. If a transaction is invalid, the database is compromised in real-time. OLTP databases frequently power large transactional workloads, like retail sales and inventory management requiring an accurate record as shipments arrive and sales complete.

An OLAP-optimized database is comparable to a data warehouse, designed to store large amounts of unchanging data. These databases use storage formats that allow

Page 4: Modern Database Landscape

4

The Modern Database Landscape Whitepaper

June 2015

them to perform operations like table scans and aggregations quickly. In order to transform data into a format that facilitates these queries, operations like individual inserts and updates take longer than in a transactional database. Data warehouses optimize speed by loading data in batches from a flat file or data pipeline. Depending on the size of the data warehouse, companies may use an additional data mart to house a relevant subset of the data. Analysts will often work with an analytic mart while they build and test models.

Transactional and analytical workloads typically run in separate data stores. Applications are often built on an OLTP database of record, but periodically, the data is shifted to an OLAP-optimized data warehouse for long-term storage and analysis.

The Modern Database Landscape

When data volume began to grow exponentially, enterprises struggled to scale their legacy RDBMSs (Relational Database Management Systems). OLTP and OLAP databases ran on a single server, and could not provide the storage capacity or data processing speed required for Big Data applications. Some organizations tried to retrofit their existing database infrastructure by manually partitioning data across multiple database instances or using complex caching schemes. However, this strategy increased administrative overhead and still lacked the necessary speed to process very large datasets. As a result, high technology companies sought new scalable, performance-oriented database architectures.

Figure 1: The Modern Database Evolution

Relational Column Store

Relational Row Store

Key Value Store

NoSQL Document

Stores

NoSQL Row

Stores

SQL on Hadoop

Relational

Non-relational attempts

Page 5: Modern Database Landscape

5

The Modern Database Landscape Whitepaper

June 2015

Some companies turned to non-relational NoSQL databases designed for scalability and high throughput. There are various NoSQL design paradigms, but most are intended for sparse or unstructured data and use either a key-value or JSON-like data model. They are often used for data caching and other time-sensitive key-value workloads.

However, NoSQL databases do not provide a fully relational model needed for many enterprise applications. They may offer analytic capabilities in the form of a map-reduce framework or user-defined functions, but often lack the ability to run ad hoc analytical queries that users have come to expect from SQL databases. NoSQL data models and storage engines are designed for fast inserts and reads, but do not lend themselves to aggregations or operations like JOINs. To address this limitation, some NoSQL vendors have expanded their query languages to provide functionality similar to that of SQL. Overall, however, NoSQL query languages have not caught up to the expressive power of SQL.

Figure 2: The Modern Database Landscape

Key Value Store SQL on Hadoop NoSQL Document Store NoSQL Row Store Relational Row

Store Relational Column Store

Relational Row and Column Store Legend

In-Memory

Memcached Shark/Spark MongoDB Aerospike VoltDB Oracle In-Memory MemSQL HTAP Workloads

Redis Oracle Times Ten SAP HANA

Clustrix SQL Server

MySQL Cluster Real-Time Workloads

IBM BLU

Flash Optimized

Aerospike MemSQL

Disk

Impala MongoDB Aerospike IBM Netezza AWS Redshift MemSQL

Hive Riak Cassandra Teradata Exadata SQL Server

CouchDB MySQL Vertica Batch Workloads

Postgres ParAccel

Oracle Sybase IQ

Clustrix Operational Workloads

NuoDB

Sybase

AWS RDS

Simple Reads and Writes

Batch Loading No Transactions

Reads and Writes with Database Level Locking

Fast Inserts and Lookups, Eventual Consistency

Transactional Reads and Writes

Batch Loading, Analytic Reads and Writes

Transactional and Analytical Reads and Writes

No Analytics Batch Analytics Limited Analytics Limited Analytics Simple Analytics Deep Analytics Real-Time Analytics

Horizontal Scalability

Horizontal Scalability

Limited Horizontal Scalability

Horizontal Scalability Limited Scalability Varied Scalability Horizontal

Scalability

Page 6: Modern Database Landscape

6

The Modern Database Landscape Whitepaper

June 2015

Around the same time NoSQL databases became popular, the Hadoop Distributed File System (HDFS) emerged as a scalable store to house increasingly large volumes of data. HDFS is a file system, not a database, and although it leverages a map-reduce framework for data analysis, it does not provide a familiar query interface for ad hoc analysis. To close this gap, various companies and open source developers created “SQL on Hadoop” query engines that sit on top of HDFS. These query engines are used primarily for batched analytic processing of large, static datasets, but lack basic transactional features like the ability to update or delete data.

On the transactional side, a new breed of SQL databases arose to address concerns for high throughput and scalability, also key motivations behind the NoSQL movement. Unlike the previous generation of RDBMSs, these OLTP-optimized SQL databases, sometimes referred to as NewSQL, were designed to run in a distributed environment. They are most commonly used as operational databases, maintaining application state like a traditional OLTP database but with more horsepower.

Even with innovation in database design, the transactional/analytical database split remains common. Most of the NoSQL and NewSQL offerings are optimized for OLTP or OLTP-like workloads and, in practice, are used for operations like time-sensitive record keeping. They are unable to perform mixed transactional and analytical workloads because of impediments like write locks, which block query execution during data ingest; lack of support for ad hoc analytical queries, or both. On the other side, data warehouses and Hadoop are not optimized for high throughput and lack the read and write speed to serve an application.

Hybrid Transactional and Analytical Processing

Database architects have sought the technology to perform transactional and analytical processing in a single database. Today, advances in in-memory computing technology make HTAP a reality. An HTAP-oriented system performs transactional and analytical operations in a single database, often running analysis on real-time,

Page 7: Modern Database Landscape

7

The Modern Database Landscape Whitepaper

June 2015

streaming data. It complements an organization’s existing analytics infrastructure and can be used in conjunction with other systems.

The two key benefits of HTAP are:

• Enabling new sources of revenue • Reducing administrative and development overhead by simplifying computing

infrastructure

New Sources of Revenue

Many databases promise to speed up applications and analytics, but simply increasing speed does not always open up new revenue channels. Real-time analytics transcends speed to truly capture the value of data before it reaches a time threshold. Historical technological limitations led companies create workload-specific data stores, which introduces latency, complexity, and a limit to realizing the full value of real-time data. HTAP systems address these issues and allow enterprises to harvest new business value such as:

1. Automate data-driven decision-making

In many industries, “real-time” implies a time window small enough to necessitate automation. For example, digital advertisers must select and display an advertisement in less than a second, before the potential viewer scrolls past the banner or navigates to a different page. The simplest solution is to pre-select the advertisements to display. However, dynamic targeted advertising, ads selected based on knowledge about the viewer’s demographic information, increases click-through rates and leads to more revenue for ad buyers.

The process of personalized ad serving is complex. It requires retrieving customer profile information, selecting an ad based on that profile, and displaying the ad to the customer. It requires fast transactional processing to record information like customer ID, what ad was shown, and whether they clicked on the ad. It also requires analytical processing to choose an ad based on profile information and previous clickstream data. Maintaining separate transactional and analytical data stores introduces latency that limits the speed and accuracy with which ads can be chosen. An HTAP-capable

Page 8: Modern Database Landscape

8

The Modern Database Landscape Whitepaper

June 2015

system can perform both operations within a single database of record, providing fast access to the most recent data.

Automated data-driven decision-making has applications in many industries, not just digital advertising. This model can be used to increase customer engagement in online retail with personalized recommendations, or to optimize systems of shipping routes based on traffic and availability. While real-time automation techniques are advancing rapidly, these industries are ripe for growth and innovation.

2. Perform Real-Time Operational Analytics

Even the most sophisticated business analytics software is only as powerful as its underlying database. Legacy operational databases lack the analytic capabilities to serve as an operational data store and an analytics platform simultaneously. Instead, most businesses periodically offload data from their operational data store and batch load it into a data warehouse. This becomes a time and compute-intensive process, and only gives data analysts access to static historical data.

The HTAP model gives analysts instantaneous access to real-time data, providing faster and more targeted insights. The ability to analyze data as it is generated allows business to identify operational trends as they develop, For applications like data center monitoring, this helps reduce or eliminate downtime. For applications that require monitoring of complex dynamic systems, for instance a shipping network, HTAP allows analysts to “direct traffic” and make real-time decisions to boost revenue.

3. Detect and Respond to Anomalies in Real Time

Unchecked anomalies, like fraud or hardware failures, can lead to sustained loss of revenue until the issue is addressed.

Detecting anomalies in real time means spotting patterns and trends. Anomaly detection requires more than processing streams of individual records. In many cases, the context for the event is equally important for identifying and addressing the anomaly.

Page 9: Modern Database Landscape

9

The Modern Database Landscape Whitepaper

June 2015

Fraud detection requires processing an event, like a credit card purchase, by running a series of queries to determine how ordinary the event is. The system must pull records of previous purchases with a given credit card, information like purchase sizes and locations, to determine whether that new purchase fits with previously established purchasing patterns. In this case, an HTAP system enables a user to load and query streaming purchase data while simultaneously querying historical data to process purchases as they come in. Time is a game changer in fraud detection and there are usually strict SLAs. This makes it extremely advantageous to be able to perform both record keeping and analytics in a single database. If historical data is siloed in a separate data store, data transfer and inter-system communication introduce latency that may allow fraudulent transactions to execute unknowingly.

With an application, like monitoring IT infrastructure, the value of HTAP lies in the ability to analyze the data around the time of the anomaly and diagnose the problem. Detecting a server failure may not be a big challenge, but figuring out why the server failed and getting it back online can be, especially with volumes and volumes of machine-generated data. When an anomaly is detected, there is no time to export monitoring data to a data warehouse and the monitoring database must stay online to record logs from the rest of the data center. An HTAP database enables the user to pull critical data around the anomaly quickly and seamlessly to avoid revenue loss.

Simplified Computing Structure

Because of limitations in legacy database technology, organizations have turned to in-memory computing tools like data grids, stream processing engines, and other distributed computing frameworks to process data in a real-time. and ensure low latency. While these tools have their uses, they introduce additional complexity to an organization’s computing infrastructure and run the risk of being misused. It is generally more effective to use a powerful database over separate data processing tools to streamline business processes.

An HTAP system can dramatically simplify an organization’s data processing infrastructure by functioning as the core of the day-to-day operational workload. It serves a dual purpose: database of record and analytical warehouse.

Page 10: Modern Database Landscape

10

The Modern Database Landscape Whitepaper

June 2015

There are many advantages to maintaining a simple computing infrastructure:

1. Increased uptime. Simple infrastructure has fewer potential points of failure. As a result, components fail less often and it is easier to diagnose the problem if they do.

2. Reduced latency. Data transfer between data stores takes time. Moreover, in a complex computing infrastructure with many separate components, not all components necessarily use the same data storage formats. Data transfer then necessitates ETL, which is time-consuming and introduces further opportunities for error.

3. Faster development cycles. Developers work faster when they can build on fewer, more versatile tools. Different data stores will likely have different query languages, requiring developers to spend additional time familiarizing themselves with the separate systems. When they also have different storage formats, developers must spend time writing ETL tools, connectors, and synchronization mechanisms.

Beyond the major benefits offered by simple infrastructure, HTAP systems provide unique value:

1. Save development time and prevent disasters by eliminating the need for synchronized copies of “live” operational data. Synchronizing data is invariably difficult. It is especially difficult when two systems must have current operational data, and a composite infrastructure has been chosen specifically because one system has insufficient throughput. To further complicate matters, systems like stream processing engines and data grids may not be fully fault tolerant, which places additional importance on the database of record. An HTAP system eliminates the need for multiple copies of real-time data.

2. Reduce hardware costs and administrative overhead by eliminating unnecessary duplicated data. Maintaining multiple operational data stores (i.e. not for disaster recovery) requires paying for additional hardware and DBAs. HTAP eliminates the need to manually synchronize state across separate data stores because users can run analytics directly in the database of record. HTAP systems must be fault-tolerant and ACID compliant.

Page 11: Modern Database Landscape

11

The Modern Database Landscape Whitepaper

June 2015

MemSQL is Designed for HTAP

MemSQL is the real-time database for transactions and analytics. As a fully relational, distributed database, MemSQL enables enterprises to concurrently process and analyze large volumes of data, scaling across commodity hardware or in the cloud. MemSQL has a combination of innovative features that allows us to deliver on Big Data promises, including:

• In-Memory Operation • Access to Real-Time and Historical Data • Code Generation and Compiled Query Execution Plans • Lock-Free Data Structures and Multiversion Concurrency Control • Two Tiered Shared Nothing Architecture • Fault Tolerance and ACID Compliance • Integration

In-Memory Storage

MemSQL stores data in memory, which allows both reads and writes to happen orders of magnitude faster than on legacy disk-based technology. Storing data in memory is especially valuable for running concurrent workloads, whether transactional or analytical, because it eliminates the inevitable bottleneck caused by disk contention. Storing data in memory is not sufficient for making a system capable of HTAP, and MemSQL is not the only database to use memory as the primary storage layer. However, in-memory operation should be considered necessary for HTAP as no purely disk-based system will be able to provide the required I/O with any reasonable amount of hardware.

Access to Real-Time and Historical Data

MemSQL has two separate storage engines, each designed to facilitate a different kind of workloads. The in-memory row store is designed for high-throughput operational workloads with fast data loading and row lookups. The disk-based column store is designed for fast analytical queries involving large table scans and aggregations, as well as for compressed storage of historical data. Together these

Page 12: Modern Database Landscape

12

The Modern Database Landscape Whitepaper

June 2015

storage engines converge real-time and historical data in a unified database platform. For example, JOINs can take place between row-based tables and column-based tables.

HTAP requires not only speed but also the ability to compare real-time data to statistical models and aggregations of historical data. With legacy database infrastructure, historical data is siloed in a separate data warehouse and querying it requires data transfer and ETL. MemSQL makes real-time and historical data available through a single interface.

Code Generation and Compiled Query Execution Plans

Without disk I/O, queries execute so quickly that dynamic SQL interpretation can become a bottleneck. MemSQL addresses this by taking SQL statements and generating a compiled query execution plan. With each new query, MemSQL automatically removes the parameters and generates an execution plan, written in C++, which is compiled to machine code. MemSQL stores compiled query plans in a repository called the plan cache. When future queries match an existing parameterized query plan template, MemSQL bypasses code generation and executes the query immediately using the cached plan. Executing a compiled query plan is much faster than interpreting SQL as there are several lower level optimizations along with the inherent performance advantage of executing compiled versus interpreted code.

Compiled query plans are key to providing performance advantages during HTAP workloads. Some companies use a caching layer on top of their RDBMS, usually a key-value store with SQL statements mapped to query results. This strategy may improve performance for queries on immutable datasets, but this approach runs into problems with a rapidly changing dataset. When the dataset changes, the cache is invalidated and must be repopulated with updated query results, a process which is rate-limited by the underlying database. In addition to the performance degradation, synchronizing the state across multiple data stores poses a difficult engineering problem. The MemSQL query plan strategy provides an advantage by executing a query on the in-memory database directly, rather than fetching cached results, which

Page 13: Modern Database Landscape

13

The Modern Database Landscape Whitepaper

June 2015

is part of why MemSQL maintains remarkable query performance even when data is frequently updated.

Figure 3: MemSQL Query Execution Plan

Lock-Free Data Structures and Multiversion Concurrency Control

MemSQL achieves high throughput due to its use of lock-free data structures and multiversion concurrency control (MVCC), which allows the database to avoid locking on both reads and writes. Traditional databases manage concurrency with locks, which results in some processes blocking others until they complete and release the lock. In MemSQL, writes never block reads and vice versa.

The absence of locks allows multiple queries to access the same data simultaneously. This is why MemSQL can maintain analytical query performance even during heavy write workloads such as loading streaming data.

Two-Tiered Shared Nothing Architecture

MemSQL has a two-tiered distributed architecture composed of aggregators, which are cluster-aware query routers, and leaves, which are storage and compute nodes. Clusters can be deployed on-premises on physical servers, in the cloud, or in a

Parse Parse

Execute Execute

Code Gen

Plan Cache

Code Gen Plan Cache

Page 14: Modern Database Landscape

14

The Modern Database Landscape Whitepaper

June 2015

virtualized environment. When a query is issued from a client machine to an aggregator, the query is broken down into partial-result statements that are distributed between the leaf nodes. The distributed query optimizer ensures consistent workload utilization and balancing across the cluster. No two nodes share storage or compute resources. Aggregators and leaves can be added at any time with the cluster online, even while running a workload

This architecture enables massively parallel processing, which dramatically reduces query response times. Both aggregators and leaves can be added or removed at any time, making it easy to tailor your cluster to accommodate the types and volume of client requests. Unlike manually sharded single box databases that experience performance degradation as they scale, MemSQL performance scales linearly with cluster size.

Moreover, the tiered architecture facilitates managing concurrent workloads. Increasing the number of aggregators will improve operations like data loading and will allow MemSQL to process more client requests concurrently. Application workloads can be divided amongst groups of aggregators, where one group of aggregators serves Application A, another group serves Application B, and so forth. For instance, one group of aggregators can process and load data from a streaming source, and another group of aggregators can be connect to business intelligence software or another analytic application.

Fault-Tolerance and ACID Compliance

Operational data stores cannot lose data. This makes fault tolerance and ACID compliance prerequisites for any HTAP system. MemSQL ensures that data is never lost with redundancy in the cluster and cross-data center replication for disaster recovery. MemSQL also writes logs and complete database snapshots to disk, and is fully ACID compliant.

Integration

Gartner reports that, while HTAP systems will drive innovation and enable new sources of revenue, they will not entirely displace existing computing infrastructure in

Page 15: Modern Database Landscape

15

The Modern Database Landscape Whitepaper

June 2015

the near future. As a result, HTAP databases must easily integrate with existing hardware and software.

MemSQL is designed for seamless interoperability. MemSQL uses SQL, the most widely used and best understood query language. Moreover, MemSQL uses the MySQL wire protocol, which allows developers to connect with familiar MySQL connectors and development tools such as the MySQL command line client, ODBC, and JDBC. MemSQL supports geospatial data types that follow the OpenGIS standard MemSQL also natively supports JSON storage and manipulation so organizations can process relational and semi-structured data together in a single data store.

Conclusion Technological advances are dramatically reshaping the database landscape. HTAP adoption is advancing as companies move to capitalize on new revenue sources made possible by real-time analytics. Implementing HTAP requires reexamining the traditional OLTP/OLAP binary and deploying technology capable of performing analytics in a true operational database. MemSQL, with its ability to carry out concurrent transactional and analytical workloads, is the right tool for turning real-time data into revenue.

To learn more or download MemSQL, visit www.memsql.com.

To discuss your real-time analytics needs with a MemSQL expert, please contact us at 855-4-MEMSQL (463-6775) or [email protected].

MemSQL Headquarters 534 4th Street, San Francisco, CA 94107