Shard-Query, an MPP database for the cloud using the LAMP stack

Shard-QueryAN MPP DATABASE FOR THE CLOUD USING THE

LAMP STACK

Introduction

Presenter

• Justin Swanhart

• Principal Support Engineer at Percona

• Previously a trainer and consultant at Percona too

Developer

• Swanhart-tools

• Shard-Query – MPP sharding middleware for MySQL

• Flexviews – Materialized views (fast refresh) for MySQL

• bcmath UDF – arbitrary precision math for MySQL

http://www.percona.com/

http://github.com/greenlion/swanhart-tools

Intended Audience

• MySQL users with data too large to query efficiently using a single machine

• Big Data

• Analytics / OLAP

• User generated content analysis

• People interested in distributed database processing

Terms

MPP – Massively Parallel Processing

• An MPP system is a system that can process a SQL statement in parallel on a single machine or even many machines

• A collection of machines is often called a Grid

• MPP is also sometimes called Grid Computing

MPP (cont)

• Not many open source databases (none?) support MPP

• Community editions of closed source offerings are limited

• Some closed source databases include Vertica, Greenplum, Redshift

The Cloud

• Managed collection of virtual servers

• Easy to add servers on demand

• Ideal for a federated, distributed database grid

• Easy to “scale up” by moving to a VM with more cores

• Easy to “scale out” by adding machines

• Amazon is one of the most popular cloud environments

LAMP stack

• Linux

• Amazon Linux

• RHEL

• Ubuntu LTS, etc.

• Apache Web Server

• Most popular web server on the planet

• MySQL

• The world’s most popular open source database

• PHP

• High level language makes development easier

Database Middleware

• A piece of software that sits between an end-user application and the database

• Operates on the queries submitted by the application, then returns the results to the application

• Usually a proxy of some sort

• MySQL proxy is the open source user configurable proxy for MySQL

• Supports Lua scripts which intercept queries

• Shard-Query can use MySQL Proxy out of the box

Message Queue / Job Server

• Accepts jobs or messages and places them in a queue

• A worker reads jobs/messages from the queue and acts on them

• Offers support for asynchronous jobs

• Gearman

• My job server of choice for PHP

• Has two different PHP interfaces (pear and pecl)

• SQ comes bundled with a modified version* of the pear interface

• Excellent integration with MySQL as well (UDF)

* Removes warnings triggered by modern PHP strict mode

Sharding

• It is a short for Shared Nothing

• Means splitting up your data onto more than one machine

• Tables that are split up are called sharded tables

• Lookup tables are not sharded. In other words, they must be duplicated on all nodes

• Shard-Query supports directory based or hash based sharding

Shard mapper

• Shard-Query supports DIRECTORY and HASH mapping out of the box

• DIRECTORY based sharding allows you to add or remove shards from the system, but lookups may go over the network, reducing performance* compared to HASH mapping

• HASH based sharding uses a hash algorithm to balance rows over the sharded database. However, since a HASH algorithm is used, the number of database shards can not change after initial data loading.

* But only for queries like “select count(*) from table where customer_id = 50”

What is “big data”

Most machine generated data

• Line order information for a large organization like Wal-Mart™

• Any data so large that you can’t effectively operate on it on one machine

• For example, an important query that needs to run daily executes in greater than 24 hours. It is impossible to meet the daily goal unless you can find a way to make the query execute faster.

• These kind of problems can happen on relatively small amounts of data (tens of gigabytes)

Analytics(OLAP) versus OLTP

• OLTP is focused on short lived small transactions that read or write small amounts of data

• OLAP is focused on bulk loading and reading large amounts of data in a single query.

• Aggregation queries are OLAP queries

• Shard-Query is designed for analytics (OLAP) not OLTP

• must parse all commands sent to it (and make multiple round trips)

• Minium query time of around 20ms

PROBLEM: Single Threaded QueriesTHE BIGGEST BOTTLENECK IN ANALYTICAL QUERIES IS THE SPEED OF A SINGLE CORE

Single thread queries in the database

• MySQL, PostgreSQL, Firebird and all other major open source databases have single threaded queries

• This means that a single query can only ever utilize the resources of a single core

• As the data size grows, analytical queries get slower and slower

• In memory, as the data grows the speed decreases because the data is accessed in a single query

• As the number of rows to be examined increases, performance decreases

Why single threaded

• MySQL is optimized for getting small amounts of data quickly(OLTP)

• It was created at a time when having more than one CPU was not common

• Adding parallelism now is a very complex task, particularly since MySQL supports multiple storage engines

• So adding parallel query is not a high priority (not even on the roadmap)

• Designed to run LOTS of small queries simultaneously, not one big query

Single Threading – bad for IO

• If the data set is significantly larger than memory, single threaded queries often cause the buffer pool to "churn“

• For example, small lookup tables can easily be pushed out of the buffer pool, resulting in frequent IO to look up values

• While SSD may helps somewhat, one database thread can not read from an SSD at maximum device capacity

• While the disk may be capable of 1000+ MB/sec, a single thread is generally limited to <100MB/sec (usually 30-40)

• This is because a single thread shares doing IO AND running the query on one CPU (MySQL does not use read threads for queries)

The OLAP Example

• A large company maintains a star schema of their sales history for analytics purposes

• This company likes to present a sum total of orders for all time on the dashboard

• In the beginning the query is very fast

• It gets slower, though, as months of data are added and as the business grows, data increases too

• Eventually the query takes more than 24 hours to run, which means it can no longer be updated daily

• “Drill down” gets slower as data increases

What can be done?

• Caching?

• Materialized views?

• Partitioning?

• Sharding?

Making OLAP more like OLTP!

• Shard-Query breaks on big query up into smaller queries that can access the database in parallel

• Partitioning and sharding are used to keep data size for any single query to a minimum

• If your table has 16 partitions, you can get up to 16 way parallelism

• If you also have 2 nodes, you get 32 way parallelism, and so on

• You can use multiple database schema on a single server instead (a form of sharding) if you don’t partition your data

Shard-QueryADDING PARALLELISM TO QUERIES

Sharding Reviewed

• A sharded database contains multiple nodes or databases called shards

• One physical machine might host many shards

• Each shard has identical schema

Sharding Reviewed (cont)

• The multiple shards function together as one RDBMS system.

• You can think of the shards as a big UNION ALL of the data, with only a portion of the data on any one machine

• A mechanism must control which server on which to place particular pieces of data.

• In Shard-Query a particular column controls data placement – this is called the shard key

Sharding – Data distribution

• There are usually one or two large tables that are sharded

• These are usually called FACT tables

• An example might be blogs, blog_posts and blog_comments. All three share a “shard key” of blog_id

• Most common case is one big table with smaller lookup tables

Sharding Reviewed (cont)

• The shard key is very important!

• Since a specific column acts as the “shard key”, all sharded tables must contain the shard key.

• For example: blog_id might be the shard key.

• The rows for a specific blog_id are then located on the same shard in any table that has the blog_id column

Optimization - Shard Elimination

• When Shard-Query sees an expression on the shard key it looks up* the shard that contains the appropriate data and only sends queries to the necessary shards.

• Equality lookup is most efficient, but IN, BETWEEN and other operators are allowed as well

• Lookups may not use subqueries (ie, blog_id IN (1,2,3) is okay, not blog_id in (select …))

• This is called “shard elimination”

• Shard elimination is analogous to partition elimination.

• where blog_id = 10, for example

Can Shard-Query help on 1 machine?

• Yes! - Use MySQL partitioning on a single machine

• Shard-Query can access the partitions of a table in parallel!

• This means that if you have many partitions, then Shard-Query can utilize many cores to answer the query

Use partitions for parallelism

How does that work?

• Shard-Query executes an EXPLAIN PLAN on the query

• This EXPLAIN PLAN shows the partitions that MySQL will access when running the query

• Shard-Query uses the 5.6 PARTITION hint to generate one query per partition

• These queries can execute in parallel

Sharding can help too

• How?

• Shard-Query adds parallelism to queries by spreading them over nodes in parallel

• Spread the data over four nodes and queries are 4x faster

MySQL database shards

Shard-Query

Sharding + Partitioning is best

• Why?

• Partition the tables to add parallelism to each node

• Use sharding to have multiple nodes working together

• 4 nodes with 3 partitions each = 12 way parallelism

Shard-Query


Partitions

Shard-QueryARCHITECTURE

Configuration Repository

• Shard-Query stores all configuration information in a MySQL database called the configuration repository

• This should be a highly available replication pair (or XtraDBcluster) for HA

• Web interface can change the settings

• Manual settings changes can be done via SQL

• schemata_config table in Shard-Query repository

• Makes using Shard-Query easier, especially when using more than one node

PHP OO Apache Web

Interface

MySQL Proxy

Gearman Message Queue

Worker Worker Worker Worker


Shard-Query Architecture

Interfaces

Communication

Workers

Storage

ConfigRepository

Configuration Management

PHP OO Apache Web

Interface

MySQL Proxy




Shard-Query ArchitectureGearman job server

• Provides the parallel mechanism for Shard-Query

• Multiple Gearman are supported for HA

• Enables Shard-Query to use a map/reduce like architecture

• Sends jobs to workers when they arrive at the queue

• If all workers are busy the job waits

Gearman at a glance

Shard-Query OO

Store-resultsetLoader workerSQ run SQL worker

PHP OO Apache Web

Interface

MySQL Proxy




Shard-Query ArchitectureThree kinds of workers

• loader_worker – Listens for loader jobs and executes them. Used by parallel loader.

• shard_query_worker – Listens for SQL jobs, runs the job via Shard-Query and returns the results as JSON. Used by web and proxy interfaces.

• store_resultset_worker – Main worker used by Shard-Query. It runs SQL and stores the result in a table.

PHP OO Apache Web

Interface

MySQL Proxy




Shard-Query ArchitecturePHP Object Oriented Interface

• Very simple to use

• Constructor parameters not even usually needed

• Just one function to run a SQL query and get results back

• Complete example comes with Shard-Query as: bin/run_query

PHP OO Example (from bin/run_query):

$shard_query = new ShardQuery();$stime = microtime(true);$stmt = $shard_query->query($sql);$endtime = microtime(true);

if(!empty($shard_query->errors)) {if(!empty($shard_query->errors)) {echo "ERRORS RETURNED BY OPERATION:\n";print_r($shard_query->errors);

}}

if(is_resource($stmt) || is_object($stmt)) {$count=0;while($row = $shard_query->DAL->my_fetch_assoc($stmt)) {

print_r($row);++$count;

}echo "$count rows returned\n";$shard_query->DAL->my_free_result($stmt);

} else {if(!empty($shard_query->info)) print_r($shard_query->info);echo "no query results\n";

}echo "Exec time: " . ($endtime - $stime) . "\n";

Simple data access layercomes with Shard-Query

Errors are returned as a memberof the object

Run the query

PHP OO Apache Web

Interface

MySQL Proxy




Shard-Query ArchitectureApache web interface

• GUI

• Easy to set up

• Run queries and get results

• Serves as an example of using Shard-Query in a web app with asynchronous queries

• Submits queries via Gearman

• Simple HTTP authentication

PHP OO Apache Web

Interface

MySQL Proxy




Shard-Query ArchitectureMySQL Proxy Interface

• LUA script for MySQL Proxy

• Supports most SHOW commands

• Intercepts queries, and sends them to Shard-Query using the MySQL Gearman UDF

• Serves as another example of using Gearman to execute queries.

• Behaves slightly differently than MySQL for some commands

Query submitted

SQL is parsed

Query rewritefor parallelismyields multiple

queries

Gearman Jobs (map/combine)

Final Aggregation (reduce)

Return result

Shard-Query Data FlowMap/reduce like workflow

Query submitted

SQL is parsed


queries



Return result

Shard-Query Data Flow

SQL Parser

• Find it at http://github.com/greenlion/php-sql-parser

• Supports

• SELECT/INSERT/UPDATE/DELETE

• REPLACE

• RENAME

• SHOW/SET

• DROP/CREATE INDEX/CREATE TABLE

• EXPLAIN/DESCRIBE

Used by SugarCRM too, aswell as other open source projects.

http://github.com/greenlion/php-sql-parser

Query submitted

SQL is parsed


queries



Return result


Query Rewrite for parallelism

• Shard-Query has to manipulate the SQL statement so that it can be executed over more than on partition or machine

• COUNT() turns into SUM of COUNTs from each query

• AVG turns into SUM and COUNT

• SEMI-JOIN is turned into a materialized join

• STDDEV/VARIANCE are rewritten as well use the sum of squares method

• Push down LIMIT when possible

Query Rewrite for parallelism (cont)

• Because lookup tables are duplicated on all shards, the query executes in a shared-nothing way

• All joins, filtering and aggregation are pushed down

• Mean very little data must flow between nodes in most cases

• High performance

• Meets or beats Amazon Redshift in testing at 200GB of data

Query submitted

SQL is parsed


queries



Return result


Map/Combine

• The store_resultset gearman worker runs SQL and stores the result in a table

• To keep the number of rows in the table (and the time it takes to aggregate results in the end) small, an INSERT … ON DUPLICATE KEY UPDATE (ODKU) statement is used when inserting the rows

• There is a UNIQUE KEY over the GROUP BY attributes to facilitate the upsert

Query submitted

SQL is parsed


queries



Return result


Final aggregation

• Shard-Query has to return a proper result, combining the results in the result table together to return the correct answer

• Again, for example COUNT must be rewritten as SUM to combine all the counts (from each shard) in the result table

• Aggregated result is returned to the client

Shard-Query Flow as SQL[justin@localhost bin]$ ./run_query --verboseselect count(*) from lineorder;Shard-Query optimizer messages:

SQL TO SEND TO SHARDS:Array(

[0] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1



[3] => SELECT COUNT(*) AS expr_2913896658FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 )SQL TO SEND TO COORDINATOR NODE:SELECT SUM(expr_2913896658) AS ` count `FROM `aggregation_tmp_58392079`

Array(

[count ] => 0)1 rows returnedExec time: 0.03083610534668

Initial query

Query rewrite / map

Final aggregation / reduce

Final result

Map/Combine exampleselect LO_OrderDateKey, count(*) from lineorder group by LO_OrderDateKey;Shard-Query optimizer messages:* The following projections may be selected for a UNIQUE CHECK on the storage node operation:expr$0

* storage node result set merge optimization enabled:ON DUPLICATE KEY UPDATE

expr_2445085448=expr_2445085448 + VALUES(expr_2445085448)

SQL TO SEND TO SHARDS:Array(

[0] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p0) AS `lineorder` WHERE 1=1 GROUP BY expr$0



[3] => SELECT LO_OrderDateKey AS expr$0,COUNT(*) AS expr_2445085448FROM lineorder PARTITION(p3) AS `lineorder` WHERE 1=1 GROUP BY expr$0)SQL TO SEND TO COORDINATOR NODE:SELECT expr$0 AS `LO_OrderDateKey`,SUM(expr_2445085448) AS ` count `FROM `aggregation_tmp_12033903` GROUP BY expr$0

combine

reduce

Use cases

Machine generated data

• Sensor readings

• Metrics

• Logs

• Any large table with short lookup tables

Star schema are ideal

Call detail records

• Shard-Query is used in the billing system of a large cellular provider

• CDRs generate a lot of data

• Shard-Query includes a fast PERCENTILE function

Green energy meter processing

• High volume of data means sharding is necessary

• With Shard-Query, reporting is possible over all the shards, making queries possible that would not work with Fabric or other sharding solutions

• Used in India for reporting on a green power grid

Log analysis

• Performance logs from a web application for example

• Aggregate many different statistics and shard if log volumes are high enough

• Search text logs with regular expressions

Performance

Star Schema Benchmark – SF 20

• 119 million rows of data (12GB)

• Infobright Community Database

• Only 1st query from each “flight” selected

• Unsharded compared to four shards (box has 4 cpu - Amazon m1.xlarge)

COLD

• MySQL – 35.39s

• Shard-Query – 11.62s

HOT

• MySQL – 10.99s


Query 1

select sum(lo_extendedprice*lo_discount) as revenuefrom lineorder join dim_date on lo_orderdatekey = d_datekeywhere d_year = 1993and lo_discount between 1 and 3and lo_quantity < 25;

COLD

• MySQL – 34.24s


HOT

• MySQL – 12.74s


Query 2

select sum(lo_revenue), d_year, p_brandfrom lineorderjoin dim_date on lo_orderdatekey = d_datekeyjoin part on lo_partkey = p_partkeyjoin supplier on lo_suppkey = s_suppkeywhere p_category = 'MFGR#12'and s_region = 'AMERICA'group by d_year, p_brandorder by d_year, p_brand;

COLD

• MySQL – 27.29s


HOT

• MySQL – 18.89


Query 3

select c_nation, s_nation, d_year, sum(lo_revenue) as revenuefrom customer join lineorderon lo_custkey = c_customerkeyjoin supplier on lo_suppkey = s_suppkeyjoin dim_date on lo_orderdatekey = d_datekeywhere c_region = 'ASIA'and s_region = 'ASIA'and d_year >= 1992 and d_year <= 1997group by c_nation, s_nation, d_yearorder by d_year asc, revenue desc;

COLD

• MySQL – 23.02s


HOT

• MySQL – 14.77


Query 4

select d_year, c_nation, sum(lo_revenue - lo_supplycost) as profitfrom lineorder join dim_date on lo_orderdatekey = d_datekeyjoin customer on lo_custkey = c_customerkeyjoin supplier on lo_suppkey = s_suppkeyjoin part on lo_partkey = p_partkeywhere c_region = 'AMERICA'and s_region = 'AMERICA'and (p_mfgr = 'MFGR#1'or p_mfgr = 'MFGR#2')group by d_year, c_nationorder by d_year, c_nation;

Data & Analytics

Shard-Query, an MPP database for the cloud using the LAMP stack