11
................................................................................................................................................................................................................. . DATABASE ANALYTICS: ARECONFIGURABLE-COMPUTING APPROACH ................................................................................................................................................................................................................. . THIS ARTICLE PRESENTS A HIGHLY PIPELINED, HIGH-THROUGHPUT QUERY-PROCESSING ENGINE ON A FIELD-PROGRAMMABLE GATE ARRAY TO OFFLOAD AND ACCELERATE EXPENSIVE DATABASE-ANALYTICS QUERIES.THE SOLUTION PRESENTED HERE PROVIDES A MECHANISM FOR A DATABASE MANAGEMENT SYSTEM TO SEAMLESSLY HARNESS THE FPGA ACCELERATOR WITHOUT REQUIRING ANY CHANGES IN THE APPLICATION OR THE EXISTING DATA LAYOUT.THE SYSTEM ACHIEVES AN ORDER-OF-MAGNITUDE SPEEDUP ON VARIOUS REAL-LIFE QUERIES. ......Real-time analytics—performing analytics directly on transactional data—has seen widespread adoption in the business world in recent years. Whether to set airline ticket prices to beat competition or to stock store shelves just in time, rapid analysis of transactional data has become a business necessity. Snapshot warehousing, where a snapshot of data is taken from an online transaction processing (OLTP) system for decision-support analysis, is no longer suffi- cient for making executive decisions. Wide- spread use of mobile devices and social-media networking has changed the marketplace dynamics, requiring businesses to respond to market changes in real time. For example, just the appropriate coupon for a coffeehouse must be sent to a consumer’s smartphone within seconds of the customer swiping his or her credit card at a nearby store. Injecting expensive analytics queries involving operations such as sort and join into OLTP environments leads to sharing of system resources such as the CPU and I/O between transactional and analytical work- loads. Transactional workloads relate directly to revenue generation and have strict service- level agreements (SLAs); analytical workloads must run against the same data without impacting these SLAs. Parallelization techniques commonly applied to speed up query processing include multicore architectures; 1 single-instruction, multiple-data (SIMD) operations; 2 and accelerators such as GPUs 3 and field- programmable gate arrays (FPGAs). 4 Effi- cient use of multicore demands careful data partitioning, and exploiting SIMD opera- tions and GPU requires extra preprocessing on database tables. 5 Prior work using FPGAs involve synthesizing each query into the FPGA, making them not very viable. Com- mercial data-warehousing appliances—such as Netezza (www.netezza.com), Exadata Bharat Sukhwani Hong Min Mathew Thoennes Parijat Dube Bernard Brezzo Sameh Asaad Donna Eng Dillenberger IBM T.J. Watson Research Center 0272-1732/14/$31.00 c 2014 IEEE Published by the IEEE Computer Society ............................................................. 19

Database Analytics: A Reconfigurable-Computing Approach

Embed Size (px)

Citation preview

Page 1: Database Analytics: A Reconfigurable-Computing Approach

..................................................................................................................................................................................................................

DATABASE ANALYTICS:A RECONFIGURABLE-COMPUTING

APPROACH..................................................................................................................................................................................................................

THIS ARTICLE PRESENTS A HIGHLY PIPELINED, HIGH-THROUGHPUT QUERY-PROCESSING

ENGINE ON A FIELD-PROGRAMMABLE GATE ARRAY TO OFFLOAD AND ACCELERATE

EXPENSIVE DATABASE-ANALYTICS QUERIES. THE SOLUTION PRESENTED HERE PROVIDES A

MECHANISM FOR A DATABASE MANAGEMENT SYSTEM TO SEAMLESSLY HARNESS THE

FPGA ACCELERATOR WITHOUT REQUIRING ANY CHANGES IN THE APPLICATION OR THE

EXISTING DATA LAYOUT. THE SYSTEM ACHIEVES AN ORDER-OF-MAGNITUDE SPEEDUP ON

VARIOUS REAL-LIFE QUERIES.

......Real-time analytics—performinganalytics directly on transactional data—hasseen widespread adoption in the businessworld in recent years. Whether to set airlineticket prices to beat competition or to stockstore shelves just in time, rapid analysis oftransactional data has become a businessnecessity. Snapshot warehousing, where asnapshot of data is taken from an onlinetransaction processing (OLTP) system fordecision-support analysis, is no longer suffi-cient for making executive decisions. Wide-spread use of mobile devices and social-medianetworking has changed the marketplacedynamics, requiring businesses to respond tomarket changes in real time. For example,just the appropriate coupon for a coffeehousemust be sent to a consumer’s smartphonewithin seconds of the customer swiping his orher credit card at a nearby store.

Injecting expensive analytics queriesinvolving operations such as sort and join

into OLTP environments leads to sharing ofsystem resources such as the CPU and I/Obetween transactional and analytical work-loads. Transactional workloads relate directlyto revenue generation and have strict service-level agreements (SLAs); analytical workloadsmust run against the same data withoutimpacting these SLAs.

Parallelization techniques commonlyapplied to speed up query processing includemulticore architectures;1 single-instruction,multiple-data (SIMD) operations;2 andaccelerators such as GPUs3 and field-programmable gate arrays (FPGAs).4 Effi-cient use of multicore demands careful datapartitioning, and exploiting SIMD opera-tions and GPU requires extra preprocessingon database tables.5 Prior work using FPGAsinvolve synthesizing each query into theFPGA, making them not very viable. Com-mercial data-warehousing appliances—suchas Netezza (www.netezza.com), Exadata

Bharat Sukhwani

Hong Min

Mathew Thoennes

Parijat Dube

Bernard Brezzo

Sameh Asaad

Donna Eng Dillenberger

IBM T.J. Watson Research

Center

0272-1732/14/$31.00�c 2014 IEEE Published by the IEEE Computer Society

.............................................................

19

Page 2: Database Analytics: A Reconfigurable-Computing Approach

(www.oracle.com/us/products/database/exadata), and Greenplum (http://gopivotal.c o m / p r o d uc t s / p i v o t a l - g r e e n p l u m -database)—focus solely on analytics work-loads and operate on more uniformlyformatted data compared to OLTP workloads.

To address these challenges, we propose ahardware acceleration approach to offloadand accelerate the most CPU-intensive oper-ations in analytics queries on an FPGA. Oursystem performs analytics on OLTP datawithin an OLTP environment. Unlike mostexisting approaches, the FPGA in our systemoperates on a database management system’sin-memory data, which is the most up-to-date copy of the data, for real-time analyticsalongside OLTP without any preprocessingor partitioning of data.

Relational databases and query processingIn relational database management systems

(DBMSs), data is organized in the form ofrecords in a table, where each record (row)contains one or more fields (columns). Tablesare stored on the disk as a collection of pages,each containing multiple rows.6 In most enter-prise DBMSs, rows are compressed for storage

and I/O savings, and are decompressed onlywhile the DBMS evaluates the queries. A des-ignated memory space, called a buffer cache orbuffer pool, is used for caching the data pages,and the I/O operations between the bufferpool and the disk are managed transparently.Data residing in the buffer pool is the mostup-to-date hot copy of the data.

Structured Query Language (SQL) is thede facto standard for schema definition, datamanipulation, and querying of data fromrelational DBMSs. SQL queries use variousclauses to represent different data-retrievaloperations (see Figure 1). The “where” clauserepresents predicates used to qualify a subsetof table records via logical inequality orequality comparisons, or to test set contain-ment for a field in the record. The selectclause refers to projection and indicates therecord fields to be extracted. Finally, the orderby clause specifies the fields for sorting theresults. Other common query operationsinclude table joins, which match records withcommon column values from two or moretables, and group-by aggregations, whichinvolve clustering the output (for example,sum or average) according to certain fields.

DBMSs commonly employ two ap-proaches of traversing a table: indexing andtable scanning. Indexing is efficient for locat-ing one or a few records, as is the case forOLTP queries. Scanning involves siftingthrough the entire table and is commonlyused in analytics, where a large number ofrecords typically match the search criteria. Itis often expensive to scan large tables on gen-eral-purpose CPUs. A mechanism to acceler-ate the analytics queries involving table scanis thus highly desirable.

Database analytics accelerator on FPGAsOur FPGA-accelerated database analytics

system consists of an off-the-shelf host serverwith a PCI Express (PCIe)-attached FPGAaccelerator card, and it runs a commercialDBMS. Our accelerator is connected directlyto the processor’s main memory instead ofbeing in the I/O path, and it processes thelatest in-memory data in the DBMS’s bufferpool. To offload a query, the DBMS issues acommand to the FPGA via the control soft-ware (see Figure 2). The command contains

EmployeeID

1201 Robert

Jack

Anna Morris

Smith

Rose Engineering

Sales

HR

2001

2003

2001

Projection

Predicate

Sort

1202

1203

Select:

From:

Where:

Order by:

Firstname

Emp_Id, Dept

Employee

Joining_year < 2002

Emp_Id

Lastname

Department Year of joining

Figure 1. An example “employee” table and a Structured Query Language

(SQL) query. SQL queries use various clauses to represent different data-

retrieval operations. This query retrieves the employee ID and department

of all employees who joined the company before 2002.

..............................................................................................................................................................................................

RECONFIGURABLE COMPUTING

............................................................

20 IEEE MICRO

Page 3: Database Analytics: A Reconfigurable-Computing Approach

the query specification and pointers to thedata that is to be processed. The FPGA pullsstandard database pages from the main mem-ory, parses the pages to extract, and processesthe rows and writes the qualifying rows backto the main memory in database-formattedpages. All the data transfers are managed bythe accelerator without any DBMS involve-ment. Standard techniques such as blockDMA operations, job queuing, and doublebuffering have been implemented to achievehigh accelerator utilization and peak PCIebandwidth. Also, the query-processingengine is designed to match the sustained busbandwidth and allow trade-off of query proc-essing throughput for complexity to handlecomplex queries.

Below, we present an overview of thequery acceleration pipeline on FPGA, fol-lowed by the different query operationsimplemented on the FPGA.

Query acceleration pipeline on FPGAsQuery processing in our FPGA accelera-

tor proceeds in a streaming fashion; rowsstream through a series of pipelined execu-tion units, each implementing one queryoperation (see Figure 2). Operations such asjoin and sort are not purely streaming, andthey require the rows to be temporarily heldin the accelerator. Because the FPGA block

RAM (BRAM) is limited, only the columnsnecessary for these operations are held inBRAM, whereas the full records are stored inon-card DRAM. Because predicate evalua-tion and projection are performed before joinand sort, disqualified rows and unwanted col-umns are eliminated, significantly reducingthe amount of data to be stored. The filteredrows that pass the query operations are readfrom the DRAM and streamed back to thehost in uncompressed page format for directconsumption by the application.

Because of architectural limitations, cur-rent FPGA devices operate at an order ofmagnitude slower clock than high-endCPUs. To bridge this gap and achieve furtherperformance improvement, FPGA designsrely heavily on exploiting a high degree ofparallelism. To that end, our design is archi-tected to exploit parallelism at various levels.Our FPGA design obtains page-level parallel-ism by maintaining two or more concurrentpage streams (tiles). Within each tile, theFPGA achieves row-level parallelism by con-currently evaluating multiple records from apage. Finally, our design obtains finest-grained parallelism by concurrently evaluat-ing multiple predicates against differentcolumns within the row.

Depending on the target workload (set ofqueries), our design can be scaled to achieve

Host server PCIe-attached FPGA card

FPGA chip

Pageparser

Pageparser

Decompress

Decompress

Rowscanner

Rowscanner

Projection

Projection

Inputpages

Row

Row

(Key, Tag)

Sortedtags

Sorted rows

(Key, Tag)

Result pages in DBMS format

DBMS

PCIe+

DMA

CPUTile 0

DRAM

DRAMarbiter +

controller

Join /sort

Outputpage

formatter

Tile 1Main

memory

Controlsoftware

DBpage

DBpage

Figure 2. Field-programmable gate array (FPGA)-accelerated database system. Database pages are copied from the host

memory into the FPGA over PCIe. Rows from these pages stream through a series of pipelined query operators, and

qualifying rows are packed back in database-formatted pages and copied back to the host memory.

.............................................................

JANUARY/FEBRUARY 2014 21

Page 4: Database Analytics: A Reconfigurable-Computing Approach

maximum performance, given the availableFPGA resources. In other words, the numberof parallel page streams and parallel row-processing engines can be traded against thetype and complexity of the query operations.The best-matching configuration can beselected from a library of prebuilt configura-tions. Loading a new FPGA configurationtypically takes a few hundred milliseconds,which is insignificant compared to the run-time of typical analytics queries.

Row decompressionOffloading decompression enables the

FPGA to directly consume compressed data-base pages and accelerate the per-row decom-pression task while also providing higher“effective” PCIe data-transfer bandwidth.Bandwidth advantage is especially importantbecause the overall accelerator performance isoften limited by the bus bandwidth.

Our target DBMS uses a dictionary-basedmodified Liv-Zempel compression algo-rithm, in which a compressed row consists ofa series of symbols. Decompression is inher-ently serial, involving symbol expansion

using dictionary lookups, where the nextlookup depends on the current one. More-over, being a variable-byte operation withrandom lookups, decompression consumessignificant CPU cycles and often suffers fromhuge cache misses.

On the FPGA, an optimized decompres-sion datapath results in a highly efficientdecompression operation. For deterministic,single-cycle lookups, the decompression dic-tionary is preloaded into FPGA BlockRAMs. The dictionary must be reloadedonly if the database table changes. Forimproved pipeline efficiency, a symbol pre-fetch logic is implemented; compared to thenonprefetch FPGA design, prefetching helpsreduce the decompression time by more than50 percent in some cases.7 The decompres-sion engine has a very small footprint, allow-ing the instantiation of multiple engines onthe FPGA for row-level parallelism.

Predicate evaluationPredicate evaluation selects the rows of

interest from a table by applying filtering cri-teria (predicates) on one or more columns

Rowstream

PE PE chain

P(1)P(0) P(2) P(n–1)

C(n–1)C(2)C(1)C(0)

Comp. Comp. Comp. Comp.

RU

RU

RU

Reductionnetwork

Figure 3. Row scanner for predicate evaluation. PEs operate concurrently and independently

of one another.

..............................................................................................................................................................................................

RECONFIGURABLE COMPUTING

............................................................

22 IEEE MICRO

Page 5: Database Analytics: A Reconfigurable-Computing Approach

of the rows. From a CPU consumption per-spective, predicate evaluation against eachrow becomes an intensive operation as thenumber of predicates increases, especially forlarge analytics queries, which often involvefiltering millions to billions of rows.

In our FPGA design, predicates are eval-uated by a chain of predicate evaluation units(PEs) inside the row scanner (see Figure 3).Each PE evaluates a single predicate by com-paring a stored predicate value, P(i), suppliedby the query against up to a 64-bit-long col-umn of the streaming database row. PEsoperate concurrently and independently ofone another. A configurable, binary reduc-tion network (RN) combines the individualoutputs of the PEs to implement complexfilter criteria according to the query, generat-ing a 1-bit qualify signal. The number of PEsinside a row scanner and the size of the

reduction network are configurable at syn-thesis time, to allow for area-versus-com-plexity trade-offs. Software can furthercustomize a given FPGA configuration byloading the PE operation, the predicatevalue, and arbitrary reduction patterns on aper-query basis.

Table joinsThe join operation aims to match records

from one table against the other, using thejoin column. Because the naive nested-loopimplementation has quadratic computationalcomplexity, most DBMSs employ sort-merge-join or hash-join for efficient compu-tation. Hash-join often performs betterbecause of its linear algorithmic complexity,though hashing introduces false positives thatmust be resolved. Besides, hash-join onCPUs suffers from high cache misses due to

Memory

Memory

Joinedrow

Hashing

Bit

vect

or

Bit

vect

or Valid

ate

Valid

ate

Row

mat

eria

lizat

ion

Bit

vect

or

Bit

vect

or

Hashing

Addresstable

Addresstable

Addresstable

Addresstable

Probe phase

Channel 1

Channel 2

Build phase

Hashing

Hashing

Row

Row

Row

Row

Figure 4. Build and probe phases of hash-join on FPGA. Bit vectors are used to quickly

eliminate most of the nonmatching rows, whereas the address tables are used for

eliminating the false positives. Only the true positive matches result in off-chip memory

access.

.............................................................

JANUARY/FEBRUARY 2014 23

Page 6: Database Analytics: A Reconfigurable-Computing Approach

random memory accesses based on the hashvalue.

We implement a hash-based join in ourFPGA, which performs hashing, hash con-flict resolution, and row materialization (cre-ation of the joined rows) at the PCIestreaming rate.8 Two or more join operationscan be performed concurrently using paralleljoin channels.

Figure 4 shows the two phases of FPGAhash-join. During the build phase, one of thejoin tables (usually the smaller one) isstreamed through, and the join columns arehashed to populate a bit vector. The full rowsare stored in off-chip DRAM, whereas thejoin columns and the row addresses are storedin the address table in the FPGA BRAM;rows hashing to the same position arechained in the address table.

The second table is streamed during theprobe phase; rows are hashed for probing thebit-vector, for quick elimination of non-matching rows. The FPGA subsequentlyremoves the false positives by comparingagainst the values in the address table. Onlytrue matches issue reads from off-chip mem-ory and are materialized. This multilevel fil-tering significantly reduces off-chip accessesand is critical to handling streaming rowswithout stalls.

Column projectionQueries often request only a few columns

from each record for reporting or analyticspurposes. Including projection in the acceler-ator’s pipeline thus provides significant band-width, processing, and storage savings byremoving unwanted columns in early stages.Moreover, projection is required to extractthe columns for sort and join operations andformat the output for consumption by theDBMS.

Projection in our FPGA is performedconcurrently with predicate evaluation andthe projected row is carried forward if itqualifies. The FPGA design maintains row-level parallelism by instantiating multiplecopies of the projection logic, one for eachrow scanner.

The DBMS specifies the projection col-umns by loading their lengths and positionsin the row into the FPGA BRAM during thequery load phase. We store this information

in the order of those columns in the record,thereby requiring only one comparison at atime. This allows for very efficient imple-mentation of the projection logic, independ-ent of the number of columns to beprojected. As the rows stream over the projec-tion unit, the required columns are capturedwhile the rest are discarded.

Database tables often contain columnswhose length and position vary from row torow (for example, a string). Projecting suchcolumns requires computing this informa-tion for each row from the row metadata.This can significantly affect the performance.Our FPGA projects variable-length columnsat the full streaming rate by employing atwo-phase hybrid streaming scheme. First,the row is staged in a buffer, and the lengthand position for all the variable columns tobe projected are computed. We refer to thisas resolving the variable columns into fixedcolumns. Next, the row is streamed through,skipping-over the metadata, as it is no longerrequired; variable columns are then pro-jected, much like the fixed-length columns.The savings from skipping the metadatamore than compensates for the extra cyclesspent during the resolution step.

Database sortDatabase sort keys are created by extracting

one or more fields from a much larger row.These keys are thus generated once every fewcycles, obviating the need for the fastest sortimplementation. Instead, the main require-ments for sort in databases are support forlong sort keys (tens of bytes or longer), han-dling large payloads (rows) associated witheach key, and generating large sorted batches(millions of records) in multiphased sort.These requirements direct toward a highlyhardware-efficient algorithm, ideally havingthe hardware resource requirements inde-pendent of the number of records to be sorted.Furthermore, database applications preferstreaming sort for large-query processing.

For these reasons, we find the tournamenttree sort algorithm most suitable.9 Amongdifferent hardware sorters, tournament treesort requires the least number of comparatorsand can generate large sorted batches. A treewith N leaf nodes guarantees a minimumbatch of size N; however, for almost-sorted

..............................................................................................................................................................................................

RECONFIGURABLE COMPUTING

............................................................

24 IEEE MICRO

Page 7: Database Analytics: A Reconfigurable-Computing Approach

data, which happens frequently in databases,much larger batches can be generated.

To the best of our knowledge, this is thefirst hardware implementation of the tourna-ment tree algorithm. Our FPGA designimplements two independent tournamenttree sorters, each with 16 thousand nodes,followed by a merger, thereby generatingminimum batches of 32 thousand records.

For each tree, the leaf nodes containingthe sort keys are stored in the key BRAM,whereas the corresponding row is stored inoff-chip DRAM. Row pointers are storedwith the keys and are used to fetch the rowsas sorted keys get emitted. The nonleaf nodescontain the pointers to the losing keys (leafnodes) at different tree levels and are held inthe loser pointer BRAM (see Figure 5).

Each new key is entered into the leaf nodeof the last sorted key, and moves this new keyup the tree by comparing it with the parentnodes, updating the loser pointers and theoverall winning key as needed. The tree istraversed using simple modulo arithmetic tocompute the parent key’s address. The entiretree is serviced using a single comparator. Tomeet timing, the large comparator must beimplemented as a multistage pipeline. Forpipelining of dependent comparisons, weimplement a speculative comparator usingtwo independent two-stage comparators. Theoverall sort design uses just five two-stagecomparators—two per tree, plus one for the

merger—while sustaining full PCIe band-width for most row sizes. We can achievehigher throughputs by further splitting thetrees at the expense of more comparators.

A tournament tree can continue to sortbeyond the minimum batch size, providedthat the incoming keys don’t violate the orderof the already-sorted keys. Unless handled ina special way, such violating keys requireflushing the keys in the tree and starting anew batch from an empty tree, incurringextra tree setup and teardown costs and sig-nificantly reducing the sorting throughput.We address this by coloring the keys as theyenter the tree. In other words, we attach aprefix to each key, thereby implicitly binningthe keys on the fly into different sortedbatches. A differently colored key always losesagainst any key of the current color. Coloringthus allows the violating keys to participate insorting without corrupting the current sortedbatch, thereby eliminating the need to drainthe tree between batches.

Acceleration enablement of DBMSLeveraging the FPGA-accelerated query

processing requires two modifications to astandard DBMS: restructuring of the DBMSfor seamless plug-in of the accelerator intothe DBMS’ query operation flow, and trans-forming the different query operations tomap to the hardware-accelerated functions.

Inputkey Key

coloring KeyBRAM

Two stagecomparator Control

logicSorted

key

Winnerpointer

Loser pointerInitialize winner

Currentwinning key

Winnerpointer

Winner key

Tournament tree logic

Two stagecomparator

Speculative comparator

LoserpointerBRAM

Write data

Write addr

Rd addr

Figure 5. Tournament tree sort implementation on an FPGA. The tree is implemented using

block RAMs and only two comparators. Key coloring is used to avoid the need to flush the

tree between consecutive sorted runs.

.............................................................

JANUARY/FEBRUARY 2014 25

Page 8: Database Analytics: A Reconfigurable-Computing Approach

For efficient accelerator utilization andhigh query-processing throughput, the in-teractions between the DBMS and theI/O-attached accelerator must happen at theblock level as opposed to the traditional one-row-at-a-time flow. For this purpose, werestructured the DBMS to introduce block-level data operations within the DBMS queryflow. The restructured DBMS communicateswith the FPGA through a series of controlblocks.7 A long-running query is divided intomultiple jobs. For each job, the DBMSobtains a list of buffer pool pages to be proc-essed and locks them in the host memory;the list of page pointers is sent to the FPGA.Note that locking can affect the OLTPqueries, and issuing multiple lightweight jobsis critical to maintaining a fine locking granu-larity. Multiple outstanding jobs are issuedand queued to the accelerator, which proc-esses them sequentially. The FPGA returnsthe results in the format expected by theDBMS processing engine for further down-stream processing; hence, no additional datacopying or formatting is required.

For mapping the query operations ontothe FPGA, the query is transformed into aquery control block (QCB), a data structurethat contains information about the recordstructure, the query predicates, and otherquery operations. A QCB can be interpretedby the FPGA to tailor the application logic toa specific query. Based on the target query,the DBMS must transform a query intoQCB specifications. As SQL provides richsyntax to express data relations, simply pars-ing the SQL query is not sufficient to provideall the required information in the QCB.

Existing SQL parsing and transformationroutines in the DBMS provide some of theintermediate expressions required for gener-ating the query specifications for the FPGA.We devised techniques to efficiently parsethese internal expressions to extract the rele-vant information for creating the QCB.Moreover, because certain queries might notbenefit from FPGA offload, the query trans-formation function also decides if a queryshould be directed to the FPGA.

Performance evaluationOur prototype is built on a commercial

DBMS running on a 3.8-GHz multicore

superscalar system with a PCIe-attachedFPGA card with an Altera Stratix 5SGXA7and 8 Gbytes of DDR3 at 1,333 megabitsper second (Mbps). The FPGA design runsat 200 MHz. Our experimental workload isderived from real customer tables, and weevaluated four types of queries:

1. predicate evaluation only;2. decompression and predicate evalua-

tion;3. decompression, predicate evaluation,

projection, and sort; and4. decompression, predicate evaluation,

and join.

These queries resemble the TPC-H Q1and TPC-DS template-3 queries.

Figure 6 shows the CPU savings from off-loading analytics queries with different rowqualification ratios. Higher offload meansmore CPU resources being freed up forOLTP. As shown, the savings are higher whena smaller fraction of rows qualify, graduallydecreasing with increasing qualification ratio.This is due to the CPU requirements forpost-processing the qualified rows—forexample, moving data to the application buf-fer. Overall, CPU savings are higher on type2 queries compared to type 1 queries; the off-load for type 4 queries would be even higherdue to a larger fraction of the query taskbeing offloaded. In other words, larger CPUsavings can be achieved on such queries, evenwith a large number of rows qualifying. Type3 queries, however, could require a mergestep on the CPU, depending on the sortbatch size generated by the FPGA, poten-tially resulting in a slight reduction in theoverall savings. This depends on the particu-lar workload’s characteristics.

From the perspective of performance im-provement on the offloaded queries, Table 1compares the row processing throughput inthe FPGA against the baseline unmodifiedDBMS running on a single core of our sys-tem. Overall, the FPGA achieves speedups inthe range of 7� to 14� for most queries,except for type 1 queries. Type 1 queriesoperate on uncompressed data, and the per-formance is limited by the data transfer band-width over the PCIe bus. This is evident ingoing from type 1 to type 2 queries: through-put in the FPGA increases significantly,

..............................................................................................................................................................................................

RECONFIGURABLE COMPUTING

............................................................

26 IEEE MICRO

Page 9: Database Analytics: A Reconfigurable-Computing Approach

owing to the increase in effective bandwidth,whereas the CPU throughput falls drasticallybecause of the cost of decompressing eachrow. On the FPGA, better compressionresults in higher speedups, and very high

compression ratios cause the performancebottleneck to shift from PCIe to the FPGA’squery-processing capabilities.

For more complex type 3 and type 4queries, the throughput on the baseline CPU

100 Type I query

CP

U o

fflo

ad (

%)

Type II query

CPU savings obtained from offloading queries to FPGA

90

80

70

60

50

40

30

20

10

00.30% 1.70% 2.70% 6% 12% 99.50%

14.745.655.26264.267.993.8 90.2 87.7

Qualification ratio (%)

80.4 71.7 27.8

Type I query

Type II query

Figure 6. CPU savings in OLTP system from offloading type 1 and type 2 queries to the FPGA

accelerator. Higher query qualification ratios reduce the overall CPU savings due to the CPU

requirements for post-processing of qualified data.

Table 1. Row processing throughput and FPGA speedup.

Throughput

(million rows

per second)Query

type*

Row length

(bytes)

Compression

factor Baseline FPGA

FPGA

speedup

1 170 1� 12.57 14.60 1.1�1 235 1� 8.16 10.15 1.2�2 170 5� 3.57 38.20 10.7�2 235 2� 3.16 21.20 6.7�3 80 2� 2.48 28.00 11.3�3 170 2� 2.05 28.00 13.6�3 235 2� 1.62 21.00 12.9�3 420 2� 1.30 19.00 14.6�4 170 5� 1.60 18.00 11.2�

................................................................................................................................* Query type 1: predicate evaluation only; query type 2: decompression and predicateevaluation; query type 3: decompression, predicate evaluation, projection, and sort; querytype 4: decompression, predicate evaluation, and join.

.............................................................

JANUARY/FEBRUARY 2014 27

Page 10: Database Analytics: A Reconfigurable-Computing Approach

further decreases, whereas the FPGA through-put is relatively stable. This is due to the pipe-lined execution of these operations in theFPGA. The main factor that affects the FPGAthroughput is the row size; the smaller therow, the higher the throughput, because morerows can be transferred over a given PCIebandwidth. For type 3 queries, however, up to180-byte rows, the performance is limited bythe sort tree, and doesn’t improve for smallerrow sizes. In terms of execution complexity,“join” is the most expensive operation, inboth the CPU and the FPGA. For similar-sized rows, the throughput for the type 4query is about half that of other queries.

O ffloading analytics queries to anFPGA-based accelerator provides a

viable, attractive solution for running ana-lytics queries against transactional data, forreal-time market responsiveness. This ap-proach provides significant improvement inresponse time of offloaded analytics querieswhile saving CPU resources for mission-crit-ical OLTP workloads. Our solution can beeasily scaled across multiple acceleratornodes, either for a single query or across dif-ferent queries, without any explicit datapartitioning. MICR O

....................................................................References1. J. Krueger, et al., “Fast Updates on Read

Optimized Databases Using Multicore

CPUs,” Proc. VLDB Endowment, vol. 5,

no. 1, 2012, pp. 61-72.

2. R. Johnson, et al., “Row-Wise Parallel Predi-

cate Evaluation,” Proc. VLDB Endowment,

vol. 1, no. 1, 2008, pp. 622-634.

3. N. Satish, et al., “Fast Sort on CPUs and

GPUs: A Case for Bandwidth Oblivious SIMD

Sort,” Proc. ACM SIGMOD Int’l Conf. Man-

agement of Data (SIGMOD 10), 2010, pp.

351-362.

4. R. Mueller, J. Teubner, and G. Alonso,

“Glacier: A Query-to-Hardware Compiler,”

Proc. ACM SIGMOD Int’l Conf. Manage-

ment of Data (SIGMOD 10), 2010, pp. 1159-

1162.

5. T. Horikawa, “An Unexpected Scalability Bot-

tleneck in a DBMS: A Hidden Pitfall in Imple-

menting Mutual Exclusion,” Proc. Parallel

and Distributed Computing and Systems

(PDCS 2011), 2011, doi:10.2316/P.2011.

757-036.

6. R. Ramakrishnan and J. Gehrke, Database

Management Systems, 3rd ed., McGraw-

Hill, 2002.

7. B. Sukhwani, et al., “Database Analytics

Acceleration Using FPGAs,” Proc. 21st Int’l

Conf. Parallel Architectures and Compilation

Techniques, ACM, 2012, pp. 411-420.

8. R. Halstead et al., “Accelerating Join Opera-

tion for Relational Databases with FPGAs,”

Proc. IEEE Symp. FCCM, IEEE CS, 2013,

pp. 17-20.

9. D.E. Knuth, The Art of Computer Program-

ming, Vol. 3—Sorting and Searching, Addi-

son-Wesley, 1973.

Bharat Sukhwani is a research staff mem-ber at the IBM T.J. Watson Research Cen-ter. His research interests include computerarchitecture; parallel computing; applica-tion-specific, high-performance computingsystems; and computing using reconfigura-ble and graphics processors. Sukhwani has aPhD in electrical engineering from BostonUniversity. He is a member of IEEE.

Hong Min is a senior technical staff mem-ber at the IBM T.J. Watson Research Cen-ter. Her research interests include databasesystems and acceleration technologies fordata processing. Min has a PhD in mechani-cal engineering from Drexel University.

Mathew Thoennes is a senior engineer atthe IBM T.J. Watson Research Center. Hisresearch interests include software and hard-ware for Series z systems. Thoennes has anMS in computer science from the Universityof Massachusetts. He’s a member of theACM and a senior member of IEEE.

Parijat Dube is a senior engineer at theIBM T.J. Watson Research Center. Hisresearch interests include performance mod-eling, analysis, and optimization of systems.Dube has a PhD in computer science fromINRIA.

Bernard Brezzo is a senior engineer at theIBM T.J. Watson Research Center. His

..............................................................................................................................................................................................

RECONFIGURABLE COMPUTING

............................................................

28 IEEE MICRO

Page 11: Database Analytics: A Reconfigurable-Computing Approach

research interests include high-performancecomputing, FPGA emulation, and Synapse.Brezzo has a Diploma in electronics engi-neering from Conservatoire National desArts et Metiers, Paris.

Sameh Asaad is a research staff member atthe IBM T.J. Watson Research Center,where he manages the Digital Design Groupin the Communication and ComputationSubsystems Department. His research inter-ests include domain-optimized architecturesand systems, reconfigurable computing, andFPGA-based application acceleration. Asaadhas a PhD in electrical engineering fromVanderbilt University.

Donna Eng Dillenberger leads IBM’sglobal research on enterprise systems at the

IBM T.J. Watson Research Center. She’salso IBM’s Chief Technology Officer of ITOptimization, an IBM Distinguished Engi-neer and Master Inventor, and an adjunctprofessor at Columbia University. Herresearch interests include hybrid architec-tures, Cloud, and IT optimization. Dillen-berger has an MS in computer science fromColumbia University.

Direct questions and comments about thisarticle to Bharat Sukhwani, IBM Research,1101 Kitchawan Road, Yorktown Heights,NY 10598; [email protected].

.............................................................

JANUARY/FEBRUARY 2014 29