Upload
donna-eng
View
214
Download
0
Embed Size (px)
Citation preview
..................................................................................................................................................................................................................
DATABASE ANALYTICS:A RECONFIGURABLE-COMPUTING
APPROACH..................................................................................................................................................................................................................
THIS ARTICLE PRESENTS A HIGHLY PIPELINED, HIGH-THROUGHPUT QUERY-PROCESSING
ENGINE ON A FIELD-PROGRAMMABLE GATE ARRAY TO OFFLOAD AND ACCELERATE
EXPENSIVE DATABASE-ANALYTICS QUERIES. THE SOLUTION PRESENTED HERE PROVIDES A
MECHANISM FOR A DATABASE MANAGEMENT SYSTEM TO SEAMLESSLY HARNESS THE
FPGA ACCELERATOR WITHOUT REQUIRING ANY CHANGES IN THE APPLICATION OR THE
EXISTING DATA LAYOUT. THE SYSTEM ACHIEVES AN ORDER-OF-MAGNITUDE SPEEDUP ON
VARIOUS REAL-LIFE QUERIES.
......Real-time analytics—performinganalytics directly on transactional data—hasseen widespread adoption in the businessworld in recent years. Whether to set airlineticket prices to beat competition or to stockstore shelves just in time, rapid analysis oftransactional data has become a businessnecessity. Snapshot warehousing, where asnapshot of data is taken from an onlinetransaction processing (OLTP) system fordecision-support analysis, is no longer suffi-cient for making executive decisions. Wide-spread use of mobile devices and social-medianetworking has changed the marketplacedynamics, requiring businesses to respond tomarket changes in real time. For example,just the appropriate coupon for a coffeehousemust be sent to a consumer’s smartphonewithin seconds of the customer swiping his orher credit card at a nearby store.
Injecting expensive analytics queriesinvolving operations such as sort and join
into OLTP environments leads to sharing ofsystem resources such as the CPU and I/Obetween transactional and analytical work-loads. Transactional workloads relate directlyto revenue generation and have strict service-level agreements (SLAs); analytical workloadsmust run against the same data withoutimpacting these SLAs.
Parallelization techniques commonlyapplied to speed up query processing includemulticore architectures;1 single-instruction,multiple-data (SIMD) operations;2 andaccelerators such as GPUs3 and field-programmable gate arrays (FPGAs).4 Effi-cient use of multicore demands careful datapartitioning, and exploiting SIMD opera-tions and GPU requires extra preprocessingon database tables.5 Prior work using FPGAsinvolve synthesizing each query into theFPGA, making them not very viable. Com-mercial data-warehousing appliances—suchas Netezza (www.netezza.com), Exadata
Bharat Sukhwani
Hong Min
Mathew Thoennes
Parijat Dube
Bernard Brezzo
Sameh Asaad
Donna Eng Dillenberger
IBM T.J. Watson Research
Center
0272-1732/14/$31.00�c 2014 IEEE Published by the IEEE Computer Society
.............................................................
19
(www.oracle.com/us/products/database/exadata), and Greenplum (http://gopivotal.c o m / p r o d uc t s / p i v o t a l - g r e e n p l u m -database)—focus solely on analytics work-loads and operate on more uniformlyformatted data compared to OLTP workloads.
To address these challenges, we propose ahardware acceleration approach to offloadand accelerate the most CPU-intensive oper-ations in analytics queries on an FPGA. Oursystem performs analytics on OLTP datawithin an OLTP environment. Unlike mostexisting approaches, the FPGA in our systemoperates on a database management system’sin-memory data, which is the most up-to-date copy of the data, for real-time analyticsalongside OLTP without any preprocessingor partitioning of data.
Relational databases and query processingIn relational database management systems
(DBMSs), data is organized in the form ofrecords in a table, where each record (row)contains one or more fields (columns). Tablesare stored on the disk as a collection of pages,each containing multiple rows.6 In most enter-prise DBMSs, rows are compressed for storage
and I/O savings, and are decompressed onlywhile the DBMS evaluates the queries. A des-ignated memory space, called a buffer cache orbuffer pool, is used for caching the data pages,and the I/O operations between the bufferpool and the disk are managed transparently.Data residing in the buffer pool is the mostup-to-date hot copy of the data.
Structured Query Language (SQL) is thede facto standard for schema definition, datamanipulation, and querying of data fromrelational DBMSs. SQL queries use variousclauses to represent different data-retrievaloperations (see Figure 1). The “where” clauserepresents predicates used to qualify a subsetof table records via logical inequality orequality comparisons, or to test set contain-ment for a field in the record. The selectclause refers to projection and indicates therecord fields to be extracted. Finally, the orderby clause specifies the fields for sorting theresults. Other common query operationsinclude table joins, which match records withcommon column values from two or moretables, and group-by aggregations, whichinvolve clustering the output (for example,sum or average) according to certain fields.
DBMSs commonly employ two ap-proaches of traversing a table: indexing andtable scanning. Indexing is efficient for locat-ing one or a few records, as is the case forOLTP queries. Scanning involves siftingthrough the entire table and is commonlyused in analytics, where a large number ofrecords typically match the search criteria. Itis often expensive to scan large tables on gen-eral-purpose CPUs. A mechanism to acceler-ate the analytics queries involving table scanis thus highly desirable.
Database analytics accelerator on FPGAsOur FPGA-accelerated database analytics
system consists of an off-the-shelf host serverwith a PCI Express (PCIe)-attached FPGAaccelerator card, and it runs a commercialDBMS. Our accelerator is connected directlyto the processor’s main memory instead ofbeing in the I/O path, and it processes thelatest in-memory data in the DBMS’s bufferpool. To offload a query, the DBMS issues acommand to the FPGA via the control soft-ware (see Figure 2). The command contains
EmployeeID
1201 Robert
Jack
Anna Morris
Smith
Rose Engineering
Sales
HR
2001
2003
2001
Projection
Predicate
Sort
1202
1203
Select:
From:
Where:
Order by:
Firstname
Emp_Id, Dept
Employee
Joining_year < 2002
Emp_Id
Lastname
Department Year of joining
Figure 1. An example “employee” table and a Structured Query Language
(SQL) query. SQL queries use various clauses to represent different data-
retrieval operations. This query retrieves the employee ID and department
of all employees who joined the company before 2002.
..............................................................................................................................................................................................
RECONFIGURABLE COMPUTING
............................................................
20 IEEE MICRO
the query specification and pointers to thedata that is to be processed. The FPGA pullsstandard database pages from the main mem-ory, parses the pages to extract, and processesthe rows and writes the qualifying rows backto the main memory in database-formattedpages. All the data transfers are managed bythe accelerator without any DBMS involve-ment. Standard techniques such as blockDMA operations, job queuing, and doublebuffering have been implemented to achievehigh accelerator utilization and peak PCIebandwidth. Also, the query-processingengine is designed to match the sustained busbandwidth and allow trade-off of query proc-essing throughput for complexity to handlecomplex queries.
Below, we present an overview of thequery acceleration pipeline on FPGA, fol-lowed by the different query operationsimplemented on the FPGA.
Query acceleration pipeline on FPGAsQuery processing in our FPGA accelera-
tor proceeds in a streaming fashion; rowsstream through a series of pipelined execu-tion units, each implementing one queryoperation (see Figure 2). Operations such asjoin and sort are not purely streaming, andthey require the rows to be temporarily heldin the accelerator. Because the FPGA block
RAM (BRAM) is limited, only the columnsnecessary for these operations are held inBRAM, whereas the full records are stored inon-card DRAM. Because predicate evalua-tion and projection are performed before joinand sort, disqualified rows and unwanted col-umns are eliminated, significantly reducingthe amount of data to be stored. The filteredrows that pass the query operations are readfrom the DRAM and streamed back to thehost in uncompressed page format for directconsumption by the application.
Because of architectural limitations, cur-rent FPGA devices operate at an order ofmagnitude slower clock than high-endCPUs. To bridge this gap and achieve furtherperformance improvement, FPGA designsrely heavily on exploiting a high degree ofparallelism. To that end, our design is archi-tected to exploit parallelism at various levels.Our FPGA design obtains page-level parallel-ism by maintaining two or more concurrentpage streams (tiles). Within each tile, theFPGA achieves row-level parallelism by con-currently evaluating multiple records from apage. Finally, our design obtains finest-grained parallelism by concurrently evaluat-ing multiple predicates against differentcolumns within the row.
Depending on the target workload (set ofqueries), our design can be scaled to achieve
Host server PCIe-attached FPGA card
FPGA chip
Pageparser
Pageparser
Decompress
Decompress
Rowscanner
Rowscanner
Projection
Projection
Inputpages
Row
Row
(Key, Tag)
Sortedtags
Sorted rows
(Key, Tag)
Result pages in DBMS format
DBMS
PCIe+
DMA
CPUTile 0
DRAM
DRAMarbiter +
controller
Join /sort
Outputpage
formatter
Tile 1Main
memory
Controlsoftware
DBpage
DBpage
Figure 2. Field-programmable gate array (FPGA)-accelerated database system. Database pages are copied from the host
memory into the FPGA over PCIe. Rows from these pages stream through a series of pipelined query operators, and
qualifying rows are packed back in database-formatted pages and copied back to the host memory.
.............................................................
JANUARY/FEBRUARY 2014 21
maximum performance, given the availableFPGA resources. In other words, the numberof parallel page streams and parallel row-processing engines can be traded against thetype and complexity of the query operations.The best-matching configuration can beselected from a library of prebuilt configura-tions. Loading a new FPGA configurationtypically takes a few hundred milliseconds,which is insignificant compared to the run-time of typical analytics queries.
Row decompressionOffloading decompression enables the
FPGA to directly consume compressed data-base pages and accelerate the per-row decom-pression task while also providing higher“effective” PCIe data-transfer bandwidth.Bandwidth advantage is especially importantbecause the overall accelerator performance isoften limited by the bus bandwidth.
Our target DBMS uses a dictionary-basedmodified Liv-Zempel compression algo-rithm, in which a compressed row consists ofa series of symbols. Decompression is inher-ently serial, involving symbol expansion
using dictionary lookups, where the nextlookup depends on the current one. More-over, being a variable-byte operation withrandom lookups, decompression consumessignificant CPU cycles and often suffers fromhuge cache misses.
On the FPGA, an optimized decompres-sion datapath results in a highly efficientdecompression operation. For deterministic,single-cycle lookups, the decompression dic-tionary is preloaded into FPGA BlockRAMs. The dictionary must be reloadedonly if the database table changes. Forimproved pipeline efficiency, a symbol pre-fetch logic is implemented; compared to thenonprefetch FPGA design, prefetching helpsreduce the decompression time by more than50 percent in some cases.7 The decompres-sion engine has a very small footprint, allow-ing the instantiation of multiple engines onthe FPGA for row-level parallelism.
Predicate evaluationPredicate evaluation selects the rows of
interest from a table by applying filtering cri-teria (predicates) on one or more columns
Rowstream
PE PE chain
P(1)P(0) P(2) P(n–1)
C(n–1)C(2)C(1)C(0)
Comp. Comp. Comp. Comp.
RU
RU
RU
Reductionnetwork
Figure 3. Row scanner for predicate evaluation. PEs operate concurrently and independently
of one another.
..............................................................................................................................................................................................
RECONFIGURABLE COMPUTING
............................................................
22 IEEE MICRO
of the rows. From a CPU consumption per-spective, predicate evaluation against eachrow becomes an intensive operation as thenumber of predicates increases, especially forlarge analytics queries, which often involvefiltering millions to billions of rows.
In our FPGA design, predicates are eval-uated by a chain of predicate evaluation units(PEs) inside the row scanner (see Figure 3).Each PE evaluates a single predicate by com-paring a stored predicate value, P(i), suppliedby the query against up to a 64-bit-long col-umn of the streaming database row. PEsoperate concurrently and independently ofone another. A configurable, binary reduc-tion network (RN) combines the individualoutputs of the PEs to implement complexfilter criteria according to the query, generat-ing a 1-bit qualify signal. The number of PEsinside a row scanner and the size of the
reduction network are configurable at syn-thesis time, to allow for area-versus-com-plexity trade-offs. Software can furthercustomize a given FPGA configuration byloading the PE operation, the predicatevalue, and arbitrary reduction patterns on aper-query basis.
Table joinsThe join operation aims to match records
from one table against the other, using thejoin column. Because the naive nested-loopimplementation has quadratic computationalcomplexity, most DBMSs employ sort-merge-join or hash-join for efficient compu-tation. Hash-join often performs betterbecause of its linear algorithmic complexity,though hashing introduces false positives thatmust be resolved. Besides, hash-join onCPUs suffers from high cache misses due to
Memory
Memory
Joinedrow
Hashing
Bit
vect
or
Bit
vect
or Valid
ate
Valid
ate
Row
mat
eria
lizat
ion
Bit
vect
or
Bit
vect
or
Hashing
Addresstable
Addresstable
Addresstable
Addresstable
Probe phase
Channel 1
Channel 2
Build phase
Hashing
Hashing
Row
Row
Row
Row
Figure 4. Build and probe phases of hash-join on FPGA. Bit vectors are used to quickly
eliminate most of the nonmatching rows, whereas the address tables are used for
eliminating the false positives. Only the true positive matches result in off-chip memory
access.
.............................................................
JANUARY/FEBRUARY 2014 23
random memory accesses based on the hashvalue.
We implement a hash-based join in ourFPGA, which performs hashing, hash con-flict resolution, and row materialization (cre-ation of the joined rows) at the PCIestreaming rate.8 Two or more join operationscan be performed concurrently using paralleljoin channels.
Figure 4 shows the two phases of FPGAhash-join. During the build phase, one of thejoin tables (usually the smaller one) isstreamed through, and the join columns arehashed to populate a bit vector. The full rowsare stored in off-chip DRAM, whereas thejoin columns and the row addresses are storedin the address table in the FPGA BRAM;rows hashing to the same position arechained in the address table.
The second table is streamed during theprobe phase; rows are hashed for probing thebit-vector, for quick elimination of non-matching rows. The FPGA subsequentlyremoves the false positives by comparingagainst the values in the address table. Onlytrue matches issue reads from off-chip mem-ory and are materialized. This multilevel fil-tering significantly reduces off-chip accessesand is critical to handling streaming rowswithout stalls.
Column projectionQueries often request only a few columns
from each record for reporting or analyticspurposes. Including projection in the acceler-ator’s pipeline thus provides significant band-width, processing, and storage savings byremoving unwanted columns in early stages.Moreover, projection is required to extractthe columns for sort and join operations andformat the output for consumption by theDBMS.
Projection in our FPGA is performedconcurrently with predicate evaluation andthe projected row is carried forward if itqualifies. The FPGA design maintains row-level parallelism by instantiating multiplecopies of the projection logic, one for eachrow scanner.
The DBMS specifies the projection col-umns by loading their lengths and positionsin the row into the FPGA BRAM during thequery load phase. We store this information
in the order of those columns in the record,thereby requiring only one comparison at atime. This allows for very efficient imple-mentation of the projection logic, independ-ent of the number of columns to beprojected. As the rows stream over the projec-tion unit, the required columns are capturedwhile the rest are discarded.
Database tables often contain columnswhose length and position vary from row torow (for example, a string). Projecting suchcolumns requires computing this informa-tion for each row from the row metadata.This can significantly affect the performance.Our FPGA projects variable-length columnsat the full streaming rate by employing atwo-phase hybrid streaming scheme. First,the row is staged in a buffer, and the lengthand position for all the variable columns tobe projected are computed. We refer to thisas resolving the variable columns into fixedcolumns. Next, the row is streamed through,skipping-over the metadata, as it is no longerrequired; variable columns are then pro-jected, much like the fixed-length columns.The savings from skipping the metadatamore than compensates for the extra cyclesspent during the resolution step.
Database sortDatabase sort keys are created by extracting
one or more fields from a much larger row.These keys are thus generated once every fewcycles, obviating the need for the fastest sortimplementation. Instead, the main require-ments for sort in databases are support forlong sort keys (tens of bytes or longer), han-dling large payloads (rows) associated witheach key, and generating large sorted batches(millions of records) in multiphased sort.These requirements direct toward a highlyhardware-efficient algorithm, ideally havingthe hardware resource requirements inde-pendent of the number of records to be sorted.Furthermore, database applications preferstreaming sort for large-query processing.
For these reasons, we find the tournamenttree sort algorithm most suitable.9 Amongdifferent hardware sorters, tournament treesort requires the least number of comparatorsand can generate large sorted batches. A treewith N leaf nodes guarantees a minimumbatch of size N; however, for almost-sorted
..............................................................................................................................................................................................
RECONFIGURABLE COMPUTING
............................................................
24 IEEE MICRO
data, which happens frequently in databases,much larger batches can be generated.
To the best of our knowledge, this is thefirst hardware implementation of the tourna-ment tree algorithm. Our FPGA designimplements two independent tournamenttree sorters, each with 16 thousand nodes,followed by a merger, thereby generatingminimum batches of 32 thousand records.
For each tree, the leaf nodes containingthe sort keys are stored in the key BRAM,whereas the corresponding row is stored inoff-chip DRAM. Row pointers are storedwith the keys and are used to fetch the rowsas sorted keys get emitted. The nonleaf nodescontain the pointers to the losing keys (leafnodes) at different tree levels and are held inthe loser pointer BRAM (see Figure 5).
Each new key is entered into the leaf nodeof the last sorted key, and moves this new keyup the tree by comparing it with the parentnodes, updating the loser pointers and theoverall winning key as needed. The tree istraversed using simple modulo arithmetic tocompute the parent key’s address. The entiretree is serviced using a single comparator. Tomeet timing, the large comparator must beimplemented as a multistage pipeline. Forpipelining of dependent comparisons, weimplement a speculative comparator usingtwo independent two-stage comparators. Theoverall sort design uses just five two-stagecomparators—two per tree, plus one for the
merger—while sustaining full PCIe band-width for most row sizes. We can achievehigher throughputs by further splitting thetrees at the expense of more comparators.
A tournament tree can continue to sortbeyond the minimum batch size, providedthat the incoming keys don’t violate the orderof the already-sorted keys. Unless handled ina special way, such violating keys requireflushing the keys in the tree and starting anew batch from an empty tree, incurringextra tree setup and teardown costs and sig-nificantly reducing the sorting throughput.We address this by coloring the keys as theyenter the tree. In other words, we attach aprefix to each key, thereby implicitly binningthe keys on the fly into different sortedbatches. A differently colored key always losesagainst any key of the current color. Coloringthus allows the violating keys to participate insorting without corrupting the current sortedbatch, thereby eliminating the need to drainthe tree between batches.
Acceleration enablement of DBMSLeveraging the FPGA-accelerated query
processing requires two modifications to astandard DBMS: restructuring of the DBMSfor seamless plug-in of the accelerator intothe DBMS’ query operation flow, and trans-forming the different query operations tomap to the hardware-accelerated functions.
Inputkey Key
coloring KeyBRAM
Two stagecomparator Control
logicSorted
key
Winnerpointer
Loser pointerInitialize winner
Currentwinning key
Winnerpointer
Winner key
Tournament tree logic
Two stagecomparator
Speculative comparator
LoserpointerBRAM
Write data
Write addr
Rd addr
Figure 5. Tournament tree sort implementation on an FPGA. The tree is implemented using
block RAMs and only two comparators. Key coloring is used to avoid the need to flush the
tree between consecutive sorted runs.
.............................................................
JANUARY/FEBRUARY 2014 25
For efficient accelerator utilization andhigh query-processing throughput, the in-teractions between the DBMS and theI/O-attached accelerator must happen at theblock level as opposed to the traditional one-row-at-a-time flow. For this purpose, werestructured the DBMS to introduce block-level data operations within the DBMS queryflow. The restructured DBMS communicateswith the FPGA through a series of controlblocks.7 A long-running query is divided intomultiple jobs. For each job, the DBMSobtains a list of buffer pool pages to be proc-essed and locks them in the host memory;the list of page pointers is sent to the FPGA.Note that locking can affect the OLTPqueries, and issuing multiple lightweight jobsis critical to maintaining a fine locking granu-larity. Multiple outstanding jobs are issuedand queued to the accelerator, which proc-esses them sequentially. The FPGA returnsthe results in the format expected by theDBMS processing engine for further down-stream processing; hence, no additional datacopying or formatting is required.
For mapping the query operations ontothe FPGA, the query is transformed into aquery control block (QCB), a data structurethat contains information about the recordstructure, the query predicates, and otherquery operations. A QCB can be interpretedby the FPGA to tailor the application logic toa specific query. Based on the target query,the DBMS must transform a query intoQCB specifications. As SQL provides richsyntax to express data relations, simply pars-ing the SQL query is not sufficient to provideall the required information in the QCB.
Existing SQL parsing and transformationroutines in the DBMS provide some of theintermediate expressions required for gener-ating the query specifications for the FPGA.We devised techniques to efficiently parsethese internal expressions to extract the rele-vant information for creating the QCB.Moreover, because certain queries might notbenefit from FPGA offload, the query trans-formation function also decides if a queryshould be directed to the FPGA.
Performance evaluationOur prototype is built on a commercial
DBMS running on a 3.8-GHz multicore
superscalar system with a PCIe-attachedFPGA card with an Altera Stratix 5SGXA7and 8 Gbytes of DDR3 at 1,333 megabitsper second (Mbps). The FPGA design runsat 200 MHz. Our experimental workload isderived from real customer tables, and weevaluated four types of queries:
1. predicate evaluation only;2. decompression and predicate evalua-
tion;3. decompression, predicate evaluation,
projection, and sort; and4. decompression, predicate evaluation,
and join.
These queries resemble the TPC-H Q1and TPC-DS template-3 queries.
Figure 6 shows the CPU savings from off-loading analytics queries with different rowqualification ratios. Higher offload meansmore CPU resources being freed up forOLTP. As shown, the savings are higher whena smaller fraction of rows qualify, graduallydecreasing with increasing qualification ratio.This is due to the CPU requirements forpost-processing the qualified rows—forexample, moving data to the application buf-fer. Overall, CPU savings are higher on type2 queries compared to type 1 queries; the off-load for type 4 queries would be even higherdue to a larger fraction of the query taskbeing offloaded. In other words, larger CPUsavings can be achieved on such queries, evenwith a large number of rows qualifying. Type3 queries, however, could require a mergestep on the CPU, depending on the sortbatch size generated by the FPGA, poten-tially resulting in a slight reduction in theoverall savings. This depends on the particu-lar workload’s characteristics.
From the perspective of performance im-provement on the offloaded queries, Table 1compares the row processing throughput inthe FPGA against the baseline unmodifiedDBMS running on a single core of our sys-tem. Overall, the FPGA achieves speedups inthe range of 7� to 14� for most queries,except for type 1 queries. Type 1 queriesoperate on uncompressed data, and the per-formance is limited by the data transfer band-width over the PCIe bus. This is evident ingoing from type 1 to type 2 queries: through-put in the FPGA increases significantly,
..............................................................................................................................................................................................
RECONFIGURABLE COMPUTING
............................................................
26 IEEE MICRO
owing to the increase in effective bandwidth,whereas the CPU throughput falls drasticallybecause of the cost of decompressing eachrow. On the FPGA, better compressionresults in higher speedups, and very high
compression ratios cause the performancebottleneck to shift from PCIe to the FPGA’squery-processing capabilities.
For more complex type 3 and type 4queries, the throughput on the baseline CPU
100 Type I query
CP
U o
fflo
ad (
%)
Type II query
CPU savings obtained from offloading queries to FPGA
90
80
70
60
50
40
30
20
10
00.30% 1.70% 2.70% 6% 12% 99.50%
14.745.655.26264.267.993.8 90.2 87.7
Qualification ratio (%)
80.4 71.7 27.8
Type I query
Type II query
Figure 6. CPU savings in OLTP system from offloading type 1 and type 2 queries to the FPGA
accelerator. Higher query qualification ratios reduce the overall CPU savings due to the CPU
requirements for post-processing of qualified data.
Table 1. Row processing throughput and FPGA speedup.
Throughput
(million rows
per second)Query
type*
Row length
(bytes)
Compression
factor Baseline FPGA
FPGA
speedup
1 170 1� 12.57 14.60 1.1�1 235 1� 8.16 10.15 1.2�2 170 5� 3.57 38.20 10.7�2 235 2� 3.16 21.20 6.7�3 80 2� 2.48 28.00 11.3�3 170 2� 2.05 28.00 13.6�3 235 2� 1.62 21.00 12.9�3 420 2� 1.30 19.00 14.6�4 170 5� 1.60 18.00 11.2�
................................................................................................................................* Query type 1: predicate evaluation only; query type 2: decompression and predicateevaluation; query type 3: decompression, predicate evaluation, projection, and sort; querytype 4: decompression, predicate evaluation, and join.
.............................................................
JANUARY/FEBRUARY 2014 27
further decreases, whereas the FPGA through-put is relatively stable. This is due to the pipe-lined execution of these operations in theFPGA. The main factor that affects the FPGAthroughput is the row size; the smaller therow, the higher the throughput, because morerows can be transferred over a given PCIebandwidth. For type 3 queries, however, up to180-byte rows, the performance is limited bythe sort tree, and doesn’t improve for smallerrow sizes. In terms of execution complexity,“join” is the most expensive operation, inboth the CPU and the FPGA. For similar-sized rows, the throughput for the type 4query is about half that of other queries.
O ffloading analytics queries to anFPGA-based accelerator provides a
viable, attractive solution for running ana-lytics queries against transactional data, forreal-time market responsiveness. This ap-proach provides significant improvement inresponse time of offloaded analytics querieswhile saving CPU resources for mission-crit-ical OLTP workloads. Our solution can beeasily scaled across multiple acceleratornodes, either for a single query or across dif-ferent queries, without any explicit datapartitioning. MICR O
....................................................................References1. J. Krueger, et al., “Fast Updates on Read
Optimized Databases Using Multicore
CPUs,” Proc. VLDB Endowment, vol. 5,
no. 1, 2012, pp. 61-72.
2. R. Johnson, et al., “Row-Wise Parallel Predi-
cate Evaluation,” Proc. VLDB Endowment,
vol. 1, no. 1, 2008, pp. 622-634.
3. N. Satish, et al., “Fast Sort on CPUs and
GPUs: A Case for Bandwidth Oblivious SIMD
Sort,” Proc. ACM SIGMOD Int’l Conf. Man-
agement of Data (SIGMOD 10), 2010, pp.
351-362.
4. R. Mueller, J. Teubner, and G. Alonso,
“Glacier: A Query-to-Hardware Compiler,”
Proc. ACM SIGMOD Int’l Conf. Manage-
ment of Data (SIGMOD 10), 2010, pp. 1159-
1162.
5. T. Horikawa, “An Unexpected Scalability Bot-
tleneck in a DBMS: A Hidden Pitfall in Imple-
menting Mutual Exclusion,” Proc. Parallel
and Distributed Computing and Systems
(PDCS 2011), 2011, doi:10.2316/P.2011.
757-036.
6. R. Ramakrishnan and J. Gehrke, Database
Management Systems, 3rd ed., McGraw-
Hill, 2002.
7. B. Sukhwani, et al., “Database Analytics
Acceleration Using FPGAs,” Proc. 21st Int’l
Conf. Parallel Architectures and Compilation
Techniques, ACM, 2012, pp. 411-420.
8. R. Halstead et al., “Accelerating Join Opera-
tion for Relational Databases with FPGAs,”
Proc. IEEE Symp. FCCM, IEEE CS, 2013,
pp. 17-20.
9. D.E. Knuth, The Art of Computer Program-
ming, Vol. 3—Sorting and Searching, Addi-
son-Wesley, 1973.
Bharat Sukhwani is a research staff mem-ber at the IBM T.J. Watson Research Cen-ter. His research interests include computerarchitecture; parallel computing; applica-tion-specific, high-performance computingsystems; and computing using reconfigura-ble and graphics processors. Sukhwani has aPhD in electrical engineering from BostonUniversity. He is a member of IEEE.
Hong Min is a senior technical staff mem-ber at the IBM T.J. Watson Research Cen-ter. Her research interests include databasesystems and acceleration technologies fordata processing. Min has a PhD in mechani-cal engineering from Drexel University.
Mathew Thoennes is a senior engineer atthe IBM T.J. Watson Research Center. Hisresearch interests include software and hard-ware for Series z systems. Thoennes has anMS in computer science from the Universityof Massachusetts. He’s a member of theACM and a senior member of IEEE.
Parijat Dube is a senior engineer at theIBM T.J. Watson Research Center. Hisresearch interests include performance mod-eling, analysis, and optimization of systems.Dube has a PhD in computer science fromINRIA.
Bernard Brezzo is a senior engineer at theIBM T.J. Watson Research Center. His
..............................................................................................................................................................................................
RECONFIGURABLE COMPUTING
............................................................
28 IEEE MICRO
research interests include high-performancecomputing, FPGA emulation, and Synapse.Brezzo has a Diploma in electronics engi-neering from Conservatoire National desArts et Metiers, Paris.
Sameh Asaad is a research staff member atthe IBM T.J. Watson Research Center,where he manages the Digital Design Groupin the Communication and ComputationSubsystems Department. His research inter-ests include domain-optimized architecturesand systems, reconfigurable computing, andFPGA-based application acceleration. Asaadhas a PhD in electrical engineering fromVanderbilt University.
Donna Eng Dillenberger leads IBM’sglobal research on enterprise systems at the
IBM T.J. Watson Research Center. She’salso IBM’s Chief Technology Officer of ITOptimization, an IBM Distinguished Engi-neer and Master Inventor, and an adjunctprofessor at Columbia University. Herresearch interests include hybrid architec-tures, Cloud, and IT optimization. Dillen-berger has an MS in computer science fromColumbia University.
Direct questions and comments about thisarticle to Bharat Sukhwani, IBM Research,1101 Kitchawan Road, Yorktown Heights,NY 10598; [email protected].
.............................................................
JANUARY/FEBRUARY 2014 29