59
Fachgebiet Datenbanksysteme und Informationsmanagement Technische Universität Berlin http://www.dima.tu-berlin.de/ GPU Processing in Database Systems Max Heimel Nikolaj Leischner Volker Markl Michael Säcker

GPU Processing in Database Systems - AMD

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 1

Fachgebiet Datenbanksysteme und InformationsmanagementTechnische Universität Berlin

http://www.dima.tu-berlin.de/

GPU Processing in Database Systems

Max HeimelNikolaj Leischner

Volker MarklMichael Säcker

Page 2: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 2

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU/CPU Hybrid DBMS Systems

Agenda

Page 3: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 3

Database Query

■ SQL is declarative: specifies what data is needed, not how to get it

SELECT DISTINCT o.name,a.driverFROM owner o,

car c, demographics d , accidents a

WHERE c.ownerid = o.id AND o.id = d.ownerid ANDc.id = a.id ANDc.make = 'Mazda' AND c.model = '323' ANDo.country3 = 'EG' ANDo.city = 'Cairo' ANDd.age < 30 ;

Find owner and driver of Mazda 323s that have been involved in accidents in Cairo, Egypt, where the age of the driver has been less than 30

Page 4: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 4

Relational Database Operators

Selection/Filteringσcity=„Berlin“(owner)

Projectionπcity (owner)

make model ownerid

Honda Accord 1

Toyota Camry 1

Mazda 323 2

ownerid city

1 Berlin

2 Potsdam

3 Berlin

ownerid city

1 Berlin

3 Berlin

city

Berlin

Potsdam

Joincars |><| cars.ownerid = owner.ownerid owner

make model ownerid city

Honda Accord 1 Berlin

Toyota Camry 1 Berlin

Mazda 323 2 Potsdam

Further important database operators: union, intersection, set difference, Cartesian cross product, grouping/aggregation, sort

carsowner

Page 5: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 5

■ Bottom up data flow graph

■ Strings together operators

■ Many QEPs for a query□ physical operators□ operator order □ pipelining vs. bulk transfer

■ Query optimization□ determines “best“ QEP

■ Cost model □ for each operator□ for operator combination

Query Execution Plan (QEP)

RETURN>7

NLJN /-----+-----\

7 N/A SCAN FETCH

| |7 N/A

SORT ISCAN | |7 N/A

HSJN ACCIDENTS/------+-----\

162015 10 SCAN HSJN

| /--+--\605999 14422 1000 DEMO. FETCH FETCH

| |14422 1000

ISCAN ISCAN| |

N/A N/ACAR OWNER

SELECT o.name,a.driverFROM owner o,

car c, demographics d , accidents a

WHERE c.ownerid = o.id AND o.id = d.ownerid ANDc.id = a.id ANDc.make = 'Mazda' AND c.model = '323' ANDo.country3 = 'EG' ANDo.city = 'Cairo' ANDd.age < 30 ;

Execution PlanQuery

Page 6: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 6

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU/CPU Hybrid DBMS Systems

Agenda

Page 7: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 7

A Case for GPU architectures

■ Challenge in processor architecture□ More’s law is hitting a wall

− power wall− memory wall− ILP wall− …

□ Solution: Parallel processing

■ GPGPUS offer massively parallel processing□ Many more cores□ High GFLOPs and memory

bandwidth□ Lower/cost per core□ Can be used to build and test

massively parallel systems

Page 8: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 8

■ GFLOPS & GB/s for different CPU & GPU chips..

■ Beware: theoretical peak numbers

GFLOPS & GB/s

Wikipedia, Intel, AMD, Nvidia, IBM

Xeon E5320 

Xeon W3540 Xeon X5680 

Opteron 2360SE 

Opteron 6180SE 

Power7 4.04Ghz 

SPARC64 VIIIfx 

Firestream 9270 

Firestream 9370 

Tesla c870 

Tesla c1060 

Tesla c2050 

0

20

40

60

80

100

120

140

160

2006 2007 2008 2009 2010 2011 2012

Memory bandwidth (GB/s)

Xeon E5320 

Xeon X7460 

Xeon W3540 

Xeon X5680  Xeon E8870 Opteron 2360SE 

Opteron 2435 

Opteron 6180SE 

Power7 4.04Ghz 

SPARC64 VIIIfx 

Firestream 9270 

Firestream 9370 

Tesla c1060 

Tesla c2050 

0

100

200

300

400

500

600

2006 2007 2008 2009 2010 2011 2012

GFLOP/s

Page 9: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 9

■ Most die space devoted to control logic & caches

■ Maximize performance for arbitrary, sequential programs

What is the difference? Look at a modern CPU..

www.chip‐architect.comAMD K8L

Page 10: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 10

■ Little control logic,a lot of execution logic

■ Maximize parallel computational throughput

And at a GPU..

ixbtlabs.comATI RV770

Page 11: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 11

■ SIMD cores with small local memories (16-48 kB)■ Shared high bandwidth + high latency DRAM (1-6 GB)■ Multi-threading hides memory latency■ Bulk-synchroneous execution model

GPU architecture

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

Local memory

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

Local memory

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

Local memory…

SharedDRAM

x86 host

Page 12: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 12

■ Most database operations well-suited for massively parallel processing

■ Efficient algorithms for many database primitives exist□ Scatter, gather/reduce, prefix sum, sort..

■ Find ways to utilize higher bandwidth & arithmetic throughput

GPUs & databases: opportunities

Page 13: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 13

■ Find a way to live with architectural constraints□ GPU to CPU bandwidth (PCIe) smaller than CPU to RAM bandwidth□ Fast GPU memory smaller than CPU RAM

■ Technical hurdles□ GPU is a co-processor:

− needs CPU to orchestrate work− get data from storage devices − etc.

□ GPU programming models (e.g., OpenCL, CUDA) are low level− need architecture-specific tuning− lack database primitives

□ Limited support for multi-tasking− extra work like aggregating operations into bulks required

GPUs & databases: challenges & constraints

Page 14: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 14

■ J. Owens: GPU architecture overview, In ACM SIGGRAPH 2007 courses (SIGGRAPH '07). ACM, New York, NY, USA, , Article 2

■ H. Sutter: The Free Lunch is Over: A Fundamental Turn Toward Concurrency in Software, Dr. Dobb's Journal, 30(3), March 2005, http://www.gotw.ca/publications/concurrency-ddj.htm (Visited May 2011)

■ V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (June 2010), 451-460.

References & Further Reading

Page 15: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 15

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems

Agenda

Page 16: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 16

Hybrid CPU+GPU system

■ 1 or several (interconnected) multicore CPUs

■ 1 or several GPUs

CPU

I/OH

GPU

RAMRAMRAMRAM

HDD

CPU

I/OH

GPU

RAMRAMRAMRAM

HDD

Page 17: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 17

Bottlenecks

■ PCIe bandwidth & latency

■ GPU memory size

■ GPU has no direct access to storage & network

CPU

I/OH

GPU

RAMRAMRAMRAM

HDD

CPU

I/OH

GPU

RAMRAMRAMRAM

HDD

Page 18: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 18

■ Fully-fledged GPU query processor□ GPU hash indices, B+ trees, hash join, sort-merge join□ Handles data sizes larger than GPU memory, supports non-numeric

data types

GDB: GPU query processing

Relational query coprocessing on graphics processors

Bingsheng He and Mian Lu and Ke Yang and Rui Fang  and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander

CPU

I/OH

GPU

RAMRAMRAMRAM

HDD

Page 19: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 19

■ 2x-7x speedup for compute-intense operators if data fits inside GPU memory

■ Mixed mode operators provide no speedup■ No speedup if data does not fit into GPU memory

GDB: results

05101520253035404550

NEJ EJ (HJ)

Performance of SQL queries (seconds)

CPU only

GPU only

Page 20: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 20

■ Offload large bulks of OLTP queries to GPU

■ CPU generates processing plan based on a transaction dependency graph, GPU does actual query processing

□ System with 1 CPU + 1 GPU

□ Data must fit inside GPU memory

□ Only supports stored procedures

GPUTx: GPU OLTP query processing

High‐Throughput Transaction Executions on Graphics Processors

Bingsheng He and  Jeffrey Xu Yu

Page 21: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 21

■ Implementation based on H-Store

GPUTx: results

0

5

10

15

20

25

30

35

40

Scale factor 128

256 512 1024 2048

TPC‐B, normalized throughput

CPU (4 core)

GPUTx

0

5

10

15

20

25

Scale factor 20 40 60 80

TPC‐C, normalized throughput

Page 22: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 22

■ B. He and J. Xu Yu: High-throughput transaction executions on graphics processors. Proc. VLDB Endow. 4, 5 (February 2011), 314-325.

■ B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo, and P. V. Sander: Relational query coprocessing on graphics processors. ACM Trans. Database Syst. 34, 4, Article 21 (December 2009), 39 pages.

References & Further Reading

Page 23: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 23

■ No dynamic memory allocation

■ Divergence□ Threads taking different code paths

Serialization

■ Limited synchronization possibilities□ Only on block level□ Limited block size

Challenges of GPU Programming

Page 24: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 24

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU

□ Index Operations

□ Compression

□ Sorting

□ Relational Operators

□ Further Operations

■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems

Agenda

Page 25: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 25

► GPGPU architectures offer massively data-parallel processing

► Fruitfly architecture: code may run on future multicore CPUs

Opportunities and Challenges

► Bottlenecks (PCIe, local memory, disk and network access)

► Ensuring correctness while fully utilizing data parallelism is hard

► Synchronization mechanism for read/write conflicts is limited

► Cost modeling hard - some CPU details are unknown or change quickly

Page 26: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 26

■ Second order functions on arrays

□ Map

□ Scatter (indexed write)

□ Gather (indexed read)

□ Prefix-Scan (combines/reduces entries)

□ Split (partitioning, e.g., hash or range)

□ Sort (e.g., bitonic sort/merge sort, radix sort, quicksort)

■ Library of primitives (building blocks) for GPGPU queryprocessing

Primitives for DBMS Operator Implementations

Relational joins on graphics processors

Bingsheng He and Mian Lu and Ke Yang and Rui Fang  and Naga K. Govindaraju and Qiong Luo and Pedro V. Sander

Page 27: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 27

■ Classic: B+ Tree

■ Directory accelerates access, exploting the fact that the data in a sorted and organized in a hierarchical way

■ Compression to map variable-length data to fixed size□ Dictionary encoding

■ CSS-Tree: Encode directory and payload in array, cache-aware

Indexing

… …

… …

Key

Payload in leaves

Directory

Page 28: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 28

■ Path of a key within the tree defined by absolute value

■ Split key into equally sized sub-keys

■ Tree depth depends on key length

Prefix Tree Index

INT = 793910

0001 1111 0000 00112

1        15         0        310…

Page 29: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 29

■ Idea: Group processing to exploit faster local memory■ Group computational trees into blocks■ Tree block is executed by a GPU thread block

□ use shared memory for intermediate results

■ Generate global result by traversing internal result list

Speculative Hierarchical Index Traversal

Page 30: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 30

GPUs□ very limited memory capacity□ high computation capabilities and bandwidth⇒ Compression reduces overhead of data transfer⇒ Compression allows GPUs to address bigger problems

Usually: combine compression schemes for higher compression ratiosE.g., auxiliary scheme RLE and main scheme zero suppression

Goal: Carry operations (e.g., filtering) out on compressedrepresentation of the dataW. Fang, B. He, Q. Luo: Database Compression on Graphics Processors, Proc. VLDB Endow. 3, 1-2 (September 2010), 670-680

Database Compression

1000,1003,1005 1000,+0003,+0002 1000,+3,+2

Page 31: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 31

Sorting with GPUs

■ Numerous GPU algorithms published□ Bitonic sort, bucket sort, merge sort, quicksort, radix sort, sample

sort..

■ Existing work: focus on in-memory sorting with 1 GPU

■ State of the art: merge sort, radix sort

■ GPUs outperform multicore CPUs when data transfer over PCIe is not considered

Page 32: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 32

Radix Sort @ GPGPU

Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw

Scan

Core 0

Bottom

‐level Red

uctio

nTop‐level 

Scan

Bottom

‐level Scan

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Core 0

...

Reduce

Accumulate

Binn

ing

Keys (in)..

Accumulate

Binn

ing

Keys (in)..

Core 0

Reduce

Accumulate

Binn

ing

Keys (in)..

Accumulate

Binn

ing

Keys (in)..

Core 1

Reduce

Accumulate

Binn

ing

Keys (in)..

Accumulate

Binn

ing

Keys (in)..

Core 2

Reduce

Accumulate

Binn

ing

Keys (in)..

Accumulate

Binn

ing

Keys (in)..

Core 3

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Core 1

…Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Core 2

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Binn

ing

Keys (in)..

Scatter

Scan

Scan

Keys (o

ut)..

Core 3

Page 33: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 33

■ For GPU radix sort the PCIe transfer dominates running time

GPU sorting performance

Revisiting Sorting for GPGPU Stream Architectures Duane Merrill, Andrew Grimshaw

Faster Radix Sort via Virtual Memory and Write‐Combining  Jan Wasenberg, Peter Sanders

0

200

400

600

800

1000

1200

GPU Radix w/ PCIe transfer GPU Radix CPU Radix

Throughput  (million items / second)

32bit keys

32bit keys, 32bit values

Page 34: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 34

1. Split relations into blocks

2. Join smaller blocks in parallel

Example: Non-indexed Nested-Loop Join

R

S

Thread Group 1,1

Thread Groupi,1

Thread Group1,j

Thread Groupi,j

… …

R`

S`

… …

… …Thread 1 Thread T

Problem: To allocate memory for the result, the join has to be performed twice! 

Aka “fragment and replicate join” (symmetric/unsymmetric)There is more: indexed NLJ, hash join, sort‐merge join, shuffling rings, etc.

Page 35: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 35

■ Brent-Kung circuit strategy ■ Only upsweep phase necessary because only final result is

needed■ Permutation of elements to minimize memory bank conflicts■ Separate thread group to combine results of blocks

Example: Aggregation

x0 x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15

Page 36: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 36

■ Map/Reduce

■ Regular Expression Matching

■ K-Means Clustering

■ Apriori Clustering

■ Exact String matching

■ …

There is more...

Page 37: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 37

■ J. Rao, K. A. Ross: Cache Conscious Indexing for Decision-Support in Main Memory, In Proceedings of the 25th International Conference on Very Large Data Bases (VLDB '99), Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, and Michael L. Brodie (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 78-89

■ C. Kim, J. Chhugani, N. Satish, E. Sedlar, A. D. Nguyen, T. Kaldewey, V. W. Lee, S. Brandt, P. Dubey: FAST: fast architecture sensitive tree search on modern CPUs and GPUs, ACM SIGMOD ‘10, 339-350, 2010

■ P. B. Volk, D. Habich, W. Lehner: GPU-Based Speculative Query Processing for Database Operations, First International Workshop on Accelerating Data Management Systems Using Modern Processor and Storage Architectures In conjunction with VLDB 2010

References & Further Reading: Indexing

Page 38: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 38

■ D. G. Merrill and A. S. Grimshaw: Revisiting sorting for GPGPU stream architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT '10). ACM, New York, NY, USA, 545-546

■ J. Wassenberg, P. Sanders: Faster Radix Sort via Virtual Memory and Write-Combining, CoRR 2010, http://arxiv.org/abs/1008.2849 (Visited May 2011)

■ N. Satish, C. Kim, J. Chhugani, A. D. Nguyen, V. W. Lee, D. Kim, and P. Dubey: Fast sort on CPUs and GPUs: a case for bandwidth oblivious SIMD sort. In Proceedings of the 2010 international conference on Management of data (SIGMOD '10). ACM, New York, NY, USA, 351-362.

References & Further Reading: Sorting

Page 39: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 39

■ B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, P. Sander: Relational joins on graphics processors. In Proceedings of the 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). ACM, New York, NY, USA, 511-524.

■ D. Merrill, A. Grimshaw: Parallel Scan for Stream Architectures. Technical Report CS2009-14, Department of Computer Science, University of Virginia. December 2009.

References & Further Reading: Rel. Operators

Page 40: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 40

■ N. Cascarano, P. Rolando, F. Risso, R. Sisto: iNFAnt: NFA Pattern Matching on GPGPU Devices. SIGCOMM Comput. Commun. Rev. 40, 5 20-26.

■ M.C. Schatz, C. Trapnell: Fast Exact String Matching on the GPU, Technical Report.

■ W. Fang, K. K. Lau, M. Lu, X. Xiao, C. K. Lam, P. Y. Yang, B. He, Q. Luo, P. V. Sander, K. Yang: Parallel Data Mining on Graphics Processors, Technical Report HKUST-CS08-07, Oct 2008.

■ B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang. 2008. Mars: a MapReduce framework on graphics processors. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques (PACT '08). ACM, New York, NY, USA, 260-269.

■ many more …

References & Further Reading: Further Ops

Page 41: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 41

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems

Agenda

Page 42: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 42

A blast from the past: DIRECT

■ Specialized database processors already in 1978■ Back then could not keep up with rise of commodity CPUs■ Today is different: memory wall, ILP wall, power wall..

ControllerHost

Query processor

Massstorage

Memory

Query processor

Query processor

Memory Memory

DIRECT ‐ a multiprocessor organization for supporting relational data base management systems

David J. DeWitt

Page 43: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 43

■ Massively parallel & reconfigurable processors■ Lower clock speeds, but configuration for specific tasks

makes up for this■ Very difficult to program■ Used in real world appliances: Kickfire/Teradata,

Netezza/IBM..

FPGAs

FPGA: what's in it for a database?

Jens Teubner and Rene Mueller

CPU

RAM

FPGA

Network

HDD

Page 44: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 44

■ CPU & GPU integrated on one die, shared memory■ Uses GPU programming model (OpenCL)■ No PCIe-bottleneck but only memory bandwidth like CPU

CPU/GPU hybrid (AMD Fusion)

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

Local memory

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

ALUALUALUALU

Local memory…

Memory controller

x86 core x86 core…

RAM

AMD Whitepaper: AMD Fusion Family of APUs

Page 45: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 45

■ Network processors□ Many-core architectures□ Associative memory (constant time search)

■ CELL□ Similar to CPU/GPU hybrid: CPU core + several wide SIMD cores w/

local scratchpad memories

There is more...

Page 46: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 46

■ Similar approaches□ Many-core, massively parallel□ Distributed on-chip memory□ Hope for better perf/$ and perf/watt than traditional CPUs

■ Similar difficulties□ Memory bandwidth always a bottleneck□ Hard to program

− parallel− synchronization− low-level & architecture-specific

Common goals, common problems

Page 47: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 47

■ D. J. DeWitt: DIRECT - a multiprocessor organization for supporting relational data base management systems. In Proceedings of the 5th annual symposium on Computer architecture (ISCA '78). ACM, New York, NY, USA, 182-189.

■ R. Mueller and J. Teubner: FPGA: what's in it for a database?. In Proceedings of the 35th SIGMOD international conference on Management of data (SIGMOD '09), Carsten Binnig and Benoit Dageville (Eds.). ACM, New York, NY, USA, 999-1004.

■ AMD White paper: AMD Fusion family of APUs, http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf (Visited May 2011)

■ N. Bandi, A. Metwally, D. Agrawal, and A. El Abbadi: Fast data stream algorithms using associative memories. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data (SIGMOD '07). ACM, New York, NY, USA, 247-256.

■ B. Gedik, P. S. Yu, and R. R. Bordawekar: Executing stream joins on the cell processor. In Proceedings of the 33rd international conference on Very large data bases (VLDB '07). VLDB Endowment 363-374.

References & Further Reading

Page 48: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 48

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems

Agenda

Page 49: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 49

Architectural constraints

■ PCIe bottleneck□ Direct access to storage devices, network etc.?□ Caching strategies for device memory

■ GPU memory size□ Deeper memory hierarchy

(e.g. a few GB of fast GDDR + large „slow“ DRAM)?

Page 50: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 50

■ Want forward scalability: performance should scale with next generations of GPUs□ Existing work often optimized for exactly 1 type of GPU chip

■ Need higher level programming models□ Hide hardware details

(processor count, SIMD width, local memory size..)

■ Cost estimation of operators

Performance portability challenges

Page 51: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 51

■ Big data volume and limited memory

■ Execution Plan consists of multiple operators

■ Where to execute each operator?

□ Trade off transfer time between CPU and GPU and computational

advantage

□ Cost Models for GPGPU, CPU and hybrid

□ Amdahl’s law

Database-specific challenges

Page 52: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 52

■ Database Systems Background■ Overview of GPU Technology■ Architecture of a Hybrid CPU/GPU System■ Database Operations on the GPU■ Alternative Architectures■ Research Challenges of Hybrid Architectures■ Current GPU DBMS Systems

□ Monet DB/CUDA□ Parstream

Agenda

Page 53: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 53

■ Joint Project of TU Berlin and CWI

■ Goal: Investigate GPGPU in the context of database systems□ We extend the open-source database system MonetDB using OpenCL

■ Why MonetDB?□ Open-source □ Column-oriented in-memory database system

■ Why OpenCL?□ Open standard with support from many big players□ Not restricted to graphics cards

General Overview: Hybrid MonetDB

Page 54: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 54

■ Proof-of-concept implementation□ Supports only fixed-size columns fitting into device memory□ Implemented operators

− Table Scan− Nested-Loop Join− Aggregation− Group By

■ Implementation consists of:□ OpenCL context handling□ GPU memory management□ Integration into MonetDB Assembly Language (MAL).□ Support for multiple GPUs

Hybrid MonetDB System Outline

Page 55: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 55

Parstream High Performance Indexing

ParallelExcecutionIndex Compression PartitionBig Data

Page 56: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 56

ParStream combines state of the art database technologies with unique technologies

ColumnStore

Fast data access for analytical processing

CustomQuery Operators

SQL, JDBC and powerful C++ API enable fast query processing

In-MemoryTechnology

Data and indices can stay in memory due to efficient compression

High PerformanceIndex

Unique index structure allows highly parallel execution of queries

Patent Pending – High USP

ParStream – Building Blocks

Page 57: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 57

■ S. Hong and H. Kim: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Comput. Archit. News 37, 3 (June 2009), 152-163.

■ L. Bic and R. L. Hartmann: AGM: a dataflow database machine. ACM Trans. Database Syst. 14, 1 (March 1989), 114-146.

■ H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun: A domain-specific approach to heterogeneous parallelism. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming (PPoPP '11). ACM, New York, NY, USA, 35-46.

■ http://www.dima.tu-berlin.de■ http://www.monetdb.nl■ http:// www.parstream.com

References & Further Reading

Page 58: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 58

Thank You

MerciGrazie

Gracias

Obrigado

Danke

Japanese

English

French

Russian

German

Italian

Spanish

Brazilian Portuguese

Arabic

Traditional Chinese

Simplified Chinese

Hindi

Tamil

Thai

Korean

Page 59: GPU Processing in Database Systems - AMD

24.06.2011 DIMA – TU Berlin 59

■ Disclaimer & Attribution■ The information presented in this document is for informational purposes only and may contain

technical inaccuracies, omissions and typographical errors.■ The information contained herein is subject to change and may be rendered inaccurate for many

reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

■ NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION.

■ ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

■ AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners.

■ The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied.