Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs Changkyu Kim 1, Eric Sedlar 2, Jatin Chhugani 1, Tim Kaldewey 2, Anthony D

1

Sort vs. Hash Revisited:Fast Join Implementation on Modern

Multi-Core CPUs

Changkyu Kim1, Eric Sedlar2, Jatin Chhugani1, Tim Kaldewey2, Anthony D. Nguyen1, Andrea Di Blas2,Victor W. Lee1, Nadathur Satish1 and Pradeep Dubey1

1. Throughput Computing Lab, Intel Corporation2. Special Projects Group, Oracle Corporation

Intel Corporation and Oracle Corporation

2

Introduction

• Join is a widely used database operation

• Two common join algorithms– Hash join, Sort-merge join

• Increasing memory capacity and faster I/O– Emphasize importance of exploiting modern CPU features

(Multi-core + SIMD)– 4-core CPUs with 128-bit wide SIMD are commonplace

• Revisit these two join algorithms in the context of modern CPUs and compare them w.r.t. future CPU trends

3

Summary

• In-memory hash join to exploit features of modern CPUs– Fastest reported hash join performance– Stable across a wide range of input sizes and data skews

• An analytical model to estimate compute and bandwidth usage– Compare hash join and sort-merge join and project future

performance

• Sort-merge join has better scalable potential for future CPUs– Wider SIMD, limited per-core memory bandwidth

4

Outline

• CPU Architecture Trends and Related Work

• Hash Join for Modern CPUs

• Exploiting Parallelism in Hash Join

• Sort-merge join for Modern CPUs

• Results

• Conclusions

5

Outline





• Results

• Conclusions

6

CPU Architecture Trends and Optimization

• Memory Subsystem– Latency

• Cache, TLB, Prefetch

– Bandwidth• Cache

• Thread-level Parallelism (TLP)– Increasing number of cores and threads– Memory bandwidth

• Data-level Parallelism (DLP)– SIMD execution units perform operation on multiple data– Currently 128-bit wide (SSE) and growing wider (256-bit : Intel

AVX, 512-bit: Intel Larrabee)

7

Related Work

• Hash Join– Grace hash join [Kitsuregawa et al.], Hybrid hash join [Zeller et al.]– Partitioned hash [Shatdal et al.]– Radix-clustering [Manegold et al.]

• Join on New Architectures– Cell [Gedik et al.]– GPU [He et al.]

• Various Database Operations on CMP– Aggregation [Cieslewicz et al.]– Sort [Chhugani et al.]

• Sort versus Hash [Graefe et al.]

8

Outline





• Results

• Conclusions

9

Hash Join

• Basic operationQ: SELECT … FROM R, S WHERE R.key = S.key

– Build the hash table with keys of the inner relation (S)– Iterate the outer tuples (R) and find matching keys in the hash

table– Append matching tuples to the output table

• Challenges of large hash table operation– Cache line unfriendly access

• Waste memory bandwidth

– TLB unfriendly access• Increase memory latency

=> Can we make the problem compute bound?

10

Overview

• Perform join operations inside caches

• Partition two input tables into sub-tables (Partition phase)– Partition so that a sub-table pair (Rn, Sn) fits in L2 caches

• Join between the corresponding sub-tables (Join phase)– Build the hash table with the inner sub-table (Build phase)– Probe the outer sub-table against the hash table (probe phase)

Key Rid

Relations R

Key Rid

Relations S

R1

R2

R3

Rn

Partitioned R

S1

S2

S3

Sn

Partitioned S

Join

11

Handling Various Size of Keys/Payloads

• Prolog– # of distinct keys and records are less than 4 billion (232)– Read keys and compact into (4B partial key, 4B record id) pairs– Use a hash function to generate partial keys

• Join with (4B key, 4B Rid) tuples

• Epilog– Read record id pair (Rid1, Rid2), Access the tuples– Compare full keys to check false positives– Write output result (key, payload1, payload2)

12

Partition Phase

• Step P1: Build a histogram• Step P2: Perform prefix sum to compute the addresses of scatter• Step P3: Permute (scatter) the tuples

• The number of partitions?– Level1: 64-entry fully associative TLB– Level2: 512-entry 4-way set associative TLB– TLB misses slow down partition phase when more than 128 partitions used

• Multi-pass partitioning is necessary

Key

11 … 00

10 … 01

00 … 10

01 … 00

Rid

0

1

2

3

3

1

Hist []Step P10

3

Hist []Step P2 Step P3Key

11 … 00

00 … 10

01 … 00

10 … 01

Rid

0

2

3

1

13

Build Phase

Hist[j]

2

1

1

0

Inner Sub-table

• Build the hash table of the inner sub-table

• Collided/Duplicated elements in the hash table are located consecutively– Speed up comparing join keys in the probe phase

• Similar to the partition phase

Key

11 … 00

10 … 01

00 … 10

01 … 00

Rid

0

1

2

3

Hash Func ( )

Hist[j]

0

2

3

3

Permuted Inner Sub-table

Key

11 … 00

01 … 00

00 … 10

10 … 01

Rid

0

3

2

1

14

Probe Phase

• Probe the outer sub-table against the hash table of inner sub-table– For the outer tuple, apply the hash function to find the histogram index (j)– Compare the outer key with the inner keys between Hist[j] and Hist[j+1] – Duplicated/Collided keys are potential matches – When matched keys found, append tuples to the output table

Outer Sub-table

Key

11 … 00

00 … 00

11 … 10

00 … 01

Rid

0

1

2

3

Hist[j]

Permuted Inner Sub-table

Key

11 … 00

01 … 00

00 … 10

10 … 01

Rid

0

3

2

1

Hist[j+1]0

2

3

3

Hash Func ( )

Histogram

11 … 00

15

Outline



• Exploiting Parallelism in Hash Join– Exploiting TLP– Exploiting DLP


• Results

• Conclusions

16

Parallelizing Partition Phase

• Equally divide the input tuples among threads, each thread maintains local histogram

• Explicit barrier at each end of partition steps

• Use task queuing for dynamic load balancing– An idle thread steals tasks from other threads

Key

11 … 00

10 … 01

00 … 10

01 … 00

Rid

0

1

2

3

1

1

LocalHist1

2

0LocalHist2

Thread0

Thread1

Barrier

Step1

0

3

LocalHist1

1

4LocalHist2

Barrier

Step2

Key

11 … 00

10 … 01

00 … 10

01 … 00

Rid

0

1

2

3

P0

P1

Barrier

Step3

Tuple0

Tuple1

Tuple2

Tuple3

17

Parallelizing the join phase

• Input distribution can cause severe load-imbalance in this phase

– Propose three-phase fine-grain parallelization– Phase I

• Each thread picks a pair of sub-tables and performs join independently• Sub-tables with size larger than a threshold are handled in the next phase

– Phase II• All threads work together to join a pair of “large” sub-tables• Build the hash table in parallel• Equally divide the outer sub-table and probe in parallel• Tuples with large number of potential matches are handled in the next phase

– Phase III• All threads work together to compare a key with potentially matched inner tuples• When data are heavily skewed (Zipf distribution)• Equally divide potential matches among threads, and compare in parallel

18

Outline





• Results

• Conclusions

19

Data-level Parallelism

• Perform the same operation on K elements (K = 4 in the current SSE)

– Operate on K consecutive tuples simultaneously

• Challenges– Data Gather

• Pack the elements together from K distinct memory locations• Compare K join keys to check match

– Data Scatter• Write the elements to non-contiguous memory locations• Permute K tuples in the partition phase

– SIMD update Conflict• Elements in different SIMD lanes are written to the same memory location• Histogram update

• Current CPU SSE architectures do not support the above features– Negligible SIMD benefit for hash join

20

Outline





• Results

• Conclusions

21

Sort-merge Join

• Base on the “merge sort” by Chhugani et al.[VLDB’08]– Extend to handle (key, rid) pair

• Exploiting DLP– Use a bitonic merge network– Good SIMD benefit

• Exploiting TLP– Perform a parallel merge

• Bandwidth-friendly multiway merge– Data is read/written only once from/to the main memory

22

Experimental Setup

• Tuples consist of (4B key, 4B rid) pair

• Vary the tuple size from 64K to 128M

• Various Distribution– Change percentage of tuples with matches– Change cardinalities– Use the heavily skewed data using Zipf distribution

• Intel Core i7– 3.2GHz, Quad-Core, 2 SMT threads per core– 32KB L1, 256KB L2, 6MB L3– 64-entry L1 TLB, 512-entry L2 TLB

• Metric: Cycles Per Tuples– 32 cycles per tuple = 100M tuples per second in 3.2Ghz

23

Benefit of Partitioning

• 128M tuples, Uniform distribution• 16K-partition shows the best performance• 7X better than non-partitioning join• Fastest reported hash join performance

– 100M tuples per second– 8X faster than the best reported GPU (8800GTX) number [SIGMOD’08 by He et al.]

1-Pass 2-Pass 3-Pass

24

Various Input Distributions

=> Stable across various input distributions

64K256K

1M 4M16M

64M

0

10

20

30

Size of Input

Cycle

s

per

Tuple

0 20 40 60 80100

05

101520253035

Percentage of Matched Tuples

Cycle

s p

er

Tuple

0 20 40 60 80 100 1200

5

10

15

20

25

30

35

Cardinalities (Million)

Cycle

s p

er

Tuple

00.2 0.4 0.6 0.8 1

05

101520253035

Ө Value in Zipf Distribution

Cycle

s p

er

Tuple

25

Current Sort VS. Hash

• Sort-merge join– 2.3X SIMD speedup– 3.6X parallel speedup

• Fastest reported (key, rid) sorting performance– 1.25 (Cycles Per Element Per Iteration) * N * log N (N: # of elements) – 50M tuples per second (two tables of 128M tuples)

• Hash join is still up to 2X faster than sort-merge join

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

16

M

32

M

64

M

12

8M

020406080

100120140

Sort Join (no SSE) Sort Join

Hash Join

Number of Tuples in a Relation

Cycl

es

per

Tu

ple

26

Future Sort VS. Hash (Projected)

• Future architecture trends favor sort-merge join– Wider SIMD

• 256-bit wide SIMD, 512-bit wide SIMD

– Limited per-core memory bandwidth• Merge sort-based join has fewer memory accesses than hash-based join

64

K

12

8K

25

6K

51

2K

1M

2M

4M

8M

16

M

32

M

64

M

12

8M

0.0

20.0

40.0

60.0

Hash JoinSort Join (128b-wide)Sort Join (256b-wide)Sort Join (512b-wide)

Number of Tuples in a Relation

Cyc

les

pe

r T

up

le

27

Conclusions

• Optimized hash join and sort-merge join for modern CPUs– Exploit TLP/DLP

• Fastest reported hash join and sort-merge join performance

• Future architecture trends shows better performance potential for sort-merge join

• Optimizing join algorithms for upcoming Intel Larrabee architectures

28

Thank you!

Questions?

Documents

Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs Changkyu Kim 1, Eric Sedlar 2, Jatin Chhugani 1, Tim Kaldewey 2, Anthony D