Upload
wenjian-xu
View
191
Download
0
Embed Size (px)
Citation preview
Fast Multi-Column Sorting in Main-Memory Column-Stores
Wenjian Xu†, Ziqiang Feng†, Eric Lo‡
†The Hong Kong Polytechnic University‡The Chinese University of Hong Kong
2
Background
Analytic database
Read-most queries
Main memory
Column store
Column compression
De-normalization
3
Sort
• Implementing SQL operators like• GROUP BY• ORDER BY• PARTITION BY
4
SIMD-Sort
256-bit SIMD register
0xBBB0000000F0x222000000010x333000F00090x8000000000E0x110003000010x1020FF000000x108000000900x1000200000E
…
44-bit column
0xBBB0000000F 0x22200000001 0x333000F0009 0x8000000000E
64-bit bank
4x data parallelism
Bank size could be 8-bit, 16-bit, 32-bit, or
64-bit
5
SIMD-Sort
0xA0000x10000x20000x72000x00000x020F0x08000x0002
…
0xA000 0x1000 0x2000 0x7200 0x0000 0x020F 0x0800 0x0002 0xBB00 0x1C00 0x0022 0x7200 0x00F0 0xFFFF 0xBBCF 0x1000
44-bit column 16-bit bank
16x data parallelism
256-bit SIMD register
Parallelism degree depends on the code width of the column
16
Bank size could be 8-bit, 16-bit, 32-bit, or
64-bit
Multi-Column Sorting
6
SELECT FROM orders
ORDER BY order_date, retail_price
Multiple attributes
Multi-Column Sorting Scan+Lookup+Aggregation
Q1 Q2 Q3 Q7
Q9 Q10 Q16 Q18
TPC-H QueriesMulti-Column Sorting becomes the bottleneck
Widespread in workloads: 45% TPC-H queries, 72% TPC-DS queries
Our work: Optimizing Multi-Column Sorting
Example Query:
7
State-of-the-Art Implementation: Column-at-a-Time
X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
1234567
32-bit bank SIMD-sort
8x parallelism
0x100010x100010x100010x100010x100030x100030x10003
0xEEEEE 10x00000 20xEEEEE 30x00000 40xEEEEE 50x00000 60xEEEEE 7
oid Y (12-bit)0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC
Order by X, Y
8
State-of-the-Art Implementation: Column-at-a-Time
X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
1234567
32-bit bankSIMD-sort
8x parallelism
0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE
2461357
Y (12-bit)0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC
oid
0xCCC
0xAAA
0xFFF
0xAAA
Order by X, Y
9
State-of-the-Art Implementation: Column-at-a-Time
X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
1234567
32-bit bankSIMD-sort
8x parallelism
0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE
2461357
Y (12-bit)2461357
16-bit bankSIMD-sort
16x parallelism
16-bit bankSIMD-sort
16x parallelism
0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC
oid
LOOKUP
0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC
0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC
4261537
Order by X, Y
Can we do better?
10
Option 1: Stitch TogetherX (20-bit)
0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
1234567
32-bit bank SIMD-sort
8x parallelism
0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE
2461357
Y (12-bit)2461357
16-bit bankSIMD-sort
16x parallelism
16-bit bankSIMD-sort
16x parallelism
0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC
oid
0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC
0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC
4261537
0xEEEEE AAA0x00000 CCC0xEEEEE BBB
LOOKUP
Stitch
LOOKUP
Column-at-a-Time
Stitch X and Y
State-of-the-Art Implementation: Column-at-a-Time
11
Option 1: Stitch TogetherX (20-bit)
0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
1234567
32-bit bankSIMD-sort
8x parallelism
0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE
2461357
Y (12-bit)2461357
16-bit bankSIMD-sort
16x parallelism
16-bit bankSIMD-sort
16x parallelism
0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC
oid
0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC
0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC
4261537
Supercolumn(32-bit)
LOOKUP
0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE
AAACCCBBBAAAAAAFFFCCC
32-bit bankSIMD-sort
8x parallelism
1234567
0x00000AAA0x00000CCC0x00000FFF0xEEEEEAAA0xEEEEEAAA0xEEEEEBBB0xEEEEECCC
4261537
Save one LOOKUP operation
LOOKUP
Stitch
Column-at-a-Time
Stitch X and Y
Correctness proved!
Save one round of sorting
Stitch overhead
WIN
12
Is stitch together always good?Let’s consider another example.
13
Option 1: Stitch TogetherX (20-bit)
0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE
32-bit bank SIMD-sort
8x parallelism
0x100010x100010x100010x100030x100030x100030x10003
Y (12-bit)32-bit bankSIMD-sort
8x parallelism
32-bit bankSIMD-sort
8x parallelism
0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC
0x00C0x00A0x00F0x00A0x00B0x00A0x00C
0x00A0x00C0x00F0x00A0x00A0x00B0x00C
LOOKUPLOOKUP
24 20
Supercolumn(32-bit)
0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE
AAAAACCCCCAAAAACCCCCCCCCCAAAAACCCCC
32-bit bankSIMD-sort
4x parallelism
0x00000AAA0x00000CCC0x00000FFF0xEEEEEAAA0xEEEEEAAA0xEEEEEBBB0xEEEEECCC
Stitch Stitch X and Y
44
64
Column-at-a-Time
Lower Data Parallelism LOSE
Any alternatives other than Stitching X and Y in this example?
14
0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC
0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE
Option 2: Bit BorrowingX (24-bit)
0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE
32-bit bank SIMD-sort
8x parallelism
0x100010x100010x100010x100030x100030x100030x10003
Y (20-bit)32-bit bankSIMD-sort
8x parallelism
32-bit bankSIMD-sort
8x parallelism
0x00C0x00A0x00F0x00A0x00B0x00A0x00C
0x00A0x00C0x00F0x00A0x00A0x00B0x00C
LOOKUPLOOKUP
<< 4 bitsX (24-bit) Y (20-bit)
0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC
ACACCAC
32-bit bank SIMD-sort
8x parallelism
16-bit bankSIMD-sort
16x parallelism
16-bit bankSIMD-sort
16x parallelism
0x000000A0x000000C0x000000C0xEEEEEEA0xEEEEEEA0xEEEEEEC0xEEEEEEC
28 16
Option 1: Stitch TogetherColumn-at-a-Time
Borrowing bits from Y to X
Improved parallelism
LOOKUP
15
Optimal Plan
• Given 3 columns with 11-bit, 14-bit, and 21-bit to be sorted:
• Cost model• Plan enumeration and
search
Stitch together?
Bit borrowing?
Split into more
rounds? In the paper:
Num. of possible Plans: 2(11+14+21)
16
Experiments
• Setup Intel Xeon E5 10-core & Intel i7 quad-coreAVX2 instruction set (256 bits)
• Data sets TPC-H TPC-H Skew TPC-DS Real data (Airline Origin and Destination Survey)
17
Speedup over Column-at-a-Time
1.8X ~ 5.5X speedup
TPC-H TPC-H Skew TPC-DS Real Data
18
Data Size Scalability
Linear data size scalability
Our solution for Multi-Column Sorting
19
Core/thread Scalability
Linear core/thread scalability
Our solution for Multi-Column Sorting
20
Summary• First work to pinpoint and tackle the issue of multi-column
sorting• Our technique: manipulate the bits across input columns• Up to 5.5X speedup in query execution.
Thank you