20
Fast Multi-Column Sorting in Main-Memory Column-Stores Wenjian Xu , Ziqiang Feng , Eric Lo The Hong Kong Polytechnic University The Chinese University of Hong Kong

fast multi-column sorting in main-memory column-stores

Embed Size (px)

Citation preview

Page 1: fast multi-column sorting in main-memory column-stores

Fast Multi-Column Sorting in Main-Memory Column-Stores

Wenjian Xu†, Ziqiang Feng†, Eric Lo‡

†The Hong Kong Polytechnic University‡The Chinese University of Hong Kong

Page 2: fast multi-column sorting in main-memory column-stores

2

Background

Analytic database

Read-most queries

Main memory

Column store

Column compression

De-normalization

Page 3: fast multi-column sorting in main-memory column-stores

3

Sort

• Implementing SQL operators like• GROUP BY• ORDER BY• PARTITION BY

Page 4: fast multi-column sorting in main-memory column-stores

4

SIMD-Sort

256-bit SIMD register

0xBBB0000000F0x222000000010x333000F00090x8000000000E0x110003000010x1020FF000000x108000000900x1000200000E

44-bit column

0xBBB0000000F 0x22200000001 0x333000F0009 0x8000000000E

64-bit bank

4x data parallelism

Bank size could be 8-bit, 16-bit, 32-bit, or

64-bit

Page 5: fast multi-column sorting in main-memory column-stores

5

SIMD-Sort

0xA0000x10000x20000x72000x00000x020F0x08000x0002

0xA000 0x1000 0x2000 0x7200 0x0000 0x020F 0x0800 0x0002 0xBB00 0x1C00 0x0022 0x7200 0x00F0 0xFFFF 0xBBCF 0x1000

44-bit column 16-bit bank

16x data parallelism

256-bit SIMD register

Parallelism degree depends on the code width of the column

16

Bank size could be 8-bit, 16-bit, 32-bit, or

64-bit

Page 6: fast multi-column sorting in main-memory column-stores

Multi-Column Sorting

6

SELECT FROM orders

ORDER BY order_date, retail_price

Multiple attributes

Multi-Column Sorting Scan+Lookup+Aggregation

Q1 Q2 Q3 Q7

Q9 Q10 Q16 Q18

TPC-H QueriesMulti-Column Sorting becomes the bottleneck

Widespread in workloads: 45% TPC-H queries, 72% TPC-DS queries

Our work: Optimizing Multi-Column Sorting

Example Query:

Page 7: fast multi-column sorting in main-memory column-stores

7

State-of-the-Art Implementation: Column-at-a-Time

X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

1234567

32-bit bank SIMD-sort

8x parallelism

0x100010x100010x100010x100010x100030x100030x10003

0xEEEEE 10x00000 20xEEEEE 30x00000 40xEEEEE 50x00000 60xEEEEE 7

oid Y (12-bit)0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC

Order by X, Y

Page 8: fast multi-column sorting in main-memory column-stores

8

State-of-the-Art Implementation: Column-at-a-Time

X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

1234567

32-bit bankSIMD-sort

8x parallelism

0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE

2461357

Y (12-bit)0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC

oid

0xCCC

0xAAA

0xFFF

0xAAA

Order by X, Y

Page 9: fast multi-column sorting in main-memory column-stores

9

State-of-the-Art Implementation: Column-at-a-Time

X (20-bit)0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

1234567

32-bit bankSIMD-sort

8x parallelism

0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE

2461357

Y (12-bit)2461357

16-bit bankSIMD-sort

16x parallelism

16-bit bankSIMD-sort

16x parallelism

0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC

oid

LOOKUP

0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC

0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC

4261537

Order by X, Y

Can we do better?

Page 10: fast multi-column sorting in main-memory column-stores

10

Option 1: Stitch TogetherX (20-bit)

0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

1234567

32-bit bank SIMD-sort

8x parallelism

0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE

2461357

Y (12-bit)2461357

16-bit bankSIMD-sort

16x parallelism

16-bit bankSIMD-sort

16x parallelism

0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC

oid

0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC

0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC

4261537

0xEEEEE AAA0x00000 CCC0xEEEEE BBB

LOOKUP

Stitch

LOOKUP

Column-at-a-Time

Stitch X and Y

State-of-the-Art Implementation: Column-at-a-Time

Page 11: fast multi-column sorting in main-memory column-stores

11

Option 1: Stitch TogetherX (20-bit)

0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

1234567

32-bit bankSIMD-sort

8x parallelism

0x000000x000000x000000xEEEEE0xEEEEE0xEEEEE0xEEEEE

2461357

Y (12-bit)2461357

16-bit bankSIMD-sort

16x parallelism

16-bit bankSIMD-sort

16x parallelism

0xAAA0xCCC0xBBB0xAAA0xAAA0xFFF0xCCC

oid

0xCCC0xAAA0xFFF0xAAA0xBBB0xAAA0xCCC

0xAAA0xCCC0xFFF0xAAA0xAAA0xBBB0xCCC

4261537

Supercolumn(32-bit)

LOOKUP

0xEEEEE0x000000xEEEEE0x000000xEEEEE0x000000xEEEEE

AAACCCBBBAAAAAAFFFCCC

32-bit bankSIMD-sort

8x parallelism

1234567

0x00000AAA0x00000CCC0x00000FFF0xEEEEEAAA0xEEEEEAAA0xEEEEEBBB0xEEEEECCC

4261537

Save one LOOKUP operation

LOOKUP

Stitch

Column-at-a-Time

Stitch X and Y

Correctness proved!

Save one round of sorting

Stitch overhead

WIN

Page 12: fast multi-column sorting in main-memory column-stores

12

Is stitch together always good?Let’s consider another example.

Page 13: fast multi-column sorting in main-memory column-stores

13

Option 1: Stitch TogetherX (20-bit)

0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE

32-bit bank SIMD-sort

8x parallelism

0x100010x100010x100010x100030x100030x100030x10003

Y (12-bit)32-bit bankSIMD-sort

8x parallelism

32-bit bankSIMD-sort

8x parallelism

0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC

0x00C0x00A0x00F0x00A0x00B0x00A0x00C

0x00A0x00C0x00F0x00A0x00A0x00B0x00C

LOOKUPLOOKUP

24 20

Supercolumn(32-bit)

0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE

AAAAACCCCCAAAAACCCCCCCCCCAAAAACCCCC

32-bit bankSIMD-sort

4x parallelism

0x00000AAA0x00000CCC0x00000FFF0xEEEEEAAA0xEEEEEAAA0xEEEEEBBB0xEEEEECCC

Stitch Stitch X and Y

44

64

Column-at-a-Time

Lower Data Parallelism LOSE

Any alternatives other than Stitching X and Y in this example?

Page 14: fast multi-column sorting in main-memory column-stores

14

0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC

0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE

Option 2: Bit BorrowingX (24-bit)

0xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE0x0000000xEEEEEE

32-bit bank SIMD-sort

8x parallelism

0x100010x100010x100010x100030x100030x100030x10003

Y (20-bit)32-bit bankSIMD-sort

8x parallelism

32-bit bankSIMD-sort

8x parallelism

0x00C0x00A0x00F0x00A0x00B0x00A0x00C

0x00A0x00C0x00F0x00A0x00A0x00B0x00C

LOOKUPLOOKUP

<< 4 bitsX (24-bit) Y (20-bit)

0xAAAAA0xCCCCC0xAAAAA0xCCCCC0xCCCCC0xAAAAA0xCCCCC

ACACCAC

32-bit bank SIMD-sort

8x parallelism

16-bit bankSIMD-sort

16x parallelism

16-bit bankSIMD-sort

16x parallelism

0x000000A0x000000C0x000000C0xEEEEEEA0xEEEEEEA0xEEEEEEC0xEEEEEEC

28 16

Option 1: Stitch TogetherColumn-at-a-Time

Borrowing bits from Y to X

Improved parallelism

LOOKUP

Page 15: fast multi-column sorting in main-memory column-stores

15

Optimal Plan

• Given 3 columns with 11-bit, 14-bit, and 21-bit to be sorted:

• Cost model• Plan enumeration and

search

Stitch together?

Bit borrowing?

Split into more

rounds? In the paper:

Num. of possible Plans: 2(11+14+21)

Page 16: fast multi-column sorting in main-memory column-stores

16

Experiments

• Setup Intel Xeon E5 10-core & Intel i7 quad-coreAVX2 instruction set (256 bits)

• Data sets TPC-H TPC-H Skew TPC-DS Real data (Airline Origin and Destination Survey)

Page 17: fast multi-column sorting in main-memory column-stores

17

Speedup over Column-at-a-Time

1.8X ~ 5.5X speedup

TPC-H TPC-H Skew TPC-DS Real Data

Page 18: fast multi-column sorting in main-memory column-stores

18

Data Size Scalability

Linear data size scalability

Our solution for Multi-Column Sorting

Page 19: fast multi-column sorting in main-memory column-stores

19

Core/thread Scalability

Linear core/thread scalability

Our solution for Multi-Column Sorting

Page 20: fast multi-column sorting in main-memory column-stores

20

Summary• First work to pinpoint and tackle the issue of multi-column

sorting• Our technique: manipulate the bits across input columns• Up to 5.5X speedup in query execution.

Thank you