78
Vectorized Execution (Part I) @Andy_Pavlo // 15-721 // Spring 2018 ADVANCED DATABASE SYSTEMS Lecture #21

Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

Vectorized Execution (Part I)

@Andy_Pavlo // 15-721 // Spring 2018

ADVANCEDDATABASE SYSTEMS

Le

ctu

re #

21

Page 2: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Background

Hardware

Vectorized Algorithms (Columbia)

2

Page 3: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

VECTORIZATION

The process of converting an algorithm's scalar implementation that processes a single pair of operands at a time, to a vector implementation that processes one operation on multiple pairs of operands at once.

3

Page 4: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

WHY THIS MAT TERS

Say we can parallelize our algorithm over 32 cores.

Each core has a 4-wide SIMD registers.

Potential Speed-up: 32x × 4x = 128x

4

Page 5: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

MULTI-CORE CPUS

Use a small number of high-powered cores. → Intel Xeon Skylake / Kaby Lake→ High power consumption and area per core.

Massively superscalar and aggressive out-of-order execution→ Instructions are issued from a sequential stream.→ Check for dependencies between instructions.→ Process multiple instructions per clock cycle.

5

Page 6: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6

Page 7: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

MANY INTEGRATED CORES (MIC)

Use a larger number of low-powered cores.→ Intel Xeon Phi→ Low power consumption and area per core.→ Expanded SIMD instructions with larger register sizes.

Knights Ferry (Columbia Paper)→ Non-superscalar and in-order execution→ Cores = Intel P54C (aka Pentium from the 1990s).

Knights Landing (Since 2016)→ Superscalar and out-of-order execution.→ Cores = Silvermont (aka Atom)

6

Page 8: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

S INGLE INSTRUCTION, MULTIPLE DATA

A class of CPU instructions that allow the processor to perform the same operation on multiple data points simultaneously.

All major ISAs have microarchitecture support SIMD operations.→ x86: MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX512→ PowerPC: Altivec→ ARM: NEON

8

Page 9: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

11111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 10: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

911111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 11: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

SISD

+for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 12: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 611111111

Y

SIMD

+

8 7 6 5

1 1 1 1

128-bit SIMD Register

128-bit SIMD Register128-bit SIMD Register

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 13: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Z

SIMD EXAMPLE

9

X + Y = Z 87654321

X

for (i=0; i<n; i++) {Z[i] = X[i] + Y[i];

}

9 8 7 6 5 4 3 211111111

Y

SIMD

+4 3 2 1

1 1 1 1

x1

x2

⋮xn

y1

y2

⋮yn

x1+y1

x2+y2

⋮xn+yn

+ =

Page 14: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

STREAMING SIMD EXTENSIONS (SSE)

SSE is a collection SIMD instructions that target special 128-bit SIMD registers.

These registers can be packed with four 32-bit scalars after which an operation can be performed on each of the four elements simultaneously.

First introduced by Intel in 1999.

10

Page 15: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SIMD INSTRUCTIONS (1 )

Data Movement→ Moving data in and out of vector registers

Arithmetic Operations→ Apply operation on multiple data items (e.g., 2 doubles, 4

floats, 16 bytes)→ Example: ADD, SUB, MUL, DIV, SQRT, MAX, MIN

Logical Instructions→ Logical operations on multiple data items→ Example: AND, OR, XOR, ANDN, ANDPS, ANDNPS

11

Page 16: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SIMD INSTRUCTIONS (2)

Comparison Instructions→ Comparing multiple data items (==,<,<=,>,>=,!=)

Shuffle instructions→ Move data in between SIMD registers

Miscellaneous→ Conversion: Transform data between x86 and SIMD

registers.→ Cache Control: Move data directly from SIMD registers

to memory (bypassing CPU cache).

12

Page 17: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

INTEL SIMD EXTENSIONS

13

Width Integers Single-P Double-P

1997 MMX 64 bits ✔

1999 SSE 128 bits ✔ ✔(×4)

2001 SSE2 128 bits ✔ ✔ ✔(×2)

2004 SSE3 128 bits ✔ ✔ ✔

2006 SSSE 3 128 bits ✔ ✔ ✔

2006 SSE 4.1 128 bits ✔ ✔ ✔

2008 SSE 4.2 128 bits ✔ ✔ ✔

2011 AVX 256 bits ✔ ✔(×8) ✔(×4)

2013 AVX2 256 bits ✔ ✔ ✔

2017 AVX-512 512 bits ✔ ✔(×16) ✔(×8)

Source: James Reinders

Page 18: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

WHY NOT GPUS?

Moving data back and forth between DRAM and GPU is slow over PCI-E bus.

There are some newer GPU-enabled DBMSs→ Examples: MapD, SQream, Kinetica

Emerging co-processors that can share CPU’s memory may change this.→ Examples: AMD’s APU, Intel’s Knights Landing

14

Page 20: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

VECTORIZATION

Choice #1: Automatic Vectorization

Choice #2: Compiler Hints

Choice #3: Explicit Vectorization

16

Source: James Reinders

Ease of Use

ProgrammerControl

Page 21: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

AUTOMATIC VECTORIZATION

The compiler can identify when instructions inside of a loop can be rewritten as a vectorizedoperation.

Works for simple loops only and is rare in database operators. Requires hardware support for SIMD instructions.

17

Page 22: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

AUTOMATIC VECTORIZATION

This loop is not legal to automatically vectorize.

The code is written such that the addition is described as being done sequentially.

18

These might point to the same address!

void add(int *X,int *Y,int *Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

*Z=*X+1

Page 23: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

COMPILER HINTS

Provide the compiler with additional information about the code to let it know that is safe to vectorize.

Two approaches:→ Give explicit information about memory locations.→ Tell the compiler to ignore vector dependencies.

19

Page 24: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

COMPILER HINTS

The restrict keyword in C++ tells the compiler that the arrays are distinct locations in memory.

20

void add(int *restrict X,int *restrict Y,int *restrict Z) {

for (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

Page 25: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

COMPILER HINTS

This pragma tells the compiler to ignore loop dependencies for the vectors.

It’s up to you make sure that this is correct.

21

void add(int *X,int *Y,int *Z) {

#pragma ivdepfor (int i=0; i<MAX; i++) {Z[i] = X[i] + Y[i];

}}

Page 26: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

EXPLICIT VECTORIZATION

Use CPU intrinsics to manually marshal data between SIMD registers and execute vectorized instructions.

Potentially not portable.

22

Page 27: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

EXPLICIT VECTORIZATION

Store the vectors in 128-bit SIMD registers.

Then invoke the intrinsic to add together the vectors and write them to the output location.

23

void add(int *X,int *Y,int *Z) {

__mm128i *vecX = (__m128i*)X;__mm128i *vecY = (__m128i*)Y;__mm128i *vecZ = (__m128i*)Z;for (int i=0; i<MAX/4; i++) {_mm_store_si128(vecZ++,_mm_add_epi32(*vecX, *vecY));

}}

Page 28: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

VECTORIZATION DIRECTION

Approach #1: Horizontal→ Perform operation on all elements

together within a single vector.

Approach #2: Vertical→ Perform operation in an elementwise

manner on elements of each vector.

24

Source: Przemysław Karpiński

0 1 2 3

SIMD Add 6

0 1 2 3

SIMD Add

1 1 1 1

1 2 3 4

Page 29: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

EXPLICIT VECTORIZATION

Linear Access Operators→ Predicate evaluation→ Compression

Ad-hoc Vectorization→ Sorting→ Merging

Composable Operations→ Multi-way trees→ Bucketized hash tables

25

Source: Orestis Polychroniou

Page 30: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

VECTORIZED DBMS ALGORITHMS

Principles for efficient vectorization by using fundamental vector operations to construct more advanced functionality.→ Favor vertical vectorization by processing different input

data per lane.→ Maximize lane utilization by executing different things

per lane subset.

26

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

Page 31: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL OPERATIONS

Selective Load

Selective Store

Selective Gather

Selective Scatter

27

Page 32: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

Page 33: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

Page 34: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U

Page 35: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U

Page 36: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

Page 37: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

Page 38: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

Page 39: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B

Page 40: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B

Page 41: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

FUNDAMENTAL VECTOR OPERATIONS

28

Selective Load Selective Store

A B C DVector

Memory

0 1 0 1Mask

U V W X Y Z • • •

U V

A B C DVector

U V W X Y Z • • •Memory

0 1 0 1Mask

B D

Page 42: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CA

Page 43: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW

0 21 3 54

Page 44: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

29

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • •

CAW V XZ

0 21 3 54

Page 45: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

29

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector

U V W X Y Z • • •Memory

2 1 5 3Index Vector

CAW V XZ

0 21 3 54

0 21 3 54

Page 46: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

Selective Gather

FUNDAMENTAL VECTOR OPERATIONS

29

Selective Scatter

A B DValue Vector

Memory

2 1 5 3Index Vector

U V W X Y Z • • • A B C DValue Vector

U V W X Y Z • • •Memory

2 1 5 3Index Vector

CAW V XZ B A CD

0 21 3 54

0 21 3 54

Page 47: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

ISSUES

Gathers and scatters are not really executed in parallel because the L1 cache only allows one or two distinct accesses per cycle.

Gathers are only supported in newer CPUs.

Selective loads and stores are also emulated in Xeon CPUs using vector permutations.

30

Page 48: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

VECTORIZED OPERATORS

Selection Scans

Hash Tables

Partitioning

Paper provides additional info:→ Joins, Sorting, Bloom filters.

31

RETHINKING SIMD VECTORIZATION FOR IN-MEMORY DATABASESSIGMOD 2015

Page 49: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

32

SELECT * FROM tableWHERE key >= $(low)AND key <= $(high)

Page 50: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)

i = 0for t in table:

key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Page 51: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)

i = 0for t in table:

key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Scalar (Branchless)

i = 0for t in table:

copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)

i = i + m

Page 52: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

32

Scalar (Branching)

i = 0for t in table:

key = t.keyif (key≥low) && (key≤high):

copy(t, output[i])i = i + 1

Scalar (Branchless)

i = 0for t in table:

copy(t, output[i])key = t.keym = (key≥low ? 1 : 0) &&↪(key≤high ? 1 : 0)

i = i + m

Source: Bogdan Raducanu

Page 53: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

Page 54: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

ID1

KEYJ

2 O3 Y4 S5 U6 X

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 55: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 56: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 57: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

33

Vectorized

i = 0for vt in table:

simdLoad(vt.key, vk)vm = (vk≥low ? 1 : 0) &&↪(vk≤high ? 1 : 0)

simdStore(vt, vm, output[i])i = i + |vm≠false|

J O Y S U XKey VectorID1

KEYJ

2 O3 Y4 S5 U6 X

Mask 0 1 0 1 1 0

SIMD Compare

0 1 2 3 4 5All Offsets

SIMD Store

1 3 4Matched OffsetsSELECT * FROM tableWHERE key >= "O" AND key <= "U"

Page 58: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

34

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

5.7 5.6 5.35.7 4.9 4.3 2.8 1.3

1.7 1.7 1.71.81.6 1.4 1.5

1.2

Page 59: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

SELECTION SCANS

34

0

16

32

48

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Selectivity (%)

Scalar (Branching)

Scalar (Branchless)

Vectorized (Early Mat)

Vectorized (Late Mat)

0.0

2.0

4.0

6.0

0 1 2 5 10 20 50 100

Thr

ough

put

(bil

lion

tupl

es /

sec

)Selectivity (%)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

MemoryBandwidth

MemoryBandwidth

Page 60: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Page 61: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1 k9=

Page 62: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOADKEY

Linear Probing Hash Table

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

k1

k9=

k3=

k8=

k1=

Page 63: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

Page 64: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

HASH TABLES PROBING

35

Scalar

k1

Input Key

h1

Hash Index

#hash(key)

Vectorized (Horizontal)

KEYS PAYLOAD

Linear Probing Bucketized Hash Table

k1

Input Key

h1

Hash Index

#hash(key)

k9= k3 k8 k1k1

0 0 0 1Matched Mask

SIMD Compare

Page 65: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

Page 66: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Gather

k1

k2

k3

k4

Page 67: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

Page 68: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k1

k99

k88

k4

====

SIMD Compare

1

0

0

1

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6

Page 69: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PAYLOAD

k99

k1

k6

k4

KEY

k5

k88

Linear Probing Hash Table

HASH TABLES PROBING

36

Vectorized (Vertical)Input Key

Vector hash(key)

#

#

#

#

Hash IndexVector

h1

h2

h3

h4

k1

k2

k3

k4

k5

k6

h5

h2+1

h3+1

h6

Page 70: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

HASH TABLES PROBING

37

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

2.3 2.2 2.12.41.1 0.9 0.7 0.6

1.1 1.10.9

1.2

0.8 0.8

0.30.2

Page 71: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

HASH TABLES PROBING

37

0

3

6

9

12

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Scalar Vectorized (Horizontal) Vectorized (Vertical)

MIC (Xeon Phi 7120P – 61 Cores + 4×HT) Multi-Core (Xeon E3-1275v3 – 4 Cores + 2×HT)

0

0.5

1

1.5

2

Thr

ough

put

(bil

lion

tupl

es /

sec

)

Hash Table Size

Out of Cache

Out of Cache

Page 72: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector

SIMD AddSIMD Radix

+1

+1

+1

Histogram

Page 73: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

# of Vector Lanes

SIMD Radix SIMD Scatter

Page 74: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PARTITIONING HISTOGRAM

Use scatter and gathers to increment counts.

Replicate the histogram to handle collisions.

38

k1

k2

k3

k4

Input KeyVector

h1

h2

h3

h4

Hash Index Vector Replicated Histogram

+1

+1

+1

+1

SIMD Add

# of Vector Lanes

SIMD Radix

+1

+2

+1

Histogram

SIMD Scatter

Page 75: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

JOINS

No Partitioning→ Build one shared hash table using atomics→ Partially vectorized

Min Partitioning→ Partition building table→ Build one hash table per thread→ Fully vectorized

Max Partitioning→ Partition both tables repeatedly→ Build and probe cache-resident hash tables→ Fully vectorized

39

Page 76: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

JOINS

40

0

0.5

1

1.5

2

Scalar Vector Scalar Vector Scalar VectorNo Partitioning Min Partitioning Max Partitioning

Join

Tim

e (s

ec)

Partition Build Probe Build+Probe

200M ⨝ 200M tuples (32-bit keys & payloads)Xeon Phi 7120P – 61 Cores + 4×HT

Page 77: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

PARTING THOUGHTS

Vectorization is essential for OLAP queries.

These algorithms don’t work when the data exceeds your CPU cache.

We can combine all the intra-query parallelism optimizations we’ve talked about in a DBMS.→ Multiple threads processing the same query.→ Each thread can execute a compiled plan.→ The compiled plan can invoke vectorized operations.

41

Page 78: Vectorized Execution (Part I) - CMU 15-721 · 2018. 5. 21. · special 128-bit SIMD registers. These registers can be packed with four 32-bit scalars after which an operation can

CMU 15-721 (Spring 2018)

NEXT CL ASS

Vectorization (Part II)

42