0
10
20
30
40
50
60
70
80
90
100
110
Our approach
Key-index approach
Direct approach
Radix sort STLstable_sort
STL sort(unstable)
executio
n ti
me (
sec)
with SIMD without SIMD
multiway mergesort
faste
r
Hiroshi Inoue ([email protected])† ‡ Kenjiro Taura‡
†IBM Research – Tokyo ‡University of Tokyo
SIMD- and Cache-Friendly Algorithm for Sorting an Array of Structures
VLDB 2015 at Hawai‘i, USA
• SIMD-based multiway mergesort has been used for in-memory sorting of integers
• We extend this for sorting structures (key + payload)
Existing approaches
• Key-index approach:
1. encode key and index for each record into an integer,
2. sort the key-index pairs with SIMD, and
3. rearrange the records based on the sorted key-index pairs
SIMD friendly but NOT cache friendly
Costly due to random accesses for memory
• Direct approach:
1. sort records directly without encoding into an integer
Cache friendly but NOT SIMD friendly
Inefficient with SIMD due to gather for keys
Key idea:
• to execute rearranging of records more frequently, e.g. once per m merge stages (m > 1), instead of only once at the last in key-index approach
Benefits
• Cache friendly: the rearrange operation reads from k = 2m input streams and write to one output stream; hence the memory accesses are sequential unless m is too large
• SIMD friendly: most of the merge operations are done for integers; reading keys from records, which is costly with SIMD, only once per m stages
Implementation in multiway merge
We integrate key encoding and rearrange into multiway merge operation
• multiway merge, which merges k (k > 2) input streams into one output stream, is a common technique to reduce memory bandwidth in mergesort
Steps of our multiway merge operation
1. at the first stage, read records from system memory and encode key and streamID into an integer
2. merge integer values using SIMD
3. at the last stage, rearrange records based on the encoded streamID
Performance results
Sorting 512M records of 16 byte
Sorting 16M records of various record sizes
processor’s cache memory
system memory
input streams
rearrange records
stream0
output stream
2-waymerge
2-waymerge
2-waymerge
encode {key, streamID} pair into an integer value
2-waymerge
2-waymerge
2-waymerge
2-waymerge
stream1
stream2
stream3
stream4
stream5
stream6
stream7
8-way merge operation for structures
move structures move integer values
SIMD merge operation for integers
stage 1
stage 2
stage 3
rearrange without random accesses
unsorted input
sorted output
Direct approachKey-index approach
encode
rearrange
Our approach (m = 3)
unsorted input
sorted output
unsorted input
sorted output
encode
rearrange
encode
rearrange
merge operation
for structures
cache
unfriendly
random
accesses
SIMD
unfriendly
merge operation
for integers
cache
friendly
sequential
accesses
cache
friendly
sequential
accesses
SIMD
friendly
SIMD
friendly
Introduction
struct Record {
int32_t key;
int32_t dataX;
int32_t dataY;...
};
struct Record array[N];
payload
key payload key payload
record i record i +1
ki i ki+1 i+1
Our approach
record i
64-bit integer encode
System
• 2.9-GHz Xeon (SandyBridge) / RHEL 6.4 / gcc-4.8 / using SSE (128-bit SIMD)
0
1
2
3
4
5
6
7
8 16 24 32 48 64 96 128 192 256
executin tim
e (
sec)
record size (byte)
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_sort
STL sort (unstable)
cache line size
faste
r
sorted
unsorted input
sorted output
k-way merge k-way merge k-way merge k-way merge
k-way merge
Step 1: divide into small blocks that can fir in (L2) cache
Step 2: sort each block by vectorized combsort
Step 3: repeatedly execute multiway merge to merge all blocks
Optimizations and overall scheme Optimizations:
exploiting 4-wide SIMD by encoding a {key, id} pair into 32-bit integer
• we encode streamID (up to k) accompanied with its key into an intermediate integer; the streamID is much smaller than the index (up to N) we can use more bits for the key
• we use a 32-bit integer instead of a 64-bit integer to encode (a part of ) key and streamID to use higher data parallelism when the number of elements to merge is smaller than a threshold
• if we use only a part of keys for merging, we check the order by using the entire key when we rearrange records (without using SIMD)
vectorized combsort for initial sorting
• because mergesort is not efficient for small amount of data, we switch to vectorized combsort if a block to sort is small enough to fit into L2 cache
• combsort is efficient with SIMD but shows very poor memory access locality good for initial sorting of small blocks
Overall sorting scheme:
0
5
10
15
20
25
30
35
40
45
0 4 8 12 16re
lative p
erf
orm
ane o
ver S
TL's
std
::sta
ble
_so
rt o
n 1
co
renumber of cores
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_sort
STL sort (unstable)
faste
r
Scalability with multiple cores
0.01
0.1
1
10
100
1M 2M 4M 8M 16M 32M 64M 128M 256M 512M
executio
n tim
e (
sec)
# records
Our approach
Key-index approach
Direct approach
Radix sort
STL stable_sort
STL sort (unstable)
faste
r
Sorting various numbers of 16-byte records
0
10
20
30
40
50
60
70
80
2 4 8
16
32
64
128
256
512
1k
2k
executio
n ti
me (
sec)
number of ways
faste
r
Effect of number of ways (k)