Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
SuRF: PRACTICAL RANGE FILTERING WITH FAST SUCCINCT TRIES
Huanchen ZhangHyeontaek Lim, Viktor Leis, David G. Andersen
Michael Kaminsky, Kimberly Keeton, Andrew Pavlo
Filters answer approximate membership queries
2
Filters answer approximate membership queries
Billionaire
2
Filters answer approximate membership queries
Billionaire
2
Filters answer approximate membership queries
Billionaire
2
YES, 100% No False Negatives
Filters answer approximate membership queries
Billionaire
2
Filters answer approximate membership queries
Billionaire
2
NO, 99%
Filters answer approximate membership queries
Billionaire
2
NO, 99%
YES, 1%
Filters answer approximate membership queries
Billionaire
2
NO, 99%
YES, 1%
Filters answer approximate membership queries
Billionaire
2
NO, 99%
YES, 1% False Positive Rate
3
Local Memory
Slow Devices
Queries
Filters pre-reject most negative queries
3
Local Memory
Slow Devices
Queries
Filters pre-reject most negative queries
NOProbably YES
Existing filters only support point filtering
Point Filtering
4
Bloom Filter (1970)
Quotient Filter (2012)
Cuckoo Filter (2014)
SELECT * FROM BillionairesWHERE LastName = ‘Pavlo’
Existing filters only support point filtering
Point Filtering
4
Bloom Filter (1970)
Quotient Filter (2012)
Cuckoo Filter (2014)
SELECT * FROM BillionairesWHERE LastName = ‘Pavlo’
Range Filtering
SELECT * FROM BillionairesWHERE LastName LIKE ‘Pav%’
Existing filters only support point filtering
Point Filtering
4
Bloom Filter (1970)
Quotient Filter (2012)
Cuckoo Filter (2014)
SELECT * FROM BillionairesWHERE LastName = ‘Pavlo’
Range Filtering
SELECT * FROM BillionairesWHERE LastName LIKE ‘Pav%’
Our solution: Succinct Range Filters (SuRF)
First practical, general-purpose range filter
5
SMALL: close to theoretic minimum
FAST: comparable to fastest trees
USEFUL: evaluated in RocksDB
64-bit integer keys, 1% false positive rate: ≈12 bits per key
10 million 64-bit integer keys: ≈ 200 ns per query
speed up range queries by up to 5x
Starting point: a complete trie
S
6
I
G
M
O
D
K
D
D
O
P
S
Starting point: a complete trie
S
6
I
G
M
O
D
K
D
D
O
P
S
TOO BIG
S
7
I
G
M
O
D
K
D
D
O
P
S
S
I
G
MK O
Make it smaller: a truncated trie
S
7
I
G
M
O
D
K
D
D
O
P
S
S
I
G
MK O
Make it smaller: a truncated trie
SIGMODSIGMETRICS
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
SIGMETRICS
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
SIGMETRICS
0x18
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
SIGMETRICS
0x18
SIGMETRICS
E
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
Each bit reduces FPR by half
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
Each bit reduces FPR by half
Cannot help range queries
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
Each bit reduces FPR by half
Cannot help range queries
Benefit point & range queries
8
Use suffix bits to reduce false positive rate
S
I
G
MK O
0x200xC8 0x06
Hashed Suffix Bits Real Suffix Bits
S
I
G
MK O
OD P
Each bit reduces FPR by half
Cannot help range queries
Benefit point & range queries
Weaker distinguishability
Succinct Data Structure
9
… uses an amount of space that is “close” to the information-theoretic lower bound, but still allows efficient query operations. [wikipedia]
SuRF’s encoding is small and fast
10
Small≈10 + suffix bits per key for 64-bit integers
≈14 + suffix bits per key for emails
FastMatches state-of-the-art pointer-based trees
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
11
Bloom filters speed up point queries in RocksDB
B
B
B
Cached FiltersB, B, B, …
SSTable
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
11
B
B
B
Cached FiltersB, B, B, …
GET(16)
Bloom filters speed up point queries in RocksDB
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
11
B
B
B
Cached FiltersB, B, B, …
GET(16)NO
Bloom filters speed up point queries in RocksDB
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
12
B
B
B
Cached FiltersB, B, B, …
SEEK(14, 18)
Bloom filters can’t help range queries in RocksDB
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
12
B
B
B
Cached FiltersB, B, B, …
SEEK(14, 18)
Bloom filters can’t help range queries in RocksDB
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
13
S
S
S
Cached FiltersS, S, S, …
SuRFs can benefit both point and range queries
LN-2
LN-1
LN
…, 6, 20, …
…, 12, 21, …
…, 11, 19, …
13
S
S
S
Cached FiltersS, S, S, … SEEK(14, 18)
GET(16)NO
SuRFs can benefit both point and range queries
Evaluation setup: a time-series benchmark
14
Time
Key: 64-bit timestamp + 64-bit sensor IDValue: 1KB payload
Evaluation setup: a time-series benchmark
14
Time
Key: 64-bit timestamp + 64-bit sensor IDValue: 1KB payload
SEEK(t1, t2)GET(t)
t t1 t2
Queries:
Evaluation setup: a time-series benchmark
14
Time
Key: 64-bit timestamp + 64-bit sensor IDValue: 1KB payload
SEEK(t1, t2)GET(t)
t t1 t2
Queries:
System Config
Dataset: ≈100 GB on SSD
DRAM: 32 GB
Evaluation setup: a time-series benchmark
14
Time
Key: 64-bit timestamp + 64-bit sensor IDValue: 1KB payload
SEEK(t1, t2)GET(t)
t t1 t2
Queries:
Filter ConfigBloom filter: 14 bits per key
SuRF: 4-bit real suffix
System Config
Dataset: ≈100 GB on SSD
DRAM: 32 GB
15
SuRFs still benefit point queries in RocksDB
0
10
20
30
40
Th
rou
gh
pu
t (K
op
s/s)
No Filter Bloom Filter SuRF
All-false point queries
Worst-case Gap
0
2
4
6
8
10T
hro
ug
hp
ut
(Ko
ps/
s)
Percent of queries with empty results
10 20 30 40 50 60 70 80 90 99
No Filter/ Bloom Filter
SuRF
16
SuRFs speed up range queries in RocksDB
0
2
4
6
8
10T
hro
ug
hp
ut
(Ko
ps/
s)
Percent of queries with empty results
10 20 30 40 50 60 70 80 90 99
SuRF
16
5x
SuRFs speed up range queries in RocksDB
No Filter/ Bloom Filter
Conclusion
17
SuRF is a fast and compact data structureoptimized for range filtering
github.com/efficient/SuRF
rangefilter.io[Demo]
Backup Slides
Comparing ARF to SuRF
B1
Experiment: insert 5M 64-bit integers, 10M Zipf-distributed range queries (ARF uses 2M queries for training)
Bits per Key (held constant)
Range Query Throughput (Mops/s)
False Positive Rate
Build Time (s)
Build Memory (GB)
Training Time (s)
Training Throughput (Mops/s)
ARF SuRF Improvement
14
0.16
25.7
118
26
117
0.02
14
3.3
2.2
1.2
0.02
N/A
N/A
-
20x
12x
98x
1300x
N/A
N/A
LOUDS-Sparse encoding example
B2
a i
h t f t
v1 v2 v3 v4 v5
a i d h t f t1 1 0 0 0 0 0
1 0 1 0 0 1 0v1 v2 v3 v4 v5
Label:
Structure:
Has-child:
Value:
LOUDS-Sparse
d
10N bitsTheoretic Limit ≈ 9.4N bits
moveToChild (p) = select(S, rank(HC, p) + 1)
LOUDS-DS trades small space for performance
B3
LOUDS-Dense
LOUDS-Sparse
Hot
Cold
space overhead
speed-up
< 1%
3x