Indexing delight --thinking cap of fractal-tree indexes

$Page 1: Indexing delight --thinking cap of fractal-tree indexes$
Indexing DelightThinking Cap of Fractal-tree Indexes

BohuTANG@2012/[email protected]

B-treeInvented in 1972, 40 years!

B-tree

Block0

Block1 Block2 Block3

Block4 Block5

.... ....

.....................................................................................

... Block0 ... ... Block3 ... Block5 ...File on disk:

B-tree Insert

Block0


Block4 Block5

.... ....

.....................................................................................

Insert x

seek


B-tree Insert

Block0


Block4 Block5

.... ....

.....................................................................................

Insert x

seek

seek


B-tree Insert

Block0


Block4 Block5

.... ....

.....................................................................................

Insert x

seek

seek


Insert one item causes many random seeks!

B-tree Search

Block0


Block4 Block5

.... ....

.....................................................................................

Search x

seek

seek

Query is fast, I/Os costs O(logBN)

B-tree Conclusions

● Search: O(logBN ) block transfers.● Insert: O(logBN ) block transfers(slow).● B-tree range queries are slow.● IMPORTANT: --Parent and child blocks sparse in disk.

A Simplified Fractal-treeCache Oblivious Lookahead Array, invented by MITers

COLA

log2N

...........

Binary Search in one level:O(log2N) 2

COLA (Using Fractional Cascading)

log2N

...........

● Search: O(log2N) block transfers.● Insert: O((1/B)log2N) amortized block transfers.● Data is stored in log2N arrays of sizes 2, 4, 8, 16,...● Balanced Binary Search Tree

COLA Conclusions

● Search: O(log2N) block transfers(Using Fractional Cascading).

● Insert: O((1/B)log2N) amortized block transfers.● Data is stored in log2N arrays of sizes 2, 4, 8, 16,...● Balanced Binary Search Tree● Lookahead(Prefetch), Data-Intensive!● BUT, the bottom level will be big and bigger,

merging expensive.

COLA vs B-tree

● Search: -- (log2N)/(logBN)

= log2B times slower than B-tree(In theory)● Insert:

--(logBN)/((1/B)log2N)= B/(log2B) times faster than B-trees(In theory)

if B = 4KB:COLA search is 12 times slower than B-treeCOLA insert is 341 times faster than B-tree

LSM-tree

LSM-tree

buffer

buffer

bufferbuffer bufferbuffer

buffer ...

... ... ...

● Lazy insertion, Sorted before● Leveli is the buffer of Leveli+1● Search: O(logBN) * O(logN) ● Insert:O((logBN)/B)

In memory

LSM-tree (Using Fractional Cascading)

buffer


buffer ...

... ... ...

● Search: O(logBN) (Using FC)● Insert:O((logBN)/B)● 0.618 Fractal-tree?But NOT Cache Oblivious...

bufferIn memory

LSM-tree (Merging)

buffer


buffer ...

... ... ...

A lot of I/O wasted during merging!Like a headless fly flying...

merge merge merge

Zzz...

bufferIn memory

Fractal-tree IndexesJust Fractal. Patented by Tokutek...

Fractal-tree Indexes

Search: O(logBN) Insert: O((logBN)/B) (amortized)Search is same as B-tree, but insert faster than B-tree

Fractal-tree Indexes (Block size)

B is 4MB...

....

.... .... ....


B is 4MB...

....

.... .... ....

full


B is 4MB...

....

.... .... ....

full


..

.. ....

..

.. ....

..

.. ....

... ... ...

full

Fractal! 4MB one seek...

Bε-treeJust a constant factor on Block fanout...

Bε-tree

Search

Inserts

Slow

Slow

Fast

Fast

B-tree

AOF

ε=1/2

Optimal Curve

Bε-tree

insert search

B-tree(ɛ=1)

O(logBN) O(logBN)

ɛ=1/2 O((logBN)/√B) O(logBN)

ɛ=0 O((logN)/B) O(logN)

if we want optimal point queries + very fast inserts, we should choose ɛ=1/2

Bε-tree

So, if block size is B, the fanout should be √B

Cache Oblivious Data StructureAll the above is JUST Cache Oblivious Data Structures...

Cache Oblivious Data Structure

Question:Reading a sequence of k consecutive blocks

at once is not much more expensive than reading a single block. How to take advantage of this feature?

Cache Oblivious Data Structure

My Questions(In Chinese):Q1：

只有1MB内存，怎样把两个64MB有序文件合并成一个有序文件？

Q2：大多数机械磁盘，连续读取多个Block和读取

单个Block花费相差不大，在Q1中如何利用这个优势？

nessDBhttps://github.com/shuttler/nessDBYou should agree that VFS do better than yourself cache!

https://github.com/shuttler/nessDB

https://github.com/shuttler/nessDB

nessDB

..

.. ....

..

.. ....

... ... ...

Each Block is Small-Splittable Tree

nessDB, What's going on?

..

.. ....

..

.. ....

..

.. ....

... ... ...

From the line to the plane..

Thanks!Most of the references are from:Tokutek & MIT CSAIL & Stony Brook.

Drafted By BohuTANG using Google Drive, @2012/12/12

Technology

Indexing delight --thinking cap of fractal-tree indexes