VLDB 2009 Tutorial Column-Oriented Database Systems
1
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Part 1: Stavros Harizopoulos (HP Labs)
Part 2: Daniel Abadi (Yale)Part 3: Peter Boncz (CWI)
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 2
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What is a column-store?
VLDB 2009 Tutorial Column-Oriented Database Systems 2
row-store column-store
Date CustomerProductStore
+ easy to add/modify a record
- might read in unnecessary data
+ only need to read in relevant data
- tuple writes require multiple accesses
=> suitable for read-mostly, read-intensive, large data repositories
Date Store Product Customer Price Price
VLDB 2009 Tutorial Column-Oriented Database Systems 3
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Are these two fundamentally different?
l The only fundamental difference is the storage layoutl However: we need to look at the big picture
VLDB 2009 Tutorial Column-Oriented Database Systems 3
‘70s ‘80s ‘90s ‘00s today
row-stores row-stores++ row-stores++
different storage layouts proposed
new applicationsnew bottleneck in hardware
column-stores
converge?
l How did we get here, and where we are headingl What are the column-specific optimizations?l How do we improve CPU efficiency when operating on Cs
Part 2
Part 1
Part 3
VLDB 2009 Tutorial Column-Oriented Database Systems 4
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavrosl Introduction to key featuresl From DSM to column-stores and performance tradeoffsl Column-store architecture overviewl Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 4
VLDB 2009 Tutorial Column-Oriented Database Systems 5
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco Data Warehousing example
l Typical DW installation
l Real-world example
VLDB 2009 Tutorial Column-Oriented Database Systems 5
usage source
toll
account
star schema
fact table
dimension tables
or RAM
QUERY 2SELECT account.account_number,sum (usage.toll_airtime),sum (usage.toll_price)FROM usage, toll, source, accountWHERE usage.toll_id = toll.toll_idAND usage.source_id = source.source_idAND usage.account_id = account.account_idAND toll.type_ind in (‘AE’. ‘AA’)AND usage.toll_price > 0AND source.type != ‘CIBER’AND toll.rating_method = ‘IS’AND usage.invoice_date = 20051013GROUP BY account.account_number
Column-store Row-storeQuery 1 2.06 300Query 2 2.20 300Query 3 0.09 300Query 4 5.24 300Query 5 2.88 300
Why? Three main factors (next slides)
“One Size Fits All? - Part 2: Benchmarking Results” Stonebraker et al. CIDR 2007
VLDB 2009 Tutorial Column-Oriented Database Systems 6
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (1/3):read efficiency
read pages containing entire rows
one row = 212 columns!
is this typical? (it depends)
VLDB 2009 Tutorial Column-Oriented Database Systems 6
row store column store
read only columns needed
in this example: 7 columns
caveats:• “select * ” not any faster• clever disk prefetching• clever tuple reconstruction What about vertical partitioning?
(it does not work with ad-hoc queries)
VLDB 2009 Tutorial Column-Oriented Database Systems 7
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (2/3):compression efficiencyl Columns compress better than rows
l Typical row-store compression ratio 1 : 3l Column-store 1 : 10
l Why?l Rows contain values from different domains
=> more entropy, difficult to dense-packl Columns exhibit significantly less entropyl Examples:
l Caveat: CPU cost (use lightweight compression)
VLDB 2009 Tutorial Column-Oriented Database Systems 7
Male, Female, Female, Female, Male1998, 1998, 1999, 1999, 1999, 2000
VLDB 2009 Tutorial Column-Oriented Database Systems 8
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Telco example explained (3/3):sorting & indexing efficiencyl Compression and dense-packing free up space
l Use multiple overlapping column collectionsl Sorted columns compress betterl Range queries are fasterl Use sparse clustered indexes
VLDB 2009 Tutorial Column-Oriented Database Systems 8
What about heavily-indexed row-stores?(works well for single column access,cross-column joins become increasingly expensive)
VLDB 2009 Tutorial Column-Oriented Database Systems 9
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Additional opportunities for column-stores
l Block-tuple / vectorized processingl Easier to build block-tuple operators
l Amortizes function-call cost, improves CPU cache performance
l Easier to apply vectorized primitivesl Software-based: bitwise operationsl Hardware-based: SIMD
l Opportunities with compressed columnsl Avoid decompression: operate directly on compressedl Delay decompression (and tuple reconstruction)
l Also known as: late materialization
l Exploit columnar storage in other DBMS componentsl Physical design (both static and dynamic)
VLDB 2009 Tutorial Column-Oriented Database Systems 9
Part 3
morein Part 2
See: Database Cracking, from CWI
VLDB 2009 Tutorial Column-Oriented Database Systems 10
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Effect on C-Store performance
VLDB 2009 Tutorial Column-Oriented Database Systems 10
“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Hachem, and Madden. SIGMOD 2008.
Tim
e (s
ec)
Average for SSBM queries on C-store
enablelate
materializationenablecompression &
operate on compressed
originalC-store
column-orientedjoin algorithm
VLDB 2009 Tutorial Column-Oriented Database Systems 11
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary of column-store key features
l Storage layout
l Execution engine
l Design tools, optimizer
VLDB 2009 Tutorial Column-Oriented Database Systems 11
columnar storage
header/ID elimination
compression
multiple sort orders
column operators
avoid decompression
late materialization
vectorized operations Part 3
Part 2
Part 2
Part 1
Part 1
Part 2
Part 3
VLDB 2009 Tutorial Column-Oriented Database Systems 12
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavrosl Introduction to key featuresl From DSM to column-stores and performance tradeoffsl Column-store architecture overviewl Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 12
VLDB 2009 Tutorial Column-Oriented Database Systems 13
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
From DSM to Column-stores
70s -1985:TOD: Time Oriented Database – Wiederhold et al."A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975More 1970s: Transposed files, Lorie, Batory,
VLDB 2009 Tutorial Column-Oriented Database Systems 14
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cantor - Column Store System early 80s•“An overview of cantor: a new system for data analysis”
Karasalo, Svensson, SSDBM 1983•“The Design of Cantor - a new system for data analysis”
Karasalo, Svensson, SSDBM 1986
VLDB 2009 Tutorial Column-Oriented Database Systems 15
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cantor - Column Store System early 80s
J
•“An overview of cantor: a new system for data analysis”Karasalo, Svensson, SSDBM 1983
•“The Design of Cantor - a new system for data analysis”Karasalo, Svensson, SSDBM 1986
1983 vector graphics
J
VLDB 2009 Tutorial Column-Oriented Database Systems 16
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cantor - Column Store System early 80s
zero suppression
delta coding
RLE
delta RLE
l Dynamic programming algorithm to choose compression method and parameters:
•“An overview of cantor: a new system for data analysis”Karasalo, Svensson, SSDBM 1983
•“The Design of Cantor - a new system for data analysis”Karasalo, Svensson, SSDBM 1986
VLDB 2009 Tutorial Column-Oriented Database Systems 17
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
From DSM to Column-stores
70s -1985:
1985: DSM paper
1990s: Commercialization through SybaseIQLate 90s – 2000s: Focus on main-memory performance
l DSM “on steroids” [1997 – now]l Hybrid DSM/NSM [2001 – 2004]
2005 – : Re-birth of read-optimized DSM as “column-store”
VLDB 2009 Tutorial Column-Oriented Database Systems 17
“A decomposition storage model”Copeland and Khoshafian. SIGMOD 1985.
CWI: MonetDB
Wisconsin: PAX, Fractured Mirrors
Michigan: Data Morphing CMU: Clotho
MIT: C-Store CWI: MonetDB/X100 10+ startups
TOD: Time Oriented Database – Wiederhold et al."A Modular, Self-Describing Clinical Databank System," Computers and Biomedical Research, 1975More 1970s: Transposed files, Lorie, Batory, Svensson.“An overview of cantor: a new system for data analysis”Karasalo, Svensson, SSDBM 1983
VLDB 2009 Tutorial Column-Oriented Database Systems 18
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
The original DSM paper
l Proposed as an alternative to NSMl 2 indexes: clustered on ID, non-clustered on valuel Speeds up queries projecting few columnsl Requires more storage
VLDB 2009 Tutorial Column-Oriented Database Systems 18
“A decomposition storage model” Copeland and Khoshafian. SIGMOD 1985.
1 2 3 4 ..ID
value0100 0962 1000 ..
VLDB 2009 Tutorial Column-Oriented Database Systems 19
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory wall and PAX
l 90s: Cache-conscious research
l PAX: Partition Attributes Acrossl Retains NSM I/O patternl Optimizes cache-to-RAM communication
VLDB 2009 Tutorial Column-Oriented Database Systems 19
“DBMSs on a modern processor: Where does time go?” Ailamaki, DeWitt, Hill, Wood. VLDB 1999.
“Weaving Relations for Cache Performance.”Ailamaki, DeWitt, Hill, Skounakis, VLDB 2001.
“Cache Conscious Algorithms for Relational Query Processing.”Shatdal, Kant, Naughton. VLDB 1994.
from:
“Database Architecture Optimized for the New Bottleneck: Memory Access.”Boncz, Manegold, Kersten. VLDB 1999.
to:and:
VLDB 2009 Tutorial Column-Oriented Database Systems 20
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
More hybrid NSM/DSM schemes
l Dynamic PAX: Data Morphing
l Clotho: custom layout using scatter-gather I/O
l Fractured mirrorsl Smart mirroring with both NSM/DSM copies
VLDB 2009 Tutorial Column-Oriented Database Systems 20
“Data morphing: an adaptive, cache-conscious storage technique.” Hankins, Patel, VLDB 2003.
“Clotho: Decoupling Memory Page Layout from Storage Organization.”Shao, Schindler, Schlosser, Ailamaki, and Ganger. VLDB 2004.
“A Case For Fractured Mirrors.” Ramamurthy, DeWitt, Su, VLDB 2002.
VLDB 2009 Tutorial Column-Oriented Database Systems 21
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB (more in Part 3)
l Late 1990s, CWI: Boncz, Manegold, and Kerstenl Motivation:
l Main-memoryl Improve computational efficiency by avoiding expression
interpreterl DSM with virtual IDs natural choicel Developed new query execution algebra
l Initial contributions:l Pointed out memory-wall in DBMSsl Cache-conscious projections and joinsl …
VLDB 2009 Tutorial Column-Oriented Database Systems 21
VLDB 2009 Tutorial Column-Oriented Database Systems 22
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
2005: the (re)birth of column-stores
l New hardware and application realitiesl Faster CPUs, larger memories, disk bandwidth limitl Multi-terabyte Data Warehouses
l New approach: combine several techniquesl Read-optimized, fast multi-column access,
disk/CPU efficiency, light-weight compression
l C-store paper:l First comprehensive design description of a column-store
l MonetDB/X100l “proper” disk-based column store
l Explosion of new productsVLDB 2009 Tutorial Column-Oriented Database Systems 22
VLDB 2009 Tutorial Column-Oriented Database Systems 23
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Performance tradeoffs: columns vs. rowsDSM traditionally was not favored by technology trendsHow has this changed?
l Optimized DSM in “Fractured Mirrors,” 2002l “Apples-to-apples” comparison
l Follow-up study
l Main-memory DSM vs. NSM
l Flash-disks: a come-back for PAX?
VLDB 2009 Tutorial Column-Oriented Database Systems 23
“Performance Tradeoffs in Read-Optimized Databases”Harizopoulos, Liang, Abadi, Madden, VLDB’06
“Read-Optimized Databases, In-Depth” Holloway, DeWitt, VLDB’08
“Query Processing Techniques for Solid State Drives”Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09
“Fast Scans and Joins Using Flash Drives” Shah, Harizopoulos, Wiener, Graefe. DaMoN’08
“ DSM vs. NSM: CPU performance tradeoffs in block-oriented query processing” Boncz, Zukowski, Nes, DaMoN’08
VLDB 2009 Tutorial Column-Oriented Database Systems 24
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Fractured mirrors: a closer look
l Store DSM relations inside a B-treel Leaf nodes contain valuesl Eliminate IDs, amortize header overheadl Custom implementation on Shore
VLDB 2009 Tutorial Column-Oriented Database Systems 24
3
sparseB-tree on ID
1 a1
“Efficient columnar storage in B-trees” Graefe. Sigmod Record 03/2007.
Similar: storage densitycomparableto column stores
“A Case For Fractured Mirrors” Ramamurthy, DeWitt, Su, VLDB 2002.
11
22
33
TIDTID
a1a1
a2a2
a3a3
ColumnData
ColumnData
TupleHeaderTuple
Header
44 a4a4
55 a5a5
2 a2 3 a3 4 a4 5 a5
1 a1 a3a2 4 a4 a5
VLDB 2009 Tutorial Column-Oriented Database Systems 25
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Fractured mirrors: performance
l Chunk-based tuple mergingl Read in segments of M pagesl Merge segments in memoryl Becomes CPU-bound after 5 pages
VLDB 2009 Tutorial Column-Oriented Database Systems 25
From PAX paper:
optimizedDSM
columns projected:1 2 3 4 5
timerow
column?
column?
regular DSM
VLDB 2009 Tutorial Column-Oriented Database Systems 26
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-scannerimplementation
VLDB 2009 Tutorial Column-Oriented Database Systems 26
1 Joe 452 Sue 37… … …
JoeSue
4537…
…prefetch ~100ms
worth of data
Sapplypredicate(s)
Joe 45… …
S
#POS 45#POS …
S
Joe 45… …
applypredicate #1
SELECT name, ageWHERE age > 40
Direct I/O
row scanner column scanner
“Performance Tradeoffs in Read-Optimized Databases”Harizopoulos, Liang, Abadi, Madden, VLDB’06
VLDB 2009 Tutorial Column-Oriented Database Systems 27
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Scan performance
l Large prefetch hides disk seeks in columnsl Column-CPU efficiency with lower selectivityl Row-CPU suffers from memory stallsl Memory stalls disappear in narrow tuplesl Compression: similar to narrow
VLDB 2009 Tutorial Column-Oriented Database Systems 27
not shown,details in the paper
VLDB 2009 Tutorial Column-Oriented Database Systems 28
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Even more results
Non-selective queries, narrow tuples, favor well-compressed rowsMaterialized views are a winScan times determine early materialized joins
VLDB 2009 Tutorial Column-Oriented Database Systems 28
“Read-Optimized Databases, In-Depth” Holloway, DeWitt, VLDB’08
0
5
10
15
20
25
30
35
1 2 3 4 5 6 7 8 9 10Tim
e (s)
Columns Returned
C-25%
C-10%
R-50%
Column-joins arecovered in part 2!
• Same engine as before• Additional findings
wide attributes:same as before
narrow & compressed tuple:CPU-bound!
VLDB 2009 Tutorial Column-Oriented Database Systems 29
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Speedup of columns over rows
l Rows favored by narrow tuples and low cpdbl Disk-bound workloads have higher cpdb
VLDB 2009 Tutorial Column-Oriented Database Systems 29
tuple width
cycl
es p
er d
isk
byte
(cpdb)
8 12 16 20 24 28 32 369
18
36
72
144
_ + ++=
+++
“Performance Tradeoffs in Read-Optimized Databases”Harizopoulos, Liang, Abadi, Madden, VLDB’06
VLDB 2009 Tutorial Column-Oriented Database Systems 30
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying prefetch size
l No prefetching hurts columns in single scans
VLDB 2009 Tutorial Column-Oriented Database Systems 30
0
10
20
30
40
4 8 12 16 20 24 28 32
time
(sec
)
selected bytes per tuple
Row (any prefetch size)
Column 48 (x 128KB)Column 16
Column 8
Column 2
no competingdisk traffic
VLDB 2009 Tutorial Column-Oriented Database Systems 31
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying prefetch size
l No prefetching hurts columns in single scansl Under competing traffic, columns outperform rows for
any prefetch sizeVLDB 2009 Tutorial Column-Oriented Database Systems 31
with competing disk traffic
0
10
20
30
40
4 12 20 28
Column, 48Row, 48
0
10
20
30
40
4 12 20 28
Column, 8Row, 8
selected bytes per tuple
time
(sec
)
VLDB 2009 Tutorial Column-Oriented Database Systems 32
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Performance
l Benefit in on-the-fly conversion between NSM and DSMl DSM: sequential access (block fits in L2), random in L1l NSM: random access, SIMD for grouped Aggregation
VLDB 2009 Tutorial Column-Oriented Database Systems 32
“ DSM vs. NSM: CPU performance trade offs in block-oriented query processing”Boncz, Zukowski, Nes, DaMoN’08
VLDB 2009 Tutorial Column-Oriented Database Systems 33
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 33
New storage technology: Flash SSDsl Performance characteristics
l very fast random reads, slow random writesl fast sequential reads and writes
l Price per bit (capacity follows)l cheaper than RAM, order of magnitude more expensive than Disk
l Flash Translation Layer introduces unpredictabilityl avoid random writes!
l Form factors not ideal yetl SSD (Ł small reads still suffer from SATA overhead/OS limitations)l PCI card (Ł high price, limited expandability)
l Boost Sequential I/O in a simple packagel Flash RAID: very tight bandwidth/cm3 packing (4GB/sec inside the box)
l Column Store Updatesl useful for delta structures and logs
l Random I/O on flash fixes unclustered index accessl still suboptimal if I/O block size > record sizel therefore column stores profit mush less than horizontal stores
l Random I/O useful to exploit secondary, tertiary table orderingsl the larger the data, the deeper clustering one can exploit
VLDB 2009 Tutorial Column-Oriented Database Systems 34
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Even faster column scans on flash SSDs
l New-generation SSDsl Very fast random reads, slower random writesl Fast sequential RW, comparable to HDD arrays
l No expensive seeks across columnsl FlashScan and Flashjoin: PAX on SSDs, inside Postgres
VLDB 2009 Tutorial Column-Oriented Database Systems 34
“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos, Shah, Wiener, Graefe, SIGMOD’09
mini-pages with no qualified attributes are not accessed
30K Read IOps, 3K Write Iops250MB/s Read BW, 200MB/s Write
VLDB 2009 Tutorial Column-Oriented Database Systems 35
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-scan performance over time
VLDB 2009 Tutorial Column-Oriented Database Systems 35
from 7x slower
..to 1.2x slower
..to same
and 3x faster!
regular DSM (2001)
optimized DSM (2002)
column-store (2006)
SSD Postgres/PAX (2009)
..to 2x slower
VLDB 2009 Tutorial Column-Oriented Database Systems 36
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavrosl Introduction to key featuresl From DSM to column-stores and performance tradeoffsl Column-store architecture overviewl Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 36
VLDB 2009 Tutorial Column-Oriented Database Systems 37
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Architecture of a column-storestorage layout
l read-optimized: dense-packed, compressedl organize in extends, batch updatesl multiple sort ordersl sparse indexes
VLDB 2009 Tutorial Column-Oriented Database Systems 37
enginel block-tuple operatorsl new access methodsl optimized relational operatorssystem-level
l system-wide column supportl loading / updatesl scaling through multiple nodesl transactions / redundancy
VLDB 2009 Tutorial Column-Oriented Database Systems 38
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
C-Store
l Compress columnsl No alignmentl Big disk blocksl Only materialized views (perhaps many)l Focus on Sorting not indexingl Data ordered on anything, not just timel Automatic physical DBMS designl Optimize for grid computingl Innovative redundancyl Xacts – but no need for Mohanl Column optimizer and executor
VLDB 2009 Tutorial Column-Oriented Database Systems 38
“C-Store: A Column-Oriented DBMS.” Stonebraker et al. VLDB 2005.
VLDB 2009 Tutorial Column-Oriented Database Systems 39
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
C-Store: only materialized views (MVs)
l Projection (MV) is some number of columns from a fact tablel Plus columns in a dimension table – with a 1-n join between
Fact and Dimension tablel Stored in order of a storage key(s)l Several may be stored!l With a permutation, if necessary, to map between theml Table (as the user specified it and sees it) is not stored!l No secondary indexes (they are a one column sorted MV plus
a permutation, if you really want one)
VLDB 2009 Tutorial Column-Oriented Database Systems 39
User view :EMP (name, age, salary, dept)Dept (dname, floor)
Possible set of MVs :MV-1 (name, dept, floor) in floor orderMV-2 (salary, age) in age orderMV-3 (dname, salary, name) in salary order
VLDB 2009 Tutorial Column-Oriented Database Systems 40
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
AsynchronousData Transfer
TUPLE MOVER
> Read Optimized Store (ROS)
• On disk• Sorted / Compressed• Segmented• Large data loaded direct
Continuous Load and Query (Vertica)
Hybrid Storage Architecture
(A B C | A)
A B C
Trickle Load
> Write Optimized Store (WOS)
§Memory based
§Unsorted / Uncompressed
§Segmented
§Low latency / Small quick inserts
A B C
VLDB 2009 Tutorial Column-Oriented Database Systems 40
VLDB 2009 Tutorial Column-Oriented Database Systems 41
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Loading Data (Vertica)
Write-OptimizedStore (WOS)In-memory
Read-Optimized Store (ROS)
On-disk
Automatic Tuple Mover
> INSERT, UPDATE, DELETE
> Bulk and Trickle Loads
§COPY
§COPY DIRECT
> User loads data into logical Tables
> Vertica loads atomically into storage
VLDB 2009 Tutorial Column-Oriented Database Systems 42
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Applications for column-storesl Data Warehousing
l High end (clustering)l Mid end/Mass Marketl Personal Analytics
l Data Miningl E.g. Proximity
l Google BigTablel RDF
l Semantic web data managementl Information retrieval
l Terabyte TREC
l Scientific datasetsl SciDB initiativel SLOAN Digital Sky Survey on MonetDB
VLDB 2009 Tutorial Column-Oriented Database Systems 42
VLDB 2009 Tutorial Column-Oriented Database Systems 43
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
List of column-store systems
l Cantor (history)l Sybase IQl SenSage (former Addamark Technologies)l Kdbl 1010datal MonetDBl C-Store/Vertical X100/VectorWisel KickFirel SAP Business Acceleratorl Infobrightl ParAccell Exasol
VLDB 2009 Tutorial Column-Oriented Database Systems 43
VLDB 2009 Tutorial Column-Oriented Database Systems 44
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Part 1: Basic concepts — Stavrosl Introduction to key featuresl From DSM to column-stores and performance tradeoffsl Column-store architecture overviewl Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 44
VLDB 2009 Tutorial Column-Oriented Database Systems 45
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 45
Simulate a Column-Store inside a Row-Store
Date Store Product Customer Price
01/01
01/01
01/01
1
2
3
1
2
3
1
2
3
Option A: Vertical Partitioning
…
Option B:Index Every Column
Date Index
Store Index
1
2
3
1
2
3
01/01
01/01
01/01
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
TID Value
StoreTID Value
ProductTID Value
CustomerTID Value
PriceTID Value
Date
VLDB 2009 Tutorial Column-Oriented Database Systems 46
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 46
Simulate a Column-Store inside a Row-Store
Date Store Product Customer Price
1
2
3
1
2
3
Option A: Vertical Partitioning
…
Option B:Index Every Column
Date Index
Store Index
1
2
3
1
2
3
01/01
01/01
01/01
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
BOS
NYC
BOS
Table
Chair
Bed
Mesa
Lutz
Mudd
$20
$13
$79
TID Value
StoreTID Value
ProductTID Value
CustomerTID Value
Price
01/01 1 3
StartPosValue
DateLength
Can explicitly run-length encode date
“Teaching an Old Elephant New Tricks.”Bruno, CIDR 2009.
VLDB 2009 Tutorial Column-Oriented Database Systems 47
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 47
Experiments
0.0
50.0
100.0
150.0
200.0
250.0
Time (seconds)
Average 25.7 79.9 221.2
Normal Row-StoreVertically Partitioned
Row-Store
Row-Store With All
Indexes
“Column-Stores vs Row-Stores: How Different are They Really?”Abadi, Hachem, and Madden. SIGMOD 2008.
l Star Schema Benchmark (SSBM)
l Implemented by professional DBAl Original row-store plus 2 column-store
simulations on same row-store product
Adjoined Dimension Column Index (ADC Index) to Improve Star Schema Query Performance”. O’Neil et. al. ICDE 2008.
VLDB 2009 Tutorial Column-Oriented Database Systems 48
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What’s Going On? Vertical Partitions
l Vertical partitions in row-stores:l Work well when workload is knownl ..and queries access disjoint sets of columnsl See automated physical design
l Do not work well as full-columnsl TupleID overhead significantl Excessive joins
VLDB 2009 Tutorial Column-Oriented Database Systems 48
“Column-Stores vs. Row-Stores: How Different Are They Really?”Abadi, Madden, and Hachem. SIGMOD 2008.
Queries touch 3-4 foreign keys in fact table, 1-2 numeric columns
Complete fact table takes up ~4 GB (compressed)
Vertically partitioned tables take up 0.7-1.1 GB (compressed)
11
22
33
TID ColumnData
TupleHeader
VLDB 2009 Tutorial Column-Oriented Database Systems 49
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 49
What’s Going On? All Indexes Case
l Tuple constructionl Common type of query:
l Result of lower part of query plan is a set of TIDs that passed all predicates
l Need to extract SELECT attributes at these TIDsl BUT: index maps value to TIDl You really want to map TID to value (i.e., a vertical partition)
à Tuple construction is SLOW
SELECT store_name, SUM(revenue)FROM Facts, StoresWHERE fact.store_id = stores.store_id
AND stores.country = “Canada”GROUP BY store_name
VLDB 2009 Tutorial Column-Oriented Database Systems 50
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
VLDB 2009 Tutorial Column-Oriented Database Systems 50
So….
l All indexes approach is a poor way to simulate a column-storel Problems with vertical partitioning are NOT fundamental
l Store tuple header in a separate partitionl Allow virtual TIDsl Combine clustered indexes, vertical partitioning
l So can row-stores simulate column-stores?l Might be possible, BUT:
l Need better support for vertical partitioning at the storage layerl Need support for column-specific optimizations at the executer levell Full integration: buffer pool, transaction manager, ..
l When will this happen?l Most promising features = soon
l ..unless new technology / new objectives change the game(SSDs, Massively Parallel Platforms, Energy-efficiency)
See Part 2, Part 3 for most promising
features
VLDB 2009 Tutorial Column-Oriented Database Systems 51
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
End of Part 1
l Basic concepts — Stavrosl Introduction to key featuresl From DSM to column-stores and performance tradeoffsl Column-store architecture overviewl Will rows and columns ever converge?
l Part 2: Column-oriented execution — Daniel
l Part 3: MonetDB/X100 and CPU efficiency — Peter
VLDB 2009 Tutorial Column-Oriented Database Systems 51
VLDB 2009 Tutorial Column-Oriented Database Systems 52
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Part 2 Outline
l Compression
l Tuple Materialization
l Joins
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Compression
VLDB 2009
Tutorial
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi, Madden, and Ferreira, SIGMOD ’06
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
•Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD’01
VLDB 2009 Tutorial Column-Oriented Database Systems 54
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Compression
l Trades I/O for CPUl Increased column-store opportunities:
l Higher data value locality in column storesl Techniques such as run length encoding far more
usefull Can use extra space to store multiple copies of
data in different sort orders
VLDB 2009 Tutorial Column-Oriented Database Systems 55
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Run-length Encoding
Q1Q1Q1Q1Q1Q1Q1
Q2Q2Q2Q2
…
…
1111122
1112
…
…
Product IDQuarter
(value, start_pos, run_length)
(1, 1, 5)
…
…
Product IDQuarter
(Q2, 301, 350)
(Q3, 651, 500)
(Q4, 1151, 600)
(2, 6, 2)
(1, 301, 3)(2, 304, 1)
5729685
3814
…
…
Price
5729685
3814
…
…
Price
(Q1, 1, 300)
(value, start_pos, run_length)
VLDB 2009 Tutorial Column-Oriented Database Systems 56
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Bit-vector Encoding
1111122
1123
…
…
Product ID
1111100
1100
…
…
ID: 1 ID: 2 ID: 3
0000000
0000
…
…
…
0000000
0001
…
…
0000011
0010
…
…
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi et. al, SIGMOD ’06
l For each unique value, v, in column c, create bit-vector bl b[i] = 1 if c[i] = v
l Good for columns with few unique values
l Each bit-vector can be further compressed if sparse
VLDB 2009 Tutorial Column-Oriented Database Systems 57
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Q1Q2Q4Q1Q3Q1Q1
Q2Q4Q3Q3…
Quarter
Q1
0130200
1322
…
Quarter
0
0: Q11: Q22: Q33: Q4
Dictionary Encoding
Dictionary
+OR
24128122
Quarter
24: Q1, Q2, Q4, Q1
128: Q3, Q1, Q1, Q1
122: Q2, Q4, Q3, Q3
Dictionary
+
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi et. al, SIGMOD ’06
…
l For each unique value create dictionary entry
l Dictionary can be per-block or per-column
l Column-stores have the advantage that dictionary entries may encode multiple values at once
VLDB 2009 Tutorial Column-Oriented Database Systems 58
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
45544855515340
49625250
…
Price
50
Frame: 50
4-2513
2
0
0…
Price
-1
Frame Of Reference Encoding
-5
∞
40
∞
62
4 bits per value
Exceptions (see part 3 for a better way to deal with exceptions)
l Encodes values as b bit offset from chosen frame of reference
l Special escape code (e.g. all bits set to 1) indicates a difference larger than can be stored in b bitsl After escape code,
original (uncompressed) value is written
“Compressing Relations and Indexes ” Goldstein, Ramakrishnan, Shaft, ICDE’98
VLDB 2009 Tutorial Column-Oriented Database Systems 59
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Differential Encoding
5:005:025:035:035:045:065:07
5:105:155:165:16…
Time
5:08
21012
1
1
0
Time
2
5:00
1
∞
5:15
2 bits per value
Exception (see part 3 for a better way to deal with exceptions)
l Encodes values as b bit offset from previous value
l Special escape code (just like frame of reference encoding) indicates a difference larger than can be stored in b bitsl After escape code, original
(uncompressed) value is written l Performs well on columns
containing increasing/decreasing sequencesl inverted listsl timestampsl object IDsl sorted / clustered columns
“Improved Word-Aligned Binary Compression for Text Indexing”Ahn, Moffat, TKDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 60
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What Compression Scheme To Use?
Does column appear in the sort key?
Is the averagerun-length > 2
Are number of unique values < ~50000
DifferentialEncoding
RLE
Is the data numericaland exhibit good locality?
Frame of ReferenceEncoding
HeavyweightCompression
Leave DataUncompressed
OR
Does this column appear frequently in selection predicates?
Bit-vectorCompression
DictionaryCompression
yes
yes
yes
yes
yes
no
no
no
no
no
VLDB 2009 Tutorial Column-Oriented Database Systems 61
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Heavy-Weight Compression Schemes
l Modern disk arrays can achieve > 1GB/sl 1/3 CPU for decompression Ł 3GB/s needed
Ł Lightweight compression schemes are better
Ł Even better: operate directly on compressed data
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 62
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l I/O - CPU tradeoff is no longer a tradeoffl Reduces memory–CPU bandwidth requirementsl Opens up possibility of operating on multiple
records at once
Operating Directly on Compressed Data
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi et. al, SIGMOD ’06
VLDB 2009 Tutorial Column-Oriented Database Systems 63
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
10011001000
…
…
Product IDQuarter
(1, 3)
(2, 1)
(3, 2)
1
00100010010
…
…
2
01000100101
…
…
3ProductID, COUNT(*))
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Index Lookup + Offset jump
SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID
301-306
Operating Directly on Compressed Data
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi et. al, SIGMOD ’06
VLDB 2009 Tutorial Column-Oriented Database Systems 64
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
(Q1, 1, 300)
(Q2, 301, 6)
(Q3, 307, 500)
(Q4, 807, 600)
Data
isOneValue()isValueSorted()isPosContiguous()isSparse()
getNext()decompressIntoArray()getValueAtPosition(pos)
getMin()getMax()getSize()
Block API
Compression-Aware Scan
Operator
SelectionOperator
AggregationOperator
SELECT ProductID, Count(*)FROM tableWHERE (Quarter = Q2)GROUP BY ProductID
“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi et. al, SIGMOD ’06
Operating Directly on Compressed Data
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Tuple Materialization andColumn-Oriented Join Algorithms
VLDB 2009
Tutorial
“Materialization Strategies in a Column-Oriented DBMS” Abadi, Myers, DeWitt, and Madden. ICDE 2007.
“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos Shah, Wiener, and Graefe. SIGMOD 2009.
“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
“Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 66
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
When should columns be projected?
l Where should column projection operators be placed in a query plan?l Row-store:
l Column projection involves removing unneeded column s from tuples
l Generally done as early as possiblel Column-store:
l Operation is almost completely opposite from a row- storel Column projection involves reading needed columns f rom storage
and extracting values for a listed set of tuples§ This process is called “materialization”
l Early materialization: project columns at beginning of query plan§ Straightforward since there is a one-to-one mapping across columns
l Late materialization: wait as long as possible for projecting columns§ More complicated since selection and join operators on one column obfuscates
mapping to other columns from same tablel Most column-stores construct tuples and column proj ection time
§ Many database interfaces expect output in regular t uples (rows)§ Rest of discussion will focus on this case
VLDB 2009 Tutorial Column-Oriented Database Systems 67
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
When should tuples be constructed?
l Solution 1: Create rows first (EM). But:l Need to construct ALL tuplesl Need to decompress datal Poor memory bandwidth
utilization
2131
2333
7134280
(4,1,4)
Construct
2
3
3
3
7
13
42
80
Select + Aggregate
2
1
3
1
4
4
4
4
prodID storeID custID price
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
VLDB 2009 Tutorial Column-Oriented Database Systems 68
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DataSourceprodID
DataSourcestoreID
AND
AGG
DataSourcecustID
DataSourceprice
1111
4444
0101
2131
prodID storeID
DataSource
DataSource
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
2131
2333
7134280
4444
prodID storeID custID price
Solution 2: Operate on columns
VLDB 2009 Tutorial Column-Oriented Database Systems 69
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Solution 2: Operate on columns
1111
0101
AND
0101
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
VLDB 2009 Tutorial Column-Oriented Database Systems 70
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
33
Solution 2: Operate on columns
0101
0101
DataSource
DataSource
0101
7134280
0101
2333
111380
custID price
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
VLDB 2009 Tutorial Column-Oriented Database Systems 71
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Solution 2: Operate on columns
1133
111380
AGG
13 193
QUERY:
SELECT custID,SUM(price)FROM tableWHERE (prodID = 4) AND
(storeID = 1) ANDGROUP BY custID
AND
AGG
DataSourcecustID
DataSourceprice
2131
2333
7134280
4444
prodID storeID custID price
DataSourceprodID
DataSourcestoreID
VLDB 2009 Tutorial Column-Oriented Database Systems 72
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
For plans without joins, late materialization is a win
l Ran on 2 compressed columns from TPC-H scale 10 data
QUERY:
SELECT C1, SUM(C2)FROM tableWHERE (C1 < CONST) AND
(C2 < CONST)GROUP BY C1
0
1
2
3
45
6
7
8
9
10
Low selectivity Mediumselectivity
High selectivity
Tim
e (s
econ
ds)
Early Materialization Late Materialization
“Materialization Strategies in a Column-Oriented DBMS”Abadi, Myers, DeWitt, and Madden. ICDE 2007.
VLDB 2009 Tutorial Column-Oriented Database Systems 73
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Materializing late still works best
QUERY:
SELECT C1, SUM(C2)FROM tableWHERE (C1 < CONST) AND
(C2 < CONST)GROUP BY C1
0
2
4
6
8
10
12
14
Low selectivity Mediumselectivity
High selectivity
Tim
e (s
econ
ds)
Early Materialization Late Materialization
Even on uncompressed data, late materialization is still a win
“Materialization Strategies in a Column-Oriented DBMS”Abadi, Myers, DeWitt, and Madden. ICDE 2007.
VLDB 2009 Tutorial Column-Oriented Database Systems 74
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What about for plans with joins?
74
Select R1.B, R1.C, R2.E, R2.H, R3.FFrom R1, R2, R3Where R1.A = R2.D AND R2.G = R3.K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join 1
A = D
Join 2
G = K
Scan
R3 (F, K, L)
B, C, E, H, F
A, B, C
B, C, E, G, H
D, E, G, H
F, K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
JoinA = D
A D
Scan
R3 (F, K, L)
Project
G
JoinG = K
Project
B, C, E, H, F
KG
E, H
FB, C
VLDB 2009 Tutorial Column-Oriented Database Systems 75
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
What about for plans with joins?
75
Select R1.B, R1.C, R2.E, R2.H, R3.FFrom R1, R2, R3Where R1.A = R2.D AND R2.G = R3.K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
Join 1
A = D
Join 2
G = K
Scan
R3 (F, K, L)
B, C, E, H, F
A, B, C
B, C, E, G, H
D, E, G, H
F, K
R1 (A, B, C) R2 (D, E, G, H)
Scan Scan
JoinA = D
A D
Scan
R3 (F, K, L)
Project
G
JoinG = K
Project
B, C, E, H, F
KG
E, H
FB, C
Early
materialization
Late
materialization
VLDB 2009 Tutorial Column-Oriented Database Systems 76
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Early Materialization Example
121111
6121
2333
7134280
Construct
2
3
3
3
7
13
42
80
prodID storeID quantity custID price
123
Construct
1
2
3
custID lastName
Green
White
Brown
GreenWhiteBrown
QUERY:
SELECT C.lastName,SUM(F.price)FROM facts AS F, customers AS CWHERE F.custID = C.custIDGROUP BY C.lastName
Facts Customers
(4,1,4)
VLDB 2009 Tutorial Column-Oriented Database Systems 77
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Early Materialization Example
2
3
3
3
7
13
42
80
1
2
3
Green
White
Brown
QUERY:
SELECT C.lastName,SUM(F.price)FROM facts AS F, customers AS CWHERE F.custID = C.custIDGROUP BY C.lastName
Join
7
13
42
80
White
Brown
Brown
Brown
VLDB 2009 Tutorial Column-Oriented Database Systems 78
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Late Materialization Example
121111
6121
2313
7134280
prodID storeID quantity custID price
123
custID lastName
GreenWhiteBrown
QUERY:
SELECT C.lastName,SUM(F.price)FROM facts AS F, customers AS CWHERE F.custID = C.custIDGROUP BY C.lastName
Facts Customers
Join
2313
1234
(4,1,4)
Late materialized join causes out of order probing of projected columns from the inner relation
VLDB 2009 Tutorial Column-Oriented Database Systems 79
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Late Materialized Join Performance
l Naïve LM join about 2X slower than EM join on typical queries (due to random I/O)l This number is very dependent on
l Amount of memory availablel Number of projected attributesl Join cardinality
l But we can do betterl Invisible Joinl Jive/Flash Joinl Radix cluster/decluster join
VLDB 2009 Tutorial Column-Oriented Database Systems 80
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
l Designed for typical joins when data is modeled using a star schemal One (“fact”) table is joined with multiple dimension tables
l Typical query:select c_nation, s_nation, d_year,
sum(lo_revenue) as revenuefrom customer, lineorder, supplier, datewhere lo_custkey = c_custkey
and lo_suppkey = s_suppkeyand lo_orderdate = d_datekeyand c_region = 'ASIA‘and s_region = 'ASIA‘and d_year >= 1992 and d_year <= 1997
group by c_nation, s_nation, d_yearorder by d_year asc, revenue desc;
“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
VLDB 2009 Tutorial Column-Oriented Database Systems 81
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey region nation …1 ASIA CHINA …2 ASIA INDIA …3 ASIA INDIA …
Apply “region = ‘Asia’” On Customer Table
Hash Table (or bit-map) Containing Keys 1, 2 and 3
suppkey region nation …1 ASIA RUSSIA …2 EUROPE SPAIN …
Apply “region = ‘Asia’” On Supplier Table
Hash Table (or bit-map)Containing Keys 1, 3
dateid year …01011997 1997 …01021997 1997 …01031997 1997 …
Apply “year in [1992,1997]” On Date Table
Hash Table Containing Keys 01011997, 01021997,
and 01031997
Invisible Join
“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
4 EUROPE FRANCE …
3 ASIA JAPAN …
VLDB 2009 Tutorial Column-Oriented Database Systems 82
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Column-Stores vs Row-Stores: How Different are They Really?”Abadi et. al. SIGMOD 2008.
custkey suppkey orderdate revenueorderkey3 1 01011997 4325613 2 01011997 3333324 3 01021997 1212131 1 01021997 2323344 2 01021997 4545651 2 01031997 4325163 2 01031997 342357
Original Fact Table
custkey3341413
Hash Table Containing
Keys 1, 2 and 3
+1101011
Hash Table Containing
Keys 1 and 3
1011000
+suppkey
1231222
Hash Table Containing Keys 01011997,
01021997, and 01031997
1111111
+orderdate01011997010119970102199701021997010219970103199701031997
1101011
1011000
1111111
& & =1001000
VLDB 2009 Tutorial Column-Oriented Database Systems 83
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey3341413
+
suppkey1231222
orderdate01011997010119970102199701021997010219970103199701031997
1001000
31
1001000
11
1001000
0101199701021997
+
+
+
+
JOIN
CHINAINDIAINDIA = CHINA
INDIA
RUSSIASPAIN = RUSSIA
RUSSIA
199719971997 =01011997
0102199701031997
19971997
FRANCE
JAPAN
VLDB 2009 Tutorial Column-Oriented Database Systems 84
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey region nation …1 ASIA CHINA …2 ASIA INDIA …3 ASIA INDIA …
Apply “region = ‘Asia’” On Customer Table
Hash Table (or bit-map) Containing Keys 1, 2 and 3
suppkey region nation …1 ASIA RUSSIA …2 EUROPE SPAIN …
Apply “region = ‘Asia’” On Supplier Table
Hash Table (or bit-map)Containing Keys 1, 3
dateid year …01011997 1997 …01021997 1997 …01031997 1997 …
Apply “year in [1992,1997]” On Date Table
Hash Table Containing Keys 01011997, 01021997,
and 01031997
Invisible Join
“Column-Stores vs Row-Stores: How Different are They Really?” Abadi, Madden, and Hachem. SIGMOD 2008.
4 EUROPE FRANCE …
3 ASIA JAPAN …
Range [1-3](between-predicate rewriting)
VLDB 2009 Tutorial Column-Oriented Database Systems 85
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Invisible Join
l Bottom Linel Many data warehouses model data using star/snowflak e
schemesl Joins of one (fact) table with many dimension table s is
commonl Invisible join takes advantage of this by making su re
that the table that can be accessed in position ord er is the fact table for each join
l Position lists from the fact table are then interse cted (in position order)
l This reduces the amount of data that must be access ed out of order from the dimension tables
l “Between-predicate rewriting” trick not relevant for this discussion
VLDB 2009 Tutorial Column-Oriented Database Systems 86
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
custkey3341413
+
suppkey1231222
orderdate01011997010119970102199701021997010219970103199701031997
1001000
31
1001000
11
1001000
0101199701021997
+
+
+
+
JOIN
CHINAINDIAINDIA = CHINA
INDIA
RUSSIASPAIN = RUSSIA
RUSSIA
199719971997 =01011997
0102199701031997
19971997
FRANCE
JAPAN
Still accessing table out of order
VLDB 2009 Tutorial Column-Oriented Database Systems 87
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
31 + CHINA
INDIAINDIA = CHINA
INDIA
FRANCE
Still accessing table out of order
“Fast Joins using Join Indices”. Li and Ross, VLDBJ 8:1-24, 1999.
“Query Processing Techniques for Solid State Drives”. Tsirogiannis, Harizopoulos et. al. SIGMOD 2009.
VLDB 2009 Tutorial Column-Oriented Database Systems 88
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
1. Add column with dense ascending integers from 1
2. Sort new position list by second column
3. Probe projected column in order using new sorted position list, keeping first column from position list around
4. Sort new result by first column
31 + CHINA
INDIAINDIA = CHINA
INDIA
FRANCE
12
13
21 + CHINA
INDIAINDIA = INDIA
CHINA
FRANCE
21
CHINAINDIA1
2
VLDB 2009 Tutorial Column-Oriented Database Systems 89
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Jive/Flash Join
l Bottom Linel Instead of probing projected columns from inner tab le out of
order:l Sort join indexl Probe projected columns in orderl Sort result using an added column
l LM vs EM tradeoffs:l LM has the extra sorts (EM accesses all columns in order)l LM only has to fit join columns into memory (EM nee ds join
columns and all projected columns)§ Results in big memory and CPU savings (see part 3 f or why there is
CPU savings)
l LM only has to materialize relevant columnsl In many cases LM advantages outweigh disadvantages
l LM would be a clear winner if not for those pesky s orts … can we do better?
VLDB 2009 Tutorial Column-Oriented Database Systems 90
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l The full sort from the Jive join is actually overkilll We just want to access the storage blocks in order (we
don’t mind random access within a block)l So do a radix sort and stop earlyl By stopping early, data within each block is accessed out of
order, but in the order specified in the original join indexl Use this pseudo-order to accelerate the post-probe sort as well
“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
•“Database Architecture Optimized for the New Bottleneck: Memory Access”VLDB’99•“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
VLDB 2009 Tutorial Column-Oriented Database Systems 91
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Radix Cluster/Decluster
l Bottom linel Both sorts from the Jive join can be significantly reduced in
overheadl Only been tested when there is sufficient memory for the
entire join index to be stored three timesl Technique is likely applicable to larger join indexes, but utility will
go down a little
l Only works if random access within a storage blockl Don’t want to use radix cluster/decluster if you have variable-
width column values or compression schemes that can only be decompressed starting from the beginning of the block
VLDB 2009 Tutorial Column-Oriented Database Systems 92
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
LM vs EM joins
l Invisible, Jive, Flash, Cluster, Decluster techniques contain a bag of tricks to improve LM joins
l Research papers show that LM joins become 2X faster than EM joins (instead of 2X slower) for a wide array of query types
VLDB 2009 Tutorial Column-Oriented Database Systems 93
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Tuple Construction Heuristics
l For queries with selective predicates, aggregations, or compressed data, use late materialization
l For joins:l Research papers:
l Always use late materializationl Commercial systems:
l Inner table to a join often materialized before joi n (reduces system complexity):
l Some systems will use LM only if columns from inner table can fit entirely in memory
VLDB 2009 Tutorial Column-Oriented Database Systems 94
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Outline
l Computational Efficiency of DB on modern hardwarel how column-stores can help herel Keynote revisited: MonetDB & VectorWise in more depth
l CPU efficient column compressionl vectorized decompression
l Conclusionsl future work
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
40 years of hardware evolution
vs.DBMS computational efficiency
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 96
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Architecture
Elements:l Storage
l CPU caches L1/L2/L3
l Registersl Execution Unit(s)
l Pipelinedl SIMD
VLDB 2009 Tutorial Column-Oriented Database Systems 97
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 98
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 99
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DRAM Metrics
VLDB 2009 Tutorial Column-Oriented Database Systems 100
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Super-Scalar Execution (pipelining)
VLDB 2009 Tutorial Column-Oriented Database Systems 101
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Hazardsl Data hazards
l Dependencies between instructions
l L1 data cache misses
Result: bubbles in the pipeline
l Control Hazardsl Branch mispredictions
l Computed branches (late binding)l L1 instruction cache misses
Out-of-order execution addresses data hazardsl control hazards typically more expensive
VLDB 2009 Tutorial Column-Oriented Database Systems 102
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIMD
l Single Instruction Multiple Datal Same operation applied on a vector of valuesl MMX: 64 bits, SSE: 128bits, AVX: 256bitsl SSE, e.g. multiply 8 short integers
VLDB 2009 Tutorial Column-Oriented Database Systems 103
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
alice 22101
next()
next()
next()
ivan 37102
ivan 37102
ivan 37102
ivan 350102
alice 22101
SELECT id, name (age-30)*50 AS bonus
FROM employeeWHERE age > 30
350
FALSETRUE
22 > 30 ?37 > 30 ?
37 – 30 7 * 50
7
A Look at the Query Pipeline
VLDB 2009 Tutorial Column-Oriented Database Systems 104
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
next()
ivan 350102
22 > 30 ?37 > 30 ?
37 – 30 7 * 50
A Look at the Query Pipeline
Operators
Iterator interface-open()-next(): tuple-close()
VLDB 2009 Tutorial Column-Oriented Database Systems 105
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
alice 22101
next()
next()
next()
ivan 37102
ivan 37102
ivan 37102
ivan 350102
alice 22101
350
FALSETRUE
22 > 30 ?37 > 30 ?
37 – 30 7 * 50
7
A Look at the Query Pipeline
Primitives
Provide computationalfunctionality
All arithmetic allowed in expressions, e.g. Multiplication
mult(int,int) Ł int
7 * 50
VLDB 2009 Tutorial Column-Oriented Database Systems 106
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Operators
Iterator interface-open()-next(): tuple-close()SCAN
SELECT
PROJECT
Database Architecture causes Hazardsl Data hazards
l Dependencies between instructions
l L1 data cache misses
l Control Hazardsl Branch mispredictions
l Computed branches (late binding)l L1 instruction cache misses
Complex NSM record navigation
Large Tree/Hash Structures
Code footprint of all operators in query plan exceeds L1 cache
Data-dependent conditions
Next() late binding method calls
Tree, List, Hash traversal
SIMD Out-of-order Execution
work on one tuple at a time
“DBMSs On A Modern Processor: Where Does Time Go? ”Ailamaki, DeWitt, Hill, Wood, VLDB’99
VLDB 2009 Tutorial Column-Oriented Database Systems 107
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DBMS Computational Efficiency TPC-H 1GB, query 1l selects 98% of fact table, computes net prices and
aggregates alll Results:
l C program: ?l MySQL: 26.2s l DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 108
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DBMS Computational Efficiency TPC-H 1GB, query 1l selects 98% of fact table, computes net prices and
aggregates alll Results:
l C program: 0.2sl MySQL: 26.2s l DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
MonetDB
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 110
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l “save disk I/O when scan-intensive queries need a few columns”
l “avoid an expression interpreter to improve computational efficiency”
a column-store
VLDB 2009 Tutorial Column-Oriented Database Systems 111
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonusFROM people
WHERE age > 30
RISC Database Algebra
VLDB 2009 Tutorial Column-Oriented Database Systems 112
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonusFROM people
WHERE age > 30
RISC Database Algebra
batcalc_minus_int(int* res, int* col, int val,int n)
{for(i=0; i<n; i++)
res[i] = col[i] - val;}
CPU happy? Give it “nice” code !
- few dependencies (control,data)- CPU gets out-of-order execution- compiler can e.g. generate SIMD
One loop for an entire column- no per-tuple interpretation- arrays: no record navigation- better instruction cache locality
Simple, hard-coded semantics in operators
VLDB 2009 Tutorial Column-Oriented Database Systems 113
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SELECT id, name, (age-30)*50 as bonusFROM people
WHERE age > 30
RISC Database Algebra
batcalc_minus_int(int* res, int* col, int val,int n)
{for(i=0; i<n; i++)
res[i] = col[i] - val;}
CPU happy? Give it “nice” code !
- few dependencies (control,data)- CPU gets out-of-order execution- compiler can e.g. generate SIMD
One loop for an entire column- no per-tuple interpretation- arrays: no record navigation- better instruction cache locality
MATERIALIZED intermediate
results
VLDB 2009 Tutorial Column-Oriented Database Systems 114
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l “save disk I/O when scan-intensive queries need a few columns”
l “avoid an expression interpreter to improve computational efficiency”
How?
l RISC query algebra: hard-coded semantics
l Decompose complex expressions in multiple operations
l Operators only handle simple arrays
l No code that handles slotted buffered record layout
l Relational algebra becomes array manipulation language
l Often SIMD for free
l Plus: use of cache-conscious algorithms for Sort/Aggr/Join
a column-store
VLDB 2009 Tutorial Column-Oriented Database Systems 115
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l You want efficiencyl Simple hard-coded operators
l I take scalabilityl Result materialization
n C program: 0.2s
n MonetDB: 3.7s
n MySQL: 26.2s
n DBMS “X”: 28.1s
a Faustian pact
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
as a research platform
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 117
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIGMOD 1985
MonetDB supports
SQL, XML, ODMG,
..RDF
RDF support on C-
STORE / SW-Store
MonetDB
BAT Algebra
•“MIL Primitives for Querying a Fragmented World”, Boncz, Kersten, VLDBJ’98•“Flattening an Object Algebra to Provide Performance”Boncz, Wilschut, Kersten, ICDE’98•“MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06•“SW-Store: a vertically partitioned DBMS for Semantic Web data management“ Abadi, Marcus, Madden, Hollenbach, VLDBJ’09
VLDB 2009 Tutorial Column-Oriented Database Systems 118
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SIGMOD 1985
MonetDB never
materializes tuples
“Materialization Strategies in a Column-Oriented DBMS” Abad, Myers, DeWitt, Madden, ICDE’07
VLDB 2009 Tutorial Column-Oriented Database Systems 119
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
XQuery
MonetDB 4 MonetDB 5
MonetDB kernel
SQL 03
Optimizers
RDF
SOAP
Arrays
The Software Stack
Extensible query langfrontend framework
.. SGL?Extensible DynamicRuntime QOPTFramework!
OGIS
compile
ExtensibleArchitecture-Conscious
Execution platform
VLDB 2009 Tutorial Column-Oriented Database Systems 120
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Cache-Conscious Joinsl Cost Models, Radix-cluster,
Radix-decluster
l MonetDB/XQuery: l structural joins exploiting
positional column access
l Cracking:l on-the-fly automatic indexing
without workload knowledge
l Recycling:l using materialized intermediates
l Run-time Query Optimization:l correlation-aware run-time
optimization without cost model
as a research platform
“MonetDB/XQuery: a fast XQuery processor powered by a relational engine” Boncz, Grust, vanKeulen, Rittinger, Teubner, SIGMOD’06
•“Database Architecture Optimized for the New Bottleneck: Memory Access” VLDB’99•“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)•“Cache-Conscious Radix-Decluster Projections”, Manegold, Boncz, Nes, VLDB’04
“Database Cracking”, CIDR’07“Updating a cracked database “, SIGMOD’07“Self-organizing tuple reconstruction in column-stores“, SIGMOD’09 (all Idreos, Manegold, Kersten)
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
optimizer frontend backend
VLDB 2009 Tutorial Column-Oriented Database Systems 121
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Radix-Partitioned Hash Joinl create partitions << CPU cachel small partitions Ł many partitionsl many partitions Ł multiple passes needed
Cache-Conscious JoinsRadix-Cluster
•“Database Architecture Optimized for the New Bottleneck: Memory Access”VLDB’99•“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
VLDB 2009 Tutorial Column-Oriented Database Systems 122
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Radix-Partitioned Hash Joinl create partitions << CPU cachel small partitions Ł many partitionsl many partitions Ł multiple passes needed
l Radix-Clusterl Radix-Sort with early stopping
l Each pass looks at Bi higher-most radix bitsl Splitting each input cluster into 2Bi output clusters
l leaves relation partially ordered
Cache-Conscious JoinsRadix-Cluster
•“Database Architecture Optimized for the New Bottleneck: Memory Access”VLDB’99•“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
VLDB 2009 Tutorial Column-Oriented Database Systems 123
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Multiple clustering passesl Limit number of clusters
per passl Avoid cache/TLB trashingl Trade memory cost for
CPU costl Any data type (hashing)
Cache-Conscious JoinsRadix-Cluster
•“Database Architecture Optimized for the New Bottleneck: Memory Access”VLDB’99•“Generic Database Cost Models for Hierarchical Memory Systems”, VLDB’02 (all Manegold, Boncz, Kersten)
VLDB 2009 Tutorial Column-Oriented Database Systems 124
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsAccurate Cache Miss Cost Modeling
VLDB 2009 Tutorial Column-Oriented Database Systems 125
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsPartitioned Hash Join
VLDB 2009 Tutorial Column-Oriented Database Systems 126
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
(64,000,000 tuples)
Cache-Conscious JoinsRadix-Clustered Hash-Join: overall perf
VLDB 2009 Tutorial Column-Oriented Database Systems 127
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsPre-Projection vs. Post-Projection
VLDB 2009 Tutorial Column-Oriented Database Systems 128
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Radix-Decluster = cache-conscious post-projection
Cache-Conscious JoinsPre-Projection vs. Post-Projection
VLDB 2009 Tutorial Column-Oriented Database Systems 129
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Join
Cache-Conscious JoinsPost-Projection
Join Index!
“Join Indices”Valduriez, TODS’87
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 130
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Joinl Cluster on Left
Cache-Conscious JoinsPost-Projection
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 131
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Joinl Cluster on Left
Cache-Conscious JoinsPost-Projection
00001111222233334444........
NNNN----1111
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 132
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Joinl Cluster on Leftl Project Left
Cache-Conscious JoinsPost-Projection
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 133
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Joinl Cluster on Leftl Project Leftl Cluster on Right
Cache-Conscious JoinsPost-Projection
222244446666111133337777000088885555
111133337777000055558888
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 134
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Partitioned Hash-Joinl Cluster on Leftl Project Leftl Cluster on Rightl Project Right
Ł Radix-Decluster
Cache-Conscious JoinsPost-Projection
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 135
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Red column forms a dense domainl {0,1,2,…,N-1}
l All subsequences are orderedl e.g.
l Task: merge subsequences into dense sequence
Cache-Conscious JoinsRadix-Decluster
222244446666111133337777000088885555
111133337777000055558888
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 136
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Red column forms a dense domainl {0,1,2,…,N-1}
l All subsequences are orderedl e.g.
l Task: merge subsequences into dense sequencel Approach 1: merge H=2B lists
l L cost O(N log(H))l many (H) cursors needed Ł cache thrashing
Cache-Conscious JoinsRadix-Decluster
222244446666111133337777000088885555
111133337777000055558888
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 137
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Red column forms a dense domainl {0,1,2,…,N-1}
l All subsequences are orderedl e.g.
l Task: merge subsequences into dense sequencel Approach 1: merge H=2B lists
l L cost O(N log(H)), l many (H) cursors needed Ł cache thrashing
l Approach 2: insert by positionl L many (H) sparse passes over the result Ł no cache reuse
Cache-Conscious JoinsRadix-Decluster
222244446666111133337777000088885555
111133337777000055558888
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 138
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Red column forms a dense domainl {0,1,2,…,N-1}
l All subsequences are orderedl e.g.
l Task: merge subsequences into dense sequencel Approach 1: merge H=2B lists
l L cost O(N log(H))l many (H) cursors needed Ł cache thrashing
l Approach 2: insert by positionl L many (H) sparse passes over the result Ł no cache reuse
l Radix-Decluster: insert by position with sliding window
Cache-Conscious JoinsRadix-Decluster
222244446666111133337777000088885555
111133337777000055558888
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 139
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 140
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
3 clusters3 clusters3 clusters3 clusters
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 141
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
destination positionsdestination positionsdestination positionsdestination positions
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 142
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Projection column (still)Projection column (still)Projection column (still)Projection column (still)
in wrong orderin wrong orderin wrong orderin wrong order
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 143
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Result column to fillResult column to fillResult column to fillResult column to fill
insertion window of size 2insertion window of size 2insertion window of size 2insertion window of size 2
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 144
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 145
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 146
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 147
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 148
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 149
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 150
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 151
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 152
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsRadix-Decluster In Action
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 153
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Only compulsary misses Ł cache-conscious
n Random access only to sliding window(<< cache size)
Cache-Conscious JoinsRadix-Decluster Memory Access Pattern
•“Cache-Conscious Radix-DeclusterProjections”, Manegold, Boncz, Nes, VLDB’04
VLDB 2009 Tutorial Column-Oriented Database Systems 154
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Radix-Declusterprefers small W
(i.e. vertical fragmentation)
Cache-Conscious JoinsRadix-Decluster Performance Tradeoff
Read Also:- Jive Join- Slam Join
“Fast Joins UsingJoin Indices”
Ross, Lei, VLDBJ’98
VLDB 2009 Tutorial Column-Oriented Database Systems 155
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cache-Conscious JoinsPre-Projection vs. Post-Projection
“Query Processing Techniques for Solid State Drives” Tsirogiannis, Harizopoulos,Shah, Wiener, Graefe, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 156
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CrackingSelf Organizing Databases
“Database Cracking”, Idreos, Manegold, Kersten, CIDR’07
VLDB 2009 Tutorial Column-Oriented Database Systems 157
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CrackingSelf Organizing Databases
“Database Cracking”, Idreos, Manegold, Kersten, CIDR’07
VLDB 2009 Tutorial Column-Oriented Database Systems 158
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 159
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 160
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 161
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 162
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 163
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 164
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 165
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Crackinghow it works
“Updating a cracked database “, Idreos, Manegold, Kersten, SIGMOD’07
VLDB 2009 Tutorial Column-Oriented Database Systems 166
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 167
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 168
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 169
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 170
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 171
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 172
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 173
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 174
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 175
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 176
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09Cracking
sideways cracking in column stores
VLDB 2009 Tutorial Column-Oriented Database Systems 177
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CrackingTPC-H Performance
“Self-organizing tuple reconstruction in column-stores“, Idreos, Manegold, Kersten, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 178
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CrackingFuture Work
VLDB 2009 Tutorial Column-Oriented Database Systems 179
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXrun-time query optimization
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 180
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXsampling & incremental materialization
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 181
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXon-the-fly correlation detection & exploitation
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 182
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 183
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 184
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 185
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 186
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 187
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 188
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 189
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 190
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 191
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 192
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXavoid local minima using Chain Sampling
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 193
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ROXdetecting deep correlations
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 194
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l >800 DBLP co-authorship 4-way join queriesl Grouped by area DB, IR, Graphics, Bio
l 2:2, 3:1, 4:0 cross-area combinations
ROXrobust query performance
“ROX: run-time optimization of XQueries”, Abdelkader, Boncz, Manegold, vanKeulen, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 195
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Motivation: l scientific databases, data analytics
l Terabytes of data (observational , transactional)
l Prevailing read-only workload
l Ad-hoc queries with commonalities
Background:l Operator-at-a-time execution paradigm
Ø Automatic materialization of intermediates
l Canonical column-store organizationØ Intermediates have reduced dimensionality and finer granularityØ Simplified overlap analysis
l Recycling idea: instead of garbage collecting, keep the intermediates and reuse theml speed up query streams with commonalities
l low cost and self-organization
Recyclermotivation & idea
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 196
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Run-time Support
Recycler
Optimizer
SQL
MonetDB
Server
Tactical Optimizer
MonetDB Kernel
XQuery
MAL
MAL
Recycle Pool
function user.s1_2(A0:date, ...):void;
X5 := sql.bind("sys","lineitem",...);
X10 := algebra.select(X5,A0);
X12 := sql.bindIdx("sys","lineitem",...);
X15 := algebra.join(X10,X12);
X25 := mtime.addmonths(A1,A2);
...
function user.s1_2(A0:date, ...):void;
X5 := sql.bind("sys","lineitem",...);
X10 := algebra.select(X5,A0);
X12 := sql.bindIdx("sys","lineitem",...);
X15 := algebra.join(X10,X12);
X25 := mtime.addmonths(A1,A2);
...
Admission & Eviction
Recyclerfit into MonetDB
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 197
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Run time comparison of l instruction types l argument values
Name Value Data type Size
X1 10 :bat[:oid,:date]
T1 “sys” :str
T2 “orders” :str
…
X1 := sql.bind("sys","orders","o_orderdate",0);…
Y3 := sql.bind("sys","orders","o_orderdate",0);
Exact
matching
Recyclerinstruction matching
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 198
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009) “An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
Name Value Data type Size
X1 10 :bat[:oid,:int] 2000
X3 130 :bat[:oid,:int] 700
X5 150 :bat[:oid,:int] 350
…
X3 := algebra.select(X1,10,80);…
Y3 := algebra.select(X1,20,45);
X5 := algebra.select(X1,20,60);X5
Recyclerinstruction subsumption
VLDB 2009 Tutorial Column-Oriented Database Systems 199
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
algebra.join
sql.bind(“C1“)
algebra.select
sql.bind(“C2“)
…
sql.bind(“C1“)X1 :=
algebra.select(X1)
X2 :=
sql.bind(“C2“)
X3 :=
algebra.join(X2,X3)
X4 :=
Q1
• Entire sub-tree cached
• Operator dependencies represented through intermediates
(operator lineage)
Recyclermotivation & idea
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 200
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
algebra.join
sql.bind(“C1“)
algebra.select
sql.bind(“C2“)
X1 := sql.bind(“C1“)
X2 := algebra.select(X1)
X3 := sql.bind(“C2“)
X4 := algebra.join(X2,X3)
algebra.join
sql.bind(“C3“)
…
X1
X2
X3
X4
Q2
• Entire sub-tree reused
• Successful matching of an operator depends on successful
matching of its predecessors (operator lineage)
Recyclerexample
algebra.join
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 201
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Decide about storing the results l KEEPALL
l all instructions advised by the optimizerl CREDIT
l instructions supplied with creditsl storage ‘paid’ with 1 creditl reuse returns creditsl lack of reuse limits admission and resource claims
Recycleradmission policies
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 202
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Decide about eviction of intermediatesl Filter ‘top’ instructions without dependentsl Pick instructions with smallest utility
l LRU : time of computation or last reusel BENEFIT : estimated contribution to performance:
CPU and I/O costs, recycling
l Triggered by resource limitations (memory or entries)
Recyclercache policies
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 203
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Sloan Digital Sky Servey/ SkyServer
http://cas.sdss.org
l 100 GB subset of DR4
l 100-query batch from January 2008 log
l 1.5GB intermediates, 99% reuse
l Join intermediates major consumer of memory and major contributor to savings
785
296
140
200
400
600
800
Tim
e(se
c)
Naïve CRD/1GB KeepAll/Unlim
RecyclerSkyServer evaluation
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
VLDB 2009 Tutorial Column-Oriented Database Systems 204
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
l Provides self-organizing cache of intermediates l Significant performance gains in SkyServer and TPC-Hl Transforms materialization overhead into benefit
l future work:l Refining and developing admission and cache policiesl Automatic switch of suitable policies
“An architecture for recycling intermediates in a column-store”, Ivanova, Kersten, Nes, Goncalves, SIGMOD’09
Recyclersummary and future work
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
“MonetDB/X100”
vectorized query processing
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 206
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Materialization vs Pipelining
MonetDB spin-off: MonetDB/X100
VLDB 2009 Tutorial Column-Oriented Database Systems 207
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
next()
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 208
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
FALSETRUETRUEFALSE
> 30 ?
22374525
aliceivanpeggyvictor
101102104105
next()
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 209
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
715
FALSETRUETRUEFALSE
3745
ivanpeggy
102104
350750
ivanpeggy
102104
350750
Observations:
next() called much less often Łmore time spent in primitivesless in overhead
primitive calls process an array of
values in a loop:> 30 ?
- 30 * 50
22374525
aliceivanpeggyvictor
101102104105
“Vectorized In Cache Processing”
vector = array of ~100
processed in a tight loop
CPU cache Resident next()
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 210
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
715
FALSETRUETRUEFALSE
3745
ivanpeggy
102104
350750
ivanpeggy
102104
350750
Observations:
next() called much less often Łmore time spent in primitivesless in overhead
primitive calls process an array of
values in a loop:> 30 ?
- 30 * 50
CPU Efficiency depends on “nice” code- out-of-order execution- few dependencies (control,data)- compiler support
Compilers like simple loops over arrays- loop-pipelining- automatic SIMD
22374525
aliceivanpeggyvictor
101102104105
next()
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 211
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
FALSETRUETRUEFALSE
350750
Observations:
next() called much less often Łmore time spent in primitivesless in overhead
primitive calls process an array of
values in a loop:> 30 ?
* 50
CPU Efficiency depends on “nice” code- out-of-order execution- few dependencies (control,data)- compiler support
Compilers like simple loops over arrays- loop-pipelining- automatic SIMD
FALSETRUETRUEFALSE
> 30 ?
715
- 30
350750
* 50
for(i=0; i<n; i++)
res[i] = (col[i] > x)
for(i=0; i<n; i++)
res[i] = (col[i] - x)
for(i=0; i<n; i++)
res[i] = (col[i] * x)
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 212
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()101102104105
aliceivanpeggyvictor
22374525
12
Tricks being played:
- Late materialization
- Materialization avoidance using selection vectors
> 30 ?next()
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 213
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
715
12
350750
Tricks being played:
- Late materialization
- Materialization avoidance using selection vectors
> 30 ?
- 30 * 50
next()
map_mul_flt_val_flt_col(float *res,int* sel,float val,float *col, int n)
{for(int i=0; i<n; i++)
res[i] = val * col[sel[i]];}
selection vectors used to reduce vector copying
contain selected positions
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 214
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
SCAN
SELECT
PROJECT
next()
next()
101102104105
aliceivanpeggyvictor
22374525
715
12
350750
Tricks being played:
- Late materialization
- Materialization avoidance using selection vectors
> 30 ?
- 30 * 50
next()
map_mul_flt_val_flt_col(float *res,int* sel,float val,float *col, int n)
{for(int i=0; i<n; i++)
res[i] = val * col[sel[i]];}
selection vectors used to reduce vector copying
contain selected positions
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 215
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB/X100
l Both efficiencyl Vectorized primitives
l and scalability..l Pipelined query evaluation
n C program: 0.2s
n MonetDB/X100: 0.6s
n MonetDB: 3.7s
n MySQL: 26.2s
n DBMS “X”: 28.1s
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 216
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 217
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Memory Hierarchy
Vectors are only the in-cache representation
RAM & disk representation mightactually be different
(vectorwise uses both PAX & DSM)
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 218
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Optimal Vector size?
All vectors together should fit the CPU cache
Optimizer should tune this,given the query characteristics.
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 219
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
Less and less iterator.next() and
primitive function calls (“interpretation overhead”)
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 220
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
Vectors start to exceed theCPU cache, causing
additional memory traffic
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 221
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Varying the Vector size
The benefit of selection vectors
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 222
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
MonetDB/MIL materializes columns
ColumnBM
(buffer manager)
CPU
cache
(raid)
Disk(s)
RAM
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 223
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Benefits of Vectorized Processingl 100x less Function Calls
l iterator.next(), primitives
l No Instruction Cache Misses
l High locality in the primitives
l Less Data Cache Misses
l Cache-conscious data placement
l No Tuple Navigation
l Primitives are record-oblivious, only see arrays
l Vectorization allows algorithmic optimization
l Move activities out of the loop (“strength reduction”)
l Compiler-friendly function bodies
l Loop-pipelining, automatic SIMD
“Buffering Database Operations for Enhanced Instruction Cache Performance”Zhou, Ross, SIGMOD’04
“Block oriented processing of relational database operations in modern computer architectures”Padmanabhan, Malkemus, Agarwal, ICDE’01
“MonetDB/X100: Hyper-Pipelining Query Execution ” Boncz, Zukowski, Nes, CIDR’05
VLDB 2009 Tutorial Column-Oriented Database Systems 224
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Vectorizing Relational Operatorsl Project
l Select
l Exploit selectivities, test buffer overflow
l Aggregation
l Ordered, Hashed
l Sort
l Radix-sort nicely vectorizes
l Join
l Merge-join + Hash-join
“Balancing Vectorized Query Execution with Bandwidth Optimized Storage ” Zukowski, CWI 2008
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Efficient Column Store Compression
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 226
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Compression in Column Stores
Column Storees:l More CPU bound thanks to less I/Ol Provide good compression opportunities
Reduce CPU effort by operating on compressed data immediately
•“Integrating Compression and Execution in Column-Oriented Database Systems” Abadi, Madden, Ferreira. SIGMOD’06•Query optimization in compressed database systems” Chen, Gehrke, Korn, SIGMOD’01
VLDB 2009 Tutorial Column-Oriented Database Systems 227
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Compression in MonetDB/X100
MonetDB/X100 highly I/O bound:l 1GB/s query consumption per 2GHz corel 1/3 CPU for decompression Ł 3GB/s needed
ŁŁŁŁ new lightweight compression schemes
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 228
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basisl Easy to exploit redundancy
l Decompress small vectors of tuples from a column into the CPU cachel Minimize main-memory overhead
l Use light-weight, CPU-efficient algorithmsl Exploit processing power of modern CPUs
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 229
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basisl Easy to exploit redundancy
l Decompress small vectors of tuples from a column into the CPU cachel Minimize main-memory overhead
l Use light-weight, CPU-efficient algorithmsl Exploit processing power of modern CPUs
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 230
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Frame Of Reference (PFOR)
“Compressing Relations and Indexes ” Goldstein, Ramakrishnan, Shaft, ICDE’98
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 231
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
PFOR-Delta
l PFOR on the differences (deltas) between subsequent values in anattribute column (a[i] – a[i-1])
l Performs well on columns containing increasing/decreasing sequencesl inverted listsl timestampsl object IDsl sorted / clustered columns
“Improved Word-Aligned Binary Compression for Text Indexing” Ahn, Moffat, TKDE’06
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 232
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Dictionary (PDICT)
l Code words representindices into a dictionary (hash table) containing the 2^b most frequent source values.l Dictionary stored in block
headerl Store colliding values as
exceptions
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 233
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basisl Columns compress well
l Decompress small vectors of tuples from a column into the CPU cachel Minimize main-memory overhead
l Use light-weight, CPU-efficient algorithmsl Exploit processing power of modern CPUs
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 234
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Key Ingredients
l Compress relations on a per-column basisl Columns compress well
l Decompress small vectors of tuples from a column into the CPU cachel Minimize main-memory overhead
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 235
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Vectorized Decompression
ColumnBM
(buffer manager)
X100 query engine
CPU
cache
(raid)
Disk(s)
RAM
Idea:
decompress a vector only
compression:-between CPU and RAM-Instead of disk and RAM (classic)
VLDB 2009 Tutorial Column-Oriented Database Systems 236
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
RAM-Cache Decompression
l Decompress vectorson-demand into thecache
l RAM-Cache boundary only crossed once
l More (compressed)data cached in RAM
l Less bandwidth use
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 237
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Multi-Core Bandwidth & Compression
Performance Degradation with Concurrent Queries
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8
number of concurrent queries
avg
quer
y sp
eed
(clo
ck n
orm
aliz
ed)
Intel® Xeon E5410 (Harpertown) - Q6b Normal
Intel® Xeon E5410 (Harpertown) - Q6b Compressed
Intel® Xeon X5560 (Nehalem) - Q6b Normal
Intel® Xeon X5560 (Nehalem) - Q6b Compressed
VLDB 2009 Tutorial Column-Oriented Database Systems 238
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
CPU Efficient Decompression
l Decoding loop over cache-resident vectors of code wordsl Avoid control dependencies within decoding loop
l no if-then-else constructs in loop body
l Avoid data dependencies between loop iterations
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 239
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Disk Block Layoutl Forward growing
section of arbitrarysize code words(code word size fixed per block)
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 240
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Disk Block Layout
l Forward growingsection of arbitrarysize code words(code word sizefixed per block)
l Backwards growingexception list
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 241
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Naïve Decompression Algorithm
l Use reserved value from code word domain (MAXCODE) to mark exception positions
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 242
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Deterioriation With Exception%
l 1.2GB/s deteriorates to 0.4GB/s
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 243
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Deterioriation With Exception%
l Perf Counters: CPU mispredicts if-then-else
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 244
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patchingl Maintain a patch-list
through code wordsection that linksexception positions
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 245
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patchingl Maintain a patch-list
through code wordsection that linksexception positions
l After decoding, patch up the exception positionswith the correct values
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 246
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 247
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Patched Decompression
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 248
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Decompression Bandwidth
l Patching makes two passes, but is faster!
Patching can be applied to:
• Frame of Reference (PFOR)
• Delta Compression (PFOR-DELTA)
• Dictionary Compression (PDICT)
Makes these methods much more applicable to noisy data
Ł without additional cost
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 249
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
TPC-H 100GB – Slow RAID
l Gains equal to compression ratio
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 250
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
TPC-H 100GB – Faster RAID
l Still some gains with 450MB/s RAID
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 251
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
TPC-H 100GB – Final Result
l Compares well to DBMS with 8x more resources
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
VLDB 2009 Tutorial Column-Oriented Database Systems 252
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Terabyte TREC 2005
23110.541Zetdir
143230.342pisaEff4
5880.530Zetdist
4280.562MU05TBy1
2720.390uwmtEwteD10
2480.555MU05TBy3
3.2160.547MonetDB/X100
msec/qCPUsP@20Run
l Many systems used BM25 (X100 also)l Kept 9GB index in 8x2GB RAM thanks to PFOR-
delta compression (ratio~=3)
“Super-Scalar RAM-CPU Cache Compression”Zukowski, Heman, Nes, Boncz, ICDE’06
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Column Store Disk I/O
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 254
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
TPC-H 100GB cost per component
Magnetic Disk Trends
The trend for more spindles per CPU core (8-20) is hard to sustain
given many-core CPUs.In OLTP, situation is worse.
DRAM Trend
VLDB 2009 Tutorial Column-Oriented Database Systems 255
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Disk Scans: Unit Size
Only 1 disk used in RAID
Stripe size2MB: all 12 disks used
Non-sequential scans: 1-8MB needed
VLDB 2009 Tutorial Column-Oriented Database Systems 256
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Using Disk in Data Warehousing
Goals:
1. large-chunk disk access *only*
2. minimize bandwidth, prepare for concurrency
database strategies:
l replicate tables in multiple orders (goal 1)
l clustering join-tables in foreign-key order (goal 1)
l keep dimension tables in RAM (goal 1&2)
l use a column-store (goal 1&2)
l increase disk bandwidth with lightweight compression (goal 2)
l coordinate concurrent disk access (goal 1&2)
•“Adjoined Dimension Column Clustering to Improve Data Warehouse Query Performance” O’Neil2, Chen, ICDE’08•“C-Store: A Column-oriented DBMS”, Stonebraker et al, VLDB’05
VLDB 2009 Tutorial Column-Oriented Database Systems 257
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Store I/Ol Columns with various
physical sizes
VLDB 2009 Tutorial Column-Oriented Database Systems 258
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Store I/Ol Columns with various
physical sizes: where are corresponding tuples located?
VLDB 2009 Tutorial Column-Oriented Database Systems 259
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Store I/Ol Columns with various
physical sizes: where are corresponding tuples located?
l More prefetch buffers needed than for rows .
VLDB 2009 Tutorial Column-Oriented Database Systems 260
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Store I/Ol Columns with various
physical sizes: where are corresponding tuples located?
l More prefetch buffers needed than for rows.
l Multiple cursors: columns fight among themselves for disk
“Performance Tradeoffs in Read-Optimized Databases” Harizopoulos, Liang, Abadi, Madden, VLDB’06
VLDB 2009 Tutorial Column-Oriented Database Systems 261
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Scans in a DBMS
l Scan-based processing:l Large queriesl Clustered indicesl No useful indices
l Types of scans:l Full-table scansl Range-scansl Multi-range scans
VLDB 2009 Tutorial Column-Oriented Database Systems 262
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Concurrent scans
l Multiple queries scanning the same table
l Different start times
l Different scan ranges
l Compete for disk access and buffer space
l FCFS request scheduling: poor latency
VLDB 2009 Tutorial Column-Oriented Database Systems 263
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Chunks
l Pages clustered on diskl Large I/O units l Amortizes random-seek with large-readsl Result: “random” system bandwidth close to sequential
VLDB 2009 Tutorial Column-Oriented Database Systems 264
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Normal” policy
l Strictly sequential read order
VLDB 2009 Tutorial Column-Oriented Database Systems 265
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Normal” policy
l Strictly sequential read order
VLDB 2009 Tutorial Column-Oriented Database Systems 266
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Normal” policy
l Strictly sequential read order
VLDB 2009 Tutorial Column-Oriented Database Systems 267
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Normal” performance
l Data read many times – poor sharingl Many scans fight for bandwidthl Both latency and throughput bad
VLDB 2009 Tutorial Column-Oriented Database Systems 268
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Normal” in real life
VLDB 2009 Tutorial Column-Oriented Database Systems 269
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Shared scans
l Observation: queries often do not need data in a sequential order
l Idea: make queries “share” the scanning process
l Two existing types:
l Attach
l Elevator
VLDB 2009 Tutorial Column-Oriented Database Systems 270
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” policy
l Attach to a running query with data overlap
VLDB 2009 Tutorial Column-Oriented Database Systems 271
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” policy
l Attach to a running query with data overlap
VLDB 2009 Tutorial Column-Oriented Database Systems 272
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” policy
l Attach to a running query with data overlap
VLDB 2009 Tutorial Column-Oriented Database Systems 273
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” policy
l Attach to a running query with data overlap
VLDB 2009 Tutorial Column-Oriented Database Systems 274
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” policy
l Attach to a running query with data overlap
VLDB 2009 Tutorial Column-Oriented Database Systems 275
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” performance
l Better than Normall Only one overlapping range is usedl Queries with different speeds can “detach”
“Increasing buffer-locality for multiple relational table scans through grouping and throttling” Lang, Bhattacharjee, Malkemus, Padmanabhan, Wong. ICDE’07
VLDB 2009 Tutorial Column-Oriented Database Systems 276
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Attach” in real life
VLDB 2009 Tutorial Column-Oriented Database Systems 277
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 278
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 279
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 280
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 281
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 282
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” policy
l A single sliding window over a table
VLDB 2009 Tutorial Column-Oriented Database Systems 283
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” performance
l Maximizes sharing, minimizes I/Osl Good for long queries and uniform loadsl Short queries wait for the windowl Fast queries wait for the slow ones
VLDB 2009 Tutorial Column-Oriented Database Systems 284
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Elevator” in real life
VLDB 2009 Tutorial Column-Oriented Database Systems 285
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Shared scans – main problem
l Query read sequence in shared scans:l Broken into 2 partsl Then fully staticl Misses opportunities in a dynamic environment
VLDB 2009 Tutorial Column-Oriented Database Systems 286
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cooperative scans
l Core ideasl Dynamically adapt to the current situationl Allow fully arbitrary chunk order
l Goals:l Maximize data sharingl Optimize latency and throughputl Work for different types of queries
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 287
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Active Buffer Manager
l ABM knows the status of all the queries
l Queries investigate ABM content:
l If some chunks buffered, choose one to use
l If not, wait for ABM to provide a new chunk
l ABM in a loop:
l Chooses a query to serve
l Chooses a chunk to load
l If out of space, chooses what to keep
l Loads a chunk and notifies the queries
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 288
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance” functions
l ABM knows the status of all the queries
l Queries investigate ABM content:
l If some chunks buffered, choose one to use
l If not, wait for ABM to provide a new chunk
l ABM in a loop:
l Chooses a query to serve
l Chooses a chunk to load
l If out of space, chooses what to keep
l Loads a chunk and notifies the queries
useRelevance()
queryRelevance()
keepRelevance()
loadRelevance()
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 289
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
useRelevance()
l Choose a chunk with a minimal number of queries interestedl Allows early chunk eviction
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 290
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
queryRelevance()
l Choose “starved” queries onlyl Queries having data are doing fine
l Promote short queriesl Better query-stream throughputl Avoid round-robin request scheduling
l Promote long-waiting queriesl Don’t let short queries starve the long ones
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 291
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
loadRelevance()
l Load chunks interesting for the maximum number of starved queriesl Keep many queries busyl Amortize loading cost among many queries
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 292
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
keepRelevance()
l Keep chunks interesting for the maximum number of (almost) starved queriesl Avoid blocking queries
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 293
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance policy”
l Follow the relevance functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 294
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance policy”
l Follow the relevance functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 295
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance policy”
l Follow the relevance functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 296
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance policy”
l Follow the relevance functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 297
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance policy”
l Follow the relevance functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 298
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
“Relevance” in real life
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 299
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Extensive benchmark
l TPC-H SF-10 datasetl MonetDB/X100, PAX storagel Two query speeds: Fast (Q6), Slow (Q1)l Varying scan ranges: 1%, 10%, 50%, 100%, at random
positionsl 16 concurrent streams, 3 seconds delayl 4 random queries in each stream
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 300
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Global results
1842140423254186Number of I/Os
93.9490.2081.3153.20CPU usage (%)
238244281453Total time (sec)
1.9613.523.726.42Avg. normalized query latency
99138160283Avg. stream time (sec)
RelevanceElevatorAttachNormal
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 301
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Results
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 302
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Cooperative Scans with DSM
l Reduced sharing opportunitiesl Physical-logical data mismatchl More complex ABM implementation and relevance
functions
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 303
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Sharing opportunities in DSM
l Both vertical and horizontal overlap needed
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 304
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Chunks in DSM
l Larger I/O requirementsl Columns with various
physical sizesl Logical chunks overlap
physicallyl Chunks as logical concepts
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 305
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
ABM in DSM
l More complex policiesl 2-dimensional decisions: chunk + columnl Columns with different queries interestedl Even column loading order matters!
l Still, it worksl Results depend on the overlap (paper)
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 306
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
DSM – global results
3639229744136490Number of I/Os
92827761CPU usage (%)
515562621805Total time (sec)
2.9615.114.777.05Avg. normalized query latency
264352338536Avg. stream time (sec)
RelevanceElevatorAttachNormal
“Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS”, Zukowski, Heman, Nes, Boncz VLDB’07
VLDB 2009 Tutorial Column-Oriented Database Systems 307
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Conclusion
l Presented MonetDB/X100 (now VectorWise)
l A new database kernel developed at CWI
l Uses Blocked Iterator Model (Vectorization)
l works amazingly well
l So fast Ł must reduce hunger for hard disks
l Storage manager specializes large sequential column I/O
l + Lightweight compression schemes (give ~~ factor 3)
l + Cooperative Bandwidth Sharing (gives ~~ factor 2)
l Good performance results
l Very fast 100GB TPC-H performance
l Beats IR systems on Terabyte TREC
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Column-Oriented Database Systems
Conclusion
VLDB 2009
Tutorial
VLDB 2009 Tutorial Column-Oriented Database Systems 309
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (1/2)
l Columns and Row-Stores: different?l No fundamental differencesl Can current row-stores simulate column-stores now?
l not efficiently: row-stores need change
l On disk layout vs execution layoutl actually independent issues, on-the-fly conversion pays offl column favors sequential access, row random
l Mixed Layout schemes l Fractured mirrorsl PAX, Clothol Data morphing
VLDB 2009 Tutorial Column-Oriented Database Systems 310
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Summary (2/2)
l Crucial Columnar Techniquesl Storage
l Lean headers, sparse indices, fast positional access
l Compression l Operating on compressed datal Lightweight, vectorized decompression
l Late vs Early materializationl Non-join: LM always wins
l Naïve/Invisible/Jive/Flash/Radix Join (LM often wins)
l Executionl Vectorized in-cache executionl Exploiting SIMD
VLDB 2009 Tutorial Column-Oriented Database Systems 311
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l looking at write/load tradeoffs in column-storesl read-only vs batch loads vs trickle updates vs OLTP
VLDB 2009 Tutorial Column-Oriented Database Systems 312
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (1/3)
l Column-stores are update-in-place aversel In-place: I/O for each columnl + re-compression l + multiple sorted replicasl + sparse tree indices
Update-in-place is infeasible!
VLDB 2009 Tutorial Column-Oriented Database Systems 313
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (2/3)
l Column-stores use differential mechanisms insteadl Differential lists/files or more advanced (e.g. PDTs)l Updates buffered in RAM, merged on each queryl Checkpointing merges differences in bulk sequentially
l I/O trends favor this anyway § trade RAM for converting random into sequential I/O
§ this trade is also needed in Flash (do not write randomly!)
l How high loads can it sustain?§ Depends on available RAM for buffering (how long until full)
§ Checkpoint must be done within that time
§ The longer it can run, the less it molests queries§ Using Flash for buffering differences buys a lot of time
§ Hundreds of GBs of differences per server
VLDB 2009 Tutorial Column-Oriented Database Systems 314
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Updates (3/3)l Differential transactions favored by hardware trendsl Snapshot semantics accepted by the user community
l can always convert to serialized
Ł Row stores could also use differential transactions and be efficient!Ł Implies a departure from ARIESŁ Implies a full rewrite
My conclusion: a system that combines row- and columns needs differentially implemented transactions.Starting from a pure column-store, this is a limited add-on.Starting from a pure row-store, this implies a full rewrite.
“Serializable Isolation For Snapshot Databases” Alomari, Cahill, Fekete, Roehm, SIGMOD’08
VLDB 2009 Tutorial Column-Oriented Database Systems 315
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Future Work
l looking at write/load tradeoffs in column-storesl read-only vs batch loads vs trickle updates vs OLTP
l database design for column-storesl column-store specific optimizers
l compression/materialization/join tricks Ł cost models?
l hybrid column-row systemsl can row-stores learn new column tricks?
l Study of the minimal number changes one needs to make to a row store to get the majority of the benefits of a column-store
l Alternative: add features to column-stores that make them more like row stores
VLDB 2009 Tutorial Column-Oriented Database Systems 316
Re-use permitted when acknowledging the original © Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
Conclusion
l Columnar techniques provide clear benefits for:l Data warehousing, BIl Information retrieval, graphs, e-science
l A number of crucial techniques make them effectivel Without these, existing row systems do not benefit
l Row-Stores and column-stores could be combinedl Row-stores may adopt some column-store techniquesl Column-stores add row-store (or PAX) functionality
l Many open issues to do research on!