26
C-Store: How Different are Column- Stores and Row- Stores? Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009

C-Store: How Different are Column-Stores and Row-Stores?

  • Upload
    janna

  • View
    93

  • Download
    0

Embed Size (px)

DESCRIPTION

C-Store: How Different are Column-Stores and Row-Stores?. Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 8, 2009. Row and Column Stores. Limitation of Current Comparison. Assuming a row store does not use any column-oriented physical design. - PowerPoint PPT Presentation

Citation preview

Page 1: C-Store: How Different are Column-Stores and Row-Stores?

C-Store: How Different are Column-Stores and Row-Stores?

Jianlin FengSchool of SoftwareSUN YAT-SEN UNIVERSITYMay. 8, 2009

Page 2: C-Store: How Different are Column-Stores and Row-Stores?

Row and Column Stores

Page 3: C-Store: How Different are Column-Stores and Row-Stores?

Limitation of Current Comparison Assuming a row store does not use any colu

mn-oriented physical design. Columns of the same logical tuple is traditionally s

tored together.

In fact it is possible to simulate a column store in a row store.

Page 4: C-Store: How Different are Column-Stores and Row-Stores?

Simulating a Column Store in a Row Store Vertical Partitioning Indexing Every Column Materialized views C-Tables

Vertical Partitioning + Run-Length Encoding

Page 5: C-Store: How Different are Column-Stores and Row-Stores?

Simulation 1: Vertical Partitioning Each column makes its own table

N columns result in N tables, each table stores (tuple-id, attr) pairs.

Need to do tuple reconstruction In C-store, columns are sorted in the same order.

In MonetDB, base columns are kept in insertion order, updates only happen to cracker columns.

In this “simulation” scenario, an integer “tuple-id” column is associated to each column. Use hash join on “tuple-id” column to reconstruct tuples.

Page 6: C-Store: How Different are Column-Stores and Row-Stores?

Two Problems of Vertical Partitioning 1. It requires the “tuple-id” column to be explici

tly stored in each column. Waste space and disk bandwidth.

2. Most row-stores store a relatively large header on every tuple further waste space.

Page 7: C-Store: How Different are Column-Stores and Row-Stores?

Tuple Header A tuple header provides metadata about the t

uple.

For example, in Postgres, each tuple contains a 27 bytes header including information such as

Insert transaction timestamp. Number of attributes in the tuple. NULL flags Length of tuple header

Page 8: C-Store: How Different are Column-Stores and Row-Stores?

Tuple Header in a Column Store A column-store puts tuple header in separate

columns Some information in the tuple header can be

removed For example, in MonetDB, each logical column st

ores (key, attr) pairs. Hence, the number of attributes is always 2. And there is no need for NULL flags, say we simpl

y don’t store the tuple with NULL value.

Page 9: C-Store: How Different are Column-Stores and Row-Stores?

Simulation 2: Indexing Every Column Base relations are stored using a standard, ro

w-oriented physical design.

And an additional secondary B-Tree is built on every column of every table. Join columns on tuple-id.

This approach answers queries by reading values directly from the indexes.

Page 10: C-Store: How Different are Column-Stores and Row-Stores?

One Problems of Indexing Every Column If a column has no predicate on it, this approach req

uires its index to be scanned to extract the needed values.

One optimization is to create indexes with composite keys

SELECT AVG(salary) FROM emp WHERE age > 40;

If we have a composite index with an (age, salary) key, we can answer this query directly from the index.

Page 11: C-Store: How Different are Column-Stores and Row-Stores?

Simulation 3: Materialized Views For every query in the benchmark workload,

create an optimal set of materialized views. Involve only the columns needed to answer queries.

Do not pre-join columns from different tables in these materialized views.

Problem of materialized views. It requires query workload to be known in advance.

Page 12: C-Store: How Different are Column-Stores and Row-Stores?

Comparison Results by Abadi et al. (SIGMOD’2008) Using a simplified version of TPC-H called the Star Schema Benchmark (SSBM).

Major Results: None of the three attemps to simulate a column st

ore in a row store are particularly effective. The Materialized View approach is the best The Vertical Partitioning approach is the moderate The Indexing Every Column approach is the worst.

Page 13: C-Store: How Different are Column-Stores and Row-Stores?

Relevant Factors for Successful Simulation For successfully simulating a column store in

a row store, we may have to Store tuple headers separately Use virtual tuple-id to do tuple reconstruction (i.e.,

joins). Maintain heap files (for base relations) in guarante

ed position order.

Page 14: C-Store: How Different are Column-Stores and Row-Stores?

Simulation 4: the C-Table Approach(By Bruno, CIDR’2009)The main idea is to extend the Vertical Partitioning approach to explicitly the Run-Length Encoding (RLE) of tuple values.

A C-Table is a sequence of (f, v, c) triples:

f is the starting tuple-id or position.

v is the data value.

c is the count (length).First sort by a, then b, and finally c.

Page 15: C-Store: How Different are Column-Stores and Row-Stores?

An Interesting Property of C-Table For any pair of tuples t1 and t2, possibly on different

c-tables, the range [f1, f1 + c1 -1] and [f2, f2 + c2 -1] do not partially

overlap, i.e., they are either disjoint, or the one associated with the c

olumn deeper in the sort order is included in the other.

This property allows us to use specific query rewriting to combine information from different c-tables to answer queries.

Page 16: C-Store: How Different are Column-Stores and Row-Stores?

Query Rewriting for C-Tables

The Schema D1: (lineitem | l_shipdate, l_suppkey)

Page 17: C-Store: How Different are Column-Stores and Row-Stores?

Comparison Results by Bruno. (CIDR’2009) Using TPC-H. Using for comparison a loose lower bound on any

column store implementation Manually compute how many pages in disk need to be read

by any column store execution plan, And measure the time taken to just read the pages.

Results: The C-Table approach can compete with column stores.

Vision: Row-stores should be able to incorporate column specific

optimizations.

Page 18: C-Store: How Different are Column-Stores and Row-Stores?

Column-Specific Optimizations Late materialization

Construct tuples as late as possible. Block iteration

Multiple-tuples-at-a-time Column-specific compression

Run-Length Encoding Invisible join on star schema (New)

Like semi-join in distributed DBMS.

Page 19: C-Store: How Different are Column-Stores and Row-Stores?

Schema of the SSBM Benchmark

Page 20: C-Store: How Different are Column-Stores and Row-Stores?

An Example Query for Illustrating Invisible Join

Page 21: C-Store: How Different are Column-Stores and Row-Stores?

Invisible Join: First PhaseEach predicate is applied to the appropriate dimension table to extract a list of dimension table keys that satisfy the predicate.

Page 22: C-Store: How Different are Column-Stores and Row-Stores?

Invisible Join: Second PhaseEach hash table is used to extract the positions of tuples in the fact table that satisfy the corresponding predicate.

Page 23: C-Store: How Different are Column-Stores and Row-Stores?

Invisible Join: Third PhaseThe third phase uses the list of satisfying positions P in the fact table to get foreign key values and hence needed data values from the corresponding dimension table.

Page 24: C-Store: How Different are Column-Stores and Row-Stores?

Which Column-Specific Optimization is most significant? Late materialization

Improves performance by a factor of 3. Block iteration

Improves performance by 5% - 50%. Column-specific compression

Improves performance by a factor of 2 on average Improves performance by a factor of 10 on queries that acc

ess sorted data. Invisible join on star schema

Improves performance by 50% - 70%.

Page 25: C-Store: How Different are Column-Stores and Row-Stores?

Open Question

Building a complete row-store that can transform into a column-store on workloads where column-stores perform well.

Page 26: C-Store: How Different are Column-Stores and Row-Stores?

References

Daniel J. Abadi, Samuel R. Madden, and Nabil Hachem. Column-Stores vs. Row-Stores: How Different Are They Really?. In SIGMOD, 2008.

Allison L. Holloway, David J. DeWitt. Read-optimized databases, in depth. In VLDB, 2008.

N Bruno. Teaching an Old Elephant New Tricks. In CIDR’2009.