Difference Between Column-Stores and OLAP Data Cubes

Embed Size (px)

DESCRIPTION

Info

Citation preview

Understanding the Difference Between Column-Stores and OLAP Data Cubesby smadden on July 7th, 2008 in big dataPrevious PostNext PostBoth column-stores and data cubes are designed to provide high performance on analytical database workloads (often referred to as Online Analytical Processing, or OLAP.) These workloads are characterized by queries that select a subset of tuples, and then aggregate and group along one or more dimensions. For example, in a sales database, one might wish to find the sales of technology products by month and storethe SQL query to do this would look like:SELECT month, store, COUNT(*)FROM sales, productsWHERE productType = technologyAND products.id = sales.productIDGROUP BY month, storeIn this post, we study how column-stores and data cubes would evaluate this query on a sample database:Column Store AnalysisIn column-stores, this query would be answered by scanning the productType column of the products table to find the ids that have type technology. These ids would then be used to filter the productID column of the sales table to find positions of records with the appropriate product type. Finally, these positions would be used to select data from themonths and stores columns for input into the GROUP BY operator. Unlike in a row-store, the column-store only has to read a few columns of the sales table (which, in most data warehouses, would contain tens of columns), making it significantly faster than most commercial relational databases that use row-based technology.Also, if the table is sorted on some combination of the attributes used in the query (or if a materialized view or projection of the table sorted on these attributes is available), then substantial performance gains can be obtained both from compression and the ability to directly offset to ranges of satisfying tuples. For example, notice that the sales table is sorted on productID, then month, then storeID. Here, all of the records for a givenproductID are co-located, so the extraction of matching productIDs can be done very quickly using binary search or a sparse index that gives the first record of each distinctproductID. Furthermore, the productID column can be effectively run-length encoded to avoid storing repeated values, which will use much less storage space. Run-length encoding will also be effective on the month and storeID columns, since for a group of records representing a specific productID, month is sorted, and for a group of records representing a given (productID,month) pair, storeID is sorted. For example, if there are 1,000,000 sales records of about 1,000 products sold by 10 stores, with sales uniformly distributed across products, months and stores, then the productID column can be stored in 1,000 records (one entry per product), the month column can be stored in 1,000 x 12 = 12,000 records, and the storeID column can be stored in and 1,000 x 12 x 10 = 120,000 records. This compression means that less the amount of data read from disk is less than 5% of its uncompressed size.Data Cube AnalysisData cube-based solutions (sometimes referred to as MOLAP systems for multidimensional online analytical processing), are represented by commercial products such as EssBase. They store data in array-like structures, where the dimensions of the array represent columns of the underlying tables, and the values of the cells represent pre-computed aggregates over the data. A data cube on the product, store, and month attributes of the sales table, for example, would be stored in an array format as shown in the figure above. Here, the cube includes roll-up cells that summarize the values of the cells in the same row, column, or stack (x,y position.) If we want to use a cube to compute the values of the COUNT aggregate, as in the query above, the cells of this cube would look like: Here, each cell contains the count of the number of records with a given (productID,month,storeID) value. For example, there is one record with storeID=1, productID=2, and month=April. The sum fields indicate the values of the COUNT rolled up on specific dimensions; for example, looking at the lower left hand corner of the cube for Store 1, we can see that in storeID 1, productID 1 was sold twice across all months. Thus, to answer the above query using a data cube, we first identify the subset of the cube that satisfies the WHERE clause (here, products 3, 4, and 5 are technology productsthis is indicated by their dark shading in the above figure.) Then, the system reads the pre-aggregated values from sum fields for the unrestricted attributes (store and month), which gives the result that store 2 had 1 technology sale in Feburary and 1 in June, and that store 3 had 1 technology sale in February and 1 in October.The advantages of a data cube should be clearit contains pre-computed aggregate values that make it a very compact and efficient way to retrieve answers for specific aggregate queries. It can be used to efficiently compute a hierarchy of aggregatesfor example, the sum columns in the above cube make it is very fast to compute the number of sales in a given month across all stores, or the number of sales or a particular product across the entire year in a given store. Because the data is stored in an array-structure, and each element is the same size, direct offsetting to particular values may be possible. However, data cubes have several limitations:Sparsity: Looking at the above cube, most of the cells are empty. This is not simply an artifact our sample data set being smallthe number of cells in a cube is the product of the cardinalities of the dimensions in the cube. Our 3D cube with 10 stores and 1,000 products would have 120,000 cells, and adding a fourth dimension, such as customerID (with, say, 10,000 values), would cause the number of cells to balloon to 1.2 billion! Such high dimensionality cubes cannot be stored without compression. Unfortunately, compression can limit performance somewhat, as direct offsetting is no longer possible. For example, a common technique is to store them as a table with the values and positions of the non-empty cells, resulting in an implementation much like a row-oriented relational database!Inflexible, Limited ad-hoc query support: Data cubes work great when a cube aggregated on the dimensions of interest and using the desired aggregation functions is available. Consider, however, what happens in the above example if the user wants to compute the average sale price rather than the count of sales, or if the user wants to include aggregates on customerID in addition to the other attributes. If no cube is available, the user has no choice but to fall back to queries on an underlying relational system. Furthermore, if theuser wants to drill down into the underlying dataasking, for example who was the customer who bought a technology product at store 2 in February?the cube cannot be used (one could imagine storing entire tuples, or pointers to tuples, in the cells of a cube, but like sparse representations, this significantly complicates the representation of a cube and can lead to storage space explosions.) To deal with these limitations, some cube systems support what is called HOLAP or hybrid online analytical processing, where they will automatically redirect queries that cannot be answered with cubes to a relational system, but such queries run as fast as whatever relational system executes them.Long load times: Computing a cube requires a complex aggregate query over all of the data in a warehouse (essentially, every record has to be read from the database.) Though it is possible to incrementally update cubes as new data arrives, it is impractical to dynamically create new cubes to answer ad-hoc queries.Summary and DiscussionData cubes work well in environments where the query workload is predictable, so that cubes needed to answer specific queries can be pre-computed. They are inappropriate for ad-hoc queries or in situations where complex relational expressions are needed.In contrast, column-stores provide very good performance across a much wider range of queries (all of SQL!) However, for low-dimensionality pre-computed aggregates, it is likely that a data-cube solution will outperform a column store. For many-dimensional aggregates, the tradeoff is less clear, as sparse cube representations are unlikely to perform any better than a column store.Finally, it is worth noting that there is no reason that cubes cannot be combined with column-stores, especially in a HOLAP-style configuration where queries not directly answerable from a cube are redirected to an underlying column-store system. That said, given that column-stores will typically get very good performance on simple aggregate queries (even if cubes are slightly faster), it is not clear if the incremental cost of maintaining and loading an additional cube system to compute aggregates is ever worthwhile in a column-store world. Furthermore, existing HOLAP products, which are based on row-stores, are likely to be an order of magnitude or more slower than column-stores on ad-hoc queries that cannot be answered by the MOLAP system, for the same reasons discussed elsewhere in this blog.